Skip to content

Posts from the ‘OpenCore’ Category

Adaptively Controlling Apache Cassandra Client Request Processing

In Performance Measurement Truths – Rarely Pure and Never Simple and Reality, Reactivity, Relevance and Repeatability in Java Application Profiling we looked at the benefits obtained using adaptive measurement techniques in the performance analysis of Apache Cassandra. For this article we keep with the adaptive theme but this time shifting focus to how we can adaptively manage the performance of such systems rather than just monitor them, especially under high load stress (surge) conditions. Before proceeding it should noted that whilst many solutions claim to manage performance what they really offer is monitoring, which is seen as a way to influence the management of the system by way of human intervention (on a completely different time scale to the execution and lifecycle of the system).

Note: Managing performance in this article refers to the direct and immediate influence on the execution of the software by adaptive control valves.

Lets start with the script used to generate load on the client machine (Intel Core i7) via the sending of thrift remote procedure requests to the server running on a separate machine (Intel Core i7).

for i in {1..20}
do
   /Applications/datastax-cassandra-1.2.5/tools/bin/cassandra-stress -d 10.0.1.2 -o insert -t 20 -n 1000000 &
   sleep 15
done

Below is the performance model following completion of the benchmark. If you recall the previously reported averages for some of the probes listed below you will notice that across the board there has been a noticeable increase. For example StorageProxy.mutate had an average of 211 microseconds whereas its average here is 5,419 microseconds. This level of increase is present in nearly all hotspot classified probes. Clearly the client load placed the system under significant stress in relation to available processing capacity and additional co-ordination and memory management effort.

apache-cassandra-ace-insert-novalve

To help the system better absorb (and adjust to) the disturbance caused by high concurrent client request processing we can attach an adaptive control valve to one (or more) of the measurement entry points. The control valve will then adaptively manage the level of concurrent execution flowing through the methods (measurement points) associated with it. There is no need to blindly fiddle with thread pool sizes. Instead the valve will dynamically adjust in real-time the degree of concurrency based on direct observations and analysis of the measured execution behavior.

Here is how a “throughput” adaptive control valve can be configured in the jxinsight.override.config file to attach itself to the org.apache.thrift.ProcessFunction.process measurement point.

j.s.p.tvalve.enabled=true
j.s.p.tvalves=pf
j.s.p.tvalve.pf.name.groups=org.apache.thrift.ProcessFunction.process
j.s.p.tvalve.pf.lower=6
j.s.p.tvalve.pf.interval=2000

Running the load stress script with adaptive control management in place makes a significant difference in the overall processing clock time. The execution time within the server has dropped down from 232K seconds to 170K seconds. The average time for TBaseProcessor.process has dropped from 11.6 ms to 8.5 ms.

Note: The clock.time metering total is an aggregation of wall clock time measures across threads. With 100 threads executing simultaneously the same method, each for 10 secs, the total will be recorded as 1000 (1K) seconds.

apache-cassandra-ace-insert-tvalve

What you might find strange is that the average for ProcessFunction.process has remained roughly the same whilst many of the other hotspot probes have dramatically reduced their average clock time. The reason for this discrepancy is that the valve introduces “managed” delay of thread execution at the ProcessFunction.process measurement point, which is one of the methods called by TBaseProcessor.process, this in turn reduces the concurrent load and resource contention experienced by those probes (methods) called, directly or indirectly,  sequential or asynchronously. The other method called by TBaseProcessor.process for each of its invocations is TBinaryProtocol.readMessageBegin which has dropped from 71K seconds to 12K seconds. The drop here is because threads that call this method are temporarily suspended in the ProcessFunction.process method by the valve. In controlling the completion of the ProcessFunction.process method we also control the execution of TBinaryProtocol.readMessageBegin, because one follows the other in the execution of TBaseProcessor.process.

The delay that is dynamically introduced serves to stabilize and optimize the execution flow within the system. It should be noted that stability and optimization can at times be at odds with each other.

How about adaptively managing the execution of requests based on the measured response time at different workload levels? The “response time” adaptive control valve allows us to do just this with a very similar configuration as above. Simply replace tvalve with rvalve in the configuration.

j.s.p.rvalve.enabled=true
j.s.p.rvalves=pf
j.s.p.rvalve.pf.name.groups=org.apache.thrift.ProcessFunction.process
j.s.p.rvalve.pf.lower=6
j.s.p.rvalve.pf.interval=2000

With the “response time” valve there is a further reduction in the overall server request time processing. This is to be expected considering this particular adaptive control valve looks to reduce the average response whereas the previous valve can trade increased response time for increased throughput up to a point as both are intertwined.

apache-cassandra-ace-insert-rvalve

The average response time for TBaseProcessor.process is 8.1 ms compared to the initial baseline 11.6 ms. Again we see that the average response time for ProcessFunction.process has remained relatively high in comparison with other hotspot classified probes as delay is dynamically introduced by the valve to ensure greater stability in the average response time per second.

What of the client times? Well for the non-managed benchmark run it took a total of 14 mins 5 seconds to complete. With the “throughput” adaptive control valve the benchmark took 11 minutes 11 seconds. For the “response time” adaptive control valve completion took a tiny bit longer at 11 minutes 14 seconds.

Below is the CPU profile for the client machine. From the charts you can see that with no control there is far more jitter observed across the many concurrent client processes.

apache-cassandra-ace-insert-client-cpu

Apache Cassandra is optimized for very fast and highly available data writing. That said this does incur insignificant allocation and co-ordination costs and can result in relatively high garbage collection cycles if workload is not managed effectively. Adaptive control valves are extremely effective if not a necessity in any large scale deployment but what of read operations, which are generally relatively cheap in most NoSQL solutions? Can adaptive control valves be as just as effective with data reads?

/Applications/datastax-cassandra-1.2.5/tools/bin/cassandra-stress -d 10.0.1.2 -o insert -t 20 -n 1000000
for i in {1..20}
do
   /Applications/datastax-cassandra-1.2.5/tools/bin/cassandra-stress -d 10.0.1.2 -o read -t 20 -n 1000000 &
   sleep 15
done

Here is the performance model following completion of the above benchmark without any adaptive control valves in place within the Apache Cassandra server runtime. The benchmark took 8 minutes and 23 seconds to finish. The average time for a request processing cycle within the server took 4.2 ms. This is under high concurrent load that heavily taxes all available computing resources on the server machine.

apache-cassandra-ace-read-novalve

With the “throughput” adaptive control valve installed the benchmark finished in 8 minutes 5 seconds. A wall clock savings of 18 seconds. The actual clock time processing on the server was reduced by only 5K seconds. What is interesting is the average clock time for ProcessFunction.process increased compared to the baseline whilst all other hotspot classified probes saw a substantial reduction. Basically we have the choice of incurring delay at one particular managed execution point or at all possible execution points that are further down in the execution path and could potential hold onto resources (memory, monitors) much longer than we would like.

apache-cassandra-ace-read-tvalve

The “response time” valve would appear to be far more effective at reducing overall processing time with a drop of 13K seconds in overall server request processing. From the client perspective the savings was far greater with the benchmark finishing in 7 minutes 47 seconds compared to 8 minutes and 23 seconds. A 36 second reduction in wall clock time. For the TBaseProcess.process method the average has dropped from 4.2 ms to 3.6 ms. The average time for ProcessFunction.process is still higher than the non-managed benchmark run but lower than the “throughput” one.

apache-cassandra-ace-read-rvalve

Performance Measurement Truths – Rarely Pure and Never Simple

In Reality, Reactivity, Relevance and Repeatability in Java Application Profiling we looked at the multiple realities proposed by different performance profiling solutions and performance measurement techniques. Hopefully in that article you would have seen how difficult it can be for a novice to discern which of the measurement models most closely resembles the truth though the degree of measurement overhead is a good starting point. Naturally we think that some approaches to measurement, such as ones based intelligent adaptive technologies, are far better but even then we still must be careful in how we execute the benchmark and how we configure the adaptive measurement as these can give us multiple realities even with the very same measurement solution. In this article we look at the impact of different adaptive measurement thresholds as well as the impact that slow or fast ramp up of concurrent workload has on a benchmark and its performance model.

The Apache Cassandra project just this week announced their latest release, version 1.2.5, which we have opted to use for the benchmark so that our findings can be put to good use by the NoSQL community behind this project. Unlike the previous benchmark, the client stress load driver will be run on a separate machine and the client insert thread count increased from 1 up to 20.

Here is the first script used.

for i in {1..20}
do
./tools/bin/cassandra-stress -d $DSC_HOST -o insert -t $i -n 1000000
done

By default the JXInsight/OpenCore adaptive measurement (metering) engine uses a hotspot extension based on a concept of scorecard which determines whether the engine continues to measure a particular probe (method) by way of credit balance that is adjusted based on timing threshold comparisons. The default threshold values are set with the typical web/enterprise application in mind so the first thing that needs to be done is to drop down the thresholds (in microseconds) from 100 and 5 to 10 and 1 respectively.

j.s.p.hotspot.threshold=10
j.s.p.hotspot.threshold.inherent=1

Just like in Reality, Reactivity, Relevance and Repeatability in Java Application Profiling, we only instrumented the following packages, though in the results presented below we have filtered the view to only include those probes in the org.apache.cassandra. The reason for this is that inclusion of additional packages in the instrumentation provides us with more accurate inherent (self) totals but when it comes to what can be changed the focus should be on the projects own code base.

org.apache.casssandra.
org.apache.thrift.
org.github.jamm.

Below is a relatively short listing of the hotspots found with the measurement thresholds set to 10 (inclusive) and 1 (exclusive). From this listing we get an initial outline of the main execution points in the processing of a request from the client stress tool. Even without a call tree, but instead looking at the Count, Total and Total (Inherent) column values, it is easy to form an educated guess that RowMutation.apply (directly or indirectly) calls Table.apply which in turns calls ColumnFamilyStore.apply. Don’t worry if this is not immediately obvious because it is not of great importance in the rest of this article though understanding the difference between inclusive and exclusive in terms of measurement time in a caller-to-callee chain would help when we see how the inherent total (and average) shifts as thresholds are changed.

cassandra.1.2.5.rampup.t10.it1

What if we reduce the threshold? What impact will this have on the reality proposed by the measurement engine? First of all we must be careful not to introduce significant measurement overhead that radically alters the execution dynamics of the software, especially in a highly concurrent environment. There is no magic (though self-adaptation can at times appear so) that eliminates measurement overhead entirely, that only every happens in sales & marketing literature of vendors who think that everyone runs PetStore style apps in production or have never tuned a SQL statement before using standard database tooling. More importantly, at least in the context of this article, is that any changes in thresholds can result in some probes, previously dynamically disabled remaining measured until the very end of the benchmark and in doing so alter (reduce) the inherent total and inherent averages of their immediate callers who previously assumed such timings in their own totals. This alteration (reduction) of the caller inherent total means that some probes will drop down in sorted table listing and in some cases cause the callers to be disabled though setting the inherent threshold to 1 mitigates this somewhat.

Below is a revised listing with the property, j.s.p.hotspot.threshold, set to 5 microseconds. There are a number of new probes appearing still enabled and at the top of the metering table following completion of the benchmark, which performed 20 iterations of the scripted loop shown above. The most notable new probes include CassandraServer.createMutationList, PrecompactedRow.merge, SSTableWriter.append and CommitLogSegment.write. You will notice how StorageProxy.mutate and CassandraServer.batch_mutate have dropped down the table listing because of measurement occurring far longer at lower levels in the caller-to-callee chain.

Note: The number of probes measured with the hotspot threshold set to 10 microseconds was 470 million. With the threshold set to 5 this number increased to 782 million.

cassandra.1.2.5.rampup.t5.it1

Dropping the threshold down to 2 microseconds, which is what we currently regard as the borderline before needing to switch to something entirely different such as Signals, offers up more potential hotspots such as AtomicSortedColumn.addAllWithSizeDelta, MergeIterator$ManyToOne.computeNextOnDiskAtom$Serializer.deserializeFromSSTable as well as LatencyMetrics.addMicro. The last one is interesting in that it is actual instrumentation that has been built directly into Apache Cassandra since version 1.2 that delegates call execution to an open source Metrics libraries covered our article series titled “How (not) to design a Metrics API”.

Note: With the threshold set to 2 the number of probes measured (metered) increased from to 1,320 million.

Comparing this model with the previous model we not only have additional probes listed, we also have a number of probes that have moved down significantly in the table ranking such as PrecompactedRow.merge and CommitLogSegment.write, which previously had a near equivalent total and inherent total column value.

cassandra.1.2.5.rampup.t2.it1

All of the above models reflect the truth to some degree. Many of the items listed above do appear in all three hotspot probe sets. What we have traded in setting different thresholds is coverage, cost (accuracy) and comprehension. This changes are good starting points for performance tuning in the caller-to-callee chain

Measurement will always contaminate (or distort) the underlying software performance execution to some degree. This also applies to instrumentation that is already present within the application code base (a reason why some less informed individuals falsely claim that DTrace has little or no overhead). The skill is in managing this overhead via runtime intelligence and measurement control set points. The art is in knowing which information quality attributes to trade and at what point in the performance investigation and ongoing monitoring of application and its services. It is natural (though not always productive) to want to collect more information in any performance assessment, but we must be careful in trying to get close to such hotspots to not get pulled into some reality distortion field. It can be far more efficient, both online and offline, to just simply identify suitable and safe landing points within the code base from which to explore the remaining and deeper terrain especially within a static plane such as an code editor. Again success depends on getting close enough, but not too close.

cassandra.benchmark.opencore.adaptive.approach

Up to now we’ve looked at how performance models can be impacted in the setting of different measurement control set points (thresholds) within the adaptive measurement engine. What about the nature of the benchmark load itself? Does the slow incrementally ramp up of load from 1 thread to 20 threads, with each scripted loop iteration, create a different performance model than when instead we start at, and remain at, 20 threads throughout the benchmark?

for i in {1..20}
do
./tools/bin/cassandra-stress -d $DSC_HOST -o insert -t 20 -n 1000000
done

Below is the performance model derived following execution of our altered benchmark script with the threshold and inherent threshold set points reverted back to 10 and 1 respectively. This is much more different than the very first hotspot probe listing presented above that used the same threshold set points. In fact it has some of the additional hotspots uncovered when the threshold was lowered from 10 to 5 and then 2 microseconds.

cassandra.1.2.5.no.rampup.t10.it1

Why would the performance model be so different depending on whether we slowly ramp-up concurrency or go directly to the highest level defined for the test? There are a number of possible causes but the primary one being the adaptive measurement engine itself. With the slow ramp up of load many of the probes listed below as hotspots would have been measured under optimal (if not unrealistic) workload conditions with no contention for resources such as the processor, object monitors and the memory system. The measurement timings would have been understated and the adaptive measurement engine would have disabled many of the probes, which in time and under more realistic production workloads represent significant contributors to the response timings and throughput. With high load very early on in the benchmark we quickly tax the memory and processor systems and thus those probes with high allocation or memory/cpu costs rise to the top much quicker.

Lowering the threshold from 10 to 5 reveals a newer reality to our performance model.

cassandra.1.2.5.no.rampup.t5.it1

Even with the most advanced measurement solution in the world, performance tuning is at best a complex challenge that cannot be easily simplified. Unless you are the one who created the project (and the problem) it can be extremely difficult reasoning about the software execution behavior of distributed systems when you know that it is simply impractical if not impossible to obtain the exact information you need both to identify and resolve the problem (which can be two very different tasks)….each and every time.

Whilst we strongly believe that technologies such as Signals, QoS for Apps and Adaptive Control can actively manage the performance and improve resilience of distributed systems with very little human intervention we don’t believe the same applies to the performance analysis of such systems. And yes this runs counter to nearly everything else that is promoted by other vendors in the application performance management space, but then again it is not like these vendors are trying to sell you the truth.

Commentary: Since the publication of our benchmark challenge not a single application performance monitoring vendor has come forward with any sort of attempt at beating or matching our performance analysis results and low overhead.

Commentary: It is because of these multiple sensed realities (reality is itself a computation performed within our own minds) that we hold little regard for other application performance monitoring solutions that claim to “put out fires in production within 15 minutes”. Unless you are dealing with extremely low hang fruit, day in and day out (which begs another far more serious and troubling question), such statements represent sheer utter marketing nonsense and completely discredit the work of professionals and scientists in the performance monitoring and management field that are actively (re)searching for new ways to solve the hard task of monitoring and management ever increasing complexity in production systems especially in the cloud.

Appendix A – (Some) Hotspot Code Listings

cassandra.1.2.5.codelisting.1

cassandra.1.2.5.codelisting.2

cassandra.1.2.5.codelisting.3

cassandra.1.2.5.codelisting.4

cassandra.1.2.5.codelisting.5

cassandra.1.2.5.codelisting.6

JXInsight/OpenCore 6.4.9 Released

We are pleased to announce the ninth update to JXInsight/OpenCore 6.4. In this release we have added a new visualization to the metering table when the queue probes provider is enabled within the connected metered application runtime. The live “queue” visualization, which is displayed at both the global and selected probe levels, tracks the number of threads concurrently executing an invocation of any metered method (probe) or of a particular method (selected probe).

jxinsight.opencore.6.4.9.queue.liveview

This visualization, based on a sampling of the queue size every 1 second, is extremely useful in understanding the concurrent production workload patterns within your application.

jxinsight.opencore.6.4.9.queue.liveview.line

The probe level visualization will automatically appear in the console when the queue probes provider is enabled by adding the following property to the jxinsight.override.config file.

jxinsight.server.probes.queue.enabled=true

In shorthand form this can be written as

j.s.p.queue.enabled=true

For the global level visualization you must also enable metering group aggregates, which we recommend only doing so in a test environment or in production within a Simz runtime for applications with high frequency and concurrency.

j.s.p.aggregates.enabled=true

The latest update can be downloaded from our developer site.

JXInsight/OpenCore 6.4.8 Released

We are pleased to announce the eight update to JXInsight/OpenCore 6.4. In this release we have added an exciting new visualization to the metering table when the stack probes provider is enabled within the connected metered application runtime. The “stack” visualization, which is displayed at both the global and selected probe levels, makes it fast and easy to observe hung or stalled threads.

jxinsight.opencore.6.4.8.stack.liveview

This unique visualization is only possible due to the metering engine creating a distinct identifier for each stack frame that is pushed (or popped) within a thread’s execution context. No other application performance monitoring solution can offer this today.

jxinsight.opencore.6.4.8.stack.liveview.line

Each column bar in the visualization represents a live thread. When colored brown, a thread probe stack has remained unchanged (no new identifiers) across two samples which by default are taken every 1 second. When colored orange, the stack has been present across two samples but has also had frames (identifiers) added or removed. When grey colored, a brand new stack (though it could be a similar call path) has been observed. A colored white, indicates no stack collected. A cream colored column indicates a stack was not collected but previously this thread had a repeat observation.

Note: To enable the stack probes provider simply add j.s.p.stack.enabled=true to the jxinsight.override.config used by the metered runtime.

We have also enhanced the adaptive mechanism used by the wvalve probes provider in this release as well as adding some miscellaneous performance optimizations to the metering engine.

The latest update can be downloaded from our developer site.

Be Informed Xperience: Service Management for an Adaptive Business Platform

Yesterday at the Be Informed Xperience 2013 event, held in the Louwman Car Museum Den Haag, we launched a special edition of our adaptive software monitoring and management platform for the Be Informed business process platform. During the event we also introduced a new branding of our innovative technology suite and previewed some new and exciting platform add-ons in development, which are scheduled for release later in the year.

beinformed.xperience.2013.collage

Both co-founders of JINSPIRED, Rod Peacock and William Louth, were present to discuss the importance of visibility and shortened feedback loops in the push for greater business agility, which is driving the adoption of model driven development technologies and adaptive case management solutions such as the Be Informed platform.

The custom instrumentation extension we have built, in co-operation with Be Informed and Taxonic BV, uses the JXInsight/OpenCore Probes Open API to dynamically instrument the Java bytecode of the business processing platform and then at runtime extract the case model context that is executed by the platform in response to a web application request or business event (rule) trigger. The model context, which typically consists of the case type and other model elements such as event handlers, is then used in place of the actual Java class/method name to represent a metered activity (probe). The names of the probes reflect the labeling that is defined by the architects and developers of the model within the Be Informed modeling studio. Importantly any changes in the model are immediately reflected in the activity metering model within the monitoring console.

jxinsight.opencore.beinformed.case.mgmt.live

Because most of our solutions and technologies are based on (or augment) the OpenCore activity metering engine, once the instrumentation is in place customers of the Be Informed platform can not only monitor the performance and scalability of their model driven applications but also control them via our “Adaptive Control in Execution” and “QoS for Apps” technologies. What is especially useful is that the control valves and QoS classifications can be attached to metered activity (probe) execution based on the labels (and language) used in the case model. Be Informed customers can also record and playback the metered execution of their case models and simulate the behavior across many application instances in near real-time all from within a single managed runtime.

jxinsight.opencore.beinformed.case.mgmt.txtree

If you are interested in learning more about adaptive case management and the importance of near real-time feedback loops we recommend the following books.

masteringtheunpredictable.front.coverthetwosecondadavantage.front.cover

Mirror, Mirror within the (Simulated) Shell. Who is the Fairest of Them All?

Have you ever wondered what underlying software execution occurs as you interact in real-time with a software interface. For example, an interactive shell that allows you to write and run programming language scripts or code snippets. Some tools have in the past offered a debug or trace facility but one problem with such facilities is the use of the same output real estate that is being used both to input and output the script lines and the results of such line executions. Sometimes you want the output of the normal script to appear in a separate window pane so you can perform more easily comparisons across different execution outputs without the input mingled and mangled with the output. When it comes to trace/debug output this becomes a necessity. Another problem with sharing the same output stream and same target window pane is that it can have a performance impact on the execution of a script. Generally most observations of such interactions don’t need immediate output feedback in the range of single digit microseconds.

Well near real-time simulation makes all this extremely simple to achieve and without any changes to existing code. To demonstrate how easy this truly is a screen video has been created that shows the replaying of the underlying software execution that occurs when simple script evaluations are performed interactively within two dynamically instrumented shells. You will be amazed at how much work is actually done when 1+1 is evaluated in both shells.

jxinsight.simz.mirror.mirror.in.the.shell.overview

To output the instrumented calls that are relayed from the shell JVM to the Simz JVM a very simple ProbesInterceptor implementation class has been created, that prints a line for each call entry and exit as well as the clock time post completion of the call. This metering extension could be installed in the actual shell runtimes but instead here it is installed within the Simz runtime. Irrespective of target runtime the code below would not have to change in anyway as both the real (or virtual) machine and simulated machine appear identical to the metering extension.

jxinsight.simz.mirror.mirror.in.the.shell.code

Code Isolation
The implementation above is not at all efficient. I could have used a char[] buffer across calls to minimize the amount of garbage collection behavior that would have been driven by the creation of a StringBuilder on every call entry and exit point. Such an impact would be completely unacceptable in the real application runtime (though many loggers do far far more damage) but then again I did not install the interception code in the actual application runtime. I installed it within the simulated runtime, whose (mirrored) reflective behavior (of the real application) is asynchronous. In fact I could have simply recorded the metering feed and then played it back at another time. This is the immense power of simulation and playback. We can change both time and space.

Here is an example of the output produced by the above code.

--->org.jruby.internal.runtime.methods.DynamicMethodFactory.newJittedMethod
---->org.jruby.internal.runtime.methods.JittedMethod.<init>
----->org.jruby.compiler.impl.StandardASMCompiler.getStaticMethodArgs
-----<org.jruby.compiler.impl.StandardASMCompiler.getStaticMethodArgs [6]
----<org.jruby.internal.runtime.methods.JittedMethod.<init> [10619]
---<org.jruby.internal.runtime.methods.DynamicMethodFactory.newJittedMethod [10725]

To install the custom probes interceptor into the Simz service metering runtime add the following system properties to the jxinsight.override.config file used by it.

j.s.p.interceptor.enabled=true
j.s.p.interceptors=print
j.s.p.interceptor.print.factory.class=\
    org.jinspired.jxinsight.probes.ext.interceptor.print.InterceptorFactory

In the screen video recording you will see the Simz service launched in the bottom terminal window and then various code snippets executed in the two tabs within the top window pane. Prior to various interactions with the interactive shells the Simz window pane is cleared so as to give you a better understanding and appreciation of how much code is executed in performing such primitive interactive operations and evaluations.

jxinsight.simz.mirror.mirror.in.the.shell.screen

Our Simz technology has been designed to handle production loads across thousands of threads and hundreds of Java runtimes. But scaling is not just a one way street. Well engineered solutions go in both directions when needed. It is always rewarding to see how a novel solution can scale on many dimensions including versatility and applicability.

The screen video recording is also available on YouTube and ideally should be viewed in full screen HD mode.

JXInsight/OpenCore 6.4.7 Released

We are pleased to announce the seventh update to JXInsight/OpenCore 6.4. In this release we have devoted considerable engineering and testing effort to enhancing all adaptive (software processing) valve extensions (tvalve, rvalve, and qos-valve) with new adaptive control mechanisms and hill climbing algorithms that offer improved robustness as well as faster and smarter adaptation to very volatile workloads passing through valves. Many new system properties have been added that control the rate of ascent and descent as well as injecting (artificial) disruptions following a period of stability and stagnation. In addition we have changed the defaults of various common system properties across valve extensions based on our experience in the field applying this technology to diverse execution environments.

We are also introducing a brand new adaptive control valve metering extension, wvalve, providing an universal adaptive execution controller that optimizes concurrent processing through a valve based on a sampling of the average “work” time for completion of current calls and the expected wait time of enqueueing calls. This new adaptive control extension optimize both response time and to some degree throughput through its measurement of queue busyness (or lack of).

An advantage this adaptive control metering extension has over the rvalve extension is that you don’t need to have the clock.time meter included in the metering model. In fact you can turn off metering of probes completely using the following system property in the rare case that this is required.

jxinsight.server.probes.wvalve.forwarding.enabled=false

or in short form

j.s.p.wvalve.forwarding.enabled=false

Note” This non-metering behavior is only applicable to the probes matching the name.groups property.

Unlike the rvalve metering extension this adaptive control extension does not include the wait time incurred by the adaptive control mechanism within the valve in its assessment of the call response time. The response time reflects more of the actual work (service) time as well as the work rate of the valve itself.

We have added numerous control set points to the wvalve probes provider to allow fine tuning of nearly every adaptive aspect of its operation but for the most part one need only specify the following system properties.

jxinsight.server.probes.wvalve.enabled=true
jxinsight.server.probes.wvalves=${v},${v},...
jxinsight.server.probes.wvalve.${v}.name.groups=...

or in short form

j.s.p.wvalve.enabled=true
j.s.p.wvalves=${v},${v},...
j.s.p.wvalve.${v}.name.groups=...

The latest update can be downloaded from our developer site.

JXInsight/OpenCore 6.4.6 Released

We are pleased to announce the sixth update to JXInsight/OpenCore 6.4. This release adds two new “live” visualization to the metering table view that track the active metered thread count and average probe count per thread in 1 second time slices at both the runtime as well as the selected probe levels.

Below is a sample screenshot taken during the execution of the following benchmark script against an DataStax (Apache) Cassandra server.

for i in {1..12}
do
  ./bin/cassandra-cli -f benchmark.cli
  sleep 2
  ./tools/bin/cassandra-stress -o insert -t $i -n 100000
done

From the timeline visualization in row 3 you can observe, from right (past) to left (present), the gradual increase in the active metered thread count at the runtime level. But in the timeline visualization in row 8 the count remains constant at 32 threads for the selected probe, Table.apply, even as the client thread concurrency is increased and consequently the duration of the burst period shortened (also shown beautifully below). Even with just one thread specified in the client benchmark driver, work is still shared across the 32 threads as determined from our 1 second time slices. There does not appear to be any affinity in the underlying worker thread pooling implementation with regard to recent (hot) request processing.

jxinsight.opencore.6.4.6.thread.timelines

The latest update can be downloaded from our developer site.

Parallels in Big Data Processing – LHC at CERN

CERN, the European Organization for Nuclear Research, has published a wonderful animated video showing how it tackles big data processing challenges in their probing of the fundamental structure of the universe. Below are some frames taken for the must-see video that callout the significant challenges as well as major architectural components of the solution. What is of interest in the context of application performance monitoring and management is that many of the technical aspects bear a strong resemblance to the approaches we have taken in metering applications, such as intelligent activity metering for dynamic runtime event selection and discrete event simulation for both online and offline playback, observation and analysis.

“The Large Hadron Collider (LHC) produces millions of collisions every second in each detector, generating approximately one petabyte of data per second. None of today’s computing systems are capable of recording such rates, so sophisticated selection systems are used for a first fast electronic pre-selection, only passing one out of 10,000 events. Tens of thousands of processor cores then select 1% of the remaining events for analysis.”

cern.lhc.big.data.parallels

If you have followed our Scala Compiler case study you would have seen how we started with 16.5 billion metered calls and then intelligently whittled this down to 1 million metered calls using various activity metering extensions. That is 1 event in every 16,500 events. This “sophisticated selection system”, that processes and selects event data as close to the source as is possible, ensures that we can meter the activity of the most demanding applications existing today including low latency trading and (online) gaming. This capability is never more evident in our recent benchmark analysis of Java performance profiling solutions.

Note: The primary reasons for “time machine” application debuggers failing to scale beyond simple HelloWorld/PetStore style applications are (1) they collect state which is generally an “all or nothing (really useful)” proposition and (2) they don’t learn to adapt instrumentation and measurement during the running of an application.

What about event storage and its distribution? In the official announcement of JXInsight/OpenCore 6.4 we declared, and independently verified with assistance from Qualogy, how OpenCore managed to transmit metering event feed to a single JXInsight/Simz runtime service at a rate of 20 million metered calls (40 million events) per second across a cluster of 20 application runtimes – all running on the Oracle Exalogic stack. The bandwidth needs for this throughput rate are 440MB/s and our goal this year is to nearly double that, hitting close to the limit you can reasonably expect to achieve on a 10G network. The recorder metering extension for JXInsight/OpenCore has also demonstrated very impressive local disk write throughput rates though we still feel there is room for even more impressive performance and throughput so watch this space.

But the most striking similarity in all of this is that all this big data processing is used to drive simulations that are run from remote sites and analyzed by users. Users are not necessarily querying large databases to extract information but instead using the event data feed to drive simulations which answer proposed questions. The simulated playback of observed event behavior is far more important and useful than the data itself – what we like to call application memories.

cern.lhc.big.data.parallels.simulation

Reality, Reactivity, Relevance and Repeatability in Java Application Profiling

“Instrument error refers to the combined accuracy and precision of a measuring instrument, or the difference between the actual value and the value indicated by the instrument (error)…a way to deal with instrument error may be to reduce the reactivity of the system to being measured…reactivity is a phenomenon that occurs when individuals (systems) alter their performance or behavior due to the awareness that they are being observed….accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity’s actual (true) value…precision of a measurement system, also called reproducibility or repeatability, is the degree to which repeated measurements under unchanged conditions show the same results.” – Wikipedia

This article was written in response to a message thread on the LinkedIn “Java Performance Professional” group in which the effectiveness of instrumentation based Java Profilers solutions was called into question in profiling well tuned systems including high frequency low latency message processing platforms. Here is a comment from a performance analyst at a very big and blue IT management software and services company

“On a sufficiently large java application, many profilers either have prohibitively large overhead; some simply cannot handle the volume of data.”

Every time a performance optimization is proposed and temporarily applied to our codebase we go through a series of evaluation tests and questions that need to be answered before deciding whether the change stays and gets published. First we run a number of micro-benckmarks to see whether a reduction in our measurement overhead can indeed be observed and whether this reduction is present in both single threaded and multi-threaded benchmark runs. We also look to see whether the optimization is fair to all measurement points and flows (paths). We need to understand whether it introduces any hotspot bias especially as our metering engine is highly adaptive and sensitive to measurement changes that impact the degree of “inherent” (self) metering assignment to probes (methods); it is this very aspect which makes our solution effective, efficient and exact. We then move onto more realistic and slightly less repeatable benchmarks in which we try to determine how much of an improvement does the optimization achieve in reality. Part of this effort involves comparing performance models (snapshots) taken with previous versions of the metering engine software with the new, and hopefully more optimized, version. We use various internal tools, built on top of our Probes Snapshot Open API, to access and compare the data but sometimes simply opening up both snapshots and looking at visualization can already instill some degree of confidence; accepting that most realistic benchmarks are never entirely reproducible or highly adaptive systems predictable. Below is a screenshot comparison of two snapshots created by different releases of our metering runtime. A great deal of similarity in the visualizations.

cassandra.benchmark.sidebyside

Of course this type of comparison does not mean the measurement solution is anymore accurate. It just means that the measurement error, which can be systematic or random, is consistent (or stable). So occasionally it is good practice to compare your measurement results with results generated with other performance measurement solutions and techniques. Whilst this sounds simple, in practice the reality is very different; in fact you face the prospect of multiple realities and then deciding which one is the more accurate (or less wrong) version. Before proceeding any further ask yourself when was the last time you used more than one performance profiling solution to determine the validity of a profilers results. If you have been fortunate to always get called in to identify “low hanging fruit” then this thought might not have ever crossed your mind. But if you have been tasked with measuring and tuning an existing well optimized software you might have opened your eyes to the fact that all measurements invariably have some degree of error (inaccuracy and imprecision); some more than others.

Note: In our Scala compiler case study this aspect of performance measurement accuracy (and relevance) was tackled through the creation of multiple performance models each one with different strengths and weaknesses.

In the rest of the article we compare the results generated by our adaptive metering engine with two other popular but largely non-adaptive Java profiling solutions. But first lets present the steps needed to run the benchmark that was used to generate the results presented.

1. Download the DataStax Community Edition of Apache Cassandra. We used version 1.1.6 though 1.2 is now available.

2. Create a file named benchmark.cli in the top directory of the Cassandra install with the following contents.

drop keyspace Keyspace1;

3. Create a file named benchmark.sh in the same directory as the benchmark.cli file with the following script code.

for i in {1..20}
do
  ./bin/cassandra-cli -f benchmark.cli
  sleep 5
  ./tools/bin/cassandra-stress -o insert -t 1 -n 500000
done

4. Open up a terminal window and launch the Apache Cassandra server with ./bin/cassandra

5. Open up a new terminal window and start the benchmark with ./benchmark.sh. You should see the following printed out in 20 repeated cycles.

total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
74446,7444,7444,1.184482712301534E-4,10
170825,9637,9637,9.477168262795837E-5,20
267589,9676,9676,9.498367161340994E-5,30
365831,9824,9824,9.234339691781519E-5,40
459426,9359,9359,9.840269245151984E-5,50
500000,4057,4057,9.589885148124415E-5,55

6. To run the Apache Cassandra server with a profiling agent you can restart the server with the JVM_OPTS environment set. Here is how this is done with JXInsight/OpenCore.

export JVM_OPTS="-agentpath:/OpenCore/bin/osx-32/libjxinsight.jnilib=prod -javaagent:/OpenCore/bundle/java/default/opencore-ext-aj-javaagent.jar -server"

We performed 4 benchmark runs. The first without any profiling agent installed, which we have labeled as the baseline. We used the latest version of JXInsight/OpenCore as well as the latest releases for JP and NBP. For each profiler we limited method level instrumentation to the following packages:

org.apache.casssandra.
org.apache.thrift.
org.github.jamm.

For NBP we unchecked the option that limited thread profiling to 32 threads as the Apache Cassandra server creates a large number of threads in processing the requests generated by the stress tool.

For JXInsight/OpenCore we created a jxinsight.override.config file in the same directory as the benchmark.sh script file with the following contents. Note we measure and report in microseconds.

# changed default hotspot threshold from 50 to 10
j.s.p.hotspot.threshold=10
# changed default hotspot inherent threshold from 5 to 1
j.s.p.hotspot.threshold.inherent=1

All profilers used instrumentation based measurement of the clock time with as many extraneous options turned off. After trying to run the benchmark with JP or NBP we had to increase the sleep time in the script to 20 and 30 seconds because the Apache Cassandra server would generate an error if it had not finished the drop keyspace command before the stress tool was run. We also had to change the for loop for both JP and NBP dropping down the count from 20 to 4 because of the long duration of each benchmark cycle.

From the results outputted by the stress tool we took a sample of the top N throughput results in each stress cycle. Here is a statistical summary.

cassandra.benchmark.thruput.combo.chart

So how exactly did JXInsight/OpenCore outperform the other solutions by such a huge margin? Well just like the JVM hotspot compiler our metering engine is adaptive and uses the measurement data to decide whether to continue measuring a particular probe. It also moves the instrumentation and measurement up to the most appropriate point in the call execution to measure the software execution from; it does this very intelligently, ensuring what continues to be metered is most relevant. Here is a graphic that can help understand this process. If you want more background information please check out this article.

cassandra.benchmark.opencore.adaptive.approach

In the remaining part of this article we will simply present the different realities offered up by each Java profiling solution, for now ignoring the fact some of these tools have changed the throughput of the system so drastically that it could be viewed as an entirely different system. Which “reality” do you think more accurately reflects the places in the code that represent the execution hotspots? Which “reality” do you think more accurately reflects the places in the code that represent the locations from which performance optimizations should be driven? These two need not necessarily be the same. Well if you think you have the right answer to both of these questions then please create a posting on your own blog and link to this article. We will track the links and if the analysis seems reasonable and proven we will link back.

Below is the metering table generated after the benchmark run with JXInsight/OpenCore, listing “hotspot” probes still being metered following completion. The “Total” column is the cumulative clock time total for a method including the time spent executing other methods some of which may and may not be measured. The “Total (I)” is the cumulative clock total for a method excluding the time spent executing other instrumented methods that are still being metered. If a method is adaptively disabled for measurement purposes then the subsequent inherent clock time readings will be assigned to its immediate callers, unless they too are also dynamically disabled. The package names have been abbreviated in the view.

jxinsight.opencore.6.4.5.cassandra.benchmark.more

Here are the hotspots identified by the JP profiling solution. You will notice that the invocation for SimpleCondition.await method is only 2 million compared to 10 million with JXInsight. That is because we only looped for 4 iterations because of the significant slowdown in execution of the benchmark. Without the cumulative clock time we found this data result incredibly hard to understand and relate immediately to our results and the execution behavior of the system.

Note: That some of the methods listed with high inherent times don’t actually execute as part of the service request processing.

jp.7.2.2.cassandra.benchmark

Here is the hotspots identified by the NBP profiling solution. Even though both JP and NBP are non-adaptive and performed relatively similar on initial inspection the results look to be different. As was the case for JP we had to change the for loop to only iterate for 4 cycles so you will see 2 million instead of 10 million listed as a frequency against db model related methods. An interesting observation is that nowhere is there a hint to what was actually being performed at the Apache Cassandra model level. The NBP results seem to flag IO as the bottleneck, which is not a revelation, but we miss important insights into which callers are the most relevant in driving this bottleneck behavior – the appropriate call points where we should look to eliminate IO.

nbp.7.3.cassandra.benchmark

We are not going to go over the many reasons why call stack sampling is not terribly effective in understanding and managing the performance of systems in production. We have covered this at great length on this site. But as an exercise we did run the benchmark with call stack sampling. Below the throughput results are compared with the baseline and our instrumentation based solution which collects far more useful data related to frequency and latency.

cassandra.benchmark.thruput.sampling.chart

Here are the hotpot methods identified by the JP call stack sampling approach. Which of the above realities above do you think most closely resembles this version, ignoring those method which we know from above don’t get executed every time an insert is performed?

jp.7.2.2.cassandra.benchmark.sampling

In our final test we employed our IMRR technique to generate a revised instrumentation set for our agent. Here is a comparison of the throughput results with the baseline. We managed to get under that magical 5% overhead even for a well tuned system with very low latency and high frequency.

cassandra.benchmark.thruput.combo.chart.oconly

Here are the metering results for the revised agent. You can see a slight reduction in some of the totals and averages. The SimpleCondition.await method, whose execution time is a good indicator of system performance, has dropped down from 33 microseconds to 29 microseconds.

jxinsight.opencore.6.4.5.cassandra.benchmark.gen.1

Again we would welcome any independent external performance analysis of the results above that would help shed light on which reality is the closet one to the actual in which the software is not observed and measured. We have a very good idea which one that but if you wish you can always check out the sources and then look at the hotspots identified by each of the solutions presented here.

Hardware
2.6 GHz Intel Core i7
8 GB 1600 MHz DDR3
500 GB Flash Storage
Operating System
Apple OS X 10.8.3
Java Runtime
Java HotSpot(TM) 64-Bit Server VM; 1.6.0_43; 20.14-b01-447; mixed mode

Update – 16th April 2013
We have done two new benchmarks with YK with an “adaptive” flag set to true and false. Here is the statistical summary chart.

cassandra.benchmark.thruput.combo.chart.yk

Here is the YK hotspot table for the non-adaptive benchmark run. Like the JP and NBP tests runs only 4 cycles instead of 20 were performed because of the significant time duration.

yk.12.0.4.cassandra.benchmark.nonadaptive

Here are the hotspots for adaptive YK benchmark run. Again only 4 cycles because of slowdown. What is interesting is that the adaptive option disabled the SequentialWriter.flushData() method which was classified by both OpenCore and JP as an actual performance hotspot.

yk.12.0.4.cassandra.benchmark.adaptive

JXInsight/OpenCore 6.4.5 Released

We are pleased to announce the fifth update to JXInsight/OpenCore 6.4. This release includes new performance optimizations at the instrumentation and metering layers that improve on the high scalability of our solution in extreme high frequency and highly concurrent probe environments.

We will soon publish a new performance overhead benchmark comparison report, based on a stress test of Apache Cassandra, that unequivocally demonstrates the effectiveness of our superior adaptive approach over existing commercial and open source Java profiling solutions.

jxinsight.opencore.6.4.5.cassandra.benchmark.more

Spot the Difference: Lifelike Simulated Runtime vs Real Application Runtime

Lifelike is an adjective that relates to anything that simulates real life, in accordance with its laws. Its goal is to immerse individuals into what is called a lifelike experience. It gets as close as possible to real life behavior, appearance, senses, etc. therefore enabling its subject to experience what is happening as if it were real. In other words, simulating reality with its physical laws is the objective of lifelike experience.” – Wikipedia

Simulation is the imitation of the operation of a real-world process or system over time. The act of simulating something first requires that a model be developed; this model represents the key characteristics or behaviors of the selected physical or abstract system or process. The model represents the system itself, whereas the simulation represents the operation of the system over time. Simulation is also used when the real system cannot be engaged, because it may not be accessible, or it may be dangerous or unacceptable to engage, or it is being designed but not yet built, or it may simply not exist.” – Wikipedia

Digesting the above definitions; to create a truly lifelike application simulation we need to first define a model of the system that is then simulated over time in accordance with laws, such as the passage of time. The simulation must give the appearance of real life; a model that is simulated with realistic appearance.

System Model
In Simz, and OpenCore for that matter, the model that underlies an application simulation is primarily made up of Context, EnvironmentProbe and Meter. A Context represents a flow of (potential) execution. The Environment provides a means for the Context to introspect its execution container (world). A Probe represents a named application activity within a Context. And a Meter represents a managed resource that is consumed during the begin and end phase of one of more chained activities.

jxinsight.opencore.6.4.4.spotthedifference.model

Model? Check! ✓

System Simulation
Both Simz and OpenCore provide a means to simulate a system model. OpenCore is able to simulate a real application runtime by playing back a metering recording in which threads are created and started, methods are called and chained within threads, and meters are adjusted as execution flow and time progresses. The method themselves don’t actually change state as in the real runtime but other than that they are operated by the simulation in much the same way. The timesync metering extension as demonstrated in Simulation and Time Synchronization of Application Memories ensures the simulation obeys a fundamental law – time itself.

Simz operates near identical to OpenCore but does not need to concern itself so much with time synchronziation as its metering recording is feed in real-time from the application runtime, as threads are started and methods are executed. Execution and simulation proceed in tandem. In addition Simz allows the simulation to span multiple application runtimes. It is even possible to record a distributed simulation and then have it played back by OpenCore as if it were a single runtime recording.

Simulation? Check! ✓

System Appearance
Finally we come to the requirement of a simulation to provide a lifelike experience that is close as possible to real life behavior and appearance. This is much harder in the context of a simulation that is representing the execution of code within an application runtime, because generally applications are  viewed as a black boxes except to monitoring tools which instrument code and introspect threads and their execution stack frames. So a good test would be to run both the application and simulation side by side and monitor each one from a different instance of a monitoring console that is completely unaware of what the application runtime represents: simulated (virtual) or reality.

jxinsight.opencore.6.4.4.spotthedifference.overview

Well this is exactly what we have done and the results are pretty convincing especially when you consider that each monitoring console is sampling actual live thread metering data every second and not just every minute, which is the the typical (and lowest) time granularity offered by other APM solutions. In addition many of the timeline charts are dynamically scaled in terms of a maximum observed (sampled) value which makes it even hard for two monitoring consoles to appear the same even when connected to the very same runtime.

The simulation is able to execute (playback) and measure (meter) in tandem the very same software execution behavior that is occurring in the live application runtime. There is currently one visualization in the monitoring console that hints at the nature of the monitored execution, but before explaining more why not check out the screen recording video and find out whether you can see the difference as the Apache Cassandra processes inbound requests from an insert benchmark script. We have also uploaded the video to YouTube.

Below is a video frame I have taken during the video playback with the monitoring console connected to the real application overlaying the monitoring console connected to the simulation. The small visual difference is present in the top most visualization which displays the heap usage within the runtime. For the simulation this heap usage is as little as 8MB whereas in the Apache Cassandra runtime it is as high as 400MB. The simulation does not need to concern itself with the whole state space within the application only the metered behavior and meter usage.

jxinsight.opencore.6.4.4.spotthedifference.sidebyside

Note: The slight discrepancy in the bottom visualization (see blue bar spacing) is because this chart reflects the snapshot measurement model that was retrieved when the Refresh toolbar button was pressed which in the video was NOT done simultaneous.

Realistic? Check! ✓

Simulation and Time Synchronization of Application Memories

“Learning is experience. Everything else is just information.” -Albert Einstein

Experiencing is Learning
Experiential learning is the process of making meaning from direct experience, i.e., ”learning from experience”. The experience can be staged or left open. The process involves a direct encounter (experience) with a phenomena of interest rather than merely thinking about such, or only considering the possibility of doing something about it.

jxinsight.opencore.timesync. experiential.learning

Reflection is a crucial part of the process of learning. It is an iterative process as learning builds and develops with further reflection and experience. In reflecting on an experience 5 key questions can be asked to improve understanding:

  • Did I notice (observe and recognize) what happened?
  • Why did that happen (and in what manner and form)?
  • Does that happen (often and does it matter)?
  • Why does that happen (beyond the particular event experienced)?
  • How can I use that (to alter my behavior in the future)?

These questions should be proposed and answered in attempting to understand application performance problems. But how can we experience the past and in such a way that it will feel real and to some degree uncertain at each point in time (in order to tune our intuition). Here is were simulated playback of past application memories can be a crucial part of learning to manage and operate complex software systems.

Experiencing The Past with Playback Simulations
In JXInsight/OpenCore 6.4.4 we have released a new metering extension, timesync, that controls the execution of the metering runtime by ensuring the time differences between two events reflect the difference in elapse time. This difference typically occurs when a metering recording is played back at full speed, without any delay other than the time incurred in changing the meters to reflect the recorded readings as well the metering of the probe itself. This aspect of playback has been covered in How to Execute Software Behavior Faster than (Wall Clock) Time.

The timesync metering extension is similar to the plugin introduced in The Power of Slow in Application Performance Analysis – Record & Playback, but far more efficient and scalable as well as being built directly into the metering runtime. The extension allows us to observe events in the past with the same degree of passage time as if we were actually observing the event occur in the present. This capability is extremely useful for IT operations. Instead of simply looking at a static timeline chart or log listing we can train ourselves to recognize events unfold, anticipate what is likely to follow and then more importantly check whether we recognize and anticipate correctly. This is flight simulation training for IT operations and more as we can go back in time and experience the whole timeline sequence again and again as well as pause time itself. We can step through time and assess each point without necessarily having any clues (or knowledge) of what comes next (in the replay), which is invariably the case with post-mortem management reports that visualize the whole monitoring timeline data on a single static chart. Knowing what is coming next changes our view of data points and actions performed. It limits (or biasses) our learning. Would you read a novel after reading the last few pages and then fully appreciate the experience it aims to give. No. Well that is exactly what many post-mortem analysis reports offer.

To enable the timesync extension simply add the following to the jxinsight.override.config used by the tools/opencore-playback.jar library.

jxinsight.server.probes.timesync.enabled=true

Seeing is Believing
A screen recording has been created that demonstrates the power of playback time synchronization. In the video a DataStax (Apache) Cassandra server is started and then a couple times an “insert” benchmark test is run against the instance. During this time the metering engine within the Cassandra runtime is in recording mode, creating a file named probes.log. It is this binary metering recording log that is then played back later when the Cassandra runtime is shutdown. What is important to note during the video is that the OpenCore console, which connects automatically to both runtimes, is completely oblivious to the fact that one of the runtimes is in fact a simulation engine of a past application memory. This runtime looks and acts just like the Cassandra runtime. It has the same number of threads, each doing what they previously did except for not actually do the table inserts execution code but simply mimicking the execution flow by firing the instrumentation probes in order and in context to what was previously fired in the real runtime. A way to think of this is that every method that was instrumented and metered in the real runtime has a no-op implementation in the simulated runtime except for calling other methods that are also no-op implementations and so on.

jxinsight.opencore.timesync.cassandra.diagram

The screen video recording is also available on YouTube and ideally should be viewed in full screen HD mode.

JXInsight/OpenCore 6.4.4. Released

We are pleased to announce the fourth update to JXInsight/OpenCore 6.4. This release add a new visualization to the metering table view that provides a live (dynamically updated) version of the existing table overview. This enhancement makes it extremely easy to visually compare a snapshots metering model with the liver model with a production server.

jxinsight.opencore.6.4.4.livemeteringtableoverview

This release also includes a new metering extension, timesync, that can be used to synchronize the replaying of a metering recording with the current passage of time. We have also made it much simpler to execute the playback of a metering recording with the addition of a tools/opencore-playback.jar library that can be launched using the -jar JVM option.

The latest update can be downloaded from our developer site.

JXInsight/OpenCore 6.4.3 Released

We are pleased to announce the third update to JXInsight/OpenCore 6.4. This is a console only update with color changes and rendering corrections to the live visualizations within the default (flat) metering table view.

jxinsight.opencore.6.4.3.cassandra.liveviews

The latest update can be downloaded from our developer site.

We have created a screen (no audio) video recording to demonstrate the real-time nature of the visualizations monitoring an Apache Cassandra server running an insert stress test.

In the video there are two monitoring data models:  the snapshot model, which is persistent and largely static, and the live model, which is transient and dynamic. The snapshot data model is obtained when a server (metered application runtime) is initially connected or data is retrieved by clicking on the “Refresh” labeled toolbar button. The snapshot data resides in the table display with the values of various rows and columns visualized immediately above it. The snapshot model directs the performance investigation and importantly does not update automatically as we have found in practice dynamic table updates break flow and concentration.

jxinsight.opencore.6.4.3.metering.table.0

The live model is what is displayed in the visualizations above the table. There are two primary sections: the runtime section, which visualizes live monitoring data at the process and thread level, and the selection section, visualizing live monitoring data for the selected probe (instrumented method) in the table view. In the runtime section there are three charts. The first is a heap timeline. The second is a metering frequency timeline, moving left (present) to right (past), that tracks the metering count across hotspots and non-hotspots probe labels. This allows you to observe the instrumentation and measurement intelligence built into the metering engine as irrelevant probes become automatically disabled. The last chart displays the metering frequency at a thread level. Like all other displays, rendering is normalized to the maximum value observed between refreshes or resets. A white color indicates no recorded metering activity. A lightly shaded column indicates metering activity but not in the last one second. A slightly darker column indicates metering activity in the last second but maybe not substantially enough when compared with the maximum one second delta. A green column indicates substantial metering activity in the last one second. In the picture below we can see that one thread is doing far more work (frequency) than others.

jxinsight.opencore.6.4.3.metering.table.1

In the selection section there are three charts. The first chart is a timeline depicting the metering count (green), total (yellow), average (blue) and running average (white) for the currently selected probe in the metering table. This is across all threads. The next row chart breaks down the frequency for the selected probe at the thread level. Like the runtime chart, green indicates the level of activity in the last one second. In the picture below we can see 32 threads firing the selected probe at approximately the same frequency. The last chart gives a breakdown of the metering total for the selected probe per thread. This has similar visualization rules as the frequency chart above, though it displays not just total (yellow) but inherent (self) total (orange) which in the case of the selected meter is wall clock time. The white circles are the cumulative metering totals since live observation of the selected probe started.

jxinsight.opencore.6.4.3.metering.table.2

In the charts above whilst frequency is evenly spread the total is not, with one thread consuming more time than others. This could indicate that occasionally in executing the selected method some house keeping chores must be done which takes up more time than its normal profile. In the screen video recording we can see this behavior move across threads as time progresses. The cumulative totals in white would indicate that outliers are not confined to one particular thread but move around.

Note: This visual pattern could also be indicative of resource contention if the frequency changed in cohort or following each single thread spike multiple threads spiked simultaneous.

Some Additional Observations
At 3:25 into the screen recording video, after the insert stress test has completed, you can see the past frequency and metering totals for the threads used to execute the Table.apply method.

jxinsight.opencore.6.4.3.metering.table.3

Then at 3:26 we see the number of thread alive within the process shrink dramatically.

jxinsight.opencore.6.4.3.metering.table.4

When the insert stress test is started again at 3:32 we see newly created threads appear.

jxinsight.opencore.6.4.3.metering.table.5

Throughout the insert stress test we can see one particular thread (at a time) dominating the metering frequency at the runtime level and at the Table.apply probe level multiple threads; indicating a single feeder of inbound request work to multiple worker threads.

jxinsight.opencore.6.4.3.metering.table.6

JXInsight/OpenCore 6.4.2 Released

We are pleased to announce the second update to JXInsight/OpenCore 6.4. In this release the console probes provider has been updated to not by default export the metering information for non-hotspots probes in a snapshot. A system property has been added which enables customers to revert this behavior if need be.

The default metering table view in the console has been improved with new live visualizations that help compare the metering activity of non-hotspots probes with those labeled as hotspots depicting the intelligence in action within the metering engine as probes are dynamically disabled as well as observe current metering behavior across threads at a global level and for the currently selected probe.

jxinsight.opencore.6.4.2.annotated.charts

The latest update can be downloaded from our developer site.

We have created a screen video recording to demonstrate the real-time nature of the visualizations monitoring an Apache Cassandra server running a stress test.

JXInsight/OpenCore 6.4.1 Released

We are pleased to announce the first update to JXInsight/OpenCore 6.4. In this release the console probes provider has been updated to not by default export the metering information for disabled probes in a snapshot. A system property has been added which enables customers to revert this behavior if need be.

The default metering table view in the OpenCore console has been improved with new beautiful visualizations that help (1) quickly identify particular metering behavioral patterns across multiple probes in terms of count, mark count, avg, mark avg, total, mark total, inherent avg, mark inherent avg, inherent total and mark inherent total, (2) observe current metering behavior of a selected probe within a connected application runtime, and (3) navigate namespace relationships as well as same last named probes.

opencore.console.6.4.1.apache.tomcat.table

The latest update can be downloaded from our developer site.

Going Beyond Actor Process Supervision with Simulation and Signaling

jinspired.logo“Nature, under my supervision, creates all animate and inanimate objects; and thus the creation keeps on going” – Bhagavad Gita

In Simz – The Promise of Near Real-Time Application Simulation in the Cloud I described how Simz can create something out of nothing when no process exists for an application that is dynamically managed in the cloud. The same article also showed how Simz can make multiple application runtimes look like a single one. But in the early inception of Simz we had another equally powerful use case in mind and that was to have a runtime simulation for every actual application runtime.

Simz in this context resembles a (black box) flight recorder, but instead of storing sampled sensor data it replays in unison the metered execution behavior, recorded and transmitted from the real runtime via a live metering feed. If the application runtime suddenly terminates, and closes the connection feed, we can still acquire runtime diagnostics from the simulation. More importantly the stack of each simulated thread will reflect what was in the process of being executed, but not completed, before an abrupt termination. This also addresses the issue where diagnostic routines built into the runtime, called via shutdown hooks, are not able to execute in time before the process is completely ended.

simz.supervision.one.to.one

We can take this further by incorporating elements of behavioral supervision and behavioral memorization.

Behavioral Supervision
Process level supervision is a familiar concept for those working with actor based languages and runtimes such as Erlang/OTP.

“In general application programming, robust server deployments include an external “nanny” that will monitor the running operating-system process and restart it if it fails. The restarted process reinitializes itself by reading its persistent state from disk and then resumes running. Any pending operations and volatile state will be lost, but assuming that the persistent state isn’t irreparably corrupted, the service can resume. The Erlang version of a nanny is the supervisor behaviour. A supervisor process spawns a set of child processes and links to them so it will be informed if they fail. A supervisor uses an initialization callback to specify a strategy and a list of child specifications. A child specification gives instructions on how to launch a new child. The strategy tells the supervisor what to do if one of its children dies: restart that child, restart all children, or several other possibilities. If the child died from a persistent condition rather than a bad command or a rare heisenbug, then the restarted child will just fail again.” – Erlang for Concurrent Programming

This all sounds good until you realize that supervision in Erlang, and much like other actor systems including Scala/Akka, is largely focused on process lifecycle management and not the internal workings of the worker process (or actor). The supervisor does not actually monitor, or is aware of, the software execution behavior of the worker process other than whether it has terminated and the cause of such termination. It is not at all like a metering supervisor.

Simz on the other hand pairs a simulation runtime with the application that is for all intents and purposes the application under supervision. The supervisor is the worker in terms of metered execution and resource consumption tracking. The simulated runtime assumes the behavior patterns of the real application runtime and thus supervision is self reflective. Instead of asking the question “What is the worker process doing?” it asks “What am I doing?”. In this context you can think of Simz as the “brain in a vat”. Its perception of reality is via the metering feed. Here the worker takes the role of the body of which we can have many over time.

simz.supervision.brain

In Erlang the supervisor is responsible for starting, stopping and monitoring its child processes. The basic idea of a supervisor is that it should keep its child processes alive by restarting them when necessary. Life and death but nothing in between. Simz today can support this level of supervision via a metering plugin which could terminate (or restart) a connected worker runtime in the event its detected unusual behavior and not just at the process or actor mailbox level. In fact this will work whether you are using actors or not. And you can chain Simz runtime services, moving from a local to global context.

simz.supervision.many.to.one

In Simz 2.0 we plan to go further by offering a bi-directional link between both runtimes that can relay signals generated within the supervisory runtime over to the worker runtime. Both runtimes would have our Signals technology embedded within them, but the worker runtime would be able to process signals that originate within itself as well as within the supervisory runtime via metering plugins. The source of a signal would be completely transparent to the application. Note signals are not your typical process level notification primitives. Signals drive self adaptation and self regulation within a runtime. They are specific to the application behavior and its activities. We are also going to make the metering runtime signal aware allowing us to remotely influence adaptive control valves without needing for the application to explicitly use our Signals technology.

Behavioral Memorization
One concern with restarts of a worker process by a supervisor is that the process starts with a clean slate. A clean slate can be good when the failure, which triggered the restart, was caused by corrupted stated now lost or discarded. But it also means that the worker process is none the wiser for why it failed previously. It might be more useful if the worker has the ability to recollect the most recent events and to make some internal adaptations to mitigate the risk this may occur again, somewhat like self self learning. We can achieve this by accessing the behavioral metering model in the Simz simulated runtime that survived the worker process termination. We can also use the supervisor to taint the behavior of the initializing worker by way of signals. The signals are enclosed in a conversational signal boundary maintained in the supervisor runtime. You can think of it as a worker downloading an upgrade from within the Matrix.

“Of course. The new Sam and I will be back to our programming as soon as I finished rebooting.” – Moon (2009)

Note: We can configure the metered application runtime to write the metering feed to a file instead of a network socket and then replay the simulated behavior offline but that would be more so postmortem (root cause) analysis and not runtime supervision in which the simulation is able to influence, even manage the lifecycle, of the application runtime.

Related Reading
Introducing Signals – The Next Big Thing in Application Management
Simz – The Promise of Near Real-Time Application Simulation in the Cloud
How to Execute Software Behavior Faster than (Wall Clock) Time
The Power of Slow in Application Performance Analysis – Record & Playback
Many Parallel Worlds

Lies, Damned Lies, and Logging

jinspired.logoAlmost 16 years ago when I started working on GSM billing systems I came across a piece of C code that had a comment written in German. I turned to my Dutch colleague and asked whether he could assist with a basic translation of the comment. His response was “I don’t know what this code does or how it works”. My immediate reaction was “Just tell me what the comment is saying”. He laughed and said “That was the translation!”.

I suspect my conservative approach to logging, and to some extent inline code commentary, has been tainted by this event. Logging, the way I have seen it applied, has always seemed like commentary that a developer wanted to echo back by way of an expensive file write. And much like the comment in my short dialogue, written in a different language to the code itself, and a reflection of what the developer thought the code did rather than what it really did, which in this case was not entirely understood.

A good few years later when I was called into a bank to help identify a performance issue with a particular transaction I was reminded of this first encounter with logging as a form of fiction writing. During the course of the performance investigation the problem was found in a different transaction than had been indicated. It was in someway similiar to the one that had been identified by operations but not the same. The initial reaction to my findings when presented to management was that my performance monitoring product was broken (the joys of performance consultancy). When asked how they had come to associate the performance problem with a different transaction the response was “It is written in the log file.”. Instead of asking to see the code for the other “problem transaction” I instead asked for the code related to the transaction I had identified. There in the code was a log statement that looked to have been directly copied over from the other transaction without any change in the labeling. A similar issue recently occurred as a result of a log statement causing misclassification of a transaction label via the stringifying (Object.toString) of state prior to the proper setup and execution of a transaction. In this case we did not have to look at the code because our metering change set showed clearly that the behavior logged was incorrect .

Here are a few observations I’ve made over the years which I believe contribute to the disconnect that can occur between what a log file says occurred and what actually happened.

  • Maintenance: Logging is treated much like commenting drifting from fact to fiction as code changes are experimented with and then committed.
  • Translation: Logging uses natural language instead of the host programming language to describe what is happening. It is unstructured.
  • Hypothesis: Logging is a based on a developers understanding of the code behavior and its side effects; it is susceptible to misconception bugs.
  • Copy & Paste: Logging statements can be copied from one source file to another in the course of copying code, thereby invalidating an enclosing context.
  • Verification: Whilst code can be thoroughly tested by a team, rarely does one find anyone looking at and verifying how factual what is logged.
  • Validation: Testing the production of certain log statements is rare in practice. It is not seen as a side effect such as in the case of a database transaction.
  • Locality: Many statements reflect an expectancy of not just what the immediate code does but also what its direct and indirect callers do.
  • Prematurity: Developers tend to place logging at the start of a code block expecting the block will execute to completion and not fail.
  • Disconnectivity: Log statements can be written and then later on invalidated by another statement without traceability.

There are other issues related to logging including poor performance, poor signal to noise ratio (lost in space), unstructured and non-standard formatting.

Most of the above mentioned concerns can be remedied if we eliminate the coding of many log statements within the code itself. Instead logging is dynamically injected into the runtime by way of an extension to an activity metering runtime and/or adaptive signaling runtime. Both approaches offer a mechanism to generate a change set reflecting execution behavior between two execution points. Each change set contains a concise history of execution behavior by way of change delta analysis on the thread specific metering model or signal boundary model. Logging is then called based on a rule based classification of the behavioral model represented by the change set. This is coded and thus can be maintained and tested independent of changes in the actual application code (to some degree). It is actual rather than assumed or premature. It is deep rather than local and shallow in its observation and classification. It it contained (within the change set) rather than dispersed. It is efficient and allows filtering out of noise. It has structure and is exposed allowing it to be used by the application runtime for self-regulation or self-adaptation purposes.

lies.damnedlies.and.logging.probes.signals

What is more we can have log statements written post the execution of code, in a separate runtime, using playback and simulation to reconstruct and replay the software execution behavior recorded. This can be used to correct mistakes in previous log rules as well as test new log rules, especially when a particular recorded behavior can be used as a predictor of abnormal and detrimental execution (or imminent termination).

lies.damnedlies.and.logging.probes.recorder

Finally if you ever wanted to log a statement (record) based on a condition that spanned multiple application runtimes you can us the same techniques above along with Simz, which operates as a metered application runtime, simulating in real-time the metered behavior across multiple processes and threads. This is also useful in removing a runtime dependency on a particular log library and version from your applications, as well as isolating apps from changes to a log level, which now only has to be applied to a single application runtime, be it a simulated one.

lies.damnedlies.and.logging.probes.simz

It is time we revisited logging and looked at other models and approaches to collection and communication of observations and events.

Further Reading
Introducing Signals – The Next Big Thing in Application Management
Going Beyond State/Structure Reflection in Java with Behavioral Tracking
From Anomaly Detection to Root Cause Analysis via Self Observation
Simz – The Promise of Near Real-Time Application Simulation in the Cloud

The Power of Slow in Application Performance Analysis – Record & Playback

jinspired.logo“It is acknowledged that tools help with the “perception” of knowledge. The magnifying glass, the telescope, and the microscope were objects that opened up an entirely new world…Manipulating the recording of information through time helped us to perceive an entirely new frontier which led to more ideas such as; that the perception of time is not a “hands on” concept.” – An Evolution of Learning: Have the Tools of Science Superceded “Hands-On” Exploration and Education?

In my previous post, How to Execute Software Behavior Faster than (Wall Clock) Time, I demonstrated the fast simulated replay of a software execution recording and then described some of the benefits such as allowing new questions to be framed and answered as if present at the time of its occurrence and within the runtime itself. But sometimes to know what is truly the right question to be answered we must first expand our understanding of the behavior. This is hard because when we look at any performance model what we are presented typically reflects a particular point in time. Even when performance measurements are presented in a timeline visualization we miss important behavioral clues because the dynamics are lost. Observing the dynamics is also hard because computers execute activity far to fast for us to observe, perceive and track. But there is a technique that can address these issues and deliver greater clarity to our understanding of code execution that can’t be visualized or conceptualized so easy in the flat static plane of a code editor – slow motion replay.

Below is the code that was referenced in the How to Execute Software Behavior Faster than (Wall Clock) Time post. Before clicking through to the post and looking at the screen video try to visualize in your head what you would expect to see if you were able to observe in real-time the pushing and popping of thread stack frames. Did it match?

metering.playback.slowmotion.code.1

Even if you were successful in imagining and visualizing the behavior you might still have missed some important observations with regard to the processing degree and timing at each stage in the execution because of the speed of execution (especially in playback). What if we could slow things down? Could we learn and understand more, even with such a simple example? One way to achieve a slowing down of the software execution would be to introduce a delay at the entry or exit points of a call site or method invocation, but there are many things that can gone wrong with this approach if applied to the actual real application itself. It is far better to let the application run at full speed and record its execution for later playback in which a delay is introduced whilst keeping measurements and the passing of time the same. That is exactly what the following class does via the JXInsight/OpenCore Probes Open API.

metering.playback.slowmotion.code.2

The above code controls the replaying of the execution at the begin and end metering points of a methods invocation. The actual method is not executed instead the dynamic bytecode instrumentation that was fired and recorded in the real application runtime is replayed along with the synchronization of meter readings. In the call back methods the Interceptor class pauses the execution of the current executing thread by waiting for the virtual time set point, managed by the Clock class, to elapse. The speed at which time passes is controlled by the FACTOR field which is set within the init() method of the InterceptorFactory class. A value greater than 1 slows down time.

Here is how to install and configure the custom extension within a jxinsight.override.config file.

metering.playback.slowmotion.config.1

Click here to see a screen video recording of the above slow motion metering extension in action. In the video recording you first see the actual application transmit its metering data to a Simz service which has a matrix visualization plugin enabled. During the actual real execution the metering is also recorded in a binary log file. Next the playing back of the metering log is done. This also transmits its metering data to the Simz service for visualization purposes. Playback happens very quick…too quick in fact that we can’t make out what has happened. Finally the interceptor extension is enabled and the playback performed three times with the speed.factor environment variable set to 1, 10 and 100. Did you notice how the execution races downwards (increasing stack depth) quickly but unwinds much slower? Not something that is immediately obvious from reading the code?

Note: Simz 2.0 “TotalReCall” will deliver an impressive set of time lapse and motion visualizations to aid rapid understanding and explorative analysis.