Skip to content

Adaptively Controlling Apache Cassandra Client Request Processing

In Performance Measurement Truths — Rarely Pure and Never Simple and Reality, Reactivity, Relevance and Repeatability in Java Application Profiling we looked at the benefits obtained using adaptive measurement techniques in the performance analysis of Apache Cassandra. For this article we keep with the adaptive theme but this time shifting focus to how we can adaptively manage the performance of such systems rather than just monitor them, especially under high load stress (surge) conditions. Before proceeding it should noted that whilst many solutions claim to manage performance what they really offer is monitoring, which is seen as a way to influence the management of the system by way of human intervention (on a completely different time scale to the execution and lifecycle of the system).

Note: Managing performance in this article refers to the direct and immediate influence on the execution of the software by adaptive control valves.

Lets start with the script used to generate load on the client machine (Intel Core i7) via the sending of thrift remote procedure requests to the server running on a separate machine (Intel Core i7).

for i in {1..20}
   /Applications/datastax-cassandra-1.2.5/tools/bin/cassandra-stress -d -o insert -t 20 -n 1000000 &
   sleep 15

Below is the performance model following completion of the benchmark. If you recall the previously reported averages for some of the probes listed below you will notice that across the board there has been a noticeable increase. For example StorageProxy.mutate had an average of 211 microseconds whereas its average here is 5,419 microseconds. This level of increase is present in nearly all hotspot classified probes. Clearly the client load placed the system under significant stress in relation to available processing capacity and additional co-ordination and memory management effort.


To help the system better absorb (and adjust to) the disturbance caused by high concurrent client request processing we can attach an adaptive control valve to one (or more) of the measurement entry points. The control valve will then adaptively manage the level of concurrent execution flowing through the methods (measurement points) associated with it. There is no need to blindly fiddle with thread pool sizes. Instead the valve will dynamically adjust in real-time the degree of concurrency based on direct observations and analysis of the measured execution behavior.

Here is how a “throughput” adaptive control valve can be configured in the jxinsight.override.config file to attach itself to the org.apache.thrift.ProcessFunction.process measurement point.


Running the load stress script with adaptive control management in place makes a significant difference in the overall processing clock time. The execution time within the server has dropped down from 232K seconds to 170K seconds. The average time for TBaseProcessor.process has dropped from 11.6 ms to 8.5 ms.

Note: The clock.time metering total is an aggregation of wall clock time measures across threads. With 100 threads executing simultaneously the same method, each for 10 secs, the total will be recorded as 1000 (1K) seconds.


What you might find strange is that the average for ProcessFunction.process has remained roughly the same whilst many of the other hotspot probes have dramatically reduced their average clock time. The reason for this discrepancy is that the valve introduces “managed” delay of thread execution at the ProcessFunction.process measurement point, which is one of the methods called by TBaseProcessor.process, this in turn reduces the concurrent load and resource contention experienced by those probes (methods) called, directly or indirectly,  sequential or asynchronously. The other method called by TBaseProcessor.process for each of its invocations is TBinaryProtocol.readMessageBegin which has dropped from 71K seconds to 12K seconds. The drop here is because threads that call this method are temporarily suspended in the ProcessFunction.process method by the valve. In controlling the completion of the ProcessFunction.process method we also control the execution of TBinaryProtocol.readMessageBegin, because one follows the other in the execution of TBaseProcessor.process.

The delay that is dynamically introduced serves to stabilize and optimize the execution flow within the system. It should be noted that stability and optimization can at times be at odds with each other.

How about adaptively managing the execution of requests based on the measured response time at different workload levels? The “response time” adaptive control valve allows us to do just this with a very similar configuration as above. Simply replace tvalve with rvalve in the configuration.


With the “response time” valve there is a further reduction in the overall server request time processing. This is to be expected considering this particular adaptive control valve looks to reduce the average response whereas the previous valve can trade increased response time for increased throughput up to a point as both are intertwined.


The average response time for TBaseProcessor.process is 8.1 ms compared to the initial baseline 11.6 ms. Again we see that the average response time for ProcessFunction.process has remained relatively high in comparison with other hotspot classified probes as delay is dynamically introduced by the valve to ensure greater stability in the average response time per second.

What of the client times? Well for the non-managed benchmark run it took a total of 14 mins 5 seconds to complete. With the “throughput” adaptive control valve the benchmark took 11 minutes 11 seconds. For the “response time” adaptive control valve completion took a tiny bit longer at 11 minutes 14 seconds.

Below is the CPU profile for the client machine. From the charts you can see that with no control there is far more jitter observed across the many concurrent client processes.


Apache Cassandra is optimized for very fast and highly available data writing. That said this does incur insignificant allocation and co-ordination costs and can result in relatively high garbage collection cycles if workload is not managed effectively. Adaptive control valves are extremely effective if not a necessity in any large scale deployment but what of read operations, which are generally relatively cheap in most NoSQL solutions? Can adaptive control valves be as just as effective with data reads?

/Applications/datastax-cassandra-1.2.5/tools/bin/cassandra-stress -d -o insert -t 20 -n 1000000
for i in {1..20}
   /Applications/datastax-cassandra-1.2.5/tools/bin/cassandra-stress -d -o read -t 20 -n 1000000 &
   sleep 15

Here is the performance model following completion of the above benchmark without any adaptive control valves in place within the Apache Cassandra server runtime. The benchmark took 8 minutes and 23 seconds to finish. The average time for a request processing cycle within the server took 4.2 ms. This is under high concurrent load that heavily taxes all available computing resources on the server machine.


With the “throughput” adaptive control valve installed the benchmark finished in 8 minutes 5 seconds. A wall clock savings of 18 seconds. The actual clock time processing on the server was reduced by only 5K seconds. What is interesting is the average clock time for ProcessFunction.process increased compared to the baseline whilst all other hotspot classified probes saw a substantial reduction. Basically we have the choice of incurring delay at one particular managed execution point or at all possible execution points that are further down in the execution path and could potential hold onto resources (memory, monitors) much longer than we would like.


The “response time” valve would appear to be far more effective at reducing overall processing time with a drop of 13K seconds in overall server request processing. From the client perspective the savings was far greater with the benchmark finishing in 7 minutes 47 seconds compared to 8 minutes and 23 seconds. A 36 second reduction in wall clock time. For the TBaseProcess.process method the average has dropped from 4.2 ms to 3.6 ms. The average time for ProcessFunction.process is still higher than the non-managed benchmark run but lower than the “throughput” one.