Determining whether an Application Performance Hotspot is a Performance Limiter
In a recent post titled A Preliminary Performance Analysis of the Vert.x (vs Node.js) HTTP Benchmark a number of latency performance hotspots were identified. In this post we introduce a novel approach in determining how much of an impact a potential change (performance tuning or degradation) would have on the overall throughput of an application under test (and in an environment that maybe oversubscribed in terms of resource capacity) using Quality of Service (QoS) for Apps.
Note: A significant benefit in doing so is we don’t have to actually spend hours/days looking for potential (narrowly focused) performance gains and then only to find out in testing such changes result in no measurable difference.
Lets first revisit the latency hotspots identified after numerous performance instrumentation refinements based on previous run metering data in which we impressively whittled down the number of methods metered from 650 to just 13 without ever looking at the source code or using developer guesswork.
Here is a chart of the HTTP request throughput rate reported by the Vert.x counter process on an iMac Intel Core i7 (4 Core, 2.8 GHz) with Oracle’s Java 1.7. All processes were run on the same machine with both client and server runtimes having 4 set as the
To see whether any change in the performance of a hotspot impacts the overall throughput of the system we will introduce a number of different timed delays using the following Quality of Service (Qos) for Apps configuration which specifies that a particular probe (a metered method) be enhanced with QoS capabilities.
j.s.p is a short hand version of
jxinsight.server.probes recognized by our measurement runtimes.
What will happen is that on startup the
main method will immediately grab and hold the only unit of capacity in the QoS resource associated with both QoS services defined. Then when the action method in the
DefaultFileSystem$11 class is executed (method entered) it will request a resource reservation of 1 unit, timing out after 5 microseconds because none are available in the pool associated with the resource and then proceeding with its normal execution. All of this behavioral change comes with no code changes as the JXInsight/OpenCore agent need only instrument the specified classes at class load time.
Below is the new request throughput rate (in orange) following this change to the agents
jxinsight.override.config file. Surprisingly there is an actual increase in the overall throughput (which we will come back to later).
Note: With waiting caused by reservation on a starved virtual QoS resource we are introducing a time delay but not necessarily consuming precious processor time.
Here is a second re-run of the test this time with the (delay) timeout set to 10 microseconds. The green line shows a drop off in the throughput but not yet below the initial baseline.
With the timeout set to 100 microseconds the throughout now drops significantly down from approximately 45K/s to 20-25K/s.
We can repeat the above set of experiments for another identified hotspot this time the
The orange line shows that with a small time delay of 5 microseconds there is already an observable difference in the throughput.
Increasing the time delay to 10 microseconds resulted in further reduction in throughput but not as much as the 5 microsecond delay over the original baseline.
When the timeout is increased to 100 microseconds the overall throughput is greatly reduced – by more than two thirds of the baseline. This is the greatest reduction of the two primary hotspots though it should be noted that the frequency of the
transferTo method is twice that of the
A similar set of test runs were conducted with time delays introduced into the
ReplayingDecoder.callDecode method. Here is the final chart which indicates that changes to this method have very little impact on the throughput of the system under test including the processing capacity (which was fully utilized). In fact the timeout setting had to be changed to 1,000 microseconds before a noticeable drop in the throughput was observed.
Granted this method is executed at 1/4 of the frequency of the
transferTo method but even at 10,000 microseconds the throughput was still close to
transferTo rate at a 100 microsecond timeout. How can this be happening? Well the first clue is that when the
instance count in both client and server is dropped down to 1 the throughput rate remains relatively high at 42K/s. The second clue is that overall processor utilization at this
instance count level is just over 50%. Clearly the delays introduced above did not always impact throughput as much as expected because there was already implicit queuing and contention built up with the
instance count at 4. This also explains why when a small delay was introduced there was a slight (contradictory) increase in the throughput as we indirectly eased contention on these implicit and explicit queues.
Because of the asynchronous nature of the underlying execution the methods classified above as latency hotspots are not exactly in the classical sense (response time) as their execution is not performed sequentially within a critical path. These methods are best classified as performance limiters in that they feed work to over hotspot methods (and worker threads) via queues. As long as such queues are never completely depleted the throughput is not adversely impacted though as seen from the above charts some methods and their associated queues have a much more pronounced impact on the overall throughput than others though not always reflected in their latency profile.
Note: When I first started writing this post I used “performance inhibitor” in the title but following discussions with Dr. Neil J. Gunther a leading expert in performance capacity planning I reverted to using “performance limiter” which he used in describing a similar behavior in his highly recommended Guerrilla Capacity Planning book in particular sections discussing his universal scalability law (USL) that is likely to become increasingly more relevant with increasing parallelism in request processing.
Optimal Application Performance and Capacity Management via QoS for Apps
Fairer Web Page Servicing with QoS Performance Credits
Dynamic QoS Prioritization
Introduction to QoS for Applications