Optimal Application Performance and Capacity Management via QoS for Apps
We are often called in to help companies solve application performance (management) problems that are in fact capacity (management) problems – well to be more specific resource management problems. This generally entails profiling, protecting, policing, prioritization and predicting resource consumption requests (or reservation). In such cases the resolution of the performance problem, bottleneck, goes hand in hand with the management of application level resource capacity as it is common for the removal of one performance resource bottleneck to introduce a much greater bottleneck further downstream once the flood gates have been opened upstream. Unfortunately knowing that such constraints exist within an application (or system) is the relatively easy part. The hard part is deciding how to control the (work) flow that elicits such related resource consumption and execution behavior.
Note: A performance bottleneck is used here to refer to points in the execution in which throughput is decreased and/or latency increased.
Somewhat counter-intuitive we generally need to introduce delays (choke/throttle/shaping points) or buffers in order to meet overall performance objectives per some service level agreement (SLA). But again the task in setting parameters for such controlled delay are not as straightforward or and the result not always (near) optimal at least not initially. In the article I will show how our activity based resource metering technology helps alleviate some of the trial and error (and possible waste) involved in the introduction and configuration of such control points using the Quality of Service (QoS) for Apps technology in JXInsight/OpenCore. and combining it with self adjusting feedback loops.
To demonstrate the application of QoS in the management of performance via the management of (resource) capacity I have constructed a test application that introduces one of the most common problematic constraints (bottlenecks) in many enterprise Java applications – sub-optimal heap memory management (capacity & usage) leading to frequent and possibly prolonged garbage collection cycles and other related issues (context switching and associated memory costs).
Worker class listed at the bottom of this article mimics the servicing of a request in particular the general high allocation rates of many small objects (strings, collections). For each invocation of
BigDecimal instances are added to a
Below is a chart depicting the number of total units of work (throughput) done across different number of concurrent threads starting from 1 and up to 64 (in powers of two). The tests were performed on an iMac Intel Core i7 (4 Core, 2.8 GHz) with Java 1.6.0. The throughput peaks at 2 and drops slightly at 4 threads and then plummets (buckles) at 8 threads.
One way to address the problem would be to limit the concurrency level upstream in the request servicing or in the case of our simple application at the primary execution point –
This is very easy to do with JXInsight/OpenCore. We simply install the instrumentation agent which dynamically instruments loaded application classes and methods with metering probes. Then we enhance a smaller subset of these instrumentation points (probes) with QoS for Apps. Here is the
jxinsight.override.config file I used to do this.
With the above configuration every time the
worker QoS (virtual) service, mapped to the
doWork method, executes it will reserve a single unit of capacity from the
cpu named QoS (virtual) resource. This reservation will be held until the
worker service finishes its execution (the
doWork method completes). In setting the capacity of the
cpu named resource to 4 we have restricted the degree of concurrent execution of the
doWork method beyond the immediate entry into it in which are metering instrumentation has been injected.
Note: QoS for Apps is extremely innovative in its approach to resource management that involves modeling the system dynamics in terms of services (flows) and resources (stocks) then mapping these logical/virtual modeling elements to named probes (qualified packages, classes or methods as well as web urls or sql strings) and combining it with feedback loops built around metering measurements.
Here is the revised work unit throughput chart from 8 concurrent threads up to 64. The sharp drop off in throughput at 8 concurrent threads has vanished and we did not have to change any code or introduce an explicit dependency on an elaborate work/resource management framework.
The above was a good start at introducing some late binding execution control but how can we be sure that 4 was the right setting which I based on the number of cores on the machine. What if we had another workload characteristic that allowed higher concurrent thread levels? Well if you look back at the first chart you will see it was possible to get just over 10% more throughput with only 2 concurrent threads.
Ideally we need a way to introduce a feedback loop into the reservation system that automatically adjusts up and down the required reservation units based on the performance of the
worker service execution. Here the units don’t equate to threads but time itself. Knowing the average time is less than 1,000 microseconds for a
worker service request the capacity is now set at 4,000 on the
cpu named resource. And instead of using the default
"one" reservation strategy the resource is configured to use the
"meter" reservation strategy which predicts the reservation requirements of a service wishing to proceed based on the metering profile of the probe associated with the service at that particular point in the threads execution.
With the feedback loop if the initial reservations are understated we will get a build up in the contention for the resource (constraint) which will then over time increase the latency of the servicing (which is metered). This then turns increases the required reservation for future services leading to faster depletion of the resource capacity and more introduced delay as threads wait for the reserve capacity to be released by in progress probes. Likewise if predicted reservations are temporarily overstated the response time will drop with less contention leading to a reduction in the reservation requirements and further increasing the level of concurrency. Here is the system diagram.
The resulting throughput has increased for tests runs with 8 and 16 threads but there is a slight dip at 32 and 64 which will be address shortly.
The dip above was more than likely caused by a large increase in the estimated (predicted) reservation units leaving some capacity in the pool (stock) but just not enough for another thread to proceed with its execution. Over a much longer time window (> 1 min) this would probably have automatically corrected itself. But it can be remedied by putting a maximum limit on the reservation that a
worker service can make of the
cpu resource ensuring we always have at most 2 concurrent threads.
Here is the revised system diagram with the
"max=2000" influence on the reservation flow included.
The change in the configuration has delivered the desired (if near optimal) throughput results.
Here is a bar chart comparing the results across all tests runs. The baseline represents the behavior of the system without any QoS enhancement.
Up to now the focus has been on throughput so what of the response time. Well during the testing I collected metering quantization data. Here is the distribution for the single threaded baseline execution.
Here is the baseline execution with 64 concurrent threads. The distribution has changed significantly with many of the requests above 500 microseconds and the frequency reduced to less 25% of the previous.
Here is the service time distribution with 64 concurrent threads using the first QoS configuration. Much much better with a clustering around 256 and 512 microseconds.
Note: The metering measurements include any possible delay introduced by the QoS runtime during the waiting on reservation requests which had maximum timeout 1000 milliseconds in the above configuration after which a kind of barging occurs without reservation or release by the service.
"meter" reservation strategy and with the capping of the reservation requirements we get extremely close to the single thread execution behavior for a 64 thread test.
Here is the
Worker class used in the testing.
And here is the remaining snippet of the class code used to generate the test workload.