Performance Instrumentation & Monitoring of an Efficient Runtime – Akka Part 2
In this part we look at the round trip (wall clock) time distribution for a request-response style message sequence exchange between two local Akka actors –
PongActor. In the course of the performance analysis we uncover an unexpected performance behavior, which should be given due consideration in the application and configuration of the Akka runtime in certain application environments with particular workload patterns and latency requirements. Whether the performance observations are applicable in a real world context is something that can’t be so easily ascertained without further measurements under actual workload conditions. At this stage in our investigation it would appear that the out-of-box configuration of actors and their mailboxes might not always give the optimal results in terms of call latency or CPU consumption. Like most software engineering endeavors there are trade-offs to be made, which need to be understood and reasoned on in light of expected or actual workloads, resource constraints, and performance objectives.
Note: Much like other component technologies such as Enterprise JavaBeans (EJB) and CORBA, careful consideration must be given to the promotion of a normal Java/Scala class/object interaction to an Actor (mailbox) interaction. Done liberally (overzealously) at design/code time, rather than dynamically at runtime when much more information is available and management needs better defined, is likely to cause many more downstream problems than is hoped to be solved by the adoption of the technology.
Timing the interval between a message dispatch to
PingActor and then the receipt of the return result from
PingActor is somewhat tricky because of the decoupling of the execution via each actors mailbox (and dispatcher). Here is the call flow diagram introduced in part 1 depicting this partitioned execution behavior.
One way to measure the round trip time is to attach a timestamp value (in nanoseconds) to the message payload that is sent and returned by both actors.
Instead of the
PongActor class sending back a
Pong message object it now sends back the very same message object that it received. We tried to eliminate as much as possible object allocation which is an all too common occurrence (and performance problem) with actor runtimes and functional programming.
Then in the
PingActor.onReceive method a metering counter,
rtt.time, is updated with the difference between the current system timer value and the timestamp within the
Ping message it received. The JXInsight/OpenCore metering runtime will read the counter value on entry into the
onReceive method, before the update, and then read it again on exiting the method. The change in the counter will then be assigned to the
onReceive method even though it reflects a time interval between last line executed in a previous call and the first line executed in this current execution frame.
PingActor class need only be instrumented with metering probes so the following line placed in a
jxinsight.aspectj.filters.config file is sufficient.
Lastly the roundtrip time counter,
rtt.time, is mapped to a meter in a
The default enabled
clock.time meter was disabled as it is not needed here.
Below is the metering frequency distribution for 100 seconds of execution with
PongActor continuously calling back and forth. The vast majority of the round trip calls take 1 microsecond though it should be noted that all timings were rounded down in the code, in the conversion from nanoseconds to microseconds. But there are some performance outliers, with a bunch of them in the 512-1,024 microsecond latency band.
Note: These results would pale in comparison to the frequency and low latency of normal caller-to-callee method invocation in Java, within the same thread context.
What was not shown in the figures above was the extreme and excessive CPU consumption on the machine. In performing 80 million round-trip calls 4 cores (8 threads) were kept 100% busy for the whole 100 seconds of the test run.
Naturally we were concerned that the excessive CPU consumption was the cause of the high latency outliers. Our first attempt to fix this was to put in a
Thread.yield() call to throttle ever so slightly the rate of the pinging and ponging, which seemed to be causing extreme levels of thread signaling and subsequent mailbox (queue) scans. The call was inserted into the code so that it would impact the call frequency but not the actual round trip timing.
We were surprised to find out that after re-running the 100 seconds benchmark that many of the calls had been pushed into higher latency bands.
There was a slight drop in the CPU consumption, but still excessive considering the number of core threads we would expect to be kept busy would be 2 at max.
Next we replaced the
Thread.yield() call with
Thread.sleep(0). Pretty much the same. Actually very similar in shape.
We then replaced the
Thread.sleeep(0) call with a
LockSupport.parkNanos(1l) call. Though we expected a much bigger drop in frequency than previously, the pushing up of the distribution into higher latency bands was initially surprising until we looked at the CPU consumption.
With the temporary parking of the thread, which is not guaranteed to be the value specified, the CPU consumption had halved from our initial test but this had come at a cost of higher average latency and far more latency outliers though not as high latency when consuming excessive amounts of CPU.
parkNanos() parameter value to 100,000 (100 microseconds) continued the trend of increasing the latency of the round trip calls, even though the call itself has no bearing on the actual measurements made. The majority of calls now fall in the 8-16 microseconds instead of 1-2 microseconds.
Here are the CPU consumption charts.
parkNanos() parameter value to 1,000,000 (1 ms) resulted in the majority of calls now falling into the 32-64 microseconds latency band.
At 10 ms more and more calls are pushed further up in higher latency bands. It would appear that to maintain any sort of low latency, throughput must be maintained throughout the execution of the test, otherwise we get a hit with migrating work from one active thread to another thread that maybe parked.
At 100 ms we reached our upper limit though the bands are not anywhere near as big as those when all CPU were spinning but throughput was relatively high.
Update 17th Dec 2012 – Less is sometimes More
The default configuration of Akka applications sets the minimum thread pool size of the
Executor service associated with the default
Dispatcher to 8. Changing this addresses the CPU spinning problem.
# Min number of threads to cap factor-based parallelism number to
parallelism-min = 1
# Parallelism (threads) ... ceil(available processors * factor)
parallelism-factor = 0
The above configuration change has made a world of difference with the throughput more than doubled, the latency now between 0-2 microseconds (note we round down), and only 1 core thread fully utilized.
Unfortunately the above configuration also restricts scalability of the application by limiting message processing to 1 (concurrent) thread, irrespective of the number of actors created. We must change both the code and the application actor configuration in step to obtain increasing scalability.
Note: In our opinion a truly scalable solution is one that scales both up and down, preferably automatically in response to changes in demand and available capacity.
Below is the revised
throughput = 1
Now all that is needed is to set two system properties for each test run.
With 2 core/threads fully utilized the overall throughput has increased though most times have been pushed up into the 1-2 microsecond latency band.
Changing both system property values to 3 increases the throughput with more calls pushed up into the 2-4 microsecond latency band.
With a value of 4 for both system properties throughput again increases with 2-4 microseconds being the largest latency band though the average within this band is just over 2.
At a thread pool size of 5 the 2-4 latency band now dominates the distribution but still the average is just over 2 microseconds. There has been a noticeable increase in the 4-8 latency band but still very small relative to the overall total. What is interesting is the small consistent frequency spike in the 512-1024 latency band.
A similar pattern of increases appears with both system properties set with a value of 6.
Increasing the thread pool to 7 has the same effect with the average still hanging around 2 microseconds.
Finally a thread pool of size 8 maxing out all core/threads on the machine.
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
OS X 10.8.2 2.6 Ghz Core i7
8 GB 1600 MHz DDR3