Performance Instrumentation & Monitoring of an Efficient Runtime – Akka Part 1
The Actor Model: The actor programming model is being revisited as a means to efficiently use the growing computational capacity that is available today with large multi-core processor systems. At the heart of this model is message passing between actors (components), which can be local or remote. Actors receive messages and in turn send messages to other actors, which they may have created, as well as change their own behavior in response to future messages received. An important aspect of the model is the decoupling of the sender from the receiver allowing asynchronous communication, which is typically buffered both locally or remotely. The destination of a message, the target actor, is identified by an address which can be a reference handle or a namespace identifier which in turn can be transformed to a handle. In this regard it is very similar to the Java EE enterprise component model. The message passing capabilities of the model are ideally suitable for increasing the parallelism of work, thereby reducing latency and improving the scalability of the software. But like most engineering designs and efforts, these benefits come with trade-offs (in some cases) with regard to observability, traceability, complexity, comprehension, optimization and manageability, especially when placed within runtimes designed specifically for the object oriented and thread execution stack programming model.
Have you ever wondered what happens underneath an application runtime such as Akka? How does its introduction into an application change the underlying execution model and call semantics? What are the runtime (overhead) costs in adopting this new programming model (actors and mailboxes)? How does it change how applications are monitored and managed? Well a number of our customers have been asking such questions, in particular how should they instrument such runtimes ensuring measurements collected have much more relevance with the more significant instrumentation and measurement that is performed above and below the runtime. When such questions are posed we ask ourselves whether there is any real value to be obtained in instrumenting core classes within a runtime that can’t already be gleaned from the application (caller) code base, which is nearly in all cases a much more efficient approach due to the adaptive nature of our metering runtime. With the wrong instrumentation and measurement approach you can very easily restrict the usage of the runtime to a particular class of application and services – those less performance and throughput sensitive.
To get a better understanding of what goes on under the hood and what is possible in terms of latency and throughput the smallest, simplest and hopefully speediest sample application will be run. Meet
Here is the
PingActor that implements the
onReceive() method in which a
count is incremented for every
Pong message object received.
Here is the
PongActor which receives the
Ping message object sent above by the above
PingActor in calling
tell on the
ActorRef returned by
Here is the code that bootstraps the actor system and kicks off the ping-and-pong repeating call sequence.
Lets first start in development mode by collecting the caller-to-callee chains using the
probe.count meter as a good estimate of the average number of direct and indirect callees for each caller (and call path).
From the tracking tree below we can see that it takes on average 37 method calls within the
akka.* codebase for the
PingActor to tell the
PongActor it received its message. More importantly
PingActor.onReceive() never calls the
PongActor.onReceive() method from the same thread call stack. Instead an instance of
Envelope is created and enqueued on a
Mailbox instance associated with the
Here is a simplified call flow showing the separate thread execution flows (1) in dispatching a message to another actor and (2) the message invoked on the actor instance behind the actor reference dispatched to.
With regard to performance there are at least three stages in the execution flow that may require measurement and monitoring. (1) The time interval between a
ActorRef.tell() call and the enqueuing of the associated
Envelope at the target
Mailbox. (2) The time interval between the enqueuing of an
Envelope and its dequeuing (removal) from its enclosing
Mailbox. (3) The time interval between the removal of an
Envelope, the unpacking of its payload, and its arrival at the
Actor by way of the
Lets now investigate the latency of some of the methods in this call pipeline to see whether it is feasible to indeed instrument and measure (meter) such activity within the framework itself. Here is the initial
jxinsight.aspectj.filters.config file I used during benchmarking.
I used the latest early access of JXInsight/OpenCore 6.4 with the following
By the end of the 100 second benchmark run only 3 high frequency probes (methods) were still being metered, the rest had been disabled and pretty early on in the run judging by the low count. The cumulative average, which is in microseconds, for
ActorCell.invoke is less than 1. For
Mailbox.run the cumulative average is 2 microseconds though it should be noted that this will be overstated due to early adaptations by the metering engine and some instrumentation residue left over in dynamic compilation of code by the Hotspot JVM.
These very low numbers, though not low relative to a normal method dispatch cost in Java, do give cause for concern if instrumentation was naively applied to the core Akka classes instead of the actual application Actor implementation classes. It would be far more efficient to apply adaptive instrumentation to the application code especially if the performance profile is very diverse across actors and their message receipts. There would be no adaptation if
ActorCell.invoke was instrumented – it would be all or nothing irrespective of target though the decision to measure could still be dynamic.
Note: There is a case for some level of instrumentation within Akka related to mailbox delay as well as different Actor dispatch configurations but that is a topic for an upcoming article.
Because both the JVM and our metering engine are adaptive you rarely ever get the exact same results across multiple runs, but it is always good to at least do a second one to be sure there is some similarity.
As with any middleware runtime that places queues between caller and callee, especially ones which decouple the flow of execution and apply dynamic thread scheduling, the call cost will be subject to slightly more variance than normal. To see whether this was indeed the case and also to eliminate measurement of measurement (nesting of metered probes), I did another benchmark run with the
spotcheck metering extension enabled in addition to the
hotspot metering extension. Because the
spotcheck metering extension does a random sample (by default 1 out of every 100) I needed to reduce the
upper settings by a factor of 10.
The sampled metering count and totals look relatively similar to our previous measurements in light of the
spotcheck metering extension sampling.
Here is the second benchmark run with the
spotcheck metering extension.
In this first part in the series the focus has been on identifying suitable observation points in the runtime from which to monitor the system dynamics. Looking at the above code it would be reasonable to assume that only one thread, maybe two threads, were busy in dispatching messages back and forth between both actors but this is in fact not the case. Actually the number of threads busy consuming resources (cpu, thread lease time) was much much higher, which can in some cases be good for scalability even response time (queue wait time), though to be honest it came as a bit of a surprise to find so many cores spinning performing scans in a custom fork of the Java
ForkJoin library classes. Here is the CPU history following the starting of the above benchmark code. Certainly not what we expected or would like to happen, which highlights a problem with new runtimes and programming models that reduce the transparency between the code, its execution and consumption of resources.
Confused by the excessively high cpu consumption and thinking that somehow there was more messaging going back and forth between actors than originally designed, the
queue metering extension was enabled to ascertain the degree of concurrency (threads) following through each probe (method).
Here is the
jxinsight.aspectj.filters.config file used.
queue metering extension was enabled in the
jxinsight.override.config with the inherent
threshold was also lowered to zero.
Here is a metering snapshot taken during the running of the benchmark showing the maximum concurrent thread execution was 1 or 2 for most probes (methods), many of which were disabled early on during the benchmark run. You might be surprised to see that
ActorCell.tell has a concurrency of 2 but remember in theory a target actor can already have completed the processing of a message sent to it by another actor before the return from the
ActorCell.tell method invocation. A very small window of overlap granted. What is more interesting is the
Mailbox.run probe (method) queue with a maximum of 5 (threads).
Using the tracking information previously collected the
jxinsight.aspectj.filters.config file was changed as follows.
Here is the new metering model which points to the cause of the excessive CPU consumption that is not directly related to the application actor code. I suspect the fast generation and deliver of messages to the mailbox wreaks havoc with the signal co-ordination algorithm used within the forked FJ library causing all CPU cores (and threads) to spin not doing any useful application work.
And just to be sure that the
scan method highlighted above is being re-entered repeatedly by multiple threads and not just by one, the
stack metering extension was enabled.
The high stack frame identifiers and counts for parent callers across multiple thread confirms the validity of the observations up to this point.
In the next part the instrumentation will be narrowed down, eliminating much of the need for adaptive instrumentation and measurement, to focus more on the latency and throughput aspects that cannot be so easily discerned from the typical caller-to-callee profiling.
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)