A Preliminary Performance Analysis of the Vert.x (vs Node.js) HTTP Benchmark
Tim Fox recently posted somewhat controversial results of a “simple http” benchmark shootout between Vert.x and Node.js. Though Tim does put forward a disclaimer at the top of the posting we think the benchmark itself does serve some useful purpose even if just to encourage the development teams behind competing offerings to give much greater attention to performance and scalability.
Below you will find our preliminary performance analysis of the Vert.x HTTP benchmark conducted over a few hours. Its not as comprehensive as our Scala Compiler case study but that is a result of the benchmark being largely narrow in its focus – the performance of the underlying software bare metal.
Note: The tests were conducted on an iMac Intel Core i7 (4 Core, 2.8 GHz) with Oracle’s Java 1.7 developer preview. Both the client and server had the instance parameter set to 4.
Note: We never looked at any of the code sources (Vert.x or Netty) other than the two application classes representing the client and server. We like to see how far our intelligent activity metering engine can get us without getting clues from a perusal of a code base would give, though naturally these should go hand in hand (at one stage) in any performance investigation.
Here is the single application class used to serve client HTTP requests.
Here is a sample of a
clock.time metering snapshot for the above class without our intelligent activity metering
hotspot strategy enabled with a threshold of just 5 microseconds which is generally extremely low but acceptable with this limited level of class instrumentation.
metrics extension enabled in the metering engine the throughput delta of the
handle method proved to be stable.
Here is the same data as above this time with the metrics chart using a baseline of the minimum value rather than zero, accentuating the variation but also revealing a trend which shows the rate of change in
clock.time cumulative metric total declining whilst the throughout
count metric remains relatively steady.
Next we instrumented the client process generating the workload. Here is a snippet of the code that does this. Note that the
makeRequest method can make 0 or more
client.getNow() calls depending on the available credits.
Below is a sample of a metering snapshot made during a benchmark run.
Note: The latency measurements do not accurately reflect the response time (and execution cost) for a HTTP request due to the asynchronous nature of the execution.
The metrics charts with a zero baseline show a relative stable performance behavior with metric sampling performed in 10 second intervals.
Here is the same metric charts with a minimum sample value baseline highlighting the same performance improvement trend seen in the server results.
Next we instrumented the complete server code base to see what happens above and below the application call execution. We used the following configuration accepting that there would initially be some “overhead” noise in the benchmark run due to instrumentation and measurement cost incurred during the early stages of our agents intelligent activity metering functioning as well as the threshold being set so low.
Note: The top section of this page will explain how the average and inherent averages will change over the course of multiple performance models.
Next we used the
hotspot classification of probes within the metering model to generate (via a simple command line utility) a refined instrumentation plan for a subsequent benchmark runs. After several iterations of instrumentation refinements and benchmark runs the agent settled on the following performance model (within our parameters).
The configuration was then changed to eliminate from the refined instrumentation plan those probes with an average inherent cost of less than 1 microsecond.
Here is the final performance model after 2 refinements and benchmark runs.
Further benchmarking and refinement was done with the threshold increased to 2.
Finally one last run was conducted with the
strategy extension disabled but using the refined instrumentation set. Below is a sample of a metering snapshot made.
HashedWheelTimer$Worker.waitForNextTick methods the top 2 execution hotspots in terms of inherent latency are
DefaultFileSystem$11.action with most of the work load coming from
The server appears to be mostly N(et/)IO workload bound which though expected considering the benchmark does indicate that Vert.x introduces very little additional overhead (at least not observable with current settings) making extremely efficient use of the underlying stack including Netty. That said we have made observations not presented here (yet) that indicate the throughput of this benchmark can indeed be bettered without any code changes using Quality of Service (QoS) for Apps to throttle (delay) the processing of the traffic at key points in the execution so there is some contention for resources (memory, processor) beyond IO.
Enabling the metering
tracking extension we can see the call path amongst the hotspots. This view is also useful in allowing us see how the path frequency changes between caller and callee.
Note: We could probably improve on this performance model by re-enabling the
nonrecursive metering extension and eliminating the metering of the nested
write call under
Next we did a benchmark run with the metering
quantization enabled to better understand the performance latency distribution of the two main entry point probes. Here is the
processEventQueue indicating that there is a significantly large section of the measured population that has very little latency.
The distribution for the other entry point shows a clustering in the 32-128 microsecond range.
The following shows the throughput delta for each of the methods listed above over a few minutes measurement window. From this we can see the ramp up in the rate for both the
One issue raised on the original benchmark posting was with regard to how the instance parameter passed to the server process changes the utilization of cores especially considering node.js is single threaded. Here is the metering queue analysis with the instance set at 4.
With the instance set at 1 we get the following
queue analysis. Of course this does not mean Vert.x only ever uses one core (its multi-threaded) but that in servicing requests its executes them using only a single worker thread at a time.