OutOfMemoryErrors
 



Introduction
Matt from JavaLobby recently asked a question in his weekly newsletter section regarding effective tools for hunting down memory leaks causing out of memory exceptions. He was unable to resolve this problem with JProfiler, a well known and one of the better code level developer profiling tools on the market today.

"I’ve tried attaching to a profiled version of the server with JProfiler, but that tends of lock up after a few minute of profiling (too soon to show me a leak, I need a day or so). The few minutes I do get to profile, everything looks normal. "

This reflects my own experience with many of the code level profilers on the market including Borland OptimizeIt and Quest JProbe. I have never been able to get such tools to memory profile a "real world" J2EE application even with a single user request thread. The JVM memory growth when using such tools can be enormous resulting in JVM crashes well before a use case is completed. This is not necessarily a reflection on the quality of the tools themselves but the fact that the applications being profiled are generally large hence the need for the profiler and that aspects of the JVMPI API have not necessarily been designed with such demands in mind.


What is an OutOfMemoryError?

Lets first start with the class hierarchy.

Object
Throwable
Error
VirtualMachineError
OutOfMemoryError

An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch. Most such errors are abnormal conditions. A VirtualMachineError is thrown to indicate that the Java Virtual Machine is broken or has run out of resources necessary for it to continue operating. An OutOfMemoryError is thrown when the Java Virtual Machine cannot allocate an object because it is out of memory, and no more memory could be made available by the garbage collector.

So the error is a memory capacity issue and not necessarily an memory leakage. Of course a memory leakage will reduce available capacity but one cannot equate all OutOfMemoryError's with a leakage. A more common reason for such errors at least in production environments, where one would expect the application to have already been tested, is a surge in request traffic and data volume per request. Here the JVM has not enough memory available at a particular point in time to service all the concurrent requests. The JVM is unable to grow the process memory and all of the heap is strongly referenced by threads actively servicing requests. When thrown most J2EE application servers will catch the error and append additional information to a log rather than crash out especially when the error happens on a request handling thread.

Note
: Catching an OutOfMemoryError can be somewhat dangerous in an unmanaged environment if the error occurred during the access and manipulation of a shared data structure or resource. Some actual home grown servers have even worse error handling whereby the JVM stays up and running after the error but with all request reader and processor threads disabled. For a monitoring tool the JVM process looks to be available but in fact from a client perspective the server is in a coma.


The standard approach to resolving capacity issues is to reduce the number of worker threads at each layer creating a threading funnel effect. That said determining an appropriate configuration of the various thread pools at each layer is problematic and is unlikely to guarantee optimal performance under varying workloads. Before starting to change configuration parameters it is best to first acquire an insight into the actual resource consumption patterns over a representative period in a pre-production or production environment.


The Customer Problem
JInspired was recently contacted by a global technology and services company that was experiencing OutOfMemoryError occurrences with a J2EE/CORBA application server in a production that had crashed a number of times during the same period of daily operation. The company which had tried various solutions and spent considerable resource time in trying to resolve the problem had resorted to a temporary measure of restarting the server each day before the troublesome period. In their testing labs they were unable to reproduce the problem partly because they could not replicate the same workload and still have a code profiler stay alive and be usable. They needed to find a solution that could be introduced into a production environment that allowed dynamic configuration of profiling and tracing data capture during specified periods in the day that did not require real-time console to server interaction. Matt at JavaLobby had similar ideas in how to resolve his own production problems.

"I think that what I really need is a way to replay the log files for two days and profile the local copy instead of trying to profile a remote one. My major problem is that I can’t find anything to handle the large log files I have and replay those..."

The company informed us that occasionally the JVM stayed alive and was able to service requests post the OutOfMemoryError occurrence but on other occasions the JVM completely shutdown. They needed a solution that could still capture sufficient profile and contextual information to ensure fast problem resolution even when the JVM process terminated unexpectedly.


The JXInsight Solution
JXInsight has some unique capabilities that can help resolve this problem in a production environment.

Terminal: A scripting tool allowing for dynamic stopping and starting of profiler as well as snapshot resetting and exporting

Trace Extensions
: Flexible and extensible distributed profiling and tracing

Low Overhead:
The event capturing performed by JVM agent ensures the lowest profiling overhead.

Multiple JVMPI Counters
Per Interval: Recording of clock time, cpu time, gc time, object allocation sizes, thread waiting and blocking times.

Memory Consumption:
JXInsight has the ability to capture JVM object allocation events and record the bytes sizes per trace, transaction, or interception recording interval with configurable aggregation at the cluster, host, process, thread group or thread levels.

UI Console
: Powerful and effective visualizations

Agent Architecture
: No monolithic management server. Loosely coupled (and potentially disconnected) distributed agents.

Data Collection Modes
: The profiler supports different data collection modes - Metric, Profile and Timeline.

Shutdown Hooks
: By default the profiler will export all active profiler mode snapshots on shutdown of the JVM

Snapshot Merging
: All snapshots can be merged in on-line or off-line modes allowing for flexible clustering analysis


The Customer Solution
After discussing the many features and options applicable the company decided to execute the following performance engineering strategy:

Production Deployment: JXInsight would be deployed into the production environment on one particular node running a relatively small cluster of J2EE servers

Hibernate Mode:
The profiler would be disabled on startup via the system property -Djdbinsight.server.profiler.autostart=false. The JVMPI agent was loaded into the JVM with default settings with the JVM argument -Xrunjdbinsight

JVMPI Events
: Object allocation JVMPI events would be turned on and off to capture the actual memory consumption of the IIOP and EJB request patterns. This would provide insight into the consumption patterns while at the same time ensuring that there would be little performance degradation. If you are looking for a memory leak that grows rapidly a good first step is to find the components responsible for the actual object creation. Then look at the surrounding components that consume or manage (cache) the data.

Integrate CORBA and EJB Trace Extension
: The customer deployed JXInsight's VisiBroker™ CORBA and J2EE™ EJB trace extensions capturing complete request processing resource consumption metrics per request pattern.

Scheduled Script:
Two OS command scripts where created and scheduled to dynamically turn on and off the JVMPI events and profiling data collection modes as well as exporting snapshots before and after the period of operation.

The scheduled start script executed
> jdbiterminal monitor -events gbwa -profile start -timeline start -h VFONE -p 1515

The scheduled stop script executed
> jdbiterminal script -execute -f c:\JXInsight\memoryleak.script

The script file, memoryleak.script, referenced by the stop script contained the following terminal commands
snapshot -create -type pr -f MemoryLeak_pr.snp -d c:\JXInsight\snapshots -host VFONE -p 1515
snapshot -create -type tl -f MemoryLeak_tl.snp -d c:\JXInsight\snapshots -host VFONE
-p 1515
snapshot -create -type mx -f MemoryLeak_mx.snp -d c:\JXInsight\snapshots -host VFONE
-p 1515
monitor -events bw -profile stop -timeline stop -h VFONE
-p 1515

Collection Modes: Profile, Timeline and Metric snapshots would be collected and exported either via the terminal or a JVM shutdown hook.

Analyze Exported Snapshots
: As this was a serious production problem JInspired would provide a rapid snapshot analysis service - OneHour Snapshot Processing.

Within minutes of receiving the snapshot I was able to pinpoint the actual cause of the "MemoryLeak" problem. It was not a memory leak but a memory capacity issue caused by a very large data processing request that had been routed from a messaging backbone and integrated into the application via IIOP. The following screen shot shows the Traces table view in the Traces perspective for an exported Profile mode snapshot. The screen shot has been altered to conceal the component names and method signatures at the request of the company.



So how did I come to this conclusion so quickly? The first obvious problem was the two EJB home finder methods consuming 2GB and 1GB of memory (Alloc Total (I)) and taking 4 minutes and 2 minutes to execute (Clock Total). This implied that the data being returned from the underlying JDBC/SQL query was extremely large in terms of rows, columns, and data within columns. One anomaly within the traces table was the zero count and clock total times on the visibroker.operation entry point which can happen if the IIOP request had not completed before the snapshot was taken or either the VisiBroker ORB had failed to delivery an exception or completion notification to the JXInsight trace extension. Looking at the timeline snapshot that also was provided by the company I could see that 5 minutes following the last finder method that the VisiBroker ORB had not dispatched the notification indicating an error rather than an exception had occurred during processing.

Note: The object allocation sizes reported do not necessarily imply that the underlying data collection returned from the finder methods consumed 2GB. A significant portion of the allocation sizes could be attributed to temporary object creation during the executing of the SQL, processing of the JDBC resultset, the creation of EJB beans for each row, the data conversion for column data and the internal growth in the data structure. The screen shot substantiates this by showing a large percentage (35%) of GC time during the EJB home finder. This collection and analysis capability within JXInsight is unique to the application performance management industry and a subject for a future Insight article. One last item to note is that GC events consume CPU time as the JVM searches for potential eviction candidates.





William Louth, JXInsight's Product Architect