Matt from
JavaLobby recently asked a question in his weekly
newsletter section regarding effective tools for hunting
down memory leaks causing out of memory exceptions.
He was unable to resolve this problem with JProfiler,
a well known and one of the better code level
developer profiling tools on the market today.
"I’ve tried attaching to a profiled
version of the server with JProfiler, but that
tends of lock up after a few minute of profiling
(too soon to show me a leak, I need a day or
so). The few minutes I do get to profile, everything
looks normal. "
This reflects my own experience with many of
the code level profilers on the market including
Borland OptimizeIt and Quest JProbe. I have
never been able to get such tools to memory
profile a "real world" J2EE application
even with a single user request thread. The
JVM memory growth when using such tools can
be enormous resulting in JVM crashes well before
a use case is completed. This is not necessarily
a reflection on the quality of the tools themselves
but the fact that the applications being profiled
are generally large hence the need for the profiler
and that aspects of the JVMPI API have not necessarily
been designed with such demands in mind.
What is an OutOfMemoryError?
Lets first start with the class hierarchy.
Object
Throwable
Error
VirtualMachineError
OutOfMemoryError
An
Error is a subclass of
Throwable that indicates serious problems that a reasonable
application should not try to catch. Most such
errors are abnormal conditions. A
VirtualMachineError is thrown to indicate that the Java Virtual
Machine is broken or has run out of resources
necessary for it to continue operating. An
OutOfMemoryError is thrown when the Java Virtual Machine cannot
allocate an object because it is out of memory,
and no more memory could be made available by
the garbage collector.
So the error is a memory capacity issue and
not necessarily an memory leakage. Of course
a memory leakage will reduce available capacity
but one cannot equate all
OutOfMemoryError's with a leakage. A more common reason for such
errors at least in production environments,
where one would expect the application to have
already been tested, is a
surge in request traffic and data volume per request.
Here the JVM has not enough memory available
at a particular point in time to service all
the concurrent requests. The JVM is unable to
grow the process memory and all of the heap
is strongly referenced by threads actively servicing
requests. When thrown most J2EE application
servers will catch the error and append additional
information to a log rather than crash out especially
when the error happens on a request handling
thread.
Note: Catching an OutOfMemoryError
can be somewhat dangerous in an unmanaged environment
if the error occurred during the access and
manipulation of a shared data structure or resource.
Some actual home grown servers have even worse
error handling whereby the JVM stays up and
running after the error but with all request
reader and processor threads disabled. For a
monitoring tool the JVM process looks to be
available but in fact from a client perspective
the server is in a coma.
The standard approach to resolving capacity
issues is to reduce the number of worker threads
at each layer creating a threading funnel effect.
That said determining an appropriate configuration
of the various thread pools at each layer is
problematic and is unlikely to guarantee optimal
performance under varying workloads. Before
starting to change configuration parameters
it is best to first acquire an insight into
the actual resource consumption patterns over
a representative period in a pre-production
or production environment.
The Customer Problem
JInspired was recently contacted by a global
technology and services company that was experiencing
OutOfMemoryError occurrences
with a J2EE/CORBA application server in a production
that had crashed a number of times during the
same period of daily operation. The company
which had tried various solutions and spent
considerable resource time in trying to resolve
the problem had resorted to a temporary measure
of restarting the server each day before the
troublesome period. In their testing labs they
were unable to reproduce the problem partly
because they could not replicate the same workload
and still have a code profiler stay alive and
be usable. They needed to find a solution that
could be introduced into a production environment
that allowed dynamic configuration of profiling
and tracing data capture during specified periods
in the day that did not require real-time console
to server interaction. Matt at
JavaLobby had similar ideas in how to resolve his own
production problems.
"I think that what I really need is
a way to replay the log files for two days and
profile the local copy instead of trying to
profile a remote one. My major problem is that
I can’t find anything to handle the large
log files I have and replay those..."
The company informed us that occasionally the
JVM stayed alive and was able to service requests
post the
OutOfMemoryError occurrence
but on other occasions the JVM completely shutdown.
They needed a solution that could still capture
sufficient profile and contextual information
to ensure fast problem resolution even when
the JVM process terminated unexpectedly.
The JXInsight
Solution
JXInsight has some unique capabilities that
can help resolve this problem in a production
environment.
| Terminal: A scripting tool
allowing for dynamic stopping and starting
of profiler as well as snapshot resetting
and exporting
Trace Extensions: Flexible
and extensible distributed profiling
and tracing
Low Overhead: The event capturing
performed by JVM agent ensures the lowest
profiling overhead.
Multiple JVMPI Counters Per Interval: Recording
of clock time, cpu time, gc time, object
allocation sizes, thread waiting and
blocking times.
Memory Consumption: JXInsight
has the ability to capture JVM object
allocation events and record the bytes
sizes per trace, transaction, or interception
recording interval with configurable
aggregation at the cluster, host, process,
thread group or thread levels.
UI Console: Powerful and effective
visualizations
Agent Architecture: No monolithic management server. Loosely coupled (and potentially
disconnected) distributed agents.
Data Collection Modes: The
profiler supports different data collection
modes - Metric, Profile and Timeline.
Shutdown Hooks: By default
the profiler will export all active
profiler mode snapshots on shutdown
of the JVM
Snapshot Merging: All snapshots
can be merged in on-line or off-line
modes allowing for flexible clustering
analysis
|
The Customer Solution
After discussing the many features and options
applicable the company decided to execute the
following performance engineering strategy:
| Production
Deployment: JXInsight would
be deployed into the production environment
on one particular node running a relatively
small cluster of J2EE servers
Hibernate Mode: The profiler
would be disabled on startup via the
system property -Djdbinsight.server.profiler.autostart=false.
The JVMPI agent was loaded into the
JVM with default settings with the JVM
argument -Xrunjdbinsight
JVMPI Events: Object allocation
JVMPI events would be turned on and
off to capture the actual memory consumption
of the IIOP and EJB request patterns.
This would provide insight into the
consumption patterns while at the same
time ensuring that there would be little
performance degradation. If you
are looking for a memory leak that grows
rapidly a good first step is to find
the components responsible for the actual
object creation. Then look at the surrounding
components that consume or manage (cache)
the data.
Integrate CORBA and EJB Trace Extension:
The customer deployed JXInsight's VisiBroker™
CORBA and J2EE™ EJB trace extensions
capturing complete request processing
resource consumption metrics per request
pattern.
Scheduled Script: Two OS command
scripts where created and scheduled
to dynamically turn on and off the JVMPI
events and profiling data collection
modes as well as exporting snapshots
before and after the period of operation.
The scheduled start script executed
> jdbiterminal
monitor -events gbwa -profile start
-timeline start -h VFONE -p 1515
The scheduled stop script executed
> jdbiterminal
script -execute -f c:\JXInsight\memoryleak.script
The script file, memoryleak.script,
referenced by the stop script contained
the following terminal commands
snapshot
-create -type pr -f MemoryLeak_pr.snp
-d c:\JXInsight\snapshots -host VFONE -p
1515
snapshot -create -type tl -f MemoryLeak_tl.snp
-d c:\JXInsight\snapshots -host VFONE -p
1515
snapshot -create -type mx -f MemoryLeak_mx.snp
-d c:\JXInsight\snapshots -host VFONE -p
1515
monitor -events bw -profile stop -timeline
stop -h VFONE -p
1515
Collection
Modes: Profile, Timeline and
Metric snapshots would be collected
and exported either via the terminal
or a JVM shutdown hook.
Analyze Exported Snapshots:
As this was a serious production problem
JInspired would provide a rapid snapshot
analysis service - OneHour Snapshot
Processing.
|
Within minutes of receiving the snapshot I was
able to pinpoint the actual cause of the "MemoryLeak"
problem. It was not a memory leak but a memory
capacity issue caused by a very large data processing
request that had been routed from a messaging
backbone and integrated into the application
via IIOP. The following screen shot shows the
Traces table view in the
Traces perspective for an exported Profile mode snapshot.
The screen shot has been altered to conceal
the component names and method signatures at
the request of the company.

So
how did I come to this conclusion so quickly?
The first obvious problem was the two EJB home
finder methods consuming 2GB and 1GB of memory
(
Alloc Total (I)) and taking 4 minutes
and 2 minutes to execute (
Clock Total).
This implied that the data being returned from
the underlying JDBC/SQL query was extremely
large in terms of rows, columns, and data within
columns. One anomaly within the traces table
was the zero count and clock total times on
the
visibroker.operation entry point
which can happen if the IIOP request had not
completed before the snapshot was taken or either
the VisiBroker ORB had failed to delivery an
exception or completion notification to the
JXInsight trace extension. Looking at the timeline
snapshot that also was provided by the company
I could see that 5 minutes following the last
finder method that the VisiBroker ORB had not
dispatched the notification indicating an error
rather than an exception had occurred during
processing.
Note: The object allocation sizes reported do not
necessarily imply that the underlying data collection
returned from the finder methods consumed 2GB.
A significant portion of the allocation sizes
could be attributed to temporary object creation
during the executing of the SQL, processing
of the JDBC resultset, the creation of EJB beans
for each row, the data conversion for column
data and the internal growth in the data structure.
The screen shot substantiates this by showing
a large percentage (35%) of GC time during the
EJB home finder. This
collection and analysis capability within JXInsight
is unique to the application performance management
industry and a subject for a future Insight
article. One last item to note is that GC events
consume CPU time as the JVM searches for potential
eviction candidates.
William Louth, JXInsight's
Product Architect