Lately I have been spending time helping some of our large customer installations resolve real-world performance problems that could not be solved by tools that provide a simple (and primitive) JMX metrics collection and analysis UI. One such problem that proved elusive for other database and JEE/Java performance management solutions was detected within minutes when JXInsight's timeline analysis was turned on across a cluster of 4 application servers. Before reading the details of the performance tests please take a moment to look at the screen shot below. The top section shows the thread/transaction activity across the cluster. The bottom section contains a view port into the shaded region in the top section.
Can you see the pattern hinted at by the vertical lines superimposed on the top section of the graphic?
Performance Investigation
During the applications busy period a 20 minute JXInsight timeline transaction tracing was recorded for each server within the cluster. The snapshots from the 4 server processes were then merged into a single timeline snapshot for comprehensive database transaction and SQL concurrency analysis. The purpose of this exercise was to understand the database behavioral patterns with concurrent data access from
different usage patterns across users, sessions and servers.
The timeline visualization shows the transactions executed during the monitored period. The top graphic displays the complete timeline with lines draw representing distinct transactions. Each row within the graphic represents a thread within a server process. The length of the transaction line reflects the duration: from the first executed statement to either
rollback or commit call. The color of a transaction line reflects the ratio of the transaction time to the maximum transaction time. The maximum transaction clock time is colored a dark red. The bottom graphic displays the transactions and SQL statements as well as bar charts indicating the level of concurrency for the slightly shaded segment within the top graphic. The graphic has been enhanced with vertical lines to show the behavioral pattern of the database under high levels of concurrency.
The lines clearly show there is a strong relationship with the completion times for concurrent transactions. This is very evident in the left most cluster of concurrent transactions where the ending of the longest running transaction coincides with the ending of the other concurrent transactions indicating possible lock contention for pages or tables. Of the 25 concurrent transactions 4 were single row updates and 2 single row inserts.
One INSERT statement took 38 seconds. The remaining transactions were read-only with relatively simple SELECT operations querying on primary and foreign keys.
Investigation of the SQL statements executed revealed a pattern of repeated execution with a LIKE clause on the ServiceCall.Description text attribute. Tracking the statement in the timeline indicated that its execution had a negative performance impact on other concurrent transactions specifically INSERT and UPDATE. Analysis of the concurrent SQL statements showed that
transactions most impacted referenced similar tables in the longest running concurrent transaction. One particular SQL LIKE query was executed every 5 minutes. This was later attributed to a user application associated with an advanced filter containing LIKE and NOT LIKE conditions with automatic refresh enabled. It should be noted that in every concurrent peak the longest running transaction was started before other impacted transactions. Also the impacted transactions had previously executed in significant less (a factor of 10) time outside of peak times. The maximum transaction concurrency across the 4 servers was 26.
The relatively long running OLAP request had greatly impacted the concurrently OLTP requests resulting in them be placed in a wait state until the request had released all its locks.
Testing Preliminary Findings
Dynamically reducing the connections transaction isolation for read-only operations would limit the acquisition of page and table level locks by long running queries. MS SQLServer provides various ways to configure this including embedding hints within SQL statements. Setting the transaction isolation to READ_UNCOMMITTED was the quickest and easiest way to achieve this. To test this we provided the ability in JXInsight 4.0 to override the transaction isolation on a JDBC connection when
Connection.setTransactionIsolation is called by the containers connection pool management. This is configurable via the following server system properties.
-Djxinsight.server.jdbc.connection.transactionisolation.override=-1
-Djxinsight.server.jdbc.connection.readonly.transactionisolation.override=-1
Valid system property values:
0 [TRANSACTION_NONE]
1 [TRANSACTION_READ_UNCOMMITTED]
2 [TRANSACTION_READ_COMMITTED]
4 [TRANSACTION_REPEATABLE_READ]
8 [TRANSACTION_SERIALIZABLE]
Note: The second system property is applicable for JXInsight trace extensions that have awareness of the actual type of work being performed and can tag the current thread trace activity as READONLY.
Executing the longest running transaction of a single INSERT statement with the default transaction isolation set to TRANSACTION_READ_COMMITTED from two different workstations revealed that the completion of the INSERT SQL statement was tied to the execution of the SELECT statement – as soon as the SELECT statement had completed the INSERT statement would also complete. When the transaction isolation was changed to TRANSACTION_READ_UNCOMMITTED the execution of the INSERT statement was immediate. The mixing of OLAP and OLTP work against the same SQLServer database with default transaction isolation levels was the cause of the unacceptable performance times. The SELECT during its execution had obtained a table level lock after acquiring a large number of page level locks.
Conclusion
OLTP workloads are characterized by many small transactions, with an expectation of very quick response time from the user. OLAP and reporting workloads are characterized by a few long-running operations that might consume more resources and cause more contention. The long-running operations are caused by locking and by the underlying physical sub-system. By default
SQLServer enforces read consistency by simply locking pages and preventing the rows from being read by other users. In a large-scale environment with mix workloads (OLAP/OLTP) this represents a major bottleneck. Long-running reports will cause transaction-processing users to wait for completion. As multiple users compete for common tables, they may block one another.
Releasing locks quickly for applications that support high numbers of users is more important in SQLServer than in Oracle. Releasing locks quickly means keeping transactions as short as possible. The SQL statements should be tuned to fetch data as quickly as possible because un-fetched data scans can hold shared locks at the server and thus block updaters
(INSERT/UPDATE/DELETE statements).
William Louth, JXInsight's Product Architect