When it comes to persisting service and subsystems data (i.e. JMS), WebLogic server offers customers a choice: filesystem or relational database using JDBC. Persistence store has implications on WLS performance and systems management. In this post, I will provide an unofficial JMS performance comparison using different persistence stores. PLEASE NOTE: This is not an official Oracle performance benchmark. […]
Several new technical whitepapers on WebLogic Server 12.1.2 and Coherence 12.1.2 have recently been released. In case you missed them, this post provides links to some of the more interesting ones. You’ll need to log into Oracle.com to download these. Oracle WebLogic: 12.1.1 Business Overview – PDF download This whitepaper provides an overview of […]
by Robert Patrick and Sabha Parameswaran WebLogic Server clusters form a loosely-federated group of managed servers that provide a model for applications to leverage for achieving scalability, load balancing, and failover. Each managed server manages its view of what servers are in the cluster and its own cluster-wide view of the JNDI tree. The cluster […]
Introduction Have you ever wondered how to control/configure the metricdump log files that accumulate in your WebLogic Server domains configured with Fusion Middleware? This blog will help to answer that question. Main Article When running WebLogic Server instances configured with Fusion Middleware (e.g., SOA Suite), there are Dynamic Monitoring Service (DMS) metrics that are gathered every 3 […]
If you have a number of WebLogic Server instances running, or applications that write a lot of information into the logs, you might find that your log files, and your stdout files start to eat up a lot of space … Continue reading →
Monitoring and optimizing BPEL Thread Pool utilization (and other metrics) is one of the key activities in performance tuning of BPEL/SOA based integrations. Although EM console provides some basic monitoring of the BPEL engine statistics, it is limited regarding the update interval, detail and the recording interval and cannot display historic data. Of course you can setup Grid Control 11g with its Repository, but this is in most cases too complex to setup just for monitoring during performance and load testing.
So, the idea came to create a tool which can easily record these statistics and export them to MS Excel or OpenOffice to create charts for the thread pool utilization over a time period (for example a whole load test execution).
All values of WLS or the SOA engine can be queried using the JMX MBean framework. I have designed the JMXClient to be able configure which MBeans should be queried by using a property file (beans.properties). I decided to connect in this first release to only one Managed Server of WLS to record / export data. This means that if you have a WLS cluster, you need to start multiple JMX Clients to record the values of each node. (In a later release the JMX Client could be optimized to query all nodes automatically).
JMXClient can be used by downloading from the project page at Sourceforge. (including JDeveloper 18.104.22.168 project and sources)
After that you need to configure
- your connection properties, JAVA_HOME and WLS_HOME of your WLS managed server of SOA in jmxclient.bat (or jmxclient.sh)
- the MBeans names, WLS Server name and the attributes to record in classes/beans.properties (you can find the MBean names form the System MBean Browser in EM)
The syntax in jmxclient.bat is
java -cp classes;%INCLUDE_LIBS% jmxclient.JMXClient <server> <user> <password> -monitor 1000
java -cp classes;%INCLUDE_LIBS% jmxclient.JMXClient 192.168.56.101 7001 weblogic welcome1 -monitor 1000
“1000” specifies the interval in milliseconds between the recording.
Then you can run it with
jmxclient > out.txt
Then simple import this text file using Excel or OpenOffice and a comma “,” as delimiter and create a line chart using line 2 as titles and lines 3 to end as data.
Let me first show a couple of results using JMXClient using the properties to record the BPEL thread pool statistics:
The following chart shows a scenario where the invoke thread pool is much too low (20) so that the queue of scheduled invocations waiting for a free thread is growing rapidly:
The second example shows a scenario where invoke and callback threads are within normal limits:
In the next posts I will show how to use JMXClient to record the BPEL process execution times or the number of messages in the AIA 11g JMS queues by simply exchanging the beans.properties file…..!
DISCLAIMER: JMXClient is provided for free use “as is” without any support or warranty. Please provide enhancements or modifications you make yourself. Feedback is welcome using the comments feature of this blog.
PS: for the experts: the format of the beans.properties file:
Every line contains 3 items separated by semicolon:
- the name of the MBean to query
- the attribute to query
- the title string which should be displayed for the column
Example for the bpel thread pools:
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/invoke,type=soainfra_bpel_requests;active_maxValue;Invoke Active Max
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/invoke,type=soainfra_bpel_requests;scheduled_maxValue;Invoke Scheduled Max
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/invoke,type=soainfra_bpel_requests;scheduled_value;Invoke Scheduled Current
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/invoke,type=soainfra_bpel_requests;active_value;Invoke Active Value
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/invoke,type=soainfra_bpel_requests;threadCount_value;Invoke Threads Value
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/invoke,type=soainfra_bpel_requests;threadCount_maxValue;Invoke Threads Max
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/system,type=soainfra_bpel_requests;active_maxValue;System Active Max
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/system,type=soainfra_bpel_requests;scheduled_maxValue;System Scheduled Max
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/system,type=soainfra_bpel_requests;scheduled_value;System Scheduled Current
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/system,type=soainfra_bpel_requests;active_value;System Active value
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/system,type=soainfra_bpel_requests;threadCount_value;Invoke Threads Value
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/system,type=soainfra_bpel_requests;threadCount_maxValue;Invoke Threads Max
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/engine,type=soainfra_bpel_requests;active_maxValue;Engine Active Max
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/engine,type=soainfra_bpel_requests;scheduled_maxValue;Engine Scheduled Max
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/engine,type=soainfra_bpel_requests;scheduled_value;Engine Scheduled Current
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/engine,type=soainfra_bpel_requests;active_value;Engine Active value
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/engine,type=soainfra_bpel_requests;threadCount_value;Engine Threads Value
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/engine,type=soainfra_bpel_requests;threadCount_maxValue;Engine Threads Max
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/audit,type=soainfra_bpel_requests;active_maxValue;Audit Active Max
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/audit,type=soainfra_bpel_requests;scheduled_maxValue;Audit Scheduled Max
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/audit,type=soainfra_bpel_requests;scheduled_value;Audit Scheduled Max
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/audit,type=soainfra_bpel_requests;active_value;Audit Active value
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/audit,type=soainfra_bpel_requests;threadCount_value;Audit Threads Value
oracle.dms:Location=AdminServer,name=/soainfra/engines/bpel/requests/audit,type=soainfra_bpel_requests;threadCount_maxValue;Audit Threads Max
Recently, I have been involved in a number of discussions with people who are setting up clusters of various Fusion Middleware products, often on an Exalogic machine. These discussions have led me to feel that it would be worth sharing … Continue reading →
ThreadLogic version 0.95 is available for public download now. Additional Features Biggest change is support for externalizing the Advisories and group definitions. Users can use the pre-defined AdvisoryMap.xml to come up with custom …
Eric Gross and Sabha Parameswaran, from Oracle FMW Architects Team (The A-team) are happy to introduce ThreadLogic, an open source Thread Dump Analyzer tool, to Middleware community.Motivation behind ThreadLogic The current set of TDA tools (Sam…
In this section, we will see a new version of TDA (Thread Dump Analysis) tool developed by the author of this blog (Sabha Parameswaran in collaboration with his colleague, Eric Gross, also from Oracle A-Team) and its capabilities as well as some real world samples of thread dump analysis before concluding the series.
TDA A-Team – an Enhanced TDA (version 3.0)
As mentioned earlier, the current TDA tools (Samurai/TDA) don’t mine the data inside thread dumps or provide a more detailed view of what each thread is doing while just limiting themselves to reporting the state (locked/waiting/running) or the lock information. No mention of type of activity within a thread, should it be treated as normal or deserving a closer look? Can a pattern or anti-pattern be applied against them? Any possible optimizations? Are there any hot spots? Any classification of threads based on their execution cycles?
In collaboration with Oracle A-team colleague, Eric Gross, the author decided to enhance the existing open source TDA version 2.2 instead of reinventing the wheel and leveraging the capabilities of base TDA to handle base parsing of the thread dump, the UI and Reporting. Eric Gross enhanced the TDA to fully support Sun Hotspot, JRockit (support was partial for JRockit in base TDA v2.2) and IBM JVM Thread dumps as well as adding support for custom categories in the TDA tool. Sabha Parameswaran enhanced the tool’s analytics capabilities – grouping of threads based on functionality and tagging of threads with advisories using pre-defined rules and patterns which can be extended to handle additional patterns.
We wish to thank Ingo Rockel, Robert Whitehurst and numerous others who had contributed to the original TDA which allowed us build on their work in delivering a more powerful tool for the entire Java community.
Once a thread dump is parsed and threads details are populated, each of the thread is then analyzed against matching advisories and tagged appropriately. The threads are also associated with specific Thread Groups based on functionality or thread group name.
|Thread Advisory Name||Large # of WLS Muxer Threads|
|Description||Large number of WLS Muxer Threads|
|Advice||Reduce number of WLS Muxer Threads to under 4, use -Dweblogic.SocketReaders=NoOfThreads flag in command line|
|Thread Advisory Name||Stuck Thread|
|Description||Thread is Stuck, request taking very long time to finish|
|Advice||Check why the thread or call is taking very long?? Is it blocked for unavailable or bad resource or contending for Lock? Can be ignored if it is doing repeat work in a loop (like adapter threads polling for events in a infinite loop)…|
|Thread Advisory Name||Hot Spots|
|Description||Multiple Threads executing same code path|
|Advice||Ensure there are no blocking locks or bottlenecks, sufficient resources are available,remote service being invoked is responsive and scaling well to handle increased load|
Each of the advisory gets triggered based on either call execution patterns observed in the thread stack or presence of other conditions (thread blocked or multiple threads blocked for same lock can trigger BlockedThreads Advisory). Sometimes a thread might be tagged as IGNORE or NORMAL based on its execution logic or might be tagged more specifically as involved in JMS send or receive client or a Servlet thread.
The health levels (in descending of severity) are FATAL (meant for Deadlocks, STUCK, Finalizer blocked etc), WARNING, WATCH (worth watching), NORMAL and IGNORE. Based on the highest criticality of threads within a group, that health level gets promoted to the Thread Group’s health level and same is repeated at the thread dump level. There can be multiple advisories tagged to a Thread, Thread Group and Thread Dump.
Snapshot of Advisory Map
Snapshot of Threads tagged with advisories in the thread dump
|Threads in a thread dump tagged with Advisories/Health Levels|
Thread Groups Summary
The threads are associated with thread groups based on the functionality or thread names. Additional patterns exists to tag other threads (like iWay Adapter, SAP, Tibco threads) and group them. The summary page reports on health level of the group, total number of threads, threads that are blocked, critical advisories found etc.
|Thread Groups Summary|
Critical Advisories per thread group
The critical advisories (at Warning/Fatal health levels) found in individual threads are then promoted to the the parent thread group and reported in the thread group summary page.
|Critical Advisories for Thread Group|
One can see the thread groups are divided into two buckets – WLS and non-WLS related threads. The JVM threads, Coherence, LDAP and other unknown custom threads go under the non-WLS bucket while all the WLS, Muxer, Oracle, SOA, JMS, Oracle Adapter threads are all under the WLS bucket.
Individual Thread tagging with Advisories
Clicking on the individual threads will display the advisories and thread stack.
|Advisories and details at thread level|
The details of the advisory will pop up on mouse over on the advisory links.
The Advisories are color coded and details can be highlighted.
|Colored advisories for individual threads|
Sub-groups are also created within individual Thread Groups based on Warning Levels, Hot call patterns (multiple threads executing same code section), threads doing remote I/O (socket or db reads) etc.
Following snapshot shows example of a Hot call pattern where multiple threads appear to be executing the same code path (all are attempting to update the MBean Server with a new mbean).
|Hot Call Pattern – multiple threads exhibiting similar code execution|
Merging of threads across multiple thread dumps and reporting of progress in the thread state
Merge has been enhanced to report on the progress of the thread across the thread dumps. Based on the order of the thread dumps, the thread stack trace is compared for every consecutive thread dump.
|Merged view showing progress information for individual threads|
Merged reporting of individual thread stack traces (exists in base TDA version 2.2)
|Merged Thread stack traces across thread dumps|
Merging can also be done across multiple thread dump log files (like in case of IBM which creates new log file containing the thread dump every time a request is issued).
Usability benefits of new TDA A-Team
Thanks to the advisories and health levels, its easy for users to quickly understand the usage patterns, hot spots, thread groups, as well as highlight the patterns or anti-patterns already implemented in the advisory list.
For example of an anti-pattern: a Servlet thread should not be waiting for an event to occur as this will translate to bad performance for end user. Similarly usage of synchronous jms consumers might be less performant compared to using async consumers. Too many WLS Muxer threads is not advisable. If WLS Muxer or Finalizer threads are blocked for unrelated locks, this will be a fatal condition. It would be okay to ignore STUCK warning issued by WLS Server for Pollers like AQ Adapter threads but not for other threads that are handling servlet request.
The thread groups help in bunching together related threads; so SOA Suite users can see how many BPEL Invoke and Engine threads are getting used, B2B users can see number of JMS consumers/producers, WLS users can look at condition and health of Muxer threads, similarly for JVM/Coherence/LDAP/other thread groups.
The merged report lets the user see at a glance the critical threads and check if they are progressing or not instead of wading through numerous threads and associated thread dumps.
We hope this tool can really help both beginners and experts do their jobs more quickly and efficiently when it comes to thread dumps.
Real World Sample of Thread Dump Analysis
A Customer was planning on a POC (Proof of Concept) of WebLogic Portal Server (WLP) on the new Oracle Exalogic Platform and measure its performance relative to their existing configurations. The test configuration used a cluster of WLP servers, communicating via Multicast for cluster membership and other service announcements. As soon as the servers were brought up, the servers became unresponsive. The server instances could not serve any request – neither traffic to Portal Application nor WLS Admin Server requests for status from the server instances. Following errors get getting thrown in the server:
There were also frequent log messages of p13n cache getting updated and flushed repeatedly. The p13n cache is the Personalization cache used internally by Portal for handling Portal personalization. Each of the servers in the cluster are kept of changes in cache via multicast broadcasts. When the customer disabled the p13n cache, the server regained normalcy and responsiveness. But the p13n cache is a requirement for portal application functionality and its not possible to have it disabled.
The FullQueueException essentially indicated that the WLS’s internal request queue was getting overwhelmed with requests and its starting to ignore/drop new requests as its already has 64K requests waiting to be executed. These requests can be internal (like transaction timeouts, schedulers, async processing) or externally generated (like client http requests). The storm of requests coupled with p13n being a key presence for the issue indicated p13n cache is triggering an internal flood of requests to the server somehow. It was decided to take thread dumps to understand what the p13n cache is executing.
Analyzing the thread dumps threw light on a weird behavior.
at java/lang/Thread.sleep(J)V(Native Method)
^-- Holding lock: weblogic/cluster/MulticastFragmentSocket@0x1d4afd4c8[thin lock]
^-- Holding lock: weblogic/cluster/MulticastSender@0x1c7ce3628[thin lock]
The above thread stack indicates the MulticastReceiver used by WLS internally to handle multicast announcements got some request/announcement that gets handled by the p13n cache. It appears to trigger a flush of local keys in the cache. But then it gets modified as a flush of cluster keys which then results in a MulticastSender sending out a cluster broadcast.
This essentially means for every announcement that lands on a server instance, it will resend the announcement to other members whcih then gets replicated for every member across the cluster leading to an infinite looping of the same message and contributing to a never ending multicast packet storm.
Analyzing the product code against the stack trace, it appeared that there was a mismatch between the CacheManager.flushKeys() implementation as the method code lines didn’t match. So, it was apparent that there was some variant form of CacheManager that was getting picked by the application while handling the cluster announcements.
at java/lang/Thread.sleep(J)V(Native Method)
REPEAT MULTICAST TRAFFIC - CREATE INFINITE FLOOD
^-- Holding lock: weblogic/cluster/MulticastFragmentSocket@0x1d4afd4c8[thin lock]
^-- Holding lock: weblogic/cluster/MulticastSender@0x1c7ce3628[thin lock]
LOCAL FLUSH ONLY
RECEIVE MULTICAST MESSAGE
Using WLS ClassLoader Analysis Tool (CAT), we were able to drill down to where this particular CacheManager class was getting loaded from. It was packaged inside a jar that got added to the Portal EAR application libraries. Removing that particular jar containing the CacheManager resolved the problem. Later investigation revealed that the jar came from an older version of the Portal Product and should not have been used with the latest version of Portal that was being tested on Exalogic. The analysis of a single thread stack trace helped pin point the root cause for all the hang conditions in this scenario.
Limitations of Thread Dumps
Although Thread dumps can throw light on multiple things, there are also some limitations.
- It is a snapshot of a running system
- The system should have some representative load or exhibit symptoms when capturing thread dumps
- The dumps are of no value if none of the above
- It cannot show details on memory usage
- It can show what is the pattern executed but not how many times it was executed previously (unless there are multiple threads dumps captured at periodic intervals).
- It cannot show details of the request that is driving the code execution or which remote service is slow
- Same set of framework code can get executed for SOA Suite BPEL Processes or OSB Proxies
- we wont be able to determine which external service is blocking the response for a thread
- Would require analysis of server logs/proxies/access logs and or heap dumps to identify slow services and then drill down from there or other forms of monitoring (EM/OSB Console monitoring statistics)
- Finalizer doing heavy work (too many finalizable objects to clean up) in each successive thread dump would indicate memory handling issues (and possibly overuse/misuse of finalize – lazy cleanup or recreating additional objects during finalize instead of active deleting/cleaning up)
- Busy GC Threads would indicate memory thrashing
- Threads stuck on socket reads would indicate slow reads or backends not responding in a timely fashion or socket closures due to firewall or other bad socket handling
- Threads stuck in socket connect indicates the backend service itself is unavailable or not listening at that address or wrong connection detail or connection recreation every time.
- Threads stuck in database sql reads/executes can indicate db response issue or sql tuning issue
- Threads blocked in opening files, jars, zips or checking on file attributes can indicate problem in reading the application bits or files on a remote storage or network mount point that’s will slow down server startup/deployment. One customer problem was related to server instance accessing WLS install & deployed application bits from a remote site that was about a hundred miles away.
- Threads blocked on locks for synchronized code imply bottlenecks
- Resolve the bottleneck by increasing the resources under contention
- Avoid the synchronized calls entirely (ex: logging might be excessively synchronized, reduce logging levels or turn off logging ) whenever possible
- Optimize the call execution to reduce lock time
- Some of the issues might then require further investigation into environment (process, cpu, memory, io, network etc) or services/partners/actors/clients to resolve the issues
- For cluster related or distributed system issues, thread dumps should be captured and analyzed from all related parties (cluster/services).