This section of the "OSB, Service Callouts and OQL" blog posting will delve into thread dump analysis of OSB server and detecting threading issues relating to Service Callout and using Heap Dump and OQL to identify the related Proxies and Business services involved. The previous section dealt with threading model used by OSB to handle Route and Service Callouts.
There have been numerous customer situations where the performance and response times starts degrading under heavy load in OSB and users are unable to identify the cause for such slowdown and areas of bottleneck.There can be appearance of STUCK Thread notifications also. Taking multiple thread dumps at short intervals (10 - 15 seconds) from the OSB server is the first step towards identifying the problem area. Next analyze the threads for patterns - are they waiting for database response, remote invocation responses? etc.
One way to identify OSB related threads is look for com.bea.wli package name in the thread stacks. There are numerous OSB related patterns and advisories included in ThreadLogic thread dump analysis I had blogged previously. Using ThreadLogic will make the thread dump analysis part lot easier as it can parse multiple dumps as well as identify thread progress across successive dumps.
Some OSB related advisories packaged in ThreadLogic:
From the above list, we can see ThreadLogic will attempt to identify threads involved in inbound Http Proxy, Java Callouts, Proxy in WAIT state for a response (this can be for Service Callout or Sync Publish action), Service response cache lookup (using in-built Coherence Cluster Cache), Publish action, Session activation, Web Service Callout, response handling etc and mark them with a related health level.
Keep in mind, with Route actions, even if the remote services are slow, these wont show up in thread dumps (unless it uses Exactly-Once QoS) as there is no thread actually waiting for the response and threads are only used during the request invocation and actual response processing as discussed earlier.
But for the Service Callouts, two threads needs to be used (at time of actual response hand-off). For an OSB server under heavy load and exhibiting slowness or STUCK (thread has been executing one request for more than 10 minutes in WebLogic), ThreadLogic will report something similar in the Thread Groups Summary node:
Under load, OSB related threads using service callouts will appear in following state in ThreadLogic analysis. The threads might appear as hung waiting for a response notification from the remote service endpoint.
The above snapshot shows ThreadLogic detecting and marking OSB threads involved in wait for Service Callout response (as well as Webservice callout and STUCK). The overall OSB server instance appears to show multiple threads waiting for Service response. The FATAL health level is due to threads appearing in STUCK State.
In the ThreadLogic's Merge or Diff view, one can see multiple threads executing the Service Callout, some progressing, while others remaining stuck in same code execution between dumps.
In the Merge/Diff view of ThreadLogic, the green column entries with "Progress" indicate thread progressing between dumps while the "No Change" indicates absence of thread progress. If the thread is in a bad state (WATCH or WARNING or FATAL) in a previous dump and not progressing in the next successive dump, its marked in brown background while those in IGNORE or NORMAL and not showing any progress would use yellow background.
Now we have detected OSB threads which are waiting for service responses. The corrective action would be to ensure the remote service can respond in time under increasing loads along with implementing better thread management in OSB to handle the response.
Before we get into the solution implementation, we still need to identify those specific business services that are contributing to the hang situation, as the solution/remedial actions have to be applied against those specific services.
If there are very few external services that OSB invokes via Service Callout or Publish action, then we can easily identify or detect which services are contributing to the slowdown. But if there are numerous services, all invoked as mix of Service Callouts/Publish actions, then it becomes difficult to identify the related remote business services and apply the corrective actions.
The thread dumps can indicate the call pattern, but cannot provide any information on which specific Proxy or Business services were involved as the OSB framework code is generic and executes same code paths for all proxy services or Service Callouts (similar to a JDBC code stack trace that does not provide data about the SQL being executed). The OSB Console Management Dashboard returns monitored statistics for services that have completed execution, but not for those still executing or hanging. So, we would have to start analyzing web traffic patterns (access logs or web proxy logs) or network interactions (like connections to remote side) to understand incoming load but might still not get an entire picture which services are really hanging.
In such situations, analyzing the heap dump of the Server JVM instance can provide a gold mine of information. Heap dumps contain a complete snapshot of every object instance loaded in memory and details of threads executing code or acting on objects when the heap dump is generated. Useful data can be retrieved as long as the heap dump can be analyzed by any of the commonly available Java Heap Dump Analyzers (Eclipse MAT, JVisualVM, IBM HA, Sun JHAT, YourKit etc.).
Note: There are two separate versions of Eclipse MAT - a 32 bit and 64 bit version. You cannot just switch the JVM to 64 bit and use Eclipse MAT 32 bit version to analyze big heap dumps (in excess of 4GB heap size). Best to download and use the 64 bit version of MAT.
Most jvms allow capturing heap dump from a running jvm instance. Depending on the vm version and options used, some of the dump formats might not be readable from vendor-agnostic tools like Eclipse MAT or VisualVM. Always try to capture the heap dump in HPROF format, so its not vendor specific. JRockit versions pre-R28 cannot generate heap dumps. JRockit R28+ versions allow heap dumps in hprof format.
For Sun Hotspot:
jmap is the utility (packaged within jdk bin folder) to generate heap dumps given a Java Process ID.
jmap -dump:format=b,file=heap.bin <pid>
format=b option tells to dump the heap in binary format instead of just string/text format.
Use jrcmd (or JRMC) to generate heap dumps in hprof format (JRockit version should be R28+)
jrcmd <PID> hprofdump filename=<pathNameToFile>
Normally jrockit heap dumps get generated in the process's current working directory.
Doing a kill -3 on a IBM JVM process id generates a textual representation of the JVM process heap dump.
Please refer to IBM JVM documentation for details on generating heap dump using dump agents or programmatically (also requires some additional JVM command line arguments to dump heap in the right format).
Most users will use heap dumps mainly to detect memory leaks or usage. But there is lot more that can be researched from heap dumps using Object Query Language, referred as OQL. OQL provides SQL like syntax to navigate, parse and retrieve data from java heap dump. Navigating object relationship and class hierarchies is very simple in tools like Eclipse MAT. There are numerous OQL Tutorials and a detailed documentation available within Eclipse MAT on OQL syntax and usage.
How can we use OQL in the OSB Service Callout situation? From the thread dumps, the code pattern involved in Service callouts is "com.bea.wli.sb.pipeline.PipelineContextImpl$SynchronousListener" from the Stuck threads stack trace. Lets use OQL to select all instances of this class.
select * from instanceof com.bea.wli.sb.pipeline.PipelineContextImpl$SynchronousListener
Select the OQL option at the top (4th icon under the heap dump in MAT) and execute the OQL.
Also, as a general recommendation, always use "instanceof <type>" to select all derived types just to be safer (although there is no actual derived type from PipelineContextImpl$SynchronousListener). Eclipse MAT will execute the OQL query against the Heap Dump and return the results for all pending responses. The Heap Dump snapshot should have been captured from an OSB instance when it was actively executing Service Callout or Publish invocations.
Sample snapshot for the select com.bea.wli.sb.pipeline.PipelineContextImpl$SynchronousListener OQL in Eclipse MAT.
Now pick any instance from the results and the MAT Inspector tab (shows up under Window option) should provide details of the instance - member variables, references, values etc. Now the remaining steps of navigating the instances are mostly guess work - a mix of trial and errors as we don't really know the class structure or relationship between members or references, that might or might not contain useful data.
One can see there is a _service reference within the PipelineContextImpl$SynchronousListener and its of type ..BusinessServiceImpl. Lets drill into it. Expanding it shows another member reference "_ref".
Clicking on the "_ref" shows more interesting data about the "_service". It shows the actual Business Service name and path in its "fullname" attribute. Now we know which services are being invoked via Service Callout or Publish action.
Use OQL and the navigation logic to get list the Business services invoked that are still pending for response.
SELECT toString(s._service._ref.fullname) FROM INSTANCEOF com.bea.wli.sb.pipeline.PipelineContextImpl$SynchronousListener s
Note: Use toString(variable) to get the actual String content instead of a reference to the String in Eclipse MAT. Use ref.toString() in JVisualVM. Other primitives will be reported directly.
There is also a "this$0" reference for the PipelineContextImpl$SynchronousListener. Drilling into it yields "_proxy". This is the actual reference to the calling proxy. Drill into "_proxy" shows its own "_ref" member attribute. The "fullname" of the "_ref" provides the name of the Proxy that was invoking the Service Callout.
So, now we have traversed from instance of PipelineContextImpl$SynchronousListener to "this$0" to "_proxy" to "_ref" to "fullname" to arrive at Proxy name. Similarly, we navigated PipelineContextImpl$SynchronousListener instance to "_service" to "_ref" to "fullname" to identify the Business Service. Now lets run a single OQL to get both the calling Proxy and the Business Service executed via Service Callout or Publish action that blocks for response.
SELECT toString(s.this$0._proxy._ref.fullname) AS ProxyService, toString(s._service._ref.fullname) AS BusinessService FROM INSTANCEOF com.bea.wli.sb.pipeline.PipelineContextImpl$SynchronousListener s
The sample output should appear like:
To analyze the same heap dump on JVisualVM (under hotspot jdk/bin folder), use following OQL:
select s.this$0._proxy._ref.fullname.toString() , s._service._ref.fullname.toString() from com.bea.wli.sb.pipeline.PipelineContextImpl$SynchronousListener s
The OQL syntax (select/from/instanceof are all case sensitive in JVisualVM), columns display and navigation/drill down are much more easier with Eclipse MAT compared to JVisualVM.
Lets run a more detailed OQL to report on type of remote invocation - whether its a Service Callout or Publish action. If there is a response Handler for the proxy, then its a Service Callout; if null, its a Publish action.
SELECT toString(s.this$0._proxy._ref.fullname) AS InvokingProxy, toString(s._service._ref.fullname) AS OutboundService, s._responseHandler AS PublishOrWSCallout FROM com.bea.wli.sb.pipeline.PipelineContextImpl$SynchronousListener s
We have now analyzed the call patterns for the OSB Server threads blocked for remote response and the remote services involved using OQL on the Heap Dump file. Next would be to focus on the correction action. Before that, we need to briefly discus about Work Managers in WLS.
Hope this section in the "OSB, Service Callouts and OQL" series gave some pointers on identifying issues with Service Callouts using Thread Dump and Heap Dump Analysis. The next section will go more into WebLogic Server Work Manager concept and the corrective actions to solve OSB Service Callout related hangs.