In a previous blog post, we saw a few useful tips to help us quickly monitor the health of resequencer components in a soa system at runtime. In this blog post, let us explore some tips to diagnose mediator resequencer issues. During the diagnosis we will also learn some key points to consider for Integration systems that run Mediator Resequencer composites.
Please refer to the Resequencer White paper for a review of the basic concepts of resequencing and the interplay of various subsystems involved in the execution of Resequencer Mediator composites.
In this blog post we will refer to the AIA Communications O2C Pre-Built Integration pack (aka O2C PIPs) as an example for understanding some issues that can arise at runtime with resequencer systems and how we can diagnose the cause of such issues. The O2C PIP uses resequencing-enabled flows. One such is the UpdateSalesOrder flow between OSM and Siebel. It is used to process the OSM status of Sales Orders in proper time sequence within the Siebel system.
Many a times the resequencer health check queries point us to an issue occurring only in one server within the soa cluster. While the Database queries mentioned here give us the containerId of the specific server, it does not specify the server name. This is because mediator uses a GUID to track a runtime server.
Trace Log messages generated by the Mediator can help us correlate this GUID to an individual server running in the cluster at runtime. The oracle.soa.mediator.dispatch runtime logger can be enabled from the FMW EM console to TRACE:32 level. Figure below shows the screenshot.
Enabling this logger just for a few minutes will suffice and one can see messages such as below in soa servers’ diagnostic logs, once every lease refresh cycle. The default refresh cycle is 60s apart.
[APP: soa-infra] [SRC_METHOD: renewContainerIdLease] Renew container id [34DB0F60899911E39F24117FE503A156] at database time :2014-01-31 06:11:18.913
It implies, the server which logged the above message is running with a containerId of 34DB0F60899911E39F24117FE503A156 !
When one observes excessive messages piling up with a status of GRP_STATUS=READY and MSG_STATUS=READY, it usually indicates that the locker thread is not locking the groups fast enough for processing the incoming messages. This could be due to Resequencer Locker thread stuck or performing poorly. For instance the locker thread could be stuck executing updates against the MEDIATOR_GROUP_STATUS table.
It is generally useful to isolate the server which is creating the backlog using health check queries and then isolate the server name by using the logger trace statements as described in previous section. Then a few thread dumps of this server, could throw more light on the actual issue affecting the locker thread.
Usually thread dumps show a stack such as below for a resequencer Locker thread.
"Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms " id=330 idx=0x1b0 tid=28794 prio=10 alive, sleeping, native_waiting, daemon at java/lang/Thread.sleep(J)V(Native Method) at oracle/tip/mediator/common/listener/<strong>DBLocker.enqueueLockedMessages</strong>(DBLocker.java:213) at oracle/tip/mediator/common/listener/DBLocker.run(DBLocker.java:84) at oracle/integration/platform/blocks/executor/WorkManagerExecutor$1.run(WorkManagerExecutor.java:120) at weblogic/work/j2ee/J2EEWorkManager$WorkWithListener.run(J2EEWorkManager.java:184) at weblogic/work/DaemonWorkThread.run(DaemonWorkThread.java:30) at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
In the above thread the Locker is enqueing messages from locked groups into the in memory queue for processing by the worker threads.
During times of any issue, the Locker thread could be seen stuck doing database updates. If this is seen across thread dumps with no progress made by the thread, then it could point to a database issue which needs to be attended.
A poor performance of the locker query on the database side will adversely impact the Resequencer performance and hence decrease the throughput of the integration flow that uses Resequencers.
Recollect that the Locker thread runs an update query continuously attempting to lock eligible groups. Below shown is a sample FIFO Resequencer Locker query as seen database AWR reports.
update mediator_group_status a set a.status=7 where id in ( select id from (select distinct b.id, b.lock_time from mediator_group_status b, mediator_resequencer_message c where b.id=c.owner_id and b.RESEQUENCER_TYPE='FIFO' and b.status=0 and b.CONTAINER_ID=:1 and c.status=0 and b.component_status!=:2 ORDER BY b.lock_time) d where rownum<=:3 )
The Database AWR reports can also very useful to check the average Elapsed Time and other performance indicators for the locker query.
Huge data volume due to no proper purging strategy for Mediator tables is a common reason for deteriorated Locker query performance. Regular data purging, partitioning, statistics gathering and creation of required indexes on MEDIATOR_GROUP_STATUS will usually ensure good performance of locker query.
Note that there is only one Resequencer Locker thread running per server at runtime. Any database issue that impacts the locker thread will impair all the Mediator Composites that use the same resequencing strategy. The mediator resequencer uses database for storage, retrieval of messages to implement the reordering and sequencing logic. Hence, the proper and timely maintenance of SOA database goes a long way in ensuring a good performance.
Recollect that Worker threads are responsible for processing messages in order. There are multiple worker threads per server to parallel-process multiple groups, while ensuring that each group is exclusively processed by only one worker thread to preserve the desired sequence. Hence, the number of worker threads configured in Mediator properties (from FMW EM console) is a key parameter for optimum performance.
Below sample snippets from server thread dumps show Resequencer Worker threads. The first stack shows a worker thread which is waiting for messages to arrive on the internal queue. As and when Locker thread, locks new eligible groups, such available worker threads will process the messages belonging to the locked groups.
"Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms " id=208 idx=0x32c tid=26068 prio=10 alive, parked, native_blocked, daemon at jrockit/vm/Locks.park0(J)V(Native Method) at jrockit/vm/Locks.park(Locks.java:2230) at jrockit/proxy/sun/misc/Unsafe.park(Unsafe.java:616)[inlined] at java/util/concurrent/locks/LockSupport.parkNanos(LockSupport.java:196)[inlined] at java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)[optimized] at java/util/concurrent/<strong>LinkedBlockingQueue.poll</strong>(LinkedBlockingQueue.java:424)[optimized] at oracle/tip/mediator/common/listener/<strong>AbstractWorker.run</strong>(AbstractWorker.java:63) at oracle/integration/platform/blocks/executor/WorkManagerExecutor$1.run(WorkManagerExecutor.java:120) at weblogic/work/j2ee/J2EEWorkManager$WorkWithListener.run(J2EEWorkManager.java:184) at weblogic/work/DaemonWorkThread.run(DaemonWorkThread.java:30) at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) -- end of trace
The next partial stack shows a worker thread which is processing a message from a group that has been locked by the Locker.
…. at oracle/tip/mediator/service/BaseActionHandler.requestProcess(BaseActionHandler.java:75)[inlined] at oracle/tip/mediator/service/OneWayActionHandler.process(OneWayActionHandler.java:47)[optimized] at oracle/tip/mediator/service/ActionProcessor.onMessage(ActionProcessor.java:64)[optimized] at oracle/tip/mediator/dispatch/MessageDispatcher.executeCase(MessageDispatcher.java:137)[optimized] at oracle/tip/mediator/dispatch/InitialMessageDispatcher.processCase(InitialMessageDispatcher.java:500)[optimized] at oracle/tip/mediator/dispatch/InitialMessageDispatcher.processCases(InitialMessageDispatcher.java:398)[optimized] at oracle/tip/mediator/dispatch/InitialMessageDispatcher.processNormalCases(InitialMessageDispatcher.java:279)[inlined] at oracle/tip/mediator/dispatch/resequencer/ResequencerMessageDispatcher.processCases(ResequencerMessageDispatcher.java:27)[inlined] at oracle/tip/mediator/dispatch/InitialMessageDispatcher.dispatch(InitialMessageDispatcher.java:151)[inlined] at oracle/tip/mediator/dispatch/resequencer/ResequencerMessageHandler.handleMessage(ResequencerMessageHandler.java:22)[optimized] at oracle/tip/mediator/resequencer/<strong>ResequencerDBWorker.handleMessag</strong>e(ResequencerDBWorker.java:178)[inlined] at oracle/tip/mediator/resequencer/ResequencerDBWorker.process(ResequencerDBWorker.java:343)[optimized] at oracle/tip/mediator/common/listener/AbstractWorker.run(AbstractWorker.java:81) at oracle/integration/platform/blocks/executor/WorkManagerExecutor$1.run(WorkManagerExecutor.java:120) at weblogic/work/j2ee/J2EEWorkManager$WorkWithListener.run(J2EEWorkManager.java:184) at weblogic/work/DaemonWorkThread.run(DaemonWorkThread.java:30) at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) -- end of trace
It should be noted that all further processing of the message until the next transaction boundary happens in the context of this worker thread. For example, the diagram below shows the O2C UpdateSalesOrder Integration flow, from a threads perspective. Here, the BPEL ABCS processing, the calls to AIA SessionPoolManager, as well as the Synchronous invoke to the Siebel Webservice, all happen in the resequencer worker thread.
"Workmanager: , Version: 0, Scheduled=false, Started=false, Wait time: 0 ms " id=299 idx=0x174 tid=72518 prio=10 alive, in native, daemon at jrockit/net/SocketNativeIO.readBytesPinned(Ljava/io/FileDescriptor;[BIII)I(Native Method) at jrockit/net/SocketNativeIO.socketRead(SocketNativeIO.java:32)[inlined] at java/net/SocketInputStream.socketRead0(Ljava/io/FileDescriptor;[BIII)I(SocketInputStream.java)[inlined] at java/net/<strong>SocketInputStream.read</strong>(SocketInputStream.java:129)[optimized] at HTTPClient/BufferedInputStream.fillBuff(BufferedInputStream.java:206) at HTTPClient/BufferedInputStream.read(BufferedInputStream.java:126)[optimized] at HTTPClient/StreamDemultiplexor.read(StreamDemultiplexor.java:356)[optimized] ^-- Holding lock: HTTPClient/StreamDemultiplexor@0x1758a7ae0[recursive] at HTTPClient/RespInputStream.read(RespInputStream.java:151)[optimized] …. …. at oraclele/tip/mediator/dispatch/resequencer/ResequencerMessageDispatcher.processCases(ResequencerMessageDispatcher.java:27) at oracle/tip/mediator/dispatch/InitialMessageDispatcher.dispatch(InitialMessageDispatcher.java:151)[optimized] at oracle/tip/mediator/dispatch/resequencer/ResequencerMessageHandler.handleMessage(ResequencerMessageHandler.java:22) at oracle/tip/mediator/resequencer/<strong>ResequencerDBWorker.handleMessage</strong>(ResequencerDBWorker.java:178)[inlined] at oracle/tip/mediator/resequencer/ResequencerDBWorker.process(ResequencerDBWorker.java:343)[optimized] at oracle/tip/mediator/common/listener/AbstractWorker.run(AbstractWorker.java:81) at oracle/integration/platform/blocks/executor/WorkManagerExecutor$1.run(WorkManagerExecutor.java:120) at weblogic/work/j2ee/J2EEWorkManager$WorkWithListener.run(J2EEWorkManager.java:184) at weblogic/work/DaemonWorkThread.run(DaemonWorkThread.java:30) at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) -- end of trace
If this thread remains at the same position across thread dumps spanning few minutes, it would indicate that the worker thread is blocked on external Webservice application. If such external system issues block a significant number of worker threads from the pool of available worker threads, it will impact the overall throughput of the system. There will be fewer workers available to process all the groups that are being locked by the Locker thread, across all composites that use resequencers. When the rate of incoming messages during such time is high, this issue will show up as a huge backlog of messages with status GRP_STATUS=LOCKED and MSG_STATUS=READY in the resequencer health check query.
Note that JTA timeout will not abort these ‘busy’ threads. Such threads may eventually return after the JTA transaction has rolled back, or in some cases depending on how sockets are handled by the external system, may not return at all.
For such integration flows, it is advisable to configure HTTP Connect and Read timeouts for Webservice calls at the composite’s Reference properties. Figure below shows a screenshot of the properties. This will ensure that worker threads are not held up due to external issues and affect processing of other components that rely on worker threads.
The below loggers can be enabled for trace logging to gather diagnostic information on specific parts of the Mediator/resequencer.
- Logger oracle.soa.mediator.dispatch for Initial message storage, Group Creation, Lease Renew, Node failover
- Loggers oracle.soa.mediator.resequencer and oracle.soa.mediator.common.listener for Resequencer Locker, Resequencer Worker, Load Balancer
We have explored into how problems at various different layers can manifest at the resequencer in an Integration system and how the cause of these issues can be diagnosed.
We have seen
- Useful pointers in diagnosing resequencer issues and where to look for relevant information
- How a good SOA database maintenance strategy is important for resequencer health
- How timeout considerations play a role in resequencer performance