Resequencer Health Check

11g Resequencer Health Check

In this Blog we will see a few useful queries to monitor and diagnose the health of resequencer components running in a typical SOA/AIA Environment.

The first query is a snapshot of the current count of Resequencer messages in their various states and group_statuses.

Query1: Check current health of resequencers

select to_char(sysdate,'YYYY-MM-DD HH24:MI:SS') time, gs.status group_status, m.status msg_status, count(1), gs.component_dn 
from mediator_group_status gs, mediator_resequencer_message m
where m.group_id = gs.group_id
and gs.status < = 3
and gs.component_status!=1
group by gs.status, m.status, gs.component_dn
order by gs.component_dn, gs.status;

Table below lists a representative sample output of the above query from a running SOA Environment containing Resequencers collected at 12:04:50

Query 1 sample output

For our analysis, let us collect the same data again after a few seconds

2

Refer to the appendix  for a quick glossary of Resequencer group_status and message_status state values

Let us dive a bit deeper into each of the above state combinations, their counts and what they imply.

1. GRP_STATUS/MSG_STATUS = 0/0 – READY

These show the messages which are ready for processing and eligible to be Locked and processed by the resequencer.  For a healthy system, this number would be quite low as the messages will be locked and processed continuously by the resequencer.  When the messages arriving into the system have stopped, this count should drop to zero.

A high count for this combination would suggest that not enough groups are being locked by the resequencer for the rate at which messages are arriving for processing.  The Mediator property – “Resequencer Maximum Groups Locked” should be adequately increased to lock groups at a higher rate.

Refer here to see how this property can be changed from EM Console

2. GRP_STATUS=0/MSG_STATUS=2 – PROCESSED

This count indicates the number of processed messages. This number will be seen to be growing over time. A Very high count (like > 1 million in the above example) indicates that the Resequencer purge is due and should be run soon to delete the processed messages.

 

  1. 3. GRP_STATUS=0/MSG_STATUS=5 – ABORTED

    This count shows the number of message that are currently manually aborted by the administrator.  Refer here for how Resequencer messages can be aborted using the SOA EM Console.

  1. 4. GRP_STATUS=1/MSG_STATUS=0 – LOCKED

    This combination of states shows the messages within groups which are being currently processed. For a healthy system, this number would be quite low as the messages belonging to locked groups are processed continuously by the Resequencer Worker threads.  When the messages arriving into the system have stopped, this count should drop to zero.

A high count for this combination would suggest that not enough worker threads are available to process the messages for the rate at which groups are locked for processing.  The Mediator property – “Resequencer Worker Threads” should be adequately increased to boost the message processing rate.

Refer here to see how this property can be changed from EM Console

 

5. GRP_STATUS=1/MSG_STATUS=2 – LOCKED

The count against this combination shows the number of messages which are processed for locked groups. This is a transient state and once all messages for the locked groups are processed, these counts change status to GRP_STATUS=0/MSG_STATUS=2

 

6. GRP_STATUS=3 – ERRORED

These show the messages against error’ed groups. These will need to be manually recovered from EM Console or the AIA Resubmission tool. They indicate messages which have failed processing due to various errors. If these messages can be recovered and processed successfully, in which case they transition to state GRP_STATUS=0/MSG_STATUS=2. If the errors are non recoverable, then they can be aborted from the EM Console and they move to GRP_STATUS=0/MSG_STATUS=5.

Refer to my earlier blog here for details on recovery of resequencer errors.

 

Query2: Check ContainerID’s  health

select * from MEDIATOR_CONTAINERID_LEASE ;

Table below shows a sample output for the above query from a 2 node clustered SOA installation.

3

 

 

It shows that time when both the nodes last renewed their mediator containerids. These containerid renewals serve as heartbeats for the mediator/Resequencer. It is vital in maintaining the load balance of messages among the nodes and failover of groups/messages that were allocated to expired nodes.


Query3: Load Balance between cluster nodes

select to_char(sysdate,'YYYY-MM-DD HH24:MI:SS') time, gs.container_id container, gs.status group_status, m.status msg_status, count(1)
from mediator_group_status gs, mediator_resequencer_message m
where m.group_id = gs.group_id
and   gs.status  in (0,1)
and component_status!=1 
group by  gs.container_id, gs.status, m.status
order by gs.container_id, gs.status;

The above query can be used to monitor the load balance of messages between nodes of a cluster. Sample output below shows an output for a 2 node clustered SOA environment.

4

This sample output shows the messages of ready and locked messages are roughly evenly distributed across the cluster. If a major skewness is observed for a specific container, then further analysis may be required. Thread dumps and Diagnostic logs of the slower node may indicate towards the cause of the skewness.

 

Appendix:

Below table lists the important status values of MEDIATOR_GROUP_STATUS and MEDIATOR_GROUP_STATUS tables and how the values can be interpreted.

6 5

Comments

  1. Here is an updated version of the “Query1: Check current health of resequencers”
    It shows decoded Strings for the grp_status and msg_status integer values.

    ===========================
    select to_char(sysdate,’YYYY-MM-DD HH24:MI:SS’) time,
    decode(gs.status ,0,’READY’,1,’LOCKED’,3,’ERRORED’,4,’TIMED_OUT’,6,’GROUP_ERROR’,null) GROUP_STATUS,
    decode(m.status ,0,’READY’,2,’PROCESSED’,3,’ERRORED’,4,’TIMED_OUT’,5,’ABORTED’,null) MSG_STATUS,
    count(1), gs.component_dn
    from mediator_group_status gs, mediator_resequencer_message m
    where m.group_id = gs.group_id
    and gs.status < = 3
    –and component_status!=1
    –uncomment above clause, only if patch for bug fix 16289110 is applied
    group by gs.status, m.status, gs.component_dn
    order by gs.component_dn, gs.status;
    =================================

    sample output:
    =================================
    TIME GRP_STATUS MSG_STATUS COUNT COMPONENT_DN
    2014-07-24 11:24:04 READY PROCESSED 171045 default/UpdateSalesOrderOSMCFSCommsJMSConsumer!1.0/Consume_UPDSO_RS
    2014-07-24 11:24:04 READY ERRORED 6 default/UpdateSalesOrderOSMCFSCommsJMSConsumer!1.0/Consume_UPDSO_RS
    2014-07-24 11:24:04 LOCKED READY 20 default/UpdateSalesOrderOSMCFSCommsJMSConsumer!1.0/Consume_UPDSO_RS
    2014-07-24 11:24:04 LOCKED PROCESSED 23 default/UpdateSalesOrderOSMCFSCommsJMSConsumer!1.0/Consume_UPDSO_RS
    2014-07-24 11:24:04 ERRORED READY 574 default/UpdateSalesOrderOSMCFSCommsJMSConsumer!1.0/Consume_UPDSO_RS
    2014-07-24 11:24:04 ERRORED PROCESSED 14 default/UpdateSalesOrderOSMCFSCommsJMSConsumer!1.0/Consume_UPDSO_RS
    2014-07-24 11:24:04 ERRORED ERRORED 222 default/UpdateSalesOrderOSMCFSCommsJMSConsumer!1.0/Consume_UPDSO_RS

    =============================================

Add Your Comment