Integration flows can fail at run-time with a variety of errors. The cause of these failures could be either Business errors or System errors. When Synchronous Integration Flows fail, they are restarted from the beginning. On the other hand, Asynchronous Integration flows when they error can potentially be resubmitted/recovered from designated/pre-configured milestones within the flow. These milestones could be persistence points like queues topics or database tables, where the state of the flow was last persisted. Recovery is a mechanism whereby a faulted Asynchronous Flow can be rerun from such a persistence milestone.
The SOA Suite 11g and AIA products provides various Automated and Manual recovery mechanisms to recover from asynchronous fault scenarios. They differ based on the SOA component that encounters the error. For instance, recovering from a BPEL fault may be quite different than recovering from a Resequencer fault. In this blog, we look at the various Manual Recovery mechanisms and options available to an end user. Manual recovery mechanisms require an Admin user to take appropriate action on the faulted instance from the Oracle Enterprise Manager Fusion Middleware Control [EM FMWC Console]
The intention of this blog is to provide a quick reference for Manual Recovery of Faults within the SOA and AIA contexts. It aims to present some of the valuable information regarding Manual recovery in one place. These are currently available across many sources such as SOA Developers Guide, SOA Admin Guide, AIAFP Developers Guide and AIAFP Infrastructure and Utilities Guide.
Next we look at the various Manual recovery mechanisms available in SOA Suite 11g and AIA, starting with the BPEL Message Recovery.
To understand the BPEL Message Recovery, let us briefly look into how BPEL Service engine performs asynchronous processing. Asynchronous BPEL processes use an intermediate Delivery Store in the SOA Infrastructure Database to store the incoming request. The message is then picked up and further BPEL processing happens in an Invoke Thread. The Invoke Thread is one among the free threads from the ‘Invoke Thread Pool’ configured for BPEL Service Engine. The processing of the message from the delivery Store onwards until the next dehydration in the BPEL process or the next commit point in the flow constitutes a transaction. Figure below shows at a high level the Asynchronous request handling by BPEL Invoke Thread.
Any unhandled errors during this processing will cause the message to roll back to the delivery Store. The Delivery Store acts as a safe milestone for any errors that cause the asynchronous BPEL processing to rollback. In such scenarios these messages sitting in the delivery store can be resubmitted for processing using the BPEL Message Recovery.
It is quite similar in case of Callback messages that arrive for in-flight BPEL process instances. The Callback messages are persisted in the Delivery Store and a free thread from the Engine Thread Pool will perform correlation and asynchronously process the callback activities. Callback messages from Faulted activities are available at the Delivery Store for Recovery.
Refer to the section from FMW Administrator’s Guide here - http://docs.oracle.com/cd/E28280_01/admin.1111/e10226/bp_config.htm#CEGFJJIF for details on configuring the BPEL Service Engine Threadpools.
The Recovery of these Invoke/Callback messages can be performed from the Oracle Enterprise Manager Fusion Middleware Control [EM FMWC Console] [SOA->Service Engine->BPEL->Recovery]. The admin user can search for recoverable messages by filtering based on available criteria on this page. The figure below shows the BPEL Engine Recovery page where the messages eligible for recovery are searched based on the message type and state.
During Recovery of these messages, the end user cannot make any modifications to the original payload. The messages marked recoverable can either be recovered or aborted. In the former case, the original message is simply redelivered for processing again.
The BPEL Configuration property ‘MaxRecoverAttempt’ determines the number of times a message can be recovered manually or automatically. Messages go to the exhausted state after reaching the MaxRecoverAttempt. They can be selected and 'Reset' back to make them available for manual/automatic recovery again.
The BPEL Service Engine can be configured to automatically recover failed messages, either on Server startup or during scheduled time periods. Refer to the section from FMW Administrator’s Guide here - http://docs.oracle.com/cd/E28280_01/admin.1111/e10226/bp_config.htm#CDEHIIFG for details on setting up auto recovery.
The Fault Handling for invocations from SOA Components can be enhanced, customized and externalized by using the Fault Management Framework (FMF). We will not go into the details of Fault Management Framework here. Refer to this a-team blog post here - http://www.ateam-oracle.com/fault-management-framework-by-example for insights into the FMF.
In short FMF, allows a Fault Policy with configurable Actions to be bound to SOA Component. These can be attached at the Composite, Component or Reference levels. The configured Actions will be executed when the invocation fails. The available Actions could be retry, abort, human intervention, custom java callout, etc. When the Action applied is human intervention the faults become available for Manual Recovery from the Oracle Enterprise Manager Fusion Middleware Control [EM FMWC Console]. They show up as recoverable instances in the faults tab of ‘SOA->faults and rejected messages’ as shown in the figure below
During the recovery, the Admin user can opt for one among Retry, Replay, Abort, Rethrow or Continue as the Recovery option. For Retry options, the EM User has access to the payload. The payload can be changed and resubmitted during recovery. While this is a useful feature, it could pose audit/security issue from an administrative perspective if it is not properly controlled using Users/Roles.
The retry action can also chain the execution into a custom java callout to do additional processing after a successful Retry. This is selected from the ‘After Successful Retry’ option during Retry. The custom java callout should be configured in the Fault Policy file attached to the composite.
Mediator Resequencer groups which end up in Errored or Timed Out states can be recovered from the EM Console by an Admin user. In fact Resequencer faults do not have other automated recovery mechanisms and rely on only Manual recovery by the admin for their remediation. Mediator Resequencer faults can be searched and filtered from the faults page of the Mediator component. Figure below shows a search of faults by the Resequencing Group.
An Errored group can be recovered by either choosing to Retry or Abort. Retry will reprocess the current failed message belonging to the faulted Group. In case of abort, the current failed message is marked as failed and processing will resume from the next in sequence available message for the faulted Group. In both cases the group itself is unlocked and set to ready so that it can process further messages. As can be seen in the figure below, the Admin user can modify the request payload during this recovery.
In case of Standard Resequencer, groups can end up as Timed Out when the next in sequence message does not arrive until the timeout period. Such groups can be recovered by skipping the missing message. Figure below shows such a recovery. In this case the processing of the group will continue from the next message rather than wait for the missing sequence id.
This section deals with Integrations built using the AIA Foundation Pack. Refer to the AIA Concepts and Technologies Guide at http://docs.oracle.com/cd/E28280_01/doc.1111/e17363/toc.htm to familiarize with the AIA concepts. Let us see a common design pattern employed in AIA Integrations. The Figure below is from the AIA Foundation Pack Developers Guide Document available and shows an architecture used for Guaranteed Message Delivery between Source and Target applications with no intermediate persistence points. The blocks shown are SOA Suite Composites. The Source and Target milestones are persistence points such as Queue, Topics or Database tables. The same design can also be enhanced to have multiple intermediate milestones in case of more complex flows.
Such Flows are commonly seen in AIA Pre Built Integrations which use Asynchronous flows to integrate systems. E.g. Communications O2C Integration Pack for Siebel, BRM and OSM. Refer to the Pre Built Integrations Documentation here- http://www.oracle.com/technetwork/apps-tech/aia/documentation/index.html
The salient points of this design are
The Faults can be recovered using the AIA Error Resubmission Utility. In AIA Foundation Pack 184.108.40.206.0 the AIA Error Resubmission Utility is a GUI utility and can be used for single or bulk fault recoveries. The Resubmission Utility can be accessed from AIA home-> Resubmission Utility of the AIA application, as shown in figure below. Earlier versions of AIA foundation Pack 11g only have a command line utility for Error Resubmission. This is available at <AIA_HOME>/util/AIAMessageResubmissionUtil.
Any fault within the flow will roll back to the previous milestone or recovery point and enable resubmission from that point. The milestones could be Queues, Topics or AQ destinations. The Queues and Topics designed to be milestones are associated with corresponding Error Destinations. This is where the faulted messages reside. The AIA Resubmission Utility simply redelivers the messages from the fault destination back to the milestone destination for reprocessing in case of Queue or Topic.
In the case of Resequencer errors, the Resequencer is the Recovery point and holds the message for resubmission. Note that Resequencer is not typically designed as a milestone in the flow but acts as a recovery point for Resequencer errors. For such errors, the AIA Resubmission utility recovers the failed Resequencer message and also unlocks the faulted Resequencer Group for further processing.
It is important to note here that the AIA Error Handling and Resubmission Mechanism is a designed solution. It relies on the fact that the Integration implements the principles and guidelines of AIA Foundation Pack and AIA Guaranteed Message Delivery pattern for its accurate functioning.
Refer to the AIA Foundation Pack Infrastructure and Utilities Guide at http://docs.oracle.com/cd/E28280_01/doc.1111/e17366/toc.htm for details of the AIA Error Handling Framework and AIA Resubmission utility. Refer to the AIA Foundation Pack Developers Guide at http://docs.oracle.com/cd/E28280_01/doc.1111/e17364/toc.htm for implementing AIA Error Handling and Recovery for the Guaranteed Message Delivery Pattern.
Let us next look at a use case from one of our customer engagements. It is a Custom Integration developed using AIA Guaranteed Message Delivery pattern and employing the AIA Resubmission utility for recovery. We can see how the above recovery mechanisms offer different options when designing a typical integration flow
Without going deep into the details, the figure below shows at a high level the design used for delivering messages to 3 End Systems using a Topic and 3 Topic Consumers. The BPEL ABCS components consume the canonical message, convert it to the respective Application specific formats and deliver to the End Systems. The requirement was the guarantee delivery of the message to each of the 3 systems within the flow, which the design achieves under normal circumstances.
However issues were observed at run-time for failure cases. When the message delivery fails for one of the Systems e.g. System B, the design caused a rollback of the message to the previous milestone which in this case is the Topic. The rolled back message residing in the Error destination is then redelivered to the Topic. The message is picked for processing again by all 3 Topic Consumers causing duplicate message delivered to Systems A and C.
This issue can be addressed in a few ways;
1) Introducing an intermediate milestone after the message is consumed off the topic. For instance we could introduce a queue to hold the converted messages. (Indicated by point 1 in the figure)
2) Use separate queues instead of a topic to hold the canonical messages.
In case of failures, only the message in the failed branch would have to be recovered using AIA Message Resubmission as seen in section above.
However, both these options introduce additional queues which need to be maintained by the operations team. Also if in future an additional end systems were to be introduced, it would necessitate adding new queues in addition to new JMS consumers and ABCS components.
3) Introduce a transaction boundary: This can be done by changing the BPEL ABCS component use an Asynchronous One Way Delivery Policy. In this case, any failures cause the message to rollback not to the topic but to the internal BPEL Delivery store. (Indicated by point 2 in the figure) These messages can then be recovered manually using Bpel Message Recovery as we saw in the first section above. The recovery is limited only to the faulted branch of the integration.
4) Another option is to employ Fault Policies. We can attach a Fault Policy to the BPEL ABCS component. The policy invokes the human intervention action for faults encountered during end system invoke. The message can then be manually recovered from the EM FMWC Console as seen in the SOA Fault Recovery Section above. This would apply only to the faulted branches and hence avoid the duplicate message delivery to the other End Systems.
Also another issue seen was that the end systems would lose messages that arrived when the consumers are offline. This problem can be addressed by configuring durable subscription for the Topic consumers. In the absence of Durable subscriptions, a Topic discards a message once it has been delivered to all the active subscribers. With Durable Subscribers, the message is retained until delivered to all the registered durable subscribers hence guaranteeing message delivery. Refer to the Adapters Users Guide here - http://docs.oracle.com/cd/E28280_01/integration.1111/e10231/adptr_jms.htm for details on configuring Durable Subscriptions for Topic Consumers.
Table below summarizes the different Manual recoveries that we have seen and their main characteristics
In this blog, we have seen the various Manual Fault Recovery mechanisms provided by SOA Suite and AIA 11g versions. We have also seen the security requirements for the Admin user to perform recovery and the variety of options available to handle the faults. This knowledge should enable us to design and administer robust integrations which have definite points of recovery.