This is part 1 of a 4 part blog explaining how the BPM engine functions under the covers when "faults" occur, be they unhandled technical faults or failures at the engine level.
Part 1 - will set the scene by explaining timeouts and their values & fault handling
Part 2 - will explain how the BPM engine handles messages, threads & transaction
Part 3 - will explain how & when the BPM engine rolls back transactions
Part 4 - will show how BPM messages can be recovered after a rolled back transaction
The BPM engine by its very nature will contain many long running process instances and it is essential that BPM project & operational teams understand how instances are handled inside the engine, how faults can be handled, how & why transaction rollbacks happen and how instances can be recovered.
Within Oracle support, the A-Team PTS, PM we hear frequently of customers who have process instances that are “stuck”, or they have a server failure and wonder where their instances have gone. In this document we will try to understand what has happened and how to recover cleanly.
One of the most important concepts to understand with the BPM engine is the level at which timeouts can occur.....
This is the broadest level of timeout inside SOA, the java transaction timeout. It can be set in the Weblogic Server Administration Console in the relevant domain....
If timeouts occur at the level of JTA they cannot be caught by the BPM process either as a “catch” activity within a process or as part of an overall fault policy, however the instances will rollback to the last dehydration point (see later).
The BPM engine itself uses a number of EJBs to control threads, these also have timeout values which should be set and they can be found in the Weblogic Server Administration Console under the soa-infra deployment....
...note that at the time of writing (PS6) BPMNActivityManagerBean did not have a timeout property, it will be necessary to apply a patch to set this value.
As for the JTA timeout, any timeouts at the level of the EJBs cannot be caught in the process or by a fault policy but instances will rollback to the previous dehydration point.
This the most local level of timeout, i.e. a call to a database times out, a call to a webservice times out. The timeout value can be set on the resource itself in the Enterprise Manager Console, e.g. for a database adapter in a composite....
...note that if the property does not appear it can be added as follows....
The general rule of thumb that should be followed is....
JTA Timeout > BPM EJB Timeout > Resource Timeout
....following this will ensure that timeouts can be handled at the local level, i.e. caught by a “catch” activity within the process or by a fault policy.
This topic is covered in great detail both within the official Oracle SOA Suite documentation and in numerous blog entries elsewhere so there will only be a cursory overview here.
As a general guideline technical faults, such as a remote exception, should be caught by an appropriate policy in the fault policy framework and business faults, such as “no account found”, should be caught within the process itself either as a boundary catch activity or a process-level catch activity. In either case the actions following a caught fault will probably follow a pattern similar to... retry “x” times with a “y” backoff, and if this still fails, direct to manual intervention. In the case of the fault policy framework, this will result in the instance being recoverable in the Enterprise Manager, and in the case of a catch within the process itself, a redirect to a manual activity. It is worth noting that in both these cases it will be possible to manipulate the message data itself.
Also worth noting is the Alter Flow functionality inside Oracle BPM which allows business users to reposition the currently active business activity within the process instance itself and also to manipulate the instance data. This can be particularly useful in situations where the instance is in "suspended" state, possibly due to a "selection failure" caused by unassigned xml elements in the payload... in this case "Alter Flow" can be the only option for recovery... this is not covered as part of this series of blogs.
In this first part in the series we have covered some groundwork necessary for understanding BPM engine faults, rollback & recovery, primarily the various timeout values and the role of fault handling. In the next part we will take some typical BPM process patterns and show how the BPM engine handles messages, threads and transactions.