Here’s a cold, hard truth for sysadmins: in enterprise computing environments, no matter how stable or how large or how fast or how powerful the hardware or network, the probability that a hardware failure or a network glitch will occur can never get to zero percent. The guaranteed-never-to-fail enterprise computing environment has not been invented yet. One strategy for making systems as highly available as possible is to design them with redundant components so that there are no single points of failure. This strategy is effective, and is in widespread use for mission-critical applications. But there is much more work involved than merely adding hardware and network components to a system. If the potential failures that challenge mission-critical application availability are going to be overcome as quickly and as seamlessly as possible, while protecting data consistency and providing users with a non-interrupted experience, then applications systems need to be designed and configured to recover in a coordinated manner when hardware or network components fail.
Oracle Fusion Middleware (FMW), the platform that acts as the foundation for Fusion Applications, incorporates several features to support “high availability” or “HA”, a term used here to describe components that are designed to keep application service levels as close to 24x7 as possible, in addition to providing reliable built-in recovery mechanisms designed to prevent transactional data from getting lost or corrupted if normal application processing is interrupted for any reason. Automated fail-over and migration options both rely upon having duplicate hardware running redundant server instances as a key part of the mix. For high service levels, FMW allows for managed servers to be set up in clusters in an active-active topology, so that if one server goes down, another is ready and able to pick up the pieces and carry on.
The objective of this post is not to cover all of the HA features built into Fusion Middleware. The published Fusion Middleware (FMW) and Fusion Applications (FA) Enterprise Deployment Guides (EDG’s) cover most of the options and details on how to build and configure high-availability architectures. But, from a quick analysis of customer SR’s, there has been some confusion around setting up whole server migration, especially with the operating system and network-specific pieces. Those topics will be the focus here.
Whole server migration (WSM) is the mechanism built into Oracle WebLogic server that makes it possible to move an entire server instance, along with all of the services it owns, to a different physical machine when triggered by a platform failure event. Although it is possible, migrating a complete server instance in this manner is probably overkill for most service requirements, because in the vast majority of cases, services can be set up to run on all managed server instances in a cluster, and failover of services from one server to another is transparent and automatic without the need to migrate an entire server instance to another platform. A few critical services, however, with JMS and JTA transactions being the prime examples, have a different relationship with managed server clusters. These services cannot be pulled out of a failed server instance and restarted in a different operational server instance; the server instance, IP address and all, has to be migrated together with these special services. Such services normally fall under the purview of SOA managed servers.
A tightly-orchestrated sequence of events with participation from WebLogic, Coherence, O/S, network, and database components gets triggered if a SOA server goes down in a Fusion Applications infrastructure that has been configured with whole server migration capabilities. Not so coincidentally, this is why the Fusion Apps Enterprise Deployment Guides recommend WSM for all SOA server clusters in an HA architecture. Here, at a high level, is a graphical breakdown of the migration process, assuming a three-node SOA server cluster with Managed Server 1 and Managed Server 2 instances running on Machine B and Machine C nodes, respectively, with an Admin Server instance running on Machine A and another machine, Machine D, participating in the cluster:
Of special interest here are steps 3, 4 and 5, and the Node Managers’ involvement with the operating system in getting floating IP addresses migrated from a failed server to a running server. If the cluster manager can contact the Node Manager for the failed managed server instance, then Node Manager can try to restart the server instance. Assuming unsuccessful restart attempts, Node Manager will then bring down the floating IP network interface on the failed node by calling the wlsifconfig.sh script with the “-removeif” parameter. After the last attempt to restart the failed server, Node Manager on an operational server cluster member will bring up the floating IP address on the functioning node by calling the wlsifconfig.sh script with the “-addif” parameter. Finally, Node Manager will attempt to restart the failed server instance on the migration target node. After a successful migration and restart, the soa-infra console and/or Admin Server console (Domain -> Monitoring -> Migration) will verify where the migrated SOA server is running.
The ability of Node Manager to migrate the floating IP address from failed to operational nodes is essential for whole server migration to work correctly. Therefore, setting up the necessary O/S-level permissions to run network configuration commands (that would normally be restricted to root) is a strict requirement. Here are the necessary steps to setting up the O/S user for sudo access to the WebLogic scripts and O/S-level commands that are embedded in the wlsifconfig.sh script:
|Operating System||/etc/sudoers config (or equivalent) (assumption is that "oracle" is the O/S FA installing user|
|OEL 5||Defaults:oracle !requiretty |
oracle ALL=NOPASSWD: /sbin/ifconfig,/sbin/arping
|OEL 4||oracle ALL=NOPASSWD: /sbin/ifconfig,/sbin/arping|
I have not found a definitive source of Solaris-specific sudo setup options, but will add any pertinent information here as it’s discovered.
In Solaris there are other available options, notably using the security profile database to add administrative permissions to a user profile.
|AIX||oracle ALL=(root) NOPASSWD: /sbin/ifconfig,/sbin/arping |
I haven’t found any AIX-specific directives for wlsifconfig.sh setup, which does not necessarily mean that there isn’t anything specific to do for this O/S.
|Windows||sudo is not applicable to the Windows operating system. As an alternative, run Node Manager as Administrator.|
The above examples are meant as guidance only. In most cases it will be necessary for Fusion Apps deployment teams to work with O/S system administrators in their organizations to comply with organization-specific security policies.
Once sudo permissions for the O/S user are set up, it should be possible to run a component test on the wlsifconfig.sh script to ensure that all prerequisites are satisfied. Each of these tasks should be performed on each node in the SOA server cluster:
Interface=<correct network interface (e.g. eth0)>
NetMask=<netmask (e.g. 255.255.255.0)>
./wlsifconfig.sh –addif eth0
./wlsifconfig.sh –removeif eth0
Debugging O/S, network, and permissions issues on this level of granularity is far simpler and isolates problems far more quickly than simulating a failure and waiting for Node Manager to either succeed or fail with the IP address migration. Of course, after successfully running the wlsifconfig.sh script in isolated mode, it is still necessary to test the complete whole server migration sequence end-to-end.
If there is a requirement for an Oracle Fusion Applications pillar to be available 24x7, whole server migration is one of several alternatives in FMW to address the unpredictability of enterprise hardware and network resources. Configuring whole server migration is far more efficient and less prone to error if the O/S-specific and network-specific setup tasks are tested independently from the process as a whole.
For more information check these MOS KnowledgeBase articles: