Floating IP Addresses and Whole Server Migration in Fusion Applications

How to Configure and Test Whole Server Migration Network Components


Here’s a cold, hard truth for sysadmins:  in enterprise computing environments, no matter how stable or how large or how fast or how powerful the hardware or network, the probability that a hardware failure or a network glitch will occur can never get to zero percent.  The guaranteed-never-to-fail enterprise computing environment has not been invented yet.  One strategy for making systems as highly available as possible is to design them with redundant components so that there are no single points of failure.  This strategy is effective, and is in widespread use for mission-critical applications.  But there is much more work involved than merely adding hardware and network components to a system.  If the potential failures that challenge mission-critical application availability are going to be overcome as quickly and as seamlessly as possible, while protecting data consistency and providing users with a non-interrupted experience, then applications systems need to be designed and configured to recover in a coordinated manner when hardware or network components fail.

Oracle Fusion Middleware (FMW), the platform that acts as the foundation for Fusion Applications, incorporates several features to support “high availability” or “HA”, a term used here to describe components that are designed to keep application service levels as close to 24×7 as possible, in addition to providing reliable built-in recovery mechanisms designed to prevent transactional data from getting lost or corrupted if normal application processing is interrupted for any reason.  Automated fail-over and migration options both rely upon having duplicate hardware running redundant server instances as a key part of the mix.  For high service levels, FMW allows for managed servers to be set up in clusters in an active-active topology, so that if one server goes down, another is ready and able to pick up the pieces and carry on.

The objective of this post is not to cover all of the HA features built into Fusion Middleware.  The published Fusion Middleware (FMW) and Fusion Applications (FA) Enterprise Deployment Guides (EDG’s) cover most of the options and details on how to build and configure high-availability architectures.  But, from a quick analysis of customer SR’s, there has been some confusion around setting up whole server migration, especially with the operating system and network-specific pieces.  Those topics will be the focus here.

Main Article

Whole server migration (WSM) is the mechanism built into Oracle WebLogic server that makes it possible to move an entire server instance, along with all of the services it owns, to a different physical machine when triggered by a platform failure event.  Although it is possible, migrating a complete server instance in this manner is probably overkill for most service requirements, because in the vast majority of cases, services can be set up to run on all managed server instances in a cluster, and failover of services from one server to another is transparent and automatic without the need to migrate an entire server instance to another platform.  A few critical services, however, with JMS and JTA transactions being the prime examples, have a different relationship with managed server clusters.  These services cannot be pulled out of a failed server instance and restarted in a different operational server instance; the server instance, IP address and all, has to be migrated together with these special services.  Such services normally fall under the purview of SOA managed servers.

A tightly-orchestrated sequence of events with participation from WebLogic, Coherence, O/S, network, and database components gets triggered if a SOA server goes down in a Fusion Applications infrastructure that has been configured with whole server migration capabilities.  Not so coincidentally, this is why the Fusion Apps Enterprise Deployment Guides recommend WSM for all SOA server clusters in an HA architecture.  Here, at a high level, is a graphical breakdown of the migration process, assuming a three-node SOA server cluster with Managed Server 1 and Managed Server 2 instances running on Machine B and Machine C nodes, respectively, with an Admin Server instance running on Machine A and another machine, Machine D, participating in the cluster:

  1. Machine C node fails (black X above).
  2. Cluster manager, reading database leasing table, detects lease expiration for Managed Server 2.
  3. Cluster manager tries, but fails, to contact Node Manager for Machine C.
  4. Cluster manager contacts Machine D’s Node Manager service.
  5. Machine D Node Manager starts Managed Server 2.
  6. Managed Server 2 starts, obtaining configuration from  Admin Server.
  7. Managed Server 2 caches configuration.
  8. Managed Server 2 obtains server lease through leasing db interface.

Of special interest here are steps 3, 4 and 5, and the Node Managers’ involvement with the operating system in getting floating IP addresses migrated from a failed server to a running server.  If the cluster manager can contact the Node Manager for the failed managed server instance, then Node Manager can try to restart the server instance.  Assuming unsuccessful restart attempts, Node Manager will then bring down the floating IP network interface on the failed node by calling the wlsifconfig.sh script with the “-removeif” parameter.  After the last attempt to restart the failed server, Node Manager on an operational server cluster member will bring up the floating IP address on the functioning node by calling the wlsifconfig.sh script with the “-addif” parameter.  Finally, Node Manager will attempt to restart the failed server instance on the migration target node.  After a successful migration and restart, the soa-infra console and/or Admin Server console (Domain -> Monitoring -> Migration) will verify where the migrated SOA server is running.

The ability of Node Manager to migrate the floating IP address from failed to operational nodes is essential for whole server migration to work correctly.  Therefore, setting up the necessary O/S-level permissions to run network configuration commands (that would normally be restricted to root) is a strict requirement. Here are the necessary steps to setting up the O/S user for sudo access to the WebLogic scripts and O/S-level commands that are embedded in the wlsifconfig.sh script:

  1. Ensure that the PATH environment variable for the NodeManager process includes the directories housing the wlsifconfig.sh and wlscontrol.sh scripts and the nodemanager.domains configuration file:
    1. For FA, the wlsifconfig.sh is located in the ../config/domains/<HostName>/<DomainName>/bin/server_migration directory
    2. Wlscontrol.sh is located in the ../products/fusionapps/wlserver_10.3/common/bin directory.
    3. Nodemanager.domains is located in the ../config/nodemanager/<HostName> directory
  2. Generic sudo configuration (see table below for OS-specific setups):
    1. Configure sudo to work without prompting for a password.
    2. Grant sudo to the O/S (“oracle”) user with no password.  Grant execute privilege on the /sbin/ifconfig and /sbin/arping executables.  If Node Manager receives a password prompt after attempting to run sudo /sbin/ifconfig, the migration will fail.
    3. Include the “!requiretty” parameter in the /etc/sudoers file when required by the O/S, although this could have negative security side-effects.

Sudo Setup Parameters for Fusion Apps-Certified Operating Systems

Operating System /etc/sudoers config (or equivalent)  (assumption is that “oracle” is the O/S FA installing user
OEL 5 Defaults:oracle !requiretty
oracle ALL=NOPASSWD: /sbin/ifconfig,/sbin/arping
OEL 4 oracle ALL=NOPASSWD: /sbin/ifconfig,/sbin/arping

I have not found a definitive source of Solaris-specific sudo setup options, but will add any pertinent information here as it’s discovered.

In Solaris there are other available options, notably using the security profile database to add administrative permissions to a user profile.


  1. Replace $SUDO with /opt/bin/sudo, or make sure sudo is in default path.
  2. For Solaris 5.10 change shebang from #!/bin/sh to #!/bin/bash
  3. Replace line ~907 with Interface=vnet0 or xnf0 (or as applicable)
AIX oracle ALL=(root) NOPASSWD: /sbin/ifconfig,/sbin/arping

I haven’t found any AIX-specific directives for wlsifconfig.sh setup, which does not necessarily mean that there isn’t anything specific to do for this O/S.

Windows sudo is not applicable to the Windows operating system.  As an alternative, run Node Manager as Administrator.

The above examples are meant as guidance only. In most cases it will be necessary for Fusion Apps deployment teams to work with O/S system administrators in their organizations to comply with organization-specific security policies.

Once sudo permissions for the O/S user are set up, it should be possible to run a component test on the wlsifconfig.sh script to ensure that all prerequisites are satisfied. Each of these tasks should be performed on each node in the SOA server cluster:

  1. Use /sbin/ifconfig to check that floating IP network interfaces have been added correctly.
  2. Append these added parameters to the nodemanager.properties file:

    Interface=<correct network interface (e.g. eth0)>
    NetMask=<netmask (e.g.>

  3. Confirm proper operation of floating IP migration:

      export ServerDir=/tmp
    cd ../../bin/server_migration
    ./wlsifconfig.sh –addif eth0
    ./wlsifconfig.sh –removeif eth0

Debugging O/S, network, and permissions issues on this level of granularity is far simpler and isolates problems far more quickly than simulating a failure and waiting for Node Manager to either succeed or fail with the IP address migration. Of course, after successfully running the wlsifconfig.sh script in isolated mode, it is still necessary to test the complete whole server migration sequence end-to-end.


If there is a requirement for an Oracle Fusion Applications pillar to be available 24×7, whole server migration is one of several alternatives in FMW to address the unpredictability of enterprise hardware and network resources. Configuring whole server migration is far more efficient and less prone to error if the O/S-specific and network-specific setup tasks are tested independently from the process as a whole.

For more information check these MOS KnowledgeBase articles:

  • 1504291.1 Whole Server Migration fails to start after a whole server migration
  • 1491667.1 WLS Whole Server Migration – you must have a tty to run sudo.
  • 1333851.1 Whole Server Migration Failing on Windows
  • 1404243.1 Debug Tracing of Whole Server Migration
  • 1401331.1 FAQ Service/Server Level Migration for SOA and OSB 11g
  • 1175789.1 Master Note for SOA 11g Clustering and HA


  1. Sridhar Yenamandra says:

    Thanks for such a wonderful article demystifying some of the WSM concepts. While the article explains the need and way to acheive WSM, I would like to understand the need for WSM in a more realistic fashion. Consider the below cases. I am really not finding the answers to these anywhere and they are really getting tough to test too. So please see if you can answer them.

    Service A is long running asynchronous process with some orchestration logic.
    Service B is a long running async process that puts a message on to a distributed JMS queue.
    Service C is a long running async process with some orchestration logic.
    Both the services are deployed to the SOA cluster with two managed servers – WLS_SOA1,WLS_SOA2

    What is going to happen in each of these cases in A) With out Whole server migration enabled b) With Whole Server migration enabled.

    CASE 1: Server stopped abruptly while the transaction that started the Service A process has not yet committed.

    1. Client has called Service A using the load balancer. WLS_SOA1 is picked (by LB)for servicing the client request.
    2. Service A has started with a new transaction T1 (initiated by dispatcher).
    3. WLS_SOA1 has suddenly stopped ( while T1 is being processed).

    CASE 2 : Server stopped abruptly while the process is still running after committing the initial transaction.

    1. Client has called Service A using the load balancer. WLS_SOA2 is picked (by LB)for servicing the client request.
    2. Service A has started with a new transaction T1.
    3. Service A has reached a dehydration point in the process.T1 is committed. T2 is started.
    4. WLS_SOA1 has suddenly stopped ( while T2 is being processed).

    CASE 3 : Server stopped after a JMS message has been posted by a Service that is still in process

    1. Client has called Service B using the load balancer. WLS_SOA1 is picked (by LB)for servicing the client request.
    2. Service B has started with a new transaction T1 (initiated by dispatcher).
    3. Service B has called JMS Adapter ( asynchronously) and published the message on to a Distributed Queue. Message lands in the queue A of WLS_SOA1.
    4. Service B continues its flow for the rest of the orchestration activities. Transaction T1 is still running.
    5. WLS_SOA1 has suddenly stopped ( while T1 is being processed).

    CASE 4 : Server stopped while Child process is still running.

    1. Client has called Service B using the load balancer. WLS_SOA2 is picked (by LB)for servicing the client request.
    2. Service A has started with a new transaction T1 (initiated by dispatcher).
    3. Service A calls Service C(with transaction property set as requiresNew ie., default one). T1 is suspended.
    4. A new transaction T2 gets started to process Service C flow. Service C is processing ( T2 is running) and then..
    5. WLS_SOA1 has suddenly stopped ( while T2 is being processed, T1 is in suspended state).

    The cases are in the context of Fail-over Vs Whole server migration in high availability.
    So please let me know what you think would be the possible outcome in each of these cases with and with out WSM.

Add Your Comment