Recently a colleague of mine (Kumar) and myself were facing a scenario where we needed to provide a solution providing disaster recovery (DR) functionality between two OCI regions. It is a fairly common requirement to support some sort of disaster recovery (DR) between different geographical locations. It has been a common practice in traditional enterprise data centers for years and has not changed with the adoption of cloud technologies. Indeed, as cloud consumption has risen, so has the need for enterprises to achieve higher levels of redundancy in their architectures. The agility that cloud offers lends itself to increased resiliency to the applications residing in the cloud.
While this sounds great in theory, it’s not always the easiest to implement. Looking back over the growth of traditional data center networking versus where the cloud is at today, there are some definite gaps and functional disparities that we have to work around. There are many scenarios where achieving like-for-like functionality with on-premises counterparts, while it is feasible, is achieved in a vastly different manner for the two environments (cloud and on-premises). Take for instance virtual IP addresses (VIPs, also called “floating” IP addresses). VIPs can be accomplished on both cloud and on-premises environments, however the way in which they’re facilitated is very different (on-premises usually relies on GARP while OCI, like many clouds, relies on API interactions).
NOTE: This document does not go into the actual configuration of Oracle Dyn or OCI, but rather focuses on the high-level theory of how a solution might be engineered. Please refer to the documentation for each specific product/feature for more info on that specific feature.
In this article we’re going to look at how to achieve DR functionality in an OCI environment. We’ll be focusing specifically on a true active/passive scenario (not active/active, which is sometimes desired, but is not covered in this article).
Within our focus, we’ll be looking at how external users coming in over the public Internet can access our applications hosted in OCI, utilizing DR between two OCI regions. While there’s value in discussing how internal users might access the same resources over FastConnect (or VPN), that’s outside of the scope of this article. Let’s focus on external users accessing resources over public-facing interfaces (the Internet).
Traditional on-premises data center networking addresses DR scenarios in multiple ways. Looking at it from a network infrastructure perspective, often times we’ll be advertising different BGP prefixes over a registered autonomous system number (ASN) to multiple Internet service providers, configuring the BGP path attributes for each such as to prefer one location (or path) over another. Sometimes BGP communities might be used, but often times AS-path prepending is one of the most common and portable ways to manipulate inbound traffic flows. Looking at things a step higher, often times global server load balancing (GSLB) solutions are used. GSLB interacts at the DNS level (receiving DNS queries and providing DNS responses, based on parameters). Sometimes GSLB will be used to send different responses based on the geolocation of the client (sending the client to the physically closest IP address) or other parameters that might be of interest to the organization. Ultimately, GSLB allows for routing traffic to different IPs, based on certain criteria which can be configured by the organization.
Cloud environments can be fronted by traditional GSLBs, however that’s not always necessary (or desirable), particularly if we want to use a complete cloud solution (with no external dependencies, such as traditional, on-premises GSLBs). Let’s explore how this can be done using some of the tools in Oracle’s technology stack!
For our fictitious example, we have two regions being used: Frankfurt and Ashburn. Let’s say that we’re hosting a public-facing web application used by both employees and outside business partners… it’s accessible over the Internet. We need to ensure that the application experiences minimal downtime due to any sort of disruption to the OCI regions we’re relying on. We need Frankfurt to be the primary, however if the application should experience any outage (and is not accessible), we need to fallback to Ashburn.
It could be surmised that both regions are using some sort of load balancers (LBs) as the front-end (which has a public IP address associated with it). This really doesn’t matter in our solution, whether it’s an LB or goes directly to a server, so long as we have two public IP addresses that are accessible to the outside world. With some of the basic premise out of the way, how do we make the failover happen? The answer is right within the Oracle portfolio: Oracle Dyn.
Before we start on the solution, meet Oracle Dyn, a product within the Oracle edge services portfolio. Oracle Dyn has a variety of different features which support protecting the edge of modern IT environments. It features web application security, DNS management, DDoS mitigation and a variety of other useful features. Oracle Dyn is packed with technologies that help guard the edge of modern networks (cloud or on-premises).
Using Oracle Dyn, here's a very high-level diagram of what we built out as a solution:
Failover Scenario Using Oracle Dyn
In our scenario, Oracle Dyn "owns" the DNS FQDN used for our web application. This will allow us to let Oracle Dyn functionality shine and help us "route" traffic to the ideal site.
There are at least two ways of facilitating this design using Oracle Dyn:
Let’s look at each in greater detail.
This is a feature within Oracle Dyn that allows you to associate a primary IP address (or fully qualified domain name (FQDN)) along with a secondary (backup, or failover) IP address (or FQDN).
To set this up, pick a zone in your Oracle Dyn management portal, then select an existing node or create a new node. Within the node settings, add the Active Failover service to the node. There’s no need to define an A-record for it (or otherwise), as Active Failover will effectively handle this for you. Once in the Active Failover configuration for your node, you may set the TTL for the record, as well as the primary and failover IPs (or FQDNs). Health monitoring may be configured, allowing you to select a protocol from several to choose from, used to monitor the primary IP/FQDN.
You may setup different notification/alerting services and view log entries, as well as control the behavior in a failover scenario (such as whether or not the service should automatically fail back to the primary IP/FQDN or require manual user intervention).
This is a robust feature within Oracle Dyn that allows for much more granular control over the routing of traffic by controlling DNS responses. Traffic Director is a cloud-based GSLB.
With Traffic Director (TD), we’re able to define more levels of failover, supporting a third and fourth level of failover. This means that rather than a primary and secondary only model, we can support a tertiary and quaternary level of failover in our environment! Also, TD allows us to specify different rulesets and set the response based on geolocation (this is optional, but a nice feature).
We have a much greater level of control around which IPs are used, as we can weight different IPs, along with monitoring them. Speaking of monitoring, we can monitor primary, secondary, tertiary, etc. IPs with TD.
My description of TD is fairly short, however don’t let that fool you. TD is by far a much more robust and feature-rich solution than Active Failover. There are a lot more “nerd knobs” that can be adjusted with TD. At the same time, it’s also very fast and easy to get a functional solution going with it.
When there are scenarios where external users need to access a publicly-accessible IP address via FQDN in a redundant fashion (DR is supported), I highly encourage that Oracle Dyn be considered for the solution.
If you need a simplistic active/standby solution, consider using Active Failover. If you get into more than two regions or need more granular controls (such as routing on geolocation, weighting, etc.) than Traffic Director might be a worthy solution.
Both of these Oracle Dyn features allow for a cloud-based solution that supports cross (inter) regional DR. This is a great solution for many customers who are hosting an application hosted in one of our OCI regions, but need to support automatic failover from one region to another, should the application become unavailable.
Both of these solutions do not result in instantaneous failover (rather it can take up to several minutes for the health monitoring to recognize a failure, depending on how it’s configured). For many organizations, sub-second failover is desirable, but not necessary, so having a couple of minutes of downtime for a catastrophic failure (such as an application being unavailable in a region) is acceptable.