This document describes how applications that are built on Oracle Kubernetes Engine (OKE) can continue operating even if an entire geographic region of Oracle Cloud Infrastructure (OCI) is lost. A basic knowledge of OCI is assumed.
Each scenario builds upon the previous ones, describing the incremental design differences that arise in more challenging scenarios.
Scenarios start with a simple active-passive stateless application, and then add the requirements of stateful services and active-active operation.
Some applications are so critical that there is a need to ensure that processing can continue even if there is a loss of all availability domains within a region. Disaster Recovery (DR) focuses on minimizing data loss and ensuring the ability to resume processing in the event of the loss of a geographic region. While basic DR can be achieved via offsite backups (and a potentially lengthy recovery process), the simplest way to achieve DR while maintaining application availability in the event of the loss on an entire region is to configure an active-passive multi-region deployment.
For an OKE-based application, this will require the following components:
In this model, the state of the database ultimately defines which region is active and which region is passive. The role of the global load balancer is to attempt to route each client request to the active region to minimize the probability of an external client seeing a request failure. The Kubernetes cluster will remain active in both regions, but will only attempt requests when the load balancer sends them, and will only successfully complete requests on the active region since the passive region's database will reject updates.
It is assumed that there is no preference for which region is active and which is passive. If that is not the case, after a region failover occurs, a "failback" can be proactively initiated to return to the original configuration.
Diagram: Summary of architectural components
We assume that:
As a result, one should manually verify the failover from an active region to a standby region. Further, while it is possible to create "one click" scripts to perform failover at each region, it is usually better to have a documented set of steps that an operator will follow, in case any unanticipated events occur (e.g. errors when bringing a database online from a standby state).
To ensure a reasonable recovery time, rapid detection of region failure is critical. This can be done at multiple levels, including the global load balancer if it supports alerting. Availability probes should ensure that all critical portions of the application are running, especially the database. Once an alert is received it should be manually verified, and if there is indeed a regional outage, the failover process should begin.
Finally, one should verify that service limits in the DR site are not an issue:
The database must achieve two objectives to support disaster recovery between regions:
We will assume the use of Oracle's "Database as a Service" (DBaaS) offering that is part of OCI, but other technologies can be substituted as necessary.
Oracle Data Guard is integrated into DBaaS and can be configured to asynchronously replicate to a remote database instance that is in standby. The use of asynchronous replication improves availability of the active region by avoiding the impact of network outages. To replicate to another region, the preferred approach is to use peering to connect the two regions' virtual cloud networks (VCNs). Not all combinations of regions will support peering; see OCI documentation for more details. In the event of a failover/failback event, the previously active database must be taken offline if the region outage didn't already do this, and the previously passive database must be switched to active mode. Once the previously active region has been brought back online (as a passive region), Data Guard can be configured to replicate to that database to synchronize its state.
If a single database is used at each region (within a single availability domain), then the loss of a database will result in the effective loss of that region. Further, after a failover event, there may be a period where there is only a single database. Since database replication is not meant as a replacement for database backups, this may be acceptable. It may be advisable to use additional replication to another availability domain (in one or both regions), though this will complicate operations.
More details on database replication (and high availability in general) can be found at the following links:
The application logic hosted in OKE is assumed to be completely stateless, meaning that it has no in-memory state (e.g. HTTP sessions) and no persistent state (e.g. persistent volume claims).
As the Kubernetes environment can neither complete requests on its own (assuming all requests flow through to the database) nor initiate requests (a role filled by the global load balancer and ultimately external clients), it is possible to keep the Kubernetes environments in an active mode in both the active and passive regions. Since there is no federation of the two Kubernetes clusters, any scripts that apply to the active region must also be applied to the passive region.
While it is desirable to minimize the amount of resources sitting idle in the passive region, in most cases it will make sense to keep a minimal Kubernetes environment running, and then increase resources in the event of a failover event. While Kubernetes itself can autoscale, the underlying compute instances used by OKE cannot be autoscaled. The process to increase these resources will need to be accounted for in the documented failover process.
The responsibility of the global load balancer is to ensure that requests are routed to the appropriate region. Because this is a simple active-passive model, the load balancer only needs to ensure that requests are routed to the active region, whichever one that might be -- there is no need for regional load balancing or geographic affinity.
Oracle offers a DNS-based global load balancer through Dyn. This relies on frequently updated DNS entries (short TTL) to ensure that the application URL always resolves to the appropriate region.
It is impossible for any global load balancing solution to ensure that no requests are sent to a passive site, so the application must handle this gracefully. Given that the passive region's database will be in standby mode, application correctness is generally guaranteed. But it may also be necessary to propagate that database error back to the global load balancer (perhaps via an HTTP error code). It is also possible to achieve a similar result by blocking the region's load balancer, though this increases operational complexity.
Kubernetes is most commonly used for microservices, which are expected to be stateless. But it can also host stateful microservices, services that are in the process of being refactored into a microservices architecture, or even traditional applications.
A stateful application may include one or more types of internal state. Each type of state will have different considerations in the context of survivng the potential loss of an OCI region.
Most commonly, when the term “stateful application” is used, it refers to conversational state (such as HTTP sessions or JavaEE stateful session beans). This state is often held in-memory, and may optionally be replicated to another nearby server for fault-tolerance. This state is not directly visible to Kubernetes, and so it is the responbility of the application — comprised of application logic or the application framework — to replicate state for region-level disaster recovery. Most applications that rely on in-memory state do not have mechanisms for remote replication. However, in-memory state is almost always either read-only caches (which can always be recovered from the underlying database) or conversational state (which may not need to be protected from region-level failures). While protecting this type of data from the loss of an availability domain is sometimes desirable, it is rarely necessary to protect this type of state from region-level failures given the severity and infrequency of a region-level loss. If it is necessary, it is an application-level concern and must be managed as part of the application. As an example, if the application is using HTTP sessions within WebLogic, it could use Oracle Coherence*Web to replicate the HTTP session state to the remote region.
It is also important to note that a stateful application may require an ingress controller to make sure that requests are routed to the appropriate pods where the in-memory state is held. To tolerate inter-region failover, not only must the state be replicated, but the routing information as well.
OKE provides a mechanism to allow hosted applications to persist state via persistent volume claims. These persistent volumes are persisted in OCI block storage, which is accessed via iSCSI and is local to a single OCI availability domain (though accessible from any availability domain within the same region). It is an application’s responsbility to protect this state from both availability domain failures and region failures. Neither OKE nor OCI provide a means to do this (especially in a continuous manner). As an example, if the application is managing a MySQL database in a persistent volume, it could use MySQL replication to the remote region for DR purposes.
OCI provides services that can be used to manage state, including the upcoming State Service and Redis Service. Both of these provide fault-tolerance to availability domain failure. Replication to another region is not supported. Applications that rely on these services should be able to tolerate loss of the state stored in them should a region be lost.
Conversational state can also be stored in an external database, and can benefit from the same DR strategy used for that database. An application can use this strategy to be semantically stateful while remaining internally stateless (from the perspective of Kubernetes).
When considering an active-active application (versus active-passive), there are usually two primary goals:
A secondary goal is often to avoid having idle resources in the passive region. Given the cost and complexity of a developing an active-active application, along with the overhead of managing concurrency between two regions, the latter goal is usually counterproductive. So we will focus primarily on the primary goals.
Designing an active-active application is generally very complex, primarily due to the requirement to manage the consistency of data between two geographic regions. The details of architecting an active-active application are beyond the scope of this document, but this section attempts to touch on the primary considerations.
At present, Kubernetes Federation is a work-in-progress, and as a result, it is not yet supported in OKE. Most of the other OCI PaaS offerings are also single-region. As a result, when managing a Kubernetes application that spans multiple regions, each region will be independently scaled and operated. Even when multi-region federation is supported, this may still remain the case depending on specific application requirements.
The scaling of application resources can be driven proactively by scaling the application tier (Kubernetes and Database resources) and allowing the global load balancer to reactively balance traffic based on the new capacity of each region. This will tend toward a steady state since the capacity is statically scaled and the load balancer is the only dynamic component.
However, using auto scaling within the application will introduce an additional dynamic component. With two dynamic components, it is possible that each one can start to operate out-of-sync with the other since there is always some delay in gathering metrics. This will cause oscillations as the autoscaling continually overcorrects application capacity (both when scaling up and scaling down). While OKE can autoscale pods, it can’t currently autoscale the underlying compute instances. If the application includes its own infrastructure for auto scaling compute instances, or a future release of OKE provides this ability, it is important to ensure that the interaction between the global load balancer and the local autoscaling doesn’t introduce oscillations. Use of a sufficiently large time window for auto scaling will dampen those oscillations. For more details on how oscillation can occur and also be mitigated, a web search can be performed against the terms "oscillation" and "auto scaling".
There are two basic approaches to targeting requests to specific regions: automatic or explicit. In the former case, each region is functionally equivalent to the other, and a global load balancer can be used to target requests. In some cases, it may be desirable for the end user to explicitly state which region they would like to connect to. Even in this case, a pair of global load balancers can be used, with each running in an active-passive model (providing two logical domain names that are able to independently failover between regions).
With regards to conversational state, most active-active applications benefit from several common constraints: there is usually session affinity to a single region, there is usually only a single writer for each session (the user), and some degree of loss is tolerated as most conversations can be replayed in the event of a catastrophic failure. As such, most applications can manage conversational state in the same manner as an active-passive application. If true active-active behavior is desired for conversational state, it can follow the same strategy as persistent state (described elsewhere in this document).
With active-passive configurations, there is only one active database, so the decision of where to place the standby databases (one in every availability domain or just one per region) is less critical. With an active-active application, that decision becomes more critical — will each availability domain have its own active database instance, or will all availability domains within a region share a single database?
The general recommendation is to have a single active-passive database pair within each region based on the following considerations:
Data mastering in an active-active application is typically an application-level concern, and is usually completely opaque to both the application platform (OKE) and the data platform (e.g. the database). However, it is worth discussing briefly since it is common to overlook the complexity of managing shared state in an active-active environment.
Ideally, each (and every) piece of data will be mastered by a single, statically-assigned region. This is often the case when data is naturally local to a region, though any distribution function can be used to assign ownership based on entity identity (e.g. an even-odd hash of the primary key). If data is always single-mastered, then an active-active application can be viewed as a pair of complementary active-passive applications (in terms of data management). In the event of the loss of an entire region, then by nature of failing over the lost region, the application will naturally become a single-region application.
It is important to be aware of why it is desirable to statically master data. By splitting the data in half (across two regions), this minimizes the degree to which the two regions must agree on their respective responsibilities (i.e., what is the state of each region, active or offline). Because it is impossible to algorithmically agree on anything if there is a network failure between regions, minimizing the degree to which agreement is required is critical (simple timeouts are often sufficient). If ownership of data is dynamic, the complexity of agreeing of data ownership immediately becomes impractical.
If each piece of data is not statically mastered in a single location, then a strategy must be determined for how to manage data conflicts between regions. There are two basic approaches, and they can be mixed within a single application. The first, and most common, is to use an arbitrary conflict resolution algorithm such as last-write-wins, or a more intelligent resolver based on knowledge of the problem domain. The second option is to use data structures that are insensitive to the order of mutations. A common example of this are append-only structures (e.g. call logs and financial ledgers), where it may not strictly matter whether updates A and B are received in the same order since both are recorded as they arrive at each region, and higher-level business rules determine how the entries are gathered together and evaluated.
There are several possible mechanisms for replicating the database. One of the options that is commonly investigated is direct active-active replication of the database, but in practice this is rarely done since it is usually the wrong level of abstraction (transaction-level and/or row-level, with no visibility into the originating application event). It generally makes more sense to handle active-active replication within the application, since all necessary context can be made available to the conflict resolution logic.
Within the application layer, a messaging layer can be used for replication, since replication between regions is asynchronous in nature. Within Oracle Cloud, the default option is to use Kafka (Oracle Event Hub Cloud Service). Oracle Coherence is also a common choice for performance-sensitive applications (using the federation features of Oracle Coherence Grid Edition and deployed on IaaS).