The Resilient Enterprise – Disaster Recovery Planning

Introduction

Organizations strive to operate their business in a continuous manner with the goal of protecting their revenue, brand, and data assets. Every organization, throughout its existence, is prone to failures originating from within or without. The propensity for failures exists. A resilient organization however does two things really well. First, it implements measures to recover from these failures and second, it institutes measures to prevent failures from occurring. In this article we focus on recovering from a class of failures known as “Disasters” – specifically the planning aspects of disaster recovery. Disaster Recovery is often referred to in conjunction with Business Continuance as BCDR (Business Continuance and Disaster Recovery). Business Continuance deals with the organizational aspects such as people and processes to maintain business continuity, while Disaster Recovery deals with restoring business operations with more of a systems focus.

In this article we will focus on how to think about and plan for disaster recovery by examining the following

  • Disaster Recovery Concepts
  • Common Disaster Recovery Techniques and Patterns
  • Steps to understanding organizations Disaster Recovery needs

A thorough understanding of the above three areas would help create cost effective and comprehensive disaster recovery solutions.

Main Article

To understand Disaster Recovery planning, one has to first understand how failures impact an organization. Failures impact an organization in one of two ways.

  • Loss of Revenue – Revenue is lost when a failure in an organization interrupts services, such as order entry. Revenue can also be lost when a failure impacts its brand or reputation, such as when customers lose faith in an organizations ability to service their needs. An example of such a failure can be seen when stock trading applications are down.
  • Loss of Data – Data that is crucial for operating the business is lost due to a failure. Examples of such data include data that is imperative for an organization to gain a competitive advantage, or data, such as financial data, that is required for reporting to regulatory agencies, such as the SEC. These losses can occur due to natural calamities or due to accidental or malicious acts.

While a resilient enterprise strives to prevent these failures from occurring, it must be noted that not all failures can be cost effectively prevented – natural calamities such as floods or earthquakes fall in this category. The goal of DR planning is to cost effectively contain the impact and minimize the losses. To achieve this goal, DR planning must start with an understanding of the concepts of DR and its parameters. The following pragmatic definition identifies the pivotal elements of DR.

DR Defined

Disaster Recovery (DR) capability allows an enterprise to protect its data reasonably well, despite catastrophic failures in the primary operating environment and become fully or partially operational within a reasonable amount of time.

Thus, based on the above definition the planning exercise can be broken up into the following areas

  • DR Concepts – What is DR
  • Business Case for DR – Why do we need DR
  • Assets to protect – What should be covered by DR
  • Implementation Choices – How should DR be implemented
  • Verification – How do we verify what we set out to do
DR-Overview

DR Concepts

The key aspects from the above definition are given below.

  • Primary operating environment
  • Catastrophic failures
  • Restore operations fully or partially
  • Protecting data reasonably well
  • Operational within a reasonable amount of time

These key concepts, expressed in qualitative terms, highlight the corner stones of DR.  Once these are understood, they can be further refined to create quantitative operational metrics that drive the design and implementation of the DR solution.

Primary operating environment:  The primary operating environment is the location where most of the business is conducted and thus to design for DR we need two operating environments – this typically means two data centers separated by a distance. There are real world implementations where the primary and secondary operating environments exist within the same campus however these are few and far between and tend to cater to a small subset or specific failure scenarios. Most organizations, once they have decided to institute DR protection, take the opportunity to build resilience to a wide array of failures, and thus leverage the baseline cost of DR infrastructure better.

Catastrophic failures: A catastrophic failure is one in which the primary operating environment can either not be restored at all, such as during a flood, or cannot be restored within a reasonable amount of time. Thus a catastrophic failure renders the primary environment inoperable, and perhaps more importantly, by definition, and by design, does not affect secondary operating environment.

Restore operations fully or partially:  Identifying the operations to be restored to the secondary environment must be done carefully, since a full restore of the primary operating environment may neither be necessary nor cost effective. This has to be consistent with the impact sustained by the organization. It is reasonable and perhaps even fiscally responsible, to choose to operate partially or under suboptimal conditions, following a catastrophic failure. This is an important aspect of DR planning.

Protecting data reasonably well: Just as there are considerations in restoring operations in the secondary environment, there are considerations to protecting data, with the intent of restoring it to the secondary environment. These considerations have to be consistent with the importance of that data to the organization and the impact the failure has on that data. For example regulatory agencies may require that you keep certain kinds of data for extended periods of time and be able to reproduce them upon request. Thus, data retention guidelines and operational requirements dictate how well the data must be protected or more importantly how much of it could be lost.

Operational within a reasonable amount of time: For an organization to meaningfully operate, it must be able to restore operations, following a failure, within tolerable limits. These tolerances must be consistent with the business domain, the impact to revenue or data and any contractual or regulatory obligations. For example, a company might choose to have the order entry system not falter at all or be up and running within a matter of seconds as opposed to not restoring the data warehouse system that predicts order placement over a 3-month period.

DR Metrics

Of the above key aspects, there are two that tie to the organization’s operating requirements or Service Level Agreements (SLA). Organizations have such contractual SLAs within itself, to its customers and partners, to financial institutions, and to regulatory agencies. For DR the two key SLA’s are

  • Recovery Point Objective (RPO)
  • Recovery Time Objective (RTO)

Recovery Point Objective (RPO): RPO defines how much of the data could be lost without impacting the application or the enterprise’s data retention requirements. This is usually measured in seconds or minutes of data that could be lost.

Recovery Time Objective (RTO):  RTO defines how long before the application is up and running. In general this is measured from the time the decision is made to institute DR for the application since there may be some deliberation time post failure that is could vary by organization. A more strict interpretation of this metric could be from the time the failure is first detected or reported. Since this varies by organization this would have to be clearly articulated and consistently applied.

What DR is not

It is just as important to understand what DR is not. DR is generally not the equivalent of High Availability (HA) architecture. DR is traditionally concerned with protecting data, or more accurately, restoring operations in a secondary operating environment following a catastrophic failure.  In the spectrum of failures and recoveries HA and DR offer different levels of protection to guard against service interruptions or data loss. Along the same vein, DR is not just backup and recovery (B&R).  HA and B&R are generally associated with the primary operating environment.  Implied in this arrangement is the fact that the secondary operating environment is isolated, and insulated, from the effects of failures affecting the primary environment; the secondary environment possesses a level of autonomy – i.e, being able to become operational without any need for the primary operating environment. In the past due to technological limitations high availability and backup and recovery remained distinct form DR since HA could not be performed across large distances and tape based systems used for backups were generally slow and were not conducive to restore operations following a catastrophic failure. Today’s technologies provide for another set of options such as HA over extended distances such as Geo Clusters. However, the emphasis is still on the fact that following a catastrophic failure the secondary environment must be both insulated from the effects of that failure and be operable without access to the primary environment. In general, HA, which is based on a shared data architecture cannot adequately cater to this need while DR is specifically. designed for it. DR-WhatIsNot

 

Business Case for DR

With this conceptual understanding of DR the next step is to understand the failures that would impact your organization and the severity of the impact. Understanding the impact of the failures is the first step in determining if a DR strategy is even required. It may not be Implementing DR is never cheap. It requires an organization to commit resources on a sustained basis to implement, continually test, and enhance. Therefore, many an organization might choose to opt for a simpler Backup & Recovery approach. In making the business case for DR the key is to understand the impact of not protecting the applications and data that the enterprise relies on or produces, and weigh that against the cost of protecting those applications and data. DR-costbalance

 Assets to protect

The enterprise has assets that allow it to conduct business operations while satisfying the various covenants, contractual, and regulatory obligations. Throughout their existence, these assets, the systems and data, come into contact with people and the elements.  In doing so they are prone to failures. Classifying these asset based on the failures they could be subject to and the severity of that impact to the business is important to the DR planning. The key questions to ask of these assets are

  • Does not having this application or data it affect your business or your your ability to fulfill any contractual obligations?
  • How serious would an extended disruption of the application/data, affect your business?
  • What kind of failures could this application/data be subjected to?
DR-failureclasses

Answering these key questions will allow you to group your systems, applications, and data assets and apply different tiers of protection that are commensurate with the failures they can sustain and the impact the disruption would have. Examples of failures are:

  • Natural calamities such as floods, earthquakes, etc
  • Regional power failures
  • Files deleted intentionally or by accident
  • An incorrect update criteria changes data affecting financial statements

Implementation Choices

Identifying the applications to protect and the method by which to protect them is an iterative process that seeks to balance the need for protection against the cost of doing so, while also taking into consideration the technology available to provide such protection.

  • Database protected with Oracle Data Guard and applications protected via Storage Replication
  • Replicated Storage with Just in Time Server Provisioning
  • Active Passive Clusters
  • Geo Active-Active Clusters
  • VM replication
DR-implementationchoices

Depending on an organization’s core competency, cost considerations, and technological challenges this could be approached in several ways

  • One size fits all – In a one size fits all approach, applications and data in the enterprise are protected to the same degree, i.e., same RPO/RTO goals. In most cases it is also done using one solution that applies equally to all applications and data that belong to the enterprise. While this is simple to implement and maintain, this may be technologically challenging for the organization to implement and it in most cases it is not cost effective
  • Menu of RPO/RTO targets with corresponding implementations – In a menu based approach, a specific methodology is chosen based on the RPO/RTO target and applications and data are mapped to these based on their classification or importance to the organization. For example, the order entry system may be protected with an active-active system, the organization’s home page may be protected with an active-active active system. On the other hand the order fulfillment may be protected with an active-passive system. Critical data may be protected with synchronous replication whereas non critical data may be replicated only every 15 minutes.

Application performance and availability considerations will be another factor that dictates which choices are reasonable. For example, a synchronous replication to a remote site every time the application issues a write to disk would provide excellent data protection. However it may significantly impact performance of the application should the connection between the primary and secondary sites be impaired or severed, the application would perform sub-optimally or be rendered inoperable. That said, a higher degree of protection may be warranted given the business domain – for example bank transactions and stock trades.

In the end, it is up to the organization to determine based on its data and revenue profiles how granular to structure the replication mechanisms realizing the operational complexities and costs that may bring about.

Verification

Finally, any planning and execution is only as good as the ability to verify the implementation. Testing the implementation serves two very important purposes as described below:

    • Verification demonstrates  the system can operate consistently with the assumptions made and meets the agreed upon SLAs (RTO and RPO).
    • Verification provides a mechanism, in the form of a dry run, in which to ensure the  organization develops competency in instituting DR.

DR is not just a technological problem. It is in fact mostly an organizational problem – one that requires an understanding of the policies, procedures, business model, business impact, classification, and taxonomy.  The technical implementation is only a small part of it and therefore the exercise of verifying the implementation, periodically, will bring, the multitude of groups within the organization together and expose any gaps that may need to be addressed.

 Conclusion

Implementing DR for an organization is a costly proposition that should not be done without careful consideration or proper planning. Understanding the various dimensions of DR and its concepts allows an organization to discern why or even whether they need DR.  It also highlights, what of the enterprise is worth protecting, and how best to offer protection in a cost effective manner. In many respects an exercise in DR, not only exposes the intra and inter organizational boundaries it also provides further insight into the valuable assets of the organization that might not have been otherwise discovered. Understanding these assets and the value they bring to the organization is the first step towards building a resilient enterprise.