Many companies underestimate the need to have disaster recovery strategies in place for their cloud-based applications. However, those who understand the issues sometimes find it difficult to put effective plans in place.
Unlike performing simple IT tasks, these plans require close collaboration and commitment from multiple parties to complete. Many IT services now rely on multiple application components, some of which can be run in the cloud and others in data centers. Building an effective disaster recovery plan therefore requires a structured, cross-functional approach that focuses on the resilience of IT services as a whole, not just individual workloads.
Answer difficult questions
To approach disaster recovery planning, companies need to question their approach, even if it raises some uncomfortable questions. This process is particularly useful since by pointing out gaps, companies can redirect efforts and stimulate stakeholders who have overlooked risks.
When a workload fails, the service it supports is disrupted, impacting user productivity and damaging customer trust. Restoring the service requires some coordination, and especially to be carried out quickly to limit the extent of the damage. Remember, moreover, that it is the responsibility of companies (and not cloud service providers) to ensure that disaster recovery procedures are in place.
Develop a disaster recovery plan
Effective disaster recovery planning begins with assessing the impact of downtime on the business. This cross-functional exercise identifies all the IT services used by the company, determines the impact (operational and financial) that a service interruption could have and, consequently, the disaster recovery requirements for each service. Many IT organizations maintain a service catalog and have a Configuration Management Enabled Database (CMDB), designed to simplify the process of identifying a comprehensive list of IT services. In the absence of such a catalogue, the inventory must be established within the framework of a process of discovery.
In order to determine the level of requirement for disaster recovery, it is useful to consider two essential parameters: the recovery time objective (RTO) and the recovery point objective (RPO). RTO represents the amount of downtime (usually measured in hours, days, or weeks) that the business can tolerate for a given IT service. On the other hand, the RPO is the amount of data loss (usually between almost zero and a few hours) that the company can accept for each of these same services.
In practice, there is often a trade-off between these two goals: for example, IT services may recover quickly, but suffer greater data loss, and vice versa. Logically, demanding RTOs and RPOs generally require the implementation of more expensive technological solutions.
Dependency mapping and technology assessment
After determining the RTOs, RPOs and the impact that a shutdown can have on the various IT services, the next step is to understand all the IT application components on which they depend. Creating a dependency map for each IT service will help ensure that the appropriate recovery measures are in place for all necessary application components, whether they are running in data centers or in the cloud.
Next, companies need to assess their data protection and resiliency capabilities for each application, including whether they can consider RTOs and RPOs collectively. This assessment should be done holistically, considering the impact of the most serious failure. For example, the right technology may already be in place to recover a single application within the required recovery time, but does this technology currently recover tens, hundreds, or even thousands of applications in parallel ? Can companies use the same technical solutions in data centers as in the cloud? The need for multiple tools will undeniably complicate recovery procedures. After assessing current technology capabilities, businesses can then identify additional technical solutions to fill the gaps.
Document and test recovery steps
While deploying the right recovery tools is essential, technology alone is not enough to guarantee disaster recovery. A critical step is to create a hierarchical set of recovery plans that can be used to guide the business through the recovery process. Higher-level plans will document how recovery activities are coordinated, while lower-level plans will include step-by-step procedures to ensure recovery of each IT service. Developing and maintaining these plans is a significant investment, but essential to ensuring effective recovery from a major incident.
To ensure that the plans will work well in practice, they should be tested regularly. Testing should be done at least once a year, and even more frequently for critical applications. They can also be an incident risk if they involve the use of live data. However, testing is an essential part of disaster recovery planning that should not be ignored.
The public cloud offers enterprises a highly scalable and resilient platform for hosting workloads. Used correctly, it can build the resilience of IT services. However, adopting the public cloud does not absolve the enterprise of its responsibility for service availability and disaster recovery. Although the cloud offers many building blocks to support a recovery strategy, companies should use them in combination with other technologies and procedures to build a cohesive plan.
Achieving multicloud resiliency requires a holistic approach around data assets, some elements of which are in common with the disaster recovery process. Disaster recovery in multicloud raises other issues around where data is stored. Existing dependencies and how data and workloads can be recovered in the event of an adverse situation with the cloud provider.
The goal of disaster recovery planning and testing is to ensure that recovery is possible in accordance with RPO and RTO objectives. In particular, this will reassure the companies’ customers – both internal and external – that they will not be affected in the event of downtime.