SafeHaven Disaster Recovery as a Service 4.0

Updated: March 13, 2018

The following service description applies to SafeHaven version 4.0. The service description for SafeHaven version 5.0 can be found at Cloud Application Manager Service Guide.

SafeHaven provides a suite of IT disaster recovery and inter-site migration services. SafeHaven is deployed by CenturyLink Cloud for its customers to deliver DRaaS. SafeHaven system components follow a structural hierarchy in the following order:

  • Cluster Layer
  • Data Center Layer
  • SafeHaven Replication Node (SRN) Layer
  • Protection Group
  • Protected VM/Disk

Cluster Layer
Each SafeHaven cluster can service up to 64 data centers. The data centers may be any combination of dedicated data centers and Cloud virtual data centers. Each data center within the cluster can include both active Protection Groups and replica instances of remote Protection Groups. Each subscriber organization is provisioned with a distinct SafeHaven cluster.

SafeHaven Console
The SafeHaven console is a rich Java client application which should be installed on all desktop or laptop computers that will be used for SafeHaven administration. All communication between the SafeHaven console and the CMS are encrypted over SSL. Administrators can perform point-and-click recovery operations upon individual virtual machines, groups of servers and data drives, or entire data centers. Recovery operations include:

  • Lossless migration
  • Failover
  • Failback
  • Rollback
  • Automatic detection and reporting of data center outages

Central Management Server (CMS)
Each SafeHaven cluster includes a single active CMS. The CMS is a SafeHaven virtual appliance that:

  • Receives commands from the SafeHaven console and relays them to the appropriate SRN in the appropriate data center
  • Monitors heartbeats from the SRNs
  • Receives state information from SRNs and relays it to the SafeHaven console

Data Center Layer
The data center layer includes the set of data centers provisioned within the SafeHaven cluster.

SafeHaven classifies data centers based on the API used for orchestration of recovery operations and recognizes three data center types:

  • Dedicated data center: Disaster Recovery (“DR”) orchestration is through VMware vSphere 4.0 (or later release) via API calls to VMware vCenter Server. This data center type can also be used to provide DR protection for physical servers, stand-alone data drives, and servers virtualized with non-VMware hypervisors. However, in these cases, no automated DR orchestration is available (i.e., manual shut-down and power-on is required).
  • vCloud Director cloud VDC: DR orchestration through the API for VMware vCloud Director (release 1.5 or later).
  • CenturyLink Cloud virtual data center: DR orchestration is through the CenturyLink Cloud API.

SRN Layer
This layer includes all SRNs provisioned within the SafeHaven cluster. Each SRN is associated with a parent data center as shown in the SafeHaven hierarchy. A given data center may include an arbitrary number of SRNs. The SRN virtual appliance is responsible to:

  • Provision and delete Protection Groups
  • Generate and maintain a replica image of each Protection Group in a remote data center
  • Generate and maintain a scrolling log of up to 2048 checkpoints for each Protection Group
  • Relay SafeHaven commands from the CMS to the cloud management layer and/or IT infrastructure control plane
  • Transmit a heartbeat to other SRNs and the CMS
  • Relay state information to the CMS

SRNs replicate at the LUN level transmitting updated blocks for each Protection Group to a peered SRN in a remote data center. Although each active Protection Group has a replica in only one other site, an SRN may support a set of Protection Groups that each have replica instances in distinct remote data centers.

Additional Storage Requirements:

  • The production SRN must be provided with a storage pool of sufficient size to mirror the protected VMs.
  • The recovery SRN must be provided with a storage pool of sufficient size to host the protected VM disks inside the recovery site.
  • Both SRNs must also have enough storage for Protection Group checkpoints. The amount of storage allocated determines how many checkpoints will be retained in the checkpoint history. While each user’s needs will be different, CenturyLink requires that storage for the checkpoint history for a given Protection Group to be approximately 30% of the size of the aggregate data image for all protected VMs and hard drives within the group.

Protection Groups
A Protection Group is set of servers and hard disks grouped by SafeHaven that failover, failback, and rollback together to the same instant in time and are shutdown and brought-up according to a prescribed recovery plan. Each Protection Group corresponds to a distinct set of servers and hard disks replicated to a remote site by a parent SRN. When protecting a multi-tiered application, administrators should provision a Protection Group that includes the set of all servers and hard disks that participate in the multi-tiered application. SafeHaven is set up to allow the applicable systems to recover via a remote data center with mutually consistent data images as they were at specific instances in time.

Protected VM/Disk
Write traffic for each protected VM and hard disk is locally and synchronously mirrored within the production data center so that it is written both to the primary data store and also to a local SRN. For Windows Server Operating Systems 2003 and later, the SafeHaven local replication agent is employed and in Linux Operating Systems, Logical Volume Manager 2 is employed.

Checkpoints
SafeHaven checkpoints correspond to LUN-level Copy on Write snapshots and are block-consistent representations of a Protection Group at an instance in time. For many users, CenturyLink recommends that the storage allocated to the checkpoints be approximately thirty percent (30%) of the storage allocated to the Protection Group itself.