Data Protection Overview
The HX Data Platform disaster recovery feature allows you to protect virtual machines from a disaster by setting up replication of running VMs between a pair of network connected clusters. Protected virtual machines running on one cluster replicate to the other cluster in the pair, and vice versa. The two paired clusters typically are located at a distance from each other, with each cluster serving as the disaster recovery site for virtual machines running on the other cluster.
Once protection has been set up on a VM, HX Data Platform periodically takes a replication snapshot of the running VM on the local cluster and replicates (copies) the snapshot to the paired remote cluster. In the event of a disaster at the local cluster, you may use the most recently replicated snapshot of each protected VM to recover and run the VM at the remote cluster. Each cluster that serves as a disaster recovery site for another cluster, must be sized with adequate spare resources so that upon a disaster, it can run the newly recovered virtual machines in addition to its normal workload.
Note |
Only one snapshot retention is supported for backup workflows. |
Each virtual machine can be individually protected by assigning it protection attributes, chief among which is the replication interval (schedule). The shorter the replication interval, the fresher the replicated snapshot data is likely to be, when it is time to recover the VM after a disaster. Replication intervals can range between 5 minutes and 24 hours.
Protection group is a group of VMs that have a common replication schedule and snapshot properties.
Setting up replication requires two existing clusters running HX Data Platform version 2.5 or higher. Both clusters must be on the same HX Data Platform version. This setup can be completed online.
First, each cluster is set up for replication networking. Use HX Connect to provide a set of IP addresses to be used by local cluster nodes to replicate to the remote cluster. HX Connect creates VLANs through UCS Manager, for dedicated replication network use.
Note |
When this option is chosen in HX Connect, UCSM is configured only when both UCS Manager and fabric interconnect are associated with the HyperFlex cluster. When UCSM and FI are not present, you must enter the VLAN ID, and not select UCSM configuration in HX Connect. |
The two clusters, and their corresponding existing relevant datastores must be explicitly paired. The pairing setup can be completed using HX Connect from one of the two clusters. This requires administrative credentials of the other cluster.
Virtual machines can be protected (or have their existing protection attributes modified) by using HX Connect at the cluster where they are currently active.
HX Connect can be used to monitor the status of both incoming and outgoing replications at a cluster.
After a disaster, a protected VM can be recovered and run at the cluster that serves as the disaster recovery site for that VM.
Replication and Recovery Considerations
The following is a list of considerations when configuring virtual machine replication and performing disaster recovery of virtual machines.
-
Administrator―All replication and recovery tasks, except monitoring, can only be performed with administrator privileges on the local cluster. For tasks involving a remote cluster, both the local and remote user must have administrator privileges and should be configured with the vCenter SSO on their respective clusters.
-
Minimum and Recommended Bandwidth―For HX 4.0(2a) and forward, minimum bandwidth can be configured to be 10 Mb for smaller size deployments. The replication network link should also be reliable and have sustained minimum symmetric bandwidth same as configured in a HyperFlex DR network. This should not be shared with any other applications on Uplink or Downlink.
-
Maximum Latency―Maximum latency supported is 75ms between two clusters.
If you are scheduling to run multiple replication jobs at the same time, for example 32 as maximum supported by DR, and your bandwidth (50Mbs) is low and latency (75ms) high, it is possible that some jobs will error out until bandwidth becomes available. If this situation occurs, increase bandwidth or reduce the concurrency by staggering the replications.
During this situation, unprotect operations can take longer than expected.
-
Network Loss―When there is a packet loss in data transmission across two sites, protection and recovery operations will have unexpected results. The transmission should be reliable for these features to work as expected.
-
Storage Space―Ensure that you have sufficient space on the remote cluster to support your replication schedule. The protected virtual machines are replicated (copied) to the remote cluster at every scheduled interval. Though storage capacity methods are applied (deduplication and compression), each replicated virtual machine will consume some storage space.
Not having sufficient storage space on the remote cluster can cause the remote cluster to reach capacity usage maximums. If you see Out of Space errors, see Handling Out of Space Errors. Pause all replication schedules until you have appropriately adjusted the space available on the HX Cluster. Always ensure that your cluster capacity consumption is below the space utilization warning threshold.
-
Supported Clusters―Replication is supported between the following HyperFlex clusters:
-
1:1 replication between HX clusters running under fabric interconnects.
-
1:1 replication between All Flash and Hybrid HX cluster running under fabric interconnects.
-
1:1 replication between 3-Node and 4-Node HX Edge and another 3-Node and 4-Node HX Edge cluster.
-
1:1 replication between All Flash 3-Node and 4-Node Edge and Hybrid 3-Node and 4-Node HX Edge clusters.
-
1:1 replication between 3-Node and 4-Node HX Edge and an HX cluster running under fabric interconnects.
Note
1:1 replication with 2-Node HX Edge is not supported.
-
-
Rebooting Nodes―Do not reboot any nodes in the HX Cluster during any restore, replication, or recovery operation.
-
Thin Provision―Protected virtual machines are recovered with thin provisioned disks irrespective of how disks were specified in the originally protected virtual machine.
-
Protection Group Limitations
-
The maximum number of VMs allowed in a protection group is 64.
-
Do not add VMs with ISOs or floppies to protection groups.
Protected Virtual Machine Scalability
-
1500 VMs across both clusters is supported. It can be 750 VMs per cluster in a bi-direction (that is replicating from both directions Site A to Site B and vice versa) or any split between the two clusters without exceeding the limit of 1500 VMs.
-
The sum of VMs on all nodes should not exceed the maximum limit of 1500 VMs per cluster in single direction configuration and 750 VMs in a bi-direction configuration. The maximum limit of 1500 VMs is supported regardless of whether it is 4-node or 8-node cluster.
For example, with 4-node clusters on both sides of DR, you can reach up to 800 VMs. However, with 8-node clusters on both sides, the replication and recovery of 1600 VMs is not supported.
-
The maximum number of VMs allowed in a protection group is 64.
-
Do not add VMs with ISOs or floppies to protection groups.
-
A maximum of 100 protection groups are supported.
-
-
Non-HX Datastores―If you have protected a VM with storage on a non-HX datastore, periodical replication will fail on this. You can either unprotect this VM or remove its non-HX storage.
Do not move protected VMs from HX datastores to non-HX datastores. If a VM is moved to a non-HX datastore through storage vMotion, unprotect the VM, then reapply the protection.
-
Templates―Templates are not supported for Disaster Recovery.
-
Protection and Recovery of Virtual Machines with Snapshots
-
A VM with no Snapshots—When replication is enabled the entire content of the VM is replicated.
-
A VM with VMware Redolog snapshots—When replication is enabled the entire content including the snapshot data is replicated. When a VM with redolog snapshots is recovered, all previous snapshots are preserved.
-
A VM with Hyperflex Snapshots—When replication is enabled only the latest data is replicated, and the snapshot data is not replicated. When the VM is recovered, previous snapshots are not preserved.
-
-
Data Protection and Disaster Recovery (DR) snapshots are stored on the same datastore as the protected VMs. Deleting these snapshots manually by an Admin, is not supported. Deleting the snapshot directories would compromise HX data protection and disaster recovery.
Caution
As in any VMware environment, not restricted to HX on VMware, datastores can be accessed by the Admin via VCenter browser or by logging into the ESX host. Because of this, snapshot directory and contents are browse-able and accessible to Admins. VMware does not restrict operations on datastores by Admin. Please be aware of this to avoid deleting snapshots manually.
Other points for consideration include:
-
Location of the VMware Virtual Center—If you delete a VM from VMware Virtual Center that is located on a “Other DRO” datastore pair, a recovery plan for this datastore pair fails during recovery. To avoid this failure, you must first unprotect the VM using the following command on one of the controller VMs:
stcli dp vm delete --vmid <VM_ID>
-
Name of the VM—If you rename a VM from the Virtual Center, Hyperflex recovers at the previous name folder but registers the VM with the new name on the recovery side. Following are some of the limitations to this situation:
-
VMware allows a VMDK located at any location to be attached to a VM. In such cases, Hyperflex recovers the VM inside the VM folder and not at a location mapped to the original location. Also, recovery can fail if the VMDK is explicitly referenced in the virtualmachine name.vmx file by its path. The data is recovered accurately but there could be problems with registering the VM to the Virtual Center. You can rectify this error by updating the virtualmachine name.vmx file name with the new path.
-
If a VM is renamed and a VMDK is added subsequently, the new VMDK is created at [sourceDs] newVm/newVm.vmdk. Hyperflex recovers this VMDK with the earlier name. In such cases, recovery can fail if the VMDK is explicitly referenced in the virtualmachine name.vmx file by its path. The data is recovered accurately but there could be problems with registering the VM to the Virtual Center. You can rectify this error by updating the virtualmachine name.vmx file with the new path.
-
Replication Network and Pairing Considerations
A replication network must be established between clusters that are expected to use replication for Data Protection. This Replication network is created to isolate inter-cluster replication traffic from other traffic within each cluster and site.
The following is a list of considerations when configuring replication network and pairing:
-
To support efficient replication, all M nodes of cluster A have to communicate with all N nodes of cluster B, as illustrated in the M x N connectivity between clusters figure.
-
To enable replication traffic between clusters to cross the site-boundary and traverse the internet, each node on Cluster A should be able to communicate with each node on Cluster B across the site boundary and the internet.
-
The replication traffic must be isolated from other traffic within the cluster and the data center.
-
To create this isolated replication network for inter-cluster traffic, complete these steps:
-
Create a replication network on each cluster.
-
Pair clusters to associate the clusters and establish M x N connectivity between the clusters.
-
-
IP addresses, Subnet, VLAN, and Gateway are associated with each replication network of each cluster. You must configure the corporate firewall and routers on both sites, to allow communication between the clusters and the sites on TCP ports 9338,3049,9098,4049,4059.
M*N Connectivity Between Clusters
Storage Replication Adapter Overview
Storage Replication Adapter (SRA) for VMware vCenter Site Recovery Manager (SRM) is a storage vendor-specific plug-in for VMware vCenter server. The adapter enables communication between SRM and a storage controller at the Storage Virtual Machine (SVM) level as well as at the cluster level configuration. The adapter interacts with the SVM to discover replicated datastores.
For more information on installation and configuration of SRM, refer the following links as per the SRM release version:
-
SRM 8.1 installation—https://docs.vmware.com/en/Site-Recovery-Manager/8.1/srm-install-config-8-1.pdf
-
SRM 6.5 installation—https://docs.vmware.com/en/Site-Recovery-Manager/6.5/srm-install-config-6-5.pdf
-
SRM 6.0 installation—https://docs.vmware.com/en/Site-Recovery-Manager/6.0/srm-install-config-6-0.pdf
You must install an appropriate SRA on the Site Recovery Manager Server hosts at the protected and recovery sites. If you use more than one type of storage array, you must install the SRA for each type of array on both of the Site Recovery Manager Server hosts.
Before installing an SRA, ensure that SRM and JDK 8 or above version are installed on Windows machines at the protected and recovery sites.
To install an SRA, do the following:
-
Download SRA from the VMware site.
In the https://my.vmware.com/web/vmware/downloads page, locate VMware Site Recovery Manager and click Download Product. Click Drivers & Tools, expand Storage Replication Adapters, and click Go to Downloads.
-
Copy the Windows installer of SRA to SRM Windows machines at the protected and recovery sites.
-
Double-click the installer.
-
Click Next on the Welcome page of the installer.
-
Accept the EULA and click Next.
-
Click Finish.
Note |
The SRA is installed within the SRM program folder: C:\Program Files\VMware\VMware vCenter Site Recovery Manager\storage\sra |
After SRA installation, refer the following guide as per the SRM release version and do the SRM environment setup:
-
SRM 8.1 configuration—https://docs.vmware.com/en/Site-Recovery-Manager/8.1/srm-admin-8-1.pdf
-
SRM 6.5 configuration—https://docs.vmware.com/en/Site-Recovery-Manager/6.5/srm-admin-6-5.pdf
-
SRM 6.0 configuration—https://docs.vmware.com/en/Site-Recovery-Manager/6.0/srm-admin-6-0.pdf
After configuration, SRM works with SRA to discover arrays and replicated and exported datastores, and to fail over or test failover datastores.
SRA enables SRM to execute the following workflows:
-
Discovery of replicated storage
-
Non-disruptive failover test recovery using a writable copy of replicated data
-
Emergency or planned failover recovery
-
Reverse replication after failover as part of failback
-
Restore replication after failover as part of a production test
Data Protection Terms
Interval―Part of the replication schedule configuration, used to enforce how often the protected VMs replication snapshot must be taken and copied to the target cluster.
Local cluster―The cluster you are currently logged into through HX Connect, in a VM replication cluster pair. From the local cluster, you can configure replication protection for locally resident VMs. The VMs are then replicated to the paired remote cluster.
Migration―A routine system maintenance and management task where a recent replication snapshot copy of the VM becomes the working VM. The replication pair of source and target cluster do not change.
Primary cluster―An alternative name for the source cluster in VM disaster recovery.
Protected virtual machine―A VM that has replication configured. The protected VMs Reside on a datastore in the local cluster of a replication pair. They have a replication schedule configured either individually or through a protection group.
Protection group―A means to apply the same replication configuration on a group of VMs.
Recovery process―The manual process to recover protected VMs in the event the source cluster fails or a disaster occurs.
Recovery test―A maintenance task that ensures the recovery process is successful in the event of a disaster.
Remote cluster―One of a VM replication cluster pair. The remote cluster receives the replication snapshots from the Protected VMs in the local cluster.
Replication pair―Two clusters that together provide a remote cluster location for storing the replication snapshots of local cluster VMs.
Clusters in a replication pair can be both a remote or local cluster. Both clusters in a replication pair can have resident VMs. Each cluster is local to its resident VMs. Each cluster is remote to the VMs that reside on the paired local cluster.
Replication snapshot―Part of the replication protection mechanism. A type of snapshot taken of the protected VM, which is copied from the local cluster to the remote cluster.
Secondary cluster―An alternative name for the target cluster in VM disaster recovery.
Source cluster―One of a VM replication cluster pair. The source cluster is where the protected VMs reside.
Target cluster―One of a VM replication cluster pair. The target cluster receives the replication snapshots from the VMs of the source cluster. The target cluster is used to recover the VMs in the event of a disaster on the source cluster.
Best Practices for Data Protection and Disaster Recovery
As an administrator, you will need to design and deploy an effective data protection and disaster recovery strategy in your environment. The solution that you design and subsequently deploy must meet or exceed business requirements for both, Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) of the production VMs. Following are some of the points that you must consider while designing this strategy:
-
The number of Service Level Agreements (SLA) necessary to comply with various categories of production workloads that may include mission critical, business critical, and important VMs.
-
Detailed constructs of each SLA that may include RPO, RTO, the number or recovery points retained, requirements for offsite copies of data, and any requirements for storing backup copies on different media types. There may be additional requirements that include the ability to recover to a different environment such as a different location, different hypervisor or different private/public cloud.
-
An ongoing testing strategy for each SLA which serves to prove that the solution meets the business requirements it was designed for.
Note that backups and backup copies must be stored external to the HyperFlex cluster being protected. For example, backups performed to protect VMs on a HyperFlex cluster should not be saved to a backup repository or a disk library that is hosted on the same HyperFlex cluster.
The built-in HyperFlex data protection capabilities are generalized into the following categories:
-
Data Replication Factor—Refers to the number of redundant copies of data within a HyperFlex cluster. A data replication factor of 2 or 3 can be configured during data platform installation and cannot be changed. The data replication factor benefit is that it contributes to the number of cluster tolerated failures. See the section titled, HX Data Platform Cluster Tolerated Failures for additional information about the data replication factor.
Note
Data Replication Factor alone may not fulfill requirements for recovery in the highly unlikely event of a cluster failure, or an extended site outage. Also, the data replication factor does not facilitate point-in-time recovery, retention of multiple recovery points, or creation of point-in-time copies of data external to the cluster.
-
Data Platform Snapshots—Operates on an individual VM basis and enables saving versions of a VM over time. A maximum of 31 snapshots can be retained.
Note
Data Platform Snapshots alone may not fulfill requirements for recovery in the highly unlikely event of a cluster failure, or an extended site outage. Also, it does not facilitate the ability to create point-in-time copies of data external to the cluster. More importantly, unintentional deletion of a VM will also delete any data platform snapshots associated with the deleted VM.
-
Asynchronous Replication—Also known as The HX Data Platform disaster recovery feature, it enables protection of virtual machines by replicating virtual machine snapshots between a pair of network connected HyperFlex clusters. Protected virtual machines running on one cluster replicate to the other cluster in the pair, and vice versa. The two paired clusters typically are located at a distance from each other, with each cluster serving as the disaster recovery site for virtual machines running on the other cluster.
Note
Asynchronous Replication alone may not fulfill requirements for recovery when multiple point-in-time copies need to be retained on the remote cluster. Only the most recent snapshot replica for a given VM is retained on the remote cluster. Also, asynchronous replication does not facilitate the ability to create point-in-time copies of data external to either cluster.
We recommend that you first understand the unique business requirements of your environment and then deploy a comprehensive data protection and disaster recovery solution to meet those requirements.