Design Zone for Branch/WAN - Cisco SD-WAN Administrator-Triggered Cluster Failover Deployment Guide

About the guide

This document provides the design and deployment information for vManage Disaster Recovery (DR) using administrator triggered failover. It covers information about the different types of disaster recovery methods, along with the steps to set up administrator triggered disaster recovery.

Note, the admin-triggered disaster recovery failover mechanism is currently supported, validated, and tested only for on-premises controller deployments.

The guide assumes that SD-WAN controllers vSmart/vBond are already deployed and the vManage nodes are already created and the cluster is already configured. See the Cisco SD-WAN Design Guide for additional background information on Cisco SD-WAN.

Figure 1. Implementation Flow

/var/folders/c7/3b6yfzj56vd0tv9xfx6nt8ph0000gn/T/com.microsoft.Word/Content.MSO/B43445A8.tmp

This document contains four major sections:

● The Define section gives background on the SD-WAN solution, along with the details regarding the available disaster recovery options.

● The Design section discusses the solution components, design aspects, and any prerequisites.

● The Deploy section provides information about configurations and best practices.

● The Operate section shows how to manage different aspects of the solution.

Audience

This document is for network architects and operators, or anyone interested in deploying and using the vManage disaster recovery feature, either for production or lab purposes.

Define: About the Solution

Figure 2. Cisco SD-WAN solution roles and responsibilities

Related image, diagram or screenshot There are three distinct types of controllers within the Cisco SD-WAN solution, each responsible for either the orchestration plane, the management plane, or the control plane.

● Orchestration Plane: the vBond controller, or vBond orchestrator, is part of the orchestration plane. It authenticates and authorizes devices onto the network and distributes the list of vSmart controllers and vManage to all the WAN Edge routers.

● Management Plane: the vManage Network Management System (NMS) server is the controller that makes up the management plane. It is a single pane of glass for Day 0, Day 1, and Day 2 operations. It provides centralized provisioning, troubleshooting, and monitoring for the solution.

● Control Plane: the vSmart controller is part of the control plane. It disseminates control plane information between routers, implements control plane policies, and distributes data plane policies to the routers.

This guide focuses on the vManage server, which makes up the management plane.

vManage NMS redundancy and high availability

● vManage can be deployed in two basic ways, either standalone or by clustering. All vManage instances inside a primary cluster operate in active mode. The purpose of a vManage cluster is scale. It does provide a level of redundancy against a single vManage failure, but it does not protect against a cluster-level failure. Clustering across geographical locations is not recommended, as database replication between cluster members requires 4 MS or less of delay between them. Therefore, members of a cluster should reside at the same site. Redundancy is achieved with a backup vManage or backup vManage cluster in standby mode.

● If you are running vManage in standalone mode, deploy a vManage in active mode as primary, and a vManage in standby mode as backup. It is recommended to deploy these at two different geographical locations to achieve redundancy.

● If you are running vManage in a cluster, deploy a vManage cluster in active mode as primary, and a vManage cluster in standby mode as backup. A cluster needs a minimum of three vManage instances, each being active and running independently. It is recommended to deploy each cluster at two different geographical locations to achieve redundancy.

Figure 3. vManage redundancy

A screenshot of a cell phoneDescription automatically generated

Disaster Recovery

The vBond and vSmart controllers are stateless. Snapshots of their virtual machines can be made before any maintenance or configuration changes, or their configurations can be copied and saved if running in CLI mode. In addition, if feature or CLI templates are configured on vManage (required for vSmart if centralized policies are built and applied from vManage), their configurations will be saved with the vManage snapshots and database. Snapshots can be restored, or the device can be re-deployed and configuration templates pushed from the vManage in a disaster recovery scenario.

vManage is the only stateful SD-WAN controller and its backup cannot be deployed in active mode. For the vManage server, snapshots should be taken, and the database backed up regularly.

There are different disaster recovery methods available. In common disaster recovery scenarios, an active vManage or vManage cluster resides at one data center site, along with at least one active vSmart controller and vBond orchestrator. In a second data center, a standby (inactive) vManage or vManage cluster is deployed, along with at least one active vSmart controller and vBond orchestrator. On the active vManage or vManage cluster, each vManage instance establishes control connections to vSmart controllers and vBond orchestrators in both data centers. When the standby vManage or vManage cluster becomes active, it then establishes control connections to the vSmart controllers and vBond orchestrators in both data centers.

The following disaster recovery methods are available:

● Manual (vManage standalone or cluster) – The backup vManage server or vManage cluster is kept shutdown in cold standby state. Regular backups of the active database are taken, and when the primary vManage or vManage cluster goes down, the standby vManage or vManage cluster is brought up manually and the backup database restored on it.

● Administrator-triggered failover (vManage cluster) (recommended) – In this method, the data is replicated automatically between the primary and secondary vManage clusters, but you must manually perform the Failover to the secondary cluster. This is supported starting in the 19.2 version of vManage code. This is the recommended disaster recovery method.

Design: Administrator-Triggered Disaster Recovery Method

Starting in the 19.2 version of vManage code, the administrator-triggered disaster recovery failover option can be configured. This disaster recovery method applies only to vManage clusters which are primary and backup to each other. This method does not apply to standalone primary and secondary vManage servers.

For this method, a vManage cluster is configured at one datacenter, while a second vManage cluster is configured at a second datacenter. The two clusters communicate with each other over a DCI link between the datacenters and the clusters must communicate via their cluster link, which is part of VPN 0. Data is replicated automatically between the primary and secondary vManage clusters and a Failover to the secondary cluster is performed manually.

All controllers (vSmart controllers and vBond orchestrators) are deployed across both primary and secondary data centers and are reachable from vManage servers from both data centers through the transport interfaces. At any given time, however, the controllers are connected only to the primary vManage cluster.

The following diagram depicts a typical admin-triggered disaster recovery topology. Note that the vManage cluster link for each vManage is extended across the datacenters and all vManage servers are reachable through this out-of-band interface. Control connections are established from each vManage to other vManage servers in the same cluster through the VPN 0 transport interface and to the vSmart controllers and the vBond orchestrators in either datacenter through the same interface.

Figure 4. Administrator-triggered disaster recovery topology

A picture containing meterDescription automatically generated

Prerequisites, Best Practices and Recommendations

Some of the best practices and recommendations for vManage clustering and administrator-triggered failover deployment include -

● Deploy each vManage VM on a separate physical server within a datacenter so that a single physical server outage will not impact the vManage cluster for a given datacenter.

● All vManage servers should be running the same software version.

● For each vManage instance within a cluster, a third interface (cluster link) is required besides the interfaces used for VPN 0 (transport) and VPN 512 (management). This interface is used for communication and syncing between the vManage servers within the cluster. This interface should be at least 1 Gbps and have a latency of 4ms or less. A 10 Gbps interface is recommended.

● In ESXi, it is recommended to use VMXNET3 adapters for interfaces. VMXNET3 supports 10 Gbps speeds. To make VMXNET3 NICs available, under ESXi 5.0 and later (VM version 8) compatibility settings, under Edit Settings>VM Options>General Options, choose a Guest OS version that supports VMXNET3 (such as Ubuntu Linux (64-bit) or Red Hat Linux 5 (64-bit) or greater).

● Within a cluster, all vManage nodes should reside on the same LAN segment and be able to reach each other on the out-of-band interface (cluster link). Between datacenters, all vManage servers should also be able to reach each other through the out-of-band interface (cluster link), either through an extended layer 2 segment or through layer 3 routing.

● Across data centers, the following ports need to be enabled on the firewalls for the vManage clusters to communicate with each other:

◦ TCP ports: 8443, 830

● The current supported vManage cluster administrator-triggered failover topology requires all services (application-server, configuration, messaging server, coordination server, and statistics) to be enabled on all the vManage servers in the cluster.

● The configuration and statistics service should be run on at least three vManage devices. Each service must run on an odd number of devices because to ensure data consistency during write operations, there must be a quorum, or simple majority, of vManage devices running and in sync.

● Ensure that you use a net admin user privilege for Disaster Recovery (DR) registration. We recommend that you modify the factory-default password, admin before you start the registration process.

To change user credentials, we recommend that you use the Cisco vManage GUI, and not use the CLI of Cisco SD-WAN device.

● If Cisco SD-WAN vManage nodes part of the cluster is configured using feature templates, ensure that you create separate feature templates for both primary data center and secondary data center.

When primary cluster is switched over to the secondary cluster, Cisco vManage detaches the Cisco SD-WAN devices from the feature templates. Therefore, ensure that you reattach the devices to the specific feature templates.

● For an on-premises deployment, ensure that you regularly take backup of the Configuration database from the active Cisco vManage instance.

Deployment: vManage Cluster Disaster Recovery (DR)

This section explains the prerequisites and steps to configure administrator triggered disaster recovery failover for a vManage primary/secondary cluster design. The primary and standby vManage clusters are configured with the same number of instances running the same services, and each cluster is in a separate data center.

The vManage cluster design followed in this guide helps achieve:

● vManage redundancy that provides high availability should one vManage in a cluster fail.

● Administrator-triggered failover Disaster Recovery (DR), providing redundancy for an entire cluster should the whole datacenter fail.

To deploy a vManage cluster, refer to the vManage cluster design whitepaper. The section Installing a vManage Cluster explains the steps required to install a standard deployment of three vManage instances. The rest of this deployment section focuses on administrator-triggered Disaster Recovery (DR) performed only when the entire active cluster goes down, therefore switching from primary to secondary clusters.

Prerequisites to enable Disaster Recovery:

1. Install and configure the required number of vManage instances in a virtual environment. Create a third interface on each vManage device within VPN 0. This is the out-of-band or cluster interface configured on VPN 0 of each vManage node involved in disaster recovery. This is the same interface that is used by vManage for communicating with its peers in a cluster.

Note: You need at least three nodes in each of the two vManage clusters to enable DR. Therefore, install and deploy 6 vManage instances.

2. Within each vManage instance, the organization name must be configured, certificates installed, and tunnel interfaces active. The bare minimal configuration on each vManage instance is as given below,

system

host-name

system-ip

site-id

organization-name

vbond

vpn 0

interface (tunnel interface)

ip address

tunnel-interface

no shutdown

interface (out-of-band/ cluster interface)

ip address

no shutdown

ip route 0.0.0.0/0 (next hop for tunnel interface)

Make sure all vManage servers can reach each other’s out-of-band/ Cluster interface.

3. To Install a vManage cluster, follow the steps listed in the vManage cluster design whitepaper. Note, the current supported vManage cluster topology requires all services to be enabled on all the vManage servers in the cluster. This includes, application-server, configuration-db, messaging server, coordination server and statistics-db.

4. Before starting with the DR registration procedure, ensure that no other procedures are running such as software upgrades, template attachments etc. on either the primary or secondary cluster. Note, DR must be registered on the primary Cisco vManage cluster.

Process 1: Disaster Recovery Registration

Step 1. Log into any vManage instance, that is a part of the primary cluster to begin the Disaster Recovery registration process.

Step 2. Navigate to Administration > Disaster Recovery.

Related image, diagram or screenshot

Step 3. Click on Manage Disaster Recovery to enter the out-of-band/ cluster IP address of a vManage instance from the primary cluster, along with the admin username and password for the cluster.

Related image, diagram or screenshot

Step 4. Under Connectivity Info:

● Enter the out-of-band/ cluster IP address of one of the vManage instances in the active cluster and credentials for the active cluster.

● Then, enter the out-of-band/ cluster IP address of any of the vManage instances in the standby cluster, followed by the username and password for the standby cluster.

Finally, click Next to continue.

Technical Tip

Once you have entered the password here and configured DR, do not change that password.

Related image, diagram or screenshot

Step 5. Under vBond Info, enter the IP address and the admin Username and Password for the first vBond in the primary cluster.

Related image, diagram or screenshot

Note, if you have more than one vBond in the deployment, click on the + button to enter details to the second vBond. Repeat this step, until you have listed all of the vBonds in your deployment. Finally, click Next to continue.

Technical Tip

Once you have entered the password within vBond Info and configured DR, do not change that password.

Step 6. In this deployment, the Recovery Mode is set to Manual.

Related image, diagram or screenshot

Step 7. Under the Replication Schedule, enter the Start Time and Replication Interval. In a production environment, the Replication Interval must be configured to at least 30 minutes. The controllers and edge lists are uploaded to the standby cluster, during the first replication.

Finally, click Save.

Related image, diagram or screenshot

Step 8. You will be navigated to the Disaster Recovery Registration page. This will restart the Application Server on each of the vManage devices, including on the vManage where you are currently using the GUI. The registration may take up 20-30 minutes to complete. Refresh the browser to see that DR is successfully configured.

A screenshot of a social media postDescription automatically generated

Technical Tip

When the Application Server restarts on the vManage that you are using, an error may be displayed at the top of the page. This is normal because the browser is trying to refresh the page while the Application Server is restarting.

Step 9. Navigate back to the Administration > Disaster Recovery to view the status of the active and standby clusters.

The example figure below is taken from the primary vManage cluster, wherein the DR registration was initiated.

A screenshot of a computer screenDescription automatically generated

And the following figure displays the view from a secondary vManage cluster.

A screenshot of a computer screenDescription automatically generated

Process 2: Manual Failover

A manual scheduled failover helps test the operation of disaster recovery.

Overall steps to perform manual Failover include the following:

● Begin by detaching device templates associated with the Cisco vManage nodes part of the primary cluster before you perform a Failover. Note: Detach device templates attached to a vManage node.

Related image, diagram or screenshot

● Shut off the tunnel interfaces part of the primary Cisco vManage cluster to prevent devices from toggling during the Failover.

● From a Cisco vManage system on the secondary cluster, wait for data replication to complete and then click Make Primary to enable the Failover.

Step 1. Navigate to Administration > Disaster Recovery on one of the standby cluster nodes. Within Standby Cluster, look for the option Make Primary located in the middle of the screen. Click this option, after the tunnel interfaces on the active vManage cluster is shut.

Related image, diagram or screenshot

Step 2. A pop-up screen displays the following. Again, make sure to shut down tunnel interface on all nodes of the current active cluster before clicking Ok.

Related image, diagram or screenshot

Step 3. The Failover process starts again.

A screenshot of a cell phoneDescription automatically generated

Step 4. Post successful completion of the Failover process, vManage cluster that was previously in standby status will now be active.

A screenshot of a computer screenDescription automatically generated

Technical Tip

The green tick marks that indicate the status of each vManage instance is based on a simple ping test to the cluster out-of-band interface on each node, which takes up to a minute to recognize that the interface is unreachable. The status icon does not indicate the status of the tunnel interface itself. This means it is possible for the green check mark to still show green when the tunnel interfaces are down, and the cluster is impacted. Verify that the devices are offline and/or the tunnel interfaces are shut down before attempting a Failover.

Step 5. Lastly, turn on the tunnel interfaces on all the nodes of the new Active cluster. This brings up the control connections from other controller(s)/ devices.

Devices and controllers converge to the secondary cluster and that cluster assumes the role of the primary cluster. When this process completes, the original primary cluster assumes the role of the secondary cluster. Then data replicates from the new primary cluster to the new secondary cluster.

Process 3: Disaster Recovery Operations

This section describes how to handle a disaster recovery in a variety of situations.

Procedure 1. Failover due to Loss of Primary vManage Cluster

This procedure explains the steps followed to Failover from primary to secondary (standby) vManage cluster, due to loss of all the vManage(s) part of the primary cluster.

Technical Tip

Do not manually switch over unless all of the active cluster nodes show red and tunnel interfaces are shut down. If one or more of the active cluster nodes has control connections, the Failover may fail.

Step 1. Log in to vManage GUI on the secondary cluster and select Administration > Disaster Recovery.

Step 2. Click Make Primary.

Related image, diagram or screenshot

The devices as well as controllers will converge to standby cluster, and it assumes the role of primary.

When the original primary vManage cluster is recovered and back online, it assumes the role of secondary cluster and starts receiving data from the primary cluster.

Technical Tip

● If a partial loss of the primary Cisco vManage cluster is seen, we recommend that you try to recover the primary cluster instead of switching over to the secondary cluster.

A cluster with N nodes is operational if (N/2) +1 nodes are operational.

A cluster with N nodes becomes read only if (N/2) +1 or more nodes are lost.

● Also, note the operator needs to validate that the tunnel interface is shut on the original primary cluster once it is up or back online. This operation cannot be performed until the cluster is up.

Procedure 2. Failover due to Loss of Primary Data Center

This procedure focuses on the steps followed to Failover from primary to secondary (standby) Data Center if the primary Data center cluster goes down.

Step 1. Log in to vManage GUI on the secondary cluster and select Administration > Disaster Recovery.

Step 2. Click Make Primary to begin the Failover process.

During the Failover, only the Cisco vBond Orchestrators in the secondary data center are updated with a new valid Cisco vManage list. As a result, the edge devices and controllers (which are online) converge to the stand by cluster and it assumes the role of the primary cluster.

After the original primary data center recovers and all VMs including controllers, are back on line, then these controllers are updated with a new valid Cisco vManage and converge to the new primary Cisco vManage cluster. The original primary cluster assumes the role of secondary cluster and begins to receive data from the primary cluster.

Technical Tip

● The operator needs to validate that the tunnel interface is shut on the original primary cluster once it is up or back online. This operation cannot be performed until the site is up.

● Also note, In the event that a link failure occurs between your data centers but the WAN in the primary data center is operational and data replication fails. In such situations, attempt to recover the link so that data replication can resume.

Operate: Administrator-Triggered Disaster Recovery Method

This section explains some of the common troubleshooting tips and errors noticed on enabling Disaster Recovery (DR).

Common Issues seen during registration:

Device Registration Fails

Solution: Verify the reachability to the vBond orchestrator from all cluster members on the secondary cluster/reachability between the secondary cluster and primary cluster on the transport interface (VPN 0)/ Check that you have the correct username and password.

Failed to find vBond IP/UUID in the registration task details page

Solution: Make sure vBonds are both connected to registered data center before retrying registration. If the error occurs while vBonds are already connected then, go to the rediscover page on the primary vManage(s) GUI and rediscover the vBonds. This action will ensure that a corresponding entry of vBond is present in the configuration-db.

Timeout while clicking next on any registration step

Solution: Make sure IP address is reachable and credentials provided are correct.

Some Common Troubleshooting Tips:

● Verify the replication details status on the disaster recovery page to ensure all data is being transferred from primary vManage cluster to secondary cluster.

Related image, diagram or screenshot

In case of replication failure: Verify IP reachability from primary vManage cluster.

● Verify the Failover timestamp on primary and secondary clusters in situations of Failovers.

Related image, diagram or screenshot

Appendix A: Product List

The following products and software versions are included as part of validation in this deployment guide. This validated set is not inclusive of all possibilities.

Table 1. Cisco SD-WAN Hardware

Product	Software version
vManage	20.1.1
vBond	20.1.1
vSmart	20.1.1

Feedback

For comments and suggestions about this guide and related guides, join the discussion on Cisco Community at https://cs.co/en-cvds.

Cisco SD-WAN Administrator-Triggered Cluster Failover Deployment Guide

Available Languages

Download Options

Bias-Free Language

Available Languages

Download Options

Table of Contents

Learn more