Catalyst Center high availability

Catalyst Center’s HA framework reduces downtime from failures and makes your network more resilient. The HA framework provides near real-time synchronization of changes across your cluster nodes, giving your network redundancy to deal with issues.

Supported synchronization types include

  • database changes (such as updates related to configuration, performance, and monitoring data), and

  • file changes (such as report configurations, configuration templates, TFTP root directory, administration settings, licensing files, and the key store).

This guide describes

  • the requirements to use HA

  • the deployment process

  • administration best practices, and

  • Catalyst Center's response to failure scenarios.

 Note

Catalyst Center provides HA support for both Automation and Assurance functionalities.

High availability requirements

To enable HA, your production environment must meet these requirements:

  • Your cluster consists of three Catalyst Center appliances with the same machine profile (for example, three third-generation large appliances).

    See Supported appliances.

  • Your secondary appliances run the same version of Catalyst Center as the primary appliance.

  • Your cluster's appliances belong to the same network and reside in the same site.

     Note

    This requirement applies to all multinode cluster deployments. The Catalyst Center appliance does not support the distribution of nodes across multiple networks or sites.

  • Your cluster's round-trip time (RTT) is 10 milliseconds or less.

Supported appliances

This table lists the Catalyst Center appliances that support HA.

Catalyst Center appliances that support HA

Machine profile

Machine profile alias

Cisco part number Number of cores

medium

medium

Second-generation:

  • DN2-HW-APL

  • DN2-HW-APL-U (promotional)

44

Third-generation:

  • DN3-HW-APL

  • DN3-HW-APL-U (promotional)

32

t2_large

large

Second-generation:

  • DN2-HW-APL-L

  • DN2-HW-APL-L-U (promotional)

56

Third-generation:

  • DN3-HW-APL-L

  • DN3-HW-APL-L-U (promotional)

t2_2xlarge

extra large

Second-generation:

  • DN2-HW-APL-XL

  • DN2-HW-APL-XL-U (promotional)

112

Third-generation:

  • DN3-HW-APL-XL

  • DN3-HW-APL-XL-U (promotional)

80

 Important

Catalyst Center supports mixed three-node clusters that have HA enabled. A valid mixed cluster meets these requirements:

  • It consists of second- and third-generation Catalyst Center appliances. First-generation appliances are not supported.

  • Its three appliances have the same machine profile. For example, a cluster with two second-generation large appliances and one third-generation large appliance is a valid mixed cluster.

High availability functionality

Catalyst Center supports a three-node cluster configuration that provides both software and hardware HA:

  • Software HA: Restarts any services on a node that have failed. If a service fails, Catalyst Center restarts that service on the same node or a different cluster node.

  • Hardware HA: Uses multiple appliances, disk drives (within each appliance's RAID configuration), and power supplies to tolerate a failure by these components until the faulty component is restored or replaced.

 Important
  • Catalyst Center does not support a cluster with more than three nodes (such as a multinode cluster with five or seven nodes).

  • A three-node cluster's fault tolerance handles single-node failures. Catalyst Center provides HA across specific services even if a single node fails. If two nodes fail, Catalyst Center loses the quorum that is necessary to perform HA operations and the cluster breaks.

Clustering and database replication

Catalyst Center enables distributed processing and database replication among multiple nodes. Clustering enables HA by sharing resources and features.

Security replication

In a multinode environment, Catalyst Center replicates the security features of one node to two other nodes, including any X.509 certificates or trustpools. After joining these nodes to form a three-node cluster, Catalyst Center shares the GUI user credentials across the nodes. Catalyst Center does not share the CLI user credentials since each node has different credentials.

System upgrade

In a multinode cluster, you can trigger an upgrade of the whole cluster in the Catalyst Center GUI. The GUI represents the entire cluster, not just a single node. An upgrade that is triggered in the GUI automatically updates all cluster nodes.

 Note

After initiating a system upgrade to update Catalyst Center's core infrastructure, Catalyst Center enters maintenance mode. In maintenance mode, Catalyst Center is unavailable until the upgrade process completes. Take this into account when scheduling a system upgrade.

Confirm a successful system upgrade

You can confirm that your system upgrade was successful by completing these steps in the GUI.


Step 1

From the main menu, choose System > Software Updates > Updates.

Step 2

In the System Update area, verify that the latest system package is installed.


High availability deployment

This section provides best practices for deploying and administering an HA-enabled cluster in your production environment.

Deployment recommendations

Catalyst Center supports three-node clusters. The odd number of nodes provides the quorum that is necessary to perform any operation in a distributed system. Instead of three separate nodes, Catalyst Center views them as one logical entity accessed through a virtual IP address.

Follow these guidelines when deploying HA:

  • When setting up a three-node cluster, avoid spanning a LAN across slow links to prevent network failures and prolonged service recovery times. When configuring the cluster interface on a three-node cluster, ensure that all the cluster nodes reside in the same subnet.

  • Avoid overloading a single interface with management, data, and HA responsibilities, which can negatively impact HA operation. At a minimum, use the Cluster and Enterprise interfaces to keep cluster and enterprise traffic separate.

  • In the appliance configuration wizards, Catalyst Center prepopulates the Services Subnet and Cluster Services Subnet fields with link-local (169.x.x.x) subnets. We recommend that you use the default subnets. If you choose to specify different subnets, they must conform to the IETF RFC 1918 and 6598 specifications for private networks.

    For details, see RFC 1918 (Address Allocation for Private Internets) and RFC 6598 (IANA-Reserved IPv4 Prefix for Shared Address Space).

  • Enable HA during off-hours, because Catalyst Center will enter maintenance mode and be unavailable until it finishes redistributing services.

Deploy a cluster

To deploy Catalyst Center on a three-node cluster with HA enabled, complete this procedure:


Step 1

Configure Catalyst Center on the first node in your cluster.

Refer to the Installation Guide topic that is specific to the configuration wizard you want to use and your appliance type:

  • If you are configuring an appliance using the Maglev configuration wizard, view the "Configure the primary node using the Maglev wizard" topic.

  • If you are configuring an appliance using the browser-based configuration wizard, view the "Configure the primary node using the Advanced Install Configuration wizard" topic.

Step 2

Configure Catalyst Center on the second node in your cluster.

Refer to the Installation Guide topic that is specific to the configuration wizard you want to use and your appliance type:

  • If you are configuring an appliance using the Maglev configuration wizard, view the "Configure a secondary node using the Maglev wizard" topic.

  • If you are configuring an appliance using the browser-based configuration wizard, view the "Configure a secondary node using the Advanced Install Configuration wizard" topic.

Step 3

Configure Catalyst Center on the third node in your cluster.

Refer to the secondary appliance configuration topic that you viewed in the previous step.

Step 4

Activate HA on your cluster:

  1. From the main menu, choose System > Settings > System Configuration > High Availability.

  2. Click Activate High Availability.

    After you click Activate High Availability in the GUI, Catalyst Center enters maintenance mode. In this mode, Catalyst Center is not available until the process completes, which can take several hours. Take this into account when scheduling an HA deployment.
 Note
  • Catalyst Center also goes into maintenance mode when you restore the database and perform a system upgrade (not a package upgrade).

  • To enable external authentication with a Authentication, Authorization, and Accounting (AAA) server in a three-node cluster environment, configure all the individual Catalyst Center node IP addresses and the virtual IP address for the three-node cluster on the AAA server.


Administer a cluster

The topics in this section cover the administrative tasks that you must complete when HA is enabled in your production environment.

Run maglev commands

To change the IP address, static route, DNS server, or maglev user password that is currently configured for a Catalyst Center appliance, run the sudo maglev-config update CLI command.

Common cluster node operations

This table describes the most common operations that you will complete when managing the nodes in your cluster.

 Important
  • Rebooting or shutting down two nodes simultaneously in a three-node cluster will break the cluster's quorum requirement.

  • Do not drain two nodes simultaneously.

Common cluster node operations
Task Action

From the CLI, shut down all nodes in a three-node cluster.

Run the sudo shutdown -h now command on all of the nodes at the same time.

When powering nodes back on, be sure to power on all nodes at the same time through Cisco IMC.

Shut down or disconnect one node for maintenance (in situations where you are not just rebooting the node).

Run these commands:

  1. maglev node drain node's-IP-address

  2. maglev node drain_history (to confirm that the node drained successfully)

  3. sudo shutdown -h now (run on the node you are shutting down)

After performing maintenance on the node, complete these steps:

  1. Log in to the Cisco IMC GUI as the Cisco IMC user.

  2. From the hyperlinked menu, choose Host Power > Power On to power on the node. It should take 30–45 minutes for the node to come back up.

  3. Run the magctl node display command and wait for the node's status to display as Ready.

  4. Run the maglev node allow node’s-IP-address command.

  5. Run the magctl workflow status command and wait until its output indicates that the task you initiated in the previous step completed successfully before you proceed.

  6. Run the maglev service nodescale refresh command, which puts the node in maintenance mode.

     Note

    Instead of running the command, you can also perform these actions:

    1. From the Catalyst Center GUI, click the menu icon and choose System > Settings > System Configuration > High Availability.

    2. Click Activate High Availability.

Reboot one or more nodes after making changes that may require a reboot.

Run the sudo shutdown -r now command on the relevant nodes.

Prepare a node for Return Merchandise Authorization (RMA).

  1. Drain the node: maglev node drain node-IP-address

    To confirm that the node drained successfully, run the maglev node drain_history command.

  2. Shut down the node: sudo shutdown -h now

  3. Confirm that the node's status is listed as NotReady,SchedulingDisabled: magctl node display

  4. Remove the node from the cluster: maglev node remove node-IP-address

  5. Install the same Catalyst Center version that's already installed on the cluster's other two nodes.

  6. Add the node back to the cluster by configuring it as a secondary node (see the Installation Guide for your second or third-generation appliance).

  7. Enable service distribution, which puts the node in maintenance mode: maglev service nodescale refresh

     Note

    Instead of running the command, you can also perform these actions:

    1. From the Catalyst Center GUI, click the menu icon and choose System > Settings > System Configuration > High Availability.

    2. Click Activate High Availability.

Update an appliance's Cisco IMC firmware.

Perform these actions:

  1. See the release notes for the Catalyst Center release that's installed on an appliance. In the release notes, the “Supported Firmware” section shows the Cisco IMC firmware version for your Catalyst Center release.

  2. See the Cisco Host Upgrade Utility User Guide for instructions on updating the firmware.

 Note

In a three-node cluster configuration, we recommend that you shut down all three nodes in the cluster before updating the Cisco IMC firmware. However, you can upgrade the cluster nodes individually if that's what you prefer. Follow the steps provided in this table to shut down one node or all nodes for maintenance.

Replace a failed node

To replace a node that has failed, complete these tasks:

  1. Remove the failed node from your cluster.

    See Remove the failed node.

  2. Replace the failed node with another node.

    See Add a replacement node.

Remove the failed node

When a node fails because of a hardware failure, remove it from the cluster. For assistance with this task, contact the Cisco TAC.

 Warning

A two-node cluster (a transient configuration that's not supported for normal use) results when one of these situations occurs:

  • During the initial formation of a three-node cluster, only two of the cluster nodes are available.

  • In an existing three-node cluster, one of the nodes has failed or is currently down.

While a two-node cluster is active, you cannot remove either of its nodes.

Add a replacement node

After removing the failed node, add a replacement node to the cluster. Set aside at least 30 minutes for this task.


Step 1

On the replacement node, install the same software version that the other nodes in the cluster are running.

  • If you are configuring an appliance using the Maglev Configuration wizard, use the wizard's Join a Catalyst Center Cluster option. See the "Configure a Secondary Node Using the Maglev Wizard" topic in the Installation Guide for your second or third-generation appliance.

  • If you are configuring an appliance using the browser-based configuration wizard, use the wizard's Join an existing cluster option. See the "Configure a Secondary Node Using the Advanced Install Configuration Wizard" topic that's specific to your appliance in the Installation Guide.

 Important

In the Maglev Cluster Details screen (Maglev Configuration wizard) or the Primary Cluster Details screen (Advanced Install configuration wizard), enter the IP address configured for the Cluster port on either of the nodes that are still active.

Step 2

After the installation is complete, enter the magctl node display command.

The replacement node should show the Ready status.

Step 3

Redistribute services to the replacement node by activating HA on your cluster:

  1. From the main menu, choose System > Settings > System Configuration > High Availability.

  2. Click Activate High Availability.

Step 4

Verify that the services have been redistributed: magctl appstack status

The replacement node should show a Running status.


Minimize failure and outage impact

In a typical three-node Catalyst Center cluster, each node connects to a single cluster switch through the node’s cluster port interface. Connectivity with the cluster switch requires two transceivers and a fiber optic cable—any of which can fail. The cluster switch can also fail due to a power loss or a manual restart, causing an outage of your Catalyst Center cluster and a loss of all controller functionality.

To minimize the impact of a failure or outage on your cluster, follow at least one of these guidelines:

  • Perform management operations (such as software upgrades, configuration reloads, and power cycling) during noncritical periods, as these operations can cause a cluster outage.

  • Connect your cluster nodes to a switch that supports the in-service software upgrade (ISSU) feature. This feature allows you to upgrade the system software while the system continues to forward traffic, using nonstop forwarding (NSF) with stateful switchover (SSO) to perform software upgrades with no system downtime.

  • Connect your cluster nodes to a switch stack, which allows you to connect each cluster node to a different member of the switch stack joined using Cisco StackWise. Because the cluster is connected to multiple switches, the impact of one switch going down is mitigated.

High availability failure scenarios

Nodes can fail due to issues related to

  • software

  • network access, and

  • hardware.

When a failure occurs, Catalyst Center normally detects it within two minutes and attempts to resolve the failure automatically. Failures that persist for longer than two minutes can require user intervention.

This table describes failure scenarios that your cluster can encounter and how Catalyst Center responds in each scenario. Pay attention to the table's first column, which indicates the scenarios that require action from you to restore your cluster's operation.

 Note

For a cluster to operate, Catalyst Center's HA implementation requires at least two cluster nodes to be up at any given time.

High availability failure scenarios

Requires User Action

Failure Scenario

HA Behavior

Yes

Any node in the cluster goes down.

Immediately perform an Automation backup. See the "Backup and Restore" chapter in the Cisco Digital Network Architecture Center Administrator Guide.

No

A node fails, becomes unreachable, or experiences a service failure for less than two minutes.

  • The GUI is not accessible for two minutes after a node fails.

  • Services that were running on the failed node are not migrated to other nodes.

  • The northbound interface (NBI) remains usable on the remaining two nodes when using the virtual IP (VIP).

  • VIP connectivity is restored after failover, and the API calls recover after the services are up and running.

After the node is restored:

  • Data on the restored node is synched with other cluster members.

     Note

    Historical Assurance data is restored, but data that was modified or updated during the failover process is not.

  • Pending GUI and NBI calls that have not timed out are completed.

No

A node fails, becomes unreachable, or experiences a service failure for longer than two minutes.

  • Catalyst Center displays a status message indicating that connectivity with a node has been lost.

  • The GUI is accessible on the two remaining nodes using the VIP.

  • Services that were running on the failed node are migrated to other nodes.

  • The status of services running on the failed node may be set to either NodeLost or Unknown.

  • The NBI on the failed node is not accessible, while the NBI on the remaining two nodes remain operational.

After the node is restored, the following actions take place:

  • Catalyst Center displays a status message indicating that cluster operation has resumed.

  • Pending GUI calls that have not timed out are completed.

  • Service requests that were pending on the failed node are completed on the node that the service was migrated to.

  • Data on the restored node is synched with other cluster members.

  • Services that were running on the failed node are stopped.

  • All the service requests that were pending on the failed node are stopped.

  • Assurance GUI selections operate as expected.

Yes

Two nodes fail or are unreachable.

The cluster is broken, and the GUI is not accessible until connectivity is restored.

  • If the nodes recover, operations resume and the data shared by cluster members are synced.

  • If the nodes do not recover, contact the Cisco TAC for assistance.

Yes

A node fails and needs to be removed from a cluster.

Contact the Cisco TAC for assistance.

No

All the nodes lose connectivity with one another.

The GUI is not accessible until connectivity is restored. After connectivity is restored, operations resume and the data shared by cluster members are synced.

Yes

A backup is scheduled and a node goes down because of a hardware failure.

Contact the Cisco TAC for a replacement node, as well as assistance with joining the new node to the cluster and restoring services on the two remaining nodes.

Yes

A red banner in the GUI indicates that a node is down: "Assurance services are currently down. Connectivity with host <IP-address> has been lost."

The banner indicates that the node is down. As a result, Assurance data collection and processing stops and data is not available. If the node comes back up, your Assurance functionality is restored. If the failure is related to a hardware failure, do the following:

  1. Remove the node that failed. Contact the Cisco TAC for assistance.

  2. Add a new node to replace the one that failed.

    See Add a replacement node.

Yes

A red banner in the GUI indicates that a node is down, but eventually changes to yellow, with this message: "This IP address is down."

The system is still usable. Investigate why the node is down, and bring it back up.

Yes

A failure occurs while upgrading a cluster.

Contact the Cisco TAC for assistance.

No

An appliance port fails.

  • Cluster port: Catalyst Center detects the failure within 5 minutes and times out the user. After 5 minutes, you should be able to log back in. A banner then appears, indicating the services that are currently unavailable. Service failover is completed within 10 minutes. The areas of the GUI that you can access depend on which services are restored. After the services that were unavailable are fully restored, the banner disappears.

  • Enterprise port: Catalyst Center might not be able to reach and manage your network.

  • Management port: Any upgrades and image downloads that are currently in progress fail and the northbound interface operations are affected.

Yes

Appliance hardware fails.

Replace the hardware component (such as a fan, power supply, or disk drive) that failed. Because multiple instances of these components are found in an appliance, the failure of one component can be tolerated temporarily.

As the RAID controller syncs a newly added disk drive with the other drives on the appliance, there might be a degradation in performance on the I/O system while this occurs.

Service redistribution times

This table shows the time required to redistribute services to two nodes in an HA environment.

Service redistribution times
Event Duration of node unavailability

Catalyst Center detects a node is down

Two minutes

Start of service redistribution

One minute

Completion of service redistribution

25 minutes

The time required to redistribute services varies. It depends on the number of pods in the node that went down, as well as the time required to relocate those pods to other nodes. The time required for services to reach the Ready state after redistribution depends on their respective liveness and readiness probes.

Service distribution is not affected by a cluster’s VIP, as no pods are mapped to it, and Kubernetes is unaware of this VIP. However, if the node that went down was the leader for the controller manager or kube-scheduler service, this can slightly delay redistribution. Before a new leader can take over and perform the switchover, either

  • the lease held by the leader must expire, or

  • the renewal deadline must pass.

Pod behavior during a node failure

This topic describes how pods move if a node fails and becomes unreachable or experiences a service failure that lasts longer than five minutes:

  • StatefulSet: The pod provides data storage. This type of pod is node-bound, using local persistent volume (LPV). When the node is down, all the stateful sets on that node move to Pending state.

    Examples of StatefulSet pods include

    • MongoDB

    • Elasticsearch, and

    • PostgreSQL.

  • DaemonSet: By design, the pod is strictly node-bound.

    Examples of DaemonSet pods include

    • agent

    • broker-agent, and

    • keepalived.

  • Stateless deployment:

    • The pod, which does not have its own datastore, uses a StatefulSet for data storage and retrieval.

    • Deployment scale varies. Some deployments have one pod instance, such as spf-service-manager-service; some have two pod instances, such as apic-em-inventory-manager-service; and some have three pod instances, such as kong, platform-ui, and collector-snmp.

    • The single-instance stateless pods are free to move across nodes based on the current state of the cluster.

    • The two-instance stateless pods have flexibility to move across nodes, but no two instances of stateless pods can run on the same node.

    • The three-instance stateless pods have node antiaffinity, meaning that no two instances can run on the same node.