This chapter describes high availability (HA) concepts and features for Cisco NX-OS software and includes the following sections:
•Information About High Availability
•Service-Level High Availability
•System-Level High Availability
•Network-Level High Availability
•VSM to VSM Heartbeat
Information About High Availability
The purpose of High Availability (HA) is to limit the impact of failures—both hardware and software— within a system. The Cisco NX-OS operating system is designed for high availability at the network, system, and service levels.
The following Cisco NX-OS features minimize or prevent traffic disruption in the event of a failure:
•Redundancy— redundancy at every aspect of the software architecture.
•Isolation of processes— isolation between software components to prevent a failure within one process disrupting other processes.
•Restartability—Most system functions and services are isolated so that they can be restarted independently after a failure while other services continue to run. In addition, most system services can perform stateful restarts, which allow the service to resume operations transparently to other services.
•Supervisor stateful switchover— Active/standby dual supervisor configuration. The state and configuration remain constantly synchronized between two Virtual Supervisor Modules (VSMs) to provide a seamless and stateful switchover in the event of a VSM failure.
The Cisco Nexus 1000V system is made up of the following:
•One or two VSMs that run within Virtual Machines (VMs).
•Virtual Ethernet Modules (VEMs) that run within virtualization servers. VEMs are represented as modules within the VSM.
•A remote management component. For example, the VMware vCenter Server is a remote management component.
Figure 1-1 shows the HA components and the communication links between them.
Figure 1-1 HA Components and Communication Links
Service-Level High Availability
The Cisco NX-OS software compartmentalizes processes for fault isolation, redundancy, and efficiency.
This section includes the following topics:
•Isolation of Processes
For additional details about service-level HA, see Chapter 2 "Understanding Service-Level High Availability."
Isolation of Processes
The Cisco NX-OS software has independent processes, known as services, that perform a function or set of functions for a subsystem or feature set. Each service and service instance runs as an independent, protected process. This way of operating provides a highly fault-tolerant software infrastructure and fault isolation between services. A failure in a service instance will not affect any other services running at that time. Additionally, each instance of a service can run as an independent process, which means that two instances of a routing protocol can run as separate processes.
Cisco NX-OS processes run in a protected memory space independently of each other and the kernel. This process isolation provides fault containment and enables rapid restarts. Process restartability ensures that process-level failures do not cause system-level failures. In addition, most services can perform stateful restarts, which allows a service that experiences a failure to be restarted and to resume operations transparently to other services within the platform and to neighboring devices within the network.
System-Level High Availability
The Cisco Nexus 1000V supports redundant VSM virtual machines—a primary and a secondary—running as an HA pair. Dual VSMs operate in an active/standby capacity in which only one of the VSMs is active at any given time, while the other acts as a standby backup. The VSMs are configured as either primary or secondary as a part of the Cisco Nexus 1000V installation. The state and configuration remain constantly synchronized between the two VSMs to provide a stateful switchover if the active VSM fails.
For more information about system-level high availability, see the "Configuring System-Level High Availability" section.
Network-Level High Availability
The Cisco Nexus 1000V high availability at the network level includes port channels and the Link Aggregation Control Protocol (LACP). A port channel bundles physical links into a channel group to create a single logical link that provides the aggregate bandwidth of up to eight physical links. If a member port within a port channel fails, the traffic previously carried over the failed link switches to the remaining member ports within the port channel.
Additionally, LACP lets you configure up to 16 interfaces into a port channel. A maximum of eight interfaces can be active, and a maximum of eight interfaces can be placed in a standby state.
For additional information about port channels and LACP, see the Cisco Nexus 1000V Layer 2 Switching Configuration Guide, Release 4.2(1)SV1(5.1).
VSM to VSM Heartbeat
This section includes the following topics:
•Control and Management Interface Redundancy
•Loss of Communication
The primary and secondary VSM use a VSM to VSM heartbeat to do the following within their domain:
•Broadcast their presence.
•Detect the presence of another VSM.
•Negotiate active and standby redundancy states.
When a VSM first boots up, it broadcasts discovery frames to the domain to detect the presence of another VSM. If no other VSM is found, the booting VSM becomes active. If another VSM is found to be active, the booting VSM becomes standby. If another VSM is found to be initializing (for example, during a system reload), the primary VSM has priority over the secondary to become active.
After the initial contact and role negotiation, the active and standby VSMs unicast the following in heartbeat messages at one-second intervals:
•Control flags requesting action by the other VSM
The remainder of this section describe how the VSM to VSM heartbeat mechanism works including observance of the following intervals.
Interval at which heartbeat requests are sent.
Interval after which missed heartbeats indicate degraded communication on the control interface so that heartbeats are also sent on the management interface.
Interval of communication loss after which the active VSM instructs the standby to reload.
Control and Management Interface Redundancy
If the active VSM does not receive a heartbeat response over the control interface for a period of 3 seconds, then communication is seen as degraded and the VSM also begins sending requests over the management interface. In this case, the management interface provides redundancy only in the sense that it acts to prevent both VSMs from becoming active, also called an active-active or split-brain situation. The communication is not fully redundant, however, because the management interface only handles heartbeat requests and responses. AIPC and the synchronization of data between VSMs is done through the control interface only.
If communication over the control interface degrades sufficiently, the two VSMs lose synchronization which can only be restored by reloading the standby VSM. When this communication between VSMs is interrupted for over six seconds, but the VSMs can still communicate over the management interface, then the active VSM instructs the standby to reload. If communication remains degraded, the standby VSM may reload multiple times before control communication is restored.
Loss of Communication
When there is no communication between redundant VSMs or Cisco Nexus 1010s, neither can detect the presence of the other. The active drops the standby. The standby interprets the lack of response as a sign that the active has failed and it also becomes active. This is what is referred to as active-active or split-brain, as both are trying to control the system by connecting to Virtual Center and communicating with the VEMs.
Since redundant VSMs or Cisco Nexus 1010s use the same IP address for their management interface, remote SSH/Telnet connections may fail, as a result of the path to this IP address changing in the network. For this reason, it is recommended that you use the consoles during a split-brain conflict.
During a split-brain conflict, avoid making changes to the configuration. When control and management communication is restored between the redundant devices, the split-brain conflict is resolved by reloading the primary so that it comes back in standby mode and then synchronizes with the secondary. Changes made to a primary configuration during a split-brain conflict are lost after the reload.
If you need to make changes to the primary during a split-brain conflict, then shut down the secondary. This prevents the two devices from resuming contact which eventually results in a reload. In this case, changes you make to the primary during the split-brain conflict are retained.
Note After a split-brain condition is resolved and the primary and secondary are communicating again, the configuration at the secondary takes precedence. Use care if changes are made when both are active.
Note A transition from active to standby always requires a reload in both the Cisco Nexus 1000V and the Cisco Nexus 1010.
VSM-VEM Communication Loss
Depending on the specific network failure that caused it, each VSM may reach a different, possibly overlapping, subset of VEMs. When the VSM that was in standby state becomes a new active, it broadcasts a request to all VEMs to switch to it as the current active device. Whether a VEM switches to the new active VSM or not, depends on the following:
•The connectivity between each VEM and the two VSMs.
•Whether the VEM receives the request to switch.
A VEM remains attached to the original active VSM even if it receives heartbeats from the new active. However, if it also receives a request to switch from the new active, it detaches from the original active and attaches to the new one.
If a VEM loses connectivity to the original active device and only receives heartbeats from the new one, it ignores those heartbeats until it goes into headless mode. This occurs approximately 15 seconds after it stops receiving heartbeats from the original active. At that point, the VEM attaches to the new active if it has connectivity to it.
If there is a network communication failure where the standby VSM recieves heartbeat requests but the active does not receive a response, the following occurs:
•The active VSM declares that the standby VSM is not present.
•The standby VSM remains in standby state and continues receiving heartbeats from the active VSM.
The redundancy state is inconsistent (show system redundancy state) and the two VSMs lose synchronization.
When two-way communication is resumed, the standby VSM detects the inconsistency and reloads.
Note If a one-way communication failure occurs in the active to standby direction, it is equivalent to a total loss of communication. This is because a standby VSM only sends heartbeats in response to active VSM requests.
Before configuring the Cisco Nexus 1000V, Cisco recommends that you read and become familiar with the following documentation:
•Cisco Nexus 1000V Installation and Upgrade Guide, Release 4.2(1)SV1(5.1)
•Cisco Nexus 1000V Port Profile Configuration Guide, Release 4.2(1)SV1(5.1)
•Cisco VN-Link: Virtualization-Aware Networking white paper