Cisco Nexus 1000V for KVM High Availability Configuration Guide, Release 5.x

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Book Contents

Find Matches in This Book

Results

Updated:: November 19, 2014

Chapter: Overview

Cisco Nexus 1000V for KVM and OpenStack
Information About High Availability
System Components
Service-Level High Availability
- Isolation of Processes
- Process Restartability
System-Level High Availability
Network-Level High Availability
VSM-to-VSM Heartbeat
Split-Brain Resolution
Checking the Accounting Logs and the Redundancy Traces
VSM Role Collision Detection
Displaying the Role Collision
Recommended Reading

Overview

This chapter contains the following sections:

Cisco Nexus 1000V for KVM and OpenStack
Information About High Availability
System Components
Service-Level High Availability
System-Level High Availability
Network-Level High Availability
VSM-to-VSM Heartbeat
Split-Brain Resolution
Checking the Accounting Logs and the Redundancy Traces
VSM Role Collision Detection
Displaying the Role Collision
Recommended Reading

Cisco Nexus 1000V for KVM and OpenStack

The Cisco Nexus 1000V for KVM consists of two main components:

Virtual Ethernet Module (VEM)—A software component that is deployed on each kernel-based virtual machine (VM) host. Each VM on the host is connected to the VEM through virtual Ethernet (vEth) ports.
Virtual Supervisor Module (VSM)—The Management component that controls multiple VEMs and helps in the definition of VM-focused network policies. It is deployed either as a virtual appliance on any KVM host or on the Cisco Cloud Services Platform appliance.

Each of these components is tightly integrated with the OpenStack environment:

The VEM is a hypervisor-resident component and is tightly integrated with the KVM architecture.
The VSM is integrated with OpenStack using the OpenStack Neutron Plug-in.
The OpenStack Neutron API has been extended to include two additional user-defined resources:
- Network profiles are logical groupings of network segments.
- Policy profiles group port policy information, including security.

Using OpenStack, you create VMs, networks, and subnets on the Cisco Nexus 1000V for KVM, by defining components such as the following:

Tenants
Network segments, such as VLANs, VLAN trunks, and VXLANs
IP address pools (subnets)

Using the Cisco Nexus 1000V for KVM VSM, you create port profiles (called policy profiles in OpenStack), which define the port policy information, including security settings.

When a VM is deployed, a port profile is dynamically created on the Cisco Nexus 1000V for KVM for each unique combination of policy port profile and network segment. All other VMs deployed with the same policy to this network reuse this dynamic port profile.

Note

You must consistently use OpenStack for all VM network and subnet configuration. If you use both OpenStack and the VSM to configure VM networks and subnets, the OpenStack and the VSM configurations can become out-of-sync and result in faulty or inoperable network deployments.

Information About High Availability

The purpose of high availability (HA) is to limit the impact of failures—both hardware and software— within a system. The Cisco NX-OS operating system is designed for high availability at the network, system, and service levels.

The following Cisco NX-OS features minimize or prevent traffic disruption in the event of a failure:

Redundancy—Redundancy at every aspect of the software architecture.
Isolation of processes—Isolation between software components to prevent a failure within one process disrupting other processes.
Restartability—Most system functions and services are isolated so that they can be restarted independently after a failure while other services continue to run. In addition, most system services can perform stateful restarts, which allow the service to resume operations transparently to other services.

System Components

The Cisco Nexus 1000V system is made up of the following:

One or two VSMs that run within Virtual Machines (VMs).
Virtual Ethernet Modules (VEMs) that run within virtualization servers. VEMs are represented as modules within the VSM.
A remote management component.

Service-Level High Availability

Isolation of Processes

The Cisco NX-OS software has independent processes, known as services, that perform a function or set of functions for a subsystem or feature set. Each service and service instance runs as an independent, protected process. This way of operating provides a highly fault-tolerant software infrastructure and fault isolation between services. A failure in a service instance does not affect any other services that are running at that time. Additionally, each instance of a service can run as an independent process, which means that two instances of a routing protocol can run as separate processes.

Process Restartability

Cisco NX-OS processes run in a protected memory space independently of each other and the kernel. This process isolation provides fault containment and enables rapid restarts. Process restartability ensures that process-level failures do not cause system-level failures. In addition, most services can perform stateful restarts. These stateful restarts allow a service that experiences a failure to be restarted and to resume operations transparently to other services within the platform and to neighboring devices within the network.

System-Level High Availability

The Cisco Nexus 1000V supports redundant VSM virtual machines—a primary and a secondary—running as an HA pair. Dual VSMs operate in an active/standby capacity in which only one of the VSMs is active at any given time, while the other acts as a standby backup. The VSMs are configured as either primary or secondary as a part of the Cisco Nexus 1000V installation.

The state and configuration remain constantly synchronized between the two VSMs to provide a stateful switchover if the active VSM fails.

Network-Level High Availability

The Cisco Nexus 1000V high availability at the network level includes port channels and the Link Aggregation Control Protocol (LACP). A port channel bundles physical links into a channel group to create a single logical link that provides the aggregate bandwidth of up to eight physical links. If a member port within a port channel fails, the traffic previously carried over the failed link switches to the remaining member ports within the port channel.

Additionally, LACP allows you to configure up to 16 interfaces into a port channel. A maximum of eight interfaces can be active, and a maximum of eight interfaces can be placed in a standby state.

VSM-to-VSM Heartbeat

The primary and secondary VSM use a VSM to VSM heartbeat to do the following within their domain:

Broadcast their presence
Detect the presence of another VSM
Negotiate active and standby redundancy states

When a VSM first boots up, it broadcasts discovery frames to the domain to detect the presence of another VSM. If no other VSM is found, the booting VSM becomes active. If another VSM is found to be active, the booting VSM becomes the standby VSM. If another VSM is found to be initializing (for example, during a system reload), the primary VSM has priority over the secondary to become the active VSM.

Note

From SV3(1.1) onwards, VSM validates the source MAC address of High Availability (HA) packets it receives on control and management interfaces. During initial contact VSM learns the peer VSM MAC addresses and stores in a permanent location. So only HA packets from a learnt VSM will be accepted.

After the initial contact and role negotiation, the active and standby VSMs unicast the following in heartbeat messages:

Redundancy state
Control flags requesting action by the other VSM

The following intervals apply when sending heartbeat messages.

Interval	Description
1 second	Interval at which heartbeat requests are sent.
6-30 seconds (Default is 15 seconds)	Interval after which missed heartbeats indicate degraded communication on the control interface so that heartbeats are also sent on the management interface. The default is 15 seconds; however, you can configure the interval to be between 6 and 30 seconds.
varies	The standby VSM resets automatically when the active VSM is no longer able to synchronize with it. This means the interval varies depending on how long the active VSM can buffer the data to be synchronized when the communication is interrupted.

Interval

Description

1 second

Interval at which heartbeat requests are sent.

6-30 seconds (Default is 15 seconds)

Interval after which missed heartbeats indicate degraded communication on the control interface so that heartbeats are also sent on the management interface.

The default is 15 seconds; however, you can configure the interval to be between 6 and 30 seconds.

varies

The standby VSM resets automatically when the active VSM is no longer able to synchronize with it. This means the interval varies depending on how long the active VSM can buffer the data to be synchronized when the communication is interrupted.

Control and Management Interface Redundancy
Partial Communication
Loss of Communication
One-Way Communication

Control and Management Interface Redundancy

The VSM communicates with the peer VSM over layer 2 only on the control and management interfaces. If the active VSM does not receive a heartbeat response over the control interface for a period of half of the inter-VSM maximum heartbeat loss interval (eight heartbeats by default), communication is seen as degraded and the VSM begins sending requests over the management interface in addition to the control interface. In this case, the management interface provides redundancy by preventing both VSMs from becoming active. This process is called an active-active or split-brain situation.

Note

The communication is not fully redundant, however, because the management interface only handles heartbeat requests and responses.

AIPC and the synchronization of data between VSMs is done through the control interface only.

Partial Communication

The secondary VSM is not immediately rebooted when communication over the control interface is interrupted because the HA mechanism tolerates brief interruptions in communication. When communication is first interrupted on the control interface, the heartbeat messages are sent over the management interface. If communication over the management interface is successful, the VSMs enter into a degraded mode, as displayed in the show system internal redundancy trace command output. If communication is interrupted on both interfaces for too long, the two VSMs get out of synchronization and the standby VSM is forced to reboot.

Note

A transition from active to standby always requires a reload in both the Cisco Nexus 1000V and the Cisco Nexus Cloud Services Platform.

Loss of Communication

When there is no communication between redundant VSMs or Cisco Nexus Cloud Services Platforms, they cannot detect the presence of the other. The standby VSM will be removed from the list of inserted modules at the active VSM. The standby interprets the lack of heartbeats as a sign that the active has failed and it also becomes active. This process is what is referred to as active-active or split-brain, as both are trying to control the system by connecting to and communicating with the VEMs.

Because redundant VSMs or Cisco Nexus Cloud Services Platforms use the same IP address for their management interface, remote Secure Shell (SSH)/Telnet connections might fail, as a result of the path to this IP address changing in the network. For this reason, we recommend that you use the consoles during a split-brain conflict.

The following parameters are used to select the VSM to be rebooted during the split-brain resolution: the module count, the last configuration time, and the last active time.

VSM-VEM Communication Loss

VSM-VEM Communication Loss

Depending on the specific network failure that caused it, each VSM might reach a different, possibly overlapping, subset of Virtual Ethernet Modules (VEMs). When the VSM that was in the standby state becomes a new active VSM, it broadcasts a request to all VEMs to switch to it as the current active device. Whether a VEM switches to the new active VSM, depends on the following:

The connectivity between each VEM and the two VSMs.
Whether the VEM receives the request to switch.

A VEM remains attached to the original active VSM even if it receives heartbeats from the new active VSM. However, if the VEM also receives a request to switch from the new active VSM, it detaches from the original active VSM and attaches to the new VSM.

If a VEM loses connectivity to the original active device and only receives heartbeats from the new one, it ignores those heartbeats until it goes into headless mode, which occurs approximately 15 seconds after it stops receiving heartbeats from the original, active VSM. At that point, the VEM attaches to the new active VSM if it has connectivity to it.

One-Way Communication

If a network communication failure occurs where the standby VSM receives heartbeat requests but the active VSM does not receive a response, the following occurs:

The active VSM declares that the standby VSM is not present.
The standby VSM remains in a standby state and continues receiving heartbeats from the active VSM.

In this scenario, the redundancy state is inconsistent (show system redundancy state) and the two VSMs lose synchronization. When two-way communication is resumed, the standby VSM replies to the active VSM and asks to be reset.

Note

If a one-way communication failure occurs in the active to standby direction, it is equivalent to a total loss of communication because a standby VSM sends heartbeats only in response to active VSM requests.

Split-Brain Resolution

When the connectivity between two Virtual Supervisor Modules (VSMs) is broken, this loss of communication can cause both VSMs to take the active role. This condition is called active-active or split-brain condition. When the communication is restored between the VSMs, both VSMs exchange information to decide which one would have a lesser impact on the system, if rebooted.

Both primary and secondary VSMs process the same data to select the VSM (primary or secondary) that needs to be rebooted. When the selected VSM is rebooted and attaches itself back to the system, high availability is back to normal. The VSM uses the following parameters in order of precedence to select the VSM to be rebooted during the split-brain resolution:

Module count—The number of modules that are attached to the VSM.
Last configuration time—The time when the last configuration is done on the VSM.
Last standby-active switch—The time when the VSM last switched from the standby state to the active state. (The VSM with a longer active time gets higher priority.)

Checking the Accounting Logs and the Redundancy Traces

During the split-brain resolution, when a VSM reboots, the accounting logs that are stored on the VSM are lost. You can display the accounting logs that were backed up during the split-brain resolution. You can also check the redundancy traces that are stored on the local and remote VSMs.

Command	Purpose
show system internal active-active accounting logs	Displays the accounting logs that are stored on a local VSM during the last split-brain resolution.
show system internal active-active redundancy traces	Displays the redundancy traces that are stored on a local VSM during the last split-brain resolution.
show system internal active-active remote accounting logs	Displays the remote accounting logs that are stored on a remote VSM during the last split-brain resolution.
show system internal active-active remote redundancy traces	Displays the remote redundancy traces that are stored on a remote VSM during the last split-brain resolution.
clear active-active accounting logs	Clears the accounting logs that are stored on a local VSM during the split-brain resolution.
clear active-active redundancy traces	Clears the redundancy traces that are stored on a local VSM during the split-brain resolution.
clear active-active remote accounting logs	Clears the remote accounting logs that are stored on a remote VSM during the split-brain resolution.
clear active-active remote redundancy traces	Clears the remote redundancy traces that are stored on a remote VSM during the split-brain resolution.

VSM Role Collision Detection

In the Cisco Nexus 1000V, if a secondary VSM is configured or installed with the same role as the primary VSM and with the same domain ID, the secondary VSM and the primary VSM exchange heartbeats to discover each other. Both VSMs detect and report a role collision when they exchange heartbeats. When a collision is detected, the VSMs report the MAC address of the VSM with which the local VSM is colliding.

Due to this issue, the HA-paired VSM cannot communicate with the correct VSM. This problem can occur on a primary VSM or a secondary VSM depending on whether the newly configured or the installed VSM has the primary or the secondary role assigned to it.

The collisions are detected on the control and the management interfaces. The maximum number of colliding VSMs reported is eight.

Note

After the eighth role collision, the problem is still logged and the MAC address entry is overwritten. The show system redundancy status command displays the overwrite details.

Note

The colliding VSMs might also report a collision detection from the original VSM. If the colliding VSMs use the same IP address for their management interfaces, the remote SSH/Telnet connections might fail. Therefore, we recommend that you use the consoles during a role collision detection.

Enter the show system redundancy status command on both the primary and secondary VSM consoles to display the MAC addresses of the detected VSMs with the same role and domain ID, if any. When the VSM stops communicating in the domain, the collision time is not updated anymore. After an hour elapses since the last collision, the collision MAC entries are removed.

Displaying the Role Collision

Use the show system redundancy status CLI command to display the VSM role collision:

Command	Purpose
show system redundancy status	Displays a detected role collision A warning is highlighted in the CLI output. Along with the MAC addresses, the latest collision time is also displayed in the output. If no collisions are detected, the highlighted output does not appear.

This example shows how to display the detected traffic collision:

switch# show system redundancy status

Redundancy role
---------------
      administrative:   secondary
         operational:   secondary

Redundancy mode
---------------
      administrative:   HA
         operational:   HA

This supervisor (sup-2)
-----------------------
    Redundancy state:   Active
    Supervisor state:   Active
      Internal state:   Active with HA standby

Other supervisor (sup-1)
------------------------
    Redundancy state:   Standby
    Supervisor state:   HA standby
      Internal state:   HA standby

WARNING! Conflicting sup-2(s) detected in same domain
-----------------------------------------------------
        MAC             Latest Collision Time
00:50:56:97:02:3b       2012-Sep-11 18:59:17
00:50:56:97:02:3c       2012-Sep-11 18:59:17
00:50:56:97:02:2f       2012-Sep-11 18:57:42
00:50:56:97:02:35       2012-Sep-11 18:57:46
00:50:56:97:02:29       2012-Sep-11 18:57:36
00:50:56:97:02:30       2012-Sep-11 18:57:42
00:50:56:97:02:36       2012-Sep-11 18:57:46
00:50:56:97:02:2a       2012-Sep-11 18:57:36

NOTE: Please run the same command on sup-1 to check for conflicting(if any) sup-1(s) in the same domain.

Cisco Nexus 1000V for KVM High Availability Configuration Guide, Release 5.x

Bias-Free Language

Book Title

Cisco Nexus 1000V for KVM High Availability Configuration Guide, Release 5.x

Chapter Title

Overview

Results

Chapter: Overview

Overview

Cisco Nexus 1000V for KVM and OpenStack

Information About High Availability

System Components

Isolation of Processes

Process Restartability

System-Level High Availability

Network-Level High Availability

VSM-to-VSM Heartbeat

Control and Management Interface Redundancy

Partial Communication

Loss of Communication

VSM-VEM Communication Loss

One-Way Communication

Split-Brain Resolution

Checking the Accounting Logs and the Redundancy Traces

VSM Role Collision Detection

Displaying the Role Collision

Recommended Reading

Was this Document Helpful?

Contact Cisco