The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.
The Virtualized Multiservice Data Center (VMDC) 2.3 design provides a number of High Availability (HA) features and is a highly resilient network. The following sections provide an overview of network resiliency and also summarize the validation results of convergence around service impacting failures as tested in the lab configuration.
This section presents the following topics:
•Resiliency Against Link and Node Failure
HA has different aspects that are implemented at different layers in the network. The VMDC2.3 design does not have any single point of failure, and the service impacting failures are minimized by ensuring quick convergence around the failing link or node. In terms of converging around the failing link or node, this may be required as part of planned maintenance or an unplanned failure event. Planned events are most commonly done to upgrading software on various nodes in the Data Center (DC) and other maintenance reasons on power plants, and to address facilities issues.
In VMDC 2.3, the network portion has dual paths, with two nodes supporting each path, in an active/ active configuration with load balancing of traffic achieved by using Border Gateway Protocol (BGP). During maintenance events on one node, wherein the node is taken down, the traffic and services can continue to be provided using the other path, however, there could be local congestion during such events, as one node going down would cause all traffic to use the other path. For example, when the Provider Edger (PE) node is down, all traffic uses the surviving PE and WAN link, which causes the bandwidth available for the entire DC to be reduced to half. This can be avoided by using dual-redundant route processors and the Encapsulating Standard Protocol (ESP) on the ASR 1006 DC-PE, and by using dual supervisors on the Nexus 7004 DC-Agg routers, which is our recommendation. In addition to the benefit of being able to perform In Service Software Upgrade (ISSU), any unexpected failure of the supervisors when configured with redundancy will cause automatic switchover to the the redundant RP/supervisor, and forwarding is minimally impacted. Similarly, it is highly recommended to deploy other services appliances and compute infrastructure, as well as the Nexus 1000V Virtual Supervisor Module (VSM) and Virtual Security Gateway (VSG) in a HA configuration with a pair of devices to support failover. Additionally, for redundancy on the link level on the Nexus 7004, two modules are used, and port-channels with members from both modules are used to provide service continuously for planned or unplanned events on each module.
Table 7-1 lists the redundancy model for the ASR 1006 and Nexus 7004.
Note The ASR 1000 is impacted by CSCuc51879. This issue causes packet drops during RPSO or during ISSU on an ASR 1000 PE with a highly scaled up configuration and is still under investigation as of this publication.
For other nodes used in the VMDC 2.3-based DC, Table 7-2 lists the redundancy model to support not having a single point of failure.
Table 7-3 and Table 7-4 detail convergence results for ASR 1006 DC-PE, and Nexus 7004 aggregation switch convergence events.
1. For the network test scenarios, traffic was sent using traffic tools (IXIA, Spirent TestCenter) for all tenants north to south.
2. A convergence event was triggered for all tenants north to south.
3. MAC scale was set to 13,000 - 14,000 MAC addresses on the Nexus 7004 devices.
4. Additional traffic was sent between east/west tenants for the first 50 tenants.
5. The worst case impacted flow amongst all the flows is reported. It should be noted that all flows are not impacted due to alternate paths, which are not impacted during tests.
|
|
|
|
|
|
|
1 |
Nexus 7004 AGG module fail |
3.6 sec |
3.12 sec |
2 3 |
|
2 |
Nexus 7004 AGG module restore |
9.9 sec |
10.9 sec |
2 6 |
|
3 |
Nexus 7004 AGG node fail |
1-2 sec |
1-2 sec |
||
4 |
Nexus 7004 AGG node recovery |
5 sec |
8 sec |
See Layer 3 Best Practices and Caveatsfor more information. Additional steps are needed to workaround the issue. 4 5 |
1 2 4 5 |
5 |
Nexus 7004 AGG vPC peer link fail |
13 sec |
2 sec |
See Layer 3 Best Practices and Caveats for more information. BGP convergence will move traffic off the Nexus 7004 path. Future fixes to help with convergence. |
1 2 5 |
6 |
Nexus 7004 AGG vPC peer link restore |
3.5 sec |
6.3 sec |
2 |
|
7 |
Nexus 7004 AGG link to ICS Nexus 5548 SW fail |
0.2 sec |
0.2 sec |
Fiber pull |
|
8 |
Nexus 7004 AGG link to ICS Nexus 5548 SW restore |
0.2 sec |
0.2 sec |
Fiber restore |
|
9 |
Nexus 7004 AGG supervisor fail - module pull |
zpl |
zpl |
||
10 |
Nexus 7004 AGG supervisor switchover - CLI |
zpl |
zpl |
||
11 |
Nexus 7004 ISSU |
zpl |
zpl |
Note The following issues are being fixed in the Nexus 7000, but are not available in the release tested. These fixes are currently planned for release in the 6.2-based release of NX-OS for the Nexus 7000.
•1CSCtn37522: Delay in L2 port-channels going down
•2CSCud82316: vPC Convergence optimization
•3CSCuc50888: High convergence after F2 module OIR The following issue is under investigation by the engineering team:
•4CSCue59878: Layer 3 convergence delays with F2 module - this is under investigation. With scale tested, the additional delay is between 10-17s in the vPC shut case. The workaround used is to divert the traffic away from N7K Agg as control plane (BGP) does converge quicker and traffic bypasses the N7K agg. Also use the least amount of port groups possible to reduce the number of programming events. Alternatively, consider using M1/M2 modules for higher scale of prefixes and better convergence.
The following issues are closed without any fix, as this is the best convergence time with the F2 module after the workarounds are applied:
•5CSCue67104: Layer 3 convergence delays during node recovery (reload) of N7k Agg. The workaround is to use L3 ports in every port group to download FIB to each port-group.
•6CSCue82194:High Unicast convergence seen with F2E Module Restore
Table 7-5, Table 7-6, Table 7-7, and Table 7-8 detail convergence results for Nexus 5500 Series ICS switch, ACE 4710 and Nexus 7004, ASA 5585, and other convergence events.
•Sunil Cherukuri
•Krishnan Thirukonda
•Chigozie Asiabaka
•Qingyan Cui
•Boo Kheng Khoo
•Padmanaba Kesav Babu Rajendran