This chapter includes design recommendations for the use of the Cisco Nexus
® 5500 switch platform as a routed access layer.
At the time of this writing, the Cisco Nexus 5000 Series Switches include the following products:
• Cisco Nexus 5548P and 5548UPSwitches: This one-rack-unit (1RU) 10 Gigabit Ethernet and Fibre Channel over Ethernet (FCoE) switch offers up to 960-Gbps throughput. It has up to 48 ports: 32 fixed 1/10-Gbps Enhanced Small Form- Factor Pluggable (SFP+) Ethernet and FCoE ports and one expansion slot. Expansion modules include 16 ports of 10-Gbps SFP+ Ethernet and FCoE, and 8 ports of 10-Gbps SFP+ Ethernet and FCoE plus 8 ports of 1/2/4/8-Gbps native Fibre Channel. The Cisco Nexus 5548 switches can be configured with a Layer 3 daughter card to provide Layer 3 switching functions.
• Cisco Nexus 5596UP Switch: This line-rate switch offers 96 ports of 10 Gigabit Ethernet and FCoE. It has 48 fixed 1/10-Gbps SFP+ Ethernet and FCoE ports and three expansion slots. Expansion modules include 16 ports of 10-Gbps SFP+ Ethernet and FCoE, 8 ports of 10-Gbps SFP+ Ethernet and FCoE plus 8 ports of 1/2/4/8-Gbps native Fibre Channel, and a Layer 3 module.
• Cisco Nexus 5020 Switch: This 2RU 10 Gigabit Ethernet and FCoE switch offers throughput of 1.04 terabits per second (Tbps). It has up to 56 ports, with 40 fixed 10 Gigabit Ethernet ports with SFP+ connectors and two expansion slots.
• Cisco Nexus 5010 Switch: This 1RU 10 Gigabit Ethernet and FCoE switch offers up to 28 ports: 20 fixed 10 Gigabit Ethernet ports with SFP+ connectors and one expansion slot.
The design best practices provided in this document apply to designs that use the Cisco Nexus 5548 and 5596 products.
At the time of this writing, the Cisco Nexus 2000 Series Fabric Extenders includes the following products:
• Cisco Nexus 2148T Fabric Extender: This product has 48 1-Gbps RJ-45 copper ports and four 10 Gigabit Ethernet SFP+ uplink ports.
• Cisco Nexus 2224TP GE Fabric Extender: This product provides 24 Fast Ethernet and Gigabit Ethernet (100/1000BASE-T) server ports and two 10 Gigabit Ethernet uplink ports in a compact 1RU form factor.
• Cisco Nexus 2248TP GE Fabric Extender: This product has 48 100-Mbps and 1-Gbps copper ports and four 10 Gigabit Ethernet SFP+ uplink ports (requires Cisco® NX-OS Software Release 4.2(1)N1(1) or later).
• Cisco Nexus 2232PP 10GE Fabric Extender: This product has 1/10-Gbps SFP and SFP+ Ethernet ports and eight 10 Gigabit Ethernet SFP+ uplink ports (requires Cisco NX-OS Release 4.2(1)N1(1) or later). The Cisco Nexus 2232PP is also suitable for carrying FCoE traffic. Servers can attach to the Cisco Nexus 2232PP with Twinax cables or optical connectivity in the same way as to a Cisco Nexus 5000 Series Switch.
Note: All ports of the Cisco Nexus 5500 switch platform have the capability to operate as 1 or 10 Gigabit Ethernet ports. With Cisco NX-OS 5.0(3)N1(1) and later, the Cisco Nexus 5500 platform ports can be configured for 1 and 10 Gigabit Ethernet speeds.
The exact number of fabric extenders that can be connected to a single Cisco Nexus 5000 Series Switch is hardware and software dependent. For the latest information, please check the Cisco Documentation page.
With Cisco NX-OS 5.0(2)N1(1), the Cisco Nexus 5000 Series supports 12 fabric extenders, with the Cisco Nexus 5500 platform supporting up to 16 fabric extenders.
With Cisco NX-OS 5.0(3)N1(1) and later, the Cisco Nexus 5500 platform supports up to 24 fabric extenders when exclusively the Layer 2 functions are used, and 8 fabric extenders when the Cisco Nexus 5500 platform is used in Layer 3 mode.
This document divides the design and configuration of the Layer 3 access layer into these categories:
• Classic Layer 3 access design with spanning tree running on the access switches and Layer 3 Interior Gateway Protocol (IGP) running between the access and aggregation layers: The design consists of pairs of Cisco Nexus 5500 switches, which provides redundant Layer 2 connectivity to support network interface card (NIC) teaming, virtual machine mobility within a pod (intra-pod mobility), and high-availability clustering.
• Deployment with vPC on the Cisco Nexus 5500 platform to support vPC-connected servers (that is, servers connected with the PortChannel NIC teaming option) and dual-connected fabric extenders: The Cisco Nexus 5500 switches are deployed in pairs and connected to the aggregation layer with Layer 3 links.
Layer 3 Capabilities of the Cisco Nexus 5500 Platform
The Layer 3 engine card on the Cisco Nexus 5500 platform provides the following functions:
• Layer 3 interfaces: Routed ports on Cisco Nexus 5500 platform ports, switch virtual interfaces (SVIs) that can route traffic coming from Cisco Nexus 5500 platform ports or from fabric extender ports; PortChannels can be configured as Layer 3 interfaces as well.
• Routing protocol support: Static, Routing Information Protocol Versions 1 and 2 (RIPv1 and v2), Open Shortest Path First Version 2 (OSPFv2), Enhanced Interior Gateway Routing Protocol (EIGRP), and Border Gateway Protocol (BGP).
• Support for gateway redundancy protocols: Hot Standby Router Protocol (HSRP) and Virtual Router Redundancy Protocol (VRRP).
• Multicast support: Protocol-Independent Multicast (PIM), Any-Source Multicast (ASM), Source-Specific Multicast (SSM), Anycast Rendezvous Point (RP), Auto-RP, Multicast Source Discovery Protocol (MSDP), and Internet Group Management Protocol Versions 1, 2, and 3 (IGMPv1, v2, and v3).
• Virtual Route Forwarding lite (VRF-lite) support: Includes multicast routing per VRF instance.
• Functions such as Dynamic Host Configuration Protocol (DHCP) relay and IP helper.
Licensing controls how many fabric extenders can be used with a Layer 3 configuration and the number of routes that can be used:
• With the Base license, you can configure EIGRP stub or use OSPF for up to a maximum of 256 nondirectly connected routes.
• With the Enterprise license, you can configure full EIGRP, OSPF with up to 8000 routes, BGP, and VRF-lite.
Without a Base License installed, it is possible to configure SVIs but traffic won't be routed.
With the Layer 3 license installed, the configuration supports up to 8 fabric extenders.
The Layer 3 card can forward up to 160 Gbps of traffic, or 240 million packets per second (mpps).
The First In, First Out (FIFO) port-to-port latency for the Cisco Nexus 5548P is approximately 2 microseconds, and for the Cisco Nexus 5548UP and 5596UP it is approximately 1.8 microseconds. With the Layer 3 card, the FIFO port-to-port latency for routed traffic is approximately 4.7 microseconds.
In campus designs, a Layer 3 access layer would consist of individual Layer 3 switches dual-homed to the aggregation layer in Layer 3 mode, where VLANs would not be shared among access switches. For instance, in Figure 1, in topology (a), you can see that subnet 10.50.1.x is not shared between the two access switches.
The (a) design would not be suitable for a data center in which servers connect in a redundant fashion with NIC teaming or are configured for high-availability clustering, and in which virtual machine mobility is used. These data center technologies require Layer 2 adjacency between the access switches.
Topology (b) illustrates a Layer 3 access design for the data center. Access-layer devices are deployed in pairs. The Layer 2 domain exists exclusively within the pair of access switches. The scope of the Layer 2 domain can be configured by adding fabric extenders, which allows ports to be added to the same pair of access-layer devices without adding any spanning-tree load to the topology.
Figure 1. Comparing Layer 3 Access Designs in the Campus (a) and in the Data Center (b)
Layer 3 Access Layer
A Layer 3 access-layer design consists of a pair of Cisco Nexus 5500 switches, with or without fabric extenders, connected with Layer 3 links to an aggregation layer of Cisco Nexus 7000 Series Switches. For example, a pair of Cisco Nexus 5500 switches with fabric extenders may provide connectivity for a pod consisting of four to eight racks with two fabric extenders per rack (numbers vary based on the rack density, level of redundancy required, desired oversubscription ratio, density of MAC addresses, etc.). These Cisco Nexus 5500 switches would then be aggregated by, for instance, a pair of Cisco Nexus 7000 Series Switches, providing aggregation of multiple pods or even data center rows. Figure 2 illustrates the building block for this type of design.
The Cisco Nexus 5500 switches at the access layer may also provide connectivity for blade server enclosures. In some cases, blade server enclosures may contain a switch such as a Cisco blade switch or virtual blade switch (VBS), and this switch would run spanning tree with the Cisco Nexus 5500 switch as shown in the figure. In the figure, the fabric extender provides a Layer 2 domain for intra-pod mobility, NIC teaming, and high-availability clusters.
Figure 2. Layer 3 Access Design Without vPC
Layer 3 and vPC Access Layer
In addition to using the topology previously described, the Cisco Nexus5500 platform can be configured for vPC and connected to an aggregation layer in Layer 3 mode, as shown in Figure 3.
With this design, the Layer 2 domain can be built with fabric extenders dual-connected (active-active fabric extenders). It can have servers connected in PortChannel mode, and it can accommodate connectivity for blade switching elements in vPC mode instead of regular spanning-tree mode.
Figure 3. Layer 3 Access Design with vPC and Active-Active Fabric Extenders
The following list summarizes important scalability factors for the Cisco Nexus 5500 platform used in Layer 3 mode:
• The Layer 3 engine on the Cisco Nexus 5500 platform supports up to 8000 adjacencies; this means that in the Layer 2 domain pod, you cannot have more than 8000 individual MAC addresses. With a maximum of 2 x 384 ports (maximum of 8 fabric extenders with Layer 3 mode) per pod, this limit does not constitute a constraint. With virtualized servers in a high-density server farm, the MAC address utilization is higher than the raw count of ports (768 ports). For example, with 10 virtual machines per physical server, assuming that each server is using a fabric extender port and that each server has routed connectivity, the number of required adjacencies would be 7680.
• The Layer 2 forwarding table can accommodate up to 24,000 MAC addresses (this limit already takes into account the hash collision space).
• The routing table can accommodate a maximum of 8000 longest-prefix match entries. Most data center designs consist of an OSPF stub or a totally stubby area and an EIGRP stub area, so the routing table would normally be far below the limit of 8000 routes.
• The solution supports a maximum of 2000 Layer 3 multicast groups in non-vPC mode and a maximum 1000 multicast groups in vPC mode. This calculation already accounts for the presence of a (*,G) and a (S,G) entry for each group.
• The solution supports a maximum of 4000 Layer 2 multicast groups. The Layer 2 table uses only one entry for each group. The same group in a different VLAN would count as an additional entry.
• For Cisco NX-OS Release 5.0(2)N1(1) and later, the Cisco Nexus 5500 platform hardware supports 4013 concurrent VLANs. The total number of available VLANs depends on the use of VSANs (each VSAN consumes a VLAN) and on the number of Layer 3 interfaces (a Layer 3 interface consumes an internal VLAN). SVIs conversely do not consume any additional VLAN. You can control VLAN utilization with the command show resource vlan.
Layer 2 Best Practices for Layer 3 Access
The Layer 3 design recommendations included in this document build on top of the existing best practices for Cisco Nexus 5500 platform connectivity to servers, fabric extenders and blade switches.
The document "Data Center Access Design with Cisco Nexus 5000 Platform and 2000 Series Fabric Extenders and Virtual PortChannels" at
http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/C07-572829-01_Design_N5K_N2K_vPC_DG.pdf describes the Layer 2 design best practices for designs in which the Cisco Nexus 5500 platform at the access layer connects to the Cisco Nexus 7000 Series Switches at the aggregation layer in Layer 2 mode. The recommendations that refer to the connectivity from the Cisco Nexus 5500 platform to servers, fabric extenders, and blade switches are equally applicable to a Layer 3 design.
The following list provides a summary:
• Choose the spanning-tree algorithm. Keep in mind that Multiple Spanning Tree (MST) Protocol scales better; Rapid Per-VLAN Spanning Tree Plus (PVST+) Protocol may be easier to deploy, but deploying it with vPC brings no real additional benefit over deploying one single Spanning-Tree for all VLAN. In a routed access design, spanning tree runs exclusively between two access switches and helps mostly in the case of blade switch connectivity and prevents misconfiguration.
• With a Layer 3 access design, you may want to make one Cisco Nexus 5500 switch the root and the peer Cisco Nexus 5500 switch the secondary root. In topologies in which only the fabric extender connects to the Cisco Nexus 5500 platform, this approach does not provide any particular benefit given that the Layer 2 topology is limited to two switches. In topologies in which blade switches connect to the Cisco Nexus 5500 platform, it is useful to have a root and a secondary root to control the topology. In either case, assigning a root and a secondary root helps ensure a deterministic configuration.
• If you are using vPC in the Layer 3 access layer, you should match spanning-tree root priorities with vPC priorities.
• Enable pathcost method long to account for the use of 10 Gigabit Ethernet links. vPC links have a predefined cost of 200. This cost is hard-coded and does not depend on the number of links in the PortChannel. If you need to modify the spanning-tree cost of links in the topology to obtain a certain forwarding topology, you can use the command spanning-tree cost <cost>.
• Host ports should be configured for PortFast (spanning-tree port type edge) or TrunkFast (spanning-tree port type edge trunk) to remove the delay in the transition from blocking to forwarding on server ports.
• The Cisco Nexus 5548 and 5596 both support as many PortChannels as the number of ports with or without a fabric extender. A single PortChannel can have up to 16 active ports. A vPC PortChannel can have up to 32 active ports (16 active ports for each vPC peer).
• In most deployments, you should use the Link Aggregation Control Protocol (LACP) to negotiate the creation of PortChannels for both vPC topologies and non-vPC topologies.
• Bridge Assurance should not be enabled on vPC ports. Bridge Assurance does not add much benefit in a PortChannel-based configuration, and it may intervene in certain vPC failure scenarios in which it is actually less disruptive not to error-disable any port. Also, if you want to take advantage of In Service Software Upgrade (ISSU) on the Cisco Nexus 5000 Series, you should not enable Bridge Assurance on any link except the peer link (referred to as the multichassis EtherChannel trunk [MCT] in the command-line interface [CLI]), where it is automatically enabled.
Layer 3 Access Design
Most validated best practices that Cisco has published for Layer 3 designs apply equally to routed access designs with the Cisco Nexus 5500 platform. This section discusses these best practices and provides information about aspects that are specific to the Layer 3 engine of the Cisco Nexus 5500 platform.
Routing at the Access Layer
Autostate refers to the capability of a Layer 3 SVI to sense the presence of active forwarding ports on the VLAN for which it provides routing. If a given VLAN does not have any forwarding ports, the Layer 3 SVI goes down. An SVI interface goes up if the associated VLAN has at least one forwarding port. This mechanism prevents traffic from begin black-holed by attracting traffic to a given subnet only if that subnet has active ports.
The autostate mechanism on the Cisco Nexus 5500 platform extends to the fabric extender ports. The SVI status reflects the presence or absence of active forwarding ports on fabric extenders. Figure 4 illustrates this concept.
Figure 4. Autostate with Fabric Extenders
Layer 3 Peering at the Access Layer
The Layer 3 access layer runs OSPF or EIGRP. Adding an SVI to the OSPF or EIGRP process both enables the announcement of the reachability of this subnet and allows the Cisco Nexus 5500 switches to peer over this SVI.
A single Layer 3 access device may have hundreds of SVIs, which would make the Cisco Nexus 5500 platform peers on hundreds of Layer 3 links without any added benefit. One Layer 3 link established over an SVI between the access-layer Cisco Nexus 5500 platform is sufficient to provide a Layer 3 escape route should the links to the core fail. In addition, having hundreds of peering links and VLANs consumes CPU resources unnecessarily.
To correct this situation, you need to make all SVIs passive, with the exception of the Layer 3 VLAN SVI, which is carried only on the PortChannel that links the two Cisco Nexus 5500 switches.
The configuration for OSPF is as follows:
(config)#interface vlan <number>
(config-if)# ip ospf passive-interface
The configuration for EIGRP is as follows:
(config)#interface vlan <number>
(config-if)#ip passive-interface eigrp
HSRP and VRRP
For Cisco NX-OS 5.0(3)N1(1) and later, the Cisco Nexus 5500 platform with the Layer 3 engine can forward traffic for up to 4000 HSRP groups and 255 VRRP groups. Given that each additional HSRP group requires control-plane resources, the number of groups that you can effectively use depends on timer settings and on the way that the control plane is used. At the time of this writing, up to 255 HSRP groups have been tested with aggressive timers.
Note: As a general recommendation, if you plan to use aggressive timers you should keep the total number of groups within 100 or so. If you observe HSRP groups flapping, this is clear indication that the control plane is oversubscribed, and you should use default timers. Also, if you are using vPC, there is no need to tune the HSRP timers because both vPC peers forward traffic.
In the Layer 3 access design, the HSRP gateway is configured on the Cisco Nexus 5500 switch in the access layer. The access layer for each pod is built with two Cisco Nexus 5500 switch, of which one is the HSRP primary switch and the peer is the HSRP secondary switch.
According to established best practices, the HSRP priority should match the spanning-tree root and secondary root placement.
The restoration of traffic after a failed Layer 3 access switch comes back online depends on the following factors:
• The time needed for the routes to be learned again by the routing protocol: If traffic from the servers goes to the newly rebooted Layer 3 switch and this switch does not have routes to the core, traffic would be dropped.
• The time needed for the Address Resolution Protocol (ARP) table to be repopulated and, as a result, for the adjacency table to be rebuilt: If flows from the core reach the newly rebooted Layer 3 switch before the ARP table is fully rebuilt, the flows whose adjacency is missing are dropped.
• The time needed for HSRP to take over the primary role (if preemption is configured): If HSRP takes over the primary role on the newly rebooted Layer 3 switch too quickly, it may attract traffic from the servers before the routing table has been fully rebuilt.
To reduce traffic drop in the server-to-core direction, you should not use HSRP preemption. However, preemption helps keep the HSRP topology matched to the spanning-tree topology, which also is desirable.
To address both needs, you can configure preemption with the
delay minimum option. This configuration helps ensure that a newly rebooted Layer 3 switch, such as the Cisco Nexus 5500 platform, in the access layer will not preempt for a configurable time, allowing the routing table to be rebuilt. The preempt delay value needs to be tuned according to the characteristics of the individual environment because the time needed for routing to converge depends on factors such as the number of routes.
In a simple topology example with an OSPF totally stubby area, approximately 4 minutes elapsed between the time when the control plane was up to the moment when the routes were installed in hardware.
Based on this value, the configuration is as follows:
preempt delay minimum 240
In vPC designs, preempt delay does not change the behavior of HSRP. In vPC designs, as soon as a vPC member port goes up, traffic destined for the HSRP address is handled by the receiving switch regardless of the preempt delay configuration.
To control this behavior with vPC, you should use the command
delay restore under the vPC domain configuration. This command delays the restoration of vPC member ports to provide more time for the Layer 3 switch to rebuild the Layer 3 forwarding table.
To achieve convergence in less than a second, you can lower the default HSRP timers and use 250 milliseconds (ms) of hello time and 750 ms of hold time as in the following configuration:
timers msec 250 msec 750
Aggressive timers control the convergence time if one Layer 3 engine fails or if the entire Cisco Nexus 5500 switch fails. The lower the timer value, the faster the recovery from the peer Cisco Nexus 5500 switch.
Be careful when using aggressive timers because the lower the timer values, the higher the load on the CPU, so if you are using hundreds of SVIs, do not change the timers from the default values.
If you observe HSRP groups flapping, this is clear indication that the control plane is oversubscribed, and you should use the default timers.
Aggressive timers do not help in vPC designs because in vPC both HSRP peers forward traffic, even before a failure.
At the time of this writing, ISSU is not supported with Layer 3 features, but if and when ISSU is supported with Layer 3 features, aggressive timers may be incompatible with the ISSU upgrade procedure.
Connectivity Between Aggregation and Access Layers
Layer 3 connectivity between the aggregation and access layers uses Layer 3 equal-cost multipathing (ECMP).
In the ECMP configuration, each Layer 3 switch has two routes and two associated hardware forwarding adjacency entries. Before a failure, traffic is forwarded using both these forwarding entries. When an adjacent link or neighbor fails, the switch hardware and software immediately remove the forwarding entry associated with the lost neighbor. After the removal of the route and forwarding entries associated with the lost path, the switch still has a remaining valid route and associated Layer 3 forwarding entry. Because the switch still has an active and valid route, it does not need to trigger or wait for routing protocol convergence and is immediately able to continue forwarding all traffic using the remaining Layer 3 entry.
Figure 5 illustrates this concept.
Figure 5. Rerouting Traffic with ECMP
In a typical design, the links between the aggregation and access layers may be Layer 3 PortChannels, as shown in Figure 6. If a single PortChannel member fails, the routing cost of the link is recalculated, and the topology forwards traffic accordingly.
Figure 6 illustrates the failure of a link in a Layer 3 PortChannel. The link may be configured as follows:
Interface Portchannel 21
ip address 10.50.3.1/31
ip ospf 1 area 184.108.40.206
ip ospf network point-to-point
If the OSPF reference bandwidth is 1000000, the cost of the link before the link failure is 50, and after the failure the cost becomes 100.
Figure 6. PortChannel Cost Calculation with Layer 3 Links
Subnets for Layer 3 Links
A routed access design requires the allocation of a subnet for each Layer 3 link. The aggregation layer will then summarize these subnets together with the subnets that are used for server connectivity. You should to use /30 or /31 to reduce the consumption of IP addresses.
In OSPF, the reference bandwidth by default is 40 Gbps. As a result, if you have PortChannels with more than four 10 Gigabit Ethernet links, you may want to adjust the reference bandwidth so that it reflects the real bandwidth of the link. In the case of EIGRP, the cost is calculated as a function of several parameters, including bandwidth and delay.
For OSPF you can change the reference bandwidth globally as follows:
<1-4000000> Rate in Mbps (bandwidth) (Default) *Default value is 40000
<1-4000> Rate in Gbps (bandwidth)
*Default value is 40
Considering that the maximum bandwidth of an individual link on the Cisco Nexus 5500 platform is 160 Gbps (16 x 10 Gigabit Ethernet), if you set the reference bandwidth at 1 Tbps instead of 40 Gbps, OSPF will be able to perform routing decisions that can distinguish between a 160-Gbps PortChannel and a 40-Gbps PortChannel. After you change the reference bandwidth, the cost of a 160-Gbps PortChannel is 6, and the cost of a 40-Gbps PortChannel is 25.
router ospf 1
auto-cost reference-bandwidth 1000000
The SVI cost does not reflect the available bandwidth, because an SVI does not have a one-to-one mapping with a finite number of links. Sometimes SVIs are used to build Layer 3 connectivity: for instance, in a routed access topology, the Layer 3 link between two Layer 3 access-layer devices is just a VLAN on a PortChannel. In this case, routing adjacency is built from an SVI as illustrated in Figure 7.
Figure 7 illustrates the case of a Layer 3 link built with SVIs over an existing PortChannel (which could be the peer link in vPC topologies). The default cost on the SVIs does note reflect the available bandwidth on this link.
Figure 7. Layer 3 Peering Between two Cisco Nexus 5500 Switches
You can change the cost of the Layer 3 route built over an SVI to a meaningful value (which takes into account the reference bandwidth) by using the command
ip ospf cost. For instance, if there are two 10 Gigabit Ethernet links between the Cisco Nexus 5500 switches, you can set the cost to 50; if there are four 10 Gigabit Ethernet links, you can set the cost to 25.
In the case of EIGRP, you can also tune the SVI cost to reflect the actual bandwidth, by using the command
ip bandwidth eigrp.
Traffic polarization is a well-known phenomenon that manifests itself when using topologies that distribute traffic by using hashing of the same type. In this kind of topology, the combined effect of cascading several layers of ECMP (or PortChannel) hashing is unequal traffic distribution.
Figure 8 illustrates this concept.
In the figure, the traffic that goes from the Cisco Nexus 5500 switch in the access layer to the Cisco Nexus 7000 Series Switch in the aggregation is hashed by ECMP on the left link. Traffic that arrives on the Cisco Nexus 7000 Series Switch will be hashed again to choose the next-hop ECMP route. If we assume that the hashing polynomial used by the Cisco Nexus 7000 Series line card and by the Cisco Nexus 5500 switch in the access layer is the same, the traffic may always take the same link because what the Cisco Nexus 7000 Series Switch receives is already the result of a hashing operation of the same kind as the one that this switch is going to perform; hence, it will yield the same result.
To reduce the possibility of traffic polarization, you should add different universal ID keys to the Layer 3 load-balancing configuration on the Cisco Nexus 5500 switch in the access layer compared to those used on the Cisco Nexus 7000 Series Switch in the aggregation layer:
L3-N5548-1(config)# ip load-sharing address source-destination universal-id
Figure 8. When the Universal ID Is Used, Layer 3 Hashing Polarization Does Not Occur
To help ensure the optimal recovery time for data traffic flows in the data center, you need to optimize the routing design to provide a minimal and deterministic convergence time for failure cases.
The length of time needed for EIGRP, OSPF, or any routing protocol to restore traffic flows is limited by the following three main factors:
• The time required to detect the loss of a valid forwarding path.
• The time required to determine a new best path (which is partially determined by the number of routers involved in determining the new path, or the number of routers that must be informed of the new path before the network can be considered converged).
• The time required to update software and associated Cisco Express Forwarding hardware forwarding tables with the new routing information.
If the switch has redundant equal-cost paths, all three of these processes are performed locally within the switch and controlled by the internal interaction of software and hardware. If there is no second equal-cost path, EIGRP or OSPF must determine a new route, and this process plays a large role in determining network convergence time.
In the case of EIGRP, the time primarily depends on the number of EIGRP queries that the switch needs to generate and the amount of time needed for the response to each of those queries to return to calculate a feasible successor (path). The time required for each of these queries to be completed depends on the distance they have to propagate in the network before a definite response can be returned. To reduce the time required to restore traffic flows when a full EIGRP routing convergence is required, the design must provide strict limits on the number and range of queries generated.
In the case of OSPF, the time required to flood and receive link-state advertisements (LSAs) in combination with the time needed to run the Dijkstra shortest path first (SPF) algorithm to determine the shortest-path tree (SPT) provides a limit on the time required to restore traffic flows. Optimizing network recovery involves tuning the design of the network to reduce the time and resources required to complete these two processes.
By limiting the number of peers that an EIGRP router must query or the number of LSAs that an OSPF peer must process, you can optimize the rerouting of the traffic when a link fails.
These objectives can be achieved with a combination of:
• Stub routing
For summarization to be possible, the IP addressing allocation should map to a route summarization scheme, which should be based on the building blocks of the network. When a link or a node in a building block fails, this failure should not result in routing updates in other building blocks.
With EIGRP, you can implement a tiered summarization scheme by summarizing directly at the access layer for a Layer 3 access topology and performing additional summarizing at the aggregation layer.
With OSPF summarization performed at the area border router (ABR), Layer 3 access summarization is performed at the aggregation layer. With a Layer 2 access layer, summarization is also still performed at the aggregation layer.
The OSPF area for the data center should be a stub or totally stubby area.
Design with OSPF
OSPF implements a two-tier hierarchical routing model that uses a core or backbone tier known as area zero (0). Attached to that backbone through ABRs are a number of secondary tier areas. In a typical data center design, the data center constitutes a separate area, with the aggregation-layer switches configured as ABRs. The ABRs control summarization and route propagation from the core to the data center area and from the data center to the core.
ABRs forward the following LSAs:
• ABRs for regular areas forward summary LSAs of type 3 (summary of type-1 and type-2 LSAs from an adjacent area), autonomous system boundary router (ASBR) summary (information about routes to the ASBR), and external (routes outside this autonomous system).
• Stub area ABRs forward summary LSAs of type 3 and the summary default (0.0.0.0).
• Totally stubby area ABRs forward the summary default (0.0.0.0).
The convergence time for network failures is proportional to the number of routes that are present in a given area; therefore, it is good practice to limit the propagation of routes by configuring a stub or a totally stubby area. The use of a stub area configuration for the aggregation block area prevents the propagation of external routes to the aggregation block. However, configuring the area as totally stubby also prevents the propagation of inter-area routes. In this configuration, each of the distribution switch ABRs creates a default route that provides the forwarding from the routed access layer to the aggregation layer.
To create a totally stubby area, the configuration is as follows:
router ospf 1
area 220.127.116.11 stub
area 18.104.22.168 range 10.50.0.0/16
auto-cost reference-bandwidth 1000000
The stub portion of the configuration blocks external LSAs from entering the area through the ABR. The
no-summary option of the
stub parameter blocks inter-area summary LSAs from entering the area. The ABRs also inject a default route (0.0.0.0) into the stub area to provide access to the routes not propagated into the area.
Figure 9 illustrates the OSPF area design with a Layer 3 access layer.
Figure 9. Scope of Totally Stubby Area
Other OSPF best practices include:
• Tuning the reference bandwidth.
• Establishing Layer 3 peering between access devices.
• Disabling peering (passive interfaces) on all SVIs except for the Layer 3 VLAN interface.
For faster convergence of the Layer 3 topology, you can also consider tuning the SPF timers.
OSPF configuration provides the capability to customize the default timers that control the LSA pacing and the pacing of SPF algorithm execution. In Cisco IOS
® Software, you should reduce these values from their default settings to improve network convergence while helping ensure that multiple iterations have a dampening effect applied to reduce processor utilization if intermittent or flapping connectivity occurs in the network.
The three parameters interoperate to determine how long an SPF calculation takes to run after notification of a topology change event (arrival of an LSA).
At the arrival of the first topology notification, the
spf-start or initial hold timer controls the wait time before the SPF calculation starts.
If no subsequent topology change notification arrives (no new LSA) during the hold interval, the SPF is free to run again as soon as the next topology change event is received.
However, if a second topology change event is received during the hold interval, the SPF calculation is delayed until the hold interval expires. Additionally, in this second case, the hold interval is temporarily doubled.
After the expiration of any hold interval, the timer is reset, and any future topology changes trigger an SPF again based on the initial timer. Typical values used when tuning Cisco Catalyst
® Family products running Cisco IOS Software are as follows:
• sfp-start: 10 ms
• spf-hold: 100 to 500 ms
• spf-max-wait: 5000 ms (5 seconds)
LSA throttle timers control the generation of type-1 and type-2 LSAs in the event of a network interface change.
Three configuration values are used: an initial delay timer, a hold timer, and a maximum hold timer. Typical values used with Cisco Catalyst Family and Cisco IOS Software are as follows:
• lsa-start: 10 ms
• lsa-hold: 100 to 500 ms
• lsa-max-wait: 5000 ms (5 seconds)
To optimize OSPF for fast convergence in the data center and to be consistent with Cisco IOS Software, the default throttle timers in Cisco NX-OS can be updated using the command syntax
timers throttle spf 10 100 5000.
Design with EIGRP
If you configure the EIGRP process to run in the stub-connected state, the access switch advertises all connected subnets. The switch also advertises to its neighbor routers that it is a stub or nontransit router and thus should never be sent queries to learn of a path to any subnet other than the advertised connected routes. Use of the EIGRP stub approach isolates the EIGRP queries in case of failure.
In addition, the aggregation-layer switch summarizes the data center subnets to the core.
Figure 10 illustrates the EIGRP topology.
Figure 10. Hierarchical Summarization with EIGRP
Figure 11 illustrates the configuration of EIGRP stub and summarization.
Proven multicast best practices apply to the routed access design. It is particularly important to remember that multicast cannot function and better or converge any faster than the IGP infrastructure. A solid multicast design starts with a solid IGP design and tuning.
The rendezvous point will typically be a device at either the aggregation layer or the core. The Cisco Nexus 5500 platform can be used as a PIM Rendezvous Point, but normally the Cisco Nexus 5500 platform at the access layer will not be used as the Rendezvous Point.
Note: The Cisco Nexus 5500 platform supports Auto-RP, BSR, PIM Anycast-RP, and MSDP.
In non-vPC designs, you can use the Cisco Nexus 5500 platform for PIM-ASM or PIM-SSM designs. Bidirectional PIM (Bidir-PIM) is not supported. The Cisco Nexus 5500 platform in Layer 3 mode supports up to 2000 multicast groups.
Note: In vPC designs, the Cisco Nexus 5500 platform used in Layer 3 mode supports up to 1000 multicast groups. In vPC designs, the Cisco Nexus 5500 platform does not support PIM-SSM.
The preferred mechanism for distributing the Rendezvous Point information is Anycast-RP combined with Auto-RP, which offers the advantages of faster convergence and the benefits of not having to manually configure the Anycast-RP address on every router. If the device that provides the Rendezvous Point function supports PIM Anycast-RP, you can use PIM Anycast-RP instead of MSDP.
Table 1 summarizes the options for synchronizing the source information for the Rendezvous Point and the way that the Rendezvous Point information can be propagated.
Table 1. Rendezvous Point Information
Source Information Synchronization Between Rendezvous Points
Distribution of Rendezvous Point Information
MSDP (RFC 3618)
Bootstrap router (BSR; RFC 5059)
To configure the Cisco Nexus 5500 platform to learn about the Rendezvous Point, you can configure the Rendezvous Point manually (static Rendezvous Point approach):
ip pim rp-address <IP>
Alternatively, you can configure the Rendezvous Point to learn through Auto-RP (equivalent to Cisco IOS Software
ip pim autorp listener command):
ip pim auto-rp forward listen
As usual, all interfaces (Layer 3 interfaces, Layer 3 VLAN used for peering, and the SVIs that provide the gateway to the server) need to be configured with this command:
ip pim sparse-mode
Regarding IGMP, if you do not configure the version explicitly, IGMPv2 will run; otherwise, for PIM-SSM designs or for faster removal of a group, you can configure IGMPv3 with explicit tracking:
ip pim sparse-mode
ip igmp version 3
ip igmp snooping explicit-tracking
The access layer always has two Cisco Nexus 5500 switches for redundancy. One is the PIM designated router (PIM-DR), and one is the IGMP querier.
The Cisco Nexus 5500 platform may provide routing for multicast sources. In this case, the PIM-DR is in charge of registering local sources with the Rendezvous Point. The Rendezvous Point will then send PIM joins to the source (which is behind the PIM-DR) to build an SPT. This process creates a PIM (S,G) state along the path between the source and the Rendezvous Point. In the presence of equal-cost paths to the source, any router in the path will choose the highest next-hop IP address.
The Cisco Nexus 5500 platform may provide routing for multicast receivers. In this case, upon receipt of IGMP reports for a given group, the PIM-DR installs the (*,G) entry for that group and sends a PIM join to the Rendezvous Point. This process creates a PIM (*,G) state along the path between the Rendezvous Point and the receiver, which is behind a Cisco Nexus 5500 switch.
The PIM-DR device is chosen based on the highest IP address, but you can have a more deterministic configuration by defining a priority:
ip pim dr-priority
The timer for PIM-DR health monitoring can be tuned to reduce the failover time associated with a failure of the designated router, with the following command:
ip pim hello-interval <period> [msec]
The IGMP querier can also be tuned to accelerate the recovery of the multicast entries in the Layer 2 forwarding table upon failover.
The access layer typically is a multicast stub network, and having two routing devices on the same segment implies that one of the routing devices will see the multicast stream destined for local receivers on an interface that does not support reverse-path forwarding (RPF). As is the case in all multicast deployments, this non-RPF traffic must be rate limited to avoid saturating the CPU.
On the Cisco Nexus 5500 platform, the non-RPF traffic is rate limited to allow the PIM control plane still to function (PIM needs to see a copy of the traffic) and at the same time to prevent the CPU from being overwhelmed.
Layer 3 Access Design with vPC
Designing the access layer with Layer 3 on the Cisco Nexus 5500 platform and vPC combines the benefits of a Layer 3 access layer with the benefits of being able to connect fabric extenders redundantly and to connect servers redundantly with a PortChannel.
Most of the design best practices for vPC designs apply, plus specific tuning that is related to the use of the Layer 3 engine on the Cisco Nexus 5500 platform.
The Cisco Nexus 5500 platform with a Layer 3 engine can forward traffic regardless of whether it is vPC primary or secondary traffic and regardless of whether the HSRP group is in active or standby mode on a given SVI. As long as one SVI is active for HSRP, the vPC peer can also route traffic for that SVI.
The Cisco Nexus 5500 platform supports the peer-gateway option, which allows the vPC peer to route traffic whose destination MAC address is the burned-in MAC address (BIA) of the vPC peer.
Using the Cisco Nexus 5500 platform as a vPC Layer 3 access switch allows the use of PIM-ASM but not the use of PIM-SSM. In this mode, the Cisco Nexus 5500 platform can support up to 1000 multicast groups.
In PIM-ASM mode, vPC peers will switch to the SPT as soon as they find the information about a source. Changing this behavior to "use shared-tree" is not supported in vPC mode.
Cisco Nexus 5000 Series vPC Baseline Configuration
The Layer 3 design recommendations included in this document build on the existing best practices for connectivity from the Cisco Nexus 5500 platform to servers, fabric extenders, and blade switches in the presence of vPC.
Figure 12 shows the components of a Cisco Nexus 5500 platform vPC deployment.
Figure 12. vPC Components
The following list provides a summary of vPC configuration best practices:
• Connect the two Cisco Nexus 5500 switches through redundant 10 Gigabit Ethernet links to form the peer link between vPC peers. There is no specific benefit of separating vPC and non-vPC VLANs. This link carries both vPC VLANs and non-vPC VLANs. When the peer link is configured, Bridge Assurance is automatically enabled on this link.
• The peer keepalive is an out-of-band monitoring mechanism that is used for vPC peers to arbitrate roles and to resolve peer-link failures. You should configure the peer-keepalive connectivity either through the mgmt0 interface or through an SVI and a separate port. When using a Cisco Nexus 5500 platform with a Layer 3 card, you can allocate a VRF instance for to carry the peer keepalive.
• Unlike on the Cisco Nexus 7000 Series, on the Cisco Nexus 5000 Series direct connectivity of the peer keepalive through mgmt0 from one vPC peer to the other is an acceptable practice, although routing over the management network (or any out-of-band network) is still preferred for deployment.
• The peer-keepalive traffic should never be carried on a VLAN over the peer link; such a configuration would make the peer keepalive useless.
• You should use Link Aggregation Control Protocol (LACP) on vPCs that connect switches and, when supported by NIC team software, also on the servers.
Because of its importance, the peer link should always be configured in a redundant manner. The loss of the peer link is recovered in a way that prevents split subnets and continuous flooding.
The peer link carries the control traffic used to synchronize MAC address tables and IGMP entries. When connectivity is lost on this link, MAC addresses and IGMP entries can no longer be synchronized. In a dual-active scenario, traffic for existing MAC entries and existing multicast group members may continue to flow correctly, but if a new unicast MAC address was learned after the loss of the peer link, traffic destined for this MAC address would cause flooding. Also, generation of a new IGMP report after the loss of the peer link would trigger proper IGMP snooping processing on one vPC peer only. As a result, multicast traffic arriving on the other vPC peer would be dropped.
For these reasons, when the peer link is lost, vPC shuts down vPC member ports on the operational secondary switch to avoid a dual-active scenario, as illustrated in Figure 13.
In the figure, Host1, Host2, and Host3 are connected to the Cisco Nexus 5000 Series and 2000 Series Switches, respectively, with a PortChannel in a vPC configuration.
Upon failure of the peer link, the vPC secondary device verifies that the vPC primary device is still alive (by using the heartbeat verification mechanism of the peer keepalive link) and correctly shuts down the vPC member port on the secondary Cisco Nexus 5000 Series Switch and associated Cisco Nexus 2000 Series Fabric Extenders.
Because of this behavior, unicast and multicast traffic continues flowing correctly through the vPC primary device.
Figure 13 illustrates failure of the peer link.
Figure 13. Behavior of vPC with Peer-Link Failure
Behavior of vPC with Routing
vPC peers run routing protocols independently, so most design best practices that have been analyzed for non-vPC environments apply:
• The spanning-tree root and secondary root should be configured the same way as in non-vPC designs.
• You should make all interface VLANs (SVI) passive except for the one used as a backup route when links to the core are down.
• Summarization schemes are configured on the vPC peers exactly the same way as in non-vPC designs.
The following list summarizes some differences between vPC and non-vPC designs:
• In vPC designs, you do not need to tune the HSRP timers, because both HSRP interfaces are forwarding.
• In vPC designs, you do not need to configure a HSRP preempt delay because both HSRP interfaces are forwarding regardless of whether their status is active or standby.
• You should configure the vPC restore delay with a value similar to the one you would use for the HSRP preempt-delay minimum. Based on current testing (and subject to change with additional data from further testing), you should configure a value of 4 minutes or more.
Interaction of vPC with Gateway Redundancy Protocols
The use of HSRP in the context of vPC does not require any special configuration. The active HSRP interface answers ARP requests just as regular HSRP deployments do, but with vPC both HSRP interfaces (active and standby) can forward traffic.
The configuration on the HSRP primary device looks like this:
ip address 10.50.0.251/24
preempt delay minimum 240
timers 1 3
The configuration on the HSRP secondary device looks like this:
ip address 10.50.0.252/24
preempt delay minimum 240
timers 1 3
The most significant difference between the HSRP implementation of a non-vPC configuration and a vPC configuration is that both vPC peers program the HSRP MAC addresses with the routed flag.
Given this fact, routable traffic can be forwarded by both the vPC primary device (with HSRP) and the vPC secondary device (with HSRP), with no need to send this traffic to the HSRP primary device.
The same behavior applies to VRRP configurations with vPC.
If a host or a switch forwards a frame to the Layer 3 gateway and this Layer 3 gateway is present on a vPC switch pair, so long as the destination MAC is destined for the HSRP MAC address, everything works as expected.
If the frame that is sent to the Layer 3 gateway uses the MAC BIA instead of the HSRP MAC address, the PortChannel hashing of the frame may forward it to the wrong vPC peer, which would then just bridge the frame to the other vPC peer.
On the Cisco Nexus 5500 platform, the vPC peer would then just route the frame out to the vPC instead of dropping it as it would do with a nonrouted frame.
Figure 14 shows the case in which a server sends traffic to a remote MAC address (RMAC-B), but the traffic gets hashed to switch RMAC-A. As a result, the frame is bridged to the vPC peer (switch RMAC-B), which then routes the frame to vPC2, thus reaching the server.
Figure 14. If servers use the BIA MAC instead of the HSRP MACt raffic May Be Routed Across the Peer Link
Although this approach is fine from a data forwarding correctness perspective it is suboptimal because traffic is crossing the peer link while a direct path to vPC2 exists. To optimize the forwarding path, you should configure the
peer-gateway command under the vPC domain. This command enables the vPC peers to exchange information about their respective BIA MAC addresses so that they can route the traffic locally without having to send it over the peer link.
The use of the peer gateway is recommended.
Interactions of vPC with SVIs
When the peer link fails, the vPC shuts down vPC member ports on the vPC secondary switch. When Layer 3 is used, shutting down the vPC member ports and leaving the associated SVIs up would cause traffic dropped, because the vPC secondary peer would still advertise reachability for the subnets associated with the vPC VLANs that have been shut down.
To avoid this problem, the vPC shuts down the SVIs on the vPC secondary device to help ensure that routing forwards traffic exclusively to the vPC primary device, as shown in Figure 15.
Figure 15. SVIs Go Down When the vPC Peer Link Goes Down.
What Is a Layer 3 vPC?
Using Layer 3 interfaces on Cisco Nexus 5500 platforms that are operating in vPC mode is fully supported to provide the gateway function for servers and to forward traffic from clients to servers. However, Layer 3 peering over vPCs (often referred to as Layer 3 vPC) is rarely necessary, and it requires an understanding of the interoperation of vPC and routing protocols and is subject to restrictions.
Because of this Layer 3 peering over vPC is not recommended.
A Layer 3 vPC is a vPC between a Layer 3 switch or router and two vPC peers whereby there are three routing entities peering over the Layer 2 segment provided by the vPC itself.
Figure 16 clarifies this concept. In the figure, router R has a PortChannel that is connected to both N5k1 and N5k2. This PortChannel carries a VLAN, for which router R has an IP address. Both N5k1 and N5k2 also have IP addresses on this VLAN.
From a Layer 3 perspective, this topology is logically equivalent to the topology at the far right of the figure in which three routing entities are peering over a Layer 2 segment.
The challenge with these topologies is that they include two load-balancing mechanisms that are acting independently:
• ECMP, which load-balances traffic from R to N5k1 and N5k2 based on hashing of the IP addresses and rewrites the destination MAC address of the packets accordingly
• Layer 2 PortChannel hashing, which may choose to send a packet to N5k1 even if ECMP chose to send the packet to N5k2
A Layer 3 vPC can function correctly from a data forwarding perspective when a peer gateway is configured, but this approach may have repercussions for the IGP peering operations that are outside the scope of this document.
Figure 16. Layer 3 and vPC Interaction
For this reason, the topology that uses Layer 3 peering over vPC (the topology in Figure 16) is not recommended.
These topologies do not provide additional benefits compared to Layer 3 links with ECMP, and they can be modified to use Layer 3 links instead of Layer 3 peering over a vPC. Layer 3 connectivity between vPC peers and the aggregation and core layers is the recommended topology.
ECMP Between Aggregation and Access Layers
At the time of this writing, the recommended practice is the use of Layer 3 links to connect the Layer 3 access layer to the Layer 3 aggregation layer instead of the use of a Layer 3 vPC.
Figure 17 illustrates the topology of a vPC access layer connected with Layer 3 links to the aggregation layer. This topology uses ECMP for traffic distribution on all links between the access and aggregation layers just as in the non-vPC designs.
Figure 17. Layer 3 Access Topology with vPC
With ECMP, as previously stated, each Layer 3 switch has two routes and two associated hardware forwarding adjacency entries. Before a failure, traffic is being forwarded using both of these forwarding entries. If an adjacent link or neighbor fails, the switch hardware and software immediately remove the forwarding entry associated with the lost neighbor. After the removal of the route and forwarding entries associated with the lost path, the switch still has a remaining valid route and associated Layer 3 forwarding entry. Because of this, the switch does not need to trigger or wait for routing protocol convergence and is immediately able to continue forwarding all traffic using the remaining Layer 3 entry.
Establishing IGP Peering over the Peer Link
In vPC designs, you should make sure to include a Layer 3 link or VLAN between the Layer 3 switching vPC peers so that the routing areas are adjacent. If you do this, if the Layer 3 links to the aggregation layer fail, a backup route exists to route traffic to the vPC peer so that the traffic is forwarded to the core through the routing peer.
Figure 18 illustrates this concept.
Figure 18. Layer 3 Peering over the vPC Peer Link
You can change the cost of the Layer 3 route built over an SVI to a meaningful value (which takes into account the reference bandwidth) by using the command
ip ospf cost. For instance, if there are two 10 Gigabit Ethernet links between the Cisco Nexus 5500 switches, you can set the cost to 50; if there are four 10 Gigabit Ethernet links, you can set the cost to 25.
The following code shows how to create a Layer 3 link to connect the access-layer switches at Layer 3. Just as in the non-vPC design, make sure to change the cost of the SVI to reflect the bandwidth available on the peer link.
ip address 10.50.3.8/31
ip router ospf 1 area 22.214.171.124
ip ospf cost 50
ip ospf network point-to-point
The vPC multicast implementation is designed to address the following requirements:
• Avoiding duplicate frames.
• Helping ensure that multicast traffic reaches orphan ports on non-vPC VLANs and Layer 3 ports.
• Optimizing traffic forwarding between sources and receivers without crossing the vPC peer link when possible.
To understand how the Cisco Nexus 5500 platform operates with multicast in a vPC configuration, consider these two distinct forwarding scenarios:
• Forwarding with directly connected sources and receivers
• Forwarding with remote sources and receivers
You should also understand that vPC synchronizes IGMP join information between the vPC peers, but it does not synchronize PIM state information.
The general rules of routing for multicast traffic with vPC are as follows:
• If a vPC peer receives multicast packets from a directly connected source, it performs a local multicast route (mroute) lookup and replicates packets for each outgoing interface (OIF). In a non-vPC design, only one of the routing peers would have installed mroutes for source-to- receiver traffic, and traffic would cross the peer link. With vPC, the multicast implementation is referred to as dual-DR because traffic from directly connected sources to directly connected receivers can be routed by either vPC peer.
• If the OIF is a VLAN trunked over a vPC peer link, one copy is sent over the peer link for each VLAN present in the OIF list. By default, the vPC peer link is considered to be an mrouter port. Therefore, the multicast packet will be sent over to peer-link for each receiver VLAN. This action helps ensure that orphan ports on a given VLAN can receive this traffic.
• Multicast traffic received through the peer link cannot go out to a vPC member port at Layer 2, nor can it be routed. This behavior prevents duplicates.
• A special VLAN must be used for traffic that may require routing on the vPC peer.
• The traffic received by a vPC peer over this "special" VLAN can be routed only to non-vPC ports, to avoid introducing duplicates.
Each vPC peer sends one copy of the multicast traffic to the vPC peer through a reserved VLAN on the peer link. This reserved VLAN is configured with the following CLI command:
One reserved VLAN is required for each VRF instance. Without this CLI command, the receivers in non-VPC VLANs and the receivers connected to Layer 3 interfaces on the vPC peer may not receive multicast traffic.
Note: Non-vPC VLANs are the VLANs that are not trunked over the peer link.
Note: Unlike regular VLANs used for server connectivity, the VLAN used in the "bind-vrf" command must not be "created", if you have "created" this VLAN, you need to do "no vlan <number>", the bind-vrf <vrf_name> vlan <VLAN number>, will then work.
PIM Dual-DR Implementation for Directly Connected Sources and Receivers
Just as in non-vPC designs, one vPC peer is the PIM-DR. Unlike non-vPC designs, both vPC peers install multicast routes in the event of directly connected sources and receivers. If this was not the case, all multicast traffic would have to be sent to the PIM-DR for it to route the traffic and possibly send it back again on the peer link to reach the final destination. With the use of PIM dual-DR, the traffic forwarded between directly connected sources and receivers does not need to cross the peer link.
In the case of traffic flowing from a directly connected source to remote receivers, the (S,G) state and outgoing interface list (OIL) are programmed on only one of the vPC peers, depending on how PIM joins were routed to the source. In this case, traffic may have to cross the peer link before being routed.
PIM-DR and Source Registration
The elected PIM-DR is responsible for sending source registration to the Rendezvous Point (RP). When multicast traffic from a directly connected source is received by a vPC peer that is not a designated router, the designated router is informed through a Cisco Fabric Services message about the source and group addresses. The PIM-DR generates source registration packets to the RP. The RP will then joins the SPT to the source through either vPC peer. The vPC peer then installs the (S,G) state with the appropriate OIL for the source to send traffic to the RP. No synchronization of state occurs between vPC peers for this (S,G) state.
At some point, the last-hop router next to the receiver also switches from the shared tree to the (S,G) state, which creates the (S,G) state on either the same vPC peer or the other vPC peer.
To help ensure that the traffic coming from the source behind a vPC is routed to both to the RP and the SPT to the receiver,
vpc bind-vrf must be configured. This configuration is required because the vPC may hash the traffic from the source to either vPC peer, not both. The
vpc bind-vrf command helps ensure that both vPC peers get a copy of the traffic so that they can route it as needed based on the configured (S,G) state and associated OIL.
PIM Prebuilt SPT
The implementation of vPC on the Cisco Nexus 5500 platform supports the concept of PIM prebuilt SPT
by adding the configuration "ip pim pre-build-spt". With PIM prebuilt SPT, both vPC peers send PIM joins to the Rendezvous Point to receive a copy of the multicast traffic.
The IGMP reports from a receiver in the vPC domain are synchronized between vPC peers. Both the PIM-DR and the non-PIM-DR install the (*,G) state and the OIL information, and both vPC peers send a PIM join to the Rendezvous Point.
After the multicast traffic is received, both vPC peers join the SPT to the source, but only one of the two vPC peers actively forwards traffic. The vPC that performs this task is selected on the basis of which vPC peer has the shortest path to the source, and in case of identical distance, on the basis of the vPC role, with the vPC primary device having greater priority in forwarding traffic than the vPC secondary device.
During the transition from shared tree to SPT, duplicate frames temporarily exist.
Sizing the Peer Link for Multicast Traffic
Under normal conditions, the peer link is not used because traffic enters and exits the Cisco Nexus 5000 Series Switch from one of the vPC member ports. Certain types of traffic use the peer link:
• Traffic destined for orphan ports
• Multicast traffic sent
• Broadcast traffic
When a server joins a multicast group; it sends an IGMP report, which is synchronized between the vPC peers (N5k1 and N5k2) to associate the multicast MAC addresses of that given multicast group with the vPC member port on the peer Cisco Nexus 5000 Series Switch (N5k2).
When a multicast stream arrives, it may be hashed to either Cisco Nexus 5000 Switch (N5k1 or N5k2). This multicast stream not only is sent to the receivers but also over the vPC peer link, in case there are other single-attached subscribers on N5k1.
In the presence of routing, the vPC peer may also route traffic to other VLANs, in which case this additional traffic is also sent over the vPC peer link.
In addition to these copies, a copy of the multicast traffic is sent to the vPC peer through the bind-VRF VLAN for the purpose of routing to Layer 3 ports connected to the vPC peer.
To optimize the utilization of the peer link, you can use the following CLI commands to avoid sending multicast traffic over the peer link for each receiver VLAN when there are no orphan ports on the vPC peer:
N5596-L3-1(config)# no ip igmp snooping mrouter vpc-peer-link
Warning: IGMP Snooping mrouter vpc-peer-link should be globally disabled on peer VPC switch as well.
With this CLI command configured, the multicast packet will be sent over to peer link only for those VLANs that have orphan ports.
Even with this optimization in place, in the presence of sustained multicast traffic you may need to properly size the vPC peer link to prevent the peer link from becoming a bottleneck.
This section includes some example configurations of a Layer 3 access layer with and without vPC. Figure 19 illustrates the topology that is used in the examples.
Figure 19. Topology Used in Configuration Examples