Cisco ACI fabric connectivity to DCIG
OpFlex policy framework with DCI
Cisco ACI GOLF design best practices
The Cisco® Application Centric Infrastructure (Cisco ACI™) GOLF (giant overlay fabric) feature provides an efficient, scalable way for the Cisco ACI fabric to have WAN connectivity handoffs for Locator Identifier Separation Protocol (LISP), Multiprotocol Label Switching (MPLS), or Virtual Route Forwarding–lite (VRF-lite).Customers will be able to both transform their business with Cisco ACI and use existing infrastructure such as Cisco Nexus® 7000 or 7700 Series Switches that would provide WAN/Data Center Interconnect (DCI) border router functionality along with Cisco ACI Multi-Pod connectivity within same site or across sites. Cisco ACI fabric external world connectivity is traditionally achieved by VRF-lite based Layer 3 Outside (L3Out) configurations on Cisco ACI border leaf switches with limitations of scale.
The GOLF feature uses a Multi-Protocol Border Gateway Protocol Ethernet VPN (MP-BGP EVPN) control plane and a virtual extensible LAN (VXLAN) data plane to carry all tenant prefixes using a single, simpler L3Out configuration for all tenants ACI fabric using Cisco ACI spine switches. In addition, OpFlex protocol allows to push or extend Cisco ACI policy information about fabric-facing tenant configuration onto WAN/DCI routers. This model offers enhanced scalability, operational simplicity, and automation.
The Cisco ACI GOLF feature provides the following benefits:
● Highly scalable multiprotocol BGP EVPN for multitenancy support
● Optimized traffic for inbound/outbound to Cisco ACI fabric using LISP and type 2 routes
● Redundancy and availability via BGP Additional Paths and equal-cost multipathing (ECMP) support
● Automation of configuration on WAN/DCI router with OpFlex policy
This document assumes knowledge about the Cisco ACI infrastructure and multipod fabric architecture. Please refer to documentation links in section “For More Information” at end of the document.
LISP integration with ACI fabric is not covered as part of this document
VXLAN tunnel endpoint (VTEP): A VTEP encapsulates MAC traffic into IP traffic and routes MAC traffic to other VTEPs.
VXLAN Network Identifier (VNID): Identifier within the VXLAN header that identifies the network and can be mapped to a VLAN. From a forwarding perspective, a VNID is a broadcast domain.
Data Center Interconnect Gateway (DCIG):
● From the Cisco ACI fabric perspective, DCIGs act as VXLAN VTEPs, running MP-BGP with the border nodes to exchange Network Layer Reachability Information (NLRI) information for EVPN address family.
● From the MPLS WAN perspective, DCIGs act as MPLS Layer 3 VPN premises-equipment peering with premises equipment in the remote data center and exchanging NLRI information for the VPNv4/VPNv6 address families.
● From an IP WAN perspective, DCIGs act as IP VRF-lite peering or LISP gateway with remote sites and exchange NLRI information for the IPv4 and IPv6 address family.
Inter-Pod Network (IPN): IPN is a Layer 3 transport connecting different Cisco ACI pods and DCIG border devices, allowing for the establishment of pod-to-pod communication (also known as east-west traffic) and pod-to-Internet/remote sites (also known as north-south traffic).
Multipod: Multipod is multiple ACI fabrics under the administration of a single APIC Cluster which provides operational simplification to manage all the interconnected pods. This allows a single policy domain to be created across all the pods, ensuring consistent end-to-end security policies.
Cisco ACI fabric connectivity to DCIG
Direct and indirect connections (through a transport IP network) are supported to establish control and data plane communication between the Cisco ACI fabric and the DCIG devices.
A dedicated IPN or collapsed IPN model can be used for Layer 3 IP connectivity between Cisco ACI spine’s and DCIG’s. Four types of deployment models are supported using ACI standalone or multipod fabrics.
● Standalone fabric with separate IPN and DCI edge devices
● Standalone fabric with collapsed IPN and DCI edge devices
● Multipod fabric with separate IPN and distributed DCI edge devices
● Multipod fabric with collapsed IPN and distributed DCI edge devices
BGP EVPN control plane sessions are established between Cisco ACI spine switches and DCIG’s (Cisco Nexus 7000 Series Switch) to carry routes in and out of the fabric toward the WAN, and vice versa, for every Tennent VRF created in ACI fabric.
Tenant routes advertised from Cisco ACI fabric to DCIG
Cisco ACI Bridge Domain (BD) subnets and external router transit routes are two types of per-tenant network advertisements from the Cisco ACI fabric, as shown in Figure 1. Cisco ACI spine switches advertise tenant route prefixes behind Cisco ACI leaf nodes, which are marked as public BD subnets with Cisco ACI spine Anycast VTEP as next hop and transit routes advertised with leaf’s VTEP as next hop.
Remote routes advertised from DCIG toward Cisco ACI fabric
DCIG routes advertised to Cisco ACI spine switches will simply get reflected to other Cisco ACI leaf switches inside the fabric with DCIG VTEP as next hop, as shown in Figure 1.
BGP additional path and ECMP support
Transit route prefixes from External router connected to redundantly to ACI leaf switches will have multiple paths to reach them. ACI spine switches would advertise those redundant paths to DCIG’s using BGP additional-path on the Cisco ACI spine. DCIG should be enabled with additional-path receive and maximum paths to be supported to install ECMP toward Cisco ACI leaf switches.
Host route advertisement
APIC Release 2.1(1) onwards, the Cisco ACI spine switches can advertise the host route’s EVPN type 2 (MAC-IP) routes, in addition to EVPN type 5 (IP prefix) routes public BD subnets’ to DCIG’s to avoid suboptimal north-to-south traffic forwarding in Multi-Pod setups.
Downstream VNID management
Cisco ACI fabric uses the APIC controller for tenant management and dynamic VNID assignment for all tenants. DCIG’s are not controlled by APIC controller and uses different VNID space which are manually configured or sequentially assigned for tenants. These asymmetric VNID’s per virtual route are exchanged as labels in the BGP EVPN control plane and treat them as downstream assigned between DCIG and ACI spine switches. This allows packets are sent with the right VNID for routes between DCIG and the Cisco ACI fabric.
As soon as a DCIG VTEP receives BGP EVPN route updates from a ACI spine or leaf switches, it adds this VTEP address to peer list. For data-plane forwarding, a BGP EVPN VTEP accepts VXLAN-encapsulated packets only from VTEP peers that are on this peer list.
North-south traffic flows from DCIG to BD Endpoints
Traffic flows from the DCIG to public BD endpoints attached directly to a Cisco ACI leaf switch are tunneled to the Cisco ACI spine’s Anycast VTEP based on hashing (any of the spine switches could be picked). Spine switches would terminate the tunnel using VXLAN de-capsulation. The packet will be tunneled again using the VXLAN encapsulation tunnel toward the desired leaf switch where the BD endpoint resides.
North-south traffic flows from DCIG to External router
Traffic flows from the DCIG to routes learnt via external router connected to Cisco ACI leaf switches will be ECMP tunneled towards Cisco ACI leaf VTEPs. Leaf switches would then terminate the tunnel using VXLAN de-capsulation and forward the packets to external routers.
South-to-north traffic flow from Cisco ACI fabric to DCIG
Traffic flow form BD endpoints and external router attached directly to Cisco ACI leaf switch going to DCIG’s are ECMP tunneled to the DCIG VTEP. DCIG would terminate the tunnel using VXLAN de-capsulation and forward packets to the desired destination on the WAN using VRF-lite or MPLS.
Traffic to and from the Cisco ACI fabric uses a VXLAN User Datagram Protocol (UDP) port number 48879 that is different from that of the DCIG implementation. Configure globally to match the UDP port number on the Cisco Nexus 7000 DCIG router to VXLAN encapsulate and de-encapsulate traffic exchanged with the Cisco ACI fabric.
OpFlex policy framework with DCI
Open Policy Protocol (OpFlex) control plane is established between the DCIG devices and the Cisco ACI spine switches to automate fabric-facing tenant provisioning on the DCIG edge devices. DCIG interfaces with the ACI fabric as an external Policy Element (PE) that talks to ACI fabric spine acting as a proxy-Policy Repository (PR) for DCI specific policy information required for fabric automation a shown in Figure 2. Data Management Engine (DME) is a service that runs on the APIC that manages data for this policy data model using message entities like request and response.
The network administrator simply configures a new external Layer 3 Outside (L3Out) policy for a tenant on the Cisco Application Policy Infrastructure Controller (APIC) and port-profile templates on DCIG. The controller then programs all related information associated with that tenant, such as VRF instance name and BGP extended community route-target attributes, for the Cisco ACI spine switches.
The OpFlex proxy server running on the spine switches reads the L3Out managed object and converts it to the OpFlex model events and are pushed to the DCIG. DCIG converts these OpFlex model events into configuration through port-profile templates, which are part of Day-0 configuration pre-provisioned on Nexus 7000 Series switches. This automates the fabric-facing tenant configuration on DCIG.
Currently, Cisco Nexus 7000 and 7700 Series Switches support three types of layer 3 handoffs, which allows the Cisco ACI fabric to connect externally using the VRF-Lite, MPLS, and LISP technologies as shown in Figure 3.
The Cisco Nexus line of data center hardware and software products must pass Cisco's comprehensive quality assurance process, which includes a multistage approach comprising extensive unit, feature test, and system-level testing. Each successive stage in the process adds increasing levels of complexity in a multidimensional mix of features and topologies.
This section provides an overview of the Cisco Validated Profile for the Cisco ACI Multi-Pod deployment model. The validated solution consists of two Cisco ACI pods, one per site interconnected through back-to-back over dark fiber. It provides an operationally simple way to interconnect Cisco ACI fabric networks that may be either physically collocated or geographically dispersed. Each of the ACI multipod fabric is composed of Cisco Nexus 9000 Series spine and leaf switches as shown in Figure 4. Each site is deployed with APIC controllers in data center site1 (2 count) and data center site2 (1 count) acting as a distributed APIC cluster with single policy domain.
Multi-Pod also has the capability of auto provisioning configuration for all the Cisco ACI devices deployed in remote pods with zero-touch configuration. IPN devices connected to the spine switches of the remote pod should have the capability to relay Dynamic Host Configuration Protocol (DHCP) requests generated from a newly provisioned Cisco ACI spine in second pod, toward the APIC node(s) active in the first pod.
IPN basically represents an extension of the Cisco ACI fabric underlay infrastructure, which helps ensure that VXLAN tunnels can be established across multipods and to DCI devices, allowing endpoint communication. IPN devices are mainly responsible for interpod VXLAN traffic exchange, whereas DCI edge devices are responsible for VXLAN encapsulation and de-encapsulation of traffic between the Cisco ACI fabric and remote site routers using IP VRF-lite, MPLS, or LISP.
A single Cisco Nexus 7000 or N7700 switch can function as IPN and DCIG in a collapsed mode for both Layer 3 routing for interpod VXLAN traffic and VXLAN encapsulated/de-encapsulated traffic toward MPLS or IP VRF-lite.
Data Center Interconnect Gateway
In this Cisco-validated deployment profile with multitenancy using VRF contexts the external Layer 3 connectivity and Data Center Interconnect (DCI) is provided through Cisco Nexus 7000 and 7700 Series Switches with F3 or M3 I/O modules. The main advantage of these switches is the versatility they provide in the DCIG function of employing Layer 3 simultaneously for multiple handoffs to the ACI fabric.
The configuration flow has two main steps: Day-0 and Day-1. Day-0 contains one-time manual pre-configuration, required for establishing undelay routing between ACI spines and DCIG routers. This will help build transport network and for overlay MP-BGP EVPN and NVE peers. This also instantiates OpFlex policy framework connection between DCIG and ACI spine switches. Day-1 creates recurring new tenant configuration into running configuration of DCIG using port-profile configuration done as part of Day-0 using OpFlex.
DCIG acting as GOLF router VDC of the Nexus 7000 or 7700 Series, enables the below features/feature-set(s). The fabric forwarding feature is used to enable the fabric forwarding protocol and HMM. The VNI feature is used to enable the ability to configure and allocate VNIs. The interface VLAN feature is used to allow Layer 3 SVIs/BD’s to be configured. The NVE interface must also be configured to map which loopback interface will be used as the TEP (tunnel endpoint) address, must have BGP EVPN used as the control plane protocol for host reachability, must have unknown peer forwarding enabled, and must have downstream VNI assignment permitted per-VRF.
For MP-BGP EVPN functionality, one must first enable the BGP, nv overlay, and configure the nv overlay evpn features and configure MP-BGP neighbors with the EVPN address-family l2vpn evpn. Additionally, since VXLAN EVPN utilizes extended communities, you must configure sending of extended communities under each neighbor. Alternatively, by using the ‘both’ command one could allow both standard and extended communities.
ACI fabric with multipod is one BGP autonomous system and usually DCIG’s will have peering eBGP over Loopback interfaces to ACI spines switches, thus eBGP multihop is being utilized here. There is one additional new command needed is to ‘allow-vni-in-ethertag’ under the EVPN address-family globally to allow the DCIG to receive a type-2 routes in addition to type-5 routes.
MP-BGP VPN family allows the MPLS core to properly interconnect the different site for both IPv4 and IPv6 services. The Provider(P) routers are connected redundantly to DCIG’s which can also act as Provider Edges (PE) which have fully-meshed MP-BGP connection to remote peers for both the IPv4 and IPv6 address families. Open Shortest Path First (OSPF) has been chosen as the provider Interior Gateway Protocol (IGP) to guarantee underlay layer 3 reachability among all the interfaces to used by BGP.
Non-Stop Forwarding (NSF) is a high-availability feature on modular switches running Cisco NX-OS Software with a redundant supervisor. On the Cisco Nexus 7000 and 7700 series, data packets are forwarded by the hardware-forwarding engines on the modules. These engines are programmed with information learned from the routing control plane running on the supervisors. If the active supervisor were to fail, the forwarding tables on the modules would be preserved. All interface states are also preserved, and the standby supervisor takes over active control of the system. This high-availability system prevents any drop of traffic during the failure of the active control plane.
BGP Graceful Restart is a BGP feature that prevents disruption to the control and data plane. It allows for the recovery of BGP sessions after a peer has failed. When combined with the NSF feature, any Graceful Restart–capable peers connected to a switch going through supervisor switchover will continue to forward traffic seamlessly. NSF and Graceful Restart for BGP are enabled by default on NX-OS software.
All the connections are on F3 and M3 modules: N7K-F324FQ-25L and N7K-M348XP-32L. To increase the number of possible connections on the N77-M324FQ-25L modules, it is possible to employ the interface breakout command to split one 40-GB port into four separate 10-GB ports.
Cisco ACI spine node
The backbone switch connects to every leaf switch in a pod. Only leaf switches can connect to spine switches; no other equipment can connect to the spine switches. One exception is the fabric underlay extension to IPN or DCIG’s using 10G or 40G connections.
Cisco ACI spine in each pod can function as BGP route reflectors and spine will re-originate EVPN prefixes received from the DCIG into the Cisco ACI fabric using VPNv4 address family. These re-originated prefixes are received by leaf switches and will see a next hop pointing directly to the DCI router.
ACI spine directly advertises locally originated internal BD subnets into EVPN (rather than importing from VPNv4) the DCIG will see a next hop of the ACI spine switches for the BD subnets. For transit routes learnt via an external router behind the Cisco ACI leaf switches, the DCIG router will encapsulate the packet and route it directly to the leaf switch.
Manual RT and Auto RT are two ways to configure EVPN Route Targets (RTs) for the GOLF enabled VRFs on ACI fabric. The route target is synchronized between ACI spines and DCIs through OpFlex proxy server running on the ACI spine switches.
The VNID associated with each VRF must be set in the BGP EVPN update as downstream “received label.” This VNID is used to identify VRF membership on the DCIG device and used by ACI spines and leaf switches to rewrite entry when sending traffic to the DCIG.
Cisco ACI leaf node
Devices like servers/hosts, Layer 2 switches, external routers, and services appliance can be dually attached to VPC switches using IEEE standard port channels. The Cisco ACI leaf will receive the BGP EVPN remote-originated prefixes via VPNv4 from the spine switches. The next hop is set to the DCI device, and a tunnel is dynamically created by BGP that has a next hop of the GOLF device. This setup will be used for programming DCI route adjacencies.
The ACI leaf switch will not directly advertise the BD subnet to the spine (ACI spines locally originates the BD subnet into EVPN rather than re-originating it from VPNv4.). ACI Leaf switch acting as the border leaf would advertise the external transit route prefixes to the ACI spine switches using BGP VPNv4. ACI spine switches are responsible for re-originating these into EVPN and advertise to the GOLF router.
Consistent security policies can be defined for communication with the external Layer 3 domain and are always enforced at the Cisco ACI leaf nodes for both inbound and outbound traffic flows.
Tables 1–5 list the hardware and scaling for the Cisco ACI Multi-Pod solution.
Table 1. Hardware profile summary
Tier/layer |
Hardware detail (quantity) |
Software release |
DCIG |
Chassis: N77-C7706 (4) Supervisor: N77-SUP2E (2) I/O module: N77-F324FQ-25 (2) N77-F348XP-23 (2) N77-M324FQ-25L (2) N77-M348XP-32L (2) |
Cisco NX-OS 8.1(2), NX-OS 8.2(1) |
Cisco ACI leaf switch |
Chassis: N9K-C9332PQ N9KC9372PX N9K-C9372TX |
Cisco ACI 12.2(2k) |
Cisco ACI spine switch |
Chassis: N9K-9504 N9K-C9336PQ Supervisor: N9K-SUP-A I/O module: N9K-X9736PQ |
Cisco ACI 12.2(2k) |
IPN |
Chassis: N77-C7010 (4) Supervisor: N77-SUP2E (2) I/O module: N77-F324FQ-25 (1) |
NX-OS 7.3(2) |
Cisco APIC Controller |
Controller: APIC-M2 (3) |
2.2(2k) |
Ixia for generation of IPv4 and IPv6-based unicast traffic streams |
Table 2. Profile scale
Feature |
Scale |
Total VRFs |
900 |
VRF-lite VRFs |
50 |
MPLS VRFs |
800 |
LISP VRFs |
50 |
Link Aggregation Control Protocol (LACP) port channels |
4 |
Bidirectional Forwarding Detection (BFD) neighbors |
4 |
Link Layer Discovery Protocol (LLDP) |
Enabled |
Authentication, Authorization, and Accounting (AAA)/Radius |
Enabled |
BFD |
Enabled |
Simple Network Management Protocol (SNMP) |
Enabled |
Switched Port Analyzer (SPAN) |
2 |
Encapsulated Remote SPAN (ERSPAN) sessions |
4 |
Table 3. DCIG (Cisco Nexus 7000 and 7700 Series Switches)
Feature |
Scale |
Layer 3 IPv4 routes |
16,000 |
Layer 3 IPv6 routes |
8,000 |
NVE Peers |
12 |
BD Vlans |
1800 |
Layer 3 VNI |
1800 |
eBGP peers |
5 |
Layer 3 ECMP using BGP add-path |
Enabled |
OpFlex Automation |
Enabled |
Table 4. Cisco ACI spine scale
Feature |
Scale |
VRF’s |
900 |
External BGP (eBGP) peers per DCIG |
4 |
Adjacency table |
10,500 |
Opflex peers per multipod |
2 |
Maximum hosts per subnet |
25 |
Table 5. Cisco ACI leaf scale
Feature |
Scale |
VRF’s per LEAF |
150 |
Bridge Domains |
1500 |
Layer 3 IPv4 routes |
4000 |
Layer 3 IPv6 routes |
1,000 |
Use case testing
The use cases in this section are executed using the topology shown in Figure 4, along with the test scale and hardware listed in Tables 1–5.
Cisco Nexus 7000 and 7700 Series Switches system upgrade
Software upgrade through system reload (cold boot)
The network administrator should be able to perform upgrades and downgrades between releases seamlessly using the incompatibility check for features with no configuration loss. To execute the disruptive procedure, copy the Kickstart and System images into the boot cache, change the boot variable to reference the replacement image, and then reload the switch. Please refer NX-OS Software Upgrade and Downgrade Guide on Cisco.com.
Power On Auto Provisioning (POAP)
By default, Cisco Nexus switches enter POAP mode, locate a DHCP server, and bootstrap themselves with the interface IP address, gateway, and DNS server IP addresses. They can also obtain the IP address of a Trivial File Transfer Protocol (TFTP) server, or the URL of an HTTP server, and download a configuration script that will run on the switch to download and install the appropriate software image and DAY-0 configuration file.
ASCII replay
Use the Reload ASCII command to copy an ASCII version of the configuration to the startup configuration when reloading the device.
In-service software upgrade
Software upgrades can be performed nondisruptively on the Cisco Nexus 7000 and 7700 Series Switches if a redundant supervisor is present. This upgrade procedure, called In-Service Software Upgrade (ISSU) is nondisruptive for connected endpoints.
Software Maintenance Upgrade (SMU)
Software can be patched for using known defects using an SMU. Please see the Software Maintenance Upgrade for Cisco NX-OS on Cisco.com.
Graceful Insertion and Removal (GIR)
The Graceful Insertion and Removal functionality available on the Cisco Nexus switches may be used as an alternative to ISSU. Please see Graceful Insertion and Removal Mode on Cisco Nexus Switches.
Operational triggers
In this section, we will cover network events applicable to the production network.
Configuration changes
Verify that the system holds good and recovers to working condition after the following events are triggered:
● Link flaps, NVE flaps, port-channel flaps
● Add/delete VRF, VLAN, and port-channel members
● Add ports to Virtual Device Context (VDC) in Cisco Nexus 7000 Series Switches
● Clear counters, clear Address Resolution Protocol (ARP), clear routes
Network events
Validate that the system holds good and recovers to working condition after the following events are triggered:
● Module reload
● Reload of chassis
● Process crash
System switchover
Verify that the system switchover has no impact on any functionality or data traffic.
Resiliency and robustness
In this section, we will cover daily network events that in the production network.
Longevity
CPU and memory usage would be monitored overnight as well as during the weekends along with any memory-leak checks. Robustness of the software code is tested using negative events to be triggered during the use-case testing.
Management and monitoring
SPAN and ERSPAN
The Switched Port Analyzer (SPAN) and Encapsulated Remote SPAN (ERSPAN) features allows traffic to be mirrored from within a switch from a specified source to a specified destination. This feature is typically used when detailed packet information is required for troubleshooting, traffic analysis, and security-threat prevention. Using ethanalyzer the captured packed can be decoded for VXLAN headers, both outer packets with VNID and inner packet details can be looked at for further analysis.
Cisco ACI GOLF design best practices
● Deploy a 3-node APIC cluster, independently from the total number of pods (plus 1 backup node in a 2-pod scenario).
● Use 10-GB or 40-GB connections between the ACI spine switches and the IPN network.
● OSPF is the only protocol supported between the ACI spine switches IPN device for underlay extension.
● Increase MTU to 9150 or 9216 bytes on all the Layer 3 interfaces of the IPN and DCIG devices.
● If directly connected using Layer 3 network, at least 2 data centers using the Cisco ACI Multi-Pod solution should have a Round-Trip Time (RTT) of no more than 10 msec.
● Deploy BGP authentication to increase the security of the network.
● Use MP-BGP EVPN fully-meshed sessions between spine switches in each pod and on all DCI devices.
● Use port-aggregation protocols to optimize bandwidth utilization and increase connection resiliency. In this profile, link aggregation control protocol is deployed.
● Enable iBGP Equal-Cost Multi-Path routing (ECMP) to optimize bandwidth utilization on all the routed connections between the aggregation PEs and the core nodes. To fully implement iBGP ECMP, enable "additional-path send/receive/selection" in the RR routers as well as "additional-paths receive/install backup" on all the RR clients for both IPv4 and IPv6 address family.
● Optimize outbound traffic flows and eliminate hairpinning (tromboning) across the IPN. Tune the BGP configuration to optimize flows, and always prefer the local GOLF devices.
● Optimize inbound traffic flows and eliminate hairpinning across the IPN ingress traffic using use host-route advertisement from the Cisco ACI fabric to the GOLF devices to ensure traffic is always steered to the right data center:
◦ Option 1: Advertise host routes into the WAN.
◦ Option 2: Deploy LISP and leverage host route advertisements from the ACI spine switches to register endpoints in the LISP map system.
The Cisco ACI GOLF handoff mechanism of integrating the Cisco ACI fabric with Cisco Nexus 7000 Series Switches is based on open protocols such as MP-BGP EVPN for control plane, VXLAN used as the data plane, and OpFlex. This model provides a scalable and operationally simpler mechanism for extending Layer 3 communication between the data center fabric and the external WAN domain, solving a challenge often seen in today’s data centers. It eliminates the need for per-VRF session tenant L3out and instead uses a tenant common GOLF L3out for all tenants to communicate to the outside world.
ACI Whitepapers