VMDC DCI 1.0 DG
System Level Design Considerations
Downloads: This chapterpdf (PDF - 1.24MB) The complete bookPDF (PDF - 9.67MB) | Feedback

Table of Contents

System Level Design Considerations

System Scale Considerations

System Availability

Security

Manageability

Service Assurance and Monitoring

Traffic Engineering

MAC Pinning

QoS Framework

Classification and Marking

Queuing, Scheduling, and Dropping

Shaping and Policing

System Level Design Considerations

The following system level design considerations are defined:

System Scale Considerations

Since this VMDC DCI release is based on the Fabric Path design validated in previous VMDC releases, most of the intra data center scale considerations remain the same as the VMDC 3.x release. The intra data center scaling is provided below. DCI specific scaling requirements are presented later in this section.

  • VLAN Scale—In NX-OS releases 5.2.5 through 6.1, a maximum of 2000 FabricPath-encapsulated VLANs is supported. This figure is improved to 4000 VLANs in NX-OS 6.2. However, it is important to note that this by definition is a one-dimensional figure, which does not factor in inter-related (Layer 2 to Layer 3) end-to-end traffic flow considerations such as FHRP constraints per module or per node. In practice, overall system VLAN scaling will be constrained by the effect of ARP learning rates on system convergence and FHRP (HSRP or GLBP) groups per module or interface, and per node. Regarding the latter, HSRP support per module is currently 500 and 1000 per system, with aggressive timers, or 2000 per system, with default timers; GLBP is 200 per module and 500 per system, with aggressive timers, or 1000 per system, with default timers.
  • Switches per FabricPath Domain—NX-OS 5.2 supports a maximum of 64 switch ids; NX-OS 6.0 a maximum of 128; NX-OS 6.2 a maximum of 256.
  • Port Density per FabricPath Node—At 48 ports per module, the F2 line cards provide up to 768 10 or 1 GE ports per switch (N7018), while the F1 cards provide up to 512 10GE ports (N7018). Again, these are uni-dimensional figures, but serve to give a theoretical maximum in terms of one measure of capacity. Currently the Nexus 7000 FabricPath limitation is 256 core ports or 384 edge ports.
  • MAC Address (Host) Scale—All FabricPath VLANs use conversational MAC address learning. Conversational MAC learning consists of a three-way handshake. This means that each interface learns only those MAC addresses for interested hosts, rather than all MAC addresses in the VLAN. This selective learning allows the network to scale beyond the limits of individual switch MAC address tables. Classical Ethernet VLANs use traditional MAC address learning by default, but the CE VLANs can be configured to use conversational MAC learning.
  • ARP Learning Rate—As noted, ARP learning rates on layer 3 edge nodes affect system convergence for specific failure types. ARP learning rates of 100/second were observed on the Nexus 7000 aggregation-edge nodes during system validation. With tuning, this was improved to 250-300/second.
  • Tenancy—The validated tenancy in the 3.0.1 Release was 32. However this does not represent the maximum scale of the architecture models. Within the models addressed in this release, several factors will constrain overall tenancy scale. These are - 1) VRFs per system. Currently, up to 1000 VRFs are supported per Nexus 7000 aggregation edge node, but then additional factors include 2) End-to-end VLAN support (i.e., affected by FHRP (HSRP or GLBP) groups per card and per system; and 3) 250 contexts per ASA FW appliance – one may increment this up by adding appliances if needed.
  • N7k Spine Nodes—Sup2/F2E cards may be utilized (16k MACs supported); for N5k leaf nodes, 24k MACs are supported; for N6k spine/leaf option, 64K MACs are currently supported (increasing to 128k+ in future s/w releases).

Note MAC address capacity is a consideration for support of inter-UCS, inter-leaf node logical segments, which must traverse the FabricPath fabric; otherwise, conversational learning insures that host MACs do not need to be maintained within the FabricPath fabric.


  • 2200 FEX Systems—(i.e., as in VMDC 3.0/3.0.1) will be required within the architecture, to provide support for 1GE bare metal server connectivity. These may be N5k, N7k, N6k -attached; in the SUT we can simply pick one method – N6k-attached - as we have validated with the first two types in previous VMDC 3.X releases.

Additional scale parameters for this VMDC DCI release include support for metro/geo LAN extensions using OTV.

  • OTV Scale—The NX-OS 6.2 release will increase scaling for OTV as specified below. Most of the scale testing to support this increased capacity will be validated by product teams. For this release, OTV scaling will be limited to the number of VLANs/MACs required by multiple applications under test. These applications include a single tier application and a multi-tier application, and will be replicated across multiple tenants (3 or more). Background VLAN traffic can be added to the OTV links to emulate peak workloads.

Figure 4-1 OTV Scale

 

  • Workload and Tenant Scaling—Test workloads will be implemented to emulate a single tier application and a multi-tier application. These applications will be replicated across multiple tenants to emulate a realistic customer environment. Live and Cold workload migrations were performed across these tenants to validate tenant isolation, traffic isolation, service isolation across DCI components.
  • New Data Center capacity for Business Continuity—New data center capacity to accommodate the recovery environment (VMs, servers, storage, network, services) must be planned for at recovery data center sites. The total scale and capacity of any one physical site will include both “normal application capacity” and “recovery/backup capacity”. The resultant scale of this design must fall within the standard scaling limitations as described previously. No additional validation is required in this area. An important Business requirement however, is to utilize the extra recovery capacity during normal operations for other business functions. To that end, VMDC DCI demonstrated how VMware SRM can “reclaim” server capacity within an ESX cluster on demand for the Cold workload Mobility use case. This can be accomplished by executing any “test application” on servers within the SRM recovery cluster. SRM has the ability to shut down and purge those recovery servers of loaded applications (reclaim) prior to the actual Cold Migration of the application under test.

System Availability

The following methods are used to achieve High Availability within the VMDC architecture:

  • Routing and Layer 3 redundancy at the core and aggregation/leaf nodes of the infrastructure. This includes path and link redundancy, non-stop forwarding and route optimization.
  • In the “Typical Data Center” (2-node spine topology) VPC+ is configured on inter-spine peer-links and utilized in conjunction with HSRP to provide dual-active paths from access edge switches across the fabric.
  • Layer 2 redundancy technologies are implemented through the FabricPath domain and access tiers of the infrastructure. This includes ARP synchronization in VPC/VPC+-enabled topologies to minimize flooding of unknown unicast and re-convergence; ECMP; utilization of port-channels between FabricPath edge/leaf and spine nodes to minimize Layer 2 IS-IS adjacency recalculations; and IS-IS SPF tuning, CoPP, GLBP and HSRP timer tuning on aggregation edge nodes, again to minimize system re-convergence.
  • Active/Active (active/standby of alternating contexts) on services utilized in the architecture.
  • Clustered HA and ECLB (equal cost load balancing) for appliance-based firewall services.
  • Hardware and Fabric redundancy throughout.
  • (VEM) MCEC uplink redundancy and VSM redundancy within the virtual access tier of the infrastructure.
  • Within the compute tier of the infrastructure, port-channeling, NIC teaming and intra-cluster HA through utilization of VMware VMotion.
  • NetApp Fabric MetroCluster with SyncMirror is configured to provide full-site data storage resiliency.

All service appliance resiliency implementations will be contained within a single data center since neither ASA or Citrix SLB currently support clustering over a metrodistance at this time.

Figure 4-2 Service Appliance Contained to Single Site

 


Note It is important to note that LISP will be added in a future VMDC DCI release to support automated tracking of moving services and applications, and the redirection of external flows to the correct data center.


Security

The proven security framework from the previous VMDC systems is leveraged for tenancy separation and isolation. Security related considerations include:

  • Aggregation Layer (Layer 3) Separation—VRF-lite implemented on aggregation-edge nodes at the aggregation layer provides per tenant isolation at Layer 3, with separate dedicated per-tenant routing and forwarding tables on the inside interfaces of firewall contexts. All inter-tenant traffic has to be routed at the outside interfaces on the Firewall that resides in the global VRF. Policies can be applied on the firewall to restrict inter-tenant communication. Layer 3 separation and tenant isolation has been verified across DCI extensions in multi-site topologies.
  • Access and Virtual Access Layer (Layer 2) Separation—VLAN IDs and the 802.1q tag provide isolation and identification of tenant traffic across the Layer 2 domain, and more generally, across shared links throughout the infrastructure. Layer 2 separation of tenant traffic has been verified across DCI extensions in multi-site topologies.
  • Network Services Separation (Services Core, Compute)—On physical appliance or service module form factors, dedicated contexts or zones provide the means for virtualized security, load balancing, NAT, and SSL offload services, and the application of unique per-tenant policies at the VLAN level of granularity. Similarly, dedicated virtual appliances (i.e., in vApp form) provide for unique per-tenant services within the compute layer of the infrastructure at the virtual machine level of granularity. Secure network services separation on physical and virtual appliances has been verified across DCI extensions in multi-site topologies.
  • Storage—This VMDC design revision uses NetApp for NFS storage, which enables virtualized storage space such that each tenant (application or user) can be separated with use of IP spaces and VLANs mapped to network layer separation. In terms of SANs, this design uses Cisco MDS 9500 for Block Storage. This allows for Fiber Channel (FC) access separation at the switch port level (VSAN), Logical path access separation on the path level (WWN/Device Hard Zoning), and at the virtual media level inside the Storage Array (LUN Masking and Mapping).

Manageability

This architecture leverages Cisco Intelligent Automation for Cloud (CIAC) and BMC Cloud Lifecycle Management (CLM) for automated service orchestration and service provisioning. Information about CIAC can be found in Intelligent Automation for Cloud . CLM was addressed in previous system releases (VMDC 2.0 and updated in the VMDC 2.2 release). Additional documentation can be found on Design Zone at Cloud Orchestration with BMC CLM.

Service Assurance and Monitoring

Service assurance is generally defined as the application of policies and processes to ensure that network services meet predefined service quality levels for optimal subscriber experiences. Service assurance enables SPs to control traffic flows, identify faults, and resolve issues in a timely manner to minimize service downtime. Service assurance also includes policies and processes to proactively diagnose and resolve service quality degradations or device malfunctions before subscribers are impacted.

In VMDC DCI, network service assurance encompasses the following concepts:

Traffic Engineering

Traffic engineering is a method of optimizing network performance by dynamically analyzing, predicting and regulating the behavior of transmitted data.

Port-channels are frequently deployed for redundancy and load sharing. Because the Nexus 1000V is an end-host switch, network administrators can use different approach than those used on physical switches, implementing a port-channel mechanism in one of the following modes:

  • Standard Port-Channel—The port-channel is configured on the Nexus 1000V and on upstream switches
  • Special Port-Channel—The port-channel is configured only on the Nexus 1000V; there is no need to configure anything upstream. Two options are available: MAC pinning and vPC host mode.

Regardless of mode, port-channels are managed using standard port-channel CLI, but each mode behaves differently. Refer to Nexus 1000V Port-Channel Configurations for details.

The VMDC virtual access layer design uses vPC host mode and then uses MAC pinning to select specific links from the port channel. As discussed in previous system releases, multiple port-channels can be used for a more granular approach for uplink traffic management on the Nexus 1000V. These options are shown in Figure 4-3 and Figure 4-4.

Figure 4-3 Nexus 1000v single Uplink PortChannel Model

 

Figure 4-4 Nexus 1000v 5 Uplink PortChannel Model

 

Traffic engineering can be performed selectively by configuring the Nexus 1000V to select the target uplink using a manual configuration (static pinning) instead of the default. For example, front-end traffic that contains many diversified flows can use both members (fabrics) of the port-channel. On the other hand, backend traffic, which has more diversity in terms of bandwidth/response time (VM-to-VM inter-fabric traffic flows, vMotion, backup, and so on) can benefit by selecting a path that enables VM-to-VM traffic to remain in the same fabric so that Fabric Interconnect switches the traffic locally. Table 4-1 lists key architectural features of VMDC DCI.

 

Table 4-1 Traffic Classification Example for MAC Pinning

Traffic Type
Classification
UCS Fabric
Mac-Pining Option
Rational

Front End Traffic

Tenant Data

Fabric A & B

Automatic

Load Share on all available uplinks, most traffic should be exiting the pod through the Aggregation-Edge Nexus 7000

Back End Traffic

Tenant Data

Fabric-A

Manual

Keep most back end traffic local switched on one Fabric Interconnect

vMotion

VMkernel/Control

Fabric-B

Manual

Keep vMotion traffic local switched on one Fabric Interconnect

MAC Pinning

MAC pinning defines all uplinks coming out of the server as standalone links and pins different MAC addresses to those links in a round-robin fashion. This approach helps to ensure that the MAC address of a virtual machine is never seen on multiple interfaces on the upstream switches. No upstream configuration is required to connect the Nexus 1000V VEM to upstream switches (Figure 4-5).

MAC pinning does not rely on any protocol to distinguish upstream switches, so the deployment is independent of any hardware or design. MAC pinning enables consistent, easy Nexus 1000V deployment because it does not depend on any physical hardware or any upstream configuration, and it is the preferred method for deploying Nexus 1000V if upstream switches cannot be clustered

However, this approach does not prevent the Nexus 1000V from constructing a port-channel on its side, providing the required redundancy in the data center in case of a failure. If a failure occurs, the Nexus 1000V sends a gratuitous ARP packet to alert the upstream switch that the MAC address of the VEM learned on the previous link must now be learned on a different link, enabling subsecond failover.

Figure 4-5 MAC-Pinning Details

 

In the case of a fabric failure, the Nexus 1000V selects the available remaining fabric to recover the traffic. Figure 4-6 shows the fabric failover with subgroup MAC pining.

Figure 4-6 MAC-Pinning Failover

 

QoS Framework

QoS is a key to service assurance because it enables differentiated treatment of specific traffic flows. Differentiated treatment ensures that critical traffic is provided sufficient bandwidth to meet throughput requirements during congestion or failure conditions.

Figure 4-7 shows the different traffic flow types defined in previous VMDC releases. These traffic types are organized in infrastructure, tenant, and storage traffic categories.

  • Infrastructure traffic comprises management and control traffic, including VMware service console and vMotion communication. This is typically set to the highest priority to maintain administrative communication during periods of instability or high CPU utilization.
  • Tenant traffic can be differentiated into front end and backend traffic, with service levels to accommodate various traffic requirements in each category.
  • The VMDC design incorporates Fibre Channel and IP-attached storage. As shown in Figure 4-7, storage requires two subcategories, because these traffic types are treated differently throughout the network. Fibre Channel traffic, by definition, requires a “no drop” policy, while Network File System (NFS) datastore traffic is sensitive to delay and loss.

Figure 4-7 Traffic Flow Types

 

To provide differentiated services, VMDC leverages the following QoS functionality:

  • Traffic (Classification and Marking)
  • Congestion Management and Avoidance (Queuing, Scheduling, and Dropping)
  • Traffic Conditioning (Shaping and Policing)

Classification and Marking

Classification and marking enables networks using QoS to identify traffic types based on source packet headers (L2 802.1p CoS and Differentiated Services Code Point (DSCP) information), and assign specific markings to those traffic types for appropriate treatment as the packets traverse network nodes. Marking (coloring) is the process of setting the value of DSCP, MPLS EXP, or Ethernet L2 class of service (CoS) fields so that traffic can later easily be identified. using simple classification techniques. Conditional marking is used to designate in-contract (“conform”) or out-of-contract ("exceed") traffic.

As in previous releases, the traffic service objectives translate to support for three broad traffic categories:

1. Infrastructure

2. Tenant service classes (three data; two multimedia priority)

3. Storage

Figure 4-8 provides a more granular description of the requisite traffic classes, characterized by their DSCP markings and per-hop behavior (PHB) designations. This represents a normalized view across validated VMDC and HCS reference architectures in the context of an eight-class IP/NGN aligned model

Figure 4-8 VMDC Traffic Classes (8-Class Reference)

 

Note that in newer datacenter QoS models, CoS 3 is reserved for lossless data (FCoE). However, in older WAN/Campus QoS services models, CoS 3 is used for VOIP signaling. The table above assumes that FCOE traffic will be localized to the UCS and Ethernet-attached Storage systems, thus enabling the use of CoS 3 for VoIP signaling traffic within the DC QoS domain. Classification values may need to be tweaked per traffic characteristics: for example CoS value 4 could potentially be used for VoIP call control if video streams are not deployed.

It is a general best practice to mark traffic at the source-end system or as close to the traffic source as possible in order to simplify the network design. However, if the end system is not capable of marking or cannot be trusted, one may mark on ingress to the network. In the VMDC QoS framework the Cloud Data Center represents a single QoS domain, with the Nexus 1000V forming the "southern" access edge, and the ASR 9000 or ASR 1000 forming the "northern" DC PE/WAN edge. These QoS domain edge devices will mark traffic, and these markings will be trusted at the nodes within the data center infrastructure; in other words, they will use simple classification based on the markings received from the edge devices. Note that where VM-FEX adapters are utilized, marking is implemented on the UCS Fabric Interconnects; in contrast to the Nexus 1000v implementation, there is no ability to conditionally mark-down CoS in the event of congestion.

In VMDC DCI, the assumption is that the DSCP values will not be altered. Intermediate nodes would ideally support QoS transparency, such that CoS values would not need to be re-marked. That said, if QoS transparency is not supported on a particular node within the QoS domain, it will be necessary to workaround this gap by re-marking. VMDC DCI verified that all QoS packets markings are preserved across DCI extensions.

Queuing, Scheduling, and Dropping

In a router or switch, the packet scheduler applies policy to decide which packet to dequeue and send next, and when to do it. Schedulers service queues in different orders. The most frequently used are:

  • First in, first out (FIFO)
  • Priority scheduling (also called priority queuing)
  • Weighted bandwidth

We use a variant of weighted bandwidth queuing called class-based weighted fair queuing/low latency queuing (CBWFQ/LLQ) on the Nexus 1000V at the southern edge of the data center QoS domain. At the ASR 9000 or ASR 1000 northern data center WAN edge, we use priority queuing (PQ)/CBWFQ to bound delay and jitter for priority traffic while supporting weighted bandwidth allocation for the remaining data traffic classes.

Queuing mechanisms manage the front of a queue, while congestion avoidance mechanisms manage the back of a queue. Because queue depths are limited, dropping algorithms, which drop packets as queue depths build, are used to avoid congestion. Two dropping algorithms are commonly used: weighted tail drop (often for VoIP or video traffic) or weighted random early detection (WRED), typically for data traffic classes. As in previous releases, WRED is used to drop out-of-contract data traffic (CoS 1) before in-contract data traffic (Gold and CoS 2), and for Bronze/Standard traffic (CoS 0) in the event of congestion.

Defining an end-to-end QoS architecture can be challenging because not all nodes in a QoS domain have consistent implementations. In the cloud data center QoS domain, we run the gamut from systems that support 16 queues per VEM (Nexus 1000V) to four internal fabric queues (Nexus 7000). This means that traffic classes must be merged on systems that support less than eight queues. Figure 4-9 shows the class-to-queue mapping that applies to the cloud data center QoS domain in the VMDC 2.2 reference architecture, in the context of alignment with either the HCS reference model or the more standard NGN reference.

Figure 4-9 VMDC Class-to-Queue Mapping

 

Note that the Nexus 2000 Fabric Extender provides only two user queues for QoS support: one for all no-drop classes and the other for all drop classes. The classes configured on its parent switch are mapped to one of these queues; traffic for no-drop classes is mapped one queue and traffic for all drop classes is mapped to the other. Egress policies are also restricted to these classes. Further, at this writing, queuing is not supported on Nexus 2000 host interface ports when connected to an upstream Nexus 7000 switch. Traffic is sent to the default fabric queue on the Nexus 7000, and queuing must be applied on FEX trunk (network interface) ports. Future NX-OS releases will feature enhanced Nexus 7000 support for FEX QoS, adding network QoS and default queuing policy support on downstream Nexus 2000 host interfaces.

Before NX-OS release 6.1.3, only two ingress queues are supported on the F2/F2E Nexus 7000 line cards. Release 6.1.3 adds support for four ingress queues. These line cards support four egress queues.

Shaping and Policing

Policing and shaping are used to enforce a maximum bandwidth rate (MBR) on a traffic stream; while policing effectively does this by dropping out-of-contract traffic, shaping does this by delaying out-of-contract traffic. VMDC uses policing in and at the edges of the cloud data center QoS domain to rate-limit data and priority traffic classes. At the data center WAN edge/PE, hierarchical QoS (HQoS) may be implemented on egress to the cloud data center; this uses a combination of shaping and policing in which L2 traffic is shaped at the aggregate (port) level per class, while policing is used to enforce per-tenant aggregates.

Sample bandwidth port reservation percentages are shown in Figure 4-10.

Figure 4-10 Sample Bandwidth Port Reservations