VMDC DCI 1.0 DG
System Overview
Downloads: This chapterpdf (PDF - 2.04MB) The complete bookPDF (PDF - 9.67MB) | Feedback

Table of Contents

System Overview

Mapping Applications to Business Criticality Levels

Active-Active Metro Design

Active-Backup Metro/Geo Design

VMDC DCI Supports Multiple Design Options

Top Level Use Cases

Design Parameters for Active-Active Metro Use Cases

Design Parameters for Active-Standby Metro/Geo Use Cases

Solution Architecture

Active-Active Metro Design

Active-Backup Metro/Geo Design

System Components

System Overview

Interconnecting Cloud Data Centers can be a complex undertaking for Enterprises and SP’s. Enabling business critical applications to operate across or migrate between metro/geo sites impacts each Tier of the Cloud Data Center as described in Figure 2-1. Customers require a validated end-to-end DCI solution that integrates Cisco’s best in class products at each tier, to address the most common Business Continuity and workload mobility functions. To support workloads that move between geographically diverse data centers, VMDC DCI provides Layer 2 extensions that preserve IP addressing, extended tenancy and network containers, a range of stateful L4-L7 services, extended hypervisor geo-clusters, geo-distributed virtual switches, distributed storage clusters, different forms of storage replication (synchronous and asynchronous), geo-extensions to service orchestration tools, IP path optimization to redirect users to moved VMs and workloads, and finally, support across multiple hypervisors. The cumulative impact of interconnecting data centers is significant and potentially costly for SPs and Enterprises. Lack of technical guidance and best practices for an “end-to-end” business continuity solution is a pain point for customers that are not staffed to sift through these technical issues on their own. In addition, multiple vendors and business disciplines are required to design and deploy a successful business continuity and workload mobility solution. VMDC DCI simplifies the design and deployment process by providing a validated reference design for each tier of the Cloud Data Center.

Figure 2-1 Extending Cloud Data Centers Across Infrastructure Tiers

 

The VMDC DCI design uses the following definitions to assess the overall cost of a recovery time resulting from workload mobility or a recovery plan:

  • Business Continuity—Processes to ensure that essential Business functions can continue during and after an outage. Business continuance seeks to prevent interruption of mission-critical services, and to reestablish full functioning as swiftly and smoothly as possible.
  • Recovery Point Objective (RPO)—Amount of data loss that’s deemed acceptable, defined by application, in the event of an outage. RPO can range from zero (0) data loss to minutes or hours of data loss depending on the criticality of the application or data.
  • Recovery Time Objective (RTO)—Amount of time to recover critical business processes to users, from initial outage, ranging from zero time to many minutes or hours.
  • Recovery Capacity Objective (RCO)—Additional capacity at recovery sites required to achieve RPO/RTO targets across multi-site topologies. This may include many-to-one site recovery models and planned utilization of recovery capacity for other functions
  • Metro Distance—Typically less than 200 km and less than 10 ms RTT
  • Geo Distance—Typically greater than 200 km and less than 100 ms RTT

The Business Criticality of an application will define an acceptable RPO and RTO target in the event of a planned or unplanned outage. (Figure 2-2)

Figure 2-2 RPO and RTO Definitions

 

Achieving necessary recovery objectives involves diverse operations teams and an underlying Cloud infrastructure that has been built to provide business continuity and workload mobility. Each application and infrastructure component has unique mechanisms for dealing with mobility, outages, and recovery. The challenge of an end-to-end cloud data center solution is to combine these methods in a coherent way so as to optimize the recovery/mobility process across metro and geo sites, and reduce the overall complexity for operations teams. This is the ultimate goal of the VMDC DCI solution.

Mapping Applications to Business Criticality Levels

A critical component of a successful DCI strategy is to align the business criticality of an application with a commensurate infrastructure design that can meet those application requirements. Defining how an application or service outage will impact Business will help to define an appropriate redundancy and mobility strategy. A critical first step in this process is to map each application to a specific Critically Level as described in Figure 2-3.

Figure 2-3 Application Criticality Levels

 

Industry standard application criticality levels range from Mission Imperative (C1) in which any outage results in immediate cessation of a primary business function, therefore no downtime or data loss is acceptable, to Business Administrative (C5) in which a sustained outage has little to no impact on a primary business function. Applications representing more Business Critical functions (C1-C3) typically have more stringent RTO/RPO targets than those toward the bottom of the spectrum (C4-C5). In addition, most SP and Enterprise Cloud Providers have applications mapping to each Criticality Level. A typical Enterprise distribution of applications described above shows roughly 20% of applications are Mission Imperative and Mission Critical (C1, C2) and the remainder of applications fall into lower categories of Business Critical, Business Operational, and Business Administrative (C3-C5). The VMDC Cloud Data Center must therefore accommodate different levels and provide Business Continuity and workload mobility capabilities to support varied RPO/RTO targets.

It important to note that even a relatively outage (less than one hour) can have a significant business impact to enterprises and service providers. Figure 2-4 describes the typical Recovery Point Objective (RPO) requirements for different enterprises. In this study, 53% of Enterprises will have significant revenue loss or business impact if they experience an outage of just one hour of Tier-1 data (Mission Critical data). In addition, 48% of these same enterprises will have a significant revenue loss or business impact if they experience an outage of less than 3 hours of Tier-2 data (Business Critical data). Even tighter RPO requirements are applicable to SP Cloud Providers. Enterprise and SP Cloud Providers have a strong incentive to implement Business Continuity and workload mobility functions to protect critical workloads and support normal IT operations. VMDC DCI provides a validated framework to achieve these goals within Private Clouds, Public Clouds, and Virtual Private Clouds.

Figure 2-4 Typical Enterprise RPO Requirements1

 

VMDC DCI implements a reference architecture that meets two of the most common RPO/RTO targets identified across Enterprise Private Clouds and SP Private/Public Clouds. The two RPO/RTO target use cases are described in Figure 2-5. The first case covers an RTO/RPO target of 0 to 15 minutes which addresses C1 and C2 criticality levels. Achieving near zero RTO/RPO requires significant infrastructure investment, including synchronous storage replication, Live VM migrations with extended clusters, LAN extensions, and metro services optimizations. Achieving near zero RTO/RPO typically requires 100% duplicate resources at the recovery site, representing the most capital intensive business continuity/workload mobility option. The second use case covers an RPO/ RTO target of more than 15 minutes which addresses Critically Levels C3 and C4. Achieving a 15 minute target is less costly, less complex, and can utilize a many-to-one resource sharing model at the recovery site.

Figure 2-5 Validated RPO/RTO Targets

 

 

To cover both of these recovery targets, the VMDC DCI design must support two operational models. The first operational model, Active-Active metro design, is derived from two physical sites spanning a metro distance, operating as a single Logical Data Center. The second operational model represents a more traditional Active-Backup metro/geo Design, where two independent data centers provide recovery and workload mobility functions across both metro and geo distances. A brief description of both VMDC DCI options is provided below.

Active-Active Metro Design

The active-active metro design is described in Figure 2-6. This model provides DCI extensions between two metro sites, operating together as a single Logical Data Center. This design accommodates the most stringent RTO/RPO targets for Business Continuity and Workload Mobility. This model supports applications that require live workload mobility, near zero RTO/RPO, stateful services, and a synchronous storage cluster across a metro distance.

Figure 2-6 Active-Active Metro Design

 

Applications mapped to this infrastructure may be distributed across metro sites and also support Live Workload mobility across metro sites. Distributed applications and Live Workload Mobility typically requires stretched clusters, LAN extensions, and synchronous storage replication, as described in Figure 2-7. DCI extensions must also support Stateful L4-L7 Services during workload moves, preservation of network QoS and tenancy across sites, and virtual switching across sites. A single Operational domain with Service Orchestration is typically used to manage and orchestrate multiple data centers in this model.

Figure 2-7 Distributed Clusters and Live Workload Mobility

 

The key VMDC DCI design choices for the Active-Active metro design are described in Figure 2-8.

Figure 2-8 Active-Active Metro Design Choices

 

Active-Backup Metro/Geo Design

The second model, Active-Backup metro/geo Design represents a more traditional primary/backup redundancy design, where two independent data centers provide recovery and workload mobility functions across both metro and geo distances, as described in Figure 2-9. This model can address less stringent RTO/RPO targets, where applications require Cold workload mobility/recovery in which applications and corresponding network services are restarted at the recovery location.

Figure 2-9 Active-Backup Metro/Geo Design

 

This Business Continuity and Workload Mobility design is best suited for moving or migrating “stopped workloads” between different Cloud data centers as described in Figure 2-10. These less stringent RPO/RTO requirements enable the participating data center to span a geo distance of more than 200 km. In this model, LAN extensions between data centers is optional, but may be necessary for operators that need to preserve to IP addressing for applications and services. In addition, Asynchronous data replication used to achieve less stringent RPO/RTO targets.

Figure 2-10 Migrating Stopped Workloads

 

The key VMDC DCI design choices for the Active-Backup metro/geo design are described in Figure 2-11.

Figure 2-11 Active-Backup Metro/Geo Design Choices

 

VMDC DCI Supports Multiple Design Options

It is important to note that BOTH of these design options are typically required by Enterprises and SPs, to address their wide range of applications in a cost efficient way. Therefore, VMDC DCI integrates the Active-Active metro design and the Active-Backup metro/geo design into a single Cloud data center that can be used to provide Business Continuity and Workload Mobility for a wide range applications and RPO/RTO targets.

Based on the recent survey sited in the Figure 2-12, almost half of all Enterprises have their primary backup facility within a 250 mile distance. As a result, most Enterprises can therefore implement both metro and geo business continuity and workload models across their current data center locations. Large Tier 1 Service Providers and Enterprises typically span longer distances and many regions.

Figure 2-12 Typical Enterprise Geo-Redundancy2

 

Top Level Use Cases

Top level use cases validated in VMDC DCI are mapped to one of the following design choices:

Design Parameters for Active-Active Metro Use Cases

VMDC DCI used the following design parameters in the Active-Active metro design.

Live Workload Mobility can Solve Specific Business Problems

  • Perform live (or cold) workload migrations between metro data centers
  • Perform operations re-balancing/maintenance/consolidation of live (or cold) workloads between metro data centers
  • Provide disaster avoidance of live (or cold) workloads between metro data centers
  • Implement application geo-clusters spanning metro DCs
  • Utilized for the most business critical applications (lowest RPO/RTO)
  • Maintain user connections for live workload moves
  • Implement load balanced workloads between metro DC's

Hypervisor tools utilized to implement Live Workload Mobility

  • VMware live vMotion
  • Stretched HA/DRS clusters across metro data centers
  • Single vCenter across metro data centers
  • DRS host Affinity rules to manage compute resources

Metro Data Center Infrastructure to support Live Workload Mobility

  • Network—Data Center Interconnect extensions between metro data centers

Simplified LAN extensions using Overlay Transport Virtualization (OTV) is used to preserve IP addressing of applications and support Live migrations

Virtual switches distributed across metro data centers

Tenant Containers spanning multiple sites

Maintain traffic QoS and packet markings across metro networks

  • Services—Maintain stateful services for active connections where possible

Support a combination of services hosted physical appliances, as well as virtual services hosted on the UCS

Minimize traffic tromboning between metro data centers

  • Compute—Support single-tier and multi-tier applications

Multiple UCS systems across metro DC's to support workload mobility

  • Storage—Storage extended across metro, synchronous and asynchronous replication

Distributed storage clusters spanning metro data centers

Figure 2-13 shows a typical live migration of an active workload. Each tier of data center is impacted by this use case.

Figure 2-13 Live Workload Mobility

 

Design Parameters for Active-Standby Metro/Geo Use Cases

VMDC DCI used the following design parameters in the Active-Standby metro/geo design.

Cold Workload Mobility can solve specific Business problems

  • Perform planned workload migrations of stopped VMs between metro/geo data centers
  • Operations rebalancing/maintenance/consolidation of stopped workloads between metro/geo data centers
  • Disaster avoidance or recovery of stopped workloads
  • User connections will be temporarily disrupted during the move process
  • Site migrations across metro/geo data centers of stopped workloads
  • Utilized for less business critical applications (Medium to High RPO/RTO)

Hypervisor tools utilized to implement Cold Workload Mobility

  • VMware Site Recovery Manager (SRM) and VMware High Availability
  • Resource pools mapped to Active/Active or Active/Standby metro/geo DCs
  • Host Affinity rules to manage compute resources
  • Many-to-One Site Recovery Scenarios

Metro/Geo Data Center Infrastructure to support Cold Workload Mobility

  • Network—Data Center Interconnect is optional

Simplified LAN extensions using Overlay Transport Virtualization (OTV) is used to preserve IP addressing of applications

Multiple UCS systems utilized to house moved workloads at the recovery site

Create new tenant containers at recovery site to support the moved workloads

  • Services—Service connections will be temporarily disrupted

New network containers and services created at new site

Traffic tromboning between metro DCs can be avoided in many cases

  • Compute—Support Single-Tier and Multi-Tier Applications
  • Storage—Asynchronous Data Replication to remote site

Virtual Volumes silo’d to each DC

Figure 2-14 shows the different infrastructure components involved in the cold migration of a stopped workload. Each tier of data center is impacted by this use case.

Figure 2-14 Components of Stopped Workload Cold Migration

 

Solution Architecture

Top lever use components validated in VMDC DCI are mapped to one of the following design choices:

Active-Active Metro Design

The Active-Active metro design used in the VMDC DCI system is included in Figure 2-15. The physical sites are separated by a metro distance of 75 Km. Layer 2 LAN extensions are included to support multi-site hypervisor clusters, stretched network containers, and preservation of IP addressing for workloads. Storage is extended between sites to support active-active clusters and synchronous storage replication. Asynchronous storage replication between sites is also provided for less business critical applications.

Figure 2-15 Active-Active Metro Design Topology

 

Active-Backup Metro/Geo Design

The Active-Backup metro/geo Design validated in the VMDC DCI system is included in Figure 2-16. The physical sites are separated by a geo Distance of 1000 Km. Layer 2 LAN extensions are optional. Storage is contained to each site. Asynchronous storage replication provides long distance data replication between sites.

Figure 2-16 Active-Backup Metro/Geo Design Topology

 

System Components

Table 2-1 and Table 2-2 list product components for Cisco and Partners, respectively.

 

Table 2-1 Cisco Components

Role
Cisco Products

WAN Edge / Core

ASR-1004
Nexus 7010

Aggregation
FabricPath Spine

Nexus 7009

Access-Edge
FabricPath Leaf

Nexus 6004
Nexus 5548
Nexus 7004 w Sup2/F2

FEX

N2K-C2232PP/N2K-C2248TP-E

Fabric Interconnect

UCS 6248UP

Compute

UCS B-200-M3s /M2
UCS M81KR Virtual Interface card
UCS P81E Virtual Interface card
UCS Virtual Interface card 1280, 1240

Virtual Access Switch

Nexus 1000v

Virtual Firewall

VSG

Physical Firewall

ASA5585X

Storage Fabric

MDS9148

 

Table 2-2 Third Party and Partner Products

Role
Partner Products

SAN/NAS Storage

NetApp MetroCluster
NetApp SnapMirror
FAS 6080/6040
FAS 3250

Hypervisor

VMWare vSphere 5.1
Site Recovery Manager 5.1

Server Load Balancers

NetScaler SDX

Applications used to demonstrate Migration use cases

Microsoft SharePoint & Visual Studio
Oracle & Swingbench


1. Source: Enterprise Strategy Group, 2012
2. Source: Forrester “State of Enterprise Disaster Recovery Preparedness, May 2011