Guest

Cisco on Cisco

Storage Networking Case Study: How Cisco IT Deployed a SAN for ERP Data Storage


Cisco Multilayer Director Switch improves storage usage, system management, and provisioning speed for ERP data storage.
BACKGROUND

Like many other companies during the mid-1990s, Cisco maintained a rapidly growing data storage infrastructure. At that time, data storage consisted of approximately 90 percent direct attached storage (DAS) and 10 percent network attached storage (NAS). The DAS was predominantly large storage subsystems (EMC Symmetrix) with a significant but smaller number of midrange storage subsystems such as Sun T3 and Baydel arrays. The DAS was also organized physically by business function or "stovepipe" (Figure 1). There were and still are very strong operational reasons for maintaining storage separated by these physically separate business functions. Some examples include:

  • Scheduling downtime for routine maintenance proved to be very difficult on storage subsystems that crossed business boundaries and supported several business applications.
  • Combining the business functions would have created a much larger, more complex environment to support, especially considering the lack of maturity and scalability of the storage management tools at the time.

Figure 1. Cisco Storage Architecture in the Mid-to-Late 1990s

Click on Image to Enlarge popup

This storage architecture served Cisco well during the mid- and late 1990s, but certain applications within one of the largest and most critical business functions (ERP) within Cisco were approaching the limits of the Small Computer System Interface (SCSI) DAS architecture. For example, there were hosts that needed storage but could not attach to specific storage resources due to port availability or SCSI distance limitations. These port constraints also negatively affected Cisco IT's ability to provide high-availability configurations for the mission-critical ERP applications, and limited the ability to add storage quickly enough to keep up with the growing business demands.

With these problems in mind, Cisco introduced 16-port Fibre Channel switches into the ERP storage environment in early 1998. These switches solved the port constraint problems and allowed for multiple paths, increasing both performance and availability. In addition, Fibre Channel relieved some of the distance limitations of SCSI DAS and made it much easier to add storage for the growing ERP environment. However, the Fibre Channel switches were only used to extend the distance between storage and server, and to provide simple multiplexing to allow one storage unit to be shared among a few applications servers, and rudimentary SAN functions such as hardware or software zoning were not used. In essence, this model was very similar to the pre-existing DAS architecture with only minor enhancements.

Figure. Terabytes graph

Click on Image to Enlarge popup

Between 1998 and 2001, ERP was the only environment within Cisco to use Fibre Channel switches to solve performance, availability, and port constraint problems. All other Cisco IT clients continued to expand their traditional DAS environments to meet their business needs. However, as with the rest of the industry, storage growth within Cisco (Figure 2) far outpaced the capabilities of the storage management tools available on the market. Cisco Storage Growth with 2003 and 2004 Projections

In late 2000, Cisco realized that storage management within the DAS architecture would not continue to scale because each storage subsystem, regardless of its size, was a "point of management." Ultimately, scalability was compromised as Cisco IT faced hundreds of points of storage management and an increasing total cost of ownership (TCO) for all the business units within Cisco, including ERP. As the economy began to slow in early 2001, the demand for storage continued to grow creating another important concern with the DAS architecture-how could Cisco continue to support the growing storage environment without adding costly resources?

In addition to the management and TCO burden that was continuing to grow with each storage frame added, the DAS model, coupled with Cisco IT's separation of storage within business functions, created several small storage islands that in turn led to inefficient use of the storage. This generated significant opportunities to improve corporate return on investment (ROI).

Figure 2. Depiction of the Global Cisco Storage Vision Developed and Adopted in Early 2001

Click on Image to Enlarge popup

With storage spending outpacing server spending in many companies since circa 2000, storage has naturally become a focus for reducing TCO. With the cost of storage hardware dropping, the major portion of the TCO for storage continues to be the management of that storage. As a result, the industry effort to lower the storage TCO has been focused on storage management, and specifically on reducing the points of management through consolidation of both hardware and software. Software tools and technology solutions that support heterogeneous storage environments have also received much attention.

The overall goal of reducing storage TCO has led many companies (including Cisco) to invest in a storage strategy or vision that includes offering storage as a utility-like service to be shared across multiple hosts and applications. The first step in offering storage as a utility service (and ultimately reducing the TCO) typically includes moving away from the "server-centric" storage model (DAS) to a "network-centric" storage model, in which storage is pooled within the network. In early 2001, Cisco defined a storage vision to move away from the DAS architecture toward a network-centric storage architecture based on SANs and NAS (Figure 3). The Cisco IT storage team felt that by focusing on the hardware, software, and business processes, the vision could be achieved.

This vision, of a network-centric storage service capable of offering storage capabilities to any corporate application, would be achieved in three phases.

Phase 1 - Migrate DAS to SAN Islands where appropriate within current business function groups

Figure 3. Strategic Hardware Storage Architecture (Consolidated SAN)

Click on Image to Enlarge popup

Phase II - Consolidate SAN Islands and any remaining DAS to a single SAN within each business function per data center

Phase III - Consolidate all business function SANs to a single SAN per datacenter

Cisco IT is now (in mid 2003) in the process of implementing Phase II, consolidating DAS and SAN islands within a business function into a single SAN. This case study focuses on Cisco IT's two-stage migration of its ERP and Data Warehousing functions into a single SAN in the data center in Research Triangle Park, North Carolina.

In 2001, Cisco IT started to migrate toward the vision outlined above by defining and initiating storage projects that were directly related to achieving the vision. Cisco decided early in the migration that the ideal hardware solution would physically connect all servers to a single pool of storage, minimize overhead associated with SAN inter-switch links (ISLs), provide additional expansion capabilities for future growth, support the business unit philosophy described above, and provide a migration path to emerging technologies such as Fibre Channel Interface Protocol (FCIP) and SCSI over IP (iSCSI). The storage pool concept would also allow Cisco to carve the pool into various service levels (Bronze, Silver, and Gold, for example) to meet various application service level agreements (SLAs) related to metrics such as performance, availability, and disaster recoverability (Figure 4).

Figure 4. High-Level Cisco Storage Architecture After Phase I of Migration to Storage Vision (Late 2002)

Click on Image to Enlarge popup

Various storage projects were defined to begin migration toward the vision in several phases. For example, in the first phase in 2001, Cisco IT began to implement relatively small SAN islands within each of the business functions to begin to learn about SAN technology and assess how to use the technology to build the foundation for the architecture shown in Figure 4. Cisco IT also began to change the organization of the storage teams by replacing a part-time virtual storage team with a dedicated storage team composed of full-time storage managers. A new virtual storage team with stakeholders worldwide was formed to promote the vision and to help ensure consistency during deployment across the enterprise. The storage teams also initiated projects to identify storage management tools for the purposes of storage resource management (SRM) and SAN management.

By the end of the first phase in late 2002, the storage infrastructure had been migrated from 80 percent DAS and 20 percent NAS to a combination of SAN islands (20 percent), DAS (60 percent), and NAS (20 percent). Figure 5 provides a high-level illustration of the storage architecture after Phase I, as well as a graphical representation of the progress made toward the goal of network storage during that phase. Phase I created roughly 24 small SAN islands, but provided valuable experience with SANs for the storage administrators and convinced Cisco IT that the decision to move toward a network-centric storage architecture was sound. Some of the lessons learned during Phase I included:

  • Better storage usage and simpler, faster storage provisioning is possible because many servers now have access to all the storage frames within the SAN.
  • Pockets of consolidation result in reduced management points (using SAN management tools).
  • SAN reliability is equal to or greater than DAS and NAS.
  • Flexibility of fiber is greatly improved (storage can be physically located at a greater distance from hosts than with SCSI DAS).
  • Reduced TCO is possible. SAN storage has proven to be 12 percent less expensive than DAS (based on the results of an internal storage TCO study conducted in early 2003).

Figure 5. Proposed Storage Architecture after Completion of Phase II

Click on Image to Enlarge popup

Migration is difficult and time consuming. This is especially true during transitions, when the new environment and the existing environments must both be maintained, leading to an initially higher overall TCO.

The plan for Phase II of the migration, which began in early 2003, is to consolidate the remaining DAS and any SAN islands within each business function into a single large SAN per business function per data center (Figure 6). This phase is much more ambitious than Phase I. Achieving this architecture will provide a hardware storage infrastructure capable of supporting the storage vision. It will also make the final phase (Phase III) of further consolidation to a single SAN per data center much easier.



CHALLENGE

Figure 6. Depiction of SAN Architecture Using Industry-Standard, 64-Port, Director-Class Switches

Click on Image to Enlarge popup

There have been significant challenges in achieving even the level of consolidation required to complete Phase II. Until recently, Fibre Channel switches were characterized by low port density and low-availability designs. Director-class switches promised to change the landscape, but failed to deliver all the necessary features and enhancements needed to build the SAN infrastructure required to complete Phase II. Given the size of the business functions within Cisco, building a SAN large enough to support an entire business function environment (even within a single data center) proved challenging. For example, providing storage for the ERP business function within Cisco would require more than 400 ports. Using 64-port director-class switches and two storage subsystems for illustrative purposes, the SAN would have resembled the architecture depicted in Figure 7.

Typically, the more complex the solution, the higher the support costs. Cisco IT did not have the staff power to build and support a SAN with this degree of complexity. Performance uncertainty existed due to the excessive use of ISLs and switch backplane limitations. In addition, the architecture in Figure 7 offered only 64 percent port usage efficiency after accounting for all the ISLs (896 total ports; 320 used for ISLs). The end result of this technology gap for the ERP environment within Cisco was a somewhat fragmented, suboptimal SAN connectivity model where none of the server's storage requirements were easily assessed or projected, backup resources could not be easily shared, and growth could not be readily accommodated (Figure 8). The ERP/DW business functions were combined into a single business function in 2002, exacerbating the scalability constraints of the available Fibre Channel switches.

Another challenge with achieving the level of consolidation required to complete Phase II has been the relative immaturity of storage and SAN management technologies such as SRM and virtualization. As noted earlier, storage management is a "vision enabler," equally important as hardware and business processes, as defined in the Cisco IT storage vision.

SOLUTION

Figure 7. Cisco ERP/Data Warehousing Business Function Environment after Phase I (Late 2002)

Click on Image to Enlarge popup

The Cisco IT storage group chose the Cisco MDS 9509 Multilayer Director Switch. In doing so, Cisco was able to realize all the benefits of scalability, availability, and performance previously missing from the currently available SAN technology. The Cisco MDS 9509 provides up to 224 Fibre Channel ports in a single chassis, which means that a relatively simple SAN large enough to support the ERP environment could be built as shown in Figure 9. This design is much simpler and has 100 percent port usage due to the differences in the ISLs required by the SAN design shown in Figure 7.

The Cisco MDS 9509 supports all the high-availability features that define director-class operation. Redundant supervisor engines, fully stateful supervisor engine failover, redundant crossbars, hitless software upgrades, individual process restartability, and process isolation within virtual SAN (VSAN) instances are just a few of the high-availability and security features that help to ensure uninterrupted service and minimal scheduled downtime. Critical performance watermarks are met by a fully nonblocking architecture, intelligent traffic management, Fibre Channel congestion control, traffic isolation per VSAN, advanced virtual output queuing (VOQ), and line-rate frame forwarding across 112 ports simultaneously. By bundling up to 16 Fibre Channel ports into a logical port channel, the Cisco MDS 9509 achieves higher interswitch bandwidth while preserving a single interface instance within the fabric shortest path first (FSPF) routing process.

Figure 8. Depiction of a SAN Architecture Using Cisco MDS 9509 Multilayer Director Switches

Click on Image to Enlarge popup

Port channeling allows optimal SAN interconnection, and will be the technology catalyst for further SAN consolidation across business functions. The modular design of the Cisco MDS 9509 helps to minimize acquisition costs by allowing the customer to purchase only the number of ports immediately required. Additional ports can be introduced without downtime due to the Cisco MDS 9509's online insertion and removal (OIR) capabilities. Line cards are simply installed as needed. By providing a migration path to newer technologies like FCIP and iSCSI, the Cisco MDS 9509 demonstrates the extensibility that helps to ensure maximum ROI by extending its useful lifespan. The Cisco MDS platform also supports various degrees of storage "intelligence." This addresses the concern that storage management tools are now lagging behind hardware technology. Again, both are needed to achieve the level of consolidation defined in Phase II.

This technology advance enables further consolidation of storage within business functions and eventually across the business functions (Phase III) within Cisco by using VSANs. The Oracle 11i portion of the ERP development environment was chosen as the first to be migrated to the new Cisco MDS 9509 switches. This environment consisted of 11 multipathed HP-UX hosts and roughly 100 TB of storage in two independent SAN islands (Figure 10). The storage team planned a two-stage migration:

Figure 9. ERP Oracle 11i SAN Islands Prior to Cisco MDS Implementation

Click on Image to Enlarge popup

Stage 1 - Migrate one half of the current two SAN fabrics from their current 32-port SAN switch architecture to a single Cisco MDS 9509.

Stage 2 - After testing, migrate the second half of the two SAN fabrics to an added pair of Cisco MDS 9509s.

The migration project posed many challenges, because the applications hosted by this environment had to be available without interruption throughout the entire process. The Cisco IT storage group extensively tested the Cisco MDS 9509 while participating in both the alpha and beta programs, so the group was confident that the Cisco MDS 9509 would work in the Oracle 11i development environment and that the migration could be done with no impact to the hosted applications. Host bus adapters (HBAs), operating system versions, microcode levels in the storage subsystems, host-based multipathing, and load balancing features had all been tested and verified during the alpha and beta programs.

In these particular SAN island configurations, each ERP host had two or more paths on independent fabrics to its storage resources, which allowed the migration to proceed one path at a time.

Figure 10. ERP Oracle 11i Development SAN Islands After Stage I of the Migration

Click on Image to Enlarge popup

In the first stage of this two-stage migration, one of the two redundant paths from each SAN island was migrated to a pair of Cisco MDS 9509 switches (Figure 11). Each Cisco MDS 9509 is configured with redundant supervisor engines, four 16-port line cards, and three 32-port line cards.

In the second stage of this migration, the second path was moved onto the same pair of Cisco MDS 9509 switches (with an additional pair for capacity and growth). The Cisco MDS 9509 switches were also connected by a single port channel composed of four ports, which is treated as a single ISL by the FSPF protocol. Figure 12 shows the environment after Stage II.



RESULTS

Figure 11. Consolidated ERP Oracle 11i Development SAN (After Stage II)

Click on Image to Enlarge popup

Between the migration and consolidation of these specific ERP SAN islands to the Cisco MDS 9509 in January 2003 and the time of this publication (May 2003), there have been no issues or problems encountered. The ERP Oracle 11i environment at Cisco is now better positioned for growth than ever before. VSANs, which add to the security and availability within the SAN as a whole, are being used to completely isolate the former SAN islands from one another, as opposed to simple zoning within the SAN. Storage management has been consolidated with the SAN islands, and specific tasks such as storage provisioning within the new Cisco MDS 9509 SAN are accomplished with greater flexibility. Servers can access storage on any frame independent of physical cabling and switch interconnects, leading to greater storage usage potential. The shared backup media server can now easily access all storage frames with fewer HBAs. Servers and storage can be added as needed, without the additional expense and lead-time associated with the installation of an additional Fibre Channel switch. Servers and HBAs can be upgraded without concern for switch performance constraints.

The cost of downtime for mission-critical applications can be threatening to a company's survival. The new Cisco ERP/DW SAN virtually eliminates the need for scheduled downtime associated with storage, and further savings are derived through operational efficiencies. By reducing the number of switches that must be supported and managed, fewer demands are placed upon the ERP storage administrators. Upon completion of Stage II of the ERP/DW migration, two Cisco MDS 9509 switches replaced six 32-port Fibre Channel switches, while providing 67 percent more port capacity than the previous SAN infrastructure-with no disruption to the hosted applications.

LESSONS LEARNED
Planning

Start small and collect data—Before performing a large-scale migration from DAS to SAN, be advised that by migrating a small application environment first, and collecting data on the success and advantages of this migration, corporate support for the inevitable large-scale migration will be easier to obtain. The TCO (including equipment, software, administrative, and maintenance costs) is a critical part of this data collection. Cisco IT found that TCO dropped by 12 percent in migrating from a DAS to a SAN environment.

Understand the physical and logical separation of DAS—Most DAS is organized physically by function or application, and the various business owners will probably be reluctant to merge their privately owned storage with the storage of other business owners. They will need to be convinced that a shared storage solution offers better availability and lower cost than a separate storage solution. In addition, there may be legal or regulatory reasons why data needs to be kept separate, and these requirements should be understood and respected when planning any storage migration. Cisco IT used VSAN technology to maintain SAN divisions without compromising the gains from consolidating storage.

Reorganize your storage teams—Before migrating from a separate DAS model to a shared SAN storage model, it may be necessary to alter the organizational structure to mirror this change. Cisco IT selected six system administrators, whose roles also include supporting storage across the various business functions, to form a single, dedicated storage group. Without their shared experience and insight, migration to a shared storage service would have presented even more significant challenges.

Investigate storage management tools—A significant portion of the cost savings from a SAN migration is achieved by managing multiple storage frames as a single entity. This cannot be done without SAN management tools. After extensive evaluation of products from numerous vendors, Cisco IT standardized on an enterprise-capable SRM platform that supports heterogeneous storage.

Plan for additional costs during the migration—During the migration process, hardware and administrative costs will increase temporarily due to the necessity of maintaining concurrent environments. Planning takes dedicated engineers away from their current work. It is necessary to plan for more resources (perhaps borrowing engineers from other areas of the company), or delay current or planned storage projects.

Plan for additional projects during the migration—During the migration, Cisco IT needed to build a second SAN architecture while supporting the existing storage architecture. This provided an opportunity to replace the entire cable infrastructure with structured cabling and a patch panel system. Although the new system offered greater flexibility, getting it in place required more planning and time than was originally anticipated. In smaller environments, structured cabling may not be needed. Cisco IT's storage vision, however, is predicated upon consolidation to reduce the total number of points of management. Such large-scale consolidation requires an efficient cabling infrastructure.

SAN switches with greater port counts result in a simpler architecture, lower costs, greater flexibility, and higher reliability—SAN switches with greater port counts mean that fewer SAN switches are required to get the job done, which translates into fewer points of management. Smaller switches require a more complex hierarchical architecture to connect numerous hosts and storage frames together. This multitiered architecture requires dedicated switch ports to interconnect with each other, reducing the number of ports available for servers and storage. In addition, increasing the number of switch ports and ISLs required to support a service adds to the number of points of failure, which reduces reliability.

Set expectations early and properly—All stakeholders must understand that the storage vision described in this case study requires a combination of all three vision enablers (hardware, software, and business processes and organization). If any of the three are ignored, the vision will not be attained.

SAN Advantages Achieved

Availability is improved by:

  • Providing more ports to support multiple paths between servers and storage; if one path fails, traffic flows over other paths
  • Being able to add new switch line cards online and to upgrade microcode with no need for downtime

Data center crowding is improved by:

  • Migrating from SCSI to Fibre Channel (allows storage to be located in another part of the floor, or on separate floors)
  • Migrating from Fibre Channel to iSCSI (allows primary storage to be located in nearby data centers)
  • Introducing FCIP (allows primary storage to be located in nearby data centers)

Costs are reduced by:

  • Increasing storage frame usage by sharing each storage frame among many servers and applications
  • Managing multiple storage devices as if they were a single storage entity
  • Sharing backup resources easily among frames
  • Using fewer switches to connect to servers and storage
  • Using fewer ISLs, which improves port usage efficiency
  • Having a migration path that includes FCIP and iSCSI

Provisioning speed is improved by:

  • Allowing IT to upgrade frame firmware without affecting application availability, so upgrades can be scheduled at any time
  • Being able to connect new storage using a large pool of available switch ports, rather than searching for ports in small pools
  • Using port channels where ISLs are necessary, which interconnects SANs more reliably and permits interswitch bandwidth to be upgraded without disrupting traffic flow
  • Being able to add new switch line cards online, which eliminates downtime

Performance is improved by:

  • A fully nonblocking architecture
  • Intelligent traffic management
  • Fibre Channel congestion control
  • Traffic isolation per VSAN
  • Advanced VOQ
  • Line-rate frame forwarding across 112 ports simultaneously
NEXT STEPS

It is important to realize that this is a small step toward the utility service model shown in Figure 4. Consolidation of the entire Research Triangle Park data center onto a single SAN began with the ERP/DW business function migration to the Cisco MDS 9509, which permitted SAN island consolidation within that business function. As Phase II continues, other business functions in the data center will consolidate their SANs using Cisco MDS 9509 switches. Migration of all business functions in the data center onto a single Cisco MDS 9509 SAN infrastructure in the final phase of migration will further extend the ROI of the Cisco MDS 9509 switches. All these migrations will take place without disruption to the hosted applications, and VSANs will be used to maintain strict traffic separation and to insulate applications from one another where applicable.

As has been mentioned, storage management technologies are just as important to enabling Cisco IT's vision as the SAN hardware. Storage products such as SRM and SAN management tools are maturing, and future technologies such as Cisco MDS 9000 Series-based storage virtualization will play a critical role in the large-scale consolidation defined and required to meet the Cisco IT storage vision. As these technologies mature and become available, the Cisco IT storage group will continue to deploy them in parallel with the hardware consolidation now occurring.