Guest

Cisco Catalyst 6500 Series Switches

Building Next-Generation Multicast Networks with Supervisor 2T White Paper

  • Viewing Options

  • PDF (5.2 MB)
  • Feedback


Cisco Catalyst 6500 IP Multicast Technology

Figure 1. This network diagram demonstrates the many different network environments & platforms which support IP Multicast (technology), with the Catalyst 6500 series serving as the Core & Distribution-layer foundation of these myriad network segments. The Catalyst 6500 platform enables true end-to-end Medianet capabilities, within a Cisco borderless network.

For additional examples of Catalyst 6500 + IP Multicast deployments, refer to:

http://cisco.biz/en/US/technologies/tk648/tk828/technologies_case_study0900aecd802e2ce2.html

http://www.redorbit.com/news/technology/243889/new_york_university_deploys_north_americas_first_native_ipv6_multicast/index.html

http://www.cisco.com/en/US/solutions/ns341/ns898/nbc_2010_olympic_winter_games.html


What You Will Learn. 4

Why Should You Care?. 5

Okay, Tell Us More. 5

Supervisor 2T Hardware Overview.. 7

Unified IPv4/IPv6 MFIB Infrastructure. 12

New Egress Replication (EDC Server and Client) Design. 14

New Multicast LTL and MET Sharing Design. 18

Up to 256 K Multicast Routes in the FIB-XL. 23

PIM-SM Source Register Support in Hardware. 25

PIM-SM Dual-RPF Support in Hardware. 27

Simplified Global L2 IGMP Snooping Design. 29

IP-Based (Compared to DMAC-Based) L2 Forwarding Lookups. 32

IGMPv3 and MLDv2 Snooping in the Hardware. 34

New Optimized Multicast Flood (OMF) Design. 38

Multicast VPN (MVPN) Egress-Replication Support 40

Support for 8 PIM-BIDIR Hardware RPF entries. 44

IPv6 Multicast (*,G) and (S,G) entries in FIB TCAM.. 46

Enhanced Multicast HA Using New Infrastructure. 50

Hardware Integration with VPLS, H-VPLS and EoMPLS.. 55

CoPP Exception Cases and Granular Multicast Rate-Limits. 60

NetFlow (v9) Special Fields and Processing for Multicast 66

Learn More. 70

Conclusion. 70

For More Information. 71


What You Will Learn

Whether you are a seasoned multicast expert, or just now deploying multicast for the first time, the new Catalyst 6500 Supervisor 2T has something for you.

Whether you are building a gigantic enterprise-class network, or a small business-class network, the new Supervisor 2T has multicast solutions for you.

This white paper will introduce new and enhanced features (both software and hardware) specifically designed for IP Multicast, now available on the Supervisor 2T.

Summary of Supervisor 2T Features

Pro Tip: Click on each feature below to jump to its details

Supervisor 2T Hardware Overview

Learn about the new hardware components available on the Supervisor 2T.

Unified IPv4/IPv6 MFIB Infrastructure

Optimized hardware infrastructure, designed for L2/L3 scalability.

New Egress Replication (EDC Server and Client) Design

Optimizes multicast frame distribution, between modules.

New Multicast LTL and MET “Sharing” Design

Saves internal forwarding resources, for commonly used paths.

Up to 256 K IPv4 Multicast Routes in the FIB-XL

Provides unprecedented hardware-based multicast scalability.

PIM-SM Source Register Support in Hardware

Saves CPU and memory usage and minimizes source register time.

PIM-SM Dual-RPF Support in Hardware

Saves CPU and memory usage and minimizes SPT switchover time.

Simplified Global L2 IGMP Snooping Design

Provides a simplified L2 snooping configuration and querier redundancy.

IP-Based (Compared to DMAC-Based) L2 Forwarding Lookups

Removes the IP-to-MAC address overlap, for L2 multicast.

IGMPv3 and MLDv2 Snooping Host Tracking in Hardware

Faster join and leave updates of IPv4/IPv6 PIM-SSM L2 host tables.

New L2 Optimized Multicast Flood (OMF) Design

Saves forwarding resources and bandwidth, for “source-only” VLANs.

Multicast VPN (MVPN) Egress-Replication Support

Saves switch fabric bandwidth when forwarding MVPN/eMVPN.

Support for 8 PIM-BIDIR Hardware RPDF Entries

Allows for eight simultaneous RPs to be defined, in hardware.

IPv6 Multicast (*,G) and (S,G) Entries in FIB TCAM

Improved IPv6 hardware-based forwarding decreases latency.

Enhanced Multicast HA Using the New Infrastructure

High availability, built on the new infrastructure, optimizes switchover.

Hardware Integration with VPLS, H-VPLS, and EoMPLS

Built-in multicast support for advanced L2 VPN network designs.

CoPP Exception Cases and More Granular Multicast Rate Limits

Improved control-plane protection for multicast traffic sent to the CPU.

NetFlow (v9 and FnF) Special Fields and Processing for Multicast

All new NFv9 + flexible NetFlow and egress NDE support for multicast flows.

Note: This white paper does not attempt to revisit all the existing IP multicast features already available, in earlier Catalyst 6500 generations.

Learn more about [IPv4 multicast with Supervisor 720 and 12.2SX IOS, http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/mcastv4.html]

and [IPv6 multicast with Supervisor 720 and 12.2SX IOS. http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/mcastv6.html]

Note: This white paper does not attempt to revisit IP Multicast as a technology.

Learn more about [IP Multicast in Cisco IOS. http://www.cisco.com/go/multicast]

Why Should You Care?

Do you need to distribute large amounts of data, to multiple hosts, simultaneously?

Are you a service or content provider?

Are you a financial organization?

Are you a retail distributor?

Are you a transportation authority?

Are you a security company?

Are you responsible for data backups?

Do you need to build a next-generation IP Multicast network? Perhaps you already have an IP Multicast network (built using legacy Cisco equipment or non-Cisco equipment), which is plagued with slow convergence, packet duplication or packet loss, or even high CPU utilization.

Unicast and broadcast forms of communication work well, but do not scale well. Unicast uses one-to-one flows, requiring a separate flow for each receiving node. Broadcast uses one-to-all flows, wasting network resources on uninterested nodes.

IP Multicast is a special forwarding paradigm, specifically designed for distributing data simultaneously (specifically, a single transmission) to multiple hosts within an IP network. It scales to a large receiver population, as it does not require prior knowledge of which or how many hosts need to receive data.

Multicast uses one-to-many and many-to-many distribution-tree forwarding models to deliver real-time communication to multiple receiver nodes over an IP network. This model is perfectly suited for any application that needs to distribute the same data to many hosts, without sacrificing network bandwidth (such as broadcast or multiple unicast sessions).

However, multicast can also be quite complex and fraught with problems if the IP network infrastructure does not support specialized features to optimize and simplify its operation.

The new Supervisor 2T was built with your needs in mind. Building on top of the widely deployed Supervisor 720, Supervisor 2T offers many enhancements to existing features, and adds many new features to improve the scalability and convergence of your next-generation multicast network.

Okay, Tell Us More

IP Multicast as a technology has matured considerably over the years. For example, PIM Sparse-Mode (RFC-4601) is generally much more flexible and conservative than PIM Dense-Mode (RFC-3973). However, in order to provide this improved flexibility, and conserve network bandwidth, its operation is quite complex with two different distribution-tree models (Source or (S,G) based and Rendezvous Point (RP) or (*,G) based).

Source-based trees provide the shortest-path forwarding at the cost of many mroute states, while RP-based trees require much less mroute state at the cost of potentially suboptimal forwarding. Hence, the inherent operational complexity of PIM-SM led to the development of specialized variants:

Source-Specific Multicast (RFC-4607), which is based on (S,G) Source trees

Bidirectional PIM (RFC-5015), which is based on (*,G) Shared or RP trees

This development evolution is represented visually below.

Figure 2. Evolution of PIM

Along with these different PIM forwarding modes, there are also a variety of application-specific uses for IP Multicast forwarding, and naturally, just as many unique networking platforms and designs.

Hence, IP Multicast has reached a technical maturity level such that many different networking platforms and vendors may appear to be the same (at first glance), with almost all of them claiming to be the best.

So, how does a multicast network engineer make sense of it all? First, ask these five simple questions of the vendor:

1. You support Feature X. Do you support X in software or hardware?

2. If you support X in hardware, what subfeatures and options do you support?

3. How big are the Feature X hardware tables (what is the scalability limit)?

4. How do all of these factors affect the performance and latency of Feature X?

5. How easy is it to monitor and debug all of the components of Feature X?

The following sections will explain the many different ways that the new Catalyst 6500 Supervisor 2T was designed specifically to address these points, allowing you to build a true next-generation IP Multicast network.

The document begins with a basic Supervisor 2T hardware overview, which will familiarize you with the important components that help enable hardware Layer 2 (L2) and Layer 3 (L3) IP Multicast forwarding. It will also identify several important areas (noted above) that significantly separate it from both its predecessors and competitors.

Subsequent sections will review both the new and enhanced IP Multicast features and capabilities available only on the Supervisor 2T. Each major section will address a single (new or enhanced) feature, and is divided into three subsections:

Yesterday’s Challenges: Explains similar behavior on Supervisor 720 and earlier

Today’s Solutions: Explains the new or enhanced behavior on Supervisor 2T

How Does That Help You? Quick examples of user benefits and use-cases

This document organization will allow the reader to quickly review any sections of interest, while ignoring the other (supportive) information.

Supervisor 2T Hardware Overview

Gain a whole new level of IP Multicast performance and scalability.

Figure 3. Supervisor 2T Important Elements

The new Supervisor 2T incorporates three main hardware elements:

Multilayer Switch Feature Card 5 (MSFC5)

Policy Feature Card 4 (PFC4)

2 Tbps (26 channel) Switch Fabric

Figure 4. Three Main Components of Supervisor 2T

Figure 5. Supervisor 2T Multicast Hardware

MSFC5 Overview

The MSFC5 is the control-plane component. It runs the IOS software responsible for learning L2 and L3 forwarding information. All IP Multicast control-plane processes and protocols (such as IP Multicast Routing, PIM, IGMP, MLD, MSDP, and more) operate on the MSFC. Once the forwarding tables are negotiated and populated into software tables, this information is programmed into the hardware-forwarding infrastructure.

Table 1. MSFC5 Overview

Feature

MSFC3 (Supervisor 720)

MSFC5 (Supervisor 2T)

CPU Speed

SP CPU @ 600 Mhz

RP CPU @ 600 Mhz

Dual-Core CPU

Each @ 1.5 Ghz (3 Ghz)

DRAM

SP CPU - Up to 1 GB

RP CPU - Up to 1 GB

2 GB (up to 4 GB)

Connectivity Management Processor (CMP)

N/A

Single CPU @ 266 Mhz

32 MB Boot Flash

256 MB System Memory

USB Console/Data Port

Inoperable

USB RP Console -or-

USB 2.0 Data Transfer

NVRAM

2 MB

4 MB

OBFL Flash

N/A

4 MB

Bootflash/Bootdisk

SP CPU - 1 GB (CF)

RP CPU - 64 MB (flash)

1 GB (CF)

PFC/DFC4 Overview

The PFC4 is the data-plane component, and is the hardware representation of all of the forwarding information learned and programmed by the control-plane. By programming the PFC4 ASIC tables, all subsequent L2 and L3 forwarding decisions can be made entirely in the hardware.

There are two types of PFC4 available (XL and non-XL), depending on the number of L2 and L3 forwarding entries required. The XL variant supports up to 1 M hardware entries, while the non-XL supports up to 256 K. Both PFC4 models are capable of forwarding lookup rates up to 60 Mpps for L2 and IPv4 L3 Multicast, and 30 Mpps for IPv6 L3 and encapsulated multicast, such as a multicast Virtual Private Network (MVPN).

The Catalyst 6500 platform is based on a modular design. All of the same capabilities and performance of the PFC4 can be extended to individual LAN modules with the Distributed Forwarding Card (DFC4). The DFC4 offloads forwarding decisions from the PFC4, and multiplies scalability by the number of DFC4s in the system (maximum 720 Mpps).

Table 2. PFC4 Overview

Feature

PFC/DFC3

PFC/DFC4

L2 (v4/v6) and IPv4 L3 Performance

30/48 Mpps

60 Mpps

IPv6 L3 and Encapsulated Performance

12/24 Mpps

30 Mpps

FIB (non-XL)

FIB (XL)

256 K Entries

1 M Entries

256 K Entries

1 M Entries

L2 Bridge Domains

4 K (# of VLANs)

16 K (BD)

L3 Logical Interfaces

4 K (Shared with VLANs)

128 K (LIF)

L2 MAC Address Table

PFC3B: 64 K

PFC3C: 96 K

128 K

Number of VPNs

4 K

16 K (IPv4)

8 K (IPv6)

RPF Interfaces

2

16

PIM-Bidir RPDF

4

8

NetFlow Entries

256 K (Ingress Only)

512 K Ingress

512 K Egress (XL Default)

Native VPLS

No

Yes

Native CTS

No

Yes

Flexible NetFlow

No

Yes

ACL and QoS TCAM

32 K (ACL) and 32 K (QoS)

Up to 256 K (Shared)

Security ACEs

Up to 32 K

Up to 192 K (XL Default)

QoS ACEs

Up to 32 K

Up to 64 K (XL Default)

Port ACLs

2 K

8 K

Aggregate Policers

1 K

6 K

Microflow Policers

63

512

Rate Limiters

L3: 8

L2: 4

L3: 32

L2: 12

2T Switch Fabric Overview

The 2 Tbps Switch Fabric provides the physical backplane data path over which all multicast packet forwarding and replication occurs. It supports 26 dedicated fabric channels, capable of operating at either 20 or 40 Gbps.

The CEF720 and CEF2T-series LAN modules support dual fabric channels, providing 40 Gbps total for CEF720 and 80 Gbps total for CEF2T. In addition, the 2T Switch Fabric provides redundant channels for CEF2T modules, for faster Stateful Switch-Over (SSO) and ISSU fail-over.

Table 3. 2T Switch Fabric Overview

Feature

Supervisor 720

Supervisor 2T

Total Fabric Bandwidth

720 Gigabits (per second)

2 Terabits (per second)

Individual Fabric Channel Bandwidth

8 Gbps (CEF256)

20 Gbps (CEF720)

20 Gbps (CEF720)

40 Gbps (CEF2T)

Total Number of Fabric Channels

18 (Sup 720-3B)

20 (Sup 720-3C)

26

Fabric Redundancy

Yes

Yes

Redundant Channels

No

Yes

Supported LAN Modules

Three module types will be supported on the new Supervisor 2T:

The new WS-X6900 (or CEF2T) series, with DFC4 pre-installed

Existing WS-X6700 and new WS-X6800 (or CEF720) series, with or without DFC4

Select WS-X6100 (or Classic) series modules

Note: While the general forwarding architecture and behavior of these various module types is outside the scope of this document, it is useful to briefly review the IP Multicast behavior of each of these.

Each generation of LAN modules features different levels of multicast-specific ASIC support and associated capabilities. The latest generation of modules support specialized multicast packet replication capabilities, as well as specialized packet scheduling and buffering capabilities.

Note: Special consideration should be taken beyond traditional unicast capacity planning, when selecting which LAN modules will be used for IP Multicast forwarding.

The new WS-X6900 (or CEF2T) generation supports dual 40 Gbps fabric channels (total of 80 Gbps, per module), using four Fabric Interface and Replication Engine (FIRE) ASIC complexes. Each FIRE ASIC is capable of ~20 Gbps of L2 and L3 multicast packet replication, or original packet * number of Outgoing Interfaces (OIFs). This class of LAN modules is preferred for medium to large-scale IP Multicast deployments.

Figure 6. WS-X6900 (CEF2T) + DFC4 Module

The existing WS-X6700 (or CEF720) generation supports dual 20 Gbps fabric channels (total of 40 Gbps, per module), by using two FIRE ASIC complexes. There are two variants of the CEF720 which will be supported: WS-X6700 + CFC and WS-X6800 + DFC4

The CEF720 modules that use the Distributed Forwarding Card (DFC4) support local L2 and L3 ingress or egress multicast packet replication, as well as egress-local optimization, and all forwarding decisions are performed locally. These modules will now be called WS-X6800 series modules, to separate the behavior differences. This class of LAN modules is preferred for medium-scale IP Multicast deployments.

Figure 7. WS-X6800 (CEF720) + DFC4 Module

The CEF720 modules that use the Centralized Forwarding Card (CFC) support local L2 and L3 ingress or egress multicast packet replication, but must rely on the central PFC4 for all forwarding decisions. These modules will continue to be called WS-X6700 series modules. This class of LAN modules is adequate for small-scale IP Multicast deployments, but should be upgraded for larger deployments.

Figure 8. WS-X6700 (CEF720) and CFC Module

The legacy WS-X6100 (or Classic) generation supports a single bus-based connection to the shared 32 Gbps data bus, and relies on the Supervisor FIRE ASIC complex for L2 and L3 multicast packet replication, as well as relying on the PFC4 for forwarding decisions. This class of LAN modules is not recommended for IP Multicast deployments, but may be used on a limited or low bandwidth basis.

Note: These modules are primarily intended for edge-connected Power-over-Ethernet (POE) devices, such as IP phones.

Figure 9. WS-X6100 (Classic) Module

Pro Tip: [Learn more about Supervisor 2T architecture. http://www.cisco.com/en/US/docs/ios/lanswitch/configuration/guide/lsw_ml_sw_over_ps6017_TSD_Products_Configuration_Guide_Chapter.html#wp1001111]

Unified IPv4/IPv6 MFIB Infrastructure

Gain optimized hardware infrastructure, designed for L2/L3 scalability.

Yesterday’s Challenges

IPv4 multicast L2/L3 Multi-Layer Switching was first developed on the Catalyst 6500 in the early 2000’s. Since it was so unique and innovative at the time, many new IP Multicast hardware-specific functions were developed in isolation.

Developers created entirely new ways to translate the software-based functions of the standards-based PIM (Protocol Independent Multicast) and IGMP (Internet Group Management Protocol) protocols into hardware-based code functions. This became known as the Multicast Multi-Layer Switching (MMLS) infrastructure.

Pro Tip: [Learn more about MMLS http://www.cisco.com/en/US/docs/ios/lanswitch/configuration/guide/lsw_ml_sw_over_ps6017_TSD_Products_Configuration_Guide_Chapter.html#wp1001111]

As many other Cisco platforms began to implement hardware-based IP Multicast features, it became clear that a single, uniform IP Multicast hardware infrastructure code was necessary.

This led to the development of the Multicast Forwarding Information Base (MFIB) platform-independent infrastructure. The MFIB was designed to logically separate the control-plane from the data-plane, using a standardized infrastructure that does not rely on platform-specific information.

Pro Tip: [Learn more about MFIB http://www.cisco.com/en/US/docs/ios/ipmulti/configuration/guide/imc_mfib_overview.html]

Note: When IPv6 multicast was first introduced on the Catalyst 6500, the development decision to move all Cisco IOS platforms to MFIB had already been settled. As a result (even with Supervisor 720), IPv6 multicast uses the MFIB infrastructure.

However, as noted previously, the Catalyst 6500 IPv4 MMLS infrastructure was already well established by this point. This created a problem, because many Cisco customers did not want to change their network operations. Hence, IPv4 multicast continued to use MMLS, while IPv6 multicast uses MFIB.

Was this really bad? No, it was simply inconsistent. Network administrators responsible for both IPv4 and IPv6 needed to both understand and manage the differences. Over time, this also made the code development of Catalyst 6500 IPv4 multicast increasingly divergent from other Cisco IOS platforms.

Today’s Solutions

With the new Supervisor 2T and PFC/DFC4, the new IOS code will finally unify the IPv4 and IPv6 hardware infrastructure, using MFIB. This will allow network operations staff to operate and debug either (or both) IPv4 and IPv6 multicast with a single, consistent CLI.

It also brings all of the benefits of MFIB to IPv4 multicast, as it:

Simplifies overall IP Multicast operation through the fundamental separation of the control and data (forwarding) planes

Handles all interfaces equally, regardless of their PIM, IGMP, or MLD (Multicast Listener Discovery) mode

Eliminates the need for the route-cache maintenance associated with demand caching schemes, such as multicast fast-switching

Introduces (*, G/mask) entries to describe a group range present in a router's local Group-to-RP mapping cache

Introduces PIM RP Tunnel Interfaces for the PIM-SM (Sparse Mode) source registration process, to further separate control and forwarding plane

Note: This represents an operational change to configuration, monitoring, and debugging IPv4 multicast on Catalyst 6500.

All platform-independent software components (such as PIM and IGMP/MLD IOS processes) and their associated CLI configuration and monitoring commands remain unchanged. These include show ip mroute, show ip pim neighbors, show ip igmp groups, and more

However, all platform-dependent IPv4 multicast CLI (previously MMLS based) commands are now changed to the MFIB equivalent.

Note: There are simply too many CLI commands to address within this white paper, but some basic examples are provided below.

Pro Tip: [Review IPv4 MFIB verification. http://www.cisco.com/en/US/docs/ios/ios_xe/ipmulti/configuration/guide/imc_verify_mfib_xe.html]

How Does That Help You?

The unified IPv4/IPv6 MFIB infrastructure provides users with a consistent, platform-independent foundation to translate multicast software routing into hardware programming.

This helps ensure a more consistent and reliable hardware forwarding behavior, as well as a single command line interface for IPv4 and IPv6 Multicast management.

It allows IP Multicast network administrators to learn and use a single, consistent approach to configuring, monitoring, and debugging their IPv4 and IPv6 multicast environments.

Now, the Catalyst 6500 IPv4 multicast (software and hardware) implementation can be consistent with other Cisco platforms. It provides network administrators with a single work environment across their entire IP Multicast network.

Figure 10. Unified MFIB Infrastructure

Note: All other interoperating components that comprise the new MFIB infrastructure, such as Egress Distribution Component (EDC), Multicast Expansion Table (MET), and Local Target Logic (LTL), are described in greater detail in the following sections.

New Egress Replication (EDC Server and Client) Design

Optimize internal multicast packet distribution between modules.

Yesterday’s Challenges

With the recent proliferation of the newer distributed or switch-based internal packet-forwarding models (using fabric-connected Modules and DFCs) which increases forwarding scalability and capacity far beyond the legacy centralized or bus-based model, an entirely new multicast requirement was introduced.

This new switch-based forwarding model required a new mechanism to distribute source frames to multiple destination modules. It uses a centralized switching complex, rather than simply forwarding frames on a common bus to which all modules are attached.

Note: The differences between the two internal packet forwarding models is roughly analogous to the well-known pros and cons of forwarding data packets over “switched or star-based” compared to “bus-based” Ethernet networks.

The performance and behavior differences between switch-based and bus-based forwarding models are briefly summarized below.

Centralized Bus-Based Forwarding

Pros: Fast and simple forwarding model, because all nodes will receive the same source data simultaneously

Cons: All nodes receive the same source data, whether they need it or not, resulting in wasted packet processing for uninterested receivers

Distributed Switch-Based Forwarding

Pros: Dedicated point-to-point connections guarantee bandwidth availability, allowing for much greater scalability and lower latency

Cons: Needs special intelligence (for example, CPU or ASIC) to direct the same source data to multiple nodes, which also requires packet replication

In Catalyst hardware-switching terms, these different nodes are the individual Ethernet switching modules, and the “connections” between the nodes are either DBUS (data bus- based) or switch fabric (switch-based).

With the introduction of Supervisor 2 and the 256 Gbps Switch Fabric Module (SFM), the WS-X6500 series switching modules were provisioned with a single (dedicated) 8 Gbps fabric channel. With this configuration, the preferred forwarding model became distributed (switch-based).

Note: The Supervisor 720 introduced an integrated 720 Gbps switch fabric ASIC complex, capable of supporting 18 dedicated fabric channels operating @ 20 Gbps.

As noted earlier, the new distributed switching model meant that multicast packets from the source (or ingress) LAN module were no longer simply flooded to every destination (or egress) module. Now there was a need to replicate, or copy, the source packets to each destination fabric channel. This led to development of the ingress replication mode.

This mode simply requires the incoming (or ingress) module to perform one packet replication, per OIF, per IP Multicast route (or mroute) for every outgoing (or egress) module. For each additional OIF (per mroute), additional packet replications are made.

Note: ingress replication mode is quite effective, but all replicated packets cross the switch fabric, consuming bandwidth and buffers (especially on the ingress modules’ fabric channels).

Using the limitations of the ingress replication design as a basis, a new generation of fabric switching and replication ASICs were developed for the WS-X6700 series modules, which allowed for development of the newer egress replication model.

Egress replication mode only requires the ingress module to create replications for its own ports (if local receivers exist). Then, it creates one additional (internal) replication that will be distributed all egress modules (through an Internal Central Rewrite OIF (ICROIF), also known as the Egress VLAN). Upon receipt, each egress module will then perform any additional replications for its local OIFs.

Note: This model is much more efficient, because it only requires one packet (per mroute) to cross the switch fabric, conserving internal bandwidth and buffers.

Pro Tip: [Learn more about ingress and egress replication modes. http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/mcastv4.html - wp1076728]

Now, as with all new and innovative technologies, the original egress replication mode implementation and newer generations of Catalyst 6500 hardware introduced a new set of technical challenges (summarized below). Also, each new challenge changed how the original implementation behaves.

In particular, the PFC3-based egress replication model faces the following challenges:

1. It uses a non session-oriented distribution list of all egress-capable modules, which are added to the ICROIF (or egress VLAN).

2. Where or when the source multicast packets are actually being rewritten, prior to egress replication and transmission.

3. The dual fabric channels (and Fabric Interface and Replication-Engine ASICs) of the newer CEF720 Ethernet modules.

4. The need to integrate egress replication mode with high availability (SSO).

Why are these cases challenging?

1. Egress replication mode uses a special internal egress VLAN to distribute the replicated multicast packets to all egress (destination) modules. Membership to this internal VLAN is determined simply by including all egress-capable modules in a distribution list. Furthermore, there is only a single egress VLAN (per VRF). Hence, there is no guarantee that all programming messages will arrive (once transmitted), and all groups and associated egress modules share the same context.

2. Multicast packets forwarded to a multicast group are rewritten first (by the FIRE ASIC) and are then replicated N number of times, once for each OIF. In edge deployments, this presents a problem, when only some receivers in the group need to perform egress policies (for example, QoS or encapsulation), and some do not.

3. The introduction of the newer dual fabric channel CEF720 modules (in a previously single fabric channel system) now required the ability to send packets to both fabric channels. Since the front-panel ports only connect to only one of the two channels, the ASIC bandwidth was being unnecessarily consumed on the other channel. This challenge is partially resolved through egress local optimization.

Pro Tip: [Learn more about egress local optimization. http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/mcastv4.html - wp1093310]

4. High availability for IP Multicast is provided by the Redundancy Framework (RF) and Checkpoint Framework (CF) server and client model, used by SSO. Because the Active Supervisor programs egress replication entries, the forwarding of egress traffic is temporarily disrupted during an SSO, while the egress entries are reinstalled on the new Active Supervisor.

Is this really bad? Yes and no. The first two challenges can result in unnecessary usage of the Fabric-Interface and Replication-Engine (FIRE) ASICs and Switch Fabric bandwidth, which are responsible for forwarding IP Multicast traffic. If you are not already using a scaled IP Multicast network, you would probably never observe such problems.

The latter two challenges are special cases, and are really more limitations than problems. You may want to apply certain egress features (such as MVPN encapsulation), which cannot be supported with the current architecture. If you ever experience a SSO, there is an expected loss due anyway, but you want to reconverge as quickly as possible.

Today’s Solutions

With the new Supervisor 2T, an important part of the Unified MFIB hardware infrastructure is the new EDC server and client model. This new egress replication design supports all previous optimizations, and also adds several new optimizations to more efficiently handle hardware-specific (for example, link-state) changes and programming.

This new EDC-based egress replication model overcomes all of the previous challenges mentioned previously:

1. Egress replication programming now uses a session-oriented server/client model to determine where to send the replicated packets.

2. Egress-replicated multicast packets are rewritten after egress replication, allowing egress policies and encapsulation.

3. The dual fabric channels (and associated port and fabric indices) are now handled separately and efficiently.

4. Egress replication mode is fully compatible with high availability (SSO).

First (and perhaps most importantly) the EDC design uses a server and client model for all distributed egress replication ASIC programming. This helps ensure a consistent, session-based and HA-compatible infrastructure to program the various egress modules.

Next, the EDC uses a new concept to replace the original egress VLAN replication model. Instead of using a single internal VLAN for all groups and egress-capable modules, EDC instead uses a per-group Egress Distribution Table (EDT) model.

Finally, EDC uses a new way to program the actual hardware OIFs (in coordination with the new MFIB hardware infrastructure), called Egress Distribution Indexes (EDI). The EDI design optimizes individual port index (LTL POE) and corresponding fabric index (FPOE) usage by combining the (LTL, RBH) pair to uniquely identify an EDI.

Note: LTL is a port-addressing scheme developed for the Catalyst switching platform. More details are provided in later sections.

A single LTL POE port index can represent multiple EDIs, as the combination of LTL and RBH uniquely identifies an EDI. Also, the software component (LTL manager) provides context/session-based callback functions to guarantee EDI programming. This prevents packet loss (for example, a black hole), due to missed programming messages.

Combined together, the new EDC server and client design allows full synchronization of the hardware multicast infrastructure between the SSO active and standby supervisors. This allows full HA compliance, and minimizes packet-loss during a switchover.

How Does That Help You?

The new EDC server and client design (combined with the unified MFIB infrastructure) provides a highly reliable and efficient egress replication model, capable of unprecedented hardware multicast scalability.

This allows for IP network administrators to build and manage highly scaled IP Multicast networks:

Capable of scaling up to 256 K mroutes @ 2 Tbps to >500 ports in a single Catalyst 6500 system

Capable of 4 Tbps up to >1000 ports, with a single VSS

Distributed egress replication @ up to 60 Mpps on all 13 slots, with 6513-E and CEF2T modules

This EDC design also provides a reliable (session-oriented) egress replication model, which is fully compatible with multicast HA.

Figure 11. EDC and MFIB Infrastructure

The new EDC design conserves precious system resources, and optimizes the forwarding of IP Multicast data traffic within the system. This also ensures an extremely fast and scalable system, ready for the multicast traffic demands of today and tomorrow.

New Multicast LTL and MET Sharing Design

Save internal forwarding resources for commonly used paths.

Yesterday’s Challenges

There are two basic requirements for hardware-based IP Multicast forwarding:

1. A port-addressing scheme that defines all possible destination OIFs.

2. A method of replicating source frames to the interested “group” of OIFs.

First, all hardware-based switching platforms (Cisco or otherwise) require some internal port-addressing scheme, in order for the forwarding protocols (software) to determine the correct interface (or port) to forward the frames to.

The same is true for the Catalyst 6500 modular hardware platform. Catalyst switches use an internal addressing scheme called Local Target Logic (LTL) that maps all ports (internal and external) within the system, and then divides the available LTL address space into physical Port of Exit (POE) indices, logical Group indices, and broadcast flood indices.

With the introduction of the Switch Fabric ASICs, each fabric channel is directly connected to a set of physical ports (on the front-panel of the line card). Hence, for each unique port (or POE) index there is also a corresponding or complimentary Fabric Port of Exit (or FPOE) index.

As with all addressing schemes, there is a careful balance between the length and design of the addresses themselves and the overall processing time (latency) necessary to actually search for and locate the correct address (within a table).

Note: A larger address scheme allows for more unique nodes (for example, IPv6 vs. IPv4), but the additional bits take much longer to locate within the address table (for example, IPv4 addresses generally take one-fourth the processing time of IPv6 addresses).

Figure 12. IPv4 Addressing Scheme

The Catalyst LTL addressing scheme is defined by the IOS firmware (and downloaded to each module during bootup). As mentioned previously, the LTL scheme is divided into various regions, which define the number of unique indices available for different uses.

The physical port (POE) indices are statically set for every module and port in the system, and used for both unicast and multicast packet forwarding. Both require sending packets to a specific port, compared with basic packet flooding (legacy switching behavior). Broadcast forwarding uses simple flood indices, which apply to all ports in a VLAN.

For IP Multicast, each mroute OIF is mapped to one or more unique POE/FPOE indices, and stored as a {MAC,VLAN} pair. This is where logical Group indices are necessary.

With the PFC3 architecture, the multicast (group) LTL region is set to 32 K indices. Each IP Multicast mroute has a group (or list) LTL index associated, and the purpose of this index is to hold a list of physical POE indices. This subsequently sets the multicast scalability number to 32 K.

Second, the addressing just determines which interfaces need to receive the multicast packets. Once that is established, we also need to make the necessary frame replications for each of the OIFs. This is the special job of the replication engine ASIC hardware, which then uses the Multicast Expansion Table (MET) to track which list or group of physical POE indices are associated with each IP mroute.

From an IP Multicast perspective, the MET is used for defining a set of OIFs which require hardware replication. This set of OIFs is called a MET Set. Each MET Set is referenced by an MET (index) pointer, and contains the physical ports (or LTL POEs) for each multicast group entry.

Similar to unicast-based hardware forwarding, the CEF FIB destination lookup process will provide an adjacency index. As the word suggests, this adjacency information is the actual next-hop for where the packet should be sent.

Unlike unicast-based forwarding, where the adjacency index = LTL POE index, multicast-based adjacencies contain a pointer into the MET. This is simply an additional step, which provides the replication engine with a list of LTL POEs.

Figure 13. Multicast Expansion Table

The replication engine ASIC may exist on the Supervisor (for legacy bus-based), or directly on the switching module itself. In the case of the newer fabric-based modules, the previously separated Fabric-Interface and Replication-Engine (FIRE) hardware is combined into a single (FIRE) ASIC complex. Each unique FIRE ASIC has its own hardware MET.

Note: The combined FIRE ASIC hardware is what helps enable the Catalyst 6500 egress replication capability.

With the PFC architecture, the MET memory size is set to 64 K entries. Depending on which multicast replication mode is operational (ingress or egress mode), this will define how the MET is programmed.

Ingress Replication Mode requires that all METs are programmed symmetrically (such as one single set of LTL group indices, for all replication engines). Hence, the total number of MET entries (for the entire system) is 64 K.

Egress Replication Mode allows the METs to be programmed asymmetrically (such as different contents, depending on which ports are considered local to the replication engine). So the total number of MET entries is 64 K * N number of replication engines.

Note: This subsequently sets the multicast scalability number to 64 K for Ingress Replication, or 64 K * N replication-engines for Egress Replication.

Is this really bad? It can be, in a scaled environment. Currently, the primary limiting factor is usually the number of unique group LTL indices, strictly setting your total available number of multicast group entries at 32 K. Conversely, you may only use a small number of groups (and few LTL indices), but if these have a large number of OIFs or using Ingress Replication Mode, you could potentially run out of MET entries.

Therefore, if the current IP Multicast network is relatively small to medium-scale you will probably never encounter a problem. However, if your network grows to a scale that exceeds these hardware limits, it may result in software or Flood-based forwarding.

Today’s Solutions

With the new Supervisor 2T and PFC/DFC4, the new IOS code introduces the concept of sharing these fundamentally-finite internal resources. Since most IP network systems are distribution or aggregation points, many of the same IP Multicast groups will traverse common network paths (for example, uplinks, downlinks, or inter-switchlinks).

In these cases, where multiple hardware MFIB entries use the exact same OIFs, the same LTL and MET entries can be used, or shared, between them. This allows the overall LTL and MET resource usage to scale far beyond their finite (address-based) limit, and allowing the entire system to scale far beyond these limits.

Figure 14. LTL Sharing

Note: The Supervisor 2T and PFC/DFC4 architecture can support up to 16 K broadcast domains (BDs), and up to a maximum of 16 K physical ports. For this level of VLAN and port density, the idea of LTL sharing allows use of the same LTL indices for more than one L2 forwarding table entry.

With the current design, the IOS code allocates and uses different LTL indices for different (GMAC, BD) entries. In real life, there are cases in which the receivers of a given BD join the same set of groups. This is typical when using audio/video applications, where the audio and video streams are broadcasted over two different multicast groups. In this case, we can optimize the usage of the LTL indices by sharing the same LTL index between the various L2 (GMAC, BD) entries for various multicast groups with identical set of receiver ports.

Note: The notion of sharing entries was first introduced for the MET, on PFC3. However sharing was (and is) limited to L3 routed interfaces, and did not account for SVI (or VLAN) entries, and was not mentioned previously.

Theoretically, if multiple IP Multicast flows happen to have exactly the same list of OIFs (for example, common network paths), then the exact same MET set/pointer can be shared. With PFC/DFC4, the VLAN and destination LTL POE fields (in the MET) will be used together to represent an adjacency (or L2 rewrite) pointer.

With PFC/DFC3 architecture, the VLAN and destination LTL POE index fields (of a given MET set) reflect the actual VLAN and LTL index of the replicated packets. In other words, the packets are already completely rewritten before replication, and the post-replicated packets contain all of the outgoing information.

With PFC/DFC4 architecture, the VLAN and destination LTL POE index fields are not the actual VLAN and LTL POE index of the replicated packets. Instead, they are used to derive the adjacency pointers associated with the replicated packets.

This is necessary to allow both ingress and egress processing (for the new EDC-based egress replication model), but it also compliments the notion of local significance of multicast forwarding (or replication), which allows the LTL and MET manager software to dynamically allocate indices more efficiently.

Note: With PFC/DFC3 architecture, there is a limitation regarding sharing MET set that contains an SVI interface. If there is an SVI interface in the OIF list, the MET set can be shared only among multicast flows with the same MAC multicast group address (specifically, the 23 least significant bits of MAC/IP Multicast address).

Now (with PFC/DFC4 architecture), this limitation does not apply because the LIF and BD concepts decouple the L2 VLAN and L3 VLAN interface (SVI). The same SVI OIF should have exactly the same adjacency pointer, regardless of the different IP Multicast flows. Therefore, a MET set with both L3 routed and SVI OIFs should always be sharable among different multicast flows with the same OIF list.

Figure 15. MET Sharing

How Does That Help You?

The new LTL and MET sharing design (combined with the unified MFIB infrastructure) provides a highly scalable port indexing and multicast replication scheme, designed for massive L2 or L3 IP Multicast networks.

This allows for IP network administrators to build and manage highly scaled IP Multicast networks that are:

Capable of scaling up to 256 K mroutes @ 2 Tbps to >500 ports in a single Catalyst 6500 system

Capable of 4 Tbps up to >1000 ports, with a single VSS

Distributed egress replication @ up to 60 Mpps on all 13 slots, with 6513-E and CEF2T modules

The software LTL and MET managers work in conjunction with the EDC and MFIB components to improve the reliability of index programming, and to synchronize indices for Multicast HA.

Figure 16. LTL, MET, and MFIB Infrastructure

The new LTL and MET sharing design conserves precious system resources, and optimizes the forwarding of IP Multicast data traffic. This supports an unprecedented new level of hardware-based multicast scalability, ready for the growth of today’s multicast networks.

Up to 256 K Multicast Routes in the FIB-XL

Gain unprecedented hardware-based multicast scalability.

Yesterday’s Challenges

Multicast L2 and L3 multi-layer switching forwards IP Multicast data flows between interfaces using specialized Application-Specific Integrated Circuit (ASIC) hardware, which offloads processor-intensive multicast forwarding and replication from the router.

Note: IP flows that cannot be hardware switched are still forwarded by the software.

The Policy Feature Card (PFC) provides L2/L3 hardware switching (and policies) for IP Multicast flows. This hardware-based switching is enabled by the use of a hardware replication table (MET), the forwarding information base (FIB), and adjacency table. The Cisco Express Forwarding (CEF) architecture is used to populate the FIB and adjacency table. Legacy Catalyst-based hardware scalability is limited by two factors:

The size (and allocation) of the FIB TCAM, for IP Multicast (shared with IP Unicast and MPLS)

The number of special software forwarding indexes (specifically, list of physical port indexes)

1. Multicast FIB Allocation: There are two variations of PFC/DFC3 FIB size:

Non-XL based PFC = 256 K total entries (32 K for IP Multicast)

FIB-XL based PFC = 1 M total entries (32 K for IP Multicast)

For the Supervisor 720 and PFC/DFC3, the default allocation for multicast is 32 K entries, configurable to a maximum of 120 K entries.

2. Number of Software Indexes: This is set at initialization, and is true system-wide, because all ASICs must have a consistent (system-wide) view of exactly where the frame is destined.

The Catalyst series use an internal port-indexing scheme (true for all hardware-based systems), based on a well-known mapping table (LTL POE/FPOE). From this table, some portion is reserved for special software (or group) forwarding indexes, which are simply a list of physical port indexes.

Multicast and other functions use these special software (or group) forwarding indexes, to insure that all receivers (OIFs) will actually receive the frames. Hence, the number of these reserved indexes becomes a fixed scalability limit.

For the Supervisor 720 and PFC/DFC3, the static allocation is 32 K entries. Hence, the overall hardware limit for IP Multicast is fixed at 32 K entries.

Note: The number of software multicast flows supported is only limited by the CPU and DRAM available, which is generally capable of ~10K packets per second.

Today’s Solutions

The same two scalability considerations apply for Supervisor 2T:

The size (allocation) of the FIB TCAM, for IP Multicast (shared with unicast and MPLS)

The number of special software forwarding indexes (specifically, list of physical port indexes)

The Supervisor 2T and PFC/DFC4 has the same basic FIB TCAM sizes (as PFC/DFC3), but the default allocation is different and larger (and still configurable).

1. Multicast FIB Allocation: There are two variations of PFC/DFC4 FIB size:

Non-XL= 256 K total (128 K for IP Multicast)

FIB-XL = 1 M total (256 K for IP Multicast)

For the Supervisor 2T, the default allocation for PFC/DFC4 is 128 K entries, up to a maximum allocation for PFC/DFC4-XL of 256 K entries.

2. Number of Software Indexes: The internal port-indexing scheme (and allocation for IP Multicast) has also been enhanced. In addition, there is also a new LTL sharing technique which will share commonly-used port indexes.

For the Supervisor 2T and PFC/DFC4, the software allocation is 32 K entries + LTL sharing, for maximum potential of 256 K entries.

How Does That Help You?

You are now able to scale your next-generation hardware-based IP Multicast multi-layer switching capacity (per system) up to 256,000 mroutes, using the FIB-XL (an eight-fold increase), or up to 128,000 mroutes using the standard FIB (a four-fold increase).

This is an unprecedented and unmatched scalability number (up to an eight-fold increase over Sup 720 and PFC/DFC3) of hardware-based IP Multicast forwarding.

Figure 17. Basic MFIB Lookup Process

This also allows you to consolidate the overall number of otherwise separate systems necessary to provide this level of scale for today and tomorrow’s IP Multicast traffic load. That is because a single Supervisor 2T can handle up to eight times more than the Supervisor 720.

Also, when the Supervisor 2T is used with Virtual Switching System (VSS) mode, this single control-plane and active/active data-plane provides up to 256 K mroutes across 1100 physical ports @ 4 Tbps and 60 Mpps.

PIM-SM Source Register Support in Hardware

Save CPU and Memory usage and minimize source register time.

Yesterday’s Challenges:

PIM Sparse-Mode (or PIM-SM) requires the notion of a Rendezvous Point (or RP). As the name implies, the basic objective is to provide all PIM routers with a pre-arranged meeting place for all multicast distribution.

Note: The term “rendezvous” comes from the French contraction of “rendez” and “vous”, meaning “return (to)” and “you”, or more commonly “meet you”.

The PIM RP is the place where all multicast traffic sources (with no direct connection to receivers) and the interested traffic receivers (with no direct knowledge of the sources) can meet together. In practical terms, the RP is how the initial PIM distribution tree is built, completing the connection between the sources and receivers.

The job of the Designated Router (or DR), which is directly connected to the source IP subnet, is to notify the RP that a new source has come online and begin the forwarding of multicast data. It does so through the PIM-SM source registration process, the basic steps of which are summarized below:

Step 1. DR (FHR) registers source IP

DR sends PIM register messages (for every source and group) to the PIM RP

DR encapsulates registers as unicast IP data and sends directly to RP address

Step 2. RP receives and processes registers

Since register messages are unicast to the RP address, the destination is the MSFC

PIM RP must de-encapsulate all register messages

Can cause high CPU (requires a rate-limit)

Step 3. RP sends Register Stop to DR (FHR)

RP processes the registers and sends PIM register-stop messages back to DR

This is standard PIM-SM behavior (RFC-2362), and if only a small number of IP mroutes are in use, then the source registration process poses no challenge at all. The PIM RP simply handles the source-register messages, sends register-stop, and continues.

Is this really bad? No, but it can be in a scaled environment. If (or when) the RP must handle source registration for several thousand or even hundreds of thousands of IP mroutes, then all of these unicast-based IP frames that are destined to the MSFC can result in high CPU usage. Then, a variety of consequential negative events can occur.

Note: The high CPU usage problem can be effectively mitigated with the use of software or hardware rate limiters, which intentionally drop frames at a specified data rate. However, this can also cause failed or delayed source-registration.

Today’s Solutions

With the new Supervisor 2T, the entire PIM-SM source-registration process is now handled by the PFC/DFC4 hardware. The encap and decap of unicast register messages (and multicast data) and register-stop messages are processed in hardware.

Note: The software aspect of this capability also comes from the unified IPv4/IPv6 MFIB infrastructure, which also introduces dedicated encapsulation/de-encapsulation tunnel interfaces for PIM-SM source-registration.

On the FHR (DR), the initial multicast packets would be leaked to the software for mroute state creation. The PIM encapsulation would be performed at FHR in hardware and the encapsulated packet is sent to RP directly.

On the RP, the initial packets are also leaked to the software for state creation. In the RP, the register packets are de-encapsulated in the hardware and are passed on the shared tree to the receivers. Some packets are leaked to the router for sending register-stop.

How Does That Help You?

If you use PIM sparse-mode, then you need to plan for PIM source registration capacity. This is especially true when lots of multicast sources come online (near) simultaneously, and/or you only use one PIM RP, or after a network reconvergence.

Performing source registration in the software works, but doesn’t scale well.

Performing source registration in the hardware is much faster and more predictable.

Having this hardware capability will provide your PIM-SM network with a highly resilient and reliable source-register setup process

It will protect your Catalyst 6500 MSFC CPU from over utilization, and preserve finite inband channel packet processing resources

It will also reduce the latency associated with source registration, reducing the overall time necessary to forward multicast packets

Figure 18. Source Register Process

In summary, this new hardware capability will significantly reduce CPU and memory usage, and simultaneously reduce the amount of time/latency (and associated caveats) for the PIM source-registration process to occur.

PIM-SM Dual-RPF Support in Hardware

Save CPU and memory usage and minimize SPT switchover time.

Yesterday’s Challenges:

Once the PIM-SM source registration process has completed, a source-based distribution tree (S,G) is established between the source Designated Router (or first-hop router) and the RP. It is then up to the IGMP/MLD receivers to solicit the multicast data.

A PIM mrouter which receives IGMP/MLD joins from the directly connected interfaces, is considered the receiver DR (or last-hop router). This DR is responsible for translating the IGMP or MLD joins into PIM (*,G) joins, and sending these to the PIM RP.

Note: PIM and IP Multicast routing uses the Reverse Path Forwarding (RPF) mechanism, to determine which interface has the best path (or metric) to a given IP address.

Once the PIM RP receives the (*,G) PIM joins, it will build a RP-based distribution tree along the links from where the joins were received. In this way, the multicast traffic has a complete link from source to receiver, through the centralized rendezvous point.

This RP-based distribution is functional, but the network may be using a sup-optimal path for the traffic (causing additional latency, wasted bandwidth, etc.). Once the traffic arrives at the receiver DR, it can now determine the source IP address, and determine the best (or optimal) path.

The receiver DR uses RPF (derived from the IP routing table) to determine the best path, and then sends new (S,G) PIM joins towards the source DR. Once the source DR receives the (S,G) joins, it can add that interface to its OIF list. Now, the traffic can flow over the optimal path. This process is called the Shortest Path Tree (or SPT) switchover.

The basic steps of this process are summarized below:

Step 1. Source sends multicast to DR (FHR)

DR registers source and group with PIM RP

RP sends (S,G) join to FHR for source

Traffic flows from source to RP

Step 2. Receiver sends join to DR (LHR)

DR sends (*,G) join to RP for group

Traffic flows from source to RP to receiver

Step 3. DR (LHR) sends join to DR (FHR)

DR learns source IP and sends (S,G) join

Traffic flows from source to receiver

This is standard PIM-SM behavior (RFC-2362). If only a small number of IP mroutes are in use, then the SPT process poses no challenge at all. The PIM DRs and RP simply handle all of the PIM joins, perform SPT switch-over, and the process continues.

Is this really bad? Not really, but it can be, in a scaled environment. If or when the RP and DRs must handle SPT switchover for thousands or even tens or hundreds of thousands of IP mroutes, then all of these RPF calculations and PIM join frames destined to the MSFC can result in high CPU usage. Then, a variety of negative events can occur.

Note: The high CPU usage problem can be effectively mitigated with the use of software or hardware rate limiters, which intentionally drop frames at a specified data rate. However, this may also cause failed or delayed SPT switchover.

Today’s Solutions:

With the new Supervisor 2T, the PFC/DFC4 hardware is capable of processing 16 different (equal cost) RPF entries for a given IP address.

Note: The 16 path RPF capability is also available for Unicast RPF and CEF-based ECMP.

For IP Multicast, this allows the Dual-RPF support for hardware-based Shortest Path Tree (SPT) switchover. In its simplest form, the Source (IP) RPF is programmed as the first RPF interface, while the RP RPF is programmed as the second RPF interface of the (S,G) entry.

If a source packet comes in from the first RPF (source RPF) interface, it is forwarded and copied to the MFSC so that the SPT (T) bit can be set, and the switchover to the SPT can take place.

If a packet comes in from the second RPF (RP RPF) interface, it is only forwarded but not copied to the MSFC.

For IP Multicast, this allows the code to simultaneously track both the Rendezvous Point and Source IP address information. With this new capability, the entire RPF calculation and SPT process can be done in the hardware, while still maintaining the traditional PIM and multicast routing processing (software) and state management.

How Does That Help You?

If you use PIM sparse-mode, then you need to plan for SPT switchover capacity, especially when lots of multicast receivers or sources come online, or after a network reconvergence.

Performing the SPT switchover in the software is functional, but doesn’t scale well. Performing the SPT switchover in the hardware is much faster and predictable.

Having this capability will provide your PIM-SM network with a highly resilient and reliable SPT switchover process

It will protect your Catalyst 6500 MSFC CPU from overutilization, and preserve inband channel packet resources

It will reduce the latency associated with SPT switchover, reducing the amount of time necessary to forward packets, over optimal paths

It will also reduce the packet duplication, or loss, that can occur when both RP and source multicast traffic are being sent, during SPT switchover

Figure 19. Basic RPF Lookup

To summarize, this new hardware capability will significantly reduce CPU and memory usage, and simultaneously reduce the amount of time (and associated caveats) for the PIM SPT switchover to occur.

Simplified Global L2 IGMP Snooping Design

Gain a simplified L2 snooping configuration and querier redundancy.

Yesterday’s Challenges:

Similar to MMLS, the original Layer 2 IGMP Snooping Code was developed specifically for the Catalyst 6500, before it was implemented on any other Cisco switching platforms. This original design was supplementary to the existing L3 IGMP process, meant for special environments that handled only L2 IP Multicast.

With PFC/DFC3 and earlier IOS releases, the Catalyst 6500 IGMP Snooping Design faced two main challenges.

1. IGMP snooping (process) is enabled by default, when “IP Multicast Routing” is configured. However, any non-default configuration (for example, the IGMP snooping querier feature) must occur on the associated SVI (or VLAN interface).

2. IGMP snooping querier does not support querier election, and thus does not support redundancy. If another IGMP querier is present (for example, L3 mrouter), the snooping querier will defer and wait for a specified timeout, before resuming.

Pro Tip: [Learn more about IGMP Snooping (Sup 720 and 12.2SX IOS) http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/snooigmp.html

SVI-Based IGMP Snooping Configurations

As noted earlier (and similar to MMLS), this design was developed before a standard (platform-independent) design was determined and developed.

This design requires the IGMP snooping configurations to be applied directly to the VLAN interface (or SVI), in order to associate its pure L2 functionality with the (otherwise unnecessary) L3 interface. It does this to use the IP address in its querier messages.

Furthermore, even if there is no intention of using L3 functions (for example, IP Multicast Routing), the current configuration model still requires the user to specify an IP address (used in the IGMP queries) and can then shutdown the SVI.

Is that really bad? No. It poses no technical problem. The only challenge is simply that it is inconsistent with some other Cisco switching platforms.

No IGMP Snooping Querier Election

As noted earlier (and similar to MMLS), this design was developed before a standard (platform-independent) design was determined and developed.

Since the purpose (intention) of the pure L2 IGMP snooping querier feature is simply to mimic the L3 IGMP querier functionality, it was originally designed to be supplementary (and subservient) to the L3 IGMP querier.

Note: The Layer 2 IGMP snooping querier feature should not be confused with the (RFC- 2236) L3 IGMP Querier, which is automatically enabled on L3 PIM interfaces. The Layer 3 implementation does support querier election. It was meant to be deployed in pure L2 multicast environments where no (zero) L3 IGMP queriers (specifically, PIM interfaces) existed, which (per RFC) is considered a special case.

As a result, it was designed to intentionally stop functioning (and wait for a timeout period, before resuming), if ever another IGMP querier was enabled on the same Layer 2 subnet. Also, because it is a special case, it had no requirement (at that time) for querier election.

6504E.S720.SA.2(config)# interface vlan 1001
6504E.S720.SA.2(config-if)# ip address 101.2.1.254 255.255.255.0
6504E.S720.SA.2(config-if)# igmp snooping querier
6504E.S720.SA.2(config-if)# shutdown

Is that really bad? Generally speaking, no. If multiple L3 PIM interfaces exist within the same IP subnet, these will automatically enable the L3 IGMP querier function and election, and assume the IGMP querier function.

If one Layer 3 IGMP querier fails, the election process will help ensure that another Layer 3 querier takes over. Meanwhile, any configured L2 IGMP snooping querier will remain silent. If all Layer 3 queriers fail, then the L2 snooping querier will assume the responsibility after a timeout period.

The L2 IGMP snooping querier timeout period is 3 * GQ + processing time @ ~5 minutes. Depending on when the last query was sent on a particular IP subnet, this can result in temporary packet loss from 1 second (min) up to 5 minutes (max). Note: This limitation is tracked by the DDTS report CSCsk48795.

Also, there is a special case scenario in pure L2 multicast environments where zero Layer 3 PIM/IGMP interfaces are present. Because redundancy is not supported, a failure of the L2 IGMP snooping querier will result in packet loss (after the PIM GQ timeout) on that subnet, until the querier functionality is restored.

Today’s Solutions

With the new Supervisor 2T and 12.2(50)SY (and later) IOS code, the Catalyst 6500 will implement a standardized global IGMP snooping model, consistent with other Cisco switching platforms.

First, as the name implies, the new model does not use SVI-based configuration. IGMP snooping configurations are now applied to each VLAN at the global configuration level. This logically separates the pure L2 configuration aspect, from an otherwise L3 specific VLAN interface (SVI).

Here is an example of the new global IGMP Snooping configuration:

6513E.SUP2T.SA.1(config)#vlan config 148
6513E.SUP2T.SA.1(config-vlan-config)#ip igmp snooping

Second, the new global model also supports IGMP snooping querier elections (making it consistent with RFC-2236). This allows for querier redundancy, within a pure L2 IP Multicast environment (where no IGMP routers exist). Here is an example of the new IGMP snooping querier configuration:

6513E.SUP2T.SA.1(config)#ip igmp snoop querier
6513E.SUP2T.SA.1(config)#ip igmp snoop querier address 3.3.3.3

How Does That Help You?

The new global IGMP snooping model provides three main benefits:

Decouples the L2 IGMP snooping configuration from L3 VLAN interfaces

Provides IGMP snooping querier election, for redundancy within pure L2 IP Multicast environments

Implements the same IGMP snooping configuration model as other Cisco switching platforms

Figure 20. IGMP Snooping Querier Election

The first benefit simply helps to easily differentiate between L2 functions (specifically, switching or bridging on a single IP subnet) from L3 functions (specifically, routing between IP subnets). This makes overall configuration and monitoring of these separate functions (often configured together for different uses on the same Catalyst 6500) much easier to understand and operate.

The second benefit allows for IGMP querier redundancy, within pure L2 subnets. This is fairly common within the data center access and distribution environments, but also within L2-based core networks. Another example would be in L2 environments separated by firewalls, which may not support L3 IGMP querier functionality (hence, requiring an L2 IGMP snooping querier).

Combined together, the new global IGMP snooping model provides a more intuitive, simplified, and redundant network design, which is consistent with other Cisco switching platforms. It will make overall setup and maintenance of a L2 IP Multicast environment easier and more reliable.

IP-Based (Compared to DMAC-Based) L2 Forwarding Lookups

Remove the IP-to-MAC address overlap for L2 multicast.

Yesterday’s Challenges

Whenever someone references Layer 2 (or L2) of the OSI model, they literally mean the “Data Link” or “Network Interface” (for example, NIC) layer. This layer relies on unambiguous MAC addresses, associated with a single entity, within a single LAN. This is frequently called a broadcast domain.

Pro Tip: [Learn more about L2 of the OSI model http://en.wikipedia.org/wiki/Data_Link_Layer]

As with Layer 3 (or L3) of the OSI model, this MAC-based addressing applies with or without the notions of unicast or multicast forwarding models (broadcast assumes a L2 environment). The unique addresses are simply necessary to determine where the frames need to go.

In the case of IP Multicast, we use a special destination address to represent a set or list of unique destination addresses. For L3, this is the destination (or group) IP address. For L2, this is the destination MAC address.

However, an IP address is either exactly 32 bits (IPv4) or 128 bits (IPv6), while a MAC address is exactly 48 bits. From the 48 bits MAC address, 24 bits are reserved for the Organizational Unit Identifier (OUI) or Vendor ID, leaving 24 bits for a unique ID.

Here, we encounter a new challenge. In order to achieve the translation between a L3 multicast IP address and a L2 multicast MAC address, the low-order 23 bits of the IP address (L3) is mapped into the low-order 23 bits of the MAC address (L2). The remaining bits are ambiguous and overlap.

Note: For IPv4, the high-order 4 bits of the L3 IP address is fixed to “1110”, to indicate the “Class D” multicast address space between “224.0.0.0”and “239.255.255.255”. The special OUI multicast MAC addresses start with “01:00:5E”, allowing for a range from “01:00:5E:00:00:00” to “01:00:5E:7F:FF:FF”.

Figure 21. IPv4 to MAC Address Mapping

With the PFC/DFC3 architecture, the IGMP/MLD snooping process populates the L2 multicast forwarding tables, based on destination group MAC address (or DMAC). Hence, if two separate (and otherwise unique) Layer 3 IP Multicast groups share the same L2 MAC addresses (for example, 224.1.1.1 and 239.1.1.1 = 01:00:5E:01:01:01), then IGMP/MLD Snooping will treat them the same.

Note: For IPv6, there is a new OUI format for multicast. The leading two bytes are set to 33:33, while the following four bytes/32 bits are available for address mapping from the last 32 bits of the 128 bit multicast address (for example, 33:33:XX:XX:XX:XX where X is the last 32 bits of the address).

Figure 22. IPv6 to MAC Address Mapping

Since all hardware-based L2 forwarding decisions rely on the IGMP/MLD snooping table, and because the same MAC address is used (in the L2 switching table), this address overlap will result in unnecessary forwarding to uninterested hosts (within the same broadcast domain).

Pro Tip: Learn more about L2 multicast addressing. http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a00800b0871.shtml - multi]

Is that really bad? It depends. If no IP:MAC address overlap exists, then no problem exists. So if you are careful to avoid certain multicast addresses (for example, 224.0.0.X and 239.0.0.X), then you will not encounter an overlap. Furthermore, the only real problem is unnecessary multicast packet forwarding to uninterested hosts.

However, if you have an environment with significant address overlap, or the overlapping multicast flows are all high rate (for example, video), then the unnecessary L2 forwarding can consume all available network bandwidth.

Today’s Solutions:

With the new Supervisor 2T and PFC/DFC4, the basic L2 forwarding lookup can now be either IP-based or DMAC-based, with the default as IP-based. With group IP-based L2 lookups, the IP:MAC address overlap problem can be eliminated.

The L2 multicast lookup mode of each bridge domain (or BD) is configurable to the user. To maintain backward compatibility (for example, if a user has static multicast MAC addresses in a saved configuration), the lookup mode can be changed to be group MAC address based.

Note: In either lookup mode, the PFC/DFC4 supports consistency checking on MAC address and IP address in a given multicast packet, to prevent inconsistent L3 or L2 tables. It does this by using a special IP-based MAC entry in the L2 forwarding table.

You will recall that it was the preset (24 bit) OUI that created the address overlap. This was done to uniquely ID the frame as IP Multicast. However, if a different OUI is used (which uniquely identifies the frames as IP Multicast, and contains the IP address information) then we can overcome the address overlap. The PFC/DFC4 is the first forwarding-engine to support this special OUI, and hence, the first to support IP-based L2 lookups.

This new OUI field includes the previously lost 20 bits of IP address, as well as other information, such as the LIF and BD, IP version, and more. These values are stored together, and provide a unique OUI, per flow, which IGMP snooping can use to build the L2 forwarding table.

Figure 23. PIM-SM IP-Based L2 DMAC Entry

You probably noticed that the remaining (or low-order) 23 bits still remain the same as before, and the only real difference is the new OUI field. This is to allow for backwards-compatibility (specifically, DMAC-based) using the same L2 forwarding table format.

Otherwise (using the new default IP-based design), because each L2 multicast entry in the L2 forwarding table is now completely unique, no address overlap exists.

How Does That Help You?

The new IP-based L2 forwarding lookup capability eliminates the earlier IP:MAC (or L3:L2) address overlap problem.

This greatly simplifies network design and management, and will make your network more consistent and flexible. This also provides many more IP Multicast group addresses, which were previously unavailable, allowing for much greater scalability.

IGMPv3 and MLDv2 Snooping in the Hardware

Gain faster updates of IPv4/IPv6 PIM-SSM L2 host tables.

Yesterday’s Challenges

IGMPv3 (RFC-3376) and MLDv2 (RFC-3810) are the latest versions of IGMP (IPv4) and MLD (IPv6) host signaling protocols. These allow hosts (specifically, receivers) to define a list of specific sources they want to receive traffic from.

IGMPv3 and MLDv2 support is required for the operation of low-latency (S,G) source-based forwarding. This is based on PIM Source-Specific Multicast (SSM) mroutes, which do not rely on a PIM RP.

Figure 24. SSM and IGMPv3 (S,G) Behavior

This is possible because the host join/leave messages explicitly denote the desired source IP addresses (INCLUDE), and also help enable the network to drop multicast packets from any unwanted sources (EXCLUDE).

Pro Tip: [Learn more about PIM-SSM. http://www.cisco.com/en/US/partner/docs/ios/12_2/ip/configuration/guide/1cfssm.html]

PFC/DFC3 and earlier forwarding engines only supported IGMPv1/2 and MLDv1 snooping in hardware. This is because the L2 snooping tables are based on (DMAC, VLAN), and do not have any source-specific information.

To allow for IGMPv3 and MLDv2 (S,G) snooping, a hybrid approach was used; combining the existing IGMPv1/2 and MLDv1 (*,G) snooping hardware-based capabilities with a new software-based source IP tracking table.

Note: The default (L3) IGMP interface mode for PFC/DFC3 is IGMPv2 (default version). Users must intentionally enable IGMPv3 support, using the ip igmp version 3 interface command.

When you enable IGMPv3 snooping, the software maintains IGMPv3 states based on messages it receives for a particular group, in a particular VLAN, and uses the existing hardware to store the information.

Pro Tip: Learn more about PFC/DFC3-based IGMPv3 snooping http://www.cisco.com/en/US/partner/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/snooigmp.html - wp1100551]

Is that really bad? Generally speaking, no. This hybrid approach works very well, and in small to medium-size PIM-SSM environments, the additional processing delay (latency) and load (CPU usage) will be minimal.

Furthermore, IGMP and MLD host reports are normally infrequent, and because these are network-edge (or subnet) technologies, a single IP Multicast system will usually not need to process very many IGMPv3/MLDv2 reports, at a given moment.

However, in a very large L2 environment or if a single system must process a very large number of IGMPv3/MLDv2 reports, then the hosts may experience increased join/leave latency, due to the increased CPU usage.

PIM-SSM and IGMPv3/MLDv2 are intended to provide a shortest-path, low-latency solution for IP Multicast. Hence, any unnecessary (or worse, unpredictable, due to load variations) latency is considered counterproductive.

Today’s Solutions

With the new Supervisor 2T and PFC/DFC4, the system is now able to store and track source-specific IP information in the hardware. This is actually another capability of the new IP-based L2 forwarding lookup design.

In the diagram below, host H1 joins channel (S1,G) and hosts H2 and H3 joins channel (S2,G). For each incoming multicast frame, the forwarding engine exams its source IP address, as well as the group IP address, to determine which port set (represented by LTL index) the packet will be forwarded to.

Figure 25. SSM and IGMPv3 L2 Behavior

For IGMPv3, the PFC/DFC4 performs longer lookup with both group and source IP addresses in two steps: one in the pre L3 processing and the other in the post L3 processing. Hence, two L2 entries are installed.

The pre-L3 entry is the same as IGMPv1/v2 group IP-based entry (as described earlier). The key of the post-L3 entry is encoded with source IP address, as well as the pre-L3 entry address (to avoid collision with other entries of the same source IP address).

For example, we will assume a PIM-SSM group address 232.1.1.1 and source address 192.168.10.10. Two separate entries are installed in the L2 forwarding table:

Figure 26. IP-Based L2 Entries for SSM and IGMPv3

Now (as described in the IP-based compared to DMAC-based L2 forwarding section), any L2 entries with the same Group IP address, but different Source IP addresses, can share the same pre-L3 lookup entry.

For example, for two channels (192.168.10.10, 232.1.1.1) and (172.16.1.1, 232.1.1.1), there are three L2 forwarding entries:

Figure 27. L2 Entries for Same SSM Group IP, but Different Source IPs

IGMPv3 also allows hosts sending joins with specifying source filtering, either including or excluding a list of sources. Here is an example of three hosts join group 232.1.1.1 with different source filtering lists:

Host1 would like to receive the group traffic only sourced from 192.168.10.10

- INC(Host1)={192.168.10.10}.

Host2 would like to receive the group traffic from any source

- EXC(Host2)={}.

Host3 would like to receive from any source except 192.168.10.10 and 172.16.1.1

- EXC(Host3)={192.168.10.10, 172.16.1.1}

In this case, the same three L2 entries (shown above) are needed for channels (192.168.10.10, 232.1.1.1), (172.16.1.1, 232.1.1.1) and (*, 232.1.1.1). According to the reports of the three hosts above, the receiver list for each channel is listed below:

OIF_LIST(192.168.10.10, 232.1.1.1) = {Host1, Host2}

OIF_LIST(172.16.1.1, 232.1.1.1) = {Host2}

OIF_LIST(*, 232.1.1.1) = {Host2, Host3}

Hence, the new IP-based L2 forwarding lookup design, coupled with the ability to store and add source IP address information, allows the PFC/DFC4 to perform full IGMPv3/MLDv2 snooping processing in the hardware.

How Does That Help You?

The new Supervisor 2T and PFC/DFC4 hardware-based processing support for IGMPv3 and MLDv2 Snooping will allow for much faster and more reliable updates of PIM-SSM multicast routes, supporting a consistent, low-latency solution.

The new design will also significantly minimize the host-specific join and leave latency, to guarantee that L2 behavior is consistent with the expectations of PIM Source Specific Multicast (SSM).

IGMPv3 and MLDv2 has many unique and complex processing requirements (for INCLUDE and EXCLUDE filtering), that only the Catalyst 6500 and Supervisor 2T hardware are capable of providing. Hence, this is a significant performance enhancement over previous generations, as well as all other (non-Cisco) IGMPv3 and MLDv2 designs, in L2 environments.

New Optimized Multicast Flood (OMF) Design

Save forwarding resources and bandwidth for “source-only” VLANs.

Yesterday’s Challenges

When IP Multicast traffic arrives from a directly connected source host, the L2 forwarding lookup will fail, if (or when, e.g. during initial setup) no IGMP/MLD receiver has been learned in the ingress bridge-domain (VLAN). This is because the L2 multicast snooping process is responsible for populating the L2 forwarding table.

Hence, this L2 lookup failure can result in unnecessary multicast frame flooding to all ports that are part of the same (ingress) bridge-domain. This type of multicast network design (specifically, 1+ sources and 0 receivers), and then resulting flood behavior, is commonly known as source-only forwarding.

Instead of just being flooded to all bridge-domain member ports, this type of source-only multicast traffic should be constrained to only the L3 multicast router ports (to be routed to remote receivers).

The PFC/DFC3 and previous forwarding-engines assisted source-only forwarding by periodically leaking traffic to the software. This provides the IOS software program with a special group DMAC entry in the L2 table, with multicast router ports as receiving ports.

This design works very well, but faces two main challenges:

One L2 “source-only” group DMAC entry, per one mroute (1:1)

Periodic L2 flooding, due to source-only aging (timer expiry) and relearning

The first challenge is related to L2 multicast scalability. Since each multicast route requires a separate L2 forwarding table entry, the aggregate number of entries can consume a large amount of the L2 forwarding table size.

The second challenge results in periodic waste of network bandwidth, and frames will be flooded to hosts that did not explicitly request the data (which is contrary to the purpose of IP Multicast).

Pro Tip: [Learn more about PFC/DFC3-based “Source-Only” entries http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a00800b0871.shtml#src_only]

Is that really bad? Generally speaking, no. As with many cases, it depends on the scale of the network.

If a large aging time is configured or source-only rate-limiting is disabled, the L2 forwarding table can become filled with unused (stale) entries that the switch learned by using source-only processing, or by using the IGMP join messages.

If the L2 forwarding table is full, and the switch receives traffic for new IP Multicast groups, it floods the packet to all ports in the same VLAN. This unnecessary flooding can impact switch performance.

Also, there is a period of time (default is every 5 minutes) when the L2 DMAC-based source-only entries are removed, to allow the traffic to be flooded to the multicast router port. This periodic flooding will stop, once the source-only entries are reprogrammed.

Today’s Solutions

With the new Supervisor 2T and PFC/DFC4, the new IOS code introduces the Optimized Multicast Flooding (OMF) design. OMF provides a more efficient approach, which only requires two L2 entries per bridge-domain (*, BD), one for IPv4 traffic and the other for IPv6 traffic, for source-only forwarding for all groups.

Compared to using one L2 entry per mroute (per VLAN), this saves L2 forwarding table space and avoids temporary flooding (before software installs the L2 entry, as with PFC/DFC3 and earlier).

Note: Optimized Multicast Flooding (OMF) is enabled as long as snooping is enabled.

Now (coupled with the new IP-based L2 forwarding lookup process), whenever a source IP host sends multicast traffic into a bridge-domain with zero receivers (resulting in FIB miss), the forwarding-engine will return the OMF index.

The result LTL index of an IPv4 OMF entry contains only the list of multicast router ports discovered by IGMP snooping. Similarly, when MLD snooping is enabled, an IPv6 OMF entry is inserted into the L2 forwarding table.

6513E.SUP2T.SA.1#show mac address-table multicast vlan 148
vlan mac/ip address LTL ports
+----+-----------------------------------------+------+----------------------
148 ( *,239.1.124.1) 0x912 Router Gi1/48
148 IPv4 OMF 0x910 Router
6513E.SUP2T.SA.1#

How Does That Help You?

The new L2-specific Optimized Multicast Flooding (OMF) source-only design provides two main enhancements:

Only one (IPv4/IPv6) OMF source-only entry is necessary (per bridge-domain), which dramatically reduces the total number L2 forwarding table entries.

It does not require special hardware and software interaction (for periodic source-only learning), and thus eliminates unnecessary multicast flooding in bridge domains with no local receivers

Figure 28. Basic Source-Only Process

This will conserve precious L2 forwarding resources, allowing a significant increase in multicast scalability. It also simultaneously protects your multicast control-plane from increased utilization or other related problems.

Multicast VPN (MVPN) Egress-Replication Support

Save overall switch fabric bandwidth when forwarding MVPN/eMVPN.

Yesterday’s Challenges

With the proliferation of virtualized, MPLS-based IP networking, users also want to transmit their multicast data traffic within their VPNs.

The original implementations used multicast over unicast (point-to-point) GRE IP tunnels. However, this did not scale well, as it requires full-mesh design between remote locations. Also, it was counterintuitive to the purpose of IP Multicast, specifically, distribution trees.

Figure 29. P2P Tunnel Scale Problem

This required a new multicast-specific VPN solution, which combined all of the benefits of Virtualized IP networking with the fundamental tenets of multicast forwarding.

What was needed was an innovative forwarding infrastructure that would use the existing VPN technology, but then also build receiver-solicited multicast distribution-trees. That new technology would become known as Multicast VPN (or MVPN).

MVPN Intranet support was introduced on the PFC/DFC3B. It provides a way to build a PIM-based multicast routing topology between remote locations within the same (isolated) Virtual Routing and Forwarding (VRF) VPNs.

MVPN uses special GRE tunnels, called Multicast Distribution Trees (or MDTs). These are built between participating Provider Edge (PE) routers, to set up the PIM control-plane and distribute multicast traffic to interested receivers.

Pro Tip: [Learn more about MVPN Intranet http://www.cisco.com/en/US/docs/ios/ipmulti/configuration/guide/imc_cfg_mc_vpn_ps6017_TSD_Products_Configuration_Guide_Chapter.html]

As MVPN popularity grew, users also wanted to include receivers within different VPNs. Since VPNs are technically isolated from one another, this new variant was called Extranet MVPN (or EMVPN).

EMVPN uses special static mroute entries to represent both the link between the “Intranet” VRF and the “Extranet” VRF(s). This allows the RPF checks to succeed between otherwise separate VPNs.

Figure 30. Basic MVPN/EMVPN Overview

Pro Tip: [Learn more about MVPN Extranet http://www.cisco.com/en/US/docs/ios/ipmulti/configuration/guide/imc_mc_vpn_extranet_ps6017_TSD_Products_Configuration_Guide_Chapter.html]

Both of these multicast VPN technologies are widely deployed on Catalyst 6500s in both enterprise and service provider networks today. However, the technology has one notable limitation that affects scalability and performance.

With the PFC/DFC3 architecture, because the MVPN/EMVPN process requires encapsulation and de-encapsulation (encap/decap) of GRE headers, it is necessary to perform all of the replications on the ingress replication engine ASIC. This meant that configuring MVPN forced the entire system to operate in ingress replication-mode.

MVPN requires two PFC/DFC3 recirculations (to perform encap/decap), since multicast frames must first be completely rewritten (using FIB lookup).Only after recirculation and rewrite can the frames be replicated (to the native VPN, and then once for each MVPN). This is not possible in PFC/DFC3-based egress replication mode.

Pro Tip: [Learn more about MVPN and Ingress Replication mode http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/mvpn.html - wp1090931]

Is that really bad? No. The challenge is not really with MVPN itself, but rather (due to the PFC/DFC3 architecture) that MVPN requires the use of Ingress Replication Mode.

There are plenty of valid uses for ingress replication mode, and if overall switch fabric bandwidth and MET scalability are not a concern, then this model works very well.

Hence, the limitations are really the same as those known for ingress replication mode. The full burden of multicast (packet) replication is performed by the ingress replication engine ASIC. Thus, multiple packets cross the switch fabric, and the MET must be symmetrical across the system.

In a small to medium-scaled multicast network, this presents no problem. However, in a highly scaled system, it can lead to oversubscription of various fabric channels.

Note: These ingress mode limitations are the reasons why the egress replication mode and egress-local features were created.

Figure 31. Ingress Replication

Furthermore, because the replication mode is a system-wide setting, the use of MVPN forces non-MVPN multicast traffic flows to also operate in ingress replication mode, even if those flows have nothing to do with MVPN.

Today’s Solutions

The new Supervisor 2T and PFC/DFC4 implement a new MFIB-based (forwarding) and EDC-based (replication) infrastructure. This new hardware infrastructure allows the MVPN and EMVPN architecture to be performed in egress replication mode.

Egress replication mode distributes the burden of packet replication. The ingress replication engine ASIC only needs to replicate frames for any local OIFs. Then, it makes a single (additional) packet replication to be sent to an internal bridge domain that all egress-capable modules are attached to. The switch fabric then replicates this single multicast frame to each of the egress replication engine ASICs, which will finally perform replications for any local OIFs.

This dramatically reduces traffic across the switch fabric, and allows the MET to be different (or asymmetric) on each module. This makes the system more scalable than using ingress replication where the number of OIFs is limited by the smallest MET table in the system.

Note: In order to have fully asymmetric MET tables, each module must have a DFC. If a module does not have a DFC, its MET is synchronized with the MET on the Supervisor.

Pro Tip: [Learn more about egress replication mode http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/mcastv4.html - wp1076728]

There are important innovations, which now allow MVPN to work in egress replication mode:

Unlike the earlier MVPN implementation, the EDC Egress Replication Model performs replication before the packet is rewritten.

The PFC/DFC4 IFE/OFE processing, along with the new MFIB infrastructure, no longer requires multiple recirculations for packet encapsulation/de-encapsulation.

Pro Tip: [Learn more about PFC/DFC4-based IFE/OFE processing. URL to Sup 2T Architecture white paper]

Figure 32. Egress Replication

Note: This new capability will be enabled automatically, and will no longer force the replication mode to be ingress.

How Does That Help You?

The ability to use MVPN and EMVPN in egress replication mode provides the following scalability and performance benefits:

Less processing burden on the ingress replication engine ASIC hardware

Less multicast traffic must cross the switch fabric, increasing scalability and reducing congestion on individual fabric ASIC hardware

Allows the MET programming to be asymmetric, and the total number of entries becomes equal to the number of METs in the system (64 K * N modules)

Support for 8 PIM-BIDIR Hardware RPF entries

Define eight simultaneous RPs in the hardware.

Yesterday’s Challenges

Bidirectional PIM (or PIM-BIDIR) is a special PIM forwarding paradigm, which is very different from other PIM modes. PIM-BIDIR is an entirely RP (or shared path)-based (*,G) distribution model, with absolutely no reliance or knowledge of individual sources. Hence, this is ideal for many-to-many applications, with a very large number of sources.

As with all PIM modes, many of the same basic multicast-routing principles apply (for example, IP mroutes with Incoming Interface (IIF) and Outgoing Interfaces (OIFs), RPF checks, etc). However, Bidirectional PIM is appropriately named, because it uses an entirely unique forwarding model that allows traffic to flow bidirectionally against the RPF.

PIM-BIDIR is also unique because it is not data-driven (meaning that IP mroute state maintenance is not based on the presence or absence of IP Multicast sources), by using a pre-built forwarding topology rooted at the RP. Hence, all forwarding occurs along the path to and from the PIM-BIDIR RP.

PIM-BIDIR accomplishes this by establishing a predefined distribution tree to and from the RP, based on the best unicast routing metrics. During initialization, each PIM-enabled interface (along the best path to the RP) undergoes a designated forwarder (DF) election, and the interface with the best metrics becomes the DF.

Figure 33. Basic PIM-BIDIR RP-Based Distribution

A single DF exists (for each RP) on every link within a PIM-BIDIR domain. A DF for a multicast group is in charge of forwarding downstream traffic onto the link, as well as forwarding upstream traffic from link towards the RP. It performs these tasks for all the bi-directional groups served by the RP. DF on a link is also responsible for processing IGMP joins from local receivers, and originating PIM join messages towards the RP.

Note: A PIM-BIDIR DF knows if it is in the return, or bidirectional, path towards the RP. It can directly forward to receivers (known as proxy forwarding), while also forwarding the traffic upstream to the RP (for other remote receivers).

This eliminates the need for PIM-SM source registration, and per-subnet DR elections, which allows for a highly predictable, scalable forwarding model. As long as all PIM routers know the IP address of the RP (for each range of group IP addresses), and how best to reach that RP, the PIM-BIDIR design is perfect for networks with a large number of sources.

Note: An RP mapping is either a static or dynamic association between a PIM interface’s IP address, and a subset of class D multicast group IP addresses, usually defined using an ACL.

Pro Tip: [Learn more about Bidirectional PIM http://www.cisco.com/en/US/docs/ios/12_2/ip/configuration/guide/1cfbipim.html]

For a hardware-based multi-layer switching platform, this requires prior knowledge and storage of the PIM-BIDIR RP mapping configuration in the hardware. The PFC/DFC3 hardware is capable of storing 4 different PIM-BIDIR RP entries (called RPDF entries because they represent the DF path to the relevant RP). Thus, each system can only support 4 different PIM-BIDIR RP mappings.

Figure 34. PFC/DFC-Based PIM-BIDIR RPDF Entries

Is that really bad? No. The challenge is simply that you can only define these four RP mappings, which restricts the possible combinations that your IP Multicast network can use. This effectively limits the number PIM-BIDIR distribution trees that each system is capable of handling to four entries.

Note: More than four RP mappings can actually be configured in the IOS software, but only four will be active in the PFC/DFC3 hardware. The fifth or subsequent RP mappings are called “zombie” entries, and will be installed if one of the existing four mappings in the hardware becomes disabled.

Today’s Solutions

With the new Supervisor 2T and PFC/DFC4, the hardware can now store eight unique PIM-BIDIR “RPDF” entries, allowing for eight different RP mappings.

Note: There are two types of hardware forwarding entries relevant to PIM-BIDIR: (*,G) and (*,G/m).

The PIM-BIDIR (*,G) hardware entry is very similar to the common PIM sparse (*,G) entry. The notable difference between the two modes, are the RPF (sparse) and DF (Bidir) forwarding checks.

The (*, G/m) entry is used in PIM-BIDIR to forward traffic to the RP, in source-only networks. In such networks, the (*,G) entry does not even exist. This is the true masked-based (through RP mapping) entry that maps all forwarding to the RP.

Therefore, the (*, G/m) entry is created in hardware simply to forward source traffic to the RP, and to avoid punting the traffic to the CPU (for software forwarding).

Note: On the RP itself, the (*, G/m) entry is actually a “drop” entry.

The packet flow and table programming of the (*,G/m) entry is very similar to the (*,G) entry. Following are the differences between the two entries.

The MFIB TCAM lookup for the (*,G/m) will use the mask “m” for IP_DA, compared to the exact match for (*,G) entry.

The (*,G/m) entry has only the RPs RPF interface as an OIF, while the (*,G) entry can also have other DF interfaces in the OIF list (in addition to the RP’s RPF).

Therefore, the OIFs in the MET will also reflect this difference.

The differences between the (*,G/m) and (*,G), in terms of hardware packet flow and table programming, are very minimal.

For the (*,G/m) a special “s_star_priority” field is set in the FIB DRAM, giving higher priority to (*, G/m) entry than the (S/m, *) entry.

Note: The new limit of eight RP mappings comes from the fact that the DF_MASK field (of the PFC/DFC4 FIB DRAM) is eight bits wide, and each bit represents one RP.

In a VPN environment, since the VLANs are locally significant, the same RP index will be shared between PIM-BIDIR RPs in the different VPNs. This means that we can actually use eight RPs per VPN (the global table is VPN 0).

How Does That Help You?

The new PFC/DFC4 hardware, combined with the new MFIB hardware infrastructure, allows you to configure eight separate RP mappings, increasing the redundancy and scalability of your PIM-BIDIR multicast network.

PIM-BIDIR enables you to create a highly scaled, many-to-many IP Multicast network, capable of supporting a virtually infinite number of source IP hosts, with minimal administrative impact on the system.

This also allows you to improve your network redundancy and scalability, by creating up to eight separate PIM-BIDIR scopes. Each different scope can service a different network area, or can create special forwarding paths to optimize the (*,G) RP-based distribution trees.

IPv6 Multicast (*,G) and (S,G) entries in FIB TCAM

Improve IPv6 hardware-based forwarding, and decrease latency.

Yesterday’s Challenges

IPv6 multicast L2/L3 switching was first introduced on the Catalyst 6500, with the PFC/DFC3 hardware and 12.2(18)SXE IOS software. This allowed IPv6 multicast to operate in hardware, but the larger IP addressing scheme introduces many challenges for hardware-based forwarding.

An IPv6 multicast address is 128 bits long, with a special prefix of FF00::/8 (or 11111111 in binary). This is the equivalent of 224.0.0.0/8 (or “Class D” address space) in IPv4. The second octet, immediately following the multicast prefix, defines the lifetime and scope of the multicast address.

A “permanent” multicast address has a lifetime parameter equal to 0, while a “temporary” multicast address has a lifetime parameter equal to 1. An IPv6 multicast address that has the scope of a “Node”, “Link”, “Site”, “Organization” or a “Global” scope, has an address parameter of FF01, FF02, FF05, FF08, or FF0E, respectively.

Figure 35. 128-Bit IPv6 Multicast Address

Note: All IPv6 nodes (hosts and routers) are required to join, or receive packets destined for, the following multicast groups:

“All-nodes” multicast group FF02:0:0:0:0:0:0:1 (scope is Link-Local, like the IPv4 address 224.0.0.1).

“All-routers” multicast group FF02:0:0:0:0:0:0:2 (scope is Link-Local, like the IPv4 address 224.0.0.2).

“Solicited-node” multicast group FF02:0:0:0:0:1:FF00:0000/104, used for each of its assigned unicast and anycast addresses

The solicited-node multicast address is a multicast group that corresponds to an IPv6 unicast or anycast address. The IPv6 solicited-node multicast address has the prefix FF02:0:0:0:0:1:FF00:0000/104 concatenated with the 24 low-order bits of a corresponding IPv6 unicast or anycast address.

For example, the solicited-node multicast address corresponding to the IPv6 address 2037::01:800:200E:8C6C is FF02::1:FF0E:8C6C. Solicited-node addresses are used in IPv6 “neighbor solicitation” messages.

Note: There are no broadcast addresses in IPv6. Link-local IPv6 multicast addresses are used instead of broadcast addresses.

Configuring a site-local or global IPv6 address on an interface automatically configures a link-local address and activates IPv6 for that interface. Additionally, the configured interface automatically joins the required multicast groups for that link.

IPv6 multicast routing uses the MRIB/MFIB forwarding architecture. MRIB is the L3 routing database on the MSFC5, while the MFIB is the L3/L2 forwarding infrastructure on the PFC4/DFC4. The MRIB/MFIB database mainly contains (*,G), (S,G) or (*,G/m) entries, with a list of IPv6 PIM interfaces hanging off of these entries.

The 12.2(18)SXE and later IOS software supports the following IPv6 protocols to implement IPv6 multicast routing:

Multicast Listener Discovery (MLD). MLD is a link-local protocol used by IPv6 routers to discover multicast listeners (receivers), on directly attached links. There are two supported versions of MLD:

1. MLD Version 1, which is based on Internet Group Management Protocol (IGMP) Version 2 (*,G) joins and leaves for IPv4 PIM-SM

2. MLD Version 2, which is based on (S,G) version 3 of the IGMP for IPv4 PIM-SSM.

Note: Cisco IOS software supports both MLD Version 2 and MLD Version 1. MLD Version 2 is fully backward-compatible with MLD Version 1. Hosts that support only MLD Version 1 will interoperate with a Cisco router running MLD Version 2.

PIM Sparse Mode (PIM-SM). This is the same basic sparse-based forwarding model as that used with IPv4. It requires an RP to define for any-source multicast (*,G) RP-based distribution tree forwarding. Once the last-hop PIM router learns the source-based IP, it can build a shortest-path (S,G) distribution tree.

PIM Source-Specific Multicast (PIM-SSM). Similar to PIM-SM, this is based entirely on shortest-path source-based (S,G) distribution trees. This requires the last-hop PIM router to have knowledge of the receivers’ preferred source IP address. MLD version 2 is required for SSM to operate.

The PFC/DFC3 hardware supports the following IPv6 Multicast features:

RPR and RPR+ redundancy mode. This is known as COLD high-availability, because the PIM and MLD processes must be restarted, and all forwarding entries relearned, after a supervisor switch-over.

Note: SSO redundancy mode is not supported on PFC/DFC3, because the IPv6 multicast hardware entries are too large to fit into the FIB TCAM.

Multicast Listener Discovery Version 2 (MLDv2) Snooping. This supports source-specific MLD joins and leaves (for PIM-SSM), on L2 IPv6 subnets.

Note: MLDv1 Snooping, which provides backwards-compatibility for any MLDv1 hosts, is not supported on PFC/DFC3

IPv6 Multicast Hardware Rate Limiters. This provides basic packet matching and threshold-based drop capabilities for IPv6 multicast frames, to minimize the processing impact on the CPU and DRAM.

IPv6 Multicast Bootstrap Router (BSR). This provides a dynamic protocol to distributed IPv6 PIM Rendezvous Point (RP) group mappings to IPv6 PIM routers.

PIM-SSM Mapping for IPv6. The IPv6 SSM address range is“FF3X::/32”, where X represents the scope bits.

IPv6 Access Services (DHCPv6, ICMPv6 and Neighbor Discovery). These are link-local multicast functions, used for both unicast and multicast IPv6 routing.

Pro Tip: [Learn more about PFC/DFC3-based IPv6 Multicast http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/mcastv6.html]

Now, the primary challenge with hardware-based IPv6 multicast forwarding is the sheer size of the necessary hardware entry. Since each unique IPv6 address is 128 bits long, and IP Multicast is represented in hardware as (Source IP, Group IP), this requires 256+ bits of address space. All PFC/DFC FIB “TCAM” (Ternary [3-part] Content Addressable Memory) hardware entries are (KEY + ADDRESS + MASK). In the case of IP Multicast, it is (KEY + S,G+ MASK).

Note: For IPv4, this allocation design requires 144 bits of TCAM memory space. For IPv6, this requires 288 bits of TCAM memory space.

The PFC/DFC3 FIB TCAM could not accommodate 288 bits in a single entry, so all IPv6 multicast HW entries are programmed into the NetFlow TCAM instead. This approach is functional, but due to its larger memory size and programming/searching process, the NetFlow TCAM is slower and the lookup results must be collated afterwards, which requires additional processing time.

Is that really bad? No. By definition, this is still considered hardware-based IPv6 multicast forwarding, since (after NetFlow-based shortcut installation) subsequent forwarding decisions are performed by the PFC/DFC, with no CPU-based processing. Hence, this design is still fundamentally faster and more scalable than software-based forwarding.

However, the NetFlow-based IPv6 multicast forwarding decisions are notably slower than their FIB-based IPv4 multicast equivalent. This means a slightly slower multicast routing (distribution tree) setup rate, when the traffic flows are first established.

With the PFC/DFC3, the maximum IPv6 forwarding lookup-rate is ~20 Mpps for centralized (PFC-based) and ~24 Mpps for distributed (DFC-based), with ~10-20 us latency.

Today’s Solutions

With the new Supervisor 2T and PFC/DFC4, the new FIB TCAM table size is now 288 bits wide, specifically to support IPv6 multicast routing. This allows the IPv6 MFIB infrastructure to install full (KEY + S,G + MASK) information, into the TCAM memory.

The PFC/DFC4 FIB TCAM can support either 512 K (non-XL) or 1 M (XL) entries, but the IOS software limits the IP Multicast allocation to 128 K or 256 K respectively. Similar to FIB TCAM allocation, the adjacency table can support either 512 K or 1 M entries, divided into a statistics and non-statistics region.

Figure 36. IPv6 MFIB Lookup Process

Multicast adjacencies (both flow and replication adjacencies) are allocated from the statistics region. The statistics region is further divided into several reserved regions for various applications such as Control Plane Policing (CoPP), IFE/OFE, replication, and more.

Among the 512 K statistics (for the 1 M table size), the available adjacencies that can be allocated dynamically are ~454 K adjacencies. The IPv6 multicast adjacencies include both flow adjacencies (adjacency index in the FIB DRAM) and replication adjacencies (adjacency index in the MET entries).

Note: In the Egress Replication Mode case, two adjacencies have to be allocated for every flow adjacency, one for PI and one for non-PI.

With the PFC/DFC4, the maximum IPv6 forwarding lookup-rate is ~30 Mpps for centralized (PFC-based) and distributed (DFC-based), with ~6-10 us latency.

How Does That Help You?

The ability to process IPv6 multicast in the FIB TCAM provides much faster L2/L3 IPv6 forwarding lookups. This helps enable IPv6 multicast traffic paths to be established faster, which allows the traffic to be delivered sooner.

IP Multicast is used to distribute mission-critical IP data to multiple hosts simultaneously. Thus, one of the most important aspects of IP Multicast forwarding is the speed at which the flow is established between multicast sources and receivers.

This new hardware capability optimizes the maximum IPv6 Multicast forwarding lookups up to ~30 Mpps (which is 33 percent faster than before), and also provides the basis for the unified IPv4 and IPv6 MFIB-based forwarding infrastructure.

Enhanced Multicast HA Using New Infrastructure

High availability, built on top of the new infrastructure, optimizes stateful switchover.

Yesterday’s Challenges

The notion of high availability (HA) was first introduced with redundant supervisor engines. One supervisor is elected the “active”, while the other supervisor becomes the “standby”. During operation, the active supervisor will then synchronize (all or some of) its L2/L3 forwarding information to the standby supervisor.

This provided a mechanism to switchover from a failed active supervisor to the waiting standby supervisor. Of course, due to the complexity of synchronization, the exact extent of what information (and how much) can be synchronized depends on which hardware and software is operational.

With the PFC/DFC3 architecture, the following HA operating modes are supported:

Route Processor Redundancy (RPR). This HA mode is considered “COLD”, and provides basic synchronization of the IOS image and configuration. If the active supervisor fails, the standby must boot, then reinitialize all IOS subsystems (using the synchronized IOS image and configuration), and then relearn all forwarding entries.

Route Processor Redundancy Plus (RPR+). This HA mode is considered “WARM”, and (beyond RPR) is already booted, with all IOS subsystems in a passive (non-operational) mode. If the active supervisor fails, it must simply reinitialize the IOS subsystems and relearns the forwarding entries.

Stateful Switch-Over (SSO). This HA mode is considered “HOT”, and (beyond RPR+) is completely booted, with all IOS subsystems in a semi-passive mode. All hardware forwarding entries and state-related software information is continuously synchronized. If the active supervisor fails, the standby supervisor is fully capable of forwarding (after changing from standby to active mode).

The L3 routing “Non-Stop Forwarding” (NSF) feature is normally associated with SSO, on hardware-based platforms. NSF leverages the SSO hardware infrastructure, and if an active supervisor fails, it does not report a routing topology change. This is only possible because SSO allows the standby supervisor to continue transmitting frames, using the old forwarding entries.

Note: NSF and SSO are technically separate features, and NSF is not a requirement for SSO operation. However, if a L3 routing adjacency fails, the neighbor will remove its routing information from its database. This process negates the hardware forwarding entries, and thus NSF and SSO are usually used together.

With 12.2(18)SXE and later IOS software, the default redundancy mode is SSO. This is the default HA mode for both stand-alone (or single chassis, dual supervisors) Catalyst 6500, as well as Virtual Switching System (VSS). Multicast HA support is enabled by default, if "ip multicast-routing" is configured.

Pro Tip: [Learn more about PFC/DFC3-based SSO/NSF http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/nsfsso.html]

IP Multicast HA support is different than HA support for other routing protocols, because multicast routing (mroute) state is dynamic (or data-driven), and depends on the presence of sources and receivers.

For IP Multicast, there are several important components that are necessary to support SSO in hardware. These components are consistent with the normal mode of operation, and can be divided into control-plane and data-plane functions.

Multicast NSF/SSO. This helps ensure that all necessary information, such as RP mapping information, mroute (*,G) and (S,G) state, and hardware forwarding entries (for example, multicast FIB and adjacency, MET and LTL/FPOE indices) are synchronized (or “check-pointed”) between supervisors (MSFC and PFC).

This allows multicast data traffic to continue over the same physical path through previously learned forwarding entries, while the routing control-plane reconverges.

Multicast HA Checkpoints:

Dynamically learned group-to-RP mappings learned from either Auto-RP or BSR (IPv4 only)

PIM Bi-dir designated forwarder (DF) information and PIM Bi-dir RP route information (IPv4 only)

Multicast Call Admission Control (MCAC) reservation information (IPv6 only)

Multicast VPN (MVPN) and MVPN Extranet MDT tunnel information (IPv4 only)

PIM Register tunnel information (IPv6 only)

Multicast forwarding state, created by data-driven events (IPv4/IPv6).

Figure 37. Basic Multicast HA

Note: PIM is technically not NSF/SSO aware. However, the default PIM query-interval (hello) of 30 seconds (3 * hello = 90 second hold time) allows the entire NSF/SSO process to complete, without losing the PIM neighbor relationship. Hence, it is recommended to leave the default PIM query-interval in multicast HA environments.

PIM Triggered Joins. This feature is used to trigger adjacent PIM neighbors on PIM interface to send PIM join messages for all (*, G) and (S, G) mroutes that use the interface as a Reverse Path Forwarding (RPF) interface.

PIM hellos now have a new Generation ID (GenID) value (defined in RFC 4601), which is incremented after a switchover. PIM neighbors that receive an incremented GenID will then trigger new PIM join messages for all mroutes associated with that interface.

GenID. A GenID is a randomly generated 32-bit value that is regenerated each time PIM forwarding is started or restarted on an interface. In order to process the GenID value in PIM hello messages, PIM neighbors must be running Cisco IOS software that is compliant with RFC 4601.

Figure 38. Basic Multicast HA

When a supervisor switchover event occurs, PIM triggered joins (using GenID) help ensure that all PIM routing state information (known by downstream neighbors) will be refreshed. Meanwhile, multicast SSO helps ensure that the hardware forwarding entries remain intact, so that all previously installed traffic flows can continue to transmit. As the control-plane re-converges, new hardware entries are installed and stale entries are removed.

When a Standby Supervisor is first installed, or during system bootup, Multicast HA software performs a “Bulk Synchronization” of information corresponding to events that modify the multicast forwarding state. During normal operation (steady state), the software performs “Periodic Synchronization” updates, triggered by events that cause internal database changes to the multicast forwarding state (for example, RPF change, new RP Mapping, new Multicast forwarding entries, and more).

Pro Tip: [Learn more about PFC/DFC3-based Multicast SSO/NSF http://www.cisco.com/en/US/docs/ios/ipmulti/configuration/guide/imc_high_availability.html]

Hence, PFC/DFC3-based Multicast SSO/NSF provides switchover recovery in ~3-6 seconds, for most configurations. However, due to the overall complexity of Multicast HA, several notable limitations apply:

No NSF/SSO support for IGMP/MLD/PIM Snooping. This is the L2-specific state-related information, necessary to forward frames within a given VLAN. This can lead to temporary forwarding loops, or a large Join/Leave delay.

No NSF/SSO support for IPv6 Multicast. This is because of the PFC/DFC3-based hybrid hardware and software interaction. IPv6 Multicast HA is supported in RPR mode.

Note: In addition to known limits, there are also several less-specific (but still notable) caveats associated with PFC/DFC3-based Multicast HA.

The Supervisor 720 and MSFC3 hardware architecture supports a separated Switch Processor (SP) and Route Processor (RP) model, which separates the L2 “Switching” functions from the L3 “Routing” functions. This model combines the (previously) physically separate “switches” and “routers” in the single L2/L3+ forwarding platform.

This split architecture model has its processing advantages, but also requires more communication between SP and RP CPUs to properly learn forwarding information and provide correct forwarding lookup results. The more forwarding load that is required, the more this inter-processor communication increases.

The current IPv4 hardware infrastructure (known as Multicast Multi-Layer Switching or MMLS) is a tightly coupled combination of L2/L3 hardware and software components. Under considerable load (for example, a fully populated system, with many PIM neighbors and tens of thousands of IP mroutes), the CPU may become overwhelmed.

If the CPU is very busy (or overwhelmed), it is possible for the MMLS software to miss IP mroute state changes, or even fail to properly update hardware forwarding entries. This can result in temporary packet forwarding loops, or result in temporary packet loss, until the entire multicast topology has reconverged fully.

Finally, because the Supervisor 720 and Supervisor 720-10GE integrate the Switch Fabric functionality, a Supervisor engine switchover will force a fabric switchover as well. During the fabric switchover, data will be lost for a minimal period of between 0.5 seconds and 1.5 seconds.

When a Supervisor 720 or Supervisor 720-10GE is installed in a Catalyst 6500 “E-Series Chassis” (such as the WS-C6513-E), IOS release 12.2(33)SXH and later incorporates a new fabric switchover mechanism, called Enhanced Hot-Standby Fabric Switchover, to reduce the data loss to a period of between 50 milliseconds and 0.5 seconds for feature-capable modules.

Is that really bad? Yes and no. In a small to medium-size multicast environment, the current PFC/DFC3-based Multicast HA design is fully capable of SSO, and traffic will re-converge fast enough that it will go essentially unnoticed by most IP Multicast applications.

However, in a highly scaled multicast environment, or in a large L2 or IPv6 multicast environment, these caveats may result in delayed reconvergence. This long convergence period may result in temporary packet duplication or loss, which will have a negative impact on IP Multicast applications.

Today’s Solutions

All the fundamental Multicast HA functionality (SSO/NSF, PIM Triggered Joins, and more) and related operational behavior remain the same with the Supervisor 2T.

Because IP Multicast forwarding is inherently data-driven, there will always be some interaction between the software tables and hardware tables. The new Supervisor 2T offers many enhancements to both the multicast control plane and data plane.

The new MSFC5 (control-plane) supports a new Dual-Core CPU, operating @ 1.5 Ghz (per Core), and runs a new single-combined IOS image. Hence, it can perform operations twice as fast as its predecessor (MSFC3 SP/RP @ 600 Mhz), and also eliminates the need for inter-processor communications between multiple (separated) CPUs.

All of these MSFC5 enhancements reduce the load on the multicast control plane. The RP CPU can now spend more cycles processing IP Multicast control plane updates, programming hardware forwarding information, and performing SSO/NSF synchronization between Active and Standby Supervisors.

The new PFC4 (data plane) supports SSO synchronization of both IPv4 and IPv6 L2 (CAM) and L3 (FIB TCAM) forwarding entries, in the hardware. It can perform IPv4 Multicast forwarding lookups @ ~60 Mpps and IPv6 Multicast lookups @ ~30 Mpps. This is a significant increase over its predecessor (PFC3 @ ~30 Mpps for IPv4 and ~20 Mpps for IPv6).

In addition to overall increased forwarding throughput, the PFC4 supports SSO synchronization of IPv4 and IPv6 Multicast RPF interfaces, IPv4 and IPv6 (PIM-SM) register tunnels, and (PIM-BIDIR) RPDF entries. It also supports LIF/BD and LTL/FPOE port indices.

The new 2 Tbps Switch Fabric supports SSO through dedicated (back-to-back) redundant channels between the active and standby supervisors, and the new WS-X6900 (CEF2T) switching modules also support redundant channels (one to the active and one to the standby) to minimize fabric switchover time.

The Supervisor 2T and latest IOS software also support the “Enhanced Hot-Standby Fabric Switchover” HA feature. This minimizes the duration of time necessary for the SSO standby Switch Fabric to begin forwarding packets (~50 - 200 ms).

Note: These enhancements set the foundation for the unified IPv4/IPv6 MFIB-based forwarding infrastructure and EDC-based multicast replication model.

The new MFIB-based forwarding infrastructure is a platform-independent and routing-protocol-independent IOS library (API) for IP Multicast software. Its main purpose is to provide Cisco IOS with basic interface information and state notifications, which is used by the IP Multicast routing table (MRIB) software to make forwarding decisions.

It is a simplified model, in regards to how it handles multicast interface status. The software MFIB simply tracks the operational status of all multicast-enabled interfaces, and provides the MRIB with this information through simple interface flags (which follow strict semantics). The hardware MFIB only needs to maintain next-hop L2/L3 address information, based on the information from the software MRIB/MFIB database.

Distributed MFIB (dMFIB) uses a server/client model, with the MSFC5 CPU as the (software) MFIB server and the PFC/DFC4 forwarding-engines as the MFIB clients.

The dMFIB distributes a full copy of the MFIB database to all modules, and then relays data-driven protocol events (using flags) from the modules, to the MRIB/MFIB control plane. It also includes the ability to switch a multicast packet to the software (for example, to trigger a data-driven event) and upload traffic statistics.

This combination of software-based MRIB/MFIB and hardware-based MFIB forwarding infrastructure greatly reduces software and hardware processing, and provides a simplified server and client model that is more compatible with the SSO and ISSU HA infrastructure.

The new EDC-based multicast replication model is the hardware forwarding component that really is multicast. The fundamental IP Multicast requirement is to transmit an exact copy of a given source datagram to multiple receiver hosts. To accomplish this in the hardware requires both an ASIC capable of making multiple frame copies, and a (software-managed) distribution model.

The MSFC5 and PFC/DFC4 support both the Ingress and Egress Replication Modes, but the default model is Egress Replication, in order to optimize OIF scalability and minimize switch fabric utilization. This is consistent with all earlier hardware replication principles.

The new EDC replication model uses a more consistent Egress Replication Mode programming IOS library (API), to program and synchronize Multicast Expansion Table (MET) entries and LTL/FPOE port indices.

EDC does this using the new LIF/BD capabilities to build (per IP mroute) internal egress broadcast domains, to optimize the delivery of multicast frames to the correct outgoing (egress) modules.

The EDC-based replication model also uses a server/client model, to provide a highly reliable and scalable (egress replication-mode) ASIC programming, which is architecturally more compatible with the SSO and ISSU HA infrastructure.

Note: In addition to these new and enhanced multicast HA features, the Supervisor 2T and latest IOS also provides SSO/NSF hardware support for L2 IGMP/MLD/PIM Snooping, as well as IPv6 Multicast. The PFC/DFC4 hardware is capable of supporting these entries, but the software functionality will be available in the 15.0SY IOS release.

How Does That Help You?

Building on all existing capabilities of previous generations, and adding new capabilities, the Supervisor 2T provides you with not only the highest-performing IP Multicast platform, but simultaneously, the most highly available platform on the market.

The new MSFC5 (control plane) and PFC4 (data plane) hardware, combined with the new MFIB-based forwarding infrastructure and EDC-based replication model, provides you with a whole new level of IP Multicast performance and high availability.

Figure 39. MFIB-based Multicast HA

All these capabilities are automatically enabled without any additional administrative overhead. This new level of Multicast HA will help ensure that your next-generation IP Multicast network will remain operational, even in the unfortunate event of catastrophic Supervisor engine failure.

Hardware Integration with VPLS, H-VPLS and EoMPLS

Gain built-in IP Multicast support for advanced L2 VPN network designs.

Yesterday’s Challenges

Virtual Private LAN Services (VPLS) emulate a L2 LAN over an IP/MPLS network. VPLS also allows dynamic learning of customer MAC addresses from different sites, and uses this for bridging L2 customer traffic. Advanced VPLS provides an enhanced CLI, and adds Ethernet Psuedowire (PW) support.

Note: To create an Ethernet Pseudowire, MPLS appends an additional label, called a flow label, which contains flow information for each Virtual Circuit (VC).

Figure 40. Basic VPLS Topology

Pro Tip: [Learn more about VPLS and Advanced VPLS http://www.cisco.com/en/US/docs/ios/mpls/configuration/guide/mp_l2vpn_advvanced_ps6017_TSD_Products_Configuration_Guide_Chapter.html]

Hierarchical VPLS (or H-VPLS) allows two Network-facing Provider Edge (N-PE) routers to provide redundancy services to a user-facing provider edge (U-PE) router, within a hierarchical VPLS network. Having redundant N-PE routers provides improved stability and reliability against link and node failures.

In the H-VPLS architecture, Ethernet Access Islands (EAIs) work in combination with a VPLS network (MPLS as the underlying transport). EAIs operate like standard Ethernet networks, between CE routers.

Traffic from any CE devices within the EAI is switched locally within the EAI by UPE devices, along the computed spanning-tree path. Each UPE device is connected to one or more NPE devices using VPLS PWs. Any traffic local to the UPE is not forwarded to the NPE devices.

Figure 41. H-VPLS Topology

Pro Tip: [Learn more about H-VPLS http://www.cisco.com/en/US/docs/ios/mpls/configuration/guide/mp_hvpls_npe_red_ps6017_TSD_Products_Configuration_Guide_Chapter.html]

Ethernet over MPLS (EoMPLS) is another method of transporting Ethernet (802.3) protocol data over an IP/MPLS network, and established many of the fundamental L2VPN concepts. EoMPLS is actually a subset of the Any Transport over MPLS (AToM), and as with VPLS, the essential L2 transport of Ethernet frames over the MPLS network is done via the Ethernet Pseudowire (PW) feature.

Pro Tip: [Learn more about EoMPLS http://www.cisco.com/en/US/docs/interfaces_modules/shared_port_adapters/configuration/6500series/76cfgeth.html - Scalable_EoMPLS]

Note: VPLS provides multipoint-to-multipoint (MP2MP) or point-to-multipoint (P2MP) services, compared to EoMPLS that provides point-to-point (P2P) service.

L2VPN domains are a set of Virtual Circuits (VCs) connecting the MPLS PEs, which interconnect remote customer CEs. These virtual circuits are defined either by manual configuration or by Targeted LDP sessions. They identify each customer (or customer VLAN) connection to the MPLS core.

In the MPLS core, these virtual circuits are switched using a specific VC label for the particular customer traffic. Traffic originating from a source CE, when transported over the MPLS core, has two labels assigned to it. The outer label is known as the Tunnel Label or IGP Label, and the inner label is known as the VC Label. The outer Tunnel Label identifies the PE-to-PE tunnel (setup between the PE routers), while the inner VC Label separates each emulated LAN.

Hence, these L2VPN-emulated LAN services provide the same essential behavior as a true Ethernet-based LAN, such as MAC-based address learning and L2 switching. This virtualized networking capability allows both L2-based unicast and multicast forwarding to operate over an IP/MPLS network.

Once an IP host’s MAC address is learned over the VC, subsequent packets will be sent over that VC, the same as with normal L2-based switching. L2VPN multicast replication behaves similarly to unicast, but uses a special (MET) adjacency.

Note: L2VPN services are available today on Supervisor 720 (PFC/DFC3), but requires the use of additional SIP/SPA hardware and 12.2(33)SXH or later IOS software.

Is that really bad? Yes and no. The essential L2VPN technology allows customers to provide emulated LAN services across their existing IP/MPLS core. This provides a valuable new capability, which was not available previously.

The primary challenge with the current L2VPN technology is two-fold. First, it requires additional hardware (SIP/SPA), which increases cost and management burden. Second, it suffers from many of the well-known limitations of normal LAN technologies (for example, L2 multicast flooding).

Today’s Solutions

The Supervisor 2T (PFC/DFC4) and latest IOS software provides native hardware integration for advanced L2VPN services (VPLS and EoMPLS) on all Ethernet ports, which simplifies the overall network design (configuration and monitoring points) and reduces operational costs (new and replacement hardware).

In addition to the obvious benefits of hardware integration, the PFC/DFC4 architecture also provides additional L2VPN Multicast-specific enhancements, which includes:

IGMP/PIM snooping (per-VC)

Multicast source-only bridging (per-VC)

Ordering of multicast entries in FIB (for VPLS)

Note: In an L2 multicast environment, the default switching behavior is to flood traffic to all ports in the same bridge domain. In order to conserve network bandwidth and resources, it is essential to restrict (or at least limit) L2 multicast flooding.

IGMP Snooping (Per-VC)

IGMP group membership and multicast router (mrouter) information (for the same customer VPN) can now be dynamically learned at the PE routers by snooping incoming IGMP frames and PIM packets (from the same customer sites).

Based on this L2 snooping information, the source PE is able to build an L2 multicast forwarding table (per VC). Once this L2 snooping table is built, a particular multicast flow will be forwarded to only remote PE’s that have interested receivers (or mrouters), in the connected customer sites.

Note: In earlier implementations of L2VPN (without IGMP snooping), any multicast packets originating from the source CE device will be sent (flooded) to all remote CE devices, irrespective of whether a receiver exists for this multicast traffic.

This optimization saves core network bandwidth. The IGMP snooping optimization is only possible if qualified learning happens at the PE routers (specifically, each different VLAN, from the same CE device, is mapped to different VPLS domains).

Otherwise, since L2 snooping occurs at the PE routers on a per-VPLS VC (domain) basis, multiple customer VLANs mapped to one VPLS domain (unqualified learning) will negate the benefit of snooping.

Figure 42. PFC/DFC4 L2VPN Flood Protection

Multicast Source-Only Bridging (Per-VC)

When there are no L2 receivers for incoming multicast traffic, this source-only traffic is bridged on the ingress VLAN (BD) to all L3 multicast router (or mrouter) ports, so that it can be routed to other L3 OIFs. Within a regular VLAN (BD), this bridging behavior is provided by the OMF (optimized multicast flooding) entry.

Since we can now learn L3 mrouters across the L2VPN core, it is possible to provide similar source-only entry behavior for each L2VPN virtual circuit. In this case, if there are no L2 receivers for multicast traffic coming from CE devices, the traffic will be forwarded only on the VCs on which we have learned L3 mrouters.

Within the PFC/DFC4 hardware, this L2VPN OMF entry is programmed as a (*, *, BD) entry in FIB. This FIB entry will drive a specific MET entry, which will have the VCs on which we have learned multicast routers (similar to normal OMF MET entries).

Ordering of Multicast Entries in FIB for VPLS

The following diagram shows the ordering of the various multicast-specific VPLS FIB entries. The ordering starts from the top with the highest order, and moves down with decreasing order.

The (S, G, BD) will be highest order (or most specific entry), and the (*, *, BD) OMF entry will be the lowest order (or least specific entry). The (*, 224.0.0.x/24) entry is the control-plane entry that is used to flood the frame on VCs in the VPLS domain.

Note: The control-plane entry is used for L3 routing-protocol frames (for example, OSPF) which use IP Multicast (224.0.0.x) addresses, to exchange adjacency information.

Figure 43. Special FIB Ordering for VPLS entries

Finally, the multicast-specific L2VPN services also benefit indirectly from all of the other enhancements provided by Supervisor 2T and PFC/DFC4 hardware, such as integration with LIF/BD port mapping, LTL/MET index sharing, MFIB-based forwarding lookups, EDC-based egress replication, and more.

With these new enhancements, the L2VPN Multicast implementation is able to replicate and send multicast traffic only to PEs (within a given VC) that have interested receivers or mrouters. This saves network bandwidth on PEs (and transit P devices), which do not need to forward the traffic.

Figure 44. Optimized L2VPN Topology

How Does That Help You?

The ability to perform L2VPN in hardware will simplify your overall network design (configuration and monitoring points) and simultaneously reduce operational costs (new and replacement hardware).

You will no longer need to purchase and support a (separate) SIP module and SPA interfaces, to leverage the benefits of L2VPN, and you can build a simplified end-to-end Ethernet-based solution, which uses well-known technologies.

For IP Multicast, the hardware integration also provides more efficient network bandwidth, through the use of the L2 snooping and special MFIB programming. This makes the L2VPN architecture consistent with the fundamental tenets of IP Multicast.

CoPP Exception Cases and Granular Multicast Rate-Limits

Improve control plane protection for multicast traffic destined to the CPU.

Yesterday’s Challenges

In order to prevent Denial of Service (DoS) attacks, as well as misconfigurations or software defects, it is a generic operational best practice to control (or “rate-limit”) any traffic that is destined to the CPU. This is particularly important for IP Multicast, which is inherently data-driven.

The vast majority of data traffic within a particular network system is simply in transit, being forwarded on to another network system, along the forwarding path from source to receiver. This traffic (and associated hardware processing) is normally called the “data-plane”.

However, in order for these network systems to understand exactly where to send the data traffic to, these devices communicate with each other. This exchange of forwarding information (and associated software processing) is normally called the “control-plane”.

The Supervisor 720 (MSFC3) architecture uses 2 separate CPUs, namely the SP and RP. The Switch Processor (SP) handles all L2 (or MAC-based) “switching” functions, while the Route Processor (RP) handles all L3+ (or IP/MPLS-based) “routing” functions.

Examples of L2 control-plane traffic include:

Switching protocol packets: STP/RSTP/MST, VTP, DTP, etc.

Adjacency packets: CDP, LLDP

Link management packets: LACP, PAGP, UDLD, etc.

Authentication packets: 802.1x, etc.

Multicast snooping: IGMP, MLD, and PIM

Examples of L3 control-plane traffic include:

Routing protocol packets: BGP, OSPF, EIGRP, IS-IS, RIP, and more

First-Hop Redundancy Protocol packets: HSRP, GLBP and VRRP

Reachability packets: ARP, ICM, IP-SLA Probes, etc.

Management packets: SNMP, SSH, Telnet, TFTP, NTP, and more

Multicast routing packets: PIM, IGMP/MLD, Auto-RP, BSR, etc.

Certain types of data-plane traffic may actually need to be processed by the CPU in the software as well. This type of traffic is normally referred to as "Punt" traffic.

Examples of software-processed data-plane packets include:

Packets with IP options (for example, Router Alert)

Packets with Time-to-Live (TTL) field = 1

Packets that require ACL logging

Packets requiring fragmentation (for example, MTU mismatch)

Packets that are not classified by the hardware (for example, AppleTalk, IPX, DECNET, and more)

Packets for which the destination IP prefix cannot be found in the L3 routing table, also referred to as "FIB-Miss"

Packets that cannot be switched in hardware, because a non-hardware-supported feature is applied to the packet (for example, Multicast NAT)

Hence, it should be clear that the CPU (control-plane) must be protected from DoS attacks, or misconfigurations, in order to preserve the integrity of the network. For the Catalyst 6500 series, there are two basic options available:

Hardware rate limiters

CoPP

Note: With the Supervisor 720 (PFC/DFC3) architecture, both options have their advantages and disadvantages, and are usually best deployed together.

PFC/DFC3 Hardware Rate Limiters

Built into the PFC/DFC3 hardware are 10 hardware registers that can be utilized to match various specific packet types, and implement a packet rate-limiting action. A rate limiter is enabled defining a packets-per-second (pps) threshold for a given type of traffic destined to the CPU.

This includes registers to match both L2 and L3 types of traffic, and both L2 and L3 rate-limiters can be applied at the same time. Any traffic matching the configured traffic type, in excess of the defined pps threshold (rate), is simply dropped. This occurs on the PFC/DFC3 itself, and dropped traffic never reaches the CPU.

Figure 45. PFC/DFC3-Based Hardware Rate Limiters and Software CoPP

On PFC/DFC3, the following hardware rate limiters for IP Multicast are available:

Table 4. PFC/DFC3 Hardware Rate Limiters

Rate-Limiter

Default/Threshold

Description

Multicast Partial-SC

ON

100,000 pps

Some multicast flows may be partially software switched, if special processing is required. It is desirable to rate-limit these flows destined to the Multilayer Switching Feature Card (MSFC).

Note: This rate-limiter uses a special register that is not accounted for in the available 10 hardware registers. It is applied globally, not on a per-forwarding-engine basis.

Multicast Default Adjacency

ON

100,000 pps

Limits multicast traffic requiring software processing because of an FIB miss (for example, if multicast destination address does not match an entry in the hardware mroute table).

Multicast Non-RPF

OFF

Limits multicast packets that fail the RPF check (for example, non-DR on a shared LAN).

Note: This option is only supported on the PFC/DFC3B and PFC/DFC3C.

Multicast IP Options

OFF

Limits multicast packets with IP options (for example, IGMP Router Alert) that are sent to the RP CPU for further processing.

Note: This option is only supported on the PFC/DFC3B and PFC/DFC3C.

TTL Failure

OFF

Applies to both unicast and multicast. Limits packets sent to the RP CPU, because of a Time-to-Live (TTL) check failure.

MTU Failure

OFF

Applies to both unicast and multicast. Limits packets sent to the RP CPU because of Maximum Transmission Unit (MTU) failure.

Multicast IGMP

OFF

Limits IGMP control messages sent to the SP CPU for IGMP snooping. This rate limiter should be used when IGMP snooping is enabled.

Note: This hardware register is shared with the L2 PDU register, and only one can be enabled at a time.

Note: This rate limiter is not supported when the switch is in truncated fabric-switching mode.

The Catalyst 6500 hardware rate-limiter capability is extremely effective, but it is also very explicit. If a very low threshold is used, it will also drop valid frames, thus creating other problems. Care must be taken to determine the correct rate-limit threshold for each system, based on the legitimate (or expected) rate of incoming frames.

Note that there are a finite number of hardware registers available (10). Eight of these registers are present in the L3 forwarding-engine and two of these registers are present in the L2 forwarding-engine. Furthermore, some of these registers are shared between packet types (for example, L2 PDU and IGMP), meaning that they cannot both be used at the same time.

Pro Tip: [Learn more about PFC/DFC3-based hardware rate limiters. http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/dos.html]

MSFC3-Based CoPP

CoPP provides a broader level of protection than that offered by hardware-based rate-limiters. However, the Supervisor 720 architecture implements CoPP on the Route Processor (RP) inband controller, which is analogous to an external router’s physical Ethernet interface.

CoPP introduces a new virtual interface to the system: the control-plane interface. This new interface type is similar to existing virtual interfaces such as the Loopback Interfaces, Tunnel Interfaces, Port Channel Interfaces, and VLAN Interfaces (SVI). CoPP uses a dedicated control plane configuration through the Modular QoS CLI (MQC).

When the control-plane interface is enabled, it allows a software (QoS-based) rate-limiter (police) to be applied. This CoPP rate-limiting action applies to any traffic selected by a user-configured ACL, which is destined to the RP control-plane. Using ACLs to match packet types provides more flexible coverage, for better protection against a global level control-plane attack.

Figure 46. PFC/DFC3-Based Hardware Rate Limiters and Software CoPP

Pro Tip: [Learn more about MSFC3-based CoPP http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/copp.html]

Unfortunately, the Supervisor 720-based CoPP functionality is implemented in the software on the RP inband controller. Hence, the CPU-bound traffic has already crossed the switching backplane (Data Bus or Switch Fabric) and consumed finite interface resources.

In addition, the current CoPP implementation does not protect the SP CPU. This can normally be mitigated by implementing the L2 hardware rate limiters, but both the L2 PDU and IGMP rate-limits share the same register (and hence, cannot coexist).

Is that really bad? No. The existing capabilities are extremely useful, and provide a previously unavailable level of control-plane protection. Some competitors do not even support this level of protection.

The primary challenges of (PFC/DFC3) hardware rate limiters is that the total number of possible traffic types that can be matched are limited, and care must be taken to configure a threshold that balances good and bad control-plane traffic.

The primary challenges of (MSFC3) CoPP is that it is implemented primarily as a function of the RP inband controller (and not the SP), and the traffic may still consume other system resources, before being dropped on the control-plane interface.

Otherwise, if used in conjunction with the hardware rate limiters, CoPP provides significant protection against many specific CPU-bound traffic storms. For example, with both a CoPP ACL using the “fragment” keyword and the “IP options” hardware rate limiters, it is possible to create a protection for the RP CPU from two uncommon types of traffic that are commonly used for DoS attacks.

Today’s Solutions:

The Supervisor 2T (MSFC5 and PFC/DFC4) hardware and latest IOS software provides several fundamental enhancements to the existing control-plane protection capabilities and these include the following:

Single (Dual-Core) RP CPU and Inband controller

Hardware integration and special exception-cases for CoPP

32 L3 and 12 L2 registers for hardware rate limiters

The MSFC5 single RP CPU and Inband controller eliminates the need for a separate SP and RP based CoPP implementation, and thus creates a single “control-plane” interface.

The CPU control-plane interface is treated the same as any other interface in the system. Because the control-plane interface operates outside of the data-plane, the transit switching performance is not affected.This allows CoPP functionality to be implemented directly into the PFC/DFC4 hardware (for any forwarding lookups that select the CPU control plane as the destination).

Note: The PFC/DFC4 L3 forwarding-engine is now responsible for applying CoPP policies, using the classification (or QoS/ACL) TCAM. This functionality is similar to how security ACLs and QoS is applied to normal data-plane traffic flows.

In addition to the single control-plane interface, there are several other general control-plane protection hardware enhancements that improve the granularity and flexibility, for both unicast and multicast control-plane messages. These are listed below:

PFC/DFC4-Based Hardware Rate Limiters

Configurable on a per-packet or per-byte basis

Ability to leak the first packet, when the threshold is exceeded

Counters for Forward and Drop, on a packet or byte basis

PFC/DFC4-Based CoPP

CoPP for output exceptions like TTL/MTU failures

Ability to specify Punt clause in a CoPP policy

Ability to selectively ignore certain exceptions

Flexibility to apply either a packet or byte based policy

Capability to count exception packets, at the flow granularity (using NetFlow)

Figure 47. PFC/DFC4-Based Hardware Rate Limiters and CoPP

All of these principles apply for both unicast and multicast control-plane protection. However, there are also some notable enhancements specifically for multicast, including:

CoPP for L2 IGMP/MLD/PIM Snooping

With L2 snooping enabled, IGMP/MLD/PIM packets are L2 redirected to the CPU using the L2 forwarding-engine redirection logic (as with PFC/DFC3). Within the L3 forwarding-engine, these redirected frames will bypass the L3 FIB lookup, but it does process the frames in order to apply any additional features such as PACL, VACL, PVACL, and more.

It is at this point that CoPP is implemented (using the classification TCAM, and a special Egress Logical Interface, or ELIF). CoPP uses a special adjacency entry for redirected frames, destined to the control-plane. The final forwarding decision (to copy the packet to CPU or not) thus depends on the IGMP/MLD or PIM traffic rate and the configured CoPP policy.

If the forwarding rate is less than the CoPP policy, the frames will be returned to the L2 forwarding-engine with the CPU forwarding index, and be sent for further processing. If the frames exceed the threshold, they will be dropped.

CoPP Multicast FIB-Miss Exception Handling

In those cases when a destination FIB entry is not present, the CPU must process the multicast packets to make a forwarding decision and install a new hardware shortcut. For this case, the PFC/DFC4 will simply use the special FIB (*,G/m) entry, which is associated with the CoPP ELIF index. As with the other cases, this forwarding lookup result will subject the packets to any CoPP policies, and limit the rate of packets sent to the CPU.

CoPP PIM-SM Source Register Handling

Multicast traffic for a previously unlearned (new) source needs to be sent to the CPU, for the PIM-SM “Source Register” process. For this case, the PFC/DFC4 will also simply use the special FIB (*,G/m) entry, which is associated with the CoPP ELIF index. As with the other cases, this forwarding lookup result will subject the packets to any CoPP policies, and limit the rate of packets sent to the CPU.

Note: In most cases, when CoPP is configured, it overrides configured rate limiters.

In some cases, CoPP cannot be applied. These cases are listed below:

When PIM register packets arriving at the RP must be sent to the CPU (which is the usual case) this cannot be achieved using CPP, since this would affect forwarded packets also. Hence a special copy_mask is driven from ACL (ACL which recognizes this packet as a PIM register packet) and a rate-limiter can be subsequently applied.

During PIM Register de-encapsulation, the de-encapsulated packet hitting a multicast route entry needs to be copied to CPU using a rate limiter, because the software needs to know that native multicast packets have started arriving. This allows it to start sending register-stop messages to subsequent PIM register messages.

New (or Enhanced) IP Multicast Hardware Rate Limiters on PFC/DFC4

Table 5. PFC/DFC4 New Hardware Rate Limiters

Rate-Limiter

Default/Threshold

Description

Multicast Punt

OFF

Limits (*,G/m) punt packets and last hop (*,G) packets which need signaling.

Multicast SPT

OFF

Limits packets coming from the source-tree during SPT switchover (where RPF towards the shared-tree is still the primary RPF).

Multicast Register

OFF

Limits packets coming from the First Hop Designated Router (DR) to the Rendezvous Point (RP) for source register processing.

Routing Control

OFF

Limits multicast-based routing protocol control-messages (for example, OSPF and EIGRP).

How Does That Help You?

The ability to perform full hardware-based Control-Plane Protection is an important differentiator over similar switching platforms. The switching control-plane is arguably more important than the data-plane (which is reliant on the control-plane). As a result, protecting the CPU is absolutely vital.

DoS attacks continue to be a serious threat to enterprise and service provider (SP) networks. These attacks can disrupt mission-critical services, prevent data transfer between devices, and decrease overall productivity.

The Supervisor 2T, along with the new MSFC5 RP CPU and PFC/DFC4 hardware-based hardware rate limiters and CoPP, provides an unprecedented level of protection against DoS attacks on the Cisco Catalyst 6500 Series switches.

This will simultaneously reduce unnecessary load on the system (which will increase system availability and stability), while also reducing your stress levels.

NetFlow (v9) Special Fields and Processing for Multicast

Get NFv9 + Flexible NetFlow and Egress NetFlow Data Export (NDE) support for multicast flows.

Yesterday’s Challenges

NetFlow accounting is a very powerful monitoring tool that network administrators can use to collect information on the traffic patterns, link utilizations, per-user billing, and more. The NetFlow feature collects traffic statistics about individual packets that flow through the Catalyst 6500 and stores these statistics in the NetFlow table. The NetFlow table on the MSFC captures statistics for flows routed in software, and the NetFlow table on the PFC (and on each DFC) captures statistics for flows routed in hardware.

Several “hardware-assisted” features use the NetFlow table. Features such as Network Address Translation (NAT) and IPv6 Multicast use NetFlow to modify the forwarding result, while other features (such as QOS microflow policing) use statistics from the NetFlow table.

Pro Tip: [Learn more about PFC/DFC3-based NetFlow http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SXF/native/configuration/guide/NetFlow.html]

In addition to providing this “network flow” information to the user directly (through CLI or SNMP), it can also be exported externally and then analyzed for a wide variety of uses. This process is known as NetFlow Data Export (NDE). The NDE feature provides the ability to export the statistics to an external device (called a NetFlow Collector).

NDE is a sophisticated traffic-management mechanism that network administrators can use to monitor vital statistical information on the overall network health, such as the device and link utilizations, the security of the network, the traffic patterns, and more.

Pro Tip: [Learn more about NetFlow Data Export (NDE) http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SXF/native/configuration/guide/nde.html]

For IP Multicast, the NetFlow feature lets the user capture multicast-specific data (both packets and bytes) for individual multicast flows. For example, the user can capture the packet-replication factor for a specific flow as well as for each outgoing stream.

Note: Multicast NetFlow support provides complete end-to-end usage information about network traffic for a complete multicast traffic billing solution.

Multicast NetFlow also allows the user to help enable NetFlow statistics to account for all packets that fail the Reverse Path Forwarding (RPF) check, which are normally dropped. Accounting for RPF-failed packets provides more accurate traffic statistics and patterns.

Finally, Multicast NetFlow also helps in capturing per-source information for PIM-BIDIR groups (which have no knowledge, and hence reporting capability, of individual Multicast sources). NetFlow is the only mechanism that can capture this vital information, and with hundreds or thousands of multicast sources, this information would be vitally important in troubleshooting and accounting.

Today’s Solutions

The new Supervisor 2T and PFC/DFC4, along with the new IOS code, provides a new NetFlow architecture known as Flexible NetFlow (or FNF). It maintains all previously supported aspects of NetFlow (including all versions of NDE), while adding new capabilities that make NetFlow accounting and export more flexible for the user.

Within the PFC/DFC4, the NetFlow module is responsible for collecting flow-based information on IFE/OFE processed packets in the L3 forwarding engine. It takes input from Classification (CL) module and outputs to L3 forwarding module where ACL, NetFlow, and L3 lookup results are combined. The CL module does the classification lookup and provides a NetFlow profile ID.

The NetFlow profile ID, containing the key and flow mask information, is used for NetFlow entry lookup. Flows are created dynamically by hardware or installed by software through CPU or inband packet. NetFlow module consists of NetFlow lookup table, NetFlow entry table, and NetFlow statistic table.

NetFlow entry table (NT) and NetFlow statistic table (NS) are related. Whenever an entry is created in the NT table, there will be a corresponding entry in the NS table. The NT table contains the flow characteristic or pattern info such as key field while the NS table maintains the 1-1 active flow statistics such as packet and byte count, last used time stamp, etc.

For NetFlow to start collecting the flow information, classifying logic has to be programmed with appropriate information. The LIF corresponding to the interface where the ingress multicast NetFlow configuration is made will point to an ACL, which will point to the profile ID which will, in turn, point to the NetFlow table and statistics.

Figure 48. PFC/DFC4 NetFlow Processing

Statistics are one of the most frequently and widely used tools for a variety of purposes, including accounting, billing, debugging, and in some cases, even for protocol operations. There are different types of statistics serving different requirements.

The PFC4/DFC4 supports the following:

Forwarding statistics

Adjacency statistics

Accounting statistics (Label to IP, IP to IP, Aggregate label)

NetFlow statistics

LIF statistics

Apart from these, there are many other counters that are useful for debugging purposes (these counters are not discussed in this document).

Forwarding statistics are a set of packet counters used to count FIB forward packets based on different packet types, such as IPv4 Unicast, IPv4 multicast, MPLS, and more.

There are 32 forwarding counters, and each counter is 32 bits wide. The forwarding engine will decide on which counter to update, based on the forwarding result. The statistics will not be updated if the packet is dropped either by forwarding decision or by rate limiter drop.

These counters will give an aggregate count of packet forwarded based on the packet type.

Adjacency statistics are maintained by a pair of counters. The first is a 40 bit counter to count bytes, and the second is a 31 bit counter to count packets. The adjacency statistics table can hold up to 512 k adjacency counter pairs. This is one of the very useful counters to check the byte/packet counters on each forwarding adjacency basis. These counters are not updated if any of the forwarding operation results in drop, rate-limit, or exception.

These counters can be used for debugging any specific flow to check if the expected adjacency is taken or not. Additionally, this statistics plays a very important role in the case of multicast. Software multicast entries are kept alive, based on the adjacency statistics for a particular multicast hardware flow.

Accounting statistics are similar to adjacency statistics, and contain both 44 bit byte counters and 35 bits packet counters. As the name suggests, these statistics can be used for accounting/billing purposes. These counters are a limited resource and can each hold up to 4 k entries. These statistics are controlled by an ACL for a specific interface and the traffic pattern.

NetFlow statistics is a 512 K (entry) statistics table that corresponds on a 1:1 basis with the NetFlow table. Each statistic entry is 27 bytes, which contains byte and packet counters, last accessed timestamp, sticky TCP flag, and more. The most important use of these statistics is the NDE; please refer to the “NetFlow Architecture” documentation for a detailed description on NetFlow statistics process.

LIF statistics is a 128 k entries table that is formed as a set of eight counters (per LIF), which includes both a 40-bit byte counter and 31-bit packet counter. These eight counters (per LIF) provide eight different protocol-specific interface statistics, as each of the eight counters can be programmed to represent different traffic types (such as bridged unicast packets, bridged multicast packets, routed IPv4 unicast, routed IPv6 multicast packets, and more). From the end-user perspective, these eight counters will be preprogrammed in such a way that each counter will provide one particular type of statistics.

Note: All the L3 forwarding-engine statistics mentioned above (forwarding, adjacency, accounting, and NetFlow statistics) are preprogrammed to provide statistics for the specific application. They are generic for all class of packets hitting the corresponding entry.

As a generic rule, there will be individual software components to manage these counters, which, in turn, can be used by applications such as multicast to fetch appropriate counters. The exceptions to this rule are the LIF statistics on the L2 forwarding engine, which require application-level programming to select and choose among the eight available counters per LIF.

The LIF statistics table is a 128 K (LIFs) * 8 (counters) * 72 bits data structure, which resides within 2 external 36 MB QDR SRAM memory banks, attached to the PFC/DFC4. The logical view of the LIF statistic counters is as follows:

Figure 49. LIF Statistics and NetFlow Accounting

Using LIF statistics allows additional visibility and per-protocol management information, for critical L2/L3 packet forwarding. The eight available LIF statistics are summarized below:

Bridged unicast packets

Bridged multicast packets

Bridged broadcast packets

Routed IPv4 Unicast packets

Routed IPv4 Multicast packets

Routed IPv6 Unicast packets

Routed IPv6 Multicast packets

Routed others (MPLS, etc.)

This allows the administrator to separate previously combined interface statistics into unique per-protocol categories, for easier management and debugging. These statistics can be accessed through the CLI or the NDE. This is very powerful information, which is only available on the Supervisor 2T and PFC/DFC4 hardware.

How Does That Help You?

The new capabilities of Flexible NetFlow (FNF) allow much better accounting and exporting of IP Multicast traffic flows, by increasing the granularity and applicability of the statistics that your company requires (and eliminates those that it does not).

If you are interested in the billing aspect of FNF and NDE, this will provide for more accurate accounting of individual multicast flows, and reduce the processing load of export by eliminating unnecessary flow data.

If you are interested the debugging aspect of FNF and NDE, the Supervisor 2T and PFC/DFC4 provides you with more granular details, and new details that were previously unavailable.

If you are interested in the traffic management (or capacity planning) aspect of FNF and NDE, the Supervisor 2T and PFC/DFC4 provides you with the flexibility to create special-case exporting reports, to focus on different areas and flows within your network.

Learn More

As noted earlier, this document only addresses the important new and enhanced IP Multicast features supported on the new Supervisor 2T.

However, this is only a subset of the entire IP Multicast feature list, supported on the Catalyst 6500. For more information on IP Multicast and the Catalyst 6500 Series, refer to http://www.cisco.com/en/US/products/hw/switches/ps708/prod_literature.html.

Conclusion

IP Multicast is an amazingly beneficial packet-forwarding model, which can efficiently distribute a single (source) IP data stream to multiple (receiver) IP hosts simultaneously.

This capability opens a wide range of possibilities, which may be too cost-prohibitive with a generic unicast or broadcast network. A wide variety of customers, from financial markets to service providers; from security agencies to transportation authorities, can (and should) use multicast traffic to enhance their network operations.

It is also amazingly complex, and thus requires specialized multicast software and hardware capabilities. These are critical details to consider when selecting your network platform.

Considering all of the existing and new capabilities, it should be clear why the Catalyst 6500 with Supervisor 2T hardware is the industry’s only true next-generation networking platform for IP Multicast:

Best performing

Highest scaling

Most highly available

Most flexible (modular) platform

For More Information

[Catalyst 6500 Series http://www.cisco.com/go/6500]

[Catalyst 6500 (Sup 720) IP Multicast Configuration - 12.2SX http://www.cisco.com/en/US/docs/ios/ipmulti/configuration/guide/12_2sx/imc_12_2sx_book.html]

[Catalyst 6500 (Sup 720 and earlier) and Catalyst 4500 (Sup 6E and earlier) IP Multicast Architecture and Troubleshooting PPT (PDF format) http://www.slideshare.net/CiscoSystems/catalyst-6500-and-4500-ip-multicast-architecture-and-troubleshooting]