The Internet Protocol Journal, Volume 16, No. 2

Link-State Protocols in Data Center Networks

by Alvaro Retana, Cisco Systems, and Russ White, Verisign

With the advent of cloud computing [6, 7], the pendulum has swung from focusing on wide-area or global network design toward a focus on Data Center network design. Many of the lessons we have learned in the global design space will be relearned in the data center space before the pendulum returns and wide-area design comes back to the fore.

This article examines three extensions to the Open Shortest Path First (OSPF) protocol that did not originate in the data center field but have direct applicability to efficient and scalable network operation in highly meshed environments. Specifically, the application extensions to OSPF to reduce flooding in Mobile Ad Hoc Networks (MANET) [1], demand circuits designed to support on-demand links in wide-area networks [2], and OSPF stub router advertisements designed to support large-scale hub and spoke networks [3] are considered in a typical data center network design to show how these sorts of protocol improvements could affect the scaling of data center environments.

Each of the improvements examined has the advantage of being available in shipping code from at least one major vendor. All of them have been deployed and tested in real-world networks, and have proven effective for solving the problems they were originally designed to address. Note, as well, that OSPF is used throughout this article, but each of these improvements is also applicable to Intermediate System-to-Intermediate System (IS-IS), or any other link-state protocol.

Defining the Problem

Figure 1 illustrates a small Clos [0] fabric, what might be a piece of a much larger network design. Although full-mesh fabrics have fallen out of favor with data center designers, Clos and other styles of fabrics are in widespread use. A Clos fabric configured with edge-to-edge Layer 3 routing has three easily identifiable problems.

The flooding rate is the first problem a link-state protocol used in this configuration must deal with. Router B (and the other routers in spine 2), for instance, will receive four type 1 Link State Advertisements (LSAs) from the four routers in spine 1. Each of the routers in spine 2 will reflood each of these type 1 LSAs into spine 3, so the other routers in spines 3, 4, and 5 will each receive four copies of each type 1 LSA originated by routers in spine 1, a total of 16 type 1 LSAs in all.

Figure 1: A Clos Fabric with Layer 3 to the Top of Rack

To make matters worse, OSPF is designed to time out every LSA originated in the network once every 20 to 30 minutes. This feature was originally put in OSPF to provide for recovery from bit and other transmission errors in older transport mechanisms with little or no error correction. So a router in spine 5 will receive 16 copies of each type 1 LSA generated by routers in spine 1 every 20 minutes. A single link failure and recovery can also cause massive reflooding. The process of bringing the OSPF adjacency back into full operation requires a complete exchange of local link-state databases. If the link between router A and router B fails and then is recovered, the entire database must be transferred between the two routers, even though router B clearly has a complete copy of the database from other sources.

Finally, the design of this network produces some challenges for the Shortest Path First (SPF) algorithm, which link-state protocols use to determine the best path to each reachable destination in the network. Every router in spine 1 appears to be a transit path to every other destination in the network. This outcome might not be the intent of the network designer, but SPF calculations deal with available paths, not intent.

This set of problems has typically swayed network designers away from using link-state protocols in such large-scale environments. Some large cloud service providers use the Border Gateway Protocol (BGP) (see [4]), with each spine being a separate Autonomous System, so they can provide scalable Layer 3 connectivity edge-to-edge in large Clos network topologies. Others have opted for simple controls, such as removing all control-plane protocols and relying on reverse-path-forwarding filters to prevent loops.

The modifications to OSPF discussed in this article, however, make it possible for a link-state protocol to not only scale in this type of environment, but also to be a better choice.

Reducing Flooding Through MANET Extensions

MANET networks are designed to be "throw and forget;" a collection of devices is deployed into a quickly fluid situation on the ground, where they connect over short- and long-haul wireless links, and "just work." One of the primary scaling (and operational) factors in these environments is an absolute reduction of link usage wherever possible, including for the control plane.

The "Extensions to OSPF to Support Mobile Ad Hoc Networking," [1] were developed to reduce flooding in single-area OSPF networks to the minimal necessary, while providing fast recovery and guaranteed delivery of control-plane information. The idea revolves around the concept of an overlapping relay, which reduces flooding by accounting for the network topology, specifically groups of overlapping nodes.

Let's examine the process from the perspective of router A shown in Figure 2.

Figure 2: Ad Hoc Extensions to OSPF

Router A begins the process by not only discovering that it is connected to routers B, C, D, and E, but also that its two-hop neighborhood contains routers F and G. By examining the list of two-hop neighbors, and the directly connected neighbors that can reach each of those two-hop neighbors, router A can determine that if router D refloods any LSAs router A floods, every router in the network will receive the changes. Given this information, router A notifies routers B, C, and E to delay the reflooding of any LSAs received from router A itself.

When router A floods an LSA, router D will reflood the LSA to routers F and G, which will then acknowledge receiving the LSA to routers B, C, D, and E. On receiving this acknowledgement, routers B, C, and E will remove the changed LSA from their reflood lists.

Routers F and G, then, will receive only one copy of the changed LSA, rather than four.

Applying this process to the Clos design in Figure 1 and using this extension would dramatically reduce the number of LSAs flooded through the network in the case of a topology change. If router A, for instance, flooded a new type 1 LSA, the routers in spine 2 would each receive one copy. The routers in spines 3, 4, and 5 would also receive only one copy each, rather than 4 or 16.

Reducing Flooding Through Demand Circuits

Network engineers have long had to consider links that are connected only when traffic is flowing in their network and protocol designs. Dial-up links, for instance, or dynamically configured IP Security (IPsec) tunnels, have always been a part of the networking landscape. Part of the problem with such links is that the network needs to draw traffic to destinations reachable through the link even though the link is not currently operational.

With protocols that rely on neighbor adjacencies to maintain database freshness, such as OSPF, links that can be disconnected in the control plane and yet still remain valid in the data plane pose a unique set of difficulties. The link must appear to be available in the network topology even when it is, in fact, not available.

To overcome this challenge, the OSPF working group in the IETF extended the protocol to support demand links. Rather than attacking the problem at the adjacency level, OSPF attacks the problem at the database level. Any LSA learned over a link configured as a demand link is marked with the Do Not Age (DNA) bit; such LSAs are exempt from the normal aging process, causing LSAs to be removed from the link-state database periodically.

How does this situation relate to scaling OSPF in data center network design?

Every 20 minutes or so, an OSPF implementation will time out all the locally generated LSAs, replacing them with newly generated (and identical) LSAs. These newly generated LSAs will be flooded throughout the network, replacing the timed-out copy of the LSA throughout the network. In a data center network, these refloods are simply redundant; there is no reason to refresh the entire link-state database periodically.

To reduce flooding, then, data center network designers can configure all the links in the data center as demand circuits. Although these links are, in reality, always available, configuring them as demand circuits causes the DNA bit to be set on all the LSAs generated in the network. This process, in turn, disables periodic reflooding of this information, reducing control-plane overhead.

Reducing Control-Plane Overhead by Incremental Database Synchronization

When a link fails and then recovers, the OSPF protocol specifies a lengthy procedure through which the two newly adjacent OSPF processes must pass to ensure their databases are exactly synchronized. In the case of data center networks, however, there is little likelihood that a single link failure (or even multiple link failures) will cause two adjacent OSPF processes to have desynchronized databases.

For instance, in Figure 1, if the link between routers A and B fails, routers A and B will still receive any and all link-state database updates from some other neighbor they are still fully adjacent with. When the link between routers A and B is restored, there is little reason for routers A and B to exchange their entire databases again.

This situation is addressed through another extension suggested through the MANET extensions to OSPF called Unsynchronized Adjacencies. Rather than sending an entire copy of the database on restart and waiting until this exchange is complete to begin forwarding traffic on link recovery, this extension states that OSPF processes do not need to synchronize their databases if they are already synchronized with other nodes in the network. If needed, the adjacency can be synchronized out of band at a later time.

The application of the MANET OSPF extensions [1] to a data center network means links can be pressed into service very quickly on recovery, and it provides a reduction in the amount of control-plane traffic required for OSPF to recover.

Reducing Processing Overhead Through Stub Routers

The SPF calculation that link-state protocols use to determine the best path to any given destination in the network treats all nodes and all edges on the graph as equal. Returning to Figure 2, router B will calculate a path through router A to routers D, E, and C, even if router A is not designed to be a transit node in the network. This failure to differentiate between transit and nontransit nodes in the network graph increases the number of paths SPF must explore when calculating the shortest-path tree to all reachable destinations.

Although modern implementations of SPF do not suffer from problems with calculation overhead or processor usage, in large-scale environments, such as a data center network with tens of thousands of nodes in the shortest-path tree and virtualization requirements that cause a single node to run SPF hundreds or thousands of times, small savings in processing power can add up.

The "OSPF Stub Router Advertisement" [3] mechanism allows net-work administrators to mark an OSPF router as nontransit in the shortest-path tree. This mechanism would, for instance, prevent router A in Figure 1 from being considered a transit path between router B and some other router in spine 2. You would normally want to consider this option only for any actual edge routers in the network, such as the top-of-rack routers shown here. Preventing these routers from being used for transit can reduce the amount of redundancy available in the network, and, if used anyplace other than a true edge, prevent the network from fully forming a shortest-path tree.

Advantages and Disadvantages of Link-State Protocols in the Data Center

Beyond the obvious concerns of convergence speed and simplicity, there is one other advantage to using a link-state protocol in data center designs: equal-cost load sharing. OSPF and IS-IS both load share across all available equal-cost links automatically (subject to the limitations of the forwarding table in any given implementation). No complex extensions (such as [5]), are required to enable load sharing across multiple paths.

One potential downside to using a link-state protocol in a data center environment must be mentioned, however—although BGP allows route filtering at any point in the network (because it is a path vector-based protocol)—link-state protocols can filter or aggregate reachability information only at flooding domain boundaries. This limitation makes it more difficult to manage traffic flows through a data center network using OSPF or IS-IS to advertise routing information. This problem has possible solutions, but this area is one of future, rather than current, work.

Conclusion

Many improvements have been made to link-state protocols over the years to improve their performance in specific situations, such as MANETs, and when interacting with dynamically created links or circuits. Many of these improvements are already deployed and tested in real network environments, so using them in a data center environment is a matter of application rather than new work. All of these improvements are applicable to link-state control planes used for Layer 2 forwarding, as well as Layer 3 forwarding, and they are applicable to OSPF and IS-IS.

These improvements, when properly applied, can make link-state protocols a viable choice for use in large-scale, strongly meshed data center networks.

References

[0] http://en.wikipedia.org/wiki/Clos_network

[1] Roy, A., "Extensions to OSPF to Support Mobile Ad Hoc Networking," RFC 5820, March 2010.

[2] Abhay Roy and Sira Panduranga Rao, "Detecting Inactive Neighbors over OSPF Demand Circuits (DC)," RFC 3883, October 2004.

[3] Alvaro Retana, Danny McPherson, Russ White, Alex D. Zinin, and Liem Nguyen, "OSPF Stub Router Advertisement," RFC 3137, June 2001.

[4] Petr Lapukhov and Ariff Premji, "Using BGP for routing in large-scale data centers," Internet Draft, work in progress, April 2013, draft-lapukhov-bgp-routing-large-dc-04

[5] Daniel Walton, John Scudder, Enke Chen, and Alvaro Retana, "Advertisement of Multiple Paths in BGP," Internet Draft, work in progress, December 2012, draft-ietf-idr-add-paths-08

[6] T. Sridhar, "Cloud Computing—A Primer, Part 1: Models and Technologies," The Internet Protocol Journal, Volume 12, No. 3, September 2009.

[7] T. Sridhar, "Cloud Computing—A Primer, Part 2: Infrastructure and Implementation Topics," The Internet Protocol Journal, Volume 12, No. 4, December 2009.

RUSS WHITE is a Principle Research Engineer at Verisign, where he works on the intersection of naming and routing. In the more than 20 years since he first began working in computer networking, he has co-authored 8 technical books and more than 30 patents; he has participated in the writing, editing, and guiding of numerous Internet Standards, and he has written a fiction novel. He is currently working on The Art of Network Architecture, to be published by Cisco Press in 2013. Russ splits his time between the Raleigh, N.C., area and Oak Island, N.C.; he teaches in a local homeschool coop and attends Shepherds Theological Seminary. He is a regular blogger and guest on the Packet Pushers podcast. E-mail: riwhite@verisign.com

ALVARO RETANA is a Distinguished Engineer in Cisco Technical Services, where he works on strategic customer enablement. Alvaro is widely recognized for his expertise in routing protocols and network design and architecture; he has CCIE® and CCDE® certifications, and he is one of a handful of people who have achieved the CCAR® certification. Alvaro is an active participant in the IETF, where he co-chairs the Routing Area Working Group (rtgwg), is a member of the Routing Area Directorate, and has authored several RFCs on routing technology. Alvaro has published 4 technical books and has been awarded more than 35 patents by the U.S. Patent and Trademark Office. His current interests include software-defined networking, energy efficiency, infrastructure security, routing protocols, and other related topics. E-mail: aretana@cisco.com