The Internet Protocol Journal - Volume 7, Number 1

High Availability in Routing

by Russ White, Cisco Systems

A network is a complex system of interacting pieces, as anyone who has ever worked with a large-scale network "in the wild" can tell you. So, when businesses begin asking for a network that can converge in something under 1 second, especially in a large network, network engineers begin to scratch their heads, and wonder what their counterparts in the business world are thinking about. Just about everyone in the network engineering business knows scale and speed are, generally speaking, contradictory goals. The faster a network converges, the less stable it is likely to be; fast reactions to changes in the network topology tend to create positive feedback loops that result in a network that simply will not converge.

But recent experience has shown that subsecond convergence in a network—even a large network in the wild—is definitely possible. How do we go about building a large-scale network that can converge in times that were, before recently, considered impossible, or improbable, at best? We approach the problem the same way network systems, themselves, are approached. We break the problem down into smaller pieces, and try to solve each piece individually. When we have solved each of the smaller pieces, we recombine them, and see what needs to be adjusted to make it all work together properly.

What pieces of a network do we need to be concerned about when considering subsecond (fast) convergence? Generally, we are concerned with the physical layer (how fast can a down link be detected?), routing protocol convergence (how fast can a routing protocol react to the topology change?), and finally, forwarding (how fast can the forwarding engine on each router in the network adjust to the new paths calculated by the routing protocol?). This article focuses on routing protocols convergence,with some discussion of fast down detection as well, specifically the interior gateway protocols, Enhanced Interior Gateway Routing Protocol (EIGRP), Intermediate System-to-Intermediate System (IS-IS), and Open Shortest Path First (OSPF).

Network Meltdowns
Before beginning to work on a network so it will converge quickly, we need to set some realistic expectations. As mentioned previously, a routing protocol configured to react very quickly to changes in network topology tends to develop positive feedback loops, which result in a network that will not converge at all. Using the following example, consider how a single problem can produce feedback that causes a failure to cascade through the network.

Figure 1: Positive Feedback Loops in a Network



Suppose the link between routers D and G flaps, meaning that it cycles between the down and up states slow enough for a routing adjacency to be formed across the link, or for the new link to be advertised as part of the topology, but too quickly for the link to actually be used. In this situation, the adjacency (or neighbor relationship) between routers D and G forms and tears down as quickly as the routing protocol will allow.

While this is occurring, the routing information at routers E, F, and G is changing as quickly as the adjacency between D and G can form and tear down. This change in routing information is, in turn, passed on to C, which then must process it as fast as it possibly can. It is possible that the routing information presented to router C will overcome the ability of its processor to process the information, causing router C to fail, or drop its neighbor adjacencies.

At the same time, the constantly changing routing information at router B will also cause problems, possibly causing it to periodically drop its adjacencies, specifically with routers C and D. At this point, if the routers B, C, and D are all three consuming a large amount of memory and processing power adjusting to apparent topology changes because of changing adjacency states, the flapping link between routers D and G, which originally caused the problem, can be removed from the network, and the routing protocol will still not converge. This is what network engineers consider a classic meltdown in the routing system.

Solving the Meltdown
Typically, when a network engineer faces a network in this condition, the first step is to simply remove routing information from the system until the network "settles." This typically involves removing parallel (redundant) links from the view that the routing protocol has of the topology until the routing protocol converges. At this point, the network would be examined, routers reloaded as needed, and the parallel links brought back up. The network design might then be reviewed, in an attempt to prevent recurrence of a meltdown.

Routing protocol designers and developers would also like to move the point at which a routing protocol "melts" as far along the curve of network design as possible.

Of course, it is impossible to prevent all network meltdowns through protocol design; there are limits in any system where the implementationsteps outside the "state machine," and the system will simply fail. But how would a routing protocol designer work around this sort of a problem in the protocol itself? The answer is actually very simple: Slow down.

The main problem here, from a protocol designer's point of view, is that routers D and G are simply reacting too fast to the changing topology. If they were to react more slowly, the network would not fall into this positive feedback loop, and the network would not melt. And, in fact, slowing down is really quite simple. Various methods of slowing down include:
Not reporting all interface transitions from the physical layer up to the routing protocol. This is called debouncing the interface; most interface types wait some number of milliseconds before reporting a change in the interface state.
Slow neighbor down timers. For instance, the amount of time a router waits without hearing from a given neighbor before declaring that a neighbor has failed is generally on the order of tens of seconds in most routing protocols. The dead timer does not impact downneighbor detection on point-to-point links, because when the interface fails, the neighbor is assumed to be down, but there are other "slow-down" timers here, as well.
Slow down the distribution of information about topology changes.
Slow down the time within which the routing protocol reacts to information about topology changes.

All four of these methods are typically used in routing protocols design and implementation to provide stability within a routing system. For instance:
In IS-IS, a timer regulates how often an intermediate system (router) may originate new routing information, and how often a router may run the shortest path first (SPF) algorithm used to calculate the best paths through the network.
In OSPF, similar timers regulate the rate at which topology information can be transmitted, and how often the shorter path first algorithm may be run.
In EIGRP, the simple rule: "no route may be advertised until it is installed in the local routing table" dampens the speed at which routing information is propagated through the network, and routing information is also paced when being transmitted through the network based on the bandwidth between two routers.

It seems like the simplest place to look when trying to decrease the time a routing protocol requires to converge, then, is at these sorts of timers. Reduce the amount of time an interface waits before reporting the transition to a down state, reduce the amount of time a router must wait before advertising topology information, etc. But when we consider implementing such changes, we remove much of the stability we have all come to expect in routing systems—the size a network can be built without melting down decreases below an acceptable threshold, even with modern processors, more memory, and implementation improvements in place.

There is another place to attack this problem: the frequency of changes within the network. This is the same concept—speed —from a different angle. How does looking at it from a different angle help us? By allowing us to see that it is not the speed of the network changes that causes the positive feedback loop, but rather how often the changes take place. If we could report the changes quickly when they occur slowly, and report them more slowly when they occur quickly, or if we could just not report some events at all, routing could converge much faster, and still provide the stability we expect.

The two options we want to examine, then, are not reporting every event, and slowing down as the network speeds up. First we will discuss these two options, and then discuss speeding up the reporting of network events, which plays a large role in decreasing convergence times.

Do Not Report Everything You See (NSF and GR)
It sounds simple just to say that a router should not report every event within the network it is aware of, but it becomes more complicated as we consider the issues involved. What we need to do is sort out which events are important, in some sense, and which are not. For instance, if a router loses contact with an adjacent router because the adjacent router restarted for some reason, do not report the resulting change in topology until you are certain the neighbor is not coming back.

But the classic questions follow: How long do you wait before deciding the problem is real? And what happens to traffic you would normally forward to that neighbor while you are waiting? Finally, how do you reconnect in a way that allows the network to continue operating correctly? A technology recently incorporated in routing protocols called Graceful Restart (GR), combined with another technology called Non-Stop Forwarding (NSF), can combine to answer these questions.

Let's start at the bottom of the Open Systems Interconnection (OSI) model, at the physical and data link layers, and discuss the second question, what happens to traffic that would normally be forwarded while a router is restarting? Normally, this traffic would be dropped, and any applications impacted would need to retransmit lost data. How could we prevent this? We can take advantage of the separation between the control plane and the forwarding plane in a large number of modern routers.

Figure 2: Control and Data Plane Interaction in a Router



In some routers, such as the Cisco® 12000, 10000, 7600, and others, the actual switching, or forwarding, of packets is performed by different processors and physical circuitry than the control plane processes run on (such as routing protocol processes, routing table calculation, and other processes). Therefore, if the control plane fails or restarts for any reason, the data plane could continue forwarding traffic based on the last known good information.

NSF, implemented through Stateful Switchover (SSO) and Stateful Switchover+ (SSO+) in Cisco products, allows this continuous forwarding, regardless of the state of the control plane, to take place. Normally, when the control plane resets, it sends a signal to the data plane that it should clear its tables out, and reset, as well. With NSF enabled, this signal from the control plane simply acts as a signal to mark the current data as stale, and to begin aging the information out.

Now we need to be able to bring the control plane back up, resynchronize the routing protocol databases, and rebuild the routing table, all without disturbing the packets still being switched by the data plane on the router. This is accomplished through GR. GR starts by assuming two critical things:

The normal hold times are acceptable, within this network environment, for reporting a network event or topology change. In other words, if a router's control plane fails, the event wouldn't be reported until the routing protocol's default hold or dead timers expire, whether or not GR is configured.
The control plane on the router can reload and begin processing data within the hold or dead time of the routing protocol.

Let's examine how, in principle, GR works, so we can put these two requirements into context, and understand where GR is best deployed in a live network. Consider the following chart to understand how GR works between two peers of any generic routing protocol.

Figure 3: The Process of Graceful Restart



When two routers begin forming an adjacency (or neighbor relationship, or begin peering, depending on which routing protocol is being run between them), they exchange some form of signaling noting that they are capable of understanding GR signaling, and responding to it correctly.

[Note that this does not imply the router is GR-capable, only that it can support a neighboring router performing a GR. For instance, the Cisco 7200 supports switching modes only where the control and data planes are not cleanly separated, so it cannot fully support GR. It can, however, support the signaling necessary for a neighboring router to gracefully restart.]

Assume some time passes, and router B is transmitting Hello packets to router A normally, on a periodic basis. Each time router A receives one of these Hello (or keepalive) packets, it resets the hold, or dead, timer on router B, indicating that it should wait that amount of time before declaring router B down if it stops receiving Hellos. Now, at some point, after sending a Hello packet, the router B control plane resets. While the control plane is down, the router A hold timer is still countingdown; the routing protocol does not reset the session. This is, in fact, normal routing protocol operation, which normally results in the packets forwarded by router A toward router B to be dropped. Because router B is NSF-capable, however, its data plane is still forwarding this traffic to the correct destination, even though the control plane is down.

When router A receives this Hello, it acts as though it has received a normal Hello, and simply keeps its adjacency with router B up. In other words, although router B may not know what the network it is connected to looks like at this point, router A does not report this failure to the rest of the network. Convergence time is, from a network standpoint, effectively reduced to 0.

When the router B control plane completes its reset, it then signals router A to begin resynchronizing their databases. The two routers then use some method specific to each protocol to resynchronize their databases, and begin operating normally, in a stable condition once again.

Slow Down When the Network Speeds Up
The second option we discussed originally was to attack the problem by reducing the frequency, rather than the number, of updates. What we want to do is to slow down the reporting of events when they occurmore frequently (or when they occur rapidly), and speed up the reporting of events when they occur less frequently (or when they occur slowly). This is possible through a series of features built into Cisco IOS ® Software within the last year or two, applying the concept of the exponential timers.

An exponential timer changes the amount of delay between an event occurring and the reporting of that event by the frequency at which the event occurs—possibly not reporting the event at all, in some situations. Two implementations of exponential timers are exponential backoffs and dampening. Let's examine each of these individually, and then consider where they are implemented in Cisco IOS Software.

Exponential Backoffs
Consider the following figure to examine how exponential backoff works.

Figure 4: Exponential Backoff



When the first event occurs, a timer is set to the initial time, 1 second in this case, meaning that the router waits for one second before notifying other routers in the network about the event. When the notification is sent, the router adds the initial timer to the increment, and sets a timer for this period. We call this timer the backoff timer.

When the second event occurs, the backoff timer is still running; the router waits until this timer expires to send the notification about this event occurring. When this notification is sent, the backoff timer is set to twice the previous setting or the maximum backoff time, whichever one is shorter. In this case, doubling the backoff timer results in 4 seconds, so it is set to 4 seconds.

When the third event occurs, the backoff timer is still running; the router waits until the timer expires before sending any notification of the event occurring. Again, the timer is doubled, this time to 8 seconds, and compared to the maximum time, which is 5 seconds. The shorter of the two times is taken, so the backoff timer is now set for 5 seconds.

At this point, any future events will be reported only at 5-second intervals, as long as one event occurs at least every 5 seconds. If no events occur for an interval of 10 seconds, the timers are all reset to their initial condition, so the initial timer is set to 1 second, and the backoff timer is not set at all.

Dampening
Dampening, or damping, is also an exponential backoff mechanism similar to the exponential backoff algorithm we examined previously. The primary difference is that dampening is applied to events that have a Boolean component; a route that is either advertised or withdrawn, an interface that is either up or down, etc. Exponential backoff simply deals with events in general, whereas dampening adds value based on the type of event, as well as the frequency at which the event occurs. Consider the following figure to understand dampening.

Figure 5: Dampening Over Time



In dampening, the desirability of reporting an event is set using the penalty ; the higher the penalty applied to a given item, such as a route or an interface, the less desirable it is to advertise changes in the state of that item. Dampening always leaves the item in the "off," or "down," state, when it stops reporting state changes; this is called the dampened state. A penalty is normally added when transitioning from "down" to "up" in most dampening systems.

Here, we start at time 0, with a penalty of 0; when the first event occurs, a penalty of 1000 is added, making the total penalty 1000. As time passes without another event occurring, the penalty is decreased, based on the half life. Each time the half life passes, in this case 15 seconds, the current penalty is halved, so after 15 seconds, the half life is set to 500.

A few seconds later, while the penalty is still decreasing, the second event occurs; 1000 is added to the current penalty, making the total penalty 1400. Again, as time passes, the penalty decays exponentially, reaching 1000 before the third event occurs. When the third event occurs, 1000 is again added to the total penalty, so it reaches 2000— which is above the damp threshold , so future events are dampened by simply leaving the interface or route in the down state.

Again, as time passes, the penalty is cut in half for each passing half life, reaching 1100 before the fourth event occurs. When the fourth event occurs, 1000 is again added, making the penalty 2100, and leaving us in the dampened state until the penalty can be reduced again. Over time, the penalty finally drops to 1000 (at around 60 seconds in the example), which is the reuse threshold. At this point, state changes in the item being tracked are once again reported as they occur, unless the penalty reaches the dampening threshold at some future point.

So, dampening reacts to events by simply not reporting events if they occur too frequently, whereas exponential backoff reacts to events by reporting each event that occurs, but slowing down the reporting of events as they occur more frequently.

Speeding Up the Reporting of Events
When we have some methods in place to prevent a network meltdown when events occur, we can consider ways to discover events faster. Primarily, these techniques are used in conjunction with exponential backoff and dampening.

There are two ways to detect a down neighbor or link: polling and event driven. We will briefly discuss each of these, and some various techniques available in both cases.

Polling
One method commonly used for detecting a link or adjacency failure is polling, or periodically sending Hello packets to the adjacent device, and expecting a periodic Hello packet in return. The speed at which Hello packets are transmitted and the number of Hello packets missed before declaring a link or adjacency as failed are the two determining factors in the speed at which polling can discover a failed link or device.

Normally, a neighbor or link is declared down if three Hello packets are lost, meaning that the hold time, or the dead time, will always be about three times the Hello time, or polling interval. Normally, for Layer 2 links and routing protocols, the Hello interval is measured in seconds. For instance:
EIGRP running over a point-to-point link sends one Hello every 5 seconds, and declares a neighbor down if no Hellos are heard for 15 seconds.
EIGRP running over a lower-speed point-to-multipoint link sends one Hello every 60 seconds, and declares a neighbor down if no Hellos are received in 180 seconds.
OSPF normally sends a Hello every 10 seconds, and declares a neighbor down if no Hellos are heard for 40 seconds.
Frame Relay Local Management Interface (LMI) messages, the equivalent of a Hello, are transmitted every 10 seconds. If an LMI is not received in 30 seconds, the circuit is assumed to have failed.
High-Level Data Link Control (HDLC) keepalive messages are transmitted every 10 seconds. If a keepalive message is not received within 30 seconds, the circuit is assumed to have failed.

Fast Hellos can decrease these timers to Hello intervals on the order of 300 milliseconds, and dead timers of around 1 second.

The primary problem with fast Hellos is scaling, particularly in receiving and processing fast Hellos from a large number of neighboring routers. For instance, if a router has 1000 neighbors and is using a Hello interval of 330 milliseconds, the router has to be able to receive and process 3000 Hellos per second and send 1000 Hellos per second. Timers in this range leave little room for processes that consume a router processor for long periods of time, short-term packet loss on a link due to congestion, and other factors.

Event Driven
Rather than polling at a fast interval, event-driven notifications rely on devices within the network that can sense the state of a link through lower layers (electrical, electronic, or optical state) to notify the routers attached to the link when the link has failed. SONET links are probably the best example of media with built-in capabilities for sensing link failures and notifying attached devices. This Tech Note on Cisco Online:

http://www.cisco.com/en/US/tech/tk482/tk607/technologies_tech_note09186a0080094522.shtml

... provides information about SONET alarms. There are also techniques that can be used to speed up the reporting of failed links in Frame Relay circuits, and techniques are being developed for allowing switches to notify devices attached to an Ethernet VLAN about a loss of connection to an attached device.

Implementations
Now that we have discussed what exponential backoff and dampening are, we can consider how they are implemented, and how their implementation helps you build highly available networks (through fast convergence) without risking major network instability along the way. We start by examining where dampening is implemented, and then follow that with a discussion about where exponential backoff is implemented. These sections do not provide a great deal of detail on the implementation of these features; vendor documentation and other sources of information (such as the forthcoming book Designing to Scale) should be consulted for technical details.

Dampening
Dampening is currently implemented in two places:
Border Gateway Protocol (BGP) route flap dampening
Interface dampening

BGP route flap dampening is a well-known technology, deployed in the Internet on a wide scale to increase the stability of the Internet routing table.

Interface dampening allows the network engineer to prevent rapidly flapping interfaces from having a wide-ranging impact on the entire network. When an interface fails and comes back up numerous times within a short time period, the interface is placed in the down state from an IP perspective, and not advertised within routing protocols, or used for forwarding packets.

It is important to note that the interface is allowed to change states freely at Layer 2; an interface that continues to change state rapidly continues to accumulate penalties, and continues to show down to the IP subsystem.

Exponential Backoff
Exponential backoff is implemented in several places in link state protocols at this point, including:
The ability to exponentially back off the amount of time between a change in the network topology being detected and the transmission of a link state packet being transmitted to report the change; exponential backoff has been applied to the link state generation timer.
The ability to exponentially back off the amount of time between receiving a link state packet reporting a change in the network topology, and running SPF to recalculate the path to each reachable destination in the network; exponential backoff has been applied to the SPF timer.

Fast Hellos
Each routing protocol has a different limit on how Fast Hellos can be transmitted and how often they must be received for a neighbor to be considered alive. OSPF and IS-IS have both implemented the fastest Hellos, with a minimum of 330 millisecond Hellos, and a dead interval of 1 second.

EIGRP can run with Hellos as fast as one per second, with a 3-second dead time. BGP can use similar timers, with a keepalive interval of 1 second.

Caution should be used when configuring Fast Hellos on a network. Congestion, high processor use, and other problems can cause false down indications that may cause higher rates of network failure than would normally occur.

Deploying GR and Fast Convergence Technologies
We now have a full range of options we can use to improve network availability, including GR and NSF, event dampening, and fast convergence techniques. How can we deploy these in a real network to improve network uptime? Generally, the technologies can be placed in one of three categories:

Fast reaction to node or link failure, to route around the failure. We use Layer 2 techniques and Fast Hellos to quickly determine when an adjacent node, or a link to that node, has failed.
Slow reaction to node of link failure, combined with routing through the failure. We rely on moderate speed reactions to node failures to allow resynchronization of routing data while forwarding of traffic continues.
Fast recalculation of the best path when a topology change has been reported.

As we can see, the first two are complementary; we could not deploy both of them in the same place in the network. The third one, fast recalculation, can be deployed with either (or both) fast reaction and slow reaction techniques to increase network availability. The primary question then becomes: which of these two techniques do you deploy in your network, and where?

The basic premise behind making this design decision follows:
If there is a fast, online backup available to a node or a link, it probably makes more sense to route around any problems that occur as rapidly as possible.
If any existing backup is going to take a good deal of time to bring online, or there is no backup path (such as a single homed remote office, or a remote office with dial backup), it probably makes more sense to route through any problems.

In general, then, we want to deploy techniques that improve network convergence time everywhere—techniques that bring down the time a network is down when a failure occurs, is detected, and a new path calculated. At the same time, we want to evaluate each point in the network we would like to protect from failure, and determine the best means to protect that point of failure: redundancy with fast down detection, GR, or NSF.

Fast, stable networks are possible with today's techniques in routing; some large networks, with several hundred routers, measure their convergence times in the milliseconds, with 1 second as their outside convergence goal.

RUSS WHITE is on the Cisco Systems Routing DNA Team in Research Triangle Park, North Carolina, specializing in the deployment and architecture of routing protocols. He has coauthored books on routing protocols, network design, and router architecture, regularly speaks at the Cisco Networkers conference, and is active in the Internet Engineering Task Force. Russ can be reached at riw@cisco.com. This article offers a high-level overview of material covered in depth in a forthcoming network design book, Designing to Scale, being published through Cisco Press.