Distribution Protocol) graceful restart provides a control plane mechanism to
ensure high availability and allows detection and recovery from failure
conditions while preserving Nonstop Forwarding (NSF) services. Graceful restart
is a way to recover from signaling and control plane failures without impacting
Without LDP graceful
restart, when an established session fails, the corresponding forwarding states
are cleaned immediately from the restarting and peer nodes. In this case LDP
forwarding restarts from the beginning, causing a potential loss of data and
The LDP graceful
restart capability is negotiated between two peers during session
initialization time, in FT SESSION TLV. In this typed length value (TLV), each
peer advertises the following information to its peers:
- Reconnect time
maximum time that other peer will wait for this LSR to reconnect after control
- Recovery time
maximum time that the other peer has on its side to reinstate or refresh its
states with this LSR. This time is used only during session reestablishment
after earlier session failure.
- FT flag
whether a restart could restore the preserved (local) node state for this flag.
Once the graceful
restart session parameters are conveyed and the session is up and running,
graceful restart procedures are activated.
When configuring the
LDP graceful restart process in a network with multiple links, targeted LDP
hello adjacencies with the same neighbor, or both, make sure that graceful
restart is activated on the session before any hello adjacency times out in
case of neighbor control plane failures. One way of achieving this is by
configuring a lower session hold time between neighbors such that session
timeout occurs before hello adjacency timeout. It is recommended to set LDP
session hold time using the following formula:
Session Holdtime <= (Hello holdtime - Hello interval) * 3
This means that for
default values of 15 seconds and 5 seconds for link Hello holdtime and interval
respectively, session hold time should be set to 30 seconds at most.
The graceful restart
mechanism is divided into different phases:
communication failure detection
communication failure is detected when the system detects either:
hello discovery messages
keepalive protocol messages
of Transmission Control Protocol (TCP) disconnection a with a peer
state maintenance during failure
forwarding states at each LSR are achieved through persistent storage
(checkpoint) by the LDP control plane. While the control plane is in the
process of recovering, the forwarding plane keeps the forwarding states, but
marks them as stale. Similarly, the peer control plane also keeps (and marks as
stale) the installed forwarding rewrites associated with the node that is
restarting. The combination of local node forwarding and remote node forwarding
plane states ensures NSF and no disruption in the traffic.
- Control state
occurs when the session is reestablished and label bindings are exchanged
again. This process allows the peer nodes to synchronize and to refresh stale
When a control plane
failure occurs, connectivity can be affected. The forwarding states installed
by the router control planes are lost, and the in-transit packets could be
dropped, thus breaking NSF. The following figure illustrates control plane
failure and recovery with graceful restart and shows the process and results of
a control plane failure leading to loss of connectivity and recovery using
Figure 3. Control Plane
Figure 4. Recovering
with Graceful Restart
The R4 LSR
control plane restarts.
LIB is lost when
the control plane restarts.
states installed by the R4 LDP control plane are immediately deleted.
packets flowing from R3 to R4 (still labeled with L4) arrive at R4.
forwarding plane at R4 performs a lookup on local label L4 which fails. Because
of this failure, the packet is dropped and NSF is not met.
The R3 LDP peer
detects the failure of the control plane channel and deletes its label bindings
The R3 control
plane stops using outgoing labels from R4 and deletes the corresponding
forwarding state (rewrites), which in turn causes forwarding disruption.
LSPs connected to R4 are terminated at R3, resulting in broken end-to-end LSPs
from R1 to R4.
LSPs connected to R4 are terminated at R3, resulting in broken LSPs end-to-end
from R2 to R4.
When the LDP control
plane recovers, the restarting LSR starts its forwarding state hold timer and
restores its forwarding state from the checkpointed data. This action
reinstates the forwarding state and entries and marks them as old.
The restarting LSR
reconnects to its peer, indicated in the FT Session TLV, that it either was or
was not able to restore its state successfully. If it was able to restore the
state, the bindings are resynchronized.
The peer LSR stops
the neighbor reconnect timer (started by the restarting LSR), when the
restarting peer connects and starts the neighbor recovery timer. The peer LSR
checks the FT Session TLV if the restarting peer was able to restore its state
successfully. It reinstates the corresponding forwarding state entries and
receives binding from the restarting peer. When the recovery timer expires, any
forwarding state that is still marked as stale is deleted.
If the restarting
LSR fails to recover (restart), the restarting LSR forwarding state and entries
will eventually timeout and is deleted, while neighbor-related forwarding
states or entries are removed by the Peer LSR on expiration of the reconnect or