Overview of High Availability Version 3
Feature History Table
Release |
Description |
---|---|
Cisco IOS XE Gibraltar 16.11.1 |
High availability version 3 introduced. |
Cisco IOS XE Amsterdam 17.3.1 |
The |
The High Availability feature is supported for Cisco CSR 1000V Routers running on Microsoft Azure, Google Cloud Platform (GCP), and Amazon Web Services (AWS). A typical use case for the Cisco CSR 1000V is to interconnect two subnets within a virtual network. You can deploy Cisco CSR 1000V routers between the front-end (public) and the back-end (private) subnets. The Cisco CSR 1000V router represents a single point of failure for access to back-end resources. To mitigate this single point of failure, you must deploy two Cisco CSR 1000V routers between the two subnets.
The back-end subnet contains a routing table with entries pointing to the next hop router, which is one of the two Cisco CSR 1000V instances. The peer Cisco CSR 1000V routers communicate with one another over a tunnel using the Bi-directional Forwarding Detection (BFD) protocol. If the connection is lost between a router and a peer, BFD generates an event. This event causes the active router that is working to update the entries in the route table so that the routing table points to the default route.
The routing table controls the upstream traffic of the Cisco CSR 1000V router and the routing protocol configured on the router determines the path of the downstream traffic.
In cloud environments, it is common for virtual networks to implement a simplistic mechanism for routing, which is based on a centralized route table. However, you can also create multiple route tables, where each route table has a subnet assigned. This subnet acts as the source of route information, and the route table is populated automatically which includes one or more individual routes depending on the network topology. You can also configure the routes in the route table.
A subnet has a centralized route table, which allows two Cisco CSR 1000V routers to operate in a redundant mode. You can deploy two Cisco CSR 1000V routers in the same virtual network with their interfaces directly connected to subnets in the virtual network. You can add routes to the route table to point to one of the two redundant Cisco CSR 1000V routers. At any given time, one of the two Cisco CSR 1000V routers serves as the next-hop router for a subnet. This router is the active router for the subnet. The peer router is referred to as the passive router. The active router is the next hop for a given route destination.
The Cisco CSR 1000V router uses the Bi-directional Failure Detection (BFD) protocol to detect whether a peer router is operating properly. An IP tunnel is created between the two peer routers and each router periodically sends a BFD protocol message to the other router. If one router fails to receive a BFD message from the peer for a specific period, the active router concludes that the peer router has failed.
If the active router fails, the route table for the subnet can be dynamically updated to change the next hop address for one or more routes so that they refer to the passive router. If the peer router detects the failure of the active router, the peer router uses the programmatic API to update the route table entries.
For a route table entry, configure which of the two Cisco CSR 1000V routers is the “primary” router. The other router is the passive router if it is configured as a “secondary” router. By default, all routes are configured as secondary.
The subnet on the right has an address block of 10.1.0.0/24. The two Cisco CSR 1000V routers that are connected to this subnet provide a redundant path for traffic leaving this leaf subnet. The subnet is associated with a route table which provides the route information to the virtual machines attached to the subnet.
Consider this scenario: Initially the default route in the route table has the IP address of the next hop router - 10.1.0.4 (CSR A). All the traffic leaving the subnet goes through CSR A. CSR A is currently the active router for the default route. When CSR A fails, CSR B detects the failure as this router stops receiving BFD protocol messages from CSR A. CSR B writes to the route table via a RESTAPI to change the default route to the interface of CSR B on the 10.1.0.0/24 subnet, which is IP address 10.1.0.5. CSR B then becomes the active router for the route to the 10.1.0.0 network.
Step |
Description |
---|---|
A |
CSR A with address 10.1.0.4 is the active router for the 10.1.0.0 network. |
B |
CSR A fails. CSR B detects the failure using the BFD protocol. |
C |
CSR B uses an HTTP request to the Azure REST API. |
D |
Azure updates the 10.1.0.0 route in the user-defined route table to the IP address of CSR B. |
E |
Virtual machines see the route table update. |
F |
Packets from the virtual machines are now directed to CSR B. |
Topologies Supported
1-for-1 redundancy topology: If both the Cisco CSR 1000v routers have a direct connection to the same subnet, the routers provide a 1-for-1 redundancy. An example of 1-for-1 redundancy is shown in the preceding figure. All the traffic that is intended for a Cisco CSR 1000v only goes to one of the routers - the Cisco CSR 1000v that is currently active. The active Cisco CSR 1000v router is the next-hop router for a subnet. The other Cisco CSR 1000v router is the passive router for all the routes.
Load sharing topology: In this topology, both the Cisco CSR 1000v routers have direct connections to different subnets within the same virtual network. Traffic from subnet A goes to router A and traffic from subnet B goes to router B. Each of these subnets is bound to different route tables. If router A fails, the route table for subnet A is updated. Instead of router A being the next hop, the route entry is changed to router B as the next hop. If router B fails, the route table for subnet B is updated. Instead of router B being the next hop, the route entry is changed to router A as the next hop.
Redundancy Nodes
A redundancy node is a set of configuration parameters that specifies an entry in a route table. The next hop of a route is updated when an active router fails. To configure a redundancy node, you require the following information:
-
Route Table – The identity of the route table in the cloud. Route table includes a region or group in which the table was created, an identifier for the creator or the owner of the table, and a name or identifier for the specific table. Optionally, you can specify an individual route within the table. If you do not specify an individual route, the redundancy node represents all the routes in the table.
-
Credentials - Authentication of the identity of the Cisco CSR 1000v router. Each cloud provider handles the process of obtaining and specifying the credentials differently.
-
Next Hop - The next hop address that is written to the route entry when a trigger event occurs. Next Hop is usually the interface of the CSR 1000v routers on the subnet that is protected.
-
Peer Router - Identifies the redundant router that will forward traffic for this route after a failure occurs on this router.
-
Router Role—Identifies whether the redundancy node serves in a primary or secondary role. This is an optional parameter. If you do not specify this value, the router role defaults to a secondary role.
Event Types
The high availability feature recognizes and responds to three types of events:
-
Peer Router Failure: When the peer route fails, it is detected as a Peer Router Failure event. In response to this event, the event handler writes the route entry with the next hop address that is defined in the redundancy node. To enable this event to be generated, configure the BFD protocol to a peer router and associate the BFD peer under redundancy for cloud high availability.
-
Revert to Primary Router: After a router recovers from a failure, the Revert to Primary Router event occurs. The purpose of this event is to ensure that the primary router for the route is re-established as the active router. This event is triggered by a timer and you need not configure this event. In the route table entry, the event handler changes the next hop address that is defined in the redundancy node only if it is different from the next hop address that is currently set for the route.
This Revert to Primary Router event is generated periodically using a CRON job in the guestshell environment. The job is scheduled to run every 5 minutes and checks if each redundancy node that is configured in the primary mode has this router’s next hop interface set in the route table. If the route table entry already points to this router’s next hop interface, then an update is not required. If a redundancy node configuration of the mode parameter is secondary, then the Revert to Primary Router event is ignored.
-
Redundancy Node Verification: The event handler detects a Redundancy Node Verification event and reads the route entry that is specified by the redundancy node. The event handler writes the same data back to the route entry. This event is not generated automatically or algorithmically. This event verifies the ability of the event handler to execute its functions. Execute a script, manually or programmatically, to trigger the Redundancy Node verification event. For further information about the verification event, see User-Defined Triggers, in the Advanced Programming for High Availability on Microsoft Azure section.
High Availability Versions and OS Compatibility
Choose one of the following deployment options for High Availability on Cisco IOS XE Fuji 16.11.x and later:
-
High Availability Version 1: This version is supported until Cisco IOS XE 16.11.1 release. From Cisco IOS XE 16.11.x, if you attempt to configure redundancy nodes in Cisco IOS, you receive a warning that the configuration is deprecated.
-
High Availability Version 2 with Redundancy Node Configuration in Cisco IOS XE: You can configure High Availability Version 2 for Cisco CSR1000V 16.11.1 or later running on Microsoft Azure using CLI commands. This version provides access to the new features in HA version 2 and allows you to continue using your existing redundancy node configurations. However, this deployment option is deprecated from Cisco IOS XE 16.12.1.
-
High Availability Versions 2 and 3 with Redundancy Node Configuration in the guestshell: You can configure high availability version 2 using guestshell-based Python scripts from Cisco IOS XE 16.9.1 release. The high availability version 3 is available from Cisco IOS XE 16.11.1 release, and this HA version is the recommended version.
If you currently use redundancy node configuration in Cisco IOS XE, we recommend that you use the Redundancy Node configuration in the guestshell using the Cisco IOS XE Fuji 16.11.1 release.
Differences in High Availability across Cisco IOS XE Releases
The following table specifies the functionalities that are supported across the high availability versions.
Feature/Functionality |
Suport in High Availability Version 1 |
Suport in High Availability Version 2 |
Suport in High Availability Version 3 |
---|---|---|---|
Redundancy Node Configuration in IOS XE |
Yes |
Yes |
Yes |
Redundancy Node Configuration in Guestshell |
No |
Yes |
Yes |
Revert to the primary router after recovery |
No |
No |
Yes |
CSR1000V authentication by an application in Azure Active Directory |
Yes |
Yes |
Yes |
CSR1000V authentication by Managed Identity (formerly known as Managed Service Identity) |
No |
Yes |
Yes |
By default, in Cisco IOS XE Gibraltar 16.11.x, the Cisco CSR 1000v on AWS runs HA Version 2. To run HA Version 3, you must manually install and enable guestshell, and install the csr_aws_ha Python package in guestshell.
What’s New in High Availability Version 3
The first version of high availability in the AWS cloud was introduced in Cisco IOS XE 16.3.1. The second version of high availability or HA Version 2 was released in Cisco IOS XE Fuji 16.5.1.
HA Version 3 is released in Cisco IOS XE Gibraltar 16.11.1. This high availability version supports several new features, a new configuration, and a deployment mechanism. Here’s an overview of what’s new in high availability version 3:
-
Cloud Agnostic: This version of high availability is functional on CSR 1000v routers running on any cloud service provider. While there are some differences in the cloud terminology and parameters, the set of functions and scripts used to configure, control, and show the high availability features are common across the different cloud service providers. High Availability Version 3 (HAv3) is supported in CSR 1000v routers running on AWS, Azure, and GCP. Support for the GCP provider has been added in 16.11.1. Check with Cisco for current support of high availability in the individual provider’s clouds.
-
Active/active operation: You can configure both Cisco CSR 1000v routers to be active simultaneously, which allows for load sharing. In this mode of operation, each route in a route table has one of the two routers serve as the primary router and the other router as the secondary router. To enable load sharing, take all the routes and split them between the two Cisco CSR 1000v routers. Note that this functionality is new for AWS-based clouds.
-
Reversion to Primary CSR After Fault Recovery: You can designate a Cisco CSR 1000v as the primary router for a given route. While this Cisco CSR 1000v is up and running, it is the next hop for the route. If this Cisco CSR 1000v fails, the peer Cisco CSR 1000v takes over as the next hop for the route, maintaining network connectivity. When the original router recovers from the failure, it reclaims ownership of the route and is the next hop router. This functionality is also new for the AWS-based clouds.
-
User-supplied Scripts: The guestshell is a container in which you can deploy your own scripts. HAv3 exposes a programming interface to user-supplied scripts. This implies that you can now write scripts that can trigger both failover and reversion events. You can also develop your own algorithms and triggers to control which Cisco CSR 1000v provides the forwarding services for a given route. This functionality is new for AWS-based clouds.
-
New Configuration and Deployment Mechanism: The implementation of HA has been moved out of the Cisco IOS XE code. High availability code now runs in the guestshell container. For further information on guestshell, see the "Guest Shell" section in the Programmability Configuration Guide. In HAv3, the configuration of redundancy nodes is performed in the guestshell using a set of Python scripts. This feature has now been introduced for AWS-based clouds.