Geo Redundancy Switchover

This chapter contains the following topics:

Switchover overview

Switchover is the process of interchanging the roles of the active cluster and standby cluster in the event of a failure. During a failure, the system performs several preliminary checks (e.g., heartbeat count, connectivity checks, HTTP and SSH login checks) and raises alarms if any of these checks fail.

The switchover can be performed either automatically or manually. Automatic switchover is enabled using the auto-arbitration setting.

For a manual switchover, if an alarm is raised, you must verify the authenticity of the alarm by checking both clusters before initiating the switchover.


Attention


  • If a switchover operation is completed on a standby VM (before the sync operation), there are no rows or entries displayed on the Publish Details for tech-support jobs. This happens because the tech-support history is written to ETCD which is not synced across geo redundancy setups. This is an expected system behavior.

  • All services provisioned on AZ1 are continuously synced to AZ2 through live asynchronous replication.

  • After a switchover, the topology is discovered more quickly on AZ2 because it is built via resync rather than from scratch.

  • After a switchover operation, you must perform an on-demand sync operation. Failure to do so may result in inaccurate state information being displayed for the Postgres and Timescale data stores.

  • If a new application is installed after the last sync operation, a new synchronization must be performed before initiating a switchover.

  • L2 link discovery may take up to 2 hours on a scaled system during a geo redundancy switchover.


Switchover triggers

A switchover can be triggered by various events, including manual intervention or system failures. These are the key triggers that initiate the switchover process:

On-demand switchover

This occurs when the user manually decides to trigger the switchover process.

Cluster node failure

This occurs when two or more hybrid nodes fail in a multicluster setup to trigger the switchover.

Crosswork Data Gateway failure

Crosswork Network Controller continuously monitors and polls the operational state of the Data Gateways in the active cluster. If all Data Gateways in the active cluster remain in the error state longer than 30 minutes or the configured arbitration window, the Crosswork Network Controller triggers an auto-switchover. This mechanism also includes Data Gateways in the maintenance mode provided they are in the error state.

Crosswork polls the Data Gateway status only at scheduled times, it might miss or delay detecting failures that happen just before or after a polling cycle. For information about the delays, considerations, and alarms raised on detecting a delay, see Data Gateway failure triggers considerations, conditions, and alarms.

The arbitration duration is configurable using the arbitration-window API. For detailed information about the API, refer to the Cisco Devnet.

Data Gateway failure triggers considerations, conditions, and alarms

Key considerations

  • The arbitration window is 30 minutes by default.

  • Data Gateway's health is evaluated only at the polling intervals.

  • Changes in the Data Gateway status may not trigger immediate detection or switchover if they occur outside the configured polling interval and arbitration window.

  • Data Gateways in the maintenance mode are considered for a switchover.

Detection delays during the edge cases

Crosswork evaluates Data Gateway health only during scheduled polling intervals. A switchover is triggered if the detection criteria are met, for example, if all Data Gateways in the active availability zone remain in the error state continuously throughout the arbitration window. However, the detection due to timing mismatches between Data Gateway state changes and polling cycles:

  • Data Gateways failure just before polling:

    • Scenario: Data Gateways transition to an error state just before the polling begins.

    • Example: Data Gateways enter the error state at 9:59 p.m., but the polling is scheduled for 10:00 p.m.

    • Outcome: Crosswork does not consider these Data Gateways are for a switchover as they do not meet the 30-minute arbitration window when the polling starts.

  • Data Gateways failure just after polling:

    • Scenario: Data Gateways transition to an error state shortly after the polling begins.

    • Example: Polling occurs at 10:00 p.m., and Data Gateways enter the error state at 10:05 p.m.

    • Outcome: The Crosswork does not detect these Data Gateways as faulty until the next polling cycle because they do not meet the 30-minute arbitration window.

Alarms for Data Gateway

A critical alarm is raised to indicate that the Crossswork Network Controller is about to trigger a Geo switchover. You need to manually clear the alarms.


Note


Alarms are local to each availability zone (AZ) and are not shared across AZs.

Auto-arbitration workflow

This topic explains the high level steps in the automatic switchover process. For more information on auto-arbitration, see Auto-arbitration in Crosswork Network Controller.

  1. Auto-arbitration is enabled in the arbitration settings. For more information, see Configure Cross Cluster Settings.

  2. The intervals for heartbeat and failure detection are configured.

  3. The leader election process is initiated, and one of the clusters is elected as the leader.

  4. The automatic switchover is triggered upon the occurrence of a switchover event in the system, provided the condition persists beyond the failure detection interval.

Perform switchover manually

Follow these steps to perform the switchover on the Crosswork Network Controller UI:

Before you begin

Ensure both clusters have the same application versions and resource footprints used.

Procedure


Step 1

Log in to the standby cluster.

Step 2

From the main menu, choose Administration > Cross Cluster. The Cross Cluster window is displayed.

Step 3

The switchover can be performed using any one of these options.

  • Click Actions > Switchover to initiate the switchover. This is a one-click control that performs the three steps of the switchover process. It automates setting the two roles and updating the DNS in a single function.

    Ths system forwards the switchover request to the cluster acting as the leader and creates a switchover job. To monitor the job progress, click on the Cross cluster jobs tab.

  • Click Actions > Set cluster role. This option allows you to assign the cluster roles manually.

    Important

     

    The Set cluster role option is particularly useful if the clusters become unresponsive during an automatic switchover process and both clusters attain the same status. In this situation, you can use the Set cluster role option to override the switchover process and manually assign the cluster roles.

    1. The Switch Cluster Role dialog box is displayed with the initial state of the clusters. For the purpose of this topic, Bengaluru cluster (cluster-blr) is in Active state and San Jose cluster (cluster-sjc) is in Standby state.

      Figure 1. Switch cluster role
    2. Click on the San Jose cluster to change it to Active state. Click Save to confirm change.

      Figure 2. Switch standby cluster to active
    3. Update the DNS server records of Management FQDN and Data FQDN to point to the new active cluster.

    4. Now log in to the Bengaluru cluster (already active). In the Cross Cluster window, click Actions > Set cluster role.

      Note

       

      At this point, till the time you change the cluster state, both clusters will be in Active state.

    5. In the Switch Cluster Role dialog box, click on the cluster to change it to Standby state.

      Figure 3. Switch active cluster to standby

      Click Save to confirm the change.

      Note

       

      Wait for the device reachability to converge before moving to resume operations on the standby cluster.

    6. After few minutes, log in to the first cluster. The switchover will be completed.

Step 4

Post-switchover, verify the following:

  1. Verify the cluster health and device status to ensure the system is functioning properly.

  2. Check the health status of the Crosswork Data Gateway to ensure it is functioning properly.

  3. Check the status of the HA pool.

  4. Check the Collection status and confirm that traffic is flowing smoothly to the newly active cluster.


Crosswork Optimization Engine License Count After a Switchover

For Crosswork Optimization Engine, the Smart Licenses page reflects the correct license count only after 24 hours or by 1:00 am after a switchover.

If you cannot wait 24 hours or until 1:00 am, there are two methods to force a license update:

  • You can disable or enable feature packs (Bandwidth on Demand, Circuit Style Manager, or Local Congestion Manager).

  • You can detach and add devices back again.