Table Of Contents
Cisco Unified CallManager Failure, Failover, and Recovery
Unified CallManager Failover Behavior
Scenario A: Non-Planned Subscriber Failure
Scenario C: Fallback of Phones to Their Primary Unified CallManager
Unified CallManager Failover Testing
Test 1: Disable/Re-Enable Cisco Unified CallManager Service
Test 3: Failover/Failback of H.323 Resources Associated with One Node
Test 4: Failover/Failback of MGCP Resources Associated with One Node
Test 5: Failover/Failback of SCCP Resources Associated with One Node
Test 6: Failover/Failback of SIP Resources Associated with One Node
Unified CallManager Disaster Recovery
Cisco Unified CallManager Failure, Failover, and Recovery
This chapter provides an overview of the failover testing that was performed for the Cisco Unified CallManager during Cisco Unified Communications Release 5.0(2) system testing.
Unified CallManager Failover Behavior
Three distinct scenarios for CallManager 4.1(3) failover are described in this document:
•
Unplanned failure - can be triggered by hardware failure, software failure, or loss of connectivity.
•
Planned failure - typically used as part of Operating System or application upgrade, or server hardware maintenance process
•
Fallback - mechanism for failing back to a normal operations configurations after either a planned or unplanned failure.
The CallManager Publisher is the database for the entire cluster CallManager Cluster. . All provisioning updates to the cluster are stored initially in the Publisher (for example, phones, trunks, service parameters, call features, etc.). The CallManager Subscribers' replicated databases are continuously updated with any changes to the Publisher database. When the Publisher becomes unavailable, the Subscriber's databases will reflect the latest state of the database prior to the Publisher failure.
Cisco's recommended configuration for large enterprise customers is a dedicated Publisher, with separate Subscribers running the CallManager application. The CallManager application is responsible for all call control, including signaling of endpoints, feature invocation, calling restrictions, etc. Large-scale configurations typically use paired redundant Subscriber servers, running an active-active configuration, with endpoints evenly distributed across the two servers. Another component of the cluster is the TFTP server, which provides configuration files for the endpoint devices and the associated firmware loads. Large enterprise configurations typically utilize redundant TFTP servers.
A KeepAlive mechanism is an essential part of any Softswitch or IP PBX solution. KeepAlives are used to ensure that endpoints (Phones, Gateways) retain their communications path to the call control server. KeepAlives are not only used to determine when the primary call control server is no longer available, they are also used to determine when the site has been completely isolated from a centralized call control system and therefore needs to revert to some form of remote survivability capability (eg., Cisco's IOS-based SRST). KeepAlives are designed to avoid any initial lag in call establishment associated with finding an available call processing server. For example, with a KeepAlive mechanism, phones are able to continuous monitor the status of call processing servers, dynamically adapting to changes in network and/or server availability. Even with a KeepAlive mechanism, there is still a chance that the phone's understanding of the state of the network and call processing servers will not be accurate - the change may have occurred since the most recent KeepAlive exchange - but the chances of this mis-matched state are significantly reduced. CallManager 4.1 supports a set of server configurable KeepAlive parameters through it's Service Parameters configuration pages. StationKeepaliveInterval configures the number of seconds between KeepAlive messages sent from the Cisco IP Phones (stations) to the Primary Subscriber server. The minimum configurable value for the parameter is 10 seconds, with the default value being 30 seconds. StationandBackupServerKeepaliveInternal configures the number of seconds between KeepAlive messages sent from the Cisco IP phones and the Secondary Subscriber server. These KeepAlives are sent at 60 second intervals.
Cisco's CallManager supports a range of different media gateway control protocols, including MGCP and H.323. MGCP is based on a centralized architecture, utilizing intelligence and resources embedded within CallManager. H.323 is based on a more distributed architecture, utilizing intelligence in the gateways themselves. Each protocol comes with a set of pros and cons, the "best" protocol for a given deployment is based on the specific set of requirements.
As a general rule-of-thumb, CallManager will prioritize Gateway failover handling over phone re-registration, since gateways tend to be shared resources utilized by a number of endpoints.
All failure scenarios described on the pages that follow assume a fully loaded CallManager cluster, supporting a maximum of 30,000 phones across 4 Subscriber server pairs. The individual Subscriber server pairs, each supporting a maximum of 7,500 phones can be configured in either Active/Standby or Active/Active configuration. Cisco's recommendation is to configure the Subscriber server pairs in Active/Active, with each individual Subscriber server handling 3750 phones in a normal operating mode. This approach minimizes the worst case impact of any failure - with a maximum of 3750 phones being impacted. In an active/standby configuration, the worst case scenario failure scenario would impact the full 7500 phones.
Scenario A: Non-Planned Subscriber Failure
This is defined as an unplanned loss of connectivity between end user devices (hard phones, soft phones, etc.) and the CallManager Subscriber server to which those devices are currently registered. This could happen for a number of reasons, examples of which are hardware failure of the CallManager server or IP connection failure.
Based on the Service Parameter StationKeepAliveInterval , the phones are sending KeepAlive messages to the Call Manager on a 10 second interval (this value is configurable via CallManager Service Parameters and 10 is the lowest). If the phone fails to receive an ACK response to the KeepAlive, the phone then retries two more times at which point it re-homes to its backup Subscriber. The phone will only initiate the re-home process after it has waited the full 10-second interval after sending the 3rd KeepAlive. Therefore, in the worst case, it will wait take a full 40 seconds or 4x intervals before the phone re-homes to its backup Subscriber. Once the phone begins the registration process with the backup Subscriber, the process typically takes about 1-2 seconds for an individual phone.
On average any given phone will be halfway through its initial 10 second keep alive interval described above. That is, based on laws of probability, on average an individual phone will be mid-way through the interval between the last successful KeepAlive and the first unsuccessful KeepAlive. Therefore, for an individual phone the average time to reregister is 36-37 seconds (35 seconds to perform the retries and associated timeouts, and 1-2 seconds to perform the re-registration process with the backup Subscriber).
In the case of an entire server losing service then all phones will need to follow the above scenario. However, it is important to realize that the backup subscriber will be able to process registrations at a rate of approximately 140-145 per second on a current MCS-7845 3.0 gHz with 2GB of RAM server that is idle. Higher performance is achievable through more powerful computing platforms. Active call processing loads on the servers will correspondingly reduce the re-registration rates. Therefore the first 140 phones will register after 31-32 seconds with an equivalent number each second thereafter until all are registered. CallManager Subscriber servers implement a low-priority call processing queue to handle the re-registration process, capable of handling a fully loaded (7500 phones) re-registration. If you assume a deployment scenario with 7500 devices on the primary and 0 on the backup this will take approximately 31-32 seconds for the first phone to register and approximately 95-96 seconds for the last phone to register. If, however, the 7500 phones are split between the two servers each acting as a primary for 3750 phones and backup for the alternate 3750 phones then only 27 seconds is required for the registration process for the 3570 phones - a total 68-69 seconds.
Scenario B: Planned Shutdown
Graceful shutdowns are employed when doing any sort of maintenance on a CallManager Subscriber server. The graceful shutdown is invoked by going to the CallManager Servicability Page and stopping the CallManager service. Any non-active calls will be immediately dropped (in-progress calls, calls on hold, etc.).
The key difference with a graceful shutdown is that the Subscriber Server immediately closes all open TCP connections to phones by issuing a TCP FIN. The Subscriber Server will also tear down any active H.245 connections, thereby clearing all calls between CCM and H.323-controlled gateways. The phones immediately recognize that the connection to their Primary Subscriber is no longer available, and do not go through the 3 successive KeepAlives before failing over to the secondary Subscriber server. The phones will already have a TCP connection established to it's Secondary Subscriber, and will have been exchanging keepalives with that server at the configurable StationandBackupServerKeepaliveInternal interval. The phones will immediately initiate the re-registration process with the secondary Subscriber. The same 140 concurrent phone registrations limit is imposed by the Secondary Subscriber as during a non-graceful failover. If you assume a deployment scenario with 7500 devices on the primary and 0 on the backup this will take approximately 1-2 seconds for the first phone to register and approximately 55-56 seconds for the last phone to register. If, however, the 7500 phones are split between the two servers each acting as a primary for 3750 phones and backup for the alternate 3750 phones then only 27 seconds is required for the registration process for the 3570 phones - a total 28-29 seconds.
Scenario C: Fallback of Phones to Their Primary Unified CallManager
This occurs when the primary CallManager is brought back to an active state. During the fallback scenario telephony is not affected, since individual phones will fallback only when there is adequate CPU to accommodate the registration.
The Phones that are not running on their backup Subscriber continue to try to re-open a TCP connection to their primary Subscriber at set 10-second interval. Once the primary Subscriber becomes available, phones on the average of 5 seconds will connect to it. The phones will then send a token registration message to the CallManager. The CallManager will either accept the token registration or deny it. If accepted the phone is put into the registration queue. The CallManager will allow a maximum of 10 phones in the registration queue at any given time during the fallback. On a system with 7500 registered phones the fallback could complete in a minimum of 54 seconds, based on the 140 phone registrations per second processing limit. Again, the fallback is set at a lower priority than processing of calls, therefore, on a busy system the fallback will take longer. If the Primary CallManager needs to reject the token registration it will inform the phone to retry the connection at a staggered interval, based on the active queue depth and the number of outstanding registration requests During this time, the phone is still communicating with the Secondary Server, and therefore call processing is not impacted. Call processing will only be minimally impacted once the registration is accepted by the Primary Server and the phone starts communicating with the Primary.
It is important to note that during these failures, active calls from an IP phone to another IP phone will last for the duration of the conversation. Assuming that H.323 gateways are used, calls from an IP phone to the PSTN will last approximately 3 minutes before the call is terminated (on average). Calls across these gateways will be dropped after 4x H.225 KeepAlive messages expire on the gateway. This timer is not configurable. In the best case, the call will remain up for up to 5 minutes, based on the keepalive retransmissions.
The above assumes that either the phone, or the gateway was registered to the Call Manager that was impacted by the outage. It will not impact calls taking place on other Call Managers servers.
Unified CallManager Failover Testing
During Cisco Unified Communications Release 5.0(2) system testing, several Cisco Unified CallManager failure ........
This section includes the following topics:
•
Test 1: Disable/Re-Enable Cisco Unified CallManager Service
•
Test 3: Failover/Failback of H.323 Resources Associated with One Node
•
Test 4: Failover/Failback of MGCP Resources Associated with One Node
•
Test 5: Failover/Failback of SCCP Resources Associated with One Node
•
Test 6: Failover/Failback of SIP Resources Associated with One Node
Test Conditions
The following conditions existed for the Cisco Unified CallManager failover testing:
The Cisco Unified CallManager cluster in this site model consists of the following:
•
One publisher
•
Eight subscribers (four in each site)
•
Two music on hold servers (one in each site)
•
Two TFTP servers (one in each site)
•
One centralized TFTP cluster
•
Cisco CallManager servers: MCS-7845-H1 with 4 GB RAM
•
![]()
Test 1: Disable/Re-Enable Cisco Unified CallManager Service
Test
Verify that de-activating a Cisco Unified CallManager service while the system is processing calls results in minimal call loss and no permanent system degradation after the service has been restored.
- Resources should immediately re-register to their backup due to the CCM server notification.
- After restoring the service, all devices should immediately re-register to their Primary, and all resources should be available
Results
Start processing 10K BHCA worth of calls through the system for 30 minutes.
While calls are being placed, de-activate the CCM service on a Subscriber.
Verify calls continue to be placed through the Backup subscriber.
When the failover has been fully completed, re-activate the CCM service on the Subscriber.
Allow the run to complete and verify the CSR for the run.
Test 2: Failover/Failback of Phones and Gateways Associated with Half of a Unified CallManager Cluster
Test
Verify that the failure half of a CCM cluster results in all resources associated with failed nodes failing over to their backup. Verify also that when the cluster is restored, that all associated resources fail back to their primary.
- When CCM nodes are failed, all resources should be registered to their backup nodes.
- When Primary CCM nodes have been restored, all resources should immediately fail back.
Results
Fail half of the Subscribers in the network by disconnecting the Ethernet cables to the machines, ensuring that the nodes that are failed are Primary for at least 5K phones, 1 HW conf bridge, and 1 IOS Gateway.
Verify that the resources fail over to their backup Subscribers, and that they are still useable.
Restore the Ethernet connections to the failed Subscribers.
Verify that all the resources fail back to their Primary Subscribers, and that they are still useable.
Test 3: Failover/Failback of H.323 Resources Associated with One Node
Test
Verify that the failure of a single CCM node results in all H.323 resources associated with the node failing over to their backup. Verify also that when the node is restored, that all associated H.323 resources fail back to their primary
- When CCM node is failed, all H.323 resources should be registered to their backup nodes.
- When Primary CCM node has been restored, all H.323 resources should immediately fail back.
Set up a cluster with at least one Publisher and two Subscribers.
Register some H.323 resources to a Device Pool that has Subscriber 1 as the Primary, and Subscriber 2 as the Backup.
Results
While H.323 resources are registered to their Primary Subscriber 1, fail Subscriber 1 by unplugging the Ethernet cable for the machine.
Verify that all H.323 resources failed over to their backup Subscriber 2, and that they are still useable.
Restore the Ethernet connection to the Primary Subscriber 1.
Verify that all H.323 resources fail back to their Primary Subscriber 1, and that they are still useable.
Test 4: Failover/Failback of MGCP Resources Associated with One Node
Test
Verify that the failure of a single CCM node results in all MGCP resources associated with the node failing over to their backup. Verify also that when the node is restored, that all associated MGCP resources fail back to their primary.
- When CCM node is failed, all MGCP resources should be registered to their backup nodes.
- When Primary CCM node has been restored, all MGCP resources should immediately fail back.
Set up a cluster with at least one Publisher and two Subscribers.
Register some MGCP resources to a Device Pool that has Subscriber 1 as the Primary, and Subscriber 2 as the Backup.
Results
While MGCP resources are registered to their Primary Subscriber 1, fail Subscriber 1 by unplugging the Ethernet cable for the machine.
Verify that all MGCP resources failed over to their backup Subscriber 2, and that they are still useable.
Restore the Ethernet connection to the Primary Subscriber 1.
Verify that all MGCP resources fail back to their Primary Subscriber 1, and that they are still useable.
Test 5: Failover/Failback of SCCP Resources Associated with One Node
Test
Verify that the failure of a single CCM node results in all SCCP resources associated with the node failing over to their backup. Verify also that when the node is restored, that all associated SCCP resources fail back to their primary Unified CallManager node.
- When CCM node is failed, all SCCP resources should be registered to their backup nodes.
- When Primary CCM node has been restored, all SCCP resources should immediately fail back
Set up a cluster with at least one Publisher and two Subscribers.
Register some SCCP resources to a Device Pool that has Subscriber 1 as the Primary, and Subscriber 2 as the Backup.
Results
While SCCP resources are registered to their Primary Subscriber 1, fail Subscriber 1 by unplugging the Ethernet cable for the machine.
Verify that all SCCP resources failed over to their backup Subscriber 2, and that they are still useable.
Restore the Ethernet connection to the Primary Subscriber 1.
Verify that all SCCP resources fail back to their Primary Subscriber 1, and that they are still useable.
Test 6: Failover/Failback of SIP Resources Associated with One Node
Test
Verify that the failure of a single CCM node results in all SIP resources associated with the node failing over to their backup. Verify also that when the node is restored, that all associated SIP resources fail back to their primary.
Set up a cluster with at least one Publisher and two Subscribers.
Register some SIP resources to a Device Pool that has Subscriber 1 as the Primary, and Subscriber 2 as the Backup.
Results
While SIP resources are registered to their Primary Subscriber 1, fail Subscriber 1 by unplugging the Ethernet cable for the machine.
Verify that all SIP resources failed over to their backup Subscriber 2, and that they are still useable.
Restore the Ethernet connection to the Primary Subscriber 1.
Verify that all SIP resources fail back to their Primary Subscriber 1, and that they are still useable.
Unified CallManager Disaster Recovery
The Disaster Recovery System (DRS), which can be invoked from Cisco Unified CallManager 5.0 Administration, provides full data backup and restore capabilities for all servers in a Cisco Unified CallManager cluster. The Disaster Recovery System allows you to perform regularly scheduled automatic or user-invoked data backups. DRS supports only one backup schedule at a time.
The Cisco Disaster Recovery System performs a cluster-level backup, which means that it collects backups for all servers in a Cisco Unified CallManager cluster to a central location and archives the backup data to physical storage device.
When performing a system data restoration, you can choose which nodes in the cluster you want to restore.
The Disaster Recovery System includes the following capabilities:
•
A user interface for performing backup and restore tasks.
•
A distributed system architecture for performing backup and restore functions.
•
A scheduling engine to initiate tasks at user-specified times.
•
Archive backups to a physical tape drive or remote SFTP server.
The Disaster Recovery System contains two key functions, Master Agent (MA) and Local Agent (LA). The Master Agent coordinates backup and restore activity with all the Local Agents. The system automatically activates both the Master Agent and the Local Agent on all nodes in the cluster. However, you can only access the Master Agent functions on the first node of the cluster.
For more information on the Cisco Unified CallManager Disaster Recovery System, see the Disaster Recovery System Administration Guide.