Table Of Contents
Cisco Unified CallManager Failure, Failover, and Recovery
Unified CallManager High Availability Features
Cisco Unified CallManager Clusters
Cisco Unified CallManager Redundancy Groups
Unified CallManager Failover Behavior
Fallback of Phones to Their Primary Unified CallManager
Unified CallManager Failover Testing
Test 1: Disable/Re-Enable Cisco Unified CallManager Service
Test 3: Failover/Failback of H.323 Resources Associated with One Node
Test 4: Failover/Failback of MGCP Resources Associated with One Node
Test 5: Failover/Failback of SCCP Resources Associated with One Node
Test 6: Failover/Failback of SIP Resources Associated with One Node
Unified CallManager Disaster Recovery
Cisco Unified CallManager Failure, Failover, and Recovery
This chapter provides an overview of the failover testing that was performed for the Cisco Unified CallManager during Cisco Unified Communications Release 5.0 system testing. It also discusses the features built into Unified CallManager to provide high availability for client phones, describes the behavior that occurs during a failure, and identifies the disaster recovery mechanism used to restore operation.
This chapter includes the following sections:
•
Unified CallManager High Availability Features
•
Unified CallManager Failover Behavior
•
Unified CallManager Failover Testing
•
Unified CallManager Disaster Recovery
Unified CallManager High Availability Features
The Cisco Unified CallManager plays the key role in maintaining call processing following a failure in an IP telephony environment. This section describes the following high availability features built into Cisco Unified CallManager:
•
Cisco Unified CallManager Clusters
•
Cisco Unified CallManager Redundancy Groups
Cisco Unified CallManager Clusters
A cluster comprises a set of Cisco Unified CallManager servers (or nodes) that share the same database and resources. You can configure the servers in a cluster in various ways to perform the following functions:
•
Database server (only one database server in the cluster)
•
TFTP server
•
Application software server
Using the Service Activation window in the Cisco Unified CallManager Serviceability application, you can specify which server performs which function for the cluster. You can dedicate a particular server to one function or combine several functions on one server, depending on the size of your system and the level of redundancy that you want. Each cluster can have only one database server (first node) and usually one TFTP server (either separate or combined).
Cisco Systems recommends that large enterprise customers possess a dedicated Cisco Unified CallManager database server, with other servers running the Cisco Unified CallManager application software. The Cisco Unified CallManager application software performs all call control, including signaling of endpoints, feature invocation, and calling restrictions. Large-scale configurations typically use paired redundant application software servers, running an active-active configuration, with endpoints evenly distributed across the two servers. The TFTP server provides configuration files for the endpoint devices and the associated firmware loads. Large enterprise configurations typically utilize redundant TFTP servers.
All provisioning updates to the cluster are stored initially in the database server (for example, phones, trunks, service parameters, call features, and so forth). All other Unified CallManager node's replicated databases are continuously updated with any changes to the database server. When the database server becomes unavailable, the databases on other Unified CallManager nodes will reflect the latest state of the database prior to the database server failure.
In a very large cluster, simultaneous initialization, the process that occurs after a Unified CallManager failure, can cause an overload of the database server. To limit the number of Unified CallManager services that will simultaneously initialize, you can configure the "Max Simultaneous Cisco Unified CallManager Initializations" service parameter. This parameter defaults to 0 and, with this value, the number of Cisco Unified CallManager services that can initialize simultaneously is unlimited. Any non-zero value will limit the number of services to that specific value.
Another service parameter that should be configured is the "Restart Cisco Unified CallManager on Initialization Exception" parameter. This parameter determines whether the Cisco Unified CallManager service restarts if an error occurs during initialization. This parameter defaults to TRUE and, with this value, the Cisco Unified CallManager initialization aborts when an error occurs during initialization. Setting the value to FALSE allows initialization to continue when an error is encountered. Service parameters are configured using the System Configuration menu in Cisco Unified CallManager Administration application.
Cisco Unified CallManager Redundancy Groups
Groups and clusters form logical collections of Cisco Unified CallManager servers and their associated devices. Groups and clusters do not necessarily relate to the physical locations of any of their members. A group comprises a prioritized list of up to three Cisco Unified CallManager servers. You can associate each group with one or more device pools to provide call-processing redundancy. You use Cisco Unified CallManager Administration to define the groups, to specify which Cisco Unified CallManager servers belong to each group, and to assign a Cisco Unified CallManager group to each device pool. Each group must contain a primary Cisco Unified CallManager, and it may contain one or two backup Cisco Unified CallManager servers. The order in which you list the Cisco Unified CallManager servers in a group determines the priority order.
Cisco Unified CallManager groups provide both redundancy and recovery:
•
Failover—Occurs when the primary Cisco Unified CallManager in a group fails, and the devices reregister with the backup Cisco Unified CallManager in that group.
•
Fallback—Occurs when a failed primary Cisco Unified CallManager comes back into service, and the devices in that group reregister with the primary Cisco Unified CallManager.
Under normal operation, the primary Cisco Unified CallManager in a group controls call processing for all the registered devices (such as phones and gateways) that are associated with that group.
If the primary Cisco Unified CallManager fails for any reason, the first backup Cisco Unified CallManager in the group takes control of the devices that were registered with the primary Cisco Unified CallManager. If you specify a second backup Cisco Unified CallManager for the group, it takes control of the devices if both the primary and the first backup Cisco Unified CallManager servers fail.
When a failed primary Cisco Unified CallManager comes back into service, it takes control of the group again, and the devices in that group automatically reregister with the primary Cisco Unified CallManager.
Keep Alive Mechanism
A Keep Alive mechanism is an essential part of an IP telephony solution. Keep A lives are used to ensure that endpoints (typically phones and gateways) retain their communications path to the Cisco Unified CallManager server. Keep A lives are used not only to determine when the primary Cisco Unified CallManager server is no longer available, they are also used to determine when the site has become completely isolated from a centralized call control system and must revert to some form of remote survivability capability such as Cisco Unified SRST. Keep A lives are designed to avoid any initial delay in establishing a call associated with finding an available Cisco Unified CallManager server.
For example, with a KeepAlive mechanism, Cisco Unified IP phones continuously monitor the status of Cisco Unified CallManager servers, dynamically adapting to changes in network and/or server availability. Cisco Unified CallManager supports a set of configurable KeepAlive service parameters in the System Configuration menu in Cisco Unified CallManager Administration application:
•
StationKeepaliveInterval configures the number of seconds between KeepAlive messages sent from Cisco Unified IP Phones (stations) to their primary Cisco Unified CallManager server. The minimum configurable value for the parameter is 10 seconds, with the default value being 30 seconds.
•
StationandBackupServerKeepaliveInternal configures the number of seconds between KeepAlive messages sent from Cisco Unified IP Phones and their secondary Cisco Unified CallManager server. These Keep A lives are sent at 60 second intervals.
Unified CallManager Failover Behavior
This section describes the automatic failover actions that occur on a Cisco Unified CallManager following one of three types of failover scenarios:
•
Unplanned failure triggered by hardware failure, software failure, or loss of connectivity.
•
Planned shutdown that occurs as part of an operating system or application upgrade, or server hardware maintenance process.
•
Fallback of phones to their primary Unified CallManager after either a planned or unplanned failure.
Unplanned Failure
An unplanned failure is a loss of connectivity between endpoints and the Cisco Unified CallManager application software server to which they are currently registered. This loss of connectivity could be due to Cisco Unified CallManager server hardware failure or a problem in the IP network infrastructure.
During a failure, active calls from an IP phone to another IP phone are maintained for the duration of the conversation. If H.323 gateways are involved in the call, calls from an IP phone to the PSTN are maintained approximately three minutes before the call is terminated (on average). Calls across these gateways are dropped after four H.225 KeepAlive messages expire on the gateway. In the best case, the calls remain up for up to 5 minutes, based on KeepAlive retransmissions.
Based on the StationKeepAliveInterval, Cisco Unified IP Phones send KeepAlive messages to the Cisco Unified CallManager at least every 10 seconds. If a phone fails to receive an ACK response to the KeepAlive, the phone then retries two more times before attempting to re-home to its backup Cisco Unified CallManager server. The phone only initiates the re-homing process after waiting the full 10 second interval after sending the third KeepAlive. Therefore, in the worst case, it requires a full 40 seconds before the phone re-homes to its backup Cisco Unified CallManager server. Once the phone starts to register with the backup Cisco Unified CallManager server, the process typically takes about 1-2 seconds.
In the case of an entire Cisco Unified CallManager server losing service, then all associated phones need to re-register to a backup server. A Cisco Unified CallManager server can process 140 simultaneous registrations. Active call processing loads on the server, however, can reduce registration rates. As a general rule-of-thumb, Cisco Unified CallManager prioritizes gateway failover handling over phone re-registration, since gateways tend to be shared resources utilized by a number of endpoints. Cisco Unified CallManager servers use a low-priority call processing queue to handle the re-registration process, capable of handling a fully loaded (7500 phones) re-registration. However, by splitting phones across a pair of Cisco Unified CallManager servers (so that half the phones use the first server as their primary and secondary server as the backup, while the other half use the first server as their backup and the second server as their primary), you can reduce the volume of re-homing that occurs when a single Cisco Unified CallManager server fails and reduce the overall re-registration time.
Planned Shutdown
Planned or "graceful" shutdowns can be performed whenever maintenance is required on a Cisco Unified CallManager application software server. A graceful shutdown can be initiated by accessing the Cisco Unified CallManager Serviceability administration page and stopping the Cisco Unified CallManager service. When the service is stopped, any non-active calls (in the process of establishing or on hold, for example) are immediately dropped.
The key difference between a graceful shutdown and unplanned shutdown is the Cisco Unified CallManager server immediately closes all open TCP connections to IP phones by issuing a TCP FIN message. The Cisco Unified CallManager server also tears down any active H.245 connections, which clears all calls between Cisco Unified CallManager and H.323-controlled gateways. Cisco Unified IP Phones immediately recognize that the connection to their primary Cisco Unified CallManager server is no longer available and do not issue the three successive Keep A lives before failing over to the secondary server. The phones already have a TCP connection established to the secondary server and have been exchanging Keep A lives with that server at the configurable StationandBackupServerKeepaliveInternal period. The phones immediately initiate the re-registration process with the secondary server. The same 140 concurrent phone registrations limit is imposed by the secondary Cisco Unified CallManager server as during a n unplanned failover.
Fallback of Phones to Their Primary Unified CallManager
Cisco Unified IP phones that are not running on their backup Cisco Unified CallManager server continually try to re-open a TCP connection to their primary server at set 10-second intervals. Once a primary Cisco Unified CallManager server becomes available again, the phones send a token registration message to the server. The Cisco Unified CallManager server either accepts the token registration or denies it. If the token is accepted, the phone is put into the registration queue. A Cisco Unified CallManager server allows a maximum of 10 phones in the registration queue at any given time during a fallback.
On a system with the maximum 7500 registered phone, the fallback requires a minimum of 54 seconds, based on the 140 phone registrations per second processing limit. Fallback is set at a lower priority than call processing, so on a busy system the fallback take longer. If the primary Cisco Unified CallManager server needs to reject the token registration, the server informs the phone to retry the connection at a staggered interval, based on the active queue depth and the number of outstanding registration requests During this time, the phone continues to communicate with the secondary server and call processing is not affected. Call processing is only be minimally impacted once the primary server accepts the registration and the phone starts communicating with it.
Unified CallManager Failover Testing
During Cisco Unified Communications Release 5.0 system testing, several Cisco Unified CallManager failure scenarios were tested:
•
Test 1: Disable/Re-Enable Cisco Unified CallManager Service
•
Test 3: Failover/Failback of H.323 Resources Associated with One Node
•
Test 4: Failover/Failback of MGCP Resources Associated with One Node
•
Test 5: Failover/Failback of SCCP Resources Associated with One Node
•
Test 6: Failover/Failback of SIP Resources Associated with One Node
Test 1: Disable/Re-Enable Cisco Unified CallManager Service
Test
Verify that de-activating a Cisco Unified CallManager service while the system is processing calls results in minimal call loss and no permanent system degradation after the service has been restored.
- Resources should immediately re-register to their backup due to the CCM server notification.
- After restoring the service, all devices should immediately re-register to their Primary, and all resources should be available
Results
Start processing 10K BHCA worth of calls through the system for 30 minutes.
While calls are being placed, de-activate the CCM service on a Subscriber.
Verify calls continue to be placed through the Backup subscriber.
When the failover has been fully completed, re-activate the CCM service on the Subscriber.
Allow the run to complete and verify the CSR for the run.
Test 2: Failover/Failback of Phones and Gateways Associated with Half of a Unified CallManager Cluster
Test
Verify that the failure half of a CCM cluster results in all resources associated with failed nodes failing over to their backup. Verify also that when the cluster is restored, that all associated resources fail back to their primary.
- When CCM nodes are failed, all resources should be registered to their backup nodes.
- When Primary CCM nodes have been restored, all resources should immediately fail back.
Results
Fail half of the Subscribers in the network by disconnecting the Ethernet cables to the machines, ensuring that the nodes that are failed are Primary for at least 5K phones, 1 HW conference bridge, and 1 IOS Gateway.
Verify that the resources fail over to their backup Subscribers, and that they are still usable.
Restore the Ethernet connections to the failed Subscribers.
Verify that all the resources fail back to their Primary Subscribers, and that they are still usable.
Test 3: Failover/Failback of H.323 Resources Associated with One Node
Test
Verify that the failure of a single CCM node results in all H.323 resources associated with the node failing over to their backup. Verify also that when the node is restored, that all associated H.323 resources fail back to their primary
- When CCM node is failed, all H.323 resources should be registered to their backup nodes.
- When Primary CCM node has been restored, all H.323 resources should immediately fail back.
Set up a cluster with at least one Publisher and two Subscribers.
Register some H.323 resources to a Device Pool that has Subscriber 1 as the Primary, and Subscriber 2 as the Backup.
Results
While H.323 resources are registered to their Primary Subscriber 1, fail Subscriber 1 by unplugging the Ethernet cable for the machine.
Verify that all H.323 resources failed over to their backup Subscriber 2, and that they are still usable.
Restore the Ethernet connection to the Primary Subscriber 1.
Verify that all H.323 resources fail back to their Primary Subscriber 1, and that they are still usable.
Test 4: Failover/Failback of MGCP Resources Associated with One Node
Test
Verify that the failure of a single CCM node results in all MGCP resources associated with the node failing over to their backup. Verify also that when the node is restored, that all associated MGCP resources fail back to their primary.
- When CCM node is failed, all MGCP resources should be registered to their backup nodes.
- When Primary CCM node has been restored, all MGCP resources should immediately fail back.
Set up a cluster with at least one Publisher and two Subscribers.
Register some MGCP resources to a Device Pool that has Subscriber 1 as the Primary, and Subscriber 2 as the Backup.
Results
While MGCP resources are registered to their Primary Subscriber 1, fail Subscriber 1 by unplugging the Ethernet cable for the machine.
Verify that all MGCP resources failed over to their backup Subscriber 2, and that they are still usable.
Restore the Ethernet connection to the Primary Subscriber 1.
Verify that all MGCP resources fail back to their Primary Subscriber 1, and that they are still usable.
Test 5: Failover/Failback of SCCP Resources Associated with One Node
Test
Verify that the failure of a single CCM node results in all SCCP resources associated with the node failing over to their backup. Verify also that when the node is restored, that all associated SCCP resources fail back to their primary Unified CallManager node.
- When CCM node is failed, all SCCP resources should be registered to their backup nodes.
- When Primary CCM node has been restored, all SCCP resources should immediately fail back
Set up a cluster with at least one Publisher and two Subscribers.
Register some SCCP resources to a Device Pool that has Subscriber 1 as the Primary, and Subscriber 2 as the Backup.
Results
While SCCP resources are registered to their Primary Subscriber 1, fail Subscriber 1 by unplugging the Ethernet cable for the machine.
Verify that all SCCP resources failed over to their backup Subscriber 2, and that they are still usable.
Restore the Ethernet connection to the Primary Subscriber 1.
Verify that all SCCP resources fail back to their Primary Subscriber 1, and that they are still usable.
Test 6: Failover/Failback of SIP Resources Associated with One Node
Test
Verify that the failure of a single CCM node results in all SIP resources associated with the node failing over to their backup. Verify also that when the node is restored, that all associated SIP resources fail back to their primary.
Set up a cluster with at least one Publisher and two Subscribers.
Register some SIP resources to a Device Pool that has Subscriber 1 as the Primary, and Subscriber 2 as the Backup.
Results
While SIP resources are registered to their Primary Subscriber 1, fail Subscriber 1 by unplugging the Ethernet cable for the machine.
Verify that all SIP resources failed over to their backup Subscriber 2, and that they are still usable.
Restore the Ethernet connection to the Primary Subscriber 1.
Verify that all SIP resources fail back to their Primary Subscriber 1, and that they are still usable.
Unified CallManager Disaster Recovery
The Disaster Recovery System (DRS), which can be invoked from Cisco Unified CallManager 5.0 Administration, provides full data backup and restore capabilities for all servers in a Cisco Unified CallManager cluster. The Disaster Recovery System allows you to perform regularly scheduled automatic or user-invoked data backups. DRS supports only one backup schedule at a time.
The Cisco Disaster Recovery System performs a cluster-level backup, which means that it collects backups for all servers in a Cisco Unified CallManager cluster to a central location and archives the backup data to physical storage device. When performing a system data restoration, you can choose which nodes in the cluster you want to restore.
The Disaster Recovery System includes the following capabilities:
•
A user interface for performing backup and restore tasks.
•
A distributed system architecture for performing backup and restore functions.
•
A scheduling engine to initiate tasks at user-specified times.
•
Archive backups to a physical tape drive or remote SFTP server.
The Disaster Recovery System contains two key functions, Master Agent (MA) and Local Agent (LA). The Master Agent coordinates backup and restore activity with all the Local Agents. The system automatically activates both the Master Agent and the Local Agent on all nodes in the cluster. However, you can only access the Master Agent functions on the first node of the cluster.
For more information on the Cisco Unified CallManager Disaster Recovery System, see the Disaster Recovery System Administration Guide.