Guest

Support

Failure, Failover, and Recovery

Table Of Contents

Failure, Failover, and Recovery

Failure with Failover Testing

Cisco Unified Customer Voice Portal Post-Routed Call Flow

Test 1: Cisco Unified Customer Voice Portal Gatekeeper Failure

Failure without Failover Testing

WAN Access Router

Private Connection Between Roggers


Failure, Failover, and Recovery


The contact center environment is intended to be redundant and self-healing. In many cases, this functionality makes a failover and recovery from a failure nearly invisible.

This subject is discussed in greater detail in the following documents:

ICM Administration Guide for Cisco ICM/IPCC Enterprise and Hosted Editions (Fault Tolerance chapter):

http://www.cisco.com/en/US/docs/voice_ip_comm/cust_contact/contact_center/icm_enterprise/icm_enterprise_7_0/maintenance/guide/icme70ag.pdf

Cisco Unified Communications Manager System Guide:

http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/admin/6_0_1/ccmsys/accm.pdf

Cisco Unified Communications Manager Features and Services Guide:

http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/admin/6_0_1/ccmfeat/fsgd.pdf

This topic provides a description of specific test cases that were executed as a part of Cisco Unified Communications System Release 6.0(1) failover testing. This testing was done to verify the redundancy and failover capabilities of specific components such as gatekeepers, WAN access routers, and the private connection between the Roggers (CallRouters and Loggers) in the data centers.

For additional failover testing that was conducted in the contact center environment, see the Failover test cases and results at: http://www.cisco.com/cisco/web/docs/iam/unified/ipcc601/System_Test_Results.html.


Note Much of the testing was done with test tools, simulated phones, and simulated agent desktops. In an actual customer setting, real agents with real phones might attempt to shut down their desktops, force logins, and so on. Such events are not generally factored into the testing done with simulation tools.

This topic includes the following sections:

Failure with Failover Testing

Cisco Unified Customer Voice Portal Post-Routed Call Flow

Failure without Failover Testing

WAN Access Router

Private Connection Between Roggers

Failure with Failover Testing

This section discusses the failover testing that was performed with contact center components that have redundancy capabilities in the event of a failure.

Cisco Unified Customer Voice Portal Post-Routed Call Flow

Test 1: Cisco Unified Customer Voice Portal Gatekeeper Failure

Pre-Test Conditions

The following describes the test conditions for this particular test:

Test site involved is Site3.

Inbound calls at Site3 are routed from the Cisco Unified Customer Voice Portal (Unified CVP) gateway to the Unified CVP Voice Browser.

Two gatekeepers (GK1 and GK2) in Site1 are deployed in an HSRP redundancy model.

Cisco Unified Intelligent Contact Management (Unified ICM) script is set up to first connect the call to a Type 7 VRU (VXML-enabled Gateway) and collect digits.

All agents in the target skill group are flagged as NOT_READY.

The Unified ICM script is then set up to queue the incoming call at the VRU.

The status for an agent in that skill group is set to become READY after the call has been queued for 30 seconds.

Test

The following describes the failover testing that were performed for active Unified CVP Post-Routed calls:

1. Place calls from a PSTN call generator to the Unified CVP gateway at Site3.

2. Verify that the VRU plays the appropriate media files and provides queue treatment for the incoming calls.

3. Disable the primary gatekeeper (GK1) in the HSRP cluster.

4. Place additional calls from the PSTN call generator, which are now routed through the secondary gatekeeper (GK2) in the HSRP cluster.

5. Make agents available in Site3 for the skill groups selected by the caller and answer the calls from a Cisco Agent Desktop (CAD).

Results

The following results were verified in this test:

All active calls remained active and no calls were dropped.

Call failures did not occur during the failover between the Unified CVP gatekeepers.

Annoucements (Unified IP IVR) and music were heard by the callers.

Pop-up errors were not displayed on the agent desktop.

The voice path was operational in both directions, to and from the PSTN.

Failure without Failover Testing

This section discusses the failover testing that was done with contact center components that did not provide redundancy capabilities in the event of a failure.

WAN Access Router

Pre-Test Conditions

The following describes the test conditions for this test:

Test site involved is Site3.

All links for Site3 are up and active.

A WAN router is deployed at this site for communication with other sites across the Frame Relay cloud.

Calls at Site3 are in progress in the IOS Voice Browser and on CAD systems with at least one supervisor monitoring an agent call.

There is no backup implemented for the serial interface on the WAN router at Site3.

Test

The following describes the failover testing done for the WAN access router (without a backup WAN link):

1. Disable the serial interface of the WAN router at Site3 and observe the impact to system behavior.

2. Verify the results of the above procedure as described in After Disabling the Serial Interface.

3. Enable the serial interface of the WAN router at Site3 and observe the affect on system behavior.

4. Verify the results of this procedure as described in After Enabling Serial Interface.

Results

The following results were verified in this test.


Note No affect is expected to Site1 and Site5 call processing or agent states. All failures occur at Site3.

After Disabling the Serial Interface

For new calls generated from the PSTN

New inbound calls to the IOS Voice Browser (Gateway) can behave in one of two ways, depending upon the gateway configuration:

They can fail with a busy tone.

They can be re-routed or "hairpinned," using a Gateway Redirect Dial Peer, as new calls to the Unified CVP at another inter-cluster site.

For transient calls (already in the IOS Voice Browser and being transferred to agents):

If the call was re-directed by the IOS Voice Browser to an agent, the call continued the transfer and rang at the agent phone. This situation occurred only if the call reached the agent phone before the phone realized that it had lost the WAN connection with its Cisco Unified Communications Manager (Unified CM).

If the call had not yet been re-directed by the IOS Voice Browser to an agent, the call either failed or, if a backup re-direct was programmed in the gateway, the call was hairpinned to another Unified CVP site within the cluster.

If the call was in queue on the IOS Voice Browser, the call either failed, or, if a backup re-direct was programmed in the gateway, the call was hairpinned to another Unified CVP site within the cluster.


Note In all cases, the H.323 call survivability timer affected the life of the call. All calls may be removed from the ingress gateway after this timer expires.

For existing calls being handled by agents:

Calls stayed active and operational until the call ended normally or until the H.323 timer expired.

The following affect was observed to the agent state for agents in Site3:

Agents lost connectivity to the CAD server at Site3 and the agent desktops were in a NOT_READY state until the WAN connection was restored.

Agents already engaged on calls were not be able to perform any agent state change or telephony functions, such as hold, conference, and transfer, during the outage event.

Agents were not be able to log into the system during this event.

Unified ICM showed the remote agents at Site3 as being no longer available (the Webview report showed the remote agents as being logged out).

The following was the affect observed to supervisor monitoring:

Supervisor already in a monitor session with the agent and caller remained in the call for the duration of the call, or until the H.323 timer expired and the call was terminated by the gateway.

Supervisor was not be able to perform any call control functions to barge in, intercept, or perform any conference or chat functions on the Supervisor desktop.

After Enabling Serial Interface

Phones were reset and re-registered with their target Unified CM.

Agent desktops reconnected to the CAD Servers and, depending upon the phone state (reset or not), allowed the agents to become READY without having to log in again.

Agents that were not logged in were able to now log in (if the Cisco Unified IP Phones were reset properly).

Any calls that were in the gateway prior to the failure and were hairpinned to another site remained in this state until the call was terminated normally. In this situation, the gateway does not automatically terminate the calls.

Private Connection Between Roggers

Pre-Test Conditions

The following describes the test conditions for this test:

Test sites involved are Site1 and Site5.

Rogger A is located at Site1 and Rogger B is located in Site5.

All links between the two sites are up and active.

Calls are in progress between the Site1 and Site5 Unified CM clusters.

There is no backup implemented for the private connection between the two Roggers at Site1 and Site5.

Test

The following describes the failover testing that was performed for the private connection between the Roggers (without a backup connection):

1. Simulate a failure of the private link between Rogger A and Rogger B.

2. Verify the system behavior immediately after the simulated private link failure as described in After the Private Link Failure.

3. Place calls from a PSTN call generator and route them between Site1 and Site5.

4. Verify the system behavior after the private link was restored as described in After the Private Link was Restored.

Results

The following results were verified in this test:

After the Private Link Failure

Via the Event Viewer, both Roggers indicated a loss of heart beats on the private network after missing five consecutive 100 ms heart beats.

The Roggers sent Test Other Side (TOS) messages to the Peripheral Gateway, which responded with either Rogger A or Rogger B as the enabled side of the system.

Based on the Rogger that was considered the enabled side, the other Rogger became disabled.

The enabled Rogger then initiated the Enabled Simplex operation (visible in the MDS process window).

There was no affect observed to system operation or behavior.

There was no loss of calls or agent state across the system during this failure.

After the Private Link was Restored

The Roggers observed the presence of a duplex partner and performed a state transfer operation from the active side to the inactive side call router.

Upon completion of the state transfer operation, the MDS processes reported that both Roggers were in an active duplex operation.

No affect on call processing was observed during this event window.