This document outlines the recommended steps and trace levels needed to troubleshoot an out-of-sync event in a Cisco Unified Contact Center Intelligent Contact Management (ICM) duplexed CallRouters.
Cisco recommends that you have knowledge of these topics:
High level understanding of ICM Central Controller functionality
The information in this document is based on Cisco ICM version 5.x and later.
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.
Refer to Cisco Technical Tips Conventions for more information on document conventions.
In rare instances, ICM CallRouters can become out-of-sync with its duplexed partner. The main trigger for this event is a failed checksum between the CallRouter and its peer. When the CallRouters become out-of-sync, a Standard Object Dump (SOD) file is generated which is a memory dump of the router at the point of failure.
An out-of-sync event can lead to calls being misrouted by the CallRouter.
Any of these methods can be used to check for out-of-sync conditions:
The CallRouters automatically perform a sync check between the two sides every 15 seconds. If it detects an out-of-sync condition, the CallRouter creates a SOD file within this directory:
<drive>:\icm\<instance>ra and <drive>:\icm\<instance>rb
This message is generated in the Application Log within the Windows Event Viewer on the CallRouter. Here are the message details:
the router has detected that it no longer synchronized with its partner
An SNMP trap is also generated.
From the CallRouter (rtr) logs (example only):
ra-rtr The router has detected that it is no longer synchronized with its partner ra-rtr Trace: RunningSyncCheck failure: SideA reported 0A7FDF68, B reported FF1319C5 ra-rtr Trace: Wrote 719296 records to sync32932.sod, total length = 1871522788 bytes ra-rtr Trace: Router dump created in sync32932.sod rb-rtr The router has detected that it is no longer synchronized with its partner rb-rtr Trace: RunningSyncCheck failure: Side A reported 0A7FDF68, B reported FF1319C5 rb-rtr Trace: Wrote 719296 records to sync32932.sod, total length = 187152790 bytes rb-rtr Trace: Router dump created in sync32932.sod
Note: Standard SOD files that generate as a result when the CallRouters go out-of-sync have their limitations and there are times when engineering requires a more granular level of debugging to better isolate the cause. If you run ICM 5.0 (0) SR8 or later, you have two registry keys which can be enabled to increase the debugging of the SOD files.
Enable these registry debugs on both CallRouters:
HKEY_LOCAL_MACHINE\SOFTWARE\Cisco Systems, Inc.\ICM\ <cust_instance>\RouterX\Router\CurrentVersion\Debug
There are two entries, MessageTrackingEnabled and MessageTrackingLimit.
Set these values:
MessageTrackingEnabled = 1
MessageTrackingLimit = 10000 (decimal value)
Note: These are dynamic values and take effect immediately. This does not cause any abnormal behavior with ICM. When you set these trace bits, it enables a more detailed SOD file debug should another out-of-sync condition occur. There is no need to disable these two trace bits, they should remain on. However, these trace bits revert back to default value (i.e. off) if setup is run on the CallRouters. If this occurs, they need to be manually re-enabled.
This data and information is needed when you request Cisco TAC support for the outage:
Note the exact time of failure.
Collect the CallRouter logs from both sides (rtr, mds, nm, ccag) for the timeframe of the outage.
Collect the Event Viewer (System and Application) logs exported in text format by a right-mouse click on the respective log folder and choose Save As. Choose Text under the Save As Type pull-down.
Collect the SOD files from both CallRouters.
Collect the CallTypeHalfHour, TCD, and RCD records that span from 2.5 hours before the routers went out-of-sync and 1 hour after it recovers.
These need to be in tab delimited format and they need to be dumped from both sides of the Loggers. These records MUST come from both sides of the Loggers.
This is an example SQL query:
SELECT * FROM Call_Type_Half_Hour WHERE DateTime >= 'yyyy-mm-dd hh:mm' /* At least 2.5 hours before the out of sync error occurred */ AND DateTime < 'yyyy-mm-dd hh:mm' /* At least 1 hour after the out of sync error occurred or less if run within an hour of the problem happening */ SELECT * FROM Termination_Call_Detail WHERE DateTime >= 'yyyy-mm-dd hh:mm' /* At least 2.5 hours before the out of sync error occurred */ AND DateTime < 'yyyy-mm-dd hh:mm' /* At least 1 hour after the out of sync error occurred or less if run within an hour of the problem happening */ SELECT * FROM Route_Call_Detail WHERE DateTime >= 'yyyy-mm-dd hh:mm' /* At least 2.5 hours before the out of sync error occurred */ AND DateTime < 'yyyy-mm-dd hh:mm' /* At least 1 hour after the out of sync error occurred or less if run within an hour of the problem happening*/
Collect vrutrace files on every Voice Response Unit Peripheral Interface Manager (VRUPIM) on both sides of the Peripheral Gateways (PGs) that cover the timeframe at least 1 hour before the router is out-of-sync and 30 minutes after it recovers.
Refer to How to Use the vrutrace Utility for more information.
Run the dumpcfg utility against both logger databases from the time before they went out-of-sync to the time after.
Refer to Use the dumpcfg Administration Tool to Track ICM Configuration Changes for more information.
Use ICMDBA in order to export the configuration from both Loggers.
Export the entire ICM registry branch from both sides of the CallRouters.
These are the two workaround options:
Cycle both CallRouters by shutting down both CallRouter processes and then starting them back up again. This is the cleanest way to work around this condition.
Restart one side of the CallRouters.
Both of these options cause the CallRouters to resynchronize and run in sync. This means both CallRouter sides will again route calls the same way.
Option 1 is the preferred method and results in a higher likelihood of both CallRouters routing all calls correctly when restarted. However, if you cannot take the chance of having both CallRouters down at the same time, option 2 can be used instead.
Option 2 can result in the same level of success as option 1. Option 2 causes the CallRouters to resynchronize and both sides route calls the same way. However, if the CallRouter that was not restarted had an incorrect state after resynchronization, the CallRouter states in both sides are incorrect. This case can lead both CallRouters, although synchronized, to route some calls incorrectly. The chance that this will occur might be slightly higher than if the steps in option 1 are taken.
Note: Cisco highly recommends that a maintenance window be scheduled in order to perform these recovery actions as to lessen the impact to production call routing.
The Cisco Support Community is a forum for you to ask and answer questions, share suggestions, and collaborate with your peers.
Refer to Cisco Technical Tips Conventions for information on conventions used in this document.