This document describes how Instant Message and Presence (IM&P) High Availability (also known as Redundancy) works in an enterprise IM and Presence environment and how to troubleshoot it.
Cisco recommends that you have knowledge of these topics:
Cisco Unified IM&P
Cisco Jabber clients
Cisco Unified IM&P 10.0 and above
Cisco Jabber clients 9.6 and above
The information in this document was created from the components in a specific lab environment. All of the components used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.
IM and Presence High Availability(HA)
IM and Presence offers high availability or redundancy in the form of logical server groups in the CUCM configuration. This configuration is passed to IM and Presence and then utilized to allow for redundancy in the event of an IM and Presence service or server failure. When a HA event takes place, the end user's sessions are moved from the failed server to the backup. When the server has been restored to a normal state, user sessions are then moved back either automatically or manually by the administrator.
Redundancy Group Configuration
The redundancy group is the logical server pair that allows for the assignment of a server to the IM and Presence subcluster as well as the configuration for HA. In order to access this portion of the configuration can be found in the CUCM server web page.
System > Presence Redundancy Groups
When the administrator adds the IM&P Publisher to the System > Server configuration on CUCM and the IM&P server is saved, the DefaultCUPSubCluster redundancy group gets created with the Publisher assigned to it.
When created, the Redundancy Group will look like this:
This Redundancy Group translates to the IM and Presence subcluster. In the current state of the Redundancy Group configuration in CUCM, this would be what it would look like in the IM and Presence Cluster Topology web page:
We see that the IM&P Publisher is assigned to the DefaultCUPSubcluster and the Subscriber server is not. This is because the IM&P Subscriber server is not assigned to the Redundancy Group in the CUCM configuration.
Assign the Subscriber to the Redundancy Group:
In order to assign the Subscriber server to the Redundancy Group, simply select the subscriber server from the dropdown then Save the configuration change.
After the IM&P Subscriber is added to the Redundancy Group:
We see after the addition of the secondary node(the subscriber) we get the High Availability option. In order to enable High Availability, we would simply need to select the Enable High Availability checkbox and Save the configuration change.
After High Availability is enabled:
The page will auto-refresh the sever state and reason. When the server is in an initialization state, this means that the two servers are able to communicate. The servers would then verify service status before the state transitions to a Normal state. If the two servers can connect to each other and all monitored services are up on both, we would then get a Normal-Normal state. This means that all monitored services are active on the IM&P Servers.
Normal-Normal Redundancy Group State:
Normal-Normal High Availability State in the IM&P Topology Page:
Monitored IM and Presence Services
Since a customer could have various deployment models: IM Only, IM with SIP/XMPP Federation, IM with Compliance, IM with persistent chat, Remote Call Control Only, etc., the actual list of which of these processes to monitor is dynamic. By default the these items are always monitored when HA is enabled:
Presence Engine (if activated)
The Server Recovery Manager checks to determine if compliance(Message Archiver), persistent chat(Text Conference Manager), SIP federation(SIP Federation Connection Manager), XMPP federation(XMPP Federation Connection Manager)are configured (and activated). If they are both configured and activated, Server Recovery Manager(SRM) will monitor those services as well.
User Failover Process
When a failover takes place(automatic or manual), the major point to remember is that the user account is not moved from one server to the other, but only the user session in Presence Engine is moved. In pre-10 versions of IM and Presence, the user assignment was moved from one server to the other. This user move was very expensive to server resources and added to the load that was on the server. In 10.X and later, the user stays homed on the server that they are assigned to, and the backend user session in the Presence Engine is moved from the failed node to the functional node. The user does not have to exit Jabber and re-log in when the change happens with Server Recovery Manager(SRM).
Jabber Client Re-Login Timer
In order for the user session to become fully active on the secondary IM&P node after a failover event, the user must attempt to log in to that server via SOAP(Client Profile Agent). This happens automatically with the one-time password that is passed from the IMDB database. Since log ins are extremely expensive to resources on the IM and Presence server, there must be a way to throttle log ins when a failover event occurs. This throttle or buffer will allow all users to log in to the secondary node without service disruption for users on the secondary node. The mechanism that is used to throttle user log ins are the Client Re-Login Lower Limit and Client Re-Login Upper Limit Server Recovery Manager(SRM) service parameters.
Client Re-Login Lower Limit - the parameter that defines the minimum amount of time(in seconds) that the Jabber client will wait before the client attempts to log in to the secondary server in the event of an HA event.
Client Re-Login Upper Limit - the parameter that defines the maximum amount of time(in seconds) that the Jabber client will wait before the client attempts to log in to the secondary server in the event of an HA event.
The Jabber client receives these parameters at log in to the server and caches the values for future use. When we receive a HA event from the IM&P server, the client will choose a random number of seconds between the upper and lower limits and wait that amount of time before the Jabber client attempts to log in to the secondary. Once the timer expires, the client will attempt SOAP log in to the secondary node.
IM and Presence Fallback Types
If there is user failover, there must be user fallback when service is restored on the problematic server. There are two types of server fallback:
Manual fallback(default configuration for Server Recovery Manager) takes place when service has been restored and the redundancy group allows the Fallback button. When this button is selected, the user sessions that were moved to the secondary node, will be moved back to their homed node.The Jabber client will apply the re-log in upper and lower limits for the fallback.
Automatic fallback takes place when the server monitors the services and the Server Recovery Manager(SRM) service will automatically fallback users to their homed nodes. The key in this configuration is that the Server Recovery Manager(SRM) service will wait 30 minutes for a failed service/server to remain active before an automatic fallback is initiated. Once this 30 minute up time is established, user sessions are moved back to their homed nodes. The Jabber client will apply the re-log in upper and lower limits for the fallback.
Automatic fallback is not the default configuration, but it can be enabled. To enable automatic fallback, change the Enable Automatic Fallback parameter in the Server Recovery Manager Service Parameters to value True.
Troubleshoot IM and Presence High Availability
In order to troubleshoot issues with HA on IM and Presence collect the information:
Server Recover Manager(SRM) logs from before and after the failover event(debug level if possible)
Output of the command via IM&P command line interface run sql select * from enterprisesubcluster
The enterprisesubcluster table in IM&P houses the Redundancy Group configuration
Output of the command via IM&P command line interfacerun sql select * from enterprisenode
The enterprisenode table will display the node information and subcluster assignment of the node