This document describes information that can used in order to troubleshoot your configuration.
Cisco IP Phone uses application level keep-alive mechanism in addition to the the Network level TCP keep alive mechanism. Keep-Alive mechanism for Skinny Call Control Protocol (SCCP) and Session Initiation Protocol (SIP) devices ensures that the device stays registered with call control. They are also meant to re-establish connection of devices with call control.
There are no specific requirements for this document.
This document is not restricted to specific software and hardware versions.
SCCP Keep-alives and Failover Mechanism
SCCP uses TCP protocol for Transport and it uses the port 2000 and 2443 (for secured) to make connection to the Call Manager. The SCCP phones should make a TCP connection with the Cisco Unified Communications Manager (CUCM) before registering to it. Following which, a TCP 3 way handshake will happen on port 2000 to establish a communication channel. The phone initiates this connection by sending a SYN (synchronize) to CUCM and CUCM responds with SYN, ACK (acknowledgement). The phone in turn responds with an ACK and the TCP connection gets established.
There are two keep-alive methods: Application level (SKINNY keep-alive) and Network level (TCP keep-alive)
In an ideal scenario, a SCCP phone keeps a TCP connection established to the primary CUCM and the first backup CUCM. SCCP phone sends keep-alive to all the CUCM to which it has the established a TCP connection. Primary server then responds to the SCCP keep-alive. The time interval is 30 seconds to primary server and 60 seconds to the backup server.
The primary CUCM responds back with SCCP keepalive ACK which acknowledges both SCCP and TCP connection. The backup CUCM just sends a TCP ACK to the keep-alive sent by the phone. When the phone fails to backup CUCM because the Call Manager service is not available or the TCP connection itself is unavailable with the primary CUCM, it uses two kinds of mechanisms to detect the primary CM failure and they are normal and delayed.
This method uses an algorithm to calculate the average of the time taken by the CUCM to acknowledge the previous keep-alives.
For example, if the average time taken by CUCM is X seconds to respond for the past 10000 keep-alives, the phone will wait for X seconds before it detects the failure of CUCM. Following which, it will try to register to the backup CUCM.
In this mechanism, the phone waits for the 3 keep-alive intervals to detect the failure of the primary CUCM.
Networks where transit time of packets fluctuate, delayed failover helps avoid unnecessary unregistration.
Example of Transit Time Fluctuation (Note the time delay for ping response):
64 bytes from 10.106.97.150: icmp_seq=1 ttl=63 time=0.100 ms
64 bytes from 10.106.97.150: icmp_seq=2 ttl=63 time=200 ms
64 bytes from 10.106.97.150: icmp_seq=3 ttl=63 time=0.180 ms
64 bytes from 10.106.97.150: icmp_seq=4 ttl=63 time=0.678 ms
64 bytes from 10.106.97.150: icmp_seq=5 ttl=63 time=590 ms
64 bytes from 10.106.97.150: icmp_seq=6 ttl=63 time=0.100 ms
64 bytes from 10.106.97.150: icmp_seq=7 ttl=63 time=345 ms
64 bytes from 10.106.97.150: icmp_seq=8 ttl=63 time=456 ms
64 bytes from 10.106.97.150: icmp_seq=9 ttl=63 time=0.345 ms
This mechanism can be used in the delay sensitive networks.
The SIP phone registers to the CUCM and sends keep-alive every 120 seconds as per the settings in CUCM. When the phone sends the initial register to primary CUCM, it sets the Expires timer to 3600 seconds (default set in SIP profile applied on the phone). CUCM sends an ACK by modifying the timer to 120 seconds as per the value set in Service parameter.
Therefore, the phone sends keep-alive every 120 seconds (actually 115 seconds which is 120 minus the delta value configured in SIP profile, which is 5 seconds by default). In this case, the phone sends keep-alive every 115 seconds.
SIP phone exchanges the Register message to Backup CUCM with Expires field set to 0.
When Captures from both end are collected, to verify that the keepalive sent by phone is actually reaching the CUCM or not.
Sequence Number of TCP packet will help easily track the TCP traffic between phone and CUCM in sniffer capture.
Capture from Phone
Phone sends a packet with sequence number 2991996107, verify that this packet reaches the CUCM.
Capture from CUCM
Sequence number that is seen in phone sniffer capture should be seen in the CUCM capture.
Case Study 1.2
SCCP phones keep restarting at regular intervals.
Event Viewer Application log indicates that the phones kept restarting due to missing keep alives with error code of 13.
Event Viewer Message.
Collect packet capture from IP Phone and CUCM. In this scenario, the last keep-alive sent from IP Phone did not reach CUCM.
Keep-alive is getting dropped because of this reason:
When the phone sent an ARP to get the MAC adress of CUCM, the response came in from ARP Proxy with ASA mac-address. Clearly, the first response was not from CUCM. However, since the phone recieves it first, it sends the frame to the switch with the MAC address of the other device.
This happens mostly when ARP-proxy is enabled on ASA.
Disable ARP Proxy on ASA to address the problem.
Case Study 2.
Cisco IP Phone model 8961 phones reset every 16 minutes and registers to secondary CUCM. After 2 minutes the phone falls back to Primary CUCM and this cycle continues.
Collect Packet captures from the phone and CUCM traces. The unregistration was due to SIP keep-alive missed by the IP Phone.
The SIP phone registers to the CUCM and it sends Keep-alive every 120 seconds as per the settings in CUCM.
When the phone sends the initial register it sets the expires timer to 3600 seconds (default set in SIP profile applied on the phone). CUCM acknowledges it by modifying the timer to 120 seconds as per the value set in Service parameter.
The phone sends Keepalive every 120 seconds ( keep-alive interval is 115 seconds which is 120 minus the delta value configured in SIP profile, which is 5 seconds by default). In this case the phone sends keepalive every 115 seconds.
In this problem scenario the phone sends the first keepalive at 115 second and it gets dropped in the network. This results in phone retransmitting the keepalive in .01 seconds(100 ms). It gets an response from CUCM for the REGISTER request.
Now the phone sends the second keepalive at 115 seconds and it gets dropped in the network. Now the phone increases it's REGISTER retry interval to .02 seconds (200 millisecond).
Every time the phone sends the keepalive after 115, it gets dropped in the network and this makes the phone to retransmit the packet. Also the phone exponentially increases it's retry interval. After few such keep-alives the phones retry increases to 14 seconds.
The phone retransmits after 14 seconds and it gets an ACK from the CUCM.
The next time when the phone sends keep-alive, it is lost and then the phone retransmits REGISTER request after 28 seconds. The CUCM cannot wait for 28 seconds, it waits for only 15 seconds (after the 115s) then it sends the unregister signal.
The keep-alive time and the RTO sums up to 16 minutes and a few seconds.
After 16 minutes due to the unregister signal from CUCM, the phones register to secondary CUCM and after 2 minutes they register back to Primary and this continues.
Cause for the keep-alive drops
When the Switch port was configured with port security, the port aging was configured with inactive timer. The timer was set to one minute which is lesser than the SIP keep-alive timer. This resulted in switch port flushing the phone MAC every one minute. The packets keep being dropped as the SIP keep-alive interval is every 2 minutes.