Guest

Cisco Content Delivery Engine Series

Field Notice: FN - 63739 - VDS - CDE250-K9 Server Might Hang - Workaround Available

Field Notice: FN - 63739 - VDS - CDE250-K9 Server Might Hang - Workaround Available

Revised May 1, 2014
April 8, 2014


NOTICE:

THIS FIELD NOTICE IS PROVIDED ON AN "AS IS" BASIS AND DOES NOT IMPLY ANY KIND OF GUARANTEE OR WARRANTY, INCLUDING THE WARRANTY OF MERCHANTABILITY. YOUR USE OF THE INFORMATION ON THE FIELD NOTICE OR MATERIALS LINKED FROM THE FIELD NOTICE IS AT YOUR OWN RISK. CISCO RESERVES THE RIGHT TO CHANGE OR UPDATE THIS FIELD NOTICE AT ANY TIME.

Revision History

Revision Date Comment
1.1
01-MAY-2014
Updated the Workaround/Solution Section
1.0
08-APR-2014
Initial Public Release

Products Affected

Products Affected
CDS-TV - 2.5.2 
CDS-TV - 2.5.3 
CDS-TV - 2.5.5
CDS-TV - 2.5.6 
CDS-TV - 2.5.7 
CDS-TV - 3.0.1 
CDS-TV - 3.0.2 
CDS-TV - 3.0.3 
CDS-TV - 3.2.1 
CDS-TV - 3.2.2 
CDS-TV - 3.2.3 
CDS-TV - 3.2.4 

Problem Description

A CDE250 (as vault, Caching Gateway (CGW), and streamer) that runs a qualified release mentioned in the Products Affected section might exhibit any of these conditions:

  • Vital Statistics in the Vitual Video Interactive Manager (VVIM) are blank and are not displayed.

  • The Super Doctor Tool (SDT) produces all zero results.

  • The system might exhibit any of these symptoms:

    • Unresponsive.
    • Might not respond to keyboard input.
    • Might not have any video output.
    • The reset button and/or power button might not work when pressed.
    • The Internal Solid State Drives (SSDs) might become read-only.

  • The system might require that the power cables be removed from it in order to reestablish operational integrity.

Background

In mid-2013, there was an increasing number of CDE250 vaults that would hang during operation and were submitted for a Return Material Authorization (RMA). As the number of deployed vaults increased, so did the number of hangs that were observed and reported.

The root cause has been isolated and a fix has been determined.

This field notice serves to provide communication to both field personnel and customers to be aware of this issue and how to resolve it in order to maintain the integrity of the systems.

Problem Symptoms

The issue has been identified to be a conflict between two hardware sub-systems when each tries to take control of the I2C management bus. The result is that when these two sub-systems try to take control of the bus at the same time, it might lead the system to hang and become inoperable. Another common symptom is that the internal SSDs become read-only and will not accept new streaming work as the database cannot be written to nor log files updated.

This conflict condition, which might lead to the hang or SSDs becoming read-only, is due to the Intelligent Platform Management Interface (IPMI) driver not being loaded.

Logs might be retrieved with this Linux command:

#ipmiutil sel

Retrieved logs indicate a variety of errors, which include the errors listed here (this is not an exhaustive list, only a sample of the type of errors seen):

Mar 18 04:02:01.000000 VAULT1 SEL: 0001 01/04/07 15:10:06 BMC 05 Platform
Security #aa Chassis Intrusion 6f [00 ff ff]
Mar 18 04:02:01.000000 VAULT1 SEL: 01e1 01/04/07 15:15:03 BMC 02 Voltage #10
Lo NoRec thresh act=41 thr=78
Mar 18 04:02:01.000000 VAULT1 SEL: 01e2 01/04/07 15:15:06 BMC 02 Voltage #11
Lo Noncrit thresh act=41 thr=7a
Mar 18 04:02:01.000000 VAULT1 SEL: 01e3 01/04/07 15:15:06 BMC 02 Voltage #11
Lo Crit thresh act=41 thr=79
Mar 18 04:02:01.000000 VAULT1 SEL: 01e4 01/04/07 15:15:06 BMC 02 Voltage #11
Lo NoRec thresh act=41 thr=78
Mar 18 04:02:02.000000 VAULT1 SEL: 0200 01/04/07 15:20:52 BMC 04 Fan #16
Lo Noncrit thresh act=00 thr=05
Mar 18 04:02:02.000000 VAULT1 SEL: 0201 01/04/07 15:20:52 BMC 04 Fan #16
Lo Crit thresh act=00 thr=04
Mar 18 04:02:02.000000 VAULT1 SEL: 0202 01/04/07 15:20:53 BMC 04 Fan #16
Lo NoRec thresh act=00 thr=03
Mar 18 04:02:02.000000 VAULT1 SEL: 0203 01/04/07 15:20:53 BMC 04 Fan #17
Lo Noncrit thresh act=00 thr=05
Mar 18 04:02:02.000000 VAULT1 SEL: 0204 01/04/07 15:20:53 BMC 04 Fan #17
Lo Crit thresh act=00 thr=04
Mar 18 04:02:02.000000 VAULT1 SEL: 0205 01/04/07 15:20:54 BMC 04 Fan #17
Lo NoRec thresh act=00 thr=03
Mar 18 04:02:02.000000 VAULT1 SEL: 0206 01/04/07 15:20:54 BMC 04 Fan #19
Lo Noncrit thresh act=00 thr=05
Mar 18 04:02:02.000000 VAULT1 SEL: 0207 01/04/07 15:20:54 BMC 04 Fan #19
Lo Crit thresh act=00 thr=04
Mar 18 04:02:02.000000 VAULT1 SEL: 0208 01/04/07 15:20:55 BMC 04 Fan #19
Lo NoRec thresh act=00 thr=03

Workaround/Solution

For an immediate solution to this issue, enter this command on each server: 

/usr/sbin/enable_ipmi.sh

Once that is completed, in order to ensure that the IPMI is always loaded, edit the /etc/rc.local file and find this line:

#/usr/sbin/enable_ipmi.sh

Remove the '#' at the start of the line. This uncomments the line and allows the command to be executed each time the system is rebooted. Failure to do this could cause the system to hang as described in this document.

Note: In the case of vaults, line /usr/sbin/enable_ipmi.sh does not exist in the rc.local file and needs to be inserted manually before the statsd line.

Sample Configuration

These are the last few lines of the rc.local file:

# Uncomment the following line for kdump
echo 1 > /proc/sys/kernel/panic
#Uncomment the following line for CDE250 Hardware Monitoring
/usr/sbin/enable_ipmi.sh

/home/stats/statsd -d eth0
Note: If the system is already in the hung state, after you perform this workaround it is required to reboot the system as well as remove the power cables.

DDTS

To follow the bug ID link below and see detailed bug information, you must be a registered customer and you must be logged in.

DDTS Description
CSCuo15002 (registered customers only)
If IPMI is not loaded it can cause CDE250 hangs

For More Information

If you require further assistance, or if you have any further questions regarding this field notice, please contact the Cisco Systems Technical Assistance Center (TAC) by one of the following methods:

Receive Email Notification For New Field Notices

Cisco Notification Service—Set up a profile to receive email updates about reliability, safety, network security, and end-of-sale issues for the Cisco products you specify.