Guest

Cisco Catalyst 6000 Series Switches

Field Notice: Cat6xxx Switches And 76xx Series Routers Running A Sup2 With CatOS Versions 7.6(1) Through 7.6(4) May Hang After Running For A Period Of Time


Revised July 27, 2004

July 14, 2004


Products Affected

Product

Comment

WS-X6K-S2-MSFC2

Specific to Sup2 only running CatOS 7.6(1) through 7.6(4)

WS-X6K-S2-PFC2

Specific to Sup2 only running CatOS 7.6(1) through 7.6(4)

WS-X6K-S2U-MSFC2

Specific to Sup2 only running CatOS 7.6(1) through 7.6(4)

WS-X6K-SUP2-2GE

Specific to Sup2 only running CatOS 7.6(1) through 7.6(4)

Problem Description

Cat6xxx Switches and 76xx Series Routers running a Sup2 with CatOS versions 7.6(1) through 7.6(4) may hang after running for a period of time.

Background

This issue was first seen at one site with two different switches running the same software and with similar configurations. They were not started at the same time, so they didn't hang at the same time but it seems that both had been running approximately 7 months. They both became stuck, no more ip connectivity with sc0, no console prompt through terminal server, no LED Activity. The only workaround was to power off/on boxes.

  1. This bug affects only 6xxx Switches and 76xx Routers with a Sup2. Sup720 or any version of Sup1 is not affected.

  2. This bug affects only 7.6(1) through 7.6(4).

  3. This bug is fixed via CSCeb37694 (registered customers only) . The fix for CSCeb37694 (registered customers only) is present only in 7.6(5) and later.

There are two triggers to this bug:

Trigger A - Present only in 7.6(1) and 7.6(4). A 5 second polling timer was introduced via CSCeb38474 (registered customers only) in 7.6(1), removed in 7.6(2), 7.6(2a), 7.6(3), 7.6(3a) and re-introduced in 7.6(4) via another bug. The 5 second polling timer puts the chances of 7.6(1) and 7.6(4) being affected by CSCeb37694 (registered customers only) faster.

The root cause of this caveat is CSCeb37694 (registered customers only) and not the two bugs which changed the polling timer. The timer changes simply exposed the bug sooner.

Trigger B - Present in 7.6(1), 7.6(2), 7.6(2a), 7.6(3), 7.6(3a) and 7.6(4):

Cisco recently uncovered another trigger which could expose the file system bug. This trigger is "user induced" when many squeeze operations on the bootflash are performed. The device could hang after performing two to three, or even 100 squeeze operations.

The uptime is not relevant for trigger B, so reloading the affected devices is not a valid workaround.

Note:?Both Trigger A and Trigger B are present in 7.6(1) and 7.6(4), hence, the chances of running into the problem with 7.6(1) or 7.6(4) are higher.

The caveat CSCeb37694 (registered customers only) is a regression introduced due to a fix for a previous bug which changed some Mistral Drivers for file system access. Since the previous bug's fix was only in 7.6.x, no other 7.x.x releases such as 7.1.x to 7.5.1 were affected.

Miscellaneous information:

When this condition is hit, why do customers report a repeated pattern of different characters like repeated patterns of 'r' , '.' or some other character when they connect a console?

Answer:

When CPU overwhelms the 9600 baud serial link, the character stream coming out of Mistral (system controller) will be contiguous and steady. The terminal driver looks for a start bit (high to low transition), next eight bits are treated as data, and then there is a stop bit. The next H-> L transition will signal next character start. In case of steady stream from Mistral, depending on when the hyperterm driver or other console types start sampling the incoming stream, the start bit might be taken to be one of the bits of the eight bit character ("." or "space"), rather than the actual start bit. Hence the pattern would be different each time you connect.

Problem Symptoms

A certain type of symptom has been reported by some customers, only with Sup2 running 7.6(1) code.

The symptom includes all of the following:

  1. Unable to reach the switch via telnet or ping

  2. Unable to access the switch via snmp or other management applications.

  3. Able to reach the MSFC via telnet or ping

  4. When connected via console, nothing is output or some garbled (repeating character 'R' or '.' or some other character) characters are output.

  5. System status LED is normal (green) and Backplane utilization LED at zero percent. In some cases, the Traffic meter LED may be at 100 percent.

  6. System has been up for approximately seven months.

Workaround/Solution

For Trigger A:

The fix has been implemented in 7.6(5), but due to a 7.6(5) issue, 7.6(6) is the recommended image.

For additional information on the 7.6(5) issue refer to the Cisco Security Advisory: Cisco CatOS Telnet, HTTP and SSH Vulnerability.

Are you running into this issue?

  1. Do you have a Sup2 running 7.6(1) through 7.6(4)?

  2. Has the switch been up for 220 days, approximately?

  3. Confirm symptoms described above.

  4. Customer is to proactively monitor their other switches running 7.6(1) or later. If there are switches that are nearing this window (safe window would be around 150 days) customers should do the following during scheduled maintenance window:

Upgrade to CatOS 7.6(6) or later.

If an upgrade is not possible at this time do the following:

For Trigger B:

Do you have a Sup2 running 7.6(1) through 7.6(4)?

Upgrade to CatOS 7.6(6) or later.

Avoid performing a squeeze operation on the affected devices until the software is upgraded to 7.6(6) or later.

If an upgrade is not possible at this time do the following:

Single Supervisor:

Schedule a maintenance window to reset the supervisor when the uptime is closer to 150 days. The uptime can be viewed via show system or show version.

Dual Supervisor:

In Dual Supervisor scenarios, the other supervisor will takeover in ten minutes.

However, if the required maintenance window can be scheduled, then do the following:

If HA is not enabled, enable HA using set system highavailability enable. Wait for HA to sync. show system highavailability should say 'ON' for Highavailability Operational-status.

Switch to the other supervisor using switch supervisor

Note:?To accurately find out the uptime of the box, prior to hitting this condition, do a show log and look at the Reboot History and subtract between the last two reboot histories.

Here is an example from a box where the primary supervisor (in slot1) has switchover after 225 days.

***************** show log ****************** 

Network Management Processor (STANDBY NMP) Log: 

Reset count: 3 

Re-boot History: Dec 10 2003 13:35:48 0, Apr 27 2003 13:15:31 0 <- 225 days! 
Apr 10 2003 13:17:20 0

DDTS

To follow the bug ID link below and see detailed bug information, you must be a registered user and you must be logged in.

DDTS

Description

CSCeb37694 (registered customers only)

cat6000 hangs after running approx 7 months

For More Information

If you require further assistance, or if you have any further questions regarding this field notice, please contact the Cisco Systems Technical Assistance Center (TAC) by one of the following methods:

Receive Email Notification For New Field Notices

Product Alert Tool - Set up a profile to receive email updates about reliability, safety, network security, and end-of-sale issues for the Cisco products you specify.