Cisco 12016, Cisco 12416, and Cisco 12816 Router Installation and Configuration Guide
Chapter 5 - Troubleshooting the Installation
Downloads: This chapterpdf (PDF - 462.0KB) The complete bookPDF (PDF - 14.88MB) | Feedback

Troubleshooting the Installation

Table Of Contents

Troubleshooting the Installation

Troubleshooting Overview

Troubleshooting Using a Subsystem Approach

Normal Router Startup Sequence

Identifying Startup Issues

Troubleshooting the Power Subsystem

Troubleshooting the AC-Input Power Subsystem

Troubleshooting the DC-Input Power Subsystem

Additional Power Subsystem Troubleshooting Information

Troubleshooting the Power Distribution System

Troubleshooting the Processor Subsystem

Troubleshooting the RP

Troubleshooting Using the RP Alphanumeric Display

Troubleshooting Line Cards

Troubleshooting Using the Line Card Alphanumeric Display

Troubleshooting Using the Alarm Cards

Monitoring Critical, Major, and Minor Alarm Status

Troubleshooting the Switch Fabric

Analyzing the Data

crc16 Output

Grant Parity and Request Errors

Properly Seating Switch Fabric Cards

Troubleshooting the Cooling Subsystem

Blower Module Operation

Power Supply Operation

Overtemperature Conditions

Isolating Cooling Subsystem Problems


Troubleshooting the Installation


This chapter contains general troubleshooting information to help isolate the cause of difficulties you might encounter during the installation and initial startup of the system.

The procedures in this chapter assume that you are troubleshooting the initial startup of the router, as described in the Powering On the Router and Observing the Boot Process, page 4-4, and that the system is running the original configuration. If you altered the original hardware configuration or changed any default configuration settings, the recommendations in this chapter may not apply.

Although an overtemperature condition is unlikely at initial startup, environmental monitoring functions are included in this chapter because they also monitor internal voltages.

Troubleshooting the installation is presented in the following sections:

Troubleshooting Overview

Troubleshooting the Power Subsystem

Troubleshooting the Processor Subsystem

Troubleshooting the Switch Fabric

Troubleshooting the Cooling Subsystem

Troubleshooting Overview

This section describes the methods used in troubleshooting the router. The troubleshooting methods are organized according to the major subsystems in the router.

If you are unable to solve a problem on your own, you can contact a Cisco customer service representative for assistance. When you call, have the following information ready:

Date you received the router and the chassis serial number (located on a label on the back of the chassis).

Installed line cards.

Use the show hardware command to determine which line cards are installed if possible.

Cisco IOS software release number.

Use the show version command to determine this information if possible.

Brief description of the symptoms and steps you have taken to isolate and solve the issue.

Maintenance agreement or warranty information.

Troubleshooting Using a Subsystem Approach

To solve a system problem, try to isolate the problem to a specific subsystem. Compare current router behavior with expected router behavior. Because a startup issue is usually attributable to one component, it is most efficient to examine each subsystem, rather than trying to troubleshoot each router component.

For troubleshooting purposes in this chapter, the router consists of the following subsystems:

Power subsystem—Includes the following components:

AC-input or DC-input power supplies, also called power entry modules (PEMs). The router chassis is shipped with fully-redundant PEMs installed in the chassis.

Chassis backplane power distribution. -48 VDC power from the power supplies is transferred to the chassis backplane. The -48 VDC is distributed to all of the cards through the backplane connectors. The blower module receives power from the chassis backplane through a wiring harness and passes MBus data back to the chassis backplane.

Processor subsystem—Includes redundant RPs, line cards, switch fabrics, and two alarm cards. The RP and line cards are equipped with onboard processors. The RP downloads a copy of the Cisco IOS image to each line card processor. The system uses an alphanumeric display (on each line card and RP) to display status and error messages, which can help in troubleshooting.

Cooling subsystem—Consists of 2 blower modules, which circulate air through the card cages to cool the cards, and fan in each of the power modules, which circulates cooling air through the power module.

Normal Router Startup Sequence

You can generally determine when and where the router failed during the startup sequence by checking the status LEDs on the power modules, and the alphanumeric displays on the RP and line cards.

In a normal router startup sequence, the following sequence of events and conditions occur:

1. The fans in the blower module receive power and begin drawing air through the chassis.

The blower module OK indicator is on.

2. The fan in each PEM receives power and begins drawing air through the power supply.

The power supply Pwr OK indicator is on.

3. As the power on and boot process progresses for the RP and each installed line card, the status of each card appears on the alphanumeric display on the front panel of the card:

The upper row of the display is powered by the DC-to-DC converter on the card.

The lower row of the display is powered by the +5 VDC provided through the backplane.

Identifying Startup Issues

Table 5-1 shows the contents of the alphanumeric displays on the RP and the line cards, as well as the normal LED states on the alarm card, the power modules (AC or DC), and the blower modules after a successful system startup.

Table 5-1 Alphanumeric Displays and LEDs at System Startup 

Component
Type of Indicator
Display Contents/LED Status and Meaning

RP

Alphanumeric display

Upper row: MSTR
Lower row: GRP or PRP

The RP is enabled and recognized by the system; a valid Cisco IOS software image is running.

Line Cards

Alphanumeric display

Upper row: IOS
Lower row: RUN

The line card is enabled and ready for use.

Alarm Cards

Detected alarm severity

Alarm card

CSC 0 and 1

SFC 0, 1, 2, and 3,

Critical: Off
Major: Off
Minor: Off


Enabled: On
Fail: Off

Enabled: On
Fail: Off

Enabled: On
Fail: Off

2000 Watt AC Power Supplies

Power status

PWR OK: On
FAULT: Off
TEMP: Off
ILIM: Off

The correct power module voltages are present and no faults have been detected.

2500 Watt AC Power Supplies

Power status

PWR OK: On
FAULT: Off
TEMP: Off
OC: Off

The correct power module voltages are present and no faults have been detected.

2000 Watt DC Power Supplies

Power status

PWR OK: On
FAULT: Off
TEMP: Off

The correct power module voltages are present and no faults have been detected.

2400 Watt DC Power Supplies

Power status

PWR OK: On
FAULT: Off
TEMP: Off
OC: Off

The correct power module voltages are present and no faults have been detected.

Blower Modules

Blower status

OK: On
FAIL: Off


Troubleshooting the Power Subsystem

This section contains information to troubleshoot the power subsystems:

Troubleshooting the AC-Input Power Subsystem

Troubleshooting the DC-Input Power Subsystem

Troubleshooting the Power Distribution System

Troubleshooting the AC-Input Power Subsystem

AC-input power supplies are monitored for internal temperature, voltage, and current load by the MBus module on the alarm cards, and by the master MBus module on the RP. If the router detects an extreme condition, it generates an alarm on the alarm card and logs the appropriate warning messages on the console.

Cisco 12016, Cisco 12416, and Cisco 12816 series routers are available with either an original (2000 W) or enhanced (2500 W) capacity AC power supply:

Figure 5-1 identifies the components of a 2000 W AC power supply.

Figure 5-1 2000 W AC Power Supply Components

Figure 5-2 identifies the components of a 2500 W AC power supply.

Figure 5-2 2500 W AC Power Supply Components

1

Ejector handle

2

Captive screw


Use the following procedure to troubleshoot the AC power supply if it is not operating properly after installation.


Step 1 Make sure the power supply is seated properly:

Eject and reseat the PEM. Check that:

The ejector lever is locked into place by its spring clip (2000 W PEM), or the captive screw on the ejector lever is tightened securely (2500 W PEM).

Step 2 Make sure the router is powered on and that all power cords are connected properly:

Power cords on the back panel of the power shelf are secured in place with their retention clips.

Power cords at the power source end are securely plugged into their own AC power outlet.

The source AC circuit breaker is switched on.

Step 3 Check the power supply status LED indicators:

PWR OK (green)—Indicates the power supply is operating normally, and the source AC voltage is within the nominal operating range of 200 VAC to 240 VAC. This indicator lights when the power supply is properly seated in position.

FAULT (yellow)—Indicates the system detected a fault within the power supply or the incoming voltage is too low. This indicator remains off during normal operation.

If the indicator is on:

Check that the source voltage is within the correct range: 170 to 262 VAC

Remove and then apply power to the power supply by disconnecting its power cord. If the indicator remains on, replace the existing power supply with a spare.

If the spare power supply also fails, the problem could be a faulty power shelf backplane connector. Power off the router and contact a Cisco service representative for assistance.

TEMP (yellow)—Indicates that the power supply is in an overtemperature condition, causing a shut-down to occur.


Note If the temp indicator is on, the fault indicator is also on.


Verify that the power supply fan is operating properly.

Verify that the blower modules are operating properly.

If the power supply fan and blower modules are operating properly, replace the existing power supply with a spare.

TEMP (flashing yellow—2500 W PEM only)—Indicates that a power supply fan is locked or malfunctioning.


Note If the temp indicator is flashing, the fault indicator also goes on.


Check to see if the fan is operating. Remove any obstructions to the fan.

If the fan is not operating, replace the power supply.

ILIM (yellow—2000 W PEM only)—Indicates the power supply is operating in a current-limiting condition.

Make sure that each power cord is connected to a dedicated AC power source.

Each AC power supply operating in the nominal range of 200 to 240 VAC requires a minimum service of 20 A, North America (or 13 A, international).

OC (2500 W PEM only) (steady, or flashing yellow after 10 seconds)—Indicates the output current of the power supply has exceeded its limit and that an overload or short has occurred.


Note If the OC indicator is on or flashing, the fault indicator also goes on.


Remove and then apply power to the power supply by disconnecting its power cord.

If the indicator remains on, try reseating the power supply.

If the indicator remains on, replace the power supply.


Because both the standard and optional AC-input power subsystems use redundant power supplies, a problem with the DC output voltage to the backplane from only one power supply should not affect router operation. Because the router is equipped with multiple AC power supplies, it powers on and operates even if one power supply fails.

Troubleshooting the DC-Input Power Subsystem

DC-input power supplies are monitored for internal temperature, voltage, and current load by the MBus module on the alarm cards, and by the master MBus module on the RP. If the router detects an extreme condition, it generates an alarm on the alarm card and logs the appropriate warning messages on the console.

Cisco 12016, Cisco 12416, and Cisco 12816 series routers are available with either original or enhanced capacity DC power supplies:

Figure 5-3 identifies the components of a 2000 W DC power supply.

Figure 5-4 identifies the components of a 2400 W DC power supply.

Figure 5-3 2000 W DC Power Supply Components

Figure 5-4 2400 W DC Power Supply Components

1

Handle

3

Ejector lever

2

Fan

4

Power switch


Use the following procedure to troubleshoot the DC PEM if it is not operating properly after installation.


Step 1 Make sure the PEM is seated properly:

Eject and reseat the PEM. Make sure:

The captive screw on the ejector lever are tightened securely.

The power switch is in the on (1) position (2400 W only).

Step 2 Make sure the router is powered on and that all power cords are connected properly. Check that:

Power cables are securely connected to their terminal studs on the back panel.

Power cables are connected to a dedicated 60 A DC service.

The source DC circuit breaker is switched on.

The PEM circuit breaker is switched on (2000 W only).

If circuit breaker does not stay switched on, replace the PEM.

Step 3 Check the PEM status indicators:

PWR OK (green) — Indicates that the PEM is operating normally, and the source DC voltage is within the nominal operating range of -48 to -60 VDC. This indicator should light when the PEM circuit breaker is switched on.

FAULT (yellow) — Indicates that the system has detected a fault within the PEM or the incoming voltage is too low. This indicator remains off during normal operation.

Check that the source voltage is within the correct range: -40.5 to -75 VDC.

Toggle the PEM circuit breaker off and then on. If the indicator remains on after several attempts to power it on, replace the existing PEM with a spare.

If the spare PEM also fails, the problem could be a faulty power shelf backplane connector. Power off the router and contact a Cisco service representative for assistance.

TEMP (yellow)—Indicates that the PEM is in an overtemperature condition causing a shut-down to occur.


Note If the temp indicator is on, the fault indicator also goes on.


Verify that the power supply fan is operating properly.

Verify that the blower modules are operating properly.

If the power supply fan and the blower modules are operating properly, replace the existing PEM with a spare.

TEMP (flashing yellow—2400 W PEM only)—Indicates that a power supply fan is locked or malfunctioning.


Note If the temp indicator is flashing, the fault indicator is also on.


Check to see if the fan is operating. Remove any obstructions to the fan.

If the fan is not operational, replace the power supply.

OC (2400 W PEM only) (steady, or flashing yellow after 10 seconds)—Indicates the output current of the power supply has exceeded its limit and that an overload or short circuit has occurred.


Note If the OC indicator is on or flashing, the fault indicator also goes on.


Remove and then apply power to the power supply by disconnecting its power cord.

If the indicator remains on, try reseating the power supply.

If the indicator remains on, replace the power supply.


Because there are redundant power supplies, a problem with the DC output voltage to the backplane from only one PEM should not affect router operation. Because the router is equipped with multiple DC power supplies, it powers on and operates even if one power supply fails.

Additional Power Subsystem Troubleshooting Information

This section contains additional troubleshooting information to help you isolate the cause of a power problem.

The MBus modules powering the alphanumeric displays on the RP and line cards are powered by +5 VDC from the backplane. The blower modules use -48 VDC from the backplane. If both the RP and the blower modules are operating, all internal correct DC voltages are present.

Enter the show environment command at the user EXEC mode prompt to display temperature and voltage information for each installed card, blower module, and PEM as shown in this example:

router#show environment
Slot #  Hot Sensor      Inlet Sensor
         (deg C)          (deg C)

1          38.0            32.5
3          36.5            39.0
5          37.0            37.0
7          36.0            32.0
16         26.0            26.0
17         27.5            27.5
18         27.0            27.5
19         0.0     0.0
20         27.0            27.5
21         28.0            28.0
22         28.0            28.0
24         47.0            NA
29         NA              22.0

Slot #  PEM Over Temperature Sensors

24      PEM1    OK
        PEM2    OK
Slot #  Hot Sensor      Inlet Sensor
         (deg C)          (deg C)
          
29         NA              22.0

Slot #  3V      5V      MBUS 5V
        (mv)    (mv)    (mv)

1       3296    5016    5048
3       3284    4976    5000
5       3308    5008    5048
7       3296    5016    5000
16      3300    NA      5064
17      3308    NA      5064
18      3292    NA      5056
19      3300    NA      5072
20      3288    NA      5056
21      3296    NA      5072
22      3292    NA      5064
24      NA      NA      5096
29      NA      NA      4920

Slot #          48V     AMP_48
                (Volt)  (Amp)

24      PEM1    56      2
        PEM2    55      2
Slot #  Fan 0   Fan 1   Fan 2
        (RPM)   (RPM)   (RPM)

29      3021    3090    2997

Troubleshooting the Power Distribution System

The power distribution system consists of:

AC or DC PEMs which supply -48 VDC to the backplane,

The chassis backplane which carries voltage to chassis components.

DC-to-DC converters which convert -48 VDC from the backplane to the correct voltages required by the line cards.

Use the following procedure to troubleshoot the power distribution system.


Step 1 Check each power supply to make sure that:

The ejector lever is fully closed and properly secured by the its captive screw.

The PWR OK indicator is on.

The FAULT and TEMP indicators are both off.

The ILIM is off (2000W AC only)

The OC indicator is off (2500 W AC and 2400 W DC only)

If the power supplies meet the above criteria, then the correct source power is present and within tolerance. The power supplies are functioning properly.

Step 2 Make sure the blower modules are operating.

If the blower modules are functioning, then the -48 VDC from the chassis backplane and the cables from the backplane to the blower modules are functioning properly.

If a blower module is not functioning, there may be a problem with either the blower module itself, or the -48 VDC power supplied to the blower module. Eject and reseat the blower module.

If a blower module is still not operating there could be a problem with the blower module controller card or cable. Replace the blower module.

Contact your Cisco representative if replacing the blower module does not fix the problem.


Troubleshooting the Processor Subsystem

The router processor subsystem consists of the RPs, line cards, and alarm cards. The RPs and line cards have two onboard processors; one serves as the main (or master processor), and the other serves as the MBus module processor. The MBus module processor monitors the environment and controls the onboard DC-to-DC converters.


Note A minimally configured router must have an RP installed in slot 7 of the upper card cage. If the router is equipped with an optional, redundant RP, that RP must be installed in the far left slot in the lower card cage (slot 8).


This section contains information to troubleshooting the processor subsystem, including:

Troubleshooting the RP

Troubleshooting Line Cards

Troubleshooting Using the Alarm Cards

Troubleshooting the RP

When the router is powered on, the alphanumeric display on the RP indicate the following (Figure 5-5):

Upper row—Indicates which RP software component is running. At the end of a successful boot process, this display reads MSTR.

Lower row—Indicates the current phase of the boot process. At the end of a successful boot process, this display reads GRP or PRP depending on the RP type.

Figure 5-5 RP Alphanumeric Display

Troubleshooting Using the RP Alphanumeric Display

You can use the alphanumeric display to isolate a problem with the RP. The two rows on the alphanumeric display are powered separately:

The upper row receives power from the DC-to-DC converters on the RP.

The lower row is powered directly from the MBus on the RP through the chassis backplane.

If the lower row is not operating, the MBus module may be malfunctioning.

If the MBus module is operating, the lower row could be on even if the RP failed to powered on.

If neither the upper nor the lower row is on, but the power modules and the blower modules are operational, the RP may not be installed properly, or the +5 VDC output from the chassis backplane is faulty.

Make sure that the system is powered on.

Initialize the RP by ejecting it from the chassis backplane and then reseating it.


Caution The soft reset (NMI) switch is not a mechanism for resetting the RP and reloading the Cisco IOS image. It is intended for software development use. To prevent system problems or loss of data, use the soft reset switch only when instructed by a Cisco certified service representative.

If both the upper and the lower displays are operating, check the meaning of the messages (see Table 5-2).

When the DC-to-DC converters are powered-on by the MBus module, the RP processor begins the boot process and displays various status messages. Some messages appear briefly; while others appear for several seconds. If the messages appear to stop at a particular point, the boot process may be halted.

Make a note of the message.

Turn off power to the router, then turn on the power again to reset the router and start the boot process. If the router halts again, replace the RP (see "Removing and Replacing Cards from the Chassis" section on page 7-79).

Table 5-2 Troubleshooting Using the RP Alphanumeric Display Messages 

Message
Description

LMEM
TEST

Running low memory test

LCAH
TEST

Initializing lower 15K cache

BSS
INIT

Initializing main memory for ROM

NVRAM
INIT

Initializing NVRAM

EXPT
INIT

Initializing interrupt handlers

TLB
INIT

Initializing TLB

CACH
INIT

Initializing CPU data and instruction cache

CACH
PARY

Enabling CPU cache parity

MEM
INIT

Initializing main memory

NVRAM
SIZE

Detecting the NVRAM size

PCMC
INIT

Initializing the PCMCIA

EXIT
INIT

Exiting the initialization sequence

IOS
UP

Running Cisco IOS software


Troubleshooting Line Cards

When the line card is powered on, the display on the line card indicates the following (Figure 5-6):

Upper row—Indicates which software component is running. At the end of a successful boot process, this display reads IOS.

Lower row—Indicates the current phase of the boot process. At the end of a successful boot process, this display reads RUN.

Figure 5-6 Line Card Alphanumeric Display

Troubleshooting Using the Line Card Alphanumeric Display

You can analyze the alphanumeric displays to isolate a problem with the line card. The two rows of the alphanumeric display are powered separately:

The upper row receives power from the DC-to-DC converters on the line card.

The lower row is powered directly from the MBus on the line card through the chassis backplane.

If the lower row is not operating the MBus module may be malfunctioning.

If the MBus module is operating, the lower row could be on even if the RP failed to powered on.

If neither the upper or lower row is on, but the power modules and the blower modules are operational, the line card may not be installed properly, or the +5 VDC output from the chassis backplane is faulty.

Make sure that the system is powered on.

Initialize the line card by ejecting it from the chassis backplane and then reseating it.

If both the upper and lower rows are operating, check the status messages (see Table 5-3).

When the DC-to-DC converters are powered-on by the MBus module, the line card processor begins the boot process and displays various status messages. Some messages appear briefly; while others appear for several seconds.

Table 5-3 Troubleshooting Using Alphanumeric Display Messages  

Display 1
Meaning
Source

MROM
nnnn

MBus microcode executing; where nnnn is the microcode version number.

MBus controller

LMEM
TEST

Low memory on the line card is being tested.

Line card ROM monitor

LROM
RUN

Low memory test is complete.

Line card ROM monitor

BSS
INIT

Main memory is being initialized.

Line card ROM monitor

RST
SAVE

Contents of the reset reason register are being saved.

Line card ROM monitor

IO
RST

Reset I/O register is being accessed.

Line card ROM monitor

EXPT
INIT

Interrupt handlers are being initialized.

Line card ROM monitor

TLB
INIT

TLB is being initialized.

Line card ROM monitor

CACH
INIT

CPU data and instruction cache is being initialized.

Line card ROM monitor

MEM
INIT

Size of the main memory on the line card is being discovered.

Line card ROM monitor

LROM
RDY

ROM is ready for a software download attempt.

Line card ROM monitor

ROMI
GET

ROM image is being loaded into line card memory.

RP IOS software

ROM
VGET3

ROM image is receiving a response.

RP IOS software

FABI
WAIT

Line card is waiting for the fabric downloader.2

RP IOS software

FABM
WAIT3

Line card is waiting for the Fabric Manager to report that the fabric is usable.

RP IOS software

FABL
DNLD

Fabric downloader is being loaded into line card memory.

RP IOS software

FABL
STRT

Fabric downloader is being launched.

RP IOS software

FABL
RUN

Fabric downloader is launched and running.

RP IOS software

IOS
DNLD

Cisco IOS software is being downloaded into line card memory.

RP IOS software

IOS
FABW3

Cisco IOS software is waiting for the fabric to be ready.

RP IOS software

IOS
VGET3

Line card is obtaining the Cisco IOS release.

RP IOS software

IOS
RUN

Line card is enabled and ready for use.

RP IOS software

IOS
STRT

Cisco IOS software is being launched.

RP IOS software

IOS
TRAN

Cisco IOS software is transitioning to active.

RP IOS software

IOS
UP

Cisco IOS software is running.

RP IOS software

1 The LED initialization sequence shown in Table 5-3 may occur too quickly for you to read; therefore, the sequence is provided in this tabular form as a baseline for how a line card should function at startup.

2 The fabric downloader loads the Cisco IOS software image onto the line card.

3 This LED sequence only appears in Cisco IOS Release 12.0(24)S or later.


Table 5-4 Troubleshooting Using Other Alphanumeric Display Messages 

Display
Meaning
Source

MAL
FUNC

Line card malfunction reported by field diagnostics.

RP

MISM
ATCH1

Line card type mismatch in paired slots.

RP

PWR
STRT1

Line card is newly powered on.

RP

PWR
ON

Line card is powered on.

RP

IN
RSET

System is resetting.

RP

RSET
DONE

System reset complete.

RP

MBUS
DNLD

MBus agent is downloading.

RP

MBUS
DONE

MBus agent download complete.

RP

ROMI
DONE

Acquisition of ROM image complete.

RP

MSTR
WAIT

Waiting for mastership determination.

RP

CLOK
WAIT

Waiting for slot clock configuration.

RP

CLOK
DONE

Slot clock configuration complete.

RP

FABL
LOAD

Load of fabric downloader2 complete.

RP

IOS
LOAD

Downloading of Cisco IOS software is complete.

RP

BMA
ERR

Cisco IOS software BMA error.

RP

FIA
ERR

Cisco IOS fabric interface ASIC configuration error.

RP

CARV
ERR

Buffer carving failure.

RP

DUMP
REQ

Line card requesting a core dump.

RP

DUMP
RUN

Line card dumping core.

RP

DUMP
DONE

Line card core dump complete.

RP

DIAG
MODE

Diagnostic mode.

RP

DIAG
LOAD

Downloading field diagnostics over the MBus.

RP

DIAG
F_LD

Downloading field diagnostics over the fabric.

RP

DIAG
STRT

Launching field diagnostics.

RP

DIAG
HALT

Cancel field diagnostics.

RP

DIAG
TEST

Running field diagnostics tests.

RP

DIAG
PASS1

Field diagnostics completed successfully.

RP

POST
STRT

Launching power-on self-test (POST).

RP

UNKN
STAT

Unknown state.

RP

ADMN
DOWN

Line card is administratively down.

RP

SCFG
PRES1

Incorrect hw-module slot srp command entered.

RP

SCFG1
REDQ

Required hw-module slot srp command not entered.

RP

1 This LED sequence only appears in Cisco IOS Release 12.0(24)S or later.

2 The fabric downloader loads the Cisco IOS software image onto the line card.


Troubleshooting Using the Alarm Cards

The router is equipped with two alarm cards:

One card occupies the dedicated far left slot of the upper card cage.

A second alarm card occupies the dedicated far right slot of the lower card cage.

In both card cages, the alarm card slot differs from the rest of the card cage slots in that it is labeled to identify it as an alarm card slot, it is physically narrower, and has a different backplane connector.

The following components and indicators are on the front panel of the alarm cards (Figure 5-7):

Critical (red), Major (red), and Minor (yellow) indicators that identify system level alarm conditions detected by the system through the MBus.

These indicators are normally off.

Audio alarm cutoff switch.

25-pin cable connection to an external alarm.

Alarm card indicators:

ENABLED (green)—the alarm card is operational and functioning properly.

FAIL (yellow)—the alarm card in that slot is faulty.

A pair of status LEDs that correspond to each of the CFC and SFC card slots in the switch fabric and clock scheduler card cage:

ENABLED (green)
On—the card installed in that slot is operational and functioning properly.
Off—either the slot is empty or the card installed in that slot is faulty.

FAIL (yellow)—the card in that slot is faulty.

Figure 5-7 Status LEDs on the Alarm Card

Monitoring Critical, Major, and Minor Alarm Status

The alarms can warn of an overtemperature condition:

On a component in the card cage

A fan failure in a blower module

An overcurrent condition in a power supply

An out-of-tolerance voltage on one of the cards

The alarm LEDs are controlled by MBus software, which sets the threshold levels for triggering the different stages of alarms.

The RP continuously polls the system for temperature, voltage, current, and fan speed values. If a threshold value is exceeded, the RP sets the appropriate alarm severity level on the alarm card which lights the corresponding LED, and energizes the appropriate alarm display relays to activate any external audible or visual alarms wired to the alarm display. The RP also logs a message about the threshold violation on the system console.


Note You can use the audio alarm cutoff switch to visually check that the alarm card indicators are operating properly. If no audible alarm is active, pressing the audio alarm cutoff switch temporarily lights all of the indicators on the alarm card front panel. If an indicator does not light it means that LED is faulty.


Troubleshooting the Switch Fabric

This section describes the procedures needed to troubleshoot problems with the switch fabric. The RP and the line cards connect through the crossbar switch fabric, which provides a high-speed physical path for most inter-card communication. Among the messages passed between the RP and the line cards over the switch fabric are, actual packets being routed and received, forwarding information, traffic statistics, and most management and control information. This information is useful in diagnosing hardware-related failures.


Note This section is recommended only for advanced Cisco IOS software operators and system administration personnel. Refer to the appropriate Cisco IOS software publications for detailed Cisco IOS information.


Use the following procedure to collect the needed data from the RPs and line cards in order to troubleshoot the switch fabric.


Step 1 Enter the show controllers fia command for the primary and secondary RPs and save the output.

Step 2 Enter the attach <slot #> command to access a line card.


Note Use the attach command when you connect to the line card. The execute-on command is dependent upon the inter-process communication (IPC) which operates over the switch fabric. If you are having problems with IPC, the commands that run remotely through the switch fabric can time out. The attach <slot #> command travels over the MBus and not the IPC.


Step 3 Enter the show controllers fia command for all installed line cards and save the output from each.

Step 4 Proceed to the next section, Analyzing the Data.


Analyzing the Data

Switch fabric problems can occur due to failures in any of the following components:

RPs

Line card hardware

The backplane

CSCs/SFCs

When troubleshooting switch fabric errors, look for patterns with regard to which components are reporting errors. For example, if you combine the show controllers fia output from all RPs and line cards, you can determine if there is an error pattern. The following subsections discuss the values within the output that can help you determine any error patterns.

crc16 Output

The crc16 data line from the show controllers fia command is an important indicator of hardware problems. If one line card or one CSC/SFC has been on line, inserted and removed, you can expect to see some crc16 error data. However, this number should not continue to increase. If the number is increasing, you may need to replace a faulty hardware component. It is very important to correlate the data from both the primary RP and the secondary RP and all installed line cards. The example output below shows the status of the primary RP. The crc16 data line is underlined and is showing errors from sfc1.

Router#show controllers fia 
Fabric configuration: Full bandwidth, redundant fabric
Master Scheduler: Slot 17  Backup Scheduler: Slot 16
From Fabric FIA Errors
-----------------------
redund fifo parity 0    redund overflow 0      cell drops 0         
crc32 lkup parity  0    cell parity     0      crc32      0         
Switch cards present    0x001F    Slots  16 17 18 19 20
Switch cards monitored  0x001F    Slots  16 17 18 19 20
Slot:     16         17         18         19         20
Name:    csc0       csc1       sfc0       sfc1       sfc2
       --------   --------   --------   --------   --------
los    0          0          0          0          0          
state  Off        Off        Off        Off        Off       
crc16  0          0          0          1345       0
To Fabric FIA Errors
-----------------------
sca not pres 0          req error     0          uni FIFO overflow 0         
grant parity 0          multi req     0          uni FIFO undrflow 0         
cntrl parity 0          uni req       0          crc32 lkup parity 0         
multi FIFO   0          empty dst req 0          handshake error   0         
cell parity  0

In the example output below, you can see the status of the line card in slot 2. The crc16 data line is underlined and is showing errors from sfc1.

Router#attach 2
Entering Console for 4 port ATM Over SONET OC-3c/STM-1 in Slot: 2
Type "exit" to end this session
Press RETURN to get started!
LC-Slot2>
LC-Slot2>enable
LC-Slot2#show controllers fia
From Fabric FIA Errors
-----------------------
redund FIFO parity 0          redund overflow 0          cell drops 0         
crc32 lkup parity  0          cell parity     0          crc32      0         
Switch cards present    0x001F    Slots  16 17 18 19 20 
Switch cards monitored  0x001F    Slots  16 17 18 19 20 
Slot:     16         17         18         19         20
Name:    csc0       csc1       sfc0       sfc1       sfc2
       --------   --------   --------   --------   --------
Los    0          0          0          0          0          
state  Off        Off        Off        Off        Off       
crc16  0          0          0          1345       0
To Fabric FIA Errors
-----------------------
sca not pres 0          req error     0          uni fifo overflow 0         
grant parity 0          multi req     0          uni fifo undrflow 0         
cntrl parity 0          uni req       0          crc32 lkup parity 0         
multi fifo   0          empty DST req 0          handshake error   0         
cell parity  0
LC-Slot2#exit
Disconnecting from slot 2.
Connection Duration: 00:00:21
Router#

After you have gathered the show controllers fia command data from the RPs and line cards, you can create a table similar to Table 5-5.

Table 5-5 Error Data Collection Table

Card Slot
CSC 0
CSC 1
SFC 0
SFC 1
SFC 2
SFC 3
SFC 4

0

     

ERROR

     

1

             

2

     

ERROR

     

3

     

ERROR

     

4

             

5

     

ERROR

     

6

             

7

     

ERROR

     

8

             

Table 5-5 indicates that more than one line card is reporting errors coming from SFC 1. Therefore, the first step to correcting this problem is to check or replace SFC 1. Whenever a replacement is recommended, first verify that the card is correctly seated (see the "Properly Seating Switch Fabric Cards" section).


Note Always reseat the corresponding card as the first line of troubleshooting to be sure it is correctly seated. If, after reseating the card, the crc errors are still increasing, then replace the part.


The common failure patterns and recommended actions for crc16 errors are as follows (one step at a time until the problem goes away):

1. Errors indicated on more than one line card from the same switch fabric card:

a. Replace the switch fabric card in the slot corresponding to the errors.

b. Replace all switch fabric cards.

c. Replace the backplane.

2. Errors indicated on one line card from more than one switch fabric card:

a. Replace the line card.

b. If errors are incrementing, replace the current master CSC.

c. If errors are not incrementing and the current master is CSC0, replace CSC1.

Grant Parity and Request Errors

Another troubleshooting indicator are the console logs or the output of the show log command, in the form of grant parity and request errors. Look for the following type of message that indicates a grant parity error:

%FABRIC-3-PARITYERR: To Fabric parity error was detected.
Grant parity error Data = 0x2.
SLOT 1:%FABRIC-3-PARITYERR: To Fabric parity error was detected.
Grant parity error Data = 0x1

You can also use the output from the show controllers fia command. Important information is underlined:

Router#show controllers fia
Fabric configuration: Full bandwidth, redundant fabric
Master Scheduler: Slot 17     Backup Scheduler: Slot 16

From Fabric FIA Errors
-----------------------
redund FIFO parity 0   redund overflow 0   cell drops 76


crc32 lkup parity  0    cell parity 0   crc32 0
Switch cards present    0x001F    Slots  16 17 18 19 20
Switch cards monitored  0x001F    Slots  16 17 18 19 20
Slot:     16         17         18         19         20
Name:    csc0       csc1       sfc0       sfc1       sfc2
       --------   --------   --------   --------   --------
Los    0          0          0          0          0
state  Off        Off        Off        Off        Off
crc16  876        257        876        876        876


To Fabric FIA Errors
-----------------------
sca not pres 0          req error     1          uni fifo overflow 0
grant parity 1          multi req     0          uni fifo undrflow 0


cntrl parity 0          uni req       0          crc32 lkup parity 0
multi fifo   0          empty DST req 0          handshake error   0
cell parity  0

The common failure patterns and recommended actions for grant parity and request errors are as follows (one step at a time until the problem goes away):

1. Grant errors on more than one line card:

a. Replace the CSC (see the note below to know which one should be swapped).

b. Replace the backplane.

2. Grant errors on one line card:

a. Replace the line card.

b. Replace the CSC (see the note below to know which one should be swapped).

c. Replace the backplane.


Note If multiple line cards are reporting grant parity or request errors and the router is still functioning, then a CSC switchover has occurred. The failed CSC is the one that is currently the backup CSC (not the one listed as Master Scheduler in the show controllers fia output). If Halted is next to the heading From Fabric FIA Errors or To Fabric FIA Errors, or if the router is no longer forwarding traffic, then a CSC switchover has not occurred and the failing CSC is the one listed as Master Scheduler. By default, the CSC in slot 17 is the primary and the CSC in slot 16 is the backup.


Properly Seating Switch Fabric Cards

The switch fabric cards in the router can be challenging to insert, and may require a small amount of force to seat correctly. If either of the CSCs are not seated properly, you may see the following error message:

%MBUS-0-NOCSC: Must have at least 1 CSC card in slot 16 or 17 
%MBUS-0-FABINIT: Failed to initialize switch fabric infrastructure


Note You may also get this error message if there are only enough CSCs and SFCs seated for quarter bandwidth configurations. Quarter bandwidth configurations are no longer supported on Cisco 12000 series routers.


When dealing with switch fabric and line card booting problems, it is important to verify that all CSCs and SFCs are correctly seated and powered on. The output from the show version and show controllers fia commands tells you which hardware configuration is currently running on the box. Important data is underlined.

Router#show version 
Cisco Internetwork Operating System Software
IOS (tm) GS Software (GSR-P-M), Experimental Version 
12.0(20010505:112551)
Copyright (c) 1986-2001 by cisco Systems, Inc.
Compiled Mon 14-May-01 19:25 by tmcclure
Image text-base: 0x60010950, data-base: 0x61BE6000

ROM: System Bootstrap, Version 11.2(17)GS2, [htseng 180]
EARLY DEPLOYMENT RELEASE SOFTWARE (fc1)
BOOTFLASH: GS Software (GSR-BOOT-M), Version 12.0(15.6)S,
EARLY DEPLOYMENT MAINTENANCE INTERIM SOFTWARE

Router uptime is 17 hours, 53 minutes 
System returned to ROM by reload at 23:59:40 MET Mon Jul 2 2001 
System restarted at 00:01:30 MET Tue Jul 3 2001 
System image file is 
"tftp://172.17.247.195/gsr-p-mz.15S2plus-FT-14-May-2001"

cisco 12016/GRP (R5000) processor (revision 0x01) with 262144K bytes 
of memory. 
R5000 CPU at 200Mhz, Implementation 35, Rev 2.1, 512KB L2 Cache 
Last reset from power-on

2 Route Processor Cards 
1 Clock Scheduler Card 
3 Switch Fabric Cards 
1 8-port OC3 POS controller (8 POs). 
1 OC12 POS controller (1 POs). 
1 OC48 POS E.D. controller (1 POs). 
7 OC48 POS controllers (7 POs). 
1 Ethernet/IEEE 802.3 interface(s) 
17 Packet over SONET network interface(s) 
507K bytes of non-volatile configuration memory.

20480K bytes of Flash PCMCIA card at slot 0 (Sector size 128K). 
8192K bytes of Flash internal SIMM (Sector size 256K).
Router#show controller fia 
Fabric configuration: Full bandwidth nonredundant
Master Scheduler: Slot 17

Troubleshooting the Cooling Subsystem

The cooling subsystem of the router consists of an upper and lower blower module in the chassis and a fan in each of the power supplies. The blower modules and the power supply fans circulate air to maintain acceptable operating temperatures within the router (Figure 5-8).

This section contains information to troubleshooting the cooling subsystem and includes:

Blower Module Operation

Power Supply Operation

Overtemperature Conditions

Isolating Cooling Subsystem Problems

Figure 5-8 Cooling Air Flow

Blower Module Operation

The blower modules maintain acceptable operating temperatures for the internal components by drawing cooling air through a replaceable air filter into the card cages. The blowers occupy a bay near the top and at the bottom of the router.

Each blower module contains three fans, a controller card, and two front panel status LEDs. A snap-on plastic front cover fits over the front panel, but the LEDs are visible through the front covers.

Green—The blower module is functioning properly.

Red—There is a fault detected in the blower module.

If the air temperature inside the chassis rises, blower speed increases to provide additional cooling air to the internal components.

If the internal air temperature continues to rise beyond the specified threshold, the system environmental monitor shuts down all internal power to prevent equipment damage due to excessive heat.

If the system detects that one or more of the fans in a blower module has failed, it displays a warning message on the system console and displays a blower failure message on the RP alphanumeric display. In addition, the remaining fans go to full speed to compensate for the loss of the failed fan.

Power Supply Operation

Each AC or DC power supply is equipped with a fan that draws cooler air in through the front of the power module and forces warmer out the back of the power shelf.

If the power source is within the required range, the power supply fan remains on.

If the fan fails:

Power supply detects an internal overtemperature condition

Fault and Temp indicators light

Power supply sends an overtemperature warning to the system and then shuts down the system.

For additional power supply troubleshooting information, see the "Troubleshooting the Power Subsystem" section.

Overtemperature Conditions

The following console error message indicates that the system has detected an overtemperature condition or out-of-tolerance power value inside the system:

Queued messages:
%ENVM-1-SHUTDOWN: Environmental Monitor initiated shutdown

The preceding message could also indicate a faulty component or temperature sensor. Enter the show environment command or the show environment all command at the user EXEC prompt to display information about the internal system environment. The information generated by these commands include:

Voltage measurements on each card from the DC-to-DC converter

The +5 VDC for the MBus module

The operating voltage for the blower module.

Temperature measurements received by two sensors on each card (one for inlet air temperature and one for the card's hot-spot temperature), as well as temperature measurements from sensors located in each power supply.

If an environmental shutdown results from an overtemperature or out-of-tolerance condition, the Fault indicator on the power supply lights before the system shuts down.

Although an overtemperature condition is unlikely at initial system startup, make sure that:

Heated exhaust air from other equipment in the immediate environment is not entering the chassis card cage vents.

You allow sufficient air flow by maintaining a minimum of 6 inches (15.24 cm) of clearance at both the inlet and exhaust openings on the chassis and the power modules to allow cool air to enter freely and hot air to be expelled from the chassis.

Isolating Cooling Subsystem Problems

Use the following procedure to isolate a problem with the chassis cooling system if you have an overtemperature condition.


Step 1 Make sure the blower modules are operating properly when you power on the system.

To determine if a blower module is operating, check the two LED indicators on the blower module front panel:

OK (green)—The blower module is functioning properly and receiving -48 VDC power, indicating that the cables from the chassis backplane to the blower module are good.

Fail (red)—A fault is detected in the blower module. Replace the blower module.

If neither indicator is on and the blower is not operating, there may be a problem with either the blower module or the -48 VDC power supplied to the blower module. Go to Step 2.

Step 2 Eject and reseat the blower module making sure the captive screws are securely tightened.

If the blower module still does not function, go to Step 3.

Step 3 Check for -48 VDC power by looking at the LED indicators on each power supply:

If the Pwr OK indicator is on and the Fault indicator is off on each power supply, it indicates that the blower is receiving -48 VDC.

If the blower module is still not functioning, there could be a problem with the blower module controller card or an undetected problem in the blower module cable. Replace the blower module.

If the new blower module does not function, contact a Cisco customer service representative for assistance.

If the Fault indicator is on, the power supply is faulty. Replace the power supply.

If the Temp and Fault indicators are on, an overtemperature condition exists.

Verify that the power supply fan is operating properly.

If the fan is not operating, replace the power supply.

Contact your Cisco representative if replacing the power supply does not fix the problem.