Cisco MDS 9000 Family Troubleshooting Guide, Release 1.3
Troubleshooting Switch System Issues

Table Of Contents

Troubleshooting Switch System Issues

Recovering the Administrator Password

Troubleshooting System Restarts

Overview

Working with a Firmware Upgrade Failure

Working with Recoverable Restarts

Working with Unrecoverable System Restarts

Troubleshooting a Failed Supervisor


Troubleshooting Switch System Issues


This chapter describes how to identify and resolve problems that might occur when accessing or starting up a single Cisco MDS 9000 Family switch. It includes the following sections:

Recovering the Administrator Password

Troubleshooting System Restarts

Troubleshooting a Failed Supervisor

Recovering the Administrator Password

If you forget the administrator password for accessing a Cisco MDS 9000 Family switch, you can recover the password using a local console connection. For the latest instructions on password recovery, go to http://www.cisco.com/warp/public/474/ and click on "MDS 9000 Series Multilayer Directors and Fabric Switches" under Storage Networking Routers.

Troubleshooting System Restarts

This section describes the different types of system crashes and how to respond to each type. It includes the following topics:

Overview

Working with a Firmware Upgrade Failure

Working with Recoverable Restarts

Working with Unrecoverable System Restarts

Overview

There are three different types of system restarts:

Recoverable—A process restarts and service is not affected.

Unrecoverable—A process is not restartable or it has restarted more than the max restart times within a fixed period of time (seconds) and will not be restarted again.

System Hung/Crashed—No communications of any kind is possible with box.

Most system restarts generate a Call Home event, but the condition causing a restart may become so severe that a Call Home event is not generated. Be sure that you configure the Call Home feature properly, follow up on any initial messages regarding system restarts, and fix the problem before it becomes so severe. For information about configuring Call Home, refer to the Cisco MDS 9000 Family Configuration Guide or the Cisco MDS 9000 Family Fabric Manager User Guide.

Working with a Firmware Upgrade Failure

When you perform a hitless upgrade (either by using the install all command, or by using Cisco Fabric Manager), and the upgrade fails with no error message displayed, please call technical support to determine the cause of the failure. Do not attempt to reboot the switch.

Working with Recoverable Restarts

Every process restart generates a Syslog message and a Call Home event. Even if the event is not service affecting you should identify and resolve the condition immediately because future occurrences could cause service interruption.

To respond to a recoverable system restart, follow these steps:


Step 1 Enter the following command to check the Syslog file to see which process restarted and why it restarted.

switch# sh log logfile | include error

For information about the meaning of each message, refer to the Cisco MDS 9000 Family System Messages Guide

The system output looks like the following:

Sep 10 23:31:31 dot-6 % LOG_SYSMGR-3-SERVICE_TERMINATED: Service "sensor" (PID 704) has 
finished with error code SYSMGR_EXITCODE_SY.
switch# show logging logfile | include fail
Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad
dr 0.0.0.0, in_classd=0 flags=1 fails: Address already in use
Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad
dr 127.0.0.1, in_classd=0 flags=0 fails: Address already in use
Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad
dr 127.1.1.1, in_classd=0 flags=1 fails: Address already in use
Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad
dr 172.22.93.88, in_classd=0 flags=1 fails: Address already in use
Jan 27 23:18:59 88 % LOG_PORT-5-IF_DOWN: Interface fc1/13 is down (Link failure 
or not-connected)
Jan 27 23:18:59 88 % LOG_PORT-5-IF_DOWN: Interface fc1/14 is down (Link failure 
or not-connected)
Jan 28 00:55:12 88 % LOG_PORT-5-IF_DOWN: Interface fc1/1 is down (Link failure o
r not-connected)
Jan 28 00:58:06 88 % LOG_ZONE-2-ZS_MERGE_FAILED: Zone merge failure, Isolating p
ort fc1/1 (VSAN 100)
Jan 28 00:58:44 88 % LOG_ZONE-2-ZS_MERGE_FAILED: Zone merge failure, Isolating p
ort fc1/1 (VSAN 100)
Jan 28 03:26:38 88 % LOG_ZONE-2-ZS_MERGE_FAILED: Zone merge failure, Isolating p
ort fc1/1 (VSAN 100)
Jan 29 19:01:34 88 % LOG_PORT-5-IF_DOWN: Interface fc1/1 is down (Link failure o
r not-connected)
switch#

Step 2 Enter the following command to identify the processes that are running and the status of each process.

switch# show processes 

The following codes are used in the system output for the State (process state):

D = uninterruptible sleep (usually IO)

R = runnable (on run queue)

S = sleeping

T = traced or stopped

Z = defunct ("zombie") process

NR = not-running

ER = should be running but currently not-running


Note ER usually is the state a process enters if it has been restarted too many times and has been detected as faulty by the system and disabled.


The system output looks like the following (the output has been abbreviated to be more concise):

PID    State  PC        Start_cnt    TTY   Process
-----  -----  --------  -----------  ----  -------------
    1      S  2ab8e33e            1     -  init
    2      S         0            1     -  keventd
    3      S         0            1     -  ksoftirqd_CPU0
    4      S         0            1     -  kswapd
    5      S         0            1     -  bdflush
    6      S         0            1     -  kupdated
   71      S         0            1     -  kjournald
  136      S         0            1     -  kjournald
  140      S         0            1     -  kjournald
  431      S  2abe333e            1     -  httpd
  443      S  2abfd33e            1     -  xinetd
  446      S  2ac1e33e            1     -  sysmgr
  452      S  2abe91a2            1     -  httpd
  453      S  2abe91a2            1     -  httpd
  456      S  2ac73419            1    S0  vsh
  469      S  2abe91a2            1     -  httpd
  470      S  2abe91a2            1     -  httpd  

Step 3 Enter the following command to show the processes that have had abnormal exits and if there is a stack-trace or core dump.

switch# show process log
Process           PID     Normal-exit  Stack-trace  Core     Log-create-time
----------------  ------  -----------  -----------  -------  ---------------
ntp               919               N            N        N  Jan 27 04:08
snsm              972               N            Y        N  Jan 24 20:50

Step 4 Enter the following command to show detailed information about a specific process that has restarted:

switch# show processes log pid 898

The system output looks like the following:

Service: idehsd
Description: ide hotswap handler Daemon
Started at Mon Sep 16 14:56:04 2002 (390923 us)
Stopped at Thu Sep 19 14:18:42 2002 (639239 us)
Uptime: 2 days 23 hours 22 minutes 22 seconds
Start type: SRV_OPTION_RESTART_STATELESS (23)
Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGTERM (3)
Exit code: signal 15 (no core)
CWD: /var/sysmgr/work
Virtual Memory:
CODE      08048000 - 0804D660
    DATA      0804E660 - 0804E824
    BRK       0804E9A0 - 08050000
    STACK     7FFFFD10
Register Set:
EBX 00000003         ECX 0804E994         EDX 00000008
    ESI 00000005         EDI 7FFFFC9C         EBP 7FFFFCAC
    EAX 00000008         XDS 0000002B         XES 0000002B
    EAX 00000003 (orig)  EIP 2ABF5EF4         XCS 00000023
    EFL 00000246         ESP 7FFFFC5C         XSS 0000002B
Stack: 128 bytes. ESP 7FFFFC5C, TOP 7FFFFD10
0x7FFFFC5C: 0804F990 0804C416 00000003 0804E994 ................
0x7FFFFC6C: 00000008 0804BF95 2AC451E0 2AAC24A4 .........Q.*.$.*
0x7FFFFC7C: 7FFFFD14 2AC2C581 0804E6BC 7FFFFCA8 .......*........
0x7FFFFC8C: 7FFFFC94 00000003 00000001 00000003 ................
0x7FFFFC9C: 00000001 00000000 00000068 00000000 ........h.......
0x7FFFFCAC: 7FFFFCE8 2AB4F819 00000001 7FFFFD14 .......*........
0x7FFFFCBC: 7FFFFD1C 0804C470 00000000 7FFFFCE8 ....p...........
0x7FFFFCCC: 2AB4F7E9 2AAC1F00 00000001 08048A2C ...*...*....,...
PID: 898
SAP: 0
UUID: 0
switch#

Step 5 Enter the following command to determine if the restart recently occurred:

switch# sh sys uptime 
Start Time: Fri Sep 13 12:38:39 2002
Up Time:    0 days, 1 hours, 16 minutes, 22 seconds

To determine if the restart is repetitive or a one-time occurrence, compare the length of time that the system has been up with the timestamp of each restart.

Step 6 Enter the following command to view the core files:

switch# show cores

The system output looks like the following:

Module-num      Process-name      					PID    	Core-create-time
----------     		------------     			 ---   	 ----------------
5              		fspf              			1524    Jan 9 03:11
6              		fcc               			919     Jan 9 03:09
8              		acltcam           			285     Jan 9 03:09
8              		fib               			283     Jan 9 03:08

This output shows all the cores presently available for upload from the active supervisor. The column entitled module-num shows the slot# on which the core was generated. In the example shown above, an fspf core was generated on the active supervisor module in slot 5. An fcc core was generated on the standby supervisory module in slot 6. Core dumps generated on the line card in slot 8 include acltcam and fib.

To copy the FSPF core dump in this example to a TFTP server with the IP address 1.1.1.1, enter the following command:

switch# copy core://5/1524 tftp::/1.1.1.1/abcd

The following command displays the file named zone_server_log.889 in the log directory.

 switch# sh pro log pid 1473
======================================================
Service: ips
Description: IPS Manager

Started at Tue Jan  8 17:07:42 1980 (757583 us)
Stopped at Thu Jan 10 06:16:45 1980 (83451 us)
Uptime: 1 days 13 hours 9 minutes 9 seconds

Start type: SRV_OPTION_RESTART_STATELESS (23)
Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGNAL (2)
Exit code: signal 6 (core dumped)
CWD: /var/sysmgr/work

Virtual Memory:

    CODE      08048000 - 080FB060
    DATA      080FC060 - 080FCBA8
    BRK       081795C0 - 081EC000
    STACK     7FFFFCF0
    TOTAL     20952 KB

Register Set:

    EBX 000005C1         ECX 00000006         EDX 2AD721E0
    ESI 2AD701A8         EDI 08109308         EBP 7FFFF2EC
    EAX 00000000         XDS 0000002B         XES 0000002B
    EAX 00000025 (orig)  EIP 2AC8CC71         XCS 00000023
    EFL 00000207         ESP 7FFFF2C0         XSS 0000002B

Stack: 2608 bytes. ESP 7FFFF2C0, TOP 7FFFFCF0

0x7FFFF2C0: 2AC8C944 000005C1 00000006 2AC735E2 D..*.........5.*
0x7FFFF2D0: 2AC8C92C 2AD721E0 2AAB76F0 00000000 ,..*.!.*.v.*....
0x7FFFF2E0: 7FFFF320 2AC8C920 2AC513F8 7FFFF42C  ... ..*...*,...
0x7FFFF2F0: 2AC8E0BB 00000006 7FFFF320 00000000 ...*.... .......
0x7FFFF300: 2AC8DFF8 2AD721E0 08109308 2AC65AFC ...*.!.*.....Z.*
0x7FFFF310: 00000393 2AC6A49C 2AC621CC 2AC513F8 .......*.!.*...*
0x7FFFF320: 00000020 00000000 00000000 00000000  ...............
0x7FFFF330: 00000000 00000000 00000000 00000000 ................
0x7FFFF340: 00000000 00000000 00000000 00000000 ................
0x7FFFF350: 00000000 00000000 00000000 00000000 ................
0x7FFFF360: 00000000 00000000 00000000 00000000 ................
0x7FFFF370: 00000000 00000000 00000000 00000000 ................
0x7FFFF380: 00000000 00000000 00000000 00000000 ................
0x7FFFF390: 00000000 00000000 00000000 00000000 ................
0x7FFFF3A0: 00000002 7FFFF3F4 2AAB752D 2AC5154C .
... output abbreviated ...
Stack: 128 bytes. ESP 7FFFF830, TOP 7FFFFCD0

Step 7 Enter the following command configure the switch to use TFTP to send the core dump to a TFTP server.

switch(config)# sys cores tftp:[//servername][/path]

This command causes the switch to enable the automatic copy of core files to a TFTP server. For example, the following command sends the core files to the TFTP server with the IP address 10.1.1.1.

switch(config)# system cores tftp://10.1.1.1/cores

The following conditions apply:

The core files are copied every 4 minutes. This time is not configurable.

The copy of a specific core file can be manually triggered, using the command
copy core//module#/pid# tftp//tftp_ip_address/file_name

The maximum number of times a process can be restarted is part of the HA policy for any process (this parameter is not configurable). If the process restarts more than the maximum number of times, the older core files are overwritten.

The maximum number of core files that can be saved for any process is part of the HA policy for any process (this parameter is not configurable, and it is set to 3).

Step 8 To determine the cause and resolution for the restart condition, call Cisco TAC and ask them to review your core dump.

Working with Unrecoverable System Restarts

An unrecoverable system restart may occur in the following cases:

A critical process fails and is not restartable

A process restarts more times than is allowed by the system configuration

A process restarts more frequently than is allowed by the system configuration

The effect of a process restart is determined by the policy configured for each process. Unrecoverable restarts may cause loss of functionality, restart of the active supervisor, a supervisor switchover, or restart of the switch.

To respond to an unrecoverable restart, perform the steps listed in the "Working with Recoverable Restarts" section.


Troubleshooting a Failed Supervisor

This section provides a workaround for a failed supervisor under certain conditions. An example situation is used to describe the problem and the workaround.

In this case, the supervisor failed when the standby was reloaded, or when the supervisor was replaced with a new one. It was discovered that the failed supervisor either had its version of code changed, or the running configuration on the active supervisor wasn't saved with the appropriate boot parameters. In either case, the problem was mismatched code on the active and standby supervisors. One clue that indicated the mismatched code was a "heartbeat" error on the active supervisor. Because of this error, the current flash images were unable to be copied from the active supervisor to the standby.

The workaround was to copy the images to compact flash, switch consoles, and load code from compact flash onto the second supervisor. The second supervisor was at a "loader" prompt which is indicative of missing boot statements. When a dir slot0 command was executed, none of the images appeared. This may have been due to mismatched images on supervisors or to not having current images in flash of the supervisor. Performing a copy slot0 bootflash command copied the images anyway. Once the images were loaded on the second supervisor and the boot statements were confirmed and saved on the active supervisor, the supervisor loaded and came up in "standby-ha" mode.

As a best practice, we recommended the following in order to understand how the switch can end up in this situation:

1. Make sure both supervisors have their flash loaded with the same versions of kickstart and system images.

2. Make sure that the proper boot statements for Sup1 and Sup2 are set to run the same code.

3. Once the boot statements are configured on the active supervisor, make sure and perform a copy run start.

4. Make a copy of the running configuration to compact flash just for a safe backup.

5. Always perform a copy run start when modifying the running configuration and the system is operating the way they desire.

6. Never "init" the switch unless you understand that the switch will lose everything.

7. Keep backup copies of running kickstart and system images on compact flash.