Table Of Contents
Troubleshooting Switch System Issues
Recovering the Administrator Password
Troubleshooting System Restarts
Overview
Working with Recoverable Restarts
Working with Unrecoverable System Restarts
Troubleshooting Switch System Issues
This chapter describes how to identify and resolve problems that might occur when accessing or starting up a single Cisco MDS 9000 Family switch. It includes the following sections:
•Recovering the Administrator Password
•Troubleshooting System Restarts
Recovering the Administrator Password
If you forget the administrator password for accessing a Cisco MDS 9000 Family switch, you can recover the password using a local console connection. For the latest instructions on password recovery, go to http://www.cisco.com/warp/public/474/ and click on "MDS 9000 Series Multilayer Directors and Fabric Switches" under Storage Networking Routers.
Troubleshooting System Restarts
This section describes the different types of system crashes and how to respond to each type. It includes the following topics:
•Overview
•Working with Unrecoverable System Restarts
Overview
There are three different types of system restarts:
•Recoverable—A process restarts and service is not affected.
•Unrecoverable—A process is not restartable or it has restarted more than the max restart times within a fixed period of time (seconds) and will not be restarted again.
•System Hung/Crashed—No communications of any kind is possible with box.
Most system restarts generate a Call Home event, but the condition causing a restart may become so severe that a Call Home event is not generated. Be sure that you configure the Call Home feature properly, follow up on any initial messages regarding system restarts, and fix the problem before it becomes so severe. For information about configuring Call Home, refer to the Cisco MDS 9000 Family Configuration Guide or the Cisco MDS 9000 Family Fabric Manager User Guide.
Working with Recoverable Restarts
Every process restart generates a Syslog message and a Call Home event. Even if the event is not service affecting you should identify and resolve the condition immediately because future occurrences could cause service interruption.
To respond to a recoverable system restart, follow these steps:
Step 1 Enter the following command to check the Syslog file to see which process restarted and why it restarted.
switch# sh log logfile | include error
For information about the meaning of each message, refer to the Cisco MDS 9000 Family System Messages Guide
The system output looks like the following:
Sep 10 23:31:31 dot-6 % LOG_SYSMGR-3-SERVICE_TERMINATED: Service "sensor" (PID 704) has
finished with error code SYSMGR_EXITCODE_SY.
switch# show logging logfile | include fail
Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad
dr 0.0.0.0, in_classd=0 flags=1 fails: Address already in use
Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad
dr 127.0.0.1, in_classd=0 flags=0 fails: Address already in use
Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad
dr 127.1.1.1, in_classd=0 flags=1 fails: Address already in use
Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad
dr 172.22.93.88, in_classd=0 flags=1 fails: Address already in use
Jan 27 23:18:59 88 % LOG_PORT-5-IF_DOWN: Interface fc1/13 is down (Link failure
Jan 27 23:18:59 88 % LOG_PORT-5-IF_DOWN: Interface fc1/14 is down (Link failure
Jan 28 00:55:12 88 % LOG_PORT-5-IF_DOWN: Interface fc1/1 is down (Link failure o
Jan 28 00:58:06 88 % LOG_ZONE-2-ZS_MERGE_FAILED: Zone merge failure, Isolating p
Jan 28 00:58:44 88 % LOG_ZONE-2-ZS_MERGE_FAILED: Zone merge failure, Isolating p
Jan 28 03:26:38 88 % LOG_ZONE-2-ZS_MERGE_FAILED: Zone merge failure, Isolating p
Jan 29 19:01:34 88 % LOG_PORT-5-IF_DOWN: Interface fc1/1 is down (Link failure o
Step 2 Enter the following command to identify the processes that are running and the status of each process.
The following codes are used in the system output for the State (process state):
•D = uninterruptible sleep (usually IO)
•R = runnable (on run queue)
•S = sleeping
•T = traced or stopped
•Z = defunct ("zombie") process
•NR = not-running
•ER = should be running but currently not-running
Note ER usually is the state a process enters if it has been restarted too many times and has been detected as faulty by the system and disabled.
The system output looks like the following (the output has been abbreviated to be more concise):
PID State PC Start_cnt TTY Process
----- ----- -------- ----------- ---- -------------
443 S 2abfd33e 1 - xinetd
446 S 2ac1e33e 1 - sysmgr
Step 3 Enter the following command to show the processes that have had abnormal exits and if there is a stack-trace or core dump.
Process PID Normal-exit Stack-trace Core Log-create-time
---------------- ------ ----------- ----------- ------- ---------------
ntp 919 N N N Jan 27 04:08
snsm 972 N Y N Jan 24 20:50
Step 4 Enter the following command to show detailed information about a specific process that has restarted:
switch# show processes log pid 898
The system output looks like the following:
Description: ide hotswap handler Daemon
Started at Mon Sep 16 14:56:04 2002 (390923 us)
Stopped at Thu Sep 19 14:18:42 2002 (639239 us)
Uptime: 2 days 23 hours 22 minutes 22 seconds
Start type: SRV_OPTION_RESTART_STATELESS (23)
Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGTERM (3)
Exit code: signal 15 (no core)
EBX 00000003 ECX 0804E994 EDX 00000008
ESI 00000005 EDI 7FFFFC9C EBP 7FFFFCAC
EAX 00000008 XDS 0000002B XES 0000002B
EAX 00000003 (orig) EIP 2ABF5EF4 XCS 00000023
EFL 00000246 ESP 7FFFFC5C XSS 0000002B
Stack: 128 bytes. ESP 7FFFFC5C, TOP 7FFFFD10
0x7FFFFC5C: 0804F990 0804C416 00000003 0804E994 ................
0x7FFFFC6C: 00000008 0804BF95 2AC451E0 2AAC24A4 .........Q.*.$.*
0x7FFFFC7C: 7FFFFD14 2AC2C581 0804E6BC 7FFFFCA8 .......*........
0x7FFFFC8C: 7FFFFC94 00000003 00000001 00000003 ................
0x7FFFFC9C: 00000001 00000000 00000068 00000000 ........h.......
0x7FFFFCAC: 7FFFFCE8 2AB4F819 00000001 7FFFFD14 .......*........
0x7FFFFCBC: 7FFFFD1C 0804C470 00000000 7FFFFCE8 ....p...........
0x7FFFFCCC: 2AB4F7E9 2AAC1F00 00000001 08048A2C ...*...*....,...
Step 5 Enter the following command to determine if the restart recently occurred:
Start Time: Fri Sep 13 12:38:39 2002
Up Time: 0 days, 1 hours, 16 minutes, 22 seconds
To determine if the restart is repetitive or a one-time occurrence, compare the length of time that the system has been up with the timestamp of each restart.
Step 6 Enter the following command to view the core files:
The system output looks like the following:
Module-num Process-name PID Core-create-time
---------- ------------ --- ----------------
8 acltcam 285 Jan 9 03:09
This output shows all the cores presently available for upload from the active supervisor. The column entitled module-num shows the slot# on which the core was generated. In the example shown above, an fspf core was generated on the active supervisor module in slot 5. An fcc core was generated on the standby supervisory module in slot 6. Core dumps generated on the line card in slot 8 include acltcam and fib.
To copy the FSPF core dump in this example to a TFTP server with the IP address 1.1.1.1, enter the following command:
switch# copy core://5/1524 tftp::/1.1.1.1/abcd
The following command displays the file named zone_server_log.889 in the log directory.
switch# sh pro log pid 1473
======================================================
Started at Tue Jan 8 17:07:42 1980 (757583 us)
Stopped at Thu Jan 10 06:16:45 1980 (83451 us)
Uptime: 1 days 13 hours 9 minutes 9 seconds
Start type: SRV_OPTION_RESTART_STATELESS (23)
Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGNAL (2)
Exit code: signal 6 (core dumped)
EBX 000005C1 ECX 00000006 EDX 2AD721E0
ESI 2AD701A8 EDI 08109308 EBP 7FFFF2EC
EAX 00000000 XDS 0000002B XES 0000002B
EAX 00000025 (orig) EIP 2AC8CC71 XCS 00000023
EFL 00000207 ESP 7FFFF2C0 XSS 0000002B
Stack: 2608 bytes. ESP 7FFFF2C0, TOP 7FFFFCF0
0x7FFFF2C0: 2AC8C944 000005C1 00000006 2AC735E2 D..*.........5.*
0x7FFFF2D0: 2AC8C92C 2AD721E0 2AAB76F0 00000000 ,..*.!.*.v.*....
0x7FFFF2E0: 7FFFF320 2AC8C920 2AC513F8 7FFFF42C ... ..*...*,...
0x7FFFF2F0: 2AC8E0BB 00000006 7FFFF320 00000000 ...*.... .......
0x7FFFF300: 2AC8DFF8 2AD721E0 08109308 2AC65AFC ...*.!.*.....Z.*
0x7FFFF310: 00000393 2AC6A49C 2AC621CC 2AC513F8 .......*.!.*...*
0x7FFFF320: 00000020 00000000 00000000 00000000 ...............
0x7FFFF330: 00000000 00000000 00000000 00000000 ................
0x7FFFF340: 00000000 00000000 00000000 00000000 ................
0x7FFFF350: 00000000 00000000 00000000 00000000 ................
0x7FFFF360: 00000000 00000000 00000000 00000000 ................
0x7FFFF370: 00000000 00000000 00000000 00000000 ................
0x7FFFF380: 00000000 00000000 00000000 00000000 ................
0x7FFFF390: 00000000 00000000 00000000 00000000 ................
0x7FFFF3A0: 00000002 7FFFF3F4 2AAB752D 2AC5154C .
... output abbreviated ...
Stack: 128 bytes. ESP 7FFFF830, TOP 7FFFFCD0
Step 7 Enter the following command configure the switch to use TFTP to send the core dump to a TFTP server.
switch(config)# sys cores tftp:[//servername][/path]
This command causes the switch to enable the automatic copy of core files to a TFTP server. For example, the following command sends the core files to the TFTP server with the IP address 10.1.1.1.
switch(config)# system cores tftp://10.1.1.1/cores
The following conditions apply:
•The core files are copied every 4 minutes. This time is not configurable.
•The copy of a specific core file can be manually triggered, using the command
copy core//module#/pid# tftp//tftp_ip_address/file_name
•The maximum number of times a process can be restarted is part of the HA policy for any process (this parameter is not configurable). If the process restarts more than the maximum number of times, the older core files are overwritten.
•The maximum number of core files that can be saved for any process is part of the HA policy for any process (this parameter is not configurable, and it is set to 3).
Step 8 To determine the cause and resolution for the restart condition, call Cisco TAC and ask them to review your core dump.
Working with Unrecoverable System Restarts
An unrecoverable system restart may occur in the following cases:
•A critical process fails and is not restartable
•A process restarts more times than is allowed by the system configuration
•A process restarts more frequently than is allowed by the system configuration
The effect of a process restart is determined by the policy configured for each process. Unrecoverable restarts may cause loss of functionality, restart of the active supervisor, a supervisor switchover, or restart of the switch.
To respond to an unrecoverable restart, perform the steps listed in the "Working with Recoverable Restarts" section.