Cisco MDS 9000 Family Troubleshooting Guide, Release 1.3 - Troubleshooting Switch System Issues [Cisco MDS 9000 NX-OS and SAN-OS Software]

Table Of Contents

Troubleshooting Switch System Issues

Recovering the Administrator Password

Troubleshooting System Restarts

Overview

Working with a Firmware Upgrade Failure

Working with Recoverable Restarts

Working with Unrecoverable System Restarts

Troubleshooting a Failed Supervisor

Troubleshooting Switch System Issues

This chapter describes how to identify and resolve problems that might occur when accessing or starting up a single Cisco MDS 9000 Family switch. It includes the following sections:

•Recovering the Administrator Password

•Troubleshooting System Restarts

•Troubleshooting a Failed Supervisor

Recovering the Administrator Password

If you forget the administrator password for accessing a Cisco MDS 9000 Family switch, you can recover the password using a local console connection. For the latest instructions on password recovery, go to http://www.cisco.com/warp/public/474/ and click on "MDS 9000 Series Multilayer Directors and Fabric Switches" under Storage Networking Routers.

Troubleshooting System Restarts

This section describes the different types of system crashes and how to respond to each type. It includes the following topics:

•Overview

•Working with a Firmware Upgrade Failure

•Working with Recoverable Restarts

•Working with Unrecoverable System Restarts

Overview

There are three different types of system restarts:

•Recoverable—A process restarts and service is not affected.

•Unrecoverable—A process is not restartable or it has restarted more than the max restart times within a fixed period of time (seconds) and will not be restarted again.

•System Hung/Crashed—No communications of any kind is possible with box.

Most system restarts generate a Call Home event, but the condition causing a restart may become so severe that a Call Home event is not generated. Be sure that you configure the Call Home feature properly, follow up on any initial messages regarding system restarts, and fix the problem before it becomes so severe. For information about configuring Call Home, refer to the Cisco MDS 9000 Family Configuration Guide or the Cisco MDS 9000 Family Fabric Manager User Guide.

Working with a Firmware Upgrade Failure

When you perform a hitless upgrade (either by using the install all command, or by using Cisco Fabric Manager), and the upgrade fails with no error message displayed, please call technical support to determine the cause of the failure. Do not attempt to reboot the switch.

Working with Recoverable Restarts

Every process restart generates a Syslog message and a Call Home event. Even if the event is not service affecting you should identify and resolve the condition immediately because future occurrences could cause service interruption.

To respond to a recoverable system restart, follow these steps:

Step 1 Enter the following command to check the Syslog file to see which process restarted and why it restarted.
switch# sh log logfile | include error
For information about the meaning of each message, refer to the Cisco MDS 9000 Family System Messages Guide

The system output looks like the following:
Sep 10 23:31:31 dot-6 % LOG_SYSMGR-3-SERVICE_TERMINATED: Service "sensor" (PID 704) has 
finished with error code SYSMGR_EXITCODE_SY.
switch# show logging logfile | include fail
Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad
dr 0.0.0.0, in_classd=0 flags=1 fails: Address already in use
Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad
dr 127.0.0.1, in_classd=0 flags=0 fails: Address already in use
Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad
dr 127.1.1.1, in_classd=0 flags=1 fails: Address already in use
Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad
dr 172.22.93.88, in_classd=0 flags=1 fails: Address already in use
Jan 27 23:18:59 88 % LOG_PORT-5-IF_DOWN: Interface fc1/13 is down (Link failure 
or not-connected)
Jan 27 23:18:59 88 % LOG_PORT-5-IF_DOWN: Interface fc1/14 is down (Link failure 
or not-connected)
Jan 28 00:55:12 88 % LOG_PORT-5-IF_DOWN: Interface fc1/1 is down (Link failure o
r not-connected)
Jan 28 00:58:06 88 % LOG_ZONE-2-ZS_MERGE_FAILED: Zone merge failure, Isolating p
ort fc1/1 (VSAN 100)
Jan 28 00:58:44 88 % LOG_ZONE-2-ZS_MERGE_FAILED: Zone merge failure, Isolating p
ort fc1/1 (VSAN 100)
Jan 28 03:26:38 88 % LOG_ZONE-2-ZS_MERGE_FAILED: Zone merge failure, Isolating p
ort fc1/1 (VSAN 100)
Jan 29 19:01:34 88 % LOG_PORT-5-IF_DOWN: Interface fc1/1 is down (Link failure o
r not-connected)
switch#
Step 2 Enter the following command to identify the processes that are running and the status of each process.
switch# show processes 
The following codes are used in the system output for the State (process state):

•D = uninterruptible sleep (usually IO)

•R = runnable (on run queue)

•S = sleeping

•T = traced or stopped

•Z = defunct ("zombie") process

•NR = not-running

•ER = should be running but currently not-running

Note ER usually is the state a process enters if it has been restarted too many times and has been detected as faulty by the system and disabled.

The system output looks like the following (the output has been abbreviated to be more concise):
PID    State  PC        Start_cnt    TTY   Process
-----  -----  --------  -----------  ----  -------------
    1      S  2ab8e33e            1     -  init
    2      S         0            1     -  keventd
    3      S         0            1     -  ksoftirqd_CPU0
    4      S         0            1     -  kswapd
    5      S         0            1     -  bdflush
    6      S         0            1     -  kupdated
   71      S         0            1     -  kjournald
  136      S         0            1     -  kjournald
  140      S         0            1     -  kjournald
  431      S  2abe333e            1     -  httpd
  443      S  2abfd33e            1     -  xinetd
  446      S  2ac1e33e            1     -  sysmgr
  452      S  2abe91a2            1     -  httpd
  453      S  2abe91a2            1     -  httpd
  456      S  2ac73419            1    S0  vsh
  469      S  2abe91a2            1     -  httpd
  470      S  2abe91a2            1     -  httpd  
Step 3 Enter the following command to show the processes that have had abnormal exits and if there is a stack-trace or core dump.
switch# show process log
Process           PID     Normal-exit  Stack-trace  Core     Log-create-time
----------------  ------  -----------  -----------  -------  ---------------
ntp               919               N            N        N  Jan 27 04:08
snsm              972               N            Y        N  Jan 24 20:50
Step 4 Enter the following command to show detailed information about a specific process that has restarted:
switch# show processes log pid 898
The system output looks like the following:
Service: idehsd
Description: ide hotswap handler Daemon
Started at Mon Sep 16 14:56:04 2002 (390923 us)
Stopped at Thu Sep 19 14:18:42 2002 (639239 us)
Uptime: 2 days 23 hours 22 minutes 22 seconds
Start type: SRV_OPTION_RESTART_STATELESS (23)
Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGTERM (3)
Exit code: signal 15 (no core)
CWD: /var/sysmgr/work
Virtual Memory:
CODE      08048000 - 0804D660
    DATA      0804E660 - 0804E824
    BRK       0804E9A0 - 08050000
    STACK     7FFFFD10
Register Set:
EBX 00000003         ECX 0804E994         EDX 00000008
    ESI 00000005         EDI 7FFFFC9C         EBP 7FFFFCAC
    EAX 00000008         XDS 0000002B         XES 0000002B
    EAX 00000003 (orig)  EIP 2ABF5EF4         XCS 00000023
    EFL 00000246         ESP 7FFFFC5C         XSS 0000002B
Stack: 128 bytes. ESP 7FFFFC5C, TOP 7FFFFD10
0x7FFFFC5C: 0804F990 0804C416 00000003 0804E994 ................
0x7FFFFC6C: 00000008 0804BF95 2AC451E0 2AAC24A4 .........Q.*.$.*
0x7FFFFC7C: 7FFFFD14 2AC2C581 0804E6BC 7FFFFCA8 .......*........
0x7FFFFC8C: 7FFFFC94 00000003 00000001 00000003 ................
0x7FFFFC9C: 00000001 00000000 00000068 00000000 ........h.......
0x7FFFFCAC: 7FFFFCE8 2AB4F819 00000001 7FFFFD14 .......*........
0x7FFFFCBC: 7FFFFD1C 0804C470 00000000 7FFFFCE8 ....p...........
0x7FFFFCCC: 2AB4F7E9 2AAC1F00 00000001 08048A2C ...*...*....,...
PID: 898
SAP: 0
UUID: 0
switch#
Step 5 Enter the following command to determine if the restart recently occurred:
switch# sh sys uptime 
Start Time: Fri Sep 13 12:38:39 2002
Up Time:    0 days, 1 hours, 16 minutes, 22 seconds
To determine if the restart is repetitive or a one-time occurrence, compare the length of time that the system has been up with the timestamp of each restart.

Step 6 Enter the following command to view the core files:
switch# show cores
The system output looks like the following:
Module-num      Process-name      					PID    	Core-create-time
----------     		------------     			 ---   	 ----------------
5              		fspf              			1524    Jan 9 03:11
6              		fcc               			919     Jan 9 03:09
8              		acltcam           			285     Jan 9 03:09
8              		fib               			283     Jan 9 03:08
This output shows all the cores presently available for upload from the active supervisor. The column entitled module-num shows the slot# on which the core was generated. In the example shown above, an fspf core was generated on the active supervisor module in slot 5. An fcc core was generated on the standby supervisory module in slot 6. Core dumps generated on the line card in slot 8 include acltcam and fib.

To copy the FSPF core dump in this example to a TFTP server with the IP address 1.1.1.1, enter the following command:
switch# copy core://5/1524 tftp::/1.1.1.1/abcd
The following command displays the file named zone_server_log.889 in the log directory.
 switch# sh pro log pid 1473
======================================================
Service: ips
Description: IPS Manager
Started at Tue Jan  8 17:07:42 1980 (757583 us)
Stopped at Thu Jan 10 06:16:45 1980 (83451 us)
Uptime: 1 days 13 hours 9 minutes 9 seconds
Start type: SRV_OPTION_RESTART_STATELESS (23)
Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGNAL (2)
Exit code: signal 6 (core dumped)
CWD: /var/sysmgr/work
Virtual Memory:
    CODE      08048000 - 080FB060
    DATA      080FC060 - 080FCBA8
    BRK       081795C0 - 081EC000
    STACK     7FFFFCF0
    TOTAL     20952 KB
Register Set:
    EBX 000005C1         ECX 00000006         EDX 2AD721E0
    ESI 2AD701A8         EDI 08109308         EBP 7FFFF2EC
    EAX 00000000         XDS 0000002B         XES 0000002B
    EAX 00000025 (orig)  EIP 2AC8CC71         XCS 00000023
    EFL 00000207         ESP 7FFFF2C0         XSS 0000002B
Stack: 2608 bytes. ESP 7FFFF2C0, TOP 7FFFFCF0
0x7FFFF2C0: 2AC8C944 000005C1 00000006 2AC735E2 D..*.........5.*
0x7FFFF2D0: 2AC8C92C 2AD721E0 2AAB76F0 00000000 ,..*.!.*.v.*....
0x7FFFF2E0: 7FFFF320 2AC8C920 2AC513F8 7FFFF42C  ... ..*...*,...
0x7FFFF2F0: 2AC8E0BB 00000006 7FFFF320 00000000 ...*.... .......
0x7FFFF300: 2AC8DFF8 2AD721E0 08109308 2AC65AFC ...*.!.*.....Z.*
0x7FFFF310: 00000393 2AC6A49C 2AC621CC 2AC513F8 .......*.!.*...*
0x7FFFF320: 00000020 00000000 00000000 00000000  ...............
0x7FFFF330: 00000000 00000000 00000000 00000000 ................
0x7FFFF340: 00000000 00000000 00000000 00000000 ................
0x7FFFF350: 00000000 00000000 00000000 00000000 ................
0x7FFFF360: 00000000 00000000 00000000 00000000 ................
0x7FFFF370: 00000000 00000000 00000000 00000000 ................
0x7FFFF380: 00000000 00000000 00000000 00000000 ................
0x7FFFF390: 00000000 00000000 00000000 00000000 ................
0x7FFFF3A0: 00000002 7FFFF3F4 2AAB752D 2AC5154C .
... output abbreviated ...
Stack: 128 bytes. ESP 7FFFF830, TOP 7FFFFCD0
Step 7 Enter the following command configure the switch to use TFTP to send the core dump to a TFTP server.
switch(config)# sys cores tftp:[//servername][/path]
This command causes the switch to enable the automatic copy of core files to a TFTP server. For example, the following command sends the core files to the TFTP server with the IP address 10.1.1.1.
switch(config)# system cores tftp://10.1.1.1/cores
The following conditions apply:

•The core files are copied every 4 minutes. This time is not configurable.

•The copy of a specific core file can be manually triggered, using the command
copy core//module#/pid# tftp//tftp_ip_address/file_name

•The maximum number of times a process can be restarted is part of the HA policy for any process (this parameter is not configurable). If the process restarts more than the maximum number of times, the older core files are overwritten.

•The maximum number of core files that can be saved for any process is part of the HA policy for any process (this parameter is not configurable, and it is set to 3).

Step 8 To determine the cause and resolution for the restart condition, call Cisco TAC and ask them to review your core dump.

Working with Unrecoverable System Restarts

An unrecoverable system restart may occur in the following cases:

•A critical process fails and is not restartable

•A process restarts more times than is allowed by the system configuration

•A process restarts more frequently than is allowed by the system configuration

The effect of a process restart is determined by the policy configured for each process. Unrecoverable restarts may cause loss of functionality, restart of the active supervisor, a supervisor switchover, or restart of the switch.

To respond to an unrecoverable restart, perform the steps listed in the "Working with Recoverable Restarts" section.
Troubleshooting a Failed Supervisor

This section provides a workaround for a failed supervisor under certain conditions. An example situation is used to describe the problem and the workaround.

In this case, the supervisor failed when the standby was reloaded, or when the supervisor was replaced with a new one. It was discovered that the failed supervisor either had its version of code changed, or the running configuration on the active supervisor wasn't saved with the appropriate boot parameters. In either case, the problem was mismatched code on the active and standby supervisors. One clue that indicated the mismatched code was a "heartbeat" error on the active supervisor. Because of this error, the current flash images were unable to be copied from the active supervisor to the standby.

The workaround was to copy the images to compact flash, switch consoles, and load code from compact flash onto the second supervisor. The second supervisor was at a "loader" prompt which is indicative of missing boot statements. When a dir slot0 command was executed, none of the images appeared. This may have been due to mismatched images on supervisors or to not having current images in flash of the supervisor. Performing a copy slot0 bootflash command copied the images anyway. Once the images were loaded on the second supervisor and the boot statements were confirmed and saved on the active supervisor, the supervisor loaded and came up in "standby-ha" mode.

As a best practice, we recommended the following in order to understand how the switch can end up in this situation:

1. Make sure both supervisors have their flash loaded with the same versions of kickstart and system images.

2. Make sure that the proper boot statements for Sup1 and Sup2 are set to run the same code.

3. Once the boot statements are configured on the active supervisor, make sure and perform a copy run start.

4. Make a copy of the running configuration to compact flash just for a safe backup.

5. Always perform a copy run start when modifying the running configuration and the system is operating the way they desire.

6. Never "init" the switch unless you understand that the switch will lose everything.

7. Keep backup copies of running kickstart and system images on compact flash.

	Cisco MDS 9000 Family Troubleshooting Guide, Release 1.3
	Troubleshooting Switch System Issues
Cisco MDS 9000 Family Troubleshooting Guide, Release 1.3 Index Preface Troubleshooting Overview Troubleshooting Switch System Issues Troubleshooting Switch Level Issues and Interswitch Connectivity Troubleshooting Switch Fabric Level Issues Troubleshooting IP Storage Issues	Download this chapter Troubleshooting Switch System Issues Download the complete book Full Book PDF (PDF - 6 MB) Feedback Table Of Contents Troubleshooting Switch System Issues Recovering the Administrator Password Troubleshooting System Restarts Overview Working with a Firmware Upgrade Failure Working with Recoverable Restarts Working with Unrecoverable System Restarts Troubleshooting a Failed Supervisor Troubleshooting Switch System Issues This chapter describes how to identify and resolve problems that might occur when accessing or starting up a single Cisco MDS 9000 Family switch. It includes the following sections: •Recovering the Administrator Password •Troubleshooting System Restarts •Troubleshooting a Failed Supervisor Recovering the Administrator Password If you forget the administrator password for accessing a Cisco MDS 9000 Family switch, you can recover the password using a local console connection. For the latest instructions on password recovery, go to http://www.cisco.com/warp/public/474/ and click on "MDS 9000 Series Multilayer Directors and Fabric Switches" under Storage Networking Routers. Troubleshooting System Restarts This section describes the different types of system crashes and how to respond to each type. It includes the following topics: •Overview •Working with a Firmware Upgrade Failure •Working with Recoverable Restarts •Working with Unrecoverable System Restarts Overview There are three different types of system restarts: •Recoverable—A process restarts and service is not affected. •Unrecoverable—A process is not restartable or it has restarted more than the max restart times within a fixed period of time (seconds) and will not be restarted again. •System Hung/Crashed—No communications of any kind is possible with box. Most system restarts generate a Call Home event, but the condition causing a restart may become so severe that a Call Home event is not generated. Be sure that you configure the Call Home feature properly, follow up on any initial messages regarding system restarts, and fix the problem before it becomes so severe. For information about configuring Call Home, refer to the Cisco MDS 9000 Family Configuration Guide or the Cisco MDS 9000 Family Fabric Manager User Guide. Working with a Firmware Upgrade Failure When you perform a hitless upgrade (either by using the install all command, or by using Cisco Fabric Manager), and the upgrade fails with no error message displayed, please call technical support to determine the cause of the failure. Do not attempt to reboot the switch. Working with Recoverable Restarts Every process restart generates a Syslog message and a Call Home event. Even if the event is not service affecting you should identify and resolve the condition immediately because future occurrences could cause service interruption. To respond to a recoverable system restart, follow these steps: Step 1 Enter the following command to check the Syslog file to see which process restarted and why it restarted. switch# sh log logfile \| include error For information about the meaning of each message, refer to the Cisco MDS 9000 Family System Messages Guide The system output looks like the following: Sep 10 23:31:31 dot-6 % LOG_SYSMGR-3-SERVICE_TERMINATED: Service "sensor" (PID 704) has finished with error code SYSMGR_EXITCODE_SY. switch# show logging logfile \| include fail Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad dr 0.0.0.0, in_classd=0 flags=1 fails: Address already in use Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad dr 127.0.0.1, in_classd=0 flags=0 fails: Address already in use Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad dr 127.1.1.1, in_classd=0 flags=1 fails: Address already in use Jan 27 04:08:42 88 %LOG_DAEMON-3-SYSTEM_MSG: bind() fd 4, family 2, port 123, ad dr 172.22.93.88, in_classd=0 flags=1 fails: Address already in use Jan 27 23:18:59 88 % LOG_PORT-5-IF_DOWN: Interface fc1/13 is down (Link failure or not-connected) Jan 27 23:18:59 88 % LOG_PORT-5-IF_DOWN: Interface fc1/14 is down (Link failure or not-connected) Jan 28 00:55:12 88 % LOG_PORT-5-IF_DOWN: Interface fc1/1 is down (Link failure o r not-connected) Jan 28 00:58:06 88 % LOG_ZONE-2-ZS_MERGE_FAILED: Zone merge failure, Isolating p ort fc1/1 (VSAN 100) Jan 28 00:58:44 88 % LOG_ZONE-2-ZS_MERGE_FAILED: Zone merge failure, Isolating p ort fc1/1 (VSAN 100) Jan 28 03:26:38 88 % LOG_ZONE-2-ZS_MERGE_FAILED: Zone merge failure, Isolating p ort fc1/1 (VSAN 100) Jan 29 19:01:34 88 % LOG_PORT-5-IF_DOWN: Interface fc1/1 is down (Link failure o r not-connected) switch# Step 2 Enter the following command to identify the processes that are running and the status of each process. switch# show processes The following codes are used in the system output for the State (process state): •D = uninterruptible sleep (usually IO) •R = runnable (on run queue) •S = sleeping •T = traced or stopped •Z = defunct ("zombie") process •NR = not-running •ER = should be running but currently not-running Note ER usually is the state a process enters if it has been restarted too many times and has been detected as faulty by the system and disabled. The system output looks like the following (the output has been abbreviated to be more concise): PID State PC Start_cnt TTY Process ----- ----- -------- ----------- ---- ------------- 1 S 2ab8e33e 1 - init 2 S 0 1 - keventd 3 S 0 1 - ksoftirqd_CPU0 4 S 0 1 - kswapd 5 S 0 1 - bdflush 6 S 0 1 - kupdated 71 S 0 1 - kjournald 136 S 0 1 - kjournald 140 S 0 1 - kjournald 431 S 2abe333e 1 - httpd 443 S 2abfd33e 1 - xinetd 446 S 2ac1e33e 1 - sysmgr 452 S 2abe91a2 1 - httpd 453 S 2abe91a2 1 - httpd 456 S 2ac73419 1 S0 vsh 469 S 2abe91a2 1 - httpd 470 S 2abe91a2 1 - httpd Step 3 Enter the following command to show the processes that have had abnormal exits and if there is a stack-trace or core dump. switch# show process log Process PID Normal-exit Stack-trace Core Log-create-time ---------------- ------ ----------- ----------- ------- --------------- ntp 919 N N N Jan 27 04:08 snsm 972 N Y N Jan 24 20:50 Step 4 Enter the following command to show detailed information about a specific process that has restarted: switch# show processes log pid 898 The system output looks like the following: Service: idehsd Description: ide hotswap handler Daemon Started at Mon Sep 16 14:56:04 2002 (390923 us) Stopped at Thu Sep 19 14:18:42 2002 (639239 us) Uptime: 2 days 23 hours 22 minutes 22 seconds Start type: SRV_OPTION_RESTART_STATELESS (23) Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGTERM (3) Exit code: signal 15 (no core) CWD: /var/sysmgr/work Virtual Memory: CODE 08048000 - 0804D660 DATA 0804E660 - 0804E824 BRK 0804E9A0 - 08050000 STACK 7FFFFD10 Register Set: EBX 00000003 ECX 0804E994 EDX 00000008 ESI 00000005 EDI 7FFFFC9C EBP 7FFFFCAC EAX 00000008 XDS 0000002B XES 0000002B EAX 00000003 (orig) EIP 2ABF5EF4 XCS 00000023 EFL 00000246 ESP 7FFFFC5C XSS 0000002B Stack: 128 bytes. ESP 7FFFFC5C, TOP 7FFFFD10 0x7FFFFC5C: 0804F990 0804C416 00000003 0804E994 ................ 0x7FFFFC6C: 00000008 0804BF95 2AC451E0 2AAC24A4 .........Q..$. 0x7FFFFC7C: 7FFFFD14 2AC2C581 0804E6BC 7FFFFCA8 ............... 0x7FFFFC8C: 7FFFFC94 00000003 00000001 00000003 ................ 0x7FFFFC9C: 00000001 00000000 00000068 00000000 ........h....... 0x7FFFFCAC: 7FFFFCE8 2AB4F819 00000001 7FFFFD14 ............... 0x7FFFFCBC: 7FFFFD1C 0804C470 00000000 7FFFFCE8 ....p........... 0x7FFFFCCC: 2AB4F7E9 2AAC1F00 00000001 08048A2C ..........,... PID: 898 SAP: 0 UUID: 0 switch# Step 5 Enter the following command to determine if the restart recently occurred: switch# sh sys uptime Start Time: Fri Sep 13 12:38:39 2002 Up Time: 0 days, 1 hours, 16 minutes, 22 seconds To determine if the restart is repetitive or a one-time occurrence, compare the length of time that the system has been up with the timestamp of each restart. Step 6 Enter the following command to view the core files: switch# show cores The system output looks like the following: Module-num Process-name PID Core-create-time ---------- ------------ --- ---------------- 5 fspf 1524 Jan 9 03:11 6 fcc 919 Jan 9 03:09 8 acltcam 285 Jan 9 03:09 8 fib 283 Jan 9 03:08 This output shows all the cores presently available for upload from the active supervisor. The column entitled module-num shows the slot# on which the core was generated. In the example shown above, an fspf core was generated on the active supervisor module in slot 5. An fcc core was generated on the standby supervisory module in slot 6. Core dumps generated on the line card in slot 8 include acltcam and fib. To copy the FSPF core dump in this example to a TFTP server with the IP address 1.1.1.1, enter the following command: switch# copy core://5/1524 tftp::/1.1.1.1/abcd The following command displays the file named zone_server_log.889 in the log directory. switch# sh pro log pid 1473 ====================================================== Service: ips Description: IPS Manager Started at Tue Jan 8 17:07:42 1980 (757583 us) Stopped at Thu Jan 10 06:16:45 1980 (83451 us) Uptime: 1 days 13 hours 9 minutes 9 seconds Start type: SRV_OPTION_RESTART_STATELESS (23) Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGNAL (2) Exit code: signal 6 (core dumped) CWD: /var/sysmgr/work Virtual Memory: CODE 08048000 - 080FB060 DATA 080FC060 - 080FCBA8 BRK 081795C0 - 081EC000 STACK 7FFFFCF0 TOTAL 20952 KB Register Set: EBX 000005C1 ECX 00000006 EDX 2AD721E0 ESI 2AD701A8 EDI 08109308 EBP 7FFFF2EC EAX 00000000 XDS 0000002B XES 0000002B EAX 00000025 (orig) EIP 2AC8CC71 XCS 00000023 EFL 00000207 ESP 7FFFF2C0 XSS 0000002B Stack: 2608 bytes. ESP 7FFFF2C0, TOP 7FFFFCF0 0x7FFFF2C0: 2AC8C944 000005C1 00000006 2AC735E2 D...........5. 0x7FFFF2D0: 2AC8C92C 2AD721E0 2AAB76F0 00000000 ,...!..v..... 0x7FFFF2E0: 7FFFF320 2AC8C920 2AC513F8 7FFFF42C ... .....,... 0x7FFFF2F0: 2AC8E0BB 00000006 7FFFF320 00000000 ....... ....... 0x7FFFF300: 2AC8DFF8 2AD721E0 08109308 2AC65AFC ....!......Z.* 0x7FFFF310: 00000393 2AC6A49C 2AC621CC 2AC513F8 ........!....* 0x7FFFF320: 00000020 00000000 00000000 00000000 ............... 0x7FFFF330: 00000000 00000000 00000000 00000000 ................ 0x7FFFF340: 00000000 00000000 00000000 00000000 ................ 0x7FFFF350: 00000000 00000000 00000000 00000000 ................ 0x7FFFF360: 00000000 00000000 00000000 00000000 ................ 0x7FFFF370: 00000000 00000000 00000000 00000000 ................ 0x7FFFF380: 00000000 00000000 00000000 00000000 ................ 0x7FFFF390: 00000000 00000000 00000000 00000000 ................ 0x7FFFF3A0: 00000002 7FFFF3F4 2AAB752D 2AC5154C . ... output abbreviated ... Stack: 128 bytes. ESP 7FFFF830, TOP 7FFFFCD0 Step 7 Enter the following command configure the switch to use TFTP to send the core dump to a TFTP server. switch(config)# sys cores tftp:[//servername][/path] This command causes the switch to enable the automatic copy of core files to a TFTP server. For example, the following command sends the core files to the TFTP server with the IP address 10.1.1.1. switch(config)# system cores tftp://10.1.1.1/cores The following conditions apply: •The core files are copied every 4 minutes. This time is not configurable. •The copy of a specific core file can be manually triggered, using the command copy core//module#/pid# tftp//tftp_ip_address/file_name •The maximum number of times a process can be restarted is part of the HA policy for any process (this parameter is not configurable). If the process restarts more than the maximum number of times, the older core files are overwritten. •The maximum number of core files that can be saved for any process is part of the HA policy for any process (this parameter is not configurable, and it is set to 3). Step 8 To determine the cause and resolution for the restart condition, call Cisco TAC and ask them to review your core dump. Working with Unrecoverable System Restarts An unrecoverable system restart may occur in the following cases: •A critical process fails and is not restartable •A process restarts more times than is allowed by the system configuration •A process restarts more frequently than is allowed by the system configuration The effect of a process restart is determined by the policy configured for each process. Unrecoverable restarts may cause loss of functionality, restart of the active supervisor, a supervisor switchover, or restart of the switch. To respond to an unrecoverable restart, perform the steps listed in the "Working with Recoverable Restarts" section. Troubleshooting a Failed Supervisor This section provides a workaround for a failed supervisor under certain conditions. An example situation is used to describe the problem and the workaround. In this case, the supervisor failed when the standby was reloaded, or when the supervisor was replaced with a new one. It was discovered that the failed supervisor either had its version of code changed, or the running configuration on the active supervisor wasn't saved with the appropriate boot parameters. In either case, the problem was mismatched code on the active and standby supervisors. One clue that indicated the mismatched code was a "heartbeat" error on the active supervisor. Because of this error, the current flash images were unable to be copied from the active supervisor to the standby. The workaround was to copy the images to compact flash, switch consoles, and load code from compact flash onto the second supervisor. The second supervisor was at a "loader" prompt which is indicative of missing boot statements. When a dir slot0 command was executed, none of the images appeared. This may have been due to mismatched images on supervisors or to not having current images in flash of the supervisor. Performing a copy slot0 bootflash command copied the images anyway. Once the images were loaded on the second supervisor and the boot statements were confirmed and saved on the active supervisor, the supervisor loaded and came up in "standby-ha" mode. As a best practice, we recommended the following in order to understand how the switch can end up in this situation: 1. Make sure both supervisors have their flash loaded with the same versions of kickstart and system images. 2. Make sure that the proper boot statements for Sup1 and Sup2 are set to run the same code. 3. Once the boot statements are configured on the active supervisor, make sure and perform a copy run start. 4. Make a copy of the running configuration to compact flash just for a safe backup. 5. Always perform a copy run start when modifying the running configuration and the system is operating the way they desire. 6. Never "init" the switch unless you understand that the switch will lose everything. 7. Keep backup copies of running kickstart and system images on compact flash.
	Terms & Conditions \| Privacy Statement \| Cookie Policy \| Trademarks