Guest

Cisco Application and Content Networking System (ACNS) Software

Interpret and Troubleshoot Hard Disk Errors for ACNS 4.2 and 5.0 on Cisco Content Engines

Document ID: 69674

Updated: Mar 29, 2006

   Print

Introduction

This document describes hard disk errors for Cisco Application and Content Networking System (ACNS) Software Releases 4.2 and 5.0 on Cisco Content Engines (CEs). This document also explains how to interpret and troubleshoot hard disk errors. The procedures in this document help you determine whether a disk drive is operational, and whether the problem is a hardware issue or software issue if the drive does not function properly. When you encounter issues with the hard disk, you must thoroughly troubleshoot the disk drive in order to avoid unnecessary hardware replacement.

Prerequisites

Requirements

There are no specific requirements for this document.

Components Used

The information in this document is based on these software and hardware versions:

  • ACNS 4.2 and 5.0

  • Content Network Engines CE-507-K9, CE-507AV-K9, CE-560-K9, CE-560AV-CDN-K9, CE-590-DC-K9, CE-590-K9 and CE-590-ICDN-K9

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.

Conventions

Refer to Cisco Technical Tips Conventions for more information on document conventions.

Disk Error Reports

ACNS 4.2 and 5.0 report disk drive failure in several ways. There are slight differences in error reporting in releases 4.2 and 5.0 but the overall approach is similar.

Dead Disk Drive

There are several modes of drive failure. One extreme failure mode is that the failed drive does not even show up on the Small Computer System Interface (SCSI) bus. When such failure occurs, the software assumes that the drive is not present. Perform a visual inspection of the box. If you can see that a drive is present, but ACNS indicates that the drive is missing, you can conclude a clear failure. For example, issue the show disks command or the show disk details command. If the output of these commands claims that no disk is present, there is a clear failure.

Ensure that the drive has not become loose. Also, check for SCSI cabling problems. If none of these actions resolves the issue, you need to replace the disk.

caution Caution: Ensure that you turn off the power before you check the cables, or before you re-seat or insert the drive.

Hardware Error

A more common mode of failure is when something goes wrong on a drive, and the drive is unable to read or write one or more sectors. The generic SCSI driver code picks up this failure, and displays a message at the LOG_CRIT level. This error goes to the syslog, which resides by default in /local1/syslog.txt.

On ACNS 5.0 and later, the error message goes to the CDM also, if CDM is configured.

Here is the format of the error message:

SCSI I/O error: POSSIBLE BAD DISK -- device 0x%x, sector %d

The message includes the word "possible" as a hedge against other sources of I/O failures. For example, SCSI cabling failure if a storage array is involved, or various FibreChannel failure conditions. However, this message typically does indicates disk failure. Various other Kernel messages that indicate file system corruption also accompany this message.

Note: Look for the string "ext2" in such messages.

In the error message, the device is a hexadecimal major+minor device number of the form 0x8XY or 0x41XY, where X and Y are hexadecimal digits. X indicates the physical drive (from the Linux perspective) and Y indicates the partition on the affected drive. The drive digit is 0-based, and the partition number is 1-based (0 for a partition means the entire drive). For example, 0x802 means disk00, partition 2, and 0x4103 means disk16, partition 3.

This table lists the mapping between device numbers when all disk drives are present:

Device Number Description
0x800 - 0x80f Disk00, or partition 1 through 15 on disk00
0x810 - 0x81f Disk01, or partition 1 through 15 on disk01
... ...
0x8f0 - 0x8ff Disk15, or partition 1 through 15 on disk15
0x4100 - 0x410f Disk 16, or partition 1 through 15 on disk16
0x4110 - 0x411f Disk 17, or partition 1 through 15 on disk17
... ...
0x41f0 - 0x41ff Disk 31, or partition 1 through 15 on disk31

Note: Mapping can be offset if one or more disk drives are missing.

Note: You do not need to know the partition number. The sector number is some sector within the affected partition. Knowledge of this number is not critical, but the sector number is reported for informational purposes. Sometimes, the DE or others can use this information to reproduce the failure. For this, the DE must and manually trigger a disk I/O to the relevant area of disk through the disk drive.

If you observe strange behavior, or suspect disk failure for any reason, issue the show disks and the show disks details EXEC commands. You can confirm drive failure if the output of these commands contains messages similar to this:

disk<x> is bad. Check cable or replace it.

In this message, <x> can be 00, 01, or higher. This value indicates the drive which failed. Refer to the product documentation for Cisco ACNS Software to find the physical location of the drive relative to the rest of the box.

The Check cable or replace it part of the message applies only for drives on an external storage array. You can ignore this part of the message for internal drives on the bulk of models in the field—CE-507 and CE-560 with only internal drives.

The show disks and the show disks details commands perform a cursory disk check. Sometimes, the check does not identify all failures. Therefore, in addition to these Command Line Interface (CLI) commands, you must also obtain the syslog output, which resides by default in /local1/syslog.txt. Copy the output to an external system through the copy disk ftp command. Use a file viewer, text editor, or word processor to search through the log file. Search from the end of the file backwards to look at newest messages first. Look for Possible Bad Disk and similar messages. You can also achieve this through the find EXEC command:

ce#find match "POSSIBLE BAD DISK" syslog.txt

You can also observe certain drive failures in boot output on the serial console. In ACNS 5.0 and later, this output also goes back into syslog after bootup, and can appear in the syslog.txt file. These messages are similar to the messages in the output of the show disk or show disk details, command. For example, the disk<x> is bad message. Look for these messages after the lines that contain the string "BOOT-100", and before the line that contains the string "entering runlevel 200". If no error messages occur between these lines, you can conclude that all file systems are properly mounted. I/O errors can still occur subsequently, typically if the particular drive failure is limited to only a certain sector or set or range of sectors. Pay attention to the syslog output.

Software Error on Disk00

Some types of disk-related problems can result in error messages that do not actually indicate disk failure, but rather some other problem. In these cases, hardware replacement or RMA is not necessary. Here is a standard message that indicates a software problem with system-use disk partitions or filesystems:

First disk not in standard configuration.
Run disk recover command and re-install the software.

This message appears on the console during bootup, and also in the syslog if /local/local1 was able to mount. Specific cases where this message appears have different causes, but are generally rare. Issue the disk recover, or disk erase-disk00-partitions command, and then issue the disk recover command to resolve this problem.

Determine Whether a Problem is a Hardware Disk Error or a Software Issue

This section provides step-by-step instructions to determine whether a problem is a hardware disk error or a software issue. This section covers SCSI disks only. This section does not include Redundant Array of Independent Disks (RAID) drives, FiberChannel drives, and Network Access Server (NAS) devices.

Step-by-step Procedure

Step 1

Check whether the CE can boot up.

In some extreme and rare cases, the SCSI disk drive can have problems that cause the ACNS device driver to hang during bootup. You can verify this from the console of the CE. If the SCSI subsystem driver Revision: 1.00 message appears, and ACNS does not boot, check whether the disk drive or SCSI subsystem is faulty. Take out disk drives and reboot the system to see if the problem relates to the disk drive. If this action fails, contact Cisco to determine the root cause of the problem.

If the CE boots and you receive the login prompt on the console, proceed to Step 2.

Step 2

Issue the show version command to verify the software version. Note the ACNS version number.

Step 3

Issue the show disks details command, and verify the output. The disk drive must appear as "Normal" if inserted. Here is a sample output for ACNS 4.2:

ACNS42#show disks details
......
disk16: Normal        (h04 c00 i08 l00)    17501MB( 17.1GB)
    disk16/00: MEDIAFS       17500MB( 17.1GB) mounted internally
    FREE:                        0MB(  0.0GB)

Sample good ACNS 5.X output:

ACNS5#show disks details
......
disk14: Normal        (h01 c00 i09 l00 - Ext DAS)        35000MB( 34.2GB)
    disk14/00: CFS           34999MB( 34.2GB)
    FREE:                        1MB(  0.0GB)

Step 4

Check whether any disk drive appears as "Not present". If you are sure that the disk drive is physically present, but the output shows the drive as "Not present", a dead disk drive is possible. Go to Step 9.

Here is a sample output for ACNS 4.2:

ACNS42#show disks details
......
disk01: Not present

Here is a sample output for ACNS 5.x:

ACNS5#show disks details
......
disk05: Not present

Step 5

Check whether any disk drive appears as "Not recognized". "Not recognized" usually indicates that other operating systems, for example, Windows or Linux, used the disk drive earlier. This problem does not occur if you use disk drives that Cisco provides. Obtain a disk drive from Cisco, and go to Step 10.

Here is a sample output for ACNS 4.2:

ACNS42#show disks details
/ruby/bin/ruby_disk: disk [/dev/sdb] has an unknown partition [/dev/sdb1], skipping it
......
disk01: Not recognized

Here is a sample output for ACNS 5.x:

ACNS5#show disks details
/ruby/bin/ruby_disk: disk [/dev/sdi] has an unknown partiton [/dev/sdi2], skipping it
......
disk08: Not recognized

Step 6

Check whether any disk drive appears as "Problematic". This status usually indicates a hardware problem. Error or warning messages can be different. Some errors can indicate that diskXX is bad, while other errors can indicate that disk /dev/sdX: cannot {open|read|write|seek}. Go to Step 9.

Here is a sample output for ACNS 4.2:

ACNS42#show disks details
disk04 is bad. Check cable or replace it.
ruby_disk: Disk /dev/sdg: cannot open: Device not configured
......
disk04: Problematic
......
disk07: Problematic

Here is a sample output for ACNS 5.x:

ACNS5#show disks details
disk01 is bad. Check cable or replace it.
......
disk01: Problematic

Step 7

Check whether the disk drive contains any SCSI errors. Search the syslog.txt file.

Messages also appear on the console or any terminal depending on the log configuration. If you find the Possible Bad Disk message in syslog.txt, you can conclude that either the disk drive is faulty, or the SCSI connection is bad. Figure out disk number and then go to Step 9. Here is the format of the message:

SCSI I/O error: POSSIBLE BAD DISK -- device 0x%x, sector %d

Step 8

Issue the show disks details command, or go through the console startup log to check for a software problem with disk00. For ACNS 5.x, you can find the console startup log also in syslog.txt.

Disk00 has some special file systems that store ACNS software and other state information that are persistent across reloads. The show disks details command must show the portion of disk00 for "System use". If you cannot find the "System use" portion, and you do not find any hardware problem in the previous steps, go to Step 11.

Here is some sample output for good ACNS 4.2:

disk00: Normal        (h00 c00 i00 l00)    17357MB( 17.0GB)
    System use:               5119MB(  5.0GB)
    FREE:                    12237MB( 12.0GB)

Here is some sample output for good ACNS 5.x:

disk00: Normal        (h00 c00 i00 l00 - Int DAS)        69999MB( 68.4GB)
    disk00/04: PHYS-FS       59246MB( 57.9GB) mounted internally
    disk00/04: CDNFS         59246MB( 57.9GB) mounted internally
    disk00/04: MEDIAFS       51893MB( 50.7GB) mounted internally
    System use:              10751MB( 10.5GB)
    FREE:                        1MB(  0.0GB)

Here is some sample output for bad ACNS 4.2:

disk00: Normal          (h00 c00 i00 l00)    17499MB( 17.1GB)
FREE:                    17499MB( 17.1GB)

Sample bad ACNS 5.X output:
disk00: Normal        (h00 c00 i00 l00 - Int DAS)        17357MB( 17.0GB)
    FREE:                    17357MB( 17.0GB)

Here is the startup message from ACNS 4.2:

BOOT-100: disk apply
*****
Your first disk is not in standard configuration.
You might need to run 'disk recover' from the CLI.
*****

Here is the startup message from ACNS 5.x:

ruby_disk: Your first disk is not in standard configuration.
ruby_disk: Run 'disk recover' from the CLI
/ruby/bin/code100.sh: NOTE: ruby_disk apply returned 6


********************************************
  System software is missing.             
  Check whether first-disk is bad, or        
  use 'disk recover' to recover first-disk. 
********************************************

Step 9

Turn off the power to the CE. Take out the disk drive. Re-insert the disk drive if the disk drive is easily accessible. This is true for Robin2 and Lightning hardware family but not true for Opal or Thunder hardware family. Ensure that the disk drive connection is good. Return to the step you completed before this step, and repeat the test. If the hardware problem persists, contact Cisco Support to replace the disk drive or the CE.

Step 10

Install the replacement disk drive. Go to Step 11 if the replacement disk is disk00. Otherwise, go to Step 14.

Step 11

If disk00 has a software problem, issue the disk recover command to manufacturing disk00. A warning prompt appears.

Here is some sample output for ACNS 4.2:

ACNS42#disk recover
This will erase everything on disk00. Are you sure? [no]yes
System file systems appear to have been installed.
Please verify your software installation with 'show flash'
and install a new image if necessary.

Here is some sample output for ACNS 5.x:

ACNS5#disk recover
This will erase everything on disk00. Are you sure? [no]yes
System file systems appear to have been installed.
Please verify your software installation with 'show flash'
and install a new image if necessary.

If this step is successful, go to Step 13. Otherwise, proceed with Step 12.

Step 12

The disk recover operation in Step 11 can fail if some applications or swap partition use disk00 partially. You must use the disk erase command to clear the partitions. This command is similar to the first part of the disk recover command with a force option. A similar warning appears.

Here is some sample output for ACNS 4.2:

ACNS42#disk erase
This will erase everything on disk00. Are you sure? [no]yes
disk00 partition table erased.  Will take effect after reboot.
ACNS42#reload
Proceed with reload?[confirm]
Shutting down all services, will timeout in 15 minutes.

Here is some sample output for ACNS 5.x:

ACNS5#disk erase
This will erase everything on disk00. Are you sure? [no]yes
disk00 partition table erased.  You need to reload the CE now!!!
ACNS5#reload
Proceed with reload?[confirm]
Shutting down all services, will timeout in 15 minutes.

warning Warning: This operation is destructive. The CE becomes unstable after this step. Reload the CE immediately. Go to Step 11 to issue the disk recover command again after the CE is back online.

Step 13

Install the disk software. Disk00 has been re-manufactured. The disk portion of the software must be re-installed. Follow the standard software installation procedure. Usually, you can do so through the Content Distribution Manager (CDM) interface, or CLI for example, with the copy ftp install command or the copy http install command.

Here is a sample ACNS 4.2 command:

ACNS42#copy ftp install server path ACNS-4.2.9-K9.bin

Here is a sample ACNS 5.X command:

ACNS5#copy ftp install server path ACNS-5.1.0-K9.bin

After this step, go to Step 14 or Step 15, on the basis of your requirement.

Step 14

If the newly replaced disk drive is not disk00, you can:

  • Issue the disk add command to add a new disk drive.

    OR

  • Issue the disk config command to re-configure all the drives on the CE.

Note: The disk config command erases all contents in SYSFS, CFS, and MEDIAFS. Contents in CDNFS are preserved.

Here is a sample ACNS 4.2 command:

ACNS42#disk config sysfs 5GB ecdnfs remaining
Disk configured successfully.
New configuration will take effect after reload.
Please remove this device from the ECDN CDM (if any) before reboot this device,
as this device's configuration will be stale due to disk repartition.
ACNS42#reload

Here is a sample ACNS 5.x command:

ACNS5#disk config sysfs 10% cfs 2GB cdnfs remaining
Disk configured successfully.
New configuration will take effect after reload.
ACNS5#reload

Step 15

Check whether the CE is back to normal operation. Contact the Cisco Technical Assistance Center (TAC) if the problem persists.

Hardware Replacement

If you require a hardware replacement, you need to open a service request with Cisco TAC. Cisco TAC requires the information in this checklist before a replacement can be processed.

Checklist for Hardware Replacement
Correct product ID, serial number, hardware part number and ACNS version of the failed boxes.
What is being replaced?
Why was the part replaced? Include personal assessment.
Physical set-up (topology) where the current failure occurred.
If console or telnet access is available, please provide the output of these show commands and logs:
  • show tech support (which includes the output of the show running config command)
  • Information in these logs, which you can obtain through FTP:
    • From the CE:
      • /local/local1/syslog.txt
      • /local/local1/errorlog/ There are many error logs in this directory. On the basis of the failure, send the appropriate logs. For example, if there was an issue with distribution, collect dist*.* under this folder.
      • /local/local1/servicelog/ There are many service logs in this directory. On the basis of the service that failed, you must send the appropriate logs. For example, if there was an issue with wmt, collect wmt*.* under this folder. It is a good idea to send cms_ce_start*.* for any service failure.
    • From the CDM: /local/local1/servicelog/ From the CDM, capture the cms communication of the CE with CDM to see if the CE has logged any errors to the CDM. cms_cdm_start*.* is necessary. Consider sending cms*.* from this location.
  • Screen capture at bootup of the system.
Was this device staged at a staging facility before deployment at current location?
Did you observe a similar failure on another device received at the same time?
What were the last changes made to the system in last 15 days, including infrastructure?
Is the problem intermittent? If yes, were you able to reproduce the problem? State the interval.
Is the problem deterministic? If yes, describe how to re-create the problem.
What activity was in progress on the system at the time of the failure?
Was the software installed or removed?
Was the traffic heavy or light? Or was traffic absent?
Did you make any new configuration changes?
Did you face any environmental issues before the current failure? Here is a list of such failures that you must look for:
  • Power outage
  • Air Conditioning failure
  • Other devices at the same physical location: Do they work fine?
  • The box chassis: Does it get overheated?
  • Mechanical noise

Failure Categories

At this point, if you determine that the problem is definitely a hardware failure, and requires replacement, try to identify the failure into one of these categories, and capture the additional information for that failure category:

  1. Cannot boot

    Check whether the system was DOA (Dead On Arrival). If the system was working for some time, but is unable to boot now, answer these questions:

    • Did this machine work earlier?

    • If not, did the machine ever work?

    • If yes, which operational sequence led to the "cannot boot" situation?

    • How long did the machine work before the failure at the site?

    Capture the console output during boot attempt.

  2. Bad Hard Drive

    Check whether a hard drive in the system is faulty. If you identify the issue to be a bad hard drive, answer these questions:

    • How long was this system in operation?

    • What is the usage pattern of this system? (24x7 traffic?)

    • Was there an unusually high traffic before the hard drive failed?

    Capture these outputs:

    • The output that reported the drive to be bad.

    • The logs that report the drive as bad.

    • The show hardware command output.

    • The show tech support command output.

  3. Bad Power Supply

    If the power supply in the unit is faulty, and the system does not power up, answer these questions:

    • Did this system work before?

    • If not, did the system ever work?

  4. Dead on Arrival (DOA)

    If the system arrived in dead state and is unable to power up or boot, check whether this is the first attempt to turn on the system.

  5. Software

    A hardware replacement is unlikely to solve a software problem. However, if you think a hardware replacement is necessary, you must indicate why you think hardware replacement can solve the software problem.

  6. Duplicate

    This is to capture RMA of second optional disk in the CE-510 and CE-510A. If this issue is a duplicate, answer these questions:

    • How long was this system in operation?

    • Did the system work before the failure at the site?

    • How long was this system in operation?

    • What is the usage pattern of this system? (24x7 traffic?)

  7. Other

    Any other failures not captured so far.

  8. Not Enough Information (NEI)

    Use this category only in the rare circumstance that the information available is not adequate to categorize the issue more specifically.

After Replacement

After hardware replacement, the Cisco TAC follows up with you to obtain this information:

  • What specific corrective actions did you take with the device?

  • What was the result of each action? For example, did a hard reboot result in a particular error message during bootup? Or, did you try to attach different ethernet cables to the same port, and different ports on switch, but the ethernet port on the CE never shows a link light.

  • If you made multiple changes, what solved the problem ultimately?

Related Information

Updated: Mar 29, 2006
Document ID: 69674