Guest

Cisco Wide Area File Services Software (WAFS)

Field Notice: FN - 62580 - WAE-511/512-K9 with Dual 250 GB Drives Will Be Slow or Non-responsive Upon Reload - Software Upgrade Recommended


March 16, 2007

NOTICE:

THIS FIELD NOTICE IS PROVIDED ON AN "AS IS" BASIS AND DOES NOT IMPLY ANY KIND OF GUARANTEE OR WARRANTY, INCLUDING THE WARRANTY OF MERCHANTABILITY. YOUR USE OF THE INFORMATION ON THE FIELD NOTICE OR MATERIALS LINKED FROM THE FIELD NOTICE IS AT YOUR OWN RISK. CISCO RESERVES THE RIGHT TO CHANGE OR UPDATE THIS FIELD NOTICE AT ANY TIME.


Products Affected

Products Affected

Comments

WAE 500 - WAE-511-K9

WAFS versions 3.0.7 or 3.0.9 or WAAS versions 4.0.1 or 4.0.3

WAE 500 - WAE-512-K9

WAFS versions 3.0.7 or 3.0.9 or WAAS versions 4.0.1 or 4.0.3

Problem Description

WAE-511-K9 and WAE-512-K9 with dual 250 GB drives running WAFS versions 3.0.7 and 3.0.9 or WAAS versions 4.0.1 and 4.0.3 will be slow or non-responsive upon reload. This problem is attributed to RAID working with Dual 250 GB drives.

Background

RAID Synchronization

This issue was seen when the customer powered up a new WAE-511-K9 and WAE-512-K9 with dual 250 GB drives running WAFS versions 3.0.7 or 3.0.9 or WAAS versions 4.0.1 or 4.0.3. The setup CLI was run to configure basic information and the WEA was reloaded. Once the WAE came up, three of the RAID partitions (SYSFS, PRINTSPOOL and WAFSFS) were in rebuilding mode. The CPU looked ok but the disk I/O was being utilized by the RAID rebuild. This made the WAE appear to be non-responsive.

Testing run with different disk drive sizes determined that this issue was specific to the WAE-511-K9 and WAE-512-K9 with 2x250GB disk drives installed only. Testing also determined the issue was not consistently reproducible.

During raid-1 resync, lots of processes get stuck in uninterrupted disk wait state (D in ps), resulting in poor response time.

This issue is seen in WAFS and WAAS only. WAFS/WAAS code at reload (ruby_reload.sh) fails to turn off MD devices, causing resync on every reboot.

Caution: Do not reboot your WAE while RAID pairs are rebuilding.

You must make sure that all RAID pairs are done rebuilding before you reboot your WAE device. RAID pairs will rebuild on the next reboot after you enable WAFS/WAAS core or edge services, use the restore factory-default command, replace or add a hard disk drive, delete disk partitions, or reinstall WAFS/WAAS from the bootable recovery CD-ROM.

Use the show disk details EXEC command to view the status of the drives and check if the RAID pairs are in "NORMAL OPERATION" or in "REBUILDING" status.

When you see that RAID is rebuilding, you must let it complete that rebuild process. This rebuild process can take up to five hours. If you reboot while the device is rebuilding, you risk corrupting the file system.

If you fail to wait for the RAID pairs to complete the rebuild process before you reboot the device, you might see the following symptoms indicating that you have a problem:

The device is offline in the Central Manager GUI.

CMS can not be loaded.

Error messages say that the file system is "read-only."

The syslog contains errors such as Aborting journal on device md2, Journal commit I/O error, Journal has aborted, and ext3_readdir: bad entry in directory.

Other strange behavior related to disk operations or the inability to perform them.

If you encounter any of these symptoms and have reason to believe that RAID synchronization was occuring during a reboot, contact Cisco TAC for assistance. Your Cisco TAC representative can run a diagnostic script on your WAE and perform any recovery procedures that might be needed. You can also find additional information on the WAAS 4.0.7 Release notes page.

Problem Symptoms

WAE-511-K9 and WAE-512-K9 with dual 250 GB drives running WAFS versions 3.0.7 or 3.0.9 or WAAS versions 4.0.1 or 4.0.3 application SW will be slow or non-responsive upon reload.

If you fail to wait for the RAID pairs to complete the rebuild process before you reboot the device, you might see the following symptoms indicating that you have a problem:

The device is offline in the Central Manager GUI.

CMS can not be loaded.

Error messages say that the file system is "read-only."

The syslog contains errors such as Aborting journal on device md2, Journal commit I/O error, Journal has aborted, and ext3_readdir: bad entry in directory.

Other strange behavior related to disk operations or the inability to perform them.

If you encounter any of these symptoms and have reason to believe that RAID synchronization was occuring during a reboot, contact Cisco TAC for assistance. Your Cisco TAC representative can run a diagnostic script on your WAE and perform any recovery procedures that might be needed.

Workaround/Solution

RAID Synchronization

Caution: Do not reboot your WAE while RAID pairs are rebuilding.

Before you upgrade your WAE, you must run a script (the WAAS disk check tool) that checks the file system for errors that can result from a RAID synchronization failure. For more information about RAID synchronization, see About RAID Synchronization and File System Errors.

You can obtain the WAAS disk check tool from the Cisco WAAS 4.0 Software Download (registered customers only) page.

You must make sure that all RAID pairs are done rebuilding before you reboot your WAE device. RAID pairs will rebuild on the next reboot after you enable WAFS core or edge services, use the restore factory-default command, replace or add a hard disk drive, delete disk partitions, or reinstall WAAS from the booted recovery CD-ROM.

Use the show disk details EXEC command to view the status of the drives and check if the RAID pairs are in "NORMAL OPERATION" or in "REBUILDING" status.

When you see that RAID is rebuilding, you must let it complete that rebuild process. This rebuild process can take up to five hours. If you reboot while the device is rebuilding, you risk corrupting the file system.

Upgrade to WAFS 3.0.11, which is currently available on CCO.

Upgrade to WAAS 4.0.7, which is currently available on CCO.

When you run the WAAS disk check tool, you will be logged out of the device. The device automatically reboots after it has completed checking the file system. Because this operation results in a reboot, we recommend that you perform this operation after normal business hours.

Copy the script to your WAE device by using the copy ftp install command.

WAE# copy ftp install disk_check.sh 
Run the script from the CLI, as shown in the following example: 

WAE# script execute disk_check.sh 
This script will check if there is any file system issue on the attached disks 
Activating the script will result in: 
Stopping all services. This will log you out. 
Perform file system check for few minutes. 
and record the result in the following files: 
/local1/disk_status.txt - result summary 
/local1/disk_check_log.txt - detailed log 
System reboot 
If the system doesn't reboot in 10 minutes, please re-login and check the result files. 
Continue?[yes/no] yes 
Please disk_status.txt after reboot for result summary 
umount: /state: device is busy 
umount: /local/lPAM_unix[26162]: ### pam_unix: pam_sm_close_session (su) session closed 
for user root 
waitpid returns error: No child processes 
No child alive.

After the device reboots and you log in, locate and open the following two files to view the file system status:

disk_status.txt - Lists each file system and shows if it is "OK," or if it contains an error that requires attention.

disk_check_log.txt - Contains a detailed log for each file system checked.

If no repair is needed, then each file system will be listed as "OK," as shown in the following example:

WAE# type disk_status.txt 
Thu Feb 1 00:40:01 UTC 2007 
device /dev/md1 (/swstore) is OK 
device /dev/md0 (/sw) is OK 
device /dev/md2 (/state) is OK 
device /dev/md6 (/local/local1/spool) is OK 
device /dev/md5 (/local/local1) is OK 
device /dev/md4 (/disk00-04) is OK

If any file system contains errors, the disk_status.txt file instructs you to repair it.

If an upgrade cannot be performed immediately, the customer should reload the system after the RAID resync is complete. RAID resync can be checked in the sh disk details output.

DDTS

To follow the bug ID link below and see detailed bug information, you must be a registered user and you must be logged in.

DDTS

Description

CSCsg26644 (registered customers only)

WAE-512 with Dual drives hangs on reload

Revision History

Revision

Date

Comment

1.0

16-MAR-2007

Initial Public Release

For More Information

If you require further assistance, or if you have any further questions regarding this field notice, please contact the Cisco Systems Technical Assistance Center (TAC) by one of the following methods:

Receive Email Notification For New Field Notices

Product Alert Tool - Set up a profile to receive email updates about reliability, safety, network security, and end-of-sale issues for the Cisco products you specify.