This document describes how disks on Cisco COS (Cloud Object Storage) running on Content Delivery Engines 470 (CDE470) are expected to be replaced, require minimum guidelines, minimizing the risk of content loss/corruption.
It is important to realize that 1 disk bay on the Cisco CDE470 chassis houses 2 disks, not 1.
This means that, if 1 of the 2 disks per bay is broken, both of the disks need to be offline first to ensure no data corruption happens during the disk replacement.
The Cisco CDE470 chassis holds 72 disks of the part number CDE4-HDD-SAS-4T=, providing a massive total of 288 Terabyte as storage capacity.
As shown in this image, the front bays disk layout numbering (from the manual - Cloud Object Storage 2.1.1):
Disk ids 34 and 35 are broken. To replace these 2 disks, this means offline 4 disks. If you want to replace disk 34, you need to offline csd33 and csd34. If you want to replace disk 35, you need to offline csd35 and csd36.
Offline a disk
Offline means taking the disks out of the active polling loop in the Cisco COS kernel module. This module continuously scans and maintains to store content on and read content from Respect 30 seconds in between offline 2 different disks.
How to Offline a disk
Using the previous example with disk 12:Use csd34 and csd35 [root@COS cdd]# echo csd12 > /proc/cds/cdd/remove_device
Result: You will get this feedback:
[root@COS cdd]# 2015 Jul 7 15:55:13 COS Disk device 12 has been removed from the file system 2015 Jul 7 15:55:13 COS System is running with 71 drives <press enter if the prompt does not return>
Replace the faulty disk and reseat the dual disk bay. The lower csd<id> sits in front, the latter in the back.
Reactivating a Replaced Disk
Once the hardware has been replaced, you can present the new disk to the cserver kernel module again, to be taken up in the active disk pool again: (an example with disk 12, use csd34 and csd35). Respect 30 seconds in between reinitializing 2 different disks.
Result: You will see feedback like this: [root@COS cdd]# 2015 Jul 7 16:01:20 COS Found disk device 12 2015 Jul 7 16:01:20 COS System is running with 72 drives <press enter if the prompt does not return>
Note: : When the Cisco CDE470 chassis is rebooted and the kernel module (cserver) is started whilst a disk in question was already broken and thus not recognized as valid disk, no index for that disk is created in the folder structure /proc/cds/cdd/disks/. If that is the case, the kernel never initialized the disk. As a consequence, using the tools for offline or re-enabling disks are not usable. The procedure in this case is to just replace the disk (after its neighbor in the same bay).
High Level Steps
In a nutshell, the steps to take for disk 33:
Log in on COS as root
echo csd33 > /proc/cds/cdd/remove_device
wait 30s - note you should receive feedback as mentioned in Step 2.
echo csd34 > /proc/cds/cdd/remove_device
wait 30s - note you should receive feedback as mentioned in Step 4.
pull out the dual disk bay, replace the failing disk and reseat the bay
echo csd33 > /proc/cds/cdd/make_well
wait 30s – note you should receive feedback as mentioned in Step 7.