简介
本文描述在结构和存储设备卡德(FSC)的一个特定的硬件问题思科Aggregrated服务路由器的(ASR) 5500。
问题:客户报告与降低的HD RAID的有故障FSC 16
******** Show alarm outstanding verbose *******Severity Object Timestamp Alarm ID-------- ---------- ---------------------------------- --------------------- Alarm Details------------------------------------------------------------------------------------------------------------------------------------Minor Card 14 Sunday August 21 07:16:34 A 3610743104839221248 The Fabric & 2x200GB Storage Card in slot 14 is a single point of failure. Another Fabric & 2x200GB Storage Card of the same type is needed.Minor Card 15 Sunday August 21 07:16:34 A 3610743104839221249 The Fabric & 2x200GB Storage Card in slot 15 is a single point of failure. Another Fabric & 2x200GB Storage Card of the same type is needed.Minor Card 17 Sunday August 21 07:16:34 A 3610743104839221250 The Fabric & 2x200GB Storage Card in slot 17 is a single point of failure. Another Fabric & 2x200GB Storage Card of the same type is needed.
******** show card table all *******Slot Card Type Oper State SPOF Attach----------- -------------------------------------- ------------- ---- ------1: DPC Data Processing Card Active No2: DPC Data Processing Card Active No3: DPC Data Processing Card Active No4: DPC Data Processing Card Active No5: MMIO Management & 20x10Gb I/O Card Active No6: MMIO Management & 20x10Gb I/O Card Standby -7: DPC Data Processing Card Active No8: DPC Data Processing Card Active No9: DPC Data Processing Card Active No10: DPC Data Processing Card Standby -11: SSC System Status Card Active No12: SSC System Status Card Active No13: FSC None - -14: FSC Fabric & 2x200GB Storage Card Active Yes15: FSC Fabric & 2x200GB Storage Card Active Yes16: FSC Fabric & 2x200GB Storage Card Active No17: FSC Fabric & 2x200GB Storage Card Active Yes
******** show hd raid verbose *******HD RAID: State : Available (active) Degraded : Yes <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< UUID : 59a7ebf0:7f798af6:68869614:3210b2c6 Size : 1.2TB (1200000073728 bytes) Action : IdleCard 16 State : Faulty card <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Description : FSC16 SAD17160089 Size : 400GB (400096755712 bytes) Disk hd16a State : In-sync component Created : Fri May 23 23:05:53 2014 Updated : Fri May 23 23:05:53 2014 Events : 0 Model : STEC Z16IZF2E-200UCU E4TC Serial Number : STM000171E75 Size : 200GB (200049647616 bytes) Disk hd16b State : In-sync component Created : Fri May 23 23:05:53 2014 Updated : Fri May 23 23:05:53 2014 Events : 0 Model : STEC Z16IZF2E-200UCU E4TC Serial Number : STM000171E8B Size : 200GB (200049647616 bytes)
当您开始与Syslog时的排除故障这些错误报告。
[local-60sec34.188] [hdctrl 132011 critical] [5/0/7135 <hdctrl:0> rl_fsm_mirror.c:5938] [software internal system critical-info syslog] hd16, FSC16 SAD17160089 failed from RAID 59a7ebf0:7f798af6:68869614:3210b2c6, MIO5 SAD1716021N. [local-60sec34.188] [hdctrl 132016 error] [5/0/7135 <hdctrl:0> rl_fsm_mirror.c:3399] [software internal system critical-info syslog] Error detected on hd16a (STEC Z16IZF2E-200UCU E4TC STM000171E75), FSC16 SAD17160089: ioerr_cnt increased from 25 to 27[local-60sec34.221] [alarmctrl 65201 info] [5/0/7072 <evlogd:0> alarmctrl.c:192] [software internal system critical-info syslog] Alarm condition: id 321befb92b220000 (Minor): The Fabric & 2x200GB Storage Card in slot 14 is a single point of failure. Another Fabric & 2x200GB Storage Card of the same type is needed. [local-60sec34.222] [alarmctrl 65201 info] [5/0/7072 <evlogd:0> alarmctrl.c:192] [software internal system critical-info syslog] Alarm condition: id 321befb92b220002 (Minor): The Fabric & 2x200GB Storage Card in slot 17 is a single point of failure. Another Fabric & 2x200GB Storage Card of the same type is needed. [local-60sec34.222] [alarmctrl 65201 info] [5/0/7072 <evlogd:0> alarmctrl.c:192] [software internal system critical-info syslog] Alarm condition: id 321befb92b220001 (Minor): The Fabric & 2x200GB Storage Card in slot 15 is a single point of failure. Another Fabric & 2x200GB Storage Card of the same type is needed.
如果看到此错误日志:
Aug 21 07:16:34 evlogd: [local-60sec34.188] [hdctrl 132016 error] [5/0/7135 <hdctrl:0> rl_fsm_mirror.c:3399] [software internal system critical-info syslog] Error detected on hd16a (STEC Z16IZF2E-200UCU E4TC STM000171E75), FSC16 SAD17160089: ioerr_cnt increased from 25 to 27
进行聪明的测验和检查总是可行的此问题是否与高温没有涉及。
[local]# show hd smart hd16asmartctl 6.1 2013-03-16 r3800 [x86_64-linux-2.6.38-staros-v3-hw-64] (local build)Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION ===Vendor: STECProduct: Z16IZF2E-200UCURevision: E4TCUser Capacity: 200,049,647,616 bytes [200 GB]Logical block size: 512 bytesLU is resource provisioned, LBPRZ=1Rotation Rate: Solid State DeviceForm Factor: 2.5 inchesLogical Unit id: 0x5000a72030079304Serial number: STM000171E75Device type: diskTransport protocol: SASLocal Time is: Mon Aug 22 16:49:11 2016 ASTSMART support is: Available - device has SMART capability.SMART support is: EnabledTemperature Warning: Enabled === START OF READ SMART DATA SECTION ===SMART Health Status: OK SS Media used endurance indicator: 0%Current Drive Temperature: 52 CDrive Trip Temperature: 75 C local]SAE-G168A-1# show hd smart hd16bsmartctl 6.1 2013-03-16 r3800 [x86_64-linux-2.6.38-staros-v3-hw-64] (local build)Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION ===Vendor: XXXXProduct: Z16IZF2E-200UCURevision: E4TCUser Capacity: 200,049,647,616 bytes [200 GB]Logical block size: 512 bytesLU is resource provisioned, LBPRZ=1Rotation Rate: Solid State DeviceForm Factor: 2.5 inchesLogical Unit id: 0x5000a7203007932cSerial number: STM000171E8BDevice type: diskTransport protocol: SASLocal Time is: Mon Aug 22 16:49:21 2016 ASTSMART support is: Available - device has SMART capability.SMART support is: EnabledTemperature Warning: Enabled === START OF READ SMART DATA SECTION ===SMART Health Status: OKSS Media used endurance indicator: 0%Current Drive Temperature: 50 CDrive Trip Temperature: 75 C
因为聪明的测验是请好您能推断其不相关对高温发出。
单个FSC驱动的特级电容器失败
有时在单个FSC的一或两驱动迁移向无效分区或镜像状态。
这可能发生,如果在单个FSC的驱动的特级电容器失败。此失败不可能被恢复,因为它是二者之一驱动替换的一永久性驱动器故障或FSC。
请与进一步协助的Cisco TAC联系。
如果这为任何驱动被看到,有该驱动的一特级电容器失败。
这些日志在活动管理输入输出卡调试控制台日志被看到(减少)。
2016-Aug-21+07:16:34.038 card 5-cpu0: [974499.606697] sd 0:0:2:0: [sde] Unhandled sense code^M2016-Aug-21+07:16:34.139 card 5-cpu0: [974499.611565] sd 0:0:2:0: [sde] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE^M2016-Aug-21+07:16:34.139 card 5-cpu0: [974499.618877] sd 0:0:2:0: [sde] Sense Key : Data Protect [current] ^M2016-Aug-21+07:16:34.139 card 5-cpu0: [974499.625214] sd 0:0:2:0: [sde] Add. Sense: Write protected^M2016-Aug-21+07:16:34.139 card 5-cpu0: [974499.630840] sd 0:0:2:0: [sde] CDB: Write(10): 2a 00 00 00 08 08 00 00 08 00^M2016-Aug-21+07:16:34.139 card 5-cpu0: [974499.638173] end_request: I/O error, dev sde, sector 2056^M2016-Aug-21+07:16:34.139 card 5-cpu0: [974499.643558] md: super_written gets error=-5, uptodate=0^M2016-Aug-21+07:16:34.139 card 5-cpu0: [974499.648872] md/raid:md0: Disk failure on md16, disabling device.^M2016-Aug-21+07:16:34.139 card 5-cpu0: [974499.648873] md/raid:md0: Operation continuing on 3 devices.^M2016-Aug-21+07:16:34.238 card 5-cpu0: [974499.712047] sd 0:0:2:0: [sde] Sense Key : Recovered Error [current] ^M2016-Aug-21+07:16:34.238 card 5-cpu0: [974499.718637] sd 0:0:2:0: [sde] <<vendor>> ASC=0x80 ASCQ=0x0ASC=0x80 ASCQ=0x0^M2016-Aug-21+07:21:37.112 card 5-cpu0: [974802.605115] sd 0:0:2:0: [sde] Sense Key : Recovered Error [current] ^M2016-Aug-21+07:21:37.112 card 5-cpu0: [974802.611712] sd 0:0:2:0: [sde] <<vendor>> ASC=0x80 ASCQ=0x0ASC=0x80 ASCQ=0x0^M2016-Aug-21+07:26:37.099 card 5-cpu0: [975102.612464] sd 0:0:2:0: [sde] Sense Key : Recovered Error [current] ^M
请检查特定的错误登录
- 调试活动减少卡德控制台输出
- 调试hdctrl映射
- 调试hdctrl历史记录
并且这确认supercapacitor失败。
**** debug console card <Active MIO card> cpu 0 tail 10000 only ***** [sde] <> ASC=0x80 ASCQ=0x0ASC=0x80 ASCQ=0x0 ******** debug hdctrl history *******Thursday March 31 02:29:03 EDT 2016 Primary HDCTRL:2016-Mar-31+02:13:38.632 move fsm=hd15#6 src=DISK_CHECK dst=DISK_FAILED arg=disk I/O error # hdctrl/hdctrl_fsm_disk.c : 4354 @ disk_fsm_enter()
Below is the mapping of the disks to device name:
******** debug hdctrl mapping *******Local card (5):Disk Device Number SCSI Size---------- ------ ------ ------- ------hd14a sdb 8:16 0:0:0:0 186 GBhd15a sdd 8:48 0:0:1:0 186 GBhd16a sde 8:64 0:0:2:0 186 GBhd17a sdg 8:96 0:0:3:0 186 GBhd14b sda 8:0 1:0:0:0 186 GBhd15b sdc 8:32 1:0:1:0 186 GBhd16b sdf 8:80 1:0:2:0 186 GBhd17b sdh 8:112 1:0:3:0 186 GB
Below highlated logs indicating a failure of the super capacitor on the solid state drive of FSC card disk 16a.
hd16a <==>sde
-------------
2016-Aug-21+07:21:37.112 card 5-cpu0: [974802.611712] sd 0:0:2:0: [sde] <<vendor>> ASC=0x80 ASCQ=0x0ASC=0x80 ASCQ=0x0^M
2016-Aug-21+07:26:37.099 card 5-cpu0: [975102.612464] sd 0:0:2:0: [sde] Sense Key : Recovered Error [current] ^M