此产品的文档集力求使用非歧视性语言。在本文档集中,非歧视性语言是指不隐含针对年龄、残障、性别、种族身份、族群身份、性取向、社会经济地位和交叉性的歧视的语言。由于产品软件的用户界面中使用的硬编码语言、基于 RFP 文档使用的语言或引用的第三方产品使用的语言,文档中可能无法确保完全使用非歧视性语言。 深入了解思科如何使用包容性语言。
思科采用人工翻译与机器翻译相结合的方式将此文档翻译成不同语言,希望全球的用户都能通过各自的语言得到支持性的内容。 请注意:即使是最好的机器翻译,其准确度也不及专业翻译人员的水平。 Cisco Systems, Inc. 对于翻译的准确性不承担任何责任,并建议您总是参考英文原始文档(已提供链接)。
本文档介绍补救以下故障的后续步骤:
"Code" : "F0321",
"Description" : "Controller <id> is unhealthy because: Data Layer Partially Degraded Leadership",
"Dn" : "topology/pod-<POD-ID>/node-<NODE-ID>/av/node-<NODE-ID>/fault-F0321",
"Code" : "F0321",
"Description" : "Controller 3 is unhealthy because: Data Layer Partially Diverged"
"Dn" : "topology/pod-<POD-ID>/node-<NODE-ID>/av/node-<NODE-ID>/fault-F0321",
"Code" : "F0325",
"Description" : "Connectivity has been lost to the leader for some data subset(s) of a service on <node >, the service may have unexpectedly restarted or failed",
"Dn" : "topology/pod-<POD-ID>/node-<NODE-ID>/av/node-<NODE-ID>/fault-F0325",
"Code" : "F0323",
"Description" : "Lost connectivity to leader for some data subset(s) of Access <Service> on <controller >",
"Dn" : "topology/pod-<POD-ID>/node-<NODE-ID>/av/node-<NODE-ID>/fault-F0323",
如果您有与Intersight连接的ACI交换矩阵,则会代表您生成服务请求,以指明在与Intersight连接的ACI交换矩阵中找到了此故障的实例。
当APIC集群不正常时,会引发此特定故障。当分片/副本中的任一个出现故障(在“acidiag rvread output”中以“\”表示)时,会显示“数据层部分分离”。当用“X”表示的APIC中的副本或数据库完全缺失时,也会发生此故障。 我们需要解决所有潜在问题并恢复群集的运行状况。
如果交换矩阵处于生产状态,请勿尝试任何侵入性步骤(例如关机、重新加载或重新分解)以解决集群问题。收集TS文件并将其上传到TAC案例,了解恢复APIC集群的确切步骤。
通过运行此命令,它将执行多项检查,包括与APIC的连接。我们应看到所有测试结果都返回OK。如果我们发现除OK以外的任何问题,我们需要调查原因。
######## Sample output on a healthy cluster ########
apic1# acidiag cluster
Admin password:
Running...
Checking Wiring and UUID: OK
Checking AD Processes: Running
Checking All Apics in Commission State: OK
Checking All Apics in Active State: OK
Checking Fabric Nodes: OK
Checking Apic Fully-Fit: OK
Checking Shard Convergence: OK
Checking Leadership Degration: Optimal leader for all shards
Ping OOB IPs:
APIC-1: 10.197.204.149 - OK
APIC-2: 10.197.204.150 - OK
APIC-3: 10.197.204.151 - OK
Ping Infra IPs:
APIC-1: 10.0.0.1 - OK
APIC-2: 10.0.0.2 - OK
APIC-3: 10.0.0.3 - OK
Checking APIC Versions: Same (5.2(4d))
Checking SSL: OK
Full file system(s): None
Done!
######## Sample output on a unhealthy cluster ########
apic1# acidiag cluster
Admin password:
Running...
Checking Wiring and UUID: switch(302) reports apic(3) has wireIssue: unapproved-ctrlr
Checking AD Processes: Running
Checking All Apics in Commission State: OK
Checking All Apics in Active State: OK
Checking Fabric Nodes: OK
Checking Apic Fully-Fit: OK
Checking Shard Convergence: OK
Checking Leadership Degration: Non optimal leader for shards : 3:1,3:2,3:4,3:5,3:7,3:8,3:10,3:11,3:13,3:14,3:16,3:17,3:19,3:20,3:22,3:23,3:25,3:26,3:28,3:29,3:31,3:32,6:1,6:2,6:4,6:5,6:7,6:8,6:10,6:11,6:13,6:14,6:16,6:17,6:19,6:20,6:22,6:23,6:25,6:26,6:28,6:29,6:31,6:32,9:1,9:2,9:4,9:5,9:7,9:8,9:10,9:11,9:13,9:14,9:16,9:17,9:19,9:20,9:22,9:23,9:25,9:26,9:28,9:29,9:31,9:32,10:1,10:2,10:4,10:5,10:7,10:8,10:10,10:11,10:13,10:14,10:16,10:17,10:19,10:20,10:22,10:23,10:25,10:26,10:28,10:29,10:31,10:32,11:1,11:2,11:4,11:5,11:7,11:8,11:10,11:11,11:13,11:14,11:16,11:17,11:19,11:20,11:22,11:23,11:25,11:26,11:28,11:29,11:31,11:32,14:1,14:2,14:4,14:5,14:7,14:8,14:10,14:11,14:13,14:14,14:16,14:17,14:19,14:20,14:22,14:23,14:25,14:26,14:28,14:29,14:31,14:32,16:1,16:2,16:4,16:5,16:7,16:8,16:10,16:11,16:13,16:14,16:16,16:17,16:19,16:20,16:22,16:23,16:25,16:26,16:28,16:29,16:31,16:32,22:1,22:2,22:4,22:5,22:7,22:8,22:10,22:11,22:13,22:14,22:16,22:17,22:19,22:20,22:22,22:23,22:25,22:26,22:28,22:29,22:31,22:32,23:1,23:2,23:4,23:5,23:7,23:8,23:10,23:11,23:13,23:14,23:16,23:17,23:19,23:20,23:22,23:23,23:25,23:26,23:28,23:29,23:31,23:32,33:1,34:1,34:2,34:4,34:5,34:7,34:8,34:10,34:11,34:13,34:14,34:16,34:17,34:19,34:20,34:22,34:23,34:25,34:26,34:28,34:29,34:31,34:32,35:1,35:2,35:4,35:5,35:7,35:8,35:10,35:11,35:13,35:14,35:16,35:17,35:19,35:20,35:22,35:23,35:25,35:26,35:28,35:29,35:31,35:32,36:1,39:1,39:2,39:4,39:5,39:7,39:8,39:10,39:11,39:13,39:14,39:16,39:17,39:19,39:20,39:22,39:23,39:25,39:26,39:28,39:29,39:31,39:32
Ping OOB IPs:
APIC-1: 10.197.204.184 - OK
APIC-2: 10.197.204.185 - OK
APIC-3: 10.197.204.186 - OK
Ping Infra IPs:
APIC-1: 10.0.0.1 - OK
APIC-2: 10.0.0.2 - OK
APIC-3: 10.0.0.3 - OK
Checking APIC Versions: Same (5.2(3e))
Checking SSL: OK
Full file system(s): None
Done!
确保APIC SSD运行正常,并且ACI交换矩阵- F2730、F2731和F2732上未出现这些故障之一。以下是在APIC CLI上运行的命令,用于查找是否存在这些故障或能否在GUI上验证这些故障(System > Faults)
##### Example:
# faultRecord
ack : no
cause : equipment-wearout
changeSet : available:unspecified, blocks:unspecified, capUtilized:0, device:Solid State Device, fileSystem:/dev/sdb, firmwareVersion:Dxxxxxxx, mediaWearout:1, model:INTEL SSDSC2BB120G4, mount:/dev/sdb, name:/dev/sdb, operSt:ok, serial:ABCDxxxxxxxxxxxXYZ, used:unspecified
childAction :
code : F2730
created : 2022-01-10T03:13:08.026+00:00
delegated : no
descr : Storage unit /dev/sdb on Node 3 with hostname apic1.cisco.com mounted at /dev/sdb has 1% life remaining
dn : topology/pod-2/node-3/sys/ch/p-[/dev/sdb]-f-[/dev/sdb]/fault-F2730
domain : infra
highestSeverity : warning
lastTransition : 2022-01-10T03:13:08.026+00:00
lc : raised
occur : 1
origSeverity : warning
prevSeverity : warning
rule : eqpt-storage-wearout-warning
severity : warning
status :
subject : equipment-wearout
type : operational
# faultRecord
ack : no
cause : equipment-wearout
changeSet : available:unspecified, blocks:unspecified, capUtilized:0, device:Solid State Device, fileSystem:/dev/sdb, firmwareVersion:Dxxxxxxx, mediaWearout:1, model:INTEL SSDSC2BB120G4, mount:/dev/sdb, name:/dev/sdb, operSt:ok, serial:ABCDxxxxxxxxxxxXYZ, used:unspecified
childAction :
code : F2731
created : 2022-01-10T03:13:08.026+00:00
delegated : no
descr : Storage unit /dev/sdb on Node 3 mounted at /dev/sdb has 1% life remaining
dn : topology/pod-2/node-3/sys/ch/p-[/dev/sdb]-f-[/dev/sdb]/fault-F2731
domain : infra
highestSeverity : major
lastTransition : 2022-01-10T03:13:08.026+00:00
lc : raised
occur : 1
origSeverity : major
prevSeverity : major
rule : eqpt-storage-wearout-major
severity : major
status :
subject : equipment-wearout
type : operational
检查是否所有DME进程正在运行
运行ps -aux | egrep "svc|nginx.bin|dhcp"
预期输出如下:
apic1# ps -ef | egrep "svc|nginx.bin|dhcp"
root 3063 1 5 22:08 ? 00:04:40 /mgmt//bin/nginx.bin -p /data//nginx/
root 8889 1 7 21:53 ? 00:06:43 /mgmt//bin/svc_ifc_appliancedirector.bin --x
ifc 8891 1 1 21:53 ? 00:01:29 /mgmt//bin/svc_ifc_policydist.bin --x
root 8893 1 2 21:53 ? 00:02:28 /mgmt//bin/svc_ifc_bootmgr.bin --x
ifc 8894 1 1 21:53 ? 00:01:41 /mgmt//bin/svc_ifc_vmmmgr.bin --x
ifc 8895 1 2 21:53 ? 00:02:14 /mgmt//bin/svc_ifc_topomgr.bin --x
ifc 8901 1 2 21:53 ? 00:02:22 /mgmt//bin/svc_ifc_observer.bin --x
root 8903 1 1 21:53 ? 00:01:40 /mgmt//bin/svc_ifc_plgnhandler.bin --x
ifc 8914 1 1 21:53 ? 00:01:34 /mgmt//bin/svc_ifc_domainmgr.bin --x
ifc 8915 1 2 21:53 ? 00:02:04 /mgmt//bin/svc_ifc_dbgr.bin --x
ifc 8917 1 1 21:53 ? 00:01:34 /mgmt//bin/svc_ifc_edmgr.bin --x
ifc 8918 1 1 21:53 ? 00:01:22 /mgmt//bin/svc_ifc_vtap.bin --x
ifc 8922 1 2 21:53 ? 00:02:09 /mgmt//bin/svc_ifc_eventmgr.bin --x
ifc 8925 1 3 21:53 ? 00:03:15 /mgmt//bin/svc_ifc_reader.bin --x
ifc 8929 1 1 21:53 ? 00:01:34 /mgmt//bin/svc_ifc_idmgr.bin --x
ifc 8930 1 1 21:53 ? 00:01:26 /mgmt//bin/svc_ifc_licensemgr.bin --x
ifc 8937 1 3 21:53 ? 00:03:18 /mgmt//bin/svc_ifc_policymgr.bin --x
ifc 8941 1 1 21:53 ? 00:01:34 /mgmt//bin/svc_ifc_scripthandler.bin --x
root 11157 1 1 21:54 ? 00:01:29 /mgmt//bin/dhcpd.bin -f -4 -cf /data//dhcp/dhcpd.conf -lf /data//dhcp/dhcpd.lease -pf /var/run//dhcpd.pid --no-pid bond0.3902
root 11170 1 4 21:54 ? 00:04:15 /mgmt//bin/svc_ifc_ae.bin --x
admin 17094 16553 0 23:27 pts/0 00:00:00 grep -E svc|nginx.bin|dhcp
您可以检查故障DME的故障代码F1419。
apic1# show faults code F1419 history
ID : 4294971876
Description : Service policymgr failed on apic bgl-aci02-apic1 of fabric
POD02 with a hostname bgl-aci02-apic1
Severity : major
DN : subj-[topology/pod-1/node-1/sys/proc/proc-
policymgr]/fr-4294971876
Created : 2022-03-21T18:29:20.570+12:00
Code : F1419
Type : operational
Cause : service-failed
Change Set : id (Old: 5152, New: 0), maxMemAlloc (Old: 1150246912, New:
0), operState (Old: up, New: down)
Action : creation
Domain : infra
Life Cycle : soaking
Count Fault Occurred : 1
Acknowledgement Status : no
如果apic之间失去连接,其中一个可能的原因就是布线问题。Acidiag Cluster命令还可以显示链路上存在的布线问题。以下是所有可能的布线问题:
ctrlr-uuid-mismatch - APIC UUID不匹配(重复APIC ID)
fabric-domain-mismatch -相邻节点属于不同交换矩阵
wiring-mismatch -无效连接(枝叶到枝叶、主干到非枝叶、枝叶交换矩阵端口到非主干等)
adjeceny-not-detected -矩阵端口上无LLDP邻接关系
infra-vlan-mismatch -枝叶和APIC之间的基础设施VLAN不匹配。
pod-id-mismatch - APIC和枝叶之间的Pod ID不匹配
unapproved-ctrlr - APIC与连接的枝叶之间的SSL握手未完成。
unapproved-serialnumber -检测到不在Apic的数据库中的节点。
如果DME进程状态部分的输出与预期输出不匹配。尝试使用“acidiag start <DME>”启动DME,例如,如果svc_ifc_eventmgr缺失,则尝试“acidiag start eventmgr”
apic1# ps -aux | egrep "svc|nginx.bin|dhcp"
root 5112 7.3 0.4 1033952 323180 ? Ssl Mar10 3073:27 /mgmt//bin/svc_ifc_appliancedirector.bin --x
ifc 5117 1.7 0.6 1062664 439876 ? Ssl Mar10 720:52 /mgmt//bin/svc_ifc_topomgr.bin --x
ifc 5118 2.1 2.2 2164512 1468200 ? Ssl Mar10 884:11 /mgmt//bin/svc_ifc_policymgr.bin --x
ifc 5119 1.5 0.3 1115984 256904 ? Ssl Mar10 664:51 /mgmt//bin/svc_ifc_licensemgr.bin --x
ifc 5120 1.5 0.5 1088252 356760 ? Ssl Mar10 666:26 /mgmt//bin/svc_ifc_edmgr.bin --x
root 5121 1.6 0.6 1125948 423392 ? Ssl Mar10 698:11 /mgmt//bin/svc_ifc_bootmgr.bin --x
ifc 5123 2.3 1.2 1474388 800564 ? Ssl Mar10 994:15 /mgmt//bin/svc_ifc_eventmgr.bin --x
ifc 5126 1.5 8.2 6032524 5363184 ? Ssl Mar10 635:58 /mgmt//bin/svc_ifc_reader.bin --x
root 5130 4.6 0.6 1092480 439580 ? Ssl Mar10 1927:08 /mgmt//bin/svc_ifc_ae.bin --x
ifc 5132 1.6 0.8 1312136 567420 ? Ssl Mar10 689:43 /mgmt//bin/svc_ifc_vmmmgr.bin --x
ifc 5133 1.5 0.5 1064176 346760 ? Ssl Mar10 659:03 /mgmt//bin/svc_ifc_domainmgr.bin --x
ifc 5135 1.8 1.6 1736876 1099924 ? Ssl Mar10 770:39 /mgmt//bin/svc_ifc_observer.bin --x
root 5141 1.5 0.7 1092948 458156 ? Ssl Mar10 663:41 /mgmt//bin/svc_ifc_plgnhandler.bin --x
ifc 5146 2.0 0.6 1037676 397236 ? Ssl Mar10 857:43 /mgmt//bin/svc_ifc_idmgr.bin --x
ifc 5148 1.3 0.3 650596 222336 ? Ssl Mar10 580:25 /mgmt//bin/svc_ifc_vtap.bin --x
ifc 5160 1.6 0.6 1098280 453492 ? Ssl Mar10 669:17 /mgmt//bin/svc_ifc_scripthandler.bin --x
root 7089 1.4 0.4 856360 315016 ? Ssl Mar10 592:04 /mgmt//bin/dhcpd.bin -f -4 -cf /data//dhcp/dhcpd.conf -lf /data//dhcp/dhcpd.lease -pf /var/run//dhcpd.pid --no-pid bond0.3903
admin 29834 0.0 0.0 112800 1780 pts/1 S+ 17:22 0:00 grep -E svc|nginx.bin|dhcp
ifc 30432 1.4 0.6 894088 405968 ? Ssl Mar17 473:45 /mgmt//bin/svc_ifc_policydist.bin --x
root 31215 2.8 5.2 4503880 3397276 ? Ssl Apr05 124:08 /mgmt//bin/nginx.bin -p /data//nginx/
在上述输出中,与DME进程状态部分中提及的预期输出相比,缺少svc_ifc_dbgr.bin。我们可以使用“acidiag restart dbgr”启动该过程
apic1# acidiag start dbgr
apic1# ps -aux | egrep "svc|nginx.bin|dhcp"
root 5112 7.3 0.4 1033952 323240 ? Ssl Mar10 3073:43 /mgmt//bin/svc_ifc_appliancedirector.bin --x
ifc 5117 1.7 0.6 1062664 439876 ? Ssl Mar10 720:56 /mgmt//bin/svc_ifc_topomgr.bin --x
ifc 5118 2.1 2.2 2164512 1468200 ? Ssl Mar10 884:16 /mgmt//bin/svc_ifc_policymgr.bin --x
ifc 5119 1.5 0.3 1115984 256904 ? Ssl Mar10 664:55 /mgmt//bin/svc_ifc_licensemgr.bin --x
ifc 5120 1.5 0.5 1088252 356760 ? Ssl Mar10 666:30 /mgmt//bin/svc_ifc_edmgr.bin --x
root 5121 1.6 0.6 1125948 423392 ? Ssl Mar10 698:15 /mgmt//bin/svc_ifc_bootmgr.bin --x
ifc 5123 2.3 1.2 1474388 800784 ? Ssl Mar10 994:21 /mgmt//bin/svc_ifc_eventmgr.bin --x
ifc 5126 1.5 8.2 6032524 5363184 ? Ssl Mar10 636:01 /mgmt//bin/svc_ifc_reader.bin --x
root 5130 4.6 0.6 1092480 439580 ? Ssl Mar10 1927:18 /mgmt//bin/svc_ifc_ae.bin --x
ifc 5132 1.6 0.8 1312136 567420 ? Ssl Mar10 689:46 /mgmt//bin/svc_ifc_vmmmgr.bin --x
ifc 5133 1.5 0.5 1064176 346760 ? Ssl Mar10 659:07 /mgmt//bin/svc_ifc_domainmgr.bin --x
ifc 5135 1.8 1.6 1736876 1099924 ? Ssl Mar10 770:43 /mgmt//bin/svc_ifc_observer.bin --x
root 5141 1.5 0.7 1092948 458156 ? Ssl Mar10 663:45 /mgmt//bin/svc_ifc_plgnhandler.bin --x
ifc 5146 2.0 0.6 1037676 397236 ? Ssl Mar10 857:48 /mgmt//bin/svc_ifc_idmgr.bin --x
ifc 5148 1.3 0.3 650596 222336 ? Ssl Mar10 580:28 /mgmt//bin/svc_ifc_vtap.bin --x
ifc 5160 1.6 0.6 1098280 453492 ? Ssl Mar10 669:21 /mgmt//bin/svc_ifc_scripthandler.bin --x
root 7089 1.4 0.4 856360 315016 ? Ssl Mar10 592:07 /mgmt//bin/dhcpd.bin -f -4 -cf /data//dhcp/dhcpd.conf -lf /data//dhcp/dhcpd.lease -pf /var/run//dhcpd.pid --no-pid bond0.3903
ifc 7609 126 0.5 987404 362824 ? Ssl 17:25 0:02 /mgmt//bin/svc_ifc_dbgr.bin --x <=====
admin 7762 0.0 0.0 112800 1668 pts/1 S+ 17:26 0:00 grep -E svc|nginx.bin|dhcp
ifc 30432 1.4 0.6 894088 405968 ? Ssl Mar17 473:48 /mgmt//bin/svc_ifc_policydist.bin --x
root 31215 2.8 5.2 4503880 3397252 ? Ssl Apr05 124:13 /mgmt//bin/nginx.bin -p /data//nginx/
运行“Acidiag start dbgr”后,进程再次启动。如果您没有看到进程入门,请联系TAC进行进一步的故障排除。
如果存在任何核心文件,请运行show core将其上传到服务请求。
apic1# show core
Node Module Creation-Time File-Size Service Process Original-Location Exit-Code Death-Reason Last-Heartbeat
---- ------ ------------- --------- ------------ ------- ------------------ --------- ------------ --------------
Ctrlr-Id Creation-Time File-Size Service Process Original-Location Exit-Code
-------- --------------------- --------- ------------ ------- ---------------------------------------- ---------
1 2021-10-05T21:19:55.0 204534444 eventmgr 22453 /dmecores/svc_ifc_eventmgr.bin_log.22453 134
00-07:00 .tar.gz
请参阅Core Collection的链接https://www.cisco.com/c/en/us/support/docs/cloud-systems-management/application-policy-infrastructure-controller-apic/214520-guide-to-collect-tech-support-and-tac-re.html
捕获APIC TS日志,并将其上传到服务请求,以进行进一步的故障排除。https://www.cisco.com/c/en/us/support/docs/cloud-systems-management/application-policy-infrastructure-controller-apic/214520-guide-to-collect-tech-support-and-tac-re.html
版本 | 发布日期 | 备注 |
---|---|---|
1.0 |
06-Apr-2022 |
初始版本 |