此产品的文档集力求使用非歧视性语言。在本文档集中,非歧视性语言是指不隐含针对年龄、残障、性别、种族身份、族群身份、性取向、社会经济地位和交叉性的歧视的语言。由于产品软件的用户界面中使用的硬编码语言、基于 RFP 文档使用的语言或引用的第三方产品使用的语言,文档中可能无法确保完全使用非歧视性语言。 深入了解思科如何使用包容性语言。
思科采用人工翻译与机器翻译相结合的方式将此文档翻译成不同语言,希望全球的用户都能通过各自的语言得到支持性的内容。 请注意:即使是最好的机器翻译,其准确度也不及专业翻译人员的水平。 Cisco Systems, Inc. 对于翻译的准确性不承担任何责任,并建议您总是参考英文原始文档(已提供链接)。
本文档介绍如何排除运行16.x版本(也称为Polaris)的新Cisco IOS®-XE平台上主要由于中断而引起的高CPU使用问题。 此外,本文档介绍了该平台上为排除此类故障而必不可少的几个新命令。
了解Cisco IOS®-XE的构建方式非常重要。借助Cisco IOS®-XE,思科已迁移至Linux内核,所有子系统已分解为多个进程。以前在Cisco IOS®内的所有子系统(如模块驱动程序、高可用性(HA)等)现在都作为Linux操作系统(OS)内的软件进程运行。 Cisco IOS®本身作为守护程序在Linux OS(IOSd)中运行。 Cisco IOS®-XE不仅保留了传统Cisco IOS®的相同外观和感觉,还保留了其操作、支持和管理。
以下是一些有用的定义:
数据平面和控制平面之间通信路径的高级图:
本部分提供系统化工作流程,以分类交换机上的高CPU问题。请注意,它包含撰写本节时的选定流程。
本部分中的故障排除和验证过程可广泛用于因中断而导致的高CPU使用率。
使用show process cpu命令可显示IOSd守护程序内部的当前进程状态。添加输出时,请修改 | exclude 0.00,它将过滤当前空闲的进程。
从此输出中有两条重要信息:
Switch# show processes cpu sort | ex 0.00
CPU utilization for five seconds: 91%/30%; one minute: 30%; five minutes: 8%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
37 14645 325 45061 59.53% 18.86% 4.38% 0 ARP Input
137 2288 115 19895 1.20% 0.14% 0.07% 0 Per-minute Jobs
373 2626 35334 74 0.15% 0.11% 0.09% 0 MMA DB TIMER
218 3123 69739 44 0.07% 0.09% 0.12% 0 IP ARP Retry Age
404 2656 35333 75 0.07% 0.09% 0.09% 0 MMA DP TIMER
show processes cpu platform sorted命令用于显示来自Linux内核的进程利用率的情况。从输出中可以看到,FED进程很高,这是由于传送到IOSd进程的ARP请求:
Switch# show processes cpu platform sorted CPU utilization for five seconds: 38%, one minute: 38%, five minutes: 40% Core 0: CPU utilization for five seconds: 39%, one minute: 37%, five minutes: 39% Core 1: CPU utilization for five seconds: 41%, one minute: 38%, five minutes: 40% Core 2: CPU utilization for five seconds: 30%, one minute: 38%, five minutes: 40% Core 3: CPU utilization for five seconds: 37%, one minute: 39%, five minutes: 41% Pid PPid 5Sec 1Min 5Min Status Size Name -------------------------------------------------------------------------------- 22701 22439 89% 88% 88% R 2187444224 linux_iosd-imag 11626 11064 46% 47% 48% S 2476175360 fed main event 4585 2 7% 9% 9% S 0 lsmpi-xmit 4586 2 3% 6% 6% S 0 lsmpi-rx
从步骤1中,您可以断定IOSd/ARP进程运行高,但是是从数据平面引入的流量的受害者。需要进一步调查FED进程为什么将流量传送到CPU以及此流量来自何处。
show platform software fed switch active punt cause summary概括了punt原因的概述。在多次运行此命令时递增的任意数字:
Switch#show platform software fed switch active punt cause summary Statistics for all causes Cause Cause Info Rcvd Dropped ------------------------------------------------------------------------------ 7 ARP request or response 18444227 0 11 For-us data 16 0 21 RP<->QFP keepalive 3367 0 24 Glean adjacency 2 0 55 For-us control 6787 0 60 IP subnet or broadcast packet 14 0 96 Layer2 control protocols 3548 0 ------------------------------------------------------------------------------
从FED发送到控制平面的数据包使用分割队列结构来保证高优先级控制流量。它不会像ARP那样丢失在优先级较低的流量之后。使用show platform software fed switch active cpu-interface可以查看这些队列的高级概述。运行此命令多次后,可以发现Forus Resolution(Forus — 表示发往CPU的流量)队列会快速增加。
Switch#show platform software fed switch active cpu-interface queue retrieved dropped invalid hol-block ------------------------------------------------------------------------- Routing Protocol 8182 0 0 0 L2 Protocol 161 0 0 0 sw forwarding 2 0 0 0 broadcast 14 0 0 0 icmp gen 0 0 0 0 icmp redirect 0 0 0 0 logging 0 0 0 0 rpf-fail 0 0 0 0 DOT1X authentication 0 0 0 0 Forus Traffic 16 0 0 0 Forus Resolution 24097779 0 0 0 Inter FED 0 0 0 0 L2 LVX control 0 0 0 0 EWLC control 0 0 0 0 EWLC data 0 0 0 0 L2 LVX data 0 0 0 0 Learning cache 0 0 0 0 Topology control 4117 0 0 0 Proto snooping 0 0 0 0 DHCP snooping 0 0 0 0 Transit Traffic 0 0 0 0 Multi End station 0 0 0 0 Webauth 0 0 0 0 Crypto control 0 0 0 0 Exception 0 0 0 0 General Punt 0 0 0 0 NFL sampled data 0 0 0 0 Low latency 0 0 0 0 EGR exception 0 0 0 0 FSS 0 0 0 0 Multicast data 0 0 0 0 Gold packet 0 0 0 0
使用show platform software fed switch active punt cpuq all可提供这些队列的更详细视图。队列5负责ARP,并按照预期在命令的多次运行中递增。show plat soft fed sw active inject cpuq clear命令可用于清除计数器,以便更轻松地读取。
Switch#show platform software fed switch active punt cpuq all <snip> CPU Q Id : 5 CPU Q Name : CPU_Q_FORUS_ADDR_RESOLUTION Packets received from ASIC : 21018219 Send to IOSd total attempts : 21018219 Send to IOSd failed count : 0 RX suspend count : 0 RX unsuspend count : 0 RX unsuspend send count : 0 RX unsuspend send failed count : 0 RX consumed count : 0 RX dropped count : 0 RX non-active dropped count : 0 RX conversion failure dropped : 0 RX INTACK count : 1050215 RX packets dq'd after intack : 90 Active RxQ event : 3677400 RX spurious interrupt : 1050016 <snip>
从这里,有几个选择。ARP是广播流量,因此您可以查找广播流量速率异常高的接口(对于排除第2层环路故障也非常有用)。 可能需要多次运行此命令,以确定主动增加的接口。
Switch#show interfaces counters Port InOctets InUcastPkts InMcastPkts InBcastPkts Gi1/0/1 1041141009678 9 0 16267828358 Gi1/0/2 1254 11 0 1 Gi1/0/3 0 0 0 0 Gi1/0/4 0 0 0 0
另一个选项是使用嵌入式数据包捕获(EPC)工具来收集在控制平面看到的数据包的示例。
Switch#monitor capture cpuCap control-plane in match any file location flash:cpuCap.pcap Switch#show monitor capture cpuCap Status Information for Capture cpuCap Target Type: Interface: Control Plane, Direction: IN Status : Inactive Filter Details: Capture all packets Buffer Details: Buffer Type: LINEAR (default) File Details: Associated file name: flash:cpuCap.pcap Limit Details: Number of Packets to capture: 0 (no limit) Packet Capture duration: 0 (no limit) Packet Size to capture: 0 (no limit) Packet sampling rate: 0 (no sampling)
此命令配置交换机上的内部捕获,以捕获传送到控制平面的所有流量。此流量将保存到闪存上的文件。这是一个普通的wireshark pcap文件,可以从交换机导出,然后在wireshark中打开以供进一步分析。
启动捕获,让它运行几秒钟并停止捕获:
Switch#monitor capture cpuCap start Enabling Control plane capture may seriously impact system performance. Do you want to continue? [yes/no]: yes Started capture point : cpuCap *Jun 14 17:57:43.172: %BUFCAP-6-ENABLE: Capture Point cpuCap enabled. Switch#monitor capture cpuCap stop Capture statistics collected at software: Capture duration - 59 seconds Packets received - 215950 Packets dropped - 0 Packets oversized - 0 Bytes dropped in asic - 0 Stopped capture point : cpuCap Switch# *Jun 14 17:58:37.884: %BUFCAP-6-DISABLE: Capture Point cpuCap disabled.
也可以查看交换机上的捕获文件:
Switch#show monitor capture file flash:cpuCap.pcap Starting the packet display ........ Press Ctrl + Shift + 6 to exit 1 0.000000 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 2 0.000054 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 3 0.000082 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 4 0.000109 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 5 0.000136 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 6 0.000162 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 7 0.000188 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 8 0.000214 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2 9 0.000241 Xerox_d7:67:a1 -> Broadcast ARP 60 Who has 192.168.1.24? Tell 192.168.1.2
从此输出中,很明显,192.168.1.2主机是导致交换机CPU高的常量ARP的源。使用show ip arp和show mac address-table address 命令跟踪主机,并将其从网络中删除或为ARP编址。还可以使用capture view命令show monitor capture file flash:cpuCap.pcap detail上的detail选项获取捕获的每个数据包的完整详细信息。有关Catalyst交换机上的数据包捕获的详细信息,请参阅本指南。
默认情况下,最新一代Catalyst交换机受控制平面策略(CoPP)保护。CoPP用于保护CPU免受恶意攻击和配置错误,这些攻击和配置错误可能会破坏交换机功能,以维护生成树和路由协议等关键功能。这些保护可能导致交换机的CPU和接口计数器仅略高,但流量在通过交换机时被丢弃。请注意设备在正常运行时的基准CPU利用率。提高CPU利用率不一定是问题,这取决于设备上启用的功能,但是当此利用率增加而不更改配置时,这可能是问题的征兆。
请考虑这种情况。网关交换机外的主机报告下载速度慢,对互联网的ping丢失。交换机的常规运行状况检查显示,从网关交换机发出的接口上没有错误或任何ping丢失。
当您检查CPU时,它会显示由于中断而略高的数字。
Switch#show processes cpu sorted | ex 0.00 CPU utilization for five seconds: 8%/7%; one minute: 8%; five minutes: 8% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 122 913359 1990893 458 0.39% 1.29% 1.57% 0 IOSXE-RP Punt Se 147 5823 16416 354 0.07% 0.05% 0.06% 0 PLFM-MGR IPC pro 404 13237 183032 72 0.07% 0.08% 0.07% 0 MMA DP TIMER
当您检查CPU接口时,您会看到ICMP重定向计数器主动递增。
Switch#show platform software fed switch active cpu-interface queue retrieved dropped invalid hol-block ------------------------------------------------------------------------- Routing Protocol 12175 0 0 0 L2 Protocol 236 0 0 0 sw forwarding 714673 0 0 0 broadcast 2 0 0 0 icmp gen 0 0 0 0 icmp redirect 2662788 0 0 0 logging 7 0 0 0 rpf-fail 0 0 0 0 DOT1X authentication 0 0 0 0 Forus Traffic 21776434 0 0 0 Forus Resolution 724021 0 0 0 Inter FED 0 0 0 0 L2 LVX control 0 0 0 0 EWLC control 0 0 0 0 EWLC data 0 0 0 0 L2 LVX data 0 0 0 0 Learning cache 0 0 0 0 Topology control 6122 0 0 0 Proto snooping 0 0 0 0 DHCP snooping 0 0 0 0 Transit Traffic 0 0 0 0
在FED中未观察到丢弃,如果检查CoPP,则可在ICMP重定向队列中观察丢弃。
Switch#show platform hardware fed switch 1 qos queue stats internal cpu policer CPU Queue Statistics ============================================================================================ (default) (set) Queue QId PlcIdx Queue Name Enabled Rate Rate Drop(Bytes) ----------------------------------------------------------------------------- 0 11 DOT1X Auth Yes 1000 1000 0 1 1 L2 Control Yes 2000 2000 0 2 14 Forus traffic Yes 4000 4000 0 3 0 ICMP GEN Yes 600 600 0 4 2 Routing Control Yes 5400 5400 0 5 14 Forus Address resolution Yes 4000 4000 0 6 0 ICMP Redirect Yes 600 600 463538463 7 16 Inter FED Traffic Yes 2000 2000 0 8 4 L2 LVX Cont Pack Yes 1000 1000 0 <snip>
CoPP实质上是设备控制平面上的QoS策略。CoPP的工作方式与交换机上的任何其他QoS相同:当特定流量的队列耗尽时,使用该队列的流量将被丢弃。从这些输出中,您知道流量是由于ICMP重定向而进行软件交换的,并且您知道此流量是由于ICMP重定向队列的速率限制而被丢弃的。您可以在控制平面上执行捕获,以验证到达控制平面的数据包是否来自用户。
为了查看每个类使用的匹配逻辑,您有一个CLI来帮助识别到达特定队列的数据包类型。例如,如果您想知道将命中system-cpp-routing-control类的内容:
Switch#show platform software qos copp policy-info
Default rates of all classmaps are displayed:
policy-map system-cpp-policy
class system-cpp-police-routing-control
police rate 5400 pps
Switch#show platform software qos copp class-info
ACL representable classmap filters are displayed:
class-map match-any system-cpp-police-routing-control
description Routing control and Low Latency
match access-group name system-cpp-mac-match-routing-control
match access-group name system-cpp-ipv4-match-routing-control
match access-group name system-cpp-ipv6-match-routing-control
match access-group name system-cpp-ipv4-match-low-latency
match access-group name system-cpp-ipv6-match-low-latency
mac access-list extended system-cpp-mac-match-routing-control
permit any host 0180.C200.0014
permit any host 0900.2B00.0004
ip access-list extended system-cpp-ipv4-match-routing-control
permit udp any any eq rip
<...snip...>
ipv6 access-list system-cpp-ipv6-match-routing-control
permit ipv6 any FF02::1:FF00:0/104
permit ipv6 any host FF01::1
<...snip...>
ip access-list extended system-cpp-ipv4-match-low-latency
permit udp any any eq 3784
permit udp any any eq 3785
ipv6 access-list system-cpp-ipv6-match-low-latency
permit udp any any eq 3784
permit udp any any eq 3785
<...snip...>
Switch#monitor capture cpuSPan control-plane in match any file location flash:cpuCap.pcap Control-plane direction IN is already attached to the capture Switch#monitor capture cpuSpan start Enabling Control plane capture may seriously impact system performance. Do you want to continue? [yes/no]: yes Started capture point : cpuSpan Switch# *Jun 15 17:28:52.841: %BUFCAP-6-ENABLE: Capture Point cpuSpan enabled. Switch#monitor capture cpuSpan stop Capture statistics collected at software: Capture duration - 12 seconds Packets received - 5751 Packets dropped - 0 Packets oversized - 0 Bytes dropped in asic - 0 Stopped capture point : cpuSpan Switch# *Jun 15 17:29:02.415: %BUFCAP-6-DISABLE: Capture Point cpuSpan disabled. Switch#show monitor capture file flash:cpuCap.pcap detailed Starting the packet display ........ Press Ctrl + Shift + 6 to exit Frame 1: 60 bytes on wire (480 bits), 60 bytes captured (480 bits) on interface 0
<snip>
Ethernet II, Src: OmronTat_2c:a1:52 (00:00:0a:2c:a1:52), Dst: Cisco_8f:cb:47 (00:42:5a:8f:cb:47)
<snip>
Internet Protocol Version 4, Src: 192.168.1.10, Dst: 8.8.8.8
<snip>
当此主机ping 8.8.8.8时,会将ping发送到网关MAC地址,因为目的地址在VLAN之外。网关交换机检测到下一跳在同一VLAN中,并将目的MAC地址重写到防火墙并转发数据包。此过程可能发生在硬件中,但此硬件转发的例外是IP重定向过程。当交换机收到ping命令时,交换机检测到它正在同一VLAN上路由流量,并将流量传送到CPU,以便向主机生成重定向数据包。此重定向消息是通知主机到目的地的路径更优。在本例中,第2层下一跳是设计的,并且是预期的,因此需要配置交换机,使其不发送重定向消息并在硬件中转发数据包。在VLAN接口上禁用重定向时,会执行此操作。
interface Vlan1 ip address 192.168.1.1 255.255.255.0 no ip redirects end
当IP重定向关闭时,交换机会重写MAC地址并在硬件中转发。
如果交换机上的高CPU间歇性地运行,则可以在交换机上设置脚本,以便在发生高CPU事件时自动运行这些命令。这通过使用Cisco IOS®嵌入式事件管理器(EEM)完成。
entry-val用于确定在脚本触发之前CPU的高度。脚本监控5秒的CPU平均SNMP OID。两个文件写入闪存,tac-cpu-<timestamp>.txt包含命令输出,tac-cpu-<timestamp>.pcap包含CPU入口捕获。然后,可以稍后查看这些文件。
config t
no event manager applet high-cpu authorization bypass
event manager applet high-cpu authorization bypass
event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.3.1 get-type next entry-op gt entry-val 80 poll-interval 1 ratelimit 300 maxrun 180
action 0.01 syslog msg "High CPU detected, gathering system information."
action 0.02 cli command "enable"
action 0.03 cli command "term exec prompt timestamp"
action 0.04 cli command "term length 0"
action 0.05 cli command "show clock"
action 0.06 regex "([0-9]|[0-9][0-9]):([0-9]|[0-9][0-9]):([0-9]|[0-9][0-9])" $_cli_result match match1
action 0.07 string replace "$match" 2 2 "."
action 0.08 string replace "$_string_result" 5 5 "."
action 0.09 set time $_string_result
action 1.01 cli command "show proc cpu sort | append flash:tac-cpu-$time.txt"
action 1.02 cli command "show proc cpu hist | append flash:tac-cpu-$time.txt"
action 1.03 cli command "show proc cpu platform sorted | append flash:tac-cpu-$time.txt"
action 1.04 cli command "show interface | append flash:tac-cpu-$time.txt"
action 1.05 cli command "show interface stats | append flash:tac-cpu-$time.txt"
action 1.06 cli command "show log | append flash:tac-cpu-$time.txt"
action 1.07 cli command "show ip traffic | append flash:tac-cpu-$time.txt"
action 1.08 cli command "show users | append flash:tac-cpu-$time.txt"
action 1.09 cli command "show platform software fed switch active punt cause summary | append flash:tac-cpu-$time.txt"
action 1.10 cli command "show platform software fed switch active cpu-interface | append flash:tac-cpu-$time.txt"
action 1.11 cli command "show platform software fed switch active punt cpuq all | append flash:tac-cpu-$time.txt"
action 2.08 cli command "no monitor capture tac_cpu"
action 2.09 cli command "monitor capture tac_cpu control-plane in match any file location flash:tac-cpu-$time.pcap"
action 2.10 cli command "monitor capture tac_cpu start" pattern "yes"
action 2.11 cli command "yes"
action 2.12 wait 10
action 2.13 cli command "monitor capture tac_cpu stop"
action 3.01 cli command "term default length"
action 3.02 cli command "terminal no exec prompt timestamp"
action 3.03 cli command "no monitor capture tac_cpu"