System Monitoring Configuration Guide for Cisco 8000 Series Routers, IOS XR Release 24.1.x, 24.2.x, 24.3.x, 24.4.x
Bias-Free Language
The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.
Cisco 8000 Series Routers support the Online Diagnostics feature that enables you to run tests to verify the hardware functionality
when connected to a live network. When a problem is detected, diagnostic test results help in isolating the location of the
problem, enabling you to take appropriate measures to resolve the issue in less time.
Table 1. Feature History Table
Feature Name
Release
Description
Online diagnostics for NPU, NPU slices, and fabric cards
Release 24.4.1
Introduced in this release on: Fixed Systems (8200 [ASIC: P100], 8700 [ASIC: P100, K100])(select variants only*); Modular
Systems (8800 [LC ASIC: P100])(select variants only*)
*This feature is supported on:
8212-48FH-M
8711-32FH-M
8712-MOD-M
88-LC1-36EH
88-LC1-12TH24FH-E
88-LC1-52Y8H-EM
Online diagnostics for NPU slices and fabric cards
Release 24.2.11
Introduced in this release on: Fixed Systems (8200); Centralized Systems (8600); Modular Systems (8800 [LC ASIC: Q100, Q200])
You can now use the online diagnostics functionality to test the health of fabric cards and all the slices in an NPU. This
feature can help you detect fabric, and slice level failures.
Online diagnostics for NPU
Release 7.5.2/Release 7.3.5
You can now use the online diagnostic feature to verify if the router NPUs are operational. NPU failure logs are captured
in the system log output.
You can also generate tech support information that is useful for Cisco Technical Support representatives when troubleshooting
a router.
The diagnostic tests check different hardware components in a system and verify the data paths and control signals. The online
diagnostics tests use the CPU to send packets to the Network Processing Unit (NPU) through the Punt switch. If a failure is
detected, an NP Datalog is automatically generated to help diagnose the problem.
The default interval for the NPU loopback test is one minute, and the default threshold is 3.
Online diagnostic tests can be categorized based on the way they are executed. They are the following:
Types of Online diagnostic test
Description
Dynamic diagnostics
Online Diagnostics are enabled when the system starts and the system datapath is operational. When the system is in use and
linked to a live network, these tests run in the background as a non-disruptive test.
On-demand diagnostics
Tests that are conducted as needed using a diagnostic start command from the command-line interface (CLI). These tests are
useful when a hardware fault is suspected.
You can use these diagnostics tests to determine the status and troubleshoot the hardware issues.
Online diagnostics for NPU slices
From Release 24.2.11, the online diagnostics functionality is extended to test all the slices in an NPU, enabling you to detect slice level failures.
The default rate at which the test packets are transmitted to each NPU slice is increased to 60 packets per minute.
The default interval is one second for the per-slice test, and the default threshold is 3.
From Release 24.2.11, you can also test the health of fabric cards (FC) using the online diagnostics functionality. The test packets are transmitted
to fabric cards at the default rate of 50 packets per minute.
The default interval for the fabric test is 30 seconds, and the default threshold is 6.
Data plane health check utility is a monitoring tool that helps you determine the health of the data plane components including
fabric cards
NPUs
This utility can detect fabric memory corruption and packet loss that may happen due to broken internal links.
Table 2. Feature History Table
Feature Name
Release
Description
Monitor data plane health
Release 24.4.1
Introduced in this release on: Fixed Systems (8200 [ASIC: P100], 8700 [ASIC: P100])(select variants only*); Modular Systems
(8800 [LC ASIC: P100])(select variants only*)
*This feature is supported on:
8212-48FH-M
8711-32FH-M
88-LC1-36EH
88-LC1-12TH24FH-E
88-LC1-52Y8H-EM
Monitor data plane health
Release 24.2.11
Introduced in this release on: Fixed Systems (8200); Modular Systems (8800 [LC ASIC: Q100, Q200])
You can now easily detect fabric memory corruption and packet loss by checking the health of data plane components including
fabric and NPUs on a distributed system using our on-demand diagnostic utility.
This functionality introduces the following commands:
You can start the diagnosis using the CLI command monitor dataplane-health. The detailed error report helps you identify the faulty card.
You can also use the show dataplane-health status command to check the status of a data plane health test. It provides information on whether the test is still running or
if it's completed, along with a summary of the results.
Note
Do not use the data plane health check utility on a router that carries live traffic, as this utility affects the system performance.
Use cases for data plane health check
You can use the data plane health check utility in the following scenarios:
Before router deployment – After installing the FC or LC on the router, you can run the utility to check for issues, and then
proceed to router provisioning.
After router deployment – If traffic loss is observed, but the packet drop analysis does not provide a root cause, then isolate
the router and run the utility to check for issues.
Limitations for data plane health check
Avoid using the show controllers npu debugshell CLI command.
Avoid system reload (or LC reload) when the Data Plane Health Check utility is being executed.
The monitor dataplane-health module fabric command is supported only on distributed routers.
You must archive the report file before subsequent runs, as this file is overwritten on re-execution of the command.
The previous log file is archived as /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt.bak.
The test report is available at /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt.
Monitor and verify data plane health
This example shows how to execute the data plane health check utility for fabric module.
RP/0/RP0/CPU0:Router# monitor dataplane-health
Wed Feb 28 15:08:15.659 EST
RP/0/RP0/CPU0:Feb 28 15:08:15.687 EST: dph_mon_bg[337]: %PLATFORM-DPH_MONITOR-6-STARTED : Dataplane health monitoring started. Please check harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt for details of the result.
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
Progressive details of the test logged in /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt
Previous log file is archived as /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt.bak
Please save the archive with a different file name as needed
########################################################################################
Module:fabric
Patterns used: 0xf0,0x0f,0x00,0x55,0xff,
Duration per pattern: 10 seconds
Pause time between each slice/pattern test: 2 seconds
Best effort time for test completion: 1044 seconds
Depending on the overall dataplane health state, it may take additional time to complete
########################################################################################
Dataplane health monitoring will run in the background
Wait for completion log message from the process "dph_mon_bg"
OR
Use "show dataplane-health status" regularly to check for completion
RP/0/RP0/CPU0:Feb 28 15:08:15.687 EST: dph_mon_bg[337]: %PLATFORM-DPH_MONITOR-6-STARTED : Dataplane health monitoring started. Please check harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt for details of the result.
RP/0/RP0/CPU0:Feb 28 15:28:54.965 EST: dph_mon_bg[337]: %PLATFORM-DPH_MONITOR-6-COMPLETED : Dataplane health monitoring completed. Please check harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt for details of the result.
Warning
Run the test on a router in a non-production environment as this test impacts the system performance.
These examples show how to check the status of a data plane health test:
Example 1: Output with data plane health check in progress
RP/0/RP0/CPU0:Router# show dataplane-health status
Mon Jan 29 22:39:48.336 UTC
Dataplane health monitoring in progress..
Example 2: Output with successful data plane health check
RP/0/RP0/CPU0:Router# show dataplane-health status
Mon Jan 29 23:10:21.564 UTC
Dataplane health monitoring completed
Summary of results (Module: fabric):
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
LC NP Slice GOOD LOSS CORRUPT ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
0 0 0 2526099 0 0 0
1 2526856 0 0 0
2 2526529 0 0 0
1 0 2526590 0 0 0
1 2526918 0 0 0
2 2526421 0 0 0
2 0 2526665 0 0 0
1 2525818 0 0 0
2 2526286 0 0 0
-------------------------------------------------------------------------------
1 0 0 2526754 0 0 0
1 2526328 0 0 0
2 2526695 0 0 0
1 0 2525892 0 0 0
1 2526988 0 0 0
2 2526215 0 0 0
**********************************************************************************
DATAPATH CHECK IS CLEAN (mode: fabric).
**********************************************************************************
If the data plane-health monitor does not report any issues, it suggests that the system is functioning correctly, and there
is no need to proceed with further data plane health verification.
If any failure is detected during the data plane health check, you must proceed with additional verification.
Additional Troubleshooting
The following is a sample output that illustrates failures that are detected in the datapath.
RP/0/RP0/CPU0:Router# monitor dataplane-health
Fri Sep 29 12:53:51.595 UTC
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
Details of the test results are logged in harddisk:/dataplane_health_detail_report.txt
Estimated time for completion: 783 seconds
Ensure that the terminal/vty session timeout is greater than 783 seconds
Testing in progress (suggest not to break the tests)
.................................................................................................................................
Processing further to find failed fabric elements. This will take more time...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Datapath test on all requested LC/NPU/slice completed
Summary of results:
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
-------------------------------------------------------------------------------
LC NP Slice GOOD LOSS CORRUPT ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
1 0 0 476214 0 637 0
1 0 0 0 0
2 0 0 0 0
1 0 475860 0 670 0
1 0 0 0 0
2 0 0 0 0
-------------------------------------------------------------------------------
2 0 0 2383553 0 0 0
1 2383747 0 0 0
2 2383616 0 0 0
1 0 2383280 0 0 0
1 2383737 0 0 0
2 2383343 0 0 0
2 0 2383937 0 0 0
1 2383913 0 0 0
2 2384017 0 0 0
**********************************************************************************
Corruption detected:(LC1/0 <-> FC7/0) (LC1/1 <-> FC7/0) **********************************************************************************
FAILURES DETECTED IN DATAPATH.
Please run "monitor dataplane-health module no-fabric"
Please check harddisk:/dataplane_health_detail_report.txt
**********************************************************************************
RP/0/RP0/CPU0:Router# monitor dataplane-health
Fri Feb 16 08:50:58.115 UTC
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
Progressive details of the test logged in /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt
########################################################################################
Module:fabric
Patterns used: 0xf0,0x0f,0x00,0x55,0xff,
Duration per pattern: 10 seconds
Pause time between each slice/pattern test: 2 seconds
Best effort time for test completion: 522 seconds
Depending on the overall dataplane health state, it may take additional time to complete
########################################################################################
Dataplane health monitoring will run in the background
Wait for completion log message from the process "dph_mon_bg"
OR
Use "show dataplane-health status" regularly to check for completion
RP/0/RP0/CPU0:Router# show dataplane-health status
Fri Feb 16 09:04:12.156 UTC
Dataplane health monitoring completed
Summary of results (Module: fabric):
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
LC NP Slice GOOD LOSS CORRUPT ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
0 0 0 85 10977 0 0
1 0 0 0 0
2 0 0 0 0
1 0 1735 8865 0 0
1 0 0 0 0
2 0 0 0 0
**********************************************************************************
**********************************************************************************
FAILURES DETECTED IN DATAPATH (mode: fabric).
Please run "monitor dataplane-health module no-fabric" to check if the issue is on the LCs or FCs
Please check /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt
**********************************************************************************
The sample output indicates that there are failures detected in the datapath. To further isolate the issue, verify if the
packet corruption is caused by a fabric card or line card. Execute the following command to run health check on the line cards
excluding the fabric card.
RP/0/RP0/CPU0:Router# monitor dataplane-health module no-fabric
Fri Sep 29 14:09:28.506 UTC
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
Details of the test results are logged in harddisk:/dataplane_health_detail_report.txt
Estimated time for completion: 783 seconds
Ensure that the terminal/vty session timeout is greater than 783 seconds
Testing in progress (suggest not to break the tests)
.....................................................................................................................................................................................................................
Datapath test on all requested LC/NPU/slice completed
Summary of results:
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
-------------------------------------------------------------------------------
LC NP Slice GOOD LOSS CORRUPT ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
1 0 0 2383412 0 0 0
1 2383031 0 0 0
2 2383484 0 0 0
1 0 2383883 0 0 0
1 2383973 0 0 0
2 2383349 0 0 0
-------------------------------------------------------------------------------
2 0 0 2383160 0 0 0
1 2384196 0 0 0
2 2383879 0 0 0
1 0 2383135 0 0 0
1 2383196 0 0 0
2 2383668 0 0 0
2 0 2383414 0 0 0
1 2384360 0 0 0
2 2383732 0 0 0
-------------------------------------------------------------------------------
6 0 0 2383933 0 0 0
1 2384205 0 0 0
2 2383746 0 0 0
1 0 2383215 0 0 0
1 2383578 0 0 0
2 2382921 0 0 0
**********************************************************************************
DATAPATH CHECK IS CLEAN.
**********************************************************************************
RP/0/RP0/CPU0:Router# monitor dataplane-health module no-fabric
Fri Feb 16 09:08:39.412 UTC
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
Progressive details of the test logged in /harddisk:/dph_mon/dataplane_health_no_fabric_mode_report.txt
########################################################################################
Module:no-fabric
Patterns used: 0xf0,0x0f,0x00,0x55,0xff,
Duration per pattern: 10 seconds
Pause time between each slice/pattern test: 2 seconds
Best effort time for test completion: 522 seconds
Depending on the overall dataplane health state, it may take additional time to complete
########################################################################################
Dataplane health monitoring will run in the background
Wait for completion log message from the process "dph_mon_bg"
OR
Use "show dataplane-health status" regularly to check for completion
RP/0/RP0/CPU0:Router# show dataplane-health status
Fri Feb 16 09:11:01.540 UTC
Dataplane health monitoring completed
Summary of results (Module: no-fabric):
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
LC NP Slice GOOD LOSS CORRUPT ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
0 0 0 4836 0 0 0
1 4908 0 0 0
2 5634 0 0 0
1 0 4196 0 0 0
1 7638 0 0 0
2 7362 0 0 0
**********************************************************************************
DATAPATH CHECK IS CLEAN..
**********************************************************************************
The above sample output indicates that there are no errors or corruption on the line cards, and hence the fabric card must
be faulty.
If there is any corruption detected after running the monitor dataplane-health module no-fabric command, then contact the Cisco Technical Assistance Center (TAC).
If packet loss is detected after running the monitor dataplane-health module command, perform the following steps for further verification:
Run the following command:
RP/0/RP0/CPU0:Router# show controllers npu driver location all
Fri Sep 29 13:11:16.738 EDT
==============================================
NPU Driver Information
==============================================
Driver Version: 1
SDK Version: 1.55.0.41
Functional role: Active, Rack: 8808, Type: lcc, Node: 0/5/CPU0
Driver ready : Yes
NPU first started : Fri Sep 29 08:11:58 2023
Fabric Mode: FABRIC/8FC
NPU Power profile: Medium
Driver Scope: Node
Respawn count : 1
Availablity masks :
card: 0x1, asic: 0x7, exp asic: 0x7
Weight distribution:
Unicast: 80, Multicast: 20
+----------------------------------------------------------------+
| Process | Connection | Registration | Connection | DLL |
| /Lib | status | status | requests | registration|
+----------------------------------------------------------------+
| FSDB | Active | Active | 1| n/a |
| FGID | Inactive | Inactive | 0| n/a |
| AEL | n/a | n/a | n/a| Yes |
| SM | n/a | n/a | n/a| Yes |
+----------------------------------------------------------------+
Asics :
HP - HotPlug event, PON - Power On reset
HR - Hard Reset, WB - Warm Boot
+------------------------------------------------------------------------------+
| Asic inst. | fap|HP|Slice|Asic|Admin|Oper | Asic state | Last |PON|HR | FW |
| (R/S/A) | id | |state|type|state|state| | init |(#)|(#)| Rev |
+------------------------------------------------------------------------------+
| 0/5/0 | 20| 1| UP |npu | UP | UP |NRML |HPON | 1| 0|0x0000|
| 0/5/1 | 21| 1| UP |npu | UP | UP |NRML |PON | 1| 0|0x0000|
| 0/5/2 | 22| 1| UP |npu | UP | UP |NRML |PON | 1| 0|0x0000|
+------------------------------------------------------------------------------+
......
.......
HPON flag indicates an error. Collect logs, and contact Cisco TAC.
Troubleshooting Flowchart
If a HPON error is detected, follow the steps provided in the following troubleshooting flowchart:
Figure 1. Troubleshooting Flowchart
Link Debugging
Identify Reachability Issues
When the datapath monitoring tool reports packet loss, if the status of all ASICs is normal, and if you have not done hard
resets, perform the following troubleshooting steps:
To identify issues in the connection between the line card and fabric cards, check the fabric links using the show controllers fabric fsdb-pla rack 0 command:
RP/0/RP0/CPU0:Router# show controllers fabric fsdb-pla rack 0
Fri Feb 2 05:59:07.624 UTC
Description:
planes : p0-p7
plane mask : Asic #0-3
Asic value 1: destination reachable via asic
.: destination unreachable via asic
x: asic not connected to LC (for S3)
-: plane not configured (for S2) or asic missing
Rack: 0, Stage: s123
=============================
Destination p0 p1 p2 p3 p4 p5 p6 p7 Reach-mask Oper Up
Address mask mask mask mask mask mask mask mask links/asic links/asic
Fapid(R/S/A) 0123 0123 0123 0123 0123 0123 0123 0123 Mn/Mx Total Mn/Mx Total
----------------------------------------------------------------------------------------------
0(0/0/0) 11 11 -- -- -- -- -- -- 6/6 22 12/12 48
1(0/1/0) 11 11 -- -- -- 11 -- -- 4/6 22 10/12 48
4(0/7/0) .... -- .. -- .. -- .. 0/0 0 12/12 48
The above sample output indicates that the line cards in slot 0, and 1 do not have any connectivity issues with fabric cards.
However, the ".." state for line card in slot 7 indicates that links are down between the line card and fabric card pair.
In such situations, perform the following troubleshooting steps:
Remove the line card from the particular slot (for example, slot 7) on the front panel.
Check the particular backplane connector (for example, connector 4) for any bent pins.
Check the FC connector (for example, FC3 connector 5) for any damage.
If there are no bent pins or FC connector damages, reinsert the line card, and run the show controller fabric fsdb-pla rack 0 command again. Check if the status displays all "11" and not "..".
If you notice bent pins on the line card connector, capture it in a photo, and open a Cisco Support (TAC) case for Return
Material Authorization (RMA).
If you notice any damage on the FC connector, capture it in a photo, and open a Cisco Support (TAC) case for RMA.
If some links are down even after the visual inspection of fabric links, collect logs and contact Cisco TAC.
Note
In some scenarios, the connectivity between line card ASICs and fabric card ASICs can be UP, but few links between the cards
can be down. So, it is essential to verify the link status.
To check the status of links between the line card and fabric cards, examine the Reach-mask links/asic min/max column output of the show controllers fabric fsdb-pla rack 0 command. In the previous sample output, you can infer that for LC1, two links out of six links are down due to link connectivity
issues.
If some links are down even after the visual inspection of fabric links, collect logs and contact Cisco TAC.
Note
1
If the min and max values in the output under links/asic min/max are not equal, that means the links between the LC ASIC and FC ASIC are down.
Run the following command to get more details about the links status:
RP/0/RP0/CPU0:Router# show controllers npu link-info rx 0 255 topo instance all location all | ex EN/DN | ex NC
Fri Feb 2 06:59:18.003 UTC
Node ID: 0/2/CPU0
-----------------------------------------------------------------------------
Link ID Log Link Asic EN/ Far-End Far-End
Link Speed Stage Oper Link (FSDB) Link (HW)
(Gbps) Status
-----------------------------------------------------------------------------
0/2/0/16 - 50.00 FIA EN/DN ............ 0/FC2/7/157 NC
0/2/0/17 - 50.00 FIA EN/DN ............ 0/FC2/8/156 NC
Node ID: 0/RP0/CPU0
-----------------------------------------------------------------------------
Link ID Log Link Asic EN/ Far-End Far-End
Link Speed Stage Oper Link (FSDB) Link (HW)
(Gbps) Status
-----------------------------------------------------------------------------
0/FC2/7/20 - 50.00 FIA EN/DN ............ 0/2/0/178 NC
0/FC2/8/33 - 50.00 FIA EN/DN ............ 0/2/0/166 0/2/0/188
In this example, links between LC2 and FC2 are down.
Identify UCE Drops of Fabric Plane
If all links are UP, then monitor the Uncorrectable Errors (UCE). UCE means that there is packet corruption between a line
card and a fabric card.
To identify packet corruption, run the following command:
RP/0/RP0/CPU0:Router# show controllers fabric plane all statistics
Fri Feb 2 07:21:13.447 UTC
Flags: E-D - Exceeded display width.
Check detail option.
In Out CE UCE PE
Plane Packets Packets Packets Packets Packets
--------------------------------------------------------------------------------
0 835649707985 733726217071 0 0 0
1 835649703742 733726213241 0 0 0
2 835649542985 733763942830 0 43921 0
Note
2
You may notice a few errors initially at card insertion. So, run this command multiple times with a 30-sec interval. If you
notice the UCE packets column count increasing, then it is evident that there are links on that plane (plane number = FC slot number) with UCE errors.
The sample output indicates that there are Uncorrectable Errors (UCE) on FE ASICs on fabric card in slot 2 (FC2).
To identify the line card and fabric cards between which the errors are detected, use the show controllers npu stats link all instance all location all command.
RP/0/RP0/CPU0:Router# show controllers npu stats link all instance all location all
Fri Feb 2 07:38:04.699 UTC
Node ID: 0/0/CPU0
In Data Out Data CE UCE CRC
Frames frames Frames Frames Errors
-------------------------------------------------------------------------
0/0/0/0-1 0 0 0 0 0
0/0/0/2-3 0 0 0 0 0
0/0/0/4-5 0 0 0 0 0
0/0/0/6-7 0 0 0 0 89
0/0/0/8-9 0 0 0 0 0
Errors are detected on fabric links 6, 7.
If the monitor dataplane-health command still displays packet loss despite no issues being detected after link debugging, collect all logs and contact Cisco
TAC before draining and reloading the router.
Collect Logs of Fabric and Line Cards and Contact Cisco TAC
Run the following show commands to collect logs for the impacted LC and all FCs.
show tech fabric link-include
show tech ofa
show tech interface
show tech optics
In addition, run the following debugshell commands, and collect logs for all NPUs on all LCs, and all FEs on all FCs.
show controllers npu debugshell <np_num/unit_num> "script print_get_counters" location 0/x/CPU0
show controllers npu debugshell <np_num> "script sf_oq_debug" location 0/x/CPU0
show controllers npu debugshell <np_num> "script sf_fabric_debug" location 0/x/CPU0
show controller npu stats counters-all instance all loc 0/x/CPU0
Note
3
To collect logs for an FC card, obtain the np_num from the fap_id column of show controllers npu driver loc 0/RP0/CPU0 command output (0/RP0/CPU0 is the location of FC).
Note
4
Starting from Cisco IOS XR Release 24.4.1, the show controllers npu stats counters-all command is deprecated and will not be supported in future releases. Instead, you can use show controllers npu debug-utils get-counters instance <> loc <> command.