Online Diagnostics for NPU

Cisco 8000 Series Routers support the Online Diagnostics feature that enables you to run tests to verify the hardware functionality when connected to a live network. When a problem is detected, diagnostic test results help in isolating the location of the problem, enabling you to take appropriate measures to resolve the issue in less time.

Table 1. Feature History Table

Feature Name

Release

Description

Online diagnostics for NPU, NPU slices, and fabric cards

Release 24.4.1

Introduced in this release on: Fixed Systems (8200 [ASIC: P100], 8700 [ASIC: P100, K100])(select variants only*); Modular Systems (8800 [LC ASIC: P100])(select variants only*)

*This feature is supported on:

  • 8212-48FH-M

  • 8711-32FH-M

  • 8712-MOD-M

  • 88-LC1-36EH

  • 88-LC1-12TH24FH-E

  • 88-LC1-52Y8H-EM

Online diagnostics for NPU slices and fabric cards

Release 24.2.11

Introduced in this release on: Fixed Systems (8200); Centralized Systems (8600); Modular Systems (8800 [LC ASIC: Q100, Q200])

You can now use the online diagnostics functionality to test the health of fabric cards and all the slices in an NPU. This feature can help you detect fabric, and slice level failures.

Online diagnostics for NPU

Release 7.5.2/Release 7.3.5

You can now use the online diagnostic feature to verify if the router NPUs are operational. NPU failure logs are captured in the system log output.

You can also generate tech support information that is useful for Cisco Technical Support representatives when troubleshooting a router.

This feature introduces the following commands:

Online Diagnostics for NPU

The diagnostic tests check different hardware components in a system and verify the data paths and control signals. The online diagnostics tests use the CPU to send packets to the Network Processing Unit (NPU) through the Punt switch. If a failure is detected, an NP Datalog is automatically generated to help diagnose the problem.

The default interval for the NPU loopback test is one minute, and the default threshold is 3.

The following is a sample system log output:

LC/0/6/CPU0:Oct 30 18:52:37.737 UTC: 8000_online_diag[123]: %DIAG-DIAG-3-GOLDXR_FAIL : SFNPULoopback: Online diagnostic packet drops detected on NPU 0, slice 0. Please collect "show tech-support online-diags". 
LC/0/6/CPU0:Oct 30 18:52:37.737 UTC: 8000_online_diag[123]: %DIAG-DIAG-3-GOLDXR_FAIL : Use "show diagnostic result location <location> detail" to monitor online diagnostic results.

Online diagnostic tests can be categorized based on the way they are executed. They are the following:

Types of Online diagnostic test

Description

Dynamic diagnostics

Online Diagnostics are enabled when the system starts and the system datapath is operational. When the system is in use and linked to a live network, these tests run in the background as a non-disruptive test.

On-demand diagnostics

Tests that are conducted as needed using a diagnostic start command from the command-line interface (CLI). These tests are useful when a hardware fault is suspected.

You can use these diagnostics tests to determine the status and troubleshoot the hardware issues.

Online diagnostics for NPU slices

From Release 24.2.11, the online diagnostics functionality is extended to test all the slices in an NPU, enabling you to detect slice level failures. The default rate at which the test packets are transmitted to each NPU slice is increased to 60 packets per minute.

The default interval is one second for the per-slice test, and the default threshold is 3.

The following is a sample system log output:

LC/0/6/CPU0:Oct 30 18:52:37.737 UTC: 8000_online_diag[271]: %DIAG-DIAG-3-GOLDXR_FAIL : SFNPUSlice: Online diagnostic packet drops detected on NPU 0, slice 0. Please collect "show tech-support online-diags".
LC/0/6/CPU0:Oct 30 18:52:37.737 UTC: 8000_online_diag[271]: %DIAG-DIAG-3-GOLDXR_FAIL : Use "show diagnostic result location <location> detail" to monitor online diagnostic results.

Online diagnostics for fabric cards

From Release 24.2.11, you can also test the health of fabric cards (FC) using the online diagnostics functionality. The test packets are transmitted to fabric cards at the default rate of 50 packets per minute.

The default interval for the fabric test is 30 seconds, and the default threshold is 6.

The following is a sample system log output:

LC/0/6/CPU0:Oct 30 18:52:37.737 UTC: npu_drvr[283]: %FABRIC-NPU_DRVR-4-FABRIC_ONLINE_DIAG_FAIL : Online diagnostic packet drops detected on NPU 0, slice 0. Please collect "show tech-support online-diags".  
LC/0/6/CPU0:Oct 30 18:52:37.737 UTC: npu_drvr[283]: %FABRIC-NPU_DRVR-4-FABRIC_ONLINE_DIAG_FAIL : Use "show diagnostic result location <location> detail" to monitor online diagnostic results.

Data plane health check utility

Data plane health check utility is a monitoring tool that helps you determine the health of the data plane components including

  • fabric cards

  • NPUs, and

  • lookup engines

This utility can detect fabric memory corruption and packet loss that may happen due to broken internal links.

Table 2. Feature History Table

Feature Name

Release

Description

Lookup engine test in data plane health check utility

Release 25.4.1

Introduced in this release on: Fixed Systems (8200 [ASIC: Q100, Q200]); Centralized Systems (8600 [ASIC:Q200]) ; Modular Systems (8800 [LC ASIC: Q100, Q200])

This feature detects mis-routed packets due to hardware defects causing bit corruption in route prefix lookups.

This feature implements a new lookup engine test mode in the data plane health check utility to detect such mis-routed packets.

This test requires a router reload upon completion.

This feature introduces these changes:

CLI: Two new keywords, lookup and all-modules, are added to the monitor dataplane-health module command.

Monitor data plane health

Release 25.3.1

Introduced in this release on: Fixed Systems (8700 [ASIC:K100], 8010 [ASIC: A100])(select variants only*)

*This feature is now supported on:

  • 8011-4G24Y4H-I

  • 8712-MOD-M

Monitor data plane health

Release 24.4.1

Introduced in this release on: Fixed Systems (8200 [ASIC: P100], 8700 [ASIC: P100])(select variants only*); Modular Systems (8800 [LC ASIC: P100])(select variants only*)

*This feature is supported on:

  • 8212-48FH-M

  • 8711-32FH-M

  • 88-LC1-36EH

  • 88-LC1-12TH24FH-E

  • 88-LC1-52Y8H-EM

Monitor data plane health

Release 24.2.11

Introduced in this release on: Fixed Systems (8200); Modular Systems (8800 [LC ASIC: Q100, Q200])

You can now easily detect fabric memory corruption and packet loss by checking the health of data plane components including fabric and NPUs on a distributed system using our on-demand diagnostic utility.

This functionality introduces the following commands:

You can start the diagnosis using the CLI command monitor dataplane-health . The detailed error report helps you identify the faulty card.

Starting from IOS XR Release 25.4.1, the data plane health check utility introduces a new lookup engine test mode to detect packets routed to wrong interfaces using the monitor dataplane-health module lookup command. The test analyzes the packet statistics in your network and reports hardware failure for dropped packets indicating a traffic loss caused by corruption in the route prefix lookup. A router reload is required after this test is completed. Also, to run the health check for all modules namely the fabric path, NPU path, and lookup engine, use the monitor dataplane-health module all-modules.

You can also use the show dataplane-health status command to check the status of a data plane health test. It provides information on whether the test is still running or if it's completed, along with a summary of the results.


Note


Do not use the data plane health check utility on a router that carries live traffic, as this utility affects the system performance.


Use cases for data plane health check

You can use the data plane health check utility in the following scenarios:

  • Before router deployment – After installing the FC or LC on the router, you can run the utility to check for issues, and then proceed to router provisioning.

  • After router deployment – If traffic loss is observed, but the packet drop analysis does not provide a root cause, then isolate the router and run the utility to check for issues.

Limitations for data plane health check

  • Avoid using the show controllers npu debugshell CLI command.

  • Avoid system reload (or LC reload) when the Data Plane Health Check utility is being executed.

  • The monitor dataplane-health module fabric command is supported only on distributed routers.

  • You must archive the report file before subsequent runs, as this file is overwritten on re-execution of the command.

    The previous log file is archived as /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt.bak.

    The test report is available at /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt.

Monitor and verify data plane health

This example shows how to execute the data plane health check utility for fabric module.

RP/0/RP0/CPU0:Router# monitor dataplane-health 
Wed Feb 28 15:08:15.659 EST
RP/0/RP0/CPU0:Feb 28 15:08:15.687 EST: dph_mon_bg[337]: %PLATFORM-DPH_MONITOR-6-STARTED : Dataplane health monitoring started. Please check harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt for details of the result. 
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
Progressive details of the test logged in /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt
Previous log file is archived as /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt.bak
Please save the archive with a different file name as needed
########################################################################################
Module:fabric
Patterns used: 0xf0,0x0f,0x00,0x55,0xff,
Duration per pattern: 10 seconds
Pause time between each slice/pattern test: 2 seconds
Best effort time for test completion: 1044 seconds
Depending on the overall dataplane health state, it may take additional time to complete
########################################################################################
Dataplane health monitoring will run in the background
Wait for completion log message from the process "dph_mon_bg"
			OR
Use "show dataplane-health status" regularly to check for completion

RP/0/RP0/CPU0:Feb 28 15:08:15.687 EST: dph_mon_bg[337]: %PLATFORM-DPH_MONITOR-6-STARTED : Dataplane health monitoring started. Please check harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt for details of the result. 

RP/0/RP0/CPU0:Feb 28 15:28:54.965 EST: dph_mon_bg[337]: %PLATFORM-DPH_MONITOR-6-COMPLETED : Dataplane health monitoring completed. Please check harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt for details of the result. 

This example shows how to run fabric, NPU, and lookup engine tests together to check the health of these modules. The test includes a warning to reload the router after the test completion.


Warning


Run the test on a router in a non-production environment as this test impacts the system performance.



RP/0/RP0/CPU0:Router# monitor dataplane-health module all-modules
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
THIS COMMAND INCLUDES THE TESTING OF THE LOOKUP ENGINE MODULE.
TESTS ON THE LOOKUP ENGINE REQUIRE A RELOAD AFTER TEST COMPLETION.
Progressive details of the test logged in /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt
Previous log file is archived as /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt.bak
Please save the archive with a different file name as needed
########################################################################################
Module:fabric, lookup
Patterns used: 0xf0,0x55,0xff,
Duration per pattern: 10 seconds
Pause time between each slice/pattern test: 2 seconds
Best effort time for test completion: 492 seconds
Depending on the overall dataplane health state, it may take additional time to complete
########################################################################################
Dataplane health monitoring will run in the background
Wait for completion log message from the process "dph_mon_bg"
            OR
Use "show dataplane-health status" regularly to check for completion
RP/0/RP0/CPU0:Router#

Note


Make sure to reload the router once the test is completed.


This example shows how to run lookup engine test. The test includes a warning to reload the router after the test completion.


Warning


Run the test on a router in a non-production environment as this test impacts the system performance.


RP/0/RP0/CPU0:ios# monitor dataplane-health module lookup 
Fri Oct 24 13:09:20.906 UTC
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
THIS COMMAND INCLUDES THE TESTING OF THE LOOKUP ENGINE MODULE.
TESTS ON THE LOOKUP ENGINE REQUIRE A RELOAD AFTER TEST COMPLETION.
Progressive details of the test logged in /harddisk:/dph_mon/dataplane_health_lookup_mode_report.txt
Previous log file is archived as /harddisk:/dph_mon/dataplane_health_lookup_mode_report.txt.bak
Please save the archive with a different file name as needed
########################################################################################
Module:lookup
Best effort time for test completion: 32 seconds
Depending on the overall dataplane health state, it may take additional time to complete
########################################################################################
Dataplane health monitoring will run in the background
Wait for completion log message from the process "dph_mon_bg"
            OR
Use "show dataplane-health status" regularly to check for completion

Note


Make sure to reload the router once the test is completed.


These examples show how to check the status of a data plane health test:

Example 1: Output with data plane health check in progress

RP/0/RP0/CPU0:Router# show dataplane-health status 
Mon Jan 29 22:39:48.336 UTC
Dataplane health monitoring in progress..

Example 2: Output with successful data plane health check

RP/0/RP0/CPU0:Router# show dataplane-health status 
Mon Jan 29 23:10:21.564 UTC
Dataplane health monitoring completed
Summary of results (Module: fabric):
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
   0     0         0     2526099           0           0           0
                   1     2526856           0           0           0
                   2     2526529           0           0           0
         1         0     2526590           0           0           0
                   1     2526918           0           0           0
                   2     2526421           0           0           0
         2         0     2526665           0           0           0
                   1     2525818           0           0           0
                   2     2526286           0           0           0
-------------------------------------------------------------------------------
   1     0         0     2526754           0           0           0
                   1     2526328           0           0           0
                   2     2526695           0           0           0
         1         0     2525892           0           0           0
                   1     2526988           0           0           0
                   2     2526215           0           0           0
**********************************************************************************
DATAPATH CHECK IS CLEAN (mode: fabric).
**********************************************************************************

Example 3: Output with failures

RP/0/RP0/CPU0:Router# show dataplane-health status 
Wed Jan 10 21:37:01.601 UTC
Dataplane health monitoring completed
Summary of results (Module: fabric):
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
   1     0         0     2526253           0           0           0
                   1     2527136           0           0           0
                   2     2526235           0           0           0
         1         0     2527166           0           0           0
                   1     2527217           0           0           0
                   2     2526424           0           0           0
-------------------------------------------------------------------------------
   2     0         0     2526733           0           0           0
                   1     2526948           0           0           0
                   2     2526554           0           0           0
         1         0     2526294           0           0           0
                   1     2526220           0           0           0
                   2     2526085           0           0           0
-------------------------------------------------------------------------------
   3     0         0     2525876           0           0           0
                   1     2526642           0           0           0
                   2     2525957           0           0           0
         1         0     2526491           0           0           0
                   1     2526263           0           0           0
                   2     2526200           0           0           0
         2         0     2526804           0           0           0
                   1     2526135           0           0           0
                   2     2526328           0           0           0
-------------------------------------------------------------------------------
   4     0         0      493934           0       11501           0
                   1           0           0           0           0
                   2           0           0           0           0
         1         0      493605           0       11591           0
                   1           0           0           0           0
                   2           0           0           0           0
-------------------------------------------------------------------------------
   5     0         0      505389           0          30           0
                   1           0           0           0           0
                   2           0           0           0           0
         1         0      505358           0          23           0
                   1           0           0           0           0
                   2           0           0           0           0
-------------------------------------------------------------------------------
   6     0         0     2526307           0           0           0
                   1     2525905           0           0           0
                   2     2526142           0           0           0
         1         0     2526755           0           0           0
                   1     2526603           0           0           0
                   2     2526607           0           0           0
**********************************************************************************
Corruption detected:(LC4/0 <-> FC2/0) (LC4/1 <-> FC2/0) (LC5/0 <-> FC3/0) (LC5/1 <-> FC3/0)
 
**********************************************************************************
FAILURES DETECTED IN DATAPATH for fabric mode.
Please run "monitor dataplane-health module no-fabric"
Please check /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt
**********************************************************************************
Example 4: Output for health check with all-modules option

RP/0/RP0/CPU0:Router# show dataplane-health status 
Dataplane health monitoring completed

Datapath test on all requested LC/NPU/slice completed
Summary of results (Module: fabric/lookup):
####################################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
LE-FAULTS: Lookup Engine test results, 0 for no failures or bit mask of faulty 
           engines
###################################################################################
-----------------------------------------------------------------------------------
  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR      LE-FAULTS
-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------
 LC4     0         0     1663101           0           0           0        0x0000
                   1     1661542           0           0           0        0x0200  
                   2     1662258           0           0           0        0x0000
         1         0     1667547           0           0           0        0x0000
                   1     1661658           0           0           0        0x0000
                   2     1661153           0           0           0        0x0000
         2         0     1661087           0           0           0        0x0000
                   1     1661474           0           0           0        0x0000
                   2     1660756           0           0           0        0x0000
         3         0     1661936           0           0           0        0x0000
                   1     1661197           0           0           0        0x0000
                   2     1661644           0           0           0        0x0000
****************************************************************************#******
DATAPATH CHECK IS CLEAN, LOOKUP ENGINE ERROR(s) DETECTED (mode: fabric/lookup).
***********************************************************************************
Example 5: Output for health check with lookup option
RP/0/RP0/CPU0:Router# show dataplane-health status 
Dataplane health monitoring completed

Datapath test on all requested LC/NPU/slice completed
Summary of results (Module: lookup):
###################################################################################
Output summary legend:
LE-FAULTS: Lookup Engine test results, 0 for no failure or bit mask of faulty lookup
           engines
##################################################################################
----------------------------------------------------------------------------------
  LC    NP     Slice    LE-FAULTS
---------------------------------
---------------------------------
   4     0         0       0x0000
                   1       0x0200
                   2       0x0000
         1         0       0x0000
                   1       0x0000
                   2       0x0000
         2         0       0x0000
                   1       0x0000
                   2       0x0000
         3         0       0x0000
                   1       0x0000
                   2       0x0000

****************************************************************************
LOOKUP ENGINE ERROR(s) DETECTED (mode: lookup).
****************************************************************************

Note


If the data plane-health monitor does not report any issues, it suggests that the system is functioning correctly, and there is no need to proceed with further data plane health verification.


If any failure is detected during the data plane health check, you must proceed with additional verification.

Additional Troubleshooting

The following is a sample output that illustrates failures that are detected in the datapath.

RP/0/RP0/CPU0:Router# monitor dataplane-health   

Fri Sep 29 12:53:51.595 UTC 

THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED. 

Details of the test results are logged in harddisk:/dataplane_health_detail_report.txt 

Estimated time for completion: 783 seconds 

Ensure that the terminal/vty session timeout is greater than 783 seconds 

Testing in progress (suggest not to break the tests) 

................................................................................................................................. 

Processing further to find failed fabric elements. This will take more time... 

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 

Datapath test on all requested LC/NPU/slice completed 

Summary of results: 

############################################################################ 

Output summary legend: 

ERROR: Tests were not run for this slice due to some errors 

GOOD: Tests were successful for this slice 

LOSS: Packet loss was observed for this slice 

CORRUPT: Packet corruption was observed for this slice 

############################################################################ 

------------------------------------------------------------------------------- 

  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR 

------------------------------------------------------------------------------- 

------------------------------------------------------------------------------- 

   1     0         0      476214           0         637           0 

                   1           0           0           0           0 

                   2           0           0           0           0 

         1         0      475860           0         670           0 

                   1           0           0           0           0 

                   2           0           0           0           0 

------------------------------------------------------------------------------- 

   2     0         0     2383553           0           0           0 

                   1     2383747           0           0           0 

                   2     2383616           0           0           0 

         1         0     2383280           0           0           0 

                   1     2383737           0           0           0 

                   2     2383343           0           0           0 

         2         0     2383937           0           0           0 

                   1     2383913           0           0           0 

                   2     2384017           0           0           0 

********************************************************************************** 

Corruption detected:(LC1/0 <-> FC7/0) (LC1/1 <-> FC7/0) ********************************************************************************** 

FAILURES DETECTED IN DATAPATH. 

Please run "monitor dataplane-health module no-fabric" 

Please check harddisk:/dataplane_health_detail_report.txt 

********************************************************************************** 
RP/0/RP0/CPU0:Router# monitor dataplane-health 
Fri Feb 16 08:50:58.115 UTC
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
Progressive details of the test logged in /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt
########################################################################################
Module:fabric
Patterns used: 0xf0,0x0f,0x00,0x55,0xff,
Duration per pattern: 10 seconds
Pause time between each slice/pattern test: 2 seconds
Best effort time for test completion: 522 seconds
Depending on the overall dataplane health state, it may take additional time to complete
########################################################################################
Dataplane health monitoring will run in the background
Wait for completion log message from the process "dph_mon_bg"
                        OR
Use "show dataplane-health status" regularly to check for completion
RP/0/RP0/CPU0:Router# show dataplane-health status 
Fri Feb 16 09:04:12.156 UTC
Dataplane health monitoring completed
Summary of results (Module: fabric):
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
   0     0         0          85       10977           0           0
                   1           0           0           0           0
                   2           0           0           0           0
         1         0        1735        8865           0           0
                   1           0           0           0           0
                   2           0           0           0           0
**********************************************************************************


**********************************************************************************
FAILURES DETECTED IN DATAPATH (mode: fabric).
Please run "monitor dataplane-health module no-fabric" to check if the issue is on the LCs or FCs
Please check /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt
**********************************************************************************

The sample output indicates that there are failures detected in the datapath. To further isolate the issue, verify if the packet corruption is caused by a fabric card or line card. Execute the following command to run health check on the line cards excluding the fabric card.

RP/0/RP0/CPU0:Router# monitor dataplane-health module no-fabric   

Fri Sep 29 14:09:28.506 UTC 

THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED. 

Details of the test results are logged in harddisk:/dataplane_health_detail_report.txt 

Estimated time for completion: 783 seconds 

Ensure that the terminal/vty session timeout is greater than 783 seconds 

Testing in progress (suggest not to break the tests) 

..................................................................................................................................................................................................................... 

Datapath test on all requested LC/NPU/slice completed 

Summary of results: 

############################################################################ 

Output summary legend: 

ERROR: Tests were not run for this slice due to some errors 

GOOD: Tests were successful for this slice 

LOSS: Packet loss was observed for this slice 

CORRUPT: Packet corruption was observed for this slice 

############################################################################ 

------------------------------------------------------------------------------- 

  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR 

------------------------------------------------------------------------------- 

------------------------------------------------------------------------------- 

   1     0         0     2383412           0           0           0 

                   1     2383031           0           0           0 

                   2     2383484           0           0           0 

         1         0     2383883           0           0           0 

                   1     2383973           0           0           0 

                   2     2383349           0           0           0 

------------------------------------------------------------------------------- 

   2     0         0     2383160           0           0           0 

                   1     2384196           0           0           0 

                   2     2383879           0           0           0 

         1         0     2383135           0           0           0 

                   1     2383196           0           0           0 

                   2     2383668           0           0           0 

         2         0     2383414           0           0           0 

                   1     2384360           0           0           0 

                   2     2383732           0           0           0 

------------------------------------------------------------------------------- 

   6     0         0     2383933           0           0           0 

                   1     2384205           0           0           0 

                   2     2383746           0           0           0 

         1         0     2383215           0           0           0 

                   1     2383578           0           0           0 

                   2     2382921           0           0           0 

********************************************************************************** 

 DATAPATH CHECK IS CLEAN. 

**********************************************************************************
RP/0/RP0/CPU0:Router# monitor dataplane-health module no-fabric 
Fri Feb 16 09:08:39.412 UTC
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
Progressive details of the test logged in /harddisk:/dph_mon/dataplane_health_no_fabric_mode_report.txt
########################################################################################
Module:no-fabric
Patterns used: 0xf0,0x0f,0x00,0x55,0xff,
Duration per pattern: 10 seconds
Pause time between each slice/pattern test: 2 seconds
Best effort time for test completion: 522 seconds
Depending on the overall dataplane health state, it may take additional time to complete
########################################################################################
Dataplane health monitoring will run in the background
Wait for completion log message from the process "dph_mon_bg"
                        OR
Use "show dataplane-health status" regularly to check for completion
RP/0/RP0/CPU0:Router# show dataplane-health status 
Fri Feb 16 09:11:01.540 UTC
Dataplane health monitoring completed
Summary of results (Module: no-fabric):
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
   0     0         0        4836           0           0           0
                   1        4908           0           0           0
                   2        5634           0           0           0
         1         0        4196           0           0           0
                   1        7638           0           0           0
                   2        7362           0           0           0
**********************************************************************************
DATAPATH CHECK IS CLEAN..

**********************************************************************************

The above sample output indicates that there are no errors or corruption on the line cards, and hence the fabric card must be faulty.

If there is any corruption detected after running the monitor dataplane-health module no-fabric command, then contact the Cisco Technical Assistance Center (TAC).

If packet loss is detected after running the monitor dataplane-health module command, perform the following steps for further verification:

  • Run the following command:

    RP/0/RP0/CPU0:Router# show controllers npu driver location all 
    
    Fri Sep 29 13:11:16.738 EDT 
    
    ============================================== 
    
    NPU Driver Information 
    
    ============================================== 
    
    Driver Version: 1 
    
    SDK Version: 1.55.0.41 
    
     
    
    Functional role: Active,     Rack: 8808, Type: lcc, Node: 0/5/CPU0 
    
    Driver ready      : Yes 
    
    NPU first started : Fri Sep 29 08:11:58 2023 
    
    Fabric Mode: FABRIC/8FC 
    
    NPU Power profile: Medium 
    
    Driver Scope: Node 
    
    Respawn count     : 1 
    
    Availablity masks : 
    
            card: 0x1,     asic: 0x7,    exp asic: 0x7 
    
    Weight distribution: 
    
            Unicast: 80,      Multicast: 20 
    
    +----------------------------------------------------------------+ 
    
    | Process | Connection | Registration | Connection | DLL         | 
    
    | /Lib    | status     | status       | requests   | registration| 
    
    +----------------------------------------------------------------+ 
    
    | FSDB    | Active     | Active       |           1|  n/a        | 
    
    | FGID    | Inactive   | Inactive     |           0|  n/a        | 
    
    | AEL     | n/a        | n/a          |         n/a|  Yes        | 
    
    | SM      | n/a        | n/a          |         n/a|  Yes        | 
    
    +----------------------------------------------------------------+ 
    
     
    
    Asics : 
    
    HP - HotPlug event, PON - Power On reset 
    
    HR - Hard Reset,    WB  - Warm Boot 
    
    +------------------------------------------------------------------------------+ 
    
    | Asic inst. | fap|HP|Slice|Asic|Admin|Oper | Asic state | Last |PON|HR |  FW  | 
    
    |  (R/S/A)   | id |  |state|type|state|state|            | init |(#)|(#)|  Rev | 
    
    +------------------------------------------------------------------------------+ 
    
    | 0/5/0      |  20| 1| UP  |npu | UP  | UP  |NRML        |HPON   |  1|  0|0x0000| 
    
    | 0/5/1      |  21| 1| UP  |npu | UP  | UP  |NRML        |PON   |  1|  0|0x0000| 
    
    | 0/5/2      |  22| 1| UP  |npu | UP  | UP  |NRML        |PON   |  1|  0|0x0000| 
    
    +------------------------------------------------------------------------------+ 
    ......
    .......
     
     

    HPON flag indicates an error. Collect logs, and contact Cisco TAC.

Troubleshooting Flowchart

If a HPON error is detected, follow the steps provided in the following troubleshooting flowchart:

Figure 1. Troubleshooting Flowchart

Link Debugging

Identify Reachability Issues

When the datapath monitoring tool reports packet loss, if the status of all ASICs is normal, and if you have not done hard resets, perform the following troubleshooting steps:

  • To identify issues in the connection between the line card and fabric cards, check the fabric links using the show controllers fabric fsdb-pla rack 0 command:

    RP/0/RP0/CPU0:Router# show controllers fabric fsdb-pla rack 0 
    Fri Feb  2 05:59:07.624 UTC
    Description:
       planes       : p0-p7
       plane mask   : Asic #0-3
       Asic value  1: destination reachable via asic
                   .: destination unreachable via asic
                   x: asic not connected to LC (for S3)
                   -: plane not configured (for S2) or asic missing
    Rack: 0, Stage: s123
    =============================
    Destination   p0     p1     p2     p3     p4     p5     p6     p7     Reach-mask  Oper Up
    Address       mask   mask   mask   mask   mask   mask   mask   mask   links/asic  links/asic
    Fapid(R/S/A)  0123   0123   0123   0123   0123   0123   0123   0123   Mn/Mx Total Mn/Mx  Total
    ----------------------------------------------------------------------------------------------
    0(0/0/0)      11     11     --     --     --     --     --     --      6/6    22  12/12     48
    1(0/1/0)      11     11     --     --     --     11     --     --      4/6    22  10/12     48
    4(0/7/0)      ..     ..     --     ..     --     ..     --     ..      0/0    0   12/12     48
    

    The above sample output indicates that the line cards in slot 0, and 1 do not have any connectivity issues with fabric cards. However, the ".." state for line card in slot 7 indicates that links are down between the line card and fabric card pair.

    In such situations, perform the following troubleshooting steps:

    1. Remove the line card from the particular slot (for example, slot 7) on the front panel.

    2. Check the particular backplane connector (for example, connector 4) for any bent pins.

    3. Check the FC connector (for example, FC3 connector 5) for any damage.

      • If there are no bent pins or FC connector damages, reinsert the line card, and run the show controller fabric fsdb-pla rack 0 command again. Check if the status displays all "11" and not "..".

      • If you notice bent pins on the line card connector, capture it in a photo, and open a Cisco Support (TAC) case for Return Material Authorization (RMA).

      • If you notice any damage on the FC connector, capture it in a photo, and open a Cisco Support (TAC) case for RMA.

  • If some links are down even after the visual inspection of fabric links, collect logs and contact Cisco TAC.


    Note


    In some scenarios, the connectivity between line card ASICs and fabric card ASICs can be UP, but few links between the cards can be down. So, it is essential to verify the link status.


  • To check the status of links between the line card and fabric cards, examine the Reach-mask links/asic min/max column output of the show controllers fabric fsdb-pla rack 0 command. In the previous sample output, you can infer that for LC1, two links out of six links are down due to link connectivity issues.

    If some links are down even after the visual inspection of fabric links, collect logs and contact Cisco TAC.


    Note


    1

    If the min and max values in the output under links/asic min/max are not equal, that means the links between the LC ASIC and FC ASIC are down.


  • Run the following command to get more details about the links status:

    RP/0/RP0/CPU0:Router# show controllers npu link-info rx 0 255 topo instance all location all | ex EN/DN | ex NC 
    Fri Feb  2 06:59:18.003 UTC
    
    Node ID: 0/2/CPU0
    -----------------------------------------------------------------------------
    Link ID    Log  Link   Asic  EN/                    Far-End         Far-End
               Link Speed  Stage Oper                   Link (FSDB)     Link (HW)
                    (Gbps)       Status
    -----------------------------------------------------------------------------
    0/2/0/16      - 50.00 FIA    EN/DN  ............    0/FC2/7/157     NC
    0/2/0/17      - 50.00 FIA    EN/DN  ............    0/FC2/8/156     NC
    
    Node ID: 0/RP0/CPU0
    -----------------------------------------------------------------------------
    Link ID    Log  Link   Asic  EN/                    Far-End         Far-End
               Link Speed  Stage Oper                   Link (FSDB)     Link (HW)
                    (Gbps)       Status
    -----------------------------------------------------------------------------
    0/FC2/7/20      - 50.00 FIA    EN/DN  ............    0/2/0/178     NC
    0/FC2/8/33      - 50.00 FIA    EN/DN  ............    0/2/0/166     0/2/0/188
    
    

    In this example, links between LC2 and FC2 are down.

Identify UCE Drops of Fabric Plane

If all links are UP, then monitor the Uncorrectable Errors (UCE). UCE means that there is packet corruption between a line card and a fabric card.

  • To identify packet corruption, run the following command:

    
    RP/0/RP0/CPU0:Router# show controllers fabric plane all statistics 
    Fri Feb  2 07:21:13.447 UTC
    
        Flags: E-D  - Exceeded display width.
                      Check detail option.
    
                         In                   Out        CE         UCE        PE
    Plane             Packets               Packets   Packets     Packets   Packets
    --------------------------------------------------------------------------------
     0                835649707985         733726217071      0         0         0
     1                835649703742         733726213241      0         0         0
     2                835649542985         733763942830      0      43921        0
     

    Note


    2

    You may notice a few errors initially at card insertion. So, run this command multiple times with a 30-sec interval. If you notice the UCE packets column count increasing, then it is evident that there are links on that plane (plane number = FC slot number) with UCE errors.


    The sample output indicates that there are Uncorrectable Errors (UCE) on FE ASICs on fabric card in slot 2 (FC2).

  • To identify the line card and fabric cards between which the errors are detected, use the show controllers npu stats link all instance all location all command.

    RP/0/RP0/CPU0:Router# show controllers npu stats link all instance all location all 
    Fri Feb  2 07:38:04.699 UTC
    
    Node ID: 0/0/CPU0
    
                           In Data      Out Data        CE       UCE      CRC
                            Frames        frames    Frames    Frames   Errors
    -------------------------------------------------------------------------
    0/0/0/0-1                        0            0        0        0        0
    0/0/0/2-3                        0            0        0        0        0
    0/0/0/4-5                        0            0        0        0        0
    0/0/0/6-7                        0            0        0        0        89
    0/0/0/8-9                        0            0        0        0        0
    

    Errors are detected on fabric links 6, 7.

  • If the monitor dataplane-health command still displays packet loss despite no issues being detected after link debugging, collect all logs and contact Cisco TAC before draining and reloading the router.

Collect Logs of Fabric and Line Cards and Contact Cisco TAC

Run the following show commands to collect logs for the impacted LC and all FCs.

  • show tech fabric link-include

  • show tech ofa

  • show tech interface

  • show tech optics

In addition, run the following debugshell commands, and collect logs for all NPUs on all LCs, and all FEs on all FCs.

show controllers npu debugshell <np_num/unit_num> "script print_get_counters" location 0/x/CPU0  
show controllers npu debugshell <np_num> "script sf_oq_debug" location 0/x/CPU0  
show controllers npu debugshell <np_num> "script sf_fabric_debug" location 0/x/CPU0  
show controller npu stats counters-all instance all loc 0/x/CPU0 

Note


3

To collect logs for an FC card, obtain the np_num from the fap_id column of show controllers npu driver loc 0/RP0/CPU0 command output (0/RP0/CPU0 is the location of FC).



Note


4

Packet Tracer

Packet Tracer is a debugging framework that is used to trace custom flows through the router.

Table 3. Feature History Table

Feature Name

Release

Description

Packet Tracer

Release 25.4.1

Introduced in this release on: Fixed Systems (8010 [ASIC: A100])

*This feature is supported on:

  • 8011-12G12X4Y-A

  • 8011-12G12X4Y-D

Packet Tracer

Release 25.3.1

Introduced in this release on: Fixed Systems (8200 [ASIC: P100], 8700 [ASIC: P100]); Modular Systems (8800 [LC ASIC: P100])

We now extend the Packet Tracer support on P100 based line cards and fixed chasis.

*This feature is now supported on:

  • 88-LC1-36EH

  • 88-LC1-12TH24FH-E

  • 88-LC1-52Y8H-EM

  • 8212-48FH-M

  • 8711-32FH-M

Packet Tracer

Release 25.1.1

Introduced in this release on: Fixed Systems (8700 [ASIC: K100], 8010 [ASIC: A100])

Packet Tracer is a framework that enables you to trace custom flows through the router for service validation or troubleshooting.

This feature utilizes the existing XR packet tracing infrastructure to facilitate debugging of packet flows at the ASIC and hardware levels.

CLI:

This feature introduces the following commands.

How packet tracer work

Packet tracing is accomplished using the packet tracer framework, which facilitates the tracking and analysis of packet flow. The framework's user interface is accessed through packet tracer commands.

When packet tracing is enabled on an interface, the Network Processor (NP) evaluates incoming packets to see if they meet specified conditions. If a packet meets the defined conditions, a flag is set in its internal header. This flag enables the tracing of the packet as it moves through all elements in the data path within the router. In Cisco IOS XR Release 25.1.1, the packet tracer captures a maximum of one packet on ingress and one packet on egress. To capture the next packet, you need to stop the current tracing session and start a new one.

The figure below illustrates the configuration flow for Packet Tracer.

Figure 2. Packet Tracer configuration flow

Guidelines for configuring packet tracer

Packet tracer conditions

Packet tracer conditions consist of the following two entities:

  • Physical interface(s) on which the packets are expected to be received.

    For example:

    packet-trace condition interface hu0/5/0/6
    Cisco 8000 series routers does not support per interface filters. Instead, you must use:
    packet-trace condition interface all

    Note


    When defining conditions, only physical or bundle interfaces are allowed. If you are tracing on sub-interfaces, you must consider the dot1q or Q-in-Q encapsulation when specifying the Offset/Value/Mask definition.


  • Offset/Value/Mask triplets that define a flow of interest.

    Defining a flow as a set of Offset/Value/Mask triplets allows the packet tracer framework to remain entirely protocol- agnostic. These triplets can represent any segment of any header within the protocol stack. Ensure that both the value and mask do not exceed 4 bytes in size. You can configure conditions at any offset within the first 128 bytes of a given packet, using a value and mask of up to 4 bytes each.


    Note


    The offset 0 is the start of the ethernet frame.


Configure packet tracer

Before you begin

To configure packet tracer and start tracing, you must first define the conditions. Use the Condition Generator Web User Interface tool (https://github.com/xr-packet-tracer) or any other tool of your choice, to define conditions.

For more information about defining these conditions and using the condition generator tool, see https://xrdocs.io/asr9k//tutorials/xr-embedded-packet-tracer/.

Procedure


Step 1

Clear packet tracing.

Use the clear packet-trace conditions all command to clear all buffered packet-trace conditions.

router# clear packet-trace conditions all

Step 2

Define conditions as per your network structure.

Use the Condition Generator Web User Interface tool (https://github.com/xr-packet-tracer) or any other tool of your choice, to define conditions.

Example:

Below are a few examples of conditions that have been generated.
condition interface all
condition 1 offset 0 value 0xaabbcc mask 0xffffff
condition 2 offset 3 value 0xdd mask 0xff

Note

 
  • You must also use the interface all condition along with the defined conditions, as this feature is enabled on all NPUs rather a specific interface.

Step 3

Configure the defined conditions.

Example:

Below is an example to configure packet trace for the defined conditions.
router# packet-trace condition interface all
router# packet-trace condition 1 offset 0 value 0xaabbcc mask 0xffffff
router# packet-trace condition 2 offset 3 value 0xdd mask 0xff

Step 4

Check the status of tracing.

Apply the conditions and verify if they are configured correctly.

Example:

Use the show packet-trace status command to check the status of tracing.
router# show packet-trace status
Tue Sep 11 18:29:07.249 UTC
------------------------------------------------------------
Packet Trace Master Process:
Buffered Conditions:
1 offset 0  value 0xaabbcc mask 0xffffff
2 offset 3  value 0xdd     mask 0xff
Status: Inactive 

Note

 

To view more detailed status, such as processes that are registered with the packet tracer framework on every card in the router, use the show packet-trace status detail command.

The output displays the status as either Active or Inactive. The status will remain Inactive until tracing starts.

Step 5

Start packet tracing.

Use the packet-trace start command to start packet tracing.

router# packet-trace start

Step 6

Check the status of tracing.

Example:

Use the show packet-trace status command to check the status of tracing.
router# show packet-trace status

The status should now display as Active.

Tue Sep 11 18:32:08.250 UTC
------------------------------------------------------------
Packet Trace Master Process:
Buffered Conditions:
1 offset 0 value 0xaabbcc mask 0xffffff
2 offset 3 value 0xdd mask 0xff
Status: Active

Step 7

View the packet tracing results.

Example:

Use the show packet-trace result command to view the results of packet tracing.
router# show packet-trace result

The following output is displayed.

Tue Sep 11 18:26:37.595 UTC
T: D - Drop counter; P - Pass counter
Location     | Source       | Counter              | T | Last-Attribute         | Count
------------   ------------   ----------------- ------------------------     ------------
0/RP0/CPU0     NP0            INGRESS                P                             1
0/RP0/CPU0     NP0            EGRESS                 P                             1
The count value of 1 indicates that the incoming packet has matched the configured filter in both the ingress and egress pipelines.

Step 8

(Optional) View packet trace counters and their descriptions.

Use the show packet-trace descriptions command to view all counters registered with the packet tracer framework along with their descriptions.

router# show packet-trace descriptions

Step 9

(Optional) View decoded packet trace data.

Use the show packet-trace decode command to view decoded packet trace information.

Example:

To view raw binary packet trace data, use the show packet-trace decode raw command.
router# show packet-trace decode raw instance 0 location 0/RP0/CPU0

The following output is displayed.

Tue Sep 11 18:29:23.717 UTC
Decoding packet trace results for NPU id 0 on node id 0x2000
=== Trace decode results for NP0 ===
{
"packet_traces": [
{
"slice-id": 0,
"termination-input": [
{
"raw": "00aabbccddeeff01940000020108004500002e00000000403d4e7fc800000a6400000a000102030405060708090a0b0c0d0e0f10111213141516171819000000000194"
}
],
"termination-output": [
{
"raw": "00aabbccddeeff01940000020108004500002e00000000403d4e7fc800000a6400000a000102030405060708090a0b0c0d0e0f10111213141516171819000000000194"
}
],
"raw": "00000061e0013a0987fd80000028077fffff7fffff7fffff0085c89ffff4000000"
}
],
"ifg1_tm-pd-output": [
{
"raw": "00001d8fd6ff6c100000001cff8000677c7b5e5bfc1c6bd7000024000000000000"
}
],
"ifg0_rxpp-output-npu-header": [
{
"raw": "00001d8fd6ff6c100000001cff8000677c7b5e5bfc1c6bd7000024000000000000"
}
],
"ifg1_rxpp-output-npu-header": [
{
"raw": "00c1b76c4f3dfff77fffb67e5000000000000000000000000000000000000000000000000000000000000000000201"
}
],
"rxpp-termination-macro-stack": [
{
"raw": "000000000000000000"
}
],

},
{
"slice-id": 1,
"termination-input": [
{
"raw": "0"
}
],
"termination-output": [
{
"raw": "0"
}
],

"ifg0_rxpp-output-npu-header": [
{
"raw": "0"
}
],
"rxppp-forwarding-macro-keys": [
{}
],
"rxppp-forwarding-macro-results": [
{}
],
"transmit-input": [
{
"raw": "0"
}
],
"transmit-pre-npe": [
{
"raw": "0"
}
],
"transmit-post-npe": [
{
"raw": "0"
}
],
"transmit-post-ene": [
{
"raw": "0"
}
],
"txpp-macro-stack": [
{
"raw": "0"
}
],
"txpp-npe-macro-keys": [
{}
],
"txpp-npe-macro-results": [
{}
]
}
]
}

Step 10

Stop packet tracing.

Use the packet-trace stop command to stop packet tracing.

router# packet-trace stop