Online Diagnostics for NPU

Cisco 8000 Series Routers support the Online Diagnostics feature that enables you to run tests to verify the hardware functionality when connected to a live network. When a problem is detected, diagnostic test results help in isolating the location of the problem, enabling you to take appropriate measures to resolve the issue in less time.

Table 1. Feature History Table

Feature Name

Release

Description

Online diagnostics for NPU, NPU slices, and fabric cards

Release 24.4.1

Introduced in this release on: Fixed Systems (8200 [ASIC: P100], 8700 [ASIC: P100, K100])(select variants only*); Modular Systems (8800 [LC ASIC: P100])(select variants only*)

*This feature is supported on:

  • 8212-48FH-M

  • 8711-32FH-M

  • 8712-MOD-M

  • 88-LC1-36EH

  • 88-LC1-12TH24FH-E

  • 88-LC1-52Y8H-EM

Online diagnostics for NPU slices and fabric cards

Release 24.2.11

Introduced in this release on: Fixed Systems (8200); Centralized Systems (8600); Modular Systems (8800 [LC ASIC: Q100, Q200])

You can now use the online diagnostics functionality to test the health of fabric cards and all the slices in an NPU. This feature can help you detect fabric, and slice level failures.

Online diagnostics for NPU

Release 7.5.2/Release 7.3.5

You can now use the online diagnostic feature to verify if the router NPUs are operational. NPU failure logs are captured in the system log output.

You can also generate tech support information that is useful for Cisco Technical Support representatives when troubleshooting a router.

This feature introduces the following commands:

Online Diagnostics for NPU

The diagnostic tests check different hardware components in a system and verify the data paths and control signals. The online diagnostics tests use the CPU to send packets to the Network Processing Unit (NPU) through the Punt switch. If a failure is detected, an NP Datalog is automatically generated to help diagnose the problem.

The default interval for the NPU loopback test is one minute, and the default threshold is 3.

The following is a sample system log output:

LC/0/6/CPU0:Oct 30 18:52:37.737 UTC: 8000_online_diag[123]: %DIAG-DIAG-3-GOLDXR_FAIL : SFNPULoopback: Online diagnostic packet drops detected on NPU 0, slice 0. Please collect "show tech-support online-diags". 
LC/0/6/CPU0:Oct 30 18:52:37.737 UTC: 8000_online_diag[123]: %DIAG-DIAG-3-GOLDXR_FAIL : Use "show diagnostic result location <location> detail" to monitor online diagnostic results.

Online diagnostic tests can be categorized based on the way they are executed. They are the following:

Types of Online diagnostic test

Description

Dynamic diagnostics

Online Diagnostics are enabled when the system starts and the system datapath is operational. When the system is in use and linked to a live network, these tests run in the background as a non-disruptive test.

On-demand diagnostics

Tests that are conducted as needed using a diagnostic start command from the command-line interface (CLI). These tests are useful when a hardware fault is suspected.

You can use these diagnostics tests to determine the status and troubleshoot the hardware issues.

Online diagnostics for NPU slices

From Release 24.2.11, the online diagnostics functionality is extended to test all the slices in an NPU, enabling you to detect slice level failures. The default rate at which the test packets are transmitted to each NPU slice is increased to 60 packets per minute.

The default interval is one second for the per-slice test, and the default threshold is 3.

The following is a sample system log output:

LC/0/6/CPU0:Oct 30 18:52:37.737 UTC: 8000_online_diag[271]: %DIAG-DIAG-3-GOLDXR_FAIL : SFNPUSlice: Online diagnostic packet drops detected on NPU 0, slice 0. Please collect "show tech-support online-diags".
LC/0/6/CPU0:Oct 30 18:52:37.737 UTC: 8000_online_diag[271]: %DIAG-DIAG-3-GOLDXR_FAIL : Use "show diagnostic result location <location> detail" to monitor online diagnostic results.

Online diagnostics for fabric cards

From Release 24.2.11, you can also test the health of fabric cards (FC) using the online diagnostics functionality. The test packets are transmitted to fabric cards at the default rate of 50 packets per minute.

The default interval for the fabric test is 30 seconds, and the default threshold is 6.

The following is a sample system log output:

LC/0/6/CPU0:Oct 30 18:52:37.737 UTC: npu_drvr[283]: %FABRIC-NPU_DRVR-4-FABRIC_ONLINE_DIAG_FAIL : Online diagnostic packet drops detected on NPU 0, slice 0. Please collect "show tech-support online-diags".  
LC/0/6/CPU0:Oct 30 18:52:37.737 UTC: npu_drvr[283]: %FABRIC-NPU_DRVR-4-FABRIC_ONLINE_DIAG_FAIL : Use "show diagnostic result location <location> detail" to monitor online diagnostic results.

Data plane health check utility

Data plane health check utility is a monitoring tool that helps you determine the health of the data plane components including

  • fabric cards

  • NPUs

This utility can detect fabric memory corruption and packet loss that may happen due to broken internal links.

Table 2. Feature History Table

Feature Name

Release

Description

Monitor data plane health

Release 24.4.1

Introduced in this release on: Fixed Systems (8200 [ASIC: P100], 8700 [ASIC: P100])(select variants only*); Modular Systems (8800 [LC ASIC: P100])(select variants only*)

*This feature is supported on:

  • 8212-48FH-M

  • 8711-32FH-M

  • 88-LC1-36EH

  • 88-LC1-12TH24FH-E

  • 88-LC1-52Y8H-EM

Monitor data plane health

Release 24.2.11

Introduced in this release on: Fixed Systems (8200); Modular Systems (8800 [LC ASIC: Q100, Q200])

You can now easily detect fabric memory corruption and packet loss by checking the health of data plane components including fabric and NPUs on a distributed system using our on-demand diagnostic utility.

This functionality introduces the following commands:

You can start the diagnosis using the CLI command monitor dataplane-health . The detailed error report helps you identify the faulty card.

You can also use the show dataplane-health status command to check the status of a data plane health test. It provides information on whether the test is still running or if it's completed, along with a summary of the results.


Note


Do not use the data plane health check utility on a router that carries live traffic, as this utility affects the system performance.


Use cases for data plane health check

You can use the data plane health check utility in the following scenarios:

  • Before router deployment – After installing the FC or LC on the router, you can run the utility to check for issues, and then proceed to router provisioning.

  • After router deployment – If traffic loss is observed, but the packet drop analysis does not provide a root cause, then isolate the router and run the utility to check for issues.

Limitations for data plane health check

  • Avoid using the show controllers npu debugshell CLI command.

  • Avoid system reload (or LC reload) when the Data Plane Health Check utility is being executed.

  • The monitor dataplane-health module fabric command is supported only on distributed routers.

  • You must archive the report file before subsequent runs, as this file is overwritten on re-execution of the command.

    The previous log file is archived as /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt.bak.

    The test report is available at /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt.

Monitor and verify data plane health

This example shows how to execute the data plane health check utility for fabric module.

RP/0/RP0/CPU0:Router# monitor dataplane-health 
Wed Feb 28 15:08:15.659 EST
RP/0/RP0/CPU0:Feb 28 15:08:15.687 EST: dph_mon_bg[337]: %PLATFORM-DPH_MONITOR-6-STARTED : Dataplane health monitoring started. Please check harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt for details of the result. 
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
Progressive details of the test logged in /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt
Previous log file is archived as /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt.bak
Please save the archive with a different file name as needed
########################################################################################
Module:fabric
Patterns used: 0xf0,0x0f,0x00,0x55,0xff,
Duration per pattern: 10 seconds
Pause time between each slice/pattern test: 2 seconds
Best effort time for test completion: 1044 seconds
Depending on the overall dataplane health state, it may take additional time to complete
########################################################################################
Dataplane health monitoring will run in the background
Wait for completion log message from the process "dph_mon_bg"
			OR
Use "show dataplane-health status" regularly to check for completion

RP/0/RP0/CPU0:Feb 28 15:08:15.687 EST: dph_mon_bg[337]: %PLATFORM-DPH_MONITOR-6-STARTED : Dataplane health monitoring started. Please check harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt for details of the result. 

RP/0/RP0/CPU0:Feb 28 15:28:54.965 EST: dph_mon_bg[337]: %PLATFORM-DPH_MONITOR-6-COMPLETED : Dataplane health monitoring completed. Please check harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt for details of the result. 


Warning


Run the test on a router in a non-production environment as this test impacts the system performance.


These examples show how to check the status of a data plane health test:

Example 1: Output with data plane health check in progress

RP/0/RP0/CPU0:Router# show dataplane-health status 
Mon Jan 29 22:39:48.336 UTC
Dataplane health monitoring in progress..

Example 2: Output with successful data plane health check

RP/0/RP0/CPU0:Router# show dataplane-health status 
Mon Jan 29 23:10:21.564 UTC
Dataplane health monitoring completed
Summary of results (Module: fabric):
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
   0     0         0     2526099           0           0           0
                   1     2526856           0           0           0
                   2     2526529           0           0           0
         1         0     2526590           0           0           0
                   1     2526918           0           0           0
                   2     2526421           0           0           0
         2         0     2526665           0           0           0
                   1     2525818           0           0           0
                   2     2526286           0           0           0
-------------------------------------------------------------------------------
   1     0         0     2526754           0           0           0
                   1     2526328           0           0           0
                   2     2526695           0           0           0
         1         0     2525892           0           0           0
                   1     2526988           0           0           0
                   2     2526215           0           0           0
**********************************************************************************
DATAPATH CHECK IS CLEAN (mode: fabric).
**********************************************************************************

Example 3: Output with failures

RP/0/RP0/CPU0:Router# show dataplane-health status 
Wed Jan 10 21:37:01.601 UTC
Dataplane health monitoring completed
Summary of results (Module: fabric):
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
   1     0         0     2526253           0           0           0
                   1     2527136           0           0           0
                   2     2526235           0           0           0
         1         0     2527166           0           0           0
                   1     2527217           0           0           0
                   2     2526424           0           0           0
-------------------------------------------------------------------------------
   2     0         0     2526733           0           0           0
                   1     2526948           0           0           0
                   2     2526554           0           0           0
         1         0     2526294           0           0           0
                   1     2526220           0           0           0
                   2     2526085           0           0           0
-------------------------------------------------------------------------------
   3     0         0     2525876           0           0           0
                   1     2526642           0           0           0
                   2     2525957           0           0           0
         1         0     2526491           0           0           0
                   1     2526263           0           0           0
                   2     2526200           0           0           0
         2         0     2526804           0           0           0
                   1     2526135           0           0           0
                   2     2526328           0           0           0
-------------------------------------------------------------------------------
   4     0         0      493934           0       11501           0
                   1           0           0           0           0
                   2           0           0           0           0
         1         0      493605           0       11591           0
                   1           0           0           0           0
                   2           0           0           0           0
-------------------------------------------------------------------------------
   5     0         0      505389           0          30           0
                   1           0           0           0           0
                   2           0           0           0           0
         1         0      505358           0          23           0
                   1           0           0           0           0
                   2           0           0           0           0
-------------------------------------------------------------------------------
   6     0         0     2526307           0           0           0
                   1     2525905           0           0           0
                   2     2526142           0           0           0
         1         0     2526755           0           0           0
                   1     2526603           0           0           0
                   2     2526607           0           0           0
**********************************************************************************
Corruption detected:(LC4/0 <-> FC2/0) (LC4/1 <-> FC2/0) (LC5/0 <-> FC3/0) (LC5/1 <-> FC3/0)
 
**********************************************************************************
FAILURES DETECTED IN DATAPATH for fabric mode.
Please run "monitor dataplane-health module no-fabric"
Please check /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt
**********************************************************************************

Note


If the data plane-health monitor does not report any issues, it suggests that the system is functioning correctly, and there is no need to proceed with further data plane health verification.


If any failure is detected during the data plane health check, you must proceed with additional verification.

Additional Troubleshooting

The following is a sample output that illustrates failures that are detected in the datapath.

RP/0/RP0/CPU0:Router# monitor dataplane-health   

Fri Sep 29 12:53:51.595 UTC 

THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED. 

Details of the test results are logged in harddisk:/dataplane_health_detail_report.txt 

Estimated time for completion: 783 seconds 

Ensure that the terminal/vty session timeout is greater than 783 seconds 

Testing in progress (suggest not to break the tests) 

................................................................................................................................. 

Processing further to find failed fabric elements. This will take more time... 

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 

Datapath test on all requested LC/NPU/slice completed 

Summary of results: 

############################################################################ 

Output summary legend: 

ERROR: Tests were not run for this slice due to some errors 

GOOD: Tests were successful for this slice 

LOSS: Packet loss was observed for this slice 

CORRUPT: Packet corruption was observed for this slice 

############################################################################ 

------------------------------------------------------------------------------- 

  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR 

------------------------------------------------------------------------------- 

------------------------------------------------------------------------------- 

   1     0         0      476214           0         637           0 

                   1           0           0           0           0 

                   2           0           0           0           0 

         1         0      475860           0         670           0 

                   1           0           0           0           0 

                   2           0           0           0           0 

------------------------------------------------------------------------------- 

   2     0         0     2383553           0           0           0 

                   1     2383747           0           0           0 

                   2     2383616           0           0           0 

         1         0     2383280           0           0           0 

                   1     2383737           0           0           0 

                   2     2383343           0           0           0 

         2         0     2383937           0           0           0 

                   1     2383913           0           0           0 

                   2     2384017           0           0           0 

********************************************************************************** 

Corruption detected:(LC1/0 <-> FC7/0) (LC1/1 <-> FC7/0) ********************************************************************************** 

FAILURES DETECTED IN DATAPATH. 

Please run "monitor dataplane-health module no-fabric" 

Please check harddisk:/dataplane_health_detail_report.txt 

********************************************************************************** 
RP/0/RP0/CPU0:Router# monitor dataplane-health 
Fri Feb 16 08:50:58.115 UTC
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
Progressive details of the test logged in /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt
########################################################################################
Module:fabric
Patterns used: 0xf0,0x0f,0x00,0x55,0xff,
Duration per pattern: 10 seconds
Pause time between each slice/pattern test: 2 seconds
Best effort time for test completion: 522 seconds
Depending on the overall dataplane health state, it may take additional time to complete
########################################################################################
Dataplane health monitoring will run in the background
Wait for completion log message from the process "dph_mon_bg"
                        OR
Use "show dataplane-health status" regularly to check for completion
RP/0/RP0/CPU0:Router# show dataplane-health status 
Fri Feb 16 09:04:12.156 UTC
Dataplane health monitoring completed
Summary of results (Module: fabric):
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
   0     0         0          85       10977           0           0
                   1           0           0           0           0
                   2           0           0           0           0
         1         0        1735        8865           0           0
                   1           0           0           0           0
                   2           0           0           0           0
**********************************************************************************


**********************************************************************************
FAILURES DETECTED IN DATAPATH (mode: fabric).
Please run "monitor dataplane-health module no-fabric" to check if the issue is on the LCs or FCs
Please check /harddisk:/dph_mon/dataplane_health_fabric_mode_report.txt
**********************************************************************************

The sample output indicates that there are failures detected in the datapath. To further isolate the issue, verify if the packet corruption is caused by a fabric card or line card. Execute the following command to run health check on the line cards excluding the fabric card.

RP/0/RP0/CPU0:Router# monitor dataplane-health module no-fabric   

Fri Sep 29 14:09:28.506 UTC 

THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED. 

Details of the test results are logged in harddisk:/dataplane_health_detail_report.txt 

Estimated time for completion: 783 seconds 

Ensure that the terminal/vty session timeout is greater than 783 seconds 

Testing in progress (suggest not to break the tests) 

..................................................................................................................................................................................................................... 

Datapath test on all requested LC/NPU/slice completed 

Summary of results: 

############################################################################ 

Output summary legend: 

ERROR: Tests were not run for this slice due to some errors 

GOOD: Tests were successful for this slice 

LOSS: Packet loss was observed for this slice 

CORRUPT: Packet corruption was observed for this slice 

############################################################################ 

------------------------------------------------------------------------------- 

  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR 

------------------------------------------------------------------------------- 

------------------------------------------------------------------------------- 

   1     0         0     2383412           0           0           0 

                   1     2383031           0           0           0 

                   2     2383484           0           0           0 

         1         0     2383883           0           0           0 

                   1     2383973           0           0           0 

                   2     2383349           0           0           0 

------------------------------------------------------------------------------- 

   2     0         0     2383160           0           0           0 

                   1     2384196           0           0           0 

                   2     2383879           0           0           0 

         1         0     2383135           0           0           0 

                   1     2383196           0           0           0 

                   2     2383668           0           0           0 

         2         0     2383414           0           0           0 

                   1     2384360           0           0           0 

                   2     2383732           0           0           0 

------------------------------------------------------------------------------- 

   6     0         0     2383933           0           0           0 

                   1     2384205           0           0           0 

                   2     2383746           0           0           0 

         1         0     2383215           0           0           0 

                   1     2383578           0           0           0 

                   2     2382921           0           0           0 

********************************************************************************** 

 DATAPATH CHECK IS CLEAN. 

**********************************************************************************
RP/0/RP0/CPU0:Router# monitor dataplane-health module no-fabric 
Fri Feb 16 09:08:39.412 UTC
THIS COMMAND IMPACTS SYSTEM PERFORMANCE AND SHOULD IDEALLY BE RUN ON A ROUTER THAT IS ISOLATED.
Progressive details of the test logged in /harddisk:/dph_mon/dataplane_health_no_fabric_mode_report.txt
########################################################################################
Module:no-fabric
Patterns used: 0xf0,0x0f,0x00,0x55,0xff,
Duration per pattern: 10 seconds
Pause time between each slice/pattern test: 2 seconds
Best effort time for test completion: 522 seconds
Depending on the overall dataplane health state, it may take additional time to complete
########################################################################################
Dataplane health monitoring will run in the background
Wait for completion log message from the process "dph_mon_bg"
                        OR
Use "show dataplane-health status" regularly to check for completion
RP/0/RP0/CPU0:Router# show dataplane-health status 
Fri Feb 16 09:11:01.540 UTC
Dataplane health monitoring completed
Summary of results (Module: no-fabric):
############################################################################
Output summary legend:
ERROR: Tests were not run for this slice due to some errors
GOOD: Tests were successful for this slice
LOSS: Packet loss was observed for this slice
CORRUPT: Packet corruption was observed for this slice
############################################################################
  LC    NP     Slice        GOOD        LOSS     CORRUPT       ERROR
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
   0     0         0        4836           0           0           0
                   1        4908           0           0           0
                   2        5634           0           0           0
         1         0        4196           0           0           0
                   1        7638           0           0           0
                   2        7362           0           0           0
**********************************************************************************
DATAPATH CHECK IS CLEAN..

**********************************************************************************

The above sample output indicates that there are no errors or corruption on the line cards, and hence the fabric card must be faulty.

If there is any corruption detected after running the monitor dataplane-health module no-fabric command, then contact the Cisco Technical Assistance Center (TAC).

If packet loss is detected after running the monitor dataplane-health module command, perform the following steps for further verification:

  • Run the following command:

    RP/0/RP0/CPU0:Router# show controllers npu driver location all 
    
    Fri Sep 29 13:11:16.738 EDT 
    
    ============================================== 
    
    NPU Driver Information 
    
    ============================================== 
    
    Driver Version: 1 
    
    SDK Version: 1.55.0.41 
    
     
    
    Functional role: Active,     Rack: 8808, Type: lcc, Node: 0/5/CPU0 
    
    Driver ready      : Yes 
    
    NPU first started : Fri Sep 29 08:11:58 2023 
    
    Fabric Mode: FABRIC/8FC 
    
    NPU Power profile: Medium 
    
    Driver Scope: Node 
    
    Respawn count     : 1 
    
    Availablity masks : 
    
            card: 0x1,     asic: 0x7,    exp asic: 0x7 
    
    Weight distribution: 
    
            Unicast: 80,      Multicast: 20 
    
    +----------------------------------------------------------------+ 
    
    | Process | Connection | Registration | Connection | DLL         | 
    
    | /Lib    | status     | status       | requests   | registration| 
    
    +----------------------------------------------------------------+ 
    
    | FSDB    | Active     | Active       |           1|  n/a        | 
    
    | FGID    | Inactive   | Inactive     |           0|  n/a        | 
    
    | AEL     | n/a        | n/a          |         n/a|  Yes        | 
    
    | SM      | n/a        | n/a          |         n/a|  Yes        | 
    
    +----------------------------------------------------------------+ 
    
     
    
    Asics : 
    
    HP - HotPlug event, PON - Power On reset 
    
    HR - Hard Reset,    WB  - Warm Boot 
    
    +------------------------------------------------------------------------------+ 
    
    | Asic inst. | fap|HP|Slice|Asic|Admin|Oper | Asic state | Last |PON|HR |  FW  | 
    
    |  (R/S/A)   | id |  |state|type|state|state|            | init |(#)|(#)|  Rev | 
    
    +------------------------------------------------------------------------------+ 
    
    | 0/5/0      |  20| 1| UP  |npu | UP  | UP  |NRML        |HPON   |  1|  0|0x0000| 
    
    | 0/5/1      |  21| 1| UP  |npu | UP  | UP  |NRML        |PON   |  1|  0|0x0000| 
    
    | 0/5/2      |  22| 1| UP  |npu | UP  | UP  |NRML        |PON   |  1|  0|0x0000| 
    
    +------------------------------------------------------------------------------+ 
    ......
    .......
     
     

    HPON flag indicates an error. Collect logs, and contact Cisco TAC.

Troubleshooting Flowchart

If a HPON error is detected, follow the steps provided in the following troubleshooting flowchart:

Figure 1. Troubleshooting Flowchart

Link Debugging

Identify Reachability Issues

When the datapath monitoring tool reports packet loss, if the status of all ASICs is normal, and if you have not done hard resets, perform the following troubleshooting steps:

  • To identify issues in the connection between the line card and fabric cards, check the fabric links using the show controllers fabric fsdb-pla rack 0 command:

    RP/0/RP0/CPU0:Router# show controllers fabric fsdb-pla rack 0 
    Fri Feb  2 05:59:07.624 UTC
    Description:
       planes       : p0-p7
       plane mask   : Asic #0-3
       Asic value  1: destination reachable via asic
                   .: destination unreachable via asic
                   x: asic not connected to LC (for S3)
                   -: plane not configured (for S2) or asic missing
    Rack: 0, Stage: s123
    =============================
    Destination   p0     p1     p2     p3     p4     p5     p6     p7     Reach-mask  Oper Up
    Address       mask   mask   mask   mask   mask   mask   mask   mask   links/asic  links/asic
    Fapid(R/S/A)  0123   0123   0123   0123   0123   0123   0123   0123   Mn/Mx Total Mn/Mx  Total
    ----------------------------------------------------------------------------------------------
    0(0/0/0)      11     11     --     --     --     --     --     --      6/6    22  12/12     48
    1(0/1/0)      11     11     --     --     --     11     --     --      4/6    22  10/12     48
    4(0/7/0)      ..     ..     --     ..     --     ..     --     ..      0/0    0   12/12     48
    

    The above sample output indicates that the line cards in slot 0, and 1 do not have any connectivity issues with fabric cards. However, the ".." state for line card in slot 7 indicates that links are down between the line card and fabric card pair.

    In such situations, perform the following troubleshooting steps:

    1. Remove the line card from the particular slot (for example, slot 7) on the front panel.

    2. Check the particular backplane connector (for example, connector 4) for any bent pins.

    3. Check the FC connector (for example, FC3 connector 5) for any damage.

      • If there are no bent pins or FC connector damages, reinsert the line card, and run the show controller fabric fsdb-pla rack 0 command again. Check if the status displays all "11" and not "..".

      • If you notice bent pins on the line card connector, capture it in a photo, and open a Cisco Support (TAC) case for Return Material Authorization (RMA).

      • If you notice any damage on the FC connector, capture it in a photo, and open a Cisco Support (TAC) case for RMA.

  • If some links are down even after the visual inspection of fabric links, collect logs and contact Cisco TAC.


    Note


    In some scenarios, the connectivity between line card ASICs and fabric card ASICs can be UP, but few links between the cards can be down. So, it is essential to verify the link status.


  • To check the status of links between the line card and fabric cards, examine the Reach-mask links/asic min/max column output of the show controllers fabric fsdb-pla rack 0 command. In the previous sample output, you can infer that for LC1, two links out of six links are down due to link connectivity issues.

    If some links are down even after the visual inspection of fabric links, collect logs and contact Cisco TAC.


    Note


    1

    If the min and max values in the output under links/asic min/max are not equal, that means the links between the LC ASIC and FC ASIC are down.


  • Run the following command to get more details about the links status:

    RP/0/RP0/CPU0:Router# show controllers npu link-info rx 0 255 topo instance all location all | ex EN/DN | ex NC 
    Fri Feb  2 06:59:18.003 UTC
    
    Node ID: 0/2/CPU0
    -----------------------------------------------------------------------------
    Link ID    Log  Link   Asic  EN/                    Far-End         Far-End
               Link Speed  Stage Oper                   Link (FSDB)     Link (HW)
                    (Gbps)       Status
    -----------------------------------------------------------------------------
    0/2/0/16      - 50.00 FIA    EN/DN  ............    0/FC2/7/157     NC
    0/2/0/17      - 50.00 FIA    EN/DN  ............    0/FC2/8/156     NC
    
    Node ID: 0/RP0/CPU0
    -----------------------------------------------------------------------------
    Link ID    Log  Link   Asic  EN/                    Far-End         Far-End
               Link Speed  Stage Oper                   Link (FSDB)     Link (HW)
                    (Gbps)       Status
    -----------------------------------------------------------------------------
    0/FC2/7/20      - 50.00 FIA    EN/DN  ............    0/2/0/178     NC
    0/FC2/8/33      - 50.00 FIA    EN/DN  ............    0/2/0/166     0/2/0/188
    
    

    In this example, links between LC2 and FC2 are down.

Identify UCE Drops of Fabric Plane

If all links are UP, then monitor the Uncorrectable Errors (UCE). UCE means that there is packet corruption between a line card and a fabric card.

  • To identify packet corruption, run the following command:

    
    RP/0/RP0/CPU0:Router# show controllers fabric plane all statistics 
    Fri Feb  2 07:21:13.447 UTC
    
        Flags: E-D  - Exceeded display width.
                      Check detail option.
    
                         In                   Out        CE         UCE        PE
    Plane             Packets               Packets   Packets     Packets   Packets
    --------------------------------------------------------------------------------
     0                835649707985         733726217071      0         0         0
     1                835649703742         733726213241      0         0         0
     2                835649542985         733763942830      0      43921        0
     

    Note


    2

    You may notice a few errors initially at card insertion. So, run this command multiple times with a 30-sec interval. If you notice the UCE packets column count increasing, then it is evident that there are links on that plane (plane number = FC slot number) with UCE errors.


    The sample output indicates that there are Uncorrectable Errors (UCE) on FE ASICs on fabric card in slot 2 (FC2).

  • To identify the line card and fabric cards between which the errors are detected, use the show controllers npu stats link all instance all location all command.

    RP/0/RP0/CPU0:Router# show controllers npu stats link all instance all location all 
    Fri Feb  2 07:38:04.699 UTC
    
    Node ID: 0/0/CPU0
    
                           In Data      Out Data        CE       UCE      CRC
                            Frames        frames    Frames    Frames   Errors
    -------------------------------------------------------------------------
    0/0/0/0-1                        0            0        0        0        0
    0/0/0/2-3                        0            0        0        0        0
    0/0/0/4-5                        0            0        0        0        0
    0/0/0/6-7                        0            0        0        0        89
    0/0/0/8-9                        0            0        0        0        0
    

    Errors are detected on fabric links 6, 7.

  • If the monitor dataplane-health command still displays packet loss despite no issues being detected after link debugging, collect all logs and contact Cisco TAC before draining and reloading the router.

Collect Logs of Fabric and Line Cards and Contact Cisco TAC

Run the following show commands to collect logs for the impacted LC and all FCs.

  • show tech fabric link-include

  • show tech ofa

  • show tech interface

  • show tech optics

In addition, run the following debugshell commands, and collect logs for all NPUs on all LCs, and all FEs on all FCs.

show controllers npu debugshell <np_num/unit_num> "script print_get_counters" location 0/x/CPU0  
show controllers npu debugshell <np_num> "script sf_oq_debug" location 0/x/CPU0  
show controllers npu debugshell <np_num> "script sf_fabric_debug" location 0/x/CPU0  
show controller npu stats counters-all instance all loc 0/x/CPU0 

Note


3

To collect logs for an FC card, obtain the np_num from the fap_id column of show controllers npu driver loc 0/RP0/CPU0 command output (0/RP0/CPU0 is the location of FC).



Note


4