Cisco MDS 9000 Series High Availability Configuration Guide, Release 9.x

Feature History for Fabric Module Error Monitoring


Feature Name	Releases	Feature Information
Fabric Module Error Monitoring	9.4(5)	The Fabric 1 HIGH_NULL_POE_DROP_CNT and Fabric 3 null fpoe port error counters were added.
Fabric Module Error Monitoring	9.3(1)	This feature was introduced.

About Fabric Module Error Monitoring

The fabric modules of modular Cisco MDS platforms are commonly called Xbars. There are two versions of these fabric modules: Fabric1 and Fabric 3. Frames that are received by an FC port with CRC error are dropped and not forwarded further. As frames move from component to component, module to module, including through the fabric modules, errors may occur. Frame CRCs are checked at several places along the switching path. Once a frame error is detected the frame is dropped as soon as possible.

The existing Internal CRC Detection and Isolation feature can detect and take corrective action when these internal CRC errors occur. However, fabric modules can experience other errors that are not true CRC errors. The Fabric Module Error Monitoring feature, introduced in Cisco NX-OS 9.3(1), complements the Internal CRC Detection and Isolation feature, and is designed to detect and take corrective action in the presence of these errors. This feature allows automated monitoring and handling of errors in Fabric 1 and Fabric 3 modules that might cause I/O problems in the fabric.

Fabric Module Error Monitoring is controlled by the xbarErrorMonitor CLI command. The command utilizes the MDS scheduler feature to check for the internal errors. It creates a scheduler job named xbarErrorMonitor_job with an error checking script and a scheduler schedule named XbarErrorMonitor_Schedule . The scheduler periodically executes the script which collects show hardware internal errors information for the configurable set of counters on each fabric module. After the configurable sleep time, it collects the show hardware internal errors information again and calculates the change of the counter values. If any counter deltas are equal or higher than the configured threshold then the configured action is executed. The log-only action will log syslog messages. The module is left in service and continues to switch traffic. The log-and-out-of-service action will log the same syslog messages but additionally put the affected module out of service, immediately, stopping the suspect device from affecting further traffic. This action provides real time operational remediation until the module can be later inspected for the root cause. If there is only one fabric module left in service it will not be powered down.

Fabric Module Error Monitoring generates the following type of syslog message detailing the affected module, switching ASIC, error counter, and its value:

%USER-2-SYSTEM_MSG: xbarErrorMonitor: counter threshold exceeded for xbar 1 for counter packets dropped destined to port. (Before: 0, After: 128, Delta 128).

For information about xbarErrorMonitor command default values, refer to the Default Values section.

Contact the Cisco Technical Assistance Center (TAC) for diagnostic assistance and possible module replacement if internal CRC errors are detected.

The following error counters can be monitored by Fabric Module Error Monitoring:


Module	Counter	Description
Fabric 1 Module	INTERNAL_ERROR_CNT	Errors related to fabric link, input and output buffer full, and timeout events on fabric module.
	HIGH_XT_DROP_CNT	Packets dropped due to fabric module packet switching timeout.
	SAC_XTIMEOUT_INTR_HI	Packet timeouts due to fabric module egress port buffer full.
	HIGH_NULL_POE_DROP_CNT	Packets dropped with empty fabric module egress port address.
Fabric 3 Module	packets dropped destined to port	Packets dropped due to fabric module egress port buffer full.
	packets drop on receive port	Packets dropped on fabric module ingress port.
	double bit ecc error	Packets dropped due to double bit ECC error in fabric module port buffers.
	null fpoe port	Packets dropped with empty egress port address.

Note

These counters can be displayed (if they are non-zero) using the show hardware internal errors command.

Guidelines and Limitations of Fabric Module Error Monitoring

This feature is supported only on Cisco MDS 9700 series switches.
Starting from Cisco NX-OS 9.3(1), this feature is enabled automatically.
Before Cisco NX-OS 9.4(5) xbar error monitoring command has the following behaviour:
- Any non-default parameters must be specified each time monitoring is enabled. For example:
```
switch# xbarErrorMonitor -si 180 enable
switch# xbarErrorMonitor -a log-and-out-of-service enable
```
- If the log-and-out-of-service option is specified the scheduling interval will be set to the default as the -si argument is ignored.
Starting with Cisco NX-OS 9.4(5) non-default parameters do not need to be specified each time monitoring is enabled. The values displayed with the show option are used. Also, the -si argument is no longer ignored.
The XbarErrorMonitoring_Job scheduler job and XbarErrorMonitor_Schedule scheduler schedule should not be deleted from the configuration or Fabric Module Error Monitoring will cease to function.

Default Values

Default values are used whenever the parameter is not specified on the enable command line.

Table 1. Fabric Module Error Monitoring default values:
Parameter	Default value
Action	log-only
Counters Monitored	NX-OS 9.3(1) - 9.4(4): Fabric 1 Module: INTERNAL_ERROR_CNT HIGH_XT_DROP_CNT SAC_XTIMEOUT_INTR_HI Fabric 3 Module: packets dropped destined to port packets drop on receive port double bit ecc error
Counters Monitored	NX-OS 9.4(5) and higher: Fabric 1 Module: INTERNAL_ERROR_CNT HIGH_XT_DROP_CNT SAC_XTIMEOUT_INTR_HI HIGH_NULL_POE_DROP_CNT Fabric 3 Module: packets dropped destined to port packets drop on receive port double bit ecc error null fpoe port
Counter Threshold	NX-OS 9.3(1) - 9.4(4): 50 NX-OS 9.4(5) and higher: 10
Scheduler Interval	120 seconds
Sleep Time	30 seconds

Configuring Fabric Module Error Monitoring

Procedure

Step 1	switch# xbarErrorMonitor -h Displays command usage and configurable runtime parameters.
Step 2	switch# xbarErrorMonitor show Displays the status of Fabric Module Error Monitoring.
Step 3	switch# xbarErrorMonitor disable (Optional) Disables the Fabric Module Error Monitoring feature.
Step 4	switch# xbarErrorMonitor enable Enables Fabric Module Error Monitoring with default parameters specific to the version of NX-OS. To modify runtime parameters, specify them as part of this step. Fabric Module Error Monitoring must be disabled before runtime parameters can be modified.

Alerting for Fabric Module Error Monitoring

Callhome

In log-only mode, threshold breachs will not trigger CallHome alerts.

In log-and-out-of-service mode, CallHome will raise an alert when the Fabric Module is taken out of service.

NDFC

Configure an alarm in NDFC to log alerts when a Fabric Module error occurs.

The alarm severity must match the syslog severity.

The raise and clear regular expressions (regex) must match the whole syslog. Using an identifier variable at the end of the regex will match all text to the end of the syslog.

For clearing, there is no situation where these alarms should be automatically cleared. Each alarm should be manually cleared and investigated. Also, there is no clear syslog message for Fabric Module Error Monitoring events. Use an impossible regex in the alarm's clear regex configuration.

Ensure that the NDFC server IP is configured on the switch as a remote syslog server so the messages reach NDFC.

Table 2. Example NDFC Alarm Parameters
Parameter	Value
severity	critical
identifier	ID1-ID2
raise regex	USER-2-SYSTEM_MSG xbarErrorMonitor: counter threshold exceeded for xbar $(ID1) for counter $(ID2)
clear regex	USER-2-SYSTEM_MSG: place holder regex for $(ID1) and $(ID2)

For more information about configuring alarms in NDFC, refer to the Event Analytics chapter.

Configuration Examples

The following example shows how to display the Fabric Module Error Monitoring status and operating values:

switch# show xbarErrorMonitor 
xbarErrorMonitor 1.2
Status: Enabled 
Schedular Interval: 120
Sleep Time: 30
Counter Threshold: 50 
Action: log-only 
Counters Monitored:
    packets dropped destined to port
    packets drop on receive port
    double bit ecc error
    null fpoe port

Thefollowing example shows how to disable Fabric Module Error Monitoring:

switch# xbarErrorMonitor disable 
xbarErrorMonitor being disabled. Please wait...
xbarErrorMonitor has been disabled.
Please check the status by running 'xbarErrorMonitor show'
It is recommended to execute 'copy running-config startup-config' to save the configuration.

The following example shows how to enable Fabric Module Error Monitoring and modify runtime parameters:

switch# xbarErrorMonitor --counter-threshold 30 --action log-and-out-of-service enable 
xbarErrorMonitor getting enabled. Please wait...
xbarErrorMonitor has been enabled
Please check the status by running 'xbarErrorMonitor show'
It is recommended to execute 'copy running-config startup-config' to save the configuration.

Bias-Free Language

Book Title

Fabric Module Error Monitoring

Results

Chapter: Fabric Module Error Monitoring