Cisco UCS Manager B-Series Troubleshooting Guide
Overview of Troubleshooting in Cisco UCS Manager
Downloads: This chapterpdf (PDF - 1.37MB) The complete bookPDF (PDF - 3.01MB) | The complete bookePub (ePub - 302.0KB) | Feedback

Overview of Troubleshooting in Cisco UCS Manager

Overview of Troubleshooting in Cisco UCS Manager

This chapter includes the following sections:

Troubleshooting Information in Cisco UCS Manager GUI

Cisco UCS Manager GUI provides several tabs and other areas that you can use to find troubleshooting information for a Cisco UCS domain. For example, you can view faults and events for specific objects or for all objects in the system.

The Admin tab in the Navigation pane provides access to faults, events, core files, and other information that can help you troubleshoot issues.

If you select Faults, Events and Audit Log in the Filter field on the Admin tab, Cisco UCS Manager GUI limits the tree browser so that you can only access the following:

  • The faults for all components in the system
  • The events for all components in the system
  • The audit log for the system
  • Any core files created by the fabric interconnects in the system
  • The fault collection and core file export settings

Note


Fault thresholds might need to be modified. See the “Statistics Threshold Policy” section in the Cisco UCS Manager GUI Configuration Guide for the release of Cisco UCS Manager that you are using.


Troubleshooting Information in Cisco UCS Manager CLI

The Cisco UCS Manager CLI includes several show commands that you can execute to find troubleshooting information for a Cisco UCS domain. These show commands are scope-aware, which means that if you enter the show fault command from the top scope, it displays all faults in the system. However, if you scope to a specific object, the show fault command displays faults that are related to that object only.


Note


Fault thresholds might need to be modified. See the “Statistics Threshold Policy” section in the Cisco UCS Manager CLI Configuration Guide for the release of Cisco UCS Manager that you are using.


Additional Troubleshooting Documentation

Additional troubleshooting information is available in the following documents:

Faults

In Cisco UCS, a fault is a mutable object that is managed by Cisco UCS Manager. Each fault represents a failure in the Cisco UCS domain or an alarm threshold that has been raised. During the lifecycle of a fault, it can change from one state or severity to another.

Each fault includes information about the operational state of the affected object at the time the fault was raised. If the fault is transitional and the failure is resolved, the object transitions to a functional state.

A fault remains in Cisco UCS Manager until the fault is cleared and deleted according to the settings in the fault collection policy.

You can view all faults in a Cisco UCS domain from either the Cisco UCS Manager CLI or the Cisco UCS Manager GUI. You can also configure the fault collection policy to determine how a Cisco UCS domain collects and retains faults.


Note


All Cisco UCS faults are included in MIBs and can be trapped by SNMP.


Fault Severities

A fault raised in a Cisco UCS domain can transition through more than one severity during its lifecycle. The following table describes the fault severities that you may encounter.

Severity

Description

Critical

Service-affecting condition that requires immediate corrective action. For example, this severity could indicate that the managed object is out of service and its capability must be restored.

Major

Service-affecting condition that requires urgent corrective action. For example, this severity could indicate a severe degradation in the capability of the managed object and that its full capability must be restored.

Minor

Nonservice-affecting fault condition that requires corrective action to prevent a more serious fault from occurring. For example, this severity could indicate that the detected alarm condition is not degrading the capacity of the managed object.

Warning

Potential or impending service-affecting fault that has no significant effects in the system. You should take action to further diagnose, if necessary, and correct the problem to prevent it from becoming a more serious service-affecting fault.

Condition

Informational message about a condition, possibly independently insignificant.

Info

Basic notification or informational message, possibly independently insignificant.

Fault States

A fault raised in a Cisco UCS domain transitions through more than one state during its lifecycle. The following table describes the possible fault states in alphabetical order.

State

Description

Cleared

Condition that has been resolved and cleared.

Flapping

Fault that was raised, cleared, and raised again within a short time interval, known as the flap interval.

Soaking

Fault that was raised and cleared within a short time interval, known as the flap interval. Because this state may be a flapping condition, the fault severity remains at its original active value, but this state indicates the condition that raised the fault has cleared.

Fault Types

A fault raised in a Cisco UCS domain can be one of the types described in the following table.

Type

Description

fsm

FSM task has failed to complete successfully, or Cisco UCS Manager is retrying one of the stages of the FSM.

equipment

Cisco UCS Manager has detected that a physical component is inoperable or has another functional issue.

server

Cisco UCS Manager cannot complete a server task, such as associating a service profile with a server.

configuration

Cisco UCS Manager cannot successfully configure a component.

environment

Cisco UCS Manager has detected a power problem, thermal problem, voltage problem, or loss of CMOS settings.

management

Cisco UCS Manager has detected a serious management issue, such as one of the following:

  • Critical services could not be started
  • The primary fabric interconnect could not be identified
  • Components in the Cisco UCS domain include incompatible firmware versions

connectivity

Cisco UCS Manager has detected a connectivity problem, such as an unreachable adapter.

network

Cisco UCS Manager has detected a network issue, such as a link down.

operational

Cisco UCS Manager has detected an operational problem, such as a log capacity issue or a failed server discovery.

Fault Properties

Cisco UCS Manager provides detailed information about each fault raised in a Cisco UCS domain. The following table describes the fault properties that you can view in Cisco UCS Manager CLI or Cisco UCS Manager GUI.

Property Name

Description

Severity

Current severity level of the fault, which can be any of the severities described in .

Last Transition

Day and time on which the severity for the fault last changed. If the severity has not changed since the fault was raised, this property displays the original creation date.

Affected Object

Component that is affected by the condition that raised the fault.

Description

Description of the fault.

ID

Unique identifier assigned to the fault.

Type

Type of fault that has been raised, which can be any of the types described in .

Cause

Unique identifier associated with the condition that caused the fault.

Created at

Day and time when the fault occurred.

Code

Unique identifier assigned to the fault.

Number of Occurrences

Number of times the event that raised the fault occurred.

Original Severity

Severity assigned to the fault the first time it occurred.

Previous Severity

Previous severity level. This property is only used if the severity of a fault changes during its lifecycle.

Highest Severity

Highest severity encountered for this issue.

Lifecycle of Faults

Faults in Cisco UCS are stateful. Only one instance of a given fault can exist on each object. If the same fault occurs a second time, Cisco UCS increases the number of occurrences by one.

A fault has the following lifecycle:

  1. A condition occurs in the system and Cisco UCS Manager raises a fault. This is the active state.
  2. When the fault is alleviated, it enters a flapping or soaking interval that is designed to prevent flapping. Flapping occurs when a fault is raised and cleared several times in rapid succession. During the flapping interval, the fault retains its severity for the length of time specified in the fault collection policy.
  3. If the condition reoccurs during the flapping interval, the fault returns to the active state. If the condition does not reoccur during the flapping interval, the fault is cleared.
  4. The cleared fault enters the retention interval. This interval ensures that the fault reaches the attention of an administrator even if the condition that caused the fault has been alleviated and the fault has not been deleted prematurely. The retention interval retains the cleared fault for the length of time specified in the fault collection policy.
  5. If the condition reoccurs during the retention interval, the fault returns to the active state. If the condition does not reoccur, the fault is deleted.

Faults in Cisco UCS Manager GUI

If you want to view faults for a single object in the system, navigate to that object in the Cisco UCS Manager GUI and click the Faults tab in the Work pane. If you want to view faults for all objects in the system, navigate to the Faults node on the Admin tab under Faults, Events and Audit Log.

In addition, you can also view a summary of all faults in a Cisco UCS domain in the Fault Summary area in the upper left of the Cisco UCS Manager GUI. This area provides a summary of all faults that have occurred in the Cisco UCS domain.

Each fault severity is represented by a different icon. The number below each icon indicates how many faults of that severity have occurred in the system. If you click an icon, the Cisco UCS Manager GUI opens the Faults tab in the Work pane and displays the details of all faults with that severity.

Faults in Cisco UCS Manager CLI

If you want to view the faults for all objects in the system, enter the show fault command from the top-level scope. If you want to view the faults for a specific object, scope to that object and then execute theshow fault command.

If you want to view all available details about a fault, enter the show fault detail command.

Fault Collection Policy

The fault collection policy controls the lifecycle of a fault in the Cisco UCS domain, including the length of time that each fault remains in the flapping and retention intervals.


Tip


For information on how to configure the fault collection policy, see the Cisco UCS Manager configuration guides, which are accessible through the Cisco UCS B-Series Servers Documentation Roadmap.


Events

In Cisco UCS, an event is an immutable object that is managed by Cisco UCS Manager. Each event represents a nonpersistent condition in the Cisco UCS domain. After Cisco UCS Manager creates and logs an event, the event does not change. For example, if you power on a server, Cisco UCS Manager creates and logs an event for the beginning and the end of that request.

You can view events for a single object, or you can view all events in a Cisco UCS domain from either the Cisco UCS Manager CLI or the Cisco UCS Manager GUI. Events remain in the Cisco UCS until the event log fills up. When the log is full, Cisco UCS Manager purges the log and all events in it.

Properties of Events

Cisco UCS Manager provides detailed information about each event created and logged in a Cisco UCS domain. The following table describes the fault properties that you can view in Cisco UCS Manager CLI or Cisco UCS Manager GUI.

Table 1  Event Properties

Property Name

Description

Affected Object

Component that created the event.

Description

Description of the event.

Cause

Unique identifier associated with the event.

Created at

Date and time when the event was created.

User

Type of user that created the event, such as one of the following:

  • admin
  • internal
  • blank

Code

Unique identifier assigned to the event.

Events in the Cisco UCS Manager GUI

If you want to view events for a single object in the system, navigate to that object in the Cisco UCS Manager GUI and click the Events tab in the Work pane. If you want to view events for all objects in the system, navigate to the Events node on the Admin tab under the Faults, Events and Audit Log.

Events in the Cisco UCS Manager CLI

If you want to view events for all objects in the system, enter the show event command from the top-level scope. If you want to view events for a specific object, scope to that object and then enter the show event command.

If you want to view all available details about an event, enter the show event detail command.

Core Files

Critical failures in Cisco UCS Manager and some of the Cisco UCS components, such as a fabric interconnect or an I/O module, can cause the system to create a core file. Each core file contains a large amount of data about the system and the component at the time of the failure.

Cisco UCS Manager manages the core files from all of the components. You can configure Cisco UCS Manager to export a copy of a core file to a location on an external TFTP server as soon as that core file is created.

Core Files in the Cisco UCS Manager GUI

You can find out if a component in the Cisco UCS domain generated a core file by navigating to the Core Files node on the Admin tab under the Faults, Events and Audit Log node.

Core Files in the Cisco UCS Manager CLI

You can find out if a component in the Cisco UCS domain generated a core file by entering the following commands:

  1. scope monitoring
  2. scope sysdebug
  3. show cores

Core File Exporter

If you enable the Core File Exporter, you can configure Cisco UCS Manager to export the core files as soon as they occur to a specified location on the network through TFTP. This functionality allows you to export the tar file with the contents of the core file to the location specified.


Tip


For information on how to enable the exporter, see the Cisco UCS Manager configuration guides, which are accessible through the Cisco UCS B-Series Servers Documentation Roadmap.


Audit Log

The audit log records actions performed by users in Cisco UCS Manager, including direct and indirect actions. Each entry in the audit log represents a single, non-persistent action. For example, if a user logs in, logs out, or creates, modifies, or deletes an object such as a service profile, Cisco UCS Manager adds an entry to the audit log for that action.

You can view the audit log entries in the Cisco UCS Manager CLI, Cisco UCS Manager GUI, or in a technical support file that you output from Cisco UCS Manager.

Audit Log Entry Properties

Cisco UCS Manager provides detailed information about each entry in the audit log. The following table describes the fault properties that you can view in the Cisco UCS Manager GUI or the Cisco UCS Manager CLI.

Table 2 Audit Log Entry Properties

Property Name

Description

ID

Unique identifier associated with the audit log message.

Affected Object

Component affected by the user action.

Severity

Current severity level of the user action associated with the audit log message. These severities are also used for the faults, as described Fault Severities.

Trigger

User role associated with the user that raised the message.

User

Type of user that created the event, as follows:

  • admin
  • internal
  • blank

Indication

Action indicated by the audit log message, which can be one of the following:

  • creation—A component was added to the system.
  • modification—An existing component was changed.

Description

Description of the user action.

Audit Log in the Cisco UCS Manager GUI

In the Cisco UCS Manager GUI, you can view the audit log on the Audit Log node on the Admin tab under the Faults, Events and Audit Log node.

Audit Log in the Cisco UCS Manager GUI

In the Cisco UCS Manager CLI, you can view the audit log through the following commands:

  • scope security
  • show audit-logs

System Event Log

The system event log (SEL) resides on the CIMC in NVRAM. It records most server-related events, such as over and under voltage, temperature events, fan events, and events from BIOS. The SEL is mainly used for troubleshooting purposes.

The SEL file is approximately 40KB in size, and no further events can be recorded when it is full. It must be cleared before additional events can be recorded.

You can use the SEL policy to backup the SEL to a remote server, and optionally clear the SEL after a backup operation occurs. Backup operations can be triggered based on specific actions, or they can occur at regular intervals. You can also manually backup or clear the SEL.

The backup file is automatically generated. The filename format is sel-SystemName-ChassisID-ServerID-ServerSerialNumber-Timestamp; for example, sel-UCS-A-ch01-serv01-QCI12522939-20091121160736.


Tip


For more information about the SEL, including how to view the SEL for each server and configure the SEL policy, see the Cisco UCS Manager configuration guides, which are accessible through the Cisco UCS B-Series Servers Documentation Roadmap.


SEL File

The SEL file is approximately 40 KB. No further events can be recorded when the SEL file is full. It must be cleared before additional events can be recorded.

SEL Policy

You can use the SEL policy to back up the SEL to a remote server and optionally clear the SEL after a backup operation occurs. Backup operations can be triggered, based on specific actions, or they can occur at regular intervals. You can also manually back up or clear the SEL.

Cisco UCS Manager automatically generates the SEL backup file, according to the settings in the SEL policy. The filename format is sel-SystemName-ChassisID-ServerID-ServerSerialNumber-Timestamp

For example, a filename could be sel-UCS-A-ch01-serv01-QCI12522939-20091121160736.

Syslog

The syslog provides a central point for collecting and processing system logs that you can use to troubleshoot and audit a Cisco UCS domain. Cisco UCS Manager relies on the Cisco NX-OS syslog mechanism and API, and on the syslog feature of the primary fabric interconnect to collect and process the syslog entries.

Cisco UCS Manager manages and configures the syslog collectors for a Cisco UCS domain and deploys the configuration to the fabric interconnect or fabric interconnects. This configuration affects all syslog entries generated in a Cisco UCS domain by Cisco NX-OS or by Cisco UCS Manager.

You can configure Cisco UCS Manager to do one or more of the following with the syslog and syslog entries:

  • Display the syslog entries in the console or on the monitor
  • Store the syslog entries in a file
  • Forward the syslog entries to up to three external log collectors where the syslog for the Cisco UCS domain is stored

Syslog Entry Format

Each syslog entry generated by a Cisco UCS component is formatted as follows:

Year month date hh:mm:ss hostname %facility-severity-MNEMONIC description

For example: 2007 Nov 1 14:07:58 excal-113 %MODULE-5-MOD_OK: Module 1 is online

Syslog Entry Severities

A syslog entry is assigned a Cisco UCS severity by Cisco UCS Manager. The following table shows how the Cisco UCS severities map to the syslog severities.

Table 3 Syslog Entry Severities in Cisco UCS

Cisco UCS Severity

Syslog Severity

CRIT

CRIT

MAJOR

ERR

MINOR

WARNING

WARNING

NOTICE

INFO

INFO

Syslog Entry Parameters

The following table describes the information contained in each syslog entry.

Table 4 Syslog Message Content

Name

Description

Facility

Logging facility that generated and sent the syslog entry. The facilities are broad categories that are represented by integers. These sources can be one of the following standard Linux facilities:

  • local0
  • local1
  • local2
  • local3
  • local4
  • local5
  • local6
  • local7

Severity

Severity of the event, alert, or issue that caused the syslog entry to be generated. The severity can be one of the following:

  • emergencies
  • critical
  • alerts
  • errors
  • warnings
  • information
  • notifications
  • debugging

Hostname

Hostname included in the syslog entry that depends upon the component where the entry originated, as follows:

  • The fabric interconnect, Cisco UCS Manager, or the hostname of the Cisco UCS domain
  • For all other components, the hostname associated with the virtual interface (VIF)

Timestamp

Date and time when the syslog entry was generated.

Message

Description of the event, alert, or issue that caused the syslog entry to be generated.

Syslog Services

The following Cisco UCS components use the Cisco NX-OS syslog services to generate syslog entries for system information and alerts:

  • I/O module—All syslog entries are sent by syslogd to the fabric interconnect to which it is connected.
  • CIMC—All syslog entries are sent to the primary fabric interconnect in a cluster configuration.
  • Adapter—All syslog entries are sent by NIC-Tools/Syslog to both fabric interconnects.
  • Cisco UCS Manager—Self-generated syslog entries are logged according to the syslog configuration.