System Management Configuration Guide, Cisco DCNM for SAN, Release 6.x
Monitoring System Processes and Logs
Downloads: This chapterpdf (PDF - 160.0KB) The complete bookPDF (PDF - 5.53MB) | Feedback

Monitoring System Processes and Logs

Table Of Contents

Monitoring System Processes and Logs

Information About System Processes and Logs

Saving Cores

Saving the Last Core to Bootflash

First and Last Core

Online System Health Management

Loopback Test Configuration Frequency

Loopback Test Configuration Frame Length

Hardware Failure Action

Performing Test Run Requirements

Tests for a Specified Module

Clearing Previous Error Reports

Interpreting the Current Status

On-Board Failure Logging

Default Settings

Clearing the Core Directory

Configuring System Health

Performing Internal Loopback Tests

Performing External Loopback Tests

Verifying System Processes and Logs Configuration

Displaying System Processes

Displaying System Status

Displaying Core Status

Additional References

MIBs


Monitoring System Processes and Logs


This chapter provides details on monitoring the health of the switch and includes the following sections:

Information About System Processes and Logs

Default Settings

Clearing the Core Directory

Configuring System Health

Verifying System Processes and Logs Configuration

Additional References

Information About System Processes and Logs

This section includes the following topics:

Saving Cores

Saving the Last Core to Bootflash

First and Last Core

Online System Health Management

Loopback Test Configuration Frequency

Loopback Test Configuration Frame Length

Hardware Failure Action

Performing Test Run Requirements

Tests for a Specified Module

Clearing Previous Error Reports

Interpreting the Current Status

On-Board Failure Logging

Saving Cores

You can save cores (from the active supervisor module, the standby supervisor module, or any switching module) to an external CompactFlash (slot 0) or to a TFTP server in one of two ways:

On demand—Copies a single file based on the provided process ID.

Periodically—Copies core files periodically as configured by the user.

A new scheme overwrites any previously issued scheme. For example, if you perform another core log copy task, the cores are periodically saved to the new location or file.

Saving the Last Core to Bootflash

This last core dump is automatically saved to bootflash in the /mnt/pss/ partition before the switchover or reboot occurs. Three minutes after the supervisor module reboots, the saved last core is restored from the flash partition (/mnt/pss) back to its original RAM location. This restoration is a background process and is not visible to the user.


Tip The timestamp on the restored last core file displays the time when the supervisor booted up not when the last core was actually dumped. To obtain the exact time of the last core dump, check the corresponding log file with the same PID.


First and Last Core

The first and last core feature uses the limited system resource and retains the most important core files. Generally, the first core and the most recently generated core have the information for debugging and, the first and last core feature tries to retain the first and the last core information.

If the core files are generated from an active supervisor module, the number of core files for the service is defined in the service.conf file. There is no upper limit on the total number of core files in the active supervisor module.

Online System Health Management

The Online Health Management System (OHMS) (system health) is a hardware fault detection and recovery feature. It ensures the general health of switching, services, and supervisor modules in any switch in the Cisco MDS 9000 Family.

The OHMS monitors system hardware in the following ways:

The OHMS component running on the active supervisor maintains control over all other OHMS components running on the other modules in the switch.

The system health application running in the standby supervisor module only monitors the standby supervisor module, if that module is available in the HA standby mode.

The OHMS application launches a daemon process in all modules and runs multiple tests on each module to test individual module components. The tests run at preconfigured intervals, cover all major fault points, and isolate any failing component in the MDS switch. The OHMS running on the active supervisor maintains control over all other OHMS components running on all other modules in the switch.

On detecting a fault, the system health application attempts the following recovery actions:

Performs additional testing to isolate the faulty component.

Attempts to reconfigure the component by retrieving its configuration information from persistent storage.

If unable to recover, sends Call Home notifications, system messages and exception logs; and shuts down and discontinues testing the failed module or component (such as an interface).

Sends Call Home and system messages and exception logs as soon as it detects a failure.

Shuts down the failing module or component (such as an interface).

Isolates failed ports from further testing.

Reports the failure to the appropriate software component.

Switches to the standby supervisor module, if an error is detected on the active supervisor module and a standby supervisor module exists in the Cisco MDS switch. After the switchover, the new active supervisor module restarts the active supervisor tests.

Reloads the switch if a standby supervisor module does not exist in the switch.

Provides CLI support to view, test, and obtain test run statistics or change the system health test configuration on the switch.

Performs tests to focus on the problem area.

Each module is configured to run the test relevant to that module. You can change the default parameters of the test in each module as required.

Loopback Test Configuration Frequency

Loopback tests are designed to identify hardware errors in the data path in the module(s) and the control path in the supervisors. One loopback frame is sent to each module at a preconfigured frequency—it passes through each configured interface and returns to the supervisor module.

The loopback tests can be run at frequencies ranging from 5 seconds (default) to 255 seconds. If you do not configure the loopback frequency value, the default frequency of 5 seconds is used for all modules in the switch. Loopback test frequencies can be altered for each module.

Loopback Test Configuration Frame Length

Loopback tests are designed to identify hardware errors in the data path in the module(s) and the control path in the supervisors. One loopback frame is sent to each module at a preconfigured size—it passes through each configured interface and returns to the supervisor module.

The loopback tests can be run with frame sizes ranging from 0 bytes to 128 bytes. If you do not configure the loopback frame length value, the switch generates random frame lengths for all modules in the switch (auto mode). Loopback test frame lengths can be altered for each module.

Hardware Failure Action

The failure-action command controls the Cisco NX-OS software from taking any action if a hardware failure is determined while running the tests.

By default, this feature is enabled in all switches in the Cisco MDS 9000 Family—action is taken if a failure is determined and the failed component is isolated from further testing.

Failure action is controlled at individual test levels (per module), at the module level (for all tests), or for the entire switch.

Performing Test Run Requirements

Enabling a test does not guarantee that the test will run.

Tests on a specific interface or module only run if you enable system health for all of the following items:

The entire switch

The required module

The required interface


Tip The test will not run if system health is disabled in any combination. If system health is disabled to run tests, the test status shows up as disabled.



Tip If the specific module or interface is enabled to run tests, but is not running the tests due to system health being disabled, then tests show up as enabled (not running).


Tests for a Specified Module

The system health feature in the NX-OS software performs tests in the following areas:

Active supervisor's in-band connectivity to the fabric.

Standby supervisor's arbiter availability.

Bootflash connectivity and accessibility on all modules.

EOBC connectivity and accessibility on all modules.

Data path integrity for each interface on all modules.

Management port's connectivity.

User-driven test for external connectivity verification, port is shut down during the test (Fibre Channel ports only).

User-driven test for internal connectivity verification (Fibre Channel and iSCSI ports).

Clearing Previous Error Reports

You can clear the error history for Fibre Channel interfaces, iSCSI interfaces, an entire module, or one particular test for an entire module. By clearing the history, you are directing the software to retest all failed components that were previously excluded from tests.

If you previously enabled the failure-action option for a period of time (for example, one week) to prevent OHMS from taking any action when a failure is encountered and after that week you are now ready to start receiving these errors again, then you must clear the system health error status for each test.


Tip The management port test cannot be run on a standby supervisor module.


Interpreting the Current Status

The status of each module or test depends on the current configured state of the OHMS test in that particular module (see Table 6-1).

On-Board Failure Logging

The Generation 2 Fibre Channel switching modules provide the facility to log failure data to persistent storage, which can be retrieved and displayed for analysis. This on-board failure logging (OBFL) feature stores failure and environmental information in nonvolatile memory on the module. The information will help in post-mortem analysis of failed cards.

OBFL data is stored in the existing CompactFlash on the module. OBFL uses the persistent logging (PLOG) facility available in the module firmware to store data in the CompactFlash. It also provides the mechanism to retrieve the stored data.

The data stored by the OBFL facility includes the following:

Time of initial power-on

Slot number of the card in the chassis

Initial temperature of the card

Firmware, BIOS, FPGA, and ASIC versions

Serial number of the card

Stack trace for crashes

CPU hog information

Memory leak information

Software error messages

Hardware exception logs

Environmental history

OBFL specific history information

ASIC interrupt and error statistics history

ASIC register dumps

Default Settings

Table 6-1 lists the default system health and log settings.

Table 6-1 Default System Health and Log Settings  

Parameters
Default

Kernel core generation

One module

System health

Enabled

Loopback frequency

5 seconds

Failure action

Enabled


Clearing the Core Directory

Prerequisites

Ensure that SSH2 is enabled on this switch.

Detailed Steps

To clear the cores on a switch, follow these steps:


Step 1 Click Clear to clear the cores.

The software keeps the last few cores, per service and per slot, and clears all the core files and other cores present on the active supervisor module.

Step 2 Click Close to close the dialog box.


Configuring System Health

The Online Health Management System (OHMS) (system health) is a hardware fault detection and recovery feature. It ensures the general health of switching, services, and supervisor modules in any switch in the Cisco MDS 9000 Family.

This section includes the following topics:

Performing Internal Loopback Tests

Performing External Loopback Tests

Performing Internal Loopback Tests

You can run manual loopback tests to identify hardware errors in the data path in the switching or services modules, and the control path in the supervisor modules. Internal loopback tests send and receive FC2 frames to and from the same ports and provide the round-trip time taken in microseconds. These tests are available for Fibre Channel, IPS, and iSCSI interfaces.

Choose Interface > Diagnostics > Internal to perform an internal loopback test from Device Manager.

Performing External Loopback Tests

You can run manual loopback tests to identify hardware errors in the data path in the switching or services modules, and the control path in the supervisor modules. External loopback tests send and receive FC2 frames to and from the same port or between two ports.

You need to connect a cable (or a plug) to loop the Rx port to the Tx port before running the test. If you are testing to and from the same port, you need a special loop cable. If you are testing to and from different ports, you can use a regular cable. This test is only available for Fibre Channel interfaces.

Choose Interface > Diagnostics > External to perform an external loopback test from Device Manager.

Verifying System Processes and Logs Configuration

This section includes the following topics:

Displaying System Processes

Displaying System Status

Displaying Core Status

Displaying System Processes

To obtain general information about all processes, follow these steps:


Step 1 Choose Admin > Running Processes.

You see the Running Processes dialog box.

Where:

ProcessId = Process ID

Name = Name of the process

MemAllocated = Sum of all the dynamically allocated memory that this process has received from the system, including memory that may have been returned

CPU Time (ms) = CPU time the process has used, in microseconds

Step 2 Click Close to close the dialog box.


Displaying System Status

To display system status from Device Manager, follow these steps:


Step 1 Choose Physical > System.

You see the System dialog box.

Step 2 Click Close to close the dialog box.


Displaying Core Status

To display cores on a switch, follow these steps:


Note Ensure that SSH2 is enabled on this switch.



Step 1 Choose Admin > Show Cores.

You see the Show Cores dialog box.

Module-num shows the slot number on which the core was generated.

Step 2 Click Close to close the dialog box.


Additional References

For additional information related to implementing system processes and logs, see the following section:

MIBs

MIBs

MIBs
MIBs Link

CISCO-SYSTEM-EXT-MIB

CISCO-SYSTEM-MIB

To locate and download MIBs, go to the following URL:

http://www.cisco.com/en/US/products/ps5989/prod_technical_reference_list.html