Monitoring System Processes and Logs
This chapter provides details on monitoring the health of the switch and includes the following sections.
Information About System Processes and Logs
This section includes the following topics:
Saving Cores
You can save cores (from the active supervisor module, the standby supervisor module, or any switching module) to an external CompactFlash (slot 0) or to a TFTP server in one of two ways:
- On demand—Copies a single file based on the provided process ID.
- Periodically—Copies core files periodically as configured by the user.
A new scheme overwrites any previously issued scheme. For example, if you perform another core log copy task, the cores are periodically saved to the new location or file.
Saving the Last Core to Bootflash
This last core dump is automatically saved to bootflash in the /mnt/pss/ partition before the switchover or reboot occurs. Three minutes after the supervisor module reboots, the saved last core is restored from the flash partition (/mnt/pss) back to its original RAM location. This restoration is a background process and is not visible to the user.
Tip |
The timestamp on the restored last core file displays the time when the supervisor booted up not when the last core was actually dumped. To obtain the exact time of the last core dump, check the corresponding log file with the same PID. |
To view the last core information, enter the show cores command in EXEC mode.
To view the time of the actual last core dump, enter the show process log command in EXEC mode.
First and Last Core
The first and last core feature uses the limited system resource and retains the most important core files. Generally, the first core and the most recently generated core have the information for debugging and, the first and last core feature tries to retain the first and the last core information.
If the core files are generated from an active supervisor module, the number of core files for the service is defined in the service.conf file. There is no upper limit on the total number of core files in the active supervisor module.
To display the core files saved in the system, use the show cores command.
Online System Health Management
The Online Health Management System (OHMS) (system health) is a hardware fault detection and recovery feature. It ensures the general health of switching, services, and supervisor modules in any switch in the Cisco MDS 9000 Family.
The OHMS monitors system hardware in the following ways:
- The OHMS component running on the active supervisor maintains control over all other OHMS components running on the other modules in the switch.
- The system health application running in the standby supervisor module only monitors the standby supervisor module, if that module is available in the HA standby mode.
The OHMS application launches a daemon process in all modules and runs multiple tests on each module to test individual module components. The tests run at preconfigured intervals, cover all major fault points, and isolate any failing component in the MDS switch. The OHMS running on the active supervisor maintains control over all other OHMS components running on all other modules in the switch.
On detecting a fault, the system health application attempts the following recovery actions:
- Performs additional testing to isolate the faulty component.
- Attempts to reconfigure the component by retrieving its configuration information from persistent storage.
- If unable to recover, sends Call Home notifications, system messages and exception logs; and shuts down and discontinues testing the failed module or component (such as an interface).
- Sends Call Home and system messages and exception logs as soon as it detects a failure.
- Shuts down the failing module or component (such as an interface).
- Isolates failed ports from further testing.
- Reports the failure to the appropriate software component.
- Switches to the standby supervisor module, if an error is detected on the active supervisor module and a standby supervisor module exists in the Cisco MDS switch. After the switchover, the new active supervisor module restarts the active supervisor tests.
- Reloads the switch if a standby supervisor module does not exist in the switch.
- Provides CLI support to view, test, and obtain test run statistics or change the system health test configuration on the switch.
- Performs tests to focus on the problem area.
Each module is configured to run the test relevant to that module. You can change the default parameters of the test in each module as required.
Loopback Test Configuration Frequency
Loopback tests are designed to identify hardware errors in the data path in the module(s) and the control path in the supervisors. One loopback frame is sent to each module at a preconfigured frequency—it passes through each configured interface and returns to the supervisor module.
The loopback tests can be run at frequencies ranging from 5 seconds (default) to 255 seconds. If you do not configure the loopback frequency value, the default frequency of 5 seconds is used for all modules in the switch. Loopback test frequencies can be altered for each module.
Loopback Test Configuration Frame Length
Loopback tests are designed to identify hardware errors in the data path in the module(s) and the control path in the supervisors. One loopback frame is sent to each module at a preconfigured size—it passes through each configured interface and returns to the supervisor module.
The loopback tests can be run with frame sizes ranging from 0 bytes to 128 bytes. If you do not configure the loopback frame length value, the switch generates random frame lengths for all modules in the switch (auto mode). Loopback test frame lengths can be altered for each module.
Hardware Failure Action
The failure-action command controls the Cisco NX-OS software from taking any action if a hardware failure is determined while running the tests.
By default, this feature is enabled in all switches in the Cisco MDS 9000 Family—action is taken if a failure is determined and the failed component is isolated from further testing.
Failure action is controlled at individual test levels (per module), at the module level (for all tests), or for the entire switch.
Performing Test Run Requirements
Enabling a test does not guarantee that the test will run.
Tests on a specific interface or module only run if you enable system health for all of the following items:
- The entire switch
- The required module
- The required interface
Tip |
The test will not run if system health is disabled in any combination. If system health is disabled to run tests, the test status shows up as disabled. |
Tip |
If the specific module or interface is enabled to run tests, but is not running the tests due to system health being disabled, then tests show up as enabled (not running). |
Tests for a Specified Module
The system health feature in the NX-OS software performs tests in the following areas:
- Active supervisor’s in-band connectivity to the fabric.
- Standby supervisor’s arbiter availability.
- Bootflash connectivity and accessibility on all modules.
- EOBC connectivity and accessibility on all modules.
- Data path integrity for each interface on all modules.
- Management port’s connectivity.
- User-driven test for external connectivity verification, port is shut down during the test (Fibre Channel ports only).
- User-driven test for internal connectivity verification (Fibre Channel and iSCSI ports).
Clearing Previous Error Reports
You can clear the error history for Fibre Channel interfaces, iSCSI interfaces, an entire module, or one particular test for an entire module. By clearing the history, you are directing the software to retest all failed components that were previously excluded from tests.
If you previously enabled the failure-action option for a period of time (for example, one week) to prevent OHMS from taking any action when a failure is encountered and after that week you are now ready to start receiving these errors again, then you must clear the system health error status for each test.
Tip |
The management port test cannot be run on a standby supervisor module. |
Interpreting the Current Status
The status of each module or test depends on the current configured state of the OHMS test in that particular module (see Table 1).
Status |
Description |
---|---|
Enabled |
You have currently enabled the test in this module and the test is not running. |
Disabled |
You have currently disabled the test in this module. |
Running |
You have enabled the test and the test is currently running in this module. |
Failing |
This state is displayed if a failure is imminent for the test running in this module—possibility of test recovery exists in this state. |
Failed |
The test has failed in this module—and the state cannot be recovered. |
Stopped |
The test has been internally stopped in this module by the Cisco NX-OS software. |
Internal failure |
The test encountered an internal failure in this module. For example, the system health application is not able to open a socket as part of the test procedure. |
Diags failed |
The startup diagnostics has failed for this module or interface. |
On demand |
The system health external-loopback or the system health internal-loopback tests are currently running in this module. Only these two commands can be issued on demand. |
Suspended |
Only encountered in the MDS 9100 Series due to one oversubscribed port moving to a E or TE port mode. If one oversubscribed port moves to this mode, the other three oversubscribed ports in the group are suspended. |
The status of each test in each module is visible when you display any of the show system health commands.
On-Board Failure Logging
The Generation 2 Fibre Channel switching modules provide the facility to log failure data to persistent storage, which can be retrieved and displayed for analysis. This on-board failure logging (OBFL) feature stores failure and environmental information in nonvolatile memory on the module. The information will help in post-mortem analysis of failed cards.
OBFL data is stored in the existing CompactFlash on the module. OBFL uses the persistent logging (PLOG) facility available in the module firmware to store data in the CompactFlash. It also provides the mechanism to retrieve the stored data.
The data stored by the OBFL facility includes the following:
- Time of initial power-on
- Slot number of the card in the chassis
- Initial temperature of the card
- Firmware, BIOS, FPGA, and ASIC versions
- Serial number of the card
- Stack trace for crashes
- CPU hog information
- Memory leak information
- Software error messages
- Hardware exception logs
- Environmental history
- OBFL specific history information
- ASIC interrupt and error statistics history
- ASIC register dumps
Default Settings
Table 1 lists the default system health and log settings.
Parameters |
Default |
---|---|
Kernel core generation |
One module |
System health |
Enabled |
Loopback frequency |
5 seconds |
Failure action |
Enabled |
Core and Log Files
This section includes the following topics:
Clearing the Core Directory
Clearing the Core Directory
Use the clear cores command to clean out the core directory. The software clears all the core files and other cores present on the active supervisor module.
switch# clear cores
Before you begin
Ensure that SSH2 is enabled on this switch.
To clear the cores on a switch, follow these steps:
Procedure
Step 1 |
Click Clear to clear the cores. The software keeps the last few cores, per service and per slot, and clears all the core files and other cores present on the active supervisor module. |
Step 2 |
Click Close to close the dialog box. |
Configuring System Health
The Online Health Management System (OHMS) (system health) is a hardware fault detection and recovery feature. It ensures the general health of switching, services, and supervisor modules in any switch in the Cisco MDS 9000 Family.
This section includes the following topics:
Task Flow for Configuring System Health
Follow these steps to configure system health:
Procedure
Step 1 |
Enable System Health Initiation. |
Step 2 |
Configure Loopback Test Configuration Frequency. |
Step 3 |
Cofigure Loopback Test Configuration Frame Length. |
Step 4 |
Configure Hardware Failure Action. |
Step 5 |
Perform Test Run Requirements. |
Step 6 |
Clear Previous Error Reports. |
Step 7 |
Perform Internal Loopback Tests. |
Step 8 |
Perform External Loopback Tests. |
Step 9 |
Perform Serdes Loopbacks. |
Performing Internal Loopback Tests
You can run manual loopback tests to identify hardware errors in the data path in the switching or services modules, and the control path in the supervisor modules. Internal loopback tests send and receive FC2 frames to and from the same ports and provide the round-trip time taken in microseconds. These tests are available for Fibre Channel, IPS, and iSCSI interfaces.
Note |
If the test fails to complete successfully, the software analyzes the failure and prints the following error: External loopback test on interface fc 7/2 failed. Failure reason: Failed to loopback, analysis complete Failed device ID 3 on module 1 |
Choose Interface > Diagnostics > Internal to perform an internal loopback test from Device Manager.
Performing External Loopback Tests
You can run manual loopback tests to identify hardware errors in the data path in the switching or services modules, and the control path in the supervisor modules. External loopback tests send and receive FC2 frames to and from the same port or between two ports.
You need to connect a cable (or a plug) to loop the Rx port to the Tx port before running the test. If you are testing to and from the same port, you need a special loop cable. If you are testing to and from different ports, you can use a regular cable. This test is only available for Fibre Channel interfaces.
Note |
If the test fails to complete successfully, the software analyzes the failure and prints the following error: External loopback test on interface fc 7/2 failed. Failure reason: Failed to loopback, analysis complete Failed device ID 3 on module 1 |
Choose Interface > Diagnostics > External to perform an external loopback test from Device Manager.
Performing Serdes Loopbacks
Serializer/Deserializer (serdes) loopback tests the hardware for a port. These tests are available for Fibre Channel interfaces.
Note |
If the test fails to complete successfully, the software analyzes the failure and prints the following error: External loopback test on interface fc 3/1 failed. Failure reason: Failed to loopback, analysis complete Failed device ID 3 on module 3. |
Configuring On-Board Failure Logging
The Generation 2 Fibre Channel switching modules provide the facility to log failure data to persistent storage, which can be retrieved and displayed for analysis. This on-board failure logging (OBFL) feature stores failure and environmental information in nonvolatile memory on the module. The information will help in post-mortem analysis of failed cards.
Verifying System Processes and Logs Configuration
To display the system processes and logs configuration information, perform one of the following tasks:
Command |
Purpose |
---|---|
show processes |
Displays system processes |
show system |
Displays system-related status information |
show system cores |
Display the currently configured scheme for copying cores |
show system health |
Displays system-related status information |
show system health loopback frame-length |
Verifies the loopback frequency configuration |
show logging onboard status |
Displays the configuration status of OBFL |
For detailed information about the fields in the output from these commands, refer to the Cisco MDS 9000 Family Command Reference .
This section includes the following topics:
Displaying System Processes
To obtain general information about all processes, follow these steps:
Procedure
Step 1 |
Choose Admin > Running Processes . You see the Running Processes dialog box. Where:
|
Step 2 |
Click Close to close the dialog box. |
Displaying System Status
- In a Cisco MDS 9513 Director, the last four reset-reason codes for the supervisor module in slot 7 and slot 8 are displayed. If either supervisor module is absent, the reset-reason codes for that supervisor module are not displayed.
- In a Cisco MDS 9506 or Cisco MDS 9509 switch, the last four reset-reason codes for the supervisor module in slot 5 and slot 6 are displayed. If either supervisor module is absent, the reset-reason codes for that supervisor module are not displayed.
- In a Cisco MDS 9200 Series switch, the last four reset-reason codes for the supervisor module in slot 1 are displayed.
- The show system reset-reason module number command displays the last four reset-reason codes for a specific module in a given slot. If a module is absent, then the reset-reason codes for that module are not displayed.
- In a Cisco MDS 9500 Series switch, this command clears the reset-reason information stored in NVRAM in the active and standby supervisor modules.
- In a Cisco MDS 9200 Series switch, this command clears the reset-reason information stored in NVRAM in the active supervisor module.
- Load average—Displays the number of running processes. The average reflects the system load over the past 1, 5, and 15 minutes.
- Processes—Displays the number of processes in the system, and how many are actually running when the command is issued.
- CPU states—Displays the CPU usage percentage in user mode, kernel mode, and idle time in the last one second.
- Memory usage—Displays the total memory, used memory, free memory, memory used for buffers, and memory used for cache in KB. Buffers and cache are also included in the used memory statistics.
To display system status from Device Manager, follow these steps:
Procedure
Step 1 |
Choose Physical > System . You see the System dialog box. |
Step 2 |
Click Close to close the dialog box. |
Displaying Core Status
To display cores on a switch, follow these steps:
Note |
Ensure that SSH2 is enabled on this switch. |
Procedure
Step 1 |
Choose Admin > Show > Cores. You see the Show Cores dialog box. Module-num shows the slot number on which the core was generated. |
Step 2 |
Click Close to close the dialog box. |
Additional References
For additional information related to implementing system processes and logs, see the following section:
MIBs
MIBs |
MIBs Link |
---|---|
|
To locate and download MIBs, go to the following URL: http://www.cisco.com/en/US/products/ps5989/prod_technical_reference_list.html |
Feature History for System Processes and Logs
Table 1 lists the release history for this feature. Only features that were introduced or modified in Release 3.x or a later release appear in the table.
Feature Name |
Releases |
Feature Information |
---|---|---|
Common Information Model |
3.3(1a) |
Added commands for displaying Common Information Model. |
On-line system health maintenance (OHMS) enhancements |
3.0(1) |
Includes the following OHMS enhancements:
|
On-board failure logging (OBFL) |
3.0(1) |
Describes OBFL, how to configure it for Generation 2 modules, and how to display the log information. |