Monitoring Hardware Status


Monitoring Hardware Status
 
This chapter describes how to use the command line interface (CLI) show commands to monitor system status and performance. These commands have related keywords that you can use to get information on all aspects of the system, from current software configuration to call activity and status.
The selection of keywords described in this chapter provides useful and in-depth information for monitoring the hardware. For additional information on these and other show command keywords, refer to the Command Line Interface Reference.
SNMP Notifications
In addition to the CLI, the system supports Simple Network Management Protocol (SNMP) notifications that indicate status and alarm conditions. Refer to the SNMP MIB Reference Guide for a detailed list.
 
 
 
Monitoring Hardware Status
This chapter describes how to use the command line interface (CLI) show commands to monitor system status and performance. These commands have related keywords that you can use to get information on all aspects of the system, from current software configuration to call activity and status.
The selection of keywords described in this chapter provides useful and in-depth information for monitoring the hardware. For additional information on these and other show command keywords, refer to the Command Line Interface Reference.
Overview
This section provides routine maintenance procedures for the ASR 5000. Additional information is in the manuals for your specific application. Refer to those sections for details.
This guide is for personnel in technical support who deal with customer issues on a daily basis. This is not a troubleshooting guide, but it does provide some information that would be useful for troubleshooting.
Chassis Access and Command Line Interface
In order to perform the software maintenance procedures in this chapter, you need to have access to a chassis. Use any client that supports telnet, ssh2, or serial connection via cable to the ASR 5000. Optimally, your username and password has been configured with Administrative access rights. Some procedures can be performed with Operator rights.
In theory, for telnet or ssh access, you can specify the IP address of any interface on the chassis. Typically, you would specify the management IP address specified in the local context, which is tied to the redundant ports on the active Switch Processor I/O (SPIO) card.
Use the following settings for console access:
Save a history of the CLI commands and output from your maintenance sessions. You may need to enable this capability in your client. It is also good practice to timestamp all commands that you run. Run the timestamps command in Exec mode.
The commands in this guide are Exec and Config types. There are situations when you need to enter Config mode to change the configuration of the chassis. In this case, you need Administrative privilege to change a configuration on a live chassis.
See the “Command Line Interface Overview” chapter in the System Administration Guide, for a discussion on chassis access and CLI usage.
Password and Credential Updates
The ASR 5000 has a password aging facility that forces password changes after a specified time. Use this feature or keep track of password expirations with some other method.
See the “Software Management Operations” chapter in the System Administration Guide, specifically the section “Managing Local-User Administrative Accounts”, for more information.
System Time
It is critical to maintain accurate time for the chassis to function properly. The A11 protocol for example, which is used for call establishment, actually depends on the time between the PCF and the PDSN to be within a certain range. You can increase this range with the timestamp-tolerance parameter in spi remote-address of the pdsn service configuration, but it is better not to risk the time differential drifting beyond the range. For troubleshooting purposes, having accurate timestamps between chassis is critical in order to match traces from those chassis.
The best solution is to enable NTP (Network Time Protocol) in the local context. Proper maintenance requires that you check SNMP traps that are generated as a result of NTP not synching properly. If you do not use NTP, then check the time on a weekly basis and adjust it if necessary. Specify the timestamp-tolerance if you do not use NTP, to account for unexpected time drifts.
Flash File System
Maintain the file system on the flash so that it is as clean as possible. Over time, various types of files, such as minicore crashes, possible full core crashes (if specified in the configuration), old configurations, and old build images could be deposited on the flash. On a monthly basis, check the file system and delete necessary files. Note that builds and full crashes tend to be the largest files.
Use the show boot command to check the boot system priority and to see a list of all configuration files and associated builds that have been specified from most recently to up to nine past priorities. If the system is approaching a priority value of 1, delete priorities that will never be needed and restart numbering at a high value (such as 50).
For lab chassis, if multiple people work on the chassis, it is good practice to put each person’s configurations and personal files in separate folders.
If you make any changes to the active SMC’s file system, make sure that the physical file systems of the primary and standby SMCs, are synchronized. Use the command card <spc | smc> synchronize filesystem.
See “Software Management Operations” in the System Administration Guide, specifically the section “Maintaining the Local File System”, for more information.
Logs and SNMP Traps
The ASR 5000 maintains logs and traps that show a history of the activities on the chassis. The SNMP traps are useful for quickly determining the health of the chassis and the history of what has occurred. They are fairly easy to read and normally contain only the more important events. The trap entries are a short version of the full traps that are actually generated.
Logs can be verbose and may be difficult or impossible to read for someone who is not completely familiar with the underlying architecture of the chassis. In some cases, only Cisco Systems will be able to interpret the logs.
SNMP Traps
Use the show snmp trap history command to view SNMP traps. Through v7.1, up to 400 traps can be stored on the chassis as first in first out (FIFO). Anything beyond 400 is dropped. Starting with v8.0, the capacity has been increased to store up to 5000 traps. This longer history is useful in scenarios where issues are not reported or noticed until many days after they have occurred, or when an issue has happened many times over an extended period.
You can clear the SNMP trap history, although since the history is still limited to 400 maximum and displays as FIFO, there is no real benefit. As a result, this is not considered part of maintenance. To clear traps, use the command clear snmp trap history.
To allow more space for traps that are more important to capture, you can suppress less important traps. Extend the timeframe used by the traps with the command snmp trap suppress. To restrict the number of times a certain trap is sent over a specified period of time, use the command snmp notif-threshold.
You can configure the chassis to send SNMP traps to a SNMP server application and/or a Cisco Systems EMS server. In the local context, use the commands snmp target, and snmp community, etc. Maintain the trap server so that it can store traps for an extended period of time. One month is an absolute minimum; three or six months, if space allows, is recommended. Maintain the trap server so that it can handle the trap volume it receives, often from multiple sources besides the ASR 5000. Make sure that it is configured to delete traps beyond a specified time frame.
Do not ignore significant traps, which include AAA Unreachable, ManagerFailure or TaskFailed (pointing to potential failures), BGPPeerSessionDown, SRPConfigOutOfSync, CardSPOFAlarm, PortLinkDown, and CardOffline.
Refer to “Configuring Management Settings” in the System Administration Guide for more information, and the SNMP MIP Reference Manual.
System Logging
Logs are stored on the chassis in memory as FIFO. To view all system logs, enter the command show logs. To limit output to a certain level of logging, use the level qualifier. The range is from all logs to critical logs only. The qualifiers are: debug, trace, info, unusual, warning, error, critical.
Set the logging level in global config mode with the command:
logging filter runtime facility <facility> level <level> (critical-info | no-critical-info)
By default, the level for all facilities is error.
When troubleshooting a specific problem, Cisco Systems may recommend turning on various levels of logging for various facilities. Each facility to be configured is done so separately. Normally, only one or two facilities need to have logging enabled beyond the default error level. View the current settings in Exec mode with the command show logging.
Logs can fill up the system quickly if they are enabled beyond the error level for any facility. When you are finished troubleshooting, whatever the time frame may be—hours or days or weeks—do not forget to disable the logging for the respective facilities. Otherwise, while you are troubleshooting another issue, unnecessary logs could fill the buffer space and overwrite the logs you need during a particular time frame.
To minimize the amount of logged data, the system allows you to restrict the generation of a specific event ID or a range of event IDs to those that are most useful. Use the command logging disable eventid to save buffer space.
You can also save logs to a file locally or remotely with the command save logs. If you save them locally, delete the file after it is saved remotely.
As with SNMP traps, to most safely address logging buffer size constraints, you may configure the chassis to send logging data to a syslog server application to assure that no data gets lost. In the local context, enter the command logging syslog. Note that it is still important to remove unnecessary filters as discussed above. Viewing the logs directly on the ASR 5000 makes troubleshooting faster since you have removed the delay caused by having to retrieve logs from a syslog server. Maintain the server so that it can store logs for an extended period of time. One month is an absolute minimum; three months, if space allows, is recommended. Maintain the server so it can handle the volume it receives, which often includes sources in addition to the ASR 5000. Assure that the server is configured to delete files older than a specific time frame.
Refer to “Configuring and Viewing System Logs” in the System Administration Guide for more information.
Full Core and Mini-core Dumps
In the event of a failure, the system is designed to recover, restart processes, re-establish connections and communication, and possibly switching to a redundant packet processing card and/or SMC. Nonetheless, the failure information may be very useful to Cisco Systems Support and Engineering, who can determine whether this is a known or new problem, and whether it can be reproduced so a fix can be included in a future build. Always report new failure information to Cisco Systems so that it can be evaluated and corrected.
To view a list of failures, enter the command show crash list. To see the details of the failure, enter show crash number x. Through v6.0, the system can store the history list of up to 30 failures. However, the list is not FIFO, and new failures are not saved to the list, which remains static until it is cleared. For these older versions, check the failure list monthly to see if the list has grown. It is important to stay apprised of failure occurrences through the SNMP trap history and report unknown issues to Cisco Systems. It is not unusual for the same type of failure to occur more than once. Knowing the frequency can be valuable for troubleshooting and tracking. The actual failure data may not be as useful because you have already reported it.
On a monthly basis, clear the failure list with the command clear crash list. This command also clears the actual failure data at the same time. Run show support details to save the failure list and all the data for all failures. If you want to save the data, run the commands individually.
In addition to the failure list and associated data, there is abridged version of failure information known as a mini-core that is stored locally in the /flash/crash directory. Run the command dir /flash/crash to see all the files and the date and times. Match the files with the failure list described above. These can be valuable troubleshooting tools for Cisco Systems Technical Support, though normally the full failure information is the most useful. On a monthly basis, clear out the mini-core files, saving files for troubleshooting specific failure issues.
Full failure information is too large to store in memory, but it may be useful to save it for analysis by Cisco Systems Engineering. Specify a local or remote storage location by entering the local context config command crash enable url. It is recommended that you store the data remotely to avoid running out of memory on the flash. If you decide to store locally, check for failures on a weekly basis. If there are failure log file(s) on the flash, delete them and save them to another location if needed for troubleshooting.
See “Configuring and Viewing System Logs” in the System Administration Guide, specifically the section “Configuring and Viewing Software Crash Logging Parameters”, for more information.
Alerts and Alarms
Use thresholds to monitor the system and alert personnel of potentially bad conditions. The system sends alerts at the end of every polling period when a threshold value set for a given condition is exceeded. Alarms are triggered once when a high threshold value is met, and then cleared when a low threshold is met. To view all alarms, enter the show alarm all command. To clear the alarm, enter clear alarm.
Whether you choose alarm or alert mode, the system generates both SNMP traps and logs. Typically, system monitoring depends more on SNMP traps rather than logs. The traps are sent to a trap server monitored by personnel who respond to the issue.
You can monitor all of the current alarms with the show alarm command to give a quick snapshot of all issues.With traps and logs, you must review a list and determine what is currently an issue and what is not. Both have their place, but you should not depend solely on the CLI-based alarm system. Even if you proactively run the show alarm command on a regular basis, for example every 15 minutes, to monitor system health, a lot can happen in short period of time. Use trap notifications and bulkstats data, discussed below, as the primary mechanism to determine that a problem has occurred.
Preventative maintenance involves setting up sensible thresholds, adjusting thresholds that may trigger unnecessarily, and very important, not ignoring reported traps and alarms.
The seriousness and time to react to various traps vary significantly. A trap server that can highlight or categorize severity is a critical component to a complete PDSN solution. Personnel need to know when to respond with urgency to situations that could quickly become detrimental.
See “Configuring Thresholding” in the System Administration Guide, for more information.
Counters, Hardware Health Check, Status, and Bulkstats
The ASR 5000 maintains many counters for statistic gathering and troubleshooting. In general, you should leave the counters alone and let them increase over time. They are not like show commands that give the current state (for example, the current number of calls), but rather account for the entire history since the chassis booted. See the online help for a list of choices. A partial list of counters to choose from are:
 
There may be times to clear the counters so you can quickly troubleshoot issues that require comparing many counters at a glance. The commands to clear start with clear.
To check the integrity of the hardware and associated software, there are a number of commands that you should run on a regular basis, or when a trap or log notifies that a card, port, fan, or CPU is malfunctioning. The value of running these commands proactively is that if you overlooked a trap, you will catch a potential issue. A partial list of commands to choose from are:
Note that the only component you need to check proactively within a specific timeframe (six months), is the air filter. Refer to “Replacing the Chassis’ Air Filter” in this guide for replacement procedures.
There are many commands for checking the health status of, for example, various services, processes, CPU load, and call volume. A partial list of basic commands is:
For inter-chassis session recovery (ICSR):
A bulk statistics feature allows you to push a very large array of statistical data to a remote server. Cisco Systems supplies an EMS server for quickly and easily viewing this information. Implementing Bulkstats, especially with Cisco Systems’ EMS, can assist with maintenance. You can obtain detailed information about the chassis’ condition, particularly over extended periods of time, efficiently, accurately, and graphically.
See “Configuring Bulk Statistics” in the System Administration Guide for more information.
Licensing and IP Pool Utilization
In terms of problem prevention, it is very important that licensing values on the system are high enough to handle the call volume during peak time. Another licensing value consideration is a failover scenario where the chassis may need to handle all the traffic from another chassis. Configure the threshold for license utilization with threshold license-remaining-sessions. Related to this is IP pool utilization, where you can monitor the amount of pool resources for a specific pool.
Make sure that you use the licenses you have purchased and apply them to the chassis in a timely manner. If you delay, you are at risk in running out of licenses eventually.
After you apply the licenses, remember to save the configuration to the current config file (lowest boot priority) and to do an SMC synchronize to ensure that both SMCs are updated, otherwise if a switchover occurs, the old limits will be in effect on the card that has been switched to.
See “Software Management Operations” chapter in the System Administration Guide, specifically the section “Managing License Keys”, for more information.
Software Upgrades
One of the best ways to maintain a robust system is to make sure the latest released software version or build is active. Cisco Systems continually improves the quality of the software in new releases.That new bugs may be introduced in new releases is always a possibility, but in general, it is better to run the latest versions. Also, bug fixes may not always be applied to older versions. You may decide that internal lab testing is required before deploying live, but do not allow these restrictions to result in moving infrequently to new versions.
Whenever moving a new build onto the chassis, check its integrity with the command show version /flash/<filename>.
See “Software Management Operations” in the System Administration Guide, specifically the section “Software Upgrades”, for more information.
Quick Reference
This section contains a quick reference of the frequency with which to perform various maintenance operations on the ASR 5000 chassis. Note that this is only a guide, and that the frequency of some tasks will vary according to various factors not controlled by Cisco Systems. This includes, but is not limited to:
 
Constant Attention
 
Daily
 
Weekly
Check the clock if NTP is not enabled.
Monthly
 
6 Months
Change the fan filter.
No Specific Time Frame
 
 
 

Cisco Systems Inc.
Tel: 408-526-4000
Fax: 408-527-0883