Monitor Process Health

This sections explain the of memory and CPU monitoring from the perspective of the Cisco IOS process and the overall control plane:

Guidelines to monitor system processes

System process monitoring is a set of guidelines that

  • require processes to provide monitoring and notification of their status or health to ensure correct operation,

  • generate a syslog error message and trigger either a process restart or device reboot when a process is stuck, has crashed, or fails, and

  • help detect potential problems early, establish baselines for normal system load, and prevent outages.

These are the advantages of regular monitoring:

  • Lack of memory on line cards operating for several years can cause major outages. Monitoring memory usage helps identify issues on the line cards and enables you to prevent outages.

  • Regular monitoring establishes a baseline for a normal system load. You can use this information as a basis for comparison when you upgrade hardware or software, to see if the upgrade has affected resource usage.

Cisco IOS process resources

You can view CPU utilization statistics on active processes by using the show process cpu command. To see the amount of memory being used in these processes, use the show memory command. These commands provide a representation of memory and CPU utilization only for the Cisco IOS process. They do not include information about resources used by the entire platform.

Command Examples

For example, when the show memory command is used in a system with 8 GB RAM running a single Cisco IOS process, the following memory usage is displayed:

Router#show memory
      Tracekey : 1#08d3ff66f05826cb63fb2b7325fcbed0
     
      Head    Total(b)     Used(b)     Free(b)   Lowest(b)  Largest(b)
      Processor  7FB733EC4048   3853903068   193512428   3660390640   707918492   3145727908
      reserve P  7FB733EC40A0      102404          92      102312      102312      102312
      lsmpi_io  7FB7320C11A8     6295128     6294304         824         824         412
      Dynamic heap limit(MB) 3000      Use(MB) 0

The show process cpu command displays Cisco IOS CPU utilization average:

Router#show process cpu
      CPU utilization for five seconds: 1%/0%; one minute: 1%; five minutes: 1%
      PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
      1           1          14         71  0.00%  0.00%  0.00%   0 Chunk Manager
      2         127         872        145  0.00%  0.00%  0.00%   0 Load Meter
      3           0           1          0  0.00%  0.00%  0.00%   0 Policy bind Proc
      4           0           1          0  0.00%  0.00%  0.00%   0 Retransmission o
      5           0           1          0  0.00%  0.00%  0.00%   0 IPC ISSU Dispatc
      6          11          13        846  0.00%  0.00%  0.00%   0 RF Slave Main Th
      7           0           1          0  0.00%  0.00%  0.00%   0 EDDRI_MAIN
      8           0           1          0  0.00%  0.00%  0.00%   0 RO Notify Timers
      9        1092         597       1829  0.00%  0.01%  0.00%   0 Check heaps
      10           8          73        109  0.00%  0.00%  0.00%   0 Pool Manager
      11           0           1          0  0.00%  0.00%  0.00%   0 DiscardQ Backgro
      12           0           2          0  0.00%  0.00%  0.00%   0 Timers
      13           0          32          0  0.00%  0.00%  0.00%   0 WATCH_AFS
      14           0           1          0  0.00%  0.00%  0.00%   0 MEMLEAK PROCESS
      15        1227       40758         30  0.00%  0.02%  0.00%   0 ARP Input
      16          41        4568          8  0.00%  0.00%  0.00%   0 ARP Background
      17           0           2          0  0.00%  0.00%  0.00%   0 ATM Idle Timer
      18           0           1          0  0.00%  0.00%  0.00%   0 ATM ASYNC PROC
      19           0           1          0  0.00%  0.00%  0.00%   0 CEF MIB API
      20           0           1          0  0.00%  0.00%  0.00%   0 AAA_SERVER_DEADT
      21           0           1          0  0.00%  0.00%  0.00%   0 Policy Manager
      22           0           2          0  0.00%  0.00%  0.00%   0 DDR Timers
      23          60          23       2608  0.00%  0.00%  0.00%   0 Entity MIB API
      24          43          45        955  0.00%  0.00%  0.00%   0 PrstVbl
      25           0           2          0  0.00%  0.00%  0.00%   0 Serial Backgroun
      26           0           1          0  0.00%  0.00%  0.00%   0 RMI RM Notify Wa
      27           0           2          0  0.00%  0.00%  0.00%   0 ATM AutoVC Perio
      28           0           2          0  0.00%  0.00%  0.00%   0 ATM VC Auto Crea
      29          30        2181         13  0.00%  0.00%  0.00%   0 IOSXE heartbeat
      30           1           9        111  0.00%  0.00%  0.00%   0 Btrace time base
      31           5         182         27  0.00%  0.00%  0.00%   0 DB Lock Manager
      32          16        4356          3  0.00%  0.00%  0.00%   0 GraphIt
      33           0           1          0  0.00%  0.00%  0.00%   0 DB Notification
      34           0           1          0  0.00%  0.00%  0.00%   0 IPC Apps Task
      35           0           1          0  0.00%  0.00%  0.00%   0 ifIndex Receive
      36           4         873          4  0.00%  0.00%  0.00%   0 IPC Event Notifi
      37          49        4259         11  0.00%  0.00%  0.00%   0 IPC Mcast Pendin
      38           0           1          0  0.00%  0.00%  0.00%   0 Platform appsess
      39           2          73         27  0.00%  0.00%  0.00%   0 IPC Dynamic Cach
      40           5         873          5  0.00%  0.00%  0.00%   0 IPC Service NonC
      41           0           1          0  0.00%  0.00%  0.00%   0 IPC Zone Manager
      42          38        4259          8  0.00%  0.00%  0.00%   0 IPC Periodic Tim
      43          18        4259          4  0.00%  0.00%  0.00%   0 IPC Deferred Por
      44           0           1          0  0.00%  0.00%  0.00%   0 IPC Process leve
      45           0           1          0  0.00%  0.00%  0.00%   0 IPC Seat Manager
      46           3         250         12  0.00%  0.00%  0.00%   0 IPC Check Queue
      47           0           1          0  0.00%  0.00%  0.00%   0 IPC Seat RX Cont
      48           0           1          0  0.00%  0.00%  0.00%   0 IPC Seat TX Cont
      49          22         437         50  0.00%  0.00%  0.00%   0 IPC Keep Alive M
      50          25         873         28  0.00%  0.00%  0.00%   0 IPC Loadometer
      51           0           1          0  0.00%  0.00%  0.00%   0 IPC Session Deta
      52           0           1          0  0.00%  0.00%  0.00%   0 SENSOR-MGR event
      53           2         437          4  0.00%  0.00%  0.00%   0 Compute SRP rate
   

Guidelines to monitor overall control plane resources

Control plane memory and CPU utilization on each control processor help you monitor overall control plane resources. You can use the show platform resources command to monitor the overall system health and resource usage for the IOS XE platforms. Also, you can use the show platform software status control-processor brief command (summary view) or the show platform software status control-processor command (detailed view) to view control plane memory and CPU utilization information.

All control processors should show status, Healthy. Other possible status values are Warning and Critical. Warning indicates that the device is operational, but that the operating level should be reviewed. Critical implies that the device is nearing failure.

If you see a Warning or Critical status, take these actions:

  • Reduce the static and dynamic loads on the system by reducing the number of elements in the configuration or by limiting the capacity for dynamic services.

  • Reduce the number of routes and adjacencies, limit the number of ACLs and other rules, reduce the number of VLANs, and so on.

These sections describe the fields in the show platform software status control-processor command output.

Load average

Load average represents the process queue or process contention for CPU resources. For example, on a single-core processor, an instantaneous load of 7 would mean that seven processes are ready to run, one of which is currently running. On a dual-core processor, a load of 7 would mean that seven processes are ready to run, two of which are currently running.

Memory utilization

Memory utilization is represented by these fields:

  • Total—Total line card memory

  • Used—Consumed memory

  • Free—Available memory

  • Committed—Virtual memory committed to processes

CPU utilization

CPU utilization is an indication of the percentage of time the CPU is busy, and is represented by these fields:

  • CPU—Allocated processor

  • User—Non-Linux kernel processes

  • System—Linux kernel process

  • Nice—Low-priority processes

  • Idle—Percentage of time the CPU was inactive

  • IRQ—Interrupts

  • SIRQ—System Interrupts

  • IOwait—Percentage of time CPU was waiting for I/O

Example: show platform software status control-processor Command

These are some examples of using the show platform software status control-processor command:

Router# show platform software status control-processor
        RP0: online, statistics updated 3 seconds ago
        RP0: online, statistics updated 5 seconds ago
        Load Average: healthy
        1-Min: 1.35, status: healthy, under 9.30
        5-Min: 1.06, status: healthy, under 9.30
        15-Min: 1.02, status: healthy, under 9.30
        Memory (kb): healthy
        Total: 7768456
        Used: 2572568 (33%), status: healthy
        Free: 5195888 (67%)
        Committed: 3112968 (40%), under 90%
        Per-core Statistics
        CPU0: CPU Utilization (percentage of time spent)
        User:  3.00, System:  2.40, Nice:  0.00, Idle: 94.60
        IRQ:  0.00, SIRQ:  0.00, IOwait:  0.00
        CPU1: CPU Utilization (percentage of time spent)
        User:  0.00, System:  0.00, Nice:  0.00, Idle:100.00
        IRQ:  0.00, SIRQ:  0.00, IOwait:  0.00
        CPU2: CPU Utilization (percentage of time spent)
        User:  0.00, System:  0.00, Nice:  0.00, Idle:100.00
        IRQ:  0.00, SIRQ:  0.00, IOwait:  0.00
        CPU3: CPU Utilization (percentage of time spent)
        User:  0.00, System:  0.00, Nice:  0.00, Idle:100.00
        IRQ:  0.00, SIRQ:  0.00, IOwait:  0.00
        CPU4: CPU Utilization (percentage of time spent)
        User:  7.30, System:  1.70, Nice:  0.00, Idle: 91.00
        IRQ:  0.00, SIRQ:  0.00, IOwait:  0.00
        CPU5: CPU Utilization (percentage of time spent)
        User:  3.30, System:  1.50, Nice:  0.00, Idle: 95.20
        IRQ:  0.00, SIRQ:  0.00, IOwait:  0.00
        CPU6: CPU Utilization (percentage of time spent)
        User: 17.91, System: 11.81, Nice:  0.00, Idle: 70.27
        IRQ:  0.00, SIRQ:  0.00, IOwait:  0.00
        CPU7: CPU Utilization (percentage of time spent)
        User: 11.91, System: 13.31, Nice:  0.00, Idle: 74.77
        IRQ:  0.00, SIRQ:  0.00, IOwait:  0.00
        CPU8: CPU Utilization (percentage of time spent)
        User:  2.70, System:  2.00, Nice:  0.00, Idle: 95.30
        IRQ:  0.00, SIRQ:  0.00, IOwait:  0.00
        CPU9: CPU Utilization (percentage of time spent)
        User:  0.00, System:  0.00, Nice:  0.00, Idle:100.00
        IRQ:  0.00, SIRQ:  0.00, IOwait:  0.00
        CPU10: CPU Utilization (percentage of time spent)
        User:  0.00, System:  0.00, Nice:  0.00, Idle:100.00
        IRQ:  0.00, SIRQ:  0.00, IOwait:  0.00
        CPU11: CPU Utilization (percentage of time spent)
        User:  0.00, System:  0.00, Nice:  0.00, Idle:100.00
        IRQ:  0.00, SIRQ:  0.00, IOwait:  0.00
        
        
        
      
Router# show platform software status control-processor brief
        Load Average
        Slot  Status  1-Min  5-Min 15-Min
        RP0 Healthy   1.14   1.07   1.02
        
        Memory (kB)
        Slot  Status    Total     Used (Pct)     Free (Pct) Committed (Pct)
        RP0 Healthy  7768456  2573416 (33%)  5195040 (67%)   3115096 (40%)
        
        CPU Utilization
        Slot  CPU   User System   Nice   Idle    IRQ   SIRQ IOwait
        RP0    0   2.80   1.80   0.00  95.39   0.00   0.00   0.00
        1   0.00   0.00   0.00 100.00   0.00   0.00   0.00
        2   0.00   0.00   0.00 100.00   0.00   0.00   0.00
        3   0.00   0.00   0.00 100.00   0.00   0.00   0.00
        4   6.80   1.80   0.00  91.39   0.00   0.00   0.00
        5   3.20   1.60   0.00  95.19   0.00   0.00   0.00
        6  16.30  12.60   0.00  71.10   0.00   0.00   0.00
        7  12.40  13.70   0.00  73.90   0.00   0.00   0.00
        8   2.40   2.40   0.00  95.19   0.00   0.00   0.00
        9   0.00   0.00   0.00 100.00   0.00   0.00   0.00
        10   0.00   0.00   0.00 100.00   0.00   0.00   0.00
        11   0.00   0.00   0.00 100.00   0.00   0.00   0.00
        
        
      

Monitor hardware using alarms

Device design and monitor hardware

The router sends alarm notifications when problems are detected, allowing you to monitor the network remotely. You do not need to use show commands to poll devices on a routine basis; however, you can perform onsite monitoring if you choose.

Guidelines to monitor bootflash disk

Bootflash Disk Space for Core Dumps

The bootflash disk must have enough free space to store two core dumps. This condition is monitored, and if the bootflash disk is too small to store two core dumps, a syslog alarm is generated, as shown in this example:

Aug 22 13:40:41.038 R0/0: %FLASH_CHECK-3-DISK_QUOTA: Flash disk quota exceeded
              [free space is 7084440 kB] - Please clean up files on bootflash.
         

Bootflash Disk Size vs. Physical Memory

The size of the bootflash disk must be at least of the same size as that of the physical memory installed on the device. If this condition is not met, a syslog alarm is generated as shown in the this example:

%IOSXEBOOT-2-FLASH_SIZE_CHECK: (rp/0): Flash capacity (8 GB) is insufficient for fault analysis based on
              installed memory of RP (16 GB)
              %IOSXEBOOT-2-FLASH_SIZE_CHECK: (rp/0): Please increase the size of installed flash to at least 16 GB (same as
              physical memory size)
         

Approaches to monitor hardware alarms

Audible or visual alarms

About audible and visual alarms

An external element can be connected to a power supply using the DB-25 alarm connector on the power supply. The external element is a DC light bulb for a visual alarm and a bell for an audible alarm.

If an alarm illuminates the CRIT, MIN, or MAJ LED on the faceplate of the device, and a visual or audible alarm is wired, the alarm also activates an alarm relay in the power supply DB-25 connector, and either the bell rings or the light bulb flashes.

Clear an audible alarm

To clear an audible alarm, perform one of these tasks:

  • Press the Audible Cut Off button on the faceplate.

  • Enter the clear facility-alarm command.

Clear a visual alarm

To clear a visual alarm, you must resolve the alarm condition. The clear facility-alarm command does not clear an alarm LED on the faceplate or turn off the DC light bulb. For example, if a critical alarm LED is illuminated because an active module was removed without a graceful deactivation, the only way to resolve that alarm is to replace the module.

View the console or syslog for alarm messages

The network administrator can monitor alarm messages by reviewing alarm messages sent to the system console or to a system message log (syslog).

Enable the logging alarm Command

The logging alarm command must be enabled for the system to send alarm messages to a logging device, such as the console or a syslog. This command is not enabled by default.

You can specify the severity level of the alarms to be logged. All the alarms at and above the specified threshold generate alarm messages. For example, the following command sends only critical alarm messages to logging devices:

If alarm severity is not specified, alarm messages for all severity levels are sent to logging devices.

Router(config)# logging alarm critical
Examples of alarm messages

The examples are of alarm messages that are sent to the console when a module is removed before performing a graceful deactivation. The alarm is cleared when the module is reinserted.

Module removed
*Aug 22 13:27:33.774: %C-SM-X-16G4M2X: Module removed from subslot 1/1, interfaces disabled
        *Aug 22 13:27:33.775: %SPA_OIR-6-OFFLINECARD: Module (SPA-4XT-SERIAL) offline in subslot 1/1
Module reinserted
*Aug 22 13:32:29.447: %CC-SM-X-16G4M2X: Module inserted in subslot 1/1
        *Aug 22 13:32:34.916: %SPA_OIR-6-ONLINECARD: Module (SPA-4XT-SERIAL) online in subslot 1/1
        *Aug 22 13:32:35.523: %LINK-3-UPDOWN: SIP1/1: Interface EOBC1/1, changed state to up
Alarms

To view alarms, use the show facility-alarm status command. The example shows a critical alarm for the power supply:

Router#show facility-alarm status
        System Totals  Critical: 1  Major: 0  Minor: 0
       
        Source                     Time                   Severity      Description [Index]
        ------                     ------                 --------      -------------------
       
        Power Supply Bay 1         Jul 08 2020 11:51:34   CRITICAL      Power Supply/FAN Module Missing [0]
       
        POE Bay 0                  Jul 08 2020 11:51:34   INFO          Power Over Ethernet Module Missing [0]
       
        POE Bay 1                  Jul 08 2020 11:51:34   INFO          Power Over Ethernet Module Missing [0]
       
        xcvr container 0/0/4       Jul 08 2020 11:51:47   INFO          Transceiver Missing - Link Down [1]
       
        TenGigabitEthernet0/1/0    Jul 08 2020 11:52:24   INFO          Physical Port Administrative State Down [2]
       
        GigabitEthernet1/0/0       Jul 08 2020 11:56:35   INFO          Physical Port Administrative State Down [2]
       
        GigabitEthernet1/0/1       Jul 08 2020 11:56:35   INFO          Physical Port Administrative State Down [2]
       
        GigabitEthernet1/0/2       Jul 08 2020 11:56:35   INFO          Physical Port Administrative State Down [2]
       
        GigabitEthernet1/0/3       Jul 08 2020 11:56:35   INFO          Physical Port Administrative State Down [2]
       
        GigabitEthernet1/0/4       Jul 08 2020 11:56:35   INFO          Physical Port Administrative State Down [2]
       
        GigabitEthernet1/0/5       Jul 08 2020 11:56:35   INFO          Physical Port Administrative State Down [2]
       
        GigabitEthernet1/0/6       Jul 08 2020 11:56:35   INFO          Physical Port Administrative State Down [2]
       
        GigabitEthernet1/0/7       Jul 08 2020 11:56:35   INFO          Physical Port Administrative State Down [2]
       
        TwoGigabitEthernet1/0/17   Jul 08 2020 11:56:35   INFO          Physical Port Administrative State Down [2]
       
        TwoGigabitEthernet1/0/18   Jul 08 2020 11:56:35   INFO          Physical Port Administrative State Down [2]
       
        TwoGigabitEthernet1/0/19   Jul 08 2020 11:56:35   INFO          Physical Port Administrative State Down [2]

To view critical alarms, use the show facility-alarm status critical command, as shown in the example:

Router#show facility-alarm status critical
        System Totals  Critical: 1  Major: 0  Minor: 0
       
        Source                     Time                   Severity      Description [Index]
        ------                     ------                 --------      -------------------
       
        Power Supply Bay 1         Jul 08 2020 11:51:34   CRITICAL      Power Supply/FAN Module Missing [0]
     

To view the operational state of the major hardware components on the device, use the show platform diag command.

Router#show platform diag
        Chassis type: C8300-1N1S-4T2X
       
        Slot: 0, C8300-1N1S-4T2X
        Running state               : ok
        Internal state              : online
        Internal operational state  : ok
        Physical insert detect time : 00:00:24 (01:29:20 ago)
        Software declared up time   : 00:01:01 (01:28:44 ago)
        CPLD version                : 20011540
        Firmware version            : 17.3(1r)
       
        Sub-slot: 0/0, 4x1G-2xSFP+
        Operational status          : ok
        Internal state              : inserted
        Physical insert detect time : 00:01:14 (01:28:30 ago)
        Logical insert detect time  : 00:01:14 (01:28:30 ago)
       
        Sub-slot: 0/1, C-NIM-1X
        Operational status          : ok
        Internal state              : inserted
        Physical insert detect time : 00:01:14 (01:28:31 ago)
        Logical insert detect time  : 00:01:14 (01:28:31 ago)
       
        Slot: 1, C8300-1N1S-4T2X
        Running state               : ok
        Internal state              : online
        Internal operational state  : ok
        Physical insert detect time : 00:00:24 (01:29:20 ago)
        Software declared up time   : 00:01:02 (01:28:43 ago)
        CPLD version                : 20011540
        Firmware version            : 17.3(1r)
       
        Sub-slot: 1/0, C-SM-X-16G4M2X
        Operational status          : ok
        Internal state              : inserted
        Physical insert detect time : 00:01:14 (01:28:30 ago)
        Logical insert detect time  : 00:01:14 (01:28:30 ago)
       
        Slot: R0, C8300-1N1S-4T2X
        Running state               : ok, active
     
Review and analyz alarm messages

You can write scripts to analyze alarm messages sent to the console or syslog. These scripts facilitate the review of alarm messages by providing reports on events such as alarms, security alerts, and interface status.

Syslog messages can also be accessed through Simple Network Management Protocol (SNMP) using the history table defined in the CISCO-SYSLOG-MIB.

SNMP alarms

The SNMP is an application-layer protocol that provides a standardized framework and a common language used for monitoring and managing devices in a network. Of all the approaches to monitor alarms, SNMP is the best approach to monitor more than one device in an enterprise and service provider setup.

SNMP provides notification of faults, alarms, and conditions that might affect services. It allows a network administrator to access device information through a network management system (NMS) instead of reviewing logs, polling devices, or reviewing log reports.

What is SNMP?

The Simple Network Management Protocol (SNMP) is an application-layer protocol that:

  • provides a standardized framework and a common language for network device communication,

  • is used for monitoring and managing devices in a network, and

  • is the best approach to monitor multiple devices in enterprise and service provider setups.

MIBs for SNMP Alarm Notification

To use SNMP to get alarm notification, use these MIBs:

  • ENTITY-MIB, RFC 4133 (required for the CISCO-ENTITY-ALARM-MIB and CISCO-ENTITY-SENSOR-MIB to work)

  • CISCO-ENTITY-ALARM-MIB

  • CISCO-ENTITY-SENSOR-MIB (for transceiver environmental alarm information, which is not provided through the CISCO-ENTITY-ALARM-MIB)