Implementing Performance Management
Prerequisites for Implementing Performance Management
Information About Implementing Performance Management
PM Functional Overview
- PM Statistics Server
- PM Statistics Collector
PM Benefits
PM Statistics Collection Overview
How to Implement Performance Management
Check System Health

Implementing Performance Management

Performance management (PM) on the Cisco IOS XR Software provides a framework to perform these tasks:

Collect and export PM statistics to a TFTP server for data storage and retrieval
Monitor the system using extensible markup language (XML) queries
Configure threshold conditions that generate system logging messages when a threshold condition is matched.

The PM system collects data that is useful for graphing or charting system resource utilization, for capacity planning, for traffic engineering, and for trend analysis.

Prerequisites for Implementing Performance Management

Before implementing performance management in your network operations center (NOC), ensure that these prerequisites are met:

You must be in a user group associated with a task group that includes the proper task IDs. The command reference guides include the task IDs required for each command. If you suspect user group assignment is preventing you from using a command, contact your AAA administrator for assistance.
You must have connectivity with a TFTP server.

Information About Implementing Performance Management

PM Functional Overview

The Performance Management (PM) frameworks consists of two major components:

PM statistics server
PM statistics collectors

PM Statistics Server

The PM statistics server is the front end for statistic collections, entity instance monitoring collections, and threshold monitoring. All PM statistic collections and threshold conditions configured through the command-line interface (CLI) or through XML schemas are processed by the PM statistics server and distributed among the PM statistics collectors.

PM Statistics Collector

The PM statistics collector collects statistics from entity instances and stores that data in memory. The memory contents are checkpointed so that information is available across process restarts. In addition, the PM statistics collector is responsible for exporting operational data to the XML agent and to the TFTP server.

PM Component Communications illustrates the relationship between the components that constitute the PM system.

PM Benefits

The PM system provides these benefits:

Configurable data collection policies
Efficient transfer of statistical data in the binary format via TFTP
Entity instance monitoring support
Threshold monitoring support
Data persistency across process restarts and processor failovers

PM Statistics Collection Overview

A PM statistics collection first gathers statistics from all the attributes associated with all the instances of an entity in the PM system. It then exports the statistical data in the binary file format to a TFTP server. For example, a Multiprotocol Label Switching (MPLS) Label Distribution Protocol (LDP) statistics collection gathers statistical data from all the attributes associated with all MPLS LDP sessions on the router.

This table lists the entities and the associated instances in the PM system.

Table 1. Entity Classes and Associated Instances
Entity Classes	Instance
BGP	Neighbors or Peers
Interface Basic Counters	Interfaces
Interface Data Rates	Interfaces
Interface Generic Counters	Interfaces
MPLS LDP	LDP Sessions
Node CPU	Nodes
Node Memory	Nodes
Node Process	Processes
OSPFv2	Processes
OSPFv3	Processes

Note

For a list of all attributes associated with the entities that constitute the PM system, see Table 1.

Note

Based on the interface type, the interface either supports the interface generic counters or the interface basic counters. The interfaces that support the interface basic counters do not support the interface data rates.

How to Implement Performance Management

Configuring an External TFTP Server or Local Disk for PM Statistics Collection

You can export PM statistical data to an external TFTP server or dump the data to the local file system. Both the local and TFTP destinations are mutually exclusive and you can configure either one of them at a time.

Configuration Examples

This example configures an external TFTP server for PM statistics collection.

RP/0/RP0/CPU0:Router# configure
RP/0/RP0/CPU0:Router(config)# performance-mgmt resources tftp-server 10.3.40.161 directory mypmdata/datafiles
RP/0/RP0/CPU0:Router(config)# commit

This example configures a local disk for PM statistics collection.

RP/0/RP0/CPU0:Router# configure
RP/0/RP0/CPU0:Router(config)# performance-mgmt resources dump local
RP/0/RP0/CPU0:Router(config)# commit

Configuring PM Statistics Collection Templates

PM statistics collections are configured through PM statistics collection templates. A PM statistics collection template contains the entity, the sample interval, and the number of sampling operations to be performed before exporting the data to a TFTP server. When a PM statistics collection template is enabled, the PM statistics collection gathers statistics for all attributes from all instances associated with the entity configured in the template. You can define multiple templates for any given entity; however, only one PM statistics collection template for a given entity can be enabled at a time.

Guidelines for Configuring PM Statistics Collection Templates

When creating PM statistics collection templates, follow these guidelines:

You must configure a TFTP server resource or local dump resource if you want to export statistics data onto a remote TFTP server or local disk.
You can define multiple templates for any given entity, but at a time you can enable only one PM statistics collection template for a given entity.
When configuring a template, you can designate the template for the entity as the default template using the default keyword or name the template. The default template contains the following default values:
- A sample interval of 10 minutes.
- A sample size of five sampling operations.
The sample interval sets the frequency of the sampling operations performed during the sampling cycle. You can configure the sample interval with the sample-interval command. The range is from 1 to 60 minutes.

The sample size sets the number of sampling operations to be performed before exporting the data to the TFTP server. You can configure the sample size with the sample-size command. The range is from 1 to 60 samples.

Note

Specifying a small sample interval increases CPU utilization, whereas specifying a large sample size increases memory utilization. The sample size and sample interval, therefore, may need to be adjusted to prevent system overload.

The export cycle determines how often PM statistics collection data is exported to the TFTP server. The export cycle can be calculated by multiplying the sample interval and sample size (sample interval x sample size = export cycle).
Once a template has been enabled, the sampling and export cycles continue until the template is disabled with the no form of the performance-mgmt apply statistics command.
You must specify either a node with the location command or enable the PM statistic collections for all nodes using the location all command when enabling or disabling a PM statistic collections for the following entities:
- Node CPU
- Node memory
- Node process

Configuration Example

This example shows how to create and enable a PM statistics collection template.

RP/0/RP0/CPU0:Router# configure
RP/0/RP0/CPU0:Router(config)# performance-mgmt statistics interface generic-counters template template 1 
RP/0/RP0/CPU0:Router(config)# performance-mgmt statistics interface generic-counters template 1 sample-size 10 
RP/0/RP0/CPU0:Router(config)# performance-mgmt statistics interface generic-counters template 1 sample-interval 5
RP/0/RP0/CPU0:Router(config)# performance-mgmt apply statistics interface generic-counters 1 
RP/0/RP0/CPU0:Router# commit

Configuring PM Threshold Monitoring Templates

The PM system supports the configuration of threshold conditions to monitor an attribute (or attributes) for threshold violations. Threshold conditions are configured through PM threshold monitoring templates. When a PM threshold template is enabled, the PM system monitors all instances of the attribute (or attributes) for the threshold condition configured in the template. If at end of the sample interval a threshold condition is matched, the PM system generates a system logging message for each instance that matches the threshold condition. For the list of attributes and value ranges associated with each attribute for all the entities, see Performance Management: Details

Guidelines for Configuring PM Threshold Monitoring Templates

While you configure PM threshold monitoring templates, follow these guidelines:

Once a template has been enabled, the threshold monitoring continues until the template is disabled with the no form of the performance-mgmt apply thresholds command.
Only one PM threshold template for an entity can be enabled at a time.
You must specify either a node with the location command or enable the PM statistic collections for all nodes using the location all command when enabling or disabling a PM threshold monitoring template for the following entities:
- Node CPU
- Node memory
- Node process

Configuration Example

This example shows how to create and enable a PM threshold monitoring template. In this example, a PM threshold template is created for the CurrMemory attribute of the node memory entity. The threshold condition in this PM threshold condition monitors the CurrMemory attribute to determine whether the current memory use is greater than 50 percent.


Router# conf t
Router(config)# performance-mgmt thresholds node memory template template20
Router(config-threshold-cpu)# CurrMemory gt 50 percent
Router(config-threshold-cpu)# sample-interval 5
Router(config-threshold-cpu)# exit
Router(config)# performance-mgmt apply thresholds node memory location 0/RP0/CPU0 template20
Router(config)# commit

Configuring Instance Filtering by Regular Expression

This task explains defining a regular expression group which can be applied to one or more statistics or threshold templates. You can also include multiple regular expression indices. The benefits of instance filtering using the regular expression group is as follows.

You can use the same regular expression group that can be applied to multiple templates.
You can enhance flexibility by assigning the same index values.
You can enhance the performance by applying regular expressions, which has OR conditions.

Note

The Instance filtering by regular-expression is currently supported in interface entities only (Interface basic-counters, generic-counters, data-rates.

Configuration Example

This example shows how to define a regular expression group.

RP/0/RP0/CPU0:Router# configure
RP/0/RP0/CPU0:Router(config)# performance-mgmt regular-expression regexp
RP/0/RP0/CPU0:Router(config-perfmgmt-regex)# index 10 match
RP/0/RP0/CPU0:Router(config)# commit

Performance Management: Details

This section contains additional information which will be useful while configuring performance management.

This table describes the attributes and value ranges associated with each attribute for all the entities that constitute the PM system.

Table 2. Attributes and Values
Entity	Attributes	Description	Values
bgp	ConnDropped	Number of times the connection was dropped.	Range is from 0 to 4294967295.
	ConnEstablished	Number of times the connection was established.	Range is from 0 to 4294967295.
	ErrorsReceived	Number of error notifications received on the connection.	Range is from 0 to 4294967295.
	ErrorsSent	Number of error notifications sent on the connection.	Range is from 0 to 4294967295.
	InputMessages	Number of messages received.	Range is from 0 to 4294967295.
	InputUpdateMessages	Number of update messages received.	Range is from 0 to 4294967295.
	OutputMessages	Number of messages sent.	Range is from 0 to 4294967295.
	OutputUpdateMessages	Number of update messages sent.	Range is from 0 to 4294967295.
interface data-rates	Bandwidth	Bandwidth in kbps.	Range is from 0 to 4294967295.
	InputDataRate	Input data rate in kbps.	Range is from 0 to 4294967295.
	InputPacketRate	Input packets per second.	Range is from 0 to 4294967295.
	InputPeakRate	Peak input data rate.	Range is from 0 to 4294967295.
	InputPeakPkts	Peak input packet rate.	Range is from 0 to 4294967295.
	OutputDataRate	Output data rate in kbps.	Range is from 0 to 4294967295.
	OutputPacketRate	Output packets per second.	Range is from 0 to 4294967295.
	OutputPeakPkts	Peak output packet rate.	Range is from 0 to 4294967295.
	OutputPeakRate	Peak output data rate.	Range is from 0 to 4294967295.
interface basic-counters	InPackets	Packets received.	Range is from 0 to 4294967295.
	InOctets	Bytes received.	Range is from 0 to 4294967295.
	OutPackets	Packets sent.	Range is from 0 to 4294967295.
	OutOctets	Bytes sent.	Range is from 0 to 4294967295.
	InputTotalDrops	Inbound correct packets discarded.	Range is from 0 to 4294967295.
	InputQueueDrops	Input queue drops.	Range is from 0 to 4294967295.
	InputTotalErrors	Inbound incorrect packets discarded.	Range is from 0 to 4294967295.
	OutputTotalDrops	Outbound correct packets discarded.	Range is from 0 to 4294967295.
	OutputQueueDrops	Output queue drops.	Range is from 0 to 4294967295.
	OutputTotalErrors	Outbound incorrect packets discarded.	Range is from 0 to 4294967295.
interface generic-counters	InBroadcastPkts	Broadcast packets received.	Range is from 0 to 4294967295.
	InMulticastPkts	Multicast packets received.	Range is from 0 to 4294967295.
	InOctets	Bytes received.	Range is from 0 to 4294967295.
	InPackets	Packets received.	Range is from 0 to 4294967295.
	InputCRC	Inbound packets discarded with incorrect CRC.	Range is from 0 to 4294967295.
	InputFrame	Inbound framing errors.	Range is from 0 to 4294967295.
	InputOverrun	Input overruns.	Range is from 0 to 4294967295.
	InputQueueDrops	Input queue drops.	Range is from 0 to 4294967295.
	InputTotalDrops	Inbound correct packets discarded.	Range is from 0 to 4294967295.
	InputTotalErrors	Inbound incorrect packets discarded.	Range is from 0 to 4294967295.
	InUcastPkts	Unicast packets received.	Range is from 0 to 4294967295.
	InputUnknownProto	Inbound packets discarded with unknown protocol.	Range is from 0 to 4294967295.
	OutBroadcastPkts	Broadcast packets sent.	Range is from 0 to 4294967295.
	OutMulticastPkts	Multicast packets sent.	Range is from 0 to 4294967295.
	OutOctets	Bytes sent.	Range is from 0 to 4294967295.
	OutPackets	Packets sent.	Range is from 0 to 4294967295.
	OutputTotalDrops	Outbound correct packets discarded.	Range is from 0 to 4294967295.
	OutputTotalErrors	Outbound incorrect packets discarded.	Range is from 0 to 4294967295.
	OutUcastPkts	Unicast packets sent.	Range is from 0 to 4294967295.
	OutputUnderrun	Output underruns.	Range is from 0 to 4294967295.
mpls ldp	AddressMsgsRcvd	Address messages received.	Range is from 0 to 4294967295.
	AddressMsgsSent	Address messages sent.	Range is from 0 to 4294967295.
	AddressWithdrawMsgsRcd	Address withdraw messages received.	Range is from 0 to 4294967295.
	AddressWithdrawMsgsSent	Address withdraw messages sent.	Range is from 0 to 4294967295.
	InitMsgsSent	Initial messages sent.	Range is from 0 to 4294967295.
	InitMsgsRcvd	Initial messages received.	Range is from 0 to 4294967295.
	KeepaliveMsgsRcvd	Keepalive messages received.	Range is from 0 to 4294967295.
	KeepaliveMsgsSent	Keepalive messages sent.	Range is from 0 to 4294967295.
	LabelMappingMsgsRcvd	Label mapping messages received.	Range is from 0 to 4294967295.
	LabelMappingMsgsSent	Label mapping messages sent.	Range is from 0 to 4294967295.
	LabelReleaseMsgsRcvd	Label release messages received.	Range is from 0 to 4294967295.
	LabelReleaseMsgsSent	Label release messages sent.	Range is from 0 to 4294967295.
	LabelWithdrawMsgsRcvd	Label withdraw messages received.	Range is from 0 to 4294967295.
	LabelWithdrawMsgsSent	Label withdraw messages sent.	Range is from 0 to 4294967295.
	NotificationMsgsRcvd	Notification messages received.	Range is from 0 to 4294967295.
	NotificationMsgsSent	Notification messages sent.	Range is from 0 to 4294967295.
	TotalMsgsRcvd	Total messages received.	Range is from 0 to 4294967295.
	TotalMsgsSent	Total messages sent.	Range is from 0 to 4294967295.
node cpu	NoProcesses	Number of processes.	Range is from 0 to 4294967295.
node memory	CurrMemory	Current application memory (in bytes) in use.	Range is from 0 to 4294967295.
node memory	PeakMemory	Maximum system memory (in MB) used since bootup.	Range is from 0 to 4194304.
node process	NoThreads	Number of threads.	Range is from 0 to 4294967295.
node process	PeakMemory	Maximum dynamic memory (in KB) used since startup time.	Range is from 0 to 4194304.
ospf v2protocol	InputPackets	Total number of packets received.	Range is from 0 to 4294967295.
	OutputPackets	Total number of packets sent.	Range is from 0 to 4294967295.
	InputHelloPackets	Number of Hello packets received.	Range is from 0 to 4294967295.
	OutputHelloPackets	Number of Hello packets sent.	Range is from 0 to 4294967295.
	InputDBDs	Number of DBD packets received.	Range is from 0 to 4294967295.
	InputDBDsLSA	Number of LSA received in DBD packets.	Range is from 0 to 4294967295.
	OutputDBDs	Number of DBD packets sent.	Range is from 0 to 4294967295.
	OutputDBDsLSA	Number of LSA sent in DBD packets.	Range is from 0 to 4294967295.
	InputLSRequests	Number of LS requests received.	Range is from 0 to 4294967295.
	InputLSRequestsLSA	Number of LSA received in LS requests.	Range is from 0 to 4294967295.
	OutputLSRequests	Number of LS requests sent.	Range is from 0 to 4294967295.
	OutputLSRequestsLSA	Number of LSA sent in LS requests.	Range is from 0 to 4294967295.
	InputLSAUpdates	Number of LSA updates received.	Range is from 0 to 4294967295.
	InputLSAUpdatesLSA	Number of LSA received in LSA updates.	Range is from 0 to 4294967295.
	OutputLSAUpdates	Number of LSA updates sent.	Range is from 0 to 4294967295.
	OutputLSAUpdatesLSA	Number of LSA sent in LSA updates.	Range is from 0 to 4294967295.
	InputLSAAcks	Number of LSA acknowledgements received.	Range is from 0 to 4294967295.
	InputLSAAcksLSA	Number of LSA received in LSA acknowledgements.	Range is from 0 to 4294967295.
	OutputLSAAcks	Number of LSA acknowledgements sent	Range is from 0 to 4294967295.
	OutputLSAAcksLSA	Number of LSA sent in LSA acknowledgements.	Range is from 0 to 4294967295.
	ChecksumErrors	Number of packets received with checksum errors.	Range is from 0 to 4294967295.
ospf v3protocol	InputPackets	Total number of packets received.	Range is from 0 to 4294967295.
	OutputPackets	Total number of packets sent.	Range is from 0 to 4294967295.
	InputHelloPackets	Number of Hello packets received.	Range is from 0 to 4294967295.
	OutputHelloPackets	Number of Hello packets sent.	Range is from 0 to 4294967295.
	InputDBDs	Number of DBD packets received.	Range is from 0 to 4294967295.
	InputDBDsLSA	Number of LSA received in DBD packets.	Range is from 0 to 4294967295.
	OutputDBDs	Number of DBD packets sent.	Range is from 0 to 4294967295.
	OutputDBDsLSA	Number of LSA sent in DBD packets.	Range is from 0 to 4294967295.
	InputLSRequests	Number of LS requests received.	Range is from 0 to 4294967295.
	InputLSRequestsLSA	Number of LSA received in LS requests.	Range is from 0 to 4294967295.
	OutputLSRequests	Number of LS requests sent.	Range is from 0 to 4294967295.
	OutputLSRequestsLSA	Number of LSA sent in LS requests.	Range is from 0 to 4294967295.
	InputLSAUpdates	Number of LSA updates received.	Range is from 0 to 4294967295.
	InputLSRequestsLSA	Number of LSA received in LS requests.	Range is from 0 to 4294967295.
	OutputLSAUpdates	Number of LSA updates sent.	Range is from 0 to 4294967295.
	OutputLSAUpdatesLSA	Number of LSA sent in LSA updates.	Range is from 0 to 4294967295.
	InputLSAAcks	Number of LSA acknowledgements received.	Range is from 0 to 4294967295.
	InputLSAAcksLSA	Number of LSA received in LSA acknowledgements.	Range is from 0 to 4294967295.
	OutputLSAAcks	Number of LSA acknowledgements sent	Range is from 0 to 4294967295.
	OutputLSAAcksLSA	Number of LSA sent in LSA acknowledgements.	Range is from 0 to 4294967295.

This table describes the commands used to enable entity instance monitoring for different entity instances.

Table 3. Entity Instances and Monitoring Commands
Entity	Command Description
BGP	Use the performance-mgmt apply monitor bgp command to enable entity instance monitoring for a BGP entity instance. Syntax: `performance-mgmt apply monitor bgp ip-address template-name \| default} RP/0/RP0/CPU0:Router(config)# performance-mgmt apply monitor bgp 10.12.0.4 default`
Interface Data Rates	Use the performance-mgmt apply monitor data-rates command to enable entity instance monitoring for an interface data rates entity instance. Syntax: `performance-mgmt apply monitor interface data-rates type interface-path-id {template-name \| default} RP/0/RP0/CPU0:Router(config)# performance-mgmt apply monitor interface data-rates HundredGigE 0/0/0/0 default`
Interface Basic Counters	Use the performance-mgmt apply monitor interface basic-counters command to enable entity instance monitoring for an interface basic counters entity instance. Syntax: `performance-mgmt apply monitor interface basic-counters type interface-path-id {template-name \| default} RP/0/RP0/CPU0:Router(config)# performance-mgmt apply monitor interface basic-counters HundredGigE 0/0/0/0 default`
Interface Generic Counters	Use the performance-mgmt apply monitor interface generic-counters command to enable entity instance monitoring for an interface generic counters entity instance. Syntax: `performance-mgmt apply monitor interface generic-counters type interface-path-id {template-name \| default} RP/0/RP0/CPU0:Router(config)# performance-mgmt apply monitor interface generic-counters HundredGigE 0/0/0/0 default`
MPLS LDP	Use the performance-mgmt apply monitor mpls ldp command to enable entity instance monitoring for an MPLS LDP entity instance. Syntax: `performance-mgmt apply monitor mpls ldp ip-address {template-name \| default} RP/0/RP0/CPU0:Router(config)# performance-mgmt apply monitor mpls ldp 10.34.64.154 default`
Node CPU	Use the performance-mgmt apply monitor node cpu command to enable entity instance monitoring for a node CPU entity instance. Syntax: `performance-mgmt apply monitor node cpu location node-id {template-name \| default} RP/0/RP0/CPU0:Router(config)# performance-mgmt apply monitor node cpu location 0/RP0/CPU0 default`
Node Memory	Use the performance-mgmt apply monitor node memory command to enable entity instance monitoring for a node memory entity instance. Syntax: `performance-mgmt apply monitor node memory location node-id {template-name \| default} RP/0/RP0/CPU0:Router(config)# performance-mgmt apply monitor node memory location 0/RP0/CPU0 default`
Node Process	Use the performance-mgmt apply monitor node process command to enable entity instance monitoring collection for a node process entity instance. Syntax: `performance-mgmt apply monitor node process location node-id pid {template-name \| default} RP/0/RP0/CPU0:Router(config)# performance-mgmt apply monitor node process location p 0/RP0/CPU0 275 default`

Check System Health

Monitoring systems in a network proactively helps prevent potential issues and take preventive actions. This section illustrates how you can monitor the system health using the health check service. This service helps to analyze the system health by monitoring, tracking and analyzing metrics that are critical for functioning of the router.

The system health can be gauged with the values reported by these metrics when the configured threshold values exceed or are nearing the threshold value.

This table describes the significant fields shown in the display.

Table 4. System Health Check Metrics
Metric	Parameter Tracked	Considered Unhealthy When
System Resources	CPU, free memory, file system, shared memory	The respective metric has exceeded the threshold
Infrastructure Services	Field Programmable Device (FPD), fabric health, platform, redundancy	Any component of the service is down or in an error state
Counters	Interface-counters, fabric-statistics, asic-errors	Any specific counter exhibits a consistent increase in drop/error count over the last n runs (n is configurable through CLI, default is 10)

By default, metrics for system resources are configured with preset threshold values. You can customize the metrics to be monitored by disabling or enabling metrics of interest based on your requirement.

Each metric is tracked and compared with that of the configured threshold, and the state of the resource is classified accordingly.

The system resources exhibit one of these states:

Normal: The resource usage is less than the threshold value.
Minor: The resource usage is more than the minor threshold, but less than the severe threshold value.
Severe: The resource usage is more than the severe threshold, but less than the critical threshold value.
Critical: The resource usage is more than the critical threshold value.

The infrastructure services show one of these states:

Normal: The resource operation is as expected.
Warning: The resource needs attention. For example, a warning is displayed when the FPD needs an upgrade.

The health check service is packaged as an optional RPM. This is not part of the base package and you must explicitly install this RPM.

You can configure the metrics and their values using CLI. In addition to the CLI, the service supports NETCONF client to apply configuration (Cisco-IOS-XR-healthcheck-cfg.yang) and retrieve operational data (Cisco-IOS-XR-healthcheck-oper.yang) using YANG data models. It also supports subscribing to metrics and their reports to stream telemetry data. For more information about streaming telemetry data, see Telemetry Configuration Guide for Cisco 8000 Series Routers.

Here is a sample of Cisco-IOS-XR-health-check-cfg.yang data model to configure the health check metrics:


module Cisco-IOS-XR-health-check-cfg {
    namespace http://cisco.com/ns/yang/Cisco-IOS-XR-health-check-cfg;
    prefix “health-check-cfg”;

    import Cisco-IOS-XR-types {
        prefix “xr”;
    }

    container metric-cfg {
        list cpu {
            key “name”;
            description
                “system 15min avg cpu utilization”;
            leaf name {
                type string;
            }
            leaf enabled {
                type boolean;
                default “true”;
            }
            leaf threshold {
                type uint32 {
                    range “1..100”;
                }
                units “percent”;
                default “20”;
                description
                    “cpu system utilization should be less than threshold”;
            }
            leaf node {
                type string;
                default “all”;
            }
        }
        list free_memory {
            key “name”;
            description
                “system free RAM”;
            leaf name {
                type string;
            }
            leaf enabled {
                type boolean;
                default “true”;
            }
            leaf threshold {
                type uint32 {
                    range “1..4096”;
                }
                units “MB”
                default “1024”;	
                description
                    “system free memory should be greater than threshold”;
            }
            leaf node {
                type string;
                default “all”;
            }
        }
    }
}

Associated Commands

Restrictions of the Health Check Feature

The following restrictions are applicable for the system health check feature:

Once the user configures the health check feature through the command line interface, the feature is in Configured state in the output of show healthcheck status . Enabling the netconf-yang agent is a prerequisite to enable the health check feature. When the user enables netconf-yang agent, the status of the health check feature changes to Enabled .
The system health data is available only when the health check feature is in Enabled state. If the user does not configure the netconf-yang agent, the status of the health check feature will remain in Configured state.
If the user disables the netconf-yang agent after enabling the health check feature, the status of the health check feature changes back to the Configured state.
The values reported in the interface-counters, asic-errors and fabric-stats are not accumulated statistics but rather they show the actual number of drops or errors in the collection window. The collection window is the cadence for the collectors.

Monitoring Critical System Resources

This task explains how to check the health of a system using operational data from the network. The data can be queried by both CLI and NETCONF RPC, and can also be streamed using telemetry.

Before you begin

Enable NETCONF-yang agent.

Procedure

Step 1

Enable NETCONF, SSH and configure the management interface.

Example:


configure
line default
exec-timeout 0 0
session-timeout 0
!

line console
exec-timeout 0 0
session-timeout 0
!

vty-pool default 0 99 line-template default
!

netconf-yang agent ssh
!

ssh server v2
ssh server vrf default
ssh server netconf vrf default
ssh timeout 120
ssh server rate-limit 600
ssh server session-limit 110
!

commit
!

interface MgmtEth 0/RP0/CPU0/0
no shut
ipv4 address dhcp
commit
!

End

Step 2

Install the health check RPM.

Example:

install source <path-to-repository>/xr-healthcheck-<release-version>.x86_64.rpm

For instructions to download optional RPMs, see the Software Installation Guide for Cisco 8000 Series Routers.

Step 3

Check the status of all metrics with its associated threshold and configured parameters in the system.

Example:

Router#show healthcheck status
Healthcheck status: Enabled

Collector Cadence: 60 seconds

System Resource metrics
  cpu
      Thresholds: Minor: 10%
                  Severe: 20%
                  Critical: 30%

       Tracked CPU utilization: 15 min avg utilization

   free-memory
        Thresholds: Minor: 10%
                    Severe: 8%
                    Critical: 5%

   filesystem
        Thresholds: Minor: 80%
                    Severe: 95%
                    Critical: 99%

   shared-memory
        Thresholds: Minor: 80%
                    Severe: 95%
                    Critical: 99%

Infra Services metrics
   fpd

   fabric-health

Step 4

View the health state for each enabled metric.

Example:


Router#show healthcheck report
Healthcheck report for enabled metrics

cpu
  State: Normal

free-momry
   State: Normal

shared-memory
   State: Normal

fpd
   State: Warning
One or more FPDs are in NEED UPGD state

fabric-health
   State: Normal

In the above output, the state of the FPD shows a warning message that indicates an FPD upgrade is required.

To further investigate the warning message, check the metric information. Here, for example, check the FPD state.


FPD Metric State: Warning
Last Update Time: 17 Feb 18:28:57.917193
FPD Service State: Enabled
Number of Active Nodes: 69

Node Name: 0/0/CPU0
    Card Name: 8800-LC-48H
    FPD Name: Bios
    HW Version: 0.31
    Status: NEED UPGD
    Run Version: 5.01
    Programmed Version: 5.01

-------------Truncated for brevity---------------

The Last Update Time is the timestamp when the health for that metric was computed. This timestamp gets refreshed with each collector run based on the cadence.

Note

With the health check service enabled, no other configuration change is permitted. Before you change a configuration and commit the change, first disable the service.


Router#configure
Router(config)#no healthcheck enable
Router(config)#commit

Step 5

Customize the health check threshold value for the following parameters:

Cadence: To change the preset cadence, use the command:
```
Router(config)#healthcheck cadence <30 – 1800>
```

Metric: To list the metrics that can be configured, use the command:


Router(config)#healthcheck metric ?
  cpu            cpu configurations(cisco-support)
  fabric-health  fabric configurations(cisco-support)
  filesystem     Filesystem usage configurations(cisco-support)
  fpd            FPD configurations(cisco-support)
  free-mem       free memory configurations(cisco-support)
  shared-mem     shared memory configurations(cisco-support)

For example, to change the preset value of CPU metric, use the command:


Router(config)#healthcheck metric cpu ?
   threshold minor, severe or critical threshold
   avg_cpu_util 1min, 5min or 15min
        ios(config)#healthcheck metric cpu threshold ?
        minor       minor threshold in %
        severe      severe threshold in %
        critical    critical threshold in %

Disable or enable metrics to selectively filter some metrics. By default, all metrics are enabled.


Router(config)#[no] healthcheck metric cpu disable
Router(config)#[no] healthcheck metric free-mem disable

Monitoring Infrastructure Services

This task explains how to check the health of the infrastructure services of a system. The data can be queried by both CLI and NETCONF RPC, and can also be streamed using telemetry.

Before you begin

Enable NETCONF-yang agent.

Perform steps 1-3 from the section Monitoring Critical System Resources

Procedure

Step 1

Check the health status of the infrastructure metrics in the system. By default, the router software enables the health check for infrastructure services.

Example:

The below example shows how to obtain the health-check status for the platform metric:

Router# show healthcheck metric platform
Platform Metric State: Normal ==========> Health of the metric
Last Update Time: 25 Jun 05:17:03.508172 =====> Timestamp at which the metric data was collected
Platform Service State: Enabled =====> Service state of Platform
Number of Racks: 1 ======> Total number of racks in the testbed
Rack Name: 0 
Number of Slots: 12
Slot Name: RP0 
Number of Instances: 2
Instance Name: CPU0 
Node Name 0/RP0/CPU0 
Card Type 8800-RP 
Card Redundancy State Active 
Admin State NSHUT 
Oper State IOS XR RUN

To retrieve the health-check information for the platform metric using the Netconf GET operation, use the schema shown below:

<health-check xmlns="http://cisco.com/ns/yang/Cisco-IOS-XR-healthcheck-oper">
<metric-info>
<platform/>
</metric-info>
</health-check>

Example:

The below example shows how to obtain the health-check status for the redundancy metric:

Router# show healthcheck metric redundancy
Redundancy Metric State: Normal ==========> Health of the metric
Last Update Time: 25 Jun 05:21:14.562291 =====> Timestamp at which the metric data was collected Redundancy Service State: Enabled =====> Service state of the metric
Active: 0/RP0/CPU0
Standby: 0/RP1/CPU0
HA State: Node Ready
NSR State: Ready

To retrieve the health-check information for the redundancy metric using the Netconf GET operation, use the schema shown below:

<health-check xmlns="http://cisco.com/ns/yang/Cisco-IOS-XR-healthcheck-oper">
<metric-info>
<redundancy/>
</metric-info>
</health-check>

Step 2

Disable health-check of any of the metrics, if required. By default, all metrics are enabled.

Example:

The below example shows how to disable the health-check status for the platform metric:

Router(config)# healthcheck metric platform disable
Router(config)# commit

To disable the health-check for the platform metric using the Netconf, use the schema shown below:

<health-check xmlns="http://cisco.com/ns/yang/Cisco-IOS-XR-healthcheck-cfg">
<ord-b>
<metric>
<platform>
<disable/>
</platform>
</metric>
</ord-b>
</health-check>

Example:

The below example shows how to disable the health-check status for the redundancy metric:

Router(config)# healthcheck metric redundancy disable
Router(config)# commit

To disable the health-check for the redundancy metric using the Netconf, use the schema shown below:

<health-check xmlns="http://cisco.com/ns/yang/Cisco-IOS-XR-healthcheck-cfg">
<ord-b>
<metric>
<redundancy>
<disable/>
</redundancy>
</metric>
</ord-b>
</health-check>

Monitoring Counters

This task explains how to check the health of the counters of a system. The counter values that can be monitored are interface-counters, asic-errors and fabric-statistics.

Before you begin

Enable NETCONF-yang agent.

Perform steps 1-3 from the section Monitoring Critical System Resources

Procedure

Step 1	Configure the size of the buffer which stores the history of the counter values as shown in the below examples. Example: The below example shows how to configure the buffer-size for the interface-counters to store values for the last 5 cadence snapshots: `Router(config)# healthcheck metric intf-counters counter-size 5 Router(config)# commit` Example: The below example shows how to configure the buffer-size for the asic-errors counters to store values for the last 5 cadence snapshots: `Router(config)# healthcheck metric asic-errors counter-size 5 Router(config)# commit` Example: The below example shows how to configure the buffer-size for the fabric-stats counters to store values for the last 5 cadence snapshots: `Router(config)# healthcheck metric fabric-stats counter-size 5 Router(config)# commit`
Step 2	Configure the list of interfaces for which the interface-counters should be tracked as shown in the below examples. This is possible only for the interface-counters metric. Example: The below example shows how to configure the list of interfaces for which the interface-counters need to be tracked : `Router(config)# healthcheck metric intf-counters intf-list MgmtEth0/RP0/CPU0/0 HundredGigE0/0/0/0 Router(config)# commit` Example: The below example shows how to configure all the interfaces so that the interface-counters are tracked for them : `Router(config)# healthcheck metric intf-counters intf-list all Router(config)# commit`
Step 3	By default, the router software enables the health-check for counters. Check the health status of the counters in the system as shown in the below examples. Example: The below example shows how to obtain the health-check status for the interface-counters: Router# show healthcheck metric interface-counters summary Interface-counters Health State: Normal ==========> Health of the metric Last Update Time: 25 Jun 05:59:33.965851 =====> Timestamp at which the metric data was collected Interface-counters Service State: Enabled =====> Service state of the metric Interface MgmtEth0/RP0/CPU0/0 =====> Configured interface for healthcheck monitoring Counter-Names Count Average Consistently-Increasing ------------------------------------------------------------------------------------------------ output-buffers-failures 0 0 N Counter-Names =====> Name of the counters Count =====> Value of the counter collected at "Last Update Time" Average =====> Average of all values available in buffer Consistently-Increasing =====> Trend of the counter values, as per data available in buffer Router# show healthcheck metric interface-counters detail all Thu Jun 25 06:02:03.145 UTC Last Update Time: 25 Jun 06:01:35.217089 =====> Timestamp at which the metric data was collected Interface MgmtEth0/RP0/CPU0/0 =====> Configured interface for healthcheck monitoring Following table displays data for last <x=5> values collected in periodic cadence intervals ------------------------------------------------------------------------------------------------------- Counter-name Last 5 values LHS = Earliest RHS = Latest ------------------------------------------------------------------------------------------------------- output-buffers-failures 0 0 0 0 0 parity-packets-received 0 0 0 0 0 Example: The below example shows how to obtain the health-check status for the asic-errors: Router# show healthcheck metric asic-errors summary Asic-errors Health State: Normal ==========> Health of the metric Last Update Time: 25 Jun 06:20:47.65152 =====> Timestamp at which the metric data was collected Asic-errors Service State: Enabled =====> Service state of the metric Node Name: 0/1/CPU0 =====> Node name for healthcheck monitoring Instance: 0 =====> Instance of the Node Counter-Names Count Average Consistently-Increasing ------------------------------------------------------------------------------------------------ Link Errors 0 0 N Counter-Names =====> Name of the counters Count =====> Value of the counter collected at "Last Update Time" Average =====> Average of all values available in buffer Consistently-Increasing =====> Trend of the counter values, as per data available in buffer Router# show healthcheck metric asic-errors detail all Thu Jun 25 06:25:13.778 UTC Last Update Time: 25 Jun 06:24:49.510525 =====> Timestamp at which the metric data was collected Node Name: 0/1/CPU0 =====> Node name for healthcheck monitoring Instance: 0 =====> Instance of the Node Following table displays data for last <x=5> values collected in periodic cadence intervals ------------------------------------------------------------------------------------------------------- Counter-name Last 5 values LHS = Earliest RHS = Latest ------------------------------------------------------------------------------------------------------- Link Errors 0 0 0 0 0 Example: The below example shows how to obtain the health-check status for the fabric-stats: Router# show healthcheck metric fabric-stats summary Thu Jun 25 06:51:13.154 UTC Fabric-stats Health State: Normal ==========> Health of the metric Last Update Time: 25 Jun 06:51:05.669753 =====> Timestamp at which the metric data was collected Fabric-stats Service State: Enabled =====> Service state of the metric Fabric plane id 0 =====> Plane ID Counter-Names Count Average Consistently-Increasing ------------------------------------------------------------------------------------------------ mcast-lost-cells 0 0 N Counter-Names =====> Name of the counters Count =====> Value of the counter collected at "Last Update Time" Average =====> Average of all values available in buffer Consistently-Increasing =====> Trend of the counter values, as per data available in buffer Router# show healthcheck metric fabric-stats detail all Thu Jun 25 06:56:20.944 UTC Last Update Time: 25 Jun 06:56:08.818528 =====> Timestamp at which the metric data was collected Fabric Plane id 0 =====> Fabric Plane ID Following table displays data for last <x=5> values collected in periodic cadence intervals ------------------------------------------------------------------------------------------------------- Counter-name Last 5 values LHS = Earliest RHS = Latest ------------------------------------------------------------------------------------------------------- mcast-lost-cells 0 0 0 0 0
Step 4	If required, disable health-check of any of the counters. By default, all counters are enabled. Example: The below example shows how to disable the health-check status for the interface-counters: `Router(config)# healthcheck metric intf-counters disable Router(config)# commit` Example: The below example shows how to disable the health-check status for the asic-errors: `Router(config)# healthcheck metric asic-errors disable Router(config)# commit` Example: The below example shows how to disable the health-check status for the fabric-stats: `Router(config)# healthcheck metric fabric-stats disable Router(config)# commit`

System Monitoring Configuration Guide for Cisco 8000 Series Routers, IOS XR Release 7.0.x

Bias-Free Language

Results

Chapter: Implementing Performance Management

Implementing Performance Management

Prerequisites for Implementing Performance Management

Information About Implementing Performance Management

PM Functional Overview

PM Statistics Server

PM Statistics Collector

PM Benefits

PM Statistics Collection Overview

How to Implement Performance Management

Configuring an External TFTP Server or Local Disk for PM Statistics Collection

Configuration Examples

Configuring PM Statistics Collection Templates

Guidelines for Configuring PM Statistics Collection Templates

Configuration Example

Configuring PM Threshold Monitoring Templates

Guidelines for Configuring PM Threshold Monitoring Templates

Configuration Example

Configuring Instance Filtering by Regular Expression

Configuration Example

Performance Management: Details

Check System Health

Associated Commands

Restrictions of the Health Check Feature

Monitoring Critical System Resources

Before you begin

Procedure

Example:

Example:

Example:

Example:

Monitoring Infrastructure Services

Before you begin

Procedure

Example:

Example:

Example:

Example:

Monitoring Counters

Before you begin

Procedure

Example:

Example:

Example:

Example:

Example:

Example:

Example:

Example:

Example:

Example:

Example:

Was this Document Helpful?

Contact Cisco