System Health Check

Monitoring systems in a network proactively helps prevent potential issues and take preventive actions. This chapter describes the tasks to configure and monitor system health check.

System health checks

The Cisco NCS 1014 health check service is a system monitoring service that

  • monitors physical characteristics, current processing status, and the currently utilized resources to assess the condition of the device at any time,

  • analyzes the system health by tracking metrics that are critical for the functioning of Cisco NCS 1014, and

  • is installed with the Cisco NCS 1014 RPM.

The system health metrics are thresholds set on the device to monitor the usage of CPU and other system resources.

System resource metrics states

You can evaluate the system's health by examining the metric values. If these values cross or approach the set thresholds, it suggests potential problems. By default, metrics for system resources are configured with preset threshold values. You can customize the metrics to monitor by disabling or enabling metrics of interest based on your requirement.

Each metric is tracked and compared with that of the configured threshold, and the state of the resource is classified accordingly.

The system resources metrics can be in one of these states:

  • Normal: The resource usage is less than the threshold value.

  • Minor: The resource usage is more than the minor threshold, but less than the severe threshold value.

  • Severe: The resource usage is more than the severe threshold, but less than the critical threshold value.

  • Critical: The resource usage is more than the critical threshold value.

Infrastructure service metrics states

The infrastructure services metrics can be in one of these states:

  • Normal: The resource operation is as expected.

  • Warning: The resource needs attention. For example, a warning is displayed when the Field-Programmable Device (FPD) needs an upgrade.

Supported system health check metrics

Cisco NCS 1014 supports the following system health check metrics:

  • communication-timeout

  • cpu

  • filesystem

  • fpd

  • free-mem

  • hw-monitoring

  • lc-monitoring

  • pci-monitoring

  • platform

  • process-resource

  • process-status

  • shared-mem

  • wd-monitoring

Enable the health check

Enable the health check service on Cisco NCS 1014 so that the system can monitor configured metrics and report the health state of system resources and infrastructure services.

Use this task to enable health check, allowing Cisco NCS 1014 to monitor system health.

Before you begin

Before enabling health check, ensure that:

  • An IP address and subnet mask is assigned to the management interface.

  • The IP address of the default gateway is configured with a static route.

For more details, see the Configure Management Interface section of the Cisco NCS 1014 System Setup and Software Installation Guide.

Procedure


Step 1

Enter the configuration mode.

Step 2

Run the healthcheck enable to enable health check.

Example:

RP/0/RP0/CPU0:ios(config)#healthcheck enable

Step 3

Run the netconf-yang agent ssh command.

Example:

RP/0/RP0/CPU0:ios(config)#netconf-yang agent ssh

Step 4

Run the grpc local-connection command to enable Google Remote Procedure Call (gRPC) using the command.

Example:

RP/0/RP0/CPU0:ios(config)#grpc local-connection

Step 5

Commit the changes using the commit command.


The health check service is enabled on Cisco NCS 1014. The NETCONF and gRPC agents are running, allowing you to monitor the health status of system resources and infrastructure services.

Health check refresh cadence

A health check refresh cadence is a configurable system parameter that

  • determines how often Cisco NCS 1014 updates and reports health check status,

  • defines the time interval, in seconds, between consecutive metric refreshes, and

  • enables administrators to customize monitoring to match their operational requirements.

By default, the health check refresh cadence is set to 60 seconds, meaning health check status is updated every 60 seconds. You can change this interval by using the healthcheck cadence [cadence-value] command, where [cadence-value] specifies your preferred time interval in seconds.

The following example shows how to change the health check cadence value to 50 seconds so that health check status is updated every 50 seconds.

RP/0/RP0/CPU0:ios(config)#healthcheck cadence 50

View the status of all metrics

Check the overall health and configuration status of all metrics in the Cisco NCS 1014 system.

Use this task to view detailed information about supported health check metrics, including which metrics are enabled, their current state, threshold settings, and the health check manager’s operational status.

Procedure


Step 1

Run the show healthcheck status command to display which metrics are enabled, their threshold settings, and the collector cadence.

Example:

RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios#show healthcheck status
Sat Jun 12 02:00:25.204 UTC

Healthcheck status: Enabled
Time started: 12 Jun 02:00:22.392972

Collector Cadence: 30 seconds

METRICS STATS 

System Resource metrics
   cpu
       Thresholds: Minor: 20%
                  Severe: 50%
                Critical: 75%

       Tracked CPU utilization: 15 min avg utilization

   free-memory
       Thresholds: Minor: 10%
                  Severe: 8%
                Critical: 5%

   filesystem
       Thresholds: Minor: 80%
                  Severe: 95%
                Critical: 99%
          
   shared-memory
       Thresholds: Minor: 80%
                  Severe: 95%
                Critical: 99%
          
Infra Services metrics
   platform
          
   fpd    
          
Install Custom Metrics
   process-status
          
   process-resource
          
   communication-timeout
          
   pci-monitoring
          
   hw-monitoring
          
   wd-monitoring
          
   lc-monitoring
          
Use case  
 Use cases are disabled

Step 2

Run the show healthcheck internal states command to check the internal health state and configuration statuses of the health check manager. command.

Example:

RP/0/RP0/CPU0:ios#show healthcheck internal states 
Sat Jun 12 02:00:55.425 UTC

 Internal Structure INFO 

 Current state: Enabled 

 Reason: Success 

 Netconf Config State: Enabled 

 Grpc Config State: Enabled 

 Nosi state: Initialized 

 Appmgr conn state: Connected 

 Nosi lib state: Not ready 

 Nosi client: Valid client  

Step 3

Run the show healthcheck report command to view the current health state for each enabled metric.

Example:

RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios#show healthcheck report     
Sat Jun 12 02:02:54.417 UTC

Healthcheck report
Last Update Time: 12 Jun 02:02:46.955241 
METRICS REPORT 

cpu
  State: Normal

free-memory
  State: Normal

filesystem
  State: Normal

shared-memory
  State: Normal

platform
  State: Warning
  Reason: One or more devices are not in operational state

fpd
  State: Warning
  Reason: One or more FPDs are not in CURRENT state
          
process-status
  State: Normal
          
process-resource
  State: Normal
          
communication-timeout
  State: Normal
          
pci-monitoring
  State: Normal
          
hw-monitoring
  State: Normal
          
wd-monitoring
  State: Normal
          
lc-monitoring
  State: Normal

In this output, the state of the Field-Programmable Device (FPD) shows a warning message that indicates an FPD upgrade is required.


The system displays the status of all configured health metrics, including thresholds and current health states. You can quickly identify any warnings or critical states that may need further action.

Metric threshold values

A metric threshold value is a configurable system parameter that

  • defines the usage level at which an alert is raised for a monitored metric,

  • allows operators to tailor alerting behavior to their operational needs, and

  • supports adjustment for each metric based on specific operational requirements.

healthcheck metric metric-name threshold threshold-value

To change the minor threshold value of the CPU metric to 25%, enter this command:

healthcheck metric cpu minor threshold 25%

In this example, the threshold for minor alerts on CPU usage is set to 25%, allowing the system to notify you when CPU usage crosses this value.

View health status of an individual metric

Monitor and troubleshoot individual system metrics by checking their health status.

Use this task to check the health status of a chosen metric on Cisco NCS 1014 systems. This helps you focus on and address issues related to specific resources or services.

Procedure


Run the show healthcheck metric metric-name command.

Example:

This example shows how to obtain the health-check status for the filesystem metric:

RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios#show healthcheck metric filesystem 
Sat Jun 12 02:01:32.432 UTC
Filesystem Metric State: Normal
Last Update Time: 12 Jun 02:01:04.446619
Filesystem Service State: Enabled
Number of Active Nodes: 1
Configured Thresholds:
   Minor: 80%
   Severe: 95%
   Critical: 99%

Node Name: 0/RP0/CPU0
    Partition Count: 5

    Partition Name: tftp:
        Partition Access Attribute: rw
        Partition Type: network
        Partition Size: 0
        Partition Free Bytes: 0
        Partition Free Space in %: 0

    Partition Name: disk0:
        Partition Access Attribute: rw
        Partition Type: flash-disk
        Partition Size: 20024897536
        Partition Free Bytes: 19978481664
        Partition Free Space in %: 99
          
    Partition Name: /misc/config
        Partition Access Attribute: rw
        Partition Type: flash
        Partition Size: 151314698240
        Partition Free Bytes: 146903269376
        Partition Free Space in %: 97
          
    Partition Name: harddisk:
        Partition Access Attribute: rw
        Partition Type: harddisk
        Partition Size: 150114078720
        Partition Free Bytes: 144962641920
        Partition Free Space in %: 96
          
    Partition Name: ftp:
        Partition Access Attribute: rw
        Partition Type: network
        Partition Size: 0
        Partition Free Bytes: 0
        Partition Free Space in %: 0

Example:

This example shows how to obtain the health-check status for the platform metric:

RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios#show healthcheck metric platform   
Sat Jun 12 02:01:51.922 UTC
Platform Metric State: Warning
Last Update Time: 12 Jun 02:01:38.650003
Platform Service State: Enabled
Number of Racks: 1

Rack Name: 0
    Number of Slots: 5

    Slot Name: RP0
        Number of Instances: 1

    Instance Name: CPU0
        Node Name 0/RP0/CPU0
        Card Type NCS1K14-CNTLR-K9
        Card Redundancy State Active
        Admin State NSHUT,NMON
        Oper State IOS XR RUN

    Slot Name: PM1
        Number of Instances: 0

        Node Name 0/PM1
        Card Type NCS1K4-AC-PSU-2
        Card Redundancy State None
        Admin State NSHUT,NMON
        Oper State OPERATIONAL
          
    Slot Name: FT1
        Number of Instances: 0
          
        Node Name 0/FT1
        Card Type NCS1K14-FAN
        Card Redundancy State None
        Admin State NSHUT,NMON
        Oper State OPERATIONAL
          
    Slot Name: FT2
        Number of Instances: 0
          
        Node Name 0/FT2
        Card Type NCS1K14-FAN
        Card Redundancy State None
        Admin State NSHUT,NMON
        Oper State OPERATIONAL
          
    Slot Name: 2
        Number of Instances: 1
          
    Instance Name: NXR0
        Node Name 0/2/NXR0
        Card Type NCS1K4-1.2T-K9
        Card Redundancy State None
        Admin State NSHUT,NMON
        Oper State CARD FAILED

The health status for the specified metric is displayed in the CLI, allowing you to assess operational states, configured thresholds, and service health.

Disable health check

Stop monitoring the entire system or specific resources by disabling the health check service or individual health check metrics.

Health checks monitor various system and resource metrics on Cisco NCS 1014 devices. By default, all health check metrics are enabled. Disabling health checks can be necessary when you need to perform configuration changes (as some changes are blocked while health check service is active) or when you want to suspend monitoring for specific reasons.

Procedure


Step 1

Enter the command no healthcheck enable to disable health checks for all metrics.

Example:

RP/0/RP0/CPU0:#configure
RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios(config)#no healthcheck enable
RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios(config)#commit

Note

 

When the health check service is enabled, other configuration changes are not permitted. Disable the service before committing configuration changes.

Step 2

Enter the command healthcheck metric metric-name disable to disable health check for an individual metric.

Example:

RP/0/RP0/CPU0:RP/0/RP0/CPU0:ios(config)#healthcheck metric free-mem disable

The health check service or specified metric’s monitoring is disabled. You can now commit configuration changes if required, or resume monitoring later by re-enabling the health check service or metric.