SMI Application Level Statistics and KPI

Subscriber Microservices Infrastructure Monitoring Key Performance Indicators

This section describes the Key Performance Indicators (KPIs) that are useful for tracking the overall health of Subscriber Microservices Infrastructure (SMI).

Licensing KPIs

The following custom statistics and queries enable you to monitor the license count for a product's entitlement.

entitlement status

Description

Captures the current requested license counts for a product’s entitlement.

Metric Type

Gauge

Data Type

Int

Statistics

entitlement_status{enforce_mode="Eval"}

NOTES:

  • enforce_mode: The current enforcement mode used by an entitlement. For example, Invalid, Init, Waiting, InCompliance, OutOfCompliance, Overage, Eval, EvalExpired, AuthorizedPeriodExpired, InvalidTag, ReservedInCompliance, and NotAuthorized

  • tag: The entitlement tag that reports to the smart licensing server. For example, regid.2019-03.com.cisco.SMI-TEST-ALL, and 1.0_63366461-0177-4c93-8eea-c9b02e9843f8.

System Status KPIs

The following custom statistics and queries enable you to monitor the status of the system.

system_mode

Description

Indicates the current mode the system is running on.

Metric Type

Gauge

Data Type

Int

Statistics

system_mode

NOTES:

  • 0: The system is in shutdown mode.

  • 1: The system is running.

  • 2: The system is under maintenance.

  • -1: The system mode is unknown.

system_synch_running

Description

Specifies whether the system configuration synch process is running or not.

Metric Type

Gauge

Data Type

Int

Statistics

system_synch_running

NOTES:

  • 1: The system configuration sync process is running.

  • 0: The system configuration sync process is running.

system_running_percent

Description

Captures the percentage of the system currently in use.

Metric Type

Gauge

Data Type

Percent

Statistics

system_running_percent

System Configuration KPIs

The following custom statistics and queries enable you to monitor the system configuration.

system_configuration_backup_total

Description Captures the total number of system configuration backups that are executed.
Metric Type Counter
Data Type Configuration Count
Statistics

irate(system_configuration_backup_total[1m])

NOTES:

status: The status of the executed backups. For example, success or error.

configuration_change_total

Description

Captures the total number of configuration changes that are executed.

Metric Type

Counter

Data Type

Configuration Count

Statistics

sum(irate(configuration_change_total[1m]))

Prometheus KPIs

The following custom statistics and queries enable you to monitor the Prometheus KPIs.

helm_chart_deploy_success

Description

Captures the helm chart deployment status.

Metric Type

Gauge

Data Type

Int

Statistics

chart="chart_name",release="release_name",chartVersion="chart_version"

NOTES:

  • 1 : The helm chart deployment is successful.

  • 0 : The helm chart deployment failed.

system_synch_error

Description

Specifies the status of system synchronization.

Metric Type

Gauge

Data Type

Int

Statistics

system_synch_error

NOTES:

  • 1 : The system synchronization is successful.

  • 0 : The system synchronization failed.

system_synch_pending

Description

Specifies the status of the system synchronization progress.

Metric Type

Gauge

Data Type

Int

Statistics

system_synch_pending

NOTES:

  • 1 : The system synchronization is pending.

  • 0 : The system synchronization is not pending.

system_configuration_backup_pending

Description

Specifies the status of the DB to configmap backup progress.

Metric Type

Gauge

Data Type

Int

Statistics

system_synch_backup_pending

NOTES:

  • 1 : The backup of DB to configmap is pending.

  • 0 : The backup of DB to configmap is not pending.

helm_repository_status

Description

Specifies the status of the helm repositories.

Metric Type

Gauge

Data Type

Int

Statistics

helm_repository_status

NOTES:

  • 1 : The helm repository is reachable.

  • 0 : The helm repository is unreachable.

Statistics from Open Source Collector Services

SMI exposes statistics from the following Open Source Collector Services:

Table 1. Open Source Collector Services

Collector Service

Documentation

kube-state-metrics

https://github.com/kubernetes/kube-state-metrics/tree/release-1.8/docs

cAdvisor

https://github.com/google/cadvisor/blob/v0.33.1/docs/storage/prometheus.md

node exporter

https://github.com/prometheus/node_exporter/tree/v0.18.1

Please refer to the collector documentation for more information on the statistics supported.

SMI Bulkstatistics Support

SMI supports the configuration of bulkstatistics through the CEE Ops Center.

The following bulkstatistics are considered KPIs. It is recommended that they be configured for effective monitoring of your deployment:

  • system-mode

  • system-running-percent

  • configuration-change-total

  • cpu-core-count

  • node-load-1

  • node-load-5

  • node-load-15

  • node-disk-rate-read-bytes

  • node-disk-wite-read-bytes

  • node-memory-free-bytes

  • network-transmit-bond-bytes-total

  • network-receive-bond-bytes-total

  • network-carrier-bond-changes-total

  • network-transmit-ens-bytes-total

  • network-receive-ens-bytes-total

  • network-carrier-ens-changes-total

  • k8s-pods-status

  • active-alerts

  • filesystem-root-avail-bytes

  • filesystem-data-avail-bytes

  • kubelet-running-pod-count

  • entitlement-status

  • memory-used

  • cpu-idle

  • cpu-softirq

  • cpu-system

  • cpu-iowait

  • cpu-steal

  • cpu-user

  • kubelet-node-status

  • network-errors-total

  • daemonset-ready-percent

  • deployment-ready-percent

  • statefulset-ready-percent

Additional details on the above bulkstatistics is provided in Appendix A: Bulkstatistic KPI Details.

Appendix A: Bulkstatistic KPI Details

system-mode

Query Expression:

system_mode

Namespace Specific:

Y

Label:

Description:

Indicates if the system is running or shutdown. 1 is running, 0 is shutdown.

Unit:

Boolean

Type:

Gauge

Threshold of Normal:

1

system-running-percent

Query Expression:

system_running_percent

Namespace Specific:

Y

Label:

Description:

This percent of the system pods that are running currently.

Unit:

Percentage

Type:

Gauge

Threshold of Normal:

100

configuration-change-total

Query Expression:

configuration_change_total

Namespace Specific:

Y

Label:

Description:

This total number of configuration changes executed

Unit:

Changes

Type:

Counter

Threshold of Normal:

N/A

cpu-core-count

Query Expression:

count(count(node_cpu_seconds_total) without (mode)) without (cpu)

Namespace Specific:

N

Label:

hostname

Description:

The total CPU cores on a given host.

Unit:

Cores

Type:

Gauge

Threshold of Normal:

N/A

node-load-1

Query Expression:

node_load1

Namespace Specific:

N

Label:

hostname

Description:

The Linux load1 value.

Unit:

Load

Type:

Gauge

Threshold of Normal:

Ratio < 1 compared to cpu-core-count

node-load-5

Query Expression:

node_load5

Namespace Specific:

N

Label:

hostname

Description:

The Linux load5 value.

Unit:

Load

Type:

Gauge

Threshold of Normal:

Ratio < 1 compared to cpu-core-count

node-load-15

Query Expression:

node_load15

Namespace Specific:

N

Label:

hostname

Description:

The Linux load15 value.

Unit:

Load

Type:

Gauge

Threshold of Normal:

Ratio < 1 compared to cpu-core-count

node-disk-rate-read-bytes

Query Expression:

sum(rate(node_disk_read_bytes_total[5m])) by (hostname)

Namespace Specific:

N

Label:

hostname

Description:

The disk read byte rate (5 minute rate) for a host.

Unit:

Bytes

Type:

Gauge

Threshold of Normal:

Workload dependent

node-disk-write-read-bytes

Query Expression:

sum(rate(node_disk_written_bytes_total[5m])) by (hostname)

Namespace Specific:

N

Label:

hostname

Description:

The disk write byte rate (5 minute rate) for a host.

Unit:

Bytes

Type:

Gauge

Threshold of Normal:

Workload dependent

node-memory-free-bytes

Query Expression: node_memory_MemFree_bytes
Namespace Specific: N
Label: hostname
Description: The bytes of free memory for a host
Unit: Bytes
Type: Gauge
Threshold of Normal: < 1,000,000,000

network-transmit-bond-bytes-total

Query Expression: sum(node_network_transmit_bytes_total{device= ~"bond[0-9]"}) by (hostname)
Namespace Specific: N
Label: hostname
Description: The bytes transmitted over the "bond" interfaces
Unit: Bytes
Type: Counter
Threshold of Normal:

network-receive-bond-bytes-total

Query Expression: sum(node_network_receive_bytes_total{device= ~"bond[0-9]"}) by (hostname)
Namespace Specific: N
Label: hostname
Description: The bytes received over the "bond" interfaces
Unit: Bytes
Type: Counter
Threshold of Normal:

network-carrier-bond-changes-total

Query Expression:

sum(node_network_carrier_changes_total{device= ~"bond[0-9]"}) by (hostname)

Namespace Specific:

N

Label:

hostname

Description:

The total instances of "bond" carrier changes.

Unit:

Changes

Type:

Counter

Threshold of Normal:

network-transmit-ens-bytes-total

Query Expression: sum(node_network_transmit_bytes_total{device= ~"ens.*"}) by (hostname)
Namespace Specific: N
Label: hostname
Description: The bytes transmitted over the "ens" interfaces
Unit: Bytes
Type: Counter
Threshold of Normal:

network-receive-ens-bytes-total

Query Expression: sum(node_network_receive_bytes_total{device=~"ens.*"}) by (hostname)
Namespace Specific: N
Label: hostname
Description: The bytes received over the "ens" interfaces
Unit: Bytes
Type: Counter
Threshold of Normal:

network-carrier-ens-changes-total

Query Expression: sum(node_network_carrier_changes_total{device= ~"ens.*"}) by (hostname)
Namespace Specific: N
Label: hostname
Description: The total instances of "ens" carrier changes.
Unit: Changes
Type: Counter
Threshold of Normal:

k8s-pods-status

Query Expression: sum(kube_pod_status_phase) by (phase)
Namespace Specific: N
Label: phase
Description: The total kubernetes pods by phase. Phases are "Running", "Pending", "Failed", "Succeeded"
Unit: Pods
Type: Gauge
Threshold of Normal:

active-alerts

Query Expression: sum(ALERTS{alertstate="firing"}) by (alertname)
Namespace Specific: N
Label: alertname
Description: The current active alerts.
Unit: Alerts
Type: Gauge
Threshold of Normal:

filesystem-root-avail-bytes

Query Expression: avg(node_filesystem_avail_bytes{device="/dev/sda1"}) by (hostname)
Namespace Specific: N
Label: hostname
Description: The current available bytes for root disk
Unit: Bytes
Type: Gauge
Threshold of Normal: < 10,000,000,000

filesystem-data-avail-bytes

Query Expression: avg(node_filesystem_avail_bytes{device="/dev/vda1"}) by (hostname)
Namespace Specific: N
Label: hostname
Description: The current available bytes for data disk
Unit: Bytes
Type: Gauge
Threshold of Normal: < 10,000,000,000

kubelet-running-pod-count

Query Expression: kubelet_running_pod_count
Namespace Specific: N
Label: hostname
Description: The current running pod count by host.
Unit: Pods
Type: Gauge
Threshold of Normal:

entitlement_status

Query Expression: entitlement_status{enforce_mode!="InCompliance"}
Namespace Specific: N
Label: tag
Description: The current out of compliance entitlements.
Unit: Int32
Type: Gauge
Threshold of Normal:

memory-used

Query Expression: sum(node_memory_MemTotal_bytes) by (hostname) - sum(node_memory_MemFree_bytes) by (hostname)
Namespace Specific: N
Label: hostname
Description: The bytes of used memory on the host.
Unit: Int64
Type: Gauge
Threshold of Normal:

cpu-idle

Query Expression: avg(rate(node_cpu_seconds_total{mode=\"idle\"}[1m])) by (hostname)*100.00
Namespace Specific: N
Label: hostname
Description: The 1 minute average CPU idle time on the node.
Unit: Int64
Type: Gauge
Threshold of Normal:

cpu-softirq

Query Expression: avg(rate(node_cpu_seconds_total{mode=\"softirq\ "}[1m])) by (hostname)*100.00
Namespace Specific: N
Label: hostname
Description: The 1 minute average CPU softirq time on the node.
Unit: Int64
Type: Gauge
Threshold of Normal:

cpu-system

Query Expression: avg(rate(node_cpu_seconds_total{mode=\"system\"}[1m])) by (hostname)*100.00
Namespace Specific: N
Label: hostname
Description: The 1 minute average CPU system time on the node.
Unit: Int64
Type: Gauge
Threshold of Normal:

cpu-iowait

Query Expression: avg(rate(node_cpu_seconds_total{mode=\"iowait\"}[1m])) by (hostname)*100.00
Namespace Specific: N
Label: hostname
Description: The 1 minute average CPU iowait time on the node.
Unit: Int64
Type: Gauge
Threshold of Normal:

cpu-steal

Query Expression: avg(rate(node_cpu_seconds_total{mode=\"steal\"}[1m])) by (hostname)*100.00
Namespace Specific: N
Label: hostname
Description: The 1 minute average CPU steal time on the node.
Unit: Int64
Type: Gauge
Threshold of Normal:

cpu-user

Query Expression: avg(rate(node_cpu_seconds_total{mode=\"user\"}[1m])) by (hostname)*100.00
Namespace Specific: N
Label: hostname
Description: The 1 minute average CPU user time on the node.
Unit: Int64
Type: Gauge
Threshold of Normal:

kubelet-node-status

Query Expression: sum(kube_node_status_condition{status=\"true\"}) by (condition)
Namespace Specific: N
Label: condition
Description: The kubelet node status by condition.
Unit: Int32
Type: Gauge
Threshold of Normal:

network-errors-total

Query Expression: sum(node_network_receive_errs_total) by (hostname)
Namespace Specific: N
Label: hostname
Description: The number of network errors by node.
Unit: Int64
Type: Counter
Threshold of Normal:

daemonset-ready-percent

Query Expression: kube_daemonset_status_number_ready/kube_daemonset_status_ desired_number_scheduled * 100
Namespace Specific: N
Label: daemonset
Description: The percent ready for the given daemonset.
Unit: Float
Type: Gauge
Threshold of Normal:

deployment-ready-percent

Query Expression: kube_deployment_status_replicas_available/ kube_deployment_status_replicas * 100
Namespace Specific: N
Label: deployment
Description: The percent ready for the given deployment.
Unit: Float
Type: Gauge
Threshold of Normal:

statefulset-ready-percent

Query Expression: kube_statefulset_status_replicas_ready/ kube_statefulset_status_replicas * 100
Namespace Specific: N
Label: statefulset
Description: The percent ready for the given statefulset
Unit: Float
Type: Gauge
Threshold of Normal: