System Health Monitoring KPIs

The following table lists the KPIs and thresholds to track the overall performance of the PCF deployment, including information about the underlying hardware.

CPU Utilization

Description: CPU is a critical system resource. When the demand increases and CPU utilization exceeds 80% utilization, the efficiency of the CPU is reduced. When CPU utilization exceeds 80%, the application processing time will increase, message response will increase, and drops and timeouts will be seen.

Statistics/Formula: (avg without (cpu,mode)(irate(node_cpu_seconds_total{component="node-exporter",mode!="idle"}[1m])))

Warning Threshold: > 60% utilization over 60 second period (assuming that idle is less than 40%)

Major Threshold: > 80% utilization over 60 second period (assuming idle is less than 20%)

CPU Steal

Description: If multiple VMs on the same hypervisor and same hardware have concurrent CPU demands, the hypervisor will “steal” CPU from one VM to satisfy another VM CPU needs. If the CPU Steal statistic is non-zero, there is not enough CPU allocated for the VMs.

Statistics/Formula: (avg without (cpu,mode)(irate(node_cpu_seconds_total{component="node-exporter",mode="steal"}[1m])))

Warning Threshold: NA

Major Threshold: > 2% over 60 second period

CPU I/O Wait

Description: This monitors CPU I/O wait time. High CPU wait times may indicate CPUs waiting on disk access.

Statistics/Formula: (avg without (cpu,mode)(irate(node_cpu_seconds_total{component="node-exporter",mode="wait"}[1m])))

Warning Threshold: > 30 for more than 5 min

Major Threshold: > 50 for more than 10 min

Memory utilization

Description: Memory is a system resource, which needs to be less than 80%. The swap threshold has been reduced, and swapping should occur when the system resources are exhausted and memory utilization hits 99%.

Statistics/Formula: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes)

Warning Threshold: > 70% utilization over 60 second period

Major Threshold: > 80% utilization over 60 second period

Disk Utilization

Description: Disk storage is a critical system resource, and when file system utilization exceeds 90% utilization the system can become less efficient. When the file system utilization hits 100%, then application can stop functioning.

Statistics/Formula:

100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"})

Warning Threshold: > 80% utilization

Major Threshold: > 90% utilization

In Queue

Description: These statistics monitors how long a message waits in the application queue, waiting to be serviced. The value should be 10ms all the time. higher values indicate the application is too slow, short of resources, or overwhelmed.

Statistics/Formula: sum(irate(input_queue_duration_seconds[1m])) / sum(irate(input_queue_total[1m]))

Warning Threshold: NA

Major Threshold: More than 10 ms over 60 seconds

Diameter 3xxx errors

Description: Diameter Too Busy 3xxx message indicate that the PCF is overwhelmed, or responding too slowly. This can be related to In Queue issues, system resources, database problems, network latency, or issues with other external nodes in the call flow.

Statistics/Formula: sum(irate(diameter_responses_total{result_code=~"3.*"}[1m])) *100/sum(irate(diameter_responses_total{result_code=~"2001"}[1m]))

Warning Threshold: > 0.5% Over 30 minute period

Major Threshold: > 1% Over 30 minute period

Diameter 5xxx errors

Description: Session Not Found and other Diameter 5xxx errors indicate a critical problem with the ability to process the incoming diameter message. This can be related to exhausted system resources, invalid session id or bad message structure, length, or content, or even database corruption.

Statistics/Formula: sum(irate(diameter_responses_total{result_code=~"3.*"}[1m])) *100/sum(irate(diameter_responses_total{result_code=~"2001"}[1m]))

Warning Threshold: > 0.5% Over 5 minute period

Major Threshold: > 1% Over 5 minute period

N7 5xx Errors

Description: N7 errors indicate that the PCF is unable to process N7 requests. This can be related to service timeout or service unavailable or an internal error.

Statistics/Formula: sum(irate(incoming_request_total{interface_name="N7", result_code=~"5.*"}[1m])) * 100 / sum(irate(incoming_request_total{interface_name="N7", result_code=~"2.*"}[1m]))

Warning Threshold: > 0.5% over 30 minute period

Major Threshold: > 1% over 30 minute period

N28 5xx Errors

Description: N28 errors indicate that the PCF is unable to process N28 requests. This can be related to service timeout or service unavailable or an internal error.

Statistics/Formula: sum(irate(outgoing_request_total{interface_name="N28", result_code=~"5.*"}[1m])) * 100 / sum(irate(outgoing_request_total{interface_name="N28", result_code=~"2.*"}[1m]))

Warning Threshold: > 0.5% over 30 minute period

Major Threshold: > 1% over 30 minute period

Active Session Count

Description: Number of total sessions currently active.

Statistics/Formula: avg(db_records_total{session_type="total"})

Warning Threshold:

>80% of the lessor of the dimensioned or licensed capacity for more than 1 hour

or

= 0 for more than 5 minutes

Major Threshold:

>80% of the lessor of the dimensioned or licensed capacity for more than 10 minutes

or

= 0 for more than 10 minutes

% of Messages dropped due to SLA timeout

Description: Messages dropped due to SLA timeouts indicate that the PCF is overwhelmed, or responding too slowly. This can be related to In Queue issues, system resources, database problems, network latency, or issues with other external nodes in the call flow.

Statistics/Formula: sum(irate(input_queue_result_total{result="drop"}[1m]))*100/(sum(irate(incoming_request_total{result_code=~"2.*"}[1m])) + sum(irate(diameter_responses_total{result_code="2001"}[1m])))

Warning Threshold: > 0.5%

Major Threshold: > 1%