System Health Monitoring KPIs
The following table lists the KPIs and thresholds to track the overall performance of the PCF deployment, including information about the underlying hardware.
CPU Utilization
Description: CPU is a critical system resource. When the demand increases and CPU utilization exceeds 80% utilization, the efficiency of the CPU is reduced. When CPU utilization exceeds 80%, the application processing time will increase, message response will increase, and drops and timeouts will be seen.
Statistics/Formula: (avg without (cpu,mode)(irate(node_cpu_seconds_total{component="node-exporter",mode!="idle"}[1m])))
Warning Threshold: > 60% utilization over 60 second period (assuming that idle is less than 40%)
Major Threshold: > 80% utilization over 60 second period (assuming idle is less than 20%)
CPU Steal
Description: If multiple VMs on the same hypervisor and same hardware have concurrent CPU demands, the hypervisor will “steal” CPU from one VM to satisfy another VM CPU needs. If the CPU Steal statistic is non-zero, there is not enough CPU allocated for the VMs.
Statistics/Formula: (avg without (cpu,mode)(irate(node_cpu_seconds_total{component="node-exporter",mode="steal"}[1m])))
Warning Threshold: NA
Major Threshold: > 2% over 60 second period
CPU I/O Wait
Description: This monitors CPU I/O wait time. High CPU wait times may indicate CPUs waiting on disk access.
Statistics/Formula: (avg without (cpu,mode)(irate(node_cpu_seconds_total{component="node-exporter",mode="wait"}[1m])))
Warning Threshold: > 30 for more than 5 min
Major Threshold: > 50 for more than 10 min
Memory utilization
Description: Memory is a system resource, which needs to be less than 80%. The swap threshold has been reduced, and swapping should occur when the system resources are exhausted and memory utilization hits 99%.
Statistics/Formula: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes)
Warning Threshold: > 70% utilization over 60 second period
Major Threshold: > 80% utilization over 60 second period
Disk Utilization
Description: Disk storage is a critical system resource, and when file system utilization exceeds 90% utilization the system can become less efficient. When the file system utilization hits 100%, then application can stop functioning.
Statistics/Formula:
100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"})
Warning Threshold: > 80% utilization
Major Threshold: > 90% utilization
In Queue
Description: These statistics monitors how long a message waits in the application queue, waiting to be serviced. The value should be 10ms all the time. higher values indicate the application is too slow, short of resources, or overwhelmed.
Statistics/Formula: sum(irate(input_queue_duration_seconds[1m])) / sum(irate(input_queue_total[1m]))
Warning Threshold: NA
Major Threshold: More than 10 ms over 60 seconds
Diameter 3xxx errors
Description: Diameter Too Busy 3xxx message indicate that the PCF is overwhelmed, or responding too slowly. This can be related to In Queue issues, system resources, database problems, network latency, or issues with other external nodes in the call flow.
Statistics/Formula: sum(irate(diameter_responses_total{result_code=~"3.*"}[1m])) *100/sum(irate(diameter_responses_total{result_code=~"2001"}[1m]))
Warning Threshold: > 0.5% Over 30 minute period
Major Threshold: > 1% Over 30 minute period
Diameter 5xxx errors
Description: Session Not Found and other Diameter 5xxx errors indicate a critical problem with the ability to process the incoming diameter message. This can be related to exhausted system resources, invalid session id or bad message structure, length, or content, or even database corruption.
Statistics/Formula: sum(irate(diameter_responses_total{result_code=~"3.*"}[1m])) *100/sum(irate(diameter_responses_total{result_code=~"2001"}[1m]))
Warning Threshold: > 0.5% Over 5 minute period
Major Threshold: > 1% Over 5 minute period
N7 5xx Errors
Description: N7 errors indicate that the PCF is unable to process N7 requests. This can be related to service timeout or service unavailable or an internal error.
Statistics/Formula: sum(irate(incoming_request_total{interface_name="N7", result_code=~"5.*"}[1m])) * 100 / sum(irate(incoming_request_total{interface_name="N7", result_code=~"2.*"}[1m]))
Warning Threshold: > 0.5% over 30 minute period
Major Threshold: > 1% over 30 minute period
N28 5xx Errors
Description: N28 errors indicate that the PCF is unable to process N28 requests. This can be related to service timeout or service unavailable or an internal error.
Statistics/Formula: sum(irate(outgoing_request_total{interface_name="N28", result_code=~"5.*"}[1m])) * 100 / sum(irate(outgoing_request_total{interface_name="N28", result_code=~"2.*"}[1m]))
Warning Threshold: > 0.5% over 30 minute period
Major Threshold: > 1% over 30 minute period
Active Session Count
Description: Number of total sessions currently active.
Statistics/Formula: avg(db_records_total{session_type="total"})
Warning Threshold:
>80% of the lessor of the dimensioned or licensed capacity for more than 1 hour
or
= 0 for more than 5 minutes
Major Threshold:
>80% of the lessor of the dimensioned or licensed capacity for more than 10 minutes
or
= 0 for more than 10 minutes
% of Messages dropped due to SLA timeout
Description: Messages dropped due to SLA timeouts indicate that the PCF is overwhelmed, or responding too slowly. This can be related to In Queue issues, system resources, database problems, network latency, or issues with other external nodes in the call flow.
Statistics/Formula: sum(irate(input_queue_result_total{result="drop"}[1m]))*100/(sum(irate(incoming_request_total{result_code=~"2.*"}[1m])) + sum(irate(diameter_responses_total{result_code="2001"}[1m])))
Warning Threshold: > 0.5%
Major Threshold: > 1%