Alerts Reference
Alerts Summary
Alarm Name | Description | Active Alert Duration * | System Impact | Associated Alerts | Validation | Resolution |
---|---|---|---|---|---|---|
server-alert |
UCS server hardware alerts |
Always |
Depending on the hardware component, it can impact the full node functionality and might cause all pods to go down on the node. |
k8s-node- not-ready kvm-node- not-ready |
Verify the CEE alerts. If there is a k8s-node-not-ready or kvm-node-not-ready alert, then it means the hardware issue might have caused the OS to crash. | Check the CIMC of the node to troubleshoot the problem and decide if an RMA is required. |
server-not- reachable-alert |
UCS server hardware alert |
Always |
CIMC server is not reachable. |
— |
Check CEE and if the alert is present, escalate immediately. |
— |
k8s-node-not- ready |
K8s node is in NotReady state for more than 1 minute | 30 minutes | The node is not reporting health-check. There could be multiple reasons for this alert, for example, node down, OS crash, network issues, hardware, or a bug. | - | Verify the node status on the cluster. Investigate potential reason for the node being not ready. | This may be expected during MW. If node stays in the non-ready state outside MW, then contact your Cisco Account representative to troubleshoot. |
k8s-node- status-change |
K8s node Ready status changed during the last 5 minutes | 30 minutes | The node changed state from Ready to Not Ready and back due to network issues, reboots, or other issues. | - | Verify the node status on the cluster. Investigate potential reason for the node changing status. | This may be expected during MW. If a node continues to change state outside MW, then contact your Cisco Account representative to troubleshoot. |
kvm-node-not -ready |
KVM node is not reachable for more than 1 minute | 1 hour | The node is not reporting health-check. There could be multiple reasons for this alert, for example, node down, OS crash, network issues, hardware, or a bug. | - | Verify the CEE alerts to ensure that the alerts are active. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
kvm-tunnels- flapping |
KVM node reachability changed during the last 5 minutes | 1 hour | The connectivity to KVM node is not stable. | - | Verify the CEE alerts to ensure that the alerts are active. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
k8s-pod- restarting |
Pod restarting one or more times during the last 5 minutes | 1 hour | The overall impact is minimal and is expected during MW. Verify the service alerts. | - | Verify the CEE alerts to ensure that the alerts are active. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
k8s-pod- crashing-loop |
Pod restarting two or more times during the last 5 minutes | 1 hour | The overall impact is minimal and is expected during MW. Verify the service alerts. This alert indicates a continuous problem that needs investigation. | - | Verify the CEE alerts to ensure that the alerts are active. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
k8s-pod- pending |
Pod is in pending state and cannot be scheduled for more than 1 minute | 1 hour | This alert is expected during MW or when a node is down. Otherwise, this alert might be because of deployment misconfiguration. |
k8s-node- not-ready |
Verify the CEE alerts to ensure that the alerts are active. Check if nodes are in the NotReady state. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
k8s-pod-not- ready |
Pod is not able to get into ready state for more than 1 minute | 1 hour | This alert is expected for a short duration during MW, otherwise it points to application issues that need to be investigated. | - | Verify the CEE alerts to ensure that the alerts are active. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
k8s-deployment- generation- mismatch |
An upgrade or change to a deployment failed | 1 hour | Some upgrade failed to run properly. | - | - | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
k8s-deployment- replica-mismatch |
Not all the pods in a replica are running for more than two minutes | 1 hour | This alert is triggered when the pods are crashing or are unable to be deployed as part of a deployment. If too many application pods in a replica are not running, then this can cause a service impact. |
k8s-pod- not-ready k8s-pod- pending k8s-pod- crashing-loop |
Verify the CEE alerts to investigate the reason for the pods not getting deployed or running. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
k8s-ss- mismatch |
Not all the pods in a stateful set are running for more than two minutes | 1 hour | This alert is triggered when the pods are crashing or are unable to be deployed as part of a stateful set. If too many application pods in a stateful set are not running, then this can cause a service impact. |
k8s-pod- not-ready k8s-pod- pending k8s-pod- crashing-loop |
Verify the CEE alerts to investigate the reason for the pods not getting deployed or running. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
k8s-ss-generation -mismatch |
An upgrade or change to a stateful set failed | 1 hour | Some upgrade failed to run properly. | - | - | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
k8s-ss-update -not-rolled-out |
The upgrade of a stateful set is stuck | 1 hour | Some upgrade failed to run properly. | - | - | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
k8s-daemonset- rollout-stuck |
Not all the pods in a daemonset are running for more than five minutes | 1 hour | Some daemonset pods are having issues to run on nodes. Daemon pods are critical for different functionalities. If they are not running, then there can be a short or long term impact to the service. |
k8s-pod- not-ready k8s-pod- pending k8s-pod- crashing-loop |
Verify the CEE alerts to investigate the reason for the pods not getting deployed or running. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
k8s-daemonset- not-scheduled |
Not all the pods in a daemonset are scheduled for more than 5 minutes | 1 hour | This is a rare case where daemon sets can not be scheduled, daemonsets have toleration for almost all taints and hence, the pods should get scheduled. This alert points to an issue with K8s or node resources. |
k8s-pod- pending |
Verify the CEE alerts to investigate the reason for the pods not getting deployed or running. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
k8s-daemonset- mischeduled |
Daemonset pods are running where these are not supposed to run | 1 hour | This alert is a rare case and can be triggered during some upgrade. This alert indicates a larger K8s issue. | - | Verify the CEE alerts to investigate the reason for the pods not getting deployed or running. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
pod-oom- killed |
Pod got killed and restarted due to OOM in the last five minutes | Always | The pod is crossing the memory limit. This is an application issue and indicates a misconfiguration or a memory leak issue. Depending on the functionality, there can be a service impact. | - | Verify the CEE alerts to investigate the reason for the pods not getting deployed or running. | Contact your Cisco Account representative to troubleshoot. |
container-memory- usage-high |
Pod crossed the 80% memory limit in the last five minutes | Always | The pod is crossing the memory limit. This is an application issue and indicates a misconfiguration or a memory leak issue. Depending on the functionality, there can be a service impact. | - | Verify the CEE alerts to investigate the reason for the pods not getting deployed or running. | Contact your Cisco Account representative to troubleshoot. |
pod-not-ready -but-all-containers -ready |
Pod is not ready, but all its containers are ready | 1 hour | This is a K8s issue and is caused when the pod is not correctly marked as running. Self-healing restarts the pod. If the alert stays on for too long, then it means either the self-healing is not working or there is a bigger issue. | - | Verify the CEE alerts and investigate the pod with the issue. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
vm-deployed | UPF VM deployed | None | This is a notification that the VM is deployed | - | - | - |
vm-alive | UPF VM running | None | This is a notification that the VM is running | - | - | - |
vm-error | UPF VM is in error state | 1 hour | The VM is in error state | - | Verify the CEE alerts to investigate the KVM VMs. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
vm-recovering | UPF VM is recovering | None | This alert is a notification that the VM is recovering. | - | - | - |
vm-recovery- failed |
UPF VM recovering failed | 1 hour | The VM failed to recover. | - | Verify the Ops Center to assess the reason for the failure. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
ops-system- sync-running |
The Ops Center is running a sync operation | None | This alert is triggered during an upgrade. The Ops Center is trying to either upgrade the application or apply recently changed configuration. | - | - | - |
ops-latest- sync-failed |
The Ops Center sync operation failed | 1 hour | The Ops Center sync operation failed to complete. This alert indicates an issue with either new release or with the latest configuration. | - | Verify the Ops Center to assess the reason for the failure. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
helm-deploy- failure |
An application helm deployment failed | 1 hour | This alerts is triggered when one of the applications fails to deploy the helm chart, usually due to misconfiguration or release issues. | - | Verify the Ops Center to assess the reason for the failure. | If the alert stays active, then contact your Cisco Account representative to troubleshoot. |
node-disk- running-full- 24hours |
The node disk is estimated to run out space in less than 24 hours | Always | The node is projected to run out of space in one of the partitions in less than 24 hours. | - | Verify the CEE alert and escalate immediately. | - |
node-disk- running-full -2hours |
The node disk is estimated to run out space in less than two hours | Always | The node is projected to run out of space in one of the partitions in less than 2 hours. If the K8s partition is affected, this alert might severely impact the services on the node. | - | Verify the CEE alert and escalate immediately. | - |
k8s-persisent- volume-usage |
The persistent volume has less than 3% free space | Always | The volume is almost full or full and most likely is impacting the application. | - | Verify the CEE alert and escalate immediately. | - |
k8s-persisent- volume-usage- projected-full |
The persistent volume will reach 100% usage in less than four days and is currently at 85% or more | Always | The volume is projected to get full in less than four days, potentially impacting the application. | - | Verify the CEE alert and escalate immediately. | - |
kube_certificate_ expiring |
Alert that the k8s certificates are about to expire | Always | The K8s certificates have to be renewed every year. With the automated process, this alert is not triggered. | - | Verify the CEE alert and escalate immediately. | - |
kubelet-too- many-pods |
Too many pods attempted to be deployed on one node | Always | The applications are designed to not exceed the max pod limit. If the alert is seen, then it indicates a misconfiguration. | - | Verify the CEE alert and escalate immediately. | - |
clock-skew- detected |
Clock skew detected on a node | Always | The node is having a wrong NTP configuration or the NTP servers have issues. This issue might be seen on new systems. | - | Verify the CEE alert and escalate immediately. | - |
clock-is-not -in-synch |
Clock is not in synch with NTP for last five minutes | Always | The node is not able to get the clock in sync with NTP. Either the NTP configuration is incorrect or there is some network issue. | - | Verify the CEE alert and escalate immediately. | - |
network-receive- errors |
Specific network is seeing receive errors in the last two minutes | Always | Networking issue with received packets. | - | Verify the CEE alert and escalate immediately. | - |
network-transmit- errors |
Specific network is seeing transmit errors in the last two minutes | Always | Networking issue with sent packets. | - | Verify the CEE alert and escalate immediately. | - |
network-interface- flapping |
Specific network up/down status is changing in the last two minutes | Always | Networking issue with specific interface. | - | Verify the CEE alert and escalate immediately. | - |
k8s-cpu- overcommit |
The CPU is overcommited compared to quota on namespaces | Always | This alert indicates a deployment misconfiguration. | - | Verify the CEE alert and escalate immediately. | - |
k8s-mem- overcommit |
The memory is overcommited compared to quota on namespaces | Always | This alert indicates a deployment misconfiguration. | - | Verify the CEE alert and escalate immediately. | - |
k8s-quota- exceeded |
A namespace is using more than 90% of its allocated CPU/memory quota | Always | This alert indicates a deployment misconfiguration. | - | Verify the CEE alert and escalate immediately. | - |
cpu-throttling- high |
K8s is throttling the pod CPU for more than 25% of the time in last five minutes | 1 hour | Application is running too hot and CPU throttling is too high. If the application pods are affected, then this can result in a service impact. | - | Verify the CEE alert and escalate immediately. | - |
cndp-ha- switchover |
Cluster Manager switchover from primary to standby | 1 hour | This alert indicates a CM failover either caused by MW, RMA or a hardware issue. | - | Verify the CEE alert and escalate immediately. | - |
backup-node- down |
Cluster Manager Backup node is in not reachable from primary | 1 hour | The backup CM is not reachable from primary either because of network issues or a hardware issue, RMA, or MW. | - | Verify the CEE alert and escalate immediately. | - |
user_password_ expiring |
User password will expire in less than <configured number of days> | Always | User password will expire and must be updated | - | Verify the CEE alert and escalate immediately. | - |
user_password_ expired |
User password expired | Always | User password expired | - | Verify the CEE alert and escalate immediately. | - |
k8s-persisent- volume-errors |
Persistent volume has issues | Always | Persistent volume has issues and can impact the application using it. Most likely, there is a hardware issue or it can be caused by a failed install or upgrade. | - | Verify the CEE alert and escalate immediately. | - |
k8s-version- mismatch |
The system has K8s components with different versions | Always | The components should always run on the same K8s version. This alert can be triggered by a failed upgrade. | - | Verify the CEE alert and escalate immediately. | - |
k8s-client- errors |
Specific K8s API server is having issues communicating with the API server | Always | This alert is triggered by either a failed upgrade, connectivity or incorrect/expired certificates. It indicates a major impact to applications or service. | - | Verify the CEE alert and escalate immediately. | - |
prometheus | Various errors related to Prometheus | 1 hour | This alert indicates issues with Prometheus or monitoring and must be investigated. | - | Verify the CEE alert and escalate immediately. | - |
node-disk- running-Low- 24hours |
The node disk partition is > 75%, and will be >80 in less than 24 hours |
Always |
Node disk partition is projected to be 80% full within the next 24 hours. |
- |
Verify the CEE alert and escalate immediately. |
- |
node-disk- running-Low- 2hours |
The node disk partition is > 75%, and will be >80% in less than 2 hours |
Always |
Node disk partition is projected to be 80% full within the next 2 hours. |
- |
Verify the CEE alert and escalate immediately. |
- |
* Escalate to investigate when the alert is active for longer than the specified time period. |
Alert Details
cndp-ha
Rules:
-
Alert: cndp-ha-switchover
-
Annotations:
-
Type: Switching Over To Primary
-
Summary: "CNDP-HA is switched {{ $labels.hostname }} over to primary."
-
-
Expression:
|
ha_is_failed_over == 1
-
For: 1m
-
Labels:
-
Severity: major
-
-
-
Alert: backup-node-down
-
Annotations:
-
Type: Backup node down
-
Summary: "The Backup CM node of {{ $labels.hostname }} is down."
-
-
Expression:
|
backup_node_status == 0
-
For: 1m
-
Labels:
-
Severity: major
-
-
kubernetes-apps
Rules:
-
Alert: pod-oom-killed
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} got OOM Killed.'
-
-
Expression:
|
sum_over_time(kube_pod_container_status_terminated_reason{reason="OOMKilled"}[5m]) > 0
-
For: 1m
-
Labels:
-
Severity: critical
-
-
-
Alert: container-memory-usage-high
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: 'Pod {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.name }} uses high memory {{ printf "%.2f" $value }}%.'
-
-
Expression:
|
((container_memory_usage_bytes{pod!="",container!="POD",image!=""} - container_memory_cache{pod!="",container!="POD",image!=""}) / (container_spec_memory_limit_bytes{pod!="",container!="POD",image!=""} != 0)) * 100 > 80
-
For: 2m
-
Labels:
-
Severity: critical
-
-
-
Alert: pod-not-ready-but-all-containers-ready
-
Expression:
>
(count by (namespace, pod) (kube_pod_status_ready{condition="true"} == 0))
and
(
(count by (namespace, pod) (kube_pod_container_status_ready==1))
unless
(count by (namespace, pod) (kube_pod_container_status_ready==0))
)
-
For: 5m
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container}}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.
-
-
Expression:
|
rate(kube_pod_container_status_restarts_total[5m]) * 60 * 5 > 0
-
For: 1m
-
Labels:
-
Severity: minor
-
-
-
Alert: k8s-pod-crashing-loop
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container}}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.
-
-
Expression:
|
rate(kube_pod_container_status_restarts_total[5m]) * 60 * 5 >= 2
-
For: 1m
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-pod-pending
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in pending state state for longer than 1 minute.
-
-
Expression:
|
sum by (namespace, pod) (kube_pod_status_phase{ phase=~"Failed|Pending|Unknown"}) > 0
-
For: 1m
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-pod-not-ready
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 1 minute.
-
-
Expression:
|
sum by (namespace, pod) (kube_pod_status_ready{condition="false"}) > 0
-
For: 1m
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-deployment-generation-mismatch
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back.
-
-
Expression:
|
kube_deployment_status_observed_generation
!=
kube_deployment_metadata_generation
-
For: 5m
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-deployment-replica-mismatch
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 2 minutes.
-
-
Expression:
|
kube_deployment_spec_replicas
!=
kube_deployment_status_replicas_available
-
For: 2m
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-ss-mismatch
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 5 minutes.
-
-
Expression:
|
kube_statefulset_status_replicas_ready
!=
kube_statefulset_status_replicas
-
For: 5m
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-ss-generation-mismatch
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back.
-
-
Expression:
|
kube_statefulset_status_observed_generation
!=
kube_statefulset_metadata_generation
-
For: 5m
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-ss-update-not-rolled-out
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out.
-
-
Expression:
|
max without (revision) (
kube_statefulset_status_current_revision
unless
kube_statefulset_status_update_revision
)
*
(
kube_statefulset_replicas
!=
kube_statefulset_status_replicas_updated
)
-
For: 5m
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-daemonset-rollout-stuck
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Only {{ $value }}% of the desired Pods of DaemonSet {{ $labels.namespace }}/ {{ $labels.daemonset }} are scheduled and ready.
-
-
Expression:
|
kube_daemonset_status_number_ready
/
kube_daemonset_status_desired_number_scheduled * 100 < 100
-
For: 5m
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-daemonset-not-scheduled
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.'
-
-
Expression:
|
kube_daemonset_status_desired_number_scheduled
kube_daemonset_status_current_number_scheduled > 0
-
For: 5m
-
Labels:
-
Severity: major
-
-
-
Alert: k8s-daemonset-mischeduled
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.'
-
-
Expression:
|
kube_daemonset_status_number_misscheduled > 0
-
For: 5m
-
Labels:
-
Severity: major
-
-
-
Alert: k8s-cronjob-running
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.
-
-
Expression:
|
time() - kube_cronjob_next_schedule_time > 3600
-
For: 1h
-
Labels:
-
Severity: major
-
-
-
Alert: k8s-job-completion
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than one hour to complete.
-
-
Expression:
|
kube_job_spec_completions - kube_job_status_succeeded > 0
-
For: 1h
-
Labels:
-
Severity: major
-
-
-
Alert: k8s-job-failed
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.
-
-
Expression:
|
kube_job_status_failed > 0
-
For: 1h
-
Labels:
-
Severity: major
-
-
-
Alert: k8s-pod-cpu-usage-high
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: '{{ $labels.namespace }}.{{ $labels.pod }} pod cpu usage is above 80%.'
-
-
Expression:
|
sum(rate(container_cpu_usage_seconds_total{container!="POD", pod!="", image!=""}[5m])) by (namespace, pod) * 100 / sum(kube_pod_container_resource_limits_cpu_cores) by (namespace, pod) > 80
-
For: 1m
-
Labels:
-
Severity: major
-
-
kubernetes-resources
Rules:
-
Alert: k8s-cpu-overcommit
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Cluster has overcommitted CPU resource requests for Namespaces.
-
-
Expression:
|
sum(kube_resourcequota{ type="hard", resource="cpu"})
/
sum(kube_node_status_allocatable_cpu_cores)
> 1.5
-
For: 2m
-
Labels:
-
Severity: major
-
-
-
Alert: k8s-mem-overcommit
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Cluster has overcommitted memory resource requests for Namespaces.
-
-
Expression:
|
sum(kube_resourcequota{ type="hard", resource="memory"})
/
sum(kube_node_status_allocatable_memory_bytes)
> 1.5
-
For: 2m
-
Labels:
-
Severity: major
-
-
-
Alert: k8s-quota-exceeded
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Namespace {{ $labels.namespace }} is using {{ printf "%0.0f" $value }}% of its {{ $labels.resource }} quota.
-
-
Expression:
|
100 * kube_resourcequota{ type="used"}
/ ignoring(instance, job, type)
(kube_resourcequota{ type="hard"} > 0)
> 90
-
For: 2m
-
Labels:
-
Severity: major
-
-
-
Alert: cpu-throttling-high
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: '{{ printf "%0.0f" $value }}% throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.'
-
-
Expression:
"100 * sum(increase(container_cpu_cfs_throttled_periods_total{container!=\"\",
}[5m])) by (container, pod, namespace)\n /\nsum(increase(container_cpu_cfs_periods_total{}[5m]))
by (container, pod, namespace)\n > 25 \n"
-
For: 2m
-
Labels:
-
Severity: major
-
-
kubernetes-storage
Rules:
-
Alert: k8s-persisent-volume-usage
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ printf "%0.2f" $value }}% free.
-
-
Expression:
|
100 * kubelet_volume_stats_available_bytes
/
kubelet_volume_stats_capacity_bytes
< 3
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-persisent-volume-usage-projected-full
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ printf "%0.2f" $value }}% is available.
-
-
Expression:
|
100 * (
kubelet_volume_stats_available_bytes
/
kubelet_volume_stats_capacity_bytes
) < 15
and
predict_linear(kubelet_volume_stats_available_bytes[6h], 4 * 24 * 3600) < 0
-
For: 5m
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-persisent-volume-errors
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}.
-
-
Expression:
|
kube_persistentvolume_status_phase{phase=~"Failed|Pending",namespace=~"(kube-.*|default|logging)"} > 0
-
Labels:
-
Severity: critical
-
-
kubernetes-system
Rules:
-
Alert: k8s-node-not-ready
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: '{{ $labels.node }} has been unready for more than 1 minutes.'
-
-
Expression:
|
kube_node_status_condition{condition="Ready",status="true"} == 0
-
For: 1m
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-node-status-change
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: '{{ $labels.node }} status was changed in the past 5 minutes.'
-
-
Expression:
|
changes(kube_node_status_condition{condition="Ready",status="true"}[5m]) > 0
-
For: 0m
-
Labels:
-
Severity: major
-
-
-
Alert: k8s-version-mismatch
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: There are {{ $value }} different semantic versions of Kubernetes components running.
-
-
Expression:
|
count(count by (gitVersion) (label_replace(kubernetes_build_info,"gitVersion","$1","gitVersion","(v[0-9]*.[0-9]*.[0-9]*).*"))) > 1
-
For: 5m
-
Labels:
-
Severity: major
-
-
-
Alert: k8s-client-errors
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ printf "%0.0f" $value }}% errors.'
-
-
Expression:
|
(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job)
/
sum(rate(rest_client_requests_total[5m])) by (instance, job))
* 100 > 1
-
For: 2m
-
Labels:
-
Severity: major
-
-
-
Alert: kubelet-too-many-pods
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Kubelet {{ $labels.instance }} is running {{ $value }} Pods, close to the limit of 110.
-
-
Expression:
|
kubelet_running_pod_count > 110 * 0.9
-
For: 5m
-
Labels:
-
Severity: critical
-
-
-
Alert: k8s-client-cert-expiration
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: A client certificate used to authenticate to the apiserver is expiring in less than 30 days
-
-
Expression:
|
apiserver_client_certificate_expiration_seconds_count > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket[5m]))) < 2592000
-
Labels:
-
Severity: warning
-
-
-
Alert: k8s-client-cert-expiration
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.
-
-
Expression:
|
apiserver_client_certificate_expiration_seconds_count > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket[5m]))) < 86400
-
Labels:
-
Severity: critical
-
-
general.rules
Rules:
-
Alert: watchdog
-
Annotations:
-
Type: Communications Alarm
-
Summary: | This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing.
-
-
Expression:
vector(1)
-
Labels:
-
Severity: minor
-
-
sync.rules
Rules:
-
Alert: ops-system-sync-running
-
Annotations:
-
Type: Communications Alarm
-
Summary: | ops center system upgrade for {{ $labels.namespace }} is in progress
-
-
Expression:
system_ops_upgrade_running > 0
-
Labels:
-
Severity: minor
-
-
-
Alert: ops-latest-sync-failed
-
Annotations:
-
Type: Communications Alarm
-
Summary: | ops center latest system sync for {{ $labels.namespace }} failed
-
-
Expression:
system_synch_error > 0
-
Labels:
-
Severity: major
-
-
kube-prometheus-node-alerting.rules
Rules:
-
Alert: node-disk-running-full-24hours
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} will be full within the next 24 hours.
-
-
Expression:
|
(node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[6h], 3600 * 24) < 0)
-
For: 5m
-
Labels:
-
Severity: major
-
-
-
Alert: node-disk-running-full-2hours
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Device {{ $labels.device }} of node-exporter {{ $labels.namespace }}/{{ $labels.pod }} will be full within the next 2 hours.
-
-
Expression:
|
(node:node_filesystem_usage: > 0.85) and (predict_linear(node:node_filesystem_avail:[30m], 3600 * 2) < 0)
-
Labels:
-
Severity: critical
-
-
node-time
Rules:
-
Alert: clock-skew-detected
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Clock skew detected on hostname {{ $labels.hostname }} . Ensure NTP is configured correctly on this host.
-
-
Expression:
|
abs(node_timex_offset_seconds) > 0.03
-
For: 2m
-
Labels:
-
Severity: major
-
-
-
Alert: clock-is-not-in-synch
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Clock not in synch detected on hostname {{ $labels.hostname }} . Ensure NTP is configured correctly on this host.
-
-
Expression:
|
min_over_time(node_timex_sync_status[5m]) == 0
and
node_timex_maxerror_seconds >= 16
-
For: 10m
-
Labels:
-
Severity: major
-
-
node-network
Rules:
-
Alert: network-receive-errors
-
Annotations:
-
Type: Communications Alarm
-
Summary: Network interface "{{ $labels.device }}" showing receive errors on hostname {{ $labels.hostname }}"
-
-
Expression:
|
rate(node_network_receive_errs_total{device!~"veth.+"}[2m]) > 0
-
For: 2m
-
Labels:
-
Severity: major
-
-
-
Alert: network-transmit-errors
-
Annotations:
-
Type: Communications Alarm
-
Summary: Network interface "{{ $labels.device }}" showing transmit errors on hostname {{ $labels.hostname }}"
-
-
Expression:
|
rate(node_network_transmit_errs_total{device!~"veth.+"}[2m]) > 0
-
For: 2m
-
Labels:
-
Severity: major
-
-
-
Alert: network-interface-flapping
-
Annotations:
-
Type: Communications Alarm
-
Summary: Network interface "{{ $labels.device }}" changing it's up status often on hostname {{ $labels.hostname }}"
-
-
Expression:
|
changes(node_network_up{device!~"veth.+"}[2m]) > 2
-
For: 2m
-
Labels:
-
Severity: major
-
-
-
Alert: kvm-tunnels flapping
-
Annotations:
-
Type: Communications Alarm
-
Summary: Pod {{ $labels.namespace }}/{{ $labels.pod }} tunnel to ({{ $labels.ip}}:{{$labels.port}}) is flapping {{ printf "%.2f" $value }} times / 5 minutes.
-
-
Expression:
|
changes(kvm_metrics_tunnels_up[5m]) > 2
-
For: 5m
-
Labels:
-
Severity: major
-
-
-
Alert: kvm-node-not-ready
-
Annotations:
-
Type: Communications Alarm
-
Summary: KVM node {{ $labels.hostname }}({{ $labels.ip}}) is not reachable.
-
-
Expression:
|
changes(kvm_metrics_tunnels_up[2m]) > 0
-
For: 0m
-
Labels:
-
Severity: major
-
-
fluentbit.rules
Rules:
-
Alert: fluent-proxy-output-retries-failed
-
Annotations:
-
Type: Communications Alarm
-
Summary: 'Fluent-proxy {{ $labels.namespace }}/{{ $labels.pod }} output retries failed for target: {{ $labels.name }}'
-
-
Expression:
|
rate(fluentbit_output_retries_failed_total{pod=~"fluent-proxy.*"}[5m]) > 0
-
For: 5m
-
Labels:
-
Severity: major
-
-
prometheus.rules
Rules:
-
Alert: prometheus-config-reload-failed
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Reloading Prometheus' configuration failed
Reloading Prometheus' configuration has failed for {{ $labels.namespace }}/{{ $labels.pod }}
-
-
Expression:
|
prometheus_config_last_reload_successful == 0
-
For: 2m
-
Labels:
-
Severity: major
-
-
-
Alert: prometheus-notification-q-running-full
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Prometheus' alert notification queue is running full
Prometheus' alert notification queue is running full for {{ $labels.namespace }}/{{ $labels.pod }}
-
-
Expression:
|
predict_linear(prometheus_notifications_queue_length[5m], 60 * 30) > prometheus_notifications_queue_capacity
-
For: 10m
-
Labels:
-
Severity: major
-
-
-
Alert: prometheus-error-sending-alerts
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Errors while sending alert from Prometheus
Errors while sending alerts from Prometheus {{ $labels.namespace }}/{{ $labels.pod }} to Alertmanager {{$labels.Alertmanager}}
-
-
Expression:
|
rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m]) > 0.01
-
For: 2m
-
Labels:
-
Severity: major
-
-
-
Alert: prometheus-error-sending-alerts
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Errors while sending alerts from Prometheus
Errors while sending alerts from Prometheus {{ $labels.namespace }}/{{ $labels.pod }} to Alertmanager {{$labels.Alertmanager}}
-
-
Expression:
|
rate(prometheus_notifications_errors_total[5m]) / rate(prometheus_notifications_sent_total[5m]) > 0.03
-
Labels:
-
Severity: critical
-
-
-
Alert: prometheus-not-connected-to-alertmanagers
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Prometheus is not connected to any Alertmanagers
Prometheus {{ $labels.namespace }}/{{ $labels.pod }} is not connected to any Alertmanagers
-
-
Expression:
|
prometheus_notifications_alertmanagers_discovered < 1
-
For: 2m
-
Labels:
-
Severity: major
-
-
-
Alert: prometheus-tsdb-reloads-failing
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Prometheus has issues reloading data blocks from disk
'{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}} reload failures over the last four hours.'
-
-
Expression:
|
increase(prometheus_tsdb_reloads_failures_total[2h]) > 0
-
For: 5m
-
Labels:
-
Severity: major
-
-
-
Alert: prometheus-tsdb-compactions-failing
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Prometheus has issues compacting sample blocks
'{{$labels.job}} at {{$labels.instance}} had {{$value | humanize}} compaction failures over the last four hours.'
-
-
Expression:
|
increase(prometheus_tsdb_compactions_failed_total[2h]) > 0
-
For: 5m
-
Labels:
-
Severity: major
-
-
-
Alert: prometheus-tsdb-wal-corruptions
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Prometheus write-ahead log is corrupted
'{{$labels.job}} at {{$labels.instance}} has a corrupted write-ahead log (WAL).'
-
-
Expression:
|
prometheus_tsdb_wal_corruptions_total > 0
-
For: 5m
-
Labels:
-
Severity: major
-
-
-
Alert: prometheus-not-ingesting-samples
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Prometheus isn't ingesting samples
Prometheus {{ $labels.namespace }}/{{ $labels.pod }} isn't ingesting samples.
-
-
Expression:
|
rate(prometheus_tsdb_head_samples_appended_total[5m]) <= 0
-
For: 5m
-
Labels:
-
Severity: major
-
-
-
Alert: prometheus-target-scrapes-duplicate
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Prometheus has many samples rejected
'{{ $labels.namespace }}/{{ $labels.pod }} has many samples rejected due to duplicate timestamps but different values'
-
-
Expression:
|
increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
-
For: 10m
-
Labels:
-
Severity: warning
-
-
-
Alert: prometheus-remote-write-behind
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: Prometheus remote write is behind
'Prometheus {{ $labels.namespace }}/{{ $labels.pod }} remote write is {{ $value | humanize }} seconds behind for target: {{ $labels.url }}.'
-
-
Expression:
|
(
max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds[5m])
ignoring(remote_name, url) group_right
max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds[5m])
)
> 120
-
For: 15m
-
Labels:
-
Severity: major
-
-
-
Alert: ssl-earliest-cert-expiry
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: SSL cerificate expires in 30 days
'{{ $labels.namespace }}/{{ $labels.pod }} ssl certificate expires in 30 days'
-
-
Expression:
|
probe_ssl_earliest_cert_expiry - time() < 86400 * 30
-
Labels:
-
Severity: major
-
-
-
Alert: ssl-earliest-cert-expiry
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: SSL cerificate expires in 7 days
'{{ $labels.namespace }}/{{ $labels.pod }} ssl certificate expires in 7 days'
-
-
Expression:
|
probe_ssl_earliest_cert_expiry - time() < 86400 * 7
-
Labels:
-
Severity: critical
-
-
-
Alert: helm-deploy-failure
-
Annotations:
-
Type: Processing Error Alarm
-
Summary: 'Helm chart failed to deploy for 5 minutes'
'Helm chart {{$labels.chart}}/{{ $labels.namespace }} deployment failed'
-
-
Expression:
|
helm_chart_deploy_success < 1
-
Labels:
-
Severity: critical
-
-
-
For: 5m
server
Rules:
-
Alert: server-alert
-
Annotations:
-
Type: Equipment Alarm
-
dn: "{{ $labels.cluster }}/{{ $labels.server }}/{{ $labels.fault_id }}/{{ $labels.id }}"
-
Summary: "{{ $labels.description }}"
-
Expression:
|
sum(server_alert) by ( id, description, fault_id, server, cluster, severity, affectedDn) == 1
-
For: 1m
-
Alert: server-not-reachable-alert
-
Annotations:
-
Type: Equipment Alarm
-
Summary: "{{ $labels.description }}"
-
-
Expression:
|
sum(cimc_server_not_reachable_alert) by (server, cluster, description) == 1
-
For: 1m
-
Labels:
-
Severity: critical
-
k8s.rules
Rules:
-
Expression:
sum(rate(container_cpu_usage_seconds_total{ image!="", container!="POD"}[5m])) by (namespace)
-
Record: namespace:container_cpu_usage_seconds_total:sum_rate
-
-
Expression:
sum by (namespace, pod, container) (
rate(container_cpu_usage_seconds_total{ image!="", container!="POD"}[5m])
)
-
Record: namespace_pod_container:container_cpu_usage_seconds_total:sum_rate
-
-
Expression:
sum(container_memory_usage_bytes{image!="", container!="POD"} - container_memory_cache{image!="", container!="POD"}) by (namespace)
-
Record: namespace:container_memory_usage_bytes:sum
-
-
Expression:
sum(
label_replace(
label_replace(
kube_pod_owner{ owner_kind="ReplicaSet"},
"replicaset", "$1", "owner_name", "(.*)"
) * on(replicaset, namespace) group_left(owner_name) kube_replicaset_owner,
"workload", "$1", "owner_name", "(.*)"
)
) by (namespace, workload, pod)
-
Labels:
-
workload_type: deployment
-
-
Record: mixin_pod_workload
-
-
Expression:
sum(
label_replace(
kube_pod_owner{ owner_kind="DaemonSet"},
"workload", "$1", "owner_name", "(.*)"
)
) by (namespace, workload, pod)
-
Labels:
-
workload_type: daemonset
-
-
Record: mixin_pod_workload
-
-
Expression:
sum(
label_replace(
kube_pod_owner{ owner_kind="StatefulSet"},
"workload", "$1", "owner_name", "(.*)"
)
) by (namespace, workload, pod)
-
Labels:
-
workload_type: statefulset
-
-
Record: mixin_pod_workload
-
node.rules
Rules:
-
Expression:
max(label_replace(kube_pod_info, "pod", "$1", "pod", "(.*)")) by (node, namespace, pod)
-
Record: 'node_namespace_pod:kube_pod_info:'
-
-
Expression:
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m]))
-
Record: :node_cpu_utilisation:avg1m
-
-
Expression:
1 -
sum(node_memory_MemFree_bytes + node_memory_Cached_bytes + node_memory_Buffers_bytes)
/
sum(node_memory_MemTotal_bytes)
-
Record: ':node_memory_utilisation:'
-
-
Expression:
sum(node_memory_MemFree_bytes + node_memory_Cached_bytes + node_memory_Buffers_bytes)
-
Record: :node_memory_MemFreeCachedBuffers_bytes:sum
-
-
Expression:
sum(node_memory_MemTotal_bytes)
-
Record: :node_memory_MemTotal_bytes:sum
-
-
Expression:
avg(irate(node_disk_io_time_seconds_total{device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m]))
-
Record: :node_disk_utilisation:avg_irate
-
-
Expression:
avg(irate(node_disk_io_time_weighted_seconds_total{device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m]))
-
Record: :node_disk_saturation:avg_irate
-
-
Expression:
max by (instance, namespace, pod, device) ((node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"}
- node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"})
/ node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"})
-
Record: 'node:node_filesystem_usage:'
-
-
Expression:
max by (instance, namespace, pod, device) (node_filesystem_avail_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"} / node_filesystem_size_bytes{fstype=~"ext[234]|btrfs|xfs|zfs"})
-
Record: 'node:node_filesystem_avail:'
-
-
Expression:
sum(irate(node_network_receive_bytes_total{device!~"veth.+"}[1m])) +
sum(irate(node_network_transmit_bytes_total{device!~"veth.+"}[1m]))
-
Record: :node_net_utilisation:sum_irate
-
-
Expression:
sum(irate(node_network_receive_drop_total{device!~"veth.+"}[1m])) +
sum(irate(node_network_transmit_drop_total{device!~"veth.+"}[1m]))
-
Record: :node_net_saturation:sum_irate
-
-
Expression:
max(
max(
kube_pod_info{host_ip!=""}
) by (node, host_ip)
* on (host_ip) group_right (node)
label_replace(
(max(node_filesystem_files{ mountpoint="/"}) by (instance)), "host_ip", "$1", "instance", "(.*):.*"
)
) by (node)
-
Record: ':node:node_inodes_total:'
-
-
Expression:
max(
max(
kube_pod_info{ host_ip!=""}
) by (node, host_ip)
* on (host_ip) group_right (node)
label_replace(
(max(node_filesystem_files_free{ mountpoint="/"}) by (instance)), "host_ip", "$1", "instance", "(.*):.*"
)
) by (node)
-
Record: ':node:node_inodes_free:'
-
-
Expression:
-sum by (node) (
(node_memory_MemFree_bytes + node_memory_Cached_bytes + node_memory_Buffers_bytes)
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
)
-
Record: node:node_memory_bytes_available:sum
-
-
Expression:
-sum by (node) (
node_memory_MemTotal_bytes
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
)
-
Record: node:node_memory_bytes_total:sum
-
-
Expression:
max without(endpoint, instance, job, pod, service) (kube_node_labels and on(node) kube_node_role{role="control-plane"})
-
Labels:
label_node_role_kubernetes_io: control-plane
Record: cluster:master_nodes
kube-prometheus-node-recording.rules
Rules:
-
Expression:
sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[3m])) BY
(instance)
-
Record: instance:node_cpu:rate:sum
-
-
Expression:
sum((node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}))
BY (instance)
-
Record: instance:node_filesystem_usage:sum
-
-
Expression:
sum(rate(node_network_receive_bytes_total[3m])) BY (instance)
-
Record: instance:node_network_receive_bytes:rate:sum
-
-
Expression:
sum(rate(node_network_transmit_bytes_total[3m])) BY (instance)
-
Record: instance:node_network_transmit_bytes:rate:sum
-
-
Expression:
sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[5m])) WITHOUT
(cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu_seconds_total)
BY (instance, cpu)) BY (instance)
-
Record: instance:node_cpu:ratio
-
-
Expression:
sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[5m]))
-
Record: cluster:node_cpu:sum_rate5m
-
-
Expression:
cluster:node_cpu:sum_rate5m / ON(cluster) GROUP_LEFT() count(sum(node_cpu_seconds_total)
BY (cluster, cpu)) BY (cluster)
-
Record: cluster:node_cpu:ratio
-
kubernetes.rules
Rules:
-
Expression:
sum(container_memory_usage_bytes{container!="POD",container!="",pod!=""} - container_memory_cache{container!="POD",container!="",pod!=""})
BY (pod, namespace)
-
Record: pod:container_memory_usage_bytes:sum
-
-
Expression:
sum(container_spec_cpu_shares{container!="POD",container!="",pod!=""})
BY (pod, namespace)
-
Record: pod:container_spec_cpu_shares:sum
-
-
Expression:
sum(rate(container_cpu_usage_seconds_total{container!="POD",container!="",pod!=""}[5m]))
BY (pod, namespace)
-
Record: pod:container_cpu_usage:sum
-
-
Expression:
sum(container_fs_usage_bytes{container!="POD",container!="",pod!=""})
BY (pod, namespace)
-
Record: pod:container_fs_usage_bytes:sum
-
-
Expression:
sum(container_memory_usage_bytes{container!=""} - container_memory_cache{container!=""}) BY (namespace)
-
Record: namespace:container_memory_usage_bytes:sum
-
-
Expression:
sum(container_spec_cpu_shares{container!=""}) BY (namespace)
-
Record: namespace:container_spec_cpu_shares:sum
-
-
Expression:
sum(rate(container_cpu_usage_seconds_total{container!="POD",container!=""}[5m]))
BY (namespace)
-
Record: namespace:container_cpu_usage:sum
-
-
Expression:
sum(container_memory_usage_bytes{container!="POD",container!="",pod!=""} - container_memory_cache{container!="POD",container!="",pod!=""})
BY (cluster) / sum(machine_memory_bytes) BY (cluster)
-
Record: cluster:memory_usage:ratio
-
-
Expression:
sum(container_spec_cpu_shares{container!="POD",container!="",pod!=""})
/ 1000 / sum(machine_cpu_cores)
-
Record: cluster:container_spec_cpu_shares:ratio
-
-
Expression:
sum(rate(container_cpu_usage_seconds_total{container!="POD",container!="",pod!=""}[5m]))
/ sum(machine_cpu_cores)
-
Record: cluster:container_cpu_usage:ratio
-
-
Expression:
kube_node_labels and on(node) kube_node_spec_taint{key="node-role.kubernetes.io/master"}
-
Labels:
-
label_node_role_kubernetes_io: master
-
-
Record: cluster:master_nodes
-
-
Expression:
sum((cluster:master_nodes * on(node) group_left kube_node_status_capacity_cpu_cores)
or on(node) (kube_node_labels * on(node) group_left kube_node_status_capacity_cpu_cores))
BY (label_beta_kubernetes_io_instance_type, label_node_role_kubernetes_io)
-
Record: cluster:capacity_cpu_cores:sum
-
-
Expression:
sum((cluster:master_nodes * on(node) group_left kube_node_status_capacity_memory_bytes)
or on(node) (kube_node_labels * on(node) group_left kube_node_status_capacity_memory_bytes))
BY (label_beta_kubernetes_io_instance_type, label_node_role_kubernetes_io)
-
Record: cluster:capacity_memory_bytes:sum
-
-
Expression:
sum(node:node_memory_bytes_total:sum - node:node_memory_bytes_available:sum)
-
Record: cluster:memory_usage_bytes:sum
-
-
Expression:
sum(cluster:master_nodes or on(node) kube_node_labels ) BY (label_beta_kubernetes_io_instance_type,
label_node_role_kubernetes_io)
-
Record: cluster:node_instance_type_count:sum
-
-
Expression:
sum(etcd_object_counts) BY (instance)
-
Record: instance:etcd_object_counts:sum
-
user-password-expiry
Rules:
-
Alert: user_password_expiring
-
Annotations:
-
Type: Cluster Node User Password Expiring Alarm
-
Summary: "{{ $labels.user_name }} password on host: {{ $labels.node_name }} is expiring in {{ $labels.days_to_expire }} days."
-
-
Expression:
|
User_password_expiration == 1
-
Labels:
-
Severity: critical
-
-
-
Alert: user_password_expired
-
Annotations:
-
Type: Cluster Node User Password Expired Alarm
-
Summary: "{{ $labels.user_name }} password on host: {{ $labels.node_name }} is expired {{ $labels.days_to_expire }} days ago."
-
-
Expression:
|
User_password_expiration == 2
-
Labels:
-
Severity: critical
-
-
VM State Alert
Rules:
-
Alert: vm-deployed
-
Annotations:
-
Type: Equipment Alarm
-
Summary: "{{ $labels.vm_name }} is deployed."
-
-
Expression:
|
upf_state == 2
-
For: 5s
-
Labels:
-
Severity: minor
-
State: DEPLOYED
-
-
-
Alert: vm-alive
-
Annotations:
-
Type: Equipment Alarm
-
Summary: "{{ $labels.vm_name }} is alive."
-
-
Expression:
|
upf_state == 1 and changes(upf_state[2m]) > 0
-
For: 5s
-
Labels:
-
Severity: minor
-
State: ALIVE
-
-
-
Alert: vm-error
-
Annotations:
-
Type: Equipment Alarm
-
Summary: "{{ $labels.vm_name }} is down."
-
-
Expression:
|
upf_state == 0
-
For: 5s
-
Labels:
-
Severity: major
-
State: ERROR
-
-
-
Alert: vm-recovering
-
Annotations:
-
Type: Equipment Alarm
-
Summary: "{{ $labels.vm_name }} is recovering."
-
-
Expression:
|
upf_state == 3
-
For: 5s
-
Labels:
-
Severity: warning
-
State: RECOVERING
-
-
-
Alert: vm-recovery-failed
-
Annotations:
-
Type: Equipment Alarm
-
Summary: "{{ $labels.vm_name }} failed to recover."
-
-
Expression:
|
upf_state == 4
-
For: 5s
-
Labels:
-
Severity: critical
-
State: RECOVERY_FAILED
-
-
confd-user-status
Rules:
-
Alert: confd_user_password_expiring
-
Annotations:
-
Type: Confd User Status Alarm
-
Summary: "Confd user {{ $labels.namespace }}/{{ $labels.confdUser }} password is expiring in less than 60 days."
-
-
Expression:
|
confd_user_password_days_to_expiry changelesser 60 and confd_user_password_days_to_expiry >= 0
-
Labels:
-
Severity: major
-
-
-
Alert: confd_user_password_expired
-
Annotations:
-
Type: Confd User Status Alarm
-
Summary: "Confd user {{ $labels.namespace }}/{{ $labels.confdUser }} password is expired."
-
-
Expression:
|
confd_user_password_days_to_expiry < 0
-
Labels:
-
Severity: critical
-
-