Alerting rules define how alerts must be triggered based on conditional expressions on any available metric. For example,
it is possible to trigger an alert when any performance metric such as CPU usage, network throughput, or disk usage reaches
a certain threshold.
HA CVIM-MON is deployed with a set of default built-in alerting rules that cover the most important error conditions that
can occur in the pod.
You can customize alerting rules by using the following steps:
-
Create a custom alerting rules configuration file to add new rules, modify or delete built-in rules.
-
Verify that the custom alerting rules file is valid using a verification tool.
-
Update alerting rules by applying the custom alerting rules file.
Update Alerting Rules
The alerting rules update operation always merges the following two files:
Applying a second custom alerting rules file does not preserve alerting rules from the previously applied custom alerting
rules file. The update operation does not include previously applied custom alerting rules.
To update alerting rules, run the k8s_runner.py command with --alerting_rules_config option and a path to the custom_alerting_rules.yml file.
For example:
# ./bootstrap/k8s-infra/k8s_runner.py --alerting_rules_config /root/custom_alerting_rules.yaml
The merge tool output file consists of:
Format of Custom Alerting Rules File
The format of the custom_alerting_rules.yml is identical to the one used by the Prometheus configuration file with a few additional semantic extensions to support deletion
and modification of pre-built existing rules.
The groups entry contains a list of groups identified by group_name, where each group can include one or more rules. The labels
are used for determining the severity and other SNMP trap attributes.
The limitations when setting labels are given below:
-
You must set the values of severity, snmp_fault_code, and snmp_fault_severity to the values specified in the example below.
-
You must set the value of snmp_fault_source to indicate the metric used in the alert expression.
-
You must not change the the value of snmp_node.
-
You must set the value of snmp_podid as the pod name specified in setup_data.yaml.
groups:
- name: {group_name}
rules:
- alert: {alert_name}
annotations:
description: {alert_description}
summary: {alert_summary}
expr: {alert_expression}
for: {pending_time}
labels:
severity: {informational/warning/critical}
snmp_fault_code: {other/resourceUsage/resourceThreshold/serviceFailure/hardwareFailure/networkConnectivity}
snmp_fault_severity: {emergency/critical/major/alert/informational}
snmp_fault_source: {fault_source}
snmp_node: '{{ $labels.instance }}'
snmp_podid: {pod_id}
Adding Alert Rules
Any alert rule specified under a group that is not named change-rules or delete-rules is populated to the merged output file. Custom rules are prioritized over the preexisting rules. If there are two alerts
with the same name in both files, only the one from the custom file is retained as a result of the merge.
Modifying Alert Rules
You can modify any preexisting rule using the following syntax:
groups:
- name: change-rules
rules:
- alert: {alert_name}
expr: {new_alert_expression}
annotations:
summary: {new_alert_summary}
The merge script looks only for a group named change-rules and changes the expression or summary of the updated alert.
If the alert to be changed does not exist, it is not created and changes are not made.
Deleting Alert Rules
You can delete any built-in rule by using the following construct:
custom_alerting_rules.yml
groups:
- name: delete-rules
rules:
- alert: {alert_name/regular_expression}
The merge script looks only for a group named delete-rules and deletes pre-existing rules that match the provided names or regular expressions.
If the alert to be deleted does not exist, changes are not made.
The following custom configuration file includes examples of a new alerting rule, a modified alerting rule and a deleted alerting
rule:
groups:
- name: cpu
rules:
- alert: cpu_idle
annotations:
description: CPU idle usage is too high - resources underutilized
summary: CPU idle too high
expr: cpu_usage_idle > 80
for: 5m
labels:
severity: informational
snmp_fault_code: resourceUsage
snmp_fault_severity: informational
snmp_fault_source: cpu_usage_idle
snmp_node: '{{ $labels.instance }}'
snmp_podid: pod7
- alert: cpu_iowait
annotations:
description: CPU iowait usage is too high
summary: CPU iowait too high
expr: cpu_usage_iowait > 10
for: 3m
labels:
severity: warning
snmp_fault_code: resourceUsage
snmp_fault_severity: alert
snmp_fault_source: cpu_usage_iowait
snmp_node: '{{ $labels.instance }}'
snmp_podid: pod7
- name: change-rules
rules:
- alert: disk_used_percent
expr: disk_used_percent > 99
annotations:
summary: Disk used > 99%
- alert: reboot
annotations:
summary: Server rebooted
- alert: system_n_users
expr: system_n_users > 10
- name: delete-rules
rules:
- alert: disk_filling_up_in_4h
- alert: mem.*
Validation Script for Custom Alerting Rules
You must validate any custom alerting rules file before an updation using the following CLI command:
/opt/cisco/check_promtool.py -v <custom_alerts_file>
The validation script uses the Prometheus promtool script but skips some of its checks to allow updation and deletion of rules. It also checks if the SNMP severities and fault
codes are supported.
The following example shows the output of the promtool script in case of a successful validation:
# /opt/cisco/check_promtool.py -v /root/alerting_custom_rules.yaml
check_promtool.py: checking /root/alerting_custom_rules.yaml
check_promtool.py: success:
check_promtool.py: rules to be changed: 2
check_promtool.py: rules to be added: 2
The following example shows the output of the promtool script in case of a failure:
# /opt/cisco/check_promtool.py -v /root/alerting_custom_rules.yaml
check_promtool.py: checking /root/alerting_custom_rules.yaml
check_promtool.py: failure:
check_promtool.py: line 22: field for already set in type rulefmt.Rule
check_promtool.py: line 23: field labels already set in type rulefmt.Rule