- CDN and Cisco Media Streamer Overview
- Cisco OMD Director Overview
- Initial CDN Provisioning
- Manage CDN Servers
- Manage Cache Groups
- Manage Delivery Services
- Manage Client Routing
- Manage Profiles
- Manage Content Invalidation Policies
- Device Groups
- KPI Metrics
- CDN Monitoring
- OMD Insights
- OMD Administration
- Example CZF File
- Example Ingest Manifest File
- NGB Whitelist File
- Header Rewrite Rules Syntax
- Configuring Enhanced DNS Request Routing
- Recreate OMD Insights Summary Index Buckets
- tacreport tool
- Access and Transaction Log Details
- OMD Director Alarms and Remediation
- OMD Monitor Alarms and Remediation
- Changing Mongodb Username and Password for OMD Director
- Manage Content Invalidation using the OMD Director REST API
OMD Director Alarms and Remediation
This Appendix describes the default checks and their thresholds that are run by the sensu-client service on the OMD Monitor clients that you can view and deactivate using OMD Director. To view these checks and their thresholds from OMD Director, choose Alarm Rules tab. This Appendix also describes possible remediation steps that you can perform if an alarm is raised.
and click theWhen you triage the alerts that are raised, you should check the following items:
-
Any active network configuration change requests that could affect traffic
-
The active Media Streamer change requests for the servers in question
-
Whether the alert indicates that end customer services are impacted, such as the alert is on a Cache node or the OMD Insights server is down
-
Run the dmesg | grep - i error command and check for any errors
-
Check the /var/log/messages files for any errors (grep -i error)
-
Check OMD Insights to determine whether the effected node provided content during the error window
All notifications are generated based on the thresholds that are configured for the alarms. An alarm can have multiple thresholds configured for it to send notifications about different severity levels. To view the existing thresholds and to create new thresholds, from the Monitor > Alarms page, click the Alarm Rules tab. The "Checks and Thresholds Available in OMD Director" table describes the default checks and thresholds that are available in OMD Director. Table 2 describes possible remediation steps you can perform based on the alarms or alerts that are raised.
Check Name |
Entity Family (entity_family in Influxdb) |
Check Value |
Alarm Condition/ Check Result [0 - No Alert/Normal] [1 - Warning Alert] [2 - Critical Alert] |
Severity/Priority |
---|---|---|---|---|
bond_interface |
cache_nodes |
Status of the bond interface:
|
Warning: Alert raised if bond interface is not configured Critical: Alert raised if bond interface is configured and down |
P4 |
cpu_usage |
common |
CPU usage in percentage. (0 - 100%) |
Warning alarm raised if:
Critical alarm raised if:
|
Warning: N/A Critical: P4 |
disk_drives_count | cache_nodes | Count of current disk partitions | Critical: Alert raised if the number of disk drive partitions currently available is less than the original number of partitions | P3 |
disk_usage |
common |
Total disk usage in percentage |
Warning alarm raised if:
Critical alarm raised if:
|
Warning: N/A Critical: P3/P4 |
dns_resolved_status |
common |
Status of DNS resolution:
|
Critical: Alert raised if DNS is not resolvable (check result returns 0) |
P4 |
hardware |
common |
Hardware error message status from the dmesg command output:
|
Critical: Hardware error reported in dmesg. |
P3 |
influx_connectivity |
common |
Checks connectivity to OMD Monitor Influx nodes.
|
Critical: Alert raised if none of the OMD Monitor Influx nodes are reachable. |
P3 |
influx_metrics_not_found |
common |
|
Critical: Metrics not found from OMD Monitor Influx nodes at this time |
|
insights_failover_check |
splunk |
|
Warning: Failover occured, the standby node is currently the master. |
P4 |
interface |
cache_nodes |
Status of the slave interfaces of the bond interface:
|
Warning: Alert raised if an interface that is part of the bond interface is not configured Critical: Alert raised if an interface that is part of the bond interface is down |
P3 |
kafka_broker_status_all |
monitor-node |
The number of servers that are having a problem with the Kafka broker server. 0 means all servers are fine or there are no servers configured for Kafka export. |
Critical: Alert raised if any or all Kafka brokers are not able to export metrics |
P3/P4 |
keep_alive |
common |
The status of the Monitor node connection to the server node being monitored. To determine this, keepalives are sent by the sensu-client service on the server node to a Monitor node every 30 seconds:
|
Warning: Alert raised if no keepalives have been received for 90 to 180 seconds Critical: Alert raised if no keepalives have been received for more than 180 seconds |
P3/P2 The priority is determined based on the number of servers that are raising the keepalive. If a single server is in this state, it is a P3. If there is a block of servers that are affected, it is a P2 and you should escalate immediately. |
load_avg_1min |
common |
CPU load average values, with regards to both the CPU and I/O over the last minute. This value is a decimal number and not a percentage. These values are taken from the /proc/loadavg system file. |
Warning: Alert raised if load_avg_1min >= 10 Critical: Alert raised if load_avg_1min >= 25 |
Warning: N/A Critical: P4 |
load_avg_5min |
common |
CPU load average values, with regards to both the CPU and I/O over the last 5 minutes. This value is a decimal number and not a percentage. These values are taken from the /proc/loadavg system file. |
Warning: Alert raised if load_avg_5min >= 20 Critical: Alert raised if load_avg_5min >= 50 |
Warning: N/A Critical: P4 |
load_avg_15min |
common |
CPU load average values, with regards to both the CPU and I/O over the last 15 minutes. This value is a decimal number and not a percentage. These values are taken from the /proc/loadavg system file. |
Warning: Alert raised if load_avg_15min >= 30 Critical: Alert raised if load_avg_15min >= 75 |
Warning: N/A Critical: P4 |
log_failed_notallowed |
common |
The number of SSH failed login attempts due to access restrictions |
Warning: Alert raised if 1 failed attempt Critical: Alert raised if 2 or more failed attempts |
P4 |
log_failed_password |
common |
The number of failed login attempts do to incorrect passphrase |
Warning: Alert raised if 1 failed attempt Critical: Alert raised if 2 or more failed attempts |
P4 |
netstat |
ESTABLISHED |
This check indicates the number of connections in the ESTABLISHED state. An alarm is raised if the number of connections exceeds the configured threshold. A server only has a finite number of TCP connections available, so this alarm indicates when the server is nearing that threshold. Connections are determined by looking for ESTABLISHED and TIME_WAIT entries using the netstat command. |
Warning: Alert is raised if ESTABLISHED connections are at the following thresholds:
Critical: Alert is raised if ESTABLISHED connections are at the following thresholds:
|
Warning: N/A Critical: P4 |
netstat |
TIME_WAIT |
This check indicates the number of connections in the TIME_WAIT state. An alarm is raised if the number of connections exceeds the configured threshold. A server only has a finite number of TCP connections available, so this alarm indicates when the server is nearing that threshold. Connections are determined by looking for ESTABLISHED and TIME_WAIT entries using the netstat command. |
Warning: Alert is raised if ESTABLISHED connections are at the following thresholds:
Critical: Alert is raised if ESTABLISHED connections are at the following thresholds:
|
Warning: N/A Critical: P4 |
ntp_sync_status |
common |
The status of the sync with the NTP server. 0 indicates the server is not in sync. 1 indicates the server is in sync. |
Critical alarm raised if ntp_offset >= 500ms |
Critical: P4 |
partition_usage |
Name of each partition: /var /sys/fs/cgroup /run /opt /dev /dev/shm /boot / |
Disk usage for each partition, as a percentage. This check will return 1(warning) if the configured warning or critical threshold values are more than 100. |
Warning: Alert raised if any partition usage >= 85% Critical: Alert raised if any partition usage >= 95% |
Warning: N/A Critical: P4 |
ram_usage |
common |
RAM usage as a percentage |
Warning: Alert raised if RAM Usage is >= 80% Critical: Alert raised if RAM Usage is >= 95% |
P4 |
redis_status | monitor-node | The number of connected OMD Monitor nodes. This number should be equal to the number of OMD Monitor nodes available at install. If it is not, a warning is raised. | Warning: Alert raised if the number of connected OMD Monitor nodes is less than the number of OMD Monitor nodes available at initial installation. | N/A |
service_status |
grafana-server |
The running status of the Grafana server on the OMD Monitor node:
|
Critical: Alert raised when the grafana-server service does not have the status of “running”. |
P4 |
service_status |
haproxy |
The running status of the haproxy service on the OMD Monitor node:
|
Critical: Alert raised when the haproxy service does not have the status of “running”. |
P4 |
service_status |
influxdb |
The running status of the influxdb service on the OMD Monitor node:
|
Critical: Alert raised when the influxdb service does not have the status of “running”. |
P4 |
service_status |
influxdb-relay |
The running status of the influxdb-relay service on the OMD Monitor node:
|
Critical: Alert raised when the influxdb-relay service does not have the status of “running”. |
P4 |
service_status |
rabbitmq-metrics-exporter |
Running status of rabbitmq-metrics-exporter on monitor node
|
Critical when rabbitmq-metrics-exporter service is not in a running state. |
P2/P3 |
service_status |
rabbitmq-server |
The running status of the rabbitmq-server service on the OMD Monitor node:
|
Critical: Alert raised when the rabbitmq-server service does not have the status of “running”. |
P4/P3 |
service_status |
redis |
The running status of redis service on the OMD Monitor node:
|
Critical: Alert raised when the redis service does not have the status of “running”. |
P4/P3 |
service_status |
redis-sentinel |
The running status of redis-sentinel service on the OMD Monitor node:
|
Critical: Alert raised when the redis-sentinel service does not have the status of “running”. |
P4/P3 |
service_status |
sensu-server |
The running status of sensu-server service on the OMD Monitor node:
|
Critical: Alert raised when the sensu-server service does not have the status of “running”. |
P4/P3 |
service_status |
splunk |
The running status of splunk service on the OMD Monitor node:
|
Critical: Alert raised when the splunk service does not have the status of “running”. |
P4/P3 |
service_status |
tcp-rabbitmq-exchange |
Running status of tcp-rabbitmq-exchange on monitor node:
|
Critical when tcp-rabbitmq-exchange service is not in a running state. |
P2/P3 |
sshd_running_status |
common |
The running status of the sshd service:
|
Warning: Alert raised if the sshd service is not running. |
P4 |
swap_usage |
common |
The percentage of swap usage |
Warning: Alert raised when swap usage is >= 95% Critical: Alert raised when swap usage is >= 98% |
P2 |
tcp_connection |
common cache_nodes traffic_router |
Number of TCP connections including TCP6 |
Warning: Alert raised if TCP connections are:
Critical: Alert raised if TCP connections are:
|
Warning: N/A Critical: P4 |
tripwire_violations |
common |
The number of tripwire report violations. Local local files are checked for changes. |
Critical: If any tripwire violations are found (check result is 3) |
P4 |
Check |
Remediation |
||
---|---|---|---|
bond_interface |
If this alarm has been raised, the alarm should have already been cleared so a keepalive alarm should have been raised. Also look at the IPMI Interface check in Traffic Ops to see if the server has power. |
||
cpu_usage |
Level 3:
|
||
disk_drives_count |
Log in to the cache node and use the ls /dev/sd* command to check the disk drive partitions list. Based on the profile of Mid or Edge cache sever, the list will change. sda, sdb, sdc, and so on are the disk drives and sda1, sda2, and so on are partitions in sda. Check if any disk is missing based on the profile. |
||
disk_usage | Level 2/Level3: For critical alerts, log in and check how different partitions are being used. Use the df command to determine which partitions have high disk usage. | ||
dns_resolved_status | Level 1/Level 2: Make sure servers have access to DNS servers and that those DNS servers are correct. | ||
hardware |
Indicates a possible hardware error. Details can be found in the /var/log/messages file. The check is looking for “Hardware Error” in the dmesg command output. Note that:
Perform the following steps:
|
||
influx_connectivity |
Check whether there is a network event that is causing the loss of connectivity. If no possible cause can be found, escalate to Level 2 or Level 3. |
||
insights_failover_check |
Check the connectivity between the primary and backup OMD Insights nodes. There may be network connectivity issues between the primary and backup OMD Insights nodes that resulted in this alarm. Check for any keepalive alerts from OMD Monitor for the OMD Insights primary nodes. |
||
interface |
Level 2: Verify that the issues is not external to the CDN cache nodes and then escalate as needed. |
||
kafka_broker_status_all |
Check the accessibility of the Kafka servers from the OMD Monitor nodes:
|
||
keep_alive |
Network connectivity between an OMD Monitor node and the server being monitored might have been lost. Additional insight into the network is required. If you can access the server using SSH, you can run the following commands to determine whether the system rebooted or if there was a network issue:
If neither of these commands reveal an issue, perform additional review of the following log files:
|
||
load_avg_1min |
Level 3:
|
||
load_avg_5min |
Level 3:
|
||
load_avg_15min |
Level 3:
|
||
log_failed_notallowed |
Level 3: SSH to the server as root and check the /var/log/secure file for a string similar to the following: "User root from 11.22.33.44 not allowed because not listed in AllowUsers" |
||
log_failed_password | SSH to the server as root and check the /var/log/secure file for the string “Failed password”. | ||
netstat for ESTABLISHED | SSH to the server. Check the output from the netstat command to determine more information about the connections and which port is being used. Look for TCP connections in the ESTABLISHED states, which could provide information about the related service. | ||
netstat for TIME_WAIT | SSH to the server. Check the output from the netstat command to determine more information about the connections and which port is being used. Look for TCP connections in the TIME_WAIT states, which could provide information about the related service. | ||
ntp_sync_status |
Level 2/Level 3: Connectivity to the NTP server needs to be checked or other network resources are not responding. Use the ntpq -p command to check the NTP source stratum for correct time. Level 3 should fix the NTP configuration as needed, or perform additional troubleshooting to ensure that NTP is synced. |
||
partition_usage |
Level 2/Level 3: SSH to the server or view Grafana dashboards for the server to determine which partition is high on usage. After connecting to the server using SSH, look for files that are consuming a large amount of disk space. |
||
ram_usage |
SSH to the server and check the memory usage. Use commands such as top to identify the processes that are running and determine which processes are consuming memory. Check Grafana dashboards for any patterns in the RAM usage. |
||
redis_status |
A warning for this check indicates that an OMD Monitor node could not connect to one or more of the other OMD Monitor node. However, if OMD Monitor is installed in an HA configuration, the OMD Monitor nodes should continue to work without issue. You should investigate the reason for the loss of connection between OMD Monitor nodes. If there is no issue with connectivity, review the /var/log files on the OMD Monitor node. |
||
service_status for grafana-server |
Connect to the OMD Monitor node and enter the systemctl status grafana-server command to check the status of the grafana-server service. If the service is not in a running state, enter the sudo systemctl restart grafana-server command to try to bring it up.
Check the /var/log/grafana/grafana.log file for any error messages. If the alarm persists, escalate the problem for further investigation. |
||
service_status for haproxy |
Connect to the OMD Monitor node and enter the systemctl status haproxy command to check the status of the haproxy service. If the service is not in a running state, enter the sudo systemctl restart haproxy command to try to bring it up. Check the /var/log/haproxy.log and /var/log/haproxy-status.log files for any error messages. If the alarm persists, escalate the problem for further investigation. |
||
service_status for influxdb |
Connect to the OMD Monitor node and enter the systemctl status influxdb command to check the status of the influxdb service. Influxdb metrics may have stopped on one OMD Monitor node, however, the remaining OMD Monitor nodes will continue to store metrics. Enter the sudo systemctl restart influxdb command to try to bring up the influxdb service on OMD Monitor node. If the alarm persists, escalate the problem for investigation. |
||
service_status for influxdb-relay |
Connect to the OMD Monitor node and enter the systemctl status influxdb-relay command to check the status of the influxdb-relay service. Influxdb metrics may have stopped on one OMD Monitor node, however, the remaining OMD Monitor nodes will continue to store metrics. Enter the sudo systemctl restart influxdb-relay command to try to bring up the influxdb-relay service on the OMD Monitor node. If the alarm persists, escalate the problem for investigation. |
||
service_status for rabbitmq-metrics-exporter |
The rabbitmq-metrics-exporter service pops out metrics data from RabbitMQ and exports the same to the Monitor Nodes, InfluxDBs, and the external Kafka servers. Failure of this service would result in no metrics data in Influxdb but the Monitor mailer alerts and Uchiwa will still be functional. Connect to the Monitor node and check the status of the process using Inspect Try bringing up the service using |
||
service_status for rabbitmq-server |
Connect to the OMD Monitor node and enter the systemctl status rabbitmq-server command to check the status of the rabbitmq-server service. An OMD Monitor node whose rabbitmq-server service is not running will not be monitoring any clients. Clients previously connected to this node will move to another OMD Monitor node, including the Sensu client of the OMD Monitor node whose rabbitmq-server service is not running. Inspect the /var/log/rabbitmq/ logs for any error messages. Enter the sudo systemctl restart rabbitmq-server command to try to bring up the rabbitmq-server service on the OMD Monitor node. If the alarm persists, escalate the problem for investigation. |
||
service_status for redis |
Connect to the OMD Monitor node and enter the systemctl status redis command to check the status of the redis service. An OMD Monitor node whose redis service is not running will not be monitoring any clients. Clients previously connected to this node will move to another OMD Monitor node, including the Sensu client of the OMD Monitor node whose redis service is not running. Inspect the /var/log/redis/ logs for any error messages. Enter the sudo systemctl restart redis command to try to bring up the redis service on the OMD Monitor node. If the alarm persists, escalate the problem for investigation. |
||
service_status for redis-sentinel |
Connect to the OMD Monitor node and enter the systemctl status redis-sentinel command to check the status of the redis-sentinel service. Inspect /var/log/redis/ logs for any error messages. Enter the cosudo systemctl restart redis-sentinelmmand to try to bring up the redis-sentinel service on the OMD Monitor node. If the alarm persists, escalate the problem for investigation. |
||
service_status for sensu-server |
Connect to the OMD Monitor node and enter the systemctl status sensu-server command to check the status of the sensu-server service. Inspect the /var/log/sensu/sensu-server.log file for any error messages. Enter the sudo systemctl restart sensu-server command to try to bring up the sensu-server service on the OMD Monitor node. If the alarm persists, escalate the problem for investigation. |
||
service_status for splunk | The Splunk service is down on the OMD Insights nodes. Log in to the OMD Insights nodes and enter the command systemctl status splunk to check the status of the Splunk service. The status of the service should show “running”. | ||
service_status for tcp-rabbitmq-exchange |
The tcp-rabbitmq-exchange service collects Sensu checks output from the TCP handler and sends the metrics data to the RabbitMQ on the same Monitor node. Failure of this service would result in no metrics data in Influxdb, but Monitor mailer alerts and Uchiwa will still be functional. Connect to the Monitor node and check the status of the process using the command Inspect Try bringing up the service using |
||
sshd_running_status | Log in to the server through the console and check whether the sshd service is running. If it is not running, start the sshd service and escalate to Level 3. | ||
swap_usage | Try to determine which processes are consuming the swap memory on the system raising the alarm. This issue needs to be escalated ASAP. | ||
tcp_connection |
Level 3 should look at the running state of the system to determine if there are any issues. Use the netstat command to check how the connections are distributed. |
||
tripwire_running_status |
Level 3 should address all issues with these alarms. Level 2 should take no action unless Level 3 determines there is an issue. To clear an alarm, connect to the server and switch to root. Look at the history for the tripwire command. Run the command to set files and then run the command /usr/sbin/tripwire --check to ensure there are no violations. |