OMD Director Alarms and Remediation

This Appendix describes the default checks and their thresholds that are run by the sensu-client service on the OMD Monitor clients that you can view and deactivate using OMD Director. To view these checks and their thresholds from OMD Director, choose Monitor > Alarms and click the Alarm Rules tab. This Appendix also describes possible remediation steps that you can perform if an alarm is raised.

When you triage the alerts that are raised, you should check the following items:

  • Any active network configuration change requests that could affect traffic

  • The active Media Streamer change requests for the servers in question

  • Whether the alert indicates that end customer services are impacted, such as the alert is on a Cache node or the OMD Insights server is down

  • Run the dmesg | grep - i error command and check for any errors

  • Check the /var/log/messages files for any errors (grep -i error)

  • Check OMD Insights to determine whether the effected node provided content during the error window

All notifications are generated based on the thresholds that are configured for the alarms. An alarm can have multiple thresholds configured for it to send notifications about different severity levels. To view the existing thresholds and to create new thresholds, from the Monitor > Alarms page, click the Alarm Rules tab. The "Checks and Thresholds Available in OMD Director" table describes the default checks and thresholds that are available in OMD Director. Table 2 describes possible remediation steps you can perform based on the alarms or alerts that are raised.

Table 1. Checks and Thresholds Available in OMD Director

Check Name

Entity Family

(entity_family in Influxdb)

Check Value

Alarm Condition/ Check Result

[0 - No Alert/Normal]

[1 - Warning Alert]

[2 - Critical Alert]

Severity/Priority

bond_interface

cache_nodes

Status of the bond interface:

  • 0: If the bond interface is up and running

  • 1: If the bond interface is not configured

  • 2: If the bond interface is down

Warning: Alert raised if bond interface is not configured

Critical: Alert raised if bond interface is configured and down

P4

cpu_usage

common

CPU usage in percentage. (0 - 100%)

Warning alarm raised if:

  • >= 50% (Default)

  • >= 80% (Monitor Node)

  • >= 55% (Edge)

Critical alarm raised if:

  • >= 80% (Default)

  • >= 95% (Monitor Node)

  • >= 85% (Edge)

Warning: N/A

Critical: P4

disk_drives_count cache_nodes Count of current disk partitions Critical: Alert raised if the number of disk drive partitions currently available is less than the original number of partitions P3

disk_usage

common

Total disk usage in percentage

Warning alarm raised if:

  • >= 85% (Default)

  • >= 90%

    (Monitor Node)

Critical alarm raised if:

  • >= 95%

Warning: N/A

Critical: P3/P4

dns_resolved_status

common

Status of DNS resolution:

  • 0: DNS is not resolvable

  • 1: DNS is resolvable

Critical: Alert raised if DNS is not resolvable (check result returns 0)

P4

hardware

common

Hardware error message status from the dmesg command output:

  • 0: No errors found

  • 1: Errors found

Critical: Hardware error reported in dmesg.

P3

influx_connectivity

common

Checks connectivity to OMD Monitor Influx nodes.

  • 0: Normal connectivity

  • 2: No connectivity to any OMD Monitor Influx nodes

Critical: Alert raised if none of the OMD Monitor Influx nodes are reachable.

P3

influx_metrics_not_found

common

  • 0: Normal no errors found

  • 2: Critical

Critical: Metrics not found from OMD Monitor Influx nodes at this time

insights_failover_check

splunk

  • 0: Failover did not happen and the primary OMD Insights Splunk nodes are active.

  • 1: Failover happened for the Splunk service on the backup OMD Insights Splunk nodes

Warning: Failover occured, the standby node is currently the master.

P4

interface

cache_nodes

Status of the slave interfaces of the bond interface:

  • 0: If the interface is up and running

  • 1: If the interface is not configured

  • 2: If the interface is down

Warning: Alert raised if an interface that is part of the bond interface is not configured

Critical: Alert raised if an interface that is part of the bond interface is down

P3

kafka_broker_status_all

monitor-node

The number of servers that are having a problem with the Kafka broker server. 0 means all servers are fine or there are no servers configured for Kafka export.

Critical: Alert raised if any or all Kafka brokers are not able to export metrics

P3/P4

keep_alive

common

The status of the Monitor node connection to the server node being monitored. To determine this, keepalives are sent by the sensu-client service on the server node to a Monitor node every 30 seconds:

  • 0: Not reachable

  • 1: Reachable

Warning: Alert raised if no keepalives have been received for 90 to 180 seconds

Critical: Alert raised if no keepalives have been received for more than 180 seconds

P3/P2

The priority is determined based on the number of servers that are raising the keepalive. If a single server is in this state, it is a P3.

If there is a block of servers that are affected, it is a P2 and you should escalate immediately.

load_avg_1min

common

CPU load average values, with regards to both the CPU and I/O over the last minute. This value is a decimal number and not a percentage.

These values are taken from the /proc/loadavg system file.

Warning: Alert raised if load_avg_1min >= 10

Critical: Alert raised if load_avg_1min >= 25

Warning: N/A

Critical: P4

load_avg_5min

common

CPU load average values, with regards to both the CPU and I/O over the last 5 minutes. This value is a decimal number and not a percentage.

These values are taken from the /proc/loadavg system file.

Warning: Alert raised if load_avg_5min >= 20

Critical: Alert raised if load_avg_5min >= 50

Warning: N/A

Critical: P4

load_avg_15min

common

CPU load average values, with regards to both the CPU and I/O over the last 15 minutes. This value is a decimal number and not a percentage.

These values are taken from the /proc/loadavg system file.

Warning: Alert raised if load_avg_15min >= 30

Critical: Alert raised if load_avg_15min >= 75

Warning: N/A

Critical: P4

log_failed_notallowed

common

The number of SSH failed login attempts due to access restrictions

Warning: Alert raised if 1 failed attempt

Critical: Alert raised if 2 or more failed attempts

P4

log_failed_password

common

The number of failed login attempts do to incorrect passphrase

Warning: Alert raised if 1 failed attempt

Critical: Alert raised if 2 or more failed attempts

P4

netstat

ESTABLISHED

This check indicates the number of connections in the ESTABLISHED state.

An alarm is raised if the number of connections exceeds the configured threshold. A server only has a finite number of TCP connections available, so this alarm indicates when the server is nearing that threshold.

Connections are determined by looking for ESTABLISHED and TIME_WAIT entries using the netstat command.

Warning: Alert is raised if ESTABLISHED connections are at the following thresholds:

  • >= 500 (Default)

  • >= 40000

    (cache nodes)

  • >= 700 (tcnode)

  • >= 1500

    (monitor node)

Critical: Alert is raised if ESTABLISHED connections are at the following thresholds:

  • >= 1000 (Default)

  • >= 45000

    (cache nodes)

  • >= 1200 (tcnode)

  • >= 2000

    (monitor node)

Warning: N/A

Critical: P4

netstat

TIME_WAIT

This check indicates the number of connections in the TIME_WAIT state.

An alarm is raised if the number of connections exceeds the configured threshold. A server only has a finite number of TCP connections available, so this alarm indicates when the server is nearing that threshold.

Connections are determined by looking for ESTABLISHED and TIME_WAIT entries using the netstat command.

Warning: Alert is raised if ESTABLISHED connections are at the following thresholds:

  • >= 500 (Default)

  • >= 40000

    (cache nodes)

  • >= 700 (tcnode)

  • >= 1500

    (monitor node)

Critical: Alert is raised if ESTABLISHED connections are at the following thresholds:

  • >= 1000 (Default)

  • >= 45000

    (cache nodes)

  • >= 1200 (tcnode)

  • >= 2000

    (monitor node)

Warning: N/A

Critical: P4

ntp_sync_status

common

The status of the sync with the NTP server. 0 indicates the server is not in sync. 1 indicates the server is in sync.

Critical alarm raised if ntp_offset >= 500ms

Critical: P4

partition_usage

Name of each partition:

/var

/sys/fs/cgroup

/run

/opt

/dev

/dev/shm

/boot

/

Disk usage for each partition, as a percentage.

This check will return 1(warning) if the configured warning or critical threshold values are more than 100.

Warning: Alert raised if any partition usage >= 85%

Critical: Alert raised if any partition usage >= 95%

Warning: N/A

Critical: P4

ram_usage

common

RAM usage as a percentage

Warning: Alert raised if RAM Usage is >= 80%

Critical: Alert raised if RAM Usage is >= 95%

P4

redis_status monitor-node The number of connected OMD Monitor nodes. This number should be equal to the number of OMD Monitor nodes available at install. If it is not, a warning is raised. Warning: Alert raised if the number of connected OMD Monitor nodes is less than the number of OMD Monitor nodes available at initial installation. N/A

service_status

grafana-server

The running status of the Grafana server on the OMD Monitor node:

  • 0: Not running

  • 1: Running

Critical: Alert raised when the grafana-server service does not have the status of “running”.

P4

service_status

haproxy

The running status of the haproxy service on the OMD Monitor node:

  • 0: Not running

  • 1: Running

Critical: Alert raised when the haproxy service does not have the status of “running”.

P4

service_status

influxdb

The running status of the influxdb service on the OMD Monitor node:

  • 0: Not running

  • 1: Running

Critical: Alert raised when the influxdb service does not have the status of “running”.

P4

service_status

influxdb-relay

The running status of the influxdb-relay service on the OMD Monitor node:

  • 0: Not running

  • 1: Running

Critical: Alert raised when the influxdb-relay service does not have the status of “running”.

P4

service_status

rabbitmq-metrics-exporter

Running status of rabbitmq-metrics-exporter on monitor node

  • 0: Running

  • -1: Not Running

Critical when rabbitmq-metrics-exporter service is not in a running state.

P2/P3

service_status

rabbitmq-server

The running status of the rabbitmq-server service on the OMD Monitor node:

  • 0: Not running

  • 1: Running

Critical: Alert raised when the rabbitmq-server service does not have the status of “running”.

P4/P3

service_status

redis

The running status of redis service on the OMD Monitor node:

  • 0: Not running

  • 1: Running

Critical: Alert raised when the redis service does not have the status of “running”.

P4/P3

service_status

redis-sentinel

The running status of redis-sentinel service on the OMD Monitor node:

  • 0: Not running

  • 1: Running

Critical: Alert raised when the redis-sentinel service does not have the status of “running”.

P4/P3

service_status

sensu-server

The running status of sensu-server service on the OMD Monitor node:

  • 0: Not running

  • 1: Running

Critical: Alert raised when the sensu-server service does not have the status of “running”.

P4/P3

service_status

splunk

The running status of splunk service on the OMD Monitor node:

  • 0: Not running

  • 1: Running

Critical: Alert raised when the splunk service does not have the status of “running”.

P4/P3

service_status

tcp-rabbitmq-exchange

Running status of tcp-rabbitmq-exchange on monitor node:

  • 0: Running

  • -1: Not Running

Critical when tcp-rabbitmq-exchange service is not in a running state.

P2/P3

sshd_running_status

common

The running status of the sshd service:

  • 0: Not running

  • 1: Running

Warning: Alert raised if the sshd service is not running.

P4

swap_usage

common

The percentage of swap usage

Warning: Alert raised when swap usage is >= 95%

Critical: Alert raised when swap usage is >= 98%

P2

tcp_connection

common

cache_nodes

traffic_router

Number of TCP connections including TCP6

Warning: Alert raised if TCP connections are:

  • >= 500 (others)

  • >= 50000

    (Cache nodes)

  • >= 5000

    (Traffic router)

Critical: Alert raised if TCP connections are:

  • >= 1000 (others)

  • >= 55000

    (Cache nodes)

  • >= 10000

    (Traffic router)

Warning: N/A

Critical: P4

tripwire_violations

common

The number of tripwire report violations. Local local files are checked for changes.

Critical: If any tripwire violations are found

(check result is 3)

P4

Table 2. Alarm Remediation Steps

Check

Remediation

bond_interface

If this alarm has been raised, the alarm should have already been cleared so a keepalive alarm should have been raised. Also look at the IPMI Interface check in Traffic Ops to see if the server has power.

cpu_usage

Level 3:

  • Check the running state of the system to determine if there is an issue. If the alert is generated and resolved immediately, it may be able to be ignored.

  • This may or may not be an issue depending on the load/traffic on the server. If the alert continuously persists, execute the top command on the client terminal to determine which process is running high on CPU.

  • Check the Grafana dashboard for any patterns in the usage.

disk_drives_count

Log in to the cache node and use the ls /dev/sd* command to check the disk drive partitions list.

Based on the profile of Mid or Edge cache sever, the list will change. sda, sdb, sdc, and so on are the disk drives and sda1, sda2, and so on are partitions in sda. Check if any disk is missing based on the profile.

disk_usage Level 2/Level3: For critical alerts, log in and check how different partitions are being used. Use the df command to determine which partitions have high disk usage.
dns_resolved_status Level 1/Level 2: Make sure servers have access to DNS servers and that those DNS servers are correct.

hardware

Indicates a possible hardware error. Details can be found in the /var/log/messages file. The check is looking for “Hardware Error” in the dmesg command output. Note that:

  • If “hardware error” is seen in the dmesg output, but has not been corrected in the mcelog (located in /var/log/messages), an alert will be raised immediately.

  • Alerts are not raised if dmesg lists the same error based on the dmesg time stamp.

  • Alerts are not raised for corrected errors shown in the mcelog (located in /var/log/messages)

  • Alerts are raised if the correctable errors are reported more than 6 hours of the day.

Perform the following steps:

  1. If the alert is repeatedly raised, then Stash (mute) the alert while investigating the issue.

  2. SSH to the affected system and change to the root user.

  3. Collect the relevant portion of the /var/log/messages file for information on the hardware error.

  4. Send an email for a case to be opened, including all of the collected information. See the note below.

  5. Back up the dmesg log using the following command:

    dmesg > /var/log/dmesg-`date +%Y%m%d_%I%M%S`
  6. Enter the dmesg -c command to clear the dmesg log.

  7. Unstash (unmute) the alert.

Note 

Level 1 should create a case with the hardware vendor and ask them to identify what the error means and to recommend the next steps. If the vendor recommends a DIMM replacement, ensure that the issue has actually been identified. Ask the vendor to explain why the DIMM needs to be replaced and to identify which slot is referenced in the error message.

influx_connectivity

Check whether there is a network event that is causing the loss of connectivity. If no possible cause can be found, escalate to Level 2 or Level 3.

insights_failover_check

Check the connectivity between the primary and backup OMD Insights nodes. There may be network connectivity issues between the primary and backup OMD Insights nodes that resulted in this alarm.

Check for any keepalive alerts from OMD Monitor for the OMD Insights primary nodes.

interface

Level 2: Verify that the issues is not external to the CDN cache nodes and then escalate as needed.

kafka_broker_status_all

Check the accessibility of the Kafka servers from the OMD Monitor nodes:

  • Check the omd.conf pillar file on the Salt Master and verify the Kafka servers and their ports are configured correctly.

  • SSH to the OMD monitor node and verify connectivity to the Kafka servers.

  • Check the /var/log/sensu/sensu-server.log and /etc/sensu/kafka_broker_status.log log files for any error messages.

keep_alive

Network connectivity between an OMD Monitor node and the server being monitored might have been lost. Additional insight into the network is required.

If you can access the server using SSH, you can run the following commands to determine whether the system rebooted or if there was a network issue:

  • uptime: This command shows how long the system has been up. If it has been up for well before the alert, than use the next command to further troubleshoot.

  • grep “NIC Link” /var/log/messages*: This will display the details for the interface status, if it had a problem.

If neither of these commands reveal an issue, perform additional review of the following log files:

  • /var/log/sensu/sensu-client.log on the server

  • /var/log/sensu/sensu-server.log on Monitor nodes

load_avg_1min

Level 3:

  • Check the running state of the system to determine if there is an issue. If the alert is generated and resolved immediately, it may be able to be ignored.

  • This may or may not be an issue depending on the load/traffic on the server. If the alert continuously persists, use the top command from the client terminal to determine which process is running high on CPU.

  • Check the Grafana dashboard for any patterns in the usage.

load_avg_5min

Level 3:

  • Check the running state of the system to determine if there is an issue. If the alert is generated and resolved immediately, it may be able to be ignored.

  • This may or may not be an issue depending on the load/traffic on the server. If the alert continuously persists, use the top command from the client terminal to determine which process is running high on CPU.

  • Check the Grafana dashboard for any patterns in the usage.

load_avg_15min

Level 3:

  • Check the running state of the system to determine if there is an issue. If the alert is generated and resolved immediately, it may be able to be ignored.

  • This may or may not be an issue depending on the load/traffic on the server. If the alert continuously persists, use the top command from the client terminal to determine which process is running high on CPU.

  • Check the Grafana dashboard for any patterns in the usage.

log_failed_notallowed

Level 3: SSH to the server as root and check the /var/log/secure file for a string similar to the following:

"User root from 11.22.33.44 not allowed because not listed in AllowUsers"

log_failed_password SSH to the server as root and check the /var/log/secure file for the string “Failed password”.
netstat for ESTABLISHED SSH to the server. Check the output from the netstat command to determine more information about the connections and which port is being used. Look for TCP connections in the ESTABLISHED states, which could provide information about the related service.
netstat for TIME_WAIT SSH to the server. Check the output from the netstat command to determine more information about the connections and which port is being used. Look for TCP connections in the TIME_WAIT states, which could provide information about the related service.

ntp_sync_status

Level 2/Level 3: Connectivity to the NTP server needs to be checked or other network resources are not responding. Use the ntpq -p command to check the NTP source stratum for correct time.

Level 3 should fix the NTP configuration as needed, or perform additional troubleshooting to ensure that NTP is synced.

partition_usage

Level 2/Level 3: SSH to the server or view Grafana dashboards for the server to determine which partition is high on usage. After connecting to the server using SSH, look for files that are consuming a large amount of disk space.

ram_usage

SSH to the server and check the memory usage. Use commands such as top to identify the processes that are running and determine which processes are consuming memory.

Check Grafana dashboards for any patterns in the RAM usage.

redis_status

A warning for this check indicates that an OMD Monitor node could not connect to one or more of the other OMD Monitor node. However, if OMD Monitor is installed in an HA configuration, the OMD Monitor nodes should continue to work without issue.

You should investigate the reason for the loss of connection between OMD Monitor nodes. If there is no issue with connectivity, review the /var/log files on the OMD Monitor node.

service_status for grafana-server

Connect to the OMD Monitor node and enter the systemctl status grafana-server command to check the status of the grafana-server service. If the service is not in a running state, enter the sudo systemctl restart grafana-server command to try to bring it up.

Note 

If the grafana-server is not in a running state, the Grafana GUI will also fail to load when connected to this OMD Monitor node.

Check the /var/log/grafana/grafana.log file for any error messages.

If the alarm persists, escalate the problem for further investigation.

service_status for haproxy

Connect to the OMD Monitor node and enter the systemctl status haproxy command to check the status of the haproxy service. If the service is not in a running state, enter the sudo systemctl restart haproxy command to try to bring it up.

Check the /var/log/haproxy.log and /var/log/haproxy-status.log files for any error messages.

If the alarm persists, escalate the problem for further investigation.

service_status for influxdb

Connect to the OMD Monitor node and enter the systemctl status influxdb command to check the status of the influxdb service.

Influxdb metrics may have stopped on one OMD Monitor node, however, the remaining OMD Monitor nodes will continue to store metrics.

Enter the sudo systemctl restart influxdb command to try to bring up the influxdb service on OMD Monitor node. If the alarm persists, escalate the problem for investigation.

service_status for influxdb-relay

Connect to the OMD Monitor node and enter the systemctl status influxdb-relay command to check the status of the influxdb-relay service.

Influxdb metrics may have stopped on one OMD Monitor node, however, the remaining OMD Monitor nodes will continue to store metrics.

Enter the sudo systemctl restart influxdb-relay command to try to bring up the influxdb-relay service on the OMD Monitor node. If the alarm persists, escalate the problem for investigation.

service_status for rabbitmq-metrics-exporter

The rabbitmq-metrics-exporter service pops out metrics data from RabbitMQ and exports the same to the Monitor Nodes, InfluxDBs, and the external Kafka servers. Failure of this service would result in no metrics data in Influxdb but the Monitor mailer alerts and Uchiwa will still be functional.

Connect to the Monitor node and check the status of the process using systemctl status rabbitmq-metrics-exporter.

Inspect /var/log/messages on Monitor node for any error messages.

Try bringing up the service using sudo systemctl restart rabbitmq-metrics-exporter on the Monitor node. If the alarm persists, then escalate for investigation.

service_status for rabbitmq-server

Connect to the OMD Monitor node and enter the systemctl status rabbitmq-server command to check the status of the rabbitmq-server service.

An OMD Monitor node whose rabbitmq-server service is not running will not be monitoring any clients. Clients previously connected to this node will move to another OMD Monitor node, including the Sensu client of the OMD Monitor node whose rabbitmq-server service is not running.

Inspect the /var/log/rabbitmq/ logs for any error messages.

Enter the sudo systemctl restart rabbitmq-server command to try to bring up the rabbitmq-server service on the OMD Monitor node. If the alarm persists, escalate the problem for investigation.

service_status for redis

Connect to the OMD Monitor node and enter the systemctl status redis command to check the status of the redis service.

An OMD Monitor node whose redis service is not running will not be monitoring any clients. Clients previously connected to this node will move to another OMD Monitor node, including the Sensu client of the OMD Monitor node whose redis service is not running.

Inspect the /var/log/redis/ logs for any error messages.

Enter the sudo systemctl restart redis command to try to bring up the redis service on the OMD Monitor node. If the alarm persists, escalate the problem for investigation.

service_status for redis-sentinel

Connect to the OMD Monitor node and enter the systemctl status redis-sentinel command to check the status of the redis-sentinel service.

Inspect /var/log/redis/ logs for any error messages.

Enter the cosudo systemctl restart redis-sentinelmmand to try to bring up the redis-sentinel service on the OMD Monitor node. If the alarm persists, escalate the problem for investigation.

service_status for sensu-server

Connect to the OMD Monitor node and enter the systemctl status sensu-server command to check the status of the sensu-server service.

Inspect the /var/log/sensu/sensu-server.log file for any error messages.

Enter the sudo systemctl restart sensu-server command to try to bring up the sensu-server service on the OMD Monitor node. If the alarm persists, escalate the problem for investigation.

service_status for splunk The Splunk service is down on the OMD Insights nodes. Log in to the OMD Insights nodes and enter the command systemctl status splunk to check the status of the Splunk service. The status of the service should show “running”.

service_status for tcp-rabbitmq-exchange

The tcp-rabbitmq-exchange service collects Sensu checks output from the TCP handler and sends the metrics data to the RabbitMQ on the same Monitor node. Failure of this service would result in no metrics data in Influxdb, but Monitor mailer alerts and Uchiwa will still be functional.

Connect to the Monitor node and check the status of the process using the command systemctl status tcp-rabbitmq-exchange.

Inspect /var/log/messages on the Monitor node for any error messages.

Try bringing up the service using sudo systemctl restart tcp-rabbitmq-exchange on the Monitor node. If the alarm persists, then escalate for investigation.

sshd_running_status Log in to the server through the console and check whether the sshd service is running. If it is not running, start the sshd service and escalate to Level 3.
swap_usage Try to determine which processes are consuming the swap memory on the system raising the alarm. This issue needs to be escalated ASAP.

tcp_connection

Level 3 should look at the running state of the system to determine if there are any issues.

Use the netstat command to check how the connections are distributed.

tripwire_running_status

Level 3 should address all issues with these alarms. Level 2 should take no action unless Level 3 determines there is an issue.

To clear an alarm, connect to the server and switch to root. Look at the history for the tripwire command. Run the command to set files and then run the command /usr/sbin/tripwire --check to ensure there are no violations.