Alarms, Events, and Statistics
This chapter describes the procedures for viewing, backing up, and restoring V2PC statistics. It also lists the alarms and events that V2PC master and worker nodes can generate, and the statistics V2PC can collect.
Caution Statistics data is not highly available. If the hosting server crashes and cannot recover data, that data will be lost. This chapter provides guidance on preventing loss of statistics in the event of a fatal crash.
Introduction
InfluxDB, a database for time series data, aggregates control plane statistics for the V2PC platform and its applications. The InfluxDB database instance runs on the ELK node VM.
All statistics from various V2PC nodes are sent to InfluxDB through the Sensu monitoring platform. The Sensu server runs in a high-availability (HA) configuration on the V2PC master nodes. Through Sensu checks, the statistics handler collects statistics from each node or service and sends it to InfluxDB running on the ELK node.
Note For additional information on InfluxDB, see the documentation for InfluxDB version .12 at: https://docs.influxdata.com/influxdb/v0.12
Figure 12-1 V2PC Statistics Aggregation
The InfluxDB metastore contains internal information about the status of the system, including user information, database and shard metadata, and enabled retention policies. Because there is only one ELK node instance in a V2PC deployment, there is no high availability on the statistics data aggregated in InfluxDB. As a result, the loss of the ELK node or InfluxDB instance can result in loss of this data.
We therefore recommend performing a regular backup schedule for statistics data. The following sections provide instructions for logging in to the InfluxDB shell, performing a backup of the database, and if ever necessary, restoring the database from a previous backup.
Viewing Current InfluxDB Data
To log in to InfluxDB on ELK node and use the database shell to view stored statistics data:
Step 1 On the master node, execute the command consul members to identify the active ELK node.
Step 2 Log in to the ELK node via SSH. The default login credentials are:
-
User ID: root
-
Password: cisco
Step 3 On the command line, type influx to access the InfluxDB command shell prompt.
Step 4 Type auth admin default.
Step 5 Type show databases.
Step 6 Type use stats_system to access the stats_system database, where V2PC stores statistics data.
Step 7 Type show measurements to show a list of measurements (similar to tables in an SQL database).
For example, COS node statistics are stored in the aic_cosnodestats measurement.
Backing Up InfluxDB Data
If a node is running, you can create a backup of the metastore for an instance by executing the command influxd backup /influxdb_backups/metadata. This command backs up the InfluxDB metadata to the directory /influxdb_backups/metadata.
To back up the InfluxDB database:
Step 1 Log in to the ELK node as described in Viewing Current InfluxDB Data.
Step 2 Run the command influxd backup –database stats_system /influxdb_backups/database to back up the influxDB database stats_system to the directory /influxdb_backups/database.
To perform a backup of a remote node, specify the host and port of the remote instance using the -host configuration switch, as show in the following example:
influxd backup -database stats_system -host 10.0.0.1:8088 /tmp/mysnapshot
Restoring InfluxDB Data
Note Restoring from a backup is only supported while the InfluxDB daemon is stopped. To restore from a backup, you must provide the path to the backup.
To restore the InfluxDB database from an earlier backup:
Step 1 Log in to the ELK node as described in Viewing Current InfluxDB Data.
Step 2 Run the command influxd restore /tmp/backup to restore the contents of /tmp/backup.
The following optional flags are available for this command:
-
-metadir <path to meta directory> – This is the path to the meta directory to which the metastore backup should be recovered. For packaged installations, specify /var/lib/influxdb/meta.
-
-datadir <path to data directory> – This is the path to the data directory to which the database backup should be recovered. For packaged installations, specify /var/lib/influxdb/data.
-
-database <database> – This is the database to which the data should be restored. This option is required if no -metadir option is provided.
-
-retention <retention policy> – This is the target retention policy to which the stored data should be restored.
-
-shard <shard id> – This is the shard data that should be restored. If specified, -database and -retention must also be set.
Step 3 Restore the backup in two steps, as follows:
a. Restore the metastore so that InfluxDB knows which databases exist:
$ influxd restore -metadir /var/lib/influxdb/meta /tmp/backup/<metastore>
Where <metastore> is the metastore to be restored. For example, to restore meta.00:
$ influxd restore -metadir /var/lib/influxdb/meta/tmp/backup/backup/meta.00
b. Recover the backed up stats_system database as follows:
$ influxd restore -database stats_system -datadir /var/lib/influxdb/data /tmp/backup
Sample Queries
Use the following queries from the database shell to show database, measurement, tags, and field information:
-
Show tags for all measurements: show tag keys
-
Show fields for all measurements: show field keys
-
Show fields for all series: show series
-
Show tags for measurement aic_cosnodestats: show tag keys from aic_cosnodestats
-
Show fields for measurement aic_cosnodestats: show field keys from aic_cosnodestats
-
Show series for measurement aic_cosnodestats: show series from aic_cosnodestats
-
SHOW TAG VALUES WITH KEY = cosNodeID
-
Show retention policies on stats_system
Use the following queries from the database shell to show COS node statistics:
-
select * from aic_cosnodestats where time > now() - 1m
-
select count(storageUsed) from aic_cosnodestats
Use the following HTP queries from the database shell against InfluxdB:
-
https://172.20.235.82:8443/db/query?&u=admin&p=default&db=stats_system&pretty=true&q=show%20measurements
-
https://172.20.235.82:8443/db/query?&u=admin&p=default&db=stats_system&pretty=true&q=select+*+from+aic_cosnodestats+where+cosNodeID=%27335574603%27
-
http://172.22.120.80:8086/query?&u=admin&p=default&db=stats_system&pretty=true&q=select * from aic_cosnodestats where cosNodeID='335574603' and time > now() - 6d
Alarms at Master Nodes
OS Level
-
CPU warning if usage > 80%
-
CPU critical if usage >= 100%
-
Disk warning if usage >= 85%
-
Disk critical if usage >= 95%
-
Disk warning if inode usage >= 85%
-
Disk critical if inode usage >= 95%
-
Memory warning if usage >= 80%
-
Memory critical if usage >= 95%
-
NTP warning if abs(offset) >= 10 ms
-
NTP critical if abs(offset) >= 100 ms
Third-Party Application Health Check
-
Consul critical if all consul-agents are not available
-
Consul warning if some consul-agents are not available
-
Mongo critical if all mongodbs are not available
-
Mongo critical if mongodb master is not available
-
Mongo warning if some mongodb slaves are not available
-
RabbitMQ critical if all rabbitmqs are not available
-
RabbitMQ warning if some rabbitmqs are not available
-
Redis critical if all redis-servers are not available
-
Redis critical if redis-server master is not available
-
Redis warning if some redis-server slaves are not available
-
RSyslog critical if rsyslog is not running
-
3rd party application healthy check
-
Salt-master critical if all salt-masters are not available
-
Salt-master warning if some salt-masters are not available
-
Sensu-api critical if all sensu-apis are not available
-
Sensu-api warning if some sensu-apis are not available
-
Sensu-server critical if all sensu-servers are not available
-
Sensu-server warning if some sensu-servers are not available
-
Zookeeper critical if all zookeepers are not available
-
Zookeeper warning if all zookeepers are not available
V2PC Component Health Check
-
Docserver critical if all docservers are not available
-
Docserver critical if docserver master is not available
-
Docserver warning if some docserver slaves are not available
-
ResourceManager critical if all RMs are not available
-
ResourceManager critical if RM master is not available
-
ResourceManager warning if some RM slaves are not available
-
ServiceManager critical if all SMs are not available
-
ServiceManager critical if SM master is not available
-
ServiceManager warning if some SM slaves are not available
-
Unified_errorlogd critical if unified_errorlogd is not running
-
Unified_translogd critical if unified_translogd is not running
-
AICM critical if all AICMs are not available
-
AICM critical if AICM master is not available
-
AICM warning if some AICM slaves are not available
-
MFCM critical if all MFCMs are not available
-
MFCM critical if MFCM master is not available
-
MFCM warning if some MFCM slaves are not available
-
PICM critical if all PICMs are not available
-
PICM critical if PICM master is not available
-
PICM warning if some PICM slaves are not available
-
EAM critical if all EAMs are not available
-
EAM critical if EAM master is not available
-
EAM warning if some EAM slaves are not available
-
v2p-ui critical if all UIs are not available
-
v2p-ui critical if UI master is not available
-
v2p-ui warning if some UI slaves are not available
-
v2pc-dns critical if all v2pc-dns are not available
-
v2pc-dns critical if v2pc-dns master is not available
-
v2pc-dns warning if some v2pc-dns slaves are not available
-
AIC/MFC/PIC instance critical if it is not running
Connection with Worker Nodes
-
keep-alive warning if no response within 25 seconds
-
keep-alive critical if no response within 300 seconds
Alarms at Worker Nodes
OS Level
-
CPU warning if usage > 80%
-
CPU critical if usage >= 100%
-
Disk warning if usage >= 85%
-
Disk critical if usage >= 95%
-
Disk warning if inode usage >= 85%
-
Disk critical if inode usage >= 95%
-
Memory warning if usage >= 80%
-
Memory critical if usage >= 95%
-
NTP warning if abs(offset) >= 10 second
-
NTP critical if abs(offset) >= 100 second
Statistics at Master and Worker Nodes
OS Level
-
CPU usage percentage
-
Disk used, available, usage percentage
-
Memory used, available, usage percentage
-
Network received Kbytes per second
-
Network sent Kbytes per second