This document describes the Drain of Unprocessed Events and Frequent Drain of Events health alerts on Firepower Management Center (FMC) as well as how to troubleshoot this type of issue. Although these events are generated and shown on the FMC they relate to a managed device sensor whether it is a Firepower Threat Defense (FTD) device or a Next-Generation Intrusion Prevention System (NGIPS) device. For the rest of this document, the term sensor refers to both FTD and NGIPS devices alike unless otherwise specified.
Contributed by Joao Delgado, Mikis Zafeiroudis, Cisco TAC Engineers.
The FMC generates one of these health alerts:
Frequent drain of Unified Low Priority Events and/or
Drain of unprocessed events from Unified Low Priority Events
This is the health alert structure:
Frequent drain of <SILO NAME>
Drain of unprocessed events from <SILO NAME>
In this example, the SILO NAME is Unified Low Priority Events. This is one of the disk manager silos (see Background Information section for a more comprehensive explanation).
Although any silo can technically generate a Frequent drain of <SILO NAME> health alert the most commonly seen are the ones related to events and, amongst them, the Low Priority Events simply because these are the type of events more often generated by the sensors.
A “Frequent drain of <SILO NAME>" event has a Warning severity in the case it is an event-related silo since, if this were processed (explanation about what constitutes an unprocessed event is given next), they are in the FMC database.
For a non-event related silo, such as the “Backups” silo, the Alert is Critical since this information is lost.
Only event type silos generate a Drain of unprocessed events from <SILO NAME> health alert. This alert always has Critical severity.
Additional symptoms can include:
Slowness on the FMC UI
Loss of events
Common Troubleshoot Scenarios
A Frequent drain of <SILO NAME> event is caused by too much input into the silo for its size. In this case, the disk manager drains (purges) that file at least twice in the last 5-minute interval. In an event type silo, this is typically caused by excessive logging of that event type. In the case of a Drain of unprocessed events of <SILO NAME> health alert, this can also be caused by a bottleneck in the event processing path.
In the diagram there are 3 potential bottlenecks:
The EventHandler process on FTD is oversubscribed (it reads slower than what Snort writes)
The Eventing interface is oversubscribed
The SFDataCorrelator process on FMC is oversubscribed
As stated in the previous section, one of the most common causes for the health alerts of this type is excessive input.
The difference between the Low Water Mark (LWM) and High Water Mark (HWM) gathered from the show disk-manager CLISH command shows how much space there is need to take on that silo to go from LWM (freshly drained) to the HWM value. If there are frequent drain of events (with or without unprocessed events) the first thing you should review is the logging configuration.
Case 3. A Bottleneck in the SFDataCorrelator Process
The last scenario to be covered is when the bottleneck happens on the SFDataCorrelator side (FMC).
The first step is to look into the diskmanager.log file as there is important information to be gathered such as:
The frequency of the drain.
The number of files with Unprocessed Events drained.
The occurrence of the drain with Unprocessed Events.
For information about the diskmanager.log file and how to interpret it, you can refer to the Disk Manager section. The information gathered from the diskmanager.log can be used to help narrow down the subsequent steps.
Additionally, you need to look at the correlator performance statistics:
Note that these statistics are for the FMC and they correspond to the aggregate of all the sensors managed by it. In the case of Unified low priority events you mainly look for:
Total flows per second of any event type to evaluate possible oversubscription of the SFDataCorrelator process.
The two rows highlighted in the previous output:
rna flows/second – Indicates the rate of low priority events processed by the SFDataCorrelator.
rna dup flows/second - Indicates the rate of duplicated low priority events processed by the SFDataCorrelator. This is generated by double logging as discussed in the previous scenario.
Based on the output it can be concluded that:
There is no duplicate logging as indicated by the rna dup flows/second row.
In the rna flows/second row, the Maximum value is much higher than the Average value so there was a spike in the rate of events processed by the SFDataCorrelator process. This could be expected if you look at this early morning when your user's workday has just started, but in general, it is a red flag and requires further investigation.
More information about the SFDataCorrelator process can be found under the Event Processing section.
First, you need to determine when the spike occurred. To do this you need to look at the correlator statistics per each 5-minute sample interval. The information gathered from the diskmanager.log can help you to go straight to the important timeframe.
Tip: Pipe the output to the Linux pager less so that you can easily search.
Determine the 5-minute interval when the spike happened.
In the previous example, there is an obvious spike in the rate of events received at 16:06:39 and beyond. Note that these are 5-minute averages so the increase may be more abrupt than shown (burst) but diluted in this 5-minute interval if it started towards the end of it.
Although this leads to the conclusion that this spike of events caused the Drain of unprocessed events you can take a look at the connection events from the FMC Graphical User Interface (GUI) with the appropriate time window to understand what type of connections traversed the FTD box in this spike:
Apply this time window to get the filtered Connection Events, do not forget to account for the timezone. In this example, the sensor uses UTC and the FMC UTC+1. Use the Table View to see the events that triggered the overload of events and take action accordingly:
Based on the timestamps (time of the first and last packet) it can be seen that these are short-lived connections. Furthermore, the Initiator and Responder Packets columns show that there was only 1 packet exchanged in each direction. This confirms that the connections were short-lived and exchanged very little data.
You can also see that all these flows target the same responder IPs and port. Also, they are all reported by the same sensor (which alongside Ingress and Egress interface information can speak to the place and direction of this flows). Additional actions:
Check the Syslogs on the destination endpoint.
Implement DOS/DDOS protection, or take other preventive measures.
Note: The intent of this article is to provide guidelines to troubleshoot the Drain of Unprocessed Events alert. This example used hping3 to generate a TCP SYN flood to the destination server. For FTD device hardening check the Cisco Firepower Threat Defense Hardening Guide
Items to Collect Before you Contact Cisco Technical Assistance Center (TAC)
It is highly recommended to collect these items before you contact Cisco TAC:
Screenshot of the health alerts seen.
Troubleshoot file generated from the FMC.
Troubleshoot file generated from the affected sensor.
Date and Time when the problem was first seen.
Information about any recent changes done to the policies (if applicable).
The output of the stats_unified.pl command as described in the Event Processing section highlighting the affected sensors.
This section covers an in-depth explanation of the various components that can take part in this type of health alerts. This includes:
Event Processing - Covers the path events take on both the sensor devices and the FMC. This is mainly useful when the health alert refers to an event-type Silo.
Disk Manager - Covers the disk manager process, silos, and how they are drained.
Health Monitor - Covers how the Health Monitor modules are used to generate health alerts.
Log to Ramdisk - Covers the logging to ramdisk feature and its potential impact on health alerts.
To understand the Drain of Events health alerts and be able to identify potential failure points there is a need to look into how these components work and interact with each other.
Even though the Frequent Drain type of health alerts can be triggered by silos that are not event-related, the vast majority of the cases seen by Cisco TAC are related to drain of event-related information. Additionally, to understand what constitutes a drain of unprocessed events there is a need to take a look at the event processing architecture and the components that constitute it.
When a Firepower sensor receives a packet from a new connection the snort process generates an event in unified2 format which is a binary format that allows for quicker read/writes as well as lighter events.
The output shows the FTD command system support trace where you can see a new connection created. The important parts are highlighted and explained:
192.168.0.2-42310 - 192.168.1.10-80 6 AS 1-1 CID 0 Packet: TCP, SYN, seq 3310981951 192.168.0.2-42310 - 192.168.1.10-80 6 AS 1-1 CID 0 Session: new snort session 192.168.0.2-42310 - 192.168.1.10-80 6 AS 1-1 CID 0 AppID: service unknown (0), application unknown (0) 192.168.0.2-42310 > 192.168.1.10-80 6 AS 1-1 I0 new firewall session 192.168.0.2-42310 > 192.168.1.10-80 6 AS 1-1 I0 using HW or preset rule order 4, 'Default Inspection', action Allow and prefilter rule 0 192.168.0.2-42310 > 192.168.1.10-80 6 AS 1-1 I0 HitCount data sent for rule id: 268437505, 192.168.0.2-42310 > 192.168.1.10-80 6 AS 1-1 I0 allow action 192.168.0.2-42310 - 192.168.1.10-80 6 AS 1-1 CID 0 Firewall: allow rule, 'Default Inspection', allow 192.168.0.2-42310 - 192.168.1.10-80 6 AS 1-1 CID 0 Snort id 0, NAP id 1, IPS id 0, Verdict PASS
Snort unified_events files are generated per instance under the path [/ngfw]var/sf/detection_engine/*/instance-N/, where:
* is the Snort UUID. This is unique per appliance.
N is the Snort instance ID which can be calculated as the instance ID from the previous output (the highlighted 0 in the example) + 1
There can be 2 types of unified_events files in any given Snort instance folder:
unified_events-1 (which contains high priority events).
A high priority event is an event that corresponds to a potentially malicious connection.
Types of events and their priority:
High Priority (1)
Low Priority (2)
Associated Connection events
The next output shows an event that belongs to the new connection traced in the previous example. The format is unified2 and is taken from the output of the respective unified event log located under [/ngfw]/var/sf/detection_engine/*/instance-1/ where 1 is the snort instance id in bold in the previous output +1. The unified event log format's name follows the syntax unified_events-2.log.1599654750 where 2 stands for the priority of the events as shown in the table and the last portion in bold (1599654750) is the timestamp (Unix time) of when the file was created.
Tip: You can use the Linux date command to convert the Unix time into a readable date: admin@FP1120-2:~$ sudo date -d@1599654750 Wed Sep 9 14:32:30 CEST 2020
This allows the disk manager process to know which events have already been processed (sent to FMC) and which ones have not.
Note that when the disk manager drains an event silo it removes unified event files. For more information about the drain of silos read the Disk Manager section.
A drained unified file is deemed to have unprocessed events when one of these is true:
The bookmark timestamp is lower than the file creation time.
The bookmark timestamp is the same as the file creation time and the position in Bytes in the file is lower than it’s size.
The EventHandler process reads events from the unified files and streams them to the FMC (as metadata) via sftunnel, which is the process responsible for encrypted communication between the sensor and the FMC. This is a TCP based connection so the event streaming is acknowledged by the FMC
You can see these messages in the [/ngfw]/var/log/messages file:
sfpreproc:OutputFile [INFO] *** Opening /ngfw/var/sf/detection_engines/77d31ce2-c2fc-11ea-b470-d428d53ed3ae/instance-1/unified_events-2.log.1597810478 for output" in /var/log/messages
EventHandler:SpoolIterator [INFO] Opened unified event file /var/sf/detection_engines/77d31ce2-c2fc-11ea-b470-d428d53ed3ae/instance-1/unified_events-2.log.1597810478
sftunneld:FileUtils [INFO] Processed 10334 events from log file var/sf/detection_engines/77d31ce2-c2fc-11ea-b470-d428d53ed3ae/instance-1/unified_events-2.log.1597810478
This output provides this information:
Snort opened the unified_events file for output (to write in it).
Event Handler opened the same unified_events file (to read from it).
sftunnel reported the number of events processed from that unified_events file.
The bookmark file is then updated accordingly. The sftunnel uses 2 different channels called Unified Events (UE) Channel 0 and 1 for high and low priority events respectively.
With the sfunnel_status CLI command on the FTD, you can see the number of events that were streamed.
Priority UE Channel 1 service
TOTAL TRANSMITTED MESSAGES <530541> for UE Channel service RECEIVED MESSAGES <424712> for UE Channel service SEND MESSAGES <105829> for UE Channel service FAILED MESSAGES <0> for UE Channel service HALT REQUEST SEND COUNTER <17332> for UE Channel service STORED MESSAGES for UE Channel service (service 0/peer 0) STATE <Process messages> for UE Channel service REQUESTED FOR REMOTE <Process messages> for UE Channel service REQUESTED FROM REMOTE <Process messages> for UE Channel service
In the FMC the events are received by the SFDataCorrelator process.
The status of events that were processed from each sensor can be seen with the stats_unified.pl command:
admin@FMC:~$ sudo stats_unified.pl Current Time - Fri Sep 9 23:00:47 UTC 2020
********************************************************************************** * FTD - 60a0526e-6ddf-11ea-99fa-89a415c16717, version 22.214.171.124 ********************************************************************************** Channel Backlog Statistics (unified_event_backlog) Chan Last Time Bookmark Time Bytes Behind 0 2020-09-09 23:00:30 2020-09-07 10:41:50 0 1 2020-09-09 23:00:30 2020-09-09 22:14:58 6960
This command shows the status of the backlog of events for a certain device per channel, the Channel ID used is the same as the sftunnel.
The Bytes Behind value can be computed as the difference between the position shown in the unified event bookmark file and the size of the unified event file, plus any subsequent file with a higher timestamp than the one in the bookmark file.
The SFDataCorrelator process also stores performance statistics, those are saved in /var/sf/rna/correlator-stats/. One file is created per day to store the performance statistics for that day in CSV format. The name of the file uses the format “YYYY-MM-DD” and the file correspondent to the current day is called now.
The statistics are gathered every 5 minutes (there is one line per each 5-minute interval).
The output of this file can be read with the perfstats command. Note that this is command is also used to read snort performance statistics files, so the appropriate flags must be used:
-C: Instructs perfstats that the input is a correlator-stats file (without this flag perfstats assumes the input is a snort performance statistics file).
-q: Quiet mode, prints only the summary for the file.
Each row in the summary has 3 values in this order: Average, Minimum, Maximum.
If you print without the -q flag you also see the 5-minute interval values. The summary is shown in the end.
Note that each FMC has a Maximum flow rate described on its datasheet. The next table contains the values per module taken from the respective datasheet:
Maximum Flow Rate (fps)
Note that these values are for the aggregate of all event types shown in bold on the SFDataCorrelator statistics output.
If you look at the output and we size our FMC in such a way as we are prepared for the worst-case scenario (when all the maximum values happen at the same time), then the rate of events this FMC is seeing is 48.65 + 348.15 + 4.63 + 3.25 + 0.06 = 404.74 fps.
This total value can be compared with the value from the datasheet of the respective model.
The SFDataCorrelator can also make additional work on top of the received events (such as for Correlation Rules), it then stores them into the database which is queried to populate various information in the FMC Graphical User Interface (GUI) such as Dashboards and Event Views.
The next logical diagram shows the logical components for both the Health Monitor and Disk Manager processes as they are intertwined for the generation of disk-related health alerts.
In a nutshell, the disk manager process manages the disk usage of the box and it has its configuration files in the [/ngfw]/etc/sf/ folder. There are multiple configuration files for the disk manager process that are used under certain circumstances:
diskmanager.conf - Standard configuration file.
diskmanager_2hd.conf - Used when the box has 2 hard drives installed. The second hard drive is the one related to the Malware Expansion, used to store files as defined in the file policy.
ramdisk-diskmanager.conf - Used when Log to Ramdisk is enabled. For more information check the Log to Ramdisk section.
Each type of file monitored by the disk manager is assigned a Silo. Based on the amount of disk space available on the system the disk manager computes a High Water Mark (HWM) and a Low Water Mark (LWM) for each silo.
When the disk manager process drains a silo it does so up until the point where the LWM is reached. Since events are drained per file this threshold may be crossed.
To check the status of the silos on a sensor device you can use this command:
Each time the disk manager process runs it generates an entry for each of the different silos on its own log file which is located under [/ngfw]/var/log/diskmanager.log and has data in a CSV format.
Next, is shown a sample line from the diskmanager.log file, taken from a sensor that triggered the Drain of unprocessed events from Unified Low Priority Events health alert, as well as the breakdown of the respective columns:
This information is then read by the respective Health Monitor module to trigger the related health alert.
Drain Manually a Silo
In certain scenarios, you may want to drain manually a silo. For example, to clear disk space with manual silo drain instead of manual file removal has the benefit of the disk manager to decide which files to keep and which to delete. The disk manager keeps the most recent files for that silo.
Any silo can be drained and this works as already described (the disk manager drains data until the amount of data goes under the LWM threshold). The command system support silo-drain is available on FTD CLISH mode and it provides a list of the available silos (name + numeric id).
This is an example of a manual drain of the Unified Low Priority Events silo:
Any health alert seen on the FMC in the Health Monitor menu or under the Health tab in the Message Center is generated by the Health Monitor process.
This process monitors the health of the system, both for the FMC and the managed sensors and it is composed of a number of different modules.
Health alert modules are defined in the Health Policy which can be attached per device.
Health alerts are generated by the Disk Usage module that can run on each of the sensors managed by the FMC.
When the Health Monitor process on the FMC runs (once every 5 minutes or when a manual run is triggered) the Disk Usage module looks into the diskmanager.log file and, if the correct conditions are met, the respective health alert is triggered.
For a Drain of Unprocessed events health alert to be triggered All these conditions must be true:
KBytes drained field is greater than 0 (this indicates that data from this silo was drained).
The number of files with unprocessed events drained greater than 0 (this indicates that there were unprocessed events within the drained data).
The time of the drain is within the last 1 hour.
For a Frequent Drain of events health alert to be triggered these conditions must be true:
The last 2 entries on the diskmanager.log file need to:
Have KBytes drained field greater than 0 (this indicates that data from this silo was drained).
Be less than 5 minutes apart.
The Time of drain of the last entry for this silo is within the last 1 hour.
The results gatherer from the disk usage module (as well as the results gathered by the other modules) are sent to the FMC via sftunnel. You can see counters for the Health Events exchanged over sftunnel with the sftunnel_status command:
TOTAL TRANSMITTED MESSAGES <3544> for Health Events service RECEIVED MESSAGES <1772> for Health Events service SEND MESSAGES <1772> for Health Events service FAILED MESSAGES <0> for Health Events service HALT REQUEST SEND COUNTER <0> for Health Events service STORED MESSAGES for Health service (service 0/peer 0) STATE <Process messages> for Health Events service REQUESTED FOR REMOTE <Process messages> for Health Events service REQUESTED FROM REMOTE <Process messages> for Health Events service
Log to Ramdisk
Even though most events are stored in disk, the device is configured by default to log to ramdisk to prevent gradual damage to the SSD that may be caused by constant writes and deletes of events to disk.
In this scenario, the events are not stored under [/ngfw]/var/sf/detection_engine/*/instance-N/, but they are located in [/ngfw]/var/sf/detection_engines/*/instance-N/connection/, which is a symbolic link to /dev/shm/instance-N/connection. In this case, the events reside in virtual memory rather than physical.
To verify what the device is currently configured to do run the command show log-events-to-ramdisk from the FTD CLISH. You can also change this if you use the command configure log-events-to-ramdisk <enable/disable>:
> show log-events-to-ramdisk Logging connection events to RAM Disk.
> configure log-events-to-ramdisk Enable or Disable enable or disable (enable/disable)
When you log to ramdisk the main drawback is that the respective silo has a smaller space allocated and thus drains them more often under the same circumstances. The next output is the disk manager from an FPR 4140 with and without the log events to ramdisk enabled for comparison.
The smaller size of the silo is compensated by the higher speed to access the Events and stream them to the FMC. Although this is a better option under proper conditions, the drawback regarding potential smaller disk life must be considered.
Frequently Asked Questions (FAQ)
Are the Drain of Events health alerts only generated by Connection Events?
Alerts of Frequent Drain can be generated by any disk manager silo.
Alerts of Drain of Unprocessed Events can be generated by any event-related silo.
Connection Events are the most common culprit.
Is it always advisable to disable Log to Ramdisk when a Frequent Drain health alert is seen?
No. Only in Excessive Logging scenarios, when the affected Silo is the Connection Events silo, and only in cases where it is not possible to further tune the Logging Settings.
What constitutes an unprocessed event?
Events are not individually marked as unprocessed. A file has unprocessed events when:
Its creation timestamp is higher than the timestamp field in the respective bookmark file.
Its creation timestamp is equal to the timestamp field in the respective bookmark file and its size is higher than the position in Bytes field on the respective bookmark file.
How does the FMC know the number of Bytes behind for a particular sensor?
By sending metadata regarding the unified_events file name and size as well as the information on the bookmark files the sensor gives the FMC enough information to compute the bytes behind as:
Current unified_events file size - Position in Bytes" field from bookmark file + Size of all unified_events files with higher timestamp than the timestamp in the respective bookmark file.