Abstract. Commercially available network intrusion detection systems (NIDS) came onto the market over six years ago. These systems have gained acceptance as a viable means of monitoring the security of consumer networks, yet no commercial standards exist to help consumers understand the capacity characteristics of these devices. Existing NIDS tests are flawed. These tests resemble the same tests used with other networking equipment, such as switches and routers. However, switches and routers do not conduct the same level of deep packet inspection, nor do they require the higher-level protocol awareness that a NIDS demands. Therefore, the current testing does not allow consumers to infer any expected performance in their environments. Designing a new set of tests that specific to the weak areas, or bottlenecks, of a NIDS is the key to discovering metrics meaningful to the consumers. Consumers of NIDS technology can then examine the metrics used in the tests and profile their network traffic based on these same metrics. Consumers can use standard test results to accurately predict performance on their networks. This paper proposes a test methodology for standardized capacity benchmarking of NIDS. The test methodology starts with examination of the bottlenecks in a NIDS, then maps these bottlenecks to metrics that can be tested, and finally explores some results from tests conducted.
Currently, no industry standards exist for testing any aspect of network intrusion detection systems (NIDS). The NIDS industry is maturing along the same lines as the router, switch, and firewall industries that came before it, and has now reached the point where standardization of testing and benchmarking is possible. Attempting to define a testing standard is beyond the scope of this paper. Instead, the metrics and methodology used to properly verify the capacity of high-speed NIDS are explored. Performance of NIDS is usually defined by false positive and false negative ratios, and speed or capacity. This paper addresses the issue of benchmarking the capacity of a NIDS. This paper uses capacity to refer to the ability of a NIDS to capture, process, and perform at the same level of accuracy under a given network load as it does on a quiescent network.
Gauging the capacity of a NIDS is difficult. Several variables in the characteristics of the network traffic affect the performance of a NIDS. The last year has seen claims of NIDS performing at or near gigabit speeds. In every case, however, further investigation by reasonably sophisticated NIDS practitioners revealed critical flaws in the testing methodology.
Most NIDS employ a mix of all these methods. Some of the metrics discussed in this paper do not apply to all the technologies. Choosing metrics and test methods valid for all NIDS in existence is impossible. Choosing a broad set of metrics that is generally applicable to most NIDS is possible. This paper focuses on two questions: What are the proper metrics for performance testing? What testing methodology best evaluates these metrics? The testing metrics and methodology described are intended for use on a NIDS located at the edge of an internal network functioning near the firewall or border router. The focus is further refined by looking at how these metrics apply to a NIDS using a combination of the technologies listed previously. However, many of the metrics and methods included also apply to the performance of a NIDS inside the core of an enterprise network and to a NIDS employing other methods of detecting intrusions such as pure anomaly-based systems.
Most NIDS capacity benchmarks to date have been run by independent third parties either for publication in a trade magazine or at the request of the vendor for inclusion in marketing material. The test methodologies were developed based on experiences in the router- and switch-testing arenas. These tests are generally not adequate for the purposes of developing a NIDS performance profile because the benchmark tests for switch and router capacity often forward packets of various sizes without regard for any protocol above IP or even the validity of the packets used. Although routers and switches are typically not concerned with Layer 4 and above, NIDS may discard packets that are not interesting. A NIDS also needs to look much deeper in a packet than a switch or a router to follow Layer 4 and above. For example, a NIDS may discard TCP streams that are not opened using a valid three-way handshake. If a switch or router test is used, most of the traffic might be ignored. The NIDS then performs very little deep packet inspection.
Because the results of a NIDS performance test based on these types of test methodologies are often skewed in the favor of the vendor, a consumer may believe these results are valid for a deployment and encounter strikingly different performance characteristics after the NIDS is fielded on the network.
For example, NIDS tests to date from Mier Communications are flawed . Mier Labs concluded that two different NIDS could perform at gigabit lines rates. Although the lab report is technically accurate, there is no mention anywhere in the report that a test using TCP port 0 packets was not representative of the performance most consumers would experience. Using this type of testing methodology for NIDS is, therefore, flawed. Marcus Ranum also mentions a few other flawed testing methodologies in "Experiences Benchmarking Intrusion Detection Systems" . Ranum does an excellent job explaining why benchmarking NIDS is difficult.
Defining metrics for any type of testing is difficult. For example, the materials used in bridge construction need to be tested to ensure the integrity of the bridge. A common approach to defining metrics for these types of tests involves asking the bridge engineer to identify the weak spots in the bridge design. Where is the most stress concentrated? What has the highest potential for failure when the load exceeds design specifications? The same premise can be used for defining the metrics used for testing the performance of NIDS. The stress points for most protocol decode and pattern matches in a NIDS are the same.
All computing devices have a similar list of fixed resources. Only so many cycles are available on the CPU, and only so many bits are available to store all the program code, state information, and runtime conditions. In addition, only so many transmission cycles are available on the various buses of the computing architecture. The upper limit for system performance is approached as one or more of these resources approaches its upper limit. Therefore, the test methods should include metrics that apply to these resources. The following resources are the most important to the performance of NIDS:
NIDS products monitor network traffic, and NIDS packet capture architectures impose physical limits on the type of traffic that can be observed. For example, a NIDS built with a standard Gigabit Ethernet card cannot observe all minimum-sized Ethernet frames sent at line rate. This equates to approximately 1.4 million packets per second, and no currently shipping standard gigabit network adapter can handle many more than 700 to 800 thousand packets per second. However, if the NIDS uses dedicated hardware or some network processing units (NPUs), then it can handle more than 1.4 million packets. Also, the host software platform for the NIDS can have significant impact on the ability of the NIDS to capture packets. Many NIDS running on a host operating system do not bind the network interface to the IP stack of an operating system, and the architectures include custom network interface drivers.
Many of the more recently published NIDS performance tests actually tested only the interface bandwidth of the NIDS. This type of testing has limited use because it shows only the upper limit of how the NIDS performs if no other fixed resources are used. Typically this type of test lets the consumer understand how quickly a NIDS can ignore traffic that is uninteresting. Knowing the performance of only the packet capture architecture is useful, but it does not provide the information needed to quantify the performance of the entire system.
Packet flow architecture is the overall architecture for data flow within the NIDS and includes the packet capture architecture. The metrics used in the packet capture architecture section are also valid for the packet flow architecture, assuming the packets used are of interest to the NIDS, that they cause deep packet inspection, and that they make proper use of protocols of interest to the NIDS. Using Hypertext Transfer Protocol (HTTP) traffic to test the packet flow architecture is generally a good choice. For a NIDS that employs some method of protocol decode and state aware reassembly, HTTP traffic flows through a major portion of the packet flow architecture.
In addition, not all packets take the same amount of time to process. Buffering in the packet flow architecture allows a NIDS to recover from the packets that take a long time to inspect. Packet buffering is an important feature for reducing the number of dropped packets. Therefore, when testing a NIDS with buffering in the packet flow architecture, it is important to test with sustained rates for a length of time to ensure that the buffer is not inflating performance.
Any NIDS that performs TCP state tracking, IP fragment reassembly, and detection of sweeps and floods must keep track of the state of the traffic on the network. Many of the signatures used in this type of NIDS are based on a certain threshold of events that occur within a specified period of time. The only way to assess event thresholds is to keep a count until the time has expired or the threshold has been exceeded. A NIDS must track information about both the source and destination of the packets, the state of TCP connections, and the type and number of packets going to or from the hosts. All this information must be stored somewhere within the NIDS. The storage medium for this information is the state database.
Database benchmarking is very mature. The database industry understands the weak points in database-like applications. The largest metrics include the time needed to insert, delete, and search through data and how those transaction times scale with the size of the data set and frequency of transactions. How do we correlate those database metrics to NIDS metrics?
State information must be inserted into the state database as new network hosts and unique connections are observed. The state information is typically removed from the database after either an alarm event has occurred or some time has elapsed. State database searches are conducted any time the incoming packet may need to refer to prior state information. Depending on the types of signatures used, searching the database may need to be done for each packet.
The size of the state database derives from the unique hosts and connections that the NIDS considers interesting and maintains prior information about. The following metrics directly affect the performance of the state database:
The connection duration in number of packets or time can be used as an indirect metric for testing the performance of the state database because the duration of a session is related to the number of connections over time as well as the number of concurrent connections. Therefore, to accurately measure the capacity of a NIDS, one must vary the number of new connections per second, the number of simultaneous open connections, and the total number of unique hosts that the NIDS must track. The ability of the NIDS to handle network loads varies as these variables are adjusted.
Memory bandwidth and memory latency are large factors in the performance of a NIDS. Much of the memory bandwidth use and latency are caused by access to memory while inspecting the packets. Different NIDS architectures exhibit different use patterns for memory. A NIDS that relies solely on regular expression matching consumes the most bandwidth and induces the most latency in the system. Inspecting each character in the packet payload and advancing a regular expression state are expensive operations.
Protocol analysis helps reduce the number of bytes that must be inspected. Every NIDS does some type of protocol decode, even if it is limited to just the IP header. Many of the commercial NIDS decode most Layer 7 protocols and perform regular expression inspection on only a subset of the entire packet.
The size of the packets, therefore, plays an important role in determining the capacity of a NIDS. Testing performance with the smallest possible average packet size reduces the amount of time available per packet for inspections. Increasing the average packet size allows more time for inspection and increases the use of memory.
The average packet size of typical traffic on the Internet is about 450 to 550 bytes per packet [3,4]. However, some networks contain averages much larger and much smaller. The average packet size is an important metric in capacity testing.
The generation of the alarm event expends CPU cycles that would otherwise be available for analysis. Additionally, the event needs to be stored in nonvolatile storage. This usually means that it must be written to disk, a relatively slow operation, or sent over a network connection. Under normal circumstances this does not affect the operation of a NIDS. However, as the rate of alarm production increases and/or the load on the network increases, alarm event production and log maintenance can have a significant effect on NIDS performance. The event generation component of a NIDS must be able to handle the events generated by the high rates of traffic. The ability of the NIDS to notify the user varies as the alarm event rate is adjusted.
The metric used to test this component of a NIDS is simply the number of alarms per second. Tools such as stick and nessus easily set off alarm events in NIDS products. In addition, packet generators can be used to generate single packets that cause an alarm event. Testing the alarm channel does not require the traffic causing the alarms to originate from real hosts.
With the major stress points of a NIDS identified, it is now possible to focus on defining the metrics that can be used to quantify the capacity of a NIDS. Table 1 defines the test metrics and how they are related to the use of the fixed resources described in the section "Fixed Resource Limitations."
NIDS Test Metrics and Corresponding Resources Used
In practice, all these metrics are related. For example, a network does not typically raise only the number of hosts without also increasing the packets and bytes per second. However, to the extent possible, the tests should attempt to stress only one component or metric of the NIDS at a time. Prospective consumers of a NIDS can then use the test results for each metric and a profile of their networks to determine if the NIDS is even capable of sustaining inspection of the traffic.
Developing the tests to quantify the metrics for the potential weak points of a NIDS is a significant task. This section explores traffic mix selection and simplification, potential problems with a test network, and a set of tests and the intended stress points under test.
Many of the metrics defined in the previous section are directly and indirectly derived from the network traffic. What is the correct mix of traffic to measure? The correct mix for each user is one that best matches the traffic where that user plans to deploy the NIDS. Obviously a test cannot be designed that contains all the traffic mixes for all potential consumers of NIDS technology. However, after the NIDS industry agrees on a standard methodology for testing the stress points, consumers could profile their traffic mix and get a reasonable idea about how well the various products perform.
How do we define the tests to be used? Studies have been performed that describe the mix of traffic seen on the major network trunks. "The Nature of the Beast: Recent Traffic Measurements from an Internet Backbone"  and "Trends in Wide Area IP Traffic"  are two good resources for defining a general test. The information in these papers is from 1998 and 2000. For more recent data, Table 2 includes results from profiling three data sets from 2002. A major university, a major U.S. government site, and a large online retailer provided the data sets. Although this data is not necessarily representative of a traffic mix found on a corporate network, it is representative of the mix that would be seen at the edge of most large networks. The metrics in this traffic provide a good starting point for the mix of the loading traffic for the test.
Traffic Metrics from Three Customer Sites (Site 1 is a major university, Site 2 is a major U.S. government site, and Site 3 is an online retailer. In the Layer 4 OTHER field, no individual protocol grouped into this field consisted of more than 3 percent of the total traffic.
Table 2 shows the same general traffic characteristics found in the CAIDA data [3,4]. Currently there are no data sets for networks running at or near gigabit-per-second speeds. Further research is needed in this area. Because of the limited scope of this paper, it is assumed that the traffic mix will scale evenly with the increased bandwidth.
With a general understanding of the type of traffic found at the edge of protected networks, it is now possible to explore crafting tests that quantify the metrics at a level useful for NIDS consumers.
Typically the HTTP protocol is not blocked outbound from firewalls and is a dominant portion of the traffic on the Internet. The servers and clients that implement HTTP have garnered the attention of many crackers and security professionals. HTTP-based signatures make up the majority of signatures in NIDS. Fortunately this situation allows for simplification of the testing in the general case. When HTTP traffic is used to test the capacity of a NIDS, it obviously stresses a large portion of the packet flow architecture.
The use of HTTP traffic has a few other advantages as well. HTTP traffic is relatively easy and inexpensive (in time and money) to produce. Web server testing tools, such as Standees WebBench, can be used to generate traffic to reproduce tests inexpensively. In addition, several vendors are selling network test equipment that utilizes real TCP/IP stack implementations instead of using "canned" traffic. We are most familiar with the products from Caw Networks and Antera. For the test conducted in the section "Example Test Results," Caw Networks WebReflector and WebAvalanche products were used.
Most NIDS shipping today perform some level of protocol decode and state tracking. Therefore, it is very important that any load traffic exhibit the same characteristics found on a consumer's network. Most of the products that allow for high-speed traffic generation have some critical flaws that make them unsuitable for testing NIDS at high speeds. Some of these issues include:
- Inability to create valid checksums for all layers at high speeds
- Inability to vary the IP addresses in a more random manner than a straight increment
- Inability to maintain state of TCP connections and issue resets if packets are dropped
- Inability to play a large mix of traffic due to the limitations in buffer size for the transmitters
These issues do not plague the replay devices at slower speeds. At high speeds, the buffer size for the replay devices prohibits large traffic samples. Using replay requires the tester to use more replay interfaces. Adding more source interfaces when testing a high aggregate rate presents problems for the test network, as described in the next section. Therefore, using test devices that use real TCP/IP implementation to generate the traffic is preferred.
The typical network setup for testing a gigabit NIDS includes several traffic generators, an attack network, and a victim network. All these devices are typically connected to a switch, and the traffic is then port mirrored, spanned, or copy captured to the NIDS. For high-speed tests, the interface for the NIDS is Gigabit Ethernet. The inter-packet arrival gap on Gigabit Ethernet is 96 ns. Inter-packet arrival gap becomes important as more interfaces are added to the switch. Each interface used to generate traffic, regardless of interface speed, increases the chances that traffic destined for the NIDS will be dropped at the switch.
Imagine that ten Fast Ethernet ports, each generating approximately 80 Mbps of traffic, are used during testing. Eventually several of these ports begin to transmit packets all within a few nanoseconds of each other. Because each of these transmitted packets must be copied to the NIDS, the switch forwards each of the packets to the port where the NIDS is connected. Unfortunately the NIDS is using Ethernet, which demands the 96-ns delay between packets. Because several packets are arriving at the port at very nearly the same time, the port buffer fills up and the switch port drops packets. This problem does not manifest itself if a choke point, such as a firewall or router, is used in the test network. But, if the industry tests require a router or firewall capable of the same high traffic speeds to reproduce the tests, it raises the cost of testing significantly. It is, therefore, better to use fewer ports for generating traffic on the switch when testing.
No single test provides all the information needed to quantify the capacity of a NIDS. A suite of tests is used to quantify each portion of a NIDS. Only when looking at the output from all these tests can a consumer infer performance on his network.
Establishing the Peak—Testing the network interface bandwidth establishes the peak for packet capture for the NIDS. The NIDS is never able to perform above this absolute peak on any further tests. Testing the network interface bandwidth is simple. Choose a packet of no interest to the NIDS and resend it at a high rate until the NIDS cannot count all the packets. Repeat the tests for minimum-sized packets and for maximum-sized packets. This reveals the maximum packets per second and the maximum bytes per second, respectively. A good example packet for this test is a UDP packet with ports set to 0 (assuming the NIDS does not send alarms on such a packet).
The Alarm Channel—Testing the alarm channel capacity of a NIDS can be accomplished with a similar test. Choose a packet that causes an alarm. The "Ping of Death" is a good packet for this test. Send this packet at different rates of speed and check packet and alarm counts. Some NIDS buffer alarms when under a heavy load, so a quiescent period after the packets are sent may be necessary before collecting counts.
Stressing the State Database—Inserting, searching, and deleting the state information from the state database are all potential bottlenecks for the NIDS. Varying the IP addresses of traffic requiring state tracking adds load to the database. Opening a large number of TCP connections causes the state database to contain many records. The search performance for a database is affected by the size of the data set. For this type of test, open a large number of concurrent TCP connections and then run one of the more general tests described. The open connections will stress both the database and the overall system architecture.
General Tests with Configurable Metrics—Establishing a baseline of traffic and then varying one of the metrics can expose a NIDS weakness in certain environments. Because each specific component of the NIDS is being tested, it is not necessary to ensure that the traffic looks exactly like the traffic of the end user. If users can extract the same metrics from their traffic, then performance on their networks can be inferred from test results using simpler data. In the example tests found in the section "Example Test Results," the traffic mix consists of only HTTP. HTTP rides on TCP, which requires state tracking. Depending on the level of protocol decoding used, HTTP may also require state tracking. In addition, most of the signatures found in a NIDS are HTTP signatures. Therefore, using an HTTP-only traffic mix still stresses the NIDS in many areas. The example tests in the section "Example Test Results" could have also included additional protocols such as Domain Name System (DNS), SMTP, and NNTP. However, due to time and space constraints these protocols were omitted for this test.
Table 3 shows the characteristics of the traffic mix when using Caw Networks WebReflector and WebAvalanche products. This test equipment allows for high-speed testing using only two ports on the switch for generating traffic. The Caw Networks equipment also has the ability to randomly drop packets. The dropped packets cause their systems to retransmit and the traffic looks more like real-world traffic. By simply varying the HTTP transaction size, many characteristics of the traffic can be manipulated. Using HTTP transaction size is just one example. The maximum segment size (MSS1) for the server or client can also be varied to affect the characteristics in other ways.
Traffic Mix Characteristics when Using Caw Networks WebReflector and WebAvalanche to Generate HTTP Traffic for General Stress Tests
Using the lessons from the previous section, this section explores two simple tests that evaluate some aspects of the capacity of a NIDS. Other tests are then suggested as a way to further refine the knowledge gained from the example tests.
A Block Diagram Showing the Network Layout for the Example Tests (The WebReflector acts as all the Web servers. The WebAvalanche acts as all the Web clients. The Cisco Catalyst(r) Switch copy captures the traffic to the NIDS.
The first test baselines the capture efficiency of a NIDS in a pure HTTP environment. A full analysis load is assumed by turning on all default signatures. However, the traffic generated does not trigger alarms or events. The number of client hosts on the network is fixed at 5080, and the number of servers is fixed at 2. In this first test, TCP sessions are allowed to run to completion as quickly as possible; therefore, the number of simultaneous open sessions is fixed at fewer than 30 for all cases. The average packet size is varied through manipulation of the HTTP transaction size. This also results in variations in the average number of packets per TCP connection, the packets per second (kpps), and the overall bandwidth used. The final variable manipulated is the number of new TCP connections per second. Each test is run for three minutes before capacity measurements are made. The results of the baseline test are given in Table 4.
Results for Baseline Test (Traffic was HTTP only with 5080 client IP addresses and 2 server IP addresses. The test runs for 3 minutes, with no server delay.
The second test introduces simultaneous open TCP sessions. A 4-second delay is introduced on the server response to cause sessions to remain open. All other test variables remain constant. The bandwidth consumed, the packets per second, and the average packet size at each data point are somewhat affected by the open connections. In the absence of a correlation study, it is unclear if these factors are statistically significant. For the purposes of this paper, they are assumed to be insignificant. The results from the open connection test are given in Table 5.
Results for Open Connection Test (Traffic was HTTP only with 5080 client IP addresses and 2 server IP addresses. The test runs for 3 minutes, with a 4-second forced server delay.)
The most significant variations in capture efficiency are in the 5000-connections-per-second tests. Capture efficiency appears affected in the 2500-connections-per-second tests as well, but this does not appear to be statistically meaningful.
The two tests differ only in the number of concurrent open TCP sessions. This implies that the state database is the component under stress. With only these results it is not possible to precisely identify what operation within the database is causing the drop in capacity. The traffic may have exceeded the insertion rate of the database, its time to search capacity, or the ability of the system to perform maintenance on the database and delete aging entries.
Regardless, it is apparent that a consumer whose network has an average number of open TCP sessions near 10,000, an average new TCP connections-per-second rate of no more than 2500 per second, and a bandwidth consumption less than approximately 400 Mbps could field the tested NIDS with confidence that the capture efficiency would be at or near 100 percent.
The example tests do not provide enough information on which to base a full-confidence decision. Nevertheless, by using the same methodology for developing further tests, the NIDS industry or independent labs could establish a suite of tests that could provide quantifiable results for each of the different stress points.
For example, the total number of database insertions per second for the database can be quantified by establishing a test that runs at a very low rate for all other stress points of the NIDS. The traffic must be crafted such that the database needs to start maintaining state on many different key values. One possible way of shaping the traffic is to use a packet generator and generate a valid TCP session with a full three-way handshake. The rate at which the connections are introduced must be ramped upward until the NIDS starts dropping traffic. Because these simple TCP connections consist of small packets, the bandwidth should remain low. Results need to be cross checked with the raw packet capture architecture evaluation results as described in the section "Potential Test Suite" to ensure that it is the database inserts and not the packets-per-second limit that has been reached.
As the NIDS industry matures, standardized testing will become prevalent. Developing these tests can be done using the same concepts of standardized testing found in other industries. Hopefully, the information found in this paper will stimulate the development of standardized tests, providing the NIDS consumer the information that is currently missing. The same techniques used for capacity testing can be extended to other performance areas such as false positive ratios.
1. Mier Communications: Test report for ManHunt from Recourse Inc. and test report for Intrusion.com's NIDS. At:
2. Ranum, M.: Experiences Benchmarking Intrusion Detection Systems. At:
3. Claffy, K., Miller, G., and Thompson, K.: The Nature of the Beast: Recent Traffic Measurements from an Internet Backbone. At:
4. McCreary, S., and Claffy, K.: Trends in Wide Area IP Traffic Patterns: A View from Ames Internet Exchange. At:
1 Used in TCP to specify the maximum amount of TCP data in a single IP datagram that the local system can accept. The MSS is typically the maximum transmission unit of the outgoing interface minus 40 bytes for the IP and TCP headers.