Guest

High Availability

Performance Management: Best Practices White Paper

Document ID: 15115

Updated: Jan 16, 2007

   Print

Introduction

Performance management involves optimization of network service response time and management of the consistency and quality of individual and overall network services. The most important service is the need to measure the user/application response time. For most users, response time is the critical performance success factor. This variable shapes the perception of network success by both your users and application administrators.

Background Information

Capacity planning is the process by which you determine requirements for future network resources in order to prevent a performance or availability impact on business-critical applications. In the area of capacity planning, the network baseline (CPU, memory, buffers, in/out octets, etc.) can affect response time. Therefore, keep in mind that performance problems often correlate with capacity. In networks, this is typically bandwidth and data that must wait in queues before it can be transmitted through the network. In voice applications, this wait time almost certainly impacts users because factors such as delay and jitter affect the quality of the voice call.

Another major issue that complicates performance management is that although high network availability is mission-critical for both large enterprise and service provider networks, the tendency is to seek short-term economic gains at the risk of (often unforeseen) higher costs in the long run. During every budget cycle, network administrators and project implementation personnel struggle to find a balance between performance and fast implementation. Further, network administrators face challenges that include rapid product development in order to meet narrow market windows, complex technologies, business consolidation, competing markets, unscheduled downtime, lack of expertise, and often insufficient tools.

In light of these challenges, how does performance fit within the network management framework? The primary function of an ideal network management system is to optimize the operational capabilities of a network. Once you accept this as the ultimate goal for network management, then the focus of network management is to keep network operation at peak performance.

An ideal network management system includes these principle operations:

  • Informs the operator of impending performance deterioration.

  • Provides easy alternative routing and workarounds when performance deterioration or failure takes place.

  • Provides the tools to pinpoint causes of performance deterioration or failure.

  • Serves as the main station for network resiliency and survivability.

  • Communicates performance in real time.

Based on this definition for an ideal system, performance management becomes essential to network management. These performance management issues are critical:

  • User performance

  • Application performance

  • Capacity planning

  • Proactive fault management

It is important to note that with newer applications like voice and video, performance is the key variable to success and if you cannot achieve consistent performance, the service is considered of low value and fails. In other cases, users simply suffer from variable performance with intermittent application timeouts that degrade productivity and user satisfaction.

This document details the most critical performance management issues, which include critical success factors, key performance indicators, and a high-level process map for performance management. It also discusses the concepts of availability, response time, accuracy, utilization, and capacity planning and includes a short discussion on the role of proactive fault analysis within performance management and the ideal network management system.

Critical Success Factors

Critical success factors identify the requirements for implementation best practices. In order to qualify as a critical success factor, a process or procedure must improve availability or the absence of the procedure must decrease availability. In addition, the critical success factor should be measurable so that the organization can determine the extent of their success.

Note: See Performance Management Indicators for detailed information.

These are the critical success factors for performance management:

  • Gather a baseline for both network and application data.

  • Perform a what-if analysis on your network and applications.

  • Perform exception reporting for capacity issues.

  • Determine the network management overhead for all proposed or potential network management services.

  • Analyze the capacity information.

  • Periodically review capacity information for both network and applications, as well as baseline and exception.

  • Have upgrade or tuning procedures set up to handle capacity issues on both a reactive and long-term basis.

Indicators for Performance Management

Performance indicators provide the mechanism by which an organization can measure critical success factors. Performance indicators for performance planning include:

  • Document the network management business objectives. This could be a formal concept of operations for network management or a less formal statement of required features and objectives.

  • Create detailed and measurable service level objectives.

  • Provide documentation of the service level agreements with charts or graphs that show the success or failure of how these agreements are met over time.

  • Collect a list of the variables for the baseline, such as polling interval, network management overhead incurred, possible trigger thresholds, whether the variable is used as a trigger for a trap, and trending analysis used against each variable.

  • Have a periodic meeting that reviews the analysis of the baseline and trends.

  • Have a what-if analysis methodology documented. This should include modeling and verification where applicable.

  • When thresholds are exceed, develop documentation on the methodology used to increase network resources. One item to document is the time line required to put in additional WAN bandwidth and a cost table.

Performance Management Process Flow

These steps provide a high-level process flow for performance management:

  1. Develop a Network Management Concept of Operations

    1. Define the Required Features: Services, Scalability and Availability Objectives

    2. Define Availability and Network Management Objectives

    3. Define Performance SLAs and Metrics

    4. Define SLAs

  2. Measure Performance

    1. Gather Network Baseline Data

    2. Measure Availability

    3. Measure Response Time

    4. Measure Accuracy

    5. Measure Utilization

    6. Capacity Planning

  3. Perform a Proactive Fault Analysis

    1. Use Thresholds for Proactive Fault Management

    2. Network Management Implementation

    3. Network Operation Metrics

Develop a Network Management Concept of Operations

Before you define the detailed performance and capacity variables for a network, you must look at the overall concept of operation for network management within your organization. When you define this overall concept, it provides a business foundation upon which you can build precise definitions of the features desired in you network. If you fail to develop an operational concept for network management, it can lead to a lack of goals or goals that constantly shift due to customer demands.

You normally produce the network management concept of operations as the first step in the system definition phase of the network management program. The purpose is to describe the overall desired system characteristics from an operational standpoint. The use of this document is to coordinate the overall business (nonquantitative) goals of network operations, engineering, design, other business units, and the end users. The focus of this document is to form the long range operational planning activities for network management and operation. It also provides guidance for the development of all subsequent definition documentation, such as service level agreements. This initial set of definitions obviously cannot focus too narrowly on the management of specific network problems, but on those items that emphasize importance to the overall organization and in relationship to the costs that must be managed as well. Some objectives are:

  • Identify those characteristics essential to efficient use of the network infrastructure.

  • Identify the services/applications that the network supports.

  • Initiate end-to-end service management.

  • Initiate performance-based metrics to improve overall service.

  • Collect and distribute performance management information.

  • Support strategic evaluation of the network with feedback from users.

In other words, the network management concept of operations should focus on the overall organizational goals and your philosophy to meet those goals. The primary ingredients consist of the higher level definitions of the mission, mission objectives, system goals, organizational involvement, and overall operational philosophy.

As a network manager, you are in the position to unify often inconsistent performance expectations of your users. For instance, if the primary requirement for the network is the transfer of large files from one location to another, you want to focus on high throughput and less on the response times of interactive users. Be careful not to limit your view of performance unless you consider a variety of issues. For instance, when you test a network, look at the load levels that are used. The load is often based on very small packets and the throughput on very large packets. Either of these performance tests might produce a very positive picture, but based on your network traffic load, the tests might not present a true picture of performance. Study the network performance under as many possible workload conditions as possible and the performance documented.

Also, while many network management organizations have effective alarm techniques to notify technicians about a device failure, it is much more difficult to define and implement an assessment process for the end-to-end application performance. Therefore, while the network operations center (NOC) can respond quickly to a downed router or switch, the network conditions that might undermine network performance and affect user perception might easily go unnoticed until that perception becomes negative. However difficult, this second process can provide immense benefit to both the business organization and network management.

Finally, ensure that you do not create unrealistic expectations of your network performance. Unrealistic expectations are usually created when you misunderstand the details of networking protocols or the applications. Often times poor performance is not the fault of the network, but rather a result of poor application design. The only way to document and measure application performance is to have a baseline of the network performance prior to application installation.

Define the Required Features: Services, Scalability, and Availability Objectives

The first step of performance management, continuous capacity planning, and network design is to define the required features and/or services. This step requires that you understand applications, basic traffic flows, user and site counts, and required network services. The first use of this information is to determine the criticality of the application to the organizational goals. You can also apply this information to create a knowledge base for use in the logical design in order to understand bandwidth, interface, connectivity, configuration, and physical device requirements. This initial step enables your network architects to create a model of your network.

Create solution scalability objectives in order to help network engineers design networks that meet future growth requirements and ensure that proposed designs do not experience resource constraints due to growth or extension of the network. Resource constraints can include:

  • Overall traffic

  • Volume

  • Number of routes

  • Number of virtual circuits

  • Neighbor counts

  • Broadcast domains

  • Device throughput

  • Media capacity

Network planners should determine the required life of the design, expected extensions or sites required through the life of the design, volume of new users, and expected traffic volume or change. This plan helps to ensure that the proposed solution meets growth requirements over the projected life of the design.

When you do not investigate solution scalability, you might be forced to implement major reactive design changes. This design change can include additional hierarchy, media upgrades, or hardware upgrades. In organizations that rely on fairly precise budget cycles for major hardware purchases, these changes can be a major inhibitor to overall success. In terms of availability, networks can experience unexpected resource limitations that cause periods of nonavailability and reactive measures.

Interoperability and interoperability testing can be critical to the success of new solution deployments. Interoperability can refer to different hardware vendors, or different topologies or solutions that must mesh together during or after a network implementation. Interoperability problems can include hardware signaling up through the protocol stack to routing or transport problems. Interoperability issues can occur before, during, or after migration of a network solution. Interoperability planning should include connectivity between different devices and topology issues that might occur during migrations.

Solution comparison is the practice in which you compare different potential designs in relation to other solution requirement practices. This practice helps to ensure that the solution is the best fit for a particular environment and that personal bias does not drive the design process. Comparison can include different factors such as cost, resiliency, availability, risk, interoperability, manageability, scalability, and performance. All of these can have a major effect on overall network availability once the design is implemented. You can also compare media, hierarchy, redundancy, routing protocols, and similar capabilities. Create a chart with factors on the X-axis and potential solutions on the Y-axis help in order to summarize solution comparisons. Detailed solution comparison in a lab environment also helps to objectively investigate new solutions and features in relation to the different comparison factors.

As part of the network management concept of operations, it is essential to define the goals for the network and supported services in a way that all users can understand. The activities that follow the development of the operational concept are greatly influenced by the quality of that document.

These are the standard performance goals:

  • Response time

  • Utilization

  • Throughput

  • Capacity (maximum throughput rate)

While these measurements might be trivial for a simple LAN, they can be very difficult on a switched campus network or a multi-vendor enterprise network. When you use a well thought out concept of operations plan, each of the performance goals is defined in a measurable way. For instance, the minimum response time for application "x" is 500 Ms or less during peak business hours. This defines the information to identify the variable, the way to measure it, and the period of day on which the network management application should focus.

Define Availability and Network Management Objectives

Availability objectives define the level of service or service level requirements for a network service. This helps to ensure the solution meets end availability requirements. Define different classes of service for a particular organization and detail network requirements for each class that are appropriate to the availability requirement. Different areas of the network might also require different levels of availability. A higher availability objective might necessitate increased redundancy and support procedures. When you define an availability objective for a particular network service and measure the availability, your network organization can understand components and service levels required to achieve projected SLAs.

Define manageability objectives in order to ensure that overall network management does not lack management functionality. In order to set manageability objectives, you must understand the support process and associated network management tools for your organization. Manageability objectives should include knowledge of how new solutions fit into the current support and tool model with references to any potential differences or new requirements. This is critical to network availability since the ability to support new solutions is paramount to deployment success and to meet availability targets.

Manageability objectives should uncover all important MIB or network tool information required to support a potential network, training required to support the new network service, staffing models for the new service and any other support requirements. Often times this information is not uncovered prior to deployment and overall availability suffers as a result of the lack of resources assigned to support the new network design.

Define Performance SLAs and Metrics

Performance SLAs and metrics help define and measure the performance of new network solutions to ensure they meet performance requirements. The performance of the proposed solution might be measured with performance monitoring tools or with a simple ping across the proposed network infrastructure. The performance SLAs should include the average expected volume of traffic, peak volume of traffic, average response time, and maximum response time allowed. This information can then be used later in the solution validation section and ultimately helps determine the required performance and availability of the network.

Define SLAs

An important aspect of network design is when you define the service for users or customers. Enterprises call these service level agreements while service providers refer to it as service level management. Service level management typically includes definitions for problem types and severity and help desk responsibilities, such as escalation path and time before escalation at each tier support level, time to start work on the problem, and time to close targets based on priority. Other important factors are what service is provided in the area of capacity planning, proactive fault management, change management notification, thresholds, upgrade criteria, and hardware replacement.

When organizations do not define service levels up front, it becomes difficult to improve or gain resource requirements identified at a later date. It also becomes difficult to understand what resources to add in order to help support the network. In many cases, these resources are applied only after problems are discovered.

Measure Performance

Performance management is an umbrella term that incorporates the configuration and measurement of distinct performance areas. This section describes these six concepts of performance management:

Gather Network Baseline Data

Most corporate intranets have sufficient bandwidth. However, without adequate data, you might not be able to rule out network congestion as a contributor to poor application performance. One of the clues for congestion or errors is if the poor performance is intermittent or time-of-day dependent. An example of this condition is when performance is adequate late in the evening, but very slow in the morning and during peak business hours.

Once you have defined the network management concept of operations and defined the needed implementation data, it is necessary to gather this data over time. This type of collection is the foundation for the network baseline.

Perform a baseline of the current network prior to a new solution (application or IOS change) deployment and after the deployment in order to measure expectations set for the new solution. This baseline helps determine if the solution meets performance and availability objectives and benchmark capacity. A typical router/switch baseline report includes capacity issues related to CPU, memory, buffer management, link/media utilization, and throughput. There are other types of baseline data that you might also include, based on the defined objectives in the concept of operations. For instance, an availability baseline demonstrates increased stability/availability of the network environment. Perform a baseline comparison between old and new environments in order to verify solution requirements.

Another specialized baseline is the application baseline, which is valuable when you trend application network requirements. This information can be used for billing and/or budgeting purposes in the upgrade cycle. Application baselines can also be important in the area of application availability in relation to preferred services or qualities of service per application. Application baseline information mainly consists of bandwidth used by applications per time period. Some network management applications can also baseline application performance. A breakdown of the traffic type (Telnet or FTP) is also important for planning. In some organizations, more critical resource-constrained areas of the network are monitored for top talkers. The network administrators can use this information in order to budget, plan, or tune the network. When you tune the network, you might modify quality of service or queue parameters for the network service or application.

Measure Availability

One of the primary metrics used by network managers is availability. Availability is the measure of time for which a network system or application is available to a user. From a network perspective, availability represents the reliability of the individual components in a network.

For example, in order to measure availability, you might coordinate the help desk phone calls with the statistics collected from the managed devices. However, availability tools cannot determine all of the reasons for failure.

Network redundancy is another factor to consider when you measure availability. Loss of redundancy indicates service degradation rather than total network failure. The result might be slower response time and a loss of data due to dropped packets. It is also possible the results show up in the other areas of performance measurement such as utilization and response time.

Finally, if you deliver against an SLA, you should take into account scheduled outages. These outages could be the result of moves, adds, and changes, plant shutdowns, or other events that you might not want reported. This is not only a difficult task, but might also be a manual task.

Measure Response Time

Network response time is the time required for traffic to travel between two points. Response times slower than normal, seen through a baseline comparison or that exceed a threshold, might indicate congestion or a network fault.

Response time is the best measure of customer network use and can help you gauge the effectiveness of your network. No matter what the source of the slow response is, users get frustrated as a result of delayed traffic. In distributed networks, many factors affect the response time, such as:

  • Network congestion

  • Less than desirable route to destination (or no route at all)

  • Underpowered network devices

  • Network faults such as a broadcast storm

  • Noise or CRC errors

In networks that employ QoS-related queuing, response time measurement is important in order to determine if the correct types of traffic move through the network as expected. For instance, when you implement voice traffic over IP networks, voice packets must be delivered on time and at a constant rate in order to maintain good voice quality. You can generate traffic classified as voice traffic in order to measure the response time of the traffic as it appears to users.

You can measure response time in order to help resolve the battles between application servers and network managers. Network administrators are often presumed guilty when an application or server appears to be slow. The network administrator must prove that the network is not the problem. Response time data collection provides an indisputable means to prove or disprove that the network is the source of application troubles.

Whenever possible, you should measure response time as it appears to users. A user perceives response as the time from when they press Enter or click a button until the screen displays. This elapsed time includes the time required for each network device, the user workstation, and the destination server to process the traffic.

Unfortunately, measurement at this level is nearly impossible due to the number of users and lack of tools. Further, when you incorporate user and server response time, it provides little value when you determine future network growth or troubleshooting network problems.

You can use the network devices and servers to measure response time. You can also use tools like ICMP to measure transactions, although it does not take into account any delays introduced into a system as the upper layers process it. This approach solves the problem of network performance knowledge.

At a simplistic level, you can time the response to pings from the network management station to key points in the network, such as a mainframe interface, end point of a service provider connection, or key user IP addresses, in order to measure response time. The problem with this method is it does not accurately reflect the user perception of response time between their machine and the destination machine. It simply collects information and reports response time from the network management station perspective. This method also masks response time issues on a hop-by-hop basis throughout the network.

An alternative to server-centric polling is to distribute the effort closer to the source and destination you wish to simulate for measure. Use distributed network management pollers and implement Cisco IOS Service Assurance Agent (SAA) functionality. You can enable SAA on routers in order to measure response time between a router and a destination device such as a server or another router. You can also specify a TCP or UDP port, which forces traffic to be forwarded and directed in the same manner as the traffic it simulates.

With the integration of voice, video, and data on multiservice networks, customers implement QoS prioritization in their network. Simple ICMP or UDP measurement do not accurately reflect response time since different applications receive different priorities. Also, with tag switching, traffic routing might vary based on the application type contained in a specific packet. So an ICMP ping might receive different priorities in how each router handles it and might receive different, less efficient routes.

In this case, the only way to measure response time is to generate traffic that resembles the particular application or technology of interest. This forces the network devices to handle the traffic as they would for the real traffic. You might be able to achieve this level with SAA or through the use of third-party application-aware probes.

Measure Accuracy

Accuracy is the measure of interface traffic that does not result in error and can be expressed in terms of a percentage that compares the success rate to total packet rate over a period of time. You must first measure the error rate. For instance, if two out of every 100 packets result in error, the error rate would be 2% and the accuracy rate would be 98%.

With earlier network technologies, especially in the wide area, a certain level of errors was acceptable. However, with high-speed networks and present-day WAN services, transmission is considerably more accurate, and error rates are close to zero unless there is an actual problem. Some common causes of interface errors include:

  • Out-of-specification wiring

  • Electrical interference

  • Faulty hardware or software

Use a decreased accuracy rate to trigger a closer investigation. You might discover that a particular interface exhibits problems and decides that the errors are acceptable. In this case, you should adjust the accuracy threshold for this interface in order to reflect where the error rate is unacceptable. The unacceptable error rate might have been reported in an earlier baseline.

The variables described in this table are used in accuracy and error rate formulas:

Notation Description
ΔifInErrors The delta (or difference) between two poll cycles that collect the snmp ifInErrors object, which represents the count of inbound packets with an error.
ΔifInUcastPkts The delta between two poll cycles that collect the snmp ifInUcastPkts object, which represents the count of inbound unicast packets.
ΔifInNUcastPkts The delta between the two poll cycles that collect the snmp ifInNUcastPkts object, which represents the count of inbound non-unicast packets (multicast and broadcast).

The formula for error rate is usually expressed as a percentage:

Error Rate = (ΔifInErrors) *100

-------------------------------------

(ΔifInUcastPkts + (ΔifInNUcastPkts )

Notice that outbound errors are not considered in the error rate and accuracy formulas. That is because a device should never knowingly place packets with errors on the network, and the outbound interface error rates should never increase. Hence, inbound traffic and errors are the only measures of interest for interface errors and accuracy.

The formula for accuracy takes the error rate and subtracts it from 100 (again, in the form of a percentage):

Accuracy = 100 - (ΔifInErrors) *100

-----------------------------------------

(ΔifInUcastPkts + (ΔifInNUcastPkts)

These formulas reflect error and accuracy in terms of MIB II interface (RFC 2233) generic counters. The result is expressed in terms of a percentage that compares errors to total packets seen and sent. The error rate that results is subtracted from 100, which produces the accuracy rate. An accuracy rate of 100% is perfect.

Since the MIB II variables are stored as counters, you must take two poll cycles and figure the difference between the two (hence the Delta used in the equation).

Measure Utilization

Utilization measures the use of a particular resource over time. The measure is usually expressed in the form of a percentage in which the usage of a resource is compared with its maximum operational capacity. Through utilization measures, you can identify congestion (or potential congestion) throughout the network. You can also identify underutilized resources.

Utilization is the principle measure to determine how full are the network pipes (links). Measure CPU, interface, queuing, and other system-related capacity measurements in order to determine the extent to which network system resources are consumed.

High utilization is not necessarily bad. Low utilization might indicate traffic flows in unexpected places. As lines become overutilized, the effects can become significant. Overutilization occurs when there is more traffic queued to pass over an interface than it can handle. Sudden jumps in resource utilization can indicate a fault condition.

As an interface becomes congested, the network device must either store the packet in a queue or discard it. If a router attempts to store a packet in a full queue, the packet is dropped. Dropped packets result when traffic is forwarded from a fast interface to a slower interface. This is indicated in the formula Q = u / (1-u) where u is utilization, and Q is the average queue depth (random traffic assumed). So high utilization levels on links result in high average queue depths, which is predictable latency if you know the packet size. Some of the network-reporting vendors indicate that you can order up less bandwidth and pay less for your WAN. However, latency implications appear when you run WAN links at 95% utilization. Furthermore, as networks are migrated to VoIP, the network administrators might need to change their policies and run WAN links at approximately 50% utilization.

When a packet is dropped, the higher layer protocol might force a retransmit of the packet. If several packets are dropped, excessive retry traffic can result. This type of reaction can result in backups on devices further down the line. In order to resolve this issue, you might set different degrees of thresholds.

The primary measure used for network utilization is interface utilization. Use the formulas described in this table based on whether the connection you measure is half-duplex or full duplex:

Notation Description
ΔifInOctets The delta (or difference) between two poll cycles that collect the snmp ifInOctets object, which represents the count of inbound octets of traffic.
ΔifOutOctets The delta between two poll cycles that collect the snmp ifOutOctets object which represents the count of outbound octets of traffic.
ifSpeed The speed of the interface as reported in the snmp ifSpeed object. Note that ifSpeed might not accurately reflect the speed of a WAN interface.

Shared LAN connections tend to be half-duplex mainly because contention detection requires that a device listen before it transmits. WAN connections are typically full duplex because the connection is point to point; both devices can transmit and receive at the same time since they know there is only one other device that shares the connection.

Since the MIB II variables are stored as counters, you must take two poll cycles and figure the difference between the two (hence the Delta used in the equation).

For half-duplex media, use this formula for interface utilization:

(ΔifInOctets + ΔifOutOctets) * 8 * 100

----------------------------------------------------

(number of seconds in Δ) * ifSpeed

For full-duplex media, the utilization calculation is more complex. For example, with a full T-1 serial connection, the line speed is 1.544 Mbps. This means that a T-1 interface can both receive and transmit 1.544 Mbps for a combined possible bandwidth of 3.088 Mbps.

When you calculate interface bandwidth for full-duplex connections, you can use this formula in which you take the larger of the in and out values and generate a utilization percentage:

max(ΔifInOctets, (ΔifOutOctets) * 8 * 100

-----------------------------------------

(number of seconds in Δ) * ifSpeed

However, this method hides the utilization of the direction that has the lesser value and provides less accurate results. A more accurate method is to measure the input utilization and output utilization separately, such as:

Input Utilization = ΔifInOctets *8 * 100

-------------------------------------

(number of seconds in Δ) * ifSpeed

And

Output Utilization = ΔifOutOctets *8 * 100

------------------------------------

(number of seconds in Δ) * ifSpeed

While these formulas are somewhat simplified, they do not take into consideration overhead associated with a particular protocol. More precise formulas exist to handle the unique aspects of each protocol. As an example, RFC 1757 contains Ethernet utilization formulas that take into consideration packet overhead. However, the high availability team has found that the general formulas presented here can be used reliably across both LAN and WAN interfaces in most cases.

Capacity Planning

As stated earlier, capacity planning is the process in which you determine the likely future network resource requirements to prevent a performance or availability impact on business-critical applications. Refer to the Capacity and Performance Management: Best Practices White Paper for more detailed information on this topic.

Perform a Proactive Fault Analysis

Proactive fault analysis is essential to performance management. The same type of data that is collected for performance management can be used for proactive fault analysis. However, the timing and use of this data is different between proactive fault management and performance management.

Proactive fault management is the way that the ideal network management system can achieve the goals you determined. The relation to performance management is through the baseline and the data variables that you use. Proactive fault management integrates customized events, an event correlation engine, trouble ticketing, and the statistical analysis of the baseline data in order to tie together fault, performance, and change management in an ideal, effective network management system.

Where performance data polling is normally accomplished every 10, 15, or even 30 minutes, recognition of a fault condition must be at a much shorter time interval. One method of proactive fault management is through the use of RMON alarms and event groups. You can set thresholds on your devices that are not polled by external devices so the thresholds are much shorter. Another method, which is not covered in this document, is through the use of a distributed management system that enables polling at a local level with aggregation of data at a manager of managers.

Use Thresholds for Proactive Fault Management

Thresholding is the process in which you define points of interest in specific data streams and generate events when thresholds are triggered. Use your network performance data to set those thresholds.

There are several different types of thresholds, some of which are more applicable to certain types of data. Thresholds are only applicable to numeric data so convert any textual data into discrete numeric values. Even if you do not know all of the possible text strings for an object, you can still enumerate the "interesting" strings and assign all other strings to a set value.

There are two classes of thresholds for the two classes of numeric data: continuous and discrete. Continuous thresholds apply to continuous or time series data such as data stored in SNMP counters or gauges. Discrete thresholds apply to enumerated objects or any discrete numeric data. Boolean objects are enumerated values with two values: true or false. Discrete data can also be called event data because events mark the transition from one value to the next.

Continuous thresholds can trigger events when the time series object crosses the specified value of the threshold. The object value either rises above the threshold or falls below it. It can also be useful to set separate rising and falling thresholds. This technique, known as a hysteresis mechanism, helps reduce the number of events generated from this class of data. The hysteresis mechanism works to reduce the volume of events generated by thresholds on rapidly varying time-series data. This mechanism can be used with any threshold technique on time-series data.

The volume of events is reduced by an alarm that is generated to track the value of an object. Rising and falling thresholds are assigned to this alarm. The alarm is only triggered when the rising threshold is crossed. Once this threshold is crossed, a rising alarm is not generated again until the falling threshold is crossed. And the same mechanism prevents the generation of falling thresholds until the rising threshold is crossed again. This mechanism can drastically reduce the volume of events and does not eliminate information required in order to determine if a fault exists.

Time series data can be represented either as counters, where each new data point is added to the sum of the previous data points, or as a gauge, where the data is represented as a rate over a time interval. There are two different forms of continuous thresholds applicable to each data type: absolute continuous thresholds and relative continuous thresholds. Use absolute continuous thresholds with gauges and relative continuous thresholds with counters.

In order to determine the threshold values for your network, complete these steps:

  1. Select the objects.

  2. Select the devices and interfaces.

  3. Determine the threshold values for each object or object/interface type.

  4. Determine the severity for the event generated by each threshold.

A fair amount of work is required in order to determine what thresholds to use on which objects (and for which devices and interfaces). Fortunately, if you collected a baseline of performance data, you have done a significant amount of that work already. Also, NSA and the high availability service (HAS) program can make recommendations that help you set objects and create ranges. However, you must tailor these recommendations for your particular network.

As you have collected performance data for the network, the HAS program recommends that you group your interfaces by categories. This simplifies setting thresholds because you might need to determine thresholds for the media type of each category rather than for each device and object on that device. For example, you would want to set different thresholds for Ethernet and FDDI networks. It is commonly thought that you can run FDDI networks at closer to 100% utilization than you can a shared Ethernet segment. However, full-duplex Ethernet can be run much closer to 100% utilization because they are not subject to collisions. You might want to set your thresholds for collisions very low for full-duplex links because you should never see a collision.

You can also consider the combination of the interface importance and the category/severity of the threshold type. Use these factors to set the priority of the event and, therefore, the importance of the event and its attention by the network operations staff.

The grouping and categorizing of network devices and interfaces cannot be overemphasized. The more you are able to group and categorize, the easier you can integrate the threshold events into your network management platform. Use the baseline as the principle resource for this information. Refer to the Capacity and Performance Management: Best Practices White Paper for more information.

Network Management Implementation

The organization should have an implemented network management system that is able to detect the defined threshold values and report on the values for specified time periods. Use a RMON network management system that can archive threshold messages in a log file for daily review or a more complete database solution that allows searches for threshold exceptions for a given parameter. The information should be available to the network operations staff and manager on a continuous basis. The network management implementation should include the ability to detect software/hardware crashes or tracebacks, interface reliability, CPU, link utilization, queue or buffer misses, broadcast volume, carrier transitions, and interface resets.

Network Operations Metrics

A final area of proactive fault management that overlaps with performance management is network operations metrics. These metrics provide valuable data for fault management process improvement. At a minimum, these metrics should include a breakdown of all problems that occurred during a given period. The breakdown should include information such as:

  • Number of problems that occur by call priority

  • Minimum, maximum, and average time to close in each priority

  • Breakdown of problems by problem type (hardware, software crash, configuration, power, user error)

  • Breakdown of time to close for each problem type

  • Availability by availability group or SLA

  • How often you met or missed SLA requirements

The help desk often has a reporting system with the ability to generate metrics or reports. Another means to gather this data is the use of an availability monitoring tool. Overall metrics should be made available on a monthly basis. Process improvement based on the discussion should be implemented in order to improve missed service level agreement requirements or in order to improve how certain problem types are handled.

Performance Management Indicators

Performance indicators provide the mechanism by which an organization measures critical success factors.

Document the Network Management Business Objectives

This document could be a formal concept of operations for network management or a less formal statement of required features and objectives. However, the document should assist the network manager as they measure success.

This document is the organization network management strategy and should coordinate the overall business (nonquantitative) goals of network operations, engineering, design, other business units, and the end users. This focus enables the organization to form the long range planning activities for network management and operation, which includes the budgeting process. It also provides guidance for the acquisition of tools and the integration path required to accomplish the network management goals, such as SLAs.

This strategic document cannot focus too narrowly on the management of specific network problems, but on those items important to the overall organization, which include budgetary issues. For example:

  • Identify a comprehensive plan with achievable goals.

  • Identify each business service/application that require network support.

  • Identify those performance-based metrics needed to measure service.

  • Plan the collection and distribution of the performance metric data.

  • Identify the support needed for network evaluation and user feedback.

  • Have documented, detailed, and measurable service level objectives.

Document the Service Level Agreements

In order to properly document the SLAs, you must fully define the service level objective metrics. This documentation should be available to users for evaluation. It provides the feedback loop to ensure that the network management organization continues to measure the variables needed to maintain the service agreement level.

SLAs are "living" documents because the business environment and the network are dynamic by nature. What works today to measure an SLA might become obsolete tomorrow. Only when they institute a feedback loop from users and act on that information can network operations maintain the high availability numbers required by the organization.

Create a List of Variables for the Baseline

This list includes items such as polling interval, network management overhead incurred, possible trigger thresholds, whether the variable is used as a trigger for a trap, and trending analysis used against each variable.

These variables are not limited to the metrics needed for the service level objectives mentioned above. At a minimum, they should include these variables: router health, switch health, routing information, technology-specific data, utilization, and delay. These variables are polled periodically and stored in a database. Reports can then be generated against this data. These reports can assist the network management operations and planning staff in these ways:

  • Reactive issues can often be solved faster with a historical database.

  • Performance reporting and capacity planning require this type of data.

  • The service level objectives can be measured against it.

Review the Baseline and Trends Analyses

Network management personnel should conduct meetings to periodically go through specific reports. This provides additional feedback, as well as a proactive approach to potential problems in the network.

These meetings should include both operational and planning personnel. This provides an opportunity for the planners to receive operational analysis of the baseline and trended data. It also puts the operational staff "in the loop" for some of the planning analysis.

Another type of item to include in these meetings is the service level objectives. As objective thresholds are approached, network management personnel can take actions in order to prevent missing an objective and, in some cases, this data can be used as a partial budgetary justification. The data can show where service level objectives go to be breached if proper measures are not taken. Also, because these objectives have been identified by business services and applications, they are easier to justify on a financial basis.

Conduct these reviews every two weeks and hold a more thorough analytical meeting every six to twelve weeks. These meetings allow you to address both short and long term issues.

Document a What-if Analysis Methodology

A what-if analysis involves modeling and verification of solutions. Before you add a new solution to the network (either a new application or a change in the Cisco IOS release), document some of the alternatives.

The documentation for this analysis includes the major questions, the methodology, data sets, and configuration files. The main point is that the what-if analysis is an experiment that someone else should be able to recreate with the information provided in the document.

Document the Methodology used for Increasing Network Performance

This documentation includes additional WAN bandwidth and a cost table that helps increase the bandwidth for a particular type of link. This information helps the organization realize how much time and money it costs to increase the bandwidth. Formal documentation allows performance and capacity experts to discover how and when to increase performance, as well as the time line and costs for such an endeavor.

Periodically review this documentation, perhaps as a part of the performance review quarterly, in order to ensure that it remains up to date.

Summary

The only way to achieve the goals of the ideal network management system is to actively integrate the components of performance management into the system. This goal should include the use of availability and response time metrics tied into a system of notification when thresholds are exceeded thresholds. It would have to include the use of a baseline for capacity planning that would have links to a heuristic model for provisioning and exception reporting. It could have a built-in modeling or simulation engine that enables the model to be updated in real time and provide a level of both planning and troubleshooting through software simulations.

While much of this system might seem an impossible ideal that could never be achieved, each of the components is currently available today. Further, the tools to integrate these components also exist in programs like MicroMuse. We should continue to work toward this ideal as it is more realistic today than ever.

Related Information

Updated: Jan 16, 2007
Document ID: 15115