Proactive Monitoring - Tenant and Fabric Policies
Proactive monitoring is a very important piece of the network administrator's job, but is often neglected because putting out fires in the network usually takes priority. However, since the Application Policy Infrastructure Controller (APIC) makes it incredibly easy to gather statistics and perform analyses, this will save network administrators both time and frustration. Since statistics are gathered automatically and policies are used and can be re-used in other places, the human error and effort is minimal.
Statistics gathering has been a somewhat manual and even resource intensive process for ACME in the past. Even when they have used tools to gather data on layer one through seven devices, it has still been necessary to manually specify which devices are to be monitored and how they should be monitored. For example, SNMP and a third party tool may have been used to monitor the CPU of switches or bandwidth utilization on ports, but they struggled with entering correct SNMP information on each device, or often forgot to add a new device to their Network Monitoring System (NMS). Cisco Application Centric Infrastructure (ACI) provides an APIC which will do all of the statistics gathering, and provides the ability to proactively monitor your entire environment without all of the hassle of maintaining a third party monitoring tool.
The APIC, whether accessed through the GUI, CLI, or API, can be used to drill into any of the components and provides the ability to click on a Stats tab to display on-demand statistics, but more importantly it enables the setup of policies to keep persistent data to analyze trends in the environment, as well as to troubleshoot or predict any issues that may be arising. When planning to move an application from a legacy network to the ACI infrastructure, it is sensible to start by testing before going straight to production. Add test VMs to port groups on either a DVS or AVS associated with the APIC, and add physical test servers to VPCs on the leaf switches. This could also be in a testing tenant which is completely separate from the production environment. At this point the APIC is already gathering statistics for the VMM domain and the physical devices. The next step is to configure a policy for trend analysis.
There are four different scopes for statistics gathering: Common or Fabric Wide, Fabric, Tenant, or Access. A Fabric Wide policy would be created as a default policy to be applied to all tenants. However, to override that policy for a particular tenant, the tenant policy will override the Fabric policy. In the following testing example, a Tenant policy is created to gather statistics. Even if this tenant is shared with other applications, customers, test cases, it will provide a real world example of how the application will behave in a production environment.
Create Tenant Monitor Policy
To create a tenant monitoring policy:
-
On the menu bar, choose Tenants > ALL TENANTS.
-
In the Work pane, choose the Tenant_Name .
-
In the Navigation pane, choose Tenant_Name > Monitoring Policies.
-
In the Work pane, choose Actions > Create Monitoring Policies.
-
In the Create Monitoring Policies dialog box, perform the following actions:
-
In the Name field enter a name for the Monitoring Policy.
-
Click Submit.
-
-
In the Navigation pane, choose Tenant_Name > Monitoring Policies > Policy_Name to display the following information:
-
Stats Collection Policies
-
Stats Export Policies
-
Callhome, SNMP, and Syslog
-
Event Severity Assignment Policies
-
Fault Severity Assignment Policies
-
Fault Lifecycle Policies
-
Stats Collection Policies
Clicking on Stats Collection Policies will display the default retention periods and admin states (Enabled/Disabled) for ALL Monitored Objects. Most likely the defaults will be kept, but a double click on them will change the admin state or retention periods. For example, to have it poll a component every 5 minutes, but be retained for 2 hours, just click on the policy that specifies a 5 minute granularity and change the retention period to 2 hours. It is similarly possible to change the policies for specific Monitoring Objects. A monitoring object tells the APIC which components to gather statistics about. For example, to change the information gathered for Bridge Domains, use the Bridge Domain (infra.RSOInfraBD) Monitoring Object.
To add monitoring objects:
-
On the menu bar, choose Tenants > ALL TENANTS.
-
In the Work pane, choose the Tenant_Name .
-
In the Navigation pane choose Tenant_Name > Monitoring Policies > Monitoring Policy_Name > Stats Collection Policies
-
Click on the Pencil icon to edit the Monitoring Objects.
-
Put a checkmark next to the Monitoring Objects to be included, and remove any checkmarks next to Monitoring Objects to be left out.
-
Click Submit.
-
For this example, changes might be made to Monitoring Object policies for Tenant, VXLAN Pool, Leaf Port, and/or Taboo Contract. There are several options and this will all depend on what is important to monitor in the environment. Click on the pull down menu to select a monitoring object and add a retention policy to it.
To add a policy to a Monitoring Object:
-
On the menu bar, choose Tenants > ALL TENANTS.
-
In the Work pane, choose the Tenant_Name .
-
In the Navigation pane choose Monitoring Policies > Monitoring Policy_Name > Stats Collection Policies.
-
In the Work pane, in the Stats Collection Policy dialog box, perform the following actions:
-
Select the Monitoring Object.
-
Click + to add the policy.
-
Select the granularity with which it is to poll.
-
Leave the state as inherited to stick with the defaults as set for ALL, or explicitly select enabled or disabled.
-
The retention policy may either be inherited or explicitly specified as enabled or disabled as well.
-
Click Update.
-
Stats Export Policies
It is desirable to collect these ongoing statistics as well as to see how this data behaves over time. Use the Stats Export Policies option in the left navigation pane, located under the monitoring policy. Much like the Stats Collection Policies, it is possible to create a policy for ALL monitoring objects, or select specific monitoring objects and specify where this information will be saved.
To create a Stats Export Policy:
-
On the menu bar, choose Tenants > ALL TENANTS.
-
In the Work pane, choose the Tenant_Name .
-
In the Navigation pane choose Tenant_Name > Monitoring Policies > Monitoring Policy_Name > Stats Export Policies.
-
In the Work pane, in the Stats Export Policy dialog box, perform the following actions:
-
Select ALL or a specific monitoring object from the drop-down list.
-
Click + to add the policy.
-
Now define the Stats Export Policy in the wizard.
-
Choose either JSON or XML as the format. There's really no difference other than personal preference, or it may be dictated by the tool used to read it.
-
Choose to compress it using GZIP, or leave it uncompressed.
-
Click + under Export Destinations to specify a server where this information is to be collected. Another wizard will pop up to enable specification of the protocols and credentials used to connect to this server.
-
Click Ok.
-
-
Click Submit.
Diagnostics Policies Using the GUI
The diagnostics policies are in the navigation pane on the left. This feature allows the setup of diagnostics test for the Monitoring Objects that were specified in the Stats Collection Policies. Next to the Monitoring Object is the Pencil button which enables selection of the monitoring objects to be configured with diagnostics policies. There are two different kind of policies for configuration: Boot-Up diagnostics or Ongoing diagnostics.
To configure diagnostic policies:
-
On the menu bar, choose Fabric > Fabric Policies.
-
In the Navigation pane choose Tenant_Name > Monitoring Policies > Diagnostics Policies.
-
In the Work pane, in the Diagnostic Policies dialog box, perform the following actions:
Click on the Pencil Icon and put checks next to the Monitoring Objects to which diagnostics tests are to be added.
-
Select one of the Monitoring Objects.
-
Click + to add an Object.
-
Select either Boot-Up or Ongoing.
-
Boot-Up runs the tests while the devices are booting, and Ongoing will run the tests as often as specified within the wizard.
-
In the wizard give it a name and select the admin state.
-
There are five different diagnostics tests available: ASIC, CPU, Internal Connectivity, Peripherals, and System Memory. Double-click on each to obtain the option of specifying no tests, full tests, or recommended tests.
-
Click Submit.
-
-
The diagnostics found here can be useful in finding failed components before they cause major issues within your environment.
Call Home/SNMP/Syslog
There are a few different ways to setup notification or alert policies. The Call Home/SNMP/Syslog policy will allow alerting to be configured in a flexible manner. Cisco Call Home is a feature in many Cisco products that will provide email or web-based notification alerts in several different formats for critical events. This allows administrators to resolve issues before they turn into outages. SNMP or syslog policies can also be used with current notification systems. Different logging levels may be selected for notifications and alert levels specified for Monitoring Objects from which alerts are to be received.
Event Severity and Fault Severity Assignments
Event and fault severities can be changed for events raised by Monitoring Objects. Most likely, the default severity assignments for Events and Faults will be kept, but there are examples where an ACI administrator may decide the event or fault is more or less severe than the default value. For example, if only critical faults are being notified, but there is a major fault you'd also like to be notified about immediately, you can change the severity for that particular fault code.
-
On the menu bar, choose Tenants > ALL TENANTS.
-
In the Work pane, choose the Tenant_Name .
-
In the Navigation pane, choose Tenant_Name > Monitoring Policies > Monitoring_Policy > Fault Lifecycle Policies.
-
In the Work pane, in the Fault Severity Assignment Policies dialog box, perform the following actions:
-
Select a Monitoring Object, which will dictate the fault codes for which you are changing the fault severity.
-
Click + to add an Object.
-
Select the particular fault code for which severity is to be changed.
-
Select the severity: Cleared, Critical, Major, Minor, Squelched, Inherit, Warning, Info.
Squelched gives it a weight of 0%, meaning it does not affect health scores.
-
-
Click Update.
The Event Severity Assignment Policies are configured in the same way.
Fault Lifecycle Policies
Fault Lifecycle is the term Cisco uses to describe the life of a fault. Once a fault is detected it is in the "soaking" state. After a certain amount of time, referred to as the "soaking interval" it will move on to the "raised" state. "Raised" means the fault is still present after the soaking interval. After the fault clears it's in a state called "raised clearing." It is only in this state briefly and moves on to the "clearing time" state. It remains in the "clearing time" state for the amount of time specified in the "clearing interval." Lastly it moves on to the "retaining" state and does not get removed until the end of the "retaining interval."
To change Fault Lifecycle Intervals:
-
On the menu bar, choose Tenants > ALL TENANTS.
-
In the Work pane, choose the Tenant_Name .
-
In the Navigation pane, choose Tenant_Name > Monitoring Policies > Monitoring_Policy > Fault Lifecycle Policies.
-
In the Work pane, in the Fault Lifecycle Policies dialog box, perform the following actions:
-
Select a Monitoring Object, which will dictate the fault codes for which you are changing the default intervals.
-
Click +.
-
Specify times for the Clearing Interval, Retention Interval, and Soaking Interval (all in seconds).
Note: The default for the Clearing Interval is 120 seconds; the Retention Interval is 3600 seconds; and the Soaking Interval is 120 seconds.
-
At this point there will be a fully working tenant monitoring policy. ACME will have other policies to configure in the fabric as outlined in the following sections.
TCAM Policy Usage
The physical ternary content-addressable memory (TCAM) in which policy is stored for enforcement is an expensive component of switch hardware and therefore tends to lower policy scale or raise hardware costs. Within the Cisco Application Centric Infrastructure (Cisco ACI) fabric, policy is applied based on the EPG rather than the endpoint itself. This policy size can be expressed as n*m*f, where n is the number of sources, m is the number of destinations, and f is the number of policy filters. Within the Cisco ACI fabric, sources and destinations become one entry for a given EPG, which reduces the number of total entries required.
TCAM is a fabric resource that should be monitored. There is a system wide view of available TCAM resources. To view the TCAM resources, on the menu bar, choose Work pane displays a table that summarizes the capacity for all nodes.
. TheTCAM is a critical system resource in a Cisco ACI fabric and should be monitored for utilization. The architecture/design team should articulate what the assumptions were for TCAM utilization. There is a Fabric Resource Calculation tool on Github that will help with estimation of normal operating parameters: https://github.com/datacenter/FabricResourceCalculation.
As a general rule, the default monitoring policies will alert you to a resource shortage and lower overall fabric health score. If your environment has a high rate of change, or you anticipate the possibility of consistently being oversubscribed, you may want to set different thresholds.
Create TCAM Policy Monitor
-
On the menu bar, choose Fabric > Fabric Policies.
-
In the Navigation pane, choose Monitor Policies > d efault > Stats Collection Policies.
-
In the Work pane, in the Stats Collection Policies dialog box, perform the following actions:
-
Select the Monitoring Object Equipment Capacity Entity (eqptcapacity.Entity).
-
Select the Stats Type Policy Entry.
-
Click + under Config Thresholds.
-
In the Thresholds For Collection 5 Minute window, select the blue pencil icon next to policy CAM entries usage current value.
-
TCAM Prefix Usage
This procedure manages a TCAM Prefix Usage.
-
On the menu bar, choose Fabric > Fabric Policies.
-
In the Navigation pane, choose Monitor Policies > default > Stats Collection Policies.
-
In the Work pane, in the Stats Collection Policies dialog box, perform the following actions:
-
Select the Monitoring Object Equipment Capacity Entity (eqptcapacity.Entity).
-
Select the Stats TypeLayer3 Entry.
-
Click + under Config Thresholds.
-
In the Thresholds For Collection 5 Minute window, select the blue pencil icon next to policy CAM entries usage current value.
-
Health Score Evaluation Policy
-
On the menu bar, choose Fabric > Fabric Policies.
-
In the Navigation pane, choose Monitor Policies > Common Policies > Health Score Evaluation Policy > Health Score Evaluation Policy.
-
In the Work pane, in the Properties dialog box, perform the following actions:
-
In the Penalty of fault severity critical dropdown menu, select the desired %.
-
In the Penalty of fault severity major dropdown menu, select the desired %.
-
In the Penalty of fault severity minor dropdown menu, select the desired %.
-
In the Penalty of fault severity warning dropdown menu, select the desired %.
-
-
Click Submit.
Communication Policy
-
On the menu bar, choose Fabric > Fabric Policies.
-
In the Navigation pane, expand Pod Policies > Policies > Communication.
-
In the Work pane, choose Actions > Create Communication Policy.
-
In the Create Communication Policy dialog box, perform the following actions:
-
Enter Communication Policy Name.
-
From the HTTP Admin State dropdown menu select the desired state.
-
From the HTTP Port dropdown menu select the desired port.
-
Select the desired HTTP redirect state.
-
From the HTTPS Admin State dropdown menu select the desired state.
-
From the HTTPS Port dropdown menu select the desired port.
-
Select the desired HTTPS redirect state.
-
From the SSH Admin State dropdown menu select the desired state.
-
From the Telnet Admin State dropdown menu select the desired state.
-
From the Telnet Port dropdown menu select the desired port.
-
-
Click Submit.