This document describes why you might see high CPU usage reported in vManage for vEdge 5000/2000/1000/100B and vEdge Cloud platforms despite the performance of the platforms being normal with no high CPU reported as viewed in top.
Understand High CPU Utilization that is Reported on vEdge 5000/2000/1000/100B and vEdge Cloud Platforms
With the 17.2.x and later releases, higher CPU and memory consumption for vEdge and vEdge Cloud platforms can be observed. This is noticed on the vManage Dashboard for a given device. In some cases, this also leads to an increased number of alerts and warnings in vManage.
The reason for reported high CPU usage when the device performs normally with normal, low, or no load is due to a change in the formula used in order to calculate usage. With the 17.2 releases, CPU utilization is computed based on the Load Average from show system status on the vEdge.
vManage shows real-time CPU utilization for a device. It pulls the 1 Minute Average [min1_avg] and 5 Minutes Average [min5_avg] based on historical data. Load Average, by definition, includes various things and not just CPU cycles that contribute to the utilization calculation. For example, IO wait time, process pending time, and other values are considered when you present this value for the platform. In this case, you ignore the values shown for the CPU states and CPU values in the top command from vShell.
Here is an example on how CPU utilization, which is actually the 1 minute load average, gets calculated and shown in the vManage dashboard:
When you check load from a vEdge CLI, this can be seen:
In this case, CPU utilization is computed based on Load-Average / # of Cores (vCPUs). For this example, the node has 4 cores. The load-average is then converted by a factor of 100 before you divide by the number of cores. When you average the Load Average from all cores and multiply by 100, you arrive on a value of ~310. Take this value and divide by 4 yields, a CPU reading of 77.5% CPU, which aligns with the value seen in the real-time graph in vManage captured around the time the CLI output was collected and as shown in the image.
In order to see the load averages and the number of CPU cores in the system, the output of top can be consulted from vShell on the device.
In the example here, the vEdge contains 4 vCPUs. The first core (Cpu0) is used for Control (seen through the lower user utilization) while the remaining 3 cores are used for Data:
This first core (Cpu0) is used for Control and the three remaining cores used for Data. As you can see in the process list, the fp-um process uses those resources.
fp-um is a process that uses a poll-mode driver, which means that it sits and polls the underlying port for packets constantly so that it can process any frame as soon as it is received. This process handles forwarding and is equivalent to fast-path forwarding in the vEdge 1000, vEdge 2000, and vEdge 100. This poll-mode architecture is used by Intel for efficient packet processing based on Data Plane Development Kit (DPDK) framework. Because packet forwarding is implemented in a tight loop, CPU remains at or close to 100% at all times. Although this is done, no latency is introduced through these CPU's as this is expected behavior.
Background information on DPDK polling can he found here.
The vEdge Cloud and vEdge 5000 platforms use the same forwarding architecture and exhibit the same behavior in this regard. Here is an example from a vEdge 5000 pulled from the top output. It has 28 cores, of which 2 (Cpu0 and Cpu1) are used for Control (like the vEdge 2000) and 26 are used for Data.
Here, the Load Average is always high because 26 out of the 28 processors run at 100% due to the fp-um process.
The reported CPU usage in vManage for 17.2.x releases prior to 17.2.7 is not the actual CPU usage but is instead calculated based on the Load Average. This may lead to confusion in understanding the reported value and lead to false alarms related to high CPU while the platform operates normally with normal, low, or no actual traffic/network load.
This behavior is changed/modified with the 17.2.7 and 18.2 releases such that the CPU reading can now be accurate based on the cpu_user reading from top.