Each VM group is
configured to enable the healing. Healing is performed at two stages: Before
the service is alive and after the service is alive with a recovery policy
defined in the data model.
The VMs are deployed
and are being monitored. After ESC receives a VM Alive event, if it receives a
VM Down event, the healing workflow attempts to recover the VM with the
recovery policy.
If ESC does not
receive a VM Alive after deployment, ESC recovers the VM with the recovery
policy when timeout happens. All the recovery procedures depend on the recovery
policy configuration. For example, if the user configured either one of the
recovery policy such as Reboot Only, Redeploy Only, or Reboot and Redeploy
then ESC will follow the same configured policy.
ESC provides YANG
based data model with comprehensive details of all the parameters and
description that is needed to define the healing. ESC uses two sections in the
data model xml file which define the events and rules:
-
<kpi> section defines the type of monitoring, events, polling interval and other parameters.
-
<rule> section defines the actions when the KPI monitoring events are triggered.
For more information
on KPI, rules, and data model, see
KPIs, Rules and Metrics.
The configuration involves the following steps:
-
Define kpi
-
Define rules
The following example shows how to configure the KPI in the data model:
<kpi>
<event_name>VM_ALIVE</event_name>
<metric_value>1</metric_value>
<metric_cond>GT</metric_cond>
<metric_type>UINT32</metric_type>
<metric_collector>
<type>ICMPPing</type>
<nicid>0</nicid>
<poll_frequency>3</poll_frequency>
<polling_unit>seconds</polling_unit>
<continuous_alarm>false</continuous_alarm>
</metric_collector>
</kpi>
The following example
shows how to configure the rules for every event:
<rules>
<admin_rules>
<rule>
<event_name>VM_ALIVE</event_name>
<action>ALWAYS log</action>
<action>FALSE recover autohealing</action>
<action>TRUE servicebooted.sh</action>
</rule>
</admin_rules>
</rules>
In the above examples,
we define a KPI to monitor the ICMP Ping on the nicid 0. It defines the
attributes metric condition and polling. Based on the KPI, the VM_ALIVE event
is triggered with appropriate values. The action in the corresponding rule
defines what the next steps are:
If recovery is
triggered on the VM with reboot then redeploy option configured in the recovery
policy, ESC reboots the VM as the first step to recover the VM. If it fails,
the VM is un-deployed and a new VM with same day-0 configuration is deployed.
ESC tries to reuse the same network configuration like MAC and IP Address as
the previous VM.
Typically, if the VM
is unreachable, ESC starts VM recovery on all unreachable VMs. During a network
outage, ESC suspends VM recovery for the duration of the network outage, thus
delaying the VM recovery. ESC detects the unreachable VM, and evaluates the
reachability of the gateway first to detect the presence of a network failure.
If ESC cannot ping the
gateway, no action is taken to recover the VM. VM recovery resumes when the
gateway becomes reachable.
In case of a double
fault condition, that is, when the network gateway and the VM failure occur at
the same time, ESC automatically performs VM monitoring after the gateway is
reachable again.
For information on healing a VNF using ETSI API, see the Cisco Elastic Services Controller NFV MANO Guide.