The Cisco NFVI management node hosts the Cisco VIM Rest API service,
Cobbler for PXE services, ELK for Logging/Kibana dashboard services and VMTP
for cloud validation. Because the maintenance node currently does not have
redundancy, understanding its points of failure and recovery scenarios is
important. These are described in this topic.
The management node architecture includes a Cisco UCS C240 M4 server
with dual CPU socket. It has a 1 Gbps on-board (LOM) NIC and a 10 Gbps Cisco
VIC mLOM. HDDs are used in 8,16, or 24 disk configurations. The following
figure shows a high level maintenance node hardware and software architecture.
Figure 1. Cisco NFVI Management Node Architecture
Different management node hardware or software failures can cause Cisco
NFVI service disruptions and outages. Some failed services can be recovered
through manual intervention. In cases where the system is operational during a
failure, double faults might not be recoverable. The following table lists
different management node failure scenarios and their recovery options.
Table 1 Management Node Failure Scenarios
Scenario #
|
Failure/Trigger
|
Recoverable?
|
Operational Impact
|
1
|
Failure of 1 or 2 active HDD
|
Yes
|
No
|
2
|
Simultaneous failure of more than 2 active HDD
|
No
|
Yes
|
3
|
Spare HDD failure: 4 spare for 24 HDD; or 2 spare for 8 HDD
|
Yes
|
No
|
4
|
Power outage/hard reboot
|
Yes
|
Yes
|
5
|
Graceful reboot
|
Yes
|
Yes
|
6
|
Docker daemon start failure
|
Yes
|
Yes
|
7
|
Service container (Cobbler, ELK) start failure
|
Yes
|
Yes
|
8
|
One link failure on bond interface
|
Yes
|
No
|
9
|
Two link failures on bond interface
|
Yes
|
Yes
|
10
|
REST API service failure
|
Yes
|
No
|
11
|
Graceful reboot with Cisco VIM Insight
|
Yes
|
Yes; CLI alternatives exist during reboot.
|
12
|
Power outage/hard reboot with Cisco VIM Insight
|
Yes
|
Yes
|
13
|
VIM Insight Container reinstallation
|
Yes
|
Yes; CLI alternatives exist during re-insight.
|
14
|
Cisco VIM Insight Container reboot
|
Yes
|
Yes; CLI alternatives exist during reboot.
|
15
|
Intel 1350 1Gbps LOM failure
|
Yes
|
Yes
|
16
|
Cisco VIC 1227 10 Gbps mLOM failure
|
Yes
|
Yes
|
17
|
DIMM memory failure
|
Yes
|
No
|
18
|
One CPU failure
|
Yes
|
No
|
Scenario 1: Failure of one or two active HDDs
The management node has either 8,16, or 24 HDDs. The HDDs are configured
with RAID 6, which helps enable data redundancy and storage performance and
overcomes any unforeseen HDD failures.
-
When 8 HDDs are installed, 7 are active disks and one is spare
disk.
-
When 16 HDDs are installed, 14 are active disks and two are spare
disks.
-
When 24 HDDs are installed, 20 are active disks and four are spare
disks.
With RAID 6 up to two simultaneous active HDD failures can occur. When
an HDD fails, the system starts automatic recovery by moving the spare disk to
active state and starts recovering and rebuilding the new active HDD. It takes
approximately four hours to rebuild the new disk and move to synchronized
state. During this operation, the system is completely functional and no
impacts are seen. However, you must monitor the system to ensure that
additional failures do not occur to enter into a double fault situation.
You can use the
storcli commands to check the disk and RAID state as
shown below:
 Note |
Make sure the node is running with hardware RAID by checking the
storcli output and comparing to the one preceding.
|
[root@mgmt-node ~]# /opt/MegaRAID/storcli/storcli64 /c0 show
<…snip…>
TOPOLOGY:
========
-----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace TR
-----------------------------------------------------------------------------
0 - - - - RAID6 Optl N 4.087 TB dflt N N dflt N N
0 0 - - - RAID6 Optl N 4.087 TB dflt N N dflt N N <== RAID 6 in optimal state
0 0 0 252:1 1 DRIVE Onln N 837.258 GB dflt N N dflt - N
0 0 1 252:2 2 DRIVE Onln N 837.258 GB dflt N N dflt - N
0 0 2 252:3 3 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 0 3 252:4 4 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 0 4 252:5 5 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 0 5 252:6 6 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 0 6 252:7 7 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 - - 252:8 8 DRIVE DHS - 930.390 GB - - - - - N
-----------------------------------------------------------------------------
<…snip…>
PD LIST:
=======
-------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
-------------------------------------------------------------------------
252:1 1 Onln 0 837.258 GB SAS HDD N N 512B ST900MM0006 U <== all disks functioning
252:2 2 Onln 0 837.258 GB SAS HDD N N 512B ST900MM0006 U
252:3 3 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:4 4 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:5 5 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:6 6 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:7 7 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:8 8 DHS 0 930.390 GB SAS HDD N N 512B ST91000640SS D
-------------------------------------------------------------------------
[root@mgmt-node ~]# /opt/MegaRAID/storcli/storcli64 /c0 show
<…snip…>
TOPOLOGY :
========
-----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace TR
-----------------------------------------------------------------------------
0 - - - - RAID6 Pdgd N 4.087 TB dflt N N dflt N N <== RAID 6 in degraded state
0 0 - - - RAID6 Dgrd N 4.087 TB dflt N N dflt N N
0 0 0 252:8 8 DRIVE Rbld Y 930.390 GB dflt N N dflt - N
0 0 1 252:2 2 DRIVE Onln N 837.258 GB dflt N N dflt - N
0 0 2 252:3 3 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 0 3 252:4 4 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 0 4 252:5 5 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 0 5 252:6 6 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 0 6 252:7 7 DRIVE Onln N 930.390 GB dflt N N dflt - N
-----------------------------------------------------------------------------
<…snip…>
PD LIST :
=======
-------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
-------------------------------------------------------------------------
252:1 1 UGood - 837.258 GB SAS HDD N N 512B ST900MM0006 U <== active disk in slot 1 disconnected from drive group 0
252:2 2 Onln 0 837.258 GB SAS HDD N N 512B ST900MM0006 U
252:3 3 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:4 4 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:5 5 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:6 6 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:7 7 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:8 8 Rbld 0 930.390 GB SAS HDD N N 512B ST91000640SS U <== spare disk in slot 8 joined drive group 0 and in rebuilding state
-------------------------------------------------------------------------
[root@mgmt-node ~]# /opt/MegaRAID/storcli/storcli64 /c0/e252/s8 show rebuild
Controller = 0
Status = Success
Description = Show Drive Rebuild Status Succeeded.
------------------------------------------------------
Drive-ID Progress% Status Estimated Time Left
------------------------------------------------------
/c0/e252/s8 20 In progress 2 Hours 28 Minutes <== spare disk in slot 8 rebuild status
------------------------------------------------------
To replace the failed disk and add it back as a spare:
[root@mgmt-node ~]# /opt/MegaRAID/storcli/storcli64 /c0/e252/s1 add hotsparedrive dg=0
Controller = 0
Status = Success
Description = Add Hot Spare Succeeded.
[root@mgmt-node ~]# /opt/MegaRAID/storcli/storcli64 /c0 show
<…snip…>
TOPOLOGY :
========
-----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type State BT Size PDC PI SED DS3 FSpace TR
-----------------------------------------------------------------------------
0 - - - - RAID6 Pdgd N 4.087 TB dflt N N dflt N N
0 0 - - - RAID6 Dgrd N 4.087 TB dflt N N dflt N N
0 0 0 252:8 8 DRIVE Rbld Y 930.390 GB dflt N N dflt - N
0 0 1 252:2 2 DRIVE Onln N 837.258 GB dflt N N dflt - N
0 0 2 252:3 3 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 0 3 252:4 4 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 0 4 252:5 5 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 0 5 252:6 6 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 0 6 252:7 7 DRIVE Onln N 930.390 GB dflt N N dflt - N
0 - - 252:1 1 DRIVE DHS - 837.258 GB - - - - - N
-----------------------------------------------------------------------------
<…snip…>
PD LIST :
=======
-------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
-------------------------------------------------------------------------
252:1 1 DHS 0 837.258 GB SAS HDD N N 512B ST900MM0006 U <== replacement disk added back as spare
252:2 2 Onln 0 837.258 GB SAS HDD N N 512B ST900MM0006 U
252:3 3 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:4 4 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:5 5 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:6 6 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:7 7 Onln 0 930.390 GB SAS HDD N N 512B ST91000640SS U
252:8 8 Rbld 0 930.390 GB SAS HDD N N 512B ST91000640SS U
-------------------------------------------------------------------------
Scenario 2:
Simultaneous failure of more than two active HDDs
If more than two HDD failures occur at the same time, the management
node goes into an unrecoverable failure state because RAID 6 allows for
recovery of up to two simultaneous HDD failures. To recover the management
node, reinstall the operating system.
Scenario 3: Spare HDD failure
When the management node has 24 HDDs, four are designated as spares.
Failure of any of the disks does not impact the RAID or system functionality.
Cisco recommends replacing these disks when they fail (see the steps in
Scenario 1) to serve as standby disks and so when an active disk fails, an
auto-rebuild is triggered.
Scenario 4: Power outage/hard reboot
If a power outage or hard system reboot occurs, the system will boot up
and come back to operational state. Services running on management node during
down time will be disrupted. See the steps in Scenario 9 for the list of
commands to check the services status after recovery.
Scenario 5: System reboot
If a graceful system reboot occurs, the system will boot up and come
back to operational state. Services running on management node during down time
will be disrupted. See the steps in Scenario 9 for the list of commands to
check the services status after recovery.
Scenario 6: Docker daemon start failure
The management node runs the services using Docker containers. If the
Docker daemon fails to come up, it causes services such as ELK, Cobbler and
VMTP to go into down state. You can use the
systemctl command to check the status of the Docker daemon, for
example:
# systemctl status docker
docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2016-08-22 00:33:43 CEST; 21h ago
Docs: http://docs.docker.com
Main PID: 16728 (docker)
If the Docker daemon is in down state, use the
systemctl restart docker
command to restart the Docker service. Run the commands listed in
Scenario 9 to verify that all the Docker services are active.
Scenario 7: Service container (Cobbler, ELK) start failure
As described in Scenario 8, all the services run as Docker containers on
the management node. To find all services running as containers, use the
docker ps –a command. If any services are in Exit state, use the
systemctl command and grep for Docker to find the exact service
name, for example:
# systemctl | grep docker- | awk '{print $1}'
docker-cobbler-tftp.service
docker-cobbler-web.service
docker-cobbler.service
docker-container-registry.service
docker-elasticsearch.service
docker-kibana.service
docker-logstash.service
docker-vmtp.service
If any services need restarting, use the
systemctl command. For example, to restart a Kibana service:
# systemctl restart docker-kibana.service
Scenario 8: One link failure on the bond Interface
The management node is set up with two different networks: br_api and
br_mgmt. The br_api interface is the external. It is used for accessing outside
services such as the Cisco VIM REST API, Kibana and Cobbler. The br_mgmt
interface is internal. It is used for provisioning and to provide management
connectivity to all OpenStack nodes (control, compute and storage). Each
network has two ports that are bonded to provide redundancy. If one port fails,
the system will remain completely functional through the other port. If a port
fails, check for physical network connectivity and/or remote switch
configuration to debug the underlying cause of the link failure.
Scenario 9: Two link failures on the bond Interface
As described in Scenario 10, each network is configured with two ports.
If both ports are down, the system is not reachable and management node
services could be disrupted. After the ports are back up, the system is fully
operational. Check the physical network connectivity and/or the remote switch
configuration to debug the underlying link failure cause.
Scenario 10: REST API service failure
The management node runs the REST API service for Cisco VIM clients to
reach the server. If the REST service is down, Cisco VIM clients cannot reach
the server to trigger any server operations. However, with the exception of the
REST service, other management node services remain operational.
To verify the management node REST services are fully operational, use
the following command to check that the httpd and mercury-restapi services are
in active and running state:
# systemctl status httpd
httpd.service - The Apache HTTP Server
Loaded: loaded (/usr/lib/systemd/system/httpd.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2016-08-22 00:22:10 CEST; 22h ago
# systemctl status mercury-restapi.service
mercury-restapi.service - Mercury Restapi
Loaded: loaded (/usr/lib/systemd/system/mercury-restapi.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2016-08-22 00:20:18 CEST; 22h ago
A tool is also provided so that you can check the REST API server status
and the location of the directory it is running from. To execute run the
following command:
# cd installer-<tagid>/tools
#./restapi.py -a status
Status of the REST API Server: active (running) since Thu 2016-08-18 09:15:39 UTC; 9h ago
REST API launch directory: /root/installer-<tagid>/
Confirm the server status is active and check that the restapi launch
directory matches the directory where the installation was launched. The
restapi tool also provides the options to launch, tear down, and reset password
for the restapi server as shown below:
# ./restapi.py –h
usage: restapi.py [-h] --action ACTION [--yes] [--verbose]
REST API setup helper
optional arguments:
-h, --help show this help message and exit
--action ACTION, -a ACTION
setup - Install and Start the REST API server.
teardown - Stop and Uninstall the REST API
server.
restart - Restart the REST API server.
regenerate-password - Regenerate the password for
REST API server.
reset-password - Reset the REST API password with
user given password.
status - Check the status of the REST API server
--yes, -y Skip the dialog. Yes to the action.
--verbose, -v Perform the action in verbose mode.
If the REST API server is not running, execute
ciscovim to show the following error message:
# cd installer-<tagid>/
# ciscovim -setupfile ~/Save/<setup_data.yaml> run
If the installer directory or the REST API state is not correct or
points to an incorrect REST API launch directory, go to the
installer-<tagid>/tools directory and execute:
# ./restapi.py –action setup
To confirm that the REST API server state and launch directory is
correct run the following command:
# ./restapi.py –action status
Scenario 11: Graceful reboot with Cisco VIM Insight
Cisco VIM Insight runs as a container on the management node. After a
graceful reboot of the management node, the VIM Insight and its associated
database containers will come up. So there is no impact on recovery.
Scenario 12: Power outage or hard reboot with VIM Insight
The Cisco VIM Insight container will come up automatically following a
power outage or hard reset of the management node.
Scenario 13: Cisco VIM Insight reinstallation
If the management node which is running the Cisco VIM Insight fails and
cannot come up, you must uninstall and reinstall the Cisco VIM Insight. After
the VM Insight container comes up, add the relevant bootstrap steps as listed
in the install guide to register the pod. VIM Insight then automatically
detects the installer status and reflects the current status appropriately.
To clean up and reinstall Cisco VIM Insight run the following command:
# cd /root/installer-<tagid>/insight/
# ./bootstrap_insight.py –a uninstall –o standalone –f </root/insight_setup_data.yaml>
Scenario 14: VIM Insight Container reboot
On Reboot of the VIM Insight container, services will continue to work
as it is.
Scenario 15: Intel (I350) 1Gbps LOM failure
The management node is set up with an Intel (I350) 1 Gbps LOM for API
connectivity. Two 1 Gbps ports are bonded to provide connectivity redundancy.
No operational impact occurs if one of these ports goes down. However, if both
ports fail, or the LOM network adapter fails, the system cannot be reached
through the API IP address. If this occurs you must replace the server because
the LOM is connected to the system motherboard. To recover the management node
with a new server, complete the following steps. Make sure the new management
node hardware profile matches the existing server and the Cisco IMC IP address
is assigned.
-
Shut down the existing management node.
-
Unplug the power from the existing and new management nodes.
-
Remove all HDDs from existing management node and install them in
the same slots of the new management node.
-
Plug in the power to the new management node, but do not boot the
node.
- Verify the configured boot
order is set to boot from local HDD.
-
Verify the Cisco NFVI management VLAN is configured on the Cisco VIC
interfaces.
-
Boot the management node for the operating system to start.
After the management node is up, the management node bond interface
will be down due to the incorrect MAC address on the ifcfg files. It will point
to old node network card MAC address.
-
Update the MAC address on the ifcfg files under
/etc/sysconfig/network-scripts.
-
Reboot the management node.
It will come up and be fully operational. All interfaces should be
in an up state and be reachable.
-
Verify that Kibana and Cobbler dashboards are accessible.
-
Verify the Rest API services are up. See Scenario 15 for any
recovery steps.
Scenario 16: Cisco VIC 1227 10Gbps mLOM failure
The management node is configured with a Cisco VIC 1227 dual port 10
Gbps mLOM adapter for connectivity to the other Cisco NFVI nodes. Two 10 Gbps
ports are bonded to provide connectivity redundancy. If one of the 10 Gbps
ports goes down, no operational impact occurs. However, if both Cisco VIC 10
Gbps ports fail, the system goes into an unreachable state on the management
network. If this occurs, you must replace the VIC network adapters. Otherwise
pod management and the Fluentd forwarding service will be disrupted. If you
replace a Cisco VIC, update the management and provisioning VLAN for the VIC
interfaces using Cisco IMC and update the MAC address in the interfaces under
/etc/sysconfig/network-scripts interface configuration file.
Scenario 17: DIMM memory failure
The management node is set up with multiple DIMM memory across different
slots. Failure of one or memory modules could cause the system to go into
unstable state, depending on how many DIMM memory failures occur. DIMM memory
failures are standard system failures like any other Linux system server. If a
DIMM memory fails, replace the memory module(s) as soon as possible to keep the
system in stable state.
Scenario 18: One CPU failure
Cisco NFVI management nodes have dual core Intel CPUs (CPU1 and CPU2).
If one CPU fails, the system remains operational. However, always replace
failed CPU modules immediately. CPU failures are standard system failures like
any other Linux system server. If a CPU fails, replace it immediately to keep
the system in stable state.