This document describes the steps required to replace a faulty motherboard of a server in an Ultra-M setup that hosts CPS Virtual Network Functions (VNFs).
Ultra-M is a pre-packaged and validated virtualized mobile packet core solution that is designed in order to simplify the deployment of VNFs. OpenStack is the Virtualized Infrastructure Manager (VIM) for Ultra-M and consists of these node types:
Object Storage Disk - Compute (OSD - Compute)
OpenStack Platform - Director (OSPD)
The high-level architecture of Ultra-M and the components involved are depicted in this image:
This document is intended for Cisco personnel who are familiar with Cisco Ultra-M platform and it details the steps that are required to be carried out at OpenStack and StarOS VNF level at the time of the motherboard replacement in a server.
Note: Ultra M 5.1.x release is considered in order to define the procedures in this document.
Virtual Network Function
Elastic Service Controller
Method of Procedure
Object Storage Disks
Hard Disk Drive
Solid State Drive
Virtual Infrastructure Manager
Ultra Automation Services
Universally Unique IDentifier
Workflow of the MoP
Motherboard Replacement in Ultra-M Setup
In an Ultra-M setup, there can be scenarios where a motherboard replacement is required in the following server types: Compute, OSD-Compute and Controller.
Note: The boot disks with the openstack installation are replaced after the replacement of the motherboard. Hence there is no requirement to add the node back to overcloud. Once the server is powered ON after the replacement activity, it would enroll itself back to the overcloud stack.
Motherboard Replacement in Compute Node
Before the activity, the VMs hosted in the Compute node are gracefully shutoff. Once the Motherboard has been replaced, the VMs are restored back.
Identify the VMs Hosted in the Compute Node
Identify the VMs that are hosted on the compute server.
The compute server contains CPS or Elastic Services Controller (ESC) VMs:
Note:In the output shown here, the first column corresponds to the Universally Unique IDentifier (UUID), the second column is the VM name and the third column is the hostname where the VM is present. The parameters from this output will be used in subsequent sections.
Graceful Power Off
Compute Node Hosts CPS/ESC VMs
Step 1. Log in to the ESC node that corresponds to the VNF and check the status of the VMs.
Step 4. Log in to the ESC hosted in the compute node and check if it is in the master state. If yes, switch the ESC to standby mode:
[admin@VNF2-esc-esc-0 esc-cli]$ escadm status 0 ESC status=0 ESC Master Healthy
[admin@VNF2-esc-esc-0 ~]$ sudo service keepalived stop Stopping keepalived: [ OK ]
[admin@VNF2-esc-esc-0 ~]$ escadm status 1 ESC status=0 In SWITCHING_TO_STOP state. Please check status after a while.
[admin@VNF2-esc-esc-0 ~]$ sudo reboot Broadcast message from email@example.com (/dev/pts/0) at 13:32 ... The system is going down for reboot NOW!
Step 1. ESC has 1:1 redundancy in UltraM Solution. 2 ESC VMs are deployed and supports single failure in UltraM. i.e. the system is recovered if there is a single failure in the system.
Note: If there is more than a single failure, it is not supported and may require redeployment of the system.
ESC backup details:
ConfD CDB DB
Step 2. The frequency of ESC DB backup is tricky and needs to be handled carefully as ESC monitors and maintains the various state machines for various VNF VMs deployed. It is advised that these backups are performed after following activities in given VNF/POD/Site.
Step 3. Verify the health of ESC is good for using health.sh script.
[root@auto-test-vnfm1-esc-0 admin]# escadm status 0 ESC status=0 ESC Master Healthy
[root@auto-test-vnfm1-esc-0 admin]# health.sh esc ui is disabled -- skipping status check esc_monitor start/running, process 836 esc_mona is up and running ... vimmanager start/running, process 2741 vimmanager start/running, process 2741 esc_confd is started tomcat6 (pid 2907) is running... [ OK ] postgresql-9.4 (pid 2660) is running... ESC service is running... Active VIM = OPENSTACK ESC Operation Mode=OPERATION
/opt/cisco/esc/esc_database is a mountpoint
============== ESC HA (MASTER) with DRBD =================
Step 5. If ESC VM is unrecoverable and requires the restore of the database, please restore the database from the previously taken backup.
Step 6. For ESC database restore, we have to ensure the esc service is stopped before restoring the database; For ESC HA, execute in secondary VM first and then the primary VM.
# service keepalived stop
Step 7. Check ESC service status and ensure everything is stopped in both Primary and Secondary VMs for HA
# escadm status
Step 8. Execute the script to restore the database. As part of the restoration of the DB to the newly created ESC instance, the tool will also promote one of the instances to be a primary ESC, mount its DB folder to the drbd device and will start the PostgreSQL database.
Step 9. Restart ESC service to complete the database restore.
For HA execute in both VMs, restart the keepalived service
# service keepalived start
Step 10. Once the VM is successfully restored and running; ensure all the syslog specific configuration is restored from the previous successful known backup. Ensure that it is restored in all the ESC VMs
[admin@auto-test-vnfm2-esc-1 ~]$ [admin@auto-test-vnfm2-esc-1 ~]$ cd /etc/rsyslog.d [admin@auto-test-vnfm2-esc-1 rsyslog.d]$ls /etc/rsyslog.d/00-escmanager.conf 00-escmanager.conf
Step 1. In some cases, ESC will fail to start the VM due to an unexpected state. A workaround is to perform an ESC switchover by rebooting the Master ESC. The ESC switchover will take about a minute. Execute health.sh on the new Master ESC to verify it is up. When the ESC becomes Master, ESC may fix the VM state and start the VM. Since this operation is scheduled, you must wait 5-7 minutes for it to complete.
Step 2. You can monitor /var/log/esc/yangesc.log and /var/log/esc/escmanager.log. If you do NOT see VM getting recovered after 5-7 minutes, the user would need to go and do the manual recovery of the impacted VM(s).
Step 3. Once the VM is successfully restored and running; ensure all the syslog specific configuration is restored from the previous successful known backup. Ensure it is restored in all the ESC VMs.
total 28 drwxr-xr-x 2 root root 4096 Jun 7 18:38 ./ drwxr-xr-x 86 root root 4096 Jun 6 20:33 ../] -rw-r--r-- 1 root root 319 Jun 7 18:36 00-vnmf-proxy.conf -rw-r--r-- 1 root root 317 Jun 7 18:38 01-ncs-java.conf -rw-r--r-- 1 root root 311 Mar 17 2012 20-ufw.conf -rw-r--r-- 1 root root 252 Nov 23 2015 21-cloudinit.conf -rw-r--r-- 1 root root 1655 Apr 18 2013 50-default.conf
root@abautotestvnfm1em-0:/etc/rsyslog.d# ls /etc/rsyslog.conf rsyslog.conf
Motherboard Replacement in OSD Compute Node
Before the activity, the VMs hosted in the Compute node are gracefully shutoff and the CEPH is put into maintenance mode. Once the Motherboard has been replaced, the VMs are restored back and CEPH is moved out of maintenance mode.
Put CEPH in Maintenance mode
Step 1. Verify ceph osd tree status are up in the server
[heat-admin@pod1-osd-compute-1 ~]$ sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 13.07996 root default -2 4.35999 host pod1-osd-compute-0 0 1.09000 osd.0 up 1.00000 1.00000 3 1.09000 osd.3 up 1.00000 1.00000 6 1.09000 osd.6 up 1.00000 1.00000 9 1.09000 osd.9 up 1.00000 1.00000
-3 4.35999 host pod1-osd-compute-2 1 1.09000 osd.1 up 1.00000 1.00000 4 1.09000 osd.4 up 1.00000 1.00000 7 1.09000 osd.7 up 1.00000 1.00000 10 1.09000 osd.10 up 1.00000 1.00000
-4 4.35999 host pod1-osd-compute-1 2 1.09000 osd.2 up 1.00000 1.00000 5 1.09000 osd.5 up 1.00000 1.00000 8 1.09000 osd.8 up 1.00000 1.00000 11 1.09000 osd.11 up 1.00000 1.00000
Step 2. Log in to the OSD Compute node and put CEPH in the maintenance mode.
[root@pod1-osd-compute-1 ~]# sudo ceph osd set norebalance [root@pod1-osd-compute-1 ~]# sudo ceph osd set noout
Note: In the output shown here, the first column corresponds to the Universally Unique IDentifier (UUID), the second column is the VM name and the third column is the hostname where the VM is present. The parameters from this output will be used in subsequent sections.
Graceful Power Off
Case 1. OSD-Compute node Hosts ESC
Procedure to gracefully power of ESC or CPS VMs is same irrespective of whether the VMs are hosted in Compute or OSD-Compute node.
Follow steps from "Motherboard Replacement in Compute Node" to gracefully power off the VMs.
Step 1. The steps in order to replace motherboard in a UCS C240 M4 server can be referred from:
Procedure to restore ofCF/ESC/EM/UAS VMs is same irrespective of whether the VMs are hosted in Compute or OSD-Compute node.
Follow steps from "Case 2. Compute Node Hosts CF/ESC/EM/UAS" to restore the VMs.
Motherboard Replacement in Controller Node
Verify Controller Status and put cluster in Maintenance mode
From OSPD, login to controller and verify pcs is in good state – all three controllers Online and galera showing all three controllers as Master.
[heat-admin@pod1-controller-0 ~]$ sudo pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: pod1-controller-2 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum Last updated: Mon Dec 4 00:46:10 2017 Last change: Wed Nov 29 01:20:52 2017 by hacluster via crmd on pod1-controller-0
[heat-admin@pod1-controller-0 ~]$ sudo pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: pod1-controller-2 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum Last updated: Mon Dec 4 00:48:24 2017 Last change: Mon Dec 4 00:48:18 2017 by root via crm_attribute on pod1-controller-0
openstack-cinder-volume (systemd:openstack-cinder-volume): Started pod1-controller-2 my-ipmilan-for-controller-0 (stonith:fence_ipmilan): Started pod1-controller-1 my-ipmilan-for-controller-1 (stonith:fence_ipmilan): Started pod1-controller-1 my-ipmilan-for-controller-2 (stonith:fence_ipmilan): Started pod1-controller-2
Step 1. The steps in order to replace the motherboard in a UCS C240 M4 server can be referred from:
Log in to the impacted controller, remove standby mode by setting unstandby. Verify controller comes Online with cluster and galera shows all three controllers as Master. This might take a few minutes.
[heat-admin@pod1-controller-0 ~]$ sudo pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: pod1-controller-2 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum Last updated: Mon Dec 4 01:08:10 2017 Last change: Mon Dec 4 01:04:21 2017 by root via crm_attribute on pod1-controller-0