This document describes the steps that are required to replace a faulty Object Storage Disk (OSD)-Compute server in an Ultra-M setup that hosts StarOS Virtual Network Functions (VNFs).
Ultra-M is a pre-packaged and validated virtualized mobile packet core solution designed in order to simplify the deployment of VNFs. OpenStack is the Virtualized Infrastructure Manager (VIM) for Ultra-M and consists of these node types:
OSD - Compute
OpenStack Platform - Director (OSPD)
The high-level architecture of Ultra-M and the components involved are depicted in this image:
This document is intended for the Cisco personnel who are familiar with Cisco Ultra-M platform and it details the steps that are required to be carried out at the OpenStack and StarOS VNF level at the time of the Compute Server Replacement.
Note: Ultra M 5.1.x release is considered in order to define the procedures in this document.
Workflow of the MoP
Virtual Network Function
Elastic Service Controller
Method of Procedure
Object Storage Disks
Hard Disk Drive
Solid State Drive
Virtual Infrastructure Manager
Ultra Automation Services
Universally Unique IDentifier
Before you replace an OSD-Compute node, it is important to check the current state of your Red Hat OpenStack Platform environment. It is recommended you check the current state in order to avoid complications when the Compute replacement process is on. It can be achieved by this flow of replacement.
In case of recovery, Cisco recommends you take a backup of the OSPD database (DB) with the use of these steps:
[root@director ~]# mysqldump --opt --all-databases > /root/undercloud-all-databases.sql [root@director ~]# tar --xattrs -czf undercloud-backup-`date +%F`.tar.gz /root/undercloud-all-databases.sql /etc/my.cnf.d/server.cnf /var/lib/glance/images /srv/node /home/stack tar: Removing leading `/' from member names
This process ensures that a node can be replaced without affecting the availability of any instances. Also, it is recommended to backup the StarOS configuration especially if the Compute node to be replaced hosts the CF VM.
Identify the VMs Hosted in the OSD-Compute Node
Identify the VMs that are hosted on the Compute server. There can be two possibilities:
The OSD-Compute server contains EM/UAS/Auto-Deploy/Auto-IT combination of VMs:
Note: In the output shown here, the first column corresponds to the UUID, the second column is the VM name and the third column is the hostname where the VM is present. The parameters from this output will be used in subsequent sections.
Verify that Ceph has available capacity in order to allow a single OSD server to be removed:
[root@pod1-osd-compute-1 ~]# sudo ceph df
SIZEAVAILRAW USED%RAW USED
Verify ceph osd tree status is up on the OSD-Compute server:
[heat-admin@pod1-osd-compute-1 ~]$ sudo ceph osd tree
ID WEIGHTTYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 13.07996 root default
11 1.09000 osd.11 up 1.00000 1.00000
Ceph processes are active on the OSD-Compute server:
After all OSD processes have been migrated/deleted, the node can be removed from the overcloud.
Note: When Ceph is removed, VNF HD RAID will go into the Degraded state but HD-disk must still be accessible.
Graceful Power Off
Case 1. OSD-Compute Node Hosts CF/ESC/EM/UAS
Migrate CF Card to Standby State
Log in to the StarOS VNF and identify the card that corresponds to the CF VM. Use the UUID of the CF VM identified from the section Identify the VMs hosted in the OSD-Compute Node, and find the card that corresponds to the UUID.
[local]VNF2# show card hardware Tuesday might 08 16:49:42 UTC 2018 <snip> Card 2: Card Type : Control Function Virtual Card CPU Packages : 8 [#0, #1, #2, #3, #4, #5, #6, #7] CPU Nodes : 1 CPU Cores/Threads : 8 Memory : 16384M (qvpc-di-large) UUID/Serial Number : F9C0763A-4A4F-4BBD-AF51-BC7545774BE2 <snip>
Check the status of the card:
[local]VNF2# show card table Tuesday might 08 16:52:53 UTC 2018 Slot Card Type Oper State SPOF Attach ----------- -------------------------------------- ------------- ---- ------ 1: CFC Control Function Virtual Card Standby - 2: CFC Control Function Virtual Card Active No 3: FC 4-Port Service Function Virtual Card Active No 4: FC 4-Port Service Function Virtual Card Active No 5: FC 4-Port Service Function Virtual Card Active No 6: FC 4-Port Service Function Virtual Card Active No 7: FC 4-Port Service Function Virtual Card Active No 8: FC 4-Port Service Function Virtual Card Active No 9: FC 4-Port Service Function Virtual Card Active No 10: FC 4-Port Service Function Virtual Card Standby -
If the card is in the active state, move the card to standby state:
[local]VNF2# card migrate from 2 to 1
Shutdown CF and EM VM from ESC
Log in to the ESC node that corresponds to the VNF and check the status of the VMs:
Log in to the ESC hosted in the compute node and check if it is in the master state. If yes, switch the ESC to standby mode:
[admin@VNF2-esc-esc-0 esc-cli]$ escadm status 0 ESC status=0 ESC Master Healthy
[admin@VNF2-esc-esc-0 ~]$ sudo service keepalived stop Stopping keepalived: [ OK ]
[admin@VNF2-esc-esc-0 ~]$ escadm status 1 ESC status=0 In SWITCHING_TO_STOP state. Please check status after a while.
[admin@VNF2-esc-esc-0 ~]$ sudo reboot Broadcast message from email@example.com (/dev/pts/0) at 13:32 ... The system is going down for reboot NOW!
Remove the OSD-Compute Node from Nova Aggregate List
List the nova aggregates and identify the aggregate that corresponds to the Compute server based on the VNF hosted by it. Usually, it would be of the format <VNFNAME>-EM-MGMT<X> and <VNFNAME>-CF-MGMT<X>:
In this case, the OSD-Compute server belongs to VNF2. So, the aggregates that correspond would be VNF2-CF-MGMT2 and VNF2-EM-MGMT2.
Remove the OSD-Compute node from the aggregate identified:
nova aggregate-remove-host <Aggregate> <Host>
[stack@director ~]$ nova aggregate-remove-host VNF2-CF-MGMT2 pod1-osd-compute-0.localdomain [stack@director ~]$ nova aggregate-remove-host VNF2-EM-MGMT2 pod1-osd-compute-0.localdomain [stack@director ~]$ nova aggregate-remove-host POD1-AUTOIT pod1-osd-compute-0.localdomain
Verify if the OSD-Compute node has been removed from the aggregates. Now, ensure that the Host is not listed under the aggregates:
nova aggregate-show <aggregate-name>
[stack@director ~]$ nova aggregate-show VNF2-CF-MGMT2 [stack@director ~]$ nova aggregate-show VNF2-EM-MGMT2 [stack@director ~]$ nova aggregate-show POD1-AUTOIT
Case 2. OSD-Compute Node Hosts Auto-Deploy/Auto-IT/EM/UAS
Backup the CDB of Auto-Deploy
Backup the autodeploy confd cdb data periodically or after every activation/deactivation and save the file to a backup server.Auto-Deploy is not redundant and if this data is lost, it will be difficult to deactivate the deployment.
Log in to Auto-Deploy VM and backup confd cdb directory:
Verify the status of Physical Drives. It must be Unconimaged Good
Create a Virtual drive from the physical drives with RAID Level 1
Storage > Cisco 12G SAS Modular Raid Controller (SLOT-HBA) > Physical Drive Info
Note: This image is for illustration purpose only, in actual, OSD-Compute CIMC you will see seven physical drives in slots (1,2,3,7,8,9,10) in Unconfigured Good state as no Virtual Drives are created from them.
Storage > Cisco 12G SAS Modular Raid Controller (SLOT-HBA) > Controller Info > Create Virtual Drive from Unused Physical Drives
Select the VD and configure “Set as Boot Drive”
Enable IPMI over LAN: Admin > Communication Services > Communication Services
Similar to BOOTOS VD created with physical drives 1 and 2, create four more virtual drives as
JOURNAL > From physical drive number 3
OSD1 > From physical drive number 7
OSD2 > From physical drive number 8
OSD3 > From physical drive number 9
OSD4 > From physical drive number 10
At the end, the Physical drives and Virtual drives must be similar as shown in the image:
virtual drivesPhyscial drives
Note: The image shown here and the configuration steps mentioned in this section are with reference to the firmware version 3.0(3e) and there might be slight variations if you work on other versions.
Add the New OSD-Compute Node to the Overcloud
The steps mentioned in this section are common irrespective of the VM hosted by the compute node.
Add Compute server with a different index.
Create an add_node.json file with only the details of the new Compute server to be added. Ensure that the index number for the new OSD-Compute server has not been used before. Typically, increment the next highest compute value.
Example: Highest prior was OSD-Compute-0 so created OSD-Compute-3 in case of 2-vnf system.
[stack@director ~]$ openstack overcloud node introspect 7eddfa87-6ae6-4308-b1d2-78c98689a56e --provide Started Mistral Workflow. Execution ID: e320298a-6562-42e3-8ba6-5ce6d8524e5c Waiting for introspection to finish... Successfully introspected all nodes. Introspection completed. Started Mistral Workflow. Execution ID: c4a90d7b-ebf2-4fcb-96bf-e3168aa69dc9 Successfully set all nodes to available.
[stack@director ~]$ ironic node-list |grep available | 7eddfa87-6ae6-4308-b1d2-78c98689a56e | None | None | power off | available | False |
Add IP addresses to custom-templates/layout.yml under OsdComputeIPs. In this case, when you replace OSD-Compute-0 you add that address to the end of the list for each type:
- 22.214.171.124 <<< take osd-compute-0 .43 and add here
- 126.96.36.199 << and here
- 188.8.131.52 << and here
- 184.108.40.206 << and here
Run deploy.sh script that was previously used to deploy the stack, in order to add the new compute node to the overcloud stack:
[heat-admin@pod1-osd-compute-3 ~]$ sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 13.07996 root default
-2 0 host pod1-osd-compute-0
-3 4.35999 host pod1-osd-compute-2
1 1.09000 osd.1 up 1.00000 1.00000
4 1.09000 osd.4 up 1.00000 1.00000
7 1.09000 osd.7 up 1.00000 1.00000
10 1.09000 osd.10 up 1.00000 1.00000
-4 4.35999 host pod1-osd-compute-1
2 1.09000 osd.2 up 1.00000 1.00000
5 1.09000 osd.5 up 1.00000 1.00000
8 1.09000 osd.8 up 1.00000 1.00000
11 1.09000 osd.11 up 1.00000 1.00000
-5 4.35999 host pod1-osd-compute-3
0 1.09000 osd.0 up 1.00000 1.00000
3 1.09000 osd.3 up 1.00000 1.00000
6 1.09000 osd.6 up 1.00000 1.00000
9 1.09000 osd.9 up 1.00000 1.00000
Post Server Replacement Settings
After adding the server to the overcloud, please refer to the link below to apply the settings that were previously present in the old server:
Restore the VMs
Case 1. OSD-Compute Node Hosting CF, ESC, EM and UAS
Addition to Nova Aggregate List
Add the OSD-Compute node to the aggregate-hosts and verify if the host has been added. In this case, the OSD-Compute node must be added to both the CF and EM host aggregates.
nova aggregate-add-host <Aggregate> <Host> [stack@director ~]$ nova aggregate-add-host VNF2-CF-MGMT2 pod1-osd-compute-3.localdomain [stack@director ~]$ nova aggregate-add-host VNF2-EM-MGMT2 pod1-osd-compute-3.localdomain [stack@direcotr ~]$ nova aggregate-add-host POD1-AUTOIT pod1-osd-compute-3.localdomain
nova aggregate-show <Aggregate> [stack@director ~]$ nova aggregate-show VNF2-CF-MGMT2 [stack@director ~]$ nova aggregate-show VNF2-EM-MGMT2 [stack@director ~]$ nova aggregate-show POD1-AUTOITT
Recovery of UAS VM
Check the status of the UAS VM in the nova list and delete it:
[stack@director ~]$ nova list | grep VNF2-UAS-uas-0 | 307a704c-a17c-4cdc-8e7a-3d6e7e4332fa | VNF2-UAS-uas-0 | ACTIVE | - | Running | VNF2-UAS-uas-orchestration=220.127.116.11; VNF2-UAS-uas-management=18.104.22.168 [stack@director ~]$ nova delete VNF2-UAS-uas-0 Request to delete server VNF2-UAS-uas-0 has been accepted.
In order to recover the autovnf-uas VM, run the uas-check script to check state. It must report an ERROR. Then run again with --fix option in order to recreate the missing UAS VM:
[stack@director ~]$ cd /opt/cisco/usp/uas-installer/scripts/ [stack@director scripts]$ ./uas-check.py auto-vnf VNF2-UAS 2017-12-08 12:38:05,446 - INFO: Check of AutoVNF cluster started 2017-12-08 12:38:07,925 - INFO: Instance 'vnf1-UAS-uas-0' status is 'ERROR' 2017-12-08 12:38:07,925 - INFO: Check completed, AutoVNF cluster has recoverable errors
[stack@director scripts]$ ./uas-check.py auto-vnf VNF2-UAS --fix 2017-11-22 14:01:07,215 - INFO: Check of AutoVNF cluster started 2017-11-22 14:01:09,575 - INFO: Instance VNF2-UAS-uas-0' status is 'ERROR' 2017-11-22 14:01:09,575 - INFO: Check completed, AutoVNF cluster has recoverable errors 2017-11-22 14:01:09,778 - INFO: Removing instance VNF2-UAS-uas-0' 2017-11-22 14:01:13,568 - INFO: Removed instance VNF2-UAS-uas-0' 2017-11-22 14:01:13,568 - INFO: Creating instance VNF2-UAS-uas-0' and attaching volume ‘VNF2-UAS-uas-vol-0' 2017-11-22 14:01:49,525 - INFO: Created instance ‘VNF2-UAS-uas-0'
Log in to autovnf-uas. Wait for a few minutes and UAS must return to the good state:
VNF2-autovnf-uas-0#show uas uas version 1.0.1-1 uas state ha-active uas ha-vip 172.17.181.101 INSTANCE IP STATE ROLE ----------------------------------- 172.17.180.6 alive CONFD-SLAVE 172.17.180.7 alive CONFD-MASTER 172.17.180.9 alive NA
Note: If uas-check.py--fix fails, you might need to copy this file and run again.
Check the status of the ESC VM from the nova list and delete it:
stack@director scripts]$ nova list |grep ESC-1 | c566efbf-1274-4588-a2d8-0682e17b0d41 | VNF2-ESC-ESC-1 | ACTIVE | - | Running | VNF2-UAS-uas-orchestration=22.214.171.124; VNF2-UAS-uas-management=126.96.36.199 | [stack@director scripts]$ nova delete VNF2-ESC-ESC-1 Request to delete server VNF2-ESC-ESC-1 has been accepted.
From AutoVNF-UAS, find the ESC deployment transaction and in the log for the transaction find the boot_vm.py command line in order to create the ESC instance:
ubuntu@VNF2-uas-uas-0:~$ sudo -i root@VNF2-uas-uas-0:~# confd_cli -u admin -C Welcome to the ConfD CLI admin connected from 127.0.0.1 using console on VNF2-uas-uas-0 VNF2-uas-uas-0#show transaction TX ID TX TYPE DEPLOYMENT ID TIMESTAMP STATUS ----------------------------------------------------------------------------------------------------------------------------- 35eefc4a-d4a9-11e7-bb72-fa163ef8df2b vnf-deployment VNF2-DEPLOYMENT 2017-11-29T02:01:27.750692-00:00 deployment-success 73d9c540-d4a8-11e7-bb72-fa163ef8df2b vnfm-deployment VNF2-ESC 2017-11-29T01:56:02.133663-00:00 deployment-success
Save the boot_vm.py line to a shell script file (esc.sh) and update all the username ***** and password ***** lines with the correct information (typically core/<PASSWORD>). You need to remove the –encrypt_key option as well. For user_pass and user_confd_pass, you need to use the format – username: password (example - admin:<PASSWORD>).
Find the URL in order to keep bootvm.py from running-config and wget the bootvm.py file to the autovnf-uas VM. In this case, 10.1.2.3 is the Auto-IT VM's IP:
root@VNF2-uas-uas-0:~# confd_cli -u admin -C Welcome to the ConfD CLI admin connected from 127.0.0.1 using console on VNF2-uas-uas-0 VNF2-uas-uas-0#show running-config autovnf-vnfm:vnfm … configs bootvm value http:// 10.1.2.3:80/bundles/5.1.7-2007/vnfm-bundle/bootvm-2_3_2_155.py !
root@VNF2-uas-uas-0:~# wget http://10.1.2.3:80/bundles/5.1.7-2007/vnfm-bundle/bootvm-2_3_2_155.py --2017-12-01 20:25:52-- http://10.1.2.3 /bundles/5.1.7-2007/vnfm-bundle/bootvm-2_3_2_155.py Connecting to 10.1.2.3:80... connected. HTTP request sent, awaiting response... 200 OK Length: 127771 (125K) [text/x-python] Saving to: ‘bootvm-2_3_2_155.py’ 100%[=====================================================================================>] 127,771 --.-K/s in 0.001s 2017-12-01 20:25:52 (173 MB/s) - ‘bootvm-2_3_2_155.py’ saved [127771/127771]
ubuntu@VNF2-uas-uas-0:~$ ssh firstname.lastname@example.org … #################################################################### # ESC on VNF2-esc-esc-1.novalocal is in BACKUP state. ####################################################################
[admin@VNF2-esc-esc-1 ~]$ escadm status 0 ESC status=0 ESC Backup Healthy
[admin@VNF2-esc-esc-1 ~]$ health.sh ============== ESC HA (BACKUP) =================================================== ESC HEALTH PASSED
Recover CF and EM VMs from ESC
Check the status of the CF and EM VMs from the nova list. They must be in the ERROR state:
Log in to new EM and verify that the EM state is up:
ubuntu@VNF2vnfddeploymentem-1:~$ /opt/cisco/ncs/current/bin/ncs_cli -u admin -C admin connected from 172.17.180.6 using ssh on VNF2vnfddeploymentem-1 admin@scm# show ems EM VNFM ID SLA SCM PROXY --------------------- 2 up up up 3 up up up
Log into the StarOS VNF and verify that the CF card is in the standby state.
Case 2. OSD-Compute Node Hosting Auto-IT, Auto-deploy, EM and UAS
Recovery of Auto-Deploy VM
From OSPD, if auto-deploy VM was impacted but still shows ACTIVE/Running, you will need to delete it first. If auto-deploy was not impacted, skip to Recovery of Auto-it VM:
[stack@director ~]$ nova list |grep auto-deploy
| 9b55270a-2dcd-4ac1-aba3-bf041733a0c9 | auto-deploy-ISO-2007-uas-0 | ACTIVE | - | Running | mgmt=172.16.181.12, 10.1.2.7 [stack@director ~]$ cd /opt/cisco/usp/uas-installer/scripts
Note: Recovery procedure of EM and UAS VM are same in both the cases. Refer to the Case.1 section for the same.
Handle ESC Recovery Failure
In cases where ESC fails to start the VM due to an unexpected state, Cisco recommends that you perform an ESC switchover by rebooting the Master ESC. The ESC switchover would take about a minute. Run the script health.sh on the new Master ESC in order to check if the status is up. Master ESC in order to start the VM and fix the VM state. This recovery task will take up to five minutes to complete.
You can monitor /var/log/esc/yangesc.log and /var/log/esc/escmanager.log. If you do not see that the VM gets recovered after 5-7 minutes, the user will need to go and do the manual recovery of the impacted VM(s).
Auto-Deploy Configuration Update
From AutoDeploy VM, edit the auto-deploy.cfg and replace the old OSD-Compute server with the new one. Then load replace in confd_cli. This step is required for successful deployment deactivation later.
root@auto-deploy-iso-2007-uas-0:/home/ubuntu# confd_cli -u admin -C Welcome to the ConfD CLI admin connected from 127.0.0.1 using console on auto-deploy-iso-2007-uas-0 auto-deploy-iso-2007-uas-0#config Entering configuration mode terminal auto-deploy-iso-2007-uas-0(config)#load replace autodeploy.cfg Loading. 14.63 KiB parsed in 0.42 sec (34.16 KiB/sec)
Restart uas-confd and Auto-Deploy services after the configuration change:
root@auto-deploy-iso-2007-uas-0:~# service uas-confd restart uas-confd stop/waiting uas-confd start/running, process 14078
root@auto-deploy-iso-2007-uas-0:~# service uas-confd status uas-confd start/running, process 14078
root@auto-deploy-iso-2007-uas-0:~# service autodeploy restart autodeploy stop/waiting autodeploy start/running, process 14017 root@auto-deploy-iso-2007-uas-0:~# service autodeploy status autodeploy start/running, process 14017
To enable the syslogs for the UCS Server, Openstack components and the recovered VMs, please follow the sections
"Re-Enable syslog for UCS and Openstack components" and "Enable syslog for the VNFs" in the link below: