This document describes the steps required in order to replace the faulty HDD drive in a server in an Ultra-M setup that hosts Cisco Policy Suite (CPS) Virtual Network Function (VNFs).
Ultra-M is a pre-packaged and validated virtualized mobile packet core solution designed to simplify the deployment of VNFs. OpenStack is the Virtualized Infrastructure Manager (VIM) for Ultra-M and consists of these node types:
Object Storage Disk - Compute (OSD - Compute)
OpenStack Platform - Director (OSPD)
The high-level architecture of Ultra-M and the components involved are as shown in this image:
Note: Ultra M 5.1.x release is considered in order to define the procedures in this document. This document is intended for the Cisco personnel who are familiar with the Cisco Ultra-M platform and it details the steps required to be carried out at OpenStack level at the time of the OSPD Server replacement.
Virtual Network Function
Elastic Service Controller
Method of Procedure
Object Storage Disks
Hard Disk Drive
Solid State Drive
Virtual Infrastructure Manager
Ultra Automation Services
Universally Unique IDentifier
Workflow of the MoP
Single HDD Failure
1. Each Baremetal server will be provisioned with two HDD drives in order to act as Boot Disk in Raid 1 configuration. In case of single HDD failure, since there is Raid 1 level redundancy, the faulty HDD drive can be Hot Swapped.
3. In case of single HDD failure, only the faulty HDD will be Hot Swapped and hence no BIOS upgrade procedure is required after you replace new disks.
4. After you replace the disks, wait for the data sync between the disks. It might take a couple of hours to complete.
5. In an OpenStack based (Ultra-M) solution, UCS 240M4 baremetal server can take up one of these roles: Compute, OSD-Compute, Controller and OSPD.
6. The steps required in order to handle the single HDD failure in each of these server roles are same and this section describes the health checks to be performed before the Hot Swap of the disk.
Single HDD Failure on a Compute Server
1. If the failure of HDD drives is observed in UCS 240M4 which acts as a Compute node, perform these health checks before you initiate the Hot Swap procedure of the faulty disk.
2. Identify the VMs running on this server and verify the status of the functions are good.
Identify the VMs Hosted in the Compute Node
Identify the VMs that are hosted on the Compute server and verify that they are active and running.
The Compute server contains CPS VMs/Elastic Services Controller (ESC) combination of VMs:
[stack@director ~]$ nova list --field name,host | grep compute-8 | 507d67c2-1d00-4321-b9d1-da879af524f8 | VNF2-DEPLOYM_XXXX_0_c8d98f0f-d874-45d0-af75-88a2d6fa82ea | pod1-compute-8.localdomain | ACTIVE | | f9c0763a-4a4f-4bbd-af51-bc7545774be2 | VNF2-DEPLOYM_c2_0_df4be88d-b4bf-4456-945a-3812653ee229 | pod1-compute-8.localdomain | ACTIVE | | 75528898-ef4b-4d68-b05d-882014708694 | VNF2-ESC-ESC-0 | pod1-compute-8.localdomain | ACTIVE |
Note: In the output shown here, the first column corresponds to the Universally Unique IDentifier (UUID), the second column is the VM name and the third column is the hostname where the VM is present.
1. Log in to the ESC hosted in the compute node and check the status.
[admin@VNF2-esc-esc-0 esc-cli]$ escadm status 0 ESC status=0 ESC Master Healthy
2. Log in to the UAS hosted in the compute node and check the status.
ubuntu@autovnf2-uas-1:~$ sudo su root@autovnf2-uas-1:/home/ubuntu# confd_cli -u admin -C Welcome to the ConfD CLI admin connected from 127.0.0.1 using console on autovnf2-uas-1 autovnf2-uas-1#show uas ha uas ha-vip 172.18.181.101 autovnf2-uas-1# autovnf2-uas-1# autovnf2-uas-1#show uas uas version 1.0.1-1 uas state ha-active uas ha-vip 172.18.181.101 INSTANCE IP STATE ROLE ----------------------------------- 172.18.180.4 alive CONFD-SLAVE 172.18.180.5 alive CONFD-MASTER 172.18.180.8 alive NA
autovnf2-uas-1#show errors % No entries found.
3. If health checks are fine, proceed with the faulty disk Hot Swap procedure and wait for the data sync as it might take a couple of hours to complete. Refer to: Replacing the Server Components
4. Repeat these health check procedures in order to confirm that the health status of the VMs hosted on compute node are restored.
Single HDD Failure on a Controller Server
1. If the failure of the HDD drives is observed in UCS 240M4 which acts as the Controller node, perform these health checks before you initiate the Hot Swap procedure of the faulty disk.
2. Check the Pacemaker status on the controllers.
3. Log in to one of the active controllers and check the Pacemaker status. All services must be running on the available controllers and stopped on the failed controller.
[heat-admin@pod1-controller-0 ~]$ sudo pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: pod1-controller-0 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum Last updated: Thu Jun 28 07:53:06 2018 Last change: Wed Jan 17 11:38:00 2018 by root via cibadmin on pod1-controller-0
4. Check the MariaDB status in the active controllers.
[stack@director] nova list | grep control | 4361358a-922f-49b5-89d4-247a50722f6d | pod1-controller-0 | ACTIVE | - | Running | ctlplane=184.108.40.206 | | d0f57f27-93a8-414f-b4d8-957de0d785fc | pod1-controller-1 | ACTIVE | - | Running | ctlplane=220.127.116.11 |
[stack@director ~]$ for i in 18.104.22.168 22.214.171.124 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo mysql --exec=\"SHOW STATUS LIKE 'wsrep_local_state_comment'\" ; sudo mysql --exec=\"SHOW STATUS LIKE 'wsrep_cluster_size'\""; done *** 126.96.36.199 *** Variable_name Value wsrep_local_state_comment Synced Variable_name Value wsrep_cluster_size 2 *** 188.8.131.52 *** Variable_name Value wsrep_local_state_comment Synced Variable_name Value wsrep_cluster_size 2
Verify that these lines are present for each active controller:
5. Check Rabbitmq status in the active controllers.
neutron-dhcp-agent.service loaded active running OpenStack Neutron DHCP Agent neutron-openvswitch-agent.service loaded active running OpenStack Neutron Open vSwitch Agent neutron-ovs-cleanup.service loaded active exited OpenStack Neutron Open vSwitch Cleanup Utility neutron-server.service loaded active running OpenStack Neutron Server openstack-aodh-evaluator.service loaded active running OpenStack Alarm evaluator service openstack-aodh-listener.service loaded active running OpenStack Alarm listener service openstack-aodh-notifier.service loaded active running OpenStack Alarm notifier service openstack-ceilometer-central.service loaded active running OpenStack ceilometer central agent openstack-ceilometer-collector.service loaded active running OpenStack ceilometer collection service openstack-ceilometer-notification.service loaded active running OpenStack ceilometer notification agent openstack-glance-api.service loaded active running OpenStack Image Service (code-named Glance) API server openstack-glance-registry.service loaded active running OpenStack Image Service (code-named Glance) Registry server openstack-heat-api-cfn.service loaded active running Openstack Heat CFN-compatible API Service openstack-heat-api.service loaded active running OpenStack Heat API Service openstack-heat-engine.service loaded active running Openstack Heat Engine Service openstack-ironic-api.service loaded active running OpenStack Ironic API service openstack-ironic-conductor.service loaded active running OpenStack Ironic Conductor service openstack-ironic-inspector-dnsmasq.service loaded active running PXE boot dnsmasq service for Ironic Inspector openstack-ironic-inspector.service loaded active running Hardware introspection service for OpenStack Ironic openstack-mistral-api.service loaded active running Mistral API Server openstack-mistral-engine.service loaded active running Mistral Engine Server openstack-mistral-executor.service loaded active running Mistral Executor Server openstack-nova-api.service loaded active running OpenStack Nova API Server openstack-nova-cert.service loaded active running OpenStack Nova Cert Server openstack-nova-compute.service loaded active running OpenStack Nova Compute Server openstack-nova-conductor.service loaded active running OpenStack Nova Conductor Server openstack-nova-scheduler.service loaded active running OpenStack Nova Scheduler Server openstack-swift-account-reaper.service loaded active running OpenStack Object Storage (swift) - Account Reaper openstack-swift-account.service loaded active running OpenStack Object Storage (swift) - Account Server openstack-swift-container-updater.service loaded active running OpenStack Object Storage (swift) - Container Updater openstack-swift-container.service loaded active running OpenStack Object Storage (swift) - Container Server openstack-swift-object-updater.service loaded active running OpenStack Object Storage (swift) - Object Updater openstack-swift-object.service loaded active running OpenStack Object Storage (swift) - Object Server openstack-swift-proxy.service loaded active running OpenStack Object Storage (swift) - Proxy Server openstack-zaqar.service loaded active running OpenStack Message Queuing Service (code-named Zaqar) Server firstname.lastname@example.org loaded active running OpenStack Message Queuing Service (code-named Zaqar) Server Instance 1 openvswitch.service loaded active exited Open vSwitch
LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type.
37 loaded units listed. Pass --all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'.
4. If health checks are fine, proceed with faulty disk Hot Swap procedure and wait for the data sync as it might take a couple of hours to complete. Refer to Replacing the Server Components
5. Repeat these health check procedures in order to confirm that the health status of the OSPD node is restored.