Replacement of Faulty Components on Server UCS C240 M4 - vEPC

Available Languages

Updated:July 2, 2018

Document ID:213464

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Introduction

Background Information

Abbreviations

Workflow of the MoP

Prerequisites

Backup

Component RMA - Compute/OSD-Compute Node

Identify the VMs Hosted in the Compute/OSD-Compute Node

Graceful Power Off

Case 1. Compute Node Hosts only SF VM

Case 2. Compute/OSD-Compute Node Hosts CF/ESC/EM/UAS

Replace the Faulty Component from the Compute/OSD-Compute Node

Restore the VMs

Case 1. Compute Node Hosts only SF VM

Case 2. Compute/OSD-Compute Node Hosts CF, ESC, EM and UAS

Handle ESC Recovery Failure

Auto-Deploy Configuration Update

Component RMA - Controller Node

Pre-Check

Move Controller Cluster to Maintenance Mode

Replace the Faulty Component from the Controller Node

Power On the Server

Introduction

This document describes the steps required to replace faulty components mentioned here in a Unified Computing System (UCS) server in an Ultra-M setup that hosts StarOS Virtual Network Functions (VNFs).

Dual In-line Memory Module (DIMM) Replacement MOP
FlexFlash Controller Failure
Solid State Drive (SSD) Failure
Trusted Platform Module (TPM) Failure
Raid Cache Failure
Raid Controller/ Hot Bus Adapter (HBA) Failure
PCI Riser Failure
PCIe adapter Intel X520 10G Failure
Modular LAN-on Motherboard (MLOM) Failure
Fan tray RMA
CPU Failure

Background Information

Ultra-M is a pre-packaged and validated virtualized mobile packet core solution that is designed in order to simplify the deployment of VNFs. OpenStack is the Virtualized Infrastructure Manager (VIM) for Ultra-M and consists of these node types:

Compute
Object Storage Disk - Compute (OSD - Compute)
Controller
OpenStack Platform - Director (OSPD)

The high-level architecture of Ultra-M and the components involved are depicted in this image:

This document is intended for Cisco personnel who are familiar with Cisco Ultra-M platform and it details the steps required to be carried out at OpenStack and StarOS VNF level at the time of the Component Replacement in the server.

Note: Ultra M 5.1.x release is considered in order to define the procedures in this document.

Abbreviations

VNF	Virtual Network Function
CF	Control Function
SF	Service Function
ESC	Elastic Service Controller
MOP	Method of Procedure
OSD	Object Storage Disks
HDD	Hard Disk Drive
SSD	Solid State Drive
VIM	Virtual Infrastructure Manager
VM	Virtual Machine
EM	Element Manager
UAS	Ultra Automation Services
UUID	Universally Unique IDentifier

Workflow of the MoP

Prerequisites

Backup

Before you replace a faulty component, it is important to check the current state of your Red Hat OpenStack Platform environment. It is recommended you check the current state in order to avoid complications when the replacement process is on. It can be achieved by this flow of replacement.

In case of recovery, Cisco recommends to take a backup of the OSPD database with the use of these steps:

[root@director ~]# mysqldump --opt --all-databases > /root/undercloud-all-databases.sql
[root@director ~]# tar --xattrs -czf undercloud-backup-`date +%F`.tar.gz /root/undercloud-all-databases.sql 
/etc/my.cnf.d/server.cnf /var/lib/glance/images /srv/node /home/stack
tar: Removing leading `/' from member names

This process ensures that a node can be replaced without affecting the availability of any instances. Also, it is recommended to backup the StarOS configuration especially if the compute/OSD-compute node to be replaced hosts the Control Function (CF) Virtual Machine (VM).

Note: If the Server is the Controller node, proceed to the section "", otherwise continue with the next section.

Component RMA - Compute/OSD-Compute Node

Identify the VMs Hosted in the Compute/OSD-Compute Node

Identify the VMs that are hosted on the server. There can be two possibilities:

The server contains only Service Function (SF) VM:

[stack@director ~]$ nova list --field name,host | grep compute-10
| 49ac5f22-469e-4b84-badc-031083db0533 |  VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d     |  
pod1-compute-10.localdomain    |

The server contains Control Function (CF)/Elastic Services Controller (ESC)/ Element Manager (EM)/ Ultra Automation Services (UAS) combination of VMs:

[stack@director ~]$ nova list --field name,host | grep compute-8
| 507d67c2-1d00-4321-b9d1-da879af524f8 | VNF2-DEPLOYM_XXXX_0_c8d98f0f-d874-45d0-af75-88a2d6fa82ea | pod1-compute-8.localdomain     |
| f9c0763a-4a4f-4bbd-af51-bc7545774be2 | VNF2-DEPLOYM_c1_0_df4be88d-b4bf-4456-945a-3812653ee229     | pod1-compute-8.localdomain     |
| 75528898-ef4b-4d68-b05d-882014708694 | VNF2-ESC-ESC-0                                             | pod1-compute-8.localdomain     |
| f5bd7b9c-476a-4679-83e5-303f0aae9309 | VNF2-UAS-uas-0                                             | pod1-compute-8.localdomain     |

Note: In the output shown here, the first column corresponds to the Universally Unique IDentifier (UUID), the second column is the VM name and the third column is the hostname where the VM is present. The parameters from this output will be used in subsequent sections.

Graceful Power Off

Case 1. Compute Node Hosts only SF VM

Migrate SF Card to Standby State

Log in to the StarOS VNF and identify the card that corresponds to the SF VM. Use the UUID of the SF VM identified from the section "Identify the VMs hosted in the Compute/OSD-Compute Node", and identify the card that corresponds to the UUID:

[local]VNF2# show card hardware
Tuesday might 08 16:49:42 UTC 2018
<snip>
Card 8:
  Card Type               : 4-Port Service Function Virtual Card
  CPU Packages            : 26 [#0, #1, #2, #3, #4, #5, #6, #7, #8, #9, #10, #11, #12, #13, #14, #15, #16, #17, #18, #19, #20, #21, #22, #23, #24, #25]
  CPU Nodes               : 2
  CPU Cores/Threads       : 26
  Memory                  : 98304M (qvpc-di-large)
  UUID/Serial Number      :  49AC5F22-469E-4B84-BADC-031083DB0533

Check the status of the card:

[local]VNF2# show card table
Tuesday might 08 16:52:53 UTC 2018
Slot         Card Type                               Oper State     SPOF  Attach
-----------  --------------------------------------  -------------  ----  ------
 1: CFC      Control Function Virtual Card           Active         No         
 2: CFC      Control Function Virtual Card           Standby        -          
 3: FC       4-Port Service Function Virtual Card    Active         No         
 4: FC       4-Port Service Function Virtual Card    Active         No         
 5: FC       4-Port Service Function Virtual Card    Active         No         
 6: FC       4-Port Service Function Virtual Card    Active         No         
 7: FC       4-Port Service Function Virtual Card    Active         No         
8: FC       4-Port Service Function Virtual Card    Active         No         
 9: FC       4-Port Service Function Virtual Card    Active         No         
10: FC       4-Port Service Function Virtual Card    Standby        -

If the card is in the active state, move the card to standby state:

  [local]VNF2# card migrate from 8 to 10

Shutdown SF VM from ESC

[admin@VNF2-esc-esc-0 ~]$ cd /opt/cisco/esc/esc-confd/esc-cli
[admin@VNF2-esc-esc-0 esc-cli]$ ./esc_nc_cli get esc_datamodel | egrep --color "<state>|<vm_name>|<vm_id>|<deployment_name>"
<snip>
<state>SERVICE_ACTIVE_STATE</state>
                    VNF2-DEPLOYM_c1_0_df4be88d-b4bf-4456-945a-3812653ee229
                    VM_ALIVE_STATE
                     VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d
                    VM_ALIVE_STATE</state>
<snip>

Stop the SF VM with the use of its VM Name. (VM Name noted from section "Identify the VMs hosted in the Compute/OSD-Compute Node"):

[admin@VNF2-esc-esc-0 esc-cli]$ ./esc_nc_cli vm-action STOP VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d

Once it is stopped, the VM must enter the SHUTOFF state:

[admin@VNF2-esc-esc-0 ~]$ cd /opt/cisco/esc/esc-confd/esc-cli
[admin@VNF2-esc-esc-0 esc-cli]$ ./esc_nc_cli get esc_datamodel | egrep --color "<state>|<vm_name>|<vm_id>|<deployment_name>"
<snip>
<state>SERVICE_ACTIVE_STATE</state>
                    VNF2-DEPLOYM_c1_0_df4be88d-b4bf-4456-945a-3812653ee229
                    VM_ALIVE_STATE
                    VNF2-DEPLOYM_c3_0_3e0db133-c13b-4e3d-ac14-
                    VM_ALIVE_STATE
                    VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d
                    VM_SHUTOFF_STATE</state>

Case 2. Compute/OSD-Compute Node Hosts CF/ESC/EM/UAS

Migrate CF Card to Standby State

Log in to the StarOS VNF and identify the card that corresponds to the CF VM. Use the UUID of the CF VM identified from the section "Identify the VMs hosted in the Node", and find the card that corresponds to the UUID:

[local]VNF2# show card hardware
Tuesday might 08 16:49:42 UTC 2018
<snip>
Card 2:
  Card Type               : Control Function Virtual Card
  CPU Packages            : 8 [#0, #1, #2, #3, #4, #5, #6, #7]
  CPU Nodes               : 1
  CPU Cores/Threads       : 8
  Memory                  : 16384M (qvpc-di-large)
  UUID/Serial Number      : F9C0763A-4A4F-4BBD-AF51-BC7545774BE2
<snip>

Check the status of the card:

[local]VNF2# show card table
Tuesday might 08 16:52:53 UTC 2018
Slot         Card Type                               Oper State     SPOF  Attach
-----------  --------------------------------------  -------------  ----  ------
 1: CFC      Control Function Virtual Card           Standby        -
 2: CFC      Control Function Virtual Card           Active         No          
 3: FC       4-Port Service Function Virtual Card    Active         No         
 4: FC       4-Port Service Function Virtual Card    Active         No         
 5: FC       4-Port Service Function Virtual Card    Active         No         
 6: FC       4-Port Service Function Virtual Card    Active         No         
 7: FC       4-Port Service Function Virtual Card    Active         No         
 8: FC       4-Port Service Function Virtual Card    Active         No         
 9: FC       4-Port Service Function Virtual Card    Active         No         
10: FC       4-Port Service Function Virtual Card    Standby        -

If the card is in the active state, move the card to standby state:

[local]VNF2# card migrate from 2 to 1

Shutdown CF and EM VM from ESC

[admin@VNF2-esc-esc-0 ~]$ cd /opt/cisco/esc/esc-confd/esc-cli
[admin@VNF2-esc-esc-0 esc-cli]$ ./esc_nc_cli get esc_datamodel | egrep --color "<state>|<vm_name>|<vm_id>|<deployment_name>"
<snip>
<state>SERVICE_ACTIVE_STATE</state>
                    VNF2-DEPLOYM_c1_0_df4be88d-b4bf-4456-945a-3812653ee229
                    VM_ALIVE_STATE</state>
                    VNF2-DEPLOYM_c3_0_3e0db133-c13b-4e3d-ac14-
                    VM_ALIVE_STATE
<deployment_name>VNF2-DEPLOYMENT-em</deployment_name>
                  507d67c2-1d00-4321-b9d1-da879af524f8
                  dc168a6a-4aeb-4e81-abd9-91d7568b5f7c
                  9ffec58b-4b9d-4072-b944-5413bf7fcf07
                SERVICE_ACTIVE_STATE
                    VNF2-DEPLOYM_XXXX_0_c8d98f0f-d874-45d0-af75-88a2d6fa82ea
                    VM_ALIVE_STATE</state>
<snip>

Stop the CF and EM VM one-by-one with the use of its VM Name. (VM Name noted from section "Identify the VMs hosted in the Compute/OSD-Compute Node"):

[admin@VNF2-esc-esc-0 esc-cli]$ ./esc_nc_cli vm-action STOP VNF2-DEPLOYM_c1_0_df4be88d-b4bf-4456-945a-3812653ee229

[admin@VNF2-esc-esc-0 esc-cli]$ ./esc_nc_cli vm-action STOP VNF2-DEPLOYM_XXXX_0_c8d98f0f-d874-45d0-af75-88a2d6fa82ea

After it stops, the VMs must enter the SHUTOFF state:

[admin@VNF2-esc-esc-0 ~]$ cd /opt/cisco/esc/esc-confd/esc-cli
[admin@VNF2-esc-esc-0 esc-cli]$ ./esc_nc_cli get esc_datamodel | egrep --color "<state>|<vm_name>|<vm_id>|<deployment_name>"
<snip>
<state>SERVICE_ACTIVE_STATE</state>
                    VNF2-DEPLOYM_c1_0_df4be88d-b4bf-4456-945a-3812653ee229</vm_name>
                    VM_SHUTOFF_STATE</state>
                    VNF2-DEPLOYM_c3_0_3e0db133-c13b-4e3d-ac14-
                    VM_ALIVE_STATE
<deployment_name>VNF2-DEPLOYMENT-em</deployment_name>
                  507d67c2-1d00-4321-b9d1-da879af524f8
                  dc168a6a-4aeb-4e81-abd9-91d7568b5f7c
                  9ffec58b-4b9d-4072-b944-5413bf7fcf07
                SERVICE_ACTIVE_STATE
                    VNF2-DEPLOYM_XXXX_0_c8d98f0f-d874-45d0-af75-88a2d6fa82ea</vm_name>
                    <state>VM_SHUTOFF_STATE</state>
<snip>

Migrate ESC to Standby Mode

[admin@VNF2-esc-esc-0 esc-cli]$ escadm status
0 ESC status=0 ESC Master Healthy


[admin@VNF2-esc-esc-0 ~]$ sudo service keepalived stop
Stopping keepalived:                                       [  OK  ]

[admin@VNF2-esc-esc-0 ~]$ escadm status
1 ESC status=0 In SWITCHING_TO_STOP state. Please check status after a while.

[admin@VNF2-esc-esc-0 ~]$ sudo reboot
Broadcast message from admin@vnf1-esc-esc-0.novalocal
       (/dev/pts/0) at 13:32 ...
The system is going down for reboot NOW!

Note: If the faulty component is to be replaced on OSD-Compute node, put the Ceph into Maintenance on the server before you proceed with the component replacement.

[admin@osd-compute-0 ~]$ sudo ceph osd set norebalance
set norebalance

[admin@osd-compute-0 ~]$ sudo ceph osd set noout
set noout

[admin@osd-compute-0 ~]$ sudo ceph status
    cluster eb2bb192-b1c9-11e6-9205-525400330666
     health HEALTH_WARN
            noout,norebalance,sortbitwise,require_jewel_osds flag(s) set
     monmap e1: 3 mons at {tb3-ultram-pod1-controller-0=11.118.0.40:6789/0,tb3-ultram-pod1-controller-1=11.118.0.41:6789/0,tb3-ultram-pod1-controller-2=11.118.0.42:6789/0}
            election epoch 58, quorum 0,1,2 tb3-ultram-pod1-controller-0,tb3-ultram-pod1-controller-1,tb3-ultram-pod1-controller-2
     osdmap e194: 12 osds: 12 up, 12 in
            flags noout,norebalance,sortbitwise,require_jewel_osds
      pgmap v584865: 704 pgs, 6 pools, 531 GB data, 344 kobjects
            1585 GB used, 11808 GB / 13393 GB avail
                 704 active+clean
  client io 463 kB/s rd, 14903 kB/s wr, 263 op/s rd, 542 op/s wr

Replace the Faulty Component from the Compute/OSD-Compute Node

Power off the specified server. The steps in order to replace a faulty component on UCS C240 M4 server can be referred from:

Replacing the Server Components

Restore the VMs

Case 1. Compute Node Hosts only SF VM

SF VM Recovery from ESC

The SF VM would be in error state in the nova list:

[stack@director  ~]$ nova list |grep VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d
| 49ac5f22-469e-4b84-badc-031083db0533 | VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d     | ERROR  | -          | NOSTATE     |

Recover the SF VM from the ESC:

[admin@VNF2-esc-esc-0 ~]$ sudo /opt/cisco/esc/esc-confd/esc-cli/esc_nc_cli recovery-vm-action DO VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d
[sudo] password for admin: 

Recovery VM Action
/opt/cisco/esc/confd/bin/netconf-console --port=830 --host=127.0.0.1 --user=admin --privKeyFile=/root/.ssh/confd_id_dsa --privKeyType=dsa --rpc=/tmp/esc_nc_cli.ZpRCGiieuW
<?xml version="1.0" encoding="UTF-8"?>
<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="1">
  <ok/>
</rpc-reply>

Monitor the yangesc.log:

admin@VNF2-esc-esc-0 ~]$ tail -f /var/log/esc/yangesc.log
…
14:59:50,112 07-Nov-2017 WARN  Type: VM_RECOVERY_COMPLETE
14:59:50,112 07-Nov-2017 WARN  Status: SUCCESS
14:59:50,112 07-Nov-2017 WARN  Status Code: 200
14:59:50,112 07-Nov-2017 WARN  Status Msg: Recovery: Successfully recovered VM [VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d].

Ensure that the SF card comes up as standby SF in the VNF

Case 2. Compute/OSD-Compute Node Hosts CF, ESC, EM and UAS

Recovery of UAS VM

Check the status of the UAS VM in the nova list and delete it:

[stack@director ~]$ nova list | grep VNF2-UAS-uas-0
| 307a704c-a17c-4cdc-8e7a-3d6e7e4332fa | VNF2-UAS-uas-0                                                 | ACTIVE | -          | Running     | VNF2-UAS-uas-orchestration=172.168.11.10; VNF2-UAS-uas-management=172.168.10.3
[stack@tb5-ospd ~]$ nova delete VNF2-UAS-uas-0
Request to delete server VNF2-UAS-uas-0 has been accepted.

In order to recover the autovnf-uas VM, run the uas-check script in order to check state. It must report an error. Then run again with --fix option in order to recreate the missing UAS VM:

[stack@director ~]$ cd /opt/cisco/usp/uas-installer/scripts/
[stack@director scripts]$ ./uas-check.py auto-vnf VNF2-UAS
2017-12-08 12:38:05,446 - INFO: Check of AutoVNF cluster started
2017-12-08 12:38:07,925 - INFO: Instance 'vnf1-UAS-uas-0' status is 'ERROR'
2017-12-08 12:38:07,925 - INFO: Check completed, AutoVNF cluster has recoverable errors

[stack@director scripts]$ ./uas-check.py auto-vnf VNF2-UAS --fix
2017-11-22 14:01:07,215 - INFO: Check of AutoVNF cluster started
2017-11-22 14:01:09,575 - INFO: Instance VNF2-UAS-uas-0' status is 'ERROR'
2017-11-22 14:01:09,575 - INFO: Check completed, AutoVNF cluster has recoverable errors
2017-11-22 14:01:09,778 - INFO: Removing instance VNF2-UAS-uas-0'
2017-11-22 14:01:13,568 - INFO: Removed instance VNF2-UAS-uas-0'
2017-11-22 14:01:13,568 - INFO: Creating instance VNF2-UAS-uas-0' and attaching volume ‘VNF2-UAS-uas-vol-0'
2017-11-22 14:01:49,525 - INFO: Created instance ‘VNF2-UAS-uas-0'

VNF2-autovnf-uas-0#show uas
uas version 1.0.1-1
uas state ha-active
uas ha-vip 172.17.181.101
INSTANCE IP   STATE  ROLE
-----------------------------------
172.17.180.6  alive  CONFD-SLAVE
172.17.180.7  alive  CONFD-MASTER
172.17.180.9  alive  NA

Note: If uas-check.py --fix fails, you might need to copy this file and run again.

[stack@director ~]$ mkdir –p /opt/cisco/usp/apps/auto-it/common/uas-deploy/
[stack@director ~]$ cp /opt/cisco/usp/uas-installer/common/uas-deploy/userdata-uas.txt /opt/cisco/usp/apps/auto-it/common/uas-deploy/

Recovery of ESC VM

Check the status of the ESC VM from the nova list and delete it:

stack@director scripts]$ nova list |grep ESC-1
| c566efbf-1274-4588-a2d8-0682e17b0d41 | VNF2-ESC-ESC-1                                                 | ACTIVE | -          | Running     | VNF2-UAS-uas-orchestration=172.168.11.14; VNF2-UAS-uas-management=172.168.10.4                                                                                                 |
[stack@director scripts]$ nova delete VNF2-ESC-ESC-1
Request to delete server VNF2-ESC-ESC-1 has been accepted.

From AutoVNF-UAS, find the ESC deployment transaction and in the log for the transaction find the boot_vm.py command line in order to create the ESC instance:

ubuntu@VNF2-uas-uas-0:~$ sudo -i
root@VNF2-uas-uas-0:~# confd_cli -u admin -C
Welcome to the ConfD CLI    
admin connected from 127.0.0.1 using console on VNF2-uas-uas-0
VNF2-uas-uas-0#show transaction
TX ID                                 TX TYPE          DEPLOYMENT ID    TIMESTAMP                         STATUS
-----------------------------------------------------------------------------------------------------------------------------
35eefc4a-d4a9-11e7-bb72-fa163ef8df2b  vnf-deployment   VNF2-DEPLOYMENT  2017-11-29T02:01:27.750692-00:00  deployment-success
73d9c540-d4a8-11e7-bb72-fa163ef8df2b  vnfm-deployment  VNF2-ESC         2017-11-29T01:56:02.133663-00:00  deployment-success


VNF2-uas-uas-0#show logs 73d9c540-d4a8-11e7-bb72-fa163ef8df2b | display xml
<config xmlns="http://tail-f.com/ns/config/1.0">
  <logs xmlns="http://www.cisco.com/usp/nfv/usp-autovnf-oper">
    <tx-id>73d9c540-d4a8-11e7-bb72-fa163ef8df2b</tx-id>
    <log>2017-11-29 01:56:02,142 - VNFM Deployment RPC triggered for deployment: VNF2-ESC, deactivate: 0
2017-11-29 01:56:02,179 - Notify deployment
..
2017-11-29 01:57:30,385 - Creating VNFM 'VNF2-ESC-ESC-1' with [python //opt/cisco/vnf-staging/bootvm.py VNF2-ESC-ESC-1 --flavor VNF2-ESC-ESC-flavor --image 3fe6b197-961b-4651-af22-dfd910436689 --net VNF2-UAS-uas-management --gateway_ip 172.168.10.1 --net VNF2-UAS-uas-orchestration --os_auth_url http://10.1.2.5:5000/v2.0 --os_tenant_name core --os_username ****** --os_password ****** --bs_os_auth_url http://10.1.2.5:5000/v2.0 --bs_os_tenant_name core --bs_os_username ****** --bs_os_password ****** --esc_ui_startup false --esc_params_file /tmp/esc_params.cfg --encrypt_key ****** --user_pass ****** --user_confd_pass ****** --kad_vif eth0 --kad_vip 172.168.10.7 --ipaddr 172.168.10.6 dhcp --ha_node_list 172.168.10.3 172.168.10.6 --file root:0755:/opt/cisco/esc/esc-scripts/esc_volume_em_staging.sh:/opt/cisco/usp/uas/autovnf/vnfms/esc-scripts/esc_volume_em_staging.sh --file root:0755:/opt/cisco/esc/esc-scripts/esc_vpc_chassis_id.py:/opt/cisco/usp/uas/autovnf/vnfms/esc-scripts/esc_vpc_chassis_id.py --file root:0755:/opt/cisco/esc/esc-scripts/esc-vpc-di-internal-keys.sh:/opt/cisco/usp/uas/autovnf/vnfms/esc-scripts/esc-vpc-di-internal-keys.sh

Save the boot_vm.py line to a shell script file (esc.sh) and update all the username ***** and password ***** lines with the correct information (typically core/<PASSWORD>). You need to remove the –encrypt_key option as well. For user_pass and user_confd_pass, you need to use the format – username: password (example - admin:<PASSWORD>).

Find the URL in order to bootvm.py from running-config and wget the bootvm.py file to the autovnf-uas VM. In this case, 10.1.2.3 is the Auto-IT VM's IP:

root@VNF2-uas-uas-0:~# confd_cli -u admin -C
Welcome to the ConfD CLI
admin connected from 127.0.0.1 using console on VNF2-uas-uas-0
VNF2-uas-uas-0#show running-config autovnf-vnfm:vnfm
…
configs bootvm
  value http:// 10.1.2.3:80/bundles/5.1.7-2007/vnfm-bundle/bootvm-2_3_2_155.py
!

root@VNF2-uas-uas-0:~# wget http://10.1.2.3:80/bundles/5.1.7-2007/vnfm-bundle/bootvm-2_3_2_155.py
--2017-12-01 20:25:52--  http://10.1.2.3 /bundles/5.1.7-2007/vnfm-bundle/bootvm-2_3_2_155.py
Connecting to 10.1.2.3:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127771 (125K) [text/x-python]
Saving to: ‘bootvm-2_3_2_155.py’
100%[=====================================================================================>] 127,771  --.-K/s   in 0.001s
2017-12-01 20:25:52 (173 MB/s) - ‘bootvm-2_3_2_155.py’ saved [127771/127771]

Create a/tmp/esc_params.cfg file:

root@VNF2-uas-uas-0:~# echo "openstack.endpoint=publicURL" > /tmp/esc_params.cfg

Execute shell script in order to deploy ESC from the UAS node:

root@VNF2-uas-uas-0:~# /bin/sh esc.sh
+ python ./bootvm.py VNF2-ESC-ESC-1 --flavor VNF2-ESC-ESC-flavor --image 3fe6b197-961b-4651-af22-dfd910436689
 --net VNF2-UAS-uas-management --gateway_ip 172.168.10.1 --net VNF2-UAS-uas-orchestration --os_auth_url 
http://10.1.2.5:5000/v2.0 --os_tenant_name core --os_username core --os_password <PASSWORD> --bs_os_auth_url 
http://10.1.2.5:5000/v2.0 --bs_os_tenant_name core --bs_os_username core --bs_os_password <PASSWORD> 
--esc_ui_startup false --esc_params_file /tmp/esc_params.cfg --user_pass admin:<PASSWORD> --user_confd_pass 
admin:<PASSWORD> --kad_vif eth0 --kad_vip 172.168.10.7 --ipaddr 172.168.10.6 dhcp --ha_node_list 172.168.10.3
172.168.10.6 --file root:0755:/opt/cisco/esc/esc-scripts/esc_volume_em_staging.sh:/opt/cisco/usp/uas/autovnf/vnfms/esc-scripts/esc_volume_em_staging.sh 
--file root:0755:/opt/cisco/esc/esc-scripts/esc_vpc_chassis_id.py:/opt/cisco/usp/uas/autovnf/vnfms/esc-scripts/esc_vpc_chassis_id.py 
--file root:0755:/opt/cisco/esc/esc-scripts/esc-vpc-di-internal-keys.sh:/opt/cisco/usp/uas/autovnf/vnfms/esc-scripts/esc-vpc-di-internal-keys.sh

ubuntu@VNF2-uas-uas-0:~$ ssh admin@172.168.11.14
…
   ####################################################################
   #   ESC on VNF2-esc-esc-1.novalocal is in BACKUP state.
   ####################################################################

[admin@VNF2-esc-esc-1 ~]$ escadm status
0 ESC status=0 ESC Backup Healthy

[admin@VNF2-esc-esc-1 ~]$ health.sh
============== ESC HA (BACKUP) ===================================================
ESC HEALTH PASSED

Recover CF and EM VMs from ESC

Check the status of the CF and EM VMs from the nova list. They must be in the ERROR state:

[stack@director ~]$ source corerc
[stack@director ~]$ nova list --field name,host,status |grep -i err   
| 507d67c2-1d00-4321-b9d1-da879af524f8 | VNF2-DEPLOYM_XXXX_0_c8d98f0f-d874-45d0-af75-88a2d6fa82ea | None                                 | ERROR|
| f9c0763a-4a4f-4bbd-af51-bc7545774be2 | VNF2-DEPLOYM_c1_0_df4be88d-b4bf-4456-945a-3812653ee229     |None                                 | ERROR

Log in to ESC Master, run recovery-vm-action for each impacted EM and CF VM. Be patient. ESC would schedule the recovery-action and it might not happen for a few minutes. Monitor the yangesc.log:

sudo /opt/cisco/esc/esc-confd/esc-cli/esc_nc_cli recovery-vm-action DO <VM-Name>

[admin@VNF2-esc-esc-0 ~]$ sudo /opt/cisco/esc/esc-confd/esc-cli/esc_nc_cli recovery-vm-action DO VNF2-DEPLOYMENT-_VNF2-D_0_a6843886-77b4-4f38-b941-74eb527113a8
[sudo] password for admin: 

Recovery VM Action
/opt/cisco/esc/confd/bin/netconf-console --port=830 --host=127.0.0.1 --user=admin --privKeyFile=/root/.ssh/confd_id_dsa --privKeyType=dsa --rpc=/tmp/esc_nc_cli.ZpRCGiieuW
<?xml version="1.0" encoding="UTF-8"?>
<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="1">
  <ok/>
</rpc-reply>

[admin@VNF2-esc-esc-0 ~]$ tail -f /var/log/esc/yangesc.log
…
14:59:50,112 07-Nov-2017 WARN  Type: VM_RECOVERY_COMPLETE
14:59:50,112 07-Nov-2017 WARN  Status: SUCCESS
14:59:50,112 07-Nov-2017 WARN  Status Code: 200
14:59:50,112 07-Nov-2017 WARN  Status Msg: Recovery: Successfully recovered VM [VNF2-DEPLOYMENT-_VNF2-D_0_a6843886-77b4-4f38-b941-74eb527113a8]

ubuntu@VNF2vnfddeploymentem-1:~$ /opt/cisco/ncs/current/bin/ncs_cli -u admin -C
admin connected from 172.17.180.6 using ssh on VNF2vnfddeploymentem-1
admin@scm# show ems
EM            VNFM
ID  SLA  SCM  PROXY
---------------------
2   up   up   up
3   up   up   up

Log into the StarOS VNF and verify that the CF card is in the standby state

Handle ESC Recovery Failure

In cases where ESC fails to start the VM due to an unexpected state, Cisco recommends how to perform an ESC switchover by rebooting the Master ESC. The ESC switchover would take about a minute. Run the script “health.sh” on the new Master ESC in order to check if the status is up. Master ESC to start the VM and fix the VM state. This recovery task would take up to 5 minutes to complete.

You can monitor /var/log/esc/yangesc.log and /var/log/esc/escmanager.log. If you do not see that VM gets recovered after 5-7 minutes, the user would need to go and do the manual recovery of the impacted VM(s).

Auto-Deploy Configuration Update

From AutoDeploy VM, edit the autodeploy.cfg and replace the old compute server with the new one. Then load replace in confd_cli. This step is required for successful deployment deactivation later:

root@auto-deploy-iso-2007-uas-0:/home/ubuntu# confd_cli -u admin -C
Welcome to the ConfD CLI
admin connected from 127.0.0.1 using console on auto-deploy-iso-2007-uas-0
auto-deploy-iso-2007-uas-0#config
Entering configuration mode terminal
auto-deploy-iso-2007-uas-0(config)#load replace autodeploy.cfg
Loading.     14.63 KiB parsed in 0.42 sec (34.16 KiB/sec)

auto-deploy-iso-2007-uas-0(config)#commit
Commit complete.
auto-deploy-iso-2007-uas-0(config)#end

Restart uas-confd and autodeploy services after the configuration change:

root@auto-deploy-iso-2007-uas-0:~# service uas-confd restart
uas-confd stop/waiting
uas-confd start/running, process 14078

root@auto-deploy-iso-2007-uas-0:~# service uas-confd status
uas-confd start/running, process 14078

root@auto-deploy-iso-2007-uas-0:~# service autodeploy restart
autodeploy stop/waiting
autodeploy start/running, process 14017
root@auto-deploy-iso-2007-uas-0:~# service autodeploy status
autodeploy start/running, process 14017

Component RMA - Controller Node

Pre-Check

From OSPD, login to the controller and verify pcs is in a good state – all three controllers Online and Galera show all three controllers as Master.

Note: A healthy cluster requires 2 active controllers so verify that the two controllers that remain are Online and active.

[heat-admin@pod1-controller-0 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: pod1-controller-2 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Mon Dec  4 00:46:10 2017                        Last change: Wed Nov 29 01:20:52 2017 by hacluster via crmd on pod1-controller-0
3 nodes and 22 resources configured

Online: [ pod1-controller-0 pod1-controller-1 pod1-controller-2 ]

Full list of resources:
 ip-11.118.0.42  (ocf::heartbeat:IPaddr2):           Started pod1-controller-1
 ip-11.119.0.47  (ocf::heartbeat:IPaddr2):           Started pod1-controller-2
 ip-11.120.0.49  (ocf::heartbeat:IPaddr2):           Started pod1-controller-1
 ip-192.200.0.102          (ocf::heartbeat:IPaddr2):           Started pod1-controller-2
 Clone Set: haproxy-clone [haproxy]
     Started: [ pod1-controller-0 pod1-controller-1 pod1-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ pod1-controller-0 pod1-controller-1 pod1-controller-2 ]
 ip-11.120.0.47  (ocf::heartbeat:IPaddr2):           Started pod1-controller-2
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ pod1-controller-0 pod1-controller-1 pod1-controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ pod1-controller-2 ]
     Slaves: [ pod1-controller-0 pod1-controller-1 ]
 ip-10.84.123.35            (ocf::heartbeat:IPaddr2):           Started pod1-controller-1
 openstack-cinder-volume          (systemd:openstack-cinder-volume):            Started pod1-controller-2
 my-ipmilan-for-pod1-controller-0        (stonith:fence_ipmilan):  Started pod1-controller-0
 my-ipmilan-for-pod1-controller-1        (stonith:fence_ipmilan):  Started pod1-controller-0
 my-ipmilan-for-pod1-controller-2        (stonith:fence_ipmilan):  Started pod1-controller-0

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Move Controller Cluster to Maintenance Mode

Use the pcs cluster on the controller that is updated in standby:

[heat-admin@pod1-controller-0 ~]$ sudo pcs cluster standby

Check the pcs status again and ensure that the pcs cluster stopped on this node:

[heat-admin@pod1-controller-0 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: pod1-controller-2 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Mon Dec  4 00:48:24 2017                        Last change: Mon Dec  4 00:48:18 2017 by root via crm_attribute on pod1-controller-0
3 nodes and 22 resources configured

Node pod1-controller-0: standby

Online: [ pod1-controller-1 pod1-controller-2 ]

Full list of resources:
 ip-11.118.0.42  (ocf::heartbeat:IPaddr2):           Started pod1-controller-1
 ip-11.119.0.47  (ocf::heartbeat:IPaddr2):           Started pod1-controller-2
 ip-11.120.0.49  (ocf::heartbeat:IPaddr2):           Started pod1-controller-1
 ip-192.200.0.102          (ocf::heartbeat:IPaddr2):           Started pod1-controller-2
 Clone Set: haproxy-clone [haproxy]
     Started: [ pod1-controller-1 pod1-controller-2 ]
     Stopped: [ pod1-controller-0 ]
Master/Slave Set: galera-master [galera]
     Masters: [ pod1-controller-1 pod1-controller-2 ]
     Slaves: [ pod1-controller-0 ]
 ip-11.120.0.47  (ocf::heartbeat:IPaddr2):           Started pod1-controller-2
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ pod1-controller-0 pod1-controller-1 pod1-controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ pod1-controller-2 ]
     Slaves: [ pod1-controller-1 ]
     Stopped: [ pod1-controller-0 ]
 ip-10.84.123.35            (ocf::heartbeat:IPaddr2):           Started pod1-controller-1
 openstack-cinder-volume          (systemd:openstack-cinder-volume):            Started pod1-controller-2
 my-ipmilan-for-pod1-controller-0        (stonith:fence_ipmilan):  Started pod1-controller-1
 my-ipmilan-for-pod1-controller-1        (stonith:fence_ipmilan):  Started pod1-controller-1
 my-ipmilan-for-pod1-controller-2        (stonith:fence_ipmilan):  Started pod1-controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Also, pcs status on the other 2 controllers should show the node as standby.

Replace the Faulty Component from the Controller Node

Power off the specified server. The steps in order to replace a faulty component on UCS C240 M4 server can be referred from:

Replacing the Server Components

Power On the Server

Power on the server and verify server comes up:

[stack@tb5-ospd ~]$ source stackrc
[stack@tb5-ospd ~]$ nova list |grep pod1-controller-0
| 1ca946b8-52e5-4add-b94c-4d4b8a15a975 | pod1-controller-0  | ACTIVE | -          | Running     | ctlplane=192.200.0.112 |

Login to the impacted controller, remove standby mode with the use of unstandby. Verify that the controller comes online with cluster and Galera shows all three controllers as Master. This might take a few minutes:

[heat-admin@pod1-controller-0 ~]$ sudo pcs cluster unstandby

[heat-admin@pod1-controller-0 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: pod1-controller-2 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Mon Dec  4 01:08:10 2017                        Last change: Mon Dec  4 01:04:21 2017 by root via crm_attribute on pod1-controller-0
3 nodes and 22 resources configured

Online: [ pod1-controller-0 pod1-controller-1 pod1-controller-2 ]

Full list of resources:
 ip-11.118.0.42  (ocf::heartbeat:IPaddr2):           Started pod1-controller-1
 ip-11.119.0.47  (ocf::heartbeat:IPaddr2):           Started pod1-controller-2
 ip-11.120.0.49  (ocf::heartbeat:IPaddr2):           Started pod1-controller-1
 ip-192.200.0.102          (ocf::heartbeat:IPaddr2):           Started pod1-controller-2
 Clone Set: haproxy-clone [haproxy]
     Started: [ pod1-controller-0 pod1-controller-1 pod1-controller-2 ]
Master/Slave Set: galera-master [galera]
     Masters: [ pod1-controller-0 pod1-controller-1 pod1-controller-2 ]
 ip-11.120.0.47  (ocf::heartbeat:IPaddr2):           Started pod1-controller-2
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ pod1-controller-0 pod1-controller-1 pod1-controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ pod1-controller-2 ]
     Slaves: [ pod1-controller-0 pod1-controller-1 ]
 ip-10.84.123.35            (ocf::heartbeat:IPaddr2):           Started pod1-controller-1
 openstack-cinder-volume          (systemd:openstack-cinder-volume):            Started pod1-controller-2
 my-ipmilan-for-pod1-controller-0        (stonith:fence_ipmilan):  Started pod1-controller-1
 my-ipmilan-for-pod1-controller-1        (stonith:fence_ipmilan):  Started pod1-controller-1
 my-ipmilan-for-pod1-controller-2        (stonith:fence_ipmilan):  Started pod1-controller-2

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

You can check some of the monitor services such as ceph that they are in a healthy state:

[heat-admin@pod1-controller-0 ~]$ sudo ceph -s
    cluster eb2bb192-b1c9-11e6-9205-525400330666
     health HEALTH_OK
     monmap e1: 3 mons at {pod1-controller-0=11.118.0.10:6789/0,pod1-controller-1=11.118.0.11:6789/0,pod1-controller-2=11.118.0.12:6789/0}
            election epoch 70, quorum 0,1,2 pod1-controller-0,pod1-controller-1,pod1-controller-2
     osdmap e218: 12 osds: 12 up, 12 in
            flags sortbitwise,require_jewel_osds
      pgmap v2080888: 704 pgs, 6 pools, 714 GB data, 237 kobjects
            2142 GB used, 11251 GB / 13393 GB avail
                 704 active+clean
  client io 11797 kB/s wr, 0 op/s rd, 57 op/s wr

Revision History

Revision	Publish Date	Comments
1.0	02-Jul-2018	Initial Release

Contributed by Cisco Engineers

Prashanth Shetty
Cisco Advanced Services
Padmaraj Ramanoudjam
Cisco Advanced Services

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

Ultra Packet Core

Replacement of Faulty Components on Server UCS C240 M4 - vEPC

Available Languages

Bias-Free Language

Contents

Introduction

Background Information

Abbreviations

Workflow of the MoP

Prerequisites

Backup

Component RMA - Compute/OSD-Compute Node

Identify the VMs Hosted in the Compute/OSD-Compute Node

Graceful Power Off

Case 1. Compute Node Hosts only SF VM

Case 2. Compute/OSD-Compute Node Hosts CF/ESC/EM/UAS

Replace the Faulty Component from the Compute/OSD-Compute Node

Restore the VMs

Case 1. Compute Node Hosts only SF VM

Case 2. Compute/OSD-Compute Node Hosts CF, ESC, EM and UAS

Handle ESC Recovery Failure

Auto-Deploy Configuration Update

Component RMA - Controller Node

Pre-Check

Move Controller Cluster to Maintenance Mode

Replace the Faulty Component from the Controller Node

Power On the Server

Revision History

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products