Replace HA Hosts
If you need to replace an HA host machine, please follow the procedures:
Note The IP addresses or VIPs are assumed not to be changed.
Hosts that having "Deployed role: Standby" can only be replaced.
Step 1 Stop the DCNM on the standby host (no IP change).
Step 2 Stop the DCNM on the active host (no IP change).
Step 3 Take backup of Standby DCNM.
Step 4 Take a local copy of ha-properties file from
/root/packaged-files/properties/ path.
Step 5 On the new host which is supposed to replace the old host, configure the IP addresses on eth0 and eth1to be identical to the old host.
Step 6 If the host is a virtual machine, configure the mac address to be identical to the old host, so there will be no need to get new licenses for the new host.
Step 7 On the new host which will join the HA setup, run the HA setup script, just like in the normal HA setup procedure.
Step 8 Restart the DCNM on the active host, then restart the DCNM on the standby host.
Troubleshooting DCNM Native HA
When Cisco DCNM native HA setup is in an uncertain situation, stop both hosts and resolve the problem. Start only one host and ensure that it is fully functional, and the device data is correct before you bring up the second host as standby.
Note Throughout this Troubleshooting procedure, dcnm1 is considered as the Active host and dcnm2 is considered for Secondary host.
This contains the following sections:
Recovering DCNM when both hosts are Powered Down
Perform the following to troubleshoot the DCNM Native HA setup when both the hosts are powered down.
Step 1 Power on
dcnm1
.
Step 2 Wait for all the applications to be operational.
Use the
appmgr status all
command to check the status of the applications.
Logon to DCNM. Verify if it is fully functional. Check if the device data is correct. If success, power on
dcnm2
as Secondary host. Terminate the troubleshooting procedure.
Step 3 If the host fails to bring up all the applications, or if the device data is incorrect, use the
appmgr stop all
command to stop the process.
Wait for all the applications to stop.
Step 4 Power on
dcnm2
.
Wait for all the applications to be operational.
Step 5 Use the
appmgr status all
command to check the status of the applications.
Logon to DCNM. Verify if it is fully functional. Check if the device data is correct. If success, power on
dcnm1
as Secondary host. Terminate the troubleshooting procedure.
Step 6 If
dcnm2
fails to bring up all the applications, or if the device data is incorrect, use the
appmgr stop all
command to stop the process.
Step 7 Restore both hosts from backup.
Recovering from Split-Brain syndrome
Perform the following to recover Cisco DCNM from the split brain syndrome.
Step 1 Stop both Active and Standby Cisco DCNM hosts.
Use the
appmgr stop all
command, to stop the applications
Step 2 Wait for all the applications to stop.
Use the
appmgr status all
command to check the status of the applications.
Resolve the communication problem between two hosts which causes the Split-Brain Syndrome.
Step 3 Ping the peer host eth1 IP address from both hosts and make sure it is reachable.
Step 4 Start all the applications on
dcnm1
. Wait for all the applications to be operational.
Use the
appmgr status all
command to check the status of the applications.
dcnm1# appmgr status all
Step 5 Logon to
dcnm1
and verify if it is fully functional and if all the data is correct.
If all the data is correct, proceed to Step 7.
If data loss is seen, proceed to Step 6.
Step 6 Use the
appmgr stop all
command, to stop the applications
Step 7 Start all the applications on
dcnm2
. Wait for all the applications to be operational.
Use the
appmgr status all
command to check the status of the applications.
dcnm2# appmgr status all
Step 8 Logon to DCNM. Verify if it is fully functional.
Check if the device data is correct. If success, power on
dcnm1
as Secondary host. Terminate the troubleshooting procedure.
Step 9 If data loss is seen on
dcnm2
, stop all the applications.
Use the
appmgr stop all
command, to stop the applications
Step 10 Restore both hosts from backup.
Checking Cisco DCNM Native HA Status
Perform the following to determine the status of the Cisco DCNM Native HA.
Step 1 Login into Cisco DCNM Web Client.
Step 2 Navigate to
Web Client > Administration > Native HA
.
Step 3 Check for HA Status.
The status of the Native HA and their description is as depicted in the table below.
|
|
OK
|
Implies that the Native HA is operational. Both the hosts on the Native HA are synchronized.
|
Stopped
|
Implies that the Standby host is not operation. The database is not synchronized.
|
Failed
|
Implies that the Active host is unable to synchronize with the Standby host. Check the log files for more information.
The log file is located at:
/usr/local/cisco/dcm/fm/logs/fms_ha.log
|
Not Ready
|
Implies that the Standby host is not setup or not configured.
|
Verifying if the Active and Standby Hosts are Operational
Perform the following to determine if the hosts are operational.
Step 1 Check the HA role on the host.
Step 2 Use the
appmgr show ha-role
command to view the current role of the host.
Step 3 Check the VIP, using the
ip address
command.
On the Active host, both eth0 and eth1 must have two IP addresses configured, with VIP assigned as the secondary IP address; on standby host, only one IP address for both eth0 and eth1 interfaces
Step 4 Check the DCNM java process.
Use
ps -ef | grep java
command to verify the java process associated with the DCNM.
dcnm1# ps -ef | grep java
The results must show one Java process, appended with
standalone-san.xml
.
dcnm2# ps -ef | grep java
There should no be any Java process, appended with
standalone-san.xml
.
Step 5 Check the heartbeat of the DCNM hosts.
dcnm1# /etc/init.d/heartbeat status dcnm2# /etc/init.d/heartbeat status
Step 6 Check if the database engine PostgreSQL is operational.
dcnm1# /etc/init.d/postgresql-9.4 status dcnm2# /etc/init.d/postgresql-9.4 status
Step 7 Check the HA cluster information.
dcnm1# cl_status listnodes dcnm2# cl_status listnodes
The two hostnames of the HA cluster will be displayed.
Step 8 Check the HA heartbeat status.
dcnm1# cl_status nodestatus <hostname> dcnm2# cl_status nodestatus <hostname>
If this command returns “active”, the heartbeat on the host is OK.
If the command returns “dead”, the heartbeat on the host is not running or not recognized.
Verifying HA Database Synchronization
Perform the following to verify if the databases synchronization on both hosts is in progress.
When running DCNM Native HA, both the host database must be operational, one host as Active and another host as Standby. Any changes made in the Active database must synchronize with the Standby database in real time.
To verify if the database is synchronizing, use
ps -ef | grep post
command.
dcnm1# ps -ef | grep post postgres: wal sender process postgres 172.23.244.222(40826) streaming 0/9A846C04 dcnm2# ps -ef | grep post postgres: wal receiver process streaming 0/9A84E00
Resolving HA Status Failure condition
Perform the following to resolve if the HA status check results in failure.
Step 1 Logon to Cisco DCNM Web UI.
Step 2 Navigate to
Administration > Native HA
.
Click the
Test
icon.
Check if there are errors. Click
Detailed Logs
for more information.
Step 3 Check log file at the location.
/usr/local/cisco/dcm/fm/logs/fms_ha.log
There should be some log messages indicating why the HA status is Failed.
Step 4 Verify if Standby host is running is operational.
See Verifying if the Active and Standby Hosts are Operational, for more information. Check is any applications are not operational.
Generally, the HA status shows Failed due to Standby database being down or rejected connection.
If the connection to standby database is rejected, the HA status shows as Failed. Check the file located at:
/usr/local/cisco/dcm/db/data/pg_hba.conf
The configuration file must contain entries for all IP addresses listed on active host
ip address
.
If not, we recommend that you contact the Technical Support for further assistance.
Step 5 If Standby database is completely down, see Bringing up Database on Standby Host.
Bringing up Database on Standby Host
Normally, the database must be running on both Active or Standby host, regardless of DCNM being operational or stopped. However, the database could be down mostly because of the initial database synchronization failure.
Perform the following to bring up the database on the Standby host.
Step 1 Start the Standby database, using the
/etc/init.d/postgresql-9.4 start
command.
Step 2 If the return value is PostgreSQL 9.4 started successfully, the Standby database is OK. The HA status shows OK within a few minutes.
If the database is not started successfully, the database files may be corrupted. This condition occurs due to initial synchronization failure. In such a condition, navigate to the located at:
/usr/local/cisco/dcm/db/replication
Step 3 Check for the file
pgsql-standby-backup.tgz
.
If the file exists, perform the following to restore database files, and start database again:.
a. Enter the
ps -ef | grep post
command and ensure that the Postgres process is not running.
b. If the Postgres process is running, stop by using the
kill <pid>
command.
c. Remove all the database files by using the following commands:
cd /usr/local/cisco/dcm/db
d. Restore the database files from the backup by using
tar xzf replication/pgsql-standby-backup.tgz data
command.
e. Restart the database by using the
/etc/init.d/postgresql-9.4 start
command.
Check if the database has started successfully.