Setting Up Geographical Disaster Recovery
Revised: April 4, 2014, OL-28573-02
Introduction
There may be times where Prime Central or the Cisco Prime applications integrated with Prime Central go down. To help minimize the amount of downtime for these applications, you can implement geographical disaster recovery in your network. Let’s begin with an overview of the geographical disaster recovery process and a definition of three of its key components.
After completing the procedures detailed later in this section, Prime Central and integrated Cisco Prime applications reside on both a primary and standby server. When one or more of these applications go down on the primary server, you receive a system-generated email notifying you of the problem. To deal with the problem, you would first initiate switchover to the standby server. Switchover is the manual switch from one server to a redundant or standby server. By initiating a switchover, you can continue to manage and monitor your network while you figure out what’s wrong with the primary server. You can also initiate switchover if you need to perform routine maintenance on the primary server, such as upgrading hardware or installing patches. If you resolve the problem, you switch back to the primary server and that’s that. However, if all of your attempts to bring the primary server back into a working state fail and it has become completely unreachable, you would then initiate failover. Failover is essentially the same operation as switchover, the difference being that failover indicates a major problem with the primary server that will require some time to resolve. By initiating failover, the standby server effectively becomes the new primary server and takes over all network management and monitoring tasks. Before you initiate failover for a server, make sure that you’ve done everything possible to bring that server back into operation because switching to the standby server could bring about other problems. Finally, when the server previously acting as the primary server is up and running again, you would then initiate failback, which is the process that reinstates that server as the primary server.
In this section, we will cover the following topics:
- How to prepare your network for geographical disaster recovery
- How to configure monitoring frequency and email notifications
- How to initiate switchover, failover, and failback
- How to integrate Cisco Prime applications with the standby server
- Best practices for geographical disaster recovery implementation, as well as troubleshooting information
Preparing Your Network for Geographical Disaster Recovery
To prepare your network for geographical disaster recovery, you will need to configure the following:
- Prime Central
- Prime Central Fault Management
- Application and database replication monitors, which keep tabs on the:
–
Prime Central portal
–
Integration layer
–
The current status of the database
–
Database replication between the primary and standby nodes
Typically, the Prime Central portal, integration layer, and database are located on the same server. You can also choose to place these components on multiple servers. If you want the integration layer to reside on its own server, you will need to complete the following procedures twice—once on the portal server and once on the integration layer server.
Note ●
If your environment contains multiple integration layer servers, you will need to complete these procedures for each of those servers.
- Prime Central and Prime Central Fault Management patch need to be applied on both Primary and standby machines separately. Once the patch is applied on primary, switchover and then apply the patch on the new Primary machine.
Configuring Prime Central for Geographical Disaster Recovery
Step 1
Install Prime Central 1.4 onto the server that will act as the primary server.
During installation, in the Embedded DB Information window, make sure that:
- You select the Enable backups on the database check box.
- You specify the desired archive log location. By default, the archive log location is set to /export/home/oracle/arch.
It is critical that you enable backups on the Oracle database. Geographical disaster recovery setup will fail unless you do so.
See “Installing Prime Central” in the Cisco Prime Central 1.4 Quick Start Guide for detailed installation instructions.
Step 2
Set up Prime Central for geographical disaster recovery on the server that will act as the standby server.
a.
Copy and unzip disaster_recovery_v1.4.zip into a folder on the standby server.
In this example, we will use the /root/disaster_recovery folder. When setting up geographical disaster recovery in your environment, overwrite this folder with the correct one.
b.
Ensure that the Prime Central database and installer binaries reside in the /root/disaster_recovery/scripts/main/installer folder. Specifically, ensure that these files are present:
–
p10404530_112030_Linux-x86-64_1of7.zip
–
p10404530_112030_Linux-x86-64_2of7.zip
–
primecentral_v1.4.bin
c.
Ensure that the primary server is reachable via SSH by the following means:
–
IP address
–
Hostname (without domain name, if applicable)
–
Hostname with domain name (FQDN)
d.
Navigate to the disaster recovery distribution package folder:
# cd /root/disaster_recovery
e.
Enter the following commands to run the setup script:
# chmod +x pc_standby_setup.sh
f.
When prompted by the setup script, enter the necessary information for the primary (active) and standby server (such as IP address, hostname and root password).
Note
The host name of the primary server should exactly be same as the one that was used when Prime Central was being installed in it.
g.
After Prime Central has been set up on the standby server, complete the following tasks:
–
As the user primeusr, generate a Secure Shell (SSH) key for the primeusr user on both the primary and standby servers by entering the following commands:
# /usr/bin/ssh-keygen -t rsa -N "" -b 2047 -f ~/.ssh/id_rsa
# chmod 600 ~/.ssh/id_rsa
–
(On the primary server only) Share the primary server's public key with the standby server so that the dynamic creation of a SSH between the servers does not prompt for a password.
As the user primeusr, enter the following commands:
# rsync -av ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
# /usr/bin/ssh primeusr@standby-server-hostname "cat ~/.ssh/id_rsa.pub" >> ~/.ssh/authorized_keys
# rsync -av ~/.ssh/authorized_keys primeusr@standby-server-hostname:~/.ssh/
where standby-server-hostname is the hostname of the standby server.
–
Verify that the SSH is working and does not prompt for authentication or a password (using both the hostname and IP address) on both the primary and standby server.
–
If you are prompted to continue connecting, enter yes. (The prompt should appear only the first time you use SSH to connect to the node.)
h.
Integrate the relevant Cisco Prime applications, such as Prime Network and Prime Optical, with the standby server. Refer to the Integrating Applications with the Standby Server for instructions.
Note
Prime Central services, portalctl, and itgctl, should be running on standby server during Domain Manager integration.
Configuring Prime Central Fault Management for Geographical Disaster Recovery
You can choose to not set up Fault Management for geographical disaster recovery at this time. However, keep in mind that you may not be able to do so later.
Step 1
Install Fault Management on the Fault Management primary server.
See “Installing Prime Central Fault Management” in the Cisco Prime Central 1.4 Quick Start Guide for detailed instructions.
Step 2
SSH to the primary Prime Central server as the user primeusr and run the list command to determine Fault Management’s instance ID value (in this example, 5).
You will need this for Step 4f.
Step 3
On the standby Prime Central server, verify that the portal and integration layer are up and running.
Step 4
Set up geographical disaster recovery for Fault Management on the standby server.
a.
Copy and unzip the distribution package zip file into the /root/disaster_recovery folder.
b.
Copy FM1.4Build.tar.gz to the /root/disaster_recovery/scripts/main/installer folder.
c.
Untar FM1.4Build.tar.gz:
# tar -zxf FM1.4Build.tar.gz
# chmod 755 primefm_v1.4.bin
d.
Ensure that the Fault Management binary (primefm_v1.4.bin) resides in the /root/disaster_recovery/scripts/main/installer/Disk1/InstData/VM folder.
e.
Enter the following commands to run the setup script:
# cd /root/disaster_recovery/scripts/main
# chmod +x fm_standby_setup.sh
f.
Verify that the Fault Management component on both the primary and standby server, as well as the primary and standby Prime Central server, can be reached using their hostname.
g.
Enter the following command:
h.
When prompted by the setup script, enter the necessary information for the primary and standby server (such as IP address and root password).
i.
After Fault Management has been set up on the standby server, complete the following tasks:
–
As the user primeusr, generate a Secure Shell (SSH) key for the primeusr user on both the primary and standby servers by entering the following commands:
# /usr/bin/ssh-keygen -t rsa -N "" -b 2047 -f ~/.ssh/id_rsa
# chmod 600 ~/.ssh/id_rsa
–
(On the primary server only) Share the primary server's public key with the standby server so that the dynamic creation of a SSH between the servers does not prompt for a password.
As the user primeusr, enter the following commands:
# rsync -av ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
# /usr/bin/ssh primeusr@standby-server-hostname "cat ~/.ssh/id_rsa.pub" >> ~/.ssh/authorized_keys
# rsync -av ~/.ssh/authorized_keys primeusr@standby-server-hostname:~/.ssh/
where standby-server-hostname is the hostname of the standby server.
–
Verify that the SSH is working and does not prompt for authentication or a password (using both the hostname and IP address).
–
If you are prompted to continue connecting, enter yes. (The prompt should appear only the first time you use SSH to connect to the server.)
Configuring Monitoring Frequency and Email Notifications
Even though the values you configure while completing this procedure are automatically synchronized between the primary and standby servers, we recommend that you verify this has taken place. Note that the first time you initiate file synchronization, it can take 5 to 10 minutes for the process to start.
Step 1
Configure the monitoring frequency value.
By default, this value is set to 5 minutes for application, file synchronization, and data replication monitoring. If you want to keep this as is, proceed to Step 2. If you want to set another value, do the following:
a.
Log into the primary server as the root user.
b.
Enter the following commands:
# cd primeusr-home-directory/local/scripts/
c.
Open primeusr-home-directory/local/cron/app_mon/conf/frequency_conf and set the desired monitoring frequency value (in minutes) for the following parameters:
–
FREQUENCY (application)
–
RSYNC_FREQUENCY (file synchronization)
–
DBMON_FREQUENCY (data replication)
d.
Enter the following commands:
# cd primeusr-home-directory/local/scripts/
Step 2
Open primeusr-home-directory/local/disaster_recovery/scripts/cron/app_mon/conf/email_config and specify the users that will be notified via email whenever an event occurs, as well as the email address from which these messages will be sent.
A sample email_config file looks like this:
EMAIL_IDS=email_id1@example.com email_id2@example.com
(recipients of application monitoring messages)
DB_ EMAIL_IDS=email_id1@example.com email_id2@example.com
(recipients of database monitoring messages)
SENDER_EMAIL=source@example.com
(sender of application monitoring messages)
DB_SENDER_EMAIL=source@example.com
(sender of database monitoring messages)
Note the following:
- The Application Monitor uses the sendmail email client. Verify that sendmail has been properly configured to send notification emails.
- Ensure that each recipient’s email address is separated by a space.
- You can specify an email alias, provided that it is a valid email address.
- Only one sender email address can be configured at any given time.
- After setup, check your email client’s junk folder for any failover-related messages. If necessary, set similar messages as being safe to read.
- If the primary server, standby server, or VM belongs to a lab where DNS configuration is not available, open /etc/hosts and add the following entry:
server-or-VM-IP-address domain-name
- If new customized rules file or impact policy is added, the operator should add the entry in primeusr-home-directory/local/disaster_recovery/rsync/fm/include.txt.
Initiating Switchover
If you are using a database other than the one that comes with Prime Central, complete the procedure described in Geographical Disaster Recovery for External Databases.
Step 1
As the root user, log into the standby server.
Step 2
Enter the following commands:
# cd primeusr-home-directory/local/disaster_recovery/scripts/main
#./primedr switch (for Prime Central)
#./fmdr switch (for Fault Management)
Note
By default, the primeusr home directory is /opt/primecentral. You can specify another directory, if necessary.
Initiating Failover
If you are using an external database other than the one that comes with Prime Central, complete the procedure described in Geographical Disaster Recovery for External Databases.
Step 1
As the root user, log into the standby server.
Step 2
Enter the following commands:
# cd primeusr-home-directory/local/disaster_recovery/scripts/main
#./primedr fail (for Prime Central)
#./fmdr fail (for Fault Management)
Note
By default, the primeusr home directory is /opt/primecentral. You can specify another directory, if necessary.
Initiating Failback
When a server that was previously brought into failover is operational again, and all of the geographical disaster recovery installation data present before the server failed (which includes the entire database and all of the files in the primeusr home directory) has been fully restored, it is ready for failback. Complete the following procedure on that server to reinstate it as the primary server.
Note
If the server’s data has been completely erased, you will first need to complete the steps described in the Configuring Prime Central for Geographical Disaster Recovery and Configuring Prime Central Fault Management for Geographical Disaster Recovery. For Prime Central, after you enter the./pc_standby_setup.sh command, the installer will detect that you are reinstalling Prime Central on the server you are reinstating as the primary server. When prompted, enter Y to proceed with the installation.
Note
If you are using an external database other than the one that comes with Prime Central, complete the procedure described in Geographical Disaster Recovery for External Databases on the server.
Step 1
As the root user, log into the server you want to reinstate as the primary server.
Step 2
Enter the following commands:
# cd primeusr-home-directory/local/disaster_recovery/scripts/main
#./primedr failback (for Prime Central)
#./fmdr failback (for Fault Management)
Geographical Disaster Recovery for External Databases
Complete the following steps if you have initiated a switchover, failover, or failback on a server where an external database other than the one that comes with Prime Central is installed.
Step 1
If the server on which Cisco Prime applications were previously installed has had all of its data wiped, you will need to reinstall them. Complete the procedures described in Configuring Prime Central for Geographical Disaster Recovery and Configuring Prime Central Fault Management for Geographical Disaster Recovery.
Step 2
On the server where the Prime Central portal is installed, verify that the portal is running by entering the portalctl status command. If the resulting status is stopped, restart the portal by entering the portalctl start command.
Step 3
On the server where the Prime Central integration layer is installed, verify that the integration Layer (Core, JMS, or both) are running by entering the itgctl status command. If the resulting status is stopped, restart the integration layer by entering the itgctl start command.
Step 4
On the server currently acting as the standby server, stop application monitoring and file synchronization:
a.
Log in as the root user.
b.
Enter the following commands:
c.
Log in as the user primeusr.
d.
Enter:
Step 5
On the server currently acting as the primary (active) server, start application monitoring and file synchronization:
a.
Log in as the root user.
b.
Enter the following commands:
c.
Log in as the user primeusr.
d.
Enter:
Integrating Applications with the Standby Server
Complete the following procedure to integrate applications such as Prime Network and Prime Optical with the standby server.
Step 1
Determine the instance ID of the application on the primary server you want to integrate. To do so, run the list command as the user primeusr.
Say you want to integrate Prime Network. In the following example, Prime Network’s instance ID value is 8.
Step 2
Enter one of the following commands, depending on whether you want to specify the necessary values when prompted by the script (in Interactive mode) or all at once (in Non-interactive mode):
# %./DMIntegrator.sh –r DMIntegrator.prop
# %./DMIntegrator.sh -d DMIntegrator.prop server db-service-name db-user db-password port dm_id
where:
–
server is the IP address of the standby Prime Central database server
–
db-service-name is the standby Prime Central database service name (for an embedded Oracle database, the default is primedb)
–
db-user is the username of the standby Prime Central database user
–
db-password is the password of the standby Prime Central database user
–
port is the port number of the standby Prime Central database port
–
dm_id is the instance ID value you obtained in Step 1.
Geographical Disaster Recovery Implementation:
Best Practices and Troubleshooting Information
Log file locations:
- During geographical disaster recovery setup, refer to the log files located in the logs folder (which resides in the same folder as the setup scripts).
- While geographical disaster recovery is taking place, refer to the logs here: primeusr-home-directory/local/disaster_recovery/logs
Problem Prerequisites check failed.
Solution Make sure the server meets all required specifications in terms of:
- Hard disk space
- RAM and swap space
- RHEL versions
Problem Environment setup failed.
Solution Do the following:
- Make sure you run the script as the root user.
- Verify whether the Expect Unix tool is installed. If not, check the log files in the logs folder for any errors that occurred.
Problem Oracle database standby setup failed.
Solution Make sure that database backups and archive logs are enabled on the primary server.
1.
Log into the primary server as the user primeusr.
2.
Enter the following command:
If any errors occur, please refer to the Cisco Prime Central 1.4 User Guide for more troubleshooting information.
If backups are already enabled but the Oracle database standby setup failed:
1.
Log into the database server as the user oracle.
2.
Enter the following command:
3.
Check the log file named PCoracleADG.ksh_*.log for any errors, where * refers to the latest available log file.
4.
Once the errors have been identified and resolved, clean up the server and run the geographical disaster recovery setup process again.
Problem Application Monitor always reports that the File Sync operation failed.
Solution Make sure that trust has been established between the primary and standby servers for the user primeusr. If you need to establish trust, perform the tasks described in Step 2g of Configuring Prime Central for Geographical Disaster Recovery.
Problem Prime Central and Fault Management failback reports that the File Sync operation failed.
Solution Make sure that trust has been established between the primary and standby servers for the user primeusr. If you need to establish trust, perform the tasks described in Step 2g of Configuring Prime Central for Geographical Disaster Recovery.
Problem Prime Central installation on the standby server failed.
Solution Verify that:
- The standby server meets all prerequisites.
- You entered the right passwords when prompted by the installer.
- The server is configured for DNS configured or the /etc/hosts file has the right information.
Refer to the Cisco Prime Central 1.4 Quick Start Guide for installation troubleshooting information.
Problem Application integration on the standby server failed.
Solution Make sure that the application’s ID value specified during the integration process is valid. To check, perform the steps described in application’s integration guide.
Problem Rsync failed.
Solution Make sure that both the primary and standby nodes can communicate via SSH without authentication (the process is described in Step 2g of Configuring Prime Central for Geographical Disaster Recovery). To view more details on the Rsync failure, navigate to primeusr-home-directory/local/disaster_recovery/logs and open app_mon.log.
Problem Information for the integration layer is not found in the standby database.
Solution Run the following commands:
# cd /root/disaster_recovery/scripts/main/