- Preface
- Configuration
- Hierarchy Management
- Device Management
- Customer Configuration
- Site Management
- Set Up Northbound Notification
- LDAP Management
- User Management
- Single Sign On
- Entitlement Management
- Dial Plan Management
- Subscriber Management
- Services
- Self Provisioning
- Customizations
- Bulk Loading
- Backup and Restore
- System Maintenance
- Networking
- Monitoring
- System Control
- System Security
- Network Security
- Macros
- SNMP
- MIB List
- Backups
- Backup Destinations
- Create Space for a Backup or Restore
- Adding More Space to Accommodate a Large Restore
- Backup Passphrase
- Setting up the Backup Passphrase on a New Environment
- Reassign Current Drives (Backup and DBroot)
- Create a Backup
- Restore the Backup
- Restore a Backup in a Clustered Environment
- Example of a Successful Restore
- Maintaining Backups
- Exporting Backups
- Scheduling
- DR Failover
- DR Failover and Recovery
- High Availability Disaster Recovery
- Troubleshooting
Backup and Restore
Backups
Backups represent a snapshot of the system, including database, configuration and system applications. Backups can be created manually, scheduled automatically, or created automatically when the system is upgraded. These backups can be stored on the local file system, or to a remote network location. There is no direct requirement for Vmware snapshots. For examples of backup maintenance commands and output, refer to the topics on Scheduling and Create a Backup.
If the Cisco Unified Communications Domain Manager 10.6(1) node is not recoverable, due to for example a hardware failure, a new node can be deployed and an existing backup restored to restore the node to service.
- Backup Destinations
- Create Space for a Backup or Restore
- Adding More Space to Accommodate a Large Restore
- Backup Passphrase
- Setting up the Backup Passphrase on a New Environment
- Reassign Current Drives (Backup and DBroot)
- Create a Backup
- Restore the Backup
- Restore a Backup in a Clustered Environment
- Example of a Successful Restore
- Maintaining Backups
- Exporting Backups
- Scheduling
- DR Failover
- DR Failover and Recovery
- High Availability Disaster Recovery
- Troubleshooting
Backup Destinations
Backups can be made to the local file system or a remote destination.
-
Display available backup destinations with backup list.
-
Add a new backup destination with backup add <location-name> <URL>.
Local backups are stored on a separate backup volume and the localbackups destination is pre-configured. If the backup volume is too small, it can be increased in size.
![]() Note | If the localbackups destination is removed or renamed, an ISO file upgrade will no longer function. Therefore, it is imperative that this destination is not removed. |
Example:
backup add myserverbackup sftp://user@server/path
Backups to sftp require ssh key-based authentication to be setup.
If a common remote backup point is to be used by all nodes in the cluster, the backup destination needs to be added to each node. This can be automated by using cluster remote execution, for example:
cluster run all backup add myserverbackup sftp://user@server/path
Create Space for a Backup or Restore
If a No space left on device message is received during a backup or a restore, carry out the following steps:
Adding More Space to Accommodate a Large Restore
Backup Passphrase
System backups are encrypted. The encryption key is initially set as the platform user's password as set in the installation wizard. It is recommended that this be changed after installation. This can be done by running backup passphrase.
The following example shows the console output:
platform@masternode:~$ backup passphrase Please enter current backup passphrase Password: Please enter new backup passphrase Password: Please re-enter new backup passphrase Password: Backup passphrase successfully changed
This password needs to be kept, because restoring the backup to a new system will require this password to be the same as above.
To restore on the new system, run the above command and enter password used to create the backup
Setting up the Backup Passphrase on a New Environment
To set the backup passphrase to restore on a different environment:
Reassign Current Drives (Backup and DBroot)
Create a Backup
Backups can be created using backup create <destination>, for example:
backup create localbackups or backup create myserverbackup <remote destination>.
An example of the console output is shown below:
platform@myhost:~$ backup create localbackup ... collecting data ... preparing mongo data backup ... Backing up database <name> ............ ... Backing up database <name>_FILES .. ... Backing up database <name>_EXPORT . ... Not backing up database local ... Backing up database PLATFORM .................. ... Backing up database admin . ... creating backup 98% completed ... verifying backup Backup was successfully created at file:///backups/a0b26a1a267a1582e2aa0258a4fa85b75d4b09bb WARNING: Backup maintenance of this location is not scheduled schedule add localbackup-maintain backup clean localbackup keep 5 platform@myhost:~$ backup list localbackup localbackup: URI: file:///backups Backups: 2014-06-18 16:26
Backups contain all application data.
Backups can be scheduled to run automatically - refer to the schedule command to automated backups.
For example:
-
schedule add mybackups backup create myserverbackup
-
schedule time mybackups 2 0
-
schedule enable mybackups
The cluster-wide backup can be created using the command cluster run all backup create myserverbackup after creating the destination with for example cluster run all backup add myserverbackup sftp://user@server/path. Generally, it is not recommended that a cluster wide backup be scheduled from a single node, since failure of the scheduled node could result in missing backups. Rather schedule a backup per node as above.
If a common network URI is used as backup destination across the cluster, each node's backup will be uniquely identified by its UID in the remote backup directory.
Restore the Backup
Restore a Backup in a Clustered Environment
In a clustered environment, servers can allow for failures and can keep data intact, because when a server fails, an automatic failover occurres. If all services are kept running and data remains accessible, a backup restore would only be necessary in very specific scenarios.
Restoring a backup in a cluster would only be necessary in the following cases:
Example of a Successful Restore
platform@Restore:~$ backup restore localbackup 2015-02-26 00:22 Services will be restarted during the restore. Do you wish to continue? y Application voss-deviceapi processes stopped. Stopping Application while performing database restore --- Restore, ip=172.29.41.240, role=webproxy,application,database, loc=jhb Application nginx processes stopped. System restore starting from file:///backups/93d19980b574ed743d9b000a7595e42cad6a6d6b (1424910132) Local and Remote metadata are synchronized, no sync needed. Last full backup date: Thu Feb 26 00:22:12 2015 Successfully restored to /backups/appdata/restore_temp_1427441507, moving to /backups/appdata Removing temporary files in /backups/appdata/restore_temp_1427441507 local\|admin Dropping database PLATFORM before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/PLATFORM [object Object] Repairing database PLATFORM before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/PLATFORM [object Object] Dropping database VOSS_FILES before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/VOSS_FILES [object Object] Repairing database VOSS_FILES before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/VOSS_FILES [object Object] Dropping database VOSS before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/VOSS [object Object] Repairing database VOSS before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/VOSS [object Object] Trying with oplogReplay restore successfull {'172.29.41.240': (200, '\n')} Starting Application after performing database restore --- Restore, ip=172.29.41.240, role=webproxy,application,database, loc=jhb Application services:firewall processes stopped. Application nginx processes started. Restarting services Application processes stopped. Application processes started. System settings have changed, please reboot using 'system reboot' Restored successfully You have new mail in /var/mail/platform
Maintaining Backups
A complete list of backups on a location can be displayed using backup list <location>.
Backups can be deleted using the following commands:
-
backup clean <location> keep <N> will delete older copies so that only N copies are kept
-
backup clean <location> before <yyyy-mm-dd [HH:MM]> will delete copies older than the specified date.
By default, there is no regular maintenance of backups, and a scheduled job should be created to perform this maintenance, for example:
-
schedule add backuprotate backup clean localbackups keep 5
-
schedule time backuprotate rotate 3 0
-
schedule enable backuprotate
Exporting Backups
The backups are encrypted and may comprise of multiple files on the backup destination.
If a backup is to be exported to another system, it must be exported with the command:
backup export <location> <destination-URI> <yyyy-mm-dd [HH:MM]>
For example:
backup export local backup destination-location 2014-04-30 11:16
In turn, the backup can be imported on the remote server using backup import <source-URI>.
Scheduling
Any CLI command can be scheduled to run automatically, including but not restricted to backups and security upgrades.
By default there is no backup maintenance scheduled. Backup maintenance can be scheduled with the number of copies to be kept.
The automated job schedule format is as follows:
-
schedule add <job-name> <user-command>
-
schedule time <job-name> <hour> <minute>
-
schedule time <job-name> every <N> hours
-
Alternatively the job can be scheduled to run every week on Monday with schedule time <job-name> weekly 1; where 0 is Sunday, 1 is Monday, 2 is Tuesday, 3 is Wednesday, 4 is Thursday, 5 is Friday and 6 is Saturday
-
schedule enable <job-name>
Example:
schedule add mybackups backup create localbackups
schedule time mybackups 2 0
schedule time mybackups weekly 0
schedule enable mybackups
Among the tasks that can be scheduled are:
-
Backup creation, e.g. schedule add backupme backup create localbackup
-
Backup maintenance, e.g. schedule add backupclean backup clean localbackup keep 5
-
Health reports, e.g. schedule add reports diag report
DR Failover
The Cisco Unified Communications Domain Manager 10.6(1) system makes use of database replication facilities during normal operation. During a failover, if 50% or more of the service resources are lost, the system will no longer function without manual intervention. In this case, the following process should be followed.
DR Failover and Recovery
DR Failover and Recovery Scenarios
A number of failover scenarios and recovery steps are shown. In each case, a node topology is assumed, a node failure scenario is indicated and a set of recovery steps are provided.
Cisco Unified Communications Domain Manager System Recovery is supported from the following failover scenarios:
-
Loss of a non-primary node in the Primary site
-
Loss of a non-primary server in the DR site
-
Loss of the Primary Database Server
-
Loss of a Primary Site
-
Loss of a DR Site
-
In the event of a network failure or a temporary network outage affecting a single node, the node will be inaccessible and the cluster will respond in the same way as if the node had failed. If network connectivity is then restored, no action is required, because the node will again start communicating with the other nodes in the cluster, provided no changes were made to that node during the outage window.
-
In a clustered deployment, the datacentre would typically be two different datacentres, for example “Virginia” and “Seattle”. These can be thought of as a primary site and a DR (Disaster Recovery) site in case of a failure in the primary site. These two datacentres can exist on the same physical hardware, so the separation of the cluster is into two sets of three nodes.
When datacentres are defined during installation, the nodes of a cluster may or may not be in the same physical location. The cluster is designed to communicate across all nodes, regardless of their physical location.
Scenario: Loss of a Non-primary Node in the Primary Site
-
The administrator deployed the cluster into a Primary and DR site.
-
The cluster is deployed according to the Cisco Unified Communications Domain Manager Planning and Install Guide.
-
The example is a typical cluster deployment: 6 nodes, where 4 nodes are database servers and 2 nodes are proxy servers. The design is preferably split over 2 physical data centers.
Data Centre: jhb application : AS01[172.29.42.100] AS02[172.29.42.101] webproxy : PS01[172.29.42.102] AS01[172.29.42.100] AS02[172.29.42.101] database : AS01[172.29.42.100] AS02[172.29.42.101] Data Centre: cpt application : AS03[172.29.21.100] AS04[172.29.21.101] webproxy : PS02[172.29.21.102] AS03[172.29.21.100] AS04[172.29.21.101] database : AS03[172.29.21.100] AS04[172.29.21.101]
Node Failure
-
Normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a loss of a primary node is experienced. In this scenario AS02[172.29.42.101] failed while transactions were running.
-
Examine the cluster status running cluster status to determine the failed state:
platform@AS01:~$ cluster status Data Centre: unknown application : unknown_172.29.42.101[172.29.42.101] (not responding) webproxy : unknown_172.29.42.101[172.29.42.101] (not responding) database : unknown_172.29.42.101[172.29.42.101] (not responding) Data Centre: jhb application : AS01[172.29.42.100] webproxy : PS01[172.29.42.102] AS01[172.29.42.100] database : AS01[172.29.42.100] Data Centre: cpt application : AS03[172.29.21.100] AS04[172.29.21.101] webproxy : PS02[172.29.21.102] AS03[172.29.21.100] AS04[172.29.21.101] database : AS03[172.29.21.100] AS04[172.29.21.101]
-
At this point, all transactions that are currently in flight are lost and will not recover.
-
The lost transactions have to be rerun.
-
With the database server AS02[172.29.42.101] still down, replaying the failed transactions are successful.
Recovery Steps if the server that is lost, is unrecoverable:
-
A new unified node needs to be deployed. Ensure the server name, IP information and datacentre name is the same as on the server that was lost.
-
Run cluster del 172.29.42.101, because this server no longer exists.
-
Delete all database weights (database weight del <ip>), for example database weight del 172.29.42.101 .
-
Run cluster provision before the new server is added.
-
Switch on the newly installed server.
-
If the node will be a unified, application or web proxy node, run cluster prepnode on it.
-
Run cluster add <ip> of the new unified server to add it to the existing cluster.
-
Add database weights so that the are weights distributed throughout the cluster (database weight add <ip> <weight>).
-
Run cluster provision to join the new unified node to the cluster communications.
Scenario: Loss of a Non-primary Server in the DR Site
-
The administrator deployed the cluster into a Primary and DR site.
-
The cluster is deployed according to the Cisco Unified Communications Domain Manager Planning and Install Guide.
-
The example is a typical cluster deployment: 6 nodes, where 4 nodes are database servers and 2 nodes are proxy servers. The design is preferably split over 2 physical data centers.
Node Failure
-
Normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a loss of a primary node is experienced. In this scenario AS02[172.29.42.101] failed while transactions were running.
-
Examine the cluster status running cluster status to determine the failed state:
Data Centre: unknown application : unknown_172.29.42.101[172.29.42.101] (not responding) webproxy : unknown_172.29.42.101[172.29.42.101] (not responding) database : unknown_172.29.42.101[172.29.42.101] (not responding) Data Centre: jhb Application : AS01[172.29.42.100] AS02[172.29.42.101] webproxy : PS01[172.29.42.102] AS01[172.29.42.100] AS02[172.29.42.101] database : AS01[172.29.42.100] AS02[172.29.42.101] Data Centre: cpt application : AS03[172.29.21.100] webproxy : PS02[172.29.21.102] AS03[172.29.21.100] database : AS03[172.29.21.100]
-
At this point, all transactions that are currently in flight are lost and will not recover.
-
The lost transactions have to be rerun.
-
With the database server AS02[172.29.42.101] still down, replaying the failed transactions are successful.
Recovery Steps if the server that is lost, is unrecoverable:
-
A new unified node needs to be deployed. Ensure the server name, IP information and datacentre name is the same as on the server that was lost.
-
Run cluster del 172.29.42.101, because this server no longer exists.
-
Delete all database weights (database weight del <ip>), for example database weight del 172.29.42.101
-
Run cluster add <ip> before the new server is added.
-
Switch on the newly installed server.
-
If the node will be a unified, application or web proxy node, run cluster prepnode on it.
-
Run cluster add <ip> of the new unified server to add it to the existing cluster.
-
Add database weights so that the weights are distributed throughout the cluster (database weight add <ip> <weight>).
-
Run cluster provision to join the new unified node to the cluster communications.
Scenario: Loss of the Primary Database Server
-
The administrator deployed the cluster into a Primary and DR site.
-
The cluster is deployed according to the Cisco Unified Communications Domain Manager Planning and Install Guide.
-
The example is a typical cluster deployment: 6 nodes, where 4 nodes are database servers and 2 nodes are proxy servers. The design is preferably split over 2 physical data centers.
Node Failure
-
Normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a loss of a primary database server is experienced. In this scenario AS01[172.29.42.100] failed while transactions were running.
-
Examine the cluster status running cluster status to determine the failed state:
Data Centre: unknown application : unknown_172.29.42.100[172.29.42.100] (not responding) webproxy : unknown_172.29.42.100[172.29.42.100] (not responding) database : unknown_172.29.42.100[172.29.42.100] (not responding) Data Centre: jhb application : AS02[172.29.42.101] webproxy : PS01[172.29.42.102] AS02[172.29.42.101] database : AS02[172.29.42.101] Data Centre: cpt application : AS03[172.29.21.100] AS04[172.29.21.101] webproxy : PS02[172.29.21.102] AS03[172.29.21.100] AS04[172.29.21.101] database : AS03[172.29.21.100] AS04[172.29.21.101]
-
The loss of the Primary database server will cause an election and the node with the highest weighting still running will become primary. The election itself may take 10-30 seconds.
-
Check the weights set in the cluster configuration: database weight list
platform@AS01:~$ database weight list 172.29.21.100: weight: 10 172.29.21.101: weight: 20 172.29.42.100: weight: 50 172.29.42.101: weight: 40
-
The primary node 172.29.42.100 failed and therefore node 172.29.42.101 will become the primary node after election.
-
To find the primary database, run database primary.
platform@AS02:~$ database primary 172.29.42.101
-
At this point all transactions that are currently in flight are lost and will not recover.
-
The lost transactions have to be rerun.
-
With the database server AS01[172.29.42.100] still down, replaying the failed transactions is successful.
Recovery Steps if the server that is lost, is unrecoverable:
-
A new unified node needs to be deployed. Ensure the server name, IP information and datacentre name is the same as on the server that was lost.
-
Run cluster del 172.29.42.100, because this server no longer exists.
-
Delete all database weights (database weight del <ip>), for example database weight del 172.29.42.101
-
Run cluster provision before the new server is added.
-
Switch on the newly installed server.
-
If the node will be a unified, application or web proxy node, run cluster prepnode on it.
-
Run cluster add <ip> of the new unified server to add it to the existing cluster.
-
Add database weights so that the weights are distributed throughout the cluster (database weight add <ip> <weight>).
-
Run cluster provision to join the new unified node to the cluster communications.
Scenario: Loss of a Primary Site
-
The administrator deployed the cluster into a Primary and DR site.
-
The cluster is deployed according to the Cisco Unified Communications Domain Manager Planning and Install Guide.
-
The example is a typical cluster deployment: 6 nodes, where 4 nodes are database servers and 2 nodes are proxy servers.The design is preferably split over 2 physical data centers.
-
The cluster might also be in two geographically dispersed areas. The cluster has to be installed in two different site names or data center names. In this scenario, a portion of the cluster is in Johannesburg and the other is in Cape Town, South Africa:
Data Centre: jhb application : AS02[172.29.42.101] webproxy : PS01[172.29.42.102] AS02[172.29.42.101] database : AS02[172.29.42.101] Data Centre: cpt application : AS03[172.29.21.100] AS04[172.29.21.101] webproxy : PS02[172.29.21.102] AS03[172.29.21.100] AS04[172.29.21.101] database : AS03[172.29.21.100] AS04[172.29.21.101]
Primary site failure
-
Normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a loss of a Primary site is experienced. In this scenario, AS01[172.29.42.100], AS01[172.29.42.101] and AS01[172.29.42.100] failed while transactions were running.
-
At this point, all transactions that are currently in flight are lost and will not recover.
-
Examine the cluster status by running cluster status to determine the failed state:
Data Centre: unknown application : unknown_172.29.42.100[172.29.42.100] (not responding) unknown_172.29.42.101[172.29.42.101] (not responding) webproxy : unknown_172.29.42.100[172.29.42.100] (not responding) unknown_172.29.42.101[172.29.42.101] (not responding) unknown_172.29.42.102[172.29.42.102] (not responding) database : unknown_172.29.42.100[172.29.42.100] (not responding) unknown_172.29.42.101[172.29.42.101] (not responding) Data Centre: jhb application : webproxy : database : Data Centre: cpt application : AS03[172.29.21.100] AS04[172.29.21.101] webproxy : PS02[172.29.21.102] AS03[172.29.21.100] AS04[172.29.21.101] database : AS03[172.29.21.100] AS04[172.29.21.101]
-
The cluster will be not be operational and manual intervention is needed to recover if a continued flow of transactions is required with a minimum of downtime.
-
To recover the lost nodes and if they are unrecoverable, carry out the following recovery steps.
Recovery Steps:
-
Run cluster del <ip> on the failed nodes from the existing half of the cluster.
-
Remove all database weights from the cluster: database weight del <ip>
-
At this point you do have the option to provision half the cluster for a faster uptime of your DR site. Only the DR site will be operational after the provision.
-
If you choose to bring the full cluster back up, you need to redeploy the primary site nodes if the nodes are unrecoverable without a doubt.
-
Run cluster provision before the new servers are added.
-
Deploy 3 nodes: 2 as unified nodes and 1 as a proxy node.
-
If the node will be a unified, application or web proxy node, run cluster prepnode on it.
-
After the redeployment, at this stage run cluster add <ip> for the nodes to become part of the cluster.
-
Add the database weights back, using database weight add <ip> <weight>
-
Run cluster provision primary to ensure that a primary is selected for the provisioning stage.
Scenario: Loss of a DR Site
-
The administrator deployed the cluster into a Primary and DR site.
-
The cluster is deployed according to the Cisco Unified Communications Domain Manager Planning and Install Guide.
-
The example is a typical cluster deployment: 6 nodes, where 4 nodes are database servers and 2 nodes are proxy servers. The design is preferably split over 2 physical data centers.
-
The cluster might also be in two geographically dispersed areas. The cluster has to be installed in two different site names or data center names. In this scenario, a portion of the cluster is in Johannesburg and the other is in Cape Town, South Africa:
Data Centre: jhb application : AS02[172.29.42.101] webproxy : PS01[172.29.42.102] AS02[172.29.42.101] database : AS02[172.29.42.101] Data Centre: cpt application : AS03[172.29.21.100] AS04[172.29.21.101] webproxy : PS02[172.29.21.102] AS03[172.29.21.100] AS04[172.29.21.101] database : AS03[172.29.21.100] AS04[172.29.21.101]
DR site failure
-
Normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a loss of a DR site is experienced. In this scenario, AS03[172.29.21.100], AS04[172.29.21.101] and PS02[172.29.21.100] failed while transactions were running.
-
At this point, all transactions that are currently in flight are lost and will not recover. The lost transactions have to be rerun.
-
With the DR site still down, replaying the failed transactions is successful
-
Examine the cluster status by running cluster status to determine the failed state:
Data Centre: unknown application : unknown_172.29.21.100[172.29.21.100] (not responding) unknown_172.29.21.101[172.29.21.101] (not responding) webproxy : unknown_172.29.21.100[172.29.21.100] (not responding) unknown_172.29.21.101[172.29.21.101] (not responding) unknown_172.29.21.102[172.29.21.102] (not responding) database : unknown_172.29.21.100[172.29.21.100] (not responding) unknown_172.29.21.101[172.29.21.101] (not responding) Data Centre: jhb application : AS01[172.29.42.100] AS02[172.29.42.101] webproxy : PS01[172.29.42.102] AS01[172.29.42.100] AS02[172.29.42.101] database : AS01[172.29.42.100] AS02[172.29.42.101] Data Centre: cpt application : webproxy : database :
-
The cluster will be operational, but only on the Primary Site.
-
You need to recover the lost nodes and if they are unrecoverable. Follow the recovery steps below.
Recovery Steps
-
Run cluster del <ip> on the failed nodes from the existing half of the cluster.
-
Remove all database weights from the cluster: database weight del <ip>
-
Run cluster provision before a new server is added.
-
If you choose to bring the full cluster back up, you need to redeploy the DR site nodes if the nodes are unrecoverable.
-
Deploy 3 nodes: 2 as unified nodes and 1 as a proxy node.
-
If a node will be a unified, application or web proxy node, run cluster prepnode on it.
-
After the redeployment, at this stage run cluster add <ip> for the nodes to become part of the cluster.
-
Add the database weights back, using database weight add <ip> <weight>
-
Run cluster provision primary to ensure that a primary is selected for the provisioning stage.
High Availability Disaster Recovery
High Availability Overview
High Availability (HA) is an approach to IT system design and configuration that ensures Cisco Unified Communications Domain Manager is operational and accessible during a specified time frame. This is achieved using redundant hardware and resources. If there is a failure, an automatic failover will occur to the secondary database node.
Default High Availability Disaster Recovery Scenario
Cisco Unified Communications Domain Manager 10.6(1) supports using off-the-shelf VMware tools.
High Availability is implemented using VMware HA clusters, with data accessed via a central storage facility (SAN). VMware monitors the primary server, and should it fail, another instance of the VM is automatically started on a different hardware instance. Since data is shared on the SAN, the new HA instance will have access to the full dataset.
Disaster Recovery is implemented by streaming data updates to a separate DR instance that remains powered on. If the primary server fails, the DR instance can take over operation. The switch-over to DR instance is scripted, but must be invoked manually.
-
DNS can be used effectively to provide hostname abstraction of underlying IP addresses. In such a case, a DNS update will allow existing UC elements to seamlessly interact with the new DR instance.
-
If DNS is not available, and the UC elements cannot be configured with the IP address of the DR instance, it is necessary for the DR instance to assume the primary IP address. In such a case, the DR and the primary IP addresses can be swapped using the CLI interface. Standard networking practices should be employed to ensure that the IP address is correctly routed, e.g. Stretched layer-2 vLAN, and ensuring that the Primary and DR instances are not operated with the same IP address.
-
Since the HA instance is started automatically if the primary instance fails, a slight interruption in service is expected, including VMware polling latency in determining that the primary server has failed, and the startup delay of the HA instance. This delay is around 3 minutes.
-
If data is corrupted on the SAN, the HA instance will start with the same corrupt code and data instances.
-
Since VMware is checking only for VM liveness, it is not able to check that the primary instance is functionally active.
-
Data updates are transported to the DR instance. If data updates cannot be shipped by the primary instance, SNMP traps are generated informing administration of the problem. However, if this is not fixed in a timely manner, it is possible for the DR instance to become out of sync. These delays could result in data loss between the primary and DR instances. Database updates are scheduled every 3 minutes and/or 16Mb.
HA and DR Scenario with Cisco VMDC Geo-Redundancy Architecture
High Availability and Disaster Recovery instances can be geo-relocated at will within the capabilities of the underlying network architecture.
For example, it is feasible to extend a VMware High Availability cluster geographically using high speed data links and layer-2 stretched vLANs.
Disaster Recovery as implemented by the Cisco Unified Communications Domain Manager 10.6(1) system lends itself to geographical separation with streaming data replication to a second powered-on instance.
Interaction with other UC elements must be considered within the capabilities of the network, using either DNS for seamless transition, or IP reconfiguration either within the UC elements or the Cisco Unified Communications Domain Manager 10.6(1) system.
Configuring a HA System Platform on VMWare
This is an optional step, however, for production servers it is highly recommended that they are run in a HA deployment configuration. This can be done by the client, but should be checked by a system representative
Troubleshooting
'No Space Left on Device' Error
You receive the following error message while backing up or restoring Cisco Unified Communications Domain Manager 10.6(1) on a virtual machine: 'No Space Left on Device.' You can create a new virtual disk on the node with the primary database and then reassign the Cisco Unified Communications Domain Manager 10.6(1) data to the new disk. The new disk has enough space for you to perform the backup or restore operation.
Loss of the whole cluster and redeploying new servers
The high level redeploy and backup restore steps are as follows:
-
Redeploy the cluster.
-
Store the backup you want to restore in a different location.
-
Recreate the remote backups on the primary node using backup create <loc-name> <URI>.
-
Copy the saved backup under the new UID folder on the remote backup server.
-
Do a backup list.
For example:
pxetest: URI: sftp://sftpusr:********@172.29.42.249/AS03 Backups: 1 backups have been created - most recently 2014-08-21 10:24
A backup restore can now be run on the primary.
The example console output below shows the steps and process:
Identifying the database primary:
platform@AS01:~$ database primary 172.29.42.100
Listing the backups:
platform@AS01:~$ backup list localbackup: URI: file:///backups Backups: 2 backups have been created - most recently 2014-08-21 17:59 pxetest: URI: sftp://sftpusr:********@172.29.42.249/AS01 Backups: 2 backups have been created - most recently 2014-08-21 12:54 You have new mail in /var/mail/platform
Restoring the backup:
platform@AS01:~$ backup restore pxetest 2014-08-21 12:54 Services will be restarted during the restore. Do you wish to continue? y Application <name>-deviceapi processes stopped. Stopping Application while performing database restore ---------- AS02, ip=172.29.42.101, role=webproxy,application,database, loc=cpt Stopping nginx:proxy ---------- AS01, ip=172.29.42.100, role=webproxy,application,database, loc=cpt Application nginx processes stopped. ---------- AS02, ip=172.29.42.101, role=webproxy,application,database, loc=cpt Application nginx processes stopped. ---------- AS04, ip=172.29.21.191, role=webproxy,application,database, loc=jhb Application nginx processes stopped. ---------- AS03, ip=172.29.21.190, role=webproxy,application,database, loc=jhb Application nginx processes stopped. System restore starting from sftp://sftpusr:sftpusr@172.29.42.249/AS01/ba1e37deff1309edcc2595bf46c6bfc2a99ca164 Local and Remote metadata are synchronized, no sync needed. Last full backup date: Thu Aug 21 12:54:25 2014 Successfully restored to /backups/appdata/restore_temp_1408699183, moving to /backups/appdata Removing temporary files in /backups/appdata/restore_temp_1408699183 local Dropping database <name>_FILES before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/<name>_FILES [object Object] Repairing database <name>_FILES before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/<name>_FILES [object Object] Dropping database PLATFORM before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/PLATFORM [object Object] Repairing database PLATFORM before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/PLATFORM [object Object] Dropping database <name> before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/<name> [object Object] Repairing database <name> before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/<name> [object Object] Dropping database <name>_LOCKING before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/<name>_LOCKING [object Object] Repairing database <name>_LOCKING before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/<name>_LOCKING [object Object] Dropping database admin before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/admin [object Object] Repairing database admin before restoring MongoDB shell version: 2.6.1 connecting to: 127.0.0.1:27020/admin [object Object] Trying with oplogReplay Trying without oplogReplay restore successfull Restarting services Application processes stopped. Application processes started. System settings have changed, please reboot using ’system reboot’