As shown in the figure
below, all P-GWs and P-CSCFs will direct traffic to the secondary site in the
event of a complete outage of the primary site. Failover time will be dependent
on failure detection timers on the P-GW and P-CSCF and the time it takes for
the database replica set to elect a new Master database at the secondary site.
Figure 1. Outage of
In order for Site A to
be considered “ready for service” after an outage, all 3x tiers (Policy
Director, Application Layer and Persistence Layer) must be operational.
At the Persistence
(database replica set) level, MongoDB uses an operations log (oplog) to keep a
rolling record of all operations that modify the data stored in the database.
Any database operations applied on the Primary node are recorded on its oplog.
Secondary members can then copy and apply those operations in an asynchronous
process. All replica set members contain a copy of the oplog, which allows them
to maintain the current state of the database. Any member can import oplog
entries from any other member. Once the oplog is full, newer operations
overwrite older ones.
When the replica
members at Site A come back up after an outage and the connectivity between
Sites A and B is restored, there are two possible recovery scenarios:
The oplog at Site
B has enough history to fully resynchronize the whole replica set, for example
the oplog did not get overwritten during the duration of the outage. In this
scenario, the database instances at Site A will go into “Recovering” state once
connectivity to Site B is restored. By default, when one of those instances
catches up to within 10 seconds of the latest oplog entry of the current
primary at Site B, the set will hold an election in order to allow the
higher-priority node at Site A to become primary again.
The oplog at Site
B does not have enough history to fully resynchronize the whole replica set
(the duration of the outage was longer than what the system can support without
overwriting data in the oplog). In this scenario, the database instances at
Site A will go into “Startup2” state and stay in that state until we manually
force a complete resynchronization (as they would be too stale to catch up with
the current primary. A “too stale to catch up” message will appear in the
mongodatabase.log or in the errmsg field when running rs.status(). For more
information on manual resynchronization,
During a complete
resynchronization, all the data is removed from the database instances at Site
A and restored from Site B by cloning the Site B session database. All Read and
Write operations will continue to use Site B during this operation.
Recovery time, holding
time for auto recovery and so on depends upon TPS, latency, oplog size. For
optimum values, contact your Cisco Technical Representative.
In CPS Release 7.5.0
and higher releases, at the Policy Director level, there is an automated
mechanism to check availability of the Master database within the local site.
When the Master database is not available, the policy director processes will
be stopped and will not process with any incoming messages (Gx/Rx).
This check runs at
Site A (primary site).
This check runs
every 5 seconds (currently not configurable) and will determine whether the
Master Sessions database is at Site A.
It is possible to
configure which databases the script will monitor (Sessions, SPR, Balance). By
default, only the Sessions database is monitored.
If the Master
database is not available at Site A, the two Policy Director Processes
(Loadatabasealancers) of site A will be stopped or remain stopped if recovering
from a complete outage (as described in this section).
In case of two
replica sets, if one of the two Replica sets Master database is not available
at Site A, the two Policy Director Processes (Loadatabasealancers) of site A
will be stopped or remain stopped if recovering from a complete outage and the
second replica set Master database will failover from Site A to Site B.
These above mentioned
checks will prevent cross site communication for read/write operations. Once
the site is recovered, P-GWs and P-CSCFs will start directing new sessions to
Site A again.
For existing sessions,
P-GWs will continue to send traffic to Site B until a message for the session
(RAR) is received from Site A. That will happen, for example, when a new call
is made and the Rx AAR for the new session is sent by the P-CSCF to Site A.
Also, for existing Rx sessions, the P-CSCF will continue to send the traffic to