February 7, 2005
Products Affected
BTS10200 R3.5.X
Problem Description
An unusually high number of database 3 alarms, errors in Deferror queue, are seen due to a configuration issue with the Event (EVT) process. An example of an alarm is as follows:
ID=1107072602939 TYPE=DATABASE NUMBER=3 DESCRIPTION=There are errors in EMS database DefError queue; contact DBA SEVERITY=CRITICAL ALARM_STATUS=ON TIMESTAMP=2005-01-31 22:40:02 ORIGIN=DATABASE THREAD=UNSPECIFIED THREAD COMPONENT_ID=EM01 DW1=ems-1 DW2=optical1 DW3=1_transactions_in_error
Background
The Event Process runs on both the active and secondary EMS. Every 30 minutes, these two processes delete old entries from the Oracle tables EVENT_LOG and ALARM_LOG. However, the Event Process on the primary side is deleting entries on the active DB while the Event Process on the standby is deleting entries on the standby DB. The problem arises when both deletions, on active and on standby, occur within 2-3 seconds of each other and cause Oracle Replication to generate conflicting transaction errors.
Problem Symptoms
The dbadm tool can be used to display replication error transactions (Deferror) as follows:
-
Login to the EMS as root
-
su - oracle
-
dbadm -r get_deferror
Oracle error ORA-01403: no data found is the typical output from get_deferror when the error transactions are due to the EVT configuration issue described in this field notice.
-
dbadm -r get_defcall | more
This will show the REP_DELETE operation on EVENT_LOG or ALARM_LOG from replication error queue.
Workaround/Solution
Modify the Platform Programs table in the EMS to indicate that EVT should only be started when the EM01 goes ACTIVE.
On both EMS's:
. /opt/ems/bin/Runtime.sh [ENTER]
mysql -u root -pmc68000 -h `hostname` -P 3307 -C Ems [ENTER]
mysql> UPDATE PlatformPrograms set manage='Y' WHERE id='EVT';[enter]
mysql>exit
kill `ps -ef | grep -v grep | grep EVT | awk '{print $2}'`
on both EMS's:
pkill smg3
Verify on primary the evt is still running
ps -ef|grep -i evt
Verify on secondary evt is not running
ps -ef|grep -i evt
This notice applies to RLS 3.x customers only and it is being tracked by defect CSCeg42249.
Additional Information
Checking Deferrors
In Oracle, there is a tool called dbadm which will be used for this procedure. When executing the procedure you will have output generated for specific Oracle queues as depicted below. The dbadm tool should never be used to truncate anything other than the deferror queue. The command dbadm -A truncate_def should never be used on a live system as this command truncates not only the deferror queue but also the deftran and deftrandest queues.
On the standby EMS:
su - oracle dbadm - C rep optical2:test-ems-2:/opt/orahome$ dbadm -C rep OPTICAL2::Deftrandest is empty? YES OPTICAL2::dba_repcatlog is empty? YES OPTICAL2::Deferror is empty? NO <==== OPTICAL2::Deftran is empty? NO OPTICAL2::Has no broken job? YES OPTICAL2::JQ Lock is empty? YES OPTICAL1::Deftrandest is empty? YES OPTICAL1::dba_repcatlog is empty? YES OPTICAL1::Deferror is empty? YES OPTICAL1::Deftran is empty? YES OPTICAL1::Has no broken job? YES OPTICAL1::JQ Lock is empty? YES
These queues will be defined as the following:
Deferror - Contains the ID of each transaction that could not be applied. You can use this ID to locate the queued calls associated with this transaction. These calls are stored in the DEFCALL view. In other words, deferror is used for storing the ID of each replication transaction that could not be applied because of data conflict. This queue should always be empty.
Deftran - Is used for recording all deferred transactions in the deferred transactions queue on the current BTS system. It is normal to see that this queue is not empty.
Deftrandest - Lists the destinations for each deferred transaction in the deferred transactions queue on the current BTS system. It is normal to see that this queue is not empty.
Dba_repcatlog - This is an internal debugging queue that should be empty. If it is not, then Cisco needs to be called.
Has no broken job - This is an internal debugging queue that should be empty. If it is not, then Cisco needs to be called.
JQ Lock - This is an internal debugging queue that should be empty. If it is not, then Cisco needs to be called.
Example #1 --- Not OK to truncate deferror queue
On the standby EMS:
su - oracle dbadm -C rep optical2:test-ems-2:/opt/orahome$ dbadm -C rep OPTICAL2::Deftrandest is empty? YES OPTICAL2::dba_repcatlog is empty? YES OPTICAL2::Deferror is empty? NO <==== OPTICAL2::Deftran is empty? NO OPTICAL2::Has no broken job? YES OPTICAL2::JQ Lock is empty? YES OPTICAL1::Deftrandest is empty? YES OPTICAL1::dba_repcatlog is empty? YES OPTICAL1::Deferror is empty? YES OPTICAL1::Deftran is empty? YES OPTICAL1::Has no broken job? YES OPTICAL1::JQ Lock is empty? YES
Notice the deferrors present.
Go to the side that has the errors, in this case optical2/standby ems
Do the following on side with deferrors
su - oracle optical2:test-ems-2:/opt/oracle/admin/rep$ dbadm -r get_deferror Connected.
List of replication error transactions with start_time:
TRAN_ID FROM_DB START_TIME ERROR_MSG --------------- -------- ---------------- -------------------------------------- 6.95.2954 OPTICAL1 05Jan29 14:44:12 ORA-01403: no data found
This gives the transaction id
Do the following and look for that transaction id
optical2:test-ems-2:/opt/oracle/admin/rep$ dbadm -r get_defcall_order Connected.
Replication transaction operations (calls) in the order of start_time:
CALLNO TRAN_ID PACKAGENAME PROCNAME START_TIME ------- ---------- ------------------------- --------------- ----------------- 0 6.95.2954 SECURITYLEVELS$RP REP_UPDATE 005Jan29 14:44:12
Notice that this deferror is not for a event or alarm table. At this point stop and open a case with Cisco.
Example #2 --- OK to truncate deferror queue
On the standby EMS:
su - oracle dbadm -C rep optical2:secems05:/opt/orahome$ dbadm -C rep OPTICAL2::Deftrandest is empty? YES OPTICAL2::dba_repcatlog is empty? YES OPTICAL2::Deferror is empty? NO <==== OPTICAL2::Deftran is empty? NO OPTICAL2::Has no broken job? YES OPTICAL2::JQ Lock is empty? YES OPTICAL1::Deftrandest is empty? YES OPTICAL1::dba_repcatlog is empty? YES OPTICAL1::Deferror is empty? YES OPTICAL1::Deftran is empty? YES OPTICAL1::Has no broken job? YES OPTICAL1::JQ Lock is empty? YES
In this example there are deferrors on side B, this is denoted by OPTICAL2(side B).
Go to side b and issue the following commands
su - oracle dbadm -r get_deferror Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001 optical2:secems05:/opt/orahome$ dbadm -r get_deferror Connected.
List of replication error transactions with start_time:
TRAN_ID FROM_DB START_TIME ERROR_MSG --------------- -------- ---------------- -------------------------------------- 4.29.671 OPTICAL1 04Nov24 15:13:44 ORA-01403: no data found 2.64.671 OPTICAL1 04Nov24 15:13:49 ORA-01403: no data found
This gives the transaction id
Do the following and look for that transaction id
optical2:secems05:/opt/orahome$ dbadm -r get_defcall_order Connected.
Replication transaction operations (calls) in the order of start_time:
CALLNO TRAN_ID PACKAGENAME PROCNAME START_TIME ------- ---------- ------------------------- --------------- ----------------- 0 4.29.671 EVENT_LOG$RP REP_DELETE 004Nov24 15:13:44 0 2.64.671 EVENT_LOG$RP REP_DELETE 004Nov24 15:13:49
Both of these transactions are for events, therefore you can truncate deferror with the following commands. Login to the side that is reporting deferrors, in this case OPTICAL2(side B)
su - oracle dbadm -A truncate_deferror optical2:secems05:/opt/orahome$ dbadm -A truncate_deferror ***************************************************************** You are about to execute the following process: ==> Truncate replication Deferred Error Queue tables. database: optical2 hostname: secems05 ***************************************************************** Do you want to continue? [y/n] y INFO: Truncating replication deferred error queue tables... INFO: Truncating table SYSTEM.DEF$_ERROR INFO: Truncating table SYSTEM.DEF$_AQERROR INFO: Deferred error queue tables are truncated.
Once the deferrors are truncated, issue the following commands to check the queue
optical2:secems05:/opt/orahome$ dbadm -C rep OPTICAL2::Deftrandest is empty? YES OPTICAL2::dba_repcatlog is empty? YES OPTICAL2::Deferror is empty? YES <==== OPTICAL2::Deftran is empty? YES OPTICAL2::Has no broken job? YES OPTICAL2::JQ Lock is empty? YES OPTICAL1::Deftrandest is empty? YES OPTICAL1::dba_repcatlog is empty? YES OPTICAL1::Deferror is empty? YES OPTICAL1::Deftran is empty? YES OPTICAL1::Has no broken job? YES OPTICAL1::JQ Lock is empty? YES
DDTS
To follow the bug ID link below and see detailed bug information, you must be a registered user and you must be logged in.
|
DDTS |
Description |
|---|---|
|
CSCeg42249 (registered customers only) |
EVT Process causes Deferrors in Oracle DB |
For More Information
If you require further assistance, or if you have any further questions regarding this field notice, please contact the Cisco Systems Technical Assistance Center (TAC) by one of the following methods:
Receive Email Notification For New Field Notices
Product Alert Tool - Set up a profile to receive email updates about reliability, safety, network security, and end-of-sale issues for the Cisco products you specify.
