Guest

Cisco MGX 8230 Software

CSCea28215 Causes Processor Switch Module (PXM1) to Reset After 466 or More Days


July 11, 2003


Products Affected

Product

PXM1

PXM1-4-155

PXM1-2-T3E3

PXM1-1-622

Problem Description

The processor switch module (PXM) becomes unresponsive after 466 or more days. Users are unable to use the change card (cc) command to get to the card, and statistics collection slows down or stops. The standby PXM may not get updates. In some instances added ports may not register on the PXM.

Note: For the purposes of this field notice, the term PXM refers to all the modules listed in the Products Affected section.

Background

This anomaly is the result of mishandling a system clock rollover. Depending on which tasks are scheduled beyond when the rollover of 0xF0000000 occurs, different symptoms will be manifest.

The operating system simultaneously maintains two independent datastructures which are used to keep an accurate representation of the time. One of these is a 32-bit counter which reflects the number of ticks (one tick is equal to ten msec) since the card was reset. This counter will rollover every (2^32/100) seconds, which is equal to 497.1 days. By itself, the clock rollover is handled correctly.

As part of its realtime preemptive multitasking environment, the operating system allows tasks to schedule delays. This feature is used extensively by MGX appliation software. The issue described above arises when a delay is scheduled to expire after the 32-bit counter rolls over. Due to a bug in the operating system, when this situation occurs the clock will jump ahead by the amount of the scheduled delay. Depending on the value of the delay, this jump can be very larg. Vvalues as large as 31 days are used in application software.

A discontinuous jump in the time can have far-reaching and severe consequences. Since this jump is local to the card, inconsistencies arise between the card and external entities, such as statistics collectors. More dangerous is the fact that tasks with regularly scheduled activities will have their timers expire at unpredictable times, often immediately. As a result, activites which should occur at regularly spaced intervals will occur in rapid-fire sequence. This failure of the timer infrastructure leads to excessive CPU utilization and message flooding, resulting in severe degradation of the node.

Problem Symptoms

The symptoms are various. Users may not be able to cc to the card. In some instances statistics tasks may not run. Database anomalies may occur. In some instances, the system date may change to another date often jumping to a date one month ahead.

Logical core dumps off Statistics Collection Manager (SCM) sequence mismatch are created for almost all slots.

Core dumps can be observed by executing the command core. See the following example:

Slot Reset Reason Dump Time
----------------------------------------------------------------------
0 Logical Error - SCM Sequence Mismatch SAT FEB 22 15:53:56 2003
1 Logical Error - SCM Sequence Mismatch SAT FEB 22 15:57:31 2003
2 Logical Error - SCM Sequence Mismatch SAT FEB 22 15:59:23 2003
3 Logical Error - SCM Sequence Mismatch SAT FEB 22 15:43:57 2003
4 Logical Error - SCM Sequence Mismatch SAT FEB 22 15:51:37 2003
5 Logical Error - SCM Sequence Mismatch SAT FEB 22 15:52:40 2003

Workaround/Solution

Evaluating whether your PXM1 is subject to this anomaly. Determine how close the 466 day threshold is by using taking the following steps.

  1. Log into shell by using the shellcon command

  2. Execute sysClkTickGet command.

  3. The response will appear in both decimal and hexidecimal notations.

mgx11.1.7.PXM.a > shellcon

-> sysClkTickGet
sysClkTickGet
value = 132375064 = 0x7e3e218

Hexadecimal

  • If the hexadecimal value is greater than or equal to 0xF0000000, then the card is immediately vulnerable to the problem and a switchover should be scheduled immediately. Note that if the counter has exceeded 0xF0000000, then the MGXs connection database may not yet be affected. See section titled Secondary Checks for further information.

  • If the value is less than 0xF0000000, then the number of days' uptime can be computed by computing from the shellconn prompt: /100/60/60/24.

The value of this computation is the number of days the card has been up. Prior to card being up for 466 days, a redundant switchover should be induced. Note that when the standby card becomes active, its clock does not reset but continues incrementing. Two consecutive switchovers will be necessary to reset clocks on both PXMs.

Note that if the problem has already manifested itself, submitting a return merchandise authorization (RMA) for a replacement PXM card set will not help since the MGX will need to be recovered manually. Waiting for an RMA to arrive will simply delay the recovery process. Instead of opening an RMA, use the Recovery Procedure detailed below.

Follow this procedure to reset both clocks. This activity should be done prior to 466 days and only if the problem has not manifest itself.

  1. Reset standby PXM1.

  2. Wait for standby PXM1 to come back to standby state. The clock on the standby PXM1 will be reset back to initial time after the reset.

  3. Switch active and standby PXM1s. Use the command switchcc.

  4. Wait for formerly active PXM1 to come to standby state.

The formerly active PXM1 will re-initialize the counter after the switch over thus delaying the onset of the greater than 466 days wraparound.

Note that any switchover on the PXM1 will result in a re-initialization of the formerly active PXM1s clock.

Solution:

This anomaly will require changes in the operating kernel as well as in PXM1 software. Monitor CSCea28215 (registered customers only) for resolution of this issue

Preventative Procedures:

It is recommended that the customer execute switchcc on each PXM1 once a year to reset the system clock. This activity should be accomplished once a year until a software is loaded that resolves this anomaly.

Secondary Checks:

Prior to executing the the Recovery Procedure below, it is recommend that Cisco Technical Assistance Center (TAC) be contacted to access and run the checks to see if the MGX connection database is corrupted. Having the TAC perform this additional check may mean the difference between experiencing payload traffic outage or a graceful recovery.

Recovery Procedure:

If the greater than 466 days clock problem occurs, then the MGX switch must be rebuilt. RMAs will not resolve or avoid this traffic-intrusive measure. Waiting for the submitted RMA PXM1s to arrive will delay implementation of the recovery procedure and therefore RMAs should be avoided. The following recovery procedure should be executed through the MGX console port.

  1. Save the switch configuration. This step may be skipped if the configuration has been saved recently.

  2. Clear the MGX configuration.

    Warning: From this point until the end of this procedure, the switch will not be passing traffic.

  3. Reset both the PXM1s. In single PXM1 nodes, the single PXM1 must be reset.

  4. Wait until the PXM1s come back to active and standby, if redundant.

  5. Restore the saved configuration to the switch. From this point forward the MGX should be passing traffic normally.

  6. Verify access to, and responsiveness of, command line interface. Verify traffic is passing normally.

  7. Monitor bug ID CSCea28215 (registered customers only) for resolution.

  8. Until software is released that resolves the counter wraparound problem, schedule switchcc at regular intervals prior to rollover of the over 466 day threshold.

DDTS

To follow the bug ID link below and see detailed bug information, you must be a registered user and you must be logged in.

DDTS

Description

CSCea28215 (registered customers only)

VxWorks timer failing after 466 days

CSCeb32489 (registered customers only)

VxWorks timer fails after 466/496 days

Additional Information

Note that this over 466 day counter applies to other MGX8800 modules including:

  • PXM1E (All models)

  • PXM45 (All models)

  • AXSM (All models)

  • AXSM/E (All models)

  • AXSM-XG (All models)

  • FRSM12

The workaround is for these cards is the same as the PXM1 workaround:

In order to avoid this issue, a redundant switchover, or failing that, a reset must occur before the card has been up continuously for 466 days. This can be determined using the VxWorks shell, using the following procedure:

  • From shellconn, execute the command sysClkTickGet()

  • The value will be displayed in both decimal and hexadecimal. If the hexadecimal value is greater than or equal to 0xF0000000, the card is immediately vulnerable to the problem and a switchover should be scheduled immediately

  • If the value is less than 0xF0000000, the number of days' uptime can be computed by computing from the shellconn prompt: /100/60/60/24

Prior to card being up for 466 days, a redundant switchover should be induced. Note that when the standby card becomes active, its clock continues incrementing, so 2 consecutive switchovers may be necessary to reset both clocks.

To summarize, here is the decision tree for implementing the workaround:

  • for each active controller or affected switch module (SM)

  • calculate uptime for active controller card

  • if approaching or exceeding 466 days

  • switch to redundant card

  • calculate uptime for newly active controller card

  • if approaching or exceeding 466 days

  • switchcc again

This hierarchical approach is acceptable because a standby card cannot be up longer than its associated active card.

For More Information

If you require further assistance, or if you have any further questions regarding this field notice, please contact the Cisco Systems Technical Assistance Center (TAC) by one of the following methods:

Receive Email Notification For New Field Notices

Product Alert Tool - Set up a profile to receive email updates about reliability, safety, network security, and end-of-sale issues for the Cisco products you specify.