When a restartable
service fails, it is restarted on the same supervisor. If the new instance of
the service determines that the previous instance was abnormally terminated by
the operating system, the service then determines whether a persistent context
exists. The initialization of the new instance attempts to read the persistent
context to build a run-time context that makes the new instance appear like the
previous one. After the initialization is complete, the service resumes the
tasks that it was performing when it stopped. During the restart and
initialization of the new instance, other services are unaware of the service
failure. Any messages that are sent by other services to the failed service are
available from the MTS when the service resumes.
Whether or not the new
instance survives the stateful initialization depends on the cause of the
failure of the previous instance. If the service is unable to survive a few
subsequent restart attempts, the restart is considered as failed. In this case,
the System Manager executes the action specified by the service’s HA policy,
forcing either a stateless restart, no restart, or a supervisor switchover or
reset.
During a successful
stateful restart, there is no delay while the system reaches a consistent
state. Stateful restarts reduce the system recovery time after a failure.
The events before,
during, and after a stateful restart are as follows:
-
The running
services make a checkpoint of their run-time state information to the PSS.
-
The System Manager
monitors the health of the running services that use heartbeats.
-
The System Manager
restarts a service instantly when it crashes or hangs.
-
After restarting,
the service recovers its state information from the PSS and resumes all pending
transactions.
-
If the service
does not resume a stable operation after multiple restarts, the System Manager
initiates a reset or switchover of the supervisor.
-
Cisco NX-OS will
collect the process stack and core for debugging purposes with an option to
transfer core files to a remote location.
When a stateful
restart occurs, Cisco NX-OS sends a syslog message of level LOG_ERR. If SNMP
traps are enabled, the SNMP agent sends a trap.