Table Of Contents
Process Monitoring and Troubleshooting
System Manager
Watchdog System Monitor
Deadlock detections
Hang detection
Core Dumps
follow Command
show processes Commands
show processes boot Command
show processes startup Command
show processes failover Command
show processes blocked Command
Redundancy and Process Restartability
Process States
Synchronous Message Passing
Blocked Processes and Process States
Process Monitoring
Process Monitoring Commands
CPU Usage
Troubleshooting a Process Reload
Troubleshooting a Process Block
Process Monitoring and Troubleshooting
This chapter includes the following sections:
•
System Manager
•
Watchdog System Monitor
•
Core Dumps
•
follow Command
•
show processes Commands
•
Redundancy and Process Restartability
•
Process States
•
Process Monitoring
•
CPU Usage
•
Troubleshooting a Process Reload
•
Troubleshooting a Process Reload
•
Troubleshooting a Process Block
The Cisco IOS XR software is built on a modular system of processes. A process is group of threads that share virtual address (memory) space. Each process provides specific functionality for the system and runs in a protected memory space to ensure that problems with one process cannot impact the entire system. Multiple instances of a process can run on a single node, and multiple threads of execution can run on each process instance.
Threads are units of execution, each with an execution context that includes a stack and registers. A thread is in effect a "sub-process" managed by the parent, responsible for executing a subportion of the overall process. For example, Open Shortest Path First (OSPF) has a thread which handles "hello" receipt and transmission. A thread may only run when the parent process is allocate runtime by the system scheduler. A process with threads is a multi-threaded process.
Under normal operating conditions, processes are managed automatically by the Cisco IOS XR software. Processes are started, stopped, or restarted as required by the running configuration of the router. In addition, processes are checkpointed to optimize performance during process restart and automatic switchover. For more information on processes, see Cisco IOS XR System Management Configuration Guide.
System Manager
Each process is assigned a job ID (JID) when started. The JID does not change when a process is started, stopped, then restarted. Each process is also assigned a process ID (PID) when started, but this PID changes each time the process is stopped and restarted.
The System Manager (sysmgr) is the fundamental process and the foundation of the system. The sysmgr is responsible for monitoring, starting, stopping, and restarting almost all processes on the system. The restarting of processes is predefined (respawn flag on or off) and honored by sysmgr. The sysmgr is the parent of all processes started on boot-up and by configuration. Two instances are running on each node providing a hot standby process level redundancy. Each active process is registered with the SysDB and once started by the sysmgr active process the sysmgr is notified when it is running. If the sysmgr active process is dying the standby process takes over the active state and a new standby process is generated.
The sysmgr running on the line card (LC) handles all the system management duties like process creation, re-spawning, and core-dumping relevant to that node.
The sysmgr itself is started on bootup by the initialization process. Once the sysmgr is started, initialization hands over the ownership of all processes started by initialization to sysmgr and exits.
Watchdog System Monitor
The Watchdog System Monitor (wdsysmon) keeps historical data on processes and posts this information to a fault detector dynamic link library (DLL), which can then be queried by manageability applications. For more information on wdsysmon, see the "Watchdog System Monitor" section in Chapter 8 "Troubleshooting Memory."
Once a minute, wdsysmon polls the kernel for process data. This data is stored in a database maintained by the fm_fd_wdsysmon.dll fault detector, which is loaded by wdsysmon.
Deadlock detections
Wdsysmon can attempt to find deadlocks because thread state is returned with the process data. Wdsysmon specifically looks for mutex deadlocks and local Inter-Process Communication (IPC) hangs. Only local IPC deadlocks can be detected. If deadlocks are detected, debugging information is collected in disk0:/wdsysmon_debug.
Deadlocked processes can be stopped and restarted manually using the processes restart command.
Hang detection
When an event manager is created in the system, the event manager library registers the event with wdsysmon. Wdsysmon expects to periodically hear a "pulse" from every registered event manager in the system. When an event manager is missing, wdsysmon runs a debug script that shows exactly what the thread that created the event manager is doing.
Core Dumps
When a process is abnormally terminated, a core dump file is written to a designated destination. A core dump contains the following information:
•
register information
•
thread status information
•
process status information
•
selected memory segments.
Use the show exception command to display the configured core dump settings. The output from the show exception command displays the core dump settings configured with the following commands:
•
exception filepath
•
exception dump-tftp-route
•
exception kernel memory
•
exception pakmem
•
exception sparse
•
exception sprsize
The following example shows the core dump settings.
RP/0/RP0/CPU0:router# show exception
Choice 1 path = harddisk:/coredump compress = on filename = <process_name.time>
Choice 2 path = tftp://223.255.254.254/users/xyz compress = on filename =
Exception path for choice 3 is not configured or removed
Choice fallback one path = harddisk:/dumper compress = on filename = <process_name>
Choice fallback two path = disk1:/dumper compress = on filename = <process_name>
Choice fallback three path = disk0:/dumper compress = on filename = <process_name>
Kernel dump not configured
Tftp route for kernel core dump not configured
Dumper packet memory in core dump enabled
Dumper will switch to sparse core dump automatically at size 300MB
Coredumps can be generated manually using the dumpcore command. There are two types of core dumps that can be manually run:
•
running—does not impact services
•
suspended—suspends a process while generating the core dump
The show context command shows the coredump information for the last 10 core dumps
follow Command
The follow command is used to unobtrusively debug a live process or live thread in a process. The follow command is particularly useful for:
•
process deadlock, livelock, or mutex conditions
•
high CPU use conditions
•
examining the contents of a memory location or a variable in a process to determine the cause of a corruption issue
•
investigating issues where a process or thread is stuck in a loop.
A livelock condition is where two or more processes continually change their state in response to changes in the other processes.
The following actions can be specified with the follow command:
•
Follow all live threads of a given process or a given thread of a process and print stack trace in a format similar to core dump output
•
Follow a process in a loop for a given number of iterations
•
Set a delay between two iterations while invoking the command
•
Set the priority at which this process should run while this command is being executed
•
Dump memory from a given virtual memory location for a given size
•
Display register values and status information of the target process
•
Take a snapshot of the execution path of a thread asynchronously to investigate performance-related issues - this can be done by specifying a high number of iterations with a zero delay
The following example shows the live thread of process 929034375.
RP/0/RP0/CPU0:router# follow process 929034375
Attaching to process pid = 929034375 (pkg/bin/bgp)
No tid specified, following all threads
DLL Loaded by this process
-------------------------------
DLL path Text addr. Text size Data addr. Data size Version
/pkg/lib/libsysmgr.dll 0xfc122000 0x0000df0c 0xfc0c2b14 0x000004ac 0
/pkg/lib/libcerrno.dll 0xfc130000 0x00002f04 0xfc133000 0x00000128 0
/pkg/lib/libcerr_dll_tbl.dll 0xfc134000 0x00004914 0xfc133128 0x00000148 0
/pkg/lib/libltrace.dll 0xfc139000 0x00007adc 0xfc133270 0x00000148 0
/pkg/lib/libinfra.dll 0xfc141000 0x00033c90 0xfc1333b8 0x00000bbc 0
/pkg/lib/cerrno/libinfra_error.dll 0xfc1121dc 0x00000cd8 0xfc175000 0x000000a8 0
/pkg/lib/libios.dll 0xfc176000 0x0002dab0 0xfc1a4000 0x00002000 0
/pkg/lib/cerrno/libevent_manager_error.dll 0xfc1a6000 0x00000e88 0xfc133f74 0x00
/pkg/lib/libc.dll 0xfc1a7000 0x00079d70 0xfc221000 0x00002000 0
/pkg/lib/libsyslog.dll 0xfc223000 0x000054e0 0xfc1750a8 0x00000328 0
/pkg/lib/libplatform.dll 0xfc229000 0x0000c25c 0xfc236000 0x00002000 0
/pkg/lib/libbackplane.dll 0xfc243000 0x000013a8 0xfc1755b8 0x000000a8 0
/pkg/lib/cerrno/libpkgfs_error.dll 0xfc245000 0x00000efc 0xfc175660 0x00000088 0
/pkg/lib/libnodeid.dll 0xfc246000 0x0000a588 0xfc1756e8 0x00000248 0
/pkg/lib/libdebug.dll 0xfc29b000 0x0000fdbc 0xfc2ab000 0x00000570 0
/pkg/lib/cerrno/libdebug_error.dll 0xfc294244 0x00000db0 0xfc175c68 0x000000e8 0
/pkg/lib/lib_procfs_util.dll 0xfc2ac000 0x00004f20 0xfc175d50 0x000002a8 0
/pkg/lib/libinst_debug.dll 0xfc375000 0x0000357c 0xfc36d608 0x000006fc 0
/pkg/lib/libpackage.dll 0xfc3c8000 0x00041ad0 0xfc40a000 0x00000db4 0
/pkg/lib/libwd_evm.dll 0xfc40b000 0x00003dc4 0xfc36dd04 0x00000168 0
------------------------------
Current process = "pkg/bin/bgp", PID = 929034375 TID = 1
trace_back: #0 0xfc164210 [MsgReceivev]
trace_back: #1 0xfc14ecb8 [msg_receivev]
trace_back: #2 0xfc14eac4 [msg_receive]
trace_back: #3 0xfc151f98 [event_dispatch]
trace_back: #4 0xfc152154 [event_block]
trace_back: #5 0xfd8e16a0 [bgp_event_loop]
trace_back: #6 0x48230db8 [<N/A>]
trace_back: #7 0x48201080 [<N/A>]
Current process = "pkg/bin/bgp", PID = 929034375 TID = 2
trace_back: #0 0xfc164210 [MsgReceivev]
trace_back: #1 0xfc14ecb8 [msg_receivev]
trace_back: #2 0xfc14eac4 [msg_receive]
trace_back: #3 0xfc151f98 [event_dispatch]
trace_back: #4 0xfc152154 [event_block]
trace_back: #5 0xfc50efd8 [chk_evm_thread]
show processes Commands
The following show processes commands are used to display process information:
•
show processes boot Command
•
show processes startup Command
•
show processes failover Command
•
show processes blocked Command
show processes boot Command
The show processes boot command displays process boot information. Use the command output to check the following:
•
How long it took the processes to start
•
The order that the processes started
•
Was a process delayed indicating a boot failure or boot problems
•
Did the processes start within the time constraints set by the system
RP/0/RP0/CPU0:router# show processes boot location 0/rp1/cpu0
Band Name Finished %Idle JID Ready Last Process
----- -------------- -------- -------- -------- ------- ---------------------
1.0 MBI 22.830 65.130% 62 22.830 insthelper
40.0 ARB 129.225 92.080% 154 106.395 dsc
90.0 ADMIN 185.140 5.950% 175 55.915 fabricq_mgr
100.0 INFRA 207.372 25.040% 165 22.232 fib_mgr
150.0 STANDBY 231.605 13.840% 104 24.233 arp
999.0 FINAL 237.942 1.590% 234 6.337 ipv6_rump
Started Level JID Inst Ready Process
--------- ------ -------- ---- ------- -------------------------------
0.000s 0.05 80 1 0.000 wd-mbi
0.000s 1.00 57 1 0.000 dllmgr
0.000s 2.00 71 1 0.000 pkgfs
0.000s 3.00 56 1 0.000 devc-conaux
0.000s 3.00 73 1 0.000 devc-pty
0.000s 6.00 70 1 0.000 pipe
Last process started: 6d19h after boot. Total: 174
show processes startup Command
The show processes startup command displays process data for processes created at startup. Use the command output to check the following:
•
Are the listed processes, including their state, start time, restart status, placement, and mandatory status as expected
•
How long it took the processes to start
•
The order that the processes started
•
Was a process delayed indicating a boot failure or boot problems
•
Did the processes start within the time constraints set by the system
RP/0/RP0/CPU0:router# show processes startup
JID LAST STARTED STATE RE- PLACE- MANDA- NAME(IID) ARGS
-------------------------------------------------------------------------------
81 07/05/2006 14:46:37.514 Run 1 M wd-mbi(1)
57 07/05/2006 14:46:37.514 Run 1 M dllmgr(1) -r 60
72 07/05/2006 14:46:37.514 Run 1 M pkgfs(1)
56 07/05/2006 14:46:37.514 Run 1 M devc-conaux(1) -
h -d librs232.dll -m libconaux.dll -u libst16550.dll
74 07/05/2006 14:46:37.514 Run 1 M devc-pty(1) -n 3
55 Not configured None 0 M clock_chip(1) -r
71 07/05/2006 14:46:37.514 Run 1 M pipe(1)
65 07/05/2006 14:46:37.514 Run 1 M mqueue(1)
64 Not configured None 0 M cat(1) /etc/motd
73 Not configured None 0 M platform_dllmap(
77 07/05/2006 14:46:37.514 Run 1 M shmwin_svr(1)
60 07/05/2006 14:46:37.514 Run 1 M devf-scrp(1) -e
0xf0000038 -m /bootflash: -s 0xfc000000,64m -r -t4 -b10
66 Not configured None 0 M nname(1)
69 07/05/2006 14:46:37.514 Run 1 M pci_bus_mgr(1) -
288 07/05/2006 14:47:02.799 Run 1 M qsm(1)
68 07/05/2006 14:46:37.514 Run 1 M obflmgr(1)
70 07/05/2006 14:46:37.514 Run 1 M pcmciad(1) -m /d
ev/pcmcia -d libpcmcia_gt64260disc_hba -p "cardmgrd -f"
-------------------------------------------------------------------------------
show processes failover Command
The show processes failover command displays process failover information. The command output displays information on how long it took processes to start after a failover (node reboot). Check if there were any delays.
show processes blocked Command
The show processes blocked command displays details about reply, send, and mutex blocked processes.
Since a temporary blocked state for any process is possible, it is recommended to run the show processes blocked command two times consecutively for each interval and for each node. If a process is displayed as blocked after the first and second iteration, you can run the command a third time to ensure the process is blocked.
The polling interval should not be too short (enough to show a sustained blocked state). For example, the Cisco CRS-1 8-Slot Line Card Chassis requires a minimum of 20 requests for each interval (2 RPs and 8 LCs) if fully equipped.
The show processes blocked command output always displays processes in the Reply state as blocked.
RP/0/RP0/CPU0:router# show processes blocked
Jid Pid Tid Name State Blocked-on
65546 8202 1 ksh Reply 8200 devc-conaux
51 36890 2 attachd Reply 32791 eth_server
51 36890 3 attachd Reply 12301 mqueue
75 36893 5 qnet Reply 32791 eth_server
75 36893 6 qnet Reply 32791 eth_server
75 36893 7 qnet Reply 32791 eth_server
75 36893 8 qnet Reply 32791 eth_server
50 36899 2 attach_server Reply 12301 mqueue
334 172108 1 tftp_server Reply 12301 mqueue
247 290991 2 lpts_fm Reply 184404 lpts_pa
65750 644260054 1 exec Reply 1 kernel
65752 662208728 1 more Reply 12299 pipe
367 2642149 5 mpls_ldp Reply 2642153 lspv_server
65770 662208746 1 show_processes Reply 1 kernel
369 659755249 3 te_control Reply 2642153 lspv_server
For these processes it is a normal output. For example, the line:
65770 662208746 1 show_processes Reply 1 kernel
is a direct result of executing the show processes blocked command. Each time the command is applied the process ID (PID) will change.
If a vital system process or fundamental application controlling connectivity (for example, routing protocols or Multiprotocol Label Switching Label Distribution Protocol [MPLS LDP]) appears blocked in the Reply, Sent, Mutex, or Condvar state, do the following:
•
Collect data from the follow job or follow process command. See the "follow Command" section for more information on these commands.
•
Use the dumpcore running job-id location node-id command on the affected process. The output of the dumpcore is located in harddisk:/dumper, unless the location has been configured using the exception choice command.
Note
Some processes are dangerous to restart. It is recommended that you involve your technical representative and follow the advice from Cisco Technical Support. For contact information for Cisco Technical Support, see the "Obtaining Technical Assistance" section in the Preface.
RP/0/RP0/CPU0:router# show processes blocked
Jid Pid Tid Name State Blocked-on
65546 8202 1 ksh Reply 8200 devc-conaux
51 36890 2 attachd Reply 32791 eth_server
51 36890 3 attachd Reply 12301 mqueue
75 36893 5 qnet Reply 32791 eth_server
75 36893 6 qnet Reply 32791 eth_server
75 36893 7 qnet Reply 32791 eth_server
75 36893 8 qnet Reply 32791 eth_server
50 36899 2 attach_server Reply 12301 mqueue
334 172108 1 tftp_server Reply 12301 mqueue
247 290991 2 lpts_fm Reply 184404 lpts_pa
65750 644260054 1 exec Reply 1 kernel
65752 655270104 1 config Reply 286888 devc-vty
367 2642149 5 mpls_ldp Reply 2642153 lspv_server
65772 655229164 1 exec Reply 1 kernel
65773 656842989 1 more Reply 12299 pipe
65774 656842990 1 show_processes Reply 1 kernel
Redundancy and Process Restartability
On systems using Cisco IOS XR software, applications primarily use a combination of two fatal error recovery methods: process restartability and process (application) redundancy.
Process restart is typically used as the first level of failure recovery. If the checkpointed data is not corrupted, the crashed process can recover after it is restarted. If multiple restarts of a mandatory process fail or if peer processes cannot recover from a crashed process restart the standby card becomes active.
For a non-mandatory process, if the number of respawns per minute is reached then the sysmgr stops to restart the process and the application has to be restarted manually.
Each process not triggered by configuration is, by default, started as `mandatory' (critical for router to function) process. If a mandatory process crashes five times within a five minute window, a RP switchover is triggered if the standby RP is ready. The show processes all command lists all processes and process state including mandatory flag. The mandatory flag can be switched OFF. The process mandatory {on | off} {executable-name | job-id} [location node-id] command is used to switch on and off the mandatory flag.
Process States
Within the Cisco IOS XR software there are servers that provide the services and clients that use the services. A specific process can have a number of threads that provide the same service. Another process can have a number of clients that may require a specific service at any point in time. Access to the servers is not always available, and if a client requests access to a service it will wait for the server to be free. When this happens the client is blocked. The client may be blocked because its waiting for a resource such as a mutex or it may be blocked because the server has not replied.
In the following example, the show process ospf command is used to check the status of the threads in the ospf process.
RP/0/RP0/CPU0:router# show processes ospf
Executable path: /disk0/hfr-rout-3.4.0/bin/ospf
Max. spawns per minute: 12
Last started: Wed Nov 8 15:45:59 2006
Started on config: cfg/gl/ipv4-ospf/proc/100/ord_f/default/ord_a/routerid
core: TEXT SHAREDMEM MAINMEM
startup_path: /pkg/startup/ospf.startup
Process cpu time: 2.648 user, 0.186 kernel, 2.834 total
JID TID Stack pri state HR:MM:SS:MSEC NAME
272 1 60K 10 Receive 0:00:00:0563 ospf
272 2 60K 10 Receive 0:00:00:0017 ospf
272 3 60K 10 Receive 0:00:00:0035 ospf
272 4 60K 10 Receive 0:00:02:0029 ospf
272 5 60K 10 Receive 0:00:00:0003 ospf
272 6 60K 10 Condvar 0:00:00:0001 ospf
272 7 60K 10 Receive 0:00:00:0000 ospf
-------------------------------------------------------------------------------
The process ospf is given a Job ID of 250. This Job ID never changes on a running router. Within the ospf process there are 7 threads, each with their own Thread ID or TID. For each thread, the stack space for each thread, the priority of each thread, and the thread state is listed. Table 7-1 lists the thread states.
Synchronous Message Passing
The message passing life cycle is as follows:
1.
A server creates a message channel.
2.
A client connects to a channel of a server (analogous to posix open).
3.
A client sends a message to a server (MsgSend) and waits for a reply and blocks.
4.
The server receives (MsgReceive) a message from a client, processes the message and replies to the client.
5.
The client unblocks and processes the reply from the server.
This blocking client-server model is synchronous message passing. This means the client sends a message and blocks. The server receives the message, processes it, replies back to the client, and then the client unblocks. The specific details are as follows.
1.
Server is waiting in RECEIVE state
2.
Client sends a message to the server and becomes BLOCKED
3.
Server receives the message and unblocks (if waiting in receive state)
4.
Client moves to the REPLY state
5.
Server moves to the RUNNING state
6.
Server processes the message
7.
Server replies to the client
8.
Client unblocks
Use the show processes command to display the states the client and servers are in. Table 7-1 lists the thread states.
Blocked Processes and Process States
Use the show processes blocked command to display the processes that are in blocked state.
Synchronized message passing enables you to track the life cycle of inter-process communication between the different threads. At any point in time a thread can be in a specific state. A blocked state can be a symptom of a problem. This does not mean that if a thread is in blocked state then there is a problem—blocked threads are normal. Using the show processes blocked command is sometimes a good way to start troubleshooting operating system-type problems. If there is a problem, for example the CPU is high, then use the show processes blocked command to determine if anything looks abnormal (what is not normal for your functioning router). This provides a baseline for you to use as a comparison when troubleshooting process life cycles.
At any point in a time a thread can be in a particular state. Table 7-1 lists the thread states.
Table 7-1 Thread States
If the State is:
|
The Thread is:
|
DEAD
|
Dead. The Kernel is waiting to release the threads resources.
|
RUNNING
|
Actively running on a CPU.
|
READY
|
Not running on a CPU but is ready to run.
|
STOPPED
|
Suspended (SIGSTOP signal).
|
SEND
|
Waiting for a server to receive a message.
|
RECEIVE
|
Waiting for a client to send a message.
|
REPLY
|
Waiting for a server to reply to a message.
|
STACK
|
Waiting for more stack to be allocate.
|
WAITPAGE
|
Waiting for the process manager to resolve a page fault.
|
SIGSUSPEND
|
Waiting for a signal.
|
SIGWAITINFO
|
Waiting for a signal.
|
NANOSLEEP
|
Sleeping for a period of time.
|
MUTEX
|
Waiting to acquire a mutex.
|
CONDVAR
|
Waiting for a conditional variable to be signaled.
|
JOIN
|
Waiting for the completion of another thread.
|
INTR
|
Waiting for an interrupt.
|
SEM
|
Waiting to acquire a semaphore.
|
Process Monitoring
Significant events of the sysmgr are stored in /tmp/sysmgr.log. The log is a wrapping buffer and is useful for troubleshooting. Use the show processes aborts location node-id all command or the show sysmgr trace verbose | include PROC_ABORT command to display an overview of abnormally terminated processes.
Since the sysmgr is already monitoring all processes on the system it is not necessarily required to monitor vital processes by external management tools. But, you can use the show fault manager metric process pid location node-id command to check critical processes on a regular basis (for example, twice each day). The command output provides information including the abort behavior and the reason of the particular process.
The following example shows OSPF critical process details. Check the number of times the process ended abnormally and the number of abnormal ends within the past time periods.
RP/0/RP0/CPU0:router# show fault manager metric process ospf location
=====================================
job id: 269, node name: 0/RP0/CPU0
process name: ospf, instance: 1
--------------------------------
last event type: process start
recent start time: Wed Jul 5 15:17:48 2006
recent normal end time: n/a
recent abnormal end time: n/a
number of times started: 1
number of times ended normally: 0
number of times ended abnormally: 2
most recent 10 process start times:
--------------------------
--------------------------
most recent 10 process end times and types:
cumulative process available time: 162 hours 20 minutes 51 seconds 452 milliseco
cumulative process unavailable time: 0 hours 0 minutes 0 seconds 0 milliseconds
process availability: 1.000000000
number of abnormal ends within the past 60 minutes (since reload): 0
number of abnormal ends within the past 24 hours (since reload): 0
number of abnormal ends within the past 30 days (since reload): 2
The vital system processes are: qnet, gsp, qsm, redcon, netio, ifmgr, fgid_aggregator, fgid_server, fgid_allocator,fsdb_server, fsdb_aserver, fabricq_mgr, fia_driver, shelfmgr, and lrd on the RP and fabricq_mgr, ingressq, egressq, pse_driver, fia_driver, cpuctrl, and pla_server on the line card.
It is also important to regularly check if critical or vital processes are in a blocked state. See the "show processes blocked Command" section for information on checking if processes are in the blocked state.
Process Monitoring Commands
Use the following commands to monitor processes:
•
top command—Displays real-time CPU usage statistics on the system. See the "top Command" section on page 1-10.
•
show processes pidin command—Displays raw output of all processes, including their state.
•
show processes blocked command—Displays details about reply, send, and mutex blocked processes. See the "show processes blocked Command" section
Note
You can also use the monitor processes and monitor threads commands to determine the top processes and threads based on CPU usage.
CPU Usage
Wdsysmon continuously monitors the system to ensure that no high priority thread is starving low priority threads and provides a procedure to recover from high-priority CPU usage. When a process is determined to be a CPU-hog, it is terminated and a coredump of the process is captured and stored on the configured device (exception choice) to aid debugging. For information on troubleshooting high CPU usage, see the "Troubleshooting a Process Reload" section.
When wdsysmon detects a CPU-hog condition a syslog message is generated. Follow the recommended action for the following syslog messages:
Message: %HA-HA_WD-6-CPU_HOG_1 CPU hog: cpu [dec]'s sched count is [dec].
RP/0/RP0/CPU0:Dec 22 16:16:34.791 : wdsysmon[331]: %HA-HA_WD-6-CPU_HOG_1 : CPU hog: cpu
1's sched count is 0.
Wdsysmon has detected a CPU starvation situation. This is a potentially high priority process spinning in a tight loop. The `sched count' is the number of times the wdsysmon ticker thread has been scheduled since the last time the wdsysmon watcher thread ran.
Check the system status, including the saved log for evidence of a high priority CPU hog. See the "Troubleshooting a Process Reload" section for information on checking system status.
Message: %HA-HA_WD-6-CPU_HOG_2 CPU hog: cpu [dec]'s ticker last ran [dec].[dec] seconds ago.
RP/0/RP0/CPU0:Dec 22 16:16:34.791 : wdsysmon[331]: %HA-HA_WD-6-CPU_HOG_2 : CPU hog: cpu
1's ticker last ran 3.965 seconds ago.
Wdsysmon has detected a CPU starvation situation. This is a potentially high priority process spinning in a tight loop.
Check the system status, including the saved log for evidence of a high priority CPU hog. See the "Troubleshooting a Process Reload" section for information on checking system status.
Message: %HA-HA_WD-6-CPU_HOG_3 Rolling average of scheduling times: [dec].[dec].
RP/0/RP0/CPU0:Dec 22 16:16:34.791 : wdsysmon[331]: %HA-HA_WD-6-CPU_HOG_3 : Rolling average
of scheduling times: 0.201.
Wdsysmon has detected a CPU starvation situation. This is a potentially high priority process spinning in a tight loop. A high value for the rolling average indicates that a periodic process is not being scheduled.
Check the system status, including the saved log for evidence of a high priority CPU hog. See the "Troubleshooting a Process Reload" section for information on checking system status.
Message: %HA-HA_WD-6-CPU_HOG_4 Process [chars] pid [dec] tid [dec] prio [dec] using [dec]% is the top user of CPU
RP/0/RP0/CPU0:Dec 22 16:16:35.813 : wdsysmon[331]: %HA-HA_WD-6-CPU_HOG_4 : Process wd_test
pid 409794 tid 2 prio 14 using 99% is the top user of CPU.
This message is displayed after the CPU hog detector trips. It shows the percentage of CPU used by the busiest thread in the top user of CPU. See the "Troubleshooting a Process Reload" section for information on checking system status.
The show watchdog trace command displays additional information about the potential CPU hog. If there is a persistent CPU hog (a hog that lasts for more than 30 seconds) the node will be reset. There will be a log such as the following just before the reset:
RP/0/RP0/CPU0:Dec 20 10:36:08.990 : wdsysmon[367]: %HA-HA_WD-1-CURRENT_STATE : Persistent
Hog detected for more than 30 seconds
If the hog is persistent and the node is reset, contact Cisco Technical Support. For contact information for Cisco Technical Support, see the "Obtaining Technical Assistance" section in the Preface. Copy the error message exactly as it appears on the console or in the system log and provide the representative with the gathered information.
Troubleshooting a Process Reload
To troubleshoot a process reload, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Technical Assistance" section in the Preface.
Collect the following information before contacting Cisco Technical Support.
•
show context all location all command output
•
show version command output
•
show dll command output
•
show log command output
•
Collect core dumps. See the "show context Command" section on page 1-10 for information on how to collect core dumps.
Troubleshooting a Process Block
To troubleshoot a blocked process, perform the following procedure.
SUMMARY STEPS
1.
show processes blocked location node-id
2.
follow job job-id location node-id
3.
process restart job-id location node-id
DETAILED STEPS
| |
Command or Action
|
Purpose
|
Step 1
|
show processes blocked location node-id
Example:
RP/0/RP0/CPU0:router# show processes blocked
location 0/1/cpu0
|
Use the show processes blocked command several times and compare the output to determine if any processes are blocked for a long period of time.
• The Name column shows the name of the blocked process.
• The Blocked-on column shows the name and process ID of the blocked process.
• If the State column is Mutex, a thread in the process waits for another thread. In this case, the Blocked-on column shows the process ID and thread ID instead of the process ID.
• A blocked process can be blocked by another process. If a process is being blocked by another process, you need to track the chain of blocking and find the root of blocking processes. Proceed to Step 2 to track the chain of blocking.
|
Step 2
|
follow job job-id location node-id
Example:
RP/0/RP0/CPU0:router# follow job 234 location
0/1/cpu0
|
Tracks the root process blocking other processes. The follow command shows what part of the code the process is periodically executed.
|
Step 3
|
process restart job-id location node-id
Example:
RP/0/RP0/CPU0:router# process restart 234
location 0/1/cpu0
|
Restarts the process.
If the problem is not resolved, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Technical Assistance" section in the Preface.
Collect the following information for Cisco Technical Support:
• show processes blocked location node-id command output
• follow job job-id location node-id command output
• show version command output
• show dll command output
• show configuration command output
• show logging command output
• content of the file: disk0:/wdsysmon_debug/debug_env. number (if it exists)
|