Table Of Contents
Process Monitoring and Troubleshooting
System Manager
Watchdog System Monitor
Deadlock detections
Hang detection
Core Dumps
follow Command
show processes Commands
show processes boot Command
show processes startup Command
show processes failover Command
show processes blocked Command
Example:
Redundancy and Process Restartability
Process States
Synchronous Message Passing
Blocked Processes and Process States
Process Monitoring
Process Monitoring Commands
Monitoring CPU Usage and Using Syslog Messages
Troubleshooting High CPU Utilization and Process Timeouts
General Guidelines for Troubleshooting CPU Utilization Problems
Using show process and top processes Commands
Troubleshooting a Process Block
Troubleshooting a Process Crash on Line Cards
Troubleshooting a Memory Leak
Troubleshooting a Hardware Failure
Troubleshooting SNMP Timeouts
Troubleshooting Communication Among Multiple Processes
Troubleshooting a Process Restart
Process Monitoring and Troubleshooting
This chapter includes the following sections:
•
System Manager
•
Watchdog System Monitor
•
Core Dumps
•
follow Command
•
show processes Commands
•
Redundancy and Process Restartability
•
Process States
•
Process Monitoring
•
Monitoring CPU Usage and Using Syslog Messages
•
Troubleshooting High CPU Utilization and Process Timeouts
•
Troubleshooting a Process Restart
The Cisco IOS XR software is built on a modular system of processes. A process is a group of threads that share virtual address (memory) space. Each process provides specific functionality for the system and runs in a protected memory space to ensure that problems with one process cannot impact the entire system. Multiple instances of a process can run on a single node, and multiple threads of execution can run on each process instance.
Threads are units of execution, each with an execution context that includes a stack and registers. A thread is in effect a "sub-process" managed by the parent, responsible for executing a subportion of the overall process. For example, Open Shortest Path First (OSPF) has a thread which handles "hello" receipt and transmission. A thread may only run when the parent process is allocate runtime by the system scheduler. A process with threads is a multi-threaded process.
Under normal operating conditions, processes are managed automatically by the Cisco IOS XR software. Processes are started, stopped, or restarted as required by the running configuration of the router. In addition, processes are checkpointed to optimize performance during process restart and automatic switchover. For more information on processes, see Cisco IOS XR System Management Configuration Guide for the Cisco CRS-1 Router.
System Manager
Each process is assigned a job ID (JID) when started. The JID does not change when a process is started, stopped, then restarted. Each process is also assigned a process ID (PID) when started, but this PID changes each time the process is stopped and restarted.
The System Manager (sysmgr) is the fundamental process and the foundation of the system. The sysmgr is responsible for monitoring, starting, stopping, and restarting almost all processes on the system. The restarting of processes is predefined (respawn flag on or off) and honored by sysmgr. The sysmgr is the parent of all processes started on boot-up and by configuration. Two instances are running on each node providing a hot standby process level redundancy. Each active process is registered with the SysDB and once started by the sysmgr active process the sysmgr is notified when it is running. If the sysmgr active process is dying the standby process takes over the active state and a new standby process is generated.
The sysmgr running on the line card (LC) handles all the system management duties like process creation, re-spawning, and core-dumping relevant to that node.
The sysmgr itself is started on bootup by the initialization process. Once the sysmgr is started, initialization hands over the ownership of all processes started by initialization to sysmgr and exits.
Watchdog System Monitor
The Watchdog System Monitor (wdsysmon) keeps historical data on processes and posts this information to a fault detector dynamic link library (DLL), which can then be queried by manageability applications. Once per minute, wdsysmon polls the kernel for process data. This data is stored in a database maintained by the fm_fd_wdsysmon.dll fault detector, which is loaded by wdsysmon.
For more information on wdsysmon and memory thresholds, see the "Watchdog System Monitor" section in Chapter 9 "Troubleshooting Memory."
Deadlock detections
Wdsysmon can attempt to find deadlocks because thread state is returned with the process data. Wdsysmon specifically looks for mutex deadlocks and local Inter-Process Communication (IPC) hangs. Only local IPC deadlocks can be detected. If deadlocks are detected, debugging information is collected in disk0:/wdsysmon_debug.
Deadlocked processes can be stopped and restarted manually using the processes restart command.
Hang detection
When an event manager is created in the system, the event manager library registers the event with wdsysmon. Wdsysmon expects to periodically hear a "pulse" from every registered event manager in the system. When an event manager is missing, wdsysmon runs a debug script that shows exactly what the thread that created the event manager is doing.
Core Dumps
When a process is abnormally terminated, a core dump file is written to a designated destination. A core dump contains the following information:
•
register information
•
thread status information
•
process status information
•
selected memory segments.
Use the show exception command to display the configured core dump settings. The output from the show exception command displays the core dump settings configured with the following commands:
•
exception filepath
•
exception dump-tftp-route
•
exception kernel memory
•
exception pakmem
•
exception sparse
•
exception sprsize
The following example shows the core dump settings.
RP/0/RP0/CPU0:router# show exception
Choice 1 path = harddisk:/coredump compress = on filename = <process_name.time>
Choice 2 path = tftp://223.255.254.254/users/xyz compress = on filename =
Exception path for choice 3 is not configured or removed
Choice fallback one path = harddisk:/dumper compress = on filename = <process_name>
Choice fallback two path = disk1:/dumper compress = on filename = <process_name>
Choice fallback three path = disk0:/dumper compress = on filename = <process_name>
Kernel dump not configured
Tftp route for kernel core dump not configured
Dumper packet memory in core dump enabled
Dumper will switch to sparse core dump automatically at size 300MB
Coredumps can be generated manually using the dumpcore command. There are two types of core dumps that can be manually run:
•
running—does not impact services
•
suspended—suspends a process while generating the core dump
The show context command shows the coredump information for the last 10 core dumps
follow Command
The follow command is used to unobtrusively debug a live process or live thread in a process. The follow command is particularly useful for:
•
process deadlock, livelock, or mutex conditions
•
high CPU use conditions
•
examining the contents of a memory location or a variable in a process to determine the cause of a corruption issue
•
investigating issues where a process or thread is stuck in a loop.
A livelock condition is where two or more processes continually change their state in response to changes in the other processes.
The following actions can be specified with the follow command:
•
Follow all live threads of a given process or a given thread of a process and print stack trace in a format similar to core dump output
•
Follow a process in a loop for a given number of iterations
•
Set a delay between two iterations while invoking the command
•
Set the priority at which this process should run while this command is being executed
•
Dump memory from a given virtual memory location for a given size
•
Display register values and status information of the target process
•
Take a snapshot of the execution path of a thread asynchronously to investigate performance-related issues - this can be done by specifying a high number of iterations with a zero delay
Caution 
If your system is running Release 3.8.0, 3.8.1, 3.8.2, or 3.9.0 software, you should
not run the
follow process and
follow job commands, because these can cause a kernel crash at the target node. Therefore, for these software releases, you should use other available commands for troubleshooting and call Cisco Technical Support if the problem is not resolved. (This crash behavior does not occur for releases other than the ones listed.)
The following example shows the live thread of process 929034375.
RP/0/RP0/CPU0:router# follow process 929034375
Attaching to process pid = 929034375 (pkg/bin/bgp)
No tid specified, following all threads
DLL Loaded by this process
-------------------------------
DLL path Text addr. Text size Data addr. Data size Version
/pkg/lib/libsysmgr.dll 0xfc122000 0x0000df0c 0xfc0c2b14 0x000004ac 0
/pkg/lib/libcerrno.dll 0xfc130000 0x00002f04 0xfc133000 0x00000128 0
/pkg/lib/libcerr_dll_tbl.dll 0xfc134000 0x00004914 0xfc133128 0x00000148 0
/pkg/lib/libltrace.dll 0xfc139000 0x00007adc 0xfc133270 0x00000148 0
/pkg/lib/libinfra.dll 0xfc141000 0x00033c90 0xfc1333b8 0x00000bbc 0
/pkg/lib/cerrno/libinfra_error.dll 0xfc1121dc 0x00000cd8 0xfc175000 0x000000a8 0
/pkg/lib/libios.dll 0xfc176000 0x0002dab0 0xfc1a4000 0x00002000 0
/pkg/lib/cerrno/libevent_manager_error.dll 0xfc1a6000 0x00000e88 0xfc133f74 0x00
/pkg/lib/libc.dll 0xfc1a7000 0x00079d70 0xfc221000 0x00002000 0
/pkg/lib/libsyslog.dll 0xfc223000 0x000054e0 0xfc1750a8 0x00000328 0
/pkg/lib/libplatform.dll 0xfc229000 0x0000c25c 0xfc236000 0x00002000 0
/pkg/lib/libbackplane.dll 0xfc243000 0x000013a8 0xfc1755b8 0x000000a8 0
/pkg/lib/cerrno/libpkgfs_error.dll 0xfc245000 0x00000efc 0xfc175660 0x00000088 0
/pkg/lib/libnodeid.dll 0xfc246000 0x0000a588 0xfc1756e8 0x00000248 0
/pkg/lib/libdebug.dll 0xfc29b000 0x0000fdbc 0xfc2ab000 0x00000570 0
/pkg/lib/cerrno/libdebug_error.dll 0xfc294244 0x00000db0 0xfc175c68 0x000000e8 0
/pkg/lib/lib_procfs_util.dll 0xfc2ac000 0x00004f20 0xfc175d50 0x000002a8 0
/pkg/lib/libinst_debug.dll 0xfc375000 0x0000357c 0xfc36d608 0x000006fc 0
/pkg/lib/libpackage.dll 0xfc3c8000 0x00041ad0 0xfc40a000 0x00000db4 0
/pkg/lib/libwd_evm.dll 0xfc40b000 0x00003dc4 0xfc36dd04 0x00000168 0
------------------------------
Current process = "pkg/bin/bgp", PID = 929034375 TID = 1
trace_back: #0 0xfc164210 [MsgReceivev]
trace_back: #1 0xfc14ecb8 [msg_receivev]
trace_back: #2 0xfc14eac4 [msg_receive]
trace_back: #3 0xfc151f98 [event_dispatch]
trace_back: #4 0xfc152154 [event_block]
trace_back: #5 0xfd8e16a0 [bgp_event_loop]
trace_back: #6 0x48230db8 [<N/A>]
trace_back: #7 0x48201080 [<N/A>]
Current process = "pkg/bin/bgp", PID = 929034375 TID = 2
trace_back: #0 0xfc164210 [MsgReceivev]
trace_back: #1 0xfc14ecb8 [msg_receivev]
trace_back: #2 0xfc14eac4 [msg_receive]
trace_back: #3 0xfc151f98 [event_dispatch]
trace_back: #4 0xfc152154 [event_block]
trace_back: #5 0xfc50efd8 [chk_evm_thread]
show processes Commands
The following show processes commands are used to display process information:
•
show processes boot Command
•
show processes startup Command
•
show processes failover Command
•
show processes blocked Command
show processes boot Command
The show processes boot command displays process boot information. Use the command output to check the following:
•
How long it took the processes to start
•
The order that the processes started
•
Was a process delayed indicating a boot failure or boot problems
•
Did the processes start within the time constraints set by the system
RP/0/RP0/CPU0:router# show processes boot location 0/rp1/cpu0
Band Name Finished %Idle JID Ready Last Process
----- -------------- -------- -------- -------- ------- ---------------------
1.0 MBI 22.830 65.130% 62 22.830 insthelper
40.0 ARB 129.225 92.080% 154 106.395 dsc
90.0 ADMIN 185.140 5.950% 175 55.915 fabricq_mgr
100.0 INFRA 207.372 25.040% 165 22.232 fib_mgr
150.0 STANDBY 231.605 13.840% 104 24.233 arp
999.0 FINAL 237.942 1.590% 234 6.337 ipv6_rump
Started Level JID Inst Ready Process
--------- ------ -------- ---- ------- -------------------------------
0.000s 0.05 80 1 0.000 wd-mbi
0.000s 1.00 57 1 0.000 dllmgr
0.000s 2.00 71 1 0.000 pkgfs
0.000s 3.00 56 1 0.000 devc-conaux
0.000s 3.00 73 1 0.000 devc-pty
0.000s 6.00 70 1 0.000 pipe
Last process started: 6d19h after boot. Total: 174
show processes startup Command
The show processes startup command displays process data for processes created at startup. Use the command output to check the following:
•
Are the listed processes, including their state, start time, restart status, placement, and mandatory status as expected
•
How long it took the processes to start
•
The order in which the processes started
•
Was a process delayed indicating a boot failure or boot problems
•
Did the processes start within the time constraints set by the system
RP/0/RP0/CPU0:router# show processes startup
JID LAST STARTED STATE RE- PLACE- MANDA- NAME(IID) ARGS
-------------------------------------------------------------------------------
81 07/05/2006 14:46:37.514 Run 1 M wd-mbi(1)
57 07/05/2006 14:46:37.514 Run 1 M dllmgr(1) -r 60
72 07/05/2006 14:46:37.514 Run 1 M pkgfs(1)
56 07/05/2006 14:46:37.514 Run 1 M devc-conaux(1) -
h -d librs232.dll -m libconaux.dll -u libst16550.dll
74 07/05/2006 14:46:37.514 Run 1 M devc-pty(1) -n 3
55 Not configured None 0 M clock_chip(1) -r
71 07/05/2006 14:46:37.514 Run 1 M pipe(1)
65 07/05/2006 14:46:37.514 Run 1 M mqueue(1)
64 Not configured None 0 M cat(1) /etc/motd
73 Not configured None 0 M platform_dllmap(
77 07/05/2006 14:46:37.514 Run 1 M shmwin_svr(1)
60 07/05/2006 14:46:37.514 Run 1 M devf-scrp(1) -e
0xf0000038 -m /bootflash: -s 0xfc000000,64m -r -t4 -b10
66 Not configured None 0 M nname(1)
69 07/05/2006 14:46:37.514 Run 1 M pci_bus_mgr(1) -
288 07/05/2006 14:47:02.799 Run 1 M qsm(1)
68 07/05/2006 14:46:37.514 Run 1 M obflmgr(1)
-------------------------------------------------------------------------------
show processes failover Command
The show processes failover command displays process failover information. The command output displays information on how long it took processes to start after a failover (node reboot). Check if there were any delays.
RP/0/RP0/CPU0:router# show processes failover
Thu May 3 11:16:05.562 EST EDT
Band Name Finished %Idle JID Ready Last Process
----- -------------- -------- -------- -------- ------- ---------------------
40.0 ARB 0.000 0.000% 0 0.000 NONE
90.0 ADMIN 0.000 0.000% 0 0.000 NONE
100.0 INFRA 0.000 0.000% 0 0.000 NONE
121.0 FT_ADMIN 0.056 0.000% 315 0.056 qsm
122.0 FT_INFRA 1.819 0.000% 195 1.763 ifmgr
123.0 FT_IP_ARM 2.097 0.000% 232 0.278 ipv6_arm
124.0 FT_ISIS 2.280 0.000% 248 0.183 isis
125.0 FT_PRE_IP 2.303 0.000% 0 0.023 NONE
126.0 FT_IP 6.345 0.000% 219 4.042 ipv4_local
127.0 FT_LPTS 8.041 0.000% 262 1.696 lpts_pa
128.0 FT_PRE_OSPF 9.889 0.000% 260 1.848
129.0 FT_OSPF 17.944 37.410% 291 8.055 ospf
130.0 FT_MPLS 23.602 0.000% 272 5.658 mpls_lsd
131.0 FT_BGP_START 26.366 0.000% 326 2.764 rsvp
132.0 FT_MULTICAST 32.940 0.000% 277 6.574 mrib
133.0 FT_CLI 35.357 0.000% 361 2.417
134.0 FT_FINAL 44.772 0.000% 322 9.415
150.0 ACTIVE 70.322 0.000% 224 25.550 ntpd
999.0 FINAL 79.011 0.000% 273 8.689 mpls_rid_helper
Go active Level Band Name JID Inst Avail Process
--------- ------ -------------- -------- ---- ------- -------------------------------
0.002s 22.00 FT_ADMIN 315 1 0.049 qsm
0.027s 38.00 FT_ADMIN 52 1 0.000 bcm_process
0.028s 40.00 FT_ADMIN 155 1 0.012 dsc
0.028s 85.00 FT_ADMIN 332 1 0.000 shelfmgr
0.030s 85.00 FT_ADMIN 333 1 0.000 shelfmgr_partner
0.031s 120.00 FT_ADMIN 184 1 0.000 fab_svr
0.061s 23.00 FT_INFRA 145 1 0.000 correlatord
0.064s 23.00 FT_INFRA 352 1 0.000 syslogd
0.065s 23.00 FT_INFRA 79 1 0.000 syslogd_helper
0.066s 38.00 FT_INFRA 297 1 0.000 packet
0.067s 40.00 FT_INFRA 379 2 0.000 chkpt_proxy
0.070s 40.00 FT_INFRA 380 3 0.000 chkpt_proxy
0.072s 40.00 FT_INFRA 381 4 0.000 chkpt_proxy
85.006s 0.00 368 1 2.094 udp_snmpd
85.310s 0.00 373 1 25.730 vrrp
85.743s 0.00 376 1 22.666 xmlagent
7d07h 0.00 136 1 0.479 chdlc_ma
Last process started: 7d07h after switch over. Total: 78
show processes blocked Command
The show processes blocked command displays details about reply, send, and mutex blocked processes.
Since a temporary blocked state for any process is possible, it is recommended to run the show processes blocked command two times consecutively for each interval and for each node. If a process is displayed as blocked after the first and second iteration, you can run the command a third time to ensure the process is blocked.
The polling interval should not be too short (enough to show a sustained blocked state). For example, the Cisco CRS-1 8-Slot Line Card Chassis requires a minimum of 20 requests for each interval (2 RPs and 8 LCs) if fully equipped.
The show processes blocked command output always displays processes in the Reply state as blocked.
RP/0/RP0/CPU0:router# show processes blocked
Wed May 2 11:44:12.360 EST EDT
Jid Pid Tid Name State Blocked-on
65546 8202 1 ksh Reply 8200 devc-conaux
52 36889 2 attachd Reply 32791 eth_server
52 36889 3 attachd Reply 12301 mqueue
77 36891 6 qnet Reply 32791 eth_server
77 36891 7 qnet Reply 32791 eth_server
77 36891 8 qnet Reply 32791 eth_server
77 36891 9 qnet Reply 32791 eth_server
51 36897 2 attach_server Reply 12301 mqueue
376 139341 1 tftp_server Reply 12301 mqueue
364 143438 6 sysdb_mc Reply 135244 gsp
268 221354 2 lpts_fm Reply 204855 lpts_pa
65725 13291709 1 exec Reply 1 kernel
65784 23720184 1 exec Reply 331975 devc-vty
65786 27287802 1 exec Reply 331975 devc-vty
65788 23589116 1 attach Reply 8200 devc-conaux
65788 23589116 2 attach Reply 12301 mqueue
65790 27316478 1 exec Reply 1 kernel
65792 27328768 1 exec Reply 331975 devc-vty
65793 27726081 1 more Reply 12299 pipe
350 27418882 2 snmpd Reply 143438 sysdb_mc
385 27418886 1 udp_snmpd Reply 221353 udp
65800 27726088 1 show_processes Reply 1 kernel
For these processes it is a normal output. For example, the line:
65770 27726088 1 show_processes Reply 1 kernel
is a direct result of executing the show processes blocked command. Each time the command is applied the process ID (PID) will change.
If a vital system process or fundamental application controlling connectivity (for example, routing protocols or Multiprotocol Label Switching Label Distribution Protocol [MPLS LDP]) appears blocked in the Reply, Sent, Mutex, or Condvar state, do the following:
•
Collect data from the follow job or follow process command. See the "follow Command" section for more information on these commands.
Caution 
If your system is running Release 3.8.0, 3.8.1, 3.8.2, or 3.9.0 software, you should
not run the
follow process and
follow job commands, because these can cause a kernel crash at the target node. Therefore, for these software releases, you should use other available commands for troubleshooting and call Cisco Technical Support if the problem is not resolved. (This crash behavior does not occur for releases other than the ones listed.)
•
Use the dumpcore running job-id location node-id command on the affected process. The output of the dumpcore is located in harddisk:/dumper, unless the location has been configured using the exception choice command.
Example:
RP/0/RP0/CPU0:router# show processes blocked
Jid Pid Tid Name State Blocked-on
65546 8202 1 ksh Reply 8200 devc-conaux
51 36890 2 attachd Reply 32791 eth_server
51 36890 3 attachd Reply 12301 mqueue
75 36893 5 qnet Reply 32791 eth_server
75 36893 6 qnet Reply 32791 eth_server
75 36893 7 qnet Reply 32791 eth_server
75 36893 8 qnet Reply 32791 eth_server
50 36899 2 attach_server Reply 12301 mqueue
334 172108 1 tftp_server Reply 12301 mqueue
247 290991 2 lpts_fm Reply 184404 lpts_pa
65750 644260054 1 exec Reply 1 kernel
65752 655270104 1 config Reply 286888 devc-vty
367 2642149 5 mpls_ldp Reply 2642153 lspv_server
65772 655229164 1 exec Reply 1 kernel
65773 656842989 1 more Reply 12299 pipe
65774 656842990 1 show_processes Reply 1 kernel
Note
To troubleshoot a blocked process, use the procedure in the "Troubleshooting a Process Block" section.
Redundancy and Process Restartability
On systems using Cisco IOS XR software, applications primarily use a combination of two fatal error recovery methods: process restartability and process (application) redundancy.
Process restart is typically used as the first level of failure recovery. If the checkpointed data is not corrupted, the crashed process can recover after it is restarted. If multiple restarts of a mandatory process fail or if peer processes cannot recover from a crashed process restart the standby card becomes active.
For a non-mandatory process, if the number of respawns per minute is reached then the sysmgr stops to restart the process and the application has to be restarted manually.
Each process not triggered by configuration is, by default, started as `mandatory' (critical for router to function) process. If a mandatory process crashes five times within a five minute window, an RP switchover is triggered if the standby RP is ready. The show processes all command lists all processes and process state including mandatory flag. The mandatory flag can be switched OFF. The process mandatory {on | off} {executable-name | job-id} [location node-id] command is used to switch on and off the mandatory flag.
Process States
Within the Cisco IOS XR software there are servers that provide the services and clients that use the services. A specific process can have a number of threads that provide the same service. Another process can have a number of clients that may require a specific service at any point in time. Access to the servers is not always available, and if a client requests access to a service it will wait for the server to be free. When this happens the client is blocked. The client may be blocked because its waiting for a resource such as a mutex or it may be blocked because the server has not replied.
In the following example, the show process ospf command is used to check the status of the threads in the ospf process.
RP/0/RP0/CPU0:router# show processes ospf
Executable path: /disk0/hfr-rout-3.4.0/bin/ospf
Max. spawns per minute: 12
Last started: Wed Nov 8 15:45:59 2006
Started on config: cfg/gl/ipv4-ospf/proc/100/ord_f/default/ord_a/routerid
core: TEXT SHAREDMEM MAINMEM
startup_path: /pkg/startup/ospf.startup
Process cpu time: 2.648 user, 0.186 kernel, 2.834 total
JID TID Stack pri state HR:MM:SS:MSEC NAME
272 1 60K 10 Receive 0:00:00:0563 ospf
272 2 60K 10 Receive 0:00:00:0017 ospf
272 3 60K 10 Receive 0:00:00:0035 ospf
272 4 60K 10 Receive 0:00:02:0029 ospf
272 5 60K 10 Receive 0:00:00:0003 ospf
272 6 60K 10 Condvar 0:00:00:0001 ospf
272 7 60K 10 Receive 0:00:00:0000 ospf
-------------------------------------------------------------------------------
The process ospf is given a Job ID of 250. This Job ID never changes on a running router. Within the ospf process there are 7 threads, each with their own Thread ID or TID. For each thread, the stack space for each thread, the priority of each thread, and the thread state is listed. Table 8-1 lists the thread states.
The PID is 299228. This number changes each time the process is restarted. The Respawn count indicates how many times the process has restarted and the Process state should show the RUN state.
Synchronous Message Passing
The message passing life cycle is as follows:
1.
A server creates a message channel.
2.
A client connects to a channel of a server (analogous to posix open).
3.
A client sends a message to a server (MsgSend) and waits for a reply and blocks.
4.
The server receives (MsgReceive) a message from a client, processes the message and replies to the client.
5.
The client unblocks and processes the reply from the server.
This blocking client-server model is synchronous message passing. This means the client sends a message and blocks. The server receives the message, processes it, replies back to the client, and then the client unblocks. The specific details are as follows.
1.
Server is waiting in RECEIVE state
2.
Client sends a message to the server and becomes BLOCKED
3.
Server receives the message and unblocks (if waiting in receive state)
4.
Client moves to the REPLY state
5.
Server moves to the RUNNING state
6.
Server processes the message
7.
Server replies to the client
8.
Client unblocks
Use the show processes command to display the states the client and servers are in. Table 8-1 lists the thread states.
Blocked Processes and Process States
Use the show processes blocked command to display the processes that are in blocked state.
Synchronized message passing enables you to track the life cycle of inter-process communication between the different threads. At any point in time a thread can be in a specific state. A blocked state can be a symptom of a problem. This does not mean that if a thread is in blocked state then there is a problem—blocked threads are normal. Using the show processes blocked command is sometimes a good way to start troubleshooting operating system-type problems. If there is a problem, for example the CPU is high, then use the show processes blocked command to determine if anything looks abnormal (what is not normal for your functioning router). This provides a baseline for you to use as a comparison when troubleshooting process life cycles.
At any point in a time a thread can be in a particular state. Table 8-1 lists the thread states.
Table 8-1 Thread States
If the State is:
|
The Thread is:
|
DEAD
|
Dead. The Kernel is waiting to release the threads resources.
|
RUNNING
|
Actively running on a CPU.
|
READY
|
Not running on a CPU but is ready to run.
|
STOPPED
|
Suspended (SIGSTOP signal).
|
SEND
|
Waiting for a server to receive a message.
|
RECEIVE
|
Waiting for a client to send a message.
|
REPLY
|
Waiting for a server to reply to a message.
|
STACK
|
Waiting for more stack to be allocate.
|
WAITPAGE
|
Waiting for the process manager to resolve a page fault.
|
SIGSUSPEND
|
Waiting for a signal.
|
SIGWAITINFO
|
Waiting for a signal.
|
NANOSLEEP
|
Sleeping for a period of time.
|
MUTEX
|
Waiting to acquire a mutex.
|
CONDVAR
|
Waiting for a conditional variable to be signaled.
|
JOIN
|
Waiting for the completion of another thread.
|
INTR
|
Waiting for an interrupt.
|
SEM
|
Waiting to acquire a semaphore.
|
To troubleshoot a blocked process, use the procedure in the "Troubleshooting a Process Block" section.
Process Monitoring
Significant events of the sysmgr are stored in /tmp/sysmgr.log. The log is a wrapping buffer and is useful for troubleshooting. Use the show processes aborts location node-id all command or the show sysmgr trace verbose | include PROC_ABORT command to display an overview of abnormally terminated processes.
Because the sysmgr is already monitoring all processes on the system it is not necessarily required to monitor vital processes by external management tools. But, you can use the show fault manager metric process pid location node-id command to check critical processes on a regular basis (for example, twice each day). The command output provides information including the abort behavior and the reason of the particular process.
The following example shows OSPF critical process details. Check the number of times the process ended abnormally and the number of abnormal ends within the past time periods.
RP/0/RP0/CPU0:router# show fault manager metric process ospf location
=====================================
job id: 269, node name: 0/RP0/CPU0
process name: ospf, instance: 1
--------------------------------
last event type: process start
recent start time: Wed Jul 5 15:17:48 2006
recent normal end time: n/a
recent abnormal end time: n/a
number of times started: 1
number of times ended normally: 0
number of times ended abnormally: 2
most recent 10 process start times:
--------------------------
--------------------------
most recent 10 process end times and types:
cumulative process available time: 162 hours 20 minutes 51 seconds 452 milliseco
cumulative process unavailable time: 0 hours 0 minutes 0 seconds 0 milliseconds
process availability: 1.000000000
number of abnormal ends within the past 60 minutes (since reload): 0
number of abnormal ends within the past 24 hours (since reload): 0
number of abnormal ends within the past 30 days (since reload): 2
The vital system processes are: qnet, gsp, qsm, redcon, netio, ifmgr, fgid_aggregator, fgid_server, fgid_allocator,fsdb_server, fsdb_aserver, fabricq_mgr, fia_driver, shelfmgr, and lrd on the RP and fabricq_mgr, ingressq, egressq, pse_driver, fia_driver, cpuctrl, and pla_server on the line card.
It is also important to regularly check if critical or vital processes are in a blocked state. See the "show processes blocked Command" section for information on checking if processes are in the blocked state.
Process Monitoring Commands
Use the following commands to monitor processes:
•
top command—Displays real-time CPU usage statistics on the system. See the "top Command" section on page 1-10.
•
show processes pidin command—Displays raw output of all processes, including their state.
•
show processes blocked command—Displays details about reply, send, and mutex blocked processes. See the "show processes blocked Command" section
You can also use the monitor processes and monitor threads commands to determine the top processes and threads based on CPU usage.
Tip
The top processes command displays almost real-time CPU and memory utilization, and updates several times per minute. The show processes cpu command displays data that has been collected for all process IDs over the past one, five and 15 minute intervals. Both methods provide valuable information.
Monitoring CPU Usage and Using Syslog Messages
Wdsysmon continuously monitors the system to ensure that no high priority thread is waiting and provides a procedure to recover from high-priority CPU usage. When a process is determined to be a CPU-hog, it is terminated and a coredump of the process is captured and stored on the configured device (exception choice) to aid debugging. For information on troubleshooting high CPU usage, see the "Troubleshooting High CPU Utilization and Process Timeouts" section.
When wdsysmon detects a CPU-hog condition a syslog message is generated. Follow the recommended action for the following syslog messages:
Message: %HA-HA_WD-6-CPU_HOG_1 CPU hog: cpu [dec]'s sched count is [dec].
RP/0/RP0/CPU0:Dec 22 16:16:34.791 : wdsysmon[331]: %HA-HA_WD-6-CPU_HOG_1 : CPU hog: cpu
1's sched count is 0.
Wdsysmon has detected a CPU starvation situation. This is a potentially high priority process spinning in a tight loop. The `sched count' is the number of times the wdsysmon ticker thread has been scheduled since the last time the wdsysmon watcher thread ran.
Check the system status, including the saved log for evidence of a high priority CPU hog. See the "Troubleshooting High CPU Utilization and Process Timeouts" section for information on checking system status.
Message: %HA-HA_WD-6-CPU_HOG_2 CPU hog: cpu [dec]'s ticker last ran [dec].[dec] seconds ago.
RP/0/RP0/CPU0:Dec 22 16:16:34.791 : wdsysmon[331]: %HA-HA_WD-6-CPU_HOG_2 : CPU hog: cpu
1's ticker last ran 3.965 seconds ago.
Wdsysmon has detected a CPU starvation situation. This is a potentially high priority process spinning in a tight loop.
Check the system status, including the saved log for evidence of a high priority CPU hog. See the "Troubleshooting High CPU Utilization and Process Timeouts" section for information on checking system status.
Message: %HA-HA_WD-6-CPU_HOG_3 Rolling average of scheduling times: [dec].[dec].
RP/0/RP0/CPU0:Dec 22 16:16:34.791 : wdsysmon[331]: %HA-HA_WD-6-CPU_HOG_3 : Rolling average
of scheduling times: 0.201.
Wdsysmon has detected a CPU starvation situation. This is a potentially high priority process spinning in a tight loop. A high value for the rolling average indicates that a periodic process is not being scheduled.
Check the system status, including the saved log for evidence of a high priority CPU hog. See the "Troubleshooting High CPU Utilization and Process Timeouts" section for information on checking system status.
Message: %HA-HA_WD-6-CPU_HOG_4 Process [chars] pid [dec] tid [dec] prio [dec] using [dec]% is the top user of CPU
RP/0/RP0/CPU0:Dec 22 16:16:35.813 : wdsysmon[331]: %HA-HA_WD-6-CPU_HOG_4 : Process wd_test
pid 409794 tid 2 prio 14 using 99% is the top user of CPU.
This message is displayed after the CPU hog detector trips. It shows the percentage of CPU used by the busiest thread in the top user of CPU. See the "Troubleshooting High CPU Utilization and Process Timeouts" section for information on checking system status.
The show watchdog trace command displays additional information about the potential CPU hog. If there is a persistent CPU hog (a hog that lasts for more than 30 seconds) the node will be reset. There will be a log such as the following just before the reset:
RP/0/RP0/CPU0:Dec 20 10:36:08.990 : wdsysmon[367]: %HA-HA_WD-1-CURRENT_STATE : Persistent
Hog detected for more than 30 seconds
If the hog is persistent and the node is reset, contact Cisco Technical Support. For contact information for Cisco Technical Support, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface. Copy the error message exactly as it appears on the console or in the system log and provide the representative with the gathered information.
Note
For more information on wdsysmon and memory thresholds, see the "Watchdog System Monitor" section in Chapter 9 "Troubleshooting Memory."
Troubleshooting High CPU Utilization and Process Timeouts
This section describes the troubleshooting of common problems that can occur due to high CPU utilization, and in some cases causing process timeouts. It includes the following topics:
•
General Guidelines for Troubleshooting CPU Utilization Problems
•
Troubleshooting a Process Block
•
Troubleshooting a Process Crash on Line Cards
•
Troubleshooting a Memory Leak
•
Troubleshooting a Hardware Failure
•
Troubleshooting SNMP Timeouts
•
Troubleshooting Communication Among Multiple Processes
General Guidelines for Troubleshooting CPU Utilization Problems
Optimal CPU utilization is vital for the routers to function properly. In general, the following cases can cause high CPU utilization:
•
Normal conditions—One or more processes might be using a large percentage (or all) of the available CPU due to the following reasons:
–
Routing table convergence calculations (until the routing table converges)
–
SNMP polling
–
Any query that requires a large amount of CPU
–
Communication among multiple processes
•
Abnormal conditions—A process might be using excessive CPU due to the following reasons:
–
Process (thread) loop
–
Memory leak
–
Process blocking due to bug or hardware problem that causes other process(es) waiting for a reply (loop)
There is no single definition of "high CPU utilization." Utilization depends on many factors, including the number of clients served and the current configuration on the router. The following example illustrates one approach to troubleshooting utilization. (Details of the commands are provided in the sections that follow.)
Example:
You run the top processes command. (It shows the top ten processes in terms of CPU usage.)
From the output of the command, you notice that the top two processes use more memory than the next eight. It is possible that this indicates a problem.
You continue by considering the context of this CPU usage. You notice that the top process is OSPF, so you run commands to show whether there are packet drops occurring on the connections that use OSPF. If there are OSPF packet drops, there might be a problem with OSPF that needs attention.
You continue by troubleshooting OSPF. After correcting the OSPF problem, you can rerun the top processes command to verify that the CPU usage by the OSPF has been reduced.
Using show process and top processes Commands
To troubleshoot high CPU utilization due to one of the above reasons, use the following commands:
•
show processes cpu | exclude 0% 0% 0%—Displays all processes currently using the CPU. The sample output displays high percentages. Run this command multiple times.
•
top processes—Displays the processes with the most CPU usage.
The top processes command displays almost real-time CPU and memory utilization, and updates several times per minute. The show processes cpu command displays data that has been collected for all process IDs over the past one, five and 15 minute intervals. Both methods provide valuable information.
•
show processes blocked location location-id (Run this command multiple times)
•
show process process_name location location-id
•
follow process process-id location location-id
Caution 
If your system is running Release 3.8.0, 3.8.1, 3.8.2, or 3.9.0 software, you should
not run the
follow process and
follow job commands, because these can cause a kernel crash at the target node. Therefore, for these software releases, you should use other available commands for troubleshooting and call Cisco Technical Support if the problem is not resolved. (This crash behavior does not occur for releases other than the ones listed.)
The following example shows the processing using the CPU.
RP/0/RP0/CPU0:router# show processes cpu | exclude 0% 0% 0%
CPU utilization for one minute: 100%; five minutes: 100%; fifteen minutes: 100%
PID 1Min 5Min 15Min Process
24615 98% 97% 97% syslog_dev <--!!!
RP/0/RP0/CPU0:CIPC2-VAN#show process block loc 0/0/cpu0
Jid Pid Tid Name State Blocked-on
54 8202 1 ksh Reply 8199 devc-ser8250
51 20502 2 attachd Reply 20500 eth_server
51 20502 3 attachd Reply 8204 mqueue
72 20503 6 qnet Reply 20500 eth_server
72 20503 7 qnet Reply 20500 eth_server
72 20503 8 qnet Reply 20500 eth_server
72 20503 9 qnet Reply 20500 eth_server
52 20507 1 ksh-aux Reply 8199 devc-ser8250
50 20508 2 attach_server Reply 8204 mqueue
216 24610 1 reddrv_listener Reply 20500 eth_server
246 90234 1 spa_xge_v2 Reply 24615 syslog_dev <--!!!
246 90234 5 spa_xge_v2 Mutex 90234-01 #1
RP/0/RP0/CPU0:CIPC2-VAN#show process block loc 0/0/cpu0
Jid Pid Tid Name State Blocked-on
54 8202 1 ksh Reply 8199 devc-ser8250
51 20502 2 attachd Reply 20500 eth_server
51 20502 3 attachd Reply 8204 mqueue
72 20503 6 qnet Reply 20500 eth_server
72 20503 7 qnet Reply 20500 eth_server
72 20503 8 qnet Reply 20500 eth_server
72 20503 9 qnet Reply 20500 eth_server
52 20507 1 ksh-aux Reply 8199 devc-ser8250
50 20508 2 attach_server Reply 8204 mqueue
216 24610 1 reddrv_listener Reply 20500 eth_server
246 90234 1 spa_xge_v2 Reply 24615 syslog_dev <--still blocking!!!
246 90234 5 spa_xge_v2 Mutex 90234-01 #1
RP/0/RP0/CPU0:CIPC2-VAN#show process syslog_dev loc 0/0/cpu0
Tue Sep 11 17:22:51.182 UTC
Executable path: /bootflash/hfr-base-3.4.1/bin/syslog_dev
Max. spawns per minute: 12
Last started: Fri Jun 22 14:15:01 2007
core: TEXT SHAREDMEM MAINMEM
startup_path: /pkg/startup/syslog_dev.startup
Process cpu time: 1283052.366 user, 0.291 kernel, 1283052.657 total
JID TID Stack pri state HR:MM:SS:MSEC NAME
262 1 12K 10 Ready 1549:26:59:0925 syslog_dev <---take look at cpu time
spending for this process!!!
The following example shows the processes with the most CPU usage.
RP/0/RP0/CPU0:router# top processes
247 processes; 930 threads; 4804 channels, 6683 fds
CPU states: 98.5% idle, 0.6% user, 0.8% kernel
Memory: 4096M total, 3095M avail, page size 4K
JID TIDS Chans FDs Tmrs MEM HH:MM:SS CPU NAME
1 33 250 197 1 0 437:50:44 0.82% procnto-600-smp-cisco-i
333 9 32 21 16 1M 0:11:28 0.26% sysdb_svr_admin
180 21 132 40 11 6M 0:36:37 0.16% gsp
332 7 161 19 11 1M 0:10:33 0.12% sysdb_mc
376 7 31 62 13 6M 0:06:49 0.04% mpls_ldp
159 1 5 14 2 756K 0:00:48 0.04% envmon_mon
344 5 6 47 2 1M 0:01:39 0.02% top_procs
341 35 26 62 6 728K 0:01:02 0.02% tcp
276 3 9 15 2 548K 0:00:06 0.00% oir_daemon
62 1 6 9 1 204K 0:00:07 0.00% i2c_server
Troubleshooting a Process Block
To troubleshoot a blocked process, perform the following procedure.
SUMMARY STEPS
1.
show processes blocked location node-id
2.
follow job job-id location node-id
3.
process restart job-id location node-id
DETAILED STEPS
| |
Command or Action
|
Purpose
|
Step 1
|
show processes blocked location node-id
Example:
RP/0/RP0/CPU0:router# show processes blocked
location 0/0/cpu0
|
Use the show processes blocked command several times (three times at 5 second intervals) and compare the output to determine if any processes are blocked for a long period of time. The process can be blocked continuously or for a few seconds. A process is blocked while it is waiting for a response from another process.
• The Name column shows the name of the blocked process.
• The Blocked-on column shows the name and process ID of the blocked process.
• If the State column is Mutex, a thread in the process waits for another thread. In this case, the Blocked-on column shows the process ID and thread ID instead of the process ID.
• A blocked process can be blocked by another process. If a process is being blocked by another process, you need to track the chain of blocking and find the root of blocking processes. Proceed to Step 2 to track the chain of blocking.
|
Step 2
|
follow job job-id location node-id
Example:
RP/0/RP0/CPU0:router# follow job 24615 location
0/0/cpu0
|
Tracks the root process blocking other processes. The follow command shows what part of the code the process is periodically executed.
Caution  If your system is running Release 3.8.0, 3.8.1, 3.8.2, or 3.9.0 software, you should not run the follow process and follow job commands, because these can cause a kernel crash at the target node. Therefore, for these software releases, you should use other available commands for troubleshooting and call Cisco Technical Support if the problem is not resolved. (This crash behavior does not occur for releases other than the ones listed.)
|
Step 3
|
process restart job-id location node-id
Example:
RP/0/RP0/CPU0:router# process restart 234
location 0/1/cpu0
|
Restarts the process. If the problem is not resolved, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface.
Collect the following information for Cisco Technical Support:
• show processes blocked location node-id command output
• follow job job-id location node-id command output
Caution  If your system is running Release 3.8.0, 3.8.1, 3.8.2, or 3.9.0 software, you should not run the follow process and follow job commands, because these can cause a kernel crash at the target node. Therefore, for these software releases, you should use other available commands for troubleshooting and call Cisco Technical Support if the problem is not resolved. (This crash behavior does not occur for releases other than the ones listed.)
• show version command output
• show dll command output
• show configuration command output
• show logging command output
• content of the file: disk0:/wdsysmon_debug/debug_env. number (if it exists)
|
The following example shows the details for a blocked process for CPU usage:
RP/0/RP0/CPU0:router# show processes cpu location 0/0/cpu0 | exc 0% 0% 0%
CPU utilization for one minute: 100%; five minutes: 100%; fifteen minutes: 100%
PID 1Min 5Min 15Min Process
24615 98% 97% 97% syslog_dev <--!!!
The following example shows the details for active processes from a designated node:
RP/0/RP0/CPU0:router# show processes blocked location 0/0/cpu0
Jid Pid Tid Name State Blocked-on
54 8202 1 ksh Reply 8199 devc-ser8250
51 20502 2 attachd Reply 20500 eth_server
51 20502 3 attachd Reply 8204 mqueue
72 20503 6 qnet Reply 20500 eth_server
72 20503 7 qnet Reply 20500 eth_server
72 20503 8 qnet Reply 20500 eth_server
72 20503 9 qnet Reply 20500 eth_server
52 20507 1 ksh-aux Reply 8199 devc-ser8250
50 20508 2 attach_server Reply 8204 mqueue
216 24610 1 reddrv_listener Reply 20500 eth_server
246 90234 1 spa_xge_v2 Reply 24615 syslog_dev <--!!!
246 90234 5 spa_xge_v2 Mutex 90234-01 #1
The following example gathers information about the dump core on the blocked process:
RP/0/RP0/CPU0:router# follow process 24615 location 0/0/cpu0
Tue Sep 11 17:21:26.205 UTC
Attaching to process pid = 24615 (pkg/bin/syslog_dev)
No tid specified, following all threads
DLL Loaded by this process
-------------------------------
DLL path Text addr. Text size Data addr. Data size Version
/pkg/lib/libsysmgr.dll 0xfc124000 0x00010f9c 0xfc087a28 0x000005cc 0
/pkg/lib/libcerrno.dll 0xfc135000 0x00002f9c 0xfc1126ac 0x00000128 0
/pkg/lib/libcerr_dll_tbl.dll 0xfc138000 0x000049e0 0xfc1127d4 0x00000148 0
/pkg/lib/libltrace.dll 0xfc13d000 0x00008a60 0xfc11291c 0x000002c8 0
/pkg/lib/libinfra.dll 0xfc146000 0x00034e60 0xfc17b000 0x00002000 0
/pkg/lib/cerrno/libinfra_error.dll 0xfc1141dc 0x00000cd8 0xfc112be4 0x000000a8 0
/pkg/lib/libios.dll 0xfc17d000 0x0002cc34 0xfc1aa000 0x00002000 0
/pkg/lib/cerrno/libevent_manager_error.dll 0xfc1ac000 0x00000e88 0xfc112c8c 0x00000088
0
/pkg/lib/libc.dll 0xfc1ad000 0x0007b118 0xfc229000 0x00002000 0
/pkg/lib/libplatform.dll 0xfc23f000 0x0000c738 0xfc24c000 0x00002000 0
/pkg/lib/libnodeid.dll 0xfc24e000 0x0000a730 0xfc23a3f8 0x00000248 0
/pkg/lib/libdebug.dll 0xfc25c000 0x00010038 0xfc23a7cc 0x00000550 0
/pkg/lib/cerrno/libdebug_error.dll 0xfc26c038 0x00000db0 0xfc23ad1c 0x000000e8 0
/pkg/lib/lib_procfs_util.dll 0xfc26d000 0x00004fb8 0xfc272000 0x000002a8 0
/pkg/lib/libsyslog.dll 0xfc28f000 0x0000564c 0xfc2724c0 0x00000328 0
/pkg/lib/libbackplane.dll 0xfc295000 0x000013f0 0xfc2727e8 0x000000a8 0
/pkg/lib/cerrno/libsysmgr_error.dll 0xfc4c9000 0x00000f94 0xfc2fba04 0x00000088 0
/pkg/lib/libsysdb.dll 0xfc4d9000 0x0004a000 0xfc523000 0x00001000 0
/pkg/lib/cerrno/libsysdb_error_v1v2.dll 0xfc524000 0x00002000 0xfc526000 0x00001000
0
/pkg/lib/cerrno/libsysdb_error_v2only.dll 0xfc527000 0x00003000 0xfc52a000 0x00001000
0
/pkg/lib/cerrno/libsysdb_error_callback.dll 0xfc52b000 0x00002000 0xfc52d000 0x00001000
0
/pkg/lib/cerrno/libsysdb_error_distrib.dll 0xfc52e000 0x00002000 0xfc530000 0x00001000
0
/pkg/lib/libsysdbutils.dll 0xfc531000 0x0000d000 0xfc53e000 0x00001000 0
------------------------------
Current process = "pkg/bin/syslog_dev", PID = 24615 TID = 1
trace_back: #0 0xfc1f6044 [strlen]
trace_back: #1 0x482002c8 [<N/A>]
trace_back: #2 0x48200504 [<N/A>]
trace_back: #3 0xfc1e4408 [_resmgr_io_handler]
trace_back: #4 0xfc1e40b0 [_resmgr_handler]
trace_back: #5 0xfc15aa28 [_eventmgr_resmgr_handler]
trace_back: #6 0xfc159e04 [_event_message_handler]
trace_back: #7 0xfc159d54 [_event_message_handler]
trace_back: #8 0xfc1575a4 [event_dispatch]
trace_back: #9 0x482007bc [<N/A>]
------------------------------
Current process = "pkg/bin/syslog_dev", PID = 24615 TID = 1
trace_back: #0 0xfc1f6130 [strncat]
trace_back: #1 0x482002a4 [<N/A>]
trace_back: #2 0x48200504 [<N/A>]
trace_back: #3 0xfc1e4408 [_resmgr_io_handler]
trace_back: #4 0xfc1e40b0 [_resmgr_handler]
trace_back: #5 0xfc15aa28 [_eventmgr_resmgr_handler]
trace_back: #6 0xfc159e04 [_event_message_handler]
trace_back: #7 0xfc159d54 [_event_message_handler]
trace_back: #8 0xfc1575a4 [event_dispatch]
trace_back: #9 0x482007bc [<N/A>]
------------------------------
Current process = "pkg/bin/syslog_dev", PID = 24615 TID = 1
trace_back: #0 0xfc1f6044 [strlen]
trace_back: #1 0x482002c8 [<N/A>]
trace_back: #2 0x48200504 [<N/A>]
trace_back: #3 0xfc1e4408 [_resmgr_io_handler]
trace_back: #4 0xfc1e40b0 [_resmgr_handler]
trace_back: #5 0xfc15aa28 [_eventmgr_resmgr_handler]
trace_back: #6 0xfc159e04 [_event_message_handler]
trace_back: #7 0xfc159d54 [_event_message_handler]
trace_back: #8 0xfc1575a4 [event_dispatch]
trace_back: #9 0x482007bc [<N/A>]
------------------------------
Current process = "pkg/bin/syslog_dev", PID = 24615 TID = 1
trace_back: #0 0xfc1f6044 [strlen]
trace_back: #1 0x482002c8 [<N/A>]
trace_back: #2 0x48200504 [<N/A>]
trace_back: #3 0xfc1e4408 [_resmgr_io_handler]
trace_back: #4 0xfc1e40b0 [_resmgr_handler]
trace_back: #5 0xfc15aa28 [_eventmgr_resmgr_handler]
trace_back: #6 0xfc159e04 [_event_message_handler]
trace_back: #7 0xfc159d54 [_event_message_handler]
trace_back: #8 0xfc1575a4 [event_dispatch]
trace_back: #9 0x482007bc [<N/A>]
------------------------------
Current process = "pkg/bin/syslog_dev", PID = 24615 TID = 1
trace_back: #0 0xfc1f6130 [strncat]
trace_back: #1 0x482002a4 [<N/A>]
trace_back: #2 0x48200504 [<N/A>]
trace_back: #3 0xfc1e4408 [_resmgr_io_handler]
trace_back: #4 0xfc1e40b0 [_resmgr_handler]
trace_back: #5 0xfc15aa28 [_eventmgr_resmgr_handler]
trace_back: #6 0xfc159e04 [_event_message_handler]
trace_back: #7 0xfc159d54 [_event_message_handler]
trace_back: #8 0xfc1575a4 [event_dispatch]
trace_back: #9 0x482007bc [<N/A>]
Troubleshooting a Process Crash on Line Cards
To troubleshoot a process crash on the line card, perform the following steps.
SUMMARY STEPS
1.
Identify the process that crashed (PI or PD) from the crash log. In either case, the stack traces obtained from the crash needs to be decoded to identify the location in the code where the process crashed.
2.
show install active
3.
show version
4.
show log
5.
show exception
DETAILED STEPS
| |
Command or Action
|
Purpose
|
Step 1
|
show install active node-id
Example:
RP/0/RP0/CPU0:router# show install active
location 0/0/CPU0
Node 0/0/CPU0 [RP] [SDR: Owner]
Boot Device: mem:
Boot Image:
/c12k-os-mbi-3.7.0.26I/mbiprp-rp.vm
Active Packages:
mem:c12k-mpls-3.7.0.26I
mem:c12k-mini-3.7.0.26I
|
Use the show install active command to collect information about the list of installed software for each node.
|
Step 2
|
Example:
RP/0/RP0/CPU0:router# show version | begin
0/0/CPU0
|
Gives the workspace (directory) and the build server where image was built.
|
Step 3
|
Example:
RP/0/RP0/CPU0:router# show log
|
Provides a background on what was going on at the time of the crash. You can find syslog messages from the dumper process at the time of the crash. This provides a list of dynamic libraries which were loaded by the process and the addresses where they were mapped. This is required to decode the program counters in the stack trace which are a part of a DLL. Also, the location where the core dump has been saved is available.
The core dump is the .Z file mentioned in the log.
|
Step 4
|
|
If you dont have the log, use the show exception command to find out where it is saved.
If the problem is not resolved, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface.
|
Troubleshooting a Memory Leak
To troubleshoot a memory leak, use the show processes {files} [job-id ] [detail] command to display detailed information about open files and open communication channels. The job-id argument displays the job identifier information for only the associated process instance.
The following example shows output from the show processes command with the files and detail keywords:
RP/0/RP0/CPU0:router# show processes files 351 detail
Sun Jan 21 04:35:18.451 EDT
Jid: 351 Total open files: 352 Name: tacacsd
-------------------------------------------------------------
File Descriptor Process Name
--------------- ------------
Troubleshooting a Hardware Failure
Hardware failure can have a major impact on the normal operation of CPU. If a problem is detected, messages can be obtained from the syslog or you can get a node name from the output of the show processes command with the blocked keyword.
Troubleshooting SNMP Timeouts
This section explains how to troubleshoot a typical SNMP timeout scenario.
The service provider typically initiates an SNMP query by means of an SNMP server in the network operations center. When you set up an SNMP query on an SNMP server, you set the parameters of the query, including a timer. If the timer expires before the server receives the query results, this means the query has timed out. If the requested SNMP query involves a large amount of data from the Cisco CRS-1 (and this is generally true for SNMP queries), the Cisco CRS-1 might experience very high CPU utilization as it searches for the data. The Cisco CRS-1 might not be able to complete the query and data transfer before the timer on the SNMP server expires.
To correct a problem with SNMP timeouts, set the timer on the SNMP server to a higher value.
Process timeouts can also occur if communication among multiple process causes high CPU utilization. For information about this scenario, see the "Troubleshooting Communication Among Multiple Processes" section.
Troubleshooting Communication Among Multiple Processes
If communication among multiple processes is causing high CPU utilization, you must stop the request process (for example, Simple Network Management Protocol [SNMP], Internet Control Message Protocol [ICMP], or TCP).
To check the communication blocks among multiple processes, use the show processes command. Use the blocked keyword (multiple times) to display details about the blocked process. Use the cpu keyword (multiple times) to display the CPU usage for each process.
The following example shows the details about the blocked processes from the show processes command with the blocked keyword:
RP/0/RP0/CPU0:router# show processes blocked
Jid Pid Tid Name State Blocked-on
65546 8202 1 ksh Reply 8200 devc-conaux
51 36889 2 attachd Reply 32790 eth_server
51 36889 3 attachd Reply 12301 mqueue
74 36892 5 qnet Reply 32790 eth_server
74 36892 6 qnet Reply 32790 eth_server
74 36892 7 qnet Reply 32790 eth_server
74 36892 8 qnet Reply 32790 eth_server
50 36898 2 attach_server Reply 12301 mqueue
361 118859 1 tftp_server Reply 12301 mqueue
259 139360 2 locald Reply 233685 tacacsd
261 155823 2 lpts_fm Reply 139336 lpts_pa
65717 51572917 1 exec Reply 1 kernel
370 233689 1 udp_snmpd Reply 155811 udp
65759 51589343 1 more Reply 12299 pipe
65774 51589358 1 show_processes Reply 1 kernel
The following example shows the CPU usage for each process from the show processes command with the cpu keyword:
RP/0/RP0/CPU0:router# show processes cpu | exclude 0% 0% 0%
PID 1Min 5Min 15Min Process
32790 4% 4% 3% eth_server
114740 9% 8% 8% sysdb_svr_shared <--!!!
118857 11% 10% 10% gsp <--!!!
Troubleshooting a Process Restart
A process restart does not typically cause a major problem for the network, nor does it typically cause a loss of traffic. However, it can be helpful to troubleshoot the event, to find out why the process crashed, and whether you need to consider taking any action to locate or correct a problem.
To troubleshoot a process restart, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface.
Collect the following information before contacting Cisco Technical Support.
•
show context all location all command output
•
show version command output
•
show dll command output
•
show log command output
•
Collect core dumps. See the "show context Command" section on page 1-11 for information on how to collect core dumps.