Cisco IOS XR Troubleshooting Guide for the Cisco CRS-1 Router
Process Monitoring and Troubleshooting
Downloads: This chapterpdf (PDF - 309.0KB) The complete bookPDF (PDF - 3.46MB) | Feedback

Process Monitoring and Troubleshooting

Table Of Contents

Process Monitoring and Troubleshooting

System Manager

Watchdog System Monitor

Deadlock detections

Hang detection

Core Dumps

follow Command

show processes Commands

show processes boot Command

show processes startup Command

show processes failover Command

show processes blocked Command

Example:

Redundancy and Process Restartability

Process States

Synchronous Message Passing

Blocked Processes and Process States

Process Monitoring

Process Monitoring Commands

Monitoring CPU Usage and Using Syslog Messages

Troubleshooting High CPU Utilization and Process Timeouts

General Guidelines for Troubleshooting CPU Utilization Problems

Using show process and top processes Commands

Troubleshooting a Process Block

Troubleshooting a Process Crash on Line Cards

Troubleshooting a Memory Leak

Troubleshooting a Hardware Failure

Troubleshooting SNMP Timeouts

Troubleshooting Communication Among Multiple Processes

Troubleshooting a Process Restart


Process Monitoring and Troubleshooting


This chapter includes the following sections:

System Manager

Watchdog System Monitor

Core Dumps

follow Command

show processes Commands

Redundancy and Process Restartability

Process States

Process Monitoring

Monitoring CPU Usage and Using Syslog Messages

Troubleshooting High CPU Utilization and Process Timeouts

Troubleshooting a Process Restart

The Cisco IOS XR software is built on a modular system of processes. A process is a group of threads that share virtual address (memory) space. Each process provides specific functionality for the system and runs in a protected memory space to ensure that problems with one process cannot impact the entire system. Multiple instances of a process can run on a single node, and multiple threads of execution can run on each process instance.

Threads are units of execution, each with an execution context that includes a stack and registers. A thread is in effect a "sub-process" managed by the parent, responsible for executing a subportion of the overall process. For example, Open Shortest Path First (OSPF) has a thread which handles "hello" receipt and transmission. A thread may only run when the parent process is allocate runtime by the system scheduler. A process with threads is a multi-threaded process.

Under normal operating conditions, processes are managed automatically by the Cisco IOS XR software. Processes are started, stopped, or restarted as required by the running configuration of the router. In addition, processes are checkpointed to optimize performance during process restart and automatic switchover. For more information on processes, see Cisco IOS XR System Management Configuration Guide for the Cisco CRS-1 Router.

System Manager

Each process is assigned a job ID (JID) when started. The JID does not change when a process is started, stopped, then restarted. Each process is also assigned a process ID (PID) when started, but this PID changes each time the process is stopped and restarted.

The System Manager (sysmgr) is the fundamental process and the foundation of the system. The sysmgr is responsible for monitoring, starting, stopping, and restarting almost all processes on the system. The restarting of processes is predefined (respawn flag on or off) and honored by sysmgr. The sysmgr is the parent of all processes started on boot-up and by configuration. Two instances are running on each node providing a hot standby process level redundancy. Each active process is registered with the SysDB and once started by the sysmgr active process the sysmgr is notified when it is running. If the sysmgr active process is dying the standby process takes over the active state and a new standby process is generated.

The sysmgr running on the line card (LC) handles all the system management duties like process creation, re-spawning, and core-dumping relevant to that node.

The sysmgr itself is started on bootup by the initialization process. Once the sysmgr is started, initialization hands over the ownership of all processes started by initialization to sysmgr and exits.

Watchdog System Monitor

The Watchdog System Monitor (wdsysmon) keeps historical data on processes and posts this information to a fault detector dynamic link library (DLL), which can then be queried by manageability applications. Once per minute, wdsysmon polls the kernel for process data. This data is stored in a database maintained by the fm_fd_wdsysmon.dll fault detector, which is loaded by wdsysmon.

For more information on wdsysmon and memory thresholds, see the "Watchdog System Monitor" section in Chapter 9 "Troubleshooting Memory."

Deadlock detections

Wdsysmon can attempt to find deadlocks because thread state is returned with the process data. Wdsysmon specifically looks for mutex deadlocks and local Inter-Process Communication (IPC) hangs. Only local IPC deadlocks can be detected. If deadlocks are detected, debugging information is collected in disk0:/wdsysmon_debug.

Deadlocked processes can be stopped and restarted manually using the processes restart command.

Hang detection

When an event manager is created in the system, the event manager library registers the event with wdsysmon. Wdsysmon expects to periodically hear a "pulse" from every registered event manager in the system. When an event manager is missing, wdsysmon runs a debug script that shows exactly what the thread that created the event manager is doing.

Core Dumps

When a process is abnormally terminated, a core dump file is written to a designated destination. A core dump contains the following information:

register information

thread status information

process status information

selected memory segments.

Use the show exception command to display the configured core dump settings. The output from the show exception command displays the core dump settings configured with the following commands:

exception filepath

exception dump-tftp-route

exception kernel memory

exception pakmem

exception sparse

exception sprsize

The following example shows the core dump settings.

RP/0/RP0/CPU0:router# show exception 
 
   
Choice  1 path = harddisk:/coredump compress = on filename = <process_name.time>
Choice  2 path = tftp://223.255.254.254/users/xyz compress = on filename = 
<process_name.time>
Exception path for choice 3 is not configured or removed
Choice fallback one path = harddisk:/dumper compress = on filename = <process_name>
Choice fallback two path = disk1:/dumper compress = on filename = <process_name>
Choice fallback three path = disk0:/dumper compress = on filename = <process_name>
Kernel dump not configured
Tftp route for kernel core dump not configured
Dumper packet memory in core dump enabled
Sparse core dump enabled
Dumper will switch to sparse core dump automatically at size 300MB
 
   

Coredumps can be generated manually using the dumpcore command. There are two types of core dumps that can be manually run:

running—does not impact services

suspended—suspends a process while generating the core dump

The show context command shows the coredump information for the last 10 core dumps

follow Command

The follow command is used to unobtrusively debug a live process or live thread in a process. The follow command is particularly useful for:

process deadlock, livelock, or mutex conditions

high CPU use conditions

examining the contents of a memory location or a variable in a process to determine the cause of a corruption issue

investigating issues where a process or thread is stuck in a loop.

A livelock condition is where two or more processes continually change their state in response to changes in the other processes.

The following actions can be specified with the follow command:

Follow all live threads of a given process or a given thread of a process and print stack trace in a format similar to core dump output

Follow a process in a loop for a given number of iterations

Set a delay between two iterations while invoking the command

Set the priority at which this process should run while this command is being executed

Dump memory from a given virtual memory location for a given size

Display register values and status information of the target process

Take a snapshot of the execution path of a thread asynchronously to investigate performance-related issues - this can be done by specifying a high number of iterations with a zero delay


Caution If your system is running Release 3.8.0, 3.8.1, 3.8.2, or 3.9.0 software, you should not run the follow process and follow job commands, because these can cause a kernel crash at the target node. Therefore, for these software releases, you should use other available commands for troubleshooting and call Cisco Technical Support if the problem is not resolved. (This crash behavior does not occur for releases other than the ones listed.)

The following example shows the live thread of process 929034375.

RP/0/RP0/CPU0:router# follow process 929034375 
 
   
Attaching to process pid = 929034375 (pkg/bin/bgp)
No tid specified, following all threads
 
   
DLL Loaded by this process 
-------------------------------
 
   
DLL path                 Text addr. Text size  Data addr. Data size  Version
/pkg/lib/libsysmgr.dll   0xfc122000 0x0000df0c 0xfc0c2b14 0x000004ac        0
/pkg/lib/libcerrno.dll   0xfc130000 0x00002f04 0xfc133000 0x00000128        0
/pkg/lib/libcerr_dll_tbl.dll 0xfc134000 0x00004914 0xfc133128 0x00000148       0
/pkg/lib/libltrace.dll   0xfc139000 0x00007adc 0xfc133270 0x00000148        0
/pkg/lib/libinfra.dll    0xfc141000 0x00033c90 0xfc1333b8 0x00000bbc        0
/pkg/lib/cerrno/libinfra_error.dll 0xfc1121dc 0x00000cd8 0xfc175000 0x000000a8 0
/pkg/lib/libios.dll      0xfc176000 0x0002dab0 0xfc1a4000 0x00002000        0
/pkg/lib/cerrno/libevent_manager_error.dll 0xfc1a6000 0x00000e88 0xfc133f74 0x00
/pkg/lib/libc.dll        0xfc1a7000 0x00079d70 0xfc221000 0x00002000        0
/pkg/lib/libsyslog.dll   0xfc223000 0x000054e0 0xfc1750a8 0x00000328        0
/pkg/lib/libplatform.dll 0xfc229000 0x0000c25c 0xfc236000 0x00002000        0
/pkg/lib/libbackplane.dll 0xfc243000 0x000013a8 0xfc1755b8 0x000000a8        0
/pkg/lib/cerrno/libpkgfs_error.dll 0xfc245000 0x00000efc 0xfc175660 0x00000088 0
/pkg/lib/libnodeid.dll   0xfc246000 0x0000a588 0xfc1756e8 0x00000248        0
/pkg/lib/libdebug.dll    0xfc29b000 0x0000fdbc 0xfc2ab000 0x00000570        0
/pkg/lib/cerrno/libdebug_error.dll 0xfc294244 0x00000db0 0xfc175c68 0x000000e8 0
/pkg/lib/lib_procfs_util.dll 0xfc2ac000 0x00004f20 0xfc175d50 0x000002a8       0
/pkg/lib/libinst_debug.dll 0xfc375000 0x0000357c 0xfc36d608 0x000006fc        0
/pkg/lib/libpackage.dll  0xfc3c8000 0x00041ad0 0xfc40a000 0x00000db4        0
/pkg/lib/libwd_evm.dll   0xfc40b000 0x00003dc4 0xfc36dd04 0x00000168        0
.
.
.
Iteration 1 of 5
------------------------------
 
   
Current process = "pkg/bin/bgp", PID = 929034375 TID = 1 
 
   
trace_back: #0 0xfc164210 [MsgReceivev]
trace_back: #1 0xfc14ecb8 [msg_receivev]
trace_back: #2 0xfc14eac4 [msg_receive]
trace_back: #3 0xfc151f98 [event_dispatch]
trace_back: #4 0xfc152154 [event_block]
trace_back: #5 0xfd8e16a0 [bgp_event_loop]
trace_back: #6 0x48230db8 [<N/A>]
trace_back: #7 0x48201080 [<N/A>]
 
   
ENDOFSTACKTRACE
 
   
 
   
Current process = "pkg/bin/bgp", PID = 929034375 TID = 2 
 
   
trace_back: #0 0xfc164210 [MsgReceivev]
trace_back: #1 0xfc14ecb8 [msg_receivev]
trace_back: #2 0xfc14eac4 [msg_receive]
trace_back: #3 0xfc151f98 [event_dispatch]
trace_back: #4 0xfc152154 [event_block]
trace_back: #5 0xfc50efd8 [chk_evm_thread]
 
   
ENDOFSTACKTRACE
.
.
.

show processes Commands

The following show processes commands are used to display process information:

show processes boot Command

show processes startup Command

show processes failover Command

show processes blocked Command

show processes boot Command

The show processes boot command displays process boot information. Use the command output to check the following:

How long it took the processes to start

The order that the processes started

Was a process delayed indicating a boot failure or boot problems

Did the processes start within the time constraints set by the system

RP/0/RP0/CPU0:router# show processes boot location 0/rp1/cpu0 
 
   
 Band Name           Finished    %Idle      JID   Ready Last Process
----- -------------- -------- -------- -------- ------- ---------------------
  1.0 MBI              22.830  65.130%       62  22.830 insthelper
 40.0 ARB             129.225  92.080%      154 106.395 dsc
 90.0 ADMIN           185.140   5.950%      175  55.915 fabricq_mgr
100.0 INFRA           207.372  25.040%      165  22.232 fib_mgr
150.0 STANDBY         231.605  13.840%      104  24.233 arp
999.0 FINAL           237.942   1.590%      234   6.337 ipv6_rump
 
   
  Started  Level      JID Inst   Ready Process
--------- ------ -------- ---- ------- -------------------------------
   0.000s   0.05       80    1   0.000 wd-mbi
   0.000s   1.00       57    1   0.000 dllmgr
   0.000s   2.00       71    1   0.000 pkgfs
   0.000s   3.00       56    1   0.000 devc-conaux
   0.000s   3.00       73    1   0.000 devc-pty
   0.000s   6.00       70    1   0.000 pipe
.
.
.
Last process started:    6d19h after boot. Total: 174
 
   

show processes startup Command

The show processes startup command displays process data for processes created at startup. Use the command output to check the following:

Are the listed processes, including their state, start time, restart status, placement, and mandatory status as expected

How long it took the processes to start

The order in which the processes started

Was a process delayed indicating a boot failure or boot problems

Did the processes start within the time constraints set by the system

RP/0/RP0/CPU0:router# show processes startup 
 
   
JID    LAST STARTED            STATE    RE-     PLACE-  MANDA-  NAME(IID) ARGS
                                        START   MENT    TORY
-------------------------------------------------------------------------------
81     07/05/2006 14:46:37.514 Run      1               M       wd-mbi(1)  
57     07/05/2006 14:46:37.514 Run      1               M       dllmgr(1) -r 60 
-u 30
72     07/05/2006 14:46:37.514 Run      1               M       pkgfs(1)  
56     07/05/2006 14:46:37.514 Run      1               M       devc-conaux(1) -
h -d librs232.dll -m libconaux.dll -u libst16550.dll
74     07/05/2006 14:46:37.514 Run      1               M       devc-pty(1) -n 3
2
55     Not configured          None     0               M       clock_chip(1) -r
 -b
71     07/05/2006 14:46:37.514 Run      1               M       pipe(1)  
65     07/05/2006 14:46:37.514 Run      1               M       mqueue(1)  
64     Not configured          None     0               M       cat(1) /etc/motd
73     Not configured          None     0               M       platform_dllmap(
1)  
77     07/05/2006 14:46:37.514 Run      1               M       shmwin_svr(1)  
60     07/05/2006 14:46:37.514 Run      1               M       devf-scrp(1) -e 
0xf0000038 -m /bootflash: -s 0xfc000000,64m -r -t4 -b10
66     Not configured          None     0               M       nname(1)  
69     07/05/2006 14:46:37.514 Run      1               M       pci_bus_mgr(1) -
o
288    07/05/2006 14:47:02.799 Run      1               M       qsm(1)  
68     07/05/2006 14:46:37.514 Run      1               M       obflmgr(1)  
.
.
.
-------------------------------------------------------------------------------
Total pcbs: 198

show processes failover Command

The show processes failover command displays process failover information. The command output displays information on how long it took processes to start after a failover (node reboot). Check if there were any delays.

RP/0/RP0/CPU0:router# show processes failover 
 
   
Thu May  3 11:16:05.562 EST EDT
Band Name           Finished    %Idle      JID   Ready Last Process
----- -------------- -------- -------- -------- ------- ---------------------
40.0 ARB               0.000   0.000%        0   0.000 NONE
90.0 ADMIN             0.000   0.000%        0   0.000 NONE
100.0 INFRA             0.000   0.000%        0   0.000 NONE
121.0 FT_ADMIN          0.056   0.000%      315   0.056 qsm
122.0 FT_INFRA          1.819   0.000%      195   1.763 ifmgr
123.0 FT_IP_ARM         2.097   0.000%      232   0.278 ipv6_arm
124.0 FT_ISIS           2.280   0.000%      248   0.183 isis
125.0 FT_PRE_IP         2.303   0.000%        0   0.023 NONE
126.0 FT_IP             6.345   0.000%      219   4.042 ipv4_local
127.0 FT_LPTS           8.041   0.000%      262   1.696 lpts_pa
128.0 FT_PRE_OSPF       9.889   0.000%      260   1.848  
loopback_caps_partner
129.0 FT_OSPF          17.944  37.410%      291   8.055 ospf
130.0 FT_MPLS          23.602   0.000%      272   5.658 mpls_lsd
131.0 FT_BGP_START     26.366   0.000%      326   2.764 rsvp
132.0 FT_MULTICAST     32.940   0.000%      277   6.574 mrib
133.0 FT_CLI           35.357   0.000%      361   2.417  
tty_session_startup
134.0 FT_FINAL         44.772   0.000%      322   9.415  
rip_policy_reg_agent
150.0 ACTIVE           70.322   0.000%      224  25.550 ntpd
999.0 FINAL            79.011   0.000%      273   8.689 mpls_rid_helper
 
   
Go active  Level Band Name           JID Inst   Avail Process
--------- ------ -------------- -------- ---- ------- -------------------------------
    0.002s  22.00 FT_ADMIN            315    1   0.049 qsm
    0.027s  38.00 FT_ADMIN             52    1   0.000 bcm_process
    0.028s  40.00 FT_ADMIN            155    1   0.012 dsc
    0.028s  85.00 FT_ADMIN            332    1   0.000 shelfmgr
    0.030s  85.00 FT_ADMIN            333    1   0.000 shelfmgr_partner
    0.031s 120.00 FT_ADMIN            184    1   0.000 fab_svr
    0.061s  23.00 FT_INFRA            145    1   0.000 correlatord
    0.064s  23.00 FT_INFRA            352    1   0.000 syslogd
    0.065s  23.00 FT_INFRA             79    1   0.000 syslogd_helper
    0.066s  38.00 FT_INFRA            297    1   0.000 packet
    0.067s  40.00 FT_INFRA            379    2   0.000 chkpt_proxy
    0.070s  40.00 FT_INFRA            380    3   0.000 chkpt_proxy
    0.072s  40.00 FT_INFRA            381    4   0.000 chkpt_proxy
.
.
.
85.006s   0.00      368    1   2.094 udp_snmpd
   85.310s   0.00      373    1  25.730 vrrp
   85.743s   0.00      376    1  22.666 xmlagent
     7d07h   0.00      136    1   0.479 chdlc_ma
 
   
Last process started:    7d07h after switch over. Total: 78
 
   

show processes blocked Command

The show processes blocked command displays details about reply, send, and mutex blocked processes.

Since a temporary blocked state for any process is possible, it is recommended to run the show processes blocked command two times consecutively for each interval and for each node. If a process is displayed as blocked after the first and second iteration, you can run the command a third time to ensure the process is blocked.

The polling interval should not be too short (enough to show a sustained blocked state). For example, the Cisco CRS-1 8-Slot Line Card Chassis requires a minimum of 20 requests for each interval (2 RPs and 8 LCs) if fully equipped.

The show processes blocked command output always displays processes in the Reply state as blocked.

RP/0/RP0/CPU0:router# show processes blocked 
 
   
Wed May  2 11:44:12.360 EST EDT
  Jid       Pid Tid                 Name State  Blocked-on
65546      8202   1                  ksh Reply    8200  devc-conaux
   52     36889   2              attachd Reply   32791  eth_server
   52     36889   3              attachd Reply   12301  mqueue
   77     36891   6                 qnet Reply   32791  eth_server
   77     36891   7                 qnet Reply   32791  eth_server
   77     36891   8                 qnet Reply   32791  eth_server
   77     36891   9                 qnet Reply   32791  eth_server
   51     36897   2        attach_server Reply   12301  mqueue
  376    139341   1          tftp_server Reply   12301  mqueue
  364    143438   6             sysdb_mc Reply  135244  gsp 
  268    221354   2              lpts_fm Reply  204855  lpts_pa
65725  13291709   1                 exec Reply       1  kernel
65784  23720184   1                 exec Reply  331975  devc-vty
65786  27287802   1                 exec Reply  331975  devc-vty
65788  23589116   1               attach Reply    8200  devc-conaux
65788  23589116   2               attach Reply   12301  mqueue
65790  27316478   1                 exec Reply       1  kernel
65792  27328768   1                 exec Reply  331975  devc-vty
65793  27726081   1                 more Reply   12299  pipe
  350  27418882   2                snmpd Reply  143438  sysdb_mc 
  385  27418886   1            udp_snmpd Reply  221353  udp
65800  27726088   1       show_processes Reply       1  kernel
 
   

For these processes it is a normal output. For example, the line:

65770 27726088    1       show_processes Reply       1  kernel
 
   

is a direct result of executing the show processes blocked command. Each time the command is applied the process ID (PID) will change.

If a vital system process or fundamental application controlling connectivity (for example, routing protocols or Multiprotocol Label Switching Label Distribution Protocol [MPLS LDP]) appears blocked in the Reply, Sent, Mutex, or Condvar state, do the following:

Collect data from the follow job or follow process command. See the "follow Command" section for more information on these commands.


Caution If your system is running Release 3.8.0, 3.8.1, 3.8.2, or 3.9.0 software, you should not run the follow process and follow job commands, because these can cause a kernel crash at the target node. Therefore, for these software releases, you should use other available commands for troubleshooting and call Cisco Technical Support if the problem is not resolved. (This crash behavior does not occur for releases other than the ones listed.)

Use the dumpcore running job-id location node-id command on the affected process. The output of the dumpcore is located in harddisk:/dumper, unless the location has been configured using the exception choice command.


Caution Some processes are dangerous to restart. It is recommended that you involve your technical representative and follow the advice from Cisco Technical Support. For contact information for Cisco Technical Support, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface.

Example:

RP/0/RP0/CPU0:router# show processes blocked 
 
   
  Jid       Pid Tid                 Name State  Blocked-on
65546      8202   1                  ksh Reply    8200  devc-conaux
   51     36890   2              attachd Reply   32791  eth_server
   51     36890   3              attachd Reply   12301  mqueue
   75     36893   5                 qnet Reply   32791  eth_server
   75     36893   6                 qnet Reply   32791  eth_server
   75     36893   7                 qnet Reply   32791  eth_server
   75     36893   8                 qnet Reply   32791  eth_server
   50     36899   2        attach_server Reply   12301  mqueue
  334    172108   1          tftp_server Reply   12301  mqueue
  247    290991   2              lpts_fm Reply  184404  lpts_pa
65750 644260054   1                 exec Reply       1  kernel
65752 655270104   1               config Reply  286888  devc-vty
  367   2642149   5             mpls_ldp Reply 2642153  lspv_server
65772 655229164   1                 exec Reply       1  kernel
65773 656842989   1                 more Reply   12299  pipe
65774 656842990   1       show_processes Reply       1  kernel
 
   

Note To troubleshoot a blocked process, use the procedure in the "Troubleshooting a Process Block" section.


Redundancy and Process Restartability

On systems using Cisco IOS XR software, applications primarily use a combination of two fatal error recovery methods: process restartability and process (application) redundancy.

Process restart is typically used as the first level of failure recovery. If the checkpointed data is not corrupted, the crashed process can recover after it is restarted. If multiple restarts of a mandatory process fail or if peer processes cannot recover from a crashed process restart the standby card becomes active.

For a non-mandatory process, if the number of respawns per minute is reached then the sysmgr stops to restart the process and the application has to be restarted manually.

Each process not triggered by configuration is, by default, started as `mandatory' (critical for router to function) process. If a mandatory process crashes five times within a five minute window, an RP switchover is triggered if the standby RP is ready. The show processes all command lists all processes and process state including mandatory flag. The mandatory flag can be switched OFF. The process mandatory {on | off} {executable-name | job-id} [location node-id] command is used to switch on and off the mandatory flag.

Process States

Within the Cisco IOS XR software there are servers that provide the services and clients that use the services. A specific process can have a number of threads that provide the same service. Another process can have a number of clients that may require a specific service at any point in time. Access to the servers is not always available, and if a client requests access to a service it will wait for the server to be free. When this happens the client is blocked. The client may be blocked because its waiting for a resource such as a mutex or it may be blocked because the server has not replied.

In the following example, the show process ospf command is used to check the status of the threads in the ospf process.

RP/0/RP0/CPU0:router# show processes ospf 
 
   
                  Job Id: 250
                     PID: 299228
         Executable path: /disk0/hfr-rout-3.4.0/bin/ospf
              Instance #: 1
              Version ID: 00.00.0000
                 Respawn: ON
           Respawn count: 1
  Max. spawns per minute: 12
            Last started: Wed Nov  8 15:45:59 2006
           Process state: Run
           Package state: Normal
       Started on config: cfg/gl/ipv4-ospf/proc/100/ord_f/default/ord_a/routerid
                    core: TEXT SHAREDMEM MAINMEM 
               Max. core: 0
               Placement: ON
            startup_path: /pkg/startup/ospf.startup
                   Ready: 3.356s
               Available: 7.363s
        Process cpu time: 2.648 user, 0.186 kernel, 2.834 total
JID    TID  Stack pri state        HR:MM:SS:MSEC NAME
272    1      60K  10 Receive       0:00:00:0563 ospf
272    2      60K  10 Receive       0:00:00:0017 ospf
272    3      60K  10 Receive       0:00:00:0035 ospf
272    4      60K  10 Receive       0:00:02:0029 ospf
272    5      60K  10 Receive       0:00:00:0003 ospf
272    6      60K  10 Condvar       0:00:00:0001 ospf
272    7      60K  10 Receive       0:00:00:0000 ospf
-------------------------------------------------------------------------------
 
   

The process ospf is given a Job ID of 250. This Job ID never changes on a running router. Within the ospf process there are 7 threads, each with their own Thread ID or TID. For each thread, the stack space for each thread, the priority of each thread, and the thread state is listed. Table 8-1 lists the thread states.

The PID is 299228. This number changes each time the process is restarted. The Respawn count indicates how many times the process has restarted and the Process state should show the RUN state.

Synchronous Message Passing

The message passing life cycle is as follows:

1. A server creates a message channel.

2. A client connects to a channel of a server (analogous to posix open).

3. A client sends a message to a server (MsgSend) and waits for a reply and blocks.

4. The server receives (MsgReceive) a message from a client, processes the message and replies to the client.

5. The client unblocks and processes the reply from the server.

This blocking client-server model is synchronous message passing. This means the client sends a message and blocks. The server receives the message, processes it, replies back to the client, and then the client unblocks. The specific details are as follows.

1. Server is waiting in RECEIVE state

2. Client sends a message to the server and becomes BLOCKED

3. Server receives the message and unblocks (if waiting in receive state)

4. Client moves to the REPLY state

5. Server moves to the RUNNING state

6. Server processes the message

7. Server replies to the client

8. Client unblocks

Use the show processes command to display the states the client and servers are in. Table 8-1 lists the thread states.

Blocked Processes and Process States

Use the show processes blocked command to display the processes that are in blocked state.

Synchronized message passing enables you to track the life cycle of inter-process communication between the different threads. At any point in time a thread can be in a specific state. A blocked state can be a symptom of a problem. This does not mean that if a thread is in blocked state then there is a problem—blocked threads are normal. Using the show processes blocked command is sometimes a good way to start troubleshooting operating system-type problems. If there is a problem, for example the CPU is high, then use the show processes blocked command to determine if anything looks abnormal (what is not normal for your functioning router). This provides a baseline for you to use as a comparison when troubleshooting process life cycles.

At any point in a time a thread can be in a particular state. Table 8-1 lists the thread states.

Table 8-1 Thread States 

If the State is:
The Thread is:

DEAD

Dead. The Kernel is waiting to release the threads resources.

RUNNING

Actively running on a CPU.

READY

Not running on a CPU but is ready to run.

STOPPED

Suspended (SIGSTOP signal).

SEND

Waiting for a server to receive a message.

RECEIVE

Waiting for a client to send a message.

REPLY

Waiting for a server to reply to a message.

STACK

Waiting for more stack to be allocate.

WAITPAGE

Waiting for the process manager to resolve a page fault.

SIGSUSPEND

Waiting for a signal.

SIGWAITINFO

Waiting for a signal.

NANOSLEEP

Sleeping for a period of time.

MUTEX

Waiting to acquire a mutex.

CONDVAR

Waiting for a conditional variable to be signaled.

JOIN

Waiting for the completion of another thread.

INTR

Waiting for an interrupt.

SEM

Waiting to acquire a semaphore.


To troubleshoot a blocked process, use the procedure in the "Troubleshooting a Process Block" section.

Process Monitoring

Significant events of the sysmgr are stored in /tmp/sysmgr.log. The log is a wrapping buffer and is useful for troubleshooting. Use the show processes aborts location node-id all command or the show sysmgr trace verbose | include PROC_ABORT command to display an overview of abnormally terminated processes.

Because the sysmgr is already monitoring all processes on the system it is not necessarily required to monitor vital processes by external management tools. But, you can use the show fault manager metric process pid location node-id command to check critical processes on a regular basis (for example, twice each day). The command output provides information including the abort behavior and the reason of the particular process.

The following example shows OSPF critical process details. Check the number of times the process ended abnormally and the number of abnormal ends within the past time periods.

RP/0/RP0/CPU0:router# show fault manager metric process ospf location 
 
   
=====================================
job id: 269, node name: 0/RP0/CPU0
process name: ospf, instance: 1
--------------------------------
last event type: process start
recent start time: Wed Jul  5 15:17:48 2006
recent normal end time: n/a
recent abnormal end time: n/a
number of times started: 1
number of times ended normally: 0
number of times ended abnormally: 2
most recent 10 process start times:
--------------------------
Wed Jul  5 15:17:48 2006
--------------------------
 
   
most recent 10 process end times and types:
 
   
cumulative process available time: 162 hours 20 minutes 51 seconds 452 milliseco
nds
cumulative process unavailable time: 0 hours 0 minutes 0 seconds 0 milliseconds
process availability:  1.000000000
number of abnormal ends within the past 60 minutes (since reload): 0
number of abnormal ends within the past 24 hours (since reload): 0
number of abnormal ends within the past 30 days (since reload): 2
 
   

The vital system processes are: qnet, gsp, qsm, redcon, netio, ifmgr, fgid_aggregator, fgid_server, fgid_allocator,fsdb_server, fsdb_aserver, fabricq_mgr, fia_driver, shelfmgr, and lrd on the RP and fabricq_mgr, ingressq, egressq, pse_driver, fia_driver, cpuctrl, and pla_server on the line card.

It is also important to regularly check if critical or vital processes are in a blocked state. See the "show processes blocked Command" section for information on checking if processes are in the blocked state.

Process Monitoring Commands

Use the following commands to monitor processes:

top command—Displays real-time CPU usage statistics on the system. See the "top Command" section on page 1-10.

show processes pidin command—Displays raw output of all processes, including their state.

show processes blocked command—Displays details about reply, send, and mutex blocked processes. See the "show processes blocked Command" section

You can also use the monitor processes and monitor threads commands to determine the top processes and threads based on CPU usage.


Tip The top processes command displays almost real-time CPU and memory utilization, and updates several times per minute. The show processes cpu command displays data that has been collected for all process IDs over the past one, five and 15 minute intervals. Both methods provide valuable information.


Monitoring CPU Usage and Using Syslog Messages

Wdsysmon continuously monitors the system to ensure that no high priority thread is waiting and provides a procedure to recover from high-priority CPU usage. When a process is determined to be a CPU-hog, it is terminated and a coredump of the process is captured and stored on the configured device (exception choice) to aid debugging. For information on troubleshooting high CPU usage, see the "Troubleshooting High CPU Utilization and Process Timeouts" section.

When wdsysmon detects a CPU-hog condition a syslog message is generated. Follow the recommended action for the following syslog messages:

Message: %HA-HA_WD-6-CPU_HOG_1 CPU hog: cpu [dec]'s sched count is [dec].

RP/0/RP0/CPU0:Dec 22 16:16:34.791 : wdsysmon[331]: %HA-HA_WD-6-CPU_HOG_1 : CPU hog: cpu 
1's sched count is 0. 
 
   

Wdsysmon has detected a CPU starvation situation. This is a potentially high priority process spinning in a tight loop. The `sched count' is the number of times the wdsysmon ticker thread has been scheduled since the last time the wdsysmon watcher thread ran.

Check the system status, including the saved log for evidence of a high priority CPU hog. See the "Troubleshooting High CPU Utilization and Process Timeouts" section for information on checking system status.

Message: %HA-HA_WD-6-CPU_HOG_2 CPU hog: cpu [dec]'s ticker last ran [dec].[dec] seconds ago.

RP/0/RP0/CPU0:Dec 22 16:16:34.791 : wdsysmon[331]: %HA-HA_WD-6-CPU_HOG_2 : CPU hog: cpu 
1's ticker last ran 3.965 seconds ago. 
 
   

Wdsysmon has detected a CPU starvation situation. This is a potentially high priority process spinning in a tight loop.

Check the system status, including the saved log for evidence of a high priority CPU hog. See the "Troubleshooting High CPU Utilization and Process Timeouts" section for information on checking system status.

Message: %HA-HA_WD-6-CPU_HOG_3 Rolling average of scheduling times: [dec].[dec].

RP/0/RP0/CPU0:Dec 22 16:16:34.791 : wdsysmon[331]: %HA-HA_WD-6-CPU_HOG_3 : Rolling average 
of scheduling times: 0.201.  
 
   

Wdsysmon has detected a CPU starvation situation. This is a potentially high priority process spinning in a tight loop. A high value for the rolling average indicates that a periodic process is not being scheduled.

Check the system status, including the saved log for evidence of a high priority CPU hog. See the "Troubleshooting High CPU Utilization and Process Timeouts" section for information on checking system status.

Message: %HA-HA_WD-6-CPU_HOG_4 Process [chars] pid [dec] tid [dec] prio [dec] using [dec]% is the top user of CPU

RP/0/RP0/CPU0:Dec 22 16:16:35.813 : wdsysmon[331]: %HA-HA_WD-6-CPU_HOG_4 : Process wd_test 
pid 409794 tid 2 prio 14 using 99% is the top user of CPU.
 
   

This message is displayed after the CPU hog detector trips. It shows the percentage of CPU used by the busiest thread in the top user of CPU. See the "Troubleshooting High CPU Utilization and Process Timeouts" section for information on checking system status.

The show watchdog trace command displays additional information about the potential CPU hog. If there is a persistent CPU hog (a hog that lasts for more than 30 seconds) the node will be reset. There will be a log such as the following just before the reset:

RP/0/RP0/CPU0:Dec 20 10:36:08.990 : wdsysmon[367]: %HA-HA_WD-1-CURRENT_STATE : Persistent 
Hog detected for more than 30 seconds
 
   

If the hog is persistent and the node is reset, contact Cisco Technical Support. For contact information for Cisco Technical Support, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface. Copy the error message exactly as it appears on the console or in the system log and provide the representative with the gathered information.


Note For more information on wdsysmon and memory thresholds, see the "Watchdog System Monitor" section in Chapter 9 "Troubleshooting Memory."


Troubleshooting High CPU Utilization and Process Timeouts

This section describes the troubleshooting of common problems that can occur due to high CPU utilization, and in some cases causing process timeouts. It includes the following topics:

General Guidelines for Troubleshooting CPU Utilization Problems

Troubleshooting a Process Block

Troubleshooting a Process Crash on Line Cards

Troubleshooting a Memory Leak

Troubleshooting a Hardware Failure

Troubleshooting SNMP Timeouts

Troubleshooting Communication Among Multiple Processes

General Guidelines for Troubleshooting CPU Utilization Problems

Optimal CPU utilization is vital for the routers to function properly. In general, the following cases can cause high CPU utilization:

Normal conditions—One or more processes might be using a large percentage (or all) of the available CPU due to the following reasons:

Routing table convergence calculations (until the routing table converges)

SNMP polling

Any query that requires a large amount of CPU

Communication among multiple processes

Abnormal conditions—A process might be using excessive CPU due to the following reasons:

Process (thread) loop

Memory leak

Process blocking due to bug or hardware problem that causes other process(es) waiting for a reply (loop)

There is no single definition of "high CPU utilization." Utilization depends on many factors, including the number of clients served and the current configuration on the router. The following example illustrates one approach to troubleshooting utilization. (Details of the commands are provided in the sections that follow.)

Example:

You run the top processes command. (It shows the top ten processes in terms of CPU usage.)

From the output of the command, you notice that the top two processes use more memory than the next eight. It is possible that this indicates a problem.

You continue by considering the context of this CPU usage. You notice that the top process is OSPF, so you run commands to show whether there are packet drops occurring on the connections that use OSPF. If there are OSPF packet drops, there might be a problem with OSPF that needs attention.

You continue by troubleshooting OSPF. After correcting the OSPF problem, you can rerun the top processes command to verify that the CPU usage by the OSPF has been reduced.

Using show process and top processes Commands

To troubleshoot high CPU utilization due to one of the above reasons, use the following commands:

show processes cpu | exclude 0% 0% 0%—Displays all processes currently using the CPU. The sample output displays high percentages. Run this command multiple times.

top processes—Displays the processes with the most CPU usage.

The top processes command displays almost real-time CPU and memory utilization, and updates several times per minute. The show processes cpu command displays data that has been collected for all process IDs over the past one, five and 15 minute intervals. Both methods provide valuable information.

show processes blocked location location-id (Run this command multiple times)

show process process_name location location-id

follow process process-id location location-id


Caution If your system is running Release 3.8.0, 3.8.1, 3.8.2, or 3.9.0 software, you should not run the follow process and follow job commands, because these can cause a kernel crash at the target node. Therefore, for these software releases, you should use other available commands for troubleshooting and call Cisco Technical Support if the problem is not resolved. (This crash behavior does not occur for releases other than the ones listed.)

The following example shows the processing using the CPU.

RP/0/RP0/CPU0:router# show processes cpu | exclude 0%      0%       0% 
 
   
CPU utilization for one minute: 100%; five minutes: 100%; fifteen minutes: 100%
 
   
PID    1Min    5Min    15Min Process
24615   98%     97%      97% syslog_dev  <--!!!
65647    1%      1%       1% bfd_agent
RP/0/RP0/CPU0:CIPC2-VAN#
RP/0/RP0/CPU0:CIPC2-VAN#show process block loc 0/0/cpu0
  Jid       Pid Tid                 Name State  Blocked-on
   54      8202   1                  ksh Reply    8199  devc-ser8250
   51     20502   2              attachd Reply   20500  eth_server
   51     20502   3              attachd Reply    8204  mqueue
   72     20503   6                 qnet Reply   20500  eth_server
   72     20503   7                 qnet Reply   20500  eth_server
   72     20503   8                 qnet Reply   20500  eth_server
   72     20503   9                 qnet Reply   20500  eth_server
   52     20507   1              ksh-aux Reply    8199  devc-ser8250
   50     20508   2        attach_server Reply    8204  mqueue
  216     24610   1      reddrv_listener Reply   20500  eth_server
  246     90234   1           spa_xge_v2 Reply   24615  syslog_dev  <--!!!
  246     90234   5           spa_xge_v2 Mutex   90234-01 #1 
 
   
RP/0/RP0/CPU0:CIPC2-VAN#show process block loc 0/0/cpu0
  Jid       Pid Tid                 Name State  Blocked-on
   54      8202   1                  ksh Reply    8199  devc-ser8250
   51     20502   2              attachd Reply   20500  eth_server
   51     20502   3              attachd Reply    8204  mqueue
   72     20503   6                 qnet Reply   20500  eth_server
   72     20503   7                 qnet Reply   20500  eth_server
   72     20503   8                 qnet Reply   20500  eth_server
   72     20503   9                 qnet Reply   20500  eth_server
   52     20507   1              ksh-aux Reply    8199  devc-ser8250
   50     20508   2        attach_server Reply    8204  mqueue
  216     24610   1      reddrv_listener Reply   20500  eth_server
  246     90234   1           spa_xge_v2 Reply   24615  syslog_dev  <--still blocking!!!
  246     90234   5           spa_xge_v2 Mutex   90234-01 #1 
RP/0/RP0/CPU0:CIPC2-VAN#
 
   
RP/0/RP0/CPU0:CIPC2-VAN#show process syslog_dev loc 0/0/cpu0                   
Tue Sep 11 17:22:51.182 UTC 
                  Job Id: 262
                     PID: 24615
         Executable path: /bootflash/hfr-base-3.4.1/bin/syslog_dev
              Instance #: 1
              Version ID: 00.00.0000
                 Respawn: ON
           Respawn count: 1
  Max. spawns per minute: 12
            Last started: Fri Jun 22 14:15:01 2007
           Process state: Run
           Package state: Normal
                    core: TEXT SHAREDMEM MAINMEM 
               Max. core: 0
                   Level: 40.90
           MaintModeProc: ON
            startup_path: /pkg/startup/syslog_dev.startup
                   Ready: 0.999s
        Process cpu time: 1283052.366 user, 0.291 kernel, 1283052.657 total
JID    TID  Stack pri state        HR:MM:SS:MSEC NAME
262    1      12K  10 Ready      1549:26:59:0925 syslog_dev   <---take look at cpu time 
spending for this process!!!
 
   

The following example shows the processes with the most CPU usage.

RP/0/RP0/CPU0:router# top processes 
 
   
Computing times...
247 processes; 930 threads; 4804 channels, 6683 fds
CPU states: 98.5% idle, 0.6% user, 0.8% kernel
Memory: 4096M total, 3095M avail, page size 4K
 
   
      JID TIDS Chans   FDs Tmrs   MEM   HH:MM:SS   CPU  NAME
        1   33  250   197    1      0  437:50:44  0.82% procnto-600-smp-cisco-i
      333    9   32    21   16     1M    0:11:28  0.26% sysdb_svr_admin
      180   21  132    40   11     6M    0:36:37  0.16% gsp
      332    7  161    19   11     1M    0:10:33  0.12% sysdb_mc
      376    7   31    62   13     6M    0:06:49  0.04% mpls_ldp
      159    1    5    14    2   756K    0:00:48  0.04% envmon_mon
      344    5    6    47    2     1M    0:01:39  0.02% top_procs
      341   35   26    62    6   728K    0:01:02  0.02% tcp
      276    3    9    15    2   548K    0:00:06  0.00% oir_daemon
       62    1    6     9    1   204K    0:00:07  0.00% i2c_server
 
   

Troubleshooting a Process Block

To troubleshoot a blocked process, perform the following procedure.

SUMMARY STEPS

1. show processes blocked location node-id

2. follow job job-id location node-id

3. process restart job-id location node-id

DETAILED STEPS

 
Command or Action
Purpose

Step 1 

show processes blocked location node-id
Example:

RP/0/RP0/CPU0:router# show processes blocked location 0/0/cpu0

Use the show processes blocked command several times (three times at 5 second intervals) and compare the output to determine if any processes are blocked for a long period of time. The process can be blocked continuously or for a few seconds. A process is blocked while it is waiting for a response from another process.

The Name column shows the name of the blocked process.

The Blocked-on column shows the name and process ID of the blocked process.

If the State column is Mutex, a thread in the process waits for another thread. In this case, the Blocked-on column shows the process ID and thread ID instead of the process ID.

A blocked process can be blocked by another process. If a process is being blocked by another process, you need to track the chain of blocking and find the root of blocking processes. Proceed to Step 2 to track the chain of blocking.

Step 2 

follow job job-id location node-id 
Example:

RP/0/RP0/CPU0:router# follow job 24615 location 0/0/cpu0

Tracks the root process blocking other processes. The follow command shows what part of the code the process is periodically executed.


Caution If your system is running Release 3.8.0, 3.8.1, 3.8.2, or 3.9.0 software, you should not run the follow process and follow job commands, because these can cause a kernel crash at the target node. Therefore, for these software releases, you should use other available commands for troubleshooting and call Cisco Technical Support if the problem is not resolved. (This crash behavior does not occur for releases other than the ones listed.)

Step 3 

process restart job-id location node-id
Example:

RP/0/RP0/CPU0:router# process restart 234 location 0/1/cpu0

Restarts the process. If the problem is not resolved, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface.

Collect the following information for Cisco Technical Support:

show processes blocked location node-id command output

follow job job-id location node-id command output


Caution If your system is running Release 3.8.0, 3.8.1, 3.8.2, or 3.9.0 software, you should not run the follow process and follow job commands, because these can cause a kernel crash at the target node. Therefore, for these software releases, you should use other available commands for troubleshooting and call Cisco Technical Support if the problem is not resolved. (This crash behavior does not occur for releases other than the ones listed.)

show version command output

show dll command output

show configuration command output

show logging command output

content of the file: disk0:/wdsysmon_debug/debug_env. number (if it exists)

The following example shows the details for a blocked process for CPU usage:

RP/0/RP0/CPU0:router# show processes cpu location 0/0/cpu0 | exc 0%      0%       0%
 
   
CPU utilization for one minute: 100%; five minutes: 100%; fifteen minutes: 100%
 
PID    1Min    5Min    15Min Process
24615   98%     97%      97% syslog_dev  <--!!!
65647    1%      1%       1% bfd_agent
 
   

The following example shows the details for active processes from a designated node:

RP/0/RP0/CPU0:router# show processes blocked location 0/0/cpu0
 
   
Jid       Pid Tid                 Name State  Blocked-on
   54      8202   1                  ksh Reply    8199  devc-ser8250
   51     20502   2              attachd Reply   20500  eth_server
   51     20502   3              attachd Reply    8204  mqueue
   72     20503   6                 qnet Reply   20500  eth_server
   72     20503   7                 qnet Reply   20500  eth_server
   72     20503   8                 qnet Reply   20500  eth_server
   72     20503   9                 qnet Reply   20500  eth_server
   52     20507   1              ksh-aux Reply    8199  devc-ser8250
   50     20508   2        attach_server Reply    8204  mqueue
  216     24610   1      reddrv_listener Reply   20500  eth_server
  246     90234   1           spa_xge_v2 Reply   24615  syslog_dev  <--!!!
  246     90234   5           spa_xge_v2 Mutex   90234-01 #1 
 
   

The following example gathers information about the dump core on the blocked process:

RP/0/RP0/CPU0:router# follow process 24615 location 0/0/cpu0
 
   
Tue Sep 11 17:21:26.205 UTC 
 
   
Attaching to process pid = 24615 (pkg/bin/syslog_dev)
No tid specified, following all threads
 
   
DLL Loaded by this process 
-------------------------------
 
   
DLL path                 Text addr. Text size  Data addr. Data size  Version
/pkg/lib/libsysmgr.dll   0xfc124000 0x00010f9c 0xfc087a28 0x000005cc        0
/pkg/lib/libcerrno.dll   0xfc135000 0x00002f9c 0xfc1126ac 0x00000128        0
/pkg/lib/libcerr_dll_tbl.dll 0xfc138000 0x000049e0 0xfc1127d4 0x00000148        0
/pkg/lib/libltrace.dll   0xfc13d000 0x00008a60 0xfc11291c 0x000002c8        0
/pkg/lib/libinfra.dll    0xfc146000 0x00034e60 0xfc17b000 0x00002000        0
/pkg/lib/cerrno/libinfra_error.dll 0xfc1141dc 0x00000cd8 0xfc112be4 0x000000a8        0
/pkg/lib/libios.dll      0xfc17d000 0x0002cc34 0xfc1aa000 0x00002000        0
/pkg/lib/cerrno/libevent_manager_error.dll 0xfc1ac000 0x00000e88 0xfc112c8c 0x00000088        
0
/pkg/lib/libc.dll        0xfc1ad000 0x0007b118 0xfc229000 0x00002000        0
/pkg/lib/libplatform.dll 0xfc23f000 0x0000c738 0xfc24c000 0x00002000        0
/pkg/lib/libnodeid.dll   0xfc24e000 0x0000a730 0xfc23a3f8 0x00000248        0
/pkg/lib/libdebug.dll    0xfc25c000 0x00010038 0xfc23a7cc 0x00000550        0
/pkg/lib/cerrno/libdebug_error.dll 0xfc26c038 0x00000db0 0xfc23ad1c 0x000000e8        0
/pkg/lib/lib_procfs_util.dll 0xfc26d000 0x00004fb8 0xfc272000 0x000002a8        0
/pkg/lib/libsyslog.dll   0xfc28f000 0x0000564c 0xfc2724c0 0x00000328        0
/pkg/lib/libbackplane.dll 0xfc295000 0x000013f0 0xfc2727e8 0x000000a8        0
/pkg/lib/cerrno/libsysmgr_error.dll 0xfc4c9000 0x00000f94 0xfc2fba04 0x00000088        0
/pkg/lib/libsysdb.dll    0xfc4d9000 0x0004a000 0xfc523000 0x00001000        0
/pkg/lib/cerrno/libsysdb_error_v1v2.dll 0xfc524000 0x00002000 0xfc526000 0x00001000        
0
/pkg/lib/cerrno/libsysdb_error_v2only.dll 0xfc527000 0x00003000 0xfc52a000 0x00001000        
0
/pkg/lib/cerrno/libsysdb_error_callback.dll 0xfc52b000 0x00002000 0xfc52d000 0x00001000        
0
/pkg/lib/cerrno/libsysdb_error_distrib.dll 0xfc52e000 0x00002000 0xfc530000 0x00001000        
0
/pkg/lib/libsysdbutils.dll 0xfc531000 0x0000d000 0xfc53e000 0x00001000        0
 
   
Iteration 1 of 5
------------------------------
 
   
Current process = "pkg/bin/syslog_dev", PID = 24615 TID = 1 
 
   
trace_back: #0 0xfc1f6044 [strlen]
trace_back: #1 0x482002c8 [<N/A>]
trace_back: #2 0x48200504 [<N/A>]
trace_back: #3 0xfc1e4408 [_resmgr_io_handler]
trace_back: #4 0xfc1e40b0 [_resmgr_handler]
trace_back: #5 0xfc15aa28 [_eventmgr_resmgr_handler]
trace_back: #6 0xfc159e04 [_event_message_handler]
trace_back: #7 0xfc159d54 [_event_message_handler]
trace_back: #8 0xfc1575a4 [event_dispatch]
trace_back: #9 0x482007bc [<N/A>]
 
   
ENDOFSTACKTRACE
 
   
 
   
Iteration 2 of 5
------------------------------
 
   
Current process = "pkg/bin/syslog_dev", PID = 24615 TID = 1 
 
   
trace_back: #0 0xfc1f6130 [strncat]
trace_back: #1 0x482002a4 [<N/A>]
trace_back: #2 0x48200504 [<N/A>]
trace_back: #3 0xfc1e4408 [_resmgr_io_handler]
trace_back: #4 0xfc1e40b0 [_resmgr_handler]
trace_back: #5 0xfc15aa28 [_eventmgr_resmgr_handler]
trace_back: #6 0xfc159e04 [_event_message_handler]
trace_back: #7 0xfc159d54 [_event_message_handler]
trace_back: #8 0xfc1575a4 [event_dispatch]
trace_back: #9 0x482007bc [<N/A>]
 
   
ENDOFSTACKTRACE
 
   
 
   
Iteration 3 of 5
------------------------------
 
   
Current process = "pkg/bin/syslog_dev", PID = 24615 TID = 1 
 
   
trace_back: #0 0xfc1f6044 [strlen]
trace_back: #1 0x482002c8 [<N/A>]
trace_back: #2 0x48200504 [<N/A>]
trace_back: #3 0xfc1e4408 [_resmgr_io_handler]
trace_back: #4 0xfc1e40b0 [_resmgr_handler]
trace_back: #5 0xfc15aa28 [_eventmgr_resmgr_handler]
trace_back: #6 0xfc159e04 [_event_message_handler]
trace_back: #7 0xfc159d54 [_event_message_handler]
trace_back: #8 0xfc1575a4 [event_dispatch]
trace_back: #9 0x482007bc [<N/A>]
 
   
ENDOFSTACKTRACE
 
   
 
   
Iteration 4 of 5
------------------------------
 
   
Current process = "pkg/bin/syslog_dev", PID = 24615 TID = 1 
 
   
trace_back: #0 0xfc1f6044 [strlen]
trace_back: #1 0x482002c8 [<N/A>]
trace_back: #2 0x48200504 [<N/A>]
trace_back: #3 0xfc1e4408 [_resmgr_io_handler]
trace_back: #4 0xfc1e40b0 [_resmgr_handler]
trace_back: #5 0xfc15aa28 [_eventmgr_resmgr_handler]
trace_back: #6 0xfc159e04 [_event_message_handler]
trace_back: #7 0xfc159d54 [_event_message_handler]
trace_back: #8 0xfc1575a4 [event_dispatch]
trace_back: #9 0x482007bc [<N/A>]
 
   
ENDOFSTACKTRACE
 
   
 
   
Iteration 5 of 5
------------------------------
 
   
Current process = "pkg/bin/syslog_dev", PID = 24615 TID = 1 
 
   
trace_back: #0 0xfc1f6130 [strncat]
trace_back: #1 0x482002a4 [<N/A>]
trace_back: #2 0x48200504 [<N/A>]
trace_back: #3 0xfc1e4408 [_resmgr_io_handler]
trace_back: #4 0xfc1e40b0 [_resmgr_handler]
trace_back: #5 0xfc15aa28 [_eventmgr_resmgr_handler]
trace_back: #6 0xfc159e04 [_event_message_handler]
trace_back: #7 0xfc159d54 [_event_message_handler]
trace_back: #8 0xfc1575a4 [event_dispatch]
trace_back: #9 0x482007bc [<N/A>]
 
   
ENDOFSTACKTRACE
 
   

Troubleshooting a Process Crash on Line Cards

To troubleshoot a process crash on the line card, perform the following steps.

SUMMARY STEPS

1. Identify the process that crashed (PI or PD) from the crash log. In either case, the stack traces obtained from the crash needs to be decoded to identify the location in the code where the process crashed.

2. show install active

3. show version

4. show log

5. show exception

DETAILED STEPS

 
Command or Action
Purpose

Step 1 

show install active node-id
Example:

RP/0/RP0/CPU0:router# show install active location 0/0/CPU0

Node 0/0/CPU0 [RP] [SDR: Owner]

Boot Device: mem:

Boot Image: /c12k-os-mbi-3.7.0.26I/mbiprp-rp.vm

Active Packages:

mem:c12k-mpls-3.7.0.26I

mem:c12k-mini-3.7.0.26I

Use the show install active command to collect information about the list of installed software for each node.

Step 2 

show version
Example:

RP/0/RP0/CPU0:router# show version | begin 0/0/CPU0

Gives the workspace (directory) and the build server where image was built.

Step 3 

show log
Example:

RP/0/RP0/CPU0:router# show log

Provides a background on what was going on at the time of the crash. You can find syslog messages from the dumper process at the time of the crash. This provides a list of dynamic libraries which were loaded by the process and the addresses where they were mapped. This is required to decode the program counters in the stack trace which are a part of a DLL. Also, the location where the core dump has been saved is available.

The core dump is the .Z file mentioned in the log.

Step 4 

show exception

If you dont have the log, use the show exception command to find out where it is saved.

If the problem is not resolved, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface.

Troubleshooting a Memory Leak

To troubleshoot a memory leak, use the show processes {files} [job-id ] [detail] command to display detailed information about open files and open communication channels. The job-id argument displays the job identifier information for only the associated process instance.

The following example shows output from the show processes command with the files and detail keywords:

RP/0/RP0/CPU0:router# show processes files 351 detail
 
   
Sun Jan 21 04:35:18.451 EDT 
Jid: 351      Total open files: 352      Name: tacacsd             
-------------------------------------------------------------
 
   
File Descriptor   Process Name        
---------------   ------------        
0                 pid: 1                               
1                 pid: 1                               
2                 syslog_dev                           
3                 dllmgr                               
4                 pid: 1                               
5                 pid: 1                               
6                 sysdb_svr_local                      
7                 sysdb_svr_local                      
8                 sysmgr                               
9                 sysdb_svr_local                      
10                sysdb_svr_local                      
11                sysdb_mc                             
12                sysdb_svr_local                      
13                sysdb_mc                             
14                sysdb_svr_local                      
15                sysdb_mc                             
16                sysdb_mc                             
17                sysdb_mc                             
18                pid: 1                               
19                tcp                                  
20                pid: 1                               
21                tcp                                  
22                tcp                                  
23                tcp                                  
24                tcp                                  
25                tcp                                  
26                tcp                                  
27                tcp                                  
28                tcp                                  
29                tcp                                  
30                tcp                                  
31                tcp                                  
32                tcp                                  
33                tcp                                  
34                tcp                                  
35                tcp                                  
36                tcp                                  
37                tcp                                  
38                tcp                                  
39                tcp                                  
40                tcp                                  
41                tcp 
 
   

Troubleshooting a Hardware Failure

Hardware failure can have a major impact on the normal operation of CPU. If a problem is detected, messages can be obtained from the syslog or you can get a node name from the output of the show processes command with the blocked keyword.

Troubleshooting SNMP Timeouts

This section explains how to troubleshoot a typical SNMP timeout scenario.

The service provider typically initiates an SNMP query by means of an SNMP server in the network operations center. When you set up an SNMP query on an SNMP server, you set the parameters of the query, including a timer. If the timer expires before the server receives the query results, this means the query has timed out. If the requested SNMP query involves a large amount of data from the Cisco CRS-1 (and this is generally true for SNMP queries), the Cisco CRS-1 might experience very high CPU utilization as it searches for the data. The Cisco CRS-1 might not be able to complete the query and data transfer before the timer on the SNMP server expires.

To correct a problem with SNMP timeouts, set the timer on the SNMP server to a higher value.

Process timeouts can also occur if communication among multiple process causes high CPU utilization. For information about this scenario, see the "Troubleshooting Communication Among Multiple Processes" section.

Troubleshooting Communication Among Multiple Processes

If communication among multiple processes is causing high CPU utilization, you must stop the request process (for example, Simple Network Management Protocol [SNMP], Internet Control Message Protocol [ICMP], or TCP).

To check the communication blocks among multiple processes, use the show processes command. Use the blocked keyword (multiple times) to display details about the blocked process. Use the cpu keyword (multiple times) to display the CPU usage for each process.

The following example shows the details about the blocked processes from the show processes command with the blocked keyword:

RP/0/RP0/CPU0:router# show processes blocked
 
   
Jid       Pid Tid                 Name State  Blocked-on
65546      8202   1                  ksh Reply    8200  devc-conaux
   51     36889   2              attachd Reply   32790  eth_server
   51     36889   3              attachd Reply   12301  mqueue
   74     36892   5                 qnet Reply   32790  eth_server
   74     36892   6                 qnet Reply   32790  eth_server
   74     36892   7                 qnet Reply   32790  eth_server
   74     36892   8                 qnet Reply   32790  eth_server
   50     36898   2        attach_server Reply   12301  mqueue
  361    118859   1          tftp_server Reply   12301  mqueue
  259    139360   2               locald Reply  233685  tacacsd
  261    155823   2              lpts_fm Reply  139336  lpts_pa
65717  51572917   1                 exec Reply       1  kernel
  370    233689   1            udp_snmpd Reply  155811  udp
65759  51589343   1                 more Reply   12299  pipe
65774  51589358   1       show_processes Reply       1  kernel
 
   

The following example shows the CPU usage for each process from the show processes command with the cpu keyword:

RP/0/RP0/CPU0:router# show processes cpu | exclude  0%      0%       0%
 
   
PID    1Min    5Min    15Min Process
1        3%      1%       1% kernel
32790    4%      4%       3% eth_server
36892    1%      0%       0% qnet
114740   9%      8%       8% sysdb_svr_shared  <--!!!
118855   1%      0%       0% netio
118857  11%     10%      10% gsp               <--!!!
118862   7%      6%       6% sysdb_mc
155799   1%      1%       1% ipv4_rib
155802   2%      2%       2% ipv4_arm
233707   1%      1%       1% bgp
 
   

Troubleshooting a Process Restart

A process restart does not typically cause a major problem for the network, nor does it typically cause a loss of traffic. However, it can be helpful to troubleshoot the event, to find out why the process crashed, and whether you need to consider taking any action to locate or correct a problem.

To troubleshoot a process restart, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface.

Collect the following information before contacting Cisco Technical Support.

show context all location all command output

show version command output

show dll command output

show log command output

Collect core dumps. See the "show context Command" section on page 1-11 for information on how to collect core dumps.