Cisco IOS XR Troubleshooting Guide, Release 3.5
Troubleshooting Memory
Downloads: This chapterpdf (PDF - 441.0KB) The complete bookPDF (PDF - 2.2MB) | Feedback

Troubleshooting Memory

Table Of Contents

Troubleshooting Memory

Watchdog System Monitor

Memory Monitoring

Configuring and Displaying Memory Thresholds

Examples

Setting Timeout for Persistent CPU Hogs

Memory Usage Analyzer

Troubleshooting Global Memory

Troubleshooting Process Memory

Identifying Process Memory Problems

Examples

Resolving Process Memory Problems


Troubleshooting Memory


Troubleshooting memory requires determining if there is a memory problem, what type of memory problem it is, and how to resolve the problem.

This chapter contains the following sections:

Watchdog System Monitor

Troubleshooting Global Memory

Troubleshooting Process Memory

Watchdog System Monitor

Watchdog system monitor (wdsysmon) is part of the high-availability (HA) infrastructure of the Cisco IOS XR software. Wdsysmon runs on the distributed route processor (DRP) and line cards (LCs) with the primary goal of monitoring the system for problem conditions and attempting to recover from them. Wdsysmon monitors the processes on each node for memory and CPU usage, deadlocks, and event monitor conditions, as well as disk usage. If thresholds are crossed, an appropriate syslog message is generated. The information is collected in disk0:/wdsysmon_debug. Recovery actions are taken when memory of CPU hog or a deadlock condition is detected, whereby the process responsible for the condition is terminated. Wdsysmon also keeps historical data on processes and posts this information to a fault detector dynamic link library (DLL), which can then be queried by manageability applications.

Memory Monitoring

The wdsysmon memory-hog detection algorithm checks the memory state of each node in regular intervals (every 1/10th of a second). It defines four node state thresholds:

Normal

Minor

Severe

Critical


Note Processes can declare a hard memory limit in their startup file with the memory_limit keyword.


The definition of the node state thresholds depends on the size of the physical memory. For instance, on a node with 2 GB of physical memory, the memory state is considered NORMAL as long the free memory is greater than 48 MB.

If a memory threshold is crossed, wdsysmon immediately checks if a process has exceeded its memory limit. All such processes are stopped after a debug script runs on the process identifier (PID) to collect detailed information on the memory hog. If memory usage is still high after this step, wdsysmon sends notifications to registered clients. Clients can then take preventive and recovery actions.

The memory state can be verified using the show watchdog memory-state location node-id command. The following example shows node 0/RP0/CPU0 as in the normal memory state.

RP/0/RP0/CPU0:router# show watchdog memory-state location 0/rp0/cpu0 
 
   
Memory information:
    Physical Memory: 4096     MB
    Free Memory:     3485.671 MB
    Memory State:         Normal
 
   

If the memory state is changing from normal to minor use the show processes memory [job-id] location node-id command to list top memory users and identify possible memory leaks. After top memory users have been identified, use the memory usage analyzer to discover the processes causing a memory leak. See the "Memory Usage Analyzer" section. Your technical representative should now be involved to collect the appropriate data and take the corresponding actions such as process restart.

Wdsysmon has a procedure to recover from memory-depletion conditions. When wdsysmon determines that the state of a node is severe, it attempts to find a process, or set of processes, that have likely leaked memory leading to the depletion condition. The process or set of processes are stopped to recover the memory. This situation should be avoided by regularly checking the watchdog memory state.

Configuring and Displaying Memory Thresholds

Memory thresholds can be configured. Threshold values can be applied to all cards, or unique threshold settings can be applied to specific cards. If the local threshold settings are removed, the local settings return to those set globally. In addition, you can view default and configured thresholds.

Table 7-1 provides the recommended memory threshold value calculations if the minor threshold is set to 20 percent, the severe threshold is set to 10 percent, and the critical threshold is set to 5 percent.

Table 7-1 Recommended Memory Threshold Values

Total Available Memory (MB)
Minor Threshold (20 percent of available memory)
Severe Threshold (10 percent of available memory)
Critical Threshold (5 percent of available memory)

128

25.6

12.8

6.4

256

51.2

25.6

12.8

512

102.4

51.2

25.6

1024

204.8

102.4

51.2

2048

409.6

204.8

102.4

4096

819.2

409.6

204.8


To identify, configure, and display memory thresholds, perform the following procedure.

SUMMARY STEPS

1. configure

2. watchdog memory threshold [location node-id] minor percentage-memory-available severe percentage-memory-available critical percentage-memory-available

3. end

or

commit

4. exit

5. show watchdog [memory-state | threshold memory {configured | default}] [location node-id]

6. Contact Cisco Technical Support if the problem is not resolved.

DETAILED STEPS

 
Command or Action
Purpose

Step 1 

configure

Example:

RP/0/RP0/CPU0:router# configure

Enters global configuration mode.

Step 2 

watchdog threshold memory [location node-id] minor percentage-memory-available severe percentage-memory-available critical percentage- memory-available

Example:

RP/0/RP0/CPU0:router(config)# watchdog threshold memory location 0/RP0/CPU0 minor 30 severe 20 critical 10

Configures the value of memory available for each alarm threshold. For example, if the minor alarm threshold is set to 30 percent, the severe alarm threshold is set to 20 percent, and the critical alarm threshold is set to 10 percent, the node goes into a minor memory alarm when the amount of memory available falls below 30 percent of the total memory on the card. In other words, this alarm occurs when 70 percent of the available memory is in use. The severe memory alarm activates when the amount of memory available falls below 20 percent, and the critical memory alarm activates when the amount of memory available falls below 10 percent.

Step 3 

end

or

commit

Example:

RP/0/RP0/CPU0:router(config)# end

or

RP/0/RP0/CPU0:router(config)# commit

Saves configuration changes.

When you issue the end command, the system prompts you to commit changes:

Uncommitted changes found, commit them before 
exiting(yes/no/cancel)? 
[cancel]:
 
        

Entering yes saves configuration changes to the running configuration file, exits the configuration session, and returns the router to EXEC mode.

Entering no exits the configuration session and returns the router to EXEC mode without committing the configuration changes.

Entering cancel leaves the router in the current configuration session without exiting or committing the configuration changes.

Use the commit command to save the configuration changes to the running configuration file and remain within the configuration session.

Step 4 

exit

Example:

RP/0/RP0/CPU0:router(config)# exit

Exits global configuration mode and enters EXEC mode.

Step 5 

show watchdog [memory-state| threshold memory {configured | defaults}] [location node-id]

Example:

RP/0/RP0/CPU0:router# show watchdog threshold memory configured location 0/RP0/CPU0

Displays information about the threshold memory in the specified locations, either as configured by the user or for the default settings.

Step 6 

Contact Cisco Technical Support.

If the problem has not been determined and is not resolved, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation, Obtaining Support, and Security Guidelines" section in the Preface.

Examples

The watchdog memory threshold enables you to set the memory thresholds.

RP/0/RP0/CPU0:router(config)# watchdog threshold memory location 0/RP0/CPU0 minor 30 
severe 20 critical 10
 
   

The show watchdog threshold memory default command enables you to display the default memory thresholds.

RP/0/RP0/CPU0:router# show watchdog threshold memory defaults location all
 
   
[K---- node0_RP1_CPU0 ---
 Default memory thresholds:
 Minor:     409      MB
 Severe:    204      MB 
 Critical:  102.399 MB 
Memory information:
    Physical Memory: 2048     MB
    Free Memory:     1236.296 MB
    Memory State:         Normal
---- node0_3_SP ---
 Default memory thresholds:
 Minor:      25      MB
 Severe:     12      MB 
 Critical:    6.399 MB 
Memory information:
    Physical Memory:  128     MB
    Free Memory:       41.187 MB
    Memory State:         Normal
---- node0_0_SP ---
 Default memory thresholds:
 Minor:      25      MB
 Severe:     12      MB 
 Critical:    6.399 MB 
[KMemory information:
    Physical Memory:  128     MB
    Free Memory:       40.683 MB
    Memory State:         Normal
---- node0_SM0_SP ---
 Default memory thresholds:
 Minor:      25      MB
 Severe:     12      MB 
 Critical:    6.399 MB 
Memory information:
    Physical Memory:  128     MB
    Free Memory:       34.394 MB
    Memory State:         Normal
---- node0_0_CPU0 ---
 Default memory thresholds:
 Minor:     204      MB
 Severe:    102      MB 
 Critical:   51.199 MB 
Memory information:
    Physical Memory: 1024     MB
    Free Memory:      463.304 MB
    Memory State:         Normal
---- node0_RP0_CPU0 ---
[K Default memory thresholds:
 Minor:     819      MB
 Severe:    409      MB 
 Critical:  204.799 MB 
Memory information:
    Physical Memory: 4096     MB
    Free Memory:     33

Setting Timeout for Persistent CPU Hogs

If wdsysmon detects a CPU hog on the card, it resets the node after 30 seconds. This default timeout value can be reset if required using the watchdog monitor cpu-hog persistent timeout command.

Memory Usage Analyzer

The memory usage analyzer tool records brief details about the heap memory usage of all processes on the router at different moments in time and compares the results. This makes it very useful for detecting patterns of memory usage during events such as restarting processes or configuring interfaces. It is also useful for troubleshooting memory leaks.

When the memory usage analyzer tool is instructed to take a snapshot, it saves output similar to the show memory heap summary command output for each process running on the router to a file. When instructed to show a report, the files are read and the differences between the memory values are displayed.

The memory usage analyzer tool uses the following commands in sequence:

1. show memory compare start command—This command takes an initial snapshot of the process memory usage.

2. show memory compare end command—This command takes another snapshot of the process memory usage.

3. show memory compare report command—This command displays the differences between the memory values.

The command output contains information about each process whose heap memory usage has changed over the test period. It is ordered by the size of the change, starting with the process with the largest increase. To detect memory leaks the memory usage analyzer should be used on a stable system when no configuration changes are taken.

Troubleshooting Global Memory

To begin troubleshooting a router in a low memory state, get a high-level view of where the memory is being used.

Use the show memory summary command to display system memory information.

RP/0/RP0/CPU0:router# show memory summary 
 
   
Physical Memory: 4096M total
 Application Memory : 3949M (3540M available)
 Image: 17M (bootram: 17M)
 Reserved: 128M, IOMem: 2028M, flashfsys: 0
 Total shared window: 7M
 
   

The output shows the amount of physical memory installed on the device, the memory available for the system to use (total memory minus image size, reserved, and flashfsys), the image size, the reserved space for packet memory, and the I/O memory used as a backup for packet memory.

If there is not sufficient memory, install more memory. See the Cisco CRS-1 Carrier Routing System documentation at the following URL:

http://www.cisco.com/en/US/products/ps5763/tsd_products_support_series_home.html

Troubleshooting Process Memory

The Cisco IOS XR Process Placement feature balances application processes between the available route processors (RPs) and distributed route processors (DRPs) on a Cisco CRS-1 system, based on memory usage and other criteria.

Under normal operating conditions, processes are managed automatically by the Cisco IOS XR software. Processes are started, stopped, or restarted as required by the running configuration of the router. In addition, processes are checkpointed to optimize performance during process restart and automatic switchover.

Identifying Process Memory Problems

To identify process memory problems, perform the following procedure.

SUMMARY STEPS

1. show watchdog memory-state location node-id

1. show processes memory [job-id] location node-id

2. show memory job-id

3. show process memory job-id

4. show memory compare start

5. show memory compare end

6. show memory compare report

7. Contact Cisco Technical Support if the problem is not resolved.

DETAILED STEPS

 
Command or Action
Purpose

Step 1 

show watchdog memory-state location node-id

Example:

RP/0/RP0/CPU0:router# show watchdog memory-state location 0/RP0/CPU0

Displays the memory state for the node. If the node is not in the normal state, proceed to Step 2 to list top memory users and identify possible memory leaks.

Step 2 

show processes memory [job-id] location node-id

Example:

RP/0/RP0/CPU0:router# show process memory location 0/RP0/CPU0

Displays information about the text, data, and stack usage for all active processes on a specified node.The output lists top memory users and identifies possible memory leaks. After top memory users have been identified, note the job ID and use the memory usage analyzer to discover the processes causing a memory leak. See Step 5 through Step 7 for how to use the memory usage analyzer.

Step 3 

show memory job-id

Example:

RP/0/RP0/CPU0:router# show memory 123

Displays the available physical memory and memory usage information of a specific process.

Step 4 

show process memory job-id

Example:

RP/0/RP0/CPU0:router# show process memory 123

Displays information about the text, data, and stack usage for a specific process.

Step 5 

show memory compare start

Example:

RP/0/RP0/CPU0:router# show memory compare start

Takes the initial snapshot of heap memory usage for all processes on the router and sends the report to a temporary file named /tmp/memcmp_start.out.

Step 6 

show memory compare end

Example:

RP/0/RP0/CPU0:router# show memory compare end

Takes the second snapshot of heap memory usage for all processes on the router and sends the report to a temporary file named /tmp/memcmp_end.out. This snapshot is compared with the initial snapshot when displaying the heap memory usage comparison report.

Step 7 

show memory compare report

Example:

RP/0/RP0/CPU0:router# show memory compare report s

Displays the heap memory comparison report, comparing heap memory usage between the two snapshots of heap memory usage.

Step 8 

Contact Cisco Technical Support.

If the problem has not been determined and is not resolved, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation, Obtaining Support, and Security Guidelines" section in the Preface.

Examples

The show watchdog memory-state location node-id command allows you to determine the memory state of a specified node.

RP/0/RP0/CPU0:router# show watchdog memory-state location 0/rp0/cpu0 
 
   
Memory information:
    Physical Memory: 4096     MB
    Free Memory:     3485.671 MB
    Memory State:         Normal
 
   

The show process memory [job-id] location node-id command allows you to determine the processes with the highest dynamic memory usage. The output of the command is sorted by the Dynamic memory usage.

RP/0/RP0/CPU0:router# show processes memory location 0/rp0/cpu0 
JID    Text     Data     Stack    Dynamic  Process
59     65536    32768    57344    38064128 eth_server
164    147456   4096     24576    13217792 fgid_server
289    90112    4096     94208    8437760  parser_server
65554  40960    0        32768    7430144  devb-ata
181    110592   4096     151552   3350528  gsp
57     28672    0        28672    3284992  dllmgr
335    4096     4096     36864    3194880  sysdb_svr_local
280    53248    4096     20480    2682880  nvgen_server
319    16384    4096     12288    2482176  schema_server
329    81920    4096     40960    2412544  statsd_manager
67     28672    12288    24576    2306048  nvram
360    552960   4096     69632    2232320  wdsysmon
216    98304    4096     65536    1908736  ipv4_rib
336    4096     4096     77824    1806336  sysdb_svr_shared
193    217088   4096     86016    1613824  ifmgr
273    45056    4096     122880   1544192  netio
234    98304    4096     53248    1486848  ipv6_rib
190    36864    4096     36864    1429504  hd_drv
333    53248    4096     65536    1327104  sysdb_mc
175    45056    4096     49152    1277952  fabricq_mgr
379    4096     0        94208    1212416  xmlagent
334    4096     4096     73728    1200128  sysdb_svr_admin
204    40960    4096     24576    262144   ipv4_acl_dispatch
162    12288    4096     12288    262144   ether_caps_partner
375    4096     0        12288    245760   ipsec_simp
196    12288    4096     12288    237568   imaedm_server
123    4096     4096     16384    237568   bgp_policy_reg_agent
222    32768    4096     20480    233472   ipv6_acl_daemon
315    16384    4096     61440    229376   rt_check_mgr
.
.
.
 
   

The show memory job-id command displays the memory available and memory usage information for the process associated with the specified job ID. The command output allows you to see exactly what memory is allocated by the process.

RP/0/RP0/CPU0:router# show memory 123 
 
   
Physical Memory: 4096M total
 Application Memory : 3949M (3540M available)
 Image: 17M (bootram: 17M)
 Reserved: 128M, IOMem: 2028M, flashfsys: 0
 Shared window ipv4_fib: 1M
 Shared window infra_ital: 323K
 Shared window ifc-mpls: 961K
 Shared window ifc-ipv6: 1M
 Shared window ifc-ipv4: 1M
 Shared window ifc-protomax: 641K
 Shared window aib: 203K
 Shared window infra_statsd: 3K
 Shared window PFI_IFH: 155K
 Shared window squid: 2M
 Shared window atc_cache: 35K
 Total shared window: 7M
 
   
pkg/bin/bgp_policy_reg_agent: jid 123
Address         Bytes           What
4817f000        4096            Program Stack (pages not allocated)
48180000        507904          Program Stack (pages not allocated)
481fc000        16384           Program Stack
48200000        16384           Shared Memory
48204000        4096            Program Text or Data
48205000        4096            Program Text or Data
48206000        16384           Allocated Memory
4820a000        16384           Allocated Memory
4820e000        16384           Allocated Memory
48212000        16384           Allocated Memory
48216000        16384           Allocated Memory
4821a000        16384           Allocated Memory
4821e000        16384           Allocated Memory
48222000        16384           Allocated Memory
60100000        8192            Shared Memory
60102000        36864           Shared Memory
6010b000        102400          Shared Memory
60124000        8192            Shared Memory
.
.
.
fd214000        106496          DLL Text liboradock.dll
fd22e000        4096            DLL Data liboradock.dll
fd241000        49152           DLL Text librasf.dll
Total Allocated Memory: 131072
Total Shared Memory: 978944
 
   

The output shows the starting address in memory and the size of memory allocated. For example, the starting address for the first entry is 4817f000 and the size of the memory allocated is 4096 bytes.

The shared memory window is where processes share common memory space (shared memory is faster than protected memory) but one process can write over the data of another process, causing memory corruption.

The show processes memory job-id command displays information about the text, data, and stack usage for the specified job ID.


Note A process has its own private memory space. A process cannot access the memory of another process.


RP/0/RP0/CPU0:router# show processes memory 123 
 
   
JID    Text     Data     Stack    Dynamic  Process
123    4096     4096     16384    241664   bgp_policy_reg_agent
 
   

The output shows the size of the text region (process executable), size of the data region (initialized and uninitialized variable), size of the process stack, and size of the dynamically allocated memory.

The show memory compare command displays details about heap memory usage for all processes on the router at different moments in time, comparing the results. This command is useful for detecting patterns of memory usage during events such as restarting processes, configuring interfaces, or looking for memory leaks.

RP/0/RP0/CPU0:router# show memory compare start 
 
   
Successfully stored memory snapshot /harddisk:/malloc_dump/memcmp_start.out 
 
   
RP/0/RP0/CPU0:router# show memory compare end 
 
   
Successfully stored memory snapshot /harddisk:/malloc_dump/memcmp_end.out 
 
   
RP/0/RP0/CPU0:router# show memory compare report 
 
   
JID   name                 mem before   mem after    difference mallocs restart
---   ----                 ----------   ---------    ---------- ------- -------
346   top_procs            584300       587052       2752       0           
303   qsm                  334144       334920       776        8           
335   sysdb_svr_local      1445844      1446004      160        4           
61    i2c_server           14464        14624        160        1           
 
   
 
   
You are now free to remove snapshot memcmp_start.out and memcmp_end.out under /p
 
   

The show memory compare start command takes the initial snapshot of heap memory usage for all processes on the router and sends the report to a temporary file. The show memory compare end command takes the second snapshot of heap memory usage for all processes on the router and sends the report to a temporary file. The snapshot taken with the show memory compare end command is compared with the initial snapshot when displaying the heap memory usage comparison report using the show memory compare report command. The show memory compare report command displays the heap memory comparison report. The output from the show memory compare report command displays details about heap memory usage for all processes on the router at different moments in time and compares the results (compares the amount of memory allocated and deallocated during a session). The report contains information about each process whose heap memory usage has changed from the time the first and second snapshots were taken. The process with the largest memory difference is listed first. The memory usage analyzer should be used on a stable system when no configuration changes are in progress.

Resolving Process Memory Problems

The following conditions can be the cause of process memory problems:

Memory leak—Occurs when a process requests or allocates memory and then forgets to free (deallocate) the memory when finished with that task. As a result, the memory block is reserved until the router is reloaded. Over time, more and more memory blocks are allocated by that process until there is no free memory available.

To detect a memory leak, use the show memory compare end and show memory compare report commands multiple times at regular intervals (either at set hours or once each day). The first show memory compare start command creates the process comparison table. If the difference for a specific process is constantly increasing (a process that should not be increasing), a memory leak is probable. Restart the process to free the memory and stop the memory leak using the process restart job-id location node-id command. If a process restart does not resolve the memory leak problem, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation, Obtaining Support, and Security Guidelines" section in the Preface.

Large quantity of memory used for normal or abnormal processes—A normal or abnormal event (for example, a large routing instability) causes the router to use an unusually large amount of processor memory for a short period of time, during which the memory has run out. The memory shortage may also be due to a combination of factors, such as:

A memory leak that has consumed a large amount of memory, and then a network instability pushes the free memory to zero.

The router does not have enough memory to begin with, but the problem is discovered only during a rare network event.

If the large memory usage is due to a normal event, install more memory. But, if the large memory usage is due to an abnormal event, fix the related problem.

Dead process—A dead process is not a real process. The process is there to account for the memory allocated under the context of another process that has terminated. Restart the process if you suspect that it may be a real process, restart the process using the process restart job-id location node-id command. If it is not a real process or if it is a real process and it does not restart, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation, Obtaining Support, and Security Guidelines" section in the Preface.

Memory fragmentation—A process has consumed a large amount of processor memory and then released most or all of it, leaving fragments of memory still allocated either by the process or by other processes that allocated memory during the problem. If the same event occurs several times, the memory may fragment into very small blocks, to the point where all processes requiring a larger block of memory cannot get the amount of memory that they need. This may affect router operation to the extent that you cannot connect to the router and get a prompt if the memory is badly fragmented.

If memory fragmentation is detected, shut down some interfaces. This may free the fragmented blocks. If this works, the memory is behaving normally. (You should add more memory.) If shutting down interfaces does not resolve the problem, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation, Obtaining Support, and Security Guidelines" section in the Preface.