Table Of Contents
Troubleshooting Memory
Watchdog System Monitor
Memory Monitoring
Configuring and Displaying Memory Thresholds
Examples
Setting Timeout for Persistent CPU Hogs
Memory Usage Analyzer
Troubleshooting Global Memory
Troubleshooting Process Memory
Identifying Process Memory Problems
Examples
Resolving Process Memory Problems
Troubleshooting Memory
Troubleshooting memory requires determining if there is a memory problem, what type of memory problem it is, and how to resolve the problem.
This chapter contains the following sections:
•
Watchdog System Monitor
•
Troubleshooting Global Memory
•
Troubleshooting Process Memory
Watchdog System Monitor
Watchdog system monitor (wdsysmon) is part of the high-availability (HA) infrastructure of the Cisco IOS XR software. Wdsysmon runs on the distributed route processor (DRP) and line cards (LCs) with the primary goal of monitoring the system for problem conditions and attempting to recover from them. Wdsysmon monitors the processes on each node for memory and CPU usage, deadlocks, and event monitor conditions, as well as disk usage. If thresholds are crossed, an appropriate syslog message is generated. The information is collected in disk0:/wdsysmon_debug. Recovery actions are taken when memory of CPU hog or a deadlock condition is detected, whereby the process responsible for the condition is terminated. Wdsysmon also keeps historical data on processes and posts this information to a fault detector dynamic link library (DLL), which can then be queried by manageability applications.
Memory Monitoring
The wdsysmon memory-hog detection algorithm checks the memory state of each node in regular intervals (every 1/10th of a second). It defines four node state thresholds:
•
Normal
•
Minor
•
Severe
•
Critical
Note
Processes can declare a hard memory limit in their startup file with the memory_limit keyword.
The definition of the node state thresholds depends on the size of the physical memory. For instance, on a node with 2 GB of physical memory, the memory state is considered NORMAL as long the free memory is greater than 48 MB.
If a memory threshold is crossed, wdsysmon immediately checks if a process has exceeded its memory limit. All such processes are stopped after a debug script runs on the process identifier (PID) to collect detailed information on the memory hog. If memory usage is still high after this step, wdsysmon sends notifications to registered clients. Clients can then take preventive and recovery actions.
The memory state can be verified using the show watchdog memory-state location node-id command. The following example shows node 0/RP0/CPU0 as in the normal memory state.
RP/0/0/CPU0:router# show watchdog memory-state location 0/rp0/cpu0
If the memory state is changing from normal to minor use the show processes memory [job-id] location node-id command to list top memory users and identify possible memory leaks. After top memory users have been identified, use the memory usage analyzer to discover the processes causing a memory leak. See the "Memory Usage Analyzer" section. Your technical representative should now be involved to collect the appropriate data and take the corresponding actions such as process restart.
Wdsysmon has a procedure to recover from memory-depletion conditions. When wdsysmon determines that the state of a node is severe, it attempts to find a process, or set of processes, that have likely leaked memory leading to the depletion condition. The process or set of processes are stopped to recover the memory. This situation should be avoided by regularly checking the watchdog memory state.
Configuring and Displaying Memory Thresholds
Memory thresholds can be configured. Threshold values can be applied to all cards, or unique threshold settings can be applied to specific cards. If the local threshold settings are removed, the local settings return to those set globally. In addition, you can view default and configured thresholds.
Table 7-1 provides the recommended memory threshold value calculations if the minor threshold is set to 20 percent, the severe threshold is set to 10 percent, and the critical threshold is set to 5 percent.
Table 7-1 Recommended Memory Threshold Values
Total Available Memory (MB)
|
Minor Threshold (20 percent of available memory)
|
Severe Threshold (10 percent of available memory)
|
Critical Threshold (5 percent of available memory)
|
128
|
25.6
|
12.8
|
6.4
|
256
|
51.2
|
25.6
|
12.8
|
512
|
102.4
|
51.2
|
25.6
|
1024
|
204.8
|
102.4
|
51.2
|
2048
|
409.6
|
204.8
|
102.4
|
4096
|
819.2
|
409.6
|
204.8
|
To identify, configure, and display memory thresholds, perform the following procedure.
SUMMARY STEPS
1.
configure
2.
watchdog memory threshold [location node-id] minor percentage-memory-available severe percentage-memory-available critical percentage-memory-available
3.
end
or
commit
4.
exit
5.
show watchdog [memory-state | threshold memory {configured | default}] [location node-id]
6.
Contact Cisco Technical Support if the problem is not resolved.
DETAILED STEPS
| |
Command or Action
|
Purpose
|
Step 1
|
configure
Example:
RP/0/0/CPU0:router# configure
|
Enters global configuration mode.
|
Step 2
|
watchdog threshold memory [location node-id]
minor percentage-memory-available severe
percentage-memory-available critical
percentage- memory-available
Example:
RP/0/0/CPU0:router(config)# watchdog threshold
memory location 0/RP0/CPU0 minor 30 severe 20
critical 10
|
Configures the value of memory available for each alarm threshold. For example, if the minor alarm threshold is set to 30 percent, the severe alarm threshold is set to 20 percent, and the critical alarm threshold is set to 10 percent, the node goes into a minor memory alarm when the amount of memory available falls below 30 percent of the total memory on the card. In other words, this alarm occurs when 70 percent of the available memory is in use. The severe memory alarm activates when the amount of memory available falls below 20 percent, and the critical memory alarm activates when the amount of memory available falls below 10 percent.
|
Step 3
|
end
or
commit
Example:
RP/0/0/CPU0:router(config)# end
or
RP/0/0/CPU0:router(config)# commit
|
Saves configuration changes.
• When you issue the end command, the system prompts you to commit changes:
Uncommitted changes found, commit them before
exiting(yes/no/cancel)?
[cancel]:
– Entering yes saves configuration changes to the running configuration file, exits the configuration session, and returns the router to EXEC mode.
– Entering no exits the configuration session and returns the router to EXEC mode without committing the configuration changes.
– Entering cancel leaves the router in the current configuration session without exiting or committing the configuration changes.
• Use the commit command to save the configuration changes to the running configuration file and remain within the configuration session.
|
Step 4
|
exit
Example:
RP/0/0/CPU0:router(config)# exit
|
Exits global configuration mode and enters EXEC mode.
|
Step 5
|
show watchdog [memory-state| threshold memory
{configured | defaults}] [location node-id]
Example:
RP/0/0/CPU0:router# show watchdog threshold
memory configured location 0/RP0/CPU0
|
Displays information about the threshold memory in the specified locations, either as configured by the user or for the default settings.
|
Step 6
|
Contact Cisco Technical Support.
|
If the problem has not been determined and is not resolved, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface.
|
Examples
The watchdog memory threshold enables you to set the memory thresholds.
RP/0/0/CPU0:router(config)# watchdog threshold memory location 0/RP0/CPU0 minor 30 severe
20 critical 10
The show watchdog threshold memory default command enables you to display the default memory thresholds.
RP/0/RP0/CPU0:router# show watchdog threshold memory defaults location all
[K---- node0_RP1_CPU0 ---
Default memory thresholds:
Default memory thresholds:
Default memory thresholds:
Default memory thresholds:
Default memory thresholds:
[K Default memory thresholds:
Setting Timeout for Persistent CPU Hogs
If wdsysmon detects a CPU hog on the card, it resets the node after 30 seconds. This default timeout value can be reset if required using the watchdog monitor cpu-hog persistent timeout command.
Memory Usage Analyzer
The memory usage analyzer tool records brief details about the heap memory usage of all processes on the router at different moments in time and compares the results. This makes it very useful for detecting patterns of memory usage during events such as restarting processes or configuring interfaces. It is also useful for troubleshooting memory leaks.
When the memory usage analyzer tool is instructed to take a snapshot, it saves output similar to the show memory heap summary command output for each process running on the router to a file. When instructed to show a report, the files are read and the differences between the memory values are displayed.
The memory usage analyzer tool uses the following commands in sequence:
1.
show memory compare start command—This command takes an initial snapshot of the process memory usage.
2.
show memory compare end command—This command takes another snapshot of the process memory usage.
3.
show memory compare report command—This command displays the differences between the memory values.
The command output contains information about each process whose heap memory usage has changed over the test period. It is ordered by the size of the change, starting with the process with the largest increase. To detect memory leaks the memory usage analyzer should be used on a stable system when no configuration changes are taken.
Troubleshooting Global Memory
To begin troubleshooting a router in a low memory state, get a high-level view of where the memory is being used.
Use the show memory summary command to display system memory information.
RP/0/0/CPU0:router# show memory summary
Physical Memory: 4096M total
Application Memory : 3949M (3540M available)
Image: 17M (bootram: 17M)
Reserved: 128M, IOMem: 2028M, flashfsys: 0
The output shows the amount of physical memory installed on the device, the memory available for the system to use (total memory minus image size, reserved, and flashfsys), the image size, the reserved space for packet memory, and the I/O memory used as a backup for packet memory.
If there is not sufficient memory, install more memory. See the Cisco CRS-1 Carrier Routing System documentation at the following URL:
http://www.cisco.com/en/US/products/ps5763/tsd_products_support_series_home.html
Troubleshooting Process Memory
Under normal operating conditions, processes are managed automatically by the Cisco IOS XR software. Processes are started, stopped, or restarted as required by the running configuration of the router. In addition, processes are checkpointed to optimize performance during process restart and automatic switchover.
Identifying Process Memory Problems
To identify process memory problems, perform the following procedure.
SUMMARY STEPS
1.
show watchdog memory-state location node-id
1.
show processes memory [job-id] location node-id
2.
show memory job-id
3.
show process memory job-id
4.
show memory compare start
5.
show memory compare end
6.
show memory compare report
7.
Contact Cisco Technical Support if the problem is not resolved.
DETAILED STEPS
| |
Command or Action
|
Purpose
|
Step 1
|
show watchdog memory-state location node-id
Example:
RP/0/0/CPU0:router# show watchdog memory-state
location 0/RP0/CPU0
|
Displays the memory state for the node. If the node is not in the normal state, proceed to Step 2 to list top memory users and identify possible memory leaks.
|
Step 2
|
show processes memory [job-id] location node-id
Example:
RP/0/0/CPU0:router# show process memory
location 0/RP0/CPU0
|
Displays information about the text, data, and stack usage for all active processes on a specified node.The output lists top memory users and identifies possible memory leaks. After top memory users have been identified, note the job ID and use the memory usage analyzer to discover the processes causing a memory leak. See Step 5 through Step 7 for how to use the memory usage analyzer.
|
Step 3
|
show memory job-id
Example:
RP/0/0/CPU0:router# show memory 123
|
Displays the available physical memory and memory usage information of a specific process.
|
Step 4
|
show process memory job-id
Example:
RP/0/0/CPU0:router# show process memory 123
|
Displays information about the text, data, and stack usage for a specific process.
|
Step 5
|
show memory compare start
Example:
RP/0/0/CPU0:router# show memory compare start
|
Takes the initial snapshot of heap memory usage for all processes on the router and sends the report to a temporary file named /tmp/memcmp_start.out.
|
Step 6
|
show memory compare end
Example:
RP/0/0/CPU0:router# show memory compare end
|
Takes the second snapshot of heap memory usage for all processes on the router and sends the report to a temporary file named /tmp/memcmp_end.out. This snapshot is compared with the initial snapshot when displaying the heap memory usage comparison report.
|
Step 7
|
show memory compare report
Example:
RP/0/0/CPU0:router# show memory compare report
s
|
Displays the heap memory comparison report, comparing heap memory usage between the two snapshots of heap memory usage.
|
Step 8
|
Contact Cisco Technical Support.
|
If the problem has not been determined and is not resolved, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface.
|
Examples
The show watchdog memory-state location node-id command allows you to determine the memory state of a specified node.
RP/0/0/CPU0:router# show watchdog memory-state location 0/rp0/cpu0
The show process memory [job-id] location node-id command allows you to determine the processes with the highest dynamic memory usage. The output of the command is sorted by the Dynamic memory usage.
RP/0/0/CPU0:router# show processes memory location 0/rp0/cpu0
JID Text Data Stack Dynamic Process
59 65536 32768 57344 38064128 eth_server
164 147456 4096 24576 13217792 fgid_server
289 90112 4096 94208 8437760 parser_server
65554 40960 0 32768 7430144 devb-ata
181 110592 4096 151552 3350528 gsp
57 28672 0 28672 3284992 dllmgr
335 4096 4096 36864 3194880 sysdb_svr_local
280 53248 4096 20480 2682880 nvgen_server
319 16384 4096 12288 2482176 schema_server
329 81920 4096 40960 2412544 statsd_manager
67 28672 12288 24576 2306048 nvram
360 552960 4096 69632 2232320 wdsysmon
216 98304 4096 65536 1908736 ipv4_rib
336 4096 4096 77824 1806336 sysdb_svr_shared
193 217088 4096 86016 1613824 ifmgr
273 45056 4096 122880 1544192 netio
234 98304 4096 53248 1486848 ipv6_rib
190 36864 4096 36864 1429504 hd_drv
333 53248 4096 65536 1327104 sysdb_mc
175 45056 4096 49152 1277952 fabricq_mgr
379 4096 0 94208 1212416 xmlagent
334 4096 4096 73728 1200128 sysdb_svr_admin
204 40960 4096 24576 262144 ipv4_acl_dispatch
162 12288 4096 12288 262144 ether_caps_partner
375 4096 0 12288 245760 ipsec_simp
196 12288 4096 12288 237568 imaedm_server
123 4096 4096 16384 237568 bgp_policy_reg_agent
222 32768 4096 20480 233472 ipv6_acl_daemon
315 16384 4096 61440 229376 rt_check_mgr
The show memory job-id command displays the memory available and memory usage information for the process associated with the specified job ID. The command output allows you to see exactly what memory is allocated by the process.
RP/0/0/CPU0:router# show memory 123
Physical Memory: 4096M total
Application Memory : 3949M (3540M available)
Image: 17M (bootram: 17M)
Reserved: 128M, IOMem: 2028M, flashfsys: 0
Shared window ipv4_fib: 1M
Shared window infra_ital: 323K
Shared window ifc-mpls: 961K
Shared window ifc-ipv6: 1M
Shared window ifc-ipv4: 1M
Shared window ifc-protomax: 641K
Shared window infra_statsd: 3K
Shared window PFI_IFH: 155K
Shared window atc_cache: 35K
pkg/bin/bgp_policy_reg_agent: jid 123
4817f000 4096 Program Stack (pages not allocated)
48180000 507904 Program Stack (pages not allocated)
481fc000 16384 Program Stack
48200000 16384 Shared Memory
48204000 4096 Program Text or Data
48205000 4096 Program Text or Data
48206000 16384 Allocated Memory
4820a000 16384 Allocated Memory
4820e000 16384 Allocated Memory
48212000 16384 Allocated Memory
48216000 16384 Allocated Memory
4821a000 16384 Allocated Memory
4821e000 16384 Allocated Memory
48222000 16384 Allocated Memory
60100000 8192 Shared Memory
60102000 36864 Shared Memory
6010b000 102400 Shared Memory
60124000 8192 Shared Memory
fd214000 106496 DLL Text liboradock.dll
fd22e000 4096 DLL Data liboradock.dll
fd241000 49152 DLL Text librasf.dll
Total Allocated Memory: 131072
Total Shared Memory: 978944
The output shows the starting address in memory and the size of memory allocated. For example, the starting address for the first entry is 4817f000 and the size of the memory allocated is 4096 bytes.
The shared memory window is where processes share common memory space (shared memory is faster than protected memory) but one process can write over the data of another process, causing memory corruption.
The show processes memory job-id command displays information about the text, data, and stack usage for the specified job ID.
Note
A process has its own private memory space. A process cannot access the memory of another process.
RP/0/0/CPU0:router# show processes memory 123
JID Text Data Stack Dynamic Process
123 4096 4096 16384 241664 bgp_policy_reg_agent
The output shows the size of the text region (process executable), size of the data region (initialized and uninitialized variable), size of the process stack, and size of the dynamically allocated memory.
The show memory compare command displays details about heap memory usage for all processes on the router at different moments in time, comparing the results. This command is useful for detecting patterns of memory usage during events such as restarting processes, configuring interfaces, or looking for memory leaks.
RP/0/RP0/CPU0:router# show memory compare start
Successfully stored memory snapshot /harddisk:/malloc_dump/memcmp_start.out
RP/0/RP0/CPU0:router# show memory compare end
Successfully stored memory snapshot /harddisk:/malloc_dump/memcmp_end.out
RP/0/0/CPU0:router# show memory compare report
JID name mem before mem after difference mallocs restart
--- ---- ---------- --------- ---------- ------- -------
346 top_procs 584300 587052 2752 0
303 qsm 334144 334920 776 8
335 sysdb_svr_local 1445844 1446004 160 4
61 i2c_server 14464 14624 160 1
You are now free to remove snapshot memcmp_start.out and memcmp_end.out under /p
The show memory compare start command takes the initial snapshot of heap memory usage for all processes on the router and sends the report to a temporary file. The show memory compare end command takes the second snapshot of heap memory usage for all processes on the router and sends the report to a temporary file. The snapshot taken with the show memory compare end command is compared with the initial snapshot when displaying the heap memory usage comparison report using the show memory compare report command. The show memory compare report command displays the heap memory comparison report. The output from the show memory compare report command displays details about heap memory usage for all processes on the router at different moments in time and compares the results (compares the amount of memory allocated and deallocated during a session). The report contains information about each process whose heap memory usage has changed from the time the first and second snapshots were taken. The process with the largest memory difference is listed first. The memory usage analyzer should be used on a stable system when no configuration changes are in progress.
Resolving Process Memory Problems
The following conditions can be the cause of process memory problems:
•
Memory leak—Occurs when a process requests or allocates memory and then forgets to free (deallocate) the memory when finished with that task. As a result, the memory block is reserved until the router is reloaded. Over time, more and more memory blocks are allocated by that process until there is no free memory available.
To detect a memory leak, use the show memory compare end and show memory compare report commands multiple times at regular intervals (either at set hours or once each day). The first show memory compare start command creates the process comparison table. If the difference for a specific process is constantly increasing (a process that should not be increasing), a memory leak is probable. Restart the process to free the memory and stop the memory leak using the process restart job-id location node-id command. If a process restart does not resolve the memory leak problem, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface.
•
Large quantity of memory used for normal or abnormal processes—A normal or abnormal event (for example, a large routing instability) causes the router to use an unusually large amount of processor memory for a short period of time, during which the memory has run out. The memory shortage may also be due to a combination of factors, such as:
–
A memory leak that has consumed a large amount of memory, and then a network instability pushes the free memory to zero.
–
The router does not have enough memory to begin with, but the problem is discovered only during a rare network event.
If the large memory usage is due to a normal event, install more memory. But, if the large memory usage is due to an abnormal event, fix the related problem.
•
Dead process—A dead process is not a real process. The process is there to account for the memory allocated under the context of another process that has terminated. Restart the process if you suspect that it may be a real process, restart the process using the process restart job-id location node-id command. If it is not a real process or if it is a real process and it does not restart, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface.
•
Memory fragmentation—A process has consumed a large amount of processor memory and then released most or all of it, leaving fragments of memory still allocated either by the process or by other processes that allocated memory during the problem. If the same event occurs several times, the memory may fragment into very small blocks, to the point where all processes requiring a larger block of memory cannot get the amount of memory that they need. This may affect router operation to the extent that you cannot connect to the router and get a prompt if the memory is badly fragmented.
If memory fragmentation is detected, shut down some interfaces. This may free the fragmented blocks. If this works, the memory is behaving normally. (You should add more memory.) If shutting down interfaces does not resolve the problem, contact Cisco Technical Support. For Cisco Technical Support contact information, see the "Obtaining Documentation and Submitting a Service Request" section in the Preface.