Checking the Memory Statistics of the Device Service Processes
- Check and analyze logs.Search for the keyword svm_mem_stats_show_device_proc_mem in the user-mode plog on the host. The following is an example of the search result:
run/plog/plog-4176893_20240629161823165.log:474:[INFO] DRV(4176893,python3.7):2024-06-29-16:19:29.085.761 [ascend][curpid: 4176893, 4185695][drv][devmm][svm_mem_stats_show_device_proc_mem 358]DEV_PROC_MEM dev6 Mem stats (Bytes). (module_name=AICPU; module_id=36; total_size=75464704) run/plog/plog-4176893_20240629161823165.log:475:[INFO] DRV(4176893,python3.7):2024-06-29-16:19:29.085.765 [ascend][curpid: 4176893, 4185695][drv][devmm][svm_mem_stats_show_device_proc_mem 358]DEV_PROC_MEM dev6 Mem stats (Bytes). (module_name=CUSTOM; module_id=76; total_size=75464704)
DEV_PROC_MEM indicates a process on the device. module_name is used to identify a process on the device (such as the AI CPU process, custom operator CUSTOM process, or HCCP process). total_size indicates the physical memory used by each process, including the resident physical memory occupied by malloc of the process and the sharedpool memory allocated for the process to call the buff allocation API.
- Check which device service process whose allocated memory increases causes OOM.
- If you run the application for the first time, check whether the memory usage of each process meets the expectation or exceeds the physical memory of the hardware based on the memory statistics in the logs. If the memory usage does not meet the expectation or exceeds the physical memory of the hardware, contact technical support for further fault locating.
- If the application is not run for the first time, compare the memory statistics in the historical success scenario with that in the current failure scenario to focus on the service process whose allocated memory is greatly different from that in the historical success scenario.
The following is an example of the memory statistics for a historical success scenario (about 72 MB memory is allocated for the AI CPU process):
[INFO] DRV(4052516,main_aarch64):2024-07-08-22:27:45.334.457 [ascend][curpid: 4052516, 4052516][drv][devmm][svm_mem_stats_show_device_proc_mem 358]DEV_PROC_MEM dev0 Mem stats (Bytes). (module_name=AICPU; module_id=36; total_size=75505664) [INFO] DRV(4052516,main_aarch64):2024-07-08-22:27:45.334.461 [ascend][curpid: 4052516, 4052516][drv][devmm][svm_mem_stats_show_device_proc_mem 358]DEV_PROC_MEM dev0 Mem stats (Bytes). (module_name=CUSTOM; module_id=76; total_size=75505664)
The following is an example of the memory statistics for the existing failure scenario (about 150 MB memory is allocated for the AI CPU process):
[INFO] DRV(4052533,main_aarch64):2024-07-08-22:47:49.335.448 [ascend][curpid: 4052533, 4052533][drv][devmm][svm_mem_stats_show_device_proc_mem 358]DEV_PROC_MEM dev0 Mem stats (Bytes). (module_name=AICPU; module_id=36; total_size=157286400) [INFO] DRV(4052533,main_aarch64):2024-07-08-22:47:49.335.458 [ascend][curpid: 4052533, 4052533][drv][devmm][svm_mem_stats_show_device_proc_mem 358]DEV_PROC_MEM dev0 Mem stats (Bytes). (module_name=CUSTOM; module_id=76; total_size=75505664)
Parent topic: Log Analysis Methods and Typical Cases