run/plog/plog-4176893_20240629161823165.log:460:[INFO] DRV(4176893,python3.7):2024-06-29-16:19:28.436.386 [ascend][curpid: 4176893, 4185695][drv][devmm][_svm_mem_stats_show 148]SVM_MEM Mem stats (Bytes). (module_name=RUNTIME; module_id=7; current_alloced_size=19927040; alloced_peak_size=19927040; alloc_cnt=37; free_cnt=0) run/plog/plog-4176893_20240629161823165.log:461:[INFO] DRV(4176893,python3.7):2024-06-29-16:19:28.436.393 [ascend][curpid: 4176893, 4185695][drv][devmm][_svm_mem_stats_show 148]DEV_MEM dev6 Mem stats (Bytes). (module_name=HCCL; module_id=3; current_alloced_size=419467264; alloced_peak_size=419500032; alloc_cnt=6; free_cnt=2) run/plog/plog-4176893_20240629161823165.log:463:[INFO] DRV(4176893,python3.7):2024-06-29-16:19:28.436.401 [ascend][curpid: 4176893, 4185695][drv][devmm][_svm_mem_stats_show 148]DEV_MEM dev6 Mem stats (Bytes). (module_name=APP; module_id=33; current_alloced_size=64680361984; alloced_peak_size=64680361984; alloc_cnt=3113; free_cnt=0)
日志级别:ERROR级别(申请内存失败时)、进程退出时也会打印内存统计信息
打印格式:[内存属性] Mem stats (Bytes). (module_name=[模块名]; module_id=[id]; current_alloced_size=[size]; alloced_peak_size=[size]; alloc_cnt=[size]; free_cnt=[size])
例如,以参数量13B大模型为例,其中,B是Billion,代表十亿参数,13B就是130亿参数,每个参数全精度是float32,占用32位bit,也就是4Byte字节,1GB=10243Byte,那么13B模型占用13 * 109 * 4Byte ÷ 10243 ≈ 48.4GB,如果说当前硬件内存只有50G左右,那运行模型时大概率会超出硬件物理内存,而导致OOM。
[INFO] DRV(4052516,main_aarch64):2024-07-08-22:27:45.354.856 [ascend][curpid: 4052516, 4052516][drv][devmm][_svm_mem_stats_show 148]DEV_MEM dev0 Mem stats (Bytes). (module_name=RUNTIME; module_id=7; current_alloced_size=44138496; alloced_peak_size=44138496; alloc_cnt=18; free_cnt=0) [INFO] DRV(4052516,main_aarch64):2024-07-08-22:27:45.354.866 [ascend][curpid: 4052516, 4052516][drv][devmm][_svm_mem_stats_show 148]DEV_MEM dev0 Mem stats (Bytes). (module_name=APP; module_id=33; current_alloced_size=2078195712; alloced_peak_size=2078195712; alloc_cnt=996; free_cnt=0)
当前问题版本的内存统计信息示例(APP申请约20G内存):
[INFO] DRV(4052522,main_aarch64):2024-07-09-22:47:56.352.884 [ascend][curpid: 4052522, 4052522][drv][devmm][_svm_mem_stats_show 148]DEV_MEM dev0 Mem stats (Bytes). (module_name=RUNTIME; module_id=7; current_alloced_size=44138496; alloced_peak_size=44138496; alloc_cnt=18; free_cnt=0) [INFO] DRV(4052522,main_aarch64):2024-07-09-22:47:56.352.894 [ascend][curpid: 4052522, 4052522][drv][devmm][_svm_mem_stats_show 148]DEV_MEM dev0 Mem stats (Bytes). (module_name=APP; module_id=33; current_alloced_size=20781957120; alloced_peak_size=20781957120; alloc_cnt=996; free_cnt=0)
注意事项: