如何通过Device日志查看资源占用过高的信息
有些业务程序运行异常的情况可能由于Device的内存被耗尽、CPU占用率过高、文件句柄数达到上限、进程数达到上限等问题引起,因此在定位问题时,我们可以通过日志中的一些关键信息排查这部分问题。
- 在Host侧服务器上,在某个有读、写、执行权限的目录(如“/var/log/npu/report”,下文以此路径为例)下执行msnpureport工具,通过msnpureport工具导出Device日志。msnpureport工具命令示例如下,其中/usr/local/Ascend是驱动包的默认安装路径,请根据实际情况替换。
/usr/local/Ascend/driver/tools/msnpureport -f
Device侧系统进程产生的运行日志默认在/var/log/npu/report/*/slog/device-os-id/run/device-os/目录下,其中,*表示时间戳,device-os-id中的id表示Device ID。
- 在Device日志中,通过SYSMONITOR模块的日志查看资源占用情况,包括内存占用率、CPU占用率、文件句柄数、僵尸进程数。
- 查看内存占用率
memory usage alarm:出现了内存占用率超过阈值上限(90%)的告警。
memory usage stat:一个周期(一小时)结束时,如果周期内有告警事件,打印该周期的统计结果。
slog/dev-os-0/run/device-os/device-os_19700101080010358.log:16:[INFO] SYSMONITOR(2150,log-daemon):1970-01-01-08:00:09.728.115 [sys_monitor_frame.c:60][tid:2168]>>> system resource monitor start, period: 10000ms slog/dev-os-0/run/device-os/device-os_19700101080010358.log:151:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:03.421.505 [sys_memory_monitor.c:233][tid:2168]>>> memory usage alarm: 93.7% slog/dev-os-0/run/device-os/device-os_19700101080010358.log:165:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.581.257 [sys_memory_monitor.c:225][tid:2168]>>> PID VSZ %VSZ COMMAND slog/dev-os-0/run/device-os/device-os_19700101080010358.log:166:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.422 [sys_memory_monitor.c:225][tid:2168]>>> 2112 950m 2.1 /usr/bin/mdc/base-plat/aosservice/iammgr slog/dev-os-0/run/device-os/device-os_19700101080010358.log:167:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.519 [sys_memory_monitor.c:225][tid:2168]>>> 2096 969m 2.1 /var/slogd slog/dev-os-0/run/device-os/device-os_19700101080010358.log:168:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.529 [sys_memory_monitor.c:225][tid:2168]>>> 2142 734m 1.6 /var/resource_mgr slog/dev-os-0/run/device-os/device-os_19700101080010358.log:169:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.537 [sys_memory_monitor.c:225][tid:2168]>>> 2088 726m 1.6 /usr/bin/mdc/base-plat/process-manager/process-manager slog/dev-os-0/run/device-os/device-os_19700101080010358.log:170:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.544 [sys_memory_monitor.c:225][tid:2168]>>> 2150 622m 1.4 /var/log-daemon slog/dev-os-0/run/device-os/device-os_19700101080010358.log:171:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.550 [sys_memory_monitor.c:225][tid:2168]>>> 2241 522m 1.1 /var/tsdaemon slog/dev-os-0/run/device-os/device-os_19700101080010358.log:172:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.558 [sys_memory_monitor.c:225][tid:2168]>>> 2195 510m 1.1 /usr/bin/mdc/base-plat/process-manager/proc_launcher slog/dev-os-0/run/device-os/device-os_19700101080010358.log:173:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.564 [sys_memory_monitor.c:225][tid:2168]>>> 2155 486m 1.0 /var/dmp_daemon slog/dev-os-0/run/device-os/device-os_19700101080010358.log:174:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.570 [sys_memory_monitor.c:225][tid:2168]>>> 2222 284m 0.6 /var/hdcd slog/dev-os-0/run/device-os/device-os_19700101080010358.log:175:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.576 [sys_memory_monitor.c:225][tid:2168]>>> 2211 284m 0.6 /var/hdcd slog/dev-os-0/run/device-os/device-os_19700101080010358.log:176:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-17:17:33.503.564 [sys_memory_monitor.c:251][tid:2168]>>> memory usage stat: minUsage= 2.3%, maxUsage=93.7%, avgUsage= 2.6%, alarmNum=1, resumeNum=1, duration=10000ms
- 查看CPU占用率
cpu usage alarm:出现了cpu占用率超过阈值上限(90%)的告警。
cpu usage stat:一个周期(一小时)结束时,如果周期内有告警事件,打印该周期的统计结果。
slog/dev-os-0/run/device-os/device-os_19700101080010358.log:16:[INFO] SYSMONITOR(2150,log-daemon):1970-01-01-08:00:09.728.115 [sys_monitor_frame.c:60][tid:2168]>>> system resource monitor start, period: 10000ms slog/dev-os-0/run/device-os/device-os_19700101080010358.log:151:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:03.421.505 [sys_cpu_monitor.c:176][tid:2168]>>> cpu usage alarm: 95.2% slog/dev-os-0/run/device-os/device-os_19700101080010358.log:165:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.581.257 [sys_cpu_monitor.c:168][tid:2168]>>> PID CPU %CPU COMMAND slog/dev-os-0/run/device-os/device-os_19700101080010358.log:166:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.422 [sys_cpu_monitor.c:168][tid:2168]>>> 2112 0 12.1 /usr/bin/mdc/base-plat/aosservice/iammgr slog/dev-os-0/run/device-os/device-os_19700101080010358.log:167:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.519 [sys_cpu_monitor.c:168][tid:2168]>>> 2096 0 2.1 /var/slogd slog/dev-os-0/run/device-os/device-os_19700101080010358.log:168:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.529 [sys_cpu_monitor.c:168][tid:2168]>>> 2142 0 1.6 /var/resource_mgr slog/dev-os-0/run/device-os/device-os_19700101080010358.log:169:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.537 [sys_cpu_monitor.c:168][tid:2168]>>> 2088 0 1.6 /usr/bin/mdc/base-plat/process-manager/process-manager slog/dev-os-0/run/device-os/device-os_19700101080010358.log:170:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.544 [sys_cpu_monitor.c:168][tid:2168]>>> 2150 0 1.4 /var/log-daemon slog/dev-os-0/run/device-os/device-os_19700101080010358.log:171:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.550 [sys_cpu_monitor.c:168][tid:2168]>>> 2241 0 1.1 /var/tsdaemon slog/dev-os-0/run/device-os/device-os_19700101080010358.log:172:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.558 [sys_cpu_monitor.c:168][tid:2168]>>> 2195 0 1.1 /usr/bin/mdc/base-plat/process-manager/proc_launcher slog/dev-os-0/run/device-os/device-os_19700101080010358.log:173:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.564 [sys_cpu_monitor.c:168][tid:2168]>>> 2155 0 1.0 /var/dmp_daemon slog/dev-os-0/run/device-os/device-os_19700101080010358.log:174:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.570 [sys_cpu_monitor.c:168][tid:2168]>>> 2222 0 0.6 /var/hdcd slog/dev-os-0/run/device-os/device-os_19700101080010358.log:175:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-16:22:07.741.576 [sys_cpu_monitor.c:168][tid:2168]>>> 2211 0 0.6 /var/hdcd slog/dev-os-0/run/device-os/device-os_19700101080010358.log:176:[INFO] SYSMONITOR(2150,log-daemon):2024-04-26-17:17:33.503.564 [sys_cpu_monitor.c:184][tid:2168]>>> cpu usage stat: minUsage= 1.8%, maxUsage=95.2%, avgUsage= 3.6%, alarmNum=1, resumeNum=1, duration=10000ms
- 查看文件句柄数
fd usage alarm:出现了文件句柄使用率超过阈值上限(90%)的告警。
fd usage stat: 一个周期(一小时)结束时,如果周期内有告警事件,打印该周期的统计结果。
[INFO] SYSMONITOR(2170,log-daemon):2024-05-15-03:17:43.820.774 [sys_fd_monitor.c:156][tid:2187]>>> fd total: 4454775, used: 4445866, fd usage alarm: 99.8% [INFO] SYSMONITOR(2170,log-daemon):2024-05-15-03:17:43.820.865 [sys_fd_monitor.c:126][tid:2187]>>> sysmonitor fd process top three [INFO] SYSMONITOR(2170,log-daemon):2024-05-15-03:17:43.933.536 [sys_fd_monitor.c:147][tid:2187]>>> pid: 12170 , fd used: 4445532 [INFO] SYSMONITOR(2170,log-daemon):2024-05-15-03:17:43.933.641 [sys_fd_monitor.c:147][tid:2187]>>> pid: 2271 , fd used: 28 [INFO] SYSMONITOR(2170,log-daemon):2024-05-15-03:17:43.933.650 [sys_fd_monitor.c:147][tid:2187]>>> pid: 2116 , fd used: 26 [INFO] SYSMONITOR(2170,log-daemon):2024-05-15-03:27:43.856.787 [INFO] SYSMONITOR(2170,log-daemon):2024-05-15-03:27:43.856.787 [sys_fd_monitor.c:162][tid:2187]>>> fd usage stat: minUsage= 0.0%, maxUsage= 99.8%, avgUsage= 87.3%, alarmNum=1, resumeNum=1, duration=80000ms
- 查看僵尸进程数
zombie process count alarm:出现了僵尸进程数超过阈值上限(5)的告警。
zombie process count stat:一个周期(一小时)结束时,如果周期内有告警事件,打印该周期的统计结果。
[INFO] SYSMONITOR(2166,log-daemon):2024-05-15-00:01:24.019.292 [sys_zp_monitor.c:119][tid:2195]>>> zombie process count alarm: 98 [INFO] SYSMONITOR(2166,log-daemon):2024-05-15-00:01:24.019.356 [sys_zp_monitor.c:134][tid:2195]>>> zombie process count stat: minCount=0, maxCount=98, avgCount=98, alarmNum=1, resumeNum=0, duration=120000ms
- 查看内存占用率
父主题: 常用定位操作