Log Collection Directory Structure
This section describes the structure of the directory to be cleaned. You can collect logs and store them in the corresponding structure.
- The size of the log file in the Ascend-fd parse input directory affects the efficiency of running the cleaning command. The total file size must be less than 5 GB, and the total number of files cannot exceed 1,000,000.
- The size of a CANN application log file must be less than 20 MB.
- The size of an NPU status monitoring metric file, monitoring metric file of NPU network port statistics, and host resource information file must be less than 512 MB.
- The size of a user training or inference log is not limited. By default, only the last log file of 1 MB is read.
- Host OS logs include messages, dmesg, vmcore_dmesg.txt, and sysmonitor.log. The maximum size of a single file to be dumped must be less than 512MB. The latest dmesg log with the maximum 100,000 line is extracted.
- The locations of process_log, environment_check, device_log, dl_log, mindie, and amct_log are not restricted. They can be stored in any location in the collection directory.
- If you perform training or inference in a container, save logs, such as user training or inference logs and CANN App logs, to the host in a timely manner.
- Collect the NPU environment check files before and after training or inference, monitoring metric files of NPU network port statistics, NPU status monitoring metric files, host resource information, host OS logs, device logs, MindCluster component logs, MindIE component logs, and AMCT logs on the host.
- After volcano-scheduler and volcano-controller trigger dumping, the dumped logs compressed in gzip format will not be read. Ensure that related logs are stored in volcano-scheduler.log and volcano-controller.log that are not dumped.
- You can collect console logs of all pods on the master node of the Kubernetes cluster and store all MindIE pod console logs in a specified directory on a node.
- An aging mechanism is introduced to MindIE pod console logs. If the collected MindIE pod console logs do not contain instance node information, multi-instance fault diagnosis will not be supported.
- You can summarize all logs to the same collection directory for cleaning. The following is an example of the directory structure of the files to be cleaned:
- Host log directory structure
Collection_directory |-- messages # Host OS logs |-- dmesg # Host kernel message logs |-- crash |-- Directory combining the host name and fault occurrence time (eg:127.xx.xx.1-2024-09-23-11:25:29) |-- vmcore_dmesg.txt # Host kernel message log file saved when the system breaks down |-- sysmonitor.log # System monitoring log file |-- rank-0.txt # Training and inference console log file |-- dmidecode.txt # dmidecode output log file ... |-- rank-7.txt # Training and inference console log file |-- process_log # Original App logs of CANN in the process_log directory |-- device_log # Device logs, which must be stored in the device_log directory. |-- dl_log # MindCluster component file, whose name must be dl_log. |-- devicePlugin # Ascend Device Plugin logs |-- noded # NodeD logs |-- ascend-docker-runtime # Ascend Docker Runtime logs |-- volcano-scheduler # volcano-scheduler logs |-- volcano-controller # volcano-controller logs |-- npu-exporter # NPU Exporter logs |-- mindie # MindIE component logs |-- log |-- debug # Run logs of MindIE components |-- security # Audit logs of MindIE components |-- mindie_cluster_log # MindIE pod console logs |-- amct_log # AMCT logs |-- environment_check # Information about the NPU network port, status, and resource |-- npu_smi_0_details.csv # NPU status monitoring metric file ... |-- npu_smi_7_details.csv # NPU status monitoring metric file |-- npu_0_details.csv # NPU network port monitoring metric file ... |-- npu_7_details.csv # NPU network port monitoring metric file |-- npu_info_before/after.txt # NPU environment check file before or after training or inference |-- host_metrics_{core_num}.json # Monitoring metric file of host resources - BMC log directory structure
Collection_directory/dump_info/AppDump/*/*.log Collection_directory/dump_info/DeviceDump/*/*.log Collection_directory/dump_info/LogDump/*/*.log Collection_directory/dump_info/AppDump/frudata/fruinfo.txt # BMC extension board SNs Collection_directory/dump_info/AppDump/chassis/mdb_info.log # SuperPoD information for BMC devices
- LCNE log directory structure
Collection_directory/*/diagnostic_information/slot_1/tempdir/devm_bddrvadp.log # LCNE extension board SNs Collection_directory/*/diag_display_info.txt # SuperPoD information for LCNE devices Collection_directory/*/log.log Collection_directory/*/log_1_*.log
Table 1 describes the log files stored in each directory.
Table 1 Log file list File Type
Log File
Description
Storage Path
CANN App logs
plog-{pid}_{time}.log
App logs on the host
Collection_directory/process_log/debug or run/plog/plog-{pid}_{time}.log
device-{pid}_{time}.log
App logs on the device
Collection_directory/process_log/debug or run/device-{id}/device-{pid}_{time}.log
User training or inference logs
rank-{id}.txt
rank-{id}.log
worker-{id}.txt
worker-{id}.log
Training and inference console logs
- Collection_directory/rank-{id}.*?.txt
- Collection_directory/rank-{id}.*?.log
- Collection_directory/worker-{id}.*?.log
- Collection_directory/worker-{id}.*?.txt
NPU network port resource information
npu_info_before.txt
NPU network port check file before training or inference
Collection_directory/environment_check/npu_info_before.txt
npu_info_after.txt
NPU network port check file after training or inference
Collection_directory/environment_check/npu_info_after.txt
npu_smi_{npu_id}_details.csv
NPU status monitoring metric file
Collection_directory/environment_check/npu_smi_{npu_id}_details.csv
npu_{npu_id}_details.csv
Monitoring metric file of NPU network port statistics
Collection_directory/environment_check/npu_{npu_id}_details.csv
Host resource information
host_metrics_{core_num}.json
Host resource monitoring metric file
Collection_directory/environment_check/host_metrics_{core_num}.json
dmidecode.txt
Log file containing DMI hardware information on the host
Collection_directory/dmidecode.txt
Host logs
dmesg
Kernel message file on the host
Collection_directory/dmesg
sysmonitor.log
System monitoring file on the host
Collection_directory/sysmonitor.log
messages-*?
Host OS log file
Collection_directory/messages-*?
vmcore_dmesg.txt
Host kernel message file saved when the system breaks down
Collection_directory/crash/Directory_combining_the_host_name_and_fault_occurrence_time (eg: 127.xx.xx.1-2024-09-23-11:25:29)/vmcore_dmesg.txt
Device logs
device-os_{time}.log
System logs of Ctrl CPUs on the device
Collection_directory/device_log/slog/dev-os-{id}/debug or run/device-os/device-os_{time}.log
event_{time}.log
EVENT-level system logs of Ctrl CPUs on the device
Ascend HDK 23.0.3 and later versions:
Collection_directory/device_log/slog/dev-os-{id}/run/event/event_{time}.log
device-{id}_{time}.log
System logs of non-Ctrl CPUs on the device
Ascend HDK 23.0.RC3:
Collection_directory/device_log/slog/dev-os-{id}/device-{id}/device-{id}_{time}.log
Ascend HDK 23.0.3 and later versions:
Collection_directory/device_log/slog/dev-os-{id}/debug/device-{id}/device-{id}_{time}.log
history.log
Black Box logs
Collection_directory/device_log/hisi_logs/device-{id}/history.log
MindCluster component logs
devicePlugin*.log
SuperPoD logs and Ascend Device Plugin logs
Collection_directory/dl_log/devicePlugin/devicePlugin*.log
noded*.log
AI server logs
Collection_directory/dl_log/noded/noded*.log
runtime-run*.log
Logs generated when ascend-docker-runtime of Ascend Docker Runtime is executed
Collection_directory/dl_log/ascend-docker-runtime/runtime-run*.log
hook-run*.log
Logs generated when ascend-docker-hook of Ascend Docker Runtime is executed
Collection_directory/dl_log/ascend-docker-runtime/
hook-run*.log
volcano-scheduler*.log
volcano-scheduler logs
Collection_directory/dl_log/volcano-scheduler/
volcano-scheduler*.log
volcano-controller*.log
volcano-controller logs
Collection_directory/dl_log/volcano-controller/
volcano-controller*.log
npu-exporter*.log
NPU Exporter logs
Collection_directory/dl_log/npu-exporter/
npu-exporter*.log
MindIE component logs
mindie-{module}_{pid}_{datetime}.log
Logs of MindIE Server, MindIE LLM, MindIE SD, MindIE RT, MindIE Torch, MindIE MS, MindIE Benchmark, and MindIE Client
Collection_directory/mindie/log/debug/mindie-{module}_{pid}_{datetime}.log
AMCT logs
amct_{framework}.log
AMCT logs
Collection_directory/amct_log/amct_{framework}.log
BMC logs
All .log files in the out-of-band environment
Out-of-band logs collected by one click
Collection_directory/dump_info/AppDump/*/*.log
Collection_directory/dump_info/DeviceDump/*/*.log
Collection_directory/dump_info/ LogDump/*/*.log
Collection_directory/dump_info/AppDump/frudata/fruinfo.txt
Collection_directory/dump_info/AppDump/chassis/mdb_info.log
LCNE logs
All .log files of LCNE
LCNE logs
Collection_directory/*/diagnostic_information/slot_1/tempdir/devm_bddrvadp.log
Collection_directory/*/diag_display_info.txt
Collection directory/*/log.log
Collection directory/*/log_1_*.log
MindIE pod console logs
{podname}.log
MindIE pod console logs
Collection_directory/mindie/log/mindie_cluster_log/{podname}.log
- Host log directory structure
- You can also use the input parameters of the corresponding cleaning command to clean the log directory. The storage structure of the log file corresponding to each parameter is as follows. For details about the cleaning command parameters, see Table 1.
|-- ${Paths specified by --process_log} |-- debug/plog/plog-{pid}_{time}.log |-- run/plog/plog-{pid}_{time}.log |-- debug/device-*/device-{pid}_{time}.log |-- run/device-*/device-{pid}_{time}.log |-- ${Paths specified by --device_log} |-- slog/dev-os-*/debug/device-os/device-os_*.log |-- slog/dev-os-*/run/device-os/device-os_*.log |-- slog/dev-os-*/run/event/event_*.log # Displayed only in Ascend HDK 23.0.3 and later versions. |--slog/dev-os-*/device-*/device-*_*.log # Path of device-*_*.log in Ascend HDK 23.0.RC3 |--slog/dev-os-*/debug/device-*/device-*_*.log # Path of device-*_*.log in Ascend HDK 23.0.3 and later versions |-- hisi_logs/device-*/history.log .... |-- ${Paths specified by --env_check} |-- npu_info_before.txt |-- npu_info_after.txt |-- npu_smi_0_details.csv ... |-- npu_smi_0_details.csv |-- npu_0_details.csv ... |-- npu_7_details.csv |-- ${Paths specified by --train_log} |-- rank-0.txt ... |-- rank-7.txt |-- ${Paths specified by --host_log} |-- messages |-- crash |-- Directory combining the host name and fault occurrence time (eg:127.xx.xx.1-2024-09-23-11:25:29) |-- vmcore_dmesg.txt |-- dmesg |-- sysmonitor.log |-- ${Paths specified by {--dl_log} |-- devicePlugin/devicePlugin*.log |-- noded/noded*.log |-- ascend-docker-runtime/runtime-run*.log |-- ascend-docker-runtime/hook-run*.log |-- volcano-scheduler/volcano-scheduler*.log |-- volcano-controller/volcano-controller*.log |-- npu-exporter/npu-exporter*.log |-- ${Path specified by --mindie_log} |-- log/debug/mindie-{module}_{pid}_{datetime}.log |-- log/mindie_cluster_log/{podname}.log |-- ${Path specified by --amct_log} |-- amct_{framework}.logFile Type
Log File
Description
Log Directory
CANN App logs
plog-{pid}_{time}.log
App logs on the host.
- ${--process_log}/debug/plog/plog-{pid}_{time}.log
- ${--process_log}/run/plog/plog-{pid}_{time}.log
device-{pid}_{time}.log
App logs on the device.
- ${--process_log}/debug/device-{id}/device-{pid}_{time}.log
- ${--process_log}/run/device-{id}/device-{pid}_{time}.log
User training or inference logs
rank-{id}.txt
rank-{id}.log
worker-{id}.txt
worker-{id}.log
Training and inference console logs
- ${--train_log}/rank-id.*?.txt
- ${--train_log}/rank-id.*?.log
- ${--train_log}/worker-id.*?.log
- ${--train_log}/worker-id.*?.txt
NPU network port resource information
npu_info_before.txt
NPU network port check file before training
${--env_check}/npu_info_before.txt
npu_info_after.txt
NPU network port check file after training
${--env_check}/npu_info_after.txt
npu_smi_{npu_id}_details.csv
NPU status monitoring metric file
${--env_check}/npu_smi_{npu_id}_details.csv
npu_{npu_id}_details.csv
Monitoring metric file of NPU network port statistics
${--env_check}/npu_{npu_id}_details.csv
Host resource information
host_metrics_{core_num}.json
Host resource monitoring metric file
${--env_check}/host_metrics_{core_num}.json
Host logs
messages-*?
Host OS log file
${--host_log}/messages-*?
dmesg
Kernel message file on the host
${--host_log}/dmesg
vmcore-dmesg.txt
Host kernel message file saved when the system breaks down.
${--host_log}/crash/Directory_combining_the_host_name_and_fault_occurrence_time (eg: 127.xx.xx.1-2024-09-23-11:25:29)/vmcore_dmesg.txt
sysmonitor.log
System monitoring file on the host
${--host_log}/sysmonitor.log
Device logs
device-os_{time}.log
System logs of Ctrl CPUs on the device
${--device_log}/slog/dev-os-{id}/debug/device-os/device-os_{time}.log
event_{time}.log
EVENT-level system logs of Ctrl CPUs on the device
Ascend HDK 23.0.3 and later versions:
${--device_log}/slog/dev-os-{id}/run/event/event_{time}.log
device-id_{time}.log
System logs of non-Ctrl CPUs on the device
Ascend HDK 23.0.RC3:
${--device_log}/slog/dev-os-{id}/device-{id}/device-{id}_{time}.log
Ascend HDK 23.0.3 and later versions:
${--device_log}/slog/dev-os-{id}/debug/device-{id}/device-{id}_{time}.log
history.log
Black Box logs
${--device_log}/hisi_logs/device-{id}/history.log
MindCluster component logs
devicePlugin*.log
SuperPoD logs and Ascend Device Plugin logs
${--dl_log}/devicePlugin/devicePlugin*.log
noded*.log
AI server logs
${--dl_log}/noded/noded*.log
runtime-run*.log
Logs generated when ascend-docker-runtime of Ascend Docker Runtime is executed
${--dl_log}/ascend-docker-runtime/runtime-run*.log
hook-run*.log
Logs generated when ascend-docker-hook of Ascend Docker Runtime is executed
${--dl_log}/ascend-docker-runtime/
hook-run*.log
volcano-scheduler*.log
volcano-scheduler logs
${--dl_log}/volcano-scheduler/
volcano-scheduler*.log
volcano-controller*.log
volcano-controller logs
${--dl_log}/volcano-controller/
volcano-controller*.log
npu-exporter*.log
NPU Exporter logs
${--dl_log}/npu-exporter/
npu-exporter*.log
MindIE component logs
mindie-{module}_{pid}_{datetime}.log
Logs of MindIE Server, MindIE LLM, MindIE SD, MindIE RT, MindIE Torch, MindIE MS, MindIE Benchmark, and MindIE Client
${--mindie_log}/log/debug/mindie-{module}_{pid}_{datetime}.log
MindIE pod console logs
{podname}.log
MindIE pod console logs
${--mindie_log}/log/mindie_cluster_log/{podname}.log
AMCT logs
amct_{framework}.log
AMCT logs
${--amct_log}/amct_{framework}.log