Log Collection Directory Structure

This section describes the structure of the directory to be cleaned. You can collect logs and store them in the corresponding structure.

  • The size of the log file in the Ascend-fd parse input directory affects the efficiency of running the cleaning command. The total file size must be less than 5 GB, and the total number of files cannot exceed 1,000,000.
  • The size of a CANN application log file must be less than 20 MB.
  • The size of an NPU status monitoring metric file, monitoring metric file of NPU network port statistics, and host resource information file must be less than 512 MB.
  • The size of a user training or inference log is not limited. By default, only the last log file of 1 MB is read.
  • Host OS logs include messages, dmesg, vmcore_dmesg.txt, and sysmonitor.log. The maximum size of a single file to be dumped must be less than 512MB. The latest dmesg log with the maximum 100,000 line is extracted.
  • The locations of process_log, environment_check, device_log, dl_log, mindie, and amct_log are not restricted. They can be stored in any location in the collection directory.
  • If you perform training or inference in a container, save logs, such as user training or inference logs and CANN App logs, to the host in a timely manner.
  • Collect the NPU environment check files before and after training or inference, monitoring metric files of NPU network port statistics, NPU status monitoring metric files, host resource information, host OS logs, device logs, MindCluster component logs, MindIE component logs, and AMCT logs on the host.
  • After volcano-scheduler and volcano-controller trigger dumping, the dumped logs compressed in gzip format will not be read. Ensure that related logs are stored in volcano-scheduler.log and volcano-controller.log that are not dumped.
  • You can collect console logs of all pods on the master node of the Kubernetes cluster and store all MindIE pod console logs in a specified directory on a node.
  • An aging mechanism is introduced to MindIE pod console logs. If the collected MindIE pod console logs do not contain instance node information, multi-instance fault diagnosis will not be supported.
  • You can summarize all logs to the same collection directory for cleaning. The following is an example of the directory structure of the files to be cleaned:
    • Host log directory structure
      Collection_directory
      |-- messages              # Host OS logs
      |-- dmesg                # Host kernel message logs
      |-- crash
          |-- Directory combining the host name and fault occurrence time (eg:127.xx.xx.1-2024-09-23-11:25:29)
              |-- vmcore_dmesg.txt     # Host kernel message log file saved when the system breaks down
      |-- sysmonitor.log       # System monitoring log file
      |-- rank-0.txt           # Training and inference console log file
      |-- dmidecode.txt        # dmidecode output log file
      ...
      |-- rank-7.txt           # Training and inference console log file
      |-- process_log          # Original App logs of CANN in the process_log directory
      |-- device_log           # Device logs, which must be stored in the device_log directory.
      |-- dl_log               #  MindCluster component file, whose name must be dl_log.
          |-- devicePlugin       # Ascend Device Plugin logs
          |-- noded              # NodeD logs
          |-- ascend-docker-runtime        # Ascend Docker Runtime logs
          |-- volcano-scheduler            # volcano-scheduler logs
          |-- volcano-controller           # volcano-controller logs
          |-- npu-exporter                 # NPU Exporter logs
      |-- mindie               # MindIE component logs
          |-- log
              |-- debug        # Run logs of MindIE components
              |-- security     # Audit logs of MindIE components
              |-- mindie_cluster_log    # MindIE pod console logs
      |-- amct_log             # AMCT logs
      |-- environment_check # Information about the NPU network port, status, and resource
          |-- npu_smi_0_details.csv   # NPU status monitoring metric file
           ...
          |-- npu_smi_7_details.csv   # NPU status monitoring metric file
          |-- npu_0_details.csv       # NPU network port monitoring metric file
           ...    
          |-- npu_7_details.csv       # NPU network port monitoring metric file
          |-- npu_info_before/after.txt  # NPU environment check file before or after training or inference
          |-- host_metrics_{core_num}.json # Monitoring metric file of host resources
    • BMC log directory structure
      Collection_directory/dump_info/AppDump/*/*.log
      Collection_directory/dump_info/DeviceDump/*/*.log
      Collection_directory/dump_info/LogDump/*/*.log
      Collection_directory/dump_info/AppDump/frudata/fruinfo.txt # BMC extension board SNs
      Collection_directory/dump_info/AppDump/chassis/mdb_info.log # SuperPoD information for BMC devices
    • LCNE log directory structure
      Collection_directory/*/diagnostic_information/slot_1/tempdir/devm_bddrvadp.log # LCNE extension board SNs
      Collection_directory/*/diag_display_info.txt # SuperPoD information for LCNE devices
      Collection_directory/*/log.log
      Collection_directory/*/log_1_*.log

      Table 1 describes the log files stored in each directory.

      Table 1 Log file list

      File Type

      Log File

      Description

      Storage Path

      CANN App logs

      plog-{pid}_{time}.log

      App logs on the host

      Collection_directory/process_log/debug or run/plog/plog-{pid}_{time}.log

      device-{pid}_{time}.log

      App logs on the device

      Collection_directory/process_log/debug or run/device-{id}/device-{pid}_{time}.log

      User training or inference logs

      rank-{id}.txt

      rank-{id}.log

      worker-{id}.txt

      worker-{id}.log

      Training and inference console logs

      • Collection_directory/rank-{id}.*?.txt
      • Collection_directory/rank-{id}.*?.log
      • Collection_directory/worker-{id}.*?.log
      • Collection_directory/worker-{id}.*?.txt

      NPU network port resource information

      npu_info_before.txt

      NPU network port check file before training or inference

      Collection_directory/environment_check/npu_info_before.txt

      npu_info_after.txt

      NPU network port check file after training or inference

      Collection_directory/environment_check/npu_info_after.txt

      npu_smi_{npu_id}_details.csv

      NPU status monitoring metric file

      Collection_directory/environment_check/npu_smi_{npu_id}_details.csv

      npu_{npu_id}_details.csv

      Monitoring metric file of NPU network port statistics

      Collection_directory/environment_check/npu_{npu_id}_details.csv

      Host resource information

      host_metrics_{core_num}.json

      Host resource monitoring metric file

      Collection_directory/environment_check/host_metrics_{core_num}.json

      dmidecode.txt

      Log file containing DMI hardware information on the host

      Collection_directory/dmidecode.txt

      Host logs

      dmesg

      Kernel message file on the host

      Collection_directory/dmesg

      sysmonitor.log

      System monitoring file on the host

      Collection_directory/sysmonitor.log

      messages-*?

      Host OS log file

      Collection_directory/messages-*?

      vmcore_dmesg.txt

      Host kernel message file saved when the system breaks down

      Collection_directory/crash/Directory_combining_the_host_name_and_fault_occurrence_time (eg: 127.xx.xx.1-2024-09-23-11:25:29)/vmcore_dmesg.txt

      Device logs

      device-os_{time}.log

      System logs of Ctrl CPUs on the device

      Collection_directory/device_log/slog/dev-os-{id}/debug or run/device-os/device-os_{time}.log

      event_{time}.log

      EVENT-level system logs of Ctrl CPUs on the device

      Ascend HDK 23.0.3 and later versions:

      Collection_directory/device_log/slog/dev-os-{id}/run/event/event_{time}.log

      device-{id}_{time}.log

      System logs of non-Ctrl CPUs on the device

      Ascend HDK 23.0.RC3:

      Collection_directory/device_log/slog/dev-os-{id}/device-{id}/device-{id}_{time}.log

      Ascend HDK 23.0.3 and later versions:

      Collection_directory/device_log/slog/dev-os-{id}/debug/device-{id}/device-{id}_{time}.log

      history.log

      Black Box logs

      Collection_directory/device_log/hisi_logs/device-{id}/history.log

      MindCluster component logs

      devicePlugin*.log

      SuperPoD logs and Ascend Device Plugin logs

      Collection_directory/dl_log/devicePlugin/devicePlugin*.log

      noded*.log

      AI server logs

      Collection_directory/dl_log/noded/noded*.log

      runtime-run*.log

      Logs generated when ascend-docker-runtime of Ascend Docker Runtime is executed

      Collection_directory/dl_log/ascend-docker-runtime/runtime-run*.log

      hook-run*.log

      Logs generated when ascend-docker-hook of Ascend Docker Runtime is executed

      Collection_directory/dl_log/ascend-docker-runtime/

      hook-run*.log

      volcano-scheduler*.log

      volcano-scheduler logs

      Collection_directory/dl_log/volcano-scheduler/

      volcano-scheduler*.log

      volcano-controller*.log

      volcano-controller logs

      Collection_directory/dl_log/volcano-controller/

      volcano-controller*.log

      npu-exporter*.log

      NPU Exporter logs

      Collection_directory/dl_log/npu-exporter/

      npu-exporter*.log

      MindIE component logs

      mindie-{module}_{pid}_{datetime}.log

      Logs of MindIE Server, MindIE LLM, MindIE SD, MindIE RT, MindIE Torch, MindIE MS, MindIE Benchmark, and MindIE Client

      Collection_directory/mindie/log/debug/mindie-{module}_{pid}_{datetime}.log

      AMCT logs

      amct_{framework}.log

      AMCT logs

      Collection_directory/amct_log/amct_{framework}.log

      BMC logs

      All .log files in the out-of-band environment

      Out-of-band logs collected by one click

      Collection_directory/dump_info/AppDump/*/*.log

      Collection_directory/dump_info/DeviceDump/*/*.log

      Collection_directory/dump_info/ LogDump/*/*.log

      Collection_directory/dump_info/AppDump/frudata/fruinfo.txt

      Collection_directory/dump_info/AppDump/chassis/mdb_info.log

      LCNE logs

      All .log files of LCNE

      LCNE logs

      Collection_directory/*/diagnostic_information/slot_1/tempdir/devm_bddrvadp.log

      Collection_directory/*/diag_display_info.txt

      Collection directory/*/log.log

      Collection directory/*/log_1_*.log

      MindIE pod console logs

      {podname}.log

      MindIE pod console logs

      Collection_directory/mindie/log/mindie_cluster_log/{podname}.log

  • You can also use the input parameters of the corresponding cleaning command to clean the log directory. The storage structure of the log file corresponding to each parameter is as follows. For details about the cleaning command parameters, see Table 1.
    |-- ${Paths specified by --process_log}
            |-- debug/plog/plog-{pid}_{time}.log
            |-- run/plog/plog-{pid}_{time}.log
            |-- debug/device-*/device-{pid}_{time}.log
            |-- run/device-*/device-{pid}_{time}.log
    
    |-- ${Paths specified by --device_log}
            |-- slog/dev-os-*/debug/device-os/device-os_*.log
            |-- slog/dev-os-*/run/device-os/device-os_*.log
            |-- slog/dev-os-*/run/event/event_*.log      # Displayed only in Ascend HDK 23.0.3 and later versions.
            |--slog/dev-os-*/device-*/device-*_*.log    # Path of device-*_*.log in Ascend HDK 23.0.RC3
            |--slog/dev-os-*/debug/device-*/device-*_*.log   # Path of device-*_*.log in Ascend HDK 23.0.3 and later versions
            |-- hisi_logs/device-*/history.log
            ....
    
    |-- ${Paths specified by --env_check}
           |-- npu_info_before.txt 
           |-- npu_info_after.txt 
           |-- npu_smi_0_details.csv
            ...
           |-- npu_smi_0_details.csv
           |-- npu_0_details.csv
           ...
           |-- npu_7_details.csv
    
    |-- ${Paths specified by --train_log}
           |-- rank-0.txt      
           ...
           |-- rank-7.txt  
     
    |-- ${Paths specified by --host_log}
           |-- messages
           |-- crash
                  |-- Directory combining the host name and fault occurrence time (eg:127.xx.xx.1-2024-09-23-11:25:29)
                         |-- vmcore_dmesg.txt
           |-- dmesg 
           |-- sysmonitor.log   
    
    |-- ${Paths specified by {--dl_log}
           |-- devicePlugin/devicePlugin*.log
           |-- noded/noded*.log
           |-- ascend-docker-runtime/runtime-run*.log
           |-- ascend-docker-runtime/hook-run*.log
           |-- volcano-scheduler/volcano-scheduler*.log
           |-- volcano-controller/volcano-controller*.log
    
           |-- npu-exporter/npu-exporter*.log
    
    |-- ${Path specified by --mindie_log}
           |-- log/debug/mindie-{module}_{pid}_{datetime}.log
           |-- log/mindie_cluster_log/{podname}.log
    
    |-- ${Path specified by --amct_log}
           |-- amct_{framework}.log

    File Type

    Log File

    Description

    Log Directory

    CANN App logs

    plog-{pid}_{time}.log

    App logs on the host.

    • ${--process_log}/debug/plog/plog-{pid}_{time}.log
    • ${--process_log}/run/plog/plog-{pid}_{time}.log

    device-{pid}_{time}.log

    App logs on the device.

    • ${--process_log}/debug/device-{id}/device-{pid}_{time}.log
    • ${--process_log}/run/device-{id}/device-{pid}_{time}.log

    User training or inference logs

    rank-{id}.txt

    rank-{id}.log

    worker-{id}.txt

    worker-{id}.log

    Training and inference console logs

    • ${--train_log}/rank-id.*?.txt
    • ${--train_log}/rank-id.*?.log
    • ${--train_log}/worker-id.*?.log
    • ${--train_log}/worker-id.*?.txt

    NPU network port resource information

    npu_info_before.txt

    NPU network port check file before training

    ${--env_check}/npu_info_before.txt

    npu_info_after.txt

    NPU network port check file after training

    ${--env_check}/npu_info_after.txt

    npu_smi_{npu_id}_details.csv

    NPU status monitoring metric file

    ${--env_check}/npu_smi_{npu_id}_details.csv

    npu_{npu_id}_details.csv

    Monitoring metric file of NPU network port statistics

    ${--env_check}/npu_{npu_id}_details.csv

    Host resource information

    host_metrics_{core_num}.json

    Host resource monitoring metric file

    ${--env_check}/host_metrics_{core_num}.json

    Host logs

    messages-*?

    Host OS log file

    ${--host_log}/messages-*?

    dmesg

    Kernel message file on the host

    ${--host_log}/dmesg

    vmcore-dmesg.txt

    Host kernel message file saved when the system breaks down.

    ${--host_log}/crash/Directory_combining_the_host_name_and_fault_occurrence_time (eg: 127.xx.xx.1-2024-09-23-11:25:29)/vmcore_dmesg.txt

    sysmonitor.log

    System monitoring file on the host

    ${--host_log}/sysmonitor.log

    Device logs

    device-os_{time}.log

    System logs of Ctrl CPUs on the device

    ${--device_log}/slog/dev-os-{id}/debug/device-os/device-os_{time}.log

    event_{time}.log

    EVENT-level system logs of Ctrl CPUs on the device

    Ascend HDK 23.0.3 and later versions:

    ${--device_log}/slog/dev-os-{id}/run/event/event_{time}.log

    device-id_{time}.log

    System logs of non-Ctrl CPUs on the device

    Ascend HDK 23.0.RC3:

    ${--device_log}/slog/dev-os-{id}/device-{id}/device-{id}_{time}.log

    Ascend HDK 23.0.3 and later versions:

    ${--device_log}/slog/dev-os-{id}/debug/device-{id}/device-{id}_{time}.log

    history.log

    Black Box logs

    ${--device_log}/hisi_logs/device-{id}/history.log

    MindCluster component logs

    devicePlugin*.log

    SuperPoD logs and Ascend Device Plugin logs

    ${--dl_log}/devicePlugin/devicePlugin*.log

    noded*.log

    AI server logs

    ${--dl_log}/noded/noded*.log

    runtime-run*.log

    Logs generated when ascend-docker-runtime of Ascend Docker Runtime is executed

    ${--dl_log}/ascend-docker-runtime/runtime-run*.log

    hook-run*.log

    Logs generated when ascend-docker-hook of Ascend Docker Runtime is executed

    ${--dl_log}/ascend-docker-runtime/

    hook-run*.log

    volcano-scheduler*.log

    volcano-scheduler logs

    ${--dl_log}/volcano-scheduler/

    volcano-scheduler*.log

    volcano-controller*.log

    volcano-controller logs

    ${--dl_log}/volcano-controller/

    volcano-controller*.log

    npu-exporter*.log

    NPU Exporter logs

    ${--dl_log}/npu-exporter/

    npu-exporter*.log

    MindIE component logs

    mindie-{module}_{pid}_{datetime}.log

    Logs of MindIE Server, MindIE LLM, MindIE SD, MindIE RT, MindIE Torch, MindIE MS, MindIE Benchmark, and MindIE Client

    ${--mindie_log}/log/debug/mindie-{module}_{pid}_{datetime}.log

    MindIE pod console logs

    {podname}.log

    MindIE pod console logs

    ${--mindie_log}/log/mindie_cluster_log/{podname}.log

    AMCT logs

    amct_{framework}.log

    AMCT logs

    ${--amct_log}/amct_{framework}.log