Cleaning and Dumping Logs

  • The drive space of the output directory specified by the cleaning command must be greater than 5 GB. If the drive space is insufficient, some cleaning results may be lost, causing abnormal or inaccurate diagnosis results.
  • During cleaning, MindCluster Ascend FaultDiag reads the log files and monitoring metric files collected by users. Ensure that the directories do not contain sensitive information to prevent information leakage.
  • During cleaning, ensure that the directory to be cleaned contains only the original logs and monitoring metric files of a single training device. If the directory contains files related to other devices, cleaning may fail.
  • To clean data of the device resource and network congestion detection modules, specify the --performance(-p) parameter. If this parameter is not specified, the program cleans only the data of the root cause node and fault event module by default.
  1. (Optional) Install the component as the root user. To use the component as a common user, configure environment variables. If no dependency can be found, check whether the dependency has been installed or whether the permission is correct.
    1. Log in as the root user and query the component location.
      which ascend-fd

      The following information is displayed. The actual location is subject to the query result.

      /usr/local/python3.7.5/bin/ascend-fd
    2. Log in as a common user and configure environment variables.
      export PATH=$PATH:/usr/local/python3.7.5/bin
    3. Run the command to check whether the configuration is complete.
      ascend-fd -h

      If the following information is displayed, the configuration is complete:

      usage: ascend-fd [-h] {version,parse,diag,blacklist,config,entity,single-diag} ...
      Ascend Fault Diag
      positional arguments:
        {version,parse,diag,blacklist,config,entity,single-diag}
          version             show ascend-fd version
          parse               parse origin log files
          diag                diag parsed log files
          blacklist           filter invalid CANN logs by blacklist for parsing
          config              custom configuration parsing files
          entity              perform operations on the user-defined faulty entity.
          single-diag         single parse and diag log files
      optional arguments:
        -h, --help            show this help message and exit
  2. Collect training device logs by referring to Collecting Logs.

    Upload the logs to any directory (for example, /home) on the server. For example, if the -i parameter is used, all logs are collected to the same collection directory for cleaning. The directory structure is as follows:

    • Host
      Collection directory
      |-- messages        # Host OS logs
      |-- dmesg                # Host kernel message logs
      |-- crash
          |-- Directory combining the host name and fault occurrence time (eg:127.xx.xx.1-2024-09-23-11:25:29)
              |-- vmcore_dmesg.txt     # Host kernel message log file saved when the system breaks down
      |-- sysmonitor.log       # System monitoring log
      |-- rank-0.txt     # Training console logs
      ...
      |-- rank-7.txt     # Training console logs
      |-- process_log          # Original App logs of CANN in the process_log directory
      |-- device_log           # Device logs, which must be stored in the device_log directory.
      |-- dl_log               # MindCluster component file, whose name must be dl_log.
          |-- devicePlugin       # Ascend Device Plugin logs
          |-- noded              # NodeD logs
          |-- ascend-docker-runtime              # Ascend Docker Runtime logs
          |-- volcano-scheduler            # volcano-scheduler logs
          |-- volcano-controller             # volcano-controller logs
      
          |-- npu-exporter                 # NPU Exporter logs
      |-- mindie               # MindIE component logs
          |-- log
              |-- debug        # Run logs of MindIE components
              |-- security     # Audit logs of MindIE components
              |-- mindie_cluster_log    # MindIE pod console logs
      |-- amct_log             # AMCT logs
      |-- environment_check # Information about the NPU network port, status, and resource
          |-- npu_smi_0_details.csv   # NPU status monitoring metric file
           ...
          |-- npu_smi_7_details.csv   # NPU status monitoring metric file
          |-- npu_0_details.csv       # Monitoring metric file of the NPU network port statistics
           ...    
          |-- npu_7_details.csv       # Monitoring metric file of the NPU network port statistics
          |-- npu_info_before/after.txt  # NPU network port status file before or after training
          |-- host_metrics_{core_num}.json # Monitoring metric file of host resources
    • BMC and LCNE:
      Decompress the BMC and LCNE logs exported from Computing ToolKit or CCAE recursively, and then place and clean the logs on a single server.
      ascend-fd parse --lcne_log Decompressed LCNE log directory of a single node -o Cleaning result output directory
      ascend-fd parse --bmc_log Decompressed BMC log directory of a single node -o Cleaning result output directory
      • Use CCAE to collect logs. For details, see LingQu Log Collection.
      • Use Computing ToolKit to collect logs. For details, see "Using Computing ToolKit" > "Log Collect" > "Usage Guide" > "Collecting BMC, IES, and Switch Logs" in Computing Toolkit User Guide.
  3. Create a log cleaning output directory.
    mkdir Cleaning_output_directory
  4. Run the command to start cleaning logs.
    ascend-fd parse -i Collection_directory -o Cleaning_output_directory --performance

    Command output:

    The parse job starts. Please wait. Job id: [****], run log file is [****].
    These job ['Module 1', 'Module 2'...] succeeded.
    The parse job is complete.

    Structure of the cleaning output directory:

    └── Cleaning output directory
       ├── ascend-kg-parser.json        # Cleaning result of fault event analysis, which is the input file of the inference engine
       ├── ascend-kg-analyzer.json      # Cleaning result of fault event analysis
       ├── ascend-rc-parser.json        # Cleaning result of the root cause node analysis
       ├── device_ip_info.json          # Device IP address
       ├── mindie-cluster-info.json    # Cleaning result of the MindIE pod console logs
       ├── server-info.json           # Cleaning result of the MindIE component logs
       ├── nad_clean.csv                # Cleaning result of compute frequency reduction
       ├── nic_clean.csv                      # Cleaning result of network congestion analysis
       ├── process_{core_num}.csv       # Output result of CPU resource preemption cleanup
       ├── plog-parser-{pid}-{0/1}.log # Logs after root cause node analysis and cleaning, including key information such as error and trace. The logs are saved by PID.
        ...
       └── plog-parser-{pid}-{0/1}.log
  5. Dump logs.

    Dump all files in the cleaning output directory of each server in a centralized manner. The dump directory structure is as follows:

    Diagnosis input directory        
        |--Cleaning output directory 1
           |--plog-parser-{pid}-{0/1}.log        # Logs after root cause node analysis and cleaning, including key information such as error and trace. The logs are saved by PID.
           |--nic_clean.csv                      # Cleaning result of network congestion analysis
           |--nad_clean.csv                      # Cleaning result of compute frequency reduction
           |--mem_used.csv                       # Cleaning result of memory resource preemption analysis. This file is reserved.
           |--process_{core_num}.csv             # Cleaning result of CPU resource preemption analysis
           |--device_ip_info.json                # Device IP address
           |--ascend-kg-parser.json              # Cleaning result of fault event analysis, which is the input file of the inference engine
           |--ascend-kg-analyzer.json            # Cleaning result of fault event analysis
           |--ascend-rc-parser.json            # Cleaning result of the root cause node analysis
           |--mindie-cluster-info.json           # Cleaning result of the MindIE pod console logs
           |--server-info.json                   # Cleaning result of the MindIE component logs
                   
        |--Cleaning output directory 2
           |--plog-parser-{pid}-{0/1}.log        
           |--nic_clean.csv  
           |--nad_clean.csv  
           |--mem_used.csv  
           |--process_{core_num}.csv
           |--device_ip_info.json
           |--ascend-kg-parser.json
           |--ascend-kg-analyzer.json               
           |--ascend-rc-parser.json
           |--server-info.json                 ...
        |--Cleaning output directory n
    • You are advised to change the name of the cleaning output directory to a directory name that can identify device node information, for example, host1-192.168.x.x.
    • Store the MindIE pod console log cleaning result only on one node.