--op_debug_config

Description

Sets the directory (including the file name) of the configuration file for enabling global memory (DDR) detection.

See Also

None

Argument

Argument: Directory of the configuration file, including the file name.

Format: The directory (including the file name) can contain letters, digits, underscores (_), hyphens (-), periods (.), and Chinese characters.

Restrictions:

The configuration file supports the following options. Multiple options when used should be separated with commas (,).

  • oom: Checks whether memory overwriting occurs in the global memory during operator execution.
    • Configuring this option retains the binary operator file (.o) and operator description file (.json) in the kernel_meta folder under the current execution directory during operator build.
    • If this option is used, the following detection logic is added during operator build. You can use the dump_cce option to view the following code in the generated .cce file:
      inline __aicore__ void  CheckInvalidAccessOfDDR(xxx) {
          if (access_offset < 0 || access_offset + access_extent > ddr_size) {
              if (read_or_write == 1) {
                  trap(0X5A5A0001);
              } else {
                  trap(0X5A5A0002);
              }
          }
      }

      During actual execution, if memory overwriting occurs, the error code EZ9999 is reported.

  • dump_bin: Retains the binary operator file (.o) and operator description file (.json) in the kernel_meta folder under the current execution directory during operator build.
  • dump_cce: Retains the operator CCE file (.cce), binary operator file (.o), and operator description file (.json) in the kernel_meta folder under the current execution directory during operator build.
  • dump_loc: Retains the Python-CCE mapping file (*_loc.json) in the kernel_meta folder under the current execution directory during operator build.
  • ccec_O0: Enables the CCEC option -O0 during operator build. This option does not optimize the debugging information for later analysis of AI Core errors.
  • ccec_g: Enables the CCEC option -g during operator build. This option optimizes the debugging information for later analysis of AI Core errors.
  • check_flag: Checks whether pipeline synchronization signals in operators match each other during operator execution.
    • Configuring this option retains the binary operator file (.o) and operator description file (.json) in the kernel_meta folder under the current execution directory during operator build.
    • If this option is used, the following detection logic is added during operator build. You can use the dump_cce option to view the following code in the generated .cce file:
        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);
        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2);
        set_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID3);
        ....
        pipe_barrier(PIPE_MTE3);
        pipe_barrier(PIPE_MTE2);
        pipe_barrier(PIPE_M);
        pipe_barrier(PIPE_V);
        pipe_barrier(PIPE_MTE1);
        pipe_barrier(PIPE_ALL);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID0);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID1);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID2);
        wait_flag(PIPE_MTE3, PIPE_MTE2, EVENT_ID3);
        ...

      During actual inference, if the pipeline synchronization signals in operators do not match each other, a timeout error is reported at the faulty operator, and the program is terminated. The following is an example of the error message:

      Aicore kernel execute failed, ..., fault kernel_name=operator name,...
      rtStreamSynchronizeWithTimeout execute failed....
  • When ccec_O0 and ccec_g are enabled, the size of the operator kernel file (*.o file) increases. In the dynamic shape scenario, all possible shape scenarios are traversed during operator build, which may cause operator build failures due to large operator kernel files. In this case, do not enable the CCE compiler options.

    If a build failure is caused by the large operator kernel file, the following log is displayed:

    message:link error ld.lld: error: InputSection too large for range extension thunk ./kernel_meta_xxxxx.o:
  • The ccec_O0 and oom options of the CCEC cannot be both enabled. Otherwise, an AI Core error may be reported. The following is an example of the error message:
    ...there is an aivec error exception, core id is 49, error code = 0x4 ...
  • If the NPU_COLLECT_PATH environment variable is configured, the function of checking whether global memory overwriting occurs cannot be enabled (--op_debug_config is set to oom). Otherwise, an error is reported when the compiled model file or operator kernel package is used.
  • When the build options oom, dump_bin, dump_cce, and dump_loc are configured, if the model contains the following MC2 operators, the *.o, *.json, and *.cce files of the operators are not generated in the kernel_meta directory.

    MatMulAllReduce

    MatMulAllReduceAddRmsNorm

    AllGatherMatMul

    MatMulReduceScatter

    AlltoAllAllGatherBatchMatMul

    BatchMatMulReduceScatterAlltoAll

Suggestions and Benefits

None

Example

Assume that the configuration file for enabling global memory detection is gm_debug.cfg.

op_debug_config=ccec_g,oom

Upload the file to any directory (for example, $HOME/module) on the server where ATC is located.

--op_debug_config=$HOME/module/gm_debug.cfg

Restrictions

During operator compilation, if you want to compile only some instead of all AI Core operators, you need to add the op_debug_list field to the gm_debug.cfg configuration file. By doing so, only the operators specified in the list are compiled, based on the options configured in op_debug_config. The op_debug_list field has the following requirements:

A configuration example is provided as follows:

Add the following information to the configuration file (for example, gm_debug.cfg) specified by op_debug_config:

op_debug_config=ccec_g,oom
op_debug_list=GatherV2,opType::ReduceSum

Upload the file to any directory (for example, $HOME/module) on the server where ATC is located.

--op_debug_config=$HOME/module/gm_debug.cfg

During actual model conversion, the GatherV2,ReduceSum operator is compiled based on the ccec_g and oom options.

Applicability

Atlas 200/300/500 Inference Product

Atlas Training Series Product