Overflow/Underflow Operator Data Collection and Analysis

Prerequisites

To use the ATC tool to convert a model, ensure that the --status_check parameter has been added to the conversion command and set to 1, indicating that the overflow/underflow detection logic is added during operator compilation.

For details about the ATC tool and its parameters, see ATC Instructions.

Collecting Overflow/Underflow Operator Information

Add the dump configuration of the overflow/underflow operator to the JSON configuration file when the acl.init API is called to initialize pyACL.

The following is an example of the content in the JSON configuration file. In the example, dump_path is a relative path.
{
    "dump":{
        "dump_path":"output",
        "dump_debug":"on"
    }
}

If dump_path is set to a relative path, you can view the exported data files in {application_executable_files}/{dump_path} directory. For each overflow/underflow operator, two data files are exported:

  • The dump file of an overflow/underflow operator is named as: {op_type}.{op_name}.{taskid}.{stream_id}.{timestamp}. Any period (.), slash (/), backslash (\), or space in the op_type or op_name field is replaced by an underscore (_).

    You can identify an overflow/underflow operator based on the preceding information. To view the operator input and output, refer to Analyzing the Dump File of an Overflow/Underflow Operator.

  • The data file of an overflow/underflow operator is named as: OpDebug.Node_Opdebug.{taskid}.{stream_id}.{timestamp}, where taskid is not the task ID of an overflow/underflow operator and can be ignored.

    You can obtain the overflow information by referring to Analyzing the Data File of an Overflow/Underflow Operator, including the model where an overflow/underflow operator is located and the status register of AI Core.

Analyzing the Dump File of an Overflow/Underflow Operator

  1. Upload the {op_type}.{op_name}.{taskid}.{stream_id}.{timestamp} file to the environment with Toolkit installed.
  2. Go to the directory where the parsing script is located. For example, Toolkit is stored in /home/HwHiAiUser/Ascend/ascend-toolkit/latest.
    cd /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/operator_cmp/compare
  3. Run the msaccucmp.py script to convert the dump file into the NumPy format. For example:
    python3 msaccucmp.py convert -d /home/HwHiAiUser/dump -out /home/HwHiAiUser/dumptonumpy -v 2

    The -d option enables the conversion of a single dump file or all dump files in a path.

  4. Use Python to save the NumPy data into a .txt file. For example:
    1
    2
    3
    4
    5
    $ python3
    >>> import numpy as np
    >>> a = np.load("/home/HwHiAiUser/dumptonumpy/Pooling.pool1.1147.1589195081588018.output.0.npy")
    >>> b = a.flatten()
    >>> np.savetxt("/home/HwHiAiUser/dumptonumpy/Pooling.pool1.1147.1589195081588018.output.0.txt", b)
    

    The shape and Dtype no longer exist in the .txt file. For more details, visit the NumPy website.

Analyzing the Data File of an Overflow/Underflow Operator

Since the generated overflow/underflow data is in binary format, you must interpret the binary file into a readable format, such as JSON.

  1. Upload the overflow/underflow data file OpDebug.Node_Opdebug.{taskid}.{timestamp} to the Toolkit installation environment.
  2. Go to the path where the parsing script is located. For example, Toolkit is stored in /home/HwHiAiUser/Ascend/ascend-toolkit/latest.
    cd /home/HwHiAiUser/Ascend/ascend-toolkit/latest/toolkit/tools/operator_cmp/compare
  3. Run the parse command.
    1
    python3 msaccucmp.py convert -d /home/HwHiAiUser/opdebug/Opdebug.Node_OpDebug.59.1597922031178434  -out /home/HwHiAiUser/result
    

    The key options are described as follows:

    • -d: directory of the overflow/underflow data file, including the file name
    • -out: directory of the parsing result. If it is not specified, the current directory is used.
  4. Find the parsing result as follows:
    {
        "DHA Atomic Add": {
            "model_id": 0,
            "stream_id": 0,
            "task_id": 0,
            "task_type": 0,
            "pc_start": "0x0",
            "para_base": "0x0",
            "status": 0
        },
        "L2 Atomic Add": {
            "model_id": 0,
            "stream_id": 0,
            "task_id": 0,
            "task_type": 0,
            "pc_start": "0x0",
            "para_base": "0x0",
            "status": 0
        },
        "AI Core": {
            "model_id": 514,
            "stream_id": 563,
            "task_id": 57,
            "task_type": 0,
            "pc_start": "0x1008005b0000",
            "para_base": "0x100800297000",
            "kernel_code": "0x1008005ae000",
            "block_idx": 1,
            "status": 32
        }
    }

    The fields are described as follows:

    • model_id: ID of the model where an overflow/underflow operator is located
    • stream_id: ID of the stream where an overflow/underflow operator is located
    • task_id: task ID of an overflow/underflow operator
    • task_type: task type of an overflow/underflow operator
    • pc_start: memory start address of an overflow/underflow operator code program
    • para_base: memory start address of an overflow/underflow operator parameter
    • kernel_code: memory start address of an overflow/underflow operator code program, the same as pc_start
    • block_idx: block ID of an overflow/underflow operator
    • status: AI Core status register. You can obtain the specific overflow/underflow error by analyzing this field. The value of status is a decimal number, so you must convert it to a hexadecimal number for locating the fault.

      For example, assume that the value of status is 272. The hexadecimal equivalent of the value is 0x00000110. Therefore, the error cause is 0x00000010+0x00000100.

      • 0x00000008: inversion overflow/underflow of the minimum negative sign bit of a signed integer
      • 0x00000010: integer addition, subtraction, multiplication, or multiplication overflow/underflow
      • 0x00000020: floating-point overflow/underflow
      • 0x00000080: negative input for the conversion of floating-point to unsigned data
      • 0x00000100: float32 to float16 conversion or 32-bit signed integer to float16 conversion overflow/underflow
      • 0x00000400: cube accumulation overflow/underflow