Overflow/Underflow Operator Data Collection and Analysis
If a floating-point exception occurs during AI service running, you can collect and parse overflow/underflow operator data to locate the fault.
Collecting Data of Overflowed/Underflowed Operators
- In the offline inference scenario, collect overflow/underflow data by referring to More Features > Overflow/Underflow Operator Data Collection and Analysis in CANN AscendCL Application Software Development Guide (C&C++).
- TensorFlow 1.x training/online inference: For details about how to collect overflow/underflow data, see More Features > Overflow/Underflow Data Collection in TensorFlow 1.15 Model Porting Guide.
Viewing Data of Overflowed/Underflowed Operators
By default, the generated overflow/underflow operator data file is stored in the {dump_path}/{time}/{deviceid}/{model_name}/{model_id}/{data_index} directory, for example, /home/HwHiAiUser/output/20200808163566/0/npu_cluster_0/11/0. If no overflow/underflow data is collected, that is, no overflow occurs, the preceding directory is not generated.
The storage path and file naming rules are as follows:
- dump_path: user-defined path for storing overflow/underflow data, for example, /home/HwHiAiUser/output
- time: timestamp (for example, 20200808163566)
- deviceid: device ID
- model_name: submodel name. Multiple folders may exist at the model_name layer. If a period (.), slash (/), backslash (\), or space character appears in model_name, it is converted to an underscore (_).
- model_id: subgraph ID.
- data_index: iterations to detect overflow/underflow.
Two types of overflow/underflow data files are generated in the preceding directory:
- The dump file of an overflow/underflow operator is named as: {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp}. Any period (.), slash (/), backslash (\), or space in the op_type or op_name field is replaced by an underscore (_).
You can identify an overflow/underflow operator based on its dump file name. To view the inputs and outputs of an overflow/underflow operator, refer to Analyzing the Dump File of an Overflow/Underflow Operator.
- The dump file of an overflow/underflow operator is named as: Opdebug.Node_OpDebug.{task_id}.{stream_id}.{timestamp}.To locate the overflow/underflow cause, follow the instructions in Analyzing the Data File of an Overflow/Underflow Operator.
- taskid is not the task ID of the overflow/underflow operator and can be ignored.
- When the command is executed in a Docker, the generated data is stored in the Docker.
Analyzing the Dump File of an Overflow/Underflow Operator
- Upload the collected data files to the Toolkit installation environment.
- Go to the ${INSTALL_DIR}/tools/operator_cmp/compare directory. Replace ${INSTALL_DIR} with the actual CANN component directory. If the Ascend-CANN-Toolkit package is installed as the root user, the CANN component directory is /usr/local/Ascend/ascend-toolkit/latest..
- Run the msaccucmp.py script to convert the dump file into a NumPy file. The following is an example:
python3 msaccucmp.py convert -d /home/HwHiAiUser/dump -out /home/HwHiAiUser/dumptonumpy -v 2
The -d option enables the conversion of a single dump file or all dump files in a path.
- Use Python to save the NumPy data into a .txt file. The following is an example:
$ python3
>>> import numpy as np
>>> a = np.load("/home/HwHiAiUser/dumptonumpy/Pooling.pool1.1.5.1732082705016774.output.0.npy")
>>> b = a.flatten()
>>> np.savetxt("/home/HwHiAiUser/dumptonumpy/Pooling.pool1.1.5.1732082705016774.output.0.txt", b)
The shape and Dtype no longer exist in the .txt file. For details, visit the NumPy website.
Analyzing the Data File of an Overflow/Underflow Operator
Since the generated overflow/underflow data is in binary format, you need to interpret the binary file into a readable format, such as JSON.
- Upload the overflow/underflow data file of the operator to the Toolkit installation environment.
You are advised to go to the data_index directory with the minimum value, and use the dump file with the minimum {timestamp} for data parsing.
- Go to the ${INSTALL_DIR}/tools/operator_cmp/compare directory. Replace ${INSTALL_DIR} with the actual CANN component directory. If the Ascend-CANN-Toolkit package is installed as the root user, the CANN component directory is /usr/local/Ascend/ascend-toolkit/latest..
- Run the parse command.
python3 msaccucmp.py convert -d /home/HwHiAiUser/opdebug/Opdebug.Node_OpDebug.1.5.1732082705016774 -out /home/HwHiAiUser/result
The key options are described as follows:
- -d: directory of the overflow/underflow data file, including the file name
- -out: directory of the parsing result. If it is not specified, the current directory is used.
- Find the parsed result as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
{ "DHA Atomic Add": { "model_id": 0, "stream_id": 0, "task_id": 0, "task_type": 0, "pc_start": "0x0", "para_base": "0x0", "status": 0 }, "L2 Atomic Add": { "model_id": 0, "stream_id": 0, "task_id": 0, "task_type": 0, "pc_start": "0x0", "para_base": "0x0", "status": 0 }, "AI Core": { "model_id": 514, "stream_id": 563, "task_id": 57, "task_type": 0, "pc_start": "0x1008005b0000", "para_base": "0x100800297000", "kernel_code": "0x1008005ae000", "block_idx": 1, "status": 32 } }
If both AI Core operator overflow/underflow detection and Atomic Add overflow/underflow detection are enabled, only the earliest out of range record is displayed.
In the preceding example, the earliest out of range record is an AI Core operator overflow/underflow.
Complete field description:
The following lists all the fields that can be parsed, which may vary with actual product.
- model_id: ID of the model where an overflow/underflow operator is located
- stream_id: ID of the stream where an overflow/underflow operator is located
- task_id: task ID of an overflow/underflow operator.
- task_type: task type of an overflow/underflow operator.
- pc_start: start of the code program of the overflow/underflow operator.
- para_base: parameter start address of the overflow/underflow operator.
- kernel_code: start of the code program of the overflow/underflow operator, which is equivalent to pc_start.
- block_id: block ID of an overflow/underflow operator.
- status: status of the AI Core status register, including the overflow/underflow information. You can analyze the value of status to obtain the specific overflow/underflow error.
Analyzing the Error Cause Based on the Status
- The status field that reflects the AI Core operator overflow/underflow detection result is in decimal format. You need to convert it into the hexadecimal format before locating the fault.
For example, assume that the value of status is 272. The hexadecimal equivalent of the value is 0x00000110. Therefore, the error cause is 0x00000010+0x00000100.
- 0x00000008: inversion overflow of the minimum negative sign bit of a signed integer
- 0x00000010: integer addition, subtraction, multiplication, or multiplication overflow
- 0x00000020: floating-point overflow
- 0x00000080: negative input for floating-point to unsigned conversion
- 0x00000100: float32-to-float16 conversion or 32-bit signed integer-to-float16 conversion overflow
- 0x00000400: Cube accumulation overflow/underflow
Note: The preceding floating-point exceptions correspond to the hexadecimal bits, which might lead to combinations of floating-point exceptions.
- The status field that reflects the DHA Atomic Add overflow/underflow detection result is in decimal format. If the value is greater than 0, DHA Atomic Add overflows/underflows.
- The status field that reflects the L2 Atomic Add overflow/underflow detection result is in decimal format. If the value is greater than 0, L2 Atomic Add overflows/underflows.