Overflow/Underflow Operator Data Collection and Analysis
If a floating-point exception occurs during AI service running, you can collect and parse the data of overflow/underflow operators to locate the fault.
Collecting Data of Overflowed/Underflowed Operators
- In the offline inference scenario, collect overflow/underflow data by referring to Overflow/Underflow Operator Data Collection and Analysis in Application Development Guide (C&C++).
- TensorFlow 1.15 training/online inference: For details about how to collect overflow/underflow data, see "Additional Features" > "Overflow/Underflow Data Collection" in TensorFlow 1.15 Model Porting Guide.
Viewing Data of Overflowed/Underflowed Operators
By default, the generated overflow/underflow operator data file is stored in the {dump_path}/{time}/{device_id}/{model_name}/{model_id}/{data_index} directory, for example, /home/HwHiAiUser/output/20200808163566/0/npu_cluster_0/11/0. If no overflow/underflow data is collected, that is, no overflow occurs, the preceding directory is not generated.
The storage path and file naming rules are as follows:
- dump_path: user-defined path for storing overflow/underflow data, for example, /home/HwHiAiUser/output
- time: timestamp (for example, 20200808163566)
- Device ID.
- model_name: submodel name. Multiple folders may exist at the model_name layer. If a period (.), slash (/), backslash (\), or space character appears in model_name, it is converted to an underscore (_).
- model_id: subgraph ID.
- data_index: iterations to detect overflow/underflow.
Two types of overflow/underflow data files are generated in the preceding directory:
- The dump file of an overflow/underflow operator is generally named as: {op_type}.{op_name}.{task_id}.{stream_id}.{timestamp}. Any period (.), slash (/), backslash (\), or space in the op_type or op_name field is replaced by an underscore (_).
You can identify the name of the overflow/underflow operator based on the {op_name} field. In special scenarios (for example, single-operator API execution), the {op_name} field may not completely correspond to the operator name, but you can roughly determine the operator name and use Analyzing the Dump File of an Overflow/Underflow Operator to determine the input and output of the operator.
- The dump file of an overflow/underflow operator is named as: Opdebug.Node_OpDebug.{task_id}.{stream_id}.{timestamp}.
To locate the overflow/underflow cause, follow the instructions in Analyzing the Data File of an Overflow/Underflow Operator.
- task_id is not the task ID of the overflow/underflow operator and can be ignored.
- When the command is executed in a Docker, the generated data is stored in the Docker.
Analyzing the Dump File of an Overflow/Underflow Operator
- Upload the collected data files to the Toolkit installation environment.
- Go to the ${INSTALL_DIR}/tools/operator_cmp/compare directory. Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann..
- Run the msaccucmp.py script to convert the dump file into a NumPy file. The following is an example:
python3 msaccucmp.py convert -d /home/HwHiAiUser/dump -out /home/HwHiAiUser/dumptonumpy -v 2
The -d option enables the conversion of a single dump file or all dump files in a path.
- Use Python to save the NumPy data into a .txt file. The following is an example:
1 2 3 4 5
$ python3 >>> import numpy as np >>> a = np.load("/home/HwHiAiUser/dumptonumpy/Pooling.pool1.1.5.1732082705016774.output.0.npy") >>> b = a.flatten() >>> np.savetxt("/home/HwHiAiUser/dumptonumpy/Pooling.pool1.1.5.1732082705016774.output.0.txt", b)
The shape and dtype no longer exist in the .txt file. For details, visit the NumPy website.
Analyzing the Data File of an Overflow/Underflow Operator
Since the generated overflow/underflow data is in binary format, you need to interpret the binary file into a readable format, such as JSON.
- Upload the overflow/underflow data file of the operator to the Toolkit installation environment.
You are advised to go to the data_index directory with the minimum value, and use the dump file with the minimum {timestamp} for data parsing.
- Go to the ${INSTALL_DIR}/tools/operator_cmp/compare directory. Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann..
- Run the parse command.
python3 msaccucmp.py convert -d /home/HwHiAiUser/opdebug/Opdebug.Node_OpDebug.1.5.1732082705016774 -out /home/HwHiAiUser/result
The key options are described as follows:
- -d: directory of the overflow/underflow data file, including the file name
- -out: directory of the parsing result. If it is not specified, the current directory is used.
- Find the parsed result as follows.
- If both AI Core operator overflow/underflow detection and Atomic Add overflow/underflow detection are enabled, only the earliest out of range record is displayed.
- In the following example, the AI Core operator overflow/underflow information appears first. Consequently, any overflow/underflow information from an Atomic Add operation will be suppressed and not displayed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
{ "DHA Atomic Add": { "model_id": 0, "stream_id": 0, "task_id": 0, "task_type": 0, "pc_start": "0x0", "para_base": "0x0", "status": 0 }, "L2 Atomic Add": { "model_id": 0, "stream_id": 0, "task_id": 0, "task_type": 0, "pc_start": "0x0", "para_base": "0x0", "status": 0 }, "AI Core": { "model_id": 514, "stream_id": 563, "task_id": 57, "task_type": 0, "pc_start": "0x1008005b0000", "para_base": "0x100800297000", "kernel_code": "0x1008005ae000", "block_id": 1, "status": 32 } }
Complete field description:
The following lists all the fields that can be parsed, which may vary with actual product.
- model_id: ID of the model where an overflow/underflow operator is located
- stream_id: ID of the stream where an overflow/underflow operator is located
- task_id: task ID of an overflow/underflow operator.
- task_type: task type of an overflow/underflow operator.
- context_id: context ID which is reserved.
- thread_id: thread ID, which is reserved.
- pc_start: start of the code program of the overflow/underflow operator.
- para_base: parameter start address of the overflow/underflow operator.
- src_addr: communication source address in the SDMA transmission scenario.
- dst_addr: communication destination address in the SDMA transmission scenario.
- channel_id: channel ID.
- core_id: AI Core ID.
- kernel_code: start of the code program of the overflow/underflow operator, which is equivalent to pc_start.
- block_id: block ID of an overflow/underflow operator.
- status: status of the AI Core status register, including the overflow/underflow information. You can analyze the value of status to obtain the specific overflow/underflow error.
Analyzing the Error Cause Based on the Status
- The status field that reflects the AI Core operator overflow/underflow detection result is in decimal format. You need to convert it into the hexadecimal format before locating the fault.
For example, assume that the value of status is 272. The hexadecimal equivalent of the value is 0x00000110. Therefore, the error cause is 0x00000010+0x00000100.
- 0x00000008: inversion overflow/underflow of the minimum negative sign bit of a signed integer
- 0x00000010: integer addition, subtraction, multiplication, or multiplication overflow/underflow
- 0x00000020: floating-point overflow/underflow
- 0x00000080: negative input for the conversion of floating-point to unsigned data
- 0x00000100: Float32 to Float16 conversion or 32-bit signed integer to Float16 conversion overflow/underflow
- 0x00000400: cube accumulation overflow/underflow
Note: The preceding floating-point exceptions correspond to the hexadecimal bits, which might lead to combinations of floating-point exceptions.
- The status field that reflects the DHA Atomic Add overflow/underflow detection result is in decimal format. If the value is greater than 0, DHA Atomic Add overflows/underflows.
- The status field that reflects the L2 Atomic Add overflow/underflow detection result is in decimal format. If the value is greater than 0, L2 Atomic Add overflows/underflows.