Troubleshooting
Troubleshooting Process
Figure 1 shows the process of locating accuracy issues in a traditional model.
- Check whether the hardware environment configuration, CANN package version, and other dependency library versions match.
- If yes, keep troubleshooting.
- If no, reconfigure the environment, software package, and dependency library and check whether the accuracy issue is resolved. If the issue persists, keep troubleshooting.
- Select a bad case with obvious accuracy issues for analysis.
- Use the traditional model accuracy comparison tool (msit debug compare) to collect and compare the full dump data of the model.
- If the comparison shows minor errors, check the preprocessing and postprocessing processes to determine whether the errors are accumulated.
- If the comparison shows major errors, find the first operator that does not meet the accuracy standard and analyze the root cause of the error.
- If the error is caused by operator implementation, contact technical support.
Troubleshooting Procedure
- Check the environment configuration.
When using the traditional model for inference, if accuracy is normal on one hardware or environment setup but fails on others, verify that all component versions match and support the current hardware. For details about the version mapping, see "Version Mapping" in MindStudio Release Notes.
Go to the directory where the ascend_toolkit_install.info file is located and run the following command to check the CANN version. For example, in an AArch64 Linux system, go to the aarch64-linux folder under the CANN installation path and run the following command:
cat ascend_toolkit_install.info
- Use the msit debug compare tool to quickly obtain the model comparison result.The one-click full-process accuracy comparison (inference) function implements automatic accuracy comparison. You can input the original model (ONNX), corresponding offline model, and data to automatically output the network-wide comparison result. You can also input the dumped operator data on the CPU and NPU for accuracy comparison. For details about how to install the msit debug compare tool, see msit debug compare User Guide.
- Run the following command to start the accuracy comparison:
msit debug compare -gm ${golden_model_path} -om ${om_model_path} [optional parameter]For details about the parameters, see Input Options.
If input data in real scenarios is available, you are advised to use the input data for commissioning. The following is an example:
msit debug compare -gm /home/HwHiAiUser/onnx_produce_data/resnet_official.onnx -om /home/HwHiAiUser/onnx_produce_data/model/resnet50.om \ -i /home/HwHiAiUser/result/test/input_0.bin -c /usr/local/Ascend/cann -o /home/HwHiAiUser/result/test
-i and --input indicate the input data path of the model. Commas (,) are used to separate multiple inputs, for example, /home/input_0.bin,/home/input_1.bin.
In this scenario, during inference, the batch size is calculated based on the input shape and the model-defined shape. However, the shape of the input file must differ from the model-defined input shape only in the batch dimension; all other dimensions must remain the same. If the input is an .npy file, this function automatically converts the .npy file to a.bin file.
- For details about the directory structure of flushed data, see Comparison Result Description.
The result_{timestamp}.csv file is the comparison result file, which contains the dump data comparison results of all operators on the entire network. It is used to analyze model accuracy issues. The meaning of the comparison result is the same as that of the basic accuracy comparison tool. For details about each field, see "Parameters in the Complete Comparison Result" in the Accuracy Debugging Tool Guide.
Table 1 shows the core comparison indicators. If any indicator exceeds the threshold, the accuracy is abnormal.
Table 1 Core comparison indicators Error Comparison Algorithm
Description
Threshold
CosineSimilarity
Cosine similarity, which is the result of cosine similarity comparison.
> 0.99
RelativeEuclideanDistance
Relative Euclidean distance, which is the result of the Euclidean relative distance comparison.
< 0.05
KullbackLeiblerDivergence
Kullback-Leibler divergence, which is result of the Kullback-Leibler divergence comparison.
< 0.005
RootMeanSquareError
Root-mean-square error (RMSE).
< 1.0
MeanRelativeError
Mean relative error.
< 1.0
- You can first check the CosineSimilarity and RelativeEuclideanDistance indicators to assess the overall results. CosineSimilarity indicates whether the directions of two high-dimensional tensors are the same, and RelativeEuclideanDistance measures the distance between two vectors.
- When evaluating whether the model accuracy meets the requirements, the primary step is to check whether the overall network output meets the accuracy criteria. If it does, ignore any issues with intermediate nodes, including operator overflow. If not, examine each faulty node individually. For details about these metrics, see Comparing the Results.
- Run the following command to start the accuracy comparison:
- Analyze the causes.
- Core analysis
The order of operators shown in the comparison result file differs from their actual execution sequence. It does not fully reflect the model's topological structure. Analyze the file to locate the first operator with accuracy issues, check its location in the model topology, and check whether the accuracy of the operator and its upstream operator meets the standard. The following situations are involved:
- If the inputs of an operator are consistent but the outputs are inconsistent, the operator has accuracy issues. In this case, analyze the cause of the error.
- If an operator has multiple inputs and outputs, NaN occurs in some inputs, and the other inputs are consistent but the outputs are inconsistent, the operator may have accuracy issues. In this case, locate the output of the previous node of the operator based on the model structure and use the output as the input of the next node for further analysis.
- If an operator has multiple inputs and outputs, the inputs are consistent, NaN occurs in some outputs, and the subsequent operator inputs are inconsistent, the operator may have accuracy issues.
- Analysis of accumulated errors and pre- and postprocessing issues
- If the comparison result does not show a sharp decrease in accuracy yet produces different network outputs, and the preprocessing and postprocessing methods of the inference process are the same as those of the benchmark, the issue stems from accumulated errors. In this case, you can try to improve the model accuracy by adding accuracy control parameters during ATC conversion. For details, see "Operator Tuning Options" in the ATC Instructions.
- If no obvious accuracy drop occurs in the comparison result and the network outputs are the same but the service outputs are different, check whether the input data is inconsistent and whether the output data processing modes are different based on the service scenario.
- Core analysis
