Troubleshooting Process

During NPU training, the accuracy does not meet the requirement occasionally due to errors. For example, the loss changes abruptly in a round of iteration in a training. The problem cannot be reproduced stably, and the time and memory usage for dumping data are high. Therefore, it is difficult to compare the accuracy by dumping data. In this case, you can compare the model files. Compare all variables in the model file when an exception occurs with those in the model file during normal training, and find the variable with the lowest cosine similarity. If the cosine similarity of a variable is lower than a certain value, for example, 0.98, the problem is caused by the operator that outputs the variable.