Overview

During the training of a large model migrated from GPU to Ascend or a large model developed based on Ascend, accuracy exceptions such as overflow, loss curve deviation, and non-convergence often occur. However, the training loss indicator does not help in accurately demarcating accuracy exceptions and thus the faulty part of the training software stack cannot be identified. The accuracy debugging tool msprobe (MindStudio Probe) allows users to quickly demarcate and locate these issues.

msprobe collects and compares multi-dimensional training process data, such as module-level and API-level forward and backward input and output data, and weight gradient data. After data is collected, it can be compared with the benchmark data of normal training for further analysis.

msprobe is an accuracy tool of the mstt toolchain. For details, see MindStudio Training Tools.

This section describes how to use msprobe for accuracy data collection, pre-check, and comparison.

Accuracy Data Collection

Accuracy data collection is the dump function of msprobe. It can collect the forward and reverse input and output data at the API or module level during model training. The collected data includes the module hierarchy, actual input and output data and statistics of the module or API, and call stack of the module or API. For details, see Accuracy Collection.

Accuracy Pre-check

The msprobe pre-check function creates a test case for each API on the network to check its accuracy and determines whether the API accuracy on the NPU meets the requirements based on different comparison algorithms (such as the absolute threshold method and benchmark comparison method). In this way, APIs with accuracy differences can be quickly found. For details, see Accuracy Pre-check.

Accuracy Comparison

The msprobe comparison function depends on the data collected by the dump tool. It calculates the error indicators (such as cosine similarity, proportion of relative error less than 1‰, and maximum error) of the NPU and benchmark devices (such as the CPU and GPU). It marks suspicious APIs or modules with abnormal accuracy to quickly locate root causes. For details, see Accuracy Comparison.