Overview
This document describes how to quickly get started with development tools for training scenarios, including tools for model development and migration, accuracy debugging, and performance tuning throughout the training process.
Main tools:
- msprobe:
For large language models (LLMs) developed based on Ascend or migrated from GPUs to the Ascend NPU environment, issues such as accuracy overflows, abnormal loss curves, or non-convergence may occur during training. Training loss and other metrics cannot help accurately locate the faulty module. This document introduces MindStudio Probe (msprobe), an accuracy debugging tool, for quick fault demarcation. The tool is referred to as msprobe in the following sections.
msprobe is the accuracy tool of the MSTT tool chain. It collects and compares the training accuracy data in the benchmark environment (such as the debugged CPU, GPU, or Ascend NPU environment) and the actual Ascend NPU environment to find the differences.
msprobe provides various functions. For details, see msprobe Instructions.
- MindSpore Profiler: collects profile data in MindSpore training scenarios.
- Ascend PyTorch Profiler: collects profile data in PyTorch training scenarios.
- msprof-analyze: collects statistics, analyzes data, and outputs tuning suggestions.
- MindStudio Insight: visualizes profile data.
Procedure
Environment Setup
- Prepare a training server based on an Ascend 910 AI Processor and install the NPU driver and firmware as instructed in "NPU Driver and Firmware Installation"
- Install the CANN Toolkit package and ops operator package of the required version, and configure CANN environment variables. For details, see CANN Software Installation Guide..
- Install a framework.
In the MindSpore training scenario, versions 2.6.0 and 2.7.0 are used as examples. For details, see MindSpore Installation Guide.
In the PyTorch training scenario, version 2.1.0 is used as an example. For details, see "Plugin Development (PyTorch)".
- Configure the environment variables.
After the CANN package is installed, log in to the environment as the CANN operating user and run the source $INSTALL_DIR/set_env.sh command to set environment variables. Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann.