Overview

This document describes how to quickly get started with development tools for training scenarios, including tools for model development and migration, accuracy debugging, and performance tuning throughout the training process.

Main tools:

  • msprobe:

    For large language models (LLMs) developed based on Ascend or migrated from GPUs to the Ascend NPU environment, issues such as accuracy overflows, abnormal loss curves, or non-convergence may occur during training. Training loss and other metrics cannot help accurately locate the faulty module. This document introduces MindStudio Probe (msprobe), an accuracy debugging tool, for quick fault demarcation. The tool is referred to as msprobe in the following sections.

    msprobe is the accuracy tool of the MSTT tool chain. It collects and compares the training accuracy data in the benchmark environment (such as the debugged CPU, GPU, or Ascend NPU environment) and the actual Ascend NPU environment to find the differences.

    msprobe provides various functions. For details, see msprobe Instructions.

  • MindSpore Profiler: collects profile data in MindSpore training scenarios.
  • Ascend PyTorch Profiler: collects profile data in PyTorch training scenarios.
  • msprof-analyze: collects statistics, analyzes data, and outputs tuning suggestions.
  • MindStudio Insight: visualizes profile data.

Procedure

Table 1 Main procedure and tool operation procedure

Procedure

Tool and Operation Procedure

Model development and migration

Currently, no migration tool is provided for MindSpore training scenarios. This document uses the training script developed in the Ascend NPU environment as an example.

In the PyTorch training scenario, the analysis and migration tool is used to migrate models from GPUs to the Ascend NPU environment.

Model accuracy debugging

msprobe is used to perform the following operations during model accuracy debugging:

  1. Pre-training configuration check

    Identifies the configuration differences between the two environments that affect the accuracy.

  2. (Optional) Training status monitoring

    Monitors exceptions in computing, communication, and optimizer during training.

  3. Accuracy data collection

    Collects the input and output data of forward and backward propagation at the API or module level during training.

  4. Accuracy pre-check

    Scans API data to identify APIs with accuracy issues.

  5. Accuracy comparison

    Compares the API data on NPUs with that in the benchmark environment to quickly locate accuracy issues.

Model performance tuning

In the MindSpore training scenario, perform the following operations to tune the model performance:

  1. Use MindSpore Profiler to collect profile data.
  2. Use msprof-analyze to analyze profile data.
  3. Use MindStudio Insight to visualize profile data.

In the PyTorch training scenario, perform the following operations for model performance tuning:

  1. Use Ascend PyTorch Profiler to collect profile data.
  2. Use msprof-analyze to analyze profile data.
  3. Use MindStudio Insight to visualize profile data.

Environment Setup

  1. Prepare a training server based on an Ascend 910 AI Processor and install the NPU driver and firmware as instructed in "NPU Driver and Firmware Installation"
  2. Install the CANN Toolkit package and ops operator package of the required version, and configure CANN environment variables. For details, see CANN Software Installation Guide..
  3. Install a framework.

    In the MindSpore training scenario, versions 2.6.0 and 2.7.0 are used as examples. For details, see MindSpore Installation Guide.

    In the PyTorch training scenario, version 2.1.0 is used as an example. For details, see "Plugin Development (PyTorch)".

  4. Configure the environment variables.

    After the CANN package is installed, log in to the environment as the CANN operating user and run the source $INSTALL_DIR/set_env.sh command to set environment variables. Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann.