Problem Information Collection

Before troubleshooting, collect accurate fault information. For details, see Table 1.

Table 1 Problem information collection template

Type

Main Information

Description

Basic information

Model type

Model structure (Llama-like, GPT-like, and MoE-like).

Operation scale

Number of cards and devices.

Parallelism strategies

Specific parallel parameter configuration.

Framework and version

  • Specify the CANN, MindSpore, and PyTorch versions.
  • Check for recent changes in these versions to determine whether the issue occurred before or after the change.

Key issue description

Issue scenarios

During training or inference, the model performance does not meet the expected standards, is lower than that of competing products, or exhibits abnormal performance.
  • The performance is not as expected. Generally, this problem occurs after model migration. The performance is not as expected compared with that of the competing product.
  • During long-term stable model training, performance fluctuates randomly or when specific events occur.
  • The cluster linearity is insufficient. After the cluster scale is expanded, the model performance does not increase as expected.
  • The performance of the pure model is abnormal under the same configuration. For details, see the training problem troubleshooting in this document.
  • Serving scheduling requires tuning.

Current performance metrics

Clarify the current performance problem. For details about the priorities of computing performance metrics, see in PyTorch Training Model Porting and Tuning Guide.

Tuning objectives

Performance tuning objectives

Specify the tuning objective and its source, for example, competitor benchmarking or linear scalability calculations.

NOTE:

If the tuning involves methods such as increasing the batch size, do not use metrics such as the single-step time. Select proper metrics by referring to section in PyTorch Training Model Porting and Tuning Guide.