Problem Information Collection

Before troubleshooting, collect accurate fault information. For details, see Table 1.

**Table 1** Problem information collection template
Type	Main Information	Description
Basic information	Model type	Model structure (Llama-like, GPT-like, and MoE-like).
	Operation scale	Number of cards and devices.
	Parallelism strategies	Specific parallel parameter configuration.
	Framework and version	Specify the CANN, MindSpore, and PyTorch versions. Check for recent changes in these versions to determine whether the issue occurred before or after the change.
Key issue description	Issue scenarios	During training or inference, the model performance does not meet the expected standards, is lower than that of competing products, or exhibits abnormal performance. The performance is not as expected. Generally, this problem occurs after model migration. The performance is not as expected compared with that of the competing product. During long-term stable model training, performance fluctuates randomly or when specific events occur. The cluster linearity is insufficient. After the cluster scale is expanded, the model performance does not increase as expected. The performance of the pure model is abnormal under the same configuration. For details, see the training problem troubleshooting in this document. Serving scheduling requires tuning.
Key issue description	Current performance metrics	Clarify the current performance problem. For details about the priorities of computing performance metrics, see in PyTorch Training Model Porting and Tuning Guide.
Tuning objectives	Performance tuning objectives	Specify the tuning objective and its source, for example, competitor benchmarking or linear scalability calculations. NOTE: If the tuning involves methods such as increasing the batch size, do not use metrics such as the single-step time. Select proper metrics by referring to section in PyTorch Training Model Porting and Tuning Guide.

Parent topic: Performance Troubleshooting Process