Problem Information Collection
Before troubleshooting, collect accurate fault information. For details, see Table 1.
Type |
Main Information |
Description |
|---|---|---|
Basic information |
Model type |
Model structure (Llama-like, GPT-like, and MoE-like). |
Operation scale |
Number of cards and devices. |
|
Parallelism strategies |
Specific parallel parameter configuration. |
|
Framework and version |
|
|
Key issue description |
Issue scenarios |
During training or inference, the model performance does not meet the expected standards, is lower than that of competing products, or exhibits abnormal performance.
|
Current performance metrics |
Clarify the current performance problem. For details about the priorities of computing performance metrics, see in PyTorch Training Model Porting and Tuning Guide. |
|
Tuning objectives |
Performance tuning objectives |
Specify the tuning objective and its source, for example, competitor benchmarking or linear scalability calculations. NOTE:
If the tuning involves methods such as increasing the batch size, do not use metrics such as the single-step time. Select proper metrics by referring to section in PyTorch Training Model Porting and Tuning Guide. |