Loss Inconsistency in Long-Term Stable Training

Before using the tool for fault locating, rectify the configuration item problems in Checking the Environment and the randomness problems in Reproducing an Issue.

Case: After a search model is converted from fp32 to bf16, the loss difference is small in the early stage but is high in the later stage.

Figure 1 Abnormal loss values

Locating method: When abnormal loss appears, the number of steps and the dump data volume are large, the monitor status monitoring tool is preferentially used to collect data.

Check grad_norm. The trend is consistent with that of loss.
Gradients can cause sudden shifts in the loss. Therefore, the following configuration is used to collect gradient data. The content of monitor_config.json is as follows:
1 2 3 4 5 6 7 8 9 10 11
{ "targets": {}, "wg_distribution": true, "format": "csv", "ops": [ "norm", "mean", "min", "max" ] }
The insertion method in the code is as follows:

Figure 2 Inserting the monitor tool in the code
After the collection, the grad_unreduced-xx-xx.csv and grad_reduced-xx-xx.csv files are generated on each card, where xx indicates the number of steps.
View the gradient data of each layer before reduction at the start ascending position after 360 steps. The result is as follows:

Figure 3 Abnormal training gradient data collected by monitor

The horizontal axis shows the reversed order of layers. Outputs appear on the left, while embeddings are on the right. Near the embedding layer, the gradient norm is high, and the FP32 gradient remains steady at this point.

Figure 4 Normal training gradient data collected by monitor

Therefore, it is suspected that the gradient of the embedding layer is numerically less stable in BF16 than FP32.

Solution: Gradient clipping is performed on the gradient of the embedding layer.

Result: The loss is converged properly.

Parent topic: msprobe Tool Troubleshooting Cases