Special Cases

Case 1: A multimodal model is used.

NaN appears during training before stream synchronization is enabled. After stream synchronization is enabled, NaN disappears and the model converges properly. The cause is memory corruption during concurrent computation.

Case 2: NaN occurs in an MoE model.

Remove the overlap-param-gather hyperparameter.