Process-Level Online Recovery Fails and "There is unsafe data in the input tensor" Is Reported

Symptom

After a fault occurs in a training job, the process-level online recovery (step-level recomputation recovery) function is enabled. After the recovery is complete, the training continues. After the training is recovered, the error "unsafe data" is reported before the first iteration is complete. As a result, the recovery fails.

Figure 1 Recovery failure

Cause Analysis

When a fault occurs, the tensor related to the fault is marked as unsafe data, which is no longer trusted. If the tensor is a global variable, you need to rebuild and repair it. When a tensor marked as unsafe data is accessed during computation, an error message "There is unsafe data in the input tensor" is displayed. In this case, you need to locate the accessed tensor object based on the error stack.

Solution

When the accessed tensor object is located, the impact of the tensor object on the training process needs to be determined first.

Scenario 1: If it does not depend on the training iteration, re-initialize and release it in the rollback phase.
Scenario 2: If it depends on the training iteration and the dependency relationship is consistent with the replica optimizer's mapping, the tensor needs to be reconstructed in the repair phase and the data of the tensor needs to be repaired through point-to-point communication.

During process-level rescheduling and process-level online recovery, valid replicas are searched for to combine a complete set of optimizer state data. If the training cluster has too many faults and a complete data cannot be assembled, rescheduling cannot proceed.

In addition, avoid accessing the tensor object before rebuilding or reinitialization. To prevent the repair failure caused by the failure to repair the tensors marked as unsafe data in global variables, you are advised to check the global tensors used in the training framework and ensure that they are traced and correctly repaired by the fault repair framework.

Handling Cases

Solution to scenario 1
As shown in Figure 1 Restoration failure, in the Megatron framework, when TensorBoard is used to record training logs and the MoE feature is enabled, a tensor is created in _MOE_AUX_LOSSES_LOGGING_TRACKER to record the loss data, which can be accessed by TensorBoard. After the tensor in _MOE_AUX_LOSSES_LOGGING_TRACKER is marked as unsafe data due to a fault, an error is reported when the tensor is accessed again. You can find the location of the tensor based on the call stack.

According to the code context, the tensor is used only to record the loss value which is set to 0 after each iteration. Therefore, this global variable is initialized to an empty dictionary in the rollback phase, which then is used by the framework in subsequent training. For details, see the feature_rollback function in mindio_ttp/adaptor/modellink_adaptor.py.

Solution to scenario 2
Assume that the name of the user-defined global variable tensor is global_tensor. Add the rebuilding and receiving logic to the end of the recv_rank_repair function in mindio_ttp/adaptor/modellink_adaptor.py so that the faulty card can receive the data saved by the replica card.

recv_tensor = torch.empty(size, dtype=type, device="npu")

torch.distributed.recv(recv_tensor, src=src_rank, group=repair_group)

global_tensor.data.copy_(recv_tensor)

Add the sending logic to the end of the send_rank_repair function in mindio_ttp/adaptor/modellink_adaptor.py to send the global tensor data to the faulty card.

torch.distributed.send(global_tensor, dst=dest_rank, group=repair_group)

Parent topic: Faults During Use