EI0012 Execution_Error_SDMA
Symptom
SDMA memory copy task exception occurred. Remote rank: [%s]. Base information: [%s]. Task information: [%s]. Communicator information: [%s].
Possible Cause
1. Network connection exception occurred during the SDMA task execution.
2. The peer process exits abnormally.
3. The input or output memory address is not allocated, the actual allocated size is smaller than the input data size, or the memory is freed before the operator execution is complete.
Solution
1. Check whether the network link is abnormal during the execution.
2. Check whether a process in the cluster exits before an error is reported. If yes, locate the cause of the process exit.
3. Check whether the input/output memory size is correct and whether the memory or communicator is released prematurely.
父主题: HCCL Errors