HCCL_DFS_CONFIG
Description
HCCL provides multiple fault detection functions, including the link setup fault detection time configuration, cluster heartbeat monitoring switch, and process suspension detection switch. After these detection functions are enabledm, the fault information can be quickly located and displayed when a service exception occurs, helping rectify the fault in a timely manner.
This environment variable has the following four configuration items:
- connection_fault_detection_time: link setup fault detection time.
When the link setup times out, HCCL starts locating the root node where the link setup fails and propagates the information about the root node. The entire process takes the time specified by the connection_fault_detection_time parameter plus 10s required for propagating the root node information.
The value of connection_fault_detection_time can be 0 or in the range of [20, 7200]. The unit is second and the default value is 20.
If this parameter is set to 0, the link setup fault detection function is disabled. That is, when a connection fails to be set up, there is no extra waiting time and the link setup process exits immediately.
- cluster_heartbeat: cluster heartbeat monitoring function, which is used to propagate fault information and record the fault root node information in run logs if the execution of a communication operation times out.
This parameter can be set to on (indicating to enable the heartbeat monitoring function) or off (indicating to disable the heartbeat monitoring function). The default value is on.
Note: After the cluster heartbeat monitoring function is disabled, the exception that the communication operation execution times out cannot be detected, the cluster fault propagation capability is lost, and the root node fault information is not recorded in run logs.
- stuck_detection: process suspension detection function.
The value can be on (indicating to enable the process suspension detection function) or off (indicating to disable the process suspension detection function). The default value is on.
In scenarios that are sensitive to communication performance, you can use this parameter to disable the process suspension detection function. However, after the process suspension detection function is disabled, service suspension faults are not proactively detected and reported.
- inconsistent_check: operator delivery inconsistency detection function.
The value can be on (indicating to enable the operator delivery inconsistency detection function) or off (indicating to disable the operator delivery inconsistency detection function). The default value is off.
You can use this parameter to enable the operator delivery inconsistency detection function, but the performance will deteriorate to some extent. Note that by default, after this function is disabled, the system does not proactively detect and record operator delivery inconsistency issues.
Note: This function does not support the HcclBatchSendRecv operator and graph mode scenarios. After this function is enabled, data cache is generated, occupying the host memory.
This detection function is used only to assist in locating cluster fault points. In some complex scenarios, the fault points may not be the root cause of cluster service failures. Determine the location of the root node based on the generation time of the detection event and the error reported by the detected node.
Example
export HCCL_DFS_CONFIG="connection_fault_detection_time:30,cluster_heartbeat:on,stuck_detection:on,inconsistent_check:off"
Restrictions
None