HCCL_DETERMINISTIC
Description
Enables or disables the deterministic computing or order-preserving function for reduction communication operators, including AllReduce, ReduceScatter, ReduceScatterV, and Reduce. Reduction order-preserving refers to strict deterministic computing, which ensures the reduction order consistency on the basis of deterministic computing.
When the deterministic computing or order-preserving function is enabled for a reduction operator, the same output is generated if the operator is executed for multiple times with the same hardware and input.
- false (default): disables deterministic computing.
- true: enables deterministic computing for reduction communication operators.
- For
Atlas A2 training products /Atlas A2 inference products , the AllReduce, ReduceScatter, ReduceScatterV, and Reduce communication operators are supported. - For the
Atlas A3 training products /Atlas A3 inference products , if the orchestration expansion position of the algorithm is AI CPU, all reduction operators use deterministic computing and are not affected by this environment variable. If the orchestration expansion position of the algorithm is the vector core, only the AllReduce and ReduceScatter communication operators involve non-deterministic computing. After this environment variable is set to true, deterministic computing can be used.
- For
- strict: enables strict deterministic computing for a reduction communication operator, that is, the order-preserving function (ensuring that the reduction sequence of all bits is consistent on the basis of determinism). To set this environment variable to this value, the following conditions must be met:
- This value is supported only for the
Atlas A2 training products /Atlas A2 inference products in the symmetric multi-device distribution scenario. It is not supported in the asymmetric distribution (that is, asymmetric number of devices) scenario. - For the
Atlas A3 training products /Atlas A3 inference products , in single-operator mode, the function of setting this parameter to strict is the same as that of setting this parameter to true. The static graph mode does not support the setting of this parameter to strict. - The AllReduce, ReduceScatter, and ReduceScatterV communication operators are supported.
- When order preserving is enabled, the saturation mode is not supported, and only the INF/NaN mode is supported.
- Compared with deterministic computing, enabling order preserving will cause performance deterioration. You are advised to use this function in inference scenarios.
- This value is supported only for the
Generally, deterministic computing or order preserving does not need to be enabled for reduction operators. However, if the execution results of a model are different for multiple times or the precision needs to be tuned, you can enable deterministic computing or order preserving to assist model debugging and tuning. However, enabling the function slows down the operator execution speed and deteriorates performance.
Example
export HCCL_DETERMINISTIC=true
Restrictions
If you call the HCCL C APIs to initialize a communicator with specific configurations and set the deterministic computing function using the hcclDeterministic parameter of HcclCommConfig, the configuration of the communicator takes precedence.