HCCL_DETERMINISTIC

Description

Enables or disables the deterministic computing or order-preserving function for reduction communication operators, including AllReduce, ReduceScatter, ReduceScatterV, and Reduce. Reduction order-preserving refers to strict deterministic computing, which ensures the reduction order consistency on the basis of deterministic computing.

When the deterministic computing or order-preserving function is enabled for a reduction operator, the same output is generated if the operator is executed for multiple times with the same hardware and input.

HCCL_DETERMINISTIC supports the following values:
  • false (default): disables deterministic computing.
  • true: enables deterministic computing for reduction communication operators.
    • For Atlas A2 training products / Atlas A2 inference products , the AllReduce, ReduceScatter, ReduceScatterV, and Reduce communication operators are supported.
    • For the Atlas A3 training products / Atlas A3 inference products , if the orchestration expansion position of the algorithm is AI CPU, all reduction operators use deterministic computing and are not affected by this environment variable. If the orchestration expansion position of the algorithm is the vector core, only the AllReduce and ReduceScatter communication operators involve non-deterministic computing. After this environment variable is set to true, deterministic computing can be used.
  • strict: enables strict deterministic computing for a reduction communication operator, that is, the order-preserving function (ensuring that the reduction sequence of all bits is consistent on the basis of determinism). To set this environment variable to this value, the following conditions must be met:
    • This value is supported only for the Atlas A2 training products / Atlas A2 inference products in the symmetric multi-device distribution scenario. It is not supported in the asymmetric distribution (that is, asymmetric number of devices) scenario.
    • For the Atlas A3 training products / Atlas A3 inference products , in single-operator mode, the function of setting this parameter to strict is the same as that of setting this parameter to true. The static graph mode does not support the setting of this parameter to strict.
    • The AllReduce, ReduceScatter, and ReduceScatterV communication operators are supported.
    • When order preserving is enabled, the saturation mode is not supported, and only the INF/NaN mode is supported.
    • Compared with deterministic computing, enabling order preserving will cause performance deterioration. You are advised to use this function in inference scenarios.

Generally, deterministic computing or order preserving does not need to be enabled for reduction operators. However, if the execution results of a model are different for multiple times or the precision needs to be tuned, you can enable deterministic computing or order preserving to assist model debugging and tuning. However, enabling the function slows down the operator execution speed and deteriorates performance.

Example

export HCCL_DETERMINISTIC=true

Restrictions

If you call the HCCL C APIs to initialize a communicator with specific configurations and set the deterministic computing function using the hcclDeterministic parameter of HcclCommConfig, the configuration of the communicator takes precedence.

Applicability

Atlas A2 training products / Atlas A2 inference products (For Atlas A2 training products / Atlas A2 inference products , only the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack are supported.)

Atlas A3 training products / Atlas A3 inference products