HcclCommConfig

Description

Defines the configurations (including the buffer size, deterministic computing switch, and communicator name) of a communicator during initialization.

Prototype

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
const uint32_t HCCL_COMM_CONFIG_INFO_BYTES = 24;
const uint32_t COMM_NAME_MAX_LENGTH = 128;
const uint32_t BUFFER_NAME_MAX_LENGTH = 128;
const uint32_t UDI_MAX_LENGTH = 128;
const uint32_t HCCL_COMM_ALGO_MAX_LENGTH = 1600;
const uint32_t HCCL_COMM_RETRY_ENABLE_MAX_LENGTH = 50;
const uint32_t HCCL_COMM_RETRY_PARAMS_MAX_LENGTH = 128;
typedef struct HcclCommConfigDef {
    char reserved[HCCL_COMM_CONFIG_INFO_BYTES];    /* Reserved field, which cannot be modified. */
    uint32_t hcclBufferSize;
    uint32_t hcclDeterministic;
    char hcclCommName[COMM_NAME_MAX_LENGTH];
    char hcclUdi[UDI_MAX_LENGTH];
    uint32_t hcclOpExpansionMode;
    uint32_t hcclRdmaTrafficClass;
    uint32_t hcclRdmaServiceLevel;
    uint32_t hcclWorldRankID;
    uint64_t hcclJobID;
    uint8_t aclGraphZeroCopyEnable;
    int32_t hcclExecTimeOut;
    char hcclAlgo[HCCL_COMM_ALGO_MAX_LENGTH];
    char hcclRetryEnable[HCCL_COMM_RETRY_ENABLE_MAX_LENGTH];
    char hcclRetryParams[HCCL_COMM_RETRY_PARAMS_MAX_LENGTH];
} HcclCommConfig;

Parameters

  • hcclBufferSize: size of the buffer for shared data. The value must be greater than or equal to 1, in MB.
  • hcclDeterministic: switch of deterministic computing. It is supported in the following products:
    • Atlas A3 training products / Atlas A3 inference products
    • Atlas A2 training products / Atlas A2 inference products
    The following table lists the parameter values and their description.
    Table 1 Values of the hcclDeterministic parameter

    Value

    Description

    0

    Disables deterministic computing. This is the default value.

    1

    Enables deterministic computing for reduction communication operators.

    2

    Enables strict deterministic computing for a reduction communication operator, that is, the order-preserving function (ensuring that the reduction sequence of all bits is consistent on the basis of determinism). To set the parameter to this value, the following conditions must be met:
    • For the Atlas A3 training products / Atlas A3 inference products , in single-operator mode, the function of setting this parameter to 2 is the same as that of setting this parameter to 1. The static graph mode does not support the setting of this parameter to 2.
    • This value is supported only for the Atlas A2 training products / Atlas A2 inference products in the symmetric multi-server distribution scenario.
    • The AllReduce, ReduceScatter, and ReduceScatterV communication operators are supported.
    • When order preserving is enabled, the saturation mode is not supported, and only the INF/NaN mode is supported.
    • Compared with deterministic computing, enabling order preserving will cause performance deterioration. You are advised to use this function in inference scenarios.

    If deterministic computing is disabled, the results of multiple executions may be different. This is generally caused by asynchronous multi-thread executions during operator implementation, which changes the accumulation sequence of floating-point numbers. When deterministic computing is enabled, the same output is generated if an operator is executed for multiple times with the same hardware and input.

    By default, deterministic computing or order preserving does not need to be enabled. However, if the model execution results for multiple times are different or the accuracy is to be optimized, you can enable deterministic computing or order preserving to assist debugging and optimization. However, after deterministic computing or order preserving is enabled, the operator execution becomes slow, resulting in performance deterioration.

  • hcclCommName: communicator name, with a maximum length of 128.

    The specified communicator name must be unique. If not specified, a name is generated by HCCL automatically.

  • hcclUdi: user-defined information, with a maximum length of 128. By default, this parameter is left empty.
  • hcclOpExpansionMode: location for expanding the communication algorithm orchestration, which is configured at the communicator granularity. It is supported in the following products:
    • Atlas A3 training products / Atlas A3 inference products
    • Atlas A2 training products / Atlas A2 inference products
    The following table lists the parameter values and their description.
    Table 2 Values of the hcclOpExpansionMode parameter

    Value

    Description

    0

    Indicates to use the default algorithm orchestration expansion location.

    • For the Atlas A3 training products / Atlas A3 inference products , if this parameter is not set, the value of the environment variable HCCL_OP_EXPANSION_MODE is used. The default value of the environment variable is the AI CPU.
    • For the Atlas A2 training products / Atlas A2 inference products , if this parameter is not set, the value of the environment variable HCCL_OP_EXPANSION_MODE is used. The default value of the environment variable is CPU on the host.

    1

    Indicates to use the CPU on the host as the location for expanding the orchestration of the communication algorithm.
    • Atlas A3 training products / Atlas A3 inference products : This configuration is not supported.
    • Atlas A2 training products / Atlas A2 inference products : This configuration is supported.

    2

    Indicates to use the AI CPU compute unit on the device as the location for expanding the orchestration of the communication algorithm.
    • Atlas A3 training products / Atlas A3 inference products : This configuration is not supported.
    • Atlas A2 training products / Atlas A2 inference products : This configuration is not supported.

    3

    Indicates to use the Vector Core compute unit on the device as the location for expanding the orchestration of the communication algorithm. This configuration supports only the symmetric networking and inference features.

    In this configuration, if the data size does not meet the running requirements of the Vector Core, some operators are automatically switched to the default mode.
    • For the Atlas A3 training products / Atlas A3 inference products :
      • This configuration option supports only the Broadcast, AllReduce, ReduceScatter, AllGather, AlltoAll, AlltoAllV, and AlltoAllVC operators.
        • For the Broadcast operator, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. Only the single-operator mode with eight or fewer devices in the single-server scenario is supported.
        • For the AllReduce operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduce operation type can only be sum, max, or min. Only single-server or multi-server communication within a supernode is supported. Communication between supernodes is not supported.
        • For the ReduceScatter operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduce operation type can only be sum, max, or min. Only single-server or multi-server communication within a supernode is supported. Communication between supernodes is not supported.
        • For the AllGather, AlltoAll, AlltoAllV, and AlltoAllVC operators, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. Only single-server or multi-server communication within a supernode is supported. Communication between supernodes is not supported.
      • For the AllReduce, ReduceScatter, AllGather, and AlltoAll (single-server communication) operators, when the data size exceeds a certain value, the system automatically switches to the 2: AI CPU mode to prevent performance deterioration. (The threshold is not fixed and may vary with factors such as the operator running mode, whether deterministic computing is enabled, and network scale.) For the AlltoAllV, AlltoAllVC, and AlltoAll (multi-server communication) operators, the system does not automatically switch to the 2: AI CPU mode. To prevent performance deterioration, you are advised to use the 3: AIV mode when the maximum communication data size between any two ranks does not exceed 1 MB. Otherwise, use the 2: AI CPU mode.
      • Under this configuration option, the collective communication supports the core control capability. You are advised to configure the number of vector cores based on the concurrency of compute operators and communication operators in actual application scenarios.
        • For the Broadcast operator, you are advised to allocate at least ranksize vector cores.
        • For the AllReduce, ReduceScatter, AllGather, AlltoAll, AlltoAllV, and AlltoAllVC operators, you are advised to allocate at least max(2, ranksize/20 + 1) vector cores.
    • For the Atlas A2 training products / Atlas A2 inference products :
      • This configuration option supports only the Broadcast, AllReduce, AlltoAll, AlltoAllV, AlltoAllVC, AllGather, ReduceScatter, AllGatherV, and ReduceScatterV operators.
        • For the Broadcast operator, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. Only the single-operator mode with eight or fewer devices in the single-server scenario is supported.
        • For the AllReduce operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduce operation type can only be sum, max, or min.
        • For the AlltoAll, AlltoAllV, and AlltoAllVC operators, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. For the AlltoAllV and AlltoAllVC operators, only single-server scenarios are supported. The graph mode of the AlltoAll operator supports only single-server scenarios.
        • For the AllGather operator, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. The graph mode of this operator supports only single-server scenarios.
        • For the ReduceScatter operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduce operation type can only be sum, max, or min. The graph mode of this operator supports only single-server scenarios.
        • For the AllGatherV operator, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. Only the single-operator mode is supported.
        • For the ReduceScatterV operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduce operation type can only be sum, max, or min.
      • Under this configuration option, the collective communication supports the core control capability. You are advised to configure the number of vector cores based on the concurrency of compute operators and communication operators in actual application scenarios.
        • For the AllReduce, ReduceScatter, and ReduceScatterV operators, you are advised to allocate at least 24 cores.
        • For the Broadcast, AlltoAll, AlltoAllV, AlltoAllVC, AllGather, and AllGatherV operators, you are advised to allocate at least 16 cores.

    4

    Indicates to use the Vector Core compute unit on the device as the location for expanding the orchestration of the communication algorithm. This configuration supports only the symmetric networking and inference features. In this configuration, the mode is not switched with the data size change. The Vector Core is always used for computing. If the running conditions of the Vector Core are not met, an error is reported and the system exits.
    • This configuration option supports only the AllReduce, ReduceScatter, AllGather, AlltoAll, AlltoAllV, and AlltoAllVC operators.
    • For the Atlas A3 training products / Atlas A3 inference products :
      • For the AllReduce operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduction operation type can only be sum, max, or min. Only single-server or multi-server communication within a supernode is supported. Communication between supernodes is not supported.
      • For the ReduceScatter operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduction operation type can only be sum, max, or min. Only single-server or multi-server communication within a supernode is supported. Communication between supernodes is not supported.
      • For the AllGather, AlltoAll, AlltoAllV, and AlltoAllVC operators, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. Only single-server or multi-server communication within a supernode is supported. Communication between supernodes is not supported.
    • For the Atlas A2 training products / Atlas A2 inference products :
      • For the AllReduce operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduce operation type can only be sum, max, or min.
      • For the ReduceScatter operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduce operation type can only be sum, max, or min. The graph mode of this operator supports only single-server scenarios.
      • For the AllGather operator, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. The graph mode of this operator supports only single-server scenarios.
      • For the AlltoAll, AlltoAllV, and AlltoAllVC operators, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. For the AlltoAllV and AlltoAllVC operators, only single-server scenarios are supported. The graph mode of the AlltoAll operator supports only single-server scenarios.
    • In this configuration, the collective communication supports the core control capability. The number of Vector Cores required by different operators of different product models is the same as that of configuration option 3.
    • In the multi-communicator parallel scenario, this parameter cannot be set to 3 or 4 (AIV Only mode) for multiple communicators at the same time.
    • For the Atlas A2 training products / Atlas A2 inference products , when the algorithm orchestration expansion location is set to 3 or 4 and hcclDeterministic is set to 1 (to enable deterministic computing), deterministic computing takes effect only for the AllReduce and ReduceScatter operators when the data size is less than or equal to 8 MB in the single-operator and graph modes of a single server. In other scenarios and for other operators, the hcclDeterministic configuration is used.
    • For Atlas A2 training products / Atlas A2 inference products , if hcclDeterministic is set to 2 (to enable the order-preserving function), hcclOpExpansionMode cannot be set to 3 or 4. The order-preserving function is used.
    • For the Atlas A3 training products / Atlas A3 inference products , when the algorithm orchestration expansion location is set to 3 or 4 and hcclDeterministic is set to 1 (to enable deterministic computing) or 2 (to enable the order-preserving function), deterministic computing takes effect only for the AllReduce and ReduceScatter operators when the data size is less than 8 MB. In other scenarios and for other operators, the hcclDeterministic configuration is used.
  • hcclRdmaTrafficClass: traffic class of the RDMA NIC. The value range is [0, 255], and value must be an integral multiple of 4.

    In the RoCE V2 protocol, this parameter corresponds to the Type of Service (ToS) field in the IP packet header. This parameter consists of 8 bits in total. Bits [0, 1] are fixed at 0, and bits [2, 7] represent the DSCP value (calculated by dividing this parameter value by 4).

    Note:

    0xFFFFFFFF is used as the priority judgment identifier. When this parameter is set to 0xFFFFFFFF, the communicator configuration is invalid. The environment variable configuration or the default value 132 is used based on the priority.

  • hcclRdmaServiceLevel: service level (SL) of the RDMA NIC. The value must be the same as the PFC priority set for the NIC. Otherwise, performance may deteriorate.

    The value must be an unsigned integer ranging from 0 to 7.

    Note:

    0xFFFFFFFF is used as the priority judgment identifier. When this parameter is set to 0xFFFFFFFF, the communicator configuration is invalid. The environment variable configuration or the default value 4 is used based on the priority.

  • hcclWorldRankID: This parameter is used in the Network Scale Load Balance-Data Plane (NSLB-DP) scenario and indicates the global rank ID of the current process in the AI framework (such as PyTorch).
  • hcclJobID: This parameter is used in the NSLB-DP scenario and indicates the unique ID of the current distributed service, which is generated by the AI framework.
  • aclGraphZeroCopyEnable: This parameter is valid only for Reduce operators in graph capture mode (aclgraph) and is used to determine whether to enable the zero-copy function for these operators.
    • 0 (default): disables the zero-copy function.
    • 1: enables the zero-copy function.
  • hcclExecTimeOut: During distributed training or inference, tasks executed by different device processes may be inconsistent, for example, only specific processes save the checkpoint data. This parameter controls the synchronization wait time during task execution between devices. Within this configured time, each device process waits for other devices to perform communication synchronization. The unit is second. For details about the value range and restrictions for different product models, see the environment variable HCCL_EXEC_TIMEOUT.

    Note:

    0xFFFFFFFF is used as the priority judgment identifier. When this parameter is set to 0xFFFFFFFF, the communicator configuration is invalid. The environment variable configuration or the default value 1836 is used based on the priority.

  • hcclAlgo: used to configure the communication algorithms between servers and supernodes for collective communication. The algorithms can be configured globally or by operator. Note that HCCL provides the adaptive algorithm selection function, so that it can select an appropriate algorithm by default based on the product form, data size, and number of servers, which usually requires no manual configuration. Enabling this environment variable to specify the inter-server communication algorithm would invalidate the adaptive algorithm selection function.
    For details about the parameters and algorithm types supported by different product models, see the environment variable HCCL_ALGO. The configuration method is as follows:
    • Configuring the algorithm globally: hcclAlgo = "level0:NA;level1:<algo>;level2:<algo>". Example:
      hcclAlgo = "level0:NA;level1:H-D_R"
    • Configuring the algorithm by operator: hcclAlgo = "<op0>=level0:NA;level1:<algo0>;level2:<algo1>/<op1>=level0:NA;level1:<algo3>;level2:<algo4>". Example:
      # The AllReduce operator uses the Ring algorithm and the AllGather operator uses the RHD algorithm. Other operators automatically select a communication algorithm based on the product form, number of ranks, and data size.
      hcclAlgo = 
      "allreduce=level0:NA;level1:ring/allgather=level0:NA;level1:H-D_R"
  • hcclRetryEnable: used to enable or disable the retry feature of the HCCL operator. If an SDMA or RDMA CQE error is reported during the execution of a communication operator, HCCL attempts to retry the communication operator. It is supported only by the Atlas A3 training products / Atlas A3 inference products .

    You can use this parameter to configure whether to enable the retry feature in the communicators of the inter-server and inter-supernode physical layers. Each layer supports two states: enabled and disabled. For details about the restrictions, see the environment variable HCCL_OP_RETRY_ENABLE. The configuration method is as follows: hcclRetryEnable = "L1:1, L2:0". The parameter values are as follows:

    • L1 indicates that the physical scope of the communicator is the communicator between servers. 0 (default value) indicates that the retry feature is disabled for inter-server communication tasks in the communicator, while 1 indicates that the retry feature is enabled for inter-server communication tasks in the communicator.
    • L2 indicates that the physical scope of the communicator is the communicator between supernodes. 0 (default value) indicates that the retry feature is disabled for inter-supernode communication tasks in the communicator, while 1 indicates that the retry feature is enabled for inter-supernode communication tasks in the communicator.
  • hcclRetryParams: used to configure the wait period for the first retry, the maximum number of retries, and the interval between two retries after the HCCL operator retry feature is enabled through the hcclRetryEnable parameter. It is supported only by the Atlas A3 training products / Atlas A3 inference products .
    For details about the restrictions, see the environment variable HCCL_OP_RETRY_PARAMS. The configuration method is hcclRetryParams = "MaxCnt:3, HoldTime:5000, IntervalTime:1000". The parameter values are as follows:
    • MaxCnt: maximum retransmission attempts. The value is of the uint32 type. The value range is [1,10]. The default value is 1.
    • HoldTime: wait period from the time when a communication operator execution failure is detected to the time when the communication operator is retried for the first time. The value is of the uint32 type. The value range is [0,60000], with the default value of 5000, in millisecond.
    • IntervalTime: interval between two retries of the same communication operator. The value is of the uint32 type. The value range is [0,60000], with the default value of 1000, in millisecond.

Configuration Priority

Table 3 Configuration priority

Parameter

Configuration Priority

hcclBufferSize

Parameter hcclBufferSize (communicator-granularity configuration) > environment variable HCCL_BUFFSIZE (global configuration) > default value 200

hcclDeterministic

Parameter hcclDeterministic (communicator-granularity configuration) > environment variable HCCL_DETERMINISTIC (global configuration) > default value 0 (disabling deterministic computing)

hcclOpExpansionMode

Parameter hcclOpExpansionMode (communicator-granularity configuration) > environment variable HCCL_OP_EXPANSION_MODE (global configuration) > default value 0

hcclRdmaTrafficClass

Parameter hcclRdmaTrafficClass (communicator-granularity configuration) > environment variable HCCL_RDMA_TC (global configuration) > default value 132

hcclRdmaServiceLevel

Parameter hcclRdmaServiceLevel (communicator-granularity configuration) > environment variable HCCL_RDMA_SL (global configuration) > default value 4

hcclExecTimeOut

Parameter hcclExecTimeOut (communicator-granularity configuration) > environment variable HCCL_EXEC_TIMEOUT (global configuration) > default value 1836

hcclAlgo

Parameter hcclAlgo (communicator-granularity configuration) > environment variable HCCL_ALGO (global configuration) > default value

hcclRetryEnable

Parameter hcclRetryEnable (communicator-granularity configuration) > environment variable HCCL_OP_RETRY_ENABLE (global configuration) > default value

hcclRetryParams

Parameter hcclRetryParams (communicator-granularity configuration) > environment variable HCCL_OP_RETRY_PARAMS (global configuration) > default value