HCCL_OP_EXPANSION_MODE

Description

Sets the location for expanding the orchestration of the communication algorithm. The options are as follows:
  • AI_CPU: The orchestration of the communication algorithm is expanded on the AI CPU on the device. The device automatically selects a scheduler based on the hardware model.
  • AIV: The orchestration of the communication algorithm is expanded on the vector core on the device, and the execution is also performed on the vector core.
  • HOST: The orchestration of the communication algorithm is expanded on the CPU on the host. The device automatically selects a scheduler based on the hardware model.
  • HOST_TS: The orchestration of the communication algorithm is expanded on the CPU on the host. The host delivers tasks to the task scheduler of the device, and the task scheduler of the device schedules and executes the tasks.

The following table lists the configurations supported by different products and related scenarios. Products not listed in the table do not support this environment variable. If an unsupported environment variable is set, the default value is used.

Table 1 Supported configurations of HCCL_OP_EXPANSION_MODE

Product

Supported Configuration

Constraints

Default Value

Atlas 300I Duo inference card

AI_CPU

  • Only the single-server single-communicator scenario is supported.
  • Only the AllReduce operator is supported. For details about the data types supported by the AllReduce operator, see the HcclAllReduce API.
  • If this parameter is set to AI_CPU, the communication operator does not support profile data collection and analysis.
  • For a static shape image, the AI_CPU configuration is not supported.

HOST

HOST

None

Atlas A2 training products / Atlas A2 inference products

(For Atlas A2 training products / Atlas A2 inference products , only the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack are supported.)

AIV

  • This configuration option supports only the symmetric networking and inference features.
  • This configuration option does not support the multi-communicator parallel scenario (because the AIV mode cannot be used for multiple communicators at the same time). Otherwise, unexpected behavior may occur. When initializing a communicator with specific configurations, you can set the location for expanding the algorithm orchestration of a communicator to AIV through HcclCommConfig.
  • This configuration option supports only the Broadcast, AllReduce, AlltoAll, AlltoAllV, AlltoAllVC, AllGather, ReduceScatter, AllGatherV, and ReduceScatterV operators.
    • For the Broadcast operator, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. Only the single-operator mode with eight or fewer devices in the single-server scenario is supported.
    • For the AllReduce operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduce operation type can only be sum, max, or min.
    • For the AlltoAll, AlltoAllV, and AlltoAllVC operators, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. For the AlltoAllV and AlltoAllVC operators, only single-server scenarios are supported. The graph mode of the AlltoAll operator supports only single-server scenarios.
    • For the AllGather operator, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. The graph mode of this operator supports only single-server scenarios.
    • For the ReduceScatter operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduce operation type can only be sum, max, or min. The graph mode of this operator supports only single-server scenarios.
    • For the AllGatherV operator, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. Only the single-operator mode is supported.
    • For the ReduceScatterV operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduce operation type can only be sum, max, or min.
  • Under this configuration option, the collective communication supports the core control capability. You are advised to configure the number of vector cores based on the concurrency of compute operators and communication operators in actual application scenarios.
    • For the AllReduce, ReduceScatter, and ReduceScatterV operators, you are advised to allocate at least 24 cores.
    • For the Broadcast, AlltoAll, AlltoAllV, AlltoAllVC, AllGather, and AllGatherV operators, you are advised to allocate at least 16 cores.

Notes:

  • If the HCCL_DETERMINISTIC environment variable is set to true, deterministic computing and AIV can be both enabled for the AllReduce and ReduceScatter operators in the single-operator and graph mode of a single server when the data size is less than or equal to 8 MB. In other scenarios and for other operators, this configuration item does not take effect and the HCCL_DETERMINISTIC configuration is used.
  • If HCCL_DETERMINISTIC is set to strict, this configuration item does not take effect and the HCCL_DETERMINISTIC configuration is used.
  • The Atlas 200T A2 Box16 heterogeneous subrack does not support inter-subrack communication.

HOST

HOST

None

HOST_TS

None

Atlas A3 training products / Atlas A3 inference products

AI_CPU

Full communication operators are supported within a supernode and between supernodes.

For the Reduce, ReduceScatter, ReduceScatterV, and AllReduce operators, the data type can only be int8, int16, int32, float16, float32, or bfp16, and the reduce operation type can only be sum, max, or min.

For details about the data types supported by other communication operators, see the corresponding collective communication APIs.

Notes:

  • In the graph mode (Ascend IR) or graph capture (aclgraph) scenario, when the communication algorithm uses the default AI CPU mode, the number of concurrent graphs on a single device cannot exceed 6. Otherwise, the communication may be blocked because the AI CPU cores are fully occupied.
  • In this mode, the communication function depends on the open AI CPU user mode to deliver scheduling tasks, which poses security risks. You need to ensure the security and reliability of custom operators to prevent malicious attacks.

AI_CPU

AIV

  • This configuration option supports only the symmetric networking and inference features.
  • This configuration option does not support the multi-communicator parallel scenario (because the AIV mode cannot be used for multiple communicators at the same time). Otherwise, unexpected behavior may occur. When initializing a communicator with specific configurations, you can set the location for expanding the algorithm orchestration of a communicator to AIV through HcclCommConfig.
  • This configuration option supports only the Broadcast, AllReduce, ReduceScatter, AllGather, AlltoAll, AlltoAllV, and AlltoAllVC operators.
    • For the Broadcast operator, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. Only the single-operator mode with eight or fewer devices in the single-server scenario is supported.
    • For the AllReduce operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduce operation type can only be sum, max, or min. Only single-server or multi-server communication within a supernode is supported. Communication between supernodes is not supported.
    • For the ReduceScatter operator, the data type can be int8, int16, int32, float16, float32, or bfp16. The reduce operation type can only be sum, max, or min. Only single-server or multi-server communication within a supernode is supported. Communication between supernodes is not supported.
    • For the AllGather, AlltoAll, AlltoAllV, and AlltoAllVC operators, the data type can be int8, uint8, int16, uint16, int32, uint32, float16, float32, or bfp16. Only single-server or multi-server communication within a supernode is supported. Communication between supernodes is not supported.
  • For the AllReduce, ReduceScatter, AllGather, and AlltoAll (single-server communication) operators, when the data size exceeds a certain value, the system automatically switches to the AI_CPU mode to prevent performance deterioration. (The threshold is not fixed and may vary with factors such as the operator running mode, whether deterministic computing is enabled, and network scale.) For the AlltoAllV, AlltoAllVC, and AlltoAll (multi-server communication) operators, when the AIV mode is used, the system does not automatically switch to the AI_CPU mode. To prevent performance deterioration, you are advised to use the AIV mode when the maximum communication data size between any two ranks does not exceed 1 MB. Otherwise, use the AI_CPU mode.
  • Under this configuration option, the collective communication supports the core control capability. You are advised to configure the number of vector cores based on the concurrency of compute operators and communication operators in actual application scenarios.
    • For the Broadcast operator, you are advised to allocate at least ranksize vector cores.
    • For the AllReduce, ReduceScatter, AllGather, AlltoAll, AlltoAllV, and AlltoAllVC operators, you are advised to allocate at least max(2, ranksize/20 + 1) vector cores.

Notes:

When the location for expanding the algorithm orchestration is set to AIV and the HCCL_DETERMINISTIC environment variable is set to true or strict, if the data size is less than 8 MB, only the deterministic computing of the AllReduce and ReduceScatter operators takes effect. In other scenarios and for other operators, the HCCL_DETERMINISTIC configuration is used.

Example

export HCCL_OP_EXPANSION_MODE="HOST"

Restrictions

  • If you call the HCCL C APIs to initialize a communicator with specific configurations and specify the location for expanding the communication algorithm orchestration using the hcclOpExpansionMode parameter of HcclCommConfig, the configuration of the communicator takes precedence.
  • For the inference feature of the Atlas A2 training products / Atlas A2 inference products :
    If AIV is configured and the process is forcibly ended by pressing Ctrl+C, the device log file exported by the msnpureport tool may contain an error indicating that the device accesses an invalid address. The log keyword is devmm_page_fault_d2h_query_flag, devmm_svm_device_fault, or ipc_fault_msg_para_check, as shown in the following. This scenario does not affect the device status or the execution of new tasks.
    1
    2
    3
    4
    5
    [ERROR] KERNEL(5044,sklogd):2024-07-29-10:33:22.646.254 [klogd.c:247][257382.266115] [ascend] [ERROR] [devmm] [devmm_page_fault_d2h_query_flag 810] <kworker/u16:2:14887,14887> Host page fault send message fail.(hostpid=2131021; devid=0; vfid=0; ret=-22; va=0x12c700300000; hostpid=2131021; devid=0; vfid=0)
    [ERROR] KERNEL(5044,sklogd):2024-07-29-10:33:22.646.284 [klogd.c:247][257382.266124] [ascend] [ERROR] [devmm] [devmm_svm_device_fault 468] <kworker/u16:2:14887,14887> Vm fault failed. (hostpid=2131021; devid=0; vfid=0; ret=64; fault_addr=0x12c700300000; start=0x12c700300000)
    [ERROR] KERNEL(5044,sklogd):2024-07-29-10:33:22.659.429 [klogd.c:247][257382.282181] [ascend] [ERROR] [tsdrv] [ipc_fault_msg_para_check 309] <swapper/3:0> Invalid node id. (devid=0; node_type=100; node_id=40; node_num=25)
    ................
    [ERROR] KERNEL(5044,sklogd):2024-07-29-10:33:24.874.211 [klogd.c:247][257384.473533] [ascend] [ERROR] [tsdrv] [tsdrv_hb_cq_callback 332] <kworker/0:0:20353> receive ts exception msg, call excep_code=0xb4060006, time=1722249204.850014098s, devid=0 tsid=0