HCCL_OP_EXPANSION_MODE
Description
- AI_CPU: The orchestration of the communication algorithm is expanded on the AI CPU on the device. The device automatically selects a scheduler based on the hardware model.
- AIV: The orchestration of the communication algorithm is expanded on the vector core on the device, and the execution is also performed on the vector core.
- HOST: The orchestration of the communication algorithm is expanded on the CPU on the host. The device automatically selects a scheduler based on the hardware model.
- HOST_TS: The orchestration of the communication algorithm is expanded on the CPU on the host. The host delivers tasks to the task scheduler of the device, and the task scheduler of the device schedules and executes the tasks.
The following table lists the configurations supported by different products and related scenarios. Products not listed in the table do not support this environment variable. If an unsupported environment variable is set, the default value is used.
|
Product |
Supported Configuration |
Constraints |
Default Value |
|---|---|---|---|
|
Atlas 300I Duo inference card |
AI_CPU |
|
HOST |
|
HOST |
None |
||
|
(For |
AIV |
Notes:
|
HOST |
|
HOST |
None |
||
|
HOST_TS |
None |
||
|
|
AI_CPU |
Full communication operators are supported within a supernode and between supernodes. For the Reduce, ReduceScatter, ReduceScatterV, and AllReduce operators, the data type can only be int8, int16, int32, float16, float32, or bfp16, and the reduce operation type can only be sum, max, or min. For details about the data types supported by other communication operators, see the corresponding collective communication APIs. Notes:
|
AI_CPU |
|
AIV |
Notes: When the location for expanding the algorithm orchestration is set to AIV and the HCCL_DETERMINISTIC environment variable is set to true or strict, if the data size is less than 8 MB, only the deterministic computing of the AllReduce and ReduceScatter operators takes effect. In other scenarios and for other operators, the HCCL_DETERMINISTIC configuration is used. |
Example
export HCCL_OP_EXPANSION_MODE="HOST"
Restrictions
- If you call the HCCL C APIs to initialize a communicator with specific configurations and specify the location for expanding the communication algorithm orchestration using the hcclOpExpansionMode parameter of HcclCommConfig, the configuration of the communicator takes precedence.
- For the inference feature of the
Atlas A2 training products /Atlas A2 inference products :If AIV is configured and the process is forcibly ended by pressing Ctrl+C, the device log file exported by the msnpureport tool may contain an error indicating that the device accesses an invalid address. The log keyword is devmm_page_fault_d2h_query_flag, devmm_svm_device_fault, or ipc_fault_msg_para_check, as shown in the following. This scenario does not affect the device status or the execution of new tasks.1 2 3 4 5
[ERROR] KERNEL(5044,sklogd):2024-07-29-10:33:22.646.254 [klogd.c:247][257382.266115] [ascend] [ERROR] [devmm] [devmm_page_fault_d2h_query_flag 810] <kworker/u16:2:14887,14887> Host page fault send message fail.(hostpid=2131021; devid=0; vfid=0; ret=-22; va=0x12c700300000; hostpid=2131021; devid=0; vfid=0) [ERROR] KERNEL(5044,sklogd):2024-07-29-10:33:22.646.284 [klogd.c:247][257382.266124] [ascend] [ERROR] [devmm] [devmm_svm_device_fault 468] <kworker/u16:2:14887,14887> Vm fault failed. (hostpid=2131021; devid=0; vfid=0; ret=64; fault_addr=0x12c700300000; start=0x12c700300000) [ERROR] KERNEL(5044,sklogd):2024-07-29-10:33:22.659.429 [klogd.c:247][257382.282181] [ascend] [ERROR] [tsdrv] [ipc_fault_msg_para_check 309] <swapper/3:0> Invalid node id. (devid=0; node_type=100; node_id=40; node_num=25) ................ [ERROR] KERNEL(5044,sklogd):2024-07-29-10:33:24.874.211 [klogd.c:247][257384.473533] [ascend] [ERROR] [tsdrv] [tsdrv_hb_cq_callback 332] <kworker/0:0:20353> receive ts exception msg, call excep_code=0xb4060006, time=1722249204.850014098s, devid=0 tsid=0