HCCL_INTRA_ROCE_ENABLE
Description
Specifies whether to use the RoCE link for communication on a server or supernode.
- For the
Atlas training products andAtlas A2 training products /Atlas A2 inference products , this environment variable is used to configure whether to use the RoCE link for communication on a server. The default value is 0. This environment variable can be configured separately or used together with the environment variable HCCL_INTRA_PCIE_ENABLE. The following table lists the communication links used on a server in different configuration combinations.Table 1 Configuration combinations supported by HCCL_INTRA_PCIE_ENABLE and HCCL_INTRA_ROCE_ENABLE HCCL_INTRA_PCIE_ENABLE
HCCL_INTRA_ROCE_ENABLE
Intra-Server Communication Link
1
Not configured
PCIe
1
0
PCIe
0
1
RoCE
Not configured
1
RoCE
0
0
PCIe
Not configured
Not configured
PCIe
HCCL_INTRA_PCIE_ENABLE and HCCL_INTRA_ROCE_ENABLE cannot be set to 1 at the same time.
- For the
Atlas A3 training products /Atlas A3 inference products , this environment variable is valid only when LLM-DataDist is used as the cluster management component. It is used to configure whether to use the RoCE link for communication on a supernode. The default value is 0. The configuration is described as follows:- 0: The default HCCS link or PCIe link is used for communication (including LLM-DataDist communication and HCCL communication) on a supernode.
- 1: For the Atlas 800T A3, Atlas 800I A3, and Atlas 900 A3 SuperPoD, the RoCE link is used for LLM-DataDist communication on a supernode, and HCCL communication is not affected. For the A200T A3 Box8, the RoCE link is used for both LLM-DataDist and HCCL communication.
Example
export HCCL_INTRA_ROCE_ENABLE=1
Restrictions
The Atlas 200T A2 Box16 heterogeneous subrack has two modules on the left and right: devices 0 to 7 and devices 8 to 15. For this product:
In the single-server scenario, when the server uses a PCIe link for internal communication, if the devices from two modules are required simultaneously, both modules must have the same number of devices and be on the same plane, meaning that devices 0 and 8, 1 and 9 (and so on) must be used together. When the server uses a RoCE link for internal communication, there is no such restriction.