Relationship Between Communication Operator Retry and Overall Network Performance

After the HCCL communication operator retry function is enabled, the end-to-end (E2E) performance change of the entire network is closely related to the model sharding and deployment mode. This section describes the relationship between the retry function and network performance.

Key Communicator

A key communicator is a communicator whose performance change will cause a great change in the E2E performance of the entire network. This means that the communicator is very important and is the performance bottleneck of the entire network.

Generally, there are multiple communicators on the entire network, and there is usually one key communicator. This section focuses on the key communicator for performance analysis.

The following figure shows an example.

In the preceding profiling, there are four communicators where communication actually occurs, which are Group_777, Group_1289, Group_257, and Group_9.

The BatchSendRecv operator executed in Group_1289 is introduced by PipelineParallel. Generally, asynchronous communication is used, which can be asynchronous with computing and does not occupy a large proportion of time. This communicator is not a key communicator.

Group_777 and Group_9 have few operator execution operations, thus having little impact on the global network. They are also not key communicators.

Therefore, Group_257 is the key communicator. If the performance of this communicator deteriorates, the E2E performance of the entire network is directly affected.

Relationship Between Network Performance Deterioration and the Key Communicator

Focus 1: whether the retry function is enabled for the key communicator

In some common deployment modes, for example, tensor parallelism (TP) and data parallelism (DP) are used together, TP is the key communicator. If the TP range is within the server (TP ≤ 16), the E2E performance is not affected because the communication operator retry function is not enabled on the server.

Non-key communicators have little impact on the performance of the entire network. The following table lists the data of the test model in the lab.

Model	Sharding Mode	Deterioration Ratio	Description
Llama3-8B (running on a 64-die cluster)	TP = 16 (key communicator) DP = 4	0.03%	The retry function is enabled only for the DP of non-key communicators, which has little impact on the E2E performance.
GPT4_dropLess (running on a 128-die cluster)	TP = 8 (key communicator) PP = 1 EP = 1 CP = 16	0.99%	The retry function is enabled only for context parallelism (CP) of non-key communicators, which has little impact on the E2E performance.
Qwen3-moe-235B (running on a 128-die cluster)	TP = 8 (key communicator) PP = 1 EP = 64	-0.1%	The retry function is enabled only for expert parallelism (EP) of non-key communicators, which has little impact on the E2E performance.

Model

Sharding Mode

Deterioration Ratio

Description

Llama3-8B

(running on a 64-die cluster)

TP = 16 (key communicator)

DP = 4

0.03%

The retry function is enabled only for the DP of non-key communicators, which has little impact on the E2E performance.

GPT4_dropLess

(running on a 128-die cluster)

TP = 8 (key communicator)

PP = 1

EP = 1

CP = 16

0.99%

The retry function is enabled only for context parallelism (CP) of non-key communicators, which has little impact on the E2E performance.

Qwen3-moe-235B (running on a 128-die cluster)

TP = 8 (key communicator)

PP = 1

EP = 64

-0.1%

The retry function is enabled only for expert parallelism (EP) of non-key communicators, which has little impact on the E2E performance.

Focus 2: whether the communication expansion and computation of the key communicator can overlap

If the retry function is enabled for the key communicator, the performance of the communicator will definitely deteriorate. However, whether the deterioration causes the deterioration of the entire network depends on whether the AI CPU expansion of the key communicator can overlap with the computation.

After the retry function is enabled for a single communicator, the biggest difference is that the asynchronous expansion mode is changed to the synchronous expansion mode, as shown in the following figure. That is, the mode is changed from the upper one to the lower one.

Figure 1 Change of the operator expansion mode after the retry function is enabled

Whether the communication expansion time can be overlapped with the computation time is the key factor that determines whether the communicator affects the E2E performance. The analysis needs to be performed based on the computation operator (model structure).

As shown in the following figure, the time consumed by the computation operator is only 50 μs. The gap between the communication operators before and after the AI CPU expansion mode is 150 μs. Therefore, the overhead introduced by the retry function is 100 μs (150 – 50). This overhead is in the key communicator, which causes E2E deterioration.

However, the deterioration degree depends on the proportion of operators in the key communicator on the entire network (closely related to the model structure and deployment mode) and whether the expansion in this dimension can overlap with the computation.

For example, for the same EP64 sharding, different models have different deterioration effects.

Model	Sharding Mode	Deterioration Ratio	Description
DeepSeek V3 (running on a 64-die cluster)	EP = 64	0.06%	The retry function is enabled for EP in the key communicator. However, the model computation takes a long time, and the retry overhead can be covered by the computation. Therefore, the E2E performance deterioration of the entire network is not serious.
qwen3-moe-30b (running on a 64-die cluster)	EP = 64	3%	The retry function is enabled for EP in the key communicator. The retry overhead cannot be covered by the computation. Therefore, the E2E performance of the entire network deteriorates.

Model

Sharding Mode

Deterioration Ratio

Description

DeepSeek V3 (running on a 64-die cluster)

EP = 64

0.06%

The retry function is enabled for EP in the key communicator. However, the model computation takes a long time, and the retry overhead can be covered by the computation. Therefore, the E2E performance deterioration of the entire network is not serious.

qwen3-moe-30b

(running on a 64-die cluster)

EP = 64

The retry function is enabled for EP in the key communicator. The retry overhead cannot be covered by the computation. Therefore, the E2E performance of the entire network deteriorates.

Therefore, factors affecting E2E performance of a model are closely related to the model structure. The impact of the retry function on the performance of the entire network needs to be evaluated based on the actual situation.

Parent topic: References