Basics

This section describes the theoretical background and development guide of MC² operators. Before learning this section, ensure that you are familiar with the knowledge of cube programming and HCCL User Guide.

MC² operators generally support the following product models:

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

MC² Operators

Compared with common compute or movement operators, MC² operators integrate serial communication and compute operations. By performing data tiling within the operator, the MC² operators enable parallel execution of compute and communication tasks, boosting operator performance. "MC² operators" are short for "Matrix Computation & Communication operators".

As shown in the following figure, the ideal execution duration of the serial communication operator and compute operator is the sum of the execution duration of the two operators. In the MC² operator fusing communication and compute tasks, required data for communication and compute is tiled to reduce the amount of data involved in each communication and compute task. By splitting overall communication and computing tasks into multiple batches, parallel execution of compute and communication is achieved, which greatly shortens the theoretical execution duration and delivers performance benefits.

Figure 1 Comparison of theoretical execution duration before and after communication-compute fusion

Application Scenarios and Advantages

As the model scale increases, training and inference on a single device encounter bottlenecks in compute capabilities, memory capacity, and energy efficiency. Therefore, distributed parallel compute becomes an inevitable technical path. Communication and compute tasks in the distributed training and inference of LLMs can be classified into two types based on the dependency between communication and compute.

Loosely coupled compute-communication tasks
The results of communication or compute are not immediately used by the other. Although dependencies exist between them, other independent compute or communication tasks can be scheduled and executed in-between. As shown in Figure 2, communication 1 depends on compute 1-2 and compute 4, but does not depend on compute 1-1, 2-1, 2-2, or 3. Communication 2 depends on compute 2-2 and compute 4, but does not depend on compute 2-1 or 3. Therefore, both communication 1 and communication 2 have large pipeline space and can be overlapped by compute tasks independent of them. As shown in Figure 3, both communication 1 and communication 2 can be overlapped by independent compute tasks. In the model, such independent communication and compute can implement task-level parallelism without operator fusion. Therefore, loosely coupled compute-communication tasks are not applicable to MC² scenarios.
Figure 2 Loosely coupled compute-communication tasks

Figure 3 Scheduling simulation of loosely coupled compute-communication tasks
Tightly coupled compute-communication tasks
The results of communication or compute are immediately used by the other, indicating a close dependency between them. As shown in the following figure, compute and communication tasks must be executed in serial mode. Hardware computing resources remain idle during the execution of communication 1 and communication 2. If a large number of such communication and compute modes exist in the model, computing power utilization will be low and communication will become the main performance bottleneck. Tightly coupled compute-communication tasks are suitable for fusion into a MC² operator, which leverages the MC² technology to improve performance.
Figure 4 Tightly coupled compute-communication tasks

Figure 5 Scheduling simulation of tightly coupled compute-communication tasks

The MC² technology is closely related to the network model structure. Generally, tightly coupled compute-communication tasks that meet the preceding conditions can potentially achieve performance improvement via MC² operators.

Parent topic: Communication-Computation Fusion