Basics
This section describes the theoretical background and development guide of MC2 operators. Before learning this section, ensure that you are familiar with the knowledge of cube programming and HCCL User Guide.
MC2 operators generally support the following product models:
MC2 Operators
Compared with common computing or movement operators, MC2 operators integrate serial communication and computing operations. By performing data tiling within the operator, the MC2 operators enable parallel execution of computing and communication tasks, boosting operator performance. "MC2 operators" are short for "Matrix Computation & Communication operators".
As shown in the following figure, the ideal execution duration of the serial communication operator and computing operator is the sum of the execution duration of the two operators. In the MC2 operator fusing communication and computing tasks, required data for communication and computation is tiled to reduce per-batch data volume. By splitting overall communication and computing tasks into multiple batches, parallel execution of computation and communication is achieved, which greatly shortens the theoretical execution duration and delivers performance benefits.

Application Scenarios and Advantages
As the model scale increases, training and inference on a single device encounter bottlenecks in computing capabilities, memory capacity, and energy efficiency. Therefore, distributed parallel computing becomes an inevitable technical path. Communication and computing tasks in the distributed training and inference of LLMs can be classified into two types based on the dependency between communication and computing.
- Weak-dependency computing communication taskThe results of communication or computation are not immediately used by the other. Although dependencies exist between them, other independent computing or communication tasks can be scheduled and executed in-between. As shown in Figure 2, communication 1 depends on computing 1-2 and computing 4, but does not depend on computing 1-1, 2-1, 2-2, or 3. Communication 2 depends on computing 2-2 and computing 4, but does not depend on computing 2-1 or 3. Therefore, both communication 1 and communication 2 have large pipeline space and can be overlapped by computation tasks independent of them. As shown in Figure 3, both communication 1 and communication 2 can be overlapped by independent computing tasks. In the model, such independent communication and computing can implement task-level parallelism without operator fusion. Therefore, weak-dependency computing and communication tasks are not applicable to MC2 scenarios.
- Strong-dependency computing and communication tasksThe results of communication or computation are immediately used by the other, indicating a close dependency between them. As shown in the following figure, computing and communication tasks must be executed in serial mode. Hardware computing resources remain idle during the execution of communication 1 and communication 2. If a large number of such communication and computing modes exist in the model, compute utilization will be low and communication will become the main performance bottleneck. Strong-dependency computing and communication tasks are suitable for fusion into a MC2 operator, which leverages the MC2 technology to improve performance.Figure 4 Strong-dependency computing and communication tasks
Figure 5 Scheduling simulation of strong-dependency computing and communication tasks
The MC2 technology is closely related to the network model structure. Generally, strong-dependency computing and communication tasks that meet the preceding conditions can potentially achieve performance improvement via MC2 operators.

