Coupled Mode

This section describes how to use basic APIs to perform matrix multiplication in Coupled Mode.

Programming Paradigm

The cube programming paradigm divides the operator implementation process into five basic tasks: CopyIn, Split, Compute, Aggregate, and CopyOut. CopyIn moves in data, Split performs data tiling, Compute performs compute on matrix instructions, Aggregate performs data aggregation, and CopyOut moves out data.

Figure 1 Basic task design of cube programming

The inter-task communication and the flow chart are explained below.

CopyIn task
1. Call DataCopy to copy data from GlobalTensor to LocalTensor.
2. Call EnQue to place LocalTensor in the A1/B1 queue.
Split task
1. Call DeQue to obtain LocalTensor from the A1/B1 queue.
2. Call the LoadData API to move LocalTensor from the A1/B1 queue to the A2/B2 queue.
3. Call EnQue to place the compute result of LocalTensor in the A2/B2 queue.
Compute task
1. Call DeQue to obtain LocalTensor from the A2/B2 queue.
2. Call Mmad to complete cube compute.
3. Call EnQue to place the compute result of LocalTensor in the CO1 queue.
Aggregate task
1. Call DeQue to obtain the LocalTensor from the CO1 queue.
2. Call the Ascend C API to copy the result matrix to the CO2 queue.
3. Call EnQue to place the compute result of LocalTensor in the CO2 queue.
CopyOut task
1. Call DeQue to obtain the LocalTensor from the CO2 queue.
2. Call DataCopy to copy data from the LocalTensor to the GlobalTensor.

Figure 2 Cube programming queue

Development Workflow

The following figure shows the process of implementing the cube operator based on Ascend C.

Figure 3 Cube operator implementation process

Analyze the operator: Analyze the mathematical expression, input, output, and compute logic implementation of the operator, and specify the Ascend C APIs to be called.
Define the kernel function: Define the entrypoint function of the Ascend C operator.
Implement the operator class based on the cube programming paradigm: Complete the internal implementation of the kernel function, and call the private member functions CopyIn, SplitA, SplitB, Compute, Aggregate, and CopyOut to complete the five-level pipeline operation of the cube operator.

The following uses the Matmul operator as an example. The code framework of the Matmul operator is as follows. For details about the complete code, see Mmad sample.

      
       
         
         
           #include "kernel_operator.h"

// Implement the operator class based on the programming paradigm.
class KernelMatmul {
public:
    __aicore__ inline void Init(GM_ADDR a, GM_ADDR b, GM_ADDR c)
    {
        // ...
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        SplitA();
        AscendC::LocalTensor<half> b1Local = inQueueB1.DeQue<half>();
        AscendC::LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
        AscendC::LocalTensor<float> c2Local = outQueueCO2.AllocTensor<float>();
        // split matrix b into 2 parts, [32, 16] and [32, 16]
        for (int i = 0; i < 2; ++i) {
            SplitB(b1Local, i);
            Compute(a2Local);
            Aggregate(c2Local, i);
        }
        inQueueB1.FreeTensor(b1Local);
        inQueueA2.FreeTensor(a2Local);
        outQueueCO2.EnQue<float>(c2Local);
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        // ...
    }
    __aicore__ inline void SplitA()
    {
        // ...
    }
    __aicore__ inline void SplitB(const LocalTensor<half>& b1Local, const int bSplitIdx)
    {
        // ...
    }
    __aicore__ inline void Compute(const LocalTensor<half>& a2Local)
    {
        // ...
    }
    __aicore__ inline void Aggregate(const LocalTensor<float>& c2Local, const int bSplitIdx)
    {
        // ...
    }
    __aicore__ inline void CopyOut()
    {
        // ...
    }
private:
    // ...

};

// Kernel function definition
extern "C" __global__ __aicore__ void matmul_custom(GM_ADDR a, GM_ADDR b, GM_ADDR c)
{
    KernelMatmul op;
    op.Init(a, b, c);
    op.Process();
}

          

        

      
     

Operator Analysis

Before developing the operator code, analyze the mathematical expression, input, output, and compute logic implementation of the operator, and specify the Ascend C APIs to be called.

Specify the mathematical expression and compute logic of the operator.
The Matmul operator completes a matrix multiplication, and its mathematical expression is as follows. Matrix a whose shape is [m, k] is multiplied by matrix b whose shape is [k, n], to obtain matrix c whose shape is [m, n]. For convenience, m = k = n = 32.
```
c = a x b
```
If the data to be processed is too large, the data needs to be tiled, moved to A2 and B2 by block, computed separately, and then aggregated. The following compute logic is used to show examples of the Split and Aggregate phases. Determine whether to tile and aggregate data based on the size of data to be processed.
Compute logic:
1. Move input data matrices a and b to the local memories A1 and B1 respectively.
2. Move matrix a from A1 to A2. To implement partial parallelism, matrix b is divided into part1 and part2, both of which are in the shape of [k, n/2]. After the division, matrix b is moved to B2 by block.
3. A matrix multiplication operation is separately performed on part1 and part2 of matrix a and matrix b to obtain part1 and part2 of matrix c, and shapes of both part1 and part2 are [m, n/2]. The compute result is stored in CO1.
4. Part1 and part2 of matrix c are respectively copied to CO2 for aggregation.
5. Move the aggregated output data out of CO2.
Specify the input and output.
- The Matmul operator has two inputs: a and b. The output is c.
- In this example, the data type supported by the operator input is half (float16), and the data type supported by the operator output is float32.
- Shapes of the matrices a, b, and c are all [32, 32].
- The input and output data format of the operator is ND.
Define the kernel function name and parameters.
- You can customize the kernel function name. In this example, the kernel function is named matmul_custom.
- Based on the analysis of the operator input and output, the kernel function has three parameters: a, b, and c. a and b indicate the memory address of the input in the global memory, and c indicates the memory address of the output in the global memory.
Analyze restrictions.
The hardware architecture has format restrictions on the input and output of matrix multiplication. Therefore, the format conversion process needs to be added to the operator implementation.
- When matrices a and b are moved to A1 and B1, the matrices a and b in ND format are converted to the NZ format.
- When matrix a is moved from A1 to A2, matrix a in NZ format is converted to the ZZ format. When matrix b is moved from B1 to B2, matrix b in NZ format is converted to the ZN format.
- When the compute result is moved out of CO2, the matrix c in NZ format is converted to the ND format.
- For details about the data format, see Data Format.
Specify the APIs required for operator implementation.
- To move data between the global memory and local memory, you can refer to the data movement API in the Ascend C API Reference. For details, see DataCopy.
- To convert the matrix data format, you can refer to the data conversion API in the Ascend C API Reference. For details, see LoadData.
- The matrix compute process involves matrix multiplication. You can refer to the matrix computation API in the Ascend C API Reference. For details, see Mmad.
- The tensor data structure used in computation is managed by the queue, involving calling APIs such as EnQue and DeQue.

Based on the analysis, the compute flowchart and design specifications of the Ascend C Matmul operator are as follows.

Figure 4 Computation process of the Matmul operator

**Table 1** Design specifications of the Ascend C Matmul operator
OpType	Matmul
Operator Input	Name	Shape	Data Type	Format
	a	(m, k) = (32, 32)	half	ND
	b	(k, n) = (32, 32)	half	ND
Operator Output	c	(m, n) = (32, 32)	float32	ND
Kernel Function Name	matmul_custom
Main API	DataCopy: data movement API
	LoadData: matrix data format conversion API
	Mmad: matrix multiplication API
	EnQue, DeQue, and others: queue management APIs
Operator Implementation File	matmul_custom.cpp

Kernel Function Definition

Define kernel functions based on the rules described in Kernel Function.

Define the function prototype.

In this example, the function name is matmul_custom (the kernel function name can be customized). Based on the analysis of the operator input and output in Operator Analysis, there are three parameters a, b, and c, where a and b indicate the input memory and c indicates the output memory. According to the rules of kernel functions described in Kernel Function, the function prototype is defined as follows: The __global__ function type qualifier is used to identify a kernel function that can be called by <<<>>>. The __aicore__ function type qualifier is used to identify that the kernel function is executed on the AI Core on the device. For convenience, the GM_ADDR macro is used to modify input parameters. For details about the definition of the GM_ADDR macro, see Kernel Function.

        
             extern "C" __global__ __aicore__ void matmul_custom(GM_ADDR a, GM_ADDR b, GM_ADDR c)
{
}

Call the Init and Process functions of the operator class.

The Init function of the operator class completes memory initialization, and the Process function completes the core logic of operator implementation. For details, see Operator Class Implementation.

         
              extern "C" __global__ __aicore__ void matmul_custom(GM_ADDR a, GM_ADDR b, GM_ADDR c)
{
    KernelMatmul op;
    op.Init(a, b, c);
    op.Process();
}

Encapsulate the kernel function to obtain the matmul_custom_do function for the main program to call. #ifndef ASCENDC_CPU_DEBUG indicates that the encapsulated function is used only when operators on the NPU are built and run. When operators on the CPU are built and run, the matmul_custom function can be directly called. According to Defining and Calling Kernel Functions, when the kernel function is called, in addition to a, b, and c parameters, blockDim (number of cores executed by the kernel function), l2ctrl (reserved parameter, set to nullptr), and stream (stream for maintaining the execution sequence of asynchronous operations in the application) also need to be input, to specify the execution configuration of the kernel function.

        
             #ifndef ASCENDC_CPU_DEBUG
// call of kernel function
void matmul_custom_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* a, uint8_t* b, uint8_t* c)
{
    matmul_custom<<<blockDim, l2ctrl, stream>>>(a, b, c);
}
#endif

Operator Class Implementation

The kernel function calls the Init and Process functions of the operator class. This section describes how to implement the operator class based on the programming paradigm. For details about the cube programming paradigm, see Programming Paradigm.

The operator class contains the initialization function Init, core processing function Process, and some private members used in implementation. The KernelMatmul operator class is defined as follows:

       
        
          
          
            class KernelMatmul {
public:
    __aicore__ inline KernelMatmul(){}
    // Initialization function, which is used to initialize memory.
    __aicore__ inline void Init(GM_ADDR a, GM_ADDR b, GM_ADDR c){}
    // Core processing function, which implements the operator logic.
    // Call the private member functions CopyIn, SplitA, SplitB, Compute, Aggregate and CopyOut to complete the five-level pipeline operation of the cube operator.
    __aicore__ inline void Process(){}

private:
    __aicore__ inline void CopyND2NZ(const LocalTensor<half>& dst, const GlobalTensor<half>& src, const uint16_t height, const uint16_t width){}
    // Move-in function, which completes the processing in the CopyIn phase in the programming paradigm. This function is called by the Process function.
    __aicore__ inline void CopyIn(){}
    // Move-in function, which completes the processing in the Split phase in the programming paradigm. This function is called by the Process function.
    __aicore__ inline void SplitA(){}
    // Move-in function, which completes the processing in the Split phase in the programming paradigm. It is called by the Process function for twice cyclically to move the two parts of matrix b.
    __aicore__ inline void SplitB(const LocalTensor<half>& b1Local, const int bSplitIdx){}
    // Compute function, which completes the processing in the Compute phase in the programming paradigm. It is called by the Process function for twice cyclically to compute the two parts of matrix c.
    __aicore__ inline void Compute(const LocalTensor<half>& a2Local){}
    // Move-out function, which completes the processing in the Aggregate phase in the programming paradigm. It is called by the Process function for twice cyclically to move out the two parts of matrix c.
    __aicore__ inline void Aggregate(const LocalTensor<float>& c2Local, const int bSplitIdx){}
    // Move-out function, which completes the processing in the CopyOut phase in the programming paradigm. This function is called by the Process function.
    __aicore__ inline void CopyOut(){}

private:
    AscendC::TPipe pipe;  // Pipe memory management object, which manages the memory of the queue.
    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1;  // Input data queue. The value of TPosition is A1.
    AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2;  // Input data queue. The value of TPosition is A2.
    AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1;  // Input data queue. The value of TPosition is B1.
    AscendC::TQue<AscendC::TPosition::B2, 2> inQueueB2;  // Input data queue. The value of TPosition is B2.
    AscendC::TQue<AscendC::TPosition::CO1, 2> outQueueCO1;  // Output data queue. The value of TPosition is CO1.
    AscendC::TQue<AscendC::TPosition::CO2, 1> outQueueCO2;  // Output data queue. The value of TPosition is CO2.
    // Object for managing the global memory addresses of the input and output data. aGM and bGM are input, and cGM is output.
    AscendC::GlobalTensor<half> aGM, bGM;
    AscendC::GlobalTensor<float> cGM;

    uint16_t m = 32;
    uint16_t n = 32;
    uint16_t k = 32;
    uint16_t aSize, bSize, cSize, mBlocks, nBlocks, kBlocks;
};

           

         

       
      

KernelMatmul Constructor Implementation

Initialize private member variables in the constructor function. The code is as follows:

      
           __aicore__ inline KernelMatmul()
{
    aSize = m * k;
    bSize = k * n;
    cSize = m * n;
    mBlocks = m / 16;
    nBlocks = n / 16;
    kBlocks = k / 16;
}

The shape of matrix a is [m, k], the shape of matrix b is [k, n], and the shape of matrix c is [m, n]. In this example, m, n, and k are all set to 32.

aSize, bSize, and cSize are the numbers of values of matrices a, b, and c, respectively.

mBlocks, nBlocks, and kBlocks are the number of fractals occupied by m, n, and k. For the half type, the shape of a fractal is 16 x 16. The formula for computing blocks is as follows:

mBlocks = m / 16
nBlocks = n / 16
kBlocks = k / 16

For details about fractals, see Data Layout Formats.

Init Function Implementation

The Init function performs the following operations:

Set the global memory address of the input and output global tensors.

For example, to set the memory offset address of input a in the global memory, run the following command:

         
              aGM.SetGlobalBuffer((__gm__ half*)a);

In this example, the input parameter of the Init function is set to uint8_t*. Therefore, you need to forcibly convert the input parameter to a specific data type (__gm__ half*) and then perform offset.

Allocate memory for the input and output queues through TPipe.

For example, allocating memory to the inQueueB2 queue can be implemented by using the following code snippet:

        
             pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half) / 2);

In this example, matrix b is divided into two parts. When allocating memory to inQueueB2, you need to allocate two blocks of memory. The size of each memory block is half of the size of matrix b. The memory initialization of outQueueCO1 is similar.

The initialization function code is as follows:

      
           __aicore__ inline void Init(GM_ADDR a, GM_ADDR b, GM_ADDR c)
{
    aGM.SetGlobalBuffer((__gm__ half*)a);
    bGM.SetGlobalBuffer((__gm__ half*)b);
    cGM.SetGlobalBuffer((__gm__ float*)c);
    pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half));
    pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half));
    pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half));
    pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half) / 2);
    pipe.InitBuffer(outQueueCO1, 2, cSize * sizeof(float) / 2);
    pipe.InitBuffer(outQueueCO2, 1, cSize * sizeof(float));
}

Process Function Implementation

Based on the cube programming paradigm, the implementation of kernel functions is divided into five basic phases: CopyIn, Split, Compute, Aggregate, and CopyOut. Matrix a and matrix b need to be distinguished in the Split, Compute, and Aggregate phases. These functions are called in the Process function as follows:

      
       
         
         
           __aicore__ inline void Process()
{
    CopyIn();
    SplitA();
    AscendC::LocalTensor<half> b1Local = inQueueB1.DeQue<half>();
    AscendC::LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
    AscendC::LocalTensor<float> c2Local = outQueueCO2.AllocTensor<float>();
    // split matrix b into 2 parts, [32, 16] and [32, 16]
    for (int i = 0; i < 2; ++i) {
        SplitB(b1Local, i);
        Compute(a2Local);
        Aggregate(c2Local, i);
    }
    inQueueB1.FreeTensor(b1Local);
    inQueueA2.FreeTensor(a2Local);
    outQueueCO2.EnQue<float>(c2Local);
    CopyOut();
}

          

        

      
     

In the two cycles, SplitB moves the two parts of matrix b from inQueueB1, Compute computes the multiplication of matrix a with the two parts of matrix b, and Aggregate outputs the two parts of matrix c. The following figure shows the data flow in the five phases.

Figure 5 Data flow

Matrix b can be split to implement partial parallelism. The following figure shows the pipeline parallelism in this example.

Figure 6 Parallelism

Implement the CopyIn function (stage 1).

Call AllocTensor to allocate a1Local and b1Local from the queues of A1 and B1.
Call DataCopy to move matrices a and b to the local memory, and convert the data format from ND to NZ.
The DataCopy command can move height x 16 elements at a time, so you need to run the command repeatedly for width/16 times. Set the parameters of DataCopy as follows:
- If blockCount is set to height, the total number of movement times is the height.
- If blockLen is set to 1, 16 half-type numbers are moved at a time.
- If srcStride is set to width/16 – 1, one row needs to be skipped each time a block is moved in the source matrix.
- If dstStride is set to 0, each block of the destination matrix is stored continuously in the memory.
- In each iteration, the start address of the source matrix is shifted by 16 elements, and the start address of the destination matrix is shifted by 16 x height elements.
The following figure shows the format conversion. The blue part is moved in the first cycle, and the green part is moved in the second cycle. In the figure, the width is 32, occupying two fractals, and the height is 32, occupying two fractals. A total of four 16 x 16 fractals are moved.

Figure 7 ND to NZ conversion

Note: The preceding format conversion from ND to NZ is only an example. You can select a proper conversion mode as required.
Call EnQue to place a1Local and b1Local in the queues of A1 and B1 respectively.

The code is as follows:

        
         
           
           
             __aicore__ inline void CopyND2NZ(const LocalTensor<half>& dst, const GlobalTensor<half>& src, const uint16_t height, const uint16_t width)
{
    for (int i = 0; i < width / 16; ++i) {
        int srcOffset = i * 16;
        int dstOffset = i * 16 * height;
        AscendC::DataCopy(dst[dstOffset], src[srcOffset], { height, 1, uint16_t(width / 16 - 1), 0 });
    }
}
__aicore__ inline void CopyIn()
{
    AscendC::LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
    AscendC::LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();
    CopyND2NZ(a1Local, aGM, m, k);
    CopyND2NZ(b1Local, bGM, k, n);
    inQueueA1.EnQue(a1Local);
    inQueueB1.EnQue(b1Local);
}

            

          

        
       

Implement the SplitA function (stage 2).

Call DeQue to obtain a1Local from the A1 queue.
Call AllocTensor to allocate a2Local from the A2 queue.
Call LoadData to move matrix a to A2 and convert matrix a from the NZ format to the ZZ format.
The following figure shows the movement and format conversion. In the figure, k is 32, occupying kBlocks (k/16 = 2) fractals. m is 32, occupying mBlocks (m/16 = 2) fractals. A total of four 16 x 16 fractals are moved. In this example, the LoadData API is called once to move two 16 x 16 fractals, and the LoadData API is called twice. The first call moves two fractals in the blue part, and the second call moves two fractals in the green part.
Set the parameters of LoadData in a single iteration as follows (in this example, two fractals need to be moved, namely the blue part or the green part):
- repeatTimes indicates the number of iterations for data processing. Because LoadData processes a fractal in each iteration, repeatTimes can also be understood as the number of fractals to be moved. In this example, it indicates the number of fractals in the k-axis direction. If this parameter is set to kBlocks, kBlocks fractals are moved.
- srcStride indicates the interval between the start addresses of the fractals of the source operands between adjacent iterations. The following uses the fractals in blue as an example: In the source operand matrix on the left of the following figure, the interval between the start addresses of the first and second blue fractals is mBlocks fractals in this example.
- The default value of dstGap is used, and the two fractals of the destination matrix are stored continuously.
- ifTranspose is set to false, each block is in Z format before and after fractal movement. In this case, the transpose function needs to be disabled.
- For each iteration, the start address offset of the source matrix is 16 x 16, and the start address offset of the destination matrix is 16 x k.
Figure 8 NZ to ZZ conversion

Call EnQue to place the compute result of a2Local in the A2 queue.

The code is as follows:

          
                   __aicore__ inline void SplitA()
    {
        int srcOffset = 0;
        int dstOffset = 0;
        AscendC::LocalTensor<half> a1Local = inQueueA1.DeQue<half>();
        AscendC::LocalTensor<half> a2Local = inQueueA2.AllocTensor<half>();

        // transform nz to zz
        for (int i = 0; i < mBlocks; ++i) {
            AscendC::LoadData2DParams loadDataParams;
            loadDataParams.repeatTimes = kBlocks;
            loadDataParams.srcStride = mBlocks;
            loadDataParams.ifTranspose = false;

            AscendC::LoadData(a2Local[dstOffset], a1Local[srcOffset], loadDataParams);

            srcOffset += 16 * 16;
            dstOffset += k * 16;
        }

        inQueueA2.EnQue<half>(a2Local);
        inQueueA1.FreeTensor(a1Local);
    }

Implement the SplitB function (stage 2).

Two parameters need to be input for SplitB: b1Local obtained from the B1 queue by calling DeQue and the iteration variable index.
Call AllocTensor to allocate b2Local from the B2 queue.
Call LoadData to move matrix b to B2 and convert the format from NZ to ZN.
The following figure shows the movement and format conversion. In the figure, k is 32, occupying kBlocks (k/16 = 2) fractals. n is 32, occupying nBlocks (n/16 = 2) fractals. A total of four 16 x 16 fractals are moved. In this example, the LoadData API is called once to move two 16 x 16 fractals, and the LoadData API is called twice. The first call moves two fractals in the blue part, and the second call moves two fractals in the green part.
Set the parameters of LoadData in a single iteration as follows (in this example, two fractals need to be moved, namely the blue part or the green part):
- repeatTimes indicates the number of iterations for data processing. Because LoadData processes a fractal in each iteration, repeatTimes can also be understood as the number of fractals to be moved. In this example, it indicates the number of fractals in the k-axis direction. If this parameter is set to kBlocks, kBlocks fractals are moved.
- srcStride indicates the interval between the start addresses of the fractals of the source operands between adjacent iterations. The following uses the fractals in blue as an example: In the source operand matrix on the left of the following figure, the interval between the start addresses of the first and second blue fractals is one fractal in this example.
- dstGap is set to the default value 0, and the two fractals of the destination matrix are stored continuously.
- ifTranspose is set to true, each block is in Z format before fractal movement and needs to be in N format after movement. In this case, the transpose function needs to be enabled.
- For each iteration, the start address of the source matrix needs to be offset by k x n/2.
Figure 9 NZ to ZN conversion
Call EnQue to place the compute result of b2Local in the B2 queue.

The code is as follows:

        
                 __aicore__ inline void SplitB(const AscendC::LocalTensor<half>& b1Local, const int bSplitIdx)
    {
        AscendC::LocalTensor<half> b2Local = inQueueB2.AllocTensor<half>();

        // transform nz to zn
        AscendC::LoadData2DParams loadDataParams;
        loadDataParams.repeatTimes = kBlocks;
        loadDataParams.srcStride = 1;
        loadDataParams.ifTranspose = true;

        AscendC::LoadData(b2Local, b1Local[bSplitIdx * bSize / 2], loadDataParams);

        inQueueB2.EnQue<half>(b2Local);
    }

Implement the Compute function to compute the core cube (stage 3).

The Compute function needs to pass the parameter a2Local, which is obtained from the A2 queue by calling DeQue.
Call AllocTensor to allocate c1Local from the CO1 queue.
Call DeQue to obtain b2Local from B2.
Call Mmad to perform matrix multiplication.
Call EnQue to place the compute result of c1Local in the CO1 queue.

The code is as follows:

        
                 __aicore__ inline void Compute(const AscendC::LocalTensor<half>& a2Local)
    {
        AscendC::LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
        AscendC::LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>();

        AscendC::MmadParams mmadParams;
        mmadParams.m = m;
        mmadParams.n = n / 2;
        mmadParams.k = k;
        AscendC::Mmad(c1Local, a2Local, b2Local, mmadParams);

        outQueueCO1.EnQue<float>(c1Local);
        inQueueB2.FreeTensor(b2Local);
    }

Implement the Aggregate function to aggregate data (stage 4).

Two parameters need to be input for Aggregate: c2Local obtained from the CO2 queue by calling AllocTensor and the iteration variable index.
Call DeQue to obtain the c1Local from the CO1 queue.
Call DataCopy to move the result matrix from CO1 to CO2.
Set the parameters of DataCopy as follows:
- If blockCount is set to 1 and blockLen is set to 2, two fractals are moved consecutively without format conversion.
- If blockMode is set to BlockMode::BLOCK_MODE_MATRIX, data needs to be moved by fractal.
- The offset of the c2Local start address is set to index * cSize / 2.

The code is as follows:

        
                 __aicore__ inline void Aggregate(const AscendC::LocalTensor<float>& c2Local, const int bSplitIdx)
    {
        AscendC::LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();

        AscendC::DataCopyParams dataCopyParams;
        dataCopyParams.blockCount = 1;
        dataCopyParams.blockLen = 2;
        AscendC::DataCopyEnhancedParams enhancedParams;
        enhancedParams.blockMode = AscendC::BlockMode::BLOCK_MODE_MATRIX;
        AscendC::DataCopy(c2Local[bSplitIdx * cSize / 2], c1Local, dataCopyParams, enhancedParams);

        outQueueCO1.FreeTensor(c1Local);
    }

Implement the CopyOut function (stage 5).

Call DeQue to obtain c2Local from CO2.
Call DataCopy to move the result matrix from CO2 to the global memory, and convert the format from NZ to ND.
Move one fractal at a time. The number of data elements to be moved is m x 16. Set the parameters of DataCopy as follows:
- If blockCount is set to m, the total number of movement times is the m.
- If blockLen is set to 2, 16 data elements are moved at a time by running the DataCopy command. There are two blocks in total.
- If srcStride is set to 0, there is no gap between two moves.
- If dstStride is set to (nBlocks - 1) * 2, two blocks are the interval between two moves.
- In each iteration, the destination matrix is offset by 16, and the source matrix is offset by m x 16.
The following figure shows the format conversion. The data in blue is moved in the first iteration, and the data in green is moved in the second iteration.

Figure 10 NZ to ND conversion

The code is as follows:

        
                 __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<float> c2Local = outQueueCO2.DeQue<float>();

        // transform nz to nd
        for (int i = 0; i < nBlocks; ++i) {
            AscendC::DataCopy(cGM[i * 16], c2Local[i * m * 16], { m, 2, 0, uint16_t((nBlocks - 1) * 2) });
        }

        outQueueCO2.FreeTensor(c2Local);
    }

Parent topic: Cube Programming (Basic APIs)