Coupled Architecture

This section describes how to use basic APIs to perform matrix multiplication in the coupled architecture.

Programming Paradigm

The cube programming paradigm divides the operator implementation process into five basic tasks: CopyIn, Split, Compute, Aggregate, and CopyOut. CopyIn moves in data, Split performs data tiling, Compute performs computation on matrix instructions, Aggregate performs data aggregation, and CopyOut moves out data.

Figure 1 Basic task design of cube programming

The inter-task communication and the flow chart are explained below.

CopyIn task
1. Call DataCopy to copy data from GlobalTensor to LocalTensor.
2. Call EnQue to place LocalTensor in the A1/B1 queue.
Split task
1. Call DeQue to obtain the LocalTensor from the A1/B1 queue.
2. Call the Ascend C API to transfer the LocalTensor from A1/B1 to A2/B2.
3. Call EnQue to place the computation result of the LocalTensor to the A2/B2 queue.
Compute task
1. Call DeQue to obtain the LocalTensor from the A2/B2 queue.
2. Call the Ascend C API to compute the matrix.
3. Call EnQue to place the computation result of LocalTensor to the CO1 queue.
Aggregate task
1. Call DeQue to obtain the LocalTensor from the CO1 queue.
2. Call the Ascend C API to copy the result matrix to the CO2 queue.
3. Call EnQue to place the computation result of LocalTensor to the CO2 queue.
CopyOut task
1. Call DeQue to obtain the LocalTensor from the CO2 queue.
2. Call DataCopy to copy data from the LocalTensor to the GlobalTensor.

Figure 2 Cube programming queue

Development Process

The following figure shows the process of implementing cube operators based on Ascend C.

Figure 3 Cube operator implementation process

Analyze the operator: Analyze the mathematical expression, input, output, and computation logic implementation of operators, and specify the Ascend C APIs to be called.
Define the kernel function: Define the entry point function of the Ascend C operator.
Implement the operator class based on the cube programming paradigm: Complete the internal implementation of the kernel function, and call the private member functions CopyIn, SplitA, SplitB, Compute, Aggregate, and CopyOut to complete the five-level pipeline operation of the cube operator.

The following uses the Matmul operator as an example. The code framework of the Matmul operator is as follows. For details about the complete code, see Mmad sample.

#include "kernel_operator.h"

// Implement the operator class based on the programming paradigm.
class KernelMatmul {
public:
    __aicore__ inline void Init(GM_ADDR a, GM_ADDR b, GM_ADDR c)
    {
        // ...
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        SplitA();
        AscendC::LocalTensor<half> b1Local = inQueueB1.DeQue<half>();
        AscendC::LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
        AscendC::LocalTensor<float> c2Local = outQueueCO2.AllocTensor<float>();
        // split matrix b into 2 parts, [32, 16] and [32, 16]
        for (int i = 0; i < 2; ++i) {
            SplitB(b1Local, i);
            Compute(a2Local);
            Aggregate(c2Local, i);
        }
        inQueueB1.FreeTensor(b1Local);
        inQueueA2.FreeTensor(a2Local);
        outQueueCO2.EnQue<float>(c2Local);
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        // ...
    }
    __aicore__ inline void SplitA()
    {
        // ...
    }
    __aicore__ inline void SplitB(const LocalTensor<half>& b1Local, const int bSplitIdx)
    {
        // ...
    }
    __aicore__ inline void Compute(const LocalTensor<half>& a2Local)
    {
        // ...
    }
    __aicore__ inline void Aggregate(const LocalTensor<float>& c2Local, const int bSplitIdx)
    {
        // ...
    }
    __aicore__ inline void CopyOut()
    {
        // ...
    }
private:
    // ...

};

// Kernel function definition
extern "C" __global__ __aicore__ void matmul_custom(GM_ADDR a, GM_ADDR b, GM_ADDR c)
{
    KernelMatmul op;
    op.Init(a, b, c);
    op.Process();
}

Operator Analysis

Before developing the operator code, analyze the mathematical expression, input, output, and compute logic implementation of the operator, and specify the Ascend C APIs to be called.

Specify the mathematical expression and computation logic of an operator.
The Matmul operator completes a matrix multiplication, and its mathematical expression is as follows. Matrix a whose shape is [m, k] is multiplied by matrix b whose shape is [k, n], to obtain matrix c whose shape is [m, n]. For convenience, m = k = n = 32.
```
c = a * b
```
If the data to be processed is too large, the data needs to be divided, transferred to A2 and B2 by block, computed separately, and then aggregated. The following computation logic is used to show examples of the Split and Aggregate phases. Determine whether to split and aggregate data based on the size of data to be processed.
Computation logic:
1. Transfer input data matrices a and b to the local memories A1 and B1 respectively.
2. Move matrix a from A1 to A2. Matrix b is divided into part1 and part2, both of which are in the shape of [k, n/2]. After the division, matrix b is transferred to B2 by block.
3. A matrix multiplication operation is separately performed on part1 and part2 of matrix a and matrix b to obtain part1 and part2 of the matrix c, and shapes of both part1 and part2 are [m, n/2]. The computation result is stored in CO1.
4. Part1 and part2 of matrix c are respectively copied to CO2 for aggregation.
5. Move the aggregated output data out of CO2.
Specify the input and output.
- The Matmul operator has two inputs: a and b. The output is c.
- In this sample, the data type supported by the operator input is half (float16), and the data type supported by the operator output is float32.
- Shapes of the matrices a, b, and c are all [32, 32].
- The input and output data format of the operator is ND.
Define the kernel function name and parameters.
- You can customize the kernel function name. In this example, the kernel function is named matmul_custom.
- Based on the analysis of the operator input and output, the kernel function has three parameters: a, b, and c. a and b indicate the memory address of the input in the Global Memory, and c indicates the memory address of the output in the Global Memory.
Analyze constraints.
The hardware architecture has format constraints on the input and output of matrix multiplication. Therefore, the format conversion process needs to be added to the operator implementation.
- When matrices a and b are transferred to A1 and B1, the matrices a and b in ND format are converted to the NZ format.
- When matrix a is transferred from A1 to A2, matrix a in the NZ format is converted to the ZZ format. When matrix b is transferred from B1 to B2, matrix b in the NZ format is converted to the ZN format.
- When the computation result is moved out of CO2, the matrix c in NZ format is converted to the ND format.
- For details about the data layout format, see Format.
Specify the APIs required for operator implementation.
- Transfer data between the Global Memory and Local Memory. For details, see the data transfer API in the Ascend C API Reference. For details, see DataCopy.
- Convert the matrix data format. For details, see the data conversion APIs in the Ascend C API Reference. For details, see LoadData.
- The matrix computation process involves matrix multiplication. For details, see the matrix computation APIs in the Ascend C API Reference. For details, see Mmad.
- The tensor data structure used in computation is managed by the queue, involving calling APIs such as EnQue and DeQue.

Based on the analysis, the computation flowchart and design specifications of the Ascend C Matmul operator are as follows.

Figure 4 Computation process of the Matmul operator

**Table 1** Design specifications of the Ascend C Matmul operator
OpType	Matmul
Operator input	name	shape	data type	format
	a	(m, k) = (32, 32)	half	ND
	b	(k, n) = (32, 32)	half	ND
Operator output	c	(m, n) = (32, 32)	float32	ND
Kernel function name	matmul_custom
Main APIs	DataCopy: data transfer API
	LoadData: matrix data format conversion API
	Mmad: matrix multiplication API
	EnQue, DeQue, and others: queue management APIs
Operator implementation file	matmul_custom.cpp

Kernel Function Definition

Define kernel functions based on the rules described in Kernel Function.

Define the function prototype.
In this sample, the function name is matmul_custom (the kernel function name can be customized). Based on the analysis of the operator input and output in Operator Analysis, there are three parameters a, b, and c, where a and b indicate the input buffers and z indicates the output buffer. According to the rules of kernel functions described in Kernel Function, the function prototype is defined as follows: The __global__ function type qualifier is used to identify a kernel function that can be called by <<<>>>. The __aicore__ function type qualifier is used to identify that the kernel function is executed on the AI Core on the device. For convenience, the GM_ADDR macro is used to modify input parameters. For details about the definition of the GM_ADDR macro, see Kernel Function.
```
extern "C" __global__ __aicore__ void matmul_custom(GM_ADDR a, GM_ADDR b, GM_ADDR c)
{
}
```
Call the Init and Process functions of the operator class.
The Init function of the operator class completes memory initialization, and the Process function completes the core logic of operator implementation. For details, see Operator Class Implementation.
```
extern "C" __global__ __aicore__ void matmul_custom(GM_ADDR a, GM_ADDR b, GM_ADDR c)
{
    KernelMatmul op;
    op.Init(a, b, c);
    op.Process();
}
```
Encapsulate the kernel function to obtain the matmul_custom_do function for the main program to call. #ifndef ASCENDC_CPU_DEBUG indicates that the encapsulated function is used only when operators on the NPU are compiled and run. When operators on the CPU are compiled and run, the matmul_custom function can be directly called. According to Defining and Calling Kernel Functions, when the kernel function is called, in addition to a, b, and c parameters, blockDim (number of cores executed by the kernel function), l2ctrl (reserved parameter, set to nullptr), and stream (stream for maintaining the execution sequence of asynchronous operations in the application) also need to be input, to specify the execution configuration of the kernel function.
```
#ifndef ASCENDC_CPU_DEBUG
// call of kernel function
void matmul_custom_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* a, uint8_t* b, uint8_t* c)
{
    matmul_custom<<<blockDim, l2ctrl, stream>>>(a, b, c);
}
#endif
```

Operator Class Implementation

The kernel function calls the Init and Process functions of the operator class. This section describes the implementation of the operator class based on the programming paradigm. For details about the cube programming paradigm, see Programming Paradigm.

The operator class contains the initialization function Init, core processing function Process, and some private members used in implementation. The KernelMatmul operator class is defined as follows:

class KernelMatmul {
public:
    __aicore__ inline KernelMatmul(){}
    // Initialization function, which is used to initialize memory.
    __aicore__ inline void Init(GM_ADDR a, GM_ADDR b, GM_ADDR c){}
    // Core processing function, which implements the operator logic.
    // Call the private member functions CopyIn, SplitA, SplitB, Compute, Aggregate and CopyOut to complete the five-level pipeline operation of the cube operator.
    __aicore__ inline void Process(){}

private:
    __aicore__ inline void CopyND2NZ(const LocalTensor<half>& dst, const GlobalTensor<half>& src, const uint16_t height, const uint16_t width){}
    // Move-in function, which completes the processing in the CopyIn phase in the programming paradigm. This function is called by the Process function.
    __aicore__ inline void CopyIn(){}
    // Move-in function, which completes the processing in the Split phase in the programming paradigm. This function is called by the Process function.
    __aicore__ inline void SplitA(){}
    // Move-in function, which completes the processing in the Split phase in the programming paradigm. It is called by the Process function for twice cyclically to transfer the two parts of matrix b.
    __aicore__ inline void SplitB(const LocalTensor<half>& b1Local, const int bSplitIdx){}
    // Compute function, which completes the processing in the Compute phase in the programming paradigm. It is called by the Process function for twice cyclically to compute the two parts of matrix b.
    __aicore__ inline void Compute(const LocalTensor<half>& a2Local){}
    // Move-out function, which completes the processing in the Aggregate phase in the programming paradigm. It is called by the Process function for twice cyclically to move out the two parts of matrix b.
    __aicore__ inline void Aggregate(const LocalTensor<float>& c2Local, const int bSplitIdx){}
    // Move-out function, which completes the processing in the CopyOut phase in the programming paradigm. This function is called by the Process function.
    __aicore__ inline void CopyOut(){}

private:
    AscendC::TPipe pipe;  // Pipe memory management object, which manages the memory of the queue.
    AscendC::TQue<AscendC::QuePosition::A1, 1> inQueueA1;  // Input data queue. The value of QuePosition is A1.
    AscendC::TQue<AscendC::QuePosition::A2, 1> inQueueA2;  // Input data queue. The value of QuePosition is A2.
    AscendC::TQue<AscendC::QuePosition::B1, 1> inQueueB1;  // Input data queue. The value of QuePosition is B1.
    AscendC::TQue<AscendC::QuePosition::B2, 2> inQueueB2;  // Input data queue. The value of QuePosition is B2.
    AscendC::TQue<AscendC::QuePosition::CO1, 2> outQueueCO1;  // Output data queue. The value of QuePosition is CO1.
    AscendC::TQue<AscendC::QuePosition::CO2, 1> outQueueCO2;  // Output data queue. The value of QuePosition is CO2.
    // Object for managing the input and output Global Memory addresses. aGM and bGM are input, and cGM is output.
    AscendC::GlobalTensor<half> aGM, bGM;
    AscendC::GlobalTensor<float> cGM;

    uint16_t m = 32;
    uint16_t n = 32;
    uint16_t k = 32;
    uint16_t aSize, bSize, cSize, mBlocks, nBlocks, kBlocks;
};

KernelMatmul Constructor Implementation

Initialize private member variables in the constructor function. The code is as follows:

__aicore__ inline KernelMatmul()
{
    aSize = m * k;
    bSize = k * n;
    cSize = m * n;
    mBlocks = m / 16;
    nBlocks = n / 16;
    kBlocks = k / 16;
}

The shape of matrix a is [m, k], the shape of matrix b is [k, n], and the shape of matrix c is [m, n]. In this example, m, n, and k are all set to 32.

aSize, bSize, and cSize are the numbers of values of matrices a, b, and c, respectively.

mBlocks, nBlocks, and kBlocks are the number of fractals occupied by m, n, and k. For the half type, the length of a fractal is 16. The formula for computing blocks is as follows:

mBlocks = m / 16
nBlocks = n / 16
kBlocks = k / 16

For details about fractals, see Format.

Init Function Implementation

The Init function performs the following operations:

Sets the Global Memory address of the input and output GlobalTensors.
For example, to set the memory offset address of input a in the Global Memory, run the following command:
```
aGM.SetGlobalBuffer((__gm__ half*)a);
```
In this example, the input parameter of the Init function is set to uint8_t*. Therefore, you need to forcibly convert the input parameter to a specific data type (__gm__ half*) and then perform offset.

The pipe memory management object is used to allocate memory for the input and output queues.
For example, allocating memory to the inQueueB2 queue can be implemented by using the following code snippet:
```
pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half) / 2);
```
In this example, matrix b is divided into two parts. When allocating memory to inQueueB2, you need to allocate two blocks of memory. The size of each memory block is half of the size of matrix b. The memory initialization of outQueueCO1 is similar.

The initialization function code is as follows:

__aicore__ inline void Init(GM_ADDR a, GM_ADDR b, GM_ADDR c)
{
    aGM.SetGlobalBuffer((__gm__ half*)a);
    bGM.SetGlobalBuffer((__gm__ half*)b);
    cGM.SetGlobalBuffer((__gm__ float*)c);
    pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(half));
    pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(half));
    pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(half));
    pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(half) / 2);
    pipe.InitBuffer(outQueueCO1, 2, cSize * sizeof(float) / 2);
    pipe.InitBuffer(outQueueCO2, 1, cSize * sizeof(float));
}

Process Function Implementation

Based on the cube programming paradigm, the implementation of kernel functions is divided into five basic phases: CopyIn, Split, Compute, Aggregate and CopyOut. Matrix a and matrix b need to be distinguished in the Split, Compute, and Aggregate phases. These functions are called in the Process function as follows:

__aicore__ inline void Process()
{
    CopyIn();
    SplitA();
    AscendC::LocalTensor<half> b1Local = inQueueB1.DeQue<half>();
    AscendC::LocalTensor<half> a2Local = inQueueA2.DeQue<half>();
    AscendC::LocalTensor<float> c2Local = outQueueCO2.AllocTensor<float>();
    // split matrix b into 2 parts, [32, 16] and [32, 16]
    for (int i = 0; i < 2; ++i) {
        SplitB(b1Local, i);
        Compute(a2Local);
        Aggregate(c2Local, i);
    }
    inQueueB1.FreeTensor(b1Local);
    inQueueA2.FreeTensor(a2Local);
    outQueueCO2.EnQue<float>(c2Local);
    CopyOut();
}

In the two cycles, SplitB transfers the two parts of matrix b from inQueueB1, Compute computes the multiplication of matrix a with the two parts of matrix b, and Aggregate outputs the two parts of matrix c. The following figure shows the data flow in the five phases.

Figure 5 Data flow

Matrix b can be split to implement partial parallelism. The following figure shows the pipeline parallelism in this example.

Figure 6 Parallelism

Implement the CopyIn function (stage 1).
1. Call AllocTensor to allocate a1Local and b1Local from the A1 and B1 in the queue.
2. Call DataCopy to transfer matrices a and b to the Local Memory, and convert the data format from ND to NZ.
  The DataCopy command can move height x 16 elements at a time, so you need to run the command repeatedly for width/16 times. Set the parameters of DataCopy as follows:
  - If blockCount is set to height, the total number of transfer times is the height.
  - If blockLen is set to 1, 16 half-type numbers are transferred at a time.
  - If srcStride is set to width/16 – 1, one row needs to be skipped each time a block is moved in the source matrix.
  - If dstStride is set to 0, each block of the destination matrix is stored continuously in the memory.
  - In each iteration, the start address of the source matrix is shifted by 16 elements, and the start address of the destination matrix is shifted by 16 x height elements.
  The following figure shows the format conversion. The blue part is moved in the first cycle, and the green part is moved in the second cycle. In the figure, the width is 32, occupying two fractals, and the height is 32, occupying two fractals. A total of four 16 x 16 fractals are moved.
  
  Figure 7 ND to NZ conversion
  
  Note: The preceding format conversion from ND to NZ is only an example. You can select a proper conversion mode based on the site requirements.
3. Call EnQue to place a1Local and b1Local in the queues of A1 and B1 respectively.
The code is as follows:
```
__aicore__ inline void CopyND2NZ(const LocalTensor<half>& dst, const GlobalTensor<half>& src, const uint16_t height, const uint16_t width)
{
    for (int i = 0; i < width / 16; ++i) {
        int srcOffset = i * 16;
        int dstOffset = i * 16 * height;
        AscendC::DataCopy(dst[dstOffset], src[srcOffset], { height, 1, uint16_t(width / 16 - 1), 0 });
    }
}
__aicore__ inline void CopyIn()
{
    AscendC::LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
    AscendC::LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();
    CopyND2NZ(a1Local, aGM, m, k);
    CopyND2NZ(b1Local, bGM, k, n);
    inQueueA1.EnQue(a1Local);
    inQueueB1.EnQue(b1Local);
}
```
Implement the SplitA function (stage 2).
1. Call DeQue to obtain a1Local from the queue of A1.
2. Call AllocTensor to allocate a2Local from the queue of A2.
3. Call LoadData to move matrix a to A2 and convert matrix a from the NZ format to the ZZ format.
  The following figure shows the transfer and format conversion. In the figure, k is 32, occupying kBlocks (k/16 = 2) fractals. m is 32, occupying mBlocks (m/16 = 2) fractals. A total of four 16 x 16 fractals are moved. In this example, the LoadData API is called once to move two 16 x 16 fractals, and the LoadData API is called twice. The first call moves two fractals in the blue part, and the second call moves two fractals in the green part.
  Set the parameters of LoadData in a single iteration as follows (in this example, two fractals need to be moved, namely the blue part or the green part):
  - repeatTimes indicates the number of iterations for data processing. Because LoadData processes a fractal in each iteration, repeatTimes can also be understood as the number of fractals to be transferred. In this example, it indicates the number of fractals in the k-axis direction. If this parameter is set to kBlocks, kBlocks fractals are transferred.
  - srcStride indicates the interval between the start addresses of the fractals of the source operands between adjacent iterations. The following uses the fractals in blue as an example: In the source operand matrix on the left of the following figure, the interval between the start addresses of the first and second blue fractals is mBlocks fractals in this example.
  - The default value of dstStride is used, and the two fractals of the destination matrix are stored continuously.
  - ifTranspose is set to false, each block is in Z format before and after fractal transfer. In this case, the transpose function needs to be disabled.
  - For each iteration, the start address offset of the destination matrix is 16 x k, and the start address offset of the source matrix is 16 x 16.
  Figure 8 NZ to ZZ conversion
4. Call EnQue to place the computation result of a2Local to the A2 queue.
  The code is as follows:
```
    __aicore__ inline void SplitA()
    {
        int srcOffset = 0;
        int dstOffset = 0;
        AscendC::LocalTensor<half> a1Local = inQueueA1.DeQue<half>();
        AscendC::LocalTensor<half> a2Local = inQueueA2.AllocTensor<half>();

        // transform nz to zz
        for (int i = 0; i < mBlocks; ++i) {
            AscendC::LoadData2DParams loadDataParams;
            loadDataParams.repeatTimes = kBlocks;
            loadDataParams.srcStride = mBlocks;
            loadDataParams.ifTranspose = false;

            AscendC::LoadData(a2Local[dstOffset], a1Local[srcOffset], loadDataParams);

            srcOffset += 16 * 16;
            dstOffset += k * 16;
        }

        inQueueA2.EnQue<half>(a2Local);
        inQueueA1.FreeTensor(a1Local);
    }
```
Implement the SplitB function (stage 2).
1. Two parameters need to be input for SplitB: b1Local obtained from the queue of B1 by calling DeQue and the iteration variable index.
2. Call AllocTensor to allocate b2Local from the queue of B2.
3. Call LoadData to move matrix b to B2 and convert the format from NZ to ZN.
  The following figure shows the transfer and format conversion. In the figure, k is 32, occupying kBlocks (k/16 = 2) fractals. n is 32, occupying nBlocks (n/16 = 2) fractals. A total of four 16 x 16 fractals are moved. In this example, the LoadData API is called once to move two 16 x 16 fractals, and the LoadData API is called twice. The first call moves two fractals in the blue part, and the second call moves two fractals in the green part.
  Set the parameters of LoadData in a single iteration as follows (in this example, two fractals need to be moved, namely the blue part or the green part):
  - repeatTimes indicates the number of iterations for data processing. Because LoadData processes a fractal in each iteration, repeatTimes can also be understood as the number of fractals to be transferred. In this example, it indicates the number of fractals in the k-axis direction. If this parameter is set to kBlocks, kBlocks fractals are transferred.
  - srcStride indicates the interval between the start addresses of the fractals of the source operands between adjacent iterations. The following uses the fractals in blue as an example: In the source operand matrix on the left of the following figure, the interval between the start addresses of the first and second blue fractals is one fractal in this example.
  - The default value of dstStride is used, and the two fractals of the destination matrix are stored continuously.
  - ifTranspose is set to true, each block is in Z format before fractal transfer and needs to be in N format after transfer. In this case, the transpose function needs to be enabled.
  - For each iteration, the start address of the source matrix needs to be offset by bSize/2.
  Figure 9 NZ to ZN conversion
4. Call EnQue to place the computation result of b2Local to the B2 queue.
The code is as follows:
```
    __aicore__ inline void SplitB(const AscendC::LocalTensor<half>& b1Local, const int bSplitIdx)
    {
        AscendC::LocalTensor<half> b2Local = inQueueB2.AllocTensor<half>();

        // transform nz to zn
        AscendC::LoadData2DParams loadDataParams;
        loadDataParams.repeatTimes = kBlocks;
        loadDataParams.srcStride = 1;
        loadDataParams.ifTranspose = true;

        AscendC::LoadData(b2Local, b1Local[bSplitIdx * bSize / 2], loadDataParams);

        inQueueB2.EnQue<half>(b2Local);
    }
```

Implement the Compute function to compute the core cube (stage 3).

The Compute function needs to transfer the parameter a2Local, which is obtained from the A2 queue by calling DeQue.
Call AllocTensor to allocate c1Local from the CO1 queue.
Call DeQue to obtain b2Local from B2.
Call the Ascend C API Mmad to perform matrix multiplication.
Call EnQue to place the computation result of c1Local to the CO1 queue.

The code is as follows:

    __aicore__ inline void Compute(const AscendC::LocalTensor<half>& a2Local)
    {
        AscendC::LocalTensor<half> b2Local = inQueueB2.DeQue<half>();
        AscendC::LocalTensor<float> c1Local = outQueueCO1.AllocTensor<float>();

        AscendC::MmadParams mmadParams;
        mmadParams.m = m;
        mmadParams.n = n / 2;
        mmadParams.k = k;
        AscendC::Mmad(c1Local, a2Local, b2Local, mmadParams);

        outQueueCO1.EnQue<float>(c1Local);
        inQueueB2.FreeTensor(b2Local);
    }

Implement the Aggregate function to aggregate data (stage 4).
1. Two parameters need to be input for Aggregate: c2Local obtained from the queue of CO2 by calling AllocTensor and the iteration variable index.
2. Call DeQue to obtain the c1Local from the CO1 queue.
3. Call DataCopy to move the result matrix from CO1 to CO2.
  Set the parameters of DataCopy as follows:
  - If blockCount is set to 1 and blockLen is set to 2, two fractals are moved consecutively without format conversion.
  - If blockMode is set to BlockMode::BLOCK_MODE_MATRIX, data needs to be moved by fractal.
  - The offset of the c2Local start address is set to index * cSize / 2.
The code is as follows:
```
    __aicore__ inline void Aggregate(const AscendC::LocalTensor<float>& c2Local, const int bSplitIdx)
    {
        AscendC::LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();

        AscendC::DataCopyParams dataCopyParams;
        dataCopyParams.blockCount = 1;
        dataCopyParams.blockLen = 2;
        AscendC::DataCopyEnhancedParams enhancedParams;
        enhancedParams.blockMode = AscendC::BlockMode::BLOCK_MODE_MATRIX;
        AscendC::DataCopy(c2Local[bSplitIdx * cSize / 2], c1Local, dataCopyParams, enhancedParams);

        outQueueCO1.FreeTensor(c1Local);
    }
```
Implement the CopyOut function (stage 5).
1. Call DeQue to obtain c2Local from CO2.
2. Call DataCopy to move the result matrix from CO2 to the Global Memory, and the convert the format from NZ to ND.
  Move one fractal at a time. The number of fractals to be transferred is m x 16. Set the parameters of DataCopy as follows:
  - If blockCount is set to m, the total number of transfer times is the m.
  - If blockLen is set to 2, 16 blocks are moved at a time by running the DataCopy command. There are two blocks in total.
  - If srcStride is set to 0, there is no gap between two moves.
  - If dstStride is set to (nBlocks - 1) * 2, two blocks are the interval between two moves.
  - In each iteration, the destination matrix is offset by 16, and the source matrix is offset by m*16.
  The following figure shows the format conversion. The data in blue is moved in the first iteration, and the data in green is moved in the second iteration.
  
  Figure 10 NZ to ND conversion
The code is as follows:
```
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<float> c2Local = outQueueCO2.DeQue<float>();

        // transform nz to nd
        for (int i = 0; i < nBlocks; ++i) {
            AscendC::DataCopy(cGM[i * 16], c2Local[i * m * 16], { m, 2, 0, uint16_t((nBlocks - 1) * 2) });
        }

        outQueueCO2.FreeTensor(c2Local);
    }
```

Parent topic: Cube Programming (Basic APIs)