Programming Paradigms of Typical Operators

The programming paradigm describes the fixed process of operator kernel function implementation. Programming based on the paradigm can quickly set up the code framework for operator implementation.

According to Abstract Hardware Architecture, the execution units inside the AI Core asynchronously and concurrently execute the received instructions. The execution units cooperate with each other to complete the entire operator execution process in a pipeline manner.

The following figure helps you understand the concept of pipeline parallelism more intuitively. In the figure, the input data needs to be processed in three phases (T1, T2, and T3) before being output. Multiple execution units process the data in parallel. Each execution unit focuses on processing only one task and processes all data slices. After an execution unit completes the processing of a data slice, it adds the slice to the communication queue. When the next execution unit is idle, it retrieves the data from the queue and continues processing. This can be compared to workers in a production line who complete only one fixed procedure and then pass the task to the next procedure owner for further processing.

Figure 1 Pipeline parallelism

The Ascend C programming paradigm is a pipeline programming paradigm. It divides the processing program in the operator core into multiple pipeline tasks, completes inter-task communication and synchronization through queues (TQue), and manages resources such as memories and events through a unified resource management module (TPipe).

The following sections describe this TPipe- and TQue-based programming paradigm in detail from the perspective of three typical operator types.

Vector programming paradigm
Cube programming paradigm
Fused operator programming paradigm

Vector Programming Paradigm

As shown in the above figure, the vector programming paradigm divides the operator implementation process into three basic tasks: CopyIn, Compute, and CopyOut.

CopyIn tasks move the input data from the global memory to the local memory (VECIN indicates the storage position of the data to be moved for vector computation). The data is enqueued after being completely moved.
Compute tasks compute vectors based on instructions. After the data is dequeued, the Compute task obtains data from local memory and performs computation. The data is enqueued after being completely computed.
CopyOut tasks move the computation result from the local memory to the global memory after the data is dequeued (VECOUT indicates the storage position of the copy-out data after vector computation).

VECIN and VECOUT mentioned above are the concepts of TPosition. When managing physical memories at different levels, Ascend C uses an abstract logical position (TPosition) to express storage at each level, replacing on-chip physical storage and hiding the hardware architecture. In addition to VECIN and VECOUT, VECCALC is also used in vector programming. Generally, this position is used when temporary variables are defined. For details about the mapping between TPosition and the physical memory, see Table 1.

The following figure illustrates the process (see the pseudo-code below) from the perspective of programming.

      
       
         
         
           AscendC::TPipe pipe;                                // Create a global resource management object.
AscendC::TQue<AscendC::TPosition::VecIn, 1> queIn;  // Create a CopyIn queue.
AscendC::TQue<AscendC::TPosition::VecOut, 1> queOut;// Create a CopyOut queue.
// Initialization phase
pipe.InitBuffer(queIn, 2, 1024);                    // Enable DoubleBuffer to divide the data to be processed into two parts for pipeline parallelism.
pipe.InitBuffer(queOut, 2, 1024);
for-loop {
    // CopyIn phase
   {
    auto tensor = queIn.AllocTensor<half>();       // Allocate resources from the queue, with the length of 1024 bytes.
    AscendC::DataCopy(tensor, gm, 1024);           // Copy data from the global memory to VECIN.
    queIn.EnQue(tensor); 
    }
    // Compute phase
   {
    auto tensor = queIn.DeQue<half>();
    auto tensorOut = queOut.AllocTensor<half>();
    AscendC::Abs(tensorOut, tensor, 1024);        // Compute
    queIn.FreeTensor(tensor);
    queOut.EnQue(tensorOut);
    }
    // CopyOut phase
   {
    auto tensor = queOut.DeQue<half>();
    AscendC::DataCopy(gmOut, tensor, 1024);       // Copy data from VECOUT to the global memory.
    queOut.FreeTensor(tensor);                    // Release resources.
    }
}

          

        

      
     

Resources such as memory and events used for data copy between tasks are managed by the Pipe module. As shown in the following figure, TPipe provides the queue memory initialization function through InitBuffer. You can call this API to allocate memory to a specified queue.

After the queue memory initialization, call AllocTensor to allocate memory to the LocalTensor when the memory is required. When the created LocalTensor is not required after computation, call FreeTensor to reclaim the LocalTensor memory.

Figure 2 Memory management

The temporary variable memory used during programming is also managed by Pipe. Temporary variables can use the TBuf data structure to allocate memory space on a specified TPosition. The memory space allocated by using TBuf can only be used for computation and cannot be used for data enqueuing or dequeuing. For details about the API usage, see TBuf.

Parallel processing of data on a single core can be implemented by programming based on this paradigm. The data to be processed is divided into n slices. Each parallel task needs to process n data slices in sequence. Arrows between tasks indicate the dependency between data. For example, Compute can process the first data slice only after CopyIn has processed it.

Figure 3 Pipeline task diagram

The following figure shows the running of the pipeline task. As shown in the diagram, for the same data slice, CopyIn, Compute, and CopyOut processes depend on each other and need to be processed in sequence. Different data slices can be processed by multiple tasks in parallel at the same time, achieving task parallelism and performance improvement.

Figure 4 Running of pipeline tasks

Cube Programming Paradigm

The following figure shows the typical data flow of Cube compute.

Similar to the vector programming paradigm, the logical position (TPosition) is also used to express data streams. The logical position used in the Cube programming paradigm is defined as follows:

A1: logical memory used for cube computation on the device, which is used to store the left cube. The physical memory corresponds to the L1 buffer of the AI Core.
B1: logical memory used for cube computation on the device, which is used to store the right cube. The physical memory corresponds to the L1 buffer of the AI Core.
C1: logical memory used for cube computation on the device, which is used to store the bias data. The physical memory corresponds to the L1 buffer or Unified Buffer of the AI Core.
A2: logical memory used for cube computation on the device, which is used to store small left cube (such as blocks that are split and adapted to the L0A buffer capacity). The physical memory corresponds to the L0A buffer of the AI Core.
B2: logical memory used for cube computation on the device, which is used to store small right cube (such as blocks that are split and adapted to the L0B buffer capacity). The physical memory corresponds to the L0B buffer of the AI Core.
C2: logical memory used for cube computation on the device, which is used to store small bias data (such as blocks that are split and adapted to the BT buffer capacity). The physical memory corresponds to the BT buffer or L0C buffer of the AI Core.
CO1: logical memory used for cube computation on the device, which is used to store the small cube computation result (such as cube computation result blocks that are split). The physical memory corresponds to the L0C buffer of the AI Core.
CO2: logical memory used for cube computation on the device, which is used to store the cube computation result (such as the final computation result of the original cube). The physical memory corresponds to the global memory or the Unified Buffer of the AI Core.
VECIN: logical memory used for vector computation on the device, which is used to store the input data of vector computation. The physical memory corresponds to the Unified Buffer of the AI Core.
VECCALC: logical memory used for vector computation on the device, which is used to store temporary variables. The physical memory corresponds to the Unified Buffer of the AI Core.
VECOUT: logical memory used for vector computation on the device, which used to store the output data of vector computation. The physical memory corresponds to the Unified Buffer of the AI Core.

For details about the mapping between TPosition and the physical memory, see Table 1.

The cube compute process also contains the CopyIn, Compute, and CopyOut phases. Considering the complexity of the process, encapsulated MatMul high-level APIs are provided to simplify the programming paradigm.

As shown in the preceding figure, the CopyIn phase corresponds to the SetTensorA, SetTensorB, and SetBias APIs, the Compute phase corresponds to the Iterate API, and the CopyOut phase corresponds to the GetTensorC API. For details, see the following example:

      
       
         
         
           // During Matmul object creation, input the type information of parameters A, B, C, and Bias. The type information is defined by MatmulType, including the logical memory location, data format, and data type.
typedef MatmulType<TPosition::GM, CubeFormat::ND, half> aType; 
typedef MatmulType<TPosition::GM, CubeFormat::ND, half> bType; 
typedef MatmulType<TPosition::GM, CubeFormat::ND, float> cType; 
typedef MatmulType<TPosition::GM, CubeFormat::ND, float> biasType; 
Matmul<aType, bType, cType, biasType> mm; 

REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm, &tiling); // Initialization
// CopyIn stage: Move data from the global memory to the local memory.
mm.SetTensorA(gm_a);    // Set the left cube A.
mm.SetTensorB(gm_b);    // Set the right cube B.
mm.SetBias(gm_bias);    // Set the bias.
// Compute phase: Complete cube multiplication computation.
while (mm.Iterate()) { 
    // CopyOut phase: Copy data from the local memory to the global memory.
    mm.GetTensorC(gm_c); 
}
// End the cube multiplication operation.
mm.End();

          

        

      
     

Fused Operator Programming Paradigm

An operator that supports hybrid compute of vectors and cubes is called a fused operator. Ascend C provides fused operator programming paradigm so that you can express the data flow of the fused operator based on the paradigm and quickly implement your own fused operator.

The fused operator data flow refers to the flow direction of the input and output of the fused operator between storage locations. Take a typical fused operator for the cube and vector computation as an example. The following figure shows the data flows between logical locations. (To simplify the description, bias is not listed.)

The output of the cube computation can be used as the input of the vector computation: CO2 -> VECIN.
The output of the vector computation can be used as the input of the cube computation: VECOUT -> A1 -> A2 or VECOUT -> B1 -> B2.

Based on the fused operator programming paradigm of the MatMul high-level APIs, the preceding data streams are simplified as follows.

Figure 5 Fused operator programming paradigm

Initialize a MatMul object and move the input data from the global memory to the AI Cube core.
Perform the internal MatMul compute.
Move the MatMul compute result to the AI Vector core.
Perform the vector compute.
Move the output to the global memory.

The sample code (pseudocode) of the entire process is as follows:

      
       
         
         
           template<typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::Process()
{
    // Step 1: Initialize a MatMul object and copy the input data from the global memory to the AI Cube core.
    uint32_t computeRound = 0;
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), matmulObj);
    matmulObj.Init(&tiling);
    matmulObj.SetTensorA(aGlobal);
    matmulObj.SetTensorB(bGlobal);
    matmulObj.SetBias(biasGlobal);
    
    while (matmulObj.template Iterate<true>()) { // Step 2: Perform the internal MatMul compute.
        // Step 3: Move the MatMul compute result to the AI Vector core.
        reluOutLocal = reluOutQueue_.AllocTensor<cType>();
        matmulObj.template GetTensorC<true>(reluOutLocal, false, true);
       // Step 4: Perform the vector compute.
        AscendC::LeakyRelu(reluOutLocal, reluOutLocal, (cType)alpha, tiling.baseM * tiling.baseN);
        reluOutQueue_.EnQue(reluOutLocal);
        // Step 5: Move the output to the global memory.
        reluOutQueue_.DeQue<cType>();
        ...
        AscendC::DataCopy(cGlobal[startOffset], reluOutLocal, copyParam);
        reluOutQueue_.FreeTensor(reluOutLocal);

        computeRound++;
    }
    matmulObj.End();
}

          

        

      
     

Parent topic: Programming Based on TPipe and TQue