Programming Paradigm

The programming paradigm describes the fixed process of operator implementation. Programming based on the paradigm can quickly set up the code framework for operator implementation.

According to Hardware Architecture Abstraction, the execution units in the AI Core asynchronously execute the received instructions in parallel. As shown in the following figure, tasks in three phases (T1, T2, and T3) need to be processed from input data to output data. Multiple execution units process data in parallel. Each execution unit focuses on only one task to process all data slices. It can be seen that pipeline parallelism is similar to a pipeline in industrial production. Each execution unit can be considered as a node on the pipeline, improving computing efficiency by pipeline parallelism. After processing a data slice, the execution unit 1 adds the data slice to the communication queue. When the execution unit 2 is idle, the execution unit 2 fetches data from the queue for further processing. This is analogous with the situation where a worker in the production line completes only one fixed process, and then the next worker continues to handle the remaining process.
Figure 1 Pipeline parallelism

The Ascend C programming paradigm is a pipeline programming paradigm. It divides the processing program in the operator core into multiple pipeline tasks, completes inter-task communication and synchronization through queues, and manages resources such as memories and events through a unified resource management module (Pipe).

Vector Programming Paradigm

As shown in the above figure, the vector programming paradigm divides the operator implementation process into three basic tasks: CopyIn, Compute, and CopyOut.

  • CopyIn tasks move the input data from global memory to local memory (VECIN indicates the storage position of the data to be moved for vector computation). The data is enqueued after being completely moved.
  • Compute tasks compute vectors based on instructions. After the data is dequeued, the Compute task obtains data from local memory and performs computation. The data is enqueued after being completely computed.
  • CopyOut tasks move the computation result from local memory to global memory after the data is dequeued (VECOUT indicates the storage position of the copy-out data after vector computation).

VECIN and VECOUT mentioned above are the concepts of TPosition. When managing physical memories at different levels, Ascend C uses an abstract logical position (TPosition) to express storage at each level, replacing on-chip physical storage and hiding the hardware architecture. In addition to VECIN and VECOUT, VECCALC is also used in vector programming. Generally, this position is used when temporary variables are defined. The following table lists the mapping between TPosition and physical memories.

Table 1 Mapping between TPosition and physical memories

TPosition

Physical Memory

GM

Global Memory

VECIN

Unified Buffer

VECOUT

Unified Buffer

VECCALC

Unified Buffer

The following figure illustrates the process (see the pseudo-code) from the perspective of programming.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
AscendC::TPipe pipe;   // Create global resource management.
AscendC::TQue<AscendC::QuePosition::VecIn, 1> queIn;  // Create a CopyIn queue.
AscendC::TQue<AscendC::QuePosition::VecOut, 1> queOut; // Create a CopyOut queue.
// Initialization phase:
pipe.InitBuffer(queIn, 2, 1024);  // Enable double buffering to divide the data to be processed into two parts for pipeline parallelism.
for-loop {
    //CopyIn phase {
    auto tensor = queIn.AllocTensor<half>();     // Allocate resources from the queue, with the length of 1024 bytes.
    AscendC::DataCopy(tensor, gm, len);                   // Copy data from global memory to VECIN.
    queIn.EnQue(tensor); 
    }
    //Compute phase {
    auto tensor = queIn.DeQue<half>();
    auto tensorOut = queOut.AllocTensor<half>();
    AscendC::Abs(tensorOut, tensor, 1024);
    queIn.FreeTensor(tensor);
    queOut.EnQue(tensorOut);
    }
    //CopyOut phase {
    auto tensor = queOut.DeQue<half>();
    AscendC::DataCopy(gmOut, tensor, 1024);
    queOut.FreeTensor(tensor);
    }
}

Resources such as memory and events used for data copy between tasks are managed by the Pipe module. As shown in the following figure, TPipe provides the queue memory initialization function through InitBuffer. You can call this API to allocate memory to a specified queue.

After the queue memory initialization, call AllocTensor to allocate memory to the LocalTensor when the memory is required. When the created LocalTensor is not required after computation, call FreeTensor to reclaim the LocalTensor memory.

Figure 2 Memory management

The temporary variable memory used during programming is also managed by Pipe. Temporary variables can use the TBuf data structure to allocate memory space on a specified TPosition. The memory space allocated by using the TBuf can only be used for computation and cannot be used for data enqueuing or dequeuing. For details about the API usage, see TBuf.

Parallel processing of data on a single core can be implemented by programming based on this paradigm. The data to be processed is divided into n slices. Each parallel task (stage 1, stage 2, and stage 3) needs to process n data slices in sequence. Arrows between stages indicate the dependency between data. For example, stage 2 (Compute) can process the first data slice only after stage 1 (CopyIn) has processed it.

Figure 3 Pipeline task diagram

The following figure shows the running of the pipeline task. Progress 1, 2, and 3 indicate the processed data slices. As shown in the diagram, for the same data slice, Stage1, Stage2, and Stage3 processes depend on each other and need to be processed in sequence. For different data slices, multiple tasks can be processed in parallel. In this way, tasks can be processed in parallel to improve performance.

Figure 4 Running of pipeline tasks

Cube Programming Paradigm

The following figure shows the typical data flow of Cube compute.

Similar to the vector programming paradigm, the logical position (TPosition) is also used to express data streams. The logical position used in the Cube programming paradigm is defined as follows:

  • Storage location of the copy-in data: A1, used to store the entire matrix A, which is similar to the level-2 cache in the multi-level cache of the CPU.
  • Storage location of the copy-in data: B1, used to store the entire matrix B, which is similar to the level-2 cache in the multi-level cache of the CPU.
  • Storage location of the copy-in data: C1, used to store the entire matmul bias matrix, which is similar to the level-2 cache in the multi-level cache of the CPU.
  • Storage location of the copy-in data: A2, used to store the split smaller matrix A, which is similar to the level-1 cache in the multi-level cache of the CPU.
  • Storage location of the copy-in data: B2, used to store the split smaller matrix B, which is similar to the level-1 cache in the multi-level cache of the CPU.
  • Storage location of the copy-in data: C2, used to store the small split pieces of the matmul bias matrix, which is similar to the level-1 cache in the multi-level cache of the CPU.
  • Storage location of the result data: CO1, used to store the small-block result matrix C, which can be considered as Cube Out.
  • Storage location of the result data: CO2, used to store the entire result matrix C, which can be considered as Cube Out.
  • Storage location of the copy-in data: VECIN, used for vector computation. This position is used when data is copied in to the Vector Unit.
  • Storage location of the copy-in data: VECCALC, used for vector computation. This position is used when temporary variables are required for the computation.
  • Storage location of the copy-out data: VECOUT, used for vector computation. This position is used when the result is copied out from the Vector Unit.

The mapping between TPosition and physical memories is as follows.

Table 2 Mapping between TPosition and physical memories

TPosition

Physical Memory

GM

Global Memory

VECIN

Unified Buffer

VECCALC

Unified Buffer

VECOUT

Unified Buffer

A1

L1 Buffer

A2

L0A Buffer

B1

L1 Buffer

B2

L0B Buffer

C1

Atlas Training Series Product : Unified Buffer.

C2

Atlas Training Series Product : L0C Buffer.

CO1

L0C Buffer

CO2

Atlas Training Series Product : Unified Buffer.

The Cube compute process also contains the CopyIn, Compute, and CopyOut phases. Because the process is complex, the encapsulated MatMul high-level APIs are provided. The programming paradigm is as follows.

For details, see the following example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// During Matmul object creation, input the type information of parameters A, B, C, and Bias. The type information is defined by MatmulType, including the logical location of memory, data format, and data type.
typedef MatmulType<TPosition::GM, CubeFormat::ND, half> aType; 
typedef MatmulType<TPosition::GM, CubeFormat::ND, half> bType; 
typedef MatmulType<TPosition::GM, CubeFormat::ND, float> cType; 
typedef MatmulType<TPosition::GM, CubeFormat::ND, float> biasType; 
Matmul<aType, bType, cType, biasType> mm; 

REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm, &tiling); // Initialization.
// CopyIn phase: Copy data from global memory to local memory.
mm.SetTensorA(gm_a);    // Set the left matrix A.
mm.SetTensorB(gm_b);    // Set the right matrix B.
mm.SetBias(gm_bias);    // Set the bias.
// Compute phase: Complete matrix multiplication computation.
while (mm.Iterate()) { 
    // CopyOut phase: Copy data from local memory to global memory.
    mm.GetTensorC(gm_c); 
}
// End the matrix multiplication operation.
mm.End();

Fusion Operator Programming Paradigm

An operator that supports hybrid compute of vectors and cubes is called a fusion operator. Ascend C provides fusion operator programming paradigm so that developers can express the data streams of the fusion operator based on the paradigm and quickly implement their own fusion operator.

The fusion operator data streams refer to the stream direction of the input and output of the fusion operator between storage locations. Take a typical fusion operator for Cube and Vector as an example. The following figure shows the data streams between logical locations. (To simplify the description, bias is not listed.)

  • The output of Cube can be used as the input of Vector: CO2 -> VECIN.
  • The output of Vector can be used as the input of Cube: VECOUT -> A1 -> A2 or VECOUT -> B1 -> B2.

Based on the fusion operator programming paradigm of the MatMul high-level APIs, the preceding data streams are simplified as follows.
Figure 5 Fusion operator programming paradigm
  1. Initialize a MatMul object and copy the input data from the global memory to the AI Cube (AIC) core.
  2. Perform the internal MatMul compute.
  3. Copy the MatMul compute result to the AI Vector (AIV) core.
  4. Perform the vector compute.
  5. Copy the output to global memory.

The sample code (pseudocode) of the entire process is as follows :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
template<typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::Process()
{
    // Step 1: Initialize a MatMul object and copy the input data from global memory to the AI Cube core.
    uint32_t computeRound = 0;
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), matmulObj);
    matmulObj.Init(&tiling);
    matmulObj.SetTensorA(aGlobal);
    matmulObj.SetTensorB(bGlobal);
    matmulObj.SetBias(biasGlobal);
    
    while (matmulObj.template Iterate<true>()) { // Step 2: Perform the internal MatMul compute.
        // Step 3: Copy the MatMul compute result to the AI Vector core.
        reluOutLocal = reluOutQueue_.AllocTensor<cType>();
        matmulObj.template GetTensorC<true>(reluOutLocal, false, true);
       // Step 4: Perform the vector compute.
        AscendC::LeakyRelu(reluOutLocal, reluOutLocal, (cType)alpha, tiling.baseM * tiling.baseN);
        reluOutQueue_.EnQue(reluOutLocal);
        // Step 5: Copy the output to global memory.
        reluOutQueue_.DeQue<cType>();
        ...
        AscendC::DataCopy(cGlobal[startOffset], reluOutLocal, copyParam);
        reluOutQueue_.FreeTensor(reluOutLocal);

        computeRound++;
    }
    matmulObj.End();
}

The Mystery Behind the Programming Model

The key elements of the parallel programming paradigm of Ascend C are: a group of parallel computing tasks, communication and synchronization between tasks through queues, and custom scheduling of parallel computing tasks and resources. This section describes the implementation principles of the programming model, helping you better understand the design and advantages of the programming model and facilitating subsequent in-depth development.

The programming paradigm of each parallel task stage is as follows:

  1. Obtain available memory from local memory, call AllocTensor to allocate memory, or call DeQue to deque a memory data slice from the upstream queue.
  2. Complete compute or data copy.
  3. Call EnQue to enqueue the data processed in the previous step.
  4. Call FreeTensor to free the memory that is no longer needed.

Take the simplest vector programming paradigm as an example. When the preceding APIs are called, some instructions are actually delivered to each execution unit, as shown in the following figure.

Figure 6 Vector programming paradigm instruction queue
  • Enque/Deque process:
    1. The Scalar Unit reads the operator instruction sequence.
    2. These instructions are sent to the instruction queue of the corresponding execution unit.
    3. Execution units execute these instructions in parallel.
    4. Enque/Deque solves the read-after-write problem of the memory.
      • When Enque is called, the synchronization instruction set is sent, and a signal is sent to activate wait.
      • When Deque is called, the synchronization instruction wait is sent to wait until the data write is finished.
      • The wait instruction can be executed only after the set signal is received. Otherwise, the instruction is blocked.

    Enque/Deque mainly solves the problem of synchronous control of read-after-write of parallel execution units when data dependency exists.

  • AllocTensor/FreeTensor process:
    1. The Scalar Unit reads the operator instruction sequence.
    2. These instructions are sent to the instruction queue of the corresponding execution unit.
    3. Execution units execute these instructions in parallel.
    4. AllocTensor/FreeTensor solves the memory write-after-read problem.
      • When AllocTensor is called, a synchronization instruction wait is sent to wait until the memory is read completely.
      • When FreeTensor is called, the synchronization instruction set is sent to notify the system to free the memory for rewrite.
      • The wait instruction can be executed only after the set signal is received. Otherwise, the instruction is blocked.

    AllocTensor/FreeTensor mainly solves the problem of synchronous control of write-after-read of parallel execution units when data dependency exists.

In conclusion, complex synchronization control needs to be considered for asynchronous parallel programs. The Ascend C programming model encapsulates these processes and uses Enque/Deque/AllocTensor/FreeTensor to simplify programming and facilitate understanding.