Asynchronous Scenario Processing

Overview

The Iterate and IterateAll APIs of Matmul provide synchronous and asynchronous modes in the MIX scenario (including matrix computation and vector computation). In CUBE_ONLY (including matrix computation only), only the synchronous mode is supported.

In synchronous mode, the next operation can be performed only after the preceding operation finishes. In asynchronous mode, the next operation can be performed without waiting for the preceding operation to finish.

Synchronous and asynchronous Iterate and GetTensorC

Synchronous mode: GetTensorC is called to move matrix C slices after an iteration. Then the next computation is performed after the data movement finishes. As shown in the following figure, in matrix C, matrix block 2 is computed only after matrix block 1 is moved out, and matrix block 3 is computed only after matrix block 2 is moved out.

The key sample code for the synchronous mode of Iterate and GetTensorC is as follows:

          
               while (mm.Iterate()) {
    mm.GetTensorC(gm_c);
}

Asynchronous mode: The asynchronous mode can be enabled by setting template parameters of the Iterate API. After Iterate is called, GetTensorC does not need to be called immediately to wait for the completion of the matrix C block movement. Other operations can be executed first, and GetTensorC can be called when the result needs to be obtained. The asynchronous mode can reduce the synchronization time and improve the parallelism degree. This mode is ideal for scenarios with high requirements on computing performance. In the asynchronous scenario, reserve a temporary space to cache the Iterate computation result. Otherwise the computation result will be overwritten. When GetTensorC is called, slices of matrix C are obtained from the temporary space. The temporary space is set by calling SetWorkspace. Call SetWorkspace before Iterate.

The key sample code for the asynchronous mode of Iterate and GetTensorC is as follows:

           
                mm.SetWorkspace(workspace, size); // workspace indicates the physical address of the temporary space, and size indicates the size of matrix C with the size of singleCoreM × singleCoreN.
// Asynchronous mode
mm.template Iterate<false>();
…… // Perform other operations.
auto mIter = Ceil(singleCoreM, baseM);
auto nIter = Ceil(singleCoreN, baseN);
for (int i = 0; i < mIter * nIter ; ++i) {
    mm.GetTensorC<false> (gm_c);
}

Synchronous and asynchronous IterateAll

Synchronous mode: Subsequent operations can be performed until the IterateAll execution finishes.

The key sample code for the synchronous mode of IterateAll is as follows:

          
               mm.SetTensorA(gm_a);    // Set the left matrix A.
mm.SetTensorB(gm_b);    // Set the right matrix B.
mm.SetBias(gm_bias);    // Set the bias.
mm.IterateAll(gm_c);
// Follow-up operations.
...

Asynchronous mode: Subsequent operations do not need to wait for the completion of IterateAll. If the result of IterateAll is required, call WaitIterateAll to wait for the result returned by the asynchronous IterateAll API.

The key sample code for the asynchronous mode of IterateAll is as follows:

           
                AscendC::Matmul<aType, bType, cType, biasType> mm;
mm.SetTensorA(queryGm[tensorACoreOffset]);
mm.SetTensorB(keyGm[tensorBCoreOffset + sInnerStart * singleProcessSInnerSize *
      tilingData->attentionScoreOffsetStrideParams.matmulHead], true);
mm.SetTail(singleProcessSOuterSize, mmNNum);
mm.template IterateAll<false>(workspaceGm[tmp_block_idx * mmResUbSize * sInnerLoopTimes],0, false,true);
// Perform other operations.
mm.WaitIterateAll(); // Wait for IterateAll to complete.
DataCopy(dstUB, GM);  // Copy data from GM to UB.

Application Scenarios

Synchronous Iterate and GetTensorC: MIX scenario (including matrix computation and vector computation) and CUBE_ONLY scenario (matrix computation only).
Asynchronous Iterate and GetTensorC: MIX scenario (including matrix computation and vector computation).
Synchronous IterateAll: MIX scenario (including matrix computation and vector computation) and CUBE_ONLY scenario (matrix computation only).
Asynchronous IterateAll: MIX scenario (including matrix computation and vector computation).

Restrictions

Asynchronous scenarios of Iterate and GetTensorC:
- Ensure that the size of the input matrix C address space is greater than or equal to baseM × baseN.
- Call SetWorkspace before Iterate.
- Three output modes are supported: output to VECIN only, output to the global memory only, and output to both the global memory and VECIN.
- When matrix C is extracted to VECIN, only the NZ data format is supported. When matrix C is extracted to the global memory, the ND or NZ data format is supported.
Asynchronous scenario of IterateAll:
- Ensure that the size of the input matrix C address space is greater than or equal to singleCoreM × singleCoreN.
- Data can only be continuously output to the global memory.

Example

For details about the complete example of the asynchronous scenarios of Iterate and GetTensorC, see asynchronous scenario sample and matrix multiplication in Iterate asynchronous scenarios.
For details about the complete example of the asynchronous scenario of IterateAll, see matrix multiplication in IterateAll asynchronous scenarios.

Parent topic: Feature Scenarios