Asynchronous scenario processing

Overview

The Iterate and IterateAll APIs of Matmul provide synchronous and asynchronous modes in the MIX scenario (including cube computation and vector computation). In the pure Cube scenario (including only cube computation), only the synchronous mode is supported.

In synchronous mode, the next operation can be performed only after the preceding operation finishes. In asynchronous mode, the next operation can be performed without waiting for the preceding operation to finish.

  • Synchronous and Asynchronous Iterate and GetTensorC
    • Synchronous mode: GetTensorC is called to move matrix C slices after an iteration. Then the next compute is performed after the data movement finishes. As shown in the following figure, in matrix C, matrix block 2 is computed only after matrix block 1 is moved out, and matrix block 3 is computed only after matrix block 2 is moved out.

      The key code example of the Iterate&GetTensorC synchronization mode is as follows:

      1
      2
      3
      while (mm.Iterate()) {
          mm.GetTensorC(gm_c);
      }
      
    • Asynchronous mode: The asynchronous mode can be enabled by setting template parameters of the Iterate API. After Iterate is called, GetTensorC does not need to be called immediately to wait for the completion of the matrix C block movement. Other operations can be executed first, and GetTensorC can be called when the result needs to be obtained. The asynchronous mode can reduce the synchronization time and improve the parallelism degree. This mode is ideal for scenarios with high requirements on computing performance. In the asynchronous scenario, reserve a temporary space to cache the Iterate compute result. Otherwise the compute result will be overwritten. When GetTensorC is called, slices of matrix C are obtained from the temporary space. The temporary space is set by calling SetWorkspace. Call SetWorkspace before Iterate.
      The following is a code example of the Iterate&GetTensorC asynchronous mode:
      1
      2
      3
      4
      5
      6
      7
      8
      9
      mm.SetWorkspace(workspace, size); // workspace indicates the physical address of the temporary space, and size indicates the size of matrix C with the size of singleCoreM x singleCoreN.
      // Asynchronous mode
      mm.template Iterate<false>();
      …… // Perform other operations.
      auto mIter = Ceil(singleCoreM, baseM);
      auto nIter = Ceil(singleCoreN, baseN);
      for (int i = 0; i < mIter * nIter ; ++i) {
          mm.GetTensorC<false> (gm_c);
      }
      
  • Synchronous and Asynchronous IterateAll
    • Synchronous mode: Subsequent operations can be performed until the IterateAll execution finishes.

      The following is a code example of the synchronous mode of IterateAll:

      1
      2
      3
      4
      5
      6
      mm.SetTensorA(gm_a);    // Set the left matrix A.
      mm.SetTensorB(gm_b);    // Set the right matrix B.
      mm.SetBias(gm_bias);    // Set the bias.
      mm.IterateAll(gm_c);
      // Follow-up operations
      ...
      
    • Asynchronous: The subsequent operations do not need to wait until the execution of IterateAll is complete. If the result of IterateAll is required, call WaitIterateAll to wait for the return of the asynchronous IterateAll API.
      The following is a code example of the asynchronous mode of IterateAll:
      1
      2
      3
      4
      5
      6
      7
      8
      9
      AscendC::Matmul<aType, bType, cType, biasType> mm;
      mm.SetTensorA(queryGm[tensorACoreOffset]);
      mm.SetTensorB(keyGm[tensorBCoreOffset + sInnerStart * singleProcessSInnerSize *
            tilingData->attentionScoreOffsetStrideParams.matmulHead], true);
      mm.SetTail(singleProcessSOuterSize, mmNNum);
      mm.template IterateAll<false>(workspaceGm[tmp_block_idx * mmResUbSize * sInnerLoopTimes],0, false,true);
      // Perform other operations.
      mm.WaitIterateAll(); // Wait for IterateAll to complete.
      DataCopy(dstUB, GM);  // Copy data from GM to UB.
      

Use Case

  • Synchronous Iterate&GetTensorC: mixed scenario (including matrix and vector computation) and pure Cube scenario (only matrix computation).
  • Asynchronous Iterate&GetTensorC: mixed scenario (including matrix and vector computation).
  • Synchronous IterateAll: mixed scenarios (including matrix and vector computation) and pure Cube scenarios (only matrix computation).
  • Asynchronous IterateAll: only mixed scenarios (including matrix and vector computation).

Restrictions

  • Asynchronous Iterate&GetTensorC:
    • Ensure that the size of the input matrix C address space is greater than or equal to baseM x baseN.
    • Call SetWorkspace before Iterate.
    • The output can be written only to the VECIN, only to the global memory, or to both the global memory and VECIN.
    • When matrix C is copied to the VECIN, the data format can only be NZ. When matrix C is copied to the global memory, the data format can be ND or NZ.
  • Asynchronous IterateAll:
    • Ensure that the size of the input matrix C address space is greater than or equal to singleCoreM x singleCoreN.
    • Data can only be continuously output to the Global Memory.

Examples