GetBatchTensorC
Applicability
Product |
Supported |
|---|---|
√ |
|
√ |
|
x |
|
x |
|
x |
|
x |
Function
Obtains a matrix C slice after it is being called once and works with the IterateNBatch asynchronous API. This API is used to obtain a matrix slice of std::max(batchA, batchB) × singleCoreM × singleCoreN size after IterateNBatch is called for iterative computation.
Prototype
1 2 | template <bool sync = true> __aicore__ inline GlobalTensor<DstT> GetBatchTensorC(uint32_t batchA, uint32_t batchB, bool enSequentialWrite = false) |
1 2 | template <bool sync = true> __aicore__ inline void GetBatchTensorC(const LocalTensor<DstT>& c, uint32_t batchA, uint32_t batchB, bool enSequentialWrite = false) |
Parameters
Parameter |
Description |
|---|---|
sync |
Only the asynchronous mode is supported. That is, this parameter can only be set to false. |
Parameter |
Input/Output |
Description |
|---|---|---|
batchA |
Input |
Number of batches of the left matrix. |
batchB |
Input |
Number of batches of the right matrix. |
enSequentialWrite |
Input |
This parameter is reserved and can be ignored. |
c |
Input |
Address of matrix C in the local memory, which is used to store matrix slices. |
Returns
GlobalTensor<DstT>: computed matrix slices
Restrictions
- This API is not supported when enableMixDualMaster (dual-master mode) is set to true.
- When matrix C slices are output to the local memory and the size of the N direction for single-core computation (singleCoreN) is not 32-byte aligned, CubeFormat of matrix C only supports the ND_ALIGN format. When matrix C slices are output, the data along the singleCoreN direction is automatically padded to 32 bytes.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | // Calculate the number of loops required for multi-batch computation. int for_extent = tiling.ALayoutInfoB * tiling.ALayoutInfoN * g_lay / tiling.BatchNum; mm1.SetTensorA(gm_a[0], isTransposeAIn); mm1.SetTensorB(gm_b[0], isTransposeBIn); if (tiling.isBias) { mm1.SetBias(gm_bias[0]); } // Multi-batch Matmul computation mm1.template IterateNBatch<false>(for_extent, batchA, batchB, false); ...other compute for (int i = 0; i < for_extent; ++i) { mm1.template GetBatchTensorC<false>(ubCmatrix); ...other compute } |