GetTensorC
Applicability
|
Product |
Supported |
|---|---|
|
|
√ |
|
|
√ |
|
|
√ |
|
|
√ |
|
|
x |
|
|
x |
Function
Obtains one or two slices of matrix C after Iterate is called and directly outputs the result to the GM tensor or VECIN tensor. When ScheduleType in MatmulConfig is set to ScheduleType::INNER_PRODUCT, one slice of matrix C is obtained. When ScheduleType in MatmulConfig is set to ScheduleType::OUTER_PRODUCT, two slices of matrix C are obtained.
This API is used together with the Iterate API to obtain one or two matrix slices with the size of baseM × baseN based on the value of ScheduleType in MatmulConfig after the Iterate API is called to complete iterative computation.
There are synchronous and asynchronous modes to iteratively obtaining the slices of matrix C.
- Synchronous mode: After Iterate, execute GetTensorC and wait until matrix C slices are obtained.
- Asynchronous mode: After Iterate, GetTensorC does not need to be called immediately for synchronous waiting. Other logic can be executed first, and GetTensorC can be called when the result needs to be obtained. The asynchronous mode can reduce the synchronization waiting time and improve the parallelism degree. This mode is ideal for scenarios with high requirements on computing performance.
Prototype
- Obtain matrix C and output it to VECIN.
1 2
template <bool sync = true> __aicore__ inline void GetTensorC(const LocalTensor<DstT>& co2Local, uint8_t enAtomic = 0, bool enSequentialWrite = false)
- Supported mode: synchronous
- Supported mode: asynchronous
- Obtain matrix C and output it to GM.
1 2
template <bool sync = true> __aicore__ inline void GetTensorC(const GlobalTensor<DstT>& gm, uint8_t enAtomic = 0, bool enSequentialWrite = false)
- Supported mode: synchronous
- Supported mode: asynchronous
- Obtain the C matrix and output it to GM and VECIN.
1 2
template <bool sync = true> __aicore__ inline void GetTensorC(const GlobalTensor<DstT> &gm, const LocalTensor<DstT> &co2Local, uint8_t enAtomic = 0, bool enSequentialWrite = false)
- Supported mode: synchronous
- Supported mode: asynchronous
- This API is not supported in CUBE_ONLY mode.
- For the
Atlas 200I/500 A2 inference products , the matric cannot be output to GM and VECIN at the same time.
- Obtains matrix C on the workspace for caching results in the asynchronous scenario. The subsequent use process is controlled by developers.
When matrix C is output to VECIN, the value of Unified Buffer allocated to VECIN affects Matmul computing. If the value is too small, the hardware computing power cannot be fully used. This API is provided to return matrix C cached on the workspace. Developers can control the subsequent use process.
Note that during initialization, the logical position of matrix C should be set to TPosition::VECIN. After this API is called to obtain the cached matrix C, the logical position is automatically copied to Unified Buffer.
1 2
template <bool sync = true> __aicore__ inline GlobalTensor<DstT> GetTensorC(uint8_t enAtomic = 0, bool enSequentialWrite = false)
- Supported mode: asynchronous
The doPad, height, width, srcGap, and dstGap parameters in the following API are to be deprecated and do not need to be passed. Retain their default values. The prototype output to VECIN is actually a function prototype that does not pass default values.
1 2 |
template <bool sync = true, bool doPad = false> __aicore__ inline void GetTensorC(const LocalTensor<DstT>& c, uint8_t enAtomic = 0, bool enSequentialWrite = false, uint32_t height = 0, uint32_t width = 0, uint32_t srcGap = 0, uint32_t dstGap = 0) |
Parameters
|
Parameter |
Description |
|---|---|
|
sync |
Sets the synchronous or asynchronous mode. Setting it to true enables the synchronous mode; while setting it to false enables the asynchronous mode. For For For the For |
Returns
None
Restrictions
- Ensure that the size of the input matrix C address space is greater than or equal to baseM x baseN.
- In the asynchronous scenario, a temporary space is required to cache the Iterate computation result. When GetTensorC is called, slices of matrix C are obtained from the temporary space. The temporary space is set by calling SetWorkspace. Call SetWorkspace before Iterate.
- This API is not supported when enableMixDualMaster (dual-master mode) is set to true.
Example
- Obtain matrix C and output it to VECIN.
1 2 3 4 5 6 7 8 9 10 11 12
// Synchronous mode while (mm.Iterate()) { mm.GetTensorC(ubCmatrix); } // Asynchronous mode mm.template Iterate<false>(); // Other operations for (int i = 0; i < singleM / baseM * singleN / baseN; ++i) { mm.template GetTensorC<false>(ubCmatrix); // Other operations }
- Obtain matrix C and output it to GM in synchronization mode.
1 2 3
while (mm.Iterate()) { mm.GetTensorC(gm); }
- Obtain matrix C and output it to GM and VECIN in synchronization mode.
1 2 3
while (mm.Iterate()) { mm.GetTensorC(gm, ubCmatrix); }
- Obtain matrix C on GM returned by the API and manually copy it to UB in asynchronous mode.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
// BaseM * BaseN = 128 *256 mm.SetTensorA(gmA); mm.SetTensorB(gmB); mm.SetTail(singleM, singleN, singleK); mm.template Iterate<false>(); // Other operations for (int i = 0; i < singleM / baseM * singleN / baseN; ++i) { // Obtain the BaseM × BaseN data (128 × 256) calculated each time. GlobalTensor<T> global = mm.template GetTensorC<false>(); for(int j = 0; j < 4; ++j) { LocalTensor local = que.AllocTensor<half>(); // Allocate the UB space with the size of 64 × 128. DataCopy(local, global[64 * 128 * i], 64 * 128); // Copy the GM data to UB for subsequent vector operations. // Other vector operations } }
For more operator examples in asynchronous scenarios, see asynchronous scenario sample and matrix multiplication in Iterate asynchronous scenarios.

