GetTensorC

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	x

Function

Obtains one or two slices of matrix C after Iterate is called and directly outputs the result to the GM tensor or VECIN tensor. When ScheduleType in MatmulConfig is set to ScheduleType::INNER_PRODUCT, one slice of matrix C is obtained. When ScheduleType in MatmulConfig is set to ScheduleType::OUTER_PRODUCT, two slices of matrix C are obtained.

This API is used together with the Iterate API to obtain one or two matrix slices with the size of baseM × baseN based on the value of ScheduleType in MatmulConfig after the Iterate API is called to complete iterative computation.

There are synchronous and asynchronous modes to iteratively obtaining the slices of matrix C.

Synchronous mode: After Iterate, execute GetTensorC and wait until matrix C slices are obtained.
Asynchronous mode: After Iterate, GetTensorC does not need to be called immediately for synchronous waiting. Other logic can be executed first, and GetTensorC can be called when the result needs to be obtained. The asynchronous mode can reduce the synchronization waiting time and improve the parallelism degree. This mode is ideal for scenarios with high requirements on computing performance.

Prototype

Obtain matrix C and output it to VECIN.

        
             template <bool sync = true>
__aicore__ inline void GetTensorC(const LocalTensor<DstT>& co2Local, uint8_t enAtomic = 0, bool enSequentialWrite = false)

Supported mode: synchronous
Supported mode: asynchronous

Obtain matrix C and output it to GM.

        
             template <bool sync = true>
__aicore__ inline void GetTensorC(const GlobalTensor<DstT>& gm, uint8_t enAtomic = 0, bool enSequentialWrite = false)

Supported mode: synchronous
Supported mode: asynchronous

Obtain the C matrix and output it to GM and VECIN.

        
             template <bool sync = true>
__aicore__ inline void GetTensorC(const GlobalTensor<DstT> &gm, const LocalTensor<DstT> &co2Local, uint8_t enAtomic = 0, bool enSequentialWrite = false)

Supported mode: synchronous
Supported mode: asynchronous
This API is not supported in CUBE_ONLY mode.
For the Atlas 200I/500 A2 inference products , the matrix cannot be output to GM and VECIN at the same time.

Obtains matrix C on the workspace for caching results in the asynchronous scenario. The subsequent use process is controlled by developers.

When matrix C is output to VECIN, the value of Unified Buffer allocated to VECIN affects Matmul computing. If the value is too small, the hardware compute cannot be fully used. This API is provided to return matrix C cached on the workspace. Developers can control the subsequent use process.

Note that during initialization, the logical position of matrix C should be set to TPosition::VECIN. After this API is called to obtain the cached matrix C, the logical position is automatically copied to Unified Buffer.

        
             template <bool sync = true>
__aicore__ inline GlobalTensor<DstT> GetTensorC(uint8_t enAtomic = 0, bool enSequentialWrite = false)

Supported mode: asynchronous

The doPad, height, width, srcGap, and dstGap parameters in the following API are to be deprecated and do not need to be passed. Retain their default values. The prototype output to VECIN is actually a function prototype that does not pass default values.

     
          template <bool sync = true, bool doPad = false>
__aicore__ inline void GetTensorC(const LocalTensor<DstT>& c, uint8_t enAtomic = 0, bool enSequentialWrite = false, uint32_t height = 0, uint32_t width = 0, uint32_t srcGap = 0, uint32_t dstGap = 0)

Parameters

**Table 1** Template parameters
Parameter	Description
sync	Sets the synchronous or asynchronous mode. Setting it to true enables the synchronous mode; while setting it to false enables the asynchronous mode. For Atlas A3 training products / Atlas A3 inference products , the asynchronous mode is supported. For Atlas A2 training products / Atlas A2 inference products , the asynchronous mode is supported. For the Atlas inference product 's AI Core, the asynchronous mode is not supported. For Atlas 200I/500 A2 inference products , the asynchronous mode is not supported.

**Table 2** API parameters
Parameter	Input/Output	Description
c/co2Local	Output	Extracts matrix C to VECIN. The data format must be NZ. For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half, float, bfloat16_t, int32_t, and int8_t. For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half, bfloat16_t, float, int32_t, and int8_t. For the Atlas inference product 's AI Core, the supported data types are half, float, int32_t, and int8_t. For the Atlas 200I/500 A2 inference products , the supported data types are half, bfloat16_t, float, and int32_t.
gm	Output	Extracts matrix C to GM. The data format can be ND or NZ. For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half, float, bfloat16_t, int32_t, and int8_t. For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half, bfloat16_t, float, int32_t, and int8_t. For the Atlas inference product 's AI Core, the supported data types are half, float, int32_t, and int8_t. For the Atlas 200I/500 A2 inference products , the supported data types are half, bfloat16_t, float, and int32_t.
enAtomic	Input	Enables the Atomic operation or not. Values: 0 (default): disables the Atomic operation. 1: enables the AtomicAdd (accumulation) operation. 2: enables the AtomicMax (maximum value calculation) operation. 3: enables the AtomicMin (minimum value calculation) operation. For the Atlas inference product 's AI Core, the Atomic operation can be enabled only when the output position is GM. For the Atlas 200I/500 A2 inference products , the Atomic operation can be enabled only when the output position is GM.
enSequentialWrite	Input	Enables the continuous write mode or not (write to [baseM,baseN] for continuous write and to [singleCoreM,singleCoreN] for discontinuous write). The default value is false (discontinuous write). Note: In discontinuous write mode, the offset is calculated based on the iteration sequence, which can be ignored by developers. If developers need to determine the layout sequence, select the continuous write mode and migrate data based on the preset offset. For the Atlas 200I/500 A2 inference products , only the discontinuous write mode is supported.

Figure 1 Discontinuous write mode

Figure 2 Continuous write mode

Returns

None

Restrictions

Ensure that the size of the input matrix C address space is greater than or equal to baseM × baseN.
In the asynchronous scenario, a temporary space is required to cache the Iterate computation result. When GetTensorC is called, slices of matrix C are obtained from the temporary space. The temporary space is set by calling SetWorkspace. Call SetWorkspace before Iterate.
This API is not supported when enableMixDualMaster (dual-master mode) is set to true.

Example

Obtain matrix C and output it to VECIN.

        
             // Synchronous mode
while (mm.Iterate()) {   
    mm.GetTensorC(ubCmatrix); 
}

// Asynchronous mode
mm.template Iterate<false>();
// Other operations
for (int i = 0; i < singleM / baseM * singleN / baseN; ++i) {   
    mm.template GetTensorC<false>(ubCmatrix); 
    // Other operations
}

Obtain matrix C and output it to GM in synchronization mode.

        
             while (mm.Iterate()) {   
    mm.GetTensorC(gm); 
}

Obtain matrix C and output it to GM and VECIN in synchronization mode.

        
             while (mm.Iterate()) {   
    mm.GetTensorC(gm, ubCmatrix); 
}

Obtain matrix C on GM returned by the API and manually copy it to UB in asynchronous mode.

        
             // BaseM * BaseN = 128 *256
mm.SetTensorA(gmA);
mm.SetTensorB(gmB);
mm.SetTail(singleM, singleN, singleK);
mm.template Iterate<false>(); 
// Other operations
for (int i = 0; i < singleM / baseM * singleN / baseN; ++i) {  
    // Obtain the BaseM × BaseN data (128 × 256) calculated each time.
    GlobalTensor<T> global = mm.template GetTensorC<false>();
    for(int j = 0; j < 4; ++j) {
        LocalTensor local = que.AllocTensor<half>(); // Allocate the UB space with the size of 64 × 128.
        DataCopy(local, global[64 * 128 * i], 64 * 128); // Copy the GM data to UB for subsequent vector operations.
        // Other vector operations
    }
}

For more operator examples in asynchronous scenarios, see asynchronous scenario sample and matrix multiplication in Iterate asynchronous scenarios.

Parent topic: Matmul Kernel APIs