GetTensorC

Function Usage

Obtains one or two slices of matrix C after Iterate is called and directly outputs the result to the GM tensor or VECIN tensor. When ScheduleType in the MatmulConfig parameter is set to ScheduleType::INNER_PRODUCT, one slice of matrix C is obtained. When ScheduleType in the MatmulConfig parameter is set to ScheduleType::OUTER_PRODUCT, two slices of matrix C are obtained.

This API is used together with the Iterate API to obtain one or two matrix slices with the size of baseM x baseN based on the value of ScheduleType in the MatmulConfig parameter after the Iterate API is called to complete iterative computation.

Thera are synchronous and asynchronous modes to iteratively obtaining the slices of matrix C.

  • Synchronous mode: After Iterate, execute GetTensorC and wait until matrix C slices are obtained.
  • Asynchronous mode: After Iterate, GetTensorC does not need to be called immediately for synchronous waiting. Other logic can be executed first, and GetTensorC can be called when the result needs to be obtained. The asynchronous mode can reduce the synchronization waiting time and improve the parallelism degree. This mode is ideal for scenarios with high requirements on computing performance.

Prototype

  • Obtain matrix C and output it to VECIN.
    1
    2
    template <bool sync = true>
    __aicore__ inline void GetTensorC(const LocalTensor<DstT>& co2Local, uint8_t enAtomic = 0, bool enSequentialWrite = false)
    
    • Supported mode: synchronous
    • Supported mode: asynchronous
  • Obtain matrix C and output it to GM.
    1
    2
    template <bool sync = true>
    __aicore__ inline void GetTensorC(const GlobalTensor<DstT>& gm, uint8_t enAtomic = 0, bool enSequentialWrite = false)
    
    • Supported mode: synchronous
    • Supported mode: asynchronous
  • Obtain the C matrix and output it to GM and VECIN.
    1
    2
    template <bool sync = true>
    __aicore__ inline void GetTensorC(const GlobalTensor<DstT> &gm, const LocalTensor<DstT> &co2Local, uint8_t enAtomic = 0, bool enSequentialWrite = false)
    
    • Supported mode: synchronous
    • Supported mode: asynchronous
    • This API is not supported in CUBE_ONLY mode.

The doPad, height, width, srcGap, and dstGap parameters in the following API are to be deprecated and do not need to be passed. Retain their default values. The prototype output to VECIN is actually a function prototype that does not pass default values.

1
2
template <bool sync = true, bool doPad = false>
__aicore__ inline void GetTensorC(const LocalTensor<DstT>& c, uint8_t enAtomic = 0, bool enSequentialWrite = false, uint32_t height = 0, uint32_t width = 0, uint32_t srcGap = 0, uint32_t dstGap = 0)

Parameters

Table 1 Parameters in the template

Parameter

Description

sync

Sets the synchronous or asynchronous mode. Setting it to true enables the synchronous mode; while setting it to false enables the asynchronous mode.

Table 2 API parameters

Parameter

Input/Output

Description

c/co2Local

Output

Extracts matrix C to VECIN. The data format must be NZ.

gm

Output

Extracts matrix C to GM. The data format can be ND or NZ.

enAtomic

Input

Enables the Atomic operation or not.

Values:

0 (default): disables the Atomic operation.

1: enables the AtomicAdd (accumulation) operation.

2: enables the AtomicMax (maximum value calculation) operation.

3: enables the AtomicMin (minimum value calculation) operation.

enSequentialWrite

Input

Enables the continuous write mode or not (write to [baseM,baseN] for continuous write and to [singleCoreM,singleCoreN] for discontinuous write). The default value is false (discontinuous write).

Note: In discontinuous write mode, the offset is calculated based on the iteration order, which can be ignored by developers. If developers need to determine the layout order, select the continuous write mode and move data based on the preset offset.

Figure 1 Discontinuous write mode
Figure 2 Continuous write mode

Returns

None

Availability

Precautions

  • Ensure that the size of the input matrix C address space is greater than or equal to baseM x baseN.
  • In the asynchronous scenario, a temporary space is required to cache the Iterate computation result. When GetTensorC is called, slices of matrix C are obtained from the temporary space. The temporary space is set by calling SetWorkspace. Call SetWorkspace before Iterate.

Example

  • Obtain matrix C and output it to VECIN.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    // Synchronous mode
    while (mm.Iterate()) {   
        mm.GetTensorC(ubCmatrix); 
    }
    
    // Asynchronous mode
    mm.template Iterate<false>();
    …… ……
    for (int i = 0; i < singleM/baseM*singleN/baseN; ++i) {   
        mm.GetTensorC<false>(ubCmatrix); 
        ...other compute
    }
    
  • Obtain matrix C and output it to GM in synchronization mode.
    1
    2
    3
    while (mm.Iterate()) {   
        mm.GetTensorC(gm); 
    }
    
  • Obtain matrix C and output it to GM and VECIN in synchronization mode.
    1
    2
    3
    while (mm.Iterate()) {   
        mm.GetTensorC(gm, ubCmatrix); 
    }
    
  • Obtain matrix C on GM returned by the API and manually copy it to UB in asynchronous mode.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    // BaseM * BaseN = 128 *256
    mm.SetTensorA(gmA);
    mm.SetTensorB(gmB);
    mm.SetTail(singleM, singleN, singleK);
    mm.template Iterate<false>(); 
    …………
    for (int i = 0; i < singleM / baseM * singleN / baseN; ++i) {  
        // Obtain the BaseM × BaseN data (128 × 256) calculated each time.
        GlobalTensor<T> global = mm.template GetTensorC<false>();
        for(int j = 0; j < 4; ++j) {
            LocalTensor local = que.AllocTensor<half>(); // Allocate the UB space with the size of 64 × 128.
            DataCopy(local, global[64 * 128 * i], 64 * 128); // Copy the GM data to UB for subsequent vector operations.
            ... Vector operations
        }
    }