GetTensorC

Applicability

Product

Supported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference product 's AI Core

Atlas inference product 's Vector Core

x

Atlas training products

x

Function

Obtains one or two slices of matrix C after Iterate is called and directly outputs the result to the GM tensor or VECIN tensor. When ScheduleType in MatmulConfig is set to ScheduleType::INNER_PRODUCT, one slice of matrix C is obtained. When ScheduleType in MatmulConfig is set to ScheduleType::OUTER_PRODUCT, two slices of matrix C are obtained.

This API is used together with the Iterate API to obtain one or two matrix slices with the size of baseM × baseN based on the value of ScheduleType in MatmulConfig after the Iterate API is called to complete iterative computation.

There are synchronous and asynchronous modes to iteratively obtaining the slices of matrix C.

  • Synchronous mode: After Iterate, execute GetTensorC and wait until matrix C slices are obtained.
  • Asynchronous mode: After Iterate, GetTensorC does not need to be called immediately for synchronous waiting. Other logic can be executed first, and GetTensorC can be called when the result needs to be obtained. The asynchronous mode can reduce the synchronization waiting time and improve the parallelism degree. This mode is ideal for scenarios with high requirements on computing performance.

Prototype

  • Obtain matrix C and output it to VECIN.
    1
    2
    template <bool sync = true>
    __aicore__ inline void GetTensorC(const LocalTensor<DstT>& co2Local, uint8_t enAtomic = 0, bool enSequentialWrite = false)
    
    • Supported mode: synchronous
    • Supported mode: asynchronous
  • Obtain matrix C and output it to GM.
    1
    2
    template <bool sync = true>
    __aicore__ inline void GetTensorC(const GlobalTensor<DstT>& gm, uint8_t enAtomic = 0, bool enSequentialWrite = false)
    
    • Supported mode: synchronous
    • Supported mode: asynchronous
  • Obtain the C matrix and output it to GM and VECIN.
    1
    2
    template <bool sync = true>
    __aicore__ inline void GetTensorC(const GlobalTensor<DstT> &gm, const LocalTensor<DstT> &co2Local, uint8_t enAtomic = 0, bool enSequentialWrite = false)
    
    • Supported mode: synchronous
    • Supported mode: asynchronous
    • This API is not supported in CUBE_ONLY mode.
    • For the Atlas 200I/500 A2 inference products , the matric cannot be output to GM and VECIN at the same time.
  • Obtains matrix C on the workspace for caching results in the asynchronous scenario. The subsequent use process is controlled by developers.

    When matrix C is output to VECIN, the value of Unified Buffer allocated to VECIN affects Matmul computing. If the value is too small, the hardware computing power cannot be fully used. This API is provided to return matrix C cached on the workspace. Developers can control the subsequent use process.

    Note that during initialization, the logical position of matrix C should be set to TPosition::VECIN. After this API is called to obtain the cached matrix C, the logical position is automatically copied to Unified Buffer.

    1
    2
    template <bool sync = true>
    __aicore__ inline GlobalTensor<DstT> GetTensorC(uint8_t enAtomic = 0, bool enSequentialWrite = false)
    
    • Supported mode: asynchronous

The doPad, height, width, srcGap, and dstGap parameters in the following API are to be deprecated and do not need to be passed. Retain their default values. The prototype output to VECIN is actually a function prototype that does not pass default values.

1
2
template <bool sync = true, bool doPad = false>
__aicore__ inline void GetTensorC(const LocalTensor<DstT>& c, uint8_t enAtomic = 0, bool enSequentialWrite = false, uint32_t height = 0, uint32_t width = 0, uint32_t srcGap = 0, uint32_t dstGap = 0)

Parameters

Table 1 Parameters in the template

Parameter

Description

sync

Sets the synchronous or asynchronous mode. Setting it to true enables the synchronous mode; while setting it to false enables the asynchronous mode.

For Atlas A3 training products / Atlas A3 inference products , the asynchronous mode is supported.

For Atlas A2 training products / Atlas A2 inference products , the asynchronous mode is supported.

For the Atlas inference product 's AI Core, the asynchronous mode is not supported.

For Atlas 200I/500 A2 inference products s, the asynchronous mode is not supported.

Table 2 API parameters

Parameter

Input/Output

Description

c/co2Local

Output

Extracts matrix C to VECIN. The data format must be NZ.

For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half, float, bfloat16_t, int32_t, and int8_t.

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half, bfloat16_t, float, int32_t, and int8_t.

For the Atlas inference product 's AI Core, the supported data types are half, float, int32_t, and int8_t.

For the Atlas 200I/500 A2 inference products , the supported data types are half, bfloat16_t, float, and int32_t.

gm

Output

Extracts matrix C to GM. The data format can be ND or NZ.

For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half, float, bfloat16_t, int32_t, and int8_t.

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half, bfloat16_t, float, int32_t, and int8_t.

For the Atlas inference product 's AI Core, the supported data types are half, float, int32_t, and int8_t.

For the Atlas 200I/500 A2 inference products , the supported data types are half, bfloat16_t, float, and int32_t.

enAtomic

Input

Enables the Atomic operation or not.

Values:

0 (default): disables the Atomic operation.

1: enables the AtomicAdd (accumulation) operation.

2: enables the AtomicMax (maximum value calculation) operation.

3: enables the AtomicMin (minimum value calculation) operation.

For the Atlas inference product 's AI Core, the Atomic operation can be enabled only when the output position is GM.

For the Atlas 200I/500 A2 inference products , the Atomic operation can be enabled only when the output position is GM.

enSequentialWrite

Input

Enables the continuous write mode or not (write to [baseM,baseN] for continuous write and to [singleCoreM,singleCoreN] for discontinuous write). The default value is false (discontinuous write).

Note: In discontinuous write mode, the offset is calculated based on the iteration sequence, which can be ignored by developers. If developers need to determine the layout sequence, select the continuous write mode and migrate data based on the preset offset.

For the Atlas 200I/500 A2 inference products , only the discontinuous write mode is supported.

Figure 1 Discontinuous write mode
Figure 2 Continuous write mode

Returns

None

Restrictions

  • Ensure that the size of the input matrix C address space is greater than or equal to baseM x baseN.
  • In the asynchronous scenario, a temporary space is required to cache the Iterate computation result. When GetTensorC is called, slices of matrix C are obtained from the temporary space. The temporary space is set by calling SetWorkspace. Call SetWorkspace before Iterate.
  • This API is not supported when enableMixDualMaster (dual-master mode) is set to true.

Example

  • Obtain matrix C and output it to VECIN.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    // Synchronous mode
    while (mm.Iterate()) {   
        mm.GetTensorC(ubCmatrix); 
    }
    
    // Asynchronous mode
    mm.template Iterate<false>();
    // Other operations
    for (int i = 0; i < singleM / baseM * singleN / baseN; ++i) {   
        mm.template GetTensorC<false>(ubCmatrix); 
        // Other operations
    }
    
  • Obtain matrix C and output it to GM in synchronization mode.
    1
    2
    3
    while (mm.Iterate()) {   
        mm.GetTensorC(gm); 
    }
    
  • Obtain matrix C and output it to GM and VECIN in synchronization mode.
    1
    2
    3
    while (mm.Iterate()) {   
        mm.GetTensorC(gm, ubCmatrix); 
    }
    
  • Obtain matrix C on GM returned by the API and manually copy it to UB in asynchronous mode.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    // BaseM * BaseN = 128 *256
    mm.SetTensorA(gmA);
    mm.SetTensorB(gmB);
    mm.SetTail(singleM, singleN, singleK);
    mm.template Iterate<false>(); 
    // Other operations
    for (int i = 0; i < singleM / baseM * singleN / baseN; ++i) {  
        // Obtain the BaseM × BaseN data (128 × 256) calculated each time.
        GlobalTensor<T> global = mm.template GetTensorC<false>();
        for(int j = 0; j < 4; ++j) {
            LocalTensor local = que.AllocTensor<half>(); // Allocate the UB space with the size of 64 × 128.
            DataCopy(local, global[64 * 128 * i], 64 * 128); // Copy the GM data to UB for subsequent vector operations.
            // Other vector operations
        }
    }
    

For more operator examples in asynchronous scenarios, see asynchronous scenario sample and matrix multiplication in Iterate asynchronous scenarios.