Iterate

Applicability

Product

Supported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference product 's AI Core

Atlas inference product 's Vector Core

x

Atlas training products

x

Function

Computes a matrix C of size baseM × baseN by each call to Iterate. The API maintains the iteration progress internally, and after each call, it will offset the initial addresses of matrices A and B. The default iteration sequence is the M axis first and then the N axis, but it can be changed to the N axis first and then the M axis by adjusting the tiling parameter iterateOrder. If the input data is not aligned and remainders exist, the computation result of the remainders is output in the last iteration.

The result matrix C of a single Iterate matrix multiplication is stored in the memory of nottoctopics/en-us_topic_0000002502744758.html#EN-US_TOPIC_0000002502744758__en-us_topic_0000001622194138_en-us_topic_0000001455771256_li42261523152714. The following two methods are available to obtain the computation result in the CO1 memory:

  • You do not need to manually manage the allocation and release of the CO1 memory that stores the results of matrix multiplication. This is automatically handled internally by the Matmul API. After calling the Iterate function prototype for API-managed CO1, call GetTensorC to move the computation result from CO1.
  • You can flexibly control the movement of matrix multiplication results. For example, you can cache the matrix multiplication results of multiple iterative computations in the CO1 memory. When the results need to be moved out, you can move out multiple blocks of matrix C of size baseM × baseN at once. In this flexible movement scenario, you need to apply for the CO1 buffer in advance. After the Iterate function prototype for user-managed CO1 is called, the calculation result of one Iterate operation is output to the CO1 buffer applied for by the user. When the calculation result needs to be moved out, the Fixpipe API is called to move the result on CO1. After the result is moved out, the applied CO1 memory is released. For details, see the matrix multiplication with user-managed CO1.

Prototype

  • API-managed CO1
    1
    template <bool sync = true> __aicore__ inline bool Iterate(bool enPartialSum = false)
    
  • User-managed CO1
    1
    template <bool sync = true, typename T> __aicore__ inline bool Iterate(bool enPartialSum, const LocalTensor<T>& localCmatrix)
    
    • For Atlas inference product 's AI Core, user-managed CO1 is not supported.
    • For Atlas 200I/500 A2 inference products s, user-managed CO1 is not supported.

Parameters

Table 1 Template parameters

Parameter

Description

sync

Thera are synchronous and asynchronous modes to iteratively obtaining the slices of matrix C. This parameter specifies the two modes: true for the synchronous mode and false for the asynchronous mode. The synchronous mode is used by default. For details about the modes and how to use them, see GetTensorC.

T

Data type of the LocalTensor in the CO1 memory applied by the user, that is, the data type of matrix C output by matrix multiplication. The supported data types are float and int32_t.

Table 2 Parameters of the function for API-managed CO1

Parameter

Input/Output

Description

enPartialSum

Input

Whether to accumulate the matrix multiplication result to the existing CO1 data. The default value is false. During L0C accumulation, the C matrix specification can only be singleCoreM==baseM && singleCoreN==baseN.

For Atlas 200I/500 A2 inference products s, this parameter can only be set to false.

Table 3 Parameters of the function for user-managed CO1

Parameter

Input/Output

Description

enPartialSum

Input

Whether to accumulate the matrix multiplication result to the existing CO1 data. During L0C accumulation, the C matrix specification can only be singleCoreM==baseM && singleCoreN==baseN.

localCmatrix

Output

The LocalTensor memory on CO1 allocated by the user, which is used to store the matrix multiplication result.

Returns

false: All data on a single core is computed.

true: Data is still in iterative computation.

Restrictions

  • This API is not supported when enableMixDualMaster (dual-master mode) is set to true.
  • For the Iterate function of user-managed CO1, when creating a Matmul object, you must set matrix C's logical memory location to TPosition::CO1, the data layout format to CubeFormat::NZ, and the data type to float or int32_t.

Example

The following is a simple call example in synchronous mode and asynchronous mode. For more complete operator examples, see asynchronous scenario sample, matrix multiplication in Iterate asynchronous scenarios, and operator sample for independently managing CO1.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Synchronous mode
while (mm.Iterate()) {   
    mm.GetTensorC(ubCmatrix); 
}

// Asynchronous mode
mm.template Iterate<false>();
// …… ……Other computations
for (int i = 0; i < singleM/baseM*singleN/baseN; ++i) {   
    mm.template GetTensorC<false>(ubCmatrix); 
}