Iterate
Applicability
|
Product |
Supported |
|---|---|
|
|
√ |
|
|
√ |
|
|
√ |
|
|
√ |
|
|
x |
|
|
x |
Function
Computes a matrix C of size baseM × baseN by each call to Iterate. The API maintains the iteration progress internally, and after each call, it will offset the initial addresses of matrices A and B. The default iteration sequence is the M axis first and then the N axis, but it can be changed to the N axis first and then the M axis by adjusting the tiling parameter iterateOrder. If the input data is not aligned and remainders exist, the computation result of the remainders is output in the last iteration.
The result matrix C of a single Iterate matrix multiplication is stored in the memory of nottoctopics/en-us_topic_0000002502744758.html#EN-US_TOPIC_0000002502744758__en-us_topic_0000001622194138_en-us_topic_0000001455771256_li42261523152714. The following two methods are available to obtain the computation result in the CO1 memory:
- You do not need to manually manage the allocation and release of the CO1 memory that stores the results of matrix multiplication. This is automatically handled internally by the Matmul API. After calling the Iterate function prototype for API-managed CO1, call GetTensorC to move the computation result from CO1.
- You can flexibly control the movement of matrix multiplication results. For example, you can cache the matrix multiplication results of multiple iterative computations in the CO1 memory. When the results need to be moved out, you can move out multiple blocks of matrix C of size baseM × baseN at once. In this flexible movement scenario, you need to apply for the CO1 buffer in advance. After the Iterate function prototype for user-managed CO1 is called, the calculation result of one Iterate operation is output to the CO1 buffer applied for by the user. When the calculation result needs to be moved out, the Fixpipe API is called to move the result on CO1. After the result is moved out, the applied CO1 memory is released. For details, see the matrix multiplication with user-managed CO1.
Prototype
- API-managed CO1
1template <bool sync = true> __aicore__ inline bool Iterate(bool enPartialSum = false)
- User-managed CO1
1template <bool sync = true, typename T> __aicore__ inline bool Iterate(bool enPartialSum, const LocalTensor<T>& localCmatrix)
- For
Atlas inference product 's AI Core, user-managed CO1 is not supported. - For
Atlas 200I/500 A2 inference products s, user-managed CO1 is not supported.
- For
Parameters
|
Parameter |
Description |
|---|---|
|
sync |
Thera are synchronous and asynchronous modes to iteratively obtaining the slices of matrix C. This parameter specifies the two modes: true for the synchronous mode and false for the asynchronous mode. The synchronous mode is used by default. For details about the modes and how to use them, see GetTensorC. |
|
T |
Data type of the LocalTensor in the CO1 memory applied by the user, that is, the data type of matrix C output by matrix multiplication. The supported data types are float and int32_t. |
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
enPartialSum |
Input |
Whether to accumulate the matrix multiplication result to the existing CO1 data. The default value is false. During L0C accumulation, the C matrix specification can only be singleCoreM==baseM && singleCoreN==baseN. For |
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
enPartialSum |
Input |
Whether to accumulate the matrix multiplication result to the existing CO1 data. During L0C accumulation, the C matrix specification can only be singleCoreM==baseM && singleCoreN==baseN. |
|
localCmatrix |
Output |
The LocalTensor memory on CO1 allocated by the user, which is used to store the matrix multiplication result. |
Returns
false: All data on a single core is computed.
true: Data is still in iterative computation.
Restrictions
- This API is not supported when enableMixDualMaster (dual-master mode) is set to true.
- For the Iterate function of user-managed CO1, when creating a Matmul object, you must set matrix C's logical memory location to TPosition::CO1, the data layout format to CubeFormat::NZ, and the data type to float or int32_t.
Example
The following is a simple call example in synchronous mode and asynchronous mode. For more complete operator examples, see asynchronous scenario sample, matrix multiplication in Iterate asynchronous scenarios, and operator sample for independently managing CO1.
1 2 3 4 5 6 7 8 9 10 11 |
// Synchronous mode while (mm.Iterate()) { mm.GetTensorC(ubCmatrix); } // Asynchronous mode mm.template Iterate<false>(); // …… ……Other computations for (int i = 0; i < singleM/baseM*singleN/baseN; ++i) { mm.template GetTensorC<false>(ubCmatrix); } |