Enabling Asynchronous the Iterate or IterateAll API to Avoid AIC/AIV Synchronization Dependency

[Priority] High

[Description] In hybrid programming of AI Cube (AIC) and AI Vector (AIV), when Matmul Iterate or IterateAll is called, AIV sends a message to AIC to start Matmul compute. In Iterate<true> synchronous mode, as shown in Figure 1, each call triggers a message sending. In Iterate<true> asynchronous mode, as shown in Figure 2, a message needs to be sent only for the first time, and no message needs to be sent subsequently. This reduces the interaction between AICs and AIVs and the inter-core communication overhead. Therefore, the asynchronous Iterate<false>() or IterateAll<false>() API is recommended in hybrid programming. (Note: When using the asynchronous API, you need to set the workspace.)

Figure 1 Message sending in synchronous mode
Figure 2 Message sending in asynchronous mode

[Negative Example]

The synchronous Iterate API is used in hybrid programming.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
TQueBind<TPosition::CO2, TPosition::VECIN>  qVecIn;
TQueBind<TPosition::VECIN, TPosition::VECOUT>  qVecOut;
mm.SetTensorA(gmA);
mm.SetTensorB(gmB);
int16_t scalar = 2;

while(mm.template Iterate()){
    auto cInUB = qVecIn.AllocTensor<float>();
    mm.GetTensorC(cInUB);
    qVecIn.EnQue(cInUB);
    cInUB = qVecIn.DeQue<float>();
    auto cOutUB = qVecOut.AllocTensor<float>();
    Muls(cOutUB, cInUB, scalar, baseM*baseN);
    qVecIn.FreeTensor(cInUB);
    ...
}

[Positive Example]

The asynchronous Iterate API is used in hybrid programming.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
TQueBind<TPosition::CO2, TPosition::VECIN>  qVecIn;
TQueBind<TPosition::VECIN, TPosition::VECOUT>  qVecOut;
mm.SetTensorA(gmA);
mm.SetTensorB(gmB);
mm.SetWorkspace(workspace, size);// workspace indicates the physical address of the temporary space, and size indicates the size of the memory occupied by matrix C, being singleCoreM*singleCoreN: singleCoreM*singleCoreN*sizeof(float).
int16_t scalar = 2;

while(mm.template Iterate<false>()){
    auto cInUB = qVecIn.AllocTensor<float>();
    mm.GetTensorC(cInUB);
    qVecIn.EnQue(cInUB);
    cInUB = qVecIn.DeQue<float>();
    auto cOutUB = qVecOut.AllocTensor<float>();
    Muls(cOutUB, cInUB, scalar, baseM*baseN);
    qVecIn.FreeTensor(cInUB);
    ...
}