Separated Mode

This section describes how to use basic APIs to perform matrix multiplication in separated mode.

For the Separated Mode, the programming paradigm of using basic APIs to implement the matrix multiplication operator is the same as that of Coupled Mode. However, the implementation varies depending on the hardware architecture. This section only describes the differences. For details about the full code, see Mmad sample.

Differences in the CopyIn phase

Coupled mode

In the CopyIn phase, that is, from the GM to A1/B1 (L1 Buffer), the DataCopy API can be used to directly move data from the GM to the L1 Buffer, or move data from the GM to the UB and then to the L1 Buffer in coupled mode. If ND2NZ conversion is required, you need to complete it by yourself. Alternatively, you can use the real-time format conversion provided by the DataCopy API. However, this function uses the UB temporary space.

In the following example, the instruction for moving data from the GM to A1/B1 is used to implement ND2NZ conversion.

         
                  __aicore__ inline void CopyND2NZ(const AscendC::LocalTensor<half>& dst, const AscendC::GlobalTensor<half>& src,
        const uint16_t height, const uint16_t width)
    {
        for (int i = 0; i < width / 16; ++i) {
            int srcOffset = i * 16;
            int dstOffset = i * 16 * height;
            AscendC::DataCopy(dst[dstOffset], src[srcOffset], { height, 1, uint16_t(width / 16 - 1), 0 });
        }
    }
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
        AscendC::LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();

        CopyND2NZ(a1Local, aGM, m, k);
        CopyND2NZ(b1Local, bGM, k, n);

        inQueueA1.EnQue(a1Local);
        inQueueB1.EnQue(b1Local);
    }

Separated mode

In separated mode, data cannot be directly moved to A1 or B1 (L1 Buffer) through VECIN, VECCALC, or VECOUT (UB). However, the real-time format conversion function provided by the DataCopy API can be used to complete format conversion without using UB as the temporary space.

The following is an example:

         
          
            
            
                  __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
        AscendC::LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();

        AscendC::Nd2NzParams nd2nzA1Params;
        nd2nzA1Params.ndNum = 1;
        nd2nzA1Params.nValue = m;
        nd2nzA1Params.dValue = k;
        nd2nzA1Params.srcNdMatrixStride = 0;
        nd2nzA1Params.srcDValue = k;
        nd2nzA1Params.dstNzC0Stride = CeilCubeBlock(m) * CUBE_BLOCK;
        nd2nzA1Params.dstNzNStride = 1;
        nd2nzA1Params.dstNzMatrixStride = 0;
        AscendC::DataCopy(a1Local, aGM, nd2nzA1Params);

        AscendC::Nd2NzParams nd2nzB1Params;
        nd2nzB1Params.ndNum = 1;
        nd2nzB1Params.nValue = k;
        nd2nzB1Params.dValue = n;
        nd2nzB1Params.srcNdMatrixStride = 0;
        nd2nzB1Params.srcDValue = n;
        nd2nzB1Params.dstNzC0Stride = CeilCubeBlock(k) * CUBE_BLOCK;
        nd2nzB1Params.dstNzNStride = 1;
        nd2nzB1Params.dstNzMatrixStride = 0;
        AscendC::DataCopy(b1Local, bGM, nd2nzB1Params);

        inQueueA1.EnQue(a1Local);
        inQueueB1.EnQue(b1Local);
    }

             

           

         
        

Differences in the Aggregate and CopyOut phases

Coupled mode

In coupled mode, the data after matrix multiplication is stored in CO1 (L0C Buffer), and is finally moved to GM through CO2 (UB). In addition, NZ2ND conversion needs to be manually completed in the CO1 -> CO2 -> GM phase. In the following example, the data in NZ format is moved from CO1 to CO2 in the Aggregate phase. In the CO2 -> GM phase, the for loop is used to call DataCopy to complete format conversion.

         
    __aicore__ inline void Aggregate(const AscendC::LocalTensor<float>& c2Local, const int bSplitIdx)
    {
        AscendC::LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();

        AscendC::DataCopyParams dataCopyParams;
        dataCopyParams.blockCount = 1;
        dataCopyParams.blockLen = 2;
        AscendC::DataCopyEnhancedParams enhancedParams;
        enhancedParams.blockMode = AscendC::BlockMode::BLOCK_MODE_MATRIX;
        AscendC::DataCopy(c2Local[bSplitIdx * cSize / 2], c1Local, dataCopyParams, enhancedParams);

        outQueueCO1.FreeTensor(c1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<float> c2Local = outQueueCO2.DeQue<float>();

        // transform nz to nd
        for (int i = 0; i < nBlocks; ++i) {
            AscendC::DataCopy(cGM[i * 16], c2Local[i * m * 16], { m, 2, 0, uint16_t((nBlocks - 1) * 2) });
        }

        outQueueCO2.FreeTensor(c2Local);
    }

Separated mode

In separated mode, the matrix multiplication result can be directly written from CO1 (L0C Buffer) to the GM through FixPipe. In addition, Fixpipe provides the real-time NZ2ND function, facilitating format conversion. The following is an example in which the Aggregate phase is omitted and CopyOut is directly used.

         
                  __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();
        AscendC::FixpipeParamsV220 fixpipeParams;
        fixpipeParams.nSize = n;
        fixpipeParams.mSize = m;
        fixpipeParams.srcStride = m;
        fixpipeParams.dstStride = n;

        fixpipeParams.ndNum = 1;
        fixpipeParams.srcNdStride = 0;
        fixpipeParams.dstNdStride = 0;
        AscendC::Fixpipe(cGM, c1Local, fixpipeParams);
        outQueueCO1.FreeTensor(c1Local);
    }

Parent topic: Cube Programming (Basic APIs)