Separated Architecture

For the separated architecture, the programming paradigm of using basic APIs to implement the matrix multiplication operator is the same as that of Coupled Architecture. However, the implementation varies depending on the hardware architecture. This section only describes the differences. For details about the full code, see Mmad sample.

Differences in the CopyIn phase

Coupled architecture

In the CopyIn phase, that is, from GM to A1/B1 (L1 Buffer), the coupled architecture can use the DataCopy API to directly move data from GM to L1 Buffer, or move data from GM to UB and then to L1 Buffer. If ND2NZ conversion is required, you need to complete it by yourself. Alternatively, you can use the format conversion provided by the DataCopy API. However, this function uses the UB temporary space.

In the following example, the instruction for moving data from the GM to A1/B1 is used to implement ND2NZ conversion.

         
                  __aicore__ inline void CopyND2NZ(const AscendC::LocalTensor<half>& dst, const AscendC::GlobalTensor<half>& src,
        const uint16_t height, const uint16_t width)
    {
        for (int i = 0; i < width / 16; ++i) {
            int srcOffset = i * 16;
            int dstOffset = i * 16 * height;
            AscendC::DataCopy(dst[dstOffset], src[srcOffset], { height, 1, uint16_t(width / 16 - 1), 0 });
        }
    }
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
        AscendC::LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();

        CopyND2NZ(a1Local, aGM, m, k);
        CopyND2NZ(b1Local, bGM, k, n);

        inQueueA1.EnQue(a1Local);
        inQueueB1.EnQue(b1Local);
    }

Separated architecture

In the separated architecture, only the path from the GM to A1/B1 (L1 Buffer) is reserved, and data cannot pass through the UB. The format conversion function provided by the DataCopy API can be used to complete format conversion without consuming the UB temporary space

The following is an example:

         
          
            
            
                  __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> a1Local = inQueueA1.AllocTensor<half>();
        AscendC::LocalTensor<half> b1Local = inQueueB1.AllocTensor<half>();

        AscendC::Nd2NzParams nd2nzA1Params;
        nd2nzA1Params.ndNum = 1;
        nd2nzA1Params.nValue = m;
        nd2nzA1Params.dValue = k;
        nd2nzA1Params.srcNdMatrixStride = 0;
        nd2nzA1Params.srcDValue = k;
        nd2nzA1Params.dstNzC0Stride = CeilCubeBlock(m) * CUBE_BLOCK;
        nd2nzA1Params.dstNzNStride = 1;
        nd2nzA1Params.dstNzMatrixStride = 0;
        AscendC::DataCopy(a1Local, aGM, nd2nzA1Params);

        AscendC::Nd2NzParams nd2nzB1Params;
        nd2nzB1Params.ndNum = 1;
        nd2nzB1Params.nValue = k;
        nd2nzB1Params.dValue = n;
        nd2nzB1Params.srcNdMatrixStride = 0;
        nd2nzB1Params.srcDValue = n;
        nd2nzB1Params.dstNzC0Stride = CeilCubeBlock(k) * CUBE_BLOCK;
        nd2nzB1Params.dstNzNStride = 1;
        nd2nzB1Params.dstNzMatrixStride = 0;
        AscendC::DataCopy(b1Local, bGM, nd2nzB1Params);

        inQueueA1.EnQue(a1Local);
        inQueueB1.EnQue(b1Local);
    }

             

           

         
        

Differences in the Aggregate and CopyOut phases

Coupled architecture

In the coupled architecture, the data after matrix multiplication is stored in CO1 (L0C Buffer), and is finally moved to GM through CO2 (UB). In addition, the format conversion of NZ2ND needs to be manually completed in the CO1 -> CO2 -> GM phase. In the following example, the data in NZ format is moved from CO1 to CO2 in the Aggregate phase. In the CO2 -> GM phase, the for loop is used to call DataCopy to complete format conversion.

         
    __aicore__ inline void Aggregate(const AscendC::LocalTensor<float>& c2Local, const int bSplitIdx)
    {
        AscendC::LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();

        AscendC::DataCopyParams dataCopyParams;
        dataCopyParams.blockCount = 1;
        dataCopyParams.blockLen = 2;
        AscendC::DataCopyEnhancedParams enhancedParams;
        enhancedParams.blockMode = AscendC::BlockMode::BLOCK_MODE_MATRIX;
        AscendC::DataCopy(c2Local[bSplitIdx * cSize / 2], c1Local, dataCopyParams, enhancedParams);

        outQueueCO1.FreeTensor(c1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<float> c2Local = outQueueCO2.DeQue<float>();

        // transform nz to nd
        for (int i = 0; i < nBlocks; ++i) {
            AscendC::DataCopy(cGM[i * 16], c2Local[i * m * 16], { m, 2, 0, uint16_t((nBlocks - 1) * 2) });
        }

        outQueueCO2.FreeTensor(c2Local);
    }

Separated architecture

In the separation architecture, the matrix multiplication result can be directly written from CO1 (L0C Buffer) to the GM through FixPipe. In addition, FixPipe provides the NZ2ND function, facilitating format conversion. The following is an example in which the Aggregate phase is omitted and CopyOut is directly used.

         
                  __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<float> c1Local = outQueueCO1.DeQue<float>();
        AscendC::FixpipeParamsV220 fixpipeParams;
        fixpipeParams.nSize = n;
        fixpipeParams.mSize = m;
        fixpipeParams.srcStride = m;
        fixpipeParams.dstStride = n;

        fixpipeParams.ndNum = 1;
        fixpipeParams.srcNdStride = 0;
        fixpipeParams.dstNdStride = 0;
        AscendC::Fixpipe(cGM, c1Local, fixpipeParams);
        outQueueCO1.FreeTensor(c1Local);
    }

Parent topic: Cube Programming (Basic APIs)