Inaccurate Operators with Matmul High-Level APIs

This section describes how to preliminarily demarcate and locate precision issues of operators with Matmul high-level APIs. Unless otherwise specified, the following uses the Atlas A2 training products / Atlas A2 inference products as an example.

Perform the following steps to locate the fault:

Perform debugging on the CPU and observe the error messages.
Check whether Matmul tiling is modified properly.
Check whether the operator function is correct when only the Matmul API is called for hiding Vector computation.
Check whether the operator function is correct during single-core execution.
Check whether the Matmul API is correctly used.
Check whether the golden script used for operator debugging is correct.

Perform debugging on the CPU and observe the error messages.

After the operator code is developed, use the CPU debugging project for kernel launch to debug the operator. During debugging on the CPU, if an error is reported during compilation or execution, obvious error messages are displayed in logs. Based on the error messages, you can quickly locate the code corresponding to the problem. This method can be used to quickly locate the cause of basic parameter problems, such as invalid address access caused by incorrect DataCopy parameter settings, incorrect operator tiling parameter settings, and invalid memory access.

Example:

The following is the code snippet of the kernel function of the Matmul operator. Based on the matrices A and B and the tiling information on the Global Memory, the code snippet computes the address offset of the data to be used by each core, creates a Matmul object, and computes the Matmul result.

         
          
            
            
              extern "C" __global__ __aicore__ void matmul_custom(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR workspace, GM_ADDR tilingGm)
{
    using A_T = half;
    using B_T = half;
    using C_T = float;
 
    AscendC::TPipe pipe;
    TCubeTiling tiling;
    CopyTiling(&tiling, tilingGm);
 
    AscendC::GlobalTensor<A_T> aGlobal;
    AscendC::GlobalTensor<B_T> bGlobal;
    AscendC::GlobalTensor<C_T> cGlobal;
    aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ A_T *>(a), tiling.M * tiling.Ka);
    bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ B_T *>(b), tiling.Ka * tiling.N);
    cGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ C_T *>(c), tiling.M * tiling.N);
 
    int offsetA = 0;
    int offsetB = 0;
    int offsetC = 0;
    bool isTransA = false;
    bool isTransB = true;
 
    int tailM = 0;
    int tailN = 0;
    CalcGMOffset(GetBlockIdx(), tiling, offsetA, offsetB, offsetC, tailM, tailN, isTransA, isTransB);
 
    auto gmA = aGlobal[offsetA];
    auto gmB = bGlobal[offsetB];
    auto gmC = cGlobal[offsetC];
 
    AscendC::Matmul<AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, A_T>,
           AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, B_T>,
           AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, C_T>> mm;
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm, &tiling);

    mm.SetTensorA(gmA, isTransA);
    mm.SetTensorB(gmB, isTransB);
    mm.SetTail(tailM, tailN);
    mm.IterateAll(gmC);
    mm.End();
}

             

           

         
        

The following shows the output of the preceding code during debugging on the CPU. The path in the following example is for reference only.

         
              [ASSERT] $HOME/Ascend/xxxxx/include/ascendc/highlevel_api/lib/matmul/matmul_client.h:268: Assertion `isTransposeB <= B_TYPE::isTrans && "It is not allowed to do B transpose when matmul B transpose is not defined."'
[ASSERT] $HOME/Ascend/xxxxx/include/ascendc/highlevel_api/lib/matmul/matmul_client.h:268: Assertion `isTransposeB <= B_TYPE::isTrans && "It is not allowed to do B transpose when matmul B transpose is not defined."'

The preceding example shows an inaccurate operator. Therefore, the CPU is used to debug the operator. After the operator is running on the CPU, the error message indicates that the transpose of matrix B is not defined. Check the configuration code of matrix B. It is found that B_TYPE::isTrans of matrix B is not set when the Matmul object is defined. However, isTransB is set to true for the SetTensorB API. As a result, an error is reported during execution. Therefore, the root cause of this problem is that the value of isTransB set in SetTensorB is inconsistent with B_TYPE.

Check whether Matmul tiling is modified properly.

Generally, in the operator tiling implementation containing Matmul, the GetTiling API is called to obtain the Matmul tiling result, which is of the TCubeTiling type. In this case, the tiling values are valid. In some cases, if you customize a group of TCubeTiling parameter values or change some of the values based on the TCubeTiling structure returned by the GetTiling API, such modification must meet the constraints between parameters.

To obtain all tiling parameter values, you need to print logs related to tiling parameters. Set the log environment variables and obtain the MatmulTiling parameter value. The commands for setting environment variables are as follows:

       
            export ASCEND_GLOBAL_LOG_LEVEL=1
export ASCEND_SLOG_PRINT_TO_STDOUT=1

Search for the keyword MatmulTiling in the log and check whether the tiling value is valid by referring to TCubeTiling constraints. If a constraint is not met, modify the corresponding parameters to ensure that the values of the TCubeTiling parameters in the group are valid.

       
        
          
          
            cat test_tiling.log |grep MatmulTiling // test_tiling.log is the example name of the log file.
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.864 [matmul_tiling_base.cpp:697][PrintTilingDataInfo] MatmulTiling: M             = 1024
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.870 [matmul_tiling_base.cpp:698][PrintTilingDataInfo] MatmulTiling: N             = 640
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.873 [matmul_tiling_base.cpp:699][PrintTilingDataInfo] MatmulTiling: Ka            = 256
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.876 [matmul_tiling_base.cpp:700][PrintTilingDataInfo] MatmulTiling: Kb            = 256
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.879 [matmul_tiling_base.cpp:701][PrintTilingDataInfo] MatmulTiling: singleCoreM   = 512
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.882 [matmul_tiling_base.cpp:702][PrintTilingDataInfo] MatmulTiling: singleCoreN   = 640
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.884 [matmul_tiling_base.cpp:703][PrintTilingDataInfo] MatmulTiling: singleCoreK   = 256
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.887 [matmul_tiling_base.cpp:704][PrintTilingDataInfo] MatmulTiling: baseM         = 256
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.890 [matmul_tiling_base.cpp:705][PrintTilingDataInfo] MatmulTiling: baseN         = 128
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.893 [matmul_tiling_base.cpp:706][PrintTilingDataInfo] MatmulTiling: baseK         = 64
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.896 [matmul_tiling_base.cpp:707][PrintTilingDataInfo] MatmulTiling: depthA1       = 10
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.899 [matmul_tiling_base.cpp:708][PrintTilingDataInfo] MatmulTiling: depthB1       = 2
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.902 [matmul_tiling_base.cpp:709][PrintTilingDataInfo] MatmulTiling: depthAL1CacheUB     = 0
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.905 [matmul_tiling_base.cpp:710][PrintTilingDataInfo] MatmulTiling: depthBL1CacheUB     = 0
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.908 [matmul_tiling_base.cpp:711][PrintTilingDataInfo] MatmulTiling: stepM         = 2
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.912 [matmul_tiling_base.cpp:712][PrintTilingDataInfo] MatmulTiling: stepN         = 1
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.915 [matmul_tiling_base.cpp:713][PrintTilingDataInfo] MatmulTiling: isBias        = 1
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.917 [matmul_tiling_base.cpp:714][PrintTilingDataInfo] MatmulTiling: transLength   = 0
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.920 [matmul_tiling_base.cpp:715][PrintTilingDataInfo] MatmulTiling: iterateOrder  = 0
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.923 [matmul_tiling_base.cpp:716][PrintTilingDataInfo] MatmulTiling: shareMode     = 0
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.926 [matmul_tiling_base.cpp:717][PrintTilingDataInfo] MatmulTiling: usedL1Size    = 295424
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.929 [matmul_tiling_base.cpp:718][PrintTilingDataInfo] MatmulTiling: usedL0CSize   = 131072
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.932 [matmul_tiling_base.cpp:719][PrintTilingDataInfo] MatmulTiling: usedUBSize    = 0
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.935 [matmul_tiling_base.cpp:720][PrintTilingDataInfo] MatmulTiling: batchM        = 1
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.938 [matmul_tiling_base.cpp:721][PrintTilingDataInfo] MatmulTiling: batchN        = 1
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.941 [matmul_tiling_base.cpp:722][PrintTilingDataInfo] MatmulTiling: singleBatchM  = 1
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.943 [matmul_tiling_base.cpp:723][PrintTilingDataInfo] MatmulTiling: singleBatchN  = 1
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.946 [matmul_tiling_base.cpp:724][PrintTilingDataInfo] MatmulTiling: stepKa        = 4
[INFO] ASCENDCKERNEL(1202803,ascendc_kernels_bbit):2024-10-12-08:53:59.636.949 [matmul_tiling_base.cpp:725][PrintTilingDataInfo] MatmulTiling: stepKb        = 1

           

         

       
      

For example, check the value of each parameter according to the preceding TCubeTiling parameters and the TCubeTiling constraints. The value of depthA1 should be equal to stepM × stepKa or stepM × stepKa × 2. However, the actual value of depthA1 is 10, not 8 (stepM × stepKa) or 16 (stepM × stepKa × 2), which does not meet the constraint. Therefore, the value of depthA1 needs to be corrected.

Check whether the operator function is correct when only the Matmul API is called for hiding Vector computation.

The code of a fused operator contains both Matmul APIs and Vector computing APIs. Delete the Vector computing APIs from the operator code and retain only the Matmul APIs to quickly check whether the Matmul APIs are incorrectly used, which causes the inaccuracy of the fused operator. The troubleshooting procedure is as follows: Modify the operator code logic, delete the Vector computing code, and modify the golden script accordingly. After the adaptation is complete, execute the operator on the CPU or NPU and check whether the operator result is correct. If the result is correct, the Matmul APIs in the code are used correctly. To locate the operator inaccuracy, check the Vector computation. If the operator result is incorrect, check whether the Matmul APIs are used correctly.

Example:

Take the fused operator matmul_leakyrelu as an example. After the operator is executed, the inaccuracy problem shown below occurs.

         
              data index: 000195, expected: -0.693000019, actual: -69.300003052, rdiff: -99.000000
data index: 000196, expected: -0.209000006, actual: -20.899999619, rdiff: -99.000000
data index: 000197, expected: -0.517000020, actual: -51.700000763, rdiff: -99.000000
data index: 000200, expected: -0.193000004, actual: -19.300001144, rdiff: -99.000000
data index: 000202, expected: -0.684000015, actual: -68.400001526, rdiff: -99.000000
data index: 000204, expected: -0.422000021, actual: -42.200000763, rdiff: -98.999992
data index: 000209, expected: -0.109000005, actual: -10.900000572, rdiff: -99.000000
error ratio: 0.4517, tolerance: 0.0001
[ERROR] result error

Modify the operator code and comment out the LeakyRelu API computation. In addition, modify the corresponding memory allocation and involved synchronization code. Then, comment out the LeakyReLU computation in the golden script. The following is an example.

The following is the code snippet of the operator kernel function.

         
          
            
            
              template <typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::Process(AscendC::TPipe *pipe)
{
    uint32_t computeRound = 0;

    matmulObj.SetTensorA(aGlobal);
    matmulObj.SetTensorB(bGlobal);
    matmulObj.SetBias(biasGlobal);
    while (matmulObj.template Iterate<true>()) {
        MatmulCompute();
        // LeakyReluCompute(); // Comment out the LeakyReluCompute Vector computation.
        CopyOut(computeRound);
        computeRound++;
    }
    matmulObj.End();
}
 
template <typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::MatmulCompute()
{
    reluOutLocal = reluOutQueue_.AllocTensor<cType>();
    matmulObj.template GetTensorC<true>(reluOutLocal, false, true);
    reluOutQueue_.EnQue(reluOutLocal); // Moves the output of reluOutLocal in the LeakyReluCompute() API here.
}
 
template <typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::LeakyReluCompute()
{
    LeakyRelu(reluOutLocal, reluOutLocal, (cType)0.1, tiling.baseM * tiling.baseN);
    reluOutQueue_.EnQue(reluOutLocal);
}
 
template <typename aType, typename bType, typename cType, typename biasType>
__aicore__ inline void MatmulLeakyKernel<aType, bType, cType, biasType>::CopyOut(uint32_t count)
{
    reluOutQueue_.DeQue<cType>();
    const uint32_t roundM = tiling.singleCoreM / tiling.baseM;
    const uint32_t roundN = tiling.singleCoreN / tiling.baseN;
    uint32_t startOffset = (count % roundM * tiling.baseM * tiling.N + count / roundM * tiling.baseN);
    AscendC::DataCopyParams copyParam = {(uint16_t)tiling.baseM, (uint16_t)(tiling.baseN * sizeof(cType) / AscendC::DEFAULT_C0_SIZE), 0,
                                (uint16_t)((tiling.N - tiling.baseN) * sizeof(cType) / AscendC::DEFAULT_C0_SIZE)};
    DataCopy(cGlobal[startOffset], reluOutLocal, copyParam);
    reluOutQueue_.FreeTensor(reluOutLocal);
}

             

           

         
        

The following is a code snippet of the golden script.

         
              def gen_golden_data():
    M = 1024
    N = 640
    K = 256
 
    input_a = np.random.randint(-10, 10, [M, K]).astype(np.float16)
    input_b = np.random.randint(-10, 10, [K, N]).astype(np.float16)
    input_bias = np.random.randint(-10, 10, [N]).astype(np.float32)
    alpha = 0.001
    golden = (np.matmul(input_a.astype(np.float32), input_b.astype(np.float32)) + input_bias).astype(np.float32)
    # golden = np.where(golden >= 0, golden, golden * alpha) # Same as that of the kernel. The corresponding LeakyRelu computation must be commented out during golden data generation.
    os.system("mkdir -p input")
    os.system("mkdir -p output")
    input_a.tofile("./input/x1_gm.bin")
    input_b.tofile("./input/x2_gm.bin")
    input_bias.tofile("./input/bias.bin")
    golden.tofile("./output/golden.bin")

After the LeakyReLU computation is deleted, run the test case. You can find that the operator result is correct through comparison.

         
              -- Installing: $HOME/samples/Precision_Check_Guide/samples-master/operator/MatmulLeakyReluCustomSample/KernelLaunch/MatmulLeakyReluInvocation_cube_vec/out/bin/ascendc_kernels_bbit
8901941eee314bcd64d24ff5f8d21247  output/golden.bin
8901941eee314bcd64d24ff5f8d21247  output/output.bin
error ratio: 0.0000, tolerance: 0.0001
test pass

Then you can determine that the Matmul API has been correctly used in the operator code and the correct Matmul API computation result has been obtained. You need to continue to locate the issue during the usage of the LeakyRelu API in the LeakyReluCompute function.

Check whether the operator function is correct during single-core execution.

Verify the operator function on a single core to quickly determine whether the computation result of the Matmul API does not meet the expectation or the Matmul API is incorrectly called in the operator code. The Matmul API implements the single-core compute logic. Therefore, the computation result on a single core is correct, but that across multiple cores is incorrect. This indicates that the Matmul API on a single core is used and computed correctly. Check whether the code logic related to multi-core tiling is correct. For example, check whether the input and output address offsets of each core are correct and whether the tail block address of each core is correctly set. If the operator is inaccurate in the single-core scenario, check whether the Matmul API is correctly used. For details, see Step 5.

Note: In the operator tiling implementation that contains Matmul, the multi-core tiling of Matmul needs to use MultiCoreMatmulTiling to construct a multi-core tiling object, and the SetDim API is used to set the number of cores used for Matmul computation. Note: This core number is set only in the multi-core scenario and is used to compute tiling parameters. The following two cases demonstrate the operators in mixed mode. For details about the SetDim setting rules, see the rules for setting the number of cores in the mixed scenario.

Case 1: In the multi-core tiling scenario, the output address offset is incorrect.

Take Matmul with M = 512, N = 1024, and K = 512 as an example. In the operator code in mixed mode, the number of AIC cores is set to 4 and the number of AIV cores is set to 8. Because the separation mode is used here, SetDim is set to 8. When this operator is executed in the multi-core scenario, the computation result is inaccurate.

The following is the code snippet for operator tiling.

         
          
            
            
              uint8_t *GenerateTiling(const char *socVersion)
{
    int M = 512;
    int N = 1024;
    int K = 512;
 
    TPosition leftPosition = TPosition::GM;
    CubeFormat leftFormat = CubeFormat::ND;
    DataType leftDtype = DataType::DT_FLOAT16;
    bool isTransA = false;
 
    TPosition rightPosition = TPosition::GM;
    CubeFormat rightFormat = CubeFormat::ND;
    DataType rightDtype = DataType::DT_FLOAT16;
    bool isTransB = false;
 
    TPosition resultPosition = TPosition::GM;
    CubeFormat resultFormat = CubeFormat::ND;
    DataType resultDtype = DataType::DT_FLOAT;
 
    bool isBias = false;
 
    int usedCoreNum = 8;
    int32_t baseM = 128;
    int32_t baseN = 256;
 
    optiling::TCubeTiling tilingData;
    auto ascendcPlatform = platform_ascendc::PlatformAscendCManager::GetInstance(socVersion);
    MultiCoreMatmulTiling tilingApi(*ascendcPlatform);
 
    tilingApi.SetDim(usedCoreNum); // Set the AIV cores to 8.
    tilingApi.SetAType(leftPosition, leftFormat, leftDtype, isTransA);
    tilingApi.SetBType(rightPosition, rightFormat, rightDtype, isTransB);
    tilingApi.SetCType(resultPosition, resultFormat, resultDtype);
 
    tilingApi.SetOrgShape(M, N, K);
    tilingApi.SetShape(M, N, K);
    tilingApi.SetFixSplit(baseM, baseN, -1);
    tilingApi.SetBias(isBias);
    tilingApi.SetBufferSpace(-1, -1, -1);
 
    int64_t res = tilingApi.GetTiling(tilingData);
    if (res == -1) {
        std::cout << "gen tiling failed" << std::endl;
    }
    return GetTilingBuf(&tilingData);
}

             

           

         
        

The following is the code snippet of the operator kernel function.

         
          
            
            
              __aicore__ inline void CalcGMOffset(int blockIdx, const TCubeTiling &tiling, int &offsetA, int &offsetB, int &offsetC,
                                    int &tailM, int &tailN, bool isTransA, bool isTransB)
{
    uint32_t mSingleBlocks = CeilDiv(tiling.M, tiling.singleCoreM);
    uint32_t mCoreIndx = blockIdx % mSingleBlocks;
    uint32_t nCoreIndx = blockIdx / mSingleBlocks;
 
    offsetA = mCoreIndx * tiling.Ka * tiling.singleCoreM;
    if (isTransA) {
        offsetA = mCoreIndx * tiling.singleCoreM;
    }
    offsetB = nCoreIndx * tiling.singleCoreN;
    if (isTransB) {
        offsetB = nCoreIndx * tiling.Kb * tiling.singleCoreN;
    }
    offsetC = mCoreIndx * tiling.singleCoreN * tiling.singleCoreM + nCoreIndx * tiling.singleCoreN; //The tiling.singleCoreN parameter is incorrect. Change it to tiling.N.
 
    tailM = tiling.M - mCoreIndx * tiling.singleCoreM;
    tailM = tailM < tiling.singleCoreM ? tailM : tiling.singleCoreM;
 
    tailN = tiling.N - nCoreIndx * tiling.singleCoreN;
    tailN = tailN < tiling.singleCoreN ? tailN : tiling.singleCoreN;
}
 
extern "C" __global__ __aicore__ void matmul_custom(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR workspace,
                                                    GM_ADDR tilingGm)
{
    using A_T = half;
    using B_T = half;
    using C_T = float;
 
    AscendC::TPipe pipe;
    TCubeTiling tiling;
    CopyTiling(&tiling, tilingGm);
 
    AscendC::GlobalTensor<A_T> aGlobal;
    AscendC::GlobalTensor<B_T> bGlobal;
    AscendC::GlobalTensor<C_T> cGlobal;
    aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ A_T *>(a), tiling.M * tiling.Ka);
    bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ B_T *>(b), tiling.Ka * tiling.N);
    cGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ C_T *>(c), tiling.M * tiling.N);
 
    int offsetA = 0;
    int offsetB = 0;
    int offsetC = 0;
    bool isTransA = false;
    bool isTransB = false;
 
    int tailM = 0;
    int tailN = 0;
    CalcGMOffset(GetBlockIdx(), tiling, offsetA, offsetB, offsetC, tailM, tailN, isTransA, isTransB);
 
    auto gmA = aGlobal[offsetA];
    auto gmB = bGlobal[offsetB];
    auto gmC = cGlobal[offsetC];
 
    AscendC::Matmul<AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, A_T>,
           AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, B_T>,
           AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, C_T>> mm;
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm, &tiling);

    mm.SetTensorA(gmA, isTransA);
    mm.SetTensorB(gmB, isTransB);
    mm.SetTail(tailM, tailN);
    mm.IterateAll(gmC);
    mm.End();
}

             

           

         
        

The precision verification fails after operator execution.

         
              data index: 000609, expected: 12979.000000000, actual: 0.000000000, rdiff: 1.000000
data index: 000610, expected: 12931.000000000, actual: 0.000000000, rdiff: 1.000000
data index: 000611, expected: 13120.000000000, actual: 0.000000000, rdiff: 1.000000
data index: 000612, expected: 12275.000000000, actual: 0.000000000, rdiff: 1.000000
error ratio: 0.8750, tolerance: 0.0001
[ERROR] result error

Modify the test script and operator tiling code, and verify the operator execution result on a single core to quickly demarcate the fault. The details are as follows:

Modify the operator debugging code to start only a single core. In the CPU debugging code, set blockDim in the ICPU_RUN_KF macro API to 1 (number of AIC and AIV combinations). In the operator tiling implementation on a single core, set the number of AIC cores to 1 and the number of AIV cores to 2, and set SetDim to 2. The code is shown below.

The following is the code snippet of the debugging script.

         
              uint32_t blockDim = 1;
ICPU_RUN_KF(matmul_custom, blockDim, a, b, c, workspace, tiling);

The following is the code snippet for operator tiling.

         
              int usedCoreNum = 2;
tilingApi.SetDim(usedCoreNum);

After the single-core scenario is used, execute the operator:

         
              -- Installing: $HOME/samples/Precision_Check_Guide/samples-master/operator/MatmulCustomSample/KernelLaunch/MatmulInvocationNeo-muticore/out/bin/ascendc_kernels_bbit
efaf4dc1e484bc3778cac65f56244e59  output/golden.bin
efaf4dc1e484bc3778cac65f56244e59  output/output.bin
error ratio: 0.0000, tolerance: 0.0001
test pass

According to the preceding comparison, the single-core computation is verified to be correct. In this case, the precision problem is related to the multi-core logic.

Check the input and output address offsets after multi-core splitting. Analyze the CalcGMOffset function. It is found that the offset address of matrix C offsetC is incorrectly calculated. The correct offset should be mCoreIndx * tiling.N * tiling.singleCoreM + nCoreIndx * tiling.singleCoreN. After offsetC is changed to the correct offset address, execute the operator. The comparison result is correct.

Note: In the preceding single-core scenario, the number of AIC cores is 1 and the number of AIV cores is 2. To further verify the modification, do not introduce any multi-core splitting. Change both the number of AIC cores and the number of AIV cores to 1. The following is a code modification example:

After the REGIST_MATMUL_OBJ API is called in the kernel function, the AIV core whose BlockIdx is not 0 exits based on the judgment code.

The following is the code snippet of the operator kernel function.

           
            
              
              
                extern "C" __global__ __aicore__ void matmul_custom(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR workspace,
                                                    GM_ADDR tilingGm)
{
    using A_T = half;
    using B_T = half;
    using C_T = float;
 
    AscendC::TPipe pipe;
    TCubeTiling tiling;
    CopyTiling(&tiling, tilingGm);
 
    AscendC::GlobalTensor<A_T> aGlobal;
    AscendC::GlobalTensor<B_T> bGlobal;
    AscendC::GlobalTensor<C_T> cGlobal;
    aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ A_T *>(a), tiling.M * tiling.Ka);
    bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ B_T *>(b), tiling.Ka * tiling.N);
    cGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ C_T *>(c), tiling.M * tiling.N);
 
    int offsetA = 0;
    int offsetB = 0;
    int offsetC = 0;
    bool isTransA = false;
    bool isTransB = false;
 
    int tailM = 0;
    int tailN = 0;
    CalcGMOffset(GetBlockIdx(), tiling, offsetA, offsetB, offsetC, tailM, tailN, isTransA, isTransB);
 
    auto gmA = aGlobal[offsetA];
    auto gmB = bGlobal[offsetB];
    auto gmC = cGlobal[offsetC];
 
    AscendC::Matmul<AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, A_T>,
           AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, B_T>,
           AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, C_T>> mm;
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm, &tiling);
    if (GetBlockIdx() == 1) {
        return;
    }
     
    mm.SetTensorA(gmA, isTransA);
    mm.SetTensorB(gmB, isTransB);
    mm.SetTail(tailM, tailN);
    mm.IterateAll(gmC);
    mm.End();
}

               

             

           
          

Set blockDim of ICPU_RUN_KF in the operator debugging script and usedCoreNum of SetDim in operator tiling to 1.

The following is the code snippet for operator debugging.

           
                uint32_t blockDim = 1;
ICPU_RUN_KF(matmul_custom, blockDim, a, b, c, workspace, tiling);

The following is the code snippet for operator tiling.

           
                int usedCoreNum = 1;
tilingApi.SetDim(usedCoreNum);

Case 2: The tail block is incorrectly set.

In the multi-core scenario, if the value of singleCoreM/singleCoreN/singleCoreK of the last core is different from that of the previous core, the SetTail API needs to be called on the last core (tail core) to change the value of singleCoreM/singleCoreN/singleCoreK to the corresponding value of the actual tail core. If these parameters are not set for the tail core or the parameter values are incorrect, the multi-core accuracy is affected but the single-core accuracy is normal.

         
              data index: 100254, expected: 13605.000000000, actual: 13137.000000000, rdiff: 0.034399
data index: 101277, expected: 13268.000000000, actual: 13419.000000000, rdiff: 0.011381
data index: 102300, expected: 13509.000000000, actual: 13114.000000000, rdiff: 0.029240
data index: 103323, expected: 13526.000000000, actual: 13400.000000000, rdiff: 0.009315
error ratio: 0.0010, tolerance: 0.0001
[ERROR] result error

The following is the code snippet of the operator kernel function.

          
           
             
             
               __aicore__ inline void CalcGMOffset(int blockIdx, const TCubeTiling &tiling, int &offsetA, int &offsetB, int &offsetC,
                                    int &tailM, int &tailN, bool isTransA, bool isTransB)
{
    uint32_t mSingleBlocks = CeilDiv(tiling.M, tiling.singleCoreM);
    uint32_t mCoreIndx = blockIdx % mSingleBlocks;
    uint32_t nCoreIndx = blockIdx / mSingleBlocks;
 
    offsetA = mCoreIndx * tiling.Ka * tiling.singleCoreM;
    if (isTransA) {
        offsetA = mCoreIndx * tiling.singleCoreM;
    }
    offsetB = nCoreIndx * tiling.singleCoreN;
    if (isTransB) {
        offsetB = nCoreIndx * tiling.Kb * tiling.singleCoreN;
    }
    offsetC = mCoreIndx * tiling.N * tiling.singleCoreM + nCoreIndx * tiling.singleCoreN;
 
    // M and N computations corresponding to the tail core. The computation method here is correct.
    tailM = tiling.M - mCoreIndx * tiling.singleCoreM;
    tailM = tailM < tiling.singleCoreM ? tailM : tiling.singleCoreM;
 
    tailN = tiling.N - nCoreIndx * tiling.singleCoreN;
    tailN = tailN < tiling.singleCoreN ? tailN : tiling.singleCoreN;
}
 
extern "C" __global__ __aicore__ void matmul_custom(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR workspace,
                                                    GM_ADDR tilingGm)
{
    using A_T = half;
    using B_T = half;
    using C_T = float;
 
    AscendC::TPipe pipe;
    TCubeTiling tiling;
    CopyTiling(&tiling, tilingGm);
 
    AscendC::GlobalTensor<A_T> aGlobal;
    AscendC::GlobalTensor<B_T> bGlobal;
    AscendC::GlobalTensor<C_T> cGlobal;
    aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ A_T *>(a), tiling.M * tiling.Ka);
    bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ B_T *>(b), tiling.Ka * tiling.N);
    cGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ C_T *>(c), tiling.M * tiling.N);
 
    int offsetA = 0;
    int offsetB = 0;
    int offsetC = 0;
    bool isTransA = false;
    bool isTransB = false;
 
    int tailM = 0;
    int tailN = 0;
    CalcGMOffset(GetBlockIdx(), tiling, offsetA, offsetB, offsetC, tailM, tailN, isTransA, isTransB);
 
    auto gmA = aGlobal[offsetA];
    auto gmB = bGlobal[offsetB];
    auto gmC = cGlobal[offsetC];
 
    AscendC::Matmul<AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, A_T>,
           AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, B_T>,
           AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, C_T>> mm;
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm, &tiling);

    mm.SetTensorA(gmA, isTransA);
    mm.SetTensorB(gmB, isTransB);
    // mm.SetTail(tailM, tailN); Set the tail core. If the tail block is not updated, the single-core accuracy is normal, but the multi-core accuracy is affected.
    mm.IterateAll(gmC);
    mm.End();
}

              

            

          
         

Check whether the Matmul API is correctly used.
After the preceding steps are performed, you can determine whether the problem is caused by Matmul APIs. If the operator accuracy is affected due to incorrect usage of the Matmul APIs, check whether the APIs are correctly used based on the usage description and constraints of each Matmul API.
- Case 1: The API constraints are not complied with.
  When the IterateBatch API is called in the Matmul MDL template, the operator fails to be executed. This is because the API constraints are not met. The IterateBatch API supports only the Norm template.
  
  To solve this problem, read the constraints of each Matmul API carefully and check whether the APIs used for operator implementation meet the constraints.
- Case 2: The template constraints are not complied with.
  When the doMTE2Preload pre-loading template is enabled, if the K direction is not fully loaded and the template constraints are not met, the accuracy comparison fails.
  
  In addition to the constraints of the function APIs, the constraints of the template parameters must also be met. Check the use of the template parameters.
Check whether the golden script used for operator debugging is correct.
The golden script of operators is implemented based on the function logic of custom operators and is used to check whether the operator execution result is correct. Therefore, the logic of the golden script must be the same as the implementation logic of the operators. If the golden script is incorrectly implemented, the accuracy comparison fails. In this case, the golden data is unreliable.

Therefore, during operator accuracy demarcation and locating, you need to check the correctness of the golden script based on the logic of the custom operators, especially for operators with complex compute logic.

Parent topic: FAQ