Normal Accuracy

A correct compute result for operator running is the basis of performance optimization. To evaluate the correctness of the compute result, you need to compare the known correct output with the actual result. After each optimization iteration, check whether the new performance optimization result meets the accuracy evaluation standard.

The following describes the factors that affect accuracy:

  • Synchronization insertion: The parallel computing architecture features synchronization, and correct synchronization insertion is required by data dependency.
  • Offset address compute: During AI Core parallelism compute, correct compute of the offset address in the memory is critical to ensure the correctness of the compute result.
  • Computation of floating-point numbers: Although the accuracy is normal, the values may not be totally consistent bitwise. The compute of floating-point numbers does not conform to the commutative and associative laws. In addition, different hardware supports floating-point numbers differently, which may lead to different accuracy results.

To prevent unnecessary inaccuracy, you must also know the rule to be strictly followed during encoding, that is, do not modify kernel parameters.

Correctly Inserting Synchronization

  • Intra-core synchronization

    The AI Core contains multiple pipelines such as MTE1, MTE2, MTE3, Cube, Vector, and Scalar. By default, the auto sync compilation option is enabled in the Ascend C framework. The compiler can insert synchronization normally. The Ascend C programming model also helps you control the synchronization of pipelines. For details about the pipeline types, synchronization types, compiler auto sync restrictions, and when to manually insert synchronization, see Synchronization Control.

  • Inter-core synchronization

    The foregoing is all about intra-core synchronization. If an operator uses multi-core synchronization (see Multi-core Synchronization), the number of logical cores (BlockDim) must be less than or equal to the number of cores for running the operator. Otherwise, an error will occur during synchronization insertion to the framework, causing the kernel to stop responding.

    [Negative Example]

    1
    2
    3
    4
    5
    6
    // In the code with multi-core synchronization, BlockDim is greater than CoreNum.
    // For example, this verification is not performed during tiling compute.
    FlashAttentionScoreApiTiling(tilingData);
    FlashAttentionScoreGetTensorSize(tilingData);
    CoreNum = ascendcPlatform.GetCoreNum();
    context->SetBlockDim(CoreNum + 1);
    

    [Positive Example]

    1
    2
    3
    4
    5
    FlashAttentionScoreApiTiling(tilingData);
    FlashAttentionScoreGetTensorSize(tilingData);
    // If a multi-core synchronization instruction is used in the kernel, BlockDim set by the host must be less than or equal to CoreNum.
    CoreNum = ascendcPlatform.GetCoreNum();
    context->SetBlockDim(CoreNum);
    

Correctly Calculating the Offset Address

When multi-core computing is enabled for an operator, the compute amount of a single core needs to be determined during tiling, based on which the address offset on the kernel is performed.

The following introduces the allocation scheme: The total data length TOTAL_LENGTH is 8 x 2048 elements. The data is evenly allocated to eight cores, and the size (BLOCK_LENGTH) of data processed by each core is 2048. x + BLOCK_LENGTH * GetBlockIdx() is the memory offset address of input x in the global memory of the single-core processing program. After the offset address is obtained, call SetGlobalBuffer of the GlobalTensor class to set the start address and length of the global memory of the core. For details, see Figure 1.

1
xGm.SetGlobalBuffer((__gm__ half*)x + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
Figure 1 Multi-core parallel processing

Computation of Floating-point Numbers

  • Each floating-point arithmetic operation involves a certain amount of rounding. Therefore, the execution order of arithmetic operations is important. If A, B, and C are floating-point values, (A+B)+C cannot be exactly equal to A+(B+C) as in mathematical compute. In parallel computing, you may change the order of operations, so parallel results may not match sequential results. The precision difference caused by this situation is inherent in floating-point value compute.
  • The data types supported by instructions of some AI Processors are limited. If any data type fails to meet requirements, you are advised to convert it to a higher precision for compute and then convert the result back to the target precision, to prevent accuracy drop. For example, a vector compute API does not support bfloat16 compute. You need to use the Cast API to convert the data type to float, perform compute, and then convert the data type back to bfloat16 by using this API again.

    [Positive Example]

    1
    2
    3
    4
    5
    6
    7
    // dst = src0 + src1. Specifically, src0, src1, and dst are of the bfloat16 type, while tmp0, tmp1, and tmp2 are of the float type.
    ...
    Cast(tmp0Tensor, src0Tensor, RoundMode::CAST_NONE, computeSize);
    Cast(tmp1Tensor, src1Tensor, RoundMode::CAST_NONE, computeSize);
    Add(tmp2Tensor, tmp0Tensor, tmp1Tensor, computeSize);
    Cast(dstTensor, tmp2Tensor, RoundMode::CAST_FLOOR, computeSize);
    ...
    
  • Ascend AI Processors support binary floating-point representation in compliance with the IEEE 754 standard, with some exceptions. These exceptions may lead to different results from the IEEE 754 values computed on the host system. Use the composite instruction API Axpy as an example. After each element in the source operand is multiplied by a scalar, the product is added to the corresponding element in the destination operand. The composite instruction combines the multiplication and addition operations into a single instruction for execution. This compute result may be slightly different from the result obtained by single-instructions that perform the two operations separately. You need to consider this precise difference when using such APIs.

Forbidding Modifying Kernel Parameters

This means it is not allowed to assign values to or modify function parameters. In the following example of FlashAttentionKernel function, its parameters such as query, key, and tilingData are pointers and cannot be modified. The pointer content cannot be modified for operator input parameters (however, it can be modified for operator output parameters). To implement static compilation, do not modify the tilingData pointer itself or the content it points to.

1
2
3
__aicore__ __global__ void FlashAttentionKernel(__gm__ uint8_t* query, __gm__ uint8_t* key, ..., __gm__ uint8_t* attention,..., __gm__ uint8_t* tilingData) {
    ......
}

[Negative Example]

1
2
3
4
5
// Re-assigning a value to the kernel function parameter and modifying TilingData are not allowed. The following is a negative example:
query = tmpQueryPtr;
key = tmpKeyPtr;
tilingData = tmpTilingDataPtr; 
tilingData[0] = 2;

[Positive Example]

1
2
3
4
5
6
7
// Perform only the read operation on the input parameter.
inputQueryGMTensor.SetGlobalBuffer(query);

//The attention pointer of the output parameter is read-only, but the memory to which it points can be read and written.
outputAttentionGMTensor.SetGlobalBuffer(attention);
...
DataCopy(outputAttentionGMTensor, outputAttentionLocalTensor, count);