Normal Accuracy

A correct computation result for operator running is the basis of performance optimization. To evaluate the correctness of the computation result, you need to compare the known correct output with the actual result. After each optimization iteration, check whether the new performance optimization result meets the accuracy evaluation standard.

The following describes the aspects that affect accuracy:

  • Synchronization insertion: The parallel computing architecture features synchronization, and correct synchronization insertion is required by data dependency.
  • Offset address computation: During AI Core parallelism computation, correct computation of the offset address in the memory is critical to ensure the correctness of the computation result.
  • Float computation: Although the accuracy is normal, the values may not be totally consistent bitwise. The computation of floating-point numbers does not conform to the commutative and associative laws. In addition, different hardware supports floating-point numbers differently, which may lead to different accuracy results.

To prevent unnecessary inaccuracy, you must also know the rule to be strictly followed during encoding, that is, do not modify kernel parameters.

Correctly Inserting Synchronization

The AI Core contains multiple pipelines such as MTE1, MTE2, MTE3, Cube, Vector and Scalar. By default, the auto sync compilation option is enabled in the Ascend C framework. The compiler can insert synchronization normally. The Ascend C programming model also helps you control the synchronization of pipelines. For details about the pipeline types, synchronization types, compiler auto sync restrictions, and when to manually insert synchronization, see Synchronization Control.

The foregoing is all about intra-core synchronization. If an operator uses multi-core synchronization (see Multi-core Synchronization), the logical BlockDim must be less than or equal to the number of cores for running the operator. Otherwise, the framework inserts abnormal synchronization, causing the kernel to stop responding.

[Negative Example]

// In the code with multi-core synchronization, BlockDim is greater than CoreNum.
// For example, if the verification is not performed in the tiling computation. CoreNum = MAX_AICORE_NUM.
FlashAttentionScoreApiTiling(tilingData);
FlashAttentionScoreGetTensorSize(tilingData);
context->SetBlockDim(MAX_AICORE_NUM + 1);

[Positive Example]

FlashAttentionScoreApiTiling(tilingData);
FlashAttentionScoreGetTensorSize(tilingData);
// If a multi-core synchronization instruction is used in the kernel, BlockDim set by the host must be less than or equal to CoreNum.
context->SetBlockDim(MAX_AICORE_NUM);

Correctly Calculating the Offset Address

When multi-core computing is enabled for an operator, the computation amount of a single core needs to be determined during tiling, based on which the address offset on the kernel is performed. For details about how to correctly calculate the offset address, click here.

The following introduces the allocation scheme: The total data length TOTAL_LENGTH is 8 x 2048. The data is evenly allocated to eight cores, and the size (BLOCK_LENGTH) of data processed by each core is 2048. x + BLOCK_LENGTH * GetBlockIdx() is the memory offset address of input x in the global memory of the single-core processing program. After the offset address is obtained, call SetGlobalBuffer of the GlobalTensor class to set the start address and length of the global memory of the core. For details, see Figure 1.

xGm.SetGlobalBuffer((__gm__ half*)x + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
Figure 1 Multi-core parallel processing

Using Float Computation

  • Each floating-point arithmetic operation involves a certain amount of rounding. Therefore, the execution order of arithmetic operations is important. If A, B, and C are floating-point values, (A+B)+C cannot be exactly equal to A+(B+C) as in mathematical computation. In parallel computing, you may change the order of operations, so parallel results may not match sequential results. The precision difference caused by this situation is inherent in float computation.
  • The data types supported by instructions of some AI Processors are limited. If any data type fails to meet requirements, you are advised to convert it to a higher precision for computation and then convert the result back to the target precision, to prevent accuracy drop. For example, a vector computation API does not support bfloat16 computation. You need to use the Cast API to convert the data type to float, perform computation, and then convert the data type back to bfloat16 by using this API again.

    [Positive Example]

    // dst = src0 + src1, bfloat16 type. The tmpx temporary buffer is allocated based on float.
    ...
    Cast(tmp0Tensor, src0Tensor, RoundMode::CAST_NONE, computeSize);
    Cast(tmp1Tensor, src1Tensor, RoundMode::CAST_NONE, computeSize);
    Add(tmp2Tensor, tmp0Tensor, tmp1Tensor, computeSize);
    Cast(dstTensor, tmp2Tensor, RoundMode::CAST_FLOOR, computeSize);
    ...
  • Ascend AI Processors support binary floating-point representation in compliance with the IEEE 754 standard, with some exceptions. These exceptions may lead to different results from the IEEE 754 values computed on the host system. Use the composite instruction API Axpy as an example. After each element in the source operand is multiplied by a scalar, the product is added to the corresponding element in the destination operand. The composite instruction combines the multiplication and addition operations into a single instruction for execution. This computation result may be slightly different from the result obtained by single-instructions that perform the two operations separately. You need to consider this precise difference when using such APIs.

Forbidding Modifying Kernel Parameters

This means it is not allowed to assign values to or modify function parameters. In the following example of FlashAttentionKernel function, its parameters such as query, key, and tilingData are pointers and cannot be modified. The pointer content cannot be modified for operator input parameters (however, it can be modified for operator output parameters). To implement static compilation, do not modify the tilingData pointer itself or the content it points to.

__aicore__ __global__ void FlashAttentionKernel(__gm__ uint8_t* query, __gm__ uint8_t* key, ..., __gm__ uint8_t* attention,..., __gm__ uint8* tilingData) {
    ......
}

[Negative Example]

// Re-assign a value to the kernel function parameter and modify TilingData.
query = tmpQueryPtr;
key = tmpKeyPtr;
tilingData = tmpTilingDataPtr;
tillingData[0] = MAX_CORE_NUM;

[Positive Example]

// Perform only the read operation on the input parameter.
inputQueryGMTensor.SetGlobalBuffer(query);

//The attention pointer of the output parameter is read-only, but the memory to which it points can be read and written.
outputAttentionGMTensor.SetGlobalBuffer(attention);
...
DataCopy(outputAttentionGMTensor, outputAttentionLocalTensor, 512);