Hardware Constraints

This section describes the hardware constraints and recommended solutions. The corresponding product models are as follows:

  • Atlas A3 training products/Atlas A3 inference products
  • Atlas A2 training products/Atlas A2 inference products
Table 1 Hardware constraints and recommended solutions

Category

Hardware Constraint

Recommended Solution

Memory access (L0 Buffer/L1 Buffer/UB)

The minimum access granularities/alignment requirements of each storage unit are as follows:

Unified Buffer: 32-byte aligned

L1 Buffer: 32-byte aligned

L0A Buffer/L0B Buffer: 512-byte aligned

L0C Buffer: 64-byte aligned

BiasTable Buffer: 64-byte aligned

FixPipe Buffer: 128-byte aligned

  • During data movement, alignment constraints need to be considered.
  • For the UB, in some non-aligned scenarios, you can use the non-aligned movement API or use some techniques (for example, including redundant data during move-in and removing redundant data during move-out) to solve the problem. For details, see Non-Alignment Scenario.

Memory access (UB)

UB bank access conflicts (vector computation access/movement access).

The addresses need to be staggered during software implementation based on the chip requirements to resolve bank conflicts. For details, see Avoiding Bank Conflicts in the Unified Buffer.

Memory access (GM)

When multiple cores concurrently access the GM at the same address, the access is serialized by hardware.

Access to the same address is serialized by hardware, and the performance (queuing time) decreases by about 10% to 20%.

Multi-core access is staggered (by adjusting the data access sequence and modifying the tiling policy) so that the first data load to the L2 cache improves subsequent access performance. For details, see Avoiding Same-Address Access.

Memory access (GM)

When the length of data moved at a time is greater than 16 KB, the optimal bandwidth performance can be achieved.

According to the test experience, optimal bandwidth performance can be achieved when data to be moved at a time is greater than 16 KB. Therefore, it is recommended that a large data block be moved at a time. (The size varies with the chip.)

For details, see Transferring a Large Data Block at a Time.

Memory access (GM-->L1)

The interval between adjacent data blocks of the source operand in DataCopy (the interval between the tail of the previous data block and the head of the next data block) does not exceed 65535. The unit is a data block (32 bytes).

If the interval between the tail of the previous data block and the head of the next data block exceeds 65535, the instruction needs to be split into multiple ones.

Memory access (GM)

Data is 128-byte, 256-byte, and 512-byte aligned for movement. If the data is not aligned, the length is rounded up.

Ensure that the inner axis of the tiling is 128-byte, 256-byte, or 512-byte aligned.

For details, see Using 512-Byte Alignment for the GM Address.

Cube

The depth of the MTE1 and MMAD instruction queues is 32.

The corresponding instruction queue is prone to be full, which blocks the delivery of other instructions and causes pipeline interruption.

To move data from the L1 Buffer to the L0 Buffer using Load2D, 32 instructions need to be transmitted. However, Load3D requires only one instruction to implement data movement. Therefore, Load3D is recommended.

iCache

The iCache hardware specification is limited to 32 KB.

Split the Tiling_key or use template functions to reduce code segments. For details, see Tiling Template Programming.

iCache

When multiple cores concurrently access the iCache at the same address, the access is serialized by hardware.

In small-shape scenarios, reduce the number of started cores to the greatest extent to minimize the issue of concurrent access to the same address by multiple cores.

DCache

The DCache hardware specification is limited to 32 KB.

None

Scalar

When scalar data is written to the GM, the data is cached in the DCache. The hardware does not ensure the consistency between the DCache and GM. You need to ensure the consistency.

Use DataCacheCleanAndInvalid to ensure consistency.

Cube

The L0C Buffer capacity is 128 KB.

None

Cube

The BiasTable Buffer is 1 KB.

None

Cube

In cube computation scenarios, the float computing power is 1/4 of the half computing power.

None

Cube

In the cube output channel quantization scenario, quantization from int32_t to bfloat16_t is not supported.

  • The AIV involves conversions from int32_t to float and from float to bfloat16_t.
  • The AIC supports channel quantization from float to bfloat16_t.

Vector

For the Reduce API, the performance of half data is poorer than that of float data.

When half data is written back to the UB, it is not 32-byte aligned, resulting in performance deterioration. It is recommended that half data be converted to float data for computation. In this scenario, the float data type is recommended.

Vector

The time taken by the Exp/Ln API to process the same amount of half/float data is the same.

The float data type has been optimized internally, so the performance of the two data types is similar. You can select a proper precision type as needed.

Pipeline synchronization (intra-core)

If the set/wait synchronization does not match, the status remains, affecting subsequent operators.

Use the twin debugging/mssanitizer tool to identify such issues in advance.

Pipeline synchronization (inter-core)

The CrossCoreSetFlag counter has a limit. If the number of times exceeds 15, reverse synchronization is required. Otherwise, the system may be suspended.

Use the twin debugging/mssanitizer tool to report errors in advance in scenarios where the limit is exceeded.

General API restrictions

General restrictions on the overlapping of the source and destination operand addresses when using Ascend C APIs

To save memory space when using high-dimensional tensor sharding compute APIs of basic APIs, you can define a tensor shared by the source and destination operands (by address overlapping). Pay attention to the following restrictions when using this:

  • In a single iteration, the source operand must completely overlap the destination operand. Partial overlapping is not supported.
  • Among multiple iterations, the destination operand of a previous iteration cannot overlap the source operand of a subsequent iteration. For example, the destination operand of the Nth iteration is the source operand of the (N+1)th iteration. In this case, the N th iteration may overwrite the value of the source operand, resulting in an unexpected result. In particular, for some two-operand compute APIs (Add, Sub, Mul, Max, Min, AddRelu, and SubRelu), when the data type is half, int32_t, or float, the destination operand of a previous iteration can overlap the source operand of a subsequent iteration. This is only applicable when the destination operand overlaps the second source operand, and src1RepStride or dstRepStride must be 0.