Hardware Constraints
This section describes the hardware constraints and recommended solutions. The corresponding product models are as follows:
Atlas A3 training products /Atlas A3 inference products Atlas A2 training products /Atlas A2 inference products
Category |
Hardware Constraint |
Recommended Solution |
|---|---|---|
Memory access (L0 Buffer/L1 Buffer/UB) |
The minimum access granularities/alignment requirements of each storage unit are as follows: Unified Buffer: 32-byte aligned L1 Buffer: 32-byte aligned L0A Buffer/L0B Buffer: 512-byte aligned L0C Buffer: 64-byte aligned BiasTable Buffer: 64-byte aligned FixPipe Buffer: 128-byte aligned |
|
Memory access (UB) |
UB bank access conflicts (vector computation access/movement access). |
The addresses need to be staggered during software implementation based on the chip requirements to resolve bank conflicts. For details, see Avoiding Bank Conflicts in the Unified Buffer. |
Memory access (GM) |
When multiple cores concurrently access the GM at the same address, the access is serialized by hardware. |
Access to the same address is serialized by hardware, and the performance (queuing time) decreases by about 10% to 20%. Multi-core access is staggered (by adjusting the data access sequence and modifying the tiling policy) so that the first data load to the L2 cache improves subsequent access performance. For details, see Avoiding Same-Address Access. |
Memory access (GM) |
When the length of data moved at a time is greater than 16 KB, the optimal bandwidth performance can be achieved. |
According to the test experience, optimal bandwidth performance can be achieved when data to be moved at a time is greater than 16 KB. Therefore, it is recommended that a large data block be moved at a time. (The size varies with the chip.) For details, see Transferring a Large Data Block at a Time. |
Memory access (GM-->L1) |
The interval between adjacent data blocks of the source operand in DataCopy (the interval between the tail of the previous data block and the head of the next data block) does not exceed 65535. The unit is a data block (32 bytes). |
If the interval between the tail of the previous data block and the head of the next data block exceeds 65535, the instruction needs to be split into multiple ones. |
Memory access (GM) |
Data is 128-byte, 256-byte, and 512-byte aligned for movement. If the data is not aligned, the length is rounded up. |
Ensure that the inner axis of the tiling is 128-byte, 256-byte, or 512-byte aligned. For details, see Using 512-Byte Alignment for the GM Address. |
Cube |
The depth of the MTE1 and MMAD instruction queues is 32. |
The corresponding instruction queue is prone to be full, which blocks the delivery of other instructions and causes pipeline interruption. To move data from the L1 Buffer to the L0 Buffer using Load2D, 32 instructions need to be transmitted. However, Load3D requires only one instruction to implement data movement. Therefore, Load3D is recommended. |
iCache |
The iCache hardware specification is limited to 32 KB. |
Split the Tiling_key or use template functions to reduce code segments. For details, see Tiling Template Programming. |
iCache |
When multiple cores concurrently access the iCache at the same address, the access is serialized by hardware. |
In small-shape scenarios, reduce the number of started cores to the greatest extent to minimize the issue of concurrent access to the same address by multiple cores. |
DCache |
The DCache hardware specification is limited to 32 KB. |
None |
Scalar |
When scalar data is written to the GM, the data is cached in the DCache. The hardware does not ensure the consistency between the DCache and GM. You need to ensure the consistency. |
Use DataCacheCleanAndInvalid to ensure consistency. |
Cube |
The L0C Buffer capacity is 128 KB. |
None |
Cube |
The BiasTable Buffer is 1 KB. |
None |
Cube |
In cube computation scenarios, the float computing power is 1/4 of the half computing power. |
None |
Cube |
In the cube output channel quantization scenario, quantization from int32_t to bfloat16_t is not supported. |
|
Vector |
For the Reduce API, the performance of half data is poorer than that of float data. |
When half data is written back to the UB, it is not 32-byte aligned, resulting in performance deterioration. It is recommended that half data be converted to float data for computation. In this scenario, the float data type is recommended. |
Vector |
The time taken by the Exp/Ln API to process the same amount of half/float data is the same. |
The float data type has been optimized internally, so the performance of the two data types is similar. You can select a proper precision type as needed. |
Pipeline synchronization (intra-core) |
If the set/wait synchronization does not match, the status remains, affecting subsequent operators. |
Use the twin debugging/mssanitizer tool to identify such issues in advance. |
Pipeline synchronization (inter-core) |
The CrossCoreSetFlag counter has a limit. If the number of times exceeds 15, reverse synchronization is required. Otherwise, the system may be suspended. |
Use the twin debugging/mssanitizer tool to report errors in advance in scenarios where the limit is exceeded. |
General API restrictions |
General restrictions on the overlapping of the source and destination operand addresses when using Ascend C APIs |
To save memory space when using high-dimensional tensor sharding compute APIs of basic APIs, you can define a tensor shared by the source and destination operands (by address overlapping). Pay attention to the following restrictions when using this:
|