Static Tensor Programming
In pipe-based operator development, the pipe (TPipe class) manages resources such as the device memory in a unified manner. Developers do not need to be aware of memory management, DoubleBuffer pipeline, and synchronization. They only need to compile operators based on the compute flow. However, this also brings some runtime overhead (such as TPipe creation and InitBuffer).
Due to the preceding reasons, Ascend C provides the static tensor programming mode. Compared with the pipe-based programming mode, this mode avoids the initialization process for TPipe memory management (about hundreds of nanoseconds), thereby reducing runtime overhead and helping developers achieve ultimate performance. This mode offers greater flexibility by directly constructing a LocalTensor with a specified address and storage location and passing it to APIs for computation and movement. However, it also introduces higher development complexity, requiring developers to manage DoubleBuffer and synchronization pipelines themselves. Additionally, only the basic APIs of Ascend C can be used, rather than all available functions.
The comparison between the two programming modes is as follows:

- For details about the restrictions on static tensor programming, see Constraints.
- For details about the complete examples involved in this section, see static tensor programming sample.
Programming Paradigm
- AI Core consists of multiple memory units, such as the Unified Buffer for vector computation and the L1 Buffer, L0A Buffer, L0B Buffer, and L0C Buffer for cube computation. Developers can manage all memory resources on the AI Core. When creating a tensor and allocating an address, developers need to manage the memory size and memory overcommitment relationship, and ensure the validity of the allocated address.
- AI Core supports multiple instruction pipeline types, such as the vector, cube, and scalar computation pipelines, as well as the MTE1, MTE2, and MTE3 movement pipelines. All pipelines are executed in parallel, and their dependencies are coordinated through synchronization events. Developers call the movement or computation APIs provided by Ascend C to write operators and insert corresponding synchronization events based on data dependencies to achieve optimal performance.
The following figure shows a typical vector operator. Developers first perform data tiling based on the service computation volume, and then insert synchronization events based on the data dependencies within the core.

Memory Management
In static tensor programming mode, developers can create tensors in either of the following ways:
- Allocate tensors by specifying the hardware location through LocalMemAllocator.
LocalMemAllocator is a linear memory allocator. Developers can call the Alloc method to allocate memory. The address allocation starts from 0 and is performed linearly based on the call sequence. LocalMemAllocator is a simple linear allocator and does not provide memory release and other memory management capabilities. In scenarios where bank conflicts are not a concern or during the initial development of operator functions, LocalMemAllocator can be used to simplify operator writing. In subsequent performance optimization, developers can switch to using LocalTensor for address allocation.
- Create tensors using the LocalTensor constructor. This method is recommended for scenarios requiring ultimate performance.
Developers can use the LocalTensor constructor to directly specify memory addresses, enabling full autonomous memory management (essentially, there is no need to allocate or free memory). When using this method, specify the address as required (without exceeding the physical memory limit) and ensure that memory is reused while maintaining correct functionality. This method is recommended if ultimate performance needs to be achieved by avoiding bank conflicts or reusing memory.
1 2 3 4 5 6 7 8 9 10 |
// Method 1: Use LocalMemAllocator to allocate memory. AscendC::LocalMemAllocator<AscendC::Hardware::UB> ubAllocator; AscendC::LocalTensor<float> xLocalPing = ubAllocator.Alloc<float, TILE_LENGTH>(); AscendC::LocalTensor<float> yLocalPing = ubAllocator.Alloc<float, TILE_LENGTH>(); AscendC::LocalTensor<float> zLocalPing = ubAllocator.Alloc<float, TILE_LENGTH>(); // Method 2: Use the LocalTensor constructor to construct a tensor. AscendC::LocalTensor<float> xLocalPing(AscendC::TPosition::VECCALC, xAddrPing, TILE_LENGTH); AscendC::LocalTensor<float> yLocalPing(AscendC::TPosition::VECCALC, yAddrPing, TILE_LENGTH); AscendC::LocalTensor<float> zLocalPing(AscendC::TPosition::VECCALC, zAddrPing, TILE_LENGTH); |
Synchronization Management
According to the hardware architecture described above, the internal asynchronous parallel compute of AI Core involves multiple pipelines (including vector computation, cube computation, data copy-in, and data copy-out). When data dependency exists between these pipelines, corresponding synchronization events need to be inserted. In static tensor programming mode, developers manually insert synchronization using SetFlag/WaitFlag(ISASI) and PipeBarrier(ISASI). The event types and event IDs are managed by developers. However, note that event IDs 6 and 7 cannot be used (as they may conflict with internal event IDs, leading to undefined behavior). In addition, because the bottom-layer synchronization APIs SetFlag, WaitFlag, and PipeBarrier (related to the ISASI hardware architecture) need to be used, cross-hardware version compatibility cannot be ensured.
In synchronization dependencies, there are two types of dependencies based on the data dependency and instruction execution relationship: forward synchronization (intra-cycle dependency) and backward synchronization (inter-cycle dependency).
- Forward synchronization
Between the current data copy-in and computation, insert the MTE2_V (vector computation pipeline waiting for MT2 movement pipeline) synchronization event to ensure that computation is performed after data copy-in. Between the current data computation and copy-out, insert the V_MTE3 (MTE3 movement pipeline waiting for vector computation pipeline) synchronization event to ensure that data is copied out after computation is complete.
- Backward synchronization
Between the previous data computation and the current data copy-in, insert the V_MTE2 (MT2 movement pipeline waiting for vector computation pipeline) synchronization event to ensure that the current data is copied in after the previous data computation is complete. This prevents the current data from overwriting the data that has not been completely computed in the previous operation. Between the previous data copy-out and the current data computation, insert the MTE3_V (vector computation pipeline waiting for MT3 movement pipeline) synchronization event to ensure that the current data is computed only after the previous data is copied out. This can prevent the current data from overwriting the data that has not been copied out in the previous operation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
AscendC::LocalTensor<float> xLocal(AscendC::TPosition::VECCALC, xAddr, TILE_LENGTH); AscendC::LocalTensor<float> yLocal(AscendC::TPosition::VECCALC, yAddr, TILE_LENGTH); AscendC::LocalTensor<float> zLocal(AscendC::TPosition::VECCALC, zAddr, TILE_LENGTH); for (int i = 0; i < loopCount; i++) { // dependency of PIPE_V & PIPE_MTE2 caused by xLocal/yLocal between 2 sequential loops if (i != 0) { AscendC::WaitFlag<AscendC::HardEvent::V_MTE2>(EVENT_ID0); } AscendC::DataCopy(xLocal, xGm[i * TILE_LENGTH], TILE_LENGTH); AscendC::DataCopy(yLocal, yGm[i * TILE_LENGTH], TILE_LENGTH); // dependency of PIPE_MTE2 & PIPE_V caused by xLocal/yLocal in one single loop AscendC::SetFlag<AscendC::HardEvent::MTE2_V>(EVENT_ID0); AscendC::WaitFlag<AscendC::HardEvent::MTE2_V>(EVENT_ID0); if (i != 0) { // dependency of PIPE_MTE3 & PIPE_V caused by zLocal between 2 sequential loops AscendC::WaitFlag<AscendC::HardEvent::MTE3_V>(EVENT_ID0); } AscendC::Add(zLocal, xLocal, yLocal, TILE_LENGTH); if (i != (loopCount - 1)) { // dependency of PIPE_V & PIPE_MTE2 caused by xLocal/yLocal between 2 sequential loops AscendC::SetFlag<AscendC::HardEvent::V_MTE2>(EVENT_ID0); } // dependency of PIPE_V & PIPE_MTE3 caused by zLocal in one single loop AscendC::SetFlag<AscendC::HardEvent::V_MTE3>(EVENT_ID0); AscendC::WaitFlag<AscendC::HardEvent::V_MTE3>(EVENT_ID0); AscendC::DataCopy(zGm[i * TILE_LENGTH], zLocal, TILE_LENGTH); if (i != (loopCount - 1)) { // dependency of PIPE_MTE3 & PIPE_V caused by zLocal between 2 sequential loops AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(EVENT_ID0); } } |
Pipeline Optimization
In the TPipe-based programming paradigm, developers only need to set the number of buffers to 2 during InitBuffer to automatically enable DoubleBuffer. However, in static tensor programming mode, developers need to manually enable DoubleBuffer. The following is an example. For details about the complete example, see the DoubleBuffer example in static tensor programming sample.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
// ping AscendC::LocalTensor<float> xLocalPing(AscendC::TPosition::VECCALC, xAddrPing, TILE_LENGTH); AscendC::LocalTensor<float> yLocalPing(AscendC::TPosition::VECCALC, yAddrPing, TILE_LENGTH); AscendC::LocalTensor<float> zLocalPing(AscendC::TPosition::VECCALC, zAddrPing, TILE_LENGTH); // pong AscendC::LocalTensor<float> xLocalPong(AscendC::TPosition::VECCALC, xAddrPong, TILE_LENGTH); AscendC::LocalTensor<float> yLocalPong(AscendC::TPosition::VECCALC, yAddrPong, TILE_LENGTH); AscendC::LocalTensor<float> zLocalPong(AscendC::TPosition::VECCALC, zAddrPong, TILE_LENGTH); // double buffer AscendC::SetFlag<AscendC::HardEvent::MTE3_MTE2>(EVENT_ID0); AscendC::SetFlag<AscendC::HardEvent::MTE3_MTE2>(EVENT_ID1); for (int i = 0; i < loopCount; i++) { int32_t eventID = (i % 2 == 0 ? EVENT_ID0 : EVENT_ID1); AscendC::LocalTensor<float> &xLocal = (i % 2 == 0 ? xLocalPing : xLocalPong); AscendC::LocalTensor<float> &yLocal = (i % 2 == 0 ? yLocalPing : yLocalPong); AscendC::LocalTensor<float> &zLocal = (i % 2 == 0 ? zLocalPing : zLocalPong); // dependency of PIPE_MTE3 & PIPE_MTE2 caused by xLocal/yLocal between 2 sequential loops AscendC::WaitFlag<AscendC::HardEvent::MTE3_MTE2>(eventID); AscendC::DataCopy(xLocal, xGm[i * TILE_LENGTH], TILE_LENGTH); AscendC::DataCopy(yLocal, yGm[i * TILE_LENGTH], TILE_LENGTH); // dependency of PIPE_MTE2 & PIPE_V caused by xLocal/yLocal in one single loop AscendC::SetFlag<AscendC::HardEvent::MTE2_V>(eventID); AscendC::WaitFlag<AscendC::HardEvent::MTE2_V>(eventID); AscendC::Add(zLocal, xLocal, yLocal, TILE_LENGTH); // dependency of PIPE_V & PIPE_MTE3 caused by zLocal in one single loop AscendC::SetFlag<AscendC::HardEvent::V_MTE3>(eventID); AscendC::WaitFlag<AscendC::HardEvent::V_MTE3>(eventID); AscendC::DataCopy(zGm[i * TILE_LENGTH], zLocal, TILE_LENGTH); // dependency of PIPE_MTE3 & PIPE_MTE2 caused by zLocal between 2 sequential loops AscendC::SetFlag<AscendC::HardEvent::MTE3_MTE2>(eventID); } AscendC::WaitFlag<AscendC::HardEvent::MTE3_MTE2>(EVENT_ID0); AscendC::WaitFlag<AscendC::HardEvent::MTE3_MTE2>(EVENT_ID1); |
The following figure shows the pipeline when DoubleBuffer is disabled and enabled. In most cases, the DoubleBuffer mechanism can effectively improve the utilization ratio of the Vector Unit and reduce the operator execution time. For details, see DoubleBuffer.


Constraints
Comply with the following constraints when using the static tensor programming mode:
- Developers cannot use framework APIs such as TPipe, TQue, TQueBind, and TBufPool. Using these APIs together with the static tensor programming mode may result in undefined behavior.
- Only some APIs can be used. For details about the supported APIs, see Supported APIs. APIs that are not in the list internally depend on TPipe to allocate event IDs, which may conflict with the event IDs defined by developers.
- Developers need to manually insert synchronization events using SetFlag/WaitFlag(ISASI) and PipeBarrier(ISASI). The event types and event IDs are managed by developers. However, note that event IDs 6 and 7 cannot be used (as they may conflict with internal event IDs, leading to undefined behavior).
- Because the bottom-layer synchronization APIs SetFlag, WaitFlag, and PipeBarrier (related to the ISASI hardware architecture) need to be used, operator cross-hardware version compatibility cannot be ensured.
- At the kernel entry, developers need to manually call the InitSocState API to initialize the global status register. Because the global status register is in an uncertain state, if this API is not called, undefined behavior may occur during operator execution. In TPipe framework programming, the initialization process is completed by TPipe, and developers do not need to pay attention to it.
Supported APIs
|
API Category |
API |
|---|---|
|
Basic API > scalar computation |
ScalarGetCountOfValue, ScalarCountLeadingZero, ScalarCast, CountBitsCntSameAsSignBit, ScalarGetSFFValue |
|
Basic API > vector computation > Basic arithmetic |
Exp, Ln, Abs, Reciprocal, Sqrt, Rsqrt, Relu, VectorPadding, Add, Sub, Mul, Div, Max, Min, BilinearInterpolation, Adds, Muls, Maxs, Mins, LeakyRelu |
|
Basic API > vector computation > logical computation |
Not, And, Or |
|
Basic API > vector computation > compound computation |
Axpy, CastDeq, AddRelu, AddReluCast, AddDeqRelu, SubRelu, SubReluCast, MulAddDst, MulCast, FusedMulAdd, FusedMulAddRelu |
|
Basic API > vector computation > comparison and selection |
Compare, Compare (result stored in the register), CompareScalar, GetCmpMask, SetCmpMask, Select, GatherMask |
|
Basic API > vector computation > type conversion |
Cast |
|
Basic API > vector computation > reduction computation |
WholeReduceMax, WholeReduceMin, WholeReduceSum, BlockReduceMax, BlockReduceMin, BlockReduceSum, PairReduceSum, RepeatReduceSum, GetReduceMaxMinCount |
|
Basic API > vector computation > data conversion |
Transpose, TransDataTo5HD |
|
Basic API > vector computation > data filling |
Duplicate |
|
Basic API > vector computation > sorting and combination |
ProposalConcat, ProposalExtract, RpSort16, MrgSort4, GetMrgSortResult |
|
Basic API > vector computation > discretization and aggregation |
Gather, Scatter |
|
Basic API > vector computation > mask operation |
SetMaskCount, SetMaskNorm, SetVectorMask, ResetMask |
|
Basic API > vector computation > quantization setting |
SetDeqScale |
|
Basic API > data movement > DataCopy |
Basic data movement |
|
Basic API > synchronization control > intra-core synchronization |
SetFlag/WaitFlag, PipeBarrier |
|
Basic API > buffer control |
DataCachePreload, DataCacheCleanAndInvalid, ICachePreLoad |
|
Basic API > system variable access |
GetBlockNum, GetBlockIdx, GetDataBlockSizeInBytes, GetArchVersion, GetTaskRatio, InitSocState, GetProgramCounter, CheckLocalMemoryIA |
|
Basic API > atomic operation |
SetAtomicAdd, SetAtomicNone |
|
Basic API > cube computation |
InitConstValue, LoadData, SetAippFunctions, LoadImageToLocal, LoadUnzipIndex, LoadDataUnzip, SetLoadDataBoundary, SetLoadDataPaddingValue, Mmad |
|
API Category |
API |
Remarks |
|---|---|---|
|
Basic API > scalar computation |
ScalarGetCountOfValue, ScalarCountLeadingZero, ScalarCast, CountBitsCntSameAsSignBit, ScalarGetSFFValue, ToBfloat16, ToFloat |
- |
|
Basic API > vector computation > Basic arithmetic |
Exp, Ln, Abs, Reciprocal, Sqrt, Rsqrt, Relu, Add, Sub, Mul, Div, Max, Min, BilinearInterpolation, Adds, Muls, Maxs, Mins, LeakyRelu |
- |
|
Basic API > vector computation > logical computation |
Not, And, Or, ShiftLeft, ShiftRight |
- |
|
Basic API > vector computation > compound computation |
Axpy, CastDeq, AddRelu, AddReluCast, AddDeqRelu, SubRelu, SubReluCast, MulAddDst, MulCast, FusedMulAdd, FusedMulAddRelu |
- |
|
Basic API > vector computation > comparison and selection |
Compare, Compare (result stored in the register), CompareScalar, GetCmpMask, SetCmpMask, Select, GatherMask |
- |
|
Basic API > vector computation > type conversion |
Cast |
- |
|
Basic API > vector computation > reduction computation |
WholeReduceMax, WholeReduceMin, WholeReduceSum, BlockReduceMax, BlockReduceMin, BlockReduceSum, PairReduceSum, RepeatReduceSum, GetAccVal, GetReduceMaxMinCount |
- |
|
Basic API > vector computation > data conversion |
Transpose, TransDataTo5HD |
- |
|
Basic API > vector computation > data filling |
Duplicate, Brcb |
- |
|
Basic API > vector computation > sorting and combination |
Sort32, MrgSort, GetMrgSortResult |
- |
|
Basic API > vector computation > discretization and aggregation |
Gather, Gatherb |
- |
|
Basic API > vector computation > mask operation |
SetMaskCount, SetMaskNorm, SetVectorMask, ResetMask |
- |
|
Basic API > vector computation > quantization setting |
SetDeqScale |
- |
|
Basic API > data movement > DataCopy |
Basic data movement |
Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported. |
|
Enhanced data movement |
Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported. |
|
|
Slice data movement |
- |
|
|
ND2NZ movement with channel conversion NZ2ND movement with channel conversion Activation movement with channel quantization |
Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported. |
|
|
Basic API > data movement |
Copy, DataCopyPad, SetPadValue |
- |
|
Basic API > synchronization control > intra-core synchronization |
SetFlag/WaitFlag, PipeBarrier, DataSyncBarrier |
- |
|
Basic API > synchronization control > inter-core synchronization |
CrossCoreSetFlag, CrossCoreWaitFlag |
- |
|
Basic API > buffer control |
DataCachePreload, DataCacheCleanAndInvalid, ICachePreLoad, GetICachePreloadStatus |
- |
|
Basic API > system variable access |
GetBlockNum, GetBlockIdx, GetDataBlockSizeInBytes, GetArchVersion, GetTaskRatio, InitSocState, GetProgramCounter, GetSubBlockNum, GetSubBlockIdx, GetSystemCycle, CheckLocalMemoryIA |
- |
|
Basic API > atomic operation |
SetAtomicAdd, SetAtomicType, SetAtomicNone, SetAtomicMax, SetAtomicMin, SetStoreAtomicConfig, GetStoreAtomicConfig |
- |
|
Basic API > cube computation |
Mmad, MmadWithSparse, SetHF32Mode, SetHF32TransMode, SetMMLayoutTransform, SetFixPipeConfig, SetFixpipeNz2ndFlag, SetFixpipePreQuantFlag, InitConstValue, LoadData, LoadDataWithTranspose, SetAippFunctions, LoadImageToLocal, LoadDataWithSparse, SetFmatrix, SetLoadDataBoundary, SetLoadDataRepeat, SetLoadDataPaddingValue, Fixpipe |
- |
|
Utils API > C++ standard library > algorithm |
max, min, index_sequence |
- |
|
Utils API > C++ standard library > container functions |
tuple, get, make_tuple |
- |
|
Utils API > C++ standard library > type features |
is_convertible, is_base_of, is_same, enable_if, conditional |
- |
|
API Category |
API |
Remarks |
|---|---|---|
|
Basic API > scalar computation |
ScalarGetCountOfValue, ScalarCountLeadingZero, ScalarCast, CountBitsCntSameAsSignBit, ScalarGetSFFValue, ToBfloat16, ToFloat |
- |
|
Basic API > vector computation > Basic arithmetic |
Exp, Ln, Abs, Reciprocal, Sqrt, Rsqrt, Relu, Add, Sub, Mul, Div, Max, Min, BilinearInterpolation, Adds, Muls, Maxs, Mins, LeakyRelu |
- |
|
Basic API > vector computation > logical computation |
Not, And, Or, ShiftLeft, ShiftRight |
- |
|
Basic API > vector computation > compound computation |
Axpy, CastDeq, AddRelu, AddReluCast, AddDeqRelu, SubRelu, SubReluCast, MulAddDst, MulCast, FusedMulAdd, FusedMulAddRelu |
- |
|
Basic API > vector computation > comparison and selection |
Compare, Compare (result stored in the register), CompareScalar, GetCmpMask, SetCmpMask, Select, GatherMask |
- |
|
Basic API > vector computation > type conversion |
Cast |
- |
|
Basic API > vector computation > reduction computation |
WholeReduceMax, WholeReduceMin, WholeReduceSum, BlockReduceMax, BlockReduceMin, BlockReduceSum, PairReduceSum, RepeatReduceSum, GetAccVal, GetReduceMaxMinCount |
- |
|
Basic API > vector computation > data conversion |
Transpose, TransDataTo5HD |
- |
|
Basic API > vector computation > data filling |
Duplicate, Brcb |
- |
|
Basic API > vector computation > sorting and combination |
Sort32, MrgSort, GetMrgSortResult |
- |
|
Basic API > vector computation > discretization and aggregation |
Gather, Gatherb |
- |
|
Basic API > vector computation > mask operation |
SetMaskCount, SetMaskNorm, SetVectorMask, ResetMask |
- |
|
Basic API > vector computation > quantization setting |
SetDeqScale |
- |
|
Basic API > data movement > DataCopy |
Basic data movement |
Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported. |
|
Basic API > data movement |
Enhanced data movement |
Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported. |
|
Slice data movement |
- |
|
|
ND2NZ movement with channel conversion NZ2ND movement with channel conversion Activation movement with channel quantization |
Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported. |
|
|
Copy, DataCopyPad, SetPadValue |
- |
|
|
Basic API > synchronization control > intra-core synchronization |
SetFlag/WaitFlag, PipeBarrier, DataSyncBarrier |
- |
|
Basic API > synchronization control > inter-core synchronization |
CrossCoreSetFlag, CrossCoreWaitFlag |
- |
|
Basic API > buffer control |
DataCachePreload, DataCacheCleanAndInvalid, ICachePreLoad, GetICachePreloadStatus |
- |
|
Basic API > system variable access |
GetBlockNum, GetBlockIdx, GetDataBlockSizeInBytes, GetArchVersion, GetTaskRatio, InitSocState, GetProgramCounter, GetSubBlockNum, GetSubBlockIdx, GetSystemCycle, CheckLocalMemoryIA |
- |
|
Basic API > atomic operation |
SetAtomicAdd, SetAtomicType, SetAtomicNone, SetAtomicMax, SetAtomicMin, SetStoreAtomicConfig, GetStoreAtomicConfig |
- |
|
Basic API > cube computation |
Mmad, MmadWithSparse, SetHF32Mode, SetHF32TransMode, SetMMLayoutTransform, SetFixPipeConfig, SetFixpipeNz2ndFlag, SetFixpipePreQuantFlag, InitConstValue, LoadData, LoadDataWithTranspose, SetAippFunctions, LoadImageToLocal, LoadDataWithSparse, SetFmatrix, SetLoadDataBoundary, SetLoadDataRepeat, SetLoadDataPaddingValue, Fixpipe |
- |
|
Utils API > C++ standard library > algorithm |
max, min, index_sequence |
- |
|
Utils API > C++ standard library > container functions |
tuple, get, make_tuple |
- |
|
Utils API > C++ standard library > type features |
is_convertible, is_base_of, is_same, enable_if, conditional |
- |
|
High-level API > C++ standard library > type features |
is_convertible, is_base_of, is_same, enable_if, conditional |
- |
|
High-level API > template library functions > type_traits |
is_convertible, is_base_of, is_same, enable_if, conditional |
- |