Static Tensor Programming

In pipe-based operator development, the pipe (TPipe class) manages resources such as the device memory in a unified manner. Developers do not need to be aware of memory management, DoubleBuffer pipeline, and synchronization. They only need to compile operators based on the compute flow. However, this also brings some runtime overhead (such as TPipe creation and InitBuffer).

Due to the preceding reasons, Ascend C provides the static tensor programming mode. Compared with the pipe-based programming mode, this mode avoids the initialization process for TPipe memory management (about hundreds of nanoseconds), thereby reducing runtime overhead and helping developers achieve ultimate performance. This mode offers greater flexibility by directly constructing a LocalTensor with a specified address and storage location and passing it to APIs for computation and movement. However, it also introduces higher development complexity, requiring developers to manage DoubleBuffer and synchronization pipelines themselves. Additionally, only the basic APIs of Ascend C can be used, rather than all available functions.

The comparison between the two programming modes is as follows:

For details about the restrictions on static tensor programming, see Constraints.
For details about the complete examples involved in this section, see static tensor programming sample.

Programming Paradigm

AI Core consists of multiple memory units, such as the Unified Buffer for vector computation and the L1 Buffer, L0A Buffer, L0B Buffer, and L0C Buffer for cube computation. Developers can manage all memory resources on the AI Core. When creating a tensor and allocating an address, developers need to manage the memory size and memory overcommitment relationship, and ensure the validity of the allocated address.
AI Core supports multiple instruction pipeline types, such as the vector, cube, and scalar computation pipelines, as well as the MTE1, MTE2, and MTE3 movement pipelines. All pipelines are executed in parallel, and their dependencies are coordinated through synchronization events. Developers call the movement or computation APIs provided by Ascend C to write operators and insert corresponding synchronization events based on data dependencies to achieve optimal performance.

The following figure shows a typical vector operator. Developers first perform data tiling based on the service computation volume, and then insert synchronization events based on the data dependencies within the core.

Memory Management

In static tensor programming mode, developers can create tensors in either of the following ways:

Allocate tensors by specifying the hardware location through LocalMemAllocator.
LocalMemAllocator is a linear memory allocator. Developers can call the Alloc method to allocate memory. The address allocation starts from 0 and is performed linearly based on the call sequence. LocalMemAllocator is a simple linear allocator and does not provide memory release and other memory management capabilities. In scenarios where bank conflicts are not a concern or during the initial development of operator functions, LocalMemAllocator can be used to simplify operator writing. In subsequent performance optimization, developers can switch to using LocalTensor for address allocation.
Create tensors using the LocalTensor constructor. This method is recommended for scenarios requiring ultimate performance.
Developers can use the LocalTensor constructor to directly specify memory addresses, enabling full autonomous memory management (essentially, there is no need to allocate or free memory). When using this method, specify the address as required (without exceeding the physical memory limit) and ensure that memory is reused while maintaining correct functionality. This method is recommended if ultimate performance needs to be achieved by avoiding bank conflicts or reusing memory.

      
               // Method 1: Use LocalMemAllocator to allocate memory.
    AscendC::LocalMemAllocator<AscendC::Hardware::UB> ubAllocator;
    AscendC::LocalTensor<float> xLocalPing = ubAllocator.Alloc<float, TILE_LENGTH>();
    AscendC::LocalTensor<float> yLocalPing = ubAllocator.Alloc<float, TILE_LENGTH>();
    AscendC::LocalTensor<float> zLocalPing = ubAllocator.Alloc<float, TILE_LENGTH>();

    // Method 2: Use the LocalTensor constructor to construct a tensor.
    AscendC::LocalTensor<float> xLocalPing(AscendC::TPosition::VECCALC, xAddrPing, TILE_LENGTH);
    AscendC::LocalTensor<float> yLocalPing(AscendC::TPosition::VECCALC, yAddrPing, TILE_LENGTH);
    AscendC::LocalTensor<float> zLocalPing(AscendC::TPosition::VECCALC, zAddrPing, TILE_LENGTH);

Synchronization Management

According to the hardware architecture described above, the internal asynchronous parallel compute of AI Core involves multiple pipelines (including vector computation, cube computation, data copy-in, and data copy-out). When data dependency exists between these pipelines, corresponding synchronization events need to be inserted. In static tensor programming mode, developers manually insert synchronization using SetFlag/WaitFlag(ISASI) and PipeBarrier(ISASI). The event types and event IDs are managed by developers. However, note that event IDs 6 and 7 cannot be used (as they may conflict with internal event IDs, leading to undefined behavior). In addition, because the bottom-layer synchronization APIs SetFlag, WaitFlag, and PipeBarrier (related to the ISASI hardware architecture) need to be used, cross-hardware version compatibility cannot be ensured.

In synchronization dependencies, there are two types of dependencies based on the data dependency and instruction execution relationship: forward synchronization (intra-cycle dependency) and backward synchronization (inter-cycle dependency).

Forward synchronization
Between the current data copy-in and computation, insert the MTE2_V (vector computation pipeline waiting for MT2 movement pipeline) synchronization event to ensure that computation is performed after data copy-in. Between the current data computation and copy-out, insert the V_MTE3 (MTE3 movement pipeline waiting for vector computation pipeline) synchronization event to ensure that data is copied out after computation is complete.
Backward synchronization
Between the previous data computation and the current data copy-in, insert the V_MTE2 (MT2 movement pipeline waiting for vector computation pipeline) synchronization event to ensure that the current data is copied in after the previous data computation is complete. This prevents the current data from overwriting the data that has not been completely computed in the previous operation. Between the previous data copy-out and the current data computation, insert the MTE3_V (vector computation pipeline waiting for MT3 movement pipeline) synchronization event to ensure that the current data is computed only after the previous data is copied out. This can prevent the current data from overwriting the data that has not been copied out in the previous operation.

When the preceding synchronization logic is used with the Pipe programming framework, the framework uses EnQue/DeQue/AllocTensor/FreeTensor for encapsulation. You can refer to Programming Model Design Principles to learn how to manually control synchronization when using static tensor programming.

       
        
          
          
                AscendC::LocalTensor<float> xLocal(AscendC::TPosition::VECCALC, xAddr, TILE_LENGTH);
    AscendC::LocalTensor<float> yLocal(AscendC::TPosition::VECCALC, yAddr, TILE_LENGTH);
    AscendC::LocalTensor<float> zLocal(AscendC::TPosition::VECCALC, zAddr, TILE_LENGTH);
    for (int i = 0; i < loopCount; i++) {
        // dependency of PIPE_V & PIPE_MTE2 caused by xLocal/yLocal between 2 sequential loops
        if (i != 0) {
            AscendC::WaitFlag<AscendC::HardEvent::V_MTE2>(EVENT_ID0);
        }
        AscendC::DataCopy(xLocal, xGm[i * TILE_LENGTH], TILE_LENGTH);
        AscendC::DataCopy(yLocal, yGm[i * TILE_LENGTH], TILE_LENGTH);
        // dependency of PIPE_MTE2 & PIPE_V caused by xLocal/yLocal in one single loop
        AscendC::SetFlag<AscendC::HardEvent::MTE2_V>(EVENT_ID0);
        AscendC::WaitFlag<AscendC::HardEvent::MTE2_V>(EVENT_ID0);
        if (i != 0) {
            // dependency of PIPE_MTE3 & PIPE_V caused by zLocal between 2 sequential loops
            AscendC::WaitFlag<AscendC::HardEvent::MTE3_V>(EVENT_ID0);
        }
        AscendC::Add(zLocal, xLocal, yLocal, TILE_LENGTH);
        if (i != (loopCount - 1)) {
            // dependency of PIPE_V & PIPE_MTE2 caused by xLocal/yLocal between 2 sequential loops
            AscendC::SetFlag<AscendC::HardEvent::V_MTE2>(EVENT_ID0);
        }
        // dependency of PIPE_V & PIPE_MTE3 caused by zLocal in one single loop
        AscendC::SetFlag<AscendC::HardEvent::V_MTE3>(EVENT_ID0);
        AscendC::WaitFlag<AscendC::HardEvent::V_MTE3>(EVENT_ID0);
        AscendC::DataCopy(zGm[i * TILE_LENGTH], zLocal, TILE_LENGTH);
        if (i != (loopCount - 1)) {
            // dependency of PIPE_MTE3 & PIPE_V caused by zLocal between 2 sequential loops
            AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(EVENT_ID0);
        }
    }

           

         

       
      

Pipeline Optimization

In the TPipe-based programming paradigm, developers only need to set the number of buffers to 2 during InitBuffer to automatically enable DoubleBuffer. However, in static tensor programming mode, developers need to manually enable DoubleBuffer. The following is an example. For details about the complete example, see the DoubleBuffer example in static tensor programming sample.

      
       
         
         
               // ping
    AscendC::LocalTensor<float> xLocalPing(AscendC::TPosition::VECCALC, xAddrPing, TILE_LENGTH);
    AscendC::LocalTensor<float> yLocalPing(AscendC::TPosition::VECCALC, yAddrPing, TILE_LENGTH);
    AscendC::LocalTensor<float> zLocalPing(AscendC::TPosition::VECCALC, zAddrPing, TILE_LENGTH);
    // pong
    AscendC::LocalTensor<float> xLocalPong(AscendC::TPosition::VECCALC, xAddrPong, TILE_LENGTH);
    AscendC::LocalTensor<float> yLocalPong(AscendC::TPosition::VECCALC, yAddrPong, TILE_LENGTH);
    AscendC::LocalTensor<float> zLocalPong(AscendC::TPosition::VECCALC, zAddrPong, TILE_LENGTH);

    // double buffer
    AscendC::SetFlag<AscendC::HardEvent::MTE3_MTE2>(EVENT_ID0);
    AscendC::SetFlag<AscendC::HardEvent::MTE3_MTE2>(EVENT_ID1);
    for (int i = 0; i < loopCount; i++) {
        int32_t eventID = (i % 2 == 0 ? EVENT_ID0 : EVENT_ID1);
        AscendC::LocalTensor<float> &xLocal = (i % 2 == 0 ? xLocalPing : xLocalPong);
        AscendC::LocalTensor<float> &yLocal = (i % 2 == 0 ? yLocalPing : yLocalPong);
        AscendC::LocalTensor<float> &zLocal = (i % 2 == 0 ? zLocalPing : zLocalPong);
        // dependency of PIPE_MTE3 & PIPE_MTE2 caused by xLocal/yLocal between 2 sequential loops
        AscendC::WaitFlag<AscendC::HardEvent::MTE3_MTE2>(eventID);
        AscendC::DataCopy(xLocal, xGm[i * TILE_LENGTH], TILE_LENGTH);
        AscendC::DataCopy(yLocal, yGm[i * TILE_LENGTH], TILE_LENGTH);

        // dependency of PIPE_MTE2 & PIPE_V caused by xLocal/yLocal in one single loop
        AscendC::SetFlag<AscendC::HardEvent::MTE2_V>(eventID);
        AscendC::WaitFlag<AscendC::HardEvent::MTE2_V>(eventID);
        AscendC::Add(zLocal, xLocal, yLocal, TILE_LENGTH);
        // dependency of PIPE_V & PIPE_MTE3 caused by zLocal in one single loop
        AscendC::SetFlag<AscendC::HardEvent::V_MTE3>(eventID);
        AscendC::WaitFlag<AscendC::HardEvent::V_MTE3>(eventID);
        AscendC::DataCopy(zGm[i * TILE_LENGTH], zLocal, TILE_LENGTH);
        // dependency of PIPE_MTE3 & PIPE_MTE2 caused by zLocal between 2 sequential loops
        AscendC::SetFlag<AscendC::HardEvent::MTE3_MTE2>(eventID);
    }
    AscendC::WaitFlag<AscendC::HardEvent::MTE3_MTE2>(EVENT_ID0);
    AscendC::WaitFlag<AscendC::HardEvent::MTE3_MTE2>(EVENT_ID1);

          

        

      
     

The following figure shows the pipeline when DoubleBuffer is disabled and enabled. In most cases, the DoubleBuffer mechanism can effectively improve the utilization ratio of the Vector Unit and reduce the operator execution time. For details, see DoubleBuffer.

Constraints

Comply with the following constraints when using the static tensor programming mode:

Developers cannot use framework APIs such as TPipe, TQue, TQueBind, and TBufPool. Using these APIs together with the static tensor programming mode may result in undefined behavior.
Only some APIs can be used. For details about the supported APIs, see Supported APIs. APIs that are not in the list internally depend on TPipe to allocate event IDs, which may conflict with the event IDs defined by developers.
Developers need to manually insert synchronization events using SetFlag/WaitFlag(ISASI) and PipeBarrier(ISASI). The event types and event IDs are managed by developers. However, note that event IDs 6 and 7 cannot be used (as they may conflict with internal event IDs, leading to undefined behavior).
Because the bottom-layer synchronization APIs SetFlag, WaitFlag, and PipeBarrier (related to the ISASI hardware architecture) need to be used, operator cross-hardware version compatibility cannot be ensured.
At the kernel entry, developers need to manually call the InitSocState API to initialize the global status register. Because the global status register is in an uncertain state, if this API is not called, undefined behavior may occur during operator execution. In TPipe framework programming, the initialization process is completed by TPipe, and developers do not need to pay attention to it.

Supported APIs

**Table 1** Supported APIs for the Atlas inference product 's AI Core
API Category	API
Basic API > scalar computation	ScalarGetCountOfValue, ScalarCountLeadingZero, ScalarCast, CountBitsCntSameAsSignBit, ScalarGetSFFValue
Basic API > vector computation > Basic arithmetic	Exp, Ln, Abs, Reciprocal, Sqrt, Rsqrt, Relu, VectorPadding, Add, Sub, Mul, Div, Max, Min, BilinearInterpolation, Adds, Muls, Maxs, Mins, LeakyRelu
Basic API > vector computation > logical computation	Not, And, Or
Basic API > vector computation > compound computation	Axpy, CastDeq, AddRelu, AddReluCast, AddDeqRelu, SubRelu, SubReluCast, MulAddDst, MulCast, FusedMulAdd, FusedMulAddRelu
Basic API > vector computation > comparison and selection	Compare, Compare (result stored in the register), CompareScalar, GetCmpMask, SetCmpMask, Select, GatherMask
Basic API > vector computation > type conversion	Cast
Basic API > vector computation > reduction computation	WholeReduceMax, WholeReduceMin, WholeReduceSum, BlockReduceMax, BlockReduceMin, BlockReduceSum, PairReduceSum, RepeatReduceSum, GetReduceMaxMinCount
Basic API > vector computation > data conversion	Transpose, TransDataTo5HD
Basic API > vector computation > data filling	Duplicate
Basic API > vector computation > sorting and combination	ProposalConcat, ProposalExtract, RpSort16, MrgSort4, GetMrgSortResult
Basic API > vector computation > discretization and aggregation	Gather, Scatter
Basic API > vector computation > mask operation	SetMaskCount, SetMaskNorm, SetVectorMask, ResetMask
Basic API > vector computation > quantization setting	SetDeqScale
Basic API > data movement > DataCopy	Basic data movement
Basic API > synchronization control > intra-core synchronization	SetFlag/WaitFlag, PipeBarrier
Basic API > buffer control	DataCachePreload, DataCacheCleanAndInvalid, ICachePreLoad
Basic API > system variable access	GetBlockNum, GetBlockIdx, GetDataBlockSizeInBytes, GetArchVersion, GetTaskRatio, InitSocState, GetProgramCounter, CheckLocalMemoryIA
Basic API > atomic operation	SetAtomicAdd, SetAtomicNone
Basic API > cube computation	InitConstValue, LoadData, SetAippFunctions, LoadImageToLocal, LoadUnzipIndex, LoadDataUnzip, SetLoadDataBoundary, SetLoadDataPaddingValue, Mmad

**Table 2** Supported APIs for the Atlas A2 training products / Atlas A2 inference products
API Category	API	Remarks
Basic API > scalar computation	ScalarGetCountOfValue, ScalarCountLeadingZero, ScalarCast, CountBitsCntSameAsSignBit, ScalarGetSFFValue, ToBfloat16, ToFloat	-
Basic API > vector computation > Basic arithmetic	Exp, Ln, Abs, Reciprocal, Sqrt, Rsqrt, Relu, Add, Sub, Mul, Div, Max, Min, BilinearInterpolation, Adds, Muls, Maxs, Mins, LeakyRelu	-
Basic API > vector computation > logical computation	Not, And, Or, ShiftLeft, ShiftRight	-
Basic API > vector computation > compound computation	Axpy, CastDeq, AddRelu, AddReluCast, AddDeqRelu, SubRelu, SubReluCast, MulAddDst, MulCast, FusedMulAdd, FusedMulAddRelu	-
Basic API > vector computation > comparison and selection	Compare, Compare (result stored in the register), CompareScalar, GetCmpMask, SetCmpMask, Select, GatherMask	-
Basic API > vector computation > type conversion	Cast	-
Basic API > vector computation > reduction computation	WholeReduceMax, WholeReduceMin, WholeReduceSum, BlockReduceMax, BlockReduceMin, BlockReduceSum, PairReduceSum, RepeatReduceSum, GetAccVal, GetReduceMaxMinCount	-
Basic API > vector computation > data conversion	Transpose, TransDataTo5HD	-
Basic API > vector computation > data filling	Duplicate, Brcb	-
Basic API > vector computation > sorting and combination	Sort32, MrgSort, GetMrgSortResult	-
Basic API > vector computation > discretization and aggregation	Gather, Gatherb	-
Basic API > vector computation > mask operation	SetMaskCount, SetMaskNorm, SetVectorMask, ResetMask	-
Basic API > vector computation > quantization setting	SetDeqScale	-
Basic API > data movement > DataCopy	Basic data movement	Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported.
	Enhanced data movement	Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported.
	Slice data movement	-
	ND2NZ movement with channel conversion NZ2ND movement with channel conversion Activation movement with channel quantization	Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported.
Basic API > data movement	Copy, DataCopyPad, SetPadValue	-
Basic API > synchronization control > intra-core synchronization	SetFlag/WaitFlag, PipeBarrier, DataSyncBarrier	-
Basic API > synchronization control > inter-core synchronization	CrossCoreSetFlag, CrossCoreWaitFlag	-
Basic API > buffer control	DataCachePreload, DataCacheCleanAndInvalid, ICachePreLoad, GetICachePreloadStatus	-
Basic API > system variable access	GetBlockNum, GetBlockIdx, GetDataBlockSizeInBytes, GetArchVersion, GetTaskRatio, InitSocState, GetProgramCounter, GetSubBlockNum, GetSubBlockIdx, GetSystemCycle, CheckLocalMemoryIA	-
Basic API > atomic operation	SetAtomicAdd, SetAtomicType, SetAtomicNone, SetAtomicMax, SetAtomicMin, SetStoreAtomicConfig, GetStoreAtomicConfig	-
Basic API > cube computation	Mmad, MmadWithSparse, SetHF32Mode, SetHF32TransMode, SetMMLayoutTransform, SetFixPipeConfig, SetFixpipeNz2ndFlag, SetFixpipePreQuantFlag, InitConstValue, LoadData, LoadDataWithTranspose, SetAippFunctions, LoadImageToLocal, LoadDataWithSparse, SetFmatrix, SetLoadDataBoundary, SetLoadDataRepeat, SetLoadDataPaddingValue, Fixpipe	-
Utils API > C++ standard library > algorithm	max, min, index_sequence	-
Utils API > C++ standard library > container functions	tuple, get, make_tuple	-
Utils API > C++ standard library > type features	is_convertible, is_base_of, is_same, enable_if, conditional	-

**Table 3** Supported APIs for the Atlas A3 training products / Atlas A3 inference products
API Category	API	Remarks
Basic API > scalar computation	ScalarGetCountOfValue, ScalarCountLeadingZero, ScalarCast, CountBitsCntSameAsSignBit, ScalarGetSFFValue, ToBfloat16, ToFloat	-
Basic API > vector computation > Basic arithmetic	Exp, Ln, Abs, Reciprocal, Sqrt, Rsqrt, Relu, Add, Sub, Mul, Div, Max, Min, BilinearInterpolation, Adds, Muls, Maxs, Mins, LeakyRelu	-
Basic API > vector computation > logical computation	Not, And, Or, ShiftLeft, ShiftRight	-
Basic API > vector computation > compound computation	Axpy, CastDeq, AddRelu, AddReluCast, AddDeqRelu, SubRelu, SubReluCast, MulAddDst, MulCast, FusedMulAdd, FusedMulAddRelu	-
Basic API > vector computation > comparison and selection	Compare, Compare (result stored in the register), CompareScalar, GetCmpMask, SetCmpMask, Select, GatherMask	-
Basic API > vector computation > type conversion	Cast	-
Basic API > vector computation > reduction computation	WholeReduceMax, WholeReduceMin, WholeReduceSum, BlockReduceMax, BlockReduceMin, BlockReduceSum, PairReduceSum, RepeatReduceSum, GetAccVal, GetReduceMaxMinCount	-
Basic API > vector computation > data conversion	Transpose, TransDataTo5HD	-
Basic API > vector computation > data filling	Duplicate, Brcb	-
Basic API > vector computation > sorting and combination	Sort32, MrgSort, GetMrgSortResult	-
Basic API > vector computation > discretization and aggregation	Gather, Gatherb	-
Basic API > vector computation > mask operation	SetMaskCount, SetMaskNorm, SetVectorMask, ResetMask	-
Basic API > vector computation > quantization setting	SetDeqScale	-
Basic API > data movement > DataCopy	Basic data movement	Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported.
Basic API > data movement	Enhanced data movement	Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported.
	Slice data movement	-
	ND2NZ movement with channel conversion NZ2ND movement with channel conversion Activation movement with channel quantization	Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported.
	Copy, DataCopyPad, SetPadValue	-
Basic API > synchronization control > intra-core synchronization	SetFlag/WaitFlag, PipeBarrier, DataSyncBarrier	-
Basic API > synchronization control > inter-core synchronization	CrossCoreSetFlag, CrossCoreWaitFlag	-
Basic API > buffer control	DataCachePreload, DataCacheCleanAndInvalid, ICachePreLoad, GetICachePreloadStatus	-
Basic API > system variable access	GetBlockNum, GetBlockIdx, GetDataBlockSizeInBytes, GetArchVersion, GetTaskRatio, InitSocState, GetProgramCounter, GetSubBlockNum, GetSubBlockIdx, GetSystemCycle, CheckLocalMemoryIA	-
Basic API > atomic operation	SetAtomicAdd, SetAtomicType, SetAtomicNone, SetAtomicMax, SetAtomicMin, SetStoreAtomicConfig, GetStoreAtomicConfig	-
Basic API > cube computation	Mmad, MmadWithSparse, SetHF32Mode, SetHF32TransMode, SetMMLayoutTransform, SetFixPipeConfig, SetFixpipeNz2ndFlag, SetFixpipePreQuantFlag, InitConstValue, LoadData, LoadDataWithTranspose, SetAippFunctions, LoadImageToLocal, LoadDataWithSparse, SetFmatrix, SetLoadDataBoundary, SetLoadDataRepeat, SetLoadDataPaddingValue, Fixpipe	-
Utils API > C++ standard library > algorithm	max, min, index_sequence	-
Utils API > C++ standard library > container functions	tuple, get, make_tuple	-
Utils API > C++ standard library > type features	is_convertible, is_base_of, is_same, enable_if, conditional	-
High-level API > C++ standard library > type features	is_convertible, is_base_of, is_same, enable_if, conditional	-
High-level API > template library functions > type_traits	is_convertible, is_base_of, is_same, enable_if, conditional	-

Parent topic: AI Core SIMD Programming