Static Tensor Programming

In pipe-based operator development, the pipe (TPipe class) manages resources such as the device memory in a unified manner. Developers do not need to be aware of memory management, DoubleBuffer pipeline, and synchronization. They only need to compile operators based on the compute flow. However, this also brings some runtime overhead (such as TPipe creation and InitBuffer).

Due to the preceding reasons, Ascend C provides the static tensor programming mode. Compared with the pipe-based programming mode, this mode avoids the initialization process for TPipe memory management (about hundreds of nanoseconds), thereby reducing runtime overhead and helping developers achieve ultimate performance. This mode offers greater flexibility by directly constructing a LocalTensor with a specified address and storage location and passing it to APIs for computation and movement. However, it also introduces higher development complexity, requiring developers to manage DoubleBuffer and synchronization pipelines themselves. Additionally, only the basic APIs of Ascend C can be used, rather than all available functions.

The comparison between the two programming modes is as follows:

Programming Paradigm

  • AI Core consists of multiple memory units, such as the Unified Buffer for vector computation and the L1 Buffer, L0A Buffer, L0B Buffer, and L0C Buffer for cube computation. Developers can manage all memory resources on the AI Core. When creating a tensor and allocating an address, developers need to manage the memory size and memory overcommitment relationship, and ensure the validity of the allocated address.
  • AI Core supports multiple instruction pipeline types, such as the vector, cube, and scalar computation pipelines, as well as the MTE1, MTE2, and MTE3 movement pipelines. All pipelines are executed in parallel, and their dependencies are coordinated through synchronization events. Developers call the movement or computation APIs provided by Ascend C to write operators and insert corresponding synchronization events based on data dependencies to achieve optimal performance.

The following figure shows a typical vector operator. Developers first perform data tiling based on the service computation volume, and then insert synchronization events based on the data dependencies within the core.

Memory Management

In static tensor programming mode, developers can create tensors in either of the following ways:

  • Allocate tensors by specifying the hardware location through LocalMemAllocator.

    LocalMemAllocator is a linear memory allocator. Developers can call the Alloc method to allocate memory. The address allocation starts from 0 and is performed linearly based on the call sequence. LocalMemAllocator is a simple linear allocator and does not provide memory release and other memory management capabilities. In scenarios where bank conflicts are not a concern or during the initial development of operator functions, LocalMemAllocator can be used to simplify operator writing. In subsequent performance optimization, developers can switch to using LocalTensor for address allocation.

  • Create tensors using the LocalTensor constructor. This method is recommended for scenarios requiring ultimate performance.

    Developers can use the LocalTensor constructor to directly specify memory addresses, enabling full autonomous memory management (essentially, there is no need to allocate or free memory). When using this method, specify the address as required (without exceeding the physical memory limit) and ensure that memory is reused while maintaining correct functionality. This method is recommended if ultimate performance needs to be achieved by avoiding bank conflicts or reusing memory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    // Method 1: Use LocalMemAllocator to allocate memory.
    AscendC::LocalMemAllocator<AscendC::Hardware::UB> ubAllocator;
    AscendC::LocalTensor<float> xLocalPing = ubAllocator.Alloc<float, TILE_LENGTH>();
    AscendC::LocalTensor<float> yLocalPing = ubAllocator.Alloc<float, TILE_LENGTH>();
    AscendC::LocalTensor<float> zLocalPing = ubAllocator.Alloc<float, TILE_LENGTH>();

    // Method 2: Use the LocalTensor constructor to construct a tensor.
    AscendC::LocalTensor<float> xLocalPing(AscendC::TPosition::VECCALC, xAddrPing, TILE_LENGTH);
    AscendC::LocalTensor<float> yLocalPing(AscendC::TPosition::VECCALC, yAddrPing, TILE_LENGTH);
    AscendC::LocalTensor<float> zLocalPing(AscendC::TPosition::VECCALC, zAddrPing, TILE_LENGTH);

Synchronization Management

According to the hardware architecture described above, the internal asynchronous parallel compute of AI Core involves multiple pipelines (including vector computation, cube computation, data copy-in, and data copy-out). When data dependency exists between these pipelines, corresponding synchronization events need to be inserted. In static tensor programming mode, developers manually insert synchronization using SetFlag/WaitFlag(ISASI) and PipeBarrier(ISASI). The event types and event IDs are managed by developers. However, note that event IDs 6 and 7 cannot be used (as they may conflict with internal event IDs, leading to undefined behavior). In addition, because the bottom-layer synchronization APIs SetFlag, WaitFlag, and PipeBarrier (related to the ISASI hardware architecture) need to be used, cross-hardware version compatibility cannot be ensured.

In synchronization dependencies, there are two types of dependencies based on the data dependency and instruction execution relationship: forward synchronization (intra-cycle dependency) and backward synchronization (inter-cycle dependency).

  • Forward synchronization

    Between the current data copy-in and computation, insert the MTE2_V (vector computation pipeline waiting for MT2 movement pipeline) synchronization event to ensure that computation is performed after data copy-in. Between the current data computation and copy-out, insert the V_MTE3 (MTE3 movement pipeline waiting for vector computation pipeline) synchronization event to ensure that data is copied out after computation is complete.

  • Backward synchronization

    Between the previous data computation and the current data copy-in, insert the V_MTE2 (MT2 movement pipeline waiting for vector computation pipeline) synchronization event to ensure that the current data is copied in after the previous data computation is complete. This prevents the current data from overwriting the data that has not been completely computed in the previous operation. Between the previous data copy-out and the current data computation, insert the MTE3_V (vector computation pipeline waiting for MT3 movement pipeline) synchronization event to ensure that the current data is computed only after the previous data is copied out. This can prevent the current data from overwriting the data that has not been copied out in the previous operation.

When the preceding synchronization logic is used with the Pipe programming framework, the framework uses EnQue/DeQue/AllocTensor/FreeTensor for encapsulation. You can refer to Programming Model Design Principles to learn how to manually control synchronization when using static tensor programming.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    AscendC::LocalTensor<float> xLocal(AscendC::TPosition::VECCALC, xAddr, TILE_LENGTH);
    AscendC::LocalTensor<float> yLocal(AscendC::TPosition::VECCALC, yAddr, TILE_LENGTH);
    AscendC::LocalTensor<float> zLocal(AscendC::TPosition::VECCALC, zAddr, TILE_LENGTH);
    for (int i = 0; i < loopCount; i++) {
        // dependency of PIPE_V & PIPE_MTE2 caused by xLocal/yLocal between 2 sequential loops
        if (i != 0) {
            AscendC::WaitFlag<AscendC::HardEvent::V_MTE2>(EVENT_ID0);
        }
        AscendC::DataCopy(xLocal, xGm[i * TILE_LENGTH], TILE_LENGTH);
        AscendC::DataCopy(yLocal, yGm[i * TILE_LENGTH], TILE_LENGTH);
        // dependency of PIPE_MTE2 & PIPE_V caused by xLocal/yLocal in one single loop
        AscendC::SetFlag<AscendC::HardEvent::MTE2_V>(EVENT_ID0);
        AscendC::WaitFlag<AscendC::HardEvent::MTE2_V>(EVENT_ID0);
        if (i != 0) {
            // dependency of PIPE_MTE3 & PIPE_V caused by zLocal between 2 sequential loops
            AscendC::WaitFlag<AscendC::HardEvent::MTE3_V>(EVENT_ID0);
        }
        AscendC::Add(zLocal, xLocal, yLocal, TILE_LENGTH);
        if (i != (loopCount - 1)) {
            // dependency of PIPE_V & PIPE_MTE2 caused by xLocal/yLocal between 2 sequential loops
            AscendC::SetFlag<AscendC::HardEvent::V_MTE2>(EVENT_ID0);
        }
        // dependency of PIPE_V & PIPE_MTE3 caused by zLocal in one single loop
        AscendC::SetFlag<AscendC::HardEvent::V_MTE3>(EVENT_ID0);
        AscendC::WaitFlag<AscendC::HardEvent::V_MTE3>(EVENT_ID0);
        AscendC::DataCopy(zGm[i * TILE_LENGTH], zLocal, TILE_LENGTH);
        if (i != (loopCount - 1)) {
            // dependency of PIPE_MTE3 & PIPE_V caused by zLocal between 2 sequential loops
            AscendC::SetFlag<AscendC::HardEvent::MTE3_V>(EVENT_ID0);
        }
    }

Pipeline Optimization

In the TPipe-based programming paradigm, developers only need to set the number of buffers to 2 during InitBuffer to automatically enable DoubleBuffer. However, in static tensor programming mode, developers need to manually enable DoubleBuffer. The following is an example. For details about the complete example, see the DoubleBuffer example in static tensor programming sample.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
    // ping
    AscendC::LocalTensor<float> xLocalPing(AscendC::TPosition::VECCALC, xAddrPing, TILE_LENGTH);
    AscendC::LocalTensor<float> yLocalPing(AscendC::TPosition::VECCALC, yAddrPing, TILE_LENGTH);
    AscendC::LocalTensor<float> zLocalPing(AscendC::TPosition::VECCALC, zAddrPing, TILE_LENGTH);
    // pong
    AscendC::LocalTensor<float> xLocalPong(AscendC::TPosition::VECCALC, xAddrPong, TILE_LENGTH);
    AscendC::LocalTensor<float> yLocalPong(AscendC::TPosition::VECCALC, yAddrPong, TILE_LENGTH);
    AscendC::LocalTensor<float> zLocalPong(AscendC::TPosition::VECCALC, zAddrPong, TILE_LENGTH);

    // double buffer
    AscendC::SetFlag<AscendC::HardEvent::MTE3_MTE2>(EVENT_ID0);
    AscendC::SetFlag<AscendC::HardEvent::MTE3_MTE2>(EVENT_ID1);
    for (int i = 0; i < loopCount; i++) {
        int32_t eventID = (i % 2 == 0 ? EVENT_ID0 : EVENT_ID1);
        AscendC::LocalTensor<float> &xLocal = (i % 2 == 0 ? xLocalPing : xLocalPong);
        AscendC::LocalTensor<float> &yLocal = (i % 2 == 0 ? yLocalPing : yLocalPong);
        AscendC::LocalTensor<float> &zLocal = (i % 2 == 0 ? zLocalPing : zLocalPong);
        // dependency of PIPE_MTE3 & PIPE_MTE2 caused by xLocal/yLocal between 2 sequential loops
        AscendC::WaitFlag<AscendC::HardEvent::MTE3_MTE2>(eventID);
        AscendC::DataCopy(xLocal, xGm[i * TILE_LENGTH], TILE_LENGTH);
        AscendC::DataCopy(yLocal, yGm[i * TILE_LENGTH], TILE_LENGTH);

        // dependency of PIPE_MTE2 & PIPE_V caused by xLocal/yLocal in one single loop
        AscendC::SetFlag<AscendC::HardEvent::MTE2_V>(eventID);
        AscendC::WaitFlag<AscendC::HardEvent::MTE2_V>(eventID);
        AscendC::Add(zLocal, xLocal, yLocal, TILE_LENGTH);
        // dependency of PIPE_V & PIPE_MTE3 caused by zLocal in one single loop
        AscendC::SetFlag<AscendC::HardEvent::V_MTE3>(eventID);
        AscendC::WaitFlag<AscendC::HardEvent::V_MTE3>(eventID);
        AscendC::DataCopy(zGm[i * TILE_LENGTH], zLocal, TILE_LENGTH);
        // dependency of PIPE_MTE3 & PIPE_MTE2 caused by zLocal between 2 sequential loops
        AscendC::SetFlag<AscendC::HardEvent::MTE3_MTE2>(eventID);
    }
    AscendC::WaitFlag<AscendC::HardEvent::MTE3_MTE2>(EVENT_ID0);
    AscendC::WaitFlag<AscendC::HardEvent::MTE3_MTE2>(EVENT_ID1);

The following figure shows the pipeline when DoubleBuffer is disabled and enabled. In most cases, the DoubleBuffer mechanism can effectively improve the utilization ratio of the Vector Unit and reduce the operator execution time. For details, see DoubleBuffer.

Constraints

Comply with the following constraints when using the static tensor programming mode:

  • Developers cannot use framework APIs such as TPipe, TQue, TQueBind, and TBufPool. Using these APIs together with the static tensor programming mode may result in undefined behavior.
  • Only some APIs can be used. For details about the supported APIs, see Supported APIs. APIs that are not in the list internally depend on TPipe to allocate event IDs, which may conflict with the event IDs defined by developers.
  • Developers need to manually insert synchronization events using SetFlag/WaitFlag(ISASI) and PipeBarrier(ISASI). The event types and event IDs are managed by developers. However, note that event IDs 6 and 7 cannot be used (as they may conflict with internal event IDs, leading to undefined behavior).
  • Because the bottom-layer synchronization APIs SetFlag, WaitFlag, and PipeBarrier (related to the ISASI hardware architecture) need to be used, operator cross-hardware version compatibility cannot be ensured.
  • At the kernel entry, developers need to manually call the InitSocState API to initialize the global status register. Because the global status register is in an uncertain state, if this API is not called, undefined behavior may occur during operator execution. In TPipe framework programming, the initialization process is completed by TPipe, and developers do not need to pay attention to it.

Supported APIs

Table 1 Supported APIs for the Atlas inference product 's AI Core

API Category

API

Basic API > scalar computation

ScalarGetCountOfValue, ScalarCountLeadingZero, ScalarCast, CountBitsCntSameAsSignBit, ScalarGetSFFValue

Basic API > vector computation > Basic arithmetic

Exp, Ln, Abs, Reciprocal, Sqrt, Rsqrt, Relu, VectorPadding, Add, Sub, Mul, Div, Max, Min, BilinearInterpolation, Adds, Muls, Maxs, Mins, LeakyRelu

Basic API > vector computation > logical computation

Not, And, Or

Basic API > vector computation > compound computation

Axpy, CastDeq, AddRelu, AddReluCast, AddDeqRelu, SubRelu, SubReluCast, MulAddDst, MulCast, FusedMulAdd, FusedMulAddRelu

Basic API > vector computation > comparison and selection

Compare, Compare (result stored in the register), CompareScalar, GetCmpMask, SetCmpMask,

Select, GatherMask

Basic API > vector computation > type conversion

Cast

Basic API > vector computation > reduction computation

WholeReduceMax, WholeReduceMin, WholeReduceSum, BlockReduceMax, BlockReduceMin, BlockReduceSum, PairReduceSum, RepeatReduceSum, GetReduceMaxMinCount

Basic API > vector computation > data conversion

Transpose, TransDataTo5HD

Basic API > vector computation > data filling

Duplicate

Basic API > vector computation > sorting and combination

ProposalConcat, ProposalExtract, RpSort16, MrgSort4, GetMrgSortResult

Basic API > vector computation > discretization and aggregation

Gather, Scatter

Basic API > vector computation > mask operation

SetMaskCount, SetMaskNorm, SetVectorMask, ResetMask

Basic API > vector computation > quantization setting

SetDeqScale

Basic API > data movement > DataCopy

Basic data movement

Basic API > synchronization control > intra-core synchronization

SetFlag/WaitFlag, PipeBarrier

Basic API > buffer control

DataCachePreload, DataCacheCleanAndInvalid, ICachePreLoad

Basic API > system variable access

GetBlockNum, GetBlockIdx, GetDataBlockSizeInBytes, GetArchVersion, GetTaskRatio, InitSocState, GetProgramCounter, CheckLocalMemoryIA

Basic API > atomic operation

SetAtomicAdd, SetAtomicNone

Basic API > cube computation

InitConstValue, LoadData, SetAippFunctions, LoadImageToLocal, LoadUnzipIndex, LoadDataUnzip, SetLoadDataBoundary, SetLoadDataPaddingValue, Mmad

Table 2 Supported APIs for the Atlas A2 training products / Atlas A2 inference products

API Category

API

Remarks

Basic API > scalar computation

ScalarGetCountOfValue, ScalarCountLeadingZero, ScalarCast, CountBitsCntSameAsSignBit, ScalarGetSFFValue, ToBfloat16, ToFloat

-

Basic API > vector computation > Basic arithmetic

Exp, Ln, Abs, Reciprocal, Sqrt, Rsqrt, Relu, Add, Sub, Mul, Div, Max, Min, BilinearInterpolation, Adds, Muls, Maxs, Mins, LeakyRelu

-

Basic API > vector computation > logical computation

Not, And, Or, ShiftLeft, ShiftRight

-

Basic API > vector computation > compound computation

Axpy, CastDeq, AddRelu, AddReluCast, AddDeqRelu, SubRelu, SubReluCast, MulAddDst, MulCast, FusedMulAdd, FusedMulAddRelu

-

Basic API > vector computation > comparison and selection

Compare, Compare (result stored in the register), CompareScalar, GetCmpMask, SetCmpMask, Select, GatherMask

-

Basic API > vector computation > type conversion

Cast

-

Basic API > vector computation > reduction computation

WholeReduceMax, WholeReduceMin, WholeReduceSum, BlockReduceMax, BlockReduceMin, BlockReduceSum, PairReduceSum, RepeatReduceSum, GetAccVal, GetReduceMaxMinCount

-

Basic API > vector computation > data conversion

Transpose, TransDataTo5HD

-

Basic API > vector computation > data filling

Duplicate, Brcb

-

Basic API > vector computation > sorting and combination

Sort32, MrgSort, GetMrgSortResult

-

Basic API > vector computation > discretization and aggregation

Gather, Gatherb

-

Basic API > vector computation > mask operation

SetMaskCount, SetMaskNorm, SetVectorMask, ResetMask

-

Basic API > vector computation > quantization setting

SetDeqScale

-

Basic API > data movement > DataCopy

Basic data movement

Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported.

Enhanced data movement

Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported.

Slice data movement

-

ND2NZ movement with channel conversion

NZ2ND movement with channel conversion

Activation movement with channel quantization

Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported.

Basic API > data movement

Copy, DataCopyPad, SetPadValue

-

Basic API > synchronization control > intra-core synchronization

SetFlag/WaitFlag, PipeBarrier, DataSyncBarrier

-

Basic API > synchronization control > inter-core synchronization

CrossCoreSetFlag, CrossCoreWaitFlag

-

Basic API > buffer control

DataCachePreload, DataCacheCleanAndInvalid, ICachePreLoad, GetICachePreloadStatus

-

Basic API > system variable access

GetBlockNum, GetBlockIdx, GetDataBlockSizeInBytes, GetArchVersion, GetTaskRatio, InitSocState, GetProgramCounter, GetSubBlockNum, GetSubBlockIdx, GetSystemCycle,

CheckLocalMemoryIA

-

Basic API > atomic operation

SetAtomicAdd, SetAtomicType, SetAtomicNone, SetAtomicMax, SetAtomicMin, SetStoreAtomicConfig, GetStoreAtomicConfig

-

Basic API > cube computation

Mmad, MmadWithSparse, SetHF32Mode, SetHF32TransMode, SetMMLayoutTransform, SetFixPipeConfig, SetFixpipeNz2ndFlag, SetFixpipePreQuantFlag, InitConstValue, LoadData, LoadDataWithTranspose, SetAippFunctions, LoadImageToLocal, LoadDataWithSparse, SetFmatrix, SetLoadDataBoundary, SetLoadDataRepeat, SetLoadDataPaddingValue, Fixpipe

-

Utils API > C++ standard library > algorithm

max, min, index_sequence

-

Utils API > C++ standard library > container functions

tuple, get, make_tuple

-

Utils API > C++ standard library > type features

is_convertible, is_base_of, is_same, enable_if, conditional

-

Table 3 Supported APIs for the Atlas A3 training products / Atlas A3 inference products

API Category

API

Remarks

Basic API > scalar computation

ScalarGetCountOfValue, ScalarCountLeadingZero, ScalarCast, CountBitsCntSameAsSignBit, ScalarGetSFFValue, ToBfloat16, ToFloat

-

Basic API > vector computation > Basic arithmetic

Exp, Ln, Abs, Reciprocal, Sqrt, Rsqrt, Relu, Add, Sub, Mul, Div, Max, Min, BilinearInterpolation, Adds, Muls, Maxs, Mins, LeakyRelu

-

Basic API > vector computation > logical computation

Not, And, Or, ShiftLeft, ShiftRight

-

Basic API > vector computation > compound computation

Axpy, CastDeq, AddRelu, AddReluCast, AddDeqRelu, SubRelu, SubReluCast, MulAddDst, MulCast, FusedMulAdd, FusedMulAddRelu

-

Basic API > vector computation > comparison and selection

Compare, Compare (result stored in the register), CompareScalar, GetCmpMask, SetCmpMask, Select, GatherMask

-

Basic API > vector computation > type conversion

Cast

-

Basic API > vector computation > reduction computation

WholeReduceMax, WholeReduceMin, WholeReduceSum, BlockReduceMax, BlockReduceMin, BlockReduceSum, PairReduceSum, RepeatReduceSum, GetAccVal, GetReduceMaxMinCount

-

Basic API > vector computation > data conversion

Transpose, TransDataTo5HD

-

Basic API > vector computation > data filling

Duplicate, Brcb

-

Basic API > vector computation > sorting and combination

Sort32, MrgSort, GetMrgSortResult

-

Basic API > vector computation > discretization and aggregation

Gather, Gatherb

-

Basic API > vector computation > mask operation

SetMaskCount, SetMaskNorm, SetVectorMask, ResetMask

-

Basic API > vector computation > quantization setting

SetDeqScale

-

Basic API > data movement > DataCopy

Basic data movement

Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported.

Basic API > data movement

Enhanced data movement

Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported.

Slice data movement

-

ND2NZ movement with channel conversion

NZ2ND movement with channel conversion

Activation movement with channel quantization

Data movement of the VECIN/VECCALC/VECOUT -> TSCM channel is not supported.

Copy, DataCopyPad, SetPadValue

-

Basic API > synchronization control > intra-core synchronization

SetFlag/WaitFlag, PipeBarrier, DataSyncBarrier

-

Basic API > synchronization control > inter-core synchronization

CrossCoreSetFlag, CrossCoreWaitFlag

-

Basic API > buffer control

DataCachePreload, DataCacheCleanAndInvalid, ICachePreLoad, GetICachePreloadStatus

-

Basic API > system variable access

GetBlockNum, GetBlockIdx, GetDataBlockSizeInBytes, GetArchVersion, GetTaskRatio, InitSocState, GetProgramCounter, GetSubBlockNum, GetSubBlockIdx, GetSystemCycle,

CheckLocalMemoryIA

-

Basic API > atomic operation

SetAtomicAdd, SetAtomicType, SetAtomicNone, SetAtomicMax, SetAtomicMin, SetStoreAtomicConfig, GetStoreAtomicConfig

-

Basic API > cube computation

Mmad, MmadWithSparse, SetHF32Mode, SetHF32TransMode, SetMMLayoutTransform, SetFixPipeConfig, SetFixpipeNz2ndFlag, SetFixpipePreQuantFlag, InitConstValue, LoadData, LoadDataWithTranspose, SetAippFunctions, LoadImageToLocal, LoadDataWithSparse, SetFmatrix, SetLoadDataBoundary, SetLoadDataRepeat, SetLoadDataPaddingValue, Fixpipe

-

Utils API > C++ standard library > algorithm

max, min, index_sequence

-

Utils API > C++ standard library > container functions

tuple, get, make_tuple

-

Utils API > C++ standard library > type features

is_convertible, is_base_of, is_same, enable_if, conditional

-

High-level API > C++ standard library > type features

is_convertible, is_base_of, is_same, enable_if, conditional

-

High-level API > template library functions > type_traits

is_convertible, is_base_of, is_same, enable_if, conditional

-