Axpy

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	√

Function

Multiplies each element in the source operand src by a scalar and then adds the result to the corresponding element in the destination operand dst. The formula is as follows:

$\text{[math]}$

Prototype

Computation of the first n data elements of a tensor

        
             template <typename T, typename U>
__aicore__ inline void Axpy(const LocalTensor<T>& dst, const LocalTensor<U>& src, const U& scalarValue, const int32_t& count)

High-dimensional tensor sharding computation

Bitwise mask mode

          
               template <typename T, typename U, bool isSetMask = true>
__aicore__ inline void Axpy(const LocalTensor<T>& dst, const LocalTensor<U>& src, const U& scalarValue, uint64_t mask[], const uint8_t repeatTime, const UnaryRepeatParams& repeatParams)

Contiguous mask mode

          
               template <typename T, typename U, bool isSetMask = true>
__aicore__ inline void Axpy(const LocalTensor<T>& dst, const LocalTensor<U>& src, const U& scalarValue, uint64_t mask, const uint8_t repeatTime, const UnaryRepeatParams& repeatParams)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the destination operand. For details about the data type constraints of the destination and source operands, see Table 3. For the Atlas training products , the supported data types are half and float. For the Atlas inference product 's AI Core, the supported data types are half and float. For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half and float. For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half and float. For Atlas 200I/500 A2 inference products , the supported data types are half and float.
U	Data type of the source operand. For the Atlas training products , the supported data types are half and float. For the Atlas inference product 's AI Core, the supported data types are half and float. For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half and float. For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half and float. For Atlas 200I/500 A2 inference products , the supported data types are half and float.
isSetMask	Indicates whether to set mask inside the API. true: sets mask inside the API. false: sets mask outside the API. Developers need to use the SetVectorMask API to set the mask value. In this mode, the mask value in the input parameter of this API must be set to the placeholder MASK_PLACEHOLDER.

**Table 2** Parameters
Parameter	Input/Output	Meaning
dst	Output	Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned.
src	Input	Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned.
scalarValue	Input	Source operand (scalar). The data type of scalarValue must be the same as that of src.
count	Input	Number of elements involved in the computation.
mask/mask[]	Input	mask is used to control the elements that participate in computation in each iteration. Bitwise mode: controls which elements are involved in computation bit by bit. A bit value of 1 means the corresponding element participates in computation, while 0 means it does not. The mask value is an array. The array length and the value range of the array elements are related to the operand data type. When the operand is 16-bit, the array length is 2, mask[0] and mask[1] ∈ [0, 2⁶⁴ -1] and cannot be 0 at the same time. When the operand is 32-bit, the array length is 1 and mask[0] ∈ (0, 2⁶⁴ – 1]. When the operand is 64-bit, the array length is 1 and mask[0] ∈ (0, 2³² – 1]. For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation. Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each repeat varies according to the data type. When the operand is 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64]. When the operand is 64-bit, mask ∈ [1, 32].
repeatTime	Input	Number of repeat iterations. The vector compute unit reads 256 bytes of contiguous data for computation each time. To process the input data, the data needs to be read and computed over multiple repeats. repeatTime indicates the number of repeats. For details about this parameter, see High-dimensional Sharding APIs.
repeatParams	Input	Parameters that control the operand address strides. This parameter is of the UnaryRepeatParams type, including the address stride of the same DataBlock between adjacent iterations of the operand and the address stride of different DataBlocks within the same iteration of the operand. For details about the address stride of the operand between adjacent iterations, see repeatStride. For details about the address stride of the operand between different data blocks in a single iteration, see dataBlockStride.

**Table 3** Data type restrictions
src Data Type	scalar Data Type	dst Data Type	PAR	Availability
half	half	half	128	Atlas training products Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products Atlas inference product 's AI Core Atlas 200I/500 A2 inference products
float	float	float	64	Atlas training products Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products Atlas inference product 's AI Core Atlas 200I/500 A2 inference products
half	half	float	64	Atlas training products Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products Atlas inference product 's AI Core Atlas 200I/500 A2 inference products

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.
For details about the constraints on operand address overlapping, see General Address Overlapping Restrictions.

When a high-dimensional tensor sharding computation API is used, if the data types of src and scalar are half and that of dst is float, the number of source operand elements processed in one iteration must be the same as that of destination operand elements. Therefore, the first four data blocks are selected for computation in each iteration. This restriction must be taken into account when you set the Repeat stride parameter, mask parameter, and address overlapping.

Example

This example shows only part of the code used in the computation process (Compute). To run the sample code, copy the code snippet and replace the corresponding part of the Compute function in the complete sample template of More Examples.

Example of high-dimensional tensor sharding computation (contiguous mask mode)

        
             // repeatTime = 4, mask = 128, 128 elements one repeat, 512 elements total
// srcLocal, scalar, and dstLocal are all of the half data type.
// dstBlkStride, srcBlkStride = 1, no gap between blocks in one repeat
// dstRepStride, srcRepStride = 8, no gap between repeats 
AscendC::Axpy(dstLocal, srcLocal, (half)2.0, 128, 4,{ 1, 1, 8, 8 });

// srcLocal and scalar are of type half, while dstLocal is of type float.
// repeatTime = 8, mask = 64, 64 elements one repeat, 512 elements total
// dstBlkStride, srcBlkStride = 1, no gap between blocks in one repeat
// dstRepStride = 8, srcRepStride = 4, no gap between repeats 
AscendC::Axpy(dstLocal, srcLocal, (half)2.0, 64, 8,{ 1, 1, 8, 4 }); // Select the first four data blocks of the source operand for computation in each iteration.

Example of high-dimensional tensor sharding computation (bitwise mask mode)

        
             uint64_t mask[2] = { 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF };
// repeatTime = 4, 128 elements per repeat, 512 elements in total. The data type is half.
// dstBlkStride, srcBlkStride = 1, no gap between blocks in one repeat
// dstRepStride, srcRepStride = 8, no gap between repeats
AscendC::Axpy(dstLocal, srcLocal, (half)2.0, mask, 4,{ 1, 1, 8, 8 });

Example of computing the first n data elements of a tensor

        
             AscendC::Axpy(dstLocal, src0Local, (half)2.0, 512);// half type

More Examples

Complete example 1: srcLocal, scalar, and dstLocal are all of type half.

        
         
           
           
             #include "kernel_operator.h"
class KernelAxpy {
public:
    __aicore__ inline KernelAxpy() {}
    __aicore__ inline void Init(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm)
    {
        srcGlobal.SetGlobalBuffer((__gm__ half*)srcGm);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        pipe.InitBuffer(inQueueSrc, 1, 512 * sizeof(half));
        pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
        AscendC::DataCopy(srcLocal, srcGlobal, 512);
        inQueueSrc.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
 
        AscendC::Duplicate(dstLocal, (half)0.0, 512);
        AscendC::Axpy(dstLocal, srcLocal, (half)2.0, 512);
 
        outQueueDst.EnQue<half>(dstLocal);
        inQueueSrc.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, 512);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
};
extern "C" __global__ __aicore__ void kernel_vec_ternary_scalar_Axpy_half_2_half(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm)
{
    KernelAxpy op;
    op.Init(srcGm, dstGm);
    op.Process();
}

            

          

        
       

Result example:

Input (srcGm):
[1. 1. 1. 1. 1. 1. ... 1.]
Output (dstGm):
[2. 2. 2. 2. 2. 2. ... 2.]

Complete example 2: srcLocal and scalar are of type half, while dstLocal is of type float.

        
         
           
           
             #include "kernel_operator.h"
class KernelAxpy {
public:
    __aicore__ inline KernelAxpy() {}
    __aicore__ inline void Init(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm)
    {
        srcGlobal.SetGlobalBuffer((__gm__ half*)srcGm);
        dstGlobal.SetGlobalBuffer((__gm__ float*)dstGm);
        pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(float));
        pipe.InitBuffer(inQueueSrc, 1, 512 * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
        AscendC::DataCopy(srcLocal, srcGlobal, 512);
        inQueueSrc.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
        AscendC::LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>();
 
        AscendC::Duplicate(dstLocal, 0.0f, 512);
        AscendC::Axpy(dstLocal, srcLocal, (half)2.0, 64, 8, { 1, 1, 8, 4 });
 
        outQueueDst.EnQue<float>(dstLocal);
        inQueueSrc.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<float> dstLocal = outQueueDst.DeQue<float>();
        AscendC::DataCopy(dstGlobal, dstLocal, 512);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> srcGlobal;
    AscendC::GlobalTensor<float> dstGlobal;
};
extern "C" __global__ __aicore__ void kernel_vec_ternary_scalar_Axpy_half_2_float(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm)
{
    KernelAxpy op;
    op.Init(srcGm, dstGm);
    op.Process();
}

            

          

        
       

Result example:

Input (srcGm):
[1. 1. 1. 1. 1. 1. ... 1.]
Output (dstGm):
[2. 2. 2. 2. 2. 2. ... 2.]

Parent topic: Compound Computation