More Samples
Template Sample
This section provides a template sample of two-operand instructions to help you quickly run reference samples in specific instructions.
You can use the following template sample as the code framework and only need to copy the sample snippet in specific instructions to replace the content in bold.
#include "kernel_operator.h"
class KernelAdd {
public:
__aicore__ inline KernelAdd() {}
__aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
src0Global.SetGlobalBuffer((__gm__ int16_t*)src0Gm);
src1Global.SetGlobalBuffer((__gm__ int16_t*)src1Gm);
dstGlobal.SetGlobalBuffer((__gm__ int16_t*)dstGm);
pipe.InitBuffer(inQueueSrc0, 1, 512 * sizeof(int16_t));
pipe.InitBuffer(inQueueSrc1, 1, 512 * sizeof(int16_t));
pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(int16_t));
}
__aicore__ inline void Process()
{
CopyIn();
Compute();
CopyOut();
}
private:
__aicore__ inline void CopyIn()
{
AscendC::LocalTensor<int16_t> src0Local = inQueueSrc0.AllocTensor<int16_t>();
AscendC::LocalTensor<int16_t> src1Local = inQueueSrc1.AllocTensor<int16_t>();
AscendC::DataCopy(src0Local, src0Global, 512);
AscendC::DataCopy(src1Local, src1Global, 512);
inQueueSrc0.EnQue(src0Local);
inQueueSrc1.EnQue(src1Local);
}
__aicore__ inline void Compute()
{
AscendC::LocalTensor<int16_t> src0Local = inQueueSrc0.DeQue<int16_t>();
AscendC::LocalTensor<int16_t> src1Local = inQueueSrc1.DeQue<int16_t>();
AscendC::LocalTensor<int16_t> dstLocal = outQueueDst.AllocTensor<int16_t>();
AscendC::Add(dstLocal, src0Local, src1Local, 512);
outQueueDst.EnQue<int16_t>(dstLocal);
inQueueSrc0.FreeTensor(src0Local);
inQueueSrc1.FreeTensor(src1Local);
}
__aicore__ inline void CopyOut()
{
AscendC::LocalTensor<int16_t> dstLocal = outQueueDst.DeQue<int16_t>();
AscendC::DataCopy(dstGlobal, dstLocal, 512);
outQueueDst.FreeTensor(dstLocal);
}
private:
AscendC::TPipe pipe;
AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
AscendC::GlobalTensor<int16_t> src0Global, src1Global, dstGlobal;
};
extern "C" __global__ __aicore__ void add_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
KernelAdd op;
op.Init(src0Gm, src1Gm, dstGm);
op.Process();
}
More Samples
You can refer to the following examples to learn how to use the high-dimensional tensor sharding computation APIs of the two-operand instructions to perform more flexible operations and implement more advanced functions.
If you need to run the sample code, copy the code snippet and replace the code in bold of the Compute function in the two-operand instruction template provided in Template Sample.
- Use the contiguous mask mode of a high-dimensional tensor sharding computation API to implement discontinuous data calculation.
uint64_t mask = 64; // Only the first 64 bits are calculated in each iteration. AscendC::Add(dstLocal, src0Local, src1Local, mask, 4, { 1, 1, 1, 8, 8, 8 });Result example:
Input (src0Local): [1 2 3 ... 512] Input (src1Local): [513 514 515 ... 1024] Output (dstLocal): [514 516 518 ... 640 undefined ... undefined 770 772 774 ... 896 undefined ... undefined 1026 1028 1030 ... 1152 undefined ... undefined 1282 1284 1286 ... 1408 undefined ... undefined]
- Use the bitwise mask mode of a high-dimensional tensor sharding computation API to implement discontinuous data calculation.
uint64_t mask[2] = { UINT64_MAX, 0 }; // mask[0] is set to max, mask[1] is set to empty, and only the first 64 bits are calculated each time. AscendC::Add(dstLocal, src0Local, src1Local, mask, 4, { 1, 1, 1, 8, 8, 8 });Result example:Input (src0Local): [1 2 3 ... 512] Input (src1Local): [513 514 515 ... 1024] Output (dstLocal): [514 516 518 ... 640 undefined ... undefined 770 772 774 ... 896 undefined ... undefined 1026 1028 1030 ... 1152 undefined ... undefined 1282 1284 1286 ... 1408 undefined ... undefined]
- Set the repeatStride parameter of a high-dimensional tensor sharding computation API to implement discontinuous data calculation.
uint64_t mask = 128; // Set repeatTimes to 2, indicating that two iterations are required. // Set src0BlkStride and src1BlkStride to 1, indicating that the interval between src0 data addresses involved in calculation in each iteration is one data block. // Set src0RepStride to 16, indicating that the interval of the src0 start addresses between adjacent iterations is 16 data blocks. AscendC::Add(dstLocal, src0Local, src1Local, mask, 2, { 1, 1, 1, 8, 16, 8 });Result example:
Input (src0Local): [1 2 3 ... 512] Input (src1Local): [513 514 515 ... 1024] Output (dstLocal): [514 516 518 ...768 898 900 902 ... 1150 1152 undefined ... undefined]
- Set the dataBlockStride and repeatStride parameters of a high-dimensional tensor sharding computation API to implement discontinuous data calculation.
uint64_t mask = 128; // Set repeatTimes to 2, indicating that two iterations are required. // Set src0BlkStride to 2, indicating that the interval between src0 data addresses involved in calculation in each iteration is two data blocks. // Set src0RepStride to 16, indicating that the interval of the src0 start addresses between adjacent iterations is 16 data blocks. AscendC::Add(dstLocal, src0Local, src1Local, mask, 2, { 1, 2, 1, 8, 16, 8 });Result example:Input (src0Local): [1 2 3 ... 512] Input (src1Local): [513 514 515 ... 1024] Output (dstLocal): [514 516 518 ... 544 562 564 566 ... 592 610 612 614 ... 640 658 660 662 ... 688 706 708 710 ... 736 754 756 758 ... 784 802 804 806 ... 832 850 852 854 ... 880 898 900 902 ... 928 946 948 950 ... 976 994 996 998 ... 1024 1042 1044 1046 ... 1072 1090 1092 1094 ... 1120 1138 1140 1142 ... 1168 1186 1188 1190 ... 1216 1234 1236 1238 … 1264 undefined ... undefined]
Parent topic: Two-Operand Instructions