Common Parameters

This section describes the high-dimensional tensor sharding computation APIs in the basic vector computation APIs. If you do not need to use such APIs, skip this section.
repeatTimes, dataBlockStride, repeatStride, and mask in the following are general descriptions, and their names may not be the same as their parameter names in specific commands.
For example, the data block stride (address stride between adjacent data blocks in a single iteration) refers to dstBlkStride and srcBlkStride in one-operand instructions, and dstBlkStride, src0BlkStride, and src1BlkStride in two-operand instructions.

You can find the meaning of each parameter in the parameter description of a specific API.

Using high-dimensional tensor sharding computation APIs can make full use of hardware advantages and enable you to control the iteration execution of instructions and the address stride of operands.

Vector computation is implemented by the Vector Unit. The source and destination operands of vector computation are stored in the Unified Buffer (UB). In each iteration, the Vector Unit fetches eight data blocks (each data block has consecutive internal addresses, with a length of 32 bytes) from the UB for computation, and writes the computation results into the corresponding eight data blocks. The following figure shows the Exp computation of eight data blocks in a single iteration.

Figure 1 Exp computation of eight data blocks in a single iteration

Vector computation APIs allow you to configure the number of iterations by setting repeatTimes to control the execution of multiple iterations. If repeatTimes is set to 2, the Vector Unit performs two iterations and computes the result of 2 x 8 (eight data blocks in each iteration) x 32 bytes (32 bytes in each data block) = 512 bytes. If the data type is half, 256 elements are computed. The following figure shows the Exp computation in two iterations. The value of repeatTimes cannot exceed 255.
Figure 2 Exp computation in two iterations
For data in the same iteration, use the mask parameter to perform the mask operation to control the number of data elements involved in the computation. The following figure shows how to use the mask bitwise mode to control which elements are involved in the Abs computation. 1 indicates that the element is involved in the computation, and 0 indicates that the elements are not involved in the computation.
Figure 3 Mask operation using the mask parameter
The Vector Unit also supports vector computation with stride, which is configured by using dataBlockStride (address stride between adjacent data blocks in a single iteration) and repeatStride (address stride of the same data block in adjacent iterations).
- dataBlockStride
  To control the data processing stride in a single iteration, you can set dataBlockStride. The following figure shows the non-contiguous scenario in a single iteration. In the example, dataBlockStride of the source operand is set to 2, indicating that the address stride between adjacent data blocks in a single iteration is two data blocks.
  Figure 4 Non-contiguous scenario in a single iteration
- repeatStride
  When repeatTimes is set to a number greater than 1, multiple iterations are required to complete vector computation, you can set repeatStride as required.
  
  The following figure shows the scenario where multiple iterations are non-contiguous. In the example, repeatStride of the source operand and destination operand is set to 9, indicating that the interval between the start addresses of the same data block between adjacent iterations is nine data blocks. The same data block indicates that the positions of the data block in iterations are the same. For example, src1 and src9 in the following figure are in the adjacent iterations, and they are both the first data block in their iterations. The interval is the value of repeatStride.
  
  Figure 5 Non-contiguous scenario between multiple iterations

The following describes the configurations of dataBlockStride, repeatStride, and mask and provides examples.

dataBlockStride

dataBlockStride refers to the address stride between adjacent data blocks in the same iteration.

For contiguous computation, dataBlockStride is set to 1, and eight data blocks in the same iteration are processed contiguously.
For non-contiguous computation, dataBlockStride should be greater than 1 (for example, 2), and one-datablock interval occurs between two data reads in the same iteration, as shown in the following figure.
Figure 6 Examples of different values of dataBlockStride

repeatStride

repeatStride refers to the address stride of the same data block between adjacent iterations.

Contiguous computation: Assume that a tensor is defined for both the destination operand and source operand (with overlapped address), and repeatStride is 8. In this case, the Vector Unit reads eight consecutive data blocks in the first iteration, reads next eight consecutive data blocks in the second iteration, and repeats the process until all input data is computed.
Discontiguous computation: When repeatStride is greater than 8 (for example, 10), the data read by the Vector Unit between adjacent iterations is discontinuous in terms of address, and there is an interval of two data blocks.
Repeated computation: When repeatStride is 0, the Vector Unit repeatedly reads and computes the first eight consecutive data blocks.
Partially repeated computation: When repeatStride is greater than 0 and less than 8, some data of adjacent iterations is repeatedly read and computed by the Vector Unit. However, this scenario rarely occurs.

mask

mask is used to control the elements that participate in computation in each iteration. You can set this parameter in contiguous mode or bitwise mode.

Contiguous mode: indicates the number of contiguous elements that participate in computation. The data type is uint64_t. The value range is related to the data type of the operand. The maximum number of elements that can be processed in each iteration varies according to the data type. For the current data type, the maximum number of elements that can be processed in a single iteration is 256/sizeof(data type). When the operand data type occupies 16 bits (for example, half/uint16_t), mask ∈ [1, 128]. When the operand data type occupies 32 bits (for example, float/int32_t), mask ∈ [1, 64].

Example:

// The maximum number of elements that can be processed in a single iteration of the int16_t data type is 256/sizeof(int16_t) = 128, mask = 64, and mask ∈ [1, 128]. Therefore, the input is valid.
// repeatTimes = 1. There are 128 elements in total. A single iteration can process 128 elements. Therefore, repeatTimes = 1.
// dstBlkStride, src0BlkStride, src1BlkStride = 1. Data is contiguously read and written in a single iteration.
// dstRepStride, src0RepStride, src1RepStride = 8. Data is contiguously read and written between iterations.
uint64_t mask = 64;
AscendC::Add(dstLocal, src0Local, src1Local, mask, 1, { 1, 1, 1, 8, 8, 8 });

Result example:

Input (src0Local): [1 2 3... 64... 128]
Input (src1Local): [1 2 3... 64... 128]
Output (dstLocal): [2 4 6 ... 128 undefined...undefined]

// The maximum number of elements that can be processed in a single iteration of the int32_t data type is 256/sizeof(int32_t) = 64, mask = 64, and mask ∈ [1, 64]. Therefore, the input is valid.
// repeatTimes = 1. There are 64 elements in total. A single iteration can process 64 elements. Therefore, repeatTimes = 1.
// dstBlkStride, src0BlkStride, src1BlkStride = 1. Data is contiguously read and written in a single iteration.
// dstRepStride, src0RepStride, src1RepStride = 8. Data is contiguously read and written between iterations.
uint64_t mask = 64;
AscendC::Add(dstLocal, src0Local, src1Local, mask, 1, { 1, 1, 1, 8, 8, 8 });

Result example:

Input (src0Local): [1 2 3... 64]
Input (src1Local): [1 2 3... 64]
Output (dstLocal): [2 4 6... 128]

Bitwise mode: controls the elements that participate in computation by bit. If a bit is set to 1, the corresponding element participates in the computation. If a bit is set to 0, the corresponding element is masked in the computation. The parameter type is a uint64_t array whose length is 2.

The parameter value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16 bits, mask[0] and mask[1] ∈ [0, 2⁶⁴ – 1], and mask[0] and mask[1] cannot be 0 at the same time. When the operand is 32-bit, mask[1] is 0 and mask[0] ∈ (0, 2⁶⁴ – 1].

Example:

// The data type is int16_t.
uint64_t mask[2] = {6148914691236517205, 6148914691236517205};
// repeatTimes = 1. There are 128 elements in total. A single iteration can process 128 elements. Therefore, repeatTimes = 1.
// dstBlkStride, src0BlkStride, src1BlkStride = 1. Data is contiguously read and written in a single iteration.
// dstRepStride, src0RepStride, src1RepStride = 8. Data is contiguously read and written between iterations.
AscendC::Add(dstLocal, src0Local, src1Local, mask, 1, { 1, 1, 1, 8, 8, 8 });

Result example:

Input (src0Local): [1 2 3... 64... 127 128]
Input (src1Local): [1 2 3... 64... 127 128]
Output (dstLocal): [2 undefined 6 ... undefined ...254 undefined]]

mask process:

mask = {6148914691236517205, 6148914691236517205} (Note: 6148914691236517205 indicates the 64-bit binary number 0b010101...01. mask is arranged from the least significant bit to the most significant bit.)

// The data type is int32_t.
uint64_t mask[2] = {6148914691236517205, 0};
// repeatTimes = 1. There are 64 elements in total. A single iteration can process 64 elements. Therefore, repeatTimes = 1.
// dstBlkStride, src0BlkStride, src1BlkStride = 1. Data is contiguously read and written in a single iteration.
// dstRepStride, src0RepStride, src1RepStride = 8. Data is contiguously read and written between iterations.
AscendC::Add(dstLocal, src0Local, src1Local, mask, 1, { 1, 1, 1, 8, 8, 8 });

Result example:

Input (src0Local): [1 2 3... 63 64]
Input (src1Local): [1 2 3... 63 64]
Output (dstLocal): [2 undefined 6 ... 126 undefined]

mask process:

mask = {6148914691236517205, 0} (Note: 6148914691236517205 indicates a 64-bit binary number 0b010101....01.)

Parent topic: Vector Computation