High-dimensional Sharding APIs

This section describes the high-dimensional tensor sharding compute APIs in the basic vector compute APIs. If you do not need to use such APIs, skip this section.
repeatTime, dataBlockStride, repeatStride, and mask in the following are general descriptions, and their names may not be the same as their parameter names in specific commands.
For example, the data block stride (address stride between adjacent data blocks in a single iteration) refers to dstBlkStride and srcBlkStride in one-operand APIs, and dstBlkStride, src0BlkStride, and src1BlkStride in two-operand APIs.

You can find the meaning of each parameter in the parameter description of a specific API.

Using high-dimensional tensor sharding compute APIs can make full use of hardware advantages and enable you to flexibly control the iteration execution of instructions and the address stride of operands.

Vector compute is implemented by the Vector Unit. The source and destination operands of vector compute are stored in the UB. In each iteration, the Vector Unit fetches eight data blocks (each data block has consecutive internal addresses, with a length of 32 bytes) from the UB for computation, and writes the computation results into the corresponding eight data blocks. The following figure shows the Exp computation of eight data blocks in a single iteration.

Figure 1 Exp computation of eight data blocks in a single iteration

Vector compute APIs allow you to configure the number of iterations by setting repeatTime to control the execution of multiple iterations. If repeatTime is set to 2, the Vector Unit performs two iterations and computes the result of 2 × 8 (eight data blocks in each iteration) × 32 bytes (32 bytes in each data block) = 512 bytes. If the data type is half, 256 elements are computed. The following figure shows the Exp computation in two iterations. Due to hardware restrictions, the value of repeatTime cannot exceed 255.
Figure 2 Exp computation in two iterations
For data in the same iteration, use the mask parameter to perform the mask operation to control the number of data elements involved in the computation. The following figure shows how to use the mask bitwise mode to control which elements are involved in the Abs computation. 1 indicates that the element is involved in the computation, and 0 indicates that the element is not involved in the computation.
Figure 3 Mask operation using the mask parameter (float type)
The Vector Unit also supports vector computation with stride, which is configured by using dataBlockStride (address stride between adjacent data blocks in a single iteration) and repeatStride (address stride of the same data block in adjacent iterations).
- dataBlockStride
  To control the data processing stride in a single iteration, you can set dataBlockStride. The following figure shows the non-contiguous scenario in a single iteration. In the example, of the source operand is set to 2, indicating that the address stride between adjacent data blocks in a single iteration is two data blocks.
  Figure 4 Non-contiguous scenario in a single iteration
- repeatStride
  When repeatTime is set to a number greater than 1, multiple iterations are required to complete vector computation. In this case, you can set repeatStride as required.
  
  The following figure shows the scenario where multiple iterations are non-contiguous. In the example, repeatStride is set to 9 for both the source operand and destination operand, indicating that the interval between the start addresses of the same data block between adjacent iterations is nine data blocks. The same data block indicates that the positions of the data block in iterations are the same. For example, src1 and src9 in the following figure are in the adjacent iterations, and they are both the first data block in their iterations. The interval is the value of repeatStride.
  
  Figure 5 Non-contiguous scenario between multiple iterations

The following describes the configurations of , , and mask and provides examples.

dataBlockStride

dataBlockStride refers to the address stride between adjacent data blocks in the same iteration.

For contiguous computation, dataBlockStride is set to 1, and eight data blocks in the same iteration are processed contiguously.
For non-contiguous computation, dataBlockStride should be greater than 1 (for example, 2), and one data block interval occurs between two data reads in the same iteration, as shown in the following figure.
Figure 6 Examples of different values of dataBlockStride

repeatStride

repeatStride refers to the address stride of the same data block between adjacent iterations.

Contiguous computation: Assume that a tensor is defined for both the destination operand and source operand (with overlapped address), and is 8. In this case, the Vector Unit reads eight consecutive data blocks in the first iteration, reads next eight consecutive data blocks in the second iteration, and repeats the process until all input data is computed.
Discontiguous computation: When is greater than 8 (for example, 10), the data read by the Vector Unit between adjacent iterations is discontiguous in terms of address, and there is an interval of two data blocks.
Repeated computation: When is 0, the Vector Unit repeatedly reads and computes the first eight consecutive data blocks.
Partially repeated computation: When is greater than 0 and less than 8, some data of adjacent iterations is repeatedly read and computed by the Vector Unit. However, this scenario rarely occurs.

mask

The mask parameter controls the elements that participate in computation in each iteration. You can set this parameter in contiguous mode or bitwise mode.

Contiguous mode: indicates the number of contiguous elements that participate in computation. The data type is uint64_t. The value range is related to the data type of the source operand. The maximum number of elements that can be processed in each iteration varies with the data type. For the current data type, the maximum number of elements that can be processed in a single iteration is 256/sizeof(data type). When the operand data type occupies 16 bits (for example, half/uint16_t), mask ∈ [1, 128]. When the operand data type occupies 32 bits (for example, float/int32_t), mask ∈ [1, 64].

Example:

// The maximum number of elements that can be processed in a single iteration of the int16_t type is 256/sizeof(int16_t) = 128, mask = 64, and mask ∈ [1, 128]. Therefore, the input is valid.
// repeatTime = 1. There are 128 elements in total. A single iteration can process 128 elements. Therefore, repeatTime = 1.
// dstBlkStride, src0BlkStride, src1BlkStride = 1. Data is contiguously read and written in a single iteration.
// dstRepStride, src0RepStride, src1RepStride = 8. Data is contiguously read and written between iterations.
uint64_t mask = 64;
AscendC::Add(dstLocal, src0Local, src1Local, mask, 1, { 1, 1, 1, 8, 8, 8 });

Result example:

Input (src0Local): [1 2 3... 64... 128]
Input (src1Local): [1 2 3... 64... 128]
Output (dstLocal): [2 4 6... 128 undefined... undefined]

// The maximum number of elements that can be processed in a single iteration of the int32_t type is 256/sizeof(int32_t) = 64, mask = 64, and mask ∈ [1, 64]. Therefore, the input is valid.
// repeatTime = 1. There are 64 elements in total. A single iteration can process 64 elements. Therefore, repeatTime = 1.
// dstBlkStride, src0BlkStride, src1BlkStride = 1. Data is contiguously read and written in a single iteration.
// dstRepStride, src0RepStride, src1RepStride = 8. Data is contiguously read and written between iterations.
uint64_t mask = 64;
AscendC::Add(dstLocal, src0Local, src1Local, mask, 1, { 1, 1, 1, 8, 8, 8 });

Result example:

Input (src0Local): [1 2 3... 64]
Input (src1Local): [1 2 3... 64]
Output (dstLocal): [2 4 6... 128]

Bitwise mode: controls the elements that participate in computation by bit. If a bit is set to 1, the corresponding element participates in the computation. If a bit is set to 0, the corresponding element is masked in the computation.

The mask value is an array. The array length and the value range of the array elements are related to the operand data type. When the operand is 16-bit, the array length is 2, mask[0] and mask[1] ∈ [0, 2⁶⁴ -1] and cannot be 0 at the same time. When the operand is 32-bit, the array length is 1 and mask[0] ∈ (0, 2⁶⁴ – 1]. When the operand is 64-bit, the array length is 1 and mask[0] ∈ (0, 2³² – 1].

Example:

// The data type is int16_t.
uint64_t mask[2] = {6148914691236517205, 6148914691236517205};
// repeatTime = 1. There are 128 elements in total. A single iteration can process 128 elements. Therefore, repeatTime = 1.
// dstBlkStride, src0BlkStride, src1BlkStride = 1. Data is contiguously read and written in a single iteration.
// dstRepStride, src0RepStride, src1RepStride = 8. Data is contiguously read and written between iterations.
AscendC::Add(dstLocal, src0Local, src1Local, mask, 1, { 1, 1, 1, 8, 8, 8 });

Result example:

Input (src0Local): [1 2 3... 64... 127 128]
Input (src1Local): [1 2 3 ... 64 ...127 128]
Output (dstLocal): [2 undefined 6 ... undefined ...254 undefined]

mask process:

mask = {6148914691236517205, 6148914691236517205} (Note: 6148914691236517205 indicates the 64-bit binary number 0b010101....01. mask is arranged from the least significant bit to the most significant bit.)

// The data type is int32_t.
uint64_t mask[1] = {6148914691236517205};
// repeatTime = 1. There are 64 elements in total. A single iteration can process 64 elements. Therefore, repeatTime = 1.
// dstBlkStride, src0BlkStride, src1BlkStride = 1. Data is contiguously read and written in a single iteration.
// dstRepStride, src0RepStride, src1RepStride = 8. Data is contiguously read and written between iterations.
AscendC::Add(dstLocal, src0Local, src1Local, mask, 1, { 1, 1, 1, 8, 8, 8 });

Result example:

Input (src0Local): [1 2 3... 63 64]
Input (src1Local): [1 2 3 ... 63 64]
Output (dstLocal): [2 undefined 6 ... 126 undefined]

mask process:

mask = {6148914691236517205, 0} (Note: 6148914691236517205 indicates a 64-bit binary number 0b010101....01.)

Parent topic: API Category Description