Non-Alignment Scenario

Alignment Requirements for Data Movement and Vector Computation

During data movement and Vector computation, the length of the moved data and the start address of the operand must meet the following alignment requirements:

When the DataCopy API is used to move data, the length of the moved data and the start address of the operand (on the UB) must be 32-byte aligned.
During Vector computation, the start address of the operand must be 32-byte aligned.

In the following description, Global refers to the tensor on the global memory (GlobalTensor), and Local refers to the tensor on the local memory (LocalTensor).

The following are some examples of non-aligned movement and computation.

Non-aligned move-in
When 11 half values need to be copied from Global to Local, to ensure that the length of the moved data is 32-byte aligned, the DataCopy API is used to copy 16 half values (32 bytes) to Local. In this case, Local[11] to Local[15] are written as invalid data -1.
Figure 1 Non-aligned move-in
Non-aligned move-out
When 11 half values need to be copied from Local to Global, to ensure that the length of the moved data is 32-byte aligned, the DataCopy API is used to copy 16 half values (32 bytes) to Global. In this case, Global[11] to Global[15] are written as invalid data -1.
Figure 2 Non-aligned move-out
Incorrect example of the start address for Vector computation that is not 32-byte aligned
During Vector computation, ensure that the start address is 32-byte aligned. In the following incorrect example, the start address is not 32-byte aligned from Local1[7], which is the eighth value of LocalTensor.
Figure 3 Incorrect example of the start address for Vector computation that is not 32-byte aligned

Non-Alignment Scheme

The DataCopyPad API provides the function of non-aligned movement. If an operator is developed based on the supported products (see Availability), this API can be directly used to solve the movement problem in the non-alignment scenario. For details about how to use DataCopyPad, see the DataCopyPad sample (project-based operator development) and DataCopyPad sample (kernel launch).

Some models do not support the DataCopyPad API. In this case, refer to the following scheme.

Figure 4 Non-alignment scheme

The length of the moved data must be 32-byte aligned. In the non-alignment case, when tensor data is moved from Global to Local row by row, each row in Local has redundant data.

After data is moved in, redundant data is processed in the following manners during Vector computation:

Involve redundant data in compute. This is generally used in elewise computing scenarios.
Mask redundant data. This is generally used in scenarios such as axis reduction.
Clear data row by row through Duplicate. Before computing, the basic API Duplicate is called for each row to pad 0s in the positions with redundant data.
Clear data at once through Pad. Before computing, the high-level API Pad is called for multiple rows of data to clear redundant data at once.

The length of the moved data and the start address of the operand (on the UB) must be 32-byte aligned. Therefore, you can remove redundant data or move out data along with redundant data.

Remove redundant data through UnPad before moving out data. If the total length of the valid data to be moved is 32-byte aligned, the high-level API UnPad can be used to remove redundant data and move out the data completely.
Gather valid data by GatherMask before moving out data. If the total length of the valid data to be moved is greater than or equal to 32 bytes, GatherMask can be used to gather the valid data again to ensure that the start address and length of the valid data to be moved are 32-byte aligned.
Move out data with redundant data. Enable atomic addition during multi-core processing to prevent data corruption.

The following separately describes the foregoing several processing solutions in detail.

Involve redundant data in compute.
As shown in the following figure, the first 11 half values are used for Abs computation. The redundant data can be used for computation, which does not affect the final result. The procedure is as follows:
1. Use DataCopy to move 16 half values from GLobal to Local1, including redundant data -11 to -15.
2. Use Abs to compute the entire block, excluding the tail block but including redundant data.
Figure 5 Involving redundant data in compute
Mask redundant data.
As shown in the following figure, assume that the shape of the input data is 16 x 4. After the input data is moved to the UB, the first four half values of each row are valid, and the rest are redundant data. To perform ReduceMin calculation on only the first four half values, you can set the mask parameter to mask the redundant data. The procedure for processing each row of data is as follows:
1. Use DataCopy to move 16 half values from Global to Local1.
2. Clear the destination operand Local2 for the purpose of reduction operation, such as by using the Duplicate operation.
3. Perform the reduction operation and set the mask mode of ReduceMin, so that the first four values are valid and redundant data is masked.
Figure 6 Masking dirty data

Clear data row by row through Duplicate.

As shown in the following figure, call Duplicate to clear non-aligned data row by row after move-in. The following lists the procedure:

Use DataCopy to move 16 half values from Global to Local.

Use the basic API Duplicate and set the mask value as follows to ensure that only the last five element positions are valid and the redundant data positions are padded with 0s.

          
               uint64_t mask0 = ((uint64_t)1 << 16) - ((uint64_t)1 << 11); 
uint64_t mask[2] = {mask0, 0};

Figure 7 Clearing data row by row through Duplicate

Clear data at once through Pad.
As shown in the following figure, assume that the shape of the input data is 16 x 6. The size of the data after being moved to Local is 16 x 16, and each row contains redundant data. Clearing data row by row may have poor effect, and you can use Pad to clear data at once. The procedure is as follows:
1. After the 16 x 6 data is moved from global memory to UB row by row, each row contains six valid values.
2. Use the Pad API to pad the redundant data positions with 0s. (The Pad API is used when the tensor width is 32-byte aligned but there is some redundant data.)
Figure 8 Clearing data at once through Pad
Remove redundant data through UnPad before moving out data.
As shown in the following figure, the local memory size is 16 x 16. Only the first six values in each row are valid, and the valid data to be moved out (16 x 6) is 32-byte aligned. In this case, you can use the UnPad API to remove redundant data and move out data completely. The procedure is as follows:
1. Use the high-level API of UnPad to remove redundant values.
2. Use DataCopy to move consecutive 16 x 6 half values to Global.
Figure 9 Use the UnPad API to remove redundant data and then move out data.
Gather valid data by GatherMask before moving out data.
As shown in the following figure, 19 half values are moved to Global, and values 16 to 18 cannot be aligned. In this case, use GatherMask to gather valid data again and move out values 3 to 18.
1. Copy the first 16 half (32 bytes) values to Global.
2. Use the GatherMask API to gather the data from Local1[4]–[19] to Local2. Local2 starts from the aligned address.
3. Move the gathered data (integer multiple of 32 bytes) from Local2 to Global.
Figure 10 Gathering valid data by GatherMask and moving out data
Move out data with redundant data.
As shown in the following figure, four cores are involved in the computation, and four values are copied one each core. The length of the data copied on each core is not 32-byte aligned. Therefore, the redundant data is moved out together. The procedure is as follows:
1. To completely clear the destination Global, clear it on the host or handle it on the kernel using UB coverage.
2. Clear the redundant part of the Local data on the current core (using Duplicate) except the four valid values to be moved out.
3. Use the atomic accumulation mode to copy data to Global, together with clearing redundant data to ensure that no data corruption occurs.
  Figure 11 Moving out data with redundant data

Examples

Example 1: involving redundant data in compute and gathering valid data by GatherMask before moving out data

This example shows how to implement the Abs operator on a tensor with the shape of 128 x 18. The solution for processing each row of data is as follows:

After the data is moved in, the last 14 bits of each row of data are redundant data. The input parameter BLOCKLEN_CEIL of the Abs API is set to 32, which is the result of 32-byte alignment of 18 values. There are 14 redundant values involved in the computation.

        
             AscendC::Abs(outputLocal, inputLocal, BLOCKLEN_CEIL); // main calculation

After the computation is complete, the input parameter bufPattern of GatherMask is used to control the last 16 values of the 18 values to be gathered.

        
                     uint16_t tmpValue = 0;
        AscendC::Duplicate<uint16_t>(bufPattern, tmpValue, 16);
        bufPattern.SetValue(0, 0b1111111111111100); // select the last 14 elements of the first 16 positions
        bufPattern.SetValue(1, 0b0000000000000011); // select the first 2 elements of the next 16 positions
        uint32_t mask = 32;
        uint64_t rsvdCnt = 0;
        AscendC::LocalTensor<half> tailLocal = outQueueTail.AllocTensor<half>();
        AscendC::GatherMask(tailLocal, outputLocal, bufPattern, true, mask, {1, 1, 8, 8}, rsvdCnt);

The first 16 values are moved by using DataCopy, and then the last 16 values are moved. The middle 14 values are moved repeatedly. Note: Due to the destination address overlap of DataCopy, you need to use PipeBarrier to add pipeline synchronization.

        
                     uint32_t copyLenMain = TILE_LENGTH * sizeof(half) / 32 * 32 / sizeof(half);
        uint32_t offsetMain = progress * TILE_LENGTH;
        AscendC::DataCopy(dstGlobal[offsetMain], outputLocal, copyLenMain);
        AscendC::PipeBarrier<PIPE_MTE3>();
        uint32_t tailLen = 32 / sizeof(half);
        uint32_t offsetTail = offsetMain + (TILE_LENGTH - tailLen);
        AscendC::DataCopy(dstGlobal[offsetTail], tailLocal, tailLen);

32-byte alignment must be ensured during the data movement. Therefore, the last row of the input must be padded to 32-byte alignment to prevent access to invalid data. In main.cpp, the input length on the global memory is defined as follows:

        
                 size_t inputByteSize = 2318 * sizeof(int16_t); // 2318 = 2304 + 32 - 18
    size_t outputByteSize = 2304 * sizeof(int16_t);

Example 2: clearing data row by row through Duplicate and moving out data with redundant data

This example shows how to implement the Abs operator on a tensor with the shape of 64 x 11. A total of four cores are used, and each core processes 16 x 11 values.

After the data is moved in, the last 5 values in each row are redundant data. The Duplicate API is used to clear the last 5 values in each row.

        
                     // mask mode controls only the last 5 elements doing Duplicate
        uint64_t mask0 = (1ul << 16) - (1ul << BLOCK_ELEMENT_NUM);
        uint64_t mask[2] = {mask0, 0};
        for (int32_t i = 0; i < BLOCK_GROUP_NUM; i++) {
            AscendC::Duplicate<half>(inputLocal[i * BLOCKLEN_CEIL], 0, mask, 1, 1, 1); // clear dummy data on inputLocal
        }
        AscendC::Abs(outputLocal, inputLocal, BLOCKLEN_CEIL * BLOCK_GROUP_NUM);

When data is moved out with redundant data and atomic accumulation is enabled, BLOCKLEN_CEIL contains redundant data.

        
                     AscendC::SetAtomicAdd<half>();
        for (int32_t i = 0; i < BLOCK_GROUP_NUM; i++) {
            AscendC::DataCopy<half>(dstGlobal[i * BLOCK_ELEMENT_NUM], outputLocal[i * BLOCKLEN_CEIL], BLOCKLEN_CEIL);
        }
        AscendC::SetAtomicNone();

Therefore, during initialization, the global memory data needs to be cleared. The following is an example of clearing code, in which multiple cores call the InitGlobalMemory API to clear the global memory data, and SyncAll needs to be called for inter-core synchronization.

        
                     AscendC::InitGlobalMemory<half>(dstGlobal, blockLength, 0);

        pipe.InitBuffer(inQueue, BUFFER_NUM, BLOCK_GROUP_NUM * BLOCKLEN_CEIL * sizeof(half));
        pipe.InitBuffer(outQueue, BUFFER_NUM, BLOCK_GROUP_NUM * BLOCKLEN_CEIL * sizeof(half));
        pipe.InitBuffer(syncLocalTbuf, USE_CORE_NUM * DEFAULT_SYNCALL_NEED_SIZE * sizeof(int32_t));
        AscendC::LocalTensor<int32_t> SyncLocal = syncLocalTbuf.Get<int32_t>();
        AscendC::SyncAll(syncGlobal, SyncLocal, USE_CORE_NUM);

When the data is moved in, ensure that the last row of the input data is 32-byte aligned to prevent access to invalid data. When the data is moved out with redundant data, the last row of the output data must also be 32-byte aligned. In main.cpp, the input and output lengths on the global memory are defined as follows:

        
                 // copy in borrow the next (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) elements of srcGM  
    size_t inputByteSize = 709 * sizeof(int16_t);
    // copy out atomic add extra (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) zeros to dstGM
    size_t outputByteSize = 709 * sizeof(int16_t);

Example 3: involving redundant data in compute and removing redundant data through UnPad before moving out data

This example shows how to implement the Abs operator on a tensor with the shape of 2048 x 14. A total of eight cores are used, and each core processes 256 x 14 values.

After the data is moved in, the last 2 values in each row are redundant data. The input parameter BLOCK_GROUP_NUM * BLOCKLEN_CEIL of the Abs API contains 16 consecutive rows of data. Each row contains 16 values. The redundant data in each row is involved in the computation.

        
             AscendC::Abs(inputLocal, inputLocal, BLOCK_GROUP_NUM * BLOCKLEN_CEIL); // main calculation

After the computation, the UnPad API is used to remove redundant data and move out data. The unPadParams.rightPad parameter is used to remove the last 2 redundant values in each row.

        
                     unPadParams.rightPad = BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM; // delete 2 dummy half each row
        AscendC::UnPad<half>(outputLocal, inputLocal, unPadParams, this->tiling);

Note: The tiling parameters need to be passed to the UnPad API. The key computation process in abs_unpad_tiling.cpp is as follows:

         
                  AscendC::GetUnPadMaxMinTmpSize(*ascendcPlatform, srcShape, sizeof(int16_t), tmpMaxSize, tmpMinSize);
    optiling::UnPadTiling tilingData;
    AscendC::UnPadTilingFunc(srcShape, tmpMaxSize, sizeof(int16_t), tilingData);

The tiling parameters in main.cpp need to be passed to the kernel through the input parameters of the kernel function so that the tiling parameters can be used by the UnPad high-level API.

        
             ACLRT_LAUNCH_KERNEL(abs_unpad_custom)(blockDim, stream, xDevice, zDevice, workspaceDevice, tilingDevice);

        
                 // 28674 is TOTAL_LENGTH + (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM)
    // 28672 is TOTAL_LENGTH
    // copy in borrow the next (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) elements of srcGM
    uint32_t oriLength = 28672;
    uint32_t colNum = 14;
    uint32_t maxColNum = 32 / sizeof(uint16_t);
    uint32_t padLength = oriLength + maxColNum - colNum;
    size_t inputByteSize = padLength * sizeof(int16_t);
    size_t outputByteSize = oriLength * sizeof(int16_t);

Example 4: clearing data at once through Pad and moving out data with redundant data

This example shows how to implement the Abs operator on a tensor with the shape of 2048 x 7. A total of eight cores are used, and each core processes 256 x 7 values.

After the data is moved in, the last 9 values in each row are redundant data. The Pad API is used to clear the 256 x 9 redundant values on each core and Abs calculation is performed.

        
                     AscendC::PadParams padParams = {0, BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM, 0};
        AscendC::Pad(outputLocal, inputLocal, padParams, this->tiling);
        AscendC::Abs(outputLocal, outputLocal, BLOCK_GROUP_NUM * BLOCKLEN_CEIL); // main calculation

The code of moving out data with redundant data after computation is the same as that in Example 2.

Note: The tiling parameters need to be passed to the Pad API. The key computation process in abs_pad_tiling.cpp is as follows:

         
                  AscendC::GetPadMaxMinTmpSize(srcShape, sizeof(int16_t), tmpMaxSize, tmpMinSize); 
    optiling::PadTiling tilingData;
    AscendC::PadTilingFunc(srcShape, oriSrcShape, tmpMaxSize, sizeof(int16_t), tilingData);

The tiling parameters in main.cpp need to be passed to the kernel through the input parameters of the kernel function so that the tiling parameters can be used by the Pad high-level API.

        
             ACLRT_LAUNCH_KERNEL(abs_pad_custom)(blockDim, stream, xDevice, zDevice, workspaceDevice, tilingDevice);

        
                 // 14336 is the length of input data
    uint32_t oriLength = 14336;
    // we must allocate more space to prevent invalid address access
    uint32_t padLength = oriLength + shapePad[1] - shapeUsed[1];
    size_t inputByteSize = padLength * sizeof(int16_t);
    size_t outputByteSize = padLength * sizeof(int16_t);
    // however, original length must be used when output to file
    size_t outputFileSize = oriLength * sizeof(int16_t);

Example 5: masking redundant data and moving out data with redundant data

This example shows how to implement the ReduceMin operator for each row of a 16 x 4 tensor. A total of four cores are used, and each core processes 4 x 4 values.

After the data is moved in, the last 12 values in each row are redundant data. The input parameter Mask of ReduceMin is used to control that only the first four values are involved in the computation.

        
                     uint64_t Mask0 = ((uint64_t)1 << BLOCK_ELEMENT_NUM) - 1; // mask mode controls only the first 4 elements do ReduceMin calculation
        uint64_t Mask[2] = {Mask0, 0};
        // main calculation
        for (int i = 0; i < BLOCK_GROUP_NUM; i++) {
            AscendC::ReduceMin<half>(outputLocal[i * BLOCKLEN_CEIL], inputLocal[i * BLOCKLEN_CEIL], workLocal, Mask, 1, 8, false);
        }
        outQueue.EnQue<half>(outputLocal);
        inQueue.FreeTensor(inputLocal);

The code of moving out data with redundant data after computation is the same as that in Example 2.

        
                 // copy in borrow the next (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) elements of srcGM
    size_t inputByteSize = 76 * sizeof(int16_t);
    // copy out atomic add extra (BLOCKLEN_CEIL - BLOCK_ELEMENT_NUM) zeros to dstGM
    size_t outputByteSize = 76 * sizeof(int16_t);

Parent topic: Vector Programming