Broadcast Scenario
In some scenarios, two inputs may have different shapes, while the Add API computes inputs with the same shape only. Therefore, you need to change the input shape before Add computation. This section describes how to broadcast inputs that meet the broadcast condition in operator implementation. For other scenarios, refer to the ideas provided in this section.
The broadcasting mechanism extends data of lower dimensionality, enabling computations on inputs with different shapes. This eliminates the need for explicit copy operations and enhances the computing efficiency. To broadcast data, two inputs must have the same number of dimensions and have different lengths in one dimension only, where one of the inputs must have a length of 1. For example, two inputs with shapes (32, 8) and (32, 1) can be broadcast because they are both two-dimensional and have the equal sizes in the first dimension, and the second input has a length of 1 in the other dimension where they have different sizes.
This section uses the Broadcast API, so the input must meet the constraints of this API. In addition, due to hardware restrictions, the input address of this API must be 32-byte aligned. This section uses the input with two dimensions as an example, where the second axis (axis = 1) needs to be broadcast. For details about the complete sample code, see the Add operator sample with input broadcast.
Tiling Implementation
Compared with the scenario where input shapes are the same, the corresponding member variables are added to the tiling structure to indicate whether to broadcast the input, which dimension needs to be broadcast, and the multiple of the broadcast axis to be expanded. Therefore, the following four tiling structure members are added:
- xLen and yLen: data lengths of two inputs.
- axis: dimension of the input to be broadcast.
- coef: dimension expansion coefficient for the broadcast input. For example, if the shape of x is (m, 1) and the shape of y is (m, n), coef is n. As shown in the following figure, the data blocks in the same color are in a single computation.
The code for defining the tiling structure is as follows:
1 2 3 4 5 6 7 |
struct AddCustomTilingData { uint32_t xLen; uint32_t yLen; uint32_t coef; uint32_t axis; ... }; |
Assume that the length of the input that needs to be broadcast is shorterAxisLen, and the length of the input that does not need to be broadcast is totalLength.
1 2 3 4 |
constexpr uint32_t BLOCK_SIZE = 32; ... // Read data. uint32_t totalLength = (xLen > yLen)? xLen : yLen; uint32_t shorterAxisLen = (xLen < yLen)? xLen : yLen; |
1 2 3 4 5 6 7 8 9 10 11 12 |
constexpr uint32_t BLOCK_SIZE = 32; if (shorterAxisLen % (BLOCK_DIM * BUFFER_NUM) == 0) { uint32_t blockLength = shorterAxisLen / BLOCK_DIM * coef; ... } else { uint32_t formerNum = (shorterAxisLen / BUFFER_NUM) % BLOCK_DIM; uint32_t tailNum = BLOCK_DIM - formerNum; uint32_t formerLength = ((shorterAxisLen / BUFFER_NUM) / BLOCK_DIM + 1) * BUFFER_NUM * coef; uint32_t tailLength = ((shorterAxisLen / BUFFER_NUM) / BLOCK_DIM) * BUFFER_NUM * coef; .... } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
ubBlockAligned = (UB_BLOCK_NUM * alignNum / (coef * BUFFER_NUM) * (coef * BUFFER_NUM) == 0)? UB_BLOCK_NUM : UB_BLOCK_NUM * alignNum / (coef * BUFFER_NUM) * (coef * BUFFER_NUM); ... tileNum = length / ubBlockAligned; if (length % ubBlockAligned == 0 || tileNum == 0) { if (tileNum == 0) { tileNum = 1; } if (length < ubBlockAligned ) { tileLength = length; lastTileLength = tileLength; } else { tileLength = ubBlockAligned; lastTileLength = tileLength; } } else { tileNum = tileNum + 1; tileLength = ubBlockNum; lastTileLength = (length - (tileNum - 1) * tileLength); } |
Operator Class Implementation
During kernel function initialization, the input to be broadcast is determined based on the parameters passed by the tiling structure. Because the second axis (axis = 1) of the input is broadcast, the length of the data to be moved into each core is blockLength/coef for the input to be broadcast.
The initialization function code is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling) { GM_ADDR longerInputPtr; GM_ADDR shorterInputPtr; if (tiling.xLen > tiling.yLen) { longerInputPtr = x; shorterInputPtr = y; } else { longerInputPtr = y; shorterInputPtr = x; } this->coef = tiling.coef; if (tiling.isEvenCore) { this->tileNum = tiling.tileNum; this->tileLength = tiling.tileLength / BUFFER_NUM; this->lastTileLength = tiling.lastTileLength; xGm.SetGlobalBuffer((__gm__ dataType *)longerInputPtr + tiling.blockLength * AscendC::GetBlockIdx(), tiling.blockLength); yGm.SetGlobalBuffer((__gm__ dataType *)shorterInputPtr + tiling.blockLength * AscendC::GetBlockIdx() / this->coef, tiling.blockLength / this->coef); zGm.SetGlobalBuffer((__gm__ dataType *)z + tiling.blockLength * AscendC::GetBlockIdx(), tiling.blockLength); } else { if (AscendC::GetBlockIdx() < tiling.formerNum) { this->tileNum = tiling.formerTileNum; this->tileLength = tiling.formerTileLength / BUFFER_NUM; this->lastTileLength = tiling.formerLastTileLength; xGm.SetGlobalBuffer((__gm__ dataType *)longerInputPtr + tiling.formerLength * AscendC::GetBlockIdx(), tiling.formerLength); yGm.SetGlobalBuffer((__gm__ dataType *)shorterInputPtr + tiling.formerLength * AscendC::GetBlockIdx() / this->coef, tiling.formerLength / this->coef); zGm.SetGlobalBuffer((__gm__ dataType *)z + tiling.formerLength * AscendC::GetBlockIdx(), tiling.formerLength); } else { this->tileNum = tiling.tailTileNum; this->tileLength = tiling.tailTileLength / BUFFER_NUM; this->lastTileLength = tiling.tailLastTileLength; xGm.SetGlobalBuffer((__gm__ dataType *)longerInputPtr + tiling.formerLength * tiling.formerNum + tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength); yGm.SetGlobalBuffer((__gm__ dataType *)shorterInputPtr + tiling.formerLength * tiling.formerNum / this->coef + tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum) / this->coef, tiling.tailLength / this->coef); zGm.SetGlobalBuffer((__gm__ dataType *)z + tiling.formerLength * tiling.formerNum + tiling.tailLength * (AscendC::GetBlockIdx() - tiling.formerNum), tiling.tailLength); } } pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(dataType)); pipe.InitBuffer(inQueueY, BUFFER_NUM, this->coef * sizeof(dataType)); pipe.InitBuffer(outQueueZ, BUFFER_NUM, this->tileLength * sizeof(dataType)); pipe.InitBuffer(tmpBuf2, this->tileLength * sizeof(dataType)); } |
Data is aligned with coef, and the address may not be 32-byte aligned during data copy. Therefore, DataCopyPad is used in the CopyIn and CopyOut functions for data copy.
The implementation code of the CopyIn function is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
__aicore__ inline void CopyIn(int32_t progress) { AscendC::LocalTensor<dataType> xLocal = inQueueX.AllocTensor<dataType>(); AscendC::LocalTensor<dataType> yLocal = inQueueY.AllocTensor<dataType>(); AscendC::DataCopyExtParams copyXParams = {1, (uint32_t)(this->tileLength * sizeof(dataType)), 0, 0, 0}; AscendC::DataCopyExtParams copyYParams = {1, (uint32_t)(this->tileLength * sizeof(dataType) / this->coef), 0, 0, 0}; AscendC::DataCopyPadExtParams<dataType> padParams = {false, 0, 0, 0}; if (progress == (this->tileNum * BUFFER_NUM - 1)) { AscendC::DataCopyPad<dataType>(xLocal, xGm[(progress - LAST_TWO_TILE) * this->tileLength + this->lastTileLength], copyXParams, padParams); AscendC::DataCopyPad<dataType>(yLocal, yGm[((progress - LAST_TWO_TILE) * this->tileLength + this->lastTileLength) / this->coef], copyYParams, padParams); } else { AscendC::DataCopyPad<dataType>(xLocal, xGm[progress * this->tileLength], copyXParams, padParams); AscendC::DataCopyPad<dataType>(yLocal, yGm[progress * this->tileLength / this->coef], copyYParams, padParams); } inQueueX.EnQue(xLocal); inQueueY.EnQue(yLocal); } |
The implementation code of the CopyOut function is as follows:
1 2 3 4 5 6 7 8 9 10 11 |
__aicore__ inline void CopyOut(int32_t progress) { AscendC::LocalTensor<dataType> zLocal = outQueueZ.DeQue<dataType>(); AscendC::DataCopyExtParams copyParams = {1, (uint32_t)(this->tileLength * sizeof(dataType)), 0, 0, 0}; if (progress == (this->tileNum * BUFFER_NUM - 1)) { AscendC::DataCopyPad<dataType>(zGm[(progress - LAST_TWO_TILE) * this->tileLength + this->lastTileLength], zLocal, copyParams); } else { AscendC::DataCopyPad<dataType>(zGm[progress * this->tileLength], zLocal, copyParams); } outQueueZ.FreeTensor(zLocal); } |
In the Compute function, the input needs to be broadcast before the Add API is called. The shapes before and after the broadcasting need to be computed. Based on the preceding data relationship, the shapes before and after the broadcasting are {tileLength / broadcastCoef, 1} and {tileLength / broadcastCoef, broadcastCoef}, respectively. Then, the input is broadcast, the computation result is stored in the temporary space, and the Add computation is performed. An implementation code example is as follows:
1 2 3 4 5 6 7 8 9 10 11 |
__aicore__ inline void Compute(int32_t progress) { AscendC::LocalTensor<dataType> xLocal = inQueueX.DeQue<dataType>(); AscendC::LocalTensor<dataType> yLocal = inQueueY.DeQue<dataType>(); AscendC::LocalTensor<dataType> zLocal = outQueueZ.AllocTensor<dataType>(); AscendC::LocalTensor<dataType> broadcastTmpTensor = broadcastTmpBuf.Get<dataType>(); uint32_t dstShape[] = {this->tileLength / this->coef, this->coef}; uint32_t srcShape[] = {this->tileLength / this->coef, 1}; AscendC::BroadCast<dataType, 2, 1>(broadcastTmpTensor, yLocal, dstShape, srcShape); ... } |