Using the Data Transfer API Efficiently
[Priority] High
[Description] When using the data transfer API, you are advised to use the parameters such as srcStride, dstStride, blockLen, and blockCount of the API to implement continuous transfer or transfer at a fixed interval instead of using the for loop. The efficiency of the two methods is greatly different. As shown in the following figure, each row is 16 KB, and the first 2 KB needs to be transferred from each row. In this scenario, parameters such as srcStride, dstStride, blockLen, and blockCount can be used to transfer 32 KB at a time. If the for loop is used to traverse each row, only 2 KB can be transferred at a time. For details about the relationship between the amount of data to be transferred and the actual bandwidth, see Transferring a Large Data Block at a Time. You are advised to use the DataCopy API containing srcStride, dstStride, blockLen, and blockCount to complete data transfer at a time.
// There is an interval for transferring data. 2 KB data is transferred among the 16 KB data in each row on the GM, and there are 16 rows in total.
LocalTensor<float> tensorIn;
GlobalTensor<float> tensorGM;
...
constexpr int32_t copyWidth = 2 * 1024 / sizeof(float);
constexpr int32_t imgWidth = 16 * 1024 / sizeof(float);
constexpr int32_t imgHeight = 16;
// Use the for loop. Only 2 KB data can be transferred each time. Repeat this operation for 16 times.
for (int i = 0, i < imgHeight; i++) {
DataCopy(tensorIn[i * copyWidth ], tensorGM[i*imgWidth], copyWidth);
}
[Positive Example]
LocalTensor<float> tensorIn; GlobalTensor<float> tensorGM; ... constexpr int32_t copyWidth = 2 * 1024 / sizeof(float); constexpr int32_t imgWidth = 16 * 1024 / sizeof(float); constexpr int32_t imgHeight = 16; // Use the DataCopy API that contains srcStride/dstStride/blockLen/blockCount to complete data transfer at a time. DataCopyParams copyParams; copyParams.blockCount = imgHeight; copyParams.blockLen = copyWidth / 8; // The unit of transfer is data block (32 bytes). Each data block has eight floats. copyParams.srcStride = (imgWidth - copyWidth ) / 8; // Indicates the interval between two src transfers. The unit is data block. copyParams.dstStride = 0; // Indicates continuous write. The interval between two dst transfers is 0. The unit is data block. DataCopy(tensorGM, tensorIn, copyParams);