Transpose

Applicability

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product's AI Core	√
Atlas inference product's Vector Core	x
Atlas training products	√

Function

Performs transpose on a 16 x 16 2D matrix data block, or between [N,C,H,W] and [N,H,W,C].

Prototype

Perform common transpose on a 16 x 16 2D matrix data block.

template <typename T>
__aicore__ inline void Transpose(const LocalTensor<T>& dst, const LocalTensor<T>& src)

Perform enhanced transpose on a 16 x 16 2D matrix data block, or between [N,C,H,W] and [N,H,W,C].

template <typename T>
__aicore__ inline void Transpose(const LocalTensor<T>& dst, const LocalTensor<T> &src, const LocalTensor<uint8_t> &sharedTmpBuffer, const TransposeParamsExt &transposeParams)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the operand. Standard transpose: For Atlas A3 training products/Atlas A3 inference products, the supported data types are uint16_t, int16_t, and half. For Atlas A2 training products/Atlas A2 inference products, the supported data types are uint16_t, int16_t, and half. For the Atlas 200I/500 A2 inference products, the supported data types are uint16_t, int16_t, and half. For the Atlas inference product's AI Core, the supported data types are uint16_t, int16_t, and half. For the Atlas training products, the supported data types are uint16_t, int16_t, and half. Enhanced transpose: When transposeType is TRANSPOSE_ND2ND_B16: For Atlas A3 training products/Atlas A3 inference products, the supported data type is uint16_t. For the Atlas A2 training products/Atlas A2 inference products, the supported data type is uint16_t. For the Atlas 200I/500 A2 inference products, the supported data type is uint16_t. For the Atlas inference product's AI Core, the supported data type is uint16_t. When transposeType is TRANSPOSE_NCHW2NHWC or TRANSPOSE_NHWC2NCHW: For Atlas A3 training products/Atlas A3 inference products, the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float. For Atlas A2 training products/Atlas A2 inference products, the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float. For the Atlas inference product's AI Core, the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float.

Table 2 Parameters

Parameter

Input/Output

Meaning

dst

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

src

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

Its data type must be the same as that of dst.

sharedTmpBuffer

Input

Shared temporary buffer. For details about the size of sharedTmpBuffer, see Table 4.

transposeParams

Input

Data structure for controlling Transpose. The structure contains the input shape information and the transposeType parameter. For details about the definition of this data structure, see Table 3.

struct TransposeParamsExt {
    __aicore__ TransposeParamsExt() {}
    __aicore__ TransposeParamsExt(const uint16_t nSizeIn, const uint16_t cSizeIn, const uint16_t hSizeIn,
        const uint16_t wSizeIn, const TransposeType transposeTypeIn)
        : nSize(nSizeIn),
          cSize(cSizeIn),
          hSize(hSizeIn),
          wSize(wSizeIn),
          transposeType(transposeTypeIn)
    {}
    uint16_t nSize = 0;
    uint16_t cSize = 0;
    uint16_t hSize = 0;
    uint16_t wSize = 0;
    TransposeType transposeType = TransposeType::TRANSPOSE_ND2ND_B16;
};

Table 3 Parameters in the TransposeParamsExt structure

Parameter

Meaning

nSize

Length of the n axis. Defaults to 0.

In the transpose of the 2D matrix data block, this parameter does not need to be passed in. The input value is invalid.
In the transpose between [N,C,H,W] and [N,H,W,C], the value range is [0, 65535].

cSize

Length of the c axis. Defaults to 0.

In the transpose of the 2D matrix data block, this parameter does not need to be passed in. The input value is invalid.

In the transpose between [N,C,H,W] and [N,H,W,C], the value range is [0, 4095].

hSize

Length of the h axis. Defaults to 0.

In the transpose of the 2D matrix data block, the value is fixed at 16.

In the transpose between [N,C,H,W] and [N,H,W,C], the value range of hSize × wSize is [0, 4095]. The product of hSize × wSize × sizeof(T) must be 32-byte aligned.

wSize

Length of the w axis. Defaults to 0.

In the transpose of the 2D matrix data block, the value is fixed at 16.

In the transpose between [N,C,H,W] and [N,H,W,C], the value range of hSize × wSize is [0, 4095]. The product of hSize × wSize × sizeof(T) must be 32-byte aligned.

transposeType

Data layout and reshape type. The type is the TransposeType enumeration type. The default value is TRANSPOSE_ND2ND_B16.

enum class TransposeType : uint8_t {
    TRANSPOSE_TYPE_NONE,           // No transpose operation is performed.
    TRANSPOSE_NZ2ND_0213,          // Not supported currently.
    TRANSPOSE_NZ2NZ_0213,          // Not supported currently.
    TRANSPOSE_NZ2NZ_012_WITH_N,    // Not supported currently.
    TRANSPOSE_NZ2ND_012_WITH_N,    // Not supported currently.
    TRANSPOSE_NZ2ND_012_WITHOUT_N, // Not supported currently.
    TRANSPOSE_NZ2NZ_012_WITHOUT_N, // Not supported currently.
    TRANSPOSE_ND2ND_ONLY,          // Not supported currently.
    TRANSPOSE_ND_UB_GM,            // Not supported currently.
    TRANSPOSE_GRAD_ND_UB_GM,       // Not supported currently.
    TRANSPOSE_ND2ND_B16,           // Transpose the [16,16] two-dimensional matrix.
    TRANSPOSE_NCHW2NHWC,           // [N,C,H,W]->[N,H,W,C]
    TRANSPOSE_NHWC2NCHW            // [N,H,W,C]->[N,C,H,W]
};

Table 4 Size required by the enhanced transpose API sharedTmpBuffer

transposeType

Size Required for sharedTmpBuffer

TRANSPOSE_ND2ND_B16

No temporary buffer is required.

TRANSPOSE_NCHW2NHWC

For the following models:

Atlas inference product's AI Core

No temporary buffer is required.

For the following models:

Atlas A2 training products/Atlas A2 inference products
Atlas A3 training products/Atlas A3 inference products

The size of the temporary buffer is computed according to the following rule (pseudocode):

auto h0 = 16; // When the bit width of the data type is 8, h0 = 32. In other cases, h0 = 16.
auto w0 = 32 / sizeof(type);  // type indicates the data type.
auto tmpBufferSize = (cSize + 2)  * h0 * w0 * sizeof(type);

TRANSPOSE_NHWC2NCHW

For the following models:

Atlas inference product's AI Core

No temporary buffer is required.

For the following models:

Atlas A2 training products/Atlas A2 inference products
Atlas A3 training products/Atlas A3 inference products

The size of the temporary buffer is computed according to the following rule (pseudocode):

auto h0 = 16; // When the bit width of the data type is 8, h0 = 32. In other cases, h0 = 16.
auto w0 = 32 / sizeof(type);  // type indicates the data type.
auto tmpBufferSize = (cSize  * 2 + 1)  * h0 * w0 * sizeof(type);

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.
The standard transpose API supports reuse of src and dst.
For the enhanced transpose API, when transposeType is TRANSPOSE_ND2ND_B16, reuse of src and dst is supported; when transposeType is TRANSPOSE_NCHW2NHWC or TRANSPOSE_NHWC2NCHW, reuse of src and dst is not supported.

Example

The following example transposes the [16,16] matrix of the half type by using a common API.

#include "kernel_operator.h"

class KernelTranspose {
public:
    __aicore__ inline KernelTranspose() {}
    __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
    {
        srcGlobal.SetGlobalBuffer((__gm__ half*)src);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);

        pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
        pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
        AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
        inQueueSrc.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();

        AscendC::Transpose<half>(dstLocal, srcLocal);

        outQueueDst.EnQue<half>(dstLocal);
        inQueueSrc.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;

    AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
    int srcDataSize = 256;
    int dstDataSize = 256;
};

extern "C" __global__ __aicore__ void transpose_kernel(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
{
    KernelTranspose op;
    op.Init(src, dstGm);
    op.Process();
}

Input (src_gm):
[[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
   14.  15.]
 [ 16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.  29.
   30.  31.]
 [ 32.  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.
   46.  47.]
 [ 48.  49.  50.  51.  52.  53.  54.  55.  56.  57.  58.  59.  60.  61.
   62.  63.]
 [ 64.  65.  66.  67.  68.  69.  70.  71.  72.  73.  74.  75.  76.  77.
   78.  79.]
 [ 80.  81.  82.  83.  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.
   94.  95.]
 [ 96.  97.  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109.
  110. 111.]
 [112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
  126. 127.]
 [128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141.
  142. 143.]
 [144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157.
  158. 159.]
 [160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173.
  174. 175.]
 [176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189.
  190. 191.]
 [192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205.
  206. 207.]
 [208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221.
  222. 223.]
 [224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237.
  238. 239.]
 [240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253.
  254. 255.]]

Output (dst_gm):
[[  0.  16.  32.  48.  64.  80.  96. 112. 128. 144. 160. 176. 192. 208.
  224. 240.]
 [  1.  17.  33.  49.  65.  81.  97. 113. 129. 145. 161. 177. 193. 209.
  225. 241.]
 [  2.  18.  34.  50.  66.  82.  98. 114. 130. 146. 162. 178. 194. 210.
  226. 242.]
 [  3.  19.  35.  51.  67.  83.  99. 115. 131. 147. 163. 179. 195. 211.
  227. 243.]
 [  4.  20.  36.  52.  68.  84. 100. 116. 132. 148. 164. 180. 196. 212.
  228. 244.]
 [  5.  21.  37.  53.  69.  85. 101. 117. 133. 149. 165. 181. 197. 213.
  229. 245.]
 [  6.  22.  38.  54.  70.  86. 102. 118. 134. 150. 166. 182. 198. 214.
  230. 246.]
 [  7.  23.  39.  55.  71.  87. 103. 119. 135. 151. 167. 183. 199. 215.
  231. 247.]
 [  8.  24.  40.  56.  72.  88. 104. 120. 136. 152. 168. 184. 200. 216.
  232. 248.]
 [  9.  25.  41.  57.  73.  89. 105. 121. 137. 153. 169. 185. 201. 217.
  233. 249.]
 [ 10.  26.  42.  58.  74.  90. 106. 122. 138. 154. 170. 186. 202. 218.
  234. 250.]
 [ 11.  27.  43.  59.  75.  91. 107. 123. 139. 155. 171. 187. 203. 219.
  235. 251.]
 [ 12.  28.  44.  60.  76.  92. 108. 124. 140. 156. 172. 188. 204. 220.
  236. 252.]
 [ 13.  29.  45.  61.  77.  93. 109. 125. 141. 157. 173. 189. 205. 221.
  237. 253.]
 [ 14.  30.  46.  62.  78.  94. 110. 126. 142. 158. 174. 190. 206. 222.
  238. 254.]
 [ 15.  31.  47.  63.  79.  95. 111. 127. 143. 159. 175. 191. 207. 223.
  239. 255.]]

The following example transposes [N,C,H,W] to [N,H,W,C] of the half type by using an enhanced API.

#include "kernel_operator.h"

template <typename T>
class Kernel4dTrans {
public:
    __aicore__ inline Kernel4dTrans() {}
    __aicore__ inline void Init(__gm__ uint8_t *srcGm, __gm__ uint8_t *dstGm)
    {
        inputSize = N * C * H * W;
        tmpBufferSize = (C + 2) * 16 * 16;
        srcGlobal.SetGlobalBuffer((__gm__ T *)srcGm);
        dstGlobal.SetGlobalBuffer((__gm__ T *)dstGm);
        pipe.InitBuffer(inQueueSrcVecIn, 1, inputSize*sizeof(T));
        pipe.InitBuffer(inQueueSrcVecOut, 1, inputSize*sizeof(T));
        pipe.InitBuffer(tmpQueue, 1, tmpBufferSize * sizeof(T));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.AllocTensor<T>();
        AscendC::DataCopy(srcLocal, srcGlobal, inputSize);
        inQueueSrcVecIn.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.DeQue<T>();
        AscendC::LocalTensor<T> dstLocal = inQueueSrcVecOut.AllocTensor<T>();
        AscendC::LocalTensor<uint8_t> stackBuffer = tmpQueue.AllocTensor<uint8_t>();

        AscendC::TransposeParamsExt transposeParams;
        transposeParams.nSize = N;
        transposeParams.cSize = C;
        transposeParams.hSize = H;
        transposeParams.wSize = W;
        transposeParams.transposeType = transposeType;
        AscendC::Transpose(dstLocal, srcLocal, stackBuffer, transposeParams);
        inQueueSrcVecOut.EnQue<T>(dstLocal);
        inQueueSrcVecIn.FreeTensor(srcLocal);
        tmpQueue.FreeTensor(stackBuffer);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> dstLocal = inQueueSrcVecOut.DeQue<T>();
        AscendC::DataCopy(dstGlobal, dstLocal, inputSize);
        inQueueSrcVecOut.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrcVecIn;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> inQueueSrcVecOut;
    AscendC::TQue<AscendC::TPosition::VECCALC, 1> tmpQueue;

    AscendC::GlobalTensor<T> srcGlobal;
    AscendC::GlobalTensor<T> dstGlobal;
    uint32_t N = 3;
    uint32_t C = 3;
    uint32_t H = 2;
    uint32_t W = 8;
    uint32_t inputSize, tmpBufferSize;
    AscendC::TransposeType transposeType = AscendC::TransposeType::TRANSPOSE_NCHW2NHWC;
};

extern "C" __global__ __aicore__ void transpose_kernel(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm)
{
    Kernel4dTrans<half>op;
    op.Init(srcGm, dstGm);
    op.Process();
}

Parent topic: Data Conversion