Cast

Function Usage

Converts precision based on the data types of the source and destination operand tensors.

Before understanding the precision conversion rules, you need to know the basics about the representation modes of floating point numbers and the rounding rules of binary numbers.

  • Representation modes of floating point numbers
    • The half type has 16 bits, including 1 sign bit (S), 5 exponent bits (E), and 10 mantissa bits (M).

      When the exponent bits (E) are not all 0s or all 1s, the value is:

      (-1)S * 2E - 15 * (1 + M)

      When the exponent bits (E) are all 0s, the value is:

      (-1)S * 2-14 * M

      When the exponent bits (E) are all 1s and the mantissa bits (M) are all 0s, the value is ±inf (depending on the sign bit). When the exponent bits (E) are all 1s and the mantissa bits (M) are not all 0s, the value is Not-a-Number (NaN).

      The preceding figure represents the value 1.75, as S = 0, E = 15, and M = 2-1 + 2-2.

    • The float type has 32 bits, including 1 sign bit (S), 8 exponent bits (E), and 23 mantissa bits (M).

      When the exponent bits (E) are not all 0s or all 1s, the value is:

      (-1)S * 2E - 127 * (1 + M)

      When the exponent bits (E) are all 0s, the value is:

      (-1)S * 2-126 * M

      When the exponent bits (E) are all 1s and the mantissa bits (M) are all 0s, the value is ±inf (depending on the sign bit). When the exponent bits (E) are all 1s and the mantissa bits (M) are not all 0s, the value is Not-a-Number (NaN).

      The preceding figure represents the value 1.75, as S = 0, E = 127, and M = 2-1 + 2-2.

    • The bfloat16_t type has 16 bits, including 1 sign bit (S), 8 exponent bits (E), and 7 mantissa bits (M).

      When the exponent bits (E) are not all 0s or all 1s, the value is:

      (-1)S * 2E - 127 * (1 + M)

      When the exponent bits (E) are all 0s, the value is:

      (-1)S * 2-126 * M

      When the exponent bits (E) are all 1s and the mantissa bits (M) are all 0s, the value is ±inf (depending on the sign bit). When the exponent bits (E) are all 1s and the mantissa bits (M) are not all 0s, the value is Not-a-Number (NaN).

      The preceding figure represents the value 1.75, as S = 0, E = 127, and M = 2-1 + 2-2.

  • Rounding in binary mode is similar to that in decimal mode.

    • In CAST_RINT mode, if the first bit to be rounded is 0, no carry is performed. If the first bit to be rounded is 1 and the subsequent bits are not all 0s, carry is performed. If the first bit is 1 and all subsequent bits are 0, no carry is performed when the last M bit is 0, and carry is performed when the last M bit is 1.
    • In CAST_FLOOR mode, if bit S is 0, no carry is performed. If bit S is 1 and the bits to be rounded are all 0s, carry is not performed; in other cases, carry is performed.
    • In CAST_CEIL mode, if bit S is 1, no carry is performed. If bit S is 0 and the bits to be rounded are all 0s, carry is not performed; in other cases, carry is performed.
    • In CAST_ROUND mode, if the first bit to be rounded is 0, no carry is performed; in other cases, carry is performed.
    • In CAST_TRUNC mode, no carry is performed.
    • In CAST_ODD mode, if the bits to be rounded are all 0s, no carry is performed. If the bits to be rounded are not all 0s, no carry is performed when the last M bit is 1, and carry is performed when the last M bit is 0.

The following table describes the precision conversion rules.

Table 1 Precision conversion rules

src Data Type

dst Data Type

Description

float

float

Rounds src according to round_mode (a precision conversion mode, see the round_mode parameter in Parameters) and saves it to dst in float type.

For example, in the case of input 0.5:

The output is 0.0 in CAST_RINT mode, 0.0 in CAST_FLOOR mode, 1.0 in CAST_CEIL mode, 1.0 in CAST_ROUND mode, and 0.0 in CAST_TRUNC mode.

half

Rounds src according to round_mode and writes the result in half type to dst (the overflow part is saturated).

For example, for input 0.5 + 2-12, it is represented as 2-1 * (1 + 2-11) in float type, meaning that E = –1 + 127 = 126, and M = 2-11.

The exponent bits of the half type can represent 2-1, meaning E = –1 + 15 = 14. However, the half type has only 10 mantissa bits. Therefore, the gray part needs to be rounded.

In CAST_RINT mode, the result mantissa is 0000000000, E = 14, and M = 0. The final result is 0.5.

In CAST_FLOOR mode, the result mantissa is 0000000000, E = 14, and M = 0. The final result is 0.5.

In CAST_CEIL mode, the result mantissa is 0000000001, E = 14, and M = 2-10. The final result is 0.5 + 2-11.

In CAST_ROUND mode, the result mantissa is 0000000001, E = 14, and M = 2-10. The final result is 0.5 + 2-11.

In CAST_TRUNC mode, the result mantissa is 0000000000, E = 14, and M = 0. The final result is 0.5.

In CAST_ODD mode, the result mantissa is 0000000001, E = 14, and M = 2-10. The final result is 0.5 + 2-11.

int64_t

Rounds src according to round_mode and writes the result in int64_t format to dst (the overflow part is saturated).

For example, in the case of input 222 + 0.5:

The output is 222 in CAST_RINT mode, 222 in CAST_FLOOR mode, 222 + 1 in CAST_CEIL mode, 222 + 1 in CAST_ROUND mode, and 222 in CAST_TRUNC mode.

int32_t

Rounds src according to round_mode and writes the result in int32_t format to dst (the overflow part is saturated).

For example, in the case of input 222 + 0.5:

The output is 222 in CAST_RINT mode, 222 in CAST_FLOOR mode, 222 + 1 in CAST_CEIL mode, 222 + 1 in CAST_ROUND mode, and 222 in CAST_TRUNC mode.

int16_t

Rounds src according to round_mode and writes the result in int16_t format to dst (the overflow part is saturated).

For example, in the case of input 222 + 0.5:

The output is 215 – 1 in CAST_RINT mode (overflow processed), 215 – 1 in CAST_FLOOR mode (overflow processed), 215 – 1 in CAST_CEIL mode (overflow processed), 215 – 1 in CAST_ROUND mode (overflow processed), and 215 – 1 in CAST_TRUNC mode (overflow processed).

bfloat16_t

Rounds src according to round_mode and writes the result in bfloat16_t type to dst (the overflow part is saturated).

For example, in the case of input 0.5 + 2:-9 + 2-11, expressed in float type: 2-1 * (1 + 2-8 + 2-10). Therefore, E = -1 + 127 = 126 and M = 2-8 + 2-10.

The number of exponent bits of bfloat16_t is the same as that of float, that is, E = 126. However, bfloat16_t has only 7 mantissa bits. Therefore, the gray part needs to be rounded.

In CAST_RINT mode, the result mantissa is 0000001, E = 126, and M = 2-7. The final result is 0.5 + 2-8.

In CAST_FLOOR mode, the result mantissa is 0000000, E = 126, and M = 0. The final result is 0.5.

In CAST_CEIL mode, the result mantissa is 0000001, E = 126, and M = 2-7. The final result is 0.5 + 2-8.

In CAST_ROUND mode, the result mantissa is 0000001, E = 126, and M = 2-7. The final result is 0.5 + 2-8.

In CAST_TRUNC mode, the result mantissa is 0000000, E = 126, and M = 0. The final result is 0.5.

half

float

Writes src to dst in float format — precision conversion is not involved.

For example, in the case of input 1.5 – 2-10 and output 1.5 – 2-10:

int32_t

Rounds src according to round_mode and writes the result in int32_t format to dst.

For example, in the case of input –1.5:

The output is –2 in CAST_RINT mode, –2 in CAST_FLOOR mode, –1 in CAST_CEIL mode, –2 in CAST_ROUND mode, and –1 in CAST_TRUNC mode.

int16_t

Rounds src according to round_mode and writes the result in int16_t format to dst (the overflow part is saturated).

For example, in the case of input 27 – 0.5:

The output is 27 in CAST_RINT mode, 27 – 1 in CAST_FLOOR mode, 27 in CAST_CEIL mode, 27 in CAST_ROUND mode, and 27 – 1 in CAST_TRUNC mode.

int8_t

Rounds src according to round_mode and writes the result in int8_t format to dst (the overflow part is saturated).

For example, in the case of input 27 – 0.5:

The output is 27 – 1 in CAST_RINT mode (overflow processed), 27 – 1 in CAST_FLOOR mode, 27 – 1 in CAST_CEIL mode (overflow processed), 27 – 1 in CAST_ROUND mode (overflow processed), and 27 – 1 in CAST_TRUNC mode.

uint8_t

Rounds src according to round_mode and writes the result in uint8_t format to dst (the overflow part is saturated).

For example, in the case of input 1.75:

The output is 2 in CAST_RINT mode, 1 in CAST_FLOOR mode, 2 in CAST_CEIL mode, 2 in CAST_ROUND mode, and 1 in CAST_TRUNC mode.

int4b_t

Rounds src according to round_mode and writes the result in int4b_t format to dst (the overflow part is saturated).

For example, in the case of input 1.5:

The output is 2 in CAST_RINT mode, 1 in CAST_FLOOR mode, 2 in CAST_CEIL mode, 2 in CAST_ROUND mode, and 1 in CAST_TRUNC mode.

bfloat16_t

float

Writes src to dst in float format — precision conversion is not involved.

For example, in the case of input 1.5 – 2-6, the output is 1.5 – 2-6.

int32_t

Rounds src according to round_mode and writes the result in int32_t format to dst (the overflow part is saturated).

For example, in the case of input 26 + 0.5:

The output is 26 in CAST_RINT mode, 26 in CAST_FLOOR mode, 26 + 1 in CAST_CEIL mode, 26 + 1 in CAST_ROUND mode, and 26 in CAST_TRUNC mode.

int4b_t

half

Writes src to dst in half format — precision conversion is not involved.

For example, input 1, output 1.0.

uint8_t

half

Writes src to dst in half format — precision conversion is not involved.

For example, input 1, output 1.0.

int8_t

half

Writes src to dst in half format — precision conversion is not involved.

For example, input –1, output –1.0.

int16_t

half

Rounds src according to round_mode and writes the result in half format to dst.

For example, for input 212 + 2, it is represented as 212 * (1 + 2-11) in half type, meaning that E = 12 + 15 = 27, and M = 2-11.

However, half has only 10 mantissa bits. Therefore, the gray part needs to be rounded.

In CAST_RINT mode, the result mantissa is 0000000000, E = 27, and M = 0. The final result is 212.

In CAST_FLOOR mode, the result mantissa is 0000000000, E = 27, and M = 0. The final result is 212.

In CAST_CEIL mode, the result mantissa is 0000000001, E = 27, and M = 2-10. The final result is 212 + 4.

In CAST_ROUND mode, the result mantissa is 0000000001, E = 27, and M = 2-10. The final result is 212 + 4.

In CAST_TRUNC mode, the result mantissa is 0000000000, E = 27, and M = 0. The final result is 212.

float

Writes src to dst in float format — precision conversion is not involved.

Example: input 215 – 1 and output 215 – 1.

int32_t

float

Rounds src according to round_mode and writes the result in float format to dst.

For example, for input 225 + 3, it is represented as 225 * (1 + 2-24 + 2-25) in float type, and requires that E = 25 + 127 = 152 and M = 2-24 + 2-25.

However, float has only 23 mantissa bits. Therefore, the gray part needs to be rounded.

In CAST_RINT mode, the result mantissa is 00000000000000000000001, E = 152, and M = 2-23. The final result is 225 + 4.

In CAST_FLOOR mode, the result mantissa is 00000000000000000000000, E = 152, and M = 0. The final result is 225.

In CAST_CEIL mode, the result mantissa is 00000000000000000000001, E = 152, and M = 2-23. The final result is 225 + 4.

In CAST_ROUND mode, the result mantissa is 00000000000000000000001, E = 152, and M = 2-23. The final result is 225 + 4.

In CAST_TRUNC mode, the result mantissa is 00000000000000000000000, E = 152, and M = 0. The final result is 225.

int64_t

Writes src to dst in int64_t format – precision conversion is not involved.

Example: 231 – 1 for input and 231 – 1 for output

int16_t

Writes src to dst in int16_t format (the overflow part is saturated by default) – precision conversion is not involved.

Example: 231 – 1 for input and 215 – 1 for output

half

Used together with SetDeqScale(half scale). The output is src/217 x scale x 217. Use banker's rounding (CAST_RINT).

int64_t

int32_t

Writes src to dst in int32_t format (the overflow part is saturated by default) – precision conversion is not involved.

Example: 231 for input and 231 – 1 for output

float

Rounds src according to round_mode and writes the result in float format to dst.

For example, for input 235 + 212 + 211, it is represented as 235 x (1 + 2-23 + 2-24) in float type, and requires that E = 35 + 127 = 162 and M = 2-23 + 2-24.

However, float has only 23 mantissa bits. Therefore, the gray part needs to be rounded.

In CAST_RINT mode, the result mantissa is 00000000000000000000010, E = 162, and M = 2-22. The final result is 235 + 213.

In CAST_FLOOR mode, the result mantissa is 00000000000000000000001, E = 162, and M = 2-23. The final result is 225 + 212.

In CAST_CEIL mode, the result mantissa is 00000000000000000000010, E = 162, and M = 2-22. The final result is 225 + 213.

In CAST_ROUND mode, the result mantissa is 00000000000000000000010, E = 162, and M = 2-22. The final result is 225 + 213.

In CAST_TRUNC mode, the result mantissa is 00000000000000000000001, E = 162, and M = 2-23. The final result is 225 + 212.

Prototype

  • Computation of the first n data elements of a tensor
    1
    2
    template <typename T1, typename T2>
    __aicore__ inline void Cast(const LocalTensor<T1>& dstLocal, const LocalTensor<T2>& srcLocal, const RoundMode& round_mode, const uint32_t calCount)
    
  • High-dimensional tensor sharding computation
    • Bitwise mask mode
      1
      2
      template <typename T1, typename T2, bool isSetMask = true>
      __aicore__ inline void Cast(const LocalTensor<T1>& dstLocal, const LocalTensor<T2>& srcLocal, const RoundMode& round_mode, const uint64_t mask[], const uint8_t repeatTimes, const UnaryRepeatParams& repeatParams)
      
    • Contiguous mask mode
      1
      2
      template <typename T1, typename T2, bool isSetMask = true>
      __aicore__ inline void Cast(const LocalTensor<T1>& dstLocal, const LocalTensor<T2>& srcLocal, const RoundMode& round_mode, const uint64_t mask, const uint8_t repeatTimes, const UnaryRepeatParams& repeatParams)
      

Parameters

Table 2 Parameters in the template

Parameter

Description

T1

Data type of the destination operand.

Atlas Training Series Product : For details about the supported data types, see Table 4.

T2

Data type of the source operand.

Atlas Training Series Product : For details about the supported data types, see Table 4.

isSetMask

Indicates whether to set mask inside the API.

  • true: sets mask inside the API.
  • false: sets mask outside the API. Developers need to use the SetVectorMask API to set the mask value. In this mode, the mask value in the input parameter of this API must be set to MASK_PLACEHOLDER.
Table 3 Parameters

Parameter

Input/Output

Description

dstLocal

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

srcLocal

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

round_mode

Input

Precision conversion mode. The type is RoundMode.

RoundMode is of the enumeration type and is used to control the precision conversion mode. The definition is as follows:

enum class RoundMode {
    CAST_NONE = 0,  // If accuracy drop is involved during conversion, the CAST_RINT mode is used. If accuracy drop is not involved, the value is not rounded.
    CAST_RINT,      // rint, banker's rounding
    CAST_FLOOR,     // floor, rounding towards negative infinity
    CAST_CEIL,      // ceil, rounding towards positive infinity
    CAST_ROUND,     // round, rounding off
    CAST_TRUNC,     // trunc, rounding towards zero
    CAST_ODD,       // Von Neumann rounding, rounding to the nearest odd number
};

For the Atlas Training Series Product , CAST_ROUND rounds away from 0. For a positive number, x.y is rounded to (x + 1). For a negative number, –x.y is rounded to –(x + 1).

calCount

Input

Number of elements of the input data.

mask

Input

mask is used to control the elements that participate in computation in each iteration.

  • Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64]. When the operand is 64-bit, mask ∈ [1, 32].
  • Bitwise mode: controls the elements that participate in computation by bit. If a bit is set to 1, the corresponding element participates in the computation. If a bit is set to 0, the corresponding element is masked in the computation. The parameter type is a uint64_t array whose length is 2.

    For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation.

    The parameter value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16-bit, mask[0] and mask[1] ∈ [0, 264 -1] and cannot be 0 at the same time. When the operand is 32-bit, mask[1] is 0 and mask[0] ∈ (0, 264 – 1]. When the operand is 64-bit, mask[1] is 0 and mask[0] ∈ (0, 232 – 1].

repeatTimes

Input

Number of iteration repeats. The Vector Unit reads 256 bytes of contiguous data for computation each time. To read the complete data for processing, the unit needs to read the input data in multiple repeats. repeatTimes indicates the number of iterations. repeatTimes ∈ [0,255].

For details about this parameter, see Common Parameters.

repeatParams

Input

Parameter for controlling the operand address stride, which is of the UnaryRepeatParams type. This parameter includes the address stride between adjacent iterations of the operand and the address stride of the data block in the same iteration of the operand. dstRepStride/srcRepStride ∈ [0, 255].

For details about the address stride of the operand between adjacent iterations, see repeatStride. For details about the address stride of the operand between different data blocks in a single iteration, see dataBlockStride.

Table 4 Parameters of the Cast instruction for the Atlas Training Series Product

src Data Type

dst Data Type

Supported roundMode

half

float

CAST_NONE

int32_t

CAST_RINT/CAST_FLOOR/CAST_CEIL/CAST_ROUND/CAST_TRUNC

int8_t

CAST_FLOOR/CAST_CEIL/CAST_ROUND/CAST_TRUNC/CAST_NONE

uint8_t

CAST_FLOOR/CAST_CEIL/CAST_ROUND/CAST_TRUNC/CAST_NONE

float

half

CAST_NONE/CAST_ODD

int32_t

CAST_RINT/CAST_FLOOR/CAST_CEIL/CAST_ROUND/CAST_TRUNC

uint8_t

half

CAST_NONE

int8_t

half

CAST_NONE

int32_t

float

CAST_NONE

half

CAST_NONE

Returns

None

Availability

Atlas Training Series Product

Precautions

  • To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
    • For a single repeat (repeatTimes = 1), the source operand must completely overlap the destination operand.
    • For multiple repeats (repeatTimes > 1), if there is a dependency between the source operand and the destination operand, that is, the destination operand of the Nth iteration is the source operand of the (N + 1)th iteration, address overlapping is not allowed.
    • In particular, when a data type with a smaller length is converted to a data type with a larger length, address overlapping may cause a result error.
  • For details about the alignment requirements of the operand address offset, see General Restrictions.
  • The amount of data that can be processed in each repeat depends on the data precision and AI processor model. For example, 64 source or destination elements are operated in each repeat during float to half conversion.
  • When the number of bits of the source operand is different from that of the destination operand, the data type with more bytes is used for the computation input. For example, if the source operand is of the half type and the destination operand is of the int32_t type, dstRepStride should be set to 8 and srcRepStride should be set to 4 to ensure that the output and input are continuous.
  • dst and src must be different tensors or the same element of the same tensor, rather than different elements of the same tensor.
  • When both of src and dst are of the float type, the source operand is rounded to an integer of the float type. In other cases, the source operand is rounded to a number that can be represented by the data type of dst.
  • When dst or src is of the int4b_t type which occupies only half a byte, only half data_size of int8_t needs to be allocated for tensor space. Currently, the host does not support int4b_t. Therefore, before allocating a tensor of the int4b_t type, allocate a tensor of the int8_t type, use Reinterpretcast to convert the tensor into the int4b_t type, and call the Cast instruction. For details, see the calling example.
  • When dst or src is of the int4b_t type, mask of the high-dimensional tensor sharding computation API in continuous mode and calCount of the API for computing the first n pieces of tensor data must be even numbers. For the bitwise mode of the high-dimensional tensor sharding computation API, the values of two adjacent bits corresponding to the same byte must be the same. That is, the values of bits 0 and 1, bits 2 and 3, bits 4 and 5, and so on.
  • For the Atlas Training Series Product , the input can be int32_t and the output can be of the half type. Used together with SetDeqScale(half scale). The output is src/217 x scale x 217. The rounding mode is CAST_RINT. The following is an example:
    SetDeqScale(static_cast<half>(2.0));
    Cast(dstLocal, srcLocal, RoundMode::CAST_RINT, 512);

Example

This example shows only part of the code used in the computation process. In this example, srcLocal is of the half type, and dstLocal is of the int32_t type. mask is computed based on int32_t.

To run the sample code, copy the code snippet and replace some code of the Compute function in Template Sample.

  • Example of high-dimensional tensor sharding computation (contiguous mask mode)
    uint64_t mask = 256 / sizeof(int32_t);
    // repeatTimes = 8, 64 elements one repeat, 512 elements total
    // dstBlkStride, srcBlkStride = 1, no gap between blocks in one repeat
    // dstRepStride, srcRepStride = 8, no gap between repeats
    AscendC::Cast(dstLocal, srcLocal, AscendC::RoundMode::CAST_CEIL, mask, 8, { 1, 1, 8, 4 });
  • Example of high-dimensional tensor sharding computation (bitwise mask mode)
    uint64_t mask[2] = { 0, UINT64_MAX };
    // repeatTimes = 8, 64 elements one repeat, 512 elements total
    // dstBlkStride, srcBlkStride = 1, no gap between blocks in one repeat
    // dstRepStride, srcRepStride = 8, no gap between repeats
    AscendC::Cast(dstLocal, srcLocal, AscendC::RoundMode::CAST_CEIL, mask, 8, { 1, 1, 8, 4 });
  • Example of computing the first n data elements of a tensor
    AscendC::Cast(dstLocal, srcLocal, AscendC::RoundMode::CAST_CEIL, 512);

Result example:

Input (srcLocal):
[29.5    83.6    16.75   45.1    40.62   69.06   47.6    96.5    72.7
 57.56   61.25   69.7    29.27   91.2    70.1    14.484   9.625  21.58
  9.336   3.125  63.72    9.9    17.28   73.2    75.7    29.81   98.8
 99.06   72.94    3.785  24.94   25.56   39.1    58.94   39.6    78.4
  5.43   25.48    9.58   60.8    77.56   29.7    70.3     6.312   4.047
 87.1    81.6    76.56   59.28   55.66   81.75   73.56   76.9    54.38
  7.254  37.84   11.08   77.6    83.6    89.2    93.06    2.96   76.56
 62.16   76.25   95.44   86.6    86.75   29.83   82.2    55.03   64.9
 56.44   12.89   87.06   39.34   72.25   43.06   63.4    51.72   63.9
  0.703  47.84   27.73   99.     89.     97.3     1.277  58.44   14.05
 78.9    98.5    28.55   44.8    41.03   40.75   74.2    74.06   10.51
 69.2    25.83   35.8    85.5    25.12   82.25   95.3    36.75   55.88
 90.9    57.47    7.13   18.1    40.97   31.     99.3    69.4    72.94
 62.44   63.7    80.     37.94   11.11   37.     39.72   87.94   31.72
 25.7    54.7    32.8    21.64   14.53   55.1     3.607  40.16   77.7
 15.15   77.44   43.25   85.75   67.3    30.33   67.56   60.72   58.16
 19.84   89.2    18.75   55.56   31.61    9.445   6.5    27.95   48.5
 37.16    7.805  37.72   69.6    36.2    92.56   24.72   41.56   48.44
 19.27   25.94   25.      8.836  55.75   77.8    25.84   46.16   71.7
 63.62   33.28    3.719  55.22   45.97   35.8    27.86   42.22    3.078
 92.06    0.805  51.97   76.4    32.03   74.56   28.1    91.2    35.38
  0.2009 74.25   87.5    92.75   76.25   51.28   22.9    34.4    28.23
 87.5    78.75   63.1    61.56   79.94    6.766  95.1    55.     56.75
 39.66   94.75   24.19   29.83   72.6    99.9    12.43   46.56   51.9
 92.3    42.66   91.8    95.8    35.2    13.08   60.7    22.22    6.055
  2.23   13.875  71.3    99.56   91.94   92.     96.06   97.5    68.75
  8.61    1.157  68.2    20.73   63.44   90.     38.78   64.4    88.9
 20.75   14.03   97.06   66.8    57.9    86.94   28.5     0.2279 51.8
 84.56   39.53   93.     15.66   15.23   71.75   11.44   45.28   57.38
 82.5    88.7     9.74   90.4    61.56   68.56   11.22   69.3    40.28
 24.78   84.44   23.92    8.4    20.88   48.2    17.42   59.84   93.2
  2.191  95.94   93.06   54.53   76.5    37.     41.7    82.7    69.5
 92.6     5.8    32.78   84.56   26.5    96.56    0.858  96.44   52.8
 90.9    30.52    2.656  32.03   35.72    8.125  21.94   84.5    66.7
 96.75   46.8     1.42   58.3    28.75   44.94   66.2    28.67   11.695
 41.75   67.25   26.75   17.72   35.9     5.72   55.88   94.7    80.8
 71.     86.06   36.78   81.06   56.8    61.34   11.42   74.     32.16
 14.695  78.6    56.1    64.4    61.75   50.88   39.6    79.94   71.25
 40.7     5.99   67.4    62.28   89.25   12.02   63.12   33.1    59.06
 28.2    19.22   59.66   51.6    53.28   97.8    42.25   82.     39.7
 50.6    95.06   20.64   26.62   54.9    55.     28.44   26.25   46.56
 87.06   98.44   49.34   37.2    97.4    34.3    83.4    57.4    94.
 29.31   79.44   19.72   54.9    50.25   58.75   92.5    17.3    17.88
 44.7     6.047  50.78   75.3    21.66   71.5    97.75   35.8    93.6
  4.367  31.02   66.5    48.25   34.     92.7    36.97   86.5    10.37
 82.     29.39   10.63   40.72   72.5    31.56   96.5    70.44    6.074
 37.34    7.58   21.72   44.97   77.6    14.22   18.62   47.97   54.6
 99.56   81.7    35.75   44.22   28.64   91.56    1.005  44.      8.125
 11.7    93.6    70.25   63.94   11.05   50.97   56.47   39.4    35.53
 84.     10.21   42.66   62.12   87.7    71.25   87.75   56.03   60.88
 31.81   68.1    91.1    67.3    53.6    96.06   43.75   27.86   46.6
 87.7    29.47    2.174  88.4    49.53   63.53   84.9    91.75   48.53
 91.94   88.44   58.3    88.44   23.11   91.56   71.4    59.66   93.44
 28.56   93.3    59.94   90.     18.95   52.8    70.3    58.     21.47
 93.7    45.03   84.25   34.06   23.86   38.4     5.566  41.5    35.1
 34.8    32.8    81.44   74.75   95.9    23.56    3.562  48.72   92.7
 43.88   83.75   69.06   85.8    22.84   63.78   90.94   52.78  ]
Output (dstLocal):
[ 30  84  17  46  41  70  48  97  73  58  62  70  30  92  71  15  10  22
  10   4  64  10  18  74  76  30  99 100  73   4  25  26  40  59  40  79
   6  26  10  61  78  30  71   7   5  88  82  77  60  56  82  74  77  55
   8  38  12  78  84  90  94   3  77  63  77  96  87  87  30  83  56  65
  57  13  88  40  73  44  64  52  64   1  48  28  99  89  98   2  59  15
  79  99  29  45  42  41  75  75  11  70  26  36  86  26  83  96  37  56
  91  58   8  19  41  31 100  70  73  63  64  80  38  12  37  40  88  32
  26  55  33  22  15  56   4  41  78  16  78  44  86  68  31  68  61  59
  20  90  19  56  32  10   7  28  49  38   8  38  70  37  93  25  42  49
  20  26  25   9  56  78  26  47  72  64  34   4  56  46  36  28  43   4
  93   1  52  77  33  75  29  92  36   1  75  88  93  77  52  23  35  29
  88  79  64  62  80   7  96  55  57  40  95  25  30  73 100  13  47  52
  93  43  92  96  36  14  61  23   7   3  14  72 100  92  92  97  98  69
   9   2  69  21  64  90  39  65  89  21  15  98  67  58  87  29   1  52
  85  40  93  16  16  72  12  46  58  83  89  10  91  62  69  12  70  41
  25  85  24   9  21  49  18  60  94   3  96  94  55  77  37  42  83  70
  93   6  33  85  27  97   1  97  53  91  31   3  33  36   9  22  85  67
  97  47   2  59  29  45  67  29  12  42  68  27  18  36   6  56  95  81
  71  87  37  82  57  62  12  74  33  15  79  57  65  62  51  40  80  72
  41   6  68  63  90  13  64  34  60  29  20  60  52  54  98  43  82  40
  51  96  21  27  55  55  29  27  47  88  99  50  38  98  35  84  58  94
  30  80  20  55  51  59  93  18  18  45   7  51  76  22  72  98  36  94
   5  32  67  49  34  93  37  87  11  82  30  11  41  73  32  97  71   7
  38   8  22  45  78  15  19  48  55 100  82  36  45  29  92   2  44   9
  12  94  71  64  12  51  57  40  36  84  11  43  63  88  72  88  57  61
  32  69  92  68  54  97  44  28  47  88  30   3  89  50  64  85  92  49
  92  89  59  89  24  92  72  60  94  29  94  60  90  19  53  71  58  22
  94  46  85  35  24  39   6  42  36  35  33  82  75  96  24   4  49  93
  44  84  70  86  23  64  91  53]
  • When int4b_t is involved in Cast, the calling example is as follows:

    dstLocal (int8_t), srcLocal (half)

    inBufferSize_ = srcSize;  // src buffer size
    outBufferSize_ = srcSize / 2;   //dst buffer size
    uint64_t mask = 128;
    AscendC::LocalTensor<half> srcLocal;
    srcLocal.SetSize(inBufferSize_);
    AscendC::LocalTensor<int8_t> dstLocal;
    dstLocal.SetSize(outBufferSize_);
    AscendC::LocalTensor<AscendC::int4b_t> dstLocalTmp = dstLocal.ReinterpretCast<AscendC::int4b_t>();
    // repeatTimes = 1, 128 elements one repeat, 128 elements total
    // dstBlkStride, srcBlkStride = 1, no gap between blocks in one repeat
    // dstRepStride = 4, srcRepStride = 8, no gap between repeats
    AscendC::Cast<AscendC::int4b_t, half>(dstLocalTmp, srcLocal, AscendC::RoundMode::CAST_CEIL, mask, 1, {1, 1, 4, 8});

Template Sample

This section provides a template sample to help you quickly run reference instruction samples.

You can use the following template sample as the code framework and only need to copy the sample snippet in specific instructions to replace the content in bold.
#include "kernel_operator.h"
class KernelCast {
public:
    __aicore__ inline KernelCast() {}
    __aicore__ inline void Init(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm)
    {
        srcGlobal.SetGlobalBuffer((__gm__ half*)srcGm);
        dstGlobal.SetGlobalBuffer((__gm__ int32_t*)dstGm);
        pipe.InitBuffer(inQueueSrc, 1, 512 * sizeof(half));
        pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(int32_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
        AscendC::DataCopy(srcLocal, srcGlobal, 512);
        inQueueSrc.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
        AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.AllocTensor<int32_t>();

        AscendC::Cast(dstLocal, srcLocal, AscendC::RoundMode::CAST_CEIL, 512);

        outQueueDst.EnQue<int32_t>(dstLocal);
        inQueueSrc.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.DeQue<int32_t>();
        AscendC::DataCopy(dstGlobal, dstLocal, 512);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> srcGlobal;
    AscendC::GlobalTensor<int32_t> dstGlobal;
};
extern "C" __global__ __aicore__ void cast_simple_kernel(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm)
{
    KernelCast op;
    op.Init(srcGm, dstGm);
    op.Process();
}

More Samples

You can refer to the following examples to learn how to use the high-dimensional tensor sharding computation API of the Cast instruction to perform more operations and implement more advanced functions.

To run the sample code, copy the code snippet and replace the code in bold of the Compute function in the single-operand instruction template provided in Template Sample.

  • Use the contiguous mask mode of a high-dimensional tensor sharding computation API to implement discontinuous data calculation.
    uint64_t mask = 32; // Only the first 32 bits are computed in each iteration.
    AscendC::Cast(dstLocal, srcLocal, AscendC::RoundMode::CAST_CEIL, mask, 8, { 1, 1, 8, 4 });

    Result example:

    Input (srcLocal):
    [37.4     7.11   53.5    19.44   22.66   43.     43.16    5.316  74.2
     15.7    87.75   86.94   92.56   25.45   36.06   94.6    73.6    30.48
     48.16   12.55   27.81   14.67    6.58   48.38   67.5    57.5    63.3
     85.2     3.654  68.7    52.53   16.38   13.945  63.84   87.2    82.5
     85.7    27.78   15.41   41.66   31.38   14.65   88.25    0.0332 43.06
     46.88   15.57   87.1    53.16   33.5    91.06   36.5    55.34   60.53
      3.238  23.92   97.5    91.1    78.44   54.47   82.     53.8    72.1
     25.06   32.12   15.88   33.38   36.7    33.3    84.4    19.25    1.743
     46.16   22.06    4.582  71.1    15.94   22.23   53.47   17.05   48.56
     94.44   77.4    90.2    46.56   92.4     9.45   68.44   35.7    31.62
     68.1    63.7    77.     92.06   20.45   27.67   93.4    22.39   17.22
     73.06    7.12   25.34   36.34   13.54   38.12   24.56   86.56   69.7
     68.3    30.38   68.4    86.1    54.44   70.     55.3    48.6    59.03
     64.44   15.45   66.5    92.7    60.7    52.22   47.     99.75   41.94
     43.06   89.5    36.9    62.5     1.306  48.06    9.37   62.25   20.61
     43.8    69.25   27.22   71.44   52.75   11.82   80.6    63.44   53.22
     85.44   25.25    2.309  26.88   84.5    29.83    9.93   81.9    97.75
     75.75   97.7    72.     19.86   26.62   88.7    74.06    9.24   42.5
     14.     39.44   98.56   66.94   89.     57.12   39.     11.57   19.05
     86.56   32.66   19.25   99.3    95.6    58.7    79.6    37.38   65.
     75.7     8.586  77.7     2.68   75.7    77.56   39.1    39.72   64.06
     98.44   30.27   31.9    94.4    85.94    4.965   2.758  92.4    49.53
     50.75    5.7    19.69   87.6    20.08   88.8    87.4    63.6    68.3
     78.9    45.66   10.01   35.25   71.9    37.38   39.7    43.47   11.67
     64.3    35.62   74.3    59.3    28.69   29.56   23.14   36.22    4.88
     70.5    25.05   72.6    71.6    32.28   34.66   80.     96.1    98.7
     12.91   95.4    61.97   87.94   19.1    40.47   89.6    84.     29.72
     17.8    81.44   23.25   33.03   18.67   78.     49.62   63.1    72.75
     77.25    3.74   38.9    17.92   76.     25.62   34.53   84.     32.03
     57.3     9.21    6.836  68.9    35.78   96.75   56.3    96.1    23.45
     78.75   94.25   12.44   56.7    24.55   25.11   90.7    50.94   78.4
      3.576  21.81   53.28   26.2    43.1     7.742  13.4    86.44   86.9
     13.93   16.48   91.06   42.3    95.5    66.8    40.6    98.06   71.9
     67.6    55.9    82.44   93.75   41.53   23.62   40.12   40.53   80.7
     80.25   96.3    51.38   93.6    91.3    32.84   88.     69.7    63.16
     41.75   43.22   43.22   31.73   84.9    91.6    80.     53.34   27.12
     76.6    97.25   44.5    30.28   74.3    76.06   40.     41.28   37.72
     99.56   18.73   16.45   92.75   79.1    40.3    68.     23.98   88.7
     86.6    24.97   59.6    28.25   82.94   46.12   60.12   34.53   79.7
     11.086  20.25   44.88   39.97   42.12   62.7    30.66   42.56   16.69
     85.2    90.8    78.75   26.16   18.14   94.06   40.3    20.16   38.
     12.99   95.44   76.25   26.03   76.     30.06   27.25   84.56   30.45
     66.1    83.25    3.732  39.1    54.22   82.8    43.22   53.03   11.66
     88.1     6.83   66.8    44.4     7.5    24.77   74.4    35.9    79.75
     41.62   37.06   60.12   57.9    96.94   84.25   39.88   22.55   72.7
     58.9    44.75   90.4    46.34   71.3    16.4    26.12   21.45   10.27
     91.     41.53   39.03   80.25    2.11    7.88   72.2    27.83   88.1
     67.56   10.72   52.84   91.2    97.6    51.44   74.7     3.527  79.25
     11.3    19.16   39.53    3.469  98.7    45.72   40.16   47.1    71.8
     11.81   52.97   71.44   37.7    26.81   46.22   26.94    4.805  12.18
     70.4    51.4    24.2    83.9     9.62   12.445  57.6    85.8    55.12
     88.25   32.38   62.88    1.903  47.72   35.9    48.94   86.06   32.44
      1.219  35.56   49.78   49.97   24.45   94.5    99.94   44.72    3.404
     83.6    23.14   76.7    91.7    24.33   20.62   24.72    4.55   88.94
     87.44   95.75   41.56   13.77   34.6    95.94   77.1    24.28   70.06
     10.06   11.38   88.8    57.22   94.56   35.     79.8    58.22   44.06
     26.9    16.25   99.94   51.1    42.38   84.25    0.9604 48.1   ]
    
    Output (dstLocal):
    [        38          8         54         20         23         43
             44          6         75         16         88         87
             93         26         37         95         74         31
             49         13         28         15          7         49
             68         58         64         86          4         69
             53         17 1879993057 1827499998 1823960025 1570990114
     1828150463 1811639312 1794470101 1754296176 1888841335 1715628997
     1839753994 1850888497 1889364175 1891068936 1823369913 1769105534
     1815638091 1808559970 1601662785 1739089473 1863146361 1694785989
     1597138938 1836478181 1888774249 1637707434 1877372650 1796304934
     1887530885 1839295471 1707240971 1873242695         33         16
             34         37         34         85         20          2
             47         23          5         72         16         23
             54         18         49         95         78         91
             47         93         10         69         36         32
             69         64         77         93         21         28
     1753837732 1488743807 1711632378 1799581711 1818783215 1891790695
     1837723802 1752132873 1727950918 1760390205 1866887130 1824876865
     1807839436 1890544910 1889755550 1787129270 1502702106 1841065201
     1820156583 1779396288 1760521448 1844604520 1831039103 1843491014
     1891199259 1839493317 1801349958 1577807434 1811377215 1879404734
     1826057367 1837853054         37         63          2         49
             10         63         21         44         70         28
             72         53         12         81         64         54
             86         26          3         27         85         30
             10         82         98         76         98         72
             20         27         89         75 1890086927 1826517134
     1814783944 1824156809 1875733079 1842114682 1845456975 1830120794
     1787980861 1807380585 1535469972 1883860884 1889167601 1747872128
     1888317235 1720937006 1836806331 1654152236 1695309475 1892773593
     1840737395 1868392748 1833724316 1600153936 1869310159 1883467778
     1892641857 1776248953 1833201514 1886743848 1745972258 1860657622
             95         86          5          3         93         50
             51          6         20         88         21         89
             88         64         69         79         46         11
             36         72         38         40         44         12
             65         36         75         60         29         30
             24         37 1733652589 1756325317 1685744372 1780772214
     1660252348 1629973784 1847815925 1828941229 1683778661 1519480967
     1762160488 1844801381 1832742021 1891724641 1761701480 1695312651
     1429433841 1774275423 1828349211 1779786303 1835953259 1784896595
     1858432988 1413442268 1893363867 1886679050 1872588913 1473866635
     1793158916 1762946052 1719627087 1893231666         76         26
             35         84         33         58         10          7
             69         36         97         57         97         24
             79         95         13         57         25         26
             91         51         79          4         22         54
             27         44          8         14         87         87
     1208960410 1208567888 1215973275 1214859418 1210992732 1208305714
     1165379704 1215252623 1197033625 1212172318 1217415148 1211058280
     1206798479 1215645776 1209223143 1217611809 1212500082 1215383615
     1208567769 1200179260 1216694386 1218070398 1195526187 1211516213
     1213089850 1213941743 1216825148 1212565573 1216694248 1217546087
     1200178778 1215973524         92         80         54         28
             77         98         45         31         75         77
             40         42         38        100         19         17
             93         80         41         68         24         89
             87         25         60         29         83         47
             61         35         80         12 1189759014 1218070662
     1211057953 1216825400 1215121414 1214596952 1216169420 1210075102
     1209157633 1213941279 1195984973 1211648118 1201686666 1212041360
     1216103688 1212500024 1173112887 1194608648 1216825427 1209747582
     1207191587 1214859224 1203980407 1215711379 1213155353 1203259396
     1214859350 1211779156 1217218713 1202473003 1216628529 1196771437
             44         54         12         89          7         67
             45          8         25         75         36         80
             42         38         61         58         97         85
             40         23         73         59         45         91
             47         72         17         27         22         11
             91         42 1211516990 1198213215 1203848735 1217349746
     1212303398 1217808150 1215514752 1209878647 1214138433 1215711277
     1212041273 1215383541 1214728294 1197754500 1169574019 1208371214
     1214269569 1216301092 1216563283 1213548677 1217873970 1203128459
     1209812695 1218136029 1194805359 1204439186 1218005120 1213941626
     1217153046 1208109091 1215055928 1215318166          5         13
             71         52         25         84         10         13
             58         86         56         89         33         63
              2         48         36         49         87         33
              2         36         50         50         25         95
            100         45          4         84         24         77
     1202210969 1213679767 1209288847 1217480263 1184319410 1214072674
     1188382819 1217283870 1200769107 1217939416 1199327294 1213351841
     1206667407 1217153163 1215580283 1214138474 1206798167 1194477696
     1193690499 1214072706 1216825421 1216693888 1217611496 1198540949
     1199654414 1206405188 1214203847 1165183076 1213745302 1208830102
     1209944118 1215121459]