NPU自定义算子

表1 NPU自定义算子

序号

算子名称

1

torch_npu._npu_dropout

2

torch_npu._npu_dropout_inplace

3

torch_npu.copy_memory_

4

torch_npu.empty_with_format

5

torch_npu.fast_gelu

6

torch_npu.npu_alloc_float_status

7

torch_npu.npu_anchor_response_flags

8

torch_npu.npu_apply_adam

9

torch_npu.npu_batch_nms

10

torch_npu.npu_bert_apply_adam

11

torch_npu.npu_bmmV2

12

torch_npu.npu_bounding_box_decode

13

torch_npu.npu_bounding_box_encode

14

torch_npu.npu_broadcast

15

torch_npu.npu_ciou

16

torch_npu.npu_clear_float_status

17

torch_npu.npu_confusion_transpose

18

torch_npu.npu_conv_transpose2d

19

torch_npu.npu_conv2d

20

torch_npu.npu_conv3d

21

torch_npu.npu_convolution

22

torch_npu.npu_convolution_transpose

23

torch_npu.npu_deformable_conv2d

24

torch_npu.npu_diou

25

torch_npu.npu_dropoutV2

26

torch_npu.npu_dtype_cast

27

torch_npu.npu_format_cast

28

torch_npu.npu_format_cast_

29

torch_npu.npu_get_float_status

30

torch_npu.npu_giou

31

torch_npu.npu_grid_assign_positive

32

torch_npu.npu_gru

33

torch_npu.npu_ifmr

34

torch_npu.npu_indexing

35

torch_npu.npu_iou

36

torch_npu.npu_layer_norm_eval

37

torch_npu.npu_linear

38

torch_npu.npu_lstm

39

torch_npu.npu_masked_fill_range

40

torch_npu.npu_max

41

torch_npu.npu_mish

42

torch_npu.npu_nms_v4

43

torch_npu.npu_nms_with_mask

44

torch_npu.npu_normalize_batch

45

torch_npu.npu_one_hot

46

torch_npu.npu_pad

47

torch_npu.npu_ps_roi_pooling

48

torch_npu.npu_ptiou

49

torch_npu.npu_random_choice_with_mask

50

torch_npu.npu_roi_align

51

torch_npu.npu_scatter

52

torch_npu.npu_sign_bits_pack

53

torch_npu.npu_sign_bits_unpack

54

torch_npu.npu_slice

55

torch_npu.npu_softmax_cross_entropy_with_logits

56

torch_npu.npu_sort_v2

57

torch_npu.npu_stride_add

58

torch_npu.npu_transpose

59

torch_npu.npu_yolo_boxes_encode

60

torch_npu.one_

映射关系

NPU自定义算子参数中存在部分映射关系可参考下表。

表2 映射关系表

参数

映射参数

说明

ACL_FORMAT_UNDEFINED

-1

Format参数映射值。

ACL_FORMAT_NCHW

0

ACL_FORMAT_NHWC

1

ACL_FORMAT_ND

2

ACL_FORMAT_NC1HWC0

3

ACL_FORMAT_FRACTAL_Z

4

ACL_FORMAT_NC1HWC0_C04

12

ACL_FORMAT_HWCN

16

ACL_FORMAT_NDHWC

27

ACL_FORMAT_FRACTAL_NZ

29

ACL_FORMAT_NCDHW

30

ACL_FORMAT_NDC1HWC0

32

ACL_FRACTAL_Z_3D

33

详细算子接口说明

torch_npu.npu_apply_adam(beta1_power, beta2_power, lr, beta1, beta2, epsilon, grad, use_locking, use_nesterov, out = (var, m, v))

Count adam result.

torch_npu.npu_convolution_transpose(input, weight, bias, padding, output_padding, stride, dilation, groups) -> Tensor

Apply a 2D or 3D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”.

torch_npu.npu_conv_transpose2d(input, weight, bias, padding, output_padding, stride, dilation, groups) -> Tensor

Apply a 2D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”.

torch_npu.npu_convolution(input, weight, bias, stride, padding, dilation, groups) -> Tensor

Apply a 2D or 3D convolution over an input image composed of several input planes.

torch_npu.npu_conv2d(input, weight, bias, stride, padding, dilation, groups) -> Tensor

Apply a 2D convolution over an input image composed of several input planes.

torch_npu.npu_conv3d(input, weight, bias, stride, padding, dilation, groups) -> Tensor

Apply a 3D convolution over an input image composed of several input planes.

torch_npu.one_(self) -> Tensor

Fills self tensor with 1. .

torch_npu.npu_sort_v2(self, dim=-1, descending=False, out=None) -> Tensor

Sort the elements of the input tensor along a given dimension in ascending order by values without indices. If dim is not given, the last dimension of the input is chosen. If descending is True then the elements are sorted in descending order by value.

torch_npu.npu_format_cast(self, acl_format) -> Tensor

Change the format of an npu tensor.

torch_npu.npu_format_cast_(self, src) -> Tensor

In-place change the format of self, with the same format as src.

torch_npu.npu_transpose(self, perm, require_contiguous) -> Tensor

Return a view of the original tensor with its dimensions permuted, and make the result contiguous.

torch_npu.npu_broadcast(self, size) -> Tensor

Return a new view of the self tensor with singleton dimensions expanded to a larger size, and make the result contiguous.

Tensor can be also expanded to a larger number of dimensions, and the new ones will be appended at the front.
  • Parameters:
    • self (Tensor) - the input tensor
    • size (ListInt) - the desired expanded size
  • Constraints:

    None

  • Examples:
    >>> x = torch.tensor([[1], [2], [3]]).npu()
    >>> x.shape
    torch.Size([3, 1])
    >>> x.npu_broadcast(3, 4)
    tensor([[1, 1, 1, 1],
            [2, 2, 2, 2],
            [3, 3, 3, 3]], device='npu:0')
torch_npu.npu_dtype_cast(input, dtype) -> Tensor
Perform tensor dtype conversion.
  • Parameters:
    • input (Tensor) - the input tensor
    • dtype (torch.dtype) - the desired data type of the returned tensor
  • Constraints:

    None

  • Examples:
    >>> torch_npu.npu_dtype_cast(torch.tensor([0, 0.5, -1.]).npu(), dtype=torch.int)
    tensor([ 0,  0, -1], device='npu:0', dtype=torch.int32)
torch_npu.empty_with_format(size, dtype, layout, device, pin_memory, acl_format) -> Tensor
Return a tensor filled with uninitialized data. The shape of the tensor is defined by the variable argument size. The format of the tensor is defined by the variable argument acl_format.
  • Parameters:
    • size (ListInt) – A sequence of integers defining the shape of the output tensor. Can be a variable number of arguments or a collection like a list or tuple.
    • dtype (torch.dtype, optional) – The desired data type of returned tensor. Default: None. If None, use a global default (see torch.set_default_tensor_type()).
    • layout (torch.layout, optional) – The desired layout of returned tensor. Default: torch.strided.
    • device (torch.device, optional) – The desired device of returned tensor. Default: None.
    • pin_memory (Bool, optional) – If set, the returned tensor will be allocated in the pinned memory. Default: False.
    • acl_format (Int) – The desired memory format of returned tensor. Default: 2.
  • Constraints:

    None

  • Examples:
    >>> torch_npu.empty_with_format((2, 3), dtype=torch.float32, device="npu")
    tensor([[1., 1., 1.],
            [1., 1., 1.]], device='npu:0')
torch_npu.copy_memory_(dst, src, non_blocking=False) -> Tensor
Copy the elements from src into self tensor and return self.
  • Parameters:
    • dst (Tensor) - the source tensor to copy from
    • src (Tensor) - the desired data type of the returned tensor
    • non_blocking (Bool) - If True and this copy is between CPU and NPU, the copy may occur asynchronously with respect to the host. In other cases, this argument has no effect.
  • Constraints:

    copy_memory_ only supports npu tensor. Input tensors of copy_memory_ should have the same dtype and device index.

  • Examples:
    >>> a=torch.IntTensor([0,  0, -1]).npu()
    >>> b=torch.IntTensor([1, 1, 1]).npu()
    >>> a.copy_memory_(b)
    tensor([1, 1, 1], device='npu:0', dtype=torch.int32)
torch_npu.npu_one_hot(input, num_classes=-1, depth=1, on_value=1, off_value=0) -> Tensor
Return an one-hot tensor. The locations represented by index in "x" take the value of "on_value", while all other locations take the value of "off_value".
  • Parameters:
    • input (Tensor) - Class values of any shape.
    • num_classes (Int) - The axis to fill. Default: "-1".
    • depth (Int) - The depth of the one_hot dimension.
    • on_value (Scalar) - The value to fill in output when indices[j] == i.
    • off_value (Scalar) - The value to fill in output when indices[j] != i.
  • Constraints:

    None

    • Examples:
      >>> a=torch.IntTensor([5, 3, 2, 1]).npu()
      >>> b=torch_npu.npu_one_hot(a, depth=5)
      >>> btensor([[0., 0., 0., 0., 0.],
              [0., 0., 0., 1., 0.],
              [0., 0., 1., 0., 0.],
              [0., 1., 0., 0., 0.]], device='npu:0')
torch_npu.npu_stride_add(x1, x2, offset1, offset2, c1_len) -> Tensor
Add the partial values of two tensors in the format NC1HWC0.
  • Parameters:
    • x1 (Tensor) - A tensor in 5HD.
    • x2 (Tensor) - A tensor of the same type as "x1", and the same shape as "x1", except for the C1 value.
    • offset1 (Scalar) - A required int. Offset value of C1 in "x1".
    • offset2 (Scalar) - A required int. Offset value of C1 in "x2".
    • c1_len (Scalar) - A required int. C1 len of "y". The value must be less than the difference between C1 and offset in "x1" and "x2".
  • Constraints:

    None

  • Examples:
    >>> a=torch.tensor([[[[[1.]]]]]).npu()
    >>> b=torch_npu.npu_stride_add(a, a, 0, 0, 1)
    >>> btensor([[[[[2.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]],
            [[[0.]]]]], device='npu:0')
torch_npu.npu_softmax_cross_entropy_with_logits(features, labels) -> Tensor

Compute softmax cross entropy cost.

torch_npu.npu_ps_roi_pooling(x, rois, spatial_scale, group_size, output_dim) -> Tensor
Perform Position Sensitive PS ROI Pooling.
  • Parameters:
    • x (Tensor) - An NC1HWC0 tensor, describing the feature map, dimension C1 must be equal to (int(output_dim+15)/C0)) group_size.
    • rois (Tensor) - A tensor with shape [batch, 5, rois_num], describing the ROIs. Each ROI consists of five elements: "batch_id", "x1", "y1", "x2", and "y2", which "batch_id" indicates the index of the input feature map and "x1", "y1", "x2", or "y2" must be greater than or equal to "0.0".
    • spatial_scale (Float) - A required float32, scaling factor for mapping the input coordinates to the ROI coordinates .
    • group_size (Int) - A required int32, specifying the number of groups to encode position-sensitive score maps. Must be within the range (0, 128).
    • output_dim (Int) - A required int32, specifying the number of output channels. Must be greater than 0.
  • Constraints:

    None

  • Examples:
    >>> roi = torch.tensor([[[1], [2], [3], [4], [5]],
                            [[6], [7], [8], [9], [10]]], dtype = torch.float16).npu()
    >>> x = torch.tensor([[[[ 1]], [[ 2]], [[ 3]], [[ 4]],
                          [[ 5]], [[ 6]], [[ 7]], [[ 8]]],
                          [[[ 9]], [[10]], [[11]], [[12]],
                          [[13]], [[14]], [[15]], [[16]]]], dtype = torch.float16).npu()
    >>> out = torch_npu.npu_ps_roi_pooling(x, roi, 0.5, 2, 2)
    >>> outtensor([[[[0., 0.],
              [0., 0.]],
            [[0., 0.],
              [0., 0.]]],
            [[[0., 0.],
              [0., 0.]],
            [[0., 0.],
              [0., 0.]]]], device='npu:0', dtype=torch.float16)
torch_npu.npu_roi_align(features, rois, spatial_scale, pooled_height, pooled_width, sample_num, roi_end_mode) -> Tensor

Obtain the ROI feature matrix from the feature map. It is a customized FasterRcnn operator.

torch_npu.npu_nms_v4(boxes, scores, max_output_size, iou_threshold, scores_threshold, pad_to_max_output_size=False) -> (Tensor, Tensor)
Greedily select a subset of bounding boxes in descending order of score.
  • Parameters:
    • boxes (Tensor) - A 2D float tensor of shape [num_boxes, 4].
    • scores (Tensor) - An 1D float tensor of shape [num_boxes] representing a single score corresponding to each box (each row of boxes).
    • max_output_size (Scalar) - A scalar representing the maximum number of boxes to be selected by non max suppression.
    • iou_threshold (Tensor) - A 0D float tensor representing the threshold for deciding whether boxes overlap too much with respect to IOU.
    • scores_threshold (Tensor) - A 0D float tensor representing the threshold for deciding when to remove boxes based on score.
    • pad_to_max_output_size (Bool) - If True, the output selected_indices is padded to be of length max_output_size. Default: False.
  • Returns:
    • selected_indices - An 1D integer tensor of shape [M] representing the selected indices from the boxes tensor, where M <= max_output_size.
    • valid_outputs - A 0D integer tensor representing the number of valid elements in selected_indices, with the valid elements appearing first.
  • Constraints:

    None

  • Examples:
    >>> boxes=torch.randn(100,4).npu()
    >>> scores=torch.randn(100).npu()
    >>> boxes.uniform_(0,100)
    >>> scores.uniform_(0,1)
    >>> max_output_size = 20
    >>> iou_threshold = torch.tensor(0.5).npu()
    >>> scores_threshold = torch.tensor(0.3).npu()
    >>> npu_output = torch_npu.npu_nms_v4(boxes, scores, max_output_size, iou_threshold, scores_threshold)
    >>> npu_output
    (tensor([57, 65, 25, 45, 43, 12, 52, 91, 23, 78, 53, 11, 24, 62, 22, 67,  9, 94,
            54, 92], device='npu:0', dtype=torch.int32), tensor(20, device='npu:0', dtype=torch.int32))
torch_npu.npu_nms_rotated(dets, scores, iou_threshold, scores_threshold=0, max_output_size=-1, mode=0) -> (Tensor, Tensor)
Greedily select a subset of the rotated bounding boxes in descending fractional order.
  • Parameters:
    • dets (Tensor) - A 2D float tensor of shape [num_boxes, 5].
    • scores (Tensor) - An 1D float tensor of shape [num_boxes] representing a single score corresponding to each box (each row of boxes).
    • iou_threshold (Float) - A scalar representing the threshold for deciding whether boxes overlap too much with respect to IOU.
    • scores_threshold (Float) - A scalar representing the threshold for deciding when to remove boxes based on score. Default: 0.
    • max_output_size (Int) - A scalar integer tensor representing the maximum number of boxes to be selected by non max suppression. Default: -1, that is, no constraint is imposed.
    • mode (Int) - This parameter specifies the layout type of the dets. If mode is set to 0, the input values of dets are x, y, w, h, and angle. If mode is set to 1, the input values of dets are x1, y1, x2, y2, and angle. Default: 0.
  • Returns:
    • selected_index - An 1D integer tensor of shape [M] representing the selected indices from the dets tensor, where M <= max_output_size.
    • selected_num - A 0D integer tensor representing the number of valid elements in selected_indices.
  • Constraints:

    None

  • Examples:
    >>> dets=torch.randn(100,5).npu()
    >>> scores=torch.randn(100).npu()
    >>> dets.uniform_(0,100)
    >>> scores.uniform_(0,1)
    >>> output1, output2 = torch_npu.npu_nms_rotated(dets, scores, 0.2, 0, -1, 1)
    >>> output1
    tensor([76, 48, 15, 65, 91, 82, 21, 96, 62, 90, 13, 59,  0, 18, 47, 23,  8, 56,
            55, 63, 72, 39, 97, 81, 16, 38, 17, 25, 74, 33, 79, 44, 36, 88, 83, 37,
            64, 45, 54, 41, 22, 28, 98, 40, 30, 20,  1, 86, 69, 57, 43,  9, 42, 27,
            71, 46, 19, 26, 78, 66,  3, 52], device='npu:0', dtype=torch.int32)
    >>> output2tensor([62], device='npu:0', dtype=torch.int32)
torch_npu.npu_lstm(x, weight, bias, seqMask, h, c, has_biases, num_layers, dropout, train, bidirectional, batch_first, flag_seq, direction)

DynamicRNN calculation.

torch_npu.npu_iou(bboxes, gtboxes, mode=0) -> Tensor 
torch_npu.npu_ptiou(bboxes, gtboxes, mode=0) -> Tensor
Compute the intersection over union (iou) or the intersection over foreground (iof) based on the ground-truth and predicted regions.
  • Parameters:
    • bboxes (Tensor) - the input tensor
    • gtboxes (Tensor) - the input tensor
    • mode (Int) - 0 1 corresponds to two modes iou iof.
  • Constraints:

    None

  • Examples:
    >>> bboxes = torch.tensor([[0, 0, 10, 10],
                               [10, 10, 20, 20],
                               [32, 32, 38, 42]], dtype=torch.float16).to("npu")
    >>> gtboxes = torch.tensor([[0, 0, 10, 20],
                                [0, 10, 10, 10],
                                [10, 10, 20, 20]], dtype=torch.float16).to("npu")
    >>> output_iou = torch_npu.npu_iou(bboxes, gtboxes, 0)
    >>> output_iou
    tensor([[0.4985, 0.0000, 0.0000],
            [0.0000, 0.0000, 0.0000], 
           [0.0000, 0.9961, 0.0000]], device='npu:0', dtype=torch.float16)
torch_npu.npu_pad(input, paddings) -> Tensor
Pad a tensor.
  • Parameters:
    • input (Tensor) - the input tensor
    • paddings (ListInt) - type int32 or int64
  • Constraints:

    None

  • Examples:
    >>> input = torch.tensor([[20, 20, 10, 10]], dtype=torch.float16).to("npu")
    >>> paddings = [1, 1, 1, 1]
    >>> output = torch_npu.npu_pad(input, paddings)
    >>> output
    tensor([[ 0.,  0.,  0.,  0.,  0.,  0.],
            [ 0., 20., 20., 10., 10.,  0.],
            [ 0.,  0.,  0.,  0.,  0.,  0.]], device='npu:0', dtype=torch.float16)
torch_npu.npu_nms_with_mask(input, iou_threshold) -> (Tensor, Tensor, Tensor)
The value 01 is generated for the nms operator to determine the valid bit.
  • Parameters:
    • input (Tensor) - the input tensor
    • iou_threshold (Scalar) - Threshold. If the value exceeds this threshold, the value is 1. Otherwise, the value is 0.
  • Returns:
    • selected_boxes - 2D tensor with shape of [N,5], representing filtered boxes including proposal boxes and corresponding confidence scores.
    • selected_idx - 1D tensor with shape of [N], representing the index of input proposal boxes.
    • selected_mask - 1D tensor with shape of [N], the symbol judging whether the output proposal boxes is valid .
  • Constraints:

    The 2nd-dim of input box_scores must be equal to 8.

  • Examples:
    >>> input = torch.tensor([[0.0, 1.0, 2.0, 3.0, 0.6], [6.0, 7.0, 8.0, 9.0, 0.4]], dtype=torch.float16).to("npu")
    >>> iou_threshold = 0.5
    >>> output1, output2, output3, = torch_npu.npu_nms_with_mask(input, iou_threshold)
    >>> output1
    tensor([[0.0000, 1.0000, 2.0000, 3.0000, 0.6001],
            [6.0000, 7.0000, 8.0000, 9.0000, 0.3999]], device='npu:0',      dtype=torch.float16)
    >>> output2
    tensor([0, 1], device='npu:0', dtype=torch.int32)
    >>> output3
    tensor([1, 1], device='npu:0', dtype=torch.uint8)
torch_npu.npu_bounding_box_encode(anchor_box, ground_truth_box, means0, means1, means2, means3, stds0, stds1, stds2, stds3) -> Tensor
Compute the coordinate variations between bboxes and ground truth boxes. It is a customized FasterRcnn operator
  • Parameters:
    • anchor_box (Tensor) - The input tensor.Anchor boxes. A 2D Tensor of float32 with shape (N, 4). "N" indicates the number of bounding boxes, and the value "4" refers to "x0", "x1", "y0", and "y1".
    • ground_truth_box (Tensor) - The input tensor.Ground truth boxes. A 2D Tensor of float32 with shape (N, 4). "N" indicates the number of bounding boxes, and the value "4" refers to "x0", "x1", "y0", and "y1"
    • means0 (Float) - An index of type float
    • means1 (Float) - An index of type float
    • means2 (Float) - An index of type float
    • means3 (Float) - An index of type int. Default: [0,0,0,0]. "deltas" = "deltas" x "stds" + "means".
    • stds0 (Float) - An index of type float
    • stds1 (Float) - An index of type float
    • stds2 (Float) - An index of type float
    • stds3 (Float) - An index of type float. Default: [1.0,1.0,1.0,1.0]. "deltas" = "deltas" x "stds" + "means" .
  • Constraints:

    None

  • Examples:
    >>> anchor_box = torch.tensor([[1., 2., 3., 4.], [3.,4., 5., 6.]], dtype = torch.float32).to("npu")
    >>> ground_truth_box = torch.tensor([[5., 6., 7., 8.], [7.,8., 9., 6.]], dtype = torch.float32).to("npu")
    >>> output = torch_npu.npu_bounding_box_encode(anchor_box, ground_truth_box, 0, 0, 0, 0, 0.1, 0.1, 0.2, 0.2)
    >>> outputtensor([[13.3281, 13.3281,  0.0000,  0.0000],
            [13.3281,  6.6641,  0.0000, -5.4922]], device='npu:0')
torch_npu.npu_bounding_box_decode(rois, deltas, means0, means1, means2, means3, stds0, stds1, stds2, stds3, max_shape, wh_ratio_clip) -> Tensor
Generate bounding boxes based on "rois" and "deltas". It is a customized FasterRcnn operator.
  • Parameters:
    • rois (Tensor) - Region of interests (ROIs) generated by the region proposal network (RPN). A 2D Tensor of type float32 or float16 with shape (N, 4). "N" indicates the number of ROIs, and the value "4" refers to "x0", "x1", "y0", and "y1".
    • deltas (Tensor) - Absolute variation between the ROIs generated by the RPN and ground truth boxes. A 2D Tensor of type float32 or float16 with shape (N, 4). "N" indicates the number of errors, and 4 indicates "dx", "dy", "dw", and "dh" .
    • means0 (Float) - An index of type float
    • means1 (Float) - An index of type float
    • means2 (Float) - An index of type float
    • means3 (Float) - An index of type float. Default: [0,0,0,0]. "deltas" = "deltas" x "stds" + "means".
    • stds0 (Float) - An index of type float
    • stds1 (Float) - An index of type float
    • stds2 (Float) - An index of type float
    • stds3 (Float) - An index of type float. Default: [1.0,1.0,1.0,1.0]. "deltas" = "deltas" x "stds" + "means" .
    • max_shape (ListInt of length 2) - Shape [h, w], specifying the size of the image transferred to the network. It is used to ensure that the bbox shape after conversion does not exceed "max_shape".
    • wh_ratio_clip (Float) - The values of "dw" and "dh" fall within (-wh_ratio_clip, wh_ratio_clip) .
  • Constraints:

    None

  • Examples:
    >>> rois = torch.tensor([[1., 2., 3., 4.], [3.,4., 5., 6.]], dtype = torch.float32).to("npu")
    >>> deltas = torch.tensor([[5., 6., 7., 8.], [7.,8., 9., 6.]], dtype = torch.float32).to("npu")
    >>> output = torch_npu.npu_bounding_box_decode(rois, deltas, 0, 0, 0, 0, 1, 1, 1, 1, (10, 10), 0.1)
    >>> output
    tensor([[2.5000, 6.5000, 9.0000, 9.0000],
            [9.0000, 9.0000, 9.0000, 9.0000]], device='npu:0')
torch_npu.npu_gru(input, hx, weight_input, weight_hidden, bias_input, bias_hidden, seq_length, has_biases, num_layers, dropout, train, bidirectional, batch_first) -> (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor)

DynamicGRUV2 calculation.

torch_npu.npu_random_choice_with_mask(x, count=256, seed=0, seed2=0) -> (Tensor, Tensor)
Shuffle index of no-zero element.
  • Parameters:
    • x (Tensor) - the input tensor.
    • count (Int) - The count of output. If 0, out all non-zero elements.
    • seed (Int) - type int32 or int64
    • seed2 (Int) - type int32 or int64
  • Returns:
    • y - 2D tensor, non-zero element index.
    • mask - 1D tensor, whether the corresponding index is valid.
  • Constraints:

    None

  • Examples:
    >>> x = torch.tensor([1, 0, 1, 0], dtype=torch.bool).to("npu")
    >>> result, mask = torch_npu.npu_random_choice_with_mask(x, 2, 1, 0)
    >>> resulttensor([[0],
            [2]], device='npu:0', dtype=torch.int32)
    >>> mask
    tensor([True, True], device='npu:0')
torch_npu.npu_batch_nms(self, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size, change_coordinate_frame=False, transpose_box=False) -> (Tensor, Tensor, Tensor, Tensor)
Compute nms for input boxes and score, support multiple batch and classes. It will do clip to window, score filter, top_k, and nms.
  • Parameters:
    • self (Tensor) - the input tensor
    • scores (Tensor) - the input tensor
    • score_threshold (Float) - A required attribute of type float32, specifying the score filter iou iou_threshold.
    • iou_threshold (Float) - A required attribute of type float32, specifying the nms iou iou_threshold.
    • max_size_per_class (Int) - A required attribute of type int, specifying the nms output num per class.
    • max_total_size (Int) - A required attribute of type int, specifying the the nms output num per batch.
    • change_coordinate_frame (Bool) - A required attribute of type bool, whether to normalize coordinates after clipping.
    • transpose_box (Bool) - A required attribute of type bool, whether insert transpose before this op. Must be "False".
  • Returns:
    • nmsed_boxes (Tensor) - A 3D tensor of type float16 with shape (batch, max_total_size, 4),specifying the output nms boxes per batch.
    • nmsed_scores (Tensor) - A 2D tensor of type float16 with shape (batch, max_total_size),specifying the output nms score per batch.
    • nmsed_classes (Tensor) - A 2D tensor of type float16 with shape (batch, max_total_size),specifying the output nms class per batch.
    • nmsed_num (Tensor) - An 1D tensor of type int32 with shape (batch), specifying the valid num of nmsed_boxes.
  • Constraints:

    None

  • Examples:
    >>> boxes = torch.randn(8, 2, 4, 4, dtype = torch.float32).to("npu")
    >>> scores = torch.randn(3, 2, 4, dtype = torch.float32).to("npu")
    >>> nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = torch_npu.npu_batch_nms(boxes, scores, 0.3, 0.5, 3, 4)
    >>> nmsed_boxes
    >>> nmsed_scores
    >>> nmsed_classes
    >>> nmsed_num
torch_npu.npu_slice(self, offsets, size) -> Tensor

Extract a slice from a tensor.

torch_npu.npu_dropoutV2(self, seed, p) -> (Tensor, Tensor, Tensor(a!))

Count dropout result with seed.

torch_npu._npu_dropout(self, p) -> (Tensor, Tensor)

Count dropout result without seed.

torch_npu._npu_dropout_inplace(result, p) -> (Tensor(a!), Tensor)
Count dropout result inplace.
  • Parameters:Similar to torch.dropout_, optimize implemention to the npu device.
    • result (Tensor) - the tensor dropout inplace
    • p (Float) - dropout probability
  • Constraints:

    None

  • Examples:
    >>> input = torch.tensor([1.,2.,3.,4.]).npu()
    >>> input
    tensor([1., 2., 3., 4.], device='npu:0')
    >>> prob = 0.3>>> output, mask = torch_npu._npu_dropout_inplace(input, prob)
    >>> output
    tensor([0.0000, 2.8571, 0.0000, 0.0000], device='npu:0')
    >>> inputtensor([0.0000, 2.8571, 4.2857, 5.7143], device='npu:0')
    >>> mask
    tensor([ 98, 255, 188, 186, 120, 157, 175, 159,  77, 223, 127,  79, 247, 151,
          253, 255], device='npu:0', dtype=torch.uint8)
torch_npu.npu_indexing(self, begin, end, strides, begin_mask=0, end_mask=0, ellipsis_mask=0, new_axis_mask=0, shrink_axis_mask=0) -> Tensor
Count indexing result by begin,end,strides array.
  • Parameters:
    • self (Tensor) - an input tensor
    • begin (ListInt) - the index of the first value to select
    • end (ListInt) - the index of the last value to select
    • strides (ListInt) - the index increment
    • begin_mask (Int) - A bitmask where a bit "i" being "1" means to ignore the begin value and instead use the largest interval possible.
    • end_mask (Int) - analogous to "begin_mask"
    • ellipsis_mask (Int) - A bitmask where bit "i" being "1" means the "i"th position is actually an ellipsis.
    • new_axis_mask (Int) - A bitmask where bit "i" being "1" means the "i"th specification creates a new shape 1 dimension.
    • shrink_axis_mask (Int) - A bitmask where bit "i" implies that the "i"th specification should shrink the dimensionality.
  • Constraints:

    None

  • Examples:
    >>> input = torch.tensor([[1, 2, 3, 4],[5, 6, 7, 8]], dtype=torch.int32).to("npu")
    >>> input
    tensor([[1, 2, 3, 4],
          [5, 6, 7, 8]], device='npu:0', dtype=torch.int32)
    >>> output = torch_npu.npu_indexing(input1, [0, 0], [2, 2], [1, 1])
    >>> output
    tensor([[1, 2],
          [5, 6]], device='npu:0', dtype=torch.int32)
torch_npu.npu_ifmr(Tensor data, Tensor data_min, Tensor data_max, Tensor cumsum, float min_percentile, float max_percentile, float search_start, float search_end, float search_step, bool with_offset) -> (Tensor, Tensor)

Count ifmr result by begin,end,strides array, Input Feature Map Reconstruction.

torch_npu.npu_max(self, dim, keepdim=False) -> (Tensor, Tensor)

Count max result with dim.

torch_npu.npu_min(self, dim, keepdim=False) -> (Tensor values, Tensor indices)
Count min result with dim.
  • Parameters:Similar to torch.min, optimize implemention to npu device.
    • self (Tensor) – the input tensor
    • dim (Int) – the dimension to reduce
    • keepdim (Bool) – whether the output tensor has dim retained or not
  • Returns:
    • values - min values in the input tensor
    • indices - index of min values in the input tensor
  • Constraints:

    None

  • Examples:
    >>> input = torch.randn(2, 2, 2, 2, dtype = torch.float32).npu()
    >>> input
    tensor([[[[-0.9909, -0.2369],
              [-0.9569, -0.6223]],
    
            [[ 0.1157, -0.3147],
              [-0.7761,  0.1344]]],
    
            [[[ 1.6292,  0.5953],
              [ 0.6940, -0.6367]],
    
            [[-1.2335,  0.2131],
              [ 1.0748, -0.7046]]]], device='npu:0')
    >>> outputs, indices = torch_npu.npu_min(input, 2)
    >>> outputs
    tensor([[[-0.9909, -0.6223],
            [-0.7761, -0.3147]],
    
            [[ 0.6940, -0.6367],
            [-1.2335, -0.7046]]], device='npu:0')
    >>> indices
    tensor([[[0, 1],
            [1, 0]],
    
            [[1, 1],
            [0, 1]]], device='npu:0', dtype=torch.int32)
torch_npu.npu_scatter(self, indices, updates, dim) -> Tensor

Count scatter result with dim.

torch_npu.npu_layer_norm_eval(input, normalized_shape, weight=None, bias=None, eps=1e-05) -> Tensor
Count layer norm result.
  • Parameters:The same as torch.nn.functional.layer_norm, optimize implemention to the npu device.
    • input (Tensor) - the input tensor
    • normalized_shape (ListInt) – input shape from an expected input of size
    • weight (Tensor) - the gamma tensor
    • bias (Tensor) - the beta tensor
    • eps (Float) – The epsilon value added to the denominator for numerical stability. Default: 1e-5.
  • Constraints:

    None

  • Examples:
    >>> input = torch.rand((6, 4), dtype=torch.float32).npu()
    >>> input
    tensor([[0.1863, 0.3755, 0.1115, 0.7308],
            [0.6004, 0.6832, 0.8951, 0.2087],
            [0.8548, 0.0176, 0.8498, 0.3703],
            [0.5609, 0.0114, 0.5021, 0.1242],
            [0.3966, 0.3022, 0.2323, 0.3914],
            [0.1554, 0.0149, 0.1718, 0.4972]], device='npu:0')
    >>> normalized_shape = input.size()[1:]
    >>> normalized_shape
    torch.Size([4])
    >>> weight = torch.Tensor(*normalized_shape).npu()
    >>> weight
    tensor([        nan,  6.1223e-41, -8.3159e-20,  9.1834e-41], device='npu:0')
    >>> bias = torch.Tensor(*normalized_shape).npu()
    >>> bias
    tensor([5.6033e-39, 6.1224e-41, 6.1757e-39, 6.1224e-41], device='npu:0')
    >>> output = torch_npu.npu_layer_norm_eval(input, normalized_shape, weight, bias, 1e-5)
    >>> output
    tensor([[        nan,  6.7474e-41,  8.3182e-20,  2.0687e-40],
            [        nan,  8.2494e-41, -9.9784e-20, -8.2186e-41],
            [        nan, -2.6695e-41, -7.7173e-20,  2.1353e-41],
            [        nan, -1.3497e-41, -7.1281e-20, -6.9827e-42],
            [        nan,  3.5663e-41,  1.2002e-19,  1.4314e-40],
            [        nan, -6.2792e-42,  1.7902e-20,  2.1050e-40]], device='npu:0')
torch_npu.npu_alloc_float_status(self) -> Tensor

Produce eight numbers with a value of zero.

torch_npu.npu_get_float_status(self) -> Tensor

Compute npu_get_float_status operator function.

torch_npu.npu_clear_float_status(self) -> Tensor

Set the value of address 0x40000 to 0 in each core.

torch_npu.npu_confusion_transpose(self, perm, shape, transpose_first) -> Tensor

Confuse reshape and transpose.

torch_npu.npu_bmmV2(self, mat2, output_sizes) -> Tensor

Multiply matrix "a" by matrix "b", producing "a * b" .

torch_npu.fast_gelu(self) -> Tensor

Compute the gradient for the fast_gelu of "x" .

torch_npu.npu_deformable_conv2d(self, weight, offset, bias, kernel_size, stride, padding, dilation=[1,1,1,1], groups=1, deformable_groups=1, modulated=True) -> (Tensor, Tensor)
Compute the deformed convolution output with the expected input.
  • Parameters:
    • self (Tensor) - A 4D tensor of input image. With the format "NHWC", the data is stored in the order of: [batch, in_height, in_width, in_channels].
    • weight (Tensor) - A 4D tensor of learnable filters. Must have the same type as "x". With the format "HWCN" , the data is stored in the order of: [filter_height, filter_width, in_channels / groups, out_channels].
    • offset (Tensor) - A 4D tensor of x-y coordinates offset and mask. With the format "NHWC", the data is stored in the order of: [batch, out_height, out_width, deformable_groups * filter_height * filter_width * 3].
    • bias (Tensor) - An optional 1D tensor of additive biases to the filter outputs. The data is stored in the order of: [out_channels].
    • kernel_size (ListInt of length 2) - A tuple/list of 2 integers.kernel size.
    • stride (ListInt) - Required. A list of 4 integers. The stride of the sliding window for each dimension of input. The dimension order is interpreted according to the data format of "x". The N and C dimensions must be set to 1.
    • padding (ListInt) - Required. A list of 4 integers. The number of pixels to add to each (top, bottom, left, right) side of the input.
    • dilations (ListInt) - Optional. A list of 4 integers. The dilation factor for each dimension of input. The dimension order is interpreted according to the data format of "x". The N and C dimensions must be set to 1. Default: [1, 1, 1, 1].
    • groups (Int) - Optional. An integer of type int32. The number of blocked connections from input channels to output channels. In_channels and out_channels must both be divisible by "groups". Default: 1.
    • deformable_groups (Int) - Optional. An integer of type int32. The number of deformable group partitions. In_channels must be divisible by "deformable_groups". Defaults to 1.
    • modulated (Bool) - Optional. Specify the version of DeformableConv2D, True means v2, False means v1, currently only support v2.
  • Constraints:

    None

  • Examples:
    >>> x = torch.rand(16, 32, 32, 32).npu()
    >>> weight = torch.rand(32, 32, 5, 5).npu()
    >>> offset = torch.rand(16, 75, 32, 32).npu()
    >>> output, _ = torch_npu.npu_deformable_conv2d(x, weight, offset, None, kernel_size=[5, 5], stride = [1, 1, 1, 1], padding = [2, 2, 2, 2])
    >>> output.shape
    torch.Size([16, 32, 32, 32])
torch_npu.npu_mish(self) -> Tensor

Compute hyperbolic tangent of "x" element-wise.

torch_npu.npu_anchor_response_flags(self, featmap_size, stride, num_base_anchors) -> Tensor

Generate the responsible flags of anchor in a single feature map.

torch_npu.npu_yolo_boxes_encode(self, gt_bboxes, stride, performance_mode=False) -> Tensor

Generate bounding boxes based on yolo's "anchor" and "ground-truth" boxes. It is a customized mmdetection operator.

torch_npu.npu_grid_assign_positive(self, overlaps, box_responsible_flags, max_overlaps, argmax_overlaps, gt_max_overlaps, gt_argmax_overlaps, num_gts, pos_iou_thr, min_pos_iou, gt_max_assign_all) -> Tensor

Perform Position Sensitive PS ROI Pooling Grad.

torch_npu.npu_normalize_batch(self, seq_len, normalize_type=0) -> Tensor

Perform batch normalization .

torch_npu.npu_masked_fill_range(self, start, end, value, axis=-1) -> Tensor

Masked fill tensor along with one axis by range.boxes. It is a customized masked fill range operator .

torch_npu.npu_linear(input, weight, bias=None) -> Tensor

Multiply matrix "a" by matrix "b", producing "a * b" .

torch_npu.npu_bert_apply_adam(lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay, step_size=None, adam_mode=0, *, out=(var,m,v))

Count adam result.

torch_npu.npu_giou(self, gtboxes, trans=False, is_cross=False, mode=0) -> Tensor

First calculate the minimum closure area of the two boxes, IoU, then the proportion of the closed area that does not belong to the two boxes in the closure area, and finally subtract this proportion from IoU to get GIoU .

torch_npu.npu_silu(self) -> Tensor

Compute the Swish of "x" .

torch_npu.npu_reshape(self, shape, bool can_refresh=False) -> Tensor

Reshape a tensor. Only the tensor shape is changed and its data is not changed.

torch_npu.npu_rotated_overlaps(self, query_boxes, trans=False) -> Tensor

Calculate the overlapping area of the rotated box.

torch_npu.npu_rotated_iou(self, query_boxes, trans=False, mode=0, is_cross=True) -> Tensor
Calculate the IOU of the rotated box.
  • Parameters:
    • self (Tensor) - Data of grad increment, a 3D Tensor of type float32 with shape (B, 5, N).
    • query_boxes (Tensor) - Bounding boxes, a 3D Tensor of type float32 with shape (B, 5, K).
    • trans (Bool) - An optional attr, True for 'xyxyt', False for 'xywht'.
    • is_cross (Bool) -Cross calculation when it is True, and one-to-one calculation when it is False.
    • mode (Int) - Computation mode with the value of 0 or 1. 0 means iou, 1 means iof.
  • Constraints:

    None

  • Examples:
    >>> a=np.random.uniform(0,1,(2,2,5)).astype(np.float16)
    >>> b=np.random.uniform(0,1,(2,3,5)).astype(np.float16)
    >>> box1=torch.from_numpy(a).to("npu")
    >>> box2=torch.from_numpy(a).to("npu")
    >>> output = torch_npu.npu_rotated_iou(box1, box2, trans=False, mode=0, is_cross=True)
    >>> output
    tensor([[[3.3325e-01, 1.0162e-01],
            [1.0162e-01, 1.0000e+00]],
    
            [[0.0000e+00, 0.0000e+00],
            [0.0000e+00, 5.9605e-08]]], device='npu:0', dtype=torch.float16)
torch_npu.npu_rotated_box_encode(anchor_box, gt_bboxes, weight) -> Tensor
Rotate Bounding Box Encoding.
  • Parameters:
    • anchor_box (Tensor) - A 3D Tensor with shape (B, 5, N). the input tensor.Anchor boxes. "B" indicates the number of batch size. "N" indicates the number of bounding boxes, and the value "5" refers to "x0", "x1", "y0", "y1" and "angle" .
    • gt_bboxes (Tensor) - A 3D Tensor of float32 (float16) with shape (B, 5, N).
    • weight (Tensor) - A float list for "x0", "x1", "y0", "y1" and "angle". Default: [1.0, 1.0, 1.0, 1.0, 1.0].
  • Constraints:

    None

  • Examples:
    >>> anchor_boxes = torch.tensor([[[30.69], [32.6], [45.94], [59.88], [-44.53]]], dtype=torch.float16).to("npu")
        >>> gt_bboxes = torch.tensor([[[30.44], [18.72], [33.22], [45.56], [8.5]]], dtype=torch.float16).to("npu")
        >>> weight = torch.tensor([1., 1., 1., 1., 1.], dtype=torch.float16).npu()
        >>> out = torch_npu.npu_rotated_box_encode(anchor_boxes, gt_bboxes, weight)
        >>> out
        tensor([[[-0.4253],
                [-0.5166],
                [-1.7021],
                [-0.0162],
                [ 1.1328]]], device='npu:0', dtype=torch.float16)
torch_npu.npu_rotated_box_decode(anchor_boxes, deltas, weight) -> Tensor

Rotate Bounding Box Encoding.

torch_npu.npu_ciou(Tensor self, Tensor gtboxes, bool trans=False, bool is_cross=True, int mode=0, bool atan_sub_flag=False) -> Tensor

Apply an NPU based CIOU operation.

A penalty item is added on the basis of DIoU, and CIoU is proposed.

torch_npu.npu_diou(Tensor self, Tensor gtboxes, bool trans=False, bool is_cross=False, int mode=0) -> Tensor

Apply an NPU based DIOU operation.

Taking the distance between the targets,the overlap rate of the distance and the range into account. Different targets or boundaries will tend to be stable.

torch_npu.npu_sign_bits_pack(Tensor self, int size) -> Tensor

one-bit Adam pack of float into uint8.

torch_npu.sign_bits_unpack(x, dtype, size) -> Tensor

one-bit Adam unpack of uint8 into float.