NPU自定义算子

表1 NPU自定义算子

序号

算子名称

1

_npu_dropout

2

_npu_dropout_inplace

3

copy_memory_

4

empty_with_format

5

fast_gelu

6

npu_alloc_float_status

7

npu_anchor_response_flags

8

npu_apply_adam

9

npu_batch_nms

10

npu_bert_apply_adam

11

npu_bmmV2

12

npu_bounding_box_decode

13

npu_bounding_box_encode

14

npu_broadcast

15

npu_clear_float_status

16

npu_confusion_transpose

17

npu_conv_transpose2d

18

npu_conv2d

19

npu_conv3d

20

npu_convolution

21

npu_convolution_transpose

22

npu_deformable_conv2d

23

npu_dropoutV2

24

npu_dtype_cast

25

npu_format_cast

26

npu_format_cast_

27

npu_get_float_status

28

npu_giou

29

npu_grid_assign_positive

30

npu_gru

31

npu_ifmr

32

npu_indexing

33

npu_iou

34

npu_layer_norm_eval

35

npu_linear

36

npu_lstm

37

npu_masked_fill_range

38

npu_max

39

npu_min

40

npu_mish

41

npu_nms_rotated

42

npu_nms_v4

43

npu_nms_with_mask

44

npu_normalize_batch

45

npu_one_hot

46

npu_pad

47

npu_ps_roi_pooling

48

npu_ptiou

49

npu_random_choice_with_mask

50

npu_reshape

51

npu_roi_align

52

npu_rotated_box_decode

53

npu_rotated_box_encode

54

npu_rotated_iou

55

npu_scatter

56

npu_silu

57

npu_slice

58

npu_softmax_cross_entropy_with_logits

59

npu_sort_v2

60

npu_stride_add

61

npu_transpose

62

npu_yolo_boxes_encode

63

one_

映射关系

NPU自定义算子参数中存在部分映射关系可参考下表。

表2 映射关系表

参数

映射参数

说明

ACL_FORMAT_UNDEFINED

-1

Format参数映射值。

ACL_FORMAT_NCHW

0

ACL_FORMAT_NHWC

1

ACL_FORMAT_ND

2

ACL_FORMAT_NC1HWC0

3

ACL_FORMAT_FRACTAL_Z

4

ACL_FORMAT_NC1HWC0_C04

12

ACL_FORMAT_HWCN

16

ACL_FORMAT_NDHWC

27

ACL_FORMAT_FRACTAL_NZ

29

ACL_FORMAT_NCDHW

30

ACL_FORMAT_NDC1HWC0

32

ACL_FRACTAL_Z_3D

33

详细算子接口说明

npu_apply_adam(beta1_power, beta2_power, lr, beta1, beta2, epsilon, grad, use_locking, use_nesterov, out = (var, m, v))

Count adam result.

npu_convolution_transpose(input, weight, bias, padding, output_padding, stride, dilation, groups) -> Tensor

Applies a 2D or 3D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”.

npuconvtranspose2d(input, weight, bias, padding, output_padding, stride, dilation, groups) -> Tensor

Applies a 2D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”.

npu_convolution(input, weight, bias, stride, padding, dilation, groups) -> Tensor

Applies a 2D or 3D convolution over an input image composed of several input planes.

npu_conv2d(input, weight, bias, stride, padding, dilation, groups) -> Tensor

Applies a 2D convolution over an input image composed of several input planes.

npu_conv3d(input, weight, bias, stride, padding, dilation, groups) -> Tensor

Applies a 3D convolution over an input image composed of several input planes.

one_(self) -> Tensor

Fills self tensor with ones.

npu_sort_v2(self, dim=-1, descending=False, out=None) -> Tensor

Sorts the elements of the input tensor along a given dimension in ascending order by value without indices.

If dim is not given, the last dimension of the input is chosen.

If descending is True then the elements are sorted in descending order by value.

npu_format_cast(self, acl_format) -> Tensor

Change the format of a npu tensor.

npu_format_cast_(self, src) -> Tensor

In-place Change the format of self, with the same format as src.

npu_transpose(self, perm) -> Tensor

Returns a view of the original tensor with its dimensions permuted, and make the result contiguous.

npu_broadcast(self, perm) -> Tensor

Returns a new view of the self tensor with singleton dimensions expanded to a larger size, and make the result contiguous.

Tensor can be also expanded to a larger number of dimensions, and the new ones will be appended at the front.

npu_dtype_cast(input, dtype) -> Tensor

Performs Tensor dtype conversion.

empty_with_format(size, dtype, layout, device, pin_memory, acl_format) -> Tensor

Returns a tensor filled with uninitialized data. The shape of the tensor is defined by the variable argument size. The format of the tensor is defined by the variable argument acl_format.

copy_memory_(dst, src, non_blocking=False) -> Tensor

Copies the elements from src into self tensor and returns self.

npu_one_hot(input, num_classes=-1, depth=1, on_value=1, off_value=0) -> Tensor

Returns a one-hot tensor. The locations represented by index in "x" take value "onvalue", while all other locations take value "offvalue".

npu_stride_add(x1, x2, offset1, offset2, c1_len) -> Tensor

Add the partial values of two tensors in format NC1HWC0.

npu_softmax_cross_entropy_with_logits(features, labels) -> Tensor

Computes softmax cross entropy cost.

npu_ps_roi_pooling(x, rois, spatial_scale, group_size, output_dim) -> Tensor

Performs Position Sensitive PS ROI Pooling.

npu_roi_align(features, rois, spatial_scale, pooled_height, pooled_width, sample_num, roi_end_mode) -> Tensor

Obtains the ROI feature matrix from the feature map. It is a customized FasterRcnn operator.

npu_nms_v4(boxes, scores, max_output_size, iou_threshold, scores_threshold, pad_to_max_output_size=False) -> (Tensor, Tensor)

Greedily selects a subset of bounding boxes in descending order of score.

npu_nms_rotated(dets, scores, iou_threshold, scores_threshold=0, max_output_size=-1, mode=0) -> (Tensor, Tensor)

Greedy selects a subset of the rotated bounding boxes in descending fractional order.

npu_lstm(x, weight, bias, seq_len, h, c, has_biases, num_layers, dropout, train, bidirectional, batch_first, flag_seq, direction)

DynamicRNN calculation.

npu_iou(bboxes, gtboxes, mode=0) -> Tensor
npu_ptiou(bboxes, gtboxes, mode=0) -> Tensor

Computes the intersection over union (iou) or the intersection over. foreground (iof) based on the ground-truth and predicted regions.

npu_pad(input, paddings) -> Tensor

Pads a tensor

npu_nms_with_mask(input, iou_threshold) -> (Tensor, Tensor, Tensor)

The value 01 is generated for the nms operator to determine the valid bit

npu_bounding_box_encode(anchor_box, ground_truth_box, means0, means1, means2, means3, stds0, stds1, stds2, stds3) -> Tensor

Computes the coordinate variations between bboxes and ground truth boxes. It is a customized FasterRcnn operator

npu_bounding_box_decode(rois, deltas, means0, means1, means2, means3, stds0, stds1, stds2, stds3, max_shape, wh_ratio_clip) -> Tensor

Generates bounding boxes based on "rois" and "deltas". It is a customized FasterRcnn operator .

npu_gru(input, hx, weight_input, weight_hidden, bias_input, bias_hidden, seq_length, has_biases, num_layers, dropout, train, bidirectional, batch_first) -> (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor)

DynamicGRUV2 calculation.

npu_random_choice_with_mask(x, count=256, seed=0, seed2=0) -> (Tensor, Tensor)

Shuffle index of no-zero element

npu_batch_nms(self, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size, change_coordinate_frame=False, transpose_box=False) -> (Tensor, Tensor, Tensor, Tensor)

Computes nms for input boxes and score, support multiple batch and classes. will do clip to window, score filter, top_k, and nms

npu_slice(self, offsets, size) -> Tensor

Extracts a slice from a tensor

npu_dropoutV2(self, seed, p) -> (Tensor, Tensor, Tensor(a!))

count dropout result with seed

_npu_dropout(self, p) -> (Tensor, Tensor)

count dropout result without seed

_npu_dropout_inplace(result, p) -> (Tensor(a!), Tensor)

count dropout result inplace.

npu_indexing(self, begin, end, strides, begin_mask=0, end_mask=0, ellipsis_mask=0, new_axis_mask=0, shrink_axis_mask=0) -> Tensor

count indexing result by begin,end,strides array.

npu_ifmr(Tensor data, Tensor data_min, Tensor data_max, Tensor cumsum, float min_percentile, float max_percentile, float search_start, float search_end, float search_step, bool with_offset) -> (Tensor, Tensor)

count ifmr result by begin,end,strides array, Input Feature Map Reconstruction

npu_max(self, dim, keepdim=False) -> (Tensor, Tensor)

count max result with dim.

npu_min(self, dim, keepdim=False) -> (Tensor, Tensor)

count min result with dim.

npu_scatter(self, indices, updates, dim) -> Tensor

count scatter result with dim.

npu_layer_norm_eval(input, normalized_shape, weight=None, bias=None, eps=1e-05) -> Tensor

count layer norm result.

npu_alloc_float_status(self) -> Tensor

Produces eight numbers with a value of zero

npu_get_float_status(self) -> Tensor

Computes NPU get float status operator function.

npu_clear_float_status(self) -> Tensor

Set the value of address 0x40000 to 0 in each core.

npu_confusion_transpose(self, perm, shape, transpose_first) -> Tensor

Confuse reshape and transpose.

npu_bmmV2(self, mat2, output_sizes) -> Tensor

Multiplies matrix "a" by matrix "b", producing "a * b" .

fast_gelu(self) -> Tensor

Computes the gradient for the fast_gelu of "x" .

npu_deformable_conv2d(self, weight, offset, bias, kernel_size, stride, padding, dilation=[1,1,1,1], groups=1, deformable_groups=1, modulated=True) -> (Tensor, Tensor)

Computes the deformed convolution output with the expected input.

npu_mish(self) -> Tensor

Computes hyperbolic tangent of "x" element-wise.

npu_anchor_response_flags(self, featmap_size, stride, num_base_anchors) -> Tensor

Generate the responsible flags of anchor in a single feature map.

npu_yolo_boxes_encode(self, gt_bboxes, stride, performance_mode=False) -> Tensor

Generates bounding boxes based on yolo's "anchor" and "ground-truth" boxes. It is a customized mmdetection operator.

npu_grid_assign_positive(self, overlaps, box_responsible_flags, max_overlaps, argmax_overlaps, gt_max_overlaps, gt_argmax_overlaps, num_gts, pos_iou_thr, min_pos_iou, gt_max_assign_all) -> Tensor

Performs Position Sensitive PS ROI Pooling Grad.

npu_normalize_batch(self, seq_len, normalize_type=0) -> Tensor

Performs batch normalization .

npu_masked_fill_range(self, start, end, value, axis=-1) -> Tensor

masked fill tensor along with one axis by range.boxes. It is a customized masked fill range operator .

npu_linear(input, weight, bias=None) -> Tensor

Multiplies matrix "a" by matrix "b", producing "a * b" .

npu_bert_apply_adam(lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay, step_size=None, adam_mode=0, *, out=(var,m,v))

count adam result.

npu_giou(self, gtboxes, trans=False, is_cross=False, mode=0) -> Tensorpugiou(self, gtboxes, trans=False, iscross=False, mode=0) -> Tensor

First calculate the minimum closure area of the two boxes, IoU, the proportion of the closed area that does not belong to the two boxes in the closure area, and finally subtract this proportion from IoU to get GIoU .

npu_silu(self) -> Tensor

Computes the for the Swish of "x" .

npu_reshape(self, shape, bool can_refresh=False) -> Tensor

Reshapes a tensor. Only the tensor shape is changed, without changing the data.

npu_rotated_overlaps(self, query_boxes, trans=False) -> Tensor

Calculate the overlapping area of the rotated box.

npu_rotated_iou(self, query_boxes, trans=False, mode=0, is_cross=True) -> Tensor

Calculate the IOU of the rotated box.

npu_rotated_box_encode(anchor_box, gt_bboxes, weight) -> Tensor

Rotate Bounding Box Encoding.

npu_rotated_box_decode(anchor_boxes, deltas, weight) -> Tensor
Rotate Bounding Box Encoding
  • Parameters:
    • anchor_box (Tensor) - A 3D Tensor with shape (B, 5, N). the input tensor.Anchor boxes. "B" indicates the number of batch size, "N" indicates the number of bounding boxes, and the value "5" refers to "x0", "x1", "y0", "y1" and "angle" .
    • deltas (Tensor) - A 3D Tensor of float32 (float16) with shape (B, 5, N).
    • weight (Tensor) - A float list for "x0", "x1", "y0", "y1" and "angle", defaults to [1.0, 1.0, 1.0, 1.0, 1.0].
  • constraints:

    None

  • Examples
        >>> anchor_boxes = torch.tensor([[[4.137],[33.72],[29.4], [54.06], [41.28]]], dtype=torch.float16).to("npu")
        >>> deltas = torch.tensor([[[0.0244], [-1.992], [0.2109], [0.315], [-37.25]]], dtype=torch.float16).to("npu")
        >>> weight = torch.tensor([1., 1., 1., 1., 1.], dtype=torch.float16).npu()
        >>> out = torch.npu_rotated_box_decode(anchor_boxes, deltas, weight)
        >>> out
        tensor([[[  1.7861],
                [-10.5781],
                [ 33.0000],
                [ 17.2969],
                [-88.4375]]], device='npu:0', dtype=torch.float16)