序号 |
算子名称 |
---|---|
1 |
_npu_dropout |
2 |
_npu_dropout_inplace |
3 |
copy_memory_ |
4 |
empty_with_format |
5 |
fast_gelu |
6 |
npu_alloc_float_status |
7 |
npu_anchor_response_flags |
8 |
npu_apply_adam |
9 |
npu_batch_nms |
10 |
npu_bert_apply_adam |
11 |
npu_bmmV2 |
12 |
npu_bounding_box_decode |
13 |
npu_bounding_box_encode |
14 |
npu_broadcast |
15 |
npu_clear_float_status |
16 |
npu_confusion_transpose |
17 |
npu_conv_transpose2d |
18 |
npu_conv2d |
19 |
npu_conv3d |
20 |
npu_convolution |
21 |
npu_convolution_transpose |
22 |
npu_deformable_conv2d |
23 |
npu_dropoutV2 |
24 |
npu_dtype_cast |
25 |
npu_format_cast |
26 |
npu_format_cast_ |
27 |
npu_get_float_status |
28 |
npu_giou |
29 |
npu_grid_assign_positive |
30 |
npu_gru |
31 |
npu_ifmr |
32 |
npu_indexing |
33 |
npu_iou |
34 |
npu_layer_norm_eval |
35 |
npu_linear |
36 |
npu_lstm |
37 |
npu_masked_fill_range |
38 |
npu_max |
39 |
npu_min |
40 |
npu_mish |
41 |
npu_nms_rotated |
42 |
npu_nms_v4 |
43 |
npu_nms_with_mask |
44 |
npu_normalize_batch |
45 |
npu_one_hot |
46 |
npu_pad |
47 |
npu_ps_roi_pooling |
48 |
npu_ptiou |
49 |
npu_random_choice_with_mask |
50 |
npu_reshape |
51 |
npu_roi_align |
52 |
npu_rotated_box_decode |
53 |
npu_rotated_box_encode |
54 |
npu_rotated_iou |
55 |
npu_scatter |
56 |
npu_silu |
57 |
npu_slice |
58 |
npu_softmax_cross_entropy_with_logits |
59 |
npu_sort_v2 |
60 |
npu_stride_add |
61 |
npu_transpose |
62 |
npu_yolo_boxes_encode |
63 |
one_ |
NPU自定义算子参数中存在部分映射关系可参考下表。
参数 |
映射参数 |
说明 |
---|---|---|
ACL_FORMAT_UNDEFINED |
-1 |
Format参数映射值。 |
ACL_FORMAT_NCHW |
0 |
|
ACL_FORMAT_NHWC |
1 |
|
ACL_FORMAT_ND |
2 |
|
ACL_FORMAT_NC1HWC0 |
3 |
|
ACL_FORMAT_FRACTAL_Z |
4 |
|
ACL_FORMAT_NC1HWC0_C04 |
12 |
|
ACL_FORMAT_HWCN |
16 |
|
ACL_FORMAT_NDHWC |
27 |
|
ACL_FORMAT_FRACTAL_NZ |
29 |
|
ACL_FORMAT_NCDHW |
30 |
|
ACL_FORMAT_NDC1HWC0 |
32 |
|
ACL_FRACTAL_Z_3D |
33 |
npu_apply_adam(beta1_power, beta2_power, lr, beta1, beta2, epsilon, grad, use_locking, use_nesterov, out = (var, m, v))
Count adam result.
npu_convolution_transpose(input, weight, bias, padding, output_padding, stride, dilation, groups) -> Tensor
Applies a 2D or 3D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”.
npuconvtranspose2d(input, weight, bias, padding, output_padding, stride, dilation, groups) -> Tensor
Applies a 2D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”.
npu_convolution(input, weight, bias, stride, padding, dilation, groups) -> Tensor
Applies a 2D or 3D convolution over an input image composed of several input planes.
npu_conv2d(input, weight, bias, stride, padding, dilation, groups) -> Tensor
Applies a 2D convolution over an input image composed of several input planes.
npu_conv3d(input, weight, bias, stride, padding, dilation, groups) -> Tensor
Applies a 3D convolution over an input image composed of several input planes.
one_(self) -> Tensor
Fills self tensor with ones.
>>> x = torch.rand(2, 3).npu() >>> x tensor([[0.6072, 0.9726, 0.3475], [0.3717, 0.6135, 0.6788]], device='npu:0') >>> x.one_() tensor([[1., 1., 1.], [1., 1., 1.]], device='npu:0')
npu_sort_v2(self, dim=-1, descending=False, out=None) -> Tensor
Sorts the elements of the input tensor along a given dimension in ascending order by value without indices.
If dim is not given, the last dimension of the input is chosen.
If descending is True then the elements are sorted in descending order by value.
>>> x = torch.randn(3, 4).npu() >>> x tensor([[-0.0067, 1.7790, 0.5031, -1.7217], [ 1.1685, -1.0486, -0.2938, 1.3241], [ 0.1880, -2.7447, 1.3976, 0.7380]], device='npu:0') >>> sorted_x = torch.npu_sort_v2(x) >>> sorted_x tensor([[-1.7217, -0.0067, 0.5029, 1.7793], [-1.0488, -0.2937, 1.1689, 1.3242], [-2.7441, 0.1880, 0.7378, 1.3975]], device='npu:0')
npu_format_cast(self, acl_format) -> Tensor
Change the format of a npu tensor.
>>> x = torch.rand(2, 3, 4, 5).npu() >>> x.storage().npu_format() 0 >>> x1 = x.npu_format_cast(29) >>> x1.storage().npu_format() 29
npu_format_cast_(self, src) -> Tensor
In-place Change the format of self, with the same format as src.
>>> x = torch.rand(2, 3, 4, 5).npu() >>> x.storage().npu_format() 0 >>> x.npu_format_cast_(29).storage().npu_format() 29
npu_transpose(self, perm) -> Tensor
Returns a view of the original tensor with its dimensions permuted, and make the result contiguous.
>>> x = torch.randn(2, 3, 5).npu() >>> x.shape torch.Size([2, 3, 5]) >>> x1 = torch.npu_transpose(x, (2, 0, 1)) >>> x1.shape torch.Size([5, 2, 3]) >>> x2 = x.npu_transpose(2, 0, 1) >>> x2.shape torch.Size([5, 2, 3])
npu_broadcast(self, perm) -> Tensor
Returns a new view of the self tensor with singleton dimensions expanded to a larger size, and make the result contiguous.
Tensor can be also expanded to a larger number of dimensions, and the new ones will be appended at the front.
>>> x = torch.tensor([[1], [2], [3]]).npu() >>> x.shape torch.Size([3, 1]) >>> x.npu_broadcast(3, 4) tensor([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]], device='npu:0')
npu_dtype_cast(input, dtype) -> Tensor
Performs Tensor dtype conversion.
>>> torch. npu_dtype_cast (torch.tensor([0, 0.5, -1.]).npu(), dtype=torch.int) tensor([ 0, 0, -1], device='npu:0', dtype=torch.int32)
empty_with_format(size, dtype, layout, device, pin_memory, acl_format) -> Tensor
Returns a tensor filled with uninitialized data. The shape of the tensor is defined by the variable argument size. The format of the tensor is defined by the variable argument acl_format.
>>> torch.empty_with_format((2, 3), dtype=torch.float32, device="npu") tensor([[1., 1., 1.], [1., 1., 1.]], device='npu:0')
copy_memory_(dst, src, non_blocking=False) -> Tensor
Copies the elements from src into self tensor and returns self.
copy_memory_ only support npu tensor.
input tensors of copy_memory_ should have same dtype.
input tensors of copy_memory_ should have same device index.
>>> a=torch.IntTensor([0, 0, -1]).npu() >>> b=torch.IntTensor([1, 1, 1]).npu() >>> a.copy_memory_(b) tensor([1, 1, 1], device='npu:0', dtype=torch.int32)
npu_one_hot(input, num_classes=-1, depth=1, on_value=1, off_value=0) -> Tensor
Returns a one-hot tensor. The locations represented by index in "x" take value "onvalue", while all other locations take value "offvalue".
>>> a=torch.IntTensor([5, 3, 2, 1]).npu() >>> b=torch.npu_one_hot(a, depth=5) >>> b tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 1., 0.], [0., 0., 1., 0., 0.], [0., 1., 0., 0., 0.]], device='npu:0')
npu_stride_add(x1, x2, offset1, offset2, c1_len) -> Tensor
Add the partial values of two tensors in format NC1HWC0.
>>> a=torch.tensor([[[[[1.]]]]]).npu() >>> b=torch.npu_stride_add(a, a, 0, 0, 1) >>> b tensor([[[[[2.]]], [[[0.]]], [[[0.]]], [[[0.]]], [[[0.]]], [[[0.]]], [[[0.]]], [[[0.]]], [[[0.]]], [[[0.]]], [[[0.]]], [[[0.]]], [[[0.]]], [[[0.]]], [[[0.]]], [[[0.]]]]], device='npu:0')
npu_softmax_cross_entropy_with_logits(features, labels) -> Tensor
Computes softmax cross entropy cost.
npu_ps_roi_pooling(x, rois, spatial_scale, group_size, output_dim) -> Tensor
Performs Position Sensitive PS ROI Pooling.
>>> roi = torch.tensor([[[1], [2], [3], [4], [5]], [[6], [7], [8], [9], [10]]], dtype = torch.float16).npu() >>> x = torch.tensor([[[[ 1]], [[ 2]], [[ 3]], [[ 4]], [[ 5]], [[ 6]], [[ 7]], [[ 8]]], [[[ 9]], [[10]], [[11]], [[12]], [[13]], [[14]], [[15]], [[16]]]], dtype = torch.float16).npu() >>> out = torch.npu_ps_roi_pooling(x, roi, 0.5, 2, 2) >>> out tensor([[[[0., 0.], [0., 0.]], [[0., 0.], [0., 0.]]], [[[0., 0.], [0., 0.]], [[0., 0.], [0., 0.]]]], device='npu:0', dtype=torch.float16)
npu_roi_align(features, rois, spatial_scale, pooled_height, pooled_width, sample_num, roi_end_mode) -> Tensor
Obtains the ROI feature matrix from the feature map. It is a customized FasterRcnn operator.
>>> x = torch.FloatTensor([[[[1, 2, 3 , 4, 5, 6], [7, 8, 9, 10, 11, 12], [13, 14, 15, 16, 17, 18], [19, 20, 21, 22, 23, 24], [25, 26, 27, 28, 29, 30], [31, 32, 33, 34, 35, 36]]]]).npu() >>> rois = torch.tensor([[0, -2.0, -2.0, 22.0, 22.0]]).npu() >>> out = torch.npu_roi_align(x, rois, 0.25, 3, 3, 2, 0) >>> out tensor([[[[ 4.5000, 6.5000, 8.5000], [16.5000, 18.5000, 20.5000], [28.5000, 30.5000, 32.5000]]]], device='npu:0')
npu_nms_v4(boxes, scores, max_output_size, iou_threshold, scores_threshold, pad_to_max_output_size=False) -> (Tensor, Tensor)
Greedily selects a subset of bounding boxes in descending order of score.
>>> boxes=torch.randn(100,4).npu() >>> scores=torch.randn(100).npu() >>> boxes.uniform_(0,100) >>> scores.uniform_(0,1) >>> max_output_size = 20 >>> iou_threshold = torch.tensor(0.5).npu() >>> scores_threshold = torch.tensor(0.3).npu() >>> npu_output = torch.npu_nms_v4(boxes, scores, max_output_size, iou_threshold, scores_threshold) >>> npu_output (tensor([57, 65, 25, 45, 43, 12, 52, 91, 23, 78, 53, 11, 24, 62, 22, 67, 9, 94, 54, 92], device='npu:0', dtype=torch.int32), tensor(20, device='npu:0', dtype=torch.int32))
npu_nms_rotated(dets, scores, iou_threshold, scores_threshold=0, max_output_size=-1, mode=0) -> (Tensor, Tensor)
Greedy selects a subset of the rotated bounding boxes in descending fractional order.
>>> dets=torch.randn(100,5).npu() >>> scores=torch.randn(100).npu() >>> dets.uniform_(0,100) >>> scores.uniform_(0,1) >>> output1, output2 = torch.npu_nms_rotated(dets, scores, 0.2, 0, -1, 1) >>> output1 tensor([76, 48, 15, 65, 91, 82, 21, 96, 62, 90, 13, 59, 0, 18, 47, 23, 8, 56, 55, 63, 72, 39, 97, 81, 16, 38, 17, 25, 74, 33, 79, 44, 36, 88, 83, 37, 64, 45, 54, 41, 22, 28, 98, 40, 30, 20, 1, 86, 69, 57, 43, 9, 42, 27, 71, 46, 19, 26, 78, 66, 3, 52], device='npu:0', dtype=torch.int32) >>> output2 tensor([62], device='npu:0', dtype=torch.int32)
npu_lstm(x, weight, bias, seq_len, h, c, has_biases, num_layers, dropout, train, bidirectional, batch_first, flag_seq, direction)
DynamicRNN calculation.
npu_iou(bboxes, gtboxes, mode=0) -> Tensor npu_ptiou(bboxes, gtboxes, mode=0) -> Tensor
Computes the intersection over union (iou) or the intersection over. foreground (iof) based on the ground-truth and predicted regions.
>>> bboxes = torch.tensor([[0, 0, 10, 10], [10, 10, 20, 20], [32, 32, 38, 42]], dtype=torch.float16).to("npu") >>> gtboxes = torch.tensor([[0, 0, 10, 20], [0, 10, 10, 10], [10, 10, 20, 20]], dtype=torch.float16).to("npu") >>> output_iou = torch.npu_iou(bboxes, gtboxes, 0) >>> output_iou tensor([[0.4985, 0.0000, 0.0000], [0.0000, 0.0000, 0.0000], [0.0000, 0.9961, 0.0000]], device='npu:0', dtype=torch.float16)
npu_pad(input, paddings) -> Tensor
Pads a tensor
>>> input = torch.tensor([[20, 20, 10, 10]], dtype=torch.float16).to("npu") >>> paddings = [1, 1, 1, 1] >>> output = torch.npu_pad(input, paddings) >>> output tensor([[ 0., 0., 0., 0., 0., 0.], [ 0., 20., 20., 10., 10., 0.], [ 0., 0., 0., 0., 0., 0.]], device='npu:0', dtype=torch.float16)
npu_nms_with_mask(input, iou_threshold) -> (Tensor, Tensor, Tensor)
The value 01 is generated for the nms operator to determine the valid bit
>>> input = torch.tensor([[0.0, 1.0, 2.0, 3.0, 0.6], [6.0, 7.0, 8.0, 9.0, 0.4]], dtype=torch.float16).to("npu") >>> iou_threshold = 0.5 >>> output1, output2, output3, = torch.npu_nms_with_mask(input, iou_threshold) >>> output1 tensor([[0.0000, 1.0000, 2.0000, 3.0000, 0.6001], [6.0000, 7.0000, 8.0000, 9.0000, 0.3999]], device='npu:0', dtype=torch.float16) >>> output2 tensor([0, 1], device='npu:0', dtype=torch.int32) >>> output3 tensor([1, 1], device='npu:0', dtype=torch.uint8)
npu_bounding_box_encode(anchor_box, ground_truth_box, means0, means1, means2, means3, stds0, stds1, stds2, stds3) -> Tensor
Computes the coordinate variations between bboxes and ground truth boxes. It is a customized FasterRcnn operator
>>> anchor_box = torch.tensor([[1., 2., 3., 4.], [3.,4., 5., 6.]], dtype = torch.float32).to("npu") >>> ground_truth_box = torch.tensor([[5., 6., 7., 8.], [7.,8., 9., 6.]], dtype = torch.float32).to("npu") >>> output = torch.npu_bounding_box_encode(anchor_box, ground_truth_box, 0, 0, 0, 0, 0.1, 0.1, 0.2, 0.2) >>> output tensor([[13.3281, 13.3281, 0.0000, 0.0000], [13.3281, 6.6641, 0.0000, -5.4922]], device='npu:0')
npu_bounding_box_decode(rois, deltas, means0, means1, means2, means3, stds0, stds1, stds2, stds3, max_shape, wh_ratio_clip) -> Tensor
Generates bounding boxes based on "rois" and "deltas". It is a customized FasterRcnn operator .
>>> rois = torch.tensor([[1., 2., 3., 4.], [3.,4., 5., 6.]], dtype = torch.float32).to("npu") >>> deltas = torch.tensor([[5., 6., 7., 8.], [7.,8., 9., 6.]], dtype = torch.float32).to("npu") >>> output = torch.npu_bounding_box_decode(rois, deltas, 0, 0, 0, 0, 1, 1, 1, 1, (10, 10), 0.1) >>> output tensor([[2.5000, 6.5000, 9.0000, 9.0000], [9.0000, 9.0000, 9.0000, 9.0000]], device='npu:0')
npu_gru(input, hx, weight_input, weight_hidden, bias_input, bias_hidden, seq_length, has_biases, num_layers, dropout, train, bidirectional, batch_first) -> (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor)
DynamicGRUV2 calculation.
npu_random_choice_with_mask(x, count=256, seed=0, seed2=0) -> (Tensor, Tensor)
Shuffle index of no-zero element
>>> x = torch.tensor([1, 0, 1, 0], dtype=torch.bool).to("npu") >>> result, mask = torch.npu_random_choice_with_mask(x, 2, 1, 0) >>> result tensor([[0], [2]], device='npu:0', dtype=torch.int32) >>> mask tensor([True, True], device='npu:0')
npu_batch_nms(self, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size, change_coordinate_frame=False, transpose_box=False) -> (Tensor, Tensor, Tensor, Tensor)
Computes nms for input boxes and score, support multiple batch and classes. will do clip to window, score filter, top_k, and nms
>>> boxes = torch.randn(8, 2, 4, 4, dtype = torch.float32).to("npu") >>> scores = torch.randn(3, 2, 4, dtype = torch.float32).to("npu") >>> nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = torch.npu_batch_nms(boxes, scores, 0.3, 0.5, 3, 4) >>> nmsed_boxes >>> nmsed_scores >>> nmsed_classes >>> nmsed_num
npu_slice(self, offsets, size) -> Tensor
Extracts a slice from a tensor
>>> input = torch.tensor([[1,2,3,4,5], [6,7,8,9,10]], dtype=torch.float16).to("npu") >>> offsets = [0, 0] >>> size = [2, 2] >>> output = torch.npu_slice(input, offsets, size) >>> output tensor([[1., 2.], [6., 7.]], device='npu:0', dtype=torch.float16)
npu_dropoutV2(self, seed, p) -> (Tensor, Tensor, Tensor(a!))
count dropout result with seed
>>> input = torch.tensor([1.,2.,3.,4.]).npu() >>> input tensor([1., 2., 3., 4.], device='npu:0') >>> seed = torch.rand((32,),dtype=torch.float32).npu() >>> seed tensor([0.4368, 0.7351, 0.8459, 0.4657, 0.6783, 0.8914, 0.8995, 0.4401, 0.4408, 0.4453, 0.2404, 0.9680, 0.0999, 0.8665, 0.2993, 0.5787, 0.0251, 0.6783, 0.7411, 0.0670, 0.9430, 0.9165, 0.3983, 0.5849, 0.7722, 0.4659, 0.0486, 0.2693, 0.6451, 0.2734, 0.3176, 0.0176], device='npu:0') >>> prob = 0.3 >>> output, mask, out_seed = torch.npu_dropoutV2(input, seed, prob) >>> output tensor([0.4408, 0.4453, 0.2404, 0.9680], device='npu:0') >>> mask tensor([0., 0., 0., 0.], device='npu:0') >>> out_seed tensor([0.4408, 0.4453, 0.2404, 0.9680, 0.0999, 0.8665, 0.2993, 0.5787, 0.0251, 0.6783, 0.7411, 0.0670, 0.9430, 0.9165, 0.3983, 0.5849, 0.7722, 0.4659, 0.0486, 0.2693, 0.6451, 0.2734, 0.3176, 0.0176, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], device='npu:0')
_npu_dropout(self, p) -> (Tensor, Tensor)
count dropout result without seed
Similar to torch.dropout, optimize implemention to npu device.
>>> input = torch.tensor([1.,2.,3.,4.]).npu() >>> input tensor([1., 2., 3., 4.], device='npu:0') >>> prob = 0.3 >>> output, mask = torch._npu_dropout(input, prob) >>> output tensor([0.0000, 2.8571, 0.0000, 0.0000], device='npu:0') >>> mask tensor([ 98, 255, 188, 186, 120, 157, 175, 159, 77, 223, 127, 79, 247, 151, 253, 255], device='npu:0', dtype=torch.uint8)
_npu_dropout_inplace(result, p) -> (Tensor(a!), Tensor)
count dropout result inplace.
Similar to torch.dropout_, optimize implemention to npu device.
>>> input = torch.tensor([1.,2.,3.,4.]).npu() >>> input tensor([1., 2., 3., 4.], device='npu:0') >>> prob = 0.3 >>> output, mask = torch._npu_dropout_inplace(input, prob) >>> output tensor([0.0000, 2.8571, 0.0000, 0.0000], device='npu:0') >>> input tensor([0.0000, 2.8571, 4.2857, 5.7143], device='npu:0') >>> mask tensor([ 98, 255, 188, 186, 120, 157, 175, 159, 77, 223, 127, 79, 247, 151, 253, 255], device='npu:0', dtype=torch.uint8)
npu_indexing(self, begin, end, strides, begin_mask=0, end_mask=0, ellipsis_mask=0, new_axis_mask=0, shrink_axis_mask=0) -> Tensor
count indexing result by begin,end,strides array.
>>> input = torch.tensor([[1, 2, 3, 4],[5, 6, 7, 8]], dtype=torch.int32).to("npu") >>> input tensor([[1, 2, 3, 4], [5, 6, 7, 8]], device='npu:0', dtype=torch.int32) >>> output = torch.npu_indexing(input1, [0, 0], [2, 2], [1, 1]) >>> output tensor([[1, 2], [5, 6]], device='npu:0', dtype=torch.int32)
npu_ifmr(Tensor data, Tensor data_min, Tensor data_max, Tensor cumsum, float min_percentile, float max_percentile, float search_start, float search_end, float search_step, bool with_offset) -> (Tensor, Tensor)
count ifmr result by begin,end,strides array, Input Feature Map Reconstruction
>>> input = torch.rand((2,2,3,4),dtype=torch.float32).npu() >>> input tensor([[[[0.4508, 0.6513, 0.4734, 0.1924], [0.0402, 0.5502, 0.0694, 0.9032], [0.4844, 0.5361, 0.9369, 0.7874]], [[0.5157, 0.1863, 0.4574, 0.8033], [0.5986, 0.8090, 0.7605, 0.8252], [0.4264, 0.8952, 0.2279, 0.9746]]], [[[0.0803, 0.7114, 0.8773, 0.2341], [0.6497, 0.0423, 0.8407, 0.9515], [0.1821, 0.5931, 0.7160, 0.4968]], [[0.7977, 0.0899, 0.9572, 0.0146], [0.2804, 0.8569, 0.2292, 0.1118], [0.5747, 0.4064, 0.8370, 0.1611]]]], device='npu:0') >>> min_value = torch.min(input) >>> min_value tensor(0.0146, device='npu:0') >>> max_value = torch.max(input) >>> max_value tensor(0.9746, device='npu:0') >>> hist = torch.histc(input.to('cpu'), bins=128, min=min_value.to('cpu'), max=max_value.to('cpu')) >>> hist tensor([1., 0., 0., 2., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 2., 1., 0., 0., 0., 0., 2., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 0., 0., 2., 0., 0., 0., 0., 0., 0., 2., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 2., 0., 0., 1., 1., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1., 1., 0., 1.]) >>> cdf = torch.cumsum(hist,dim=0).int().npu() >>> cdf tensor([ 1, 1, 1, 3, 3, 3, 3, 4, 5, 5, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 10, 11, 11, 11, 11, 11, 13, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 17, 17, 17, 17, 18, 19, 19, 20, 21, 21, 22, 22, 23, 23, 23, 24, 24, 25, 25, 25, 26, 26, 26, 28, 28, 28, 28, 28, 28, 28, 30, 30, 30, 30, 30, 30, 30, 30, 31, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 35, 37, 37, 37, 38, 39, 40, 40, 41, 41, 41, 42, 42, 43, 44, 44, 44, 44, 45, 45, 46, 47, 47, 48], device='npu:0', dtype=torch.int32) >>> scale, offset = torch.npu_ifmr(input, min_value, max_value, cdf, min_percentile=0.999999, max_percentile=0.999999, search_start=0.7, search_end=1.3, search_step=0.01, with_offset=False) >>> scale tensor(0.0080, device='npu:0') >>> offset tensor(0., device='npu:0')
npu_max(self, dim, keepdim=False) -> (Tensor, Tensor)
count max result with dim.
Similar to torch.max, optimize implemention to npu device.
>>> input = torch.randn(2, 2, 2, 2, dtype = torch.float32).npu() >>> input tensor([[[[-1.8135, 0.2078], [-0.6678, 0.7846]], [[ 0.6458, -0.0923], [-0.2124, -1.9112]]], [[[-0.5800, -0.4979], [ 0.2580, 1.1335]], [[ 0.6669, 0.1876], [ 0.1160, -0.1061]]]], device='npu:0') >>> outputs, indices = torch.npu_max(input, 2) >>> outputs tensor([[[-0.6678, 0.7846], [ 0.6458, -0.0923]], [[ 0.2580, 1.1335], [ 0.6669, 0.1876]]], device='npu:0') >>> indices tensor([[[1, 1], [0, 0]], [[1, 1], [0, 0]]], device='npu:0', dtype=torch.int32)
npu_min(self, dim, keepdim=False) -> (Tensor, Tensor)
count min result with dim.
Similar to torch.min, optimize implemention to npu device.
>>> input = torch.randn(2, 2, 2, 2, dtype = torch.float32).npu() >>> input tensor([[[[-0.9909, -0.2369], [-0.9569, -0.6223]], [[ 0.1157, -0.3147], [-0.7761, 0.1344]]], [[[ 1.6292, 0.5953], [ 0.6940, -0.6367]], [[-1.2335, 0.2131], [ 1.0748, -0.7046]]]], device='npu:0') >>> outputs, indices = torch.npu_min(input, 2) >>> outputs tensor([[[-0.9909, -0.6223], [-0.7761, -0.3147]], [[ 0.6940, -0.6367], [-1.2335, -0.7046]]], device='npu:0') >>> indices tensor([[[0, 1], [1, 0]], [[1, 1], [0, 1]]], device='npu:0', dtype=torch.int32)
npu_scatter(self, indices, updates, dim) -> Tensor
count scatter result with dim.
Similar to torch.scatter, optimize implemention to npu device.
>>> input = torch.tensor([[1.6279, 0.1226], [0.9041, 1.0980]]).npu() >>> input tensor([[1.6279, 0.1226], [0.9041, 1.0980]], device='npu:0') >>> indices = torch.tensor([0, 1],dtype=torch.int32).npu() >>> indices tensor([0, 1], device='npu:0', dtype=torch.int32) >>> updates = torch.tensor([-1.1993, -1.5247]).npu() >>> updates tensor([-1.1993, -1.5247], device='npu:0') >>> dim = 0 >>> output = torch.npu_scatter(input, indices, updates, dim) >>> output tensor([[-1.1993, 0.1226], [ 0.9041, -1.5247]], device='npu:0')
npu_layer_norm_eval(input, normalized_shape, weight=None, bias=None, eps=1e-05) -> Tensor
count layer norm result.
The same as torch.nn.functional.layer_norm, optimize implemention to npu device.
>>> input = torch.rand((6, 4), dtype=torch.float32).npu() >>> input tensor([[0.1863, 0.3755, 0.1115, 0.7308], [0.6004, 0.6832, 0.8951, 0.2087], [0.8548, 0.0176, 0.8498, 0.3703], [0.5609, 0.0114, 0.5021, 0.1242], [0.3966, 0.3022, 0.2323, 0.3914], [0.1554, 0.0149, 0.1718, 0.4972]], device='npu:0') >>> normalized_shape = input.size()[1:] >>> normalized_shape torch.Size([4]) >>> weight = torch.Tensor(*normalized_shape).npu() >>> weight tensor([ nan, 6.1223e-41, -8.3159e-20, 9.1834e-41], device='npu:0') >>> bias = torch.Tensor(*normalized_shape).npu() >>> bias tensor([5.6033e-39, 6.1224e-41, 6.1757e-39, 6.1224e-41], device='npu:0') >>> output = torch.npu_layer_norm_eval(input, normalized_shape, weight, bias, 1e-5) >>> output tensor([[ nan, 6.7474e-41, 8.3182e-20, 2.0687e-40], [ nan, 8.2494e-41, -9.9784e-20, -8.2186e-41], [ nan, -2.6695e-41, -7.7173e-20, 2.1353e-41], [ nan, -1.3497e-41, -7.1281e-20, -6.9827e-42], [ nan, 3.5663e-41, 1.2002e-19, 1.4314e-40], [ nan, -6.2792e-42, 1.7902e-20, 2.1050e-40]], device='npu:0')
npu_alloc_float_status(self) -> Tensor
Produces eight numbers with a value of zero
>>> input = torch.randn([1,2,3]).npu() >>> output = torch.npu_alloc_float_status(input) >>> input tensor([[[ 2.2324, 0.2478, -0.1056], [ 1.1273, -0.2573, 1.0558]]], device='npu:0') >>> output tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='npu:0')
npu_get_float_status(self) -> Tensor
Computes NPU get float status operator function.
>>> x = torch.rand(2).npu() >>> torch.npu_get_float_status(x) tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='npu:0')
npu_clear_float_status(self) -> Tensor
Set the value of address 0x40000 to 0 in each core.
>>> x = torch.rand(2).npu() >>> torch.npu_clear_float_status(x) tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='npu:0')
npu_confusion_transpose(self, perm, shape, transpose_first) -> Tensor
Confuse reshape and transpose.
>>> x = torch.rand(2, 3, 4, 6).npu() >>> x.shape torch.Size([2, 3, 4, 6]) >>> y = torch.npu_confusion_transpose(x, (0, 2, 1, 3), (2, 4, 18), True) >>> y.shape torch.Size([2, 4, 18]) >>> y2 = torch.npu_confusion_transpose(x, (0, 2, 1), (2, 12, 6), False) >>> y2.shape torch.Size([2, 6, 12])
npu_bmmV2(self, mat2, output_sizes) -> Tensor
Multiplies matrix "a" by matrix "b", producing "a * b" .
>>> mat1 = torch.randn(10, 3, 4).npu() >>> mat2 = torch.randn(10, 4, 5).npu() >>> res = torch.npu_bmmV2(mat1, mat2, []) >>> res.shape torch.Size([10, 3, 5])
fast_gelu(self) -> Tensor
Computes the gradient for the fast_gelu of "x" .
>>> x = torch.rand(2).npu() >>> x tensor([0.5991, 0.4094], device='npu:0') >>> torch.fast_gelu(x) tensor([0.4403, 0.2733], device='npu:0')
npu_deformable_conv2d(self, weight, offset, bias, kernel_size, stride, padding, dilation=[1,1,1,1], groups=1, deformable_groups=1, modulated=True) -> (Tensor, Tensor)
Computes the deformed convolution output with the expected input.
>>> x = torch.rand(16, 32, 32, 32).npu() >>> weight = torch.rand(32, 32, 5, 5).npu() >>> offset = torch.rand(16, 75, 32, 32).npu() >>> output, _ = torch.npu_deformable_conv2d(x, weight, offset, None, kernel_size=[5, 5], stride = [1, 1, 1, 1], padding = [2, 2, 2, 2]) >>> output.shape torch.Size([16, 32, 32, 32])
npu_mish(self) -> Tensor
Computes hyperbolic tangent of "x" element-wise.
>>> x = torch.rand(10, 30, 10).npu() >>> y = torch.npu_mish(x) >>> y.shape torch.Size([10, 30, 10])
npu_anchor_response_flags(self, featmap_size, stride, num_base_anchors) -> Tensor
Generate the responsible flags of anchor in a single feature map.
>>> x = torch.rand(100, 4).npu() >>> y = torch.npu_anchor_response_flags(x, [60, 60], [2, 2], 9) >>> y.shape torch.Size([32400])
npu_yolo_boxes_encode(self, gt_bboxes, stride, performance_mode=False) -> Tensor
Generates bounding boxes based on yolo's "anchor" and "ground-truth" boxes. It is a customized mmdetection operator.
>>> anchor_boxes = torch.rand(2, 4).npu() >>> gt_bboxes = torch.rand(2, 4).npu() >>> stride = torch.tensor([2, 2], dtype=torch.int32).npu() >>> output = torch.npu_yolo_boxes_encode(anchor_boxes, gt_bboxes, stride, False) >>> output.shape torch.Size([2, 4])
npu_grid_assign_positive(self, overlaps, box_responsible_flags, max_overlaps, argmax_overlaps, gt_max_overlaps, gt_argmax_overlaps, num_gts, pos_iou_thr, min_pos_iou, gt_max_assign_all) -> Tensor
Performs Position Sensitive PS ROI Pooling Grad.
>>> assigned_gt_inds = torch.rand(4).npu() >>> overlaps = torch.rand(2,4).npu() >>> box_responsible_flags = torch.tensor([1, 1, 1, 0], dtype=torch.uint8).npu() >>> max_overlap = torch.rand(4).npu() >>> argmax_overlap = torch.tensor([1, 0, 1, 0], dtype=torch.int32).npu() >>> gt_max_overlaps = torch.rand(2).npu() >>> gt_argmax_overlaps = torch.tensor([1, 0],dtype=torch.int32).npu() >>> output = torch.npu_grid_assign_positive(assigned_gt_inds, overlaps, box_responsible_flags, max_overlap, argmax_overlap, gt_max_overlaps, gt_argmax_overlaps, 128, 0.5, 0., True) >>> output.shape torch.Size([4])
npu_normalize_batch(self, seq_len, normalize_type=0) -> Tensor
Performs batch normalization .
>>> a=np.random.uniform(1,10,(2,3,6)).astype(np.float32) >>> b=np.random.uniform(3,6,(2)).astype(np.int32) >>> x=torch.from_numpy(a).to("npu") >>> seqlen=torch.from_numpy(b).to("npu") >>> out = torch.npu_normalize_batch(x, seqlen, 0) >>> out tensor([[[ 1.1496, -0.6685, -0.4812, 1.7611, -0.5187, 0.7571], [ 1.1445, -0.4393, -0.7051, 1.0474, -0.2646, -0.1582], [ 0.1477, 0.9179, -1.0656, -6.8692, -6.7437, 2.8621]], [[-0.6880, 0.1337, 1.3623, -0.8081, -1.2291, -0.9410], [ 0.3070, 0.5489, -1.4858, 0.6300, 0.6428, 0.0433], [-0.5387, 0.8204, -1.1401, 0.8584, -0.3686, 0.8444]]], device='npu:0')
npu_masked_fill_range(self, start, end, value, axis=-1) -> Tensor
masked fill tensor along with one axis by range.boxes. It is a customized masked fill range operator .
>>> a=torch.rand(4,4).npu() >>> a tensor([[0.9419, 0.4919, 0.2874, 0.6560], [0.6691, 0.6668, 0.0330, 0.1006], [0.3888, 0.7011, 0.7141, 0.7878], [0.0366, 0.9738, 0.4689, 0.0979]], device='npu:0') >>> start = torch.tensor([[0,1,2]], dtype=torch.int32).npu() >>> end = torch.tensor([[1,2,3]], dtype=torch.int32).npu() >>> value = torch.tensor([1], dtype=torch.float).npu() >>> out = torch.npu_masked_fill_range(a, start, end, value, 1) >>> out tensor([[1.0000, 0.4919, 0.2874, 0.6560], [0.6691, 1.0000, 0.0330, 0.1006], [0.3888, 0.7011, 1.0000, 0.7878], [0.0366, 0.9738, 0.4689, 0.0979]], device='npu:0')
npu_linear(input, weight, bias=None) -> Tensor
Multiplies matrix "a" by matrix "b", producing "a * b" .
>>> x=torch.rand(2,16).npu() >>> w=torch.rand(4,16).npu() >>> b=torch.rand(4).npu() >>> output = torch.npu_linear(x, w, b) >>> output tensor([[3.6335, 4.3713, 2.4440, 2.0081], [5.3273, 6.3089, 3.9601, 3.2410]], device='npu:0')
npu_bert_apply_adam(lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay, step_size=None, adam_mode=0, *, out=(var,m,v))
count adam result.
>>> var_in = torch.rand(321538).uniform_(-32., 21.).npu() >>> m_in = torch.zeros(321538).npu() >>> v_in = torch.zeros(321538).npu() >>> grad = torch.rand(321538).uniform_(-0.05, 0.03).npu() >>> max_grad_norm = -1. >>> beta1 = 0.9 >>> beta2 = 0.99 >>> weight_decay = 0. >>> lr = 0. >>> epsilon = 1e-06 >>> global_grad_norm = 0. >>> var_out, m_out, v_out = torch.npu_bert_apply_adam(lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay, out=(var_in, m_in, v_in)) >>> var_out tensor([ 14.7733, -30.1218, -1.3647, ..., -16.6840, 7.1518, 8.4872], device='npu:0')
npu_giou(self, gtboxes, trans=False, is_cross=False, mode=0) -> Tensorpugiou(self, gtboxes, trans=False, iscross=False, mode=0) -> Tensor
First calculate the minimum closure area of the two boxes, IoU, the proportion of the closed area that does not belong to the two boxes in the closure area, and finally subtract this proportion from IoU to get GIoU .
>>> a=np.random.uniform(0,1,(4,10)).astype(np.float16) >>> b=np.random.uniform(0,1,(4,10)).astype(np.float16) >>> box1=torch.from_numpy(a).to("npu") >>> box2=torch.from_numpy(a).to("npu") >>> output = torch.npu_giou(box1, box2, trans=True, is_cross=False, mode=0) >>> output tensor([[1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.]], device='npu:0', dtype=torch.float16
npu_silu(self) -> Tensor
Computes the for the Swish of "x" .
>>> a=torch.rand(2,8).npu() >>> output = torch.npu_silu(a) >>> output tensor([[0.4397, 0.7178, 0.5190, 0.2654, 0.2230, 0.2674, 0.6051, 0.3522], [0.4679, 0.1764, 0.6650, 0.3175, 0.0530, 0.4787, 0.5621, 0.4026]], device='npu:0')
npu_reshape(self, shape, bool can_refresh=False) -> Tensor
Reshapes a tensor. Only the tensor shape is changed, without changing the data.
This operator cannot be directly called by the acllopExecute API.
>>> a=torch.rand(2,8).npu() >>> out=torch.npu_reshape(a,(4,4)) >>> out tensor([[0.6657, 0.9857, 0.7614, 0.4368], [0.3761, 0.4397, 0.8609, 0.5544], [0.7002, 0.3063, 0.9279, 0.5085], [0.1009, 0.7133, 0.8118, 0.6193]], device='npu:0')
npu_rotated_overlaps(self, query_boxes, trans=False) -> Tensor
Calculate the overlapping area of the rotated box.
>>> a=np.random.uniform(0,1,(1,3,5)).astype(np.float16) >>> b=np.random.uniform(0,1,(1,2,5)).astype(np.float16) >>> box1=torch.from_numpy(a).to("npu") >>> box2=torch.from_numpy(a).to("npu") >>> output = torch.npu_rotated_overlaps(box1, box2, trans=False) >>> output tensor([[[0.0000, 0.1562, 0.0000], [0.1562, 0.3713, 0.0611], [0.0000, 0.0611, 0.0000]]], device='npu:0', dtype=torch.float16)
npu_rotated_iou(self, query_boxes, trans=False, mode=0, is_cross=True) -> Tensor
Calculate the IOU of the rotated box.
>>> a=np.random.uniform(0,1,(2,2,5)).astype(np.float16) >>> b=np.random.uniform(0,1,(2,3,5)).astype(np.float16) >>> box1=torch.from_numpy(a).to("npu") >>> box2=torch.from_numpy(a).to("npu") >>> output = torch.npu_rotated_iou(box1, box2, trans=False, mode=0, is_cross=True) >>> output tensor([[[3.3325e-01, 1.0162e-01], [1.0162e-01, 1.0000e+00]], [[0.0000e+00, 0.0000e+00], [0.0000e+00, 5.9605e-08]]], device='npu:0', dtype=torch.float16)
npu_rotated_box_encode(anchor_box, gt_bboxes, weight) -> Tensor
Rotate Bounding Box Encoding.
>>> anchor_boxes = torch.tensor([[[30.69], [32.6], [45.94], [59.88], [-44.53]]], dtype=torch.float16).to("npu") >>> gt_bboxes = torch.tensor([[[30.44], [18.72], [33.22], [45.56], [8.5]]], dtype=torch.float16).to("npu") >>> weight = torch.tensor([1., 1., 1., 1., 1.], dtype=torch.float16).npu() >>> out = torch.npu_rotated_box_encode(anchor_boxes, gt_bboxes, weight) >>> out tensor([[[-0.4253], [-0.5166], [-1.7021], [-0.0162], [ 1.1328]]], device='npu:0', dtype=torch.float16)
npu_rotated_box_decode(anchor_boxes, deltas, weight) -> Tensor
>>> anchor_boxes = torch.tensor([[[4.137],[33.72],[29.4], [54.06], [41.28]]], dtype=torch.float16).to("npu") >>> deltas = torch.tensor([[[0.0244], [-1.992], [0.2109], [0.315], [-37.25]]], dtype=torch.float16).to("npu") >>> weight = torch.tensor([1., 1., 1., 1., 1.], dtype=torch.float16).npu() >>> out = torch.npu_rotated_box_decode(anchor_boxes, deltas, weight) >>> out tensor([[[ 1.7861], [-10.5781], [ 33.0000], [ 17.2969], [-88.4375]]], device='npu:0', dtype=torch.float16)