仅支持通过torch_npu.xxx调用NPU自定义算子。torch.xxx或torch.Tensor.xxx的调用形式已经被废弃，使用以上方法调用可能会出报错，如图1所示。

图1 错误提示

表1 NPU自定义算子
序号	算子名称
1	torch_npu._npu_dropout
2	torch_npu.copy_memory_
3	torch_npu.empty_with_format
4	torch_npu.fast_gelu
5	torch_npu.npu_alloc_float_status
6	torch_npu.npu_anchor_response_flags
7	torch_npu.npu_apply_adam
8	torch_npu.npu_batch_nms
9	torch_npu.npu_bert_apply_adam
10	torch_npu.npu_bmmV2
11	torch_npu.npu_bounding_box_decode
12	torch_npu.npu_bounding_box_encode
13	torch_npu.npu_broadcast
14	torch_npu.npu_ciou
15	torch_npu.npu_clear_float_status
16	torch_npu.npu_confusion_transpose
17	torch_npu.npu_conv_transpose2d
18	torch_npu.npu_conv2d
19	torch_npu.npu_conv3d
20	torch_npu.npu_convolution
21	torch_npu.npu_convolution_transpose
22	torch_npu.npu_deformable_conv2d
23	torch_npu.npu_diou
24	torch_npu.npu_dtype_cast
25	torch_npu.npu_format_cast
26	torch_npu.npu_format_cast_
27	torch_npu.npu_get_float_status
28	torch_npu.npu_giou
29	torch_npu.npu_grid_assign_positive
30	torch_npu.npu_gru
31	torch_npu.npu_ifmr
32	torch_npu.npu_indexing
33	torch_npu.npu_iou
34	torch_npu.npu_layer_norm_eval
35	torch_npu.npu_linear
36	torch_npu.npu_lstm
37	torch_npu.npu_masked_fill_range
38	torch_npu.npu_max
39	torch_npu.npu_min
40	torch_npu.npu_mish
41	torch_npu.npu_nms_rotated
42	torch_npu.npu_nms_v4
43	torch_npu.npu_nms_with_mask
44	torch_npu.npu_normalize_batch
45	torch_npu.npu_one_hot
46	torch_npu.npu_pad
47	torch_npu.npu_ps_roi_pooling
48	torch_npu.npu_ptiou
49	torch_npu.npu_random_choice_with_mask
50	torch_npu.npu_reshape
51	torch_npu.npu_roi_align
52	torch_npu.npu_rotated_box_decode
53	torch_npu.npu_rotated_box_encode
54	torch_npu.npu_rotated_iou
55	torch_npu.npu_rotated_overlaps
56	torch_npu.npu_scatter
57	torch_npu.npu_sign_bits_pack
58	torch_npu.npu_sign_bits_unpack
59	torch_npu.npu_silu
60	torch_npu.npu_slice
61	torch_npu.npu_softmax_cross_entropy_with_logits
62	torch_npu.npu_sort_v2
63	torch_npu.npu_stride_add
64	torch_npu.npu_transpose
65	torch_npu.npu_yolo_boxes_encode
66	torch_npu.one_

映射关系

NPU自定义算子参数中存在部分映射关系可参考下表。

表2 映射关系表
参数	映射参数	说明
ACL_FORMAT_UNDEFINED	-1	Format参数映射值。
ACL_FORMAT_NCHW	0
ACL_FORMAT_NHWC	1
ACL_FORMAT_ND	2
ACL_FORMAT_NC1HWC0	3
ACL_FORMAT_FRACTAL_Z	4
ACL_FORMAT_NC1HWC0_C04	12
ACL_FORMAT_HWCN	16
ACL_FORMAT_NDHWC	27
ACL_FORMAT_FRACTAL_NZ	29
ACL_FORMAT_NCDHW	30
ACL_FORMAT_NDC1HWC0	32
ACL_FRACTAL_Z_3D	33

详细算子接口说明

torch_npu.npu_apply_adam(beta1_power, beta2_power, lr, beta1, beta2, epsilon, grad, use_locking, use_nesterov, out = (var, m, v))

adam结果计数。

参数解释：
- beta1_power (Scalar) - beta1的幂。
- beta2_power (Scalar) - beta2的幂。
- lr (Scalar) - 学习率。
- beta1 (Scalar) - 一阶矩估计值的指数衰减率。
- beta2 (Scalar) - 二阶矩估计值的指数衰减率。
- epsilon (Scalar) - 添加到分母中以提高数值稳定性的项数。
- grad (Tensor) - 梯度。
- use_locking (Bool，可选) - 设置为True时使用lock进行更新操作。
- use_nesterov (Bool，可选) - 设置为True时采用nesterov更新。
- var (Tensor) - 待优化变量。
- m (Tensor) - 变量平均值。
- v (Tensor) - 变量方差。
约束条件：
无
示例：
无

torch_npu.npu_convolution_transpose(input, weight, bias, padding, output_padding, stride, dilation, groups) -> Tensor

在由多个输入平面组成的输入图像上应用一个2D或3D转置卷积算子，有时这个过程也被称为“反卷积”。

参数解释：
- input (Tensor) - shape的输入张量，值为 (minibatch, in_channels, iH, iW) 或 (minibatch, in_channels, iT, iH, iW)。
- weight (Tensor) - shape过滤器，值为 (in_channels, out_channels/groups, kH, kW) 或 (in_channels, out_channels/groups, kT, kH, kW)。
- bias (Tensor, 可选) - shape偏差 (out_channels)。
- padding (ListInt) - (dilation * (kernel_size - 1) - padding) 用零来填充输入每个维度的两侧。
- output_padding (ListInt) - 添加到输出shape每个维度一侧的附加尺寸。
- stride (ListInt) - 卷积核步长。
- dilation (ListInt) - 内核元素间距。
- groups (Int) - 对输入进行分组。In_channels可被组数整除。
约束条件：
无
示例：
无

torch_npu.npu_conv_transpose2d(input, weight, bias, padding, output_padding, stride, dilation, groups) -> Tensor

在由多个输入平面组成的输入图像上应用一个2D转置卷积算子，有时这个过程也被称为“反卷积”。

参数解释：
- input (Tensor) - shape的输入张量，值为 (minibatch, in_channels, iH, iW)。
- weight (Tensor) - shape过滤器，值为 (in_channels, out_channels/groups, kH, kW)。
- bias (Tensor, 可选) - shape偏差 (out_channels)。
- padding (ListInt) - (dilation * (kernel_size - 1) - padding) 用零来填充输入每个维度的两侧。
- output_padding (ListInt) - 添加到输出shape每个维度一侧的附加尺寸。
- stride (ListInt) - 卷积核步长。
- dilation (ListInt) - 内核元素间距。
- groups (Int) - 对输入进行分组。In_channels可被组数整除。
约束条件：
无
示例：
无

torch_npu.npu_convolution(input, weight, bias, stride, padding, dilation, groups) -> Tensor

在由多个输入平面组成的输入图像上应用一个2D或3D卷积。

参数解释：
- input (Tensor) - shape的输入张量，值为 (minibatch, in_channels, iH, iW) 或 (minibatch, in_channels, iT, iH, iW)。
- weight (Tensor) - shape过滤器，值为 (out_channels, in_channels/groups, kH, kW) 或 (out_channels, in_channels/groups, kT, kH, kW)。
- bias (Tensor, 可选) - shape偏差 (out_channels)。
- stride (ListInt) - 卷积核步长。
- padding (ListInt) - 输入两侧的隐式填充。
- dilation (ListInt) - 内核元素间距。
- groups (Int) - 对输入进行分组。In_channels可被组数整除。
约束条件：
无
示例：
无

torch_npu.npu_conv2d(input, weight, bias, stride, padding, dilation, groups) -> Tensor

在由多个输入平面组成的输入图像上应用一个2D卷积。

参数解释：
- input (Tensor) - shape的输入张量，值为 (minibatch, in_channels, iH, iW)。
- weight (Tensor) - shape过滤器，值为 (out_channels, in_channels/groups, kH, kW)。
- bias (Tensor, 可选) - shape偏差 (out_channels)。
- stride (ListInt) - 卷积核步长。
- padding (ListInt) - 输入两侧的隐式填充。
- dilation (ListInt) - 内核元素间距。
- groups (Int) - 对输入进行分组。In_channels可被组数整除。
约束条件：
无
示例：
无

torch_npu.npu_conv3d(input, weight, bias, stride, padding, dilation, groups) -> Tensor

在由多个输入平面组成的输入图像上应用一个3D卷积。

参数解释：
- input (Tensor) - shape的输入张量，值为 (minibatch, in_channels, iT, iH, iW)。
- weight (Tensor) - shape过滤器，值为 (out_channels, in_channels/groups, kT, kH, kW)。
- bias (Tensor, 可选) - shape偏差 (out_channels)。
- stride (ListInt) - 卷积核步长。
- padding (ListInt) - 输入两侧的隐式填充。
- dilation (ListInt) - 内核元素间距。
- groups (Int) - 对输入进行分组。In_channels可被组数整除。
约束条件：
无
示例：
无

torch_npu.one_(self) -> Tensor

用1填充self张量。

参数解释：
- self (Tensor) - 输入张量。
约束条件：
无

示例：

>>> x = torch.rand(2, 3).npu()
>>> xtensor([[0.6072, 0.9726, 0.3475],
        [0.3717, 0.6135, 0.6788]], device='npu:0')
>>> x.one_()tensor([[1., 1., 1.],
        [1., 1., 1.]], device='npu:0')

torch_npu.npu_sort_v2(self, dim=-1, descending=False, out=None) -> Tensor

沿给定维度，按无index值对输入张量元素进行升序排序。若dim未设置，则选择输入的最后一个维度。如果descending为True，则元素将按值降序排序。

参数解释：
- self (Tensor) - 输入张量。
- dim (Int, 可选,默认值为-1) - 进行排序的维度。
- descending (Bool, 可选，默认值为None) - 排序顺序控制（升序或降序）。
约束条件：
目前仅支持输入的最后一个维度（dim=-1)。

示例：

>>> x = torch.randn(3, 4).npu()
>>> x
tensor([[-0.0067,  1.7790,  0.5031, -1.7217],
        [ 1.1685, -1.0486, -0.2938,  1.3241],
        [ 0.1880, -2.7447,  1.3976,  0.7380]], device='npu:0')
>>> sorted_x = torch_npu.npu_sort_v2(x)
>>> sorted_x
tensor([[-1.7217, -0.0067,  0.5029,  1.7793],
        [-1.0488, -0.2937,  1.1689,  1.3242],
        [-2.7441,  0.1880,  0.7378,  1.3975]], device='npu:0')

torch_npu.npu_format_cast(self, acl_format) -> Tensor

修改NPU张量的格式。

参数解释：
- self (Tensor) - 输入张量。
- acl_format (Int) - 目标格式。
约束条件：
无

示例：

>>> x = torch.rand(2, 3, 4, 5).npu()
>>> torch_npu.get_npu_format(x)
0
>>> x1 = x.npu_format_cast(29)
>>> torch_npu.get_npu_format(x1)
29

torch_npu.npu_format_cast_(self, src) -> Tensor

原地修改self张量格式，与src格式保持一致。

参数解释：
- self (Tensor) - 输入张量。
- src (Tensor，int) - 目标格式。
约束条件：
无

示例：

>>> x = torch.rand(2, 3, 4, 5).npu()
>>> torch_npu.get_npu_format(x)
0
>>> torch_npu.get_npu_format(x.npu_format_cast_(29))
29

torch_npu.npu_transpose(self, perm, require_contiguous=True) -> Tensor

返回原始张量视图，其维度已permute，结果连续。

参数解释：
- self (Tensor) - 输入张量。
- perm (ListInt) - 对应维度排列。
- require_contiguous(Bool，默认值为True) - 用户是否显式指定npu_contiguous算子适配需要对输入Tensor做转连续。默认为False，低性能模式。用户明确知道输入Tensor为连续Tensor或转置Tensor时，才能设置为True使用高性能模式。
约束条件：
无

示例：

>>> x = torch.randn(2, 3, 5).npu()
>>> x.shape
torch.Size([2, 3, 5])
>>> x1 = torch_npu.npu_transpose(x, (2, 0, 1))
>>> x1.shape
torch.Size([5, 2, 3])
>>> x2 = x.npu_transpose(2, 0, 1)
>>> x2.shape
torch.Size([5, 2, 3])

torch_npu.npu_broadcast(self, size) -> Tensor

返回self张量的新视图，其单维度扩展，结果连续。

张量也可以扩展更多维度，新的维度添加在最前面。

参数解释：
- self (Tensor) - 输入张量。
- size (ListInt) - 对应扩展尺寸。
约束条件：
无

示例：

>>> x = torch.tensor([[1], [2], [3]]).npu()
>>> x.shape
torch.Size([3, 1])
>>> x.npu_broadcast(3, 4)
tensor([[1, 1, 1, 1],
        [2, 2, 2, 2],
        [3, 3, 3, 3]], device='npu:0')

torch_npu.npu_dtype_cast(input, dtype) -> Tensor

执行张量数据类型（dtype）转换。

参数说明：
- input (Tensor) - 输入张量。
- dtype (torch.dtype) - 返回张量的目标数据类型。
约束说明：
无。

使用示例：

>>> torch_npu.npu_dtype_cast(torch.tensor([0, 0.5, -1.]).npu(), dtype=torch.int)
tensor([ 0,  0, -1], device='npu:0', dtype=torch.int32)

torch_npu.empty_with_format(size, dtype, layout, device, pin_memory, acl_format)

返回一个填充未初始化数据的张量。

参数说明：
- size (ListInt) - 定义输出张量shape的整数序列。可以是参数数量（可变值），也可以是列表或元组等集合。
- dtype (torch.dtype, 可选，默认值为None) - 返回张量所需数据类型。如果值为None，请使用全局默认值（请参见torch.set_default_tensor_type()）.
- layout (torch.layout, 可选，默认值为torch.strided) - 返回张量所需布局。
- device (torch.device, 可选，默认值为None) - 返回张量的所需设备。
- pin_memory (Bool, 可选，默认值为False) - 如果设置此参数，返回张量将分配在固定内存中。
- acl_format (Int，默认值为2) - 返回张量所需内存格式。

约束说明：
无

使用示例：

>>> torch_npu.empty_with_format((2, 3), dtype=torch.float32, device="npu")
tensor([[1., 1., 1.],
        [1., 1., 1.]], device='npu:0')

torch_npu.copy_memory_(dst, src, non_blocking=False) -> Tensor

从src拷贝元素到self张量，并返回self。

参数解释：
- dst (Tensor) - 拷贝源张量。
- src (Tensor) - 返回张量所需数据类型。
- non_blocking (Bool,默认值为False) - 如果设置为True且此拷贝位于CPU和NPU之间，则拷贝可能相对于主机异步发生。在其他情况下，此参数没有效果。
约束条件：
copy_memory_仅支持NPU张量。copy_memory_的输入张量应具有相同的dtype和设备index。

示例：

>>> a=torch.IntTensor([0,  0, -1]).npu()
>>> b=torch.IntTensor([1, 1, 1]).npu()
>>> a.copy_memory_(b)
tensor([1, 1, 1], device='npu:0', dtype=torch.int32)

torch_npu.npu_one_hot(input, num_classes=-1, depth=1, on_value=1, off_value=0) -> Tensor

返回一个one-hot张量。input中index表示的位置采用on_value值，而其他所有位置采用off_value的值。

参数解释：
- input (Tensor) - 任何shape的class值。
- num_classes (Int，默认值为-1) - 待填充的轴。
- depth (Int，默认值为1) - one_hot维度的深度。
- on_value (Scalar，默认值为1) - 当indices[j] == i时输出中的填充值。
- off_value (Scalar，默认值为0) - 当indices[j] != i时输出中的填充值。
约束条件：
无

示例：

>>> a=torch.IntTensor([5, 3, 2, 1]).npu()
>>> b=torch_npu.npu_one_hot(a, depth=5)
>>> btensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 1., 0., 0.],
        [0., 1., 0., 0., 0.]], device='npu:0')

torch_npu.npu_stride_add(x1, x2, offset1, offset2, c1_len) -> Tensor

添加两个张量的partial values，格式为NC1HWC0。

参数解释：
- x1 (Tensor) - 5HD张量。
- x2 (Tensor) - 与“x1”类型相同shape相同（C1值除外）的张量。
- offset1 (Scalar) - “x1”中C1的offset value。
- offset2 (Scalar) - “x2”中C1的offset value。
- c1_len (Scalar) - “y”的C1 len。该值必须小于“x1”和“x2”中C1与offset的差值。
约束条件：
无

示例：

>>> a=torch.tensor([[[[[1.]]]]]).npu()
>>> b=torch_npu.npu_stride_add(a, a, 0, 0, 1)
>>> btensor([[[[[2.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]],
        [[[0.]]]]], device='npu:0')

torch_npu.npu_softmax_cross_entropy_with_logits(features, labels) -> Tensor

计算softmax的交叉熵cost。

参数解释：
- features (Tensor) - 张量，一个“batch_size * num_classes”矩阵。
- labels (Tensor) - 与“features”同类型的张量。一个“batch_size * num_classes”矩阵。
约束条件：
无
示例：
无

torch_npu.npu_ps_roi_pooling(x, rois, spatial_scale, group_size, output_dim) -> Tensor

执行Position Sensitive ROI Pooling。

参数解释：
- x (Tensor) - 描述特征图的NC1HWC0张量。维度C1必须等于(int(output_dim+15)/C0)) group_size。
- rois (Tensor) - shape为[batch, 5, rois_num]的张量，用于描述ROI。每个ROI由五个元素组成：“batch_id”、“x1”、“y1”、“x2”和“y2”，其中“batch_id”表示输入特征图的index，“x1”、“y1”、“x2”，和“y2”必须大于或等于“0.0”。
- spatial_scale (Float32) - 将输入坐标映射到ROI坐标的缩放系数。
- group_size (Int32) - 指定用于编码position-sensitive评分图的组数。该值必须在（0,128）范围内。
- output_dim (Int32) - 指定输出通道数。必须大于0。
约束条件：
无

示例：

>>> roi = torch.tensor([[[1], [2], [3], [4], [5]],
                        [[6], [7], [8], [9], [10]]], dtype = torch.float16).npu()
>>> x = torch.tensor([[[[ 1]], [[ 2]], [[ 3]], [[ 4]],
                      [[ 5]], [[ 6]], [[ 7]], [[ 8]]],
                      [[[ 9]], [[10]], [[11]], [[12]],
                      [[13]], [[14]], [[15]], [[16]]]], dtype = torch.float16).npu()
>>> out = torch_npu.npu_ps_roi_pooling(x, roi, 0.5, 2, 2)
>>> outtensor([[[[0., 0.],
          [0., 0.]],
        [[0., 0.],
          [0., 0.]]],
        [[[0., 0.],
          [0., 0.]],
        [[0., 0.],
          [0., 0.]]]], device='npu:0', dtype=torch.float16)

torch_npu.npu_roi_align(features, rois, spatial_scale, pooled_height, pooled_width, sample_num, roi_end_mode) -> Tensor

从特征图中获取ROI特征矩阵。自定义FasterRcnn算子。

参数解释：
- features (Tensor) - 5HD张量
- rois (Tensor) - ROI位置，shape为(N, 5)的2D张量。“N”表示ROI的数量，“5”表示ROI所在图像的index，分别为“x0”、“y0”、“x1”和“y1”。
- spatial_scale (Float32) - 指定“features”与原始图像的缩放比率。
- pooled_height (Int32) - 指定H维度。
- pooled_width (Int32) - 指定W维度。
- sample_num (Int32，默认值为2) - 指定每次输出的水平和垂直采样频率。若此属性设置为0，则采样频率等于“rois”的向上取整值（一个浮点数）。
- roi_end_mode (Int32，默认值为1)
约束条件：
无

示例：

>>> x = torch.FloatTensor([[[[1, 2, 3 , 4, 5, 6],
                            [7, 8, 9, 10, 11, 12],
                            [13, 14, 15, 16, 17, 18],
                            [19, 20, 21, 22, 23, 24],
                            [25, 26, 27, 28, 29, 30],
                            [31, 32, 33, 34, 35, 36]]]]).npu()
>>> rois = torch.tensor([[0, -2.0, -2.0, 22.0, 22.0]]).npu()
>>> out = torch_npu.npu_roi_align(x, rois, 0.25, 3, 3, 2, 0)
>>> out
tensor([[[[ 4.5000,  6.5000,  8.5000],
          [16.5000, 18.5000, 20.5000],
          [28.5000, 30.5000, 32.5000]]]], device='npu:0')

torch_npu.npu_nms_v4(boxes, scores, max_output_size, iou_threshold, scores_threshold, pad_to_max_output_size=False) -> (Tensor, Tensor)

按分数降序选择标注框的子集。

参数解释：
- boxes (Tensor) - shape为[num_boxes, 4]的2D浮点张量。
- scores (Tensor) - shape为[num_boxes]的1D浮点张量，表示每个框（每行框）对应的一个分数。
- max_output_size (Scalar) - 表示non-max suppression下要选择的最大框数的标量。
- iou_threshold (Tensor) - 0D浮点张量，表示框与IoU重叠上限的阈值。
- scores_threshold (Tensor) - 0D浮点张量，表示决定何时删除框的分数阈值。
- pad_to_max_output_size (Bool，默认值为False) - 如果为True，则输出的selected_indices将填充为max_output_size长度。
返回值:
- selected_indices (Tensor) - shape为[M]的1D整数张量，表示从boxes张量中选定的index，其中M <= max_output_size。
- valid_outputs (Tensor) - 0D整数张量，表示selected_indices中有效元素的数量，有效元素首先呈现。
约束条件：
无

示例：

>>> boxes=torch.randn(100,4).npu()
>>> scores=torch.randn(100).npu()
>>> boxes.uniform_(0,100)
>>> scores.uniform_(0,1)
>>> max_output_size = 20
>>> iou_threshold = torch.tensor(0.5).npu()
>>> scores_threshold = torch.tensor(0.3).npu()
>>> npu_output = torch_npu.npu_nms_v4(boxes, scores, max_output_size, iou_threshold, scores_threshold)
>>> npu_output
(tensor([57, 65, 25, 45, 43, 12, 52, 91, 23, 78, 53, 11, 24, 62, 22, 67,  9, 94,
        54, 92], device='npu:0', dtype=torch.int32), tensor(20, device='npu:0', dtype=torch.int32))

torch_npu.npu_nms_rotated(dets, scores, iou_threshold, scores_threshold=0, max_output_size=-1, mode=0) -> (Tensor, Tensor)

按分数降序选择旋转标注框的子集。

参数解释：
- dets (Tensor) - shape为[num_boxes, 5]的2D浮点张量
- scores (Tensor) - shape为[num_boxes]的1D浮点张量，表示每个框（每行框）对应的一个分数。
- iou_threshold (Float) - 表示框与IoU重叠上限阈值的标量。
- scores_threshold (Float，默认值为0) - 表示决定何时删除框的分数阈值的标量。
- max_output_size (Int，默认值为-1) - 标量整数张量，表示非最大抑制下要选择的最大框数。为-1时即不施加任何约束。
- mode (Int，默认值为0) - 指定dets布局类型。如果mode设置为0，则dets的输入值为x、y、w、h和角度。如果mode设置为1，则dets的输入值为x1、y1、x2、y2和角度。
返回值:
- selected_index (Tensor) - shape为[M]的1D整数张量，表示从dets张量中选定的index，其中M <= max_output_size。
- selected_num (Tensor) - 0D整数张量，表示selected_indices中有效元素的数量。
约束条件：
目前不支持mode=1的场景。

示例：

>>> dets=torch.randn(100,5).npu()
>>> scores=torch.randn(100).npu()
>>> dets.uniform_(0,100)
>>> scores.uniform_(0,1)
>>> output1, output2 = torch_npu.npu_nms_rotated(dets, scores, 0.2, 0, -1, 1)
>>> output1
tensor([76, 48, 15, 65, 91, 82, 21, 96, 62, 90, 13, 59,  0, 18, 47, 23,  8, 56,
        55, 63, 72, 39, 97, 81, 16, 38, 17, 25, 74, 33, 79, 44, 36, 88, 83, 37,
        64, 45, 54, 41, 22, 28, 98, 40, 30, 20,  1, 86, 69, 57, 43,  9, 42, 27,
        71, 46, 19, 26, 78, 66,  3, 52], device='npu:0', dtype=torch.int32)
>>> output2tensor([62], device='npu:0', dtype=torch.int32)

torch_npu.npu_lstm(x, weight, bias, seqMask, h, c, has_biases, num_layers, dropout, train, bidirectional, batch_first, flag_seq, direction)

计算DynamicRNN。

参数解释：
- x (Tensor) - 4D张量。数据类型：float16, float32；格式：FRACTAL_NZ。
- weight (Tensor) - 4D张量。数据类型：float16, float32；格式：FRACTAL_ZN_LSTM。
- bias (Tensor) - 1D张量。数据类型：float16, float32；格式：ND。
- seqMask (Tensor) - 张量。仅支持为FRACTAL_NZ格式的float16和ND格式的int32类型。
- h (Tensor) - 4D张量。数据类型：float16, float32；格式：FRACTAL_NZ。
- c (Tensor) - 4D张量。数据类型：float16, float32；格式：FRACTAL_NZ。
- has_biases (Bool) - 如果值为True,则存在偏差。
- num_layers (Int) - 循环层数，目前只支持单层。
- dropout (Float) - 如果值为非零，则在除最后一层外的每个LSTM层的输出上引入一个dropout层，丢弃概率等于dropout参数值。目前不支持。
- train (Bool，默认值为True) - 标识训练是否在op进行的bool参数。
- bidirectional (Bool) - 如果值为True，LSTM为双向。当前不支持。
- batch_first (Bool) - 如果值为True，则输入和输出张量将表示为(batch, seq, feature)。当前不支持。
- flag_seq (Bool) - 如果值为True，输入为PackSequnce。当前不支持。
- direction (Bool) - 如果值为True，则方向为“REDIRECTIONAL”，否则为“UNIDIRECTIONAL”。
返回值:
- y (Tensor) - 4D张量。数据类型：float16, float32；格式：FRACTAL_NZ。
- output_h (Tensor) - 4D张量。数据类型：float16, float32；格式：FRACTAL_NZ。
- output_c (Tensor) - 4D张量。数据类型：float16, float32；格式：FRACTAL_NZ。
- i (Tensor) - 4D张量。数据类型：float16, float32；格式：FRACTAL_NZ。
- j (Tensor) - 4D张量。数据类型：float16, float32；格式：FRACTAL_NZ。
- f (Tensor) - 4D张量。数据类型：float16, float32；格式：FRACTAL_NZ。
- o (Tensor) - 4D张量。数据类型：float16, float32；格式：FRACTAL_NZ。
- tanhct (Tensor) - 4D张量。数据类型：float16, float32；格式：FRACTAL_NZ。
约束条件：
无
示例：
无

torch_npu.npu_iou(bboxes, gtboxes, mode=0) -> Tensor 
torch_npu.npu_ptiou(bboxes, gtboxes, mode=0) -> Tensor

根据ground-truth和预测区域计算交并比（IoU）或前景交叉比（IoF）。

参数解释：
- bboxes (Tensor) - 输入张量。
- gtboxes (Tensor) - 输入张量。
- mode (Int，默认值为0) - 0为IoU模式，1为IoF模式。
约束条件：
无

示例：

>>> bboxes = torch.tensor([[0, 0, 10, 10],
                           [10, 10, 20, 20],
                           [32, 32, 38, 42]], dtype=torch.float16).to("npu")
>>> gtboxes = torch.tensor([[0, 0, 10, 20],
                            [0, 10, 10, 10],
                            [10, 10, 20, 20]], dtype=torch.float16).to("npu")
>>> output_iou = torch_npu.npu_iou(bboxes, gtboxes, 0)
>>> output_iou
tensor([[0.4985, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000], 
       [0.0000, 0.9961, 0.0000]], device='npu:0', dtype=torch.float16)

torch_npu.npu_pad(input, paddings) -> Tensor

填充张量。

参数解释：
- input (Tensor) - 输入张量。
- paddings (ListInt) - 数据类型：int32、int64。
约束条件：
无

示例：

>>> input = torch.tensor([[20, 20, 10, 10]], dtype=torch.float16).to("npu")
>>> paddings = [1, 1, 1, 1]
>>> output = torch_npu.npu_pad(input, paddings)
>>> output
tensor([[ 0.,  0.,  0.,  0.,  0.,  0.],
        [ 0., 20., 20., 10., 10.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.]], device='npu:0', dtype=torch.float16)

torch_npu.npu_nms_with_mask(input, iou_threshold) -> (Tensor, Tensor, Tensor)

生成值0或1，用于nms算子确定有效位。

参数解释：
- input (Tensor) - 输入张量
- iou_threshold (Scalar) - 阈值。如果超过此阈值，则值为1，否则值为0。
返回值:
- selected_boxes (Tensor) - shape为[N,5]的2D张量，表示filtered box，包括proposal box和相应的置信度分数。
- selected_idx (Tensor) - shape为[N]的1D张量，表示输入建议框的index。
- selected_mask (Tensor) - shape为[N]的1D张量，判断输出建议框是否有效。
约束条件：
输入box_scores的2nd-dim必须等于8。

示例：

>>> input = torch.tensor([[0.0, 1.0, 2.0, 3.0, 0.6], [6.0, 7.0, 8.0, 9.0, 0.4]], dtype=torch.float16).to("npu")
>>> iou_threshold = 0.5
>>> output1, output2, output3, = torch_npu.npu_nms_with_mask(input, iou_threshold)
>>> output1
tensor([[0.0000, 1.0000, 2.0000, 3.0000, 0.6001],
        [6.0000, 7.0000, 8.0000, 9.0000, 0.3999]], device='npu:0',      dtype=torch.float16)
>>> output2
tensor([0, 1], device='npu:0', dtype=torch.int32)
>>> output3
tensor([1, 1], device='npu:0', dtype=torch.uint8)

torch_npu.npu_bounding_box_encode(anchor_box, ground_truth_box, means0, means1, means2, means3, stds0, stds1, stds2, stds3) -> Tensor

计算标注框和ground truth真值框之间的坐标变化。自定义FasterRcnn算子。

参数解释：
- anchor_box (Tensor) - 输入张量。锚点框。shape为（N,4）数据类型为float32的2D张量。“N”表示标注框的数量，“4”表示“x0”、“x1”、“y0”和“y1”。
- ground_truth_box (Tensor) - 输入张量。真值框。shape为（N,4）数据类型为float32的2D张量。“N”表示标注框的数量，“4”表示“x0”、“x1”、“y0”和“y1”。
- means0 (Float) - index。
- means1 (Float) - index。
- means2 (Float) - index。
- means3 (Float, 默认值为[0,0,0,0]) - index。 "deltas" = "deltas" x "stds" + "means"
- stds0 (Float) - index。
- stds1 (Float) - index。
- stds2 (Float) - index。
- stds3 (Float, 默认值：[1.0,1.0,1.0,1.0]) -index。 "deltas" = "deltas" x "stds" + "means"
约束条件：
无

示例：

>>> anchor_box = torch.tensor([[1., 2., 3., 4.], [3.,4., 5., 6.]], dtype = torch.float32).to("npu")
>>> ground_truth_box = torch.tensor([[5., 6., 7., 8.], [7.,8., 9., 6.]], dtype = torch.float32).to("npu")
>>> output = torch_npu.npu_bounding_box_encode(anchor_box, ground_truth_box, 0, 0, 0, 0, 0.1, 0.1, 0.2, 0.2)
>>> outputtensor([[13.3281, 13.3281,  0.0000,  0.0000],
        [13.3281,  6.6641,  0.0000, -5.4922]], device='npu:0')

torch_npu.npu_bounding_box_decode(rois, deltas, means0, means1, means2, means3, stds0, stds1, stds2, stds3, max_shape, wh_ratio_clip) -> Tensor

根据rois和deltas生成标注框。自定义FasterRcnn算子。

参数解释：
- rois (Tensor) - 区域候选网络（RPN）生成的region of interests（ROI）。shape为（N,4）数据类型为float32或float16的2D张量。“N”表示ROI的数量， “4”表示“x0”、“x1”、“y0”和“y1”。
- deltas (Tensor) - RPN生成的ROI和真值框之间的绝对变化。shape为（N,4）数据类型为float32或float16的2D张量。“N”表示错误数，“4”表示“dx”、“dy”、“dw”和“dh”。
- means0 (Float) - index。
- means1 (Float) - index。
- means2 (Float) - index。
- means3 (Float，默认值为[0,0,0,0]) - index。"deltas" = "deltas" x "stds" + "means"
- stds0 (Float) - index。
- stds1 (Float) - index。
- stds2 (Float) - index。
- stds3 (Float, 默认值：[1.0,1.0,1.0,1.0]) - index。"deltas" = "deltas" x "stds" + "means"
- max_shape (ListInt of length 2) - shape[h, w]，指定传输到网络的图像大小。用于确保转换后的bbox shape不超过“max_shape”。
- wh_ratio_clip (Float) -“dw”和“dh”的值在(-wh_ratio_clip, wh_ratio_clip)范围内。
约束条件：
无

示例：

>>> rois = torch.tensor([[1., 2., 3., 4.], [3.,4., 5., 6.]], dtype = torch.float32).to("npu")
>>> deltas = torch.tensor([[5., 6., 7., 8.], [7.,8., 9., 6.]], dtype = torch.float32).to("npu")
>>> output = torch_npu.npu_bounding_box_decode(rois, deltas, 0, 0, 0, 0, 1, 1, 1, 1, (10, 10), 0.1)
>>> output
tensor([[2.5000, 6.5000, 9.0000, 9.0000],
        [9.0000, 9.0000, 9.0000, 9.0000]], device='npu:0')

torch_npu.npu_gru(input, hx, weight_input, weight_hidden, bias_input, bias_hidden, seq_length, has_biases, num_layers, dropout, train, bidirectional, batch_first) -> (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor)

计算DynamicGRUV2。

参数解释：
- input (Tensor) - 数据类型：float16；格式：FRACTAL_NZ。
- hx (Tensor) - 数据类型：float16, float32；格式：FRACTAL_NZ。
- weight_input (Tensor) - 数据类型：float16；格式：FRACTAL_Z。
- weight_hidden (Tensor) - 数据类型：float16；格式：FRACTAL_Z。
- bias_input (Tensor) - 数据类型：float16, float32；格式：ND。
- bias_hidden (Tensor) - 数据类型：float16, float32；格式：ND。
- seq_length (Tensor) - 数据类型：int32；格式：ND。
- has_biases (Bool，默认值为True)
- num_layers (Int)
- dropout (Float)
- train (Bool，默认值为True) - 标识训练是否在op进行的bool参数。
- bidirectional (Bool，默认值为True)
- batch_first (Bool，默认值为True)
Returns:
- y (Tensor) - 数据类型：float16, float32；格式：FRACTAL_NZ。
- output_h (Tensor) - 数据类型：float16, float32；格式：FRACTAL_NZ。
- update (Tensor) - 数据类型：float16, float32；格式：FRACTAL_NZ。
- reset (Tensor) - 数据类型：float16, float32；格式：FRACTAL_NZ。
- new (Tensor) - 数据类型：float16, float32；格式：FRACTAL_NZ。
- hidden_new (Tensor) - 数据类型：float16, float32；格式：FRACTAL_NZ。
约束条件：
无
示例：
无

torch_npu.npu_random_choice_with_mask(x, count=256, seed=0, seed2=0) -> (Tensor, Tensor)

混洗非零元素的index。

参数解释：
- x (Tensor) - 输入张量。
- count (Int，默认值为256) - 输出计数。如果值为0，则输出所有非零元素。
- seed (Int，默认值为0) - 数据类型：int32，int64。
- seed2 (Int，默认值为2) - 数据类型：int32，int64。
返回值:
- y (Tensor) - 2D张量, 非零元素的index。
- mask (Tensor) - 1D张量, 确定对应index是否有效。
约束条件：
无

示例：

>>> x = torch.tensor([1, 0, 1, 0], dtype=torch.bool).to("npu")
>>> result, mask = torch_npu.npu_random_choice_with_mask(x, 2, 1, 0)
>>> resulttensor([[0],
        [2]], device='npu:0', dtype=torch.int32)
>>> mask
tensor([True, True], device='npu:0')

torch_npu.npu_batch_nms(self, scores, score_threshold, iou_threshold, max_size_per_class, max_total_size, change_coordinate_frame=False, transpose_box=False) -> (Tensor, Tensor, Tensor, Tensor)

根据batch分类计算输入框评分，通过评分排序，删除评分高于阈值（iou_threshold）的框，支持多批多类处理。通过NonMaxSuppression（nms）操作可有效删除冗余的输入框，提高检测精度。NonMaxSuppression：抑制不是极大值的元素，搜索局部的极大值，常用于计算机视觉任务中的检测类模型。

参数解释：
- self (Tensor) - 必填值，输入框的tensor，包含batch大小，数据类型Float16，输入示例：[batch_size, num_anchors, q, 4]，其中q=1或q=num_classes。
- scores (Tensor) - 必填值，输入tensor，数据类型Float16，输入示例：[batch_size, num_anchors, num_classes]。
- score_threshold (Float32) - 必填值，指定评分过滤器的iou_threshold，用于筛选框，去除得分较低的框，数据类型Float32。
- iou_threshold (Float32) - 必填值，指定nms的iou_threshold，用于设定阈值，去除高于阈值的的框，数据类型Float32。
- max_size_per_class (Int) - 必填值，指定每个类别的最大可选的框数，数据类型Int。
- max_total_size (Int) - 必填值，指定每个batch最大可选的框数，数据类型Int。
- change_coordinate_frame (Bool，默认值为False) -可选值，是否正则化输出框坐标矩阵，数据类型Bool。
- transpose_box (Bool，默认值为False) - 可选值，确定是否在此op之前插入转置，数据类型Bool。True表示boxes使用4,N排布。 False表示boxes使用过N,4排布。
返回值:
- nmsed_boxes (Tensor) - shape为(batch, max_total_size, 4)的3D张量，指定每批次输出的nms框，数据类型Float16。
- nmsed_scores (Tensor) - shape为(batch, max_total_size)的2D张量，指定每批次输出的nms分数，数据类型Float16。
- nmsed_classes (Tensor) - shape为(batch, max_total_size)的2D张量，指定每批次输出的nms类，数据类型Float16。
- nmsed_num (Tensor) - shape为(batch)的1D张量，指定nmsed_boxes的有效数量，数据类型Int32。
约束条件：
无

示例：

>>> boxes = torch.randn(8, 2, 4, 4, dtype = torch.float32).to("npu")
>>> scores = torch.randn(3, 2, 4, dtype = torch.float32).to("npu")
>>> nmsed_boxes, nmsed_scores, nmsed_classes, nmsed_num = torch_npu.npu_batch_nms(boxes, scores, 0.3, 0.5, 3, 4)
>>> nmsed_boxes
>>> nmsed_scores
>>> nmsed_classes
>>> nmsed_num

torch_npu.npu_slice(self, offsets, size) -> Tensor

从张量中提取切片。

参数解释：
- self (Tensor) - 输入张量。
- offsets (ListInt) - 数据类型：int32，int64。
- size (ListInt) - 数据类型：int32，int64。
约束条件：
无

示例：

>>> input = torch.tensor([[1,2,3,4,5], [6,7,8,9,10]], dtype=torch.float16).to("npu")
>>> offsets = [0, 0]>>> size = [2, 2]
>>> output = torch_npu.npu_slice(input, offsets, size)
>>> output
tensor([[1., 2.],
        [6., 7.]], device='npu:0', dtype=torch.float16)

torch_npu._npu_dropout(self, p) -> (Tensor, Tensor)

不使用种子（seed）进行dropout结果计数。与torch.dropout相似，优化NPU设备实现。

参数解释：
- self (Tensor) - 输入张量。
- p (Float) - 丢弃概率。
约束条件：
无

示例：

>>> input = torch.tensor([1.,2.,3.,4.]).npu()
>>> input
tensor([1., 2., 3., 4.], device='npu:0')
>>> prob = 0.3>>> output, mask = torch_npu._npu_dropout(input, prob)
>>> output
tensor([0.0000, 2.8571, 0.0000, 0.0000], device='npu:0')
>>> mask
tensor([ 98, 255, 188, 186, 120, 157, 175, 159,  77, 223, 127,  79, 247, 151,
      253, 255], device='npu:0', dtype=torch.uint8)

torch_npu.npu_indexing(self, begin, end, strides, begin_mask=0, end_mask=0, ellipsis_mask=0, new_axis_mask=0, shrink_axis_mask=0) -> Tensor

使用“begin,end,strides”数组对index结果进行计数。

参数解释：
- self (Tensor) - 输入张量。
- begin (ListInt) - 待选择的第一个值的index。
- end (ListInt) - 待选择的最后一个值的index。
- strides (ListInt) - index增量。
- begin_mask (Int，默认值为0) - 位掩码（bitmask），其中位“i”为“1”意味着忽略开始值，尽可能使用最大间隔。
- end_mask (Int，默认值为0) - 类似于“begin_mask”。
- ellipsis_mask (Int，默认值为0) - 位掩码，其中位“i”为“1”意味着第“i”个位置实际上是省略号。
- new_axis_mask (Int，默认值为0) - 位掩码，其中位“i”为“1”意味着在第“i”位创建新的1D shape。
- shrink_axis_mask (Int，默认值为0) - 位掩码，其中位“i”意味着第“i”位应缩小维数。
约束条件：
无

示例：

>>> input = torch.tensor([[1, 2, 3, 4],[5, 6, 7, 8]], dtype=torch.int32).to("npu")
>>> input
tensor([[1, 2, 3, 4],
      [5, 6, 7, 8]], device='npu:0', dtype=torch.int32)
>>> output = torch_npu.npu_indexing(input1, [0, 0], [2, 2], [1, 1])
>>> output
tensor([[1, 2],
      [5, 6]], device='npu:0', dtype=torch.int32)

torch_npu.npu_ifmr(Tensor data, Tensor data_min, Tensor data_max, Tensor cumsum, float min_percentile, float max_percentile, float search_start, float search_end, float search_step, bool with_offset) -> (Tensor, Tensor)

使用“begin,end,strides”数组对ifmr结果进行计数。

参数解释：
- data (Tensor) - 特征图张量。
- data_min (Tensor) - 特征图最小值的张量。
- data_max (Tensor) - 特征图最大值的张量。
- cumsum (Tensor) - cumsum bin数据张量。
- min_percentile (Float) - 最小初始化百分位数。
- max_percentile (Float) - 最大初始化百分位数。
- search_start (Float) - 搜索起点。
- search_end (Float) - 搜索终点。
- search_step (Float) - 搜索步长。
- with_offset (Bool) - 是否使用offset。
返回值:
- scale (Tensor) - 最优尺度。
- offset (Tensor) - 最优offset。
约束条件：
无

示例：

>>> input = torch.rand((2,2,3,4),dtype=torch.float32).npu()
>>> input
tensor([[[[0.4508, 0.6513, 0.4734, 0.1924],
          [0.0402, 0.5502, 0.0694, 0.9032],
          [0.4844, 0.5361, 0.9369, 0.7874]],
        [[0.5157, 0.1863, 0.4574, 0.8033],
          [0.5986, 0.8090, 0.7605, 0.8252],
          [0.4264, 0.8952, 0.2279, 0.9746]]],
        [[[0.0803, 0.7114, 0.8773, 0.2341], 
         [0.6497, 0.0423, 0.8407, 0.9515], 
         [0.1821, 0.5931, 0.7160, 0.4968]],
          [[0.7977, 0.0899, 0.9572, 0.0146],
          [0.2804, 0.8569, 0.2292, 0.1118],
          [0.5747, 0.4064, 0.8370, 0.1611]]]], device='npu:0')  
>>> min_value = torch.min(input)  
>>> min_value  
tensor(0.0146, device='npu:0')
>>> max_value = torch.max(input)  
>>> max_value  
tensor(0.9746, device='npu:0')  
>>> hist = torch.histc(input.to('cpu'),
                         bins=128,
                         min=min_value.to('cpu'),
                         max=max_value.to('cpu'))  
>>> hist  
tensor([1., 0., 0., 2., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0.,
          0., 1., 0., 0., 2., 1., 0., 0., 0., 0., 2., 1., 0., 0., 0., 0., 0., 1.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
          1., 0., 0., 0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1.,
          0., 0., 1., 0., 0., 2., 0., 0., 0., 0., 0., 0., 2., 0., 0., 0., 0., 0., 
         0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 2., 0., 0.,
          1., 1., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1., 1.,
          0., 1.])  >>> cdf = torch.cumsum(hist,dim=0).int().npu()  
>>> cdf  
tensor([ 1,  1,  1,  3,  3,  3,  3,  4,  5,  5,  6,  6,  7,  7,  7,  7,  7,  7,
          7,  8,  8,  8, 10, 11, 11, 11, 11, 11, 13, 14, 14, 14, 14, 14, 14, 15,
          15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16,
          17, 17, 17, 17, 18, 19, 19, 20, 21, 21, 22, 22, 23, 23, 23, 24, 24, 25, 
         25, 25, 26, 26, 26, 28, 28, 28, 28, 28, 28, 28, 30, 30, 30, 30, 30, 30, 
         30, 30, 31, 32, 32, 32, 32, 32, 32, 33, 33, 33, 33, 34, 35, 37, 37, 37,
          38, 39, 40, 40, 41, 41, 41, 42, 42, 43, 44, 44, 44, 44, 45, 45, 46, 47,
          47, 48], device='npu:0', dtype=torch.int32)  
>>> scale, offset = torch_npu.npu_ifmr(input,
                                     min_value, 
                                    max_value,  
                                   cdf, 
                                    min_percentile=0.999999, 
                                    max_percentile=0.999999,
                                     search_start=0.7, 
                                    search_end=1.3, 
                                    search_step=0.01,
                                     with_offset=False)  
>>> scale  tensor(0.0080, device='npu:0')  
>>> offset  tensor(0., device='npu:0')

torch_npu.npu_max(self, dim, keepdim=False) -> (Tensor, Tensor)

使用dim对最大结果进行计数。类似于torch.max, 优化NPU设备实现。

参数解释：
- self (Tensor) - 输入张量。
- dim (Int) - 待降低维度。
- keepdim (Bool，默认值为False) - 输出张量是否保留dim。
返回值:
- values (Tensor) - 输入张量中的最大值。
- indices (Tensor) - 输入张量中最大值的index。
约束条件：
无

示例：

>>> input = torch.randn(2, 2, 2, 2, dtype = torch.float32).npu()
>>> input
tensor([[[[-1.8135,  0.2078],
          [-0.6678,  0.7846]],

        [[ 0.6458, -0.0923],
          [-0.2124, -1.9112]]],

        [[[-0.5800, -0.4979], 
         [ 0.2580,  1.1335]],

          [[ 0.6669,  0.1876],
          [ 0.1160, -0.1061]]]], device='npu:0')
>>> outputs, indices = torch_npu.npu_max(input, 2)
>>> outputs
tensor([[[-0.6678,  0.7846],
        [ 0.6458, -0.0923]],

        [[ 0.2580,  1.1335],
        [ 0.6669,  0.1876]]], device='npu:0')
>>> indices
tensor([[[1, 1],
        [0, 0]],

        [[1, 1],
        [0, 0]]], device='npu:0', dtype=torch.int32)

torch_npu.npu_min(self, dim, keepdim=False) -> (Tensor, Tensor)

使用dim对最小结果进行计数。类似于torch.min, 优化NPU设备实现。

参数解释：
- self (Tensor) - 输入张量。
- dim (Int) - 待降低维度。
- keepdim (Bool) - 输出张量是否保留dim。
返回值:
- values (Tensor) - 输入张量中的最小值。
- indices (Tensor) - 输入张量中最小值的index。
约束条件：
无

示例：

>>> input = torch.randn(2, 2, 2, 2, dtype = torch.float32).npu()
>>> input
tensor([[[[-0.9909, -0.2369],
          [-0.9569, -0.6223]],

        [[ 0.1157, -0.3147],
          [-0.7761,  0.1344]]],

        [[[ 1.6292,  0.5953],
          [ 0.6940, -0.6367]],

        [[-1.2335,  0.2131],
          [ 1.0748, -0.7046]]]], device='npu:0')
>>> outputs, indices = torch_npu.npu_min(input, 2)
>>> outputs
tensor([[[-0.9909, -0.6223],
        [-0.7761, -0.3147]],

        [[ 0.6940, -0.6367],
        [-1.2335, -0.7046]]], device='npu:0')
>>> indices
tensor([[[0, 1],
        [1, 0]],

        [[1, 1],
        [0, 1]]], device='npu:0', dtype=torch.int32)

torch_npu.npu_scatter(self, indices, updates, dim) -> Tensor

使用dim对scatter结果进行计数。类似于torch.scatter，优化NPU设备实现。

参数解释：
- self (Tensor) - 输入张量。
- indices (Tensor) - 待scatter的元素index，可以为空，也可以与src有相同的维数。当为空时，操作返回“self unchanged”。
- updates (Tensor) - 待scatter的源元素。
- dim (Int) - 要进行index的轴。
约束条件：
无

示例：

>>> input    = torch.tensor([[1.6279, 0.1226], [0.9041, 1.0980]]).npu()
>>> input
tensor([[1.6279, 0.1226],
        [0.9041, 1.0980]], device='npu:0')
>>> indices  = torch.tensor([0, 1],dtype=torch.int32).npu()
>>> indices
tensor([0, 1], device='npu:0', dtype=torch.int32)
>>> updates  = torch.tensor([-1.1993, -1.5247]).npu()
>>> updates
tensor([-1.1993, -1.5247], device='npu:0')
>>> dim = 0
>>> output = torch_npu.npu_scatter(input, indices, updates, dim)
>>> output
tensor([[-1.1993,  0.1226],
        [ 0.9041, -1.5247]], device='npu:0')

torch_npu.npu_layer_norm_eval(input, normalized_shape, weight=None, bias=None, eps=1e-05) -> Tensor

对层归一化结果进行计数。与torch.nn.functional.layer_norm相同, 优化NPU设备实现。

参数解释：
- input (Tensor) - 输入张量。
- normalized_shape (ListInt) - size为预期输入的输入shape。
- weight (Tensor, 可选，默认值为None) - gamma张量。
- bias (Tensor, 可选默认值为None) - beta张量。
- eps (Float，默认值为1e-5) - 为保证数值稳定性添加到分母中的ε值。
约束条件：
无

示例：

>>> input = torch.rand((6, 4), dtype=torch.float32).npu()
>>> input
tensor([[0.1863, 0.3755, 0.1115, 0.7308],
        [0.6004, 0.6832, 0.8951, 0.2087],
        [0.8548, 0.0176, 0.8498, 0.3703],
        [0.5609, 0.0114, 0.5021, 0.1242],
        [0.3966, 0.3022, 0.2323, 0.3914],
        [0.1554, 0.0149, 0.1718, 0.4972]], device='npu:0')
>>> normalized_shape = input.size()[1:]
>>> normalized_shape
torch.Size([4])
>>> weight = torch.Tensor(*normalized_shape).npu()
>>> weight
tensor([        nan,  6.1223e-41, -8.3159e-20,  9.1834e-41], device='npu:0')
>>> bias = torch.Tensor(*normalized_shape).npu()
>>> bias
tensor([5.6033e-39, 6.1224e-41, 6.1757e-39, 6.1224e-41], device='npu:0')
>>> output = torch_npu.npu_layer_norm_eval(input, normalized_shape, weight, bias, 1e-5)
>>> output
tensor([[        nan,  6.7474e-41,  8.3182e-20,  2.0687e-40],
        [        nan,  8.2494e-41, -9.9784e-20, -8.2186e-41],
        [        nan, -2.6695e-41, -7.7173e-20,  2.1353e-41],
        [        nan, -1.3497e-41, -7.1281e-20, -6.9827e-42],
        [        nan,  3.5663e-41,  1.2002e-19,  1.4314e-40],
        [        nan, -6.2792e-42,  1.7902e-20,  2.1050e-40]], device='npu:0')

torch_npu.npu_alloc_float_status(self) -> Tensor

生成一个包含8个0的一维张量。

参数解释：
- self (Tensor) - 任何张量。
约束条件：
无

示例：

>>> input    = torch.randn([1,2,3]).npu()
>>> output = torch_npu.npu_alloc_float_status(input)
>>> input
tensor([[[ 2.2324,  0.2478, -0.1056],
        [ 1.1273, -0.2573,  1.0558]]], device='npu:0')
>>> output
tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='npu:0')

torch_npu.npu_get_float_status(self) -> Tensor

计算npu_get_float_status算子函数。

参数解释：
- self (Tensor) - 数据内存地址张量，数据类型为float32。
约束条件：
无

示例：

>>> x = torch.rand(2).npu()
>>> torch_npu.npu_get_float_status(x)
tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='npu:0')

torch_npu.npu_clear_float_status(self) -> Tensor

在每个核中设置地址0x40000的值为0。

参数解释：
- self (Tensor) - 数据类型为float32的张量。
约束条件：
无

示例：

>>> x = torch.rand(2).npu()
>>> torch_npu.npu_clear_float_status(x)
tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='npu:0')

torch_npu.npu_confusion_transpose(self, perm, shape, transpose_first) -> Tensor

混淆reshape和transpose运算。

参数解释：
- self (Tensor) - 数据类型：float16、float32、int8、int16、int32、int64、uint8、uint16、uint32、uint64。
- perm (ListInt) - self张量的维度排列。
- shape (ListInt) - 输入shape。
- transpose_first (Bool) - 如果值为True，首先执行transpose，否则先执行reshape。
约束条件：
无

示例：

>>> x = torch.rand(2, 3, 4, 6).npu()
>>> x.shape
torch.Size([2, 3, 4, 6])
>>> y = torch_npu.npu_confusion_transpose(x, (0, 2, 1, 3), (2, 4, 18), True)
>>> y.shape
torch.Size([2, 4, 18])
>>> y2 = torch_npu.npu_confusion_transpose(x, (0, 2, 1), (2, 12, 6), False)
>>> y2.shape
torch.Size([2, 6, 12])

torch_npu.npu_bmmV2(self, mat2, output_sizes) -> Tensor

将矩阵“a”乘以矩阵“b”，生成“a*b”。

参数解释：
- self (Tensor) - 2D或更高维度矩阵张量。数据类型：float16、float32、int32。格式：[ND, NHWC, FRACTAL_NZ]。
- mat2 (Tensor) - 2D或更高维度矩阵张量。数据类型：float16、float32、int32。格式：[ND, NHWC, FRACTAL_NZ]。
- output_sizes (ListInt，默认值为[]) - 输出的shape，用于matmul的反向传播。
约束条件：
无

示例：

>>> mat1 = torch.randn(10, 3, 4).npu()
>>> mat2 = torch.randn(10, 4, 5).npu()
>>> res = torch_npu.npu_bmmV2(mat1, mat2, [])
>>> res.shape
torch.Size([10, 3, 5])

torch_npu.fast_gelu(self) -> Tensor

计算输入张量中fast_gelu的梯度。

参数解释：
- self (Tensor) - 数据类型：float16、float32。
约束条件：
无

示例：

>>> x = torch.rand(2).npu()
>>> x
tensor([0.5991, 0.4094], device='npu:0')
>>> torch_npu.fast_gelu(x)
tensor([0.4403, 0.2733], device='npu:0')

torch_npu.npu_deformable_conv2d(self, weight, offset, bias, kernel_size, stride, padding, dilation=[1,1,1,1], groups=1, deformable_groups=1, modulated=True) -> (Tensor, Tensor)

使用预期输入计算变形卷积输出（deformed convolution output）。

参数解释：
- self (Tensor) - 输入图像的4D张量。格式为“NHWC”，数据按以下顺序存储：[batch, in_height, in_width, in_channels]。
- weight (Tensor) - 可学习过滤器的4D张量。数据类型需与self相同。格式为“HWCN”，数据按以下顺序存储：[filter_height, filter_width, in_channels / groups, out_channels]。
- offset (Tensor) - x-y坐标偏移和掩码的4D张量。格式为“NHWC”，数据按以下顺序存储：[batch, out_height, out_width, deformable_groups * filter_height * filter_width * 3]。
- bias (Tensor，可选) - 过滤器输出附加偏置（additive bias）的1D张量，数据按[out_channels]的顺序存储。
- kernel_size (ListInt of length 2) - 内核大小，2个整数的元组/列表。
- stride (ListInt) - 4个整数的列表，表示每个输入维度的滑动窗口步长。维度顺序根据self的数据格式解释。N维和C维必须设置为1。
- padding (ListInt) - 4个整数的列表，表示要添加到输入每侧（顶部、底部、左侧、右侧）的像素数。
- dilations (ListInt，默认值为[1, 1, 1, 1]) - 4个整数的列表，表示输入每个维度的膨胀系数（dilation factor）。维度顺序根据self的数据格式解释。N维和C维必须设置为1。
- groups (Int，默认值为1) - int32类型单整数，表示从输入通道到输出通道的阻塞连接数。In_channels和out_channels需都可被“groups”数整除。
- deformable_groups (Int，默认值为1) - int32类型单整数，表示可变形组分区的数量。In_channels需可被“deformable_groups”数整除。
- modulated (Bool，可选，默认值为True) - 指定DeformableConv2D版本。True表示v2版本, False表示v1版本，目前仅支持v2。
约束条件：
无

示例：

>>> x = torch.rand(16, 32, 32, 32).npu()
>>> weight = torch.rand(32, 32, 5, 5).npu()
>>> offset = torch.rand(16, 75, 32, 32).npu()
>>> output, _ = torch_npu.npu_deformable_conv2d(x, weight, offset, None, kernel_size=[5, 5], stride = [1, 1, 1, 1], padding = [2, 2, 2, 2])
>>> output.shape
torch.Size([16, 32, 32, 32])

torch_npu.npu_mish(self) -> Tensor

按元素计算self的双曲正切。

参数解释：
- self (Tensor) - 数据类型：float16、float32。
约束条件：
无

示例：

>>> x = torch.rand(10, 30, 10).npu()
>>> y = torch_npu.npu_mish(x)
>>> y.shape
torch.Size([10, 30, 10])

torch_npu.npu_anchor_response_flags(self, featmap_size, stride, num_base_anchors) -> Tensor

在单个特征图中生成锚点的责任标志。

参数解释：
- self (Tensor) - 真值框，shape为[batch, 4]的2D张量。
- featmap_size (ListInt of length 2) - 特征图大小。
- strides (ListInt of length 2) - 当前水平的步长。
- num_base_anchors (Int) - base anchors的数量。
约束条件：
无

示例：

>>> x = torch.rand(100, 4).npu()
>>> y = torch_npu.npu_anchor_response_flags(x, [60, 60], [2, 2], 9)
>>> y.shape
torch.Size([32400])

torch_npu.npu_yolo_boxes_encode(self, gt_bboxes, stride, performance_mode=False) -> Tensor

根据YOLO的锚点框（anchor box）和真值框（ground-truth box）生成标注框。自定义mmdetection算子。

参数解释：
- self (Tensor) - YOLO训练集生成的锚点框。shape为(N, 4)数据类型为float32或float16的2D张量。“N”表示ROI的数量，值“4”表示(tx, ty, tw, th)。
- gt_bboxes (Tensor) - 转换目标，例如真值框。shape为(N, 4)数据类型为float32或float16的2D张量。“N”表示ROI的数量，值“4”表示“dx”、“dy”、“dw”和“dh”。
- strides (Tensor) - 各框比例。shape为(N,)数据类型为int32的1D张量。“N”表示ROI的数量。
- performance_mode (Bool，默认值为False) - 选择性能模式为“high_precision”或“high_performance”。如果值为True，则性能模式为“high_performance”；如果值为False，则性能模式为“high_precision”。当输入数据类型为float32时，选择“high_precision”，输出张量精度将小于0.0001。当输入数据类型为float16时，选择“high_performance”，ops将是最佳性能，但精度将只小于0.005。
约束条件：
输入锚点框支持的最大N为20480。

示例：

>>> anchor_boxes = torch.rand(2, 4).npu()
>>> gt_bboxes = torch.rand(2, 4).npu()
>>> stride = torch.tensor([2, 2], dtype=torch.int32).npu()
>>> output = torch_npu.npu_yolo_boxes_encode(anchor_boxes, gt_bboxes, stride, False)
>>> output.shape
torch.Size([2, 4])

torch_npu.npu_grid_assign_positive(self, overlaps, box_responsible_flags, max_overlaps, argmax_overlaps, gt_max_overlaps, gt_argmax_overlaps, num_gts, pos_iou_thr, min_pos_iou, gt_max_assign_all) -> Tensor

执行position-sensitive的候选区域池化梯度计算。

参数解释：
- self (Tensor) - float16或float32类型的张量, shape为(n, )。
- overlaps (Tensor) - 数据类型与assigned_gt_inds相同，表示gt_bboxes和bboxes之间的IoU，shape为(k,n)。
- box_responsible_flags (Tensor) - 支持uint8数据类型。表示框是否responsible的标志。
- max_overlaps (Tensor) - 数据类型与assigned_gt_inds. overlaps.max(axis=0)相同。
- argmax_overlaps (Tensor) - 支持uint32数据类型，overlaps.argmax(axis=0)。
- gt_max_overlaps (Tensor) - 数据类型与assigned_gt_inds. overlaps.max(axis=1)相同。
- gt_argmax_overlaps (Tensor) - 支持uint32数据类型， overlaps.argmax(axis=1)。
- num_gts (Tensor) - 支持uint32数据类型，real k ，shape为 (1, )。
- pos_iou_thr (Float) - 正检测框的IoU阈值。
- min_pos_iou (Float) - 检测框被视为正检测框的最小IoU
- gt_max_assign_all (Bool) - 是否将与某个gt有相同最高重叠的所有检测框分配给该gt。
约束条件：
无

示例：

>>> assigned_gt_inds = torch.rand(4).npu()
>>> overlaps = torch.rand(2,4).npu()
>>> box_responsible_flags = torch.tensor([1, 1, 1, 0], dtype=torch.uint8).npu()
>>> max_overlap = torch.rand(4).npu()
>>> argmax_overlap = torch.tensor([1, 0, 1, 0], dtype=torch.int32).npu()
>>> gt_max_overlaps = torch.rand(2).npu()
>>> gt_argmax_overlaps = torch.tensor([1, 0],dtype=torch.int32).npu()
>>> output = torch_npu.npu_grid_assign_positive(assigned_gt_inds, overlaps, box_responsible_flags, max_overlap, argmax_overlap, gt_max_overlaps, gt_argmax_overlaps, 128, 0.5, 0., True)
>>> output.shape
torch.Size([4])

torch_npu.npu_normalize_batch(self, seq_len, normalize_type=0) -> Tensor

执行批量归一化。

参数解释：
- self (Tensor) - 支持float32数据类型，shape为(n, c, d)。
- seq_len (Tensor) - 支持Int32数据类型，shape为(n, )，表示每批次标准化数据量。
- normalize_type (Int，默认值为0) - 支持 "per_feature"或"all_features"。值为0表示 "per_feature"，值为1表示"all_features"。
约束条件：
无

示例：

>>> a=np.random.uniform(1,10,(2,3,6)).astype(np.float32)
>>> b=np.random.uniform(3,6,(2)).astype(np.int32)
>>> x=torch.from_numpy(a).to("npu")
>>> seqlen=torch.from_numpy(b).to("npu")
>>> out = torch_npu.npu_normalize_batch(x, seqlen, 0)
>>> out
tensor([[[ 1.1496, -0.6685, -0.4812,  1.7611, -0.5187,  0.7571],
        [ 1.1445, -0.4393, -0.7051,  1.0474, -0.2646, -0.1582],
        [ 0.1477,  0.9179, -1.0656, -6.8692, -6.7437,  2.8621]],

        [[-0.6880,  0.1337,  1.3623, -0.8081, -1.2291, -0.9410],
        [ 0.3070,  0.5489, -1.4858,  0.6300,  0.6428,  0.0433],
        [-0.5387,  0.8204, -1.1401,  0.8584, -0.3686,  0.8444]]],
      device='npu:0')

torch_npu.npu_masked_fill_range(self, start, end, value, axis=-1) -> Tensor

同轴上被range.boxes屏蔽（masked）的填充张量。自定义屏蔽填充范围算子。

参数解释：
- self (Tensor) - shape为1D (D,)、2D (N,D)或3D (N,D)的float32/float16/int32/int8 ND张量。
- start (Tensor) - 屏蔽填充开始位置。shape为(num,N)的int32 3D张量。
- end (Tensor) - 屏蔽填充结束位置。shape为(num,N)的int32 3D张量。
- value (Tensor) - 屏蔽填充值。shape为(num,)的float32/float16/int32/int8 2D张量。
- axis (Int，默认值为-1) - 带有int32屏蔽填充的轴。
约束条件：
无

示例：

>>> a=torch.rand(4,4).npu()
>>> a
tensor([[0.9419, 0.4919, 0.2874, 0.6560],
        [0.6691, 0.6668, 0.0330, 0.1006],
        [0.3888, 0.7011, 0.7141, 0.7878],
        [0.0366, 0.9738, 0.4689, 0.0979]], device='npu:0')
>>> start = torch.tensor([[0,1,2]], dtype=torch.int32).npu()
>>> end = torch.tensor([[1,2,3]], dtype=torch.int32).npu()
>>> value = torch.tensor([1], dtype=torch.float).npu()
>>> out = torch_npu.npu_masked_fill_range(a, start, end, value, 1)
>>> out
tensor([[1.0000, 0.4919, 0.2874, 0.6560],
        [0.6691, 1.0000, 0.0330, 0.1006],
        [0.3888, 0.7011, 1.0000, 0.7878],
        [0.0366, 0.9738, 0.4689, 0.0979]], device='npu:0')

torch_npu.npu_linear(input, weight, bias=None) -> Tensor

将矩阵“a”乘以矩阵“b”，生成“a*b”。

参数解释：
- input (Tensor) - 2D矩阵张量。数据类型：float32、float16、int32、int8。格式：[ND, NHWC, FRACTAL_NZ]。
- weight (Tensor) - 2D矩阵张量。数据类型：float32、float16、int32、int8。格式：[ND, NHWC, FRACTAL_NZ]。
- bias (Tensor，可选，默认值为None) - 1D张量。数据类型：float32、float16、int32。格式：[ND, NHWC]。
约束条件：
无

示例：

>>> x=torch.rand(2,16).npu()
>>> w=torch.rand(4,16).npu()
>>> b=torch.rand(4).npu()
>>> output = torch_npu.npu_linear(x, w, b)
>>> output
tensor([[3.6335, 4.3713, 2.4440, 2.0081],
        [5.3273, 6.3089, 3.9601, 3.2410]], device='npu:0')

torch_npu.npu_bert_apply_adam(lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay, step_size=None, adam_mode=0, *, out=（var,m,v）)

adam结果计数。

参数解释:
- var (Tensor) - float16或float32类型张量。
- m (Tensor) - 数据类型和shape与exp_avg相同。
- v (Tensor) - 数据类型和shape与exp_avg相同。
- lr (Scalar) - 数据类型与exp_avg相同。
- beta1 (Scalar) - 数据类型与exp_avg相同。
- beta2 (Scalar) - 数据类型与exp_avg相同。
- epsilon (Scalar) - 数据类型与exp_avg相同。
- grad (Tensor) - 数据类型和shape与exp_avg相同。
- max_grad_norm (Scalar) - 数据类型与exp_avg相同。
- global_grad_norm (Scalar) - 数据类型与exp_avg相同。
- weight_decay (Scalar) - 数据类型与exp_avg相同。
- step_size (Tensor，可选，默认值为None) - shape为(1, )，数据类型与exp_avg一致。
- adam_mode (Int，默认值为0) - 选择adam模式。0表示“adam”，1表示“mbert_adam”。
关键字参数:
- out (Tensor，可选) - 输出张量。
约束条件:
无

示例：

>>> var_in = torch.rand(321538).uniform_(-32., 21.).npu()
>>> m_in = torch.zeros(321538).npu()
>>> v_in = torch.zeros(321538).npu()
>>> grad = torch.rand(321538).uniform_(-0.05, 0.03).npu()
>>> max_grad_norm = -1.
>>> beta1 = 0.9
>>> beta2 = 0.99
>>> weight_decay = 0.
>>> lr = 0.
>>> epsilon = 1e-06
>>> global_grad_norm = 0.
>>> var_out, m_out, v_out = torch_npu.npu_bert_apply_adam(lr, beta1, beta2, epsilon, grad, max_grad_norm, global_grad_norm, weight_decay, out=(var_in, m_in, v_in))
>>> var_out
tensor([ 14.7733, -30.1218,  -1.3647,  ..., -16.6840,   7.1518,   8.4872],
      device='npu:0')

torch_npu.npu_giou(self, gtboxes, trans=False, is_cross=False, mode=0) -> Tensor

首先计算两个框的最小封闭面积和IoU，然后计算封闭区域中不属于两个框的封闭面积的比例，最后从IoU中减去这个比例，得到GIoU。

参数解释：
- self (Tensor) - 标注框，shape为(N, 4) 数据类型为float16或float32的2D张量。“N”表示标注框的数量，值“4”表示[x1, y1, x2, y2]或[x, y, w, h]。
- gtboxes (Tensor) - 真值框，shape为(M, 4) 数据类型为float16或float32的2D张量。“M”表示真值框的数量，值“4”表示[x1, y1, x2, y2]或[x, y, w, h]。
- trans (Bool，默认值为False) - 值为True代表“xywh”，值为False代表“xyxy”。
- is_cross (Bool，默认值为False) - 控制输出shape是[M, N]还是[1,N]。如果值为True，则输出shape为[M,N]。如果为False，则输出shape为[1,N]。
- mode (Int，默认值为0) - 计算模式，取值为0或1。0表示IoU，1表示IoF。
约束条件：
无

示例：

>>> a=np.random.uniform(0,1,(4,10)).astype(np.float16)
>>> b=np.random.uniform(0,1,(4,10)).astype(np.float16)
>>> box1=torch.from_numpy(a).to("npu")
>>> box2=torch.from_numpy(a).to("npu")
>>> output = torch_npu.npu_giou(box1, box2, trans=True, is_cross=False, mode=0)
>>> output
tensor([[1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.]], device='npu:0', dtype=torch.float16)

torch_npu.npu_silu(self) -> Tensor

计算self的Swish。

参数解释：
- self (Tensor) - 数据类型：float16、float32
约束条件：
无

示例：

>>> a=torch.rand(2,8).npu()
>>> output = torch_npu.npu_silu(a)
>>> output
tensor([[0.4397, 0.7178, 0.5190, 0.2654, 0.2230, 0.2674, 0.6051, 0.3522],
        [0.4679, 0.1764, 0.6650, 0.3175, 0.0530, 0.4787, 0.5621, 0.4026]],
       device='npu:0')

torch_npu.npu_reshape(self, shape, bool can_refresh=False) -> Tensor

reshape张量。仅更改张量shape，其数据不变。

参数解释：
- self (Tensor) - 输入张量。
- shape (ListInt) - 定义输出张量的shape。
- can_refresh (Bool，默认值为False) - 是否就地刷新reshape。
约束条件：
该运算符不能被aclopExecute API直接调用。

示例：

>>> a=torch.rand(2,8).npu()
>>> out=torch_npu.npu_reshape(a,(4,4))
>>> out
tensor([[0.6657, 0.9857, 0.7614, 0.4368],
        [0.3761, 0.4397, 0.8609, 0.5544],
        [0.7002, 0.3063, 0.9279, 0.5085],
        [0.1009, 0.7133, 0.8118, 0.6193]], device='npu:0')

torch_npu.npu_rotated_overlaps(self, query_boxes, trans=False) -> Tensor

计算旋转框的重叠面积。

参数解释：
- self (Tensor) -梯度增量数据，shape为(B, 5, N)数据类型为float32的3D张量。
- query_boxes (Tensor) - 标注框，shape为(B, 5, K) 数据类型为float32的3D张量。
- trans (Bool，默认值为False) - 值为True表示“xyxyt”，值为False表示“xywht”。
约束条件：
无

示例：

>>> a=np.random.uniform(0,1,(1,3,5)).astype(np.float16)
>>> b=np.random.uniform(0,1,(1,2,5)).astype(np.float16)
>>> box1=torch.from_numpy(a).to("npu")
>>> box2=torch.from_numpy(a).to("npu")
>>> output = torch_npu.npu_rotated_overlaps(box1, box2, trans=False)
>>> output
tensor([[[0.0000, 0.1562, 0.0000],
        [0.1562, 0.3713, 0.0611],
        [0.0000, 0.0611, 0.0000]]], device='npu:0', dtype=torch.float16)

torch_npu.npu_rotated_iou(self, query_boxes, trans=False, mode=0, is_cross=True,v_threshold=0.0, e_threshold=0.0) -> Tensor

计算旋转框的IoU。

参数解释：
- self (Tensor) - 梯度增量数据，shape为(B, 5, N)数据类型为float32的3D张量。
- query_boxes (Tensor) - 标注框，shape为(B, 5, K) 数据类型为float32的3D张量。
- trans (Bool，默认值为False) - 值为True表示“xyxyt”，值为False表示“xywht”。
- is_cross (Bool，默认值为True) - 值为True时表示交叉计算，为False时表示一对一计算。
- mode (Int，默认值为0) - 计算模式，取值为0或1。0表示IoU，1表示IoF。
- v_threshold (Float，可选，默认值为0.0) - provide condition relaxation for intersection calculation.
- e_threshold (Float，可选，默认值为0.0) - provide condition relaxation for intersection calculation.
约束条件：
无

示例：

>>> a=np.random.uniform(0,1,(2,2,5)).astype(np.float16)
>>> b=np.random.uniform(0,1,(2,3,5)).astype(np.float16)
>>> box1=torch.from_numpy(a).to("npu")
>>> box2=torch.from_numpy(a).to("npu")
>>> output = torch_npu.npu_rotated_iou(box1, box2, trans=False, mode=0, is_cross=True)
>>> output
tensor([[[3.3325e-01, 1.0162e-01],
        [1.0162e-01, 1.0000e+00]],

        [[0.0000e+00, 0.0000e+00],
        [0.0000e+00, 5.9605e-08]]], device='npu:0', dtype=torch.float16)

torch_npu.npu_rotated_box_encode(anchor_box, gt_bboxes, weight) -> Tensor

旋转标注框编码。

参数解释：
- anchor_box (Tensor) - shape为(B,5,N)的3D输入张量，表示锚点框。“B”表示批处理大小数量，“N”表示标注框数量，值“5”表示“x0”、“x1”、“y0”、“y1”和“angle”。
- gt_bboxes (Tensor) - shape为(B,5,N)数据类型为float32 (float16)的3D张量。
- weight (Tensor，默认值为[1.0, 1.0, 1.0, 1.0, 1.0]) - “x0”、“x1”、“y0”、“y1”和“angle”的浮点列表。
约束条件：
无

示例：

>>> anchor_boxes = torch.tensor([[[30.69], [32.6], [45.94], [59.88], [-44.53]]], dtype=torch.float16).to("npu")
    >>> gt_bboxes = torch.tensor([[[30.44], [18.72], [33.22], [45.56], [8.5]]], dtype=torch.float16).to("npu")
    >>> weight = torch.tensor([1., 1., 1., 1., 1.], dtype=torch.float16).npu()
    >>> out = torch_npu.npu_rotated_box_encode(anchor_boxes, gt_bboxes, weight)
    >>> out
    tensor([[[-0.4253],
            [-0.5166],
            [-1.7021],
            [-0.0162],
            [ 1.1328]]], device='npu:0', dtype=torch.float16)

torch_npu.npu_rotated_box_decode(anchor_boxes, deltas, weight) -> Tensor

旋转标注框编码。

参数解释：
- anchor_box (Tensor) - shape为(B,5,N)的3D输入张量，表示锚点框。“B”表示批处理大小数量，“N”表示标注框数量，值“5”表示“x0”、“x1”、“y0”、“y1”和“angle”。
- deltas (Tensor) - shape为(B,5,N)数据类型为float32 (float16)的3D张量。
- weight (Tensor，默认值为[1.0, 1.0, 1.0, 1.0, 1.0]) - “x0”、“x1”、“y0”、“y1”和“angle”的浮点列表。
约束条件：
无

示例：

>>> anchor_boxes = torch.tensor([[[4.137],[33.72],[29.4], [54.06], [41.28]]], dtype=torch.float16).to("npu")
    >>> deltas = torch.tensor([[[0.0244], [-1.992], [0.2109], [0.315], [-37.25]]], dtype=torch.float16).to("npu")
    >>> weight = torch.tensor([1., 1., 1., 1., 1.], dtype=torch.float16).npu()
    >>> out = torch_npu.npu_rotated_box_decode(anchor_boxes, deltas, weight)
    >>> out
    tensor([[[  1.7861],
            [-10.5781],
            [ 33.0000],
            [ 17.2969],
            [-88.4375]]], device='npu:0', dtype=torch.float16)

torch_npu.npu_ciou(Tensor self, Tensor gtboxes, bool trans=False, bool is_cross=True, int mode=0, bool atan_sub_flag=False) -> Tensor

应用基于NPU的CIoU操作。在DIoU的基础上增加了penalty item，并propose CIoU。

注释:
到目前为止，CIoU向后只支持当前版本中的trans==True、is_cross==False、mode==0('iou')。如果需要反向传播，确保参数正确。
参数：
- boxes1 (Tensor): 格式为xywh、shape为(4, n)的预测检测框。
- boxes2 (Tensor): 相应的gt检测框，shape为(4, n)。
- trans (Bool，默认值为False): 是否有偏移。
- is_cross (Bool，默认值为True): box1和box2之间是否有交叉操作。
- mode (Int，默认值为0): 选择DIoU的计算方式。0表示IoU，1表示IoF。
- atan_sub_flag (Bool，默认值为False): 是否将正向的第二个值传递给反向。
返回值：
torch.Tensor: mask操作的结果。

示例：

    >>> box1 = torch.randn(4, 32).npu()
    >>> box1.requires_grad = True
    >>> box2 = torch.randn(4, 32).npu()
    >>> box2.requires_grad = True
    >>> ciou = torch_npu.contrib.function.npu_ciou(box1, box2) 
    >>> l = ciou.sum()
    >>> l.backward()

torch_npu.npu_diou(Tensor self, Tensor gtboxes, bool trans=False, bool is_cross=False, int mode=0) -> Tensor

应用基于NPU的DIoU操作。考虑到目标之间距离，以及距离和范围的重叠率，不同目标或边界需趋于稳定。

注释:
到目前为止，DIoU向后只支持当前版本中的trans==True、is_cross==False、mode==0('iou')。如果需要反向传播，确保参数正确。
参数：
- boxes1 (Tensor) - 格式为xywh、shape为(4, n)的预测检测框。
- boxes2 (Tensor) - 相应的gt检测框，shape为(4, n)。
- trans (Bool，默认值为False) - 是否有偏移。
- is_cross (Bool，默认值为False) - box1和box2之间是否有交叉操作。
- mode (Int，默认值为0) - 选择DIoU的计算方式。0表示IoU，1表示IoF。
返回值：
torch.Tensor (Tensor) - mask操作的结果。

示例：

    >>> box1 = torch.randn(4, 32).npu()
    >>> box1.requires_grad = True
    >>> box2 = torch.randn(4, 32).npu()
    >>> box2.requires_grad = True
    >>> ciou = torch_npu.contrib.function.npu_diou(box1, box2) 
    >>> l = diou.sum()
    >>> l.backward()

torch_npu.npu_sign_bits_pack(Tensor self, int size) -> Tensor

将float类型1位Adam打包为uint8。

参数：
- x(Tensor) - 1D float张量。
- size(Int) - reshape时输出张量的第一个维度。
约束条件：
Size可被float打包的输出整除。如果x的size可被8整除，则输出的size为(size of x)/8；否则，输出的size为(size of x // 8) + 1。将在小端位置添加-1浮点值以填充可整除性。Atlas 训练系列产品支持float32和float16类型输入。Atlas 推理系列产品（Ascend 310P处理器）支持float32和float16类型输入。Atlas 200/300/500 推理产品仅支持float16类型输入。

示例：

    >>>a = torch.tensor([5,4,3,2,0,-1,-2, 4,3,2,1,0,-1,-2],dtype=torch.float32).npu()
    >>>b = torch_npu.sign_bits_pack(a, 2)
    >>>b
    >>>tensor([[159],[15]], device='npu:0')
    >>>(binary form of 159 is ob10011111, corresponds to 4, -2, -1, 0, 2, 3, 4, 5 respectively)

torch_npu.npu_sign_bits_unpack(x, dtype, size) -> Tensor

将uint8类型1位Adam拆包为float。

参数：
- x(Tensor) - 1D uint8张量。
- dtype(torch.dtype) - 值为1设置输出类型为float16，值为0设置输出类型为float32。
- size(Int) - reshape时输出张量的第一个维度。
约束条件：
Size可被uint8s拆包的输出整除。输出大小为(size of x) * 8。

示例：

    >>>a = torch.tensor([159, 15], dtype=torch.uint8).npu()
    >>>b = torch_npu.npu_sign_bits_unpack(a, 0, 2)
    >>>b
    >>>tensor([[1., 1., 1., 1., 1., -1., -1., 1.],
    >>>[1., 1., 1., 1., -1., -1., -1., -1.]], device='npu:0')
(binary form of 159 is ob00001111)