在模型运行时遇到报错“terminate called after throwing an instance of 'c10::Error' what(): 0 INTERNAL ASSERT”-昇腾社区

问题来源	产品大类	关键字
官方	模型训练	--

问题来源

产品大类

关键字

官方

模型训练

问题现象描述

报错截图

报错文本

terminate called after throwing an instance of 'c10::Error'
  what(): 0 INTERNAL ASSERT FAILED at /***/pytorch/c10/npu/NPUStream.cpp:146, please report a bug to PyTorch. Could not compute stream ID for Oxffff9f77fd28 on device -1 (something has gone horribly wrong!) (NPUStream_getStreamId at /***/pytorch/c10/npu/NPUStream.cpp:146
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxxll::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x74 (0xffffa0c11fe4 in /usr/local/lib64/python3.7/site.packages/torch/lib/libc10.so)
……

原因分析

执行代码后出现报错。

import torch
import torch_npu

def test_cpu():
    input = torch.randn(2000, 1000).detach().requires_grad_()
    output = torch.sum(input)
    output.backward(torch.ones_like(output))

def test_npu():
    input = torch.randn(2000, 1000).detach().requires_grad_().npu()
    output = torch.sum(input)
    output.backward(torch.ones_like(output))

if __name__ == "__main__":
    test_cpu()
    torch_npu.npu.set_device("npu:0")
    test_npu()

在运行backward运算时，若没有设置device，程序会自动默认初始化device为0，相当于执行了set_device("npu:0")。由于目前不支持切换device进行计算，若再通过set_decice()方法手动设置device设备，则可能出现该错误。

解决措施

在运行backward运算前，通过set_decice()方法手动设置device。

原代码如下：

if __name__ == "__main__":
    test_cpu()
    torch_npu.npu.set_device("npu:0")
    test_npu()

修改后代码如下：

if __name__ == "__main__":
    torch_npu.npu.set_device("npu:0")
    test_cpu()
    test_npu()

问题现象描述

原因分析

解决措施

关于昇腾

新闻与活动

交流与资讯

支持与服务

开源社区

Communication and Information

Links