昇腾社区首页
中文
注册

ATB整图下沉

产品支持情况

硬件型号

是否支持

Atlas A3 推理系列产品 / Atlas A3 训练系列产品

Atlas A2 训练系列产品 / Atlas 800I A2 推理产品

Atlas 训练系列产品

x

Atlas 推理系列产品

x

Atlas 200I/500 A2 推理产品

x

功能概述

整图下沉特性支持将整个图算子(GraphOperation)中的所有Device侧任务一次性下发并且固化到Device侧。对比单算子下发模式,整图下沉的优势在于除了第一次捕获模型实例时和单算子下发模式相同,后续的重放阶段将不再需要重新下发,而是直接执行固化在Device侧的任务。这样对于捕获完成之后的重放来说不再需要Host侧参与,大大减少Host侧耗时,进而达到优化全局耗时的目的。

功能介绍

根据用户使用方式不同,加速库的整图下沉功能支持两种不同的方式:外部capture内部capture。

  • 外部Capture

    外部Capture是指用户在调用加速库operation内的setup、Execute接口之外调用aclmdlRI相关接口的操作。对于外部Capture方式而言,若需要使用加速库Execute接口的PreLaunch阶段提供更新参数(variantPack、workspace)的能力,则需要调用Context中的SetLaunchMode接口将下发模式设置成atb::GRAPH_LAUNCH_MODE。

    具体操作如下:

    图1 外部Capture流程图
    • 创建模型

      创建模型运行实例(model)

      aclmdlRI model= nullptr
    • 捕获任务

      通过调用aclmdlRICaptureBeginaclmdlRICaptureEnd接口确定指定Stream上需要捕获的device侧任务。在aclmdlRICaptureBegin和aclmdlRICaptureEnd接口之间,所有在指定Stream上下发的device侧任务都不会立即执行,而是被暂存在系统内部模型运行实例中(aclmdlRI model)。

      • aclmdlRICaptureBegin

        通过调用aclmdlRICaptureBegin接口开始捕获Stream上下发的Device侧任务。

      • 下发需要捕获的任务

        通过使用加速库的图下发模式,将需要下发的device侧任务记录到模型运行实例(mode)中。

        • 设置加速库下发模式:
          context->SetLaunchMode(atb::GRAPH_LAUNCH_MODE)
        • 执行Setup和Execute(PreLaunch
      • aclmdlRICaptureEnd

        通过调用aclmdlRICaptureEnd接口结束捕获Stream上下发的Device侧任务,并获得aclmdlRI模型运行实例。

    • 执行模型

      真正执行之前在捕获阶段保存在aclmdlRI模型运行实例中的所有Device侧任务。

      • aclmdlRIExecuteAsync

        通过调用aclmdlRIExecuteAsync接口执行之前所捕获保存在aclmdlRI模型运行实例中的所有Device侧任务。

    • 后续重放

      在完成捕获之后,保证加速库组图、输入的tensor保持不变的情况下,只需要调用Setup、PreLaunchaclmdlRIExecuteAsync即可。其中Setup阶段只做一些简单的校验,PreLaunch阶段进行数据(workspace、算子的Args和Tiling)的更新操作。

  • 内部Capture

    在内部Capture场景下,加速库对aclmdlRI接口进行了相应的封装,调用流程上用户不感知单算子下发和整图下沉之间的不同,仍然直接使用加速库的Setup、Execute(分PreLaunch、Launch两个阶段)接口由加速库内部自行去创建model、捕获任务、执行任务。对比外部Capture的好处在于保持了加速库的调用逻辑,使用上更加便捷。

    图2 内部Capture流程图
    • 设置下发模式

      调用加速库Context类中的SetLaunchMode接口设置加速库的下发模式为图下发:

      SetLaunchMode(atb::GRAPH_LAUNCH_MODE)
    • 后续重放

      在完成捕获之后,保证加速库组图、输入的tensor保持不变的情况下,重复调用加速库的Setup、Execute接口即可。由于捕获之后所有的Device侧任务已经固化在Device侧上,所以对于后续的Setup阶段只会进行一些校验,Execute阶段进行数据的更新和整图的下发,会大大优化对应阶段的Host侧耗时。

注意事项

  • 如果capture阶段捕获的device侧任务发生了变化(加速库GraphOperation中的图结构发生了变化)或者输入的tensor大小发生变化就需要重新capture
  • aclmdlRICaptureBeginaclmdlRICaptureEnd接口需要成对使用,并且两个接口中的Stream需要保持一致。
  • 在捕获期间所有异步的device侧任务都会被记录到model中。因此如果出现device侧内存问题,可以优先查看是否有不应该被捕获的异步device侧任务被捕获到了(如:aclrtMemcpyAsync)。
  • 注意在捕获期间如果需要调用同步接口(如:aclrtMallocaclrtMemcpy)需要设置aclmdlRICaptureBegin接口中的mode为:ACL_MODEL_RI_CAPTURE_MODE_RELAXED。
  • 当前仅支持更新tensor地址及部分内置算子的tiling内容,不支持workspace地址、tensor的shape等其他输入更新。

使用示例

编译该用例前需设置ascend-toolkit、atb环境变量:
1
2
source ${toolkit安装目录}/set_env.sh # 如source /usr/local/Ascend/ascend-toolkit/set_env.sh
source ${nnal安装目录}/atb/set_env.sh # 如source /usr/local/Ascend/nnal/atb/set_env.sh
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
#include <atb/operation.h>
#include "atb/infer_op_params.h"
#include <atb/types.h>
#include <acl/acl.h>
#include <iostream>
#include <vector>
#include <atb/utils.h>
#include "experiment/runtime/runtime/rt_model.h"
#include "experiment/runtime/runtime/stream.h"
#include <unistd.h>
#include <acl/acl_mdl.h>
#include <cstdlib>  // 包含 std::getenv
#include <cstring>  // 包含 std::strcmp(用于字符串比较)

// 设置各个intensor的属性
void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) 
{
    for (size_t i = 0; i < intensorDescs.size(); i++) {
        intensorDescs.at(i).dtype = ACL_FLOAT16;
        intensorDescs.at(i).format = ACL_FORMAT_ND;
        intensorDescs.at(i).shape.dimNum = 2;
        intensorDescs.at(i).shape.dims[0] = 2;
        intensorDescs.at(i).shape.dims[1] = 2;
    }
}

// 设置各个intensor并且为各个intensor分配内存空间,此处的intensor为手动设置,工程实现上可以使用torchTensor转换或者其他简单数据结构转换的方式
void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs, uint32_t value)
{
    for (size_t i = 0; i < inTensors.size(); i++) {
        inTensors.at(i).desc = intensorDescs.at(i);
        inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i));
        std::vector<uint16_t> hostData(atb::Utils::GetTensorNumel(inTensors.at(i)), value);   // 一段全2的hostBuffer
        int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU内存
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
        ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, hostData.data(), hostData.size() * sizeof(uint16_t), ACL_MEMCPY_HOST_TO_DEVICE); //拷贝CPU内存到NPU侧
    }
}

// 设置各个outtensor并且为outtensor分配内存空间,同intensor设置
void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs)
{
    for (size_t i = 0; i < outTensors.size(); i++) {
        outTensors.at(i).desc = outtensorDescs.at(i);
        outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i));
        int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST);
        if (ret != 0) {
            std::cout << "alloc error!";
            exit(0);
        }
    }
}

// 构建子图
void CreateGraphOperation(atb::GraphParam &opGraph, atb::Operation **operation)
{
    opGraph.inTensorNum = 4;
    opGraph.outTensorNum = 1;
    opGraph.internalTensorNum = 2;
    opGraph.nodes.resize(3);

    enum InTensorId {
        IN_TENSOR_A = 0,
        IN_TENSOR_B,
        IN_TENSOR_C,
        IN_TENSOR_D,
        ADD3_OUT,
        ADD1_OUT,
        ADD2_OUT
    };

    size_t nodeId = 0;
    atb::Node &addNode = opGraph.nodes.at(nodeId++);
    atb::Node &addNode2 = opGraph.nodes.at(nodeId++);
    atb::Node &addNode3 = opGraph.nodes.at(nodeId++);

    atb::infer::ElewiseParam addParam;
    addParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD;
    atb::Status status = atb::CreateOperation(addParam, &addNode.operation);
    addNode.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B};
    addNode.outTensorIds = {ADD1_OUT};

    atb::infer::ElewiseParam addParam2;
    addParam2.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD;
    status = atb::CreateOperation(addParam2, &addNode2.operation);
    addNode2.inTensorIds = {IN_TENSOR_C, IN_TENSOR_D};
    addNode2.outTensorIds = {ADD2_OUT};

    atb::infer::ElewiseParam addParam3;
    addParam3.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD;
    status = CreateOperation(addParam3, &addNode3.operation);
    addNode3.inTensorIds = {ADD1_OUT, ADD2_OUT};
    addNode3.outTensorIds = {ADD3_OUT};

    status = atb::CreateOperation(opGraph, operation);
}

// 构建大图
void CreateMultiGraphOperation(atb::GraphParam &opGraph, atb::Operation **operation)
{
    opGraph.inTensorNum = 4;
    opGraph.outTensorNum = 1;
    opGraph.internalTensorNum = 4;
    opGraph.nodes.resize(5);

    enum InTensorId {
        IN_TENSOR_A = 0,
        IN_TENSOR_B,
        IN_TENSOR_C,
        IN_TENSOR_D,
        ADD5_OUT,
        ADD1_INTER,
        ADD2_INTER,
        ADD3_INTER,
        ADD4_INTER
    };

    size_t nodeId = 0;
    atb::Node &addGraphNode1 = opGraph.nodes.at(nodeId++);
    atb::Node &addGraphNode2 = opGraph.nodes.at(nodeId++);
    atb::Node &addGraphNode3 = opGraph.nodes.at(nodeId++);
    atb::Node &addGraphNode4 = opGraph.nodes.at(nodeId++);
    atb::Node &addGraphNode5 = opGraph.nodes.at(nodeId++);

    atb::GraphParam addGraphParam1;
    CreateGraphOperation(addGraphParam1, &addGraphNode1.operation);
    addGraphNode1.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B, IN_TENSOR_C, IN_TENSOR_D};
    addGraphNode1.outTensorIds = {ADD1_INTER};

    atb::GraphParam addGraphParam2;
    CreateGraphOperation(addGraphParam2, &addGraphNode2.operation);
    addGraphNode2.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B, IN_TENSOR_C, IN_TENSOR_D};
    addGraphNode2.outTensorIds = {ADD2_INTER};

    atb::GraphParam addGraphParam3;
    CreateGraphOperation(addGraphParam3, &addGraphNode3.operation);
    addGraphNode3.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B, IN_TENSOR_C, IN_TENSOR_D};
    addGraphNode3.outTensorIds = {ADD3_INTER};

    atb::GraphParam addGraphParam4;
    CreateGraphOperation(addGraphParam4, &addGraphNode4.operation);
    addGraphNode4.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B, IN_TENSOR_C, IN_TENSOR_D};
    addGraphNode4.outTensorIds = {ADD4_INTER};

    atb::GraphParam addGraphParam5;
    CreateGraphOperation(addGraphParam5, &addGraphNode5.operation);
    addGraphNode5.inTensorIds = {ADD1_INTER, ADD2_INTER, ADD3_INTER, ADD4_INTER};
    addGraphNode5.outTensorIds = {ADD5_OUT};

    atb::Status status = atb::CreateOperation(opGraph, operation);
}

// 打印结果
void PrintOutTensorValue(atb::Tensor &outTensor)
{
    std::vector<uint16_t> outBuffer(atb::Utils::GetTensorNumel(outTensor));
    int ret = aclrtMemcpy(outBuffer.data(), outBuffer.size() * sizeof(uint16_t), outTensor.deviceData, outTensor.dataSize, ACL_MEMCPY_DEVICE_TO_HOST);
    if (ret != 0) {
        std::cout << "copy error!";
        exit(0);
    }
    for (size_t i = 0; i < outBuffer.size(); i = i + 1) {
        std::cout << "out[" << i << "] = " << (uint32_t)outBuffer.at(i) << std::endl;
    }
}

int main()
{
    // 使用环境变量ATB_CAPTURE_TYPE判断内部Capture还是外部Capture
    // ATB_CAPTURE_TYPE=in :内部Capture
    const char* captureType = std::getenv("ATB_CAPTURE_TYPE");
    aclInit(nullptr);
    // 设置卡号、创建stream、创建context、设置stream
    uint32_t deviceId = 1;
    aclrtSetDevice(deviceId);
    aclrtStream exeStream = nullptr;
    aclrtCreateStream(&exeStream);

    atb::Context *context = nullptr;
    atb::CreateContext(&context);
    context->SetExecuteStream(exeStream);
    context->SetLaunchMode(atb::GRAPH_LAUNCH_MODE);

    // 创建图算子
    atb::Operation *operation = nullptr;
    atb::GraphParam opGraph;
    CreateMultiGraphOperation(opGraph, &operation);

    // 输入输出tensor准备
    atb::VariantPack pack;
    atb::SVector<atb::TensorDesc> intensorDescs;
    atb::SVector<atb::TensorDesc> outtensorDescs;

    uint32_t inTensorNum = opGraph.inTensorNum;
    uint32_t outTensorNum = opGraph.outTensorNum;
    inTensorNum = operation->GetInputNum();
    outTensorNum = operation->GetOutputNum();

    pack.inTensors.resize(inTensorNum);
    intensorDescs.resize(inTensorNum);

    CreateInTensorDescs(intensorDescs);

    outtensorDescs.resize(outTensorNum);
    pack.outTensors.resize(outTensorNum);
    operation->InferShape(intensorDescs, outtensorDescs);
    CreateOutTensors(pack.outTensors, outtensorDescs);

    // 初始化workspace
    uint64_t workspaceSize = 0;
    void *workSpace = nullptr;
    void *workSpace1 = nullptr;
    void *workSpace2 = nullptr;
    int ret = 0;
    // 图执行
    // 内部Capture
    if (!captureType || std::strcmp(captureType, "in") == 0) {
        std::cout << "start test capture graph inside" << std::endl;
        for (size_t i = 0; i < 10; i++) {
            CreateInTensors(pack.inTensors, intensorDescs, i + 1);
            if (i == 0) {
                // 进行Device侧任务的捕获
                operation->Setup(pack, workspaceSize, context);
                if (workspaceSize != 0 && workSpace == nullptr) {
                    ret = aclrtMalloc(&workSpace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
                    if (ret != 0) {
                        std::cout << "alloc error!";
                        exit(0);
                    }
                }
                std::cout << "workspace:" << workSpace << ", workspaceSize:" << workspaceSize << std::endl;
                context->SetExecuteType(atb::EXECUTE_PRELAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context);
                context->SetExecuteType(atb::EXECUTE_LAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context);
            } else if (i > 3 && i < 6) {
                // 后续重放执行, 支持传入新的workspace
                operation->Setup(pack, workspaceSize, context);
                if (workspaceSize != 0 && workSpace1 == nullptr) {
                    ret = aclrtMalloc(&workSpace1, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
                    if (ret != 0) {
                        std::cout << "alloc error!";
                        exit(0);
                    }
                }
                std::cout << "workspace1:" << workSpace1 << ", workspaceSize:" << workspaceSize << std::endl;
                context->SetExecuteType(atb::EXECUTE_PRELAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace1, workspaceSize, context);
                context->SetExecuteType(atb::EXECUTE_LAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace1, workspaceSize, context);
            } else if (i >= 6 && i < 8) {
                // 后续重放执行, 支持传入新的workspace
                operation->Setup(pack, workspaceSize, context);
                if (workspaceSize != 0 && workSpace2 == nullptr) {
                    ret = aclrtMalloc(&workSpace2, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
                    if (ret != 0) {
                        std::cout << "alloc error!";
                        exit(0);
                    }
                }
                std::cout << "workSpace2:" << workSpace2 << ", workspaceSize:" << workspaceSize << std::endl;
                context->SetExecuteType(atb::EXECUTE_PRELAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace2, workspaceSize, context);
                context->SetExecuteType(atb::EXECUTE_LAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace2, workspaceSize, context);
            } else {
                // 后续重放执行
                std::cout << "workspace:" << workSpace << ", workspaceSize:" << workspaceSize << std::endl;
                operation->Setup(pack, workspaceSize, context);
                context->SetExecuteType(atb::EXECUTE_PRELAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context);
                context->SetExecuteType(atb::EXECUTE_LAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context);
            }
            // 流同步,作用是等待device侧任务计算完成
            ret = aclrtSynchronizeStream(exeStream);
            if (ret != 0) {
                std::cout << "sync error!" << std::endl;
                exit(0);
            }
            std::cout << "aclrtSynchronizeStream success!" << std::endl;

            // 打印输出Tensor的值
            PrintOutTensorValue(pack.outTensors.at(0));
        }
    } else {
        // 外部Capture
        std::cout << "start test capture graph outside" << std::endl;
        // 创建模型运行实例
        aclmdlRI model = nullptr;
        for (size_t i = 0; i < 10; i++) {
            CreateInTensors(pack.inTensors, intensorDescs, i + 1);
            if (i == 0) {
                // 进行Device侧任务的捕获
                // 开始捕获
                aclmdlRICaptureBegin(exeStream, ACL_MODEL_RI_CAPTURE_MODE_RELAXED);
                operation->Setup(pack, workspaceSize, context);
                if (workspaceSize != 0 && workSpace == nullptr) {
                    ret = aclrtMalloc(&workSpace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
                    if (ret != 0) {
                        std::cout << "alloc error!";
                        exit(0);
                    }
                }
                std::cout << "workspace:" << workSpace << ", workspaceSize:" << workspaceSize << std::endl;
                context->SetExecuteType(atb::EXECUTE_PRELAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context);
                context->SetExecuteType(atb::EXECUTE_LAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context);
                // 结束捕获
                aclmdlRICaptureEnd(exeStream, &model);
                // 模型运行实例执行
                aclmdlRIExecuteAsync(model, exeStream);
            } else if (i > 3 && i < 6) {
                // 后续重放执行, 支持传入新的workspace
                // Setup阶段只进行variantPack的校验
                operation->Setup(pack, workspaceSize, context);
                if (workspaceSize != 0 && workSpace1 == nullptr) {
                    ret = aclrtMalloc(&workSpace1, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
                    if (ret != 0) {
                        std::cout << "alloc error!";
                        exit(0);
                    }
                }
                std::cout << "workspace1:" << workSpace1 << ", workspaceSize:" << workspaceSize << std::endl;
                // 调用PreLaunch对模型运行实例的参数进行更新
                context->SetExecuteType(atb::EXECUTE_PRELAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace1, workspaceSize, context);
                // 模型运行实例重放
                aclmdlRIExecuteAsync(model, exeStream);
            } else if (i >= 6 && i < 8) {
                // 后续重放执行, 支持传入新的workspace
                operation->Setup(pack, workspaceSize, context);
                if (workspaceSize != 0 && workSpace2 == nullptr) {
                    ret = aclrtMalloc(&workSpace2, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
                    if (ret != 0) {
                        std::cout << "alloc error!";
                        exit(0);
                    }
                }
                std::cout << "workSpace2:" << workSpace2 << ", workspaceSize:" << workspaceSize << std::endl;
                context->SetExecuteType(atb::EXECUTE_PRELAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace2, workspaceSize, context);
                aclmdlRIExecuteAsync(model, exeStream);
            } else {
                // 后续重放执行
                std::cout << "workspace:" << workSpace << ", workspaceSize:" << workspaceSize << std::endl;
                operation->Setup(pack, workspaceSize, context);
                context->SetExecuteType(atb::EXECUTE_PRELAUNCH);
                operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context);
                aclmdlRIExecuteAsync(model, exeStream);
            }
            // 流同步,作用是等待device侧任务计算完成
            ret = aclrtSynchronizeStream(exeStream);
            if (ret != 0) {
                std::cout << "sync error!" << std::endl;
                exit(0);
            }
            std::cout << "aclrtSynchronizeStream success!" << std::endl;
            // 打印输出Tensor的值
            PrintOutTensorValue(pack.outTensors.at(0));
        }
        // 销毁模型运行实例
        aclmdlRIDestroy(model);
    }

    // 资源释放
    atb::DestroyOperation(operation);
    atb::DestroyContext(context);
    for (size_t i = 0; i < pack.inTensors.size(); i++) {
        aclrtFree(pack.inTensors.at(i).deviceData);
    }
    for (size_t i = 0; i < pack.outTensors.size(); i++) {
        aclrtFree(pack.outTensors.at(i).deviceData);
    }
    aclrtFree(workSpace);
    aclrtFree(workSpace1);
    aclrtFree(workSpace2);
    aclrtDestroyStream(exeStream);
    aclrtResetDevice(deviceId);
    aclFinalize();
}