ATB整图下沉
产品支持情况
硬件型号 |
是否支持 |
---|---|
|
√ |
|
√ |
|
x |
|
x |
|
x |
功能概述
整图下沉特性支持将整个图算子(GraphOperation)中的所有Device侧任务一次性下发并且固化到Device侧。对比单算子下发模式,整图下沉的优势在于除了第一次捕获模型实例时和单算子下发模式相同,后续的重放阶段将不再需要重新下发,而是直接执行固化在Device侧的任务。这样对于捕获完成之后的重放来说不再需要Host侧参与,大大减少Host侧耗时,进而达到优化全局耗时的目的。
功能介绍
根据用户使用方式不同,加速库的整图下沉功能支持两种不同的方式:外部capture和内部capture。
- 外部Capture
外部Capture是指用户在调用加速库operation内的setup、Execute接口之外调用aclmdlRI相关接口的操作。对于外部Capture方式而言,若需要使用加速库Execute接口的PreLaunch阶段提供更新参数(variantPack、workspace)的能力,则需要调用Context中的SetLaunchMode接口将下发模式设置成atb::GRAPH_LAUNCH_MODE。
具体操作如下:
图1 外部Capture流程图- 创建模型
创建模型运行实例(model)
aclmdlRI model= nullptr
- 捕获任务
通过调用aclmdlRICaptureBegin和aclmdlRICaptureEnd接口确定指定Stream上需要捕获的device侧任务。在aclmdlRICaptureBegin和aclmdlRICaptureEnd接口之间,所有在指定Stream上下发的device侧任务都不会立即执行,而是被暂存在系统内部模型运行实例中(aclmdlRI model)。
- 执行模型
真正执行之前在捕获阶段保存在aclmdlRI模型运行实例中的所有Device侧任务。
- 后续重放
在完成捕获之后,保证加速库组图、输入的tensor保持不变的情况下,只需要调用Setup、PreLaunch和aclmdlRIExecuteAsync即可。其中Setup阶段只做一些简单的校验,PreLaunch阶段进行数据(workspace、算子的Args和Tiling)的更新操作。
- 创建模型
- 内部Capture
在内部Capture场景下,加速库对aclmdlRI接口进行了相应的封装,调用流程上用户不感知单算子下发和整图下沉之间的不同,仍然直接使用加速库的Setup、Execute(分PreLaunch、Launch两个阶段)接口由加速库内部自行去创建model、捕获任务、执行任务。对比外部Capture的好处在于保持了加速库的调用逻辑,使用上更加便捷。
图2 内部Capture流程图
注意事项
- 如果capture阶段捕获的device侧任务发生了变化(加速库GraphOperation中的图结构发生了变化)或者输入的tensor大小发生变化就需要重新capture。
- aclmdlRICaptureBegin和aclmdlRICaptureEnd接口需要成对使用,并且两个接口中的Stream需要保持一致。
- 在捕获期间所有异步的device侧任务都会被记录到model中。因此如果出现device侧内存问题,可以优先查看是否有不应该被捕获的异步device侧任务被捕获到了(如:aclrtMemcpyAsync)。
- 注意在捕获期间如果需要调用同步接口(如:aclrtMalloc、aclrtMemcpy)需要设置aclmdlRICaptureBegin接口中的mode为:ACL_MODEL_RI_CAPTURE_MODE_RELAXED。
- 当前仅支持更新tensor地址及部分内置算子的tiling内容,不支持workspace地址、tensor的shape等其他输入更新。
使用示例
1 2 |
source ${toolkit安装目录}/set_env.sh # 如source /usr/local/Ascend/ascend-toolkit/set_env.sh source ${nnal安装目录}/atb/set_env.sh # 如source /usr/local/Ascend/nnal/atb/set_env.sh |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 |
#include <atb/operation.h> #include "atb/infer_op_params.h" #include <atb/types.h> #include <acl/acl.h> #include <iostream> #include <vector> #include <atb/utils.h> #include "experiment/runtime/runtime/rt_model.h" #include "experiment/runtime/runtime/stream.h" #include <unistd.h> #include <acl/acl_mdl.h> #include <cstdlib> // 包含 std::getenv #include <cstring> // 包含 std::strcmp(用于字符串比较) // 设置各个intensor的属性 void CreateInTensorDescs(atb::SVector<atb::TensorDesc> &intensorDescs) { for (size_t i = 0; i < intensorDescs.size(); i++) { intensorDescs.at(i).dtype = ACL_FLOAT16; intensorDescs.at(i).format = ACL_FORMAT_ND; intensorDescs.at(i).shape.dimNum = 2; intensorDescs.at(i).shape.dims[0] = 2; intensorDescs.at(i).shape.dims[1] = 2; } } // 设置各个intensor并且为各个intensor分配内存空间,此处的intensor为手动设置,工程实现上可以使用torchTensor转换或者其他简单数据结构转换的方式 void CreateInTensors(atb::SVector<atb::Tensor> &inTensors, atb::SVector<atb::TensorDesc> &intensorDescs, uint32_t value) { for (size_t i = 0; i < inTensors.size(); i++) { inTensors.at(i).desc = intensorDescs.at(i); inTensors.at(i).dataSize = atb::Utils::GetTensorSize(inTensors.at(i)); std::vector<uint16_t> hostData(atb::Utils::GetTensorNumel(inTensors.at(i)), value); // 一段全2的hostBuffer int ret = aclrtMalloc(&inTensors.at(i).deviceData, inTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); // 分配NPU内存 if (ret != 0) { std::cout << "alloc error!"; exit(0); } ret = aclrtMemcpy(inTensors.at(i).deviceData, inTensors.at(i).dataSize, hostData.data(), hostData.size() * sizeof(uint16_t), ACL_MEMCPY_HOST_TO_DEVICE); //拷贝CPU内存到NPU侧 } } // 设置各个outtensor并且为outtensor分配内存空间,同intensor设置 void CreateOutTensors(atb::SVector<atb::Tensor> &outTensors, atb::SVector<atb::TensorDesc> &outtensorDescs) { for (size_t i = 0; i < outTensors.size(); i++) { outTensors.at(i).desc = outtensorDescs.at(i); outTensors.at(i).dataSize = atb::Utils::GetTensorSize(outTensors.at(i)); int ret = aclrtMalloc(&outTensors.at(i).deviceData, outTensors.at(i).dataSize, ACL_MEM_MALLOC_HUGE_FIRST); if (ret != 0) { std::cout << "alloc error!"; exit(0); } } } // 构建子图 void CreateGraphOperation(atb::GraphParam &opGraph, atb::Operation **operation) { opGraph.inTensorNum = 4; opGraph.outTensorNum = 1; opGraph.internalTensorNum = 2; opGraph.nodes.resize(3); enum InTensorId { IN_TENSOR_A = 0, IN_TENSOR_B, IN_TENSOR_C, IN_TENSOR_D, ADD3_OUT, ADD1_OUT, ADD2_OUT }; size_t nodeId = 0; atb::Node &addNode = opGraph.nodes.at(nodeId++); atb::Node &addNode2 = opGraph.nodes.at(nodeId++); atb::Node &addNode3 = opGraph.nodes.at(nodeId++); atb::infer::ElewiseParam addParam; addParam.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD; atb::Status status = atb::CreateOperation(addParam, &addNode.operation); addNode.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B}; addNode.outTensorIds = {ADD1_OUT}; atb::infer::ElewiseParam addParam2; addParam2.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD; status = atb::CreateOperation(addParam2, &addNode2.operation); addNode2.inTensorIds = {IN_TENSOR_C, IN_TENSOR_D}; addNode2.outTensorIds = {ADD2_OUT}; atb::infer::ElewiseParam addParam3; addParam3.elewiseType = atb::infer::ElewiseParam::ElewiseType::ELEWISE_ADD; status = CreateOperation(addParam3, &addNode3.operation); addNode3.inTensorIds = {ADD1_OUT, ADD2_OUT}; addNode3.outTensorIds = {ADD3_OUT}; status = atb::CreateOperation(opGraph, operation); } // 构建大图 void CreateMultiGraphOperation(atb::GraphParam &opGraph, atb::Operation **operation) { opGraph.inTensorNum = 4; opGraph.outTensorNum = 1; opGraph.internalTensorNum = 4; opGraph.nodes.resize(5); enum InTensorId { IN_TENSOR_A = 0, IN_TENSOR_B, IN_TENSOR_C, IN_TENSOR_D, ADD5_OUT, ADD1_INTER, ADD2_INTER, ADD3_INTER, ADD4_INTER }; size_t nodeId = 0; atb::Node &addGraphNode1 = opGraph.nodes.at(nodeId++); atb::Node &addGraphNode2 = opGraph.nodes.at(nodeId++); atb::Node &addGraphNode3 = opGraph.nodes.at(nodeId++); atb::Node &addGraphNode4 = opGraph.nodes.at(nodeId++); atb::Node &addGraphNode5 = opGraph.nodes.at(nodeId++); atb::GraphParam addGraphParam1; CreateGraphOperation(addGraphParam1, &addGraphNode1.operation); addGraphNode1.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B, IN_TENSOR_C, IN_TENSOR_D}; addGraphNode1.outTensorIds = {ADD1_INTER}; atb::GraphParam addGraphParam2; CreateGraphOperation(addGraphParam2, &addGraphNode2.operation); addGraphNode2.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B, IN_TENSOR_C, IN_TENSOR_D}; addGraphNode2.outTensorIds = {ADD2_INTER}; atb::GraphParam addGraphParam3; CreateGraphOperation(addGraphParam3, &addGraphNode3.operation); addGraphNode3.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B, IN_TENSOR_C, IN_TENSOR_D}; addGraphNode3.outTensorIds = {ADD3_INTER}; atb::GraphParam addGraphParam4; CreateGraphOperation(addGraphParam4, &addGraphNode4.operation); addGraphNode4.inTensorIds = {IN_TENSOR_A, IN_TENSOR_B, IN_TENSOR_C, IN_TENSOR_D}; addGraphNode4.outTensorIds = {ADD4_INTER}; atb::GraphParam addGraphParam5; CreateGraphOperation(addGraphParam5, &addGraphNode5.operation); addGraphNode5.inTensorIds = {ADD1_INTER, ADD2_INTER, ADD3_INTER, ADD4_INTER}; addGraphNode5.outTensorIds = {ADD5_OUT}; atb::Status status = atb::CreateOperation(opGraph, operation); } // 打印结果 void PrintOutTensorValue(atb::Tensor &outTensor) { std::vector<uint16_t> outBuffer(atb::Utils::GetTensorNumel(outTensor)); int ret = aclrtMemcpy(outBuffer.data(), outBuffer.size() * sizeof(uint16_t), outTensor.deviceData, outTensor.dataSize, ACL_MEMCPY_DEVICE_TO_HOST); if (ret != 0) { std::cout << "copy error!"; exit(0); } for (size_t i = 0; i < outBuffer.size(); i = i + 1) { std::cout << "out[" << i << "] = " << (uint32_t)outBuffer.at(i) << std::endl; } } int main() { // 使用环境变量ATB_CAPTURE_TYPE判断内部Capture还是外部Capture // ATB_CAPTURE_TYPE=in :内部Capture const char* captureType = std::getenv("ATB_CAPTURE_TYPE"); aclInit(nullptr); // 设置卡号、创建stream、创建context、设置stream uint32_t deviceId = 1; aclrtSetDevice(deviceId); aclrtStream exeStream = nullptr; aclrtCreateStream(&exeStream); atb::Context *context = nullptr; atb::CreateContext(&context); context->SetExecuteStream(exeStream); context->SetLaunchMode(atb::GRAPH_LAUNCH_MODE); // 创建图算子 atb::Operation *operation = nullptr; atb::GraphParam opGraph; CreateMultiGraphOperation(opGraph, &operation); // 输入输出tensor准备 atb::VariantPack pack; atb::SVector<atb::TensorDesc> intensorDescs; atb::SVector<atb::TensorDesc> outtensorDescs; uint32_t inTensorNum = opGraph.inTensorNum; uint32_t outTensorNum = opGraph.outTensorNum; inTensorNum = operation->GetInputNum(); outTensorNum = operation->GetOutputNum(); pack.inTensors.resize(inTensorNum); intensorDescs.resize(inTensorNum); CreateInTensorDescs(intensorDescs); outtensorDescs.resize(outTensorNum); pack.outTensors.resize(outTensorNum); operation->InferShape(intensorDescs, outtensorDescs); CreateOutTensors(pack.outTensors, outtensorDescs); // 初始化workspace uint64_t workspaceSize = 0; void *workSpace = nullptr; void *workSpace1 = nullptr; void *workSpace2 = nullptr; int ret = 0; // 图执行 // 内部Capture if (!captureType || std::strcmp(captureType, "in") == 0) { std::cout << "start test capture graph inside" << std::endl; for (size_t i = 0; i < 10; i++) { CreateInTensors(pack.inTensors, intensorDescs, i + 1); if (i == 0) { // 进行Device侧任务的捕获 operation->Setup(pack, workspaceSize, context); if (workspaceSize != 0 && workSpace == nullptr) { ret = aclrtMalloc(&workSpace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST); if (ret != 0) { std::cout << "alloc error!"; exit(0); } } std::cout << "workspace:" << workSpace << ", workspaceSize:" << workspaceSize << std::endl; context->SetExecuteType(atb::EXECUTE_PRELAUNCH); operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context); context->SetExecuteType(atb::EXECUTE_LAUNCH); operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context); } else if (i > 3 && i < 6) { // 后续重放执行, 支持传入新的workspace operation->Setup(pack, workspaceSize, context); if (workspaceSize != 0 && workSpace1 == nullptr) { ret = aclrtMalloc(&workSpace1, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST); if (ret != 0) { std::cout << "alloc error!"; exit(0); } } std::cout << "workspace1:" << workSpace1 << ", workspaceSize:" << workspaceSize << std::endl; context->SetExecuteType(atb::EXECUTE_PRELAUNCH); operation->Execute(pack, (uint8_t*)workSpace1, workspaceSize, context); context->SetExecuteType(atb::EXECUTE_LAUNCH); operation->Execute(pack, (uint8_t*)workSpace1, workspaceSize, context); } else if (i >= 6 && i < 8) { // 后续重放执行, 支持传入新的workspace operation->Setup(pack, workspaceSize, context); if (workspaceSize != 0 && workSpace2 == nullptr) { ret = aclrtMalloc(&workSpace2, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST); if (ret != 0) { std::cout << "alloc error!"; exit(0); } } std::cout << "workSpace2:" << workSpace2 << ", workspaceSize:" << workspaceSize << std::endl; context->SetExecuteType(atb::EXECUTE_PRELAUNCH); operation->Execute(pack, (uint8_t*)workSpace2, workspaceSize, context); context->SetExecuteType(atb::EXECUTE_LAUNCH); operation->Execute(pack, (uint8_t*)workSpace2, workspaceSize, context); } else { // 后续重放执行 std::cout << "workspace:" << workSpace << ", workspaceSize:" << workspaceSize << std::endl; operation->Setup(pack, workspaceSize, context); context->SetExecuteType(atb::EXECUTE_PRELAUNCH); operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context); context->SetExecuteType(atb::EXECUTE_LAUNCH); operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context); } // 流同步,作用是等待device侧任务计算完成 ret = aclrtSynchronizeStream(exeStream); if (ret != 0) { std::cout << "sync error!" << std::endl; exit(0); } std::cout << "aclrtSynchronizeStream success!" << std::endl; // 打印输出Tensor的值 PrintOutTensorValue(pack.outTensors.at(0)); } } else { // 外部Capture std::cout << "start test capture graph outside" << std::endl; // 创建模型运行实例 aclmdlRI model = nullptr; for (size_t i = 0; i < 10; i++) { CreateInTensors(pack.inTensors, intensorDescs, i + 1); if (i == 0) { // 进行Device侧任务的捕获 // 开始捕获 aclmdlRICaptureBegin(exeStream, ACL_MODEL_RI_CAPTURE_MODE_RELAXED); operation->Setup(pack, workspaceSize, context); if (workspaceSize != 0 && workSpace == nullptr) { ret = aclrtMalloc(&workSpace, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST); if (ret != 0) { std::cout << "alloc error!"; exit(0); } } std::cout << "workspace:" << workSpace << ", workspaceSize:" << workspaceSize << std::endl; context->SetExecuteType(atb::EXECUTE_PRELAUNCH); operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context); context->SetExecuteType(atb::EXECUTE_LAUNCH); operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context); // 结束捕获 aclmdlRICaptureEnd(exeStream, &model); // 模型运行实例执行 aclmdlRIExecuteAsync(model, exeStream); } else if (i > 3 && i < 6) { // 后续重放执行, 支持传入新的workspace // Setup阶段只进行variantPack的校验 operation->Setup(pack, workspaceSize, context); if (workspaceSize != 0 && workSpace1 == nullptr) { ret = aclrtMalloc(&workSpace1, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST); if (ret != 0) { std::cout << "alloc error!"; exit(0); } } std::cout << "workspace1:" << workSpace1 << ", workspaceSize:" << workspaceSize << std::endl; // 调用PreLaunch对模型运行实例的参数进行更新 context->SetExecuteType(atb::EXECUTE_PRELAUNCH); operation->Execute(pack, (uint8_t*)workSpace1, workspaceSize, context); // 模型运行实例重放 aclmdlRIExecuteAsync(model, exeStream); } else if (i >= 6 && i < 8) { // 后续重放执行, 支持传入新的workspace operation->Setup(pack, workspaceSize, context); if (workspaceSize != 0 && workSpace2 == nullptr) { ret = aclrtMalloc(&workSpace2, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST); if (ret != 0) { std::cout << "alloc error!"; exit(0); } } std::cout << "workSpace2:" << workSpace2 << ", workspaceSize:" << workspaceSize << std::endl; context->SetExecuteType(atb::EXECUTE_PRELAUNCH); operation->Execute(pack, (uint8_t*)workSpace2, workspaceSize, context); aclmdlRIExecuteAsync(model, exeStream); } else { // 后续重放执行 std::cout << "workspace:" << workSpace << ", workspaceSize:" << workspaceSize << std::endl; operation->Setup(pack, workspaceSize, context); context->SetExecuteType(atb::EXECUTE_PRELAUNCH); operation->Execute(pack, (uint8_t*)workSpace, workspaceSize, context); aclmdlRIExecuteAsync(model, exeStream); } // 流同步,作用是等待device侧任务计算完成 ret = aclrtSynchronizeStream(exeStream); if (ret != 0) { std::cout << "sync error!" << std::endl; exit(0); } std::cout << "aclrtSynchronizeStream success!" << std::endl; // 打印输出Tensor的值 PrintOutTensorValue(pack.outTensors.at(0)); } // 销毁模型运行实例 aclmdlRIDestroy(model); } // 资源释放 atb::DestroyOperation(operation); atb::DestroyContext(context); for (size_t i = 0; i < pack.inTensors.size(); i++) { aclrtFree(pack.inTensors.at(i).deviceData); } for (size_t i = 0; i < pack.outTensors.size(); i++) { aclrtFree(pack.outTensors.at(i).deviceData); } aclrtFree(workSpace); aclrtFree(workSpace1); aclrtFree(workSpace2); aclrtDestroyStream(exeStream); aclrtResetDevice(deviceId); aclFinalize(); } |