Building a Model Running Instance Based on the Capture Mode

This function is for trial use and may be changed in later versions. Therefore, it cannot be used in commercial products.

Principles

In eager mode (adopted by mainstream frameworks such as PyTorch), each operation or task is delivered and executed at the same time, and there is no need to build a computational graph. This mode not only brings about immediate execution and convenient debugging, but also results in the delivery overhead of the host. As the performance optimization continues, host overheads gradually become bottlenecks and an issue that cannot be ignored.

On the Ascend AI Processor, you can offload related tasks to the device for execution, which can reduce the host overheads. To achieve this, an acl API is provided for capturing stream tasks to the model and then performing model inference. As shown in Figure 1, between aclmdlRICaptureBegin and aclmdlRICaptureEnd, all tasks delivered to the specified stream are not executed immediately. Instead, they are temporarily stored in the running instance of the model. These tasks are executed only when aclmdlRIExecuteAsync is called to perform model inference. If the tasks in a stream need to be executed for multiple times, you do not need to deliver the tasks again. You only need to call aclmdlRIExecuteAsync to execute the model inference multiple times. This reduces the task delivery overhead on the host. After the tasks are executed, if the running instance of the model is no longer required, you can call aclmdlRIDestroy to destroy the resource in a timely manner.

Figure 1 Capturing tasks to a model and performing model inference

In the scenario where tasks are captured to a model before model inference is performed, there are the following restrictions:

  1. Before a stream enters the capture state, tasks in the stream are still executed immediately.
  2. During the capture, operations on the default stream are also invalid.
  3. During the capture, the device memory used by the tasks must remain unchanged. Related resources can be destroyed only after the model is no longer used and destroyed.
  4. During the capture, if ACL_MODEL_RI_CAPTURE_MODE_GLOBAL (all threads are not allowed to call non-secure functions) is used, the memory synchronization functions (such as aclrtMemset, aclrtMemcpy, and aclrtMemcpy2d) are invalid. If these functions are called, an error is reported, and the capture fails.

    However, if the service side determines that the execution of these functions does not affect task capture, you can call aclmdlRICaptureThreadExchangeMode to switch the capture mode of the current thread to ACL_MODEL_RI_CAPTURE_MODE_RELAXED to remove the restriction.

  5. When tasks in the stream are captured, the tasks are offloaded to the device and not executed immediately. As a result, querying or synchronizing the stream or event is invalid. Similarly, the query or synchronization of the device or context is also invalid because the device and context contain the synchronization information of the stream.

    During the capture, the synchronization or query of streams, events, devices, and contexts is invalid in any capture mode.

  6. If the captured asynchronous memory copy task involves the host memory, only an acl API (for example, aclrtMallocHost) can be used to allocate the host page-locked memory, or an error is reported during the capture.

Following the API calls, add exception handling branches and specify log printing of error and information levels. The following is a code snippet of key steps only, which is not ready to be built or run.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
#include <stdio.h>
#include <vector>
#include "acl/acl.h"
#include "aclnnop/aclnn_add.h"

#define ACL_LOG(fmt, args...) fprintf(stdout, "[INFO]  " fmt "\n", ##args)

int64_t GetShapeSize(const std::vector<int64_t> &shape)
{
    int64_t shape_size = 1;
    for (auto i : shape) {
        shape_size *= i;
    }
    return shape_size;
}

int CreateAclTensor(const std::vector<int64_t> &shape, void **deviceAddr,
    aclDataType dataType, aclTensor **tensor)
{
    auto size = GetShapeSize(shape) * sizeof(float);
    //Allocate device memory.
    auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST);
    //Compute the strides of the contiguous tensor.
    std::vector<int64_t> strides(shape.size(), 1);
    for (int64_t i = shape.size() - 2; i >= 0; i--) {
        strides[i] = shape[i + 1] * strides[i + 1];
    }
    //Call aclCreateTensor to create an ACL Tensor.
    *tensor = aclCreateTensor(shape.data(),
        shape.size(),
        dataType,
        strides.data(),
        0,
        aclFormat::ACL_FORMAT_ND,
        shape.data(),
        shape.size(),
        *deviceAddr);
    return 0;
}

int main()
{
    int devID = 0;
    void *self_d = nullptr;
    void *other_d = nullptr;
    void *out_d = nullptr;
    aclTensor *self = nullptr;
    aclTensor *other = nullptr;
    aclScalar *alpha = nullptr;
    aclTensor *out = nullptr;
    /* aclnnAdd: out = self  +  other * alpha */
    float *self_h = nullptr;
    float *other_h = nullptr;
    std::vector<int64_t> shape = {4, 2};
    float alphaValue = 1.1f;
    uint64_t workspaceSize = 0;
    aclOpExecutor *executor;
    auto size = GetShapeSize(shape);
	
    //Perform initialization.
    aclInit(NULL);
    //Specify the compute device.
    aclrtSetDevice(devID);

    //Prepare the input and output parameters of the aclnnAdd operator.
    CreateAclTensor(shape, &self_d, aclDataType::ACL_FLOAT, &self);
    CreateAclTensor(shape, &other_d, aclDataType::ACL_FLOAT, &other);
    alpha = aclCreateScalar(&alphaValue, aclDataType::ACL_FLOAT);
    CreateAclTensor(shape, &out_d, aclDataType::ACL_FLOAT, &out);

    //Obtain the workspace size required for operator computation and the executor that contains the operator computation process.
    aclnnAddGetWorkspaceSize(self, other, alpha, out, &workspaceSize, &executor);
    void *workspaceAddr = nullptr;
    if (workspaceSize > 0) {
        aclrtMalloc(&workspaceAddr, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
    }
	
    //Allocate page-locked memory by using aclrtMallocHost.
    aclrtMallocHost((void **)&self_h, size * sizeof(float));
    aclrtMallocHost((void **)&other_h, size * sizeof(float));
    for (int i = 0; i < 8; i++) {
        self_h[i] = static_cast<float>(0);
        other_h[i] = static_cast<float>(1);
    }

    aclmdlRI modelRI;
    aclrtStream stream;
    aclrtCreateStream(&stream);

    //====== ==Start the capture task.========
    aclmdlRICaptureBegin(stream, ACL_MODEL_RI_CAPTURE_MODE_GLOBAL);
    //Asynchronous copy, which is used to copy the input data of the operator's self parameter from the host to the device
    aclrtMemcpyAsync(self_d, size * sizeof(float), self_h, size * sizeof(float), ACL_MEMCPY_HOST_TO_DEVICE, stream);
    //Switch the capture mode to RELAXED to allow the calls to the aclrtMemcpy function.
    aclmdlRICaptureMode mode = ACL_MODEL_RI_CAPTURE_MODE_RELAXED;   
    aclmdlRICaptureThreadExchangeMode(&mode);
    //Synchronous copy, which is used to copy the input data of the operator's other parameter from the host to the device. This step is performed only once.
    aclrtMemcpy(other_d, size * sizeof(float), other_h, size * sizeof(float), ACL_MEMCPY_HOST_TO_DEVICE);
    //Switch the capture mode back to GLOBAL.
    aclmdlRICaptureThreadExchangeMode(&mode);
    //Execute the aclnnAdd operator.
    aclnnAdd(workspaceAddr, workspaceSize, executor, stream);    
    //Asynchronous copy, which is used to copy the output data of the operator from the device to the host
    aclrtMemcpyAsync(self_h, size * sizeof(float), out_d, size * sizeof(float), ACL_MEMCPY_DEVICE_TO_HOST, stream);
    // ====== ==End the capture task.========
    aclmdlRICaptureEnd(stream, &modelRI);
	
    //Print model information, which is used in the maintenance and test scenario.
    const char *jsonPath = "./modelRI.json";
    aclmdlRIDebugJsonPrint(modelRI, jsonPath, 0);

    //Execute the model multiple times.
    for (int i = 0; i < 8; i++) {
        aclmdlRIExecuteAsync(modelRI, stream);
        aclrtSynchronizeStream(stream);
	//Print each output of the operator.
        ACL_LOG("%f %f %f %f %f %f %f %f\n",
            self_h[0],
            self_h[1],
            self_h[2],
            self_h[3],
            self_h[4],
            self_h[5],
            self_h[6],
            self_h[7]);
    }

    //Destroy allocations.
    aclmdlRIDestroy(modelRI);
    aclrtDestroyStream(stream);
    aclDestroyTensor(self);
    aclDestroyTensor(other);
    aclDestroyTensor(out);
    aclDestroyScalar(alpha);
    aclrtFree(self_d);
    aclrtFree(other_d);
    aclrtFree(out_d);
    if (workspaceAddr != nullptr) {
        aclrtFree(workspaceAddr);
    }	
    //Deallocate the resources of the compute device.
    aclrtResetDevice(devID);
    //Perform deinitialization.
    aclFinalize();
}

Cross-Stream Task Capture

When a task in a stream is captured, the streams specified by aclmdlRICaptureBegin and aclmdlRICaptureEnd must be the same stream (main stream). To capture tasks across streams, you can call aclrtRecordEvent to deliver the Event Record task in the main stream and call aclrtStreamWaitEvent to deliver the Event Wait task in other streams to establish the association between the main stream and other streams. In this way, the tasks in the main stream and other streams can be captured to the same model. The event also enters the capture state. If there are other streams waiting for the event, the corresponding streams also enter the capture state. As shown in Figure 2, stream2 needs to wait for the completion of task1 in the main stream, and stream3 needs to wait for the completion of task2 in stream2. In this case, stream2 directly depends on the main stream, and stream3 indirectly depends on the main stream. Therefore, stream2 and stream3 are captured, and task2 in stream2 and task3 in stream3 are captured into the model.

Streams that are added to the capture state through events must be returned to the main stream through events directly or indirectly. Otherwise, an error is reported when the capture is ended. As shown in Figure 2, you can call aclrtRecordEvent to deliver the Event Record task to stream2 and stream3, and call aclrtStreamWaitEvent to deliver the Event Wait task to the main stream, so that stream2 and stream3 can return to the main stream. In addition, for a stream (such as stream3) that indirectly depends on the main stream, you can deliver the Event Record task to stream3 and deliver the Event Wait task to stream2 so that stream3 returns to stream2. Then deliver the Event Record task to stream2 and the Event Wait task to the main stream so that the streams return to the main stream. After the streams return to the main stream and before the capture ends, do not deliver tasks (for example, task5 in Figure 2) to stream2 and stream3. Otherwise, an error is reported when the capture ends because of unassociated tasks.

When the capture ends, a task for resetting the status is inserted into the event in the cross-stream task capture scenario. This ensures that the running instance of the model can be executed again.

After the capture ends, tasks in the streams are executed immediately.

Figure 2 Cross-stream task capture

Following the API calls, add exception handling branches and specify log printing of error and information levels. The following is a code snippet of key steps only, which is not ready to be built or run.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
#include <stdio.h>
#include <vector>
#include "acl/acl.h"
#include "aclnnop/aclnn_add.h"

#define ACL_LOG(fmt, args...) fprintf(stdout, "[INFO]  " fmt "\n", ##args)

int64_t GetShapeSize(const std::vector<int64_t> &shape)
{
    int64_t shape_size = 1;
    for (auto i : shape) {
        shape_size *= i;
    }
    return shape_size;
}

int CreateAclTensor(const std::vector<int64_t> &shape, void **deviceAddr,
    aclDataType dataType, aclTensor **tensor)
{
    auto size = GetShapeSize(shape) * sizeof(float);
    //Allocate device memory.
    auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST);
    //Compute the strides of the contiguous tensor.
    std::vector<int64_t> strides(shape.size(), 1);
    for (int64_t i = shape.size() - 2; i >= 0; i--) {
        strides[i] = shape[i + 1] * strides[i + 1];
    }
    //Call aclCreateTensor to create an ACL Tensor.
    *tensor = aclCreateTensor(shape.data(),
        shape.size(),
        dataType,
        strides.data(),
        0,
        aclFormat::ACL_FORMAT_ND,
        shape.data(),
        shape.size(),
        *deviceAddr);
    return 0;
}

int main()
{
    int devID = 0;
    void *self_d = nullptr;
    void *other_d = nullptr;
    void *out_d = nullptr;
    aclTensor *self = nullptr;
    aclTensor *other = nullptr;
    aclScalar *alpha = nullptr;
    aclTensor *out = nullptr;
    /* aclnnAdd: out = self  +  other * alpha */
    float *self_h = nullptr;
    float *other_h = nullptr;
    std::vector<int64_t> shape = {4, 2};
    float alphaValue = 1.1f;
    uint64_t workspaceSize = 0;
    aclOpExecutor *executor;
    auto size = GetShapeSize(shape);
	
    //Perform initialization.
    aclInit(NULL);
    //Specify the compute device.
    aclrtSetDevice(devID);

    //Prepare the input and output parameters of the aclnnAdd operator.
    CreateAclTensor(shape, &self_d, aclDataType::ACL_FLOAT, &self);
    CreateAclTensor(shape, &other_d, aclDataType::ACL_FLOAT, &other);
    alpha = aclCreateScalar(&alphaValue, aclDataType::ACL_FLOAT);
    CreateAclTensor(shape, &out_d, aclDataType::ACL_FLOAT, &out);

    //Obtain the workspace size required for operator computation and the executor that contains the operator computation process.
    aclnnAddGetWorkspaceSize(self, other, alpha, out, &workspaceSize, &executor);
    void *workspaceAddr = nullptr;
    if (workspaceSize > 0) {
        aclrtMalloc(&workspaceAddr, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
    }

    //Allocate page-locked memory by using aclrtMallocHost.
    aclrtMallocHost((void **)&self_h, size * sizeof(float));
    aclrtMallocHost((void **)&other_h, size * sizeof(float));
    for (int i = 0; i < 8; i++) {
        self_h[i] = static_cast<float>(0);
        other_h[i] = static_cast<float>(1);
    }

    aclmdlRI modelRI;
    aclrtStream stream1, stream2;
    aclrtEvent event1, event2;
    aclrtCreateStream(&stream1);
    aclrtCreateStream(&stream2);
    aclrtCreateEvent(&event1);
    aclrtCreateEvent(&event2);

    //====== ==Start the capture task.========
    aclmdlRICaptureBegin(stream1, ACL_MODEL_RI_CAPTURE_MODE_GLOBAL);
    //Asynchronous copy, which is used to copy the input data of the operator's self parameter from the host to the device
    aclrtMemcpyAsync(self_d, size * sizeof(float), self_h, size * sizeof(float), ACL_MEMCPY_HOST_TO_DEVICE, stream1);
    //Switch the capture mode to RELAXED to allow the call to function aclrtMemcpy.
    aclmdlRICaptureMode mode = ACL_MODEL_RI_CAPTURE_MODE_RELAXED;
    aclmdlRICaptureThreadExchangeMode(&mode);
    //Synchronous copy, which is used to copy the input data of the operator's other parameter from the host to the device. This step is performed only once.
    aclrtMemcpy(other_d, size * sizeof(float), other_h, size * sizeof(float), ACL_MEMCPY_HOST_TO_DEVICE);
    //Switch the capture mode back to GLOBAL.
    aclmdlRICaptureThreadExchangeMode(&mode);
    //Add stream2 to the capture state using event1.
    aclrtRecordEvent(event1, stream1);
    aclrtStreamWaitEvent(stream2, event1);
    //Execute the aclnnAdd operator.
    aclnnAdd(workspaceAddr, workspaceSize, executor, stream2);
    //After the tasks in stream2 are executed, use event2 to return stream2 to the main stream (stream1).
    aclrtRecordEvent(event2, stream2);
    aclrtStreamWaitEvent(stream1, event2);
    //Asynchronous copy, which is used to copy the output data of the operator from the device to the host
    aclrtMemcpyAsync(self_h, size * sizeof(float), out_d, size * sizeof(float), ACL_MEMCPY_DEVICE_TO_HOST, stream1);
    // ====== ==End the capture task.========
    aclmdlRICaptureEnd(stream1, &modelRI);

    //Execute the model multiple times.
    for (int i = 0; i < 8; i++) {
        aclmdlRIExecuteAsync(modelRI, stream1);
        aclrtSynchronizeStream(stream1);
	//Print each output of the operator.
        ACL_LOG("%f %f %f %f %f %f %f %f\n",
            self_h[0],
            self_h[1],
            self_h[2],
            self_h[3],
            self_h[4],
            self_h[5],
            self_h[6],
            self_h[7]);
    }

    //Destroy allocations.
    aclmdlRIDestroy(modelRI);
    aclrtDestroyStream(stream1);
    aclrtDestroyStream(stream2);
    aclrtDestroyEvent(event1);
    aclrtDestroyEvent(event2);
    aclDestroyTensor(self);
    aclDestroyTensor(other);
    aclDestroyTensor(out);
    aclDestroyScalar(alpha);
    aclrtFree(self_d);
    aclrtFree(other_d);
    aclrtFree(out_d);		
    if (workspaceAddr != nullptr) {
        aclrtFree(workspaceAddr);
    }
    //Deallocate the resources of the compute device.
    aclrtResetDevice(devID);
    //Perform deinitialization.
    aclFinalize();
}

Task Update Function

After the tasks in the stream are captured and temporarily stored in the model, you can update the tasks (including the tasks themselves and task parameter information) in either of the following ways:

  1. Method 1: Use the task to be updated for demarcation. Capture the tasks before and after that task between aclmdlRICaptureBegin and aclmdlRICaptureEnd, and temporarily store the tasks in different models for separate execution.

    This method is suitable for scenarios where a large number of tasks need to be updated (for example, a model has inputs of two shapes). The API call logic is simple, but the number of models that temporarily store captured tasks is increasing. If the number of models exceeds the hardware resource limit, an error is reported.

    The following figure shows the basic process of this method.

  2. Method 2: Deliver the tasks to be captured in the main stream between the aclmdlRICaptureBegin and aclmdlRICaptureEnd. Use aclmdlRICaptureTaskGrpBegin and aclmdlRICaptureTaskGrpEnd to mark the tasks to be captured in a task group, return the handle of the task group, and update the tasks between aclmdlRICaptureTaskUpdateBegin and aclmdlRICaptureTaskUpdateEnd.

    This method is suitable for the scenario where a small number of single-operator calling tasks need to be updated. You can update tasks and then execute the tasks in the model instance in sequence, or update the tasks and execute other tasks in the model instance concurrently. However, updating tasks is more time-consuming than delivering tasks separately. In addition, there are some restrictions: The number and types of tasks between aclmdlRICaptureTaskGrpBegin and aclmdlRICaptureTaskGrpEnd must be the same as those between aclmdlRICaptureTaskUpdateBegin and aclmdlRICaptureTaskUpdateEnd. In the cross-stream task capture scenario, tasks cannot be delivered to streams in other capture states at the same time between aclmdlRICaptureTaskGrpBegin and aclmdlRICaptureTaskGrpEnd. A task group is similar to a critical resource and does not support concurrent update of multiple threads and streams. Otherwise, the update result may be unexpected.

    • The following figure shows the process of updating tasks and then executing tasks in the aclmdlRI instance in sequence.

    • The following figure shows the process of updating tasks and executing other tasks concurrently.

      If a large number of tasks exist in the running instance of the model, you can use an external event to update tasks and execute other tasks concurrently to improve performance. In addition, you need to create a stream (UpdateStream) for updating tasks. The external event refers to the event that is created through aclrtCreateEventWithFlag, with the flag set to ACL_EVENT_EXTERNAL. This type of event cannot be used for cross-stream task capture, and their specifications are limited. For this reason, you need to properly reuse this type of event. After creating an external event, deliver an update task in UpdateStream, and then call aclrtRecordEvent to deliver an Event Record task. Then, in the main stream, call aclrtStreamWaitEvent to deliver an Event Wait task to wait for the task update in UpdateStream to complete. Finally, in the main stream, call aclrtStreamWaitEvent and then aclrtResetEvent to reset the external event.

In this example, tasks are executed concurrently. The following is a code snippet of key steps for updating the input parameters of the aclnnAdd operator and is not ready to be built or run:

#include <stdio.h>
#include <vector>
#include "acl/acl.h"
#include "aclnnop/aclnn_add.h"

#define ACL_LOG(fmt, args...) fprintf(stdout, "[INFO]  " fmt "\n", ##args)

int64_t GetShapeSize(const std::vector<int64_t> &shape)
{
    int64_t shape_size = 1;
    for (auto i : shape) {
        shape_size *= i;
    }
    return shape_size;
}

int CreateAclTensor(const std::vector<int64_t> &shape, void **deviceAddr,
    aclDataType dataType, aclTensor **tensor)
{
    auto size = GetShapeSize(shape) * sizeof(float);
    //Allocate device memory.
    auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST);
    //Compute the strides of the contiguous tensor.
    std::vector<int64_t> strides(shape.size(), 1);
    for (int64_t i = shape.size() - 2; i >= 0; i--) {
        strides[i] = shape[i + 1] * strides[i + 1];
    }
    //Call aclCreateTensor to create an ACL Tensor.
    *tensor = aclCreateTensor(shape.data(),
        shape.size(),
        dataType,
        strides.data(),
        0,
        aclFormat::ACL_FORMAT_ND,
        shape.data(),
        shape.size(),
        *deviceAddr);
    return 0;
}

int main()
{
    int devID = 0;
    void *self_d = nullptr;
    void *other_d = nullptr;
    void *out_d = nullptr;
    void *outtmp_d = nullptr;
    aclTensor *self = nullptr;
    aclTensor *other = nullptr;
    aclScalar *alpha = nullptr;
    aclScalar *updatealpha = nullptr;
    aclTensor *out = nullptr;
    aclTensor *outtmp = nullptr;
    /* aclnnAdd: self = self  +  other * alpha */
    float *self_h = nullptr;
    float *other_h = nullptr;
    std::vector<int64_t> shape = {4, 2};
    float *out_h = nullptr;
    float alphaValue = 1.1f;
    float updatealphaValue = 5.5f;
    uint64_t workspaceSize = 0;
    uint64_t workspaceSize1 = 0;
    uint64_t workspaceSize2 = 0;
    aclOpExecutor *executor2;
    aclOpExecutor *executor;
    aclOpExecutor *executor1;
    auto size = GetShapeSize(shape);

    //Perform initialization.
    aclInit(NULL);
    //Specify the compute device.
    aclrtSetDevice(devID);

    //Prepare the input and output parameters of the aclnnAdd operator.
    CreateAclTensor(shape, &self_d, aclDataType::ACL_FLOAT, &self);
    CreateAclTensor(shape, &other_d, aclDataType::ACL_FLOAT, &other);
    alpha = aclCreateScalar(&alphaValue, aclDataType::ACL_FLOAT);
    updatealpha = aclCreateScalar(&updatealphaValue, aclDataType::ACL_FLOAT);
    CreateAclTensor(shape, &out_d, aclDataType::ACL_FLOAT, &out);
    CreateAclTensor(shape, &outtmp_d, aclDataType::ACL_FLOAT, &outtmp);

    //Call the first-phase API of the aclnnAdd operator to obtain the workspace size required for operator computation and the executor that contains the operator computation process.
    //If aclnnAdd is called multiple times, the first-phase API needs to be called multiple times to obtain different aclOpExecutors.
    // outtmp = self + alpha * other
    //Before the update: out = outtmp + alpha * other  After the update: out = outtmp + updatealpha * other
    aclnnAddGetWorkspaceSize(self, other, alpha, outtmp, &workspaceSize, &executor);
	void *workspaceAddr = nullptr;
    if (workspaceSize > 0) {
        aclrtMalloc(&workspaceAddr, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
    }
    //Before the update: out = outtmp + alpha * other
    aclnnAddGetWorkspaceSize(outtmp, other, alpha, out, &workspaceSize1, &executor1);
	void *workspaceAddr1 = nullptr;
    if (workspaceSize1 > 0) {
        aclrtMalloc(&workspaceAddr1, workspaceSize1, ACL_MEM_MALLOC_HUGE_FIRST);
    }
    //After the update: out = outtmp + updatealpha * other
    aclnnAddGetWorkspaceSize(outtmp, other, updatealpha, out, &workspaceSize2, &executor2);
	void *workspaceAddr2 = nullptr;
    if (workspaceSize2 > 0) {
        aclrtMalloc(&workspaceAddr2, workspaceSize2, ACL_MEM_MALLOC_HUGE_FIRST);
    }
	
    //Allocate page-locked memory by using aclrtMallocHost.
    aclrtMallocHost((void **)&self_h, size * sizeof(float));
    aclrtMallocHost((void **)&other_h, size * sizeof(float));
    aclrtMallocHost((void **)&out_h, size * sizeof(float));
    for (int i = 0; i < 8; i++) {
        self_h[i] = static_cast<float>(0);
        other_h[i] = static_cast<float>(1);
        out_h[i] = static_cast<float>(0);
    }

    aclmdlRI modelRI;
    aclrtStream stream1;
    aclrtCreateStream(&stream1);
    aclrtEvent event;

    //Create an external event.
    aclrtCreateEventWithFlag(&event, ACL_EVENT_EXTERNAL);

    //====== ==Start the capture task.========
    aclmdlRICaptureBegin(stream1, ACL_MODEL_RI_CAPTURE_MODE_GLOBAL);
    //Asynchronous copy, which is used to copy the input data of the self parameter of the aclnnAdd operator from the host to the device
    aclrtMemcpyAsync(self_d, size * sizeof(float), self_h, size * sizeof(float), ACL_MEMCPY_HOST_TO_DEVICE, stream1);
    //Asynchronous copy, which is used to copy the input data of the other parameter of the aclnnAdd operator from the host to the device
    aclrtMemcpyAsync(other_d, size * sizeof(float), other_h, size * sizeof(float), ACL_MEMCPY_HOST_TO_DEVICE, stream1);
    //Execute the aclnnAdd operator.
    aclnnAdd(workspaceAddr, workspaceSize, executor, stream1);
    //Deliver an Event Wait task in the main stream (stream1) to wait for the update task to complete.
    aclrtStreamWaitEvent(stream1, event);
    aclrtResetEvent(event, stream1);
    aclrtTaskGrp handle;
    //Mark the task to be updated.
    aclmdlRICaptureTaskGrpBegin(stream1);
    aclnnAdd(workspaceAddr1, workspaceSize1, executor1, stream1);
    aclmdlRICaptureTaskGrpEnd(stream1, &handle);
    //Asynchronous copy, which is used to copy the output data of the operator from the device to the host
    aclrtMemcpyAsync(out_h, size * sizeof(float), out_d, size * sizeof(float), ACL_MEMCPY_DEVICE_TO_HOST, stream1);
    // ====== ==End the capture task.========
    aclmdlRICaptureEnd(stream1, &modelRI);

    aclrtStream updateStream;
    aclrtCreateStream(&updateStream);

    for (int i = 0; i < 2; i++) {
        ACL_LOG("execute model, loop: %d", i);
        aclmdlRIExecuteAsync(modelRI, stream1);
        //Start the update task and update the alpha parameter of the aclnnAdd operator to updatealpha.
        aclmdlRICaptureTaskUpdateBegin(updateStream, handle);
        if (i == 1) {
            aclnnAdd(workspaceAddr2, workspaceSize2, executor2, updateStream);
            ACL_LOG("update alpha value of aclnnAdd");
        }
        aclmdlRICaptureTaskUpdateEnd(updateStream);
        //After the update task is complete, deliver an Event Record task in UpdateStream to notify the main stream (stream1) to continue with the tasks following the Event Wait task.
        aclrtRecordEvent(event, updateStream);
        aclrtSynchronizeStream(updateStream);		
        aclrtSynchronizeStream(stream1);
        ACL_LOG("%f %f %f %f %f %f %f %f\n",
            out_h[0],
            out_h[1],
            out_h[2],
            out_h[3],
            out_h[4],
            out_h[5],
            out_h[6],
            out_h[7]);
    }

    //Destroy allocations.
    aclmdlRIDestroy(modelRI);
    aclDestroyTensor(self);
    aclDestroyTensor(other);
    aclDestroyTensor(out);
    aclDestroyTensor(outtmp);
    aclDestroyScalar(alpha);
    aclDestroyScalar(updatealpha);
    aclrtFree(self_d);
    aclrtFree(other_d);
    aclrtFree(out_d);
    aclrtFree(outtmp_d);
    aclrtDestroyStream(stream1);
    aclrtDestroyStream(updateStream);
    aclrtDestroyEvent(event);
	if (workspaceAddr != nullptr) {
        aclrtFree(workspaceAddr);
    }
	if (workspaceAddr1 != nullptr) {
        aclrtFree(workspaceAddr1);
    }
	if (workspaceAddr2 != nullptr) {
        aclrtFree(workspaceAddr2);
    }	
    //Deallocate the resources of the compute device.
    aclrtResetDevice(devID);
    //Perform deinitialization.
    aclFinalize();
}