Single-Operator with Dynamic Shape (Operator Selector Registered)

Prerequisites

Before loading and executing a dynamic-shape operator, you need to develop a custom operator and generate the corresponding binary file by referring to the description of TIK Custom Operator with Dynamic Shape in TBE&AI CPU Operator Developer Guide.

Principles

The procedure of loading and executing a dynamic-shape operator is as follows:

Initialize resources, including initializing the pyACL, setting the loading directory of the single-operator model file, and specifying the compute device.

Call acl.init to initialize pyACL.

Call the pyACL APIs to register the custom operator to be built.

Call acl.op.register_compile_func to register the operator selector (that is, selecting the tiling policy function). Different tilling strategies are adopted for different shapes when executing the operator.

The operator selector needs to be defined and implemented in advance.

Prototype

              
                   op_selector(in_num, in_desc, out_num, out_desc, op_attr, op_kernel_desc)
    """
    operator selector: The function and parameter names can be customized. The number and type of parameters must match.
    :param in_num: number of the input tensor descriptions
     :param in_desc: list of the input tensor descriptions
     :param out_num: number of the output tensor descriptions
    :param out_desc: list of the output tensor descriptions
    :param op_attr: address object of operator attributes for setting operator attributes
    :param op_kernel_desc: address object of operator kernel description for setting the workspace parameters of an operator in dynamic shape scenarios.
    :return:
    """

Function implementation
You can write code logic to select a tiling policy and generate tiling parameters, and call acl.op.set_kernel_args to set tiling arguments and number of blocks for concurrent execution.

Call acl.op.create_kernel to register the operator to the system for code implementation when executing the operator.

Call acl.rt.set_device to specify the compute device.
Call acl.rt.create_stream to create a stream explicitly.
The default stream is used if no stream is created explicitly. The default stream is implicitly created with the acl.rt.set_device call. To pass the default stream to any API call, pass 0 directly.

Construct the operator description (such as the input and output tensor description and operator attributes) and allocate memory for storing the input and output data of the operator.
Copy the operator input data from the host to the device.
- Call acl.rt.memcpy to implement synchronous memory copy. The memory needs to be freed in a timely manner after being used.
- Call acl.rt.memcpy_async to implement asynchronous memory copy. The memory needs to be freed in a timely manner after being used.
Compile a single-operator.
Call acl.op.update_params to compile the operator and trigger the calling logic of the operator selector.
Execute the single-operator.
Call acl.op.execute_v2 to load and execute the operator.
Copy the output data of the operator from the device to the host (memory on the host needs to be allocated in advance).
- Call acl.rt.memcpy to implement synchronous memory copy. The memory needs to be freed in a timely manner after being used.
- Call acl.rt.memcpy_async to implement asynchronous memory copy. The memory needs to be freed in a timely manner after being used.
Destroy streams, contexts and devices in sequence.
- Call acl.rt.destroy_stream to destroy a stream.
  If no stream is created explicitly and the default stream is used, acl.rt.destroy_stream does not need to be called.
- Call acl.rt.reset_device to reset the device.
Call acl.finalize to deinitialize pyACL.

Sample Code

Following the API calls in the sample, add exception handling branches and specify log printing of error and information levels. The following is a code snippet of key steps only, which is not ready to be built or run.

For details about how to obtain, compile, and run the sample of a dynamic-shape operator (registering an operator selector), see Sample Usage in TBE&AI CPU Operator Developer Guide.

      
       
         
         
           import acl
import numpy as np
# ......

ACL_ENGINE_AICORE = 1
ACL_FLOAT16 = 1
ACL_FORMAT_ND = 2
ACL_MEM_MALLOC_HUGE_FIRST = 2
ACL_MEMCPY_HOST_TO_DEVICE = 1
ACL_MEMCPY_DEVICE_TO_HOST= 2

device_id = 0

# 1. Initialize resources.
ret = acl.init()
ret = acl.rt.set_device(device_id)
stream, ret = acl.rt.create_stream()
ret = acl.op.register_compile_func("add", op_select)
# Compile the *.o file of the operator kernel in advance, call the NumPy to load the .o file, and convert the .o file into an address object. op_data_size_0 indicates the memory size occupied by the first .o file.
# If there are .o files of multiple operator kernels, this API needs to be called for multiple times.
ret = acl.op.create_kernel("add", "cce_add_11_33_float16_11_33_float16__kernel0", "cce_add_11_33_float16_11_33_float16__kernel0", np_op_0_ptr, op_data_size_0, ACL_ENGINE_AICORE, 0)

# 2. Construct the input and output tensors, input and output tensor descriptions of the add operator, and allocate memory for storing the input and output data of the operator.
# Enter the following parameters:
a = np.random.rand(2, 1).astype(np.float16)
b = np.random.rand(2, 1).astype(np.float16)
bytes_data = a.tobytes()
a_ptr = acl.util.bytes_to_ptr(bytes_data)
bytes_data = b.tobytes()
b_ptr = acl.util.bytes_to_ptr(bytes_data)
input_desc_list = [acl.create_tensor_desc(ACL_FLOAT16, [2, 1], ACL_FORMAT_ND), acl.create_tensor_desc(ACL_FLOAT16, [2, 1], ACL_FORMAT_ND)]
output_desc_list = [acl.create_tensor_desc(ACL_FLOAT16, [2, 1], ACL_FORMAT_ND)]
# Allocate device memory.
size_a = acl.get_tensor_desc_size(input_desc_list[0])
size_b = acl.get_tensor_desc_size(input_desc_list[1])
size_c = acl.get_tensor_desc_size(output_desc_list[0])
dev_a, ret = acl.rt.malloc(size_a, ACL_MEM_MALLOC_HUGE_FIRST)
dev_b, ret = acl.rt.malloc(size_b, ACL_MEM_MALLOC_HUGE_FIRST)
dev_c, ret = acl.rt.malloc(size_c, ACL_MEM_MALLOC_HUGE_FIRST)

# 3. Copy the operator input data from the host to the device.
ret = acl.rt.memcpy(dev_a, size_a, a_ptr, size_a, ACL_MEMCPY_HOST_TO_DEVICE)
ret = acl.rt.memcpy(dev_b, size_b, b_ptr, size_b, ACL_MEMCPY_HOST_TO_DEVICE)

# 4. Call acl.op.update_params to compile the operator.
op_attr = acl.op.create_attr()
ret = acl.op.update_params("add", input_desc_list, output_desc_list, op_attr )

# 5. Call acl.op.execute to load and execute the operator.
in_data_list = [acl.create_data_buffer(dev_a, size_a), acl.create_data_buffer(dev_b, size_b)]
out_data_list = [acl.create_data_buffer(dev_c, size_c)]
ret = acl.op.execute_v2("add", input_desc_list, in_data_list, output_desc_list, out_data_list, op_attr, stream)

# 6. Copy the output data of the operator from the device to the host (memory on the host needs to be allocated in advance).
host_ptr, ret = acl.rt.malloc_host(size_c)
ret = acl.rt.memcpy(host_ptr, size_c, dev_c, size_c, ACL_MEMCPY_DEVICE_TO_HOST)
bytes_out = acl.util.ptr_to_bytes(host_ptr, size_c)
out_np = np.frombuffer(bytes_out, dtype=np.byte).reshape((size_c,))

# 7. Destroy resources in sequence.
# 7.1 Destroy input and output tensor descriptions.
# 7.2 Free host memory.
# 7.3 Free device memory.
# 7.4 Destroy device management resources.
ret = acl.rt.destroy_stream(stream)
ret = acl.rt.reset_device(device_id)
ret = acl.finalize()
# ......

          

        

      
     

Parent topic: Single-Operator Calling