Function: execute

Applicability

Product

Supported (√/x)

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas training products

Atlas inference products

Atlas 200I/500 A2 inference products

Function Usage

Executes model inference until the result is returned.

Prototype

  • C Prototype
    1
    aclError aclmdlExecute(uint32_t modelId, const aclmdlDataset *input, aclmdlDataset *output)
    
  • Python Function
    1
    ret = acl.mdl.execute(model_id, input, output)
    

Parameter Description

Parameter

Description

model_id

Int, ID of the model to be executed for inference

You can obtain the model ID after the model is successfully loaded by calling the following APIs:

input

Int, pointer address of the input data for model inference. For details, see aclmdlDataset.

output

Int, pointer address of the output data of model inference. For details, see aclmdlDataset.

When calling acl.create_data_buffer to create an ACL data buffer for storing the output data of the corresponding index, you can set data to 0 and set size to 0 to create an empty ACL data buffer. During model execution, the system calculates and allocates the memory for the index output.

This method saves memory. However, you need to free the memory and reset the aclDataBuffer after using the data. In addition, memory copy is involved when the system allocates memory, which may cause performance loss.

The sample code for freeing the memory and resetting the aclDataBuffer is as follows:

data_buffer = acl.mdl.get_dataset_buffer(output, 0) // Obtain the corresponding data buffer based on the index.
data_addr = acl.get_data_buffer_addr(data_buffer) // Obtain the device pointer address of the data.
acl.rt.free(data_addr) // Release the device memory.
acl.update_data_buffer(data_buffer, 0, 0) // Reset the data buffer for next inference.

Return Value Description

Return Value

Description

ret

Int, error code: 0 on success; else, failure.

Restrictions

  • If the same model_id is shared by multiple threads due to service requirements, locks must be added between user threads to ensure that operations of refreshing the input and output memory and executing inference are performed continuously. For example:
    // API call sequence of thread A:
    lock(handle1) -> acl.rt.memcpy (Refresh the input and output memory.) - > acl.mdl.execute (Execute inference.) - > unlock(handle1)
    
    // API call sequence of thread B:
    lock(handle1) -> acl.rt.memcpy (Refresh the input and output memory.) - > acl.mdl.execute (Execute inference.) - > unlock(handle1)
  • The operations of loading, executing, and unloading a model must be performed in the same context. For details about how to create a context, see acl.rt.create_context.
  • You can call acl.rt.malloc, acl.rt.malloc_host, or acl.rt.malloc_cachedacl.media.dvpp_malloc or acl.himpi.dvpp_malloc to allocate the memory for storing the model input and output data.

    Note that:

    • For details about the usage scenarios and restrictions of the memory allocation APIs, see the related description.
    • When the application calls acl.rt.malloc_host on the device to allocate memory, the device memory is allocated.
    • acl.media.dvpp_malloc and acl.himpi.dvpp_malloc are dedicated memory allocation APIs for media data processing. To reduce copy, the output of media data processing is used as the input of model inference to implement memory overcommitment.
    • Hardware has memory alignment and supplement requirements. If you use one of these APIs to allocate a large memory block, and divide and manage the memory, the alignment and supplement restrictions of the corresponding API must be met. For details, see Secondary Memory Allocation.

Reference

For details about the API call sequence and sample code, see Executing a Model.