Terminology

Table 1 Terminology

Term

Description

Synchronous/Asynchronous

The synchronous and asynchronous mentioned in this document are from the perspective of the caller and executor.

  • If the host does not wait for the device to complete the execution after calling the API, the scheduling of the host is asynchronous.
  • If the host wait for the device to complete the execution after calling the API, the scheduling of the host is synchronous.

Process/Thread

Unless otherwise specified, the processes and threads mentioned in this document refer to the processes and threads in user applications.

Host

A host refers to the x86 or Arm server connected to the device. The host utilizes the NN compute capability provided by the device to implement services.

Device

A device is a hardware device with Ascend AI Processor installed. It connects to the host over the PCIe interface and provides the host with the NN computing capability. Memory sharing between devices is not supported.

Context

As a container, the context manages the lifetime of its objects (including streams, events, and device memory). Streams and events in different contexts are isolated, and cannot be executed synchronously.

There are two types of contexts:

  • Default context: a default context created implicitly upon the acl.rt.set_device call that specifies a device for computation. One device corresponds to one default context. A default context cannot be released by calling acl.rt.destroy_context.
  • Explicitly created context: a context created explicitly upon the acl.rt.create_context call in a process or thread.

Stream

Streams are used to maintain the execution sequence of asynchronous operations to ensure that tasks in the same stream are executed on the device in the code call sequence in the application.

Stream-based kernel execution and data transfer can implement the parallelism of computing on the host, data movement between the host and the device, and computing on the device.

Streams come in two types:

  • Default stream: When the acl.rt.set_device API is called to specify a device for computation or the acl.rt.create_context API is called to create a context, the system automatically implicitly creates a default stream. Each device corresponds to a default stream. The default stream cannot be released by calling the acl.rt.destroy_stream API.
  • Explicitly created stream (recommended): a stream created explicitly upon the acl.rt.create_stream call in a process or thread.

Event

Events are used to synchronize tasks between streams by using pyACL API calls, including tasks between the host and the device, and tasks on the same device.

For example, if task of stream2 needs to be executed after task in stream1 is complete, you can create an event and insert it to stream1.

AIPP

The Artificial Intelligence Pre-Processing (AIPP) module is introduced for image preprocessing including Color Space Conversion (CSC), image normalization (by subtracting the mean value or multiplying a factor), image cropping (by specifying the crop start and cropping the image to the size required by the NN), and much more.

Static AIPP and dynamic AIPP modes are supported. However, the two modes are mutually exclusive.
  • Static AIPP: If you use this mode and specify the AIPP parameters when converting a model, the AIPP attribute values are saved in the offline model (.om file) after the model is generated. Fixed AIPP configurations are used in each model inference.

    In static AIPP mode, batches share the same set of AIPP parameters.

  • Dynamic AIPP: During model conversion, specify the AIPP mode to dynamic, and set different sets of dynamic AIPP parameters as required. In this way, different sets of parameters can be used for model inference.

    In dynamic AIPP mode, batches can use different sets of AIPP parameters.

Dynamic batch/image size

In some scenarios, the batch size or resolution of the model input is not fixed. For example, if the target recognition network is executed after the target is detected, the batch size of the target recognition network input is not fixed because the number of targets is not fixed.

  • Dynamic batch: The batch size is not determinable until inference time.
  • Dynamic image size: The image size (H x W) of each image is dynamically variable during inference.

Dynamic dimensions (ND format only)

To support the scenarios where the input dimension sizes are uncertain, such as the Transformer network, the dynamic dimensions feature for the ND format is supported.

ND indicates any format with up to four dimensions.

Channel

In RGB color mode, a complete image consists of three channels: HSV stands for hue, saturation, and value (brightness) and is an alternative representation of the RGB color model.

Standard form

The device functions as an endpoint (EP) and co-works with the host (namely, an x86 or ARM server) over the PCIe interface. In this case, CPU resources on the device can be accessed only by the host, and the related inference applications run on the host. The device provides only the NN computing capability for the server.

EP mode

If the PCIe of the Ascend AI Processor works in standby mode, the mode is called EP mode.

RC mode

If the PCIe of the Ascend AI Processor works in active mode, the mode is called RC mode.

Devices, Contexts, and Streams

Figure 1 Devices, Contexts, and Streams
  • Device specifies the compute device.
    • Its lifetime starts with the first acl.rt.set_device call.
    • With each acl.rt.set_device call, the reference count is increased by 1. With each acl.rt.reset_device call, the reference count is decreased by 1.
    • When the reference count is 0, resources on the device in the process is no longer available.
  • A context belongs to a unique device.
    • Contexts can be created implicitly or explicitly.
    • The lifetime of an implicitly created context (default context) starts with the acl.rt.set_device call and ends with the acl.rt.reset_device call when the reference count is 0.

      An implicitly created context is created only once. Calling acl.rt.set_device to repeatedly specify the same device adds only the reference count of the implicitly created context.

    • The lifetime of an explicitly created context starts with the acl.rt.create_context call and ends with the acl.rt.destroy_context call.
    • If multiple contexts are created in a process, the current thread can use only one context at a time. It is recommended that acl.rt.create_stream be used to specify the context of the current thread to improve program maintainability. The number of contexts depends on the number of streams. For details, see acl.rt.set_context.
    • Contexts in a process are shared and can be switched by using the acl.rt.set_context call.
  • A stream is the execution flow on a device. The sequence of tasks in the same stream is strictly preserved.
    • Streams can be created implicitly or explicitly.
    • Each context has a default stream (implicitly created), the lifetime of which is the same as that of the corresponding context.
    • The lifetime of an explicitly created stream starts with the acl.rt.create_stream call and ends with the acl.rt.destroy_stream call. Once the context to which the explicitly created stream belongs is destroyed or the lifetime ends, the stream is no longer available.
  • A task/kernel is the real task executor on a device.

Threads, Contexts, and Streams

  • A context must be bound to a user thread. The usage and scheduling of all device resources must be based on the context.
  • Only one context that is associated with the device is used in a thread at a time.
  • acl.rt.set_context can be called to quickly switch between devices. The sample code is as follows, which is for reference only. Do not directly copy and run the code.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    
    ctx1, ret = acl.rt.create_context(0)        # Use the acl.rt.create_context API to create a context by passing the device ID.
    stream, ret = acl.rt.create_stream()
    ret = acl.op.execute_v2(op_type, input_desc, inputs, output_desc, outputs, attr, stream)
    ctx2, ret = acl.rt.create_context(1)        
    
    # After ctx2 is created, the context used in the current thread changes to ctx2, and the corresponding tasks are computed on device 1. In this sample, op2 is executed on device 1.
    stream2, ret = acl.rt.create_stream()
    ret = acl.op.execute_v2(op_type2, input_desc, inputs, output_desc, outputs, attr, stream2)
    ret = acl.rt.set_context(ctx1);
    
    # Switch devices by switching contexts in the current thread so that the subsequent tasks can be computed on device 0.
    ret = acl.op.execute_v2(op3,...,s1)
    
    
  • Multiple streams can be created in a thread, where tasks in different streams can be implemented in parallel. In multi-thread scenarios, you can also create one stream in each thread, where each stream is independent on the device, and tasks in each stream are executed in its original order.
  • Multi-thread scheduling depends on the OS scheduling of the running app. Multi-stream scheduling is performed by the scheduling component on the device.

Context Migration Between Threads in a Process

  • Multiple contexts can be created in a process, but only one context is used in a thread at a time.
  • If multiple contexts are created in a thread, the last created context is used by default.
  • If multiple contexts are created in a process, call acl.rt.set_context to set the context to be used.
    Figure 2 API call sequence

Application Scenarios of Default Contexts and Default Streams

  • Before operations are delivered to the device, a context and a stream must exist, which can be created implicitly or explicitly. The implicitly created context and stream are the default context and stream.

    To pass the default stream to any API call, pass 0 directly.

  • If an implicitly created context is used, acl.rt.get_context, acl.rt.set_context, or acl.rt.destroy_context is not available.
  • An implicitly created context or stream is applicable to simple apps where only one compute device is needed. For multi-thread apps, you are advised to use the explicitly created contexts and streams.

The sample code is as follows, which is for reference only. Do not directly copy and run the code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# ...
ret = acl.init(config_path)
ret = acl.rt.set_device(device_id)

# In the default context, a default stream is created, and the stream is available in the current thread.
# ...
ret = acl.op.execute_v2(op1, input_desc, inputs, output_desc, outputs, attr, 0)  # 0 indicates that op1 is executed in the default stream.
ret = acl.op.execute_v2(op2, input_desc, inputs, output_desc, outputs, attr, 0)  # 0 indicates that op2 is executed in the default stream.
ret = acl.rt.synchronize_stream(0)

# Output the result as required when all computing tasks (op1 and op2 execution tasks) are complete.
# ...
ret = acl.rt.reset_device(device_id)  # Reset device 0. The lifetime of the corresponding default context and default stream end.

Multi-Thread and Multi-Stream Performance

  • Thread scheduling depends on the OS. The device scheduling unit schedules tasks in streams. When tasks in different streams in a process contend for resources on the device, the performance may be lower than that of a single stream scenario.
  • Currently, Ascend AI Processor has different execution components, such as AI Core, AI CPU, and Vector Core. You are advised to create multiple streams based on the operator execution engine.
  • Performance depends on the app's logic implementation. Generally, the performance of a single-threading, multistreaming scenario is slightly better than that of a multithreading, multistreaming scenario, as less thread scheduling is involved at the app layer.