Overview

Table 1 Terminology

Term

Description

Host

A host refers to the x86 or Arm server connected to the device. The host utilizes the NN compute capability provided by the device to implement services.

Device

The device refers to an Ascend AI Processor-powered hardware device that connects to the host over the PCIe interface, providing the host with the NN computing capability. Memory sharing between devices is not supported.

Context

As a container, the context manages the lifetime of its objects (including streams, events, and device memory). Streams and events in different contexts are isolated, and cannot be executed synchronously.

There are two types of contexts:

  • Default context: a default context created implicitly upon the acl.rt.set_device call that specifies a device for computation. One device corresponds to one default context. A default context cannot be released by calling acl.rt.destroy_context.
  • Explicitly created context: a context created explicitly upon the acl.rt.create_context call in a process or thread.

Stream

Streams preserve the execution order of a stack of asynchronous operations that are executed on the device in its original order.

Stream-based kernel execution and data transfer can implement the parallelism of computing on the host, data movement between the host and the device, and computing on the device.

Streams come in two types:

  • Default stream: a default stream created implicitly upon the acl.rt.set_device call that specifies a device for computation or upon the acl.rt.create_context call that creates a context. Each device corresponds to a default stream. A default stream cannot be released by calling acl.rt.destroy_stream.
  • Explicitly created stream (recommended): a stream created explicitly upon the acl.rt.create_stream call in a process or thread.

Event

An event is usually used for task synchronization between streams in a device. For example, if the tasks in stream2 depend on the tasks in stream1 and you want to ensure that the tasks in stream1 are complete first, you can create an event, insert the event into stream1 (usually called an Event Record task), and insert a task that waits for the event completion into stream2 (usually called an Event Wait task). In addition, events can record the timestamp information.

Notify

Notify is typically used for status/action communication notifications between devices. For example, after sending data to device 1, device 0 uses the Notify mechanism to notify device 1 that the data has been written. Notify does not record timestamps.

Notify supports only the one-to-one notification. To notify multiple devices, you need to perform the Notify operation multiple times, as shown in the figure.

The main difference between Notify and events is that after Notify Wait is complete, the Notify status is automatically reset. Therefore, a Notify Record task can notify only one Notify Wait task. However, Event Wait does not automatically reset the event status. Therefore, an Event Record task can notify one or more Event Wait tasks.

Devices, Contexts, and Streams

Figure 1 Devices, Contexts, and Streams
  • Device specifies the compute device.
    • Its lifetime starts with the first acl.rt.set_device call.
    • With each acl.rt.set_device call, the reference count is increased by 1. With each acl.rt.reset_device call, the reference count is decreased by 1.
    • When the reference count is 0, resources on the device in the process is no longer available.
  • A context belongs to a unique device.
    • Contexts can be created implicitly or explicitly.
    • The lifetime of an implicitly created context (default context) starts with the acl.rt.set_device call and ends with the acl.rt.reset_device call when the reference count is 0.

      An implicitly created context is created only once. Calling acl.rt.set_device to repeatedly specify the same device adds only the reference count of the implicitly created context.

    • The lifetime of an explicitly created context starts with the acl.rt.create_context call and ends with the acl.rt.destroy_context call.
    • If multiple contexts are created in a process, the current thread can use only one context at a time. It is recommended that acl.rt.set_context be used to specify the context of the current thread to improve program maintainability. The number of contexts depends on the number of streams. For details, see acl.rt.create_stream.
    • Contexts in a process are shared and can be switched by using the acl.rt.set_context call.
  • A stream is the execution flow on a device. The sequence of tasks in the same stream is strictly preserved.
    • Streams can be created implicitly or explicitly.
    • Each context has a default stream (implicitly created), the lifetime of which is the same as that of the corresponding context.
    • The lifetime of an explicitly created stream starts with the acl.rt.create_stream call and ends with the acl.rt.destroy_stream call. Once the context to which the explicitly created stream belongs is destroyed or the lifetime ends, the stream is no longer available.
  • A task/kernel is the real task executor on a device.

Threads, Contexts, and Streams

  • A context must be bound to a user thread. The usage and scheduling of all device resources must be based on the context.
  • Only one context that is associated with the device is used in a thread at a time.
  • acl.rt.set_context can be called to quickly switch between devices. The sample code is as follows, which is for reference only. Do not directly copy and run the code.
    …
    ctx1, ret = acl.rt.create_context(0)        # Use the acl.rt.create_context API to create a context by passing the device ID.
    stream, ret = acl.rt.create_stream()
    ret = acl.op.execute_v2(op_type, input_desc, inputs, output_desc, outputs, attr, stream)
    ctx2, ret = acl.rt.create_context(1)        
    
    # After ctx2 is created, the context used in the current thread changes to ctx2, and the corresponding tasks are computed on device 1. In this sample, op2 is executed on device 1.
    stream2, ret = acl.rt.create_stream()
    ret = acl.op.execute_v2(op_type2, input_desc, inputs, output_desc, outputs, attr, stream2)
    ret = acl.rt.set_context(ctx1);
    
    # Switch devices by switching contexts in the current thread so that the subsequent tasks can be computed on device 0.
    ret = acl.op.execute_v2(op3,...,s1)
    …
  • Multiple streams can be created in a thread, where tasks in different streams can be implemented in parallel. In multi-thread scenarios, you can also create one stream in each thread, where each stream is independent on the device, and tasks in each stream are executed in its original order.
  • Multi-thread scheduling depends on the OS scheduling of the running app. Multi-stream scheduling is performed by the scheduling component on the device.

Context Switch Between Threads Within a Process

  • Multiple contexts can be created in a process, but only one context is used in a thread at a time.
  • If multiple contexts are created in a thread, the last created context is used by default.
  • If multiple contexts are created in a process, call acl.rt.set_context to set the context to be used.
    Figure 2 API call sequence

Application Scenarios of Default Contexts and Default Streams

  • Before operations are delivered to the device, a context and a stream must exist, which can be created implicitly or explicitly. The implicitly created context and stream are the default context and stream.

    To pass the default stream to any API call, pass 0 directly.

  • If an implicitly created context is used, acl.rt.get_context, acl.rt.set_context, or acl.rt.destroy_context is not available.
  • An implicitly created context or stream is applicable to simple apps where only one compute device is needed. For multi-thread apps, you are advised to use the explicitly created contexts and streams.

The sample code is as follows, which is for reference only. Do not directly copy and run the code.

# ...
ret = acl.init(config_path)
ret = acl.rt.set_device(device_id)

# In the default context, a default stream is created, and the stream is available in the current thread.
# ...
ret = acl.op.execute_v2(op1, input_desc, inputs, output_desc, outputs, attr, 0)  # 0 indicates that op1 is executed in the default stream.
ret = acl.op.execute_v2(op2, input_desc, inputs, output_desc, outputs, attr, 0)  # 0 indicates that op2 is executed in the default stream.
ret = acl.rt.synchronize_stream(0)

# Output the result as required when all computing tasks (op1 and op2 execution tasks) are complete.
# ...
ret = acl.rt.reset_device(device_id)  # Reset device 0. The lifetime of the corresponding default context and default stream end.

Multi-Thread and Multi-Stream Performance

  • Thread scheduling depends on the OS. The device scheduling unit schedules tasks in streams. When tasks in different streams in a process contend for resources on the device, the performance may be lower than that of a single stream scenario.
  • Various execution components are provided by Ascend AI Processor, such as AI Core, AI CPU, and Vector Core, so that different tasks can be executed by specialized components. You are advised to create streams based on the operator execution components.
  • Performance depends on the app's logic implementation. Generally, the performance of a single-threading, multistreaming scenario is slightly better than that of a multithreading, multistreaming scenario, as less thread scheduling is involved at the app layer.