Concepts
Basic Concepts
- Host
A host refers to the x86 or Arm server connected to the device. The host utilizes the neural network (NN) compute capabilities provided by the device to implement services.
- Device (or NPU)
A device refers to a hardware device that is powered by the Ascend AI Processor. It connects to the host over the PCIe interface and provides the NN computing capability.
- Context
A context is a container that manages the lifetime of its objects (including streams, events, and more). Objects, such as streams and events, in different contexts are isolated, and cannot be executed synchronously.
Contexts come in two types:- Default context: a context implicitly created when aclrtSetDevice is called to specify a compute device. Each device corresponds to a default context. A default context cannot be destroyed through aclrtDestroyContext. Instead, it is destroyed when aclrtResetDevice or aclrtResetDeviceForce is called to reset the device.
- Explicitly created context: a context created explicitly upon the aclrtCreateContext call in a process or thread
- Stream
Streams preserve the execution order of a stack of asynchronous operations that are executed on the device in its original order.
Stream-based kernel execution and data transfer can implement parallel computing on both the host and the device, as well as data transfer between the host and the device.
Streams come in two types:- Default stream: a stream implicitly created when aclrtSetDevice is called to specify a compute device or aclrtCreateContext is called to create a context. Each device corresponds to a default stream. A default stream cannot be destroyed through aclrtDestroyStream. Instead, it is destroyed when aclrtResetDevice or aclrtResetDeviceForce is called to reset the device.
- Explicitly created stream (recommended): a stream created explicitly upon the aclrtCreateStream call in a process or thread.
- Event
An event is usually used for task synchronization between streams on a device. For example, if the tasks in stream2 depend on the tasks in stream1 and you want to ensure that the tasks in stream1 are complete first, you can create an event, insert the event into stream1 (usually called an Event Record task), and insert a task that waits for the event completion into stream2 (usually called an Event Wait task). In addition, events can record the timestamp information.

- Notify
Notify is typically used for communication and synchronization, such as delivering status or action notifications between devices. For example, after sending data to device 1, device 0 uses the Notify mechanism to notify device 1 that the data has been written. Notify does not record timestamps.

Notify supports only the one-to-one notification. To notify multiple devices, you need to perform the Notify operation multiple times, as shown in the figure.

The main difference between Notify and events is that after Notify Wait is complete, the Notify status is automatically reset. Therefore, a Notify Record task can notify only one Notify Wait task. However, Event Wait does not automatically reset the event status. Therefore, an Event Record task can notify one or more Event Wait tasks.
Devices, Contexts, and Streams
- A device is the compute device. You can call an acl API, for example, aclrtSetDevice, to specify the compute device in the current thread.
- A context belongs to a unique device.
- Contexts can be created implicitly or explicitly.
- A default context is implicitly created when aclrtSetDevice is called.
- A context is explicitly created when aclrtCreateContext is called and explicitly destroyed when aclrtDestroyContext is called.
- If multiple contexts are created in a process, the current thread can use only one context at a time. It is recommended that aclrtSetCurrentContext be used to specify the context of the current thread to improve program maintainability. (The number of contexts depends on the number of streams. For details about the limits to the number of streams for different product models, see aclrtCreateStream.)
- Contexts in a process can be switched through aclrtSetCurrentContext.
- A stream is the execution flow on a device. The sequence of tasks in the same stream is strictly preserved.
- Streams can be created implicitly or explicitly.
- Each context contains a default stream, which is created implicitly.
- You can call aclrtCreateStream to explicitly create a stream and call aclrtDestroyStream to explicitly destroy a stream. Once the context to which the explicitly created stream belongs is destroyed, the stream is no longer available.
- A task/kernel is the real task executor on a device.
Threads, Contexts, and Streams
- A context must be bound to a user thread, and the usage and scheduling of all device resources must be context-based.
- Only one context associated with the device is used in a thread at a time.
- aclrtSetCurrentContext can be called to quickly switch between devices. The following code snippet shows key steps only, and is not ready to be built or run.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
// ...... aclrtCreateContext(&ctx1, 0); aclrtCreateStream(&s1); /* Execute the operator. */ aclopExecuteV2(op1,...,s1); aclrtCreateContext(&ctx2,1); /* In the current thread, after ctx2 is created, the context corresponding to the current thread is switched to ctx2. Subsequent tasks are performed on device 1.*/ aclrtCreateStream(&s2); /* Execute the operator. */ aclopExecuteV2(op2,...,s2); /* Switch between devices by switching between contexts in the current thread so that the subsequent tasks can be computed on device 0.*/ aclrtSetCurrentContext(ctx1); /* Execute the operator. */ aclopExecuteV2(op3,...,s1); // ......
- Multiple streams can be created in a thread, where tasks in different streams can be implemented in parallel. In multithreading scenarios, you are advised to create one stream in each thread, where each stream is independent on the device and tasks in each stream are executed in the original order.
- Multi-thread scheduling is performed by the OS of the app, while multistreaming scheduling is performed by the scheduling component on the device.
Context Switch Between Threads Within a Process
- Multiple contexts can be created in a process, but only one context can be used at a time in a thread.
- If multiple contexts are created in a thread, the most recently created context is used by default.
- If multiple contexts are created in a process, call aclrtSetCurrentContext to set the context to be used.
Use Cases of Default Context and Default Stream
- A context and a stream must exist before any requests are delivered to the device, and these can be created either explicitly or implicitly (default). The implicitly created context and stream are the default context and stream.
To pass the default stream to any API call, pass NULL directly.
- If an implicitly created context is used, aclrtGetCurrentContext, aclrtSetCurrentContext, and aclrtDestroyContext are not available.
- An implicitly created context or stream is applicable to simple apps where only one compute device is needed. For multi-thread apps, you are advised to use the explicitly created contexts and streams.
The following code snippet shows key steps only, and is not ready to be built or run.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
// ...... aclInit(...); aclrtSetDevice(0); /* In the default context, a default stream is created, and the stream is available in the current thread. */ // ...... aclopExecuteV2(op1,...,NULL); //NULL indicates that op1 is executed in the default stream. aclopExecuteV2(op2,...,NULL); //NULL indicates that op2 is executed in the default stream. aclrtSynchronizeStream(NULL); /* Output the result as required when all computing tasks (op1 and op2 execution tasks) are complete. */ // ...... aclrtResetDevice(0); //Reset device 0. The lifetime of the corresponding default context and default stream end. |
Multithreading and Multistreaming Performance
- Thread scheduling depends on the OS. Tasks added to streams are scheduled by Task Scheduler on the device. However, tasks in different streams within a process often contend for device resources, leading to lower performance than in certain single-stream scenarios.
- Various execution components are provided by Ascend AI Processor, such as AI Core, AI CPU, and Vector Core, so that different tasks can be executed by specialized components. You are advised to create streams based on the operator execution components.
- Performance depends on the app's logic implementation. Generally, the performance of a single-threading, multistreaming scenario is slightly better than that of a multithreading, multistreaming scenario, as less thread scheduling is involved at the app layer.