Terminology
Devices, Contexts, and Streams
Figure 1 Devices, Contexts, and Streams
- Device specifies the compute device.
- Its lifetime starts with the first acl.rt.set_device call.
- With each acl.rt.set_device call, the reference count is increased by 1. With each acl.rt.reset_device call, the reference count is decreased by 1.
- When the reference count is 0, resources on the device in the process is no longer available.
- A context belongs to a unique device.
- Contexts can be created implicitly or explicitly.
- The lifetime of an implicitly created context (default context) starts with the acl.rt.set_device call and ends with the acl.rt.reset_device call when the reference count is 0.
An implicitly created context is created only once. Calling acl.rt.set_device to repeatedly specify the same device adds only the reference count of the implicitly created context.
- The lifetime of an explicitly created context starts with the acl.rt.create_context call and ends with the acl.rt.destroy_context call.
- If multiple contexts are created in a process, the current thread can use only one context at a time. It is recommended that acl.rt.create_stream be used to specify the context of the current thread to improve program maintainability. The number of contexts depends on the number of streams. For details, see acl.rt.set_context.
- Contexts in a process are shared and can be switched by using the acl.rt.set_context call.
- A stream is the execution flow on a device. The sequence of tasks in the same stream is strictly preserved.
- Streams can be created implicitly or explicitly.
- Each context has a default stream (implicitly created), the lifetime of which is the same as that of the corresponding context.
- The lifetime of an explicitly created stream starts with the acl.rt.create_stream call and ends with the acl.rt.destroy_stream call. Once the context to which the explicitly created stream belongs is destroyed or the lifetime ends, the stream is no longer available.
- A task/kernel is the real task executor on a device.
Threads, Contexts, and Streams
- A context must be bound to a user thread. The usage and scheduling of all device resources must be based on the context.
- Only one context that is associated with the device is used in a thread at a time.
- acl.rt.set_context can be called to quickly switch between devices. The sample code is as follows, which is for reference only. Do not directly copy and run the code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
… ctx1, ret = acl.rt.create_context(0) # Use the acl.rt.create_context API to create a context by passing the device ID. stream, ret = acl.rt.create_stream() ret = acl.op.execute_v2(op_type, input_desc, inputs, output_desc, outputs, attr, stream) ctx2, ret = acl.rt.create_context(1) # After ctx2 is created, the context used in the current thread changes to ctx2, and the corresponding tasks are computed on device 1. In this sample, op2 is executed on device 1. stream2, ret = acl.rt.create_stream() ret = acl.op.execute_v2(op_type2, input_desc, inputs, output_desc, outputs, attr, stream2) ret = acl.rt.set_context(ctx1); # Switch devices by switching contexts in the current thread so that the subsequent tasks can be computed on device 0. ret = acl.op.execute_v2(op3,...,s1) …
- Multiple streams can be created in a thread, where tasks in different streams can be implemented in parallel. In multi-thread scenarios, you can also create one stream in each thread, where each stream is independent on the device, and tasks in each stream are executed in its original order.
- Multi-thread scheduling depends on the OS scheduling of the running app. Multi-stream scheduling is performed by the scheduling component on the device.
Context Migration Between Threads in a Process
- Multiple contexts can be created in a process, but only one context is used in a thread at a time.
- If multiple contexts are created in a thread, the last created context is used by default.
- If multiple contexts are created in a process, call acl.rt.set_context to set the context to be used.
Figure 2 API call sequence
Application Scenarios of Default Contexts and Default Streams
- Before operations are delivered to the device, a context and a stream must exist, which can be created implicitly or explicitly. The implicitly created context and stream are the default context and stream.
To pass the default stream to any API call, pass 0 directly.
- If an implicitly created context is used, acl.rt.get_context, acl.rt.set_context, or acl.rt.destroy_context is not available.
- An implicitly created context or stream is applicable to simple apps where only one compute device is needed. For multi-thread apps, you are advised to use the explicitly created contexts and streams.
The sample code is as follows, which is for reference only. Do not directly copy and run the code.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# ... ret = acl.init(config_path) ret = acl.rt.set_device(device_id) # In the default context, a default stream is created, and the stream is available in the current thread. # ... ret = acl.op.execute_v2(op1, input_desc, inputs, output_desc, outputs, attr, 0) # 0 indicates that op1 is executed in the default stream. ret = acl.op.execute_v2(op2, input_desc, inputs, output_desc, outputs, attr, 0) # 0 indicates that op2 is executed in the default stream. ret = acl.rt.synchronize_stream(0) # Output the result as required when all computing tasks (op1 and op2 execution tasks) are complete. # ... ret = acl.rt.reset_device(device_id) # Reset device 0. The lifetime of the corresponding default context and default stream end. |
Multi-Thread and Multi-Stream Performance
- Thread scheduling depends on the OS. The device scheduling unit schedules tasks in streams. When tasks in different streams in a process contend for resources on the device, the performance may be lower than that of a single stream scenario.
- Currently, Ascend AI Processor has different execution components, such as AI Core, AI CPU, and Vector Core. You are advised to create multiple streams based on the operator execution engine.
- Performance depends on the app's logic implementation. Generally, the performance of a single-threading, multistreaming scenario is slightly better than that of a multithreading, multistreaming scenario, as less thread scheduling is involved at the app layer.
Parent topic: Overview