AscendCL Architecture and Basic Concepts
This section describes AscendCL's main functions and basic concepts such as device, stream, and context, as well as their relationships.
Introduction
AscendCL is a C language API library for developing deep neural network (DNN) apps. It provides APIs for runtime management, single-operator calling, model inference, and media data processing. It can also perform deep learning inference computing, graphics and image preprocessing, and single-operator accelerated computing on the CANN platform by leveraging underlying hardware compute resources. In a nutshell, AscendCL is a unified API framework used to invoke every resource. The compute resource layer serves as the hardware compute capability basis of Ascend AI Processor, and performs matrix-related computation of neural networks (NNs), general computation and execution control of control operators, scalars, and vectors, as well as image and video data preprocessing. This layer is fundamental to executing DNN computing.
AscendCL application scenarios:
- App development: You can call the AscendCL APIs to develop image classification and target recognition apps, and more.
- Calling through a third-party framework: You can call the AscendCL APIs through a third-party framework to use the compute capabilities of Ascend AI Processor.
- Development of third-party libraries: You can encapsulate AscendCL into third-party libraries to offer the runtime and resource management capabilities of Ascend AI Processor.
Advantages of AscendCL:
- High-level abstraction: The APIs for operator compilation, loading, and execution are unified. In this way, AscendCL greatly cuts the number of APIs and reduces complexity.
- Backward compatibility: AscendCL is backward compatible to ensure that apps built based on an earlier version can still run on a later version.
- Smooth development experience: AscendCL works the same for Ascend AI Processor of different versions thanks to a unified set of APIs.
Terminology
Device, Context, and Stream
- Device: computing device. You can call an API provided by AscendCL, for example, aclrtSetDevice, to specify the computing device in the current thread.
- A context belongs to a unique device.
- Contexts can be created implicitly or explicitly.
- Implicitly created context (default context). When the aclrtSetDevice API is called, the default context is implicitly created.
- Explicitly created context. Calling the aclrtCreateContext API explicitly creates the context, and calling the aclrtDestroyContext API explicitly destroys the context.
- If multiple contexts are created in a process, the current thread can use only one context at a time. It is recommended that aclrtSetCurrentContext be used to specify the context of the current thread to improve program maintainability. (The number of contexts depends on the number of streams. For details, see aclrtCreateStream.)
- Contexts in a process can be switched by calling aclrtSetCurrentContext.
- A stream is the execution flow on a device. The sequence of tasks in the same stream is strictly preserved.
- Streams can be created implicitly or explicitly.
- Each context contains a default stream, which is created implicitly.
- You can explicitly create a stream by calling the aclrtCreateStream API, and explicitly destroy a stream by calling the aclrtDestroyStream API. After the context to which the explicitly created stream belongs is destroyed, the stream cannot be used. Although the stream is not destroyed, it cannot be used again.
- A task/kernel is the real task executor on a device.
Thread, Context, and Stream
- A context must be bound to a user thread, and the usage and scheduling of all device resources must be context-based.
- Only one context associated with the device is used in a thread at a time.
- aclrtSetCurrentContext can be called to quickly switch between devices. The following code snippet shows key steps only, and is not ready to be built or run.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
// ...... aclrtCreateContext(&ctx1, 0); aclrtCreateStream(&s1); /* Execute the operator. */ aclopExecuteV2(op1,...,s1); aclrtCreateContext(&ctx2,1); /* In the current thread, after ctx2 is created, the context corresponding to the current thread is switched to ctx2. Subsequent tasks are performed on device 1.*/ aclrtCreateStream(&s2); /* Execute the operator. */ aclopExecuteV2(op2,...,s2); /* Switch between devices by switching between contexts in the current thread so that the subsequent tasks can be computed on device 0.*/ aclrtSetCurrentContext(ctx1); /* Execute the operator. */ aclopExecuteV2(op3,...,s1); // ......
- Multiple streams can be created in a thread, where tasks in different streams can be implemented in parallel. In multithreading scenarios, you are advised to create one stream in each thread, where each stream is independent on the device and tasks in each stream are executed in the original order.
- Multi-thread scheduling is performed by the OS of the app, while multistreaming scheduling is performed by the scheduling component on the device.
Context Switch Between Threads Within a Process
- Multiple contexts can be created in a process, but only one context can be used at a time in a thread.
- If multiple contexts are created in a thread, the most recently created context is used by default.
- If multiple contexts are created in a process, call aclrtSetCurrentContext to set the context to be used.
Use Cases of Default Context and Default Stream
- A context and a stream must exist before any requests are delivered to the device, and these can be created either explicitly or implicitly (default). The implicitly created context and stream are the default context and stream.
To pass the default stream to any API call, pass NULL directly.
- If an implicitly created context is used, aclrtGetCurrentContext, aclrtSetCurrentContext, and aclrtDestroyContext are not available.
- An implicitly created context or stream is applicable to simple apps where only one compute device is needed. For multi-thread apps, you are advised to use the explicitly created contexts and streams.
The following code snippet shows key steps only, and is not ready to be built or run.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
// ...... aclInit(...); aclrtSetDevice(0); /* In the default context, a default stream is created, and the stream is available in the current thread. */ // ...... aclopExecuteV2(op1,...,NULL); //NULL indicates that op1 is executed in the default stream. aclopExecuteV2(op2,...,NULL); //NULL indicates that op2 is executed in the default stream. aclrtSynchronizeStream(NULL); /* Output the result as required when all computing tasks (op1 and op2 execution tasks) are complete. */ // ...... aclrtResetDevice(0); //Reset device 0. The lifetime of the corresponding default context and default stream end. |
Multithreading and Multistreaming Performance
- Thread scheduling depends on the OS. Tasks added to streams are scheduled by Task Scheduler on the device. However, tasks in different streams within a process often contend for device resources, leading to lower performance than in certain single-stream scenarios.
- Various execution components are provided by Ascend AI Processor, such as AI Core, AI CPU, and Vector Core, so that different tasks can be executed by specialized components. You are advised to create streams based on the operator execution components.
- Performance depends on the app's logic implementation. Generally, the performance of a single-threading, multistreaming scenario is slightly better than that of a multithreading, multistreaming scenario, as less thread scheduling is involved at the app layer.