AscendCL Architecture and Basic Concepts

This section describes AscendCL's main functions and basic concepts such as device, stream, and context, as well as their relationships.

Introduction

AscendCL is a C language API library for developing deep neural network (DNN) apps. It provides APIs for runtime management, single-operator calling, model inference, and media data processing. It can also perform deep learning inference computing, graphics and image preprocessing, and single-operator accelerated computing on the CANN platform by leveraging underlying hardware compute resources. In a nutshell, AscendCL is a unified API framework used to invoke every resource. The compute resource layer serves as the hardware compute capability basis of Ascend AI Processor, and performs matrix-related computation of neural networks (NNs), general computation and execution control of control operators, scalars, and vectors, as well as image and video data preprocessing. This layer is fundamental to executing DNN computing.

Figure 1 Logical architecture

AscendCL application scenarios:

  • App development: You can call the AscendCL APIs to develop image classification and target recognition apps, and more.
  • Calling through a third-party framework: You can call the AscendCL APIs through a third-party framework to use the compute capabilities of Ascend AI Processor.
  • Development of third-party libraries: You can encapsulate AscendCL into third-party libraries to offer the runtime and resource management capabilities of Ascend AI Processor.

Advantages of AscendCL:

  • High-level abstraction: The APIs for operator compilation, loading, and execution are unified. In this way, AscendCL greatly cuts the number of APIs and reduces complexity.
  • Backward compatibility: AscendCL is backward compatible to ensure that apps built based on an earlier version can still run on a later version.
  • Smooth development experience: AscendCL works the same for Ascend AI Processor of different versions thanks to a unified set of APIs.

Terminology

Table 1 Terminology

Term

Description

Synchronous/Asynchronous

The synchronous and asynchronous mentioned in this document are from the perspective of the caller and executor.

  • If the AscendCL API calling result is returned immediately without waiting for the device to complete the execution, the scheduling is in asynchronous mode.
  • If the AscendCL API calling result is not returned until the device completes the execution, the scheduling is in synchronous mode.

Process/Thread

Unless otherwise specified, the processes and threads mentioned in this document refer to the processes and threads in user applications.

Host

A host refers to the x86 or Arm server connected to the device. The host utilizes the NN compute capability provided by the device to implement services.

Device

The device refers to a hardware device with Ascend AI Processor installed. It connects to the host over the PCIe interface and provides the NN computing capability.

Context

A context is a container that manages the lifetime of its objects (including streams, events, and device memory). Each unique context has isolated streams and events, and they cannot be synchronized with one another.

Contexts come in two types:
  • Default context: a default context created implicitly upon the aclrtSetDevice call that sets the compute device. Each device corresponds to a default context. The default context cannot be destroyed by the aclrtDestroyContext call.
  • Explicitly created context: a context created explicitly upon the aclrtCreateContext call in a process or thread

Stream

Streams are used to maintain the execution sequence of asynchronous operations to ensure that tasks in the same stream are executed on the device in the code call sequence in the application.

Stream-based kernel execution and data transfer can implement parallel computing on both the host and the device, as well as data transfer between the host and the device.

Streams come in two types:
  • Default stream: When the aclrtSetDevice API is called to specify a device for computation or the aclrtCreateContext API is called to create a context, the system automatically implicitly creates a default stream. Each device corresponds to a default stream. The default stream cannot be released by calling the aclrtDestroyStream API.
  • Explicitly created stream (recommended): a stream created explicitly upon the aclrtCreateStream call in a process or thread.

Event

Events are used to synchronize tasks between streams by using AscendCL API calls, such as tasks on the same device.

For example, to execute a task from stream 2 after executing a task from stream 1, create an event and append it to stream 1.

AIPP

The Artificial Intelligence Pre-Processing (AIPP) module implements image preprocessing on the AI Core, including Color Space Conversion (CSC), image normalization (by subtracting the mean value or multiplying a factor), image cropping (by specifying the crop start and cropping the image to the size required by the NN), and much more.

AIPP supports static and dynamic modes. These two modes are mutually exclusive.
  • Static AIPP: During model conversion, set the AIPP mode to static and set the AIPP parameters. After the model is generated, the AIPP parameter values are saved in the offline model (*.om file). The same AIPP parameter configurations are used in each model inference phase and cannot be modified.

    In static AIPP mode, batches share the same set of AIPP parameters.

  • Dynamic AIPP: If you use this mode when converting a model, you can set dynamic AIPP parameters each time before running the model for inference.

    In dynamic AIPP mode, batches can use different sets of AIPP parameters.

Dynamic batch/image size

The batch size or image size is not fixed in certain scenarios. For example, in the object detection+target recognition cascade scenario where the number of detected objects is subject to change, the batch size of the target recognition input is dynamic.

  • Dynamic batch size: The batch size is not determinable until inference time.
  • Dynamic image size: The image size (H x W) of each image is not determinable until inference time.

Dynamic dimensions (ND format only)

Dynamic dimensions for the ND format are useful in scenarios where input dimensions are unknown (such as the Transformer network).

Channel

An RGB image has three channels: red, green, and blue. HSV stands for hue, saturation, and value (brightness) and is an alternative representation of the RGB color model.

Ascend EP form

The PCIe working mode is distinguished by the PCIe working mode of the Ascend AI Processor. If the PCIe works in slave mode, the PCIe working mode is called the Ascend EP mode.

In Ascend EP mode, the host functions as the primary end, and the device functions as the secondary end (EP). Customers' AI service programs run on the host. As a device, Ascend AI Processor connects to the host system through the PCIe channel. The host system interacts with the device system through the PCIe channel and loads AI tasks to Ascend AI Processor on the device for running. The device works with the host (such as x86 and ARM servers) through the PCIe channel. The device provides only the NN computing capability for the host, and the CPU resources on the device can be invoked only by the host..

Ascend RC Form

The PCIe working mode of Ascend AI Processor is used for differentiation. If the PCIe works in master mode and peripherals can be extended, the PCIe working mode is called Ascend RC mode.

In Ascend RC mode, the CPU of the product directly runs the AI service software specified by the user, and peripherals such as network cameras, I2 C sensors, and SPI monitors are connected as slave devices.

Device, Context, and Stream

Figure 2 Device, context, and stream
  • Device: computing device. You can call an API provided by AscendCL, for example, aclrtSetDevice, to specify the computing device in the current thread.
  • A context belongs to a unique device.
    • Contexts can be created implicitly or explicitly.
    • Implicitly created context (default context). When the aclrtSetDevice API is called, the default context is implicitly created.
    • Explicitly created context. Calling the aclrtCreateContext API explicitly creates the context, and calling the aclrtDestroyContext API explicitly destroys the context.
    • If multiple contexts are created in a process, the current thread can use only one context at a time. It is recommended that aclrtSetCurrentContext be used to specify the context of the current thread to improve program maintainability. (The number of contexts depends on the number of streams. For details, see aclrtCreateStream.)
    • Contexts in a process can be switched by calling aclrtSetCurrentContext.
  • A stream is the execution flow on a device. The sequence of tasks in the same stream is strictly preserved.
    • Streams can be created implicitly or explicitly.
    • Each context contains a default stream, which is created implicitly.
    • You can explicitly create a stream by calling the aclrtCreateStream API, and explicitly destroy a stream by calling the aclrtDestroyStream API. After the context to which the explicitly created stream belongs is destroyed, the stream cannot be used. Although the stream is not destroyed, it cannot be used again.
  • A task/kernel is the real task executor on a device.

Thread, Context, and Stream

  • A context must be bound to a user thread, and the usage and scheduling of all device resources must be context-based.
  • Only one context associated with the device is used in a thread at a time.
  • aclrtSetCurrentContext can be called to quickly switch between devices. The following code snippet shows key steps only, and is not ready to be built or run.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    // ......
    aclrtCreateContext(&ctx1, 0);
    aclrtCreateStream(&s1);
    /* Execute the operator. */
    aclopExecuteV2(op1,...,s1);
    
    aclrtCreateContext(&ctx2,1);
    /* In the current thread, after ctx2 is created, the context corresponding to the current thread is switched to ctx2. Subsequent tasks are performed on device 1.*/
    aclrtCreateStream(&s2);
    /* Execute the operator. */
    aclopExecuteV2(op2,...,s2);
    
    /* Switch between devices by switching between contexts in the current thread so that the subsequent tasks can be computed on device 0.*/
    aclrtSetCurrentContext(ctx1);
    /* Execute the operator. */
    aclopExecuteV2(op3,...,s1);
    // ......
    
  • Multiple streams can be created in a thread, where tasks in different streams can be implemented in parallel. In multithreading scenarios, you are advised to create one stream in each thread, where each stream is independent on the device and tasks in each stream are executed in the original order.
  • Multi-thread scheduling is performed by the OS of the app, while multistreaming scheduling is performed by the scheduling component on the device.

Context Switch Between Threads Within a Process

  • Multiple contexts can be created in a process, but only one context can be used at a time in a thread.
  • If multiple contexts are created in a thread, the most recently created context is used by default.
  • If multiple contexts are created in a process, call aclrtSetCurrentContext to set the context to be used.
Figure 3 API call sequence

Use Cases of Default Context and Default Stream

  • A context and a stream must exist before any requests are delivered to the device, and these can be created either explicitly or implicitly (default). The implicitly created context and stream are the default context and stream.

    To pass the default stream to any API call, pass NULL directly.

  • If an implicitly created context is used, aclrtGetCurrentContext, aclrtSetCurrentContext, and aclrtDestroyContext are not available.
  • An implicitly created context or stream is applicable to simple apps where only one compute device is needed. For multi-thread apps, you are advised to use the explicitly created contexts and streams.

The following code snippet shows key steps only, and is not ready to be built or run.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// ......
aclInit(...);
aclrtSetDevice(0); 

/* In the default context, a default stream is created, and the stream is available in the current thread. */
// ......
aclopExecuteV2(op1,...,NULL);  //NULL indicates that op1 is executed in the default stream.
aclopExecuteV2(op2,...,NULL); //NULL indicates that op2 is executed in the default stream.
aclrtSynchronizeStream(NULL); 

/* Output the result as required when all computing tasks (op1 and op2 execution tasks) are complete. */
// ......
aclrtResetDevice(0); //Reset device 0. The lifetime of the corresponding default context and default stream end.

Multithreading and Multistreaming Performance

  • Thread scheduling depends on the OS. Tasks added to streams are scheduled by Task Scheduler on the device. However, tasks in different streams within a process often contend for device resources, leading to lower performance than in certain single-stream scenarios.
  • Various execution components are provided by Ascend AI Processor, such as AI Core, AI CPU, and Vector Core, so that different tasks can be executed by specialized components. You are advised to create streams based on the operator execution components.
  • Performance depends on the app's logic implementation. Generally, the performance of a single-threading, multistreaming scenario is slightly better than that of a multithreading, multistreaming scenario, as less thread scheduling is involved at the app layer.