Basics About Media Data Processing

This section describes the functions, API calling processes, and sample code of image/video/audio data processing.

Typical Functions

Figure 1 Image/Video data processing

The following table describes the functions. For details about the media data processing functions supported by each product model, see Function Support of Different Versions. The current AIPP versions support all the functions.

Function	Sub-Function Module	Definition
Obtain video data.	Image signal processing (ISP) system control	The system control function is used to register the 3A algorithm, register the sensor driver, initialize the ISP firmware, run the ISP firmware, exit the ISP firmware, and configure the ISP attributes.
	MIPI RX ioctl command words	MIPI RX is a collection unit that supports multiple differential video input interfaces. It receives data from the MIPI, LVDS, sub-LVDS, and HiSPI interfaces through the combo PHY. MIPI RX supports data transmission at multiple speeds and resolutions by configuring different function modes and supports multiple external input devices.
	Video Input (VI)	The VI module captures video images, performs operations such as cropping, stabilization, color optimization, brightness optimization, and noise removal on the images, and outputs YUV or RAW images.
Display video data.	VO (Video Output)	The VO module receives the images that have been processed by VPSS, controls the playing of the images, and outputs the images to peripheral video devices based on the configured output protocols (only HDMI is supported now). The VO module can work with the two-dimensional engine (TDE) module and HiSilicon Framebuffer (HiFB) module to draw graphics and manage graphics layers by leveraging hardware.
	High Definition Multimedia Interface (HDMI)	HDMI is a fully digital video/audio interface for transmitting uncompressed audio and video signals.
	Two-Dimensional Engine (TDE)	The TDE is a two-dimensional graphics acceleration engine. It uses hardware to provide fast graphics drawing functions for the On Screen Display (OSD) and Graphics User Interface (GUI). The functions include quick copy, quick color filling, and pattern filling. (Currently, only alpha blending is supported.)
	HiSilicon Framebuffer (HiFB)	The HiFB is used to manage overlaid graphics layers. It not only provides the basic functions of Linux framebuffer, but also provides extended functions such as modifying the display start position of a graphics layer and inter-layer alpha.
Manage regions.	Region	The overlaid OSD and color blocks on a video are called regions. The Region module is used to manage the region resources in a unified manner. It is used to display specific information (such as the channel ID and PTS) on the video or fill color blocks in the video for covering. Currently, this function must be used together with VPSS.
Process image/video data.	Video Process Sub-System (VPSS)	The VPSS module preprocesses input images in a unified manner, such as denoising, deinterlacing, and cropping, and then processes each channel separately, such as scaling and bordering.
	Artificial Intelligence Pre-Processing (AIPP)	AIPP implements functions on the AI Core, including image resizing (such as cropping and padding), CSC, mean subtraction, and factor multiplication (for pixel changing). AIPP supports static and dynamic modes. However, the two modes are mutually exclusive. Static AIPP: If you use this mode and specify the AIPP parameters when converting a model, the AIPP attribute values are saved in the offline model (.om file) after the model is generated. Fixed AIPP configurations are used in each model inference. If the static AIPP mode is used, multiple batches share the same AIPP parameters. The AIPP parameter values are set when the ATC tool is used for model conversion. For details about the ATC tool, see ATC Instructions*. Dynamic AIPP: If you use this mode when converting a model, you can set dynamic AIPP parameters before running the model for inference. Then, different AIPP parameters are used in model execution. If the dynamic AIPP mode is used, multiple batches can use different AIPP parameters. The AIPP parameter values used by each batch are set by calling AscendCL APIs. For details, see Dynamic AIPP Model Inference.
	Digital Vision Pre-Processing (DVPP)	DVPP is an embedded image processing unit of the Ascend AI Processor. It provides powerful hardware acceleration capabilities for media processing through AscendCL APIs. It delivers the following functions: Vision Preprocessing Core (VPC): processes YUV and RGB images, including resizing, cropping, pyramid, and CSC. JPEG Decoder (JPEGD): decodes images from JPEG to YUV. JPEG Encoder (JPEGE): encodes images from YUV to JPEG. Video Decoder (VDEC): decodes video streams from H.264/H.265 to YUV/RGB. Video Encoder (VENC): encodes video streams from YUV420SP to H.264/H.265. PNG Decoder (PNGD): decodes images from PNG to RGB. NOTE: AIPP and DVPP can be used separately or together. In combined applications, DVPP is used first to decode, crop, and resize images or videos. However, due to DVPP hardware restrictions, the image format and resolution after DVPP may not meet the model requirements. Therefore, AIPP is required to further perform color space conversion (CSC), image cropping, and border making. For example, in the Atlas 200/300/500 Inference Product and Atlas Training Series Product, DVPP video decoding supports only the output of YUV images. If the model requires RGB images, AIPP is required to perform CSC.
Obtain and output audio data.	Audio Input (AI)	The AI module captures audio data.
Obtain and output audio data.	Audio Output (AO)	The AO module plays the audio decoded by the ADEC module.
Encode and decode audio data.	Audio Encoder (AENC)	The AENC module encodes the audio obtained by the AI module and outputs audio streams.
Encode and decode audio data.	Audio Decoder (ADEC)	The ADEC module decodes G.711a, G.711u, and other audio streams and plays audios through the AO module.

Typical Scenarios

The resolution and format of the source image or video can be processed to meet the model requirements. The following is an example of a typical scenario.

Video decoding and resizing
The input video is in H.264 encoding format and the resolution is 1920 x 1080. However, the YOLOv3 model for object detection requires an RGB or YUV input image with the resolution of 416 x 416. In this case, you can process the video as follows.

Figure 2 Video decoding and resizing
Image decoding, resizing, and format conversion
The input image is in JPEG encoding format and the resolution is 1280 x 720. However, the ResNet-50 model for image classification requires an RGB input image with the resolution of 224 x 224. In this case, you can process the image as follows.

Figure 3 Image decoding, resizing, and format conversion
Image cropping, resizing, and format conversion
The input image is in YUV420SP format and the resolution is 1280 x 720. However, the ResNet-50 model for image classification requires an RGB input image with the resolution of 224 x 224. In this case, you can process the image as follows.

Figure 4 Image cropping, resizing, and format conversion

Development Workflow of Media Data Processing

Figure 5 Development workflow

Set up the environment.
For details, see Development and Operating Environment Setup.

Create code directories.

Create directories to store code files, build scripts, test images, and model files.

The following is an example:

├App name
├── model                 // Model files
│   ├── xxx.json               

├── data
│ ├── xxxxxx // Test data

├── inc                   // Header files that declare functions
│   ├── xxx.h               

├── out                   // Output files

├── src   
│   ├── xxx.json         // Configuration files for system initialization
│   ├── CMakeLists.txt   // Build scripts
│   ├── xxx.cpp          // Implementation files

(Optional) Build a model.
For model inference, the offline model adapted to the Ascend AI Processor (*.om file) is required. For details, see Building a Model.

If model inference is involved in an app, you need to build a model.
Develop an app.
For details about the required header files and library files, see Dependent Header Files and Library Files.

If model inference is involved in an app, write code by referring to Inference with Single-Batch and Static-Shape Inputs and Additional Features.
Build and run the app. For details, see App Build and Run.

Parent topic: Media Data Processing (Including Images and Videos)