Compute Function Implementation
The following is an example of implementing the Compute function, including validity check and compute logic implementation. You can call the Dump log API in a proper position to print related debugging information. For details about the APIs used in the example, see AI CPU API.
// Obtain the input tensor from the context.
Tensor *input = ctx.Input(0);
// Perform basic verification on the input tensor.
// For example, perform null pointer check on the obtained input.
if (input == nullptr) {
return 1;
}
// Obtain the shape information of the input tensor.
auto inputShape = input->GetTensorShape();
for (int32_t i = 0; i < inputShape->GetDims(); ++i) {
// Call the Dump log API as required to print related debugging information.
CUST_KERNEL_LOG_DEBUG(ctx, "dim[%d] size:%ld.", i, inputShape->GetDimSize(i));
}
// Obtain the DataType of the input tensor.
DataType inputType = input->GetDataType();
// Obtain the data address of the input tensor.
auto inputData = input->GetData();
// Obtain the data address and shape of the output tensor.
Tensor *output = ctx.Output(0);
auto outputShape = output->GetTensorShape();
auto outputData = output->GetData();
// Save the output result.
outputData[0] = inputData[0];
Validity Check
- (Required) Performing null pointer check on the obtained inputs
- Checking the number of inputs and outputs
- Checking the internal logic of operator inputs
For example, for a multi-input operator, the data types of tensors must be the same. As such, consistency verification on the data type of inputs is required. If the internal logic of inputs has been implemented in the Verify function of Operator Prototype Definition, it does not need to be checked by using the Compute function.
- Checking data types
Check the data types as needed. If an operator supports only data types A and B, check whether the data types are included in the supported data types before implementing the operator's compute logic.
Compute Logic Implementation
// Obtain the data type of input(i).
auto data_type = ctx.Input(i)->GetDataType();
switch (data_type) {
case DT_FLOAT16:
return OpCompute<Eigen::half>(...);
case DT_FLOAT:
return OpCompute<float>(...);
case DT_DOUBLE:
return OpCompute<double>(...);
case DT_INT8:
return OpCompute<int8_t>(...);
case DT_INT16:
return OpCompute<int16_t>(...);
... ...
default:
return PARAM_INVAILD;
}
Pay attention to the following points when implementing the operator's compute logic:
- C++ does not support the half-precision floating-point type. To implement data of this type, the third-party library Eigen 3.3.9 is advisable. For details, visit LINK.
The following takes the Less operator as an example to show how to cast its half-precision inputs using Eigen.
auto input = reinterpret_cast<Eigen::half *>(input_0->GetData());
Notes:
The third-party library Eigen also provides functions for linear algebra, such as matrix and vector computation. If your operator implementation involves related computations, Eigen is an excellent choice.
The following code shows how to define and initialize a matrix using the Eigen library, and obtain the determinant.
#include "Eigen/Dense" int m,n; Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic> eMatrix(m, n); for (int i = 0; i < m; i++) { for (int j = 0; j < n; j++) { eMatrix(i, j) = i * m + j * n; } } //Using eigen to calculate the Determinant float result = eMatrix.determinant(); - For a dynamic-shape operator, the shape of the output tensor cannot be inferred by InferShape defined in Operator Prototype Definition. Therefore, the output shape needs to be computed and updated in the Compute function.
std::vector<int64_t> dims = {inputData[0], inputData[1], 3, 4} outputShape ->SetDimSizes(dims);For details, refer to the code example of dynamic-shape operator UniqueCust in the Ascend samples repository.
- It is advisable to perform block-based parallel computing to harness the full potential of Ascend AI Processor.
- You can determine whether to perform block-based parallel computing to improve the performance according to the performance requirements of operators.
- The block-based parallel computing function supports only independent computation between arguments. If data dependency exists between arguments, this method is inapplicable.
- To use this function, set opInfo.flagSupportBlockDim to True and opInfo.functionName to RunCpuKernelWithBlock in Operator Information Library Definition.
- Obtain the number of blocks and the block IDs by calling GetAttr. The following is an example.
uint32_t blockdim = ctx.GetAttr("block_num")->GetInt(); uint32_t blockid = ctx.GetAttr("block_id")->GetInt();Note: The number of blocks is automatically calculated by the system based on the configured BlockDim partition principle (opInfo.blockDimByIndex configured in the operator information library) and the number of CPU cores. Each block is indexed with a block_id whose value is blockdim minus 1.
Obtain block_num and block_id and perform basic verification.
- Compute the offset and data volume of the current computation (computation of the current block).
For example, if opInfo.blockDimByIndex is set to -1, the BlockDim partition is performed based on the number of elements in the first input. The code example for calculating the offset and data volume is as follows:
// Obtain the number of elements of the first input. int64_t total = input0->NumElements(); int64_t startpos = 0; int64_t len = total; if (blockdim != 1) { // Compute the maximum data volume of each block. uint32_t per_unit = std::ceil(total / blockdim); // Obtain the offset of this computation. startpos = blockid * per_unit; // Obtain the data volume of the current computation. // The value range of blockid is 0 to blockdim minus 1. To prevent the last data block from smearing, when blockid points to the last block, len = total – per_unit * (blockdim – 1). len = blockid < blockdim - 1 ? per_unit : (total - per_unit * (blockdim - 1)); } - Implement the compute logic of the operator.
Click here for a complete sample of block-based parallel computing. For more samples, see Custom Operator Template.
Using the Log Dump Function
The log dump function is used to record logs generated during the execution of AI CPU operators, facilitating function commissioning of AI CPU operators. To use the log dump function, perform the following steps:
- Call the log dump API in the Compute function to output logs of the specified level. For details about the API, see Dump Log APIs. A code example is as follows.
uint32_t UniqueCpuKernel::Compute(CpuKernelContext &ctx) { Tensor *param_tensor = ctx.Input(0); if (param_tensor == nullptr) { return 1; } auto param_shape = param_tensor->GetTensorShape(); if (param_shape == nullptr) { return 1; } int64_t p_size = 1; for (int i = 0; i < param_shape->GetDims(); ++i) { p_size *= param_shape->GetDimSize(i); } CUST_KERNEL_LOG_DEBUG(ctx, "Cust UniqueCpuKernel Compute, p_size is %ld.", p_size); ... } - (Optional) Set opInfo.workspaceSize in the AI CPU operator information library to configure the memory size for recording AI CPU operator logs. The default value is 2 KB. For details about the parameters, see AI CPU Operator Information Library.
- Enable the dump function before running the operator to make the log dump function take effect. The enablement method depends on the network operating mode. Take TensorFlow online inference as an example. In sess.run mode, configure enable_dump, dump_path, and dump_mode through session. The following is an example:
import tensorflow as tf import numpy as np sess_config = tf.compat.v1.ConfigProto() from npu_bridge.estimator import npu_ops custom_op = sess_config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["enable_data_pre_proc"].b = True custom_op.parameter_map["use_off_line"].b = True custom_op.parameter_map["min_group_size"].b = 1 # enable_dump: specifies whether to enable the dump function. custom_op.parameter_map["enable_dump"].b = True # dump_path: specifies the path for storing the dumped data. custom_op.parameter_map["dump_path"].s = tf.compat.as_bytes("/test/log") # When dump_mode is set to all, AI CPU logs can be dumped. custom_op.parameter_map["dump_mode"].s = tf.compat.as_bytes("all") tensor = tf.constant([1,2,3,4,5,6,7,8,9], dtype=tf.float32) output, idx = tf.raw_ops.Unique(x=tensor) with tf.compat.v1.Session(config=sess_config) as sess: print("data = ", sess.run([output,idx])) print("end proc")After the operator is executed, a log dump file is generated in the dump data storage path. The file name format is {op_type}.{op_name}.{taskid}.{stream_id}.{timestamp}, where {op_type} indicates the operator type, {op_name} indicates the operator name, and {taskid} indicates the ID of the task that calls the operator Compute API, {stream_id} indicates the ID of the stream executed by the operator, and {timestamp} indicates the timestamp.
- Use the dump file parser in the development toolkit to parse the log dump file. For details about how to install the , see the . After the installation is complete, run the dump_parser.py script in the {install_path}/tools/operator_cmp/compare directory to parse the log dump file. {install_path} indicates the toolkit installation directory.
The command example and parameter description are as follows:
python3 dump_parser.py save_log -d dump_file [-out output]
Table 1 Command-line options Option
Description
Required
-d
--dump_file
Dump file to be parsed.
Yes
-out
--output
Directory for storing parsed logs that have been dumped. The default value is the current path.
No
The dump file of parsed logs is named in the format of {dump_file_name}.{index}.log, where {dump_file_name} indicates the name of the dump file and {index} indicates the index number of the dump file. For example, if you run the following command, a log file named Unique.Unique.3.2.1671845532156774.0.log is generated in the current directory:
python3 dump_parser.py save_log –d Unique.Unique.3.2.1671845532156774
When the network is properly connected and exception_dump is enabled, if an exception occurs during AI CPU operator execution, log dump information of the abnormal operator is generated. The parsing method is the same as that in step 4.