Network-wide Accuracy Comparison

Overview

If the accuracy still does not meet expectations after the above steps, collect operator execution results (dump data) during training and compare them with results from the benchmark operator (such as TensorFlow). This helps quickly pinpoint operators with accuracy issues. The major steps are described as follows.

Prerequisites

Floating-point exceptions have been excluded, and the overflow/underflow detection function has been disabled.
Fusion issues exceptions have been excluded, and the fusion switch has been restored to on.
Model Accuracy Analyzer Deployment has been completed in the GPU/CPU/NPU training environment.
All random operations for image preprocessing have been disabled in your training script. Failure to do so will result in unavailable comparison result due to inconsistent input data. For details, see Disabling Random Preprocessings in the Training Script.

Dumping Benchmark Data on GPU/CPU

Use the TensorFlow debugger — tfdbg (by adding the tf_debug code to your CPU/GPU training script) and the precision_tool command line to generate an .npy dump file.

Perform the following operations in the GPU/CPU training environment.

Install Python 3 dependencies in the GPU/CPU training environment.
1
pip3 install gnureadline pexpect

Edit the original training script to dump benchmark data.

This is implemented by using the print_tensor(pt) command of tf_debug. As the training code provides a flexible run() API and there is no way to inform the script of the exact run phase where tensors should be dumped, you must edit the training code. Ensure that training exits immediately when a step is complete to avoid accuracy analysis bugs.

# Import precision_tool/tf_config.py.
import precision_tool.tf_config as npu_tf_config

# If Estimator is used, add training_hooks to EstimatorSpec.
# It is equivalent to estim_specs = tf_debug.DumpingDebugHook("precision_data/tf/tf_debug").
estim_specs = tf.estimator.EstimatorSpec(training_hooks=[npu_tf_config.estimator_dump()])    

# If session.run is used, add the tf_debug wrapper to sess.
# It is equivalent to sess = tf_debug.DumpingDebugWrapperSession(sess, "precision_data/tf/tf_debug").
sess = npu_tf_config.sess_dump(sess=sess)

Perform GPU/CPU training.
A number of dump directories are generated under precision_data/tf/tf_debug/ based on the number of runs.
Use the precision_tool command line to analyze the dump files and generate the operator output tensor file.
1
python3 precision_tool/cli.py tf_dump
Find the extracted tensors in the precision_data/tf/dump/ directory.
To regenerate dump data, run the following command:
1
rm -rf precision_data/tf/dump/* && python3 precision_tool/cli.py tf_dump

Dumping User Model on NPU

Perform the following operations in the NPU training environment. Pay attention to the following points before dumping data:

Generally, dump of the first step is enough for comparison and analysis. To avoid inaccurate comparison caused by random weights, enable checkpoints saving before training. If you find an accuracy issue with a particular step, resume the training process from the checkpoint closest to the particular step.

Modify config.py in the precision_tool/lib/config directory of the tool and specify the step of the data to be dumped.
1 2
# Set the steps to dump, for example '0|5|10'. To dump the input layer, retain the default value. TF_DUMP_STEP = '0'
If TF_DUMP_STEP is not set, dump data of all iterations is collected.

Edit the original training script to enable dumping.

With the following script, both dump data and dump graphs are generated.

# Import precision_tool/tf_config.py.
import precision_tool.tf_config as npu_tf_config

# 1. Manual network porting
# 1.1 Estimator mode
dump_config=npu_tf_config.estimator_dump_config(action='dump')
npu_config = NPURunConfig(dump_config=dump_config)
# 1.2 Session run mode
config = npu_tf_config.session_dump_config(config, action='dump')
sess = tf.Session(config)

# 2. Automated network porting
# If custom_op is not configured in the script, add the following statement in bold to the script:
session_config = npu_tf_config.session_dump_config(session_config, action='dump')
# If custom_op has been configured in the script, add the following statement in bold to the script to update custom_op:
custom_op = session_config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = 'NpuOptimizer'
custom_op.parameter_map["precision_mode"].s = tf.compat.as_bytes("allow_mix_precision")
custom_op = npu_tf_config.update_custom_op(custom_op, action='dump')

# 2.1 Estimator mode
run_config = tf.estimator.RunConfig(session_config=session_config,...)
# 2.2 Session run mode
with tf.Session(config=npu_config_proto(session_config)):
    ....
# 2.3 tf.keras mode
npu_keras_sess = set_keras_session_npu_config(config=session_config)

In addition to this method, you can also refer to Collecting Dump Data to modify the training script and collect dump data. However, the configuration is complex, and you need to manually extract the dump data and save it to the required directory for analysis. Note that the two modes are mutually exclusive.

Run training. The dump graph and dump data files of GE are generated in the precision_data/npu/debug_0 directory.
For details about subsequent data analysis, see Comparing Dump Data.

Comparing Dump Data

Accuracy analysis depends on the ATC and msaccucmp.py tools in the CANN package. Perform the following operations in the CANN development environment:

Upload the precision_tool and precision_data directories (containing the benchmark and NPU dump data) to any directory in the CANN development environment. The two directories are organized as follows:

├── precision_tool              
│    ├── cli.py                   
│    ├── ...
├── precision_data              
│    ├── npu                   
 │    │    ├── debug_0  // NPU dump data.
│    ├── tf
│    │    ├── dump     // Benchmark dump data

Install the Python 3 dependencies.

# Graphviz is optional and needs to be installed only when you need to create operator subgraphs.
pip3 install rich graphviz
# ubuntu/Debian
sudo apt-get install graphviz
# fedora/CentOS
sudo yum install graphviz

Modify config.py in the precision_tool/lib/config directory.

# The tool depends on the atc and msaccucmp.py tools in the CANN package. Set this parameter to the CANN package installation path.
# By default, the CANN package is installed in /usr/local/Ascend. You can retain the path or replace the path as needed.
CMD_ROOT_PATH = '/usr/local/Ascend'

Start the precision_tool command line.
python3 ./precision_tool/cli.py

Enter the command line interface.

PrecisionTool >

To exit, press Ctrl+C.
Run the ac -l [limit_num] (-c) command for network-wide accuracy comparison.
PrecisionTool > ac -c

The time consumption varies depending on the data size.

The comparison result is saved in CSV format in the precision_data/temp/vector_compare directory.

You can directly inspect the CSV file. For details, see Network Accuracy Comparison Result File.
(Optional) Run the vcs -f [file_name] -c [cos_sim_threshold] -l [limit] command to narrow down the operators with potential accuracy issues.
By default, the vcs command returns operators with cosine similarity values less than 0.98. The threshold can be user-defined by using the -c argument.
- Left: name of the operator running on the NPU.
- Right: name of the operator running on the GPU or CPU.
- Input/Output: cosine similarity comparison result of the operator inputs/outputs. The value range is [–1, +1]. A value closer to 1 indicates higher similarity.
As shown in the preceding figure, the operator inputs are basically the same, but their first outputs are remarkably different (the cosine similarity is 0.806927, much less than 0.98). This indicates that the operator may have an accuracy drop.

The list sorts operators with accuracy drop by execution sequence. As there are close ties between successive operators, analyze the top operator on the list.
Run the ni (-n) [op_name] -g [graph] -a [attr] -s [save subgraph depth] command to query the node information of a particular operator.
The ni command outputs the following information based on the passed operator name.
1. Operator type. In this example, the operator type is Add.
  PassName indicates that the operator is a fused operator, whose value indicates the fusion pattern name, and OriginOp indicates the base operators. The accuracy drop could be caused by operator fusion. In normal cases, any fusion bug should have been fixed in Fusion Exception Detection.
2. Preliminary dump analysis result (max/min/mean).
3. Subgraph of the specified depth with the current operator as the root, if the -s option is included. The following gives an example.

Analysis Principles

Network-wide data comparison provides a layer-wise cumulative comparison report between network dump data and TensorFlow benchmark data. Even for networks without accuracy drop, errors caused by hardware differences are inevitable in the comparison result, and such errors will accumulate as the number of layers increases. Cosine similarity is a feasible metric to narrow down the operators with potential accuracy issues. A low cosine similarity always points to an accuracy bug while a high cosine similarity does not guarantee that the operator is 100% bug-free.

Determine whether an error operator is a custom operator based on the operator type.
- For a custom operator, check that the implementation logic of the operator is consistent with that of the benchmark by inspecting the ni (-n) [op_name] -g [graph] -a [attr] -s [save subgraph depth] command output or the dump analysis report.
- For a built-in CANN operator, if the operator input or output type is float16, you can switch the operator type to float32 by using either of the following methods:
  1. (Recommended) Method 1: Modify the blocklist, trustlist, and graylist for the operator that uses the mixed precision mode. For details, see Modifying the Blocklist, Trustlist, and Graylist for Mixed Precision.
  2. Method 2: Use the keep_dtype_scope API to preserve the original precision for a selected operator.
    1 2 3
    from npu_bridge.npu_init import * with npu_scope.keep_dtype_scope(): y = tf.mul(x1,x2)
If the fault persists: Click here to contact technical support.

Parent topic: Accuracy Tuning