Network Accuracy Comparison

Overview

If the accuracy problem does not happen in the steps above, dump the compute result of each operator during the training process and compare the dump data with that of each benchmark operator (such as the TensorFlow equivalents) to quickly spot the faulty operators. The major steps are described as follows.

Prerequisites

Floating-point exceptions have been excluded, and the overflow/underflow detection function has been disabled.
Fusion issues exceptions have been excluded, and the fusion switch has been restored to on.
You have completed One-Click Accuracy Analyzer Deployment.
All random operations for image preprocessing have been disabled in your training script. Failure to do so will result in unavailable comparison result due to inconsistent input data. For details, see Disabling Random Preprocessings in the Training Script.

Dumping Benchmark Data on the GPU/CPU

Before obtaining the dump data or .npy data during the original TensorFlow 2.x network training or online inference, a complete, executable, standard TensorFlow model training or online inference project is required. For details about how to prepare the GPU training environment, see Quickly Creating a GPU Training Environment on an ECS. The content in the link is for reference only.
Install the debugger tfdbg_ascend of TensorFlow 2.x. For details, see tfdbg_ascend README.
Disable all random functions in the script, including but not limited to shuffle operations on datasets, random initialization of parameters, and implicit random initialization of some operators (such as the dense operator). Ensure that all parameters in the script are not initialized randomly.

You can use the TensorFlow debugger (tfdbg_ascend) to generate .npy files. The major steps are as follows:

Modify the configuration in the TensorFlow training script for model calling. The sample code is as follows:

Sample 1:

Import the debug plugin.

          
               import tfdbg_ascend as dbg

Add the following code before the training startup code of each step. For example, to dump the data of the fifth step, add the code as follows:

          
                     tfdbg.disable()
      if current_step == 5: 
        tfdbg.enable()
        tfdbg.set_dump_path('home/test/gpu_dump')

Sample 2:

Import the debug plugin.

           
                import tfdbg_ascend as dbg

Dump the data of the fourth step (example). If you do not configure dbg.enable, the dump function is enabled by default. If you do not specify the dump path, dump files are saved in the path where the training script is located by default.

           
                class DumpConfig(tf.keras.callbacks.Callback):
    def __init__(self):
        super().__init__()
    def on_batch_begin(self, batch, logs={}):
        if batch == 4:
            dbg.enable()
            dbg.set_dump_path("/user/name1/pip_pkg/dump4")
        else:
            dbg.disable()

           
                # define callbacks
        callbacks = [
            ModelCheckpoint(
                f'models/model_epochs-{epochs}_batch-{batch_size}_loss-{loss_function}_{Mask2FaceModel.get_datetime_string()}.h5'),
            LossHistory(batch_size),
            DumpConfig()
        ]
	
# Fit the model: Call the model from here.
history = self.model.fit(train_dataset, validation_data=valid_dataset, epochs=1, callbacks=callbacks, verbose=2)

Execute the training script. After the training job is stopped, the .npy files are generated in the specified directory.
Check that names of the generated .npy files comply with the naming rules, as shown in Figure 1.
- An .npy file is named in the format {op_name}.{output_index}.{timestamp}.npy, where op_name must comply with the A-Za-z0-9_- regular expression, timestamp must comply with the [0-9]{1,255} regular expression, and output_index must be a digit ranging from 0 to 9.
- If the name of an .npy file exceeds 255 characters due to a long operator name, comparison of this operator is not supported.
Figure 1 Viewing the .npy files

Dumping User Model on the NPU

Perform the following operations in the NPU training environment. Pay attention to the following points before dumping data:

Generally, dump of the first step is enough for comparison and analysis. To avoid inaccurate comparison caused by random weights, enable checkpoints saving before training. If you find an accuracy issue with a particular step, resume the training process from the checkpoint closest to the particular step.

Modify the config.py file in the precision_tool/lib/config directory and specify the step of the data to be dumped.

        
             # Set the steps to dump, for example '0|5|10'. To dump the input layer, retain the default value.
TF_DUMP_STEP = '0'

Edit the original training script to enable dumping.

With the following script, both dump data and dump graphs are generated.

         
              import precision_tool.tf_config as npu_tf_config 
npu_tf_config.npu_device_dump_config(npu_device, action='dump')

In addition to this method, you can find another mode to collect dump data in Accuracy Debugging Tool Guide. However, the configuration is complex, and you need to manually extract the dump data and save it to the required directory for analysis. Note that the two modes are mutually exclusive.

Run training. The dump graph and dump data files of GE are generated in the precision_data/npu/debug_0 directory.

Comparing Dump Data

Accuracy analysis depends on the atc and msaccucmp.py tools in CANN Toolkit. Perform the following operations in the CANN development environment where is Toolkit installed.

Upload the precision_tool directory and precision_data directory (containing the benchmark and NPU dump data) to any directory in the Toolkit environment. The two directories are organized as follows:

        
             ├── precision_tool              
│    ├── cli.py                   
│    ├── ...
├── precision_data              
│    ├── npu                   
 │    │    ├── debug_0  // NPU dump data.
│    ├── tf
│    │    ├── dump     // Benchmark dump data

Install the Python dependencies.

# Graphviz is optional and needs to be installed only when you need to create operator subgraphs.
pip3 install rich graphviz
# ubuntu/Debian
sudo apt-get install graphviz
# fedora/CentOS
sudo yum install graphviz

Modify config.py in the precision_tool/lib/config directory.

        
             # The tool depends on the atc and msaccucmp.py tools in Toolkit. Set this parameter to the Toolkit installation path.
# By default, Toolkit is installed in /usr/local/Ascend. Replace the path as needed.
CMD_ROOT_PATH = '/usr/local/Ascend'

Start the precision_tool command line.
python3 ./precision_tool/cli.py

Enter the command line interface:

PrecisionTool >
Run the ac -l [limit_num] (-c) command for network comparison.
PrecisionTool > ac -c

The time consumption varies depending on the data size.

The comparison result is saved in CSV format in the precision_data/temp/vector_compare directory.

You can directly inspect the CSV file. For details, see Network Accuracy Comparison Result File.
(Optional) Run the vcs -f [file_name] -c [cos_sim_threshold] -l [limit] command to narrow down the operators with potential accuracy issues.
By default, the vcs command returns operators with cosine similarity values less than 0.98. The threshold can be user-defined by using the -c argument.
- Left: name of the operator running on the NPU.
- Right: name of the operator running on the GPU or CPU.
- Input/Output: cosine similarity comparison result of the operator inputs/outputs. The value range is [–1, +1]. A value closer to 1 indicates higher similarity.
As shown in the preceding figure, the operator inputs are basically the same, but their first outputs are remarkably different (the cosine similarity is 0.806927, much less than 0.98). This indicates that the operator may have an accuracy drop.

The list sorts operators with accuracy drop by execution sequence. As there are close ties between successive operators, analyze the top operator on the list.
Run the ni (-n) [op_name] -g [graph] -a [attr] -s [save sub graph deep] command to query the node information of a particular operator.

The ni command outputs the following information based on the passed operator name.
1. Operator type. In this example, the operator type is Add.
  PassName indicates that the operator is a fused operator, whose value indicates the fusion pattern name, and OriginOp indicates the base operators. The accuracy drop could be caused by operator fusion. In normal cases, any fusion bug should have been fixed in Floating-Point Exception Detection.
2. Preliminary dump analysis result (max/min/mean).
3. Subgraph of the specified depth with the current operator as the root, if the -s option is included. The following gives an example.

Analysis Principles

Network comparison provides a layer-wise comparison report between the imported network and its TensorFlow benchmark. Even for networks without accuracy drop, errors caused by hardware differences are inevitable in the comparison result, and such errors will accumulate as the number of layers increases. Cosine similarity is a feasible metric to narrow down the operators with potential accuracy issues. A low cosine similarity always points to an accuracy bug while a high cosine similarity does not guarantee that the operator is 100% bug-free.

Determine whether an error operator is a custom operator based on the operator type.

For a custom operator, check that the implementation logic of the operator is consistent with that of the benchmark by inspecting the ni (-n) [op_name] -g [graph] -a [attr] -s [save sub graph deep] command output or the dump analysis report.

For a built-in CANN operator, if the operator input or output type is float16, you can switch the operator type to float32 by using either of the following methods:

(Recommended) Method 1: Use modify_mixlist of Performance Tuning to modify the blocklist, trustlist, and graylist for the operator that uses the mixed precision mode.

Method 2: Use the npu.keep_dtype_scope API to preserve the original precision for a selected operator.

            
                 import npu_device as npu
with npu.keep_dtype_scope():
    v = tf.add(1, 1)

If the problem persists, submit an issue in the Ascend community.

Parent topic: Accuracy Tuning