How Do I Fix the Training Error Caused by Placing Variable Initialization and Data Preprocessing Initialization in the Same Subgraph?

Symptom

Some networks such as LeNet ported to Ascend AI Processor for training are found with no loss convergence and low accuracy unchanged.

In addition, it is found that all variables are initialized to 0s.

Possible Cause

The user training script is analyzed as follows.

1
2
3
4
5
sess.run(tf.group(      
    tf.global_variables_initializer(),  # Variable initialization
    tf.local_variables_initializer(),   # Variable initialization
    iterator.initializer                # Data preprocessing initialization
    ))

Based on the preceding, in TensorFlow graph partitioning, variable initialization and data preprocessing initialization use the same subgraph where only variable initialization can be offloaded to the device. As a result, the entire subgraph cannot be offloaded to the Ascend AI Processor.

Because variable initialization is performed on the host, all variables are initialized to zero.

Solution

Schedule variable initialization and data preprocessing initialization to different subgraphs during TensorFlow graph partitioning.

1
2
3
4
5
sess.run(tf.group(      
    tf.global_variables_initializer(),  # Variable initialization
    tf.local_variables_initializer()    # Variable initialization
    ))
sess.run(iterator.initializer)        # Data preprocessing initialization