In Estimator Mode

Automated porting

Check whether init_resource exists in the ported script.

If it exists, modify it by referring to the following example. After the modification is complete, go to the next step.

          
               if __name__ == '__main__':

  session_config = tf.ConfigProto(allow_soft_placement=True)
  custom_op = session_config.graph_options.rewrite_options.custom_optimizers.add()
  custom_op.name = "NpuOptimizer"
  # Enable profiling.
  custom_op.parameter_map["profiling_mode"].b = True
  # Collect only task trace data.
  custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes('{"output":"/home/HwHiAiUser/output","task_trace":"on"}')
  # Collect task trace data and iteration trace data. You can collect only the task trace data. If the problem cannot be analyzed, collect the iteration trace data.
  # custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes('{"output":"/home/HwHiAiUser/output","task_trace":"on","training_trace":"on","aicpu":"on","fp_point":"","bp_point":"","aic_metrics":"PipeUtilization"}')

  (npu_sess, npu_shutdown) = init_resource(config=session_config)
  tf.app.run()
  shutdown_resource(npu_sess, npu_shutdown)
  close_session(npu_sess)

Note that only the parameters supported in initialize_system can be configured in config of the init_resource function. For other functions, configure them in run_config of the npu_run_config_init function.

profiling_mode: profiling enable.
output: path for storing profile data. Create the specified directory in the training environment (container or host) in advance. The running user configured during installation must have the read and write permissions on this path. It can be either an absolute path or a relative path.
task_trace: task trace collection enable.
training_trace: iteration trace collection enable. If it is set to on, both fp_point and bp_point need to be configured.
aicpu: whether to collect details about the AI CPU operator, such as the operator execution time and data copy time.
fp_point: start point of the forward propagated operator in iteration traces. This parameter is used to record the start timestamp of forward propagation. You can leave it empty to make the system obtain the values or manually obtain them by referring to How Do I Determine fp_point and bp_point?.
bp_point: end point of the backward propagated operator in iteration traces. This parameter is used to record the end timestamp of backward propagation. You can leave it empty to make the system obtain the values or manually obtain them by referring to How Do I Determine fp_point and bp_point?.
aic_metrics: AI Core hardware information. The value PipeUtilization indicates the percentages of time taken by compute units and MTEs.
For details about profiling configuration, see Profiling.

If it does not exist, go to the next step.

Search for npu_run_config_init in the ported script and find the run configuration function, such as run_config in the example.

If the session_config parameter does not exist in the run configuration function, add the parameter according to the following example. If the session_config parameter exists, go to the next step.

         
              session_config = tf.ConfigProto(allow_soft_placement=True)

run_config = tf.estimator.RunConfig(
    train_distribute=distribution_strategy,
    session_config=session_config,
    save_checkpoints_secs=60*60*24)

classifier = tf.estimator.Estimator(
    model_fn=model_function, model_dir=flags_obj.model_dir, config=npu_run_config_init(run_config=run_config))

Add the session_config configuration to enable profiling.

        
             session_config = tf.ConfigProto(allow_soft_placement=True)
custom_op = session_config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = 'NpuOptimizer'
# Enable profiling.
custom_op.parameter_map["profiling_mode"].b = True
# Collect only task trace data.
custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes('{"output":"/home/HwHiAiUser/output","task_trace":"on"}')
# Collect task trace data and iteration trace data. You can collect only the task trace data first. If the problem cannot be analyzed, collect the iteration trace data.
# custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes('{"output":"/home/HwHiAiUser/output","task_trace":"on","training_trace":"on","aicpu":"on","fp_point":"","bp_point":"","aic_metrics":"PipeUtilization"}')

run_config = tf.estimator.RunConfig(
    train_distribute=distribution_strategy,
    session_config=session_config,
    save_checkpoints_secs=60*60*24)

classifier = tf.estimator.Estimator(
    model_fn=model_function, model_dir=flags_obj.model_dir, config=npu_run_config_init(run_config=run_config))

Run the training script again to collect profile data.

Manual porting

You can try to collect task trace data by enabling task_trace.

      
           from npu_bridge.npu_init import *

# enable_profiling: profiling enable.
# output: path for storing profile data. Create the specified directory in the training environment (container or host) in advance. The running user configured during installation must have the read and write permissions on this path. It can be either an absolute path or a relative path.
# task_trace: task trace collection enable.
profiling_options = '{"output":"/home/HwHiAiUser/output","task_trace":"on"}'
profiling_config = ProfilingConfig(enable_profiling=True, profiling_options= profiling_options)
session_config=tf.ConfigProto()

config = NPURunConfig(profiling_config=profiling_config, session_config=session_config)

(Optional) If the problem cannot be spotted, enable training_trace to collect iteration traces.

       
            from npu_bridge.npu_init import *

# enable_profiling: profiling enable
# output: path for storing profile data
# task_trace: task trace collection enable
# training_trace: iteration trace collection enable
# fp_point: start point of the forward propagated operator in iteration traces, recording the start timestamp of forward propagation.
# bp_point: end point of the backward propagated operator in iteration traces, recording the end timestamp of backward propagation. fp_point and bp_point are used to compute the time used by forward and backward propagation.
profiling_options = '{"output":"/home/HwHiAiUser/output","task_trace":"on","training_trace":"on","aicpu":"on","fp_point":"","bp_point":"","aic_metrics":"PipeUtilization"}'
profiling_config = ProfilingConfig(enable_profiling=True, profiling_options= profiling_options)
session_config=tf.ConfigProto(allow_soft_placement=True)

config = NPURunConfig(profiling_config=profiling_config, session_config=session_config)

Note that fp_point (start point of the forward propagated operator in iteration traces) and bp_point (end point of the backward propagated operator in iteration traces) are required for collecting iteration traces. You can leave them empty to make the system obtain the values or refer to How Do I Determine fp_point and bp_point? to configure them when collection exceptions occur.

For details about related APIs, see ProfilingConfig Constructor.

Parent topic: Collecting Profile Data Globally (by Modifying the Training Script)