Using the msnpureport Tool to Collect More AI Core Error Information

When contacting technical support to locate difficult AI Core errors, in addition to the information collected in Collecting AI Core Error Information, you also need to perform some operations to further check operator and hardware issues. The operations include obtaining the device configuration information, setting whether the TaskSchedule automatically resets the accelerator, exporting register information, setting the serial or parallel execution of tasks on the AI Core, and masking the task execution on the specified AI Core or Vector Core.

  • The Atlas 200/300/500 Inference Product does not support the positioning method in this step.
  • The locating method in this step is not supported when the Ascend AI application process is running. As a result, the application process may run abnormally or the command in this step may fail to be executed. The locating method can be used only after the application process exits.
  • Obtain the current device configuration information.

    The following is the command format. --device or -d specifies the device ID. This parameter is optional and the default value is 0.

    msnpureport config --get [--device <deviceId>]

    Command example:

    msnpureport config --get --device 0
  • Set whether the TaskSchedule automatically resets the accelerator to export more detailed and accurate register information for fault locating. However, this configuration affects the execution performance.
    The command and its parameters are described as follows:

    Command

    Parameter

    Remarks

    msnpureport config --set [--device <deviceId>]

    --accelerator_recover {0 | 1}

    • --device or -d: Specify the device ID. This parameter is optional and the default value is 0.
    • --accelerator_recover: Set whether to reset the accelerator.
      • 0: The TaskSchedule does not automatically reset the accelerator.

        If this parameter is set to this value and fault locating is complete, you must restart the operating environment to restore AI Core services. After the environment is restarted, the accelerator is automatically reset by default.

      • 1: The accelerator is automatically reset. This option is the default value.

    Command example:

    msnpureport config --set --accelerator_recover 0 -d 1
  • Set singlecommit of the AI Core. After the setting, you need to perform 2 again to collect fault information and analyze problems.
    The command and its parameters are described as follows.

    Command

    Parameter

    Remarks

    msnpureport config --set [--device <deviceId>]

    --singlecommit {0 | 1}

    • --device or -d: Specify the device ID. This parameter is optional and the default value is 0.
    • --singlecommit: Set the serial or parallel execution of tasks on the AI Core.
      • 0: The AI Core singlecommit mode is disabled. In this case, multiple instructions in the AI Core are parallel. The option is the default value.
      • 1: The AI Core singlecommit mode is enabled. In this case, multiple instructions in the AI Core are serial.

        After you set the parameter to this value, the AI Core singlecommit mode is disabled by default if the operating environment is restarted.

    Command example:

    msnpureport config --set --singlecommit 0 -d 1
  • Mask the task execution on the specified AI Core or Vector Core to locate the faulty AI Core or Vector Core. After the setting, you need to perform 2 again to collect fault information and analyze problems.
    The commands and its parameters are described as follows.

    Command

    Parameter

    Remarks

    msnpureport config --set [--device <deviceId>]

    --aic_switch {0 | 1} --coreid 3,4

    Mask specified AI Cores in the sequence of --aic_switch and --coreid.

    • --device or -d: Specify the device ID. This parameter is optional and the default value is 0.
    • --aic_switch
      • 0: The AI Core is masked.

        After you set the parameter to this value, the AI Core is not masked by default if the operating environment is restarted.

      • 1: The AI Core is not masked. This option is the default value.
    • --coreid: Specify the core ID. Multiple core IDs are separated by commas (,). A maximum of four AI Cores can be specified at the same time. If one core ID is invalid, an error is reported.

    To restore the initial value, set --aic_switch to 1 and --coreid to 0xFFFF.

    msnpureport config --set [--device <deviceId>]

    --aiv_switch {0 | 1} --coreid 5,6

    Mask specified Vector Cores in the sequence of --aiv_switch and --coreid.

    • --device or -d: Specify the device ID. This parameter is optional and the default value is 0.
    • --aiv_switch
      • 0: The Vector Core is masked.

        After you set the parameter to this value, the Vector Core is not masked by default if the operating environment is restarted.

      • 1: The Vector Core is not masked. This option is the default value.
    • --coreid: Specify the core ID. Multiple core IDs are separated by commas (,). A maximum of four Vector Cores can be specified at the same time. If one core ID is invalid, an error is reported.

    To restore the initial value, set --aiv_switch to 1 and --coreid to 0xFFFF.

    The Atlas Training Series Product does not have a Vector Core and therefore does not support this command.

    Command examples:

    msnpureport config --set --aic_switch 0 --coreid 3,4 -d 0
    msnpureport config --set --aiv_switch 0 --coreid 5,6 -d 0
  • Set the iCache bit flipping check range to locate operator problems. After the setting, you need to perform 2 again to collect fault information and analyze problems.
    The command and its parameters are described as follows.

    Command

    Parameter

    Remarks

    msnpureport config --set [--device <deviceId>]

    --icachecheck <Parameter value>

    • --device or -d: Specify the device ID. This parameter is optional and the default value is 0.
    • --icachecheck: Set parameter value who range is [0, 131072], in KB. The maximum value is 120 MB. For example, if parameter is set to 128, the system checks whether the iCache (from 128 K before the PC error to 128 K after the PC error) is the same as the global memory (GM).

      Before setting this parameter, you can obtain the iCache bit flipping check range (that is, the value of Icache check range) in the current environment by referring to #EN-US_TOPIC_0000002128092896__li19691124922110.

    Command example:

    msnpureport config --set --icachecheck 128 -d 0
  • Export register information to facilitate hardware fault locating.
    The command and its parameters are described as follows.

    Command

    Parameter

    Remarks

    msnpureport report -t 2

    or

    msnpureport report --type 2

    N/A

    Export register information on all devices.

    Note: Exporting all register information may take a long time (within 10 seconds).

    Command example:

    msnpureport report --type 2