HCCL_DIAGNOSE_ENABLE

Description

Sets whether to cache detailed information about some tasks during collective communication. If a task fails to be executed, detailed logs can be printed for fault locating.

The following options are supported:
  • 1: enables the function.
  • 0: disables the function.

The default value is 0.

Note that enabling this function will affect the performance.

Example

export HCCL_DIAGNOSE_ENABLE=1

Restrictions

Information about a maximum of 2000 latest operators can be saved.

Applicability

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products (For Atlas A2 training products/Atlas A2 inference products, only the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack are supported.)