Environment Variables

The environment variables in bold are common environment variables.

Parameter

Description

Value Range

Default Value

TTP_LOG_PATH

MindIO TFT log path.

Soft links are not allowed. The log file name is ttp_log.log. It is recommended that the log file name contain the date and specific time to prevent multiple training records from being recorded in one log file, causing cyclic overwriting.

It is recommended that the log path be configured in the training startup script as follows:

date_time=$(date +%Y-%m-%d-%H_%M_%S)
export TTP_LOG_PATH=logs/${date_time}

If shared storage is used, configure the log path by node.

export TTP_LOG_PATH=logs/${nodeId}

Folder path

logs

TTP_LOG_LEVEL

MindIO TFT log level.

  • DEBUG: detailed information, which is used only for fault diagnosis.
  • INFO: indicates that the program is running as expected.
  • WARNING: indicates that an exception has occurred or is about to occur. The program is still running as expected.
  • ERROR: indicates that some functions of the program cannot be executed due to serious problems.
  • DEBUG
  • INFO
  • WARNING
  • ERROR

INFO

TTP_LOG_MODE

MindIO TFT log mode.

  • ONLY_ONE: All MindIO TFT processes write one log.
  • PER_PROC: All MindIO TFT processes write one log. The log file path is {TTP_LOG_PATH}/ttp_log.log.{pid}.
  • ONLY_ONE
  • PER_PROC (default value if ONLY_ONE is not used)

PER_PROC

TTP_LOG_STDOUT

MindIO TFT log recording mode.

  • 0: MindIO TFT run logs are recorded in the corresponding log files.
  • 1: MindIO TFT run logs are directly printed and not stored locally.
  • 0
  • 1

0

MASTER_ADDR

IP address or domain name of the primary training node.

IPv4 address or domain name

-

MASTER_PORT

Communication port of the primary training node. The port is configurable.

[1024, 65535]

-

TTP_RETRY_TIMES

Number of Processor TCP link setup attempts.

[1, 300]

10

MINDIO_WAIT_MINDX_TIME

Maximum time for Controller to wait for a response from MindCluster, in seconds.

[1, 3600]

30

TTP_ACCLINK_CHECK_PERIOD_HOURS

Interval for MindIO TFT to check the certificate validity after TLS authentication is enabled, in hours.

[24, 720]

168

TTP_ACCLINK_CERT_CHECK_AHEAD_DAYS

Number of days in advance that MindIO TFT generates an alarm before a certificate expires after TLS authentication is enabled. This time must be greater than or equal to the inspection period to ensure timely detection of certificate expiration risks.

[7, 180]. The value must also meet the following requirement: TTP_ACCLINK_CERT_CHECK_AHEAD_DAYS × 24 ≥ TTP_ACCLINK_CHECK_PERIOD_HOURS.

30

TTP_NORMAL_ACTION_TIME_LIMIT

Timeout interval for executing the rebuild, repair, or rollback callback in the fault recovery process, in seconds.

[30, 1800]

180

MINDIO_FOR_MINDSPORE

Indicates whether to enable MindSpore. When the value is True (case insensitive) or 1, MindSpore is enabled. When the value is not 1, MindSpore is disabled.

  • True (case-insensitive) or 1: MindSpore is enabled.
  • Other values: MindSpore is disabled.

False

MINDX_TASK_ID

MindCluster task ID, which is used for MindIO ARF and is configured by ClusterD without user intervention.

String

-

TORCHELASTIC_USE_AGENT_STORE

PyTorch environment variable, which controls whether to create a TCP Store server or client. It is used when MindIO TFT saves the dying gasp checkpoint and the Torch Agent TCP Store Server connection fails.

  • True: create a client.
  • False: create a server.

-

TTP_STOP_CLEAN_BEFORE_DUMP

Used by MindIO TFT to control whether MindIO TTP performs the stop&clean operation before saving the dying gasp checkpoint.

  • 0: disable the stop&clean operation before saving the dying gasp checkpoint.
  • 1: enable the stop&clean operation before saving the dying gasp checkpoint.

0