Environment Variables
The environment variables in bold are common environment variables.
Parameter |
Description |
Value Range |
Default Value |
|---|---|---|---|
TTP_LOG_PATH |
MindIO TFT log path. Soft links are not allowed. The log file name is ttp_log.log. It is recommended that the log file name contain the date and specific time to prevent multiple training records from being recorded in one log file, causing cyclic overwriting. It is recommended that the log path be configured in the training startup script as follows: date_time=$(date +%Y-%m-%d-%H_%M_%S)
export TTP_LOG_PATH=logs/${date_time}
If shared storage is used, configure the log path by node. export TTP_LOG_PATH=logs/${nodeId}
|
Folder path |
logs |
TTP_LOG_LEVEL |
MindIO TFT log level.
|
|
INFO |
TTP_LOG_MODE |
MindIO TFT log mode.
|
|
PER_PROC |
TTP_LOG_STDOUT |
MindIO TFT log recording mode.
|
|
0 |
MASTER_ADDR |
IP address or domain name of the primary training node. |
IPv4 address or domain name |
- |
MASTER_PORT |
Communication port of the primary training node. The port is configurable. |
[1024, 65535] |
- |
TTP_RETRY_TIMES |
Number of Processor TCP link setup attempts. |
[1, 300] |
10 |
MINDIO_WAIT_MINDX_TIME |
Maximum time for Controller to wait for a response from MindCluster, in seconds. |
[1, 3600] |
30 |
TTP_ACCLINK_CHECK_PERIOD_HOURS |
Interval for MindIO TFT to check the certificate validity after TLS authentication is enabled, in hours. |
[24, 720] |
168 |
TTP_ACCLINK_CERT_CHECK_AHEAD_DAYS |
Number of days in advance that MindIO TFT generates an alarm before a certificate expires after TLS authentication is enabled. This time must be greater than or equal to the inspection period to ensure timely detection of certificate expiration risks. |
[7, 180]. The value must also meet the following requirement: TTP_ACCLINK_CERT_CHECK_AHEAD_DAYS × 24 ≥ TTP_ACCLINK_CHECK_PERIOD_HOURS. |
30 |
TTP_NORMAL_ACTION_TIME_LIMIT |
Timeout interval for executing the rebuild, repair, or rollback callback in the fault recovery process, in seconds. |
[30, 1800] |
180 |
MINDIO_FOR_MINDSPORE |
Indicates whether to enable MindSpore. When the value is True (case insensitive) or 1, MindSpore is enabled. When the value is not 1, MindSpore is disabled. |
|
False |
MINDX_TASK_ID |
MindCluster task ID, which is used for MindIO ARF and is configured by ClusterD without user intervention. |
String |
- |
TORCHELASTIC_USE_AGENT_STORE |
PyTorch environment variable, which controls whether to create a TCP Store server or client. It is used when MindIO TFT saves the dying gasp checkpoint and the Torch Agent TCP Store Server connection fails. |
|
- |
TTP_STOP_CLEAN_BEFORE_DUMP |
Used by MindIO TFT to control whether MindIO TTP performs the stop&clean operation before saving the dying gasp checkpoint. |
|
0 |