Environment Variable Description
Environment Variables Used by MindCluster Components
Table 1 describes the environment variables used by MindCluster components.
Environment Variable |
Source |
Required (Yes/No) |
Value |
Description |
|---|---|---|---|---|
POD_IP |
Written in the YAML file of the deployed component |
Yes |
IP address of the pod where the current container is located |
Specifies the IP address used by ClusterD and TaskD to start the gRPC service. |
POD_UID |
Written in the YAML file of the deployed component |
No |
UID of the pod where the current container is located |
Specifies the UID used to parse the server_id field in the ranktable file. |
ASCEND_DOCKER_RUNTIME |
Written by Ascend Docker Runtime during container creation |
No |
true |
Specifies Ascend Docker Runtime for Ascend Device Plugin to determine whether the default runtime of the container on the current node is Ascend Docker Runtime. |
HOSTNAME |
Written when a container is created in Kubernetes |
Yes |
Name of the pod where the current container is located |
Specifies the name of the current pod for Ascend Device Plugin to obtain the name. |
NODE_NAME |
Written in the YAML file of the deployed component |
Yes |
Name of the node where the current container is located |
Specifies the name of the current node for Ascend Device Plugin, NodeD, and ClusterD to obtain the name. |
LD_LIBRARY_PATH |
Written in Dockerfile |
Yes |
File path |
Ascend Device Plugin and NPU Exporter are used to initialize the DCMI. |
BATCH_BIND_NUM |
- |
No |
Numeric string |
Specifies the number of pods bound to a specified Volcano in a batch. |
MULTI_SCHEDULER_ENABLE |
- |
No |
true or false |
Specifies whether Volcano is in the multi-scheduler scenario. |
SCHEDULER_POD_NAME |
- |
No |
String |
Specifies the pod name of the Volcano scheduler. |
SCHEDULER_NUM |
- |
No |
Numeric string |
Specifies the number of Volcano schedulers. |
PANIC_ON_ERROR |
- |
No |
true or false |
Specifies whether to panic when an error occurs in the Volcano scheduler. |
KUBECONFIG |
- |
No |
File path |
Specifies the kubeconfig path for Volcano to connect to Kubernetes api-server. |
HOME |
Written when a container is created in Kubernetes |
Yes |
Folder path |
Specifies the current user's home path obtained by Volcano. |
DEBUG_SOCKET_DIR |
- |
No |
Socket file path |
Specifies the socket path listened by Volcano. |
HCCL_CONNECT_TIMEOUT |
Written in the training script |
No |
Timeout interval of HCCL link establishment |
Specifies the link establishment timeout interval. |
TTP_PORT |
Written in the YAML file of the deployed component |
Yes |
Communication port used by MindIO TTP |
Specifies the port to start MindIO Controller. |
SSH_CLIENT |
Environment variable set on the SSH server, which contains information about the client connection. |
Yes |
Information about the current client connection. |
Records the information in the operation log during Ascend Docker Runtime installation. |
TASKD_LOG_PATH |
- |
No |
String |
Flush path of TaskD run logs. |
MINDX_SERVER_IP |
Written by Ascend Operator during container creation |
Yes |
String |
Specifies the IP address used by a job to communicate with ClusterD, same as svc IP of clusterd-grpc-svc. |
MINDX_TASK_ID |
Written by Ascend Operator during container creation |
No |
For MindIE inference jobs, the value is the same as jobID in the label field of an acjob. |
Specifies MINDX_TASK_ID provided by the Elastic Agent/TaskD when it registers the gRPC service and TaskD profiling function with ClusterD. |
GROUP_BASE_DIR |
Written in the job startup script |
No |
Folder path |
Specifies the path for exporting the parallelism domain information of the TaskD component. |
MINDIO_WAIT_MINDX_TIME |
Written in the job YAML |
No |
A number string ranging from 1 to 3600 |
Specifies the timeout interval for waiting for the scheduling of the faulty pod when process-level rescheduling is disabled and elastic training is enabled. |
RAS_NET_ROOT_PATH |
User-defined |
No |
Root path of the shared directory of ClusterD and NodeD |
In the slow network diagnosis scenario, ClusterD and NodeD interact with each other through shared storage. For details, see Slow Network Diagnosis. |
Environment Variables of Ascend Operator
Ascend Operator provides environment variables for distributed training jobs (acjob) of different AI frameworks. The following table describes the environment variables.
Framework |
Environment Variable |
Description |
Value |
Remarks |
|---|---|---|---|---|
PyTorch |
MASTER_ADDR |
IP address for communicating with the master node |
The value is a valid IPv4 or IPv6 address. |
|
MASTER_PORT |
Port for communicating with the master node |
The value is a number string ranging from 0 to 65520. |
The master pod corresponds to the value of ascendjob-port in SVC. The default value is 2222. |
|
WORLD_SIZE |
Total number of NPUs used by a job |
A positive integer |
Total number of NPUs used by a job. For example, if a job uses 64 NPUs, set the value to 64. |
|
RANK |
Node rank of the pod on the local node |
The value must be an integer greater than or equal to 0. |
The value for Master is 0, and that for Worker increases from 1. |
|
LOCAL_WORLD_SIZE |
Number of NPUs used by pods on each node |
The value must be an integer greater than or equal to 0. |
For example, if a pod uses four NPUs, set this parameter to 4. |
|
LOCAL_RANK |
Logical ID list of the NPUs used by the pod on each node |
String |
Set this parameter based on the number of NPUs used by the pod. The value starts from 0. For example, if the pod uses four NPUs, set this parameter to {0,1,2,3}. |
|
PyTorch, MindSpore, TensorFlow |
HostNetwork |
Value of the hostNetwork field in the YAML file of the current job |
|
If the cluster scale is large (the number of nodes is greater than 1000), you are advised to use the host IP address to create a pod. |
MINDX_SERVER_IP |
IP address used by a job to communicate with ClusterD, same as svc ip of clusterd-grpc-svc. |
The value is a valid IPv4 or IPv6 address. |
- |
|
PyTorch, MindSpore, TensorFlow |
HCCL_LOGIC_SUPERPOD_ID |
Processors with the same ID use the UnifiedBus network for communication, and processors with different IDs use the RoCE network for communication. |
The value must be an integer greater than or equal to 0. |
Used by HCCL for dynamic networking to restrict the communication mode between processors. NOTE:
This environment variable can be used only under the following conditions:
|
PyTorch, MindSpore, TensorFlow |
MINDX_TASK_ID |
Elastic Agent/TaskD needs to provide the MINDX_TASK_ID information when registering the gRPC service with ClusterD. For MindIE inference jobs, the value is the same as jobID in the label field of an acjob. |
String |
Job UID |
APP_TYPE |
The value is the same as that of app in the label field of an acjob. |
String |
- |
|
MindSpore |
NPU_POD |
Whether a processor is mounted to the current pod |
|
- |
MS_SERVER_NUM |
Number of processes whose role is MS_PSERVER |
0 |
Currently, the PS mode is not supported. The value is fixed to 0. |
|
MS_WORKER_NUM |
Total number of NPUs used by a job |
A positive integer |
Total number of NPUs used by a job. For example, if a job uses 64 NPUs, set the value to 64. |
|
MS_LOCAL_WORKER |
Number of NPUs used by pods on each node |
A positive integer |
For example, if a pod uses four NPUs, set this parameter to 4. |
|
MS_SCHED_HOST |
IP address of the scheduler |
Valid IP address |
|
|
MS_SCHED_PORT |
Port for communicating with the scheduler |
The port number ranges from 1024 to 65535. |
The scheduler pod corresponds to the value of ascendjob-port in SVC. The default value is 2222. |
|
MS_ROLE |
Process role |
|
The worker process registers with the scheduler process to complete the networking. |
|
MS_NODE_RANK |
Node rank of the pod on the local node |
The value must be an integer greater than or equal to 0. |
Set this parameter to 0 for the scheduler pod.
|
|
TensorFlow |
CM_CHIEF_IP |
IP address for communicating with the chief |
The value is a valid IPv4 or IPv6 address. |
|
CM_CHIEF_PORT |
Port for communicating with the chief |
The value is a number string ranging from 0 to 65520. |
The scheduler pod corresponds to the value of ascendjob-port in SVC. The default value is 2222. |
|
CM_CHIEF_DEVICE |
Logical ID of the device for collecting statistics on the server cluster information on the chief node |
0 |
The value is fixed to 0. |
|
CM_WORKER_SIZE |
Total number of NPUs used by a job |
The value ranges from 0 to 32768. |
Total number of NPUs used by a job. For example, if a job uses 64 NPUs, set the value to 64. |
|
CM_LOCAL_WORKER |
Number of NPUs used by each pod |
A positive integer |
For example, if a pod uses four NPUs, set this parameter to 4. |
|
CM_WORKER_IP |
Pod IP address |
The value is a valid IPv4 or IPv6 address. |
IP address of the current pod. |
|
CM_RANK |
Node rank of the pod on the local node |
The value must be an integer greater than or equal to 0. |
|
|
PyTorch, MindSpore |
PROCESS_RECOVER |
Switch for process-level rescheduling, process-level online recovery, and elastic training |
|
This environment variable is injected in process-level rescheduling, process-level online recovery, process-level in-place recovery, and elastic training scenarios. |
PyTorch |
HIGH_AVAILABILITY |
Switch for the MindSpeed-LLM process-level recovery function |
Available recovery policy.
|
|
PyTorch, MindSpore |
ELASTIC_PROCESS_RECOVER_ENABLE |
Controls process-level rescheduling, process-level online recovery, and dying gasp checkpoint recovery on Elastic Agent. |
|
This environment variable is injected in process-level rescheduling, process-level online recovery, and process-level in-place recovery scenarios. |
PyTorch, MindSpore |
ENABLE_RESTART_FAULT_PROCESS |
Controls process-level in-place recovery on Elastic Agent/TaskD. |
NOTE:
|
|
MindSpore |
MINDIO_FOR_MINDSPORE |
Whether to enable MindSpore switch for MindIO |
1: enables MindSpore switch for MindIO. |
|
MindSpore |
MS_ENABLE_TFT |
Whether to enable MindSpore process-level recovery |
'{TTP:1,UCE:1,ARF:1,HCCE:1,RSC:1}' # Enables the dying gasp, process-level online recovery for on-chip memory faults, process-level rescheduling, process-level online recovery for network faults, and pod-level rescheduling.
|
Environment Variables of Ascend Docker Runtime
Ascend Docker Runtime injects environment variables into the container.
Environment Variable |
Description |
Value |
Remarks |
|---|---|---|---|
ASCEND_DOCKER_RUNTIME |
Indicates whether the Ascend Docker Runtime plugin is installed in the current environment. |
True |
This environment variable does not exist if Ascend Docker Runtime is not installed. |
Environment Variables of Ascend Device Plugin
Ascend Device Plugin injects environment variables into the container. For details about the environment variables, see the following table.
Environment Variable |
Description |
Value |
Remarks |
|---|---|---|---|
ASCEND_VISIBLE_DEVICES |
If a task requires an NPU device, use ASCEND_VISIBLE_DEVICES to specify the NPU device to be mounted to the container. Otherwise, the NPU device fails to be mounted. If the device ID is used to specify a device, you can specify only one device or multiple devices at a time. If the processor ID is used to specify a device, you can specify multiple processors of the same type at a time. |
|
- |
ASCEND_ALLOW_LINK |
Specifies whether soft links are allowed in the mounted file or directory. This parameter needs to be specified in the Atlas 500 A2 edge station, Atlas 200I A2 accelerator module, and Atlas 200I DK A2 developer kit. |
|
- |
ASCEND_RUNTIME_OPTIONS |
Restricts the processor ID specified by ASCEND_VISIBLE_DEVICES.
|
|
- |
WORLD_SIZE |
Total number of NPUs used by a job |
The value must be an integer greater than or equal to 0. |
Written only in dynamic vNPU scheduling scenarios. |
LOCAL_WORLD_SIZE |
Number of NPUs used by pods on each node |
The value must be an integer greater than or equal to 0. |
Written only in dynamic vNPU scheduling scenarios. |
LOCAL_RANK |
Logical ID list of the NPUs used by the pod on each node |
String |
Written only in dynamic vNPU scheduling scenarios. The sequence number starts from 0. For example, if the pod uses four NPUs, set this parameter to {0,1,2,3}. |
CM_WORKER_SIZE |
Total number of NPUs used by a job |
The value must be an integer greater than or equal to 0. |
Written only in dynamic vNPU scheduling scenarios. |
CM_LOCAL_WORKER |
Number of NPUs used by pods on each node |
The value must be an integer greater than or equal to 0. |
Written only in dynamic vNPU scheduling scenarios. |
MS_WORKER_NUM |
Total number of NPUs used by a job |
The value must be an integer greater than or equal to 0. |
Written only in dynamic vNPU scheduling scenarios. |
MS_LOCAL_WORKER |
Number of NPUs used by pods on each node |
The value must be an integer greater than or equal to 0. |
Written only in dynamic vNPU scheduling scenarios. |
PERF_DUMP_PATH |
Path for saving iteration delay and group information |
String |
Written only in slow node detection scenarios. |
PERF_DUMP_CONFIG |
Switch of iteration delay and group information |
String |
Written only in slow node detection scenarios. |
KUBELET_PORT |
Default port number of kubelet on the current node. (If the kubelet port is not customized, you do not need to set this parameter.) |
An integer ranging from 0 to 65535. |
If the default kubelet port is changed, set this environment variable to the new port number. If the default kubelet port is not changed, ignore this environment variable. |
HOST_IP |
Physical IP address of the current node. |
The value is a valid IPv4 address. |
Fixed value, which is built in the initial YAML file. |
Environment Variables of Elastic Agent
Elastic Agent has reached its end of life and its documentation will be deleted on the 30th of December, 2026.
The following table lists environment variables that can be configured when Elastic Agent is used. For details about other environment variables from the source code, see PyTorch documentation.
Environment Variable |
Description |
Value |
Remarks |
|---|---|---|---|
ELASTIC_LOG_PATH |
Flush path of the run logs of Elastic Agent. |
String |
Name of the node that stores logs needs to be distinguished. Example: ELASTIC_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/elasticlogs/elastic-log$XDL_IP-$RANK #Replace $XDL_IP with the actual node IP address. #Replace $RANK with the actual node rank. |
ELASTIC_PROCESS_RECOVER_ENABLE |
Controls process-level rescheduling, process-level online recovery, and dying gasp checkpoint recovery on Elastic Agent. |
String |
|
ENABLE_RESTART_FAULT_PROCESS |
Controls process-level in-place recovery on Elastic Agent. |
String |
The value can be on or other values.
|
RESTART_FAULT_PROCESS_TYPE |
Type of the fault process that Elastic Agent notifies MindIO to restart |
String |
The value can be worker or pod.
|
RANK_TABLE_FILE |
Ranktable file path |
String |
Path of the hccl.json file |
PROCESS_RECOVER |
Controls process-level rescheduling or process-level online recovery. |
String |
The value can be on or other values.
|
Environment Variables of TaskD
The following table lists environment variables that can be configured when TaskD is used. For details about other environment variables from the source code, see PyTorch documentation.
Environment Variable |
Description |
Value |
Remarks |
|---|---|---|---|
TASKD_LOG_PATH |
Flush path of TaskD run logs. |
String |
If the path is not specified, the default path ./taskd_log/taskd.log-worker-{RANK} is used, that is, the taskd_log directory in the current execution path is used. {RANK} indicates the global rank ID of the current training process. |
TASKD_FILE_LOG_LEVEL |
Level of logs to be recorded in log files. |
String |
- |
TASKD_STD_LOG_LEVEL |
Level of the logs to be printed to the screen. |
String |
- |
TASKD_LOG_STDOUT |
Whether to print logs to the screen. |
bool |
The value can be True or False. |
ENABLE_RESTART_FAULT_PROCESS |
Controls process-level in-place recovery on TaskD. |
String |
The value can be on or other values.
|
RESTART_FAULT_PROCESS_TYPE |
Type of the faulty process that TaskD notifies MindIO to restart |
String |
The value can be worker or pod.
|
TASKD_PROCESS_ENABLE |
Whether to enable the process-level rescheduling, process-level online recovery, process-level local recovery, and elastic training functions for TaskD |
String |
The value can be on or off.
|
LOCAL_PROXY_ENABLE |
Whether to enable the local proxy (required for security hardening) |
String |
The value can be on or off.
The default value is off. In communication security hardening scenarios, this parameter must be set to on. |
HCCL_ASYNC_ERROR_HANDLING |
Whether to enable the watchdog function. |
String |
The values are as follows:
The default value is 1. |
TASKD_PROCESS_INTERVAL |
Interval for processing the main process of TaskD Manager. |
String |
The value ranges from 100 to 1000, in milliseconds. |
Environment Variables of NodeD
Environment Variable |
Description |
Value |
Remarks |
|---|---|---|---|
XDL_IP |
Obtains the IP address of the host where the pod is located. This environment variable is used for slow nodes to record and match slow node information. |
String |
Written in the YAML file of the NodeD component. |