Environment Variable Description

Environment Variables Used by MindCluster Components

Table 1 describes the environment variables used by MindCluster components.

**Table 1** Environment variable description
Environment Variable	Source	Required (Yes/No)	Value	Description
POD_IP	Written in the YAML file of the deployed component	Yes	IP address of the pod where the current container is located	Specifies the IP address used by ClusterD and TaskD to start the gRPC service.
POD_UID	Written in the YAML file of the deployed component	No	UID of the pod where the current container is located	Specifies the UID used to parse the server_id field in the ranktable file.
ASCEND_DOCKER_RUNTIME	Written by Ascend Docker Runtime during container creation	No	true	Specifies Ascend Docker Runtime for Ascend Device Plugin to determine whether the default runtime of the container on the current node is Ascend Docker Runtime.
HOSTNAME	Written when a container is created in Kubernetes	Yes	Name of the pod where the current container is located	Specifies the name of the current pod for Ascend Device Plugin to obtain the name.
NODE_NAME	Written in the YAML file of the deployed component	Yes	Name of the node where the current container is located	Specifies the name of the current node for Ascend Device Plugin, NodeD, and ClusterD to obtain the name.
LD_LIBRARY_PATH	Written in Dockerfile	Yes	File path	Ascend Device Plugin and NPU Exporter are used to initialize the DCMI.
BATCH_BIND_NUM	-	No	Numeric string	Specifies the number of pods bound to a specified Volcano in a batch.
MULTI_SCHEDULER_ENABLE	-	No	true or false	Specifies whether Volcano is in the multi-scheduler scenario.
SCHEDULER_POD_NAME	-	No	String	Specifies the pod name of the Volcano scheduler.
SCHEDULER_NUM	-	No	Numeric string	Specifies the number of Volcano schedulers.
PANIC_ON_ERROR	-	No	true or false	Specifies whether to panic when an error occurs in the Volcano scheduler.
KUBECONFIG	-	No	File path	Specifies the kubeconfig path for Volcano to connect to Kubernetes api-server.
HOME	Written when a container is created in Kubernetes	Yes	Folder path	Specifies the current user's home path obtained by Volcano.
DEBUG_SOCKET_DIR	-	No	Socket file path	Specifies the socket path listened by Volcano.
HCCL_CONNECT_TIMEOUT	Written in the training script	No	Timeout interval of HCCL link establishment	Specifies the link establishment timeout interval.
TTP_PORT	Written in the YAML file of the deployed component	Yes	Communication port used by MindIO TTP	Specifies the port to start MindIO Controller.
SSH_CLIENT	Environment variable set on the SSH server, which contains information about the client connection.	Yes	Information about the current client connection.	Records the information in the operation log during Ascend Docker Runtime installation.
TASKD_LOG_PATH	-	No	String	Flush path of TaskD run logs.
MINDX_SERVER_IP	Written by Ascend Operator during container creation	Yes	String	Specifies the IP address used by a job to communicate with ClusterD, same as svc IP of clusterd-grpc-svc.
MINDX_TASK_ID	Written by Ascend Operator during container creation	No	For MindIE inference jobs, the value is the same as jobID in the label field of an acjob.	Specifies MINDX_TASK_ID provided by the Elastic Agent/TaskD when it registers the gRPC service and TaskD profiling function with ClusterD.
GROUP_BASE_DIR	Written in the job startup script	No	Folder path	Specifies the path for exporting the parallelism domain information of the TaskD component.
MINDIO_WAIT_MINDX_TIME	Written in the job YAML	No	A number string ranging from 1 to 3600	Specifies the timeout interval for waiting for the scheduling of the faulty pod when process-level rescheduling is disabled and elastic training is enabled.
RAS_NET_ROOT_PATH	User-defined	No	Root path of the shared directory of ClusterD and NodeD	In the slow network diagnosis scenario, ClusterD and NodeD interact with each other through shared storage. For details, see Slow Network Diagnosis.

Environment Variables of Ascend Operator

Ascend Operator provides environment variables for distributed training jobs (acjob) of different AI frameworks. The following table describes the environment variables.

**Table 2** Training environment variable injected by Ascend Operator
Framework	Environment Variable	Description	Value	Remarks
PyTorch	MASTER_ADDR	IP address for communicating with the master node	The value is a valid IPv4 or IPv6 address.	Set this parameter to podIP for the master pod. Set this parameter to clusterIP of the SVC corresponding to the master pod for the worker pod.
	MASTER_PORT	Port for communicating with the master node	The value is a number string ranging from 0 to 65520.	The master pod corresponds to the value of ascendjob-port in SVC. The default value is 2222.
	WORLD_SIZE	Total number of NPUs used by a job	A positive integer	Total number of NPUs used by a job. For example, if a job uses 64 NPUs, set the value to 64.
	RANK	Node rank of the pod on the local node	The value must be an integer greater than or equal to 0.	The value for Master is 0, and that for Worker increases from 1.
	LOCAL_WORLD_SIZE	Number of NPUs used by pods on each node	The value must be an integer greater than or equal to 0.	For example, if a pod uses four NPUs, set this parameter to 4.
	LOCAL_RANK	Logical ID list of the NPUs used by the pod on each node	String	Set this parameter based on the number of NPUs used by the pod. The value starts from 0. For example, if the pod uses four NPUs, set this parameter to {0,1,2,3}.
PyTorch, MindSpore, TensorFlow	HostNetwork	Value of the hostNetwork field in the YAML file of the current job	true: The host IP address is used to create a pod. false: The host IP address is not used to create a pod.	If the cluster scale is large (the number of nodes is greater than 1000), you are advised to use the host IP address to create a pod.
PyTorch, MindSpore, TensorFlow	MINDX_SERVER_IP	IP address used by a job to communicate with ClusterD, same as svc ip of clusterd-grpc-svc.	The value is a valid IPv4 or IPv6 address.	-
PyTorch, MindSpore, TensorFlow	HCCL_LOGIC_SUPERPOD_ID	Processors with the same ID use the UnifiedBus network for communication, and processors with different IDs use the RoCE network for communication.	The value must be an integer greater than or equal to 0.	Used by HCCL for dynamic networking to restrict the communication mode between processors. NOTE: This environment variable can be used only under the following conditions: Hardware: Atlas 900 A3 SuperPoD Software: MindCluster 7.0.RC1 or later, CANN 8.0.0 or later
PyTorch, MindSpore, TensorFlow	MINDX_TASK_ID	Elastic Agent/TaskD needs to provide the MINDX_TASK_ID information when registering the gRPC service with ClusterD. For MindIE inference jobs, the value is the same as jobID in the label field of an acjob.	String	Job UID
PyTorch, MindSpore, TensorFlow	APP_TYPE	The value is the same as that of app in the label field of an acjob.	String	-
MindSpore	NPU_POD	Whether a processor is mounted to the current pod	true: A processor has been mounted to the current pod. false: No processor has been mounted to the current pod.	-
	MS_SERVER_NUM	Number of processes whose role is MS_PSERVER	0	Currently, the PS mode is not supported. The value is fixed to 0. NOTE: For details about MS_PSERVER and PS, see MindSpore documents.
	MS_WORKER_NUM	Total number of NPUs used by a job	A positive integer	Total number of NPUs used by a job. For example, if a job uses 64 NPUs, set the value to 64.
	MS_LOCAL_WORKER	Number of NPUs used by pods on each node	A positive integer	For example, if a pod uses four NPUs, set this parameter to 4.
	MS_SCHED_HOST	IP address of the scheduler	Valid IP address	Set this parameter to podIP for the scheduler pod. Set this parameter to clusterIP of the SVC corresponding to the scheduler pod for the worker pod.
	MS_SCHED_PORT	Port for communicating with the scheduler	The port number ranges from 1024 to 65535.	The scheduler pod corresponds to the value of ascendjob-port in SVC. The default value is 2222.
	MS_ROLE	Process role	MS_SCHED: indicates the scheduler process. Only one scheduler is started for a training job. It is responsible for networking and container recovery, but does not execute training code. MS_WORKER: indicates the worker process. Generally, the distributed training process is set to this role.	The worker process registers with the scheduler process to complete the networking.
	MS_NODE_RANK	Node rank of the pod on the local node	The value must be an integer greater than or equal to 0.	Set this parameter to 0 for the scheduler pod. When a processor is mounted to the scheduler, the worker pod increases from 1. When a processor is not mounted to the scheduler, the worker pod increases from 0.
TensorFlow	CM_CHIEF_IP	IP address for communicating with the chief	The value is a valid IPv4 or IPv6 address.	Set this parameter to podIP for the chief pod. Set this parameter to clusterIP of the SVC corresponding to the chief pod for the worker pod.
	CM_CHIEF_PORT	Port for communicating with the chief	The value is a number string ranging from 0 to 65520.	The scheduler pod corresponds to the value of ascendjob-port in SVC. The default value is 2222.
	CM_CHIEF_DEVICE	Logical ID of the device for collecting statistics on the server cluster information on the chief node	0	The value is fixed to 0.
	CM_WORKER_SIZE	Total number of NPUs used by a job	The value ranges from 0 to 32768.	Total number of NPUs used by a job. For example, if a job uses 64 NPUs, set the value to 64.
	CM_LOCAL_WORKER	Number of NPUs used by each pod	A positive integer	For example, if a pod uses four NPUs, set this parameter to 4.
	CM_WORKER_IP	Pod IP address	The value is a valid IPv4 or IPv6 address.	IP address of the current pod.
	CM_RANK	Node rank of the pod on the local node	The value must be an integer greater than or equal to 0.	Set the value to 0 for the chief. The value for the worker increases from 1.
PyTorch, MindSpore	PROCESS_RECOVER	Switch for process-level rescheduling, process-level online recovery, and elastic training	on: enabled off: disabled	This environment variable is injected in process-level rescheduling, process-level online recovery, process-level in-place recovery, and elastic training scenarios.
PyTorch	HIGH_AVAILABILITY	Switch for the MindSpeed-LLM process-level recovery function	Available recovery policy. retry: process-level online recovery recover: process-level rescheduling dump: saving dying gasp elastic-training: elastic training
PyTorch, MindSpore	ELASTIC_PROCESS_RECOVER_ENABLE	Controls process-level rescheduling, process-level online recovery, and dying gasp checkpoint recovery on Elastic Agent.	1: enabled Other values: disabled. If disabled, the related functions of MindIO must be disabled at the same time.	This environment variable is injected in process-level rescheduling, process-level online recovery, and process-level in-place recovery scenarios.
PyTorch, MindSpore	ENABLE_RESTART_FAULT_PROCESS	Controls process-level in-place recovery on Elastic Agent/TaskD.	on: enabled Other values: disabled NOTE: In the PyTorch framework, this function is provided by Elastic Agent/TaskD. In the MindSpore framework, this function is provided by TaskD.
MindSpore	MINDIO_FOR_MINDSPORE	Whether to enable MindSpore switch for MindIO	1: enables MindSpore switch for MindIO.
MindSpore	MS_ENABLE_TFT	Whether to enable MindSpore process-level recovery	'{TTP:1,UCE:1,ARF:1,HCCE:1,RSC:1}' # Enables the dying gasp, process-level online recovery for on-chip memory faults, process-level rescheduling, process-level online recovery for network faults, and pod-level rescheduling.

Environment Variables of Ascend Docker Runtime

Ascend Docker Runtime injects environment variables into the container.

Environment Variable	Description	Value	Remarks
ASCEND_DOCKER_RUNTIME	Indicates whether the Ascend Docker Runtime plugin is installed in the current environment.	True	This environment variable does not exist if Ascend Docker Runtime is not installed.

Environment Variables of Ascend Device Plugin

Ascend Device Plugin injects environment variables into the container. For details about the environment variables, see the following table.

**Table 3** Environment variables injected by Ascend Device Plugin to the container
Environment Variable	Description	Value	Remarks
ASCEND_VISIBLE_DEVICES	If a task requires an NPU device, use ASCEND_VISIBLE_DEVICES to specify the NPU device to be mounted to the container. Otherwise, the NPU device fails to be mounted. If the device ID is used to specify a device, you can specify only one device or multiple devices at a time. If the processor ID is used to specify a device, you can specify multiple processors of the same type at a time.	Physical processors (NPUs) ASCEND_VISIBLE_DEVICES=0 indicates that NPU 0 (/dev/davinci0) is mounted to the container. ASCEND_VISIBLE_DEVICES=1,3 indicates that NPUs 1 and 3 are mounted to the container. Virtual processors (vNPUs) Static virtualization: The mounting method is the same as that of physical processors. You only need to replace physical processor IDs with virtual processor IDs (vNPU IDs). Dynamic virtualization: ASCEND_VISIBLE_DEVICES=0 indicates that a certain number of AI Cores are allocated from NPU 0.	-
ASCEND_ALLOW_LINK	Specifies whether soft links are allowed in the mounted file or directory. This parameter needs to be specified in the Atlas 500 A2 edge station, Atlas 200I A2 accelerator module, and Atlas 200I DK A2 developer kit.	If ASCEND_ALLOW_LINK is set to True, driver files with soft links can be mounted in the Atlas 500 A2 edge station, Atlas 200I A2 accelerator module, and Atlas 200I DK A2 developer kit. If ASCEND_ALLOW_LINK is set to False or this parameter is not specified, the Atlas 500 A2 edge station, Atlas 200I A2 accelerator module, and Atlas 200I DK A2 developer kit cannot use Ascend Docker Runtime.	-
ASCEND_RUNTIME_OPTIONS	Restricts the processor ID specified by ASCEND_VISIBLE_DEVICES. NODRV indicates that driver-related directories are not mounted. VIRTUAL indicates that the virtual processor is mounted. NODRV,VIRTUAL indicates that the virtual processor is mounted while driver-related directories are not mounted.	ASCEND_RUNTIME_OPTIONS=NODRV ASCEND_RUNTIME_OPTIONS=VIRTUAL ASCEND_RUNTIME_OPTIONS=NODRV,VIRTUAL	-
WORLD_SIZE	Total number of NPUs used by a job	The value must be an integer greater than or equal to 0.	Written only in dynamic vNPU scheduling scenarios.
LOCAL_WORLD_SIZE	Number of NPUs used by pods on each node	The value must be an integer greater than or equal to 0.	Written only in dynamic vNPU scheduling scenarios.
LOCAL_RANK	Logical ID list of the NPUs used by the pod on each node	String	Written only in dynamic vNPU scheduling scenarios. The sequence number starts from 0. For example, if the pod uses four NPUs, set this parameter to {0,1,2,3}.
CM_WORKER_SIZE	Total number of NPUs used by a job	The value must be an integer greater than or equal to 0.	Written only in dynamic vNPU scheduling scenarios.
CM_LOCAL_WORKER	Number of NPUs used by pods on each node	The value must be an integer greater than or equal to 0.	Written only in dynamic vNPU scheduling scenarios.
MS_WORKER_NUM	Total number of NPUs used by a job	The value must be an integer greater than or equal to 0.	Written only in dynamic vNPU scheduling scenarios.
MS_LOCAL_WORKER	Number of NPUs used by pods on each node	The value must be an integer greater than or equal to 0.	Written only in dynamic vNPU scheduling scenarios.
PERF_DUMP_PATH	Path for saving iteration delay and group information	String	Written only in slow node detection scenarios.
PERF_DUMP_CONFIG	Switch of iteration delay and group information	String	Written only in slow node detection scenarios.
KUBELET_PORT	Default port number of kubelet on the current node. (If the kubelet port is not customized, you do not need to set this parameter.)	An integer ranging from 0 to 65535.	If the default kubelet port is changed, set this environment variable to the new port number. If the default kubelet port is not changed, ignore this environment variable.
HOST_IP	Physical IP address of the current node.	The value is a valid IPv4 address.	Fixed value, which is built in the initial YAML file.

Environment Variables of Elastic Agent

Elastic Agent has reached its end of life and its documentation will be deleted on the 30th of December, 2026.

The following table lists environment variables that can be configured when Elastic Agent is used. For details about other environment variables from the source code, see PyTorch documentation.

**Table 4** Environment variables of Elastic Agent
Environment Variable	Description	Value	Remarks
ELASTIC_LOG_PATH	Flush path of the run logs of Elastic Agent.	String	Name of the node that stores logs needs to be distinguished. Example: ELASTIC_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/elasticlogs/elastic-log$XDL_IP-$RANK #Replace $XDL_IP with the actual node IP address. #Replace $RANK with the actual node rank.
ELASTIC_PROCESS_RECOVER_ENABLE	Controls process-level rescheduling, process-level online recovery, and dying gasp checkpoint recovery on Elastic Agent.	String	1: enabled Other values: disabled If disabled, the related functions of MindIO must be disabled at the same time.
ENABLE_RESTART_FAULT_PROCESS	Controls process-level in-place recovery on Elastic Agent.	String	The value can be on or other values. on: enabled Other values: disabled
RESTART_FAULT_PROCESS_TYPE	Type of the fault process that Elastic Agent notifies MindIO to restart	String	The value can be worker or pod. worker: The pod does not exit, and only the faulty process is restarted. pod: restarts the pod.
RANK_TABLE_FILE	Ranktable file path	String	Path of the hccl.json file
PROCESS_RECOVER	Controls process-level rescheduling or process-level online recovery.	String	The value can be on or other values. on: enabled Other values: disabled

Environment Variables of TaskD

The following table lists environment variables that can be configured when TaskD is used. For details about other environment variables from the source code, see PyTorch documentation.

**Table 5** Environment variables of TaskD
Environment Variable	Description	Value	Remarks
TASKD_LOG_PATH	Flush path of TaskD run logs.	String	If the path is not specified, the default path ./taskd_log/taskd.log-worker-{RANK} is used, that is, the taskd_log directory in the current execution path is used. {RANK} indicates the global rank ID of the current training process.
TASKD_FILE_LOG_LEVEL	Level of logs to be recorded in log files.	String	-
TASKD_STD_LOG_LEVEL	Level of the logs to be printed to the screen.	String	-
TASKD_LOG_STDOUT	Whether to print logs to the screen.	bool	The value can be True or False.
ENABLE_RESTART_FAULT_PROCESS	Controls process-level in-place recovery on TaskD.	String	The value can be on or other values. on: enabled Other values: disabled
RESTART_FAULT_PROCESS_TYPE	Type of the faulty process that TaskD notifies MindIO to restart	String	The value can be worker or pod. worker: The pod does not exit, and only the faulty process is restarted. pod: restarts the pod.
TASKD_PROCESS_ENABLE	Whether to enable the process-level rescheduling, process-level online recovery, process-level local recovery, and elastic training functions for TaskD	String	The value can be on or off. on: enabled off: disabled
LOCAL_PROXY_ENABLE	Whether to enable the local proxy (required for security hardening)	String	The value can be on or off. on: enabled off: disabled The default value is off. In communication security hardening scenarios, this parameter must be set to on.
HCCL_ASYNC_ERROR_HANDLING	Whether to enable the watchdog function.	String	The values are as follows: 0: disables the fault detection and process exit functions. 1: enables the fault detection and process exit functions. 2: enables only the fault detection function. The default value is 1.
TASKD_PROCESS_INTERVAL	Interval for processing the main process of TaskD Manager.	String	The value ranges from 100 to 1000, in milliseconds.

Environment Variables of NodeD

**Table 6** Environment variables of NodeD
Environment Variable	Description	Value	Remarks
XDL_IP	Obtains the IP address of the host where the pod is located. This environment variable is used for slow nodes to record and match slow node information.	String	Written in the YAML file of the NodeD component.

Parent topic: Appendixes