Environment Variable Description

Environment Variables Used by MindCluster Components

Table 1 describes the environment variables used by MindCluster components.

Table 1 Environment variable description

Environment Variable

Source

Required (Yes/No)

Value

Description

POD_IP

Written in the YAML file of the deployed component

Yes

IP address of the pod where the current container is located

Specifies the IP address used by ClusterD and TaskD to start the gRPC service.

POD_UID

Written in the YAML file of the deployed component

No

UID of the pod where the current container is located

Specifies the UID used to parse the server_id field in the ranktable file.

ASCEND_DOCKER_RUNTIME

Written by Ascend Docker Runtime during container creation

No

true

Specifies Ascend Docker Runtime for Ascend Device Plugin to determine whether the default runtime of the container on the current node is Ascend Docker Runtime.

HOSTNAME

Written when a container is created in Kubernetes

Yes

Name of the pod where the current container is located

Specifies the name of the current pod for Ascend Device Plugin to obtain the name.

NODE_NAME

Written in the YAML file of the deployed component

Yes

Name of the node where the current container is located

Specifies the name of the current node for Ascend Device Plugin, NodeD, and ClusterD to obtain the name.

LD_LIBRARY_PATH

Written in Dockerfile

Yes

File path

Ascend Device Plugin and NPU Exporter are used to initialize the DCMI.

BATCH_BIND_NUM

-

No

Numeric string

Specifies the number of pods bound to a specified Volcano in a batch.

MULTI_SCHEDULER_ENABLE

-

No

true or false

Specifies whether Volcano is in the multi-scheduler scenario.

SCHEDULER_POD_NAME

-

No

String

Specifies the pod name of the Volcano scheduler.

SCHEDULER_NUM

-

No

Numeric string

Specifies the number of Volcano schedulers.

PANIC_ON_ERROR

-

No

true or false

Specifies whether to panic when an error occurs in the Volcano scheduler.

KUBECONFIG

-

No

File path

Specifies the kubeconfig path for Volcano to connect to Kubernetes api-server.

HOME

Written when a container is created in Kubernetes

Yes

Folder path

Specifies the current user's home path obtained by Volcano.

DEBUG_SOCKET_DIR

-

No

Socket file path

Specifies the socket path listened by Volcano.

HCCL_CONNECT_TIMEOUT

Written in the training script

No

Timeout interval of HCCL link establishment

Specifies the link establishment timeout interval.

TTP_PORT

Written in the YAML file of the deployed component

Yes

Communication port used by MindIO TTP

Specifies the port to start MindIO Controller.

SSH_CLIENT

Environment variable set on the SSH server, which contains information about the client connection.

Yes

Information about the current client connection.

Records the information in the operation log during Ascend Docker Runtime installation.

TASKD_LOG_PATH

-

No

String

Flush path of TaskD run logs.

MINDX_SERVER_IP

Written by Ascend Operator during container creation

Yes

String

Specifies the IP address used by a job to communicate with ClusterD, same as svc IP of clusterd-grpc-svc.

MINDX_TASK_ID

Written by Ascend Operator during container creation

No

For MindIE inference jobs, the value is the same as jobID in the label field of an acjob.

Specifies MINDX_TASK_ID provided by the Elastic Agent/TaskD when it registers the gRPC service and TaskD profiling function with ClusterD.

GROUP_BASE_DIR

Written in the job startup script

No

Folder path

Specifies the path for exporting the parallelism domain information of the TaskD component.

MINDIO_WAIT_MINDX_TIME

Written in the job YAML

No

A number string ranging from 1 to 3600

Specifies the timeout interval for waiting for the scheduling of the faulty pod when process-level rescheduling is disabled and elastic training is enabled.

RAS_NET_ROOT_PATH

User-defined

No

Root path of the shared directory of ClusterD and NodeD

In the slow network diagnosis scenario, ClusterD and NodeD interact with each other through shared storage. For details, see Slow Network Diagnosis.

Environment Variables of Ascend Operator

Ascend Operator provides environment variables for distributed training jobs (acjob) of different AI frameworks. The following table describes the environment variables.

Table 2 Training environment variable injected by Ascend Operator

Framework

Environment Variable

Description

Value

Remarks

PyTorch

MASTER_ADDR

IP address for communicating with the master node

The value is a valid IPv4 or IPv6 address.

  • Set this parameter to podIP for the master pod.
  • Set this parameter to clusterIP of the SVC corresponding to the master pod for the worker pod.

MASTER_PORT

Port for communicating with the master node

The value is a number string ranging from 0 to 65520.

The master pod corresponds to the value of ascendjob-port in SVC. The default value is 2222.

WORLD_SIZE

Total number of NPUs used by a job

A positive integer

Total number of NPUs used by a job. For example, if a job uses 64 NPUs, set the value to 64.

RANK

Node rank of the pod on the local node

The value must be an integer greater than or equal to 0.

The value for Master is 0, and that for Worker increases from 1.

LOCAL_WORLD_SIZE

Number of NPUs used by pods on each node

The value must be an integer greater than or equal to 0.

For example, if a pod uses four NPUs, set this parameter to 4.

LOCAL_RANK

Logical ID list of the NPUs used by the pod on each node

String

Set this parameter based on the number of NPUs used by the pod. The value starts from 0. For example, if the pod uses four NPUs, set this parameter to {0,1,2,3}.

PyTorch, MindSpore, TensorFlow

HostNetwork

Value of the hostNetwork field in the YAML file of the current job

  • true: The host IP address is used to create a pod.
  • false: The host IP address is not used to create a pod.

If the cluster scale is large (the number of nodes is greater than 1000), you are advised to use the host IP address to create a pod.

MINDX_SERVER_IP

IP address used by a job to communicate with ClusterD, same as svc ip of clusterd-grpc-svc.

The value is a valid IPv4 or IPv6 address.

-

PyTorch, MindSpore, TensorFlow

HCCL_LOGIC_SUPERPOD_ID

Processors with the same ID use the UnifiedBus network for communication, and processors with different IDs use the RoCE network for communication.

The value must be an integer greater than or equal to 0.

Used by HCCL for dynamic networking to restrict the communication mode between processors.

NOTE:

This environment variable can be used only under the following conditions:

  • Hardware: Atlas 900 A3 SuperPoD
  • Software: MindCluster 7.0.RC1 or later, CANN 8.0.0 or later

PyTorch, MindSpore, TensorFlow

MINDX_TASK_ID

Elastic Agent/TaskD needs to provide the MINDX_TASK_ID information when registering the gRPC service with ClusterD.

For MindIE inference jobs, the value is the same as jobID in the label field of an acjob.

String

Job UID

APP_TYPE

The value is the same as that of app in the label field of an acjob.

String

-

MindSpore

NPU_POD

Whether a processor is mounted to the current pod

  • true: A processor has been mounted to the current pod.
  • false: No processor has been mounted to the current pod.

-

MS_SERVER_NUM

Number of processes whose role is MS_PSERVER

0

Currently, the PS mode is not supported. The value is fixed to 0.

NOTE:

For details about MS_PSERVER and PS, see MindSpore documents.

MS_WORKER_NUM

Total number of NPUs used by a job

A positive integer

Total number of NPUs used by a job. For example, if a job uses 64 NPUs, set the value to 64.

MS_LOCAL_WORKER

Number of NPUs used by pods on each node

A positive integer

For example, if a pod uses four NPUs, set this parameter to 4.

MS_SCHED_HOST

IP address of the scheduler

Valid IP address

  • Set this parameter to podIP for the scheduler pod.
  • Set this parameter to clusterIP of the SVC corresponding to the scheduler pod for the worker pod.

MS_SCHED_PORT

Port for communicating with the scheduler

The port number ranges from 1024 to 65535.

The scheduler pod corresponds to the value of ascendjob-port in SVC. The default value is 2222.

MS_ROLE

Process role

  • MS_SCHED: indicates the scheduler process. Only one scheduler is started for a training job. It is responsible for networking and container recovery, but does not execute training code.
  • MS_WORKER: indicates the worker process. Generally, the distributed training process is set to this role.

The worker process registers with the scheduler process to complete the networking.

MS_NODE_RANK

Node rank of the pod on the local node

The value must be an integer greater than or equal to 0.

Set this parameter to 0 for the scheduler pod.

  • When a processor is mounted to the scheduler, the worker pod increases from 1.
  • When a processor is not mounted to the scheduler, the worker pod increases from 0.

TensorFlow

CM_CHIEF_IP

IP address for communicating with the chief

The value is a valid IPv4 or IPv6 address.

  • Set this parameter to podIP for the chief pod.
  • Set this parameter to clusterIP of the SVC corresponding to the chief pod for the worker pod.

CM_CHIEF_PORT

Port for communicating with the chief

The value is a number string ranging from 0 to 65520.

The scheduler pod corresponds to the value of ascendjob-port in SVC. The default value is 2222.

CM_CHIEF_DEVICE

Logical ID of the device for collecting statistics on the server cluster information on the chief node

0

The value is fixed to 0.

CM_WORKER_SIZE

Total number of NPUs used by a job

The value ranges from 0 to 32768.

Total number of NPUs used by a job. For example, if a job uses 64 NPUs, set the value to 64.

CM_LOCAL_WORKER

Number of NPUs used by each pod

A positive integer

For example, if a pod uses four NPUs, set this parameter to 4.

CM_WORKER_IP

Pod IP address

The value is a valid IPv4 or IPv6 address.

IP address of the current pod.

CM_RANK

Node rank of the pod on the local node

The value must be an integer greater than or equal to 0.

  • Set the value to 0 for the chief.
  • The value for the worker increases from 1.

PyTorch, MindSpore

PROCESS_RECOVER

Switch for process-level rescheduling, process-level online recovery, and elastic training

  • on: enabled
  • off: disabled

This environment variable is injected in process-level rescheduling, process-level online recovery, process-level in-place recovery, and elastic training scenarios.

PyTorch

HIGH_AVAILABILITY

Switch for the MindSpeed-LLM process-level recovery function

Available recovery policy.

  • retry: process-level online recovery
  • recover: process-level rescheduling
  • dump: saving dying gasp
  • elastic-training: elastic training

PyTorch, MindSpore

ELASTIC_PROCESS_RECOVER_ENABLE

Controls process-level rescheduling, process-level online recovery, and dying gasp checkpoint recovery on Elastic Agent.

  • 1: enabled
  • Other values: disabled. If disabled, the related functions of MindIO must be disabled at the same time.

This environment variable is injected in process-level rescheduling, process-level online recovery, and process-level in-place recovery scenarios.

PyTorch, MindSpore

ENABLE_RESTART_FAULT_PROCESS

Controls process-level in-place recovery on Elastic Agent/TaskD.

  • on: enabled
  • Other values: disabled
NOTE:
  • In the PyTorch framework, this function is provided by Elastic Agent/TaskD.
  • In the MindSpore framework, this function is provided by TaskD.

MindSpore

MINDIO_FOR_MINDSPORE

Whether to enable MindSpore switch for MindIO

1: enables MindSpore switch for MindIO.

MindSpore

MS_ENABLE_TFT

Whether to enable MindSpore process-level recovery

'{TTP:1,UCE:1,ARF:1,HCCE:1,RSC:1}'     # Enables the dying gasp, process-level online recovery for on-chip memory faults, process-level rescheduling, process-level online recovery for network faults, and pod-level rescheduling.

Environment Variables of Ascend Docker Runtime

Ascend Docker Runtime injects environment variables into the container.

Environment Variable

Description

Value

Remarks

ASCEND_DOCKER_RUNTIME

Indicates whether the Ascend Docker Runtime plugin is installed in the current environment.

True

This environment variable does not exist if Ascend Docker Runtime is not installed.

Environment Variables of Ascend Device Plugin

Ascend Device Plugin injects environment variables into the container. For details about the environment variables, see the following table.

Table 3 Environment variables injected by Ascend Device Plugin to the container

Environment Variable

Description

Value

Remarks

ASCEND_VISIBLE_DEVICES

If a task requires an NPU device, use ASCEND_VISIBLE_DEVICES to specify the NPU device to be mounted to the container. Otherwise, the NPU device fails to be mounted. If the device ID is used to specify a device, you can specify only one device or multiple devices at a time. If the processor ID is used to specify a device, you can specify multiple processors of the same type at a time.

  • Physical processors (NPUs)
    • ASCEND_VISIBLE_DEVICES=0 indicates that NPU 0 (/dev/davinci0) is mounted to the container.
    • ASCEND_VISIBLE_DEVICES=1,3 indicates that NPUs 1 and 3 are mounted to the container.
  • Virtual processors (vNPUs)
    • Static virtualization: The mounting method is the same as that of physical processors. You only need to replace physical processor IDs with virtual processor IDs (vNPU IDs).
    • Dynamic virtualization:

      ASCEND_VISIBLE_DEVICES=0 indicates that a certain number of AI Cores are allocated from NPU 0.

-

ASCEND_ALLOW_LINK

Specifies whether soft links are allowed in the mounted file or directory. This parameter needs to be specified in the Atlas 500 A2 edge station, Atlas 200I A2 accelerator module, and Atlas 200I DK A2 developer kit.

  • If ASCEND_ALLOW_LINK is set to True, driver files with soft links can be mounted in the Atlas 500 A2 edge station, Atlas 200I A2 accelerator module, and Atlas 200I DK A2 developer kit.
  • If ASCEND_ALLOW_LINK is set to False or this parameter is not specified, the Atlas 500 A2 edge station, Atlas 200I A2 accelerator module, and Atlas 200I DK A2 developer kit cannot use Ascend Docker Runtime.

-

ASCEND_RUNTIME_OPTIONS

Restricts the processor ID specified by ASCEND_VISIBLE_DEVICES.

  • NODRV indicates that driver-related directories are not mounted.
  • VIRTUAL indicates that the virtual processor is mounted.
  • NODRV,VIRTUAL indicates that the virtual processor is mounted while driver-related directories are not mounted.
  • ASCEND_RUNTIME_OPTIONS=NODRV
  • ASCEND_RUNTIME_OPTIONS=VIRTUAL
  • ASCEND_RUNTIME_OPTIONS=NODRV,VIRTUAL

-

WORLD_SIZE

Total number of NPUs used by a job

The value must be an integer greater than or equal to 0.

Written only in dynamic vNPU scheduling scenarios.

LOCAL_WORLD_SIZE

Number of NPUs used by pods on each node

The value must be an integer greater than or equal to 0.

Written only in dynamic vNPU scheduling scenarios.

LOCAL_RANK

Logical ID list of the NPUs used by the pod on each node

String

Written only in dynamic vNPU scheduling scenarios.

The sequence number starts from 0. For example, if the pod uses four NPUs, set this parameter to {0,1,2,3}.

CM_WORKER_SIZE

Total number of NPUs used by a job

The value must be an integer greater than or equal to 0.

Written only in dynamic vNPU scheduling scenarios.

CM_LOCAL_WORKER

Number of NPUs used by pods on each node

The value must be an integer greater than or equal to 0.

Written only in dynamic vNPU scheduling scenarios.

MS_WORKER_NUM

Total number of NPUs used by a job

The value must be an integer greater than or equal to 0.

Written only in dynamic vNPU scheduling scenarios.

MS_LOCAL_WORKER

Number of NPUs used by pods on each node

The value must be an integer greater than or equal to 0.

Written only in dynamic vNPU scheduling scenarios.

PERF_DUMP_PATH

Path for saving iteration delay and group information

String

Written only in slow node detection scenarios.

PERF_DUMP_CONFIG

Switch of iteration delay and group information

String

Written only in slow node detection scenarios.

KUBELET_PORT

Default port number of kubelet on the current node. (If the kubelet port is not customized, you do not need to set this parameter.)

An integer ranging from 0 to 65535.

If the default kubelet port is changed, set this environment variable to the new port number.

If the default kubelet port is not changed, ignore this environment variable.

HOST_IP

Physical IP address of the current node.

The value is a valid IPv4 address.

Fixed value, which is built in the initial YAML file.

Environment Variables of Elastic Agent

Elastic Agent has reached its end of life and its documentation will be deleted on the 30th of December, 2026.

The following table lists environment variables that can be configured when Elastic Agent is used. For details about other environment variables from the source code, see PyTorch documentation.

Table 4 Environment variables of Elastic Agent

Environment Variable

Description

Value

Remarks

ELASTIC_LOG_PATH

Flush path of the run logs of Elastic Agent.

String

Name of the node that stores logs needs to be distinguished. Example:

ELASTIC_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/elasticlogs/elastic-log$XDL_IP-$RANK 
#Replace $XDL_IP with the actual node IP address.
#Replace $RANK with the actual node rank.

ELASTIC_PROCESS_RECOVER_ENABLE

Controls process-level rescheduling, process-level online recovery, and dying gasp checkpoint recovery on Elastic Agent.

String

  • 1: enabled
  • Other values: disabled

    If disabled, the related functions of MindIO must be disabled at the same time.

ENABLE_RESTART_FAULT_PROCESS

Controls process-level in-place recovery on Elastic Agent.

String

The value can be on or other values.

  • on: enabled
  • Other values: disabled

RESTART_FAULT_PROCESS_TYPE

Type of the fault process that Elastic Agent notifies MindIO to restart

String

The value can be worker or pod.

  • worker: The pod does not exit, and only the faulty process is restarted.
  • pod: restarts the pod.

RANK_TABLE_FILE

Ranktable file path

String

Path of the hccl.json file

PROCESS_RECOVER

Controls process-level rescheduling or process-level online recovery.

String

The value can be on or other values.

  • on: enabled
  • Other values: disabled

Environment Variables of TaskD

The following table lists environment variables that can be configured when TaskD is used. For details about other environment variables from the source code, see PyTorch documentation.

Table 5 Environment variables of TaskD

Environment Variable

Description

Value

Remarks

TASKD_LOG_PATH

Flush path of TaskD run logs.

String

If the path is not specified, the default path ./taskd_log/taskd.log-worker-{RANK} is used, that is, the taskd_log directory in the current execution path is used.

{RANK} indicates the global rank ID of the current training process.

TASKD_FILE_LOG_LEVEL

Level of logs to be recorded in log files.

String

-

TASKD_STD_LOG_LEVEL

Level of the logs to be printed to the screen.

String

-

TASKD_LOG_STDOUT

Whether to print logs to the screen.

bool

The value can be True or False.

ENABLE_RESTART_FAULT_PROCESS

Controls process-level in-place recovery on TaskD.

String

The value can be on or other values.

  • on: enabled
  • Other values: disabled

RESTART_FAULT_PROCESS_TYPE

Type of the faulty process that TaskD notifies MindIO to restart

String

The value can be worker or pod.

  • worker: The pod does not exit, and only the faulty process is restarted.
  • pod: restarts the pod.

TASKD_PROCESS_ENABLE

Whether to enable the process-level rescheduling, process-level online recovery, process-level local recovery, and elastic training functions for TaskD

String

The value can be on or off.

  • on: enabled
  • off: disabled

LOCAL_PROXY_ENABLE

Whether to enable the local proxy (required for security hardening)

String

The value can be on or off.

  • on: enabled
  • off: disabled

The default value is off. In communication security hardening scenarios, this parameter must be set to on.

HCCL_ASYNC_ERROR_HANDLING

Whether to enable the watchdog function.

String

The values are as follows:

  • 0: disables the fault detection and process exit functions.
  • 1: enables the fault detection and process exit functions.
  • 2: enables only the fault detection function.

The default value is 1.

TASKD_PROCESS_INTERVAL

Interval for processing the main process of TaskD Manager.

String

The value ranges from 100 to 1000, in milliseconds.

Environment Variables of NodeD

Table 6 Environment variables of NodeD

Environment Variable

Description

Value

Remarks

XDL_IP

Obtains the IP address of the host where the pod is located. This environment variable is used for slow nodes to record and match slow node information.

String

Written in the YAML file of the NodeD component.