Environment Variable Configuration

Table 1 describes the environment variables of Rec SDK Torch.

Table 1 Environment variables

Environment Variable

Meaning

Mandatory/Optional

Description

INPUT_DIST_THREADS

Number of concurrent threads in the thread pool used by Rec SDK Torch to execute bucketing tasks.

Optional

The value is an integer ranging from 1 to 12. The default value is 6.

POST_INPUT_THREADS

Number of concurrent threads in the thread pool used by Rec SDK Torch to execute hash deduplication tasks.

Optional

The value is an integer ranging from 1 to 12. The default value is 6.

MASTER_ADDR

IP address of the master node in distributed training.

Optional

IPv4 address. 127.0.0.1 is recommended.

MASTER_PORT

Listening port number in distributed training.

Optional

The value is an integer ranging from 0 to 65520.

LOCAL_RANK

NPU ID of the current process on the local host.

Optional

The value is an integer ranging from 0 to world_size – 1.

WORLD_SIZE

Number of devices involved in training.

Optional

The value is an integer ranging from 1 to 8.

ASCEND_VISIBLE_DEVICES

Devices visible to the Ascend AI Processor, which is used to specify that the program uses only some of devices.

Mandatory

You can use this environment variable to specify the NPU device for training. (Run the ls /dev/ | grep davinci* command to query the NPU device of the host.) In addition, you can use the device serial number to specify the NPU device. A single NPU device or a range of NPU devices and use them together. Example:

  • ASCEND_VISIBLE_DEVICES=0 indicates that device 0 (/dev/davinci0) is mounted to the container.
  • ASCEND_VISIBLE_DEVICES=1,3 indicates that devices 1 and 3 are mounted to the container.
  • ASCEND_VISIBLE_DEVICES=0-2 indicates that devices 0 to 2 (including devices 0 and 2) are mounted to the container. The effect is the same as that of

    ASCEND_VISIBLE_DEVICES=0,1,2.

  • ASCEND_VISIBLE_DEVICES=0-2,4 indicates that devices 0 to 2 and device 4 are mounted to the container. The effect is the same as that of

    ASCEND_VISIBLE_DEVICES=0,1,2,4.

ASCEND_OPP_PATH

Root directory of the operator library.

Mandatory

Set this parameter when running the CANN environment variable configuration script. You are advised not to change the value. The default value is /usr/local/Ascend/cann/opp.

GLOO_SOCKET_IFNAME

NIC configuration for gloo communication.

Optional

Run the ifconfig or ip a command to view the NIC name of the server. The recommended value is lo.

ENABLE_FAST_HASHMAP

Whether to enable the quick hash table.

Optional

The value is a character string. The value true, yes, or 1 indicates that the function is enabled. Other values indicate that the function is disabled. The default value is false.

EMB_MEMORY_POOL_SIZE

Size of the embedding memory pool of the quick hash table.

Optional

The value is an integer. The value ranges from [1, 200000]. The default value is 102400.

FAST_HASHMAP_RESERVE_BUCKET_NUM

Number of reserved buckets in the quick hash table.

Optional

The value is an integer. The value ranges from [128, 4294967291]. The default value is 2097152.

EMB_MEMORY_POOL_THREAD_NUM

Number of processing threads in the embedding memory pool of the quick hash table.

Optional

The value is an integer. The value ranges from [1, 1024]. The default value is 4.

EMBCACHE_SIZE_ON_DEVICE_MEM

On-chip memory embedding cache size (unit: byte).

Optional

The value is an integer. The value ranges from [1, Available device memory]. The default value is 17179869184 (16 GB).

DO_EC_LOCAL_UNIQUE

Whether to enable EC local unique for multi-level cache.

Optional

The value is a character string. The value true, 1, or yes indicates that the function is enabled, and other values indicate that the function is disabled. The default value is false.

LOCAL_UNIQUE_PARALLEL_BATCH_NUM

Number of batches for parallel processing of local unique in EmbCacheTrainPipelineSparseDist

Optional

The value is an integer. The value ranges from 1 to 24. The default value is 2.

ENABLE_PARALLEL_GLOBAL_UNIQUE

Whether to enable parallel global unique processing.

Optional

The value is a character string. The value 1 indicates that the function is enabled, and other values indicate that the function is disabled. The default value is 0, indicating that the function is disabled.

GLOG_stderrthreshold

Sets the log level of the multi-level cache C++ module.

Optional

The value is an integer. The default value is 0.

Value range:
  • -2: TRACE
  • -1: DEBUG
  • 0: INFO
  • 1: WARN
  • 2: ERROR