--cluster_config

Description

Specifies the configuration file of the logical topology of the target deployment environment. It is used to generate hcom group and rank ID.

As long as the original foundation model contains communication operators, this option needs to be configured regardless of whether distributed deployment is enabled. Otherwise, an error may be reported during execution of the communication operators.

See Also

This option is required if the model contains communication operators or enables algorithm-based partitioning (--enable_graph_parallel = 1).

Argument

Argument: Directory (including the file name) of the logical topology file.

Format: The directory (including the file name) can contain letters, digits, underscores (_), hyphens (-), periods (.), and Chinese characters.

Restrictions: The content in the configuration file must be in JSON format.

Suggestions and Benefits

None

Example

Upload the configuration file (for example, numa_config.json) to any directory (for example, $HOME/conf) on the server where ATC is located. An example is as follows:

atc --model=xxx.air --framework=1 --soc_version=<soc_version> --output=$HOME/out --cluster_config=$HOME/conf/numa_config.json 

The following is an example of the logical topology file:

  • Atlas Training Series Product: The number of device processors in use is cluster_nodes*item_lists. The number of item_ids in each cluster_nodes must be the same.
    4p logical network config:
    {
     "cluster": [{
      "cluster_nodes": [{
       "node_id": 0,
       "node_type": "ATLAS800",
    			"ipaddr": "127.0.0.1",       // (Required) IP address for communication on the control plane of a node, string type. For example, the IP address of a training server is the host IP address, and that of a SoC server is the head node IP address.
    			"port": 2509,                // (Required) port for communication on the control plane of a node, integer type.
       "is_local": true,
       "item_list": [{
        "item_id": 0
       },
       {
        "item_id": 1
       },
       {
        "item_id": 2
       },
       {
        "item_id": 3
       }]
      }]
     }],
     "item_def": [{
      "item_type": "<soc_version>"
     }],
    	"node_def": [{                              // Public attributes of nodes of the same type in a cluster.
      "item": [{
    			"item_type": "<soc_version>"    // (Required) accelerator card type on a node, string type.
      }]
     }] 
    }

The parameters are described as follows:

Table 1 Parameters

Parameter

Type

Required (Yes/Not)

Description

cluster

-

Cluster configuration.

cluster_nodes

-

Array of Cluster_node

Yes

Cluster resource information.

node_id

-

Integer

Yes

ID of a node in a cluster. Generally, 0 indicates the primary node.

node_type

-

String

Yes

Node type, for example, ATLAS800.

ipaddr

-

String

Yes

IP address for communication on the control plane of a node. For example, the IP address of a training server is the host IP address, and that of a SoC server is the head node IP address.

port

-

Integer

Yes

Port for communication on the control plane of a node.

is_local

-

BOOL

No

Whether the node in the file is a local node when a cluster contains multiple nodes. Default value: false

item_list

-

Array of item_info

Yes

Accelerator card that executes the job orchestrated and managed by cloud resources.

-

item_id

Integer

Yes

Accelerator card ID on a node.

item_def

-

Public attributes of accelerator cards of the same type on a node.

device_list

-

Array of device_info

No

Number of physical devices in a processor. You do not need to set this configuration item for the Atlas Training Series Product.

device_id

-

Integer

Yes

Physical device ID of a processor.

item_type

-

-

String

Yes

Accelerator card type on a node.

node_def

-

Public attributes of nodes of the same type in a cluster.

item

item_type

-

String

Yes

Accelerator card type on a node.

Applicability

Atlas Training Series Product

Dependencies and Restrictions

None