--cluster_config
Description
Specifies the configuration file of the logical topology of the target deployment environment. It is used to generate hcom group and rank ID.
As long as the original foundation model contains communication operators, this option needs to be configured regardless of whether distributed deployment is enabled. Otherwise, an error may be reported during execution of the communication operators.
See Also
This option is required if the model contains communication operators or enables algorithm-based partitioning (--enable_graph_parallel = 1).
Argument
Argument: Directory (including the file name) of the logical topology file.
Format: The directory (including the file name) can contain letters, digits, underscores (_), hyphens (-), periods (.), and Chinese characters.
Restrictions: The content in the configuration file must be in JSON format.
Suggestions and Benefits
None
Example
Upload the configuration file (for example, numa_config.json) to any directory (for example, $HOME/conf) on the server where ATC is located. An example is as follows:
atc --model=xxx.air --framework=1 --soc_version=<soc_version> --output=$HOME/out --cluster_config=$HOME/conf/numa_config.json
The following is an example of the logical topology file:
Atlas Training Series Product : The number of device processors in use is cluster_nodes*item_lists. The number of item_ids in each cluster_nodes must be the same.4p logical network config: { "cluster": [{ "cluster_nodes": [{ "node_id": 0, "node_type": "ATLAS800", "ipaddr": "127.0.0.1", // (Required) IP address for communication on the control plane of a node, string type. For example, the IP address of a training server is the host IP address, and that of a SoC server is the head node IP address. "port": 2509, // (Required) port for communication on the control plane of a node, integer type. "is_local": true, "item_list": [{ "item_id": 0 }, { "item_id": 1 }, { "item_id": 2 }, { "item_id": 3 }] }] }], "item_def": [{ "item_type": "<soc_version>" }], "node_def": [{ // Public attributes of nodes of the same type in a cluster. "item": [{ "item_type": "<soc_version>" // (Required) accelerator card type on a node, string type. }] }] }
The parameters are described as follows:
Parameter |
Type |
Required (Yes/Not) |
Description |
|||
|---|---|---|---|---|---|---|
cluster |
- |
Cluster configuration. |
||||
cluster_nodes |
- |
Array of Cluster_node |
Yes |
Cluster resource information. |
||
node_id |
- |
Integer |
Yes |
ID of a node in a cluster. Generally, 0 indicates the primary node. |
||
node_type |
- |
String |
Yes |
Node type, for example, ATLAS800. |
||
ipaddr |
- |
String |
Yes |
IP address for communication on the control plane of a node. For example, the IP address of a training server is the host IP address, and that of a SoC server is the head node IP address. |
||
port |
- |
Integer |
Yes |
Port for communication on the control plane of a node. |
||
is_local |
- |
BOOL |
No |
Whether the node in the file is a local node when a cluster contains multiple nodes. Default value: false |
||
item_list |
- |
Array of item_info |
Yes |
Accelerator card that executes the job orchestrated and managed by cloud resources. |
||
- |
item_id |
Integer |
Yes |
Accelerator card ID on a node. |
||
item_def |
- |
Public attributes of accelerator cards of the same type on a node. |
||||
device_list |
- |
Array of device_info |
No |
Number of physical devices in a processor. You do not need to set this configuration item for the |
||
device_id |
- |
Integer |
Yes |
Physical device ID of a processor. |
||
item_type |
- |
- |
String |
Yes |
Accelerator card type on a node. |
|
node_def |
- |
Public attributes of nodes of the same type in a cluster. |
||||
item |
item_type |
- |
String |
Yes |
Accelerator card type on a node. |
|
Applicability
Dependencies and Restrictions
None