numa_config.json Configuration

Table 1 describes all fields of numa_config.json.

**Table 1** Field description
Parameter	Type	Required	Description
Cluster
Cluster_nodes	Array of Cluster_node. For details about Cluster_node, see Table 2.	Yes	Cluster resource information.
nodes_topology.	nodes_topology. For details about nodes_topology, see Table 5.	No	Topology of communication between nodes. For example, the communication topology of the parameter plane between two AI servers.
node_def: public attributes of nodes of the same type in a cluster
node_type	String	Yes	Node type. For example: Atlas training products: ATLAS800 Atlas A2 training products/Atlas A2 inference products: ATLAS900
resource_type	String	Yes	Type of the architecture and CPU resource that can be deployed by UDF. The value can be X86 or Aarch.
support_links	String	Yes	Communication mode of a node. For example, [HCCS,PCIE,ROCE].
item_type	String	Yes	Type of the accelerator card on a node. For example: Atlas training products scenario: Ascend xxx. Atlas A2 training products/Atlas A2 inference products scenario: Ascend xxx B1, Ascend xxx B2, Ascend xxx B3, and Ascend xxx B4.
inter_item_memory_access_mode	String	Yes	Mutual access mode between addresses of different devices in the server. Value: NUMA or UMA.
h2d_bw	String	No	Bandwidth between the host and device, for example, PCIe: 100 Gb.
item_topology	item_topology For details about item_topology, see Table 6.	No	Link information between different accelerator cards on a node. In the Atlas A2 training products/Atlas A2 inference products cloud scenario, 1, 2, 4, or 8 accelerator cards may be provisioned. If there is only one card, the multi-card topology is not involved.
item_def: public attributes of accelerator cards of the same type on a node
item_type	String	Yes	Type of the accelerator card on a node. For example: Atlas training products scenario: Ascend xxx. Atlas A2 training products/Atlas A2 inference products scenario: Ascend xxx B1, Ascend xxx B2, Ascend xxx B3, and Ascend xxx B4.
memory	String	Yes	Total memory of the chip. Atlas training products scenario: Set it to [HBM:64GB]. Atlas A2 training products/Atlas A2 inference products scenario: Set it to [DDR:64GB].
aic_type	String	Yes	Computation core type and number of cores of the accelerator card. For example, [DAVINCI_V100:32].
resource_type	String	Yes	Type of the Ascend resource that can be deployed by UDF. Set it to Ascend.
links_mode	String	No	Link topology of a chip. It does not need to be configured in the Atlas training products scenario.
device_list	Array of device_info. For details about device_info, see Table 9.	No	The entire chip contains physical device information. It does not need to be configured in the Atlas training products scenario.

**Table 2** Cluster_node description
Parameter	Type	Required	Description
node_id	Integer	Yes	ID of a node in a cluster. Generally, 0 indicates the primary node.
node_type	String	Yes	Node type. For example: Atlas training products: ATLAS800 Atlas A2 training products/Atlas A2 inference products: ATLAS900
ipaddr	String	Yes	IP address for communication on the control plane of a node. For example, the IP address of the training server is the host IP address, and the IP address of the SoC server is the head node IP address.
port	Integer	Yes	Port for communication on the control plane of a node.
data_panel	NetworkInfo	No	Data plane communication information.
memory	Integer	No	Available memory on the application host.
is_local	BOOL	No	Whether the node in the file is a local node when a cluster contains multiple nodes.
deploy_res_path	String	No	Path for storing the model file. Default path for the root user: /root/runtime/deploy_res Default path for a non-root user: ${home}/runtime/deploy_res
item_list	Array of item_info. For details about item_info, see Table 4.	Yes	Accelerator card that executes the job orchestrated and managed by cloud resources.

**Table 3** NetworkInfo description
Parameter	Type	Required	Description
avail_ports	Array of String	No	Range of data plane communication port numbers available to the application instance on each node. The carried services are as follows: Asynchronous point-to-point communication between flow gateways (FlowGWs): In the Atlas A2 training products/Atlas A2 inference products scenario, the host FlowGW communicates with the device FlowGW through PCIe and does not occupy host OS port numbers. A unique port number needs to be allocated for the communication between FlowGWs of devices. Collective communication: point-to-point graph send/recv and ranktbl collective communication. FlowGW communication among devices supports TCP and RDMA. By default, the port numbers among devices range from 16666 to 32767, which can be configured by users.

**Table 4** Item_info description
Parameter	Type	Required	Description
item_id	Integer	Yes	Logical ID of the accelerator card in a node.
device_id	Integer	Yes	Physical ID of the accelerator card in a node.
ipaddr	String	No	IP address or network segment used for communication.

**Table 5** nodes_topology description
Parameter	Type	Required	Description
type	String	Yes	Type of the topology, based on which the communication library determines the communication algorithm. For example, star.
protocol	String	No	Protocol of the communication between nodes. The default value is RDMA. The TCP configuration capability is reserved. This parameter does not need to be configured if only a single server is used. Example: RDMA:200Gb.
topos	Array of plane. For details about the plane, see Table 8.	Yes	Plane information of the accelerator card links between nodes.

**Table 6** item_topology description
Parameter	Type	Required	Description
links_mode	String	Yes	Communication mode and bandwidth between different accelerator cards on a node. For example, HCCS:128Gb.
links	Array of item_pair For details about item_pair, see Table 7.	Yes	Link topology between different accelerator cards on a node. For example, [0, 1].

**Table 7** item_pair description
Name	Type	Required	Description
item_id	Integer	Yes	Accelerator card ID.
pair_item_id	Integer	Yes	A pair of accelerator card IDs.

**Table 8** Plane description
Parameter	Type	Required	Description
plane_id	Integer	Yes	ID of the communication plane between servers (provided by the cloud).
devices	Array of int list	Yes	List of accelerator cards on a plane.

**Table 9** device_info description
Parameter	Type	Required	Description
device_id	Integer	Yes	Physical device ID of the chip.

Example of numa_config.json (Atlas training products)

The following is an example of numa_config.json. All fields are described in Table 1.

{
  "cluster":[                                             // Level 1: cluster attributes
    {                                                        // Level 2: virtual supernode 0
      "cluster_nodes": [                        // Level 3: cluster -> supernode -> nodes(Server) -> device -> DIE logically
        {
                   "node":{                                          
                              "node_id": 0,                      // Global server ID of this training
                              "node_type": "ATLAS800",          // Server type. Assume that hybrid deployment is supported in the future. Use this line of code to index node_def.
                              "ipaddr": "x.x.x.x",            // server ip
                              "port": 21,                  // Server port
                              "data_panel": {
                              "avail_ports": ["10023","20000~29999"]
                                             },
                              "memory":25165824,
                              "is_local":true,                            // The test environment does not have eth0. Therefore, it cannot be determined.
                "deploy_res_path":"/home/modeldeployer/", // Path for storing the model file (optional)
                              "item_list" : [                              // device list
                               {
                                      "item_id":0,                                  // Logical ID
                                      "device_id":0,                                 // Physical ID
                                      "ipaddr":"y.y.y.y"                // device ip
                               },
                               {
                                      "item_id":1,                                  // Logical ID
                                      "device_id":1,                                 // Physical ID
                                      "ipaddr":"z.z.z.z"                // device ip
                               }
                              ]
                            }
                   }
      ],
      "nodes_topoloy":     // Topology of communication nodes between servers. (The topology of the entire cluster can be deduced by describing a single node.)
{        // Topology between nodes, because
                   1) The scale-out between nodes may be huge.
                   2) It is unlikely that a topology of any shape can be maintained on the cloud.
         It is not suitable to list all links, so the topology type of nodes on the cloud are described, such as star. (In the future, there may be topologies such as torus and dragonfly, each of which will be described separately.)
         
         "type" : "star", 
         "protocol " : "RDMA:100Gb",
         "topos" : [
                   {
                            plane_id : 0,                // Network plane ID. Different network planes cannot communicate with each other. For communication, there are multiple link diagrams. Each plane or link diagram contains the topology specified by type.
                            devices : [0, 4],                // Items on the network plane. Each item may belong to the same or different network planes.
                   }, 
                   {
                            plane_id : 1, 
                            devices : [1, 5],
                   },
                   {
                            plane_id : 2, 
                            devices : [2, 6],
                   }, 
                   {
                            plane_id : 3, 
                            devices : [3, 7],
                   }        
         ]
         // "type": "2D-torus". A description table must be added for torus to describe its neighbor nodes. When the cluster scale is large, a large amount of content may be involved.
         // "type": "dragonfly", which can be regarded as the hierarchical overlay of "star" and "fullmesh".
}  ,
  
  "node_def": [                              // Level 2: node_def, indicating the public attributes of the server.
    {
      "node_type": "ATLAS800" ,              // Server type
      "resource_type": "Aarch" ,              // Arm architecture and CPU resource type that can be deployed by User-Defined FlowFunction (UDF).
      "support_links": "[HCCS,PCIE,ROCE]" ,  // Link types supported in the server
      "item_type": "Ascendxxx",            // Device type on the server
      "inter_item_memory_access_mode": "[NUMA]", // Mutual access mode between addresses of different devices in the server
      "h2d_bw": "PCIE:100Gb",
      "item_topology": [
        {
           "links_mode": "HCCS",            // Device link mode.
           "links": [ [0,1],[0,2],[0,3],       
                                                [1,2],[1,3],
                                                [2,3], 
                                                [4,5],[4,6],[4,7],
                                                [5,6],[5,7],
                                                [6,7] ]      // Device link pairs. This example is 8P.
        }
      ]
    }
  ],
  "item_def": [                           // Level 3: device attributes
    {
      "item_type": "Ascendxxx",           // Device type
      "memory": "[HBM:64GB]",             // Memory resource
      "aic_type": "[DAVINCI_V100:32]"     // Computing resource
      "resource_type": "Ascend" ,              // Type of the Ascend resource that can be deployed by UDF
    }
  ]
}

Example of numa_config.json (Atlas A2 training products/Atlas A2 inference products)

The following is an example of numa_config.json. All fields are described in Table 1.

{
  "cluster":[
    {
      "cluster_nodes": [
        {
          "node_id":0,
          "node_type": "ATLAS900",
          "ipaddr": "10.170.27.156",
          "port":2509,
          "memory":25165824,
          "is_local":true,
          "deploy_res_path": "/home/",
          "item_list" : [
           {
             "item_id":0,
             "device_id":0,
             "ipaddr":"29.89.133.79"
           },
           {
             "item_id":1,
             "device_id":1,
             "ipaddr":"29.89.97.58"
           },
           {
             "item_id":2,
             "device_id":2,
             "ipaddr":"29.89.96.113"
           },
           {
             "item_id":3,
             "device_id":3,
             "ipaddr":"29.89.90.77"
           },
           {
             "item_id":4,
             "device_id":4,
             "ipaddr":"29.89.34.38"
           },
           {
             "item_id":5,
             "device_id":5,
             "ipaddr":"29.89.127.83"
           },
           {
             "item_id":6,
             "device_id":6,
             "ipaddr":"29.89.7.31"
           },
           {
             "item_id":7,
             "device_id":7,
             "ipaddr":"29.89.128.229"
           }
          ]
        }
      ],
      "nodes_topology":
      {
        "protocol": "RDMA:200Gb",
        "type" : "star",
        "topos" : [
          {
            "plane_id" : 0,
            "devices" : [0, 1, 2, 3, 4, 5, 6, 7]
          },
        ]
      }
    }
  ],
  "node_def": [
    {
      "node_type": "ATLAS900" ,
      "resource_type": "Aarch",
      "support_links": "[HCCS,PCIE,ROCE]" ,
      "item_type": "AscendxxxB1",
      "inter_item_memory_access_mode": "[NUMA]",
      "h2d_bw": "PCIE:100Gb",
      "item_topology": [
        {
          "links_mode": "HCCS",
          "links": [ [0,1],[0,2],[0,3],[0,4],[0,5],[0,6],[0,7],
                   [1,2],[1,3],[1,4],[1,5],[1,6],[1,7], 
                   [2,3],[2,4],[2,5],[2,6],[2,7],
                   [3,4],[3,5],[3,6],[3,7],
                   [4,5],[4,6],[4,7],
                   [5,6],[5,7],
                   [6,7]]
        }
      ]
    }
  ],
  "item_def": [
    {
      "item_type": "AscendxxxB1",
      "resource_type": "Ascend",
      "memory": "[DDR:64GB]",
      "aic_type": "[DAVINCI_V100:32]"
    }
  ]
}

Parent topic: Appendixes