Adapting to MindCluster

MindIO TFT provides services using SDKs and does not have resident processes. The services start when a training process starts, and exit when the training process finishes.

When MindCluster is adapted, MindCluster manages the Kubernetes container. The installation and adaptation process in the Kubernetes container is the same as that in a bare metal.

Procedure

  • If the Python environment is not installed in the shared storage, you can integrate the MindIO TFT SDK into an image to facilitate its use in large clusters. In this case, when the pod is installed through the image, the MindIO TFT SDK can be automatically installed.
  • Heartbeat packets are transmitted between the Controller and Processor modules of the MindIO TFT service. When network isolation is implemented in Kubernetes, you need to include the communication port in the YAML file configured during pod creation.
    Modify the YAML file configured during pod creation. The following uses the pod.yaml file as an example.
    1. Open the pod.yaml file.
      vim pod.yaml
    2. Press i to enter the insert mode and add the following content:
      ports:
        - containerPort: 8000    # Communication port between the Controller and Processor of the MindIO TFT service
          name: ttp-port
    3. Press Esc, type :wq!, and press Enter to save the changes and exit.
  • To adapt to the Kubernetes network, modify the pre-training script of 2 as follows:
    # Comment out the following two lines. The environment variable is configured by MindCluster.
    # MASTER_ADDR=$(hostname -I | awk '{print $1}')
    # MASTER_PORT=XXXX
    
    # Obtain the MASTER_ADDR and MASTER_PORT environment variables (service network IP address of K8s) from Kubernetes.
    CONTROLLER_ADDR=$(hostname -I | awk '{print $1}')
    PROCESSOR_ADDR=${MASTER_ADDR}
    export CONTROLLER_ADDR
    export PROCESSOR_ADDR