TaskD

Application Scenario

Faults and performance deterioration may occur during the execution of foundation model training and inference jobs, affecting job execution. TaskD provides training and inference job status monitoring and control capabilities on Ascend devices.

In the current version, TaskD provides two service flows: 1. Fast fault recovery in the PyTorch and MindSpore scenarios; 2. Training service O&M management. (These two service flows rely on different installation and deployment mechanisms, with distinct upstream and downstream dependencies. In future versions, they will be unified under the same mechanism.)

Component Architecture

Figure 1 Software architecture

Where:

TaskD Manager: controls the service status by managing other TaskD modules.
TaskD Proxy: forwards messages. As the message proxy in each container, it sends messages to TaskD Manager.
TaskD Agent: manages processes and the service process lifetime.
TaskD Worker: manages services and the service process status.

Component Function

Functions of each component in service flow 1
- Manage Ascend device processes in the PyTorch or MindSpore framework. That is, stop and restart training processes if a software or hardware fault occurs.
- Connect to the control plane in the Kubernetes cluster and manage training job status based on the plane.

Functions of each component in service flow 2
- Provide lightweight profiling capabilities for training data and collect profile data based on the cluster's control plane.
- Provide the capabilities of link failover and switchover and online stress testing.

Upstream and Downstream Dependencies

Dependency description in service flow 1
- The MindCluster cluster scheduling components write information such as the device and training status to ConfigMap through Kubernetes and map the information to the container. The ConfigMap name is reset-config-<Job name>.
- The MindCluster cluster scheduling components write the training status detection instruction to ConfigMap through Kubernetes and map the instruction to the container.
- TaskD Manager obtains the device status and training job status of the current training container through ConfigMap.
- TaskD Manager interconnects with the Kubernetes cluster control plane and completes training management based on the plane.
Figure 2 Upstream and downstream dependencies in service flow 1
Dependency description in service flow 2
- TaskD Worker obtains the instruction for enabling training detection of the current job through ConfigMap.
- TaskD Manager obtains the instruction for enabling training detection of the current job through gRPC.
Figure 3 Upstream and downstream dependencies in service flow 2

Parent topic: Component Description