taskd.python.toolkit.recover_module.recover_manager. DLRecoverManager (for Internal Use Only)

Description

The DLRecoverManager class provides APIs for process-level recovery and process-level online recovery. The client is imported to the client code as a Python package.

The APIs provided by the DLRecoverManager class may throw an exception, which needs to be captured and handled by the API caller.

__init__(self, info: pb.ClientInfo, server_addr: str)

This API is used to construct DLRecoverManager for subsequent communication.

Table 1 Parameters

Parameter

Type

Description

info

pb.ClientInfo

info.jobId: job ID of the string type

info.role: client role of the string type

server_addr

string

Server IP address

register(self, request: pb.ClientInfo) -> pb.Status

This API is used to register the client. The server initializes the job specified by request before recovery.

Table 2 Parameters

Parameter

Type

Description

request

pb.ClientInfo

request.jobId: job ID of the string type

request.role: client role of the string type

Table 3 Return Value

Type

Description

Status

Status.info: return information of the string type.

Status.code: return code of the integer type. The value 0 indicates success, and other values indicate failure. For details, see Return Codes.

def start_subscribe(self, frame: str = "pytorch")

This API is used to establish a persistent gRPC connection between the client and server, which is used by the server to unidirectionally communicate with the client. For example, when a fault occurs, the server sends training suspension and global fault rank information to the client.

Table 4 Parameter description

Parameter

Type

Description

frame

String

AI framework used by a job

init_clusterd(self)

The client uses this API to initialize the ClusterD server status to ensure that subsequent jobs can be registered and links can be established.