mindx_elastic.recover_manager.DLRecoverManager (for Internal Use Only)

The DLRecoverManager class provides APIs for process-level recovery and process-level online recovery. The client is imported to the client code as a Python package.

The APIs provided by the DLRecoverManager class may throw an exception, which needs to be captured and handled by the API caller.

__init__(self, info: pb.ClientInfo, server_addr: str, secure_conn: bool = True, cert_path: str = "")

This API is used to construct DLRecoverManager for subsequent communication.

Table 1 Parameters

Parameter

Type

Description

info

pb.ClientInfo

info.ip: client IP of the string type (reserved).

info.port: client port of the string type (reserved).

info.taskId: task ID of the string type.

info.role: client role of the string type.

server_addr

String

Server IP address.

register(self, request: pb.ClientInfo) -> pb.Status

This API is used to register the client. The server initializes the task specified by request before recovery.

Table 2 Parameters

Parameter

Type

Description

request

pb.ClientInfo

request.ip: client IP of the string type (reserved).

request.port: client port of the string type (reserved).

request.taskId: task ID of the string type.

request.role: client role of the string type.

Table 3 Return value

Type

Description

Status

Status.info: return information of the string type.

Status.code: return code of the integer type. The value 0 indicates success, and other values indicate failure. For details, see Return Codes.

start_subscribe(self)

This API is used to establish a persistent gRPC connection between the client and server, which is used by the server to unidirectionally communicate with the client. For example, when a fault occurs, the server sends training suspension and global fault rank information to the client.

init_clusterd(self)

The client uses this API to initialize the ClusterD server status ensure that subsequent jobs can be registered and links can be established.