mindx_elastic.recover_manager.DLRecoverManager (for Internal Use Only)
The DLRecoverManager class provides APIs for process-level recovery and process-level online recovery. The client is imported to the client code as a Python package.
The APIs provided by the DLRecoverManager class may throw an exception, which needs to be captured and handled by the API caller.
__init__(self, info: pb.ClientInfo, server_addr: str, secure_conn: bool = True, cert_path: str = "")
This API is used to construct DLRecoverManager for subsequent communication.
Parameter |
Type |
Description |
|---|---|---|
info |
pb.ClientInfo |
info.ip: client IP of the string type (reserved). info.port: client port of the string type (reserved). info.taskId: task ID of the string type. info.role: client role of the string type. |
server_addr |
String |
Server IP address. |
register(self, request: pb.ClientInfo) -> pb.Status
This API is used to register the client. The server initializes the task specified by request before recovery.
Parameter |
Type |
Description |
|---|---|---|
request |
pb.ClientInfo |
request.ip: client IP of the string type (reserved). request.port: client port of the string type (reserved). request.taskId: task ID of the string type. request.role: client role of the string type. |
Type |
Description |
|---|---|
Status |
Status.info: return information of the string type. Status.code: return code of the integer type. The value 0 indicates success, and other values indicate failure. For details, see Return Codes. |
start_subscribe(self)
This API is used to establish a persistent gRPC connection between the client and server, which is used by the server to unidirectionally communicate with the client. For example, when a fault occurs, the server sends training suspension and global fault rank information to the client.
init_clusterd(self)
The client uses this API to initialize the ClusterD server status ensure that subsequent jobs can be registered and links can be established.