taskd.python.toolkit.recover_module.recover_manager. DLRecoverManager (for Internal Use Only)
Description
The DLRecoverManager class provides APIs for process-level recovery and process-level online recovery. The client is imported to the client code as a Python package.
The APIs provided by the DLRecoverManager class may throw an exception, which needs to be captured and handled by the API caller.
__init__(self, info: pb.ClientInfo, server_addr: str)
This API is used to construct DLRecoverManager for subsequent communication.
Parameter |
Type |
Description |
|---|---|---|
info |
pb.ClientInfo |
info.jobId: job ID of the string type info.role: client role of the string type |
server_addr |
string |
Server IP address |
register(self, request: pb.ClientInfo) -> pb.Status
This API is used to register the client. The server initializes the job specified by request before recovery.
Parameter |
Type |
Description |
|---|---|---|
request |
pb.ClientInfo |
request.jobId: job ID of the string type request.role: client role of the string type |
Type |
Description |
|---|---|
Status |
Status.info: return information of the string type. Status.code: return code of the integer type. The value 0 indicates success, and other values indicate failure. For details, see Return Codes. |
def start_subscribe(self, frame: str = "pytorch")
This API is used to establish a persistent gRPC connection between the client and server, which is used by the server to unidirectionally communicate with the client. For example, when a fault occurs, the server sends training suspension and global fault rank information to the client.
Parameter |
Type |
Description |
|---|---|---|
frame |
String |
AI framework used by a job |
init_clusterd(self)
The client uses this API to initialize the ClusterD server status to ensure that subsequent jobs can be registered and links can be established.