SwitchNicTrack
Description
Receives link failover requests from the O&M platform and forwards the corresponding failover or switchback operation to the device on the specified node of a training job. It must be called only after a training job has successfully completed its execution and iteration, to ensure that the job has been registered with ClusterD. This API represents a manual O&M operation. If link failover or switchback repeatedly fails, checkpoints are frequently saved, potentially leading to drive exhaustion.
Deliver the link failover or switchback command after the training iteration is normal.
Prototype
rpc SwitchNicTrack(SwitchNics) returns (Status) {}
Input Parameters
Parameter |
Type (Defined by Protobuf) |
Description |
|---|---|---|
SwitchNics |
message SwitchNics{ string jobID; map<string, DeviceList> nicOps; } message DeviceList { repeated string dev; repeated bool op; } |
SwitchNics.jobID: job ID SwitchNics.nicOps: device and operation receiving user-issued link failover or switchback instructions. key represents the node name, and value represents the device to be operated on the node. DeviceList.dev: list of device IDs on a node, corresponding to DeviceList.op. DeviceList.op: list of link failover operations to be performed on the device, specified by the device ID of the corresponding node. true indicates the standby link and false indicates the active link. |
Return Value
Parameter |
Type (Defined by Protobuf) |
Description |
|---|---|---|
Status |
message Status{ int32 code = 1; string info = 2; } |
Status.code: return code
Status.info: return information |