SwitchNicTrack

Description

Receives link failover requests from the O&M platform and forwards the corresponding failover or switchback operation to the device on the specified node of a training job. It must be called only after a training job has successfully completed its execution and iteration, to ensure that the job has been registered with ClusterD. This API represents a manual O&M operation. If link failover or switchback repeatedly fails, checkpoints are frequently saved, potentially leading to drive exhaustion.

Deliver the link failover or switchback command after the training iteration is normal.

Prototype

rpc SwitchNicTrack(SwitchNics) returns (Status) {}

Input Parameters

Parameter

Type (Defined by Protobuf)

Description

SwitchNics

message SwitchNics{

string jobID;

map<string, DeviceList> nicOps;

}

message DeviceList {

repeated string dev;

repeated bool op;

}

SwitchNics.jobID: job ID

SwitchNics.nicOps: device and operation receiving user-issued link failover or switchback instructions. key represents the node name, and value represents the device to be operated on the node.

DeviceList.dev: list of device IDs on a node, corresponding to DeviceList.op.

DeviceList.op: list of link failover operations to be performed on the device, specified by the device ID of the corresponding node. true indicates the standby link and false indicates the active link.

Return Value

Parameter

Type (Defined by Protobuf)

Description

Status

message Status{

int32 code = 1;

string info = 2;

}

Status.code: return code

  • 0: instruction delivered successfully
  • Other values: failed to deliver the instruction

Status.info: return information