gRPC

Description

Receives and processes public faults from the gRPC client to connect to the resumable training process.

  • If the actual parameter values in gRPC requests are different from the defined value ranges, ClusterD discards the fault information.
  • When faults are injected through ConfigMap or gRPC interfaces, the maximum number of faults on all nodes is 50,000. If this threshold is exceeded, ClusterD discards any newly injected fault information.
  • To clear public faults, the recover event of the fault needs to be transmitted to ClusterD through the gRPC interface.

Prototype

rpc SendPublicFault(PublicFaultRequest) returns (RespStatus){}

Input Parameters

Parameter

Type (Defined by Protobuf)

Description

PublicFaultRequest

message PublicFaultRequest{

string id = 1;

int64 timestamp = 2;

string version = 3;

string resource = 4;

repeated Fault faults = 5;

}

message Fault{

string faultId = 1;

string faultType = 2;

string faultCode = 3;

int64 faultTime = 4;

string assertion = 5;

map<string, string> faultLocation = 6;

repeated PubFaultInfo influence = 7;

string description = 8;

}

message PubFaultInfo{

string nodeName = 1;

string nodeSN = 2;

repeated int32 deviceIds = 3;

}

PublicFaultRequest.id: unique ID of a message

PublicFaultRequest.timestamp: Timestamp for message sending

PublicFaultRequest.version: message version

PublicFaultRequest.resource: fault sender

PublicFaultRequest.faults: fault content

Fault.faultId: fault instance ID

Fault.faultType: fault type

Fault.faultCode: fault code

Fault.faultTime: fault occurrence time

Fault.assertion: fault status

Fault.faultLocation: fault locating information

Fault.influence: fault impact scope

Fault.description: fault description

PubFaultInfo.nodeName: node name

PubFaultInfo.nodeSN: node SN

PubFaultInfo.deviceIds: physical processor ID

For more details, see ConfigMap.

Return Value

Return Value

Type (Defined by Protobuf)

Description

RespStatus

message RespStatus{

int32 code = 1;

string info = 2;

}

RespStatus.code: return code

  • 0: The fault is transmitted successfully.
  • Other values: The fault fails to be transmitted. 409 indicates incorrect request parameters, and 410 indicates that the message sending frequency exceeds the upper limit.

RespStatus.info: return information