GetFaultMsgSignal

Description

Serving as a fault query interface, it is used to receive requests from the client to query cluster and job fault information.

This interface can receive a maximum of 10 requests per second. If the number of requests exceeds 10, the requests are added to the waiting queue. If the total number of waiting requests exceeds 50, request sending will be rejected.

Prototype

rpc GetFaultMsgSignal(ClientInfo) returns(FaultQueryResult){}

Input Parameters

Parameter

Type (Defined by Protobuf)

Description

ClientInfo

message ClientInfo{

string jobId = 1;

string role = 2;

}

ClientInfo.jobId: job ID If the input jobId is empty, fault information within the cluster is returned. If jobId is not empty, it must contain 8 to 128 characters and cannot include any Chinese character.

ClientInfo.role: client role

NOTE:
  • If the input jobId is empty, all faults in the current cluster are queried.
  • If the input jobId is not empty, faults of the node where a job belongs are queried.

Return Value

Return Value

Type (Defined by Protobuf)

Description

FaultQueryResult

message FaultQueryResult{

int32 code = 1;

string info = 2;

FaultMsgSignal faultSignal =3;

}

code: return code of a query

  • 200: successful query
  • 429: rate limiting on the server
  • 500: server error

info: description of the query result

faultSignal: fault information structure

FaultMsgSignal.uuid: message ID

FaultMsgSignal.jobId: job ID. The value -1 indicates the cluster.

FaultMsgSignal.signalType: message type. fault indicates that a fault occurs, and normal indicates that no fault occurs or a fault is rectified.

FaultMsgSignal.nodeFaultInfo: node fault information

NodeFaultInfo.nodeName: name of the faulty node

NodeFaultInfo.nodeIP: node IP address

NodeFaultInfo.nodeSN: node SN

NodeFaultInfo.faultLevel: fault type, which can be Healthy, SubHealthy, or UnHealthy. Set this parameter to the most severe level in DeviceFaultInfo.faultLevel.

NodeFaultInfo.faultDevice: device fault information

DeviceFaultInfo.deviceId: device ID

DeviceFaultInfo.deviceType: device type, including Node, NPU, Storage, CPU, and Network.

DeviceFaultInfo.faultCodes: fault code list

DeviceFaultInfo.faultLevel: fault type, including Healthy, SubHealthy, and UnHealthy. The severity levels increase in ascending order.

DeviceFaultInfo.faultType: (reserved) fault subsystem type

DeviceFaultInfo.faultReason: (reserved) fault cause

DeviceFaultInfo.switchFaultInfos: UnifiedBus fault information list

DeviceFaultInfo.faultLevels: fault level list

SwitchFaultInfo.faultCode: UnifiedBus fault code

SwitchFaultInfo.switchChipId: ID of the faulty UnifiedBus chip

SwitchFaultInfo.switchPortId: ID of the faulty UnifiedBus port

SwitchFaultInfo.faultTime: time when a UnifiedBus fault occurs

SwitchFaultInfo.faultLevel: UnifiedBus fault level