SubscribeFaultMsgSignal

Description

Receives fault subscription requests from the client. The server allocates a message queue to each job and monitors whether the message queue contains messages to be transmitted. If yes, the server transmits the messages to the client through the gRPC stream.

  • Before calling this API, call Register.
  • After subscribing to the fault information of a general computing job, the client can receive only the NodeD fault and Kubernetes node status exception.

Prototype

rpc SubscribeFaultMsgSignal(ClientInfo) returns (stream FaultMsgSignal){}

Input Parameters

Parameter

Type (Defined by Protobuf)

Description

ClientInfo

message ClientInfo{

string jobId = 1;

string role = 2;

}

ClientInfo.jobId: job ID

ClientInfo.role: client role

NOTE:
  • If the input jobId is empty, faults of all jobs in the cluster are obtained.
  • If the input jobId is not empty, faults of the node where a job belongs are obtained.

Return Value

Return Value

Type (Defined by Protobuf)

Description

stream

gRPC stream

This API returns a gRPC stream. (The data structure of the return value is based on the programming language selected by the client.)

The client can call the stream's Receive method (the actual name is determined by the client's programming language) to receive data pushed by the server.

Data to Be Sent

Parameter

Type (Defined by Protobuf)

Description

FaultMsgSignal

message FaultMsgSignal{

string uuid = 1;

string jobId = 2;

string signalType = 3;

repeated NodeFaultInfo nodeFaultInfo = 4;

}

message NodeFaultInfo{

string nodeName = 1;

string nodeIP = 2;

string nodeSN = 3;

string faultLevel = 4;

repeated DeviceFaultInfo faultDevice = 5;

}

message DeviceFaultInfo{

string deviceId = 1;

string deviceType = 2;

repeated string faultCodes = 3;

string faultLevel = 4;

repeated string faultType = 5;

repeated string faultReason = 6;

repeated SwitchFaultInfo switchFaultInfos = 7;

repeated string faultLevels = 8;

}

message SwitchFaultInfo{

string faultCode = 1;

string switchChipId = 2;

string switchPortId = 3;

string faultTime = 4;

string faultLevel = 5;

}

FaultMsgSignal.uuid: message ID

FaultMsgSignal.jobId: job ID

FaultMsgSignal.signalType: message type. fault indicates that a fault occurs, and normal indicates that no fault occurs or a fault is rectified.

FaultMsgSignal.nodeFaultInfo: node fault information

NodeFaultInfo.nodeName: name of the faulty node

NodeFaultInfo.nodeIP: node IP address

NodeFaultInfo.nodeSN: node SN

NodeFaultInfo.faultLevel: fault type, which can be Healthy, SubHealthy, or UnHealthy. Set this parameter to the most severe level in DeviceFaultInfo.faultLevel.

NodeFaultInfo.faultDevice: device fault information

DeviceFaultInfo.deviceId: device ID. When a bus device fault or Kubernetes status exception occurs on the node, the value of deviceId is -1.

DeviceFaultInfo.deviceType: device type, including Node, NPU, Storage, CPU, and Network.

DeviceFaultInfo.faultCodes: fault code list

DeviceFaultInfo.faultLevel: fault type, including Healthy, SubHealthy, and UnHealthy. The severity levels increase in ascending order.

DeviceFaultInfo.faultType: (reserved) fault subsystem type

DeviceFaultInfo.faultReason: (reserved) fault cause

DeviceFaultInfo.switchFaultInfos: UnifiedBus fault information

DeviceFaultInfo.faultLevels: fault level list

SwitchFaultInfo.faultCode: UnifiedBus fault code

SwitchFaultInfo.switchChipId: ID of the faulty UnifiedBus chip

SwitchFaultInfo.switchPortId: ID of the faulty UnifiedBus port

SwitchFaultInfo.faultTime: time when a UnifiedBus fault occurs

SwitchFaultInfo.faultLevel: UnifiedBus fault level