Job Information

job-summary-<job name>

Table 1 Fields in job-summary-<job name> ConfigMap

Parameter

Description

Value

hccl.json

Communication information about the processor used by a job.

The value can be escaped to the JSON format. The fields are described as follows:

  • status: whether RankTable has been generated.
    • initializing: Devices are being allocated to a job, and RankTable is not generated.
    • complete: After RankTable is generated, the status changes to complete immediately, and fields such as server_list are displayed synchronously.
  • server_list: device allocation information
    • device: NPU allocation information, including the NPU IP address, and rank ID.
    • server_id: AI server ID, which is globally unique.
    • server_name: server name
    • server_sn: server SN. Ensure that the server SN exists. If it does not exist, contact Huawei technical support.
  • server_count: number of servers used by a job
  • version: version information

String

job_id

Kubernetes ID of a job

String

operator

  • add: After a job adding command is received, the task status is changed to add.
  • delete: After a task deletion command is received, the task status is changed to delete.

String

deleteTime

Time when a job is deleted.

String

sharedTorIp

Information about the shared switch used by a job.

String

masterAddr

MASTER_ADDR value specified during PyTorch training.

String

total

Number of ConfigMaps.

Integer

time

Time when a job starts.

String

framework

Framework used by a job.

String

job_status

Job status:

  • Pending
  • Running
  • Complete
  • Failed

String

job_name

Job name.

String

cm_index

Index of the current ConfigMap.

String

current-job-statistic

This file records statistics about the current job in the cluster and stores the information in /var/log/mindx-dl/clusterd/event_job.log. Due to the capacity limit of Kubernetes ConfigMap, a maximum of about 10,000 cluster jobs can be collected. When the size of a log file reaches 20 MB, the log file is automatically dumped. A maximum of five dump logs can be saved, and the dump logs can be retained for a maximum of 40 days.

Parameter

Description

data

-

- ID

Job ID allocated by the Kubernetes cluster.

- customID

User-defined job ID. If the content is empty, the job ID is not displayed.

- cardNum

Number of cards used by a job. If the content is empty, the card quantity is not displayed.

- podFirstRunTime

Time when all pods of a job are running for the first time. If the content is empty, the time is not displayed.

- stopTime

Time when all pods of a job are complete or are forcibly deleted. If the content is empty, the time is not displayed.

- podLastRunTime

Last time when all pods of a job were restored to running. If the content is empty, the time is not displayed.

- podLastFaultTime

Last time when some or all pods of a job failed. If the content is empty, the time is not displayed.

- podFaultTimes

Number of times that pods are rescheduled due to job faults. If the number is 0, the rescheduling times are not displayed.

totalJob

Total number of jobs in the current cluster.