Job Information
job-summary-<job name>
Parameter |
Description |
Value |
|---|---|---|
hccl.json |
Communication information about the processor used by a job. The value can be escaped to the JSON format. The fields are described as follows:
|
String |
job_id |
Kubernetes ID of a job |
String |
operator |
|
String |
deleteTime |
Time when a job is deleted. |
String |
sharedTorIp |
Information about the shared switch used by a job. |
String |
masterAddr |
MASTER_ADDR value specified during PyTorch training. |
String |
total |
Number of ConfigMaps. |
Integer |
time |
Time when a job starts. |
String |
framework |
Framework used by a job. |
String |
job_status |
Job status:
|
String |
job_name |
Job name. |
String |
cm_index |
Index of the current ConfigMap. |
String |
current-job-statistic
This file records statistics about the current job in the cluster and stores the information in /var/log/mindx-dl/clusterd/event_job.log. Due to the capacity limit of Kubernetes ConfigMap, a maximum of about 10,000 cluster jobs can be collected. When the size of a log file reaches 20 MB, the log file is automatically dumped. A maximum of five dump logs can be saved, and the dump logs can be retained for a maximum of 40 days.
Parameter |
Description |
|---|---|
data |
- |
- ID |
Job ID allocated by the Kubernetes cluster. |
- customID |
User-defined job ID. If the content is empty, the job ID is not displayed. |
- cardNum |
Number of cards used by a job. If the content is empty, the card quantity is not displayed. |
- podFirstRunTime |
Time when all pods of a job are running for the first time. If the content is empty, the time is not displayed. |
- stopTime |
Time when all pods of a job are complete or are forcibly deleted. If the content is empty, the time is not displayed. |
- podLastRunTime |
Last time when all pods of a job were restored to running. If the content is empty, the time is not displayed. |
- podLastFaultTime |
Last time when some or all pods of a job failed. If the content is empty, the time is not displayed. |
- podFaultTimes |
Number of times that pods are rescheduled due to job faults. If the number is 0, the rescheduling times are not displayed. |
totalJob |
Total number of jobs in the current cluster. |