Key Fields in acjob
Ascend Job (acjob): a job type customized by MindCluster. You can start a training or inference job by configuring resource information using environment variables or files. The following table describes the acjob fields.
Field Path |
Type |
Format |
Description |
|---|---|---|---|
apiVersion |
String |
- |
Object's versioning resource pattern. The server will convert it to the latest internal value and reject unrecognized versions. For more information, see https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds. |
kind |
String |
- |
REST resource type corresponding to an object. The value, formatted in camel case, is derived from the endpoint and cannot be updated. For more information, see https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources. |
metadata |
Object |
- |
Kubernetes metadata, including namespaces and labels. For more information, see https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata. |
spec |
Object |
- |
Specifications of the desired AscendJob status. replicaSpecs is a mandatory field. |
spec.replicaSpecs |
Object
|
- |
Mapping from ReplicaType to ReplicaSpec, which specifies MS cluster configurations, for example, { "Scheduler": ReplicaSpec, "Worker": ReplicaSpec }. |
spec.replicaSpecs.[ReplicaType] |
Object |
- |
Replica description |
spec.replicaSpecs.[ReplicaType].replicas |
Integer |
int32 |
Number of replicas required by the given template. The default value is 1. |
spec.replicaSpecs.[ReplicaType].restartPolicy |
String |
- |
Restart policy, including Always, OnFailure, Never, and ExitCode. The default value is Never. |
spec.replicaSpecs.[ReplicaType].template |
Object |
- |
Kubernetes pod template. For more information, see https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-template-v1/. |
spec.runPolicy |
Object |
- |
Runtime policy (such as resource clearing and activity time) of a distributed training job. |
spec.runPolicy.backoffLimit |
Integer |
int32 |
(Optional) Number of retries before a job fails. |
spec.runPolicy.activeDeadlineSeconds |
Integer |
int64 |
Maximum duration (in seconds) for which a job keeps active. The value must be a positive integer. This field has no effect and will be deleted in later versions. |
spec.runPolicy.cleanPodPolicy |
String |
- |
Pod clearing policy after a job is complete. The default value is Running. This field has no effect and will be deleted in later versions. |
spec.runPolicy.ttlSecondsAfterFinished |
Integer |
int32 |
Time to live (TTL) after a job is complete. By default, the value is infinite, but actual deletion may be delayed. This field has no effect and will be deleted in later versions. |
spec.runPolicy.schedulingPolicy |
Object |
- |
Scheduling policy, for example, gang-scheduling. |
spec.runPolicy.schedulingPolicy.minAvailable |
Integer |
int32 |
Minimum number of available resources. |
spec.runPolicy.schedulingPolicy.minResources |
Object |
- |
Minimum resource set (integer or string) allocated by resource name. |
spec.runPolicy.schedulingPolicy.priorityClass |
String |
- |
Priority class name. |
spec.runPolicy.schedulingPolicy.queue |
String |
- |
Scheduling queue name. |
spec.schedulerName |
String |
- |
Scheduler specified when gang-scheduling is enabled. Currently, only Volcano is supported. |
spec.successPolicy |
String |
- |
Standard for marking AscendJob success. Currently, this field has no effect. A job is considered successful only when all pods are successful. This field will be deleted in later versions. |
status |
Object |
- |
Latest observed status of AscendJob (read-only). conditions and replicaStatuses are mandatory fields. |
status.completionTime |
String |
date-time |
Job completion time (RFC3339 format, UTC). |
status.conditions |
Array |
- |
Condition array for a job. |
status.conditions[type] |
String |
- |
Job condition type, for example, Complete. |
status.conditions[status] |
String |
- |
Condition status, including True, False, or Unknown. |
status.conditions[lastTransitionTime] |
String |
date-time |
Time when the condition status changes. |
status.conditions[lastUpdateTime] |
String |
date-time |
Last time after a condition is updated. |
status.conditions[message] |
String |
- |
Condition description. |
status.conditions[reason] |
String |
- |
Reason why a condition changes. |
status.lastReconcileTime |
String |
date-time |
Time when a job was last reconciled (RFC3339 format, UTC). |
status.replicaStatuses |
Object |
- |
Mapping from the replica type to the replica status. |
status.replicaStatuses.[ReplicaType].active |
Integer |
int32 |
Number of running pods. |
status.replicaStatuses.[ReplicaType].failed |
Integer |
int32 |
Number of failed pods. |
status.replicaStatuses.[ReplicaType].succeeded |
Integer |
int32 |
Number of successful pods. |
status.replicaStatuses.[ReplicaType].labelSelector |
Object |
- |
Pod label selector (defining how to filter pods). |
status.replicaStatuses.[ReplicaType].labelSelector.matchExpressions |
Array |
- |
Label matching rule, supporting operators such as In, NotIn, Exists, and DoesNotExist. |
status.replicaStatuses.[ReplicaType].labelSelector.matchLabels |
Object |
- |
Key-value pair that matches the label (equivalent to the matchExpressions condition). |
status.startTime |
String |
date-time |
Job start time (RFC3339 format, UTC). |