Key Fields in acjob

Ascend Job (acjob): a job type customized by MindCluster. You can start a training or inference job by configuring resource information using environment variables or files. The following table describes the acjob fields.

Table 1 acjob field description

Field Path

Type

Format

Description

apiVersion

String

-

Object's versioning resource pattern. The server will convert it to the latest internal value and reject unrecognized versions. For more information, see https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds.

kind

String

-

REST resource type corresponding to an object. The value, formatted in camel case, is derived from the endpoint and cannot be updated. For more information, see https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources.

metadata

Object

-

Kubernetes metadata, including namespaces and labels. For more information, see https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata.

spec

Object

-

Specifications of the desired AscendJob status. replicaSpecs is a mandatory field.

spec.replicaSpecs

Object

-

Mapping from ReplicaType to ReplicaSpec, which specifies MS cluster configurations, for example, { "Scheduler": ReplicaSpec, "Worker": ReplicaSpec }.

spec.replicaSpecs.[ReplicaType]

Object

-

Replica description

spec.replicaSpecs.[ReplicaType].replicas

Integer

int32

Number of replicas required by the given template. The default value is 1.

spec.replicaSpecs.[ReplicaType].restartPolicy

String

-

Restart policy, including Always, OnFailure, Never, and ExitCode. The default value is Never.

spec.replicaSpecs.[ReplicaType].template

Object

-

Kubernetes pod template. For more information, see https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-template-v1/.

spec.runPolicy

Object

-

Runtime policy (such as resource clearing and activity time) of a distributed training job.

spec.runPolicy.backoffLimit

Integer

int32

(Optional) Number of retries before a job fails.

spec.runPolicy.activeDeadlineSeconds

Integer

int64

Maximum duration (in seconds) for which a job keeps active. The value must be a positive integer. This field has no effect and will be deleted in later versions.

spec.runPolicy.cleanPodPolicy

String

-

Pod clearing policy after a job is complete. The default value is Running. This field has no effect and will be deleted in later versions.

spec.runPolicy.ttlSecondsAfterFinished

Integer

int32

Time to live (TTL) after a job is complete. By default, the value is infinite, but actual deletion may be delayed. This field has no effect and will be deleted in later versions.

spec.runPolicy.schedulingPolicy

Object

-

Scheduling policy, for example, gang-scheduling.

spec.runPolicy.schedulingPolicy.minAvailable

Integer

int32

Minimum number of available resources.

spec.runPolicy.schedulingPolicy.minResources

Object

-

Minimum resource set (integer or string) allocated by resource name.

spec.runPolicy.schedulingPolicy.priorityClass

String

-

Priority class name.

spec.runPolicy.schedulingPolicy.queue

String

-

Scheduling queue name.

spec.schedulerName

String

-

Scheduler specified when gang-scheduling is enabled. Currently, only Volcano is supported.

spec.successPolicy

String

-

Standard for marking AscendJob success. Currently, this field has no effect. A job is considered successful only when all pods are successful. This field will be deleted in later versions.

status

Object

-

Latest observed status of AscendJob (read-only). conditions and replicaStatuses are mandatory fields.

status.completionTime

String

date-time

Job completion time (RFC3339 format, UTC).

status.conditions

Array

-

Condition array for a job.

status.conditions[type]

String

-

Job condition type, for example, Complete.

status.conditions[status]

String

-

Condition status, including True, False, or Unknown.

status.conditions[lastTransitionTime]

String

date-time

Time when the condition status changes.

status.conditions[lastUpdateTime]

String

date-time

Last time after a condition is updated.

status.conditions[message]

String

-

Condition description.

status.conditions[reason]

String

-

Reason why a condition changes.

status.lastReconcileTime

String

date-time

Time when a job was last reconciled (RFC3339 format, UTC).

status.replicaStatuses

Object

-

Mapping from the replica type to the replica status.

status.replicaStatuses.[ReplicaType].active

Integer

int32

Number of running pods.

status.replicaStatuses.[ReplicaType].failed

Integer

int32

Number of failed pods.

status.replicaStatuses.[ReplicaType].succeeded

Integer

int32

Number of successful pods.

status.replicaStatuses.[ReplicaType].labelSelector

Object

-

Pod label selector (defining how to filter pods).

status.replicaStatuses.[ReplicaType].labelSelector.matchExpressions

Array

-

Label matching rule, supporting operators such as In, NotIn, Exists, and DoesNotExist.

status.replicaStatuses.[ReplicaType].labelSelector.matchLabels

Object

-

Key-value pair that matches the label (equivalent to the matchExpressions condition).

status.startTime

String

date-time

Job start time (RFC3339 format, UTC).