# Constants
GroupNameLabel represents the label key for group name, e.g.
JobCreated means the job has been accepted by the system, but one or more of the pods/services has not been started.
JobFailed means one or more sub-resources (e.g.
JobNameLabel represents the label key for the job name, the value is job name.
JobRestarting means one or more sub-resources (e.g.
JobRoleLabel represents the label key for the job role, e.g.
JobRunning means all sub-resources (e.g.
JobSucceeded means all sub-resources (e.g.
ReplicaIndexLabel represents the label key for the replica-index, e.g.
ReplicaTypeLabel represents the label key for the replica-type, e.g.
RestartPolicyExitCode policy means that user should add exit code by themselves, The job operator will check these exit codes to determine the behavior when an error occurs: - 1-127: permanent error, do not restart.
# Structs
+k8s:deepcopy-gen=true JobCondition describes the state of the job at a certain point.
+k8s:deepcopy-gen=true JobStatus represents the current observed state of the training Job.
+k8s:deepcopy-gen=true ReplicaSpec is a description of the replica.
ReplicaStatus represents the current observed state of the replica.
+k8s:deepcopy-gen=true RunPolicy encapsulates various runtime policies of the distributed training job, for example how to clean up resources and how long the job can stay active.
SchedulingPolicy encapsulates various scheduling policies of the distributed training job, for example `minAvailable` for gang-scheduling.
# Type aliases
CleanPodPolicy describes how to deal with pods when the job is finished.
JobConditionType defines all kinds of types of JobStatus.
ReplicaType represents the type of the replica.
RestartPolicy describes how the replicas should be restarted.