package
0.38.1
Repository: https://github.com/determined-ai/determined.git
Documentation: pkg.go.dev

# Functions

DeleteJobResponseOf returns a response containing the specified error.
EmptyDeleteJobResponse returns a response with an empty error chan.
ErrJobNotFound returns a standard job error.
FromContainerExitCode converts an aproto.ExitCode to an ExitCode.
FromContainerFailureType converts an aproto.FailureType to a FailureType.
FromContainerID converts a cproto.ID to a ResourcesID.
FromContainerStarted converts an aproto.ContainerStarted message to ResourcesStarted.
FromContainerState converts a cproto.State to ResourcesState.
FromContainerStateChanged converts an aproto.ContainerStateChanged message to ResourcesStateChanged.
FromContainerStopped converts an aproto.ContainerStopped message to ResourcesStopped.
IsTransientSystemError checks if the error is caused by the system and shouldn't count against `max_restarts`.
IsUnrecoverableSystemError checks if the error is absolutely unrecoverable.
NewAllocationSubscription create a new subcription.
NewProxyPortConfig converts expconf proxy configs into internal representation.
NewResourcesFailure returns a resources failure message wrapping the type, msg and exit code.
ResourcesError returns a resources stopped message wrapping the provided error.
SchedulingStateFromProto returns SchedulingState from proto representation.
StringFromResourcePoolTypeProto returns a string from the protobuf resource pool type.

# Constants

AgentError denotes that the agent failed to launch the container.
AgentFailed denotes that the agent failed while the container was running.
Assigned state means that the resources have been assigned.
DecimalExp is a constant used by decimal.Decimal objects to denote its exponent.
InstanceNumberExceedsMaximum represents the reason for terminating instances because the instance number exceeding the maximum.
K8sExp is a constant used by decimal.Decimal objects to denote the exponent for Kubernetes labels as k8s labels are limited to 63 characters.
Pulling state means that the resources are pulling container images.
ResourcesAborted denotes the container was canceled before it was started.
ResourcesFailed denotes that the container ran but failed with a non-zero exit code.
ResourcesMissing denotes the resources were missing when the master asked about it.
ResourcesTypeDockerContainer indicates the resources are a handle for a docker container.
ResourcesTypeEnvVar is the name of the env var indicating the resource type to a task.
ResourcesTypeK8sJob indicates the resources are a handle for a k8s pod.
ResourcesTypeSlurmJob indicates the resources are a handle for a slurm job.
RestoreError denotes a failure to restore a running allocation on master blip.
Running state means that the service on the resources is running.
SchedulingStateQueued denotes a queued job waiting to be scheduled.
SchedulingStateScheduled denotes a job that is scheduled for execution.
SchedulingStateScheduledBackfilled denotes a job that is scheduled for execution as a backfill.
SlurmProxyIfaceEnvVar is the env var for overriding the net iface used to proxy between the master and agents.
SlurmRendezvousIfaceEnvVar is the name of the env var for indicating the net iface on which to rendezvous (horovodrun will use the IPs of the nodes on this interface to launch).
Starting state means the service running on the resources is being started.
SuccessExitCode is the 0 zero value exit code.
TaskAborted denotes that the task was canceled before it was started.
TaskError denotes that the task failed without an associated exit code.
Terminated state means that the resources have exited or has been aborted.
TerminateLongDisconnectedInstances represents the reason for terminating long disconnected instances.
TerminateLongIdleInstances represents the reason for terminating long idle instances.
TerminateStoppedInstances represents the reason for terminating stopped instances.
Unknown state is a null value.
UnknownError denotes an internal error that did not map to a know failure type.

# Variables

HeadAnchor is an internal anchor for the head of the job queue.
ScheduledStates provides a list of ScheduledStates that are considered scheduled.
TailAnchor is an internal anchor for the tail of the job queue.

# Structs

AgentSummary contains information about an agent for external display.
Task-related cluster level messages.
Task-related cluster level messages.
No description provided by the author
No description provided by the author
Incoming task actor messages; task actors must accept these messages.
No description provided by the author
DeleteJob instructs the RM to clean up all metadata associated with a job external to Determined.
DeleteJobResponse returns to the caller if the cleanup was successful or not.
FittingRequirements allow tasks to specify requirements for their placement.
Task-related cluster level messages.
InvalidResourcesRequestError is an unrecoverable validation error from the underlying RM.
Message protocol from the default resource manager to an agent actor.
Incoming task actor messages; task actors must accept these messages.
Incoming task actor messages; task actors must accept these messages.
Task-related cluster level messages.
Task-related cluster level messages.
RecoverJobPosition gets sent from the experiment or command actor to the resource pool.
Incoming task actor messages; task actors must accept these messages.
Incoming task actor messages; task actors must accept these messages.
ResourcesFailedError contains information about restored resources' failure.
Task-related cluster level messages.
ResourcesReleasedEvent notes when the RM has acknowledged resources are released.
Incoming task actor messages; task actors must accept these messages.
ResourcesStarted contains the information needed by tasks from container started.
ResourcesStateChanged notifies that the task actor container state has been transitioned.
ResourcesStopped contains the information needed by tasks from container stopped.
ResourcesSubscription is a subscription for streaming ResourcesEvents's.
ResourcesSummary provides a summary of the resources comprising what we know at the time the allocation is granted, but for k8s it is granted before being scheduled so it isn't really much and `agent_devices` are missing for k8s.
RMJobInfo packs information available only to the RM that updates frequently.
ScalingInfo describes the information that is needed for scaling.
No description provided by the author
No description provided by the author
No description provided by the author
Message protocol from the default resource manager to an agent actor.
TerminateDecision describes a terminating decision.
Task-related cluster level messages.
Task-related cluster level messages.

# Interfaces

Resources is an interface that provides function for task actors to start tasks on assigned resources.
ResourcesEvent describes a change in status or state of an allocation's resources.

# Type aliases

AQueue is a map of jobID to RMJobInfo.
ExitCode is the process exit code of the container.
FailureType denotes the type of failure that resulted in the container stopping.
ResourceList is a wrapper for a list of resources.
ResourcesID is the ID of some set of resources.
ResourcesState is the state of some set of resources.
ResourcesType is the type of some set of resources.
ResourcesUnsubscribeFn closes a subscription.
SchedulingState denotes the scheduling state of a job and in order of its progression value.