Categorygithub.com/twitter/scootschedulerserver

package

0.0.1-alpha

Repository: https://github.com/twitter/scoot.git

Documentation: pkg.go.dev

# README

Scoot Scheduler

Components:

Scheduler interface (scheduler.go) - scheduler interface
statefulScheduler (stateful_scheduler.go) - scheduler implementation
clusterState (cluster_state.go) - keeps track of the available nodes and nodes running tasks
jobState (job_state.go) - tracks the state of each job (running, waiting and completed tasks)
taskRunner (task_runner.go) - starts tasks running and collects the results
LoadBasedAlg (load_based_sched_alg.go) - the algorithm for computing the list of tasks to start/stop with each scheduler loop iteration.

Scheduling Algorithm

(load_based_scheduling_alg.go)

This scheduling algorithm computes the list of tasks to start and stop:

Given

a defined set of job classes with load % for each class, such that the load %s define the number of scoot workers targeted for running the class's tasks
a mapping of job requestors to job classes (using reg exp to match requestor values)
jobs with
- list of tasks waiting to start (in descending duration order)
we have the number of idle workers waiting to run a task

Overview
The algorithm selects the list of tasks that should be started on the idle workers. The objective of the algorithm is to maintain the target load %. When a given job class does not have enough tasks to use all the workers in that class's target load %, the unused workers will be used to run other class tasks (we call this loaning a worker).

It may be the case that long running tasks running on 'loaned' workers make it impossible for the algorithm to bring the task allocations back in line with the original target load %s. To address this, the algorithm has a 'rebalancing' feature that when turned on, will cancel the most recently started tasks for classes using loaned workers.

GetTasksToBeAssigned() (the top level entry to the algorithm):

determine if the system needs rebalancing (as per the rebalancing thresholds - see rebalancing below)
if the system needs rebalancing,
- compute the tasks that should be stopped,
compute the number tasks to start for each class
compute the list of tasks that should be started
return the list of tasks that should be started and list of tasks that should be stopped

Get number of tasks to start, and list of tasks to stop:

(rebalance happens first, but it is described below since we've never turned it on)

compute number of tasks to start for each class as per the load % 'entitlements'

(entitlementTasksToStart())

A class's entitlement is the number of workers * the class's load % - it is the number of workers that should be allocated to that tasks in that class. When the algorithm is computing tasks to start, it will try to meet each class' entitlement.

For each class with tasks waiting to start, it's unused entitlement is the class's entitlement - number of that class's tasks currently running (minimum unused entitlement is 0). In addition, if a class does not have tasks waiting to start, it's unused entitlement is 0.
Compute each class's unused entitlement %: the class's _unused entitlement / sum(all classes _unused entitlement_s)
compute the number of tasks to start for each class: min(number tasks waiting to start, _unused entitlement % * number of idle (unallocated) workers)

When a class's number of waiting tasks < the number to start, there will still be available workers after computing the number of tasks to start for each class.When this happens the algorithm repeats the steps listed above computing the additional number of tasks to start for each class. Each iteration will either allocate all available workers, all of at least one class's waiting tasks or a class's full entitlement. When all available workers are allocated or all class's waiting tasks or entitlements have been allocated the iteration stops and the entitlement computation is complete.

It may be the case that some classes are under-utilizing their entitlements, but other classes have more tasks than their entitlement waiting. When this happens the entitlement allocation will complete, but there will still be unallocated workers (the total number of tasks to start is still less than the number of idle workers) and tasks waiting to start. When this happens the algorithm proceeds to compute the number of workers to 'loan' to each class:

loaned workers

(workerLoanAllocation())

The loan part of the algorithm computes the loan distribution %s as follows:

normalize the original load %s for classes with waiting tasks.
compute the total workers that will be loaned (sum of current loaned workers + number of idle workers that would still be unassigned after the entitlement distribution above).
compute what the worker loan distribution would be if all of the currently loaned workers plus newly available workers were distributed as per the normalized loan %s
adjust the worker loan distribution by subtracting the number of currently loaned workers for each class. If classes are exceeding their loan distribution (from step 3) then their adjusted loan distribution is 0
compute class final loan %s as each class's adjusted loan distribution / sum of the adjusted loan distributions

Each class's loan amount is computed as the class final loan % (from step 5) * number of workers still available or number of waiting tasks whichever is smaller

When the loan amount for a class is larger than the number of waiting tasks in that class, there will still be available workers after processing each class. When this happens the algorithm repeats the loan computation for the still unallocated workers. The iteration finishes when all unallocated workers have been assigned to a class or when all the classes waiting tasks have been allocated to a worker.

Note the entitlement and loan computation does not assign a specific worker to a task, it simply computes the number of workers that can be used to start tasks in each class.

rebalance

(rebalanceClassTasks())

Each scheduling iteration naturally brings the running task allocations back to the original class entitlements, but it could be the case that long running tasks holding on to loaned workers slowly cause the number of running tasks for each class to be far from the target load percents. When this occurs, the algorithm re-balances the running workers back toward the original target %s by stopping tasks that have been started on loaned workers. It will select the most recently started tasks till the running task to class allocation meets the original entitlement targets.

The re-balancing is triggered by computing the max difference of the percent under/over class load for each class with waiting tasks. We call this the delta entitlement spread. When the delta entitlement spread has been over a threshold for a set period of time, the algorithm computes the list of tasks to stop and start to bring the classes to their entitled number of workers.

Example

example 1: Test_Class_Task_Start_Cnts(), scenario 1 - takes 2 iterations to allocate all workers based on entitlement

totalWorkers: 1000, 710 running tasks, 290 idle workers

class	load %	running tasks	waiting tasks	entitlement
c0	30%	200	290	300
c1	25%	300	230	250
c2	20%	0	150	200
c3	15%	100	150	150
c4	10%	110	90	100
c5	0%	0	328	0

	iter 1			iter 2			start
	290 idle workers		174 allocated	16 idle workers		16 allocated
class	entitled tasks	normalized %		entitled tasks	normalized %
c0	300-200=100	100/350=29%	.29*290=84	300-284=16	16/26=62%	.62*16=10	94
c1	250-300-> 0	0%	0	0	0%	0	0
c2	200-0=200	200/350=57%	.57*290=165 -> 150 tasks waiting	0	0%	0	150
c3	150-100=50	50/350=14%	.14*290=40	150-140=10	10/26=38%	.38*16=6	46

example 2: Test_Class_Task_Start_Cnts(), scenario 3 - entitlement plus loan

totalWorkers: 1000, 710 running tasks, 290 idle workers

class	load %	running tasks	waiting tasks	entitlement	loaned
c0	30%	200	10	300	0
c1	25%	300	230	250	50
c2	20%	0	0	200	0
c3	15%	100	50	150	0
c4	10%	110	90	100	10

	entitlement			loan		start
	290 idle workers		60 allocated	230 idle+60 loaned	230 allocated
class	entitled tasks	normalized %		normalized %
c0	300-200=100	100/150=67%	.67*290=194 -> 10 tasks waiting	0	0	10
c1	250-300-> 0	0%	0	25/35=72%	.72*290=207, 206-50=157	157
c2	0 waiting tasks	0%	0	0	0	0
c3	150-100=50	50/150=33%	.33*290=98-> 50 tasks waiting	0	0	50
c4	100-110-> 0	0%	0	10/35=28%	.29*290=83, 84-10=73	73

selecting tasks to start

Selecting tasks to start in a class uses a round robin approach selecting tasks from jobs with the least number of running tasks. This makes sure that all jobs in a given task are using an equitable number of the workers allocate to that class (one job with many tasks won't use up all the workers allocated to that class).

exposed scheduling parameters

Scheduler api for the algorithm:

GetClassLoadPercents - returns the map of className:%
SetClassLoadPercents - sets the load %s from a map of className:%
GetRequestorMap - returns the map of className:requestor_re, where requestor_re is the regular expression for matching requestor values in the job defs
SetRequestorMap - sets the map of className:requestor_re, where requestor_re is the regular expression for matching requestor values in the job defs
GetRebalanceMinimumDuration - return minimum number of minutes the system must be over the delta entitlement spread before triggering re-balancing
SetRebalanceMinimumDuration - set the minimum number of minutes the system must be over the delta entitlement spread before triggering re-balancing
GetRebalanceThreshold - return the threshold the delta entitlement spread must be over before triggering re-balancing
SetRebalanceThreshold - set the threshold the delta entitlement spread must be over before triggering re-balancing

# Functions

GetRequestorClass

GetRequestorClass find the requestorToClass entry for requestor keys in requestorToClassEntry are regular expressions if no match is found, return "" for the class name.

NewLoadBasedAlg

NewLoadBasedAlg allocate a new LoadBaseSchedAlg object.

NewMockScheduler

NewMockScheduler creates a new mock instance.

NewMockSchedulingAlgorithm

NewMockSchedulingAlgorithm creates a new mock instance.

NewStatefulScheduler

Create a New StatefulScheduler that implements the Scheduler interface cc.Cluster - cluster of worker nodes saga.SagaCoordinator - the Saga Coordinator to log to and recover from RunnerFactory - Function which converts a node to a Runner SchedulerConfig - additional configuration settings for the scheduler StatsReceiver - stats receiver to log statistics to specifying debugMode true, starts the scheduler up but does not start the update loop.

# Constants

DeadLetterTrailer

No description provided by the author

DefaultDefaultTaskTimeout

Nothing should run forever by default, use this timeout as a fallback.

DefaultMaxHistoryTags

No description provided by the author

DefaultMaxJobsPerRequestor

Number of jobs any single requestor can have (to prevent spamming, not for scheduler fairness).

DefaultMaxRequestorHistories

Max number of requestors to track tag history, and max number of tags per requestor to track.

DefaultMaxRequestors

Number of different requestors that can run jobs at any given time.

DefaultMaxTaskDurations

Max number of task IDs to track durations for.

DefaultSoftMaxSchedulableTasks

Set the maximum number of tasks we'd expect to queue to a nonzero value (it'll be overridden later).

DefaultTaskTimeoutOverhead

Allow extra time when waiting for a task response.

LongJobDuration

Threshold for jobs considered long running.

MaxPriority

The max job priority we respect (higher priority is untested and disabled).

RebalanceRequestedErrStr

Clients will check for this string to differentiate between scoot and user initiated actions.

TickRate

How often Scheduler step is called in loop.

UserRequestedErrStr

Clients will check for this string to differentiate between scoot and user initiated actions.

# Variables

DefaultLoadBasedSchedulerClassPercents

defaults for the LoadBasedScheduler algorithm: only one class and all jobs map to that class.

DefaultMinRebalanceTime

defaults for the LoadBasedScheduler algorithm: only one class and all jobs map to that class.

DefaultRequestorToClassMap

defaults for the LoadBasedScheduler algorithm: only one class and all jobs map to that class.

MaxTaskDuration

defaults for the LoadBasedScheduler algorithm: only one class and all jobs map to that class.

# Structs

LoadBasedAlg

LoadBasedAlg the scheduling algorithm computes the list of tasks to start and stop for the current iteration of the scheduler loop.

LoadBasedAlgConfig

No description provided by the author

MockScheduler

MockScheduler is a mock of Scheduler interface.

MockSchedulerMockRecorder

MockSchedulerMockRecorder is the mock recorder for MockScheduler.

MockSchedulingAlgorithm

MockSchedulingAlgorithm is a mock of SchedulingAlgorithm interface.

MockSchedulingAlgorithmMockRecorder

MockSchedulingAlgorithmMockRecorder is the mock recorder for MockSchedulingAlgorithm.

PersistedSettings

PersistedSettings the persisted scheduler settings structure for encoding/decoding as json.

SchedulerConfiguration

SchedulerConfiguration variables read at initialization MaxRetriesPerTask - the number of times to retry a failing task before marking it as completed.

# Interfaces

Persistor

Persistor interface for persisting scheduler settings and initializing the scheduler from its persisted settings.

Scheduler

No description provided by the author

SchedulingAlgorithm

SchedulingAlgorithm interface for the scheduling algorithm.

# Type aliases

ReadyFn

Cluster will use this function to determine if newly added nodes are ready to be used.

RunnerFactory

No description provided by the author

class	load %	running tasks	waiting tasks	entitlement
c0	30%	200	290	300
c1	25%	300	230	250
c2	20%	0	150	200
c3	15%	100	150	150
c4	10%	110	90	100
c5	0%	0	328	0

class	load %	running tasks	waiting tasks	entitlement	loaned
c0	30%	200	10	300	0
c1	25%	300	230	250	50
c2	20%	0	0	200	0
c3	15%	100	50	150	0
c4	10%	110	90	100	10

class	load %	running tasks	waiting tasks	entitlement
c0	30%	200	290	300
c1	25%	300	230	250
c2	20%	0	150	200
c3	15%	100	150	150
c4	10%	110	90	100
c5	0%	0	328	0

class	load %	running tasks	waiting tasks	entitlement	loaned
c0	30%	200	10	300	0
c1	25%	300	230	250	50
c2	20%	0	0	200	0
c3	15%	100	50	150	0
c4	10%	110	90	100	10

class	load %	running tasks	waiting tasks	entitlement
c0	30%	200	290	300
c1	25%	300	230	250
c2	20%	0	150	200
c3	15%	100	150	150
c4	10%	110	90	100
c5	0%	0	328	0

class	load %	running tasks	waiting tasks	entitlement	loaned
c0	30%	200	10	300	0
c1	25%	300	230	250	50
c2	20%	0	0	200	0
c3	15%	100	50	150	0
c4	10%	110	90	100	10