package
0.0.1-alpha
Repository: https://github.com/twitter/scoot.git
Documentation: pkg.go.dev

# README

Scoot Scheduler

Components:

  • Scheduler interface (scheduler.go) - scheduler interface
  • statefulScheduler (stateful_scheduler.go) - scheduler implementation
  • clusterState (cluster_state.go) - keeps track of the available nodes and nodes running tasks
  • jobState (job_state.go) - tracks the state of each job (running, waiting and completed tasks)
  • taskRunner (task_runner.go) - starts tasks running and collects the results
  • LoadBasedAlg (load_based_sched_alg.go) - the algorithm for computing the list of tasks to start/stop with each scheduler loop iteration.

Scheduling Algorithm

(load_based_scheduling_alg.go)

This scheduling algorithm computes the list of tasks to start and stop:

Given

  • a defined set of job classes with load % for each class, such that the load %s define the number of scoot workers targeted for running the class's tasks
  • a mapping of job requestors to job classes (using reg exp to match requestor values)
  • jobs with
    • list of tasks waiting to start (in descending duration order)
  • we have the number of idle workers waiting to run a task

Overview
The algorithm selects the list of tasks that should be started on the idle workers. The objective of the algorithm is to maintain the target load %. When a given job class does not have enough tasks to use all the workers in that class's target load %, the unused workers will be used to run other class tasks (we call this loaning a worker).

It may be the case that long running tasks running on 'loaned' workers make it impossible for the algorithm to bring the task allocations back in line with the original target load %s. To address this, the algorithm has a 'rebalancing' feature that when turned on, will cancel the most recently started tasks for classes using loaned workers.

GetTasksToBeAssigned() (the top level entry to the algorithm):

  • determine if the system needs rebalancing (as per the rebalancing thresholds - see rebalancing below)
  • if the system needs rebalancing,
    • compute the tasks that should be stopped,
  • compute the number tasks to start for each class
  • compute the list of tasks that should be started
  • return the list of tasks that should be started and list of tasks that should be stopped

Get number of tasks to start, and list of tasks to stop:

(rebalance happens first, but it is described below since we've never turned it on)

compute number of tasks to start for each class as per the load % 'entitlements'

(entitlementTasksToStart())

A class's entitlement is the number of workers * the class's load % - it is the number of workers that should be allocated to that tasks in that class. When the algorithm is computing tasks to start, it will try to meet each class' entitlement.

  1. For each class with tasks waiting to start, it's unused entitlement is the class's entitlement - number of that class's tasks currently running (minimum unused entitlement is 0). In addition, if a class does not have tasks waiting to start, it's unused entitlement is 0.
  2. Compute each class's unused entitlement %: the class's _unused entitlement / sum(all classes _unused entitlement_s)
  3. compute the number of tasks to start for each class: min(number tasks waiting to start, _unused entitlement % * number of idle (unallocated) workers)

When a class's number of waiting tasks < the number to start, there will still be available workers after computing the number of tasks to start for each class.When this happens the algorithm repeats the steps listed above computing the additional number of tasks to start for each class. Each iteration will either allocate all available workers, all of at least one class's waiting tasks or a class's full entitlement. When all available workers are allocated or all class's waiting tasks or entitlements have been allocated the iteration stops and the entitlement computation is complete.

It may be the case that some classes are under-utilizing their entitlements, but other classes have more tasks than their entitlement waiting. When this happens the entitlement allocation will complete, but there will still be unallocated workers (the total number of tasks to start is still less than the number of idle workers) and tasks waiting to start. When this happens the algorithm proceeds to compute the number of workers to 'loan' to each class:

loaned workers

(workerLoanAllocation())

The loan part of the algorithm computes the loan distribution %s as follows:

  1. normalize the original load %s for classes with waiting tasks.
  2. compute the total workers that will be loaned (sum of current loaned workers + number of idle workers that would still be unassigned after the entitlement distribution above).
  3. compute what the worker loan distribution would be if all of the currently loaned workers plus newly available workers were distributed as per the normalized loan %s
  4. adjust the worker loan distribution by subtracting the number of currently loaned workers for each class. If classes are exceeding their loan distribution (from step 3) then their adjusted loan distribution is 0
  5. compute class final loan %s as each class's adjusted loan distribution / sum of the adjusted loan distributions

Each class's loan amount is computed as the class final loan % (from step 5) * number of workers still available or number of waiting tasks whichever is smaller

When the loan amount for a class is larger than the number of waiting tasks in that class, there will still be available workers after processing each class. When this happens the algorithm repeats the loan computation for the still unallocated workers. The iteration finishes when all unallocated workers have been assigned to a class or when all the classes waiting tasks have been allocated to a worker.

Note the entitlement and loan computation does not assign a specific worker to a task, it simply computes the number of workers that can be used to start tasks in each class.

rebalance

(rebalanceClassTasks())

Each scheduling iteration naturally brings the running task allocations back to the original class entitlements, but it could be the case that long running tasks holding on to loaned workers slowly cause the number of running tasks for each class to be far from the target load percents. When this occurs, the algorithm re-balances the running workers back toward the original target %s by stopping tasks that have been started on loaned workers. It will select the most recently started tasks till the running task to class allocation meets the original entitlement targets.

The re-balancing is triggered by computing the max difference of the percent under/over class load for each class with waiting tasks. We call this the delta entitlement spread. When the delta entitlement spread has been over a threshold for a set period of time, the algorithm computes the list of tasks to stop and start to bring the classes to their entitled number of workers.

Example

example 1: Test_Class_Task_Start_Cnts(), scenario 1 - takes 2 iterations to allocate all workers based on entitlement

totalWorkers: 1000, 710 running tasks, 290 idle workers

classload %running taskswaiting tasksentitlement
c030%200290300
c125%300230250
c220%0150200
c315%100150150
c410%11090100
c50%03280
iter 1iter 2start
290 idle workers174 allocated16 idle workers16 allocated
classentitled tasksnormalized %entitled tasksnormalized %
c0300-200=100100/350=29%.29*290=84300-284=1616/26=62%.62*16=1094
c1250-300-> 00%000%00
c2200-0=200200/350=57%.57*290=165 -> 150 tasks waiting00%0150
c3150-100=5050/350=14%.14*290=40150-140=1010/26=38%.38*16=646

example 2: Test_Class_Task_Start_Cnts(), scenario 3 - entitlement plus loan

totalWorkers: 1000, 710 running tasks, 290 idle workers

classload %running taskswaiting tasksentitlementloaned
c030%200103000
c125%30023025050
c220%002000
c315%100501500
c410%1109010010
entitlementloanstart
290 idle workers60 allocated230 idle+60 loaned230 allocated
classentitled tasksnormalized %normalized %
c0300-200=100100/150=67%.67*290=194 -> 10 tasks waiting0010
c1250-300-> 00%025/35=72%.72*290=207, 206-50=157157
c20 waiting tasks0%0000
c3150-100=5050/150=33%.33*290=98-> 50 tasks waiting0050
c4100-110-> 00%010/35=28%.29*290=83, 84-10=7373

selecting tasks to start

Selecting tasks to start in a class uses a round robin approach selecting tasks from jobs with the least number of running tasks. This makes sure that all jobs in a given task are using an equitable number of the workers allocate to that class (one job with many tasks won't use up all the workers allocated to that class).

exposed scheduling parameters

Scheduler api for the algorithm:

  • GetClassLoadPercents - returns the map of className:%
  • SetClassLoadPercents - sets the load %s from a map of className:%
  • GetRequestorMap - returns the map of className:requestor_re, where requestor_re is the regular expression for matching requestor values in the job defs
  • SetRequestorMap - sets the map of className:requestor_re, where requestor_re is the regular expression for matching requestor values in the job defs
  • GetRebalanceMinimumDuration - return minimum number of minutes the system must be over the delta entitlement spread before triggering re-balancing
  • SetRebalanceMinimumDuration - set the minimum number of minutes the system must be over the delta entitlement spread before triggering re-balancing
  • GetRebalanceThreshold - return the threshold the delta entitlement spread must be over before triggering re-balancing
  • SetRebalanceThreshold - set the threshold the delta entitlement spread must be over before triggering re-balancing

# Functions

GetRequestorClass find the requestorToClass entry for requestor keys in requestorToClassEntry are regular expressions if no match is found, return "" for the class name.
NewLoadBasedAlg allocate a new LoadBaseSchedAlg object.
NewMockScheduler creates a new mock instance.
NewMockSchedulingAlgorithm creates a new mock instance.
Create a New StatefulScheduler that implements the Scheduler interface cc.Cluster - cluster of worker nodes saga.SagaCoordinator - the Saga Coordinator to log to and recover from RunnerFactory - Function which converts a node to a Runner SchedulerConfig - additional configuration settings for the scheduler StatsReceiver - stats receiver to log statistics to specifying debugMode true, starts the scheduler up but does not start the update loop.

# Constants

No description provided by the author
Nothing should run forever by default, use this timeout as a fallback.
No description provided by the author
Number of jobs any single requestor can have (to prevent spamming, not for scheduler fairness).
Max number of requestors to track tag history, and max number of tags per requestor to track.
Number of different requestors that can run jobs at any given time.
Max number of task IDs to track durations for.
Set the maximum number of tasks we'd expect to queue to a nonzero value (it'll be overridden later).
Allow extra time when waiting for a task response.
Threshold for jobs considered long running.
The max job priority we respect (higher priority is untested and disabled).
Clients will check for this string to differentiate between scoot and user initiated actions.
How often Scheduler step is called in loop.
Clients will check for this string to differentiate between scoot and user initiated actions.

# Variables

defaults for the LoadBasedScheduler algorithm: only one class and all jobs map to that class.
defaults for the LoadBasedScheduler algorithm: only one class and all jobs map to that class.
defaults for the LoadBasedScheduler algorithm: only one class and all jobs map to that class.
defaults for the LoadBasedScheduler algorithm: only one class and all jobs map to that class.

# Structs

LoadBasedAlg the scheduling algorithm computes the list of tasks to start and stop for the current iteration of the scheduler loop.
No description provided by the author
MockScheduler is a mock of Scheduler interface.
MockSchedulerMockRecorder is the mock recorder for MockScheduler.
MockSchedulingAlgorithm is a mock of SchedulingAlgorithm interface.
MockSchedulingAlgorithmMockRecorder is the mock recorder for MockSchedulingAlgorithm.
PersistedSettings the persisted scheduler settings structure for encoding/decoding as json.
SchedulerConfiguration variables read at initialization MaxRetriesPerTask - the number of times to retry a failing task before marking it as completed.

# Interfaces

Persistor interface for persisting scheduler settings and initializing the scheduler from its persisted settings.
No description provided by the author
SchedulingAlgorithm interface for the scheduling algorithm.

# Type aliases

Cluster will use this function to determine if newly added nodes are ready to be used.
No description provided by the author