# README
Scheduler Plugin: Architecture
The scheduler plugin has some unfortunately annoying design, largely resulting from edge cases that we can't properly handle otherwise. So, this document serves to explain some of this weirdness, for people new to this code to have a little more context.
In some places, we assume that you're already familiar with the protocol between the scheduler
plugin and each autoscaler-agent
. For more information, refer to the section on the protocol in
the repo-level architecture doc.
This document should be up-to-date. If it isn't, that's a mistake (open an issue!).
Table of contents:
File descriptions
ARCHITECTURE.md
— this file :)config.go
— definition of theconfig
type, plus entrypoints for setting up update watching/handling and config validation.dumpstate.go
— HTTP server, types, and conversions for dumping all internal stateplugin.go
— scheduler plugin interface implementations, plus type definition forAutoscaleEnforcer
, the type implementing theframework.*Plugin
interfaces.queue.go
— implementation of a metrics-based priority queue to select migration targets. Usescontainer/heap
internally.- [
prommetrics.go
] — prometheus metrics collectors. run.go
— handling forautoscaler-agent
requests, to a point. The nitty-gritty of resource handling relies ontrans.go
.state.go
— definitions ofpluginState
,nodeState
,podState
. Also many functions to create and use them. Basically a catch-all file for everything that's not inplugin.go
,run.go
, ortrans.go
.trans.go
— generic handling for resource requests and pod deletion. This is where the meat of the code to ensure we don't overcommit resources is.watch.go
— setup to watch VM pod (and non-VM pod) deletions. Uses ourutil.Watch
.
High-level overview
The entrypoint for plugin initialization is through the NewAutoscaleEnforcerPlugin
method in
plugin.go
, which in turn:
- Fetches the scheduler config (and starts watching for changes) (see:
config.go
) - Starts watching for pod events, among others (see:
watch.go
) - Loads an initial state from the cluster's resources (by waiting for all the initial Pod start events to be handled)
- Spawns the HTTP server for handling
autoscaler-agent
requests (see:run.go
)
The plugins we implement are:
- Filter — preemptively discard nodes that don't have enough room for the pod
- PreFilter and PostFilter — used for counts of total number of scheduling attempts and failures.
- Score — allows us to rank nodes based on available resources. It's called once for each pod-node pair, but we don't actually use the pod.
- Reserve — gives us a chance to approve (or deny) putting a pod on a node, setting aside the resources for it in the process.
For more information on scheduler plugins, see: https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/.
We support both VM pods and non-VM pods, in order to accommodate mixed deployments. We expect that all other resource usage is within the bounds of the configured per-node "system" usage, so it's best to deploy as much as possible through the scheduler.
VM pods have an associated NeonVM VirtualMachine
object, so we can fetch the resources from there.
For non-VM pods, we use the values from resources.requests
for compatibility with other systems
(e.g., [cluster-autoscaler]). This can lead to overcommitting, but it isn't really worth being
strict about this. If any container in a pod has no value for one of its resources, the pod will be
rejected; the scheduler doesn't have enough information to make accurate decisions.
Deep dive into resource management
Some basics:
- Different resources are handled independently. This makes the implementation of the scheduler
simpler, at the cost of relaxing guarantees about always allocating multiples of compute units.
This is why
autoscaler-agent
s are responsible for making sure their resource requests are a multiple of the configured compute unit (although we do check this). - Resources are handled per-node. This may be obvious, but it's worth stating explicitly. Whenever we talk about handling resources, we're only looking at what's available on a single node.
With those out of the way, there's a few things to discuss. In state.go
, the relevant
resource-related types are:
type nodeState struct {
pods map[util.NamespacedName]*podState
cpu nodeResourceState[vmapi.MilliCPU]
mem nodeResourceState[api.Bytes]
// -- other fields omitted --
}
// Total resources from all pods - both VM and non-VM
type nodeResourceState[T any] struct {
Total T
Watermark T
Reserved T
Buffer T
CapacityPressure T
PressureAccountedFor T
}
type podState struct {
name util.NamespacedName
// -- other fields omitted --
cpu podResourceState[vmapi.MilliCPU]
mem podResourceState[api.Bytes]
}
// Resources for a VM pod
type podResourceState[T any] struct {
Reserved T
Buffer T
CapacityPressure T
Min T
Max T
}
Basics: reserved
and total
At a high-level, nodeResourceState.Reserved
provides an upper bound on the amount of each resource
that's currently allocated. Total
is the total amount available, so, Reserved
is almost always
less than or equal to Total
.
During normal operations, we have a strict bound on resource usage in order to keep Reserved ≤ Total
, but it isn't feasible to guarantee that in all circumstances. In particular, this
condition can be temporarily violated after startup.
Pressure and watermarks
In order to preemptively migrate away VMs before we run out of resources, we have a "watermark"
for each resource. When Reserved > Watermark
, we start picking migration targets from the
migration queue (see: updateMetricsAndCheckMustMigrate
in run.go
). When Reserved > Watermark
, we refer to the amount above the watermark as the logical pressure on the resource.
It's possible, however, that we can't react fast enough and completely run out of resources (i.e.
Reserved == Total
). In this case, any requests that go beyond the maximum reservable
amount are marked as capacity pressure (both in the node's CapacityPressure
and the pod's).
Roughly speaking, CapacityPressure
represents the amount of additional resources that will be
consumed as soon as they're available — we care about it because migration is slow, so we ideally
don't want to wait to start more migrations.
So: we have two components of resource pressure:
- Capacity pressure — total requested resources we denied because they weren't available
- Logical pressure — the difference
Reserved - Watermark
(or zero, ifReserved ≤ Watermark
).
When a VM migration is started, we mark its Reserved
resources and CapacityPressure
as
PressureAccountedFor
. We continue migrating away VMs until those migrations account for all of the
resource pressure in the node.
In practice, this strategy means that we're probably over-correcting slightly when there's capacity pressure: when capacity pressure occurs, it's probably the result of a temporary, greater-than-usual increase, so we're likely to have started more migrations than we need in order to cover it. In future, mechanisms to improve this could be:
- Making
autoscaler-agent
s prefer more-frequent smaller increments in allocation, so that requests are less extreme and more likely to be sustained. - Multiplying
CapacityPressure
by some fixed ratio (e.g. 0.5) when calculating the total pressure to reduce impact — something less than one, but separately guaranteed to be != 0 ifCapacityPressure != 0
. - Artificially slowing down pod
CapacityPressure
, so that it only contributes to the node'sCapacityPressure
when sustained
In general, the idea is that moving slower and correcting later will prevent drastic adjustments.
Startup uncertainty: Buffer
In order to stay useful after communication with the scheduler has failed, autoscaler-agent
s will
continue to make scaling decisions without checking with the plugin. These scaling decisions are
bounded by the last resource permit approved by the scheduler, so that they can still reduce unused
resource usage (we don't want users getting billed extra because of our downtime!).
This presents a problem, however: how does a new scheduler know what the old scheduler last
permitted? Without that, we can't determine an accurate upper bound on resource usage — at least,
until the autoscaler-agent
reconnects to us. It's actually quite difficult to know what the
previous scheduler last approved, so we don't try! Instead, we work with the uncertainty.
On startup, we assume all existing VM pods may scale — without notifying us — up to the VM's
configured maximum. So each VM pod gets Reserved
equal to the VM's <resource>.Max
. Alongside
that, we track Buffer
— the expected difference between Reserved
usage and actual usage: equal
to the VM's <resource>.Max - <resource>.Use
.
As each autoscaler-agent
reconnects, their first message contains their current resource usage, so
we're able to reduce Reserved
appropriately and begin allowing other pods to be scheduled. When
this happens, we reset the pod's Buffer
to zero.
Eventually, all autoscaler-agent
s should reconnect, and the node's Buffer
is zero — meaning
that there's no longer any uncertainty about VM resource usage.
With Buffer
, we have a more precise guarantee about resource usage:
Assuming all
autoscaler-agent
s and the previous scheduler are well-behaved, then each node will always haveReserved - Buffer ≤ Total
.