How does Kubernetes' scheduler work?

The paragraph you quoted describes where we hope to be in the future (where the future is defined in units of months, not years). We're not there yet, but the scheduler does have a number of useful features already, enough for a simple deployment. In the rest of this reply, I'll explain how the scheduler works today.

The scheduler is not just an admission controller; for each pod that is created, it finds the "best" machine for that pod, and if no machine is suitable, the pod remains unscheduled until a machine becomes suitable.

The scheduler is configurable. It has two types of policies, FitPredicate (see master/pkg/scheduler/predicates.go) and PriorityFunction (see master/pkg/scheduler/priorities.go). I'll describe them.

Fit predicates are required rules, for example the labels on the node must be compatible with the label selector on the pod (this rule is implemented in PodSelectorMatches() in predicates.go), and the sum of the requested resources of the container(s) already running on the machine plus the requested resources of the new container(s) you are considering scheduling onto the machine must not be greater than the capacity of the machine (this rule is implemented in PodFitsResources() in predicates.go; note that "requested resources" is defined as pod.Spec.Containers[n].Resources.Limits, and if you request zero resources then you always fit). If any of the required rules are not satisfied for a particular (new pod, machine) pair, then the new pod is not scheduled on that machine. If after checking all machines the scheduler decides that the new pod cannot be scheduled onto any machine, then the pod remains in Pending state until it can be satisfied by one of the machines.

After checking all of the machines with respect to the fit predicates, the scheduler may find that multiple machines "fit" the pod. But of course, the pod can only be scheduled onto one machine. That's where priority functions come in. Basically, the scheduler ranks the machines that meet all of the fit predicates, and then chooses the best one. For example, it prefers the machine whose already-running pods consume the least resources (this is implemented in LeastRequestedPriority() in priorities.go). This policy spreads pods (and thus containers) out instead of packing lots onto one machine while leaving others empty.

When I said that the scheduler is configurable, I mean that you can decide at compile time which fit predicates and priority functions you want Kubernetes to apply. Currently, it applies all of the ones you see in predicates.go and priorities.go.


We've done customizations that, for example, apply multilevel affinity and anti affinity based on custom selectors. The scheduler isn't perfect, but it's pretty good for most service level workloads, and in the future should get a lot better. https://docs.openshift.org/latest/admin_guide/scheduler.html#use-cases describes one particular Kube scheduler config that provides that.

Tags:

Kubernetes