docker swarm - how to balance already running containers in a swarm cluster?

Swarm doesn't do auto-balancing once containers are created. You can scale up/down once all your workers are up and it will distribute containers per your config requirements/roles/etc.


There are problems with new nodes getting "mugged" as they are added. We also avoid pre-emption of healthy tasks. Rebalancing is done over time, rather than killing working processes. Pre-emption is being considered for the future.

As a workaround, scaling a service up and down should rebalance the tasks. You can also trigger a rolling update, as that will reschedule new tasks.

Here's a bash script I use to rebalance:

#!/usr/bin/env bash

set -e


for service in $(docker service ls | egrep -v $EXCLUDE_LIST | 
                 awk '{print $2}'); do
  docker service update --force $service

In docker-compose.yml, you can define:

version: "3"


    image: repository/user/app:latest
      - net
      - 80
        condition: any
      mode: replicated
      replicas: 5
        constraints: [node.role == worker]
        delay: 2s

Remark: the constraint is node.role == worker

Using the flag “ — replicas” implies we don’t care on which node they are put on, if we want one service per node we can use “ — mode=global” instead.

In Docker 1.13 and higher, you can use the --force or -f flag with the docker service update command to force the service to redistribute its tasks across the available worker nodes.

Swarm currently (18.03) does not move or replace containers when new nodes are started, if services are in the default "replicated mode". This is by design. If I were to add a new node, I don't necessarily want a bunch of other containers stopped, and new ones created on my new node. Swarm only stops containers to "move" replicas when it has to (in replicated mode).

docker service update --force <servicename> will rebalance a service across all nodes that match its requirements and constraints.

Further advice: Like other container orchestrators, you need to give capacity on your nodes in order to handle the workloads of any service replicas that move during outages. You're spare capacity should match the level of redundancy you plan to support. If you want to handle capacity for 2 nodes failing at once, for instance, you'd need a minimum percentage of resources on all nodes for those workloads to shift to other nodes.