GKE Cluster can't pull (ErrImagePull) from GCR Registry in same project (GitLab Kubernetes Integration): Why?

TL;DR — Clusters created by GitLab-Ci Kubernetes Integration will not be able to pull an image from a GCR Registry in the same project as the container images — without modifying the Node(s) permissions (scopes).

While you CAN manually modify the permissions on an Individual Node machine(s) to grant the Application Default Credentials (see: https://developers.google.com/identity/protocols/application-default-credentials) the proper scopes in real time — doing it this way would mean that if your node is re-created at some point in the future it WOULD NOT have your modified scopes and things would break.

Instead of modifying the permissions manually — create a new Node pool that has the proper Scope(s) to access your required GCP services.

Here are some resources I used for reference:

  1. https://medium.com/google-cloud/updating-google-container-engine-vm-scopes-with-zero-downtime-50bff87e5f80
  2. https://adilsoncarvalho.com/changing-a-running-kubernetes-cluster-permissions-a-k-a-scopes-3e90a3b95636

Creating a properly Scoped Node Pool Generally looks like this

gcloud container node-pools create [new pool name] \
 --cluster [cluster name] \
 --machine-type [your desired machine type] \
 --num-nodes [same-number-nodes] \
 --scopes [your new set of scopes]

If you aren't sure what the names of your required Scopes are — You can see a full list of Scopes AND Scope Aliases over here: https://cloud.google.com/sdk/gcloud/reference/container/node-pools/create

For me I did gke-default (same as my other cluster) and sql-admin. The reason for this being that I need to be able to access an SQL Database in Cloud SQL during part of my build and I don't want to have to connect to a pubic IP to do that.

gke-default Scopes (for reference)

  1. https://www.googleapis.com/auth/devstorage.read_only (allows you to pull)
  2. https://www.googleapis.com/auth/logging.write
  3. https://www.googleapis.com/auth/monitoring
  4. https://www.googleapis.com/auth/service.management.readonly
  5. https://www.googleapis.com/auth/servicecontrol
  6. https://www.googleapis.com/auth/trace.append

Contrast the above with more locked down permissions from a GitLab-Ci created cluster ( ONLY these two: https://www.googleapis.com/auth/logging.write, https://www.googleapis.com/auth/monitoring):

Obviosuly configuring your cluster to ONLY the minimum permissions needed is for sure the way to go here. Once you figure out what that is and create your new properly scoped Node Pool...

List your nodes with:

kubectl get nodes

The one you just created (most recent) is has the new settings while the older option is the default gitlab cluster that can pull from the GCR.

Then:

kubectl cordon [your-node-name-here]

After that you want to drain:

kubectl drain [your-node-name-here] --force

I ran into issues where the fact that I had a GitLab Runner installed meant that I couldn't drain the pods normally due to the local data / daemon set that was used to control it.

For that reason once I cordon'd my Node I just deleted the node from Kubectl (not sure if this will cause problems — but it was fine for me). Once your node is deleted you need to delete the 'default-pool' node pool created by GitLab.

List your node-pools:

gcloud container node-pools list --cluster [CLUSTER_NAME]

See the old scopes created by gitlab:

gcloud container node-pools describe default-pool \
    --cluster [CLUSTER_NAME]

Check to see if you have the correct new scopes (that you just added):

gcloud container node-pools describe [NEW_POOL_NAME] \
    --cluster [CLUSTER_NAME]

If your new Node Pool has the right scopes your deployments can now delete the default pool with:

gcloud container node-pools delete default-pool \
   --cluster <YOUR_CLUSTER_NAME> --zone <YOUR_ZONE>

In my personal case I am still trying to figure out how to allow access to the private network (ie. get to Cloud SQL via private IP) but I can pull my images now so I am half way there.

I think that's it — hope it saved you a few minutes!


TL;DR — Clusters created by GitLab-Ci Kubernetes Integration will not be able to pull an image from a GCR Registry in the same project as the container images — without modifying the Node(s) permissions (scopes).

By default the Cluster Nodes created by a Cluster which was itself created by GitLab-Ci's Kubernetes Integration are created with minimal permissions (scopes) to Google Cloud services.

You can see this visually from the GCP console dashboard for your cluster, scroll down to the permissions section and look for "Storage":

enter image description here

This essentially means that the Node(s) running within your GitLab-Ci Kubernetes integration cluster WILL NOT have the default GCR Registry (read-only) permissions necessary to pull an image from a GCR Registry.

It also means (as far as I can tell) that even if you grant a Service Account proper access to the GCR Registry it still will not work — not totally sure I set my Service Account up properly but I believe I did.

Great.

How to fix Permissions

Basically you have two options. The first one is to create a Cluster (ie. outside of GitLab Kubernetes Integration) and then re-connect your GitLab project to THAT Cluster by following the manual 'connect to an existing Cluster' directions that can be found here: https://gitlab.com/help/user/project/clusters/index#adding-an-existing-kubernetes-cluster

The second option is to modify your permissions but that is more complicated.