stackdriver-metadata-agent-cluster-level gets OOMKilled

I was about to open a support ticket with GCP, but they have this notice:

Description We are experiencing issue with Fluentd crashlooping in Google Kubernetes Engine where master version is 1.14 or 1.15, when gVisor is enabled. The fix is targeted for a release aiming to begin on 17 April 2020. We will provide more updates as the date gets closer. We will provide an update by Thursday, 2020-04-09 14:30 US/Pacific with current details. We apologize to all who are affected by the disruption.

Start time April 2, 2020 at 10:58:24 AM GMT-7

End time Steps to reproduce Fluentd crashloops in GKE clusters could lead to missing logs.

Workaround Upgrade Google Kubernetes Engine cluster masters to version 1.16+.

Affected products Other


The issue is being caused because the LIMIT set on the metadata-agent deployment is too low on resources so the POD is being killed (OOM killed) since the POD requires more memory to properly work.

There is a workaround for this issue until it is fixed.


You can overwrite the base resources in the configmap of the metadata-agent with:

kubectl edit cm -n kube-system metadata-agent-config

Setting baseMemory: 50Mi should be enough, if it doesn't work use higher value 100Mi or 200Mi.

So metadata-agent-config configmap should look something like this:

apiVersion: v1
data:
  NannyConfiguration: |-
    apiVersion: nannyconfig/v1alpha1
    kind: NannyConfiguration
    baseMemory: 50Mi
kind: ConfigMap

Note also that You need to restart the deployment, as the config map doesn't get picked up automatically:

kubectl delete deployment -n kube-system stackdriver-metadata-agent-cluster-level

For more details look into addon-resizer Documentation.