Kubernetes CronJob Stops Scheduling Jobs

How kubernetes jobs handle failures

As per Jobs - Run to Completion - Handling Pod and Container Failures:

An entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the node (node is upgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the .spec.template.spec.restartPolicy = "Never". When a Pod fails, then the Job controller starts a new Pod.

You are using restartPolicy: Never for your jobTemplate, so, see the next quote on Pod backoff failure policy:

There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. The back-off count is reset if no new failed Pods appear before the Job’s next status check.

The .spec.backoffLimit is not defined in your jobTemplate, so it's using the default (6).

Following, as per Job Termination and Cleanup:

By default, a Job will run uninterrupted unless a Pod fails, at which point the Job defers to the .spec.backoffLimit described above. Another way to terminate a Job is by setting an active deadline. Do this by setting the .spec.activeDeadlineSeconds field of the Job to a number of seconds.

That's your case: If your containers fail to pull the image six consecutive times, your Job will be considered as failed.


Cronjobs

As per Cron Job Limitations:

A cron job creates a job object about once per execution time of its schedule [...]. The Cronjob is only responsible for creating Jobs that match its schedule, and the Job in turn is responsible for the management of the Pods it represents.

This means that all pod/container failures should be handled by the Job Controller (i.e., adjusting the jobTemplate).

"Retrying" a Job:

You do not need to recreate a Cronjob in case its Job of fails. You only need to wait for the next schedule.

If you want to run a new Job before the next schedule, you can use the Cronjob template to create a Job manually with:

kubectl create job --from=cronjob/my-cronjob-name my-manually-job-name

What you should do:

If your containers are unable to download the images constantly, you have the following options:

  • Explicit set and tune backoffLimit to a higher value.
  • Use restartPolicy: OnFailure for your containers, so the Pod will stay on the node, and only the container will be re-run.
  • Consider using imagePullPolicy: IfNotPresent. If you are not retagging your images, there is no need to force a re-pull for every job start.

Just to expand on Eduardo Baitello's answer I would also like to mention 2 more caveats:

  1. Eduardo mentioned Cronjob Limitations, but didn't expand on the Too many missed start time (> 100) issue. For this I've found that the only solution is to delete the cronjob and recreate it. You can patch the cronjob to decrease its frequency which tricks the scheduler to run it again. Then you can re-patch it back to how it was but this is trickier. The kubectl describe cronjob CRONJOB_NAME should list this as one of its events if this has been affected, and it usually affects cronjobs which have a high frequency.

  2. If you have a lot of Cronjobs/Jobs then you could be experiencing this bug (#77465) which has been fixed in 1.14.7. This occurs if you have more than 500 Jobs within the entire cluster. This one is harder to find, but you can query the kube-scheduler logs for expected type *batchv1.JobList, got type *internalversion.List.

You can print the logs for kube-scheduler using the following command:

kubectl -n kube-system logs -l component=kube-scheduler --tail 100

Tags:

Kubernetes