How to make upstart back off, rather than give up

Solution 1:

The Upstart Cookbook recommends a post-stop delay (http://upstart.ubuntu.com/cookbook/#delay-respawn-of-a-job). Use the respawn stanza without arguments and it will continue trying forever:

respawn
post-stop exec sleep 5

(I got this from this Ask Ubuntu question)

To add the exponential delay part, I'd try working with an environment variable in the post-stop script, I think something like:

env SLEEP_TIME=1
post-stop script
    sleep $SLEEP_TIME
    NEW_SLEEP_TIME=`expr 2 \* $SLEEP_TIME`
    if [ $NEW_SLEEP_TIME -ge 60 ]; then
        NEW_SLEEP_TIME=60
    fi
    initctl set-env SLEEP_TIME=$NEW_SLEEP_TIME
end script

** EDIT **

To apply the delay only when respawning, avoiding the delay on a real stop, use the following, which checks whether the current goal is "stop" or not:

env SLEEP_TIME=1
post-stop script
    goal=`initctl status $UPSTART_JOB | awk '{print $2}' | cut -d '/' -f 1`
    if [ $goal != "stop" ]; then
        sleep $SLEEP_TIME
        NEW_SLEEP_TIME=`expr 2 \* $SLEEP_TIME`
        if [ $NEW_SLEEP_TIME -ge 60 ]; then
            NEW_SLEEP_TIME=60
        fi
        initctl set-env SLEEP_TIME=$NEW_SLEEP_TIME
    fi
end script

Solution 2:

As already mentioned, use respawn to trigger the respawn.

However, the Upstart Cookbook coverage on respawn-limit says that you'll need to specify respawn limit unlimited to have continual retry behaviour.

By default it will retry as long as the process doesn't respawn more than 10 times in 5 seconds.

I would therefore suggest:

respawn
respawn limit unlimited
post-stop <script to back-off or constant delay>

Solution 3:

I ended up putting a start in a cronjob. If the service is running, it has no effect. If it's not running, it starts the service.


Solution 4:

I have done an improvement to Roger answer. Typically you want to backoff when there is a problem in the underlying software causing it to crash a lot in a short period of time but once the system has recovered you want to reset the backoff time. In Roger's version the service will sleep for 60 seconds always, even for single and isolated crashes after 7 crashes.

#The initial delay.
env INITIAL_SLEEP_TIME=1

#The current delay.
env CURRENT_SLEEP_TIME=1

#The maximum delay
env MAX_SLEEP_TIME=60

#The unix timestamp of the last crash.
env LAST_CRASH=0

#The number of seconds without any crash 
#to consider the service healthy and reset the backoff.
env HEALTHY_TRESHOLD=180

post-stop script
  exec >> /var/log/auth0.log 2>&1
  echo "`date`: stopped $UPSTART_JOB"
  goal=`initctl status $UPSTART_JOB | awk '{print $2}' | cut -d '/' -f 1`
  if [ $goal != "stop" ]; then
    CRASH_TIMESTAMP=$(date +%s)

    if [ $LAST_CRASH -ne 0 ]; then
      SECS_SINCE_LAST_CRASH=`expr $CRASH_TIMESTAMP - $LAST_CRASH`
      if [ $SECS_SINCE_LAST_CRASH -ge $HEALTHY_TRESHOLD ]; then
        echo "resetting backoff"
        CURRENT_SLEEP_TIME=$INITIAL_SLEEP_TIME
      fi
    fi

    echo "backoff for $CURRENT_SLEEP_TIME"
    sleep $CURRENT_SLEEP_TIME

    NEW_SLEEP_TIME=`expr 2 \* $CURRENT_SLEEP_TIME`
    if [ $NEW_SLEEP_TIME -ge $MAX_SLEEP_TIME ]; then
      NEW_SLEEP_TIME=$MAX_SLEEP_TIME
    fi

    initctl set-env CURRENT_SLEEP_TIME=$NEW_SLEEP_TIME
    initctl set-env LAST_CRASH=$CRASH_TIMESTAMP
  fi
end script

Tags:

Ubuntu

Upstart