Proper way to use OnFailure in systemd

In order to perform some cleanup if the service fails, you can use ExecStopPost=, which is executed whether the service succeeds or not.

In the code you run at ExecStopPost=, you can use one of $SERVICE_RESULT, $EXIT_CODE or $EXIT_STATUS to determine the failure condition and act accordingly. See the documentation on those environment variables to check which one is appropriate for you.

Then you can use Restart=on-failure so that systemd tries to restart your unit when it fails.

Putting it all together, this is what it would look like. Assuming that run_program will exit with status 2 whenever the files are corrupted (hopefully you can adapt this to other failure scenarios from the documentation above), this should work:

[Service]
ExecStart=/bin/run_program
ExecStopPost=/bin/sh -c 'if [ "$$EXIT_STATUS" = 2 ]; then rm /file/to/delete; fi'
Restart=on-failure

(NOTE: The double dollar-sign $$ is to escape this to systemd, so the shell sees $EXIT_STATUS and accesses that variable. Using a single dollar-sign would also work, but then systemd would do that replacement instead and the shell would see [ "2" = 2 ], which arguably also works... Anyways, you can bypass most of that by putting all this logic into a shell script and calling it by its full path in ExecStopPost=, that would be probably better and you could also easily add more commands to the script, such as logging the action taken to recover from the error condition.)

Hopefully this will give you enough pointers to figure out how to configure this correctly given your particular situation!


NOTE: You probably want to use ExecStopPost= instead of OnFailure= here (see my other answer), but this is trying to address why your OnFailure= setup is not working.

The problem with OnFailure= not starting the unit might be because it's in the wrong section, it needs to be in the [Unit] section and not [Service].

You can try this instead:

# software.service
[Unit]
Description=Software
OnFailure=software-fail.service

[Service]
ExecStart=/bin/run_program

And:

# software-fail.service
[Unit]
Description=Delete corrupt files

[Service]
ExecStart=/bin/rm /file/to/delete
ExecStop=/bin/systemctl --user start software.service

I can make it work with this setup.

But note that using OnFailure= is not ideal here, since you can't really tell why the program failed, and chaining another start of it in ExecStop= by calling /bin/systemctl start directly is pretty hacky... The solution using ExecStopPost= and looking at the exit status is definitely superior.

If you define OnFailure= inside [Service], systemd (at least version 234 from Fedora 27) complains with:

software.service:6: Unknown lvalue 'OnFailure' in section 'Service'

Not sure if you're seeing that in your logs or not... (Maybe this was added in recent systemd?) That should be a hint of what is going on there.