What exactly is "a stop job", as in "A stop job is running..."?

systemd operates internally in terms of a queue of "jobs". Each job (simplifying a little bit) is an action to take: stop, check, start, or restart a particular unit.

When (for example) you instruct systemd to start a service unit, it works out a list of stop and start jobs for whatever units (service units, mount units, device units, and so forth) are necessary for achieving that goal, according to unit requirements and dependencies, orders them, according to unit ordering relationships, works out and (if possible) fixes up any self-contradictions, and (if that final step is successful) places them in the queue.

Then it tries to perform the enqueued "jobs".

A stop job is running for Session 1 of user xy

The unit display name here is Session 1 of user xy. This will be (from the display name) a session unit, not a service unit. This is the user-space login session abstraction that is maintained by systemd's logind program and its PAM plugins. It is (in essence and in theory) a grouping of all of the processes that that user is running as a "login session" somewhere.

The job that has been enqueued against it is stop. And it's probably taking a long time because the systemd people have erroneously conflated session hangup with session shutdown. They break the former to get the latter to work, and in response some people alter systemd to break the latter to get the former to work. The systemd people really should recognize that they are two different things.

In your login session, you have something that ignores SIGTERM or that takes a long time to terminate once it has seen SIGTERM. Ironically, the former is the long-standing behaviour of some job-control shells. The correct way to terminate login session leaders when they are these particular job-control shells is to tell them that the session has been hung up, whereupon they terminate all of their jobs (a different kind of job to the internal systemd job) and then terminate themselves.

What's actually happening is that systemd is waiting the unit's stop timeout until it resorts to SIGKILL. This timeout is configurable per unit, of course, and can be set to never time out. Hence why one can potentially see different behaviours.

Further reading

  • Lennart Poettering (2015). systemd. systemd manual pages. Freedesktop.org.
  • Jonathan de Boyne Pollard (2016-06-01). systemd kills background processes after user logs out. 825394. Debian bug tracker.
  • Lennart Poettering (2015). systemd.kill. systemd manual pages. Freedesktop.org.
  • Lennart Poettering (2015). systemd.service. systemd manual pages. Freedesktop.org.
  • Why does bash ignore SIGTERM?
  • https://superuser.com/questions/1102242/

These messages are from systemd, which is a init system which starts and stops jobs. Jobs can be daemons, but can also little tasks such as mounting and unmounting disks, deleting /tmp, or saving and restoring screen brightness across boot. systemctl list-units gives you the idea. Systemd uses "unit" and "job" to mean much the same thing.

When a job is being stopped, as with systemctl stop ..., then a question is how long to wait for the job to complete before declaring failure and killing the job's processes with the SIGKILL signal. We really don't want to use SIGKILL unless we have to, as it doesn't give the opportunity for the process to exit cleanly. For some processes a few seconds might be ample time to declare failure, for other processes like a database there might be substantial network and disk I/O for the job to stop cleanly, and therefore we might give those units several minutes to shut down cleanly.

What you are seeing upon shutdown is the equivalent of systemctl stop $UNIT_NAME which is taking some time to run. There is a counter which shows elapsed seconds and the maximum waiting time before SIGKILL will be issued and the shutdown proceed regardless.

Unless there are good reasons to expect a long delay, this usually indicates some sort of malfunction. That might range from a DHCP server not responding to a Release and thus the Release action needing to time out, or some error causing a daemon to never exit.