Why can't 64-bit Windows unwind user-kernel-user exceptions?

I'm the developer who wrote this Hotfix a loooooooong time ago as well as the blog post. The main reason is that the full register file isn't always captured when you transition into kernel space, for performance reasons.

If you make a normal syscall, the x64 Application Binary Interface (ABI) only requires you to preserve the non-volatile registers (similar to making a normal function call). However, correctly unwinding the exception requires you to have all the registers, so it's not possible. Basically, this was a choice between perf in a critical scenario (i.e. a scenario that potentially happens thousands of times per second) vs. 100% correctly handling a pathological scenario (a crash).

Bonus Reading

  • Overview of x64 Calling Conventions
  • x86 Software Conventions - Register Usage

A very good question.

I can give a hint of why "propagating" an exception across kernel-user boundary is somewhat problematic.

Citation from your question:

Why can't 64-bit Windows unwind the stack during an exception, if the stack crosses the kernel boundary - when 32-bit Windows can?

The reason is very simple: There's no such a thing as "stack crosses kernel boundary". Calling a kernel-mode function is by no means comparable to a standard function call. It has nothing to do with the call stack actually. As you probably know, kernel-mode memory is simply inaccessible from the user mode.

Invoking a kernel-mode function (aka syscall) is implemented by triggering a software interrupt (or a similar mechanism). A user-mode code puts some values into registers (that identify the needed kernel-mode service) and invokes a CPU instruction (such as sysenter) which transfers the CPU into kernel-mode and passes the control to the OS.

Then there's a kernel-mode code that handles the requested syscall. It runs in a separate kernel-mode stack (that has nothing to do with the user-mode stack). After the request was handled - the control is returned to the user-mode code. Depending on the specific syscall the user-mode return address may be the one that invoked the kernel-mode transaction, as well as it may be different address.

Sometimes you call a kernel-mode function that "in the middle" should invoke a user-mode call. It may look like a call stack consisting of a user-kernel-user code, but it's just an emulation. In such a case the kernel-mode code transfers the control to a user-mode code which wraps your user-mode function. This wrapper code calls your function, and immediately upon its return triggers a kernel-mode transaction.

Now, if the user mode code "invoked from the kernelmode" raises an exception - this is what should happen:

  1. The wrapper user-mode code handles the SEH exception (i.e. stops its propagation, but doesn't perform the stack unwinding yet).
  2. Passes the control to kernel-mode (OS), as in a normal program flow case.
  3. Kenrel-mode code responds appropriately. It finishes the requested service. Depending on whether there was a user-mode exception - the processing may be different.
  4. Upon return to user-mode - the kernel-mode code may specify if there was a nested exception. In case of an exception the stack is not restored to its original state (since there was no unwinding yet).
  5. User-mode code checks if there was such an exception. If it was - the call stack is forged to include the nested user-mode call, and the exception propagates.

So that exception that crosses kernel-user boundary is an emulation. There's no such a thing natively.