Why is there a Linux kernel policy to never break user space?

The reason is not a historical one but a practical one. There are many many many programs that run on top of the Linux kernel; if a kernel interface breaks those programs then everybody would need to upgrade those programs.

Now it's true that most programs do not in fact depend on kernel interfaces directly (the system calls), but only on interfaces of the C standard library (C wrappers around the system calls). Oh, but which standard library? Glibc? uClibC? Dietlibc? Bionic? Musl? etc.

But there are also many programs that implement OS-specific services and depend on kernel interfaces that are not exposed by the standard library. (On Linux, many of these are offered through /proc and /sys.)

And then there are statically compiled binaries. If a kernel upgrade breaks one of these, the only solution would be to recompile them. If you have the source: Linux does support proprietary software too.

Even when the source is available, gathering it all can be a pain. Especially when you're upgrading your kernel to fix a bug with your hardware. People often upgrade their kernel independently from the rest of their system because they need the hardware support. In the words of Linus Torvalds:

Breaking user programs simply isn't acceptable. (…) We know that people use old binaries for years and years, and that making a new release doesn't mean that you can just throw that out. You can trust us.

He also explains that one reason to make this a strong rule is to avoid dependency hell where you'd not only have to upgrade another program to get some newer kernel to work, but also have to upgrade yet another program, and another, and another, because everything depends on a certain version of everything.

It's somewhat ok to have a well-defined one-way dependency. It's sad, but inevitable sometimes. (…) What is NOT ok is to have a two-way dependency. If user-space HAL code depends on a new kernel, that's ok, although I suspect users would hope that it wouldn't be "kernel of the week", but more a "kernel of the last few months" thing.

But if you have a TWO-WAY dependency, you're screwed. That means that you have to upgrade in lock-step, and that just IS NOT ACCEPTABLE. It's horrible for the user, but even more importantly, it's horrible for developers, because it means that you can't say "a bug happened" and do things like try to narrow it down with bisection or similar.

In userspace, those mutual dependencies are usually resolved by keeping different library versions around; but you only get to run one kernel, so it has to support everything people might want to do with it.

Officially,

backward compatibility for [system calls declared stable] will be guaranteed for at least 2 years.

In practice though,

Most interfaces (like syscalls) are expected to never change and always be available.

What does change more often is interfaces that are only meant to be used by hardware-related programs, in /sys. (/proc, on the other hand, which since the introduction of /sys has been reserved for non-hardware-related services, pretty much never breaks in incompatible ways.)

In summary,

breaking user space would require fixes on the application level

and that's bad because there's only one kernel, which people want to upgrade independently of the rest of their system, but there are many many applications out there with complex interdependencies. It's easier to keep the kernel stable that to keep thousands of applications up-to-date on millions of different setups.

In any inter-dependent systems there are basically two choices. Abstraction and integration. (I am purposely not using technical terms). With Abstraction, you're saying that when you make a call to an API that, while the code behind the API may change, the result will always be the same. For example when we call fs.open() we don't care whether it's a network drive, a SSD or a hard drive, we will always get an open file descriptor that we can do stuff with. With "integration" the goal is to provide the "best" way to do a thing, even if the way changes. For example, opening a file may be different for a network share than for a file on disk. Both ways are used pretty extensively in the modern Linux desktop.

From a developers point of view it's a question of "works with any version" or "works with a specific version". A great example of this is OpenGL. Most games are set to work with a specific version of OpenGL. It doesn't matter if you're compiling from source. If the game was written to use OpenGL 1.1 and you're trying to get it to run on 3.x, you're not going to have a good time. On the other end of the spectrum, some calls, are expected to work no matter what. For example, I want to call fs.open() I don't want to care what kernel version I am on. I just want a file descriptor.

There are benefits to each way. Integration provides "newer" features at the cost of backwards compatibility. While abstraction provides stability over "newer" calls. Though it's important to note it's a matter of priority, not possibility.

From a communal stand point, without a really really good reason, abstraction is always better in a complex system. For example, imagine if fs.open() worked differently depending on kernel version. Then a simple file system interaction library would need maintain several hundred different "open file" methods (or blocks probably). When a new kernel version came out, you wouldn't be able to "upgrade", you would have to test every single piece of software you used. Kernel 6.2.2 (fake) may just break your text editor.

For some real world examples OSX tends to not care about breaking User Space. They aim for "integration" over "abstraction" more frequently. And at every major OS update, things break. That's not to say one way is better then the other. It's a choice and design decision.

Most importantly, the Linux eco-system is filled with awesome opensource projects, where people or groups work on the project in their free time, or because the tool is useful. With that in mind, the second it stops being fun and starts being a PIA, those developers will go somewhere else.

For example, I submitted a patch to BuildNotify.py. Not because I am altruistic, but because I use the tool, and I wanted a feature. It was easy, so here, have a patch. If it were complicated, or cumbersome, I would not use BuildNotify.py and I would find something else. If every time a kernel update came out my text editor broke, I would just use a different OS. My contributions to the community (however small) would not continue or exist, and so on.

So, the design decision was made to abstract system calls, so that when I do fs.open() it just works. That means maintaining fs.open long after fs.open2() gained popularity.

Historically, this is the goal of POSIX systems in general. "Here are a set of calls and expected return values, you figure out the middle." Again for portability reasons. Why Linus chooses to use that methodology is internal to his brain, and you would have to ask him to know exactly why. If it were me however, I would choose abstraction over integration on a complex system.

It's a design decision and choice. Linus wants to be able to guarantee to user-space developers that, except in extremely rare and exceptional (e.g. security-related) circumstances, changes in the kernel will not break their applications.

The pros are that userspace devs won't find their code suddenly breaking on new kernels for arbitrary and capricious reasons.

The cons are that the kernel has to keep old code and old syscalls etc around forever (or, at least, long past their use-by dates).

Why is there a Linux kernel policy to never break user space?

Tags:

Linux Kernel

History

Related

Recent Posts