What is "namespace cleanliness", and how does glibc achieve it?

First, note that the identifier read is not reserved by ISO C at all. A strictly conforming ISO C program can have an external variable or function called read. Yet, POSIX has a function called read. So how can we have a POSIX platform with read that at the same time allows the C program? After all fread and fgets probably use read; won't they break?

One way would be to split all the POSIX stuff into separate libraries: the user has to link -lio or whatever to get read and write and other functions (and then have fread and getc use some alternative read function, so they work even without -lio).

The approach in glibc is not to use symbols like read, but instead stay out of the way by using alternative names like __libc_read in a reserved namespace. The availability of read to POSIX programs is achieved by making read a weak alias for __libc_read. Programs which make an external reference to read, but do not define it, will reach the weak symbol read which aliases to __libc_read. Programs which define read will override the weak symbol, and their references to read will all go to that override.

The important part is that this has no effect on __libc_read. Moreover, the library itself, where it needs to use the read function, calls its internal __libc_read name that is unaffected by the program.

So all of this adds up to a kind of cleanliness. It's not a general form of namespace cleanliness feasible in a situation with many components, but it works in a two-party situation where our only requirement is to separate "the system library" and "the user application".

OK, first some basics about the C language as specified by the standard. In order that you can write C applications without concern that some of the identifiers you use might clash with external identifiers used in the implementation of the standard library or with macros, declarations, etc. used internally in the standard headers, the language standard splits up possible identifiers into namespaces reserved for the implementation and namespaces reserved for the application. The relevant text is:

7.1.3 Reserved identifiers

Each header declares or defines all identifiers listed in its associated subclause, and optionally declares or defines identifiers listed in its associated future library directions subclause and identifiers which are always reserved either for any use or for use as file scope identifiers.

All identifiers that begin with an underscore and either an uppercase letter or another underscore are always reserved for any use.

All identifiers that begin with an underscore are always reserved for use as identifiers with file scope in both the ordinary and tag name spaces.

Each macro name in any of the following subclauses (including the future library directions) is reserved for use as specified if any of its associated headers is included; unless explicitly stated otherwise (see 7.1.4).

All identifiers with external linkage in any of the following subclauses (including the future library directions) and errno are always reserved for use as identifiers with external linkage.184)

Each identifier with file scope listed in any of the following subclauses (including the future library directions) is reserved for use as a macro name and as an identifier with file scope in the same name space if any of its associated headers is included.

No other identifiers are reserved. If the program declares or defines an identifier in a context in which it is reserved (other than as allowed by 7.1.4), or defines a reserved identifier as a macro name, the behavior is undefined.

Emphasis here is mine. As examples, the identifier read is reserved for the application in all contexts ("no other..."), but the identifier __read is reserved for the implementation in all contexts (bullet point 1).

Now, POSIX defines a lot of interfaces that are not part of the standard C language, and libc implementations might have a good deal more not covered by any standards. That's okay so far, assuming the tooling (linker) handles it correctly. If the application doesn't include <unistd.h> (outside the scope of the language standard), it can safely use the identifier read for any purpose it wants, and nothing breaks even though libc contains an identifier named read.

The problem is that a libc for a unix-like system is also going to want to use the function read to implement parts of the base C language's standard library, like fgetc (and all the other stdio functions built on top of it). This is a problem, because now you can have a strictly conforming C program such as:

#include <stdio.h>
#include <stdlib.h>
void read()
{
    abort();
}
int main()
{
    getchar();
    return 0;
}

and, if libc's stdio implementation is calling read as its backend, it will end up calling the application's function (not to mention, with the wrong signature, which could break/crash for other reasons), producing the wrong behavior for a simple, strictly conforming program.

The solution here is for libc to have an internal function named __read (or whatever other name in the reserved namespace you like) that can be called to implement stdio, and have the public read function call that (or, be a weak alias for it, which is a more efficient and more flexible mechanism to achieve the same thing with traditional unix linker semantics; note that there are some namespace issues more complex than read that can't be solved without weak aliases).

Kaz and R.. have explained why a C library will, in general, need to have two names for functions such as read, that are called by both applications and other functions within the C library. One of those names will be the official, documented name (e.g. read) and one of them will have a prefix that makes it a name reserved for the implementation (e.g. __read).

The GNU C Library has three names for some of its functions: the official name (read) plus two different reserved names (e.g. both __read and __libc_read). This is not because of any requirements made by the C standard; it's a hack to squeeze a little extra performance out of some heavily-used internal code paths.

The compiled code of GNU libc, on disk, is split into several shared objects: libc.so.6, ld.so.1, libpthread.so.0, libm.so.6, libdl.so.2, etc. (exact names may vary depending on the underlying CPU and OS). The functions in each shared object often need to call other functions defined within the same shared object; less often, they need to call functions defined within a different shared object.

Function calls within a single shared object are more efficient if the callee's name is hidden—only usable by callers within that same shared object. This is because globally visible names can be interposed. Suppose that both the main executable and a shared object define the name __read. Which one will be used? The ELF specification says that the definition in the main executable wins, and all calls to that name from anywhere must resolve to that definition. (The ELF specification is language-agnostic and does not make any use of the C standard's distinction between reserved and non-reserved identifiers.)

Interposition is implemented by sending all calls to globally visible symbols through the procedure linkage table, which involves an extra layer of indirection and a runtime-variable final destination. Calls to hidden symbols, on the other hand, can be made directly.

read is defined in libc.so.6. It is called by other functions within libc.so.6; it's also called by functions within other shared objects that are also part of GNU libc; and finally it's called by applications. So, it is given three names:

__libc_read, a hidden name used by callers from within libc.so.6. (nm --dynamic /lib/libc.so.6 | grep read will not show this name.)
__read, a visible reserved name, used by callers from within libpthread.so.0 and other components of glibc.
read, a visible normal name, used by callers from applications.

Sometimes the hidden name has a __libc prefix and the visible implementation name has just two underscores; sometimes it's the other way around. This doesn't mean anything. It's because GNU libc has been under continuous development since the 1990s and its developers have changed their minds about internal conventions several times, but haven't always bothered to fix up all the old-style code to match the new convention (sometimes compatibility requirements mean we can't fix up the old code, even).

What is "namespace cleanliness", and how does glibc achieve it?

Tags:

C

Posix

Language Lawyer

Libc

Related

Recent Posts