Object files vs Library files and why?

Ok, let's start with the beginning.

A programmer (you) creates some source files, .cpp and .h. The difference between those two files is just a convention :

  • .cpp are meant to be compiled
  • .h are meant to be included in other source files

but nothing (except the fear of having an unmaintanable thing) forbids you to import cpp files into other .cpp files.

At the early time of C (the ancestor of C++) .h file only contained declarations of functions, structures (without methods in C !) and constants. You could also have a macro (#define) but apart from that, no code should be in .h.

In C++ with templates, you must also add in the .h implementation of template classes because as C++ uses templates and not generics like Java, each instantiation of a template is a different class.

Now with the answer to your question :

Each .cpp file is a compilation unit. The compiler will :

  • in the preprocessor phase process, all #include or #define to (internally) generates a full source code
  • compiles it to object format (generally .o or .obj)

This object format contains :

  • relocatable code (that is addresses in code or variables are relatives to exported symbols)
  • exported symbols: the symbols that could be used from other compilation units (functions, classes, global variables)
  • imported symbols: the symbols used in that compilation unit and defined in other compilations units

Then (let's forget the libraries for now) the linker will take all the compilations units together and will resolve symbols to create an executable file.

One step further with static libraries.

A static library (generally .a or .lib) is more or less a bunch of object files put together. It exists to avoid to list individually every object file that you need, those from which you use the exported symbols. Linking a library containing object files you use and linking the objects files themselves is exactly the same. Simply adding -lc, -lm or -lx11 is shorter them adding hundred of .o files. But at least on Unix-like systems, a static library is an archive and you can extract the individual object files if you want to.

The dynamic libraries are completely different. A dynamic library should be seen as a special executable file. They are generally built with the same linker that creates normal executables (but with different options). But instead of simply declaring an entry point (on windows a .dll file does declare an entry point that can be used for initializing the .dll), they declare a list of exported (and imported) symbols. At runtime, there are system calls that allow to get the addresses of those symbols and use them almost normally. But in fact, when you call a routine in a dynamic loaded library the code resides outside of what the loader initially loads from your own executable file. Generally, the operation of loading all the used symbols from a dynamic library is either at load time directly by the loader (on Unix like systems) or with import libraries on Windows.

And now a look back to the include files. Neither good old K&R C nor the most recent C++ have a notion of the global module to import like for example Java or C#. In those languages, when you import a module, you get both the declarations for their exported symbols, and an indication that you will later link it. But in C++ (same in C) you have to do it separately :

  • first, declare the functions or classes - done by including a .h file from your source, so that compiler knows what they are
  • next link the object module, static library or dynamic library to actually get access to the code

Historically, an object file gets linked either completely or not at all into an executable (nowadays, there are exceptions like function level linking or whole program optimization becoming more popular), so if one function of an object file is used, the executable receives all of them.

To keep executables small and free of dead code, the standard library is split into many small object files (typically in the order of hundreds). Having hundreds of small files is very undesirable for efficiency reasons: Opening many files is inefficient, and every file has some slack (unused disk space at the end of the file). This is why object files get grouped into libraries, which is kind of like a ZIP file with no compression. At link time, the whole library is read, and all object files from that library that resolve symbols already known as unresolved when the linker started reading a library or object files needed by them are included into the output. This likely means that the whole library has to be in memory at once to recursively solve dependencies. As the amount of memory was quite limited, the linker only loads one library at a time, so a library mentioned later on the command line of the linker can not use functions from a library mentioned earlier on the command line.

To improve the performance (loading a whole library takes some time, especially from slow media like floppy disks), libraries often contain an index that tells the linker what object files provide which symbols. Indexes are created by tools like ranlib or the library management tool (Borland's tlib has a switch to generate the index). As soon as there is an index, libraries are definitely more efficient to link then single object files, even if all object files are in the disk cache and loading files from the disk cache is free.

You are completely right that I can replace .o or .a files while keeping the header files, and change what the functions do (or how they do it). This is used by the LPGL-license, which requires the author of a program that uses an LGPL-licensed library to give the user the possibility to replace that library by a patched, improved or alternative implementation. Shipping the object files of the own application (possibly grouped as library files) is enough to give the user the required freedom; no need to ship the source code (like with the GPL).

If two sets of libraries (or object files) can be used successfully with the same header files, they are said to be ABI compatible, where ABI means Application Binary Interface. This is more narrow than just having two sets of libraries (or object files) accompanied by their respective headers, and guaranteeing that you can use each library if you use the headers for this specific library. This would be called API compatibility, where API means Application Program Interface. As an example of the difference, look at the following three header files:

File 1:

typedef struct {
    int a;
    int __undocumented_member;
    int b;
} magic_data;
magic_data* calculate(int);

File 2:

struct __tag_magic_data {
    int a;
    int __padding;
    int b;
};
typedef __tag_magic_data magic_data;
magic_data* calculate(const int);

File 3:

typedef struct {
    int a;
    int b;
    int c;
} magic_data;
magic_data* do_calculate(int, void*);
#define calculate(x) do_calculate(x, 0)

The first two files are not identical, but they provide exchangeable definitions that (as far as I expect) do not violate the "one definition rule", so a library providing File 1 as header file can be used as well with File 2 as a header file. On the other hand, File 3 provides a very similar interface to the programmer (which might be identical in all that the library author promises the user of the library), but code compiled with File 3 fails to link with a library designed to be used with File 1 or File 2, as the library designed for File 3 would not export calculate, but only do_calculate. Also, the structure has a different member layout, so using File 1 or File 2 instead of File 3 will not access b correctly. The libraries providing File 1 and File 2 are ABI compatible, but all three libraries are API compatible (assuming that c and the more capable function do_calculate do not count towards that API).

For dynamic libraries (.dll, .so) things are completely different: They started appearing on systems where multiple (application) programs can be loaded at the same time (which is not the case on DOS, but it is the case on Windows). It is wasteful to have the same implementation of a library function in memory multiple times, so it is loaded only once and multiple applications use it. For dynamic libraries, the code of the referenced function is not included in the executable file, but just a reference to the function inside a dynamic library is included (For Windows NE/PE, it is specified which DLL has to provide which function. For Unix .so files, only the function names and a set of libraries are specified.). The operating system contains a loader aka dynamic linker that resolves these references and loads dynamic libraries if they are not already in memory at the time a program is started.

Tags:

C++