Why do we need header files in C and C++?

31.03.18    C++ and C

Header files are an important part of programming in C. But they are a clunky, error-prone, and redundant nuisance. How did they come to be? What function do they have?

I have repeatedly heard people defend them based on their technical merit. They say header files separate the interface from the implementation. The function signatures are the interface but there is no principal reason to repeat them in a separate file which has to be kept in sync with its implementation. Compiled languages with a similar compilation model (with object files or similar) allowing for incremental compilation like Java, C#, Go, D, Rust and others get along without header files. So why do C and C++ need them?

The C programming language is designed to be compiled top down in a stream. This follows from its BCPL and B heritage. Supposedly this makes writing compilers easier because you do not need to design a complex tree structure of data and metadata (something like an AST). Instead the compiler only needs to fully understand the statement it is currently analyzing. Everything it needs to remember is put into a set of simple tables which are filled as the compiler moves through the source. These tables can be significantly faster and smaller in memory than the equivalent trees.

The compilation of a line - by C design - never depends on table entries which are put into the table later. The compiler moves sequentially through the source, incorporating information of the tables for analyzing the current line, updating the table and the moving on. This definitely simplifies implementing a C compiler as it puts the burden of ordering the interdependent function and type definitions onto the programmer. Ease of implementation, speed and a low memory footprint were important and sensible goals during the design of C.

But C also has a much richer and complicated type system than its predecessors. The compiler needs to keep track of types because the compilation result depends on them.

//Depending on which of these lines you choose
unsigned short a, b;
float a; int b;
double a, b;

//This one line's generated code differs dramatically
a *= b;

Moreover the C compiler tries to type check expressions including function calls (in non-archaic versions of the language). For this to work, the compiler needs to have already analyzed the function, so it appears in the tables. But this can only happen when the programmer can actually sort the functions topologically, i.e. there are no cyclic dependencies between functions. But C needs to able to compile programs with cyclically dependent functions. To solve this problem C has prototypes: declaring that a function exists and what types it uses without actually defining what it does.

// This is a prototype:
int foo(int a, char *s);

// Then foo can bee used
int bar() {
    foo(3, "x");
}

// Then foo can be defined
int foo(int a, char *s) {
    // Could use bar here
}

Interestingly this does not break the possibility of the simple stream nature for the compiler. But how does it generate the call of foo in bar when it does not know where the function foo will end up in the binary? It does not! That is because C compilers do not generate executable binaries directly. Instead they produce object code. It is a like a list of templates for binary code. It is organized into named sections containing mostly compiled machine code. But the code is interspersed with reference "holes". These are to be replaced with the addresses of the symbols they are referencing.

The compiler just inserts a reference to an entity named foo into the object code at compile time without knowing where it is. The linker then collects all code and references from all object files, arranges the binary parts therefore fixing the functions' addresses, and resolves the references by substituting them for the correct addresses. This is a phase separate from compilation: link time. Modern compilers mostly hide the existence of this phase from its users by automagically invoking the linker on temporarily generated object files after compilation.

The linker consumes object code files and produces actual executable binaries. It does not care when or how the object files have been created. This enables separate compilation and using (external, pre-compiled, potentially closed-source) libraries. To use functions defined in a separate source file (like for example a library) you need to put the prototypes of the functions you use before their usage. Most comfortably this is done at the top of the source file.

But actually repeating all used prototypes in every source file and keeping them up-to-date would be a nightmare. Instead C employs a preprocessor, a transparent macro substitution layer which allows you to #include other files. It consumes a source file (most of the time including other files into it) and feeds one giant stream of C source code to the actual compiler. This included file can contain the prototypes. It can be reused by multiple source files and it can be provided with a binary library.

This led to the convention of using .h files which provide the prototypes of the corresponding .c file. Theses are "normal" C files because the compiler does not know or care about header files. It just sees a stream of C code. But the header files may not themselves generate binary code. They can only include function prototypes, unions and structs (and preprocessor directives). Otherwise the oblivious compiler would generate the same code multiple times into multiple object files. The linker would rightfully complain about redundant definitions.

So now we know what header files are, how they came to be, and why we need them. But how can similar languages avoid the need for them? This again goes back to the table-filling stream nature C was designed with. It allowed the designers to make C grammar depend on the entries of the tables: C grammar is not context-free. The meaning of an identifier foo differs before and after the line:

typedef int *foo;

The meaning of the line:

foo x;

is undefined (on a grammar level) before that line above but well-defined afterwards. You cannot just parse a C source file. You always need to semantically analyze it to some extent along the way. This includes parsing and analyzing all the files it depends on recursively. Other languages allow the compiler to just parse the source file first and then analyze the bits it cares about. This enables an import-mechanism: Symbols from other source can be imported without effectively compiling that source file. C and C++ cannot do this.

What does the following line of C++ do?

a * b;

Well it computes a times b, right? This is not necessarily a no-op because of C++ operator-overloading. But if a is a type (think struct a {};) then it is declaring a pointer b pointing to an a! This hints at all the problems context-sensitive parsing implies. Any external tooling (linters, documentation generators etc.) has to wrestle with them.

So, to sum up, C and C++ need header files to specify function prototypes and type definitions so symbols from different compilation units can be used in a type-safe manner. Because the C/C++ grammar is not context-free the source files cannot be reused for this. Instead the preprocessor needed to assemble the header and source files before passing them to the actual compiler introduces yet another top-to-bottom-dependent streaming pass. Moreover, this stacking of streams on top of each other is the core reason compiling C++ is so slow.