Getting started with compiling code

When I was first getting into programming, I was mainly focused on interpreted languages like Python, JavaScript or R. But over time, I've learned more and more about compiled languages like C++, Rust and C#. One of the things that I found helped me the most when I got started with this was Harvard University's CS50 lectures on YouTube.

Here, I'll explain how compilation works in as beginner-friendly but concise a way I can manage. Many others have done the same, but maybe my take will resonate with you. I also hope to build on this in future posts about build systems such as GNU Make or Bazel.

I assume that you have some basic understanding of programming, e.g. Python, and computer hardware, most importantly the central processing unit (CPU). I will also assume that you have installed a C compiler such as the one used by clang, gcc or cl (only a couple of the cl steps are described though).

Why do we need compilation?

Machine language

For computers to be useful to us, we need to be able to give them instructions. This ultimately means passing on low-level instructions to the central processing unit (CPU), e.g.:

  1. Copy value from (RAM) memory address A to CPU register X.
  2. Copy value from memory address B to CPU register Y.
  3. Add up the values at register X and register Y and store result in register Z.
  4. Copy the value from register Z to memory address C.

(a register is essentially a memory location that the CPU has very fast and direct access to)

These instructions are passed to the CPU as series of 1s and 0s. This is called machine language and in the end, that's all the CPU understands.

Assembly language

Since it's hard for us to speak directly in machine language, assembly language (ASM) was developed at the end of the 1940s and is still in use today. This is a set of low-level languages which mainly consist of commands that correspond directly to CPU instructions, but with more human-readable names. Using pseudocode, step 1 in our example (copy address A -> register X) might look something like mov REGX, ADDRA in ASM rather than say 01101111 00110011 00110011 (in reality you'd need a lot more 0s and 1s).

In order to translate from ASM to machine code, a piece of software called an assembler is used. This enables people to write programs in ASM and then run it through the assembler to get machine code ready to be executed by the computer.

Limitations of ASM

Different CPUs accept different sets of instructions (instruction sets). This is clearly illustrated by a somewhat recent change to Apple Mac computers. Pre-2020 Macs came with Intel CPUs which use a certain (x86) instruction set, while newer Mac CPUs have Apple Silicon CPUs which use another (RISC/ARM) instruction set. Intel CPUs simply don't understand the same instructions as the Apple Silicon CPUs do, and vice versa. Since ASM is so tightly tied to specific CPU instructions, you need to speak different assembly languages for x86 and ARM. This means that if all you had was ASM, for your program to run on both older and newer Macs you would need to write your program twice - once for the Intel CPUs' instruction set, once for the Apple Silicon CPUs' instruction set.

A more obvious limitation is that even if ASM is easier to write than raw machine code is, it's still very tedious and error prone.

High-level languages

To overcome the limitations of ASM, high-level programming languages like C and Fortran were developed. The term 'high-level' here is in contrast to ASM as a 'low-level' language.

With C, we can write a program "hello.c" that looks like this:

#include "stdio.h"

int main() {
    printf("Hello, world!\n");
}

If you're not already familiar with C, the code might seem a bit confusing. In short, what it says is:

  1. Take all of whatever is in the file "stdio.h" (we'll talk later about how/where this is found) and put it right where the #include "stdio.h line is.
  2. There is a function main.
  3. main returns an integer (int - here, since we don't have an explicit return statement, 0 is implicitly returned)
  4. When main is called, call the function printf with a single argument, namely the string "Hello, world!\n".

Even if this is new to you, it should be clear that C is on a much higher level of abstraction than ASM is. We no longer refer to any specific CPU instructions, and instead we talk about high-level concepts like functions and strings. This lets us write more compact code, reducing the time spent writing it and the risk of making mistakes, while also enabling us to organize our code and have a better overview of what it's doing. Moreover, what we write is no longer bound to any one CPU architecture. I should note that there are some import exceptions to this, but discussing them here would only be distracting.

So, how do we get from high-level code, like our C code, to machine code? We've already seen how it's possible to translate ASM to machine code, so if we can translate from C code to ASM then the whole chain has been solved. That's where compilers come in.

Preprocessing

Before we talk about compiling code, we should discuss preprocessors since it's easy to get confused about how they relate to compilation.

A preprocessor does its work before the compiler (hence the 'pre') and is generally simpler than the compiler is. In the case of C, the preprocessor doesn't actually contain any logic related to C itself as a programming language. Instead, it performs actions based on preprocessor directives, which are lines starting with #.

In "hello.c", there's the line #include "stdio.h". When the preprocessor sees this, it searches for a file called "stdio.h" in some default directories (e.g. /usr/include if you're on Linux) and any additional directories specified by you as the user. If the file is found, the preprocessor replaces the #include "stdio.h" line with all the contents of "stdio.h".

Try creating a local copy of "hello.c" and then, in the same directory with your shell, run:

# replace with clang if necessary
gcc -E hello.c > hello_preprocessed.c
# or `cl /P hello.c` with cl on Windows

This runs the preprocessor against "hello.c" and pipes its output to "hello_preprocessed.c". If you look in the latter file, you'll find that the preprocessor added about 500 lines to our code - the entire contents of "stdio.h". You can confirm this by looking at the contents of the original file, e.g. on Linux /usr/include/stdio.h.

So, the preprocessor looks at lines starting with # and performs simpler operations such as text replacement, in preparation for compilation.

Compilation

Translating code written in a high-level programming language to a low-level ASM language is called compilation, and is done by a compiler. Compilers perform their task by executing a series of steps involving analyzing and parsing the high-level code to then generate low-level output, but we won't discuss those details here.

There are many different compilers supporting different high-level programming languages and ASM languages. A popular choice for C is the C compiler included in the GNU Compiler Collection (GCC) project. The project's command line tool is called gcc and is what we used to perform preprocessing before - the -E flag we passed it says to only run the preprocessor. Now, let's try actually compiling the file by passing gcc the flag -S, and the already preprocessed file.

# replace with clang if necessary
gcc -S hello_preprocessed.c
# or `cl /c hello_preprocessed.c` with cl on Windows

This produces a file "hello_preprocessed.s", which contains ASM code. If you try opening the file up in a text editor, you should see something like this:

    .section    __TEXT,__text,regular,pure_instructions
    .build_version macos, 14, 0 sdk_version 14, 4
    .globl  _main                           ; -- Begin function main
    .p2align    2
_main:                                  ; @main
    .cfi_startproc
; %bb.0:
    stp x29, x30, [sp, #-16]!           ; 16-byte Folded Spill
    .cfi_def_cfa_offset 16
    mov x29, sp
--- snip ---

The GCC C compiler has done its job! Now all that's left to get to machine code is assembling the ASM code.

Assembling

Just like compilers, there are many different assemblers. A popular choice, strongly linked to GCC, is the GNU Assembler (GAS), with its associated command line tool being as. If you can run gcc, you should also be able to run as.

To assemble "hello_preprocessed.s":

as hello_preprocessed.s -o hello.o

(the -o flag is simply to specify the output file name - the default is "a.out")

Now, "hello.o" holds machine code, with CPU instructions corresponding to the ASM commands we saw before.

Summary

We've looked at the following steps involved with going from code written in a high-level programming language (C) to machine code:

  1. Preprocessing - performing primarily text manipulation based on directives, lines starting with #.
  2. Compilation - converting high-level code to low-level ASM code.
  3. Assembling - converting ASM code to machine code.

We deliberately performed each step in isolation, using the gcc tool with certain flags and the as tool. Normally, you just run a single command for all of these steps, and another that we'll discuss next. That's much more convenient, but also part of why it's easy to get confused about what compilation actually is or isn't, and what steps are involved when going from source code to machine code. Being able to isolate these steps when reasoning about them and by running them separately helps to problem solve when something in the chain goes wrong.

What's next?

Even after assembling the ASM code to produce machine code, we still don't have an executable, something that we can run. "hello.o" is just an object file, not an executable binary. Before we can actually get an executable program, we need to:

  1. Define the program's starting point, what code should be executed when the program is run - we want this to be the main function, which is customary but not a must.
  2. Link our machine code to preexisting machine code (a system binary) that defines what printf does.

To do this, we need to use a linker, such as ld. I hope to write about that in a future post.

In the meantime, as you may already know, if you want to perform all three steps we've discussed here and do linking, you can simply:

# or clang
gcc hello.c -o hello
# run the file
./hello

Additional links

  • CS50 2016 C lecture - this goes into more detail on compilation than they seem to do in more recent versions of this lecture.
  • Barry Brown on using Makefiles - this is a great practical introduction to how to automate compilation work, while also explaining preprocessing, compilation, assembling and linking.