Getting started with linking code

Nov. 18, 2024, 5:32 p.m.

In the previous post "Getting started with compiling code" I mentioned that after you've preprocessed (with e.g. gcc -E hello.c > hellop.c), compiled (gcc -S hellop.c) and assembled (as hellop.s) your source code to machine code, you need to link it together with any other machine code that it depends on.

Linking against preexisting libraries

Declaring functions

Let's again look at the sample code from the previous post.

#include "stdio.h"

int main() {
    printf("Hello, world!\n");
}

As I explained in the post, #include "stdio.h" is a preprocessor directive that tells the compiler to copy all contents from "stdio.h" (in e.g. /usr/include) and replace the #include line with them. But "stdio.h" is just a header file, meaning (generally) that it provides a bunch of declarations without definitions. For example, in my system's "/usr/include/stdio.h" there's the following line:

extern int printf (const char *__restrict __format, ...);

This declares the function printf, essentially making a promise that the end program will have access to a function called printf that takes a char pointer and returns an integer. So, by adding #include "stdio.h" (which results in the preprocessor copying in the printf declaration) we're telling the compiler (at the gcc -S step) that it's fine for code in "hello.c" to make printf calls.

Note that the header file doesn't define printf's behavior. Unlike our "hello.c" file's main function, "stdio.h" has no printf function body enclosed in curly braces. Even though "stdio.h" is used to make the promise that printf will be available in the end program, actually fulfilling that promise is handled separately. After assembling your own code into hello.o, you need to use a linker to link that machine code with other machine code that does define printf's behavior.

Linking with ld

Just like with compilers, there are different linkers. The default one in Linux is ld (short for "load" or perhaps "link editor" according to different sources discussed in this SO thread). Using it directly for system functions like printf can be somewhat daunting, but here's the full CLI command that works on my machine:

ld -dynamic-linker /lib64/ld-linux-x86-64.so.2 /usr/lib/x86_64-linux-gnu/crt1.o /usr/lib/x86_64-linux-gnu/crti.o /usr/lib/x86_64-linux-gnu/crtn.o hello.o -o hello -lc

The part about selecting a particular dynamic linker and the extra ".o" files (e.g. "crt1.o") are details that we won't talk about here, but you can find more information in this SO thread and this LinuxQuestions thread.

What's more relevant to us is that we specify hello.o, and -lc. hello.o tells ld to statically link our assembly code into the end program, i.e. to put our machine code in the executable. -o hello says to call our program/executable "hello", without any file extension. -lc says to also link our program to a "c" library. This means that ld looks through a default list of directories, and any directories specified by you (which we'll get to later), trying to find a "libc.so" or "libc.a" file. If we were to add -lfoo, ld would additionally search for for "libfoo.so" and "libfoo.a" files - that's just how the convention programmed into ld works.

Static and dynamic libraries

"*.a" files are "archive files" which are statically linked, just like our own "hello.o" machine code. In fact, "*.a" files are little more than bundles of "*.o" files. When we link in ".a" files, the relevant code is copied directly to our executable - even if you were to delete all ".a" and ".o" files that you used when creating an executable like "hello", you would still be able to run it since the code had already been copied. "*.so" files are dynamically linked, meaning their machine code isn't actually copied to the executable. Instead, linking against an executable like "hello" means that whenever you run "hello", the dynamic linker will, similar to ld, look through a set of default and user-defined directories to dynamically grab the code when the process is loaded into memory. This means that if you don't have the required ".so" file on hand, like if you delete "libc.so" after linking (doing this is a very bad idea), your program will fail to run.

On Linux the default dynamic linker is linux-ld.so, still often referred to by the older name ld.so. The /lib64/ld-linux-x86-64.so.2 we specified in the ld command's -dynamic-linker argument is a specific version of linux-ld.so (more info available in the ld.so man page).

Static linking means your executable doesn't depend on any .so files to run, while with dynamic linking you might have to worry about potentially missing .so files. On the other hand, since you're copying all the code, static linking sometimes results in huge executable files and potentially many duplicates of the same machine code stored in different files, while dynamic linking means more lightweight files. When passing a flag like -lc to ld, dynamic libraries (a.k.a. shared objects; e.g. "/usr/lib/x86_64-linux-gnu/libc.so") are always picked over static libraries ("/usr/lib/x86_64-linux-gnu/libc.a").

All this means that after running the previous ld command, we've got a "hello" executable that has its own copy of machine code from hello.o (and the other .o files, like crt1.o), and dependencies on linux-ld.so itself and the dynamic library libc.so. When I try to run the file, this triggers linux-ld.so to be loaded, which in turn searches for libc.so and loads it into memory, making the printf function available for use.

Inspecting executables

There are various tools for inspecting executables in Linux. One of the most useful ones when it comes to dynamic linking is ldd (man page), which lists the dynamic libraries (shared objects) that the executable depends on:

ldd hello
#   linux-vdso.so.1 (0x00007ffce5196000)
#   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007cdb70600000)
#   /lib64/ld-linux-x86-64.so.2 (0x00007cdb70a24000)

As expected, there is a dependency on "ld-linux-x86-64.so.2" and "libc.so.6" (technically this is a linker script which resolved to pointing at "libc.so.6", if you're interested the details are discussed in this SO thread). The "linux-vdso.so.1" dependency is a bit special as it's baked into all modern Linux executables and can safely be ignored, as explained in the vdso man page.

ldd lists all shared objects that an executable depends on. Say your program depends on "libfoo.so". "libfoo.so" might itself depend on "libqux.so". ldd would list both of these, which is often what you want. However, if you only want to list direct dependencies like "libfoo.so" (and exclude ld-linux.so and linux-vdso.so), you can use readelf --dynamic to list the file's dynamic section, where shared object dependencies are represented by NEEDED lines:

readelf -d hello
# ---
# 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
# ---

(I got this from this SO thread)

Using the normal shortcut

It's very useful to understand the linking stage in-depth, as this really helps to troubleshoot linking issues. However, just like with the as command for converting from assembly to machine code, you rarely run ld directly. Instead, ld is used behind the scenes for you when you run gcc commands, which is why we get the same executable just by:

gcc hello.c -o hello

Here, preprocessing, compilation, assembling and linking is all done by the use of one command. Also note that gcc handles commonly required ld flags, like specifying that crt1.o should be linked in.

Creating your own libraries

Once you understand what libraries are and how to use them, creating your own is relatively straightforward.

Static libraries

Let's say we have the following files in the same directory:

// badadd.c

int bad_add(int val1, int val2) {
    return val1 + val2 + 7;
}

// badmultiply.c

int bad_multiply(int val1, int val2) {
    return val1 * val2 * 2;
}

We want to put bad_add and bad_multiply inside of a static archive (".a" file) so that users can call both functions after just linking against one file. This is done by first creating one assembly code object file for each source code file, then jamming them into a single archive with the GNU utility program ar (man page):

gcc -c badadd.c
gcc -c badmultiply.c
ar rcs libbadmath.a badadd.o badmultiply.o

You can read in the man page about what the r, c and s ar flags indicate.

With the library file in place, which will be used during linking, we need a header file that defines the library's interface, for use when compiling code that calls our library's functions.

// badmath.h

int bad_add(int val1, int val2);
int bad_multiply(int val1, int val2);

We also want another file with a simple main function that will use our badmath library.

// domath.c
#include "stdio.h"

#include "badmath.h"

int main() {
    printf("1 + 2 = %d\n", bad_add(1, 2));
    printf("1 * 2 = %d\n", bad_multiply(1, 2));
}

When we compile "domath.c", we need to tell the preprocessor, through gcc, to search in the current directory for files to include ("."; where the "badmath.h" file is located). This is done with -I, include, flags. We also need to tell ld, again through gcc, to look in the current directory, specified with an -L flag, and that we want to link against "libbadmath" with -lbadmath.

gcc -I. domath.c -L. -lbadmath -o domath
# alternatively, with separate steps:
# gcc -I. -c domath.c
# ld -dynamic-linker /lib64/ld-linux-x86-64.so.2 /usr/lib/x86_64-linux-gnu/crt1.o /usr/lib/x86_64-linux-gnu/crti.o /usr/lib/x86_64-linux-gnu/crtn.o domath.o -o domath -lc -L. -lbadmath

Note that the order of arguments matters. You should generally put dependents (like "domath.c") toward the left, and dependencies (libraries the dependents require) to their right. More details are available in this SO thread answer.

Dynamic libraries

The process for creating a dynamic library is very similar, except you need to supply the flag -fPIC (Position Independent Code) when producing the machine code, and instead of ar you run another gcc command to jam the object files together into a library

gcc -fPIC -c badadd.c
gcc -fPIC -c badmultiply.c
gcc -shared badadd.o badmultiply.o -o libbadmath.so

When we compile the program using the shared object, we do exactly the same as when using a static library.

gcc -I. domath.c -L. -lbadmath -o domath

However, if we run ldd on "domath" we can see that the dynamic linker, linux-ld.so, is unable to find "libbadmath.so":

ldd domath
# linux-vdso.so.1 (0x00007ffefc1ab000)
# libbadmath.so => not found
# libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000075f9d8000000)
# /lib64/ld-linux-x86-64.so.2 (0x000075f9d8413000)

This is expected, since the random directory we happen to be in isn't one of the directories that linux-ld.so has been preprogrammed to search in. You can solve this by relinking domath with an -rpath flag to ld, using the magical $ORIGIN variable which refers to the executable's location (as described in this SO thread). However, what if we want to specify this through the gcc command, like we did with -L and -l? gcc doesn't have shorthand flags for specifying these ld arguments, but there is the generic gcc flag -Wl,<option> which allows us to pass on arbitrary options to the linker:

gcc -I. domath.c -L. -lbadmath -Wl,-rpath='$ORIGIN' -o domath

This tells ld to configure the executable's information so that the runtime linker linux-ld.so will look for libraries in the same directory that the executable itself is located in. Now we get:

ldd domath
# linux-vdso.so.1 (0x00007ffdcdbb9000)
# libbadmath.so => /home/datalowe/Documents/ex/libbadmath.so (0x000079303e862000)
# libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000079303e600000)
# /lib64/ld-linux-x86-64.so.2 (0x000079303e86e000)

And we can successfully run the program:

./domath
# 1 + 2 = 10
# 1 * 2 = 4