1 Intermediate C

1.1 Objectives

  • Recalling C types with a focus on pointers.
  • Focus on more advanced features like void *s and function pointers.
  • Practice thinking about C programs as memory.

1.2 Types

1.2.1 Basic types

  • char - think “one byte”. Same as signed char.
  • short int - think “two bytes”. Same as short.
  • int - think “four bytes”.
  • long int - think “potentially larger int”. Most commonly known solely as long.
  • float, double - floating point values (not used in the class).
  • void - the lack of a type. Variables cannot have this type.

Program output:

inline_exec_tmp.c: In function main:
inline_exec_tmp.c:6:7: error: variable or field a declared void
    6 |  void a;
      |       ^
make[1]: *** [Makefile:33: inline_exec_tmp] Error 1
  • enum - an int or short int with some “named” values.

Program output:

MON = 0, TUES = 1, ..., FRI = 4

You can choose the values explicitly (e.g. enum grades { MON = 1, TUES = 2, WED = 3, THURS = 4, FRI = 5};).

Modifiers

  • unsigned - variables that cannot be negative. Given that variables have a fixed bit-width, they can use the extra bit (“negative” no longer needs to be tracked) to instead represent numbers twice the size of signed variants.
  • signed - signed variables. You don’t see this modifier as much because char, int, long all default to signed.
  • long - Used to modify another type to make it larger in some cases. long int can represent larger numbers and is synonymous with long. long long int (or long long) is an even larger value!
  • static - this variable should not be accessible outside of the .c file in which it is defined.
  • const - an immutable value. We won’t focus much on this modifier.
  • volatile - this variable should be “read from memory” every time it is accessed. Confusing now, relevant later, but not a focus.

1.2.1.1 Examples

Program output:

You might see all of these, but the common primitives, and their sizes:

Program output:

char:   1
short:  2
int:    4
long:   8

1.2.2 Common POSIX types and values

  • stddef.h1:

    • size_t, usize_t, ssize_t - types for variables that correspond to sizes. These include the size of the memory request to malloc, the return value from sizeof, and the arguments and return values from read/write/… ssize_t is signed (allows negative values), while the others are unsigned.
    • NULL - is just #define NULL ((void *)0)
  • limits.h2:

    • INT_MAX, INT_MIN, UINT_MAX - maximum and minimum values for a signed integer, and the maximum value for an unsigned integer.
    • LONG_MAX, LONG_MIN, ULONG_MAX - minimum and maximum numerical values for longs and unsigned longs.
    • Same for short ints (SHRT_MAX, etc…) and chars (CHAR_MAX, etc…).

1.2.3 Format Strings

Many standard library calls take “format strings”. You’ve seen these in printf. The following format specifiers should be used:

  • %d - int
  • %ld - long int
  • %u - unsigned int
  • %c - char
  • %x - unsigned int printed as hexadecimal
  • %lx - long unsigned int printed as hexadecimal
  • %p - prints out any pointer value, void *
  • %s - prints out a string, char *

Format strings are also used in scanf functions to read and parse input.

You can control the spacing of the printouts using %NX where N is the number of characters you want printed out (as spaces), and X is the format specifier above. For example, "%10ld"would print a long integer in a 10 character slot. Adding \n and \t add in the newlines and the tabs. If you need to print out a “\”, use \\.

1.2.3.1 Example

Program output:

Integers: 2147483647, 9223372036854775807, 4294967295, *
Hex and pointers: 7fffffffffffffff, 0x561630eec149
Strings: hello world

1.2.4 Compound types

  • struct - A collection of different values. Example: objects with many different fields.
  • union - One of a set of values. Example: what if you want to represent data for one food item among others.

Unions are not very common, but are sometimes useful. The size of a union is the maximum size of its fields. In contrast, the struct is the sum of the sizes of each of its fields.

1.2.4.1 Example

Program output:

Size of union:  12
Size of struct: 16

We can see the effect of the union: The size is max(fields) rather than sum(fields). What other examples can you think of where you might want unions?

An aside on syntax: The structure initialization syntax in this example is simply a shorthand. The struct hamburger initialization above is equivalent to:

Though since there are so many .s, this is a little confusing. We’d typically want to simply as:

More on -> in the next section.

1.2.5 Pointers & Arrays

Variables can have be pointer types (e.g. int *a). Variables with pointer types are pointers and hold an address to the a variable of the type to which they point, or to NULL. NULL denotes that the pointer is not valid. You want to imagine that variables hold data, for example int a = 6:

a---+
| 6 |
+---+

A pointer, int *b = &a should be imagined as a pointer:

b ---> a---+
       | 6 |
       +---+

Note that the “address of”, & operator takes a variable and returns its address. If you print out the pointer, printf("%p", b), you’ll get the address (i.e. the arrow) To follow the arrow, you must dereference the pointer: *b == 6.

Program output:

value is initialized to 0
value updated to 1

Pointers are necessary as they enable us to build linked data-structures (linked-lists, binary trees, etc…). Languages such as Java assume that every single object variable is a pointer, and since all object variables are pointers, they don’t need special syntax for them.

Arrays are simple contiguous data items, all of the same type. int a[4] = {6, 7, 8, 9} should be imagined as:

a ---> +---+---+---+---+
       | 6 | 7 | 8 | 9 |
       +---+---+---+---+

When you access an array item, a[2] == 8, C is really treating a as a pointer, doing pointer arithmetic, and dereferences to find offset 2.

Program output:

0th index: 0x7ffde6818d10 == 0x7ffde6818d10; 6 == 6
nth index: 0x7ffde6818d14 == 0x7ffde6818d14; 7 == 7

Making this a little more clear, lets understand how C accesses the nth item. Lets make a pointer int *p = a + 1 (we’ll just simply and assume that n == 1 here), we should have this:

p ---------+
           |
           V
a ---> +---+---+---+---+
       | 6 | 7 | 8 | 9 |
       +---+---+---+---+

Thus if we dereference p, we access the 1st index, and access the value 7.

Program output:

nth index: 0x7fffc37a0bd4 == 0x7fffc37a0bd4; 7 == 7

We can see that pointer arithmetic (i.e. doing addition/subtraction on pointers) does the same thing as array indexing plus a dereference. That is, *(a + 1) == a[1]. For the most part, arrays and pointers can be viewed as very similar, with only a few exceptions3.

Pointer arithmetic should generally be avoided in favor of using the array syntax. One complication for pointer arithmetic is that it does not fit our intuition for addition:

Program output:

idx 0 @ 0x7ffd8d194ec0 & 0x7ffd8d194ed4
idx 1 @ 0x7ffd8d194ec4 & 0x7ffd8d194ed5
idx 2 @ 0x7ffd8d194ec8 & 0x7ffd8d194ed6
idx 3 @ 0x7ffd8d194ecc & 0x7ffd8d194ed7

Note that the pointer for the integer array (a) is being incremented by 4, while the character array (b) by 1. Focusing on the key part:

idx 0 @ ...0 & ...4
idx 1 @ ...4 & ...5
idx 2 @ ...8 & ...6
idx 3 @ ...c & ...7
           ^      ^
           |      |
Adds 4 ----+      |
Adds 1 -----------+

Thus, pointer arithmetic depends on the size of the types within the array. There are types when one wants to iterate through each byte of an array, even if the array contains larger values such as integers. For example, the memset and memcmp functions set each byte in an range of memory, and byte-wise compare two ranges of memory. In such cases, casts can be used to manipulate the pointer type (e.g. (char *)a enables a to not be referenced with pointer arithmetic that iterates through bytes).

1.2.5.1 Example

Program output:

Penny
Gabe

1.2.5.2 Generic Pointer Types

Generally, if you want to treat a pointer type as another, you need to use a cast. You rarely want to do this (see the memset example below to see an example where you might want it). However, there is a need in C to have a “generic pointer type” that can be implicitly cast into any other pointer type. To get a sense of why this is, two simple examples:

  1. What should the type of NULL be?

NULL is used as a valid value for any pointer (of any type), thus NULL must have a generic type that can be used in the code as a value for any pointer. Thus, the type of NULL must be void *.

  1. malloc returns a pointer to newly-allocated memory. What should the type of the return value be?

C solves this with the void * pointer type. Recall that void is not a valid type for a variable, but a void * is different. It is a "generic pointer that cannot be dereferenced*. Note that dereferencing a void * pointer shouldn’t work as void is not a valid variable type (e.g. void *a; *a = 10; doesn’t make much sense because *a is type void).

Program output:

Data-structures often aim to store data of any type (think: a linked list of anything). Thus, in C, you often see void *s to reference the data they store.

1.2.5.3 Relationship between Pointers, Arrays, and Arrows

Indexing into arrays (a[b]) and arrows (a->b) are redundant syntactic features, but they are very convenient.

  • &a[b] is equivalent to a + b where a is a pointer and b is an index.
  • a[b] is equivalent to *(a + b) where a is a pointer and b is an index.
  • a->b is equivalent to (*a).b where a is a pointer to a variable with a structure type that has b as a field.

Generally, you should always try and stick to the array and arrow syntax were possible, as it makes your intention much more clear when coding than the pointer arithmetic and dereferences.

1.3 Memory Allocation

Dynamic memory allocations

  • void *malloc(size_t size) - allocate size memory, or return NULL.
  • void *calloc(size_t nmemb, size_t size) - allocate nmemb * size bytes, and initialize them to 0.
  • void *realloc(void *ptr, size_t size) - pass in a previous allocation, and either grow/shrink it to size, or allocate a new chunk of memory and copy the data in ptr into it. Return the memory of size size, or NULL on error.
  • void free(void *ptr) - deallocate previously allocated memory (returned from any of the above).

A few things to keep in mind:

  1. If you want to allocate an array, then you have to do the math yourself for the array size. For example, int *arr = malloc(sizeof(int) * n); to allocate an array of ints with a length of n.
  2. malloc is not guaranteed to initialize its memory to 0. You must make sure that your array gets initialized. It is not uncommon to do a memset(arr, 0, sizeof(int) * n); to set the memory 0.
  3. calloc is guaranteed to initialize all its allocated memory to 0.

1.3.1 Common Errors

  • Allocation error. You must check the return value of memory allocation functions for NULL, and handle the error appropriately.

Program output:

  • Dangling pointer. If you maintain a pointer to a chunk of memory that you free, and then dereference that pointer, bad things can happen. The memory might have already been re-allocated due to another call to malloc, and is used for something completely different in your program. It is up to you as a programmer to avoid freeing memory until all references to it are dropped.

Program output:

  • Memory leaks. If you remove all references to memory, but don’t free it, then the memory will never get freed. This is a memory leak.

Program output:

  • Double free. If you free the memory twice, bad things can happen. You could confuse the memory allocation logic, or you could accidentally free an allocation made after the first free was called.

Program output:

free(): double free detected in tcache 2
make[1]: *** [Makefile:30: inline_exec] Aborted (core dumped)

valgrind will help you debug the last three of these issues, and later in the class, we’ll develop a library to help debug the first.

1.4 Function Pointers

It is not uncommon for programs to require some dynamic behavior. A data structure might need to do different types of comparisons between its data (depending on what data is being stored). This means that a generic data-structure (think: a hash-table) needs to somehow compare equality of the data objects it stores without knowing what they are! In Java, think about the equals method. Each object can define its own equals method to compare between objects of the same type.

In C, this is implemented using function pointers. These are pointers to functions of a particular type. They can be passed as arguments to functions, and can be dereferenced to call the function. Let look at an example from man 3 lsearch:

void *lfind(const void *key, const void *base, size_t *nmemb, size_t size,
            int(*compar)(const void *, const void *));

This is a function to find a value in an array. The implementation needs to understand how to compare a key with each of the items in the array. For this purpose, the compar function pointer is the last argument.

1.4.1 Typedefs

Function pointer syntax is not natural in C. Thus, it is common to use a typedef to create the type name for a function pointer for a given API. The syntax for this is perhaps even less natural, but only needs be done once.

typedef int (*compare_fn_t)(const void *, const void *);
void *lfind(const void *key, const void *base, size_t *nmemb, size_t size, compare_fn_t compar);

Now the function declaration is more clear, and the _fn_t postfix, by convention, denotes a function pointer.

1.5 Exercises

1.5.1 C is a Thin Language Layer on Top of Memory

We’re going to look at a set of variables as memory. When variables are created globally, they are simply allocated into subsequent addresses.

Program output:

Addresses:
a   @ 0x563690b89010
b   @ 0x563690b89014
c   @ 0x563690b89020
end @ 0x563690b89039
&end - &a = 41

Initial values:
a     = 1
b     = 2
c.c_a = 3
c.c_b = 0
c.c_c = 0x563690b89014

Print out the variables as raw memory


Post-`memset` values:

Question Answer Q1-4 in the code, uncommenting and modifying where appropriate.

1.5.1.1 Takeaways

  1. Each variable in C (including fields in structs) want to be aligned on a boundary equal to the variable’s type’s size. This means that a variable (b) with an integer type (sizeof(int) == 4) should always have an address that is a multiple of its size (&b % sizeof(b) == 0, so an int’s address is always divisible by 4, a long’s by 8).
  2. The operation to figure out the size of all the variables, &end - &a, is crazy. We’re used to performing math operations values on things of the same type, but not on pointers. This is only possible because C sees the variables are chunks of memory that happen to be laid out in memory, one after the other.
  3. The crazy increases with mem = &a, and our iteration through mem[i]. We’re able to completely ignore the types in C, and access memory directly!

Question: What would break if we changed char a; into int a;? C doesn’t let us do math on variables of any type. If you fixed compilation problems, would you still get the same output?

1.5.2 Quick-and-dirty Key-Value Store

Please read the man pages for lsearch and lfind. man pages can be pretty cryptic, and you are aiming to get some idea where to start with an implementation. An simplistic, and incomplete initial implementation:

Program output:

0: 4 @ 0x5620dc5c304c
1: 2 @ 0x5620dc5c3044
2: 0 @ (nil)

You want to implement a simple “key-value” store that is very similar in API to a hash-table (many key-value stores are implemented using hash-tables!).

Questions/Tasks:

  • Q1: What is the difference between lsearch and lfind? The manpages should help here (man 3 lsearch) (you can exit from a man page using ‘q’). What operations would you want to perform with each (and why)?
  • Q2: The current implementation doesn’t include values at all. It returns the keys instead of values. Expand the implementation to properly track values.
  • Q3: Encapsulate the key-value store behind the put and get functions.
  • Q4: Add testing into the main for the relevant conditions and failures in get and put.

2 C Memory Model, Data-structures, and APIs

C presents some unique opportunities for how to structure your code and data-structures, but also requires that you’re careful about how data is passed around a program.

2.1 Memory Allocation Options

In C, when creating a variable, you have the option of allocating it in one of multiple different types of memory:

  • In another existing structure.
  • In the heap using malloc or its sibling functions.
  • In global memory.
  • On the stack.

It might be surprising, but it is quite uncommon in programming languages to have this flexibility.

2.1.1 Internal Allocation

What does it look like to do internal allocation of one struct or array inside of another? See the following as an example.

Program output:

One of the more interesting uses of internal allocation is in linked data-structures (e.g. like linked lists). It is common in languages like java to implement linked data-structures to have “nodes” that reference the data being tracked.

Lets see the same in C.

Program output:

student added to list!

A few interesting things happened here:

  1. We’re using void * pointers within the linked_list_node so that the linked list can hold data of any pointer type. You can see that in our list implementation, there is nothing that is student-specific.
  2. The list is inline-allocated inside the student structure, which completely avoids the separate node allocation.

Most serious data-structure implementations enable this inlining of data-structure nodes even without requiring the data pointer in the node. We won’t cover that, but you can see a sample implementation.

2.1.2 Heap Allocation

We’ve already seen malloc, calloc, realloc, and free which are our interface to the heap. These provide the most flexibility, but require that you track the memory and free it appropriately4

2.1.3 Global Allocation

We have already seen global allocations frequently in examples so far.

Program output:

What is uninitialized global memory set to?
Integer: 0
Pointer: (nil) (as hex: 0)

Global variables are either initialized where they are defined, or are initialized to 0. Note that they are intialized to all 0s regardless their type. This makes more sense than it sounds because pointers set to 0 are NULL (because, recall, NULL is just (void *)0 – see the “hex” output above), and because strings (see later) are terminated by \0 which is also 0! Thus, this policy initializes all numerical data to 0, all pointers to NULL, and all strings to "".

2.1.4 Stack Allocation

Variables can also be allocated on the stack. This effectively means that as you’re executing in a function, variables can be allocated within the context/memory of that function. An example that allocates a structure and an array on the stack:

Program output:

Should find 8 @ 6: 6
Should find 8 @ 6: 6 (-1 is wrong here)

Key points to realize:

  • fibs in fib is allocated on the stack and is initialized in fib!
  • We are passing the array into find_and_reset which modifies the array directly as it is passed as a pointer.
  • The first time that fib is called, it creates the arr. After we return from fib, the fibs goes away (strictly: its memory is reclaimed as part of returning from fib).
  • The second time we call fib, it effectively is a new execution of fib, thus a new allocation of fibs that is now initialized a second time.

2.1.4.1 Stack Usage Example

How do we think about this? The stack starts out in main:

stack          |
|              |
+-main---------+
|              | <--- main's local data (note: no local variables)
+--------------+

What we’re drawing here is a “stack frame” for the main function. This includes all of the local data allocated within the function, and (at least conceptually) also includes the arguments passed into the function (main has none). When it calls fib, a stack frame is allocated for fib, the argument is passed in, and fibs variables are allocated:

stack          |
|              |
+-main---------+
|              |
+-fib----------+
| arg: v       | <--- argument to fib
| fibs[ARR_SZ] | <-+- local variables, allocated here!
| ret          | <-'
+--------------+

fib calls find_and_reset:

stack          |
|              |
+-main---------+
|              |
+-fib----------+
| v            |
| fibs[ARR_SZ] |<--+
| ret          |   | pointer to fibs passed as argument, and used to update arr[v]
+find_and_reset+   |
| arg: arr     |---+
| arg: val     |
| i            |
+--------------+

Since find_and_reset updates the array in fib’s stack frame, when it returns, the array is still properly updated. Importantly, once we return to main, we have deallocated all of the variables from the previous call to fib

stack          |
|              |
+-main---------+
|              | <--- returning to main, deallocates the fibs array in fib
+--------------+

…thus the next call to fib allocates a new set of local variables including fibs.

stack          |
|              |
+-main---------+
|              |
+-fib----------+
| arg: v       |
| fibs[ARR_SZ] | <--- new version of fibs, re-initialized when we fib is called
| ret          |
+--------------+

2.1.4.2 Common Errors in Stack Allocation

A few common errors with stack allocation include:

  • Uninitialized variables. You must always initialize all variables allocated on the stack, otherwise they can contain seemingly random values (see below).
  • Pointers to stack allocated variables after return. After a function returns, its stack allocated variables are no longer valid (they were deallocated upon return). Any pointers that remain to any of those variables are no longer pointing to valid memory!

We discuss these next.

Initializing stack-allocated variables. For global memory, we saw that variables, if not intentionally initialized, would be set to 0. This is not the case with stack allocated variables. In fact, the answer is “it depends”. If you compile your code without optimizations, stack allocated variables will be initialized to 0; if you compile with optimizations, they are not initialized. Yikes. Thus, we must assume that they are not automatically initialized (similar to the memory returned from malloc). An example:

Program output:

0: 194
1: 0
2: 672536615
3: 32766
4: 672536614
5: 32766
6: -1041644995
7: 21946
8: 2138456808
9: 32645
10: -1041645072
11: 21946

Yikes. We’re getting random values in the array! Where do you think these values came from?

References to stack variables after function return.

Program output:

inline_exec_tmp.c: In function bar:
inline_exec_tmp.c:10:9: warning: function returns address of local variable [-Wreturn-local-addr]
   10 |  return &a;            /* Return the address of a local variable. */
      |         ^~
Save address of local variable, and dereference it: 42
make[1]: *** [Makefile:30: inline_exec] Segmentation fault (core dumped)

You can see a few interesting facts from the output.

  1. The value referenced by *ptr after foo is a random value, and dereferencing the return value frombar causes a segmentation fault.
  2. foo and bar contain logic that feels pretty identical. In either case, they are taking a local variable, and passing its address to main where it is used. foo passed it through a global variable, and bar simply returns the address. Despite this, one of them causes a segmentation fault, and the other seems to return a nonsensical value! When you try and use stack allocated variables after they are been freed (by their function returning), you get unpredictable results.
  3. The C compiler is aggressive about issuing warnings as it can tell that bad things are afoot. Warnings are your friend and you must do your development with them enabled.

Stack allocation is powerful, can be quite useful. However, you have to always be careful that stack allocated variables are never added into global data-structures, and are never returned.

2.1.5 Putting it all Together

Lets look at an example that uses inlined, global, and stack allocation. We’ll avoid heap-based allocation to demonstrate the less normal memory allocation options in C.

Program output:

42 is the secret to everything

In this program you might notice that we’ve used no dynamic memory allocation at all! The cost of this is that we had to create a key-value store of only a fixed size (16 items, here).

2.1.6 Comparing C to Java

C enables a high-degree of control in which memory different variables should be placed. Java traditionally has simple rules for which memory is used for variables:

  • Primitive types are stored in objects and on the stack.
  • Objects are always allocated using new in the heap, and references to objects are always pointers.
  • The heap is managed by the garbage collector, thus avoiding the need for free.

Many aspects of this are absolutely fantastic:

  • The programmer doesn’t need to think about how long the memory for an object needs to stick around – the garbage collector instead manages deallocation.
  • The syntax for accessing fields in objects is uniform and simple: always use a . to access a field (e.g. obj.a) – all object references are pointers, so instead of having -> everywhere, just replace it with the uniform ..

However, there are significant downsides when the goal is to write systems code:

  • Some systems don’t have a heap! Many embedded (small) systems don’t support dynamic allocation.
  • Garbage collection (GC) isn’t free! It has some overhead for some applications, can result in larger memory consumption (as garbage is waiting to be collected), and can causes delays in processing when GC happens. That said, modern GC is pretty amazing and does a pretty good job at minimizing all of these factors.
  • Many data-structures that might be allocated globally or on the stack, instead must be allocated on the heap, which is slower. Similarly, many objects might want other objects to be part of their allocation, but that isn’t possible as each must be a separate heap allocation, which adds overhead.

2.2 Strings

String don’t have a dedicated type in C – there is no String type. Strings are

  1. arrays of chars,
  2. with a null-terminator (\0) character in the last array position to denote the termination of the string.

In memory, a simple string has this representation:

char *str = "hi!"

str ---> +---+---+---+---+
         | h | i | ! |\0 |
         +---+---+---+---+

Note that \0 is a single character even though it looks like two. We can see this:

Program output:

hi!

So strings aren’t really all that special in C. They’re just arrays with a special terminating character, and a little bit of syntax support to construct them (i.e. "this syntax"). So what’s there to know about strings in C?

2.2.1 string.h Functions

Working with strings is not C’s strong point. However, you have to handle strings in every language, and C provides a number of functions to do so. You can read about each of these functions with man 3 <function>.

2.2.1.1 Core String Operations

  • strlen - How many characters is a string (not including the null-terminator)?
  • strcmp - Compare two strings, return 0 if they are the same, or -1 or 1, depending on which is lexographically less than the other. Similar in purpose to equals in Java.
  • strcpy - Copy into a string the contents of another string.
  • strcat - Concatenate, or append onto the end of a string, another string.
  • strdup - Duplicate a string by mallocing a new string, and copying the string into it (you have to free the string later!).
  • snprintf - You’re familiar with printf, but snprintf enables you to “print” into a string! This gives you a lot of flexibility in easily constructing strings. A downside is that you don’t really know how big the resulting string is, so the n in the name is the maximum length of the string you’re creating.

Program output:

"hello world and also 42" has length 23

Many of these functions raise an important question: What happens if one of the strings is not large enough to hold the data being put into it? If you want to strcat a long string into a small character array, what happens? This leads us to a simple fact…

It is easy to use the string functions incorrectly. Many of these functions also have a strnX variant where X is the operation (strnlen, strncmp, etc..). These are safer variants of the string functions. The key insight here is that if a string is derived from a user, it might not actually be a proper string! It might not, for example, have a null-terminator – uh oh! In that case, many of the above functions will keep on iterating past the end of the buffer

Program output:

zzz
*** stack smashing detected ***: terminated
make[1]: *** [Makefile:30: inline_exec] Aborted (core dumped)

2.2.1.2 Parsing Strings

While many of the previous functions have to do with creating and modifying strings, computers frequently need to “parse”, or try to understand the different parts of, a string. Some examples:

  • The code we write in .c or .java files is just a long string, and the programming language needs to parse the string to understand what commands you’re trying to issue to the computer.
  • The shell must parse your commands to determine which actions to perform.
  • Webpages are simply a collection of html code (along with other assets), and a browser needs to parse it to determine what to display.
  • The markdown text for this lecture is parsed by pandoc to generate a pdf and html.

Some of the core functions that help us parse strings include:

  • strtol - Pull an integer out of a string. Converts the first part of a string into a long int, and also returns an endptr which points to the character in the string where the conversion into a number stopped. If it cannot find an integer, it will return 0, and endptr is set to point to the start of the string.
  • strtok - Iterate through a string, and find the first instance of one of a number of specific characters, and return the string leading up to that character. This is called multiple times to iterate through the string, each time extracting the substring up to the specific characters. See the example in the man page for strtok for an example.
  • strstr - Try and find a string (the “needle”) in another string (the “haystack”), and return a pointer to it (or NULL if you don’t find it). As such, it finds a string in another string (thus strstr).
  • sscanf - A versatile function will enables a format string (e.g. as used in printf) to specify the format of the string, and extract out digits and substrings.

Some examples:

Program output:

extracted 1, remaining string: " 42 the secret to everything"
extracted 42, remaining string: " the secret to everything"
42 and "secret"
Using strtok to parse through a string finding substrings separated by 'h' or 't':
[1 42 ]
[e secre]
[ ]
[o every]
[ing]

2.2.2 Bonus: Explicit Strings

When you use an explicit string (e.g. "imma string") in your code, you’re actually asking C to allocate the string in global memory. This has some strange side-effects:

Program output:

blahblahblah
0x56097bfb9004 == 0x56097bfb9004 != 0x56097bfbb011

The C compiler and linker are smart enough to see that if you have already used a string with a specific value (in this case "clone"), it will avoid allocating a copy of that string, and will just reuse the previous value. Generally, it doesn’t make much sense to look at the address of strings, and certainly you should not compare them. You can see in this example how you must compare strings for equality using strncmp, and not to compare pointers.

2.3 API Design and Concerns

When programming in C, you’ll see quite a few APIs. Throughout the class, we’ll see quite a few APIs, most documented in man pages. It takes some practice in reading man pages to get what you need from them. One of the things that helps the most is to understand a few common patterns and requirements that you find these APIs, and in C programming in general.

2.3.1 Return Values

Function’s often need to return multiple values. C does not provide a means to return more than one value, thus is forced to use pointers. To understand this, lets look at the multiple ways that pointers can be used as function arguments.

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>

/*
 * `arg` is used to pass an argument that happens to be an array here. In contrast,
 * `ret` is a *second return value*. This function will set the value that `ret` points
 * to -- which happens to be `retval` in the `main` stack frame -- to the value we are
 * getting from the array.
 */
int
get(int *arg, int offset, int *ret)
{
    if (arg == NULL) return -1;

    *ret = arg[offset];

    return 0;
}

/*
 * Again, the array is passed in as the first argument, but this time it is used to
 * store the new value.
 */
int
set(int *ret_val, int offset, int value)
{
    if (ret_val == NULL) return -1;

    ret_val[offset] = value;

    return 0;
}

/*
 * `arrdup`'s job is to duplicate an array by allocating and populating
 * a new array. It will return `0` or not `0` on success/failure. Thus
 * the new array must be returned using pointer arguments. `ret_allocated`
 * is a pointer to a pointer to an array in the calling function, and it
 * is used to return the new array.
 */
int
arrdup(int **ret_allocated, int *args, size_t args_size)
{
    size_t i;
    int *newarr;

    if (ret_allocated == NULL) return -1;

    newarr = calloc(args_size, sizeof(int)); /* 1 below */
    if (newarr == NULL) return -1;

    for (i = 0; i < args_size; i++) {
        newarr[i] = args[i];
    }
    *ret_allocated = newarr; /* 2 and 3 below */

    return 0;
}
/*
 * Lets draw this one. The stack setup when we call `arrdup`:
 *
 * |               |
 * +-main----------+
 * | arr           |<---+
 * | dup           |<-+ |
 * | ...           |  | |
 * +-arrdup--------+  | |
 * | ret_allocated |--+ |
 * | args          |----+
 * | ...           |
 * +---------------+
 *
 * `ret_allocated` points to `dup` in `main`.
 *
 *
 *    3. *ret_allocated = newarr
 *                          ^
 *                          |
 *            ,-------------'
 *           |
 * |         |     |
 * +-main----|-----+
 * | arr     |     |
 * | dup ---'  <------+
 * | ...           |  | -- 2. *ret_allocated
 * +-arrdup--------+  |
 * | ret_allocated ---+
 * | args          |
 * | newarr --------------> 1. calloc(...)
 * +---------------+
 *
 * 1. `arrdup` calls `calloc` to allocate on the heap
 * 2. Dereferencing `ret_allocated` gives us access to `dup`
 * 3. thus we can set dup equal to the new heap memory
 *
 * This effectively enables us to return the new memory into the
 * `dup` variable in main.
 */



int
main(void)
{
    int arr[] = {0, 1, 2, 3};
    int *dup;
    int retval;

    assert(get(arr, 2, &retval) == 0 && retval == 2);
    assert(set(arr, 2, 4) == 0);
    assert(get(arr, 2, &retval) == 0 && retval == 4);
    assert(arrdup(&dup, arr, 4) == 0);
    assert(get(dup, 2, &retval) == 0 && retval == 4);
    free(dup);

    printf("no errors!");

    return 0;
}

Program output:

no errors!

2.3.2 Errors

The first question is how we can detect that some error occurred within a function we have called? We’ll separate functions into two different classes:

  1. Functions that return pointers. Functions that return pointers (e.g. that have declarations of the form type *fn(...)) are often relatively straightforward. If they return NULL, then an error occurred; otherwise the returned pointer can be used. malloc is an example here.
  2. Functions that return integers. Integers are used as a relatively flexible indication of the output of a function. Is it common for a -1 to indicate an error. Sometimes any negative value indicates an error, each negative value designating that a different error occurred. Non-negative values indicate success. If a function wishes to return a binary success or failure, you’ll see that many APIs (counter-intuitively) return 0 for success, and -1 for failure.

It is common that you want more information than the return value can give you. You want more information about why the failure happened so that you can debug more easily. The errno variable is UNIX’s solution to this (see its man page), and can be referenced if you include errno.h. errno is simply an integer where specific values represent specific errors. You can view these values by looking them up in the source5 or, you can ask the errno program6. For example:

$ errno -l
EPERM 1 Operation not permitted
ENOENT 2 No such file or directory
ESRCH 3 No such process
EINTR 4 Interrupted system call
EIO 5 Input/output error
...

You can look up specific values:

$ errno 28
ENOSPC 28 No space left on device

You can imagine how that error might occur. What if you’re trying to add files on disk, and the disk runs out of room!?

If you want your program to print a useful error when you encounter such a problem, you can use the perror function (see its man page for documentation) to print out an error, or strerror (via string.h) if you just want a string corresponding to the error.

Program output:

Lets get greedy: allocate 9223372036854775807 bytes!
Error: errno value 12 and description: Cannot allocate memory
Error allocating memory: Cannot allocate memory
make[1]: *** [Makefile:30: inline_exec] Error 255

(Note: when you return from a program with a non-zero value, it designates that your program had an error. This is why we see the make error when it runs your program.)

To understand the return value of UNIX library functions:

  • look at the function return values to understand the type (and identify if it is returning an int or a pointer)
SYNOPSIS
       #include <stdlib.h>

       void *malloc(size_t size);
       ...
  • Read through the description of the function(s).
  • Read through the RETURN VALUE and ERRORS sections of the man page.
RETURN VALUE
       The malloc() and calloc() functions return a pointer to the allocated memory, which is  suitably  aligned  for
       any  built-in type.  On error, these functions return NULL.  NULL may also be returned by a successful call to
       malloc() with a size of zero, or by a successful call to calloc() with nmemb or size equal to zero.
...
ERRORS
       calloc(), malloc(), realloc(), and reallocarray() can fail with the following error:

       ENOMEM Out  of  memory.
       ...

This explains why we got the error we did.

2.3.3 Memory Ownership

One of the last, but most challenging aspects of APIs in C is that of memory ownership. The big question is: when a pointer passed into a function, or returned from a function, who is responsible for freeing the memory? This is due to a combination of factors, mainly:

  1. C requires that memory is explicitly freed, so someone has to do it, and it should be freed only once, and
  2. pointers are passed around freely and frequently in C, so somehow the function caller and callee have to understand who “owns” each of those pointers.

It is easiest to understand this issue through the concept of ownership: simply put, the owner of a piece of memory is the one that should either free it, or pass it to another part of the code that becomes the owner. In contrast, a caller can pass a pointer to memory into a function allowing it to borrow that memory, but the function does not free it, and after it returns it should not further access the memory. There are three general patterns:

  • A caller passes a pointer into a function, and passes the ownership of that data to the function. This is common for data-structures whose job is to take a pointer of data to store, and at that point, they own the memory. Think: if we enqueue data into a queue.

    Examples: The key-value store’s put function owns the passed in data (assuming it takes a void *.

  • A function returns a pointer to the function caller and passes the ownership to the caller. The caller must later free the data. This is also common in data-structures when we wish to retrieve the data. Think: if we dequeue data from a queue.

    Examples: strdup creates a new string, expecting the caller to free it.

  • A caller passes a pointer into a function, but only allows the function to borrow the data. Thus the caller still owns the memory (thus is still responsible to free the data) after the function returns, and the function should not maintain any references to the data. Think: most of the string functions that take a string as an argument, perform some operation on it, and return expecting the caller to still free the string.

    Examples: Most other functions we’ve seen borrow pointers, perform operations, and then don’t maintain references to them.

  • A function returns a pointer to the function caller that enables the caller to borrow the data. This requires a difficult constraint: the caller can access the data, but must not maintain a pointer to it after the function or API (that still owns the data) frees it.

    Examples: The key-value store’s get function transfers ownership to the caller.

The memory ownership constraints are an agreement between the calling function, and a function being called.

2.4 Exercises

2.4.1 Stack Allocation

An old interview question:

How can you write a function that determines if the execution stack grows upwards (from lower addresses to higher), or downwards?

Write this function!

2.4.2 Understanding Memory Ownership

Lets look at a simple key-value store that needs to learn to be more careful about memory. Above each function, we specify the ownership of pointers being passed – either passing ownership, or borrowing the memory. The current implementations do not adhere to these specifications.

#include <string.h>
#include <malloc.h>

/*
 * Lets just use a single key/value as a proxy for an entire kv/store.
 * You can assume that the stored values are strings, so `strdup` can
 * allocate the memory for and copy the values. You absolutely will
 * have to use `strdup` in some of the functions.
 */
static int   kv_key;
static char *kv_value;

/**
 * The returned value should not be maintained in the data-structure
 * and should be `free`d by the caller.
 */
char *
get_pass_ownership(int key)
{
    if (key != kv_key) return NULL;

    return kv_value;
}

/**
 * Pointers to the returned value are maintained in this data-structure
 * and it will be `free`d by the data-structure, not by the caller
 */
char *
get_borrow(int key)
{
    if (key != kv_key) return NULL;

    return kv_value;
}

/**
 * Pointers to the `value` passed as the second argument are maintained by
 * the caller, thus the `value` is borrowed here. The `value` will be `free`d
 * by the caller.
 */
void
set_borrow(int key, char *value)
{
    /* What do we do with `kv_value`? Do we `strdup` anything? */
    kv_key   = key;
    kv_value = value;

    return;
}

/**
 * Pointers to the `value` passed as the second argument are not maintained by
 * the caller, thus the `value` should be `free`d by this data-structure.
 */
void
set_pass_ownership(int key, char *value)
{
    /* What do we do with `kv_value`? Do we `strdup` anything? */
    kv_key   = key;
    kv_value = value;

    return;
}

int
main(void)
{
    /* The values we pass in. */
    char *v_p2p = strdup("value1"); /* calls `malloc` ! */
    char *v_p2b = strdup("value2");
    char *v_b2p = strdup("value3");
    char *v_b2b = strdup("value4");

    /* The return values */
    char *r_p2p, *r_p2b, *r_b2p, *r_b2b;

    /* p2p: passing ownership on set, passing ownership on get */
    set_pass_ownership(0, v_p2p);
    r_p2p = get_pass_ownership(0);
    /* The question: should we `free(v_p2p)`?, `free(r_p2p)`? */

    /* p2b: passing ownership on set, borrowing memory for get */
    set_pass_ownership(0, v_p2b);
    r_p2b = get_borrow(0);
    /* The question: should we `free(v_p2b)`?, `free(r_p2b)`? */

    /* b2p: borrowing ownership on set, passing ownership on get */
    set_borrow(0, v_b2p);
    r_b2p = get_pass_ownership(0);
    /* The question: should we `free(v_b2p)`?, `free(r_b2p)`? */

    /* b2b: borrowing ownership on set, borrowing on get */
    set_borrow(0, v_b2b);
    r_b2b = get_borrow(0);
    /* The question: should we `free(v_b2b)`?, `free(r_b2b)`? */

    if (kv_value) free(kv_value);

    printf("Looks like success!...but wait till we valgrind; then ;-(\n");

    return 0;
}

Program output:

Looks like success!...but wait till we valgrind; then ;-(

The above code is hopelessly broken. Run it in valgrind to see.

Tasks:

  • In the above code, implement the malloc/free/strdup operations that are necessary both in the key-value implementation, and in the client (main) to make both the caller and callee abide by the memory ownership constraints.
  • In which cases can stack allocation of the values be used in main? Why?
  • In which cases can stack allocation of the values in the key-value store (i.e. in get/set) be used? Why?

3 Processes

Programming is hard. Really hard. When you write a program, you have to consider complex logic and the design of esoteric data-structures, all while desperately trying to avoid errors. It is an exercise that challenges all of our logical, creative, and risk management facilities. As such, programming is the act of methodically conquering complexity with our ability to abstract. We walk on the knives-edge of our own abstractions where we can barely, just barely, make progress and engineer amazing systems.

Imagine if when programming and debugging, we had to consider the actions and faults of all other programs running in the system? If your program crashed because one of your colleagues’ program had a bug, how could you make progress?

Luckily, we stand on the abstractions of those that came before us. A key abstraction provided by systems is that of isolation - that one program cannot arbitrarily interfere with the execution of another. This isolation is a core provision of Operating Systems (OSes), and a core abstraction they provide is the process. At the highest-level, each process can only corrupt its own memory, and a process that goes into an infinite loop cannot prevent another process for executing.

A process has a number of properties including:

  • It contains the set of memory that is only accessible to that process. This memory includes the code, global data, stack, and dynamically allocated memory in the heap. Different processes have disjoint sets of memory, thus no matter how one process alters its own memory, it won’t interfere with the data-structures of another.
  • A number of descriptors identified by integers that enable the process to access the “resources” of the surrounding system including the file system, networking, and communication with other processes. When we printf, we interact with the descriptor to output to the terminal. The OS prevents processes from accessing and changing resources they shouldn’t have access to.
  • A set of unique identifiers including a process identifier (pid).
  • The “current working directory” for the process, analogous to pwd.
  • A owning user (e.g. gparmer) for whom the process executes on the behalf of.
  • Each process has a parent - the process that created it, and that has the ability and responsibility oversee it (like a normal, human parent).
  • Arguments to the process often passed on the command line.
  • Environment variables that give the process information about the system’s configuration.
A process, and many of its properties.
A process, and many of its properties.

Throughout the class we’ll uncover more and more of these properties, but in this lecture, we’ll focus on the lifecycle of a process, and its relationship to its parent. Processes are the abstraction that provide isolation between users, and between computations, thus it is the core of the security of the system.

3.1 History: UNIX, POSIX, and Linux

UNIX is an old system, having its origins in 1969 at Bell Labs when it was created by Ken Thompson and Dennis Ritchie7. Time has shown it is an “oldie but goodie” – as most popular OSes are derived from it (OSX, Linux, Android, …). In the OS class, you’ll dive into an implementation of UNIX very close to what it was in those early years! The original paper is striking in how uninteresting it is to most modern readers. UNIX is so profound that it has defined what is “normal” in most OSes.

UNIX Philosophy: A core philosophy of UNIX is that applications should not be written as huge monolithic collections of millions of lines of code, and should instead be composed of smaller programs, each focusing on “doing one thing well”. Programs focus on inputting text, and outputting text. Pipelines of processes take a stream of text, and process on it, process after process to get an output. The final program is a composition of these processes composed to together using pipelines.

In the early years, many different UNIX variants were competing for market-share along-side the likes of DOS, Windows, OS2, etc… This led to differences between the UNIX variants which prevented programmers from effectively writing programs that would work across the variants. In response to this, in the late 1980s, POSIX standardized many aspects of UNIX including command-line programs and the standard library APIs. man pages are documentation for what is often part of the POSIX specification. I’ll use the term UNIX throughout the class, but often mean POSIX. You can think of Linux as an implementation of POSIX and as a variant of UNIX.

UNIX has been taken into many different directions. Android and iOS layer a mobile runtime on top of UNIX; OSX and Ubuntu layer modern graphics and system management on top of UNIX, and even Windows Subsystem for Linux (WSL) enables you to run a POSIX environment inside of Windows. In many ways, UNIX has won. However, it has won be being adapted to different domains – you might be a little hard-pressed looking at the APIs for some of those systems to understand that it is UNIX under the hood. Regardless, in this class, we’ll be diving into UNIX and its APIs to better understand the core technology underlying most systems we use.

The core UNIX philosophy endures, but has changed shape. In addition to the pervasive deployment of UNIX, Python is popular as it is has good support to compose together disparate services and libraries, and applications are frequently compositions of various REST/CRUD webservice APIs, and json is the unified language in which data is shared.

3.2 Process: Life and Death

When we construct a pipeline of processes…

$ ps aux | grep gparmer

we’re executing two processes that separately run the code of the programs ps and grep. The shell is the parent process that creates the ps and grep processes. It ensures the descriptor for the output of the ps process sends its data to the descriptor in the grep process for its input.

What are the functions involved in the creation, termination, and coordination between processes?

  • fork - Create a new, child, process from the state of the calling process.
  • getpid, getppid - get the unique identifier (the Process ID, or pid) for the process, or for its parent.
  • exit - This is the function to terminate a process.
  • wait - A parent awaits a child exiting.
  • exec - Execute a new program using the current process8.

Lets be a little more specific. A program is an executable file. We see them in the file system (for example, see ls /bin), and create them when we compile our programs. When you compile your C programs, the generated file that you can execute, is a program. A process is an executing program with the previously mentioned characteristics (disjoint memory, descriptors to resources, etc…).

First, lets look at fork:

Program output:

pre-fork:
     parent pid 308364
post-fork:
    returned 308365,
    current pid 308364
    parent pid 308355
post-fork:
    returned 0,
    current pid 308365
    parent pid 308364

We can see the fork creates a new process that continues running the same code, but has a separate set of memory. A few observations:

  1. In this case, you can see that the pid variable is different for the two processes. In fact, after the fork call, all of memory is copied into the new process and is disjoint from that of the parent.
  2. The getpid function returns the current process’ process identifier.
  3. fork returns two values - one in the parent in which it returns the process id of the newly created child, and one in the child in which it returns 0. This return value is one of the main indicators of if the current process is the parent, or the child.

So much complexity for such a simple function prototype!

An analogy for fork. Each fork creates a copy of the last.
An analogy for fork. Each fork creates a copy of the last.
Little known fact: fork is the basis for life.
Little known fact: fork is the basis for life.

exit and wait. In contrast to fork that creates a new process, exit is the function that used to terminate the current process. exit’s integer argument is the “exit status” – 0 is equivalent to EXIT_SUCCESS, and denotes a successful execution of the program, and -1 is EXIT_FAILURE, and denotes execution failure. The return value from main is also this exit status. Where is the exit status used?

Parent processes have the responsibility to manage their children. This starts with the wait function. A parent that calls wait will “wait” for the child to exit. Then, wait enables a parent to retrieve the exit status of the child.

The relationship between wait (in the parent), and exit or return from main (in the child) is a little hard to understand. A figure always helps…

The relationship between a waiting parent, and an exiting, or returning from main parent.
The relationship between a waiting parent, and an exiting, or returning from main parent.

Adopting orphaned children. What happens if a parent exits before a child? Who is allowed to wait and get the child’s status? Intuition might tell you that it is the grandparent but that is not the case. Most programs are not written to understand that wait might return a pid other than ones they’ve created. Instead, if a parent terminates, the init process – the processes that is the ancestor of all processes – adopts the child, and waits for it. We’ll dive into init later in the class.

Interpreting exit status. Despite wait’s status return value being a simple int, the bits of that integer mean very specific things. See the man page for wait for details, but we can use a set of functions (really “macros”) to interpret the status value:

  • WIFEXITED(status) will return 0 if the process didn’t exit (e.g. if it faulted instead), and non-0 otherwise.
  • WEXITSTATUS(status) will get the intuitive integer value that was passed to exit (or returned from main), but assumes that this value is only 8 bits large, max (thus has a maximum value of 256).

3.2.1 Process Example

Program output:

Child 308385 exiting with -1
Child 308384 exiting with 0
Child 308386 exiting with -2
Child 308387 exiting with -3
Child 308384 exited with exit status 0
Child 308385 exited with exit status -1
Child 308386 exited with exit status -2
Child 308387 exited with exit status -3

3.2.2 Non-determinism Everywhere: Concurrency

The output of this program can actually vary, somewhat arbitrarily, in the following way: any of the output lines can be reordered with respect to any other, except that the parent’s print out for a specific child must come after it.

For example, all of the following are possible:

Child 4579 exiting with 0
Child 4580 exiting with -1
Child 4581 exiting with -2
Child 4582 exiting with -3
Child 4579 exited with exit status 0
Child 4580 exited with exit status -1
Child 4581 exited with exit status -2
Child 4582 exited with exit status -3
Child 4579 exiting with 0
Child 4579 exited with exit status 0
Child 4580 exiting with -1
Child 4580 exited with exit status -1
Child 4581 exiting with -2
Child 4581 exited with exit status -2
Child 4582 exiting with -3
Child 4582 exited with exit status -3
Child 4582 exiting with -3
Child 4582 exited with exit status -3
Child 4581 exiting with -2
Child 4581 exited with exit status -2
Child 4580 exiting with -1
Child 4580 exited with exit status -1
Child 4579 exiting with 0
Child 4579 exited with exit status 0

This non-determinism is a product of the isolation that is provided by processes. The OS switches back and forth between processes frequently (up to thousands of time per second!) so that if one goes into an infinite loop, others will still make progress. But this also means that the OS can choose to run any of the processes that are trying to execute at any point in time! We cannot predict the order of execution, completion, or wait notification. This non-deterministic execution is called concurrency. You’ll want to keep this in mind as you continue to learn the process APIs, and when we talk about IPC, later.

3.2.3 Taking Actions on exit

Lets dive into exit a little bit.

Believe it or not, this is identical to

…because when main returns, it calls exit.

Investigating main return → exit via gdb. You can see this by diving into the return.c program with gdb -tui, breakpointing before the return (e.g. b 5), and single-stepping through the program. You’ll want to layout asm to drop into “assembly mode”, and single step through the assembly and if you want to step through it instruction at a time use stepi or si. You can see that it ends up calling __GI_exit. __GI_* functions are glibc internal functions, so we see that libc is actually calling main, and when it returns, it is then going through its logic for exit.

Though exit is a relatively simple API, it has a few interesting features.

  • on_exit which takes a function pointer to be called upon returning from main or calling exit. The function receives the status, and an argument pass to on_exit. So we can see that calling exit does not immediately terminate the process!
  • atexit takes a simpler function pointer (that doesn’t receive any arguments) to call at exit.
  • _exit terminates a process without calling any of the function pointers set up by on_exit nor atexit.

Program output:

pre-exit
post-exit: status 0, data 1

3.2.4 fork Questions

  1. How many processes (not including the initial one) are created in the following?

  2. What are the possible outputs for:

3.3 Executing a Program

Recall, a program is an executable that you store in your file-system. These are the output of your compilation, the “binaries”. So fork lets us make a new process from an existing process, both of which are executing within a single program. But how do we execute a new program?

The exec family of system calls will do the following:

  • Stop executing in the current process.
  • Reclaim all memory within the current process.
  • Load the target program into the process.
  • Start executing in the target program (i.e. starting normally, resulting in main execution).

The core insight is that the same process continues execution; it simply continues execution in the new program. This is a little unnatural, but has a few handy side-effects:

  • The execution of the new program inherits the process identifier (pid_t) and the parent/child relationships of the process.
  • Comparably, the descriptors are inherited into the new program’s execution.
  • The environment variables (see section below) pass to the new program’s execution.
  • Many other process properties are comparably inherited. With the exception of the process memory, you can assume, by default, that process properties are inherited across an exec.
exec replaces the previous program executing in this process with the execed program. Not shown here: maintaining the descriptors, current working directory, etc… of the program previously executing in this process.
exec replaces the previous program executing in this process with the execed program. Not shown here: maintaining the descriptors, current working directory, etc… of the program previously executing in this process.

3.3.1 exec APIs

There are a few ways to execute a program:

  • execl
  • execlp
  • execle
  • execv
  • execvp

The naming scheme is quite annoying and hard to remember, but the man page has a decent summary. The trailing characters correspond to specific operations that differ in how the command-line arguments are passed to the main (l means pass the arguments to this exec call to the program, while v means pass the arguments as an array of the arguments into the exec call), how the program’s path is specified (by default, an “absolute path” starting with / must be used, but in a v variant, the binary is looked up using comparable logic to your shell), and how environment variables are passed to the program. For now, we’ll simply use execvp, and cover the rest in subsequent sections.

Program output:

hello world!
Parent: status 42

3.4 Command Line Arguments

I think that we likely have a decent intuition about what the command-line arguments are`:

$ ls /bin /sbin

The ls program takes two arguments, /bin and /sbin. How does ls access those arguments?

Lets look at a chain of programs that exec each other. The first program (that you see here) is called inline_exec_tmp, and the programs 03/args?.c are subsequently executed.

A chain of processes execing each other, and passing arguments. We only print them out in the 3rd program, args2.bin.
A chain of processes execing each other, and passing arguments. We only print them out in the 3rd program, args2.bin.

Program output:

First program, arg 1: ./inline_exec_tmp
Inside ./03/args2.bin
arg 0: ./03/args2.bin
arg 1: hello
arg 2: world

args1.c is

…and args2.c is

So we see the following.

  • It is convention that the first argument to a program is always the program itself. The shell will always ensure that this is the case (NB: the shell execs your programs).
  • The rest of the arguments are passed as separate entries in the array of arguments.
  • The v variants of exec require the NULL termination of the argument array, something that is easy to mess up!

Parsing through the command-line arguments can be a little annoying, and getopt can help.

3.5 Environment Variables

Environment variables are UNIX’s means of providing configuration information to any process that might want it. They are a key-value store9 that maps an environment variable to its value (both are strings).

Environment variables are used to make configuration information accessible to programs. They are used instead of command-line arguments when:

  • You don’t want the user worrying about passing the variable to a program. For example, you don’t want the user to have to pass their username along to every program as an argument.
  • You don’t know which programs are going to use the configuration data, but any of them could. You don’t want to pass a bunch of command-line variables into each program, and expect them to pass them along to each child process.

Example common environment variables include:

  • PATH - a :-separated list of file system paths to use to look for programs when attempt to execute a program.
  • HOME - the current user’s home directory (e.g. /home/gparmer).
  • USER - the username (e.g. gparmer).
  • TEMP - a directory that you can use to store temporary files.

Many programs setup and use their own environment variables. Note that environment variables are pretty pervasively used – simple libraries exist to access them from python, node.js, rust, java, etc…

You can easily access environment variables from the command line:

$ echo $HOME
/home/gparmer
$ export BESTPUP=penny
$ echo $BESTPUP
penny

Any program executed from that shell, will be able to access the “BESTPUP” environment variable. The env command dumps out all current environment variables.

3.5.1 Environment Variable APIs

So how do we access the environment variable key-value store in C? The core functions for working with environment variables include:

  • getenv - Get an environment variable’s value.
  • setenv - Set one of environment variable’s value (used by the shell to set up children’s variables).
  • clearenv - Reset the entire environment.
  • environ array - This is the array of environment variables you’ll see in the man pages. You don’t want to access/modify this directly, but you can imagine it is used to back all of the previous calls.

Program output:

I am ycombinator, and I live in /home/ycombinator

You can see all of the environmental variables available by default with:

$ env
SHELL=/bin/bash
DESKTOP_SESSION=ubuntu
EDITOR=emacs -nw
PWD=/home/gparmer/repos/gwu-cs-sysprog/22/lectures
LOGNAME=gparmer
HOME=/home/gparmer
USERNAME=gparmer
USER=gparmer
PATH=/home/gparmer/.local/bin::/home/gparmer/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/usr/racket/bin/
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus
...

Program output:

user: ycombinator
user (forked child): penny
Environment variable USER -> penny

03/envtest.c is

A common use of environment variables is the “home” directory in your shell. How is this implemented?

$ cd ~

The ~ means “my home directory”. To understand what directory is a user’s home directory, you can getenv(HOME)!

3.5.2 An Aside: Creating Processes with posix_spawn

fork and exec are not the only functions to execute a program. posix_spawn also enables the creation of a new process that execute a given program. posix_spawn performs three high-level actions:

  1. fork to create a new process,
  2. a set of “file actions” that update and modify the files/descriptors for the new process, that are specified in an array argument to posix_spawn, and
  3. exec to execute a program and pass arguments/environmental variables.

It is strictly more limited in what it can do than fork and exec, but this is often a feature not a bug. fork is really hard to use well, and can be quite confusing to use. It is considered by some to be a flawed API. Thus the focus of posix_spawn on specific executing a new program can be quite useful to simply programs.

3.6 Representations of Processes

We’ve seen how to create and manage processes and how to execute programs. Amazingly, modern systems have pretty spectacular facilities for introspection into executing processes. Introspection facilities generally let you dive into something as it runs. The most immediate example of this is gdb or any debugger that let you dive into an implementation. Looking at pause.c:

Program output:

pid 9310
global (RO):    0x55be6d40b008
global:         0x55be6d40d014
stack:          0x7ffc7abafc7c
heap:           0x55be6da162a0
function:       0x55be6d40a1c9

Lets take the process identifier, and dive into the process! The “proc filesystem” in Linux is a part of the file system devoted to representing processes. There is a subdirectly in it for each process in the system. Lets check out process 9310.

$ cd /proc/9310/
$ ls
arch_status  cgroup      coredump_filter  exe      io         maps       mountstats  oom_adj        patch_state  sched      smaps         statm    timers
attr         clear_refs  cpuset           fd       limits     mem        net         oom_score      personality  schedstat  smaps_rollup  status   timerslack_ns
autogroup    cmdline     cwd              fdinfo   loginuid   mountinfo  ns          oom_score_adj  projid_map   sessionid  stack         syscall  uid_map
auxv         comm        environ          gid_map  map_files  mounts     numa_maps   pagemap        root         setgroups  stat          task     wchan
$ cat maps
55be6d409000-55be6d40a000 r--p 00000000 08:02 1315893                    /home/ycombinator/repos/gwu-cs-sysprog/22/lectures/03/pause.bin
55be6d40a000-55be6d40b000 r-xp 00001000 08:02 1315893                    /home/ycombinator/repos/gwu-cs-sysprog/22/lectures/03/pause.bin
55be6d40b000-55be6d40c000 r--p 00002000 08:02 1315893                    /home/ycombinator/repos/gwu-cs-sysprog/22/lectures/03/pause.bin
55be6d40c000-55be6d40d000 r--p 00002000 08:02 1315893                    /home/ycombinator/repos/gwu-cs-sysprog/22/lectures/03/pause.bin
55be6d40d000-55be6d40e000 rw-p 00003000 08:02 1315893                    /home/ycombinator/repos/gwu-cs-sysprog/22/lectures/03/pause.bin
55be6da16000-55be6da37000 rw-p 00000000 00:00 0                          [heap]
7ff4a127f000-7ff4a12a4000 r--p 00000000 08:02 11797912                   /lib/x86_64-linux-gnu/libc-2.31.so
7ff4a12a4000-7ff4a141c000 r-xp 00025000 08:02 11797912                   /lib/x86_64-linux-gnu/libc-2.31.so
7ff4a141c000-7ff4a1466000 r--p 0019d000 08:02 11797912                   /lib/x86_64-linux-gnu/libc-2.31.so
7ff4a1466000-7ff4a1467000 ---p 001e7000 08:02 11797912                   /lib/x86_64-linux-gnu/libc-2.31.so
7ff4a1467000-7ff4a146a000 r--p 001e7000 08:02 11797912                   /lib/x86_64-linux-gnu/libc-2.31.so
7ff4a146a000-7ff4a146d000 rw-p 001ea000 08:02 11797912                   /lib/x86_64-linux-gnu/libc-2.31.so
7ff4a146d000-7ff4a1473000 rw-p 00000000 00:00 0
7ff4a1495000-7ff4a1496000 r--p 00000000 08:02 11797896                   /lib/x86_64-linux-gnu/ld-2.31.so
7ff4a1496000-7ff4a14b9000 r-xp 00001000 08:02 11797896                   /lib/x86_64-linux-gnu/ld-2.31.so
7ff4a14b9000-7ff4a14c1000 r--p 00024000 08:02 11797896                   /lib/x86_64-linux-gnu/ld-2.31.so
7ff4a14c2000-7ff4a14c3000 r--p 0002c000 08:02 11797896                   /lib/x86_64-linux-gnu/ld-2.31.so
7ff4a14c3000-7ff4a14c4000 rw-p 0002d000 08:02 11797896                   /lib/x86_64-linux-gnu/ld-2.31.so
7ff4a14c4000-7ff4a14c5000 rw-p 00000000 00:00 0
7ffc7ab90000-7ffc7abb1000 rw-p 00000000 00:00 0                          [stack]
7ffc7abe1000-7ffc7abe4000 r--p 00000000 00:00 0                          [vvar]
7ffc7abe4000-7ffc7abe5000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

There’s a lot going on here. The most important parts of the ranges of addresses on the left that tell us where we can find the pause.bin program’s code, read-only global data, global read-writable data, heap, and stack! We also see that the number of maps in a very simple process is surprisingly large. Let me manually focus in on a few parts of this.

...
55be6d40a000-55be6d40b000 r-xp ... /03/pause.bin <-- &main      (0x55be6d40a1c9)
55be6d40b000-55be6d40c000 r--p ... /03/pause.bin <-- &global_ro (0x55be6d40b008)
...
55be6d40d000-55be6d40e000 rw-p ... /03/pause.bin <-- &global    (0x55be6d40d014)
...
55be6da16000-55be6da37000 rw-p ... [heap]        <-- heap       (0x55be6da162a0)
...
7ff4a14c4000-7ff4a14c5000 rw-p ...
7ffc7ab90000-7ffc7abb1000 rw-p ... [stack]       <-- stack      (0x7ffc7abafc7c)
...

We can see that each of the variables we’re accessing in the program are at addresses that correspond to the ranges of memory in the maps. Even more, the /proc/9310/mem file contains the actual memory for the process! This is really amazing: we can watch the memory, even as it changes, as a process actually executes. This is how debuggers work!

As you can see, processes are data too!

3.7 Process Exercises

3.7.1 Exercise 1: fibonacci with fork

Implement forkonacci! This will solve the nth fibonacci number, using fork for the recursion, and a combination of exit to “return” and wait to retrieve the returned value, instead of function calls. The normal recursive implementation:

This will work decently for n <= 12, but not for n > 12. Why10?

3.7.2 Exercise 2: Shell Support for Environment Variables

How do you think that the shell-based support for environment variables (export BESTPUP=penny) is implemented? Specifically, if we do the following…

$ export BESTPUP=penny
$ ./03/envtest.bin BESTPUP  // this just prints the environment variable
Environment variable BESTPUP -> penny

…which process is using which APIs? Put another way, how is export implemented?

3.7.3 Exercise 3: Process Detective

The /proc filesystem is a treasure trove of information. You can dive into the process with pid N’s information in /proc/N/. You’ll only have access to the informationx for your processes. So how do you find the pid of your processes? You can find the pid of all of the processes that belong to you using ps aux | grep gparmer | awk '{print $2}'11, but replacing your username for gparmer.

Choose one of those, and go into its directory in /proc. It it doesn’t work, you might have chosen a process that has terminated since you find its ID. Try again.

What do you think the following files contain:

  • cmdline
  • environ
  • status

Some of the files are binary, which means you might need to use xxd (a hexdecimal binary view program) to look at them. In xxd, pay attention to the right column here.

4 Process Descriptors and Signals

We’ve discussed the APIs for process manipulation and program execution, but now we’re going to start discussing how processes can manipulate the resources it has access to. Two of the key UNIX mechanisms that processes must use effectively are

  • the current working directory (reported by pwd),
  • the ability manipulate and use their descriptors – often called “file descriptors” though they are more general than only interfacing with files, and
  • controlling and leveraging the exceptional control flow provided by signals.

4.1 Current Working Directory

Each process has a “working directory” which is used as the base for any relative paths in the file system. All file system paths are either absolute paths – in which case they begin with / and specify the path from the “root”, or relative paths. Relative paths are quite frequently used when we interact with the shell. Every time you type cd blah/, you’re saying "please change the current working directory to blah/ which is a directory in the current working directory. Similarly, cat penny_the_best_pup.md looks up the file penny_the_best_pup.md in the current working directory.

This is backed by a simple API:

  • getcwd which lets you get the current process’ working directory, and
  • chdir which enable the process to change its working directory.

Program output:

Current directory: /home/ycombinator/repos/sp/lectures
New current dir: /home/ycombinator/repos/sp

The cd command in shells is actually not a program, and is instead a shell-internal function. You can confirm this using the which program, which tells you the location of a program:

$ which ls
/bin/ls
$ which pwd
/bin/pwd
$ which cd
$

cd isn’t a program, so which can’t tell us where it is!

4.2 Process Descriptors and Pipes

Each process has a set of descriptors that are each associated with some resource in the system. Each descriptor is an integer that is passed into a family of APIs that perform operations on their corresponding resources. For example, most processes have at least a set of three descriptors that are each attached to a resource:

  • STDIN_FILENO = 0 (defined in unistd.h) - This is the standard input, or the main way the process gets input from the system. As such, the resource is often the terminal if the user directly provides input, or sometimes the output of a previous stage in a command-line pipeline.
  • STDOUT_FILENO = 1 - This is the standard output, or the main way the process sends its output to the system. When we call printf, it will, by default, output using this descriptor.
  • STDERR_FILENO = 2 - This is the standard error, or the main way that the process outputs error messages. perror (and other error-centric APIs) output to the standard error.

Each of these descriptors is associated with a potentially infinite sequence of bytes, or channel12. When we type at the shell, we’re providing a channel of data that is sent to the standard input of the active process. When a process prints, it sends a sequence of characters to the standard output, and because programs can loop, that stream can be infinite! Channels, however, can terminate if they run out of data. This could happen, for example, if a channel reading through a file reaches the end of the file, or if the user hits cntl-d on their terminal to signify “I don’t have any more data”.

4.2.1 File Descriptor Operations

File descriptors are analogous to pointers. The descriptors effectively point to channel resources. Some of the core operations on file descriptors include:

  • read - Pull a sequence of bytes from the channel into a buffer in memory. This is one of the core means of reading data from files, pipes, and other resources.
  • write - Send a sequence of bytes from a buffer in memory into a channel. At its core, even printf ends up being a write to the standard out descriptor.
  • dup/dup2 - Take a file descriptor as an argument, and create a new descriptor that is a duplicate of the existing descriptor. Note, this does not duplicate the channel itself, just the descriptor. The analogy is that dup just copies the pointer, not the resource it points to.
  • close - Deallocate a file descriptor. This is like removing a pointer – it doesn’t necessarily remove the backing channel, for example, if the descriptor was previously duped.

Lets see some of these in action.

Program output:

hello world
hello world
in: 0, out: 1, err: 2, new: 3
New descriptor @ 0

You can run this, and redirect the standard error to a file to see that writing to standard error is a different operation than writing to standard output. For example: $ prog 2> errors.txt will redirect file descriptor 2 (stderr) to the file.

Lets focus in a little bit on read and write. First, it is notable that the buffer they take as an argument (along with its length) is simply an array of bytes. It can be a string, or it could be the bytes that are part of an encoded video. Put another way, by default, channels are just sequences of bytes. It is up to our program to interpret those bytes properly. Second, we need to understand that the return value for read/write has four main, interesting conditions:

Program output:

more complicated than you'd think...

It is common to have a convention on how channel data is structured. UNIX pipeline encourage channels to be plain text, so that each program can read from their standard input, do some processing that can involved filtering out data or transforming it, and send the result to the standard out. That standard output is sent to the standard input of the next process in the pipeline. An example in which we print out each unique program that is executing on the system:

$ ps aux | tr -s ' ' | cut -d ' ' -f 11 | sort | uniq

Each of the programs in the pipeline is not configured to print out each unique process, and we are able to compose them together in a pipeline to accomplish the goal.

4.2.2 Pipes

We’ve seen that the descriptors 0-2 are automatically set up for our processes, and we can use/manipulate them. But how do we create resources? There are many different resources in a UNIX system, but three of the main ones:

  1. files and other file-system objects (e.g. directories),
  2. sockets that are used to communicate over the network, and
  3. communication facilities like pipes that are used to send data and coordinate between processes.

Each of these has very different APIs for creating the resources. We’ll discuss files later, and will now focus on pipes.

A pipe (a UNIX resource, not the | character in a command-line pipeline) is a channel referenced by two different file descriptors13, one that we can write to, and one that we can read from. The sequence of bytes written to the pipe will be correspondingly read out.

The pipe and pipe2 functions create a pipe resource, and return the two file descriptors that reference the readable and writable ends of the pipe. Each pipe can only hold a finite capacity. Thus if a process tries to write more than that capacity into a pipe, it won’t complete that write until data is read from the pipe.

Program output:

Successfully memcpyed through a pipe!

Pipes are really quite a bit more useful when they are used to coordinate between separate processes. Thus, lets combine them with fork:

Program output:

Parent got the message!
Child sent whole message!

The concurrency of the system enables separate processes to be active at the same time, thus for the write and read to be transferring data through the pipe at the same time. This simplifies our code as we don’t need to worry about sending chunks of our data.

Note that we’re closeing the end of the pipe that we aren’t using in the corresponding processes. Though the file descriptors are identical in each process following fork, each process does have a separate set of those descriptors. Thus closing in one, doesn’t impact the other. Remember, processes provide isolation!

4.2.3 The Shell

We can start to understand part of how to a shell might be implemented now!

Setting up pipes. Lets start with the more obvious: for each | in a command, the shell will create a new pipe. It is a little less obvious to understand how the standard output for one process is hooked up through a pipe to the standard input of the next process. To do this, the shell does the following procedure:

  1. Create a pipe.
  2. fork the processes (a fork for each process in a pipeline).
  3. In the upstream process close(STDOUT_FILENO), and dup2 the writable file descriptor in the pipe into STDOUT_FILENO.
  4. In the downstream process close(STDIN_FILENO), and dup2 the readable file descriptor in the pipe into STDIN_FILENO.

Due to this careful usage of close to get rid of the old standard in/out, and dup or dup2 to methodically replace it with the pipe, we can see how the shell sets up the processes in a pipeline!

Lets go over an example of setting up the file descriptors for a child process. This does not set up the pipe-based communication between two children, so is not sufficient for a shell; but it is well on the way. Pipes contain arbitrary streams of bytes, not just characters. This example will

  1. setup the input and output of two files to communicate over a pipe, and
  2. send and receive binary data between processes.

Program output:

42 + 2a

Closing pipes. reading from a pipe will return that there is no more data on the pipe (i.e. return 0) only if all write-ends of the pipe are closed. This makes sense because we think of a pipe as a potentially infinite stream of bytes, thus the only way the system can know that there are no more bytes to be read, is if the write end of the pipe cannot receive more data, i.e. if it is closed.

This seems simple, in principle, but when implementing a shell, you use dup to make multiple copies of the write file descriptor. In this case, the shell must be very careful to close its own copies because if any write end of a pipe is open, the reader will not realize when there is no more data left. If you implement a shell, and it seems like commands are hanging, and not exiting, this is likely why.

A graphical sequence for the above code. Each image is a snapshot of the system after an operation is preformed. The red portions of each figure show what is changed in each operation. The parent creates a pipe, forks a child, which inherits a copy of all of the file descriptors (including the pipe), the parent and child close their stdin and stdout, respectively, the pipes descriptors are duped into those now vacant descriptors, and the pipe descriptors are closed. Now all processes are in a state where they can use their stdin/stdout/stderr descriptors as normal (scanf, printf), but the standard output from the child will get piped to the standard input of the parent. Shells do a similar set of operations, but in which a child has their standard output piped to the standard input of another child.
A graphical sequence for the above code. Each image is a snapshot of the system after an operation is preformed. The red portions of each figure show what is changed in each operation. The parent creates a pipe, forks a child, which inherits a copy of all of the file descriptors (including the pipe), the parent and child close their stdin and stdout, respectively, the pipes descriptors are duped into those now vacant descriptors, and the pipe descriptors are closed. Now all processes are in a state where they can use their stdin/stdout/stderr descriptors as normal (scanf, printf), but the standard output from the child will get piped to the standard input of the parent. Shells do a similar set of operations, but in which a child has their standard output piped to the standard input of another child.

Question. If we wanted to have a parent, shell process setup two child connected by |, what would the final picture (in the image of the pictures above) look like?

4.3 Signals

We’re used to sequential execution in our processes. Each instruction executes after the previous in the “instruction stream”. However, systems also require exceptional execution patterns. What happens when you access memory that doesn’t exist (e.g. *(int *)NULL); when you divide by 0; when a user types cntl-c; when a process terminates; or when you make a call to access a descriptor which somehow is not accessible outside of the call14?

UNIX’s signals provide asynchronous execution in a process. Regardless the previous instruction a process was executing, when a signal activates, a signal handler function will immediately activate. Lets see what this looks like:

Program output:

Lets live dangerously
My downfall is the forbidden fruit at address (nil).
make[1]: *** [Makefile:30: inline_exec] Error 1

This is kind-of crazy. We can actually execute in the signal handler when we access invalid memory! We can write code to execute in response to a segmentation fault. This is how Java prints out a backtrace – something we will do in C soon. The concepts in the above code:

  • sigset_t this is a bitmap with a bit per signal. It is used to program the signal mask. While a specific signal is executing, which signals can occur during its execution? If a bit is set to 1, then the signal will wait for the specific signal to complete execution, before its handler is called. However, if a signal’s bit is set to 0, then signals can nest: you can run a signal handler while you’re in the middle of executing another signal handler. sigemptyset initializes all bits to 0, while sigaddset sets the bit corresponding to a signal number (SIGSEGV in this case).
  • sig_fault_handler is our signal handler. It is the function that is called asynchronously when the corresponding event happens. In the case of SIGSEGV, here, the signal handler is called whenever we access invalid memory (e.g. segmentation fault).
  • sigaction is the function that enables us to set up a signal handler for a specific signal.

There are signals for quite a few events. A list of all of the signals is listed in man 7 signal and glibc has decent documentation. Notable signals include:

  • SIGCHLD - A notification that a child process has terminated, thus is awaiting wait
  • SIGINT - The user typed cntl-c
  • SIGSTOP & SIGCONT - Stop a child executing, and later continue its execution
  • SIGTSTP - The user typed cntl-z (…or other causes)
  • SIGPIPE - write to a pipe with no reader
  • SIGSEGV - Invalid memory access (segmentation fault)
  • SIGTERM & SIGKILL - Terminate a process; SIGTERM can be caught, but SIGKILL is forceful, non-optional termination
  • SIGHUP - The terminal attached to the shell that created us, itself terminated
  • SIGALRM - A notification that time has passed
  • SIGUSR1 & SIGUSR2 - User-defined signal handlers you can use for your own machinations

Each signal has a default behavior that triggers if you do not define a handler. These are either to ignore the signal, terminate the process, to stop the process executing, and to continue its execution.

4.3.1 Tracking Time with Signals

Lets see ualarm in action:

Program output:

0 milliseconds passed!
1 milliseconds passed!
2 milliseconds passed!
3 milliseconds passed!
4 milliseconds passed!
5 milliseconds passed!
6 milliseconds passed!
7 milliseconds passed!
8 milliseconds passed!
9 milliseconds passed!

Question: Track and explain the control flow through this program.

SIGKILL and SIGSTOP are unique in that they cannot be disabled, and handlers for them cannot be defined. They enable non-optional control of a child by the parent.

4.3.2 Parent/Child Coordination with Signals

Another example of coordination between parent and child processes. We can use signals to get a notification that a child has exited! Additionally, we can send the SIGTERM signal to terminate a process (this is used to implement the kill command line program – see man 1 kill).

#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <unistd.h> /* kill, pause */
#include <assert.h>
#include <sys/wait.h>

void
sig_handler(int signal_number, siginfo_t *info, void *context)
{
    switch(signal_number) {
    case SIGCHLD: {
        /* see documentation on `siginfo_t` in `man sigaction` */
        printf("%d: Child process %d has exited.\n", getpid(), info->si_pid);
        fflush(stdout);
        break;
    }
    case SIGTERM: {
        printf("%d: We've been asked to terminate. Exit!\n", getpid());
        fflush(stdout);
        exit(EXIT_SUCCESS);
        break;
    }}

    return;
}

void
setup_signal(int signo, void (*fn)(int , siginfo_t *, void *))
{
    sigset_t masked;
    struct sigaction siginfo;
    int ret;

    sigemptyset(&masked);
    sigaddset(&masked, signo);
    siginfo = (struct sigaction) {
        .sa_sigaction = fn,
        .sa_mask      = masked,
        .sa_flags     = SA_RESTART | SA_SIGINFO
    };

    if (sigaction(signo, &siginfo, NULL) == -1) {
        perror("sigaction error");
        exit(EXIT_FAILURE);
    }
}

int
main(void)
{
    pid_t pid;
    int status;

    setup_signal(SIGCHLD, sig_handler);
    setup_signal(SIGTERM, sig_handler);

    /*
     * The signal infromation is inherited across a fork,
     * and is set the same for the parent and the child.
     */
    pid = fork();
    if (pid == -1) {
        perror("fork");
        exit(EXIT_FAILURE);
    }

    if (pid == 0) {
        pause(); /* stop execution, wake upon signal */

        exit(EXIT_SUCCESS);
    }

    printf("%d: Parent asking child (%d) to terminate\n", getpid(), pid);
    kill(pid, SIGTERM); /* send the child the TERM signal */

    /* Wait for the sigchild notification of child termination! */
    pause();

    /* this should return immediately because waited for sigchld! */
    assert(pid == wait(&status));
    assert(WIFEXITED(status));

    return 0;
}

Program output:

308644: We've been asked to terminate. Exit!
308643: Parent asking child (308644) to terminate
308643: Child process 308644 has exited.

Note: You want to run this a few times on your system to see the output. The auto-execution scripts of the lectures might cause wonky effects here due to concurrency.

We now see a couple of new features:

  • The SIGCHLD signal is activated in the parent when a child process exits.
  • We can use the kill function to send a signal to a another process owned by the same user (e.g. gparmer).
  • The pause call says to stop execution (to pause) until a signal is triggered.
A high-level overview of the control flow between parent and child in this code.
A high-level overview of the control flow between parent and child in this code.

A couple of additional important functions:

  • raise will trigger a signal in the current process (it is effectively a kill(getpid(), ...)).
  • ualarm will set a recurring SIGALRM signal. This can be quite useful if your program needs to keep track of time in some way.

4.3.3 The Dark Side of Signals

Signals are dangerous mechanisms in some situations. It can be difficult to use them properly, and avoid bugs. Signals complication data-structures as only functionality that is re-eentrant should be used in signal handlers, and they complicate the logic around all system calls as they can interrupt slow system calls.

“Slow” system calls. Many library calls can block. These are calls that cannot necessarily complete immediately as they might require concurrent events to complete. For example, the wait call has to block and wait for the child to exit; and read has to block if there is no data available (yet) on a pipe.

But what happens if a signal is sent to a process while it is blocked? By default, the system call will wake up and return immediately, even though the call’s functionality has not been provided. For example, wait will return even though a child didn’t exit, and read returns despite not reading data.

So how do you tell the difference between the blocking function returning properly, or returning because it was interrupted by a signal? The answer is, of course, in the man pages: the function will return an error (-1), and errno is set to EINTR. Given this, we see the problem with this design: now the programmer must add logic for every single system call that can block to check this condition. Yikes15.

Luckily, UNIX provides a means to disable the interruption of blocking calls by setting the SA_RESTART flag to the sa_flags field of the sigaction struct passed to the sigaction call. Note that the code above already sets this as I consider it a default requirement if you’re setting up signal handlers.

Lets see the explicit interaction with between the slow call wait, and the signal handler:

Program output:

Alarm signal handler
Wait call interrupted by signal!
Child has exited.

Re-entrant computations. Signal handlers execute by interrupting the currently executing instructions, regardless what computations they are performing. Because we don’t really know anything about what was executing when a signal handler started, we have to an issue. What if an action in the signal handler in some way conflicts with the action it interrupted? Conceptually, printf will copy data into a buffer before calling write to output it to standard output. What if half-way through generating the data in printf, we execute a signal handler, and call printf…which happens to overwrite all of the hard work of the interrupted printf! Any function that has these issues is called non-reentrant. Yikes. An example:

Program output:

errno should be "Bad file descriptor", but has value "Success"

The set of functions you can call in a signal handler (i.e. that are re-entrant) are listed in the manual page: man 7 signal-safety. Notably these do not include the likes of

  • printf/snprintf
  • malloc
  • exit (though _exit is fine)
  • functions that set errno!

It is hard to do much in a program of any complexity without snprintf (called by printf), malloc, or use errno. A very common modern way to handle this situation is to create a pipe into which the signal handlers write a notification (e.g. the signal number), while the main flow execution reads from the pipe. This enables the main execution in our programs to handle these notifications. However, this is only really possible and feasible when we get to poll later, and can block waiting for any of a number of file descriptors, including this pipe.

Note that in most cases, you won’t see a bug due to using non-re-entrant functions in your signal handlers. These bugs are heavily non-deterministic, and are dependent on exactly what instructions were interrupted when the signal activated. This may feel like a good thing: buggy behavior is very rare! But reality is opposite: rare, non-deterministic bugs become very very hard to debug and test. Worse, these bugs that pass your testing are more likely to happen in customer’s systems that might have different concurrency patterns. So it is nearly impossible to debug/test for these bugs, and they are more likely to cause problems in your customer’s environment. Again, yikes.

5 I/O: Programming Files and Directories

We’ve seen how to use pipes for communication between processes, and how descriptors are used to interact with system resources. In this module, we’ll dive into how to access the file system and the files and directories on your system!

5.1 Accessing File Contents

The filesystem is one of the core abstractions on UNIX systems and indeed, on modern OSes. Each file is identified by a path through directories. Interestingly, an early goal of UNIX is that “everything is a file”. This is a strange statement as it raises the question “what wouldn’t normally be a file?” Some examples:

  • We know of processes as executable instances of programs so certainly they cannot be files, right? We’ve seen that they are represented by files in /proc/*! These files are used to provide “task monitor” type functionality (showing running processes) and gdb.
  • Devices are the physical parts of our computers like screens, keyboards, USB devices, hard-drives, and networks. They are all represented by files in /dev/*! You can actually cat a file directly to your disk16!
  • Want random values? /dev/random. Want the complete absence of anything? /dev/null17. Want to update your power settings? There are files in /sys/* for that.

When everything is a file, it can all be manipulated and accessed using the exactly same functions and APIs as the normal files in the system. This means that all of the shell programs we’ve seen can be used not just to operate on files, but also on processes or devices!

So what does it look like to access files? The relevant functions include:

  • int open(path, flags) - Open a file, identified by its path, and return a file descriptor that can be used to access that file. The flags must be include of the following: O_RDONLY, O_WRONLY, or O_RDWR – if you want to only read from the file, only write to the file, or have the ability to both read and write to the file. Another interesting flag value is O_CREAT which will create the file if it doesn’t already exist. Whenever you see “flags”, you should think of them as a set of bits, and each of the options as a single bit. Thus, when passing in flags, you can use bitwise operators to pass in multiple options, for example open("penny_is_best.gif", O_RDWR | O_CREAT, 0700) will open the file, creating it if it doesn’t already exist, and enable reading and writing to the file. Note that when you pass in O_CREAT, you should pass in the third argument, the mode. For now, just always pass in 0700!
  • read &write - We’ve seen these before when we saw pipes! They are the generic functions for getting data from, and pushing data to descriptors.
  • close - Remember that we can close any descriptor to release it (the free for descriptors, if you will), including those that reference files.
  • int stat(path, struct stat *info) - Get information about a file identified by path into the info structure. The structure is documented in the man page, but it includes, for example, the file size. fstat enables to you get stat info from an existing file descriptor.
  • creat(path, mode) - Create a new file at the path with the mode (specified identically to open). Later, we’ll learn about mode when we discuss security, for now, you can use a permissive mode of 0700.
  • unlink(path) - Try and remove a file! This is called, for example, by the rm program.

Program output:

We want to say 6044 things...but the world doesn't value artistry.
Lets cut down to 280 characters and tweet like it is 1915...

 The Love Song of J. Alfred Prufrock
 T.S. Elliot

Sio credesse che mia risposta fosse
A persona che mai tornasse al mondo,
Questa fiamma staria senza piu scosse.
Ma percioche giammai di questo fondo
Non torno vivo alcun, siodo il vero,
Senza tema dinfamia ti rispondo

It is really important that quite a few interesting functions operate on descriptors, which might reference pipes, files, the terminal, or other resources. This means that processes can operate on descriptors and not care what resources back them. This enables shells to implement pipelines and the redirection of output to files, all without the processes knowing! It enables forked processes to inherit these resources from its parent. In this way, descriptors and the functions to operate on them are a polymorphic foundation (in the sense of object-oriented polymorphism) to accessing the system.

We see something a little strange with files as compared to pipes. Files have a finite size, and the data remains in the file after being read – unlike with pipes where data is removed out of the pipe when read. For files, subsequent reads and writes must progress through the contents of the file, but we must also to be able to go back and read previously read data. Lets understand what happens when we read or write from a file – we’ll focus only on read to start, but both behave similarly. Each file descriptor18 tracks an offset into the file at which the next read will access data. Lets go through an example of a file that contains the alphabet:

abcdefghijklmnopqrstuvwxyz

A freshly opened file returns a file descriptor (fd) where offset = 0.

abcdefghijklmnopqrstuvwxyz
^
|
+-- offset

A read(fd, buf, 10) that returns 10 (saying that 10 bytes or characters a through j was successfully read) advances offset by 10.

abcdefghijklmnopqrstuvwxyz
          ^
          |
          +-- offset

Thus, an additional read will start reading from the file at k. The offset, then, enables subsequent reads to iterate through a file’s contents. write uses the offset identically.

There are many cases in which we might want to modify the offset. For example, databases want to jump around a file to find exactly the interesting records, playing audio and video requires us to skip around a file finding the appropriate frames. Thus, we have the function:

  • off_t lseek(int fd, off_t update, int whence) - This function lets you modify the offset of the file associated with fd, by update. The last argument can be set to SEEK_SET to set the fd’s offset to update, SEEK_CUR to add update to the current offset (thus making a relative update), or SEEK_END to set the offset to the files end plus offset. It returns the resulting offset of the fd.

Program output:

What lies beyond the tweet? Tweet 2/N:

.

Let us go then, you and I,
When the evening is spread out against the sky
Like a patient etherized upon a table;
Let us go, through certain half-deserted streets,
The muttering retreats
Of restless nights in one-night cheap hotels
And sawdust restaurants with oyster-shells:
St

5.2 Mapping Files

It is very common to access files with read and write. However, they require many interactions with the file through subsequent calls to these functions. Thus, UNIX provides a second means to access files. UNIX also provides a way to directly “map” a file into a process, enabling to appear directly in memory, thus be accessible using normal memory accesses. This is cool because your changes in memory update the file directly!

  • void *mmap(addr, len, prot, flags, fd, offset) - Wow, that’s a lot of arguments. Believe it or not, it was the call in Linux that had the most arguments for a long time! Lets go through these:

    • fd - this is the file descriptor we want to map into memory.
    • addr - this is the address in memory at which you want to map. Normally, this is NULL, which means “please map it whereever you want”. The return value is the address at which the file is mapped, or NULL if there was an error (see the man page and errno values).
    • len - How much of the file do we want to map in? This is where stat is quite useful to find the entire file’s size if you want to map it all in.
    • offset - If we aren’t mapping the entire file, we can start mapping at this offset.
    • prot - This enables you to choose some properties for the mapping. prot values include PROT_READ, and PROT_WRITE which ask the system to map the memory in readable, or writable (you can choose both with PROT_READ | PROT_WRITE). Note that the type of accesses here must not request accesses not previously requested in open – for example, if we passed O_RDONLY in to open, mmap will return an error if we ask for PROT_WRITE.
    • flags - For now, lets focus on using only MAP_PRIVATE.
  • munmap - Unmap a previously mapped file.

Program output:

 The Love Song of J. Alfred Prufrock
 T.S. Elliot

Sio credesse che mia risposta fosse
A persona che mai tornasse al mondo,
Questa fiamma staria senza piu scosse.
Ma percioche giammai di questo fondo
Non torno vivo alcun, siodo il vero,
Senza tema dinfamia ti rispondo.

Let us go then, you and I,
When the evening is spread out against the sky
Like a patient etherized upon a table;
Let us go, through certain half-deserted streets,
The muttering retreats
Of restless nights in one-night cheap hotels
And sawdust restaurants with oyster-shells:
St

Note that you cannot mmap a pipe or terminal, only files. This kinda makes sense: since a pipe holds transient data that is only there until read out of the pipe it doesn’t make sense for it to be at a fixed address in memory – it would require an infinite amount of memory as more and more data is written into the pipe!

Memory allocation with mmap. Interesting, mmap can also be used to allocate memory. Specifically, if we pass in fd = 0, prot = PROT_READ | PROT_WRITE, and flags = MAP_ANONYMOUS | MAP_PRIVATE, mmap will simply return newly allocated memory instead of a file! Somewhere deep in malloc’s implementation, it calls mmap to allocate the memory it eventually returns to the application!

5.3 Stream-based I/O

There is actually a third way to access files beyond “raw” read/write and mmap – streams! Streams are an abstraction on top of the “raw” descriptor accesses using read and write.

Streams are identified by the type FILE *, and all the APIs for operating on streams take FILEs instead of file descriptors. In most ways they behave very similar to normal descriptor accesses – you can fopen files to open a stream to a file, and fread and fwrite to read data from, and write data to the file. As with a process’ descriptors, we have default streams:

  • stdout, stdin, stderr which correspond to STDOUT_FILENO, STDIN_FILENO, and STDERR_FILENO, respectively.

Some of the stream functions:

  • FILE *fopen(path, mode) - similar to open, but returning a stream. The access rights we’re asking for (reading or writing to the file), are specified in a more intuitive mode string. “r” to read, “r+” for read/write, etc…
  • fclose - similar to close, but for FILE *s.
  • fread/fwrite - Stream equivalents of read/write. Additionally, streams provide the feof function that tells us if we are at the end of a file.
  • printf/fprintf - Wait, what? Yes, printf is actually used to send our string to the stdout stream. fprintf enables us to write the string we’re creating out to a specific stream, not just the stdout!
  • scanf/fscanf - Similar, scanf inputs from the stdin, and fscanf inputs from a specific stream.

Program output:

 The Love Song of J. Alfred Prufrock
 T.S. Elliot

Sio credesse che mia risposta fosse
A persona che mai tornasse al mondo,
Questa fiamma staria senza piu scosse.
Ma percioche giammai di questo fondo
Non torno vivo alcun, siodo il vero,
Senza tema dinfamia ti rispondo.

Let us go then, you and I,
When the evening is spread out against the sky
Like a patient etherized upon a table;
Let us go, through certain half-deserted streets,
The muttering retreats
Of restless nights in one-night cheap hotels
And sawdust restaurants with oyster-shells:
St

Streams as buffered I/O. So it doesn’t seem like we’re getting much of use out of these streams compared to raw descriptors. Streams are an optimization over read/write as they buffer data. Buffering essentially means that your fwrites actually write data into a “buffer” in memory, and only when either you write out a \n or when the buffer gets large enough does a write get called to send the data to the system. Why do this? By buffering data in memory rather than making a write for each fwrite, we’re saving any overhead associated with each write (to be discussed in a couple of weeks). However, this means that just because we called printf, doesn’t mean it will output immediately! Yikes. Lets see this.

Program output:

_exit exits from the process immediately, and without invoking any atexit functions. Flushing the stdout stream is done in one such function, thus we don’t see the output!

Program output:

hello world
hello world

Well that’s not good. The buffered data is copied into each process, then it is output when a \n is encountered.

Program output:

world hello 

Both printf and the write are to standard output, but printf is to a stream. The buffered printf output is not written out immediately, thus ordering can get messed up. Yikes.

Last example of the complexities of streams. What happens if you get a segfault!?

Program output:

make[1]: *** [Makefile:30: inline_exec] Segmentation fault (core dumped)

Even though we fault after the printf, we don’t see the printf’s output! We now know why: the printf wrote into a buffer, and didn’t yet write to the system. Thus the segmentation fault, which terminates the process, happens before the buffer is actually written!

Flushing the stream buffers. It is imperative that streams give us some means to force the buffer to be output! Thus, streams provide a means of flushing the stream’s buffers, and sending them out to the system (e.g. using write).

  • fflush - flush out the buffer associated with a specific stream.

Fixing the previous examples:

Program output:

hello world 
hello world

Program output:

hello world

A few other functions for streams that can be useful:

  • fileno - Get the file descriptor number associated with a stream. This can be useful if you need to use an API that takes a descriptor, rather than a stream * FILE.
  • getline - Read out a line (delimited by a \n) from a stream. Since reading input, line at a time, is a pretty common thing, this is a useful function.

Program output:

The new genre: abridged, tweet-worthy poetry...

 The Love Song of J. Alfred Prufrock
Of restless nights in one-night cheap hotels
Let fall upon its back the soot that falls from chimneys,
And for a hundred visions and revisions,
Disturb the universe?
Then how should I begin
Of lonely men in shirt-sleeves, leaning out of windows? ...
And I have seen the eternal Footman hold my coat, and snicker,

Am an attendant lord, one that will do

5.4 Accessing Directories

Files aren’t the only thing we care to access in the system – what about directories! We want to be able to create directories, delete them, and be able to read their contents. These core operations are what underlies the implementation of the ls program, or, similarly, what happens when you double-click on a directory in a graphical file explorer!

  • opendir, closedir - Open (and close) a directory, and return a directory stream, a DIR *, to it, which is backed by a descriptor. Yes, another descriptor!
  • readdir - This takes a DIR * to a directory, and returns a structure that includes the “current” file or directory in the directory. This includes the name of the returned object in d_name, and type (in the d_type field) which is one of:

    • DT_BLK - This is a block device.
    • DT_CHR - This is a character device.
    • DT_DIR - This is a directory.
    • DT_FIFO - This is a named pipe (FIFO).
    • DT_LNK - This is a symbolic link.
    • DT_REG - This is a regular file.
    • DT_SOCK - This is a UNIX domain socket.
    • DT_UNKNOWN - The file type could not be determined.
    The two main ones we care about are directories and files. Subsequent calls to readdir will return the “next” file or directory.
  • scandir & glob - Two functions that enable you to get directory contents based on some simple search patterns and logic. scandir lets you pass in a directory stream, and a couple of functions that are used to sort directory contents output, and filter it (enable you to say that you don’t want to look at some directory contents). glob, on the other hand, lets you search for “wildcards” designated by * to select all files/direcotries that match the string.

Program output:

 Directory: .
 Directory: test_directory
      File: 02_exercises.md (size: 2978)
      File: prufrock.txt (size: 6044)
      File: 01_lecture.md (size: 21580)
 Directory: ..

To make changes in the first system hierarchy, we need an additional set of functions.

  • mkdir(path, mode) - This function is used in, for example, the mkdir program. Don’t confuse the program you use in the shell, with the function you can call from C.
  • rmdir(path) - Deletes a directory that is empty. This means that you have to unlink and rmdir the directory’s contents first.
  • rename(from, to) - Change the name of file or directory from from to to. You can imagine this is used by the mv command line program.

Program output:

If there were no errors, we
    1. created a directory, 
    2. a file in the directory, 
    3. change the directory name,
    4. removed the file, and
    5. removed the directory

5.5 Exercises

Write a tree clone, with disk usage information. Though it isn’t installed, by default, tree outputs the filesystem at a specific directory:

$ tree .
.
├── 01_lecture.md
├── 02_exercises.md
├── prufrock.txt
└── test_directory
    └── crickets.txt

5.5.1 Task 1: Markdown Tree

We’ll simplify the output to print out markdown lists with either D or F for directories or files. If we run the program in the 05 directory:

$ ./t
- 01_lecture.md (20759)
- 02_exercises.md (1683)
- prufrock.txt (6044)
- test_directory
    - crickets.txt (26)

Some starter code that focuses on printing out info for a single directory.

Task: Change this code to print out the hierarchy of files and directories, rather than just the contents of a single directory.

#include <sys/types.h>
#include <sys/stat.h>
#include <dirent.h>
#include <assert.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* helper functions */
void indent(int n);
int ignoredir(char *dir);
long file_size(char *dir, char *file);

/*
 * Return the size of the directory (sum of all files inside).
 * The easiest implementation here is recursive where this is called
 * for each directory.
 */
size_t
print_dir(char *dir, int depth)
{
    DIR *d = opendir(dir);
    struct dirent *entry;

    assert(d);
    errno = 0;

    /* Go through each dir/file in this directory! */
    while ((entry = readdir(d)) != NULL) {
        /* we'll ignore . and .. */
        if (ignoredir(entry->d_name)) continue;

        /* print out the proper indentation */
        indent(depth);
        if (entry->d_type == DT_DIR) {
            printf("- D %s\n", entry->d_name);
        } else if (entry->d_type == DT_REG) {
            printf("- F %s (%ld)\n", entry->d_name, file_size(dir, entry->d_name));
        }
        /* we'll ignore everything that isn't a file or dir */
    }
    if (errno != 0) {
        perror("Reading directory");
        exit(EXIT_FAILURE);
    }
    closedir(d);

    return 0;
}

int
main(void)
{
    print_dir("./", 0);

    return 0;
}

/* Indent `n` levels */
void
indent(int n)
{
    for (int i = 0; i < n; i++) printf("    ");
}

/* Should we ignore this directory? */
int
ignoredir(char *dir)
{
    return strcmp(dir, ".") == 0 || strcmp(dir, "..") == 0;
}

long
file_size(char *dir, char *file)
{
    struct stat finfo;
    char buf[512];
    int ret;

    memset(buf, 0, 512); /* zero out the buffer to add '\0's */
    snprintf(buf, 512, "%s/%s", dir, file);

    ret = stat(buf, &finfo);
    assert(ret == 0);

    return finfo.st_size;
}

Program output:

- D 14
- D 02
- D 03
- D tools
- D 05
- D 01
- D 09
- D figures
- D 10
- F inline_exec_tmp (60152)
- D 13
- D 11
- D 08
- D 07
- D templates
- F lectures.html (732776)
- F README.md (35)
- F title.md (320)
- F output_tmp.dat (0)
- F LICENSE (1522)
- D slides
- F Makefile (1908)
- D 99
- D 12
- D 04
- D 06
- F inline_exec_tmp.c (1658)
- F aggregate.md (162984)
- F .gitignore (520)
- D .git

5.5.2 Task 2: File and Directory Sizes

Task: Add output for the size of files, and the size of each directory – the sum of the size of all contained directories and files.

Again, if we run the program in the 05 directory:

$ ./t
- 01_lecture.md (20759)
- 02_exercises.md (1683)
- prufrock.txt (6044)
- test_directory
    - crickets.txt (26)
    (26)
(28512)

The test_directory size is the (26), while the ./ directory has size (28512). Your specific sizes might differ.

6 I/O: Inter-Process Communication

We’ve seen that processes often coordinate through pipes, signals, and wait. This is necessary when coordinating between parents and children in shells (both on the terminal, and GUIs). These are useful when children have a common parent (thus can share pipes), and where the parent knows that the children want to communicate before they are execed. Modern systems require a lot of flexibility than this. Generally, we want a process to be able to decide who to coordinate with as they execute given their own goals. For example, the shell – when it executes a child – doesn’t know if it wants to have a GUI, thus needs to talk to the window server.

We require more dynamic, more flexible forms of coordination between processes. Generally, we call this coordination Inter-Process Communication or IPC. When do we generally want more flexible forms of IPC than pipes, signals, and wait?

  1. If the parent doesn’t know what other processes with which a child wants IPC.
  2. If the parent didn’t create the other process with which a child wants IPC.
  3. The process with which we want IPC belongs to a different user.
  4. The process with which we want IPC has access to special resources in the system (e.g. graphics cards, other hardware, or filesystem data).

Even applications that are implemented as many processes often require IPC between those processes.

6.1 IPC Example: Modern Browsers

Browsers all require a lot of IPC. Each tab is a separate process, and there is a single browser management process that manages the GUI and other browser resources. This is motivated by security concerns: a tab that fails or is compromised by an attacker cannot access the data of another tab! In the early (first 15-20 years) of browsers, this wasn’t the case, and you had to be careful about opening your banking webpage next to webpages that were suspect. The browser management process communicates with each of the child tab processes to send them user input (mouse clicks, keyboard input), and receive a bitmap to display on screen. Lets see this structure:

$ ps aux | grep firefox
... 8029 ... /usr/lib/firefox/firefox -new-window
... 8151 ... /usr/lib/firefox/firefox -contentproc -childID 1 -isForBrowser ... 8029 ...
...

I have an embarrassing number of tabs open, so I’ve taken the liberty of manually filtering this output. We can see that the process with pid 8029 is the parent, browser management process, and the process 8151 is a child managing a tab (note, you can see processes arranges as a process tree using ps axjf). There are many other children processes managing tabs.

Lets try and figure out how they communicate with each other. First, lets look at the file descriptors of the tab’s processes (the child).

$ ls -l /proc/8151/fd/
0 -> /dev/null
1 -> socket:[100774]
...
11 -> /memfd:mozilla-ipc
4 -> socket:[107464]
5 -> socket:[107467]
7 -> pipe:[110438]

We see that there is no standard input (/dev/null is “nothing”), and that there are a few means of IPC:

  • pipe - We know these!
  • memfd - A Linux-specific way of sharing a range of memory between the processes. Stores to the memory can be seen by all sharing processes.
  • socket - Another pipe-line channel in which bytes written into the socket can be read out on the other side. A key difference is that socket connections between processes can be created without requiring the inheritance of descriptors via fork.

Lets look at the browser management process, and see what’s going on. Notably, we want to see how the tab’s processes communicate with it. Thus, lets look at the descriptors for the parent:

$ ls -l /proc/8029/fd/ | awk '{print $9 $10 $11}'
110 -> /home/ycombinator/.mozilla/.../https+++mail.google.com...data.sqlite
171 -> /home/ycombinator/.mozilla/.../https+++www.youtube.com....data.sqlite
123 -> /home/ycombinator/.mozilla/.../formhistory.sqlite
130 -> /home/ycombinator/.mozilla/.../bookmarks.sqlite
58 -> /home/ycombinator/.mozilla/.../cookies.sqlite
3 -> /dev/dri/renderD128
43 -> /dev/dri/card0
...
100 -> /memfd:mozilla-ipc
60 -> socket:[107464]
64 -> socket:[107467]
82 -> pipe:[110438]

This is heavily filtered. The first, highlighted section shows that the browser management process uses a database (sqlite) to access a webpage’s data (see gmail and youtube), store form histories, bookmarks, and cookies. It also uses the dri device is part of the Direct Rendering Infrastructure which enables it to communicate with the X Window Server, and display what we see on screen. Finally, we see it uses the same means of IPC as the client, and if we carefully look at the [ids] of the sockets and pipes, we can see that they match those in the child tab process! We can see that the tab has multiple IPC channels open with the parent, and can access them with file descriptors.

6.2 Goals of IPC Mechanisms

There are multiple IPC mechanisms in the system, and they all represent different trade-offs. They might be good for some things, and bad at others.

  • Transparent IPC. Do applications need to be changed at all to perform IPC? If not, they are transparently leveraging IPC. Shell-driven pipes have a huge benefit that they enable transparent IPC!
  • Named IPC. If we want multiple process to communicate that can’t rely on a comment parent to coordinate that communication, they need a means to find an “name” the IPC mechanism. Named IPC is in conflict with transparent IPC as we likely need to use a specific API to access the IPC mechanism.
  • Channel-based IPC. Often we want to send messages between processes such that a sent message, when received is removed from the channel – once read, it won’t be read again. This enables processes to receive some request, do some processing on it, then move on to the next request.
  • Multi-client communication. We often want to create a process that can provide services to multiple other “client” processes. Clients request service, and the “server” process receives these requests, and provides replies.

Lets assess pipes in this taxonomy:

  • Transparent IPC. - Pipes are pervasive for a reason! They enable composition of programs via pipelines despite the programs not even knowing they are in the pipeline!
  • Named IPC. - Na. The parent has to set it all up – lots of dup, close, and pipe.
  • Channel-based IPC. - Data written to a pipe is removed by the reader.
  • Multi-client communication. - Pipes are really there to pass a stream of data from one process to the other.

6.3 Files & Shared Memory

If the goal is to send data from one process to another, one option is found in the filesystem (FS): can’t we just use files to share? Toward this, we saw in the section on FS I/O that we can open a file, and read and write (or fread/fwrite) from it from multiple processes. This will certainly get data from one process to the other. However, it has a number of shortcomings:

  1. If you want to pass a potentially infinite stream of data between processes (think pipes), then we’d end up with an infinite file. That translates to your disk running out of space very quickly, with little benefit. Take-away: files are for a finite amount of data, not for an infinite stream of data.
  2. Given this, processes must assume that the file has a specific format, and they have indication of if the other process has accessed/read data in the file. An example of a specific format: maybe the file is treated as a single string of a constant length.
  3. There isn’t any indication about when other processes have modified the file, thus the processes might overwrite each other’s data19.

To emphasize these problems, lets try and implement a channel in a file to send data between processes. We just want to send a simple string repetitively from one process to the other.

Program output:

msg1: 
msg2: 

You can see that there are some problems here. If we run it many times, we can see that sometimes we don’t see any messages, sometimes we only see the first, sometimes we only see the second, and other times combinations of all three options. Thus, they are not useful for channel-based communication between multiple processes.

On the other side, files have a very useful properti that we’d like to use in a good solution to IPC: they have a location in the FS that the communicating processes can both use to find the file, thus avoiding the need for a shared parent.

  • Transparent IPC - We have to open the file explicitly.
  • Named IPC. - Each file has a path in the filesystem.
  • Channel-based IPC. - When we read out of a file, the data remains, making it hard to know what we’ve read, and what is yet to be written. A stream of messages placed into a file also will expend your disk!
  • Multi-client communication. - It isn’t clear how multiple clients can convey separate requests, and can receive separate replies from a server.

An aside: you can use mmap to map a file into the address space, and if you map it MAP_SHARED (instead of MAP_PRIVATE), then the memory will be shared between processes. When one process does a store to the memory, the other process will see that store immediately! We can use this to pass data between processes, but we still have many of the problems above. How do we avoid conflicting modifications to the memory, get notifications that modifications have been made, and make sure that the data is formatted in an organized (finite) manner?

6.4 Named Pipes

The core problem with files is that they aren’t channels that remove existing data when it is “consumed” (read). But they have the significant up-side that they have a “name” in the filesystem that multiple otherwise independent processes can use to access the communication medium.

Named pipes or FIFOs are like pipes in that one process can write into the FIFO, and another process can read from it. Thus, unlike files, they have the desirable property of channels in which data read from the channel is consumed. However, like files (and unlike pipes) they have a “name” in the filesystem – FIFOs appear in the filesystem along-side files. The stat function will let you know that a file is a FIFO if the st_mode is S_IFIFO. Since these pipes appear in the filesystem, they are called named pipes. Two processes that wish to communicate need only both know where in the filesystem they agree to find the named pipe.

Lets see an example of using named pipes:

Program output:

What I hear at office hours is "halp!".

There is one very awkward named pipe behavior. Processes attempting to open a named pipe will block (the open will not return) until processes have opened it separately as both readable and writable. This is awkward because how would a process know when another process might want to communicate with it? If a process gets IPC requests from many others (think the browser manager), then it doesn’t want to block awaiting a new communication; it wants to service the requests of other processes.

Regardless, we do see named pipes as a cool option: named pipes enable us to use the filesystem to identify the pipe used for IPC. This enables communicating processes without shared parents to leverage IPC. This enables pipes to live up to the UNIX motto: everything is a file.

6.4.1 Challenges with Named Pipes for Multi-Client Communication

Lets check out an example that demonstrates how using named pipes for communication between a single process and multiple clients has a number of challenges.

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <assert.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/*
 * Receive requests, and send them immediately as a response.
 * You can imagine that interesting computation could be done
 * to formulate a response.
 */
void
instructor(void)
{
    int req, resp, ret;
    char msg[32];

    req = open("requests", O_RDONLY);
    assert(req != -1);
    resp = open("responses", O_WRONLY);
    assert(resp != -1);

    while (1) {
        ret = read(req, msg, 32);
        if (ret == 0) break;
        assert(ret != -1);
        ret = write(resp, msg, ret);
        if (ret == 0) break;
        assert(ret != -1);
    }

    close(req);
    close(resp);
}

/*
 * Make a "request" with our pid, and get a response,
 * hopefully also our pid.
 */
void
student(void)
{
    int req, resp, ret;
    char msg[32];

    req = open("requests", O_WRONLY);
    assert(req != -1);
    resp = open("responses", O_RDONLY);
    assert(resp != -1);

    ret = snprintf(msg, 32, "%d", getpid());
    ret = write(req, msg, ret);
    assert(ret != -1);
    ret = read(resp, msg, 32);
    assert(ret != -1);

    printf("%d: %s\n", getpid(), msg);

    close(req);
    close(resp);
}

void
close_fifos(void)
{
    unlink("requests");
    unlink("responses");
}

int
main(void)
{
    int ret, i;
    pid_t pids[3];

    /* clients write to this, server reads */
    ret = mkfifo("requests", 0777);
    assert(ret == 0);
    /* server sends replies to this, clients read */
    ret = mkfifo("responses", 0777);
    assert(ret == 0);

    /* create 1 instructor that is lecturing */
    if ((pids[0] = fork()) == 0) {
        instructor();
        return 0;
    }
    /* Create 2 students "listening" to the lecture */
    for (i = 0; i < 2; i++) {
        if ((pids[i + 1] = fork()) == 0) {
            student();
            return 0;
        }
    }

    atexit(close_fifos);
    sleep(1);
    for (i = 0; i < 3; i++) kill(pids[i], SIGTERM);
    while (wait(NULL) != -1);

    return 0;
}

Program output:

308994: 308994
308995: 308995

If executed many times, you see the expected result:

167907: 167907
167908: 167908

…but also strange results:

167941: 167940167941

If this is executed many times, we see a few properties of named pipes.

  1. There isn’t really a way to predict which student sees which data – this is determined by the concurrency of the system.
  2. Sometimes a student is “starved” of data completely.
  3. The instructor doesn’t have any control over which student should receive which data.
Using two named pipes, one per communication direction, to communicate between a server (right) and clients (left). The color of the arrows designates the client for whom the communication is relevant. Note that on the way back to the clients, we “lose” track of which client each reply is for. This simply isn’t tracked with named pipes.
Using two named pipes, one per communication direction, to communicate between a server (right) and clients (left). The color of the arrows designates the client for whom the communication is relevant. Note that on the way back to the clients, we “lose” track of which client each reply is for. This simply isn’t tracked with named pipes.

Named pipes summary. These solve an important problem: how can we have multiple processes find a “pipe” to use for communication even if they don’t have a parent to coordinate that communication? They use a filesystem path/name to identify the pipe. However, they are not suitable for a single processes (a server) to communicate with multiple clients as they don’t enable the communication for each client to be separated in the server.

  • Transparent IPC. - we have to use the mkfifo API explicitly.
  • Named IPC. - the named pipe is represented as a file in the filesystem.
  • Channel-based IPC. - read data is removed from the pipe.
  • Multi-client communication. - a server cannot tell the difference between clients, nor can they send responses to specific clients.

6.5 UNIX Domain Sockets

Sockets are the mechanism provided by UNIX to communicate over the network! However, they can also be used to communicate between processes on your system through UNIX domain sockets.

A few key concepts for domain sockets:

  1. They are presented in the filesystem as a file, similar to named pipes. This enables them to be accessed by multiple communicating processes that don’t necessarily share a parent to coordinate that communication.
  2. Each new client that attempts to connect to a server will be represented as a separate descriptor in the server, thus enabling the server to separate its communication with on, from the other. The goal is to create a descriptor for each pair of communicating processes.
  3. Each descriptor to a domain socket is bi-directional – it can be read and writen to, thus making communication back and forth quite a bit easier.

6.5.1 Domain sockets for Multi-Client Communication

Lets look at an example were we want a server to receive a client’s requests as strings, and to reply with those same strings. This isn’t useful, per-say, but demonstrates how this communication can happen. Notably, we want to enable the server to communicate with different clients!

  • A server receives IPC requests from clients. Note that this is similar to how a server on the internet serves webpages to multiple clients (and, indeed, the code is similar!).
  • The server’s functions include socket, bind, and listen. socket creates a domain socket file descriptor
  • The server creates a separate descriptor for each client using accept.
  • The client’s functions include socket and connect.

Most of these functions are complex and have tons of options. Most of them have been distilled into the following functions in 06/domain_sockets.h:

  • int domain_socket_server_create(const char *file_name) - Create a descriptor to the “server” end of the IPC channel identified by file_name in the filesystem, similar to the named pipes before.
  • int domain_socket_client_create(const char *file_name) - Create a descriptor to the “client” end of the IPC channel identified by a file name. One constraint is that the server must create the domain socket first, or else this call (the connect) will fail.

The server’s descriptor is not meant to be used for direct communication (i.e. should not be used for read, and write). Instead, it is used to create new descriptors, one per client! With a descriptor per-client, we have the fortunate ability to communicate explicitly with each client without the same problem of messages getting messed up in named pipes.

A sequence of using domain sockets to communicate between multiple clients and a server. (a) The single domain socket that clients can attempt to connect to. (b) The server using accept to create a new descriptor and channel for communication with the red client (thus the “red” channel). (c) Subsequent communication with that client is explicitly through that channel, so the server can send specifically to that client. (d) The server creates a separate channel for communication with the blue client.
A sequence of using domain sockets to communicate between multiple clients and a server. (a) The single domain socket that clients can attempt to connect to. (b) The server using accept to create a new descriptor and channel for communication with the red client (thus the “red” channel). (c) Subsequent communication with that client is explicitly through that channel, so the server can send specifically to that client. (d) The server creates a separate channel for communication with the blue client.

6.5.1.1 Setting up Domain Sockets

Two functions that both take an argument which is the domain socket name/path in the file system, and return a descriptor to the socket. For the most part, you can just use these functions in your code directly by using 06/domain_sockets.h.

#include "06/domain_sockets.h"

#include <errno.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <sys/wait.h>

void
unlink_domain_socket(int status, void *filename)
{
    unlink(filename);
    free(filename);
}

#define MAX_BUF_SZ 128

void
server(int num_clients, char *filename)
{
    char buf[MAX_BUF_SZ];
    int new_client, amnt, i, socket_desc;

    socket_desc = domain_socket_server_create(filename);
    if (socket_desc < 0) exit(EXIT_FAILURE); /* should do proper cleanup */
    on_exit(unlink_domain_socket, strdup(filename));

    /*
     * Service `num_clients` number of clients, one at a time.F
     * For many servers, this might be an infinite loop.
     */
    for (i = 0; i < num_clients; i++) {
        /*
         * We use this new descriptor to communicate with *this* client.
         * This is the key function that enables us to create per-client
         * descriptors. It only returns when a client is ready to communicate.
         */
        if ((new_client = accept(socket_desc, NULL, NULL)) == -1) exit(EXIT_FAILURE);
        printf("Server: New client connected with new file descriptor %d.\n", new_client);
        fflush(stdout);

        amnt = read(new_client, buf, MAX_BUF_SZ - 1);
        if (amnt == -1) exit(EXIT_FAILURE);
        buf[amnt] = '\0'; /* ensure null termination of the string */
        printf("Server received message (sz %d): \"%s\". Replying!\n", amnt, buf);
        fflush(stdout);

        /* send the client a reply */
        if (write(new_client, buf, amnt) < 0) exit(EXIT_FAILURE);
        /* Done with communication with this client */
        close(new_client);
    }
    close(socket_desc);

    exit(EXIT_SUCCESS);
}

void
client(char *filename)
{
    char msg[MAX_BUF_SZ];
    int  amnt = 0, socket_desc;

    socket_desc = domain_socket_client_create(filename);
    if (socket_desc < 0) exit(EXIT_FAILURE);
    printf("1. Client %d connected to server.\n", getpid());
    fflush(stdout);

    snprintf(msg, MAX_BUF_SZ - 1, "Citizen %d: Penny for Pawsident!", getpid());
    amnt = write(socket_desc, msg, strlen(msg) + 1);
    if (amnt < 0) exit(EXIT_FAILURE);
    printf("2. Client %d request sent message to server.\n", getpid());
    fflush(stdout);

    if (read(socket_desc, msg, amnt) < 0) exit(EXIT_FAILURE);
    msg[amnt] = '\0';
    printf("3. Client %d reply received from server: %s\n", getpid(), msg);
    fflush(stdout);

    close(socket_desc);

    exit(EXIT_SUCCESS);
}

int
main(void)
{
    char *channel_name = "pennys_channel";
    int nclients = 2;
    int i;

    if (fork() == 0) server(nclients, channel_name);
    /* wait for the server to create the domain socket */
    sleep(1);
    for (i = 0; i < nclients; i++) {
        if (fork() == 0) client(channel_name);
    }
    /* wait for all of the children */
    while (wait(NULL) != -1);

    return 0;
}

Program output:

Server: New client connected with new file descriptor 4.
1. Client 309015 connected to server.
2. Client 309015 request sent message to server.
1. Client 309016 connected to server.
2. Client 309016 request sent message to server.
Server received message (sz 37): "Citizen 309015: Penny for Pawsident!". Replying!
Server: New client connected with new file descriptor 4.
3. Client 309015 reply received from server: Citizen 309015: Penny for Pawsident!
Server received message (sz 37): "Citizen 309016: Penny for Pawsident!". Replying!
3. Client 309016 reply received from server: Citizen 309016: Penny for Pawsident!

The server’s call to accept is the key difference of domain sockets from named pipes. It enables us to have per-client descriptors we can use to separately communicate (via read/write) to each client.

  • Transparent IPC. - We have to explicitly use the socket APIs.
  • Named IPC. - Uses a file name to name the domain socket.
  • Channel-based IPC. - When data is read, it is removed from the socket.
  • Multi-client communication. - The server separates each client’s communication into separate descriptors.

Aside: Sockets are the main way to communicate over the network (i.e. to chat with the Internet). The APIs you’d use to create network sockets are the same, it simply requires setting up the socket and the binding of the socket in a network-specific manner.

6.6 IPC Exercises

Find the programs:

Both require the 06/domain_sockets.h header file. You must run the server first to create the domain socket “file”. If you run the server, and it complains that “server domain socket creation”, then you might need to rm the domain socket file on the command line first. It already exists from a previous run of the server, so the server cannot create it again!

Your job is to try to figure what in the hey these do! Answer the following questions:

  • Describe what the server does to handle each client request.
  • How many clients can the server handle at a time?
  • Is the number of clients limited by the server waiting on client/child processes? Recall when talking about background tasks in a shell that the wait is key to the system’s concurrency/behavior.
  • Describe what the client does. What data does it send to the server?
  • Describe the behavior when you use the client with the server.

The programs can be compiled directly with gcc (e.g. gcc domain_socket_server.c -o server; gcc domain_socket_client.c -o client). Use these programs on the command line to send data from the client to server.

  • How can you use the client and server programs to send data between them?
  • How can you make multiple clients connect to the server?

7 Reinforcing Ideas: Assorted Exercises and Event Notification

In the following, if you aren’t positive of the answer, please run the program! Note that we’re omitting error checking in these programs to keep them terse. Remember that in your programs, you must check and react to all errors. This will require you to use the man pages to look up the required #includes. If the output doesn’t match a high-level intuition, how would you modify the program to match the intuition? What are all of the potential output of the following programs? Why?

7.0.1 fork and Stream Behavior

7.0.2 Wait Behavior

  • What if change wait_param to equal WNOHANG?
  • What if we change output_w_write to 1?
  • Why do we need to reap children? Why would this actually matter?

7.0.3 read Behavior

  • There are multiple bugs here. Find them and squish ’em.

7.0.4 Signals

What is the following doing?

  • What are the various aspects of the output? What do the numbers mean? What do the strings mean?
  • Explain what is happening before the “Moving on…”.
  • Explain what is happening after the “Moving on…”.
  • There is a lot of information there (lines being output) that aren’t useful to the programmer. Make the output more useful by printing only the parts of the output that a programmer would find useful.

7.0.5 Communication with Multiple Clients

Communicating with multiple clients is hard. Domain sockets are complicated, but there are challenges around blocking on reads. What if one client is very “slow”, and we block waiting for them?

#include "06/domain_sockets.h"

#include <errno.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <sys/wait.h>
#include <assert.h>

void
panic(char *msg)
{
    perror(msg);
    exit(EXIT_FAILURE);
}

void
client(char *filename, int slowdown)
{
    int i, socket_desc;
    char b;

    socket_desc = domain_socket_client_create(filename);
    if (socket_desc < 0) {
        perror("domain socket client create");
        exit(EXIT_FAILURE);
    }

    /* delay after creating connection, but before communicating */
    sleep(slowdown);
    if (write(socket_desc, ".", 1) == -1) panic("client write");
    if (read(socket_desc, &b, 1) == -1)   panic("client read");
    printf("c: %c\n", b);

    close(socket_desc);
    exit(EXIT_SUCCESS);
}

void
client_slow(char *filename)
{
    client(filename, 3);
}

void
client_fast(char *filename)
{
    client(filename, 1);
}

int
main(void)
{
    char *ds = "domain_socket";
    int socket_desc, i;

    socket_desc = domain_socket_server_create(ds);
    if (socket_desc < 0) {
        /* remove the previous domain socket file if it exists */
        unlink(ds);
        socket_desc = domain_socket_server_create(ds);
        if (socket_desc < 0) panic("server domain socket creation");
    }

    /* TODO: change this order. What changes? */
    if (fork() == 0) client_slow(ds);
    if (fork() == 0) client_fast(ds);

    /* handle two clients, one after the other */
    for (i = 0; i < 2; i++) {
        int ret, new_client, i;
        char b;

        new_client = accept(socket_desc, NULL, NULL);
        if (new_client == -1) panic("server accept");

        /* read from, then write to the client! */
        if (read(new_client, &b, 1) == -1)   panic("server read");
        if (write(new_client, "*", 1) == -1) panic("server write");
        close(new_client);
    }

    close(socket_desc);
    /* reap all children */
    while (wait(NULL) != -1) ;

    return 0;
}
  • See the TODO in the main. When you change the order, what do you see?
  • Is there generally a problem when we use both fast and slow clients?

7.1 Revisiting IPC with Multiple Clients

7.1.1 System Services

The systemctl command enables you to understand many of the services on the system (of which there are many: systemctl --list-units | wc -l yields 244 on my system).

$ systemctl --list-units
...
  cups.service           loaded active     running   CUPS Scheduler
...
  gdm.service            loaded active     running   GNOME Display Manager
...
  ssh.service            loaded active     running   OpenBSD Secure Shell server
...
  NetworkManager.service loaded active     running   Network Manager
...
  openvpn.service        loaded active     exited    OpenVPN service
...

I’ve pulled out a few selections that are easier to relate to:

  • CUPS which is the service used to receive print jobs.
  • GDM which manages your graphical logins on Ubuntu.
  • OpenSSH which manages your remote logins through ssh.
  • NetworkManager that you interact with when you select which wifi hotspot to use.
  • OpenVPN which handles your VPN logins to the school network. You can see that the service is currently “exited” as I’m not running VPN now.

Each of these services communicates with many clients using domain sockets.

7.1.2 Understanding Descriptor Events with poll

Each of these services is a process that any client send requests to. We’ve seen that domain sockets can help us to talk to many different clients as each is represented with a separate file descriptor. How does the service process know which of the file descriptors has information available on it? Imagine the following case:

  1. a client, A, connects to a service.
  2. a client, B, connects to the same service.
  3. B immediately sends a request, while A goes into a sleep(100) (or, more realistically, simply does some expensive computation).

If the server issues a read on the descriptor for A, it will block for 100 seconds! Worse, it won’t service B despite it making a request immediately. Why are we waiting for a slow client when there is a fast client with data already available? Yikes.

We’d really like a facility that can tell us which descriptor’s have data and are ready to be read from, and which are ready to be written to! Luckily, UNIX comes to the rescue with its event notification APIs. These are APIs that let us understand when a file descriptor has an event (i.e. a client writes to it) and is now readable or writable. These include three functions: poll, select, and (the more modern) epoll. Lets look at how to use poll!

7.1.3 Event Loops and poll

Lets look at some pseudocode for using poll.

This yields what is called an event loop – we loop, each time processing a single event. You see event loops in most GUIs (where each event is a key/mouse press), and in web programming where javascript callbacks are executed by the browser’s event loop. Importantly, we only process descriptors that have events, thus can avoid blocking on descriptors that don’t have available data! This solves our previous problem: a server won’t block awaiting communication with a client that is delayed, or never writes to a channel, instead only reading/writeing to descriptors that are ready.

7.1.4 poll API

  • int poll(struct pollfd *fds, nfds_t nfds, int timeout) - We pass in an array of struct pollfds of length nfds, with each entry corresponding to a single file descriptor we want to get information about. The timeout is in milliseconds, and enables poll to return after that amount of time returns even if none of the file descriptors has an event. A negative timeout is interpreted as “infinite”, while 0 means that poll will return immediately with any current events.

Lets check out the struct pollfd:

When we make the poll call, we populate the fd and events fields with the file descriptors we’re interested in, and which events we’re interested in retrieving. events is a “bitmap” which means each bit in the value denotes a different type of event, and we bitwise or event types together when writing them into events. These event types include:

  • POLLIN - is there data available to be read? If the file descriptor is the domain socket that we use for accept, then this POLLIN means that a new client request is ready to accept.
  • POLLOUT - is there data available to be writen?

We can almost always set events = POLLIN | POLLOUT as we wait to wait for both.

When poll returns, we determine which events happened by looking at the contents of the revents field. In addition to POLLIN and POLLOUT which tell us if data is ready to be read or written, we have the following:

  • POLLHUP - The other side of the pipe closed its descriptor! Subsequent reads to the descriptor will return 0. We can likely close our descriptor.
  • POLLERR - Again, there was some sort of a problem with the descriptor, and we should likely close it, terminating communication.

7.1.5 Example poll Code

Lets put this all together. For simplicity, in this example, we’ll assume that we can always write to a descriptor without blocking. This isn’t generally true if you’re writing large amounts of data, and in “real code” you’d also want to handle POLLOUT appropriately.

#include "06/domain_sockets.h"

#include <errno.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <sys/wait.h>
#include <assert.h>
#include <poll.h>

void
panic(char *msg)
{
    perror(msg);
    exit(EXIT_FAILURE);
}

void
client(char *filename)
{
    int i, socket_desc;

    sleep(1); /* await the domain socket creation by the server */
    socket_desc = domain_socket_client_create(filename);
    if (socket_desc < 0) {
        perror("domain socket client create");
        exit(EXIT_FAILURE);
    }

    for (i = 0; i < 5; i++) {
        char b;
        if (write(socket_desc, ".", 1) == -1) panic("client write");
        if (read(socket_desc, &b, 1) == -1)   panic("client read");
        printf("c %d: %c\n", getpid(), b);
    }

    close(socket_desc);
    exit(EXIT_SUCCESS);
}

/* we can track max 16 fds */
#define MAX_FDS 16

void
server(char *filename)
{
    int socket_desc, num_fds = 0;
    struct pollfd poll_fds[MAX_FDS];

    /*** Initialize the domain socket and its pollfd ***/
    socket_desc = domain_socket_server_create(filename);
    if (socket_desc < 0) {
        unlink(filename); /* remove the previous domain socket file if it exists */
        socket_desc = domain_socket_server_create(filename);
        if (socket_desc < 0) panic("server domain socket creation");
    }

    /* Initialize all pollfd structs to 0 */
    memset(poll_fds, 0, sizeof(struct pollfd) * MAX_FDS);
    poll_fds[0] = (struct pollfd) {
        .fd     = socket_desc,
        .events = POLLIN,
    };
    num_fds++;

    /*** The event loop ***/
    while (1) {
        int ret, new_client, i;

        /*** Poll; if we don't get a client for a second, exit ***/
        ret = poll(poll_fds, num_fds, 1000);
        if (ret == -1) panic("poll error");
        /*
         * If we timeout, break out of the loop.
         * This isn't what you'd normally do as servers stick around!
         */
        if (ret == 0) break;

        /*** Accept file descriptor has a new client connecting! ***/
        if (poll_fds[0].revents & POLLIN) {
            if ((new_client = accept(socket_desc, NULL, NULL)) == -1) panic("server accept");
            /* add a new file descriptor! */
            poll_fds[num_fds] = (struct pollfd) {
                .fd = new_client,
                .events = POLLIN
            };
            num_fds++;
            poll_fds[0].revents = 0;
            printf("server: created client connection %d\n", new_client);
        }
        /*** Communicate with clients! ***/
        for (i = 1; i < num_fds; i++) {
            if (poll_fds[i].revents & (POLLHUP | POLLERR)) {
                printf("server: closing client connection %d\n", poll_fds[i].fd);
                poll_fds[i].revents = 0;
                close(poll_fds[i].fd);
                /* replace the fd to fill the gap */
                poll_fds[i] = poll_fds[num_fds - 1];
                num_fds--;
                /* make sure to check the fd we used to fill the gap */
                i--;

                continue;
            }
            if (poll_fds[i].revents & POLLIN) {
                char b;

                poll_fds[i].revents = 0;
                /* our server is simply sending a '*' for each input character */
                if (read(poll_fds[i].fd, &b, 1) == -1) panic("server read");
                if (write(poll_fds[i].fd, "*", 1) == -1) panic("server write");
            }
        }
    }
    close(socket_desc);

    exit(EXIT_SUCCESS);
}

int
main(void)
{
    char *ds = "domain_socket";

    if (fork() == 0) client(ds);
    if (fork() == 0) client(ds);
    server(ds);

    return 0;
}

8 Libraries

Up till now, we’ve been focusing on UNIX programming APIs, and how to interact with various aspects of the system. Part of this discussion has been how various programs can be orchestrated to cooperate. Pipelines enable larger functionalities to be composed out of multiple programs in which the output of one goes to the input of the next. System servers provide services to clients that use IPC to harness their functionality. This enables servers that control system resources (like wifi) to let many clients have limited and controlled access to those resources.

The next few chapters dive under the hood of processes. We’ll investigate

  1. how programs are represented and how to understand their format,
  2. how programs are organized and can share code, and
  3. how they interact with the system.

How do we think about our system’s programs? We’re used to writing our own programs, and like to think of the code we write being relatively self-contained. In contrast, quite a bit of system programming is about providing functionality that can be used by any program on the system. What if you wanted to implement the world’s best key-value store, and wanted to enable anyone to use your implementation!?

We think of these shared functionalities as falling into one of two categories:

  1. Libraries that represent bodies of code that can be pull into our applications. You’re already quite used to these! libc is the core library that provides all of the functions that we use when we include many of the headers in the class, for example, #include <stdlib.h>, #include <stdio.h>, or #include <string.h>.
  2. Services which track different aspects of the system, and provide information to our programs. They tend to be programs that are always running on your system, and your programs can communicate with them to harness their functionality. These include services that receive print requests, that service ssh sessions, that provide your vpn connection, and that display your graphical user interface, and many other functions. We’ve already seen many of the IPC and communication foundations that make these tick.

This chapter, we’ll focus on libraries as a foundation for sharing software in systems.

8.1 Libraries - Goals and Overview

Before we start, consider:

  • When you #include a ton of files, do you think that all of the related functionality is compiled into your program? That would mean that every program has the functionality compiled into it.
  • If so, how much memory and disk space do you think that will take?
  • Do you think there are other options? What might they be?

Visually, this is what it would look like to compile with “libraries” that are defined as normal *.o files.

We know how to generate *.o files, and to combine them together as shown here. If some other people’s code that we depend on (e.g. string.h functionality) is provided as *.o files, we could compile with it as such. The questions above should let you think about the trade-offs with this approach.
We know how to generate *.o files, and to combine them together as shown here. If some other people’s code that we depend on (e.g. string.h functionality) is provided as *.o files, we could compile with it as such. The questions above should let you think about the trade-offs with this approach.

The goals of libraries are to:

  • Enable programs to leverage shared implementations of specific functionality (e.g. the contents of string.h).
  • Provide a uniform set of programming APIs to programs.
  • Attempt to save memory by making all programs in the system as small as possible by not taking memory for all of the library code in each program.
  • Enable the system to upgrade both libraries and programs (for example, if a security compromise is found).

Libraries are collections of functionality that can be used by many different programs. They are code that expand the functionality of programs. This differs from services which are separate programs. Libraries have two core components that mirror what we understand about C:

  1. header files that share the types of the library API functions and data, and
  2. the code to implement those APIs.

Lets set up some visual nomenclature.

We’ll talk about the files that are part of your normal programs (the *.c and *.o on the left), those that are part of the library, and files to represent the library.
We’ll talk about the files that are part of your normal programs (the *.c and *.o on the left), those that are part of the library, and files to represent the library.

There are two main ways library code is integrated into programs:

  • Static libraries. These are added into your program when you compile it.
  • Shared or dynamic libraries. These are integrated into your program, dynamically, when you run the program.

We’ll discuss each of these in subsequent sections

8.2 Libraries - Header Files

We can see most of the header files in /usr/include, within which you’ll find some familiar files:

$ ls /usr/include/std*
/usr/include/stdc-predef.h  /usr/include/stdint.h  /usr/include/stdio_ext.h  /usr/include/stdio.h  /usr/include/stdlib.h

Yes, this means that you have the source code for much of the standard library at your fingertips! But how does gcc know to find all of these files when we use #include <...>? The following gcc incantation gives us a good idea:

$ gcc -xc -E -v -
...
#include <...> search starts here:
 /usr/lib/gcc/x86_64-linux-gnu/9/include
 /usr/local/include
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.

These are all of the “include paths” that are searched, by default, when we do an #include <>.

Aside: In Makefiles, we often want to add to tell the compiler to look for header files in our own files Thus, we update the include paths by adding multiple -I<dir> options to the gcc command line. We can see how this works:

$ gcc -I. -xc -E -v -
...
#include <...> search starts here:
.
/usr/lib/gcc/x86_64-linux-gnu/9/include
...

The include paths are searched in order, so you can see that when we explicitly tell gcc to include a directory -I. here, we’re asking to to check in the current directory (.) first.

8.3 Linking Undefined Symbols to References Across Objects

Before we proceed to figure out how multiple *.o objects are combined into an executable, lets learn what it means for one object to depend on another for some functionality. To understand this, we have to understand how object files (*.o) and binaries think about symbols and linking them together.

When your program uses a library’s API, the library’s code is linked into your program. Linking is the act of taking any symbols that are referenced, but not defined in your object (*.o) files, and the definition of those symbols in the objects that provide them. A symbol is a functions or global variable in your program. Each symbol has a representation in your object files so that they can be referenced across objects. We have to understanding the linking operation to dive into how libraries can be added into, thus accessible from your program. Lets peek into how our objects think about the world using a number of programs that can introspect on the objects including nm, objdump, and readelf.

As an example, lets look at your ptrie implementation. We know that each of the tests (in tests/0?-*.c) depends on your ptrie implementation in ptrie.c. What does that look like?

$ nm tests/01-add.o
                 U _GLOBAL_OFFSET_TABLE_
000000000000022a T main
                 U printf
                 U ptrie_add
                 U ptrie_allocate
                 U ptrie_free
                 U __stack_chk_fail
0000000000000000 t sunit_execute
0000000000000000 d test
0000000000000184 T test_add
0000000000000144 T test_alloc

The left column is the address of the symbol, and the character in the middle column gives us some information about the symbol:

  • T - this symbol is part of the code of the *.o file, and is visible outside of the object (i.e. it isn’t static). Capital letters mean that the symbol is visible outside the object which simply means it can be linked with another object.
  • t - this symbol is part of the code, but is not visible outside of the object (it is static).
  • D - a global variable that is visible.
  • d - a global variable that is not visible.
  • U - an undefined symbol that must be provided by another object file.

For other symbol types, see man nm.

We can see what we’d expect: the test defines its own functions for (e.g. main, test_add), but requires the ptrie_* functions and printf to be provided by another object file. Now lets check out the ptrie objects.

$nm ptrie.o
...
                 U calloc
                 U free
...
                 U printf
00000000000005ae T ptrie_add
00000000000003fc t ptrie_add_internal
0000000000000016 T ptrie_allocate
0000000000000606 T ptrie_autocomplete
00000000000000f8 T ptrie_free
...
                 U putchar
                 U puts
                 U strcmp
                 U strdup
                 U strlen

We can see that the ptrie.o object depends on other objects for all of the functions we’re using from stdio.h, malloc.h, and string.h, provides all of the functions that are part of the public API (e.g. ptrie_add), and some other symbols (e.g. ptrie_add_internal) that cannot be linked to other object files.

After we link the ptrie into the test (with gcc ... tests/01_add.o ptrie.o -o tests/01_add.test),

$ nm tests/01_add.test | grep alloc
                 U calloc@@GLIBC_2.2.5
0000000000001566 T ptrie_allocate
00000000000013ad T test_alloc

Now we can see that there are no longer any undefined references (U) to ptrie_allocate, so the test object has now found that symbol within the ptrie.o object. Additionally, we can see that some symbols (e.g. calloc here) are still undefined and have some mysterious @@... information associated with them. You might guess that this somehow tells us that the function should be provided by the standard C library (glibc on Ubuntu).

Lets see another example visually using the libexample/* files:

The program on the left depends on the functionality of the object on the right, the library. We can see from the nm output that the left has undefined symbols for bar which is defined in the library. On the right, we can see the symbols defined in the object (with T, saying it is part of the text, or code, i.e. a function). Both are compiled together into an executable binary, prog_naive which shows that the reference from prog.o is resolved or linked with the definition in example.o.
The program on the left depends on the functionality of the object on the right, the library. We can see from the nm output that the left has undefined symbols for bar which is defined in the library. On the right, we can see the symbols defined in the object (with T, saying it is part of the text, or code, i.e. a function). Both are compiled together into an executable binary, prog_naive which shows that the reference from prog.o is resolved or linked with the definition in example.o.

Where are we? OK, lets summarize so far. Objects can have undefined symbol references that get linked to the symbols when combined with the objects in which they are defined. How is this linking implemented?

Question: What does a file format mean? We talk about the format of webpages being html, images, being png, documents being docx or pdf. What do these formats mean? What is the format of object and executable binaries?

8.3.1 ELF Object Format

First, we have to understand something that is a little amazing: all of our objects and binaries have a defined file format called the Executable and Linkable Format (ELF)20.

A representation of the ELF binary program format by Ange_Albertini.
A representation of the ELF binary program format by Ange_Albertini.

We can confirm that these files are ELF with

$ xxd ptrie.o | head -n 1
00000000: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............

xxd dumps out a file in hexadecimal format, and head just lets us filter out only the first line (i.e. at the head). The first characters in most formats often denote the type of file, in this case an ELF file.

This reveals a cool truth: programs are data, plain and simple. They happen to be data of a very specific format, ELF. nm, readelf, and objdump are all programs that simply understand how to dive into the ELF format, parse, and display the information.

As programs are simply data encapsulated in the ELF format, we can start to understand what it means to link two objects together.

An example demonstrating what it means to link two objects together. In this case, we can see the ELF object tells us where in the prog.o binary the reference to bar is, and gcc patchs up prog.o as part of linking to reference bar in example.o.
An example demonstrating what it means to link two objects together. In this case, we can see the ELF object tells us where in the prog.o binary the reference to bar is, and gcc patchs up prog.o as part of linking to reference bar in example.o.

An in-depth explanation follows. Now we get to an interesting part of the ELF objects:

$ readelf -r tests/01_add.o | grep ptrie
000000000151  001500000004 R_X86_64_PLT32    0000000000000000 ptrie_allocate - 4
000000000179  001600000004 R_X86_64_PLT32    0000000000000000 ptrie_free - 4
000000000191  001500000004 R_X86_64_PLT32    0000000000000000 ptrie_allocate - 4
0000000001dd  001800000004 R_X86_64_PLT32    0000000000000000 ptrie_add - 4
00000000021f  001600000004 R_X86_64_PLT32    0000000000000000 ptrie_free - 4

The -r flag here outputs the “relocation” entries of the elf object.

This output says that there are number of references to the ptrie_* functions in the code, and enumerates each of them. You can imagine that these are simply all of the places in the code that these functions are called. The first column gives us the offset into the ELF object where the reference to the function is made – in this case where it is called. For example, at the offset 0x15121 into the ELF object, we have a call to ptrie_allocate. This is important because the tests/01_add.o does not know where the ptrie_allocate function is yet. It will only know that when it links with the ptrie implementation!

Lets check out what this looks like in the ELF object’s code. For this, we’ll use objdump’s ability to dump out the “assembly code” for the object file. It also will try and print out the C code that corresponds to the assembly, which is pretty cool.

$ objdump -S tests/01_add.o
...
sunit_ret_t
test_alloc(void)
{
 144:   f3 0f 1e fa             endbr64
 148:   55                      push   %rbp
 149:   48 89 e5                mov    %rsp,%rbp
 14c:   48 83 ec 10             sub    $0x10,%rsp
    struct ptrie *pt;

    pt = ptrie_allocate();
 150:   e8 00 00 00 00          callq  155 <test_alloc+0x11>
 155:   48 89 45 f0             mov    %rax,-0x10(%rbp)
...

I’ve printed out the addresses in the object around offset 0x151 because the first reference to ptrie_allocate is made there (see readelf output above). We can see an instruction to call test_alloc. However, you can see that the address that the binary has for that function is 00 00 00 00, or NULL! This is what it means for a function to be undefined (recall: the nm output for the symbol is U).

Now we can understand what a linker’s job is:

What does a Linker do? A linker takes all of the undefined references to symbols from the relocation records, and rewrites the binary to populate those references with the address of that symbol (here updating 00 00 00 00 to the actual address of test_alloc).

In our Makefiles, we always use gcc both for compiling *.c files into objects, but also linking objects together into a binary.

8.3.2 Linking Summary

Lets back up and collect our thoughts.

  • Our programs are collections of a number of ELF object files.
  • We’ve seen that some symbols in an object file are undefined if they are functions or variables that are not defined in a .c file.
  • Though their types are defined in header files, their implementation must be found somewhere.
  • Each reference to an undefined symbol has a relation entry that enables the linker to update the reference once it knows the ELF object that provides it.
  • When another ELF exports the symbol that is undefined elsewhere, they can be linked to turn the function calls of undefined functions into function calls, and references to global variables be actual pointers to memory.

This is the foundation for creating large programs out of small pieces, and to enable some of those pieces to be shared between programs.

We’ll also see that libraries enable an opportunity to save memory. For example,

  • static libraries enable only the object files that are needed by a program to be compiled into it, and
  • dynamic libraries enable the code for a library to be shared between all processes that execute it in the system.

8.4 Static Libraries

A static library is simply a collection of ELF object (*.o) files created with gcc -c. They are collected together into a static library file that has the name lib*.a. You can think of these files as a .zip or .tar file containing the ELF objects.

Static libraries of the form lib*.a and are created with the ar (archive) program from a collection of objects. An example from the ptrie library (expanded from its Makefile) that creates libptrie.a:

$ ar -crs libptrie.a *.o

So a static library is just a collection of ELF objects created by our compilation process. If we are writing a program that wants to use a library, first you make sure to include its header file(s). Then, when compiling, we have to tell the compiler which library we want to link with by using a few compiler flags:

$ gcc -o tests/01_add.test 01_add.o -L. -lptrie

The last two flags are the relevant ones:

  • -L. says “look for static libraries in the current directory (i.e. .)” – other directories can be included, and a few default directories are used, and
  • -lptrie says “please link me with the ptrie library” which should be found in one of the given directories and is found in a file called libptrie.a.

Note that the linker already searches a few directories for libraries (i.e. default -L paths):

$ gcc -print-search-dirs | grep libraries | sed 's/libraries: =//' | tr -s ":" '\n'
/usr/lib/gcc/x86_64-linux-gnu/9/
/usr/lib/gcc/x86_64-linux-gnu/9/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/9/
/usr/lib/gcc/x86_64-linux-gnu/9/../../../../x86_64-linux-gnu/lib/x86_64-linux-gnu/
/usr/lib/gcc/x86_64-linux-gnu/9/../../../../x86_64-linux-gnu/lib/../lib/
/usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/9/
/usr/lib/gcc/x86_64-linux-gnu/9/../../../x86_64-linux-gnu/
/usr/lib/gcc/x86_64-linux-gnu/9/../../../../lib/
/lib/x86_64-linux-gnu/9/
/lib/x86_64-linux-gnu/
/lib/../lib/
/usr/lib/x86_64-linux-gnu/9/
/usr/lib/x86_64-linux-gnu/
/usr/lib/../lib/
/usr/lib/gcc/x86_64-linux-gnu/9/../../../../x86_64-linux-gnu/lib/
/usr/lib/gcc/x86_64-linux-gnu/9/../../../
/lib/
/usr/lib/

As many of these paths are in directories that any user can access, this is how the functionality of these libraries can be accessed by any program wishing to use them. As we compile our ptrie library as a static library, you’ve already seen one example of these in use.

How static libraries are created, and interact with our programs. The .a files include all library objects, but only those that are needed by the program are compiled into it.x
How static libraries are created, and interact with our programs. The .a files include all library objects, but only those that are needed by the program are compiled into it.x

8.4.1 Saving Memory with Static Libraries

Static libraries do provide some facilities for trying to shrink the amount of memory required for library code. If this was all done naively, then all of the object files in a static library could get loaded into a program with which it is linked. This means that for N programs that link with a library whose objects take X bytes, we’ll have N × X bytes devoted to storing programs on disk, and running the programs in memory (if they all run at the same time). Some static libraries can be quite large.

Instead, static libraries are smarter. If a static library contains multiple .o files, only those object files that define symbols that are undefined in the program being linked with, are compiled into the program. This means that when designing a static library, you often want to break it into multiple .o files along the lines of different functionalities that separately used by different programs22.

A graphical depiction of how library memory gets added into a program using static libraries. The program is only dependent on a single object within the library, so we only need to compile that object into the resulting executable program.
A graphical depiction of how library memory gets added into a program using static libraries. The program is only dependent on a single object within the library, so we only need to compile that object into the resulting executable program.

Some projects take this quite far. For example musl libc is a libc replacement, and it separates almost every single function into a separate object file (i.e. in a separate .c file!) so that only the exact functions that are called are linked into the program.

8.5 Shared/Dynamic Libraries

Shared or dynamic libraries (for brevity, I’ll call them only “dynamic libraries”, but both terms are commonly used) are linked into a program at runtime when the program starts executing as part of exec.

Recall that even executable binaries might still have undefined references to symbols. For example, see calloc in the example below:

$ nm tests/01_add.test | grep calloc
                 U calloc@@GLIBC_2.2.5

Though we can execute the program tests/01_add.test, it has references to functions that don’t exist in the program! How can we possibly execute a program that has undefined functions; won’t the calls to calloc here be NULL pointer dereferences?

To understand how dynamic linking works, lets look at the output of a program that tells us about dynamic library dependencies, ldd.

$ ldd tests/01_add.test
        linux-vdso.so.1 (0x00007ffff3734000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb502adf000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb502cfa000)

ldd also simply parses through the ELF program, and determines which dynamic libraries are required by the program. For now, we’ll ignore the linux-vdso, but the other two entries are interesting. We can see that the C standard library, libc is being provided by /lib/x86_64-linux-gnu/libc.so.6. These are both dynamic libraries, as are most *.so.?* and *.so files. If we check out that object we see that it provides calloc.

$ objdump -T /lib/x86_64-linux-gnu/libc.so.6  | grep calloc
000000000009ec90 g    DF .text  0000000000000395  GLIBC_2.2.5 __libc_calloc
000000000009ec90  w   DF .text  0000000000000395  GLIBC_2.2.5 calloc

So this library provides the symbols we require (i.e. the calloc function)!

8.5.1 Compiling and Executing with Dynamic Libraries

If we want our program to use a dynamic library, we have to compile it quite similarly to when we wanted to use static libraries:

$ gcc -o dyn_prog *.o -L. -lptrie

So everything is somewhat normal here; we’re saying “look for the library in this directory”, and compile me with the ptrie library. To create the dynamic library:

$ gcc -Wall -Wextra -fpic -I. -c -o ptrie.o ptrie.c
$ gcc -shared -o libptrie.so ptrie.o

The first line is the normal way to create an object file from C, but includes a new flag, -fpic. This tells the compiler to generate “Position Independent Code”, or PIC, which is code that can, seemingly magically, be executed when the code is loaded into any address! The dynamic library cannot assume which addresses it will be loaded into as many dynamic libraries might be loaded into an executable, thus the PIC requirement.

The second line creates the dynamic (shared) library. By convention, all dynamic libraries end with .so which you can think of as “shared object”. This is the line that is dynamic-library equivalent to the ar command for static libraries.

Now we have a binary that has been told were the library is; lets execute it!

$ ./dyn_prog
./dyn_prog: error while loading shared libraries: libptrie.so: cannot open shared object file: No such file or directory

Whaaaaaa? If we dive in a little bit, we can see:

$ nm ./dyn_prog | grep ptrie_alloc
                 U ptrie_alloc

But I thought that we wanted all symbols to be defined when we create a binary? Why is the library’s ptrie_alloc not linked into the program? We can see that not all symbols are defined in a program when we are using dynamic libraries as they are linked when we try and run the program!

We now can see the main practical difference between static and dynamic libraries:

  • Static libraries are compiled into the program when you compile with the -lptrie directive.
  • Dynamic libraries not not compiled into the program, and are instead linked into the program when it is executed (i.e. at the time of exec).

We can start to see why the program wont execute when we look at its dynamic library dependencies:

$ ldd ./dyn_prog
        linux-vdso.so.1 (0x00007ffdf3ca0000)
        libptrie.so => not found
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff9921e8000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ff9923f8000)

It isn’t finding libptrie.so!

To execute the program, we have to properly set an environment variable that will tell the program, when it is executed, where to look for the dynamic library:

$ LD_LIBRARY_PATH=./:$LD_LIBRARY_PATH ./dyn_prog
...success...

This is some strange shell syntax. LD_LIBRARY_PATH=./:$LD_LIBRARY_PATH essentially says "The environment variable called LD_LIBRARY_PATH should be updated to prepend the diretory ./ onto the front of it. This is conceptually similar to something like lib_path = "./:" + lib_path in a language that supports string concatenation. When ./dyn_prog is executed, the updated LD_LIBRARY_PATH environment variable is visible in that program using the normal getenv. So there is some part of the initialization of ./dyn_prog that looks at this environment variable to figure out where to look for the dynamic library.

To confirm why we can now execute the program, we can again look at the library dependencies:

$ LD_LIBRARY_PATH=./:$LD_LIBRARY_PATH ldd prog_dynamic
        linux-vdso.so.1 (0x00007ffc939f8000)
        libptrie.so => ./libptrie.so (0x00007f70d316f000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f70d2f66000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f70d3236000)

If we expand the LD_LIBRARY_PATH to include the current directory, the program can find the library.

If we want to see the default paths that are searched for dynamic libraries:

$ ldconfig -v 2>/dev/null | grep -v ^$'\t'
/usr/local/lib:
/lib/x86_64-linux-gnu:
/lib32:
/libx32:
/lib:

In fact, ldconfig -v will also print out all of the potential default dynamic libraries we can link with.

Using dynamic libraries to compile a program. Additional flags are required when compiling the library files (for PIC), and we must tell the system that the library is dynamic when compiling using the -shared flag. But even then, the executable doesn’t have the library functionality linked into it. Only when we execute the program does ld dynamically link in the library. For ld to know where to look for te library, we also have to tell it additional paths in which to look using an environment variable.
Using dynamic libraries to compile a program. Additional flags are required when compiling the library files (for PIC), and we must tell the system that the library is dynamic when compiling using the -shared flag. But even then, the executable doesn’t have the library functionality linked into it. Only when we execute the program does ld dynamically link in the library. For ld to know where to look for te library, we also have to tell it additional paths in which to look using an environment variable.

8.5.2 exec with Dynamic Linking

How does the libc.so library get linked into our program? Diving in a little bit further, we see that ld is our program’s “interpreter”:

$ readelf --program-headers tests/01_add.test
Elf file type is DYN (Shared object file)
Entry point 0x1180
There are 13 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000002d8 0x00000000000002d8  R      0x8
  INTERP         0x0000000000000318 0x0000000000000318 0x0000000000000318
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x00000000000008e8 0x00000000000008e8  R      0x1000
  LOAD           0x0000000000001000 0x0000000000001000 0x0000000000001000
                 0x0000000000000e55 0x0000000000000e55  R E    0x1000
...

A lot is going on here, but we see the reference to ld-linux-x86-64.so.2 that we saw previously in the ldd output. We can now see that the library (that I’ll call simply ld) is a “program interpreter” (see “Requesting program interpreter: /lib64/ld-linux-x86-64.so.2”) for our program23. This is the core of understanding how our program can execute while having undefined references to functions provided by another library.

When we call exec, we believe that the program we execute takes over the current process and starts executing. This is not true! Instead, if a program interpreter is defined for our program (as we see: it is), the interpreter program’s memory is loaded along side our program, but then the interpreter is executed instead of our program! It is given access to our program’s ELF object, and ld’s job is to finish linking and loading our program. We have to add ./ to the LD_LIBRARY_PATH to make the program execute because ld reads that environment variable when it is trying to link and load in your program!

8.5.3 Diving into Dynamic Linking at exec

But wait. Why does exec end up loading ld, which then loads our program? Why load ld to link our program instead of just running our program? Because ld also loads and links all of the dynamic libraries (.sos) that our program depends on!!!

We can confirm that ld is loaded into our program by executing it in gdb, blocking it on breakpoint, and outputting its memory maps (cat /proc/262648/maps on my system):

555555554000-555555555000 r--p 00000000 08:02 527420                     /home/gparmer/repos/gwu-cs-sysprog/22/hw_solns/02/tests/01_add.test
555555555000-555555556000 r-xp 00001000 08:02 527420                     /home/gparmer/repos/gwu-cs-sysprog/22/hw_solns/02/tests/01_add.test
555555556000-555555557000 r--p 00002000 08:02 527420                     /home/gparmer/repos/gwu-cs-sysprog/22/hw_solns/02/tests/01_add.test
555555557000-555555558000 r--p 00002000 08:02 527420                     /home/gparmer/repos/gwu-cs-sysprog/22/hw_solns/02/tests/01_add.test
555555558000-555555559000 rw-p 00003000 08:02 527420                     /home/gparmer/repos/gwu-cs-sysprog/22/hw_solns/02/tests/01_add.test
7ffff7dbe000-7ffff7de3000 r--p 00000000 08:02 2235639                    /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ffff7de3000-7ffff7f5b000 r-xp 00025000 08:02 2235639                    /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ffff7f5b000-7ffff7fa5000 r--p 0019d000 08:02 2235639                    /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ffff7fa5000-7ffff7fa6000 ---p 001e7000 08:02 2235639                    /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ffff7fa6000-7ffff7fa9000 r--p 001e7000 08:02 2235639                    /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ffff7fa9000-7ffff7fac000 rw-p 001ea000 08:02 2235639                    /usr/lib/x86_64-linux-gnu/libc-2.31.so
7ffff7fac000-7ffff7fb2000 rw-p 00000000 00:00 0
7ffff7fc9000-7ffff7fcd000 r--p 00000000 00:00 0                          [vvar]
7ffff7fcd000-7ffff7fcf000 r-xp 00000000 00:00 0                          [vdso]
7ffff7fcf000-7ffff7fd0000 r--p 00000000 08:02 2235423                    /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ffff7fd0000-7ffff7ff3000 r-xp 00001000 08:02 2235423                    /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ffff7ff3000-7ffff7ffb000 r--p 00024000 08:02 2235423                    /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ffff7ffc000-7ffff7ffd000 r--p 0002c000 08:02 2235423                    /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ffff7ffd000-7ffff7ffe000 rw-p 0002d000 08:02 2235423                    /usr/lib/x86_64-linux-gnu/ld-2.31.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

Recall that this dumps all of the memory segments of the process.

There are two important observations:

  • /usr/lib/x86_64-linux-gnu/libc-2.31.so is present in the process so somehow got linked and loaded into the process.
  • /usr/lib/x86_64-linux-gnu/ld-2.31.so is also present in the process; indeed it was the first code to execute in the process!

8.5.4 Dynamic Loading Summary

Lets summarize the algorithm for executing a dynamically loaded program.

  1. exec is called to execute a new program.
  2. The program’s ELF file is parsed, and if a “program interpreter” is set, the interpreter is instead executed.
  3. In most cases, the interpreter is ld, and it is loaded into the process along with the target program.
  4. Execution is started at ld, and it:

    1. Loads all of the dynamic libraries required by the program, looking at the LD_LIBRARY_PATH environment variable to help look for them.
    2. Links the program such that all of its references to library symbols (i.e. calloc above) are set to point to library functions.
    3. Executes the initialization procedures (and eventually main) of our program.

This is all quite complex. Why would we ever want all of this complexity over the simplicity of static libraries?

8.5.5 Saving Memory with Dynamic Libraries

First, lets look at how much memory a library takes up.

$ ls -lh /usr/lib/x86_64-linux-gnu/libc-2.31.so
-rwxr-xr-x 1 root root 2.0M Dec 16  2020 /usr/lib/x86_64-linux-gnu/libc-2.31.so

So libc takes around 2MiB. Other libraries take quite a bit more:

  • Graphics card drivers (e.g. libradeonsi_dri.so is 23 MiB)
  • Databases clients (e.g. libmysqlclient.so is 7.2 MiB)
  • Languages (e.g. firefox’s javascript engine, libmozjs.so is 11 MiB, and webkit’s libjavascriptcoregtk.so is 26 MiB)

If N programs must load in multiple libraries (totaling, say, X MiB), then N × X MiB is consumed. For context, my system is currently running 269 programs (ps aux | wc -l), so this memory can add up!

Dynamic libraries enable us to do a lot better. The contents memory for the library is the same regardless the process it is present in, and an Operating System has a neat trick: it can make the same memory appear in multiple processes if it is identical and cannot be modified24. As such, dynamic libraries typically only require memory for each library once in the system (only X), as opposed for each process.

A graphical depiction of how library memory gets added into a program using dynamic libraries. The libraries are only linked into the processes when they are executed. When that happens, the same memory for the library is mmaped into each process, thus the system only requires memory for a single copy of the library.
A graphical depiction of how library memory gets added into a program using dynamic libraries. The libraries are only linked into the processes when they are executed. When that happens, the same memory for the library is mmaped into each process, thus the system only requires memory for a single copy of the library.

8.6 Static vs. Dynamic Library Memory Usage Summary

When a program is linked with a static library, the result is relatively small as only the objects in the library that are necessary are linked into the program. When executed as a process, the memory consumed will only include the necessary objects of the library. In contrast, a program that uses a dynamic library links in the entire library at runtime into a process. Thus, for a single process, a static library approach is almost always going to be more memory-efficient.

However, in a system with many executing processes, the per-library memory, rather than per-process memory, will result in significantly less memory consumption. Requiring only a single copy of a larger dynamic library shared between all processes, uses less memory than multiple copies of smaller compilations of the library in each program.

9 Organizing Software with Dynamic Libraries: Exercises and Programming

Lets dive a little deeper into how software is structured on our system, and the trade-offs that different libraries make in the system.

9.1 Exercise: Understanding Library Memory Consumption

Lets investigate the libexample from the figures in last week’s lectures. See the 08/libexample/ directory for examples of using normal linking (make naive), static library linking (make static), and dynamic library linking (make dynamic). The source includes:

  • prog and prog2 that are programs that use libraries.
  • example.c is a single file that creates the first library which generates libexample.a and libexample.so.
  • example2.c and example2b.c are the files that create the second library which generates libexample2.a and libexample2.so.

Each of the library objects uses around 750KB of memory. To help you set the LD_LIBRARY_PATH to execute the dynamic libraries, you can run the dynamic library examples using

$ ./run_dyn.sh ./prog_dynamic
  1. Use nm and ldd to understand which libraries and object files within those libraries each of the programs require.
  2. Use ls -lh to see the size of each of the resulting executable programs. Explain the differences in sizes.

9.2 Library Trade-offs

  • Question: What types of systems might want to always use only static libraries?
  • Question: What types of systems might want to always use only dynamic libraries?
  • Question: OSX uses dynamically linked libraries for a few common and frequently used libraries (like libc), and static libraries for the rest. Why?

9.3 Programming Dynamic Libraries

We saw how libraries enable the sharing of functionality between programs. Static libraries are convenient, and only required objects are linked with your program at compile time. In contrast, dynamic libraries are linked at runtime before the main of your program executes.

9.3.1 Plugins and Dynamic Library APIs

In addition to enabling shared library code, dynamic libraries also enable explicit means to load libraries programmatically. Namely, UNIX provdes an API that enables your process to load a dynamic library into itself. This is kinda crazy: a running process with a set of code and data can add a library with more code and data into itself dynamically. This effectively means that your program can change its own code as it executes by adding and removing dynamic libraries. Being able to load library code into your process as your program executes (at runtime) is an instance of self-modifying code, and it enables a few very cool features including:

  • dynamic update of services in which new versions of key computations in a service can be dynamically loaded in (and old versions eventually removed),
  • plugin architectures enable different interesting functionalities to be loaded into a process to provide unique functionalities – some of the most popular image and video processing programs (e.g. ffmpeg) enable different formats to be handled with separate dynamic libraries, thus enabling the processing of any image or video by simply loading a new library, and
  • languages, such as python, dynamically load libraries (many or most are written in C) into a process when the python code imports them. Since the imports are part of the code, the corresponding dynamic libraries must be loaded by the python runtime dynamically. This enables your dynamic (slow) python code to load and leverage fast C libraries.
Example programmatic uses of dynamic libraries.
Example programmatic uses of dynamic libraries.

This API is provided by the linker (ld.so) that loaded the elf object for the currently executing process. What does the API that enables this dynamic loading look like?

  • handle = dlopen(file_path, flags) - Open and load in a dynamic library. This returns a handle to the library that you should pass into the latter functions. You can read about the flags in man dlopen. Unless otherwise noted, I’d recommend using RTLD_NOW | RTLD_LOCAL25. Returns NULL if it cannot find the dynamic library.
  • dlclose(handle) - Close and unload a library that was previously loaded.
  • symb_ptr = dlsym(handle, symbol) - pass in a symbol which is a string holding a function or variable name in the dynamic library. It should return a pointer to that symbol, which you can then call (if it is a function pointer), or dereference if it is a variable.

Program output:

42

As dlsym returns a pointer to the symbol passed in (as a string), it means that you end up seeing a lot of function pointers when you’re using plugin infrastructures. See (a) in the image below.

We can also use dlsym to ask for a symbol in any of the installed libraries. To do so, we pass in the “special pseudo-handles” RTLD_DEFAULT or RTLD_NEXT.

Program output:

Look at me! Diving into your dynamic libraries...pulling out the symbols

Though this example doesn’t have much of a point, as we can directly call write, thus have to go through the symbol we retrieved from dlsym, See (b) in the image below.

Three programmatic means to interact with dynamic libraries.
Three programmatic means to interact with dynamic libraries.

9.3.2 Interposing Libraries on Library APIs

One powerful mechanism that dynamic libraries enable is to interpose on library function calls. Your program might believe that it calls malloc when it invokes the function “malloc”, but you can orchestrate it so that it instead calls a malloc in your library! You can then, if you choose, invoke the normal library calls, thus why this is known as interposing on the library calls. But how?

The environment variable LD_PRELOAD can be used to specify a dynamic library to load into the process to interpose on any library calls. A trivial use of this might be to log (i.e. write out to a file) all of a set of library calls. We’ll use it to add intelligence on the memory allocation paths.

When we use LD_PRELOAD=lib, the symbols in the dynamic library that we specify in lib have priority over those of other libraries. Thus, if we include a malloc function definition, when the application calls malloc, it will call our malloc! In the image above, (c) demonstrates using LD_PRELOAD.

In malloclog, we can see a library that interposes on malloc and free, and prints information to standard error. This includes the following:

We do the same for free, and should do the same also for calloc and other memory allocation functions.

Question: Why can’t we just call malloc directly instead of doing this complex thing with dlsym?

We can use it on a simple program that simply calls malloc and free on the allocated memory. To test this out, in malloclog:

$ make
$ LD_PRELOAD=./libmalloclog.so ./main.bin
malloc(4) -> 0x55fe6cd492a0
free(0x55fe6cd492a0)
...

This shows us that dynamic libraries are programmable and that we can use them in interesting ways to coordinate between libraries, provide new functionalities, and to interpose on functionalities. For example, many of the largest companies use LD_PRELOAD to swap out the malloc implementation for one that is higher performance.

Questions:

  • What else can you think to do with the ability to interpose on any library functions?
  • How might we implement some of the valgrind functionality using the ability to interpose on library functions?

Note that running a program that uses LD_PRELOAD environment variables are more difficult to use in gdb. The ldpreload_w_gdb.sh script demonstrates how to use both technologies together.

10 System Calls and Memory Management

We’ve talked about the many resources that are provided by the system. These include files, pipes, and domain sockets. But we’ve been vague about what the system is. Though we’ll leave the details for a later course, lets dive in to what constitutes the “system” a little bit.

10.1 System Calls

We’ve claimed many times that the system provides pipes, domain sockets, files, the file system, etc… What is “the system”? First, lets start from what we know: is this system just a dynamic library? To some extent this makes sense.

Question:

  • The dynamic libraries we’ve seen provide many of the functions we’ve used, so why couldn’t they provide all of the functionality we require!? For example, why can’t read, open, etc… all be simply provided solely by dynamic libraries?

A few observations:

  1. Dynamic libraries are loaded into a process, but what provides the process itself?
  2. Some resources, like pipes, involve multiple processes, thus there must be something in the system beyond specific processes.
  3. Hardware is limited – a system only has a limited amount of DRAM (memory) – and something has to determine how to hand it out to processes.
The OS kernel’s relationship to processes and libraries. Each OS has a kernel, in this case the Linux Kernel. The kernel provides resources such as file systems, channels (domain sockets, pipes, etc…), and it implements processes (including fork, exec, wait, etc…). The kernel is separated from processes so that a failure in a process won’t spread beyond the process. This means that when a process wants to interact with the kernel, it must make a special system call. This behaves like a function call, but it maintains isolation of the kernel from the process. This is a system with four processes. The second is making a write system call, the third is using printf which is implemented in libc as part of the stream API in stdio.h. The stream implementation makes a write system call to send the data to the kernel (e.g. to display on our terminal). On the far right, we can see a process that doesn’t even have any libraries and makes system calls directly.
The OS kernel’s relationship to processes and libraries. Each OS has a kernel, in this case the Linux Kernel. The kernel provides resources such as file systems, channels (domain sockets, pipes, etc…), and it implements processes (including fork, exec, wait, etc…). The kernel is separated from processes so that a failure in a process won’t spread beyond the process. This means that when a process wants to interact with the kernel, it must make a special system call. This behaves like a function call, but it maintains isolation of the kernel from the process. This is a system with four processes. The second is making a write system call, the third is using printf which is implemented in libc as part of the stream API in stdio.h. The stream implementation makes a write system call to send the data to the kernel (e.g. to display on our terminal). On the far right, we can see a process that doesn’t even have any libraries and makes system calls directly.

The system, as we’ve discussed it so far, is called the kernel. One of the main components of every operating system is the kernel. Every process in the system uses the kernel’s functionality. It provides the logic for most of the calls we’ve learned in the class, fork, exec, pipe, read, write, open, creat, etc… and also most of the resources we’ve discussed including channels, domain sockets, pipes, files, and the filesystem. It is some of the most trusted code on the system. If it has a bug, the whole system can crash – or worse!

We have to be able to call functions in the kernel from any process. This means we want to make what seem like function calls to the kernel. And we want to make sure that the faults of one process don’t spread to another! This means that somehow the kernel must be isolated from each process.

System calls are the mechanism that behaves like a function call (and looks like one to us!), but that switches over to the kernel. System calls are a special instruction in the instruction set architecture (Intel x86 or ARM) that triggers the switch to the kernel. When the kernel’s done with its functionality, it can return (via a special instruction) to the process.

So system calls behave like function calls, but inherently involves switches from the process and to the kernel and back.

Question:

  • So why does UNIX decide what functionality to put in a library, and what to put in the kernel?

10.1.1 System Calls and Processes

In UNIX, system calls are available in unistd.h, so whenever you include that file to use a function, you know the function is likely a direct system call! You can also tell if a UNIX function is a system call by looking at the sections of man (from man man):

MANUAL SECTIONS
    The standard sections of the manual include:

    1      User Commands
    2      System Calls
    3      C Library Functions
    4      Devices and Special Files
    5      File Formats and Conventions
    6      Games et. al.
    7      Miscellanea
    8      System Administration tools and Daemons

Questions:

  • Do man write. What is this documentation for?
  • Do man 2 write. What is this documentation for? What does the 2 mean?

    You can always tell which section you’re looking at in the header line of the man output: WRITE(2) shows that we’re reading about write in the “System Calls” section.

  • Is malloc a system call or a C Library Function?

We know what a system call looks like, as we’ve used them many times throughout the class. Lets look at the various ways to make the same system call. First, the normal way:

Program output:

hello world

The C standard library even provides you a “generic” way to invoke any system call. The syscall function lets you make an integer-identified system call. Each system call on the system is identified with a unique integer. In this case, we’ll use SYS_write.

Program output:

hello world

Lets look at a few of these integer values to see what’s happening here:

Program output:

write: 1
read:  0
open:  2
close: 3

Note, these integer values are not related to descriptors (e.g. STDIN_FILENO == 0). The system calls in Linux (and in all operating systems!) are each assigned an integer with which to identify them. You can see all of the system calls and their integer numbers, and some documentation in man syscalls (link).

Exercise:

  • Find three interesting system calls that you didn’t know about.

OK, now lets do a little more of a deep-dive. You won’t be familiar with the Intel’s/AMD’s x86-64 assembly language that follows, but the comments on the right give you the jist.

We compile this assembly with make (in 10/). Program output:

hello world

We can see that the requirements of making a system call are quite minimal! A special instruction (on x86-64, syscall) provides the magical incantation to make what feels like a function call to the kernel.

How does this work? Recall that ELF, the file format for objects and programs includes many pieces of information necessary to execute the program. Among them, the starting or entry address which is the initial instruction to start executing when the program runs.

$ readelf -a ./asm_syscall  | grep "Entry"
  Entry point address:               0x401000
$ readelf -a ./asm_syscall  | grep "401000"
  Entry point address:               0x401000
...
     6: 0000000000401000     0 NOTYPE  GLOBAL DEFAULT    1 _start

We can see that the “entry point address” for the binary is 0x401000, and we can see that the _start symbol has that address. The exec logic of the kernel will set the program to start executing at that _start address.

Note that this same analysis rings true for the normal C system call code above (not only for the assembly)! The _start function is where all execution starts in each program. This explains why our program above labels its code with _start!

10.2 Observing System Calls

The strace program lets us monitor the system calls a program makes. Running int on the assembly program above:

$ strace ./asm_syscall
execve("./asm_syscall", ["./asm_syscall"], 0x7fff3c06ae10 /* 50 vars */) = 0
write(1, "hello world\n", 12hello world
)           = 12
exit(0)                                 = ?
+++ exited with 0 +++

This likely has what you’d expect: We see the exec that kicks off the program, the write, and, finally, the exit. You’d likely expect the same from the normal C program that also writes out hello world.

$ trace ./normal_syscall
execve("./normal_syscall", ["./normal_syscall"], 0x7ffc036c1520 /* 50 vars */) = 0
brk(NULL)                               = 0x5638ed59d000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffd8bfb49c0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=138102, ...}) = 0
mmap(NULL, 138102, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7efddf831000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360A\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\237\333t\347\262\27\320l\223\27*\202C\370T\177"..., 68, 880) = 68
fstat(3, {st_mode=S_IFREG|0755, st_size=2029560, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efddf82f000
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\237\333t\347\262\27\320l\223\27*\202C\370T\177"..., 68, 880) = 68
mmap(NULL, 2037344, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7efddf63d000
mmap(0x7efddf65f000, 1540096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7efddf65f000
mmap(0x7efddf7d7000, 319488, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19a000) = 0x7efddf7d7000
mmap(0x7efddf825000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7efddf825000
mmap(0x7efddf82b000, 13920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7efddf82b000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7efddf830540) = 0
mprotect(0x7efddf825000, 16384, PROT_READ) = 0
mprotect(0x5638eb775000, 4096, PROT_READ) = 0
mprotect(0x7efddf880000, 4096, PROT_READ) = 0
munmap(0x7efddf831000, 138102)          = 0
write(1, "hello world\n", 12hello world
)           = 12
exit_group(0)                           = ?
+++ exited with 0 +++

Whaaaaa!? If you look carefully, at the bottom we see the expected output. But we also see a lot of system calls previous to that. Between when we start execution at _start, and when we execute in main, quite a lot happens.

Question:

  • What do you think is causing all of these system calls? What is the program doing before running main?
Why are there so many system calls in a simple C program? Lets compare what happens in the C versus assembly programs. The process on the left is normal_syscall above, while asm_syscall is on the right. The assembly system call avoids much of the C runtime support, thus starts execution at _start, and makes system calls directly. In contrast, the C program starts execution in the dynamic linker ld (i.e. at _start in ld) – step (1). ld links in the libc dynamic library (2), then executes _start in our program (3). Our program actually calls libc (4) before executing main to do library initialization, which then calls main. After main returns (5), it returns into libc, which then terminates our program with the _exit system call. So why are there so many system calls made in the C program? There’s a lot of execution that happens outside of our logic! Most of the system calls are from ld.
Why are there so many system calls in a simple C program? Lets compare what happens in the C versus assembly programs. The process on the left is normal_syscall above, while asm_syscall is on the right. The assembly system call avoids much of the C runtime support, thus starts execution at _start, and makes system calls directly. In contrast, the C program starts execution in the dynamic linker ld (i.e. at _start in ld) – step (1). ld links in the libc dynamic library (2), then executes _start in our program (3). Our program actually calls libc (4) before executing main to do library initialization, which then calls main. After main returns (5), it returns into libc, which then terminates our program with the _exit system call. So why are there so many system calls made in the C program? There’s a lot of execution that happens outside of our logic! Most of the system calls are from ld.

10.3 System Call Overhead

There are many reasons why some functionality might be in a library, and accessed via function calls, or in the kernel, and accessed with system calls.

Lets look at an example of memory allocation. We can, of course, use malloc and free to allocate and free memory. But we can also call mmap and munmap that make system calls to allocate and free memory.

Questions:

  • Which will be faster, malloc/free or mmap/munmap?
  • How much overhead would you guess a system call has?

Program output:


malloc + free overhead (cycles): 422

mmap + munmap overhead (cycles): 9037

What is a “cycle”? When you read that you have a 3GHz, that essentially says that the internal “clock” that drives the logic in the processor cycles by at 3 billion cycles per second. That’s a lot. Each cycle is 1/3 a nanosecond. A cycle is the smallest unit of measurement on a processor, and the cycles function we implement here gives us how many cycles have elapsed since the system boots.

We can also look at the output APIs like write and compare them to the stream APIs that buffer their output.

Questions:

  • If we write a byte to a file using write or to a stream (via a FILE) using fwrite, what performance would you expect?
  • How about if we did an fflush after the fwrite?

Program output:

                                                                                                                                                                                                                                                                
write overhead (cycles): 3758
                                                                                                                                                                                                                                                                
fwrite (stream) overhead (cycles): 40
                                                                                                                                                                                                                                                                
fwrite + fflush overhead (cycles): 3756

10.4 Library vs. Kernel Trade-offs in Memory Allocation

We’ve seen that

  1. the kernel is there to facilitate access to hardware resources (memory, processing, etc…),
  2. the kernel provides processes (fork, exec) and inter-process interactions (wait, pipe, etc…) as these functionalities require implementation beyond any single process,
  3. functions in dynamic libraries are much faster to invoke as system calls have significant overhead.

This leads us to the conclusion that for performance, we’d like to implement as many functions as possible in libraries, and to rely on the kernel when processes require modifications (exec, fork), when there are multiple processes involved (IPC), or when the finite resources of the system must be split up between processes (open, creat, sbrk).

Here, we’re going to dive into the design of the memory management functionalities in UNIX to better understand the trade-offs. Specifically, how does malloc and free work!?

10.4.1 Library Tracking of Memory

So malloc, calloc, realloc, and free are implemented in a library (in libc), and when we request memory from them, they might call sbrk (or mmap on some systems) to get memory from the kernel, but from that point on, they will manage that memory themselves. What does this mean? When the heap doesn’t have enough free memory to satisfy a malloc request, we call the kernel to expand the heap. To do so, we ask the kernel for memory via sbrk or mmap, and we generally get a large chunk of memory. For example, we might ask sbrk for 64KB of memory with mem = sbrk(1 << 16)26. Then it is malloc’s job to split up that memory into the smaller chunks that we’ll return in each subsequent call to malloc. When memory is freed, it is returned to the pool of memory which we’ll consider for future allocations. Because we’re reusing freed memory, we’re avoiding system calls when the heap contains enough freed memory to satisfy a request.

In (a), we see the heap with red rectangles that designate allocated memory. (b) we get a malloc request. First we check to see if there is any span of unallocated memory we can use to satisfy the request. (c) shows what we must avoid: the overlapping of two different allocations. As we cannot find a sufficient span of memory to satisfy the malloc request, in (d), we expand the heap using sbrk. (e) shows how we can satisfy the request from the newly expanded heap memory.
In (a), we see the heap with red rectangles that designate allocated memory. (b) we get a malloc request. First we check to see if there is any span of unallocated memory we can use to satisfy the request. (c) shows what we must avoid: the overlapping of two different allocations. As we cannot find a sufficient span of memory to satisfy the malloc request, in (d), we expand the heap using sbrk. (e) shows how we can satisfy the request from the newly expanded heap memory.

This is what it means that our malloc/free code in the libc library need to track the memory in the heap. We need to track where allocated memory is located and where freed spans of memory are located, to enable intelligent allocation decisions for future malloc calls.

Questions:

  • What data-structures should we use to track allocated/freed memory?
  • Do you foresee any challenges in using data-structures to track memory?

10.4.2 Malloc Data-structures

The malloc and free implementation must track which chunks of memory are allocated, and which are freed, but it cannot use malloc to allocate data-structures. That would result in malloc calling itself recursively, and infinitely! If malloc requires memory allocation for its tracking data-structures, who tracks the memory for the tracking data-structures? Yikes.

We solve this problem by allocating the tracking data-structures along with the actual memory allocation itself. That is to say, we allocate a header that is a struct defined by malloc that is directly before an actual memory allocation. Because the header is part of the same contiguous memory chunk, we allocate the header at the same time as we allocate memory! When memory is freed, we use the header to track the chunk of memory (the header plus the free memory directly following the header) in a linked list of free chunks of memory called a freelist.

Memory allocations through various sets of details. The red boxes are allocated memory, white box is memory that we’ve retrieved in the heap from sbrk, the green boxes are the headers data-structures that track free chunks of memory we can use for future allocations, and they are arranged in a linked list called freelist, and orange boxes are headers for allocated chunks of memory thta track how large that allocation is so that when we free it, we know large the chunk of freed memory will be.
Memory allocations through various sets of details. The red boxes are allocated memory, white box is memory that we’ve retrieved in the heap from sbrk, the green boxes are the headers data-structures that track free chunks of memory we can use for future allocations, and they are arranged in a linked list called freelist, and orange boxes are headers for allocated chunks of memory thta track how large that allocation is so that when we free it, we know large the chunk of freed memory will be.
    1. the heap with some memory allocated, and some free spans of memory.
    1. we need to make a data-structure to track where the free spans of memory are, so we add a structure into each of the free chunks, and use it to construct a linked list of the free chunks.
    1. When we malloc, we can find the free chunk large enough for the allocation, remove it from the freelist, …
    1. …and return it!
    1. If we want to free that memory, we must understand somehow (i.e. must have a structure that tracks) the size of the allocation, so that we can add it back onto the freelist.
    1. We can track the size of each allocation in a header directly before the memory we return. We are tracking the memory in the same contiguous span of memory we use for the allocation!
    1. So when we free, we’re just adding that memory back onto the freelist.
    1. If we ever receive a malloc request that can’t be satisfied by the freelist, only then do we make a system call to allocate more heap.

10.4.3 Exercise: smalloc

Look at the 10/smalloc.c (link) file for a simple implementation of a malloc/free that assumes only allocations of a specific size. This means that we don’t need to track the sizes of allocations and free chunks in the headers, simplifying our implementation. Build, and run the program:

$ make
$ ./smalloc.bin

It currently implements this simple malloc! Read through the smalloc and sfree functions, and the main. Answer QUESTION 1-3. Then, move on to QUESTION 4. In expand_heap, we make the system call to allocate more memory from the system with sbrk. We need to set up that newly allocated heap memory to include memory chunks that we can allocate with smalloc. Answering how this is done is the core of QUESTION 4.

11 UNIX Security: Users, Groups, and Filesystem Permissions

This material adapted from Prof. Aviv’s material.

We’ve seen how the filesystem is used to store and name files and channels, and we’ve seen the APIs to access directories and files. If everyone had access to every else’s files, then there would be no way to secure your information. Thus, the kernel provides means to control different user’s access to different files and directories. This means that I cannot access your files, and vice-versa!

To understand the core concept of access control in UNIX, we have to understand

  1. how UNIX thinks about, implements, and organizes users and groups of users,
  2. how files and directories provide various access permissions to different users and how to alter those permissions, and
  3. how services, even when executed by a user, can have heightened access to system resources.

11.1 File and Directory Permissions

The filesystem stores files and directories are somehow associated with each of us, separately. Lets check out the properties of a file that are returned by running ls -l:

.- Directory?
|    .-------Permissions                   .- Directory Name
| ___|___     .----- Owner                 |
v/       \    V     ,---- Group            V
drwxr-x--x 4 aviv scs 4096 Dec 17 15:14 im_a_directory
-rw------- 1 aviv scs 400  Dec 19  2013 .ssh/id_rsa.pub
                       ^   \__________/    ^
File Size -------------'       |           '- File Name
  in bytes                     |
                               |
   Last Modified --------------'

An example from the class’ server where we create a file, a directory, and see the results:

$ echo hi > file.txt
$ mkdir hello
$ ls -l
-rw-r--r-- 1 gparmer dialout    3 Mar 18 12:19 file.txt
drwxr-xr-x 2 gparmer dialout 4096 Mar 18 12:20 hello

We see that each file and directory has a owner and group associated with it.

Question:

  • What do you think this means? Why do files have owners and groups?

11.1.1 File Ownership and Groups

The owner of a file is the user that is directly responsible for the file and has special status with respect to the file permissions. Users can also be collected together into a set called a group, a collection of users who posses the same permissions. A file is associated with a group. A file’s or directory’s owners and groups have different permissions to access the file. Before we dive into that, lets understand how to understand users and groups.

You all are already aware of your username. You can get your username with the whoami command:

$ whoami
gparmer

To have UNIX tell you your username and connection information on this machine, use the command, who:

$ who
gparmer  pts/1        2022-03-18 12:12 (128.164...)

The first part of the output is the username. The rest of the information in the output refers to the terminal you’re using (it is a “pseudo-terminal device”, something that pretends it is a typewriter from the 60s), the time the terminal was created, and from which host you are connected.

You can determine which groups you are in using the groups command.

$ groups
dialout CS_Admins

On the class’ server, I’m in the dialout and CS_Admins groups. On my work computer, I’m in a few more groups:

$ groups
ycombinator adm cdrom sudo dip plugdev lpadmin sambashare

The good news is that I can use my cdrom. Fantastic. It will pair well with my pseudo-typewriter. Cool.

11.1.2 File Permissions

The permissions associated with each file/directory specify who should be able to have the following access:

  • r - the ability to read the file, or read the contents of a directory.
  • w - the ability to write to the file, or update the contents of a directory.
  • x - the ability to execute a file (i.e. a program), or chdir into a directory.

We think about different users as having different sets of these permissions to access files and directories. When we see the permissions string for a file, we see three sets of these rwx permissions:

-rw-r--r-- 1 gparmer dialout    3 Mar 18 12:19 file.txt

Here we have three sets of permissions rw- (read and write), r-- (read only), and r-- (read only). These three sets of permissions each specify the access permissions for the owner, the group, and everyone else’s global permissions.

 .-- Directory Bit
|
|       ,--- Global Permission (i.e. "every else")
v      / \
-rwxr-xr-x
 \_/\_/
  |  `--Group Permission
  |
   `-- Owner Permission

Given this, now we can understand the permissions for each file in:

ls -l 11/perm_example/
total 0
-rw-r--r-- 1 gparmer gparmer 0 Mar 25 18:43 bar
-rw-rw-r-- 1 gparmer gparmer 0 Mar 25 18:43 baz
-rwx------ 1 gparmer gparmer 0 Mar 25 18:42 foo

Question:

  • Who has what access permissions to which files?

11.1.3 Numerical Representation of Permissions

Each permission has a numerical representation. We represent it in octal, which is a fancy way to say that we think of each digit of our numerical representation as being between 0-7 – it is “base 8” (in contrast to the normal digits numbers we use every day that are “base 10”). Thus, each of the three sets of permissions is 3 bits.

Each of the three permissions, r, w, and x correspond to the most-significant to the least significant bits in each octal digit. For example:

rwx -> 1 1 1 -> 7
r-x -> 1 0 1 -> 5
--x -> 0 0 1 -> 1
rw- -> 1 1 0 -> 6

On the left we see a permission, its bit representation, and the associated octal digit. The shorthand, you can think of this is this:

  • r = 4 as it is represented with the most significant bit
  • w = 2 as the second bit
  • x = 1 as it is the least significant bit

Thus the octal digit is simply the addition of each of these numeric values for each access right. We can combine these digits to encapsulate permissions for the owner, group, and everyone else:

-rwxrwxrwx -> 111 111 111 -> 7 7 7
-rwxrw-rw- -> 111 110 110 -> 7 6 6
-rwxr-xr-x -> 111 101 101 -> 7 5 5

So now we know how to interpret, and understand these permission values, and how to represent them as digits.

11.1.4 Updating File/Directory Permissions

To change a file permission, you use the chmod command, or the chmod system call. In each, we specify the file to update, and the new permissions in octal. Lets play around with permission and see their impact:

$ gcc helloworld.c -o helloworld.bin
$ ls -l helloworld.bin
-rwxrwxr-x 1 gparmer gparmer 41488 Mar 22 08:52 helloworld.bin
$ ./helloworld.bin
hello world!

I can execute the program, and I could read the program, for example using nm. I could even overwrite the file.

Note: on the server, the default permissions are more restrictive – -rwxr-xr-x 1 gparmer dialout 16464 Mar 25 19:16 helloworld.bin. We don’t want anyone other than us writing to the file on a shared server!

What if I removed all permissions for everyone but the owner, and even removed my own ability to execute?

$ chmod 600 helloworld.bin
$ ls -l helloworld.bin
-rw------- 1 gparmer gparmer 41488 Mar 22 08:52 helloworld.bin
$ ./helloworld.bin
-bash: ./helloworld.bin: Permission denied

We removed our own ability to execute the file, so we no longer can! Note that permission 6 is 4 + 2 which is r + w. If I wanted on the system to be able to execute the program:

$ chmod 711 helloworld.bin
$ ls -l helloworld.bin
-rwx--x--x 1 gparmer gparmer 41488 Mar 22 08:52 helloworld.bin
$ ./helloworld.bin
hello world!

Recall, 7 = 4 + 2 + 1 which is r + w + x. We can see that we’ve given everyone on the system execute permissions, so the file is again about to execute! Any other user on the system would also be able to execute the program.

Questions:

  • How do you think permissions change access to directories? Try out setting the permissions of a directory (recall mkdir), and seeing what accesses cannot be made if permissions are removed.
  • You and a classmate should sit down together, and use chmod to concretely understand when you can access each other’s files. Remember, you can find out where your files are with pwd, what groups you’re in with groups. Which files can another user can access at that path with sufficient permissions? You can use the /tmp/ directory to store files that you can both mutually access, permissions allowing.

11.1.5 Changing File/Directory Owner and Group

We’ve seen that each file and directory is associated with a user, and with a group. Which means that, of course, we can change those settings! Two commands (and their corresponding system call: chown):

  • chown <user> <file/directory> - change owner of the file/directory to the user
  • chgrp <group> <file/directory> - change group of the file to the group

UNIX limits who is able to execute these operations.

  • Only a superuser called root is allowed to chown a file or directory. root is special in that they are able to change the permissions of any file/directory, and most of the most sensitive files on the system are owned by root. For example, the devices (think, hard-drives holding the filesystem, monitors, and network devices) are only accessible by root (check out /dev/*).
  • Only the owner of a file (and root) is allowed to change the group of a file, and only to a group that user is in (recall groups).

We can see this:

$ chown root helloworld.bin
chown: changing ownership of 'helloworld.bin': Operation not permitted

Question:

  • Why do you think we aren’t allowed to change the owner of a file, nor change the group to one in which we don’t belong?

11.2 Security and Programming with Users and Groups

We’ve seen that a lot of UNIX security centers around users, groups, and the corresponding permissions associated with files and directories. When a process is forked, it inherits the user and group of the parent process. Thus, when you’re typing at a shell, the shell process is what really is executing as your user id, and when you create programs, they inherit your user id. These are the core of the security primitives, so lets dive into a little more detail.

11.2.0.1 APIs for Accessing a Program’s Privilege Settings

If every process has a user and group, we must be able to access them, right? There are two basic system calls for retrieving user and group information within a program.

  • uid_t getuid(void) - Returns the real user id of the calling process.
  • gid_t getgid(void) - Returns the real group id of the calling process.

Let’s look at an example program.

Program output:

uid=1000 gid=1000

Every user has different identifiers as does each group. If another user where to run the same program, they’d get a different value.

Question:

  • In groups, one of you compile this program, and the other run it! Use chmod to make sure that they can! What do you observe?
  • If the kernel thinks of users as numeric identifiers, how does whoami print out a human-readable string? Use strace whoami to come up with a strong guess.
  • How do we get human-readable groups with groups (investigate with strace groups)?

Interestingly, whomever executes the program will have their uid/gid printed out; it won’t use the uid/gid of the owner of the program. Remember, that we inherit the uid/gid on fork, thus when the shell runs the program, it uses our shell’s (thus our) uid/gid.

11.2.1 User and Group Strings

If the UNIX programmatic APIs provide access to user and group identifiers (i.e. integers), how is it that we can see strings corresponding to these numeric values? The mapping of the identifiers to strings are defined in two places. The first is a file called /etc/passwd which manages all the users of the system. Here is the /etc/passwd entry on my local system (the server uses another means to store user info):

$ cat /etc/passwd | grep gparmer
gparmer:x:1000:1000:Gabe Parmer,,,:/home/ycombinator:/bin/bash
   |    |   |     |      |                 |             |
   V    |   V     |      V                 V             V
 user   | user id |  human-readable id  home directory  shell to use upon login
        V         V
    password    group id

We can lookup our username and associate it with a numerical user id, and a group id for that user. These numbers are what the system uses to track users, but UNIX nicely converts these numbers into names for our convenience. The file additionally holds the path to the user’s home directory, and their shell. These are used upon the user logging in: the login logic will switch to the user’s home directory, and execute their shell. The password entry is deprecated. The translation between userid and username is in the password file. The translation between groupid and group name is in the group file, /etc/group. Here is the entry for the administrators on the class’ server:

$ grep CS_Admins /etc/group
CS_Admins:x:1004:aaviv,timwood,gparmer,...

There you can see that the users aaviv, timwood, and gparmer are all in the group of administrators.

All of this information, including identifiers, and human-readable strings can be retrieved with id.

$ id
uid=1000(gparmer) gid=1000(gparmer) groups=1000(gparmer),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),120(lpadmin),131(lxd),132(sambashare)

Question: Take a moment to explore these files and the commands. See what groups you are in.

11.3 Application-Specific Privileges using set-user-id/set-group-id

Systems that support multiple users are quite complex. They have services used by many users, but that must be protected from the potentially accidental or malicious actions of users. These programs require more privileges than a user to do their job. A few examples:

  • Logging is a very important functionality in systems. It enables services to log actions that are taken, for example, users logging in to ssh and clients connecting to webservers. The logs are stored in files (often in /var/log/), and users must not be able to directly modify (and potentially corrupt) the logs. But they must be able to append to the logs. How can we support this contradiction when a logging process will run with our user id when we execute it!?
  • Coordinating actions such as running commands regularly, or at specific points in time. The programs at and cron enable exactly this! They require adding entries into system files (e.g. crontab) to record the actions, but one user must not corrupt the entires of another user. Again, we want to both be able to add entries to the files, but cannot modify the file!

We already know one way to solve this problem. We can write a service, and have clients request (e.g. via IPC and domain sockets) it perform actions on our behalf. So long as a service runs as a different user, its files will be in accessible from clients.

However, there is another way to solve the above contradictions. We really want a program to run with a different user’s permissions – those that go beyond our own. We want the logger to run with the permissions of the logging user, and we want cron to run with the permissions of the cron service. This is possible with the set-uid-bits (and set-gid-bits).

The set-*-bits can be seen in the man chmod manual page. We’re used to permissions being three digits (for the owner, the group, and everyone else), but it can be specified as four digits. The most significant digit can be set on executable binaries, and will result in the binary executing with the owner’s uid/gid.

That is, we previously assumed a permission string contained 3 octal digits, but really there are 4 octal digits. The missing octal digit is that for the set-bits. There are three possible set-bit settings and they are combined in the same way as other permissions:

  • 4 - set-user-id: sets the program’s effective user id to the owner of the program
  • 2 - set-group-id: sets the program’s effective group id to the group of the program
  • 1 - the sticky bit: which, when set on a directory, prevents others from deleting files from that directory. We won’t cover the stick bit.

These bits are used in much the same way as the other permission modes. For example, we can change the permission of our get_uidgid program from before like so:

chmod 6751 get_uidgid

And we can interpet the octals like so:

set      group
bits user |  other
  |   |   |   |
  V   V   V   V
 110 111 101 001
  6   7   5   1

When we look at the ls -l output of the program, the permission string reflects these settings with an “s” in the execute part of the string for user and group.

$ ls -l get_uidgid
-rwsr-s--x 1 aviv scs 8778 Mar 30 16:45 get_uidgid

11.3.0.1 Real vs. Effective Capabilities

With the set-bits, when the program runs, the capabilities of the program are effectively that of the owner and group of the program. However, the real user id and real group id remain that of the user who ran the program. This brings up the concept of effective vs. real identifiers:

  • real user id (or group id): the identifier of the actual user who executed a program.
  • effective user id (or group id): the idenifier for the capabilities or permissions settings of an executing program.

The system calls getuid and getgid return the real user and group identifiers, but we can also retrieve the effective user and group identifiers with

  • uid_t geteuid(void) - return the effective user identifier for the calling process.
  • gid_t getegid(void) - return the effective group identifer for the calling process.

We now have enough to test set-x-bit programs using a the following program that prints both the real and effective user/group identities.

As the owner of the file, after compilation, the permissions can be set to add set-user-id:

$ gcc 11/get_uidgid.c -o get_uidgid
$ chmod 6755 get_uidgid
$ ls -l get_uidgid
-rwsr-sr-x   1 gparmer        gparmer     16880 2022-03-29 08:30 get_uidgid

Notice the s values to denote the set-*-bits.

Lets test this set uid stuff!

$ cp get_uidgid /tmp/
$ chmod 4755 /tmp/get_uidgid

Now you should all have access to the file, with it being set as set-uid.

Question:

11.3.0.2 Programmatically Downgrading/Upgrading Capabilities

The set-bits automatically start a program with the effective user or group id set; however, there are times when we might want to downgrade privilege or change permission dynamically. There are two system calls to change user/group settings of an executing process:

  • setuid(uid_t uid) - change the effective user id of a process to uid
  • setgid(gid_t gid) - change the effective group id of a proces to gid

The requirements of setuid() (for all users other than root) is that the effective user id can be changed to the real user id of the program or to an effective user id as described in the set-bits. The root user, however, can downgrade to any user id and upgrade back to the root user. For setgid() the user can chance the group id to any group the user belongs to or as allowed by the set-group-id bit.

Now we can look at a program that downgrades and upgrades a program dynamically:

Question:

  • What do think will happen if you have a peer run it on the server after you set the set-uid bit?

11.4 Applications of Identify and Permission Management

11.4.1 Logging Events: Selectively Increasing Access Rights

We previously covered the general desire to have a log in the system of the events for different activities. For example, logging when each user logs in, when clients make web requests of an http server, etc…

We might support two high-level operations:

  1. adding a logging event, and
  2. reading the log.

If we don’t care to censor the log, we can support the second very easily by, for example, making the log file’s group the log group, and giving read (4) permissions to the log for anyone in that group. Now any user in the loggroup can read the log.

However, to support adding logging events to the log, we might use the set-uid-bit to ensure that a log program runs with the effictive user-id of the logger user. Thus, it will be able to append to the log in a straightforward manner.

11.4.2 Logging In: Carefully Decreasing Access Rights

One example of how you might want to use the identity management APIs is when logging in. The process of logging in requires that the login program access sensitive files that include (encrypted) user passwords (/etc/shadow). Understandably, users don’t generally have access to these files. Thus, the act of logging in requires programs running as root to read in username and password. Only if they match the proper values for a user, will the login logic then change uid/gid to the corresponding user and group id, and execute a shell for the user (as the user). See the figure for more details.

The login procedure on a system. On a system that uses a nameserver like systemd, it will ensure that terminals are available (either attached to the keyboard/monitor, or remote) by executing the getty (link), or “get tty” program (or ssh for remote, or gdm for graphical) which makes available a terminal for a user to interact with. getty executes the login (link) program that reads the username and password from the user. If it is unable to find a match, it exits, letting systemd start again. If the username and password match what is in the /etc/passwd and /etc/shadow (password) files, it will change the user and group ids to those for the user in the /etc/passwd file, change to the home directory specified in that same file, and exec the corresponding shell.
The login procedure on a system. On a system that uses a nameserver like systemd, it will ensure that terminals are available (either attached to the keyboard/monitor, or remote) by executing the getty (link), or “get tty” program (or ssh for remote, or gdm for graphical) which makes available a terminal for a user to interact with. getty executes the login (link) program that reads the username and password from the user. If it is unable to find a match, it exits, letting systemd start again. If the username and password match what is in the /etc/passwd and /etc/shadow (password) files, it will change the user and group ids to those for the user in the /etc/passwd file, change to the home directory specified in that same file, and exec the corresponding shell.

11.4.3 sudo and su: Increasing Access Rights

There are many times in systems that we want to increase the privileges of a human. For example, for security, it is often not possible to login as root. But it is certainly true that sometimes humans need to be able to run commands as root. For example, updating some software, installing new software, creating new users, and many other operations, all require root.

To control who is able to execute commands as root, the sudo and su programs enable restricted access to root privileges. Specifically, sudo and su will execute a command as a specified user or switch to a specified user. By default, these commands execute as the root user, and you need to know the root password or have sudo access to use them.

We can see this as the case if I were to run the get_euidegid program using sudo. First notice that it is no longer set-group or set-user:

$ sudo ./get_euidegid
[sudo] password for gparmer:
 uid=0  gid=0
euid=0 egid=0

After sudo authenticated me, the program’s effective and real user identification becomes 0 which is the uid/gid for the root user.

11.4.3.1 sudoers

Who has permission to run sudo commands? This is important because on many modern unix systems, like ubuntu, there is no default root password.

Question:

  • Why don’t we just have a root account with a password instead of having sudo and su?

Instead certain users are deemed to be sudoers or privileged users. These are set in a special configuraiton file called the /etc/sudoers.

aviv@saddleback: lec-23-demo $ cat /etc/sudoers
cat: /etc/sudoers: Permission denied
aviv@saddleback: lec-23-demo $ sudo cat /etc/sudoers
#
# This file MUST be edited with the 'visudo' command as root.
#
# Please consider adding local content in /etc/sudoers.d/ instead of
# directly modifying this file.
#
# See the man page for details on how to write a sudoers file.
#
Defaults    env_reset
Defaults    mail_badpass
Defaults    secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

# Host alias specification

# User alias specification

# Cmnd alias specification

# User privilege specification
root    ALL=(ALL:ALL) ALL

# Members of the admin group may gain root privileges
%admin ALL=(ALL) ALL

# Allow members of group sudo to execute any command
%sudo   ALL=(ALL:ALL) ALL

# See sudoers(5) for more information on "#include" directives:

#includedir /etc/sudoers.d

Notice that only root has access to read this file, and since I am a sudoer on the system I can get access to it. If you look carefully, you can perform a basic parse of the settings. The root user has full sudo permissions, and other sudoer’s are determine based on group membership. Users in the sudo or admin group may run commands as root, and I am a member of the sudo group:

$ id
uid=1000(gparmer) gid=1000(gparmer) groups=1000(gparmer),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),120(lpadmin),131(lxd),132(sambashare)

Question:

  • How do you think that sudo is implemented? What mechanisms from this week’s lecture is it using to provide the requested functionality?

11.4.3.2 Assignment Submission

What if we implemented an assignment submission program that copied a student’s directory into the instructor’s home directory? For example, if you wanted to submit the directory of code at my_hw/ for homework HW1,

$ ./3410_submit HW1 my_hw

This program, run by you, has your user and group permissions, but it is able to take your submission and copy/save those submissions to my home directory, with my permissions at a location where you do not have access to write. How is that possible?

Naively, we’d have a problem:

  • The submission program is written and provided by the instructor.
  • The students would be executing the program, thus it would execute with their identity.
  • We don’t want the submitted data to be visible to any students.
  • However, if the students run the submission program, it will run as their uid/gid, and 1. it won’t have access to the instructor’s home directory, and 2. any files it creates (during the copy) will belong to the user.

Yikes.

Question:

  • How could we go about implementing a homework submission program?

12 Security: Attacks on System Programs

This material adapted from Prof. Aviv’s material.

It is an unfortunate truth of security that all programs have faults because humans have faults — human’s write programs. As such, we will take some time to understand the kinds of mistakes you, me, and all programmers may make that can lead to security violations.

This is a broad topic area, and we only have a small amount of time to talk about it. We will focus on three classes of attack that are very common for the kinds of programs we’ve been writing in this course.

  • Path Lookup Attacks: Where the attacker leverages path lookup to compromise an executable or library
  • Injection Attacks: Where an attacker can inject code, usually in the form of bash, into a program that will get run.
  • Overflow Attacks: Where the attack overflows a buffer to alter program state.

In isolation, each of these attacks can just make a program misbehave; however, things get interesting when an attack can lead to privilege escalation. Privilege escalation enables a user to maliciously access resources (e.g. files) normally unavailable to the user, by exploiting a program with heightened privilege (e.g. due to setuid bits, or explicit id upgrades) to perform arbitrary tasks. The ultimate goal of an attacker is to achieve privilege escalation to root, which essentially gives them access to all system resources.

You might think that the kernel has more privilege than the root user. However, the root user is given access to the /dev/mem pseudo-device that is “all of memory” including all of the kernel’s memory, and root has the ability to insert dynamic libraries (called “kernel modules”) into the kernel to dynamically update kernel code! Being root truly is a super-power!

Each of these topics are quite nuanced, and the hope is to give you a general overview so you can explore the topic more on your own.

12.1 Path Attacks

We’ve been using path lookup throughout the class. Perhaps the best example is when we are in the shell and type a command:

$ cat Makefile
BINS=$(patsubst %.c,%,$(wildcard *.c))

all: $(BINS)

%: %.c
        gcc -o $@ $<

clean:
        rm -rf $(BINS)

The command cat is run, but the program that is actually cat’s the file exists in a different place in the file system. We can find that location using the which command:

$ which cat
/bin/cat

So when we type cat, the shell look for the cat binary, finds it in /bin/cat, and executes that. Recall when we discussed environment variables that binaries are found by looking in all of the directories in the PATH enviroment variable.

$ echo $PATH
/home/gparmer/.local/bin/:/home/gparmer/.local/bin:/home/gparmer/.cargo/bin:/home/gparmer/.local/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/gparmer/local/bin:/usr/racket/bin/:/home/gparmer/local/bin:/usr/racket/bin/

Lets look at that by converting the :, which separate the directories, with newlines:

$ echo $PATH | tr : "\n"
/home/gparmer/.local/bin/
/home/gparmer/.local/bin
/home/gparmer/.cargo/bin
/home/gparmer/.local/bin/
/usr/local/sbin
/usr/local/bin
/usr/sbin
/usr/bin
/sbin
/bin
/usr/games
/usr/local/games
/snap/bin
/home/gparmer/local/bin
/usr/racket/bin/
/home/gparmer/local/bin
/usr/racket/bin/

Each of the directories listed is searched, in order, until the command’s program is found.

Environment variables are global variables set across programs that provide information about the current environment. The PATH environment variable is a perfect example of this, and the customizability of the environment variables. If you look at the directories along my path, I have a bin directory Win my home directory so I can load custom binaries. The fact that I can customize the PATH in this way can lead to some interesting security situations.

12.1.1 system()

To help this conversation, we need to introduce two library functions that work much like execvp() with a fork(), like we’ve done all along, but more compact. Here’s an abridged description in the manual page:

NAME
       system - execute a shell command

SYNOPSIS
       #include <stdlib.h>

       int system(const char *command);

DESCRIPTION

       system() executes a command specified in command by calling
       /bin/sh -c command, and returns after the command has been
       completed.  During execution of the command, SIGCHLD will be
       blocked, and SIGINT and SIGQUIT will be ignored.

That is, the system() function will run an arbitrary shell command, inheriting all of the file descriptors of the caller of system. Let’s look at a very simple example, a “hello world” that uses two commands.

$ echo "Hello World" | ./system_cat
Hello World

The system_cat program runs cat with system(), thus it will print whatever it reads from stdin to the screen (recall: cat essentially echos stdin to stdout). It turns out, that this program, despite its simplicity, actually has a relatively bad security flaw. Let’s quickly consider what might happen if we were to change the PATH value to include our local directory:

$ export PATH=.:$PATH
$ echo $PATH
./:/home/gparmer/.local/bin/:/home/gparmer/.local/bin:/home/gparmer/.cargo/bin:/home/gparmer/.local/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/gparmer/local/bin:/usr/racket/bin/:/home/gparmer/local/bin:/usr/racket/bin/
$ echo $PATH | tr : "\n" | head -n 2
./
/home/gparmer/.local/bin

The export builtin command in our shell updates an environment variable. We see that at the start of the PATH variable is now the current directory! system (and execvp) will look for the program in the current directory before the others. So now the local directory is on the path, thus if I were to create a program named cat, then that cat would run instead of the one you would expect. For example, here is such a program (in fake_cat.c):

This is our imposter program! In this case it is innocuous, but you can imagine it being more nefarious. To do this, we’ll make a new cat command in the current directory.

$ echo "Hello World" | ./system_cat
$ cp fake_cat cat
$ echo "Hello World" | ./system_cat
Goodbye World

This is not just a problem with the system() command, but also execvp(), which will also look up commands along the path.

$ rm ./cat
$ echo "Hello World" | ./execvp_cat
Hello World
$ cp fake_cat cat
$ echo "Hello World" | ./execvp_cat
Goodbye World

Question:

  • Explain what has happened here to your teammates.
  • How do you think we can prevent the intended execution of our program from being taken over so easily?

12.1.1.1 Fixing Path Attacks

How do we fix this? There are two solutions:

  1. Always use full paths. Don’t specify a command to run by its name, instead describe exactly which program is to be executed.
  2. Fix the path before executing using setenv()

You can actually control the current PATH setting during execution. To do this you can set the enviorment variables using setenv() and getenv()

$ rm ./cat
$ echo "Hello World" | ./setenv_cat
Hello World
$ cp fake_cat cat
$ echo "Hello World" | ./setenv_cat
Hello World

Success! Because we used a controlled and known PATH, we can avoid the attacker-chosen cat from executing.

12.1.2 Path Attacks with Set-user-id Bits for Privilege Escalation

The previous examples were bad as we made the program execute different than the programmer intended. But we didn’t necessarily see the risk of privilege escalation yet. So let’s see what happens when we introduce set-uid to the mix. This time, I’ve set the program system_cat to be set-user-id in a place that others can run it:

$ chmod 4775 system_cat
$ ls -l system_cat
-rwsrwxr-x 1 gparmer gparmer 16704 Apr  4 13:01 system_cat

Now lets imagine there is another user on the system hacker-mchackyface, and as that user we update the PATH

$ export PATH=.:$PATH

…and do something this time that is not so innocuous. Lets run a shell after changing our real user id to that of gparmer. Yikes.

This time when we run the system_cat program, it will run bash as the set-user-id user. Now we’ve just escalated privilege.

$ whoami
hacker-mchackyface
$ ./system_cat
$ whoami  <---- new shell!
gparmer
$

Uh oh. Now user hacker-mchackyface has escalated their privilege to that of user gparmer!

Path Attacks Summary: Multiple UNIX APIs rely on the PATH environment variable to help us find a program we want to execute. The attacker can modify PATH by adding their attack path, and ensure that a program with the same name as the target is in that path. This will lead to the program executing the attacker’s code! Should this attack exist in a program that is set-uid, the attacker can now execute the program at the higher privilege.

12.1.3 LD_PRELOAD Attacks?

We might be clever, and think that we can use LD_PRELOAD to take over some of the library APIs with our own code! Fortunately, this doesn’t work. The dynamic linker (ld.so) will only use the LD_PRELOAD environment variable if the real user id is the same as the effective user id! A simple solution to a threatening problem.

12.2 Injection Attacks

Now, lets consider a situation where you use system() in a safer way. You call all programs with absolute path, or update and control PATH before using it. Unfortunately, even that is not enough. You must also consider injection attacks which is when the attacker can inject code that will run. In our case, the injected code will be bash.

Consider this program which prompts the user for a file to cat out.

If we were to run this program, it hopefully does what you’d expect:

$ ./inject_system
What input do you want to 'cat' (choose from below)
01_lecture.md  cat  execvp_cat  execvp_cat.c  fake_cat  fake_cat.c  inject_system  inject_system.c  Makefile  setenv_cat  setenv_cat.c  system_cat  system_cat.c
input > Makefile
Executing: /bin/cat ./Makefile
BINS=$(patsubst %.c,%,$(wildcard *.c))

all: $(BINS)

%: %.c
        gcc -o $@ $<

clean:
        rm -rf $(BINS)

Ok, now consider if we were to provide input that doesn’t fit this model. What if we were to provide shell commands as input instead of simply the file name?

$ ./inject_system
What input do you want to 'cat' (choose from below)
01_lecture.md  cat  execvp_cat  execvp_cat.c  fake_cat  fake_cat.c  inject_system  inject_system.c  Makefile  setenv_cat  setenv_cat.c  system_cat  system_cat.c
input > ;echo
Executing: /bin/cat ./;echo
/bin/cat: ./: Is a directory

$

The input we provided was ;echo the semi-colon closes off a bash command allowing a new one to start. Specifically, the programmer thought that the command to be executed would only take the shape /bin/cat ./<file> for some user-specified <file>. Instead, we make the command execute /bin/cat ./;echo! Notice that there is an extra new line printed, that was the echo printing. This is pretty innocuous; can we get this program to run something more interesting?

We still have the cat program we wrote that prints “Goodbye World” and the PATH is set up to look in the local directory. Setting that up, we get the following result:

$ ./inject_system
What input do you want to 'cat' (choose from below)
01_lecture.md  cat  execvp_cat  execvp_cat.c  fake_cat  fake_cat.c  inject_system  inject_system.c  Makefile  setenv_cat  setenv_cat.c  system_cat  system_cat.c
input > ;cat
Executing: /bin/cat ./;cat
/bin/cat: ./: Is a directory
Goodbye World

At this point, we own this program (pwn in the parlance) – we can essentially get it to execute whatever we want. If the program were set-uid, we could escalate our privilege.

Injection Attacks Summary: When the user can enter commands that are passed to functions that will execute them, we must be exceedingly careful about sanitizing the input. For shell commands, this often means avoiding ; and | in inputs. Interestingly, one of the most common attacks on webpages is SQL injection. When user input is used to create SQL commands, an attacker might add ; followed by additional queries. For example, we might append ; DROP TABLE USERS to destructively harm the database!

SQL injection can be weaponized in many different ways.
SQL injection can be weaponized in many different ways.

12.3 Overflow Attacks

Two attacks down, moving onto the third. Let’s now assume that the programmer has wised up to the two previous attacks. Now we are using full paths to executables and we are scrubbing the input to remove any potential bash commands prior to execution. The result is the following program:

$ ./overflow_system
What input do you want to 'cat' (choose from below)
cat    execvp_cat    inject_system    inject_system.c~  Makefile  overflow_system    overflow_system.c~  run_foo.c  setenv_cat    setenv_cat.c~  system_cat    #system-ex.c#
cat.c  execvp_cat.c  inject_system.c  input.txt     mal-lib   overflow_system.c  run_foo         run_foo.o  setenv_cat.c  shared-lib     system_cat.c
input > ;cat
Executing: /bin/cat ./
/bin/cat: ./: Is a directory

This time, no dice, but the jig is not up yet. There is an overflow attack.

Question:

  • Inspect the program. What happens when we increase the size of the string input into the program?
  • At any point does the program start to do “bad things”?

To increase the input side programatically, we’ll use a small trick to print a bunch of As:

$ python -c "print 'A'*10"
AAAAAAAAAA
$ python -c "print 'A'*30"
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
$ python -c "print 'A'*50"
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

This produces strings of varying lengths 10, 30, and 50. Those strings can be sent to the target program using a pipe. And we can see the result:

$ python -c "print 'A'*10" | ./inject_system
What input do you want to 'cat' (choose from below)
01_lecture.md  cat  execvp_cat  execvp_cat.c  fake_cat  fake_cat.c  inject_system  inject_system.c  Makefile  overflow_system.c  setenv_cat  setenv_cat.c  system_cat  system_cat.c
input > Executing: /bin/cat ./AAAAAAAAAA
/bin/cat: ./AAAAAAAAAA: No such file or directory
$ python -c "print 'A'*30" | ./inject_system
What input do you want to 'cat' (choose from below)
01_lecture.md  cat  execvp_cat  execvp_cat.c  fake_cat  fake_cat.c  inject_system  inject_system.c  Makefile  overflow_system.c  setenv_cat  setenv_cat.c  system_cat  system_cat.c
input > Executing: /bin/cat ./AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
/bin/cat: ./AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA: No such file or directory
$ python -c "print 'A'*50" | ./inject_system
What input do you want to 'cat' (choose from below)
01_lecture.md  cat  execvp_cat  execvp_cat.c  fake_cat  fake_cat.c  inject_system  inject_system.c  Makefile  overflow_system.c  setenv_cat  setenv_cat.c  system_cat  system_cat.c
input > Executing: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
sh: 1: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA: not found

Something changes at when the input string is 50 bytes long. We overflow the buffer for the input.

Question:

  • Inspect the code, or run the program above, and deduce what values of the string length cause the first failure to execute /bin/cat.

Recall that the input buffer is only 40 bytes and size, and it is placed adjacent to the cmd buffer:

char cmd[1024] = "/bin/cat ./"; /* will append to this string */
char input[40];

When the input buffer overflows, we begin to write As to cmd which replaces /bin/cat with whatever string is written at that location! Yikes. Finally, we concatenate cmd with input, resulting in a long string of As for the command being executed by system().

How do we leverage this error to pwn this program? The program is trying to execute a command that is AAA... and we can control the PATH. Lets get dangerous. Let’s create such a program named AAA..., so that it is executed when we overflow!

$ cp cat AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
$ ./AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Goodbye World

Now when we execute the overflow, we get our desired “Goodbye World” result:

python -c "print 'A'*50" | ./inject_system
What input do you want to 'cat' (choose from below)
01_lecture.md                                          cat           fake_cat       inject_system.c    setenv_cat    system_cat.c
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA     execvp_cat    fake_cat.c     Makefile           setenv_cat.c
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  execvp_cat.c  inject_system  overflow_system.c  system_cat
input > Executing: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Goodbye World

Uh-oh: we’re executing attacker-controlled code again!

Question:

  • How would you fix this bug?
  • How should we generally fix overflow bugs?

The most direct way is to always “bound check” accesses to strings, just like Java and Python do. For example, always use library functions like strncp() or strncat() that we can use to prevent input from overflowing, but that is even sometimes not sufficient. This can all get quite complex. This is an introduction to this issue, and not the final say.

13 The Android UNIX Personality

Find the slides.

14 UNIX Containers

Find the slides.

15 Appendix: C Programming Conventions

C gives you quite a bit of freedom in how you write your code. Conventions are necessary in C because they bring order to the unrestricted freedom that allows code that is too complicated to debug, too clever to understand, and completely unmaintainable. Conventions are a set of rules that we follow that inherently limit the way in which we write code. The intention is to put guard rails onto our development process and to only deviate from the conventions in exceptional cases. An important aspect of this is that once you understand the conventions, it is drastically easier to read the code because you only have to consider how convention-based code is written. This means that there is a significant importance to uniformly applying conventions. Because of this, it is less important if one likes the conventions, than it is to adhere to them for uniformity.

Put another way, it is easy for any programmer to code themself into a corner where the program is more complex than our brains can handle. Being an experienced programmer is much less about being able to handle more complexity, and much more about understanding that you need to spend significant development time simplifying your code. Conventions are a set of codified simplifications, in some sense. They provide your a set of rules (that you don’t need to think about, you just do) to simplify the structure of your code.

OK, onward to conventions!

15.1 Naming Conventions

There are many potential naming conventions, for example, camel case, snake case, and hungarian notiation. Some are antiquated (hungarian), but most are a matter of taste and convention.

Convention. Most C uses lower-case snake_case for variables, types, and functions, capital SNAKE_CASE for constants and macros.

Stick to this convention. A new convention will likely not feel “right” because it is new and different. Remember that they are subjective, but the aim to apply them uniformly.

For names that are visible across .c files (i.e. functions in a .h file), use names that are the shortest they can be to convey the functionality. Within functions, names can be short, and here there are a few conventions as well:

  • ret is a variable that stores a return value (i.e. a value that will later be used as in return ret;).
  • i, j, k, l are iterator variables.

15.2 Namespacing

When you create a global variable, type, or function, you’re creating a symbol that can be accessed for any .c file. This has a significant downside: if you choose a name that is already used in a library, or that is common, you might get a linker error along the lines of:

multiple definition of `foo'; /tmp/ccrvA8in.o:a.c:(.text+0x0): first defined here
...
collect2: error: ld returned 1 exit status

This means that the naming choices each application and library makes can easily cause such a problem.

Convention. Thus, the convention is that you should “namespace” your functions and variables by pre-pending them with a functionality-specific prefix. For example, a key-value store might have all of its types, variables, and functions prefixed with kv_, for example, kv_get.

15.3 Headers and Visibility

C does not have classes and the rules therein to support per-class/object encapsulation. Instead, C views each .c file as a unit of visibility. Functions and global variables (known as “symbols”) can be hooked up with references to them in other objects by the linker. However, it is possible to write code in a .c file that is not visible outside of that file. In contrast to the linking error above emphasizing good naming, if a symbol is marked as static, no other .c file will be able to reference that symbol.

No other .c file can define code that will be able to access global_var nor call foo. You can see the list of all symbols within a .c file that are accessible outside of it by compiling into a .o file (using gcc -c ...), and nm -g --defined-only x.o | awk '{print $3}' (replace x with your object file name).

Convention. All functions and variables that are not part of the public interface with which to interact with the .c file should be static so that they cannot be used from other objects.

This convention enables .c files to support encapsulation. It is also important to support encapsulation at compile time (which is before linking). We want our compiler to tell us if we try and access a symbol we shouldn’t be able to access. Header files are #included into .c files, thus providing access to the symbol names and types, so that the .c file can compile while referencing symbols in another.

Convention. The focus of header files should be to provide the types for the data and functions of a public interface, and nothing more.

When we include a file, it just copies the file at the point of the #include. Thus, if the same header is included separately into two .c files, then it effectively replicates the header’s contents into the two .o files. This can cause problems if the two .o files are linked together. An example:

a.c:

b.c:

c.h:

After compiling these files, we get

$ gcc -c a.c -o a.o
$ gcc -c b.c -o b.o
$ gcc a.o b.o -o test
/usr/bin/ld: b.o: in function `foo':
b.c:(.text+0x0): multiple definition of `foo'; a.o:a.c:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status

Because foo was replicated into each .o file, when they were linked together the linker said “I see two foos!”. Note that we would have avoided this if we put only the function declaration (i.e. the type information) in the .h file (e.g. int foo(void);).

Convention. Never put global data in a .h file. Only put type information (e.g. function declarations) in .h files27.

15.4 What to Include?

You must include the header files that provide the functionality that you rely on and use. This sounds obvious, but it is very easy to mess this up. A couple of examples:

Example 1: Implementing a library. You’re implementing a library, and you have a .h file for it that exposes the types of the library’s functions. You test your library, and it works wonderfully – victory! However, because it is a library, it might be used by other people’s code, and all they are going to do is #include <your_lib.h>, and expect everything to work.

What if the following happened? Your your_lib.h file does this:

In your test file, you do this:

Your tests of your library work wonderfully! But if the user of your library simply does:

The user will get a compiler error that says it doesn’t understand what size_t is! The core problem is that your library’s header file didn’t include all of the headers it relies on! Instead, it should updated to look like so:

A good way to defensively code to try to minimize the chance that this error happens, you should include your header files before the rest of the header files. The only real solution is to be very careful, and always include what headers are necessary for the functionality you depend on.

Example 2: implementing a program or test. A .c file, lets say my_prog.c, includes some header file, head.h. It compiles just fine. But if my_prog.c relies on some header file – lets say dep.h – that is only included itself in the head.h file, then if head.h is updated and removes the include, it will break any program that depends on it! In this case, my_prog.c should explicitly include dep.h: include the headers that are required for your functionality!

How do you know which headers to include? If you’re using a system function, the man page for it is explicit about which headers provide the necessary functionality. If you’re implementing a library in the class, we are explicit about what headers are required.

15.5 Depth = Complexity

Syntactic nesting, or depth often significantly increase the cognitive complexity in C28. Two examples of this:

Depth through indentation nesting. Each additional nesting level (each nested set of {}) adds another condition that you must keep in your mind29 when understanding the code. This includes the conditions, or looping conditions associated with the nesting level. This is a lot to keep in your head, and makes developing and reading code quite a bit more difficult.

What can you do to avoid this nesting? First, return early for errors. Instead of the following code…

…do the following…

Though seemingly trivial, it enables you to separate your understanding of error conditions from the complexity of the following code. Try to perform as much error checking as soon as it possibly can be done so that you can switch into “normal case” thinking.

Second, if you nest deeper than four levels of brackets, then pull code out into appropriately-named functions. Long functions aren’t necessarily bad, but complex functions are.

Convention.

  1. Separate error handling where possible, and put it as close to the top of the function as possible.
  2. If you nest brackets deeper than 4 levels, pull logic out into separate, appropriately named functions.

I use indentation levels in C of eight visual spaces to make it visually clear when I’m indenting too much.

Depth through field nesting. It can be quite difficult to follow code that nests access to fields. For example:

Of course, this is fake code, but even if it wasn’t, keeping track of the types of each of the levels of increasing depth of field reference is nearly impossible. This will make writing the code fickle and challenging, and reading the code, needlessly complex.

The main means we have to remove this complexity is to

  1. Create an set of typed variables

    It is up to you to decide where it is right to provide an “intermediate” type for a certain depth in the fields, and when to directly dive into a few fields.
  2. If any the structs (or sub-structs, or arrays) represents an API layer, then you can call a function that performs the necessary operation on that part of the data-structure. This both provide abstraction over the data access, and documents the types.

Convention. When faced with walking through nested composite data-structures, use intermediate variables to make types explicit, or functions to abstract operations.

15.6 Object Orientation: Initialization, Destruction, and Methods

C is an imperative programming language as it came before Object Orientation (OO) took hold. That said, the OO concept of encapsulation behind an abstract data type is a universal requirement of programming in complex environments. Though C doesn’t support this explicitly, conventions can provide much of the same benefit. Such an abstraction requires

  1. a datatype that stores the information required for the abstraction (think: a list, a key-value store, etc…),
  2. a set of methods to operate on that datatype, and
  3. means to create and destroy objects of that datatype.

If we wanted to provide a key-value datatype, kv, we’d provide these in C by

  1. providing a struct kv for the datatype in which the rest of the data is contained,
  2. publicly providing (in kv.h) a set of functions with names all prefixed with kv_, that take the kv to operate on as the first argument (e.g. void *kv_get(struct kv *kv, int key);) – this first argument acts as “this” in Java, and
  3. allocate and deallocate functions, for example, struct kv *kv_alloc(...) and void kv_free(struct kv *kv)30.

C can support polymorphism, inheritance, dependency injection and composition as well, but we’ll leave that for a future discussion.

Convention.

  1. Meticulously name your header files, your structure, and the methods to act on the datatype to associate them with each other.
  2. Pass the object to be acted on as the first argument to each of the “methods” to act as the C equivalent of “this”.
  3. Provide allocation and deallocation functions for the datatype.

With this, we encourage the client (user of the API) to think of the datatype as encapsulated.


  1. Which is provided by gcc and can be found in /usr/include/gcc/x86_64-linux-gnu/9/include/stddef.h.

  2. In /usr/include/limits.h.

  3. Arrays differ from pointers as 1. the sizeof operator on an array variable gives the total size of the array (including all elements) whereas for pointers, it returns the size of the pointer (i.e. 8 bytes on a 64 bit architecture); 2. & on an array returns the address of the first item in the array whereas for pointers, it returns the address of the pointer; and 3. assignments to pointers end up initializing the pointer to an address, or doing math on the pointer whereas for arrays, initialization will initialize items in the array, and math/assignment after initialization are not allowed.

  4. Not always an easy feat. See the previous discussion on Common Errors.

  5. For example, in /usr/include/asm/errno.h.

  6. You might need to do apt-get install moreutils if you’re using your own system.

  7. C was also created as part of the original UNIX as a necessary tool to implement the OS!

  8. This is a little bit like when an agent transforms from a normal citizen in the Matrix.

  9. Yes, yes, yes, I know this is getting redundant. Key-value stores are everywhere!

  10. Hint: make sure to read the man pages for exit (man 1 exit).

  11. ps prints out process information, grep filters the output to only lines including the argument (gparmer here), and awk enables us to print out specific columns (column 2 here). You should start getting familiar with your command-line utilities!

  12. “Channel” is a term that is heavily overloaded, but I’ll inherit the general term from the glibc documentation.

  13. We see why the name “file” in file descriptors is not very accurate. Recall that I’m using the traditional “file descriptors” to denote all of our generic descriptors.

  14. We’ll discuss the kernel later, and dual-mode execution to protect the implementation of these calls in OS.

  15. A hallmark of bad design is functionality that is not orthogonal with existing functionality. When a feature must be considered in the logic for many other features, we’re adding a significant complexity to the system.

  16. Please, please do not try this.

  17. This is useful to redirect shell output to if you don’t care about the output.

  18. I’ll try and call it a “file descriptor” when we know the descriptor is to a file.

  19. This is called a “race condition”. File “locking” helps solve this issue, but we’ll delay discussing locking until the OS class.

  20. ELF is a file format for binary programs in the same way that html is the data format for webpages, .c for C files, .png for images, and .pdf for documents. It just happens to contain all of the information necessary to link and execute a program! It may be surprising, but comparable formats even exist for java (.class) files as well.

  21. Note: these are hexadecimal values, not base-10 digits.

  22. This is increasingly not true as many compilers support Link-Time Optimization (LTO). This goes beyond the scope of this class.

  23. We can even choose to use a different program interpreter, for example, interp.so, using gcc -Wl,--dynamic-linker,interp.so.

  24. This sharing of library memory across processes is why dynamic libraries are also called shared libraries.

  25. What are “flags”? The notion of “flags” is pretty confusing. Flags are often simply a set of bits (in this case, the bits of the int) where each bit specifies some option. In this case, there’s a bit for RTLD_NOW which says to load all of the symbol dependencies now, and a separate bit for RTLD_LOCAL which says that subsequent libraries should not see this library’s symbols. In short, these are just single bits in the “flags” integer, thus they can be bitwise or-ed together (thus the |) to specify options – they are simply a way to specify a set of options.

  26. Note, this is just 216, using shifts to emulate the power operator.

  27. There are sometimes optimization-driven motivations to place function definitions in .h files. In such cases, the functions must be marked static to avoid the above problem.

  28. …and in most languages that don’t support first class functions/closures.

  29. Short term memory only holds 7 ± 2 items, an each nesting level expands at least one.

  30. We also often provide an initialization function that initializes a struct kv instead of allocating a new one (e.g. int kv_init(struct kv *kv)). These are useful since C allows stack and global allocation of the datatype, and we want to provide a means to initialize one of those allocations.