Understanding how containers work by building one

October 09, 2017

When I first used (Docker) containers I thought of them as something like a lightweight virtual machine. But turns out that's not true at all. Containers are not virtual machines at all. They are just processes running on the same Linux kernel in isolation. And the isolation is so effective that it makes us feel as if we're on a VM. This isolation is achieved through 2 amazing Linux kernel features; namely namespaces and cgroups

I'm going to dive in and try to get a minimal container working and note things that work out well and things that don't.

Let's first list out the things we want from our container. To run processes in isolation from the host, we need a few things to be isolated. For now I can think of the following

  • File system isolation: A process inside the container should not be able to see files outside the container
  • Process namespace isolation: A process inside the container should not be able to see processes outside the container
  • Network isolation: A process inside the container should not be able to use network interfaces on the host.

Additionally, we should also be able to control the amount of resources (CPU, memory etc) the container uses. This is a critical requirement because if something resource hungry goes wild inside the container, a proper container should be able to keep it isolated so the host machine is not affected.

For now, that's all the requirement I can think of. Now let's start building our container.

So our set up is this. We'll try to achieve all the different types of isolation we want for a bash process. So that's going to be our first "container", an isolated bash process. The first thing we're going to try is to have an isolated file system.

The magic of chroot

One way to have a separate file system for our process would be to use the chroot command. The man page of chroot says:

chroot - run command or interactive shell with special root directory

Seems to be exactly what we need. Ok lets create a new root directory and try running bash with chroot. I will also place a oldroot marker file on my host so that I know where I am. Just in case ¯\(ツ)

So that's not gonna work because our new root filesystem is empty. So there is no /bin/bash executable there. To make this work we need a copy of a root file system. After searching around for a minute I found this page that has Ubuntu root file systems as tarballs for all the platforms. I'm going to grab the one for amd64 and unpack it into the new-rootfs directory.

Ok. Seems to be done. Now let's try again.

That worked!! (We know that because the /old-root file we created was not present when I did lsin the new root. And also because the username changed to root in the prompt). Look Ma! isolated file system! Let's do some additional checks too.

Ok, we can't see what's on the host machine outside our new root fs. That'll do. So let's move on to process namespace isolation. Let's get a list all running processes we can see with ps.

Hmm, that didn't work. Let's find out why.

Quick guide to ps and /proc

So here's the deal. All Linux processes can be found in the /proc directory as sort of a virtual file. /proc is not really an actual file system. It doesn't occupy any space on the disk. It's a special type of virtual file system called procfs. We can actually check this by running ls -l inside /proc. What we'll notice is that all files in there have size 0. The files in /proc are just in memory data with runtime information from the kernel. ps and a bunch of other commands just use this data to show us their respective output. So if a procfs is not mounted at /proc, then ps can't really do its job.

All right. Let's mount it then.

Hmm, now we can see a bunch of processes running in our container. But there are just too many. That's because what we are seeing is the list of processes including those on the host machine. What's worse is that we can actually kill these processes from inside our container. Uh! that sucks. Ideally we should only be able to interact with processes running inside our container. We'll do this using unshare.

unshare and Namespaces 101

So the man page about unshare says:

unshare - run program with some namespaces unshared from parent

It lets us run programs with isolated namespaces from parents. Cool. So what namespaces are there? I did some quick reading (in the unshare man page itself)and it seems there are 6 namespaces:

  • Mount namespace: For mounting and unmounting filesystems without affecting rest of the system
  • UTS namespace: For setting hostname
  • IPC namespace: It says the process will have an independent namespace for System V message queues, semaphore sets and shared memory segments. I don't fully understand what that means for the moment.
  • network namespace: Separate networking stack for the process
  • pid namespace: Provides distinct set of PID to process mapping for the process
  • user namespace: Provides distinct set of UIDs, GIDs and capabilities for the process.

We can specify which namespace we want to be unshared by setting options on the unshare call. Let's try it with the hostname namespace first. It seems to the most straight forward to test. Let's start with a failing test case first.

Hmm, now let's try it with unshare.

That worked! Now let's try it with --pid option for pid a separate pid namespace.

What happened there? Brb, googling.

So I found an incredibly well explained post on StackOverflow as to why we got that "Cannot allocate memory" error. Basically we need to tell unshare to fork a new process and start bash in that so that bash becomes PID 1 in our container. Otherwise, the first process created by bash will get PID 1 and after that exits, the kernel won't allow us to fork any more processes. (Read the SO post linked above for more detailed explanation)

Ok then how do we tell unshare to fork a new process and start bash in that? By using the --fork option. Let's do that.

Ok, new problem we need to mount /proc. We can do that like we did earlier or we can tell unshare to do it for us with the --mount-proc option.

Yay! That worked. Now we have an isolated PID namespace. Similarly we can use --net flag to have a separate networking interfaces, ip tables etc.

Now we can control how much resource our container can use with cgroups. But I'll write that up in another post. This post has become quite long as it is.

Thanks for reading. Bye!