October 09, 2017
When I first used (Docker) containers I thought of them as something like a lightweight virtual machine. But turns out that's not true at all. Containers are not virtual machines at all. They are just processes running on the same Linux kernel in isolation. And the isolation is so effective that it makes us feel as if we're on a VM. This isolation is achieved through 2 amazing Linux kernel features; namely namespaces and cgroups
I'm going to dive in and try to get a minimal container working and note things that work out well and things that don't.
Let's first list out the things we want from our container. To run processes in isolation from the host, we need a few things to be isolated. For now I can think of the following
Additionally, we should also be able to control the amount of resources (CPU, memory etc) the container uses. This is a critical requirement because if something resource hungry goes wild inside the container, a proper container should be able to keep it isolated so the host machine is not affected.
For now, that's all the requirement I can think of. Now let's start building our container.
So our set up is this. We'll try to achieve all the different types of isolation we want for a bash process. So that's going to be our first "container", an isolated bash process. The first thing we're going to try is to have an isolated file system.
chroot
One way to have a separate file system for our process would be to use the chroot
command. The man page of chroot
says:
chroot - run command or interactive shell with special root directory
Seems to be exactly what we need. Ok lets create a new root directory and try running bash with chroot. I will also place a oldroot
marker file on my host so that I know where I am. Just in case ¯\(ツ)/¯
So that's not gonna work because our new root filesystem is empty. So there is no /bin/bash
executable there. To make this work we need a copy of a root file system. After searching around for a minute I found this page that has Ubuntu root file systems as tarballs for all the platforms. I'm going to grab the one for amd64 and unpack it into the new-rootfs directory.
Ok. Seems to be done. Now let's try again.
That worked!! (We know that because the /old-root
file we created was not present when I did ls
in the new root. And also because the username changed to root in the prompt). Look Ma! isolated file system! Let's do some additional checks too.
Ok, we can't see what's on the host machine outside our new root fs. That'll do. So let's move on to process namespace isolation. Let's get a list all running processes we can see with ps
.
Hmm, that didn't work. Let's find out why.
ps
and /proc
So here's the deal. All Linux processes can be found in the /proc
directory as sort of a virtual file. /proc
is not really an actual file system. It doesn't occupy any space on the disk. It's a special type of virtual file system called procfs
. We can actually check this by running ls -l
inside /proc
. What we'll notice is that all files in there have size 0. The files in /proc
are just in memory data with runtime information from the kernel. ps
and a bunch of other commands just use this data to show us their respective output. So if a procfs
is not mounted at /proc
, then ps
can't really do its job.
All right. Let's mount it then.
Hmm, now we can see a bunch of processes running in our container. But there are just too many. That's because what we are seeing is the list of processes including those on the host machine. What's worse is that we can actually kill these processes from inside our container. Uh! that sucks. Ideally we should only be able to interact with processes running inside our container. We'll do this using unshare
.
unshare
and Namespaces 101So the man page about unshare
says:
unshare - run program with some namespaces unshared from parent
It lets us run programs with isolated namespaces from parents. Cool. So what namespaces are there? I did some quick reading (in the unshare man page itself)and it seems there are 6 namespaces:
We can specify which namespace we want to be unshared by setting options on the unshare
call. Let's try it with the hostname namespace first. It seems to the most straight forward to test. Let's start with a failing test case first.
Hmm, now let's try it with unshare
.
That worked! Now let's try it with --pid
option for pid a separate pid namespace.
What happened there? Brb, googling.
So I found an incredibly well explained post on StackOverflow as to why we got that "Cannot allocate memory" error. Basically we need to tell unshare
to fork a new process and start bash
in that so that bash
becomes PID 1 in our container. Otherwise, the first process created by bash will get PID 1 and after that exits, the kernel won't allow us to fork any more processes. (Read the SO post linked above for more detailed explanation)
Ok then how do we tell unshare
to fork a new process and start bash
in that? By using the --fork
option. Let's do that.
Ok, new problem we need to mount /proc
. We can do that like we did earlier or we can tell unshare
to do it for us with the --mount-proc
option.
Yay! That worked. Now we have an isolated PID namespace. Similarly we can use --net
flag to have a separate networking interfaces, ip tables etc.
Now we can control how much resource our container can use with cgroups
. But I'll write that up in another post. This post has become quite long as it is.
Thanks for reading. Bye!