Editors Note: I initially published this in 2019, but subsequently tore the property hosting it down. I wanted to share it with a colleague, so I’m repositing it. It might be slightly out of date.
Containers have recently become a common way of packaging, deploying and running software across various machines in various environments. With the initial release of Docker in March 2013[1], containers have become ubiquitous in modern software deployment, with 71% of Fortune 100 companies running it in some capacity[2]. Containers can be used for:
Running user-facing, production software
Running a software development environment
Compiling software with its dependencies in a sandbox
Analysing the behaviour of software within a sandbox
Like their namesake in the shipping industry, containers are designed to easily "lift and shift" software to different environments and execute that software similarly across those environments.
Containers have thus earned their place in the modern software development toolkit. However, to understand how container technology fits into our modern software architecture, it’s worth understanding how we arrived at containers, as well as how they work.
History
The "birth" of containers was denoted by Bryan Cantrill as March 18th, 1982[3], with the addition of the chroot
syscall in BSD. From the FreeBSD website[4]:
According to the SCCS logs, the chroot call was added by Bill Joy on March 18, 1982 approximately 1.5 years before 4.2BSD was released. That was well before we had ftp servers of any sort (ftp did not show up in the source tree until January 1983). My best guess as to its purpose was to allow Bill to chroot into the /4.2BSD build directory and build a system using only the files, include files, etc contained in that tree. That was the only use of chroot that I remember from the early days.
— Dr Marshall Kirk Mckusick
chroot
is used to put a process into a "changed root", a new root filesystem with limited or no access to the parent root filesystem. An extremely minimal chroot
can be created on Linux as follows[5]:
# Get a shell
$ cd $(mktemp -d)
$ mkdir bin
$ $(which sh) bin/bash
# Find shared libraries required for shell
$ ldd bin/sh
linux-vdso.so.1 (0x00007ffe69784000)
/lib/x86_64-linux-gnu/libsnoopy.so (0x00007f6cc4c33000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6cc4a42000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f6cc4a21000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6cc4a1c000)
/lib64/ld-linux-x86-64.so.2 (0x00007f6cc4c66000)
# Duplicate libraries into root
$ mkdir -p lib64 lib/x86_64-linux-gnu
$ cp /lib/x86_64-linux-gnu/libsnoopy.so \
/lib/x86_64-linux-gnu/libc.so.6 \
/lib/x86_64-linux-gnu/libpthread.so.0 \
/lib/x86_64-linux-gnu/libdl.so.2 \
lib/x86_64-linux-gnu/
$ cp /lib64/ld-linux-x86-64.so.2 lib64/
# Change into that root
$ sudo chroot .
# Test the chroot
# ls
/bin/bash: 1: ls: not found
#
There were problems with this early implementation of chroot
, such as being able to exit that chroot
by running cd..
[3], but these were resolved in short order. Seeking to provide better security, FreeBSD extended the chroot
into the jail
[3,4] which allowed running software that desired to run as root
and running it within a confined environment that was root
within that environment but not root
elsewhere on the system.
This work was further built upon in the Solaris operating system to provide fuller isolation from the host[3][6]:
User separation (similar to
jail
)Filesystem separation (similar to
chroot
)A separate process space
Providing something similar to the modern concept of containers, processes running on the same kernel. Later, similar work took place in the Linux kernel to isolate kernel structures per process under "namespaces"[7].
However, in parallel, Amazon Web Services (AWS) launched their Elastic Compute Cloud (EC2) product which took a different approach to separate workloads: virtualising the entire hardware[3]. This has different tradeoffs; it limits the exploitation of the host kernel or isolation implementation; however, running the additional operating system and hypervisor meant far less efficient use of resources.
Virtualisation continued to dominate workload isolation until the company "dot-cloud" (now Docker), then operating as a "platform as a service" (PAAS) offering, open-sourced the software they used to run their PAAS. With that software and much luck, containers proliferated rapidly until Docker became the powerhouse it is now.
Shortly after Docker released their container runtime, they expanded their product offerings into build, orchestration and server management tooling[8]. Unhappy with this, CoreOS created its container runtime, rkt
, which had the stated goal of interoperating with existing services, such as systemd
, following the UNIX philosophy of "Write programs that do one thing and do it well[9]."
The Open Container Initiative was established to reconcile these disparate definitions of a container [10], after which Docker donated its schema and runtime as a defacto container standard.
There are now several container implementations and standards to define their behaviour.
Definition
It might be surprising to learn that a "container" is not a real thing but a specification. At the time of writing, this specification has implementations on^[11]:
Linux
Windows
Solaris
Virtual Machines
In turn, containers are expected to be[12]:
Consumable with a set of standard, interoperable tools
Consistent regardless of what type of software is being run
Agnostic to the underlying infrastructure the container is being run on
Designed in a way that makes automation easy
Of excellent quality
Specifications dictate how containers should reach these principles by defining how they should be executed (the runtime specification[11]), what a container should contain (the image specification[13]) and how to distribute container "images" (the distribution specification[14]).
These specifications mean that various tools can be used to interact with containers. The canonical tool in most common use is the Docker tool, which in addition to manipulating containers, provides container build tooling and some limited orchestration of containers. However, there are many container runtimes:
As well as other tools that help with building or distributing images.
Lastly, extensions to the existing standards, such as the container networking interface, define additional behaviour where the standards are not yet clear enough.
Implementation
While the standards give us some idea as to what a container is and how it should work, it’s perhaps helpful to understand how a container implementation works. Not all container runtimes are implemented this way; notably, kata containers implement hardware virtualisation, as mentioned earlier with EC2.
The problems being solved by containers are:
Isolation of a process(es)
Distribution of that process(es)
Connecting that process(es) to other machines
With that said, let’s dive into the Docker implementation[15]. This uses a series of technologies exposed by the underlying kernel:
Kernel feature isolation: namespaces
The man namespaces
command defines namespaces as follows:
A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to implement containers.
Paraphrased, a namespace is a slice of the system; from within that slice, a process cannot see the rest of the system.
A process must make a system call to the Linux kernel to change its namespace. There are several system calls:
clone
: Create a new process. When used in conjunction withCLONE_NEW*
it creates a namespace of the kind specified. For example, if used withCLONE_NEWPID
the process will enter a newpid
namespace and becomepid 1
setns
: Allows the calling process to join an existing namespace specified under/proc/[pid]/ns
unshare
: Moves the calling process into a new namespace
There is a user command also called unshare
which allows us to experiment with namespaces. We can put ourselves into a separate process and network namespace with the following command:
# Scratch space
$ cd $(mktemp -d)
# Fork is required to spawn new processes, and proc is mounted to give accurate process information
$ sudo unshare \
--fork \
--pid \
--mount-proc \
--net
# Here we see that we only have access to the loopback interface
root@sw-20160616-01:/tmp/tmp.XBESuNMJJS# ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# Here we see that we can only see the first process (bash) and our `ps aux` invocation
root@sw-20160616-01:/tmp/tmp.XBESuNMJJS# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.3 0.0 8304 5092 pts/7 S 05:48 0:00 -bash
root 5 0.0 0.0 10888 3248 pts/7 R+ 05:49 0:00 ps aux
Docker uses the following namespaces to limit the ability for a process running in the container to see resources outside that container:
The
pid
namespace: Process isolation (PID: Process ID).The
net
namespace: Managing network interfaces (NET: Networking).The
ipc
namespace: Managing access to IPC resources (IPC: InterProcess Communication).The
mnt
namespace: Managing filesystem mount points (MNT: Mount).The
uts
namespace: Isolating kernel and version identifiers. (UTS: Unix Timesharing System).
These provide reasonable separation between processes such that workloads should not be able to interfere with each other. However, there is a notable caveat: we can disable some of this isolation^[16].
This is an extremely useful property. One example would be for system daemons needing access to the host network to bind ports on the host[17], such as running a DNS service or service proxy in a container.
TIP: Process #1 or the
init
process in Linux systems has some additional responsibilities. When processes terminate in Linux they are not automatically cleaned up, but rather simply enter a terminated state. It is the responsibility of the init process to "reap" those processes, deleting them so that their process ID can be reused[18]. Accordingly the first process run in a Linux namespace should be aninit
process, and not a user facing process likemysql
. This is known as the zombie reaping problem.
Resource isolation: control groups
The kernel documentation cgroups
defines the cgroup as follows:
Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.
That doesn’t really tell us much, though. Luckily it expands:
On their own, the only use for cgroups is for simple job tracking. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can access. For example, cpusets (see Documentation/cgroup-v1/cpusets.txt) allow you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup.
So, cgroups
are a groups of "jobs" that other systems can assign meaning to. The systems that currently use this cgroup
systems:
As well as various others.
cgroups
are manipulated by reading and writing to the /proc
filesystem. For example:
# Create a cgroup called "me"
$ mkdir /sys/fs/cgroup/memory/me
# Allocate the cgroup a max of 100Mb memory
$ echo '100000000' | sudo tee /sys/fs/cgroup/memory/me/memory.limit_in_bytes
# Move this proess into the cgroup
$ echo $$ | sudo tee /sys/fs/cgroup/memory/me/cgroup.procs
5924
That’s it! This process should now be limited to 100Mb total usage
Docker uses the same functionality in its --memory
and --cpus
arguments, and it is employed by the orchestration systems Kubernetes and Apache Mesos to determine where to schedule workloads.
TIP
Although cgroups
are most commonly associated with containers that’s already used for other workloads. The best example is perhaps systemd
, which automatically puts all services into a cgroup
if the CPU scheduler is enabled in the kernel[20]. systemd
services are … kind of containers!
Userland isolation: seccomp
While both namespaces and cgroups
go a significant way to isolating processes into their containers Docker goes further than that to restrict what access the process can have to the Linux kernel itself. This is enforced in supported operating systems via "SECure COMPuting with filters", also known as seccomp-bpf
or simply seccomp
.
The Linux kernel user space API guide defines seccomp
as:
Seccomp filtering provides a means for a process to specify a filter for incoming system calls. The filter is expressed as a Berkeley Packet Filter (BPF) program, as with socket filters, except that the data operated on is related to the system call being made: system call number and the system call arguments.
BPF, in turn, is a small, in-kernel virtual machine language used in several kernel tracing, networking and other tasks[21]. Whether the system supports seccomp can be determined by running the following command[22]:
$ grep CONFIG_SECCOMP= /boot/config-$(uname -r)
# Our system supports seccomp
CONFIG_SECCOMP=y
Practically this limits a process’s ability to ask the kernel to do certain things. Any system call can be restricted, and docker allows the use of arbitrary seccomp "profiles" via its --security-opt
argument[22]:
docker run --rm \
-it \
--security-opt seccomp=/path/to/seccomp/profile.json \
hello-world
However, most usefully, Docker provides a default security profile that limits some of the more dangerous system calls that processes run from a container should never need to make, including:
clone
: The ability to clone new namespacesbpf
: The ability to load and runbpf
programsadd_key
: The ability to access the kernel keyringkexec_load
: The ability to load a new Linux kernel
As well as many others. The full list of syscalls blocked by default is available on the Docker website.
In addition to seccomp
there are other ways to ensure containers are behaving as expected, including:
Each of these takes slightly different approaches to ensuring the process is only executed within expected behaviour. It’s worth spending time investigating the tradeoffs of each security decision or simply delegating the choice to a competent third-party provider.
Additionally, it’s worth noting that even though Docker defaults to enabling the seccomp
policy, orchestration systems such as kubernetes
may disable it[25].
Distribution: the union file system
To generate a container, Docker requires a set of "build instructions". A trivial image could be:
# Scrath space
$ cd $(mktemp -d)
# Create a docker file
$ cat <<EOF > Dockerfile
FROM debian:buster
# Create a test directory
RUN mkdir /test
# Create a bunch of spam files
RUN echo $(date) > /test/a
RUN echo $(date) > /test/b
RUN echo $(date) > /test/c
EOF
# Build the image
$ docker build .
Sending build context to Docker daemon 4.096kB
Step 1/5 : FROM debian:buster
---> ebdc13caae1e
Step 2/5 : RUN mkdir /test
---> Running in a9c0fa1a56c7
Removing intermediate container a9c0fa1a56c7
---> 6837541a46a5
Step 3/5 : RUN echo Sat 30 Mar 18:05:24 CET 2019 > /test/a
---> Running in 8b61ca022296
Removing intermediate container 8b61ca022296
---> 3ea076dcea98
Step 4/5 : RUN echo Sat 30 Mar 18:05:24 CET 2019 > /test/b
---> Running in 940d5bcaa715
Removing intermediate container 940d5bcaa715
---> 07b2f7a4dff8
Step 5/5 : RUN echo Sat 30 Mar 18:05:24 CET 2019 > /test/c
---> Running in 251f5d00b55f
Removing intermediate container 251f5d00b55f
---> 0122a70ad0a3
Successfully built 0122a70ad0a3
This creates a docker image with the id of 0122a70ad0a3
containing the contents of date
at a
, b
and c
. We can verify this by starting the container and examining its contents:
$ docker run \
--rm=true \
-it \
0122a70ad0a3 \
/bin/bash
$ cd /test
$ ls
a b c
$ cat *
Sat 30 Mar 18:05:24 CET 2019
Sat 30 Mar 18:05:24 CET 2019
Sat 30 Mar 18:05:24 CET 2019
However, in the docker build
command earlier, Docker created several images. If we run the image after only a
and b
have been executed, we will not see c
:
$ docker run \
--rm=true \
-it \
07b2f7a4dff8 \
/bin/bash
$ ls test
a b
Docker is not creating a whole new filesystem for each of these images. Instead, each of the images is layered on top of each other. If we query Docker, we can see each of the layers that go into a given image:
$ docker history 0122a70ad0a3
IMAGE CREATED CREATED BY SIZE COMMENT
0122a70ad0a3 5 minutes ago /bin/sh -c echo Sat 30 Mar 18:05:24 CET 2019… 29B
07b2f7a4dff8 5 minutes ago /bin/sh -c echo Sat 30 Mar 18:05:24 CET 2019… 29B
3ea076dcea98 5 minutes ago /bin/sh -c echo Sat 30 Mar 18:05:24 CET 2019… 29B
6837541a46a5 5 minutes ago /bin/sh -c mkdir /test 0B
ebdc13caae1e 12 months ago /bin/sh -c #(nop) CMD ["bash"] 0B
<missing> 12 months ago /bin/sh -c #(nop) ADD file:2219cecc89ed69975… 106MB
This allows docker to reuse vast chunks of what it downloads. For example, given the image we built earlier, we can see that it uses:
A layer called
ADD file:…
— this is the Debian Buster root filesystem at 106MBA layer for
a
that renders the data to disk at 29BA layer for
b
that renders the data to disk at 29B
And so on. Docker will reuse the Add file:…
Debian Buster root for all images that start with FROM: debian:buster
.
This allows Docker to be highly space efficient, reusing the same operating system image for multiple executions.
TIP
Even though Docker is hugely space efficient, the docker library on disk can grow extremely large and transferring large docker images over the network can become expensive. Therefore, try to reuse image layers where possible and prefer smaller operating systems or the scratch
(nothing) image where possible.
These layers are implemented via a Union Filesystem or UnionFS. There are various "backends" or filesystems that can implement this approach:
overlay2
devicemapper
aufs
Generally speaking, the package manager on our machine will include the appropriate underlying filesystem driver; docker supports many:
$ docker info | grep Storage
Storage Driver: overlay2
We can replicate this implementation with our overlay mount fairly easily[26]:
# scratch
cd $(mktemp -d)
# Create some layers
$ mkdir \
lower \
upper \
workdir \
overlay
# Create some files that represent the layers
$ touch lower/i-am-the-lower
$ touch higher/i-am-the-higher
# Create the layered filesystem at overlay with lower, upper and workdir
$ mount -t overlay \
-o lowerdir=lower,upperdir=upper,workdir=workdir \
./overlay \
overlay
# List the directory
$ ls overlay/
i-am-the-lower i-am-the-upper
Docker goes so far as to nest those layers until the multi-layered filesystem has been successfully implemented.
Files that are written are written back to the upper
directory in the case of overlay2
. However, Docker will generally dispose of these temporary files when the container is removed.
TIP
Generally speaking, all software needs access to shared libraries found in static paths in Linux operating systems. Accordingly, it is the convention to simply ship a stripped-down version of an operating system’s root file system such that users can install it and applications can find the libraries they expect. However, it is possible to use an empty filesystem and a statically compiled binary with the scratch
image type.
Connectivity: networking
As mentioned earlier, containers make use of Linux namespaces. Of particular interest when understanding container networking is the network namespace. This namespace gives the process separate:
(Virtual) ethernet devices
routing tables
iptables
rules
For example,
# Create a new network namespace
$ sudo unshare --fork --net
# List the ethernet devices with associated ip addresses
$ ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# List all iptables rules
root@sw-20160616-01:/home/andrewhowden# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
# List all network routes
$ ip route show
By default, the container has no network connectivity — not even the loopback
adapter is up. We cannot even ping ourselves!
$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1): 56 data bytes
ping: sending packet: Network is unreachable
We can start setting up the expected network environment by bringing up the loopback
adapter:
$ ip link set lo up
root@sw-20160616-01:/home/andrewhowden# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
# Test the loopback adapter
$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.092 ms
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.068 ms
However, we cannot access the outside world. In most environments, our host machine will be connected via ethernet to a given network and either have an IP assigned to it via the cloud provider or, in the case of a development or office machine, request an IP via DHCP. However, our container is in a network namespace of its own and does not know the ethernet connected to the host. We need to employ a veth device to connect the container to the host.
veth
, or "Virtual Ethernet Device" is defined by man vet
as:
The veth devices are virtual Ethernet devices. They can act as tunnels between network namespaces to create a bridge to a physical network device in another namespace, but can also be used as standalone network devices.
This is precisely what we need! Because unshare
creates an anonymous network namespace, we need to determine what the pid
of the process started in that namespace is[27]+[<<igalia.network-namespaces,28>>]+:
$ echo $$
18171
We can then create the veth
device:
$ sudo ip link add veth0 type veth peer name veth0 netns 18171
We can see these virtual ethernet devices appear both the host and the guest. However, neither has an IP attached nor any routes defined:
# Container
$ ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: veth0@if7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 16:34:52:54:a2:a1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
$ ip route show
# No output
To address that, we add an IP and define the default route:
# On the host
$ ip addr add 192.168.24.1 dev veth0
# Within the container
$ ip address add 192.168.24.10 dev veth0
From there, bring the devices up:
# Both host and container
$ ip link set veth0 up
Add a route such that 192.168.24.0/24
goes out via veth0
:
# Both host and guest
ip route add 192.168.24.0/24 dev veth0
And voilà! We have connectivity to the host namespace and back:
# Within container
$ ping 192.168.24.1
PING 192.168.24.1 (192.168.24.1): 56 data bytes
64 bytes from 192.168.24.1: icmp_seq=0 ttl=64 time=0.149 ms
64 bytes from 192.168.24.1: icmp_seq=1 ttl=64 time=0.096 ms
64 bytes from 192.168.24.1: icmp_seq=2 ttl=64 time=0.104 ms
64 bytes from 192.168.24.1: icmp_seq=3 ttl=64 time=0.100 ms
However, that does not give us access to the wider internet. While the veth
adapter functions as a virtual cable between our container and our host, there is currently no path from our container to the internet:
# Within container
$ ping google.com
ping: unknown host
To create such a path we need to modify our host such that it functions as a "router" between its own, separated network namespaces and its internet facing adapter.
Luckily, Linux is set up well for this purpose. First, we need to modify the normal behaviour of Linux from dropping packets not destined for IP addresses with which their associated but rather allow forwarding a packet from one adapter to the other:
# Within container
$ echo 1 > /proc/sys/net/ipv4/ip_forward
That means when we request public facing IPs from within our container via our veth
adapter to our host veth
adapter the host adapter won’t simply drop those packets.
From there we employ iptables
rules on the host to forward traffic from the host veth
adapter to the internet facing adapter — in this case wlp2s0
:
# On the host
# Forward packets from the container to the host adapter
iptables -A FORWARD -i veth0 -o wlp2s0 -j ACCEPT
# Forward packets that have been established via egress from the host adapater back to the contianer
iptables -A FORWARD -i wlp2s0 -o veth0 -m state --state ESTABLISHED,RELATED -j ACCEPT
# Relabel the IPs for the container so return traffic will be routed correctly
iptables -t nat -A POSTROUTING -o wlp2s0 -j MASQUERADE
We then tell our container to send traffic it doesn’t know anything else about down the veth
adapter:
# Within the container
$ ip route add default via 192.168.24.1 dev veth0
And the internet works!
$ # ping google.com
PING google.com (172.217.22.14): 56 data bytes
64 bytes from 172.217.22.14: icmp_seq=0 ttl=55 time=16.456 ms
64 bytes from 172.217.22.14: icmp_seq=1 ttl=55 time=15.102 ms
64 bytes from 172.217.22.14: icmp_seq=2 ttl=55 time=34.369 ms
64 bytes from 172.217.22.14: icmp_seq=3 ttl=55 time=15.319 ms
As mentioned, each container implementation can implement networking differently. There are implementations that use the aforementioned veth
pair, vxlan
, BPF
or other cloud specific implementations. However, when designing containers we need some way to reason about what behaviour we should expect.
To help address this the "Container Network Interface" tooling has been designed. This allows defining consistent network behaviour across network implementations, as well as models such as Kubernetes shared lo
adapter between several containers.
The networking side of containers is an area undergoing rapid innovation but relying on:
A
lo
interfaceA public facing
eth0
(or similar) interface
being present seems a fairly stable guarantee.
Landscape review
Given our understanding of the implementation of containers we can now take a look at some of the classic docker discussions.
Systems Updates
One of the oft-overlooked parts of containers is the necessity to keep them and the host system up to date.
In modern systems, it is pretty common to enable automatic updates on host systems, and so long as we stick to the system package manager and ensure updates stay successful, the system will keep itself both up-to-date and stable.
However, containers take a very different approach. They’re effectively giant static binaries deployed into a production system. In this capacity, they can do no self-maintenance.
Accordingly, even if there are no updates to the container's software, containers should be periodically rebuilt and redeployed to the production system — less they accumulate vulnerabilities over time.
Init within container
Given our understanding of containers its reasonable to consider the "1 process per container" advice and determine that it is an oversimplification of how containers work, and it makes sense in some cases to do service management within a container with a system like runit
.
This allows multiple processes to be executed within a single container including things like:
syslog
logrotate
cron
And so fourth.
In the case where Docker is the only system that is being used it is indeed reasonable to think about doing service management within docker — particularly when hitting the constraints of shared filesystem or network state. However systems such as Kubernetes, Swarm or Mesos have replaced much of the necessity of these init systems; tasks such as log aggregation, restarting services or colocating services are taken care of by these tools.
Accordingly its best to keep containers simple such that they are maximally composable and easy to debug, delegating the more complex behaviour out.
In Conclusion
Containers are an excellent way to ship software to production systems. They solve a swathe of interesting problems and cost very little as a result. However, their rapid growth has meant some confusion in industry as to exactly how they work, whether they’re stable and so fourth. Containers are a combination of both old and new Linux kernel technology such as namespaces, cgroups, seccomp and other Linux networking tooling but are as stable as any other kernel technology (so, very) and well suited for production systems.
<3 for making it this far.
References
[1] “Docker.” https://en.wikipedia.org/wiki/Docker_(software) .
[2] “Cloud Native Technologies in the Fortune 100.” https://redmonk.com/fryan/2017/09/10/cloud-native-technologies-in-the-fortune-100/ , Sep. 2017.
[3] B. Cantrill, “The Container Revolution: Reflections After the First Decade.” , Sep. 2018.
[4] “Papers (Jail).” https://docs.freebsd.org/44doc/papers/jail/jail.html .
[5] “An absolutely minimal chroot.” https://sagar.se/an-absolutely-minimal-chroot.html , Jan. 2011.
[6] J. Beck et al., “Virtualization and Namespace Isolation in the Solaris Operating System (PSARC/2002/174).” https://us-east.manta.joyent.com/jmc/public/opensolaris/ARChive/PSARC/2002/174/zones-design.spec.opensolaris.pdf , Sep. 2006.
[7] M. Kerrisk, “Namespaces in operation, part 1: namespaces overview.” https://lwn.net/Articles/531114/ , Jan. 2013.
[8] A. Polvi, “CoreOS is building a container runtime, rkt.” https://coreos.com/blog/rocket.html , Jan. 2014.
[9] “Basics of the Unix Philosophy.” http://www.catb.org/ esr/writings/taoup/html/ch01s06.html .
[10] P. Estes and M. Brown, “OCI Image Support Comes to Open Source Docker Registry.” https://www.opencontainers.org/blog/2018/10/11/oci-image-support-comes-to-open-source-docker-registry , Oct. 2018.
[11] “Open Container Initiative Runtime Specification.” https://github.com/opencontainers/runtime-spec/blob/74b670efb921f9008dcdfc96145133e5b66cca5c/spec.md , Mar. 2018.
[12] “The 5 principles of Standard Containers.” https://github.com/opencontainers/runtime-spec/blob/74b670efb921f9008dcdfc96145133e5b66cca5c/principles.md , Dec. 2016.
[13] “Open Container Initiative Image Specification.” https://github.com/opencontainers/image-spec/blob/db4d6de99a2adf83a672147d5f05a2e039e68ab6/spec.md , Jun. 2017.
[14] “Open Container Initiative Distribution Specification.” https://github.com/opencontainers/distribution-spec/blob/d93cfa52800990932d24f86fd233070ad9adc5e0/spec.md , Mar. 2019.
[15] “Docker Overview.” https://docs.docker.com/engine/docker-overview/ .
[16] J. Frazelle, “Containers aka crazy user space fun.” , Jan. 2018.
[17] “Use Host Networking.” https://docs.docker.com/network/host/ .
[18] Krallin, “Tini: A tini but valid init for containers.” https://github.com/krallin/tini , Nov. 2018.
[19] https://chromium.googlesource.com/chromium/src.git/+/HEAD/docs/linux_sandboxing.md .
[[0pointer.resources]][20] L. Poettering, “systemd for Administrators, Part XVIII.” http://0pointer.de/blog/projects/resources.html , Oct. 2012.
[21] A. Howden, “Coming to grips with eBPF.” https://www.littleman.co/articles/coming-to-grips-with-ebpf/ , Mar. 2019.
[22] “Seccomp security profiles for docker.” https://docs.docker.com/engine/security/seccomp/ .
[23] “Linux kernel capabilities.” https://docs.docker.com/engine/security/security/#linux-kernel-capabilities .
[24] M. Stemm, “SELinux, Seccomp, Sysdig Falco, and you: A technical discussion.” https://sysdig.com/blog/selinux-seccomp-falco-technical-discussion/ , Dec. 2016.
[25] “Pod Security Policies.” https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp .
[26] Programster, “Example OverlayFS Usage.” https://askubuntu.com/a/704358 , Nov. 2015.
[27] “How do I connect a veth device inside an ’anonymous’ network namespace to one outside?” https://unix.stackexchange.com/a/396210 , Oct. 2017.
[28] D. P. García, “Network namespaces.” https://blogs.igalia.com/dpino/2016/04/10/network-namespaces/ , Apr. 2016.