Simpler production with containers

As part of my job I’ve been responsible for crafting the design, implementation and continued delivery of various software systems. I am…

Jun 25, 2018

As part of my job I’ve been responsible for crafting the design, implementation and continued delivery of various software systems. I am currently responsible for a maybe 6 user facing websites (both internal and external) with around ~80 services, and regularly assist with the remaining services that my organisation publishes.

This is a story about the different approaches that I’ve tried to make this level of complexity manageable, with a recommended solution for the ongoing development of these applications.

The problem of machine management

Over time, managing the number of machines that we have has become a significant cost. Machine management is inherently complex; we need to be able to:

Create the initial machine definition such that it performs the required tasks
Ensure the machine is kept up to date as new security issues are discovered and must be protected against
Ensure the service the machine provides is continually operational when it is required
Maintain the service as requirements change over time

We, as an organisation, have gone through several different approaches to manage these machines as our own requirements have changed over the companies 20 odd years. In this version of the company in which we build web-services, we started with a small number of multi-tenant physical production machines that were carefully attended to by systems administrators. Then, moving to cloud, we split those tenants off to separate machines but still managed them manually. Next, we improved this process by prepacking the “base” configuration into an EC2 and using that machine with our new deployments, maintaining the machines by hand following their release. The standard approach as of the time of writing is to take a base operating system (usually Ubuntu 16.04) and apply a series of pre-crafted Ansible roles to it, maintaining sets of machines by updating those roles.

Though we have vastly increased our efficiency in this machine maintenance, it is still a significant and difficult to justify cost for both the company and the merchants that we service. Additionally we’ve looked at hosting partners in the past but I think the mixed workloads we have mean specialist hosting partners are usually not quite the right fit.

Most recently however we have been looking at using containers as the model in which we’d like to manage software going forward.

Containers: The solution?

What’s in a container

A container is defined by Docker as:

A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings.

To understand the implications of this we need to step back a little bit into how the Linux operating system works. We have:

The “Kernel” which provides the core functionality of the machine. Things such as the ability to read and write to disk, run tasks on the CPU, network connections.
The rest of the things on a machine, usually located on the “root filesystem” (or at /). All software that runs on Linux*.

A container is a package that contains everything except the kernel; usually, an entire root file-system.

Cheap machines

Containers are extremely cheap to run. Far cheaper* than their nearest analogy; virtual machines. We can run an extremely large number on a single machine.

This provides the opportunity for a new way of thinking about deploying software. Traditionally when designing the architecture for an application I will start by considering each application and the demands thereof against a budget of ${N} machines, and the network/failure characteristics thereof. I will then making a decision as to which service to place on which machine based on nothing more than experience and a guess as to how that machine will behave. Given the typical PHP application we might have:

2x Machine type A with NGINX + Varnish
4x Machine type B with PHP and NGINX/Apache
1x Machine type C with Redis and MySQL*

That decision is basically never revisited as a whole; ones the initial spec is written and deployed, it’s super hard to change.

Containers let us shift our model of thinking from “Machines” to “Services”. Instead of the above diagram with the specific machines, we can simply have:

2x Varnish
4x PHP
1x Redis
1x MySQL

and not care where they are or what they’re doing*. This allows us to do things like basing them on different operating systems, deploy them separately to all others, shift them easily from one node to another etc.

The delivery process

Without some sort of architecture behind the above it sounds somewhat magical; something like the clouds promise of “Yo give me an SQL service” to which the cloud replies “Yeah sure! 172.17.4.1”.

However, the technologies that drive this containerised approach are all open standards that have been implemented now by all cloud providers. It’s perhaps worth going through the (hypothetical) process of building and deploying such a service to examine how it all works. First, let’s start with:

The build

Containers can be constructed in any number of ways, but an extremely common model of doing so is Docker. Docker was among the first to provide an easy way to handle containers and enjoys a position as the defacto container management solution for the local development of containers.

It provides a syntax called the “Dockerfile” syntax. One example is below:

FROM nginx:latest

# Define the snake repo in a handy environment variable for easy replacement
ENV SNAKE_REPO="https://github.com/PKief/Snake.git"

# Install git so we can download the snake repo
RUN apt-get update && \
    apt-get install --yes \
        git

# Clone the repo to a directory that NGINX can access
RUN git clone ${SNAKE_REPO} /var/www/snake
    
# Modify NGINX configuration to serve our snake game 
RUN sed --in-place 's/usr\/share\/nginx\/html/var\/www\/snake/' /etc/nginx/conf.d/default.conf

This creates a service that will expose the game “Snake” by the author PKief over HTTP.

The syntax is super familiar to those who are used to bash as a language, but it’s worth covering

FROM which takes an existing service (NGINX) and builds our service on top of that service
ENV which sets an environment variable for all other processes run subsequently to that declaration
RUN which runs a command with sh -c inside the containerised environment.

From this we get a “container image”, which we can tag with a HTTP URI and version. This one is available as quay.io/littlemanco/snake:latest

Storing the image for later consumption

One of the most attractive properties of containers is that they package entire deliverable software services into a single data blob and have the ability to store them for later use and deployment an essentially infinite number of times. While this was previously possible with something like EC2 or Hashicorps packer, it was much, much more expensive.

The aforementioned Docker provides super simple primitives for managing these images:

$ docker push quay.io/littlemanco/snake:latest

will push it to the address:

quay.io/littlemanco/snake

and tags it with “latest” (simply overwriting the previous latest).

More recently there is a trend of services providing these registries to scan the images in place with certain static analysis tools and provide information such as whether the images are vulnerable. In the case above, the image is in fact vunlerable (spoiler: all of them are). This provides a fundamentally new service, previously unavailable with traditional virtualised-machine deployment types.

Additionally, there is work to build cryptographic assertions of trust into the software delivery pipeline, assuring that only software that has been signed by a trusted organisation or member of the team is able to run in the production system.

Deployment

Perhaps the most exciting part of this story is deployment. Not long after docker became successful the team at Google released a competitor to dockers production deployment tooling called “Kubernetes”, which has become essentially the defacto way of managing containers in production.

Kubernetes is an extremely clever, reconciliation driven system that is designed to take up to ~5000 machines per cluster and treat them as one continuous source of compute*. Essentially it allows us to take our earlier three machines, plaster a layer of Kubernetes on top and then run our services on top of Kubernetes — completely ignoring where they are or what they’re doing*.

To the developer Kubernetes appears as a set of YAML files. Something like:

$ kubectl run snake --image=quay.io/littlemanco/snake:latest --dry-run --output=yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    run: snake
  name: snake
spec:
  replicas: 1
  selector:
    matchLabels:
      run: snake
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        run: snake
    spec:
      containers:
      - image: quay.io/littlemanco/snake:latest
        name: snake
        resources: {}
        ports:
        - containerPort: 80
          name: http
          protocol: TCP
status: {}

Given that yaml, Kubernetes will take the image quay.io/littlemanco/snake:latest and run it “somewhere”. By adding a new declaration:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: null
  labels:
    run: snake
  name: snake
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: snake
status:
  loadBalancer: {}

Kubernetes will expose the newly created container to the internet — handling the routing, assigning of IPs and and so fourth automatically.

In short, Kubernetes takes the complex work of managing many different containers across many different machines and unifies it through a single API and set of primitives that are designed to solve exactly this problem.

It additionally has a host of other super nice properties around health checking, failure, load balancing and so fourth that make the practical work of managing these services much easier also.

Simpler?

While it might not seem that buying into these new approaches of managing systems through containers is simpler I assure you that it is. Not in the sense that it’s simpler to craft the design of the system; doing it in this way forces you to rethinking assumptions that you’ve previously made about how things should work and those assumptions will not easily go away.

But it’s easier in the sense it’s a solution designed for the management of many different services. It’s designed for the less elegant work of rescuing unhealthy systems, backing up and restoring data, ensuring that services are available across multiple different regions etc.

It is not simpler to start with Kubernetes if you have a simple service and you wish to simply stand it up and get it running. However, if you’re like us and your responsibility has grown to managing a large amount of different services, Kubernetes provides you with a way to think about them all in a common way, and take certain guarantees when running these services for granted.

Caveats

Containers still have problems

While containers do solve a large number of operational problems they bring their own new and interesting problems into the mix. Those problems do have solutions, but can trip the new container venture up. Things such as managing updates to those containers, ensuring that the containers are built in a secure way and allowing users access to only the resources they need in the production system — all problems that we’ve faced and had to find solutions to.

There are some half truths here

While writing this, I could think of examples where this is not strictly true. It happened a few times, but for the purposes of this article it’s not worth delving into this area a lot. Like all articles, this captures a simplified version of the truth — shoot me a message and I’ll explore this in depth further if you’d like.

Thanks

My team, for putting up with my nonsense and learning these things as I have.
The broader Magento community, who has at least somewhat been willing to accompany me on this journey.
The various mentors in the Kubernetes community who have been readily kind with their knowledge.
Shem Taljaard for an early review.
Ben Sonassi for an early review and critique.
Wifey. I wrote this on a Sunday and she put up with me!
Innumerable more people.

Simple, Beautiful Software Development