Deploying on Kubernetes #10: Health Checking

This is the tenth in a series of blog posts that hope to detail the journey deploying a service on Kubernetes. It’s purpose is not to serve…

Apr 10, 2018

This is the tenth in a series of blog posts that hope to detail the journey deploying a service on Kubernetes. It’s purpose is not to serve as a tutorial (there are many out there already), but rather to discuss some of the approaches we take.

Assumptions

To read this it’s expected that you’re familiar with Docker, and have perhaps played with building docker containers. Additionally, some experience with docker-compose is perhaps useful, though not immediately related.

Necessary Background

So far we’ve been able:

Application Health

Application health checking is commonly (though not always) employed in a traditional load balancing architecture. Consider the diagram below:

                          +-------+
                          |       |
                    +-----+❣◕ ‿ ◕❣|
                    |     |       |
       +------+     |     +-------+
       |      |     |
+----> |  LB  | +---+
       |      |     |
       +------+     |     +-------+
                    |     |       |
                    +-----+❣◕ ‿ ◕❣|
                          |       |
                          +-------+

In this diagram, the ingress (called “LB” or “Load Balancer”) here must determine if the application should receive traffic; if it is “healthy”. To do this, it usually makes a request that fairly cheap request that’s designed to identify the health of the application — often with simple HTTP status codes. So long as the application stays healthy, it will continue to receive traffic.

However, there are two conditions in which applications either “start” or “become” unhealthy:

Long bootstrap times (Java applications)
App overload or other upstream connection problem.

When they are no longer healthy, the load balancer should not bother sending traffic that it is know is not going to function to that node:

                                +-------+
                                |       |
                          +-----+❣◕ ‿ ◕❣|
                          |     |       |
             +------+     |     +-------+
             |      |     |
      +----> |  LB  | +---+
             |      |
             +------+           +-------+
                                |       |
                                |(✖╭╮✖) |
                                |       |
                                +-------+

In this way, we can spare our users suffering from our unhealthy workloads.

Kubernetes Services

In a similar way, Kubernetes services generally will not add a pod to the pods that are routable via this service until the pod enters the “ready” status, as determined by Kubernetes. Kubernetes calculates readiness based on:

All containers in the pod running
Any health or readyness checks configured in the deployment

Having a process running seems a fairly logical measure. However, we can reach further than this and determine whether our application (kolide/fleet) is healthy.

Kolide’s Health

By spying on the router file for kolide, we can see that it helpfully provides an endpoint at /healthz specifically for this purpose. This handler runs additional checks to ensure that kolide can connect to both it’s data store and query result store. Practically, that means that so long as kolide and connect to both MySQL and Redis, it decides it’s healthy.

The health check

There are three types of health checks that are available for pods:

ExecAction: Run a command inside the container, and consider it healthy if that command execs 0. Useful for batch jobs, controllers.
TCPSocketAction: Attempt to connect via TCP to a containers IP address on a specified port. If this is successful, the container is healthy. Useful for non HTTP services such as MySQL.
HTTPGetAction: Makes a HTTP GET request against a TCP at the configured endpoint. Any status code between 200 and 400 is OK. Useful for all HTTP powered applications.

Of these, I recommend HTTPGetAction as much as possible; it allows the application to be clear about it’s health, and generates logs to debug issues that occur.

The start charter chart documents these extremely well. Accordingly, I will not be reviewing them all, but simply using thing the HTTP spec. Though this will be opaque, the command I used to transform the templates/deployment.yaml into what I will use is:

# See https://stackoverflow.com/questions/2112469/delete-specific-line-numbers-from-a-text-file-using-sed
$ sed --in-place '145,156d;158,204d;206,213d;215,218d;220,223d;224,227d;229,231d;233,236d;238,240d;242,249d' templates/deployment.yaml

This epic sed command removes a bunch of commented out documentation. If you’re curious specifically what, check the commit at the end of this post. This leaves us with:

# templates/deployment.yaml:138-145

      volumeMounts:
            - name: "fleet-configuration"
              readOnly: true
              mountPath: "/etc/fleet"
            - name: "fleet-tls"
              readOnly: true
              mountPath: "/etc/pki/fleet"
          # livenessProbe:
            # httpGet:
              # path: /healthz
              # port: "http"
              # initialDelaySeconds: 5
              # timeoutSeconds: 1
              # failureThreshold: 3
          # readinessProbe:
      restartPolicy: "Always"

Here we see two probes commented out:

Liveness Probe: Is the application still alive
Readiness Probe: Has the application booted successfully

It’s useful to undertake both probes to make sure both that the application has been verified as booted successfully, and that it is still up and running.

Let’s look first at the commented out liveness probe:

# templates/deployment.yaml:145-151

          # livenessProbe:
            # httpGet:
              # path: /healthz
              # port: "http"
              # initialDelaySeconds: 5
              # timeoutSeconds: 1
              # failureThreshold: 3

We can simply un-comment and fill it in. The sections are as follows:

httpGet: The aforementioned http health check type
path: The path to query for HTTP. healthz is actually correct here — we can see that in the kolide source.
port: The port at which to perform the health check. We can use the named reference defined earlier with the port definition.
initialDelaySeconds: How long to wait before we start checking whether the app is alive
timeoutSeconds: How long until we decide a connection has failed
failureThreshold: How many failures to tolerate before restarting the container

There is also an additional configuration which I somehow missed, but added to the starter chart while writing these notes:

periodSeconds: How long between probes
scheme: Whether to connect via HTTP or HTTPS

The values defined are all the defaults, and are largely fine. This leaves us:

# templates/deployment.yaml:145-154

          livenessProbe:
            httpGet:
              path: /healthz
              port: "http"
              scheme: "HTTPS"
              periodSeconds: 10
              initialDelaySeconds: 5
              timeoutSeconds: 1
              failureThreshold: 3

The only thing we amended is HTTP to HTTPS. Kolide/fleet is only available over TLS, and without that the health checks will fail. That’s it! That will ensure our application is up and running, or it will restart the application.

Given this configuration, the readiness probe is remarkably simple. Simpy copy the liveness probe, and rename livnessProbe to readinessProbe. Perfect! This will ensure our application actually boots before it’s included in the available service pool:

# templates/deployment:154-162

          readinessProbe:
            httpGet:
              path: /healthz
              port: "http"
              scheme: "HTTPS"
              periodSeconds: 10
              initialDelaySeconds: 5
              timeoutSeconds: 1
              failureThreshold: 3

Becoming Unhealthy

In order to test whether our health checks are actually functioning as expected, let’s make the application unhealthy. Let’s hide MySQL:

$ kubectl scale deployment kolide-fleet-mysql --replicas=0
deployment.extensions "kolide-fleet-mysql" scaled

This will remove the MySQL container without actually removing any associations that it requires. Pretty quickly we can see the container has become “unhealthy”:

$ kubectl get pods
NAME                                        READY     STATUS        RESTARTS   AGE
kolide-fleet-fleet-68c766dd57-7hbl2         0/1       Running       1          3m

Additionally, we can see that it’s been taken out of our service:

$ kubectl describe service kolide-fleet-fleet
Name:                     kolide-fleet-fleet
Namespace:                __REDACTED__
Labels:                   app=kolide-fleet-fleet
                          chart=fleet-1.0.6-1
                          heritage=Tiller
                          release=kolide-fleet
Annotations:              prometheus.io/path=/metrics
                          prometheus.io/port=9102
                          prometheus.io/probe=
                          prometheus.io/scrape=false
                          service.beta.kubernetes.io/external-traffic=OnlyLocal
Selector:                 app=kolide-fleet-fleet,release=kolide-fleet
Type:                     LoadBalancer
IP:                       __REDACTED__
LoadBalancer Ingress:     __REDACTED__
Port:                     http  443/TCP
TargetPort:               http/TCP
NodePort:                 http  __REDACTED__/TCP

Endpoints:                # Here there should be some endpoints. But
                          # there is not -- kolide is unhealthy.

Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

The pod will shortly enter a state of CrashLoopBackoff, where it will restart at exponentially longer periods to prevent cascading failure:

$ kubectl get pods
NAME                                        READY     STATUS             RESTARTS   AGE
kolide-fleet-fleet-68c766dd57-7hbl2         0/1       CrashLoopBackOff   5          5m

Let’s see what happens if we create a new pod that is unprepared:

$ kubectl delete pod kolide-fleet-fleet-68c766dd57-7hbl2
pod "kolide-fleet-fleet-68c766dd57-7hbl2" deleted

$ kubectl get pods
NAME                                        READY     STATUS      RESTARTS   AGE
kolide-fleet-fleet-68c766dd57-76tdw         0/1       Running     0          8s

Looks pretty bad. Let’s have a closer look:

Warning  Unhealthy              3s (x2 over 13s)  kubelet, gke-sitewards-internal-n2-stardard-2-4d8b8f10-tzdk  Liveness probe failed: Get https://10.0.9.34:8080/healthz: dial tcp 10.0.9.34:8080: getsockopt: connection refused
  Warning  Unhealthy              1s (x3 over 21s)  kubelet, gke-sitewards-internal-n2-stardard-2-4d8b8f10-tzdk  Readiness probe failed: Get https://10.0.9.34:8080/healthz: dial tcp 10.0.9.34:8080: getsockopt: connection refused

Yep, it’s definitely unhealthy. Okay, let’s rescue it:

$ kubectl scale deployment kolide-fleet-mysql --replicas=1
deployment.extensions "kolide-fleet-mysql" scaled

Then, using watch -n 1 kubectl get pods we can see it come alive again:

$ kubectl get pods
NAME                                        READY     STATUS      RESTARTS   AGE
kolide-fleet-fleet-68c766dd57-76tdw         1/1       Running     4          1m

It’s alive! 4 restarts later. Kubernets doesn’t know whether the application is dead, or something else — it will simply restart things until they work again.

In Summary

Health checks allow us to implement safeties into our application such that our users do not get served incorrect data. Further, they account for a wide range in error conditions in which the app can become arbitrarily “unhealthy” — for example, resource exhaustion, some sort of failure to connect to upstream due to transient network conditions. Our applications perversely become resilient simply by relying on the Kubernetes infrastructure and giving it the means to accurately determine the health of our application.

As usual, you can see the changes here:

AD-HOC feat (kolide-fleet): Add liveness checking to the deployment ·…
In previous work a deployment was crafted which deploys the kolide application. However, there was no health checking…github.com

Checkout the next in the series at:

https://medium.com/@andrewhowdencom/deploying-on-kubernetes-11-annotations-ec78f8b285b1

Simple, Beautiful Software Development

Discussion about this post