Discover more from Simple, Beautiful Software Development
Deploying on Kubernetes #10: Health Checking
This is the tenth in a series of blog posts that hope to detail the journey deploying a service on Kubernetes. It’s purpose is not to serve…
This is the tenth in a series of blog posts that hope to detail the journey deploying a service on Kubernetes. It’s purpose is not to serve as a tutorial (there are many out there already), but rather to discuss some of the approaches we take.
To read this it’s expected that you’re familiar with Docker, and have perhaps played with building docker containers. Additionally, some experience with docker-compose is perhaps useful, though not immediately related.
So far we’ve been able:
Application health checking is commonly (though not always) employed in a traditional load balancing architecture. Consider the diagram below:
+-------+ | | +-----+❣◕ ‿ ◕❣| | | | +------+ | +-------+ | | | +----> | LB | +---+ | | | +------+ | +-------+ | | | +-----+❣◕ ‿ ◕❣| | | +-------+
In this diagram, the ingress (called “LB” or “Load Balancer”) here must determine if the application should receive traffic; if it is “healthy”. To do this, it usually makes a request that fairly cheap request that’s designed to identify the health of the application — often with simple HTTP status codes. So long as the application stays healthy, it will continue to receive traffic.
However, there are two conditions in which applications either “start” or “become” unhealthy:
Long bootstrap times (Java applications)
App overload or other upstream connection problem.
When they are no longer healthy, the load balancer should not bother sending traffic that it is know is not going to function to that node:
+-------+ | | +-----+❣◕ ‿ ◕❣| | | | +------+ | +-------+ | | | +----> | LB | +---+ | | +------+ +-------+ | | |(✖╭╮✖) | | | +-------+
In this way, we can spare our users suffering from our unhealthy workloads.
In a similar way, Kubernetes services generally will not add a pod to the pods that are routable via this service until the pod enters the “ready” status, as determined by Kubernetes. Kubernetes calculates readiness based on:
All containers in the pod running
Any health or readyness checks configured in the deployment
Having a process running seems a fairly logical measure. However, we can reach further than this and determine whether our application (kolide/fleet) is healthy.
By spying on the router file for kolide, we can see that it helpfully provides an endpoint at
/healthz specifically for this purpose. This handler runs additional checks to ensure that kolide can connect to both it’s data store and query result store. Practically, that means that so long as kolide and connect to both MySQL and Redis, it decides it’s healthy.
The health check
There are three types of health checks that are available for pods:
ExecAction: Run a command inside the container, and consider it healthy if that command execs 0. Useful for batch jobs, controllers.
TCPSocketAction: Attempt to connect via TCP to a containers IP address on a specified port. If this is successful, the container is healthy. Useful for non HTTP services such as MySQL.
HTTPGetAction: Makes a HTTP GET request against a TCP at the configured endpoint. Any status code between 200 and 400 is OK. Useful for all HTTP powered applications.
Of these, I recommend HTTPGetAction as much as possible; it allows the application to be clear about it’s health, and generates logs to debug issues that occur.
The start charter chart documents these extremely well. Accordingly, I will not be reviewing them all, but simply using thing the HTTP spec. Though this will be opaque, the command I used to transform the
templates/deployment.yaml into what I will use is:
# See https://stackoverflow.com/questions/2112469/delete-specific-line-numbers-from-a-text-file-using-sed $ sed --in-place '145,156d;158,204d;206,213d;215,218d;220,223d;224,227d;229,231d;233,236d;238,240d;242,249d' templates/deployment.yaml
This epic sed command removes a bunch of commented out documentation. If you’re curious specifically what, check the commit at the end of this post. This leaves us with:
volumeMounts: - name: "fleet-configuration" readOnly: true mountPath: "/etc/fleet" - name: "fleet-tls" readOnly: true mountPath: "/etc/pki/fleet" # livenessProbe: # httpGet: # path: /healthz # port: "http" # initialDelaySeconds: 5 # timeoutSeconds: 1 # failureThreshold: 3 # readinessProbe: restartPolicy: "Always"
Here we see two probes commented out:
Liveness Probe: Is the application still alive
Readiness Probe: Has the application booted successfully
It’s useful to undertake both probes to make sure both that the application has been verified as booted successfully, and that it is still up and running.
Let’s look first at the commented out liveness probe:
# livenessProbe: # httpGet: # path: /healthz # port: "http" # initialDelaySeconds: 5 # timeoutSeconds: 1 # failureThreshold: 3
We can simply un-comment and fill it in. The sections are as follows:
httpGet: The aforementioned http health check type
path: The path to query for HTTP.
healthzis actually correct here — we can see that in the kolide source.
port: The port at which to perform the health check. We can use the named reference defined earlier with the port definition.
initialDelaySeconds: How long to wait before we start checking whether the app is alive
timeoutSeconds: How long until we decide a connection has failed
failureThreshold: How many failures to tolerate before restarting the container
There is also an additional configuration which I somehow missed, but added to the starter chart while writing these notes:
periodSeconds: How long between probes
scheme: Whether to connect via HTTP or HTTPS
The values defined are all the defaults, and are largely fine. This leaves us:
livenessProbe: httpGet: path: /healthz port: "http" scheme: "HTTPS" periodSeconds: 10 initialDelaySeconds: 5 timeoutSeconds: 1 failureThreshold: 3
The only thing we amended is
HTTPS. Kolide/fleet is only available over TLS, and without that the health checks will fail. That’s it! That will ensure our application is up and running, or it will restart the application.
Given this configuration, the readiness probe is remarkably simple. Simpy copy the liveness probe, and rename
readinessProbe. Perfect! This will ensure our application actually boots before it’s included in the available service pool:
readinessProbe: httpGet: path: /healthz port: "http" scheme: "HTTPS" periodSeconds: 10 initialDelaySeconds: 5 timeoutSeconds: 1 failureThreshold: 3
In order to test whether our health checks are actually functioning as expected, let’s make the application unhealthy. Let’s hide MySQL:
$ kubectl scale deployment kolide-fleet-mysql --replicas=0 deployment.extensions "kolide-fleet-mysql" scaled
This will remove the MySQL container without actually removing any associations that it requires. Pretty quickly we can see the container has become “unhealthy”:
$ kubectl get pods NAME READY STATUS RESTARTS AGE kolide-fleet-fleet-68c766dd57-7hbl2 0/1 Running 1 3m
Additionally, we can see that it’s been taken out of our service:
$ kubectl describe service kolide-fleet-fleet Name: kolide-fleet-fleet Namespace: __REDACTED__ Labels: app=kolide-fleet-fleet chart=fleet-1.0.6-1 heritage=Tiller release=kolide-fleet Annotations: prometheus.io/path=/metrics prometheus.io/port=9102 prometheus.io/probe= prometheus.io/scrape=false service.beta.kubernetes.io/external-traffic=OnlyLocal Selector: app=kolide-fleet-fleet,release=kolide-fleet Type: LoadBalancer IP: __REDACTED__ LoadBalancer Ingress: __REDACTED__ Port: http 443/TCP TargetPort: http/TCP NodePort: http __REDACTED__/TCP
Endpoints: # Here there should be some endpoints. But # there is not -- kolide is unhealthy.
Session Affinity: None External Traffic Policy: Cluster Events: <none>
The pod will shortly enter a state of
CrashLoopBackoff, where it will restart at exponentially longer periods to prevent cascading failure:
$ kubectl get pods NAME READY STATUS RESTARTS AGE kolide-fleet-fleet-68c766dd57-7hbl2 0/1 CrashLoopBackOff 5 5m
Let’s see what happens if we create a new pod that is unprepared:
$ kubectl delete pod kolide-fleet-fleet-68c766dd57-7hbl2 pod "kolide-fleet-fleet-68c766dd57-7hbl2" deleted
$ kubectl get pods NAME READY STATUS RESTARTS AGE kolide-fleet-fleet-68c766dd57-76tdw 0/1 Running 0 8s
Looks pretty bad. Let’s have a closer look:
Warning Unhealthy 3s (x2 over 13s) kubelet, gke-sitewards-internal-n2-stardard-2-4d8b8f10-tzdk Liveness probe failed: Get https://10.0.9.34:8080/healthz: dial tcp 10.0.9.34:8080: getsockopt: connection refused Warning Unhealthy 1s (x3 over 21s) kubelet, gke-sitewards-internal-n2-stardard-2-4d8b8f10-tzdk Readiness probe failed: Get https://10.0.9.34:8080/healthz: dial tcp 10.0.9.34:8080: getsockopt: connection refused
Yep, it’s definitely unhealthy. Okay, let’s rescue it:
$ kubectl scale deployment kolide-fleet-mysql --replicas=1 deployment.extensions "kolide-fleet-mysql" scaled
watch -n 1 kubectl get pods we can see it come alive again:
$ kubectl get pods NAME READY STATUS RESTARTS AGE kolide-fleet-fleet-68c766dd57-76tdw 1/1 Running 4 1m
It’s alive! 4 restarts later. Kubernets doesn’t know whether the application is dead, or something else — it will simply restart things until they work again.
Health checks allow us to implement safeties into our application such that our users do not get served incorrect data. Further, they account for a wide range in error conditions in which the app can become arbitrarily “unhealthy” — for example, resource exhaustion, some sort of failure to connect to upstream due to transient network conditions. Our applications perversely become resilient simply by relying on the Kubernetes infrastructure and giving it the means to accurately determine the health of our application.
As usual, you can see the changes here:
Checkout the next in the series at: