Deploying on Kubernetes #10: Health Checking
This is the tenth in a series of blog posts that hope to detail the journey deploying a service on Kubernetes. It’s purpose is not to serve…
This is the tenth in a series of blog posts that hope to detail the journey deploying a service on Kubernetes. It’s purpose is not to serve as a tutorial (there are many out there already), but rather to discuss some of the approaches we take.
Assumptions
To read this it’s expected that you’re familiar with Docker, and have perhaps played with building docker containers. Additionally, some experience with docker-compose is perhaps useful, though not immediately related.
Necessary Background
So far we’ve been able:
Application Health
Application health checking is commonly (though not always) employed in a traditional load balancing architecture. Consider the diagram below:
+-------+
| |
+-----+❣◕ ‿ ◕❣|
| | |
+------+ | +-------+
| | |
+----> | LB | +---+
| | |
+------+ | +-------+
| | |
+-----+❣◕ ‿ ◕❣|
| |
+-------+
In this diagram, the ingress (called “LB” or “Load Balancer”) here must determine if the application should receive traffic; if it is “healthy”. To do this, it usually makes a request that fairly cheap request that’s designed to identify the health of the application — often with simple HTTP status codes. So long as the application stays healthy, it will continue to receive traffic.
However, there are two conditions in which applications either “start” or “become” unhealthy:
Long bootstrap times (Java applications)
App overload or other upstream connection problem.
When they are no longer healthy, the load balancer should not bother sending traffic that it is know is not going to function to that node:
+-------+
| |
+-----+❣◕ ‿ ◕❣|
| | |
+------+ | +-------+
| | |
+----> | LB | +---+
| |
+------+ +-------+
| |
|(✖╭╮✖) |
| |
+-------+
In this way, we can spare our users suffering from our unhealthy workloads.
Kubernetes Services
In a similar way, Kubernetes services generally will not add a pod to the pods that are routable via this service until the pod enters the “ready” status, as determined by Kubernetes. Kubernetes calculates readiness based on:
All containers in the pod running
Any health or readyness checks configured in the deployment
Having a process running seems a fairly logical measure. However, we can reach further than this and determine whether our application (kolide/fleet) is healthy.
Kolide’s Health
By spying on the router file for kolide, we can see that it helpfully provides an endpoint at /healthz
specifically for this purpose. This handler runs additional checks to ensure that kolide can connect to both it’s data store and query result store. Practically, that means that so long as kolide and connect to both MySQL and Redis, it decides it’s healthy.
The health check
There are three types of health checks that are available for pods:
ExecAction: Run a command inside the container, and consider it healthy if that command execs 0. Useful for batch jobs, controllers.
TCPSocketAction: Attempt to connect via TCP to a containers IP address on a specified port. If this is successful, the container is healthy. Useful for non HTTP services such as MySQL.
HTTPGetAction: Makes a HTTP GET request against a TCP at the configured endpoint. Any status code between 200 and 400 is OK. Useful for all HTTP powered applications.
Of these, I recommend HTTPGetAction as much as possible; it allows the application to be clear about it’s health, and generates logs to debug issues that occur.
The start charter chart documents these extremely well. Accordingly, I will not be reviewing them all, but simply using thing the HTTP spec. Though this will be opaque, the command I used to transform the templates/deployment.yaml
into what I will use is:
# See https://stackoverflow.com/questions/2112469/delete-specific-line-numbers-from-a-text-file-using-sed
$ sed --in-place '145,156d;158,204d;206,213d;215,218d;220,223d;224,227d;229,231d;233,236d;238,240d;242,249d' templates/deployment.yaml
This epic sed command removes a bunch of commented out documentation. If you’re curious specifically what, check the commit at the end of this post. This leaves us with:
# templates/deployment.yaml:138-145
volumeMounts:
- name: "fleet-configuration"
readOnly: true
mountPath: "/etc/fleet"
- name: "fleet-tls"
readOnly: true
mountPath: "/etc/pki/fleet"
# livenessProbe:
# httpGet:
# path: /healthz
# port: "http"
# initialDelaySeconds: 5
# timeoutSeconds: 1
# failureThreshold: 3
# readinessProbe:
restartPolicy: "Always"
Here we see two probes commented out:
Liveness Probe: Is the application still alive
Readiness Probe: Has the application booted successfully
It’s useful to undertake both probes to make sure both that the application has been verified as booted successfully, and that it is still up and running.
Let’s look first at the commented out liveness probe:
# templates/deployment.yaml:145-151
# livenessProbe:
# httpGet:
# path: /healthz
# port: "http"
# initialDelaySeconds: 5
# timeoutSeconds: 1
# failureThreshold: 3
We can simply un-comment and fill it in. The sections are as follows:
httpGet: The aforementioned http health check type
path: The path to query for HTTP.
healthz
is actually correct here — we can see that in the kolide source.port: The port at which to perform the health check. We can use the named reference defined earlier with the port definition.
initialDelaySeconds: How long to wait before we start checking whether the app is alive
timeoutSeconds: How long until we decide a connection has failed
failureThreshold: How many failures to tolerate before restarting the container
There is also an additional configuration which I somehow missed, but added to the starter chart while writing these notes:
periodSeconds: How long between probes
scheme: Whether to connect via HTTP or HTTPS
The values defined are all the defaults, and are largely fine. This leaves us:
# templates/deployment.yaml:145-154
livenessProbe:
httpGet:
path: /healthz
port: "http"
scheme: "HTTPS"
periodSeconds: 10
initialDelaySeconds: 5
timeoutSeconds: 1
failureThreshold: 3
The only thing we amended is HTTP
to HTTPS
. Kolide/fleet is only available over TLS, and without that the health checks will fail. That’s it! That will ensure our application is up and running, or it will restart the application.
Given this configuration, the readiness probe is remarkably simple. Simpy copy the liveness probe, and rename livnessProbe
to readinessProbe
. Perfect! This will ensure our application actually boots before it’s included in the available service pool:
# templates/deployment:154-162
readinessProbe:
httpGet:
path: /healthz
port: "http"
scheme: "HTTPS"
periodSeconds: 10
initialDelaySeconds: 5
timeoutSeconds: 1
failureThreshold: 3
Becoming Unhealthy
In order to test whether our health checks are actually functioning as expected, let’s make the application unhealthy. Let’s hide MySQL:
$ kubectl scale deployment kolide-fleet-mysql --replicas=0
deployment.extensions "kolide-fleet-mysql" scaled
This will remove the MySQL container without actually removing any associations that it requires. Pretty quickly we can see the container has become “unhealthy”:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
kolide-fleet-fleet-68c766dd57-7hbl2 0/1 Running 1 3m
Additionally, we can see that it’s been taken out of our service:
$ kubectl describe service kolide-fleet-fleet
Name: kolide-fleet-fleet
Namespace: __REDACTED__
Labels: app=kolide-fleet-fleet
chart=fleet-1.0.6-1
heritage=Tiller
release=kolide-fleet
Annotations: prometheus.io/path=/metrics
prometheus.io/port=9102
prometheus.io/probe=
prometheus.io/scrape=false
service.beta.kubernetes.io/external-traffic=OnlyLocal
Selector: app=kolide-fleet-fleet,release=kolide-fleet
Type: LoadBalancer
IP: __REDACTED__
LoadBalancer Ingress: __REDACTED__
Port: http 443/TCP
TargetPort: http/TCP
NodePort: http __REDACTED__/TCP
Endpoints: # Here there should be some endpoints. But
# there is not -- kolide is unhealthy.
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
The pod will shortly enter a state of CrashLoopBackoff
, where it will restart at exponentially longer periods to prevent cascading failure:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
kolide-fleet-fleet-68c766dd57-7hbl2 0/1 CrashLoopBackOff 5 5m
Let’s see what happens if we create a new pod that is unprepared:
$ kubectl delete pod kolide-fleet-fleet-68c766dd57-7hbl2
pod "kolide-fleet-fleet-68c766dd57-7hbl2" deleted
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
kolide-fleet-fleet-68c766dd57-76tdw 0/1 Running 0 8s
Looks pretty bad. Let’s have a closer look:
Warning Unhealthy 3s (x2 over 13s) kubelet, gke-sitewards-internal-n2-stardard-2-4d8b8f10-tzdk Liveness probe failed: Get https://10.0.9.34:8080/healthz: dial tcp 10.0.9.34:8080: getsockopt: connection refused
Warning Unhealthy 1s (x3 over 21s) kubelet, gke-sitewards-internal-n2-stardard-2-4d8b8f10-tzdk Readiness probe failed: Get https://10.0.9.34:8080/healthz: dial tcp 10.0.9.34:8080: getsockopt: connection refused
Yep, it’s definitely unhealthy. Okay, let’s rescue it:
$ kubectl scale deployment kolide-fleet-mysql --replicas=1
deployment.extensions "kolide-fleet-mysql" scaled
Then, using watch -n 1 kubectl get pods
we can see it come alive again:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
kolide-fleet-fleet-68c766dd57-76tdw 1/1 Running 4 1m
It’s alive! 4 restarts later. Kubernets doesn’t know whether the application is dead, or something else — it will simply restart things until they work again.
In Summary
Health checks allow us to implement safeties into our application such that our users do not get served incorrect data. Further, they account for a wide range in error conditions in which the app can become arbitrarily “unhealthy” — for example, resource exhaustion, some sort of failure to connect to upstream due to transient network conditions. Our applications perversely become resilient simply by relying on the Kubernetes infrastructure and giving it the means to accurately determine the health of our application.
As usual, you can see the changes here:
AD-HOC feat (kolide-fleet): Add liveness checking to the deployment ·…
In previous work a deployment was crafted which deploys the kolide application. However, there was no health checking…github.com
Checkout the next in the series at:
https://medium.com/@andrewhowdencom/deploying-on-kubernetes-11-annotations-ec78f8b285b1