Kubernetes, etcd and disk throughput
TL;DR: ensure sufficient disk read and write speeds for your etcd cluster nodes. Any minimally decent hard drive will do, but don’t put
/var/lib/etcd on a thumb drive.
Moving to Kubernetes
I don’t really need Kubernetes, but growing up I’m starting to find out practice is the only true path to mastery.
Learning Kubernetes is something I’ve procrastinated for too long already, and so I planned to migrate all my services on a Docker Swarm onto Kubernetes.
That was a week ago1.
I wrote Ansible playbooks to configure the single node that would comprise my Kubernetes “cluster” (again, I don’t need k8s) and refactored my Compose files into Kubernetes objects with the help of the priceless Kubernetes API Reference Docs.
Tested deployments on a DigitalOcean Kubernetes cluster, everything works. Tested playbooks and deployments on a similarly-specced droplet, everything checked.
I repeated those steps more times than I remember, as I iterated over little changes to the configuration. Once I settled, I notified my handful of users about “planned downtime for maintenance”, and I started backing up and taking everything offline.
Installed the OS, prepared for Ansible, ran the playbooks,
kubeadm init, untainted the master nodes (or rather, node), installed Calico and created all my objects.
I had done it a thousand times already.
scheduler and the
controller-manager went down into alternating
CrashLoopBackOff states. 😨🔥
Why? I carefully repeated the very same steps I followed on a VPS for many times, why is it failing now?
And so the journey begins.
Root cause analysis
Root cause A: the NetworkPolicies
Googling the following messages in the
…led me to this comment on a GitHub issue which instantly made me believe the culprits of this chaos were my NetworkPolicies.
It made sense: I had one global NetworkPolicy that prohibited traffic between pods, and whitelisted certain paths by writing per-deployment NetworkPolicies, except those between
kube-system pods, so I instantly convinced myself that I had denied traffic between them and
I temporarily deleted all NetworkPolicies, and was fooled into thinking that everything was 👌 because the scheduler and the controller-manager were back online.
After 10 minutes of
k get pods2 everything broke loose again.
Reason: a NetworkPolicy object only affects the pods within the same namespace it exists. The
kube-system namespace was not affected by my NetworkPolicy objects because they were applied in the
Root cause B: the pod network
After some more googling, I found some random issue on GitHub that suggested similar connectivity issues when using Calico.
I wasn’t at the time for a thorough comparison between CNI plugins so I chose Weave3 because why not? No red flags and comes with traffic encryption. I’ll take it, thanks!
Removed Calico, installed Weave, everything okay for a couple minutes. Left it some more time, and everything broke down on me again.
Reason: why in the world would a field-tested pod network be guilty of denying traffic between pods? It made no sense. At times I make these silly attempts to fix things in ways that don’t even make theoretical sense, out of desperation.
Root cause C: disk I/O speeds
After that, I looked at the
kube-apiserver logs out of desperation. Only to find messages like:
I don’t know what
Notify is, but that’s etcd’s port!
Intuition tells me that
Notify is some kind of agent that notifies
kube-apiserver of available etcd nodes. If I was right, that meant that every once in a while (and randomly, which is what made this issue so hard to track), there were no available etcd nodes!
Took a glance at etcd’s logs, which contained tons of lines like:
Google returned this two issues (1,2) suggesting low disk read speed as the cause for those messages in the logs.
I discarded that as the spinner in the server is relatively fast, and it was 6 in the morning (too late, or too early, you choose), so I went to sleep with no clear answer.
Woke up, gave it a thought, and… I was wrong!.
The data for the Kubernetes pods is stored in the spinner but the OS files are in a thumb drive, including
/var/lib/etcd, etcd’s data directory!
Quick coffee, take everything offline, reinstall the OS with
/var/lib in a spinning partition, re-run the playbooks, redeploy Kubernetes and voilà!
Twenty minutes in and everything is healthy!
/var/libon a USB 2.0 thumb drive
Give late-night decisions a second thought, at least
Keep breaking stuff because troubleshooting mini-“production” outages is how one learns best
A note about resiliency
As I troubleshooted this issue, one thing that surprised me was how, even during times when two of the core components (
controller-manager) of Kubernetes were down, and the single most core one (
apiserver) couldn’t talk to its database, the pods in my deployments (and the services on those pods) kept running.
It threw me off a bunch of times because it made me think everything was back online, but it says a lot about Kubernetes’ design to be resilient. Even when the control plane itself is unhealthy, it runs your services!
Note: external access to my pods had some serious latency issues due to the fact that the Service objects have to lookup in etcd which pods are serving that traffic, so when etcd was unreachable, latency spiked up to many seconds.
1 I didn’t learn Kubernetes in a week. I’ve been testing it locally for some time, I’ve watched more Kelsey Hightower keynotes than I can remember, and I skimmed through Kubernetes Up & Running, which to be respectfully frank, is not worth the time.
2 Do yourself a favor and
alias k=kubectl, thank me later.
3 I ended up staying with Weave even though I planned to use Calico for two reasons: I forgot to undo 😅, and I liked how you can modify the configuration file by adding query string parameters to the YAML URL. If developers take that much attention to detail I might as well stay with Weave.