Kubernetes fail compilation: but they keep getting worse

I’ve never been a big gambler. I might have placed a small bet in the past to spice up a Super Bowl that I wasn’t that invested in, but nothing crazy. There is a level of certainty that’s required to actually put your money on the line that I rarely have for any sporting event, electoral outcome, or future prediction. There are very few certainties in tech. Job security isn’t a given, industry trends ebb and flow, and the tools and tech stack you work on every day will more than likely evolve as time goes on. Despite this sea of uncertainty, there is something you can safely bet your money on, at some point, you will suffer an outage.

Kubernetes engineers who have been around for any amount of time can attest to this reality. This being the case, it makes little sense to fear failure or perform herculean feats to ensure 100% availability, if anything mistakes and outages should be welcomed as learning opportunities, a necessary evil to any environment that aspires to mature and deliver a high-quality service in a reliable way.

The best way to effectively process and digest an outage is the systematic performance of a post-mortem. These prove to be the sharpest tools we have to find patterns and synthesize the learnings that an outage has to offer. On the topic of patterns, common in Kubernetes cluster failures there are a few that emerge.

DNS, networking, and default resource allocation are some of the key culprits.

In this article, we will analyze some of these post-mortems and try our best to absorb what others have had to learn the hard way.

Failure severity scale

Not every outage has the same impact, so I’ve created a very scientific categorization system to understand the impact of each Kubernetes outage:

🤷 – Oopsie Daisy: Trivial Kubernetes failures
😅 – Non-prod, but still annoying: no customers we affected but learnings were made
🤬 – Houston, we have a problem in production: where customers were impacted and career choices were questioned.

Before I forget, let me thank Glasskube for allowing me to take the time to create content just like this. If this is the first time you’ve heard of us, we are working to build the next generation Package Manager for Kubernetes.

If you’d like to support us on this mission we would appreciate it if you could
⭐️ Star Glasskube on GitHub 🙏

Oopsie Daisy 🤷

If you can’t laugh at yourself who can you laugh at?

Clusters and node groups

The first story comes from your humble correspondent, who recently spun up a test Kubernetes cluster using the AWS console for a quick proof of concept. It had been a while since I’d created a cluster without using EKSCTL or some form of Infrastructure as Code definition file.

So, I logged in, accessed the EKS console, named my Kubernetes cluster, and hit “create.” I then followed the CLI instructions to configure the kubeconfig file and connect to my newly created cluster via the terminal.

Eager to test the newest version of Glasskube, I installed it in the cluster. However, I was surprised by how long the pods were taking to schedule. Reflecting on it now, I’m embarrassed to admit how long it took me to realize that I hadn’t provisioned a node group, no wonder the pods weren’t being scheduled.

Call the fire brigade, I forgot to add resource limits.

Another true story comes from another Glasskube member who overloaded his local laptop due to installing to many components (GitLab) in his local Minikube cluster, the laptop nearly burned a hole through his desk, a good reminder to use resource limits and requests

Non-prod, but still annoying 😅

Moving on to some real incidents, these luckily were localized to clusters that didn’t impact paying customers.

Incident #1: Venafi’s unresponsive Webhooks

Venafi is a Control Plane for Machine Identities, recently acquired by Cyberark who had some issues with OPA. Full post-mortem here.

Impact: intermittent API server timeouts leading to unhealthy nodes.
Involved: Open Policy Agent, Node readiness

Play-by-play
During a scheduled cluster upgrade and despite warnings and successful prior upgrades, the master upgrade failed, leading to API server timeouts and node instability. The root cause was a timeout during a ConfigMap update, triggered by an unresponsive OPA webhook. Deleting the webhook restored service, and they’ve since restricted it to specific namespaces, added a liveness probe for OPA, and updated documentation.
They emphasized the need for API response time alerts, workload probes, and possibly using a Helm chart for deployment to avoid similar issues in the future. They continue to monitor improvements in functionality and offer insights through their Flightdeck service.

Learnings:

The need for alerting on API server response times.
Increased livenessProbes needed for all workloads.
Using package management for more granular configuration.

💡 This incident highlights one of the use cases Glasskube aims to address. While Glasskube doesn’t yet support the OPA operator, we believe this issue could have been avoided with a robust Kubernetes package manager. Glasskube allows for easy configuration of key features, assists in upgrades, and applies a GitOps approach to package operator management, including rollbacks and specific namespace allocation. Try it here.

Incident #2: When crypto miners sneak in

JW Player was targeted by bitcoin mining malware, check out the full post-mortem here.

Impact: A non-prod cluster was infiltrated by bitcoin miners
Involved: Root access exploitation

Play-by-play
The DevOps team at JW Player discovered a cryptocurrency miner on their Kubernetes clusters after Datadog alerted them to high load averages in their staging and development environments. Initial investigation pointed to a gcc process consuming 100% CPU, which was found to be a miner launched by Weave Scope, a monitoring tool. The miner exploited a public-facing Weave Scope load balancer that allowed command execution in containers.
Immediate actions included stopping Weave Scope, isolating affected nodes, and rotating them out. The incident caused high CPU usage but no service disruption or data compromise. The team identified manual security group edits overridden by Kubernetes as a key issue and emphasized the need for proper configuration practices to prevent such vulnerabilities.

Learnings:

Monitoring load is not the best way to detect cluster issues.
Tools like falcon or sysdig might be needed
More robust Docker image and container scanning needed.
Some areas of the architecture need revisiting.
More cross-team data sharing and communication is needed.

Incident #3: GKE ran out of IP addresses

Impact: A high node-count cluster ran out of IP addresses and couldn’t schedule new pods.
Involved: Subnets, Default IP allocations per node

Play-by-play
An incident arose when a team member reported unusually long deployment times for their application. They discovered quickly that while some newly deployed pods were serving traffic, the rest remained in a pending state. Analysis revealed a FailedScheduling warning indicating insufficient resources. Despite having a cluster autoscaler in place, the issue persisted, as they saw an alarming “0/256 nodes available” message. Further examination uncovered that GKE pre-allocates 110 IPs per node, resulting in unexpected high IP consumption. Once this was known, they adjusted the pod allocation per node, reducing overall IP usage by 30%. Additionally, they explored options like subnet expansion and increasing node sizes to mitigate IP exhaustion, eventually optimizing node pool instance sizes to better utilize resources.

Learnings:

The importance of knowing the default values set by GKE.

Subnet expansion is a nifty tool to have at your disposal (not much documentation on secondary ranges though).
Increased node pool instance size can do the job too (running more pods per node, then needing fewer nodes).

Houston, we have a problem in production 🤬

These are the types of outages that keep SRE’s up at night, when customers are impacted and business value is on the line, these are where the most important learnings emerge and where hero’s are made.

Incident #1 Skyscanner only needed a couple of characters to bring their site down

Here we see that an architecture optimized for resiliency was still susceptible to failure due to just one line of code. Full post-mortem here.

Impact: Global Skyscanner website and mobile apps were inaccessible
Involved: IaC manifest

Play-by-play
In August 2021, Skyscanner faced a global outage lasting over four hours due to an inadvertent change to a root file in its infrastructure provisioning system. This change, with a lack of {{ }}, unexpectedly triggered the deletion of critical microservices across the globe, rendering the website and mobile apps inaccessible.

They swiftly addressed the issue, leveraging GitOps to restore configurations and prioritize critical services.

Learnings:

Don’t do global config deploys.
More drastic “worst case scenario“ planning is needed.
Verify the backup/restore process.
Keep runbooks up-to-date.
Potential over automation.

Incident #2 Monzo Bank’s linkerd fiasco

British digital bank found a critical Kubernetes bug the hard way. Full port-mortem here.

Impact: prepaid cards and new current accounts were down for about 1.5 hours
Involved: Linkerd, kube-apiserver, etcd

Play-by-play:
The incident began when a routine deployment caused payment processing failures. Attempts to rollback the change were unsuccessful, leading to an internal outage declaration. Engineers identified and restarted unhealthy linkerd instances, but a configuration issue with kube-apiserver prevented new linkerd instances from starting, escalating the outage to a full platform failure. The root cause was traced to a bug in Kubernetes and etcd, triggered by a recent cluster reconfiguration. This caused linkerd to fail to receive network updates, compounded by a compatibility issue between Kubernetes and linkerd. The incident was resolved by updating linkerd and removing empty Kubernetes services.

Learnings:

A new version of Linkerd was needed.
The k8s bug needed to be fixed (now fixed).
Improve health checks, dashboard, and alerting.
Procedural improvements to improve internal communication during outages.

Incident #3 The Redis operator threw a curve ball

Palark is a DevOps service provider that tried to protect it’s Redis cluster and ended up rueing the day. Here is the full port-mortem

Impact: Production Redis data after adding replicas
Involved: Redis operator

Play-by-play
They encountered an incident involving the well-known in-memory key-value store, Redis, which they installed via the Redis Operator for running Redis failover. Initially deployed with one Redis replica, they expanded to two replicas to enhance database reliability. However, this seemingly minor change proved catastrophic during a rollout, leading to data loss. The incident exposed flaws in the Redis Operator, primarily its readiness probe, triggering unintended master promotion and subsequent data destruction. Further analysis using tools like Redis-memory-analyzer revealed insights into database size and elements which then helped developers to optimise the database and application code to prevent future incidents.

Learnings:

To be very careful when using Kubernetes operators (make sure they are mature and well-tested).
The found a crucial bug associated the the Redis Operators readiness probe that made replica scale out prone to data loss (since fixed).

Redis-memory-analyzer is the best tool for troubleshooting Redis databases.

Incident #4 Datadog’s, multi-region nightmare

Multiple Datadog regions were down systemd-networkd forcibly deleted the routes managed by the Container Network Interface (CNI) plugin. Full post-mortem here.

Impact: Users in multiple regions were left without API and platform access.
Involved: systemd update, Cilium

Play-by-play
Starting on March 8, 2023, Datadog experienced a major outage affecting multiple regions, preventing users from accessing the platform, APIs, and monitors, and impacting data ingestion. The issue, triggered by an automatic security update to systemd on numerous VMs, caused network disruptions that took tens of thousands of nodes offline. Recovery involved restoring compute capacity, addressing service-specific issues, and providing continuous updates to customers. The root cause was identified as a misconfiguration that allowed the automatic update, which has since been disabled.

Learnings:

More robust chaos testing.
Improved communication with customers during outages is needed
The status page was inadequate during the outage.
Automatic updates are inherently risky and should be employed with care.

Incident #5 Reddit’s Pi-Day Outage

Reddit suffered the consequences of rapid organic growth, they faced the crushing reality that a lot of their critical Kubernetes clusters were unstandardised and susceptible to outages, full Pi-Day outage post-mortem here.

Impact: Significant cross-platform outage lasting 314 minutes
Involved: Calico, Kubernetes version update

Play-by-play
In March 2023, Reddit experienced a significant outage lasting 314 minutes, coincidentally occurring on Pi Day. Users trying to access the site encountered either an overwhelmed Snoo mascot, error messages, or an empty homepage. This outage was triggered by an upgrade from Kubernetes 1.23 to 1.24, which introduced a subtle, previously unseen issue. The engineering team, having emphasized improvements in availability over recent years, found themselves in a challenging situation where a rollback, though risky, became the best option.

During the restore, complications arose from mismatches in TLS certificates and AWS capacity limits, but the team managed to navigate these challenges and reestablish a high-availability control plane.

Further investigation revealed that the root cause was related to an outdated route reflector configuration for Calico, which became incompatible with Kubernetes 1.24 due to the removal of the “master” node label.

Learnings:

The importance of improving the pre-prod cluster for testing purposes.
The need for improved Kubernetes component lifecycle management tooling.
Need for more homogeneous environments.
Also, the need to increase their IaC and internal technical documentation.

Conclusion

As you can see, the law of entropy easily applies to Kubernetes clusters—it’s much easier to break them than to keep them happy. Changes like upgrades, rollouts, scale-outs, and deployments usually trigger outages, so you might feel inclined to minimize them. But this isn’t an option for organizations fighting to lead their market segments and meet changing customer needs. The best we can hope for is to learn by doing and learn by failing. On the upside, the tech industry is generally open to learning and upfront about failures (for the most part). The fact that many large enterprises publicly share post-mortem summaries for the greater community to learn from is a best practice grounded in the assumption that failures and outages are a matter of “when” and not “if.” The best way to protect ourselves is to learn from them once they have passed.

🫵 And what about you? Have you weathered any particularly difficult outages and come out the other side to tell the tale? If so, please share your experience in the comments below. I’m sure many of us would love to hear about it.

Help us make more content like this!

At Glasskube we’re putting a lot of effort into content just like this, as well as building the next generation package manager for Kubernetes.

If you get value from the work we do, we’d appreciate it if you could
⭐️ Star Glasskube on GitHub 🙏