Bridging the Gap: Uniting Chaos Engineering and Platform Engineering

Bridging the Gap: Uniting Chaos Engineering and Platform Engineering

In a world where automation has become the norm, the appeal of practices like chaos engineering, with its inherently uncertain outcomes, can be challenging for developers to embrace.

The solution for encouraging developers to adopt best practices when deploying to the cloud lies in platform engineering.

To help you catch up, in case you missed the train, here’s a concise definition:

Platform engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. — Luca Galante

How can we seamlessly integrate chaos engineering practices into the platform?

Provide a workflow dedicated to chaos engineering

When an application is created using the platform engineering tool, it requires a series of experiments to evaluate its behaviour under turbulent conditions. To analyse the outcomes of the experiments, we must also get a proper observability around those chaotic events. And finally, to gauge the progress made in terms of reliability, we need a comprehensive means of assessment.

In summary:

Generate experiments for the application.Provide observability around experiments executions and insights.Track the progress in terms of reliability

1. Generate experiments for the application

To streamline the mass production of experiments for a multitude of applications, we recognized the need for experiment templates. To acquire these templates, we conducted workshops involving multiple teams. These collaborative sessions yielded templates imbued with patterns of reliability characteristics to be rigorously tested:

Graceful Downtime Handling: It became evident that applications often struggle to gracefully handle downtime from their dependencies, sometimes failing to provide proper error responses.

Observability Gaps: The observability of these applications did not always adequately cover downtime scenarios, revealing the need for proactive tuning before real incidents occur.

Latency Challenges: While applications generally survived latency issues, problems often slipped unnoticed into production due to insufficient observability in this domain.

With these insights in hand, we embarked on the creation of templates for various teams. These templates are now accessible within the Reliability Hub of Steadybit, our current tool. The beauty of the hub is that it’s open-source, allowing us to harness templates developed by other clients as well.

Example of one experiment

So, we’ve got the templates — step one completed. Now, the challenge lies in configuring the right parameters for each experiment. Each experiment is tailored to target a specific Kubernetes deployment within a particular Kubernetes namespace. Some of these experiments require a URL to assess the application’s availability.

To facilitate this, we’ve leveraged our in-house tools for deployment and asset conformity checks. Presently, our new tool is intricately linked to two vital dependencies: the Steadybit API for acquiring, creating, and executing experiment templates, and our platform portal for extracting critical deployment information, including the Kubernetes deployment name, namespace, and health checks. Equipped with these informations, we are able also to link the Datadog monitors related to the application to see how they behave during the experiment.

Once more, our dependency management hinges on the asset’s conformity, where each dependency is meticulously mapped to the tags within the experiment templates of the Steadybit hub. This meticulous mapping results in a structured “tree” of experiments categorized by domains such as Kubernetes, RabbitMQ, Kafka, Postgres, and Redis.

{
“asset”: {
“id”: “booking-88”,
“name”: “booking-web”,
“tags”: [
“Redis”,
“kubernetes”
],
“kubernetes_namespace”: “booking”,
“kubernetes_deployment”: “booking-web”,
“healthcheck”: “https://int.platform/booking/monitoring/healthcheck”
},
“reliability_domains”: [
{
“name”: “Oups, Redis Cache is Gone !”,
“tag”: “Redis”,
“experiments”: [
{
“name”: “Verify graceful degradation while Redis is unavailable”
},
{
“name”: “Verify graceful degradation while Redis suffers a high latency”
}
],
“achievement”: “Redis Resiliency Rockstar”
},
{
“name”: “My Kubernetes pods gets jostled ?”,
“tag”: “kubernetes”,
“experiments”: [
{
“name”: “Load balancing hides a single container failure of Gateway for end users”
},
{
“name”: “Faultless redundancy of Gateway during rolling update”
},
{
“name”: “Verify Time to Readiness of Deployment”
}
],
“achievement”: “Kubernetes Resiliency Champion”
}
]
}

2. Provide observability around experiments executions and insights.

Now that we have the essentials ready for launch — omitting various detailed experiment state information for JSON length considerations — we need to provide users with the means to comprehend the events that transpired during the experiment. To achieve this, we leveraged the events sent by Steadybit to Datadog, utilizing them as markers for our analysis.

This approach simplifies the task of visualizing what happens before, during, and after the experiment, leveraging Kubernetes metrics at our disposal. We’re fortunate to have dedicated SREs who have crafted an intuitive dashboard to aid in troubleshooting Kubernetes deployment issues. (And have we mentioned that chaos often strikes when least expected?)

In the event of an unsuccessful experiment, we make it a priority to offer enhancements within our tool. We provide valuable resources, including links to our documentation and Datadog dashboards, to facilitate the troubleshooting and improvement process.

3. Track the progress in terms of reliability

Each attempt is meticulously logged in the JSON “chaos tree,” allowing you to gradually push the known boundaries of your application’s behaviour a bit further whenever you find a spare moment. You can then return to it at your convenience to analyze the results and make informed decisions.

However, there’s an additional layer to consider. We acknowledge that reliability isn’t always tied to tangible aspects, and it can manifest differently from one company to another (more on that on this fantastic webinar). Keeping this in mind, we’ve also dedicated some thought to our platform portal, an in-house tool for platform engineering, to evaluate reliability in a more comprehensive and adaptable way.

Chaos engineering, powered by our tool, affectionately named Resilient-Quest :), forms an integral part of our reliability initiatives. However, our efforts extend well beyond this practice. We’re committed to raising the performance scores of each application and capturing a broader spectrum of key indicators. Here’s a sneak peek into our comprehensive reliability report:

Our assessment encompasses a range of critical aspects, including identifying observability gaps and addressing flappiness issues. We delve into deployment safety by ensuring a correct configuration of Kubernetes liveness, readiness, and replica settings. Chaos engineering practice takes its rightful place in the report, and not to be overlooked, we gauge operational readiness by verifying the presence of an up-to-date runbook within the last six months. Our holistic approach ensures that reliability is viewed through a multidimensional lens.

4. (? Bonus!) Gamify the experience

We want people to practice chaos engineering autonomously, but can we get a little fun while doing it ? Yes! Let’s not forget the essence of the GameDays that we have done at ManoMano!

We wrapped the experiments and their outcome as challenges, and each success increase the score of the asset. Each domain completed delivers achievements, and we represent all of that by a superb castle rising from the ground. Hopefully the metaphor don’t stop there and the application is also here to stay.

Can we have the auto game mode ?

Just like mobile games offer an “Auto” mode for those times when you’re short on time, we understand that our users may appreciate a similar convenience.

That’s why we’ve introduced a workflow powered by Temporal. It’s designed to automatically execute all experiments and, if desired, continue even in the event of an error. This feature streamlines the process, ensuring a hassle-free experience for our users.

Parameters of the workflow accessible in the tool

What’s Next ?

Currently, our chaos engineering practice relies on volunteers, but the integration of experiments with applications opens up an intriguing possibility. We can envision a dedicated pipeline step closely associated with the application release process. What if chaos engineering became an integral part of the journey to production? It might be less playful, but the prospect is undeniably compelling.

Our commitment doesn’t stop here. We’re dedicated to offering fresh challenges to everyone by introducing new experiment templates. Moreover, for the avid chaos engineering enthusiasts among us, we’re exploring the idea of providing a platform where they can craft their own GameDay experiences, complete with customizable experiments. The future holds exciting prospects for us all.

Resilient-Quest Architecture

For those curious about the workings of Resilient-Quest, we’ve aimed to keep it straightforward and easy to understand.

Special Thanks

I would like to express my heartfelt gratitude to the Pulse team, with a special shout-out to John, who played a pivotal role in helping me bring order to my chaotic codebase under React. Your support and guidance have been invaluable, and I’m deeply appreciative of your assistance.

Kudos to JB as well. I’ve never come across a manager with graphic design superpowers quite like his.

I’d also like to extend my gratitude to Steadybit. Without this exceptional tool, none of this would have been achievable.

Bridging the Gap: Uniting Chaos Engineering and Platform Engineering was originally published in ManoMano Tech team on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Reply

Your email address will not be published. Required fields are marked *