Chaos Engineering the AWS Way: 101

Over the last decade or so, we’ve seen a switch from the era of monolithic applications to a modern, more efficient, microservice-based development model. This had meant that we need to consider a wider landscape around testing to ensure our applications are resilient and perform as expected.

Build me a monolith.

In the ‘good old days’ of software development, applications were designed and written as monoliths. These large, self-contained codebases simplified initial development but led to challenges as the scope of the applications grew. Scaling was difficult as the application would often have to be duplicated, rather than being able to scale individual components as needed. Also deploying updates or new functionality was a complex process, often needing extended downtime for the entire application.

Testing of these applications would often be carried out manually, by test teams who would concentrate on the functional requirements. If non-functional requirements were tested, often this would be limited to performance-related tasks such as ensuring that a defined hardware configuration could handle a specific level of user traffic and respond within a given timeframe.

Transition to microservices.

As the use of Cloud providers became more prevalent, new serverless functionality allowed us to change our approach to development, leading to the spread in the use of microservices. These allowed us to break down the monolithic functionality into smaller, independently developed and deployed services. Each microservice tended to focus on a specific piece of business functionality, allowing teams to work in parallel in a more agile manner.

Applications now often have a distributed architecture with code running on servers, serverless or container-based systems or even client-side in browsers. We’re using databases with multiple read (or even write) hosts, caching, load balancers and other components, all coming together to create what is known as a distributed system, typically communicating via network links. This enables us to scale individual services as needed, leading to a more efficient use of resources and increased fault-tolerance but introduces new challenges related to communications, consistency and the need for more observability.

This new paradigm has also enabled us to improve our development and QA practices, through the use of automated deployments, often including automated testing covering unit tests, through to behavioural testing. But again, these have tended to concentrate on functional requirements – and this means that whilst the complexity of our application landscapes has grown, our non-functional testing hasn’t kept pace with the now complex architectures.

This challenge in managing distributed systems led to Peter Deutsch and James O. Coplient, articulating what are known as the ‘8 fallacies of distributed computing’ 1:

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn’t change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

Acknowledging these fallacies is essential for designing robust, resilient distributed systems and their associated applications.

Release the apes of chaos.

To counter this lack of testing around complexity, a new discipline started to emerge in the early 2000s – chaos engineering.

Chaos engineering can be considered as:

the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.2

Whilst there were some early, basic attempts by Amazon and Google, Netflix is generally considered as having defined this new approach by considering the 8 fallacies and trying to develop engineering solutions that would test against them to ensure reliability within their systems.

In 2011, Netflix developed a tool known as ‘Chaos Monkey’, which intentionally disabled servers, databases and applications at random in the production environment to test the resilience of their worldwide video-streaming network. The name came from the idea that you could measure what might happen if a wild monkey ran through their data centres and Cloud environments with a weapon, smashing servers and chewing through cabling3.

They quickly realised the value that this provided to their engineers, allowing them to design more highly available services, and expanded the tooling to create what is known as the ‘Simian Army’, with a variety of tools such as

Latency Monkey – a tool that introduced artificial delays in the network layer between different services. It allowed Netflix to simulate a server becoming unavailable, or even losing an entire service.
Conformity Monkey – looks for servers not configured according to their best practices and shuts them down. For example, it would look for servers that weren’t part of an auto-scaling group, and so had limited resilience to unplanned shutdown.
Chaos Gorilla – whilst Chaos Monkey looked for individual servers to target, Chaos Gorilla looked to test the outage of an entire availability zone.

These, and many other, approaches to testing Netflix’s resilience soon began to gain recognition within the wider engineering community, and others tried to re-use or re-engineer similar tools.

AWS enter the Fray.

As the major player in the Cloud engineering space, it’s safe to say that Netflix’s approach would have caught the eye of Amazon Web Services. After all, for many years, they were the biggest single customer using the Cloud provider’s services.

One of AWS’s oft-quoted ideals is the aim of reducing ‘undifferentiated heavy lifting’, where they look for tasks that are being widely adopted by their customers, and which they could provide as a managed service, providing them with the opportunity to reduce workload and complexity (whilst at the same time, no doubt, providing an income stream). However, AWS’s announcement4 at their 2020 re:Invent conference, that they would provide a managed service providing chaos engineering tools still came as a surprise to some.

Amazon’s Fault Injection Service, or FIS5, offered a set of scenarios that could be deployed in their customers’ accounts, initially allowing testing against Amazon’s EC2, ECS, EKS and RDS services. Since that time, the offering has expanded and now includes cross-account testing, simulating errors at the control plane and network layers, allowing simulation of availability zone failures, API throttling and a wide range of other failure scenarios.

FIS and associated AWS services, allow engineering teams to follow what is now seen as a standard set of practices within chaos engineering:

Baseline performance – services such as AWS CloudWatch allow a deep understanding of applications’ operating parameters, allowing them to measure and track interactions, dependencies and various service metrics.

Hypothesise – once there is an understanding of the components of a service, architects and engineers can start to think about ‘what if’ – would their availability continue under increased network latency, API throttling or the unplanned termination of components?

FIS enables these hypotheses to be codified using what are known as ‘experiment templates’, describing tests to be carried out along with which components should be stressed.

Experiment – FIS allows the experiment templates to be deployed and executed in an AWS environment.

Blast Radius – any chaos engineering tool should have the option to terminate an experiment if it starts to affect service in an unplanned way. FIS allows CloudWatch Alarms to be configured to halt experiments and roll back the effects that had been put in place.

Measurement – once again, CloudWatch provides services such as metrics, logs and alarms, allowing application designers to understand how their services reacted to the experiments put in place.

FIS also brings an important element to the world of chaos engineering – control. While it’s good to test in development environments to understand how services will react to unplanned scenarios, one of chaos engineering’s tenets is that the most valuable insights will be gained by testing against production services. However, AWS customers will want to control how and when these experiments are deployed – this is achieved using IAM permissions to control who can define and execute FIS scenarios.

Conclusion.

Organisations and engineers working with complex, distributed systems within Cloud environments should look to adopt the principles of chaos engineering, ensuring that it becomes not just a best practice but a strategic imperative.

Amazon’s FIS empowers engineering teams to proactively address the challenges of distributed systems, ensuring the robustness and resilience of applications in the dynamic and unpredictable cloud environment. Working to their principle of undifferentiated heavy lifting, AWS has positioned chaos engineering as a managed service, aligning with their commitment to reducing complexity and empowering customers to navigate the intricacies of modern cloud-based architectures.

https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing ↩

https://principlesofchaos.org/ ↩

https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116 ↩

https://www.youtube.com/watch?v=VndV2j8fulo ↩

http://aws.amazon.com/fis ↩