Chaos Engineering: Simulate AZ Failure in AWS
In the world of cloud computing, making sure your computer programs and systems work well and don’t break is very important. Amazon Web Services (AWS) is a strong and flexible system for this, but even the best systems can sometimes have problems. To be ready for those problems, we use Chaos Engineering, which helps us find and fix issues before they cause trouble. In this article, we’ll learn how to pretend that a part of AWS is broken, so we can make sure our systems can handle it.
Understanding Availability Zones (AZs)
Before diving into Chaos Engineering, it’s essential to understand what AWS Availability Zones are. AZs are isolated data centers within an AWS Region, each with its own power, cooling, and networking. They are designed to provide high availability and fault tolerance by allowing you to distribute your resources and applications across multiple AZs within a single Region. AWS guarantees that these AZs are physically separate and independent from one another, reducing the risk of simultaneous failures affecting your applications.
Why Simulate AZ Failure?
AWS Availability Zones are supposed to work well, but sometimes they break due to maintenance or other problems. To make sure your apps can handle these breakdowns, we test them. By pretending an AZ fails, we can:
- Find Weak Spots: Discover problems in your setup that might show up during an AZ problem.
- Get Better at Fixing: Learn how to fix things faster when a real AZ fails.
- Test Auto Scaling Group Configuration: Make sure your auto scaling groups for EC2 instances work well in a multi-AZ setup.
Chaos Engineering
Chaos Engineering is the discipline of experimenting on a system to uncover weaknesses proactively.
Is like a safety test for technology, but instead of testing if everything goes perfectly, we deliberately make things go wrong to see how well a system can handle it.
Imagine you have a car. Normally, you drive it smoothly, and it works fine. But in chaos engineering, you might suddenly slam on the brakes to see if they work well under pressure. Or you might turn off the headlights to check if you can still drive safely in the dark. By doing these tests, you can find problems in your car before they become big issues.
In the same way, chaos engineering is about causing controlled problems in your computer systems, like making a part of a website slow or turning off a server, to see if your systems can still work properly and recover gracefully.
Simulating AZ Failure:
By default, AWS doesn’t have a straightforward way to turn off or “stop” an AZ. However there are two methods for doing this:
Method 1: Using Network Access Control Lists (NACLs)
You can create a new Network Access Control List (NACL) and replace the existing NACL in your AZ with this new one. A NACL is like a fence that controls what traffic can come in or out of a part of your AWS network.
Here’s how it works:
- NACLs, by default, block all traffic going in and out of your network.
- When you switch to the new NACL, it will also block all traffic, making it seem like the AZ isn’t responding.
- This is a quick and easy way to simulate an AZ failure, but it’s important to remember that it’s not the same as a real AZ failure. It just stops traffic temporarily.
Method 2: Using AWS Fault Injection Simulator (FIS)
AWS offers a tool called AWS Fault Injection Simulator (FIS), which is designed specifically for simulating chaos in your AWS environments. Here’s how you can use it:
- AWS FIS helps you create experiment templates for testing how your AWS resources and applications handle chaos.
- You can use FIS to simulate different types of failures, including AZ failures, in a more controlled and realistic way.
- This method is more advanced but gives you a better understanding of how your system responds to AZ failures.
Here is a step by step guide using AWS FIS:
In this Fault Injection Simulator (FIS) experiment, we will employ an Auto Scaling group distributed across two AZs.
Auto Scaling Group Capacity Configuration:
- AZ Distribution: The Auto Scaling group will span across two Availability Zones (Private Subnet 1 and Private Subnet 2).
- Instance Capacity: Within this Auto Scaling group, we will set the minimum instance to 4, the desired to 4 and the maximum to 6.
Application and Load Balancer Configuration:
- Internal Application Load Balancer (ALB): All instances in the Auto Scaling group will be positioned behind an internal Application Load Balancer (ALB).
EC2 Instances Configuration:
- HTTPD Configuration: Each instance within the Auto Scaling group will be preconfigured with the Apache HTTP server (httpd). This server will be set up to serve a simple “Hello World” website, which will be accessible at the root (“/”) route of the instances.
Target Group and Health Checks:
- Target Group Configuration: A Target Group will be created and associated with the internal ALB. This Target Group will be responsible for directing traffic to the instances.
- Health Checks: Within the Target Group, health checks will be configured to monitor the status of instances. These health checks will periodically send requests to the root (“/”) route of each instance’s web server (httpd) to assess their health and availability.
In the AWS Management Console, navigate to the AWS FIS service. Click on “Create experiment” to start creating a new FIS experiment. Add some description and a name.
Choose the AWS actions that you want to simulate during the experiment. FIS supports various actions, such as stopping instances, causing latency, and more.
For this experiment we need to select the “NETWORK” Action Type and the Scope must be “all”. Do not select the “availability zone” Scope, that only blocks inter AZ traffic. Here is a detailed guide about actyions and scopes: https://docs.aws.amazon.com/fis/latest/userguide/fis-tutorial-disrupt-connectivity.html
Important to click on the Save Button.
In the Targets section click on edit and then select the Subnet that you want to simulate the failure.
AWS FIS as the others AWS Services requires an IAM role. You can use an existing one or select create a new one.
You can also log the actions of AWS FIS into a Cloudwatch Log Group or into an S3 bucket. However you will need to add IAM Policies to the IAM Role.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"logs:CreateLogDelivery",
"logs:CreateLogStream",
"logs:PutResourcePolicy",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams",
"logs:CreateLogGroup",
"logs:DescribeResourcePolicies",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}
Add the previous policy to the role if you selected the option to log into Cloudwatch.
Once the Experiment Template is created. Select it and click on start a new experiment.
You can check the progress in the Experiments section.
In this case i can confirm that the experiment is working because the healt checks are turning into unhealthy in the specified AZ
Now that we’ve verified the functionality of the AWS Fault Injection Simulator (FIS) experiment, our next step is to ensure the proper functioning of our Auto Scaling group. This experiment focuses on the utilization of Application Load Balancer (ALB) health checks within the Auto Scaling Group (ASG).
If you’ve configured ALB health checks within your ASG, it means you’ve set up a mechanism to monitor the health status of instances. In the event that an instance falls into an unhealthy state, the ASG will take appropriate action by replacing that instance with a new one. This ensures the continuity of your application’s availability and performance.
You can check this configuration in the Auto Scaling Group.
However, if you haven’t configured ALB health checks within your ASG, the ASG will only replace an instance if it is terminated for some reason. In such cases, it will not proactively address instances that are experiencing issues but are still running. Therefore, it’s vital to have ALB health checks enabled within your ASG to maintain the health and resilience of your AWS resources effectively.
In this scenario, the Auto Scaling Group (ASG) verifies that certain instances have become unhealthy, prompting the ASG to initiate the spawning of new instances to replace those that are no longer in a healthy state.
After a few minutes, the experiment will finish. It is crucial to conduct these experiments regularly or execute them before launching into production to validate that our systems operate as anticipated.