Menu

Saturday, March 22, 2025

Disaster recovery strategies with Terraform

Share


The total cost of unplanned outages has been rising exponentially each year. A 2016 study conducted by the Ponemon Institute stated that the mean total cost per minute of an unplanned outage was $8,851, a 32% increase since 2013, and a 81% increase since 2010. A 2022 study by EMA Research says that number is up to $12,900. These metrics showcase how crucial it is for organizations to have a solid and well-thought disaster recovery strategy in place in order to reduce downtime and data loss as much as possible once disaster strikes.

Ensuring business continuity and safeguarding mission-critical systems against unexpected failures can be time-consuming, expensive, and difficult to maintain, especially as systems scale. It is also not uncommon for disaster recovery (DR) solutions to cost enterprises anywhere from several hundreds of thousands to millions of dollars per year, creating significant strain on IT budgets within organizations.

However, setting up and maintaining DR infrastructure doesn’t have to be so cumbersome nor costly. This is where leveraging infrastructure as code (IaC) within your DR plan comes into play.

This blog post showcases how HashiCorp Terraform can be used to effectively setup, test, and validate your DR environments in a cost-efficient, practical, and consistent manner by codifying the infrastructure provisioning process.

»

»Why use Terraform with your DR strategy?

If you have gone through the process of selecting and using DR tooling in the past, you most likely encountered one, or more, of the following problems:

  • Cost: As I previously mentioned, disaster recovery tools can be extremely expensive. Licensing fees coupled with ongoing costs of maintaining redundant, idle infrastructure can be a significant strain on IT budgets.
  • Lack of flexibility: DR toolsets are typically tied to a particular platform. This results in additional complexity and reduced flexibility when it comes to setting DR strategies across multiple cloud providers. This also applies to leveraging a managed solution from one of the major public clouds. While leveraging a cloud-specific DR solution may be convenient at first, it will limit your options for multi-cloud and hybrid strategies in the future as you expand.
  • Performance: These tools can also be very slow when it comes to performance and recovery speed. Legacy DR solutions typically rely on complex mechanisms that are slow and error prone, making desired RTO and RPO difficult to achieve.

Terraform not only helps solve all these issues, but provides several other key advantages when it is leveraged within your disaster recovery strategy:

»

  • -refresh-only flag can update the Terraform state file to match the actual infrastructure state without modifying the infrastructure itself. This can be used after a backup or recovery operation in order to sync Terraform state and reduce drift.
  • Pilot Light and Active/Passive: Terraform conditional expressions can be leveraged to deploy only the required infrastructure components needed for a Pilot Light while keeping other resources in a dormant state, or label an Active/Passive configuration as on/off until a DR event occurs. Once a DR event occurs, conditionals can trigger resource scaling to full production capacity, ensuring minimal downtime and operational impact. The next section of this post shows an example of this Active/Passive cutover.
  • Multi-Region Active/Active: Terraform modules can be used to encapsulate and re-use infrastructure components. This plays a crucial role in ensuring consistency is maintained in large-scale, multi-region environments while simplifying infrastructure management by ensuring a single source of truth for your infrastructure code. As an example, you can parameterize our modules by region, ensuring you deploy the same infrastructure across various regions:
#Terraform modules parameterized by region
 
module "vpc"  true = Disaster Recovery)"
  default     = false

 
module "compute" 
  type        = bool
  description = "Flag to control environment switchover (false = Production 

It is also worth noting that the Terraform import command can be a valuable tool within your DR strategy by ensuring existing infrastructure created outside of Terraform is integrated and managed.

»

Amazon EC2 instance behind Route 53 (Refer to Figure 2 below).

The complete code repository for this example can be found here.

Note: I will be using my own domain already set up as an AWS Route 53 Hosted Zone (andrecfaria.com). If you are following along, this value should be replaced with whatever domain you set up within your Terraform configuration.

In a real-world scenario, your environment typically will be much more robust, most likely including:

  • Multiple web servers across several availability zones
  • Load balancers sitting in front of the web servers
  • Databases in both regions with cross-region replication in place
  • And more

However, for simplicity, this example only uses EC2 instance.

Figure 2 - Web server hosted on an Amazon EC2 instance behind Route 53

Figure 2 – Web server hosted on an Amazon EC2 instance behind Route 53

This scenario, employs the Active/Passive DR strategy with all of your infrastructure provisioned and managed through Terraform. However, the infrastructure required for a DR failover will only be provisioned when you trigger the failover itself, preventing ongoing costs related to idle compute instances and other cloud resources. After running a terraform apply, you see the following outputs:

Outputs:
 
current_active_environment = "Production"
dns_record = "test.andrecfaria.com"
production_public_ip = "18.234.86.230"

You can use the dig command to verify that your DNS record points to the production IP address:

$ dig test.andrecfaria.com
 
; <<>> DiG 9.18.28-0ubuntu0.22.04.1-Ubuntu <<>> test.andrecfaria.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58089
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
 
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;test.andrecfaria.com.      	IN  	A
 
;; ANSWER SECTION:
test.andrecfaria.com.   60  	IN  	A   	18.234.86.230
 
;; Query time: 9 msec
;; SERVER: 10.255.255.254#53(10.255.255.254) (UDP)
;; WHEN: Mon Feb 10 16:04:47 EST 2025
;; MSG SIZE  rcvd: 65

You can also run a curl command to visualize the contents of your production webpage:

$ curl "
<h1>Hello World from Production!</h1>

Looking at the Terraform code, within the variables.tf file you can find the following dr_switchover variable:

variable "dr_switchover"  true = Disaster Recovery)"
  default     = false

This variable is a key component of the DR configuration because it will define whether the Route 53 DNS record points to the production web server (by keeping the default value of false), or if the record should switch over to the DR web server and create the required infrastructure resources for the DR failover to take place, by setting its value to true.

This is accomplished by leveraging the conditional expressions functionality of Terraform when setting the records argument within the aws_route53_record resource declaration, as well as leveraging the count argument within the DR resources.

# Route53 Record - Conditional based on dr_switchover
 
resource "aws_route53_record" "test" {
  zone_id = data.aws_route53_zone.selected.zone_id
  name    = "${var.subdomain}.${var.domain_name}"
  type    = "A"
  ttl = 60
  records = [var.dr_switchover ? aws_instance.dr_webserver.public_ip : aws_instance.prod_webserver.public_ip]
}
# Disaster Recovery EC2 Instance
 
resource "aws_instance" "dr_webserver" {
  count                  = var.dr_switchover ? 1 : 0
  provider               = aws.dr
  ami                    = var.dr_ami_id
  instance_type          = var.instance_type
  key_name               = var.key_name
  vpc_security_group_ids = [aws_security_group.dr_sg.id]
  user_data              = <<-EOF
              #!/bin/bash
              sudo yum update -y
              sudo yum install -y nginx
              sudo systemctl start nginx
              sudo systemctl enable nginx
              echo "" | sudo tee /usr/share/nginx/html/index.html
              EOF
  tags = {
    Name        = "dr-instance"
    Environment = "Disaster Recovery"
  }
  depends_on = [aws_security_group.dr_sg]
}

The only change required in order to cutover to the DR environment is setting the value of the dr_switchover variable to true:

$ terraform apply -var="dr_switchover=true" -auto-approve

Below are the actions and output that Terraform will display when creating the DR EC2 instance and performing an in-place update to the Route 53 record resource, changing the records argument to point to your DR web server IP address instead of the production IP address:

Terraform will perform the following actions:
 
  # aws_instance.dr_webserver[0] will be created
  + resource "aws_instance" "dr_webserver" {
  	...
    }
 
  # aws_route53_record.test will be updated in-place
  ~ resource "aws_route53_record" "test" {
    	id = "Z0441403334ANN7OFVRF1_test.andrecfaria.com_A"
    	name = "test.andrecfaria.com"
  	~ records = [
      	- "18.234.86.230",
    	] -> (known after apply)
    	# (7 unchanged attributes hidden)
	}
 
Plan: 1 to add, 1 to change, 0 to destroy.
 
Changes to Outputs:
  ~ current_active_environment = "Production" -> "Disaster Recovery"
  + dr_public_ip  = (known after apply)
 
 
Outputs:
 
current_active_environment = "Disaster Recovery"
dns_record = "test.andrecfaria.com"
dr_public_ip = "54.219.217.97"
production_public_ip = "18.234.86.230
 

Once the Terraform run is complete, you can validate that the DNS record now points to the DR web server by using the same dig and curl commands as before):

#dig command results showing DR IP address
 
$ dig test.andrecfaria.com
 
; <<>> DiG 9.18.28-0ubuntu0.22.04.1-Ubuntu <<>> test.andrecfaria.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19471
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
 
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;test.andrecfaria.com.      	IN  	A
 
;; ANSWER SECTION:
test.andrecfaria.com.   60  	IN  	A   	54.219.217.97
 
;; Query time: 19 msec
;; SERVER: 10.255.255.254#53(10.255.255.254) (UDP)
;; WHEN: Mon Feb 10 16:16:25 EST 2025
;; MSG SIZE  rcvd: 65
#curl command showcasing DR webpage contents
 
$ curl "
<h1>Hello World from Disaster Recovery!</h1>

Finally, we can fail back to production by simply running the terraform apply command again, this time while setting the dr_switchover variable back to false. This will also destroy all the infrastructure created when failing over to DR, enabling us to prevent unnecessary spend related to idle resources.

#Setting the dr_switchover variable value via CLI
 
$ terraform apply -var="dr_switchover=false" -auto-approve
#Terraform apply run output
 
Terraform will perform the following actions:
 
  # aws_instance.dr_webserver[0] will be destroyed
  # (because index [0] is out of range for count)
  - resource "aws_instance" "dr_webserver" {
  	...
    }
 
  # aws_route53_record.test will be updated in-place
  ~ resource "aws_route53_record" "test" {
    	id = "Z0441403334ANN7OFVRF1_test.andrecfaria.com_A"
    	name = "test.andrecfaria.com"
  	~ records = [
      	- "54.219.217.97",
      	+ "18.234.86.230",
    	]
    	# (7 unchanged attributes hidden)
	}
 
Plan: 0 to add, 1 to change, 1 to destroy.
 
Changes to Outputs:
  ~ current_active_environment = "Disaster Recovery" -> "Production"
  - dr_public_ip = "54.219.217.97" -> null

»

»

»

HashiCorp developer portal, where you can find more information regarding best practices, integrations, and reference architectures.



Source link

Read more

Local News