The total cost of unplanned outages has been rising exponentially each year. A 2016 study conducted by the Ponemon Institute stated that the mean total cost per minute of an unplanned outage was $8,851, a 32% increase since 2013, and a 81% increase since 2010. A 2022 study by EMA Research says that number is up to $12,900. These metrics showcase how crucial it is for organizations to have a solid and well-thought disaster recovery strategy in place in order to reduce downtime and data loss as much as possible once disaster strikes.
Ensuring business continuity and safeguarding mission-critical systems against unexpected failures can be time-consuming, expensive, and difficult to maintain, especially as systems scale. It is also not uncommon for disaster recovery (DR) solutions to cost enterprises anywhere from several hundreds of thousands to millions of dollars per year, creating significant strain on IT budgets within organizations.
However, setting up and maintaining DR infrastructure doesn’t have to be so cumbersome nor costly. This is where leveraging infrastructure as code (IaC) within your DR plan comes into play.
This blog post showcases how HashiCorp Terraform can be used to effectively setup, test, and validate your DR environments in a cost-efficient, practical, and consistent manner by codifying the infrastructure provisioning process.
»
»Why use Terraform with your DR strategy?
If you have gone through the process of selecting and using DR tooling in the past, you most likely encountered one, or more, of the following problems:
- Cost: As I previously mentioned, disaster recovery tools can be extremely expensive. Licensing fees coupled with ongoing costs of maintaining redundant, idle infrastructure can be a significant strain on IT budgets.
- Lack of flexibility: DR toolsets are typically tied to a particular platform. This results in additional complexity and reduced flexibility when it comes to setting DR strategies across multiple cloud providers. This also applies to leveraging a managed solution from one of the major public clouds. While leveraging a cloud-specific DR solution may be convenient at first, it will limit your options for multi-cloud and hybrid strategies in the future as you expand.
- Performance: These tools can also be very slow when it comes to performance and recovery speed. Legacy DR solutions typically rely on complex mechanisms that are slow and error prone, making desired RTO and RPO difficult to achieve.
Terraform not only helps solve all these issues, but provides several other key advantages when it is leveraged within your disaster recovery strategy:
- Automation: Terraform allows you to automate the entire infrastructure deployment and recovery process, minimizing the need for manual intervention and greatly reducing risk of human error. This also ensures consistency and repeatability within your DR infrastructure setup.
- Repeatability: With Terraform, you are adopting an infrastructure as code mindset, meaning that you ensure consistent infrastructure configuration across multiple environments by defining your infrastructure once in a codified manner. This mitigates configuration drift and ensures that your DR environment accurately mirrors your production setup.
- Scalability: Terraform enables you to scale your environments as needed with ease, allowing you to test your DR infrastructure plans at scale, ensuring they can handle real-world scenarios.
- Cost efficiency: Terraform allows you to dynamically provision and destroy ephemeral resources as needed, resulting in greatly reduced infrastructure costs as you only pay for the resources utilized during your DR exercise instead of incurring ongoing costs from resources that remain idle most of the time.
- Flexibility: With Terraform being a cloud agnostic solution, you have the ability to not only spin up infrastructure in different availability zones or regions within a single cloud provider, but you can provision and manage resources across multiple cloud providers as well.
»
- -refresh-only flag can update the Terraform state file to match the actual infrastructure state without modifying the infrastructure itself. This can be used after a backup or recovery operation in order to sync Terraform state and reduce drift.
- Pilot Light and Active/Passive: Terraform conditional expressions can be leveraged to deploy only the required infrastructure components needed for a Pilot Light while keeping other resources in a dormant state, or label an Active/Passive configuration as on/off until a DR event occurs. Once a DR event occurs, conditionals can trigger resource scaling to full production capacity, ensuring minimal downtime and operational impact. The next section of this post shows an example of this Active/Passive cutover.
- Multi-Region Active/Active: Terraform modules can be used to encapsulate and re-use infrastructure components. This plays a crucial role in ensuring consistency is maintained in large-scale, multi-region environments while simplifying infrastructure management by ensuring a single source of truth for your infrastructure code. As an example, you can parameterize our modules by region, ensuring you deploy the same infrastructure across various regions:
#Terraform modules parameterized by region
module "vpc" true = Disaster Recovery)"
default = false
module "compute"
type = bool
description = "Flag to control environment switchover (false = Production
It is also worth noting that the Terraform import command can be a valuable tool within your DR strategy by ensuring existing infrastructure created outside of Terraform is integrated and managed.
»
Amazon EC2 instance behind Route 53 (Refer to Figure 2 below).
The complete code repository for this example can be found here.
Note: I will be using my own domain already set up as an AWS Route 53 Hosted Zone (andrecfaria.com). If you are following along, this value should be replaced with whatever domain you set up within your Terraform configuration.
In a real-world scenario, your environment typically will be much more robust, most likely including:
- Multiple web servers across several availability zones
- Load balancers sitting in front of the web servers
- Databases in both regions with cross-region replication in place
- And more
However, for simplicity, this example only uses EC2 instance.

Figure 2 – Web server hosted on an Amazon EC2 instance behind Route 53
This scenario, employs the Active/Passive DR strategy with all of your infrastructure provisioned and managed through Terraform. However, the infrastructure required for a DR failover will only be provisioned when you trigger the failover itself, preventing ongoing costs related to idle compute instances and other cloud resources. After running a terraform apply
, you see the following outputs:
Outputs:
current_active_environment = "Production"
dns_record = "test.andrecfaria.com"
production_public_ip = "18.234.86.230"
You can use the dig
command to verify that your DNS record points to the production IP address:
$ dig test.andrecfaria.com
; <<>> DiG 9.18.28-0ubuntu0.22.04.1-Ubuntu <<>> test.andrecfaria.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58089
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;test.andrecfaria.com. IN A
;; ANSWER SECTION:
test.andrecfaria.com. 60 IN A 18.234.86.230
;; Query time: 9 msec
;; SERVER: 10.255.255.254#53(10.255.255.254) (UDP)
;; WHEN: Mon Feb 10 16:04:47 EST 2025
;; MSG SIZE rcvd: 65
You can also run a curl
command to visualize the contents of your production webpage:
$ curl "
<h1>Hello World from Production!</h1>
Looking at the Terraform code, within the variables.tf
file you can find the following dr_switchover
variable:
variable "dr_switchover" true = Disaster Recovery)"
default = false
This variable is a key component of the DR configuration because it will define whether the Route 53 DNS record points to the production web server (by keeping the default value of false
), or if the record should switch over to the DR web server and create the required infrastructure resources for the DR failover to take place, by setting its value to true
.
This is accomplished by leveraging the conditional expressions functionality of Terraform when setting the records
argument within the aws_route53_record
resource declaration, as well as leveraging the count
argument within the DR resources.
# Route53 Record - Conditional based on dr_switchover
resource "aws_route53_record" "test" {
zone_id = data.aws_route53_zone.selected.zone_id
name = "${var.subdomain}.${var.domain_name}"
type = "A"
ttl = 60
records = [var.dr_switchover ? aws_instance.dr_webserver.public_ip : aws_instance.prod_webserver.public_ip]
}
# Disaster Recovery EC2 Instance
resource "aws_instance" "dr_webserver" {
count = var.dr_switchover ? 1 : 0
provider = aws.dr
ami = var.dr_ami_id
instance_type = var.instance_type
key_name = var.key_name
vpc_security_group_ids = [aws_security_group.dr_sg.id]
user_data = <<-EOF
#!/bin/bash
sudo yum update -y
sudo yum install -y nginx
sudo systemctl start nginx
sudo systemctl enable nginx
echo "" | sudo tee /usr/share/nginx/html/index.html
EOF
tags = {
Name = "dr-instance"
Environment = "Disaster Recovery"
}
depends_on = [aws_security_group.dr_sg]
}
The only change required in order to cutover to the DR environment is setting the value of the dr_switchover
variable to true
:
$ terraform apply -var="dr_switchover=true" -auto-approve
Below are the actions and output that Terraform will display when creating the DR EC2 instance and performing an in-place update to the Route 53 record resource, changing the records argument to point to your DR web server IP address instead of the production IP address:
Terraform will perform the following actions:
# aws_instance.dr_webserver[0] will be created
+ resource "aws_instance" "dr_webserver" {
...
}
# aws_route53_record.test will be updated in-place
~ resource "aws_route53_record" "test" {
id = "Z0441403334ANN7OFVRF1_test.andrecfaria.com_A"
name = "test.andrecfaria.com"
~ records = [
- "18.234.86.230",
] -> (known after apply)
# (7 unchanged attributes hidden)
}
Plan: 1 to add, 1 to change, 0 to destroy.
Changes to Outputs:
~ current_active_environment = "Production" -> "Disaster Recovery"
+ dr_public_ip = (known after apply)
Outputs:
current_active_environment = "Disaster Recovery"
dns_record = "test.andrecfaria.com"
dr_public_ip = "54.219.217.97"
production_public_ip = "18.234.86.230
Once the Terraform run is complete, you can validate that the DNS record now points to the DR web server by using the same dig
and curl
commands as before):
#dig command results showing DR IP address
$ dig test.andrecfaria.com
; <<>> DiG 9.18.28-0ubuntu0.22.04.1-Ubuntu <<>> test.andrecfaria.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19471
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;test.andrecfaria.com. IN A
;; ANSWER SECTION:
test.andrecfaria.com. 60 IN A 54.219.217.97
;; Query time: 19 msec
;; SERVER: 10.255.255.254#53(10.255.255.254) (UDP)
;; WHEN: Mon Feb 10 16:16:25 EST 2025
;; MSG SIZE rcvd: 65
#curl command showcasing DR webpage contents
$ curl "
<h1>Hello World from Disaster Recovery!</h1>
Finally, we can fail back to production by simply running the terraform apply
command again, this time while setting the dr_switchover
variable back to false
. This will also destroy all the infrastructure created when failing over to DR, enabling us to prevent unnecessary spend related to idle resources.
#Setting the dr_switchover variable value via CLI
$ terraform apply -var="dr_switchover=false" -auto-approve
#Terraform apply run output
Terraform will perform the following actions:
# aws_instance.dr_webserver[0] will be destroyed
# (because index [0] is out of range for count)
- resource "aws_instance" "dr_webserver" {
...
}
# aws_route53_record.test will be updated in-place
~ resource "aws_route53_record" "test" {
id = "Z0441403334ANN7OFVRF1_test.andrecfaria.com_A"
name = "test.andrecfaria.com"
~ records = [
- "54.219.217.97",
+ "18.234.86.230",
]
# (7 unchanged attributes hidden)
}
Plan: 0 to add, 1 to change, 1 to destroy.
Changes to Outputs:
~ current_active_environment = "Disaster Recovery" -> "Production"
- dr_public_ip = "54.219.217.97" -> null
»
»
»
HashiCorp developer portal, where you can find more information regarding best practices, integrations, and reference architectures.