Terraforming 1Password

A few days ago I posted this tweet:

The tweet generated quite a bit of interest from people running or managing their services, and I thought I would share some of the cool things we are working on.

This post will go into technical details and I apologize in advance if I explain things too quickly. I tried to make up for this by including some pretty pictures but most of them ended up being code snippets. 🙂

1Password and AWS

1Password is hosted by Amazon Web Services (AWS). We’ve been using AWS for several years now, and it is incredible how easy it was to scale our service from zero users three years ago to several million happy customers today.

AWS has many geographical regions. Each region consists of multiple independent data centres located closely together. We are currently using three regions:

  • N. Virginia, USA us-east-1
  • Montreal, Canada ca-central-1
  • Frankfurt, Germany eu-central-1

In each region we have four environments running 1Password:

  • production
  • staging
  • testing
  • development

If you are counting, that’s 12 environments across three regions, including three production environments: 1password.com, 1password.ca, and 1password.eu.

Every 1Password environment is more or less identical and includes these components:

  • Virtual Private Cloud
  • Amazon Aurora database cluster
  • Caching (Redis) clusters
  • Subnets
  • Routing tables
  • Security roles
  • IAM permissions
  • Auto-scaling groups
  • Elastic Compute Cloud (EC2) instances
  • Elastic Load Balancers (ELB)
  • Route53 DNS (both internal and external)
  • Amazon S3 buckets
  • CloudFront distributions
  • Key Management System (KMS)

Here is a simplified diagram:

env

As you can see, there are many components working together to provide 1Password service. One of the reasons it is so complex is the need for high availability. Most of the components are deployed as a cluster to make sure there are at least two of each: database, cache, server instance, and so on.

Furthermore, every AWS region has at least two data centres that are also known as Availability Zones (AZs) – you can see them in blue in the diagram above. Every AZ has its own independent power and network connections. For example, Canadian region ca-central-1 has two data centres: ca-central-1a and ca-central-1b.

If we deployed all 1Password components into just a single Availability Zone, then we would not be able to achieve high availability because a single problem in the data centre would take 1Password offline. This is why when 1Password services are deployed in a region, we make sure that every component has at least one backup in the neighbouring data centre. This helps to keep 1Password running even when there’s a problem in one of the data centres.

Infrastructure as Code

It would be very challenging and error-prone to manually deploy and maintain 12 environments, especially when you consider that each environment consists of at least 50 individual components.

This is why so many companies today switched from updating their infrastructure manually and embraced Infrastructure as Code. With Infrastructure as Code, the hardware becomes software and can take advantage of all software development best practices. When we apply these practices to infrastructure, every server, every database, every open network port can be written in code, committed to GitHub, peer-reviewed, and then deployed and updated as many times as necessary.

For AWS customers, two major languages could be used to describe and maintain the infrastructure:

CloudFormation is an excellent option for many AWS customers, and we successfully used it to deploy 1Password environments for over two years. At the same time we wanted to move to Terraform as our main infrastructure tool for several reasons:

  • Terraform has a more straightforward and powerful language (HCL) that makes it easier to write and review code.
  • Terraform has the concept of resource providers that allows us to manage resources outside of Amazon Web Services, including services like DataDog and PagerDuty, which we rely on internally.
  • Terraform is completely open source and that makes it easier to understand and troubleshoot.
  • We are already using Terraform for smaller web apps at AgileBits, and it makes sense to standardize on a single tool.

Compared to the JSON or YAML files used by CloudFormation, Terraform HCL is both a more powerful and a more readable language. Here is a small example of a snippet that defines a subnet for the application servers. As you can see, the Terraform code is a quarter of the size, more readable, and easier to understand.

CloudFormation

"B5AppSubnet1": {
    "Type": "AWS::EC2::Subnet",
    "Properties": {
        "CidrBlock": { "Fn::Select" : ["0", { "Fn::FindInMap" : [ "SubnetCidr", { "Ref" : "Env" }, "b5app"] }] },
        "AvailabilityZone": { "Fn::Select" : [ "0", { "Fn::GetAZs" : "" } ]},
        "VpcId": { "Ref": "Vpc" },
        "Tags": [
            { "Key" : "Application", "Value" : "B5" },
            { "Key" : "env", "Value": { "Ref" : "Env" } },
            { "Key" : "Name", "Value": { "Fn::Join" : ["-", [ {"Ref" : "Env"}, "b5", "b5app-subnet1"]] } }
        ]
    }
},

"B5AppSubnet2": {
    "Type": "AWS::EC2::Subnet",
    "Properties": {
        "CidrBlock": { "Fn::Select" : ["1", { "Fn::FindInMap" : [ "SubnetCidr", { "Ref" : "Env" }, "b5app"] }] },
        "AvailabilityZone": { "Fn::Select" : [ "1", { "Fn::GetAZs" : "" } ]},
        "VpcId": { "Ref": "Vpc" },
        "Tags": [
            { "Key" : "Application", "Value" : "B5" },
            { "Key" : "env", "Value": { "Ref" : "Env" } },
            { "Key" : "Name", "Value": { "Fn::Join" : ["-", [ {"Ref" : "Env"}, "b5", "b5app-subnet2"]] } }
        ]
    }
},

"B5AppSubnet3": {
    "Type": "AWS::EC2::Subnet",
    "Properties": {
        "CidrBlock": { "Fn::Select" : ["2", { "Fn::FindInMap" : [ "SubnetCidr", { "Ref" : "Env" }, "b5app"] }] },
        "AvailabilityZone": { "Fn::Select" : [ "2", { "Fn::GetAZs" : "" } ]},
        "VpcId": { "Ref": "Vpc" },
        "Tags": [
            { "Key" : "Application", "Value" : "B5" },
            { "Key" : "env", "Value": { "Ref" : "Env" } },
            { "Key" : "Name", "Value": { "Fn::Join" : ["-", [ {"Ref" : "Env"}, "b5", "b5app-subnet3"]] } }
        ]
    }
},

Terraform

resource "aws_subnet" "b5app" {
  count             = "${length(var.subnet_cidr["b5app"])}"
  vpc_id            = "${aws_vpc.b5.id}"
  cidr_block        = "${element(var.subnet_cidr["b5app"],count.index)}"
  availability_zone = "${var.az[count.index]}"

  tags {
    Application = "B5"
    env         = "${var.env}"
    type        = "${var.type}"
    Name        = "${var.env}-b5-b5app-subnet-${count.index}"
  }
}

Terraform has another gem of a feature that we rely on: terraform plan. It allows us to visualize the changes that will happen to the environment without performing them.

For example, here is what would happen if we change the server instance size from t2.medium to t2.large.

Terraform Plan Output

#
# Terraform code changes
#
# variable "instance_type" {
#    type        = "string"
# -  default     = "t2.medium"
# +  default     = "t2.large"
#  }


$ terraform plan 
Refreshing Terraform state in-memory prior to plan...

...

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

-/+ module.b5site.aws_autoscaling_group.asg (new resource required)
      id:                                 "B5Site-prd-lc20180123194347404900000001-asg" =>  (forces new resource)
      arn:                                "arn:aws:autoscaling:us-east-1:921352000000:autoScalingGroup:32b38032-56c6-40bf-8c57-409e9e4a264a:autoScalingGroupName/B5Site-prd-lc20180123194347404900000001-asg" => 
      default_cooldown:                   "300" => 
      desired_capacity:                   "2" => "2"
      force_delete:                       "false" => "false"
      health_check_grace_period:          "300" => "300"
      health_check_type:                  "ELB" => "ELB"
      launch_configuration:               "B5Site-prd-lc20180123194347404900000001" => "${aws_launch_configuration.lc.name}"
      load_balancers.#:                   "0" => 
      max_size:                           "3" => "3"
      metrics_granularity:                "1Minute" => "1Minute"
      min_size:                           "2" => "2"
      name:                               "B5Site-prd-lc20180123194347404900000001-asg" => "${aws_launch_configuration.lc.name}-asg" (forces new resource)
      protect_from_scale_in:              "false" => "false"
      tag.#:                              "4" => "4"
      tag.1402295282.key:                 "Application" => "Application"
      tag.1402295282.propagate_at_launch: "true" => "true"
      tag.1402295282.value:               "B5Site" => "B5Site"
      tag.1776938011.key:                 "env" => "env"
      tag.1776938011.propagate_at_launch: "true" => "true"
      tag.1776938011.value:               "prd" => "prd"
      tag.3218409424.key:                 "type" => "type"
      tag.3218409424.propagate_at_launch: "true" => "true"
      tag.3218409424.value:               "production" => "production"
      tag.4034324257.key:                 "Name" => "Name"
      tag.4034324257.propagate_at_launch: "true" => "true"
      tag.4034324257.value:               "prd-B5Site" => "prd-B5Site"
      target_group_arns.#:                "2" => "2"
      target_group_arns.2352758522:       "arn:aws:elasticloadbalancing:us-east-1:921352000000:targetgroup/prd-B5Site-8080-tg/33ceeac3a6f8b53e" => "arn:aws:elasticloadbalancing:us-east-1:921352000000:targetgroup/prd-B5Site-8080-tg/33ceeac3a6f8b53e"
      target_group_arns.3576894107:       "arn:aws:elasticloadbalancing:us-east-1:921352000000:targetgroup/prd-B5Site-80-tg/457e9651ad8f1af4" => "arn:aws:elasticloadbalancing:us-east-1:921352000000:targetgroup/prd-B5Site-80-tg/457e9651ad8f1af4"
      vpc_zone_identifier.#:              "2" => "2"
      vpc_zone_identifier.2325591805:     "subnet-d87c3dbc" => "subnet-d87c3dbc"
      vpc_zone_identifier.3439339683:     "subnet-bfe16590" => "subnet-bfe16590"
      wait_for_capacity_timeout:          "10m" => "10m"

-/+ module.b5site.aws_launch_configuration.lc (new resource required)
      id:                                 "B5Site-prd-lc20180123194347404900000001" =>  (forces new resource)
      associate_public_ip_address:        "false" => "false"
      ebs_block_device.#:                 "0" => 
      ebs_optimized:                      "false" => 
      enable_monitoring:                  "true" => "true"
      iam_instance_profile:               "prd-B5Site-instance-profile" => "prd-B5Site-instance-profile"
      image_id:                           "ami-263d0b5c" => "ami-263d0b5c"
      instance_type:                      "t2.medium" => "t2.large" (forces new resource)
      key_name:                           "" => 
      name:                               "B5Site-prd-lc20180123194347404900000001" => 
      name_prefix:                        "B5Site-prd-lc" => "B5Site-prd-lc"
      root_block_device.#:                "0" => 
      security_groups.#:                  "1" => "1"
      security_groups.4230886263:         "sg-aca045d8" => "sg-aca045d8"
      user_data:                          "ff8281e17b9f63774c952f0cde4e77bdba35426d" => "ff8281e17b9f63774c952f0cde4e77bdba35426d"


Plan: 2 to add, 0 to change, 2 to destroy.

Overall, Terraform is a pleasure to work with, and that makes a huge difference in our daily lives. DevOps people like to enjoy their lives too. 🙌

Migration from CloudFormation to Terraform

It is possible to simply import the existing AWS infrastructure directly into Terraform, but there are certain downsides to it. We found that naming conventions are quite different and that would make it more challenging to maintain our environments in the future. Also, a simple import would not allow us to use the new Terraform features. For example, instead of hard-coding the identifiers of Amazon Machine Images used for deployment we started using aws_ami to find the most recent image dynamically:

aws_ami

data "aws_ami" "bastion_ami" {
  most_recent = true
  
  filter {
    name   = "architecture"
    values = ["x86_64"]
  }
  filter {
    name   = "name"
    values = ["bastion-*"]
  }
  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
  name_regex = "bastion-.*"
  owners     = [92135000000]
}

It took us a couple of weeks to write the code from scratch. After we had the same infrastructure described in Terraform, we recreated all non-production environments where downtime wasn’t an issue. This also allowed us to create a complete checklist of all the steps required to migrate the production environment.

Finally, on January 21, 2018, we completely recreated 1Password.com. We had to bring the service offline during the migration. Most of our customers were not affected by the downtime because the 1Password apps are designed to function even when the servers are down or when an Internet connection is not available. Unfortunately, our customers who needed to access the web interface during that time were unable to do so, and we apologize for the interruption. Most of the 2 hours and 39 minutes of downtime were related to data migration. The 1Password.com database is just under 1TB in size (not including documents and attachments), and it took almost two hours to complete the snapshot and restore operations.

We are excited to finally have all our development, test, staging, and production environments managed with Terraform. There are many new features and improvements we have planned for 1Password, and it will be fun to review new infrastructure pull requests on GitHub!

I remember when we were starting out we hosted our very first server with 1&1. It would have taken weeks to rebuild the very simple environment there. The world has come a long way since we first launched 1Passwd 13 years ago. I am looking forward to what the next 13 years will bring! 😃

Questions

A few questions and suggestions about the migration came up on Twitter:

By “recreating” you mean building out a whole new VPC with Terraform? Couldn’t you build it then switch existing DNS over for much less down time?1

This is pretty much what we ended up doing. Most of the work was performed before the downtime. Then we updated the DNS records to point to the new VPC.

Couldn’t you’ve imported all online resources? Just wondering.2

That is certainly possible, and it would have allowed us to avoid downtime. Unfortunately, it also requires manual mapping of all existing resources. Because of that, it’s hard to test, and the chance of a human error is high – and we know humans are pretty bad at this. As a wise person on Twitter said: “If you can’t rebuild it, you can’t rebuild it“.

If you have any questions, let us know in the comments, or ask me (@roustem) and Tim (@stumyp), our Beardless Keeper of Keys and Grounds, on Twitter.

28 replies
  1. Adeel
    Adeel says:

    “Furthermore, every AWS region has at least two data centres that are also known as Availability Zones (AZs)” – Actually….I recently went to AWS for training in London and they have told me that “Each AZ is one or more data centres – most of the time being the latter”

    Which makes a total of 11 AZs for your region – very likely being 22 data centres.

    Reply
    • Roustem Karimov
      Roustem Karimov says:

      AWS keeps growing and adding more regions and AZs. When I look at AWS Console today, I can see 6 AZs in us-east-1, 2 AZs in ca-central-1, and 3 AZs in eu-central-1.

    • DM
      DM says:

      AWS AZs are not static. Most people assume 1a and 1b are always the same, but in fact that is only true at an account level.

      Also each zone is backed by one or more physical data centers. Some have up to 6 today if memory serves.

      A single AZ can span multiple data centers but no two zones share a data center. And for load balancing, at the DC level, Amazon independently maps zones to identifiers at the account level. This means that us-east-1a for me will not be the same for you.

    • Roustem Karimov
      Roustem Karimov says:

      Wow, thank you for sharing this info! I had no idea, I don’t think AWS mentions it anywhere.

    • Winson Smith
      Winson Smith says:

      AWS AZs are ‘static’ in that once AZs are assigned to an account, they do not change. What you see as “ca-central-1a” will always be the same physical data centre.

      New AWS Regions will always have three AZs at launch, at least as of the middle of last year. Some recent regions (London, I’m looking at you) were launched with only two AZs, but that will not be the norm going forward. Too many AWS services rely on three AZs to function, so launching in a ‘crippled’ state is a bad customer experience.

    • Cobus Bernard
      Cobus Bernard says:

      Having the AZs differ between accounts has interesting implications for reserved instances – ping your account manager, they are able to align them for you if you ask nicely. Just spanning 2 AZs can sometimes be a risk: when 1 AZ goes down, there is storming due to all the ASGs trying to spin up in the other available AZs (which is worse if there is only 1 other to pick from). Luckily they don’t go down often, but it is something to be aware of if you buy Reserved Instances (the non-AZ specific ones as they don’t have a capacity guarantee, just a discount).

  2. Esmond Kane
    Esmond Kane says:

    Did you consider making the legacy cloud-formation driven web-interface read-only while the migration was occurring vs taking it offline and impacting the user?

    Reply
    • Roustem Karimov
      Roustem Karimov says:

      It is an interesting idea! We never really thought about it. One of the challenges here is that our activity log keeps track of the logins as well and database will have to be updated even in “read-only” mode.

  3. Winson Smith
    Winson Smith says:

    First, congrats on the successful migration. Glad to hear that it went well.

    One minor nitpick in the blogpost, though. “Canadian region ca-central-1 has two data centres: ca-central-1a and ca-central-2a.” The naming convention is “country-region-number-a” and “…-b”. So the second AZ in ca-central-1 is ca-central-1b.

    “ca-central-2a” would imply a a second region in central Canada.

    Reply
  4. Cobus Bernard
    Cobus Bernard says:

    Another idea that we are currently testing is to build out our base AMI with Packer on a weekly basis with the latest AMI of our distro of choice (Ubuntu) using the same search that you have for the AMI. That way you get all the updates on a weekly basis in a fresh image. You then have terraform create the launch config with a timestamp component (like you do), but add the name to the lifecycle ignore. That means you will only recreate the LC if there is a non-name change (like the AMI changing). Lastly, you then set your ASGs to scale up 2x required capacity & down to normal again at a specific time, i.e. Monday after coffee when everyone is in (for the dev env). That will cycle out the older instances and you have freshly patched instances.

    Reply
    • Roustem Karimov
      Roustem Karimov says:

      I think this is something that we could use as well.

      One more change that I would like to see in our environment is a regular recycling of instances.

      We do that now for every new deployment but it would be great to make sure that none of the instances lives longer than a few hours.

  5. Myron A. Semack
    Myron A. Semack says:

    I notice that you are using three regions, but many of services you listed (Aurora) do not cluster across regions. Is customer data replicated between regions? If so, can you share how you are handling that? (The common solution seems to be copying snapshots between regions combined with S3 replication.)

    A region-wide failure is unlikely, but it certainly is possible (ex. a hurricane rolling through Virginia).

    Reply
    • Roustem Karimov
      Roustem Karimov says:

      The data is not replicated. We intentially keep these three regions completely separated from each other. Part of it is related to the upcoming GDPR regulations that are easier to comply with if we do not transfer data outside EU.

      I agree with your point about having a backups in another regions (or maybe even in other services like Google Cloud or Azure). We hope to get it implemented in 2018.

  6. Sam Gendler
    Sam Gendler says:

    I’d be interested in hearing about your terraform infrastructure – not the infrastructure it manages, but the stuff surrounding terraform itself.

    Are your engineers IAM users in your AWS account(s) or are you using 3rd party auth into amazon with assumed roles? If so, what’s your solution for authenticating and maintaining temporary credentials?
    How are you sharing stored state in terraform – the s3 backend or something with more sophisticated locking?
    Are your environments spread across multiple AWS accounts so that prod can be isolated or is everything in a single account (it’s interesting because it presents auth and identity difficulties if you use 3rd party auth, because you cannot assume another role from an assumed role and 3rd party users only get assumed roles)?
    Finally, how big was the team responsible for migrating your cloud formation config to terraform? You mention several weeks of effort, but I’m curious about man-weeks.

    Did you have to write any custom providers/provisioners to do the things you wanted?
    Does CI/CD use terraform or modify terraform configs or are they completely separate?
    Are you using any tooling to generate terraform configs at plan/apply-time – terragrunt or a tool to render templates as JSON, or anything else to make working with terraform in a team a little easier and to make complex templates easier to write than HCL often allows?

    Reply
    • Roustem Karimov
      Roustem Karimov says:

      Hi Sam,

      Thank you for your questions. I will try to answer them here but we probably need another blog post to cover everything :)

      We don’t use any 3rd party auth at the moment. I am not very happy with how credentials are managed — we have to manually rotate the secrets. Thankfully, we have very few people that require this level of access and many operations can be done using IAM Roles instead of keys. Many operations are performed by the special control instance that runs within and is only accessible via a bastion host. I did not show it on the diagram.

      We use multiple AWS accounts because they make it simpler to lock down access to production environments. For example, development and test environments are deployed in a separate account.

      We had two people working on the migration. It is really not that difficult, especially if you could do that in smaller steps and verify the results instead of trying to complete everything at once.

      I have to go now, will ask Tim to anwer the rest :)

    • Sam Gendler
      Sam Gendler says:

      A blog post about how you use terraform itself would certainly be welcome. There doesn’t seem to be a ton of discussion about trying to formulate common best practices for how to manage certain aspects of using terraform in a team – managing credentials, keeping secrets secret, managing infrastructure across multiple accounts, etc. I’ve certainly stumbled over a few speed bumps in those areas along the way, but I’m actively modifying our entire infrastructure to more closely adhere to common AWS networking and security best practices while also automating everything in terraform, so it’s a much larger task than just migrating from cloudformation. We had little to no automation of anything prior to my development effort, and I’ve definitely encountered some significant friction from terraform itself in the process. It would be interesting to see how others are solving similar problems.

    • Tim Sattarov
      Tim Sattarov says:

      Hi Sam!

      It is really broad set of questions, and, as Roustem pointed out, we should write a blog post about it:)

      Did you have to write any custom providers/provisioners to do the things you wanted?

      No, we didn’t need one. So far all providers we have in Terraform is more than enough for us.

      Does CI/CD use terraform or modify terraform configs or are they completely separate?

      I’m really anxious about giving access to our infrastructure to CI/CD, only few people have access to that.
      Currently the workflow is organized through Pull Requests and tag/commit signing. (another blog post pending :) )

      Are you using any tooling to generate terraform configs at plan/apply-time – terragrunt or a tool to render templates as JSON, or anything else to make working with terraform in a team a little easier and to make complex templates easier to write than HCL often allows?

      Currently we are perfectly satisfied with Terraform and HCL, We use modules a lot, that is our way to simplify the readbility of our code.
      With code generator tools I do not have much of a control over code that is generated and flexibility of state manipulation that Terraform gives me.
      Terraform itself is an abstraction on top of API calls.
      I do not have strong opinions about Terragrunt or other tools. I just don’t see that they are useful to us at this moment. It may change in the future.

  7. Michael Smith
    Michael Smith says:

    Very Cool. I’ve been using Terraform for nearly 3 years now and still like it more than Cloud Formation. Are all your regions independent of each other? If not, how are you keeping all the data across regions in sync.

    Reply
    • Roustem Karimov
      Roustem Karimov says:

      Yes. We intentionally keep them completely independent from one another. There is no syncing and they never talk to each other.

    • Tim Sattarov
      Tim Sattarov says:

      Thanks Bjorn,

      yes we are aware of ChangeSets, although it is not quite the same experience with them. Especially when nested stacks are used.

  8. Jared Allen
    Jared Allen says:

    Would you be willing to share what the tech looks like running on each of the EC2 instances? How do you manage and deploy the code running on them?

    Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *