Multi-Account AWS with Terragrunt

As your organization grows, a single AWS account becomes a liability: blast radius is unlimited, IAM policies become unmanageable, and cost attribution is impossible. AWS Organizations with multiple accounts solves these problems, but managing Terraform across 10+ accounts introduces its own complexity. Terragrunt makes it manageable by keeping your configuration DRY and your state safely isolated.

Repository Layout

The key architectural decision is separating Terraform modules (reusable infrastructure definitions) from Terragrunt live configurations (environment-specific parameterization). This separation enables code reuse while keeping each environment's state independent.

infrastructure/
  modules/                          # Reusable Terraform modules
    vpc/
    eks-cluster/
    rds-postgres/
    s3-bucket/
    iam-baseline/
 
  live/                             # Terragrunt live configurations
    terragrunt.hcl                  # Root config (provider, backend defaults)
 
    _envcommon/                     # Shared per-component defaults
      vpc.hcl
      eks.hcl
      rds.hcl
 
    management/                     # Management account (111111111111)
      account.hcl
      us-east-1/
        organization/terragrunt.hcl
        sso/terragrunt.hcl
 
    security/                       # Security account (222222222222)
      account.hcl
      eu-west-1/
        guardduty/terragrunt.hcl
        securityhub/terragrunt.hcl
 
    production/                     # Production account (333333333333)
      account.hcl
      eu-west-1/
        vpc/terragrunt.hcl
        eks/terragrunt.hcl
        rds/terragrunt.hcl
 
    staging/                        # Staging account (444444444444)
      account.hcl
      eu-west-1/
        vpc/terragrunt.hcl
        eks/terragrunt.hcl
        rds/terragrunt.hcl

Root Terragrunt Configuration

The root terragrunt.hcl defines settings inherited by all child configurations: the remote state backend, the provider, and common variables.

# live/terragrunt.hcl
locals {
  account_vars = read_terragrunt_config(find_in_parent_folders("account.hcl"))
  region_vars  = read_terragrunt_config(find_in_parent_folders("region.hcl", "empty.hcl"))
  account_id   = local.account_vars.locals.account_id
  account_name = local.account_vars.locals.account_name
  region       = try(local.region_vars.locals.region, "eu-west-1")
}
 
# Automatically configure the S3 backend with per-account state isolation
remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket         = "terraform-state-${local.account_id}"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "eu-west-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
 
    # State bucket lives in the management account
    role_arn = "arn:aws:iam::111111111111:role/TerraformStateAccess"
  }
}
 
# Generate the AWS provider with cross-account role assumption
generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
provider "aws" {
  region = "${local.region}"
 
  assume_role {
    role_arn = "arn:aws:iam::${local.account_id}:role/TerraformExecutionRole"
  }
 
  default_tags {
    tags = {
      ManagedBy   = "terraform"
      Account     = "${local.account_name}"
      Repository  = "infrastructure"
    }
  }
}
EOF
}

DRY Environment Configuration

The _envcommon/ directory contains shared defaults for each infrastructure component. Individual environments override only what differs.

# live/_envcommon/vpc.hcl
locals {
  base_source_url = "git::[email protected]:myorg/infrastructure-modules.git//vpc"
}
 
terraform {
  source = "${local.base_source_url}?ref=v2.3.0"
}
 
inputs = {
  enable_nat_gateway   = true
  single_nat_gateway   = false
  enable_dns_hostnames = true
  enable_flow_logs     = true
 
  public_subnet_tags = {
    "kubernetes.io/role/elb" = "1"
  }
  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = "1"
  }
}

# live/production/eu-west-1/vpc/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}
 
include "envcommon" {
  path   = "${dirname(find_in_parent_folders())}/_envcommon/vpc.hcl"
  merge_strategy = "deep"
}
 
inputs = {
  name       = "production-vpc"
  cidr_block = "10.1.0.0/16"
 
  azs             = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
  private_subnets = ["10.1.1.0/24", "10.1.2.0/24", "10.1.3.0/24"]
  public_subnets  = ["10.1.101.0/24", "10.1.102.0/24", "10.1.103.0/24"]
}

# live/staging/eu-west-1/vpc/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}
 
include "envcommon" {
  path   = "${dirname(find_in_parent_folders())}/_envcommon/vpc.hcl"
  merge_strategy = "deep"
}
 
inputs = {
  name       = "staging-vpc"
  cidr_block = "10.2.0.0/16"
 
  azs             = ["eu-west-1a", "eu-west-1b"]
  private_subnets = ["10.2.1.0/24", "10.2.2.0/24"]
  public_subnets  = ["10.2.101.0/24", "10.2.102.0/24"]
 
  # Cost optimization: single NAT gateway in staging
  single_nat_gateway = true
}

Cross-Account Dependencies

Terragrunt's dependency block lets you reference outputs from other Terragrunt configurations, even across accounts:

# live/production/eu-west-1/eks/terragrunt.hcl
dependency "vpc" {
  config_path = "../vpc"
  mock_outputs = {
    vpc_id          = "vpc-mock"
    private_subnets = ["subnet-mock-1", "subnet-mock-2"]
  }
}
 
inputs = {
  cluster_name    = "production"
  cluster_version = "1.29"
  vpc_id          = dependency.vpc.outputs.vpc_id
  subnet_ids      = dependency.vpc.outputs.private_subnets
}

Environment Promotion Workflow

Promoting infrastructure changes through environments follows the same pattern as application deployments:

Bump the module version in staging's _envcommon or directly in the component's terragrunt.hcl.
Open a PR. Atlantis (or a GitHub Actions workflow) runs terragrunt plan and posts the plan as a PR comment.
Review and apply to staging. Validate with integration tests.
Promote to production by updating the production config to use the same module version. Another PR, another review.

Atlantis Integration

Atlantis provides a pull-request-driven workflow for Terraform. Configure it to work with Terragrunt using a custom workflow:

# atlantis.yaml (repo-level config)
version: 3
automerge: false
parallel_plan: true
parallel_apply: false
 
projects:
  - name: production-vpc
    dir: live/production/eu-west-1/vpc
    workflow: terragrunt
    autoplan:
      when_modified: ["*.hcl", "../../_envcommon/vpc.hcl"]
 
  - name: staging-vpc
    dir: live/staging/eu-west-1/vpc
    workflow: terragrunt
    autoplan:
      when_modified: ["*.hcl", "../../_envcommon/vpc.hcl"]
 
workflows:
  terragrunt:
    plan:
      steps:
        - env:
            name: TERRAGRUNT_TFPATH
            value: terraform
        - run: terragrunt plan -no-color -out=$PLANFILE
    apply:
      steps:
        - run: terragrunt apply -no-color $PLANFILE

Drift Detection

Schedule regular drift detection runs to catch manual changes made outside of Terraform:

#!/bin/bash
# scripts/detect-drift.sh
set -euo pipefail
 
ACCOUNTS=("production" "staging" "security")
DRIFT_FOUND=0
 
for account in "${ACCOUNTS[@]}"; do
  echo "Checking drift in $account..."
  cd "live/$account"
 
  terragrunt run-all plan -detailed-exitcode -no-color 2>&1 | tee "/tmp/drift-$account.log"
  EXIT_CODE=${PIPESTATUS[0]}
 
  if [ "$EXIT_CODE" -eq 2 ]; then
    echo "DRIFT DETECTED in $account"
    DRIFT_FOUND=1
    # Send Slack notification
    curl -X POST "$SLACK_WEBHOOK" -d "{\"text\":\"Drift detected in *$account* account\"}"
  fi
 
  cd ../..
done
 
exit $DRIFT_FOUND

Key Takeaways

Separate modules (what to build) from live configs (where and how to build it).
Use _envcommon/ to define component defaults; override only environment-specific values.
Isolate state per account with dedicated S3 buckets and DynamoDB lock tables.
Pin module versions and promote them through environments like application releases.
Automate plan/apply with Atlantis for auditability and team collaboration.
Run scheduled drift detection to catch out-of-band changes before they cause incidents.