Ephemeral CI Runners with Tart VMs

macOS CI runners are expensive. GitHub-hosted large runners cost $0.48/minute. Self-hosted persistent runners accumulate state -- leftover caches, stale Xcode simulators, rogue processes -- that causes flaky builds and security concerns. The solution is ephemeral runners: a fresh macOS VM for every job, destroyed immediately after. This article shows how to build that system using Tart (an open-source macOS VM manager built on Apple's Virtualization.framework), Packer for image pipelines, and a custom Python controller to orchestrate the lifecycle.

Architecture Overview

The system has four components:

Base image pipeline -- Packer builds a macOS VM image with Xcode, Homebrew dependencies, and the GitHub Actions runner agent pre-installed. Runs weekly on a cron schedule.
Controller service -- A Python daemon that polls the GitHub API for queued workflow runs, clones a Tart VM from the base image, starts the runner, and destroys the VM once the job completes.
Cache layer -- A multi-tier caching strategy (APFS clones for VM images, a local HTTP proxy for package downloads) that keeps VM startup time under 30 seconds.
Host fleet -- One or more Apple Silicon Mac Minis running the controller. Each host can run 3-4 concurrent VMs depending on RAM.

Building Base Images with Packer

Tart provides a Packer plugin that creates macOS VM images. The build process installs Xcode, sets up the runner agent, and produces a compressed IPSW-based image.

# packer/macos-runner.pkr.hcl
packer {
  required_plugins {
    tart = {
      source  = "github.com/cirruslabs/tart"
      version = ">= 1.12.0"
    }
  }
}
 
source "tart-cli" "macos" {
  vm_base_name = "ghcr.io/cirruslabs/macos-sonoma-base:latest"
  vm_name      = "macos-ci-runner"
  cpu_count    = 4
  memory_gb    = 8
  disk_size_gb = 80
 
  ssh_username = "admin"
  ssh_password = "admin"
  ssh_timeout  = "120s"
}
 
build {
  sources = ["source.tart-cli.macos"]
 
  # Install Homebrew and common dependencies
  provisioner "shell" {
    inline = [
      "NONINTERACTIVE=1 /bin/bash -c \"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)\"",
      "eval \"$(/opt/homebrew/bin/brew shellenv)\"",
      "brew install git-lfs jq gh swiftlint fastlane",
      "git lfs install"
    ]
  }
 
  # Install Xcode from a pre-downloaded .xip
  provisioner "file" {
    source      = "/Volumes/Storage/Xcode_15.4.xip"
    destination = "/tmp/Xcode.xip"
  }
 
  provisioner "shell" {
    inline = [
      "xip -x /tmp/Xcode.xip -C /Applications",
      "sudo xcodebuild -license accept",
      "xcodebuild -runFirstLaunch",
      "rm /tmp/Xcode.xip"
    ]
  }
 
  # Pre-install GitHub Actions runner
  provisioner "shell" {
    inline = [
      "mkdir -p ~/actions-runner && cd ~/actions-runner",
      "curl -sL https://github.com/actions/runner/releases/download/v2.319.1/actions-runner-osx-arm64-2.319.1.tar.gz | tar xz",
    ]
  }
 
  # Clean up caches to reduce image size
  provisioner "shell" {
    inline = [
      "brew cleanup --prune=all",
      "rm -rf ~/Library/Caches/*",
      "sudo purge"
    ]
  }
}

Build the image with:

packer build packer/macos-runner.pkr.hcl
# Produces: ~/.tart/vms/macos-ci-runner/

The Controller Service

The controller is a Python asyncio service that manages the full lifecycle of ephemeral runners. It polls for pending jobs, provisions VMs, and cleans up after completion.

# controller/runner_manager.py
import asyncio
import subprocess
import logging
from dataclasses import dataclass
from github import Github
 
logger = logging.getLogger(__name__)
 
@dataclass
class RunnerConfig:
    base_image: str = "macos-ci-runner"
    max_concurrent: int = 4
    github_org: str = "myorg"
    labels: list = None
 
    def __post_init__(self):
        self.labels = self.labels or ["self-hosted", "macOS", "ARM64"]
 
 
class RunnerManager:
    def __init__(self, config: RunnerConfig, github_token: str):
        self.config = config
        self.gh = Github(github_token)
        self.active_vms: dict[str, asyncio.Task] = {}
 
    async def provision_runner(self, job_id: str) -> None:
        """Clone base image, configure runner, start VM."""
        vm_name = f"runner-{job_id}"
        try:
            # Clone VM from base image (instant with APFS clones)
            logger.info(f"Cloning VM {vm_name} from {self.config.base_image}")
            subprocess.run(
                ["tart", "clone", self.config.base_image, vm_name],
                check=True, capture_output=True
            )
 
            # Start the VM in the background
            vm_proc = subprocess.Popen(
                ["tart", "run", vm_name, "--no-graphics"],
                stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
            )
 
            # Wait for SSH to become available
            await self._wait_for_ssh(vm_name, timeout=60)
 
            # Register as a JIT (just-in-time) runner
            jit_config = self._create_jit_runner(vm_name)
            await self._ssh_exec(vm_name,
                f"cd ~/actions-runner && "
                f"echo '{jit_config}' | base64 -d > .runner_jit_config && "
                f"./run.sh --jitconfig .runner_jit_config"
            )
 
            # Wait for the runner process to exit (job completed)
            vm_proc.wait()
 
        finally:
            # Always destroy the VM
            logger.info(f"Destroying VM {vm_name}")
            subprocess.run(["tart", "stop", vm_name], capture_output=True)
            subprocess.run(["tart", "delete", vm_name], capture_output=True)
 
    async def poll_and_dispatch(self) -> None:
        """Main loop: check for queued jobs and provision runners."""
        while True:
            if len(self.active_vms) >= self.config.max_concurrent:
                await asyncio.sleep(5)
                continue
 
            queued = self._get_queued_jobs()
            for job in queued:
                if job.id not in self.active_vms:
                    task = asyncio.create_task(self.provision_runner(str(job.id)))
                    self.active_vms[str(job.id)] = task
                    task.add_done_callback(
                        lambda t, jid=str(job.id): self.active_vms.pop(jid, None)
                    )
 
            await asyncio.sleep(10)

Multi-Tier Caching Strategy

VM startup speed is critical. Without caching, cloning and booting a macOS VM takes 2-3 minutes. With the right caching layers, it drops to under 30 seconds.

Tier 1: APFS Clones

Tart uses APFS copy-on-write clones by default. Cloning a 60GB base image is instantaneous and uses zero additional disk space until the VM writes to it. This is the biggest performance win.

Tier 2: Local HTTP Cache Proxy

Run an Nginx caching proxy on each host to cache Homebrew bottles, CocoaPods specs, and npm packages. This avoids redundant downloads across VMs:

# nginx-cache-proxy.conf
proxy_cache_path /var/cache/ci-proxy levels=1:2
    keys_zone=ci_cache:100m max_size=50g inactive=30d;
 
server {
    listen 8080;
 
    location / {
        proxy_pass $scheme://$host$request_uri;
        proxy_cache ci_cache;
        proxy_cache_valid 200 30d;
        proxy_cache_use_stale error timeout updating;
        add_header X-Cache-Status $upstream_cache_status;
    }
}

Inside the VM, configure package managers to use the proxy:

export http_proxy=http://host.internal:8080
export https_proxy=http://host.internal:8080
export HOMEBREW_ARTIFACT_DOMAIN=http://host.internal:8080

Tier 3: Pre-warmed DerivedData

For iOS/macOS projects, Xcode's DerivedData is the biggest time sink. Pre-build the project's dependencies in the Packer image so the first build starts warm:

# In the Packer provisioner
provisioner "shell" {
  inline = [
    "cd /tmp && git clone --depth 1 https://github.com/myorg/ios-app.git",
    "cd ios-app && xcodebuild build -scheme App -destination 'generic/platform=iOS' -derivedDataPath ~/DerivedData || true",
    "rm -rf /tmp/ios-app"
  ]
}

Monitoring and Observability

Track these metrics to understand your fleet's health:

VM boot time -- From clone to SSH-ready. Target: under 25 seconds.
Queue wait time -- How long jobs wait before a runner picks them up. Alert if over 60 seconds.
Concurrent VM count -- Per host. Stay under 80% of max to handle burst traffic.
Disk usage -- APFS clones share blocks with the base image. Monitor actual disk consumption, not apparent size.
Job success rate -- Track per-label success rates. A drop may indicate a stale base image.

# Expose metrics via Prometheus
from prometheus_client import Histogram, Gauge, Counter
 
vm_boot_time = Histogram(
    "runner_vm_boot_seconds",
    "Time from clone to SSH-ready",
    buckets=[5, 10, 15, 20, 25, 30, 45, 60]
)
active_vms = Gauge("runner_active_vms", "Currently running VMs", ["host"])
jobs_total = Counter("runner_jobs_total", "Total jobs processed", ["status"])

Production Deployment

Run the controller as a launchd service on each Mac Mini host:

<!-- /Library/LaunchDaemons/com.myorg.ci-controller.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.myorg.ci-controller</string>
    <key>ProgramArguments</key>
    <array>
        <string>/opt/ci-controller/venv/bin/python</string>
        <string>-m</string>
        <string>controller</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/var/log/ci-controller.log</string>
    <key>StandardErrorPath</key>
    <string>/var/log/ci-controller.log</string>
</dict>
</plist>

Key Takeaways

Ephemeral VMs eliminate state drift and improve CI security -- every job starts on a pristine machine.
Tart's APFS clones make VM creation instant and storage-efficient on Apple Silicon.
Packer-built base images with pre-installed tooling keep boot times under 30 seconds.
A custom controller gives full control over scheduling, concurrency, and cleanup.
Multi-tier caching (APFS clones, HTTP proxy, pre-warmed build caches) is essential for fast feedback loops.