Ephemeral CI Runners with Tart VMs
macOS CI runners are expensive. GitHub-hosted large runners cost $0.48/minute. Self-hosted persistent runners accumulate state -- leftover caches, stale Xcode simulators, rogue processes -- that causes flaky builds and security concerns. The solution is ephemeral runners: a fresh macOS VM for every job, destroyed immediately after. This article shows how to build that system using Tart (an open-source macOS VM manager built on Apple's Virtualization.framework), Packer for image pipelines, and a custom Python controller to orchestrate the lifecycle.
Architecture Overview
The system has four components:
- Base image pipeline -- Packer builds a macOS VM image with Xcode, Homebrew dependencies, and the GitHub Actions runner agent pre-installed. Runs weekly on a cron schedule.
- Controller service -- A Python daemon that polls the GitHub API for queued workflow runs, clones a Tart VM from the base image, starts the runner, and destroys the VM once the job completes.
- Cache layer -- A multi-tier caching strategy (APFS clones for VM images, a local HTTP proxy for package downloads) that keeps VM startup time under 30 seconds.
- Host fleet -- One or more Apple Silicon Mac Minis running the controller. Each host can run 3-4 concurrent VMs depending on RAM.
Building Base Images with Packer
Tart provides a Packer plugin that creates macOS VM images. The build process installs Xcode, sets up the runner agent, and produces a compressed IPSW-based image.
# packer/macos-runner.pkr.hcl
packer {
required_plugins {
tart = {
source = "github.com/cirruslabs/tart"
version = ">= 1.12.0"
}
}
}
source "tart-cli" "macos" {
vm_base_name = "ghcr.io/cirruslabs/macos-sonoma-base:latest"
vm_name = "macos-ci-runner"
cpu_count = 4
memory_gb = 8
disk_size_gb = 80
ssh_username = "admin"
ssh_password = "admin"
ssh_timeout = "120s"
}
build {
sources = ["source.tart-cli.macos"]
# Install Homebrew and common dependencies
provisioner "shell" {
inline = [
"NONINTERACTIVE=1 /bin/bash -c \"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)\"",
"eval \"$(/opt/homebrew/bin/brew shellenv)\"",
"brew install git-lfs jq gh swiftlint fastlane",
"git lfs install"
]
}
# Install Xcode from a pre-downloaded .xip
provisioner "file" {
source = "/Volumes/Storage/Xcode_15.4.xip"
destination = "/tmp/Xcode.xip"
}
provisioner "shell" {
inline = [
"xip -x /tmp/Xcode.xip -C /Applications",
"sudo xcodebuild -license accept",
"xcodebuild -runFirstLaunch",
"rm /tmp/Xcode.xip"
]
}
# Pre-install GitHub Actions runner
provisioner "shell" {
inline = [
"mkdir -p ~/actions-runner && cd ~/actions-runner",
"curl -sL https://github.com/actions/runner/releases/download/v2.319.1/actions-runner-osx-arm64-2.319.1.tar.gz | tar xz",
]
}
# Clean up caches to reduce image size
provisioner "shell" {
inline = [
"brew cleanup --prune=all",
"rm -rf ~/Library/Caches/*",
"sudo purge"
]
}
}Build the image with:
packer build packer/macos-runner.pkr.hcl
# Produces: ~/.tart/vms/macos-ci-runner/The Controller Service
The controller is a Python asyncio service that manages the full lifecycle of ephemeral runners. It polls for pending jobs, provisions VMs, and cleans up after completion.
# controller/runner_manager.py
import asyncio
import subprocess
import logging
from dataclasses import dataclass
from github import Github
logger = logging.getLogger(__name__)
@dataclass
class RunnerConfig:
base_image: str = "macos-ci-runner"
max_concurrent: int = 4
github_org: str = "myorg"
labels: list = None
def __post_init__(self):
self.labels = self.labels or ["self-hosted", "macOS", "ARM64"]
class RunnerManager:
def __init__(self, config: RunnerConfig, github_token: str):
self.config = config
self.gh = Github(github_token)
self.active_vms: dict[str, asyncio.Task] = {}
async def provision_runner(self, job_id: str) -> None:
"""Clone base image, configure runner, start VM."""
vm_name = f"runner-{job_id}"
try:
# Clone VM from base image (instant with APFS clones)
logger.info(f"Cloning VM {vm_name} from {self.config.base_image}")
subprocess.run(
["tart", "clone", self.config.base_image, vm_name],
check=True, capture_output=True
)
# Start the VM in the background
vm_proc = subprocess.Popen(
["tart", "run", vm_name, "--no-graphics"],
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
)
# Wait for SSH to become available
await self._wait_for_ssh(vm_name, timeout=60)
# Register as a JIT (just-in-time) runner
jit_config = self._create_jit_runner(vm_name)
await self._ssh_exec(vm_name,
f"cd ~/actions-runner && "
f"echo '{jit_config}' | base64 -d > .runner_jit_config && "
f"./run.sh --jitconfig .runner_jit_config"
)
# Wait for the runner process to exit (job completed)
vm_proc.wait()
finally:
# Always destroy the VM
logger.info(f"Destroying VM {vm_name}")
subprocess.run(["tart", "stop", vm_name], capture_output=True)
subprocess.run(["tart", "delete", vm_name], capture_output=True)
async def poll_and_dispatch(self) -> None:
"""Main loop: check for queued jobs and provision runners."""
while True:
if len(self.active_vms) >= self.config.max_concurrent:
await asyncio.sleep(5)
continue
queued = self._get_queued_jobs()
for job in queued:
if job.id not in self.active_vms:
task = asyncio.create_task(self.provision_runner(str(job.id)))
self.active_vms[str(job.id)] = task
task.add_done_callback(
lambda t, jid=str(job.id): self.active_vms.pop(jid, None)
)
await asyncio.sleep(10)Multi-Tier Caching Strategy
VM startup speed is critical. Without caching, cloning and booting a macOS VM takes 2-3 minutes. With the right caching layers, it drops to under 30 seconds.
Tier 1: APFS Clones
Tart uses APFS copy-on-write clones by default. Cloning a 60GB base image is instantaneous and uses zero additional disk space until the VM writes to it. This is the biggest performance win.
Tier 2: Local HTTP Cache Proxy
Run an Nginx caching proxy on each host to cache Homebrew bottles, CocoaPods specs, and npm packages. This avoids redundant downloads across VMs:
# nginx-cache-proxy.conf
proxy_cache_path /var/cache/ci-proxy levels=1:2
keys_zone=ci_cache:100m max_size=50g inactive=30d;
server {
listen 8080;
location / {
proxy_pass $scheme://$host$request_uri;
proxy_cache ci_cache;
proxy_cache_valid 200 30d;
proxy_cache_use_stale error timeout updating;
add_header X-Cache-Status $upstream_cache_status;
}
}Inside the VM, configure package managers to use the proxy:
export http_proxy=http://host.internal:8080
export https_proxy=http://host.internal:8080
export HOMEBREW_ARTIFACT_DOMAIN=http://host.internal:8080Tier 3: Pre-warmed DerivedData
For iOS/macOS projects, Xcode's DerivedData is the biggest time sink. Pre-build the project's dependencies in the Packer image so the first build starts warm:
# In the Packer provisioner
provisioner "shell" {
inline = [
"cd /tmp && git clone --depth 1 https://github.com/myorg/ios-app.git",
"cd ios-app && xcodebuild build -scheme App -destination 'generic/platform=iOS' -derivedDataPath ~/DerivedData || true",
"rm -rf /tmp/ios-app"
]
}Monitoring and Observability
Track these metrics to understand your fleet's health:
- VM boot time -- From clone to SSH-ready. Target: under 25 seconds.
- Queue wait time -- How long jobs wait before a runner picks them up. Alert if over 60 seconds.
- Concurrent VM count -- Per host. Stay under 80% of max to handle burst traffic.
- Disk usage -- APFS clones share blocks with the base image. Monitor actual disk consumption, not apparent size.
- Job success rate -- Track per-label success rates. A drop may indicate a stale base image.
# Expose metrics via Prometheus
from prometheus_client import Histogram, Gauge, Counter
vm_boot_time = Histogram(
"runner_vm_boot_seconds",
"Time from clone to SSH-ready",
buckets=[5, 10, 15, 20, 25, 30, 45, 60]
)
active_vms = Gauge("runner_active_vms", "Currently running VMs", ["host"])
jobs_total = Counter("runner_jobs_total", "Total jobs processed", ["status"])Production Deployment
Run the controller as a launchd service on each Mac Mini host:
<!-- /Library/LaunchDaemons/com.myorg.ci-controller.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.myorg.ci-controller</string>
<key>ProgramArguments</key>
<array>
<string>/opt/ci-controller/venv/bin/python</string>
<string>-m</string>
<string>controller</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/var/log/ci-controller.log</string>
<key>StandardErrorPath</key>
<string>/var/log/ci-controller.log</string>
</dict>
</plist>Key Takeaways
- Ephemeral VMs eliminate state drift and improve CI security -- every job starts on a pristine machine.
- Tart's APFS clones make VM creation instant and storage-efficient on Apple Silicon.
- Packer-built base images with pre-installed tooling keep boot times under 30 seconds.
- A custom controller gives full control over scheduling, concurrency, and cleanup.
- Multi-tier caching (APFS clones, HTTP proxy, pre-warmed build caches) is essential for fast feedback loops.