Fixing Fly.io auto-suspend after unexpected bills

Listen to this articleAI narration

0:00 / 0:00

A background worker runs on Fly.io. It handles AI embeddings for a cloud storage platform—text extraction from PDFs via Unstructured.io, chunking documents, generating vectors with OpenAI's embedding API. Event-driven architecture: Inngest Cloud dispatches jobs, the worker wakes up, processes, suspends. Pay-per-use.

The January bill showed $18 for 16 days. For a machine that should run maybe 10 minutes per day, that's wrong.

Debugging the cost

Cost Explorer showed $17.91 for the worker. Checking the Fly.io dashboard revealed two machines:

Machine 1: suspended
Machine 2: started (running)

┌────────────────────────────────────────┐
│  Expected behavior:                    │
│  • Upload triggers Inngest event       │
│  • Worker wakes, processes, suspends   │
│  • Cost: ~$1-3/month                   │
└────────────────────────────────────────┘

┌────────────────────────────────────────┐
│  Actual behavior:                      │
│  • Machine 2 running continuously      │
│  • Machine 1 suspended (unused)        │
│  • Cost: ~$18/month (39% uptime)       │
└────────────────────────────────────────┘

Reading fly.toml:

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "suspend"
  auto_start_machines = true
  min_machines_running = 1  # ← The problem

min_machines_running = 1 means Fly always keeps one machine running. It defeats the entire purpose of auto-suspend. The worker was configured to never fully sleep.

With a performance-4x VM (8GB RAM) at ~$0.12/hour:

24/7 for 16 days = 384 hours
384 hours × $0.12 = $46.08 potential cost
Actual charge: $18 (the machine scaled down during low activity, but still ran constantly)

The fix

Changed one line:

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = "suspend"
  auto_start_machines = true
- min_machines_running = 1
+ min_machines_running = 0

Deployed the update, destroyed the extra machine:

flyctl deploy --config apps/worker/fly.toml
flyctl machine destroy e827771c03ee28 --app disposal-worker --force

Verified the result:

ID              STATE   CHECKS
68306d3a560d28  stopped 1 warning

One machine. Stopped. The health check warning is expected—it can't pass when the machine is off. That's fine. Inngest will wake it when needed.

Why it happened

When we migrated from Railway, we created the fly.toml based on Fly.io's defaults. The docs recommend min_machines_running = 1 for HTTP services to avoid cold start latency. That makes sense for a user-facing API that needs instant response.

But this isn't user-facing. It's a background worker triggered by Inngest Cloud. Cold start latency doesn't matter—Inngest handles retries, the job runs async, and a 5-second wake time is irrelevant when processing takes 30+ seconds anyway.

We copied the config without questioning whether the defaults fit the use case.

What changed

The worker now:

Suspends after 60 seconds of inactivity
Wakes when Inngest dispatches an event
Charges only for active processing time

Expected monthly cost: $1-3 instead of $18.

The machine is performance-4x with 8GB RAM. That's oversized for this workload—PDF processing is I/O-bound (waiting on Unstructured.io for text extraction, OpenAI for embeddings). We could drop to shared-cpu-2x (2GB) and cut per-minute costs by 75%. But with auto-suspend working, the total cost is low enough that optimization can wait until we see actual memory usage patterns.

Lessons

Read the defaults critically. Copy-paste from docs works until it doesn't. min_machines_running = 1 is a reasonable default for most HTTP services. Not for background workers.

Billing reveals misconfigurations. The worker "worked" in that it processed jobs successfully. The only signal something was wrong was the cost. If we hadn't checked the bill, we'd still be burning $18/month.

Infrastructure assumptions don't always transfer. Railway's auto-scaling meant machines stopped when idle by default. Fly.io's defaults assume you want availability over cost. Both are valid—just different models.

Two machines = double cost. Fly created a second machine during a deploy and never cleaned it up. One was suspended, one kept running. Always verify machine counts after deployments.

The platform works the same. The bill dropped 85%. One line in a config file.