
The cron job ran at 11 PM. By the time I poured coffee the next morning, it had called a paid LLM API a few thousand times.
Nobody touched it. Nobody approved it. It just sat there in the dark, retrying, retrying, retrying — like a vending machine eating a stuck dollar bill, except the dollar bill was billable tokens.
I didn’t find it because of a clever alert. I found it because the provider dashboard had a graph that looked like a cliff face.
The scene
I run a small fleet of unattended automations for a client. One of them enriches records overnight by sending them through a metered LLM endpoint. Cheap per call. Boring. The kind of job you set up once and forget — which is exactly the problem.
That night, the upstream service it depended on got flaky. A handful of calls returned errors. The loop did what badly-written loops always do.
It tried again. Immediately. Forever.
┌────────────────────────────────────────────────────┐
│ THE LOOP (as written) │
│ │
│ ┌──────────┐ error ┌──────────────┐ │
│ │ call API │ ────────► │ retry now │ │
│ └────┬─────┘ └──────┬───────┘ │
│ ▲ │ │
│ └────────────────────────┘ │
│ no backoff · no cap · no kill switch │
│ │
│ result: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ thousands of hits │
└────────────────────────────────────────────────────┘
No sleep. No max-retry counter. No circuit breaker. Every failure became an instant request, and every instant request became another line on the invoice.
The investigation
First thing: confirm it’s us and not a stolen key. I pulled the provider’s usage view and bucketed by hour.
# usage exploded between 23:00 and 07:00 — exactly the cron window
# logs from the job host told the same story
journalctl -u record-enrich.service --since "yesterday 23:00" \
| grep -c "POST /v1/"
# -> 4,300-something. Overnight. For a job that should make ~80 calls.
Same source IP. Same user-agent. Same service. Not a breach — a self-inflicted wound. Somehow that’s worse, because it means the call was coming from inside the house.
The “aha”
The smoking gun was four lines of code. Paraphrased:
while not done:
try:
resp = client.complete(payload)
done = True
except Exception:
continue # <-- the whole disaster, right here
continue. No delay. No ceiling. The instant the upstream hiccuped, this turned into a tight spin loop firing paid requests as fast as the network would carry them. A retry without backoff isn’t resilience — it’s a denial-of-wallet attack you launch against yourself.
The fix
I treated the key as compromised even though it wasn’t, because the behavior was indistinguishable from a leak. Containment first, blame later.
# 1. Rotate the key immediately — old one dies on the spot
provider keys rotate --name record-enrich --revoke-old
# 2. Disable the API entirely at the provider while I fix the code.
# Kill the bleeding before patching the artery.
provider api disable --service record-enrich
# 3. Re-enable WITH a hard cap + budget alert. The cap is the real fix.
provider budget set --service record-enrich \
--hard-limit-usd 25 --period monthly
provider alerts set --service record-enrich \
--notify-at 50% --notify-at 90% --channel telegram
Then the code got the guardrails it should have shipped with:
import time
MAX_RETRIES = 5
for attempt in range(MAX_RETRIES):
try:
resp = client.complete(payload)
break
except TransientError:
time.sleep(min(2 ** attempt, 30)) # exponential backoff, capped
else:
raise RuntimeError("gave up after retries — failing loud, not looping")
Backoff. A retry ceiling. And a loud failure instead of a silent infinite spin. The provider cap is the seatbelt; this is actually steering the car.
Why it happened
It happened because “it’s just a small overnight job” is the exact mindset that ships uncapped loops at paid endpoints. The cost-per-call was trivial, so nobody did the multiplication. Trivial times infinity is still a number you have to pay.
The code had no concept of “too many.” The provider had no concept of “enough.” With neither a ceiling in the app nor a ceiling at the wallet, the only limiter left was how fast the network could move — and that’s a throttle on the damage, not a budget.
Takeaways
- Put a hard spend cap at the provider. It’s the only limit a runaway loop can’t out-code. Set it before you write the first request.
- Backoff and a max-retry count are not optional. A retry without delay or ceiling is a self-DoS.
continueon an exception is a loaded gun. - Wire usage alerts at 50% and 90%. You want a phone buzz at midnight, not a graph shaped like a cliff at breakfast.
- Treat weird usage as a leak until proven otherwise. Rotate the key and disable the endpoint first; debug the code second. Containment beats curiosity.
- “Small overnight job” is a smell. Anything unattended that touches a metered API gets a kill switch, a cap, and an alert — or it doesn’t get deployed.