
A customer called on a Tuesday. A critical service had been down for hours. Their first question was the one that ruins your week:
“Why didn’t we get an alert?”
I had built the alerting. I knew the rules existed. I’d watched them fire in testing months ago. So my honest, confident, completely wrong answer was: “You should have.”
That’s the trap. “Should have” is what you say right before you learn something humbling about your own stack.
The investigation
First instinct: the rule didn’t fire. Easy theory, easy to check. So I pulled the alert history.
The rule fired. On time. With the right severity, the right labels, the right everything. It did its job flawlessly and then handed the alert off to… something.
┌──────────┐ ┌────────────┐ ┌──────────┐ ┌────────┐
│ RULE │──► │ RECEIVER │──► │ CHANNEL │──► │ HUMAN │
│ ✓ fired │ │ ✓ matched │ │ ✗ none │ │ ✗ — │
└──────────┘ └────────────┘ └──────────┘ └────────┘
▲
│
the chain dies right here
The rule routed to a receiver. The receiver had a name, a config block, a place in the tree. It looked alive. And its list of channels — the email addresses, the webhook URLs, the push targets, whatever was supposed to carry the message to a person — was empty.
Not broken. Not misconfigured. Empty. The alert was delivered, perfectly, to nobody.
The “aha”
Here’s the part that stung. Everyone — me included — had validated the wrong layer.
We tested that rules existed. We tested that rules fired. We never tested that a fired alert reached a phone, an inbox, a pager, a chat channel — a person. The chain has four links; we’d verified two and assumed the other two by vibes.
An alert is not a notification. A rule firing is not someone finding out. The gap between those two facts is exactly where this outage lived — silent, for months — manufacturing a warm, false confidence that the system “had alerting.”
It did. The alerting just talked to a wall.
The fix
Audit every receiver. Every single one needs at least one channel, and that channel has to actually reach a human or a device. No empties, no dead webhooks, no stale email lists.
Then — and this is the part people skip — add a heartbeat. A “dead-man” alert that is supposed to fire on a schedule. If it stops arriving, your delivery path is broken and you find out on purpose instead of during a customer call.
# 1. Find every receiver with an empty / missing channel list.
amtool config routes show
amtool config show | yq '.receivers[] | select(
([.email_configs, .webhook_configs, .pushover_configs,
.slack_configs, .opsgenie_configs] | flatten | length) == 0
) | .name'
# 2. End-to-end test: fire a synthetic alert and confirm it lands on a device.
amtool alert add deadmans_switch \
severity=info service=alerting \
--annotation summary="heartbeat — if you can read this, delivery works"
# 3. The dead-man heartbeat: a rule that is ALWAYS firing on purpose.
groups:
- name: heartbeat
rules:
- alert: DeadMansSwitch
expr: vector(1)
labels: { severity: heartbeat }
annotations:
summary: "Alerting pipeline is alive. Silence = it isn't."
DELIVERY CHAIN ── post-fix
RULE ─► RECEIVER ─► CHANNEL ─► DEVICE
✓ ✓ ✓ (≥1) ✓ buzzed
+ DeadMansSwitch heartbeat every 60s ►► if it goes quiet, you know
Flip the logic. A normal alert proves a problem exists. A heartbeat proves the path itself still works. You want both, because the path is the thing that fails silently.
Why it happened
Someone — possibly past me — created the receiver as a placeholder, fully intending to fill in the channels later. Later never came. The config was syntactically valid, so nothing complained. There’s no error on an empty channel list; it’s a legal, boring state.
And because the rules were visibly there, every review after that pattern-matched “alerting: configured ✓” and moved on. Nobody walked the whole chain to a buzzing phone. The placeholder calcified into production.
Takeaways
- An alert with no destination is worse than no alert. No alerting at least keeps you honest. A dead receiver hands you false confidence and bills you for it during an outage.
- Validate the entire chain: rule → receiver → channel → device. Verifying any single link tells you almost nothing about the link after it.
- Test delivery, not configuration. “The rule exists” and “a human got paged” are different claims. Only the second one matters at 2 a.m.
- Add a dead-man heartbeat. A signal that should arrive on schedule turns silent failure into a loud, on-purpose one.
- Audit receivers for empties on a cadence. Placeholders, decommissioned inboxes, and rotted webhooks accumulate. Grep them out before they grep you.