The Sensor That Cried Wolf

The page came in at 2:14 a.m. Then 2:16. Then 2:19.

By the time I’d wiped the sleep out of my eyes there were forty-one alerts in the channel, all from the same audio detector at a customer site. “LOUD EVENT DETECTED.” Forty-one of them. My first thought was that someone was kicking the building in.

Nobody was kicking the building. Someone closed a heavy door. Then, hours later, a kitchen dropped a sheet pan on a tile floor. Each one lit up the detector like a fireworks finale.

This is the story of a sensor that screamed at everything and meant nothing.

The investigation

The setup was simple on paper. A mic feeds audio into a small model that’s supposed to flag impact-type events — glass breaks, hard slams, the kind of thing you actually want a human to look at. It fires a webhook, and the webhook pages whoever’s on call.

In practice it fired on anything loud.

I pulled the trigger log and lined it up against what people swore actually happened that night.

┌──────────────────────────────────────────────────────────┐
│  TRIGGER LOG  vs  GROUND TRUTH                            │
├──────────────┬───────────────┬───────────────┬───────────┤
│  time        │ detector says │ peak level    │ reality   │
├──────────────┼───────────────┼───────────────┼───────────┤
│  02:14:07    │ ► IMPACT      │  ▓▓▓▓▓▓▓ 0.91  │ door slam ✗│
│  02:16:55    │ ► IMPACT      │  ▓▓▓▓▓▓  0.88  │ door slam ✗│
│  02:19:31    │ ► IMPACT      │  ▓▓▓▓▓▓▓ 0.93  │ dropped pan✗│
│  03:48:12    │ ► IMPACT      │  ▓▓▓▓▓   0.81  │ HVAC bang ✗│
│  (silent)    │   ——          │  ▓▓      0.34  │ real break✗│
└──────────────┴───────────────┴───────────────┴───────────┘
        every loud thing fired ── the one real event didn't

That last row is the one that made my stomach drop. There had been a genuine event earlier in the week — a real break — and it was quieter than a slammed door. The detector had ignored it.

The “aha”

So I read the actual trigger code. And there it was, in one branch:

if peak_amplitude > THRESHOLD:
    fire_alert("IMPACT")

That’s it. That’s the whole brain.

The model attached to this thing was a perfectly good audio classifier. It returned a predicted label and a confidence score for every clip — door, speech, glass_break, hvac, the works. And the trigger logic threw all of it away and looked at one number: was it loud?

Loudness is not meaning. A slammed door is loud and boring. A real break can be quiet and important. We’d built a smoke detector that goes off when you turn the lights on.

Same five sounds. Volume flags all of them; label + confidence flags the one that matters.

The fix

The model already knew what it was hearing. We just had to listen to the label instead of the volume — then gate on confidence and stop it from machine-gunning the same event.

ALERT_LABELS = {"glass_break", "impact", "alarm"}
MIN_CONFIDENCE = 0.82
DEBOUNCE_SECONDS = 30

_last_fire = 0.0

def handle(clip):
    label, confidence = classifier.predict(clip)   # what was it, how sure

    # 1) classify by LABEL, not loudness
    if label not in ALERT_LABELS:
        return

    # 2) confidence gate
    if confidence < MIN_CONFIDENCE:
        return

    # 3) debounce repeats within a window
    global _last_fire
    now = time.monotonic()
    if now - _last_fire < DEBOUNCE_SECONDS:
        return
    _last_fire = now

    fire_alert(label, confidence)

Then — and this is the part people skip — I refused to trust it until I’d measured it. We had that hand-labeled log from the noisy night, so I scored the new logic against ground truth before letting it page a single human again.

# replay the labeled clips through the new gate, compare to truth
python eval_detector.py \
    --clips ./ground_truth/clips/ \
    --labels ./ground_truth/truth.csv \
    --report precision_recall

# precision  0.94   (almost no false alarms)
# recall     0.88   (catches the real ones, including the quiet break)
# false-positives/night:  41  ->  1

Forty-one false alarms a night down to one. The quiet break? Caught.

Why it happened

The amplitude check was the first thing someone wrote — a five-minute proof of concept to confirm the mic and the webhook worked end to end. It worked. It shipped. The classifier got bolted on later for “labeling,” but nobody ever rewired the trigger to actually use it.

So the smart part rode along as a passenger while the dumbest possible heuristic drove. Classic. The proof of concept becomes production because it never visibly breaks — until 2 a.m. on a night with a heavy door.

Takeaways

Loudness is not meaning. Amplitude tells you something happened, not what. If your model predicts a label, trigger on the label.
Gate on confidence. A bare classification with no threshold is just a louder guess. Make the model commit before it pages a human.
Debounce. One physical event becomes many samples. Collapse repeats inside a window or you’ll get a 41-message wall.
Validate against ground truth before you trust it. Keep a hand-labeled event log and measure precision/recall. “Seems better” is not a number.
Audit the trigger path, not just the model. A great classifier is worthless if a five-minute if loud: is still the thing pulling the trigger.

The investigation#

The “aha”#

The fix#

Why it happened#

Takeaways#

The investigation

The “aha”

The fix

Why it happened

Takeaways