There is a specific sound. A mouth click. A wet little lip-smack between words. A sharp inhale before a sentence.

You probably don’t hear them. I hear nothing else.

Misophonia means certain sounds don’t annoy me — they hijack me. So I did what any sysadmin does with a problem that won’t leave: I decided to make it someone else’s job. Specifically, a script’s. The pitch was simple. Detect the trigger sounds in any audio or video, scrub them out before playback, hand me back a clean track.

The pitch was simple. The first version was a disaster.

The naive approach

My first instinct was the dumbest possible one, which is usually where I start.

Triggers feel violent, so I assumed they were loud. Threshold the loudness, mute anything that spikes, done by lunch.

# v1: the volume-gate theory of everything
import numpy as np

def scrub_naive(samples, sr, thresh_db=-18.0):
    frame = int(0.02 * sr)               # 20ms frames
    out = samples.copy()
    for i in range(0, len(samples) - frame, frame):
        chunk = samples[i:i+frame]
        rms = np.sqrt(np.mean(chunk**2)) + 1e-9
        db = 20 * np.log10(rms)
        if db > thresh_db:               # "too loud" -> kill it
            out[i:i+frame] = 0.0
    return out

I ran it on a podcast. It muted the consonants. It muted laughter. It muted entire emphasized words. And the lip-smacks? Sailed right through, untouched, smug.

Because the triggers were never loud. A lip-smack sits below normal speech in raw energy. I’d built a machine that deleted the wrong things at the wrong volume for the wrong reason.

The aha

I pulled the waveforms apart and actually looked at them. That’s when it clicked.

A vowel is periodic and narrowband — energy stacked in tidy harmonic bands, sustained over time. A mouth click is the opposite animal: short, broadband, transient. A flat smear of energy across the whole spectrum, gone in 30 milliseconds.

   AMPLITUDE-OVER-TIME  (loudness lies)        SPECTRUM  (signature tells the truth)

   speech  ┌─────────────┐   click ┌─┐         speech ▁▃█▇▅▂   (harmonic, banded)
           │ /\  /\  /\   │        │█│         click  ▅▅▆▅▆▅   (flat, broadband)
   ────────┘/  \/  \/  \  └──┐ ────┘ └──        breath ▂▂▃▂▂▃   (broadband, low, hissy)
            loud + sustained    quiet + 30ms
                  ^ v1 chased THIS, the wrong axis

The trigger wasn’t a level. It was a shape. A fingerprint smeared across frequency and time. And “find this fingerprint in a stream” is not a thresholding problem.

It’s a classification problem. I’d been trying to solve a recognition task with a ruler.

audioframesfeaturesspectral flat.ZCR / onsetMFCCclassifierp(trigger)+ thresholdcleanduck & fade
Stop measuring how loud it is. Describe what it looks like, then let a classifier decide.

The fix

So I threw out the ruler and described the signature instead. Per frame, extract the features that actually separate a click from a vowel — spectral flatness (how broadband it is), zero-crossing rate, onset sharpness, MFCCs. Then a small classifier outputs a probability, and only frames over a confidence threshold get touched.

import librosa, numpy as np

def features(chunk, sr):
    S = np.abs(librosa.stft(chunk, n_fft=512, hop_length=128))
    return np.concatenate([
        [librosa.feature.spectral_flatness(S=S).mean()],   # broadband?
        [librosa.feature.zero_crossing_rate(chunk).mean()],# transient hiss?
        librosa.feature.mfcc(y=chunk, sr=sr, n_mfcc=13).mean(axis=1),
    ])

def scrub(samples, sr, clf, p=0.85, debounce_ms=40):
    frame, out = int(0.02*sr), samples.copy()
    cooldown, hold = 0, int(debounce_ms/20)
    for i in range(0, len(samples)-frame, frame):
        prob = clf.predict_proba(features(samples[i:i+frame], sr)[None])[0,1]
        if prob >= p: cooldown = hold          # fire + arm debounce
        if cooldown > 0:                        # short cross-fade, not a hard cut
            out[i:i+frame] *= np.linspace(1, 0, frame) if cooldown==hold else 0.0
            cooldown -= 1
    return out

Two details earned their keep. The confidence threshold stops it firing on every sibilant s. The debounce keeps one click from flickering the gate on and off mid-word. And I duck with a short cross-fade instead of a hard zero, so the cut doesn’t itself become a click.

On isolated triggers — a click in a gap, a breath between sentences — it works. Validated against a hand-labeled set, and it catches the things that used to make me leave the room.

Why it happened

I anthropomorphized the problem. The sounds feel loud and aggressive to me, so I encoded my emotional read of them as a signal property. The DSP didn’t care about my feelings. Loudness and identity are different axes, and I’d built my whole v1 on the wrong one.

And I’ll be honest about the part that isn’t solved: triggers that land on top of speech. When a lip-smack overlaps a spoken word, ducking the frame mangles the word too. Source separation in real time is a genuinely harder problem, and pretending I’d nailed it would be the same mistake in a nicer suit. That’s phase two. For now the tool is honest about what it can and can’t pull apart.

Takeaways

  • “Detect this sound” is classification, not thresholding. If you reach for a volume gate, ask whether the thing you want is actually defined by loudness — it usually isn’t.
  • Define the signature in feature space. Spectral flatness, ZCR, onset, MFCCs separate a broadband transient from a harmonic vowel far better than amplitude ever will.
  • Validate against labeled examples. “It feels right” is not a metric. A confidence threshold is only meaningful when you’ve measured it against ground truth.
  • Debounce and cross-fade your edits. A hard cut to fix a click can introduce a new click. Smooth the seams.
  • Be loud about the cases you can’t solve. Isolated triggers: handled. Triggers overlapping speech: not yet, and saying so is the whole point.