A client’s surveillance system had stopped recording. Seventy-five cameras, a 44-terabyte recorder, and a banner on the dashboard reading “Recovering storage.” Internally the system had flipped itself into a read-only “limited mode” and quietly stopped saving video. By the time we dug in, it had been down for two and a half days.
The obvious diagnosis — the one most techs would reach for, and the one the symptoms practically beg you to believe — is a disk is dying, replace it.
That diagnosis would have made everything dramatically worse. Here’s why, and what was actually going on.
The symptom vs. the disease
The logs were screaming the same error, over and over, hundreds of times:
EXT4-fs warning (device md3): I/O error writing to
inode 339687425, block 357888066
Buffer I/O error on device md3, logical block 357888066
Same block. Every single time. The filesystem wanted to write to one specific spot on the array and the write kept failing.
When a write to a RAID array fails on the same block forever, instinct says bad sector, failing drive. So the first thing we did was pull the health stats (SMART) off all seven drives.
Every counter that would indicate a failing disk — reallocated sectors, pending sectors, offline-uncorrectable sectors, cable/CRC errors — was zero. On all seven drives. The disks were pristine.
Healthy disks. A write that won’t complete. Same block forever. That contradiction is the whole story.
What’s actually a “bad block list”?
Linux software RAID (the md subsystem, the thing mdadm drives) has a feature called the Bad Block List, or BBL. The idea sounds reasonable: if the array ever has trouble with a specific spot on a member disk, instead of kicking the whole disk out, it just writes that spot down in a little list — “don’t use this block” — and routes around it.
Helpful in theory. In practice, it has been one of the most controversial features in the Linux RAID stack for about fifteen years. Kernel developers have openly argued for removing it. The reason is exactly what we walked into:
- Something causes a transient I/O hiccup on a member — a momentary timeout, a controller blip, a drive that stalls for a fraction of a second under heavy write load. Nothing actually wrong with the media.
- The BBL feature misreads that hiccup as a bad block and writes it down permanently.
- On RAID5, where every stripe is spread across multiple disks, the bad-block entry gets propagated onto other members. Over time, the same logical region ends up flagged on several disks.
We confirmed this precisely. Four of the seven disks carried byte-for-byte identical bad-block ranges — and that is the tell:
RAID5 · seven disks · 44 TB usable
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ sda │ sdb │ sdc │ sdd │ sde │ sdf │ sdg │
├─────┼─────┼─────┼─────┼─────┼─────┼─────┤
│ · │ ▓▓▓ │ ▓▓▓ │ ▓▓▓ │ · │ · │ ▓▓▓ │
│clean│ BBL │ BBL │ BBL │clean│clean│ BBL │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┘
╰──── identical phantom ranges ────╯
real bad sectors do NOT line up to the exact
same address across four separate drives.
Why this stops recording entirely
This is the part that turns a phantom bookkeeping error into a system outage.
To write one stripe of video, RAID5 has to write to several disks at once plus update parity. If even one of those disks says “that block is on my no-go list,” the entire stripe write fails and returns an I/O error:
ONE stripe write ──► every data member must accept it
┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐
│ D0 ││ D1 ││ D2 ││ D3 ││ D4 ││ P │
└────┘└─🚫─┘└────┘└─🚫─┘└────┘└────┘
│ │
╰── "on my no-go list" ──╯
│
▼
✗ THE WHOLE STRIPE FAILS → EIO
→ filesystem protectively goes read-only
→ recorder drops to "limited mode"
→ 75 cameras stop saving video
The filesystem above sees the write fail, decides its metadata is now untrustworthy, and protectively remounts read-only. The camera software sees a read-only volume and drops into limited mode. Recording stops. All of it.
So: zero failing hardware, and a total recording outage, caused entirely by a list of imaginary bad spots the software refused to write over.
The trap
Now the dangerous part. The “replace the disk” reflex.
If you pull one of those drives and let the array rebuild onto a replacement, mdadm re-creates a bad-block list on the new member and re-propagates the same poison. You’d do four sequential multi-hour degraded rebuilds, each one re-importing the exact problem you were trying to remove, while running the array in its most fragile state. Replacing disks here doesn’t just fail to fix it — it actively risks the array.
The correct fix is the opposite of intuition: don’t touch the hardware. Erase the lists.
The repair
mdadm has an escape hatch for exactly this — a flag that clears (and disables) the bad-block feature across every member as the array is assembled:
mdadm --assemble --update=force-no-bbl /dev/md3 /dev/sd[a-g]5
The careful sequence around that one command:
- Back up everything first. Recovery codes, configuration, and a full pre-incident database snapshot, all pulled off the box. Nothing to lose before touching the array.
- Quiet the system. Stop every service touching the volume so nothing is mid-write. (This box had seventeen interlocking services and a storage daemon that re-assembles the array on its own.)
- Unmount and stop the array.
- Clear the bad-block lists with the command above. Afterward, every member reported the magic words:
sda5 → No bad-blocks list configured
sdb5 → No bad-blocks list configured
sdc5 → No bad-blocks list configured
sdd5 → No bad-blocks list configured
sde5 → No bad-blocks list configured
sdf5 → No bad-blocks list configured
sdg5 → No bad-blocks list configured
☁ poof. gone.
- Check the filesystem. A read-only
e2fsck -fnpreview came back clean — only two trivial, optional defragmentation suggestions. Zero corruption, zero data to remove. That confirmed the “filesystem damage” was never real damage; it was just the blocked writes. - Reboot and verify.
The result: array healthy, zero I/O errors, bad-block lists empty, and within seconds of the services coming back, the full camera fleet was writing video again. Total surgery time: about an hour. Zero permanent data loss.
The gotcha worth its own paragraph
These appliances don’t assemble their array from a config file the way a generic Linux server does. A storage daemon brings the array up at boot. That means the daemon will happily re-assemble and re-mount the array out from under you in the middle of a repair if you don’t stop it — and it has to be running again before reboot or the box comes up with no storage at all. Finding and accounting for every one of those interlocking services (seventeen, including media-server processes that respawn instantly when killed) was most of the actual work. The headline command takes one second; safely getting the system into a state where that command is safe to run takes the hour.
The kicker
Once a system surprises you, the right question is: where else do we have this and not know it yet?
So we audited every storage system under management for the same fingerprint. Most weren’t even capable of this failure (different storage technology). One was clean. And one other recorder was sitting in the earliest stage of the exact same bug — two phantom bad-block ranges just beginning to form, still recording fine, no errors yet. Caught months before it would have become an outage. We added automated monitoring across the fleet that checks these lists every 30 minutes, so the next occurrence is a 30-minute alert instead of a two-day blackout.
Takeaways
- Symptoms lie. Health data doesn’t. The single most useful five minutes of this entire incident was reading SMART and seeing all zeros. That one fact ruled out the obvious-but-wrong diagnosis and pointed at the real one.
- The bad-block list is a footgun. If you run Linux software RAID, know this feature exists, know it can manufacture failures out of transient hiccups, and know
--update=force-no-bblis how you clear it. Consider monitoring for non-empty lists proactively. - The intuitive fix can be the destructive one. “Replace the disk” would have deepened the hole. Sometimes the move is to change nothing physical and correct the software’s bad bookkeeping.
- Always back up before array surgery, even when you’re confident. Especially when you’re confident.
- Then go look at everything else. The bug you just fixed is rarely the only instance of it you own.
The hardware was never broken. The drives were never dying. A fifteen-year-old software feature, trying to be helpful, wrote down a problem that didn’t exist and then refused to work around its own mistake. Once you can see that, the fix is almost anticlimactic. Seeing it is the job.