The pager went off at the worst possible hour, which is the only hour pagers know.

Every app talking to the database was throwing the same line:

FATAL:  the database system is starting up

Not “down.” Not “connection refused.” Starting up. As if it were thirty seconds from being fine. It said that thirty seconds ago too. And thirty seconds before that.

The investigation

First instinct: restart it harder. Bad instinct. I sat on my hands.

The host had taken an ugly storage hiccup earlier — the array blipped, the kernel got cranky, and Postgres went down without a clean shutdown. So I went to the only source of truth that matters here: the server log.

sudo tail -f /var/lib/pgsql/data/log/postgresql-*.log

And there it was, crawling:

LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 3F/A2000028
LOG:  redo in progress, elapsed time: 412.06 s, current LSN: 3F/A2090110

Recovery was happening. It was just happening at the speed of continental drift. In seven minutes of replay it had moved the LSN by a rounding error.

That’s not a Postgres bug. That’s a disk gasping for air.

The “aha”

Here’s the thing every junior forgets at 3 a.m.: after an unclean stop, Postgres doesn’t just “boot.” It replays its write-ahead log to drag the data files back to a consistent state. Until that replay finishes, the server refuses connections — and the polite way it refuses is the database system is starting up.

        unclean shutdown
   ┌────────────────────────┐      reads WAL
   │  PostgreSQL  (startup)  │ ───────────────┐
   └────────────────────────┘                 │
                │                              ▼
                │ replays records      ┌───────────────┐
                │ onto data files      │   WAL (pg_wal) │
                ▼                       └───────┬───────┘
   ┌────────────────────────┐                  │
   │   data files on DISK    │ ◄────────────────┘
   └───────────┬────────────┘
               │  if DISK is slow/failing,
               ▼  replay crawls → "starting up" forever
        clients see: FATAL: starting up

WAL replay is only as fast as the disk it’s writing to. The storage hiccup hadn’t killed the disk — it had wounded it. Every read was retrying, every write was waiting. Recovery was real, honest, and effectively never going to end on that hardware.

The cardinal sin would’ve been to kill -9 the startup process to “speed things up.” Kill recovery mid-replay and you risk a data directory that’s half-applied and fully useless.

Postgres(startup)WALredo logDISKslow / failingreplay is only as fast as the disk underneath itwounded disk ⇒ recovery crawls ⇒ "starting up" forever
Recovery wasn't stuck — it was throttled by the storage it was trying to heal.

The fix

Rule one of database recovery: a database cannot recover on a broken disk. Fix the floor before you ask anyone to dance on it.

So I stopped Postgres cleanly, took the storage out of the equation, and verified the hardware first.

# 1. Stop the instance cleanly — let it park, don't kill -9
sudo systemctl stop postgresql

# 2. Check the storage BEFORE touching the database
sudo smartctl -a /dev/sda | grep -iE 'reallocated|pending|health'
dmesg --ctime | grep -iE 'i/o error|ata|reset'

# 3. Move the data dir onto healthy storage (rsync, preserve everything)
sudo rsync -aHAX --info=progress2 /var/lib/pgsql/data/ /srv/pgdata-healthy/

With the data directory living on a disk that wasn’t actively dying, I pointed Postgres at it and let recovery run — and left it alone.

sudo -u postgres /usr/bin/postgres -D /srv/pgdata-healthy &
sudo tail -f /srv/pgdata-healthy/log/postgresql-*.log

This time the LSN sprinted instead of crawled:

LOG:  redo in progress, elapsed time: 9.21 s, current LSN: 3F/F1A40020
LOG:  redo done at 3F/F1A40020
LOG:  database system is ready to accept connections

Twelve seconds of replay on good hardware versus an eternity on bad. Same WAL. Same database. Different floor.

And because I never fully trust a data directory that’s been through a disk event, the first thing I did once it was up was take a logical backup — the kind that doesn’t care about block-level corruption:

pg_dump -Fc -d appdb -f /srv/backups/appdb_$(date +%F).dump

Why it happened

The unclean shutdown was never the real problem. Postgres handled that exactly as designed — replay the WAL, return to consistency, open the doors.

The problem was that recovery was asked to run on storage that couldn’t keep up. WAL replay is I/O, and I/O on a wounded disk is mostly waiting. “Starting up forever” is what a healthy recovery process looks like when it’s standing on broken ground.

Fix the ground. The database knows what to do.

Takeaways

  • “Starting up” forever almost always means WAL replay — Postgres is recovering from an unclean stop, not hanging for no reason.
  • Watch the server log, not the client errors. redo in progress with a moving LSN means it’s working; a stalled LSN means your storage is the suspect.
  • Fix the storage first. A database physically cannot finish recovery on a slow or failing disk — verify SMART/dmesg, move to healthy media, then let it replay.
  • Never kill -9 recovery to “speed it up.” A half-applied data directory can be worse than no database at all.
  • Keep logical dumps (pg_dump). A physical data directory can’t always be salvaged after a disk event — a logical backup is the restore path that survives bad blocks.