
The ticket said: “The server is possessed.”
That’s a direct quote. An app had been rock-solid for two years, then one Tuesday it “just stopped” — no config change, no deploy, no nothing. The on-call engineer had already rebooted twice, blamed cosmic rays, and was halfway to blaming the building’s wiring.
It wasn’t possessed. It never is.
After enough years you learn that 95% of “haunted network” tickets are the same seven mundane bastards wearing bedsheets. None of them are mysterious. All of them are one check away from being solved. The trick is running the check before you start theorizing about ghosts.
Here’s the scene I keep coming back to.
THE GHOST THE CHECK VERDICT
───────────────────────── ────────────────── ──────
"service won't start" ──► who owns the port? ✓ orphan
"site loads wrong server" ──► what does dig say? ✓ stale DNS
"one client 500s" ──► force --http1.1 ✓ h2 quirk
"big transfers hang" ──► ping -M do -s 1472 ✓ MTU/MSS
"intermittent dropouts" ──► arp -a (two MACs?) ✓ dup IP
"works one direction" ──► firewall state table ✓ asymmetric
"localhost can't connect" ──► loopback fw rule ✓ blocked pipe
The Investigation, Compressed Into a Checklist
1. The service that “can’t start.” Symptom: it crashed, you restart it, and now it refuses to bind. The aha: the crashed process orphaned the port and the kernel still thinks it’s owned. The check — find the squatter:
# Linux
ss -ltnp 'sport = :8080'
# Windows
netstat -ano | findstr :8080
Kill the orphan PID, the port frees, the service starts. No exorcism required.
2. DNS serving a corpse. Symptom: you migrated a site, half the world sees the new box, this one client sees the old one. The check — ask what’s actually resolving, not what should:
dig +short app.example.com @1.1.1.1
nslookup app.example.com
# flush the local liar:
sudo resolvectl flush-caches # systemd
ipconfig /flushdns # Windows
Stale cache, gone.
3. The one client that 500s while everyone else is fine. That’s an HTTP/2 downgrade or framing incompatibility. Pin the protocol and watch it heal:
curl -v --http1.1 https://app.example.com/health
If --http1.1 works and the default doesn’t, you found your ghost.
4. Small requests fine, big ones hang. Classic MTU/MSS mismatch — pings and tiny payloads sail through, the TLS handshake or a large transfer stalls forever. Probe for the real path MTU:
# 1472 + 28 overhead = 1500. Shrink until it stops fragmenting.
ping -M do -s 1472 192.0.2.1
If 1472 fails but 1400 works, clamp your MSS and the hangs vanish.
5. The intermittent flapper. Symptom: connectivity that drops for seconds at random — a rogue DHCP server or a static IP collision. Check the ARP table for one IP claimed by two MACs:
arp -a | sort # two different MACs on the same IP = your culprit
ip neigh show
6. Works one way, dies the other. Asymmetric routing where the return path takes a different door and a stateful firewall drops it for having no matching session. Look at the state table:
# Linux conntrack
conntrack -L | grep 192.0.2.50
# pf
pfctl -ss | grep 192.0.2.50
No state entry for the return flow = there’s your “ghost.”
7. The app that “should be local” but can’t reach itself. A host firewall rule or blocked named pipe is eating loopback. Confirm the listener and that localhost itself is allowed:
ss -ltnp | grep 127.0.0.1
sudo iptables -L INPUT -n -v | grep -E 'lo|127.0.0.1'
That Tuesday “possession”? Number one. Crashed worker, orphaned port, ss -ltnp named the PID in four seconds. Ghost dispelled.
Why It Happens
None of these are exotic. They’re the natural failure modes of stateful systems: processes die untidily, caches outlive their truth, two devices want the same address, a firewall remembers a flow that no longer exists. The supernatural feeling comes entirely from partial symptoms — it works for some and not others, or for small things and not big ones. Partial failure reads as spooky. It’s just state you haven’t looked at yet.
Takeaways
- Run the seven-item checklist before you theorize. The boring answer is almost always right.
- “Some clients work” is a clue, not a contradiction — it points straight at DNS, HTTP/2, or MTU.
- “Small works, big hangs” is MTU/MSS until proven otherwise. Don’t waste an hour on app logs.
- Check state, not config.
ss,dig,arp, and the firewall state table tell you what is, not what should be. - Reboots hide orphans; they don’t explain them. Find the squatter PID before you cycle power and lose the evidence.