
The bot worked. Then it didn’t. Then it worked again.
It was a small Node.js Telegram bot running on a client’s server. Its whole job was to fire off outbound API calls — sendMessage, getUpdates, the usual. Most of the time it was instant. But every so often a call would just… stop. No exception. No 500. No DNS error. The promise sat there until the request timeout finally put it out of its misery seconds later.
Intermittent. Silent. Multi-second. The worst kind of bug there is.
The investigation
First instinct was the obvious one: the upstream API is flaky. So I logged timing around every call. The hangs weren’t correlated with anything — not load, not time of day, not a specific endpoint. Same call, same payload, same host. Sometimes 80ms, sometimes 5,000ms-then-timeout.
I pulled strace on the process during a hang and watched the syscalls. The bot wasn’t busy. It was sitting in a connect() that never came back. It wasn’t waiting on the API — it was waiting on a socket that was never going to open.
That reframed everything. This wasn’t a slow server. This was a connection attempt going into the void.
bot ──► undici (happy-eyeballs)
│
├─► resolve host
│ ├─ AAAA (IPv6) ─► :: ► connect() ──► ✗ black hole (no route)
│ └─ A (IPv4) ─► . ► connect() ──► ✓ would work fine
│
▼
races both... but stalls on the v6 attempt
The host had IPv6 addresses configured on its primary NIC. It just had no working IPv6 route to the outside world. Addresses present, connectivity dead. The packets went out and nothing ever came back.
The “aha”
The bot used undici as its HTTP client. Modern undici does Happy Eyeballs (RFC 8305) — it resolves both A and AAAA records and races IPv4 and IPv6 connection attempts, preferring the v6 path.
On a healthy host that’s great. On this host it was poison. undici would get an AAAA record, try the IPv6 path first, and that connect() would sail into a black hole. The race timer is supposed to fall back to IPv4 — but with a route that silently drops packets instead of cleanly rejecting them, the fallback timing slipped and the whole call stalled until the app-level timeout.
A quick curl proved it cold:
# hangs forever — there's an AAAA record and broken v6 routing
curl -v https://api.example.com/
# instant — forced down IPv4
curl -4 -v https://api.example.com/
-6 hung. -4 flew. That’s your smoking gun.
The fix
Two layers. First, kill IPv6 on the one NIC that had it broken — while leaving it alone on the VPN and VM interfaces that genuinely use it.
# disable IPv6 on the offending interface only (e.g. eth0), keep it elsewhere
sudo sysctl -w net.ipv6.conf.eth0.disable_ipv6=1
# make it stick across reboots
echo 'net.ipv6.conf.eth0.disable_ipv6 = 1' | sudo tee /etc/sysctl.d/99-disable-ipv6-eth0.conf
sudo sysctl --system
Second, belt and suspenders at the app layer — force the resolver IPv4-first so even a half-broken host can’t bite us again:
// top of the entrypoint, before any outbound calls
const dns = require('node:dns')
dns.setDefaultResultOrder('ipv4first')
curl -4 already proved the IPv4 path was healthy, so this was a safe, deterministic pin. Restarted the bot, hammered it with calls for ten minutes. Zero hangs. Every request back under 200ms.
Why it happened
“Broken but present” IPv6 is worse than no IPv6 at all. If the host had no AAAA-resolvable path, the client never tries v6 and you never notice. But give it IPv6 addresses with no real route, and every modern HTTP client cheerfully tries the dead path first — then waits on a connect() that silently drops instead of refusing.
No RST means no fast failure. No fast failure means a stall. And because Happy Eyeballs only sometimes loses the race badly, the stall is intermittent. That’s why it looked like a flaky upstream when it was a flaky local NIC the whole time.
Takeaways
- If something hangs only sometimes on outbound HTTP, suspect IPv6. Test it in one command:
curl -4vscurl -6. If-6hangs, you found it. - Addresses ≠ connectivity. A NIC can hold a perfectly valid IPv6 address and still have zero working route to the internet.
- Disable IPv6 per-interface, not globally. Kill it on the broken NIC; leave it on VPN/VM links that actually need it.
- Pin IPv4-first at the app layer too (
dns.setDefaultResultOrder('ipv4first')) — defense in depth for hosts you don’t fully control. - A
connect()that never returns is a routing problem, not a server problem.stracethe hang before you blame the API.