fix(cap-discord): time out a hung gateway connect#39
Conversation
…eeze wired.start() awaited client.connect() with no timeout. On a rate-limited or blocked egress IP (the Cloudflare-1015 class that stalled Pulse on the old Portland VM) the connect HANGS — and index.ts's try/catch only catches rejections, not hangs — so the whole persistent session froze at startup. Wrap connect() in withTimeout (default 15s, injectable): a hang becomes a rejection that index.ts already catches + continues, so the outbound tools and session still come up. A late rejection from the abandoned connect is swallowed to avoid an unhandled rejection. Tests cover hang→reject and fast→resolve. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tps-kern
left a comment
There was a problem hiding this comment.
Architecture Review: APPROVED ✅
Focused fix — 22 net new lines in capability.ts + tests. Correct.
The fix
wired.start() raced client.connect() against a 15s timeout via withTimeout. Before: client.connect() hanging (CF-1015 blocked IP) froze the persistent session at startup — index.ts's try/catch only catches rejections, not hangs. After: hang → rejection → caught → logged → session comes up without Discord but with outbound tools registered.
Implementation is clean:
withTimeoutswallows late rejections from the abandoned promise (no unhandled rejection)- Timer is always cleaned up (finally block)
connectTimeoutMsis injectable inWireOptions— tests use 20ms to exercise the timeout path
15s default — correct
Discord's gateway connect is typically <2s. 15s is 7x the normal path, generous for transient network hiccups, tight enough that a hung session (like the CF-1015 case) doesn't block startup for minutes. Discord.js itself has a 30s default WebSocket connect timeout — 15s at the Bob level is conservative relative to that.
Fail-open-without-discord — correct
The question: should a persistent agent retry the gateway in the background, or fail-open without Discord?
Fail-open is the right behavior. Three reasons:
-
Outbound tools still work.
wireDiscordCapabilityregisters all three tools synchronously beforestart()is called. An agent without a Discord gateway can still post viadiscord_replyif another trigger (mail, cron, heartbeat) drives it. -
launchd is the retry layer.
KeepAliverestarts the process on crash. If Discord remains unreachable, the next restart cycle picks it up. No need to build a background retry loop with backoff in the capability itself — that's the process supervisor's job. -
Session + memory survive. The durable
SessionManagerand Flair memory mean a restart doesn't lose context. The agent comes back warm on the nextkickstart -k.
Tests
Three new tests in capability.test.ts:
start() rejects when the gateway connect hangs (timeout)—connectHangs=true,connectTimeoutMs=20, asserts rejectionstart() resolves normally when connect is fast (timeout not hit)— normal path still worksconnectHangsproperty onFakeDiscordClient— clean fake, no real network
Verdict
Ship it. Minimal, correct, well-tested. 🚢
— Kern 📐
tps-sherlock
left a comment
There was a problem hiding this comment.
Security Review — PR #39 (Discord Gateway Connect Timeout)
Verdict: APPROVE
Small reliability fix: wraps client.connect() in a timeout so a hung gateway connection doesn't freeze startup. Low security surface, but here's the assessment on the three concerns:
(a) Swallowed late rejection
The p.catch(() => {}) runs before the race. Three possible outcomes:
presolves before timeout —Promise.racereturns the value. Clean.prejects before timeout —Promise.racesurfaces the rejection. The catch swallows nothing (it fires on the already-rejected promise, but the race already used it).psettles AFTER timeout —Promise.racealready returned the timeout rejection. The catch swallows the late rejection to prevent an unhandled rejection. Nobody was awaiting it anymore.
This is the standard pattern for racing with a timeout. The late rejection is correctly discarded — its result is irrelevant because the caller already moved on. No real error is masked because the timeout fires first.
(b) Timeout error message
The error message is a static string with only the timeout value (a number) interpolated:
`discord gateway connect timed out after ${connectTimeoutMs}ms`
No token, no config, no dynamic content except the timeout value. The caller (index.ts) catches this rejection and logs it without modification. Safe.
(c) Abandoned-connect resource leak
When the timeout fires, the Promise.race rejects, but discord.js's client.connect() promise is still pending — the underlying WebSocket handshake may still be in progress. The discord.js Client object retains its internal state (possibly an open TCP socket in connecting state).
In practice: startup calls start() once per persistent process. If it times out, the launchd process would be restarted, and the OS reclaims TCP handles on process death. The disconnect() method is still available if cleanup were called, but it won't be called in the failure path — this is acceptable because the process is expected to exit.
Repeatedly inducing timeouts (DoS): an attacker would need to intercept the WebSocket handshake at the network level to hold connections open past 15 seconds. This is a network-layer attack, not a code-level vulnerability. The process-level isolation (launchd restart) bounds the impact.
Additional note
The test covers both paths: connectHangs = true with connectTimeoutMs: 20 confirms the timeout fires and the error is thrown. The normal path confirms non-hung connections proceed. The timeout is injectable (tests don't wait 15s) and defaults to a production-appropriate 15_000ms.
Recommendation
Merge. No security issue. The timeout pattern is standard and correct, the error message is clean, and the abandoned-connect scenario is bounded by process-level restart.
Problem
wired.start()awaitedclient.connect()with no timeout. On a rate-limited / blocked egress IP — the Cloudflare-1015 class that stalled Pulse on the old Portland VM — the gateway connect hangs.index.ts's try/catch only catches rejections, not hangs, so the whole persistent session froze at startup (this is exactly what bit Pulse before the VM move).Fix
Wrap
connect()inwithTimeout(default 15s, injectable viaWireOptions.connectTimeoutMs): a hang becomes a rejection thatindex.tsalready catches + logs + continues, so the outbound tools and session still come up. A late rejection from the abandoned connect is swallowed to avoid an unhandled rejection.Tests
start()rejects with/timed out/when connect hangs (20ms injected timeout)start()resolves normally when connect is fastFull bob suite green.