GUACAMOLE-2265: Fix worker process leak after client disconnect.#662
GUACAMOLE-2265: Fix worker process leak after client disconnect.#662escra wants to merge 1 commit into
Conversation
|
@escra I believe the issue you're describing here may be a duplicate of GUACAMOLE-2118, and your fix may address that issue. Please check on that issue and see if it matches, and re-tag your issue as GUACAMOLE-2118. |
|
@escra FYI - the original reporter of GUACAMOLE-2118 tried out your patch, and got continuously disconnected from RDP sessions after applying it. I haven't tried it myself or reviewed it in detail, but it seems like it is breaking something else. |
c2ace0e to
acb1b81
Compare
|
Thanks for the feedback, @necouchman. We identified the problem: the initial version of the patch was missing a reconnect grace period. When a user disconnects (e.g. browser refresh, tab switch), the worker detected zero connected users and exited immediately, which terminated the remote session before the user could rejoin. We overlooked this when creating the PR. The updated patch now includes:
We also renamed the JIRA issue to GUACAMOLE-2118 as you suggested, since it addresses the same root cause. We have automated tests covering both the leak scenario and the reconnect grace period. Happy to share details if that helps with review. |
acb1b81 to
7ca5fa3
Compare
|
Nevermind..this is true of my testing of the main |
| /* Absolute safety net: if we've been idle for too long after | ||
| * having a user, force exit regardless of client state. This | ||
| * catches cases where the client free handler is blocked | ||
| * (e.g. guac_argv_await in FreeRDP authenticate callback). */ | ||
| if (had_user | ||
| && idle_cycles * GUACD_PROC_IDLE_TIMEOUT_MS | ||
| >= GUACD_PROC_MAX_IDLE_MS) { | ||
| guacd_log(GUAC_LOG_WARNING, | ||
| "Worker exceeded maximum idle time (%ds) after user " | ||
| "disconnected. Force exiting.", | ||
| GUACD_PROC_MAX_IDLE_MS / 1000); | ||
| break; | ||
| } |
There was a problem hiding this comment.
During testing, I'm seeing active connections always hit this condition, resulting in connections getting terminated after 2 minutes.
Worker processes are never terminated after a client disconnects when the normal exit path is blocked, leaking one guacd child per affected connection until the container is restarted. The worker's main loop blocked indefinitely in recvmsg() waiting for new user file descriptors on fd_socket and only woke up when a new user joined, so it could not notice that the client had stopped and no further users would ever arrive. Replace the blocking receive with a poll() that wakes on a fixed idle interval (GUACD_PROC_IDLE_TIMEOUT_MS) and decides whether to keep waiting or exit based on client state and connected user count. After the last user disconnects a short reconnect grace period (GUACD_PROC_RECONNECT_GRACE_MS) is honoured so a browser refresh or tab re-open can rejoin the existing session, and an absolute safety net (GUACD_PROC_MAX_IDLE_MS) force-exits a worker whose teardown is wedged (e.g. guac_argv_await stuck in the FreeRDP authenticate callback). The absolute safety net is gated on connected_users == 0 and idle_cycles is reset while users are connected. The poll loop only observes new user FDs on fd_socket; an already-connected single-user session generates no new FDs, so without this gating a healthy active session accumulated idle cycles and was force-terminated at the GUACD_PROC_MAX_IDLE_MS mark. The safety net now measures true post-disconnect idle time and can never fire on a live session, while still catching the blocked-teardown case. If pthread_create() for a user thread fails, the connection is closed and the worker is stopped so a failed thread launch cannot leak the worker. Verified with a k6 soak (143s) and an interactive session (285s) that an actively connected session survives well past GUACD_PROC_MAX_IDLE_MS with no premature disconnect and no 'exceeded maximum idle time' warning, and with a deterministic reproducer that holds one active single-user session past the safety-net threshold.
7ca5fa3 to
fc4dea4
Compare
|
Thanks for relaying the reporter's feedback, and thanks to them for actually testing the patch. The continuous RDP disconnects are a real regression in the original version of this change, and we have tracked down the cause. The problem was the absolute safety net The fix is small and keeps the original intent of the change:
The leak and reconnect-grace behavior is preserved: the worker still shuts down cleanly once no users are connected, including the blocked-teardown case the PR originally targeted. We have also dropped the unrelated rdp.c We have built this and verified it. Our image CI is green (reconnect-grace, bug-reproducer, and watchdog jobs). We also verified it live on our integration environment: an actively connected RDP session stayed up well past 120 seconds (a 143 second soak, plus 285 seconds of interactive use with mouse, keyboard, and clipboard) with no premature disconnect, and the guacd log no longer shows the "Worker exceeded maximum idle time" warning. One apparent counter-example in a browser test turned out, per the logs, to be a Windows single-session takeover ( We have since re-verified both halves against a real Windows Server 2025 RDP target (not only our integration environment): a single actively connected session held cleanly past the 120 second threshold (180 seconds, no watchdog warning, clean teardown), and a multi-user churn run — concurrent sessions from distinct accounts plus dozens of rapid connect/disconnect cycles — drove the guacd worker count up and then cleanly back to baseline, with every connection terminated and no lingering workers. That last part directly covers the original report behind this change (worker processes not being reaped under many user connections): the leak fix still holds under real multi-user load, while active sessions are no longer force-terminated. We have regression tests for this case (a k6 soak and an interactive Guacamole WebSocket test, plus a deterministic image test, Sorry for the effort this took on your side — if anything still looks off, I'll gladly adjust, and this time right away. |
|
Thanks @escra - I've tested, again, and can confirm that the connections stay running, which is good. I'll review, again, and see if we can get some other folks to test it and see if it takes care of the leftover process issues. |
Summary
Worker processes may remain alive indefinitely after all users have
disconnected, particularly under resource pressure (e.g. cgroup pids-limit).
Over days/weeks this leads to steady PID accumulation until no new connections
can be created.
Root causes
1.
pthread_create()failure ignored inguacd_proc_add_user()When thread creation fails with EAGAIN (common under cgroup pids-limit), the
user thread never runs, so
guacd_proc_stop()is never called and the workerblocks in
recvmsg()forever.2. Worker main loop blocks indefinitely in
recvmsg()The loop in
guacd_exec_proc()callsguacd_recv_fd()with no timeout orhealth check. If
guacd_proc_stop()is not triggered for any reason, theworker hangs at 0% CPU consuming a PID slot indefinitely.
3. RDP fail path does not clean up resources
guac_rdp_handle_connection()does not free the display, FreeRDP instance,keyboard, or SVC list on the fail path, leaking resources on every failed
connection attempt.
Fix
pthread_create()failure by closing the FD, freeing params, andcalling
guacd_proc_stop()recvmsg()loop with apoll()-based loop that checksclient state every 5 seconds (
GUACD_PROC_IDLE_TIMEOUT_MS). Workers exitwhen the client has stopped or all users have disconnected. An absolute
safety timeout (
GUACD_PROC_MAX_IDLE_MS= 30s) handles edge cases wherethe normal exit path is blocked
faillabel inguac_rdp_handle_connection()JIRA
GUACAMOLE-2265
Related
Sporadic hanging connections after upgrade
Improve process management to prevent zombie process accumulation
Test plan
pids.max=50, create 60 concurrentconnections, verify workers exit after disconnect
leak via
/proc/[pid]/fdcountreturns to baseline