GUACAMOLE-2265: Fix worker process leak after client disconnect. by escra · Pull Request #662 · apache/guacamole-server

escra · 2026-04-14T23:28:31Z

Summary

Worker processes may remain alive indefinitely after all users have
disconnected, particularly under resource pressure (e.g. cgroup pids-limit).
Over days/weeks this leads to steady PID accumulation until no new connections
can be created.

Root causes

1. pthread_create() failure ignored in guacd_proc_add_user()

When thread creation fails with EAGAIN (common under cgroup pids-limit), the
user thread never runs, so guacd_proc_stop() is never called and the worker
blocks in recvmsg() forever.

2. Worker main loop blocks indefinitely in recvmsg()

The loop in guacd_exec_proc() calls guacd_recv_fd() with no timeout or
health check. If guacd_proc_stop() is not triggered for any reason, the
worker hangs at 0% CPU consuming a PID slot indefinitely.

3. RDP fail path does not clean up resources

guac_rdp_handle_connection() does not free the display, FreeRDP instance,
keyboard, or SVC list on the fail path, leaking resources on every failed
connection attempt.

Fix

Handle pthread_create() failure by closing the FD, freeing params, and
calling guacd_proc_stop()
Replace the bare recvmsg() loop with a poll()-based loop that checks
client state every 5 seconds (GUACD_PROC_IDLE_TIMEOUT_MS). Workers exit
when the client has stopped or all users have disconnected. An absolute
safety timeout (GUACD_PROC_MAX_IDLE_MS = 30s) handles edge cases where
the normal exit path is blocked
Add resource cleanup to the fail label in
guac_rdp_handle_connection()

JIRA

GUACAMOLE-2265

Test plan

Run guacd under cgroup with pids.max=50, create 60 concurrent
connections, verify workers exit after disconnect
Kill the user's browser mid-session, verify worker exits within 30s
Trigger RDP connection failures (unreachable host), verify no resource
leak via /proc/[pid]/fd count
Stress test: 100 rapid connect/disconnect cycles, verify process count
returns to baseline
Verify normal operation is unaffected (long-running sessions, reconnect)

necouchman · 2026-04-14T23:32:24Z

@escra I believe the issue you're describing here may be a duplicate of GUACAMOLE-2118, and your fix may address that issue. Please check on that issue and see if it matches, and re-tag your issue as GUACAMOLE-2118.

necouchman · 2026-04-16T12:57:35Z

@escra FYI - the original reporter of GUACAMOLE-2118 tried out your patch, and got continuously disconnected from RDP sessions after applying it. I haven't tried it myself or reviewed it in detail, but it seems like it is breaking something else.

escra · 2026-04-16T19:15:01Z

Thanks for the feedback, @necouchman. We identified the problem: the initial version of the patch was missing a reconnect grace period. When a user disconnects (e.g. browser refresh, tab switch), the worker detected zero connected users and exited immediately, which terminated the remote session before the user could rejoin.

We overlooked this when creating the PR. The updated patch now includes:

A 60-second reconnect grace period (GUACD_PROC_RECONNECT_GRACE_MS) after the last user disconnects, allowing time for browser refreshes or tab re-opens to rejoin the session.
A 120-second absolute safety-net timeout (GUACD_PROC_MAX_IDLE_MS) that only triggers after the grace period, catching edge cases where the normal exit path is blocked.
The worker only exits immediately when client->state != GUAC_CLIENT_RUNNING (explicit disconnect), not on temporary zero-user conditions.

We also renamed the JIRA issue to GUACAMOLE-2118 as you suggested, since it addresses the same root cause.

We have automated tests covering both the leak scenario and the reconnect grace period. Happy to share details if that helps with review.

necouchman · 2026-04-21T00:53:12Z

@escra My testing indicates that this doesn't work at all - right now, after applying this patch to the staging/1.6.1 branch, I'm able to connect about once out of every 20 times to a RDP host - the other times I get various levels of guacd failures during the connection, often with this message:

guacd[327]: WARNING:	Client does not support the "required" instruction. No authentication parameters will be requested.

Nevermind..this is true of my testing of the main staging/1.6.1 branch at the moment, so it isn't your changes causing this...

necouchman · 2026-05-23T20:12:37Z

+            /* Absolute safety net: if we've been idle for too long after
+             * having a user, force exit regardless of client state. This
+             * catches cases where the client free handler is blocked
+             * (e.g. guac_argv_await in FreeRDP authenticate callback). */
+            if (had_user
+                    && idle_cycles * GUACD_PROC_IDLE_TIMEOUT_MS
+                        >= GUACD_PROC_MAX_IDLE_MS) {
+                guacd_log(GUAC_LOG_WARNING,
+                        "Worker exceeded maximum idle time (%ds) after user "
+                        "disconnected. Force exiting.",
+                        GUACD_PROC_MAX_IDLE_MS / 1000);
+                break;
+            }


During testing, I'm seeing active connections always hit this condition, resulting in connections getting terminated after 2 minutes.

Worker processes are never terminated after a client disconnects when the normal exit path is blocked, leaking one guacd child per affected connection until the container is restarted. The worker's main loop blocked indefinitely in recvmsg() waiting for new user file descriptors on fd_socket and only woke up when a new user joined, so it could not notice that the client had stopped and no further users would ever arrive. Replace the blocking receive with a poll() that wakes on a fixed idle interval (GUACD_PROC_IDLE_TIMEOUT_MS) and decides whether to keep waiting or exit based on client state and connected user count. After the last user disconnects a short reconnect grace period (GUACD_PROC_RECONNECT_GRACE_MS) is honoured so a browser refresh or tab re-open can rejoin the existing session, and an absolute safety net (GUACD_PROC_MAX_IDLE_MS) force-exits a worker whose teardown is wedged (e.g. guac_argv_await stuck in the FreeRDP authenticate callback). The absolute safety net is gated on connected_users == 0 and idle_cycles is reset while users are connected. The poll loop only observes new user FDs on fd_socket; an already-connected single-user session generates no new FDs, so without this gating a healthy active session accumulated idle cycles and was force-terminated at the GUACD_PROC_MAX_IDLE_MS mark. The safety net now measures true post-disconnect idle time and can never fire on a live session, while still catching the blocked-teardown case. If pthread_create() for a user thread fails, the connection is closed and the worker is stopped so a failed thread launch cannot leak the worker. Verified with a k6 soak (143s) and an interactive session (285s) that an actively connected session survives well past GUACD_PROC_MAX_IDLE_MS with no premature disconnect and no 'exceeded maximum idle time' warning, and with a deterministic reproducer that holds one active single-user session past the safety-net threshold.

escra · 2026-06-09T00:58:36Z

Thanks for relaying the reporter's feedback, and thanks to them for actually testing the patch. The continuous RDP disconnects are a real regression in the original version of this change, and we have tracked down the cause.

The problem was the absolute safety net GUACD_PROC_MAX_IDLE_MS. It counted plain wall-clock idle cycles and was not gated on whether any users were still connected. The worker only polls fd_socket for new user file descriptors, so ongoing session traffic of an already connected single-user session never shows up there. As a result a perfectly healthy, actively connected single-user session kept accumulating idle cycles and got force-terminated after about 120 seconds. That is exactly the "continuously disconnected" behavior the reporter saw.

The fix is small and keeps the original intent of the change:

idle_cycles is reset whenever client->connected_users > 0, so an active session is never counted as idle.
The absolute safety net only fires additionally when client->connected_users == 0.

The leak and reconnect-grace behavior is preserved: the worker still shuts down cleanly once no users are connected, including the blocked-teardown case the PR originally targeted.

We have also dropped the unrelated rdp.c fail:-path cleanup that shipped in the first version of this PR, since it freed a non-initialised GDI context on the pre-connect failure path; the leak fix itself is fully contained in proc.c/proc.h, and a correct cleanup can follow as a separate change.

We have built this and verified it. Our image CI is green (reconnect-grace, bug-reproducer, and watchdog jobs). We also verified it live on our integration environment: an actively connected RDP session stayed up well past 120 seconds (a 143 second soak, plus 285 seconds of interactive use with mouse, keyboard, and clipboard) with no premature disconnect, and the guacd log no longer shows the "Worker exceeded maximum idle time" warning. One apparent counter-example in a browser test turned out, per the logs, to be a Windows single-session takeover (ERRINFO_RPC_INITIATED_DISCONNECT), not a watchdog kill.

We have since re-verified both halves against a real Windows Server 2025 RDP target (not only our integration environment): a single actively connected session held cleanly past the 120 second threshold (180 seconds, no watchdog warning, clean teardown), and a multi-user churn run — concurrent sessions from distinct accounts plus dozens of rapid connect/disconnect cycles — drove the guacd worker count up and then cleanly back to baseline, with every connection terminated and no lingering workers. That last part directly covers the original report behind this change (worker processes not being reaped under many user connections): the leak fix still holds under real multi-user load, while active sessions are no longer force-terminated.

We have regression tests for this case (a k6 soak and an interactive Guacamole WebSocket test, plus a deterministic image test, tests/bugs/test_f157_active_session.sh, that holds one active single-user session past the safety-net threshold) and are happy to share them. If it helps, we can fold the fix into this PR as an update.

Sorry for the effort this took on your side — if anything still looks off, I'll gladly adjust, and this time right away.

necouchman · 2026-06-15T01:35:09Z

Thanks @escra - I've tested, again, and can confirm that the connections stay running, which is good. I'll review, again, and see if we can get some other folks to test it and see if it takes care of the leftover process issues.

escra force-pushed the fix/worker-process-leak branch from c2ace0e to acb1b81 Compare April 16, 2026 14:03

escra force-pushed the fix/worker-process-leak branch from acb1b81 to 7ca5fa3 Compare April 16, 2026 19:29

necouchman reviewed May 23, 2026

View reviewed changes

escra force-pushed the fix/worker-process-leak branch from 7ca5fa3 to fc4dea4 Compare June 9, 2026 00:57

escra mentioned this pull request Jun 10, 2026

GUACAMOLE-2265: Fix worker process leak after client disconnect. #677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GUACAMOLE-2265: Fix worker process leak after client disconnect.#662

GUACAMOLE-2265: Fix worker process leak after client disconnect.#662
escra wants to merge 1 commit into
apache:staging/1.6.1from
ESCRA-GmbH:fix/worker-process-leak

escra commented Apr 14, 2026

Uh oh!

necouchman commented Apr 14, 2026

Uh oh!

necouchman commented Apr 16, 2026

Uh oh!

escra commented Apr 16, 2026

Uh oh!

necouchman commented Apr 21, 2026 •

edited

Loading

Uh oh!

necouchman May 23, 2026

Uh oh!

escra commented Jun 9, 2026

Uh oh!

necouchman commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

escra commented Apr 14, 2026

Summary

Root causes

Fix

JIRA

Related

Test plan

Uh oh!

necouchman commented Apr 14, 2026

Uh oh!

necouchman commented Apr 16, 2026

Uh oh!

escra commented Apr 16, 2026

Uh oh!

necouchman commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

necouchman May 23, 2026

Choose a reason for hiding this comment

Uh oh!

escra commented Jun 9, 2026

Uh oh!

necouchman commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

necouchman commented Apr 21, 2026 •

edited

Loading