Skip to content

GUACAMOLE-2265: Fix worker process leak after client disconnect.#662

Open
escra wants to merge 1 commit into
apache:staging/1.6.1from
ESCRA-GmbH:fix/worker-process-leak
Open

GUACAMOLE-2265: Fix worker process leak after client disconnect.#662
escra wants to merge 1 commit into
apache:staging/1.6.1from
ESCRA-GmbH:fix/worker-process-leak

Conversation

@escra

@escra escra commented Apr 14, 2026

Copy link
Copy Markdown

Summary

Worker processes may remain alive indefinitely after all users have
disconnected, particularly under resource pressure (e.g. cgroup pids-limit).
Over days/weeks this leads to steady PID accumulation until no new connections
can be created.

Root causes

1. pthread_create() failure ignored in guacd_proc_add_user()

When thread creation fails with EAGAIN (common under cgroup pids-limit), the
user thread never runs, so guacd_proc_stop() is never called and the worker
blocks in recvmsg() forever.

2. Worker main loop blocks indefinitely in recvmsg()

The loop in guacd_exec_proc() calls guacd_recv_fd() with no timeout or
health check. If guacd_proc_stop() is not triggered for any reason, the
worker hangs at 0% CPU consuming a PID slot indefinitely.

3. RDP fail path does not clean up resources

guac_rdp_handle_connection() does not free the display, FreeRDP instance,
keyboard, or SVC list on the fail path, leaking resources on every failed
connection attempt.

Fix

  • Handle pthread_create() failure by closing the FD, freeing params, and
    calling guacd_proc_stop()
  • Replace the bare recvmsg() loop with a poll()-based loop that checks
    client state every 5 seconds (GUACD_PROC_IDLE_TIMEOUT_MS). Workers exit
    when the client has stopped or all users have disconnected. An absolute
    safety timeout (GUACD_PROC_MAX_IDLE_MS = 30s) handles edge cases where
    the normal exit path is blocked
  • Add resource cleanup to the fail label in
    guac_rdp_handle_connection()

JIRA

GUACAMOLE-2265

Related

  • GUACAMOLE-2118 -
    Sporadic hanging connections after upgrade
  • GUACAMOLE-2143 -
    Improve process management to prevent zombie process accumulation

Test plan

  • Run guacd under cgroup with pids.max=50, create 60 concurrent
    connections, verify workers exit after disconnect
  • Kill the user's browser mid-session, verify worker exits within 30s
  • Trigger RDP connection failures (unreachable host), verify no resource
    leak via /proc/[pid]/fd count
  • Stress test: 100 rapid connect/disconnect cycles, verify process count
    returns to baseline
  • Verify normal operation is unaffected (long-running sessions, reconnect)

@necouchman

Copy link
Copy Markdown
Contributor

@escra I believe the issue you're describing here may be a duplicate of GUACAMOLE-2118, and your fix may address that issue. Please check on that issue and see if it matches, and re-tag your issue as GUACAMOLE-2118.

@necouchman

Copy link
Copy Markdown
Contributor

@escra FYI - the original reporter of GUACAMOLE-2118 tried out your patch, and got continuously disconnected from RDP sessions after applying it. I haven't tried it myself or reviewed it in detail, but it seems like it is breaking something else.

@escra escra force-pushed the fix/worker-process-leak branch from c2ace0e to acb1b81 Compare April 16, 2026 14:03
@escra

escra commented Apr 16, 2026

Copy link
Copy Markdown
Author

Thanks for the feedback, @necouchman. We identified the problem: the initial version of the patch was missing a reconnect grace period. When a user disconnects (e.g. browser refresh, tab switch), the worker detected zero connected users and exited immediately, which terminated the remote session before the user could rejoin.

We overlooked this when creating the PR. The updated patch now includes:

  • A 60-second reconnect grace period (GUACD_PROC_RECONNECT_GRACE_MS) after the last user disconnects, allowing time for browser refreshes or tab re-opens to rejoin the session.
  • A 120-second absolute safety-net timeout (GUACD_PROC_MAX_IDLE_MS) that only triggers after the grace period, catching edge cases where the normal exit path is blocked.
  • The worker only exits immediately when client->state != GUAC_CLIENT_RUNNING (explicit disconnect), not on temporary zero-user conditions.

We also renamed the JIRA issue to GUACAMOLE-2118 as you suggested, since it addresses the same root cause.

We have automated tests covering both the leak scenario and the reconnect grace period. Happy to share details if that helps with review.

@escra escra force-pushed the fix/worker-process-leak branch from acb1b81 to 7ca5fa3 Compare April 16, 2026 19:29
@necouchman

necouchman commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

@escra My testing indicates that this doesn't work at all - right now, after applying this patch to the staging/1.6.1 branch, I'm able to connect about once out of every 20 times to a RDP host - the other times I get various levels of guacd failures during the connection, often with this message:

guacd[327]: WARNING:	Client does not support the "required" instruction. No authentication parameters will be requested.

Nevermind..this is true of my testing of the main staging/1.6.1 branch at the moment, so it isn't your changes causing this...

Comment thread src/guacd/proc.c
Comment on lines +434 to +446
/* Absolute safety net: if we've been idle for too long after
* having a user, force exit regardless of client state. This
* catches cases where the client free handler is blocked
* (e.g. guac_argv_await in FreeRDP authenticate callback). */
if (had_user
&& idle_cycles * GUACD_PROC_IDLE_TIMEOUT_MS
>= GUACD_PROC_MAX_IDLE_MS) {
guacd_log(GUAC_LOG_WARNING,
"Worker exceeded maximum idle time (%ds) after user "
"disconnected. Force exiting.",
GUACD_PROC_MAX_IDLE_MS / 1000);
break;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During testing, I'm seeing active connections always hit this condition, resulting in connections getting terminated after 2 minutes.

Worker processes are never terminated after a client disconnects when
the normal exit path is blocked, leaking one guacd child per affected
connection until the container is restarted.

The worker's main loop blocked indefinitely in recvmsg() waiting for new
user file descriptors on fd_socket and only woke up when a new user
joined, so it could not notice that the client had stopped and no further
users would ever arrive. Replace the blocking receive with a poll() that
wakes on a fixed idle interval (GUACD_PROC_IDLE_TIMEOUT_MS) and decides
whether to keep waiting or exit based on client state and connected user
count. After the last user disconnects a short reconnect grace period
(GUACD_PROC_RECONNECT_GRACE_MS) is honoured so a browser refresh or tab
re-open can rejoin the existing session, and an absolute safety net
(GUACD_PROC_MAX_IDLE_MS) force-exits a worker whose teardown is wedged
(e.g. guac_argv_await stuck in the FreeRDP authenticate callback).

The absolute safety net is gated on connected_users == 0 and idle_cycles
is reset while users are connected. The poll loop only observes new user
FDs on fd_socket; an already-connected single-user session generates no
new FDs, so without this gating a healthy active session accumulated idle
cycles and was force-terminated at the GUACD_PROC_MAX_IDLE_MS mark. The
safety net now measures true post-disconnect idle time and can never fire
on a live session, while still catching the blocked-teardown case.

If pthread_create() for a user thread fails, the connection is closed and
the worker is stopped so a failed thread launch cannot leak the worker.

Verified with a k6 soak (143s) and an interactive session (285s) that an
actively connected session survives well past GUACD_PROC_MAX_IDLE_MS with
no premature disconnect and no 'exceeded maximum idle time' warning, and
with a deterministic reproducer that holds one active single-user session
past the safety-net threshold.
@escra escra force-pushed the fix/worker-process-leak branch from 7ca5fa3 to fc4dea4 Compare June 9, 2026 00:57
@escra

escra commented Jun 9, 2026

Copy link
Copy Markdown
Author

Thanks for relaying the reporter's feedback, and thanks to them for actually testing the patch. The continuous RDP disconnects are a real regression in the original version of this change, and we have tracked down the cause.

The problem was the absolute safety net GUACD_PROC_MAX_IDLE_MS. It counted plain wall-clock idle cycles and was not gated on whether any users were still connected. The worker only polls fd_socket for new user file descriptors, so ongoing session traffic of an already connected single-user session never shows up there. As a result a perfectly healthy, actively connected single-user session kept accumulating idle cycles and got force-terminated after about 120 seconds. That is exactly the "continuously disconnected" behavior the reporter saw.

The fix is small and keeps the original intent of the change:

  1. idle_cycles is reset whenever client->connected_users > 0, so an active session is never counted as idle.
  2. The absolute safety net only fires additionally when client->connected_users == 0.

The leak and reconnect-grace behavior is preserved: the worker still shuts down cleanly once no users are connected, including the blocked-teardown case the PR originally targeted.

We have also dropped the unrelated rdp.c fail:-path cleanup that shipped in the first version of this PR, since it freed a non-initialised GDI context on the pre-connect failure path; the leak fix itself is fully contained in proc.c/proc.h, and a correct cleanup can follow as a separate change.

We have built this and verified it. Our image CI is green (reconnect-grace, bug-reproducer, and watchdog jobs). We also verified it live on our integration environment: an actively connected RDP session stayed up well past 120 seconds (a 143 second soak, plus 285 seconds of interactive use with mouse, keyboard, and clipboard) with no premature disconnect, and the guacd log no longer shows the "Worker exceeded maximum idle time" warning. One apparent counter-example in a browser test turned out, per the logs, to be a Windows single-session takeover (ERRINFO_RPC_INITIATED_DISCONNECT), not a watchdog kill.

We have since re-verified both halves against a real Windows Server 2025 RDP target (not only our integration environment): a single actively connected session held cleanly past the 120 second threshold (180 seconds, no watchdog warning, clean teardown), and a multi-user churn run — concurrent sessions from distinct accounts plus dozens of rapid connect/disconnect cycles — drove the guacd worker count up and then cleanly back to baseline, with every connection terminated and no lingering workers. That last part directly covers the original report behind this change (worker processes not being reaped under many user connections): the leak fix still holds under real multi-user load, while active sessions are no longer force-terminated.

We have regression tests for this case (a k6 soak and an interactive Guacamole WebSocket test, plus a deterministic image test, tests/bugs/test_f157_active_session.sh, that holds one active single-user session past the safety-net threshold) and are happy to share them. If it helps, we can fold the fix into this PR as an update.

Sorry for the effort this took on your side — if anything still looks off, I'll gladly adjust, and this time right away.

@necouchman

Copy link
Copy Markdown
Contributor

Thanks @escra - I've tested, again, and can confirm that the connections stay running, which is good. I'll review, again, and see if we can get some other folks to test it and see if it takes care of the leftover process issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants