Skip to content

Fix race conditions in RunningEndpointInstance.Stop and BaseEndpointLifecycle where concurrent or early-cancelled Stop calls leave the endpoint in a broken state#7750

Merged
danielmarbach merged 8 commits into
masterfrom
stopping-race
May 13, 2026
Merged

Fix race conditions in RunningEndpointInstance.Stop and BaseEndpointLifecycle where concurrent or early-cancelled Stop calls leave the endpoint in a broken state#7750
danielmarbach merged 8 commits into
masterfrom
stopping-race

Conversation

@danielmarbach
Copy link
Copy Markdown
Contributor

@danielmarbach danielmarbach commented May 12, 2026

Problem

When Stop is called with an already-cancelled token (e.g., during host shutdown when the cancellation token has already triggered), stopSemaphore.WaitAsync(cancelledToken) throws OperationCanceledException before entering the critical section. This leaves status as Running and endpointInstance non-null. The framework then calls DisposeAsync, which re-enters Stop(CancellationToken.None) and attempts full shutdown against a DI container that is already being torn down by the host, causing ObjectDisposedException on stoppingTokenSource or accessing a disposed ILoggerFactory.

The same race exists in BaseEndpointLifecycle.Stop: lifeCycleSemaphore.WaitAsync(cancelledToken) can fail, leaving endpointInstance non-null, and subsequent DisposeAsync re-enters stop against an already-disposed container.

Solution

RunningEndpointInstance is refactored into three methods:

  • StopCore(CancellationToken) performs shutdown only (cancel stopping token, stop components/transport). It uses CancellationToken.None on the semaphore wait because the semaphore is an internal serialization mechanism, not a cancellation point. The first caller to enter StopCore owns shutdown; later callers that observe Stopping or Stopped return immediately without waiting.
  • Stop(CancellationToken) is the legacy IEndpointInstance API. It calls StopCore then DisposeAsync in a try/finally, so the public contract still covers full shutdown and cleanup.
  • DisposeAsync() handles cleanup only (unregister log slot, clear settings, dispose CTS/service provider lease). It uses Interlocked.Exchange for idempotency and calls StopCore as a safety net in case Stop was never called.

BaseEndpointLifecycle changes:

  • Stop uses CancellationToken.None on lifeCycleSemaphore.WaitAsync and calls endpointInstance.StopCore(cancellationToken) (not Stop), so cleanup is left to the separate DisposeAsync call.
  • DisposeAsync uses Interlocked.Exchange on isDisposed for idempotency, reads/nulls endpointInstance (safe as a reference-type atomic read), then calls instance.DisposeAsync() and providerLease.DisposeAsync(). No semaphore is needed because Interlocked.Exchange already prevents concurrent entry.

EndpointHostedService is unchanged. The .NET Generic Host calls StopAsync(token) then DisposeAsync(), which maps to lifecycle.Stop(token) then lifecycle.DisposeAsync(). This is the correct two-step flow.

Internally managed mode (legacy IEndpointInstance.Stop()): calls StopCore then DisposeAsync, maintaining the original "stop and clean up everything" contract. Double-dispose of the service provider is idempotent.

Key decisions

  • CancellationToken.None on semaphore waits: the semaphore is an internal serialization mechanism; the caller token must not abort the wait because a failed wait leaves state as Running, allowing a subsequent DisposeAsync re-entry to attempt full shutdown against an already-torn-down DI container.
  • StopCore early return: later callers that observe Stopping or Stopped return immediately. Only the first caller owns shutdown. This is intentional; waiting would add no value since the outcome is predetermined.
  • Separate Stop and DisposeAsync in BaseEndpointLifecycle: allows the hosted service to control the lifecycle with a clean two-step flow (stop, then dispose).
  • "Stopper" keyed singleton is intentionally a hidden backdoor for acceptance testing, not exposed in Core.

Acceptance test

When_stop_is_called_with_cancelled_token cancels the scenario token after the endpoint starts, causing StopEndpoints to pass an already-cancelled token to BaseEndpointLifecycle.Stop. Without the fix, this exposes ObjectDisposedException during disposal.

… callers return immediately; note BaseEndpointLifecycle.Stop calls StopCore not Stop
@danielmarbach danielmarbach changed the title Refactor and improve shutdown handling with cancelled tokens Fix race conditions in RunningEndpointInstance.Stop and BaseEndpointLifecycle where concurrent or early-cancelled Stop calls leave the endpoint in a broken state May 12, 2026
@danielmarbach danielmarbach merged commit 41e65c0 into master May 13, 2026
4 checks passed
@danielmarbach danielmarbach deleted the stopping-race branch May 13, 2026 14:00
@danielmarbach danielmarbach mentioned this pull request May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants