Skip to content

[CELEBORN-2016] Fix the worker decommission and graceful shutdown condition#3727

Open
s0nskar wants to merge 1 commit into
apache:mainfrom
s0nskar:shutdown_fix
Open

[CELEBORN-2016] Fix the worker decommission and graceful shutdown condition#3727
s0nskar wants to merge 1 commit into
apache:mainfrom
s0nskar:shutdown_fix

Conversation

@s0nskar

@s0nskar s0nskar commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

In current code, we set the shutdown hook for timeout time and condition tries to check timeSpent < timeout when this condition is will become true, shutdown hook timer is already passed and VM will exit without executing the code below this point.

New condition will be timeSpent + interval < timeout, so we will get (0, interval] time to execute the below code.

Why are the changes needed?

We current shutdown logic we have seen worker getting shutdown abruptly with timeout exception without completely executing the shutdown hook because of which Celeborn is unable to print unreleased partition location and unreleased shuffle on decommission and graceful shutdown.

Does this PR resolve a correctness bug?

  • Yes

Does this PR introduce any user-facing change?

  • Yes

How was this patch tested?

NA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant