Skip to content

[CELEBORN-2166] Mark shuffle data lost and fast fail if allocated worker is lost#3726

Open
s0nskar wants to merge 3 commits into
apache:mainfrom
s0nskar:fix_fetch_failures
Open

[CELEBORN-2166] Mark shuffle data lost and fast fail if allocated worker is lost#3726
s0nskar wants to merge 3 commits into
apache:mainfrom
s0nskar:fix_fetch_failures

Conversation

@s0nskar

@s0nskar s0nskar commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Enhancing the logic of #3496

Why are the changes needed?

#3496 only handles the reduce side flow i.e GetReducerFileGroup request will fail if the shuffle data is mark lost.

In this PR, we are making use of the WorkerStatusListener to immediately detect the lost workers, mark the data lost of the stage and immediately issue stage end for that stage. This will also allow write stage to fast-fail during revives and commit request, otherwise write stage will run as usual and then reduce will fail at startup. This will lead to lot of resource and time wastage.

Does this PR resolve a correctness bug?

  • Yes

Does this PR introduce any user-facing change?

  • Yes

How was this patch tested?

  • Added UTs
  • Working on staging testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant