Skip to content

feat: Add slow sync strike out logic#2196

Open
sergerad wants to merge 15 commits into
nextfrom
sergerad-dc-slow-sync
Open

feat: Add slow sync strike out logic#2196
sergerad wants to merge 15 commits into
nextfrom
sergerad-dc-slow-sync

Conversation

@sergerad

@sergerad sergerad commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Relates to #2176.

Strikes out sync subscriptions / connections when they take longer than block_time seconds to read events. Relies on capacity of the watch channels which is 32 atm.

Changelog

changelog = "none"
reason    = "Internal change only."

@sergerad sergerad added the no changelog This PR does not require an entry in the `CHANGELOG.md` file label Jun 4, 2026
@sergerad

sergerad commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

@Mirko-von-Leipzig this seems like a sensible way of doing it? Might want to tweak the capacity from 32? To lower

@Mirko-von-Leipzig Mirko-von-Leipzig left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code itself looks good; but I think there is a problem with using the block interval conceptually.

Comment thread crates/store/src/state/subscription.rs Outdated
Comment thread crates/store/src/state/subscription.rs
@sergerad sergerad removed the no changelog This PR does not require an entry in the `CHANGELOG.md` file label Jun 5, 2026

@kkovaacs kkovaacs left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me % the Docker cache ID changes, which we should fix somewhere else?

proof,
proven_chain_tip: tip,
}))
.await

@Mirko-von-Leipzig Mirko-von-Leipzig Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we have a problem 😓 I think having this is good. It applies a maximum bound on handling a single block by the client so the loop doesn't get stuck.

But this also means a client can fall behind more and more so long as they are faster than the timeout, but slower than the block interval.

I also think my previous idea was too complex and had quite a few edge cases. I've got a new proposal that I hope is simpler.

  • We keep the send timeout. This keeps us from locking up the loop.
  • Then every N blocks that the chain tip advances:
// The current client gap
let current_gap = tip - next;

if current_gap > previous_gap {
    bad_windows += 1;
} else {
    bad_windows = bad_windows.saturating_sub(1);
}

// Give some grace when near the chain tip.
previous_gap = current_gap.min(10);

if bad_windows > 3 {
    // drop client;
}

Values are placeholders for some better constants of course.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this also means a client can fall behind more and more so long as they are faster than the timeout, but slower than the block interval.

I don't think this is a problem or at least not one we are solving by disconnecting the client. And I think this logic introduces a problem - we disconnect clients that could have otherwise bursted and caught up to the tip without the need to reconnect.

Disconnecting on the timeout makes sense on the basis that such clients are probably dead / dying and so the connection shouldn't exist anymore. I don't think the connection should cease to exist when clients experience temporary slowdown / lag.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the connection should cease to exist when clients experience temporary slowdown / lag.

Agreed; but I also don't want clients that will never catch up to the chain tip. And as it stands a single timeout allows for that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behavior of the client if they are disconnected? Basically, let's say we have a client that is slow and will never catch up. We give it some grace period, but let's say it is falling further and further behind, and we drop the connection. Would the client just try to connect again at that point? Or would we put this client on some kind of list of "bad" clients so that if it tries to connect again, we won't accept the connection in the first place?

@Mirko-von-Leipzig Mirko-von-Leipzig Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tries to reconnect yeah.

Currently our only restriction here afaik is that we limit the total number of subscriptions available. So this at least gives opportunity for another client to take its place.

Better would be some sort of timeout e.g. client can't resubscribe for N hours or minutes. We could even include that information in the error and we can log it on the client.

Regardless though we need something like this PR first.

My main concern is that clients that aren't near the tip (and therefore load data from disk) are expensive. So if they're never going to be near the tip then I don't want them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Agreed that we should probably split this up into two parts:

  1. Detect and disconnect clients that are falling behind. I think the logic here could be something like you described here - i.e., if the client is far away from the tip, and the gap is not closing for some time, drop the client.
  2. Put such a client on a list that would prevent immediate re-connection. Otherwise, the effect of the first step is pretty limited.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added logic similar to the snippet above in this thread. Main difference is that instead of bad window count, we use the size (in blocks) of the gap to decide whether to disconnect or not. My thinking was that a series of small gaps (single block) could cause a client to disconnect which would be undesirable.

@sergerad sergerad added the no changelog This PR does not require an entry in the `CHANGELOG.md` file label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no changelog This PR does not require an entry in the `CHANGELOG.md` file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants