feat: Add slow sync strike out logic#2196
Conversation
|
@Mirko-von-Leipzig this seems like a sensible way of doing it? Might want to tweak the capacity from 32? To lower |
Mirko-von-Leipzig
left a comment
There was a problem hiding this comment.
Code itself looks good; but I think there is a problem with using the block interval conceptually.
kkovaacs
left a comment
There was a problem hiding this comment.
Looks good to me % the Docker cache ID changes, which we should fix somewhere else?
| proof, | ||
| proven_chain_tip: tip, | ||
| })) | ||
| .await |
There was a problem hiding this comment.
I still think we have a problem 😓 I think having this is good. It applies a maximum bound on handling a single block by the client so the loop doesn't get stuck.
But this also means a client can fall behind more and more so long as they are faster than the timeout, but slower than the block interval.
I also think my previous idea was too complex and had quite a few edge cases. I've got a new proposal that I hope is simpler.
- We keep the send timeout. This keeps us from locking up the loop.
- Then every
Nblocks that the chain tip advances:
// The current client gap
let current_gap = tip - next;
if current_gap > previous_gap {
bad_windows += 1;
} else {
bad_windows = bad_windows.saturating_sub(1);
}
// Give some grace when near the chain tip.
previous_gap = current_gap.min(10);
if bad_windows > 3 {
// drop client;
}Values are placeholders for some better constants of course.
There was a problem hiding this comment.
But this also means a client can fall behind more and more so long as they are faster than the timeout, but slower than the block interval.
I don't think this is a problem or at least not one we are solving by disconnecting the client. And I think this logic introduces a problem - we disconnect clients that could have otherwise bursted and caught up to the tip without the need to reconnect.
Disconnecting on the timeout makes sense on the basis that such clients are probably dead / dying and so the connection shouldn't exist anymore. I don't think the connection should cease to exist when clients experience temporary slowdown / lag.
There was a problem hiding this comment.
I don't think the connection should cease to exist when clients experience temporary slowdown / lag.
Agreed; but I also don't want clients that will never catch up to the chain tip. And as it stands a single timeout allows for that.
There was a problem hiding this comment.
What is the behavior of the client if they are disconnected? Basically, let's say we have a client that is slow and will never catch up. We give it some grace period, but let's say it is falling further and further behind, and we drop the connection. Would the client just try to connect again at that point? Or would we put this client on some kind of list of "bad" clients so that if it tries to connect again, we won't accept the connection in the first place?
There was a problem hiding this comment.
It tries to reconnect yeah.
Currently our only restriction here afaik is that we limit the total number of subscriptions available. So this at least gives opportunity for another client to take its place.
Better would be some sort of timeout e.g. client can't resubscribe for N hours or minutes. We could even include that information in the error and we can log it on the client.
Regardless though we need something like this PR first.
My main concern is that clients that aren't near the tip (and therefore load data from disk) are expensive. So if they're never going to be near the tip then I don't want them.
There was a problem hiding this comment.
Makes sense. Agreed that we should probably split this up into two parts:
- Detect and disconnect clients that are falling behind. I think the logic here could be something like you described here - i.e., if the client is far away from the tip, and the gap is not closing for some time, drop the client.
- Put such a client on a list that would prevent immediate re-connection. Otherwise, the effect of the first step is pretty limited.
There was a problem hiding this comment.
I have added logic similar to the snippet above in this thread. Main difference is that instead of bad window count, we use the size (in blocks) of the gap to decide whether to disconnect or not. My thinking was that a series of small gaps (single block) could cause a client to disconnect which would be undesirable.
Relates to #2176.
Strikes out sync subscriptions / connections when they take longer than block_time seconds to read events. Relies on capacity of the watch channels which is 32 atm.
Changelog