Design proposal: Persistent Replica Runtime#69
Conversation
| ### Messaging | ||
|
|
||
| How do deltas get communicated to dependencies? especially if the in-memory | ||
| actors can crash and be unreliable for message delivery. | ||
|
|
||
| The deltas get written to the persistent store and a message is sent to the | ||
| other threads about the write so that the delta can be read back from the store | ||
| and propagated to dependencies. |
There was a problem hiding this comment.
After delta-write, we can just push the entire delta to the dependencies as part of the hint message. The delta has provancne data so each dependency will know if it has missed a delta from that and check. This saves us a little bit of latency.
|
|
||
| Most importantly, the matrix-protection-suite was designed to be used entirely | ||
| within one JS process. Using workers was never a consideration. This holds the | ||
| Draupnir project back from being able to use more centralized service | ||
| architectures. Which would be required to provide a popular service. | ||
|
|
There was a problem hiding this comment.
We should probably think a little bit more about this. Imagine a ban being added to CME and what that would look like for d4all. ROom state -> policy room -> and then an explosion of individual policy list revision issuers that need updating. Does it make sense for these to have individual replica actors just for the partition data? Or can we do something smarter to manage the workload?
There was a problem hiding this comment.
Hm this is actually quite terrible. to think about, it's mostly all the same computation, the same delta through the stack until the protection layer. but idk maybe it's ok
There was a problem hiding this comment.
I think ti's fine, we just probably don't want one replica actor per partition, just a minimum of one replica actor per replica. And they can just do all the partitions themselves. And if there is a high number of partitions we can just add more actors to deal with the workload.
| ### Garbage collection | ||
|
|
||
| Replica dependents will formally be tracked. Transitive dependencies will also | ||
| need to be considered for rollbacks to be successful. replicas will only be able | ||
| to discard an old revision when all transitive dependents have successfully | ||
| migrated to a superseding revision. |
There was a problem hiding this comment.
We need to make it clear that "revisions" are really materialized views of deltas. And that garbage collection involves a compaction of deltas into a revision and changing which revision the delta stream is now based on.
| @@ -0,0 +1,258 @@ | |||
| # Persistent Replica Runtime | |||
There was a problem hiding this comment.
Starting to think this is a pre-requisite to participation metric. Since it makes writing that much, much more egonomic, especially within the context of draupnir4all.
There was a problem hiding this comment.
The architecture is also superior to just about anything anyone else is using for services on matrix to the best of our knowledge. Using this for bridges would be insane.
| ### Scheduling | ||
|
|
||
| The scheduler is responsible for starting replica actors and migrating them when | ||
| CPU becomes exhausted in worker threads. The scheduler gives each actor a | ||
| controller that signals interrupts. The store is expected to also receive the | ||
| controller so that it can preempt on I/O. | ||
|
|
||
| The scheduler will have to be communicated `os.loadavg()` from each worker | ||
| thread and use this to trigger actor migration. Realistically we can probably | ||
| only move actors from high utilisation to low utilisation and have some strategy | ||
| that tries to identify and isolate the cpu intensive actors for reporting. |
There was a problem hiding this comment.
If we're moving to having the actors work per replica rather than replica partition. Then it probably makes more sense that when the system comes under excessive load, we bias towards latency and start batching reducer operations so that the IO from switching partitions eases.
However, because there are critical operations, this needs more thought. Particularly for e.g. spam joins. We can respond to that with application specific protections (join wave short circuit) or just by prioritising resource allocation somehow to bias things like member ban sync and the policy flow above all else.
There was a problem hiding this comment.
Idk if we really need to do this, it's kind of not our fault that matrix is stupid here. And we should instead have application specific protections that just neuter all new users as was planned with draupnir already and policy server.
There was a problem hiding this comment.
We still need to decide and formalise the way we do prioritise operations and also handle load though.
There was a problem hiding this comment.
We are partially guarded in that technically revision reducers can be run lazily.. but it's not clear how we define the circumstances that force them to run.... for policy application, membership changes and room activity are possible triggers for sure. So maybe optimisation includes figuring out the triggers for reducer evaluation so that everything can actually be lazy by derfault.
There was a problem hiding this comment.
This would work because it's not actually urgent to remove anyone from a room until they attempt activity in that room. I'm not sure how it'd apply to other things though... but if we can make defining triggers part of the replica description then it'll be ok.
There was a problem hiding this comment.
Nah, we can tell the difference by watching if the backlog size is going DOWN or it is stable or going up.
No we can't, because the window we get from sync to matrix is probably going to be constant
There was a problem hiding this comment.
Ok, so PolicyRoomReplica is very obviously giong to have a high fan out in draupnir4all to each PolicyListReplica. Which is going to cause a nightmare. I think .
The thinking is that we can make evaluation of reducers for PolicyListReplica partitions lazy if we can make a principle for draupnir4all that ALL protection activity is bounded by protected room activity. That's to say there are no protections that require policies to eagerly evaluate in the absence of room activity.
There was a problem hiding this comment.
ServerBanSync being reactive rather than proactive due to lazy evaluation will be ok with policy servers. Since we can direct policy check into room activity even before the event gets checked.
There was a problem hiding this comment.
Maybe we can make this more natural to think about if we can encode "when" evaluation should take place as part of the replica description/partition data. And to be honest, this needs to be defined on dependants rather than on the replica itself since otherwise it's anti modular. In this sense, replicas then don't run at all unless they have a downstream consumer that needs them now. I think this relationship is seperate to data flow.
There was a problem hiding this comment.
Scheduling evaluation is always provided by non deterministic sources and as far as i can tell this flow is actually backwards. But then i'm not sure how the scheduling works for "control" replicas that act on non determinstic sources get scheduled and trigger evaluation idk how this works we need to figure it out
| ### Garbage collection | ||
|
|
||
| Replica dependents will formally be tracked. Transitive dependencies will also | ||
| need to be considered for rollbacks to be successful. replicas will only be able | ||
| to discard an old revision when all transitive dependents have successfully | ||
| migrated to a superseding revision. |
There was a problem hiding this comment.
Hm, we need to also think about how some replicas interact with this. Like some protections and capabilities may no longer be relevant to keep around? Especially protections that have been instantiated with the intent to observe their effects and nothing more... it needs some thought.
| ### Revision reducer | ||
|
|
||
| A revision reducer is stateless code that takes a revision and an external delta | ||
| or data input and produces a new delta specific to the revision. |
There was a problem hiding this comment.
An issue i have discovered while working on intent projections for protections in draupnir is that this split step of delta + superseding revision is limiting and really is an implementation detail that makes some revisions easier to derive. Specifically, if i want to go from PolicyRuleChange[] to an intent delta such as { denied: StringServerName, recalled: StringServerName }, then the same transformation on the projection state would need to be applied just to calculate the delta. And then again to apply the delta in order to produce a revision. The API Should really unify the steps.
There was a problem hiding this comment.
I think the way that we handle this is that we just bury that it exists and can be used in any weird situations you find, specifically because not splitting the steps means that you now have a weird code path to apply a delta when you want to rebuild a revision. A code path that is only run when doing restore as opposed to it just being part of normal operation, as it is currently with the spilt model
| Revisions are produced by replica actors, and both are described by a replica | ||
| description. | ||
|
|
||
| ### Delta |
There was a problem hiding this comment.
how do we make sure that reducers do not end up in a situation where progress deltas are required from dependencies or transient dependencies because of data flow fork convergence. I'm pretty sure there's a simple rule to follow that means keeping progress ie unnecessary and I think we already follow it by only allowing input reducers to work on deltas and not reveisiins but it needs double checking and also checking for the rebuild reducer since that one does count
There was a problem hiding this comment.
So the problem occurs when dataflow forks and later reconverges in another replica. The issue is that handling a delta just from one input replica implies applying it to out of date state on the other side of the fork. An example of this is if you tried to use the PoliciesMatchingMembersRevision from Draupnir and combined it with a RoomMembershipRevision.
There was a problem hiding this comment.
So this already does happen with the MembershipPolicyRevision. But deltas are processed with the snapshot for the revision on the other branch. The order that you apply the deltas should not matter, what probably matters is that the deltas produced by each reducer are merged before continuing. Again, this only matters for upstream deltas that share the same source delta.
There was a problem hiding this comment.
So the scheduler needs to manage this by keeping track of source deltas as the progress marker and merging deltas that share identical source deltas where data flow converges
There was a problem hiding this comment.
This also means that the current snapshot of the input on either side can be used even when neither delta has been reduced yet which simplifies things. And deltas can already be safely squashed because of the properties of the system already accounts for it.
There was a problem hiding this comment.
Nah deltas cannot be safely squashed when they are derived from different input deltas, even from the same source. They have to be from identical input deltas. The input sources themselves can ofc squash deltas before publishing them.
So to summarise:
- Source revision deltas are special
- Each delta is linked to the source revision delta only as dependency information.
- When converging data flow from the same source, the scheduler has to wait for both branches to get up to date and merge the deltas derived from the same source revision deltas before publishing.
- Deltas can never be directly derived from multiple sources. Only transiently through the snapshot.
| ### Partition | ||
|
|
||
| replicas are partitioned by a key. This key is usually something like a Matrix | ||
| room ID, or the user identifier of a Draupnir instance. |
There was a problem hiding this comment.
"Replicas are partitioned by a key" what this means is there are in effect multiple instances of the same replica that are logically partitioned because e.g. they concern different data like different draupnir or different policy rooms. But the projection nodes within that partition can be partitioned further to keep reducers running and scaling efficiently. And this should really be something that we can optimize automatically under the rug. So we could automatically split projections into shards based on e.g. ordering the event id of input policies.
Introduce projection intent projections for member ban and server ban protections These projections are a generalisation of revision issuers used in an ad-hoc way throughout mps, see the-draupnir-project/planning#69 for related design work. This means we can always report on what a protection intends, rather than deriving it from what the protection has done. Which is what the current model does with capability renderers. So not only does this allow previews to be generated, but it also allows for better reporting on what protections are doing in UI. This PR will need follow up to tidy up protection logic and capabilities to directly depend on the projection. This PR will probably want follow up in MPS and Draupnir to create renderers for intent projection delta. This will need MPS support specifically to manage subscribing to the intent deltas and forwarding them to renderers, in the same way capability renderers work. the-draupnir-project/planning#2
| - The revision ID's of any revisions that are transitive dependencies, along | ||
| with the identities of the replicas that issued them. | ||
|
|
||
| ### Revision reducer |
There was a problem hiding this comment.
projection inputs need to declare whether they actually need their input projection nodes or just the delta. This will help the scheduler figure out if it needs to keep two projections allocated on the same shard
Rendered
the-draupnir-project/Draupnir#979. #68.