Skip to content

Design proposal: Persistent Replica Runtime#69

Open
Gnuxie wants to merge 7 commits into
mainfrom
gnuxie/persistent-replica-runtime
Open

Design proposal: Persistent Replica Runtime#69
Gnuxie wants to merge 7 commits into
mainfrom
gnuxie/persistent-replica-runtime

Conversation

@Gnuxie

@Gnuxie Gnuxie commented Oct 15, 2025

Copy link
Copy Markdown
Member

Comment on lines +225 to +232
### Messaging

How do deltas get communicated to dependencies? especially if the in-memory
actors can crash and be unreliable for message delivery.

The deltas get written to the persistent store and a message is sent to the
other threads about the write so that the delta can be read back from the store
and propagated to dependencies.

@Gnuxie Gnuxie Oct 16, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After delta-write, we can just push the entire delta to the dependencies as part of the hint message. The delta has provancne data so each dependency will know if it has missed a delta from that and check. This saves us a little bit of latency.

Comment on lines +32 to +37

Most importantly, the matrix-protection-suite was designed to be used entirely
within one JS process. Using workers was never a consideration. This holds the
Draupnir project back from being able to use more centralized service
architectures. Which would be required to provide a popular service.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably think a little bit more about this. Imagine a ban being added to CME and what that would look like for d4all. ROom state -> policy room -> and then an explosion of individual policy list revision issuers that need updating. Does it make sense for these to have individual replica actors just for the partition data? Or can we do something smarter to manage the workload?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm this is actually quite terrible. to think about, it's mostly all the same computation, the same delta through the stack until the protection layer. but idk maybe it's ok

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ti's fine, we just probably don't want one replica actor per partition, just a minimum of one replica actor per replica. And they can just do all the partitions themselves. And if there is a high number of partitions we can just add more actors to deal with the workload.

Comment on lines +188 to +193
### Garbage collection

Replica dependents will formally be tracked. Transitive dependencies will also
need to be considered for rollbacks to be successful. replicas will only be able
to discard an old revision when all transitive dependents have successfully
migrated to a superseding revision.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make it clear that "revisions" are really materialized views of deltas. And that garbage collection involves a compaction of deltas into a revision and changing which revision the delta stream is now based on.

@@ -0,0 +1,258 @@
# Persistent Replica Runtime

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to think this is a pre-requisite to participation metric. Since it makes writing that much, much more egonomic, especially within the context of draupnir4all.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The architecture is also superior to just about anything anyone else is using for services on matrix to the best of our knowledge. Using this for bridges would be insane.

Comment on lines +195 to +205
### Scheduling

The scheduler is responsible for starting replica actors and migrating them when
CPU becomes exhausted in worker threads. The scheduler gives each actor a
controller that signals interrupts. The store is expected to also receive the
controller so that it can preempt on I/O.

The scheduler will have to be communicated `os.loadavg()` from each worker
thread and use this to trigger actor migration. Realistically we can probably
only move actors from high utilisation to low utilisation and have some strategy
that tries to identify and isolate the cpu intensive actors for reporting.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're moving to having the actors work per replica rather than replica partition. Then it probably makes more sense that when the system comes under excessive load, we bias towards latency and start batching reducer operations so that the IO from switching partitions eases.

However, because there are critical operations, this needs more thought. Particularly for e.g. spam joins. We can respond to that with application specific protections (join wave short circuit) or just by prioritising resource allocation somehow to bias things like member ban sync and the policy flow above all else.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk if we really need to do this, it's kind of not our fault that matrix is stupid here. And we should instead have application specific protections that just neuter all new users as was planned with draupnir already and policy server.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to decide and formalise the way we do prioritise operations and also handle load though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are partially guarded in that technically revision reducers can be run lazily.. but it's not clear how we define the circumstances that force them to run.... for policy application, membership changes and room activity are possible triggers for sure. So maybe optimisation includes figuring out the triggers for reducer evaluation so that everything can actually be lazy by derfault.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would work because it's not actually urgent to remove anyone from a room until they attempt activity in that room. I'm not sure how it'd apply to other things though... but if we can make defining triggers part of the replica description then it'll be ok.

@Gnuxie Gnuxie Oct 18, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, we can tell the difference by watching if the backlog size is going DOWN or it is stable or going up.

No we can't, because the window we get from sync to matrix is probably going to be constant

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so PolicyRoomReplica is very obviously giong to have a high fan out in draupnir4all to each PolicyListReplica. Which is going to cause a nightmare. I think .

The thinking is that we can make evaluation of reducers for PolicyListReplica partitions lazy if we can make a principle for draupnir4all that ALL protection activity is bounded by protected room activity. That's to say there are no protections that require policies to eagerly evaluate in the absence of room activity.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServerBanSync being reactive rather than proactive due to lazy evaluation will be ok with policy servers. Since we can direct policy check into room activity even before the event gets checked.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make this more natural to think about if we can encode "when" evaluation should take place as part of the replica description/partition data. And to be honest, this needs to be defined on dependants rather than on the replica itself since otherwise it's anti modular. In this sense, replicas then don't run at all unless they have a downstream consumer that needs them now. I think this relationship is seperate to data flow.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduling evaluation is always provided by non deterministic sources and as far as i can tell this flow is actually backwards. But then i'm not sure how the scheduling works for "control" replicas that act on non determinstic sources get scheduled and trigger evaluation idk how this works we need to figure it out

Comment on lines +188 to +193
### Garbage collection

Replica dependents will formally be tracked. Transitive dependencies will also
need to be considered for rollbacks to be successful. replicas will only be able
to discard an old revision when all transitive dependents have successfully
migrated to a superseding revision.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, we need to also think about how some replicas interact with this. Like some protections and capabilities may no longer be relevant to keep around? Especially protections that have been instantiated with the intent to observe their effects and nothing more... it needs some thought.

Comment on lines +103 to +106
### Revision reducer

A revision reducer is stateless code that takes a revision and an external delta
or data input and produces a new delta specific to the revision.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue i have discovered while working on intent projections for protections in draupnir is that this split step of delta + superseding revision is limiting and really is an implementation detail that makes some revisions easier to derive. Specifically, if i want to go from PolicyRuleChange[] to an intent delta such as { denied: StringServerName, recalled: StringServerName }, then the same transformation on the projection state would need to be applied just to calculate the delta. And then again to apply the delta in order to produce a revision. The API Should really unify the steps.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way that we handle this is that we just bury that it exists and can be used in any weird situations you find, specifically because not splitting the steps means that you now have a weird code path to apply a delta when you want to rebuild a revision. A code path that is only run when doing restore as opposed to it just being part of normal operation, as it is currently with the spilt model

Revisions are produced by replica actors, and both are described by a replica
description.

### Delta

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we make sure that reducers do not end up in a situation where progress deltas are required from dependencies or transient dependencies because of data flow fork convergence. I'm pretty sure there's a simple rule to follow that means keeping progress ie unnecessary and I think we already follow it by only allowing input reducers to work on deltas and not reveisiins but it needs double checking and also checking for the rebuild reducer since that one does count

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem occurs when dataflow forks and later reconverges in another replica. The issue is that handling a delta just from one input replica implies applying it to out of date state on the other side of the fork. An example of this is if you tried to use the PoliciesMatchingMembersRevision from Draupnir and combined it with a RoomMembershipRevision.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this already does happen with the MembershipPolicyRevision. But deltas are processed with the snapshot for the revision on the other branch. The order that you apply the deltas should not matter, what probably matters is that the deltas produced by each reducer are merged before continuing. Again, this only matters for upstream deltas that share the same source delta.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the scheduler needs to manage this by keeping track of source deltas as the progress marker and merging deltas that share identical source deltas where data flow converges

@Gnuxie Gnuxie Dec 18, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also means that the current snapshot of the input on either side can be used even when neither delta has been reduced yet which simplifies things. And deltas can already be safely squashed because of the properties of the system already accounts for it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah deltas cannot be safely squashed when they are derived from different input deltas, even from the same source. They have to be from identical input deltas. The input sources themselves can ofc squash deltas before publishing them.

So to summarise:

  • Source revision deltas are special
  • Each delta is linked to the source revision delta only as dependency information.
  • When converging data flow from the same source, the scheduler has to wait for both branches to get up to date and merge the deltas derived from the same source revision deltas before publishing.
  • Deltas can never be directly derived from multiple sources. Only transiently through the snapshot.

Comment on lines +135 to +138
### Partition

replicas are partitioned by a key. This key is usually something like a Matrix
room ID, or the user identifier of a Draupnir instance.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Replicas are partitioned by a key" what this means is there are in effect multiple instances of the same replica that are logically partitioned because e.g. they concern different data like different draupnir or different policy rooms. But the projection nodes within that partition can be partitioned further to keep reducers running and scaling efficiently. And this should really be something that we can optimize automatically under the rug. So we could automatically split projections into shards based on e.g. ordering the event id of input policies.

Gnuxie added a commit to Gnuxie/matrix-protection-suite that referenced this pull request Nov 21, 2025
Introduce projection intent projections for member ban and server ban protections

These projections are a generalisation of revision issuers used in an ad-hoc way throughout mps, see the-draupnir-project/planning#69 for related design work.

This means we can always report on what a protection intends, rather than deriving it from what the protection has done. Which is what the current model does with capability renderers. So not only does this allow previews to be generated, but it also allows for better reporting on what protections are doing in UI.

This PR will need follow up to tidy up protection logic and capabilities to directly depend on the projection.
This PR will probably want follow up in MPS and Draupnir to create renderers for intent projection delta. This will need MPS support specifically to manage subscribing to the intent deltas and forwarding them to renderers, in the same way capability renderers work.    

the-draupnir-project/planning#2
- The revision ID's of any revisions that are transitive dependencies, along
with the identities of the replicas that issued them.

### Revision reducer

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

projection inputs need to declare whether they actually need their input projection nodes or just the delta. This will help the scheduler figure out if it needs to keep two projections allocated on the same shard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant