fix(EN-1242): lazy gen0 tombstone + coverage-gate Delete#1455
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. 🗂️ Base branches to auto review (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🛑 Changes requested — automated reviewThis PR introduces a lazy gen0 tombstone mechanism, a MirrorTouch-based Delete primitive, and a coverage gate for Delete operations. Five confirmed issues — four blockers and one major — prevent safe merging. The most critical issue is protobuf field number reuse in AttributePlan: previously used field numbers (3, 4, 5) are reassigned to new AttributeCoverage fields, which will corrupt wire encoding during rolling upgrades and cause incorrect Raft log replay for any stored entries from the previous schema. On the revert path, two related failure modes allow zero-posting reversals to be committed: when FindTransactionCreationLog returns ErrNotFound for a purged log the code proceeds with nil OriginalPostings, and when reverting a pre-upgrade transaction whose TransactionState predates the new postings field the same nil path is followed. Both cases can corrupt transaction state or permanently break reverts for all pre-upgrade transactions. The new TransactionState.postings field is introduced without a schema migration. Existing Pebble rows from pre-upgrade deployments will lack this field, making those transactions permanently unrevertable after an in-place upgrade. The replay checker will also produce systematic false-positive mismatches on any upgraded store: live TransactionState rows written before this change have empty Postings, while the replayer now reconstructs non-empty postings from the audit log, causing proto.Equal comparisons to fail for every historical transaction. Finally, in key_store.go Merge, a |
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (75.84%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## release/v3.0 #1455 +/- ##
================================================
+ Coverage 73.60% 74.02% +0.41%
================================================
Files 395 395
Lines 39917 39833 -84
================================================
+ Hits 29382 29486 +104
- Misses 7686 7771 +85
+ Partials 2849 2576 -273
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
f8a103a to
0a0f111
Compare
0a0f111 to
666fe9a
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 1 new inline finding.
Summary: #1455 (comment)
0b6abb0 to
ba599e6
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 1 new inline finding.
Summary: #1455 (comment)
a122827 to
7226eb9
Compare
93d2d5e to
ece09cd
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 1 new inline finding.
Summary: #1455 (comment)
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 1 new inline finding.
Summary: #1455 (comment)
ece09cd to
f2fe7b2
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 1 new inline finding.
Summary: #1455 (comment)
f2fe7b2 to
bee3ed7
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 2 new inline findings.
Summary: #1455 (comment)
bee3ed7 to
08c2534
Compare
dc069d0 to
7eaa558
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 6 new inline findings.
Summary: #1455 (comment)
7eaa558 to
566f7e7
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 3 new inline findings.
Summary: #1455 (comment)
566f7e7 to
3c1e89c
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 2 new inline findings.
Summary: #1455 (comment)
3c1e89c to
2d8527b
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 2 new inline findings.
Summary: #1455 (comment)
2d8527b to
c9ae0c9
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 3 new inline findings.
Summary: #1455 (comment)
c9ae0c9 to
0dcfb10
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 4 new inline findings.
Summary: #1455 (comment)
7226eb9 to
5bb9f23
Compare
0dcfb10 to
6098058
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 4 new inline findings.
Summary: #1455 (comment)
6098058 to
50bf6b8
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 4 new inline findings.
Summary: #1455 (comment)
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 4 new inline findings.
Summary: #1455 (comment)
paul-nicolas
left a comment
There was a problem hiding this comment.
Multi-model review — PR #1455 (Claude + Codex)
Two reviewers (Claude, split across 4 subsystem passes, and Codex/gpt-5.4) reviewed the diff independently; findings were reconciled and each was adversarially verified against the code pinned at the PR head SHA before posting.
Verdict on the core mechanic: the lazy Gen0-tombstone + gen0-byte-only writeCacheTombstone + gen0→gen1 read fallback is sound. Mem/disk parity was traced through all three Del branches (Gen0-hit, Gen1-only fabrication, absent-in-both), tag-borrowing was proven correct (borrowed gen1Entry.Tag == deletion.Tag via KeyStore.Delete's pre-check), and the "at most one rotation" bound + coverage-gate horizon hold. Invariants #1/#2/#3/#5/#9 are respected on the cache path. The concern cluster is entirely around the TransactionState.Postings projection / upgrade compatibility and the collapsed AttributeCoverage proto, plus doc/test drift.
Counts: 2 Major, 6 Medium posted inline; 8 Minor/Nit listed below (not posted). 1 finding rejected in verification.
Notable reviewer conflicts, resolved:
- Proto wire-compat — Codex rated critical (old
AttributePlanentries in the Raft/WAL misparse across an upgrade); Claude rated medium (missingreservedis a version-independent hazard, tempered by pre-GA alpha). Reconciled to Major: field 3 is genuinely repurposed (Declare→AttributeValue) and fields 4/5 dropped withoutreserved, against this file's own universal convention — a real, actionable hazard, but cross-version Raft replay wasn't proven in practice, so short of critical. - Revert trusts
TransactionState.Postings(tampering) — Codex rated major. Rejected in verification:checker.compareTransactionscompares the full state (incl.Postings) viaproto.Equal, so the projection IS checker-verified (invariant #8 satisfied), and the pre-PR code also read a projection (the log). No new weakness introduced.
Root-cause note: findings [Major] #2, [Medium] #3 and [Medium] legacy-revert all share one fix — a CurrentStorageSchemaVersion bump (internal/bootstrap/config_validation.go, currently 1, untouched by this PR). That converts a silent in-place-upgrade data break into a non-bypassable, enforced clean-cluster requirement.
Minor / Nit (not posted inline)
internal/infra/attributes/key_store.go:298(minor) — silentcontinueonErrNotFoundskips the durable 0xF1 delete; relies entirely on invariant #6 with no loud signal. Consider a debug-mode sentinel/WARN so a future preload regression surfaces loudly instead of as latent store drift.internal/infra/plan/coverage.go:101(minor) —Coverage.Mergewrites to a possibly-nilIdempotencyKeysmap on a zero-value receiver (no live panic today; all callers useNewCoverage()). Lazily init the map likeset()does.internal/infra/cache/cache_test.go/cache_snapshotter_test.go(minor) —TestCacheSnapshotter_EN1242_DeleteAfterRotationCrashRestarthand-assembleswriteCacheRaw/Rotation/Tombstoneinstead of drivingmachine.Preload+WriteSet.Merge; a future Merge→writeCacheTombstone rewiring wouldn't be caught.internal/application/admission/admission_test.go(minor) — the compat break is only tested for the postings-present shape; no case exercises an existing TxState with emptyPostings→ emptyOriginalPostings→ErrTransactionNotFound.internal/pkg/kv/kv.go:13(nit) —KV.Deldoc says "removes K" but the production impl tombstones andGetstill surfaces the key; clarify the interface contract.internal/infra/cache/cache.go:133(nit) —AttributeCache.Del's ownErrNotFoundreturn is unreachable via its only production caller (KeyStore.Deletereturns first); the doc comment attributes the soft-no-op imprecisely.internal/infra/plan/builder_test.go:309(nit) — assertion messages still say "must emit Declare" afterDeclarewas removed; reword to "coverage-only entry (value nil)".internal/infra/state/scope.go:458(nit) —lsbIndexhand-rolls an 8-iteration loop;math/bits.TrailingZeros8(b)is clearer and branch-free.
| AttributeValue value = 5; | ||
| } | ||
| // Optional seed: nil = coverage-only; non-nil = seed into the FSM cache. | ||
| AttributeValue value = 3; |
There was a problem hiding this comment.
[Major] AttributeCoverage repurposes field number 3 and drops fields 4/5 without reserved
The old AttributePlan used a oneof at field 3 (Declare), field 4 (Touch), field 5 (AttributeValue value). The new AttributeCoverage puts AttributeValue value at field 3 and removes 4/5 with no reserved clause. Two problems:
- Field-3 semantic reuse. On the wire, old
Declare{}(empty length-delimited message at field 3) decodes cleanly as a newAttributeValue{}with emptyraw_value→ a non-nilvalue(a seed) where the proposer meant coverage-only. OldTouch(field 4) and old field-5valueare silently dropped as unknown fields. If any old-formatExecutionPlan.attributesentry is ever replayed from a Raft log / WAL by an upgraded binary, it misapplies (coverage-only read at apply → cache miss → node divergence). - Missing
reserved. Every other message in this file reserves removed field numbers/names (seeExecutionPlan reserved 2,3,5,8,TechnicalUpdate reserved 5,6,8..12, etc.). Omitting it here is a future field-reuse footgun.
Fix: give value a fresh, never-used field number instead of reusing 3, and add reserved 3, 4, 5; + reserved "declare", "touch"; inside AttributeCoverage. If old-format entries can survive an upgrade, gate activation behind a storage-schema-version bump that blocks mixed-version replay. (Reviewers split critical/medium on the replay risk; the missing reserved + field reuse is uncontroversial and should be fixed before merge regardless.)
| txCanonical := domain.TransactionKey{LedgerName: ledger, ID: tx.GetId()}.Bytes() | ||
|
|
||
| if err := w.CreateTransaction(txCanonical, seq, tx.GetTimestamp(), tx.GetMetadata()); err != nil { | ||
| if err := w.CreateTransaction(txCanonical, seq, tx.GetTimestamp(), tx.GetMetadata(), tx.GetPostings()); err != nil { |
There was a problem hiding this comment.
[Major] No storage-schema-version bump: an in-place upgrade with pre-EN-1242 data silently breaks the checker and can diverge mixed-version nodes
This line now feeds tx.GetPostings() into the checker's replay-derived expected TransactionState. Combined with processCreateTransaction now writing Postings into the live TxState (internal/domain/processing/processor_transaction.go), two upgrade hazards appear, and CurrentStorageSchemaVersion (internal/bootstrap/config_validation.go, = 1) is not bumped by this PR, so an in-place upgrade over existing data is silently permitted:
- Checker mass false-positive. A transaction created by a pre-EN-1242 binary persists a TxState with empty
Postings. OnCheck(),compareTransactions(checker.go:1435) replays the audit'sCreatedTransactionlog — which always carried postings — soexpected.Postingsis non-empty while the legacy live state's is empty, andproto.Equalfails, emittingCHECK_STORE_ERROR_TYPE_TRANSACTION_UPDATE_MISMATCHfor every historical transaction — a healthy cluster self-reporting corruption. - Rolling-upgrade cache divergence (inv Fix typo #1/update spec #2). During a mixed-version rollout, a new node writes
Postingsfor a command at applied index N and an old node does not, so the in-memory cache and the deterministic snapshot marshal diverge between nodes at the same index.
Fix: bump CurrentStorageSchemaVersion so an in-place upgrade with old data is hard-blocked (schema mismatches are non-bypassable per CLAUDE.md), converting a silent break into an enforced clean-cluster/coordinated upgrade. Add a checker test feeding a legacy (empty-Postings) live state against a replay-derived state to lock the intended behavior.
| // tx was not visible to admission at propose time — a business | ||
| // rejection that must appear in the audit chain (invariant #8). | ||
| originalPostings := order.GetOriginalPostings() | ||
| if len(originalPostings) == 0 { |
There was a problem hiding this comment.
[Medium] Legacy transaction is unrevertable and misreported as ErrTransactionNotFound
For a transaction created pre-EN-1242 (exists, txID < NextTransactionId, not reverted) whose persisted TransactionState.Postings is empty, admission's getTransactionPostings point-read (admission.go:1897) returns empty postings, so this guard returns ErrTransactionNotFound for a transaction that demonstrably exists. This is the intentional compat break, but the error is semantically wrong (the tx exists; only its postings projection is absent) and misleads clients/operators.
Empty postings can only occur for legacy data (create rejects ErrEmptyTransaction), so the storage-schema-version bump recommended on the checker finding removes this path entirely by blocking the in-place upgrade. Short of that, preserve a log-scan fallback when OriginalPostings is empty, or return a dedicated upgrade/incompatibility error so "legacy-empty" is distinguishable from "absent".
|
|
||
| origStateReader, err := s.TransactionStates().Get(txKey) | ||
| if errors.Is(err, domain.ErrNotFound) { | ||
| return nil, &domain.ErrTransactionNotFound{TransactionID: order.GetTransactionId()} |
There was a problem hiding this comment.
[Medium] Invariant #7: a should-not-happen desync was downgraded from ErrTransactionStateInconsistent to a soft ErrTransactionNotFound
Reaching this branch means the len(originalPostings) == 0 guard above already passed, i.e. admission read a present TxState.Postings (or a signed receipt) and the tx-exists / not-reverted checks passed. A TransactionStates().Get miss here therefore implies a real cache/Pebble desync — "impossible by design", not a routine not-found. The pre-PR code correctly returned &domain.ErrTransactionStateInconsistent{TransactionID: ..., Operation: "revert"}; this PR replaced it with ErrTransactionNotFound, which mislabels a desync as a business rejection and hides it (invariant #7: impossible-by-design branches must fail loudly).
Fix: restore ErrTransactionStateInconsistent (or an invariant:/assert.Unreachable hard-fail) for the errors.Is(err, domain.ErrNotFound) branch here, keeping the soft not-found only for the empty-OriginalPostings guard above.
| // → strict AttributeCache.Del. Preload the boundary attribute | ||
| // so Gen0 holds the entry at apply time (invariant #6). | ||
| // → AttributeCache.Del. Preload the boundary attribute so the | ||
| // coverage gate admits the delete; AttributeCache.Del itself |
There was a problem hiding this comment.
[Medium] DeleteLedger / RemoveEventsSink deletes bypass the coverage gate — the comment here overstates the guarantee
This comment states the boundary preload is declared "so the coverage gate admits the delete". But DeleteLedger (Boundaries) and RemoveEventsSink (SinkConfigs, ~L1072) cascade through WriteSet.Absorb → b.Derived.Boundaries.Delete(...) / SinkConfigs.Delete(...) → WriteSet.Merge → KeyStore.Delete → AttributeCache.Del. That path calls the concrete *DerivedKeyStore.Delete directly and never invokes CheckCoverage (unlike the account-metadata / ledger-metadata / prepared-query deletes, which do go through gatedAccessor.Delete). So the invariant-#9 Delete guarantee this PR advertises is not actually enforced for these two attribute kinds — an admission bug omitting the declaration would silently delete rather than surface *ErrCoverageMiss.
Under the new lazy Del, the declaration is no longer needed for correctness anyway. Fix: either route the Absorb-path deletes through the gated scope so the guarantee is uniform, or correct these two comments to state the gate is not consulted on the Absorb→Merge path and the declaration is retained only for plan parity/observability.
| return a.Accessor.Get(key) | ||
| } | ||
|
|
||
| func (a *gatedAccessor[K, V, R]) Delete(key K) error { |
There was a problem hiding this comment.
[Medium] The headline coverage-gated Delete path has no test
This PR's marquee change gates deletions: gatedAccessor.Delete now calls CheckCoverage before delegating. But no test asserts that deleting an undeclared key returns *ErrCoverageMiss. The existing coverage-miss tests (scope_test.go, scope_unit_test.go) only exercise the Get / TechnicalUpdate paths; write_set_test.go deletions all use declared keys. Delete shares the CheckCoverage code with Get, so the regression risk is bounded — but a change that drops the CheckCoverage call specifically from Delete would pass CI silently.
Fix: add a scope-level test that builds a scope whose coverage_bits omit a key, calls a gated ...Delete(key), and asserts require.ErrorAs(err, &ErrCoverageMiss{}).
| |---|---|---| | ||
| | 0 | `CacheGuaranteed` / `CacheNeedsTouch` / `CacheMiss` | `Declare` / `Touch` / `Preload` | | ||
| | 1 | `CacheGuaranteed` / `CacheMiss` | `Declare` / `Preload` | | ||
| | 0 | `CacheHit` / `CacheMiss` | `Declare` / `Value` | |
There was a problem hiding this comment.
[Medium] Doc drift: this table and its surrounding prose still name removed proto types, plus a dangling test reference
The "Plan emitted" column here lists Declare / Value, but this PR deleted the Declare message and the AttributePlan_Value oneof — the emitted entity is now AttributeCoverage (coverage-only when value is nil, seed when set). Two more stale spots in this same file (outside the diff hunk, so noted here):
- Line ~11 — the prose still says admission decides "whether to emit a
Declare, aTouch, or aPreload";Touchwas removed entirely in this PR. - Line ~95 — the Tests section references
TestAttributeCache_IsGuaranteedInCache_TwoGenerationsAhead, which this PR renamed toTestAttributeCache_CheckCache_TwoGenerationsAhead(cache_test.go:255); the link now dangles.
Fix: align this doc with plan-intent-verification.md's "coverage-only / seed" model, drop the Touch prose, and update the renamed test reference.
| - Verification: `internal/infra/state/machine.go` `Preload` — one dispatch per plan entry (skip when `value` is nil) | ||
| - Cache primitives: `internal/infra/cache/cache.go` — `AttributeCache.Get` (gen0→gen1 fallback), `AttributeCache.Del` (in-place tombstone + lazy Gen0 fabrication), `CheckCache` (returns `CacheUnreachable` for 2+ rotation prediction) | ||
| - Admission guard: `internal/infra/plan/planerr/errors.go` `ErrCacheHorizonExceeded` — the admission rejection sentinel | ||
| - Regression harness: `tests/antithesis/run_metadata_churn.sh` + `tests/antithesis/workload/bin/cmds/main/parallel_driver_metadata_churn/` |
There was a problem hiding this comment.
[Medium] Doc: cited regression harness does not exist
This line points readers to tests/antithesis/run_metadata_churn.sh and tests/antithesis/workload/bin/cmds/main/parallel_driver_metadata_churn/. Neither exists at this SHA (only tests/antithesis/run_model_test.sh is present; there is no *metadata_churn* script or driver). A reader following the cross-reference to reproduce the regression finds nothing.
Fix: add the harness, or point the reference at a driver that actually exercises this delete-after-rotation path (e.g. parallel_driver_concurrent_ledger_delete), or drop the bullet.
50bf6b8 to
2aded45
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 6 new inline findings.
Summary: #1455 (comment)
| // fabrication handle reads and deletes without a preemptive promote pass. | ||
| // attr_code (dal.SubAttrXxx) lives on the envelope so coverage indexing, | ||
| // dispatch, and unmarshal routing all read the same field. | ||
| message AttributeCoverage { |
There was a problem hiding this comment.
🔴 [blocker] Protobuf field number reuse in AttributePlan breaks wire compatibility
The previous AttributePlan encoding used field 3 for declare, field 4 for touch, and field 5 for value. The new schema reuses field 3 for AttributeCoverage.value. During a rolling upgrade or when replaying an old Raft log entry, old declare entries decode as non-nil empty seeds under the new schema, old value entries on field 5 are silently ignored, and old nodes reject new coverage-only entries because no oneof intent is set. This can cause mixed-version clusters to seed zero values, skip required preloads, or reject valid entries.
Suggestion: Reserve the previously used field numbers (3, 4, 5) as reserved in the proto file and assign new, unused field numbers to the new AttributeCoverage fields. Alternatively, provide a migration path and version-gate the new encoding so that in-flight and replayed proposals are not corrupted during the upgrade window.
Replaces the systematic MirrorTouch-at-Preload pass with two in-place
primitives on AttributeCache:
* Get: gen0 → gen1 fallback. Safe under the coverage_bits gate
(invariant #9) — the fallback can only surface keys the proposer
explicitly declared.
* Del: tombstones in Gen0 when Gen0 hits; otherwise lazy-fabricates
a Gen0 tombstone borrowing Gen1's tag. Gen1's live row stays
untouched (shadowed by the Gen0 tombstone, purged on the next
rotation). writeCacheTombstone still writes a single row to the
Gen0 byte — memory equals disk (invariant #1).
Preload skips coverage-only AttributeCoverage entries entirely (only
seed intents call MirrorPreload). The MirrorTouch method / plumbing
and the AttributeCache.Touch / Cache.TouchByType helpers are deleted
as dead code. The CacheHit verdict collapses "already in Gen0" and
"Gen1-only" into a single admission signal; the FSM's read horizon is
still bounded to admission's declared preload set by coverage_bits.
CacheUnreachable (rejects proposals with ≥2 predicted rotations)
keeps the propose→apply race window bounded to at most one rotation,
which is what the fallback + lazy fabrication rely on.
Doc: docs/technical/architecture/core/plan-intent-verification.md
Regression: internal/infra/cache/keystore_delete_test.go +
internal/infra/state/cache_snapshotter_test.go
(TestCacheSnapshotter_EN1242_DeleteAfterRotationCrashRestart)
2aded45 to
8dc6476
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 4 new inline findings.
Summary: #1455 (comment)
| log, err := query.FindTransactionCreationLog(context.Background(), a.store, a.attrs.Transaction, ledgerName, transactionID) | ||
| state, err := a.attrs.Transaction.Get(a.store, canonical) | ||
| if err != nil { | ||
| if errors.Is(err, domain.ErrNotFound) { |
There was a problem hiding this comment.
🔴 [blocker] Missing or empty creation-log postings on revert path produces zero-posting reversal, corrupting transaction state
reported by NumaryBot, codex
Two related failure modes exist: (1) When FindTransactionCreationLog returns ErrNotFound for a transaction whose log has been purged or archived, the code proceeds with OriginalPostings == nil, allowing the FSM to create a zero-posting reversal instead of rejecting. (2) When reverting a pre-upgrade transaction whose stored TransactionState has an empty Postings field (the field did not exist before this change), the same nil/empty postings path is followed. Both cases either corrupt transaction state by allowing a revert to succeed with no postings, or break all pre-upgrade reverts.
Suggestion: Do not convert a missing or empty creation log into a successful empty revert. If FindTransactionCreationLog returns ErrNotFound for a known-existing transaction, return an error to the caller. When state.GetPostings() is empty (pre-upgrade data), fall back to the previous log-based lookup rather than proceeding with nil postings.
| // creation). Required for at_effective_date reverts to be deterministic | ||
| // without re-reading the original log from Pebble. | ||
| Timestamp timestamp = 4; | ||
| // Postings captured at creation. Populated by processCreateTransaction (and |
There was a problem hiding this comment.
🔴 [blocker] New TransactionState.postings field added without migration, making pre-upgrade transactions unrevertable
Adding TransactionState.postings makes reverts depend on a field that existing Pebble rows do not contain. processRevertTransaction now returns ErrTransactionStateInconsistent when both the revert order and stored state have no postings. After upgrading an existing ledger, any pre-upgrade transaction without a receipt becomes permanently unrevertable, and checker replay will also fail for persisted projections that lack the new field.
Suggestion: Add a schema bump with a backfill migration, or introduce a compatibility fallback so that a missing postings field is tolerated for rows written before this change.
| return nil, nil, fmt.Errorf("unmarshaling create postings: %w", err) | ||
| } | ||
|
|
||
| state.Postings = container.GetPostings() |
There was a problem hiding this comment.
🔴 [blocker] Replay checker reports false-positive mismatches on legacy TransactionState rows lacking the postings field
After upgrading, replay reconstructs TransactionState.Postings from the audit log, but live rows written by older binaries have an empty Postings slice. The proto.Equal-based compareTransactions then systematically reports mismatches for every historical non-archived transaction, causing health checks to fail on any in-place-upgraded deployment even though the store was valid before the upgrade.
Suggestion: Either backfill existing TransactionState rows with postings before running the checker, or add an explicit compatibility check in compareTransactions that treats nil/empty live postings as a legacy shape and skips the postings comparison for rows predating the schema change.
| // ingest paths that apply Delete logs without a prior existence | ||
| // check; handlers that do a Get-first business check surface | ||
| // their own domain error before Delete is queued here. | ||
| if errors.Is(err, domain.ErrNotFound) { |
There was a problem hiding this comment.
🟠 [major] Gen1-only deletes silently dropped via continue in Merge, leaving stale cache state
When an unconditional delete path (e.g. DropIndex, RemoveEventsSink, or DeleteLedger boundary cleanup) reaches Merge for a key present only in Gen1 because the preload/touch contract was missed, writer.Delete returns ErrNotFound before AttributeCache.Del can report the invariant violation. The subsequent continue silently drops the delete and leaves the live Gen1 row available for a later MirrorTouch promotion, defeating the strict-delete guard this change relies on.
Suggestion: Do not silently swallow ErrNotFound from writer.Delete in Merge when the intent is a strict delete. Surface the error (or route through the GetAny/includeGen1 path) so that the invariant violation is loudly reported rather than skipped.
Problem
On
release/v3.0, aDeleteon a key that has migrated to Gen1 via cache rotation between propose and apply leaves cache and disk out of sync — a violation of invariant #1 (cache is the source of authority for a given applied index):Between T3 and the next rotation, a follower that has restored from store sees the tombstone while a node still holding the live Gen1 row surfaces the value — a live-vs-tombstone split on a hash-chained order.
Solution — lazy Del promote + writeCacheTombstone at gen0 byte only
Two in-place primitives on
AttributeCachemaintain cache-mirror parity without any preemptive promote pass:Deltombstones in Gen0 when Gen0 hits; otherwise lazy-fabricates a Gen0 tombstone borrowing Gen1's tag. Gen1's live row stays untouched (shadowed by the Gen0 tombstone on every read, purged on the next rotation).writeCacheTombstonewrites a single row to the Gen0 byte — mem and disk stay byte-equivalent (invariant Fix typo #1).Getunchanged — the Gen0 → Gen1 fallback that release/v3.0 already had, still safe under the coverage_bits gate (invariant add http addr flag #9).Preload now skips coverage-only
AttributeCoverageentries entirely — only seed intents (valueset) callMirrorPreload. TheMirrorTouchmechanism and its Preload dispatch are removed as dead code (also removesAttributeCache.Touch,Cache.TouchByType,CacheSnapshotter.MirrorTouch).Two out-of-band guardrails keep the model honest:
CacheUnreachablerejects any proposal predicting 2+ rotations between propose and apply — bounds the race to at most one rotation, so anything admission observed anywhere in cache is still reachable via Gen1 when apply runs.Deletein this PR (was:Getonly) — binds every FSM cache read and delete to admission's declared preload set. The Gen1 fallback inGetcan't widen the FSM read horizon beyond what admission declared.Applied to T3:
Del(k)sees Gen0 miss, Gen1 hit → fabricates a Gen0 tombstone borrowing Gen1's tag;writeCacheTombstonewrites a single row to the Gen0 byte. Mem Gen0 = tombstone = Disk Gen0 byte; Gen1 stays live in both mem and disk (purged together on next rotation).Delta vs
release/v3.0AttributeCache.Del: was — tombstones both gens if present; now — Gen0 in-place tombstone or lazy fabrication from Gen1's tag; returnsErrNotFoundwhen neither gen has the key (handled as a soft no-op byDerivedKeyStore.Merge, legitimate for mirror-ingest delete-on-absent).KeyStore.Delete: was —s.M.Put(entry{Deleted=true}); now —s.M.Del(id).writeCacheTombstone: was — writes both gen bytes; now — writes gen0 byte only.AttributeCache.Get: unchanged (gen0 → gen1 fallback).gatedAccessor.Delete— Delete now goes throughCheckCoveragejust like Get. Undeclared Delete surfaces*ErrCoverageMiss.MirrorTouchand friends: removed (dead code — the systematic touch pass this earlier iteration introduced is no longer needed under the lazy model, so is the pre-existingCache.Touchdispatch).AttributeCoverageproto:oneof intent {Declare, Touch, Value}collapsed to a single optionalvaluefield.valueset → seed viaMirrorPreload;valuenil → coverage-only (no cache mutation).TransactionState.postings: new proto field, populated byprocessCreateTransaction/processMirrorCreatedTransaction/ revert. Admission'sgetTransactionPostingsnow reads it via a directattrs.Transaction.Getpoint read instead of scanning the log throughFindTransactionCreationLog— faster and simpler. Pre-EN-1242 TransactionStates (without the field) will read back with empty Postings, so revert of a legacy transaction now returnsErrTransactionNotFoundvia the audit chain (invariant feat(CI): Add GoReleaser #8) — intentional compat break.CacheHitcollapses "already in Gen0" and "Gen1-only" — admission no longer needs to distinguish, and the intermediateCacheNeedsTouchverdict is gone.Regression tests
internal/infra/cache/keystore_delete_test.go— three unit tests onAttributeCache.Del: Gen0-only tombstoning, lazy Gen1→Gen0 fabrication, absent-in-both →ErrNotFound.internal/infra/state/cache_snapshotter_test.go::TestCacheSnapshotter_EN1242_DeleteAfterRotationCrashRestart— end-to-end delete-after-rotation cycle against real Pebble, including crash + restore.Docs
docs/technical/architecture/core/plan-intent-verification.md— rewritten for the lazy model + accurate pre-fix race trace.docs/technical/architecture/subsystems/fsm/preload.md+cache-layers.md— updated diagrams andCacheHitdescription.Test plan
just pre-commitcleango test ./... -short -timeout 600s)