Skip to content

fix(EN-1242): lazy gen0 tombstone + coverage-gate Delete#1455

Merged
gfyrag merged 1 commit into
release/v3.0from
fix/EN-1242-tombstone-gen0-only
Jul 3, 2026
Merged

fix(EN-1242): lazy gen0 tombstone + coverage-gate Delete#1455
gfyrag merged 1 commit into
release/v3.0from
fix/EN-1242-tombstone-gen0-only

Conversation

@gfyrag

@gfyrag gfyrag commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Problem

On release/v3.0, a Delete on a key that has migrated to Gen1 via cache rotation between propose and apply leaves cache and disk out of sync — a violation of invariant #1 (cache is the source of authority for a given applied index):

T0  Admission: CheckCache(k) → CacheMiss → resolver emits a coverage-only entry.
T1  Concurrent Save(k, v) applies                → Gen0[k] = v.
T2  Cache rotation                               → Gen1[k] = v, Gen0[k] = ∅.
T3  Delete(k) applies:
      KeyStore.Delete → s.M.Get(id) hits Gen1 via fallback → passes tag check
                     → s.M.Put(id, entry{Deleted=true}) writes to Gen0 only.
      Mem: Gen0 = fabricated tombstone, Gen1 = still live.
      Disk (via writeCacheTombstone): tombstone at BOTH gen bytes.
      → Mem Gen1 live vs Disk Gen1 byte tombstoned. Cache-mirror invariant
        broken until the next rotation / restart re-hydrates Gen1.

Between T3 and the next rotation, a follower that has restored from store sees the tombstone while a node still holding the live Gen1 row surfaces the value — a live-vs-tombstone split on a hash-chained order.

Solution — lazy Del promote + writeCacheTombstone at gen0 byte only

Two in-place primitives on AttributeCache maintain cache-mirror parity without any preemptive promote pass:

  • Del tombstones in Gen0 when Gen0 hits; otherwise lazy-fabricates a Gen0 tombstone borrowing Gen1's tag. Gen1's live row stays untouched (shadowed by the Gen0 tombstone on every read, purged on the next rotation). writeCacheTombstone writes a single row to the Gen0 byte — mem and disk stay byte-equivalent (invariant Fix typo #1).
  • Get unchanged — the Gen0 → Gen1 fallback that release/v3.0 already had, still safe under the coverage_bits gate (invariant add http addr flag #9).

Preload now skips coverage-only AttributeCoverage entries entirely — only seed intents (value set) call MirrorPreload. The MirrorTouch mechanism and its Preload dispatch are removed as dead code (also removes AttributeCache.Touch, Cache.TouchByType, CacheSnapshotter.MirrorTouch).

Two out-of-band guardrails keep the model honest:

  1. Admission-side CacheUnreachable rejects any proposal predicting 2+ rotations between propose and apply — bounds the race to at most one rotation, so anything admission observed anywhere in cache is still reachable via Gen1 when apply runs.
  2. Coverage gate (invariant add http addr flag #9) — extended to Delete in this PR (was: Get only) — binds every FSM cache read and delete to admission's declared preload set. The Gen1 fallback in Get can't widen the FSM read horizon beyond what admission declared.

Applied to T3: Del(k) sees Gen0 miss, Gen1 hit → fabricates a Gen0 tombstone borrowing Gen1's tag; writeCacheTombstone writes a single row to the Gen0 byte. Mem Gen0 = tombstone = Disk Gen0 byte; Gen1 stays live in both mem and disk (purged together on next rotation).

Delta vs release/v3.0

  • AttributeCache.Del: was — tombstones both gens if present; now — Gen0 in-place tombstone or lazy fabrication from Gen1's tag; returns ErrNotFound when neither gen has the key (handled as a soft no-op by DerivedKeyStore.Merge, legitimate for mirror-ingest delete-on-absent).
  • KeyStore.Delete: was — s.M.Put(entry{Deleted=true}); now — s.M.Del(id).
  • writeCacheTombstone: was — writes both gen bytes; now — writes gen0 byte only.
  • AttributeCache.Get: unchanged (gen0 → gen1 fallback).
  • Coverage gate: extended to gatedAccessor.Delete — Delete now goes through CheckCoverage just like Get. Undeclared Delete surfaces *ErrCoverageMiss.
  • MirrorTouch and friends: removed (dead code — the systematic touch pass this earlier iteration introduced is no longer needed under the lazy model, so is the pre-existing Cache.Touch dispatch).
  • AttributeCoverage proto: oneof intent {Declare, Touch, Value} collapsed to a single optional value field. value set → seed via MirrorPreload; value nil → coverage-only (no cache mutation).
  • TransactionState.postings: new proto field, populated by processCreateTransaction / processMirrorCreatedTransaction / revert. Admission's getTransactionPostings now reads it via a direct attrs.Transaction.Get point read instead of scanning the log through FindTransactionCreationLog — faster and simpler. Pre-EN-1242 TransactionStates (without the field) will read back with empty Postings, so revert of a legacy transaction now returns ErrTransactionNotFound via the audit chain (invariant feat(CI): Add GoReleaser #8) — intentional compat break.
  • CacheHit collapses "already in Gen0" and "Gen1-only" — admission no longer needs to distinguish, and the intermediate CacheNeedsTouch verdict is gone.

Regression tests

  • internal/infra/cache/keystore_delete_test.go — three unit tests on AttributeCache.Del: Gen0-only tombstoning, lazy Gen1→Gen0 fabrication, absent-in-both → ErrNotFound.
  • internal/infra/state/cache_snapshotter_test.go::TestCacheSnapshotter_EN1242_DeleteAfterRotationCrashRestart — end-to-end delete-after-rotation cycle against real Pebble, including crash + restore.

Docs

  • docs/technical/architecture/core/plan-intent-verification.md — rewritten for the lazy model + accurate pre-fix race trace.
  • docs/technical/architecture/subsystems/fsm/preload.md + cache-layers.md — updated diagrams and CacheHit description.

Test plan

  • just pre-commit clean
  • Full unit test suite passes (go test ./... -short -timeout 600s)
  • CI green (Tests / E2E / Scenarios / Model / Build / Schemathesis / Coverage)
  • Antithesis singleton_driver_model regression clean

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (1)
  • main

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 89c65057-ebd1-4c3a-bba9-5834f5808e7e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/EN-1242-tombstone-gen0-only

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@NumaryBot

NumaryBot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

🛑 Changes requested — automated review

This PR introduces a lazy gen0 tombstone mechanism, a MirrorTouch-based Delete primitive, and a coverage gate for Delete operations. Five confirmed issues — four blockers and one major — prevent safe merging.

The most critical issue is protobuf field number reuse in AttributePlan: previously used field numbers (3, 4, 5) are reassigned to new AttributeCoverage fields, which will corrupt wire encoding during rolling upgrades and cause incorrect Raft log replay for any stored entries from the previous schema.

On the revert path, two related failure modes allow zero-posting reversals to be committed: when FindTransactionCreationLog returns ErrNotFound for a purged log the code proceeds with nil OriginalPostings, and when reverting a pre-upgrade transaction whose TransactionState predates the new postings field the same nil path is followed. Both cases can corrupt transaction state or permanently break reverts for all pre-upgrade transactions.

The new TransactionState.postings field is introduced without a schema migration. Existing Pebble rows from pre-upgrade deployments will lack this field, making those transactions permanently unrevertable after an in-place upgrade.

The replay checker will also produce systematic false-positive mismatches on any upgraded store: live TransactionState rows written before this change have empty Postings, while the replayer now reconstructs non-empty postings from the audit log, causing proto.Equal comparisons to fail for every historical transaction.

Finally, in key_store.go Merge, a continue statement silently swallows ErrNotFound from writer.Delete for keys that exist only in Gen1, dropping the delete without surfacing the invariant violation and leaving stale Gen1 rows eligible for MirrorTouch promotion.

@gfyrag gfyrag marked this pull request as draft June 30, 2026 11:14
@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.84541% with 50 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.02%. Comparing base (bce53d4) to head (8dc6476).

Files with missing lines Patch % Lines
internal/application/check/replay_store.go 71.42% 4 Missing and 6 partials ⚠️
.../domain/processing/processor_revert_transaction.go 50.00% 3 Missing and 3 partials ⚠️
internal/infra/plan/coverage.go 84.21% 3 Missing and 3 partials ⚠️
internal/infra/state/cache_incremental.go 25.00% 1 Missing and 2 partials ⚠️
internal/application/admission/admission.go 88.88% 2 Missing ⚠️
internal/domain/processing/processor_index.go 0.00% 1 Missing and 1 partial ⚠️
...nal/domain/processing/processor_ledger_metadata.go 0.00% 1 Missing and 1 partial ⚠️
internal/domain/processing/processor_metadata.go 0.00% 1 Missing and 1 partial ⚠️
...nal/domain/processing/processor_metadata_schema.go 0.00% 2 Missing ⚠️
internal/domain/processing/processor_mirror.go 50.00% 2 Missing ⚠️
... and 8 more

❌ Your patch check has failed because the patch coverage (75.84%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@               Coverage Diff                @@
##           release/v3.0    #1455      +/-   ##
================================================
+ Coverage         73.60%   74.02%   +0.41%     
================================================
  Files               395      395              
  Lines             39917    39833      -84     
================================================
+ Hits              29382    29486     +104     
- Misses             7686     7771      +85     
+ Partials           2849     2576     -273     
Flag Coverage Δ
e2e 74.02% <75.84%> (+0.41%) ⬆️
scenario 74.02% <75.84%> (+0.41%) ⬆️
unit 74.02% <75.84%> (+0.41%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from f8a103a to 0a0f111 Compare June 30, 2026 12:17
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from 0a0f111 to 666fe9a Compare July 1, 2026 13:55
@gfyrag gfyrag changed the title fix(EN-1242): write metadata-delete tombstone to gen0 only fix(EN-1242): systematic plan-intent verification (retry-safe strict Del) Jul 1, 2026
@gfyrag gfyrag marked this pull request as ready for review July 2, 2026 05:29
@gfyrag gfyrag changed the title fix(EN-1242): systematic plan-intent verification (retry-safe strict Del) fix(EN-1242): unified coverage model — MirrorTouch subsumes assertions Jul 2, 2026

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 1 new inline finding.

Summary: #1455 (comment)

Comment thread internal/infra/cache/cache.go
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from 0b6abb0 to ba599e6 Compare July 2, 2026 10:24
@gfyrag gfyrag changed the base branch from release/v3.0 to refactor/EN-1242-generic-needs July 2, 2026 10:25
@gfyrag gfyrag changed the title fix(EN-1242): unified coverage model — MirrorTouch subsumes assertions fix(EN-1242): unified coverage model — MirrorTouch as promote primitive Jul 2, 2026

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 1 new inline finding.

Summary: #1455 (comment)

Comment thread internal/infra/cache/cache.go
@gfyrag gfyrag force-pushed the refactor/EN-1242-generic-needs branch from a122827 to 7226eb9 Compare July 2, 2026 10:44
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch 2 times, most recently from 93d2d5e to ece09cd Compare July 2, 2026 10:52

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 1 new inline finding.

Summary: #1455 (comment)

Comment thread internal/infra/cache/cache.go Outdated

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 1 new inline finding.

Summary: #1455 (comment)

Comment thread internal/infra/cache/cache.go
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from ece09cd to f2fe7b2 Compare July 2, 2026 11:04

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 1 new inline finding.

Summary: #1455 (comment)

Comment thread internal/infra/cache/cache.go Outdated
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from f2fe7b2 to bee3ed7 Compare July 2, 2026 11:27

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 2 new inline findings.

Summary: #1455 (comment)

Comment thread internal/infra/cache/cache.go Outdated
Comment thread internal/infra/attributes/key_store.go
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from bee3ed7 to 08c2534 Compare July 2, 2026 11:51
@gfyrag gfyrag changed the title fix(EN-1242): unified coverage model — MirrorTouch as promote primitive fix(EN-1242): coverage-gate Delete + unified MirrorTouch promote Jul 2, 2026
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from dc069d0 to 7eaa558 Compare July 2, 2026 13:15

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 6 new inline findings.

Summary: #1455 (comment)

Comment thread internal/infra/cache/cache.go
Comment thread misc/proto/raft_cmd.proto
Comment thread internal/application/admission/admission.go
Comment thread internal/infra/cache/cache.go Outdated
Comment thread internal/infra/attributes/key_store.go
Comment thread internal/infra/state/accessor.go
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from 7eaa558 to 566f7e7 Compare July 2, 2026 14:19

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 3 new inline findings.

Summary: #1455 (comment)

Comment thread misc/proto/raft_cmd.proto
Comment thread internal/application/admission/admission.go Outdated
Comment thread misc/proto/common.proto
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from 566f7e7 to 3c1e89c Compare July 2, 2026 14:51

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 2 new inline findings.

Summary: #1455 (comment)

Comment thread misc/proto/raft_cmd.proto
Comment thread internal/application/admission/admission.go
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from 3c1e89c to 2d8527b Compare July 2, 2026 15:54
@gfyrag gfyrag changed the title fix(EN-1242): coverage-gate Delete + unified MirrorTouch promote fix(EN-1242): lazy Del promote + gen0→gen1 read fallback Jul 2, 2026

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 2 new inline findings.

Summary: #1455 (comment)

Comment thread misc/proto/raft_cmd.proto
Comment thread internal/application/admission/admission.go
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from 2d8527b to c9ae0c9 Compare July 2, 2026 23:32

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 3 new inline findings.

Summary: #1455 (comment)

Comment thread misc/proto/raft_cmd.proto
Comment thread internal/application/admission/admission.go
Comment thread internal/infra/attributes/key_store.go
@gfyrag gfyrag changed the title fix(EN-1242): lazy Del promote + gen0→gen1 read fallback fix(EN-1242): lazy gen0 tombstone + coverage-gate Delete Jul 3, 2026
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from c9ae0c9 to 0dcfb10 Compare July 3, 2026 07:54

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 4 new inline findings.

Summary: #1455 (comment)

Comment thread misc/proto/raft_cmd.proto
Comment thread internal/application/admission/admission.go
Comment thread internal/application/check/replay_store.go
Comment thread internal/infra/attributes/key_store.go
@gfyrag gfyrag force-pushed the refactor/EN-1242-generic-needs branch from 7226eb9 to 5bb9f23 Compare July 3, 2026 08:11
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from 0dcfb10 to 6098058 Compare July 3, 2026 08:11

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 4 new inline findings.

Summary: #1455 (comment)

Comment thread misc/proto/raft_cmd.proto
Comment thread internal/application/admission/admission.go
Comment thread internal/application/check/replay_store.go
Comment thread internal/infra/attributes/key_store.go
Base automatically changed from refactor/EN-1242-generic-needs to release/v3.0 July 3, 2026 08:21
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from 6098058 to 50bf6b8 Compare July 3, 2026 08:40

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 4 new inline findings.

Summary: #1455 (comment)

Comment thread misc/proto/raft_cmd.proto
Comment thread internal/application/admission/admission.go
Comment thread internal/application/check/replay_store.go
Comment thread internal/infra/attributes/key_store.go

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 4 new inline findings.

Summary: #1455 (comment)

Comment thread misc/proto/raft_cmd.proto
Comment thread internal/application/admission/admission.go
Comment thread internal/application/check/replay_store.go
Comment thread internal/infra/attributes/key_store.go

@paul-nicolas paul-nicolas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-model review — PR #1455 (Claude + Codex)

Two reviewers (Claude, split across 4 subsystem passes, and Codex/gpt-5.4) reviewed the diff independently; findings were reconciled and each was adversarially verified against the code pinned at the PR head SHA before posting.

Verdict on the core mechanic: the lazy Gen0-tombstone + gen0-byte-only writeCacheTombstone + gen0→gen1 read fallback is sound. Mem/disk parity was traced through all three Del branches (Gen0-hit, Gen1-only fabrication, absent-in-both), tag-borrowing was proven correct (borrowed gen1Entry.Tag == deletion.Tag via KeyStore.Delete's pre-check), and the "at most one rotation" bound + coverage-gate horizon hold. Invariants #1/#2/#3/#5/#9 are respected on the cache path. The concern cluster is entirely around the TransactionState.Postings projection / upgrade compatibility and the collapsed AttributeCoverage proto, plus doc/test drift.

Counts: 2 Major, 6 Medium posted inline; 8 Minor/Nit listed below (not posted). 1 finding rejected in verification.

Notable reviewer conflicts, resolved:

  • Proto wire-compat — Codex rated critical (old AttributePlan entries in the Raft/WAL misparse across an upgrade); Claude rated medium (missing reserved is a version-independent hazard, tempered by pre-GA alpha). Reconciled to Major: field 3 is genuinely repurposed (DeclareAttributeValue) and fields 4/5 dropped without reserved, against this file's own universal convention — a real, actionable hazard, but cross-version Raft replay wasn't proven in practice, so short of critical.
  • Revert trusts TransactionState.Postings (tampering) — Codex rated major. Rejected in verification: checker.compareTransactions compares the full state (incl. Postings) via proto.Equal, so the projection IS checker-verified (invariant #8 satisfied), and the pre-PR code also read a projection (the log). No new weakness introduced.

Root-cause note: findings [Major] #2, [Medium] #3 and [Medium] legacy-revert all share one fix — a CurrentStorageSchemaVersion bump (internal/bootstrap/config_validation.go, currently 1, untouched by this PR). That converts a silent in-place-upgrade data break into a non-bypassable, enforced clean-cluster requirement.


Minor / Nit (not posted inline)

  • internal/infra/attributes/key_store.go:298 (minor) — silent continue on ErrNotFound skips the durable 0xF1 delete; relies entirely on invariant #6 with no loud signal. Consider a debug-mode sentinel/WARN so a future preload regression surfaces loudly instead of as latent store drift.
  • internal/infra/plan/coverage.go:101 (minor) — Coverage.Merge writes to a possibly-nil IdempotencyKeys map on a zero-value receiver (no live panic today; all callers use NewCoverage()). Lazily init the map like set() does.
  • internal/infra/cache/cache_test.go / cache_snapshotter_test.go (minor) — TestCacheSnapshotter_EN1242_DeleteAfterRotationCrashRestart hand-assembles writeCacheRaw/Rotation/Tombstone instead of driving machine.Preload + WriteSet.Merge; a future Merge→writeCacheTombstone rewiring wouldn't be caught.
  • internal/application/admission/admission_test.go (minor) — the compat break is only tested for the postings-present shape; no case exercises an existing TxState with empty Postings → empty OriginalPostingsErrTransactionNotFound.
  • internal/pkg/kv/kv.go:13 (nit) — KV.Del doc says "removes K" but the production impl tombstones and Get still surfaces the key; clarify the interface contract.
  • internal/infra/cache/cache.go:133 (nit) — AttributeCache.Del's own ErrNotFound return is unreachable via its only production caller (KeyStore.Delete returns first); the doc comment attributes the soft-no-op imprecisely.
  • internal/infra/plan/builder_test.go:309 (nit) — assertion messages still say "must emit Declare" after Declare was removed; reword to "coverage-only entry (value nil)".
  • internal/infra/state/scope.go:458 (nit) — lsbIndex hand-rolls an 8-iteration loop; math/bits.TrailingZeros8(b) is clearer and branch-free.

Comment thread misc/proto/raft_cmd.proto
AttributeValue value = 5;
}
// Optional seed: nil = coverage-only; non-nil = seed into the FSM cache.
AttributeValue value = 3;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] AttributeCoverage repurposes field number 3 and drops fields 4/5 without reserved

The old AttributePlan used a oneof at field 3 (Declare), field 4 (Touch), field 5 (AttributeValue value). The new AttributeCoverage puts AttributeValue value at field 3 and removes 4/5 with no reserved clause. Two problems:

  1. Field-3 semantic reuse. On the wire, old Declare{} (empty length-delimited message at field 3) decodes cleanly as a new AttributeValue{} with empty raw_value → a non-nil value (a seed) where the proposer meant coverage-only. Old Touch (field 4) and old field-5 value are silently dropped as unknown fields. If any old-format ExecutionPlan.attributes entry is ever replayed from a Raft log / WAL by an upgraded binary, it misapplies (coverage-only read at apply → cache miss → node divergence).
  2. Missing reserved. Every other message in this file reserves removed field numbers/names (see ExecutionPlan reserved 2,3,5,8, TechnicalUpdate reserved 5,6,8..12, etc.). Omitting it here is a future field-reuse footgun.

Fix: give value a fresh, never-used field number instead of reusing 3, and add reserved 3, 4, 5; + reserved "declare", "touch"; inside AttributeCoverage. If old-format entries can survive an upgrade, gate activation behind a storage-schema-version bump that blocks mixed-version replay. (Reviewers split critical/medium on the replay risk; the missing reserved + field reuse is uncontroversial and should be fixed before merge regardless.)

txCanonical := domain.TransactionKey{LedgerName: ledger, ID: tx.GetId()}.Bytes()

if err := w.CreateTransaction(txCanonical, seq, tx.GetTimestamp(), tx.GetMetadata()); err != nil {
if err := w.CreateTransaction(txCanonical, seq, tx.GetTimestamp(), tx.GetMetadata(), tx.GetPostings()); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] No storage-schema-version bump: an in-place upgrade with pre-EN-1242 data silently breaks the checker and can diverge mixed-version nodes

This line now feeds tx.GetPostings() into the checker's replay-derived expected TransactionState. Combined with processCreateTransaction now writing Postings into the live TxState (internal/domain/processing/processor_transaction.go), two upgrade hazards appear, and CurrentStorageSchemaVersion (internal/bootstrap/config_validation.go, = 1) is not bumped by this PR, so an in-place upgrade over existing data is silently permitted:

  1. Checker mass false-positive. A transaction created by a pre-EN-1242 binary persists a TxState with empty Postings. On Check(), compareTransactions (checker.go:1435) replays the audit's CreatedTransaction log — which always carried postings — so expected.Postings is non-empty while the legacy live state's is empty, and proto.Equal fails, emitting CHECK_STORE_ERROR_TYPE_TRANSACTION_UPDATE_MISMATCH for every historical transaction — a healthy cluster self-reporting corruption.
  2. Rolling-upgrade cache divergence (inv Fix typo #1/update spec #2). During a mixed-version rollout, a new node writes Postings for a command at applied index N and an old node does not, so the in-memory cache and the deterministic snapshot marshal diverge between nodes at the same index.

Fix: bump CurrentStorageSchemaVersion so an in-place upgrade with old data is hard-blocked (schema mismatches are non-bypassable per CLAUDE.md), converting a silent break into an enforced clean-cluster/coordinated upgrade. Add a checker test feeding a legacy (empty-Postings) live state against a replay-derived state to lock the intended behavior.

// tx was not visible to admission at propose time — a business
// rejection that must appear in the audit chain (invariant #8).
originalPostings := order.GetOriginalPostings()
if len(originalPostings) == 0 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] Legacy transaction is unrevertable and misreported as ErrTransactionNotFound

For a transaction created pre-EN-1242 (exists, txID < NextTransactionId, not reverted) whose persisted TransactionState.Postings is empty, admission's getTransactionPostings point-read (admission.go:1897) returns empty postings, so this guard returns ErrTransactionNotFound for a transaction that demonstrably exists. This is the intentional compat break, but the error is semantically wrong (the tx exists; only its postings projection is absent) and misleads clients/operators.

Empty postings can only occur for legacy data (create rejects ErrEmptyTransaction), so the storage-schema-version bump recommended on the checker finding removes this path entirely by blocking the in-place upgrade. Short of that, preserve a log-scan fallback when OriginalPostings is empty, or return a dedicated upgrade/incompatibility error so "legacy-empty" is distinguishable from "absent".


origStateReader, err := s.TransactionStates().Get(txKey)
if errors.Is(err, domain.ErrNotFound) {
return nil, &domain.ErrTransactionNotFound{TransactionID: order.GetTransactionId()}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] Invariant #7: a should-not-happen desync was downgraded from ErrTransactionStateInconsistent to a soft ErrTransactionNotFound

Reaching this branch means the len(originalPostings) == 0 guard above already passed, i.e. admission read a present TxState.Postings (or a signed receipt) and the tx-exists / not-reverted checks passed. A TransactionStates().Get miss here therefore implies a real cache/Pebble desync — "impossible by design", not a routine not-found. The pre-PR code correctly returned &domain.ErrTransactionStateInconsistent{TransactionID: ..., Operation: "revert"}; this PR replaced it with ErrTransactionNotFound, which mislabels a desync as a business rejection and hides it (invariant #7: impossible-by-design branches must fail loudly).

Fix: restore ErrTransactionStateInconsistent (or an invariant:/assert.Unreachable hard-fail) for the errors.Is(err, domain.ErrNotFound) branch here, keeping the soft not-found only for the empty-OriginalPostings guard above.

// → strict AttributeCache.Del. Preload the boundary attribute
// so Gen0 holds the entry at apply time (invariant #6).
// → AttributeCache.Del. Preload the boundary attribute so the
// coverage gate admits the delete; AttributeCache.Del itself

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] DeleteLedger / RemoveEventsSink deletes bypass the coverage gate — the comment here overstates the guarantee

This comment states the boundary preload is declared "so the coverage gate admits the delete". But DeleteLedger (Boundaries) and RemoveEventsSink (SinkConfigs, ~L1072) cascade through WriteSet.Absorb → b.Derived.Boundaries.Delete(...) / SinkConfigs.Delete(...)WriteSet.Merge → KeyStore.Delete → AttributeCache.Del. That path calls the concrete *DerivedKeyStore.Delete directly and never invokes CheckCoverage (unlike the account-metadata / ledger-metadata / prepared-query deletes, which do go through gatedAccessor.Delete). So the invariant-#9 Delete guarantee this PR advertises is not actually enforced for these two attribute kinds — an admission bug omitting the declaration would silently delete rather than surface *ErrCoverageMiss.

Under the new lazy Del, the declaration is no longer needed for correctness anyway. Fix: either route the Absorb-path deletes through the gated scope so the guarantee is uniform, or correct these two comments to state the gate is not consulted on the Absorb→Merge path and the declaration is retained only for plan parity/observability.

return a.Accessor.Get(key)
}

func (a *gatedAccessor[K, V, R]) Delete(key K) error {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] The headline coverage-gated Delete path has no test

This PR's marquee change gates deletions: gatedAccessor.Delete now calls CheckCoverage before delegating. But no test asserts that deleting an undeclared key returns *ErrCoverageMiss. The existing coverage-miss tests (scope_test.go, scope_unit_test.go) only exercise the Get / TechnicalUpdate paths; write_set_test.go deletions all use declared keys. Delete shares the CheckCoverage code with Get, so the regression risk is bounded — but a change that drops the CheckCoverage call specifically from Delete would pass CI silently.

Fix: add a scope-level test that builds a scope whose coverage_bits omit a key, calls a gated ...Delete(key), and asserts require.ErrorAs(err, &ErrCoverageMiss{}).

|---|---|---|
| 0 | `CacheGuaranteed` / `CacheNeedsTouch` / `CacheMiss` | `Declare` / `Touch` / `Preload` |
| 1 | `CacheGuaranteed` / `CacheMiss` | `Declare` / `Preload` |
| 0 | `CacheHit` / `CacheMiss` | `Declare` / `Value` |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] Doc drift: this table and its surrounding prose still name removed proto types, plus a dangling test reference

The "Plan emitted" column here lists Declare / Value, but this PR deleted the Declare message and the AttributePlan_Value oneof — the emitted entity is now AttributeCoverage (coverage-only when value is nil, seed when set). Two more stale spots in this same file (outside the diff hunk, so noted here):

  • Line ~11 — the prose still says admission decides "whether to emit a Declare, a Touch, or a Preload"; Touch was removed entirely in this PR.
  • Line ~95 — the Tests section references TestAttributeCache_IsGuaranteedInCache_TwoGenerationsAhead, which this PR renamed to TestAttributeCache_CheckCache_TwoGenerationsAhead (cache_test.go:255); the link now dangles.

Fix: align this doc with plan-intent-verification.md's "coverage-only / seed" model, drop the Touch prose, and update the renamed test reference.

- Verification: `internal/infra/state/machine.go` `Preload` — one dispatch per plan entry (skip when `value` is nil)
- Cache primitives: `internal/infra/cache/cache.go` — `AttributeCache.Get` (gen0→gen1 fallback), `AttributeCache.Del` (in-place tombstone + lazy Gen0 fabrication), `CheckCache` (returns `CacheUnreachable` for 2+ rotation prediction)
- Admission guard: `internal/infra/plan/planerr/errors.go` `ErrCacheHorizonExceeded` — the admission rejection sentinel
- Regression harness: `tests/antithesis/run_metadata_churn.sh` + `tests/antithesis/workload/bin/cmds/main/parallel_driver_metadata_churn/`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] Doc: cited regression harness does not exist

This line points readers to tests/antithesis/run_metadata_churn.sh and tests/antithesis/workload/bin/cmds/main/parallel_driver_metadata_churn/. Neither exists at this SHA (only tests/antithesis/run_model_test.sh is present; there is no *metadata_churn* script or driver). A reader following the cross-reference to reproduce the regression finds nothing.

Fix: add the harness, or point the reference at a driver that actually exercises this delete-after-rotation path (e.g. parallel_driver_concurrent_ledger_delete), or drop the bullet.

@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from 50bf6b8 to 2aded45 Compare July 3, 2026 09:29

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 6 new inline findings.

Summary: #1455 (comment)

Comment thread misc/proto/raft_cmd.proto
// fabrication handle reads and deletes without a preemptive promote pass.
// attr_code (dal.SubAttrXxx) lives on the envelope so coverage indexing,
// dispatch, and unmarshal routing all read the same field.
message AttributeCoverage {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocker] Protobuf field number reuse in AttributePlan breaks wire compatibility

The previous AttributePlan encoding used field 3 for declare, field 4 for touch, and field 5 for value. The new schema reuses field 3 for AttributeCoverage.value. During a rolling upgrade or when replaying an old Raft log entry, old declare entries decode as non-nil empty seeds under the new schema, old value entries on field 5 are silently ignored, and old nodes reject new coverage-only entries because no oneof intent is set. This can cause mixed-version clusters to seed zero values, skip required preloads, or reject valid entries.

Suggestion: Reserve the previously used field numbers (3, 4, 5) as reserved in the proto file and assign new, unused field numbers to the new AttributeCoverage fields. Alternatively, provide a migration path and version-gate the new encoding so that in-flight and replayed proposals are not corrupted during the upgrade window.

Comment thread internal/application/admission/admission.go
Comment thread misc/proto/common.proto
Comment thread internal/application/check/replay_store.go
Comment thread internal/infra/state/accessor.go
Comment thread internal/application/admission/admission.go
Replaces the systematic MirrorTouch-at-Preload pass with two in-place
primitives on AttributeCache:

  * Get: gen0 → gen1 fallback. Safe under the coverage_bits gate
    (invariant #9) — the fallback can only surface keys the proposer
    explicitly declared.
  * Del: tombstones in Gen0 when Gen0 hits; otherwise lazy-fabricates
    a Gen0 tombstone borrowing Gen1's tag. Gen1's live row stays
    untouched (shadowed by the Gen0 tombstone, purged on the next
    rotation). writeCacheTombstone still writes a single row to the
    Gen0 byte — memory equals disk (invariant #1).

Preload skips coverage-only AttributeCoverage entries entirely (only
seed intents call MirrorPreload). The MirrorTouch method / plumbing
and the AttributeCache.Touch / Cache.TouchByType helpers are deleted
as dead code. The CacheHit verdict collapses "already in Gen0" and
"Gen1-only" into a single admission signal; the FSM's read horizon is
still bounded to admission's declared preload set by coverage_bits.
CacheUnreachable (rejects proposals with ≥2 predicted rotations)
keeps the propose→apply race window bounded to at most one rotation,
which is what the fallback + lazy fabrication rely on.

Doc: docs/technical/architecture/core/plan-intent-verification.md
Regression: internal/infra/cache/keystore_delete_test.go +
internal/infra/state/cache_snapshotter_test.go
  (TestCacheSnapshotter_EN1242_DeleteAfterRotationCrashRestart)
@gfyrag gfyrag force-pushed the fix/EN-1242-tombstone-gen0-only branch from 2aded45 to 8dc6476 Compare July 3, 2026 12:31

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 4 new inline findings.

Summary: #1455 (comment)

log, err := query.FindTransactionCreationLog(context.Background(), a.store, a.attrs.Transaction, ledgerName, transactionID)
state, err := a.attrs.Transaction.Get(a.store, canonical)
if err != nil {
if errors.Is(err, domain.ErrNotFound) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocker] Missing or empty creation-log postings on revert path produces zero-posting reversal, corrupting transaction state
reported by NumaryBot, codex

Two related failure modes exist: (1) When FindTransactionCreationLog returns ErrNotFound for a transaction whose log has been purged or archived, the code proceeds with OriginalPostings == nil, allowing the FSM to create a zero-posting reversal instead of rejecting. (2) When reverting a pre-upgrade transaction whose stored TransactionState has an empty Postings field (the field did not exist before this change), the same nil/empty postings path is followed. Both cases either corrupt transaction state by allowing a revert to succeed with no postings, or break all pre-upgrade reverts.

Suggestion: Do not convert a missing or empty creation log into a successful empty revert. If FindTransactionCreationLog returns ErrNotFound for a known-existing transaction, return an error to the caller. When state.GetPostings() is empty (pre-upgrade data), fall back to the previous log-based lookup rather than proceeding with nil postings.

Comment thread misc/proto/common.proto
// creation). Required for at_effective_date reverts to be deterministic
// without re-reading the original log from Pebble.
Timestamp timestamp = 4;
// Postings captured at creation. Populated by processCreateTransaction (and

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocker] New TransactionState.postings field added without migration, making pre-upgrade transactions unrevertable

Adding TransactionState.postings makes reverts depend on a field that existing Pebble rows do not contain. processRevertTransaction now returns ErrTransactionStateInconsistent when both the revert order and stored state have no postings. After upgrading an existing ledger, any pre-upgrade transaction without a receipt becomes permanently unrevertable, and checker replay will also fail for persisted projections that lack the new field.

Suggestion: Add a schema bump with a backfill migration, or introduce a compatibility fallback so that a missing postings field is tolerated for rows written before this change.

return nil, nil, fmt.Errorf("unmarshaling create postings: %w", err)
}

state.Postings = container.GetPostings()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocker] Replay checker reports false-positive mismatches on legacy TransactionState rows lacking the postings field

After upgrading, replay reconstructs TransactionState.Postings from the audit log, but live rows written by older binaries have an empty Postings slice. The proto.Equal-based compareTransactions then systematically reports mismatches for every historical non-archived transaction, causing health checks to fail on any in-place-upgraded deployment even though the store was valid before the upgrade.

Suggestion: Either backfill existing TransactionState rows with postings before running the checker, or add an explicit compatibility check in compareTransactions that treats nil/empty live postings as a legacy shape and skips the postings comparison for rows predating the schema change.

// ingest paths that apply Delete logs without a prior existence
// check; handlers that do a Get-first business check surface
// their own domain error before Delete is queued here.
if errors.Is(err, domain.ErrNotFound) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 [major] Gen1-only deletes silently dropped via continue in Merge, leaving stale cache state

When an unconditional delete path (e.g. DropIndex, RemoveEventsSink, or DeleteLedger boundary cleanup) reaches Merge for a key present only in Gen1 because the preload/touch contract was missed, writer.Delete returns ErrNotFound before AttributeCache.Del can report the invariant violation. The subsequent continue silently drops the delete and leaves the live Gen1 row available for a later MirrorTouch promotion, defeating the strict-delete guard this change relies on.

Suggestion: Do not silently swallow ErrNotFound from writer.Delete in Merge when the intent is a strict delete. Surface the error (or route through the GetAny/includeGen1 path) so that the invariant violation is loudly reported rather than skipped.

@gfyrag gfyrag merged commit 4d55861 into release/v3.0 Jul 3, 2026
12 of 14 checks passed
@gfyrag gfyrag deleted the fix/EN-1242-tombstone-gen0-only branch July 3, 2026 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants