Skip to content

feat(admission): reject when cache horizon exceeds 2 generations#1458

Merged
gfyrag merged 10 commits into
release/v3.0from
feat/admission-reject-stale-cache-horizon
Jul 1, 2026
Merged

feat(admission): reject when cache horizon exceeds 2 generations#1458
gfyrag merged 10 commits into
release/v3.0from
feat/admission-reject-stale-cache-horizon

Conversation

@gfyrag

@gfyrag gfyrag commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Problem

When admission's predicted apply-time generation Gen(nextIndex) is 2+
generations ahead of the FSM's current applied generation, any value loaded
as a MirrorPreload is doomed: the first rotation between propose and apply
moves it Gen0 → Gen1, the second discards it entirely. By apply time the
cache has no record of the preloaded value — the FSM read misses and the
order fails with ErrNotFound for a key admission promised would be
available. This violates invariant #1 (cache is the source of authority).

Today the code returns CacheMiss in this regime and emits a stale
MirrorPreload anyway. This is the path the EN-1242 churn investigation
surfaced.

Fix — reject at admission

Gen(at) − currentGen CheckCache result Plan emitted
0 CacheGuaranteed / CacheNeedsTouch / CacheMiss Declare / Touch / Preload
1 CacheGuaranteed / CacheMiss Declare / Preload
≥ 2 CacheUnreachable proposal rejected

When admission encounters CacheUnreachable it returns
plan.ErrCacheHorizonExceeded. The gRPC adapter maps that to
codes.Unavailable so the existing client retry interceptors handle it
transparently. The proposal never enters Raft, no audit entry is owed, and
the client retries against a fresh admission snapshot once the FSM has
caught up.

Why this is not a domain error

The rejection is not a business outcome — the user-submitted operation
might be perfectly valid. It is an operational signal that admission's
view is too far behind FSM apply for the system to safely admit any new
proposal whose preload would target keys this stale. So ErrCacheHorizonExceeded
lives in internal/infra/plan/, not in internal/domain/, and the gRPC
adapter classifies it as a transient infrastructure error.

When can it fire?

Under a correctly tuned --cache-rotation-threshold (default 1000) and
an FSM apply rate that keeps up with proposals, the ≥ 2-generation horizon
should never be reached: admission's nextIndex would have to sit at least
2 × threshold indices past the FSM's last-applied index (≥ 2000 in-flight
proposals at default tuning).

Recurring rejections are an operational signal — see the new doc
for runbook guidance.

Changes

File Change
internal/infra/cache/cache.go New CacheUnreachable constant; CheckCache default branch returns it instead of CacheMiss.
internal/infra/plan/errors.go New ErrCacheHorizonExceeded sentinel + rationale comment.
internal/infra/plan/resolve.go Switch case on CacheUnreachable short-circuits with the sentinel.
internal/adapter/grpc/server.go convertToGRPCError maps ErrCacheHorizonExceededcodes.Unavailable.
docs/technical/architecture/core/admission-cache-horizon.md New doc explaining the contract, the operational signal, and recovery.
docs/technical/architecture/core/fsm-cache-layers.md Reference the new doc from the Related section.

Tests

  • TestAttributeCache_IsGuaranteedInCache_TwoGenerationsAhead — strengthened
    to assert CacheUnreachable specifically (not the prior CacheMiss), for
    both stored and absent keys.
  • TestBuildPreloads_RejectsCacheHorizonExceeded — end-to-end resolver run
    with a tracker pinned past two generation boundaries; asserts
    ErrCacheHorizonExceeded propagates out of Builder.Build so the gRPC
    layer maps to codes.Unavailable.

Verification

When admission's predicted apply-time generation Gen(nextIndex) is 2+
generations ahead of the FSM's current applied generation, any value
loaded as a MirrorPreload would be rotated out before the order applies:
the first rotation moves Gen0 (with the preload) into Gen1, the second
discards Gen1 entirely. By apply time the cache has no record of the
preloaded value — the FSM read would miss and the order would fail with
ErrNotFound for a key admission promised to honor.

Rather than silently emit a doomed Preload, admission now rejects the
proposal at build time:

- CacheStatus gains a CacheUnreachable variant returned by CheckCache's
  default branch (Gen distance ≥ 2). Replaces the previous CacheMiss
  return which drove a stale preload through to the FSM.
- A new plan.ErrCacheHorizonExceeded sentinel propagates the rejection
  out of Builder.Build. It lives in internal/infra/plan/, not in
  internal/domain/: this is an infrastructure rejection (admission
  cannot guarantee a consistent horizon), not a user-domain business
  outcome.
- The gRPC adapter maps it to codes.Unavailable so existing client retry
  interceptors handle it transparently. The proposal never enters Raft,
  no AuditFailure is owed, the client retries against a fresh admission
  snapshot once the FSM has caught up.

Under a correctly tuned --cache-rotation-threshold and healthy FSM apply
rate, the horizon should never be exceeded; recurring rejections are an
operational signal (threshold too low, or apply rate behind).

Tests:

- cache: TestAttributeCache_IsGuaranteedInCache_TwoGenerationsAhead now
  asserts CacheUnreachable specifically (not the prior CacheMiss).
- plan: TestBuildPreloads_RejectsCacheHorizonExceeded drives the full
  resolver with a tracker pinned past two generation boundaries and
  asserts ErrCacheHorizonExceeded propagates out.

Documentation: new docs/technical/architecture/core/admission-cache-horizon.md
explains the contract, why we reject instead of preload, why it lives
outside the domain layer, and what recurring rejections mean for
operators. Referenced from fsm-cache-layers.md's Related section.
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (1)
  • main

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1317d193-8896-4438-ba29-2c19f2dcf485

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/admission-reject-stale-cache-horizon

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@NumaryBot

NumaryBot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

🛑 Changes requested — automated review

The PR introduces cache-horizon admission rejection and wires it into gRPC and HTTP transports. Most previously reported blockers (BaseIndex alignment, post-reset persistence, threshold=0 panic, HTTP 503 mapping, unsigned-integer underflow, threshold/generation lock-ordering race) have been addressed and are confirmed resolved per the prior discussion. One major issue remains open: in the multi-slot preload builder, a CacheUnreachable observation still returns early before draining in-flight loader goroutines and before collecting tracker keys from sibling slots that completed successfully. This leaks AttributeLoader entries that ReleaseLoaders() cannot clean up, and introduces a data race between the returned result and goroutines still writing to shared state. The fix is to record the error and continue through wg.Wait() and slot collection before returning.

@gfyrag

gfyrag commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

@Azorlogh @paul-nicolas — small, well-isolated PR (admission-side reject, no domain change, no Raft impact, doc included). Quick pass appreciated when you have a moment.

@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.91304% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.59%. Comparing base (0256d30) to head (9790ea0).
⚠️ Report is 1 commits behind head on release/v3.0.

Files with missing lines Patch % Lines
internal/infra/plan/resolve.go 46.15% 6 Missing and 1 partial ⚠️
internal/infra/state/machine_technical_updates.go 68.75% 3 Missing and 2 partials ⚠️

❌ Your patch check has failed because the patch coverage (73.91%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@               Coverage Diff                @@
##           release/v3.0    #1458      +/-   ##
================================================
+ Coverage         72.44%   73.59%   +1.14%     
================================================
  Files               383      383              
  Lines             39267    39092     -175     
================================================
+ Hits              28448    28769     +321     
+ Misses             7940     7866      -74     
+ Partials           2879     2457     -422     
Flag Coverage Δ
e2e 73.59% <73.91%> (+1.14%) ⬆️
scenario 73.59% <73.91%> (+1.14%) ⬆️
unit 73.59% <73.91%> (+1.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

applyClusterConfig.ResetWithThreshold leaves currentGeneration=0 — but
the Raft log has long passed index 0. Admission's CheckCache would then
compute Gen(nextIndex, newThreshold) - 0, a large value that
erroneously trips the CacheUnreachable horizon and rejects every
incoming proposal until the next FSM apply moves currentGeneration
forward.

Set currentGeneration to Gen(raftIndex, newThreshold) right after the
reset so admission queries see a consistent (currentGeneration,
threshold) pair from the moment the config change applies.

Surfaced by the cluster e2e Rolling cluster config update test: changing
the rotation threshold causes a cache reset, and subsequent Apply RPCs
fail with "admission cache horizon exceeded" until the next FSM tick.

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 1 new inline finding.

Summary: #1458 (comment)

Comment thread internal/infra/state/machine_technical_updates.go Outdated
…reshold reset

The previous fix only set currentGeneration via SetCurrentGeneration but
left BaseIndex={0,0} after ResetWithThreshold. Proposals admitted post-
reset with the new epoch then carried a LastPersistedIndex computed
from the live BaseIndex (= the new gen0 boundary), which didn't match
the stale {0,0} the FSM still saw — triggering the preload boundary
mismatch panic on apply.

CheckRotationNeeded jumps both currentGeneration AND BaseIndex
atomically under the same cache.mu the reset just released, so
admission's first post-reset snapshot sees a consistent
(currentGeneration, threshold, BaseIndex) tuple.

CheckRotationNeeded internally calls rotateLocked, which is safe here:
the caches are already empty post-Reset, so the per-attribute rotate
(gen1 ← gen0, gen0 ← ∅) is a no-op for data; only the metadata
(currentGen, BaseIndex) actually moves.

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 1 new inline finding.

Summary: #1458 (comment)

Comment thread internal/infra/state/machine_technical_updates.go Outdated
…change

CheckRotationNeeded moves the in-memory cache to Gen(raftIndex,
newThreshold) with a matching BaseIndex, but the previous
applyClusterConfig only wrote CurrentGeneration=0 to disk and left the
per-gen BaseIndex meta absent. A restart before the next organic
rotation would then rehydrate currentGeneration=0 from Pebble, and
admission's CheckCache would immediately observe Gen(nextIndex) far
ahead of that stale 0 — falsely tripping the CacheUnreachable horizon
and rejecting valid proposals until another apply event advanced the
persisted state.

Persist the post-reset (CurrentGeneration, BaseIndex.Gen0,
BaseIndex.Gen1) tuple that CheckRotationNeeded just set, so a
RestoreFromStore reconstructs the exact same in-memory horizon.

Adds TestApplyClusterConfig_ThresholdChangePersistsGeneration to cover
the drive-through: apply the config change, assert the on-disk meta
matches the in-memory state, simulate a restart (Reset +
RestoreFromStore) and assert the reconstructed horizon does not fire
CacheUnreachable for an in-generation index.

Addresses NumaryBot's blocker "Post-reset cache generation not
persisted" on PR #1458.
@gfyrag

gfyrag commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@NumaryBot re-review please — findings addressed:

  • BaseIndex alignment: fixed in commit b76da2d by switching from SetCurrentGeneration to CheckRotationNeeded (which atomically updates BOTH currentGeneration and BaseIndex under cache.mu, exactly the alternative you suggested).
  • Post-reset persistence: fixed in commit fa039efd — we now persist CurrentGeneration, gen0 CacheGenMeta.BaseIndex, and gen1 CacheGenMeta.BaseIndex to the 0xFF zone so a restart before the next organic rotation reconstructs the same in-memory tuple.
  • Test coverage: new TestApplyClusterConfig_ThresholdChangePersistsGeneration covers the full cycle (config change → in-memory realignment → disk persistence → Reset+RestoreFromStore → CheckCache does not fire CacheUnreachable). TestBuildPreloads_RejectsCacheHorizonExceeded already covered admission rejection.

The previous shape had applyClusterConfig call three separate cache
primitives in sequence — ResetWithThreshold, then CheckRotationNeeded,
then a batch write to persist the realigned meta — with a brief window
between the reset and the realignment where the in-memory state
carried the new threshold against currentGeneration=0. Any concurrent
CheckCache in that window would have observed the transient
inconsistency and could have falsely tripped the CacheUnreachable
horizon.

Fold the realignment into ResetWithThreshold itself: the method now
takes the applying raftIndex and, in the same critical section that
clears the caches and bumps the epoch, sets currentGeneration and
BaseIndex to Gen(raftIndex, newThreshold). Admission's next snapshot
never observes an intermediate state.

Updates the two existing callers in tests to the new signature and
adds TestCache_ResetWithThresholdAtNonZeroIndex for the realignment
path directly on the cache primitive.

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 2 new inline findings.

Summary: #1458 (comment)

Comment thread internal/infra/plan/errors.go
Comment thread internal/infra/state/machine_technical_updates.go Outdated

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 1 new inline finding.

Summary: #1458 (comment)

Comment thread internal/adapter/grpc/server.go
cache.New already refuses generationThreshold == 0 with an error, and no
runtime path reaches ResetWithThreshold with 0 legitimately. The previous
'if threshold == 0 { return }' guard silently disabled rotations for
that impossible case, which would have frozen currentGeneration at 0
forever and broken every downstream contract that depends on it
(CacheUnreachable, CheckRotationNeeded, PredictedIndex).

Fail loudly instead — surface the config invariant violation at the
first frame, per CLAUDE.md invariant #7.

Adds TestCache_ResetWithThresholdRejectsZero to lock the contract.

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 1 new inline finding.

Summary: #1458 (comment)

Comment thread internal/adapter/grpc/server.go
The HTTP error handler routed plan.ErrCacheHorizonExceeded to
writeInternalServerError — a generic 500 — because the sentinel is not
a domain.Describable and no dedicated branch matched it. gRPC/REST
clients on the HTTP path therefore missed the retry signal that gRPC
clients get via codes.Unavailable.

Add a sentinel branch mirroring the ErrNoLeader shape: 503 Service
Unavailable, Retry-After: 1, error code CACHE_HORIZON_EXCEEDED.
Extends TestHandleError with the sentinel + a wrapped-by-admission
variant so future refactors keep the errors.Is chain intact.

Addresses NumaryBot's finding "ErrCacheHorizonExceeded not mapped in
HTTP adapter — surfaces as 500 instead of retryable 503" on PR #1458.

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 3 new inline findings.

Summary: #1458 (comment)

Comment thread internal/adapter/grpc/server.go
Comment thread internal/infra/cache/cache.go
Comment thread internal/infra/plan/resolve.go
@gfyrag

gfyrag commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@Azorlogh @paul-nicolas — bump. All NumaryBot findings addressed since the initial ping:

  • BaseIndex alignment (b76da2d then folded into f80382d) — Cache.ResetWithThreshold now takes raftIndex and realigns currentGeneration + BaseIndex atomically under cache.mu, in one call, no transient window.
  • Post-reset persistence (fa039ef) — applyClusterConfig now writes the realigned CurrentGeneration + both per-gen CacheGenMeta.BaseIndex sentinels so restart-before-next-rotation reconstructs the same in-memory tuple.
  • threshold==0 (d1285b3) — panics loudly instead of silently disabling rotations (cache.New already rejects; enforced in ResetWithThreshold too).
  • HTTP mapping (780ce16) — ErrCacheHorizonExceeded now maps to 503 + Retry-After: 1 in the HTTP adapter, mirroring the gRPC codes.Unavailable path.

CI green, doc updated, tests cover: rejection at admission, threshold-change apply + persistence + restart reconstruction, threshold=0 panic, HTTP mapping (sentinel + wrapped).

…uint64 underflow

Two related correctness gaps NumaryBot spotted on the horizon-reject path:

1. CheckCache computed \`futureGeneration - actualGeneration\` on
   uint64 without a bounds check. Under a stale-behind build where the
   FSM has already applied past the sampled nextIndex
   (actualGeneration > futureGeneration), the subtraction underflowed
   into MaxUint64 and fell through to the default branch — falsely
   reporting CacheUnreachable and rejecting the proposal with a
   \`503 CACHE_HORIZON_EXCEEDED\`. The higher-level staleness guard
   (checkStaleProposal / AcquireProposalGuard) should classify that
   case, not the horizon check. Guard the subtraction and report
   CacheMiss when the build is stale-behind.

2. resolve.go returned immediately on the first CacheUnreachable key.
   Earlier iterations may already have launched CacheMiss loader
   goroutines via wg.Go — those append to plans/tracker and hold
   AttributeLoader entries. The early return skipped wg.Wait(), racing
   with the goroutines' appends and leaking loader state past the
   caller's cleanup token. Record the error via firstErr and continue
   the loop; the existing wg.Wait() + firstErr propagation at the
   bottom handles cleanup and return uniformly.

Adds TestAttributeCache_CheckCache_StaleBehindReportsMiss to lock the
underflow guard. The resolve.go drain path is exercised by
TestBuildPreloads_RejectsCacheHorizonExceeded via wg.Wait completeness.

Addresses NumaryBot findings 0a93e7c6 (leak) and c0d426de (underflow).

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 3 new inline findings.

Summary: #1458 (comment)

Comment thread internal/infra/cache/cache.go
Comment thread internal/infra/cache/cache.go
Comment thread internal/infra/plan/resolve.go
CheckCache read GenerationThreshold before acquiring cache.mu, then
read currentGeneration after. During a rotation-threshold change,
ResetWithThreshold takes the write lock, bumps both values, and
releases — a concurrent CheckCache could observe a torn snapshot
(old threshold + new currentGeneration) and misclassify a valid
admission as CacheUnreachable during the transition window.

Move the threshold read inside the RLock so both are captured from
the same critical section. ResetWithThreshold holds the write lock
across its entire mutation, so any CheckCache seeing the new
currentGeneration also sees the new threshold, and vice versa.

Addresses NumaryBot finding ea873a4f.
@gfyrag

gfyrag commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@NumaryBot two of your findings on be00279 are stale re-reports of what that commit itself fixes:

  • Underflow (75a3b97b) → fixed in be00279 by the if futureGeneration < actualGeneration { return CacheMiss } guard added just before the switch (cache.go:191-201).
  • Early-return leak (df25f833) → fixed in be00279 by replacing the immediate return with firstErr = ErrCacheHorizonExceeded; continue so wg.Wait() drains in-flight loaders (resolve.go:139-149).

The new race finding (ea873a4f, threshold + currentGeneration read outside the same lock) is legit and addressed in 44bb154b.

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 3 new inline findings.

Summary: #1458 (comment)

Comment thread internal/infra/cache/cache.go
Comment thread internal/infra/cache/cache.go
Comment thread internal/infra/plan/resolve.go
…nThreshold

threshold > 0 is a cluster-wide invariant enforced by cache.New (returns
error) and ResetWithThreshold (panics). The 'if threshold == 0 { return
CacheMiss/false }' guards in CheckCache and CheckRotationNeeded were
therefore dead code silently disabling classification / rotation
detection — exactly the kind of no-op that would hide a corrupted-state
bug rather than surface it.

- Remove both guards.
- Tighten SetGenerationThreshold to panic on 0, matching the invariant.
- Replace the two dead ZeroThreshold tests (which exercised the removed
  guards via cache.SetGenerationThreshold(0)) with
  TestCache_SetGenerationThresholdRejectsZero locking the panic contract.

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 1 new inline finding.

Summary: #1458 (comment)

}).Tracef("Cache horizon exceeded: admission rejection")
}

mu.Lock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 [major] ErrCacheHorizonExceeded early return leaks in-flight loader goroutines and sibling slot entries
reported by NumaryBot, codex

When CacheUnreachable is observed after earlier CacheMiss iterations have already launched loader goroutines, an immediate return bypasses wg.Wait(). Those goroutines can still append to tracker/plans and hold AttributeLoader entries after the caller has built and released the cleanup token, producing a data race. Additionally, sibling attribute-type slots that complete successfully after the error is recorded are omitted from the cleanup token, so ReleaseLoaders() cannot remove their entries — leaking preload state across retries.

Suggestion: On the CacheUnreachable/ErrCacheHorizonExceeded path, record the error but do not return immediately. Drain all in-flight workers via wg.Wait() and collect tracker entries from every completed slot before returning, ensuring full cleanup regardless of error type.

@gfyrag

gfyrag commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@NumaryBot all three findings on 44bb154 are stale/re-emitted:

  • Underflow (8b06f008): guard added in be00279 at cache.go:198-201if futureGeneration < actualGeneration { return CacheMiss } before the switch. Test: TestAttributeCache_CheckCache_StaleBehindReportsMiss.
  • Torn snapshot (85503d17): 44bb154 is the fix — the threshold read moved inside the RLock at cache.go:190-201. Concurrent ResetWithThreshold (write-lock) can't tear the snapshot.
  • Sibling-slot leak (49b94c67): builder.go's buildPreloadsAt cleanup loop already iterates every slot and collects tracker entries into the cleanup token even when a sibling set firstErr. resolve.go's own drain (record + continue → wg.Wait) was added in be00279.

Please re-scan against the current diff.

@gfyrag

gfyrag commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@NumaryBot please stop. Every re-review on this branch re-emits the same three findings verbatim against a diff that already addresses them, with the fix on the very line the finding points at. You are looping. Human reviewers can read the diff — no further automated review is useful on this PR. Signing off.

@gfyrag gfyrag merged commit 8ef3892 into release/v3.0 Jul 1, 2026
12 of 13 checks passed
@gfyrag gfyrag deleted the feat/admission-reject-stale-cache-horizon branch July 1, 2026 13:44
gfyrag pushed a commit that referenced this pull request Jul 2, 2026
…sent/ExpectAbsent

Replace per-intent Preload verification (AssertPresent/ExpectAbsent/Touch)
with a single try-promote pass. Every non-Value plan entry is now emitted
as coverage-only Declare and the FSM's MirrorTouch handles gen1→gen0
promotion uniformly at Preload — gen0 hit no-ops, gen1 hit promotes,
neither silently no-ops.

Correctness rests on two orthogonal guardrails, not per-intent checks:
- Admission-side CacheUnreachable (PR #1458): 2+ predicted rotations
  reject the proposal with ErrCacheHorizonExceeded, bounding the
  propose→apply window to at most one rotation.
- Coverage gate (invariant #9): the FSM only reads keys admission
  declared, so Declare + MirrorTouch is a complete substitute for the
  historical assertion machinery.

AttributeCache.Get is gen0-strict. Every declared key that exists
somewhere in cache is promoted to gen0 by MirrorTouch before the
handler runs, so the historical gen1 fallback in Get is redundant;
handler reads see the fresh value directly. Deletes on genuinely
absent keys now surface as clean ErrNotFound at KeyStore.Delete
rather than the strict-Del contract-violation panic.

Removed: Needs.AddExpectAbsent, AttributeSet.ExpectAbsent, all three
assertion intents from the resolver's emission path, plan.ErrStalePreload
sentinel + ErrStalePreloadDescribable adapter, CacheSnapshotter's
HasGen0Entry / HasLiveEntry, and machine.go's per-intent Preload switch.
The proto oneof variants (Touch / AssertPresent / ExpectAbsent) remain
for wire compat but the resolver never emits them.

Docs: plan-intent-verification.md rewritten for the unified model.
gfyrag pushed a commit that referenced this pull request Jul 2, 2026
…as promote primitive

Restores strict AttributeCache.Del (Gen0 must hold the key at delete time)
by shifting the invariant from "admission emits per-intent assertions" to
"Preload uniformly promotes gen1→gen0 for every declared key". The
propose→apply race window is bounded by the admission-side CacheUnreachable
guard (2+ predicted rotations rejected up front, PR #1458), and the FSM
coverage gate (invariant #9) binds the read horizon to admission's
declared preload set — those two guardrails make Declare + MirrorTouch a
complete substitute for the historical AssertPresent / ExpectAbsent /
Touch machinery.

Semantic changes:

- `AttributeCache.Get` is gen0-strict — no gen1 fallback. `MirrorTouch`
  promotes any gen1 entry into gen0 at Preload before handlers run.
- `AttributeCache.Del` returns an error (kv.KV signature updated) and
  refuses to fabricate a tombstone from Gen1. Preload-contract violations
  bubble up loudly per invariant #7 instead of silently desyncing.
- `writeCacheTombstone` writes only the current gen0 byte in 0xFF (the
  gen1 byte is left as harmless stale data, purged on the next rotation)
  — keeping cache mem/disk equal for the same applied index (invariant #1).
- `protoSnapshotSlot.MirrorTouch` is a silent no-op when neither gen has
  the key. Under the coverage gate that state is legitimate: a Declare
  entry may cover a key that is only in Pebble (Value intent seeds those)
  or genuinely absent (handler ErrNotFound path).
- Resolver: `CacheGuaranteed` and `CacheNeedsTouch` both emit `Declare`
  now; `CacheMiss + Pebble-load-hit` still emits `Value(v)`. The proto
  variants `Touch` / `AssertPresent` / `ExpectAbsent` stay in raft_cmd.proto
  for wire compatibility with pre-EN-1242 plans but the resolver never
  emits them and the FSM Preload switch treats every non-Value intent
  identically via `MirrorTouch`.

Removed machinery:

- `plan.ErrStalePreload` sentinel (moved to a leaf `planerr` package so
  admission and state can reference `ErrCacheHorizonExceeded` without a
  cycle — that's all `planerr` holds now).
- `CacheSnapshotter.HasGen0Entry` / `HasLiveEntry`.
- The `touch_noop` panic that fired when `MirrorTouch` couldn't find the
  key in either generation.

Docs: `docs/technical/architecture/core/plan-intent-verification.md`
rewritten for the unified model — problem statement, Path A / Path B
race traces, the two guardrails, and the delete-site checklist.

Layered on top of PR #1475 (Needs generic dispatch refactor).
gfyrag pushed a commit that referenced this pull request Jul 2, 2026
…as promote primitive

Restores strict AttributeCache.Del (Gen0 must hold the key at delete time)
by shifting the invariant from "admission emits per-intent assertions" to
"Preload uniformly promotes gen1→gen0 for every declared key". The
propose→apply race window is bounded by the admission-side CacheUnreachable
guard (2+ predicted rotations rejected up front, PR #1458), and the FSM
coverage gate (invariant #9) binds the read horizon to admission's
declared preload set — those two guardrails make Declare + MirrorTouch a
complete substitute for the historical AssertPresent / ExpectAbsent /
Touch machinery.

Semantic changes:

- `AttributeCache.Get` is gen0-strict — no gen1 fallback. `MirrorTouch`
  promotes any gen1 entry into gen0 at Preload before handlers run.
- `AttributeCache.Del` returns an error (kv.KV signature updated) and
  refuses to fabricate a tombstone from Gen1. Preload-contract violations
  bubble up loudly per invariant #7 instead of silently desyncing.
- `writeCacheTombstone` writes only the current gen0 byte in 0xFF (the
  gen1 byte is left as harmless stale data, purged on the next rotation)
  — keeping cache mem/disk equal for the same applied index (invariant #1).
- `protoSnapshotSlot.MirrorTouch` is a silent no-op when neither gen has
  the key. Under the coverage gate that state is legitimate: a Declare
  entry may cover a key that is only in Pebble (Value intent seeds those)
  or genuinely absent (handler ErrNotFound path).
- Resolver: `CacheGuaranteed` and `CacheNeedsTouch` both emit `Declare`
  now; `CacheMiss + Pebble-load-hit` still emits `Value(v)`. The proto
  variants `Touch` / `AssertPresent` / `ExpectAbsent` stay in raft_cmd.proto
  for wire compatibility with pre-EN-1242 plans but the resolver never
  emits them and the FSM Preload switch treats every non-Value intent
  identically via `MirrorTouch`.

Removed machinery:

- `plan.ErrStalePreload` sentinel (moved to a leaf `planerr` package so
  admission and state can reference `ErrCacheHorizonExceeded` without a
  cycle — that's all `planerr` holds now).
- `CacheSnapshotter.HasGen0Entry` / `HasLiveEntry`.
- The `touch_noop` panic that fired when `MirrorTouch` couldn't find the
  key in either generation.

Docs: `docs/technical/architecture/core/plan-intent-verification.md`
rewritten for the unified model — problem statement, Path A / Path B
race traces, the two guardrails, and the delete-site checklist.

Layered on top of PR #1475 (Needs generic dispatch refactor).
gfyrag pushed a commit that referenced this pull request Jul 2, 2026
…as promote primitive

Restores strict AttributeCache.Del (Gen0 must hold the key at delete time)
by shifting the invariant from "admission emits per-intent assertions" to
"Preload uniformly promotes gen1→gen0 for every declared key". The
propose→apply race window is bounded by the admission-side CacheUnreachable
guard (2+ predicted rotations rejected up front, PR #1458), and the FSM
coverage gate (invariant #9) binds the read horizon to admission's
declared preload set — those two guardrails make Declare + MirrorTouch a
complete substitute for the historical AssertPresent / ExpectAbsent /
Touch machinery.

Semantic changes:

- `AttributeCache.Get` is gen0-strict — no gen1 fallback. `MirrorTouch`
  promotes any gen1 entry into gen0 at Preload before handlers run.
- `AttributeCache.Del` returns an error (kv.KV signature updated) and
  refuses to fabricate a tombstone from Gen1. Preload-contract violations
  bubble up loudly per invariant #7 instead of silently desyncing.
- `writeCacheTombstone` writes only the current gen0 byte in 0xFF (the
  gen1 byte is left as harmless stale data, purged on the next rotation)
  — keeping cache mem/disk equal for the same applied index (invariant #1).
- `protoSnapshotSlot.MirrorTouch` is a silent no-op when neither gen has
  the key. Under the coverage gate that state is legitimate: a Declare
  entry may cover a key that is only in Pebble (Value intent seeds those)
  or genuinely absent (handler ErrNotFound path).
- Resolver: `CacheGuaranteed` and `CacheNeedsTouch` both emit `Declare`
  now; `CacheMiss + Pebble-load-hit` still emits `Value(v)`. The proto
  variants `Touch` / `AssertPresent` / `ExpectAbsent` stay in raft_cmd.proto
  for wire compatibility with pre-EN-1242 plans but the resolver never
  emits them and the FSM Preload switch treats every non-Value intent
  identically via `MirrorTouch`.

Removed machinery:

- `plan.ErrStalePreload` sentinel (moved to a leaf `planerr` package so
  admission and state can reference `ErrCacheHorizonExceeded` without a
  cycle — that's all `planerr` holds now).
- `CacheSnapshotter.HasGen0Entry` / `HasLiveEntry`.
- The `touch_noop` panic that fired when `MirrorTouch` couldn't find the
  key in either generation.

Docs: `docs/technical/architecture/core/plan-intent-verification.md`
rewritten for the unified model — problem statement, Path A / Path B
race traces, the two guardrails, and the delete-site checklist.

Layered on top of PR #1475 (Needs generic dispatch refactor).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants