Skip to content

fix(operator): preserve LedgerClusterAgent seed across LedgerService recreation#1488

Merged
gfyrag merged 4 commits into
release/v3.0from
fix/agent-seed-persistence
Jul 3, 2026
Merged

fix(operator): preserve LedgerClusterAgent seed across LedgerService recreation#1488
gfyrag merged 4 commits into
release/v3.0from
fix/agent-seed-persistence

Conversation

@gfyrag

@gfyrag gfyrag commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fix an issue where deleting and recreating a LedgerService invalidated every agent bundle handed out via kubectl-ledger agents get-key --bundle.
  • Root cause: the LedgerClusterAgent reconcile GC'd every replica of the agent Secret when desiredNamespaces became empty (transient absence of matched services). The next reconcile found no existing replica and generated a fresh Ed25519 keypair, so previously-issued JWTs stopped verifying against the operator-managed auth-keys.json ConfigMap.
  • Fix: skip the GC loop when desiredNamespaces is empty. Dormant replicas survive the transient state; the next matching reconcile reuses the existing seed via canonicalKeyData. Definitive cleanup remains on the finalizer (handleDeletion), which lists and deletes all replicas when the LedgerClusterAgent CR itself is deleted.

Test plan

  • go build ./...
  • go vet ./...
  • golangci-lint run — 0 issues
  • just operator-pre-commit (controller-gen + go mod tidy + go build)
  • Added TestReconcile_AgentSeedSurvivesLedgerServiceRecreation covering the full scenario: LedgerService created → seed generated → LedgerService deleted → dormant Secret preserved → LedgerService recreated → seed/keyID identical.
  • TestReconcile_AgentOrphanCleanup and TestReconcile_AgentNoTargets behavior verified unchanged (GC still runs when at least one target remains; fresh agents with no targets still don't produce a Secret).

…recreation

Deleting a LedgerService triggered a transient reconcile where the agent's
desiredNamespaces became empty, and the GC loop deleted every replica of
the agent Secret. The next reconcile (after LedgerService recreation)
found no existing replica and generated a fresh Ed25519 keypair, silently
invalidating every bundle already handed out via kubectl-ledger.

Skip the GC when desiredNamespaces is empty so dormant replicas survive
the transient absence of matching services. The finalizer still guarantees
definitive cleanup on LedgerClusterAgent deletion.
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (1)
  • main

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6a28f251-ca77-4564-be0b-d168e95626c2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/agent-seed-persistence

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@NumaryBot

NumaryBot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

✅ Approve — automated review

The PR introduces a canonical Secret in the operator namespace to preserve the LedgerClusterAgent seed across LedgerService recreations. All three significant issues previously flagged (legacy replica label adoption on upgrade, operator namespace cache scoping, and no-target upgrade bootstrapping edge case) have been addressed in subsequent commits per the resolved inline threads. Code coverage improved slightly. The independent "codex" reviewer found no new correctness issues. There are no remaining blocker or major findings to block merging.

No findings.

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 1 new inline finding.

Summary: #1488 (comment)

Comment thread misc/operator/internal/controller/ledgerclusteragent_controller.go Outdated
@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.86%. Comparing base (f4430b6) to head (487bf92).
⚠️ Report is 16 commits behind head on release/v3.0.

Additional details and impacted files
@@               Coverage Diff                @@
##           release/v3.0    #1488      +/-   ##
================================================
- Coverage         73.96%   73.86%   -0.11%     
================================================
  Files               388      393       +5     
  Lines             39664    39748      +84     
================================================
+ Hits              29337    29359      +22     
- Misses             7815     7817       +2     
- Partials           2512     2572      +60     
Flag Coverage Δ
e2e 73.86% <ø> (-0.11%) ⬇️
scenario 73.86% <ø> (-0.11%) ⬇️
unit 73.86% <ø> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…ator namespace

Follow-up to review feedback: the previous fix skipped GC when
desiredNamespaces became empty, which preserved the seed across transient
LedgerService deletion but also left stale replicas in namespaces no
longer in scope when the user removed targets explicitly.

Introduce a canonical seed Secret per agent, living in the operator's
own namespace (discovered via POD_NAMESPACE downward API with fallback
to the in-cluster ServiceAccount namespace file). Per-target replicas
become pure projections of the canonical, and replica GC returns to its
aggressive form — deleting stale replicas can never destroy the seed
because the seed lives elsewhere.

The canonical Secret is created lazily on first target so an agent that
never matches anything never persists key material. Once created, it
survives until the LedgerClusterAgent itself is deleted, at which point
the finalizer cleans up both the canonical and every replica.

Helm chart injects POD_NAMESPACE via downward API on the operator
Deployment.

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 2 new inline findings.

Summary: #1488 (comment)

Comment thread misc/operator/internal/controller/ledgerclusteragent_controller.go
Comment thread misc/operator/cmd/operator/main.go Outdated
…r ns in watch cache

Two review findings from the canonical-Secret refactor:

- Upgrade path: pre-canonical replicas had only agentNameLabel and no
  agent-role label, so the new list-by-labels missed them. The reconciler
  bootstrapped a fresh canonical and overwrote every legacy replica with
  new key material, silently invalidating bundles across the upgrade.
  Drop the role label entirely — identify the canonical by name+namespace
  and let ensureCanonicalSecret adopt a legacy replica's seed on first
  bootstrap. Sort adoption candidates deterministically so the choice is
  stable when multiple replicas exist.

- Watch-namespace cache: when --watch-namespace is set and the operator
  itself runs in a different namespace, the manager cache is limited to
  the watched namespace. Canonical Secret reads/writes in the operator
  namespace hit an uncached namespace and fail. Discover
  OperatorNamespace up-front and add it to Cache.DefaultNamespaces when
  --watch-namespace is used.

New integration test TestReconcile_AgentUpgradeAdoptsLegacyReplicaSeed
seeds a legacy replica by hand, brings up an agent, and verifies the
canonical Secret adopts the legacy seed (no rotation, no overwrite).

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 3 new inline findings.

Summary: #1488 (comment)

Comment thread misc/operator/internal/controller/ledgerclusteragent_controller.go
Comment thread misc/operator/cmd/operator/main.go Outdated
Comment thread misc/operator/internal/controller/ledgerclusteragent_controller.go Outdated
…option

Two more findings from the review round on 07bb1f1:

- Watch-namespace scoping (Cache.DefaultNamespaces widening): adding the
  operator namespace to DefaultNamespaces expanded the cache scope for
  every controller in the manager, not just canonical Secret ops. A
  --watch-namespace=tenant-a deployment where the operator lives in
  operator-system suddenly watched operator-system too, and any
  LedgerService created there would be reconciled. Route canonical
  Secret reads through an uncached APIReader (mgr.GetAPIReader()) and
  drop the DefaultNamespaces expansion — the confined widening is now
  a single Reader hitting the API server directly.

- No-target upgrade edge case: if the LedgerService was deleted while
  the old (pre-canonical) operator was stopped, the reconcile that
  runs on the new operator finds no matching targets, previously
  skipped ensureCanonicalSecret, and the aggressive GC deleted the
  orphan legacy replica along with its seed. Bootstrap the canonical
  whenever there is either a target OR an existing replica to adopt,
  so the seed is captured before GC touches anything.

ensureCanonicalSecret is rewritten around explicit Get-then-Create so
the caller can pick the right reader; the canonical Secret is never
mutated after creation, which fits the "seed is immutable identity"
contract.

New integration test TestReconcile_AgentUpgradeAdoptsLegacySeedWithoutTargets
seeds a legacy orphan replica with no matching service, brings up the
agent, and asserts the canonical is bootstrapped from the orphan's seed
before GC removes it.

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot review complete: no remaining inline findings.

Resolved 3 stale NumaryBot review threads (1 fixed, 2 outdated).

Summary: #1488 (comment)

@gfyrag gfyrag merged commit dbc0ecb into release/v3.0 Jul 3, 2026
14 checks passed
@gfyrag gfyrag deleted the fix/agent-seed-persistence branch July 3, 2026 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants