fix(operator): preserve LedgerClusterAgent seed across LedgerService recreation#1488
Conversation
…recreation Deleting a LedgerService triggered a transient reconcile where the agent's desiredNamespaces became empty, and the GC loop deleted every replica of the agent Secret. The next reconcile (after LedgerService recreation) found no existing replica and generated a fresh Ed25519 keypair, silently invalidating every bundle already handed out via kubectl-ledger. Skip the GC when desiredNamespaces is empty so dormant replicas survive the transient absence of matching services. The finalizer still guarantees definitive cleanup on LedgerClusterAgent deletion.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. 🗂️ Base branches to auto review (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
✅ Approve — automated reviewThe PR introduces a canonical Secret in the operator namespace to preserve the LedgerClusterAgent seed across LedgerService recreations. All three significant issues previously flagged (legacy replica label adoption on upgrade, operator namespace cache scoping, and no-target upgrade bootstrapping edge case) have been addressed in subsequent commits per the resolved inline threads. Code coverage improved slightly. The independent "codex" reviewer found no new correctness issues. There are no remaining blocker or major findings to block merging. No findings. |
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 1 new inline finding.
Summary: #1488 (comment)
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release/v3.0 #1488 +/- ##
================================================
- Coverage 73.96% 73.86% -0.11%
================================================
Files 388 393 +5
Lines 39664 39748 +84
================================================
+ Hits 29337 29359 +22
- Misses 7815 7817 +2
- Partials 2512 2572 +60
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…ator namespace Follow-up to review feedback: the previous fix skipped GC when desiredNamespaces became empty, which preserved the seed across transient LedgerService deletion but also left stale replicas in namespaces no longer in scope when the user removed targets explicitly. Introduce a canonical seed Secret per agent, living in the operator's own namespace (discovered via POD_NAMESPACE downward API with fallback to the in-cluster ServiceAccount namespace file). Per-target replicas become pure projections of the canonical, and replica GC returns to its aggressive form — deleting stale replicas can never destroy the seed because the seed lives elsewhere. The canonical Secret is created lazily on first target so an agent that never matches anything never persists key material. Once created, it survives until the LedgerClusterAgent itself is deleted, at which point the finalizer cleans up both the canonical and every replica. Helm chart injects POD_NAMESPACE via downward API on the operator Deployment.
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 2 new inline findings.
Summary: #1488 (comment)
…r ns in watch cache Two review findings from the canonical-Secret refactor: - Upgrade path: pre-canonical replicas had only agentNameLabel and no agent-role label, so the new list-by-labels missed them. The reconciler bootstrapped a fresh canonical and overwrote every legacy replica with new key material, silently invalidating bundles across the upgrade. Drop the role label entirely — identify the canonical by name+namespace and let ensureCanonicalSecret adopt a legacy replica's seed on first bootstrap. Sort adoption candidates deterministically so the choice is stable when multiple replicas exist. - Watch-namespace cache: when --watch-namespace is set and the operator itself runs in a different namespace, the manager cache is limited to the watched namespace. Canonical Secret reads/writes in the operator namespace hit an uncached namespace and fail. Discover OperatorNamespace up-front and add it to Cache.DefaultNamespaces when --watch-namespace is used. New integration test TestReconcile_AgentUpgradeAdoptsLegacyReplicaSeed seeds a legacy replica by hand, brings up an agent, and verifies the canonical Secret adopts the legacy seed (no rotation, no overwrite).
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 3 new inline findings.
Summary: #1488 (comment)
…option Two more findings from the review round on 07bb1f1: - Watch-namespace scoping (Cache.DefaultNamespaces widening): adding the operator namespace to DefaultNamespaces expanded the cache scope for every controller in the manager, not just canonical Secret ops. A --watch-namespace=tenant-a deployment where the operator lives in operator-system suddenly watched operator-system too, and any LedgerService created there would be reconciled. Route canonical Secret reads through an uncached APIReader (mgr.GetAPIReader()) and drop the DefaultNamespaces expansion — the confined widening is now a single Reader hitting the API server directly. - No-target upgrade edge case: if the LedgerService was deleted while the old (pre-canonical) operator was stopped, the reconcile that runs on the new operator finds no matching targets, previously skipped ensureCanonicalSecret, and the aggressive GC deleted the orphan legacy replica along with its seed. Bootstrap the canonical whenever there is either a target OR an existing replica to adopt, so the seed is captured before GC touches anything. ensureCanonicalSecret is rewritten around explicit Get-then-Create so the caller can pick the right reader; the canonical Secret is never mutated after creation, which fits the "seed is immutable identity" contract. New integration test TestReconcile_AgentUpgradeAdoptsLegacySeedWithoutTargets seeds a legacy orphan replica with no matching service, brings up the agent, and asserts the canonical is bootstrapped from the orphan's seed before GC removes it.
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot review complete: no remaining inline findings.
Resolved 3 stale NumaryBot review threads (1 fixed, 2 outdated).
Summary: #1488 (comment)
Summary
LedgerServiceinvalidated every agent bundle handed out viakubectl-ledger agents get-key --bundle.LedgerClusterAgentreconcile GC'd every replica of the agent Secret whendesiredNamespacesbecame empty (transient absence of matched services). The next reconcile found no existing replica and generated a fresh Ed25519 keypair, so previously-issued JWTs stopped verifying against the operator-managedauth-keys.jsonConfigMap.desiredNamespacesis empty. Dormant replicas survive the transient state; the next matching reconcile reuses the existing seed viacanonicalKeyData. Definitive cleanup remains on the finalizer (handleDeletion), which lists and deletes all replicas when theLedgerClusterAgentCR itself is deleted.Test plan
go build ./...go vet ./...golangci-lint run— 0 issuesjust operator-pre-commit(controller-gen + go mod tidy + go build)TestReconcile_AgentSeedSurvivesLedgerServiceRecreationcovering the full scenario: LedgerService created → seed generated → LedgerService deleted → dormant Secret preserved → LedgerService recreated → seed/keyID identical.TestReconcile_AgentOrphanCleanupandTestReconcile_AgentNoTargetsbehavior verified unchanged (GC still runs when at least one target remains; fresh agents with no targets still don't produce a Secret).