refactor(operator): rename LedgerService to Cluster with transient compat#1481
refactor(operator): rename LedgerService to Cluster with transient compat#1481gfyrag wants to merge 3 commits into
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. 🗂️ Base branches to auto review (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🛑 Changes requested — automated reviewThis PR renames LedgerService to Cluster and introduces a migrator for in-place upgrades. The rename itself is structurally sound, but the migration path has several significant safety gaps that must be addressed before production use. The most critical issue is that the migrator creates the new Cluster object without reparenting child resources (ServiceAccount, Services, StatefulSet, etc.) that were previously owned by the LedgerService. Because Kubernetes enforces a single controller ownerReference per object, the Cluster reconciler will fail to adopt those children via SetControllerReference, leaving the workload unmanageable. Deleting the old LedgerService would also trigger cascading garbage collection of the live resources. Two additional gaps compound this: DNSEndpoint unstructured children are never included in the adoption loop, and the instance-label-based child selector silently matches nothing in installations where additionalLabels overrode the app.kubernetes.io/instance label. Two further migration-ordering races introduce upgrade regressions. The LedgerClusterAgent reconciler now lists only Cluster objects; if it runs before the migrator creates the corresponding Cluster, it treats all existing replicated agent Secrets as orphans and deletes them, unexpectedly rotating credentials mid-upgrade. Similarly, a non-terminal LedgerBackupRun reconcile that fires before its Cluster exists immediately calls setRunFailed, permanently terminating in-flight backup runs due solely to migration ordering. All five issues require resolution before this migration path can be considered safe for production upgrades. |
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 1 new inline finding.
Summary: #1481 (comment)
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release/v3.0 #1481 +/- ##
================================================
+ Coverage 73.64% 73.95% +0.31%
================================================
Files 388 388
Lines 39745 39745
================================================
+ Hits 29269 29394 +125
+ Misses 7972 7834 -138
- Partials 2504 2517 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…tion Addresses PR #1481 blocker review: creating a Cluster alongside an already-reconciled LedgerService leaves every child (StatefulSet, Services, ServiceAccount, ConfigMap, Secret, Ingress, NetworkPolicy) owned by the LedgerService. The Cluster reconciler's SetControllerReference would then reject them as already-controlled, and a later delete of the LedgerService would cascade-delete the live workload. The migrator now walks each of those kinds by app.kubernetes.io/instance=<name> and rewrites any ownerReference matching the LedgerService UID to a controller ref on the new Cluster. Idempotent: objects without a matching ownerRef are skipped, so the pass is cheap on already-migrated services. The chainsaw e2e now asserts the resulting StatefulSet is owned by Cluster, not LedgerService.
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 3 new inline findings.
Summary: #1481 (comment)
…mpat Renames the LedgerService Kind end-to-end (Go types, CRD, kubectl-ledger, Helm chart, e2e tests, Antithesis manifests, proto comment, docs) and adds a runtime migration path so existing LedgerService manifests keep working: - LedgerService compat shell reuses ClusterSpec/ClusterStatus verbatim (no schema duplication) and is marked deprecated at the CRD level. - New LedgerServiceMigrator reconciler creates a Cluster of the same name/namespace on each reconcile, annotates the LedgerService with ledger.formance.com/migrated-to and sets a Migrated=True condition. Never deletes the LS. Same names preserve derived StatefulSet/PVCs. - New chainsaw e2e (ledgerservice-migration) covers the migrator path.
…tion Addresses PR #1481 blocker review: creating a Cluster alongside an already-reconciled LedgerService leaves every child (StatefulSet, Services, ServiceAccount, ConfigMap, Secret, Ingress, NetworkPolicy) owned by the LedgerService. The Cluster reconciler's SetControllerReference would then reject them as already-controlled, and a later delete of the LedgerService would cascade-delete the live workload. The migrator now walks each of those kinds by app.kubernetes.io/instance=<name> and rewrites any ownerReference matching the LedgerService UID to a controller ref on the new Cluster. Idempotent: objects without a matching ownerRef are skipped, so the pass is cheap on already-migrated services. The chainsaw e2e now asserts the resulting StatefulSet is owned by Cluster, not LedgerService.
…nces Post-rebase fixup: aa373d5 (deletion protection default-on) landed LedgerService references in files that didn't exist when the rename commit was authored (deletion_protection_test.go, persistence_test.go, NOTES.txt, validatingadmissionpolicy.yaml, cluster_types.go comment, README.md prose).
431cf41 to
c784130
Compare
NumaryBot
left a comment
There was a problem hiding this comment.
NumaryBot posted 5 new inline findings.
Summary: #1481 (comment)
| err := r.Get(ctx, types.NamespacedName{Name: ls.Name, Namespace: ls.Namespace}, cluster) | ||
| switch { | ||
| case apierrors.IsNotFound(err): | ||
| cluster = &ledgerv1alpha1.Cluster{ |
There was a problem hiding this comment.
🔴 [blocker] Child resources retain LedgerService ownerReferences after migration
reported by NumaryBot, codex
For in-place upgrades where a LedgerService has already reconciled and created child resources (ServiceAccount, Services, StatefulSet, etc.), the migrator only creates a new Cluster object but never reparents the children. Those resources retain a controller ownerReference pointing to the old LedgerService. When the Cluster reconciler subsequently calls SetControllerReference on them it will fail because Kubernetes rejects a second controller owner, so the new Cluster cannot manage or update the existing deployment. Additionally, if the old LedgerService is later deleted, Kubernetes garbage-collects all its owned children, destroying the live workload.
Suggestion: During migration, enumerate all child resources owned by the LedgerService (using ownerReference lookup or the instance label selector) and patch their ownerReferences to replace the LedgerService controller reference with one pointing to the newly created Cluster, before marking the source as migrated.
| } | ||
|
|
||
| var services ledgerv1alpha1.LedgerServiceList | ||
| var services ledgerv1alpha1.ClusterList |
There was a problem hiding this comment.
🔴 [blocker] Agent secrets GC'd as orphans when Cluster not yet created during migration
The agent reconciler now lists only Cluster objects to find matched services. If it runs before the migrator has created the corresponding Cluster for a legacy LedgerService, matchedServices is empty and the garbage-collection path below deletes the existing replicated agent Secrets as orphans. On the next reconcile a new keypair is generated, unexpectedly rotating or removing credentials and causing the Cluster auth-keys ConfigMap to drop or change those keys mid-upgrade.
Suggestion: Treat matching legacy LedgerService objects as valid targets until migration converges (e.g., fall back to listing LedgerService when no Cluster match is found), or explicitly skip orphan GC while any LedgerService-to-Cluster migration is still pending.
| return ctrl.Result{}, nil | ||
| } | ||
|
|
||
| // rehomeChildren walks every child object kind the Cluster reconciler owns |
There was a problem hiding this comment.
🟠 [major] DNSEndpoint children not rehomed during migration
When a legacy LedgerService had spec.dnsEndpoint.enabled set and the ExternalDNS CRD was installed, the reconcileDNSEndpoint path sets a controller ownerReference on the DNSEndpoint object. The migrator never lists or reparents this unstructured child resource. The new Cluster reconciler will fail SetControllerReference on the existing DNSEndpoint (already controlled by LedgerService), and deletion of the LedgerService will garbage-collect the live DNS record.
Suggestion: Extend the child adoption loop to also enumerate DNSEndpoint unstructured objects owned by the LedgerService (by GVR or ownerReference scan) and patch their controller ownerReference to point to the new Cluster.
| func (r *LedgerServiceMigrator) rehomeChildren(ctx context.Context, ls *ledgerv1alpha1.LedgerService, cluster *ledgerv1alpha1.Cluster) error { | ||
| listOpts := []client.ListOption{ | ||
| client.InNamespace(ls.Namespace), | ||
| client.MatchingLabels{labelInstance: ls.Name}, |
There was a problem hiding this comment.
🟠 [major] Instance-label selector misses children when additionalLabels overrides app.kubernetes.io/instance
The migrator uses the app.kubernetes.io/instance label to select child resources for adoption. For LedgerService objects where spec.additionalLabels overrode this label, all children were stamped with the overridden value, so the selector matches nothing. The migrator then marks the source as migrated while all old ownerReferences remain in place, silently leaving the Cluster reconciler unable to adopt or manage those resources.
Suggestion: Instead of (or in addition to) the instance label selector, scan children by iterating over the LedgerService's ownerReference UID across relevant resource types, which is label-value agnostic and will correctly identify all owned objects regardless of label overrides.
| } | ||
|
|
||
| var ledgerService ledgerv1alpha1.LedgerService | ||
| var cluster ledgerv1alpha1.Cluster |
There was a problem hiding this comment.
🟠 [major] In-flight backup runs permanently failed due to migration ordering race
During an upgrade with an existing non-terminal LedgerBackupRun targeting a legacy LedgerService, there is no ordering guarantee that the migrator has already created the same-named Cluster before this reconcile runs. In that transient window the reconciler calls setRunFailed, permanently marking the run as failed even though the referenced workload may appear moments later. This turns a transient migration condition into a permanent terminal failure for in-flight or manual backup runs.
Suggestion: When the referenced Cluster is not found during active migration (i.e., a corresponding LedgerService still exists or migration is in progress), requeue with a backoff instead of marking the run as failed. Alternatively, fall back to resolving the legacy LedgerService resource as the target until the Cluster exists.
Summary
LedgerServiceKind end-to-end (Go types, CRD, kubectl-ledger, Helm chart, e2e tests, Antithesis manifests, proto comment, docs) toCluster.LedgerServiceKind (schema reusesClusterSpec/ClusterStatusverbatim, no duplication) plus aLedgerServiceMigratorreconciler that copies eachLedgerServiceinto a same-namedClusteron reconcile, annotates the source withledger.formance.com/migrated-toand sets aMigrated=Truecondition. The LS is never deleted — operators remove it manually once the migration has converged.Test plan
go build ./...clean in both root module andmisc/operator/go vet ./...clean in both modulesgo test ./... -shortpasses (root + operator)golangci-lint run ./...clean on operator; pre-existing unused-code warnings only in feature-flagged event sinks on the root modulechainsawsuites still green (need cluster)ledgerservice-migrationchainsaw test passes on kindjust generate(nix) regenerateszz_generated.deepcopy.goand both CRDs idempotently (should be a no-op diff)snouty launch) still boots with the renamed manifests