refactor(operator): rename LedgerService to Cluster with transient compat by gfyrag · Pull Request #1481 · formancehq/ledger

gfyrag · 2026-07-02T13:37:09Z

Summary

Renames the LedgerService Kind end-to-end (Go types, CRD, kubectl-ledger, Helm chart, e2e tests, Antithesis manifests, proto comment, docs) to Cluster.
Adds a transient compat layer: a deprecated LedgerService Kind (schema reuses ClusterSpec/ClusterStatus verbatim, no duplication) plus a LedgerServiceMigrator reconciler that copies each LedgerService into a same-named Cluster on reconcile, annotates the source with ledger.formance.com/migrated-to and sets a Migrated=True condition. The LS is never deleted — operators remove it manually once the migration has converged.
Same names preserve derived StatefulSet / Services / PVCs → no downtime for in-place upgrades.

Test plan

go build ./... clean in both root module and misc/operator/
go vet ./... clean in both modules
go test ./... -short passes (root + operator)
golangci-lint run ./... clean on operator; pre-existing unused-code warnings only in feature-flagged event sinks on the root module
E2E chainsaw suites still green (need cluster)
New ledgerservice-migration chainsaw test passes on kind
just generate (nix) regenerates zz_generated.deepcopy.go and both CRDs idempotently (should be a no-op diff)
Antithesis smoke run (snouty launch) still boots with the renamed manifests

coderabbitai · 2026-07-02T13:37:22Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (1)

main

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9be6e24-31cd-411f-b33f-b547ff999093

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/rename-ledgerservice-cluster

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

NumaryBot · 2026-07-02T13:39:19Z

🛑 Changes requested — automated review

This PR renames LedgerService to Cluster and introduces a migrator for in-place upgrades. The rename itself is structurally sound, but the migration path has several significant safety gaps that must be addressed before production use.

The most critical issue is that the migrator creates the new Cluster object without reparenting child resources (ServiceAccount, Services, StatefulSet, etc.) that were previously owned by the LedgerService. Because Kubernetes enforces a single controller ownerReference per object, the Cluster reconciler will fail to adopt those children via SetControllerReference, leaving the workload unmanageable. Deleting the old LedgerService would also trigger cascading garbage collection of the live resources. Two additional gaps compound this: DNSEndpoint unstructured children are never included in the adoption loop, and the instance-label-based child selector silently matches nothing in installations where additionalLabels overrode the app.kubernetes.io/instance label.

Two further migration-ordering races introduce upgrade regressions. The LedgerClusterAgent reconciler now lists only Cluster objects; if it runs before the migrator creates the corresponding Cluster, it treats all existing replicated agent Secrets as orphans and deletes them, unexpectedly rotating credentials mid-upgrade. Similarly, a non-terminal LedgerBackupRun reconcile that fires before its Cluster exists immediately calls setRunFailed, permanently terminating in-flight backup runs due solely to migration ordering.

All five issues require resolution before this migration path can be considered safe for production upgrades.

NumaryBot

NumaryBot posted 1 new inline finding.

Summary: #1481 (comment)

codecov · 2026-07-02T13:45:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.95%. Comparing base (e9d5ca1) to head (c784130).

Additional details and impacted files

@@               Coverage Diff                @@
##           release/v3.0    #1481      +/-   ##
================================================
+ Coverage         73.64%   73.95%   +0.31%     
================================================
  Files               388      388              
  Lines             39745    39745              
================================================
+ Hits              29269    29394     +125     
+ Misses             7972     7834     -138     
- Partials           2504     2517      +13

Flag	Coverage Δ
e2e	`73.95% <ø> (+0.31%)`	⬆️
scenario	`73.95% <ø> (+0.31%)`	⬆️
unit	`73.95% <ø> (+0.31%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…tion Addresses PR #1481 blocker review: creating a Cluster alongside an already-reconciled LedgerService leaves every child (StatefulSet, Services, ServiceAccount, ConfigMap, Secret, Ingress, NetworkPolicy) owned by the LedgerService. The Cluster reconciler's SetControllerReference would then reject them as already-controlled, and a later delete of the LedgerService would cascade-delete the live workload. The migrator now walks each of those kinds by app.kubernetes.io/instance=<name> and rewrites any ownerReference matching the LedgerService UID to a controller ref on the new Cluster. Idempotent: objects without a matching ownerRef are skipped, so the pass is cheap on already-migrated services. The chainsaw e2e now asserts the resulting StatefulSet is owned by Cluster, not LedgerService.

NumaryBot

NumaryBot posted 3 new inline findings.

Summary: #1481 (comment)

…mpat Renames the LedgerService Kind end-to-end (Go types, CRD, kubectl-ledger, Helm chart, e2e tests, Antithesis manifests, proto comment, docs) and adds a runtime migration path so existing LedgerService manifests keep working: - LedgerService compat shell reuses ClusterSpec/ClusterStatus verbatim (no schema duplication) and is marked deprecated at the CRD level. - New LedgerServiceMigrator reconciler creates a Cluster of the same name/namespace on each reconcile, annotates the LedgerService with ledger.formance.com/migrated-to and sets a Migrated=True condition. Never deletes the LS. Same names preserve derived StatefulSet/PVCs. - New chainsaw e2e (ledgerservice-migration) covers the migrator path.

…tion Addresses PR #1481 blocker review: creating a Cluster alongside an already-reconciled LedgerService leaves every child (StatefulSet, Services, ServiceAccount, ConfigMap, Secret, Ingress, NetworkPolicy) owned by the LedgerService. The Cluster reconciler's SetControllerReference would then reject them as already-controlled, and a later delete of the LedgerService would cascade-delete the live workload. The migrator now walks each of those kinds by app.kubernetes.io/instance=<name> and rewrites any ownerReference matching the LedgerService UID to a controller ref on the new Cluster. Idempotent: objects without a matching ownerRef are skipped, so the pass is cheap on already-migrated services. The chainsaw e2e now asserts the resulting StatefulSet is owned by Cluster, not LedgerService.

…nces Post-rebase fixup: aa373d5 (deletion protection default-on) landed LedgerService references in files that didn't exist when the rename commit was authored (deletion_protection_test.go, persistence_test.go, NOTES.txt, validatingadmissionpolicy.yaml, cluster_types.go comment, README.md prose).

NumaryBot

NumaryBot posted 5 new inline findings.

Summary: #1481 (comment)

NumaryBot · 2026-07-02T15:14:11Z

+	err := r.Get(ctx, types.NamespacedName{Name: ls.Name, Namespace: ls.Namespace}, cluster)
+	switch {
+	case apierrors.IsNotFound(err):
+		cluster = &ledgerv1alpha1.Cluster{


🔴 [blocker] Child resources retain LedgerService ownerReferences after migration
_{reported by NumaryBot, codex}

For in-place upgrades where a LedgerService has already reconciled and created child resources (ServiceAccount, Services, StatefulSet, etc.), the migrator only creates a new Cluster object but never reparents the children. Those resources retain a controller ownerReference pointing to the old LedgerService. When the Cluster reconciler subsequently calls SetControllerReference on them it will fail because Kubernetes rejects a second controller owner, so the new Cluster cannot manage or update the existing deployment. Additionally, if the old LedgerService is later deleted, Kubernetes garbage-collects all its owned children, destroying the live workload.

Suggestion: During migration, enumerate all child resources owned by the LedgerService (using ownerReference lookup or the instance label selector) and patch their ownerReferences to replace the LedgerService controller reference with one pointing to the newly created Cluster, before marking the source as migrated.

NumaryBot · 2026-07-02T15:14:11Z

 	}

-	var services ledgerv1alpha1.LedgerServiceList
+	var services ledgerv1alpha1.ClusterList


🔴 [blocker] Agent secrets GC'd as orphans when Cluster not yet created during migration

The agent reconciler now lists only Cluster objects to find matched services. If it runs before the migrator has created the corresponding Cluster for a legacy LedgerService, matchedServices is empty and the garbage-collection path below deletes the existing replicated agent Secrets as orphans. On the next reconcile a new keypair is generated, unexpectedly rotating or removing credentials and causing the Cluster auth-keys ConfigMap to drop or change those keys mid-upgrade.

Suggestion: Treat matching legacy LedgerService objects as valid targets until migration converges (e.g., fall back to listing LedgerService when no Cluster match is found), or explicitly skip orphan GC while any LedgerService-to-Cluster migration is still pending.

NumaryBot · 2026-07-02T15:14:11Z

+	return ctrl.Result{}, nil
+}
+
+// rehomeChildren walks every child object kind the Cluster reconciler owns


🟠 [major] DNSEndpoint children not rehomed during migration

When a legacy LedgerService had spec.dnsEndpoint.enabled set and the ExternalDNS CRD was installed, the reconcileDNSEndpoint path sets a controller ownerReference on the DNSEndpoint object. The migrator never lists or reparents this unstructured child resource. The new Cluster reconciler will fail SetControllerReference on the existing DNSEndpoint (already controlled by LedgerService), and deletion of the LedgerService will garbage-collect the live DNS record.

Suggestion: Extend the child adoption loop to also enumerate DNSEndpoint unstructured objects owned by the LedgerService (by GVR or ownerReference scan) and patch their controller ownerReference to point to the new Cluster.

NumaryBot · 2026-07-02T15:14:11Z

+func (r *LedgerServiceMigrator) rehomeChildren(ctx context.Context, ls *ledgerv1alpha1.LedgerService, cluster *ledgerv1alpha1.Cluster) error {
+	listOpts := []client.ListOption{
+		client.InNamespace(ls.Namespace),
+		client.MatchingLabels{labelInstance: ls.Name},


🟠 [major] Instance-label selector misses children when additionalLabels overrides app.kubernetes.io/instance

The migrator uses the app.kubernetes.io/instance label to select child resources for adoption. For LedgerService objects where spec.additionalLabels overrode this label, all children were stamped with the overridden value, so the selector matches nothing. The migrator then marks the source as migrated while all old ownerReferences remain in place, silently leaving the Cluster reconciler unable to adopt or manage those resources.

Suggestion: Instead of (or in addition to) the instance label selector, scan children by iterating over the LedgerService's ownerReference UID across relevant resource types, which is label-value agnostic and will correctly identify all owned objects regardless of label overrides.

NumaryBot · 2026-07-02T15:14:11Z

 	}

-	var ledgerService ledgerv1alpha1.LedgerService
+	var cluster ledgerv1alpha1.Cluster


🟠 [major] In-flight backup runs permanently failed due to migration ordering race

During an upgrade with an existing non-terminal LedgerBackupRun targeting a legacy LedgerService, there is no ordering guarantee that the migrator has already created the same-named Cluster before this reconcile runs. In that transient window the reconciler calls setRunFailed, permanently marking the run as failed even though the referenced workload may appear moments later. This turns a transient migration condition into a permanent terminal failure for in-flight or manual backup runs.

Suggestion: When the referenced Cluster is not found during active migration (i.e., a corresponding LedgerService still exists or migration is in progress), requeue with a backoff instead of marking the run as failed. Alternatively, fall back to resolving the legacy LedgerService resource as the target until the Cluster exists.

gfyrag requested review from Azorlogh and paul-nicolas as code owners July 2, 2026 13:37

NumaryBot reviewed Jul 2, 2026

View reviewed changes

Comment thread misc/operator/internal/controller/ledgerservice_migrator.go

NumaryBot reviewed Jul 2, 2026

View reviewed changes

Comment thread misc/operator/internal/controller/ledgerservice_migrator.go

Comment thread misc/operator/internal/controller/ledgerservice_migrator.go

Comment thread misc/operator/internal/controller/ledgerservice_migrator.go

Geoffrey Ragot added 3 commits July 2, 2026 17:07

gfyrag force-pushed the refactor/rename-ledgerservice-cluster branch from 431cf41 to c784130 Compare July 2, 2026 15:09

NumaryBot reviewed Jul 2, 2026

View reviewed changes

Uh oh!

Conversation

gfyrag commented Jul 2, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

NumaryBot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛑 Changes requested — automated review

Uh oh!

NumaryBot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

NumaryBot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NumaryBot left a comment

Choose a reason for hiding this comment

Uh oh!

NumaryBot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

NumaryBot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

NumaryBot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

NumaryBot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

NumaryBot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

NumaryBot commented Jul 2, 2026 •

edited

Loading

codecov Bot commented Jul 2, 2026 •

edited

Loading