Skip to content

refactor(operator): rename LedgerService to Cluster with transient compat#1481

Open
gfyrag wants to merge 3 commits into
release/v3.0from
refactor/rename-ledgerservice-cluster
Open

refactor(operator): rename LedgerService to Cluster with transient compat#1481
gfyrag wants to merge 3 commits into
release/v3.0from
refactor/rename-ledgerservice-cluster

Conversation

@gfyrag

@gfyrag gfyrag commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Renames the LedgerService Kind end-to-end (Go types, CRD, kubectl-ledger, Helm chart, e2e tests, Antithesis manifests, proto comment, docs) to Cluster.
  • Adds a transient compat layer: a deprecated LedgerService Kind (schema reuses ClusterSpec/ClusterStatus verbatim, no duplication) plus a LedgerServiceMigrator reconciler that copies each LedgerService into a same-named Cluster on reconcile, annotates the source with ledger.formance.com/migrated-to and sets a Migrated=True condition. The LS is never deleted — operators remove it manually once the migration has converged.
  • Same names preserve derived StatefulSet / Services / PVCs → no downtime for in-place upgrades.

Test plan

  • go build ./... clean in both root module and misc/operator/
  • go vet ./... clean in both modules
  • go test ./... -short passes (root + operator)
  • golangci-lint run ./... clean on operator; pre-existing unused-code warnings only in feature-flagged event sinks on the root module
  • E2E chainsaw suites still green (need cluster)
  • New ledgerservice-migration chainsaw test passes on kind
  • just generate (nix) regenerates zz_generated.deepcopy.go and both CRDs idempotently (should be a no-op diff)
  • Antithesis smoke run (snouty launch) still boots with the renamed manifests

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (1)
  • main

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9be6e24-31cd-411f-b33f-b547ff999093

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/rename-ledgerservice-cluster

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@NumaryBot

NumaryBot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

🛑 Changes requested — automated review

This PR renames LedgerService to Cluster and introduces a migrator for in-place upgrades. The rename itself is structurally sound, but the migration path has several significant safety gaps that must be addressed before production use.

The most critical issue is that the migrator creates the new Cluster object without reparenting child resources (ServiceAccount, Services, StatefulSet, etc.) that were previously owned by the LedgerService. Because Kubernetes enforces a single controller ownerReference per object, the Cluster reconciler will fail to adopt those children via SetControllerReference, leaving the workload unmanageable. Deleting the old LedgerService would also trigger cascading garbage collection of the live resources. Two additional gaps compound this: DNSEndpoint unstructured children are never included in the adoption loop, and the instance-label-based child selector silently matches nothing in installations where additionalLabels overrode the app.kubernetes.io/instance label.

Two further migration-ordering races introduce upgrade regressions. The LedgerClusterAgent reconciler now lists only Cluster objects; if it runs before the migrator creates the corresponding Cluster, it treats all existing replicated agent Secrets as orphans and deletes them, unexpectedly rotating credentials mid-upgrade. Similarly, a non-terminal LedgerBackupRun reconcile that fires before its Cluster exists immediately calls setRunFailed, permanently terminating in-flight backup runs due solely to migration ordering.

All five issues require resolution before this migration path can be considered safe for production upgrades.

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 1 new inline finding.

Summary: #1481 (comment)

Comment thread misc/operator/internal/controller/ledgerservice_migrator.go
@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.95%. Comparing base (e9d5ca1) to head (c784130).

Additional details and impacted files
@@               Coverage Diff                @@
##           release/v3.0    #1481      +/-   ##
================================================
+ Coverage         73.64%   73.95%   +0.31%     
================================================
  Files               388      388              
  Lines             39745    39745              
================================================
+ Hits              29269    29394     +125     
+ Misses             7972     7834     -138     
- Partials           2504     2517      +13     
Flag Coverage Δ
e2e 73.95% <ø> (+0.31%) ⬆️
scenario 73.95% <ø> (+0.31%) ⬆️
unit 73.95% <ø> (+0.31%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

gfyrag pushed a commit that referenced this pull request Jul 2, 2026
…tion

Addresses PR #1481 blocker review: creating a Cluster alongside an
already-reconciled LedgerService leaves every child (StatefulSet,
Services, ServiceAccount, ConfigMap, Secret, Ingress, NetworkPolicy)
owned by the LedgerService. The Cluster reconciler's
SetControllerReference would then reject them as already-controlled,
and a later delete of the LedgerService would cascade-delete the live
workload.

The migrator now walks each of those kinds by
app.kubernetes.io/instance=<name> and rewrites any ownerReference
matching the LedgerService UID to a controller ref on the new Cluster.
Idempotent: objects without a matching ownerRef are skipped, so the
pass is cheap on already-migrated services.

The chainsaw e2e now asserts the resulting StatefulSet is owned by
Cluster, not LedgerService.

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 3 new inline findings.

Summary: #1481 (comment)

Comment thread misc/operator/internal/controller/ledgerservice_migrator.go
Comment thread misc/operator/internal/controller/ledgerservice_migrator.go
Comment thread misc/operator/internal/controller/ledgerservice_migrator.go
Geoffrey Ragot added 3 commits July 2, 2026 17:07
…mpat

Renames the LedgerService Kind end-to-end (Go types, CRD, kubectl-ledger,
Helm chart, e2e tests, Antithesis manifests, proto comment, docs) and
adds a runtime migration path so existing LedgerService manifests keep
working:

- LedgerService compat shell reuses ClusterSpec/ClusterStatus verbatim
  (no schema duplication) and is marked deprecated at the CRD level.
- New LedgerServiceMigrator reconciler creates a Cluster of the same
  name/namespace on each reconcile, annotates the LedgerService with
  ledger.formance.com/migrated-to and sets a Migrated=True condition.
  Never deletes the LS. Same names preserve derived StatefulSet/PVCs.
- New chainsaw e2e (ledgerservice-migration) covers the migrator path.
…tion

Addresses PR #1481 blocker review: creating a Cluster alongside an
already-reconciled LedgerService leaves every child (StatefulSet,
Services, ServiceAccount, ConfigMap, Secret, Ingress, NetworkPolicy)
owned by the LedgerService. The Cluster reconciler's
SetControllerReference would then reject them as already-controlled,
and a later delete of the LedgerService would cascade-delete the live
workload.

The migrator now walks each of those kinds by
app.kubernetes.io/instance=<name> and rewrites any ownerReference
matching the LedgerService UID to a controller ref on the new Cluster.
Idempotent: objects without a matching ownerRef are skipped, so the
pass is cheap on already-migrated services.

The chainsaw e2e now asserts the resulting StatefulSet is owned by
Cluster, not LedgerService.
…nces

Post-rebase fixup: aa373d5 (deletion protection default-on) landed
LedgerService references in files that didn't exist when the rename
commit was authored (deletion_protection_test.go, persistence_test.go,
NOTES.txt, validatingadmissionpolicy.yaml, cluster_types.go comment,
README.md prose).
@gfyrag gfyrag force-pushed the refactor/rename-ledgerservice-cluster branch from 431cf41 to c784130 Compare July 2, 2026 15:09

@NumaryBot NumaryBot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumaryBot posted 5 new inline findings.

Summary: #1481 (comment)

err := r.Get(ctx, types.NamespacedName{Name: ls.Name, Namespace: ls.Namespace}, cluster)
switch {
case apierrors.IsNotFound(err):
cluster = &ledgerv1alpha1.Cluster{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocker] Child resources retain LedgerService ownerReferences after migration
reported by NumaryBot, codex

For in-place upgrades where a LedgerService has already reconciled and created child resources (ServiceAccount, Services, StatefulSet, etc.), the migrator only creates a new Cluster object but never reparents the children. Those resources retain a controller ownerReference pointing to the old LedgerService. When the Cluster reconciler subsequently calls SetControllerReference on them it will fail because Kubernetes rejects a second controller owner, so the new Cluster cannot manage or update the existing deployment. Additionally, if the old LedgerService is later deleted, Kubernetes garbage-collects all its owned children, destroying the live workload.

Suggestion: During migration, enumerate all child resources owned by the LedgerService (using ownerReference lookup or the instance label selector) and patch their ownerReferences to replace the LedgerService controller reference with one pointing to the newly created Cluster, before marking the source as migrated.

}

var services ledgerv1alpha1.LedgerServiceList
var services ledgerv1alpha1.ClusterList

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 [blocker] Agent secrets GC'd as orphans when Cluster not yet created during migration

The agent reconciler now lists only Cluster objects to find matched services. If it runs before the migrator has created the corresponding Cluster for a legacy LedgerService, matchedServices is empty and the garbage-collection path below deletes the existing replicated agent Secrets as orphans. On the next reconcile a new keypair is generated, unexpectedly rotating or removing credentials and causing the Cluster auth-keys ConfigMap to drop or change those keys mid-upgrade.

Suggestion: Treat matching legacy LedgerService objects as valid targets until migration converges (e.g., fall back to listing LedgerService when no Cluster match is found), or explicitly skip orphan GC while any LedgerService-to-Cluster migration is still pending.

return ctrl.Result{}, nil
}

// rehomeChildren walks every child object kind the Cluster reconciler owns

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 [major] DNSEndpoint children not rehomed during migration

When a legacy LedgerService had spec.dnsEndpoint.enabled set and the ExternalDNS CRD was installed, the reconcileDNSEndpoint path sets a controller ownerReference on the DNSEndpoint object. The migrator never lists or reparents this unstructured child resource. The new Cluster reconciler will fail SetControllerReference on the existing DNSEndpoint (already controlled by LedgerService), and deletion of the LedgerService will garbage-collect the live DNS record.

Suggestion: Extend the child adoption loop to also enumerate DNSEndpoint unstructured objects owned by the LedgerService (by GVR or ownerReference scan) and patch their controller ownerReference to point to the new Cluster.

func (r *LedgerServiceMigrator) rehomeChildren(ctx context.Context, ls *ledgerv1alpha1.LedgerService, cluster *ledgerv1alpha1.Cluster) error {
listOpts := []client.ListOption{
client.InNamespace(ls.Namespace),
client.MatchingLabels{labelInstance: ls.Name},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 [major] Instance-label selector misses children when additionalLabels overrides app.kubernetes.io/instance

The migrator uses the app.kubernetes.io/instance label to select child resources for adoption. For LedgerService objects where spec.additionalLabels overrode this label, all children were stamped with the overridden value, so the selector matches nothing. The migrator then marks the source as migrated while all old ownerReferences remain in place, silently leaving the Cluster reconciler unable to adopt or manage those resources.

Suggestion: Instead of (or in addition to) the instance label selector, scan children by iterating over the LedgerService's ownerReference UID across relevant resource types, which is label-value agnostic and will correctly identify all owned objects regardless of label overrides.

}

var ledgerService ledgerv1alpha1.LedgerService
var cluster ledgerv1alpha1.Cluster

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 [major] In-flight backup runs permanently failed due to migration ordering race

During an upgrade with an existing non-terminal LedgerBackupRun targeting a legacy LedgerService, there is no ordering guarantee that the migrator has already created the same-named Cluster before this reconcile runs. In that transient window the reconciler calls setRunFailed, permanently marking the run as failed even though the referenced workload may appear moments later. This turns a transient migration condition into a permanent terminal failure for in-flight or manual backup runs.

Suggestion: When the referenced Cluster is not found during active migration (i.e., a corresponding LedgerService still exists or migration is in progress), requeue with a backoff instead of marking the run as failed. Alternatively, fall back to resolving the legacy LedgerService resource as the target until the Cluster exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants