monitor lifecycle conductor by benzekrimaha · Pull Request #2723 · scality/backbeat

benzekrimaha · 2026-03-02T16:23:04Z

Until now there was no good way to tell, from monitoring alone, why lifecycle wasn't making progress. A silent conductor (no scans being scheduled) and a stuck/failed scan (started but never completed) looked the same, and there was no signal for the bucket processor falling behind or for two scans running concurrently. This PR adds conductor- and bucket-processor-level observability so each of those situations is distinguishable and alertable, and it threads a scan id through the pipeline so logs and metrics for a given scan can be correlated end to end.

What it adds

Conductor scan lifecycle metrics
s3_lifecycle_latest_batch_start_time a scheduling heartbeat, set when a scan starts. Lets us detect "the conductor stopped scheduling scans."
s3_lifecycle_latest_batch_end_time + s3_lifecycle_latest_batch_bucket_count set only on successful completion (end timestamp + number of buckets listed). A scan that fails after starting deliberately does not update these, so the failure stays visible instead of looking complete.
Bucket-processor scan metrics
s3_lifecycle_bucket_processor_scan_messages_total bucket-task messages picked up, labelled by conductor_scan_id (used to spot overlapping/concurrent scans).
s3_lifecycle_bucket_processor_scan_message_age_seconds (histogram) wall-clock time from conductor scan start to bucket-processor pickup. This is a dequeue/backlog signal, not task-processing duration.
Scan-context propagation conductorScanId and conductorScanStartTimestamp are carried from the conductor through bucket-task messages, continuation entries, action queue entries, and task logs => so all work for a scan is attributable to it without putting bucket names (high cardinality) into metrics.
Alerts (monitoring/lifecycle/alerts.yaml)
LifecycleLateScan — the conductor stopped scheduling scans (based on the start-time heartbeat).
LifecycleStalledScan — a scan started but its completion metric didn't catch up (start_time > end_time), i.e. it's stalled or failed after starting. Uses dedicated thresholds and auto-resolves on the next successful scan.
Dashboard :
New "Lifecycle Conductor" panels (scheduling heartbeat, last batch duration, buckets listed, stalled state) and bucket-processor panels (recent conductor scan ids, message-pickup age).

Through the added metrics we have now a bounded label cardinality. The per-conductor_scan_id series are cleaned up by per-scan timers after a fixed retention, so the label can't grow unbounded in prom-client; the metric help text points readers to that timer.
Failure ≠ completion => Failed scans reset local conductor state but don't publish completion metrics, which is what makes LifecycleStalledScan fire correctly (and resolve on the next good scan).
Rolling-upgrade safe => Messages from an older conductor without a scan id don't create synthetic undefined series.

=> No high-cardinality data in metrics as bucket-level detail stays in logs (correlated by scan id), not in label values.

=> Dashboards were verified on a live preview deployment of this branch the conductor panels populate with real scan data. (Bucket-processor throughput panels read "No data" on a small test platform because each scan completes within a single Prometheus scrape; they populate under sustained lifecycle load.)

bert-e · 2026-03-02T16:23:08Z

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

codecov · 2026-03-02T16:48:01Z

Codecov Report

❌ Patch coverage is 96.19048% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.74%. Comparing base (584ef18) to head (a2584af).
⚠️ Report is 43 commits behind head on development/9.5.

Files with missing lines	Patch %	Lines
...ecycle/bucketProcessor/LifecycleBucketProcessor.js	76.92%	3 Missing ⚠️
extensions/lifecycle/LifecycleMetrics.js	97.77%	1 Missing ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
...tensions/lifecycle/conductor/LifecycleConductor.js	`87.34% <100.00%> (+0.24%)`	⬆️
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js	`92.81% <100.00%> (+0.04%)`	⬆️
extensions/lifecycle/tasks/LifecycleTask.js	`91.63% <100.00%> (+0.08%)`	⬆️
extensions/lifecycle/tasks/LifecycleTaskV2.js	`88.99% <100.00%> (+0.10%)`	⬆️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js	`81.33% <100.00%> (+0.25%)`	⬆️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js	`92.00% <100.00%> (+0.08%)`	⬆️
lib/models/ActionQueueEntry.js	`96.29% <ø> (ø)`
extensions/lifecycle/LifecycleMetrics.js	`97.95% <97.77%> (-0.26%)`	⬇️
...ecycle/bucketProcessor/LifecycleBucketProcessor.js	`80.60% <76.92%> (+0.73%)`	⬆️

... and 3 files with indirect coverage changes

Components	Coverage Δ
Bucket Notification	`80.22% <ø> (ø)`
Core Library	`80.62% <ø> (-0.61%)`	⬇️
Ingestion	`70.63% <ø> (ø)`
Lifecycle	`79.97% <96.19%> (+0.44%)`	⬆️
Oplog Populator	`85.83% <ø> (ø)`
Replication	`59.70% <ø> (ø)`
Bucket Scanner	`85.76% <ø> (ø)`

@@                 Coverage Diff                 @@
##           development/9.5    #2723      +/-   ##
===================================================
- Coverage            74.87%   74.74%   -0.13%     
===================================================
  Files                  200      200              
  Lines                13670    13745      +75     
===================================================
+ Hits                 10235    10274      +39     
- Misses                3425     3461      +36     
  Partials                10       10

Flag	Coverage Δ
api:retry	`9.10% <0.00%> (-0.06%)`	⬇️
api:routes	`8.92% <0.00%> (-0.06%)`	⬇️
bucket-scanner	`85.76% <ø> (ø)`
ft_test:queuepopulator	`9.21% <12.38%> (-1.17%)`	⬇️
ingestion	`12.49% <0.00%> (-0.08%)`	⬇️
lib	`7.71% <0.00%> (-0.06%)`	⬇️
lifecycle	`19.16% <68.57%> (+0.24%)`	⬆️
notification	`1.01% <0.00%> (-0.01%)`	⬇️
oplogPopulator	`0.14% <0.00%> (-0.01%)`	⬇️
replication	`18.64% <12.38%> (-0.02%)`	⬇️
unit	`51.77% <88.57%> (+0.37%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bert-e · 2026-06-24T05:32:58Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
2 peers

The following reviewers are expecting changes from the author, or must review again:

…ocessor. Issue: BB-740

Issue: BB-740

…nductor. Issue: BB-740

SylvainSenechal · 2026-06-29T08:42:30Z

Side note : the top pull request overview comment may need to be updated as it was openend a long time ago : Not sure if it was updated recently, but the screenshots probably shows older grafana panels, as I see duplicated graphs

Also wished this top level comment could tell us a bit about all the prs that will be needed for the feature : I see there is another backbeat pr, but I don't know what each one focus on, also I'm surprised there is no zenko/zenko-operator but maybe this is ok

If its too annoying to open/setup a lab again just to take a fresh screenshot dont bother, but at least indicate its out of date

benzekrimaha · 2026-06-29T09:04:05Z

@SylvainSenechal , screenshots are recent and there are no duplicates I just tried to show the whole pannel and some dashboards are on both screenshots. For zenko operator there is no need , and zenko as well these are metrics that are proper to backbeat and the conductor. I am updating the summary , and the commits have informations as well to make the review easier.

bert-e · 2026-06-29T11:46:52Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
2 peers

The following reviewers are expecting changes from the author, or must review again:

Issue: BB-740

SylvainSenechal

I think it's ok, I don't know if 100% of the edge cases are handled in case of failure/unexpected situations but I think for metrics its alright

Some comments from Thomas and François are still opened
It would be easier if you closed the resolve issues progressively, lot's of messages that are already fixed, but this is moslty claude repeating itself

edit: One more question, not sure if this was discussed before : we are adding some data to kafka messages, data which is only used for metrics (like 'conductorScanId') : Is this ok ? is this gonna increase the message size/perf significantly ? Do we want a mode to disable that ?

benzekrimaha · 2026-06-29T15:20:58Z

Thank you for the review @SylvainSenechal
On the Kafka message size/perf: the only fields added to each bucket-task message are conductorScanId (a UUID, ~36 chars) and conductorScanStartTimestamp (a number) under contextInfo — roughly ~100 bytes including the JSON keys. Bucket-task messages already carry bucket / owner (canonical id) / accountId / action / reqId / taskVersion (several hundred bytes to ~1 KB), and there's one message per bucket per scan, so it's a small relative bump with no extra processing or round-trips no measurable size or throughput impact.
One clarification: conductorScanId isn't metrics-only. It's the scan-correlation id and is also added to the request-logger default fields, so it ties together all logs across conductor → bucket-processor → lifecycle tasks for a given scan. It follows the same pattern as the existing contextInfo.reqId (also always included for request correlation) — so it's general observability (logs + metrics), not a metric-only payload.
Given that, I'd rather not add a toggle to disable it: the overhead is negligible and the id is foundational to the end-to-end correlation this PR delivers, a config flag would add a rarely-used code path for ~100 bytes. Happy to revisit if it ever shows up at very high message rates.
And agreed on the threads I'll resolve the ones already addressed so it's easier to see what's actually open.

…id age. Skip the observation and log a warning when the conductor scan-start timestamp is missing or the clocks are skewed, instead of recording a misleading 0. Drop the gauge comments that duplicated the metric help, align the unit-test metric name with the source, and remove the test for the no-longer-configurable scan-metric retention. Issue: BB-740

francoisferrand · 2026-06-30T09:13:05Z

                bucket,
                owner,
                accountId,


not needed, already set as default fields (same below)

Could you clarify which fields you mean here? The request logger's default fields are conductorScanId/conductorScanStartTimestamp (set via addDefaultFields(scanLogInfo)); bucket/owner/accountId in this log.error (and the log.debug below) are not default fields. Do you want those promoted to the logger default fields (set once after parsing) and removed from the two calls?

francoisferrand · 2026-06-30T10:03:56Z

+    ],
+)
+
+bucket_processor_active_scan_ids = TimeSeries(


these panels should not be empty, even on a lab : the metric is a Counter (ever increasing), and cleared on a timer which should be long enough for prometheus to scrap. Even if it completes shortly, the metrics will stay long after the next scrap.

→ something really wrong here. Maybe you need to setup lifecycle rules on a bucket (so one is actually processed), or there is a bucket somewhere in the chain

Re-validated on the platform with live lifecycle workload (a bb740-lc bucket with a lifecycle rule, plus the existing bench buckets). The panels were empty in the earlier screenshot simply because no buckets were being processed at that moment - it's not a metric/scrape bug. With workload:

The bucket-processor exposes the counter per scan, e.g. s3_lifecycle_bucket_processor_scan_messages_total{origin="bucket_processor",conductor_scan_id="..."} 133.

Prometheus is scraping it: sum(s3_lifecycle_bucket_processor_scan_messages_total) = 13428 across 124 lifecycle series.

The panel's own query is non-zero: sum(rate(s3_lifecycle_bucket_processor_scan_messages_total[5m])) by (pod) > 0 -> ~0.032/s for the bucket-processor pod, so Bucket Tasks Picked Up by Pod draws a line.

Conductor-side metrics (s3_lifecycle_latest_batch_start_time, etc.) are present as well.

So the section populates as soon as there is lifecycle work. Note the per-conductor_scan_id panel is naturally sparse between scans (each scan id only increments during its short scan window), which is the expected shape for a "detect parallel scans" view.

bert-e · 2026-06-30T10:04:15Z

Waiting for approval

The following approvals are needed before I can proceed with the merge:

the author
2 peers

The following reviewers are expecting changes from the author, or must review again:

Drop the redundant `_currentScanId !== scanId` throttle guard: Throttling only originates from the backlog-control gate, before the scan id is set, so a Throttling error always means the scan never started. Move the unknown-canonical-id read to where it is used (after the throttle return) and put the batch-start log on a single line. Issue: BB-740

`_getTransitionActionEntry` now builds the action context from `params.bucketData` via `_getActionContext`, so callers pass `bucketData` instead of a precomputed `actionContext`. Also fit the continuation-entry call on a single line. Issue: BB-740

The Last Conductor Batch Duration and Buckets Listed stat panels were 6 units wide, leaving the right half of their row empty. Widen them to fill the row so the Lifecycle Conductor section has no layout holes. Issue: BB-740

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 380069a to 25ea9d5 Compare March 2, 2026 16:32

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch 5 times, most recently from 8316f88 to 408c96c Compare March 11, 2026 16:03

benzekrimaha marked this pull request as ready for review March 11, 2026 16:35

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 408c96c to e1c5b13 Compare March 11, 2026 16:48

benzekrimaha requested review from a team, SylvainSenechal and francoisferrand March 13, 2026 08:49

francoisferrand requested a review from delthas March 18, 2026 09:19

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from e1c5b13 to aefb677 Compare March 18, 2026 10:04

benzekrimaha changed the title ~~Improvement/bb 740 monitor lifecycle conductor~~ Improvement/BB-740 monitor lifecycle conductor Mar 18, 2026

francoisferrand changed the title ~~Improvement/BB-740 monitor lifecycle conductor~~ monitor lifecycle conductor Mar 18, 2026

delthas reviewed Mar 18, 2026

View reviewed changes

Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated