Skip to content

monitor lifecycle conductor#2723

Open
benzekrimaha wants to merge 24 commits into
development/9.5from
improvement/BB-740-monitor-lifecycle-conductor
Open

monitor lifecycle conductor#2723
benzekrimaha wants to merge 24 commits into
development/9.5from
improvement/BB-740-monitor-lifecycle-conductor

Conversation

@benzekrimaha

@benzekrimaha benzekrimaha commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Issue: BB-740

Until now there was no good way to tell, from monitoring alone, why lifecycle wasn't making progress. A silent conductor (no scans being scheduled) and a stuck/failed scan (started but never completed) looked the same, and there was no signal for the bucket processor falling behind or for two scans running concurrently. This PR adds conductor- and bucket-processor-level observability so each of those situations is distinguishable and alertable, and it threads a scan id through the pipeline so logs and metrics for a given scan can be correlated end to end.

What it adds

  1. Conductor scan lifecycle metrics
    s3_lifecycle_latest_batch_start_time a scheduling heartbeat, set when a scan starts. Lets us detect "the conductor stopped scheduling scans."
    s3_lifecycle_latest_batch_end_time + s3_lifecycle_latest_batch_bucket_count set only on successful completion (end timestamp + number of buckets listed). A scan that fails after starting deliberately does not update these, so the failure stays visible instead of looking complete.

  2. Bucket-processor scan metrics
    s3_lifecycle_bucket_processor_scan_messages_total bucket-task messages picked up, labelled by conductor_scan_id (used to spot overlapping/concurrent scans).
    s3_lifecycle_bucket_processor_scan_message_age_seconds (histogram) wall-clock time from conductor scan start to bucket-processor pickup. This is a dequeue/backlog signal, not task-processing duration.

  3. Scan-context propagation conductorScanId and conductorScanStartTimestamp are carried from the conductor through bucket-task messages, continuation entries, action queue entries, and task logs => so all work for a scan is attributable to it without putting bucket names (high cardinality) into metrics.

  4. Alerts (monitoring/lifecycle/alerts.yaml)
    LifecycleLateScan — the conductor stopped scheduling scans (based on the start-time heartbeat).
    LifecycleStalledScan — a scan started but its completion metric didn't catch up (start_time > end_time), i.e. it's stalled or failed after starting. Uses dedicated thresholds and auto-resolves on the next successful scan.

  5. Dashboard :
    New "Lifecycle Conductor" panels (scheduling heartbeat, last batch duration, buckets listed, stalled state) and bucket-processor panels (recent conductor scan ids, message-pickup age).

Through the added metrics we have now a bounded label cardinality. The per-conductor_scan_id series are cleaned up by per-scan timers after a fixed retention, so the label can't grow unbounded in prom-client; the metric help text points readers to that timer.
Failure ≠ completion => Failed scans reset local conductor state but don't publish completion metrics, which is what makes LifecycleStalledScan fire correctly (and resolve on the next good scan).
Rolling-upgrade safe => Messages from an older conductor without a scan id don't create synthetic undefined series.

=> No high-cardinality data in metrics as bucket-level detail stays in logs (correlated by scan id), not in label values.

=> Dashboards were verified on a live preview deployment of this branch the conductor panels populate with real scan data. (Bucket-processor throughput panels read "No data" on a small test platform because each scan completes within a single Prometheus scrape; they populate under sustained lifecycle load.)

image image

@bert-e

bert-e commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 380069a to 25ea9d5 Compare March 2, 2026 16:32
@codecov

codecov Bot commented Mar 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.19048% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.74%. Comparing base (584ef18) to head (a2584af).
⚠️ Report is 43 commits behind head on development/9.5.

Files with missing lines Patch % Lines
...ecycle/bucketProcessor/LifecycleBucketProcessor.js 76.92% 3 Missing ⚠️
extensions/lifecycle/LifecycleMetrics.js 97.77% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
...tensions/lifecycle/conductor/LifecycleConductor.js 87.34% <100.00%> (+0.24%) ⬆️
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js 92.81% <100.00%> (+0.04%) ⬆️
extensions/lifecycle/tasks/LifecycleTask.js 91.63% <100.00%> (+0.08%) ⬆️
extensions/lifecycle/tasks/LifecycleTaskV2.js 88.99% <100.00%> (+0.10%) ⬆️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js 81.33% <100.00%> (+0.25%) ⬆️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js 92.00% <100.00%> (+0.08%) ⬆️
lib/models/ActionQueueEntry.js 96.29% <ø> (ø)
extensions/lifecycle/LifecycleMetrics.js 97.95% <97.77%> (-0.26%) ⬇️
...ecycle/bucketProcessor/LifecycleBucketProcessor.js 80.60% <76.92%> (+0.73%) ⬆️

... and 3 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.22% <ø> (ø)
Core Library 80.62% <ø> (-0.61%) ⬇️
Ingestion 70.63% <ø> (ø)
Lifecycle 79.97% <96.19%> (+0.44%) ⬆️
Oplog Populator 85.83% <ø> (ø)
Replication 59.70% <ø> (ø)
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.5    #2723      +/-   ##
===================================================
- Coverage            74.87%   74.74%   -0.13%     
===================================================
  Files                  200      200              
  Lines                13670    13745      +75     
===================================================
+ Hits                 10235    10274      +39     
- Misses                3425     3461      +36     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.10% <0.00%> (-0.06%) ⬇️
api:routes 8.92% <0.00%> (-0.06%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 9.21% <12.38%> (-1.17%) ⬇️
ingestion 12.49% <0.00%> (-0.08%) ⬇️
lib 7.71% <0.00%> (-0.06%) ⬇️
lifecycle 19.16% <68.57%> (+0.24%) ⬆️
notification 1.01% <0.00%> (-0.01%) ⬇️
oplogPopulator 0.14% <0.00%> (-0.01%) ⬇️
replication 18.64% <12.38%> (-0.02%) ⬇️
unit 51.77% <88.57%> (+0.37%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch 5 times, most recently from 8316f88 to 408c96c Compare March 11, 2026 16:03
@benzekrimaha benzekrimaha marked this pull request as ready for review March 11, 2026 16:35
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 408c96c to e1c5b13 Compare March 11, 2026 16:48
@francoisferrand francoisferrand requested a review from delthas March 18, 2026 09:19
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from e1c5b13 to aefb677 Compare March 18, 2026 10:04
@benzekrimaha benzekrimaha changed the title Improvement/bb 740 monitor lifecycle conductor Improvement/BB-740 monitor lifecycle conductor Mar 18, 2026
@francoisferrand francoisferrand changed the title Improvement/BB-740 monitor lifecycle conductor monitor lifecycle conductor Mar 18, 2026
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 725c3df to 11a94ea Compare March 19, 2026 09:36
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from a2128cf to a464b39 Compare March 19, 2026 09:43
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated
Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
@bert-e

bert-e commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following reviewers are expecting changes from the author, or must review again:

Comment thread extensions/lifecycle/conductor/LifecycleConductor.js
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/tasks/LifecycleTask.js
Comment thread extensions/lifecycle/LifecycleMetrics.js
Comment thread extensions/lifecycle/LifecycleMetrics.js Outdated
Comment thread monitoring/lifecycle/alerts.yaml
Comment thread monitoring/lifecycle/dashboard.py Outdated
Comment thread monitoring/lifecycle/dashboard.py Outdated
@SylvainSenechal

SylvainSenechal commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Side note : the top pull request overview comment may need to be updated as it was openend a long time ago : Not sure if it was updated recently, but the screenshots probably shows older grafana panels, as I see duplicated graphs

Also wished this top level comment could tell us a bit about all the prs that will be needed for the feature : I see there is another backbeat pr, but I don't know what each one focus on, also I'm surprised there is no zenko/zenko-operator but maybe this is ok

If its too annoying to open/setup a lab again just to take a fresh screenshot dont bother, but at least indicate its out of date

@benzekrimaha

benzekrimaha commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

@SylvainSenechal , screenshots are recent and there are no duplicates I just tried to show the whole pannel and some dashboards are on both screenshots. For zenko operator there is no need , and zenko as well these are metrics that are proper to backbeat and the conductor. I am updating the summary , and the commits have informations as well to make the review easier.

Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js
@bert-e

bert-e commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following reviewers are expecting changes from the author, or must review again:

Comment thread extensions/lifecycle/LifecycleMetrics.js
Comment thread extensions/lifecycle/LifecycleMetrics.js Outdated

@SylvainSenechal SylvainSenechal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok, I don't know if 100% of the edge cases are handled in case of failure/unexpected situations but I think for metrics its alright

Some comments from Thomas and François are still opened
It would be easier if you closed the resolve issues progressively, lot's of messages that are already fixed, but this is moslty claude repeating itself

edit: One more question, not sure if this was discussed before : we are adding some data to kafka messages, data which is only used for metrics (like 'conductorScanId') : Is this ok ? is this gonna increase the message size/perf significantly ? Do we want a mode to disable that ?

@benzekrimaha

Copy link
Copy Markdown
Contributor Author

Thank you for the review @SylvainSenechal
On the Kafka message size/perf: the only fields added to each bucket-task message are conductorScanId (a UUID, ~36 chars) and conductorScanStartTimestamp (a number) under contextInfo — roughly ~100 bytes including the JSON keys. Bucket-task messages already carry bucket / owner (canonical id) / accountId / action / reqId / taskVersion (several hundred bytes to ~1 KB), and there's one message per bucket per scan, so it's a small relative bump with no extra processing or round-trips no measurable size or throughput impact.
One clarification: conductorScanId isn't metrics-only. It's the scan-correlation id and is also added to the request-logger default fields, so it ties together all logs across conductor → bucket-processor → lifecycle tasks for a given scan. It follows the same pattern as the existing contextInfo.reqId (also always included for request correlation) — so it's general observability (logs + metrics), not a metric-only payload.
Given that, I'd rather not add a toggle to disable it: the overhead is negligible and the id is foundational to the end-to-end correlation this PR delivers, a config flag would add a rarely-used code path for ~100 bytes. Happy to revisit if it ever shows up at very high message rates.
And agreed on the threads I'll resolve the ones already addressed so it's easier to see what's actually open.

…id age.

Skip the observation and log a warning when the conductor scan-start
timestamp is missing or the clocks are skewed, instead of recording a
misleading 0. Drop the gauge comments that duplicated the metric help,
align the unit-test metric name with the source, and remove the test for
the no-longer-configurable scan-metric retention.

Issue: BB-740
Comment on lines 299 to 301
bucket,
owner,
accountId,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, already set as default fields (same below)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify which fields you mean here? The request logger's default fields are conductorScanId/conductorScanStartTimestamp (set via addDefaultFields(scanLogInfo)); bucket/owner/accountId in this log.error (and the log.debug below) are not default fields. Do you want those promoted to the logger default fields (set once after parsing) and removed from the two calls?

Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/tasks/LifecycleTask.js Outdated
Comment thread extensions/lifecycle/tasks/LifecycleTask.js Outdated
Comment thread monitoring/lifecycle/dashboard.py Outdated
],
)

bucket_processor_active_scan_ids = TimeSeries(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these panels should not be empty, even on a lab : the metric is a Counter (ever increasing), and cleared on a timer which should be long enough for prometheus to scrap. Even if it completes shortly, the metrics will stay long after the next scrap.

→ something really wrong here. Maybe you need to setup lifecycle rules on a bucket (so one is actually processed), or there is a bucket somewhere in the chain

@benzekrimaha benzekrimaha Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-validated on the platform with live lifecycle workload (a bb740-lc bucket with a lifecycle rule, plus the existing bench buckets). The panels were empty in the earlier screenshot simply because no buckets were being processed at that moment - it's not a metric/scrape bug. With workload:

  • The bucket-processor exposes the counter per scan, e.g. s3_lifecycle_bucket_processor_scan_messages_total{origin="bucket_processor",conductor_scan_id="..."} 133.
  • Prometheus is scraping it: sum(s3_lifecycle_bucket_processor_scan_messages_total) = 13428 across 124 lifecycle series.
  • The panel's own query is non-zero: sum(rate(s3_lifecycle_bucket_processor_scan_messages_total[5m])) by (pod) > 0 -> ~0.032/s for the bucket-processor pod, so Bucket Tasks Picked Up by Pod draws a line.
  • Conductor-side metrics (s3_lifecycle_latest_batch_start_time, etc.) are present as well.

So the section populates as soon as there is lifecycle work. Note the per-conductor_scan_id panel is naturally sparse between scans (each scan id only increments during its short scan window), which is the expected shape for a "detect parallel scans" view.

image image

@bert-e

bert-e commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

The following approvals are needed before I can proceed with the merge:

  • the author

  • 2 peers

The following reviewers are expecting changes from the author, or must review again:

Drop the redundant `_currentScanId !== scanId` throttle guard: Throttling
only originates from the backlog-control gate, before the scan id is set,
so a Throttling error always means the scan never started. Move the
unknown-canonical-id read to where it is used (after the throttle return)
and put the batch-start log on a single line.

Issue: BB-740
`_getTransitionActionEntry` now builds the action context from
`params.bucketData` via `_getActionContext`, so callers pass `bucketData`
instead of a precomputed `actionContext`. Also fit the continuation-entry
call on a single line.

Issue: BB-740
Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated
The Last Conductor Batch Duration and Buckets Listed stat panels were 6 units wide, leaving the right half of their row empty. Widen them to fill the row so the Lifecycle Conductor section has no layout holes.

Issue: BB-740
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants