monitor lifecycle conductor#2723
Conversation
Hello benzekrimaha,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
380069a to
25ea9d5
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 3 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.5 #2723 +/- ##
===================================================
- Coverage 74.87% 74.74% -0.13%
===================================================
Files 200 200
Lines 13670 13745 +75
===================================================
+ Hits 10235 10274 +39
- Misses 3425 3461 +36
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
8316f88 to
408c96c
Compare
408c96c to
e1c5b13
Compare
e1c5b13 to
aefb677
Compare
725c3df to
11a94ea
Compare
a2128cf to
a464b39
Compare
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following reviewers are expecting changes from the author, or must review again: |
|
Side note : the top pull request overview comment may need to be updated as it was openend a long time ago : Not sure if it was updated recently, but the screenshots probably shows older grafana panels, as I see duplicated graphs Also wished this top level comment could tell us a bit about all the prs that will be needed for the feature : I see there is another backbeat pr, but I don't know what each one focus on, also I'm surprised there is no zenko/zenko-operator but maybe this is ok If its too annoying to open/setup a lab again just to take a fresh screenshot dont bother, but at least indicate its out of date |
|
@SylvainSenechal , screenshots are recent and there are no duplicates I just tried to show the whole pannel and some dashboards are on both screenshots. For zenko operator there is no need , and zenko as well these are metrics that are proper to backbeat and the conductor. I am updating the summary , and the commits have informations as well to make the review easier. |
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following reviewers are expecting changes from the author, or must review again: |
There was a problem hiding this comment.
I think it's ok, I don't know if 100% of the edge cases are handled in case of failure/unexpected situations but I think for metrics its alright
Some comments from Thomas and François are still opened
It would be easier if you closed the resolve issues progressively, lot's of messages that are already fixed, but this is moslty claude repeating itself
edit: One more question, not sure if this was discussed before : we are adding some data to kafka messages, data which is only used for metrics (like 'conductorScanId') : Is this ok ? is this gonna increase the message size/perf significantly ? Do we want a mode to disable that ?
|
Thank you for the review @SylvainSenechal |
…id age. Skip the observation and log a warning when the conductor scan-start timestamp is missing or the clocks are skewed, instead of recording a misleading 0. Drop the gauge comments that duplicated the metric help, align the unit-test metric name with the source, and remove the test for the no-longer-configurable scan-metric retention. Issue: BB-740
| bucket, | ||
| owner, | ||
| accountId, |
There was a problem hiding this comment.
not needed, already set as default fields (same below)
There was a problem hiding this comment.
Could you clarify which fields you mean here? The request logger's default fields are conductorScanId/conductorScanStartTimestamp (set via addDefaultFields(scanLogInfo)); bucket/owner/accountId in this log.error (and the log.debug below) are not default fields. Do you want those promoted to the logger default fields (set once after parsing) and removed from the two calls?
| ], | ||
| ) | ||
|
|
||
| bucket_processor_active_scan_ids = TimeSeries( |
There was a problem hiding this comment.
these panels should not be empty, even on a lab : the metric is a Counter (ever increasing), and cleared on a timer which should be long enough for prometheus to scrap. Even if it completes shortly, the metrics will stay long after the next scrap.
→ something really wrong here. Maybe you need to setup lifecycle rules on a bucket (so one is actually processed), or there is a bucket somewhere in the chain
There was a problem hiding this comment.
Re-validated on the platform with live lifecycle workload (a bb740-lc bucket with a lifecycle rule, plus the existing bench buckets). The panels were empty in the earlier screenshot simply because no buckets were being processed at that moment - it's not a metric/scrape bug. With workload:
- The bucket-processor exposes the counter per scan, e.g.
s3_lifecycle_bucket_processor_scan_messages_total{origin="bucket_processor",conductor_scan_id="..."} 133. - Prometheus is scraping it:
sum(s3_lifecycle_bucket_processor_scan_messages_total) = 13428across 124 lifecycle series. - The panel's own query is non-zero:
sum(rate(s3_lifecycle_bucket_processor_scan_messages_total[5m])) by (pod) > 0-> ~0.032/s for the bucket-processor pod, soBucket Tasks Picked Up by Poddraws a line. - Conductor-side metrics (
s3_lifecycle_latest_batch_start_time, etc.) are present as well.
So the section populates as soon as there is lifecycle work. Note the per-conductor_scan_id panel is naturally sparse between scans (each scan id only increments during its short scan window), which is the expected shape for a "detect parallel scans" view.
Waiting for approvalThe following approvals are needed before I can proceed with the merge:
The following reviewers are expecting changes from the author, or must review again: |
Drop the redundant `_currentScanId !== scanId` throttle guard: Throttling only originates from the backlog-control gate, before the scan id is set, so a Throttling error always means the scan never started. Move the unknown-canonical-id read to where it is used (after the throttle return) and put the batch-start log on a single line. Issue: BB-740
`_getTransitionActionEntry` now builds the action context from `params.bucketData` via `_getActionContext`, so callers pass `bucketData` instead of a precomputed `actionContext`. Also fit the continuation-entry call on a single line. Issue: BB-740
The Last Conductor Batch Duration and Buckets Listed stat panels were 6 units wide, leaving the right half of their row empty. Widen them to fill the row so the Lifecycle Conductor section has no layout holes. Issue: BB-740
Issue: BB-740
Until now there was no good way to tell, from monitoring alone, why lifecycle wasn't making progress. A silent conductor (no scans being scheduled) and a stuck/failed scan (started but never completed) looked the same, and there was no signal for the bucket processor falling behind or for two scans running concurrently. This PR adds conductor- and bucket-processor-level observability so each of those situations is distinguishable and alertable, and it threads a scan id through the pipeline so logs and metrics for a given scan can be correlated end to end.
What it adds
Conductor scan lifecycle metrics
s3_lifecycle_latest_batch_start_timea scheduling heartbeat, set when a scan starts. Lets us detect "the conductor stopped scheduling scans."s3_lifecycle_latest_batch_end_time+s3_lifecycle_latest_batch_bucket_countset only on successful completion (end timestamp + number of buckets listed). A scan that fails after starting deliberately does not update these, so the failure stays visible instead of looking complete.Bucket-processor scan metrics
s3_lifecycle_bucket_processor_scan_messages_totalbucket-task messages picked up, labelled by conductor_scan_id (used to spot overlapping/concurrent scans).s3_lifecycle_bucket_processor_scan_message_age_seconds(histogram) wall-clock time from conductor scan start to bucket-processor pickup. This is a dequeue/backlog signal, not task-processing duration.Scan-context propagation conductorScanId and conductorScanStartTimestamp are carried from the conductor through bucket-task messages, continuation entries, action queue entries, and task logs => so all work for a scan is attributable to it without putting bucket names (high cardinality) into metrics.
Alerts (monitoring/lifecycle/alerts.yaml)
LifecycleLateScan — the conductor stopped scheduling scans (based on the start-time heartbeat).
LifecycleStalledScan — a scan started but its completion metric didn't catch up (start_time > end_time), i.e. it's stalled or failed after starting. Uses dedicated thresholds and auto-resolves on the next successful scan.
Dashboard :
New "Lifecycle Conductor" panels (scheduling heartbeat, last batch duration, buckets listed, stalled state) and bucket-processor panels (recent conductor scan ids, message-pickup age).
Through the added metrics we have now a bounded label cardinality. The per-conductor_scan_id series are cleaned up by per-scan timers after a fixed retention, so the label can't grow unbounded in prom-client; the metric help text points readers to that timer.
Failure ≠ completion => Failed scans reset local conductor state but don't publish completion metrics, which is what makes LifecycleStalledScan fire correctly (and resolve on the next good scan).
Rolling-upgrade safe => Messages from an older conductor without a scan id don't create synthetic undefined series.
=> No high-cardinality data in metrics as bucket-level detail stays in logs (correlated by scan id), not in label values.
=> Dashboards were verified on a live preview deployment of this branch the conductor panels populate with real scan data. (Bucket-processor throughput panels read "No data" on a small test platform because each scan completes within a single Prometheus scrape; they populate under sustained lifecycle load.)