Skip to content

Harden PR 101 WPCOM and streaming follow-ups#110

Merged
chrisbliss18 merged 2 commits into
v2from
feature/pr-101-followups
May 13, 2026
Merged

Harden PR 101 WPCOM and streaming follow-ups#110
chrisbliss18 merged 2 commits into
v2from
feature/pr-101-followups

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

@chrisbliss18 chrisbliss18 commented May 13, 2026

Summary

This draft PR closes the useful follow-up items from superseded PR #101 that still apply to current v2. PR #101 targeted the older round/page scheduler path and was superseded by the streaming monitor engine from #104, so this branch keeps the applicable hardening without reviving stale scheduler code.

What Changed

  • Rebased feature/pr-101-followups on latest v2 after Avoid repeated Veriflier confirmation for down sites #111 and resolved the roadmap conflict.
  • Added typed WPCOM HTTP status errors so 404/410 responses are treated as permanent per-notification failures.
  • Prevented permanent WPCOM 404/410 responses from opening the shared WPCOM circuit breaker, queueing, or triggering pointless immediate retries.
  • Added WPCOM permanent-failure metrics, HTTP-status-specific permanent-failure metrics, and wpcom_failure audit rows so operators still have a clear evidence trail.
  • Exposed wpcom.ErrCircuitOpen and updated the orchestrator to count already-queued circuit-open responses as queued instead of retrying into the open circuit.
  • Added bounded WPCOM queue-drop logging to avoid log floods during broad WPCOM outages.
  • Added scheduler.streaming.pressure_suppressed.count and streaming summary output for local timeout/connect failures that are suppressed under monitor-side pressure.
  • Confirmed current v2 does not add the old idx_monitor_blog_id legacy-table index; the roadmap now records that this should stay evidence-led and sidecar-first.

Validation

  • go test ./internal/wpcom ./internal/orchestrator
  • go test ./...

Chris Jean added 2 commits May 13, 2026 14:30
PR #101 is now stale because the merged streaming monitor engine replaced the older round/page scheduler path that branch optimized. Merging it directly would regress current v2 work, but the review identified a few useful ideas worth preserving.

Add roadmap entries for permanent WPCOM status handling, streaming-aware transport failure-storm suppression, and evidence-led evaluation of any remaining jetpack_monitor_sites blog_id indexing needs. These notes give the follow-up branch a scoped paper trail without keeping the superseded PR open.
Classify WPCOM 404 and 410 responses as permanent per-notification failures instead of transport failures. These errors now bypass the global WPCOM circuit breaker, skip pointless immediate retry pressure, emit permanent-failure metrics, and write an audit failure row so operators still have an evidence trail.

Also expose ErrCircuitOpen so the orchestrator can treat already-queued notifications as queued rather than retrying into an open circuit. Add bounded queue-drop logging to keep broad WPCOM outages from flooding logs.

Finally, make the streaming engine report pressure-suppressed local timeout/connect failures. This preserves the existing failure-storm guard while giving sysadmins a visible counter for monitor-side pressure suppression.

Tests: go test ./internal/wpcom ./internal/orchestrator; go test ./...
@chrisbliss18 chrisbliss18 force-pushed the feature/pr-101-followups branch from b1ed7b9 to 8a98783 Compare May 13, 2026 19:37
@chrisbliss18 chrisbliss18 changed the title Track follow-ups from superseded PR 101 Harden PR 101 WPCOM and streaming follow-ups May 13, 2026
@chrisbliss18 chrisbliss18 marked this pull request as ready for review May 13, 2026 19:46
@chrisbliss18 chrisbliss18 merged commit 9a05c93 into v2 May 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant