Harden PR 101 WPCOM and streaming follow-ups#110
Merged
Conversation
added 2 commits
May 13, 2026 14:30
PR #101 is now stale because the merged streaming monitor engine replaced the older round/page scheduler path that branch optimized. Merging it directly would regress current v2 work, but the review identified a few useful ideas worth preserving. Add roadmap entries for permanent WPCOM status handling, streaming-aware transport failure-storm suppression, and evidence-led evaluation of any remaining jetpack_monitor_sites blog_id indexing needs. These notes give the follow-up branch a scoped paper trail without keeping the superseded PR open.
Classify WPCOM 404 and 410 responses as permanent per-notification failures instead of transport failures. These errors now bypass the global WPCOM circuit breaker, skip pointless immediate retry pressure, emit permanent-failure metrics, and write an audit failure row so operators still have an evidence trail. Also expose ErrCircuitOpen so the orchestrator can treat already-queued notifications as queued rather than retrying into an open circuit. Add bounded queue-drop logging to keep broad WPCOM outages from flooding logs. Finally, make the streaming engine report pressure-suppressed local timeout/connect failures. This preserves the existing failure-storm guard while giving sysadmins a visible counter for monitor-side pressure suppression. Tests: go test ./internal/wpcom ./internal/orchestrator; go test ./...
b1ed7b9 to
8a98783
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This draft PR closes the useful follow-up items from superseded PR #101 that still apply to current
v2. PR #101 targeted the older round/page scheduler path and was superseded by the streaming monitor engine from #104, so this branch keeps the applicable hardening without reviving stale scheduler code.What Changed
feature/pr-101-followupson latestv2after Avoid repeated Veriflier confirmation for down sites #111 and resolved the roadmap conflict.wpcom_failureaudit rows so operators still have a clear evidence trail.wpcom.ErrCircuitOpenand updated the orchestrator to count already-queued circuit-open responses as queued instead of retrying into the open circuit.scheduler.streaming.pressure_suppressed.countand streaming summary output for local timeout/connect failures that are suppressed under monitor-side pressure.v2does not add the oldidx_monitor_blog_idlegacy-table index; the roadmap now records that this should stay evidence-led and sidecar-first.Validation
go test ./internal/wpcom ./internal/orchestratorgo test ./...