Add alert contacts and managed delivery by chrisbliss18 · Pull Request #75 · Automattic/jetmon

chrisbliss18 · 2026-04-28T14:21:08Z

Stacked PR 3 of 9.

Base: stack-02-webhooks
Head: stack-03-alert-contacts
Previous PR: #74

Summary:

Adds alert-contact schema migrations and the internal/alerting package.
Adds email, PagerDuty, Slack, and Teams transports.
Adds alert contact CRUD, send-test, delivery listing, retry endpoints, retry behavior, and rate caps.
Wires the alert delivery worker into jetmon2.

Review notes:
This PR builds on the webhook delivery shape from #74 and applies the same transition-consumer pattern to managed alert contacts.

Replaces the Family 5 stub ("legacy bridge to WPCOM") with the actual Phase 3.x design: managed channels for human destinations (email, PagerDuty, Slack, Teams), parallel to the webhooks API rather than a generalization of it. Key design decisions captured: - Boundary with webhooks: alert contacts deliver Jetmon-rendered notifications through Jetmon-owned transports; webhooks deliver the raw signed event stream for the consumer to render. Customers who want generic outbound webhooks register a webhook, not an alert contact — explicitly noted with deferral rationale. - Filter shape: site_filter + min_severity (no event-type filter, no state filter). Severity gate is the primary control surface, with Critical default to avoid accidental noise. - Per-contact max_per_hour cap as pager-storm insurance. - Send-test endpoint (POST /alert-contacts/{id}/test) for verifying newly-configured contacts. - Email delivery via swappable Sender interface with wpcom / smtp / stub implementations — wpcom for production, smtp for docker integration tests, stub for unit tests. - Plaintext credential storage in destination JSON, same rationale as webhook secrets (outbound dispatch needs the raw value). - Worker placement intentionally parallel (internal/alerting/) rather than extending internal/webhooks/. ROADMAP additions under "Deferred from Phase 3.x": - SMS, OpsGenie, quiet hours, alert ack, alert grouping, generic webhook transport — each with deferral rationale and revisit trigger. - WPCOM notification migration: documented as the cleanup that pays off the deliverer-binary extraction, gated on alert contacts proving out in production. Multi-binary-split section: explicit revisit point added for unifying internal/alerting/ and internal/webhooks/ at deliverer-binary extraction time, with rationale for why the split is right today (no shared abstraction with only one known concrete user) and what enables the later unification (third transport, two known shapes).

The first draft used placeholder severity names (Info/Notice/Warning/ Critical). The actual codebase severity ladder is Up < Warning < Degraded < SeemsDown < Down (uint8 0-4 in internal/eventstore). Aligned the JSON string names, default min_severity, severity gate description, PagerDuty mapping table, and schema column type to match what the code actually uses.

jetmon_alert_contacts mirrors jetmon_webhooks but with a simpler filter model: site_filter (JSON, empty=all sites) + min_severity (TINYINT matching internal/eventstore.Severity*, default 4=Down) + max_per_hour rate cap. destination is JSON because each transport has a different shape (email address, PagerDuty integration key, Slack/Teams webhook URL); plaintext storage rationale matches jetmon_webhooks.secret — outbound dispatch needs the raw value at every send. jetmon_alert_deliveries is a per-fire record table mirroring jetmon_webhook_deliveries, with the same fan-in shape (one transition → many deliveries, one contact gets at most one delivery per transition via uk_alert_transition). Adds a severity column so the dispatcher can render transport-specific severity (PagerDuty critical/warning/info) without re-deriving from the payload. jetmon_alert_dispatch_progress is a high-water-mark table for the alert delivery worker, identical shape to jetmon_webhook_dispatch_progress.

Establishes the package boundary parallel to internal/webhooks: types, sentinel errors, severity name <-> uint8 helpers, transport identifier enum, and the Dispatcher interface that every concrete transport implements. No DB or transport code yet — that's the next two tasks. AlertContact.Matches encodes the full filter rule: site_id ∈ SiteFilter (or empty = match all sites) AND ( new_severity >= MinSeverity // escalation / sustained OR (prev_severity >= MinSeverity // recovery from a AND new_severity == SeverityUp) // previously-paging state ) Recovery firing requires both prev and new severity because Matches doesn't see the transition reason. The unit tests exercise both halves of the gate, the recovery edge cases, the empty-site-filter "match all" semantic, and the AND across all dimensions. Notification is the rendered shape passed to Dispatcher.Send — the worker builds it once per delivery from the frozen-at-fire-time payload, transports translate it into channel-specific representations (PagerDuty events-v2, Slack Block Kit, Teams Adaptive Card, email subject + body).

Email is unique among alert-contact transports in that there is no "post to this URL" — it requires a sender. emailDispatcher owns the rendering (subject + plain + HTML body) and delegates the actual send to a Sender, so swapping transports is a config change and rendering logic stays in one place. Three Sender implementations: - StubSender — records messages in memory + log line; default for EMAIL_TRANSPORT="stub" and the basis of unit tests. - SMTPSender — net/smtp send; for dev/staging with MailHog. - WPCOMSender — Bearer-token POST to a WPCOM-owned email endpoint; same shape as the existing internal/wpcom client. Rendering covers severity name, site URL, event id/type, state, reason, and timestamp. Subject prefixes signal recovery and test sends. HTML output escapes untrusted fields (site URL, reason) against XSS in HTML email clients. Config additions: EMAIL_TRANSPORT (default "stub"), EMAIL_FROM, WPCOM_EMAIL_ENDPOINT/AUTH_TOKEN, SMTP_HOST/PORT/USERNAME/PASSWORD/ USE_TLS. Validation is lazy — happens at Dispatcher construction time, not at config load — so a misconfigured email block doesn't prevent the binary from starting if alerting is disabled. Tests cover subject variants, plain/HTML body rendering, HTML escaping, recovery and test banners, dispatcher destination parsing (happy path and three rejection cases), MIME multipart shape, WPCOM HTTP request shape (Bearer auth, JSON body, status surfacing), and StubSender record/reset behavior.

Three concrete Dispatcher implementations, each rendering the Notification into the channel-specific request shape: PagerDuty (Events API v2): event_action of "trigger" or "resolve" based on the recovery flag, dedup_key derived from the Jetmon event id so all transitions of the same incident group under one alert. Severity mapping: Down/SeemsDown → critical, Degraded → warning, Warning → info. Test sends use a distinct dedup_key so they never accidentally resolve a real alert. Slack: Block Kit message with severity-keyed emoji header, fallback text for old clients / mobile previews, and a fields section with site id, event id, state, reason, time. Recovery and test variants swap the header. Teams: Adaptive Card 1.4 wrapped in the message envelope expected by incoming-webhook URLs. FactSet for the same fields Slack uses. Shared postJSON helper handles serialization, request building, and status interpretation: 4xx/5xx returns an error so the worker schedules a retry. Response bodies are read with a 4 KB cap and truncated to last_response's column width before being returned. 24 tests cover happy paths, recovery action mapping (PagerDuty resolve), test-send dedup isolation, severity ordering, bad destination rejection (empty fields, malformed JSON), and upstream error surfacing.

internal/alerting/contacts.go provides Create/Get/List/ListActive/ Update/Delete and a separate LoadDestination helper for the worker. Mirrors the webhooks CRUD shape but with the simpler filter model (site_filter + min_severity, no event-type/state filter) and a transport-aware destination validator: each transport requires its own field shape (email→address, pagerduty→integration_key, slack/teams→webhook_url). destination_preview is the last 4 chars of the credential field for the contact's transport, so operators can identify a contact without exposing the full credential. The destination itself is loaded separately by the worker via LoadDestination, kept off the AlertContact struct so it can't leak through serialization. Update intentionally cannot change the transport — the destination shape is transport-specific and validating cross-transport changes is brittle. Operators who want to switch transports delete and re-create. API surface in internal/api/handlers_alerts.go: POST/GET/LIST/PATCH/ DELETE on /api/v1/alert-contacts plus POST /alert-contacts/{id}/test. The send-test endpoint loads the contact, fetches the destination, and dispatches a synthetic IsTest=true notification through the configured Dispatcher with a 15s timeout. Returns the transport's status_code and response_body so operators can verify connectivity; transport errors surface as 502. Server gains an alertDispatchers map (alerting.Transport → Dispatcher) populated via SetAlertDispatchers — this is the same map the worker will use, so a successful send-test exercises the real production code path. nil map (alerting disabled) → 503 transport_not_configured. 12 new sqlmock tests cover create happy-path with default min_severity, transport rejection (sms), missing-field rejection per transport, invalid severity name (Critical instead of Down), get/update/delete happy + 404, and send-test happy / transport-error / no-dispatcher cases.

internal/alerting/deliveries.go provides the per-fire delivery store mirroring internal/webhooks/deliveries.go: Enqueue (INSERT IGNORE on uk_alert_transition), ClaimReady, MarkDelivered, ScheduleRetry, GetDelivery, ListDeliveries, RetryDelivery for the operator manual- retry path. Adds MarkSuppressed for the rate-cap exit — terminal, status='abandoned', last_status_code=429, last_response identifies why so deliveries don't look like normal abandonments in the audit trail. internal/alerting/worker.go runs two background goroutines: - dispatchTick polls jetmon_event_transitions on a high-water mark in jetmon_alert_dispatch_progress, then for each new transition matches every active contact via Matches(prev_severity, next_severity, site_id) — the recovery edge fires correctly because we read both severity columns from the transition row. - deliverTick claims pending deliveries respecting the per-contact in-flight cap, looks up the contact + destination, builds a Notification (frozen-payload-derived; site URL looked up live for display only), and dispatches through the configured Dispatcher map. Same retry ladder as webhooks (1m/5m/30m/1h/6h then abandon). Per-contact rate cap enforced via an in-memory sliding window (rateLimitWindow). When the cap is hit, the delivery is marked suppressed rather than retried — by the time the cap reopens the alert is stale, so retrying just delays the inevitable. Single-instance caveat documented inline; matches the existing multi-instance row-claim caveat in webhooks. 14 unit tests cover the retry schedule, defaults, in-flight cap (per- contact and isolation), zero-count cleanup, rate-window admission and expiry, contact isolation in the rate window, and the transition reason to alert event type mapping. Race-clean.

GET /api/v1/alert-contacts/{id}/deliveries with status filter and cursor pagination; POST /api/v1/alert-contacts/{id}/deliveries/ {delivery_id}/retry to reset abandoned rows back to pending. Same shapes as the webhook delivery endpoints. cmd/jetmon2/main.go now boots the alerting worker alongside the webhooks worker (gated on API_PORT > 0). buildAlertDispatchers constructs the per-transport map from runtime config: the three HTTP-based transports (PagerDuty, Slack, Teams) always wire up; email picks wpcom/smtp/stub via EMAIL_TRANSPORT and falls back to stub if unset. The same dispatcher map is shared with the API server via SetAlertDispatchers, so a successful send-test exercises the real production code path. 5 sqlmock tests cover list happy path, bad status rejection, retry happy path, wrong-contact 404 cross-check, and not-abandoned 409. Verified end-to-end in docker: registered an email + a slack alert contact, ran the send-test endpoint (Block Kit message landed at the test target with the 🔍 test emoji), inserted a failing test site, watched the worker enqueue + dispatch alert.opened -> alert.state_changed -> alert.closed transitions through the slack transport (status 200 round trip), stopped the target to exercise the retry path (delivery rescheduled with last_response captured the transport error), force-marked one delivery abandoned and exercised the manual retry endpoint (200 with status=pending; second call against pending row returns 409 delivery_not_retryable as expected). Test data cleaned up.

ClaimReady now follows its SELECT with a per-row UPDATE pushing next_attempt_at to NOW+60s before the dispatch goroutine even starts. Without this, the 1-second deliver tick was re-claiming any still- in-flight row up to the per-contact in-flight cap (default 3), producing concurrent dispatches that all read d.Attempt at claim time, computed retry delays from the same stale value, and ran ScheduleRetry's `attempt = attempt + 1` SQL three times in a row. The visible effect: the documented 1m / 5m / 30m / 1h / 6h ladder was collapsing to roughly 1m → 1h → abandoned, ~1h total instead of the intended 7h36m. A transient consumer outage during a deploy window could exhaust all retries before the consumer was back. The dispatch goroutine still owns the real next_attempt_at: it overwrites the soft lock with NULL on success or the actual retry time on failure. If a goroutine panics without recovery (or the process is OOM-killed mid-dispatch), the 60s soft lock expires and the row becomes claimable again — natural recovery without operator intervention. Same fix in both internal/webhooks and internal/alerting (they share the bug because alerting's worker was modeled on webhooks). Tests verify the contract: SELECT + N per-row UPDATEs, and an idle tick with no candidates issues no UPDATE traffic. Multi-instance row-claim (SELECT ... FOR UPDATE SKIP LOCKED) is still tracked alongside the deliverer-binary extraction in ROADMAP.

CHANGELOG gets a comprehensive "v2 branch — site health platform" section ahead of the rewrite section, covering everything that has landed on this branch: event sourcing, the internal REST API, Phase 3 webhooks, Phase 3.x alert contacts, verifier hardening, and the soft-lock worker fix from the previous commit. Marked clearly as not drop-in with Jetmon 1 (separate from the rewrite branch). AGENTS.md's Key Files table picks up internal/api, internal/apikeys, internal/webhooks, internal/alerting, plus references to API.md and ROADMAP.md so future agents have a route to the design docs. Architecture diagram intentionally left as-is — it reflects the drop-in rewrite, and the v2 architecture is still mid-flight pending the deliverer-binary extraction.

…otency Three small hardenings caught while reviewing the recently-added code: 1) alerting.Update now validates label (must be non-empty) and max_per_hour (must be >= 0) at input time, before the DB lookup. Previously an empty label PATCH would silently persist and a negative max_per_hour would surface as a generic 500 from MySQL's INT UNSIGNED constraint instead of a clean 422. Validations that don't need the existing row run first so obviously bad bodies don't pay for a round-trip. 2) buildMIMEMessage and renderEmailSubject now strip CR/LF from anything that becomes a MIME header value (From, To, Subject, site URL in the subject). Defense-in-depth: monitor_url is operator-controlled but the column doesn't enforce CRLF-free, so a malicious or accidental URL with embedded CRLF could have added new header lines (Bcc:, X-headers, etc.) in outbound email. Body content with newlines is intentionally unaffected. 3) POST /api/v1/alert-contacts/{id}/test now goes through withIdempotency like the other write POSTs (and like webhooks/{id}/rotate-secret). A retried "click to test" during a network blip no longer double-pages the destination. Tests: 2 new for the Update validation rejections (no DB hit because validation fires first), 2 new for the MIME header strip (subject strip + asserting no new header lines are created when the input contains CRLF; body CRLF passes through unchanged).

API.md picks up a one-paragraph note on the send-test endpoint explaining its Idempotency-Key support — same shape as the rest of the write surface, called out specifically because operators are the typical caller (a "click to test" UX expectation). CHANGELOG gets a "Polish" subsection under the v2 branch entry covering the three hardenings just landed: Update input validation, MIME header CRLF stripping, and idempotent send-test.

Chris Jean added 13 commits April 25, 2026 11:23

This was referenced Apr 28, 2026

Harden shadow-state projection and delivery paths #76

Merged

Internal REST API foundation: events, webhooks, alerting #72

Merged

chrisbliss18 changed the base branch from stack-02-webhooks to v2 April 28, 2026 14:53

chrisbliss18 merged commit b83bee8 into v2 Apr 28, 2026

chrisbliss18 deleted the stack-03-alert-contacts branch April 28, 2026 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alert contacts and managed delivery#75

Add alert contacts and managed delivery#75
chrisbliss18 merged 13 commits into
v2from
stack-03-alert-contacts

chrisbliss18 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrisbliss18 commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant