Skip to content

Add alert contacts and managed delivery#75

Merged
chrisbliss18 merged 13 commits into
v2from
stack-03-alert-contacts
Apr 28, 2026
Merged

Add alert contacts and managed delivery#75
chrisbliss18 merged 13 commits into
v2from
stack-03-alert-contacts

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

Stacked PR 3 of 9.

Base: stack-02-webhooks
Head: stack-03-alert-contacts
Previous PR: #74

Summary:

  • Adds alert-contact schema migrations and the internal/alerting package.
  • Adds email, PagerDuty, Slack, and Teams transports.
  • Adds alert contact CRUD, send-test, delivery listing, retry endpoints, retry behavior, and rate caps.
  • Wires the alert delivery worker into jetmon2.

Review notes:
This PR builds on the webhook delivery shape from #74 and applies the same transition-consumer pattern to managed alert contacts.

Chris Jean added 13 commits April 25, 2026 11:23
Replaces the Family 5 stub ("legacy bridge to WPCOM") with the actual
Phase 3.x design: managed channels for human destinations (email,
PagerDuty, Slack, Teams), parallel to the webhooks API rather than a
generalization of it.

Key design decisions captured:
- Boundary with webhooks: alert contacts deliver Jetmon-rendered
  notifications through Jetmon-owned transports; webhooks deliver
  the raw signed event stream for the consumer to render. Customers
  who want generic outbound webhooks register a webhook, not an
  alert contact — explicitly noted with deferral rationale.
- Filter shape: site_filter + min_severity (no event-type filter, no
  state filter). Severity gate is the primary control surface, with
  Critical default to avoid accidental noise.
- Per-contact max_per_hour cap as pager-storm insurance.
- Send-test endpoint (POST /alert-contacts/{id}/test) for verifying
  newly-configured contacts.
- Email delivery via swappable Sender interface with wpcom / smtp /
  stub implementations — wpcom for production, smtp for docker
  integration tests, stub for unit tests.
- Plaintext credential storage in destination JSON, same rationale as
  webhook secrets (outbound dispatch needs the raw value).
- Worker placement intentionally parallel (internal/alerting/) rather
  than extending internal/webhooks/.

ROADMAP additions under "Deferred from Phase 3.x":
- SMS, OpsGenie, quiet hours, alert ack, alert grouping, generic
  webhook transport — each with deferral rationale and revisit trigger.
- WPCOM notification migration: documented as the cleanup that pays
  off the deliverer-binary extraction, gated on alert contacts proving
  out in production.

Multi-binary-split section: explicit revisit point added for unifying
internal/alerting/ and internal/webhooks/ at deliverer-binary extraction
time, with rationale for why the split is right today (no shared
abstraction with only one known concrete user) and what enables the
later unification (third transport, two known shapes).
The first draft used placeholder severity names (Info/Notice/Warning/
Critical). The actual codebase severity ladder is Up < Warning <
Degraded < SeemsDown < Down (uint8 0-4 in internal/eventstore).
Aligned the JSON string names, default min_severity, severity gate
description, PagerDuty mapping table, and schema column type to match
what the code actually uses.
jetmon_alert_contacts mirrors jetmon_webhooks but with a simpler filter
model: site_filter (JSON, empty=all sites) + min_severity (TINYINT
matching internal/eventstore.Severity*, default 4=Down) + max_per_hour
rate cap. destination is JSON because each transport has a different
shape (email address, PagerDuty integration key, Slack/Teams webhook
URL); plaintext storage rationale matches jetmon_webhooks.secret —
outbound dispatch needs the raw value at every send.

jetmon_alert_deliveries is a per-fire record table mirroring
jetmon_webhook_deliveries, with the same fan-in shape (one transition
→ many deliveries, one contact gets at most one delivery per transition
via uk_alert_transition). Adds a severity column so the dispatcher can
render transport-specific severity (PagerDuty critical/warning/info)
without re-deriving from the payload.

jetmon_alert_dispatch_progress is a high-water-mark table for the alert
delivery worker, identical shape to jetmon_webhook_dispatch_progress.
Establishes the package boundary parallel to internal/webhooks: types,
sentinel errors, severity name <-> uint8 helpers, transport identifier
enum, and the Dispatcher interface that every concrete transport
implements. No DB or transport code yet — that's the next two tasks.

AlertContact.Matches encodes the full filter rule:
  site_id ∈ SiteFilter (or empty = match all sites)
  AND (
    new_severity >= MinSeverity            // escalation / sustained
    OR (prev_severity >= MinSeverity       // recovery from a
        AND new_severity == SeverityUp)    //   previously-paging state
  )

Recovery firing requires both prev and new severity because Matches
doesn't see the transition reason. The unit tests exercise both halves
of the gate, the recovery edge cases, the empty-site-filter "match all"
semantic, and the AND across all dimensions.

Notification is the rendered shape passed to Dispatcher.Send — the
worker builds it once per delivery from the frozen-at-fire-time
payload, transports translate it into channel-specific representations
(PagerDuty events-v2, Slack Block Kit, Teams Adaptive Card, email
subject + body).
Email is unique among alert-contact transports in that there is no
"post to this URL" — it requires a sender. emailDispatcher owns the
rendering (subject + plain + HTML body) and delegates the actual send
to a Sender, so swapping transports is a config change and rendering
logic stays in one place.

Three Sender implementations:
- StubSender — records messages in memory + log line; default for
  EMAIL_TRANSPORT="stub" and the basis of unit tests.
- SMTPSender — net/smtp send; for dev/staging with MailHog.
- WPCOMSender — Bearer-token POST to a WPCOM-owned email endpoint;
  same shape as the existing internal/wpcom client.

Rendering covers severity name, site URL, event id/type, state,
reason, and timestamp. Subject prefixes signal recovery and test
sends. HTML output escapes untrusted fields (site URL, reason)
against XSS in HTML email clients.

Config additions: EMAIL_TRANSPORT (default "stub"), EMAIL_FROM,
WPCOM_EMAIL_ENDPOINT/AUTH_TOKEN, SMTP_HOST/PORT/USERNAME/PASSWORD/
USE_TLS. Validation is lazy — happens at Dispatcher construction
time, not at config load — so a misconfigured email block doesn't
prevent the binary from starting if alerting is disabled.

Tests cover subject variants, plain/HTML body rendering, HTML escaping,
recovery and test banners, dispatcher destination parsing (happy path
and three rejection cases), MIME multipart shape, WPCOM HTTP request
shape (Bearer auth, JSON body, status surfacing), and StubSender
record/reset behavior.
Three concrete Dispatcher implementations, each rendering the
Notification into the channel-specific request shape:

PagerDuty (Events API v2): event_action of "trigger" or "resolve"
based on the recovery flag, dedup_key derived from the Jetmon event
id so all transitions of the same incident group under one alert.
Severity mapping: Down/SeemsDown → critical, Degraded → warning,
Warning → info. Test sends use a distinct dedup_key so they never
accidentally resolve a real alert.

Slack: Block Kit message with severity-keyed emoji header, fallback
text for old clients / mobile previews, and a fields section with
site id, event id, state, reason, time. Recovery and test variants
swap the header.

Teams: Adaptive Card 1.4 wrapped in the message envelope expected by
incoming-webhook URLs. FactSet for the same fields Slack uses.

Shared postJSON helper handles serialization, request building, and
status interpretation: 4xx/5xx returns an error so the worker
schedules a retry. Response bodies are read with a 4 KB cap and
truncated to last_response's column width before being returned.

24 tests cover happy paths, recovery action mapping (PagerDuty
resolve), test-send dedup isolation, severity ordering, bad
destination rejection (empty fields, malformed JSON), and upstream
error surfacing.
internal/alerting/contacts.go provides Create/Get/List/ListActive/
Update/Delete and a separate LoadDestination helper for the worker.
Mirrors the webhooks CRUD shape but with the simpler filter model
(site_filter + min_severity, no event-type/state filter) and a
transport-aware destination validator: each transport requires its
own field shape (email→address, pagerduty→integration_key,
slack/teams→webhook_url).

destination_preview is the last 4 chars of the credential field for
the contact's transport, so operators can identify a contact without
exposing the full credential. The destination itself is loaded
separately by the worker via LoadDestination, kept off the
AlertContact struct so it can't leak through serialization.

Update intentionally cannot change the transport — the destination
shape is transport-specific and validating cross-transport changes
is brittle. Operators who want to switch transports delete and
re-create.

API surface in internal/api/handlers_alerts.go: POST/GET/LIST/PATCH/
DELETE on /api/v1/alert-contacts plus POST /alert-contacts/{id}/test.
The send-test endpoint loads the contact, fetches the destination,
and dispatches a synthetic IsTest=true notification through the
configured Dispatcher with a 15s timeout. Returns the transport's
status_code and response_body so operators can verify connectivity;
transport errors surface as 502.

Server gains an alertDispatchers map (alerting.Transport → Dispatcher)
populated via SetAlertDispatchers — this is the same map the worker
will use, so a successful send-test exercises the real production
code path. nil map (alerting disabled) → 503 transport_not_configured.

12 new sqlmock tests cover create happy-path with default min_severity,
transport rejection (sms), missing-field rejection per transport,
invalid severity name (Critical instead of Down), get/update/delete
happy + 404, and send-test happy / transport-error / no-dispatcher
cases.
internal/alerting/deliveries.go provides the per-fire delivery store
mirroring internal/webhooks/deliveries.go: Enqueue (INSERT IGNORE on
uk_alert_transition), ClaimReady, MarkDelivered, ScheduleRetry,
GetDelivery, ListDeliveries, RetryDelivery for the operator manual-
retry path. Adds MarkSuppressed for the rate-cap exit — terminal,
status='abandoned', last_status_code=429, last_response identifies
why so deliveries don't look like normal abandonments in the audit
trail.

internal/alerting/worker.go runs two background goroutines:

  - dispatchTick polls jetmon_event_transitions on a high-water mark
    in jetmon_alert_dispatch_progress, then for each new transition
    matches every active contact via Matches(prev_severity,
    next_severity, site_id) — the recovery edge fires correctly
    because we read both severity columns from the transition row.

  - deliverTick claims pending deliveries respecting the per-contact
    in-flight cap, looks up the contact + destination, builds a
    Notification (frozen-payload-derived; site URL looked up live for
    display only), and dispatches through the configured Dispatcher
    map. Same retry ladder as webhooks (1m/5m/30m/1h/6h then abandon).

Per-contact rate cap enforced via an in-memory sliding window
(rateLimitWindow). When the cap is hit, the delivery is marked
suppressed rather than retried — by the time the cap reopens the
alert is stale, so retrying just delays the inevitable. Single-instance
caveat documented inline; matches the existing multi-instance row-claim
caveat in webhooks.

14 unit tests cover the retry schedule, defaults, in-flight cap (per-
contact and isolation), zero-count cleanup, rate-window admission and
expiry, contact isolation in the rate window, and the transition reason
to alert event type mapping. Race-clean.
GET /api/v1/alert-contacts/{id}/deliveries with status filter and
cursor pagination; POST /api/v1/alert-contacts/{id}/deliveries/
{delivery_id}/retry to reset abandoned rows back to pending. Same
shapes as the webhook delivery endpoints.

cmd/jetmon2/main.go now boots the alerting worker alongside the
webhooks worker (gated on API_PORT > 0). buildAlertDispatchers
constructs the per-transport map from runtime config: the three
HTTP-based transports (PagerDuty, Slack, Teams) always wire up;
email picks wpcom/smtp/stub via EMAIL_TRANSPORT and falls back to
stub if unset. The same dispatcher map is shared with the API
server via SetAlertDispatchers, so a successful send-test exercises
the real production code path.

5 sqlmock tests cover list happy path, bad status rejection, retry
happy path, wrong-contact 404 cross-check, and not-abandoned 409.

Verified end-to-end in docker: registered an email + a slack alert
contact, ran the send-test endpoint (Block Kit message landed at
the test target with the 🔍 test emoji), inserted a failing
test site, watched the worker enqueue + dispatch alert.opened ->
alert.state_changed -> alert.closed transitions through the slack
transport (status 200 round trip), stopped the target to exercise
the retry path (delivery rescheduled with last_response captured
the transport error), force-marked one delivery abandoned and
exercised the manual retry endpoint (200 with status=pending; second
call against pending row returns 409 delivery_not_retryable as
expected). Test data cleaned up.
ClaimReady now follows its SELECT with a per-row UPDATE pushing
next_attempt_at to NOW+60s before the dispatch goroutine even starts.
Without this, the 1-second deliver tick was re-claiming any still-
in-flight row up to the per-contact in-flight cap (default 3),
producing concurrent dispatches that all read d.Attempt at claim
time, computed retry delays from the same stale value, and ran
ScheduleRetry's `attempt = attempt + 1` SQL three times in a row.

The visible effect: the documented 1m / 5m / 30m / 1h / 6h ladder
was collapsing to roughly 1m → 1h → abandoned, ~1h total instead of
the intended 7h36m. A transient consumer outage during a deploy
window could exhaust all retries before the consumer was back.

The dispatch goroutine still owns the real next_attempt_at: it
overwrites the soft lock with NULL on success or the actual retry
time on failure. If a goroutine panics without recovery (or the
process is OOM-killed mid-dispatch), the 60s soft lock expires and
the row becomes claimable again — natural recovery without operator
intervention.

Same fix in both internal/webhooks and internal/alerting (they share
the bug because alerting's worker was modeled on webhooks). Tests
verify the contract: SELECT + N per-row UPDATEs, and an idle tick
with no candidates issues no UPDATE traffic.

Multi-instance row-claim (SELECT ... FOR UPDATE SKIP LOCKED) is
still tracked alongside the deliverer-binary extraction in ROADMAP.
CHANGELOG gets a comprehensive "v2 branch — site health platform"
section ahead of the rewrite section, covering everything that has
landed on this branch: event sourcing, the internal REST API,
Phase 3 webhooks, Phase 3.x alert contacts, verifier hardening, and
the soft-lock worker fix from the previous commit. Marked clearly
as not drop-in with Jetmon 1 (separate from the rewrite branch).

AGENTS.md's Key Files table picks up internal/api, internal/apikeys,
internal/webhooks, internal/alerting, plus references to API.md and
ROADMAP.md so future agents have a route to the design docs.
Architecture diagram intentionally left as-is — it reflects the
drop-in rewrite, and the v2 architecture is still mid-flight pending
the deliverer-binary extraction.
…otency

Three small hardenings caught while reviewing the recently-added code:

1) alerting.Update now validates label (must be non-empty) and
   max_per_hour (must be >= 0) at input time, before the DB lookup.
   Previously an empty label PATCH would silently persist and a
   negative max_per_hour would surface as a generic 500 from MySQL's
   INT UNSIGNED constraint instead of a clean 422. Validations that
   don't need the existing row run first so obviously bad bodies
   don't pay for a round-trip.

2) buildMIMEMessage and renderEmailSubject now strip CR/LF from
   anything that becomes a MIME header value (From, To, Subject,
   site URL in the subject). Defense-in-depth: monitor_url is
   operator-controlled but the column doesn't enforce CRLF-free,
   so a malicious or accidental URL with embedded CRLF could have
   added new header lines (Bcc:, X-headers, etc.) in outbound email.
   Body content with newlines is intentionally unaffected.

3) POST /api/v1/alert-contacts/{id}/test now goes through
   withIdempotency like the other write POSTs (and like
   webhooks/{id}/rotate-secret). A retried "click to test" during a
   network blip no longer double-pages the destination.

Tests: 2 new for the Update validation rejections (no DB hit because
validation fires first), 2 new for the MIME header strip (subject
strip + asserting no new header lines are created when the input
contains CRLF; body CRLF passes through unchanged).
API.md picks up a one-paragraph note on the send-test endpoint
explaining its Idempotency-Key support — same shape as the rest of
the write surface, called out specifically because operators are the
typical caller (a "click to test" UX expectation).

CHANGELOG gets a "Polish" subsection under the v2 branch entry
covering the three hardenings just landed: Update input validation,
MIME header CRLF stripping, and idempotent send-test.
@chrisbliss18 chrisbliss18 changed the base branch from stack-02-webhooks to v2 April 28, 2026 14:53
@chrisbliss18 chrisbliss18 merged commit b83bee8 into v2 Apr 28, 2026
@chrisbliss18 chrisbliss18 deleted the stack-03-alert-contacts branch April 28, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant