Skip to content

Jetmon 2 — Site health platform#61

Open
chrisbliss18 wants to merge 299 commits into
masterfrom
v2
Open

Jetmon 2 — Site health platform#61
chrisbliss18 wants to merge 299 commits into
masterfrom
v2

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

Work in progress. This branch (v2) is the ambitious successor to the Go rewrite started in refactor/jetmon2 (PR #60). It includes everything from that branch and extends it with a new architecture and direction.


What changed from PR #60

PR #60 scoped Jetmon 2 as a drop-in replacement: same interfaces, same schema, same behaviour — just Go instead of Node.js + C++. That work is complete and forms the base of this branch.

This branch pivots to a larger goal: Jetmon 2 as a full site health monitoring platform, not just an uptime tracker. The key additions:

  • Event-sourced architecture. Site state is derived from an event log, not a mutable status column. The event log is the source of truth; the site row carries a denormalized projection for fast reads. Full design in EVENTS.md.
  • Five-layer test taxonomy. Reachability → Transport & Security → Infrastructure & Edge → Application Response → Content Integrity, plus Reverse Checks (agent-reported signals from inside WordPress). ~55 v1 items, ~55 v2, ~40 v3. Full taxonomy in TAXONOMY.md.
  • Site → Endpoint → Check hierarchy. Sites have multiple endpoints; each endpoint has multiple checks of different types. Site state rolls up from endpoint state, which rolls up from check results. Rollup rules are explicit and configurable per site.
  • Multi-state vocabulary. Up, Warning, Degraded, Seems Down, Down, Paused, Maintenance, Unknown. Unknown is not downtime — monitor-side failures never inflate customer downtime figures.
  • Competitor-parity public REST API. Five capability groups: status and state, events and history, SLA statistics (uptime %, response time p95/p99, MTTR), monitor management (CRUD, pause, resume, trigger-now), and alert contacts with outbound webhooks. Full design in ROADMAP.md.
  • Gradual rollout with back-compat. The existing site_status column keeps receiving derived writes so current consumers are not broken. New capabilities are additive; consumers adopt progressively.

Architectural decisions are locked in AGENTS.md so they are enforced consistently across all changes.


Why Go

The current architecture uses forked Node.js processes (8–16MB RSS each at startup, 53MB limit before recycling) as workers, plus a compiled C++ addon to escape Node's event loop for blocking network I/O. Go eliminates both constraints:

  • Goroutines start at ~4KB of stack and grow on demand, making 50,000 concurrent checks on a single host practical without the memory overhead of forked processes or libuv thread pools
  • net/http and crypto/tls are first-class stdlib packages — no native addon, no node-gyp, no compilation step during deployment
  • net/http/httptrace provides DNS, TCP, TLS, and TTFB timing hooks as separate measurements within each check, for free
  • Single static binary deployment with no runtime dependencies, no node_modules, and no addon rebuild on Node.js version upgrades
  • Built-in profiling via pprof, race detector via go test -race, and a mature testing ecosystem
  • Graceful goroutine lifecycle management replaces the fragile worker spawn/recycle/evaporate lifecycle

The Veriflier is rewritten in Go as well, replacing the Qt C++ dependency with a lightweight Go HTTP service. The protocol between Monitor and Verifliers moves from custom HTTPS to gRPC, providing type-safe contracts, built-in retries, and bidirectional streaming for future use.

Benefits of the Rewrite

Memory

The current architecture forks Node.js worker processes that start at 8–16MB RSS and are recycled once they reach 53MB. With a typical deployment of 8–16 workers, the process tree consumes 240–850MB of resident memory just for worker overhead, before any check data is counted.

Jetmon 2 runs as a single process. Go goroutines start at 4KB of stack and grow on demand. A pool of 1,000 concurrent goroutines costs roughly 4MB of stack. Total process RSS for an equivalent workload is estimated at 50–150MB — a 75–90% reduction in memory consumption per host.

Concurrent Checks

Current concurrency is bounded by the number of worker processes. Each worker is a single-threaded Node.js process; practical concurrency per host is in the low hundreds.

Go's goroutine scheduler makes 10,000+ concurrent in-flight checks on a single host practical with no additional configuration. At a conservative network timeout of 10 seconds and average site response time of 200ms, a pool of 1,000 goroutines sustains approximately 5,000 check completions per second — an estimated 10–50× increase in concurrent checks per host.

Throughput

The current architecture crosses a process boundary on every unit of work: master dispatches via IPC, worker receives and processes, replies via IPC, master aggregates. Each crossing involves serialisation, a context switch, and V8 event loop scheduling on both ends.

Jetmon 2 replaces all IPC with Go channel sends, which are in-process and order-of-magnitude cheaper. Estimated throughput improvement: 3–10× more sites checked per second per host under equivalent conditions.

Check Scheduling Accuracy

The current system uses setTimeout and setInterval for round scheduling, subject to V8 event loop delay — a busy loop can delay a callback by tens to hundreds of milliseconds, introducing jitter into RTT measurements.

Go's time.Ticker fires with OS-level timer precision. RTT measurements from net/http/httptrace are taken inside the HTTP stack with no event loop between the measurement point and the timer.

Deployment Speed

Current deployment requires npm install, a node-gyp rebuild of the native C++ addon, and a coordinated process restart. A failed addon compilation blocks deployment entirely.

Jetmon 2 deploys as a single static binary. Deployment is: copy binary, systemctl restart jetmon2. Total deployment time drops from several minutes to under 30 seconds.

Mean Time to Recovery

A worker process crash requires the master to detect the exit, spawn a replacement, and wait for initialisation — several seconds, with in-flight checks unresolved.

In Jetmon 2, a panicking goroutine is recovered by a deferred handler, the result counted as an error, and a replacement goroutine immediately spawned — recovery in the low milliseconds. For a full process crash, systemd restarts the binary; with Go's fast startup, the process is accepting work again in under 2 seconds.

Operational Complexity

The current system requires managing Node.js version compatibility, native addon compilation, npm dependency trees, and the fragile worker spawn/recycle lifecycle.

Jetmon 2 eliminates all of this. One artifact to manage: the Go binary. No node-gyp, no npm, no Node.js version management.


Build order

  1. Schemajetmon_endpoints, jetmon_events, updated site row projection, back-compat site_status derived write
  2. Probe runner — replaces per-site HTTP check loop; iterates endpoints; owns idempotent event dedup
  3. Check types — DNS, TLS cert expiry, redirect chain, keyword, TTFB (v1 taxonomy)
  4. Public REST API — status/state, events/history, SLA statistics, monitor management, alert contacts/webhooks

🤖 Generated with Claude Code

Chris Jean added 5 commits April 25, 2026 12:48
…otency

Three small hardenings caught while reviewing the recently-added code:

1) alerting.Update now validates label (must be non-empty) and
   max_per_hour (must be >= 0) at input time, before the DB lookup.
   Previously an empty label PATCH would silently persist and a
   negative max_per_hour would surface as a generic 500 from MySQL's
   INT UNSIGNED constraint instead of a clean 422. Validations that
   don't need the existing row run first so obviously bad bodies
   don't pay for a round-trip.

2) buildMIMEMessage and renderEmailSubject now strip CR/LF from
   anything that becomes a MIME header value (From, To, Subject,
   site URL in the subject). Defense-in-depth: monitor_url is
   operator-controlled but the column doesn't enforce CRLF-free,
   so a malicious or accidental URL with embedded CRLF could have
   added new header lines (Bcc:, X-headers, etc.) in outbound email.
   Body content with newlines is intentionally unaffected.

3) POST /api/v1/alert-contacts/{id}/test now goes through
   withIdempotency like the other write POSTs (and like
   webhooks/{id}/rotate-secret). A retried "click to test" during a
   network blip no longer double-pages the destination.

Tests: 2 new for the Update validation rejections (no DB hit because
validation fires first), 2 new for the MIME header strip (subject
strip + asserting no new header lines are created when the input
contains CRLF; body CRLF passes through unchanged).
API.md picks up a one-paragraph note on the send-test endpoint
explaining its Idempotency-Key support — same shape as the rest of
the write surface, called out specifically because operators are the
typical caller (a "click to test" UX expectation).

CHANGELOG gets a "Polish" subsection under the v2 branch entry
covering the three hardenings just landed: Update input validation,
MIME header CRLF stripping, and idempotent send-test.
Each httptrace phase has a Start hook and a Done hook. When the
connection fails mid-way (TCP refused, TLS handshake failure,
hostname resolution then disconnect, etc.), the Start hook fires but
the matching Done hook never does — leaving its *End timestamp as
the zero time.Time. The recording code then computed
zero_time.Sub(real_time), which is roughly the negative of the
original timestamp's nanoseconds — a huge negative duration that
overflows the INT NULL columns in jetmon_check_history.

The visible symptom was a flood of repeat log lines on every check
round whenever a monitored site refused TCP:

  orchestrator: record history blog_id=N: Error 1264 (22003):
    Out of range value for column 'dns_ms' at row 1

The fix is one line per phase: only record a duration if BOTH the
Start and Done hooks fired. A failed phase reports zero rather than
a misleading negative value. Zero is the right reporting because
the phase didn't successfully complete — there is no "duration" to
report.

Regression test added to TestCheckConnectionRefused asserting all
three phase durations are non-negative on a connection-refused
target. Without the fix, TCP would be ~ -unix-nanos.
The previous diagram described the original drop-in rewrite — three
boxes (orchestrator, check pool, gRPC server) talking to the same
internal channels. The v2 branch has three more independently-scaling
concerns the diagram didn't show: the REST API server, the webhook
delivery worker, and the alerting delivery worker — all consumers
of jetmon_event_transitions via the eventstore.

Updated diagram shows the layered shape:
  - Top tier: orchestrator + check pool + veriflier transport (the
    "monitor what's out there" half)
  - Middle: eventstore as the single writer for events / transitions
  - Bottom tier: API + webhook worker + alerting worker (the
    "tell the world about it" half)

Plus inline component descriptions for each new package and an
explicit forward-looking note about the deliverer-binary split
tracked in ROADMAP.md, so future agents have a route from "I see
two delivery workers" to "yes, intentionally — they unify when the
binary extracts."

WPCOM is shown as the legacy notification path coexisting with
alert contacts, matching the design decision documented in API.md
"Family 5 → Relationship to legacy WPCOM notifications."
Adds docs/adr/ with a README explaining the format and seven ADRs
covering the load-bearing decisions on this branch — the kind of
"why is X like this" question that has been re-explained more than
once in code review, commit messages, and inline design rationale.

Each ADR is short (Status / Context / Decision / Consequences, plus
Alternatives where useful) and cross-links to the others and to the
relevant code paths.

  0001 — Event-sourced state model with dedicated transitions table
  0002 — Internal-only API behind a gateway
  0003 — Plaintext credential storage for outbound dispatch
  0004 — Stripe-style HMAC-SHA256 webhook signatures
  0005 — Pull-only webhook and alerting delivery
  0006 — Separate internal/alerting and internal/webhooks packages
  0007 — Soft-lock claim vs SELECT ... FOR UPDATE SKIP LOCKED

AGENTS.md picks up a Key Files row pointing future agents at
docs/adr/. The README explains the conventions: append-only after
acceptance, one decision per ADR, cross-link generously, don't
backfill speculatively.

No code changes; pure docs.
Chris Jean added 22 commits April 27, 2026 11:56
API key rotation uses revoked_at for two different operational shapes: immediate revocation sets it to now, while a rolling rotation may set it in the future so the old token keeps working until consumers have deployed the replacement.

Lookup treated any non-NULL revoked_at as terminal. That made the documented grace-window path impossible because the old key was rejected immediately instead of at the configured cutoff time.

Change Lookup to reject only when revoked_at is now or in the past. Add sqlmock coverage for both sides of the boundary: a future revoked_at remains valid and updates last_used_at, while a past revoked_at returns ErrKeyRevoked.
The shadow-v2-state migration needs Jetmon 2 to keep writing the v1 site_status / last_status_change projection while legacy readers are still in flight. The old DB_UPDATES_ENABLE name and JETMON_UNSAFE_DB_UPDATES startup guard described a local-only safety valve, which is the opposite of the production bridge we now need.

Add LEGACY_STATUS_PROJECTION_ENABLE as the explicit migration switch, default it on, and keep DB_UPDATES_ENABLE as a deprecated alias so older configs keep their behavior until they are rewritten. If both keys are present, the new name wins.

The orchestrator now uses config.LegacyStatusProjectionEnabled() for every compatibility projection write, and the fatal unsafe-env guard is removed from startup. Config tests cover the default, alias behavior, and new-key precedence; orchestrator tests opt out via the new field.
During the shadow-v2-state migration, jetpack_monitor_sites remains the table we page through for site identity and configuration. It must not remain the API's source of truth for current incident state, especially once the legacy site_status projection is intentionally disabled.

List responses now reflect the worst open jetmon_events row for each returned site before state/severity filtering runs. Single-site scans use the legacy site_status fallback only while LEGACY_STATUS_PROJECTION_ENABLE is true; with projection disabled, no open v2 event means the API reports Up even if the old v1 column is stale.

Administrative event closes now project site_status back to running inside the same close transaction when the compatibility projection is enabled and no active events remain. That removes the separate maybeProjectRunning follow-up write and keeps manual close / trigger-now behavior aligned with the eventstore invariant.

Tests cover active-event list rollups, stale legacy status when projection is disabled, and the revised transaction shape for close paths.
The code now supports the migration shape we settled on: v2 event tables own incident state, while jetpack_monitor_sites stays in place as the legacy site/config table and temporary compatibility projection. The docs needed to say that directly instead of carrying forward the old DB_UPDATES_ENABLE / unsafe-local-test framing.

Add ADR 0008 for the shadow-v2-state decision, including the context, rejected alternatives, rollout consequences, and the rule that legacy site_status is a projection rather than source of truth. README, ARCHITECTURE, EVENTS, CHANGELOG, AGENTS.md, ROADMAP, config.readme, and config-sample.json now use LEGACY_STATUS_PROJECTION_ENABLE and describe when it can be disabled.

API.md also documents two adjacent migration details that matter operationally: future revoked_at values are key-rotation grace windows, and the current single-binary deployment should run API/webhook/alert delivery workers on only one active instance per DB cluster until transactional row claiming or a deliverer split lands.

No behavior changes in this commit; it is the written contract for operators, reviewers, and future agents working on the v2 branch.
Three follow-ups from a review of the morning's shadow-v2-state work.

Active-event rollup (handlers_sites.go) used a ROW_NUMBER() OVER
(PARTITION BY ...) window function. Window functions require MySQL 8.0+,
which conflicts with the documented "MySQL 5.7+ in production" target in
.claude/rules/general-guidelines.md. Replace the query with a flat
SELECT over open events for the bounded blog_id set and reduce to the
worst (highest severity, earliest started_at) per blog_id in Go. The IN
list is capped by the API's max page size (200) and a site rarely has
more than one open event, so the in-Go reduction is cheap. Adds tests
covering severity-wins and severity-tie-with-earlier-started-at picks.

API key cutoffs (apikeys.go) now share half-open semantics: a key is
valid for times strictly before the cutoff and rejected at or after it.
Previously revoked_at used !Before(now) and expires_at used After(now),
so a key was rejected at exactly revoked_at but still valid at exactly
expires_at. Pick the strict-less-than convention to match JWT/OAuth `exp`
behavior and document the invariant in the code and API.md. Adds tests
for the boundary on both sides of expires_at.

Projection state visibility (cmd/jetmon2/main.go) now logs
`config: legacy_status_projection=enabled|disabled` at startup and adds
the same line to `./jetmon2 validate-config` output. The fatal
JETMON_UNSAFE_DB_UPDATES guard is intentionally not reinstated; the
ADR-0008 default is projection-enabled in production, and the guard
fights that default. Loud observability is the right shape here.
Housekeeping pass over LSP-flagged diagnostics that were unrelated to
today's earlier shadow-state work but worth a sweep while the analysis
tooling was attached.

- internal/api/handlers_events.go: drop the unused handleListAllEvents
  handler. Not routed and the 'future admin tooling' rationale would
  better belong with the actual route when one is added.
- internal/api/middleware.go: remove the unused scopeAdmin alias and
  the unused ctx parameter from (*Server).audit; audit.Log doesn't take
  context today, so threading it was misleading. All five call sites
  updated.
- internal/api/testhelp_test.go: remove unused invokeWithMux helper and
  rename the unused server arg in invokeAuthed to _ so the signature is
  honest without churning every call site.
- internal/api/handlers_sites_test.go: drop the no-op trimSQL helper.
- cmd/jetmon2/main.go: writePIDFile now uses fmt.Appendf to build the
  pid bytes directly, and the local repeat helper is replaced with
  strings.Repeat. main_test.go drops the now-redundant TestRepeat that
  covered the removed helper.
- go.mod: promote github.com/DATA-DOG/go-sqlmock to a direct dependency
  (it is imported by package tests, not just transitively).

go test ./... and go vet ./... both clean.
Whitespace alignment only — no functional change. The set was flagged
by `gofmt -l .` and consisted entirely of struct-field and var-block
alignment differences in files that hadn't been re-run through gofmt
since they last grew a wider field. Repo is now `gofmt -l` clean.
The previous housekeeping commit deleted handleListAllEvents, which was
the only caller that passed nil to listEvents. The function still
carried a `siteID *int64` parameter and a `WHERE 1=1` shape so the
optional `AND blog_id = ?` clause could be appended conditionally.

With every caller now guaranteed to pass a real blog id, change the
parameter to `siteID int64`, start the WHERE with `blog_id = ?`, and
seed the args slice with siteID directly. Tests updated to match the
new SQL shape.
Several top-level docs still described the Veriflier path as a live
gRPC server and API.md as a not-yet-implemented proposal. That no
longer matches the current packages: internal/veriflier is
JSON-over-HTTP, the API/webhook/alerting surfaces exist, and eventstore
owns event transitions while audit remains operational context.

Update README, PROJECT, ARCHITECTURE, and API to describe the current
transport, package map, table set, and coarse read/write scopes. Also
fix the developer build instructions for veriflier2 so they build into
bin/ rather than targeting the parent directory, which fails from the
repo root.

Verified with:
- go test ./...
- go vet ./...
- go build -o bin/jetmon2 ./cmd/jetmon2/
- go build -o bin/veriflier2 ./veriflier2/cmd/
- git diff --check
API.md still carried a few design-era details after the architecture docs
cleanup. Some of those details described behavior that is not routed today
or response shapes that no longer match the handler structs, which makes the
file risky as an implementation reference for internal consumers.

Update the API reference to match internal/api and the related webhook and
alerting packages: one per-key rate-limit bucket, POST-only idempotency,
current site create and trigger-now payloads, current SLA/stat response
shapes, implemented webhook and alert-contact route lists, and the actual
webhook/alert delivery retry and suppression behavior. Also mark the
OpenAPI endpoint as planned rather than live.

Verified with:
- go test ./...
- go vet ./...
- git diff --check
EMAIL_TRANSPORT previously accepted any string and the alerting path fell through to the stub sender for unrecognized values. That made a typo look like a working configuration while silently avoiding the intended email delivery path.

Teach config validation which email transports are supported and enforce the required SMTP or WPCOM settings for the active mode. Document the transport keys in the sample config and operator docs, and cover the default stub behavior, invalid values, required transport fields, and sample config loading in tests.

Verified with:

- go test ./internal/config ./cmd/jetmon2 ./internal/alerting ./internal/api

- go test ./...

- go vet ./...

- git diff --check

- make build

- make build-veriflier
The webhook and alerting delivery queues already pushed next_attempt_at forward after selecting ready rows, but ClaimReady still returned every selected row even when the soft-lock UPDATE did not affect anything. If another claimant moved the row first, the stale row could still be dispatched by this worker.

Repeat the readiness predicate in the soft-lock UPDATE and check RowsAffected before returning a claimed delivery. This keeps overlapping claim attempts from doing duplicate dispatch work while preserving the existing soft-lock model until the future SELECT FOR UPDATE SKIP LOCKED path lands.

Add matching regression coverage for webhook and alert-contact deliveries so a row that loses the soft-lock race is skipped rather than returned to the deliver loop.

Verified with:

- go test ./internal/webhooks ./internal/alerting

- go test ./...

- go vet ./...

- git diff --check

- make build

- make build-veriflier
EMAIL_TRANSPORT validation already rejects unknown values and enforces
the required SMTP / WPCOM settings, but it accepts "stub" (and the
empty-string compatibility alias) silently. An operator who configures
alert contacts with transport="email" while leaving EMAIL_TRANSPORT as
the default ends up with rendered messages that go to the log instead
of the destination. There is no startup signal that this is happening.

Emit `email_transport=<mode>` at startup alongside the existing
legacy_status_projection line, and add a WARN line when the mode is
"stub" (or empty) so the missing delivery path is visible in logs.
`./jetmon2 validate-config` surfaces the same INFO + WARN pair so the
warning is also visible during pre-deploy validation.

The emailTransportLabel helper canonicalizes the empty-string alias to
"stub" so logs and validate-config show one name regardless of which
form an operator wrote.
ClaimReady reuses candidates' backing array via `claimed := candidates[:0]`
to drop rows that lost the soft-lock race without a second allocation.
The pattern is safe here because the write index is always <= the read
index and the candidates slice is not returned in its original form, but
that contract is not obvious to a reader skimming the loop. Add a
two-line comment in both webhook and alert-contact deliveries.go so the
next reader does not relax the invariant by accident.
The startup warning added for stub email transport depends on small command-package helpers and dispatcher wiring. That behavior was covered by package compilation but not by focused tests, so a future cleanup could accidentally change the operator-facing mode labels or drop the email dispatcher without a narrow failure.

Add tests for the EMAIL_TRANSPORT label and delivery classification helpers, including the empty-string compatibility alias. Also verify that buildAlertDispatchers wires every managed alert transport and that the stub email dispatcher can render and send successfully.

Verified with:
- go test ./cmd/jetmon2
- go test ./...
- go vet ./...
- git diff --check
- make build
- make build-veriflier
The recent EMAIL_TRANSPORT startup warning made stub email mode visible in logs and validate-config output, but the operator docs still described stub mostly as a unit-test sender and README/PROJECT still claimed validate-config checked Veriflier reachability and WPCOM certificate state.

Update the API reference, README, config reference, PROJECT, and changelog so the documented operator behavior matches the current code: stub is the default log-only sender, startup and validate-config warn when email contacts will not deliver mail, validate-config reports Veriflier configuration as context rather than failing on reachability, and the delivery soft-lock changelog now includes the stale-claim RowsAffected fix.

Verified with:
- go test ./...
- go vet ./...
- git diff --check
- make build
- make build-veriflier
README and PROJECT still described a broader operator dashboard than the code currently serves, including a live System Health Map, Veriflier status, slowest-site lists, and dependency health details that are not wired into the dashboard publisher today.

Narrow the operator-facing docs to the metrics and WPCOM circuit state that the dashboard actually renders, and recast the broader health grid as future work with the existing /api/health shape but no live publisher yet. Also update the validate-config inline comment so it says the Veriflier list is operator context rather than a reachability check.

Verified with:
- go test ./cmd/jetmon2 ./internal/dashboard
- go test ./...
- go vet ./...
- git diff --check
- make build
- make build-veriflier
The API health handler assumed Server.db was always populated and called PingContext directly. That is true for normal startup, but the server constructor accepts a nil DB and the existing nil-DB test only verified that the route compiled rather than exercising the handler behavior it described.

Return the same db_unavailable 503 envelope when the database handle is missing, and turn the no-op test into a real regression test for that path. This keeps /api/v1/health predictable even if a caller builds the API server before wiring storage.

Verified with:
- go test ./internal/api
- go test ./...
- go vet ./...
- git diff --check
- make build
- make build-veriflier
The v3 probe-agent discussion should be preserved without changing the v2 production target. The important constraint is sequencing: deploy and stabilize v2 first, gather production data, then revisit whether Jetmon should evolve beyond the current main-server-plus-Veriflier confirmation model.

Add a planning note that records the v1 -> v2 -> v3 migration framing, the production data to collect during v2, the current v2 baseline, candidate future architectures, and the current recommendation to treat a central scheduler plus regional probe agents as the leading v3 option. Link the note from ROADMAP so it stays visible but clearly parked until after v2 is stable.

Verified with:
- git diff --check
- rg -n '[[:blank:]]$' ROADMAP.md docs/v3-probe-agent-architecture-options.md
The docs directory now contains both accepted ADRs and future-facing planning notes. Without a top-level index, the new v3 probe-agent architecture note is only discoverable through ROADMAP or by listing files.

Add docs/README.md to explain the distinction between architecture decisions and planning notes, point readers at the ADR index, and link the v3 probe-agent architecture options note.

Verified with:
- git diff --check
- rg -n '[[:blank:]]$' docs/README.md
- go test ./...
- go vet ./...
The roadmap still described the public REST API as if Jetmon had no API surface at all, which now conflicts with the internal /api/v1 work.

Reframe the item around the remaining customer-facing contract: gateway semantics, tenant ownership, sanitized errors, public rate limits, and payloads safe for external tooling.

Verified with:
- git diff --check
- go test ./...
- go vet ./...
Several docs and comments still implied that Monitor-to-Veriflier traffic had already moved to gRPC. The implementation is intentionally JSON-over-HTTP today, with the proto contract kept for a future generated gRPC swap.

Clarify the wording around the existing transport while leaving the legacy config names alone for compatibility.

Verified with:
- git diff --check
- go test ./...
- go vet ./...
- make build
- make build-veriflier
Chris Jean and others added 7 commits May 3, 2026 00:39
Move the exact due-count and legacy projection-drift diagnostic queries off the every-round hot path. Variable-interval scheduling still fetches due pages until drained, but the broad reporting queries now run at a one-minute cadence instead of every five-second idle poll.

Emit a scheduler.round.due_count_sampled gauge so operators can distinguish a fresh due-count sample from a round that intentionally skipped the broad read. Add an orchestrator test that proves due-count and projection-drift queries are sampled, skipped before the cadence elapses, and sampled again after the interval.
Add a next_check_at column and scheduler index so variable-interval site selection can use a simple indexed due predicate instead of recalculating DATE_ADD(last_checked_at, check_interval) across active rows on every fetch.

Maintain next_check_at in both single-row and batched freshness writes using each site's check interval. Existing NULL values remain due until checked, which preserves safe catch-up behavior after the additive migration without requiring a production-wide backfill update.

Update the schema reference docs and tests so the new column is scanned, written, and used by due-count queries.
Bring in the scheduler diagnostics, event mutation retry handling, and capacity test documentation updates so the integration branch reflects the latest 1,000-site stability fixes before layering on the newer scheduler and transport optimizations.
Add the shared bounded checker transport to reduce per-check allocation and connection setup overhead during larger capacity runs while preserving per-site timeout behavior.

# Conflicts:
#	docs/roadmap.md
#	internal/checker/checker.go
Add sampled broad scheduler reports so variable-interval rounds do not perform exact due-count and projection-drift scans on every short poll during high-volume capacity runs.

# Conflicts:
#	docs/roadmap.md
#	internal/orchestrator/orchestrator.go
#	internal/orchestrator/orchestrator_test.go
Bring in the next_check_at migration, indexed variable-interval due query, and write-path maintenance so the stress-test branch can exercise the lower-CPU scheduler path alongside the other efficiency changes.

# Conflicts:
#	AGENTS.md
#	docs/architecture.md
#	docs/data-model.md
#	internal/db/migrations.go
#	internal/db/queries.go
#	internal/db/queries_test.go
…-stress-20260503

Improve Jetmon v2 capacity scaling and scheduler efficiency

This merges the capacity-stress integration branch after the successful Jetmon v2 capacity scout through 10,000 active monitors. The branch materializes scheduler due times, reuses checker HTTP transports, reduces broad scheduler report overhead, batches changed SSL expiry writes, and adds capacity diagnostics/documentation for the next scaling passes.

Validation included go test ./... and the capacity-scout-20260503-134229Z run, where Jetmon v2 passed 1,000, 5,000, and 10,000 active-monitor batches with zero stale active rows and zero missed checks. Follow-up scaling work should continue above 10,000 carefully because the 10,000-site p95 check age was 267s and oldest age was 289s, still inside the 5-minute freshness window but close enough to watch MySQL and check-age distribution.
chrisbliss18 and others added 22 commits May 4, 2026 14:56
* Detect truncated GET response bodies

Jetmon v2 previously treated a successful HTTP status as enough evidence for a healthy GET unless a keyword check was configured. That left a gap where a server could return 200 OK, close before satisfying Content-Length, and still be marked healthy.

Always perform a bounded response-body integrity read for successful checks, surface early body close/read failures as a hard intermittent checker result, and keep keyword matching inside the existing larger bounded body window. This keeps the runtime cost controlled while letting both the local checker and Veriflier catch partial GET responses.

Document the new checker error code and classify it explicitly in telemetry reports so body-read failures are visible as transport-style intermittent failures.

* Swallow downtime failures during maintenance

Jetmon v2 previously applied maintenance only after local failures could open Seems Down events or after verifier confirmation could promote them to Down. That still surfaced covered maintenance as downtime in event/API consumers even though WPCOM notifications were suppressed.

Move maintenance handling ahead of retry and event mutation for failing checks. Covered failures now record operational audit context, emit maintenance-swallowed metrics, avoid verifier escalation, clear any pending local retry, and close an already-open HTTP event with maintenance_swallowed while projecting the legacy site status back to running.

Update the maintenance docs to distinguish incident suppression from notification suppression, and make the maintenance-window check use the orchestrator time source so tests and rollout simulations can exercise it deterministically.

* Report deprecated TLS as an advisory event

The benchmark showed TLS 1.0/1.1 sites being treated as hard TLS failures instead of advisory findings. The checker already had an advisory error code, but the Go client could reject deprecated handshakes before Jetmon observed the negotiated version, and the orchestrator did not persist a dedicated advisory event.

Permit TLS 1.0/1.1 negotiation long enough to classify the site, capture the negotiated cipher suite, and open a site-level tls_deprecated warning event that does not enter the downtime retry pipeline or project legacy site status down. Close that warning when a later check negotiates TLS 1.2 or newer.

Document the advisory behavior and the custom-header security caveat so operators know deprecated TLS is not downtime, but should still be remediated before using sensitive per-site headers.

* Guard GET-based uptime checks against HEAD regressions

The remaining uptime-bench HEAD-failure cases should not be fixed by changing Jetmon v2 back toward HEAD semantics. Jetmon v2 intentionally checks sites with GET because HEAD-only probing was a core source of v1 false positives and false negatives.

Record the probe method in HTTP failure metadata and add regressions for targets where HEAD returns 405 or would time out while GET remains healthy. These tests make the intended method behavior explicit and give future incident metadata enough context to prove which request method opened a Seems Down event.

* Add forbidden response keyword checks

Introduce a separate forbidden_keyword site setting so Jetmon can fail checks when known bad response-body text appears without overloading the existing required check_keyword behavior.

Propagate the setting through the checker, orchestrator, Veriflier transport, internal API, API CLI, schema migrations, and documentation. Forbidden matches now return ErrorKeyword with keyword_rule=forbidden metadata so operators can distinguish them from missing required keywords during incident review.

Covered by checker regressions, trigger-now API coverage, API/DB fixture updates, and a full go test ./... run.

* Record check request methods in history

Add a compact request_method column to jetmon_check_history so operators can verify which HTTP method was used for each timing sample during v2 rollout and uptime-bench review.

The orchestrator now writes checker.Result.Method into each check-history row, db.RecordCheckHistory normalizes empty or oddly-cased method values to a bounded uppercase value, and the schema/migration reference documents request_method as defaulting to GET for existing rows.

This intentionally avoids storing the full monitor URL in the high-volume history table; failed incidents already carry URL and reason metadata on the event row.

Covered by db normalization coverage plus go test ./internal/db ./internal/orchestrator and go test ./....

* Document uptime-bench Jetmon v2 follow-up

* Preserve forbidden keywords in API bulk adds

Bulk site sources could parse forbidden_keyword from JSON and CSV input, but planAPIBulkSiteCreates did not copy it into the generated create request. That meant bulk-add silently dropped the forbidden-content rule before the request reached the API.

Forward ForbiddenKeyword through the planner and expand JSON, CSV, cycling, and marshal coverage so bulk-created sites retain the same forbidden-content checks as typed single-site creates.

Validated with go test ./cmd/jetmon2 ./internal/api ./internal/checker ./internal/orchestrator and go test ./....

* Avoid duplicate check-history method migration

The request_method branch added the column to both the original jetmon_check_history create-table migration and the later ALTER migration. Fresh databases would create the column in migration 6, then fail when migration 28 attempted to add it again.

Keep migration 6 stable and let migration 28 own the additive request_method schema change. Add migration-table coverage so future edits do not reintroduce the duplicate-column path.

Validated with go test ./internal/db ./internal/orchestrator and go test ./....

* Add vet alias for validation workflow

Expose make vet as an explicit alias for the existing lint target so branch and handoff verification steps can ask for vet directly without changing the underlying go vet command.

Validated with make vet and go test ./... on the updated uptime-bench integration branch.

* Document uptime-bench scenario coverage gaps

Add a dedicated roadmap section for scenario classes found during the read-only uptime-bench review that Jetmon v2 does not fully support yet.

Separate adapter-level inverted keyword wiring from larger production capabilities such as multi-pattern body rules, content baselining, geo-scoped vantage semantics, and explicit DNS monitors so each follow-up can be implemented on its own branch.

* Add multi-pattern forbidden content checks

Extend Jetmon v2 content checking beyond the single forbidden_keyword string by adding a forbidden_keywords JSON array on jetpack_monitor_sites.

The checker now scans both the legacy single forbidden keyword and the new explicit bad-content list, and the orchestrator, trigger-now path, and Veriflier transport all preserve those rules during local and remote confirmation checks.

Expose the new field through the internal sites API, API CLI create/update/bulk-add flows, schema migrations, tests, and operator docs so uptime-bench and production operators can model injected scripts, SEO spam links, and other known-bad body markers without relying on broad content baselining.

* Document deferred uptime-bench capability decisions

Expand the uptime-bench scenario coverage roadmap with the decisions to defer full content baselining, customer-visible regional classifications, and explicit DNS monitor types.

The roadmap now records the smaller near-term work that is still appropriate for v2: conservative empty-body detection, richer DNS diagnostics on HTTP lookup failures, and Veriflier vote evidence for operator diagnostics.

Each deferred item explains why it is parked for now so future branches do not accidentally turn benchmark gaps into poorly-scoped production behavior.

* Add agent playbooks for Jetmon test work

Document the guardrails agents should follow while uptime-bench or Jetmon capacity tests are active. The notes make it explicit that deployed services, support hosts, databases, provider state, fleet config, and runtime config should not be changed during active tests without approval.

Add project-local agent skills for handoff preparation, safe background work, and Jetmon test fleet handling so future sessions can preserve the same safety model when switching branches, repos, or test environments.

---------

Co-authored-by: Chris Jean <chris.jean@automattic.com>
Bring feature/service-handoff-prelaunch-recommendations forward after the uptime-bench v2 feature fixes landed on v2.

The conflict resolution keeps the branch's prelaunch documentation and rollout guidance while preserving the merged v2 detection work, including forbidden-content checks, request-method history, maintenance suppression, TLS advisory events, and the API/docs updates.
Capture bounded checker diagnostics on failed probes, including redirect chains, final URLs, TLS version/cipher details, and truncated transport errors. This gives operators more evidence about what Jetmon saw without storing response bodies.

Schedule failed variable-interval checks for a bounded one-minute follow-up when the site's normal check interval is longer, so transient incidents are observed again sooner while successful checks keep their normal cadence.

Document the updated scheduler semantics and event metadata shape, and track the remaining verifier-confirmed Down benchmark coverage gap in the roadmap.
…tion (#87)

* config: add BODY_READ_MAX_BYTES and BODY_READ_MAX_MS

* docs: document EOF policy scope, exemptions, and body-read budgets

* Add dedicated keyword read budget configuration

* Propagate body-read budgets through verifier checks

* Add KEYWORD_READ_MAX_MS config and document timeout semantics

* Propagate keyword read timeout budget to veriflier checks

* checker/config: finalize body-read alias and config validation cleanup

* checker: raise body-read default and harden 1MiB boundaries
The latest all-services gapfill report exposed two useful follow-ups for Jetmon v2. TLS 1.1 scenarios were being treated as SSL outages because the checker refused the deprecated handshake before the orchestrator could classify the site as advisory-only tls_deprecated. Lower the checker client minimum TLS version so the handshake can complete, while still recording the negotiated protocol and keeping certificate validation enabled.

Also preserve resolver-visible DNS failure details in checker results and event metadata. NXDOMAIN, SERVFAIL-like resolver failures, DNS timeouts, queried name, and resolver server are now available to operators when Go exposes that data. This does not claim to solve short authoritative DNS outages hidden by recursive cache TTLs; the roadmap now tracks that as a separate explicit-DNS-monitor design decision.

Validation: go test ./...
Certificate-expiry and deprecated-TLS events are local advisory observations. When later local probes show the condition has cleared, the resolved event should say probe_cleared rather than verifier_cleared so reports and operator history do not imply a verifier made the recovery decision.

Add regression coverage for both advisory cleanup paths and update the event reason documentation to describe advisory churn explicitly.
Bring feature/jetmon-v2-incident-observability forward to origin/v2 after the strict EOF and truncated-response work landed there.

Resolve the checker overlap by keeping both behaviors: the new body-read integrity limits from v2 and the DNS/deprecated-TLS evidence fields from this branch. go test ./internal/checker and go test ./... both pass after the merge.
Record body-read mode, bytes read, expected content length, read limit, and bounded read errors in checker results so partial and truncated responses carry enough operator evidence for support investigations.

Add detector_class and legacy_status_type to HTTP incident metadata. This keeps failure_class compatible with the old WPCOM status vocabulary while exposing the actual detector path, such as partial_response, content_failure, timeout, or dns_nxdomain.

Update event and operations docs and extend tests for checker body-read diagnostics and orchestrator metadata serialization.
Split WPCOM suppression accounting by the actual audit detail instead of treating every alert_suppressed row as a confirmed-down suppression. Recovery cooldown rows were previously eligible to count both as recovery suppression and down suppression, which could hide parity gaps during rollout review.

Add a focused regression test for the down/recovery suppression classifier so mixed maintenance and cooldown windows remain readable in telemetry reports.
Add the repo-owned prelaunch readiness tracker, rollout/support/operations guidance for launch posture and WPCOM parity, telemetry evidence in guided rollout flows, VM lab rehearsal updates, and a final telemetry fix so down and recovery suppressions are accounted independently during rollout review.
Add bounded event metadata for failed and advisory checks, including redirect chains, final URLs, DNS error classification, TLS version/cipher details, and body-read evidence for partial responses. Treat deprecated TLS and certificate-expiry advisories as locally observed probe-cleared conditions, keep hard TLS failures as outages, and schedule failed checks for a bounded fast follow-up so operators get fresher evidence without changing legacy WPCOM status semantics.
Merge the streaming monitor engine foundation for Jetmon v2. This adds the memory-backed time-wheel scheduler behind SCHEDULER_ENGINE=streaming, reduces healthy-check database writes, documents the validated 2M-site capacity evidence, and tracks the next scaling follow-ups for latency-aware concurrency and worker-scaler hardening.
* Add GitHub Actions workflow to publish Docker images to GHCR

Publishes ghcr.io/automattic/jetmon and ghcr.io/automattic/veriflier
on pushes to the v2 branch (tagged latest) and on pull requests
labeled "Docker Build" (tagged with the PR head short SHA).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Document running Jetmon and Veriflier from GHCR images

Adds docs/docker-images.md covering pull, tag scheme, env vars, ports,
volume mounts, validate-config, reload/drain, and PR-build pinning.
Links the new doc from the README documentation table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Post PR comment with docker pull commands after image build

Adds a comment-pr job that runs after build-and-push on pull_request
events and upserts a sticky PR comment containing docker pull commands
for the freshly built jetmon and veriflier images. Uses an HTML marker
to identify the comment so subsequent runs update it in place rather
than appending duplicates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Harden Jetmon v2 so local DNS-shaped failures after recovery do not create customer-visible HTTP downtime unless Verifliers confirm the outage.

This preserves real downtime detection while treating monitor-local NXDOMAIN, timeout, and resolver failures as low-confidence signals until independent confirmation is available. The branch also records resolver-source metadata, extends false-alarm dampening across unstable recovery periods, and adds regression coverage for the new retry and confirmation paths.

Validated with go test ./... and the focused uptime-bench post-recovery false-positive suite, which passed 4/4 scenarios with no false positives and preserved TLS advisory detection.
Introduce staged Jetmon v2 rollout policy so production can first replace v1 using legacy-compatible HEAD checks, then migrate selected cohorts to GET with simple HTTP behavior, and finally enable the full v2 detection profile.

Move v2-only site configuration and runtime state out of jetpack_monitor_sites into Jetmon-owned sidecar tables, keeping the legacy site table v1-shaped for safer rollout, rollback, and compatibility testing. Update monitor checks, Veriflier requests, API site flows, bulk import, rollout checks, VM lab helpers, and docs to honor default and per-site check policies.

Add low-cardinality StatsD counters for live rollout cohort visibility by effective request method and detection profile, so operators can confirm HEAD/legacy, GET/simple_http, and GET/full traffic without querying MySQL.

Validated with local tests and uptime-bench focused plus 3-hour soak runs against the stripped legacy schema.
Treat failed local checks for already confirmed-down sites as ongoing incident observations instead of sending them back through the retry and Veriflier-confirmation path. This keeps recovery checks active while avoiding duplicate verifier traffic, repeated confirmed-down logs, and repeated WPCOM notification attempts during persistent outages.

Add still-down StatsD counters so operators can see the continuing failed observations without creating duplicate confirmation work. Document the behavior in the operations guide and cover the guard with an orchestrator unit test.
PR #101 is now stale because the merged streaming monitor engine replaced the older round/page scheduler path that branch optimized. Merging it directly would regress current v2 work, but the review identified a few useful ideas worth preserving.

Add roadmap entries for permanent WPCOM status handling, streaming-aware transport failure-storm suppression, and evidence-led evaluation of any remaining jetpack_monitor_sites blog_id indexing needs. These notes give the follow-up branch a scoped paper trail without keeping the superseded PR open.
Classify WPCOM 404 and 410 responses as permanent per-notification failures instead of transport failures. These errors now bypass the global WPCOM circuit breaker, skip pointless immediate retry pressure, emit permanent-failure metrics, and write an audit failure row so operators still have an evidence trail.

Also expose ErrCircuitOpen so the orchestrator can treat already-queued notifications as queued rather than retrying into an open circuit. Add bounded queue-drop logging to keep broad WPCOM outages from flooding logs.

Finally, make the streaming engine report pressure-suppressed local timeout/connect failures. This preserves the existing failure-storm guard while giving sysadmins a visible counter for monitor-side pressure suppression.

Tests: go test ./internal/wpcom ./internal/orchestrator; go test ./...
Merge PR #110 to preserve and implement the still-relevant follow-ups from the superseded PR #101 work. This adds typed WPCOM status errors, treats 404/410 responses as permanent per-notification failures rather than global circuit-breaker failures, records queued/permanent-failure metrics and audit rows, bounds WPCOM queue-drop logging, and exposes streaming pressure-suppression metrics so operators can distinguish monitor-side transport pressure from site incidents.

Validation passed locally with go test ./internal/wpcom ./internal/orchestrator and go test ./....
Add production-data audit and legacy status bootstrap commands so rollout operators can evaluate a copied v1 site table before cutover.

The commands default to dry-run behavior, refuse unsafe duplicate active blog_id rows unless explicitly allowed, and bootstrap v2 events from legacy non-running status rows when executed. The rollout docs now point operators at the audit and bootstrap steps before starting migration.

Tests: go test ./...
Use jetpack_monitor_site_id as the endpoint identity for HTTP monitor execution so active duplicate blog_id rows with distinct monitor URLs are checked and tracked independently.

The checker, Veriflier transport, scheduler maps, streaming planner, retry state, HTTP event identity, and legacy projection writes now carry the monitor row id when available while preserving blog_id as the WPCOM/site identity.

Tests: go test ./...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants