Skip to content

Add staged HEAD/GET rollout check modes#109

Merged
chrisbliss18 merged 3 commits into
v2from
feature/rollout-staged-check-modes
May 13, 2026
Merged

Add staged HEAD/GET rollout check modes#109
chrisbliss18 merged 3 commits into
v2from
feature/rollout-staged-check-modes

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

@chrisbliss18 chrisbliss18 commented May 13, 2026

Summary

This PR adds the staged check-mode rollout path for Jetmon v2 so production can replace v1 without immediately changing probe semantics. The rollout can start with v1-compatible HEAD checks, then move controlled cohorts to GET checks with the same basic HTTP interpretation, and finally enable the richer v2 detection set after the GET rollout is stable.

The branch also keeps jetpack_monitor_sites v1-shaped by moving v2-only site configuration and runtime freshness state into Jetmon-owned sidecar tables. This lets the test DB act as a compatibility canary: the old v2-added columns and scheduler indexes have been removed from jetpack_monitor_sites, so accidental references fail quickly.

Check Methods

Method Purpose Behavior
HEAD Legacy-compatible rollout mode Sends HEAD requests like Jetmon v1. This is intended for initial v2 host replacement when we want the new service to behave as close to v1 as possible. Because HEAD responses do not include a response body, body-based detections cannot run.
GET Jetmon v2 steady-state mode Sends GET requests so Jetmon sees the response path customers actually use. This is the long-term default after rollout, but it can first be enabled with simple detections before turning on the full v2 detection set.

Detection Profiles

Profile Purpose Behavior
legacy Initial replacement semantics Preserves the v1-style interpretation as much as possible: basic request success/failure and HTTP status behavior without new rich detections. This pairs with HEAD for the first rollout phase.
simple_http GET transport without rich v2 detections Uses the selected request method, normally GET, but suppresses body-based and richer v2 detections. This lets operators validate that switching from HEAD to GET is stable before adding more alert classes.
full Full Jetmon v2 detection mode Enables the richer v2 checks that depend on GET/body data and expanded metadata, including required/forbidden body content, strict/body-read evidence, redirect policy behavior, and richer TLS/HTTP diagnostic signals.

HEAD automatically caps the effective profile to simple_http when full is requested because HEAD cannot support body-based checks. This keeps the configuration safe: a site cannot silently claim full body validation while still using HEAD.

Three-Stage Rollout Plan

Stage Default policy Goal What operators watch
1. Replace v1 with v2 DEFAULT_CHECK_METHOD=HEAD, DEFAULT_DETECTION_PROFILE=legacy Swap the service implementation while preserving v1 probe behavior. This isolates binary/runtime/schema risk from probe-semantics risk. Missed checks, projection drift, verifier behavior, WPCOM parity, runtime health, and any unexpected difference from v1 behavior.
2. Move cohorts to GET Per-site or cohort policy moves to GET + simple_http Validate GET as the transport path without introducing the full v2 detection set at the same time. This directly addresses the v1 HEAD-only limitation while limiting false-positive risk. Cohort-specific incident volume, blocked/WAF behavior, response-code differences between HEAD and GET, verifier confirmation, support explanations, and rollback readiness.
3. Enable full v2 detections Per-site or cohort policy moves to GET + full Turn on the richer v2 detection set once GET behavior is proven stable. This is the long-term feature posture. New detection classes, keyword/forbidden-content results, redirect-policy findings, TLS/body-read evidence, false positives, and support/customer explanation quality.

Process defaults come from DEFAULT_CHECK_METHOD and DEFAULT_DETECTION_PROFILE. Per-site policy lives in jetmon_site_check_config.request_method and jetmon_site_check_config.detection_profile, so operators can migrate cohorts gradually without changing the legacy jetpack_monitor_sites table.

What Changed

  • Added process defaults DEFAULT_CHECK_METHOD and DEFAULT_DETECTION_PROFILE.
  • Added per-site staged policy via jetmon_site_check_config.request_method and detection_profile.
  • Added low-cardinality StatsD counters for live rollout cohort visibility under scheduler.{page,round,streaming}.check.method.<method>.profile.<profile>.count, using the effective runtime method/profile.
  • Added sidecar-owned v2 config fields to jetmon_site_check_config: keyword rules, forbidden-content rules, maintenance windows, custom headers, timeout override, redirect policy, and alert cooldown.
  • Added jetmon_site_runtime for runtime freshness, due time, last alert time, and SSL expiry observation.
  • Updated monitor checks, Veriflier requests, API site create/update/list/get, trigger-now, bulk import, rollout activity checks, and rollout VM lab helpers to use sidecar tables.
  • Kept prior legacy-table migration IDs as no-op compatibility entries so migration ordering remains stable while fresh installs avoid hot ALTERs on jetpack_monitor_sites.
  • Updated rollout, API, data-model, operations, architecture, roadmap, support, and migration docs.

Validation

Local/repo verification completed:

  • go test ./internal/orchestrator
  • go test ./...
  • make all
  • make rollout-docs-verify

Test fleet deployment completed:

  • jetmon-service-host-2: jetmon2 8e9c2bf
  • jetmon-vm-host-2 / jetmon-v2-veriflier: veriflier2 8e9c2bf
  • Migrations 37 and 38 applied successfully.
  • The test DB now has a v1-shaped jetpack_monitor_sites table only: jetpack_monitor_site_id, blog_id, bucket_no, monitor_url, monitor_active, site_status, last_status_change, and check_interval.
  • Old v2-added jetpack_monitor_sites columns and scheduler indexes were dropped after backfilling existing test data into sidecars.
  • validate-config, API health, rollout dynamic-check, and jetmon2 migrate pass after the cleanup.

StatsD cohort counter smoke coverage:

  • Unit coverage verifies completed checks are aggregated into HEAD / legacy, GET / simple_http, and GET / full cohorts.
  • Unit coverage verifies metric names such as scheduler.streaming.check.method.get.profile.full.count and scheduler.streaming.check.method.head.profile.simple_http.count.
  • The counters are emitted as aggregated deltas per page, round, or streaming report interval rather than once per site check.

Uptime-bench focused validation completed in reports/20260513T100519Z-3h-jetmon-v2-legacy-sidecar:

  • Jetmon v2 operated against the stripped legacy table without Unknown column errors.
  • HEAD + legacy stayed up for HEAD-good/GET-bad targets.
  • GET + simple_http suppressed rich detections as expected.
  • GET + full detected keyword, forbidden content, redirect, timeout, and HTTP failure cases.
  • Veriflier confirmed expected internal failure cases.
  • Sidecar runtime/config writes were observed.

Uptime-bench 3-hour soak validation completed in reports/20260513T142404Z-3h-jetmon-v2-legacy-sidecar-soak:

  • The soak completed the full requested window from 2026-05-13T14:30:08Z to 2026-05-13T17:30:18Z.
  • Hourly rollout checks passed, including validate-config, rollout dynamic-check, and activity checks.
  • All 24 active benchmark sites were checked during the soak.
  • jetpack_monitor_sites remained v1-shaped and the removed v2 columns stayed absent (old_columns_present=0).
  • Default and explicit HEAD / legacy sites stayed up where HEAD succeeded and GET failed.
  • GET / simple_http produced expected basic HTTP failures while suppressing rich body/keyword detections.
  • GET / full produced expected keyword, forbidden-content, redirect, timeout, and HTTP failures.
  • Maintenance-window failure cases remained suppressed.
  • API cleanup soft-deleted the 24 benchmark sites, and benchmark-only cleanup purged the test sidecar/runtime rows.
  • Strict log review found no schema errors, SQL errors, runtime panics, DNS failures, or infrastructure timeout patterns.
  • Scheduler health remained stable: max lag was about 1.87s, with no backpressure waits, no stale results, and empty queues in steady state.

Non-Blocking Follow-ups

  • Uptime-bench cleanup wording still says API-created sites were deleted even though Jetmon's API first soft-deletes by setting monitor_active=0. That is a benchmark/reporting wording issue rather than a Jetmon runtime failure.
  • The broad uptime-bench schema-error scan is noisy because it matches normal metric names such as error_keyword and error_timeout. The stricter review did not find real schema or runtime errors.
  • Persistent down sites produce repeated Veriflier confirmation/log entries over a long soak. Event transitions are not duplicated, so correctness is intact, but a later efficiency/log-noise pass could reduce repeated confirmation traffic for already-confirmed incidents if we want that behavior.

Chris Jean added 2 commits May 12, 2026 22:55
Introduce a shared checkmode package, process defaults, and per-site policy overrides so v2 can first replace v1 with HEAD plus legacy semantics, then migrate cohorts through simple GET and full GET detections.

Store those overrides in a Jetmon-owned sidecar table instead of adding more rollout-policy columns to jetpack_monitor_sites. Wire the selected method and detection profile through the monitor, streaming scheduler, API, CLI, check history metadata, and Veriflier JSON transport.

Update rollout checks and migration docs so sysadmins can see when a pinned replacement host is still configured for the initial HEAD/legacy phase, and cover the behavior with checker, orchestrator, API, DB, and rollout tests.
Move the remaining v2-only site config and runtime freshness fields into Jetmon-owned side tables so rollout no longer requires adding columns or scheduler indexes to jetpack_monitor_sites. The legacy table now stays v1-shaped for production cutover: v2 reads identity, bucket, URL, active state, interval, and projection fields, while advanced check settings live in jetmon_site_check_config and runtime observations live in jetmon_site_runtime.

Update API reads and writes, scheduler queries, activity checks, SSL/freshness writes, rollout lab helpers, migrations, schema reference docs, and operator docs to use the sidecar tables consistently. Keep the old migration IDs as no-op compatibility entries so existing migration ordering remains stable while new databases avoid hot ALTERs on the legacy table.

Verification: go test ./...; make all; make rollout-docs-verify
@chrisbliss18 chrisbliss18 marked this pull request as ready for review May 13, 2026 18:06
Copy link
Copy Markdown
Contributor

@heydemoura heydemoura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One non-blocking gap. No dedicated StatsD counter per (method, profile) cohort. Observability lives in jetmon_check_history.request_method (migration
32) instead. That's queryable in MySQL but doesn't give the rollout dashboard a real-time cohort signal. Worth a follow-up, not a blocker.

Emit low-cardinality StatsD counters for the effective check method and detection profile used by each completed check. The counters are aggregated per page, round, and streaming report interval so rollout dashboards can confirm HEAD/legacy, GET/simple_http, and GET/full cohort traffic without querying MySQL check history.

Also document the metric shape in the operations guide and add unit coverage for cohort aggregation and metric naming.
@chrisbliss18
Copy link
Copy Markdown
Contributor Author

@heydemoura Good feedback. I updated to add new counters:

  • New counters emit the effective runtime cohort:
    • scheduler.page.check.method.<method>.profile.<profile>.count
    • scheduler.round.check.method.<method>.profile.<profile>.count
    • scheduler.streaming.check.method.<method>.profile.<profile>.count
  • Example:
    • scheduler.streaming.check.method.head.profile.legacy.count
    • scheduler.streaming.check.method.get.profile.simple_http.count
    • scheduler.streaming.check.method.get.profile.full.count
  • Counters are aggregated per page/round/streaming report interval, not emitted per individual site check.

Does this fit what you're looking for?

@chrisbliss18 chrisbliss18 merged commit f316f57 into v2 May 13, 2026
2 checks passed
@chrisbliss18 chrisbliss18 deleted the feature/rollout-staged-check-modes branch May 13, 2026 18:55
chrisbliss18 pushed a commit that referenced this pull request May 13, 2026
Keep the effective HEAD/GET method and detection profile on the /v2/check request path after rebasing the Veriflier rebuild onto the staged rollout work from PR #109.

This lets v2 monitor cohorts use HEAD plus legacy, GET plus simple_http, and GET plus full checks through remote Verifliers without silently converting them to full-profile GET probes. It also updates the transport docs and fixes the ADR reference now that the streaming engine owns ADR-0009.
chrisbliss18 pushed a commit that referenced this pull request May 13, 2026
Keep the effective HEAD/GET method and detection profile on the /v2/check request path after rebasing the Veriflier rebuild onto the staged rollout work from PR #109.

This lets v2 monitor cohorts use HEAD plus legacy, GET plus simple_http, and GET plus full checks through remote Verifliers without silently converting them to full-profile GET probes. It also updates the transport docs and fixes the ADR reference now that the streaming engine owns ADR-0009.
chrisbliss18 pushed a commit that referenced this pull request May 14, 2026
Keep the effective HEAD/GET method and detection profile on the /v2/check request path after rebasing the Veriflier rebuild onto the staged rollout work from PR #109.

This lets v2 monitor cohorts use HEAD plus legacy, GET plus simple_http, and GET plus full checks through remote Verifliers without silently converting them to full-profile GET probes. It also updates the transport docs and fixes the ADR reference now that the streaming engine owns ADR-0009.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants