Add staged HEAD/GET rollout check modes#109
Merged
Merged
Conversation
added 2 commits
May 12, 2026 22:55
Introduce a shared checkmode package, process defaults, and per-site policy overrides so v2 can first replace v1 with HEAD plus legacy semantics, then migrate cohorts through simple GET and full GET detections. Store those overrides in a Jetmon-owned sidecar table instead of adding more rollout-policy columns to jetpack_monitor_sites. Wire the selected method and detection profile through the monitor, streaming scheduler, API, CLI, check history metadata, and Veriflier JSON transport. Update rollout checks and migration docs so sysadmins can see when a pinned replacement host is still configured for the initial HEAD/legacy phase, and cover the behavior with checker, orchestrator, API, DB, and rollout tests.
Move the remaining v2-only site config and runtime freshness fields into Jetmon-owned side tables so rollout no longer requires adding columns or scheduler indexes to jetpack_monitor_sites. The legacy table now stays v1-shaped for production cutover: v2 reads identity, bucket, URL, active state, interval, and projection fields, while advanced check settings live in jetmon_site_check_config and runtime observations live in jetmon_site_runtime. Update API reads and writes, scheduler queries, activity checks, SSL/freshness writes, rollout lab helpers, migrations, schema reference docs, and operator docs to use the sidecar tables consistently. Keep the old migration IDs as no-op compatibility entries so existing migration ordering remains stable while new databases avoid hot ALTERs on the legacy table. Verification: go test ./...; make all; make rollout-docs-verify
heydemoura
approved these changes
May 13, 2026
Contributor
heydemoura
left a comment
There was a problem hiding this comment.
One non-blocking gap. No dedicated StatsD counter per (method, profile) cohort. Observability lives in jetmon_check_history.request_method (migration
32) instead. That's queryable in MySQL but doesn't give the rollout dashboard a real-time cohort signal. Worth a follow-up, not a blocker.
Emit low-cardinality StatsD counters for the effective check method and detection profile used by each completed check. The counters are aggregated per page, round, and streaming report interval so rollout dashboards can confirm HEAD/legacy, GET/simple_http, and GET/full cohort traffic without querying MySQL check history. Also document the metric shape in the operations guide and add unit coverage for cohort aggregation and metric naming.
Contributor
Author
|
@heydemoura Good feedback. I updated to add new counters:
Does this fit what you're looking for? |
chrisbliss18
pushed a commit
that referenced
this pull request
May 13, 2026
Keep the effective HEAD/GET method and detection profile on the /v2/check request path after rebasing the Veriflier rebuild onto the staged rollout work from PR #109. This lets v2 monitor cohorts use HEAD plus legacy, GET plus simple_http, and GET plus full checks through remote Verifliers without silently converting them to full-profile GET probes. It also updates the transport docs and fixes the ADR reference now that the streaming engine owns ADR-0009.
chrisbliss18
pushed a commit
that referenced
this pull request
May 13, 2026
Keep the effective HEAD/GET method and detection profile on the /v2/check request path after rebasing the Veriflier rebuild onto the staged rollout work from PR #109. This lets v2 monitor cohorts use HEAD plus legacy, GET plus simple_http, and GET plus full checks through remote Verifliers without silently converting them to full-profile GET probes. It also updates the transport docs and fixes the ADR reference now that the streaming engine owns ADR-0009.
chrisbliss18
pushed a commit
that referenced
this pull request
May 14, 2026
Keep the effective HEAD/GET method and detection profile on the /v2/check request path after rebasing the Veriflier rebuild onto the staged rollout work from PR #109. This lets v2 monitor cohorts use HEAD plus legacy, GET plus simple_http, and GET plus full checks through remote Verifliers without silently converting them to full-profile GET probes. It also updates the transport docs and fixes the ADR reference now that the streaming engine owns ADR-0009.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds the staged check-mode rollout path for Jetmon v2 so production can replace v1 without immediately changing probe semantics. The rollout can start with v1-compatible
HEADchecks, then move controlled cohorts toGETchecks with the same basic HTTP interpretation, and finally enable the richer v2 detection set after the GET rollout is stable.The branch also keeps
jetpack_monitor_sitesv1-shaped by moving v2-only site configuration and runtime freshness state into Jetmon-owned sidecar tables. This lets the test DB act as a compatibility canary: the old v2-added columns and scheduler indexes have been removed fromjetpack_monitor_sites, so accidental references fail quickly.Check Methods
HEADGETDetection Profiles
legacyHEADfor the first rollout phase.simple_httpGET, but suppresses body-based and richer v2 detections. This lets operators validate that switching from HEAD to GET is stable before adding more alert classes.fullHEADautomatically caps the effective profile tosimple_httpwhenfullis requested because HEAD cannot support body-based checks. This keeps the configuration safe: a site cannot silently claim full body validation while still using HEAD.Three-Stage Rollout Plan
DEFAULT_CHECK_METHOD=HEAD,DEFAULT_DETECTION_PROFILE=legacyGET+simple_httpGET+fullProcess defaults come from
DEFAULT_CHECK_METHODandDEFAULT_DETECTION_PROFILE. Per-site policy lives injetmon_site_check_config.request_methodandjetmon_site_check_config.detection_profile, so operators can migrate cohorts gradually without changing the legacyjetpack_monitor_sitestable.What Changed
DEFAULT_CHECK_METHODandDEFAULT_DETECTION_PROFILE.jetmon_site_check_config.request_methodanddetection_profile.scheduler.{page,round,streaming}.check.method.<method>.profile.<profile>.count, using the effective runtime method/profile.jetmon_site_check_config: keyword rules, forbidden-content rules, maintenance windows, custom headers, timeout override, redirect policy, and alert cooldown.jetmon_site_runtimefor runtime freshness, due time, last alert time, and SSL expiry observation.jetpack_monitor_sites.Validation
Local/repo verification completed:
go test ./internal/orchestratorgo test ./...make allmake rollout-docs-verifyTest fleet deployment completed:
jetmon-service-host-2:jetmon2 8e9c2bfjetmon-vm-host-2/jetmon-v2-veriflier:veriflier2 8e9c2bfjetpack_monitor_sitestable only:jetpack_monitor_site_id,blog_id,bucket_no,monitor_url,monitor_active,site_status,last_status_change, andcheck_interval.jetpack_monitor_sitescolumns and scheduler indexes were dropped after backfilling existing test data into sidecars.validate-config, API health,rollout dynamic-check, andjetmon2 migratepass after the cleanup.StatsD cohort counter smoke coverage:
HEAD/legacy,GET/simple_http, andGET/fullcohorts.scheduler.streaming.check.method.get.profile.full.countandscheduler.streaming.check.method.head.profile.simple_http.count.Uptime-bench focused validation completed in
reports/20260513T100519Z-3h-jetmon-v2-legacy-sidecar:Unknown columnerrors.HEAD+legacystayed up for HEAD-good/GET-bad targets.GET+simple_httpsuppressed rich detections as expected.GET+fulldetected keyword, forbidden content, redirect, timeout, and HTTP failure cases.Uptime-bench 3-hour soak validation completed in
reports/20260513T142404Z-3h-jetmon-v2-legacy-sidecar-soak:2026-05-13T14:30:08Zto2026-05-13T17:30:18Z.validate-config,rollout dynamic-check, and activity checks.jetpack_monitor_sitesremained v1-shaped and the removed v2 columns stayed absent (old_columns_present=0).HEAD/legacysites stayed up where HEAD succeeded and GET failed.GET/simple_httpproduced expected basic HTTP failures while suppressing rich body/keyword detections.GET/fullproduced expected keyword, forbidden-content, redirect, timeout, and HTTP failures.1.87s, with no backpressure waits, no stale results, and empty queues in steady state.Non-Blocking Follow-ups
monitor_active=0. That is a benchmark/reporting wording issue rather than a Jetmon runtime failure.error_keywordanderror_timeout. The stricter review did not find real schema or runtime errors.