Skip to content

Rebuild Veriflier v2 contract and discovery#105

Draft
chrisbliss18 wants to merge 8 commits into
v2from
veriflier-go-rebuild
Draft

Rebuild Veriflier v2 contract and discovery#105
chrisbliss18 wants to merge 8 commits into
v2from
veriflier-go-rebuild

Conversation

@chrisbliss18
Copy link
Copy Markdown
Contributor

@chrisbliss18 chrisbliss18 commented May 11, 2026

Summary

This draft PR rebuilds the Jetmon v2 Veriflier path around a versioned Go JSON-over-HTTP contract while preserving transition compatibility with the existing legacy Veriflier endpoints.

Included:

  • Adds the v2 Veriflier contract: /v2/check and /v2/status with request IDs, typed outcomes, timing metadata, vantage identity, agent identity, deadline propagation, and capacity reporting.
  • Carries the staged rollout check policy from PR Add staged HEAD/GET rollout check modes #109 through the Veriflier v2 transport, so remote checks preserve HEAD + legacy, GET + simple_http, and GET + full cohorts instead of silently converting everything to full-profile GET probes.
  • Adds a bounded Veriflier executor with auto-sized concurrency, queue capacity, overload handling, and explicit agent_overloaded / HTTP 503 behavior when saturated.
  • Updates the Monitor client to prefer v2 and fall back to legacy JSON-over-HTTP only for transition-safe unsupported endpoint responses.
  • Counts downtime quorum by unique Veriflier vantage IDs, preserves duplicate replies in audit/metadata, and adds the multi-Veriflier quorum floor.
  • Adds trusted Veriflier auto-discovery: jetmon_veriflier_vantages, monitor-collected jetmon_veriflier_agents, VERIFLIER_DISCOVERY_MODE=static|shadow|active, and static fallback in active mode.
  • Adds jetmon2 verifliers discovery-report as a read-only shadow-mode gate.
  • Extends validate-config, telemetry report, host health, and the fleet dashboard with Veriflier v2/discovery evidence.
  • Updates migrations, config samples, Docker local Veriflier config, proto schema reference, operations docs, rollout docs, roadmap, and ADR-0010.

Rebase Notes

Rebased on the latest v2 after PR #110. The branch now includes the staged-check sidecar work from #109, the repeated-confirmation guard from #111, and the WPCOM/streaming hardening from #110. The Veriflier discovery migrations remain after the staged-check sidecar migrations as 39 and 40, and fresh schema creation includes both sidecar tables and Veriflier discovery tables.

Validation

Latest local checks after the #110 rebase:

  • go test ./...
  • make test-veriflier-soak
  • git diff --check

Requested Uptime-Bench Coverage

The next runtime test should validate both transition compatibility and the new v2 contract:

  • Direct v1 Veriflier versus veriflier2 legacy-compatible endpoint comparison using the same request corpus, expected to produce equivalent up/down/HTTP/error outcomes for legacy-supported detections.
  • Jetmon v2 Monitor using new veriflier2 /v2/check and /v2/status endpoints, including staged HEAD/GET check policies.
  • Fallback behavior when a configured endpoint is legacy-only or returns unsupported for /v2/status or /v2/check.
  • Duplicate-vantage and mixed-vantage quorum behavior.
  • Executor overload behavior and recovery.
  • Long outage promotion and recovery with v2 vote evidence.

Remaining Rollout Gates

This is intentionally a draft PR. Do not merge until the remaining external/runtime gates are handled:

  • Production-like Veriflier soak for deployed-like network behavior, duplicate-vantage misconfiguration, mixed-vantage responses, overload, and longer outage promotion/recovery.
  • Veriflier auto-discovery shadow-mode rehearsal against production-like data before enabling active mode.
  • Uptime-bench direct v1-vs-v2 Veriflier compatibility comparison.
  • Uptime-bench Jetmon v2 end-to-end Veriflier v2 contract run.
  • WPCOM/Product/Support/Systems approval items tracked in the readiness docs.

Safety Notes

  • Veriflier hosts still do not need DB credentials.
  • Agent telemetry is not trust; only operator-approved enabled vantages count for quorum.
  • Active discovery falls back to static VERIFIERS if discovery is unavailable or empty during rollout.
  • Auth token values are not printed by the dashboard or discovery report.

@chrisbliss18 chrisbliss18 force-pushed the veriflier-go-rebuild branch from 717b746 to 7e24bad Compare May 13, 2026 19:17
Chris Jean added 4 commits May 13, 2026 14:50
Keep the effective HEAD/GET method and detection profile on the /v2/check request path after rebasing the Veriflier rebuild onto the staged rollout work from PR #109.

This lets v2 monitor cohorts use HEAD plus legacy, GET plus simple_http, and GET plus full checks through remote Verifliers without silently converting them to full-profile GET probes. It also updates the transport docs and fixes the ADR reference now that the streaming engine owns ADR-0009.
@chrisbliss18 chrisbliss18 force-pushed the veriflier-go-rebuild branch from 7e24bad to fceba31 Compare May 13, 2026 19:52
Chris Jean added 4 commits May 13, 2026 15:07
The Veriflier discovery rebuild added monitor-collected agent telemetry, but the streaming scheduler returned from the classic round loop before calling the existing telemetry sync path. That left the discovery report amber on streaming deployments even when the configured v2 Veriflier was healthy.

Start an asynchronous telemetry sync when the streaming engine boots and repeat it during the normal heartbeat cadence. The sync is guarded so slow or unreachable Verifliers cannot stack overlapping probes or block the scheduler hot path.
The v2 Veriflier contract now carries both quorum-counted vantage identity and diagnostic agent identity. Legacy Veriflier replies still have a vote identity, but treating that fallback as a vantage_id makes reports look like the legacy endpoint supplied trusted v2 metadata.

Keep vote_id and host for legacy evidence, but only write the vantage_id field when the Veriflier response explicitly supplied one. Update the summary test so legacy replies do not grow synthetic v2 fields.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant