diff --git a/.agents/skills/handoff/SKILL.md b/.agents/skills/handoff/SKILL.md
new file mode 100644
index 00000000..0100ece7
--- /dev/null
+++ b/.agents/skills/handoff/SKILL.md
@@ -0,0 +1,29 @@
+---
+name: handoff
+description: Create a self-contained Jetmon handoff for another agent.
+---
+
+# Jetmon Handoff
+
+Use this when Chris asks for a handoff doc or wants another agent to continue a
+Jetmon thread.
+
+## Include
+
+- Repo path, branch, and relevant commit IDs.
+- Whether the work affects Jetmon v1, Jetmon v2, Veriflier, bridge, support
+  services, or uptime-bench.
+- Active test locks and what must not be changed.
+- Problem statement, evidence, and current hypothesis.
+- Relevant logs, reports, metrics, PRs, and file paths.
+- Commands already run and their outcome.
+- Next recommended actions and approvals needed.
+
+## Placement
+
+During active tests, prefer `.agents` or global memory for agent-only handoffs.
+Ask before editing non-agent project docs.
+
+## Secrets
+
+Do not include tokens, passwords, private keys, or unredacted service configs.
diff --git a/.agents/skills/jetmon-test-fleet/SKILL.md b/.agents/skills/jetmon-test-fleet/SKILL.md
new file mode 100644
index 00000000..d77e652e
--- /dev/null
+++ b/.agents/skills/jetmon-test-fleet/SKILL.md
@@ -0,0 +1,44 @@
+---
+name: jetmon-test-fleet
+description: Work safely with Jetmon services used by uptime-bench capacity tests.
+---
+
+# Jetmon Test Fleet
+
+Use this when Chris asks about Jetmon v1/v2 test services, Verifliers, support
+services, Prometheus capacity data, or whether a Jetmon branch is ready for
+uptime-bench tests.
+
+## Safety First
+
+- If tests are running, do not restart services, change config, move support
+  services, deploy binaries, mutate databases, or alter target/provider state
+  without explicit permission.
+- Prefer read-only inspection and report analysis during active tests.
+- State which repo is being acted on before making changes.
+
+## Common Context
+
+- Uptime-bench canonical repo:
+  `/home/gaarai/code/uptime-bench`.
+- Current Prometheus for Jetmon capacity work:
+  `http://10.0.0.67:9091`.
+- Service hosts:
+  `jetmon-service-host-1`/`jetmon-v1`,
+  `jetmon-service-host-2`/`jetmon-v2`,
+  `jetmon-service-host-3`,
+  `jetmon-service-host-4`.
+- Support/monitoring hosts:
+  `jetmon-vm-host-1`,
+  `jetmon-vm-host-2`,
+  `jetmon-vm-host-3`.
+
+## Output Expectations
+
+When answering readiness or risk questions, include:
+
+- Branch and commit under discussion.
+- What is deployed versus only local.
+- Which checks were read-only.
+- Whether changes are safe during an active uptime-bench run.
+- Recommended next action and any approval needed.
diff --git a/.agents/skills/safe-background-work/SKILL.md b/.agents/skills/safe-background-work/SKILL.md
new file mode 100644
index 00000000..087c70a7
--- /dev/null
+++ b/.agents/skills/safe-background-work/SKILL.md
@@ -0,0 +1,31 @@
+---
+name: safe-background-work
+description: Pick useful Jetmon work that cannot affect active uptime-bench or Jetmon tests.
+---
+
+# Safe Background Work
+
+Use this when tests are running and Chris asks what can be done without
+interrupting them.
+
+## Allowed By Default
+
+- Local code review and static analysis.
+- Agent-specific files.
+- Branch inspection and commit comparison.
+- Handoff writing.
+- Local-only planning for changes that will not be deployed.
+
+## Ask First
+
+- Deploying binaries or configs.
+- Restarting `jetmon2`, Jetmon v1, bridge, Veriflier, database, StatsD, or
+  monitoring services.
+- Moving support services between hosts.
+- Changing bucket ownership, pinned bucket ranges, or test fleet data.
+- Running smoke tests that create, delete, or modify sites/providers.
+
+## Blocker Policy
+
+If a safe task becomes blocked on approval, record the blocker and move to the
+next safe task.
diff --git a/.claude/commands/debug-memory.md b/.claude/commands/debug-memory.md
index ddb502ad..0d6966cc 100644
--- a/.claude/commands/debug-memory.md
+++ b/.claude/commands/debug-memory.md
@@ -1,97 +1,66 @@
 # Debug Memory Issues
 
-Debug memory issues in Jetmon workers and identify leaks.
+Debug memory growth and goroutine leaks in the Jetmon 2 Go binary.
 
 ## Instructions
 
-Help the user diagnose memory problems in Jetmon workers. Memory leaks are a known pitfall because workers are long-running processes.
+Help the user diagnose memory problems in Jetmon 2. Unlike the old Node.js/worker architecture,
+Jetmon 2 is a single Go binary. Memory pressure does not cause worker crashes — instead the
+orchestrator drains the goroutine pool when RSS exceeds `WORKER_MAX_MEM_MB`.
 
 ### 1. Check Current Memory Status
 
-First, see current memory usage of all Jetmon processes:
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon ps aux --sort=-%mem | grep -E '(node|PID)' | head -20
+cd docker && docker compose exec jetmon ps aux
 ```
 
-Check worker memory limits in config:
+Check memory config:
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon cat config/config.json | grep -E '(WORKER_MAX_MEM|WORKER_MAX_CHECK)'
+docker compose exec jetmon cat config/config.json | grep -E '(WORKER_MAX_MEM|NUM_WORKERS)'
 ```
 
-### 2. Monitor Memory Over Time
+### 2. Use pprof for Deep Analysis
 
-Watch memory growth in real-time:
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon bash -c 'while true; do echo "=== $(date) ==="; ps aux --sort=-%mem | grep node | head -10; sleep 10; done'
-```
-
-Let this run for a few minutes to observe trends. Look for:
-- Workers steadily increasing memory without recycling
-- Workers approaching or exceeding `WORKER_MAX_MEM_MB` (default 53MB)
-- Memory not dropping after worker recycle
-
-### 3. Check Worker Recycling
+The operator dashboard exposes pprof endpoints at http://localhost:8080/debug/pprof/
 
-Verify workers are being recycled when hitting limits:
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose logs jetmon 2>&1 | grep -E '(memory|recycle|spawn|die|limit)' | tail -30
-```
-
-### 4. Force Aggressive Recycling (Testing)
+# Count goroutines
+curl http://localhost:8080/debug/pprof/goroutine?debug=1 | grep -c "^goroutine"
 
-To test worker recycling behavior, temporarily set low limits:
-
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon bash -c 'cat > /tmp/test-config.json << EOF
-{
-    "WORKER_MAX_CHECKS": 50,
-    "WORKER_MAX_MEM_MB": 20
-}
-EOF
-cat /tmp/test-config.json'
+# Heap profile
+curl http://localhost:8080/debug/pprof/heap > heap.prof
+go tool pprof heap.prof
 ```
 
-Tell the user to manually update `config/config.json` with these values, then reload:
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon sh -c 'kill -HUP $(pgrep -f "node lib/jetmon.js" | head -1)'
-```
+### 3. Monitor Memory Over Time
 
-Watch for recycling:
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose logs -f jetmon 2>&1 | grep -E '(spawn|die|recycle|memory|limit)'
+docker compose exec jetmon bash -c 'while true; do ps -o pid,rss,vsz,comm -p $(pgrep jetmon2); sleep 10; done'
 ```
 
-### 5. Check for Known Memory Issues
+Enable detailed StatsD metrics by setting `STATSD_SEND_MEM_USAGE: true` in `config/config.json`,
+then reload config: `docker compose exec jetmon ./jetmon2 reload`
 
-**Retry queue growth:** If retry queues aren't being processed, they can grow unbounded:
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose logs jetmon 2>&1 | grep -i retry | tail -20
-```
+### 4. Check Retry Queue Size
+
+Large retry queues indicate many sites are down and being tracked. This is expected behaviour.
 
-**StatsD buffer:** Check if metrics buffer is growing:
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon bash -c 'cat stats/* 2>/dev/null'
+curl http://localhost:8080/api/state | python3 -m json.tool
 ```
 
-### 6. Analyze with Node.js Tools (Advanced)
-
-If deeper analysis is needed, suggest:
-
-1. **Heap snapshots:** Would require code changes to expose `v8.writeHeapSnapshot()`
-2. **--inspect flag:** Could attach Chrome DevTools, but requires exposing debug port
-3. **Process stats:** Check `/proc/<pid>/status` for detailed memory breakdown
+Look at `RetryQueueSize`.
 
-### 7. Common Memory Issues in Jetmon
+### 5. Common Issues
 
 | Symptom | Likely Cause | Fix |
 |---------|--------------|-----|
-| Workers never recycle | `WORKER_MAX_MEM_MB` set to 0 or very high | Set reasonable limit (53MB default) |
-| Memory spikes during rounds | Too many concurrent checks | Reduce `NUM_TO_PROCESS` |
-| Gradual leak over hours | Retry queue not draining | Check Veriflier connectivity |
-| Sudden OOM | Node.js version regression | Test with previous Node version |
+| Goroutine count grows | Context not cancelled on shutdown | Verify `orch.Stop()` called |
+| Memory never drops | Pool drain not triggered | Check `WORKER_MAX_MEM_MB` value |
+| Retry queue unbounded | Veriflier unreachable | Check veriflier connectivity |
+| High allocations | Keyword-check body reads | Reduce `NUM_WORKERS` |
 
-### 8. Restore Normal Settings
+### 6. Restore Normal Settings
 
-Remind user to restore normal config values after testing:
-- `WORKER_MAX_MEM_MB`: 53
-- `WORKER_MAX_CHECKS`: 10000
+After testing, remind user to restore:
+- `STATSD_SEND_MEM_USAGE`: false (avoid extra StatsD traffic in production)
diff --git a/.claude/commands/docker-test.md b/.claude/commands/docker-test.md
index 4c6b1d06..9a942f18 100644
--- a/.claude/commands/docker-test.md
+++ b/.claude/commands/docker-test.md
@@ -1,68 +1,80 @@
 # Docker Test Environment
 
-Run, debug, and test Jetmon using the Docker development environment.
+Run, debug, and test Jetmon 2 using the Docker development environment.
 
 ## Instructions
 
-Help the user test Jetmon in the Docker environment. Follow these steps:
+Help the user test Jetmon 2 in the Docker environment. Follow these steps:
 
 ### 1. Check Docker Status
 First, check if the Docker environment is already running:
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose ps
+cd docker && docker compose ps
 ```
 
 ### 2. Start Services (if needed)
 If services aren't running, start them:
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose up -d
+cd docker && docker compose up -d
 ```
 
 Wait a few seconds for services to initialize, then verify:
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose ps
+docker compose ps
 ```
 
 ### 3. Ask User What They Want to Test
 
 Present these options:
 - **View logs** - Watch Jetmon or Veriflier logs in real-time
-- **Check worker status** - See worker activity and stats
+- **Operator dashboard** - Open http://localhost:8080 in a browser
 - **Test with sample sites** - Insert test URLs into database
-- **Test configuration reload** - Send SIGHUP to master process
-- **Test graceful shutdown** - Verify shutdown behavior
+- **Test configuration reload** - Send SIGHUP to reload config
+- **Test graceful drain** - Verify drain/shutdown behaviour
 - **Test Veriflier connectivity** - Check Veriflier is responding
+- **View audit log** - Query the audit log for a specific blog
 - **View metrics** - Check StatsD/Graphite dashboard
 
 ### 4. Execute Based on Selection
 
 **View logs:**
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose logs -f jetmon
+docker compose logs -f jetmon
 ```
 
-**Check worker status:**
+**Check process and stats:**
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon ps auxf
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon cat stats/sitespersec
+docker compose exec jetmon ps aux
+docker compose exec jetmon cat stats/sitespersec
+docker compose exec jetmon cat stats/sitesqueue
 ```
 
 **Test with sample sites:**
 First check if table exists and has data:
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec mysqldb mysql -u root -p123456 jetmon_db -e "SELECT COUNT(*) as count FROM jetpack_monitor_sites;" 2>/dev/null
+docker compose exec mysqldb mysql -u root -p123456 jetmon_db -e "SELECT COUNT(*) as count FROM jetpack_monitor_sites;" 2>/dev/null
 ```
 
 If empty or table doesn't exist, offer to create test data per `running-tests.md`.
 
 **Test configuration reload:**
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon sh -c 'kill -HUP $(pgrep -f "node lib/jetmon.js" | head -1)'
+docker compose exec jetmon ./jetmon2 reload
+```
+
+**Test drain/graceful shutdown:**
+```bash
+docker compose exec jetmon ./jetmon2 drain
 ```
 
 **Test Veriflier connectivity:**
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon curl -k https://veriflier:7801/get/status
+docker compose exec jetmon curl http://veriflier:7803/status
+```
+
+**View audit log:**
+```bash
+docker compose exec jetmon ./jetmon2 audit --blog-id 1 --since 1h
 ```
 
 **View metrics:**
@@ -70,10 +82,10 @@ Tell user to open http://localhost:8088 and navigate to `Metrics > stats > com >
 
 ### 5. Cleanup (if requested)
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose down
+docker compose down
 ```
 
 Or to fully reset with fresh database:
 ```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose down -v
+docker compose down -v
 ```
diff --git a/.claude/commands/rebuild-addon.md b/.claude/commands/rebuild-addon.md
deleted file mode 100644
index 4110d733..00000000
--- a/.claude/commands/rebuild-addon.md
+++ /dev/null
@@ -1,92 +0,0 @@
-# Rebuild Native Addon
-
-Rebuild the C++ native addon after making changes to `src/http_checker.cpp` or related C++ files.
-
-## Instructions
-
-When the user has modified C++ code and needs to rebuild the native addon, follow these steps:
-
-### 1. Check What Changed
-First, identify what C++ files were modified:
-```bash
-git -C /Users/rdcoll/Code/a8c/jetmon status --porcelain | grep -E '\.(cpp|h|gyp)$'
-```
-
-### 2. Determine Build Environment
-
-Ask the user: **Are you running in Docker or locally?**
-
-### 3a. Docker Build (Recommended)
-
-Check if Docker is running:
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose ps
-```
-
-If not running, start it:
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose up -d
-```
-
-Rebuild and restart inside container:
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon npm run rebuild-run
-```
-
-Or if you want to rebuild without auto-running:
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon sh -c 'node-gyp rebuild && cp build/Release/jetmon.node lib/'
-```
-
-Then restart Jetmon:
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose restart jetmon
-```
-
-### 3b. Local Build
-
-Run the npm script:
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon && npm run rebuild-run
-```
-
-Or manually:
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon && node-gyp rebuild && cp build/Release/jetmon.node lib/
-```
-
-### 4. Verify Build Success
-
-Check that the new `.node` file was created:
-```bash
-ls -la /Users/rdcoll/Code/a8c/jetmon/lib/jetmon.node
-```
-
-### 5. Test the Addon
-
-Create a quick test to verify the addon loads correctly:
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon node -e "const c = require('./lib/jetmon.node'); console.log('Addon loaded successfully');"
-```
-
-Or run a simple HTTP check:
-```bash
-cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon node -e "
-const checker = require('./lib/jetmon.node');
-checker.http_check('https://wordpress.com', 443, 0, function(idx, rtt, http, err) {
-    console.log('RTT:', rtt, 'HTTP:', http, 'Error:', err);
-    process.exit(0);
-});
-"
-```
-
-### 6. Watch for Issues
-
-If the build fails, common issues include:
-- Missing build tools: `node-gyp` requires Python and a C++ compiler
-- Node.js version mismatch: Addon must be built for the running Node.js version
-- OpenSSL issues: Check that OpenSSL dev headers are available
-
-If Jetmon crashes after rebuild:
-- Check logs: `docker compose logs jetmon`
-- Verify the addon API hasn't changed incompatibly
diff --git a/.claude/rules/coding-standards.md b/.claude/rules/coding-standards.md
index 22f306b0..6a63f7b3 100644
--- a/.claude/rules/coding-standards.md
+++ b/.claude/rules/coding-standards.md
@@ -5,8 +5,52 @@
 Follow coding standards in this order:
 1. Existing patterns in the codebase
 2. Conventions documented in this file
-3. Node.js best practices (for JavaScript)
-4. Google C++ Style Guide (for C++, with local modifications)
+3. Effective Go (https://go.dev/doc/effective_go) for Go code
+
+---
+
+## Go
+
+### Formatting
+- Run `gofmt` / `goimports` — all Go code must be formatted
+- Tabs for indentation (enforced by gofmt)
+- Line length: no hard limit; prefer readability over brevity
+
+### Naming Conventions
+- **Packages**: lowercase, single word (e.g., `checker`, `wpcom`, `audit`)
+- **Exported identifiers**: `PascalCase`
+- **Unexported identifiers**: `camelCase`
+- **Acronyms**: all-caps when exported (`HTTPCode`, `RTTMs`, `URL`), lowercase otherwise (`httpCode`)
+- **Error variables**: `ErrFoo` for sentinel errors
+- **Interfaces**: noun or `-er` suffix (`Checker`, `Client`)
+- **Constants**: `PascalCase` for Go constants; config key strings use `SCREAMING_SNAKE_CASE` to match existing JSON keys
+
+### Error Handling
+- Return errors; do not panic in library code
+- Wrap with context: `fmt.Errorf("connect: %w", err)`
+- Log and continue for non-fatal errors; `log.Fatalf` only at startup
+
+### Concurrency
+- Pass `context.Context` as the first argument to any function that blocks or does I/O
+- Use `sync/atomic` for hot-path counters; `sync.Mutex` for struct guards
+- Never share mutable state across goroutines without synchronisation
+- Prefer buffered channels sized to the expected burst; document the rationale
+
+### Imports
+- Standard library first, then external, then internal — separated by blank lines
+- Alias internal `grpc` package as `vgrpc` to avoid collision with `google.golang.org/grpc`
+- Alias `"context"` as `stdctx` only when the local scope shadows the package name
+
+### Comments
+- Package comment on every package (`// Package foo ...`)
+- Exported symbol comments are required (`// Foo does ...`)
+- Inline comments explain *why*, not *what*
+
+---
+
+## Legacy (JavaScript / C++)
+
+The codebase was previously Node.js + C++. Those conventions are no longer relevant — the Go section above takes full precedence. The sections below are retained only as historical reference and should not be followed for new work.
 
 ---
 
@@ -476,4 +520,4 @@ kill -HUP <jetmon-master-pid>
 - Documentation standards: `.claude/rules/documentation.md`
 - Configuration options: `config/config.readme`
 - Docker setup: `docker/` directory
-- Veriflier build: `veriflier/README.md`
+- Veriflier binary: `veriflier2/cmd/main.go`
diff --git a/.claude/rules/documentation.md b/.claude/rules/documentation.md
index a51998d4..b73a869e 100644
--- a/.claude/rules/documentation.md
+++ b/.claude/rules/documentation.md
@@ -97,7 +97,7 @@ For main README, use this structure with `====` underlines (not `#` headers):
 6. Running
 7. Database (schema if applicable)
 
-For component READMEs (e.g., `veriflier/README.md`), use minimal format:
+For component READMEs (e.g., a future `veriflier2/README.md`), use minimal format:
 ```markdown
 component name
 ==============
diff --git a/.claude/rules/general-guidelines.md b/.claude/rules/general-guidelines.md
index 161d564f..7cf134f2 100644
--- a/.claude/rules/general-guidelines.md
+++ b/.claude/rules/general-guidelines.md
@@ -1,6 +1,6 @@
 # General Guidelines for Jetmon Development
 
-You are an expert in Node.js, C++, and high-performance systems programming. You have deep expertise in building scalable monitoring services, native Node.js addons, network programming, and multi-process architectures. You prioritize reliability and performance while delivering maintainable solutions for production infrastructure.
+You are an expert in Go and high-performance systems programming. You have deep expertise in building scalable monitoring services, concurrent network programming, and production infrastructure. You prioritize reliability and performance while delivering maintainable solutions.
 
 ## Short Codes
 
@@ -10,105 +10,115 @@ Check the start of any user message for the following short codes and act approp
 
 ## Key Principles
 
-- Write concise, technical code with accurate JavaScript and C++ examples.
+- Write idiomatic Go — prefer stdlib, use goroutines and channels correctly.
 - Follow the established code style conventions (see `coding-standards.md`).
-- Use callback-based asynchronous patterns (not Promises/async-await) in JavaScript.
+- No Promises/async patterns — Go uses goroutines, channels, and `context.Context` for concurrency.
 - Prefer modularization over duplication.
-- Use descriptive function, variable, and file names following existing conventions:
-  - JavaScript: `camelCase` for functions, `SCREAMING_SNAKE_CASE` for constants
-  - C++: `snake_case` for methods, `m_` prefix for member variables
+- Use descriptive names following existing conventions:
+  - Go packages: `lowercase`, single-word when possible
+  - Exported identifiers: `PascalCase`
+  - Unexported identifiers: `camelCase`
+  - Constants: `PascalCase` (Go-idiomatic) or `SCREAMING_SNAKE_CASE` for config keys
 - Use lowercase with hyphens for new directories.
-- Favor IPC messaging for process communication over shared state.
+- Pass `context.Context` as the first argument to functions that do I/O or may block.
 
 ## Analysis Process
 
 Before responding to any request, follow these steps:
 
 1. **Request Analysis**
-   - Determine if task involves master process, worker process, native addon, or veriflier
    - Identify which component(s) need modification:
-     - `lib/jetmon.js` - Master process orchestration
-     - `lib/httpcheck.js` - Worker process logic
-     - `src/http_checker.cpp` - Native addon HTTP checking
-     - `veriflier/` - Geographic verification service
+     - `cmd/jetmon2/` - Main binary entry point (CLI subcommands, signal handling)
+     - `internal/orchestrator/` - Round loop, bucket coordination, retry queue
+     - `internal/checker/` - HTTP check logic (httptrace, SSL, keyword, redirect)
+     - `internal/checker/pool.go` - Auto-scaling goroutine pool
+     - `internal/db/` - MySQL queries and migrations
+     - `internal/config/` - Config loading, validation, hot reload
+     - `internal/veriflier/` - Veriflier client/server (JSON-over-HTTP; swap for true gRPC after `make generate`)
+     - `internal/wpcom/` - WPCOM notification client with circuit breaker
+     - `internal/audit/` - Audit log read/write
+     - `internal/metrics/` - StatsD UDP client, stats file writer
+     - `internal/dashboard/` - SSE operator dashboard
+     - `veriflier2/cmd/` - Standalone veriflier binary
    - Note compatibility requirements:
-     - Node.js version (currently v24)
-     - C++ compiler requirements for native addon
-     - Qt5 for veriflier builds
+     - Go 1.22 (uses range-over-integer, builtin `min`/`max`)
+     - MySQL 8.0 (Docker) / MySQL 5.7+ (production)
    - Define core functionality and reliability goals
-   - Consider memory usage implications (worker recycling thresholds)
-   - Consider observability requirements (StatsD metrics)
+   - Consider goroutine pool scaling implications
+   - Consider observability requirements (StatsD metrics, audit log)
 
 2. **Solution Planning**
-   - Break into process-compatible components
-   - Identify required IPC message types
-   - Plan for configuration via `config.json`
+   - Break into package-compatible components
+   - Identify required channel/interface contracts
+   - Plan for configuration via `config/config.json`
    - Evaluate performance impact:
-     - Memory usage per worker
+     - Pool queue depth and goroutine count
      - Check throughput (sites per second)
      - Network timeout handling
-   - Consider horizontal scaling implications (bucket ranges)
+   - Consider horizontal scaling implications (bucket ranges, heartbeat)
 
 3. **Implementation Strategy**
-   - Choose appropriate patterns for the target component
-   - Consider impact on worker lifecycle (memory limits, check counts)
-   - Plan for graceful error handling and logging
-   - Ensure metrics are emitted for observability
+   - Choose appropriate Go patterns for the target component
+   - Use `context.Context` for cancellation propagation
+   - Plan for graceful error handling and structured logging
+   - Ensure StatsD metrics are emitted for significant events
    - Verify changes work in Docker development environment
-   - After proposing any code change, always provide specific manual testing steps the user should follow. Jetmon has no automated test suite — manual verification is mandatory for every change. Reference `running-tests.md` for the Docker testing environment.
+   - After proposing any code change, always provide specific manual testing steps the user should follow. Reference `running-tests.md` for the Docker testing environment.
 
 ## Architecture Awareness
 
-### Process Boundaries
-- Master process (`jetmon.js`): Orchestration only, no direct HTTP checks
-- Worker processes (`httpcheck.js`): Disposable, recycled on limits
-- SSL server (`server.js`): Receives veriflier responses only
-- Veriflier: Independent Qt application, communicates via HTTPS
+### Package Boundaries
+- `cmd/jetmon2`: Entry point only; delegates to internal packages
+- `internal/orchestrator`: Owns the round loop, retry state, and bucket leases
+- `internal/checker`: Stateless HTTP check; no global state
+- `internal/checker/pool`: Auto-scaling goroutine pool; driven by queue depth
+- `internal/veriflier`: Thin transport layer; JSON-over-HTTP until protoc generates real stubs
+- `internal/wpcom`: Owns WPCOM circuit breaker and notification queue
 
 ### Data Flow
 ```
-Database → Master → Workers → C++ Addon → HTTP Checks
+Database → Orchestrator → Pool → checker.Check → Results
                 ↓
-         Verifliers (geo-distributed)
+         Veriflier gRPC clients (geo-distributed)
                 ↓
-         WordPress.com API
+         WPCOM API (circuit-broken notification queue)
 ```
 
 ### Critical Constraints
-- Workers must not exceed `WORKER_MAX_MEM_MB` (53MB default)
-- Workers recycle after `WORKER_MAX_CHECKS` (10,000 default)
-- Retry queues must persist between rounds (not flushed)
-- Bucket ranges must not overlap between hosts
+- Retry queue must persist between rounds (never flushed at round start)
+- Bucket ranges must not overlap between hosts (MySQL `SELECT ... FOR UPDATE` enforces this)
+- Heartbeat must fire every round; WatchdogSec=120s means missing two rounds triggers systemd restart
+- Circuit breaker floor: at least 1 veriflier quorum, even if all verifliers are offline
 
 ## Production Considerations
 
 ### Before Modifying Code
-- Test changes locally using Docker environment
-- Verify memory usage patterns with extended runs
+- Test changes locally using Docker environment (`docker compose up -d`)
+- Verify goroutine count and memory do not grow unboundedly
 - Check that StatsD metrics are properly emitted
-- Ensure graceful shutdown behavior is preserved
+- Ensure graceful shutdown behaviour is preserved (SIGINT → `orch.Stop()`)
 
 ### Deployment Process
 - Changes require Systems team deployment
 - Create a Systems Request with PR links
-- Test in Docker before requesting production deploy
+- Run `./jetmon2 validate-config` before deploying
 
 ### Performance Sensitivity
-- RTT (round-trip time) calculations affect timeout behavior
-- Node.js version changes can impact performance characteristics
-- Memory leaks compound over time due to long-running processes
+- RTT calculations feed into timeout heuristics — don't add unnecessary latency
+- Pool auto-scaling fires every 5 seconds; don't block the scale goroutine
+- `runtime.ReadMemStats` is stop-the-world; call it infrequently
 
 ## Security Considerations
 
-- Authentication tokens in config must not be logged
-- SSL certificates are required for veriflier communication
-- Database credentials are stored separately in `db-config.conf`
+- Auth tokens in config must not be logged
+- gRPC/HTTP veriflier auth token is validated per-request in `internal/veriflier/server.go`
+- Database credentials are stored in `config/db-config.conf` (not committed)
 - Never commit secrets to the repository
 
 ## Testing Approach
 
+- Use `go test ./...` for unit tests
 - Use Docker environment for integration testing
 - Enable `DB_UPDATES_ENABLE` only in local test environments
-- Verify worker spawn/death cycle works correctly
 - Test graceful shutdown with SIGINT
-- Monitor memory growth over extended runs
+- Monitor goroutine count over extended runs (`/debug/pprof/goroutine`)
diff --git a/.claude/rules/running-tests.md b/.claude/rules/running-tests.md
index cd19b1ec..6a0223a9 100644
--- a/.claude/rules/running-tests.md
+++ b/.claude/rules/running-tests.md
@@ -1,6 +1,14 @@
 # Running Tests
 
-Jetmon does not have a formal automated test suite. Testing is performed manually using the Docker development environment.
+Jetmon 2 has a Go test suite (`go test ./...`) and a Docker development environment for integration testing.
+
+## Automated Tests
+
+```bash
+make test          # go test ./...
+make test-race     # go test -race ./...
+make lint          # go vet ./...
+```
 
 ## Prerequisites
 
@@ -22,45 +30,29 @@ docker compose down                   # Stop all services
 docker compose down -v                # Stop and remove volumes (fresh start)
 ```
 
-Services started: `mysqldb` (MySQL 5.7), `jetmon` (master + workers), `veriflier`, `statsd` (Graphite)
+Services started: `mysqldb` (MySQL 8.0), `jetmon` (single binary), `veriflier`, `statsd` (Graphite)
 
 ### View Logs
 ```bash
 docker compose logs -f jetmon         # Follow Jetmon logs
 docker compose logs -f veriflier      # Follow Veriflier logs
-docker compose exec jetmon cat logs/jetmon.log
-docker compose exec jetmon cat logs/status-change.log
 ```
 
 ### Monitor Activity
 ```bash
 docker compose exec jetmon cat stats/sitespersec
 docker compose exec jetmon cat stats/sitesqueue
-docker compose exec jetmon ps auxf    # Process tree: master, workers, server
+docker compose exec jetmon ps aux     # Single process — no worker tree
 ```
 
 ## Test Database Setup
 
-### Create Table
+The Docker entrypoint automatically runs `./jetmon2 migrate` on startup. For manual testing, connect to MySQL:
+
 ```bash
 docker compose exec mysqldb mysql -u root -p123456 jetmon_db
 ```
 
-```sql
-CREATE TABLE IF NOT EXISTS `jetpack_monitor_sites` (
-    `jetpack_monitor_site_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY,
-    `blog_id` bigint(20) unsigned NOT NULL,
-    `bucket_no` smallint(2) unsigned NOT NULL,
-    `monitor_url` varchar(300) NOT NULL,
-    `monitor_active` tinyint(1) unsigned NOT NULL DEFAULT 1,
-    `site_status` tinyint(1) unsigned NOT NULL DEFAULT 1,
-    `last_status_change` timestamp NULL DEFAULT current_timestamp(),
-    `check_interval` tinyint(1) unsigned NOT NULL DEFAULT 5,
-    INDEX `blog_id_monitor_url` (`blog_id`, `monitor_url`),
-    INDEX `bucket_no_monitor_active_check_interval` (`bucket_no`, `monitor_active`, `check_interval`)
-);
-```
-
 ### Insert Test Sites
 ```sql
 INSERT INTO jetpack_monitor_sites (blog_id, bucket_no, monitor_url, monitor_active, site_status)
@@ -88,52 +80,39 @@ Edit `config/config.json`:
 
 ### Configuration Reload
 ```bash
-docker compose exec jetmon ps aux | grep jetmon-master  # Find PID
-docker compose exec jetmon kill -HUP <pid>              # Reload config
+docker compose exec jetmon ./jetmon2 reload   # Sends SIGHUP via PID file
+# Or manually:
+docker compose exec jetmon kill -HUP <pid>
 ```
 
-### Graceful Shutdown
+### Graceful Shutdown / Drain
 ```bash
-docker compose exec jetmon kill -INT <pid>    # Or: docker compose restart jetmon
+docker compose exec jetmon ./jetmon2 drain    # Sends SIGINT via PID file
+# Or: docker compose stop jetmon
 ```
 
-### Veriflier Connectivity
+### Validate Config
 ```bash
-docker compose exec jetmon curl -k https://veriflier:7801/get/status
-# Should return: OK
+docker compose exec jetmon ./jetmon2 validate-config
 ```
 
-### Native Addon Rebuild
+### Veriflier Connectivity
 ```bash
-docker compose exec jetmon npm run rebuild-run
-# Or manually:
-docker compose exec jetmon bash -c "node-gyp rebuild && cp build/Release/jetmon.node lib/ && node lib/jetmon.js"
+docker compose exec jetmon curl http://veriflier:7803/status
+# Should return: {"hostname":"...","version":"...","status":"ok"}
 ```
 
-### Test HTTP Checker Directly
-Create `lib/test-addon.js`:
-```javascript
-var checker = require( './jetmon.node' );
-checker.http_check( 'https://wordpress.com', 80, 0, function( index, rtt, http_code, error_code ) {
-    console.log( 'RTT:', rtt, 'HTTP:', http_code, 'Error:', error_code );
-    process.exit( 0 );
-});
-```
-Run: `docker compose exec jetmon node lib/test-addon.js`
+### Operator Dashboard
+- Open http://localhost:8080 in a browser after starting Docker services.
 
-### Worker Recycling
-Set low limits in `config/config.json`:
-```json
-{
-    "WORKER_MAX_CHECKS": 100,
-    "WORKER_MAX_MEM_MB": 30
-}
+### Audit Log
+```bash
+docker compose exec jetmon ./jetmon2 audit --blog-id 1 --since 1h
 ```
-Watch: `docker compose logs -f jetmon | grep -E "(spawn|die|recycle|limit)"`
 
 ### Memory Monitoring
 ```bash
-docker compose exec jetmon bash -c 'while true; do ps aux --sort=-%mem | head -10; sleep 5; done'
+docker compose exec jetmon bash -c 'while true; do ps aux --sort=-%mem | head -5; sleep 5; done'
 ```
 
 ### StatsD Metrics
@@ -159,9 +138,10 @@ Query database: `docker compose exec mysqldb mysql -u root -p123456 jetmon_db -e
 | Problem | Check |
 |---------|-------|
 | Jetmon not starting | `docker compose ps mysqldb`, verify `config/db-config.conf` |
-| No sites being checked | Verify `BUCKET_NO_MIN/MAX` matches data, `monitor_active = 1` |
-| Veriflier connection fails | `docker compose ps veriflier`, check auth tokens match, SSL certs exist |
+| No sites being checked | Verify `BUCKET_TOTAL/TARGET` and that `monitor_active = 1` in DB |
+| Veriflier connection fails | `docker compose ps veriflier`, check auth tokens match |
 | StatsD not receiving | `docker compose exec jetmon ping statsd`, check for UDP errors |
+| Migration fails | Check MySQL is up: `docker compose ps mysqldb` |
 
 ## Cleanup
 
diff --git a/.claude/skills/create-issue/SKILL.md b/.claude/skills/create-issue/SKILL.md
index 195794ff..6ebdd3c1 100644
--- a/.claude/skills/create-issue/SKILL.md
+++ b/.claude/skills/create-issue/SKILL.md
@@ -66,14 +66,17 @@ Brief description of the issue or need. Include error messages, logs, or metrics
 
 ## Affected Component(s)
 
-- [ ] Master Process (`lib/jetmon.js`)
-- [ ] Worker Process (`lib/httpcheck.js`)
-- [ ] C++ Native Addon (`src/http_checker.cpp`)
-- [ ] Veriflier (`veriflier/`)
-- [ ] Database (`lib/database.js`)
-- [ ] Configuration
+- [ ] CLI / Entry Point (`cmd/jetmon2/main.go`)
+- [ ] Orchestrator (`internal/orchestrator/`)
+- [ ] HTTP Checker (`internal/checker/`)
+- [ ] Goroutine Pool (`internal/checker/pool.go`)
+- [ ] Database / Migrations (`internal/db/`)
+- [ ] Configuration (`internal/config/`)
+- [ ] gRPC / Veriflier Transport (`internal/grpc/`)
+- [ ] WPCOM Client (`internal/wpcom/`)
+- [ ] Operator Dashboard (`internal/dashboard/`)
+- [ ] Veriflier Binary (`veriflier2/`)
 - [ ] Docker/Infrastructure
-- [ ] WPCOM Integration
 
 ## Steps to Reproduce (if applicable)
 
@@ -116,12 +119,12 @@ Workers are hitting memory limits more frequently than expected...
 
 ## Affected Component(s)
 
-- [x] Worker Process (`lib/httpcheck.js`)
+- [x] Goroutine Pool (`internal/checker/pool.go`)
 
 ## Acceptance Criteria
 
-- [ ] Workers stay under 53MB memory limit
-- [ ] No increase in worker recycling frequency
+- [ ] Goroutine count stays bounded under sustained load
+- [ ] No goroutine leak after pool drain
 EOF
 )"
 ```
diff --git a/.claude/skills/create-pr/SKILL.md b/.claude/skills/create-pr/SKILL.md
index a533c7fa..faf8b380 100644
--- a/.claude/skills/create-pr/SKILL.md
+++ b/.claude/skills/create-pr/SKILL.md
@@ -37,21 +37,26 @@ Create a PR for the current branch, targeting `master`.
 
 | Component | Key Files |
 |-----------|-----------|
-| Master Process | `lib/jetmon.js` |
-| Worker Process | `lib/httpcheck.js` |
-| C++ Native Addon | `src/http_checker.cpp`, `src/http_checker.h`, `binding.gyp` |
-| Veriflier | `veriflier/*.cpp`, `veriflier/*.h` |
-| Database | `lib/database.js`, `lib/dbpools.js` |
-| Configuration | `config/config.json`, `config/config.readme` |
+| CLI / Entry Point | `cmd/jetmon2/main.go` |
+| Orchestrator | `internal/orchestrator/` |
+| HTTP Checker | `internal/checker/checker.go` |
+| Goroutine Pool | `internal/checker/pool.go` |
+| Database | `internal/db/` |
+| Config | `internal/config/config.go`, `config/config.readme` |
+| gRPC / Veriflier Transport | `internal/grpc/` |
+| WPCOM Client | `internal/wpcom/client.go` |
+| Audit Log | `internal/audit/audit.go` |
+| Metrics | `internal/metrics/metrics.go` |
+| Operator Dashboard | `internal/dashboard/dashboard.go` |
+| Veriflier Binary | `veriflier2/cmd/main.go` |
 | Docker | `docker/docker-compose.yml`, `docker/Dockerfile*` |
-| StatsD/Metrics | Look for `statsdClient` calls |
-| WPCOM Integration | `lib/wpcom.js`, `lib/comms.js` |
+| Migrations | `internal/db/migrations.go`, `migrations/001_jetmon2.sql` |
 
 6. **Determine testing requirements**:
-   - C++ changes require `npm run rebuild-run`
-   - Config changes should be tested with Docker environment
-   - Worker changes should be tested with memory monitoring
-   - Database changes need schema verification
+   - Config changes: test with `./jetmon2 validate-config`
+   - DB/schema changes: test migration with `./jetmon2 migrate`
+   - All changes: test with Docker environment (`docker compose up --build`)
+   - Run `make test` to verify unit tests pass
 
 7. **Create the PR** using `gh pr create --draft --assignee @me` with this format:
 
@@ -72,10 +77,11 @@ Brief description of what this PR accomplishes and why.
 
 ## Testing
 
-- [ ] Tested locally with Docker environment
-- [ ] Ran `npm run rebuild-run` (if C++ changes)
-- [ ] Verified memory usage is within limits (if worker changes)
-- [ ] Tested configuration reload via SIGHUP (if config changes)
+- [ ] Tested locally with Docker environment (`docker compose up --build`)
+- [ ] `make test` passes
+- [ ] `./jetmon2 validate-config` passes (if config changes)
+- [ ] Migration tested with `./jetmon2 migrate` (if schema changes)
+- [ ] Tested configuration reload via `./jetmon2 reload` (if config changes)
 
 ## Deployment Notes
 
diff --git a/.claude/skills/debug-memory/SKILL.md b/.claude/skills/debug-memory/SKILL.md
index 42c54898..e97fee89 100644
--- a/.claude/skills/debug-memory/SKILL.md
+++ b/.claude/skills/debug-memory/SKILL.md
@@ -1,12 +1,12 @@
 ---
 name: debug-memory
-description: Debug memory issues in Jetmon workers and identify leaks
-allowed-tools: Bash(docker*), Bash(ps*), Bash(top*), Bash(node*), Read, Glob, Grep
+description: Debug memory and goroutine issues in Jetmon 2
+allowed-tools: Bash(docker*), Bash(ps*), Bash(curl*), Bash(go*), Read, Glob, Grep
 ---
 
 # Debug Memory Issues
 
-Use this skill to investigate memory problems in Jetmon workers, identify leaks, and optimize memory usage.
+Use this skill to investigate memory growth and goroutine leaks in the Jetmon 2 Go binary.
 
 ## Usage
 
@@ -16,224 +16,122 @@ Use this skill to investigate memory problems in Jetmon workers, identify leaks,
 
 ## Memory Architecture
 
-### Worker Memory Limits
+Jetmon 2 is a single binary with an auto-scaling goroutine pool. Memory pressure does
+not cause crashes; the orchestrator drains the pool when memory exceeds `WORKER_MAX_MEM_MB`.
 
-| Setting | Default | Purpose |
-|---------|---------|---------|
-| `WORKER_MAX_MEM_MB` | 53 | Memory limit before worker recycles |
-| `WORKER_MAX_CHECKS` | 10,000 | Check count before worker recycles |
-
-Workers are designed to be disposable. When hitting limits, they stop accepting work and exit gracefully.
-
-### Memory Flow
-
-```
-Worker Process
-├── Node.js Heap (V8)
-│   ├── HTTP check callbacks
-│   ├── Retry queues (arrToRetry)
-│   └── Active checks (arrCheck)
-├── Native Addon (C++)
-│   └── HTTP_Checker instances
-└── Buffers (TCP/SSL)
-```
+Key memory consumers:
+- Goroutine pool (each goroutine ~8KB stack, grows on demand)
+- Retry queue (in-memory map, bounded by number of monitored sites)
+- WPCOM notification queue (bounded at 1000 entries)
+- HTTP response bodies (read up to 1MB for keyword checks)
 
 ## Monitoring Commands
 
 ### Docker Environment
 
 ```bash
-# Real-time memory monitoring
-docker compose exec jetmon bash -c 'while true; do ps aux --sort=-%mem | head -15; sleep 5; done'
-
-# Memory usage by process
-docker compose exec jetmon ps aux --sort=-%mem
+# Real-time process memory (single Go process)
+docker compose exec jetmon bash -c 'while true; do ps -o pid,rss,vsz,comm -p $(pgrep jetmon2); sleep 5; done'
 
-# Specific worker memory
-docker compose exec jetmon bash -c 'ps -o pid,rss,vsz,comm | grep jetmon'
+# Goroutine count and heap via pprof
+curl http://localhost:8080/debug/pprof/goroutine?debug=1 | head -30
 ```
 
-### Process Details
+### pprof Profiles (via Operator Dashboard)
+
+The dashboard exposes `/debug/pprof/` endpoints:
 
 ```bash
-# View process tree
-docker compose exec jetmon ps auxf
+# Heap profile — shows allocations
+curl http://localhost:8080/debug/pprof/heap > heap.prof
+go tool pprof heap.prof
 
-# Memory maps (detailed)
-docker compose exec jetmon bash -c 'cat /proc/$(pgrep -f jetmon-master)/status | grep -E "Vm|Rss"'
+# Goroutine profile — detect leaks
+curl http://localhost:8080/debug/pprof/goroutine > goroutine.prof
+go tool pprof goroutine.prof
+
+# CPU profile (30s)
+curl "http://localhost:8080/debug/pprof/profile?seconds=30" > cpu.prof
+go tool pprof cpu.prof
 ```
 
-### StatsD Metrics
+### Metrics
 
-Check Graphite (http://localhost:8088) for:
-- `stats.workers.*.memory` - Per-worker memory usage
-- `stats.workers.recycle.count` - Worker recycling frequency
-- `stats.workers.free.count` - Available workers
+Check Graphite (http://localhost:8088):
+- `stats.goroutines.*` — goroutine count over time
+- `stats.memory.*` — heap and RSS metrics (requires `STATSD_SEND_MEM_USAGE: true`)
 
 ## Common Memory Issues
 
-### 1. Retry Queue Growth
+### 1. Goroutine Leak
 
-**Symptom:** Memory grows steadily, especially during site outages.
+**Symptom:** Goroutine count grows unboundedly.
 
 **Diagnosis:**
 ```bash
-docker compose exec jetmon cat stats/sitesqueue
+curl http://localhost:8080/debug/pprof/goroutine?debug=1 | grep -c "^goroutine"
 ```
 
-**Cause:** Large numbers of sites in retry queue (`arrToRetry`).
+**Cause:** A goroutine is blocked on a channel that is never read, or a context is never cancelled.
 
-**Solution:** Check retry queue flush logic. Ensure retries are processed, not accumulated.
+**Solution:** Check that all goroutines started in `orchestrator.go` and `pool.go` exit
+when `ctx.Done()` fires. Ensure `orch.Stop()` is called on shutdown.
 
-### 2. Native Addon Leak
+### 2. Retry Queue Growth
 
-**Symptom:** Memory grows even with low check counts.
+**Symptom:** Memory grows during extended site outages.
 
 **Diagnosis:**
 ```bash
-# Enable debug mode in http_checker.cpp
-#define DEBUG_MODE 1
+docker compose exec jetmon ./jetmon2 status
+# Check RetryQueueSize in API response
+curl http://localhost:8080/api/state | python3 -m json.tool
 ```
 
-Watch for:
-- Unfreed buffers
-- Socket descriptor leaks
-- SSL context accumulation
-
-**Solution:** Review C++ destructor cleanup in `HTTP_Checker::~HTTP_Checker()`.
-
-### 3. Event Loop Blocking
+**Cause:** Retry queue entries accumulate when verifliers are unreachable.
 
-**Symptom:** Workers become unresponsive, memory spikes.
+**Solution:** Check veriflier connectivity. Retry queue is expected to hold state for down
+sites — it is not a leak, but a design feature. If it grows without bound with no site
+outages, check `retryQueue.clear()` is being called in `handleRecovery`.
 
-**Diagnosis:**
-```bash
-docker compose exec jetmon node --trace-warnings lib/jetmon.js
-```
+### 3. HTTP Response Body Accumulation
 
-**Solution:** Ensure async operations complete and callbacks fire.
+**Symptom:** Memory spikes correlate with keyword-check sites.
 
-### 4. DNS Resolution Caching
+**Cause:** Keyword checks read up to 1MB of response body per check. With many such sites
+and a large pool, this can total significant memory.
 
-**Symptom:** Memory grows with unique domains checked.
+**Solution:** Reduce `NUM_WORKERS` if memory is constrained. The 1MB cap is hard-coded in
+`internal/checker/checker.go`.
 
-**Diagnosis:** Check if `USE_GETADDRINFO` is enabled in http_checker.cpp.
-
-**Solution:** `getaddrinfo` uses more memory than `gethostbyname`. Consider trade-offs.
-
-## Memory Profiling
-
-### Node.js Heap Snapshot
-
-```javascript
-// Add to lib/httpcheck.js for debugging
-const v8 = require('v8');
-const fs = require('fs');
-
-// Trigger heap snapshot
-function dumpHeap() {
-    const filename = `/tmp/heap-${process.pid}-${Date.now()}.heapsnapshot`;
-    const stream = fs.createWriteStream(filename);
-    v8.writeHeapSnapshot(filename);
-    console.log('Heap snapshot written to:', filename);
-}
-
-// Call when memory is high
-if (process.memoryUsage().rss > 50 * 1024 * 1024) {
-    dumpHeap();
-}
-```
-
-### Memory Usage Logging
-
-Add to worker process:
-
-```javascript
-setInterval(function() {
-    const mem = process.memoryUsage();
-    logger.debug('Memory: RSS=' + Math.round(mem.rss / 1024 / 1024) + 'MB, ' +
-                 'Heap=' + Math.round(mem.heapUsed / 1024 / 1024) + 'MB');
-}, 30000);
-```
-
-## Reducing Memory Usage
-
-### Configuration Tuning
-
-```json
-{
-    "NUM_WORKERS": 40,          // Reduce from 60 if memory constrained
-    "NUM_TO_PROCESS": 30,       // Reduce parallel checks per worker
-    "WORKER_MAX_MEM_MB": 40,    // Lower threshold for faster recycling
-    "WORKER_MAX_CHECKS": 5000   // Recycle more frequently
-}
-```
-
-### Code Patterns
-
-**DO:**
-```javascript
-// Release references when done
-arrCheck.splice(index, 1);  // Remove processed items
-
-// Use callbacks, don't hold references
-checker.http_check(url, port, index, function(result) {
-    // Process result immediately
-    sendResult(result);
-    // Callback goes out of scope
-});
-```
-
-**DON'T:**
-```javascript
-// Accumulate data without bounds
-allResults.push(result);  // Unbounded growth
-
-// Hold references longer than needed
-var savedChecker = checker;  // Prevents GC
-```
-
-## Testing Memory Fixes
-
-### Set Low Limits
+## Configuration Tuning
 
 ```json
 {
-    "WORKER_MAX_MEM_MB": 30,
-    "WORKER_MAX_CHECKS": 100
+    "NUM_WORKERS": 40,
+    "WORKER_MAX_MEM_MB": 200,
+    "STATSD_SEND_MEM_USAGE": true
 }
 ```
 
-### Monitor Recycling
-
-```bash
-docker compose logs -f jetmon | grep -E "(spawn|die|recycle|memory)"
-```
-
-### Extended Run Test
-
-```bash
-# Run for extended period, monitor memory growth
-docker compose up -d jetmon
-watch -n 5 'docker compose exec jetmon ps aux --sort=-%mem | head -10'
-```
+- `NUM_WORKERS`: Upper bound on pool goroutines
+- `WORKER_MAX_MEM_MB`: Triggers pool drain when Go RSS exceeds this (MB)
+- `STATSD_SEND_MEM_USAGE`: Emit `runtime.MemStats` to StatsD each interval
 
-## Key Files for Memory Investigation
+## Key Files for Investigation
 
 | File | Memory-Related Code |
 |------|---------------------|
-| `lib/httpcheck.js` | Worker arrays: `arrCheck`, `arrToRetry` |
-| `lib/jetmon.js` | Master arrays: `arrWorkers`, `gCountSuccess` |
-| `src/http_checker.cpp` | Buffer allocation, SSL contexts |
-| `lib/config.js` | Memory limit settings |
+| `internal/checker/pool.go` | Pool scaling, goroutine lifecycle |
+| `internal/orchestrator/orchestrator.go` | Round loop, retry queue, pool drain |
+| `internal/orchestrator/retry.go` | Retry queue implementation |
+| `internal/wpcom/client.go` | Notification queue (bounded at 1000) |
 
 ## Checklist for Memory Issues
 
-- [ ] Check worker recycling frequency in metrics
-- [ ] Monitor retry queue size (`stats/sitesqueue`)
-- [ ] Review recent code changes affecting arrays
-- [ ] Verify C++ cleanup in destructor
-- [ ] Test with reduced memory limits
-- [ ] Check for unclosed connections/sockets
-- [ ] Review setTimeout/setInterval cleanup
-- [ ] Confirm process.send() callbacks complete
+- [ ] Check goroutine count via pprof (is it growing?)
+- [ ] Check retry queue size via `/api/state`
+- [ ] Enable `STATSD_SEND_MEM_USAGE` and observe Graphite
+- [ ] Capture heap profile before and after a round
+- [ ] Verify `orch.Stop()` fully drains the pool on shutdown
+- [ ] Check for unbounded channel accumulation in pool.go
diff --git a/.claude/skills/docker-test/SKILL.md b/.claude/skills/docker-test/SKILL.md
index 7deef78b..f61c449c 100644
--- a/.claude/skills/docker-test/SKILL.md
+++ b/.claude/skills/docker-test/SKILL.md
@@ -1,12 +1,12 @@
 ---
 name: docker-test
-description: Run, debug, and test Jetmon using the Docker development environment
+description: Run, debug, and test Jetmon 2 using the Docker development environment
 allowed-tools: Bash(docker*), Bash(cd docker*), Read, Glob, Grep
 ---
 
 # Docker Testing Skill
 
-Use this skill for running, debugging, and testing Jetmon in the Docker development environment.
+Use this skill for running, debugging, and testing Jetmon 2 in the Docker development environment.
 
 ## Usage
 
@@ -22,9 +22,9 @@ The docker-compose environment includes:
 
 | Service | Port | Purpose |
 |---------|------|---------|
-| `mysqldb` | 3306 | MySQL 5.7 database |
-| `jetmon` | 7800 | Main monitoring service |
-| `veriflier` | 7801 | Geographic verification |
+| `mysqldb` | 3306 | MySQL 8.0 database |
+| `jetmon` | 8080 | Jetmon 2 + operator dashboard |
+| `veriflier` | 7803 | Geographic verification (gRPC) |
 | `statsd` | 8125/8088 | Metrics (Graphite UI on 8088) |
 
 ## Common Commands
@@ -49,7 +49,8 @@ docker compose logs --tail=100 jetmon       # Last 100 lines
 
 ```bash
 docker compose ps                           # Service status
-docker compose exec jetmon ps auxf          # Process tree inside container
+docker compose exec jetmon ps aux           # Single process inside container
+docker compose exec jetmon ./jetmon2 status # Internal status via API
 ```
 
 ### Stopping Services
@@ -69,43 +70,44 @@ docker compose exec jetmon cat stats/totals
 docker compose exec jetmon cat stats/sitespersec
 ```
 
-### 2. Check Worker Activity
+### 2. Open Operator Dashboard
 
-```bash
-# View worker stats
-docker compose exec jetmon cat stats/sitesqueue
-
-# Monitor worker memory
-docker compose exec jetmon bash -c 'ps aux --sort=-%mem | head -10'
-```
+Navigate to http://localhost:8080 in a browser. The dashboard shows:
+- Worker/goroutine count
+- Retry queue size
+- WPCOM circuit breaker state
+- Bucket range owned by this host
 
 ### 3. Test Configuration Reload
 
 ```bash
-# Find master process PID
-docker compose exec jetmon ps aux | grep jetmon-master
-
-# Send SIGHUP to reload config
-docker compose exec jetmon kill -HUP <pid>
+docker compose exec jetmon ./jetmon2 reload  # Sends SIGHUP via PID file
+# Watch logs for "config reloaded"
+docker compose logs -f jetmon
 ```
 
-### 4. Test Graceful Shutdown
+### 4. Test Graceful Drain/Shutdown
 
 ```bash
-# Send SIGINT for graceful shutdown
-docker compose exec jetmon kill -INT <pid>
+docker compose exec jetmon ./jetmon2 drain   # Sends SIGINT via PID file
+# Or:
+docker compose stop jetmon
+```
+
+### 5. View Audit Log
 
-# Or restart the container
-docker compose restart jetmon
+```bash
+docker compose exec jetmon ./jetmon2 audit --blog-id 1 --since 1h
 ```
 
-### 5. View Status Changes
+### 6. Test Veriflier Connectivity
 
 ```bash
-docker compose exec jetmon tail -f logs/status-change.log
+docker compose exec jetmon curl http://veriflier:7803/status
+# Should return: {"hostname":"...","version":"...","status":"ok"}
 ```
 
-### 6. Check Database
+### 7. Check Database
 
 ```bash
 docker compose exec mysqldb mysql -u root -p123456 jetmon_db -e "SELECT COUNT(*) FROM jetpack_monitor_sites;"
@@ -150,27 +152,29 @@ Ensure `config/config.json` has:
 }
 ```
 
+### Validate Config Before Restart
+
+```bash
+docker compose exec jetmon ./jetmon2 validate-config
+```
+
 ### Attach to Container
 
 ```bash
 docker compose exec jetmon bash
 ```
 
-### Test Native Addon Directly
+### Profile Goroutines / Memory (pprof)
 
-Create `lib/test-addon.js`:
-```javascript
-var checker = require( './jetmon.node' );
+The dashboard exposes pprof at http://localhost:8080/debug/pprof/
 
-checker.http_check( 'https://wordpress.com', 80, 0, function( index, rtt, http_code, error_code ) {
-    console.log( 'RTT:', rtt, 'HTTP:', http_code, 'Error:', error_code );
-    process.exit( 0 );
-});
-```
-
-Run it:
 ```bash
-docker compose exec jetmon node lib/test-addon.js
+# Goroutine dump
+curl http://localhost:8080/debug/pprof/goroutine?debug=1
+
+# Heap profile
+curl http://localhost:8080/debug/pprof/heap > heap.prof
+go tool pprof heap.prof
 ```
 
 ### Check Metrics
@@ -183,26 +187,25 @@ Open http://localhost:8088 for Graphite UI. Navigate to:
 ### Jetmon Not Starting
 
 - Check database: `docker compose ps mysqldb`
-- Verify config: `docker compose exec jetmon cat config/db-config.conf`
-- Check for port conflicts on 7800, 7801, 7802
+- Validate config: `docker compose exec jetmon ./jetmon2 validate-config`
+- Check migration output: `docker compose logs jetmon | head -30`
 
 ### No Sites Being Checked
 
-- Verify sites exist in database
-- Check bucket range matches data: `BUCKET_NO_MIN`, `BUCKET_NO_MAX`
-- Ensure `monitor_active = 1` for test sites
+- Verify sites exist in database with `monitor_active = 1`
+- Check bucket ownership: `docker compose exec jetmon ./jetmon2 status`
 
 ### Veriflier Connection Failures
 
 - Check veriflier is running: `docker compose ps veriflier`
-- Test connectivity: `docker compose exec jetmon curl -k https://veriflier:7801/get/status`
-- Verify SSL certificates exist in `veriflier/certs/`
+- Test connectivity: `docker compose exec jetmon curl http://veriflier:7803/status`
+- Verify `VERIFLIER_AUTH_TOKEN` matches in both containers
 
 ### Memory Issues
 
 ```bash
-# Monitor memory over time
-docker compose exec jetmon bash -c 'while true; do ps aux --sort=-%mem | head -10; sleep 5; done'
+# Monitor goroutine count and memory via pprof
+curl http://localhost:8080/debug/pprof/goroutine?debug=1 | head -20
 ```
 
 ## Cleanup
diff --git a/.claude/skills/jetmon-pre-ship/SKILL.md b/.claude/skills/jetmon-pre-ship/SKILL.md
new file mode 100644
index 00000000..cde74808
--- /dev/null
+++ b/.claude/skills/jetmon-pre-ship/SKILL.md
@@ -0,0 +1,27 @@
+---
+name: jetmon-pre-ship
+description: Run jetmon v2 pre-ship checklist before opening a PR
+allowed-tools: Bash(go *) Bash(grep *) Bash(git *)
+---
+
+## Changed files
+!`git diff main...HEAD --name-only`
+
+## Race detector
+!`go test -race ./... 2>&1 | tail -30`
+
+## Known pitfall checks
+Retry queue flush (must not happen at round start):
+!`grep -rn "RetryQueue\|retryQueue" internal/orchestrator/ | grep -i "flush\|clear\|reset\|= \[\]" || echo "OK"`
+
+Bucket claim outside transaction (must use SELECT FOR UPDATE):
+!`grep -rn "UPDATE jetmon_hosts\|INSERT.*jetmon_hosts" internal/ | grep -v "_test.go" || echo "OK"`
+
+Non-context DB calls:
+!`grep -rn "\.Query\b\|\.QueryRow\b\|\.Exec\b" internal/ | grep -v "Context\|_test.go" || echo "OK"`
+
+Open maintenance window risk:
+!`grep -rn "maintenance_end" internal/ | grep -v "test\|nil\|IsZero" | head -10`
+
+## Review
+Work through each result above. Flag any violation. Then confirm the checklist from AGENTS.md is satisfied.
diff --git a/.claude/skills/rebuild-addon/SKILL.md b/.claude/skills/rebuild-addon/SKILL.md
deleted file mode 100644
index 8381597d..00000000
--- a/.claude/skills/rebuild-addon/SKILL.md
+++ /dev/null
@@ -1,189 +0,0 @@
----
-name: rebuild-addon
-description: Rebuild the C++ native addon after making changes to http_checker.cpp
-allowed-tools: Bash(npm run*), Bash(node-gyp*), Bash(docker*), Bash(cp*), Bash(ls*), Read, Glob, Grep
----
-
-# Rebuild Native Addon
-
-Use this skill after making changes to the C++ native addon (`src/http_checker.cpp` or `src/http_checker.h`).
-
-## Usage
-
-- `/rebuild-addon` - Rebuild the addon and restart Jetmon
-- `/rebuild-addon docker` - Rebuild inside Docker container
-- `/rebuild-addon test` - Rebuild and run a quick test
-
-## Quick Reference
-
-### Using npm Script (Recommended)
-
-```bash
-npm run rebuild-run
-```
-
-This runs `node-gyp rebuild`, copies the addon to `lib/`, and starts Jetmon.
-
-### Manual Build
-
-```bash
-node-gyp rebuild
-cp build/Release/jetmon.node lib/
-node lib/jetmon.js
-```
-
-### Docker Build
-
-```bash
-docker compose exec jetmon npm run rebuild-run
-```
-
-Or manually inside the container:
-
-```bash
-docker compose exec jetmon bash
-cd /jetmon
-node-gyp rebuild
-cp build/Release/jetmon.node lib/
-node lib/jetmon.js
-```
-
-## Build Verification
-
-After building, verify the addon loads correctly:
-
-```bash
-node -e "require('./lib/jetmon.node'); console.log('Addon loaded successfully');"
-```
-
-## Testing the Addon
-
-### Quick HTTP Check Test
-
-Create a test script:
-
-```javascript
-// lib/test-addon.js
-var checker = require( './jetmon.node' );
-
-checker.http_check( 'https://wordpress.com', 80, 0, function( index, rtt, http_code, error_code ) {
-    console.log( 'Index:', index );
-    console.log( 'RTT (microseconds):', rtt );
-    console.log( 'HTTP Code:', http_code );
-    console.log( 'Error Code:', error_code );
-    process.exit( 0 );
-});
-```
-
-Run it:
-```bash
-node lib/test-addon.js
-```
-
-### Expected Output
-
-- `index`: The index passed to the check (0 in this case)
-- `rtt`: Round-trip time in microseconds
-- `http_code`: HTTP response code (200 for success)
-- `error_code`: 0 for success, non-zero for errors
-
-### Error Codes
-
-| Code | Meaning |
-|------|---------|
-| 0 | Success |
-| 1 | Connection failed |
-| 2 | Timeout |
-| 3 | SSL error |
-| 4 | DNS resolution failed |
-| 5 | Too many redirects |
-
-## C++ Source Files
-
-| File | Purpose |
-|------|---------|
-| `src/http_checker.cpp` | Main HTTP checking implementation |
-| `src/http_checker.h` | Header with class definition |
-| `binding.gyp` | Node-gyp build configuration |
-
-## Common Issues
-
-### Build Errors
-
-**Missing OpenSSL headers:**
-```
-fatal error: openssl/ssl.h: No such file or directory
-```
-Solution: Install OpenSSL development package:
-```bash
-# macOS
-brew install openssl
-
-# Ubuntu/Debian
-apt-get install libssl-dev
-```
-
-**Node version mismatch:**
-If you see ABI version errors, clean and rebuild:
-```bash
-node-gyp clean
-node-gyp rebuild
-```
-
-### Runtime Errors
-
-**Addon not found:**
-```
-Error: Cannot find module './jetmon.node'
-```
-Solution: Copy the built addon:
-```bash
-cp build/Release/jetmon.node lib/
-```
-
-**Symbol errors:**
-Usually indicates Node.js version changed. Rebuild the addon.
-
-## Debugging C++ Code
-
-### Enable Debug Output
-
-In `src/http_checker.cpp`, set:
-```cpp
-#define DEBUG_MODE 1
-```
-
-Debug output goes to stderr.
-
-### Memory Debugging
-
-For memory leaks, use Valgrind (Linux):
-```bash
-valgrind --leak-check=full node lib/jetmon.js
-```
-
-## Build Configuration
-
-The `binding.gyp` file configures the build:
-
-```json
-{
-  "targets": [{
-    "target_name": "jetmon",
-    "sources": ["src/http_checker.cpp"],
-    "include_dirs": ["<!(node -e \"require('nan')\")"],
-    "libraries": ["-lssl", "-lcrypto"]
-  }]
-}
-```
-
-Key settings:
-- Uses NAN (Native Abstractions for Node.js) for compatibility
-- Links against OpenSSL for HTTPS support
-
-## After Rebuilding
-
-1. **Test the addon** with a simple HTTP check
-2. **Start Jetmon** and verify workers spawn correctly
-3. **Monitor logs** for any C++ errors
-4. **Check memory usage** to ensure no new leaks
diff --git a/.dockerignore b/.dockerignore
new file mode 100644
index 00000000..bc928902
--- /dev/null
+++ b/.dockerignore
@@ -0,0 +1,33 @@
+# Version control
+.git/
+.gitignore
+
+# Secrets and generated runtime files
+config/config.json
+config/db-config.conf
+certs/
+
+# Runtime output dirs
+docker/volumes/
+logs/
+stats/
+
+# Build artifacts
+bin/
+jetmon2
+veriflier2-bin
+
+# Dev and editor files
+.DS_Store
+.env
+.idea/
+*.swp
+*.swo
+
+# Documentation (not needed in build context)
+*.md
+PROJECT.md
+LICENSE
+
+# Deployment configs not needed in image
+systemd/
diff --git a/.github/workflows/docker-publish.yml b/.github/workflows/docker-publish.yml
new file mode 100644
index 00000000..21276c39
--- /dev/null
+++ b/.github/workflows/docker-publish.yml
@@ -0,0 +1,115 @@
+name: Build and publish Docker images
+
+on:
+  push:
+    branches: [v2]
+  pull_request:
+    types: [opened, synchronize, reopened, labeled]
+
+permissions:
+  contents: read
+  packages: write
+
+jobs:
+  build-and-push:
+    name: Build ${{ matrix.name }}
+    runs-on: ubuntu-latest
+    if: |
+      github.event_name == 'push' ||
+      (github.event_name == 'pull_request' &&
+       contains(github.event.pull_request.labels.*.name, 'Docker Build'))
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - name: jetmon
+            dockerfile: docker/Dockerfile_jetmon
+          - name: veriflier
+            dockerfile: docker/Dockerfile_veriflier
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+
+      - name: Log in to GHCR
+        uses: docker/login-action@v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Compute image metadata
+        id: meta
+        uses: docker/metadata-action@v5
+        with:
+          images: ghcr.io/automattic/${{ matrix.name }}
+          tags: |
+            type=raw,value=latest,enable=${{ github.event_name == 'push' && github.ref == 'refs/heads/v2' }}
+            type=sha,format=short,prefix=,enable=${{ github.event_name == 'pull_request' }}
+
+      - name: Build and push
+        uses: docker/build-push-action@v5
+        with:
+          context: .
+          file: ${{ matrix.dockerfile }}
+          platforms: linux/amd64
+          push: true
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          cache-from: type=gha,scope=${{ matrix.name }}
+          cache-to: type=gha,mode=max,scope=${{ matrix.name }}
+
+  comment-pr:
+    name: Post pull commands on PR
+    needs: build-and-push
+    if: github.event_name == 'pull_request'
+    runs-on: ubuntu-latest
+    permissions:
+      pull-requests: write
+    steps:
+      - name: Upsert PR comment with docker pull commands
+        uses: actions/github-script@v7
+        with:
+          script: |
+            const sha = context.sha.substring(0, 7);
+            const marker = '<!-- docker-publish:pr-images -->';
+            const body = [
+              marker,
+              '### Docker images built for this PR',
+              '',
+              `Built from \`${sha}\`. Pull with:`,
+              '',
+              '```bash',
+              `docker pull ghcr.io/automattic/jetmon:${sha}`,
+              `docker pull ghcr.io/automattic/veriflier:${sha}`,
+              '```',
+              '',
+              'Images are `linux/amd64` only. On Apple Silicon, add `--platform linux/amd64`. ' +
+              'See [docs/docker-images.md](https://github.com/Automattic/jetmon/blob/v2/docs/docker-images.md) for run examples.',
+            ].join('\n');
+
+            const comments = await github.paginate(github.rest.issues.listComments, {
+              owner: context.repo.owner,
+              repo: context.repo.repo,
+              issue_number: context.issue.number,
+              per_page: 100,
+            });
+            const existing = comments.find(c => c.body && c.body.includes(marker));
+
+            if (existing) {
+              await github.rest.issues.updateComment({
+                owner: context.repo.owner,
+                repo: context.repo.repo,
+                comment_id: existing.id,
+                body,
+              });
+            } else {
+              await github.rest.issues.createComment({
+                owner: context.repo.owner,
+                repo: context.repo.repo,
+                issue_number: context.issue.number,
+                body,
+              });
+            }
diff --git a/.gitignore b/.gitignore
index 01facc04..64c9bcad 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,6 +1,41 @@
-build/
-node_modules/
+# Compiled binaries
+bin/
+/jetmon2
+
+# Editor and OS files
 .DS_Store
-lib/jetmon.node
+.idea/
+*.swp
+*.swo
+
+# Secrets and local config
 .env
-.idea
+config/config.json
+config/db-config.conf
+
+# Generated TLS certificates
+certs/*.crt
+certs/*.key
+
+# Generated veriflier runtime config (veriflier-sample.json is tracked)
+veriflier2/config/veriflier.json
+
+# Generated protobuf Go stubs (produced by `make generate`)
+*.pb.go
+
+# Runtime output dirs
+docker/volumes/
+logs/*.log
+stats/*
+!logs/.gitkeep
+!stats/.gitkeep
+
+# Go test coverage output
+coverage.out
+coverage.html
+
+# AI tool directories
+.codex
+
+# Local Claude settings (project settings.json is tracked)
+.claude/settings.local.json
diff --git a/.npmrc b/.npmrc
deleted file mode 100644
index 4d936e8e..00000000
--- a/.npmrc
+++ /dev/null
@@ -1 +0,0 @@
-unsafe-perm=true
diff --git a/AGENTS.md b/AGENTS.md
index e41411c9..ae7bdd99 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,151 +1,340 @@
 # Jetmon Development Guidelines
 
-You are an expert Node.js/C++ developer with extensive knowledge about WordPress and enterprise-level web services.
+You are an expert Go developer with extensive knowledge about WordPress, enterprise-level web services, and high-performance network programming.
 
 ## Project Overview
 
-Jetmon is a parallel HTTP health monitoring service that monitors Jetpack website uptime at scale. It performs HEAD requests against sites, uses geographically distributed Veriflier services to confirm downtime, and notifies WordPress.com of status changes.
+Jetmon is a parallel HTTP uptime monitoring service that checks Jetpack websites at scale. Jetmon 2 is a complete rewrite of the original Node.js + C++ native addon service into a single Go binary. It retains full drop-in compatibility with all external interfaces — MySQL schema, WPCOM API payload, StatsD metric names, and log file format — while dramatically increasing concurrency, reducing memory usage, and eliminating the native addon compilation dependency.
+
+The Veriflier is rewritten in Go as well, replacing the Qt C++ dependency. JSON-over-HTTP on the configured Veriflier port is the v2 production Monitor-to-Veriflier transport; the proto contract is retained only as a schema reference for a possible future transport.
+
+See `docs/project.md` for the full project description, feature list, and performance benefit estimates.
 
 ## Architecture
 
 ```
-Database → Master Process → Worker Pool → C++ HTTP Checks
-                 ↓
-         Veriflier Services (geo-distributed)
-                 ↓
-         WordPress.com API ← Status Notifications
+┌──────────────────────────────────────────────────────────────────────┐
+│                       jetmon2 (single binary)                        │
+│                                                                      │
+│  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐                  │
+│  │ Orchestrator│  │ Check Pool  │  │  Veriflier   │                  │
+│  │  goroutine  │  │ (goroutines)│  │  transport   │                  │
+│  └──────┬──────┘  └──────┬──────┘  └──────┬───────┘                  │
+│         │                │                │                          │
+│  ┌──────┴────────────────┴────────────────┴───────┐                  │
+│  │                 Internal channels              │                  │
+│  └─────────────────────┬──────────────────────────┘                  │
+│                        │                                             │
+│   ┌────────────────────┴────────────────────┐                        │
+│   │   eventstore (jetmon_events +           │                        │
+│   │    jetmon_event_transitions writes)     │                        │
+│   └────────────────────┬────────────────────┘                        │
+│                        │                                             │
+│   ┌────────────┐  ┌────┴────────────┐  ┌──────────────────────┐      │
+│   │  REST API  │  │  Webhook        │  │  Alerting            │      │
+│   │  /api/v1/  │  │  delivery       │  │  delivery            │      │
+│   │  + auth +  │  │  worker         │  │  worker              │      │
+│   │  ratelimit │  │  (HMAC POST)    │  │  (email/PD/Slack/Tm) │      │
+│   └─────┬──────┘  └────────┬────────┘  └──────────┬───────────┘      │
+│         │                  │                      │                  │
+│   ┌─────┴──────┐    ┌──────┴──────────┐  ┌────────┴──────────────┐   │
+│   │  Operator  │    │  Webhook        │  │  Alert contact        │   │
+│   │  dashboard │    │  receivers      │  │  destinations         │   │
+│   │  (SSE)     │    │  (HTTPS)        │  │  (HTTPS / SMTP / API) │   │
+│   └────────────┘    └─────────────────┘  └───────────────────────┘   │
+└────────────┬──────────────────────────┬──────────────────────────────┘
+             │                          │
+          MySQL                    WPCOM API
+          StatsD                   (legacy notification path,
+          Log files                 still active alongside
+                                    alert contacts)
 ```
 
-**Master Process** (`lib/jetmon.js`): Spawns workers, fetches site batches from database every 5 seconds, distributes work, and notifies WordPress.com of status changes.
+**Orchestrator goroutine** (`internal/orchestrator/`): Fetches site batches from MySQL, dispatches work to the check pool via channels, processes results, manages the local retry queue, coordinates Veriflier confirmation requests, and emits WPCOM legacy notifications. Owns all DB access for site state and writes events through `eventstore`.
 
-**Worker Processes** (`lib/httpcheck.js`): Forked child processes that perform HTTP checks via C++ native addon. Workers recycle when reaching memory limit (53MB) or check count (10,000).
+**Check Pool** (`internal/checker/`): A bounded goroutine pool that performs HTTP checks using Go's `net/http` and `net/http/httptrace`. Records DNS, TCP connect, TLS handshake, and TTFB timings for every check. Pool size auto-scales against queue depth within configured min/max bounds.
 
-**C++ Native Addon** (`src/http_checker.cpp`): High-performance HTTP checking with HEAD requests, 60-second timeout, OpenSSL support, and redirect handling.
+**Eventstore** (`internal/eventstore/`): The single writer for `jetmon_events` and `jetmon_event_transitions`. Every status / severity / state change is written transactionally so the event row's projection and the transition log can never disagree. Both downstream workers (webhooks, alerting) consume `jetmon_event_transitions` via a high-water mark.
 
-**Veriflier Services** (`veriflier/`): C++/Qt applications deployed globally to verify downtime before status changes are reported.
+**REST API** (`internal/api/`): The internal API surface (`/api/v1/...`) used by the gateway, alerting workers, dashboards, and CI tooling. Per-consumer Bearer-token auth (`internal/apikeys/`), per-key rate limiting, Stripe-style idempotency keys on POSTs. Sites CRUD, events list / single / transitions, SLA stats, webhooks CRUD, alert-contacts CRUD, manual delivery retry.
 
-## Build and Run Commands
+**Webhook delivery worker** (`internal/webhooks/`): Polls `jetmon_event_transitions`, matches each new transition against active webhooks (event-type + site + state filters), and POSTs HMAC-signed payloads to consumer URLs. Retry ladder 1m / 5m / 30m / 1h / 6h then abandon. Per-webhook in-flight cap and shared dispatch pool.
 
-```bash
-# Docker development (recommended)
-cd docker && docker compose up -d      # Start all services
-docker compose down                     # Stop services
+**Alerting delivery worker** (`internal/alerting/`): Same shape as the webhook worker but for managed channels — email (via `wpcom`/`smtp`/`stub` senders), PagerDuty Events API v2, Slack incoming webhooks, Microsoft Teams. Filter is simpler (`site_filter` + `min_severity`); per-contact `max_per_hour` rate cap absorbs pager storms. Send-test endpoint exercises the same dispatch path without requiring a real event.
 
-# Manual build and run
-npm install
-node-gyp rebuild
-cp build/Release/jetmon.node lib/
-node lib/jetmon.js
+**Current delivery-owner constraint:** In the single-binary v2 deployment, `API_PORT > 0` starts the API server and makes webhook / alert-contact delivery workers eligible to run. Delivery rows are claimed transactionally, so multiple active delivery workers do not claim the same pending row. Use `DELIVERY_OWNER_HOST` as a rollout guard when intentionally keeping delivery single-owner during migration from embedded to standalone delivery.
 
-# Rebuild and run (npm script)
-npm run rebuild-run
-```
+**Veriflier transport** (`internal/veriflier/`): JSON-over-HTTP client/server for Monitor↔Veriflier communication. Replaces the previous SSL server and custom HTTPS protocol. This is the v2 production transport.
 
-## Configuration
+**Veriflier** (`veriflier2/`): Standalone Go binary deployed at remote locations. Receives check batches from the Monitor, performs HTTP checks, and returns results. Replaces the Qt C++ Veriflier.
 
-Copy `config/config-sample.json` to `config/config.json`. Key settings:
+**Future shape:** the API server, webhook worker, and alerting worker are independently scalable concerns and the natural target for the multi-binary split tracked in `docs/roadmap.md`. Today they coexist in `jetmon2` and the MySQL schema is the bus between them; tomorrow the deliverer becomes its own binary handling all outbound dispatch (webhooks + alerting + WPCOM legacy migrated behind it).
 
-- `NUM_WORKERS`: Worker process count (default 60)
-- `NUM_TO_PROCESS`: Parallel checks per worker (default 40)
-- `BUCKET_NO_MIN/MAX`: Database bucket range for horizontal scaling (0-511 total)
-- `MIN_TIME_BETWEEN_ROUNDS_SEC`: Check interval (300 seconds default)
-- `PEER_OFFLINE_LIMIT`: Verifliers required to confirm downtime (3)
+## Key Files
 
-**Variable Check Intervals:** Sites can be configured for 1-5 minute check intervals via the `check_interval` database field. The default is 5 minutes. One-minute intervals require sufficient host capacity.
+| Path | Purpose |
+|------|---------|
+| `cmd/jetmon2/main.go` | Binary entry point, signal handling, startup |
+| `internal/orchestrator/` | Round scheduling, DB fetch, work dispatch, WPCOM notifications |
+| `internal/checker/` | Goroutine pool, HTTP checks, httptrace timing |
+| `internal/veriflier/` | JSON-over-HTTP client/server for Veriflier communication |
+| `internal/db/` | MySQL access, `jetmon_hosts` heartbeat, connection pooling |
+| `internal/config/` | Config loading, SIGHUP hot-reload |
+| `internal/metrics/` | StatsD client, stats file writer |
+| `internal/wpcom/` | WPCOM API client, circuit breaker |
+| `internal/audit/` | Operational log writes to `jetmon_audit_log` (WPCOM, retries, verifier RPCs, config reloads) |
+| `internal/eventstore/` | Event-sourced site state — manages `jetmon_events` + `jetmon_event_transitions` writes in single transactions |
+| `internal/api/` | Internal REST API server (`/api/v1/...`) — auth, rate limiting, idempotency, sites/events/SLA/webhooks/alert-contacts handlers |
+| `internal/apikeys/` | API key registry, sha256-hashed at rest; `./jetmon2 keys` CLI |
+| `internal/webhooks/` | Webhook registry + delivery worker — outbound HMAC-signed POSTs of event transitions, retry ladder 1m/5m/30m/1h/6h |
+| `internal/alerting/` | Alert contact registry + delivery worker — managed channels (email/PagerDuty/Slack/Teams) with site_filter + severity gate + per-hour rate cap |
+| `internal/dashboard/` | Operator dashboard, SSE handler |
+| `veriflier2/` | Go Veriflier binary |
+| `docs/internal-api-reference.md` | Internal REST API reference (auth, all endpoints, payload shapes) |
+| `docs/roadmap.md` | Deferred features and architectural roadmap (multi-binary split, public-API path) |
+| `docs/adr/` | Architecture Decision Records — load-bearing decisions ("why is X like this") with context, decision, and consequences |
+| `docs/project.md` | Full project description and feature specification |
+
+## Build and Run
 
-See `config/config.readme` for detailed documentation of all options.
+```bash
+# Docker development (recommended)
+cd docker && docker compose up -d         # Start all services
+docker compose up --build                 # Rebuild binary and start
+docker compose down                       # Stop services
+docker compose down -v                    # Stop and remove volumes (fresh start)
+
+# Build binaries directly
+make all
+
+# Use a non-default Go binary when needed
+make GO=/path/to/go all
+
+# Run tests
+make test
+make test-race
+make lint
+
+# Run with race detector
+go run -race ./cmd/jetmon2/
+
+# Validate config
+./jetmon2 validate-config
+
+# CLI subcommands
+./jetmon2 version
+./jetmon2 migrate
+./jetmon2 status
+./jetmon2 audit --blog-id 12345 --since 2h
+./jetmon2 rollout guided
+./jetmon2 rollout host-preflight
+./jetmon2 rollout pinned-check
+./jetmon2 rollout cutover-check
+./jetmon2 rollout dynamic-check
+./jetmon2 rollout projection-drift
+./jetmon2 rollout state-report
+./jetmon2 site-tenants import --file site-tenants.csv --dry-run
+./jetmon2 drain
+./jetmon2 reload
+```
 
-## Key Files
+## Configuration
 
-| File | Purpose |
-|------|---------|
-| `lib/jetmon.js` | Master process orchestration |
-| `lib/httpcheck.js` | Worker process HTTP checking |
-| `lib/database.js` | MySQL queries and connection |
-| `lib/comms.js` | HTTPS communication with Verifliers |
-| `lib/wpcom.js` | WordPress.com API notifications |
-| `lib/server.js` | SSL server for Veriflier responses |
-| `lib/statsd.js` | StatsD metrics client |
-| `src/http_checker.cpp` | C++ native addon for HTTP checks |
-| `binding.gyp` | Node-gyp build configuration |
+Copy `config/config-sample.json` to `config/config.json`. All keys from the original Jetmon are honoured; new keys are additive. Send SIGHUP to hot-reload config without restarting.
+
+**Existing keys (unchanged behaviour):**
+- `NUM_WORKERS`: Goroutine pool size (replaces worker process count)
+- `NUM_TO_PROCESS`: Legacy compatibility setting retained so copied v1-style configs parse; it does not cap Go scheduler throughput
+- `DATASET_SIZE`: Database fetch page size for scheduler work; the scheduler continues fetching pages until due work is drained
+- `MIN_TIME_BETWEEN_ROUNDS_SEC`: Fixed-cadence full-fleet pass interval when `USE_VARIABLE_CHECK_INTERVALS` is false
+- `NET_COMMS_TIMEOUT`: Default per-check HTTP timeout in seconds
+- `PEER_OFFLINE_LIMIT`: Veriflier agreements required to confirm downtime
+- `WORKER_MAX_MEM_MB`: Go runtime memory threshold that triggers worker-pool drain (replaces worker recycling)
+
+**New keys:**
+- `BUCKET_TOTAL`: Total bucket range (e.g. 1000); replaces static `BUCKET_NO_MIN/MAX`
+- `BUCKET_TARGET`: Maximum buckets this host should own
+- `BUCKET_HEARTBEAT_GRACE_SEC`: Seconds before an unresponsive host's buckets are reclaimed (suggested: 2× round time)
+- `PINNED_BUCKET_MIN/MAX`: Migration-only static bucket range for replacing one v1 host with one v2 host; disables `jetmon_hosts` dynamic ownership while set. Legacy `BUCKET_NO_MIN/MAX` are accepted as aliases for this mode.
+- `ALERT_COOLDOWN_MINUTES`: Default cooldown between repeated alerts for the same site
+- `LEGACY_STATUS_PROJECTION_ENABLE`: Keep v1 `site_status` / `last_status_change` projection updated during shadow-v2-state migration
+- `LOG_FORMAT`: `text` (default, drop-in compatible) or `json` (structured logging)
+- `USE_VARIABLE_CHECK_INTERVALS`: Respect per-site `check_interval`; the scheduler uses a short idle poll and maintained `jetmon_site_runtime.next_check_at` timestamps control which sites are ready in legacy round-scheduler mode
+- `DASHBOARD_PORT`: Internal port for the operator dashboard (0 to disable)
+- `DEBUG_PORT`: localhost-only pprof port, default 6060 (0 to disable; never exposed remotely)
+
+See `config/config.readme` for the full option reference.
+
+## Drop-in Compatibility Requirements
+
+These interfaces must remain identical to the original Jetmon. Do not change them without explicit discussion:
+
+| Interface | Constraint |
+|-----------|-----------|
+| MySQL schema | Read same columns; additive migrations only |
+| WPCOM notification payload | Same JSON structure and field names |
+| StatsD metric names | Same dotted paths; new metrics may be added |
+| Log file paths and format | `logs/jetmon.log`, `logs/status-change.log` |
+| `stats/` file outputs | `sitespersec`, `sitesqueue`, `totals` — same format |
+| `config/config.json` keys | All existing keys honoured |
+| SIGHUP config reload | Same behaviour |
+| SIGINT graceful shutdown | Same behaviour |
 
 ## Site Status Values
 
-- `0` SITE_DOWN: Local checks failed
+- `0` SITE_DOWN: Local checks failed, retry/verification in progress
 - `1` SITE_RUNNING: Confirmed online
-- `2` SITE_CONFIRMED_DOWN: Verified down by Verifliers
+- `2` SITE_CONFIRMED_DOWN: Verified down by Verifliers, WPCOM notified
 
-## Monitoring Behavior
+## Monitoring Behaviour
 
 **Check Process:**
-- Initial timeout: 10 seconds
-- Verification timeout: 20 seconds (on retry from different locations)
-- Max redirects: 3 (beyond this triggers "redirect" error)
-- HTTP response code < 400 is considered success
-- User Agent: `jetmon/1.0 (Jetpack Site Uptime Monitor by WordPress.com)`
+- Default timeout: `NET_COMMS_TIMEOUT` seconds (configurable per site via `jetmon_site_check_config.timeout_seconds`)
+- HTTP response code < 400 is success
+- Redirect policy configurable per site: `follow` (default), `alert` (warn on chain change), `fail`
+- Max redirects when following: 10
+- Keyword check: if `check_keyword` is set, GET the body and confirm the string is present
+- User-Agent: `jetmon/2.0 (Jetpack Site Uptime Monitor by WordPress.com)`
+- Per-site custom headers merged from `jetmon_site_check_config.custom_headers`
+
+**Timing Breakdown (via `net/http/httptrace`):**
+Every check records composite RTT plus DNS lookup, TCP connect, TLS handshake, and first response byte (TTFB) timings. These samples are stored in `jetmon_check_history` for trending and API statistics. Scheduler-level StatsD metrics expose phase timing and write volume so capacity tests can separate check execution, freshness writes, check-history inserts, SSL expiry updates, and event handling.
+
+**SSL Monitoring:**
+Every HTTPS check inspects `tls.ConnectionState` for:
+- Certificate `NotAfter` — alerts at 30, 14, and 7 days before expiry
+- TLS version — flags TLS 1.0/1.1 as deprecated
+- Cipher suite — recorded in audit log
 
 **Downtime Verification:**
-When a site appears down, Jetmon retries from the same location twice, then verifies from 2 other locations on different continents via Verifliers before confirming downtime.
-
-**Status Change Email Types:**
-- `server`: 5xx response (internal/fatal error)
-- `blocked`: 403 response (monitoring blocked)
-- `client`: 4xx response other than 403 (auth/DNS issues)
-- `https`: SSL certificate problems
-- `intermittent`: Request timeout (>10 seconds but site may load)
-- `redirect`: Too many redirects (>3)
-- `success`: Normal response (used in "site is back up" emails)
+1. Local check fails → open a `Seems Down` event (severity 3) and enter the local retry queue. The event opens on the **first** failure so `started_at` reflects the actual incident start. Subsequent failures during retry are no-ops on the events table (idempotent dedup).
+2. After `NUM_OF_CHECKS` local failures → dispatch to Verifliers (event stays Seems Down)
+3. `PEER_OFFLINE_LIMIT` Veriflier agreements required to confirm
+4. Verifier outcomes:
+   - **Confirms** → Promote event to `Down` (severity 4) with `reason = verifier_confirmed`. WPCOM notification via same payload as original.
+   - **Disagrees** → Close event with `resolution_reason = false_alarm`.
+5. Recovery (any successful probe while an event is open):
+   - From `Seems Down` → close with `resolution_reason = probe_cleared`.
+   - From `Down` → close with `resolution_reason = verifier_cleared` and send recovery notification.
+
+Shadow-v2-state migration keeps incidents authoritative in `jetmon_events` + `jetmon_event_transitions` while `jetpack_monitor_sites` remains the v1-owned site identity/cadence/projection table. V2-only check config lives in `jetmon_site_check_config`. When `LEGACY_STATUS_PROJECTION_ENABLE` is true, the `jetpack_monitor_sites.site_status` / `last_status_change` projection is updated in the same transaction as every event mutation (no drift). v1 mapping: open Seems Down → `site_status = SITE_DOWN (0)`; promoted to Down → `site_status = SITE_CONFIRMED_DOWN (2)`; closed → `site_status = SITE_RUNNING (1)`. After legacy readers move to the v2 API/event tables, this projection can be disabled.
+
+**Alert Deduplication:**
+After an alert fires, subsequent alerts for the same site are suppressed for the global `ALERT_COOLDOWN_MINUTES` value or `jetmon_site_check_config.alert_cooldown_minutes`. Suppression is recorded in the audit log.
+
+**Status Change Types (unchanged):**
+- `server`: 5xx response
+- `blocked`: 403 response
+- `client`: 4xx other than 403
+- `https`: SSL/TLS problems
+- `intermittent`: Request timeout
+- `redirect`: Redirect policy failure
+- `success`: Site recovered
 
 ## Database Schema
 
-Sites are stored in `jetpack_monitor_sites` with bucket-based sharding. The `bucket_no` field (0-511) enables horizontal scaling across multiple Jetmon instances.
+Sites are stored in the v1-shaped `jetpack_monitor_sites` table with
+bucket-based sharding. The `bucket_no` field enables horizontal scaling. Jetmon
+v2 keeps v2-only site config and runtime state out of that legacy table: rich
+probe config lives in `jetmon_site_check_config`, and freshness / SSL
+observation state lives in `jetmon_site_runtime`. During rollout, v2 writes
+only the v1 compatibility projection fields `site_status` and
+`last_status_change` back to `jetpack_monitor_sites`.
 
-## Metrics
+New tables introduced by Jetmon 2:
 
-StatsD metrics are sent with prefix `com.jetpack.jetmon.<hostname>`. Key metrics include worker lifecycle events, queue sizes, database timing, and memory usage.
+| Table | Purpose |
+|-------|---------|
+| `jetmon_hosts` | MySQL-coordinated bucket ownership and heartbeat |
+| `jetmon_events` | Current state of every incident — one row per `(blog_id, endpoint_id, check_type, discriminator)` while open; mutable until `ended_at` is set, then frozen |
+| `jetmon_event_transitions` | Append-only history of every mutation to `jetmon_events` (open, severity change, state change, cause link, close) |
+| `jetmon_audit_log` | Operational trail — WPCOM notifications, retry dispatch, verifier RPCs, alert/maintenance suppression, config reloads. Site-state changes do **not** flow through here |
+| `jetmon_check_history` | RTT and timing samples for trending |
+| `jetmon_site_check_config` | V2-only per-site check policy/config: HEAD/GET mode, detection profile, keywords, maintenance windows, headers, timeout, redirect policy, cooldown |
+| `jetmon_site_runtime` | V2-only runtime freshness and observation projection: last checked, next check, last alert, SSL expiry |
+| `jetmon_false_positives` | Veriflier non-confirmation events |
 
-**Grafana Dashboard:** Production metrics are visualized in the Jetmon Health Dashboard using Graphite as the StatsD backend. The dashboard tracks free/active workers, sites processed, round times, and memory usage.
+## Multi-Host Bucket Coordination
 
-**StatsD Configuration Notes:**
-- Flush interval: 5 seconds (`STATS_UPDATE_INTERVAL_MS`)
-- Graphite retention: 10s:6h, 1m:7d, 10m:5y
-- Counter metrics use `sum` aggregation; gauges use `average`
+Jetmon 2 normally replaces static `BUCKET_NO_MIN/MAX` config with runtime bucket ownership via the `jetmon_hosts` table. On startup, each instance claims unclaimed or expired bucket ranges using `SELECT ... FOR UPDATE` transactions. A heartbeat query runs each round; hosts with stale heartbeats (older than `BUCKET_HEARTBEAT_GRACE_SEC`) have their buckets absorbed by surviving peers. On SIGINT, the instance releases its buckets immediately. During the initial v1-to-v2 migration only, `PINNED_BUCKET_MIN/MAX` (or legacy `BUCKET_NO_MIN/MAX`) can pin one v2 host to its v1 predecessor's exact bucket range and disables `jetmon_hosts` ownership for that host.
 
-## WPCOM Integration
+This enables zero-config horizontal scaling (spin up a host, it claims buckets) and self-healing coverage (a failed host's buckets are absorbed within one grace period) without a cluster orchestrator.
+
+## Metrics
 
-**Jetmon Endpoint:** WPCOM receives status change notifications from Jetmon and triggers the `jetpack_monitor_site_status_change` hook for consumers (notifications, Activity Log, etc.).
+StatsD metrics retain the same prefix and dotted path format as Jetmon 1: `com.jetpack.jetmon.<hostname>`. New metrics added by Jetmon 2 follow the same naming convention and are additive.
 
-**Email Notification Options (stored on WPCOM):**
-- `jetpack_monitor_notifications_users_ids`: WPCOM user IDs to notify
-- `jetpack_monitor_notify_email_addresses`: Additional email addresses
+StatsD is the primary metrics transport. No Prometheus endpoint is provided.
 
-**REST API Endpoints:**
-- `GET /sites/{site}/jetpack-monitor-status`: Current monitoring status
-- `GET /sites/{site}/jetpack-monitor-incidents`: Historical incidents
-- `GET/POST /sites/{site}/jetpack-monitor-settings`: Monitor configuration
+## WPCOM Integration
+
+Jetmon notifies WPCOM of status changes via the same JSON payload format as Jetmon 1. The `jetpack_monitor_site_status_change` hook on WPCOM is triggered for consumers (notifications, Activity Log, etc.). A circuit breaker protects against WPCOM API failures: after N consecutive failures the circuit opens, pending notifications are queued in memory, and retries are attempted on a backoff schedule.
 
 ## Production Deployment
 
-Jetmon runs on 6 production hosts managed by the Systems team. To deploy changes:
-1. Test changes locally using Docker environment
-2. Create a Systems Request with PR links for review
-3. Systems team deploys to production hosts
+Jetmon runs on production hosts managed by the Systems team. To deploy changes:
+1. Test locally using the Docker environment (`go test ./...`, manual Docker verification)
+2. Create a PR and request a Systems Request with PR links
+3. Systems team performs a rolling update: one host at a time, SIGINT → drain → deploy binary → restart
+4. Surviving hosts absorb the draining host's buckets during each update window
+
+Rolling updates require no simultaneous restart of all hosts and leave no sites unchecked during the update.
+
+## Architectural Decisions — Event and State Model
+
+These decisions govern how Jetmon models site state. They must be maintained consistently across all changes. Full design rationale is in [`docs/taxonomy.md`](docs/taxonomy.md) (Parts 2–3) and [`docs/events.md`](docs/events.md).
 
-## Worker Lifecycle
+**Events are the source of truth.** Site status is event-sourced across two tables: `jetmon_events` (one row per incident, holding the current severity/state/metadata) and `jetmon_event_transitions` (append-only history of every mutation). The site row stores a denormalized projection for read performance. Update events, transitions, and the projection in the same transaction — they must not drift. If the projection is ever suspect, rebuild it from the events tables.
+
+**Every event mutation writes a transition row in the same transaction.** Open, severity bump, state change, cause-link change, close — no carve-outs. The `eventstore` package is the only writer for `jetmon_events` and `jetmon_event_transitions`; external callers must go through it. This keeps the invariant testable with one integration test surface.
+
+**Severity and state are separate fields.** Severity is numeric — use it for ordering, thresholds, and rollup. State is a human-readable label — use it for display and lifecycle transitions. A live event's severity can be updated in place without changing its state (a worsening degradation is not a new kind of problem).
+
+**"Seems Down" is a first-class lifecycle state.** Between first probe failure and verifier confirmation, a site is Seems Down. It is not an implementation detail — dashboards show it, alert rules can key off it. The lifecycle is:
+```
+Up → Seems Down → Down → Resolved
+         ↓
+         Up (false alarm)
+```
 
-Workers exit and are respawned when:
-- Memory exceeds `WORKER_MAX_MEM_MB` (53MB default)
-- Check count exceeds `WORKER_MAX_CHECKS` (10,000 default)
-- Process receives termination signal
+**Events update in place on severity change.** When a Seems Down event is verifier-confirmed to Down, update the same event row — do not close and open a new one. The event's `started_at` stays at first-failure time. Incident duration is honest: it starts from first failure, not from confirmation.
 
-The master process tracks worker states and gracefully handles recycling.
+**Event identity is idempotent.** The same underlying failure must not produce duplicate events. Deduplication lives in the shared probe runner, not in individual check types. Key events by `(site_id, endpoint_id, check_type, [discriminator])` so repeated detection of the same condition updates the existing open event.
+
+**Resolution reason is required on close.** When an event closes, record why: `verifier_cleared`, `false_alarm`, `manual_override`, `auto_timeout`. Don't just set `end_timestamp` — capture the cause. This affects uptime calculations and report accuracy.
+
+**Causal links are separate from hierarchical rollup.** An endpoint event rolling up to site level is a hierarchy relationship. A Layer-3 event caused by a Layer-1 failure is a causal relationship. Store these in separate structures. Conflating them creates bugs where dismissing a cause accidentally dismisses a rollup.
+
+**Unknown is not downtime.** If the probe crashes, a region loses network, or the Jetpack agent stops reporting, the result is Unknown — not Down. Monitor-side failures must never be reported as customer-site downtime.
 
 ## Known Pitfalls
 
-**Retry Queue Persistence:** Retry queues must persist between rounds. Flushing queues at round start prevents sites from being confirmed as down, since the 1-minute recheck cannot complete before the next round.
+**Retry Queue Persistence:** The local retry queue must persist between rounds. Do not flush it at round start — a site must accumulate `NUM_OF_CHECKS` failures before Veriflier escalation, and flushing resets that counter, preventing downtime confirmation.
+
+**Bucket Claiming Races:** When dynamic ownership is active, the `SELECT ... FOR UPDATE` transaction on `jetmon_hosts` is the only safe way to claim buckets. Do not claim buckets outside a transaction — two hosts starting simultaneously will both see the same unclaimed range and must not both write it. Pinned v1-to-v2 migration hosts intentionally do not claim buckets in `jetmon_hosts`.
+
+**Circuit Breaker Floor:** The WPCOM API circuit breaker queue is bounded. If the queue fills, the oldest pending notifications are dropped with an error log. Monitor the circuit breaker state in the operator dashboard during any WPCOM API incident.
+
+**Veriflier Quorum Floor:** When Verifliers are marked unhealthy and excluded, `PEER_OFFLINE_LIMIT` adjusts dynamically, but there is a configured floor to prevent a single healthy Veriflier from confirming downtime alone. Ensure the floor is set appropriately for the number of deployed Verifliers.
+
+**Delivery Ownership During Rollout:** Webhook and alert-contact workers claim delivery rows transactionally. Use `DELIVERY_OWNER_HOST` when you want to keep only one delivery owner active per database cluster during migration from embedded `jetmon2` delivery to standalone `jetmon-deliverer`.
+
+**Maintenance Windows:** Checks continue during a maintenance window and data is recorded in the audit log, but no alerts fire. Verify that `maintenance_end` is correctly set — an open-ended maintenance window silently suppresses all alerts for that site indefinitely.
+
+**Memory Pressure Drain:** If RSS exceeds the configured threshold, the goroutine pool shrinks by 10% via graceful drain. This reduces throughput temporarily. If memory pressure is sustained, investigate for goroutine leaks using the pprof endpoint at `http://localhost:<DEBUG_PORT>/debug/pprof/` (localhost only) before increasing `WORKER_MAX_MEM_MB`.
 
-**Bucket Configuration:** The `BUCKET_NO_MIN/MAX` configuration must not overlap between hosts. A past misconfiguration caused hosts to process only half their intended sites, masking performance issues.
+## Agent Workflow Notes
 
-**Node Version Sensitivity:** RTT (round-trip time) calculations can vary between Node.js versions. Version changes should be tested thoroughly as they can affect timeout behaviors.
+These notes are for Codex and other coding agents working for Chris.
 
-**Memory Pressure:** When checking more sites (due to shorter intervals or configuration fixes), memory usage increases. Monitor memory metrics and consider scaling hosts horizontally if workers frequently hit memory limits.
+- If uptime-bench or Jetmon capacity tests are running, do not change deployed
+  services, support hosts, databases, provider state, fleet config, or runtime
+  config without explicit permission.
+- When a request could touch both `jetmon` and `uptime-bench`, state the repo
+  path before acting. Treat "this repo" as ambiguous when multiple agents or
+  worktrees are active.
+- Prefer local analysis, agent files, branch inspection, code review, and
+  handoff preparation while tests are active.
+- Project-local agent playbooks live under `.agents/skills`.
+- For uptime-bench-specific report or fleet rules, also read
+  `/home/gaarai/code/uptime-bench/AGENTS.md`.
diff --git a/Makefile b/Makefile
new file mode 100644
index 00000000..3f60c7a1
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,155 @@
+BINARY      := bin/jetmon2
+DELIVERER   := bin/jetmon-deliverer
+VERIFLIER   := bin/veriflier2
+API_SMOKE_BATCH ?= local-smoke
+API_SMOKE_ARGS  ?=
+API_VALIDATE_BATCH ?= api-cli-validate
+API_VALIDATE_COUNT ?= 1
+API_VALIDATE_MODE  ?= http-500
+API_VALIDATE_WAIT  ?= 30s
+API_VALIDATE_WEBHOOK_WAIT ?= 60s
+API_VALIDATE_SKIP_WEBHOOK ?= 0
+API_VALIDATE_SKIP_FAILURE ?= 0
+DOCKER_COMPOSE ?= docker compose -f docker/docker-compose.yml
+API_CLI_TOKEN_CONSUMER ?= api-cli
+API_CLI_TOKEN_SCOPE ?= admin
+API_CLI_TOKEN_CREATED_BY ?= docker-local
+API_CLI_TOKEN_TTL ?= 0
+API_CLI_TOKEN_ID ?=
+ROLLOUT_VM_LAB_HOST ?= jetmon-vm-host-1
+ROLLOUT_VM_LAB_SSH ?= ssh -F $(HOME)/.ssh/config -o ControlMaster=no -o ControlPath=none -o BatchMode=yes -o ConnectTimeout=10
+ROLLOUT_VM_LAB_SNAPSHOT ?= pre-guided-flow
+GO          ?= $(shell if command -v go >/dev/null 2>&1; then command -v go; elif [ -x /usr/local/go/bin/go ]; then printf /usr/local/go/bin/go; else printf go; fi)
+GOCACHE     ?= /tmp/jetmon-go-cache
+GOMODCACHE  ?= /tmp/jetmon-gomod-cache
+GO_ENV      := GOCACHE=$(GOCACHE) GOMODCACHE=$(GOMODCACHE)
+BUILD_FLAGS := -ldflags "-X main.version=$(shell git describe --tags --always --dirty) \
+                         -X main.buildDate=$(shell date -u +%Y-%m-%dT%H:%M:%SZ) \
+                         -X main.goVersion=$(shell $(GO) version | awk '{print $$3}')"
+
+.PHONY: all build build-deliverer build-veriflier generate test test-race test-veriflier-soak lint vet rollout-docs-verify rollout-rehearsal-verify rollout-vm-lab-sync rollout-vm-lab-sync-artifacts rollout-vm-lab-stage-v2 rollout-vm-lab-doctor rollout-vm-lab-prepare rollout-vm-lab-smoke rollout-vm-lab-execute-smoke rollout-vm-lab-failure-smoke rollout-vm-lab-resume-smoke rollout-vm-lab-post-start-rollback-smoke rollout-vm-lab-bad-ssh-smoke rollout-vm-lab-v2-start-failure-smoke rollout-vm-lab-runtime-guard-smoke rollout-vm-lab-real-activity-smoke rollout-vm-lab-snapshot-execute-smoke rollout-vm-lab-snapshot-all-smoke api-cli-smoke api-cli-validate api-cli-token-create api-cli-token-list api-cli-token-revoke clean
+
+all: build build-deliverer build-veriflier
+
+build:
+	mkdir -p bin
+	$(GO_ENV) CGO_ENABLED=0 $(GO) build $(BUILD_FLAGS) -o $(BINARY) ./cmd/jetmon2/
+
+build-deliverer:
+	mkdir -p bin
+	$(GO_ENV) CGO_ENABLED=0 $(GO) build $(BUILD_FLAGS) -o $(DELIVERER) ./cmd/jetmon-deliverer/
+
+build-veriflier:
+	mkdir -p bin
+	$(GO_ENV) CGO_ENABLED=0 $(GO) build $(BUILD_FLAGS) -o $(VERIFLIER) ./veriflier2/cmd/
+
+
+generate:
+	protoc --go_out=. --go_opt=paths=source_relative \
+	       --go-grpc_out=. --go-grpc_opt=paths=source_relative \
+	       proto/veriflier.proto
+
+test:
+	$(GO_ENV) $(GO) test ./...
+
+test-race:
+	$(GO_ENV) $(GO) test -race ./...
+
+test-veriflier-soak:
+	$(GO_ENV) $(GO) test ./internal/veriflier ./cmd/jetmon2 -run 'Test(V2Soak|VeriflierDiscoverySoak)'
+
+lint:
+	$(GO_ENV) $(GO) vet ./...
+
+vet: lint
+
+rollout-docs-verify: all test lint
+	scripts/rollout-docs-verify.sh
+
+rollout-rehearsal-verify: build
+	scripts/rollout-rehearsal-verify.sh
+
+rollout-vm-lab-sync:
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'mkdir -p ~/jetmon-rollout-tools/scripts ~/jetmon-rollout-tools/docs'
+	rsync -e "$(ROLLOUT_VM_LAB_SSH)" -a scripts/rollout-vm-lab.sh $(ROLLOUT_VM_LAB_HOST):~/jetmon-rollout-tools/scripts/
+	rsync -e "$(ROLLOUT_VM_LAB_SSH)" -a docs/rollout-vm-lab.md $(ROLLOUT_VM_LAB_HOST):~/jetmon-rollout-tools/docs/
+
+rollout-vm-lab-sync-artifacts: build rollout-vm-lab-sync
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'mkdir -p ~/jetmon-rollout-tools/bin ~/jetmon-rollout-tools/systemd ~/jetmon-rollout-tools/config'
+	rsync -e "$(ROLLOUT_VM_LAB_SSH)" -a bin/jetmon2 $(ROLLOUT_VM_LAB_HOST):~/jetmon-rollout-tools/bin/
+	rsync -e "$(ROLLOUT_VM_LAB_SSH)" -a systemd/jetmon2.service systemd/jetmon2-logrotate $(ROLLOUT_VM_LAB_HOST):~/jetmon-rollout-tools/systemd/
+	rsync -e "$(ROLLOUT_VM_LAB_SSH)" -a config/config-sample.json config/db-config-sample.conf $(ROLLOUT_VM_LAB_HOST):~/jetmon-rollout-tools/config/
+
+rollout-vm-lab-stage-v2: rollout-vm-lab-sync-artifacts
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh install-v2'
+
+rollout-vm-lab-doctor: rollout-vm-lab-sync
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh doctor'
+
+rollout-vm-lab-prepare: rollout-vm-lab-sync-artifacts
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh prepare-topology'
+
+rollout-vm-lab-smoke: rollout-vm-lab-stage-v2
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh smoke-preflight'
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh smoke-guided-dry-run'
+
+rollout-vm-lab-execute-smoke: rollout-vm-lab-stage-v2
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh smoke-guided-execute-rollback'
+
+rollout-vm-lab-failure-smoke: rollout-vm-lab-stage-v2
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh smoke-failure-gates'
+
+rollout-vm-lab-resume-smoke: rollout-vm-lab-stage-v2
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh smoke-interrupted-resume'
+
+rollout-vm-lab-post-start-rollback-smoke: rollout-vm-lab-stage-v2
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh smoke-post-start-rollback'
+
+rollout-vm-lab-bad-ssh-smoke: rollout-vm-lab-stage-v2
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh smoke-bad-ssh'
+
+rollout-vm-lab-v2-start-failure-smoke: rollout-vm-lab-stage-v2
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh smoke-v2-start-failure'
+
+rollout-vm-lab-runtime-guard-smoke: rollout-vm-lab-stage-v2
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh smoke-runtime-guards'
+
+rollout-vm-lab-real-activity-smoke: rollout-vm-lab-stage-v2
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh smoke-real-activity'
+
+rollout-vm-lab-snapshot-execute-smoke: rollout-vm-lab-stage-v2
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh snapshot-run $(ROLLOUT_VM_LAB_SNAPSHOT) execute-rollback'
+
+rollout-vm-lab-snapshot-all-smoke: rollout-vm-lab-stage-v2
+	$(ROLLOUT_VM_LAB_SSH) $(ROLLOUT_VM_LAB_HOST) 'cd ~/jetmon-rollout-tools && scripts/rollout-vm-lab.sh snapshot-run-all $(ROLLOUT_VM_LAB_SNAPSHOT)'
+
+api-cli-smoke: build
+	@test -n "$$JETMON_API_TOKEN" || { echo "JETMON_API_TOKEN is required"; exit 1; }
+	$(BINARY) api health --pretty
+	$(BINARY) api me --pretty
+	$(BINARY) api sites bulk-add --count 3 --batch $(API_SMOKE_BATCH) --dry-run --pretty
+	$(BINARY) api smoke --batch $(API_SMOKE_BATCH) --pretty $(API_SMOKE_ARGS)
+
+api-cli-validate: build
+	API_CLI_BINARY=$(BINARY) \
+	API_VALIDATE_BATCH=$(API_VALIDATE_BATCH) \
+	API_VALIDATE_COUNT=$(API_VALIDATE_COUNT) \
+	API_VALIDATE_MODE=$(API_VALIDATE_MODE) \
+	API_VALIDATE_WAIT=$(API_VALIDATE_WAIT) \
+	API_VALIDATE_WEBHOOK_WAIT=$(API_VALIDATE_WEBHOOK_WAIT) \
+	API_VALIDATE_SKIP_WEBHOOK=$(API_VALIDATE_SKIP_WEBHOOK) \
+	API_VALIDATE_SKIP_FAILURE=$(API_VALIDATE_SKIP_FAILURE) \
+	scripts/api-cli-validate.sh
+
+api-cli-token-create:
+	$(DOCKER_COMPOSE) exec jetmon ./jetmon2 keys create --consumer $(API_CLI_TOKEN_CONSUMER) --scope $(API_CLI_TOKEN_SCOPE) --ttl $(API_CLI_TOKEN_TTL) --created-by $(API_CLI_TOKEN_CREATED_BY)
+
+api-cli-token-list:
+	$(DOCKER_COMPOSE) exec jetmon ./jetmon2 keys list
+
+api-cli-token-revoke:
+	@test -n "$(API_CLI_TOKEN_ID)" || { echo "API_CLI_TOKEN_ID is required"; exit 1; }
+	$(DOCKER_COMPOSE) exec jetmon ./jetmon2 keys revoke $(API_CLI_TOKEN_ID)
+
+clean:
+	rm -f $(BINARY) $(DELIVERER) $(VERIFLIER)
diff --git a/README.md b/README.md
index b6c90975..c99caed9 100644
--- a/README.md
+++ b/README.md
@@ -1,97 +1,177 @@
-jetmon.js
-=========
-
-Overview
---------
-
-Parallel HTTP health monitoring using HEAD requests for large scale website monitoring.
-
-The service relies on confirmation from external servers to verify that sites are indeed offline. This mitigates the Internet weather issue sometimes giving false positives. The code for these servers can be found in the verifliers directory.
-
-Architecture
---------
-![jetmon_chart](https://user-images.githubusercontent.com/1758399/201877599-8992b68a-9ca7-4984-9de7-abe99f989d88.png)
-
-Jetmon will periodically (every 5 minutes) loop over a list of Jetpack sites and perform a HEAD request to check their current status.
-
-When a status change is detected, Jetmon will notify WPCOM including the related notification data in the request.
-
-Here are the possible flows, depending on the status change:
-
-| Previous Status  | Current status   | Action                                                                             |
-| ---------------- | ---------------- | ---------------------------------------------------------------------------------- |
-| DOWN             | UP               | Notify WPCOM about status change                                                   |
-| UP               | DOWN             | Verify status down via the Veriflier services and notify WPCOM about status change |
-| DOWN             | DOWN (confirmed) | Notify WPCOM about status change                                                   |
-
-### Jetmon service
-
-The Jetmon master service is responsible for communicating with the database in order to fetch a list of sites to check. It will spawn and re-allocate workers every five seconds and update stats repeatedly based on `STATS_UPDATE_INTERVAL_MS`.
-
-The jetmon-workers internally use an Node Addon written in C++ to check the connection by sending a HEAD request to the server. 
-
-
-### Verifliers
-
-The Veriflier service, which is written in C++ and uses the QT Framework, does something similar to the Node Addon mentioned before, but lives in its own server. Note that the production environment consists of multiple Verifliers, though the local development environment consists of a single Veriflier service.
-
-### Notification data
-
-Here are the current notification data, Jetmon sends to WPCOM upon detecting a site status change:
-- `blog_id`: The site's WPCOM ID
-- `monitor_url`: The URL Jetmon checked
-- `status_id`: The site's current status. Enum: `0` is status down, `1` is status running and `2` status confirmed down.
-- `last_check`: The datetime of the last check
-- `last_status_change`: The datetime of the last status change
-- `checks`: An array of the checks results from both Jetmon and Veriflier services. Each entry consists of:
-    - `type`: Enum: `1` refers to a Jetmon check, while `2` to a Veriflier check.
-    - `host`: The server hostname.
-    - `status`: The site's current status. Enum: `0` is status down, `1` is status running and `2` status confirmed down.
-    - `rtt`: Round-trip time (RTT) in milliseconds (ms).
-    - `code`: The HTTP response status code.
-
-
-Installation
-------------
-
-1) Make sure you have installed [Docker](https://docs.docker.com/get-docker/) and [docker-compose](https://docs.docker.com/compose/install/)
-
-2) Clone the Jetmon monorepo
-
-3) Copy the environment variables file from within the `docker` folder: `cp jetmon/docker/.env-sample jetmon/docker/.env`
-
-4) Open `jetmon/docker/.env` and make any modifications you'd like.
-
-5) Run `docker compose build` from within the `docker` folder
-
-Configuration
--------------
-
-The Jetmon configuration lives under `config/config.json`. This file is generated on the fly, if not present, each time you run the Jetmon service, using the `config-sample.json` and the corresponding environment variables defined in `docker/.env`.
-Feel free to modify your local config file as needed.
-
-The Veriflier configuration lives under `veriflier/config/veriflier.json`. This file is generated on the fly, if not present, each time you run the Veriflier service, using the `veriflier-sample.json` and the corresponding environment variables defined in `docker/.env`. 
-
-Running
--------
-
-Run `docker compose up -d` from within the `docker` folder.
-
-Database
--------
-
-Main Table Schema:
-
-	CREATE TABLE `jetpack_monitor_sites` (
-	    `jetpack_monitor_site_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY,
-	    `blog_id` bigint(20) unsigned NOT NULL,
-	    `bucket_no` smallint(2) unsigned NOT NULL,
-	    `monitor_url` varchar(300) NOT NULL,
-	    `monitor_active` tinyint(1) unsigned NOT NULL DEFAULT 1,
-	    `site_status` tinyint(1) unsigned NOT NULL DEFAULT 1,
-	    `last_status_change` timestamp NULL DEFAULT current_timestamp(),
-	    `check_interval` tinyint(1) unsigned NOT NULL DEFAULT 5,
-	    INDEX `blog_id_monitor_url` (`blog_id`, `monitor_url`),
-	    INDEX `bucket_no_monitor_active_check_interval` (`bucket_no`, `monitor_active`, `check_interval`)
-	);
+# Jetmon 2
 
+Jetmon 2 is the Go rewrite of Jetpack's uptime monitor: the same production
+contract v1 consumers depend on, with a cleaner runtime, an event-sourced health
+model, richer diagnostics, and API-first automation.
+
+The core detection story stays familiar:
+
+```text
+local checks -> local retries -> geo Veriflier confirmation -> notify
+```
+
+The first difference is correctness: v2 checks sites with `GET`, not the
+`HEAD`-only probes that made v1 disagree with real visitor behavior on too many
+VIP and Agency sites. Around that more realistic probe, Jetmon 2 records what
+it saw, why it believed a site was down, which Verifliers agreed, which
+notifications were sent, and how every incident changed over time. It turns "up
+or down" into an auditable health platform.
+
+## Why This Matters
+
+| Audience | What Gets Better |
+|---|---|
+| Systems | Static Go binaries, no `npm`, `node-gyp`, Qt, or worker process tree. Bucket ownership is coordinated in MySQL, hosts drain cleanly, and memory pressure is handled inside the goroutine pool. |
+| VIP and Agency | GET-based checks that match customer-visible behavior better than v1's HEAD probes, plus fewer noisy pages and fewer missed incidents through local retries, Veriflier quorum, maintenance windows, keyword checks, redirect policy, SSL/TLS checks, and clearer failure classes. |
+| Leadership | A foundation for differentiated uptime monitoring: internal API, webhooks, managed alert contacts, tenant-aware gateway paths, and future Jetpack/WPCOM integrations. |
+| Happiness Engineers | Incident answers with evidence: audit logs, event transitions, check timing, Veriflier votes, WPCOM payloads, and suppression reasons are all queryable. |
+| Jetpack | A monitor that can grow into a product surface, not just a backend notification hook. |
+
+## What Changed
+
+| Area | Jetmon 1 | Jetmon 2 |
+|---|---|---|
+| Runtime | Node master, Node workers, C++ native addon, Qt Veriflier | Go monitor, Go Veriflier, optional Go deliverer |
+| Probe method | `HEAD` requests that could disagree with real page loads | `GET` requests for local checks and Veriflier checks |
+| State | Mutable `site_status` projection | `jetmon_events` plus append-only `jetmon_event_transitions` |
+| Detection | Binary status changes | `Seems Down`, `Down`, recovery, false-alarm, and severity transitions |
+| Evidence | Basic logs | Audit log, check history, timing breakdown, verifier outcomes, API request logs |
+| Integrations | WPCOM notification path | WPCOM, REST API, HMAC webhooks, email, PagerDuty, Slack, Teams |
+| Operations | Static bucket config and process recycling | Dynamic bucket ownership, graceful drain, hot reload, dashboard, pprof |
+
+Jetmon 2 keeps the compatibility surfaces that matter during rollout:
+
+- MySQL changes are additive.
+- WPCOM notification payloads stay compatible.
+- StatsD metric naming remains `com.jetpack.jetmon.<hostname>`.
+- Legacy log and stats file paths remain available.
+- `jetpack_monitor_sites.site_status` can be projected from v2 events during
+  the [v1-to-v2 migration](docs/v1-to-v2-migration.md).
+
+## How Incidents Flow
+
+1. The monitor checks active sites with a bounded Go worker pool.
+2. A first local failure opens a `Seems Down` event so the incident start time is
+   honest.
+3. Local retries absorb one-off network blips before customer notification.
+4. Geo-distributed Verifliers confirm or reject the outage.
+5. Confirmed outages become `Down`; rejected outages close as false alarms.
+6. WPCOM, webhooks, alert contacts, the dashboard, and the API all read from the
+   same event and transition history.
+
+That model gives operators and support teams the part v1 could not: a coherent
+timeline for every incident, not just the final status bit.
+
+## Try It Locally
+
+Docker Compose is the fastest path for local development:
+
+```bash
+cd docker
+cp .env-sample .env
+docker compose up --build -d
+```
+
+Build and test from the repository root:
+
+```bash
+make all
+make test
+make test-race
+```
+
+The API CLI can exercise the internal REST API and local failure fixture:
+
+```bash
+make build
+make api-cli-token-create
+
+export JETMON_API_URL=http://localhost:${API_HOST_PORT:-8090}
+export JETMON_API_TOKEN=jm_replace_with_the_printed_token
+
+./bin/jetmon2 api health --pretty
+./bin/jetmon2 api commands --output table
+make api-cli-smoke
+```
+
+See [docs/getting-started.md](docs/getting-started.md) for the full local loop.
+
+## Documentation
+
+| Document | Start Here For |
+|---|---|
+| [docs/project.md](docs/project.md) | Full product and implementation specification |
+| [docs/internal-api-reference.md](docs/internal-api-reference.md) | Internal REST API reference |
+| [docs/events.md](docs/events.md) | Event lifecycle and transition semantics |
+| [docs/taxonomy.md](docs/taxonomy.md) | Severity, state, cause, and rollup taxonomy |
+| [docs/getting-started.md](docs/getting-started.md) | Docker setup, builds, tests, API CLI smoke runs |
+| [docs/docker-images.md](docs/docker-images.md) | Pulling and running the published GHCR images |
+| [docs/operations-guide.md](docs/operations-guide.md) | Production config, rollout, delivery workers, metrics, debugging |
+| [docs/data-model.md](docs/data-model.md) | Tables, migrations, event projection, tenant mapping |
+| [docs/support-guide.md](docs/support-guide.md) | HE workflows for explaining alerts and missed alerts |
+| [docs/api-cli-guide.md](docs/api-cli-guide.md) | API CLI examples and automation patterns |
+| [docs/v1-to-v2-migration.md](docs/v1-to-v2-migration.md) | Full v1-to-v2 production migration and rollback runbook |
+| [docs/jetmon-deliverer-rollout.md](docs/jetmon-deliverer-rollout.md) | Moving outbound delivery to `jetmon-deliverer` |
+| [docs/roadmap.md](docs/roadmap.md) | Broader v2 and v3 planning |
+
+Longer design decisions live in [docs/adr/](docs/adr/).
+
+## Production Posture
+
+Jetmon 2 is designed for a cautious host-by-host rollout. The complete process
+is in [docs/v1-to-v2-migration.md](docs/v1-to-v2-migration.md). Use
+[docs/rollout-quick-reference.md](docs/rollout-quick-reference.md) as the
+one-page command checklist during rehearsals and rollout windows:
+
+- Run `./jetmon2 migrate` before first start. Migrations are embedded and
+  additive.
+- Run `./jetmon2 validate-config` before deploy to check config shape,
+  database connectivity, email transport mode, verifier config, and rollout
+  safety commands.
+- Use pinned bucket mode for the first v1-to-v2 migration so one v1 host can be
+  replaced by one v2 host with the same bucket range.
+- Prefer `rollout guided` during production rollout windows so operators get a
+  transcript, resume state, typed confirmations, and fail-closed rollout gates.
+  Run it from the staged v2 runtime host. For fresh-server takeovers, that
+  runtime host must have SSH access to the old v1 host when the configured v1
+  stop/start commands use SSH.
+  Use `rollout static-plan-check`, `rollout host-preflight`,
+  `rollout cutover-check`, `rollout rollback-check`, and targeted
+  `rollout activity-check` / `rollout projection-drift` from the migration
+  runbook before changing the next host. Use `rollout state-report` for a
+  quick handoff snapshot.
+- Keep `LEGACY_STATUS_PROJECTION_ENABLE` on until legacy readers have moved to
+  the v2 API or event tables.
+- Use `SIGINT` or `./jetmon2 drain` for graceful shutdown.
+- Use `SIGHUP` or `./jetmon2 reload` for config reload without restart.
+- Use the host dashboard at `/` and the fleet dashboard at `/fleet` during
+  rollout windows. Keep `DASHBOARD_BIND_ADDR` on loopback unless the listener is
+  protected by trusted operator-network controls.
+
+After the fleet is fully on v2, dynamic bucket ownership lets surviving hosts
+absorb work during rolling updates.
+
+## Main Binaries
+
+| Binary | Purpose |
+|---|---|
+| `bin/jetmon2` | Monitor, orchestrator, REST API, dashboard, embedded delivery workers |
+| `bin/veriflier2` | Remote confirmation worker used by the monitor |
+| `bin/jetmon-deliverer` | Standalone webhook and alert-contact delivery worker |
+
+## Development Commands
+
+```bash
+make all              # Build jetmon2, jetmon-deliverer, and veriflier2
+make build            # Build only jetmon2
+make build-deliverer  # Build only jetmon-deliverer
+make build-veriflier  # Build only veriflier2
+make test             # Run the Go test suite
+make test-race        # Run tests with the race detector
+make lint             # Run lint checks
+make rollout-docs-verify  # Verify rollout docs/tooling alignment
+```
+
+`make generate` is intentionally separate. It requires `protoc` and Go protobuf
+plugins, and the generated stubs are not part of the production JSON-over-HTTP
+Veriflier transport.
diff --git a/binding.gyp b/binding.gyp
deleted file mode 100644
index 7e0e2186..00000000
--- a/binding.gyp
+++ /dev/null
@@ -1,17 +0,0 @@
-{
-    'targets':[ {
-        'target_name':'jetmon',
-        'cflags_cc': [ '-fexceptions','-O3' ],
-        'sources':[
-            './src/main.cpp',
-            './src/http_checker.cpp',
-        ],
-        'conditions': [
-        ['node_shared_openssl=="false"', {
-          'include_dirs': [
-            '<(node_root_dir)/deps/openssl/openssl/include'
-          ],
-        }]
-      ]
-    } ]
-}
diff --git a/cmd/jetmon-deliverer/delivery_check.go b/cmd/jetmon-deliverer/delivery_check.go
new file mode 100644
index 00000000..4fdbe0a0
--- /dev/null
+++ b/cmd/jetmon-deliverer/delivery_check.go
@@ -0,0 +1,416 @@
+package main
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"flag"
+	"fmt"
+	"io"
+	"os"
+	"strings"
+	"text/tabwriter"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+)
+
+const deliveryCheckDefaultSince = "15m"
+
+type deliveryCheckOptions struct {
+	HostOverride                 string
+	Since                        string
+	Output                       string
+	MaxPending                   int64
+	MaxDue                       int64
+	MaxAbandoned                 int64
+	MaxFailed                    int64
+	RequireRecentDelivery        bool
+	RequireRecentWebhookDelivery bool
+	RequireRecentAlertDelivery   bool
+}
+
+type deliveryTableSummary struct {
+	Kind                string `json:"kind"`
+	Pending             int64  `json:"pending"`
+	DueNow              int64  `json:"due_now"`
+	FutureRetry         int64  `json:"future_retry"`
+	DeliveredSince      int64  `json:"delivered_since"`
+	AbandonedSince      int64  `json:"abandoned_since"`
+	FailedSince         int64  `json:"failed_since"`
+	OldestPendingAgeSec int64  `json:"oldest_pending_age_sec"`
+	OldestDueAgeSec     int64  `json:"oldest_due_age_sec"`
+}
+
+type deliveryCheckReport struct {
+	OK           bool                   `json:"ok"`
+	Host         string                 `json:"host"`
+	GeneratedAt  time.Time              `json:"generated_at"`
+	Since        time.Time              `json:"since"`
+	OwnerLevel   string                 `json:"owner_level,omitempty"`
+	OwnerMessage string                 `json:"owner_message,omitempty"`
+	Tables       []deliveryTableSummary `json:"tables"`
+	Total        deliveryTableSummary   `json:"total"`
+	Failures     []string               `json:"failures,omitempty"`
+}
+
+func parseDeliveryCheckOptions(args []string) (deliveryCheckOptions, error) {
+	opts := deliveryCheckOptions{
+		Since:        deliveryCheckDefaultSince,
+		Output:       "text",
+		MaxPending:   -1,
+		MaxDue:       -1,
+		MaxAbandoned: -1,
+		MaxFailed:    -1,
+	}
+	fs := flag.NewFlagSet("delivery-check", flag.ContinueOnError)
+	fs.SetOutput(io.Discard)
+	fs.StringVar(&opts.HostOverride, "host", "", "host id to use for DELIVERY_OWNER_HOST context (default current hostname)")
+	fs.StringVar(&opts.Since, "since", deliveryCheckDefaultSince, "report cutoff as duration like 15m or RFC3339 timestamp")
+	fs.StringVar(&opts.Output, "output", "text", "output format: text or json")
+	fs.Int64Var(&opts.MaxPending, "max-pending", -1, "fail when total pending deliveries exceed this count (-1 disables)")
+	fs.Int64Var(&opts.MaxDue, "max-due", -1, "fail when total due deliveries exceed this count (-1 disables)")
+	fs.Int64Var(&opts.MaxAbandoned, "max-abandoned", -1, "fail when abandoned deliveries since cutoff exceed this count (-1 disables)")
+	fs.Int64Var(&opts.MaxFailed, "max-failed", -1, "fail when failed deliveries since cutoff exceed this count (-1 disables)")
+	fs.BoolVar(&opts.RequireRecentDelivery, "require-recent-delivery", false, "fail unless at least one delivery succeeded since cutoff")
+	fs.BoolVar(&opts.RequireRecentWebhookDelivery, "require-recent-webhook-delivery", false, "fail unless at least one webhook delivery succeeded since cutoff")
+	fs.BoolVar(&opts.RequireRecentAlertDelivery, "require-recent-alert-delivery", false, "fail unless at least one alert-contact delivery succeeded since cutoff")
+	if err := fs.Parse(args); err != nil {
+		return opts, err
+	}
+	if fs.NArg() != 0 {
+		return opts, fmt.Errorf("unexpected argument %q", fs.Arg(0))
+	}
+	opts.Output = strings.ToLower(strings.TrimSpace(opts.Output))
+	if opts.Output != "text" && opts.Output != "json" {
+		return opts, fmt.Errorf("--output must be text or json")
+	}
+	if opts.MaxPending < -1 {
+		return opts, fmt.Errorf("--max-pending must be >= 0, or -1 to disable")
+	}
+	if opts.MaxDue < -1 {
+		return opts, fmt.Errorf("--max-due must be >= 0, or -1 to disable")
+	}
+	if opts.MaxAbandoned < -1 {
+		return opts, fmt.Errorf("--max-abandoned must be >= 0, or -1 to disable")
+	}
+	if opts.MaxFailed < -1 {
+		return opts, fmt.Errorf("--max-failed must be >= 0, or -1 to disable")
+	}
+	return opts, nil
+}
+
+func cmdDeliveryCheck(args []string) {
+	opts, err := parseDeliveryCheckOptions(args)
+	if err != nil {
+		fmt.Fprintln(os.Stderr, "usage: jetmon-deliverer delivery-check [--host=<host>] [--since=15m] [--max-pending=N] [--max-due=N] [--max-abandoned=N] [--max-failed=N] [--require-recent-delivery] [--require-recent-webhook-delivery] [--require-recent-alert-delivery] [--output=text|json]")
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+	emitProgress := opts.Output != "json"
+
+	configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+	if err := config.Load(configPath); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL config parse: %v\n", err)
+		os.Exit(1)
+	}
+	if emitProgress {
+		fmt.Println("PASS config parse")
+	}
+
+	config.LoadDB()
+	if err := db.ConnectWithRetry(3); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL db connect: %v\n", err)
+		os.Exit(1)
+	}
+	if emitProgress {
+		fmt.Println("PASS db connect")
+	}
+
+	hostID := strings.TrimSpace(opts.HostOverride)
+	if hostID == "" {
+		hostID = db.Hostname()
+	}
+	report, err := buildDeliveryCheckReport(context.Background(), db.DB(), config.Get(), hostID, opts, time.Now().UTC())
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL delivery check: %v\n", err)
+		os.Exit(1)
+	}
+	if err := renderDeliveryCheckReport(os.Stdout, report, opts.Output); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL render delivery check: %v\n", err)
+		os.Exit(1)
+	}
+	if !report.OK {
+		os.Exit(1)
+	}
+}
+
+func buildDeliveryCheckReport(ctx context.Context, conn *sql.DB, cfg *config.Config, hostID string, opts deliveryCheckOptions, now time.Time) (deliveryCheckReport, error) {
+	if conn == nil {
+		return deliveryCheckReport{}, errors.New("database handle is nil")
+	}
+	now = now.UTC()
+	cutoff, err := resolveDeliveryCheckCutoff(now, opts.Since)
+	if err != nil {
+		return deliveryCheckReport{}, err
+	}
+	hostID = strings.TrimSpace(hostID)
+
+	report := deliveryCheckReport{
+		Host:        hostID,
+		GeneratedAt: now,
+		Since:       cutoff,
+		Total:       deliveryTableSummary{Kind: "total"},
+	}
+	if cfg != nil {
+		report.OwnerLevel, report.OwnerMessage = deliveryOwnerStatus(cfg, hostID)
+	}
+
+	tables := []struct {
+		kind string
+		name string
+	}{
+		{kind: "webhook", name: "jetmon_webhook_deliveries"},
+		{kind: "alert", name: "jetmon_alert_deliveries"},
+	}
+	for _, table := range tables {
+		summary, err := queryDeliveryTableSummary(ctx, conn, table.kind, table.name, now, cutoff)
+		if err != nil {
+			return deliveryCheckReport{}, err
+		}
+		report.Tables = append(report.Tables, summary)
+		report.Total.Pending += summary.Pending
+		report.Total.DueNow += summary.DueNow
+		report.Total.FutureRetry += summary.FutureRetry
+		report.Total.DeliveredSince += summary.DeliveredSince
+		report.Total.AbandonedSince += summary.AbandonedSince
+		report.Total.FailedSince += summary.FailedSince
+		report.Total.OldestPendingAgeSec = maxInt64(report.Total.OldestPendingAgeSec, summary.OldestPendingAgeSec)
+		report.Total.OldestDueAgeSec = maxInt64(report.Total.OldestDueAgeSec, summary.OldestDueAgeSec)
+	}
+
+	report.Failures = evaluateDeliveryCheckFailures(report, opts)
+	report.OK = len(report.Failures) == 0
+	return report, nil
+}
+
+func resolveDeliveryCheckCutoff(now time.Time, raw string) (time.Time, error) {
+	raw = strings.TrimSpace(raw)
+	if raw == "" {
+		return time.Time{}, errors.New("--since must not be empty")
+	}
+	if d, err := time.ParseDuration(raw); err == nil {
+		if d <= 0 {
+			return time.Time{}, errors.New("--since duration must be > 0")
+		}
+		return now.Add(-d).UTC(), nil
+	}
+	cutoff, err := time.Parse(time.RFC3339, raw)
+	if err != nil {
+		return time.Time{}, fmt.Errorf("--since must be a duration or RFC3339 timestamp")
+	}
+	if cutoff.After(now) {
+		return time.Time{}, errors.New("--since timestamp must not be in the future")
+	}
+	return cutoff.UTC(), nil
+}
+
+func queryDeliveryTableSummary(ctx context.Context, conn *sql.DB, kind, table string, now, cutoff time.Time) (deliveryTableSummary, error) {
+	switch table {
+	case "jetmon_webhook_deliveries", "jetmon_alert_deliveries":
+	default:
+		return deliveryTableSummary{}, fmt.Errorf("unsupported delivery table %q", table)
+	}
+
+	summary := deliveryTableSummary{Kind: kind}
+
+	pendingQuery := fmt.Sprintf(`
+		SELECT COUNT(*),
+		       COALESCE(TIMESTAMPDIFF(SECOND, MIN(created_at), ?), 0)
+		  FROM %s
+		 WHERE status = 'pending'`, table)
+	if err := conn.QueryRowContext(ctx, pendingQuery, now).Scan(
+		&summary.Pending,
+		&summary.OldestPendingAgeSec,
+	); err != nil {
+		return deliveryTableSummary{}, fmt.Errorf("%s pending delivery summary: %w", kind, err)
+	}
+
+	dueQuery := fmt.Sprintf(`
+		SELECT COUNT(*),
+		       COALESCE(TIMESTAMPDIFF(SECOND, MIN(COALESCE(next_attempt_at, created_at)), ?), 0)
+		  FROM %s
+		 WHERE status = 'pending'
+		   AND (next_attempt_at IS NULL OR next_attempt_at <= ?)`, table)
+	if err := conn.QueryRowContext(ctx, dueQuery, now, now).Scan(
+		&summary.DueNow,
+		&summary.OldestDueAgeSec,
+	); err != nil {
+		return deliveryTableSummary{}, fmt.Errorf("%s due delivery summary: %w", kind, err)
+	}
+
+	futureQuery := fmt.Sprintf(`
+		SELECT COUNT(*)
+		  FROM %s
+		 WHERE status = 'pending'
+		   AND next_attempt_at > ?`, table)
+	if err := conn.QueryRowContext(ctx, futureQuery, now).Scan(&summary.FutureRetry); err != nil {
+		return deliveryTableSummary{}, fmt.Errorf("%s future delivery summary: %w", kind, err)
+	}
+
+	deliveredQuery := fmt.Sprintf(`
+		SELECT COUNT(*)
+		  FROM %s
+		 WHERE status = 'delivered'
+		   AND delivered_at >= ?`, table)
+	if err := conn.QueryRowContext(ctx, deliveredQuery, cutoff).Scan(&summary.DeliveredSince); err != nil {
+		return deliveryTableSummary{}, fmt.Errorf("%s delivered summary: %w", kind, err)
+	}
+
+	abandonedSince, err := queryRecentTerminalDeliveryCount(ctx, conn, table, "abandoned", cutoff)
+	if err != nil {
+		return deliveryTableSummary{}, fmt.Errorf("%s abandoned summary: %w", kind, err)
+	}
+	summary.AbandonedSince = abandonedSince
+
+	failedSince, err := queryRecentTerminalDeliveryCount(ctx, conn, table, "failed", cutoff)
+	if err != nil {
+		return deliveryTableSummary{}, fmt.Errorf("%s failed summary: %w", kind, err)
+	}
+	summary.FailedSince = failedSince
+	summary.OldestPendingAgeSec = maxInt64(0, summary.OldestPendingAgeSec)
+	summary.OldestDueAgeSec = maxInt64(0, summary.OldestDueAgeSec)
+	return summary, nil
+}
+
+func queryRecentTerminalDeliveryCount(ctx context.Context, conn *sql.DB, table, status string, cutoff time.Time) (int64, error) {
+	switch table {
+	case "jetmon_webhook_deliveries", "jetmon_alert_deliveries":
+	default:
+		return 0, fmt.Errorf("unsupported delivery table %q", table)
+	}
+	switch status {
+	case "abandoned", "failed":
+	default:
+		return 0, fmt.Errorf("unsupported terminal status %q", status)
+	}
+
+	withAttemptQuery := fmt.Sprintf(`
+		SELECT COUNT(*)
+		  FROM %s
+		 WHERE status = ?
+		   AND last_attempt_at >= ?`, table)
+	var withAttempt int64
+	if err := conn.QueryRowContext(ctx, withAttemptQuery, status, cutoff).Scan(&withAttempt); err != nil {
+		return 0, err
+	}
+
+	createdFallbackQuery := fmt.Sprintf(`
+		SELECT COUNT(*)
+		  FROM %s
+		 WHERE status = ?
+		   AND last_attempt_at IS NULL
+		   AND created_at >= ?`, table)
+	var createdFallback int64
+	if err := conn.QueryRowContext(ctx, createdFallbackQuery, status, cutoff).Scan(&createdFallback); err != nil {
+		return 0, err
+	}
+	return withAttempt + createdFallback, nil
+}
+
+func evaluateDeliveryCheckFailures(report deliveryCheckReport, opts deliveryCheckOptions) []string {
+	var failures []string
+	if opts.MaxPending >= 0 && report.Total.Pending > opts.MaxPending {
+		failures = append(failures, fmt.Sprintf("pending deliveries total=%d exceeds max-pending=%d", report.Total.Pending, opts.MaxPending))
+	}
+	if opts.MaxDue >= 0 && report.Total.DueNow > opts.MaxDue {
+		failures = append(failures, fmt.Sprintf("due deliveries total=%d exceeds max-due=%d", report.Total.DueNow, opts.MaxDue))
+	}
+	if opts.MaxAbandoned >= 0 && report.Total.AbandonedSince > opts.MaxAbandoned {
+		failures = append(failures, fmt.Sprintf("abandoned deliveries since %s total=%d exceeds max-abandoned=%d", report.Since.Format(time.RFC3339), report.Total.AbandonedSince, opts.MaxAbandoned))
+	}
+	if opts.MaxFailed >= 0 && report.Total.FailedSince > opts.MaxFailed {
+		failures = append(failures, fmt.Sprintf("failed deliveries since %s total=%d exceeds max-failed=%d", report.Since.Format(time.RFC3339), report.Total.FailedSince, opts.MaxFailed))
+	}
+	if opts.RequireRecentDelivery && report.Total.DeliveredSince == 0 {
+		failures = append(failures, fmt.Sprintf("no delivered rows since %s", report.Since.Format(time.RFC3339)))
+	}
+	if opts.RequireRecentWebhookDelivery && deliveredSince(report, "webhook") == 0 {
+		failures = append(failures, fmt.Sprintf("no webhook deliveries since %s", report.Since.Format(time.RFC3339)))
+	}
+	if opts.RequireRecentAlertDelivery && deliveredSince(report, "alert") == 0 {
+		failures = append(failures, fmt.Sprintf("no alert-contact deliveries since %s", report.Since.Format(time.RFC3339)))
+	}
+	return failures
+}
+
+func renderDeliveryCheckReport(out io.Writer, report deliveryCheckReport, output string) error {
+	if output == "json" {
+		enc := json.NewEncoder(out)
+		enc.SetIndent("", "  ")
+		return enc.Encode(report)
+	}
+	return renderDeliveryCheckText(out, report)
+}
+
+func renderDeliveryCheckText(out io.Writer, report deliveryCheckReport) error {
+	fmt.Fprintf(out, "INFO deliverer_host=%q\n", report.Host)
+	fmt.Fprintf(out, "INFO delivery_check_generated_at=%s\n", report.GeneratedAt.Format(time.RFC3339))
+	fmt.Fprintf(out, "INFO delivery_check_since=%s\n", report.Since.Format(time.RFC3339))
+	if report.OwnerMessage != "" {
+		fmt.Fprintf(out, "%s %s\n", report.OwnerLevel, report.OwnerMessage)
+	}
+
+	tw := tabwriter.NewWriter(out, 0, 0, 2, ' ', 0)
+	fmt.Fprintln(tw, "KIND\tPENDING\tDUE_NOW\tFUTURE_RETRY\tDELIVERED_SINCE\tABANDONED_SINCE\tFAILED_SINCE\tOLDEST_PENDING_SEC\tOLDEST_DUE_SEC")
+	for _, summary := range report.Tables {
+		writeDeliverySummaryRow(tw, summary)
+	}
+	writeDeliverySummaryRow(tw, report.Total)
+	if err := tw.Flush(); err != nil {
+		return err
+	}
+
+	if report.OK {
+		fmt.Fprintln(out, "PASS delivery_check=ok")
+		return nil
+	}
+	for _, failure := range report.Failures {
+		fmt.Fprintf(out, "FAIL %s\n", failure)
+	}
+	return nil
+}
+
+func writeDeliverySummaryRow(out io.Writer, summary deliveryTableSummary) {
+	fmt.Fprintf(
+		out,
+		"%s\t%d\t%d\t%d\t%d\t%d\t%d\t%d\t%d\n",
+		summary.Kind,
+		summary.Pending,
+		summary.DueNow,
+		summary.FutureRetry,
+		summary.DeliveredSince,
+		summary.AbandonedSince,
+		summary.FailedSince,
+		summary.OldestPendingAgeSec,
+		summary.OldestDueAgeSec,
+	)
+}
+
+func deliveredSince(report deliveryCheckReport, kind string) int64 {
+	for _, summary := range report.Tables {
+		if summary.Kind == kind {
+			return summary.DeliveredSince
+		}
+	}
+	return 0
+}
+
+func maxInt64(a, b int64) int64 {
+	if a > b {
+		return a
+	}
+	return b
+}
diff --git a/cmd/jetmon-deliverer/delivery_check_test.go b/cmd/jetmon-deliverer/delivery_check_test.go
new file mode 100644
index 00000000..99963c51
--- /dev/null
+++ b/cmd/jetmon-deliverer/delivery_check_test.go
@@ -0,0 +1,366 @@
+package main
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"regexp"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestParseDeliveryCheckOptions(t *testing.T) {
+	opts, err := parseDeliveryCheckOptions([]string{
+		"--host=deliverer-1",
+		"--since=30m",
+		"--output=json",
+		"--max-pending=10",
+		"--max-due=0",
+		"--max-abandoned=1",
+		"--max-failed=2",
+		"--require-recent-delivery",
+		"--require-recent-webhook-delivery",
+		"--require-recent-alert-delivery",
+	})
+	if err != nil {
+		t.Fatalf("parseDeliveryCheckOptions: %v", err)
+	}
+	if opts.HostOverride != "deliverer-1" {
+		t.Fatalf("HostOverride = %q, want deliverer-1", opts.HostOverride)
+	}
+	if opts.Since != "30m" || opts.Output != "json" {
+		t.Fatalf("parsed since/output = %q/%q", opts.Since, opts.Output)
+	}
+	if opts.MaxPending != 10 || opts.MaxDue != 0 || opts.MaxAbandoned != 1 || opts.MaxFailed != 2 {
+		t.Fatalf("parsed thresholds = pending:%d due:%d abandoned:%d failed:%d", opts.MaxPending, opts.MaxDue, opts.MaxAbandoned, opts.MaxFailed)
+	}
+	if !opts.RequireRecentDelivery || !opts.RequireRecentWebhookDelivery || !opts.RequireRecentAlertDelivery {
+		t.Fatalf("recent delivery flags = %+v, want all true", opts)
+	}
+
+	defaults, err := parseDeliveryCheckOptions(nil)
+	if err != nil {
+		t.Fatalf("parseDeliveryCheckOptions(defaults): %v", err)
+	}
+	if defaults.Since != deliveryCheckDefaultSince || defaults.Output != "text" {
+		t.Fatalf("defaults = %+v", defaults)
+	}
+	if defaults.MaxPending != -1 || defaults.MaxDue != -1 || defaults.MaxAbandoned != -1 || defaults.MaxFailed != -1 {
+		t.Fatalf("default thresholds = %+v, want disabled", defaults)
+	}
+
+	if _, err := parseDeliveryCheckOptions([]string{"--output=xml"}); err == nil {
+		t.Fatal("parseDeliveryCheckOptions accepted invalid output")
+	}
+	if _, err := parseDeliveryCheckOptions([]string{"--max-due=-2"}); err == nil {
+		t.Fatal("parseDeliveryCheckOptions accepted invalid threshold")
+	}
+	if _, err := parseDeliveryCheckOptions([]string{"--max-failed=-2"}); err == nil {
+		t.Fatal("parseDeliveryCheckOptions accepted invalid failed threshold")
+	}
+	if _, err := parseDeliveryCheckOptions([]string{"extra"}); err == nil {
+		t.Fatal("parseDeliveryCheckOptions accepted positional argument")
+	}
+}
+
+func TestResolveDeliveryCheckCutoff(t *testing.T) {
+	now := time.Date(2026, 4, 29, 18, 30, 0, 0, time.UTC)
+
+	durationCutoff, err := resolveDeliveryCheckCutoff(now, "45m")
+	if err != nil {
+		t.Fatalf("resolveDeliveryCheckCutoff(duration): %v", err)
+	}
+	if want := now.Add(-45 * time.Minute); !durationCutoff.Equal(want) {
+		t.Fatalf("duration cutoff = %s, want %s", durationCutoff, want)
+	}
+
+	timestampCutoff, err := resolveDeliveryCheckCutoff(now, "2026-04-29T18:00:00Z")
+	if err != nil {
+		t.Fatalf("resolveDeliveryCheckCutoff(timestamp): %v", err)
+	}
+	if want := time.Date(2026, 4, 29, 18, 0, 0, 0, time.UTC); !timestampCutoff.Equal(want) {
+		t.Fatalf("timestamp cutoff = %s, want %s", timestampCutoff, want)
+	}
+
+	for _, raw := range []string{"", "0s", "-1m", "not-time", "2026-04-29T19:00:00Z"} {
+		t.Run(raw, func(t *testing.T) {
+			if _, err := resolveDeliveryCheckCutoff(now, raw); err == nil {
+				t.Fatalf("resolveDeliveryCheckCutoff(%q) returned nil error", raw)
+			}
+		})
+	}
+}
+
+func TestBuildDeliveryCheckReportSummarizesAndAppliesThresholds(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	now := time.Date(2026, 4, 29, 18, 30, 0, 0, time.UTC)
+	cutoff := now.Add(-15 * time.Minute)
+	expectDeliverySummaryQueries(t, mock, "jetmon_webhook_deliveries", now, cutoff, deliveryTableSummary{
+		Pending:             2,
+		DueNow:              1,
+		FutureRetry:         1,
+		DeliveredSince:      4,
+		AbandonedSince:      0,
+		FailedSince:         2,
+		OldestPendingAgeSec: 120,
+		OldestDueAgeSec:     60,
+	})
+	expectDeliverySummaryQueries(t, mock, "jetmon_alert_deliveries", now, cutoff, deliveryTableSummary{
+		Pending:             4,
+		DueNow:              2,
+		FutureRetry:         2,
+		DeliveredSince:      0,
+		AbandonedSince:      1,
+		FailedSince:         0,
+		OldestPendingAgeSec: 90,
+		OldestDueAgeSec:     30,
+	})
+
+	opts := deliveryCheckOptions{
+		Since:                 "15m",
+		MaxPending:            5,
+		MaxDue:                2,
+		MaxAbandoned:          0,
+		MaxFailed:             1,
+		RequireRecentDelivery: true,
+	}
+	report, err := buildDeliveryCheckReport(context.Background(), sqlDB, &config.Config{
+		DeliveryOwnerHost: "deliverer-1",
+	}, "deliverer-1", opts, now)
+	if err != nil {
+		t.Fatalf("buildDeliveryCheckReport: %v", err)
+	}
+	if report.OK {
+		t.Fatal("report.OK = true, want false because thresholds fail")
+	}
+	if report.Total.Pending != 6 || report.Total.DueNow != 3 || report.Total.FutureRetry != 3 {
+		t.Fatalf("total queue summary = %+v", report.Total)
+	}
+	if report.Total.DeliveredSince != 4 || report.Total.AbandonedSince != 1 {
+		t.Fatalf("total terminal summary = %+v", report.Total)
+	}
+	if report.Total.FailedSince != 2 || report.Total.OldestPendingAgeSec != 120 || report.Total.OldestDueAgeSec != 60 {
+		t.Fatalf("total failed/age summary = %+v", report.Total)
+	}
+	if report.OwnerLevel != "INFO" || !strings.Contains(report.OwnerMessage, "matched") {
+		t.Fatalf("owner status = %q %q", report.OwnerLevel, report.OwnerMessage)
+	}
+	wantFailures := []string{
+		"pending deliveries total=6 exceeds max-pending=5",
+		"due deliveries total=3 exceeds max-due=2",
+		"abandoned deliveries since 2026-04-29T18:15:00Z total=1 exceeds max-abandoned=0",
+		"failed deliveries since 2026-04-29T18:15:00Z total=2 exceeds max-failed=1",
+	}
+	if len(report.Failures) != len(wantFailures) {
+		t.Fatalf("failures = %v, want %d failures", report.Failures, len(wantFailures))
+	}
+	for i, want := range wantFailures {
+		if report.Failures[i] != want {
+			t.Fatalf("failure[%d] = %q, want %q", i, report.Failures[i], want)
+		}
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("sql expectations: %v", err)
+	}
+}
+
+func TestBuildDeliveryCheckReportRequiresRecentDelivery(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	now := time.Date(2026, 4, 29, 18, 30, 0, 0, time.UTC)
+	cutoff := now.Add(-15 * time.Minute)
+	expectDeliverySummaryQueries(t, mock, "jetmon_webhook_deliveries", now, cutoff, deliveryTableSummary{})
+	expectDeliverySummaryQueries(t, mock, "jetmon_alert_deliveries", now, cutoff, deliveryTableSummary{})
+
+	report, err := buildDeliveryCheckReport(context.Background(), sqlDB, &config.Config{}, "deliverer-1", deliveryCheckOptions{
+		Since:                 "15m",
+		MaxPending:            -1,
+		MaxDue:                -1,
+		MaxAbandoned:          -1,
+		MaxFailed:             -1,
+		RequireRecentDelivery: true,
+	}, now)
+	if err != nil {
+		t.Fatalf("buildDeliveryCheckReport: %v", err)
+	}
+	if report.OK {
+		t.Fatal("report.OK = true, want false")
+	}
+	if len(report.Failures) != 1 || !strings.Contains(report.Failures[0], "no delivered rows since") {
+		t.Fatalf("failures = %v", report.Failures)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("sql expectations: %v", err)
+	}
+}
+
+func TestBuildDeliveryCheckReportRequiresRecentDeliveryByKind(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	now := time.Date(2026, 4, 29, 18, 30, 0, 0, time.UTC)
+	cutoff := now.Add(-15 * time.Minute)
+	expectDeliverySummaryQueries(t, mock, "jetmon_webhook_deliveries", now, cutoff, deliveryTableSummary{DeliveredSince: 1})
+	expectDeliverySummaryQueries(t, mock, "jetmon_alert_deliveries", now, cutoff, deliveryTableSummary{})
+
+	report, err := buildDeliveryCheckReport(context.Background(), sqlDB, &config.Config{}, "deliverer-1", deliveryCheckOptions{
+		Since:                        "15m",
+		MaxPending:                   -1,
+		MaxDue:                       -1,
+		MaxAbandoned:                 -1,
+		MaxFailed:                    -1,
+		RequireRecentWebhookDelivery: true,
+		RequireRecentAlertDelivery:   true,
+	}, now)
+	if err != nil {
+		t.Fatalf("buildDeliveryCheckReport: %v", err)
+	}
+	if report.OK {
+		t.Fatal("report.OK = true, want false")
+	}
+	if len(report.Failures) != 1 || !strings.Contains(report.Failures[0], "no alert-contact deliveries since") {
+		t.Fatalf("failures = %v", report.Failures)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("sql expectations: %v", err)
+	}
+}
+
+func TestQueryRecentTerminalDeliveryCountUsesAttemptAndCreatedFallback(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	cutoff := time.Date(2026, 4, 29, 18, 15, 0, 0, time.UTC)
+	mock.ExpectQuery(`(?s)FROM jetmon_webhook_deliveries.*status = \?.*last_attempt_at >= \?`).
+		WithArgs("abandoned", cutoff).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(2))
+	mock.ExpectQuery(`(?s)FROM jetmon_webhook_deliveries.*status = \?.*last_attempt_at IS NULL.*created_at >= \?`).
+		WithArgs("abandoned", cutoff).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(1))
+
+	got, err := queryRecentTerminalDeliveryCount(context.Background(), sqlDB, "jetmon_webhook_deliveries", "abandoned", cutoff)
+	if err != nil {
+		t.Fatalf("queryRecentTerminalDeliveryCount: %v", err)
+	}
+	if got != 3 {
+		t.Fatalf("queryRecentTerminalDeliveryCount() = %d, want 3", got)
+	}
+	if _, err := queryRecentTerminalDeliveryCount(context.Background(), sqlDB, "bad_table", "abandoned", cutoff); err == nil {
+		t.Fatal("queryRecentTerminalDeliveryCount accepted bad table")
+	}
+	if _, err := queryRecentTerminalDeliveryCount(context.Background(), sqlDB, "jetmon_webhook_deliveries", "delivered", cutoff); err == nil {
+		t.Fatal("queryRecentTerminalDeliveryCount accepted bad status")
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("sql expectations: %v", err)
+	}
+}
+
+func TestRenderDeliveryCheckReport(t *testing.T) {
+	report := deliveryCheckReport{
+		OK:          true,
+		Host:        "deliverer-1",
+		GeneratedAt: time.Date(2026, 4, 29, 18, 30, 0, 0, time.UTC),
+		Since:       time.Date(2026, 4, 29, 18, 15, 0, 0, time.UTC),
+		Tables: []deliveryTableSummary{
+			{Kind: "webhook", Pending: 1, DueNow: 0, FutureRetry: 1, DeliveredSince: 2, FailedSince: 1, OldestPendingAgeSec: 120},
+			{Kind: "alert", DeliveredSince: 3},
+		},
+		Total: deliveryTableSummary{Kind: "total", Pending: 1, FutureRetry: 1, DeliveredSince: 5, FailedSince: 1, OldestPendingAgeSec: 120},
+	}
+
+	var textOut bytes.Buffer
+	if err := renderDeliveryCheckReport(&textOut, report, "text"); err != nil {
+		t.Fatalf("renderDeliveryCheckReport(text): %v", err)
+	}
+	text := textOut.String()
+	for _, want := range []string{"INFO deliverer_host=\"deliverer-1\"", "FAILED_SINCE", "OLDEST_PENDING_SEC", "webhook", "total", "PASS delivery_check=ok"} {
+		if !strings.Contains(text, want) {
+			t.Fatalf("text output missing %q:\n%s", want, text)
+		}
+	}
+
+	var jsonOut bytes.Buffer
+	if err := renderDeliveryCheckReport(&jsonOut, report, "json"); err != nil {
+		t.Fatalf("renderDeliveryCheckReport(json): %v", err)
+	}
+	var decoded deliveryCheckReport
+	if err := json.Unmarshal(jsonOut.Bytes(), &decoded); err != nil {
+		t.Fatalf("json output did not decode: %v\n%s", err, jsonOut.String())
+	}
+	if !decoded.OK || decoded.Host != "deliverer-1" || decoded.Total.DeliveredSince != 5 {
+		t.Fatalf("decoded json = %+v", decoded)
+	}
+	if decoded.Total.FailedSince != 1 || decoded.Total.OldestPendingAgeSec != 120 {
+		t.Fatalf("decoded json summary = %+v", decoded.Total)
+	}
+}
+
+func TestRenderDeliveryCheckReportFailureText(t *testing.T) {
+	report := deliveryCheckReport{
+		OK:          false,
+		Host:        "deliverer-1",
+		GeneratedAt: time.Date(2026, 4, 29, 18, 30, 0, 0, time.UTC),
+		Since:       time.Date(2026, 4, 29, 18, 15, 0, 0, time.UTC),
+		Total:       deliveryTableSummary{Kind: "total"},
+		Failures:    []string{"due deliveries total=1 exceeds max-due=0"},
+	}
+
+	var out bytes.Buffer
+	if err := renderDeliveryCheckReport(&out, report, "text"); err != nil {
+		t.Fatalf("renderDeliveryCheckReport(text): %v", err)
+	}
+	if !strings.Contains(out.String(), "FAIL due deliveries total=1 exceeds max-due=0") {
+		t.Fatalf("failure text missing:\n%s", out.String())
+	}
+}
+
+func expectDeliverySummaryQueries(t *testing.T, mock sqlmock.Sqlmock, table string, now, cutoff time.Time, summary deliveryTableSummary) {
+	t.Helper()
+	quotedTable := regexp.QuoteMeta(table)
+	mock.ExpectQuery(`(?s)MIN\(created_at\).*FROM ` + quotedTable + `.*WHERE status = 'pending'`).
+		WithArgs(now).
+		WillReturnRows(sqlmock.NewRows([]string{"count", "oldest_pending_age_sec"}).
+			AddRow(summary.Pending, summary.OldestPendingAgeSec))
+	mock.ExpectQuery(`(?s)MIN\(COALESCE\(next_attempt_at, created_at\)\).*FROM `+quotedTable+`.*next_attempt_at IS NULL`).
+		WithArgs(now, now).
+		WillReturnRows(sqlmock.NewRows([]string{"count", "oldest_due_age_sec"}).
+			AddRow(summary.DueNow, summary.OldestDueAgeSec))
+	mock.ExpectQuery(`(?s)FROM ` + quotedTable + `.*status = 'pending'.*next_attempt_at > \?`).
+		WithArgs(now).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(summary.FutureRetry))
+	mock.ExpectQuery(`(?s)FROM ` + quotedTable + `.*status = 'delivered'.*delivered_at >= \?`).
+		WithArgs(cutoff).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(summary.DeliveredSince))
+	mock.ExpectQuery(`(?s)FROM `+quotedTable+`.*status = \?.*last_attempt_at >= \?`).
+		WithArgs("abandoned", cutoff).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(summary.AbandonedSince))
+	mock.ExpectQuery(`(?s)FROM `+quotedTable+`.*status = \?.*last_attempt_at IS NULL.*created_at >= \?`).
+		WithArgs("abandoned", cutoff).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(0))
+	mock.ExpectQuery(`(?s)FROM `+quotedTable+`.*status = \?.*last_attempt_at >= \?`).
+		WithArgs("failed", cutoff).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(summary.FailedSince))
+	mock.ExpectQuery(`(?s)FROM `+quotedTable+`.*status = \?.*last_attempt_at IS NULL.*created_at >= \?`).
+		WithArgs("failed", cutoff).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(0))
+}
diff --git a/cmd/jetmon-deliverer/main.go b/cmd/jetmon-deliverer/main.go
new file mode 100644
index 00000000..f2e9b5a0
--- /dev/null
+++ b/cmd/jetmon-deliverer/main.go
@@ -0,0 +1,349 @@
+package main
+
+import (
+	"context"
+	"database/sql"
+	"flag"
+	"fmt"
+	"io"
+	"log"
+	"os"
+	"os/signal"
+	"strings"
+	"syscall"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/audit"
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/deliverer"
+	"github.com/Automattic/jetmon/internal/fleethealth"
+	"github.com/Automattic/jetmon/internal/metrics"
+	"github.com/Automattic/jetmon/internal/processmetrics"
+)
+
+const processHealthWriteTimeout = 2 * time.Second
+
+// Injected at build time via -ldflags.
+var (
+	version   = "dev"
+	buildDate = "unknown"
+	goVersion = "unknown"
+)
+
+func main() {
+	if len(os.Args) > 1 {
+		switch os.Args[1] {
+		case "version":
+			fmt.Printf("jetmon-deliverer %s (built %s with %s)\n", version, buildDate, goVersion)
+			return
+		case "validate-config":
+			cmdValidateConfig(os.Args[2:])
+			return
+		case "delivery-check":
+			cmdDeliveryCheck(os.Args[2:])
+			return
+		default:
+			fmt.Fprintf(os.Stderr, "unknown command %q (want: version, validate-config, delivery-check)\n", os.Args[1])
+			os.Exit(2)
+		}
+	}
+	run()
+}
+
+type delivererValidationOptions struct {
+	HostOverride         string
+	RequireOwnerMatch    bool
+	RequireEmailDelivery bool
+	RequireAPIDisabled   bool
+}
+
+func parseValidateConfigOptions(args []string) (delivererValidationOptions, error) {
+	var opts delivererValidationOptions
+	fs := flag.NewFlagSet("validate-config", flag.ContinueOnError)
+	fs.SetOutput(io.Discard)
+	fs.StringVar(&opts.HostOverride, "host", "", "host id to validate against DELIVERY_OWNER_HOST (default current hostname)")
+	fs.BoolVar(&opts.RequireOwnerMatch, "require-owner-match", false, "fail unless DELIVERY_OWNER_HOST exactly matches the validated host")
+	fs.BoolVar(&opts.RequireEmailDelivery, "require-email-delivery", false, "fail unless EMAIL_TRANSPORT is smtp or wpcom")
+	fs.BoolVar(&opts.RequireAPIDisabled, "require-api-disabled", false, "fail unless API_PORT is 0 in the deliverer config")
+	if err := fs.Parse(args); err != nil {
+		return opts, err
+	}
+	if fs.NArg() != 0 {
+		return opts, fmt.Errorf("unexpected argument %q", fs.Arg(0))
+	}
+	return opts, nil
+}
+
+func cmdValidateConfig(args []string) {
+	opts, err := parseValidateConfigOptions(args)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "usage: jetmon-deliverer validate-config [--host=<host>] [--require-owner-match] [--require-email-delivery] [--require-api-disabled]\n")
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+	if err := config.Load(configPath); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL config parse: %v\n", err)
+		os.Exit(1)
+	}
+	fmt.Println("PASS config parse")
+
+	config.LoadDB()
+	if err := db.ConnectWithRetry(3); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL db connect: %v\n", err)
+		os.Exit(1)
+	}
+	fmt.Println("PASS db connect")
+
+	cfg := config.Get()
+	hostID := strings.TrimSpace(opts.HostOverride)
+	if hostID == "" {
+		hostID = db.Hostname()
+	}
+	fmt.Printf("INFO deliverer_host=%q\n", hostID)
+	fmt.Printf("INFO email_transport=%s\n", emailTransportLabel(cfg))
+	if !emailTransportDelivers(cfg) {
+		fmt.Printf("WARN email_transport=%s; alert-contact emails will be logged but not delivered\n", emailTransportLabel(cfg))
+	}
+	if cfg.APIPort > 0 {
+		fmt.Printf("WARN api_port=%d; standalone deliverer ignores API_PORT, confirm this is a process-specific config\n", cfg.APIPort)
+	} else {
+		fmt.Println("PASS api_port=disabled")
+	}
+	if level, msg := deliveryOwnerStatus(cfg, hostID); msg != "" {
+		fmt.Printf("%s %s\n", level, msg)
+	}
+	failures := validateDelivererConfigRequirements(cfg, hostID, opts)
+	if len(failures) > 0 {
+		for _, failure := range failures {
+			fmt.Fprintf(os.Stderr, "FAIL %s\n", failure)
+		}
+		os.Exit(1)
+	}
+
+	fmt.Println("\nvalidation passed")
+}
+
+func run() {
+	configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+	if err := config.Load(configPath); err != nil {
+		log.Fatalf("load config: %v", err)
+	}
+	cfg := config.Get()
+	log.Printf("config: email_transport=%s", emailTransportLabel(cfg))
+	if !emailTransportDelivers(cfg) {
+		log.Printf("WARN: email_transport=%s; alert-contact emails will be logged but not delivered", emailTransportLabel(cfg))
+	}
+
+	config.LoadDB()
+	if err := db.ConnectWithRetry(10); err != nil {
+		log.Fatalf("db connect: %v", err)
+	}
+	audit.Init(db.DB())
+
+	if err := metrics.Init("statsd:8125", db.Hostname()); err != nil {
+		log.Printf("warning: statsd init failed: %v", err)
+	}
+
+	hostname := db.Hostname()
+	processStartedAt := time.Now().UTC()
+	processID := fleethealth.ProcessID(hostname, fleethealth.ProcessDeliverer)
+	workersEnabled := deliveryWorkersShouldStart(cfg, hostname)
+	publishProcessHealth := func(state string) {
+		snapshot := delivererProcessHealthSnapshot(hostname, processStartedAt, state, cfg, workersEnabled, delivererDependencyHealth(context.Background(), db.DB(), metrics.Global() != nil, time.Now().UTC()))
+		ctx, cancel := context.WithTimeout(context.Background(), processHealthWriteTimeout)
+		if err := fleethealth.Upsert(ctx, db.DB(), snapshot); err != nil {
+			log.Printf("process health: %v", err)
+		}
+		cancel()
+	}
+	if level, msg := deliveryOwnerStatus(cfg, hostname); msg != "" {
+		if level == "WARN" {
+			log.Printf("WARN: %s", msg)
+		} else {
+			log.Printf("config: %s", msg)
+		}
+	}
+	initialState := fleethealth.StateRunning
+	if !workersEnabled {
+		initialState = fleethealth.StateIdle
+	}
+	publishProcessHealth(initialState)
+	stopHealth := make(chan struct{})
+	go func() {
+		ticker := time.NewTicker(10 * time.Second)
+		defer ticker.Stop()
+		for {
+			select {
+			case <-ticker.C:
+				publishProcessHealth(initialState)
+			case <-stopHealth:
+				return
+			}
+		}
+	}()
+
+	if !workersEnabled {
+		waitForShutdown()
+		close(stopHealth)
+		publishProcessHealth(fleethealth.StateStopping)
+		ctx, cancel := context.WithTimeout(context.Background(), processHealthWriteTimeout)
+		if err := fleethealth.MarkStopped(ctx, db.DB(), processID, time.Now().UTC()); err != nil {
+			log.Printf("process health: %v", err)
+		}
+		cancel()
+		log.Println("jetmon-deliverer: shutdown complete")
+		return
+	}
+
+	runtime := deliverer.Start(deliverer.Config{
+		DB:          db.DB(),
+		InstanceID:  hostname,
+		Dispatchers: deliverer.BuildAlertDispatchers(cfg),
+	})
+	waitForShutdown()
+	close(stopHealth)
+	publishProcessHealth(fleethealth.StateStopping)
+	runtime.Stop()
+	ctx, cancel := context.WithTimeout(context.Background(), processHealthWriteTimeout)
+	if err := fleethealth.MarkStopped(ctx, db.DB(), processID, time.Now().UTC()); err != nil {
+		log.Printf("process health: %v", err)
+	}
+	cancel()
+	log.Println("jetmon-deliverer: shutdown complete")
+}
+
+func deliveryWorkersShouldStart(cfg *config.Config, hostname string) bool {
+	owner := strings.TrimSpace(cfg.DeliveryOwnerHost)
+	return owner == "" || owner == hostname
+}
+
+func deliveryOwnerStatus(cfg *config.Config, hostname string) (string, string) {
+	owner := strings.TrimSpace(cfg.DeliveryOwnerHost)
+	if owner == "" {
+		return "WARN", fmt.Sprintf("delivery_owner_host is unset; standalone deliverer on host %q will run delivery workers", hostname)
+	}
+	if owner == hostname {
+		return "INFO", fmt.Sprintf("delivery_owner_host=%q matched; delivery workers enabled on this host", owner)
+	}
+	return "INFO", fmt.Sprintf("delivery_owner_host=%q; standalone deliverer idle on host %q", owner, hostname)
+}
+
+func validateDelivererConfigRequirements(cfg *config.Config, hostname string, opts delivererValidationOptions) []string {
+	if cfg == nil {
+		return []string{"config is not loaded"}
+	}
+	hostID := strings.TrimSpace(hostname)
+	failures := []string{}
+	if opts.RequireOwnerMatch {
+		owner := strings.TrimSpace(cfg.DeliveryOwnerHost)
+		if hostID == "" {
+			failures = append(failures, "validated host id is empty")
+		} else if owner == "" {
+			failures = append(failures, fmt.Sprintf("DELIVERY_OWNER_HOST must be set to %q for single-owner deliverer rollout", hostID))
+		} else if owner != hostID {
+			failures = append(failures, fmt.Sprintf("DELIVERY_OWNER_HOST=%q does not match deliverer host %q", owner, hostID))
+		}
+	}
+	if opts.RequireEmailDelivery && !emailTransportDelivers(cfg) {
+		failures = append(failures, fmt.Sprintf("EMAIL_TRANSPORT=%q does not deliver email; set smtp or wpcom", emailTransportLabel(cfg)))
+	}
+	if opts.RequireAPIDisabled && cfg.APIPort > 0 {
+		failures = append(failures, fmt.Sprintf("API_PORT=%d must be 0 for standalone deliverer config", cfg.APIPort))
+	}
+	return failures
+}
+
+func delivererProcessHealthSnapshot(hostname string, startedAt time.Time, state string, cfg *config.Config, workersEnabled bool, health []fleethealth.DependencyHealth) fleethealth.Snapshot {
+	mem := processmetrics.CurrentMemory()
+	healthStatus := fleethealth.RollupHealthStatus(health)
+	if workersEnabled && strings.TrimSpace(cfg.DeliveryOwnerHost) == "" && healthStatus == fleethealth.HealthGreen {
+		healthStatus = fleethealth.HealthAmber
+	}
+	if state == fleethealth.StateStopping || state == fleethealth.StateStopped {
+		healthStatus = fleethealth.HealthAmber
+	}
+	return fleethealth.Snapshot{
+		HostID:                 hostname,
+		ProcessType:            fleethealth.ProcessDeliverer,
+		PID:                    os.Getpid(),
+		Version:                version,
+		BuildDate:              buildDate,
+		GoVersion:              goVersion,
+		State:                  state,
+		HealthStatus:           healthStatus,
+		StartedAt:              startedAt,
+		UpdatedAt:              time.Now().UTC(),
+		DeliveryWorkersEnabled: workersEnabled,
+		DeliveryOwnerHost:      cfg.DeliveryOwnerHost,
+		GoSysMemMB:             mem.GoSysMemMB,
+		RSSMemMB:               mem.RSSMemMB,
+		DependencyHealth:       health,
+	}
+}
+
+func delivererDependencyHealth(ctx context.Context, sqlDB *sql.DB, statsdReady bool, checkedAt time.Time) []fleethealth.DependencyHealth {
+	return []fleethealth.DependencyHealth{
+		delivererMySQLHealth(ctx, sqlDB, checkedAt),
+		delivererStatsDHealth(statsdReady, checkedAt),
+	}
+}
+
+func delivererMySQLHealth(ctx context.Context, sqlDB *sql.DB, checkedAt time.Time) fleethealth.DependencyHealth {
+	entry := fleethealth.DependencyHealth{Name: "mysql", CheckedAt: checkedAt}
+	if sqlDB == nil {
+		entry.Status = "red"
+		entry.LastError = "database pool is not initialized"
+		return entry
+	}
+	pingCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
+	defer cancel()
+	start := time.Now()
+	if err := sqlDB.PingContext(pingCtx); err != nil {
+		entry.Status = "red"
+		entry.LatencyMS = time.Since(start).Milliseconds()
+		entry.LastError = err.Error()
+		return entry
+	}
+	entry.Status = "green"
+	entry.LatencyMS = time.Since(start).Milliseconds()
+	return entry
+}
+
+func delivererStatsDHealth(ready bool, checkedAt time.Time) fleethealth.DependencyHealth {
+	entry := fleethealth.DependencyHealth{Name: "statsd", CheckedAt: checkedAt}
+	if !ready {
+		entry.Status = "amber"
+		entry.LastError = "statsd client is not initialized"
+		return entry
+	}
+	entry.Status = "green"
+	return entry
+}
+
+func waitForShutdown() {
+	sigCh := make(chan os.Signal, 1)
+	signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
+	sig := <-sigCh
+	log.Printf("received %s, stopping", sig)
+}
+
+func emailTransportLabel(cfg *config.Config) string {
+	if cfg.EmailTransport == "" {
+		return "stub"
+	}
+	return cfg.EmailTransport
+}
+
+func emailTransportDelivers(cfg *config.Config) bool {
+	return cfg.EmailTransport == "smtp" || cfg.EmailTransport == "wpcom"
+}
+
+func envOrDefault(key, def string) string {
+	if v := os.Getenv(key); v != "" {
+		return v
+	}
+	return def
+}
diff --git a/cmd/jetmon-deliverer/main_test.go b/cmd/jetmon-deliverer/main_test.go
new file mode 100644
index 00000000..a8816623
--- /dev/null
+++ b/cmd/jetmon-deliverer/main_test.go
@@ -0,0 +1,269 @@
+package main
+
+import (
+	"context"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/fleethealth"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestDeliveryWorkersShouldStart(t *testing.T) {
+	tests := []struct {
+		name      string
+		cfg       config.Config
+		hostname  string
+		wantStart bool
+		wantLevel string
+		wantMsg   string
+	}{
+		{
+			name:      "empty owner starts with warning",
+			cfg:       config.Config{},
+			hostname:  "host-a",
+			wantStart: true,
+			wantLevel: "WARN",
+			wantMsg:   "delivery_owner_host is unset",
+		},
+		{
+			name: "matching owner starts",
+			cfg: config.Config{
+				DeliveryOwnerHost: "host-a",
+			},
+			hostname:  "host-a",
+			wantStart: true,
+			wantLevel: "INFO",
+			wantMsg:   "matched",
+		},
+		{
+			name: "non-owner idles",
+			cfg: config.Config{
+				DeliveryOwnerHost: "host-a",
+			},
+			hostname:  "host-b",
+			wantLevel: "INFO",
+			wantMsg:   "idle on host",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			if got := deliveryWorkersShouldStart(&tt.cfg, tt.hostname); got != tt.wantStart {
+				t.Fatalf("deliveryWorkersShouldStart() = %v, want %v", got, tt.wantStart)
+			}
+			level, msg := deliveryOwnerStatus(&tt.cfg, tt.hostname)
+			if level != tt.wantLevel {
+				t.Fatalf("deliveryOwnerStatus() level = %q, want %q", level, tt.wantLevel)
+			}
+			if !strings.Contains(msg, tt.wantMsg) {
+				t.Fatalf("deliveryOwnerStatus() message = %q, want substring %q", msg, tt.wantMsg)
+			}
+		})
+	}
+}
+
+func TestParseValidateConfigOptions(t *testing.T) {
+	opts, err := parseValidateConfigOptions([]string{
+		"--host=deliverer-1",
+		"--require-owner-match",
+		"--require-email-delivery",
+		"--require-api-disabled",
+	})
+	if err != nil {
+		t.Fatalf("parseValidateConfigOptions: %v", err)
+	}
+	if opts.HostOverride != "deliverer-1" {
+		t.Fatalf("HostOverride = %q, want deliverer-1", opts.HostOverride)
+	}
+	if !opts.RequireOwnerMatch || !opts.RequireEmailDelivery || !opts.RequireAPIDisabled {
+		t.Fatalf("parsed options = %+v, want all requirements enabled", opts)
+	}
+
+	if _, err := parseValidateConfigOptions([]string{"extra"}); err == nil {
+		t.Fatal("parseValidateConfigOptions accepted unexpected positional argument")
+	}
+}
+
+func TestValidateDelivererConfigRequirements(t *testing.T) {
+	tests := []struct {
+		name     string
+		cfg      config.Config
+		hostname string
+		opts     delivererValidationOptions
+		want     []string
+	}{
+		{
+			name: "single owner production config passes",
+			cfg: config.Config{
+				DeliveryOwnerHost: "deliverer-1",
+				EmailTransport:    "smtp",
+			},
+			hostname: "deliverer-1",
+			opts: delivererValidationOptions{
+				RequireOwnerMatch:    true,
+				RequireEmailDelivery: true,
+				RequireAPIDisabled:   true,
+			},
+		},
+		{
+			name:     "owner required but empty",
+			cfg:      config.Config{EmailTransport: "smtp"},
+			hostname: "deliverer-1",
+			opts:     delivererValidationOptions{RequireOwnerMatch: true},
+			want:     []string{"DELIVERY_OWNER_HOST must be set"},
+		},
+		{
+			name: "owner mismatch",
+			cfg: config.Config{
+				DeliveryOwnerHost: "deliverer-2",
+				EmailTransport:    "smtp",
+			},
+			hostname: "deliverer-1",
+			opts:     delivererValidationOptions{RequireOwnerMatch: true},
+			want:     []string{"does not match"},
+		},
+		{
+			name: "stub email rejected",
+			cfg: config.Config{
+				DeliveryOwnerHost: "deliverer-1",
+				EmailTransport:    "stub",
+			},
+			hostname: "deliverer-1",
+			opts:     delivererValidationOptions{RequireEmailDelivery: true},
+			want:     []string{"does not deliver email"},
+		},
+		{
+			name: "api port rejected",
+			cfg: config.Config{
+				DeliveryOwnerHost: "deliverer-1",
+				EmailTransport:    "smtp",
+				APIPort:           8090,
+			},
+			hostname: "deliverer-1",
+			opts:     delivererValidationOptions{RequireAPIDisabled: true},
+			want:     []string{"API_PORT=8090"},
+		},
+		{
+			name:     "empty host rejected when owner must match",
+			cfg:      config.Config{DeliveryOwnerHost: "deliverer-1"},
+			hostname: " ",
+			opts:     delivererValidationOptions{RequireOwnerMatch: true},
+			want:     []string{"host id is empty"},
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			failures := validateDelivererConfigRequirements(&tt.cfg, tt.hostname, tt.opts)
+			if len(tt.want) == 0 {
+				if len(failures) != 0 {
+					t.Fatalf("failures = %v, want none", failures)
+				}
+				return
+			}
+			if len(failures) != len(tt.want) {
+				t.Fatalf("failures = %v, want %d failures", failures, len(tt.want))
+			}
+			for i, want := range tt.want {
+				if !strings.Contains(failures[i], want) {
+					t.Fatalf("failure[%d] = %q, want substring %q", i, failures[i], want)
+				}
+			}
+		})
+	}
+}
+
+func TestEmailTransportLabelAndDelivery(t *testing.T) {
+	tests := []struct {
+		name     string
+		cfg      config.Config
+		label    string
+		delivers bool
+	}{
+		{name: "empty is stub alias", cfg: config.Config{}, label: "stub"},
+		{name: "stub logs only", cfg: config.Config{EmailTransport: "stub"}, label: "stub"},
+		{name: "smtp delivers", cfg: config.Config{EmailTransport: "smtp"}, label: "smtp", delivers: true},
+		{name: "wpcom delivers", cfg: config.Config{EmailTransport: "wpcom"}, label: "wpcom", delivers: true},
+		{name: "unknown does not deliver", cfg: config.Config{EmailTransport: "sendmail"}, label: "sendmail"},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			if got := emailTransportLabel(&tt.cfg); got != tt.label {
+				t.Fatalf("emailTransportLabel() = %q, want %q", got, tt.label)
+			}
+			if got := emailTransportDelivers(&tt.cfg); got != tt.delivers {
+				t.Fatalf("emailTransportDelivers() = %v, want %v", got, tt.delivers)
+			}
+		})
+	}
+}
+
+func TestDelivererProcessHealthSnapshot(t *testing.T) {
+	started := time.Date(2026, 4, 30, 11, 0, 0, 0, time.UTC)
+	cfg := &config.Config{DeliveryOwnerHost: "deliverer-1"}
+	snapshot := delivererProcessHealthSnapshot("deliverer-1", started, fleethealth.StateRunning, cfg, true, []fleethealth.DependencyHealth{{
+		Name:      "mysql",
+		Status:    "green",
+		CheckedAt: started,
+	}})
+
+	if snapshot.HostID != "deliverer-1" {
+		t.Fatalf("HostID = %q, want deliverer-1", snapshot.HostID)
+	}
+	if snapshot.ProcessType != fleethealth.ProcessDeliverer {
+		t.Fatalf("ProcessType = %q, want deliverer", snapshot.ProcessType)
+	}
+	if !snapshot.DeliveryWorkersEnabled {
+		t.Fatal("DeliveryWorkersEnabled = false, want true")
+	}
+	if snapshot.DeliveryOwnerHost != "deliverer-1" {
+		t.Fatalf("DeliveryOwnerHost = %q, want deliverer-1", snapshot.DeliveryOwnerHost)
+	}
+	if snapshot.HealthStatus != fleethealth.HealthGreen {
+		t.Fatalf("HealthStatus = %q, want green", snapshot.HealthStatus)
+	}
+	if len(snapshot.DependencyHealth) != 1 {
+		t.Fatalf("DependencyHealth len = %d, want 1", len(snapshot.DependencyHealth))
+	}
+}
+
+func TestDelivererDependencyHealth(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New(sqlmock.MonitorPingsOption(true))
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+	mock.ExpectPing()
+
+	checkedAt := time.Date(2026, 4, 30, 11, 1, 0, 0, time.UTC)
+	entries := delivererDependencyHealth(context.Background(), sqlDB, false, checkedAt)
+	if len(entries) != 2 {
+		t.Fatalf("entries len = %d, want 2", len(entries))
+	}
+	if entries[0].Name != "mysql" || entries[0].Status != "green" {
+		t.Fatalf("mysql entry = %+v, want green", entries[0])
+	}
+	if entries[1].Name != "statsd" || entries[1].Status != "amber" {
+		t.Fatalf("statsd entry = %+v, want amber", entries[1])
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("sql expectations: %v", err)
+	}
+}
+
+func TestEnvOrDefault(t *testing.T) {
+	const key = "JETMON_DELIVERER_TEST_ENV_OR_DEFAULT"
+	t.Setenv(key, "")
+	if got := envOrDefault(key, "fallback"); got != "fallback" {
+		t.Fatalf("envOrDefault() = %q, want fallback", got)
+	}
+
+	t.Setenv(key, "set-value")
+	if got := envOrDefault(key, "fallback"); got != "set-value" {
+		t.Fatalf("envOrDefault() = %q, want set-value", got)
+	}
+}
diff --git a/cmd/jetmon-testsite/main.go b/cmd/jetmon-testsite/main.go
new file mode 100644
index 00000000..39a8a3b8
--- /dev/null
+++ b/cmd/jetmon-testsite/main.go
@@ -0,0 +1,309 @@
+package main
+
+import (
+	"context"
+	"crypto/hmac"
+	"crypto/rand"
+	"crypto/rsa"
+	"crypto/sha256"
+	"crypto/tls"
+	"crypto/x509"
+	"crypto/x509/pkix"
+	"encoding/hex"
+	"encoding/json"
+	"encoding/pem"
+	"errors"
+	"fmt"
+	"io"
+	"log"
+	"math/big"
+	"net"
+	"net/http"
+	"os"
+	"os/signal"
+	"strconv"
+	"strings"
+	"sync"
+	"syscall"
+	"time"
+)
+
+const (
+	defaultHTTPAddr  = ":8091"
+	defaultHTTPSAddr = ":8443"
+)
+
+func main() {
+	if len(os.Args) > 1 && os.Args[1] == "healthcheck" {
+		if err := healthcheck(); err != nil {
+			fmt.Fprintln(os.Stderr, err)
+			os.Exit(1)
+		}
+		return
+	}
+
+	httpAddr := envOrDefault("FIXTURE_HTTP_ADDR", defaultHTTPAddr)
+	httpsAddr := envOrDefault("FIXTURE_HTTPS_ADDR", defaultHTTPSAddr)
+	handler := newFixtureHandler()
+
+	servers := []*http.Server{{
+		Addr:    httpAddr,
+		Handler: handler,
+	}}
+	if httpsAddr != "" {
+		cert, err := selfSignedCert()
+		if err != nil {
+			log.Fatalf("generate tls cert: %v", err)
+		}
+		servers = append(servers, &http.Server{
+			Addr:      httpsAddr,
+			Handler:   handler,
+			TLSConfig: &tls.Config{Certificates: []tls.Certificate{cert}, MinVersion: tls.VersionTLS12},
+		})
+	}
+
+	errCh := make(chan error, len(servers))
+	for _, srv := range servers {
+		srv := srv
+		go func() {
+			log.Printf("jetmon-testsite: listening on %s", srv.Addr)
+			var err error
+			if srv.TLSConfig != nil {
+				err = srv.ListenAndServeTLS("", "")
+			} else {
+				err = srv.ListenAndServe()
+			}
+			if err != nil && !errors.Is(err, http.ErrServerClosed) {
+				errCh <- err
+			}
+		}()
+	}
+
+	sigCh := make(chan os.Signal, 1)
+	signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
+	select {
+	case sig := <-sigCh:
+		log.Printf("jetmon-testsite: shutdown signal=%s", sig)
+	case err := <-errCh:
+		log.Printf("jetmon-testsite: server error: %v", err)
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+	defer cancel()
+	for _, srv := range servers {
+		if err := srv.Shutdown(ctx); err != nil {
+			log.Printf("jetmon-testsite: shutdown %s: %v", srv.Addr, err)
+		}
+	}
+}
+
+func newFixtureHandler() http.Handler {
+	mux := http.NewServeMux()
+	webhooks := &fixtureWebhookReceiver{}
+	mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
+		w.Header().Set("Content-Type", "text/plain; charset=utf-8")
+		_, _ = io.WriteString(w, "ok\n")
+	})
+	mux.HandleFunc("/webhook", webhooks.handleWebhook)
+	mux.HandleFunc("/webhook/requests", webhooks.handleRequests)
+	mux.HandleFunc("/ok", func(w http.ResponseWriter, r *http.Request) {
+		w.Header().Set("Content-Type", "text/plain; charset=utf-8")
+		_, _ = io.WriteString(w, "jetmon fixture ok\n")
+	})
+	mux.HandleFunc("/tls", func(w http.ResponseWriter, r *http.Request) {
+		w.Header().Set("Content-Type", "text/plain; charset=utf-8")
+		_, _ = io.WriteString(w, "jetmon fixture tls endpoint\n")
+	})
+	mux.HandleFunc("/keyword", func(w http.ResponseWriter, r *http.Request) {
+		w.Header().Set("Content-Type", "text/plain; charset=utf-8")
+		_, _ = io.WriteString(w, "jetmon fixture keyword present\n")
+	})
+	mux.HandleFunc("/redirect", func(w http.ResponseWriter, r *http.Request) {
+		http.Redirect(w, r, "/ok", http.StatusFound)
+	})
+	mux.HandleFunc("/slow", func(w http.ResponseWriter, r *http.Request) {
+		delay := fixtureDelay(r.URL.Query().Get("delay"), 5*time.Second)
+		time.Sleep(delay)
+		w.Header().Set("Content-Type", "text/plain; charset=utf-8")
+		fmt.Fprintf(w, "slow response after %s\n", delay)
+	})
+	mux.HandleFunc("/status/", func(w http.ResponseWriter, r *http.Request) {
+		raw := strings.TrimPrefix(r.URL.Path, "/status/")
+		code, err := strconv.Atoi(raw)
+		if err != nil || code < 100 || code > 599 {
+			http.Error(w, "status must be 100-599", http.StatusBadRequest)
+			return
+		}
+		w.WriteHeader(code)
+		if code != http.StatusNoContent && code != http.StatusNotModified {
+			fmt.Fprintf(w, "status %d\n", code)
+		}
+	})
+	return mux
+}
+
+type fixtureWebhookReceiver struct {
+	mu       sync.Mutex
+	nextID   int
+	requests []fixtureWebhookRequest
+}
+
+type fixtureWebhookRequest struct {
+	ID             int    `json:"id"`
+	ReceivedAt     string `json:"received_at"`
+	Event          string `json:"event,omitempty"`
+	Delivery       string `json:"delivery,omitempty"`
+	Signature      string `json:"signature,omitempty"`
+	SignatureValid *bool  `json:"signature_valid,omitempty"`
+	Body           string `json:"body"`
+}
+
+func (f *fixtureWebhookReceiver) handleWebhook(w http.ResponseWriter, r *http.Request) {
+	if r.Method != http.MethodPost {
+		http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
+		return
+	}
+	body, err := io.ReadAll(http.MaxBytesReader(w, r.Body, 1<<20))
+	if err != nil {
+		http.Error(w, "read body: "+err.Error(), http.StatusBadRequest)
+		return
+	}
+	signature := r.Header.Get("X-Jetmon-Signature")
+	var signatureValid *bool
+	if secret := r.URL.Query().Get("secret"); secret != "" {
+		valid := verifyJetmonSignature(signature, body, secret)
+		signatureValid = &valid
+	}
+
+	f.mu.Lock()
+	f.nextID++
+	f.requests = append(f.requests, fixtureWebhookRequest{
+		ID:             f.nextID,
+		ReceivedAt:     time.Now().UTC().Format(time.RFC3339Nano),
+		Event:          r.Header.Get("X-Jetmon-Event"),
+		Delivery:       r.Header.Get("X-Jetmon-Delivery"),
+		Signature:      signature,
+		SignatureValid: signatureValid,
+		Body:           string(body),
+	})
+	f.mu.Unlock()
+
+	w.WriteHeader(http.StatusNoContent)
+}
+
+func (f *fixtureWebhookReceiver) handleRequests(w http.ResponseWriter, r *http.Request) {
+	switch r.Method {
+	case http.MethodGet:
+		f.mu.Lock()
+		requests := append([]fixtureWebhookRequest(nil), f.requests...)
+		f.mu.Unlock()
+		writeFixtureJSON(w, map[string]any{
+			"count":    len(requests),
+			"requests": requests,
+		})
+	case http.MethodDelete:
+		f.mu.Lock()
+		f.nextID = 0
+		f.requests = nil
+		f.mu.Unlock()
+		w.WriteHeader(http.StatusNoContent)
+	default:
+		http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
+	}
+}
+
+func verifyJetmonSignature(signature string, body []byte, secret string) bool {
+	var timestamp string
+	var got string
+	for _, part := range strings.Split(signature, ",") {
+		k, v, ok := strings.Cut(part, "=")
+		if !ok {
+			continue
+		}
+		switch k {
+		case "t":
+			timestamp = v
+		case "v1":
+			got = v
+		}
+	}
+	if timestamp == "" || got == "" {
+		return false
+	}
+	mac := hmac.New(sha256.New, []byte(secret))
+	_, _ = mac.Write([]byte(timestamp))
+	_, _ = mac.Write([]byte("."))
+	_, _ = mac.Write(body)
+	want := hex.EncodeToString(mac.Sum(nil))
+	return hmac.Equal([]byte(got), []byte(want))
+}
+
+func writeFixtureJSON(w http.ResponseWriter, v any) {
+	w.Header().Set("Content-Type", "application/json")
+	if err := json.NewEncoder(w).Encode(v); err != nil {
+		log.Printf("jetmon-testsite: encode json: %v", err)
+	}
+}
+
+func fixtureDelay(raw string, fallback time.Duration) time.Duration {
+	if raw == "" {
+		return fallback
+	}
+	delay, err := time.ParseDuration(raw)
+	if err != nil || delay < 0 {
+		return fallback
+	}
+	if delay > 30*time.Second {
+		return 30 * time.Second
+	}
+	return delay
+}
+
+func selfSignedCert() (tls.Certificate, error) {
+	key, err := rsa.GenerateKey(rand.Reader, 2048)
+	if err != nil {
+		return tls.Certificate{}, err
+	}
+	serial, err := rand.Int(rand.Reader, new(big.Int).Lsh(big.NewInt(1), 128))
+	if err != nil {
+		return tls.Certificate{}, err
+	}
+	tmpl := x509.Certificate{
+		SerialNumber: serial,
+		Subject:      pkix.Name{CommonName: "jetmon-testsite"},
+		NotBefore:    time.Now().Add(-time.Hour),
+		NotAfter:     time.Now().Add(24 * time.Hour),
+		KeyUsage:     x509.KeyUsageKeyEncipherment | x509.KeyUsageDigitalSignature,
+		ExtKeyUsage:  []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth},
+		DNSNames:     []string{"localhost", "api-fixture", "jetmon-testsite"},
+		IPAddresses:  []net.IP{net.ParseIP("127.0.0.1")},
+	}
+	certDER, err := x509.CreateCertificate(rand.Reader, &tmpl, &tmpl, &key.PublicKey, key)
+	if err != nil {
+		return tls.Certificate{}, err
+	}
+	keyDER := x509.MarshalPKCS1PrivateKey(key)
+	certPEM := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: certDER})
+	keyPEM := pem.EncodeToMemory(&pem.Block{Type: "RSA PRIVATE KEY", Bytes: keyDER})
+	return tls.X509KeyPair(certPEM, keyPEM)
+}
+
+func healthcheck() error {
+	client := &http.Client{Timeout: 2 * time.Second}
+	resp, err := client.Get("http://127.0.0.1" + envOrDefault("FIXTURE_HEALTH_PORT", ":8091") + "/health")
+	if err != nil {
+		return err
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusOK {
+		return fmt.Errorf("health returned %s", resp.Status)
+	}
+	return nil
+}
+
+func envOrDefault(name, fallback string) string {
+	if v := os.Getenv(name); v != "" {
+		return v
+	}
+	return fallback
+}
diff --git a/cmd/jetmon-testsite/main_test.go b/cmd/jetmon-testsite/main_test.go
new file mode 100644
index 00000000..1bdbd6db
--- /dev/null
+++ b/cmd/jetmon-testsite/main_test.go
@@ -0,0 +1,149 @@
+package main
+
+import (
+	"bytes"
+	"crypto/hmac"
+	"crypto/sha256"
+	"encoding/hex"
+	"encoding/json"
+	"fmt"
+	"net/http"
+	"net/http/httptest"
+	"strings"
+	"testing"
+	"time"
+)
+
+func TestFixtureHandlerEndpoints(t *testing.T) {
+	srv := httptest.NewServer(newFixtureHandler())
+	defer srv.Close()
+
+	tests := []struct {
+		path string
+		code int
+		body string
+	}{
+		{path: "/health", code: http.StatusOK, body: "ok"},
+		{path: "/ok", code: http.StatusOK, body: "fixture ok"},
+		{path: "/keyword", code: http.StatusOK, body: "keyword present"},
+		{path: "/status/403", code: http.StatusForbidden, body: "status 403"},
+		{path: "/status/500", code: http.StatusInternalServerError, body: "status 500"},
+	}
+	for _, tt := range tests {
+		t.Run(tt.path, func(t *testing.T) {
+			resp, err := http.Get(srv.URL + tt.path)
+			if err != nil {
+				t.Fatalf("GET %s: %v", tt.path, err)
+			}
+			defer resp.Body.Close()
+			if resp.StatusCode != tt.code {
+				t.Fatalf("status = %d, want %d", resp.StatusCode, tt.code)
+			}
+			buf := make([]byte, 256)
+			n, _ := resp.Body.Read(buf)
+			if !strings.Contains(string(buf[:n]), tt.body) {
+				t.Fatalf("body = %q, want substring %q", string(buf[:n]), tt.body)
+			}
+		})
+	}
+}
+
+func TestFixtureRedirectAndDelay(t *testing.T) {
+	srv := httptest.NewServer(newFixtureHandler())
+	defer srv.Close()
+
+	client := &http.Client{CheckRedirect: func(*http.Request, []*http.Request) error {
+		return http.ErrUseLastResponse
+	}}
+	resp, err := client.Get(srv.URL + "/redirect")
+	if err != nil {
+		t.Fatalf("GET redirect: %v", err)
+	}
+	resp.Body.Close()
+	if resp.StatusCode != http.StatusFound || resp.Header.Get("Location") != "/ok" {
+		t.Fatalf("redirect status=%d location=%q", resp.StatusCode, resp.Header.Get("Location"))
+	}
+
+	start := time.Now()
+	resp, err = http.Get(srv.URL + "/slow?delay=10ms")
+	if err != nil {
+		t.Fatalf("GET slow: %v", err)
+	}
+	resp.Body.Close()
+	if elapsed := time.Since(start); elapsed < 10*time.Millisecond {
+		t.Fatalf("slow endpoint returned too quickly: %s", elapsed)
+	}
+}
+
+func TestFixtureWebhookReceiverRecordsAndVerifiesSignature(t *testing.T) {
+	srv := httptest.NewServer(newFixtureHandler())
+	defer srv.Close()
+
+	secret := "whsec_test_secret"
+	body := []byte(`{"type":"event.opened"}`)
+	req, err := http.NewRequest(http.MethodPost, srv.URL+"/webhook?secret="+secret, bytes.NewReader(body))
+	if err != nil {
+		t.Fatalf("new request: %v", err)
+	}
+	req.Header.Set("X-Jetmon-Event", "event.opened")
+	req.Header.Set("X-Jetmon-Delivery", "123")
+	req.Header.Set("X-Jetmon-Signature", fixtureTestSignature(1700000000, body, secret))
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("POST webhook: %v", err)
+	}
+	resp.Body.Close()
+	if resp.StatusCode != http.StatusNoContent {
+		t.Fatalf("POST status = %d, want 204", resp.StatusCode)
+	}
+
+	resp, err = http.Get(srv.URL + "/webhook/requests")
+	if err != nil {
+		t.Fatalf("GET webhook requests: %v", err)
+	}
+	defer resp.Body.Close()
+	var got struct {
+		Count    int `json:"count"`
+		Requests []struct {
+			Event          string `json:"event"`
+			Delivery       string `json:"delivery"`
+			SignatureValid *bool  `json:"signature_valid"`
+			Body           string `json:"body"`
+		} `json:"requests"`
+	}
+	if err := json.NewDecoder(resp.Body).Decode(&got); err != nil {
+		t.Fatalf("decode webhook requests: %v", err)
+	}
+	if got.Count != 1 || len(got.Requests) != 1 {
+		t.Fatalf("requests = %+v, want one", got)
+	}
+	if got.Requests[0].Event != "event.opened" || got.Requests[0].Delivery != "123" {
+		t.Fatalf("request headers = %+v", got.Requests[0])
+	}
+	if got.Requests[0].SignatureValid == nil || !*got.Requests[0].SignatureValid {
+		t.Fatalf("signature_valid = %v, want true", got.Requests[0].SignatureValid)
+	}
+	if got.Requests[0].Body != string(body) {
+		t.Fatalf("body = %q, want %q", got.Requests[0].Body, string(body))
+	}
+
+	req, err = http.NewRequest(http.MethodDelete, srv.URL+"/webhook/requests", nil)
+	if err != nil {
+		t.Fatalf("new delete request: %v", err)
+	}
+	resp, err = http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("DELETE webhook requests: %v", err)
+	}
+	resp.Body.Close()
+	if resp.StatusCode != http.StatusNoContent {
+		t.Fatalf("DELETE status = %d, want 204", resp.StatusCode)
+	}
+}
+
+func fixtureTestSignature(ts int64, body []byte, secret string) string {
+	mac := hmac.New(sha256.New, []byte(secret))
+	_, _ = mac.Write([]byte(fmt.Sprintf("%d.", ts)))
+	_, _ = mac.Write(body)
+	return fmt.Sprintf("t=%d,v1=%s", ts, hex.EncodeToString(mac.Sum(nil)))
+}
diff --git a/cmd/jetmon2/api_cli.go b/cmd/jetmon2/api_cli.go
new file mode 100644
index 00000000..8f73f712
--- /dev/null
+++ b/cmd/jetmon2/api_cli.go
@@ -0,0 +1,858 @@
+package main
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"errors"
+	"flag"
+	"fmt"
+	"io"
+	"net"
+	"net/http"
+	"net/url"
+	"os"
+	"sort"
+	"strconv"
+	"strings"
+	"text/tabwriter"
+	"time"
+)
+
+const defaultAPIBaseURL = "http://localhost:8090"
+const defaultAPIAuthPolicy = "same-origin"
+
+type apiCLIOptions struct {
+	baseURL        string
+	token          string
+	authPolicy     string
+	allowRemote    bool
+	verbose        bool
+	pretty         bool
+	output         string
+	timeout        time.Duration
+	body           string
+	bodyFile       string
+	idempotencyKey string
+	headers        apiHeaderFlags
+	out            io.Writer
+	errOut         io.Writer
+	in             io.Reader
+	commandName    string
+}
+
+type apiHeaderFlags []string
+
+type apiHTTPResponse struct {
+	StatusCode int
+	Status     string
+	Body       []byte
+}
+
+type apiCommandInfo struct {
+	Command     string `json:"command"`
+	Description string `json:"description"`
+	Example     string `json:"example"`
+}
+
+var apiCommandCatalog = []apiCommandInfo{
+	{Command: "health", Description: "check API and database health", Example: "jetmon2 api health --pretty"},
+	{Command: "me", Description: "show the authenticated API key identity", Example: "jetmon2 api me --pretty"},
+	{Command: "request", Description: "send an arbitrary request to an API path", Example: "jetmon2 api request --output table GET /api/v1/sites"},
+	{Command: "sites list", Description: "list monitored sites with filters", Example: "jetmon2 api sites list --limit 20 --output table"},
+	{Command: "sites get", Description: "show one monitored site", Example: "jetmon2 api sites get 12345 --pretty"},
+	{Command: "sites create", Description: "create a monitored site", Example: "jetmon2 api sites create --blog-id 12345 --url https://example.com --pretty"},
+	{Command: "sites update", Description: "update check settings for a site", Example: "jetmon2 api sites update 12345 --url https://example.com/health --pretty"},
+	{Command: "sites delete", Description: "delete a monitored site", Example: "jetmon2 api sites delete 12345"},
+	{Command: "sites pause", Description: "pause monitoring for a site", Example: "jetmon2 api sites pause 12345 --idempotency-key site-12345-pause"},
+	{Command: "sites resume", Description: "resume monitoring for a site", Example: "jetmon2 api sites resume 12345 --idempotency-key site-12345-resume"},
+	{Command: "sites trigger-now", Description: "run an immediate check", Example: "jetmon2 api sites trigger-now 12345 --pretty"},
+	{Command: "sites bulk-add", Description: "create bounded local test-site batches", Example: "jetmon2 api sites bulk-add --count 3 --batch local-smoke --dry-run --pretty"},
+	{Command: "sites cleanup", Description: "delete deterministic CLI-created site batches", Example: "jetmon2 api sites cleanup --batch local-smoke --count 3 --output table"},
+	{Command: "sites simulate-failure", Description: "mutate test sites into known failure modes", Example: "jetmon2 api sites simulate-failure --batch local-smoke --mode http-500 --wait 30s --output table"},
+	{Command: "events list", Description: "list events for a site", Example: "jetmon2 api events list 12345 --active=true --output table"},
+	{Command: "events get", Description: "show one event", Example: "jetmon2 api events get --site-id 12345 98765 --pretty"},
+	{Command: "events transitions", Description: "list event transition history", Example: "jetmon2 api events transitions 12345 98765 --output table"},
+	{Command: "events close", Description: "manually close an event", Example: "jetmon2 api events close 12345 98765 --reason manual_override --pretty"},
+	{Command: "webhooks list", Description: "list webhook registrations", Example: "jetmon2 api webhooks list --output table"},
+	{Command: "webhooks create", Description: "create a webhook registration", Example: "jetmon2 api webhooks create --url https://receiver.example.test/jetmon --event event.opened --pretty"},
+	{Command: "webhooks deliveries", Description: "list webhook delivery rows", Example: "jetmon2 api webhooks deliveries 77 --status failed --output table"},
+	{Command: "webhooks retry", Description: "retry an abandoned webhook delivery", Example: "jetmon2 api webhooks retry 77 555 --idempotency-key webhook-77-555-retry --pretty"},
+	{Command: "alert-contacts list", Description: "list managed alert contacts", Example: "jetmon2 api alert-contacts list --output table"},
+	{Command: "alert-contacts create", Description: "create an email, PagerDuty, Slack, or Teams contact", Example: "jetmon2 api alert-contacts create --label Local --transport email --address alerts@example.test --pretty"},
+	{Command: "alert-contacts test", Description: "send a managed alert-contact test", Example: "jetmon2 api alert-contacts test 12 --idempotency-key alert-12-test --pretty"},
+	{Command: "alert-contacts deliveries", Description: "list managed alert delivery rows", Example: "jetmon2 api alert-contacts deliveries 12 --status failed --output table"},
+	{Command: "smoke", Description: "run the Docker-local API smoke workflow", Example: "jetmon2 api smoke --batch local-smoke --exercise webhook --pretty"},
+	{Command: "commands", Description: "list API CLI commands and examples", Example: "jetmon2 api commands --output table"},
+}
+
+func (h *apiHeaderFlags) String() string {
+	return strings.Join(*h, ",")
+}
+
+func (h *apiHeaderFlags) Set(v string) error {
+	if !strings.Contains(v, ":") {
+		return fmt.Errorf("header %q must be in Name: Value form", v)
+	}
+	*h = append(*h, v)
+	return nil
+}
+
+func cmdAPI(args []string) {
+	if len(args) == 0 {
+		printAPIUsage(os.Stderr)
+		os.Exit(1)
+	}
+
+	sub := args[0]
+	rest := args[1:]
+	var err error
+	switch sub {
+	case "health":
+		err = cmdAPIHealth(rest)
+	case "me":
+		err = cmdAPIMe(rest)
+	case "request":
+		err = cmdAPIRequest(rest)
+	case "commands":
+		err = cmdAPICommands(rest)
+	case "sites":
+		err = cmdAPISites(rest)
+	case "events":
+		err = cmdAPIEvents(rest)
+	case "webhooks":
+		err = cmdAPIWebhooks(rest)
+	case "alert-contacts":
+		err = cmdAPIAlertContacts(rest)
+	case "smoke":
+		err = cmdAPISmoke(rest)
+	default:
+		fmt.Fprintf(os.Stderr, "unknown api subcommand %q (want: health, me, request, commands, sites, events, webhooks, alert-contacts, smoke)\n", sub)
+		printAPIUsage(os.Stderr)
+		os.Exit(1)
+	}
+	if err != nil {
+		logAPIErrorAndExit(err)
+	}
+}
+
+func printAPIUsage(w io.Writer) {
+	fmt.Fprintln(w, "usage: jetmon2 api <health|me|request|commands|sites|events|webhooks|alert-contacts|smoke> [flags]")
+	fmt.Fprintln(w)
+	fmt.Fprintln(w, "Run `jetmon2 api commands --output table` for the command catalog.")
+	fmt.Fprintln(w)
+	fmt.Fprintln(w, "Environment:")
+	fmt.Fprintln(w, "  JETMON_API_URL          API base URL (default: http://localhost:8090)")
+	fmt.Fprintln(w, "  JETMON_API_TOKEN        Bearer token for authenticated routes")
+	fmt.Fprintln(w, "  JETMON_API_AUTH_POLICY  automatic auth policy: same-origin or any-origin (default: same-origin)")
+}
+
+func cmdAPIHealth(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api health", &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return fmt.Errorf("usage: jetmon2 api health [flags]")
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, "/api/v1/health", nil)
+}
+
+func cmdAPIMe(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api me", &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return fmt.Errorf("usage: jetmon2 api me [flags]")
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, "/api/v1/me", nil)
+}
+
+func cmdAPIRequest(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api request", &opts)
+	fs.StringVar(&opts.body, "body", "", "literal request body")
+	fs.StringVar(&opts.bodyFile, "body-file", "", "file containing request body (- reads stdin)")
+	fs.StringVar(&opts.idempotencyKey, "idempotency-key", "", "Idempotency-Key header for POST retries")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 2 {
+		return fmt.Errorf("usage: jetmon2 api request [flags] <method> <path-or-url>")
+	}
+
+	body, err := readAPIRequestBody(opts)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, fs.Arg(0), fs.Arg(1), body)
+}
+
+func cmdAPICommands(args []string) error {
+	opts := defaultAPIOptions()
+	opts.output = "table"
+	fs := newAPIFlagSet("api commands", &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return fmt.Errorf("usage: jetmon2 api commands [flags]")
+	}
+	return writeAPICommands(opts)
+}
+
+func writeAPICommands(opts apiCLIOptions) error {
+	return writeAPIValueOutput(opts.out, map[string]any{"commands": apiCommandCatalog}, opts)
+}
+
+func defaultAPIOptions() apiCLIOptions {
+	return apiCLIOptions{
+		baseURL:    envOrDefault("JETMON_API_URL", defaultAPIBaseURL),
+		token:      os.Getenv("JETMON_API_TOKEN"),
+		authPolicy: envOrDefault("JETMON_API_AUTH_POLICY", defaultAPIAuthPolicy),
+		timeout:    10 * time.Second,
+		out:        os.Stdout,
+		errOut:     os.Stderr,
+		in:         os.Stdin,
+	}
+}
+
+func newAPIFlagSet(name string, opts *apiCLIOptions) *flag.FlagSet {
+	opts.commandName = name
+	fs := flag.NewFlagSet(name, flag.ContinueOnError)
+	fs.SetOutput(opts.errOut)
+	fs.StringVar(&opts.baseURL, "base-url", opts.baseURL, "API base URL")
+	fs.StringVar(&opts.token, "token", opts.token, "Bearer token")
+	if tokenFlag := fs.Lookup("token"); tokenFlag != nil {
+		tokenFlag.DefValue = ""
+	}
+	fs.StringVar(&opts.authPolicy, "auth-policy", opts.authPolicy, "automatic auth policy: same-origin or any-origin")
+	fs.BoolVar(&opts.allowRemote, "allow-remote", opts.allowRemote, "allow writes to a non-local API base URL")
+	fs.BoolVar(&opts.verbose, "v", false, "print request and response headers to stderr")
+	fs.BoolVar(&opts.verbose, "verbose", false, "print request and response headers to stderr")
+	fs.BoolVar(&opts.pretty, "pretty", false, "pretty-print JSON response bodies")
+	defaultOutput := opts.output
+	if defaultOutput == "" {
+		defaultOutput = "json"
+	}
+	fs.StringVar(&opts.output, "output", defaultOutput, "response output format: json or table")
+	fs.DurationVar(&opts.timeout, "timeout", opts.timeout, "request timeout")
+	fs.Var(&opts.headers, "header", "additional request header in Name: Value form (repeatable)")
+	fs.Usage = func() {
+		printAPIFlagUsage(fs.Output(), fs)
+	}
+	return fs
+}
+
+type apiBoolFlag interface {
+	IsBoolFlag() bool
+}
+
+func parseAPIFlags(fs *flag.FlagSet, args []string) error {
+	normalized := normalizeAPIFlagArgs(fs, args)
+	return fs.Parse(normalized)
+}
+
+func normalizeAPIFlagArgs(fs *flag.FlagSet, args []string) []string {
+	flags := []string{}
+	positionals := []string{}
+	onlyPositionals := false
+	hasTerminator := false
+	for i := 0; i < len(args); i++ {
+		arg := args[i]
+		if onlyPositionals || arg == "-" || !strings.HasPrefix(arg, "-") {
+			positionals = append(positionals, arg)
+			continue
+		}
+		if arg == "--" {
+			onlyPositionals = true
+			hasTerminator = true
+			continue
+		}
+
+		name, hasValue := apiFlagName(arg)
+		f := fs.Lookup(name)
+		if f == nil {
+			flags = append(flags, arg)
+			continue
+		}
+		flags = append(flags, arg)
+		if hasValue || apiFlagIsBool(f) {
+			continue
+		}
+		if i+1 < len(args) {
+			i++
+			flags = append(flags, args[i])
+		}
+	}
+	if hasTerminator {
+		flags = append(flags, "--")
+	}
+	return append(flags, positionals...)
+}
+
+func apiFlagName(arg string) (string, bool) {
+	name := strings.TrimLeft(arg, "-")
+	if idx := strings.IndexByte(name, '='); idx >= 0 {
+		return name[:idx], true
+	}
+	return name, false
+}
+
+func apiFlagIsBool(f *flag.Flag) bool {
+	bf, ok := f.Value.(apiBoolFlag)
+	return ok && bf.IsBoolFlag()
+}
+
+func printAPIFlagUsage(w io.Writer, fs *flag.FlagSet) {
+	fmt.Fprintf(w, "Usage of %s:\n", fs.Name())
+	printAPIFlagDefaults(w, fs)
+}
+
+func printAPIFlagDefaults(w io.Writer, fs *flag.FlagSet) {
+	flags := []*flag.Flag{}
+	fs.VisitAll(func(f *flag.Flag) {
+		flags = append(flags, f)
+	})
+	sort.Slice(flags, func(i, j int) bool {
+		return flags[i].Name < flags[j].Name
+	})
+
+	for _, f := range flags {
+		valueName, usage := flag.UnquoteUsage(f)
+		prefix := "--"
+		if len(f.Name) == 1 {
+			prefix = "-"
+		}
+		fmt.Fprintf(w, "  %s%s", prefix, f.Name)
+		if valueName != "" {
+			fmt.Fprintf(w, " %s", valueName)
+		}
+		fmt.Fprintf(w, "\n    \t%s", usage)
+		if defaultValue := apiFlagDefaultValue(f, valueName); defaultValue != "" {
+			fmt.Fprintf(w, " (default %s)", defaultValue)
+		}
+		fmt.Fprintln(w)
+	}
+}
+
+func apiFlagDefaultValue(f *flag.Flag, valueName string) string {
+	if f.DefValue == "" || f.DefValue == "0" || f.DefValue == "0s" || f.DefValue == "false" {
+		return ""
+	}
+	if valueName == "string" {
+		return strconv.Quote(f.DefValue)
+	}
+	return f.DefValue
+}
+
+func readAPIRequestBody(opts apiCLIOptions) ([]byte, error) {
+	if opts.body != "" && opts.bodyFile != "" {
+		return nil, errors.New("use --body or --body-file, not both")
+	}
+	if opts.body != "" {
+		return []byte(opts.body), nil
+	}
+	if opts.bodyFile == "" {
+		return nil, nil
+	}
+	if opts.bodyFile == "-" {
+		return io.ReadAll(opts.in)
+	}
+	return os.ReadFile(opts.bodyFile)
+}
+
+func executeAPIRequest(ctx context.Context, client *http.Client, opts apiCLIOptions, method, target string, body []byte) error {
+	if opts.out == nil {
+		opts.out = io.Discard
+	}
+	if err := validateAPIOutputFormat(opts.output); err != nil {
+		return err
+	}
+	resp, err := doAPIRequest(ctx, client, opts, method, target, body)
+	if err != nil {
+		return err
+	}
+	if err := writeAPIOutput(opts.out, resp.Body, opts); err != nil {
+		return err
+	}
+	if resp.StatusCode >= 400 {
+		return fmt.Errorf("api returned %s", resp.Status)
+	}
+	return nil
+}
+
+func doAPIRequest(ctx context.Context, client *http.Client, opts apiCLIOptions, method, target string, body []byte) (apiHTTPResponse, error) {
+	if opts.errOut == nil {
+		opts.errOut = io.Discard
+	}
+	if opts.timeout <= 0 {
+		opts.timeout = 10 * time.Second
+	}
+	if client == nil {
+		client = &http.Client{Timeout: opts.timeout}
+	}
+
+	requestURL, err := apiRequestURL(opts.baseURL, target)
+	if err != nil {
+		return apiHTTPResponse{}, err
+	}
+	if apiMethodRequiresRemoteWriteGuard(method) {
+		if _, err := requireAPILocalURLOrAllowRemote(requestURL, opts.allowRemote, apiRemoteGuardCommand(opts)); err != nil {
+			return apiHTTPResponse{}, err
+		}
+	}
+
+	var bodyReader io.Reader
+	if len(body) > 0 {
+		bodyReader = bytes.NewReader(body)
+	}
+	req, err := http.NewRequestWithContext(ctx, strings.ToUpper(method), requestURL, bodyReader)
+	if err != nil {
+		return apiHTTPResponse{}, err
+	}
+	sendAutoAuth, err := shouldSendAPIAutoAuth(opts.baseURL, requestURL, opts.authPolicy)
+	if err != nil {
+		return apiHTTPResponse{}, err
+	}
+	applyAPIRequestHeaders(req, opts, len(body) > 0, sendAutoAuth)
+
+	if opts.verbose {
+		writeAPIRequestHeaders(opts.errOut, req)
+	}
+
+	resp, err := client.Do(req)
+	if err != nil {
+		return apiHTTPResponse{}, err
+	}
+	defer resp.Body.Close()
+
+	if opts.verbose {
+		writeAPIResponseHeaders(opts.errOut, resp)
+	}
+
+	respBody, err := io.ReadAll(resp.Body)
+	if err != nil {
+		return apiHTTPResponse{}, err
+	}
+	return apiHTTPResponse{
+		StatusCode: resp.StatusCode,
+		Status:     resp.Status,
+		Body:       respBody,
+	}, nil
+}
+
+func apiRequestURL(baseURL, target string) (string, error) {
+	if strings.TrimSpace(target) == "" {
+		return "", errors.New("request path is required")
+	}
+	if u, err := url.Parse(target); err == nil && u.IsAbs() {
+		return u.String(), nil
+	}
+
+	base, err := url.Parse(strings.TrimRight(baseURL, "/"))
+	if err != nil {
+		return "", fmt.Errorf("invalid API base URL %q: %w", baseURL, err)
+	}
+	if !base.IsAbs() || base.Host == "" {
+		return "", fmt.Errorf("invalid API base URL %q: must include scheme and host", baseURL)
+	}
+	rel, err := url.Parse(target)
+	if err != nil {
+		return "", fmt.Errorf("invalid API path %q: %w", target, err)
+	}
+	if !strings.HasPrefix(rel.Path, "/") {
+		rel.Path = "/" + rel.Path
+	}
+	return base.ResolveReference(rel).String(), nil
+}
+
+func shouldSendAPIAutoAuth(baseURL, requestURL, policy string) (bool, error) {
+	policy = strings.ToLower(strings.TrimSpace(policy))
+	if policy == "" {
+		policy = defaultAPIAuthPolicy
+	}
+	switch policy {
+	case "any-origin":
+		return true, nil
+	case "same-origin":
+		base, err := url.Parse(strings.TrimRight(baseURL, "/"))
+		if err != nil {
+			return false, fmt.Errorf("invalid API base URL %q: %w", baseURL, err)
+		}
+		target, err := url.Parse(requestURL)
+		if err != nil {
+			return false, fmt.Errorf("invalid request URL %q: %w", requestURL, err)
+		}
+		return sameAPIOrigin(base, target), nil
+	default:
+		return false, fmt.Errorf("invalid auth policy %q (want: same-origin or any-origin)", policy)
+	}
+}
+
+func sameAPIOrigin(a, b *url.URL) bool {
+	if a == nil || b == nil {
+		return false
+	}
+	return strings.EqualFold(a.Scheme, b.Scheme) && strings.EqualFold(a.Host, b.Host)
+}
+
+func apiMethodRequiresRemoteWriteGuard(method string) bool {
+	switch strings.ToUpper(strings.TrimSpace(method)) {
+	case http.MethodPost, http.MethodPut, http.MethodPatch, http.MethodDelete:
+		return true
+	default:
+		return false
+	}
+}
+
+func apiRemoteGuardCommand(opts apiCLIOptions) string {
+	if strings.TrimSpace(opts.commandName) != "" {
+		return strings.TrimSpace(opts.commandName)
+	}
+	return "api"
+}
+
+func requireAPILocalOrAllowRemote(opts apiCLIOptions, allowRemote bool, command string) (bool, error) {
+	return requireAPILocalURLOrAllowRemote(opts.baseURL, allowRemote, command)
+}
+
+func requireAPILocalURLOrAllowRemote(rawURL string, allowRemote bool, command string) (bool, error) {
+	local, err := isLocalAPIURL(rawURL)
+	if err != nil {
+		return false, err
+	}
+	if local {
+		return false, nil
+	}
+	if allowRemote {
+		return true, nil
+	}
+	return true, fmt.Errorf("%s refuses to modify non-local API URL %q without --allow-remote (local means localhost or loopback IP)", command, rawURL)
+}
+
+func isLocalAPIURL(rawURL string) (bool, error) {
+	u, err := url.Parse(strings.TrimSpace(rawURL))
+	if err != nil {
+		return false, fmt.Errorf("invalid API URL %q: %w", rawURL, err)
+	}
+	if !u.IsAbs() || u.Host == "" {
+		return false, fmt.Errorf("invalid API URL %q: must include scheme and host", rawURL)
+	}
+	host := strings.ToLower(strings.TrimSuffix(u.Hostname(), "."))
+	if host == "localhost" || strings.HasSuffix(host, ".localhost") {
+		return true, nil
+	}
+	ip := net.ParseIP(host)
+	return ip != nil && ip.IsLoopback(), nil
+}
+
+func applyAPIRequestHeaders(req *http.Request, opts apiCLIOptions, hasBody bool, sendAutoAuth bool) {
+	req.Header.Set("Accept", "application/json")
+	if hasBody {
+		req.Header.Set("Content-Type", "application/json")
+	}
+	if sendAutoAuth && strings.TrimSpace(opts.token) != "" {
+		req.Header.Set("Authorization", "Bearer "+strings.TrimSpace(opts.token))
+	}
+	if sendAutoAuth && strings.TrimSpace(opts.idempotencyKey) != "" {
+		req.Header.Set("Idempotency-Key", strings.TrimSpace(opts.idempotencyKey))
+	}
+	for _, raw := range opts.headers {
+		name, value, ok := strings.Cut(raw, ":")
+		if !ok {
+			continue
+		}
+		req.Header.Set(strings.TrimSpace(name), strings.TrimSpace(value))
+	}
+}
+
+func writeAPIRequestHeaders(w io.Writer, req *http.Request) {
+	path := req.URL.RequestURI()
+	if path == "" {
+		path = "/"
+	}
+	fmt.Fprintf(w, "> %s %s %s\n", req.Method, path, req.Proto)
+	fmt.Fprintf(w, "> Host: %s\n", req.URL.Host)
+	writeSortedHeaders(w, "> ", req.Header)
+	fmt.Fprintln(w, ">")
+}
+
+func writeAPIResponseHeaders(w io.Writer, resp *http.Response) {
+	fmt.Fprintf(w, "< %s %s\n", resp.Proto, resp.Status)
+	writeSortedHeaders(w, "< ", resp.Header)
+	fmt.Fprintln(w, "<")
+}
+
+func writeSortedHeaders(w io.Writer, prefix string, h http.Header) {
+	keys := make([]string, 0, len(h))
+	for k := range h {
+		keys = append(keys, k)
+	}
+	sort.Strings(keys)
+	for _, k := range keys {
+		for _, v := range h.Values(k) {
+			if isSensitiveAPIHeader(k) {
+				v = "[redacted]"
+			}
+			fmt.Fprintf(w, "%s%s: %s\n", prefix, k, v)
+		}
+	}
+}
+
+func isSensitiveAPIHeader(name string) bool {
+	switch strings.ToLower(strings.TrimSpace(name)) {
+	case "authorization", "proxy-authorization", "idempotency-key", "cookie", "set-cookie", "x-api-key":
+		return true
+	default:
+		return false
+	}
+}
+
+func writeAPIResponseBody(w io.Writer, body []byte, pretty bool) error {
+	body = bytes.TrimSpace(body)
+	if len(body) == 0 {
+		return nil
+	}
+	if pretty && json.Valid(body) {
+		var formatted bytes.Buffer
+		if err := json.Indent(&formatted, body, "", "  "); err != nil {
+			return err
+		}
+		body = formatted.Bytes()
+	}
+	if _, err := w.Write(body); err != nil {
+		return err
+	}
+	if !bytes.HasSuffix(body, []byte("\n")) {
+		_, err := fmt.Fprintln(w)
+		return err
+	}
+	return nil
+}
+
+func writeAPIValueOutput(w io.Writer, value any, opts apiCLIOptions) error {
+	if w == nil {
+		w = io.Discard
+	}
+	body, err := json.Marshal(value)
+	if err != nil {
+		return err
+	}
+	return writeAPIOutput(w, body, opts)
+}
+
+func writeAPIOutput(w io.Writer, body []byte, opts apiCLIOptions) error {
+	if err := validateAPIOutputFormat(opts.output); err != nil {
+		return err
+	}
+	switch opts.output {
+	case "", "json":
+		return writeAPIResponseBody(w, body, opts.pretty)
+	case "table":
+		return writeAPIResponseTable(w, body)
+	}
+	return nil
+}
+
+func validateAPIOutputFormat(output string) error {
+	switch output {
+	case "", "json", "table":
+		return nil
+	default:
+		return fmt.Errorf("output must be one of: json, table")
+	}
+}
+
+func writeAPIResponseTable(w io.Writer, body []byte) error {
+	body = bytes.TrimSpace(body)
+	if len(body) == 0 {
+		return nil
+	}
+	var value any
+	if err := json.Unmarshal(body, &value); err != nil {
+		return err
+	}
+	rows := apiTableRows(value)
+	if len(rows) == 0 {
+		_, err := fmt.Fprintln(w, "no rows")
+		return err
+	}
+	columns := apiTableColumns(rows)
+	tw := tabwriter.NewWriter(w, 0, 0, 2, ' ', 0)
+	for i, col := range columns {
+		if i > 0 {
+			fmt.Fprint(tw, "\t")
+		}
+		fmt.Fprint(tw, col)
+	}
+	fmt.Fprintln(tw)
+	for _, row := range rows {
+		for i, col := range columns {
+			if i > 0 {
+				fmt.Fprint(tw, "\t")
+			}
+			fmt.Fprint(tw, apiTableValue(row[col]))
+		}
+		fmt.Fprintln(tw)
+	}
+	return tw.Flush()
+}
+
+func apiTableRows(value any) []map[string]any {
+	switch v := value.(type) {
+	case map[string]any:
+		if rows := apiWorkflowTableRows(v); len(rows) > 0 {
+			return rows
+		}
+		for _, key := range []string{"data", "created", "sites", "steps", "commands"} {
+			if data, ok := v[key].([]any); ok {
+				return apiRowsFromArray(data)
+			}
+		}
+		return []map[string]any{v}
+	case []any:
+		return apiRowsFromArray(v)
+	default:
+		return nil
+	}
+}
+
+func apiWorkflowTableRows(value map[string]any) []map[string]any {
+	steps, ok := value["steps"].([]any)
+	if !ok {
+		return nil
+	}
+	rows := make([]map[string]any, 0, len(steps))
+	for _, item := range steps {
+		step, ok := item.(map[string]any)
+		if !ok {
+			continue
+		}
+		row := map[string]any{"kind": "step"}
+		for k, v := range step {
+			row[k] = v
+		}
+		rows = append(rows, row)
+	}
+	cleanupResults, _ := value["cleanup_results"].([]any)
+	for _, item := range cleanupResults {
+		cleanup, ok := item.(map[string]any)
+		if !ok {
+			continue
+		}
+		row := map[string]any{
+			"kind":   "cleanup",
+			"name":   cleanup["resource"],
+			"id":     cleanup["id"],
+			"status": cleanup["status"],
+		}
+		if errText, ok := cleanup["error"]; ok {
+			row["detail"] = errText
+		}
+		rows = append(rows, row)
+	}
+	return rows
+}
+
+func apiRowsFromArray(data []any) []map[string]any {
+	rows := make([]map[string]any, 0, len(data))
+	for _, item := range data {
+		row, ok := item.(map[string]any)
+		if !ok {
+			continue
+		}
+		rows = append(rows, row)
+	}
+	return rows
+}
+
+func apiTableColumns(rows []map[string]any) []string {
+	best := []string{}
+	for _, cols := range [][]string{
+		{"id", "blog_id", "monitor_url", "monitor_active", "current_state", "current_severity", "active_event_id"},
+		{"blog_id", "monitor_url", "monitor_active", "request_method", "detection_profile", "check_keyword", "redirect_policy", "timeout_seconds"},
+		{"id", "site_id", "check_type", "state", "severity", "started_at", "ended_at"},
+		{"id", "url", "active", "events", "secret_preview", "created_at"},
+		{"id", "label", "active", "transport", "min_severity", "max_per_hour", "destination_preview"},
+		{"id", "status", "attempt", "event_id", "event_type", "last_status_code", "created_at"},
+		{"site_id", "status", "error"},
+		{"site_id", "action", "trigger_status", "event_ids", "event_states", "event_severities", "transition_count", "note", "error"},
+		{"site_id", "action", "note", "error"},
+		{"kind", "name", "id", "status", "detail"},
+		{"command", "description", "example"},
+		{"name", "status", "detail"},
+	} {
+		present := apiColumnsPresent(rows, cols)
+		if len(present) > len(best) {
+			best = present
+		}
+	}
+	if len(best) > 0 {
+		return best
+	}
+	seen := map[string]struct{}{}
+	for _, row := range rows {
+		for k := range row {
+			seen[k] = struct{}{}
+		}
+	}
+	cols := make([]string, 0, len(seen))
+	for k := range seen {
+		cols = append(cols, k)
+	}
+	sort.Strings(cols)
+	return cols
+}
+
+func apiColumnsPresent(rows []map[string]any, cols []string) []string {
+	out := []string{}
+	for _, col := range cols {
+		for _, row := range rows {
+			if _, ok := row[col]; ok {
+				out = append(out, col)
+				break
+			}
+		}
+	}
+	return out
+}
+
+func apiTableValue(v any) string {
+	switch value := v.(type) {
+	case nil:
+		return ""
+	case string:
+		return value
+	case bool:
+		return fmt.Sprintf("%t", value)
+	case float64:
+		if value == float64(int64(value)) {
+			return fmt.Sprintf("%d", int64(value))
+		}
+		return fmt.Sprintf("%g", value)
+	case []any:
+		parts := make([]string, 0, len(value))
+		for _, item := range value {
+			parts = append(parts, apiTableValue(item))
+		}
+		return strings.Join(parts, ",")
+	default:
+		b, err := json.Marshal(value)
+		if err != nil {
+			return fmt.Sprint(value)
+		}
+		return string(b)
+	}
+}
+
+func logAPIErrorAndExit(err error) {
+	if errors.Is(err, flag.ErrHelp) {
+		os.Exit(0)
+	}
+	fmt.Fprintf(os.Stderr, "api: %v\n", err)
+	os.Exit(1)
+}
diff --git a/cmd/jetmon2/api_cli_alert_contacts.go b/cmd/jetmon2/api_cli_alert_contacts.go
new file mode 100644
index 00000000..8dea9822
--- /dev/null
+++ b/cmd/jetmon2/api_cli_alert_contacts.go
@@ -0,0 +1,405 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"flag"
+	"fmt"
+	"net/http"
+	"net/url"
+	"strconv"
+	"strings"
+)
+
+type apiAlertContactCreateOptions struct {
+	label       string
+	active      apiOptionalBoolFlag
+	transport   string
+	destination apiAlertDestinationOptions
+	siteIDs     apiInt64SliceFlags
+	minSeverity apiOptionalStringFlag
+	maxPerHour  apiOptionalIntFlag
+}
+
+type apiAlertContactUpdateOptions struct {
+	label       apiOptionalStringFlag
+	active      apiOptionalBoolFlag
+	destination apiAlertDestinationOptions
+	siteIDs     apiInt64SliceFlags
+	clearSites  bool
+	minSeverity apiOptionalStringFlag
+	maxPerHour  apiOptionalIntFlag
+}
+
+type apiAlertDestinationOptions struct {
+	raw            string
+	address        string
+	integrationKey string
+	webhookURL     string
+}
+
+type apiAlertDeliveriesFilters struct {
+	cursor string
+	limit  int
+	status string
+}
+
+type apiAlertContactSiteFilter struct {
+	SiteIDs []int64 `json:"site_ids,omitempty"`
+}
+
+type apiAlertContactCreateRequest struct {
+	Label       string                    `json:"label"`
+	Active      *bool                     `json:"active,omitempty"`
+	Transport   string                    `json:"transport"`
+	Destination json.RawMessage           `json:"destination"`
+	SiteFilter  apiAlertContactSiteFilter `json:"site_filter"`
+	MinSeverity *string                   `json:"min_severity,omitempty"`
+	MaxPerHour  *int                      `json:"max_per_hour,omitempty"`
+}
+
+type apiAlertContactUpdateRequest struct {
+	Label       *string                    `json:"label,omitempty"`
+	Active      *bool                      `json:"active,omitempty"`
+	Destination json.RawMessage            `json:"destination,omitempty"`
+	SiteFilter  *apiAlertContactSiteFilter `json:"site_filter,omitempty"`
+	MinSeverity *string                    `json:"min_severity,omitempty"`
+	MaxPerHour  *int                       `json:"max_per_hour,omitempty"`
+}
+
+func cmdAPIAlertContacts(args []string) error {
+	if len(args) == 0 {
+		return errors.New("usage: jetmon2 api alert-contacts <list|get|create|update|delete|test|deliveries|retry> [flags]")
+	}
+
+	sub := args[0]
+	rest := args[1:]
+	switch sub {
+	case "list":
+		return cmdAPIAlertContactsList(rest)
+	case "get":
+		return cmdAPIAlertContactsGet(rest)
+	case "create":
+		return cmdAPIAlertContactsCreate(rest)
+	case "update":
+		return cmdAPIAlertContactsUpdate(rest)
+	case "delete":
+		return cmdAPIAlertContactsDelete(rest)
+	case "test":
+		return cmdAPIAlertContactsTest(rest)
+	case "deliveries":
+		return cmdAPIAlertContactsDeliveries(rest)
+	case "retry":
+		return cmdAPIAlertContactsRetry(rest)
+	default:
+		return fmt.Errorf("unknown api alert-contacts subcommand %q (want: list, get, create, update, delete, test, deliveries, retry)", sub)
+	}
+}
+
+func cmdAPIAlertContactsList(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api alert-contacts list", &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return errors.New("usage: jetmon2 api alert-contacts list [flags]")
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, "/api/v1/alert-contacts", nil)
+}
+
+func cmdAPIAlertContactsGet(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api alert-contacts get", &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api alert-contacts get [flags] <contact-id>")
+	}
+	target, err := apiAlertContactPath(fs.Arg(0), "")
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, target, nil)
+}
+
+func cmdAPIAlertContactsCreate(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api alert-contacts create", &opts)
+	addAPIIdempotencyFlag(fs, &opts)
+	create := apiAlertContactCreateOptions{}
+	fs.StringVar(&create.label, "label", "", "alert contact label")
+	fs.Var(&create.active, "active", "alert contact enabled: true or false")
+	fs.StringVar(&create.transport, "transport", "", "transport: email, pagerduty, slack, or teams")
+	addAPIAlertDestinationFlags(fs, &create.destination)
+	fs.Var(&create.siteIDs, "site-id", "site id filter (repeatable or comma-separated)")
+	fs.Var(&create.minSeverity, "min-severity", "minimum severity: Up, Warning, Degraded, SeemsDown, or Down")
+	fs.Var(&create.maxPerHour, "max-per-hour", "maximum notifications per hour, 0 for unlimited")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return errors.New("usage: jetmon2 api alert-contacts create [flags]")
+	}
+	body, err := marshalAPIAlertContactCreateBody(create)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodPost, "/api/v1/alert-contacts", body)
+}
+
+func cmdAPIAlertContactsUpdate(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api alert-contacts update", &opts)
+	update := apiAlertContactUpdateOptions{}
+	fs.Var(&update.label, "label", "alert contact label")
+	fs.Var(&update.active, "active", "alert contact enabled: true or false")
+	addAPIAlertDestinationFlags(fs, &update.destination)
+	fs.Var(&update.siteIDs, "site-id", "site id filter (repeatable or comma-separated)")
+	fs.BoolVar(&update.clearSites, "clear-sites", false, "clear site filters")
+	fs.Var(&update.minSeverity, "min-severity", "minimum severity: Up, Warning, Degraded, SeemsDown, or Down")
+	fs.Var(&update.maxPerHour, "max-per-hour", "maximum notifications per hour, 0 for unlimited")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api alert-contacts update [flags] <contact-id>")
+	}
+	target, err := apiAlertContactPath(fs.Arg(0), "")
+	if err != nil {
+		return err
+	}
+	body, err := marshalAPIAlertContactUpdateBody(update)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodPatch, target, body)
+}
+
+func cmdAPIAlertContactsDelete(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api alert-contacts delete", &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api alert-contacts delete [flags] <contact-id>")
+	}
+	target, err := apiAlertContactPath(fs.Arg(0), "")
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodDelete, target, nil)
+}
+
+func cmdAPIAlertContactsTest(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api alert-contacts test", &opts)
+	addAPIIdempotencyFlag(fs, &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api alert-contacts test [flags] <contact-id>")
+	}
+	target, err := apiAlertContactPath(fs.Arg(0), "test")
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodPost, target, nil)
+}
+
+func cmdAPIAlertContactsDeliveries(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api alert-contacts deliveries", &opts)
+	filters := apiAlertDeliveriesFilters{}
+	fs.StringVar(&filters.cursor, "cursor", "", "pagination cursor")
+	fs.IntVar(&filters.limit, "limit", 0, "page size (1-200)")
+	fs.StringVar(&filters.status, "status", "", "delivery status: pending, delivered, failed, or abandoned")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api alert-contacts deliveries [flags] <contact-id>")
+	}
+	target, err := apiAlertContactDeliveriesPath(fs.Arg(0), filters)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, target, nil)
+}
+
+func cmdAPIAlertContactsRetry(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api alert-contacts retry", &opts)
+	addAPIIdempotencyFlag(fs, &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 2 {
+		return errors.New("usage: jetmon2 api alert-contacts retry [flags] <contact-id> <delivery-id>")
+	}
+	target, err := apiAlertContactRetryPath(fs.Arg(0), fs.Arg(1))
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodPost, target, nil)
+}
+
+func addAPIAlertDestinationFlags(fs *flag.FlagSet, dest *apiAlertDestinationOptions) {
+	fs.StringVar(&dest.raw, "destination", "", "raw destination JSON")
+	fs.StringVar(&dest.address, "address", "", "email destination address")
+	fs.StringVar(&dest.integrationKey, "integration-key", "", "PagerDuty Events API v2 integration key")
+	fs.StringVar(&dest.webhookURL, "webhook-url", "", "Slack or Teams incoming webhook URL")
+}
+
+func apiAlertContactPath(rawID, suffix string) (string, error) {
+	id, err := apiPositiveID(rawID, "alert contact")
+	if err != nil {
+		return "", err
+	}
+	path := "/api/v1/alert-contacts/" + strconv.FormatInt(id, 10)
+	if suffix != "" {
+		path += "/" + strings.TrimPrefix(suffix, "/")
+	}
+	return path, nil
+}
+
+func apiAlertContactDeliveriesPath(rawID string, filters apiAlertDeliveriesFilters) (string, error) {
+	path, err := apiAlertContactPath(rawID, "deliveries")
+	if err != nil {
+		return "", err
+	}
+	if filters.limit < 0 {
+		return "", errors.New("limit must be positive")
+	}
+
+	values := url.Values{}
+	if filters.cursor != "" {
+		values.Set("cursor", filters.cursor)
+	}
+	if filters.limit > 0 {
+		values.Set("limit", strconv.Itoa(filters.limit))
+	}
+	if filters.status != "" {
+		switch filters.status {
+		case "pending", "delivered", "failed", "abandoned":
+			values.Set("status", filters.status)
+		default:
+			return "", errors.New("status must be one of: pending, delivered, failed, abandoned")
+		}
+	}
+	if len(values) == 0 {
+		return path, nil
+	}
+	return path + "?" + values.Encode(), nil
+}
+
+func apiAlertContactRetryPath(rawContactID, rawDeliveryID string) (string, error) {
+	contactID, err := apiPositiveID(rawContactID, "alert contact")
+	if err != nil {
+		return "", err
+	}
+	deliveryID, err := apiPositiveID(rawDeliveryID, "delivery")
+	if err != nil {
+		return "", err
+	}
+	return fmt.Sprintf("/api/v1/alert-contacts/%d/deliveries/%d/retry", contactID, deliveryID), nil
+}
+
+func marshalAPIAlertContactCreateBody(opts apiAlertContactCreateOptions) ([]byte, error) {
+	if strings.TrimSpace(opts.label) == "" {
+		return nil, errors.New("label is required")
+	}
+	if strings.TrimSpace(opts.transport) == "" {
+		return nil, errors.New("transport is required")
+	}
+	destination, err := opts.destination.rawForTransport(opts.transport, true)
+	if err != nil {
+		return nil, err
+	}
+	req := apiAlertContactCreateRequest{
+		Label:       opts.label,
+		Active:      opts.active.ptr(),
+		Transport:   opts.transport,
+		Destination: destination,
+		SiteFilter:  apiAlertContactSiteFilter{SiteIDs: opts.siteIDs.valuesOrEmpty()},
+		MinSeverity: opts.minSeverity.ptr(),
+		MaxPerHour:  opts.maxPerHour.ptr(),
+	}
+	return json.Marshal(req)
+}
+
+func marshalAPIAlertContactUpdateBody(opts apiAlertContactUpdateOptions) ([]byte, error) {
+	if opts.clearSites && opts.siteIDs.set {
+		return nil, errors.New("use --site-id or --clear-sites, not both")
+	}
+	destination, err := opts.destination.rawForTransport("", false)
+	if err != nil {
+		return nil, err
+	}
+	req := apiAlertContactUpdateRequest{
+		Label:       opts.label.ptr(),
+		Active:      opts.active.ptr(),
+		Destination: destination,
+		MinSeverity: opts.minSeverity.ptr(),
+		MaxPerHour:  opts.maxPerHour.ptr(),
+	}
+	if opts.siteIDs.set || opts.clearSites {
+		req.SiteFilter = &apiAlertContactSiteFilter{SiteIDs: opts.siteIDs.valuesOrEmpty()}
+	}
+	return json.Marshal(req)
+}
+
+func (opts apiAlertDestinationOptions) rawForTransport(transport string, required bool) (json.RawMessage, error) {
+	set := 0
+	for _, v := range []string{opts.raw, opts.address, opts.integrationKey, opts.webhookURL} {
+		if strings.TrimSpace(v) != "" {
+			set++
+		}
+	}
+	if set == 0 {
+		if required {
+			return nil, errors.New("destination is required")
+		}
+		return nil, nil
+	}
+	if set > 1 {
+		return nil, errors.New("use only one destination flag")
+	}
+	if opts.raw != "" {
+		if !json.Valid([]byte(opts.raw)) {
+			return nil, errors.New("destination must be valid JSON")
+		}
+		return json.RawMessage(opts.raw), nil
+	}
+
+	var value any
+	switch {
+	case opts.address != "":
+		if transport != "" && transport != "email" {
+			return nil, errors.New("--address requires --transport email")
+		}
+		value = map[string]string{"address": opts.address}
+	case opts.integrationKey != "":
+		if transport != "" && transport != "pagerduty" {
+			return nil, errors.New("--integration-key requires --transport pagerduty")
+		}
+		value = map[string]string{"integration_key": opts.integrationKey}
+	case opts.webhookURL != "":
+		if transport != "" && transport != "slack" && transport != "teams" {
+			return nil, errors.New("--webhook-url requires --transport slack or teams")
+		}
+		value = map[string]string{"webhook_url": opts.webhookURL}
+	default:
+		return nil, errors.New("destination is required")
+	}
+
+	b, err := json.Marshal(value)
+	if err != nil {
+		return nil, err
+	}
+	return json.RawMessage(b), nil
+}
diff --git a/cmd/jetmon2/api_cli_alert_contacts_test.go b/cmd/jetmon2/api_cli_alert_contacts_test.go
new file mode 100644
index 00000000..eb0be12e
--- /dev/null
+++ b/cmd/jetmon2/api_cli_alert_contacts_test.go
@@ -0,0 +1,196 @@
+package main
+
+import (
+	"encoding/json"
+	"net/url"
+	"testing"
+)
+
+func TestMarshalAPIAlertContactCreateBody(t *testing.T) {
+	var active apiOptionalBoolFlag
+	setTestFlag(t, &active, "false")
+	var siteIDs apiInt64SliceFlags
+	setTestFlag(t, &siteIDs, "42,99")
+	var minSeverity apiOptionalStringFlag
+	setTestFlag(t, &minSeverity, "Warning")
+	var maxPerHour apiOptionalIntFlag
+	setTestFlag(t, &maxPerHour, "0")
+
+	body, err := marshalAPIAlertContactCreateBody(apiAlertContactCreateOptions{
+		label:       "ops-email",
+		active:      active,
+		transport:   "email",
+		destination: apiAlertDestinationOptions{address: "ops@example.com"},
+		siteIDs:     siteIDs,
+		minSeverity: minSeverity,
+		maxPerHour:  maxPerHour,
+	})
+	if err != nil {
+		t.Fatalf("marshalAPIAlertContactCreateBody() error = %v", err)
+	}
+	var got map[string]any
+	if err := json.Unmarshal(body, &got); err != nil {
+		t.Fatalf("unmarshal body: %v", err)
+	}
+	if got["label"] != "ops-email" {
+		t.Fatalf("label = %#v", got["label"])
+	}
+	if got["active"] != false {
+		t.Fatalf("active = %#v, want false", got["active"])
+	}
+	if got["transport"] != "email" {
+		t.Fatalf("transport = %#v, want email", got["transport"])
+	}
+	dest := got["destination"].(map[string]any)
+	if dest["address"] != "ops@example.com" {
+		t.Fatalf("destination.address = %#v", dest["address"])
+	}
+	siteFilter := got["site_filter"].(map[string]any)
+	assertNumberArray(t, siteFilter["site_ids"], []int64{42, 99})
+	if got["min_severity"] != "Warning" {
+		t.Fatalf("min_severity = %#v, want Warning", got["min_severity"])
+	}
+	if got["max_per_hour"] != float64(0) {
+		t.Fatalf("max_per_hour = %#v, want 0", got["max_per_hour"])
+	}
+}
+
+func TestMarshalAPIAlertContactCreateBodyBuildsTransportDestinations(t *testing.T) {
+	tests := []struct {
+		name        string
+		transport   string
+		destination apiAlertDestinationOptions
+		wantKey     string
+		wantValue   string
+	}{
+		{
+			name:        "pagerduty",
+			transport:   "pagerduty",
+			destination: apiAlertDestinationOptions{integrationKey: "pd-key"},
+			wantKey:     "integration_key",
+			wantValue:   "pd-key",
+		},
+		{
+			name:        "slack",
+			transport:   "slack",
+			destination: apiAlertDestinationOptions{webhookURL: "https://hooks.slack.com/services/test"},
+			wantKey:     "webhook_url",
+			wantValue:   "https://hooks.slack.com/services/test",
+		},
+		{
+			name:        "teams",
+			transport:   "teams",
+			destination: apiAlertDestinationOptions{webhookURL: "https://outlook.office.com/webhook/test"},
+			wantKey:     "webhook_url",
+			wantValue:   "https://outlook.office.com/webhook/test",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			body, err := marshalAPIAlertContactCreateBody(apiAlertContactCreateOptions{
+				label:       tt.name,
+				transport:   tt.transport,
+				destination: tt.destination,
+			})
+			if err != nil {
+				t.Fatalf("marshalAPIAlertContactCreateBody() error = %v", err)
+			}
+			var got map[string]any
+			if err := json.Unmarshal(body, &got); err != nil {
+				t.Fatalf("unmarshal body: %v", err)
+			}
+			dest := got["destination"].(map[string]any)
+			if dest[tt.wantKey] != tt.wantValue {
+				t.Fatalf("destination[%s] = %#v, want %q", tt.wantKey, dest[tt.wantKey], tt.wantValue)
+			}
+		})
+	}
+}
+
+func TestMarshalAPIAlertContactUpdateBodySupportsDestinationAndClearSites(t *testing.T) {
+	var label apiOptionalStringFlag
+	setTestFlag(t, &label, "platform-oncall")
+
+	body, err := marshalAPIAlertContactUpdateBody(apiAlertContactUpdateOptions{
+		label:       label,
+		destination: apiAlertDestinationOptions{raw: `{"webhook_url":"https://example.com/hook"}`},
+		clearSites:  true,
+	})
+	if err != nil {
+		t.Fatalf("marshalAPIAlertContactUpdateBody() error = %v", err)
+	}
+	var got map[string]any
+	if err := json.Unmarshal(body, &got); err != nil {
+		t.Fatalf("unmarshal body: %v", err)
+	}
+	if got["label"] != "platform-oncall" {
+		t.Fatalf("label = %#v", got["label"])
+	}
+	dest := got["destination"].(map[string]any)
+	if dest["webhook_url"] != "https://example.com/hook" {
+		t.Fatalf("destination.webhook_url = %#v", dest["webhook_url"])
+	}
+	if _, ok := got["site_filter"].(map[string]any)["site_ids"]; ok {
+		t.Fatalf("site_ids present in cleared site_filter: %#v", got["site_filter"])
+	}
+}
+
+func TestMarshalAPIAlertContactUpdateBodyRejectsConflicts(t *testing.T) {
+	var siteIDs apiInt64SliceFlags
+	setTestFlag(t, &siteIDs, "42")
+	if _, err := marshalAPIAlertContactUpdateBody(apiAlertContactUpdateOptions{siteIDs: siteIDs, clearSites: true}); err == nil {
+		t.Fatal("site filter conflict error = nil, want error")
+	}
+	if _, err := (apiAlertDestinationOptions{raw: `{}`, address: "ops@example.com"}).rawForTransport("", false); err == nil {
+		t.Fatal("destination conflict error = nil, want error")
+	}
+	if _, err := (apiAlertDestinationOptions{raw: `{not-json}`}).rawForTransport("", false); err == nil {
+		t.Fatal("invalid raw destination error = nil, want error")
+	}
+}
+
+func TestAPIAlertContactPaths(t *testing.T) {
+	got, err := apiAlertContactPath("17", "test")
+	if err != nil {
+		t.Fatalf("apiAlertContactPath() error = %v", err)
+	}
+	if got != "/api/v1/alert-contacts/17/test" {
+		t.Fatalf("path = %q, want test path", got)
+	}
+
+	got, err = apiAlertContactRetryPath("17", "88")
+	if err != nil {
+		t.Fatalf("apiAlertContactRetryPath() error = %v", err)
+	}
+	if got != "/api/v1/alert-contacts/17/deliveries/88/retry" {
+		t.Fatalf("retry path = %q, want delivery retry path", got)
+	}
+}
+
+func TestAPIAlertContactDeliveriesPath(t *testing.T) {
+	got, err := apiAlertContactDeliveriesPath("17", apiAlertDeliveriesFilters{
+		cursor: "cur-5",
+		limit:  50,
+		status: "failed",
+	})
+	if err != nil {
+		t.Fatalf("apiAlertContactDeliveriesPath() error = %v", err)
+	}
+	u, err := url.Parse(got)
+	if err != nil {
+		t.Fatalf("parse path: %v", err)
+	}
+	if u.Path != "/api/v1/alert-contacts/17/deliveries" {
+		t.Fatalf("path = %q, want deliveries path", u.Path)
+	}
+	for key, want := range map[string]string{
+		"cursor": "cur-5",
+		"limit":  "50",
+		"status": "failed",
+	} {
+		if got := u.Query().Get(key); got != want {
+			t.Fatalf("query %s = %q, want %q", key, got, want)
+		}
+	}
+}
diff --git a/cmd/jetmon2/api_cli_events.go b/cmd/jetmon2/api_cli_events.go
new file mode 100644
index 00000000..3f66a682
--- /dev/null
+++ b/cmd/jetmon2/api_cli_events.go
@@ -0,0 +1,270 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"net/http"
+	"net/url"
+	"strconv"
+	"strings"
+)
+
+type apiEventsListFilters struct {
+	cursor       string
+	limit        int
+	state        string
+	stateIn      string
+	checkType    string
+	checkTypeIn  string
+	startedAtGTE string
+	startedAtLT  string
+	active       string
+}
+
+type apiTransitionsListFilters struct {
+	cursor string
+	limit  int
+}
+
+type apiEventCloseOptions struct {
+	reason string
+	note   string
+}
+
+type apiEventCloseRequest struct {
+	Reason string `json:"reason,omitempty"`
+	Note   string `json:"note,omitempty"`
+}
+
+func cmdAPIEvents(args []string) error {
+	if len(args) == 0 {
+		return errors.New("usage: jetmon2 api events <list|get|transitions|close> [flags]")
+	}
+
+	sub := args[0]
+	rest := args[1:]
+	switch sub {
+	case "list":
+		return cmdAPIEventsList(rest)
+	case "get":
+		return cmdAPIEventsGet(rest)
+	case "transitions":
+		return cmdAPIEventsTransitions(rest)
+	case "close":
+		return cmdAPIEventsClose(rest)
+	default:
+		return fmt.Errorf("unknown api events subcommand %q (want: list, get, transitions, close)", sub)
+	}
+}
+
+func cmdAPIEventsList(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api events list", &opts)
+	filters := apiEventsListFilters{}
+	fs.StringVar(&filters.cursor, "cursor", "", "pagination cursor")
+	fs.IntVar(&filters.limit, "limit", 0, "page size (1-200)")
+	fs.StringVar(&filters.state, "state", "", "filter by event state")
+	fs.StringVar(&filters.stateIn, "state-in", "", "comma-separated event states")
+	fs.StringVar(&filters.checkType, "check-type", "", "filter by check type")
+	fs.StringVar(&filters.checkTypeIn, "check-type-in", "", "comma-separated check types")
+	fs.StringVar(&filters.startedAtGTE, "started-at-gte", "", "filter events started at or after this RFC3339 timestamp")
+	fs.StringVar(&filters.startedAtLT, "started-at-lt", "", "filter events started before this RFC3339 timestamp")
+	fs.StringVar(&filters.active, "active", "", "filter open events: true or false")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api events list [flags] <site-id>")
+	}
+	target, err := apiEventsListPath(fs.Arg(0), filters)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, target, nil)
+}
+
+func cmdAPIEventsGet(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api events get", &opts)
+	var siteID string
+	fs.StringVar(&siteID, "site-id", "", "optional site id for site-scoped event lookup")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api events get [flags] <event-id>")
+	}
+	target, err := apiEventDetailPath(siteID, fs.Arg(0))
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, target, nil)
+}
+
+func cmdAPIEventsTransitions(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api events transitions", &opts)
+	filters := apiTransitionsListFilters{}
+	fs.StringVar(&filters.cursor, "cursor", "", "pagination cursor")
+	fs.IntVar(&filters.limit, "limit", 0, "page size (1-200)")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 2 {
+		return errors.New("usage: jetmon2 api events transitions [flags] <site-id> <event-id>")
+	}
+	target, err := apiEventTransitionsPath(fs.Arg(0), fs.Arg(1), filters)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, target, nil)
+}
+
+func cmdAPIEventsClose(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api events close", &opts)
+	addAPIIdempotencyFlag(fs, &opts)
+	closeOpts := apiEventCloseOptions{}
+	fs.StringVar(&closeOpts.reason, "reason", "", "resolution reason (default: manual_override)")
+	fs.StringVar(&closeOpts.note, "note", "", "operator note recorded in transition metadata")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 2 {
+		return errors.New("usage: jetmon2 api events close [flags] <site-id> <event-id>")
+	}
+	target, err := apiEventClosePath(fs.Arg(0), fs.Arg(1))
+	if err != nil {
+		return err
+	}
+	body, err := marshalAPIEventCloseBody(closeOpts)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodPost, target, body)
+}
+
+func apiEventsListPath(rawSiteID string, filters apiEventsListFilters) (string, error) {
+	siteID, err := apiPositiveID(rawSiteID, "site")
+	if err != nil {
+		return "", err
+	}
+	if filters.limit < 0 {
+		return "", errors.New("limit must be positive")
+	}
+	if filters.state != "" && filters.stateIn != "" {
+		return "", errors.New("use --state or --state-in, not both")
+	}
+	if filters.checkType != "" && filters.checkTypeIn != "" {
+		return "", errors.New("use --check-type or --check-type-in, not both")
+	}
+
+	values := url.Values{}
+	if filters.cursor != "" {
+		values.Set("cursor", filters.cursor)
+	}
+	if filters.limit > 0 {
+		values.Set("limit", strconv.Itoa(filters.limit))
+	}
+	if filters.state != "" {
+		values.Set("state", filters.state)
+	}
+	if filters.stateIn != "" {
+		values.Set("state__in", filters.stateIn)
+	}
+	if filters.checkType != "" {
+		values.Set("check_type", filters.checkType)
+	}
+	if filters.checkTypeIn != "" {
+		values.Set("check_type__in", filters.checkTypeIn)
+	}
+	if filters.startedAtGTE != "" {
+		values.Set("started_at__gte", filters.startedAtGTE)
+	}
+	if filters.startedAtLT != "" {
+		values.Set("started_at__lt", filters.startedAtLT)
+	}
+	if strings.TrimSpace(filters.active) != "" {
+		active, err := strconv.ParseBool(filters.active)
+		if err != nil {
+			return "", errors.New("active must be true or false")
+		}
+		values.Set("active", strconv.FormatBool(active))
+	}
+
+	path := "/api/v1/sites/" + strconv.FormatInt(siteID, 10) + "/events"
+	if len(values) == 0 {
+		return path, nil
+	}
+	return path + "?" + values.Encode(), nil
+}
+
+func apiEventDetailPath(rawSiteID, rawEventID string) (string, error) {
+	eventID, err := apiPositiveID(rawEventID, "event")
+	if err != nil {
+		return "", err
+	}
+	if rawSiteID == "" {
+		return "/api/v1/events/" + strconv.FormatInt(eventID, 10), nil
+	}
+	siteID, err := apiPositiveID(rawSiteID, "site")
+	if err != nil {
+		return "", err
+	}
+	return fmt.Sprintf("/api/v1/sites/%d/events/%d", siteID, eventID), nil
+}
+
+func apiEventTransitionsPath(rawSiteID, rawEventID string, filters apiTransitionsListFilters) (string, error) {
+	path, err := apiEventDetailPath(rawSiteID, rawEventID)
+	if err != nil {
+		return "", err
+	}
+	if rawSiteID == "" {
+		return "", errors.New("site id is required for transitions")
+	}
+	if filters.limit < 0 {
+		return "", errors.New("limit must be positive")
+	}
+
+	values := url.Values{}
+	if filters.cursor != "" {
+		values.Set("cursor", filters.cursor)
+	}
+	if filters.limit > 0 {
+		values.Set("limit", strconv.Itoa(filters.limit))
+	}
+	path += "/transitions"
+	if len(values) == 0 {
+		return path, nil
+	}
+	return path + "?" + values.Encode(), nil
+}
+
+func apiEventClosePath(rawSiteID, rawEventID string) (string, error) {
+	path, err := apiEventDetailPath(rawSiteID, rawEventID)
+	if err != nil {
+		return "", err
+	}
+	if rawSiteID == "" {
+		return "", errors.New("site id is required for close")
+	}
+	return path + "/close", nil
+}
+
+func marshalAPIEventCloseBody(opts apiEventCloseOptions) ([]byte, error) {
+	req := apiEventCloseRequest{
+		Reason: opts.reason,
+		Note:   opts.note,
+	}
+	return json.Marshal(req)
+}
+
+func apiPositiveID(raw, label string) (int64, error) {
+	id, err := strconv.ParseInt(raw, 10, 64)
+	if err != nil || id <= 0 {
+		return 0, fmt.Errorf("%s id must be a positive integer", label)
+	}
+	return id, nil
+}
diff --git a/cmd/jetmon2/api_cli_events_test.go b/cmd/jetmon2/api_cli_events_test.go
new file mode 100644
index 00000000..a7c619ef
--- /dev/null
+++ b/cmd/jetmon2/api_cli_events_test.go
@@ -0,0 +1,131 @@
+package main
+
+import (
+	"encoding/json"
+	"net/url"
+	"testing"
+)
+
+func TestAPIEventsListPath(t *testing.T) {
+	got, err := apiEventsListPath("42", apiEventsListFilters{
+		cursor:       "cur-2",
+		limit:        20,
+		state:        "Down",
+		checkTypeIn:  "http,tls_expiry",
+		startedAtGTE: "2026-04-28T10:00:00Z",
+		startedAtLT:  "2026-04-29T10:00:00Z",
+		active:       "true",
+	})
+	if err != nil {
+		t.Fatalf("apiEventsListPath() error = %v", err)
+	}
+	u, err := url.Parse(got)
+	if err != nil {
+		t.Fatalf("parse path: %v", err)
+	}
+	if u.Path != "/api/v1/sites/42/events" {
+		t.Fatalf("path = %q, want site events path", u.Path)
+	}
+	q := u.Query()
+	for key, want := range map[string]string{
+		"cursor":          "cur-2",
+		"limit":           "20",
+		"state":           "Down",
+		"check_type__in":  "http,tls_expiry",
+		"started_at__gte": "2026-04-28T10:00:00Z",
+		"started_at__lt":  "2026-04-29T10:00:00Z",
+		"active":          "true",
+	} {
+		if got := q.Get(key); got != want {
+			t.Fatalf("query %s = %q, want %q", key, got, want)
+		}
+	}
+}
+
+func TestAPIEventsListPathRejectsAmbiguousFilters(t *testing.T) {
+	if _, err := apiEventsListPath("42", apiEventsListFilters{state: "Down", stateIn: "Up,Down"}); err == nil {
+		t.Fatal("apiEventsListPath() state error = nil, want error")
+	}
+	if _, err := apiEventsListPath("42", apiEventsListFilters{checkType: "http", checkTypeIn: "http,tls"}); err == nil {
+		t.Fatal("apiEventsListPath() check type error = nil, want error")
+	}
+}
+
+func TestAPIEventDetailPath(t *testing.T) {
+	got, err := apiEventDetailPath("", "99")
+	if err != nil {
+		t.Fatalf("apiEventDetailPath() direct error = %v", err)
+	}
+	if got != "/api/v1/events/99" {
+		t.Fatalf("direct path = %q, want /api/v1/events/99", got)
+	}
+
+	got, err = apiEventDetailPath("42", "99")
+	if err != nil {
+		t.Fatalf("apiEventDetailPath() scoped error = %v", err)
+	}
+	if got != "/api/v1/sites/42/events/99" {
+		t.Fatalf("scoped path = %q, want site-scoped event path", got)
+	}
+}
+
+func TestAPIEventTransitionsPath(t *testing.T) {
+	got, err := apiEventTransitionsPath("42", "99", apiTransitionsListFilters{
+		cursor: "cur-3",
+		limit:  100,
+	})
+	if err != nil {
+		t.Fatalf("apiEventTransitionsPath() error = %v", err)
+	}
+	u, err := url.Parse(got)
+	if err != nil {
+		t.Fatalf("parse path: %v", err)
+	}
+	if u.Path != "/api/v1/sites/42/events/99/transitions" {
+		t.Fatalf("path = %q, want transitions path", u.Path)
+	}
+	if got := u.Query().Get("cursor"); got != "cur-3" {
+		t.Fatalf("cursor = %q, want cur-3", got)
+	}
+	if got := u.Query().Get("limit"); got != "100" {
+		t.Fatalf("limit = %q, want 100", got)
+	}
+}
+
+func TestAPIEventClosePath(t *testing.T) {
+	got, err := apiEventClosePath("42", "99")
+	if err != nil {
+		t.Fatalf("apiEventClosePath() error = %v", err)
+	}
+	if got != "/api/v1/sites/42/events/99/close" {
+		t.Fatalf("path = %q, want close path", got)
+	}
+}
+
+func TestMarshalAPIEventCloseBody(t *testing.T) {
+	body, err := marshalAPIEventCloseBody(apiEventCloseOptions{
+		reason: "false_alarm",
+		note:   "verified from dashboard",
+	})
+	if err != nil {
+		t.Fatalf("marshalAPIEventCloseBody() error = %v", err)
+	}
+	var got map[string]any
+	if err := json.Unmarshal(body, &got); err != nil {
+		t.Fatalf("unmarshal body: %v", err)
+	}
+	if got["reason"] != "false_alarm" {
+		t.Fatalf("reason = %#v, want false_alarm", got["reason"])
+	}
+	if got["note"] != "verified from dashboard" {
+		t.Fatalf("note = %#v, want dashboard note", got["note"])
+	}
+
+	body, err = marshalAPIEventCloseBody(apiEventCloseOptions{})
+	if err != nil {
+		t.Fatalf("marshalAPIEventCloseBody(empty) error = %v", err)
+	}
+	if string(body) != "{}" {
+		t.Fatalf("empty body = %s, want {}", body)
+	}
+}
diff --git a/cmd/jetmon2/api_cli_remote_guard_test.go b/cmd/jetmon2/api_cli_remote_guard_test.go
new file mode 100644
index 00000000..5ceac23c
--- /dev/null
+++ b/cmd/jetmon2/api_cli_remote_guard_test.go
@@ -0,0 +1,253 @@
+package main
+
+import (
+	"bytes"
+	"context"
+	"strings"
+	"testing"
+)
+
+func TestIsLocalAPIBaseURL(t *testing.T) {
+	tests := []struct {
+		name    string
+		baseURL string
+		want    bool
+	}{
+		{name: "localhost", baseURL: "http://localhost:8090", want: true},
+		{name: "localhost subdomain", baseURL: "http://jetmon.localhost:8090", want: true},
+		{name: "ipv4 loopback", baseURL: "http://127.0.0.1:8090", want: true},
+		{name: "ipv6 loopback", baseURL: "http://[::1]:8090", want: true},
+		{name: "private lan is remote", baseURL: "http://10.0.0.171:8090", want: false},
+		{name: "public hostname", baseURL: "https://jetmon-api.example.test", want: false},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			got, err := isLocalAPIURL(tt.baseURL)
+			if err != nil {
+				t.Fatalf("isLocalAPIURL() error = %v", err)
+			}
+			if got != tt.want {
+				t.Fatalf("isLocalAPIURL(%q) = %v, want %v", tt.baseURL, got, tt.want)
+			}
+		})
+	}
+}
+
+func TestExecuteAPIRequestRejectsRemoteWrite(t *testing.T) {
+	err := executeAPIRequest(context.Background(), nil, apiCLIOptions{
+		baseURL: "https://jetmon-api.example.test",
+		out:     ioDiscard{},
+		errOut:  ioDiscard{},
+	}, "POST", "/api/v1/sites", []byte(`{}`))
+	if err == nil || !strings.Contains(err.Error(), "--allow-remote") {
+		t.Fatalf("executeAPIRequest() error = %v, want --allow-remote refusal", err)
+	}
+}
+
+func TestExecuteAPIRequestRejectsAbsoluteRemoteWriteWithLocalBase(t *testing.T) {
+	err := executeAPIRequest(context.Background(), nil, apiCLIOptions{
+		baseURL: "http://localhost:8090",
+		out:     ioDiscard{},
+		errOut:  ioDiscard{},
+	}, "DELETE", "https://jetmon-api.example.test/api/v1/sites/42", nil)
+	if err == nil || !strings.Contains(err.Error(), "--allow-remote") {
+		t.Fatalf("executeAPIRequest() error = %v, want --allow-remote refusal", err)
+	}
+}
+
+func TestRemoteWorkflowGuardRequiresAllowRemote(t *testing.T) {
+	opts := apiCLIOptions{baseURL: "https://jetmon-api.example.test"}
+	remote, err := requireAPILocalOrAllowRemote(opts, false, "api smoke")
+	if err == nil {
+		t.Fatal("requireAPILocalOrAllowRemote() error = nil, want refusal")
+	}
+	if !remote {
+		t.Fatal("remote = false, want true")
+	}
+	if !strings.Contains(err.Error(), "--allow-remote") {
+		t.Fatalf("error = %v, want --allow-remote hint", err)
+	}
+
+	remote, err = requireAPILocalOrAllowRemote(opts, true, "api smoke")
+	if err != nil {
+		t.Fatalf("requireAPILocalOrAllowRemote(... allow) error = %v", err)
+	}
+	if !remote {
+		t.Fatal("remote = false with remote URL and allow flag, want true")
+	}
+}
+
+func TestRunAPISitesBulkAddRemoteGuard(t *testing.T) {
+	opts := apiCLIOptions{baseURL: "https://jetmon-api.example.test", out: ioDiscard{}, errOut: ioDiscard{}}
+	bulk := apiSitesBulkAddOptions{
+		count:       1,
+		batch:       "remote-batch",
+		source:      "fixture",
+		blogIDStart: defaultAPIBulkAddBlogIDStart,
+	}
+	err := runAPISitesBulkAdd(context.Background(), nil, opts, bulk)
+	if err == nil || !strings.Contains(err.Error(), "--allow-remote") {
+		t.Fatalf("runAPISitesBulkAdd() error = %v, want --allow-remote refusal", err)
+	}
+
+	opts.allowRemote = true
+	bulk.batch = ""
+	err = runAPISitesBulkAdd(context.Background(), nil, opts, bulk)
+	if err == nil || !strings.Contains(err.Error(), "requires --batch") {
+		t.Fatalf("runAPISitesBulkAdd() error = %v, want remote batch requirement", err)
+	}
+}
+
+func TestRunAPISitesBulkAddDryRunAllowsRemotePlanning(t *testing.T) {
+	var stdout bytes.Buffer
+	err := runAPISitesBulkAdd(context.Background(), nil, apiCLIOptions{
+		baseURL: "https://jetmon-api.example.test",
+		out:     &stdout,
+		errOut:  ioDiscard{},
+	}, apiSitesBulkAddOptions{
+		count:       1,
+		source:      "fixture",
+		blogIDStart: defaultAPIBulkAddBlogIDStart,
+		dryRun:      true,
+	})
+	if err != nil {
+		t.Fatalf("runAPISitesBulkAdd() dry-run error = %v", err)
+	}
+	if !strings.Contains(stdout.String(), `"dry_run":true`) {
+		t.Fatalf("stdout = %s, want dry-run output", stdout.String())
+	}
+}
+
+func TestRunAPISitesCleanupRemoteGuard(t *testing.T) {
+	opts := apiCLIOptions{baseURL: "https://jetmon-api.example.test", out: ioDiscard{}, errOut: ioDiscard{}}
+	cleanup := apiSitesCleanupOptions{batch: "remote-batch", count: 1, ignoreNotFound: true}
+	err := runAPISitesCleanup(context.Background(), nil, opts, cleanup)
+	if err == nil || !strings.Contains(err.Error(), "--allow-remote") {
+		t.Fatalf("runAPISitesCleanup() error = %v, want --allow-remote refusal", err)
+	}
+
+	opts.allowRemote = true
+	cleanup.allowUnmarked = true
+	err = runAPISitesCleanup(context.Background(), nil, opts, cleanup)
+	if err == nil || !strings.Contains(err.Error(), "cannot use --allow-unmarked") {
+		t.Fatalf("runAPISitesCleanup() error = %v, want allow-unmarked refusal", err)
+	}
+}
+
+func TestRunAPISitesSimulateFailureRemoteGuard(t *testing.T) {
+	opts := apiCLIOptions{baseURL: "https://jetmon-api.example.test", out: ioDiscard{}, errOut: ioDiscard{}}
+	sim := apiSitesSimulateFailureOptions{
+		mode:         "http-500",
+		batch:        "remote-batch",
+		count:        1,
+		trigger:      false,
+		pollInterval: 1,
+	}
+	err := runAPISitesSimulateFailure(context.Background(), nil, opts, sim)
+	if err == nil || !strings.Contains(err.Error(), "--allow-remote") {
+		t.Fatalf("runAPISitesSimulateFailure() error = %v, want --allow-remote refusal", err)
+	}
+
+	opts.allowRemote = true
+	sim.batch = ""
+	err = runAPISitesSimulateFailure(context.Background(), nil, opts, sim)
+	if err == nil || !strings.Contains(err.Error(), "requires --batch") {
+		t.Fatalf("runAPISitesSimulateFailure() error = %v, want remote batch requirement", err)
+	}
+}
+
+func TestRunAPISmokeRemoteGuard(t *testing.T) {
+	err := runAPISmoke(context.Background(), nil, apiCLIOptions{
+		baseURL: "https://jetmon-api.example.test",
+		out:     ioDiscard{},
+		errOut:  ioDiscard{},
+	}, apiSmokeOptions{batch: "remote-smoke", exercise: "none"})
+	if err == nil || !strings.Contains(err.Error(), "--allow-remote") {
+		t.Fatalf("runAPISmoke() error = %v, want --allow-remote refusal", err)
+	}
+
+	err = runAPISmoke(context.Background(), nil, apiCLIOptions{
+		baseURL:     "https://jetmon-api.example.test",
+		allowRemote: true,
+		out:         ioDiscard{},
+		errOut:      ioDiscard{},
+	}, apiSmokeOptions{exercise: "none"})
+	if err == nil || !strings.Contains(err.Error(), "requires --batch") {
+		t.Fatalf("runAPISmoke() error = %v, want remote batch requirement", err)
+	}
+}
+
+func TestRunAPISmokeWebhookExerciseRemoteGuard(t *testing.T) {
+	err := runAPISmoke(context.Background(), nil, apiCLIOptions{
+		baseURL:     "https://jetmon-api.example.test",
+		allowRemote: true,
+		out:         ioDiscard{},
+		errOut:      ioDiscard{},
+	}, apiSmokeOptions{
+		batch:    "remote-smoke",
+		exercise: "webhook",
+	})
+	if err == nil || !strings.Contains(err.Error(), "Docker-local only") {
+		t.Fatalf("runAPISmoke() error = %v, want Docker-local webhook refusal", err)
+	}
+}
+
+func TestRunAPISmokeWebhookRequiresLocalRequestsURL(t *testing.T) {
+	err := runAPISmoke(context.Background(), nil, apiCLIOptions{
+		baseURL: "http://localhost:8090",
+		out:     ioDiscard{},
+		errOut:  ioDiscard{},
+	}, apiSmokeOptions{
+		batch:              "local-smoke",
+		exercise:           "webhook",
+		webhookRequestsURL: "https://fixture.example.test/webhook/requests",
+	})
+	if err == nil || !strings.Contains(err.Error(), "webhook-requests-url must be local") {
+		t.Fatalf("runAPISmoke() error = %v, want local webhook requests URL refusal", err)
+	}
+}
+
+func TestRunAPISmokeWebhookRejectsExternalWebhookURL(t *testing.T) {
+	err := runAPISmoke(context.Background(), nil, apiCLIOptions{
+		baseURL: "http://localhost:8090",
+		out:     ioDiscard{},
+		errOut:  ioDiscard{},
+	}, apiSmokeOptions{
+		batch:              "local-smoke",
+		exercise:           "webhook",
+		webhookURL:         "https://receiver.example.test/webhook",
+		webhookRequestsURL: "http://localhost:18091/webhook/requests",
+	})
+	if err == nil || !strings.Contains(err.Error(), "allow-external-webhook-url") {
+		t.Fatalf("runAPISmoke() error = %v, want external webhook URL refusal", err)
+	}
+}
+
+func TestRequireAPIWebhookFixtureURLAllowed(t *testing.T) {
+	tests := []struct {
+		name          string
+		rawURL        string
+		allowExternal bool
+		wantErr       bool
+	}{
+		{name: "api fixture", rawURL: "http://api-fixture:8091/webhook"},
+		{name: "localhost", rawURL: "http://localhost:18091/webhook"},
+		{name: "loopback", rawURL: "http://127.0.0.1:18091/webhook"},
+		{name: "external blocked", rawURL: "https://receiver.example.test/webhook", wantErr: true},
+		{name: "external explicit", rawURL: "https://receiver.example.test/webhook", allowExternal: true},
+		{name: "relative rejected", rawURL: "/webhook", wantErr: true},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			err := requireAPIWebhookFixtureURLAllowed(tt.rawURL, tt.allowExternal)
+			if tt.wantErr && err == nil {
+				t.Fatal("requireAPIWebhookFixtureURLAllowed() error = nil, want error")
+			}
+			if !tt.wantErr && err != nil {
+				t.Fatalf("requireAPIWebhookFixtureURLAllowed() error = %v", err)
+			}
+		})
+	}
+}
diff --git a/cmd/jetmon2/api_cli_sites.go b/cmd/jetmon2/api_cli_sites.go
new file mode 100644
index 00000000..a7489018
--- /dev/null
+++ b/cmd/jetmon2/api_cli_sites.go
@@ -0,0 +1,538 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"flag"
+	"fmt"
+	"io"
+	"net/http"
+	"net/url"
+	"sort"
+	"strconv"
+	"strings"
+)
+
+type apiSitesListFilters struct {
+	cursor        string
+	limit         int
+	state         string
+	stateIn       string
+	severityGTE   int
+	monitorActive string
+	q             string
+}
+
+type apiSiteCreateOptions struct {
+	blogID               int64
+	monitorURL           string
+	monitorActive        apiOptionalBoolFlag
+	bucketNo             apiOptionalIntFlag
+	checkKeyword         apiOptionalStringFlag
+	forbiddenKeyword     apiOptionalStringFlag
+	forbiddenKeywords    apiStringSliceFlags
+	redirectPolicy       apiOptionalStringFlag
+	requestMethod        apiOptionalStringFlag
+	detectionProfile     apiOptionalStringFlag
+	timeoutSeconds       apiOptionalIntFlag
+	customHeaders        apiStringMapFlags
+	alertCooldownMinutes apiOptionalIntFlag
+	checkInterval        apiOptionalIntFlag
+}
+
+type apiSiteUpdateOptions struct {
+	monitorURL             apiOptionalStringFlag
+	monitorActive          apiOptionalBoolFlag
+	bucketNo               apiOptionalIntFlag
+	checkKeyword           apiOptionalStringFlag
+	forbiddenKeyword       apiOptionalStringFlag
+	forbiddenKeywords      apiStringSliceFlags
+	clearForbiddenKeywords bool
+	redirectPolicy         apiOptionalStringFlag
+	requestMethod          apiOptionalStringFlag
+	detectionProfile       apiOptionalStringFlag
+	timeoutSeconds         apiOptionalIntFlag
+	customHeaders          apiStringMapFlags
+	clearCustomHeaders     bool
+	alertCooldownMinutes   apiOptionalIntFlag
+	checkInterval          apiOptionalIntFlag
+	maintenanceStart       apiOptionalStringFlag
+	maintenanceEnd         apiOptionalStringFlag
+}
+
+type apiSiteCreateRequest struct {
+	BlogID               int64              `json:"blog_id"`
+	MonitorURL           string             `json:"monitor_url"`
+	MonitorActive        *bool              `json:"monitor_active,omitempty"`
+	BucketNo             *int               `json:"bucket_no,omitempty"`
+	CheckKeyword         *string            `json:"check_keyword,omitempty"`
+	ForbiddenKeyword     *string            `json:"forbidden_keyword,omitempty"`
+	ForbiddenKeywords    *[]string          `json:"forbidden_keywords,omitempty"`
+	RedirectPolicy       *string            `json:"redirect_policy,omitempty"`
+	RequestMethod        *string            `json:"request_method,omitempty"`
+	DetectionProfile     *string            `json:"detection_profile,omitempty"`
+	TimeoutSeconds       *int               `json:"timeout_seconds,omitempty"`
+	CustomHeaders        *map[string]string `json:"custom_headers,omitempty"`
+	AlertCooldownMinutes *int               `json:"alert_cooldown_minutes,omitempty"`
+	CheckInterval        *int               `json:"check_interval,omitempty"`
+}
+
+type apiSiteUpdateRequest struct {
+	MonitorURL           *string            `json:"monitor_url,omitempty"`
+	MonitorActive        *bool              `json:"monitor_active,omitempty"`
+	BucketNo             *int               `json:"bucket_no,omitempty"`
+	CheckKeyword         *string            `json:"check_keyword,omitempty"`
+	ForbiddenKeyword     *string            `json:"forbidden_keyword,omitempty"`
+	ForbiddenKeywords    *[]string          `json:"forbidden_keywords,omitempty"`
+	RedirectPolicy       *string            `json:"redirect_policy,omitempty"`
+	RequestMethod        *string            `json:"request_method,omitempty"`
+	DetectionProfile     *string            `json:"detection_profile,omitempty"`
+	TimeoutSeconds       *int               `json:"timeout_seconds,omitempty"`
+	CustomHeaders        *map[string]string `json:"custom_headers,omitempty"`
+	AlertCooldownMinutes *int               `json:"alert_cooldown_minutes,omitempty"`
+	CheckInterval        *int               `json:"check_interval,omitempty"`
+	MaintenanceStart     *string            `json:"maintenance_start,omitempty"`
+	MaintenanceEnd       *string            `json:"maintenance_end,omitempty"`
+}
+
+func cmdAPISites(args []string) error {
+	if len(args) == 0 {
+		return errors.New("usage: jetmon2 api sites <list|get|create|update|delete|pause|resume|trigger-now|bulk-add|cleanup|simulate-failure> [flags]")
+	}
+
+	sub := args[0]
+	rest := args[1:]
+	switch sub {
+	case "list":
+		return cmdAPISitesList(rest)
+	case "get":
+		return cmdAPISitesGet(rest)
+	case "create":
+		return cmdAPISitesCreate(rest)
+	case "update":
+		return cmdAPISitesUpdate(rest)
+	case "delete":
+		return cmdAPISitesDelete(rest)
+	case "pause":
+		return cmdAPISitesPostAction(rest, "pause", "pause")
+	case "resume":
+		return cmdAPISitesPostAction(rest, "resume", "resume")
+	case "trigger-now":
+		return cmdAPISitesPostAction(rest, "trigger-now", "trigger-now")
+	case "bulk-add":
+		return cmdAPISitesBulkAdd(rest)
+	case "cleanup":
+		return cmdAPISitesCleanup(rest)
+	case "simulate-failure":
+		return cmdAPISitesSimulateFailure(rest)
+	default:
+		return fmt.Errorf("unknown api sites subcommand %q (want: list, get, create, update, delete, pause, resume, trigger-now, bulk-add, cleanup, simulate-failure)", sub)
+	}
+}
+
+func printAPISitesUsage(w io.Writer) {
+	fmt.Fprintln(w, "usage: jetmon2 api sites <list|get|create|update|delete|pause|resume|trigger-now|bulk-add|cleanup|simulate-failure> [flags]")
+}
+
+func cmdAPISitesList(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api sites list", &opts)
+	filters := apiSitesListFilters{severityGTE: -1}
+	fs.StringVar(&filters.cursor, "cursor", "", "pagination cursor")
+	fs.IntVar(&filters.limit, "limit", 0, "page size (1-200)")
+	fs.StringVar(&filters.state, "state", "", "filter by current state")
+	fs.StringVar(&filters.stateIn, "state-in", "", "comma-separated current states")
+	fs.IntVar(&filters.severityGTE, "severity-gte", -1, "minimum current severity")
+	fs.StringVar(&filters.monitorActive, "monitor-active", "", "filter active sites: true or false")
+	fs.StringVar(&filters.q, "q", "", "monitor URL substring search")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return errors.New("usage: jetmon2 api sites list [flags]")
+	}
+	target, err := apiSitesListPath(filters)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, target, nil)
+}
+
+func cmdAPISitesGet(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api sites get", &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api sites get [flags] <site-id>")
+	}
+	target, err := apiSiteResourcePath(fs.Arg(0), "")
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, target, nil)
+}
+
+func cmdAPISitesCreate(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api sites create", &opts)
+	addAPIIdempotencyFlag(fs, &opts)
+	create := apiSiteCreateOptions{}
+	fs.Int64Var(&create.blogID, "blog-id", 0, "site blog_id")
+	fs.StringVar(&create.monitorURL, "url", "", "site monitor URL")
+	fs.Var(&create.monitorActive, "monitor-active", "monitoring enabled: true or false")
+	fs.Var(&create.bucketNo, "bucket-no", "bucket number")
+	fs.Var(&create.checkKeyword, "check-keyword", "keyword required in response body")
+	fs.Var(&create.forbiddenKeyword, "forbidden-keyword", "keyword forbidden in response body")
+	fs.Var(&create.forbiddenKeywords, "forbidden-keyword-list", "additional forbidden body keyword (repeatable or comma-separated)")
+	fs.Var(&create.redirectPolicy, "redirect-policy", "redirect policy: follow, alert, or fail")
+	fs.Var(&create.requestMethod, "request-method", "HTTP check method: HEAD or GET")
+	fs.Var(&create.detectionProfile, "detection-profile", "detection profile: legacy, simple_http, or full")
+	fs.Var(&create.timeoutSeconds, "timeout-seconds", "per-site timeout in seconds")
+	fs.Var(&create.customHeaders, "custom-header", "site custom header in Name: Value form (repeatable)")
+	fs.Var(&create.alertCooldownMinutes, "alert-cooldown-minutes", "per-site alert cooldown in minutes")
+	fs.Var(&create.checkInterval, "check-interval", "check interval in minutes")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return errors.New("usage: jetmon2 api sites create [flags]")
+	}
+	body, err := marshalAPISiteCreateBody(create)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodPost, "/api/v1/sites", body)
+}
+
+func cmdAPISitesUpdate(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api sites update", &opts)
+	update := apiSiteUpdateOptions{}
+	fs.Var(&update.monitorURL, "url", "site monitor URL")
+	fs.Var(&update.monitorActive, "monitor-active", "monitoring enabled: true or false")
+	fs.Var(&update.bucketNo, "bucket-no", "bucket number")
+	fs.Var(&update.checkKeyword, "check-keyword", "keyword required in response body; empty clears it")
+	fs.Var(&update.forbiddenKeyword, "forbidden-keyword", "keyword forbidden in response body; empty clears it")
+	fs.Var(&update.forbiddenKeywords, "forbidden-keyword-list", "replacement forbidden body keyword list (repeatable or comma-separated)")
+	fs.BoolVar(&update.clearForbiddenKeywords, "clear-forbidden-keywords", false, "clear the forbidden body keyword list")
+	fs.Var(&update.redirectPolicy, "redirect-policy", "redirect policy: follow, alert, or fail")
+	fs.Var(&update.requestMethod, "request-method", "HTTP check method: HEAD or GET; empty inherits default")
+	fs.Var(&update.detectionProfile, "detection-profile", "detection profile: legacy, simple_http, or full; empty inherits default")
+	fs.Var(&update.timeoutSeconds, "timeout-seconds", "per-site timeout in seconds")
+	fs.Var(&update.customHeaders, "custom-header", "site custom header in Name: Value form (repeatable)")
+	fs.BoolVar(&update.clearCustomHeaders, "clear-custom-headers", false, "clear all site custom headers")
+	fs.Var(&update.alertCooldownMinutes, "alert-cooldown-minutes", "per-site alert cooldown in minutes")
+	fs.Var(&update.checkInterval, "check-interval", "check interval in minutes")
+	fs.Var(&update.maintenanceStart, "maintenance-start", "maintenance start RFC3339 timestamp; empty clears it")
+	fs.Var(&update.maintenanceEnd, "maintenance-end", "maintenance end RFC3339 timestamp; empty clears it")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api sites update [flags] <site-id>")
+	}
+	target, err := apiSiteResourcePath(fs.Arg(0), "")
+	if err != nil {
+		return err
+	}
+	body, err := marshalAPISiteUpdateBody(update)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodPatch, target, body)
+}
+
+func cmdAPISitesDelete(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api sites delete", &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api sites delete [flags] <site-id>")
+	}
+	target, err := apiSiteResourcePath(fs.Arg(0), "")
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodDelete, target, nil)
+}
+
+func cmdAPISitesPostAction(args []string, usageName, suffix string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api sites "+usageName, &opts)
+	addAPIIdempotencyFlag(fs, &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return fmt.Errorf("usage: jetmon2 api sites %s [flags] <site-id>", usageName)
+	}
+	target, err := apiSiteResourcePath(fs.Arg(0), suffix)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodPost, target, nil)
+}
+
+func addAPIIdempotencyFlag(fs *flag.FlagSet, opts *apiCLIOptions) {
+	fs.StringVar(&opts.idempotencyKey, "idempotency-key", "", "Idempotency-Key header for POST retries")
+}
+
+func apiSitesListPath(filters apiSitesListFilters) (string, error) {
+	if filters.limit < 0 {
+		return "", errors.New("limit must be positive")
+	}
+	if filters.severityGTE < -1 {
+		return "", errors.New("severity-gte must be zero or greater")
+	}
+	if filters.state != "" && filters.stateIn != "" {
+		return "", errors.New("use --state or --state-in, not both")
+	}
+
+	values := url.Values{}
+	if filters.cursor != "" {
+		values.Set("cursor", filters.cursor)
+	}
+	if filters.limit > 0 {
+		values.Set("limit", strconv.Itoa(filters.limit))
+	}
+	if filters.state != "" {
+		values.Set("state", filters.state)
+	}
+	if filters.stateIn != "" {
+		values.Set("state__in", filters.stateIn)
+	}
+	if filters.severityGTE >= 0 {
+		values.Set("severity__gte", strconv.Itoa(filters.severityGTE))
+	}
+	if strings.TrimSpace(filters.monitorActive) != "" {
+		active, err := strconv.ParseBool(filters.monitorActive)
+		if err != nil {
+			return "", errors.New("monitor-active must be true or false")
+		}
+		values.Set("monitor_active", strconv.FormatBool(active))
+	}
+	if filters.q != "" {
+		values.Set("q", filters.q)
+	}
+
+	if len(values) == 0 {
+		return "/api/v1/sites", nil
+	}
+	return "/api/v1/sites?" + values.Encode(), nil
+}
+
+func apiSiteResourcePath(rawID, suffix string) (string, error) {
+	id, err := strconv.ParseInt(rawID, 10, 64)
+	if err != nil || id <= 0 {
+		return "", errors.New("site id must be a positive integer")
+	}
+	path := "/api/v1/sites/" + strconv.FormatInt(id, 10)
+	if suffix != "" {
+		path += "/" + strings.TrimPrefix(suffix, "/")
+	}
+	return path, nil
+}
+
+func marshalAPISiteCreateBody(opts apiSiteCreateOptions) ([]byte, error) {
+	if opts.blogID <= 0 {
+		return nil, errors.New("blog-id is required and must be a positive integer")
+	}
+	if strings.TrimSpace(opts.monitorURL) == "" {
+		return nil, errors.New("url is required")
+	}
+
+	req := apiSiteCreateRequest{
+		BlogID:               opts.blogID,
+		MonitorURL:           opts.monitorURL,
+		MonitorActive:        opts.monitorActive.ptr(),
+		BucketNo:             opts.bucketNo.ptr(),
+		CheckKeyword:         opts.checkKeyword.ptr(),
+		ForbiddenKeyword:     opts.forbiddenKeyword.ptr(),
+		ForbiddenKeywords:    opts.forbiddenKeywords.ptr(),
+		RedirectPolicy:       opts.redirectPolicy.ptr(),
+		RequestMethod:        opts.requestMethod.ptr(),
+		DetectionProfile:     opts.detectionProfile.ptr(),
+		TimeoutSeconds:       opts.timeoutSeconds.ptr(),
+		CustomHeaders:        opts.customHeaders.ptr(),
+		AlertCooldownMinutes: opts.alertCooldownMinutes.ptr(),
+		CheckInterval:        opts.checkInterval.ptr(),
+	}
+	return json.Marshal(req)
+}
+
+func marshalAPISiteUpdateBody(opts apiSiteUpdateOptions) ([]byte, error) {
+	if opts.clearCustomHeaders && opts.customHeaders.set {
+		return nil, errors.New("use --custom-header or --clear-custom-headers, not both")
+	}
+	if opts.clearForbiddenKeywords && opts.forbiddenKeywords.set {
+		return nil, errors.New("use --forbidden-keyword-list or --clear-forbidden-keywords, not both")
+	}
+
+	req := apiSiteUpdateRequest{
+		MonitorURL:           opts.monitorURL.ptr(),
+		MonitorActive:        opts.monitorActive.ptr(),
+		BucketNo:             opts.bucketNo.ptr(),
+		CheckKeyword:         opts.checkKeyword.ptr(),
+		ForbiddenKeyword:     opts.forbiddenKeyword.ptr(),
+		ForbiddenKeywords:    opts.forbiddenKeywords.ptr(),
+		RedirectPolicy:       opts.redirectPolicy.ptr(),
+		RequestMethod:        opts.requestMethod.ptr(),
+		DetectionProfile:     opts.detectionProfile.ptr(),
+		TimeoutSeconds:       opts.timeoutSeconds.ptr(),
+		CustomHeaders:        opts.customHeaders.ptr(),
+		AlertCooldownMinutes: opts.alertCooldownMinutes.ptr(),
+		CheckInterval:        opts.checkInterval.ptr(),
+		MaintenanceStart:     opts.maintenanceStart.ptr(),
+		MaintenanceEnd:       opts.maintenanceEnd.ptr(),
+	}
+	if opts.clearCustomHeaders {
+		empty := map[string]string{}
+		req.CustomHeaders = &empty
+	}
+	if opts.clearForbiddenKeywords {
+		empty := []string{}
+		req.ForbiddenKeywords = &empty
+	}
+	return json.Marshal(req)
+}
+
+type apiOptionalBoolFlag struct {
+	value bool
+	set   bool
+}
+
+func (f *apiOptionalBoolFlag) Set(v string) error {
+	parsed, err := strconv.ParseBool(v)
+	if err != nil {
+		return err
+	}
+	f.value = parsed
+	f.set = true
+	return nil
+}
+
+func (f *apiOptionalBoolFlag) String() string {
+	if !f.set {
+		return ""
+	}
+	return strconv.FormatBool(f.value)
+}
+
+func (f *apiOptionalBoolFlag) IsBoolFlag() bool {
+	return true
+}
+
+func (f apiOptionalBoolFlag) ptr() *bool {
+	if !f.set {
+		return nil
+	}
+	v := f.value
+	return &v
+}
+
+type apiOptionalIntFlag struct {
+	value int
+	set   bool
+}
+
+func (f *apiOptionalIntFlag) Set(v string) error {
+	parsed, err := strconv.Atoi(v)
+	if err != nil {
+		return err
+	}
+	f.value = parsed
+	f.set = true
+	return nil
+}
+
+func (f *apiOptionalIntFlag) String() string {
+	if !f.set {
+		return ""
+	}
+	return strconv.Itoa(f.value)
+}
+
+func (f apiOptionalIntFlag) ptr() *int {
+	if !f.set {
+		return nil
+	}
+	v := f.value
+	return &v
+}
+
+type apiOptionalStringFlag struct {
+	value string
+	set   bool
+}
+
+func (f *apiOptionalStringFlag) Set(v string) error {
+	f.value = v
+	f.set = true
+	return nil
+}
+
+func (f *apiOptionalStringFlag) String() string {
+	return f.value
+}
+
+func (f apiOptionalStringFlag) ptr() *string {
+	if !f.set {
+		return nil
+	}
+	v := f.value
+	return &v
+}
+
+type apiStringMapFlags struct {
+	values map[string]string
+	set    bool
+}
+
+func (f *apiStringMapFlags) Set(v string) error {
+	name, value, ok := strings.Cut(v, ":")
+	if !ok {
+		return fmt.Errorf("custom header %q must be in Name: Value form", v)
+	}
+	name = strings.TrimSpace(name)
+	if name == "" {
+		return errors.New("custom header name must not be empty")
+	}
+	if f.values == nil {
+		f.values = map[string]string{}
+	}
+	f.values[name] = strings.TrimSpace(value)
+	f.set = true
+	return nil
+}
+
+func (f *apiStringMapFlags) String() string {
+	if !f.set {
+		return ""
+	}
+	keys := make([]string, 0, len(f.values))
+	for k := range f.values {
+		keys = append(keys, k)
+	}
+	sort.Strings(keys)
+	parts := make([]string, 0, len(keys))
+	for _, k := range keys {
+		parts = append(parts, k+": "+f.values[k])
+	}
+	return strings.Join(parts, ", ")
+}
+
+func (f apiStringMapFlags) ptr() *map[string]string {
+	if !f.set {
+		return nil
+	}
+	values := make(map[string]string, len(f.values))
+	for k, v := range f.values {
+		values[k] = v
+	}
+	return &values
+}
diff --git a/cmd/jetmon2/api_cli_sites_bulk.go b/cmd/jetmon2/api_cli_sites_bulk.go
new file mode 100644
index 00000000..f045bc73
--- /dev/null
+++ b/cmd/jetmon2/api_cli_sites_bulk.go
@@ -0,0 +1,416 @@
+package main
+
+import (
+	"bytes"
+	"context"
+	"encoding/csv"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io"
+	"net/http"
+	"os"
+	"strconv"
+	"strings"
+)
+
+const (
+	apiSitesBulkAddMaxCount      = 200
+	defaultAPIBulkAddBlogIDStart = int64(900000000)
+)
+
+type apiSitesBulkAddOptions struct {
+	count                int
+	batch                string
+	source               string
+	file                 string
+	blogIDStart          int64
+	dryRun               bool
+	idempotencyKeyPrefix string
+	monitorActive        apiOptionalBoolFlag
+}
+
+type apiBulkSiteEntry struct {
+	MonitorURL           string            `json:"monitor_url"`
+	CheckKeyword         *string           `json:"check_keyword,omitempty"`
+	ForbiddenKeyword     *string           `json:"forbidden_keyword,omitempty"`
+	ForbiddenKeywords    []string          `json:"forbidden_keywords,omitempty"`
+	RedirectPolicy       *string           `json:"redirect_policy,omitempty"`
+	RequestMethod        *string           `json:"request_method,omitempty"`
+	DetectionProfile     *string           `json:"detection_profile,omitempty"`
+	TimeoutSeconds       *int              `json:"timeout_seconds,omitempty"`
+	CustomHeaders        map[string]string `json:"custom_headers,omitempty"`
+	AlertCooldownMinutes *int              `json:"alert_cooldown_minutes,omitempty"`
+	CheckInterval        *int              `json:"check_interval,omitempty"`
+}
+
+type apiSitesBulkAddOutput struct {
+	DryRun  bool              `json:"dry_run,omitempty"`
+	Count   int               `json:"count"`
+	Sites   []json.RawMessage `json:"sites,omitempty"`
+	Created []json.RawMessage `json:"created,omitempty"`
+}
+
+func cmdAPISitesBulkAdd(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api sites bulk-add", &opts)
+	bulk := apiSitesBulkAddOptions{
+		source:      "fixture",
+		blogIDStart: defaultAPIBulkAddBlogIDStart,
+	}
+	fs.IntVar(&bulk.count, "count", 0, "number of sites to create, max 200")
+	fs.StringVar(&bulk.batch, "batch", "", "stable batch label; derives blog ids and stores a custom header marker")
+	fs.StringVar(&bulk.source, "source", bulk.source, "site source: fixture, file, or stdin")
+	fs.StringVar(&bulk.file, "file", "", "source file for --source file")
+	fs.Int64Var(&bulk.blogIDStart, "blog-id-start", bulk.blogIDStart, "first blog_id to assign")
+	fs.BoolVar(&bulk.dryRun, "dry-run", false, "print planned create payloads without sending requests")
+	fs.StringVar(&bulk.idempotencyKeyPrefix, "idempotency-key-prefix", "", "prefix for per-site Idempotency-Key headers")
+	fs.Var(&bulk.monitorActive, "monitor-active", "override monitor_active for every generated site")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return errors.New("usage: jetmon2 api sites bulk-add [flags]")
+	}
+	return runAPISitesBulkAdd(context.Background(), nil, opts, bulk)
+}
+
+func runAPISitesBulkAdd(ctx context.Context, client *http.Client, opts apiCLIOptions, bulk apiSitesBulkAddOptions) error {
+	if opts.out == nil {
+		opts.out = io.Discard
+	}
+	entries, err := loadAPIBulkSiteEntries(bulk, opts.in)
+	if err != nil {
+		return err
+	}
+	planned, err := planAPIBulkSiteCreates(entries, bulk)
+	if err != nil {
+		return err
+	}
+
+	if bulk.dryRun {
+		sites, err := marshalAPIBulkSiteRequests(planned)
+		if err != nil {
+			return err
+		}
+		return writeAPIValueOutput(opts.out, apiSitesBulkAddOutput{
+			DryRun: true,
+			Count:  len(sites),
+			Sites:  sites,
+		}, opts)
+	}
+
+	remote, err := requireAPILocalOrAllowRemote(opts, opts.allowRemote, "api sites bulk-add")
+	if err != nil {
+		return err
+	}
+	if remote && strings.TrimSpace(bulk.batch) == "" {
+		return errors.New("api sites bulk-add requires --batch when --allow-remote targets a non-local API")
+	}
+
+	created := make([]json.RawMessage, 0, len(planned))
+	for i, req := range planned {
+		body, err := json.Marshal(req)
+		if err != nil {
+			return err
+		}
+		requestOpts := opts
+		var response bytes.Buffer
+		requestOpts.out = &response
+		if bulk.idempotencyKeyPrefix != "" {
+			requestOpts.idempotencyKey = fmt.Sprintf("%s-%03d", bulk.idempotencyKeyPrefix, i+1)
+		}
+		if err := executeAPIRequest(ctx, client, requestOpts, http.MethodPost, "/api/v1/sites", body); err != nil {
+			if response.Len() > 0 {
+				_, _ = opts.out.Write(response.Bytes())
+			}
+			return fmt.Errorf("create site %d (%s): %w", req.BlogID, req.MonitorURL, err)
+		}
+		created = append(created, json.RawMessage(bytes.TrimSpace(response.Bytes())))
+	}
+
+	return writeAPIValueOutput(opts.out, apiSitesBulkAddOutput{
+		Count:   len(created),
+		Created: created,
+	}, opts)
+}
+
+func loadAPIBulkSiteEntries(opts apiSitesBulkAddOptions, in io.Reader) ([]apiBulkSiteEntry, error) {
+	var data []byte
+	var err error
+	switch opts.source {
+	case "fixture":
+		if opts.file != "" {
+			return nil, errors.New("--file is only valid with --source file")
+		}
+		data = apiCLISiteFixture
+	case "file":
+		if opts.file == "" {
+			return nil, errors.New("--file is required with --source file")
+		}
+		data, err = os.ReadFile(opts.file)
+		if err != nil {
+			return nil, err
+		}
+	case "stdin":
+		if in == nil {
+			return nil, errors.New("stdin source requires an input reader")
+		}
+		data, err = io.ReadAll(in)
+		if err != nil {
+			return nil, err
+		}
+	default:
+		return nil, errors.New("source must be one of: fixture, file, stdin")
+	}
+	return parseAPIBulkSiteEntries(data)
+}
+
+func parseAPIBulkSiteEntries(data []byte) ([]apiBulkSiteEntry, error) {
+	trimmed := bytes.TrimSpace(data)
+	if len(trimmed) == 0 {
+		return nil, errors.New("site source is empty")
+	}
+	if trimmed[0] == '[' || trimmed[0] == '{' || trimmed[0] == '"' {
+		return parseAPIBulkJSONSiteEntries(trimmed)
+	}
+	return parseAPIBulkCSVSiteEntries(trimmed)
+}
+
+func parseAPIBulkJSONSiteEntries(data []byte) ([]apiBulkSiteEntry, error) {
+	var raw []json.RawMessage
+	if data[0] == '[' {
+		if err := json.Unmarshal(data, &raw); err != nil {
+			return nil, err
+		}
+	} else {
+		raw = []json.RawMessage{data}
+	}
+	entries := make([]apiBulkSiteEntry, 0, len(raw))
+	for _, item := range raw {
+		var entry apiBulkSiteEntry
+		if err := json.Unmarshal(item, &entry); err != nil {
+			return nil, err
+		}
+		entries = append(entries, entry)
+	}
+	return validateAPIBulkSiteEntries(entries)
+}
+
+func parseAPIBulkCSVSiteEntries(data []byte) ([]apiBulkSiteEntry, error) {
+	r := csv.NewReader(bytes.NewReader(data))
+	r.TrimLeadingSpace = true
+	r.FieldsPerRecord = -1
+	records, err := r.ReadAll()
+	if err != nil {
+		return nil, err
+	}
+	if len(records) == 0 {
+		return nil, errors.New("site source is empty")
+	}
+
+	header := apiBulkCSVHeader(records[0])
+	start := 0
+	if len(header) > 0 {
+		start = 1
+	}
+
+	entries := make([]apiBulkSiteEntry, 0, len(records)-start)
+	for _, record := range records[start:] {
+		if len(record) == 0 || strings.TrimSpace(record[0]) == "" {
+			continue
+		}
+		if len(header) == 0 {
+			entries = append(entries, apiBulkSiteEntry{MonitorURL: strings.TrimSpace(record[0])})
+			continue
+		}
+		entry, err := apiBulkSiteEntryFromCSVRecord(header, record)
+		if err != nil {
+			return nil, err
+		}
+		entries = append(entries, entry)
+	}
+	return validateAPIBulkSiteEntries(entries)
+}
+
+func apiBulkCSVHeader(record []string) map[string]int {
+	header := map[string]int{}
+	hasURL := false
+	for i, col := range record {
+		name := strings.ToLower(strings.TrimSpace(col))
+		header[name] = i
+		if name == "monitor_url" || name == "url" {
+			hasURL = true
+		}
+	}
+	if !hasURL {
+		return nil
+	}
+	return header
+}
+
+func apiBulkSiteEntryFromCSVRecord(header map[string]int, record []string) (apiBulkSiteEntry, error) {
+	entry := apiBulkSiteEntry{}
+	entry.MonitorURL = csvField(header, record, "monitor_url")
+	if entry.MonitorURL == "" {
+		entry.MonitorURL = csvField(header, record, "url")
+	}
+	if v := csvField(header, record, "check_keyword"); v != "" {
+		entry.CheckKeyword = &v
+	}
+	if v := csvField(header, record, "forbidden_keyword"); v != "" {
+		entry.ForbiddenKeyword = &v
+	}
+	if v := csvField(header, record, "forbidden_keywords"); v != "" {
+		entry.ForbiddenKeywords = splitAPIBulkStringList(v)
+	}
+	if v := csvField(header, record, "redirect_policy"); v != "" {
+		entry.RedirectPolicy = &v
+	}
+	if v := csvField(header, record, "request_method"); v != "" {
+		entry.RequestMethod = &v
+	}
+	if v := csvField(header, record, "detection_profile"); v != "" {
+		entry.DetectionProfile = &v
+	}
+	if v := csvField(header, record, "timeout_seconds"); v != "" {
+		parsed, err := strconv.Atoi(v)
+		if err != nil {
+			return entry, fmt.Errorf("timeout_seconds must be an integer: %w", err)
+		}
+		entry.TimeoutSeconds = &parsed
+	}
+	if v := csvField(header, record, "check_interval"); v != "" {
+		parsed, err := strconv.Atoi(v)
+		if err != nil {
+			return entry, fmt.Errorf("check_interval must be an integer: %w", err)
+		}
+		entry.CheckInterval = &parsed
+	}
+	return entry, nil
+}
+
+func csvField(header map[string]int, record []string, name string) string {
+	idx, ok := header[name]
+	if !ok || idx >= len(record) {
+		return ""
+	}
+	return strings.TrimSpace(record[idx])
+}
+
+func validateAPIBulkSiteEntries(entries []apiBulkSiteEntry) ([]apiBulkSiteEntry, error) {
+	if len(entries) == 0 {
+		return nil, errors.New("no sites found in source")
+	}
+	for i := range entries {
+		entries[i].MonitorURL = strings.TrimSpace(entries[i].MonitorURL)
+		if entries[i].MonitorURL == "" {
+			return nil, fmt.Errorf("site source entry %d is missing monitor_url", i+1)
+		}
+	}
+	return entries, nil
+}
+
+func planAPIBulkSiteCreates(entries []apiBulkSiteEntry, opts apiSitesBulkAddOptions) ([]apiSiteCreateRequest, error) {
+	if opts.count <= 0 {
+		return nil, errors.New("count is required and must be positive")
+	}
+	if opts.count > apiSitesBulkAddMaxCount {
+		return nil, fmt.Errorf("count must be <= %d", apiSitesBulkAddMaxCount)
+	}
+	if opts.blogIDStart <= 0 {
+		return nil, errors.New("blog-id-start must be a positive integer")
+	}
+	if opts.batch != "" && opts.blogIDStart == defaultAPIBulkAddBlogIDStart {
+		opts.blogIDStart = apiCLIBatchBlogIDStart(opts.batch)
+	}
+	if len(entries) == 0 {
+		return nil, errors.New("no sites found in source")
+	}
+
+	out := make([]apiSiteCreateRequest, 0, opts.count)
+	for i := 0; i < opts.count; i++ {
+		entry := entries[i%len(entries)]
+		req := apiSiteCreateRequest{
+			BlogID:               opts.blogIDStart + int64(i),
+			MonitorURL:           entry.MonitorURL,
+			MonitorActive:        opts.monitorActive.ptr(),
+			CheckKeyword:         entry.CheckKeyword,
+			ForbiddenKeyword:     entry.ForbiddenKeyword,
+			ForbiddenKeywords:    forbiddenKeywordsPtr(entry.ForbiddenKeywords),
+			RedirectPolicy:       entry.RedirectPolicy,
+			RequestMethod:        entry.RequestMethod,
+			DetectionProfile:     entry.DetectionProfile,
+			TimeoutSeconds:       entry.TimeoutSeconds,
+			AlertCooldownMinutes: entry.AlertCooldownMinutes,
+			CheckInterval:        entry.CheckInterval,
+		}
+		if len(entry.CustomHeaders) > 0 || opts.batch != "" {
+			headers := make(map[string]string, len(entry.CustomHeaders)+1)
+			for k, v := range entry.CustomHeaders {
+				headers[k] = v
+			}
+			if opts.batch != "" {
+				headers[apiCLIBatchHeader] = opts.batch
+			}
+			req.CustomHeaders = &headers
+		}
+		out = append(out, req)
+	}
+	return out, nil
+}
+
+func marshalAPIBulkSiteRequests(requests []apiSiteCreateRequest) ([]json.RawMessage, error) {
+	out := make([]json.RawMessage, 0, len(requests))
+	for _, req := range requests {
+		b, err := json.Marshal(req)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, json.RawMessage(b))
+	}
+	return out, nil
+}
+
+func splitAPIBulkStringList(raw string) []string {
+	parts := strings.Split(raw, ",")
+	out := make([]string, 0, len(parts))
+	for _, part := range parts {
+		part = strings.TrimSpace(part)
+		if part != "" {
+			out = append(out, part)
+		}
+	}
+	return out
+}
+
+func forbiddenKeywordsPtr(values []string) *[]string {
+	if len(values) == 0 {
+		return nil
+	}
+	out := make([]string, len(values))
+	copy(out, values)
+	return &out
+}
+
+func (e *apiBulkSiteEntry) UnmarshalJSON(data []byte) error {
+	var urlOnly string
+	if err := json.Unmarshal(data, &urlOnly); err == nil {
+		e.MonitorURL = urlOnly
+		return nil
+	}
+
+	type bulkSiteEntry apiBulkSiteEntry
+	var aux struct {
+		bulkSiteEntry
+		URL string `json:"url"`
+	}
+	if err := json.Unmarshal(data, &aux); err != nil {
+		return err
+	}
+	*e = apiBulkSiteEntry(aux.bulkSiteEntry)
+	if e.MonitorURL == "" {
+		e.MonitorURL = aux.URL
+	}
+	return nil
+}
diff --git a/cmd/jetmon2/api_cli_sites_bulk_test.go b/cmd/jetmon2/api_cli_sites_bulk_test.go
new file mode 100644
index 00000000..f569c8d1
--- /dev/null
+++ b/cmd/jetmon2/api_cli_sites_bulk_test.go
@@ -0,0 +1,182 @@
+package main
+
+import (
+	"encoding/json"
+	"strings"
+	"testing"
+)
+
+func TestParseAPIBulkJSONSiteEntries(t *testing.T) {
+	entries, err := parseAPIBulkSiteEntries([]byte(`[
+		"https://example.com/",
+		{"url":"https://wordpress.com/","check_keyword":"WordPress","forbidden_keyword":"database error","forbidden_keywords":["metrics.evil-cdn.example/collect.js","buy cheap viagra"],"redirect_policy":"follow","timeout_seconds":5}
+	]`))
+	if err != nil {
+		t.Fatalf("parseAPIBulkSiteEntries() error = %v", err)
+	}
+	if len(entries) != 2 {
+		t.Fatalf("len(entries) = %d, want 2", len(entries))
+	}
+	if entries[0].MonitorURL != "https://example.com/" {
+		t.Fatalf("first URL = %q", entries[0].MonitorURL)
+	}
+	if entries[1].MonitorURL != "https://wordpress.com/" {
+		t.Fatalf("second URL = %q", entries[1].MonitorURL)
+	}
+	if entries[1].CheckKeyword == nil || *entries[1].CheckKeyword != "WordPress" {
+		t.Fatalf("check_keyword = %#v, want WordPress", entries[1].CheckKeyword)
+	}
+	if entries[1].ForbiddenKeyword == nil || *entries[1].ForbiddenKeyword != "database error" {
+		t.Fatalf("forbidden_keyword = %#v, want database error", entries[1].ForbiddenKeyword)
+	}
+	if len(entries[1].ForbiddenKeywords) != 2 || entries[1].ForbiddenKeywords[0] != "metrics.evil-cdn.example/collect.js" {
+		t.Fatalf("forbidden_keywords = %#v", entries[1].ForbiddenKeywords)
+	}
+	if entries[1].TimeoutSeconds == nil || *entries[1].TimeoutSeconds != 5 {
+		t.Fatalf("timeout_seconds = %#v, want 5", entries[1].TimeoutSeconds)
+	}
+}
+
+func TestParseAPIBulkCSVSiteEntries(t *testing.T) {
+	source := strings.NewReader("monitor_url,check_keyword,forbidden_keyword,forbidden_keywords,redirect_policy,check_interval\nhttps://example.com/,Example Domain,database error,\"metrics.evil-cdn.example/collect.js,buy cheap viagra\",follow,5\n")
+	entries, err := loadAPIBulkSiteEntries(apiSitesBulkAddOptions{source: "stdin"}, source)
+	if err != nil {
+		t.Fatalf("loadAPIBulkSiteEntries() error = %v", err)
+	}
+	if len(entries) != 1 {
+		t.Fatalf("len(entries) = %d, want 1", len(entries))
+	}
+	if entries[0].MonitorURL != "https://example.com/" {
+		t.Fatalf("monitor_url = %q", entries[0].MonitorURL)
+	}
+	if entries[0].CheckKeyword == nil || *entries[0].CheckKeyword != "Example Domain" {
+		t.Fatalf("check_keyword = %#v, want Example Domain", entries[0].CheckKeyword)
+	}
+	if entries[0].ForbiddenKeyword == nil || *entries[0].ForbiddenKeyword != "database error" {
+		t.Fatalf("forbidden_keyword = %#v, want database error", entries[0].ForbiddenKeyword)
+	}
+	if len(entries[0].ForbiddenKeywords) != 2 || entries[0].ForbiddenKeywords[1] != "buy cheap viagra" {
+		t.Fatalf("forbidden_keywords = %#v", entries[0].ForbiddenKeywords)
+	}
+	if entries[0].CheckInterval == nil || *entries[0].CheckInterval != 5 {
+		t.Fatalf("check_interval = %#v, want 5", entries[0].CheckInterval)
+	}
+}
+
+func TestParseAPIBulkNewlineSiteEntries(t *testing.T) {
+	entries, err := parseAPIBulkSiteEntries([]byte("https://example.com/\nhttps://wordpress.com/\n"))
+	if err != nil {
+		t.Fatalf("parseAPIBulkSiteEntries() error = %v", err)
+	}
+	if len(entries) != 2 {
+		t.Fatalf("len(entries) = %d, want 2", len(entries))
+	}
+	if entries[1].MonitorURL != "https://wordpress.com/" {
+		t.Fatalf("second URL = %q", entries[1].MonitorURL)
+	}
+}
+
+func TestPlanAPIBulkSiteCreatesCyclesFixtureEntries(t *testing.T) {
+	var active apiOptionalBoolFlag
+	setTestFlag(t, &active, "false")
+	forbidden := "database error"
+	forbiddenKeywords := []string{"metrics.evil-cdn.example/collect.js", "buy cheap viagra"}
+	entries := []apiBulkSiteEntry{
+		{MonitorURL: "https://example.com/", ForbiddenKeyword: &forbidden, ForbiddenKeywords: forbiddenKeywords},
+		{MonitorURL: "https://wordpress.com/"},
+	}
+	planned, err := planAPIBulkSiteCreates(entries, apiSitesBulkAddOptions{
+		count:         3,
+		blogIDStart:   900,
+		monitorActive: active,
+	})
+	if err != nil {
+		t.Fatalf("planAPIBulkSiteCreates() error = %v", err)
+	}
+	if len(planned) != 3 {
+		t.Fatalf("len(planned) = %d, want 3", len(planned))
+	}
+	if planned[0].BlogID != 900 || planned[2].BlogID != 902 {
+		t.Fatalf("blog ids = %d, %d; want 900, 902", planned[0].BlogID, planned[2].BlogID)
+	}
+	if planned[2].MonitorURL != "https://example.com/" {
+		t.Fatalf("cycled URL = %q, want first source URL", planned[2].MonitorURL)
+	}
+	if planned[2].ForbiddenKeyword == nil || *planned[2].ForbiddenKeyword != "database error" {
+		t.Fatalf("cycled forbidden_keyword = %#v, want database error", planned[2].ForbiddenKeyword)
+	}
+	if planned[2].ForbiddenKeywords == nil || len(*planned[2].ForbiddenKeywords) != 2 {
+		t.Fatalf("cycled forbidden_keywords = %#v, want two values", planned[2].ForbiddenKeywords)
+	}
+	if planned[0].MonitorActive == nil || *planned[0].MonitorActive {
+		t.Fatalf("monitor_active = %#v, want false", planned[0].MonitorActive)
+	}
+}
+
+func TestPlanAPIBulkSiteCreatesUsesBatchMarker(t *testing.T) {
+	entries := []apiBulkSiteEntry{{MonitorURL: "https://example.com/"}}
+	planned, err := planAPIBulkSiteCreates(entries, apiSitesBulkAddOptions{
+		count:       1,
+		batch:       "batch-a",
+		blogIDStart: defaultAPIBulkAddBlogIDStart,
+	})
+	if err != nil {
+		t.Fatalf("planAPIBulkSiteCreates() error = %v", err)
+	}
+	if planned[0].BlogID != apiCLIBatchBlogIDStart("batch-a") {
+		t.Fatalf("blog_id = %d, want batch-derived id", planned[0].BlogID)
+	}
+	if planned[0].CustomHeaders == nil || (*planned[0].CustomHeaders)[apiCLIBatchHeader] != "batch-a" {
+		t.Fatalf("custom headers = %#v, want batch marker", planned[0].CustomHeaders)
+	}
+}
+
+func TestPlanAPIBulkSiteCreatesRejectsUnboundedCount(t *testing.T) {
+	_, err := planAPIBulkSiteCreates([]apiBulkSiteEntry{{MonitorURL: "https://example.com/"}}, apiSitesBulkAddOptions{
+		count:       apiSitesBulkAddMaxCount + 1,
+		blogIDStart: 900,
+	})
+	if err == nil {
+		t.Fatal("planAPIBulkSiteCreates() error = nil, want max count error")
+	}
+}
+
+func TestLoadAPIBulkFixture(t *testing.T) {
+	entries, err := loadAPIBulkSiteEntries(apiSitesBulkAddOptions{source: "fixture"}, nil)
+	if err != nil {
+		t.Fatalf("load fixture error = %v", err)
+	}
+	if len(entries) < 8 {
+		t.Fatalf("fixture entries = %d, want at least 8", len(entries))
+	}
+}
+
+func TestMarshalAPIBulkSiteRequests(t *testing.T) {
+	keyword := "Example Domain"
+	forbidden := "database error"
+	requests := []apiSiteCreateRequest{{
+		BlogID:            900,
+		MonitorURL:        "https://example.com/",
+		CheckKeyword:      &keyword,
+		ForbiddenKeyword:  &forbidden,
+		ForbiddenKeywords: &[]string{"metrics.evil-cdn.example/collect.js", "buy cheap viagra"},
+	}}
+	raw, err := marshalAPIBulkSiteRequests(requests)
+	if err != nil {
+		t.Fatalf("marshalAPIBulkSiteRequests() error = %v", err)
+	}
+	var got map[string]any
+	if err := json.Unmarshal(raw[0], &got); err != nil {
+		t.Fatalf("unmarshal request: %v", err)
+	}
+	if got["blog_id"] != float64(900) {
+		t.Fatalf("blog_id = %#v, want 900", got["blog_id"])
+	}
+	if got["check_keyword"] != "Example Domain" {
+		t.Fatalf("check_keyword = %#v, want Example Domain", got["check_keyword"])
+	}
+	if got["forbidden_keyword"] != "database error" {
+		t.Fatalf("forbidden_keyword = %#v, want database error", got["forbidden_keyword"])
+	}
+	assertStringArray(t, got["forbidden_keywords"], []string{"metrics.evil-cdn.example/collect.js", "buy cheap viagra"})
+}
diff --git a/cmd/jetmon2/api_cli_sites_cleanup.go b/cmd/jetmon2/api_cli_sites_cleanup.go
new file mode 100644
index 00000000..e371efd0
--- /dev/null
+++ b/cmd/jetmon2/api_cli_sites_cleanup.go
@@ -0,0 +1,214 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io"
+	"net/http"
+	"strconv"
+	"strings"
+)
+
+type apiSitesCleanupOptions struct {
+	batch          string
+	siteIDs        apiInt64SliceFlags
+	count          int
+	blogIDStart    int64
+	dryRun         bool
+	ignoreNotFound bool
+	allowUnmarked  bool
+}
+
+type apiSitesCleanupSummary struct {
+	DryRun bool                    `json:"dry_run,omitempty"`
+	Batch  string                  `json:"batch,omitempty"`
+	Count  int                     `json:"count"`
+	Sites  []apiSitesCleanupResult `json:"sites"`
+}
+
+type apiSitesCleanupResult struct {
+	SiteID int64  `json:"site_id"`
+	Status string `json:"status"`
+	Error  string `json:"error,omitempty"`
+}
+
+func cmdAPISitesCleanup(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api sites cleanup", &opts)
+	cleanup := apiSitesCleanupOptions{
+		count:          apiSitesBulkAddMaxCount,
+		ignoreNotFound: true,
+	}
+	fs.StringVar(&cleanup.batch, "batch", "", "batch label whose deterministic site ids should be deleted")
+	fs.Var(&cleanup.siteIDs, "site-id", "explicit site id to delete (repeatable or comma-separated)")
+	fs.IntVar(&cleanup.count, "count", cleanup.count, "number of batch-derived site ids to delete, max 200")
+	fs.Int64Var(&cleanup.blogIDStart, "blog-id-start", 0, "first batch blog_id; default derives from --batch")
+	fs.BoolVar(&cleanup.dryRun, "dry-run", false, "print the planned deletes without sending requests")
+	fs.BoolVar(&cleanup.ignoreNotFound, "ignore-not-found", cleanup.ignoreNotFound, "treat 404 responses as already cleaned")
+	fs.BoolVar(&cleanup.allowUnmarked, "allow-unmarked", false, "allow cleanup of --batch targets that do not expose the matching CLI batch marker")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return errors.New("usage: jetmon2 api sites cleanup [flags]")
+	}
+	return runAPISitesCleanup(context.Background(), nil, opts, cleanup)
+}
+
+func runAPISitesCleanup(ctx context.Context, client *http.Client, opts apiCLIOptions, cleanup apiSitesCleanupOptions) error {
+	if opts.out == nil {
+		opts.out = io.Discard
+	}
+	siteIDs, err := apiCleanupSiteIDs(cleanup)
+	if err != nil {
+		return err
+	}
+	if !cleanup.dryRun {
+		remote, err := requireAPILocalOrAllowRemote(opts, opts.allowRemote, "api sites cleanup")
+		if err != nil {
+			return err
+		}
+		if remote {
+			if strings.TrimSpace(cleanup.batch) == "" {
+				return errors.New("api sites cleanup requires --batch when --allow-remote targets a non-local API")
+			}
+			if cleanup.allowUnmarked {
+				return errors.New("api sites cleanup cannot use --allow-unmarked with --allow-remote")
+			}
+		}
+	}
+
+	summary := apiSitesCleanupSummary{
+		DryRun: cleanup.dryRun,
+		Batch:  cleanup.batch,
+		Count:  len(siteIDs),
+		Sites:  make([]apiSitesCleanupResult, 0, len(siteIDs)),
+	}
+	for _, siteID := range siteIDs {
+		result := apiSitesCleanupResult{SiteID: siteID}
+		if cleanup.dryRun {
+			result.Status = "would_delete"
+			summary.Sites = append(summary.Sites, result)
+			continue
+		}
+		if cleanup.batch != "" && !cleanup.allowUnmarked {
+			ok, exists, err := apiSiteBelongsToBatch(ctx, client, opts, siteID, cleanup.batch)
+			if err != nil {
+				result.Status = "failed"
+				result.Error = err.Error()
+				summary.Sites = append(summary.Sites, result)
+				_ = writeAPIValueOutput(opts.out, summary, opts)
+				return fmt.Errorf("verify site %d batch marker: %w", siteID, err)
+			}
+			if !exists && cleanup.ignoreNotFound {
+				result.Status = "not_found"
+				summary.Sites = append(summary.Sites, result)
+				continue
+			}
+			if !exists {
+				result.Status = "failed"
+				result.Error = "site not found"
+				summary.Sites = append(summary.Sites, result)
+				_ = writeAPIValueOutput(opts.out, summary, opts)
+				return fmt.Errorf("site %d not found", siteID)
+			}
+			if !ok {
+				result.Status = "skipped_unmatched_batch"
+				result.Error = fmt.Sprintf("site does not expose cli_batch %q", cleanup.batch)
+				summary.Sites = append(summary.Sites, result)
+				_ = writeAPIValueOutput(opts.out, summary, opts)
+				return fmt.Errorf("site %d does not belong to CLI batch %q", siteID, cleanup.batch)
+			}
+		}
+		resp, err := doAPIRequest(ctx, client, opts, http.MethodDelete, "/api/v1/sites/"+strconv.FormatInt(siteID, 10), nil)
+		if err != nil {
+			result.Status = "failed"
+			result.Error = err.Error()
+			summary.Sites = append(summary.Sites, result)
+			_ = writeAPIValueOutput(opts.out, summary, opts)
+			return fmt.Errorf("delete site %d: %w", siteID, err)
+		}
+		switch {
+		case resp.StatusCode == http.StatusNotFound && cleanup.ignoreNotFound:
+			result.Status = "not_found"
+		case resp.StatusCode >= 400:
+			result.Status = "failed"
+			result.Error = strings.TrimSpace(string(resp.Body))
+			if result.Error == "" {
+				result.Error = resp.Status
+			}
+			summary.Sites = append(summary.Sites, result)
+			_ = writeAPIValueOutput(opts.out, summary, opts)
+			return fmt.Errorf("delete site %d returned %s", siteID, resp.Status)
+		default:
+			result.Status = "deleted"
+		}
+		summary.Sites = append(summary.Sites, result)
+	}
+	return writeAPIValueOutput(opts.out, summary, opts)
+}
+
+func apiCleanupSiteIDs(cleanup apiSitesCleanupOptions) ([]int64, error) {
+	if cleanup.siteIDs.set {
+		return cleanup.siteIDs.valuesOrEmpty(), nil
+	}
+	if cleanup.batch == "" && cleanup.blogIDStart == 0 {
+		return nil, errors.New("use --batch, --blog-id-start, or --site-id")
+	}
+	if cleanup.count <= 0 {
+		return nil, errors.New("count must be positive")
+	}
+	if cleanup.count > apiSitesBulkAddMaxCount {
+		return nil, fmt.Errorf("count must be <= %d", apiSitesBulkAddMaxCount)
+	}
+	start := cleanup.blogIDStart
+	if start == 0 {
+		start = apiCLIBatchBlogIDStart(cleanup.batch)
+	}
+	if start <= 0 {
+		return nil, errors.New("blog-id-start must be positive")
+	}
+	ids := make([]int64, 0, cleanup.count)
+	for i := 0; i < cleanup.count; i++ {
+		ids = append(ids, start+int64(i))
+	}
+	return ids, nil
+}
+
+func apiSiteBelongsToBatch(ctx context.Context, client *http.Client, opts apiCLIOptions, siteID int64, batch string) (bool, bool, error) {
+	resp, err := doAPIRequest(ctx, client, opts, http.MethodGet, apiSitePathWithCLIMetadata(siteID), nil)
+	if err != nil {
+		return false, false, err
+	}
+	if resp.StatusCode == http.StatusNotFound {
+		return false, false, nil
+	}
+	if resp.StatusCode >= 400 {
+		body := strings.TrimSpace(string(resp.Body))
+		if body == "" {
+			body = resp.Status
+		}
+		return false, true, fmt.Errorf("site lookup returned %s: %s", resp.Status, body)
+	}
+	siteBatch, err := apiSiteCLIBatch(resp.Body)
+	if err != nil {
+		return false, true, err
+	}
+	return siteBatch == batch, true, nil
+}
+
+func apiSitePathWithCLIMetadata(siteID int64) string {
+	return "/api/v1/sites/" + strconv.FormatInt(siteID, 10) + "?include_cli_metadata=true"
+}
+
+func apiSiteCLIBatch(body []byte) (string, error) {
+	var site struct {
+		CLIBatch string `json:"cli_batch"`
+	}
+	if err := json.Unmarshal(body, &site); err != nil {
+		return "", err
+	}
+	return site.CLIBatch, nil
+}
diff --git a/cmd/jetmon2/api_cli_sites_cleanup_test.go b/cmd/jetmon2/api_cli_sites_cleanup_test.go
new file mode 100644
index 00000000..04ac57fc
--- /dev/null
+++ b/cmd/jetmon2/api_cli_sites_cleanup_test.go
@@ -0,0 +1,150 @@
+package main
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+	"strconv"
+	"strings"
+	"testing"
+	"time"
+)
+
+func TestRunAPISitesCleanupDeletesBatchAndIgnoresMissing(t *testing.T) {
+	var calls []string
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		calls = append(calls, r.Method+" "+r.URL.RequestURI())
+		switch {
+		case r.Method == http.MethodGet && strings.HasSuffix(r.URL.Path, "000"):
+			writeTestJSON(t, w, map[string]any{"id": apiCLIBatchBlogIDStart("cleanup-batch"), "cli_batch": "cleanup-batch"})
+		case r.Method == http.MethodDelete && strings.HasSuffix(r.URL.Path, "000"):
+			w.WriteHeader(http.StatusNoContent)
+		case r.Method == http.MethodGet && strings.HasSuffix(r.URL.Path, "001"):
+			writeTestStatusJSON(t, w, http.StatusNotFound, map[string]string{"code": "site_not_found"})
+		default:
+			t.Fatalf("unexpected request: %s %s", r.Method, r.URL.String())
+		}
+	}))
+	defer srv.Close()
+
+	var stdout bytes.Buffer
+	start := apiCLIBatchBlogIDStart("cleanup-batch")
+	err := runAPISitesCleanup(context.Background(), srv.Client(), apiCLIOptions{
+		baseURL: srv.URL,
+		timeout: time.Second,
+		out:     &stdout,
+		errOut:  ioDiscard{},
+	}, apiSitesCleanupOptions{
+		batch:          "cleanup-batch",
+		count:          2,
+		ignoreNotFound: true,
+	})
+	if err != nil {
+		t.Fatalf("runAPISitesCleanup() error = %v\nstdout=%s", err, stdout.String())
+	}
+	var summary apiSitesCleanupSummary
+	if err := json.Unmarshal(stdout.Bytes(), &summary); err != nil {
+		t.Fatalf("unmarshal summary: %v\n%s", err, stdout.String())
+	}
+	if summary.Batch != "cleanup-batch" || summary.Count != 2 {
+		t.Fatalf("summary = %#v", summary)
+	}
+	if summary.Sites[0].SiteID != start || summary.Sites[0].Status != "deleted" {
+		t.Fatalf("first cleanup result = %#v", summary.Sites[0])
+	}
+	if summary.Sites[1].SiteID != start+1 || summary.Sites[1].Status != "not_found" {
+		t.Fatalf("second cleanup result = %#v", summary.Sites[1])
+	}
+	wantCalls := []string{
+		"GET /api/v1/sites/" + strconvInt64(start) + "?include_cli_metadata=true",
+		"DELETE /api/v1/sites/" + strconvInt64(start),
+		"GET /api/v1/sites/" + strconvInt64(start+1) + "?include_cli_metadata=true",
+	}
+	if strings.Join(calls, "\n") != strings.Join(wantCalls, "\n") {
+		t.Fatalf("calls:\n%s\nwant:\n%s", strings.Join(calls, "\n"), strings.Join(wantCalls, "\n"))
+	}
+}
+
+func TestRunAPISitesCleanupRejectsUnmatchedBatchMarker(t *testing.T) {
+	var calls []string
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		calls = append(calls, r.Method+" "+r.URL.RequestURI())
+		switch {
+		case r.Method == http.MethodGet:
+			writeTestJSON(t, w, map[string]any{"id": 42, "cli_batch": "other-batch"})
+		default:
+			t.Fatalf("unexpected request: %s %s", r.Method, r.URL.String())
+		}
+	}))
+	defer srv.Close()
+
+	var stdout bytes.Buffer
+	err := runAPISitesCleanup(context.Background(), srv.Client(), apiCLIOptions{
+		baseURL: srv.URL,
+		timeout: time.Second,
+		out:     &stdout,
+		errOut:  ioDiscard{},
+	}, apiSitesCleanupOptions{
+		batch:          "cleanup-batch",
+		siteIDs:        mustSiteIDs(t, "42"),
+		ignoreNotFound: true,
+	})
+	if err == nil {
+		t.Fatal("runAPISitesCleanup() error = nil, want batch mismatch")
+	}
+	var summary apiSitesCleanupSummary
+	if err := json.Unmarshal(stdout.Bytes(), &summary); err != nil {
+		t.Fatalf("unmarshal summary: %v\n%s", err, stdout.String())
+	}
+	if got := summary.Sites[0].Status; got != "skipped_unmatched_batch" {
+		t.Fatalf("status = %q, want skipped_unmatched_batch", got)
+	}
+	if strings.Join(calls, "\n") != "GET /api/v1/sites/42?include_cli_metadata=true" {
+		t.Fatalf("calls:\n%s\nwant only GET", strings.Join(calls, "\n"))
+	}
+}
+
+func TestRunAPISitesCleanupDryRunTable(t *testing.T) {
+	var stdout bytes.Buffer
+	err := runAPISitesCleanup(context.Background(), nil, apiCLIOptions{
+		output: "table",
+		out:    &stdout,
+		errOut: ioDiscard{},
+	}, apiSitesCleanupOptions{
+		siteIDs: mustSiteIDs(t, "42,43"),
+		dryRun:  true,
+	})
+	if err != nil {
+		t.Fatalf("runAPISitesCleanup() error = %v", err)
+	}
+	got := stdout.String()
+	for _, want := range []string{
+		"site_id  status",
+		"42       would_delete",
+		"43       would_delete",
+	} {
+		if !strings.Contains(got, want) {
+			t.Fatalf("table missing %q:\n%s", want, got)
+		}
+	}
+}
+
+func TestAPICleanupSiteIDsFromBatch(t *testing.T) {
+	ids, err := apiCleanupSiteIDs(apiSitesCleanupOptions{batch: "batch-a", count: 3})
+	if err != nil {
+		t.Fatalf("apiCleanupSiteIDs() error = %v", err)
+	}
+	start := apiCLIBatchBlogIDStart("batch-a")
+	want := []int64{start, start + 1, start + 2}
+	for i := range want {
+		if ids[i] != want[i] {
+			t.Fatalf("ids[%d] = %d, want %d", i, ids[i], want[i])
+		}
+	}
+}
+
+func strconvInt64(v int64) string {
+	return strconv.FormatInt(v, 10)
+}
diff --git a/cmd/jetmon2/api_cli_sites_fixture.go b/cmd/jetmon2/api_cli_sites_fixture.go
new file mode 100644
index 00000000..29e3829e
--- /dev/null
+++ b/cmd/jetmon2/api_cli_sites_fixture.go
@@ -0,0 +1,6 @@
+package main
+
+import _ "embed"
+
+//go:embed testdata/api-cli-sites.json
+var apiCLISiteFixture []byte
diff --git a/cmd/jetmon2/api_cli_sites_simulate.go b/cmd/jetmon2/api_cli_sites_simulate.go
new file mode 100644
index 00000000..d2e36ec1
--- /dev/null
+++ b/cmd/jetmon2/api_cli_sites_simulate.go
@@ -0,0 +1,697 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io"
+	"net"
+	"net/http"
+	"net/url"
+	"strings"
+	"time"
+)
+
+const (
+	apiFixtureAuto              = "auto"
+	apiFixtureOff               = "off"
+	defaultAPIFixtureMonitorURL = "http://api-fixture:8091"
+	defaultAPIFixtureProbeURL   = "http://localhost:18091/health"
+)
+
+type apiSitesSimulateFailureOptions struct {
+	mode                   string
+	batch                  string
+	siteIDs                apiInt64SliceFlags
+	count                  int
+	blogIDStart            int64
+	createMissing          bool
+	trigger                bool
+	wait                   time.Duration
+	pollInterval           time.Duration
+	idempotencyKeyPrefix   string
+	fixtureURL             string
+	fixtureProbeURL        string
+	allowUnmarkedBatch     bool
+	expectEventState       string
+	expectEventSeverity    apiOptionalIntFlag
+	requireTransition      bool
+	expectTransitionReason string
+}
+
+type apiFailureModeDefinition struct {
+	Mode           string
+	Description    string
+	MonitorURL     string
+	CheckKeyword   *string
+	RedirectPolicy string
+	TimeoutSeconds *int
+	CustomHeaders  map[string]string
+}
+
+type apiSimulateFailureSummary struct {
+	Mode          string                   `json:"mode"`
+	Batch         string                   `json:"batch,omitempty"`
+	Wait          string                   `json:"wait"`
+	Trigger       bool                     `json:"trigger"`
+	CreateMissing bool                     `json:"create_missing"`
+	FixtureURL    string                   `json:"fixture_url,omitempty"`
+	Sites         []apiSimulatedSiteResult `json:"sites"`
+}
+
+type apiSimulatedSiteResult struct {
+	SiteID          int64                    `json:"site_id"`
+	Action          string                   `json:"action"`
+	TriggerStatus   string                   `json:"trigger_status,omitempty"`
+	EventIDs        []int64                  `json:"event_ids,omitempty"`
+	EventStates     []string                 `json:"event_states,omitempty"`
+	EventSeverities []int                    `json:"event_severities,omitempty"`
+	TransitionCount int                      `json:"transition_count"`
+	Site            json.RawMessage          `json:"site,omitempty"`
+	TriggerNow      json.RawMessage          `json:"trigger_now,omitempty"`
+	Events          json.RawMessage          `json:"events,omitempty"`
+	Transitions     []apiSimulatedTransition `json:"transitions,omitempty"`
+	Note            string                   `json:"note,omitempty"`
+	Error           string                   `json:"error,omitempty"`
+}
+
+type apiSimulatedTransition struct {
+	EventID     int64           `json:"event_id"`
+	Transitions json.RawMessage `json:"transitions"`
+}
+
+func cmdAPISitesSimulateFailure(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api sites simulate-failure", &opts)
+	sim := apiSitesSimulateFailureOptions{
+		mode:         "http-500",
+		count:        1,
+		trigger:      true,
+		pollInterval: 2 * time.Second,
+		fixtureURL:   envOrDefault("JETMON_API_FIXTURE_URL", apiFixtureAuto),
+		fixtureProbeURL: envOrDefault(
+			"JETMON_API_FIXTURE_PROBE_URL",
+			defaultAPIFixtureProbeURL,
+		),
+	}
+	fs.StringVar(&sim.mode, "mode", sim.mode, "failure mode: unreachable, http-500, http-403, redirect, keyword, timeout, or tls")
+	fs.StringVar(&sim.batch, "batch", "", "batch label whose deterministic site ids should be mutated")
+	fs.Var(&sim.siteIDs, "site-id", "explicit site id to mutate (repeatable or comma-separated)")
+	fs.IntVar(&sim.count, "count", sim.count, "number of batch-derived site ids to mutate")
+	fs.Int64Var(&sim.blogIDStart, "blog-id-start", 0, "first batch blog_id; default derives from --batch")
+	fs.BoolVar(&sim.createMissing, "create-missing", false, "create a site if the target id does not exist")
+	fs.BoolVar(&sim.trigger, "trigger", sim.trigger, "call trigger-now after mutation")
+	fs.DurationVar(&sim.wait, "wait", 0, "poll duration for active events after mutation")
+	fs.DurationVar(&sim.pollInterval, "poll-interval", sim.pollInterval, "active-event poll interval when --wait is set")
+	fs.StringVar(&sim.idempotencyKeyPrefix, "idempotency-key-prefix", "", "prefix for per-site POST Idempotency-Key headers")
+	fs.StringVar(&sim.fixtureURL, "fixture-url", sim.fixtureURL, "Docker fixture monitor URL, auto, or off")
+	fs.StringVar(&sim.fixtureProbeURL, "fixture-probe-url", sim.fixtureProbeURL, "URL used when --fixture-url=auto")
+	fs.BoolVar(&sim.allowUnmarkedBatch, "allow-unmarked", false, "allow mutation of --batch targets that do not expose the matching CLI batch marker")
+	fs.StringVar(&sim.expectEventState, "expect-event-state", "", "require at least one active event with this state after polling")
+	fs.Var(&sim.expectEventSeverity, "expect-event-severity", "require at least one active event with this severity after polling")
+	fs.BoolVar(&sim.requireTransition, "require-transition", false, "require at least one event transition after polling")
+	fs.StringVar(&sim.expectTransitionReason, "expect-transition-reason", "", "require at least one transition with this reason after polling")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return errors.New("usage: jetmon2 api sites simulate-failure [flags]")
+	}
+	return runAPISitesSimulateFailure(context.Background(), nil, opts, sim)
+}
+
+func runAPISitesSimulateFailure(ctx context.Context, client *http.Client, opts apiCLIOptions, sim apiSitesSimulateFailureOptions) error {
+	if opts.out == nil {
+		opts.out = io.Discard
+	}
+	remote, err := requireAPILocalOrAllowRemote(opts, opts.allowRemote, "api sites simulate-failure")
+	if err != nil {
+		return err
+	}
+	if remote {
+		if strings.TrimSpace(sim.batch) == "" {
+			return errors.New("api sites simulate-failure requires --batch when --allow-remote targets a non-local API")
+		}
+		if sim.allowUnmarkedBatch {
+			return errors.New("api sites simulate-failure cannot use --allow-unmarked with --allow-remote")
+		}
+	}
+	fixtureURL := apiSimulationFixtureURL(ctx, sim)
+	def, err := apiFailureMode(sim.mode, fixtureURL)
+	if err != nil {
+		return err
+	}
+	siteIDs, err := apiSimulationSiteIDs(sim)
+	if err != nil {
+		return err
+	}
+	if sim.pollInterval <= 0 {
+		return errors.New("poll-interval must be positive")
+	}
+
+	summary := apiSimulateFailureSummary{
+		Mode:          def.Mode,
+		Batch:         sim.batch,
+		Wait:          sim.wait.String(),
+		Trigger:       sim.trigger,
+		CreateMissing: sim.createMissing,
+		FixtureURL:    fixtureURL,
+		Sites:         make([]apiSimulatedSiteResult, 0, len(siteIDs)),
+	}
+	for i, siteID := range siteIDs {
+		result, err := runAPISiteSimulation(ctx, client, opts, sim, def, siteID, i)
+		summary.Sites = append(summary.Sites, result)
+		if err != nil {
+			summary.Sites[len(summary.Sites)-1].Error = err.Error()
+			_ = writeAPIValueOutput(opts.out, summary, opts)
+			return fmt.Errorf("simulate failure for site %d: %w", siteID, err)
+		}
+	}
+	return writeAPIValueOutput(opts.out, summary, opts)
+}
+
+func runAPISiteSimulation(ctx context.Context, client *http.Client, opts apiCLIOptions, sim apiSitesSimulateFailureOptions, def apiFailureModeDefinition, siteID int64, index int) (apiSimulatedSiteResult, error) {
+	result := apiSimulatedSiteResult{SiteID: siteID}
+	if sim.batch != "" && !sim.allowUnmarkedBatch {
+		ok, exists, err := apiSiteBelongsToBatch(ctx, client, opts, siteID, sim.batch)
+		if err != nil {
+			return result, err
+		}
+		if exists && !ok {
+			return result, fmt.Errorf("site %d does not belong to CLI batch %q", siteID, sim.batch)
+		}
+	}
+	update := apiSiteUpdateRequest{
+		MonitorURL:     &def.MonitorURL,
+		CheckKeyword:   def.CheckKeyword,
+		RedirectPolicy: &def.RedirectPolicy,
+		TimeoutSeconds: def.TimeoutSeconds,
+	}
+	if len(def.CustomHeaders) > 0 || sim.batch != "" {
+		headers := make(map[string]string, len(def.CustomHeaders)+1)
+		for k, v := range def.CustomHeaders {
+			headers[k] = v
+		}
+		if sim.batch != "" {
+			headers[apiCLIBatchHeader] = sim.batch
+		}
+		update.CustomHeaders = &headers
+	}
+
+	site, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodPatch, fmt.Sprintf("/api/v1/sites/%d", siteID), update, "")
+	if err != nil {
+		var httpErr apiWorkflowHTTPError
+		if errors.As(err, &httpErr) && strings.Contains(httpErr.Status, "404") && sim.createMissing {
+			site, err = createMissingSimulationSite(ctx, client, opts, sim, def, siteID, index)
+			if err != nil {
+				return result, err
+			}
+			result.Action = "created"
+		} else {
+			return result, err
+		}
+	} else {
+		result.Action = "updated"
+	}
+	result.Site = site
+
+	if sim.trigger {
+		body, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodPost, fmt.Sprintf("/api/v1/sites/%d/trigger-now", siteID), nil, apiSimulationIDKey(sim, index, "trigger-now"))
+		if err != nil {
+			return result, err
+		}
+		result.TriggerNow = body
+		result.TriggerStatus = apiTriggerNowStatus(body)
+	} else {
+		result.TriggerStatus = "skipped"
+	}
+
+	events, transitions, err := waitForSimulationEvents(ctx, client, opts, siteID, sim)
+	if err != nil {
+		return result, err
+	}
+	result.Events = events
+	result.Transitions = transitions
+	result.EventIDs, result.EventStates, result.EventSeverities = summarizeSimulationEvents(events)
+	result.TransitionCount = simulationTransitionCount(transitions)
+	if len(transitions) == 0 {
+		result.Note = "no active events returned; trigger-now reports check results but regular orchestrator rounds create failure events"
+	}
+	if err := validateSimulationExpectations(result, sim); err != nil {
+		return result, err
+	}
+	return result, nil
+}
+
+func createMissingSimulationSite(ctx context.Context, client *http.Client, opts apiCLIOptions, sim apiSitesSimulateFailureOptions, def apiFailureModeDefinition, siteID int64, index int) (json.RawMessage, error) {
+	headers := map[string]string{}
+	for k, v := range def.CustomHeaders {
+		headers[k] = v
+	}
+	if sim.batch != "" {
+		headers[apiCLIBatchHeader] = sim.batch
+	}
+	req := apiSiteCreateRequest{
+		BlogID:         siteID,
+		MonitorURL:     def.MonitorURL,
+		CheckKeyword:   def.CheckKeyword,
+		RedirectPolicy: &def.RedirectPolicy,
+		TimeoutSeconds: def.TimeoutSeconds,
+	}
+	if len(headers) > 0 {
+		req.CustomHeaders = &headers
+	}
+	return apiWorkflowRequestJSON(ctx, client, opts, http.MethodPost, "/api/v1/sites", req, apiSimulationIDKey(sim, index, "create-site"))
+}
+
+func waitForSimulationEvents(ctx context.Context, client *http.Client, opts apiCLIOptions, siteID int64, sim apiSitesSimulateFailureOptions) (json.RawMessage, []apiSimulatedTransition, error) {
+	deadline := time.Now().Add(sim.wait)
+	for {
+		body, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodGet, fmt.Sprintf("/api/v1/sites/%d/events?active=true&limit=10", siteID), nil, "")
+		if err != nil {
+			return nil, nil, err
+		}
+		ids := eventIDsFromList(body)
+		transitions, err := querySimulationTransitions(ctx, client, opts, siteID, ids)
+		if err != nil {
+			return nil, nil, err
+		}
+		if simulationHasExpectations(sim) && sim.wait > 0 {
+			result := apiSimulatedSiteResult{SiteID: siteID, Events: body, Transitions: transitions}
+			if validateSimulationExpectations(result, sim) == nil {
+				return body, transitions, nil
+			}
+		} else if len(ids) > 0 || sim.wait <= 0 || time.Now().After(deadline) {
+			return body, transitions, nil
+		}
+		if sim.wait <= 0 || time.Now().After(deadline) {
+			return body, transitions, nil
+		}
+		select {
+		case <-ctx.Done():
+			return nil, nil, ctx.Err()
+		case <-time.After(sim.pollInterval):
+		}
+	}
+}
+
+func querySimulationTransitions(ctx context.Context, client *http.Client, opts apiCLIOptions, siteID int64, eventIDs []int64) ([]apiSimulatedTransition, error) {
+	out := make([]apiSimulatedTransition, 0, len(eventIDs))
+	for _, eventID := range eventIDs {
+		body, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodGet, fmt.Sprintf("/api/v1/sites/%d/events/%d/transitions", siteID, eventID), nil, "")
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, apiSimulatedTransition{EventID: eventID, Transitions: body})
+	}
+	return out, nil
+}
+
+func eventIDsFromList(body json.RawMessage) []int64 {
+	events, err := simulationEventsFromList(body)
+	if err != nil {
+		return nil
+	}
+	ids := make([]int64, 0, len(events))
+	for _, event := range events {
+		if event.ID > 0 {
+			ids = append(ids, event.ID)
+		}
+	}
+	return ids
+}
+
+type apiSimulationListedEvent struct {
+	ID       int64  `json:"id"`
+	State    string `json:"state"`
+	Severity int    `json:"severity"`
+}
+
+type apiSimulationListedTransition struct {
+	ID            int64   `json:"id"`
+	EventID       int64   `json:"event_id"`
+	Reason        string  `json:"reason"`
+	StateAfter    *string `json:"state_after"`
+	SeverityAfter *int    `json:"severity_after"`
+}
+
+func simulationEventsFromList(body json.RawMessage) ([]apiSimulationListedEvent, error) {
+	var envelope struct {
+		Data []apiSimulationListedEvent `json:"data"`
+	}
+	if err := json.Unmarshal(body, &envelope); err != nil {
+		return nil, err
+	}
+	return envelope.Data, nil
+}
+
+func simulationTransitionsFromResults(results []apiSimulatedTransition) ([]apiSimulationListedTransition, error) {
+	rows := []apiSimulationListedTransition{}
+	for _, result := range results {
+		var envelope struct {
+			Data []apiSimulationListedTransition `json:"data"`
+		}
+		if err := json.Unmarshal(result.Transitions, &envelope); err != nil {
+			return nil, err
+		}
+		rows = append(rows, envelope.Data...)
+	}
+	return rows, nil
+}
+
+func apiTriggerNowStatus(body json.RawMessage) string {
+	var envelope struct {
+		Result struct {
+			HTTPCode  int  `json:"http_code"`
+			ErrorCode int  `json:"error_code"`
+			Success   bool `json:"success"`
+		} `json:"result"`
+	}
+	if err := json.Unmarshal(body, &envelope); err != nil {
+		return "unknown"
+	}
+	if envelope.Result.Success {
+		return "success"
+	}
+	if envelope.Result.HTTPCode > 0 {
+		return fmt.Sprintf("failed_http_%d", envelope.Result.HTTPCode)
+	}
+	if envelope.Result.ErrorCode > 0 {
+		return fmt.Sprintf("failed_error_%d", envelope.Result.ErrorCode)
+	}
+	return "failed"
+}
+
+func summarizeSimulationEvents(body json.RawMessage) ([]int64, []string, []int) {
+	events, err := simulationEventsFromList(body)
+	if err != nil {
+		return nil, nil, nil
+	}
+	ids := make([]int64, 0, len(events))
+	states := make([]string, 0, len(events))
+	severities := make([]int, 0, len(events))
+	for _, event := range events {
+		ids = append(ids, event.ID)
+		states = append(states, event.State)
+		severities = append(severities, event.Severity)
+	}
+	return ids, states, severities
+}
+
+func simulationTransitionCount(results []apiSimulatedTransition) int {
+	transitions, err := simulationTransitionsFromResults(results)
+	if err != nil {
+		return 0
+	}
+	return len(transitions)
+}
+
+func validateSimulationExpectations(result apiSimulatedSiteResult, sim apiSitesSimulateFailureOptions) error {
+	if !simulationHasExpectations(sim) {
+		return nil
+	}
+	events, err := simulationEventsFromList(result.Events)
+	if err != nil {
+		return fmt.Errorf("decode active events response: %w", err)
+	}
+	var failures []string
+	if sim.expectEventState != "" && !simulationHasEventState(events, sim.expectEventState) {
+		failures = append(failures, fmt.Sprintf("expected active event state %q, got %s", sim.expectEventState, formatSimulationEvents(events)))
+	}
+	if sim.expectEventSeverity.set && !simulationHasEventSeverity(events, sim.expectEventSeverity.value) {
+		failures = append(failures, fmt.Sprintf("expected active event severity %d, got %s", sim.expectEventSeverity.value, formatSimulationEvents(events)))
+	}
+	transitions, err := simulationTransitionsFromResults(result.Transitions)
+	if err != nil {
+		return fmt.Errorf("decode transition response: %w", err)
+	}
+	if sim.requireTransition && len(transitions) == 0 {
+		failures = append(failures, "expected at least one transition, got none")
+	}
+	if sim.expectTransitionReason != "" && !simulationHasTransitionReason(transitions, sim.expectTransitionReason) {
+		failures = append(failures, fmt.Sprintf("expected transition reason %q, got %s", sim.expectTransitionReason, formatSimulationTransitions(transitions)))
+	}
+	if len(failures) > 0 {
+		return errors.New(strings.Join(failures, "; "))
+	}
+	return nil
+}
+
+func simulationHasExpectations(sim apiSitesSimulateFailureOptions) bool {
+	return sim.expectEventState != "" ||
+		sim.expectEventSeverity.set ||
+		sim.requireTransition ||
+		sim.expectTransitionReason != ""
+}
+
+func simulationHasEventState(events []apiSimulationListedEvent, state string) bool {
+	for _, event := range events {
+		if event.State == state {
+			return true
+		}
+	}
+	return false
+}
+
+func simulationHasEventSeverity(events []apiSimulationListedEvent, severity int) bool {
+	for _, event := range events {
+		if event.Severity == severity {
+			return true
+		}
+	}
+	return false
+}
+
+func simulationHasTransitionReason(transitions []apiSimulationListedTransition, reason string) bool {
+	for _, transition := range transitions {
+		if transition.Reason == reason {
+			return true
+		}
+	}
+	return false
+}
+
+func formatSimulationEvents(events []apiSimulationListedEvent) string {
+	if len(events) == 0 {
+		return "none"
+	}
+	parts := make([]string, 0, len(events))
+	for _, event := range events {
+		parts = append(parts, fmt.Sprintf("#%d state=%q severity=%d", event.ID, event.State, event.Severity))
+	}
+	return strings.Join(parts, ", ")
+}
+
+func formatSimulationTransitions(transitions []apiSimulationListedTransition) string {
+	if len(transitions) == 0 {
+		return "none"
+	}
+	parts := make([]string, 0, len(transitions))
+	for _, transition := range transitions {
+		parts = append(parts, fmt.Sprintf("#%d event=%d reason=%q", transition.ID, transition.EventID, transition.Reason))
+	}
+	return strings.Join(parts, ", ")
+}
+
+func apiSimulationSiteIDs(sim apiSitesSimulateFailureOptions) ([]int64, error) {
+	if sim.siteIDs.set {
+		return sim.siteIDs.valuesOrEmpty(), nil
+	}
+	if sim.batch == "" && sim.blogIDStart == 0 {
+		return nil, errors.New("use --batch, --blog-id-start, or --site-id")
+	}
+	if sim.count <= 0 {
+		return nil, errors.New("count must be positive")
+	}
+	start := sim.blogIDStart
+	if start == 0 {
+		start = apiCLIBatchBlogIDStart(sim.batch)
+	}
+	if start <= 0 {
+		return nil, errors.New("blog-id-start must be positive")
+	}
+	ids := make([]int64, 0, sim.count)
+	for i := 0; i < sim.count; i++ {
+		ids = append(ids, start+int64(i))
+	}
+	return ids, nil
+}
+
+func apiSimulationIDKey(sim apiSitesSimulateFailureOptions, index int, suffix string) string {
+	if sim.idempotencyKeyPrefix == "" {
+		return ""
+	}
+	return fmt.Sprintf("%s-%03d-%s", sim.idempotencyKeyPrefix, index+1, suffix)
+}
+
+func apiSimulationFixtureURL(ctx context.Context, sim apiSitesSimulateFailureOptions) string {
+	fixtureURL := strings.TrimSpace(sim.fixtureURL)
+	switch strings.ToLower(fixtureURL) {
+	case "", apiFixtureOff, "none", "false":
+		return ""
+	case apiFixtureAuto:
+		if apiFixtureAvailable(ctx, sim.fixtureProbeURL) {
+			return defaultAPIFixtureMonitorURL
+		}
+		return ""
+	default:
+		return strings.TrimRight(fixtureURL, "/")
+	}
+}
+
+func apiFixtureAvailable(ctx context.Context, probeURL string) bool {
+	probeURL = strings.TrimSpace(probeURL)
+	if probeURL == "" {
+		return false
+	}
+	probeCtx, cancel := context.WithTimeout(ctx, 750*time.Millisecond)
+	defer cancel()
+	req, err := http.NewRequestWithContext(probeCtx, http.MethodGet, probeURL, nil)
+	if err != nil {
+		return false
+	}
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		return false
+	}
+	defer resp.Body.Close()
+	return resp.StatusCode >= 200 && resp.StatusCode < 300
+}
+
+func apiFailureMode(mode, fixtureBase string) (apiFailureModeDefinition, error) {
+	if strings.TrimSpace(fixtureBase) != "" {
+		return apiFixtureFailureMode(mode, strings.TrimRight(fixtureBase, "/"))
+	}
+
+	policyFollow := "follow"
+	policyFail := "fail"
+	missingKeyword := "jetmon-api-cli-keyword-that-should-not-exist"
+	timeoutShort := 2
+	switch mode {
+	case "unreachable":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "reserved TEST-NET-1 address expected to be unreachable",
+			MonitorURL:     "http://192.0.2.1/",
+			RedirectPolicy: policyFollow,
+			TimeoutSeconds: &timeoutShort,
+		}, nil
+	case "http-500":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "HTTP 500 response",
+			MonitorURL:     "https://httpbin.org/status/500",
+			RedirectPolicy: policyFollow,
+		}, nil
+	case "http-403":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "HTTP 403 response",
+			MonitorURL:     "https://httpbin.org/status/403",
+			RedirectPolicy: policyFollow,
+		}, nil
+	case "redirect":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "redirect response with fail policy",
+			MonitorURL:     "https://httpbin.org/redirect-to?url=https%3A%2F%2Fexample.com%2F",
+			RedirectPolicy: policyFail,
+		}, nil
+	case "keyword":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "keyword mismatch against example.com",
+			MonitorURL:     "https://example.com/",
+			CheckKeyword:   &missingKeyword,
+			RedirectPolicy: policyFollow,
+		}, nil
+	case "timeout":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "slow response with short timeout",
+			MonitorURL:     "https://httpbin.org/delay/10",
+			RedirectPolicy: policyFollow,
+			TimeoutSeconds: &timeoutShort,
+		}, nil
+	case "tls":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "expired TLS certificate",
+			MonitorURL:     "https://expired.badssl.com/",
+			RedirectPolicy: policyFollow,
+		}, nil
+	default:
+		return apiFailureModeDefinition{}, errors.New("mode must be one of: unreachable, http-500, http-403, redirect, keyword, timeout, tls")
+	}
+}
+
+func apiFixtureFailureMode(mode, fixtureBase string) (apiFailureModeDefinition, error) {
+	policyFollow := "follow"
+	policyFail := "fail"
+	missingKeyword := "jetmon-api-cli-keyword-that-should-not-exist"
+	timeoutShort := 1
+	switch mode {
+	case "unreachable":
+		return apiFailureMode(mode, "")
+	case "http-500":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "Docker fixture HTTP 500 response",
+			MonitorURL:     fixtureBase + "/status/500",
+			RedirectPolicy: policyFollow,
+		}, nil
+	case "http-403":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "Docker fixture HTTP 403 response",
+			MonitorURL:     fixtureBase + "/status/403",
+			RedirectPolicy: policyFollow,
+		}, nil
+	case "redirect":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "Docker fixture redirect response with fail policy",
+			MonitorURL:     fixtureBase + "/redirect",
+			RedirectPolicy: policyFail,
+		}, nil
+	case "keyword":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "Docker fixture keyword mismatch",
+			MonitorURL:     fixtureBase + "/keyword",
+			CheckKeyword:   &missingKeyword,
+			RedirectPolicy: policyFollow,
+		}, nil
+	case "timeout":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "Docker fixture slow response with short timeout",
+			MonitorURL:     fixtureBase + "/slow?delay=5s",
+			RedirectPolicy: policyFollow,
+			TimeoutSeconds: &timeoutShort,
+		}, nil
+	case "tls":
+		return apiFailureModeDefinition{
+			Mode:           mode,
+			Description:    "Docker fixture self-signed TLS certificate",
+			MonitorURL:     apiFixtureTLSBase(fixtureBase) + "/tls",
+			RedirectPolicy: policyFollow,
+		}, nil
+	default:
+		return apiFailureModeDefinition{}, errors.New("mode must be one of: unreachable, http-500, http-403, redirect, keyword, timeout, tls")
+	}
+}
+
+func apiFixtureTLSBase(fixtureBase string) string {
+	u, err := url.Parse(fixtureBase)
+	if err != nil || u.Host == "" {
+		return strings.TrimRight(fixtureBase, "/")
+	}
+	u.Scheme = "https"
+	host, port, err := net.SplitHostPort(u.Host)
+	if err == nil && port == "8091" {
+		u.Host = net.JoinHostPort(host, "8443")
+	}
+	return strings.TrimRight(u.String(), "/")
+}
diff --git a/cmd/jetmon2/api_cli_sites_simulate_test.go b/cmd/jetmon2/api_cli_sites_simulate_test.go
new file mode 100644
index 00000000..71c7f833
--- /dev/null
+++ b/cmd/jetmon2/api_cli_sites_simulate_test.go
@@ -0,0 +1,376 @@
+package main
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+	"strings"
+	"testing"
+	"time"
+)
+
+func TestRunAPISitesSimulateFailureUpdatesAndReportsEvents(t *testing.T) {
+	var severity apiOptionalIntFlag
+	setTestFlag(t, &severity, "3")
+	var calls []string
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		calls = append(calls, r.Method+" "+r.URL.String())
+		switch {
+		case r.Method == http.MethodPatch && r.URL.Path == "/api/v1/sites/42":
+			var body map[string]any
+			decodeTestJSON(t, r, &body)
+			if body["monitor_url"] != "https://httpbin.org/status/500" {
+				t.Fatalf("monitor_url = %#v, want http-500 URL", body["monitor_url"])
+			}
+			writeTestJSON(t, w, map[string]any{"id": 42, "monitor_url": body["monitor_url"]})
+		case r.Method == http.MethodPost && r.URL.Path == "/api/v1/sites/42/trigger-now":
+			writeTestJSON(t, w, map[string]any{"result": map[string]any{"success": false, "http_code": 500}})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/sites/42/events":
+			writeTestJSON(t, w, map[string]any{
+				"data": []any{map[string]any{"id": 99, "state": "Seems Down", "severity": 3}},
+				"page": map[string]any{"limit": 10},
+			})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/sites/42/events/99/transitions":
+			writeTestJSON(t, w, map[string]any{
+				"data": []any{map[string]any{
+					"id":             1,
+					"event_id":       99,
+					"severity_after": 3,
+					"state_after":    "Seems Down",
+					"reason":         "opened",
+				}},
+			})
+		default:
+			t.Fatalf("unexpected request: %s %s", r.Method, r.URL.String())
+		}
+	}))
+	defer srv.Close()
+
+	var stdout bytes.Buffer
+	err := runAPISitesSimulateFailure(context.Background(), srv.Client(), apiCLIOptions{
+		baseURL: srv.URL,
+		token:   "token-123",
+		timeout: time.Second,
+		out:     &stdout,
+		errOut:  ioDiscard{},
+	}, apiSitesSimulateFailureOptions{
+		mode:                   "http-500",
+		siteIDs:                mustSiteIDs(t, "42"),
+		trigger:                true,
+		pollInterval:           time.Millisecond,
+		expectEventState:       "Seems Down",
+		expectEventSeverity:    severity,
+		requireTransition:      true,
+		expectTransitionReason: "opened",
+	})
+	if err != nil {
+		t.Fatalf("runAPISitesSimulateFailure() error = %v\nstdout=%s", err, stdout.String())
+	}
+	var summary apiSimulateFailureSummary
+	if err := json.Unmarshal(stdout.Bytes(), &summary); err != nil {
+		t.Fatalf("unmarshal summary: %v\n%s", err, stdout.String())
+	}
+	if summary.Mode != "http-500" || len(summary.Sites) != 1 {
+		t.Fatalf("summary = %#v", summary)
+	}
+	if summary.Sites[0].Action != "updated" {
+		t.Fatalf("action = %q, want updated", summary.Sites[0].Action)
+	}
+	if summary.Sites[0].TriggerStatus != "failed_http_500" {
+		t.Fatalf("trigger status = %q, want failed_http_500", summary.Sites[0].TriggerStatus)
+	}
+	if got := summary.Sites[0].EventIDs; len(got) != 1 || got[0] != 99 {
+		t.Fatalf("event ids = %#v, want [99]", got)
+	}
+	if got := summary.Sites[0].EventStates; len(got) != 1 || got[0] != "Seems Down" {
+		t.Fatalf("event states = %#v, want [Seems Down]", got)
+	}
+	if got := summary.Sites[0].EventSeverities; len(got) != 1 || got[0] != 3 {
+		t.Fatalf("event severities = %#v, want [3]", got)
+	}
+	if summary.Sites[0].TransitionCount != 1 {
+		t.Fatalf("transition count = %d, want 1", summary.Sites[0].TransitionCount)
+	}
+	if len(summary.Sites[0].Transitions) != 1 || summary.Sites[0].Transitions[0].EventID != 99 {
+		t.Fatalf("transitions = %#v, want event 99", summary.Sites[0].Transitions)
+	}
+	wantCalls := []string{
+		"PATCH /api/v1/sites/42",
+		"POST /api/v1/sites/42/trigger-now",
+		"GET /api/v1/sites/42/events?active=true&limit=10",
+		"GET /api/v1/sites/42/events/99/transitions",
+	}
+	if strings.Join(calls, "\n") != strings.Join(wantCalls, "\n") {
+		t.Fatalf("calls:\n%s\nwant:\n%s", strings.Join(calls, "\n"), strings.Join(wantCalls, "\n"))
+	}
+}
+
+func TestRunAPISitesSimulateFailurePollsUntilAssertionsMatch(t *testing.T) {
+	var eventPolls int
+	var transitionPolls int
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		switch {
+		case r.Method == http.MethodPatch && r.URL.Path == "/api/v1/sites/42":
+			writeTestJSON(t, w, map[string]any{"id": 42})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/sites/42/events":
+			eventPolls++
+			state := "Seems Down"
+			severity := 3
+			if eventPolls > 1 {
+				state = "Down"
+				severity = 4
+			}
+			writeTestJSON(t, w, map[string]any{
+				"data": []any{map[string]any{"id": 99, "state": state, "severity": severity}},
+				"page": map[string]any{"limit": 10},
+			})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/sites/42/events/99/transitions":
+			transitionPolls++
+			reason := "opened"
+			if transitionPolls > 1 {
+				reason = "verifier_confirmed"
+			}
+			writeTestJSON(t, w, map[string]any{
+				"data": []any{map[string]any{"id": transitionPolls, "event_id": 99, "reason": reason}},
+			})
+		default:
+			t.Fatalf("unexpected request: %s %s", r.Method, r.URL.String())
+		}
+	}))
+	defer srv.Close()
+
+	var stdout bytes.Buffer
+	err := runAPISitesSimulateFailure(context.Background(), srv.Client(), apiCLIOptions{
+		baseURL: srv.URL,
+		timeout: time.Second,
+		out:     &stdout,
+		errOut:  ioDiscard{},
+	}, apiSitesSimulateFailureOptions{
+		mode:                   "http-500",
+		siteIDs:                mustSiteIDs(t, "42"),
+		trigger:                false,
+		wait:                   100 * time.Millisecond,
+		pollInterval:           time.Millisecond,
+		expectEventState:       "Down",
+		expectTransitionReason: "verifier_confirmed",
+	})
+	if err != nil {
+		t.Fatalf("runAPISitesSimulateFailure() error = %v\nstdout=%s", err, stdout.String())
+	}
+	if eventPolls < 2 || transitionPolls < 2 {
+		t.Fatalf("eventPolls=%d transitionPolls=%d, want at least 2 each", eventPolls, transitionPolls)
+	}
+}
+
+func TestRunAPISitesSimulateFailureFailsWhenAssertionsDoNotMatch(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		switch {
+		case r.Method == http.MethodPatch && r.URL.Path == "/api/v1/sites/42":
+			writeTestJSON(t, w, map[string]any{"id": 42})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/sites/42/events":
+			writeTestJSON(t, w, map[string]any{"data": []any{}, "page": map[string]any{"limit": 10}})
+		default:
+			t.Fatalf("unexpected request: %s %s", r.Method, r.URL.String())
+		}
+	}))
+	defer srv.Close()
+
+	var stdout bytes.Buffer
+	err := runAPISitesSimulateFailure(context.Background(), srv.Client(), apiCLIOptions{
+		baseURL: srv.URL,
+		timeout: time.Second,
+		out:     &stdout,
+		errOut:  ioDiscard{},
+	}, apiSitesSimulateFailureOptions{
+		mode:              "http-500",
+		siteIDs:           mustSiteIDs(t, "42"),
+		trigger:           false,
+		pollInterval:      time.Millisecond,
+		expectEventState:  "Seems Down",
+		requireTransition: true,
+	})
+	if err == nil {
+		t.Fatalf("runAPISitesSimulateFailure() error = nil\nstdout=%s", stdout.String())
+	}
+	if !strings.Contains(err.Error(), `expected active event state "Seems Down"`) {
+		t.Fatalf("error = %v, want event-state assertion failure", err)
+	}
+	if !strings.Contains(stdout.String(), "expected at least one transition") {
+		t.Fatalf("stdout = %s, want transition assertion failure", stdout.String())
+	}
+}
+
+func TestRunAPISitesSimulateFailureCanCreateMissing(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		switch {
+		case r.Method == http.MethodPatch && r.URL.Path == "/api/v1/sites/42":
+			writeTestStatusJSON(t, w, http.StatusNotFound, map[string]string{"code": "site_not_found"})
+		case r.Method == http.MethodPost && r.URL.Path == "/api/v1/sites":
+			var body map[string]any
+			decodeTestJSON(t, r, &body)
+			if body["blog_id"] != float64(42) {
+				t.Fatalf("blog_id = %#v, want 42", body["blog_id"])
+			}
+			writeTestStatusJSON(t, w, http.StatusCreated, map[string]any{"id": 42})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/sites/42/events":
+			writeTestJSON(t, w, map[string]any{"data": []any{}, "page": map[string]any{"limit": 10}})
+		default:
+			t.Fatalf("unexpected request: %s %s", r.Method, r.URL.String())
+		}
+	}))
+	defer srv.Close()
+
+	var stdout bytes.Buffer
+	err := runAPISitesSimulateFailure(context.Background(), srv.Client(), apiCLIOptions{
+		baseURL: srv.URL,
+		timeout: time.Second,
+		out:     &stdout,
+		errOut:  ioDiscard{},
+	}, apiSitesSimulateFailureOptions{
+		mode:          "keyword",
+		siteIDs:       mustSiteIDs(t, "42"),
+		createMissing: true,
+		trigger:       false,
+		pollInterval:  time.Millisecond,
+	})
+	if err != nil {
+		t.Fatalf("runAPISitesSimulateFailure() error = %v\nstdout=%s", err, stdout.String())
+	}
+	var summary apiSimulateFailureSummary
+	if err := json.Unmarshal(stdout.Bytes(), &summary); err != nil {
+		t.Fatalf("unmarshal summary: %v\n%s", err, stdout.String())
+	}
+	if summary.Sites[0].Action != "created" {
+		t.Fatalf("action = %q, want created", summary.Sites[0].Action)
+	}
+}
+
+func TestRunAPISitesSimulateFailureRejectsUnmatchedBatchMarker(t *testing.T) {
+	start := apiCLIBatchBlogIDStart("simulation-batch")
+	var calls []string
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		calls = append(calls, r.Method+" "+r.URL.RequestURI())
+		switch {
+		case r.Method == http.MethodGet &&
+			r.URL.Path == "/api/v1/sites/"+strconvInt64(start) &&
+			r.URL.Query().Get("include_cli_metadata") == "true":
+			writeTestJSON(t, w, map[string]any{"id": start, "cli_batch": "other-batch"})
+		default:
+			t.Fatalf("unexpected request: %s %s", r.Method, r.URL.String())
+		}
+	}))
+	defer srv.Close()
+
+	var stdout bytes.Buffer
+	err := runAPISitesSimulateFailure(context.Background(), srv.Client(), apiCLIOptions{
+		baseURL: srv.URL,
+		timeout: time.Second,
+		out:     &stdout,
+		errOut:  ioDiscard{},
+	}, apiSitesSimulateFailureOptions{
+		mode:         "http-500",
+		batch:        "simulation-batch",
+		count:        1,
+		trigger:      false,
+		pollInterval: time.Millisecond,
+	})
+	if err == nil {
+		t.Fatal("runAPISitesSimulateFailure() error = nil, want batch mismatch")
+	}
+	if !strings.Contains(err.Error(), `does not belong to CLI batch "simulation-batch"`) {
+		t.Fatalf("error = %v, want batch mismatch", err)
+	}
+	if strings.Join(calls, "\n") != "GET /api/v1/sites/"+strconvInt64(start)+"?include_cli_metadata=true" {
+		t.Fatalf("calls:\n%s\nwant only GET", strings.Join(calls, "\n"))
+	}
+}
+
+func TestAPISimulationSiteIDsFromBatch(t *testing.T) {
+	ids, err := apiSimulationSiteIDs(apiSitesSimulateFailureOptions{batch: "batch-a", count: 3})
+	if err != nil {
+		t.Fatalf("apiSimulationSiteIDs() error = %v", err)
+	}
+	start := apiCLIBatchBlogIDStart("batch-a")
+	want := []int64{start, start + 1, start + 2}
+	for i := range want {
+		if ids[i] != want[i] {
+			t.Fatalf("ids[%d] = %d, want %d", i, ids[i], want[i])
+		}
+	}
+}
+
+func TestAPIFailureModesCoverRoadmapTargets(t *testing.T) {
+	for _, mode := range []string{"unreachable", "http-500", "http-403", "redirect", "keyword", "timeout", "tls"} {
+		t.Run(mode, func(t *testing.T) {
+			def, err := apiFailureMode(mode, "")
+			if err != nil {
+				t.Fatalf("apiFailureMode(%q) error = %v", mode, err)
+			}
+			if def.MonitorURL == "" || def.RedirectPolicy == "" {
+				t.Fatalf("definition = %#v, want URL and redirect policy", def)
+			}
+		})
+	}
+}
+
+func TestAPIFailureModesPreferFixtureWhenConfigured(t *testing.T) {
+	tests := []struct {
+		mode string
+		url  string
+	}{
+		{mode: "http-500", url: "http://api-fixture:8091/status/500"},
+		{mode: "http-403", url: "http://api-fixture:8091/status/403"},
+		{mode: "redirect", url: "http://api-fixture:8091/redirect"},
+		{mode: "keyword", url: "http://api-fixture:8091/keyword"},
+		{mode: "timeout", url: "http://api-fixture:8091/slow?delay=5s"},
+		{mode: "tls", url: "https://api-fixture:8443/tls"},
+	}
+	for _, tt := range tests {
+		t.Run(tt.mode, func(t *testing.T) {
+			def, err := apiFailureMode(tt.mode, "http://api-fixture:8091")
+			if err != nil {
+				t.Fatalf("apiFailureMode(%q) error = %v", tt.mode, err)
+			}
+			if def.MonitorURL != tt.url {
+				t.Fatalf("MonitorURL = %q, want %q", def.MonitorURL, tt.url)
+			}
+		})
+	}
+}
+
+func TestAPISimulationFixtureURLAutoDetection(t *testing.T) {
+	fixture := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		if r.URL.Path != "/health" {
+			t.Fatalf("probe path = %q, want /health", r.URL.Path)
+		}
+		w.WriteHeader(http.StatusOK)
+	}))
+	defer fixture.Close()
+
+	got := apiSimulationFixtureURL(context.Background(), apiSitesSimulateFailureOptions{
+		fixtureURL:      apiFixtureAuto,
+		fixtureProbeURL: fixture.URL + "/health",
+	})
+	if got != defaultAPIFixtureMonitorURL {
+		t.Fatalf("fixture URL = %q, want default Docker monitor URL", got)
+	}
+
+	got = apiSimulationFixtureURL(context.Background(), apiSitesSimulateFailureOptions{
+		fixtureURL:      apiFixtureAuto,
+		fixtureProbeURL: "http://127.0.0.1:1/health",
+	})
+	if got != "" {
+		t.Fatalf("fixture URL = %q, want fallback to public endpoints", got)
+	}
+}
+
+func mustSiteIDs(t *testing.T, raw string) apiInt64SliceFlags {
+	t.Helper()
+	var ids apiInt64SliceFlags
+	if err := ids.Set(raw); err != nil {
+		t.Fatalf("set site ids: %v", err)
+	}
+	return ids
+}
diff --git a/cmd/jetmon2/api_cli_sites_test.go b/cmd/jetmon2/api_cli_sites_test.go
new file mode 100644
index 00000000..530d9c2f
--- /dev/null
+++ b/cmd/jetmon2/api_cli_sites_test.go
@@ -0,0 +1,181 @@
+package main
+
+import (
+	"encoding/json"
+	"net/url"
+	"testing"
+)
+
+func TestAPISitesListPath(t *testing.T) {
+	got, err := apiSitesListPath(apiSitesListFilters{
+		cursor:        "cur-1",
+		limit:         25,
+		stateIn:       "Down,Seems Down",
+		severityGTE:   3,
+		monitorActive: "1",
+		q:             "example.com",
+	})
+	if err != nil {
+		t.Fatalf("apiSitesListPath() error = %v", err)
+	}
+	u, err := url.Parse(got)
+	if err != nil {
+		t.Fatalf("parse path: %v", err)
+	}
+	if u.Path != "/api/v1/sites" {
+		t.Fatalf("path = %q, want /api/v1/sites", u.Path)
+	}
+	q := u.Query()
+	for key, want := range map[string]string{
+		"cursor":         "cur-1",
+		"limit":          "25",
+		"state__in":      "Down,Seems Down",
+		"severity__gte":  "3",
+		"monitor_active": "true",
+		"q":              "example.com",
+	} {
+		if got := q.Get(key); got != want {
+			t.Fatalf("query %s = %q, want %q in %s", key, got, want, got)
+		}
+	}
+}
+
+func TestAPISitesListPathRejectsAmbiguousStateFilter(t *testing.T) {
+	_, err := apiSitesListPath(apiSitesListFilters{state: "Down", stateIn: "Up,Down"})
+	if err == nil {
+		t.Fatal("apiSitesListPath() error = nil, want error")
+	}
+}
+
+func TestAPISiteResourcePath(t *testing.T) {
+	got, err := apiSiteResourcePath("42", "trigger-now")
+	if err != nil {
+		t.Fatalf("apiSiteResourcePath() error = %v", err)
+	}
+	if got != "/api/v1/sites/42/trigger-now" {
+		t.Fatalf("path = %q, want trigger-now path", got)
+	}
+	if _, err := apiSiteResourcePath("0", ""); err == nil {
+		t.Fatal("apiSiteResourcePath() error = nil, want invalid id error")
+	}
+}
+
+func TestMarshalAPISiteCreateBody(t *testing.T) {
+	var active apiOptionalBoolFlag
+	setTestFlag(t, &active, "false")
+	var bucket apiOptionalIntFlag
+	setTestFlag(t, &bucket, "7")
+	var redirect apiOptionalStringFlag
+	setTestFlag(t, &redirect, "alert")
+	var headers apiStringMapFlags
+	setTestFlag(t, &headers, "X-Jetmon-Test: yes")
+	var forbiddenKeywords apiStringSliceFlags
+	setTestFlag(t, &forbiddenKeywords, "metrics.evil-cdn.example/collect.js")
+	setTestFlag(t, &forbiddenKeywords, "buy cheap viagra")
+
+	body, err := marshalAPISiteCreateBody(apiSiteCreateOptions{
+		blogID:            12345,
+		monitorURL:        "https://example.com",
+		monitorActive:     active,
+		bucketNo:          bucket,
+		forbiddenKeywords: forbiddenKeywords,
+		redirectPolicy:    redirect,
+		customHeaders:     headers,
+	})
+	if err != nil {
+		t.Fatalf("marshalAPISiteCreateBody() error = %v", err)
+	}
+	var got map[string]any
+	if err := json.Unmarshal(body, &got); err != nil {
+		t.Fatalf("unmarshal body: %v", err)
+	}
+	if got["blog_id"] != float64(12345) {
+		t.Fatalf("blog_id = %#v, want 12345", got["blog_id"])
+	}
+	if got["monitor_url"] != "https://example.com" {
+		t.Fatalf("monitor_url = %#v", got["monitor_url"])
+	}
+	if got["monitor_active"] != false {
+		t.Fatalf("monitor_active = %#v, want false", got["monitor_active"])
+	}
+	if got["bucket_no"] != float64(7) {
+		t.Fatalf("bucket_no = %#v, want 7", got["bucket_no"])
+	}
+	if got["redirect_policy"] != "alert" {
+		t.Fatalf("redirect_policy = %#v, want alert", got["redirect_policy"])
+	}
+	assertStringArray(t, got["forbidden_keywords"], []string{"metrics.evil-cdn.example/collect.js", "buy cheap viagra"})
+	custom, ok := got["custom_headers"].(map[string]any)
+	if !ok {
+		t.Fatalf("custom_headers = %#v, want object", got["custom_headers"])
+	}
+	if custom["X-Jetmon-Test"] != "yes" {
+		t.Fatalf("custom header = %#v, want yes", custom["X-Jetmon-Test"])
+	}
+}
+
+func TestMarshalAPISiteUpdateBodySupportsClears(t *testing.T) {
+	var keyword apiOptionalStringFlag
+	setTestFlag(t, &keyword, "")
+	var maintenanceEnd apiOptionalStringFlag
+	setTestFlag(t, &maintenanceEnd, "")
+
+	body, err := marshalAPISiteUpdateBody(apiSiteUpdateOptions{
+		checkKeyword:           keyword,
+		clearCustomHeaders:     true,
+		clearForbiddenKeywords: true,
+		maintenanceEnd:         maintenanceEnd,
+	})
+	if err != nil {
+		t.Fatalf("marshalAPISiteUpdateBody() error = %v", err)
+	}
+	var got map[string]any
+	if err := json.Unmarshal(body, &got); err != nil {
+		t.Fatalf("unmarshal body: %v", err)
+	}
+	if got["check_keyword"] != "" {
+		t.Fatalf("check_keyword = %#v, want empty string", got["check_keyword"])
+	}
+	if got["maintenance_end"] != "" {
+		t.Fatalf("maintenance_end = %#v, want empty string", got["maintenance_end"])
+	}
+	custom, ok := got["custom_headers"].(map[string]any)
+	if !ok {
+		t.Fatalf("custom_headers = %#v, want object", got["custom_headers"])
+	}
+	if len(custom) != 0 {
+		t.Fatalf("custom_headers = %#v, want empty object", custom)
+	}
+	assertStringArray(t, got["forbidden_keywords"], []string{})
+}
+
+func TestMarshalAPISiteUpdateBodyRejectsCustomHeaderConflict(t *testing.T) {
+	var headers apiStringMapFlags
+	setTestFlag(t, &headers, "X-Test: yes")
+	_, err := marshalAPISiteUpdateBody(apiSiteUpdateOptions{
+		customHeaders:      headers,
+		clearCustomHeaders: true,
+	})
+	if err == nil {
+		t.Fatal("marshalAPISiteUpdateBody() error = nil, want conflict error")
+	}
+}
+
+func TestMarshalAPISiteUpdateBodyRejectsForbiddenKeywordConflict(t *testing.T) {
+	var forbiddenKeywords apiStringSliceFlags
+	setTestFlag(t, &forbiddenKeywords, "bad")
+	_, err := marshalAPISiteUpdateBody(apiSiteUpdateOptions{
+		forbiddenKeywords:      forbiddenKeywords,
+		clearForbiddenKeywords: true,
+	})
+	if err == nil {
+		t.Fatal("marshalAPISiteUpdateBody() error = nil, want conflict error")
+	}
+}
+
+func setTestFlag(t *testing.T, v interface{ Set(string) error }, raw string) {
+	t.Helper()
+	if err := v.Set(raw); err != nil {
+		t.Fatalf("Set(%q) error = %v", raw, err)
+	}
+}
diff --git a/cmd/jetmon2/api_cli_test.go b/cmd/jetmon2/api_cli_test.go
new file mode 100644
index 00000000..6b1ae5e2
--- /dev/null
+++ b/cmd/jetmon2/api_cli_test.go
@@ -0,0 +1,496 @@
+package main
+
+import (
+	"bytes"
+	"context"
+	"errors"
+	"flag"
+	"net/http"
+	"net/http/httptest"
+	"strings"
+	"testing"
+	"time"
+)
+
+func TestAPIRequestURL(t *testing.T) {
+	tests := []struct {
+		name    string
+		baseURL string
+		target  string
+		want    string
+		wantErr bool
+	}{
+		{
+			name:    "absolute path",
+			baseURL: "http://localhost:8090",
+			target:  "/api/v1/health",
+			want:    "http://localhost:8090/api/v1/health",
+		},
+		{
+			name:    "relative path",
+			baseURL: "http://localhost:8090/",
+			target:  "api/v1/me",
+			want:    "http://localhost:8090/api/v1/me",
+		},
+		{
+			name:    "absolute url",
+			baseURL: "http://localhost:8090",
+			target:  "http://127.0.0.1:9000/api/v1/health",
+			want:    "http://127.0.0.1:9000/api/v1/health",
+		},
+		{
+			name:    "base requires host",
+			baseURL: "localhost:8090",
+			target:  "/api/v1/health",
+			wantErr: true,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			got, err := apiRequestURL(tt.baseURL, tt.target)
+			if tt.wantErr {
+				if err == nil {
+					t.Fatal("apiRequestURL() error = nil, want error")
+				}
+				return
+			}
+			if err != nil {
+				t.Fatalf("apiRequestURL() error = %v", err)
+			}
+			if got != tt.want {
+				t.Fatalf("apiRequestURL() = %q, want %q", got, tt.want)
+			}
+		})
+	}
+}
+
+func TestExecuteAPIRequestSendsAuthAndVerboseHeaders(t *testing.T) {
+	var sawAuth, sawIDKey bool
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		if got := r.Header.Get("Authorization"); got == "Bearer token-123" {
+			sawAuth = true
+		}
+		if got := r.Header.Get("Idempotency-Key"); got == "idem-1" {
+			sawIDKey = true
+		}
+		w.Header().Set("X-Test-Response", "yes")
+		w.Header().Set("Set-Cookie", "session=secret-cookie")
+		w.Header().Set("X-Api-Key", "response-api-key")
+		w.WriteHeader(http.StatusCreated)
+		_, _ = w.Write([]byte(`{"ok":true}`))
+	}))
+	defer srv.Close()
+
+	var stdout, stderr bytes.Buffer
+	opts := apiCLIOptions{
+		baseURL:        srv.URL,
+		token:          "token-123",
+		idempotencyKey: "idem-1",
+		verbose:        true,
+		pretty:         true,
+		timeout:        time.Second,
+		out:            &stdout,
+		errOut:         &stderr,
+	}
+	if err := executeAPIRequest(context.Background(), srv.Client(), opts, http.MethodPost, "/api/v1/sites/42/trigger-now", []byte(`{}`)); err != nil {
+		t.Fatalf("executeAPIRequest() error = %v", err)
+	}
+	if !sawAuth {
+		t.Fatal("Authorization header was not sent")
+	}
+	if !sawIDKey {
+		t.Fatal("Idempotency-Key header was not sent")
+	}
+	if got := stdout.String(); !strings.Contains(got, "{\n  \"ok\": true\n}") {
+		t.Fatalf("stdout = %q, want pretty JSON body", got)
+	}
+	errOut := stderr.String()
+	for _, want := range []string{
+		"> POST /api/v1/sites/42/trigger-now HTTP/1.1",
+		"> Authorization: [redacted]",
+		"> Idempotency-Key: [redacted]",
+		"< HTTP/1.1 201 Created",
+		"< Set-Cookie: [redacted]",
+		"< X-Api-Key: [redacted]",
+		"< X-Test-Response: yes",
+	} {
+		if !strings.Contains(errOut, want) {
+			t.Fatalf("stderr missing %q:\n%s", want, errOut)
+		}
+	}
+	for _, secret := range []string{"token-123", "idem-1", "secret-cookie", "response-api-key"} {
+		if strings.Contains(errOut, secret) {
+			t.Fatalf("stderr leaked %q:\n%s", secret, errOut)
+		}
+	}
+}
+
+func TestExecuteAPIRequestSkipsAutomaticAuthForDifferentOrigin(t *testing.T) {
+	base := httptest.NewServer(http.NotFoundHandler())
+	defer base.Close()
+
+	var sawAuth, sawIDKey bool
+	target := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		sawAuth = r.Header.Get("Authorization") != ""
+		sawIDKey = r.Header.Get("Idempotency-Key") != ""
+		_, _ = w.Write([]byte(`{"ok":true}`))
+	}))
+	defer target.Close()
+
+	var stdout bytes.Buffer
+	opts := apiCLIOptions{
+		baseURL:        base.URL,
+		token:          "token-123",
+		idempotencyKey: "idem-1",
+		timeout:        time.Second,
+		out:            &stdout,
+		errOut:         ioDiscard{},
+	}
+	if err := executeAPIRequest(context.Background(), target.Client(), opts, http.MethodPost, target.URL+"/api/v1/sites", []byte(`{}`)); err != nil {
+		t.Fatalf("executeAPIRequest() error = %v", err)
+	}
+	if sawAuth {
+		t.Fatal("Authorization header was sent to a different origin")
+	}
+	if sawIDKey {
+		t.Fatal("Idempotency-Key header was sent to a different origin")
+	}
+}
+
+func TestExecuteAPIRequestAnyOriginPolicySendsAutomaticAuth(t *testing.T) {
+	base := httptest.NewServer(http.NotFoundHandler())
+	defer base.Close()
+
+	var sawAuth, sawIDKey bool
+	target := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		sawAuth = r.Header.Get("Authorization") == "Bearer token-123"
+		sawIDKey = r.Header.Get("Idempotency-Key") == "idem-1"
+		_, _ = w.Write([]byte(`{"ok":true}`))
+	}))
+	defer target.Close()
+
+	var stdout bytes.Buffer
+	opts := apiCLIOptions{
+		baseURL:        base.URL,
+		token:          "token-123",
+		authPolicy:     "any-origin",
+		idempotencyKey: "idem-1",
+		timeout:        time.Second,
+		out:            &stdout,
+		errOut:         ioDiscard{},
+	}
+	if err := executeAPIRequest(context.Background(), target.Client(), opts, http.MethodPost, target.URL+"/api/v1/sites", []byte(`{}`)); err != nil {
+		t.Fatalf("executeAPIRequest() error = %v", err)
+	}
+	if !sawAuth {
+		t.Fatal("Authorization header was not sent with any-origin policy")
+	}
+	if !sawIDKey {
+		t.Fatal("Idempotency-Key header was not sent with any-origin policy")
+	}
+}
+
+func TestExecuteAPIRequestRejectsInvalidAuthPolicy(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		t.Fatal("server should not be called")
+	}))
+	defer srv.Close()
+
+	err := executeAPIRequest(context.Background(), srv.Client(), apiCLIOptions{
+		baseURL:    srv.URL,
+		authPolicy: "sometimes",
+		timeout:    time.Second,
+		out:        ioDiscard{},
+		errOut:     ioDiscard{},
+	}, http.MethodGet, "/api/v1/health", nil)
+	if err == nil {
+		t.Fatal("executeAPIRequest() error = nil, want invalid auth policy")
+	}
+	if !strings.Contains(err.Error(), "invalid auth policy") {
+		t.Fatalf("error = %v, want invalid auth policy", err)
+	}
+}
+
+func TestExecuteAPIRequestReturnsErrorForHTTPFailureAfterWritingBody(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusUnauthorized)
+		_, _ = w.Write([]byte(`{"error":"missing token"}`))
+	}))
+	defer srv.Close()
+
+	var stdout bytes.Buffer
+	opts := apiCLIOptions{
+		baseURL: srv.URL,
+		timeout: time.Second,
+		out:     &stdout,
+		errOut:  ioDiscard{},
+	}
+	err := executeAPIRequest(context.Background(), srv.Client(), opts, http.MethodGet, "/api/v1/me", nil)
+	if err == nil {
+		t.Fatal("executeAPIRequest() error = nil, want error")
+	}
+	if got := stdout.String(); !strings.Contains(got, `"missing token"`) {
+		t.Fatalf("stdout = %q, want error body", got)
+	}
+}
+
+func TestAPIFlagUsageUsesLongDashesAndHidesTokenDefault(t *testing.T) {
+	var stderr bytes.Buffer
+	opts := apiCLIOptions{
+		baseURL: "http://localhost:8090",
+		token:   "token-should-not-print",
+		timeout: 10 * time.Second,
+		errOut:  &stderr,
+	}
+	fs := newAPIFlagSet("api health", &opts)
+	fs.Usage()
+
+	got := stderr.String()
+	for _, want := range []string{
+		"Usage of api health:",
+		"--allow-remote",
+		"--auth-policy string",
+		"--base-url string",
+		"--header value",
+		"--output string",
+		"--pretty",
+		"--timeout duration",
+		"--token string",
+		"-v",
+		"--verbose",
+		`API base URL (default "http://localhost:8090")`,
+		`request timeout (default 10s)`,
+	} {
+		if !strings.Contains(got, want) {
+			t.Fatalf("usage missing %q:\n%s", want, got)
+		}
+	}
+	for _, unwanted := range []string{
+		"  -base-url",
+		"  -allow-remote",
+		"  -header",
+		"  -output",
+		"  -pretty",
+		"  -timeout",
+		"  -token",
+		"  -verbose",
+		"token-should-not-print",
+	} {
+		if strings.Contains(got, unwanted) {
+			t.Fatalf("usage contains %q:\n%s", unwanted, got)
+		}
+	}
+}
+
+func TestAPIHelpReturnsFlagErrHelp(t *testing.T) {
+	var stderr bytes.Buffer
+	opts := apiCLIOptions{baseURL: "http://localhost:8090", timeout: 10 * time.Second, errOut: &stderr}
+	fs := newAPIFlagSet("api health", &opts)
+	err := parseAPIFlags(fs, []string{"--help"})
+	if !errors.Is(err, flag.ErrHelp) {
+		t.Fatalf("Parse(--help) error = %v, want flag.ErrHelp", err)
+	}
+	if got := stderr.String(); !strings.Contains(got, "--base-url string") {
+		t.Fatalf("usage = %q, want long-dash flag output", got)
+	}
+}
+
+func TestParseAPIFlagsAllowsFlagsAfterPositionals(t *testing.T) {
+	var stderr bytes.Buffer
+	opts := apiCLIOptions{baseURL: "http://localhost:8090", timeout: 10 * time.Second, errOut: &stderr}
+	fs := newAPIFlagSet("api sites get", &opts)
+
+	err := parseAPIFlags(fs, []string{"12345", "--pretty", "--output", "table", "--header", "X-Test: yes"})
+	if err != nil {
+		t.Fatalf("parseAPIFlags() error = %v", err)
+	}
+	if !opts.pretty {
+		t.Fatal("pretty = false, want true")
+	}
+	if opts.output != "table" {
+		t.Fatalf("output = %q, want table", opts.output)
+	}
+	if got := opts.headers; len(got) != 1 || got[0] != "X-Test: yes" {
+		t.Fatalf("headers = %#v, want X-Test header", got)
+	}
+	if got := fs.Args(); len(got) != 1 || got[0] != "12345" {
+		t.Fatalf("args = %#v, want [12345]", got)
+	}
+}
+
+func TestParseAPIFlagsPreservesPositionalsAfterDoubleDash(t *testing.T) {
+	var stderr bytes.Buffer
+	opts := apiCLIOptions{baseURL: "http://localhost:8090", timeout: 10 * time.Second, errOut: &stderr}
+	fs := newAPIFlagSet("api request", &opts)
+
+	err := parseAPIFlags(fs, []string{"GET", "--", "--not-a-flag"})
+	if err != nil {
+		t.Fatalf("parseAPIFlags() error = %v", err)
+	}
+	if got := fs.Args(); len(got) != 2 || got[0] != "GET" || got[1] != "--not-a-flag" {
+		t.Fatalf("args = %#v, want GET and literal --not-a-flag", got)
+	}
+}
+
+func TestNewAPIFlagSetHonorsPresetOutputDefault(t *testing.T) {
+	var stderr bytes.Buffer
+	opts := apiCLIOptions{output: "table", errOut: &stderr}
+	fs := newAPIFlagSet("api commands", &opts)
+	if err := parseAPIFlags(fs, nil); err != nil {
+		t.Fatalf("parseAPIFlags() error = %v", err)
+	}
+	if opts.output != "table" {
+		t.Fatalf("output = %q, want table", opts.output)
+	}
+}
+
+func TestWriteAPICommandsTable(t *testing.T) {
+	var out bytes.Buffer
+	err := writeAPICommands(apiCLIOptions{output: "table", out: &out})
+	if err != nil {
+		t.Fatalf("writeAPICommands() error = %v", err)
+	}
+	got := out.String()
+	for _, want := range []string{
+		"command                    description",
+		"sites simulate-failure",
+		"mutate test sites into known failure modes",
+		"commands",
+		"list API CLI commands and examples",
+	} {
+		if !strings.Contains(got, want) {
+			t.Fatalf("commands table missing %q:\n%s", want, got)
+		}
+	}
+}
+
+func TestWriteAPIResponseTableForSiteList(t *testing.T) {
+	body := []byte(`{
+		"data": [
+			{"id": 42, "monitor_url": "https://example.com", "monitor_active": true, "current_state": "Up", "current_severity": 0},
+			{"id": 43, "monitor_url": "https://wordpress.com", "monitor_active": false, "current_state": "Paused", "current_severity": 0}
+		],
+		"page": {"limit": 50}
+	}`)
+	var out bytes.Buffer
+	if err := writeAPIResponseTable(&out, body); err != nil {
+		t.Fatalf("writeAPIResponseTable() error = %v", err)
+	}
+	got := out.String()
+	for _, want := range []string{
+		"id  monitor_url            monitor_active  current_state  current_severity",
+		"42  https://example.com    true            Up             0",
+		"43  https://wordpress.com  false           Paused         0",
+	} {
+		if !strings.Contains(got, want) {
+			t.Fatalf("table missing %q:\n%s", want, got)
+		}
+	}
+}
+
+func TestWriteAPIResponseTableUsesNestedWorkflowRows(t *testing.T) {
+	body := []byte(`{
+		"mode": "http-500",
+		"sites": [
+			{"site_id": 42, "action": "updated", "note": "no active events returned"},
+			{"site_id": 43, "action": "created", "error": "trigger failed"}
+		]
+	}`)
+	var out bytes.Buffer
+	if err := writeAPIResponseTable(&out, body); err != nil {
+		t.Fatalf("writeAPIResponseTable() error = %v", err)
+	}
+	got := out.String()
+	for _, want := range []string{
+		"site_id  action   note                       error",
+		"42       updated  no active events returned",
+		"43       created                             trigger failed",
+	} {
+		if !strings.Contains(got, want) {
+			t.Fatalf("table missing %q:\n%s", want, got)
+		}
+	}
+}
+
+func TestWriteAPIResponseTableIncludesSimulationSummaryColumns(t *testing.T) {
+	body := []byte(`{
+		"mode": "http-500",
+		"sites": [
+			{
+				"site_id": 42,
+				"action": "updated",
+				"trigger_status": "failed_http_500",
+				"event_ids": [99],
+				"event_states": ["Seems Down"],
+				"event_severities": [3],
+				"transition_count": 1
+			}
+		]
+	}`)
+	var out bytes.Buffer
+	if err := writeAPIResponseTable(&out, body); err != nil {
+		t.Fatalf("writeAPIResponseTable() error = %v", err)
+	}
+	got := out.String()
+	for _, want := range []string{
+		"site_id  action   trigger_status   event_ids  event_states  event_severities  transition_count",
+		"42       updated  failed_http_500  99         Seems Down    3                 1",
+	} {
+		if !strings.Contains(got, want) {
+			t.Fatalf("table missing %q:\n%s", want, got)
+		}
+	}
+}
+
+func TestWriteAPIResponseTableIncludesSmokeCleanupRows(t *testing.T) {
+	body := []byte(`{
+		"steps": [
+			{"name": "health", "status": "ok"},
+			{"name": "me", "status": "ok"}
+		],
+		"cleanup_results": [
+			{"resource": "alert_contact", "id": 77, "status": "deleted"},
+			{"resource": "site", "id": 910, "status": "failed", "error": "not found"}
+		]
+	}`)
+	var out bytes.Buffer
+	if err := writeAPIResponseTable(&out, body); err != nil {
+		t.Fatalf("writeAPIResponseTable() error = %v", err)
+	}
+	got := out.String()
+	for _, want := range []string{
+		"kind     name           id   status   detail",
+		"step     health              ok",
+		"cleanup  alert_contact  77   deleted",
+		"cleanup  site           910  failed   not found",
+	} {
+		if !strings.Contains(got, want) {
+			t.Fatalf("table missing %q:\n%s", want, got)
+		}
+	}
+}
+
+func TestWriteAPIResponseTableFallsBackToSortedColumns(t *testing.T) {
+	body := []byte(`{"zeta":"last","alpha":"first"}`)
+	var out bytes.Buffer
+	if err := writeAPIResponseTable(&out, body); err != nil {
+		t.Fatalf("writeAPIResponseTable() error = %v", err)
+	}
+	if got := out.String(); !strings.HasPrefix(got, "alpha  zeta\n") {
+		t.Fatalf("table = %q, want sorted fallback columns", got)
+	}
+}
+
+func TestWriteAPIOutputRejectsUnknownFormat(t *testing.T) {
+	err := writeAPIOutput(ioDiscard{}, []byte(`{"ok":true}`), apiCLIOptions{output: "yaml"})
+	if err == nil {
+		t.Fatal("writeAPIOutput() error = nil, want bad output format")
+	}
+}
+
+type ioDiscard struct{}
+
+func (ioDiscard) Write(p []byte) (int, error) {
+	return len(p), nil
+}
diff --git a/cmd/jetmon2/api_cli_webhooks.go b/cmd/jetmon2/api_cli_webhooks.go
new file mode 100644
index 00000000..0295c9cb
--- /dev/null
+++ b/cmd/jetmon2/api_cli_webhooks.go
@@ -0,0 +1,412 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"net/http"
+	"net/url"
+	"strconv"
+	"strings"
+)
+
+type apiWebhookCreateOptions struct {
+	url     string
+	active  apiOptionalBoolFlag
+	events  apiStringSliceFlags
+	siteIDs apiInt64SliceFlags
+	states  apiStringSliceFlags
+}
+
+type apiWebhookUpdateOptions struct {
+	url         apiOptionalStringFlag
+	active      apiOptionalBoolFlag
+	events      apiStringSliceFlags
+	clearEvents bool
+	siteIDs     apiInt64SliceFlags
+	clearSites  bool
+	states      apiStringSliceFlags
+	clearStates bool
+}
+
+type apiWebhookDeliveriesFilters struct {
+	cursor string
+	limit  int
+	status string
+}
+
+type apiWebhookSiteFilter struct {
+	SiteIDs []int64 `json:"site_ids,omitempty"`
+}
+
+type apiWebhookStateFilter struct {
+	States []string `json:"states,omitempty"`
+}
+
+type apiWebhookCreateRequest struct {
+	URL         string                `json:"url"`
+	Active      *bool                 `json:"active,omitempty"`
+	Events      []string              `json:"events"`
+	SiteFilter  apiWebhookSiteFilter  `json:"site_filter"`
+	StateFilter apiWebhookStateFilter `json:"state_filter"`
+}
+
+type apiWebhookUpdateRequest struct {
+	URL         *string                `json:"url,omitempty"`
+	Active      *bool                  `json:"active,omitempty"`
+	Events      *[]string              `json:"events,omitempty"`
+	SiteFilter  *apiWebhookSiteFilter  `json:"site_filter,omitempty"`
+	StateFilter *apiWebhookStateFilter `json:"state_filter,omitempty"`
+}
+
+func cmdAPIWebhooks(args []string) error {
+	if len(args) == 0 {
+		return errors.New("usage: jetmon2 api webhooks <list|get|create|update|delete|rotate-secret|deliveries|retry> [flags]")
+	}
+
+	sub := args[0]
+	rest := args[1:]
+	switch sub {
+	case "list":
+		return cmdAPIWebhooksList(rest)
+	case "get":
+		return cmdAPIWebhooksGet(rest)
+	case "create":
+		return cmdAPIWebhooksCreate(rest)
+	case "update":
+		return cmdAPIWebhooksUpdate(rest)
+	case "delete":
+		return cmdAPIWebhooksDelete(rest)
+	case "rotate-secret":
+		return cmdAPIWebhooksRotateSecret(rest)
+	case "deliveries":
+		return cmdAPIWebhooksDeliveries(rest)
+	case "retry":
+		return cmdAPIWebhooksRetry(rest)
+	default:
+		return fmt.Errorf("unknown api webhooks subcommand %q (want: list, get, create, update, delete, rotate-secret, deliveries, retry)", sub)
+	}
+}
+
+func cmdAPIWebhooksList(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api webhooks list", &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return errors.New("usage: jetmon2 api webhooks list [flags]")
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, "/api/v1/webhooks", nil)
+}
+
+func cmdAPIWebhooksGet(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api webhooks get", &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api webhooks get [flags] <webhook-id>")
+	}
+	target, err := apiWebhookPath(fs.Arg(0), "")
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, target, nil)
+}
+
+func cmdAPIWebhooksCreate(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api webhooks create", &opts)
+	addAPIIdempotencyFlag(fs, &opts)
+	create := apiWebhookCreateOptions{}
+	fs.StringVar(&create.url, "url", "", "webhook destination URL")
+	fs.Var(&create.active, "active", "webhook enabled: true or false")
+	fs.Var(&create.events, "event", "event type filter (repeatable or comma-separated)")
+	fs.Var(&create.siteIDs, "site-id", "site id filter (repeatable or comma-separated)")
+	fs.Var(&create.states, "state", "state filter (repeatable or comma-separated)")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return errors.New("usage: jetmon2 api webhooks create [flags]")
+	}
+	body, err := marshalAPIWebhookCreateBody(create)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodPost, "/api/v1/webhooks", body)
+}
+
+func cmdAPIWebhooksUpdate(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api webhooks update", &opts)
+	update := apiWebhookUpdateOptions{}
+	fs.Var(&update.url, "url", "webhook destination URL")
+	fs.Var(&update.active, "active", "webhook enabled: true or false")
+	fs.Var(&update.events, "event", "event type filter (repeatable or comma-separated)")
+	fs.BoolVar(&update.clearEvents, "clear-events", false, "clear event filters")
+	fs.Var(&update.siteIDs, "site-id", "site id filter (repeatable or comma-separated)")
+	fs.BoolVar(&update.clearSites, "clear-sites", false, "clear site filters")
+	fs.Var(&update.states, "state", "state filter (repeatable or comma-separated)")
+	fs.BoolVar(&update.clearStates, "clear-states", false, "clear state filters")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api webhooks update [flags] <webhook-id>")
+	}
+	target, err := apiWebhookPath(fs.Arg(0), "")
+	if err != nil {
+		return err
+	}
+	body, err := marshalAPIWebhookUpdateBody(update)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodPatch, target, body)
+}
+
+func cmdAPIWebhooksDelete(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api webhooks delete", &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api webhooks delete [flags] <webhook-id>")
+	}
+	target, err := apiWebhookPath(fs.Arg(0), "")
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodDelete, target, nil)
+}
+
+func cmdAPIWebhooksRotateSecret(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api webhooks rotate-secret", &opts)
+	addAPIIdempotencyFlag(fs, &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api webhooks rotate-secret [flags] <webhook-id>")
+	}
+	target, err := apiWebhookPath(fs.Arg(0), "rotate-secret")
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodPost, target, nil)
+}
+
+func cmdAPIWebhooksDeliveries(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api webhooks deliveries", &opts)
+	filters := apiWebhookDeliveriesFilters{}
+	fs.StringVar(&filters.cursor, "cursor", "", "pagination cursor")
+	fs.IntVar(&filters.limit, "limit", 0, "page size (1-200)")
+	fs.StringVar(&filters.status, "status", "", "delivery status: pending, delivered, failed, or abandoned")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 1 {
+		return errors.New("usage: jetmon2 api webhooks deliveries [flags] <webhook-id>")
+	}
+	target, err := apiWebhookDeliveriesPath(fs.Arg(0), filters)
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodGet, target, nil)
+}
+
+func cmdAPIWebhooksRetry(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api webhooks retry", &opts)
+	addAPIIdempotencyFlag(fs, &opts)
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 2 {
+		return errors.New("usage: jetmon2 api webhooks retry [flags] <webhook-id> <delivery-id>")
+	}
+	target, err := apiWebhookRetryPath(fs.Arg(0), fs.Arg(1))
+	if err != nil {
+		return err
+	}
+	return executeAPIRequest(context.Background(), nil, opts, http.MethodPost, target, nil)
+}
+
+func apiWebhookPath(rawID, suffix string) (string, error) {
+	id, err := apiPositiveID(rawID, "webhook")
+	if err != nil {
+		return "", err
+	}
+	path := "/api/v1/webhooks/" + strconv.FormatInt(id, 10)
+	if suffix != "" {
+		path += "/" + strings.TrimPrefix(suffix, "/")
+	}
+	return path, nil
+}
+
+func apiWebhookDeliveriesPath(rawID string, filters apiWebhookDeliveriesFilters) (string, error) {
+	path, err := apiWebhookPath(rawID, "deliveries")
+	if err != nil {
+		return "", err
+	}
+	if filters.limit < 0 {
+		return "", errors.New("limit must be positive")
+	}
+
+	values := url.Values{}
+	if filters.cursor != "" {
+		values.Set("cursor", filters.cursor)
+	}
+	if filters.limit > 0 {
+		values.Set("limit", strconv.Itoa(filters.limit))
+	}
+	if filters.status != "" {
+		switch filters.status {
+		case "pending", "delivered", "failed", "abandoned":
+			values.Set("status", filters.status)
+		default:
+			return "", errors.New("status must be one of: pending, delivered, failed, abandoned")
+		}
+	}
+	if len(values) == 0 {
+		return path, nil
+	}
+	return path + "?" + values.Encode(), nil
+}
+
+func apiWebhookRetryPath(rawWebhookID, rawDeliveryID string) (string, error) {
+	webhookID, err := apiPositiveID(rawWebhookID, "webhook")
+	if err != nil {
+		return "", err
+	}
+	deliveryID, err := apiPositiveID(rawDeliveryID, "delivery")
+	if err != nil {
+		return "", err
+	}
+	return fmt.Sprintf("/api/v1/webhooks/%d/deliveries/%d/retry", webhookID, deliveryID), nil
+}
+
+func marshalAPIWebhookCreateBody(opts apiWebhookCreateOptions) ([]byte, error) {
+	if strings.TrimSpace(opts.url) == "" {
+		return nil, errors.New("url is required")
+	}
+	req := apiWebhookCreateRequest{
+		URL:         opts.url,
+		Active:      opts.active.ptr(),
+		Events:      opts.events.valuesOrEmpty(),
+		SiteFilter:  apiWebhookSiteFilter{SiteIDs: opts.siteIDs.valuesOrEmpty()},
+		StateFilter: apiWebhookStateFilter{States: opts.states.valuesOrEmpty()},
+	}
+	return json.Marshal(req)
+}
+
+func marshalAPIWebhookUpdateBody(opts apiWebhookUpdateOptions) ([]byte, error) {
+	if opts.clearEvents && opts.events.set {
+		return nil, errors.New("use --event or --clear-events, not both")
+	}
+	if opts.clearSites && opts.siteIDs.set {
+		return nil, errors.New("use --site-id or --clear-sites, not both")
+	}
+	if opts.clearStates && opts.states.set {
+		return nil, errors.New("use --state or --clear-states, not both")
+	}
+
+	req := apiWebhookUpdateRequest{
+		URL:    opts.url.ptr(),
+		Active: opts.active.ptr(),
+	}
+	if opts.events.set || opts.clearEvents {
+		events := opts.events.valuesOrEmpty()
+		req.Events = &events
+	}
+	if opts.siteIDs.set || opts.clearSites {
+		req.SiteFilter = &apiWebhookSiteFilter{SiteIDs: opts.siteIDs.valuesOrEmpty()}
+	}
+	if opts.states.set || opts.clearStates {
+		req.StateFilter = &apiWebhookStateFilter{States: opts.states.valuesOrEmpty()}
+	}
+	return json.Marshal(req)
+}
+
+type apiStringSliceFlags struct {
+	values []string
+	set    bool
+}
+
+func (f *apiStringSliceFlags) Set(v string) error {
+	for _, part := range strings.Split(v, ",") {
+		part = strings.TrimSpace(part)
+		if part == "" {
+			continue
+		}
+		f.values = append(f.values, part)
+		f.set = true
+	}
+	return nil
+}
+
+func (f *apiStringSliceFlags) String() string {
+	return strings.Join(f.values, ",")
+}
+
+func (f apiStringSliceFlags) valuesOrEmpty() []string {
+	if !f.set {
+		return []string{}
+	}
+	out := make([]string, len(f.values))
+	copy(out, f.values)
+	return out
+}
+
+func (f apiStringSliceFlags) ptr() *[]string {
+	if !f.set {
+		return nil
+	}
+	out := f.valuesOrEmpty()
+	return &out
+}
+
+type apiInt64SliceFlags struct {
+	values []int64
+	set    bool
+}
+
+func (f *apiInt64SliceFlags) Set(v string) error {
+	for _, part := range strings.Split(v, ",") {
+		part = strings.TrimSpace(part)
+		if part == "" {
+			continue
+		}
+		id, err := apiPositiveID(part, "site")
+		if err != nil {
+			return err
+		}
+		f.values = append(f.values, id)
+		f.set = true
+	}
+	return nil
+}
+
+func (f *apiInt64SliceFlags) String() string {
+	parts := make([]string, len(f.values))
+	for i, v := range f.values {
+		parts[i] = strconv.FormatInt(v, 10)
+	}
+	return strings.Join(parts, ",")
+}
+
+func (f apiInt64SliceFlags) valuesOrEmpty() []int64 {
+	if !f.set {
+		return []int64{}
+	}
+	out := make([]int64, len(f.values))
+	copy(out, f.values)
+	return out
+}
diff --git a/cmd/jetmon2/api_cli_webhooks_test.go b/cmd/jetmon2/api_cli_webhooks_test.go
new file mode 100644
index 00000000..85ce187c
--- /dev/null
+++ b/cmd/jetmon2/api_cli_webhooks_test.go
@@ -0,0 +1,190 @@
+package main
+
+import (
+	"encoding/json"
+	"net/url"
+	"testing"
+)
+
+func TestMarshalAPIWebhookCreateBody(t *testing.T) {
+	var active apiOptionalBoolFlag
+	setTestFlag(t, &active, "false")
+	var events apiStringSliceFlags
+	setTestFlag(t, &events, "event.opened,event.closed")
+	var siteIDs apiInt64SliceFlags
+	setTestFlag(t, &siteIDs, "42,99")
+	var states apiStringSliceFlags
+	setTestFlag(t, &states, "Down")
+
+	body, err := marshalAPIWebhookCreateBody(apiWebhookCreateOptions{
+		url:     "https://example.com/hook",
+		active:  active,
+		events:  events,
+		siteIDs: siteIDs,
+		states:  states,
+	})
+	if err != nil {
+		t.Fatalf("marshalAPIWebhookCreateBody() error = %v", err)
+	}
+	var got map[string]any
+	if err := json.Unmarshal(body, &got); err != nil {
+		t.Fatalf("unmarshal body: %v", err)
+	}
+	if got["url"] != "https://example.com/hook" {
+		t.Fatalf("url = %#v", got["url"])
+	}
+	if got["active"] != false {
+		t.Fatalf("active = %#v, want false", got["active"])
+	}
+	assertStringArray(t, got["events"], []string{"event.opened", "event.closed"})
+	siteFilter := got["site_filter"].(map[string]any)
+	assertNumberArray(t, siteFilter["site_ids"], []int64{42, 99})
+	stateFilter := got["state_filter"].(map[string]any)
+	assertStringArray(t, stateFilter["states"], []string{"Down"})
+}
+
+func TestMarshalAPIWebhookCreateBodyDefaultsFiltersToMatchAll(t *testing.T) {
+	body, err := marshalAPIWebhookCreateBody(apiWebhookCreateOptions{
+		url: "https://example.com/hook",
+	})
+	if err != nil {
+		t.Fatalf("marshalAPIWebhookCreateBody() error = %v", err)
+	}
+	var got map[string]any
+	if err := json.Unmarshal(body, &got); err != nil {
+		t.Fatalf("unmarshal body: %v", err)
+	}
+	assertStringArray(t, got["events"], []string{})
+	if _, ok := got["site_filter"].(map[string]any)["site_ids"]; ok {
+		t.Fatalf("site_ids present in empty site_filter: %#v", got["site_filter"])
+	}
+	if _, ok := got["state_filter"].(map[string]any)["states"]; ok {
+		t.Fatalf("states present in empty state_filter: %#v", got["state_filter"])
+	}
+}
+
+func TestMarshalAPIWebhookUpdateBodySupportsClears(t *testing.T) {
+	body, err := marshalAPIWebhookUpdateBody(apiWebhookUpdateOptions{
+		clearEvents: true,
+		clearSites:  true,
+		clearStates: true,
+	})
+	if err != nil {
+		t.Fatalf("marshalAPIWebhookUpdateBody() error = %v", err)
+	}
+	var got map[string]any
+	if err := json.Unmarshal(body, &got); err != nil {
+		t.Fatalf("unmarshal body: %v", err)
+	}
+	assertStringArray(t, got["events"], []string{})
+	if _, ok := got["site_filter"].(map[string]any)["site_ids"]; ok {
+		t.Fatalf("site_ids present in cleared site_filter: %#v", got["site_filter"])
+	}
+	if _, ok := got["state_filter"].(map[string]any)["states"]; ok {
+		t.Fatalf("states present in cleared state_filter: %#v", got["state_filter"])
+	}
+}
+
+func TestMarshalAPIWebhookUpdateBodyRejectsClearConflicts(t *testing.T) {
+	var events apiStringSliceFlags
+	setTestFlag(t, &events, "event.opened")
+	if _, err := marshalAPIWebhookUpdateBody(apiWebhookUpdateOptions{events: events, clearEvents: true}); err == nil {
+		t.Fatal("events conflict error = nil, want error")
+	}
+
+	var siteIDs apiInt64SliceFlags
+	setTestFlag(t, &siteIDs, "42")
+	if _, err := marshalAPIWebhookUpdateBody(apiWebhookUpdateOptions{siteIDs: siteIDs, clearSites: true}); err == nil {
+		t.Fatal("sites conflict error = nil, want error")
+	}
+
+	var states apiStringSliceFlags
+	setTestFlag(t, &states, "Down")
+	if _, err := marshalAPIWebhookUpdateBody(apiWebhookUpdateOptions{states: states, clearStates: true}); err == nil {
+		t.Fatal("states conflict error = nil, want error")
+	}
+}
+
+func TestAPIWebhookPaths(t *testing.T) {
+	got, err := apiWebhookPath("7", "rotate-secret")
+	if err != nil {
+		t.Fatalf("apiWebhookPath() error = %v", err)
+	}
+	if got != "/api/v1/webhooks/7/rotate-secret" {
+		t.Fatalf("path = %q, want rotate-secret path", got)
+	}
+
+	got, err = apiWebhookRetryPath("7", "44")
+	if err != nil {
+		t.Fatalf("apiWebhookRetryPath() error = %v", err)
+	}
+	if got != "/api/v1/webhooks/7/deliveries/44/retry" {
+		t.Fatalf("retry path = %q, want delivery retry path", got)
+	}
+}
+
+func TestAPIWebhookDeliveriesPath(t *testing.T) {
+	got, err := apiWebhookDeliveriesPath("7", apiWebhookDeliveriesFilters{
+		cursor: "cur-4",
+		limit:  25,
+		status: "abandoned",
+	})
+	if err != nil {
+		t.Fatalf("apiWebhookDeliveriesPath() error = %v", err)
+	}
+	u, err := url.Parse(got)
+	if err != nil {
+		t.Fatalf("parse path: %v", err)
+	}
+	if u.Path != "/api/v1/webhooks/7/deliveries" {
+		t.Fatalf("path = %q, want deliveries path", u.Path)
+	}
+	for key, want := range map[string]string{
+		"cursor": "cur-4",
+		"limit":  "25",
+		"status": "abandoned",
+	} {
+		if got := u.Query().Get(key); got != want {
+			t.Fatalf("query %s = %q, want %q", key, got, want)
+		}
+	}
+}
+
+func TestAPIWebhookDeliveriesPathRejectsBadStatus(t *testing.T) {
+	_, err := apiWebhookDeliveriesPath("7", apiWebhookDeliveriesFilters{status: "waiting"})
+	if err == nil {
+		t.Fatal("apiWebhookDeliveriesPath() error = nil, want bad status error")
+	}
+}
+
+func assertStringArray(t *testing.T, got any, want []string) {
+	t.Helper()
+	items, ok := got.([]any)
+	if !ok {
+		t.Fatalf("value = %#v, want JSON array", got)
+	}
+	if len(items) != len(want) {
+		t.Fatalf("array len = %d, want %d: %#v", len(items), len(want), items)
+	}
+	for i, wantItem := range want {
+		if items[i] != wantItem {
+			t.Fatalf("array[%d] = %#v, want %q", i, items[i], wantItem)
+		}
+	}
+}
+
+func assertNumberArray(t *testing.T, got any, want []int64) {
+	t.Helper()
+	items, ok := got.([]any)
+	if !ok {
+		t.Fatalf("value = %#v, want JSON array", got)
+	}
+	if len(items) != len(want) {
+		t.Fatalf("array len = %d, want %d: %#v", len(items), len(want), items)
+	}
+	for i, wantItem := range want {
+		if items[i] != float64(wantItem) {
+			t.Fatalf("array[%d] = %#v, want %d", i, items[i], wantItem)
+		}
+	}
+}
diff --git a/cmd/jetmon2/api_cli_workflows.go b/cmd/jetmon2/api_cli_workflows.go
new file mode 100644
index 00000000..bce7d9ca
--- /dev/null
+++ b/cmd/jetmon2/api_cli_workflows.go
@@ -0,0 +1,852 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"hash/fnv"
+	"io"
+	"net/http"
+	"net/url"
+	"strconv"
+	"strings"
+	"time"
+)
+
+const (
+	apiCLIBatchHeader       = "X-Jetmon-CLI-Batch"
+	apiSmokeDefaultURL      = "https://example.com/"
+	apiSmokeDefaultKeyword  = "Example Domain"
+	apiSmokeAlertTestEmail  = "jetmon-api-cli@example.invalid"
+	apiSmokeDefaultExercise = "alert-contact"
+	apiSmokeWebhookEvent    = "event.opened"
+	apiSmokeWebhookState    = "Seems Down"
+	apiSmokeWebhookMode     = "http-500"
+
+	defaultAPIFixtureWebhookURL         = "http://api-fixture:8091/webhook"
+	defaultAPIFixtureWebhookRequestsURL = "http://localhost:18091/webhook/requests"
+)
+
+type apiSmokeOptions struct {
+	batch                string
+	blogID               int64
+	url                  string
+	cleanup              bool
+	exercise             string
+	idempotencyKeyPrefix string
+	webhookURL           string
+	webhookRequestsURL   string
+	webhookWait          time.Duration
+	webhookPollInterval  time.Duration
+	fixtureURL           string
+	fixtureProbeURL      string
+	allowExternalWebhook bool
+}
+
+type apiSmokeSummary struct {
+	Batch             string                         `json:"batch"`
+	BlogID            int64                          `json:"blog_id"`
+	BaseURL           string                         `json:"base_url"`
+	Cleanup           bool                           `json:"cleanup"`
+	Steps             []apiSmokeStep                 `json:"steps"`
+	Site              json.RawMessage                `json:"site,omitempty"`
+	TriggerNow        json.RawMessage                `json:"trigger_now,omitempty"`
+	Events            json.RawMessage                `json:"events,omitempty"`
+	AlertContact      json.RawMessage                `json:"alert_contact,omitempty"`
+	AlertTest         json.RawMessage                `json:"alert_test,omitempty"`
+	Webhook           *apiSmokeWebhookSummary        `json:"webhook,omitempty"`
+	WebhookDelivery   json.RawMessage                `json:"webhook_delivery,omitempty"`
+	WebhookFixture    *apiSmokeWebhookFixtureSummary `json:"webhook_fixture,omitempty"`
+	FailureSimulation *apiSimulatedSiteResult        `json:"failure_simulation,omitempty"`
+	CleanupResults    []apiSmokeCleanupResult        `json:"cleanup_results,omitempty"`
+}
+
+type apiSmokeStep struct {
+	Name   string `json:"name"`
+	Status string `json:"status"`
+	Detail string `json:"detail,omitempty"`
+}
+
+type apiSmokeCleanupResult struct {
+	Resource string `json:"resource"`
+	ID       int64  `json:"id"`
+	Status   string `json:"status"`
+	Error    string `json:"error,omitempty"`
+}
+
+type apiSmokeWebhookSummary struct {
+	ID            int64    `json:"id"`
+	URL           string   `json:"url"`
+	Active        bool     `json:"active"`
+	Events        []string `json:"events,omitempty"`
+	SecretPreview string   `json:"secret_preview,omitempty"`
+}
+
+type apiSmokeWebhookFixtureSummary struct {
+	Requests          int    `json:"requests"`
+	MatchedDeliveryID string `json:"matched_delivery_id,omitempty"`
+	MatchedEvent      string `json:"matched_event,omitempty"`
+	SignatureVerified bool   `json:"signature_verified"`
+}
+
+type apiSmokeFixtureResponse struct {
+	Count    int                         `json:"count"`
+	Requests []apiSmokeFixtureWebhookHit `json:"requests"`
+}
+
+type apiSmokeFixtureWebhookHit struct {
+	ID             int    `json:"id"`
+	Event          string `json:"event,omitempty"`
+	Delivery       string `json:"delivery,omitempty"`
+	Signature      string `json:"signature,omitempty"`
+	SignatureValid *bool  `json:"signature_valid,omitempty"`
+	Body           string `json:"body"`
+}
+
+type apiWorkflowHTTPError struct {
+	Method string
+	Target string
+	Status string
+	Body   []byte
+}
+
+func (e apiWorkflowHTTPError) Error() string {
+	body := strings.TrimSpace(string(e.Body))
+	if len(body) > 300 {
+		body = body[:300] + "..."
+	}
+	if body == "" {
+		return fmt.Sprintf("%s %s returned %s", e.Method, e.Target, e.Status)
+	}
+	return fmt.Sprintf("%s %s returned %s: %s", e.Method, e.Target, e.Status, body)
+}
+
+func cmdAPISmoke(args []string) error {
+	opts := defaultAPIOptions()
+	fs := newAPIFlagSet("api smoke", &opts)
+	smoke := apiSmokeOptions{
+		url:      apiSmokeDefaultURL,
+		cleanup:  true,
+		exercise: apiSmokeDefaultExercise,
+		webhookURL: envOrDefault(
+			"JETMON_API_WEBHOOK_FIXTURE_URL",
+			defaultAPIFixtureWebhookURL,
+		),
+		webhookRequestsURL: envOrDefault(
+			"JETMON_API_WEBHOOK_FIXTURE_REQUESTS_URL",
+			defaultAPIFixtureWebhookRequestsURL,
+		),
+		webhookWait:         60 * time.Second,
+		webhookPollInterval: 2 * time.Second,
+		fixtureURL:          envOrDefault("JETMON_API_FIXTURE_URL", apiFixtureAuto),
+		fixtureProbeURL: envOrDefault(
+			"JETMON_API_FIXTURE_PROBE_URL",
+			defaultAPIFixtureProbeURL,
+		),
+	}
+	fs.StringVar(&smoke.batch, "batch", "", "stable batch label for generated test resources")
+	fs.Int64Var(&smoke.blogID, "blog-id", 0, "specific blog_id to create; default derives from --batch")
+	fs.StringVar(&smoke.url, "url", smoke.url, "site monitor URL to create")
+	fs.BoolVar(&smoke.cleanup, "cleanup", smoke.cleanup, "delete smoke-created resources before exit")
+	fs.StringVar(&smoke.exercise, "exercise", smoke.exercise, "extra path to exercise: alert-contact, webhook, or none")
+	fs.StringVar(&smoke.idempotencyKeyPrefix, "idempotency-key-prefix", "", "prefix for smoke POST Idempotency-Key headers")
+	fs.StringVar(&smoke.webhookURL, "webhook-url", smoke.webhookURL, "receiver URL to register when --exercise=webhook")
+	fs.StringVar(&smoke.webhookRequestsURL, "webhook-requests-url", smoke.webhookRequestsURL, "local fixture requests URL to poll when --exercise=webhook")
+	fs.DurationVar(&smoke.webhookWait, "webhook-wait", smoke.webhookWait, "maximum wait for webhook delivery when --exercise=webhook")
+	fs.DurationVar(&smoke.webhookPollInterval, "webhook-poll-interval", smoke.webhookPollInterval, "poll interval for webhook delivery checks")
+	fs.StringVar(&smoke.fixtureURL, "fixture-url", smoke.fixtureURL, "Docker fixture monitor URL, auto, or off when --exercise=webhook")
+	fs.StringVar(&smoke.fixtureProbeURL, "fixture-probe-url", smoke.fixtureProbeURL, "URL used when --fixture-url=auto")
+	fs.BoolVar(&smoke.allowExternalWebhook, "allow-external-webhook-url", false, "allow --exercise=webhook to register a receiver URL outside localhost, loopback, or api-fixture")
+	if err := parseAPIFlags(fs, args); err != nil {
+		return err
+	}
+	if fs.NArg() != 0 {
+		return errors.New("usage: jetmon2 api smoke [flags]")
+	}
+	return runAPISmoke(context.Background(), nil, opts, smoke)
+}
+
+func runAPISmoke(ctx context.Context, client *http.Client, opts apiCLIOptions, smoke apiSmokeOptions) error {
+	if opts.out == nil {
+		opts.out = io.Discard
+	}
+	remote, err := requireAPILocalOrAllowRemote(opts, opts.allowRemote, "api smoke")
+	if err != nil {
+		return err
+	}
+	if remote && strings.TrimSpace(smoke.batch) == "" {
+		return errors.New("api smoke requires --batch when --allow-remote targets a non-local API")
+	}
+	if smoke.batch == "" {
+		smoke.batch = apiCLINewBatchID("smoke")
+	}
+	if smoke.blogID == 0 {
+		smoke.blogID = apiCLIBatchBlogIDStart(smoke.batch)
+	}
+	if smoke.url == "" {
+		smoke.url = apiSmokeDefaultURL
+	}
+	if smoke.exercise == "" {
+		smoke.exercise = apiSmokeDefaultExercise
+	}
+	if smoke.webhookURL == "" {
+		smoke.webhookURL = defaultAPIFixtureWebhookURL
+	}
+	if smoke.webhookRequestsURL == "" {
+		smoke.webhookRequestsURL = defaultAPIFixtureWebhookRequestsURL
+	}
+	if smoke.webhookWait == 0 {
+		smoke.webhookWait = 60 * time.Second
+	}
+	if smoke.webhookPollInterval == 0 {
+		smoke.webhookPollInterval = 2 * time.Second
+	}
+	if smoke.fixtureURL == "" {
+		smoke.fixtureURL = apiFixtureAuto
+	}
+	if smoke.fixtureProbeURL == "" {
+		smoke.fixtureProbeURL = defaultAPIFixtureProbeURL
+	}
+	if smoke.exercise != "alert-contact" && smoke.exercise != "webhook" && smoke.exercise != "none" {
+		return errors.New("exercise must be one of: alert-contact, webhook, none")
+	}
+	if remote && smoke.exercise == "webhook" {
+		return errors.New("api smoke --exercise webhook is Docker-local only and refuses non-local API targets")
+	}
+	if smoke.exercise == "webhook" {
+		if err := requireAPIWebhookFixtureURLAllowed(smoke.webhookURL, smoke.allowExternalWebhook); err != nil {
+			return err
+		}
+		if err := requireAPIWebhookFixtureRequestsLocal(smoke.webhookRequestsURL); err != nil {
+			return err
+		}
+	}
+	if smoke.webhookWait <= 0 {
+		return errors.New("webhook-wait must be positive")
+	}
+	if smoke.webhookPollInterval <= 0 {
+		return errors.New("webhook-poll-interval must be positive")
+	}
+
+	summary := apiSmokeSummary{
+		Batch:   smoke.batch,
+		BlogID:  smoke.blogID,
+		BaseURL: opts.baseURL,
+		Cleanup: smoke.cleanup,
+	}
+	var createdContactID int64
+	var createdWebhookID int64
+	siteCreated := false
+
+	cleanup := func() {
+		if !smoke.cleanup {
+			return
+		}
+		if createdWebhookID > 0 {
+			target := "/api/v1/webhooks/" + strconv.FormatInt(createdWebhookID, 10)
+			err := apiWorkflowDelete(ctx, client, opts, target)
+			result := apiSmokeCleanupResult{Resource: "webhook", ID: createdWebhookID, Status: "deleted"}
+			if err != nil {
+				result.Status = "failed"
+				result.Error = err.Error()
+			}
+			summary.CleanupResults = append(summary.CleanupResults, result)
+		}
+		if createdContactID > 0 {
+			target := "/api/v1/alert-contacts/" + strconv.FormatInt(createdContactID, 10)
+			err := apiWorkflowDelete(ctx, client, opts, target)
+			result := apiSmokeCleanupResult{Resource: "alert_contact", ID: createdContactID, Status: "deleted"}
+			if err != nil {
+				result.Status = "failed"
+				result.Error = err.Error()
+			}
+			summary.CleanupResults = append(summary.CleanupResults, result)
+		}
+		if siteCreated {
+			target := "/api/v1/sites/" + strconv.FormatInt(smoke.blogID, 10)
+			err := apiWorkflowDelete(ctx, client, opts, target)
+			result := apiSmokeCleanupResult{Resource: "site", ID: smoke.blogID, Status: "deleted"}
+			if err != nil {
+				result.Status = "failed"
+				result.Error = err.Error()
+			}
+			summary.CleanupResults = append(summary.CleanupResults, result)
+		}
+	}
+
+	step := func(name string, fn func() error) error {
+		if err := fn(); err != nil {
+			summary.Steps = append(summary.Steps, apiSmokeStep{Name: name, Status: "failed", Detail: err.Error()})
+			cleanup()
+			_ = writeAPIValueOutput(opts.out, summary, opts)
+			return fmt.Errorf("smoke %s failed: %w", name, err)
+		}
+		summary.Steps = append(summary.Steps, apiSmokeStep{Name: name, Status: "ok"})
+		return nil
+	}
+
+	if err := step("health", func() error {
+		_, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodGet, "/api/v1/health", nil, "")
+		return err
+	}); err != nil {
+		return err
+	}
+	if err := step("me", func() error {
+		_, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodGet, "/api/v1/me", nil, "")
+		return err
+	}); err != nil {
+		return err
+	}
+	if err := step("create_site", func() error {
+		keyword := apiSmokeDefaultKeyword
+		redirectPolicy := "follow"
+		checkInterval := 5
+		headers := map[string]string{apiCLIBatchHeader: smoke.batch}
+		site, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodPost, "/api/v1/sites", apiSiteCreateRequest{
+			BlogID:         smoke.blogID,
+			MonitorURL:     smoke.url,
+			CheckKeyword:   &keyword,
+			RedirectPolicy: &redirectPolicy,
+			CheckInterval:  &checkInterval,
+			CustomHeaders:  &headers,
+		}, apiSmokeIDKey(smoke, "create-site"))
+		if err != nil {
+			return err
+		}
+		siteCreated = true
+		summary.Site = site
+		return nil
+	}); err != nil {
+		return err
+	}
+	if err := step("trigger_now", func() error {
+		body, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodPost, fmt.Sprintf("/api/v1/sites/%d/trigger-now", smoke.blogID), nil, apiSmokeIDKey(smoke, "trigger-now"))
+		if err != nil {
+			return err
+		}
+		summary.TriggerNow = body
+		return nil
+	}); err != nil {
+		return err
+	}
+	if err := step("events", func() error {
+		body, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodGet, fmt.Sprintf("/api/v1/sites/%d/events?limit=5", smoke.blogID), nil, "")
+		if err != nil {
+			return err
+		}
+		summary.Events = body
+		return nil
+	}); err != nil {
+		return err
+	}
+	if smoke.exercise == "alert-contact" {
+		if err := step("create_alert_contact", func() error {
+			contact, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodPost, "/api/v1/alert-contacts", apiAlertContactCreateRequest{
+				Label:       "api-cli-smoke-" + smoke.batch,
+				Transport:   "email",
+				Destination: json.RawMessage(`{"address":"` + apiSmokeAlertTestEmail + `"}`),
+				SiteFilter:  apiAlertContactSiteFilter{SiteIDs: []int64{smoke.blogID}},
+			}, apiSmokeIDKey(smoke, "create-alert-contact"))
+			if err != nil {
+				return err
+			}
+			id, err := apiJSONInt64(contact, "id")
+			if err != nil {
+				return err
+			}
+			createdContactID = id
+			summary.AlertContact = contact
+			return nil
+		}); err != nil {
+			return err
+		}
+		if err := step("alert_contact_test", func() error {
+			body, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodPost, fmt.Sprintf("/api/v1/alert-contacts/%d/test", createdContactID), nil, apiSmokeIDKey(smoke, "alert-contact-test"))
+			if err != nil {
+				return err
+			}
+			summary.AlertTest = body
+			return nil
+		}); err != nil {
+			return err
+		}
+	}
+	if smoke.exercise == "webhook" {
+		var webhookSecret string
+		if err := step("webhook_clear_fixture", func() error {
+			return clearAPIWebhookFixtureRequests(ctx, client, opts, smoke.webhookRequestsURL)
+		}); err != nil {
+			return err
+		}
+		if err := step("create_webhook", func() error {
+			active := false
+			hook, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodPost, "/api/v1/webhooks", apiWebhookCreateRequest{
+				URL:    strings.TrimSpace(smoke.webhookURL),
+				Active: &active,
+				Events: []string{apiSmokeWebhookEvent},
+				SiteFilter: apiWebhookSiteFilter{
+					SiteIDs: []int64{smoke.blogID},
+				},
+				StateFilter: apiWebhookStateFilter{
+					States: []string{apiSmokeWebhookState},
+				},
+			}, apiSmokeIDKey(smoke, "create-webhook"))
+			if err != nil {
+				return err
+			}
+			id, err := apiJSONInt64(hook, "id")
+			if err != nil {
+				return err
+			}
+			secret, err := apiJSONString(hook, "secret")
+			if err != nil {
+				return err
+			}
+			createdWebhookID = id
+			webhookSecret = secret
+			summary.Webhook = redactedAPIWebhookSummary(hook)
+			return nil
+		}); err != nil {
+			return err
+		}
+		if err := step("activate_webhook_signature_fixture", func() error {
+			signedURL, err := apiWebhookFixtureURLWithSecret(smoke.webhookURL, webhookSecret)
+			if err != nil {
+				return err
+			}
+			active := true
+			body, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodPatch, fmt.Sprintf("/api/v1/webhooks/%d", createdWebhookID), apiWebhookUpdateRequest{
+				URL:    &signedURL,
+				Active: &active,
+			}, "")
+			if err != nil {
+				return redactAPISecretError(err, webhookSecret)
+			}
+			summary.Webhook = redactedAPIWebhookSummary(body)
+			return nil
+		}); err != nil {
+			return err
+		}
+		if err := step("simulate_failure_for_webhook", func() error {
+			result, err := runAPISmokeWebhookFailureSimulation(ctx, client, opts, smoke)
+			if err != nil {
+				return err
+			}
+			summary.FailureSimulation = &result
+			return nil
+		}); err != nil {
+			return err
+		}
+		if err := step("webhook_fixture_delivery", func() error {
+			fixture, err := waitForAPIWebhookFixtureDelivery(ctx, client, opts, smoke)
+			if err != nil {
+				return err
+			}
+			summary.WebhookFixture = fixture
+			return nil
+		}); err != nil {
+			return err
+		}
+		if err := step("webhook_delivery_row", func() error {
+			body, err := waitForAPIWebhookDeliveredRow(ctx, client, opts, createdWebhookID, smoke, summary.WebhookFixture.MatchedDeliveryID)
+			if err != nil {
+				return err
+			}
+			summary.WebhookDelivery = body
+			return nil
+		}); err != nil {
+			return err
+		}
+	}
+
+	cleanup()
+	return writeAPIValueOutput(opts.out, summary, opts)
+}
+
+func apiWorkflowRequestJSON(ctx context.Context, client *http.Client, opts apiCLIOptions, method, target string, body any, idempotencyKey string) (json.RawMessage, error) {
+	var payload []byte
+	var err error
+	if body != nil {
+		payload, err = json.Marshal(body)
+		if err != nil {
+			return nil, err
+		}
+	}
+	requestOpts := opts
+	requestOpts.idempotencyKey = idempotencyKey
+	resp, err := doAPIRequest(ctx, client, requestOpts, method, target, payload)
+	if err != nil {
+		return nil, err
+	}
+	trimmed := json.RawMessage(strings.TrimSpace(string(resp.Body)))
+	if len(trimmed) == 0 {
+		trimmed = json.RawMessage(`null`)
+	}
+	if resp.StatusCode >= 400 {
+		return trimmed, apiWorkflowHTTPError{Method: method, Target: target, Status: resp.Status, Body: resp.Body}
+	}
+	return trimmed, nil
+}
+
+func apiWorkflowDelete(ctx context.Context, client *http.Client, opts apiCLIOptions, target string) error {
+	resp, err := doAPIRequest(ctx, client, opts, http.MethodDelete, target, nil)
+	if err != nil {
+		return err
+	}
+	if resp.StatusCode >= 400 {
+		return apiWorkflowHTTPError{Method: http.MethodDelete, Target: target, Status: resp.Status, Body: resp.Body}
+	}
+	return nil
+}
+
+func runAPISmokeWebhookFailureSimulation(ctx context.Context, client *http.Client, opts apiCLIOptions, smoke apiSmokeOptions) (apiSimulatedSiteResult, error) {
+	sim := apiSitesSimulateFailureOptions{
+		mode:                   apiSmokeWebhookMode,
+		batch:                  smoke.batch,
+		count:                  1,
+		blogIDStart:            smoke.blogID,
+		createMissing:          false,
+		trigger:                true,
+		wait:                   smoke.webhookWait,
+		pollInterval:           smoke.webhookPollInterval,
+		idempotencyKeyPrefix:   smoke.idempotencyKeyPrefix,
+		fixtureURL:             smoke.fixtureURL,
+		fixtureProbeURL:        smoke.fixtureProbeURL,
+		expectEventState:       apiSmokeWebhookState,
+		requireTransition:      true,
+		expectTransitionReason: "opened",
+	}
+	sim.expectEventSeverity.set = true
+	sim.expectEventSeverity.value = 3
+	fixtureURL := apiSimulationFixtureURL(ctx, sim)
+	if fixtureURL == "" {
+		return apiSimulatedSiteResult{}, errors.New("Docker API fixture is required for --exercise=webhook; start api-fixture or pass --fixture-url")
+	}
+	def, err := apiFailureMode(sim.mode, fixtureURL)
+	if err != nil {
+		return apiSimulatedSiteResult{}, err
+	}
+	return runAPISiteSimulation(ctx, client, opts, sim, def, smoke.blogID, 0)
+}
+
+func clearAPIWebhookFixtureRequests(ctx context.Context, client *http.Client, opts apiCLIOptions, requestsURL string) error {
+	req, err := http.NewRequestWithContext(ctx, http.MethodDelete, strings.TrimSpace(requestsURL), nil)
+	if err != nil {
+		return err
+	}
+	resp, err := apiExternalHTTPClient(client, opts).Do(req)
+	if err != nil {
+		return err
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode >= 400 {
+		body, _ := io.ReadAll(io.LimitReader(resp.Body, 300))
+		return apiWorkflowHTTPError{Method: http.MethodDelete, Target: requestsURL, Status: resp.Status, Body: body}
+	}
+	return nil
+}
+
+func waitForAPIWebhookFixtureDelivery(ctx context.Context, client *http.Client, opts apiCLIOptions, smoke apiSmokeOptions) (*apiSmokeWebhookFixtureSummary, error) {
+	deadline := time.Now().Add(smoke.webhookWait)
+	for {
+		fixture, err := getAPIWebhookFixtureRequests(ctx, client, opts, smoke.webhookRequestsURL)
+		if err != nil {
+			return nil, err
+		}
+		if summary := matchingAPIWebhookFixtureDelivery(fixture, smoke.blogID); summary != nil {
+			return summary, nil
+		}
+		if time.Now().After(deadline) {
+			return nil, fmt.Errorf("timed out waiting for verified webhook fixture delivery for site %d", smoke.blogID)
+		}
+		select {
+		case <-ctx.Done():
+			return nil, ctx.Err()
+		case <-time.After(smoke.webhookPollInterval):
+		}
+	}
+}
+
+func getAPIWebhookFixtureRequests(ctx context.Context, client *http.Client, opts apiCLIOptions, requestsURL string) (apiSmokeFixtureResponse, error) {
+	req, err := http.NewRequestWithContext(ctx, http.MethodGet, strings.TrimSpace(requestsURL), nil)
+	if err != nil {
+		return apiSmokeFixtureResponse{}, err
+	}
+	resp, err := apiExternalHTTPClient(client, opts).Do(req)
+	if err != nil {
+		return apiSmokeFixtureResponse{}, err
+	}
+	defer resp.Body.Close()
+	body, err := io.ReadAll(resp.Body)
+	if err != nil {
+		return apiSmokeFixtureResponse{}, err
+	}
+	if resp.StatusCode >= 400 {
+		return apiSmokeFixtureResponse{}, apiWorkflowHTTPError{Method: http.MethodGet, Target: requestsURL, Status: resp.Status, Body: body}
+	}
+	var fixture apiSmokeFixtureResponse
+	if err := json.Unmarshal(body, &fixture); err != nil {
+		return apiSmokeFixtureResponse{}, err
+	}
+	return fixture, nil
+}
+
+func matchingAPIWebhookFixtureDelivery(fixture apiSmokeFixtureResponse, siteID int64) *apiSmokeWebhookFixtureSummary {
+	for _, req := range fixture.Requests {
+		if req.SignatureValid == nil || !*req.SignatureValid {
+			continue
+		}
+		var body struct {
+			Type   string `json:"type"`
+			SiteID int64  `json:"site_id"`
+		}
+		if err := json.Unmarshal([]byte(req.Body), &body); err != nil {
+			continue
+		}
+		if body.Type != apiSmokeWebhookEvent || body.SiteID != siteID {
+			continue
+		}
+		if strings.TrimSpace(req.Delivery) == "" {
+			continue
+		}
+		return &apiSmokeWebhookFixtureSummary{
+			Requests:          fixture.Count,
+			MatchedDeliveryID: req.Delivery,
+			MatchedEvent:      req.Event,
+			SignatureVerified: true,
+		}
+	}
+	return nil
+}
+
+func waitForAPIWebhookDeliveredRow(ctx context.Context, client *http.Client, opts apiCLIOptions, webhookID int64, smoke apiSmokeOptions, expectedDeliveryID string) (json.RawMessage, error) {
+	deadline := time.Now().Add(smoke.webhookWait)
+	target := fmt.Sprintf("/api/v1/webhooks/%d/deliveries?status=delivered&limit=10", webhookID)
+	for {
+		body, err := apiWorkflowRequestJSON(ctx, client, opts, http.MethodGet, target, nil, "")
+		if err != nil {
+			return nil, err
+		}
+		if apiDeliveredWebhookRowsIncludeSite(body, smoke.blogID, expectedDeliveryID) {
+			return body, nil
+		}
+		if time.Now().After(deadline) {
+			if strings.TrimSpace(expectedDeliveryID) != "" {
+				return nil, fmt.Errorf("timed out waiting for delivered webhook row %s for webhook %d and site %d", expectedDeliveryID, webhookID, smoke.blogID)
+			}
+			return nil, fmt.Errorf("timed out waiting for delivered webhook row for webhook %d and site %d", webhookID, smoke.blogID)
+		}
+		select {
+		case <-ctx.Done():
+			return nil, ctx.Err()
+		case <-time.After(smoke.webhookPollInterval):
+		}
+	}
+}
+
+func apiDeliveredWebhookRowsIncludeSite(body json.RawMessage, siteID int64, expectedDeliveryID string) bool {
+	var envelope struct {
+		Data []struct {
+			ID      int64           `json:"id"`
+			Status  string          `json:"status"`
+			Payload json.RawMessage `json:"payload"`
+		} `json:"data"`
+	}
+	if err := json.Unmarshal(body, &envelope); err != nil {
+		return false
+	}
+	for _, row := range envelope.Data {
+		if row.Status != "delivered" {
+			continue
+		}
+		var payload struct {
+			Type   string `json:"type"`
+			SiteID int64  `json:"site_id"`
+		}
+		if err := json.Unmarshal(row.Payload, &payload); err != nil {
+			continue
+		}
+		if payload.Type == apiSmokeWebhookEvent && payload.SiteID == siteID && apiDeliveryIDMatches(row.ID, expectedDeliveryID) {
+			return true
+		}
+	}
+	return false
+}
+
+func apiDeliveryIDMatches(rowID int64, expectedDeliveryID string) bool {
+	expectedDeliveryID = strings.TrimSpace(expectedDeliveryID)
+	if expectedDeliveryID == "" {
+		return true
+	}
+	expected, err := strconv.ParseInt(expectedDeliveryID, 10, 64)
+	if err != nil {
+		return false
+	}
+	return rowID == expected
+}
+
+func apiWebhookFixtureURLWithSecret(rawURL, secret string) (string, error) {
+	if strings.TrimSpace(secret) == "" {
+		return "", errors.New("webhook secret is empty")
+	}
+	u, err := url.Parse(strings.TrimSpace(rawURL))
+	if err != nil {
+		return "", err
+	}
+	if !u.IsAbs() || u.Host == "" {
+		return "", errors.New("webhook-url must be absolute")
+	}
+	q := u.Query()
+	q.Set("secret", secret)
+	u.RawQuery = q.Encode()
+	return u.String(), nil
+}
+
+func apiExternalHTTPClient(client *http.Client, opts apiCLIOptions) *http.Client {
+	if client != nil {
+		return client
+	}
+	timeout := opts.timeout
+	if timeout <= 0 {
+		timeout = 10 * time.Second
+	}
+	return &http.Client{Timeout: timeout}
+}
+
+func requireAPIWebhookFixtureURLAllowed(rawURL string, allowExternal bool) error {
+	u, err := url.Parse(strings.TrimSpace(rawURL))
+	if err != nil {
+		return fmt.Errorf("invalid webhook-url: %w", err)
+	}
+	if !u.IsAbs() || u.Host == "" {
+		return errors.New("webhook-url must be absolute")
+	}
+	if allowExternal {
+		return nil
+	}
+	host := strings.ToLower(strings.TrimSuffix(u.Hostname(), "."))
+	if host == "api-fixture" {
+		return nil
+	}
+	local, err := isLocalAPIURL(rawURL)
+	if err != nil {
+		return fmt.Errorf("invalid webhook-url: %w", err)
+	}
+	if local {
+		return nil
+	}
+	return fmt.Errorf("webhook-url must be localhost, loopback, or api-fixture for api smoke --exercise webhook; pass --allow-external-webhook-url to register %q", rawURL)
+}
+
+func requireAPIWebhookFixtureRequestsLocal(rawURL string) error {
+	local, err := isLocalAPIURL(rawURL)
+	if err != nil {
+		return fmt.Errorf("invalid webhook-requests-url: %w", err)
+	}
+	if !local {
+		return fmt.Errorf("webhook-requests-url must be local for api smoke --exercise webhook: %q", rawURL)
+	}
+	return nil
+}
+
+func apiJSONInt64(body json.RawMessage, field string) (int64, error) {
+	var obj map[string]any
+	if err := json.Unmarshal(body, &obj); err != nil {
+		return 0, err
+	}
+	raw, ok := obj[field]
+	if !ok {
+		return 0, fmt.Errorf("response missing %q", field)
+	}
+	switch v := raw.(type) {
+	case float64:
+		return int64(v), nil
+	default:
+		return 0, fmt.Errorf("response field %q is %T, want number", field, raw)
+	}
+}
+
+func apiJSONString(body json.RawMessage, field string) (string, error) {
+	var obj map[string]any
+	if err := json.Unmarshal(body, &obj); err != nil {
+		return "", err
+	}
+	raw, ok := obj[field]
+	if !ok {
+		return "", fmt.Errorf("response missing %q", field)
+	}
+	value, ok := raw.(string)
+	if !ok {
+		return "", fmt.Errorf("response field %q is %T, want string", field, raw)
+	}
+	return value, nil
+}
+
+func redactAPISecretError(err error, secret string) error {
+	if err == nil {
+		return nil
+	}
+	secret = strings.TrimSpace(secret)
+	if secret == "" {
+		return err
+	}
+	msg := err.Error()
+	msg = strings.ReplaceAll(msg, secret, "redacted")
+	msg = strings.ReplaceAll(msg, url.QueryEscape(secret), "redacted")
+	if msg == err.Error() {
+		return err
+	}
+	return errors.New(msg)
+}
+
+func redactedAPIWebhookSummary(body json.RawMessage) *apiSmokeWebhookSummary {
+	var hook struct {
+		ID            int64    `json:"id"`
+		URL           string   `json:"url"`
+		Active        bool     `json:"active"`
+		Events        []string `json:"events"`
+		SecretPreview string   `json:"secret_preview"`
+	}
+	if err := json.Unmarshal(body, &hook); err != nil {
+		return nil
+	}
+	return &apiSmokeWebhookSummary{
+		ID:            hook.ID,
+		URL:           redactedWebhookFixtureURL(hook.URL),
+		Active:        hook.Active,
+		Events:        hook.Events,
+		SecretPreview: hook.SecretPreview,
+	}
+}
+
+func redactedWebhookFixtureURL(rawURL string) string {
+	u, err := url.Parse(rawURL)
+	if err != nil {
+		return rawURL
+	}
+	if u.Query().Has("secret") {
+		q := u.Query()
+		q.Set("secret", "redacted")
+		u.RawQuery = q.Encode()
+	}
+	return u.String()
+}
+
+func apiSmokeIDKey(smoke apiSmokeOptions, suffix string) string {
+	if smoke.idempotencyKeyPrefix == "" {
+		return ""
+	}
+	return smoke.idempotencyKeyPrefix + "-" + suffix
+}
+
+func apiCLINewBatchID(prefix string) string {
+	return fmt.Sprintf("%s-%s", prefix, time.Now().UTC().Format("20060102T150405Z"))
+}
+
+func apiCLIBatchBlogIDStart(batch string) int64 {
+	h := fnv.New32a()
+	_, _ = h.Write([]byte(batch))
+	// Reserve a deterministic 1,000-id slot in the high local-test range.
+	return 910000000 + int64(h.Sum32()%90000)*1000
+}
diff --git a/cmd/jetmon2/api_cli_workflows_test.go b/cmd/jetmon2/api_cli_workflows_test.go
new file mode 100644
index 00000000..75137f25
--- /dev/null
+++ b/cmd/jetmon2/api_cli_workflows_test.go
@@ -0,0 +1,477 @@
+package main
+
+import (
+	"bytes"
+	"context"
+	"crypto/hmac"
+	"crypto/sha256"
+	"encoding/hex"
+	"encoding/json"
+	"fmt"
+	"io"
+	"net/http"
+	"net/http/httptest"
+	"strings"
+	"sync"
+	"testing"
+	"time"
+)
+
+func TestRunAPISmokeHappyPath(t *testing.T) {
+	var calls []string
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		calls = append(calls, r.Method+" "+r.URL.Path)
+		if r.URL.Path != "/api/v1/health" && r.Header.Get("Authorization") != "Bearer token-123" {
+			t.Fatalf("missing auth for %s %s", r.Method, r.URL.Path)
+		}
+		switch {
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/health":
+			writeTestJSON(t, w, map[string]string{"status": "ok"})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/me":
+			writeTestJSON(t, w, map[string]any{"consumer_name": "api-cli-test", "scope": "admin"})
+		case r.Method == http.MethodPost && r.URL.Path == "/api/v1/sites":
+			var body map[string]any
+			decodeTestJSON(t, r, &body)
+			if body["blog_id"] != float64(910) {
+				t.Fatalf("blog_id = %#v, want 910", body["blog_id"])
+			}
+			headers := body["custom_headers"].(map[string]any)
+			if headers[apiCLIBatchHeader] != "smoke-test" {
+				t.Fatalf("batch header = %#v, want smoke-test", headers[apiCLIBatchHeader])
+			}
+			writeTestStatusJSON(t, w, http.StatusCreated, map[string]any{"id": 910, "blog_id": 910})
+		case r.Method == http.MethodPost && r.URL.Path == "/api/v1/sites/910/trigger-now":
+			writeTestJSON(t, w, map[string]any{"result": map[string]any{"success": true}})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/sites/910/events":
+			writeTestJSON(t, w, map[string]any{"data": []any{}, "page": map[string]any{"limit": 5}})
+		case r.Method == http.MethodPost && r.URL.Path == "/api/v1/alert-contacts":
+			writeTestStatusJSON(t, w, http.StatusCreated, map[string]any{"id": 77, "label": "api-cli-smoke-smoke-test"})
+		case r.Method == http.MethodPost && r.URL.Path == "/api/v1/alert-contacts/77/test":
+			writeTestJSON(t, w, map[string]any{"contact_id": 77, "delivered": true})
+		case r.Method == http.MethodDelete && r.URL.Path == "/api/v1/alert-contacts/77":
+			w.WriteHeader(http.StatusNoContent)
+		case r.Method == http.MethodDelete && r.URL.Path == "/api/v1/sites/910":
+			w.WriteHeader(http.StatusNoContent)
+		default:
+			t.Fatalf("unexpected request: %s %s", r.Method, r.URL.Path)
+		}
+	}))
+	defer srv.Close()
+
+	var stdout bytes.Buffer
+	err := runAPISmoke(context.Background(), srv.Client(), apiCLIOptions{
+		baseURL: srv.URL,
+		token:   "token-123",
+		timeout: time.Second,
+		out:     &stdout,
+		errOut:  ioDiscard{},
+	}, apiSmokeOptions{
+		batch:    "smoke-test",
+		blogID:   910,
+		url:      "https://example.com/",
+		cleanup:  true,
+		exercise: "alert-contact",
+	})
+	if err != nil {
+		t.Fatalf("runAPISmoke() error = %v\nstdout=%s", err, stdout.String())
+	}
+
+	var summary apiSmokeSummary
+	if err := json.Unmarshal(stdout.Bytes(), &summary); err != nil {
+		t.Fatalf("unmarshal summary: %v\n%s", err, stdout.String())
+	}
+	if summary.Batch != "smoke-test" || summary.BlogID != 910 {
+		t.Fatalf("summary batch/id = %q/%d", summary.Batch, summary.BlogID)
+	}
+	if len(summary.Steps) != 7 {
+		t.Fatalf("steps = %#v, want 7 steps", summary.Steps)
+	}
+	for _, step := range summary.Steps {
+		if step.Status != "ok" {
+			t.Fatalf("step %#v, want ok", step)
+		}
+	}
+	if len(summary.CleanupResults) != 2 {
+		t.Fatalf("cleanup results = %#v, want contact and site cleanup", summary.CleanupResults)
+	}
+	wantCalls := []string{
+		"GET /api/v1/health",
+		"GET /api/v1/me",
+		"POST /api/v1/sites",
+		"POST /api/v1/sites/910/trigger-now",
+		"GET /api/v1/sites/910/events",
+		"POST /api/v1/alert-contacts",
+		"POST /api/v1/alert-contacts/77/test",
+		"DELETE /api/v1/alert-contacts/77",
+		"DELETE /api/v1/sites/910",
+	}
+	if strings.Join(calls, "\n") != strings.Join(wantCalls, "\n") {
+		t.Fatalf("calls:\n%s\nwant:\n%s", strings.Join(calls, "\n"), strings.Join(wantCalls, "\n"))
+	}
+}
+
+func TestRunAPISmokeWebhookExercise(t *testing.T) {
+	const webhookSecret = "whsec_TESTSMOKESECRET"
+
+	fixture := newSmokeWebhookFixture(t)
+	defer fixture.Close()
+
+	var (
+		calls         []string
+		triggerCalls  int
+		registeredURL string
+	)
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		calls = append(calls, r.Method+" "+r.URL.Path)
+		if r.URL.Path != "/api/v1/health" && r.Header.Get("Authorization") != "Bearer token-123" {
+			t.Fatalf("missing auth for %s %s", r.Method, r.URL.Path)
+		}
+		switch {
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/health":
+			writeTestJSON(t, w, map[string]string{"status": "ok"})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/me":
+			writeTestJSON(t, w, map[string]any{"consumer_name": "api-cli-test", "scope": "admin"})
+		case r.Method == http.MethodPost && r.URL.Path == "/api/v1/sites":
+			writeTestStatusJSON(t, w, http.StatusCreated, map[string]any{"id": 910, "blog_id": 910})
+		case r.Method == http.MethodPost && r.URL.Path == "/api/v1/sites/910/trigger-now":
+			triggerCalls++
+			if triggerCalls == 2 {
+				postSignedSmokeWebhook(t, registeredURL, webhookSecret, []byte(`{"type":"event.opened","site_id":910}`))
+				writeTestJSON(t, w, map[string]any{"result": map[string]any{"success": false, "http_code": 500}})
+				return
+			}
+			writeTestJSON(t, w, map[string]any{"result": map[string]any{"success": true}})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/sites/910/events" && r.URL.RawQuery == "limit=5":
+			writeTestJSON(t, w, map[string]any{"data": []any{}, "page": map[string]any{"limit": 5}})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/sites/910" && r.URL.Query().Get("include_cli_metadata") == "true":
+			writeTestJSON(t, w, map[string]any{"id": 910, "blog_id": 910, "cli_batch": "smoke-webhook"})
+		case r.Method == http.MethodPost && r.URL.Path == "/api/v1/webhooks":
+			var body map[string]any
+			decodeTestJSON(t, r, &body)
+			if body["url"] != fixture.URL+"/webhook" {
+				t.Fatalf("webhook url = %#v", body["url"])
+			}
+			if body["active"] != false {
+				t.Fatalf("webhook active = %#v, want false until secret is registered", body["active"])
+			}
+			writeTestStatusJSON(t, w, http.StatusCreated, map[string]any{
+				"id":             88,
+				"url":            fixture.URL + "/webhook",
+				"active":         false,
+				"events":         []string{apiSmokeWebhookEvent},
+				"secret_preview": "whsec_TEST...",
+				"secret":         webhookSecret,
+			})
+		case r.Method == http.MethodPatch && r.URL.Path == "/api/v1/webhooks/88":
+			var body map[string]any
+			decodeTestJSON(t, r, &body)
+			registeredURL = body["url"].(string)
+			if !strings.Contains(registeredURL, "secret="+webhookSecret) {
+				t.Fatalf("registered URL did not include fixture secret: %q", registeredURL)
+			}
+			if body["active"] != true {
+				t.Fatalf("webhook active = %#v, want true", body["active"])
+			}
+			writeTestJSON(t, w, map[string]any{
+				"id":             88,
+				"url":            registeredURL,
+				"active":         true,
+				"events":         []string{apiSmokeWebhookEvent},
+				"secret_preview": "whsec_TEST...",
+			})
+		case r.Method == http.MethodPatch && r.URL.Path == "/api/v1/sites/910":
+			var body map[string]any
+			decodeTestJSON(t, r, &body)
+			if !strings.Contains(fmt.Sprint(body["monitor_url"]), "/status/500") {
+				t.Fatalf("monitor_url = %#v, want fixture failure URL", body["monitor_url"])
+			}
+			writeTestJSON(t, w, map[string]any{"id": 910, "blog_id": 910})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/sites/910/events" && r.URL.RawQuery == "active=true&limit=10":
+			writeTestJSON(t, w, map[string]any{
+				"data": []any{
+					map[string]any{"id": 321, "state": apiSmokeWebhookState, "severity": 3},
+				},
+				"page": map[string]any{"limit": 10},
+			})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/sites/910/events/321/transitions":
+			writeTestJSON(t, w, map[string]any{
+				"data": []any{
+					map[string]any{"id": 654, "event_id": 321, "reason": "opened", "state_after": apiSmokeWebhookState, "severity_after": 3},
+				},
+				"page": map[string]any{"limit": 50},
+			})
+		case r.Method == http.MethodGet && r.URL.Path == "/api/v1/webhooks/88/deliveries":
+			writeTestJSON(t, w, map[string]any{
+				"data": []any{
+					map[string]any{
+						"id":         776,
+						"status":     "delivered",
+						"event_id":   321,
+						"event_type": apiSmokeWebhookEvent,
+						"payload":    map[string]any{"type": apiSmokeWebhookEvent, "site_id": 910},
+					},
+					map[string]any{
+						"id":         777,
+						"status":     "delivered",
+						"event_id":   321,
+						"event_type": apiSmokeWebhookEvent,
+						"payload":    map[string]any{"type": apiSmokeWebhookEvent, "site_id": 910},
+					},
+				},
+				"page": map[string]any{"limit": 10},
+			})
+		case r.Method == http.MethodDelete && r.URL.Path == "/api/v1/webhooks/88":
+			w.WriteHeader(http.StatusNoContent)
+		case r.Method == http.MethodDelete && r.URL.Path == "/api/v1/sites/910":
+			w.WriteHeader(http.StatusNoContent)
+		default:
+			t.Fatalf("unexpected request: %s %s?%s", r.Method, r.URL.Path, r.URL.RawQuery)
+		}
+	}))
+	defer srv.Close()
+
+	var stdout bytes.Buffer
+	err := runAPISmoke(context.Background(), srv.Client(), apiCLIOptions{
+		baseURL: srv.URL,
+		token:   "token-123",
+		timeout: time.Second,
+		out:     &stdout,
+		errOut:  ioDiscard{},
+	}, apiSmokeOptions{
+		batch:               "smoke-webhook",
+		blogID:              910,
+		url:                 "https://example.com/",
+		cleanup:             true,
+		exercise:            "webhook",
+		webhookURL:          fixture.URL + "/webhook",
+		webhookRequestsURL:  fixture.URL + "/webhook/requests",
+		webhookWait:         2 * time.Second,
+		webhookPollInterval: 10 * time.Millisecond,
+		fixtureURL:          fixture.URL,
+		fixtureProbeURL:     fixture.URL + "/health",
+	})
+	if err != nil {
+		t.Fatalf("runAPISmoke() error = %v\nstdout=%s", err, stdout.String())
+	}
+
+	var summary apiSmokeSummary
+	if err := json.Unmarshal(stdout.Bytes(), &summary); err != nil {
+		t.Fatalf("unmarshal summary: %v\n%s", err, stdout.String())
+	}
+	if summary.Webhook == nil || summary.Webhook.ID != 88 {
+		t.Fatalf("webhook summary = %#v, want webhook id 88", summary.Webhook)
+	}
+	if strings.Contains(summary.Webhook.URL, webhookSecret) {
+		t.Fatalf("webhook summary URL leaked raw secret: %q", summary.Webhook.URL)
+	}
+	if summary.WebhookFixture == nil || !summary.WebhookFixture.SignatureVerified {
+		t.Fatalf("fixture summary = %#v, want verified signature", summary.WebhookFixture)
+	}
+	if summary.FailureSimulation == nil || summary.FailureSimulation.TransitionCount != 1 {
+		t.Fatalf("failure simulation = %#v, want one transition", summary.FailureSimulation)
+	}
+	if len(summary.CleanupResults) != 2 {
+		t.Fatalf("cleanup results = %#v, want webhook and site cleanup", summary.CleanupResults)
+	}
+
+	wantCalls := []string{
+		"GET /api/v1/health",
+		"GET /api/v1/me",
+		"POST /api/v1/sites",
+		"POST /api/v1/sites/910/trigger-now",
+		"GET /api/v1/sites/910/events",
+		"POST /api/v1/webhooks",
+		"PATCH /api/v1/webhooks/88",
+		"GET /api/v1/sites/910",
+		"PATCH /api/v1/sites/910",
+		"POST /api/v1/sites/910/trigger-now",
+		"GET /api/v1/sites/910/events",
+		"GET /api/v1/sites/910/events/321/transitions",
+		"GET /api/v1/webhooks/88/deliveries",
+		"DELETE /api/v1/webhooks/88",
+		"DELETE /api/v1/sites/910",
+	}
+	if strings.Join(calls, "\n") != strings.Join(wantCalls, "\n") {
+		t.Fatalf("calls:\n%s\nwant:\n%s", strings.Join(calls, "\n"), strings.Join(wantCalls, "\n"))
+	}
+}
+
+func TestRunAPISmokeWritesFailureSummary(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		switch r.URL.Path {
+		case "/api/v1/health":
+			writeTestJSON(t, w, map[string]string{"status": "ok"})
+		case "/api/v1/me":
+			writeTestStatusJSON(t, w, http.StatusUnauthorized, map[string]string{"error": "missing token"})
+		default:
+			t.Fatalf("unexpected request: %s", r.URL.Path)
+		}
+	}))
+	defer srv.Close()
+
+	var stdout bytes.Buffer
+	err := runAPISmoke(context.Background(), srv.Client(), apiCLIOptions{
+		baseURL: srv.URL,
+		timeout: time.Second,
+		out:     &stdout,
+		errOut:  ioDiscard{},
+	}, apiSmokeOptions{
+		batch:    "smoke-failure",
+		blogID:   911,
+		cleanup:  true,
+		exercise: "none",
+	})
+	if err == nil {
+		t.Fatal("runAPISmoke() error = nil, want auth failure")
+	}
+	var summary apiSmokeSummary
+	if err := json.Unmarshal(stdout.Bytes(), &summary); err != nil {
+		t.Fatalf("unmarshal summary: %v\n%s", err, stdout.String())
+	}
+	if len(summary.Steps) != 2 {
+		t.Fatalf("steps = %#v, want health + failed me", summary.Steps)
+	}
+	if summary.Steps[1].Name != "me" || summary.Steps[1].Status != "failed" {
+		t.Fatalf("failed step = %#v, want me failed", summary.Steps[1])
+	}
+}
+
+func TestRedactAPISecretError(t *testing.T) {
+	err := redactAPISecretError(
+		fmt.Errorf(`PATCH /api/v1/webhooks/88 returned 400 Bad Request: {"url":"http://api-fixture:8091/webhook?secret=whsec_TEST"}`),
+		"whsec_TEST",
+	)
+	if err == nil {
+		t.Fatal("redactAPISecretError() = nil, want error")
+	}
+	if strings.Contains(err.Error(), "whsec_TEST") {
+		t.Fatalf("redactAPISecretError() leaked secret: %v", err)
+	}
+	if !strings.Contains(err.Error(), "secret=redacted") {
+		t.Fatalf("redactAPISecretError() = %v, want redacted query value", err)
+	}
+}
+
+func TestAPIDeliveredWebhookRowsIncludeSiteRequiresExpectedDeliveryID(t *testing.T) {
+	body := json.RawMessage(`{
+		"data": [
+			{"id": 776, "status": "delivered", "payload": {"type": "event.opened", "site_id": 910}},
+			{"id": 778, "status": "delivered", "payload": {"type": "event.opened", "site_id": 911}}
+		]
+	}`)
+	if apiDeliveredWebhookRowsIncludeSite(body, 910, "777") {
+		t.Fatal("apiDeliveredWebhookRowsIncludeSite() = true for wrong delivery id")
+	}
+	if !apiDeliveredWebhookRowsIncludeSite(body, 910, "776") {
+		t.Fatal("apiDeliveredWebhookRowsIncludeSite() = false for expected delivery id")
+	}
+}
+
+func TestAPICLIBatchBlogIDStartStable(t *testing.T) {
+	first := apiCLIBatchBlogIDStart("batch-a")
+	second := apiCLIBatchBlogIDStart("batch-a")
+	if first != second {
+		t.Fatalf("batch id start not stable: %d != %d", first, second)
+	}
+	if first < 910000000 || first >= 1000000000 {
+		t.Fatalf("batch id start = %d, want high local-test range", first)
+	}
+}
+
+type smokeWebhookFixture struct {
+	*httptest.Server
+	mu       sync.Mutex
+	requests []apiSmokeFixtureWebhookHit
+}
+
+func newSmokeWebhookFixture(t *testing.T) *smokeWebhookFixture {
+	t.Helper()
+	fixture := &smokeWebhookFixture{}
+	fixture.Server = httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		switch {
+		case r.Method == http.MethodGet && r.URL.Path == "/health":
+			writeTestJSON(t, w, map[string]string{"status": "ok"})
+		case r.Method == http.MethodDelete && r.URL.Path == "/webhook/requests":
+			fixture.mu.Lock()
+			fixture.requests = nil
+			fixture.mu.Unlock()
+			w.WriteHeader(http.StatusNoContent)
+		case r.Method == http.MethodGet && r.URL.Path == "/webhook/requests":
+			fixture.mu.Lock()
+			requests := append([]apiSmokeFixtureWebhookHit(nil), fixture.requests...)
+			fixture.mu.Unlock()
+			writeTestJSON(t, w, map[string]any{"count": len(requests), "requests": requests})
+		case r.Method == http.MethodPost && r.URL.Path == "/webhook":
+			body, err := io.ReadAll(r.Body)
+			if err != nil {
+				t.Fatalf("read webhook body: %v", err)
+			}
+			valid := smokeTestSignatureValid(r.Header.Get("X-Jetmon-Signature"), body, r.URL.Query().Get("secret"))
+			fixture.mu.Lock()
+			fixture.requests = append(fixture.requests, apiSmokeFixtureWebhookHit{
+				ID:             len(fixture.requests) + 1,
+				Event:          r.Header.Get("X-Jetmon-Event"),
+				Delivery:       r.Header.Get("X-Jetmon-Delivery"),
+				Signature:      r.Header.Get("X-Jetmon-Signature"),
+				SignatureValid: &valid,
+				Body:           string(body),
+			})
+			fixture.mu.Unlock()
+			w.WriteHeader(http.StatusNoContent)
+		default:
+			t.Fatalf("unexpected fixture request: %s %s", r.Method, r.URL.Path)
+		}
+	}))
+	return fixture
+}
+
+func postSignedSmokeWebhook(t *testing.T, target, secret string, body []byte) {
+	t.Helper()
+	req, err := http.NewRequest(http.MethodPost, target, bytes.NewReader(body))
+	if err != nil {
+		t.Fatalf("build webhook request: %v", err)
+	}
+	req.Header.Set("X-Jetmon-Event", apiSmokeWebhookEvent)
+	req.Header.Set("X-Jetmon-Delivery", "777")
+	req.Header.Set("X-Jetmon-Signature", smokeTestSignature(1700000000, body, secret))
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("post webhook: %v", err)
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusNoContent {
+		t.Fatalf("post webhook status = %s", resp.Status)
+	}
+}
+
+func smokeTestSignature(ts int64, body []byte, secret string) string {
+	mac := hmac.New(sha256.New, []byte(secret))
+	_, _ = mac.Write([]byte(fmt.Sprintf("%d.", ts)))
+	_, _ = mac.Write(body)
+	return fmt.Sprintf("t=%d,v1=%s", ts, hex.EncodeToString(mac.Sum(nil)))
+}
+
+func smokeTestSignatureValid(signature string, body []byte, secret string) bool {
+	return signature == smokeTestSignature(1700000000, body, secret)
+}
+
+func decodeTestJSON(t *testing.T, r *http.Request, v any) {
+	t.Helper()
+	if err := json.NewDecoder(r.Body).Decode(v); err != nil {
+		t.Fatalf("decode request body: %v", err)
+	}
+}
+
+func writeTestJSON(t *testing.T, w http.ResponseWriter, v any) {
+	t.Helper()
+	writeTestStatusJSON(t, w, http.StatusOK, v)
+}
+
+func writeTestStatusJSON(t *testing.T, w http.ResponseWriter, status int, v any) {
+	t.Helper()
+	w.Header().Set("Content-Type", "application/json")
+	w.WriteHeader(status)
+	if err := json.NewEncoder(w).Encode(v); err != nil {
+		t.Fatalf("encode response: %v", err)
+	}
+}
diff --git a/cmd/jetmon2/main.go b/cmd/jetmon2/main.go
new file mode 100644
index 00000000..0ff13d0a
--- /dev/null
+++ b/cmd/jetmon2/main.go
@@ -0,0 +1,1421 @@
+package main
+
+import (
+	"context"
+	"database/sql"
+	"flag"
+	"fmt"
+	"io"
+	"log"
+	"net"
+	"net/http"
+	"os"
+	"os/signal"
+	"path/filepath"
+	"sort"
+	"strconv"
+	"strings"
+	"sync"
+	"sync/atomic"
+	"syscall"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/alerting"
+	"github.com/Automattic/jetmon/internal/api"
+	"github.com/Automattic/jetmon/internal/apikeys"
+	"github.com/Automattic/jetmon/internal/audit"
+	"github.com/Automattic/jetmon/internal/checker"
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/dashboard"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/deliverer"
+	"github.com/Automattic/jetmon/internal/fleethealth"
+	"github.com/Automattic/jetmon/internal/metrics"
+	"github.com/Automattic/jetmon/internal/orchestrator"
+	"github.com/Automattic/jetmon/internal/processmetrics"
+	"github.com/Automattic/jetmon/internal/veriflier"
+	"github.com/Automattic/jetmon/internal/wpcom"
+)
+
+const processHealthWriteTimeout = 2 * time.Second
+
+// Injected at build time via -ldflags.
+var (
+	version   = "dev"
+	buildDate = "unknown"
+	goVersion = "unknown"
+)
+
+func main() {
+	if len(os.Args) < 2 {
+		runServe()
+		return
+	}
+
+	if isVersionCommand(os.Args[1]) {
+		printVersion(os.Stdout)
+		return
+	}
+
+	switch os.Args[1] {
+	case "migrate":
+		cmdMigrate()
+	case "validate-config":
+		cmdValidateConfig()
+	case "status":
+		cmdStatus()
+	case "audit":
+		cmdAudit()
+	case "drain":
+		cmdDrain()
+	case "reload":
+		cmdReload()
+	case "keys":
+		cmdKeys(os.Args[2:])
+	case "api":
+		cmdAPI(os.Args[2:])
+	case "site-tenants":
+		cmdSiteTenants(os.Args[2:])
+	case "telemetry":
+		cmdTelemetry(os.Args[2:])
+	case "verifliers":
+		cmdVerifliers(os.Args[2:])
+	case "rollout":
+		cmdRollout(os.Args[2:])
+	default:
+		runServe()
+	}
+}
+
+func isVersionCommand(arg string) bool {
+	switch arg {
+	case "version", "--version", "-version":
+		return true
+	default:
+		return false
+	}
+}
+
+func printVersion(w io.Writer) {
+	fmt.Fprintf(w, "jetmon2 %s (built %s with %s)\n", version, buildDate, goVersion)
+}
+
+// runServe is the main entry point for the monitoring service.
+func runServe() {
+	configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+
+	if err := config.Load(configPath); err != nil {
+		log.Fatalf("load config: %v", err)
+	}
+	cfg := config.Get()
+	if err := checker.ConfigureResolverServers(cfg.CheckDNSResolvers); err != nil {
+		log.Fatalf("configure check DNS resolvers: %v", err)
+	}
+	log.Printf("config: legacy_status_projection=%s", enabledLabel(cfg.LegacyStatusProjectionEnable))
+	log.Printf("config: bucket_ownership=%s", bucketOwnershipLabel(cfg))
+	log.Printf("config: scheduler=%s", schedulerConfigLabel(cfg))
+	log.Printf("config: default_check_policy=method:%s profile:%s", cfg.DefaultCheckMethod, cfg.DefaultDetectionProfile)
+	log.Printf("config: check_dns_resolvers=%s", checkDNSResolversLabel(checker.ConfiguredResolverServers()))
+	log.Printf("config: wpcom_notify=%s", enabledLabel(cfg.WPCOMNotifyEnable))
+	log.Printf("config: email_transport=%s", emailTransportLabel(cfg))
+	if !emailTransportDelivers(cfg) {
+		log.Printf("WARN: email_transport=%s — alert-contact emails will be logged but not delivered", emailTransportLabel(cfg))
+	}
+	if cfg.DashboardPort > 0 {
+		if msg := dashboardBindWarning(cfg.DashboardBindAddr); msg != "" {
+			log.Printf("WARN: %s", msg)
+		}
+	}
+
+	config.LoadDB()
+	if err := db.ConnectWithRetry(10); err != nil {
+		log.Fatalf("db connect: %v", err)
+	}
+
+	pidPath := envOrDefault("JETMON_PID_FILE", "/run/jetmon2/jetmon2.pid")
+	if err := writePIDFile(pidPath); err != nil {
+		log.Printf("warning: could not write PID file %s: %v", pidPath, err)
+	} else {
+		defer removePIDFile(pidPath)
+	}
+
+	audit.Init(db.DB())
+
+	if err := metrics.Init("statsd:8125", db.Hostname()); err != nil {
+		log.Printf("warning: statsd init failed: %v", err)
+	}
+
+	hostname := db.Hostname()
+	processStartedAt := time.Now().UTC()
+	processID := fleethealth.ProcessID(hostname, fleethealth.ProcessMonitor)
+
+	wp := wpcom.New(cfg.AuthToken, hostname)
+
+	orch := orchestrator.New(cfg, wp)
+	if err := orch.ClaimBuckets(); err != nil {
+		log.Fatalf("claim buckets: %v", err)
+	}
+
+	var dash *dashboard.Server
+	if cfg.DashboardPort > 0 {
+		dash = dashboard.New(hostname)
+		dash.SetFleetSource(newFleetDashboardStore(cfg))
+		go func() {
+			addr := dashboardListenAddr(cfg)
+			if err := dash.Listen(addr); err != nil {
+				log.Printf("dashboard: %v", err)
+			}
+		}()
+	}
+
+	// pprof on localhost only — never expose this on a public interface.
+	if cfg.DebugPort > 0 {
+		go func() {
+			addr := fmt.Sprintf("127.0.0.1:%d", cfg.DebugPort)
+			if err := dashboard.ListenDebug(addr); err != nil {
+				log.Printf("debug server: %v", err)
+			}
+		}()
+	}
+
+	// Internal API server. Disabled when API_PORT is 0. Bears auth via
+	// jetmon_api_keys; key management is CLI-only (`./jetmon2 keys`).
+	var apiSrv *api.Server
+	if cfg.APIPort > 0 {
+		apiSrv = api.New(fmt.Sprintf(":%d", cfg.APIPort), db.DB(), hostname)
+		go func() {
+			if err := apiSrv.Listen(); err != nil && !api.IsServerClosed(err) {
+				log.Printf("api: %v", err)
+			}
+		}()
+	}
+
+	if level, msg := deliveryOwnerStatus(cfg, hostname); msg != "" {
+		if level == "WARN" {
+			log.Printf("WARN: %s", msg)
+		} else {
+			log.Printf("config: %s", msg)
+		}
+	}
+	deliveryWorkersEnabled := deliveryWorkersShouldStart(cfg, hostname)
+
+	var alertDispatchers map[alerting.Transport]alerting.Dispatcher
+	if cfg.APIPort > 0 {
+		alertDispatchers = deliverer.BuildAlertDispatchers(cfg)
+		if apiSrv != nil {
+			apiSrv.SetAlertDispatchers(alertDispatchers)
+		}
+	}
+
+	// Embedded outbound delivery workers. Disabled when API_PORT is 0
+	// (no API to manage webhooks or alert contacts) or when
+	// DELIVERY_OWNER_HOST names another host.
+	var deliveryRuntime *deliverer.Runtime
+	if deliveryWorkersEnabled {
+		deliveryRuntime = deliverer.Start(deliverer.Config{
+			DB:          db.DB(),
+			InstanceID:  hostname,
+			Dispatchers: alertDispatchers,
+		})
+	}
+
+	var healthMu sync.RWMutex
+	var publishMu sync.Mutex
+	var shuttingDown atomic.Bool
+	var lastHealth []dashboard.HealthEntry
+	publishHostSnapshot := func(state string, refreshDependencies bool) {
+		publishMu.Lock()
+		defer publishMu.Unlock()
+		if shuttingDown.Load() && state == fleethealth.StateRunning {
+			return
+		}
+		currentCfg := config.Get()
+		if currentCfg == nil {
+			currentCfg = cfg
+		}
+		checkedAt := time.Now().UTC()
+		var health []dashboard.HealthEntry
+		if refreshDependencies {
+			health = dashboardHealthEntries(context.Background(), currentCfg, db.DB(), wp, metrics.Global() != nil, checkedAt)
+			healthMu.Lock()
+			lastHealth = append([]dashboard.HealthEntry(nil), health...)
+			healthMu.Unlock()
+		} else {
+			healthMu.RLock()
+			health = append([]dashboard.HealthEntry(nil), lastHealth...)
+			healthMu.RUnlock()
+		}
+		bMin, bMax := orch.BucketRange()
+		sitesPerSec, roundDuration := orch.LastRoundStats()
+		mem := processmetrics.CurrentMemory()
+		deliveryConfigEligible := deliveryWorkersShouldStart(currentCfg, hostname)
+		st := dashboard.State{
+			WorkerCount:                   orch.WorkerCount(),
+			ActiveChecks:                  orch.ActiveChecks(),
+			QueueDepth:                    orch.QueueDepth(),
+			RetryQueueSize:                orch.RetryQueueSize(),
+			SitesPerSec:                   sitesPerSec,
+			RoundDurationMs:               roundDuration.Milliseconds(),
+			WPCOMCircuitOpen:              wp.IsCircuitOpen(),
+			WPCOMQueueDepth:               wp.QueueDepth(),
+			GoSysMemMB:                    mem.GoSysMemMB,
+			RSSMemMB:                      mem.RSSMemMB,
+			BucketMin:                     bMin,
+			BucketMax:                     bMax,
+			BucketOwnership:               bucketOwnershipLabel(currentCfg),
+			LegacyStatusProjectionEnabled: currentCfg.LegacyStatusProjectionEnable,
+			DeliveryWorkersEnabled:        deliveryWorkersEnabled,
+			DeliveryConfigEligible:        deliveryConfigEligible,
+			DeliveryOwnerHost:             currentCfg.DeliveryOwnerHost,
+			RolloutPreflightCommand:       rolloutPreflightCommand(currentCfg),
+			RolloutCutoverCommand:         cutoverCheckCommand(currentCfg),
+			RolloutActivityCommand:        rolloutActivityCommand(),
+			RolloutRollbackCommand:        rollbackCheckCommand(currentCfg),
+			RolloutStateReportCommand:     stateReportCommand(),
+			ProjectionDriftCommand:        projectionDriftCommand(),
+		}
+		st.Hostname = hostname
+		st.UpdatedAt = checkedAt
+		if dash != nil {
+			if refreshDependencies {
+				dash.UpdateHealth(health)
+			}
+			dash.Update(st)
+		}
+		ctx, cancel := context.WithTimeout(context.Background(), processHealthWriteTimeout)
+		if err := fleethealth.Upsert(ctx, db.DB(), monitorProcessHealthSnapshot(hostname, processStartedAt, state, currentCfg, st, health)); err != nil {
+			log.Printf("process health: %v", err)
+		}
+		cancel()
+	}
+
+	// Publish both host-dashboard state and the durable fleet-health heartbeat.
+	publishHostSnapshot(fleethealth.StateRunning, false)
+	stopHostPublisher := make(chan struct{})
+	var stopHostPublisherOnce sync.Once
+	go func() {
+		ticker := time.NewTicker(time.Duration(cfg.StatsUpdateIntervalMS) * time.Millisecond)
+		defer ticker.Stop()
+		publishHostSnapshot(fleethealth.StateRunning, true)
+		for {
+			select {
+			case <-ticker.C:
+				publishHostSnapshot(fleethealth.StateRunning, true)
+			case <-stopHostPublisher:
+				return
+			}
+		}
+	}()
+
+	// Signal handling.
+	sigCh := make(chan os.Signal, 1)
+	signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM, syscall.SIGHUP)
+
+	go func() {
+		for sig := range sigCh {
+			switch sig {
+			case syscall.SIGHUP:
+				log.Println("received SIGHUP, reloading config")
+				if err := config.Reload(); err != nil {
+					log.Printf("config reload failed: %v", err)
+				} else {
+					if dash != nil {
+						dash.SetFleetSource(newFleetDashboardStore(config.Get()))
+					}
+					log.Println("config reloaded; CHECK_DNS_RESOLVERS changes require restart")
+				}
+			case syscall.SIGINT, syscall.SIGTERM:
+				log.Println("received shutdown signal, draining")
+				shuttingDown.Store(true)
+				stopHostPublisherOnce.Do(func() { close(stopHostPublisher) })
+				publishHostSnapshot(fleethealth.StateStopping, false)
+				if apiSrv != nil {
+					ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
+					if err := apiSrv.Shutdown(ctx); err != nil {
+						log.Printf("api: shutdown error: %v", err)
+					}
+					cancel()
+				}
+				if deliveryRuntime != nil {
+					deliveryRuntime.Stop()
+				}
+				orch.Stop()
+				// Hard kill if drain takes too long (e.g. a stalled HTTP check).
+				time.AfterFunc(30*time.Second, func() {
+					log.Println("jetmon2: shutdown timeout exceeded, forcing exit")
+					os.Exit(1)
+				})
+			}
+		}
+	}()
+
+	orch.Run()
+	shuttingDown.Store(true)
+	stopHostPublisherOnce.Do(func() { close(stopHostPublisher) })
+	publishHostSnapshot(fleethealth.StateStopping, false)
+	ctx, cancel := context.WithTimeout(context.Background(), processHealthWriteTimeout)
+	if err := fleethealth.MarkStopped(ctx, db.DB(), processID, time.Now().UTC()); err != nil {
+		log.Printf("process health: %v", err)
+	}
+	cancel()
+	log.Println("jetmon2: shutdown complete")
+}
+
+func cmdMigrate() {
+	config.LoadDB()
+	if err := db.ConnectWithRetry(5); err != nil {
+		log.Fatalf("db connect: %v", err)
+	}
+	if err := db.Migrate(); err != nil {
+		log.Fatalf("migrate: %v", err)
+	}
+	fmt.Println("migrations applied successfully")
+}
+
+func cmdValidateConfig() {
+	configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+	if err := config.Load(configPath); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL config parse: %v\n", err)
+		os.Exit(1)
+	}
+	fmt.Println("PASS config parse")
+
+	config.LoadDB()
+	if err := db.ConnectWithRetry(3); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL db connect: %v\n", err)
+		os.Exit(1)
+	}
+	fmt.Println("PASS db connect")
+
+	cfg := config.Get()
+	fmt.Printf("INFO legacy_status_projection=%s\n", enabledLabel(cfg.LegacyStatusProjectionEnable))
+	fmt.Printf("INFO bucket_ownership=%s\n", bucketOwnershipLabel(cfg))
+	fmt.Printf("INFO scheduler=%s\n", schedulerConfigLabel(cfg))
+	fmt.Printf("INFO default_check_policy=method:%s profile:%s\n", cfg.DefaultCheckMethod, cfg.DefaultDetectionProfile)
+	fmt.Printf("INFO wpcom_notify=%s\n", enabledLabel(cfg.WPCOMNotifyEnable))
+	for _, line := range rolloutAdviceLines(cfg) {
+		fmt.Println(line)
+	}
+	fmt.Printf("INFO email_transport=%s\n", emailTransportLabel(cfg))
+	if !emailTransportDelivers(cfg) {
+		fmt.Printf("WARN email_transport=%s — alert-contact emails will be logged but not delivered\n", emailTransportLabel(cfg))
+	}
+	if cfg.DashboardPort > 0 {
+		if msg := dashboardBindWarning(cfg.DashboardBindAddr); msg != "" {
+			fmt.Printf("WARN %s\n", msg)
+		}
+	}
+	if level, msg := deliveryOwnerStatus(cfg, db.Hostname()); msg != "" {
+		fmt.Printf("%s %s\n", level, msg)
+	}
+	readiness := probeConfiguredVerifliers(context.Background(), cfg, dashboardHealthTimeout)
+	readinessLines, readinessFailed := renderVeriflierReadiness(readiness)
+	for _, line := range readinessLines {
+		fmt.Println(line)
+	}
+	discoverySnapshot, discoveryErr := veriflierDiscoverySnapshotForConfig(context.Background(), cfg)
+	discoveryLines, discoveryFailed := renderVeriflierDiscoveryReadiness(cfg.VeriflierDiscoveryModeOrDefault(), discoverySnapshot, discoveryErr, readiness)
+	for _, line := range discoveryLines {
+		fmt.Println(line)
+	}
+	if readinessFailed || discoveryFailed {
+		os.Exit(1)
+	}
+
+	fmt.Println("\nvalidation passed")
+}
+
+type veriflierReadinessResult struct {
+	Name    string
+	Addr    string
+	Status  *veriflier.StatusV2Response
+	Err     error
+	Latency time.Duration
+}
+
+func probeConfiguredVerifliers(ctx context.Context, cfg *config.Config, timeout time.Duration) []veriflierReadinessResult {
+	if cfg == nil || len(cfg.Verifiers) == 0 {
+		return nil
+	}
+	if ctx == nil {
+		ctx = context.Background()
+	}
+	out := make([]veriflierReadinessResult, 0, len(cfg.Verifiers))
+	for i, v := range cfg.Verifiers {
+		name := configuredVeriflierName(v, i)
+		addr := fmt.Sprintf("%s:%s", v.Host, v.TransportPort())
+		result := veriflierReadinessResult{Name: name, Addr: addr}
+		if v.Host == "" || v.TransportPort() == "" {
+			result.Err = fmt.Errorf("host or port is not configured")
+			out = append(out, result)
+			continue
+		}
+
+		probeCtx, cancel := context.WithTimeout(ctx, timeout)
+		start := time.Now()
+		status, err := veriflier.NewVeriflierClient(addr, v.AuthToken).Status(probeCtx)
+		cancel()
+		result.Latency = time.Since(start)
+		result.Status = status
+		result.Err = err
+		out = append(out, result)
+	}
+	return out
+}
+
+func renderVeriflierReadiness(results []veriflierReadinessResult) ([]string, bool) {
+	if len(results) == 0 {
+		return nil, false
+	}
+	vantageCounts := duplicateVantageCounts(results)
+	lines := make([]string, 0, len(results)*2)
+	failed := false
+	for _, result := range results {
+		lines = append(lines, fmt.Sprintf("INFO veriflier %q at %s", result.Name, result.Addr))
+		if result.Err != nil {
+			lines = append(lines, fmt.Sprintf("WARN veriflier_status name=%q addr=%q error=%q", result.Name, result.Addr, result.Err.Error()))
+			continue
+		}
+		if result.Status == nil {
+			lines = append(lines, fmt.Sprintf("WARN veriflier_status name=%q addr=%q error=%q", result.Name, result.Addr, "empty status response"))
+			continue
+		}
+		if !statusSupportsProtocol(result.Status, veriflier.ProtocolV2) {
+			lines = append(lines, fmt.Sprintf("WARN veriflier_contract name=%q addr=%q protocol=%s version=%q", result.Name, result.Addr, veriflier.ProtocolLegacy, result.Status.Version))
+			continue
+		}
+		vantageID := strings.TrimSpace(result.Status.Vantage.ID)
+		if vantageID == "" {
+			failed = true
+			lines = append(lines, fmt.Sprintf("FAIL veriflier_vantage_missing name=%q addr=%q", result.Name, result.Addr))
+			continue
+		}
+		if vantageCounts[vantageID] > 1 {
+			failed = true
+			lines = append(lines, fmt.Sprintf("FAIL veriflier_vantage_duplicate id=%q name=%q addr=%q", vantageID, result.Name, result.Addr))
+			continue
+		}
+		lines = append(lines, fmt.Sprintf("PASS veriflier_contract name=%q addr=%q protocol=%s vantage_id=%q agent_id=%q capacity=%q",
+			result.Name, result.Addr, veriflier.ProtocolV2, vantageID, result.Status.Agent.ID, verifierCapacitySummary(result.Status.Capacity)))
+	}
+	return lines, failed
+}
+
+func veriflierDiscoverySnapshotForConfig(ctx context.Context, cfg *config.Config) (db.VeriflierDiscoverySnapshot, error) {
+	if cfg == nil || cfg.VeriflierDiscoveryModeOrDefault() == config.VeriflierDiscoveryModeStatic {
+		return db.VeriflierDiscoverySnapshot{}, nil
+	}
+	queryCtx, cancel := context.WithTimeout(ctx, dashboardHealthTimeout)
+	defer cancel()
+	return db.ListVeriflierDiscoverySnapshot(queryCtx, db.VeriflierDiscoveryDefaultStaleAfter)
+}
+
+func renderVeriflierDiscoveryReadiness(mode string, snapshot db.VeriflierDiscoverySnapshot, err error, staticResults []veriflierReadinessResult) ([]string, bool) {
+	mode = (&config.Config{VeriflierDiscoveryMode: mode}).VeriflierDiscoveryModeOrDefault()
+	if mode == config.VeriflierDiscoveryModeStatic {
+		return []string{"INFO veriflier_discovery=static"}, false
+	}
+
+	failed := false
+	if err != nil {
+		line := fmt.Sprintf("WARN veriflier_discovery mode=%s error=%q", mode, err.Error())
+		if mode == config.VeriflierDiscoveryModeActive {
+			line = fmt.Sprintf("FAIL veriflier_discovery mode=%s error=%q", mode, err.Error())
+			failed = true
+		}
+		return []string{line}, failed
+	}
+
+	enabled, usable := 0, 0
+	for _, vantage := range snapshot.Vantages {
+		if !vantage.Enabled {
+			continue
+		}
+		enabled++
+		if vantage.Usable() {
+			usable++
+		}
+	}
+
+	lines := []string{fmt.Sprintf(
+		"INFO veriflier_discovery mode=%s enabled_vantages=%d usable_vantages=%d recent_agents=%d",
+		mode, enabled, usable, len(snapshot.Agents),
+	)}
+	for _, vantage := range snapshot.Vantages {
+		if !vantage.Enabled || vantage.Usable() {
+			continue
+		}
+		level := "WARN"
+		if mode == config.VeriflierDiscoveryModeActive {
+			level = "FAIL"
+			failed = true
+		}
+		lines = append(lines, fmt.Sprintf("%s veriflier_discovery_incomplete vantage_id=%q endpoint_host=%q endpoint_port=%q auth_token_present=%t",
+			level, vantage.VantageID, vantage.EndpointHost, vantage.EndpointPort, strings.TrimSpace(vantage.AuthToken) != ""))
+	}
+	if mode == config.VeriflierDiscoveryModeActive && usable == 0 {
+		lines = append(lines, "FAIL veriflier_discovery_active usable_vantages=0")
+		failed = true
+	}
+	if mode == config.VeriflierDiscoveryModeShadow {
+		lines = append(lines, veriflierDiscoveryDriftLines(snapshot, staticResults)...)
+	}
+	return lines, failed
+}
+
+func veriflierDiscoveryDriftLines(snapshot db.VeriflierDiscoverySnapshot, staticResults []veriflierReadinessResult) []string {
+	staticVantages := make(map[string]struct{})
+	for _, result := range staticResults {
+		if result.Err != nil || result.Status == nil || !statusSupportsProtocol(result.Status, veriflier.ProtocolV2) {
+			continue
+		}
+		id := strings.TrimSpace(result.Status.Vantage.ID)
+		if id != "" {
+			staticVantages[id] = struct{}{}
+		}
+	}
+	discovered := make(map[string]struct{})
+	for _, vantage := range snapshot.Vantages {
+		if vantage.Enabled {
+			discovered[strings.TrimSpace(vantage.VantageID)] = struct{}{}
+		}
+	}
+
+	var lines []string
+	for id := range discovered {
+		if id == "" {
+			continue
+		}
+		if _, ok := staticVantages[id]; !ok {
+			lines = append(lines, fmt.Sprintf("WARN veriflier_discovery_extra vantage_id=%q", id))
+		}
+	}
+	for id := range staticVantages {
+		if _, ok := discovered[id]; !ok {
+			lines = append(lines, fmt.Sprintf("WARN veriflier_discovery_missing vantage_id=%q", id))
+		}
+	}
+	sort.Strings(lines)
+	if len(lines) == 0 {
+		lines = append(lines, "PASS veriflier_discovery_shadow static_vantages_match_registry")
+	}
+	return lines
+}
+
+func configuredVeriflierName(v config.VerifierConfig, index int) string {
+	if strings.TrimSpace(v.Name) != "" {
+		return v.Name
+	}
+	return fmt.Sprintf("veriflier-%d", index+1)
+}
+
+func statusSupportsProtocol(status *veriflier.StatusV2Response, protocol string) bool {
+	if status == nil {
+		return false
+	}
+	for _, p := range status.Protocols {
+		if p == protocol {
+			return true
+		}
+	}
+	return false
+}
+
+func duplicateVantageCounts(results []veriflierReadinessResult) map[string]int {
+	counts := make(map[string]int)
+	for _, result := range results {
+		if result.Err != nil || result.Status == nil || !statusSupportsProtocol(result.Status, veriflier.ProtocolV2) {
+			continue
+		}
+		vantageID := strings.TrimSpace(result.Status.Vantage.ID)
+		if vantageID == "" {
+			continue
+		}
+		counts[vantageID]++
+	}
+	return counts
+}
+
+func verifierCapacitySummary(c veriflier.Capacity) string {
+	return fmt.Sprintf("active=%d in_flight=%d max_concurrency=%d queue=%d/%d",
+		c.Active, c.InFlight, c.MaxConcurrency, c.QueueDepth, c.QueueCapacity)
+}
+
+func enabledLabel(b bool) string {
+	if b {
+		return "enabled"
+	}
+	return "disabled"
+}
+
+func checkDNSResolversLabel(servers []string) string {
+	if len(servers) == 0 {
+		return "system"
+	}
+	return "configured [" + strings.Join(servers, ",") + "]"
+}
+
+func bucketOwnershipLabel(cfg *config.Config) string {
+	if min, max, ok := cfg.PinnedBucketRange(); ok {
+		return fmt.Sprintf("pinned range=%d-%d", min, max)
+	}
+	return "dynamic jetmon_hosts"
+}
+
+func rolloutAdviceLines(cfg *config.Config) []string {
+	lines := []string{}
+	if _, _, ok := cfg.PinnedBucketRange(); ok {
+		lines = append(lines, "INFO rollout_static_plan="+staticPlanCheckCommand())
+	}
+	lines = append(lines,
+		"INFO rollout_preflight="+rolloutPreflightCommand(cfg),
+		"INFO rollout_activity_check="+rolloutActivityCommand(),
+	)
+	if cmd := cutoverCheckCommand(cfg); cmd != "" {
+		lines = append(lines, "INFO rollout_cutover_check="+cmd)
+	}
+	if cmd := rollbackCheckCommand(cfg); cmd != "" {
+		lines = append(lines, "INFO rollout_rollback_check="+cmd)
+	}
+	lines = append(lines, "INFO rollout_state_report="+stateReportCommand())
+	lines = append(lines, "INFO rollout_drift_report="+projectionDriftCommand())
+	return lines
+}
+
+func staticPlanCheckCommand() string {
+	return "./jetmon2 rollout static-plan-check --file=<ranges.csv>"
+}
+
+func rolloutPreflightCommand(cfg *config.Config) string {
+	if minBucket, maxBucket, ok := cfg.PinnedBucketRange(); ok {
+		cmd := fmt.Sprintf("./jetmon2 rollout host-preflight --file=<ranges.csv> --host=<v1-hostname> --runtime-host=<v2-hostname> --bucket-min=%d --bucket-max=%d", minBucket, maxBucket)
+		if cfg.BucketTotal > 0 {
+			cmd += fmt.Sprintf(" --bucket-total=%d", cfg.BucketTotal)
+		}
+		return cmd
+	}
+	return "./jetmon2 rollout dynamic-check"
+}
+
+func rolloutActivityCommand() string {
+	return "./jetmon2 rollout activity-check --since=15m"
+}
+
+func cutoverCheckCommand(cfg *config.Config) string {
+	if _, _, ok := cfg.PinnedBucketRange(); ok {
+		return "./jetmon2 rollout cutover-check --since=15m"
+	}
+	return ""
+}
+
+func rollbackCheckCommand(cfg *config.Config) string {
+	if _, _, ok := cfg.PinnedBucketRange(); ok {
+		return "./jetmon2 rollout rollback-check"
+	}
+	return ""
+}
+
+func projectionDriftCommand() string {
+	return "./jetmon2 rollout projection-drift"
+}
+
+func stateReportCommand() string {
+	return "./jetmon2 rollout state-report --since=15m"
+}
+
+func dashboardListenAddr(cfg *config.Config) string {
+	bindAddr := "127.0.0.1"
+	port := 0
+	if cfg != nil {
+		if strings.TrimSpace(cfg.DashboardBindAddr) != "" {
+			bindAddr = strings.TrimSpace(cfg.DashboardBindAddr)
+		}
+		port = cfg.DashboardPort
+	}
+	return net.JoinHostPort(bindAddr, strconv.Itoa(port))
+}
+
+func dashboardBindWarning(bindAddr string) string {
+	bindAddr = strings.TrimSpace(bindAddr)
+	if bindAddr == "" {
+		bindAddr = "127.0.0.1"
+	}
+	host := strings.Trim(bindAddr, "[]")
+	host = strings.TrimSuffix(strings.ToLower(host), ".")
+	if host == "localhost" || strings.HasSuffix(host, ".localhost") {
+		return ""
+	}
+	if ip := net.ParseIP(host); ip != nil && ip.IsLoopback() {
+		return ""
+	}
+	return fmt.Sprintf("DASHBOARD_BIND_ADDR=%q exposes unauthenticated operator dashboards; restrict access to trusted operator networks", bindAddr)
+}
+
+func newFleetDashboardStore(cfg *config.Config) *dashboard.FleetStore {
+	if cfg == nil {
+		cfg = config.Get()
+	}
+	bucketTotal := 0
+	heartbeatGrace := 0
+	if cfg != nil {
+		bucketTotal = cfg.BucketTotal
+		heartbeatGrace = cfg.BucketHeartbeatGraceSec
+	}
+	return dashboard.NewFleetStore(db.DB(), dashboard.FleetStoreOptions{
+		BucketTotal:    bucketTotal,
+		HeartbeatGrace: time.Duration(heartbeatGrace) * time.Second,
+	})
+}
+
+const dashboardHealthTimeout = 2 * time.Second
+
+func dashboardHealthEntries(ctx context.Context, cfg *config.Config, sqlDB *sql.DB, wp *wpcom.Client, statsdReady bool, checkedAt time.Time) []dashboard.HealthEntry {
+	entries := []dashboard.HealthEntry{
+		mysqlHealthEntry(ctx, sqlDB, checkedAt),
+		wpcomHealthEntry(wp, checkedAt),
+		statsdHealthEntry(statsdReady, checkedAt),
+		diskHealthEntry("logs", checkedAt),
+		diskHealthEntry("stats", checkedAt),
+	}
+	entries = append(entries, veriflierHealthEntries(ctx, cfg, checkedAt)...)
+	if entry, ok := veriflierDiscoveryHealthEntry(ctx, cfg, checkedAt); ok {
+		entries = append(entries, entry)
+	}
+	return entries
+}
+
+func monitorProcessHealthSnapshot(hostname string, startedAt time.Time, state string, cfg *config.Config, st dashboard.State, health []dashboard.HealthEntry) fleethealth.Snapshot {
+	if st.UpdatedAt.IsZero() {
+		st.UpdatedAt = time.Now().UTC()
+	}
+	bucketMin, bucketMax := st.BucketMin, st.BucketMax
+	apiPort, dashboardPort := cfg.APIPort, cfg.DashboardPort
+	healthStatus := dashboard.SummarizeHost(st, health).Status
+	if state == fleethealth.StateStopping || state == fleethealth.StateStopped {
+		healthStatus = fleethealth.HealthAmber
+	}
+	return fleethealth.Snapshot{
+		HostID:                 hostname,
+		ProcessType:            fleethealth.ProcessMonitor,
+		PID:                    os.Getpid(),
+		Version:                version,
+		BuildDate:              buildDate,
+		GoVersion:              goVersion,
+		State:                  state,
+		HealthStatus:           healthStatus,
+		StartedAt:              startedAt,
+		UpdatedAt:              time.Now().UTC(),
+		BucketMin:              &bucketMin,
+		BucketMax:              &bucketMax,
+		BucketOwnership:        st.BucketOwnership,
+		APIPort:                &apiPort,
+		DashboardPort:          &dashboardPort,
+		DeliveryWorkersEnabled: st.DeliveryWorkersEnabled,
+		DeliveryOwnerHost:      st.DeliveryOwnerHost,
+		WorkerCount:            st.WorkerCount,
+		ActiveChecks:           st.ActiveChecks,
+		QueueDepth:             st.QueueDepth,
+		RetryQueueSize:         st.RetryQueueSize,
+		WPCOMCircuitOpen:       st.WPCOMCircuitOpen,
+		WPCOMQueueDepth:        st.WPCOMQueueDepth,
+		GoSysMemMB:             st.GoSysMemMB,
+		RSSMemMB:               st.RSSMemMB,
+		DependencyHealth:       dashboardHealthToFleet(health),
+	}
+}
+
+func dashboardHealthToFleet(entries []dashboard.HealthEntry) []fleethealth.DependencyHealth {
+	out := make([]fleethealth.DependencyHealth, 0, len(entries))
+	for _, entry := range entries {
+		out = append(out, fleethealth.DependencyHealth{
+			Name:      entry.Name,
+			Status:    entry.Status,
+			LatencyMS: entry.Latency,
+			LastError: entry.LastError,
+			CheckedAt: entry.CheckedAt,
+		})
+	}
+	return out
+}
+
+func mysqlHealthEntry(ctx context.Context, sqlDB *sql.DB, checkedAt time.Time) dashboard.HealthEntry {
+	entry := dashboard.HealthEntry{Name: "mysql", CheckedAt: checkedAt}
+	if sqlDB == nil {
+		entry.Status = "red"
+		entry.LastError = "database pool is not initialized"
+		return entry
+	}
+
+	pingCtx, cancel := context.WithTimeout(ctx, dashboardHealthTimeout)
+	defer cancel()
+
+	start := time.Now()
+	if err := sqlDB.PingContext(pingCtx); err != nil {
+		entry.Status = "red"
+		entry.Latency = time.Since(start).Milliseconds()
+		entry.LastError = err.Error()
+		return entry
+	}
+	entry.Status = "green"
+	entry.Latency = time.Since(start).Milliseconds()
+	return entry
+}
+
+func veriflierHealthEntries(ctx context.Context, cfg *config.Config, checkedAt time.Time) []dashboard.HealthEntry {
+	if cfg == nil || len(cfg.Verifiers) == 0 {
+		return []dashboard.HealthEntry{{
+			Name:      "verifliers",
+			Status:    "amber",
+			LastError: "no verifliers configured",
+			CheckedAt: checkedAt,
+		}}
+	}
+
+	results := probeConfiguredVerifliers(ctx, cfg, dashboardHealthTimeout)
+	vantageCounts := duplicateVantageCounts(results)
+	entries := make([]dashboard.HealthEntry, 0, len(results))
+	for _, result := range results {
+		entry := dashboard.HealthEntry{
+			Name:      "veriflier:" + result.Name,
+			Latency:   result.Latency.Milliseconds(),
+			CheckedAt: checkedAt,
+		}
+		if result.Err != nil {
+			entry.Status = "red"
+			entry.LastError = result.Err.Error()
+			entries = append(entries, entry)
+			continue
+		}
+		if result.Status == nil {
+			entry.Status = "red"
+			entry.LastError = "empty status response"
+			entries = append(entries, entry)
+			continue
+		}
+		if !statusSupportsProtocol(result.Status, veriflier.ProtocolV2) {
+			entry.Status = "amber"
+			entry.LastError = "legacy verifier status endpoint; v2 status metadata unavailable"
+			if result.Status.Version != "" {
+				entry.Name = fmt.Sprintf("%s (%s)", entry.Name, result.Status.Version)
+			}
+			entries = append(entries, entry)
+			continue
+		}
+		vantageID := strings.TrimSpace(result.Status.Vantage.ID)
+		if vantageID == "" {
+			entry.Status = "red"
+			entry.LastError = "v2 verifier status did not report a vantage id"
+			entries = append(entries, entry)
+			continue
+		}
+		if vantageCounts[vantageID] > 1 {
+			entry.Status = "red"
+			entry.LastError = fmt.Sprintf("duplicate v2 verifier vantage id %q", vantageID)
+			entries = append(entries, entry)
+			continue
+		}
+		entry.Status = "green"
+		entry.Name = fmt.Sprintf("%s (%s vantage=%s %s)", entry.Name, result.Status.Version, vantageID, verifierCapacitySummary(result.Status.Capacity))
+		entries = append(entries, entry)
+	}
+	return entries
+}
+
+func veriflierDiscoveryHealthEntry(ctx context.Context, cfg *config.Config, checkedAt time.Time) (dashboard.HealthEntry, bool) {
+	mode := cfg.VeriflierDiscoveryModeOrDefault()
+	if mode == config.VeriflierDiscoveryModeStatic {
+		return dashboard.HealthEntry{}, false
+	}
+	entry := dashboard.HealthEntry{Name: "veriflier-discovery", CheckedAt: checkedAt}
+	start := time.Now()
+	snapshot, err := veriflierDiscoverySnapshotForConfig(ctx, cfg)
+	entry.Latency = time.Since(start).Milliseconds()
+	if err != nil {
+		if mode == config.VeriflierDiscoveryModeActive {
+			entry.Status = "red"
+		} else {
+			entry.Status = "amber"
+		}
+		entry.LastError = err.Error()
+		return entry, true
+	}
+	enabled, usable := 0, 0
+	for _, vantage := range snapshot.Vantages {
+		if !vantage.Enabled {
+			continue
+		}
+		enabled++
+		if vantage.Usable() {
+			usable++
+		}
+	}
+	entry.Name = fmt.Sprintf("veriflier-discovery:%s enabled=%d usable=%d agents=%d", mode, enabled, usable, len(snapshot.Agents))
+	entry.Status = "green"
+	if mode == config.VeriflierDiscoveryModeActive && usable == 0 {
+		entry.Status = "red"
+		entry.LastError = "active discovery has no usable enabled vantages"
+	} else if mode == config.VeriflierDiscoveryModeShadow && enabled == 0 {
+		entry.Status = "amber"
+		entry.LastError = "shadow discovery registry has no enabled vantages"
+	}
+	return entry, true
+}
+
+func wpcomHealthEntry(wp *wpcom.Client, checkedAt time.Time) dashboard.HealthEntry {
+	entry := dashboard.HealthEntry{Name: "wpcom", CheckedAt: checkedAt}
+	if wp == nil {
+		entry.Status = "red"
+		entry.LastError = "wpcom client is not initialized"
+		return entry
+	}
+	queueDepth := wp.QueueDepth()
+	if wp.IsCircuitOpen() {
+		entry.Status = "red"
+		entry.LastError = fmt.Sprintf("circuit open, queued notifications=%d", queueDepth)
+		return entry
+	}
+	if queueDepth > 0 {
+		entry.Status = "amber"
+		entry.LastError = fmt.Sprintf("queued notifications=%d", queueDepth)
+		return entry
+	}
+	entry.Status = "green"
+	return entry
+}
+
+func statsdHealthEntry(ready bool, checkedAt time.Time) dashboard.HealthEntry {
+	entry := dashboard.HealthEntry{Name: "statsd", CheckedAt: checkedAt}
+	if !ready {
+		entry.Status = "amber"
+		entry.LastError = "statsd client is not initialized"
+		return entry
+	}
+	entry.Status = "green"
+	return entry
+}
+
+func diskHealthEntry(dir string, checkedAt time.Time) dashboard.HealthEntry {
+	entry := dashboard.HealthEntry{Name: "disk:" + dir, CheckedAt: checkedAt}
+	if err := checkWritableDir(dir); err != nil {
+		entry.Status = "red"
+		entry.LastError = err.Error()
+		return entry
+	}
+	entry.Status = "green"
+	return entry
+}
+
+func checkWritableDir(dir string) error {
+	info, err := os.Stat(dir)
+	if err != nil {
+		return err
+	}
+	if !info.IsDir() {
+		return fmt.Errorf("%s is not a directory", dir)
+	}
+	f, err := os.CreateTemp(dir, ".jetmon-health-*")
+	if err != nil {
+		return err
+	}
+	name := f.Name()
+	if err := f.Close(); err != nil {
+		_ = os.Remove(name)
+		return err
+	}
+	if err := os.Remove(name); err != nil {
+		return err
+	}
+	return nil
+}
+
+// emailTransportLabel collapses an empty EMAIL_TRANSPORT to its compatibility
+// alias ("stub") so startup output and validate-config show a single canonical
+// name regardless of which form an operator wrote in config.
+func emailTransportLabel(cfg *config.Config) string {
+	if cfg.EmailTransport == "" {
+		return "stub"
+	}
+	return cfg.EmailTransport
+}
+
+// emailTransportDelivers reports whether the configured email transport
+// actually delivers mail. The stub transport (and the empty-string alias for
+// it) only logs, so any alert-contact configured with transport="email" will
+// silently disappear into the logs in that mode.
+func emailTransportDelivers(cfg *config.Config) bool {
+	return cfg.EmailTransport == "smtp" || cfg.EmailTransport == "wpcom"
+}
+
+func schedulerConfigLabel(cfg *config.Config) string {
+	if cfg.SchedulerEngine == "streaming" {
+		return fmt.Sprintf(
+			"streaming reload=%s legacy_projection=%s worker_floor=%d fetch_page_size=%d",
+			time.Duration(cfg.StreamingTargetReloadSec)*time.Second,
+			time.Duration(cfg.StreamingLegacyProjectionIntervalMin)*time.Minute,
+			cfg.NumWorkers,
+			cfg.DatasetSize,
+		)
+	}
+	if cfg.UseVariableCheckIntervals {
+		return fmt.Sprintf(
+			"variable_intervals fetch_page_size=%d idle_poll=%s",
+			cfg.DatasetSize,
+			orchestrator.VariableIntervalPollInterval(),
+		)
+	}
+	return fmt.Sprintf(
+		"fixed_rounds fetch_page_size=%d min_round_interval=%s",
+		cfg.DatasetSize,
+		time.Duration(cfg.MinTimeBetweenRoundsSec)*time.Second,
+	)
+}
+
+func deliveryWorkersShouldStart(cfg *config.Config, hostname string) bool {
+	if cfg.APIPort <= 0 {
+		return false
+	}
+	owner := strings.TrimSpace(cfg.DeliveryOwnerHost)
+	return owner == "" || owner == hostname
+}
+
+func deliveryOwnerStatus(cfg *config.Config, hostname string) (string, string) {
+	owner := strings.TrimSpace(cfg.DeliveryOwnerHost)
+	if cfg.APIPort <= 0 {
+		if owner == "" {
+			return "INFO", "delivery_workers=disabled api_port=disabled"
+		}
+		return "INFO", fmt.Sprintf("delivery_owner_host=%q ignored because API_PORT is disabled", owner)
+	}
+	if owner == "" {
+		return "WARN", fmt.Sprintf("delivery_owner_host is unset; host %q will run delivery workers because API_PORT is enabled", hostname)
+	}
+	if owner == hostname {
+		return "INFO", fmt.Sprintf("delivery_owner_host=%q matched; delivery workers enabled on this host", owner)
+	}
+	return "INFO", fmt.Sprintf("delivery_owner_host=%q; delivery workers disabled on host %q", owner, hostname)
+}
+
+func cmdStatus() {
+	// Connect to the running instance's internal API.
+	port := envOrDefault("DASHBOARD_PORT", "8080")
+	host := envOrDefault("DASHBOARD_HOST", envOrDefault("DASHBOARD_BIND_ADDR", "localhost"))
+	if host == "0.0.0.0" || host == "::" {
+		host = "localhost"
+	}
+	resp, err := httpGet(fmt.Sprintf("http://%s/api/state", net.JoinHostPort(host, port)))
+	if err != nil {
+		log.Fatalf("status: %v", err)
+	}
+	fmt.Println(resp)
+}
+
+func cmdAudit() {
+	fs := flag.NewFlagSet("audit", flag.ExitOnError)
+	blogID := fs.Int64("blog-id", 0, "blog ID to query")
+	since := fs.String("since", "", "start time (RFC3339 or duration like 24h)")
+	until := fs.String("until", "", "end time (RFC3339)")
+	_ = fs.Parse(os.Args[2:])
+
+	if *blogID == 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 audit --blog-id <id> [--since <time>] [--until <time>]")
+		os.Exit(1)
+	}
+
+	config.LoadDB()
+	if err := db.ConnectWithRetry(3); err != nil {
+		log.Fatalf("db: %v", err)
+	}
+
+	sinceStr := resolveSince(*since)
+	rows, err := audit.Query(db.DB(), *blogID, sinceStr, *until)
+	if err != nil {
+		log.Fatalf("query: %v", err)
+	}
+	defer rows.Close()
+
+	fmt.Printf("Audit log for blog_id=%d\n", *blogID)
+	fmt.Printf("%-25s %-22s %-15s %s\n", "TIMESTAMP", "EVENT", "SOURCE", "DETAIL")
+	fmt.Println(strings.Repeat("-", 90))
+
+	for rows.Next() {
+		var (
+			id        int64
+			bid       sql.NullInt64
+			eventID   sql.NullInt64
+			eventType string
+			source    string
+			detail    sql.NullString
+			metadata  sql.NullString
+			createdAt time.Time
+		)
+		if err := rows.Scan(&id, &bid, &eventID, &eventType, &source,
+			&detail, &metadata, &createdAt); err != nil {
+			log.Printf("scan: %v", err)
+			continue
+		}
+		det := ""
+		if detail.Valid {
+			det = detail.String
+		}
+		if eventID.Valid {
+			det = fmt.Sprintf("event=%d %s", eventID.Int64, det)
+		}
+		if metadata.Valid && metadata.String != "" {
+			det = fmt.Sprintf("%s meta=%s", det, metadata.String)
+		}
+		fmt.Printf("%-25s %-22s %-15s %s\n",
+			createdAt.Format("2006-01-02 15:04:05.000"),
+			eventType, source, det)
+	}
+}
+
+func cmdDrain() {
+	pid := readPIDFile()
+	proc, err := os.FindProcess(pid)
+	if err != nil {
+		log.Fatalf("find process %d: %v", pid, err)
+	}
+	if err := proc.Signal(syscall.SIGINT); err != nil {
+		log.Fatalf("signal: %v", err)
+	}
+	fmt.Printf("SIGINT sent to pid %d — jetmon2 will drain and exit\n", pid)
+}
+
+func cmdReload() {
+	pid := readPIDFile()
+	proc, err := os.FindProcess(pid)
+	if err != nil {
+		log.Fatalf("find process %d: %v", pid, err)
+	}
+	if err := proc.Signal(syscall.SIGHUP); err != nil {
+		log.Fatalf("signal: %v", err)
+	}
+	fmt.Printf("SIGHUP sent to pid %d\n", pid)
+}
+
+// cmdKeys is the entrypoint for `./jetmon2 keys ...` ops commands. Key
+// management is intentionally CLI-only — the public API has no /keys
+// endpoints. See docs/internal-api-reference.md "Authentication".
+func cmdKeys(args []string) {
+	if len(args) == 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 keys <create|list|revoke|rotate> [args]")
+		os.Exit(1)
+	}
+	config.LoadDB()
+	if err := db.ConnectWithRetry(3); err != nil {
+		log.Fatalf("db: %v", err)
+	}
+	ctx := context.Background()
+
+	sub := args[0]
+	rest := args[1:]
+	switch sub {
+	case "create":
+		cmdKeysCreate(ctx, rest)
+	case "list":
+		cmdKeysList(ctx, rest)
+	case "revoke":
+		cmdKeysRevoke(ctx, rest)
+	case "rotate":
+		cmdKeysRotate(ctx, rest)
+	default:
+		fmt.Fprintf(os.Stderr, "unknown keys subcommand %q (want: create, list, revoke, rotate)\n", sub)
+		os.Exit(1)
+	}
+}
+
+func cmdKeysCreate(ctx context.Context, args []string) {
+	fs := flag.NewFlagSet("keys create", flag.ExitOnError)
+	consumer := fs.String("consumer", "", "consumer name (e.g. 'gateway', 'alerts-worker') — required")
+	scopeStr := fs.String("scope", "read", "permission scope: read | write | admin")
+	rateLimit := fs.Int("rate-limit", 0, "requests per minute (0 = scope default)")
+	ttl := fs.Duration("ttl", 0, "key lifetime (e.g. 90d, 720h); 0 = never expires")
+	createdBy := fs.String("created-by", currentOperator(), "operator identity for audit")
+	_ = fs.Parse(args)
+
+	if *consumer == "" {
+		fmt.Fprintln(os.Stderr, "--consumer is required")
+		os.Exit(1)
+	}
+
+	raw, k, err := apikeys.Create(ctx, db.DB(), apikeys.CreateInput{
+		ConsumerName:       *consumer,
+		Scope:              apikeys.Scope(*scopeStr),
+		RateLimitPerMinute: *rateLimit,
+		TTL:                *ttl,
+		CreatedBy:          *createdBy,
+	})
+	if err != nil {
+		log.Fatalf("create: %v", err)
+	}
+
+	fmt.Printf("Created key id=%d for consumer=%q scope=%s rate=%d/min\n",
+		k.ID, k.ConsumerName, k.Scope, k.RateLimitPerMinute)
+	if k.ExpiresAt != nil {
+		fmt.Printf("Expires: %s\n", k.ExpiresAt.UTC().Format(time.RFC3339))
+	} else {
+		fmt.Println("Expires: never")
+	}
+	fmt.Println()
+	fmt.Println("Token (shown ONCE — save it now):")
+	fmt.Println(raw)
+}
+
+func cmdKeysList(ctx context.Context, args []string) {
+	fs := flag.NewFlagSet("keys list", flag.ExitOnError)
+	includeRevoked := fs.Bool("include-revoked", false, "show revoked keys too")
+	_ = fs.Parse(args)
+
+	keys, err := apikeys.List(ctx, db.DB())
+	if err != nil {
+		log.Fatalf("list: %v", err)
+	}
+
+	fmt.Printf("%-5s %-24s %-7s %-9s %-21s %-21s %s\n",
+		"ID", "CONSUMER", "SCOPE", "RATE/MIN", "EXPIRES", "LAST USED", "STATUS")
+	fmt.Println(strings.Repeat("-", 110))
+	for _, k := range keys {
+		status := "active"
+		if k.RevokedAt != nil {
+			if !*includeRevoked && k.RevokedAt.Before(time.Now().UTC()) {
+				continue
+			}
+			if k.RevokedAt.After(time.Now().UTC()) {
+				status = "revokes-at " + k.RevokedAt.UTC().Format("2006-01-02T15:04:05Z")
+			} else {
+				status = "revoked"
+			}
+		} else if k.ExpiresAt != nil && k.ExpiresAt.Before(time.Now().UTC()) {
+			status = "expired"
+		}
+		expires := "never"
+		if k.ExpiresAt != nil {
+			expires = k.ExpiresAt.UTC().Format("2006-01-02T15:04:05Z")
+		}
+		lastUsed := "never"
+		if k.LastUsedAt != nil {
+			lastUsed = k.LastUsedAt.UTC().Format("2006-01-02T15:04:05Z")
+		}
+		fmt.Printf("%-5d %-24s %-7s %-9d %-21s %-21s %s\n",
+			k.ID, k.ConsumerName, k.Scope, k.RateLimitPerMinute, expires, lastUsed, status)
+	}
+}
+
+func cmdKeysRevoke(ctx context.Context, args []string) {
+	if len(args) < 1 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 keys revoke <id>")
+		os.Exit(1)
+	}
+	id, err := parseInt64(args[0])
+	if err != nil {
+		log.Fatalf("invalid id %q: %v", args[0], err)
+	}
+	if err := apikeys.Revoke(ctx, db.DB(), id); err != nil {
+		log.Fatalf("revoke: %v", err)
+	}
+	fmt.Printf("Revoked key id=%d\n", id)
+}
+
+func cmdKeysRotate(ctx context.Context, args []string) {
+	fs := flag.NewFlagSet("keys rotate", flag.ExitOnError)
+	grace := fs.Duration("grace", 5*time.Minute, "grace period before old key is revoked (0 = revoke immediately)")
+	createdBy := fs.String("created-by", currentOperator(), "operator identity for audit")
+	_ = fs.Parse(args)
+
+	if fs.NArg() < 1 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 keys rotate [--grace=DURATION] <id>")
+		os.Exit(1)
+	}
+	id, err := parseInt64(fs.Arg(0))
+	if err != nil {
+		log.Fatalf("invalid id %q: %v", fs.Arg(0), err)
+	}
+
+	raw, k, err := apikeys.Rotate(ctx, db.DB(), id, *grace, *createdBy)
+	if err != nil {
+		log.Fatalf("rotate: %v", err)
+	}
+	fmt.Printf("Rotated key id=%d → new key id=%d for consumer=%q\n", id, k.ID, k.ConsumerName)
+	if *grace > 0 {
+		fmt.Printf("Old key id=%d will be revoked at %s\n", id, time.Now().UTC().Add(*grace).Format(time.RFC3339))
+	} else {
+		fmt.Printf("Old key id=%d revoked immediately\n", id)
+	}
+	fmt.Println()
+	fmt.Println("New token (shown ONCE — save it now):")
+	fmt.Println(raw)
+}
+
+func currentOperator() string {
+	if u := os.Getenv("USER"); u != "" {
+		return u
+	}
+	if u := os.Getenv("LOGNAME"); u != "" {
+		return u
+	}
+	return "cli"
+}
+
+func parseInt64(s string) (int64, error) {
+	var v int64
+	_, err := fmt.Sscan(s, &v)
+	return v, err
+}
+
+func readPIDFile() int {
+	pidPath := envOrDefault("JETMON_PID_FILE", "/run/jetmon2/jetmon2.pid")
+	data, err := os.ReadFile(pidPath)
+	if err != nil {
+		log.Fatalf("read pid file %s: %v (is jetmon2 running?)", pidPath, err)
+	}
+	var pid int
+	if _, err := fmt.Sscan(string(data), &pid); err != nil || pid <= 0 {
+		log.Fatalf("invalid pid in %s", pidPath)
+	}
+	return pid
+}
+
+func writePIDFile(path string) error {
+	if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
+		return err
+	}
+	return os.WriteFile(path, fmt.Appendf(nil, "%d\n", os.Getpid()), 0644)
+}
+
+func removePIDFile(path string) {
+	_ = os.Remove(path)
+}
+
+func envOrDefault(key, def string) string {
+	if v := os.Getenv(key); v != "" {
+		return v
+	}
+	return def
+}
+
+func httpGet(url string) (string, error) {
+	resp, err := http.Get(url)
+	if err != nil {
+		return "", err
+	}
+	defer resp.Body.Close()
+	body, err := io.ReadAll(resp.Body)
+	if err != nil {
+		return "", err
+	}
+	if resp.StatusCode >= 400 {
+		return "", fmt.Errorf("http %d: %s", resp.StatusCode, string(body))
+	}
+	return string(body), nil
+}
+
+func resolveSince(s string) string {
+	if s == "" {
+		return ""
+	}
+	d, err := time.ParseDuration(s)
+	if err == nil {
+		return time.Now().Add(-d).Format("2006-01-02 15:04:05")
+	}
+	return s
+}
diff --git a/cmd/jetmon2/main_test.go b/cmd/jetmon2/main_test.go
new file mode 100644
index 00000000..562b25d6
--- /dev/null
+++ b/cmd/jetmon2/main_test.go
@@ -0,0 +1,847 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"fmt"
+	"net"
+	"net/http"
+	"net/http/httptest"
+	"os"
+	"path/filepath"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/alerting"
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/dashboard"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/deliverer"
+	"github.com/Automattic/jetmon/internal/fleethealth"
+	"github.com/Automattic/jetmon/internal/veriflier"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestHTTPGet(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		_, _ = w.Write([]byte("ok"))
+	}))
+	defer srv.Close()
+
+	body, err := httpGet(srv.URL)
+	if err != nil {
+		t.Fatalf("httpGet() error = %v", err)
+	}
+	if strings.TrimSpace(body) != "ok" {
+		t.Fatalf("httpGet() body = %q, want %q", body, "ok")
+	}
+}
+
+func TestHTTPGetErrorStatus(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		http.Error(w, "boom", http.StatusBadGateway)
+	}))
+	defer srv.Close()
+
+	_, err := httpGet(srv.URL)
+	if err == nil {
+		t.Fatalf("httpGet() expected error")
+	}
+	if !strings.Contains(err.Error(), "502") {
+		t.Fatalf("httpGet() error = %v, want status code", err)
+	}
+}
+
+func TestEnvOrDefault(t *testing.T) {
+	const key = "JETMON_TEST_ENV_OR_DEFAULT"
+	t.Setenv(key, "")
+
+	if got := envOrDefault(key, "fallback"); got != "fallback" {
+		t.Fatalf("envOrDefault() = %q, want fallback", got)
+	}
+
+	t.Setenv(key, "set-value")
+	if got := envOrDefault(key, "fallback"); got != "set-value" {
+		t.Fatalf("envOrDefault() = %q, want set-value", got)
+	}
+}
+
+func TestIsVersionCommand(t *testing.T) {
+	for _, arg := range []string{"version", "--version", "-version"} {
+		if !isVersionCommand(arg) {
+			t.Fatalf("isVersionCommand(%q) = false, want true", arg)
+		}
+	}
+	for _, arg := range []string{"", "status", "--help", "validate-config"} {
+		if isVersionCommand(arg) {
+			t.Fatalf("isVersionCommand(%q) = true, want false", arg)
+		}
+	}
+}
+
+func TestReadPIDFile(t *testing.T) {
+	dir := t.TempDir()
+	pidPath := filepath.Join(dir, "test.pid")
+	if err := os.WriteFile(pidPath, []byte("12345\n"), 0644); err != nil {
+		t.Fatalf("WriteFile: %v", err)
+	}
+	t.Setenv("JETMON_PID_FILE", pidPath)
+
+	pid := readPIDFile()
+	if pid != 12345 {
+		t.Fatalf("readPIDFile() = %d, want 12345", pid)
+	}
+}
+
+func TestWriteAndRemovePIDFile(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "test.pid")
+	if err := writePIDFile(path); err != nil {
+		t.Fatalf("writePIDFile() error = %v", err)
+	}
+
+	data, err := os.ReadFile(path)
+	if err != nil {
+		t.Fatalf("ReadFile() error = %v", err)
+	}
+	var pid int
+	if _, err := fmt.Sscan(string(data), &pid); err != nil || pid <= 0 {
+		t.Fatalf("invalid PID in file: %q", string(data))
+	}
+
+	removePIDFile(path)
+	if _, err := os.Stat(path); !os.IsNotExist(err) {
+		t.Fatal("PID file still exists after removePIDFile()")
+	}
+}
+
+func TestResolveSince(t *testing.T) {
+	if got := resolveSince(""); got != "" {
+		t.Fatalf("resolveSince(\"\") = %q, want empty", got)
+	}
+
+	// Duration input: result should be a timestamp just before now.
+	before := time.Now()
+	got := resolveSince("1h")
+	after := time.Now()
+
+	ts, err := time.ParseInLocation("2006-01-02 15:04:05", got, time.Local)
+	if err != nil {
+		t.Fatalf("resolveSince(\"1h\") = %q, not a valid timestamp: %v", got, err)
+	}
+	if ts.Before(before.Add(-time.Hour-time.Second)) || ts.After(after.Add(-time.Hour+time.Second)) {
+		t.Fatalf("resolveSince(\"1h\") = %q, out of expected range", got)
+	}
+
+	// Non-duration string passes through unchanged.
+	const literal = "2024-01-15 10:00:00"
+	if got := resolveSince(literal); got != literal {
+		t.Fatalf("resolveSince(%q) = %q, want passthrough", literal, got)
+	}
+}
+
+func TestEmailTransportLabelAndDelivery(t *testing.T) {
+	tests := []struct {
+		name     string
+		cfg      config.Config
+		label    string
+		delivers bool
+	}{
+		{
+			name:     "empty is stub alias",
+			cfg:      config.Config{EmailTransport: ""},
+			label:    "stub",
+			delivers: false,
+		},
+		{
+			name:     "stub logs only",
+			cfg:      config.Config{EmailTransport: "stub"},
+			label:    "stub",
+			delivers: false,
+		},
+		{
+			name:     "smtp delivers",
+			cfg:      config.Config{EmailTransport: "smtp"},
+			label:    "smtp",
+			delivers: true,
+		},
+		{
+			name:     "wpcom delivers",
+			cfg:      config.Config{EmailTransport: "wpcom"},
+			label:    "wpcom",
+			delivers: true,
+		},
+		{
+			name:     "invalid transport does not deliver",
+			cfg:      config.Config{EmailTransport: "sendmail"},
+			label:    "sendmail",
+			delivers: false,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			if got := emailTransportLabel(&tt.cfg); got != tt.label {
+				t.Fatalf("emailTransportLabel() = %q, want %q", got, tt.label)
+			}
+			if got := emailTransportDelivers(&tt.cfg); got != tt.delivers {
+				t.Fatalf("emailTransportDelivers() = %v, want %v", got, tt.delivers)
+			}
+		})
+	}
+}
+
+func TestDeliveryWorkersShouldStart(t *testing.T) {
+	tests := []struct {
+		name      string
+		cfg       config.Config
+		hostname  string
+		wantStart bool
+		wantLevel string
+		wantMsg   string
+	}{
+		{
+			name:      "api disabled",
+			cfg:       config.Config{},
+			hostname:  "host-a",
+			wantLevel: "INFO",
+			wantMsg:   "delivery_workers=disabled",
+		},
+		{
+			name:      "legacy api port behavior starts workers",
+			cfg:       config.Config{APIPort: 8090},
+			hostname:  "host-a",
+			wantStart: true,
+			wantLevel: "WARN",
+			wantMsg:   "delivery_owner_host is unset",
+		},
+		{
+			name: "matching owner starts workers",
+			cfg: config.Config{
+				APIPort:           8090,
+				DeliveryOwnerHost: "host-a",
+			},
+			hostname:  "host-a",
+			wantStart: true,
+			wantLevel: "INFO",
+			wantMsg:   "matched",
+		},
+		{
+			name: "non-owner skips workers",
+			cfg: config.Config{
+				APIPort:           8090,
+				DeliveryOwnerHost: "host-a",
+			},
+			hostname:  "host-b",
+			wantLevel: "INFO",
+			wantMsg:   "disabled on host",
+		},
+		{
+			name: "owner ignored when api disabled",
+			cfg: config.Config{
+				DeliveryOwnerHost: "host-a",
+			},
+			hostname:  "host-a",
+			wantLevel: "INFO",
+			wantMsg:   "ignored because API_PORT is disabled",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			if got := deliveryWorkersShouldStart(&tt.cfg, tt.hostname); got != tt.wantStart {
+				t.Fatalf("deliveryWorkersShouldStart() = %v, want %v", got, tt.wantStart)
+			}
+			level, msg := deliveryOwnerStatus(&tt.cfg, tt.hostname)
+			if level != tt.wantLevel {
+				t.Fatalf("deliveryOwnerStatus() level = %q, want %q", level, tt.wantLevel)
+			}
+			if !strings.Contains(msg, tt.wantMsg) {
+				t.Fatalf("deliveryOwnerStatus() message = %q, want substring %q", msg, tt.wantMsg)
+			}
+		})
+	}
+}
+
+func TestEnabledLabel(t *testing.T) {
+	if got := enabledLabel(true); got != "enabled" {
+		t.Fatalf("enabledLabel(true) = %q, want enabled", got)
+	}
+	if got := enabledLabel(false); got != "disabled" {
+		t.Fatalf("enabledLabel(false) = %q, want disabled", got)
+	}
+}
+
+func TestBucketOwnershipLabel(t *testing.T) {
+	if got := bucketOwnershipLabel(&config.Config{}); got != "dynamic jetmon_hosts" {
+		t.Fatalf("bucketOwnershipLabel(dynamic) = %q", got)
+	}
+	min, max := 12, 34
+	got := bucketOwnershipLabel(&config.Config{PinnedBucketMin: &min, PinnedBucketMax: &max})
+	if got != "pinned range=12-34" {
+		t.Fatalf("bucketOwnershipLabel(pinned) = %q", got)
+	}
+}
+
+func TestRolloutAdviceLines(t *testing.T) {
+	dynamic := rolloutAdviceLines(&config.Config{})
+	if len(dynamic) != 4 {
+		t.Fatalf("dynamic advice len = %d, want 4", len(dynamic))
+	}
+	if !strings.Contains(dynamic[0], "rollout dynamic-check") {
+		t.Fatalf("dynamic preflight advice = %q", dynamic[0])
+	}
+	if !strings.Contains(dynamic[1], "rollout activity-check") {
+		t.Fatalf("dynamic activity advice = %q", dynamic[1])
+	}
+	if !strings.Contains(dynamic[2], "rollout state-report") {
+		t.Fatalf("dynamic state report advice = %q", dynamic[2])
+	}
+	if !strings.Contains(dynamic[3], "rollout projection-drift") {
+		t.Fatalf("dynamic drift advice = %q", dynamic[3])
+	}
+
+	min, max := 12, 34
+	pinned := rolloutAdviceLines(&config.Config{PinnedBucketMin: &min, PinnedBucketMax: &max})
+	if len(pinned) != 7 {
+		t.Fatalf("pinned advice len = %d, want 7", len(pinned))
+	}
+	if !strings.Contains(pinned[0], "rollout static-plan-check") {
+		t.Fatalf("pinned static-plan advice = %q", pinned[0])
+	}
+	if !strings.Contains(pinned[1], "rollout host-preflight") {
+		t.Fatalf("pinned preflight advice = %q", pinned[1])
+	}
+	if !strings.Contains(pinned[2], "rollout activity-check") {
+		t.Fatalf("pinned activity advice = %q", pinned[2])
+	}
+	if !strings.Contains(pinned[3], "rollout cutover-check") {
+		t.Fatalf("pinned cutover advice = %q", pinned[3])
+	}
+	if !strings.Contains(pinned[4], "rollout rollback-check") {
+		t.Fatalf("pinned rollback advice = %q", pinned[4])
+	}
+	if !strings.Contains(pinned[5], "rollout state-report") {
+		t.Fatalf("pinned state report advice = %q", pinned[5])
+	}
+	if !strings.Contains(pinned[6], "rollout projection-drift") {
+		t.Fatalf("pinned drift advice = %q", pinned[6])
+	}
+}
+
+func TestRolloutCommandHelpers(t *testing.T) {
+	if got := staticPlanCheckCommand(); got != "./jetmon2 rollout static-plan-check --file=<ranges.csv>" {
+		t.Fatalf("staticPlanCheckCommand() = %q", got)
+	}
+	if got := rolloutPreflightCommand(&config.Config{}); got != "./jetmon2 rollout dynamic-check" {
+		t.Fatalf("rolloutPreflightCommand(dynamic) = %q", got)
+	}
+	min, max := 12, 34
+	cfg := &config.Config{PinnedBucketMin: &min, PinnedBucketMax: &max, BucketTotal: 100}
+	want := "./jetmon2 rollout host-preflight --file=<ranges.csv> --host=<v1-hostname> --runtime-host=<v2-hostname> --bucket-min=12 --bucket-max=34 --bucket-total=100"
+	if got := rolloutPreflightCommand(cfg); got != want {
+		t.Fatalf("rolloutPreflightCommand(pinned) = %q", got)
+	}
+	if got := rolloutActivityCommand(); got != "./jetmon2 rollout activity-check --since=15m" {
+		t.Fatalf("rolloutActivityCommand() = %q", got)
+	}
+	if got := cutoverCheckCommand(&config.Config{}); got != "" {
+		t.Fatalf("cutoverCheckCommand(dynamic) = %q, want empty", got)
+	}
+	if got := cutoverCheckCommand(cfg); got != "./jetmon2 rollout cutover-check --since=15m" {
+		t.Fatalf("cutoverCheckCommand(pinned) = %q", got)
+	}
+	if got := rollbackCheckCommand(&config.Config{}); got != "" {
+		t.Fatalf("rollbackCheckCommand(dynamic) = %q, want empty", got)
+	}
+	if got := rollbackCheckCommand(cfg); got != "./jetmon2 rollout rollback-check" {
+		t.Fatalf("rollbackCheckCommand(pinned) = %q", got)
+	}
+	if got := projectionDriftCommand(); got != "./jetmon2 rollout projection-drift" {
+		t.Fatalf("projectionDriftCommand() = %q", got)
+	}
+	if got := stateReportCommand(); got != "./jetmon2 rollout state-report --since=15m" {
+		t.Fatalf("stateReportCommand() = %q", got)
+	}
+}
+
+func TestRenderVeriflierReadinessReportsV2LegacyAndUnreachable(t *testing.T) {
+	results := []veriflierReadinessResult{
+		{
+			Name: "us-east",
+			Addr: "127.0.0.1:7803",
+			Status: &veriflier.StatusV2Response{
+				Version:   "2.0.0",
+				Protocols: []string{veriflier.ProtocolV2, veriflier.ProtocolLegacy},
+				Vantage:   veriflier.Vantage{ID: "us-east-1"},
+				Agent:     veriflier.Agent{ID: "agent-a"},
+				Capacity:  veriflier.Capacity{MaxConcurrency: 64, QueueCapacity: 256, QueueDepth: 2, Active: 3, InFlight: 4},
+			},
+		},
+		{
+			Name:   "legacy",
+			Addr:   "127.0.0.1:7804",
+			Status: &veriflier.StatusV2Response{Version: "1.0.0", Protocols: []string{veriflier.ProtocolLegacy}},
+		},
+		{
+			Name: "offline",
+			Addr: "127.0.0.1:7805",
+			Err:  fmt.Errorf("dial tcp: connection refused"),
+		},
+	}
+
+	lines, failed := renderVeriflierReadiness(results)
+	if failed {
+		t.Fatal("readiness should not fail for reachable unique v2 plus warning-only legacy/offline verifliers")
+	}
+	joined := strings.Join(lines, "\n")
+	for _, want := range []string{
+		`PASS veriflier_contract name="us-east"`,
+		`protocol=v2-json-http`,
+		`vantage_id="us-east-1"`,
+		`WARN veriflier_contract name="legacy"`,
+		`WARN veriflier_status name="offline"`,
+	} {
+		if !strings.Contains(joined, want) {
+			t.Fatalf("readiness lines missing %q:\n%s", want, joined)
+		}
+	}
+}
+
+func TestRenderVeriflierReadinessFailsDuplicateAndMissingVantage(t *testing.T) {
+	results := []veriflierReadinessResult{
+		{
+			Name:   "a",
+			Addr:   "127.0.0.1:7803",
+			Status: &veriflier.StatusV2Response{Version: "2.0.0", Protocols: []string{veriflier.ProtocolV2}, Vantage: veriflier.Vantage{ID: "same"}},
+		},
+		{
+			Name:   "b",
+			Addr:   "127.0.0.1:7804",
+			Status: &veriflier.StatusV2Response{Version: "2.0.0", Protocols: []string{veriflier.ProtocolV2}, Vantage: veriflier.Vantage{ID: "same"}},
+		},
+		{
+			Name:   "c",
+			Addr:   "127.0.0.1:7805",
+			Status: &veriflier.StatusV2Response{Version: "2.0.0", Protocols: []string{veriflier.ProtocolV2}},
+		},
+	}
+
+	lines, failed := renderVeriflierReadiness(results)
+	if !failed {
+		t.Fatal("readiness should fail duplicate or missing v2 vantage ids")
+	}
+	joined := strings.Join(lines, "\n")
+	if got := strings.Count(joined, "FAIL veriflier_vantage_duplicate"); got != 2 {
+		t.Fatalf("duplicate failures = %d, want 2\n%s", got, joined)
+	}
+	if !strings.Contains(joined, `FAIL veriflier_vantage_missing name="c"`) {
+		t.Fatalf("missing vantage failure absent:\n%s", joined)
+	}
+}
+
+func TestRenderVeriflierDiscoveryReadinessStatic(t *testing.T) {
+	lines, failed := renderVeriflierDiscoveryReadiness(config.VeriflierDiscoveryModeStatic, db.VeriflierDiscoverySnapshot{}, nil, nil)
+	if failed {
+		t.Fatal("static discovery should not fail")
+	}
+	if len(lines) != 1 || lines[0] != "INFO veriflier_discovery=static" {
+		t.Fatalf("lines = %#v", lines)
+	}
+}
+
+func TestRenderVeriflierDiscoveryReadinessShadowReportsDrift(t *testing.T) {
+	staticResults := []veriflierReadinessResult{{
+		Name: "static-east",
+		Status: &veriflier.StatusV2Response{
+			Protocols: []string{veriflier.ProtocolV2},
+			Vantage:   veriflier.Vantage{ID: "us-east"},
+		},
+	}}
+	snapshot := db.VeriflierDiscoverySnapshot{
+		Vantages: []db.VeriflierVantage{
+			{VantageID: "us-west", Enabled: true, EndpointHost: "west.example", EndpointPort: "7803", AuthToken: "token"},
+		},
+		Agents: []db.VeriflierAgent{{AgentID: "agent-west"}},
+	}
+
+	lines, failed := renderVeriflierDiscoveryReadiness(config.VeriflierDiscoveryModeShadow, snapshot, nil, staticResults)
+	if failed {
+		t.Fatal("shadow discovery drift should not fail validation")
+	}
+	joined := strings.Join(lines, "\n")
+	for _, want := range []string{
+		`INFO veriflier_discovery mode=shadow enabled_vantages=1 usable_vantages=1 recent_agents=1`,
+		`WARN veriflier_discovery_extra vantage_id="us-west"`,
+		`WARN veriflier_discovery_missing vantage_id="us-east"`,
+	} {
+		if !strings.Contains(joined, want) {
+			t.Fatalf("lines missing %q:\n%s", want, joined)
+		}
+	}
+}
+
+func TestRenderVeriflierDiscoveryReadinessActiveFailsIncomplete(t *testing.T) {
+	snapshot := db.VeriflierDiscoverySnapshot{
+		Vantages: []db.VeriflierVantage{
+			{VantageID: "us-east", Enabled: true, EndpointHost: "east.example", EndpointPort: "7803"},
+		},
+	}
+	lines, failed := renderVeriflierDiscoveryReadiness(config.VeriflierDiscoveryModeActive, snapshot, nil, nil)
+	if !failed {
+		t.Fatal("active discovery with no usable vantages should fail")
+	}
+	joined := strings.Join(lines, "\n")
+	if !strings.Contains(joined, `FAIL veriflier_discovery_incomplete vantage_id="us-east"`) {
+		t.Fatalf("lines = %#v", lines)
+	}
+	if !strings.Contains(joined, "FAIL veriflier_discovery_active usable_vantages=0") {
+		t.Fatalf("lines = %#v", lines)
+	}
+}
+
+func TestVeriflierHealthEntriesReportsV2CapacityAndDuplicateVantage(t *testing.T) {
+	status := veriflier.StatusV2Response{
+		Status:    "ok",
+		Version:   "2.0.0",
+		Protocols: []string{veriflier.ProtocolV2, veriflier.ProtocolLegacy},
+		Vantage:   veriflier.Vantage{ID: "shared"},
+		Agent:     veriflier.Agent{ID: "agent-a"},
+		Capacity:  veriflier.Capacity{MaxConcurrency: 64, QueueCapacity: 256, QueueDepth: 1, Active: 2, InFlight: 3},
+	}
+	serverA := testVeriflierStatusServer(t, status)
+	defer serverA.Close()
+	status.Agent.ID = "agent-b"
+	serverB := testVeriflierStatusServer(t, status)
+	defer serverB.Close()
+
+	cfg := &config.Config{Verifiers: []config.VerifierConfig{
+		testVerifierConfig(t, "a", serverA),
+		testVerifierConfig(t, "b", serverB),
+	}}
+	entries := veriflierHealthEntries(context.Background(), cfg, time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC))
+	if len(entries) != 2 {
+		t.Fatalf("entries len = %d, want 2", len(entries))
+	}
+	for _, entry := range entries {
+		if entry.Status != "red" {
+			t.Fatalf("entry %s status = %q, want red for duplicate vantage", entry.Name, entry.Status)
+		}
+		if !strings.Contains(entry.LastError, `duplicate v2 verifier vantage id "shared"`) {
+			t.Fatalf("entry %s LastError = %q, want duplicate vantage message", entry.Name, entry.LastError)
+		}
+	}
+}
+
+func testVeriflierStatusServer(t *testing.T, status veriflier.StatusV2Response) *httptest.Server {
+	t.Helper()
+	return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		if r.URL.Path != "/v2/status" {
+			http.NotFound(w, r)
+			return
+		}
+		w.Header().Set("Content-Type", "application/json")
+		if err := json.NewEncoder(w).Encode(status); err != nil {
+			t.Fatalf("encode status: %v", err)
+		}
+	}))
+}
+
+func testVerifierConfig(t *testing.T, name string, server *httptest.Server) config.VerifierConfig {
+	t.Helper()
+	host, port, err := net.SplitHostPort(server.Listener.Addr().String())
+	if err != nil {
+		t.Fatalf("SplitHostPort: %v", err)
+	}
+	return config.VerifierConfig{Name: name, Host: host, Port: port}
+}
+
+func TestDashboardHealthEntriesReportsCoreDependencies(t *testing.T) {
+	root := t.TempDir()
+	if err := os.Mkdir(filepath.Join(root, "logs"), 0755); err != nil {
+		t.Fatalf("mkdir logs: %v", err)
+	}
+	if err := os.Mkdir(filepath.Join(root, "stats"), 0755); err != nil {
+		t.Fatalf("mkdir stats: %v", err)
+	}
+	wd, err := os.Getwd()
+	if err != nil {
+		t.Fatalf("Getwd: %v", err)
+	}
+	if err := os.Chdir(root); err != nil {
+		t.Fatalf("Chdir: %v", err)
+	}
+	defer func() {
+		if err := os.Chdir(wd); err != nil {
+			t.Fatalf("restore working directory: %v", err)
+		}
+	}()
+
+	sqlDB, mock, err := sqlmock.New(sqlmock.MonitorPingsOption(true))
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+	mock.ExpectPing()
+
+	checkedAt := time.Date(2026, 4, 28, 3, 0, 0, 0, time.UTC)
+	entries := dashboardHealthEntries(context.Background(), &config.Config{}, sqlDB, nil, false, checkedAt)
+	byName := make(map[string]string, len(entries))
+	for _, entry := range entries {
+		byName[entry.Name] = entry.Status
+		if !entry.CheckedAt.Equal(checkedAt) {
+			t.Fatalf("%s CheckedAt = %s, want %s", entry.Name, entry.CheckedAt, checkedAt)
+		}
+	}
+
+	want := map[string]string{
+		"mysql":      "green",
+		"wpcom":      "red",
+		"statsd":     "amber",
+		"disk:logs":  "green",
+		"disk:stats": "green",
+		"verifliers": "amber",
+	}
+	for name, status := range want {
+		if byName[name] != status {
+			t.Fatalf("health[%s] = %q, want %q (entries=%v)", name, byName[name], status, entries)
+		}
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("sql expectations: %v", err)
+	}
+}
+
+func TestMonitorProcessHealthSnapshot(t *testing.T) {
+	started := time.Date(2026, 4, 30, 12, 0, 0, 0, time.UTC)
+	cfg := &config.Config{APIPort: 8090, DashboardPort: 8080, DeliveryOwnerHost: "host-a"}
+	st := dashboard.State{
+		WorkerCount:            12,
+		ActiveChecks:           3,
+		QueueDepth:             4,
+		RetryQueueSize:         5,
+		BucketMin:              0,
+		BucketMax:              99,
+		BucketOwnership:        "pinned range=0-99",
+		DeliveryWorkersEnabled: true,
+		DeliveryConfigEligible: true,
+		DeliveryOwnerHost:      "host-a",
+		WPCOMQueueDepth:        2,
+		GoSysMemMB:             88,
+		RSSMemMB:               99,
+	}
+	health := []dashboard.HealthEntry{{
+		Name:      "mysql",
+		Status:    "green",
+		CheckedAt: started,
+	}}
+
+	snapshot := monitorProcessHealthSnapshot("host-a", started, fleethealth.StateRunning, cfg, st, health)
+	if snapshot.HostID != "host-a" {
+		t.Fatalf("HostID = %q, want host-a", snapshot.HostID)
+	}
+	if snapshot.ProcessType != fleethealth.ProcessMonitor {
+		t.Fatalf("ProcessType = %q, want monitor", snapshot.ProcessType)
+	}
+	if snapshot.BucketMin == nil || *snapshot.BucketMin != 0 {
+		t.Fatalf("BucketMin = %v, want 0", snapshot.BucketMin)
+	}
+	if snapshot.APIPort == nil || *snapshot.APIPort != 8090 {
+		t.Fatalf("APIPort = %v, want 8090", snapshot.APIPort)
+	}
+	if snapshot.HealthStatus != fleethealth.HealthGreen {
+		t.Fatalf("HealthStatus = %q, want green", snapshot.HealthStatus)
+	}
+	if snapshot.GoSysMemMB != 88 || snapshot.RSSMemMB != 99 {
+		t.Fatalf("memory fields = go=%d rss=%d, want go=88 rss=99", snapshot.GoSysMemMB, snapshot.RSSMemMB)
+	}
+	if len(snapshot.DependencyHealth) != 1 || snapshot.DependencyHealth[0].Name != "mysql" {
+		t.Fatalf("DependencyHealth = %+v, want mysql entry", snapshot.DependencyHealth)
+	}
+}
+
+func TestDashboardListenAddrDefaultsLocalhost(t *testing.T) {
+	cfg := &config.Config{DashboardPort: 8080}
+	if got := dashboardListenAddr(cfg); got != "127.0.0.1:8080" {
+		t.Fatalf("dashboardListenAddr() = %q, want 127.0.0.1:8080", got)
+	}
+
+	cfg.DashboardBindAddr = "0.0.0.0"
+	if got := dashboardListenAddr(cfg); got != "0.0.0.0:8080" {
+		t.Fatalf("dashboardListenAddr() = %q, want 0.0.0.0:8080", got)
+	}
+}
+
+func TestDashboardBindWarning(t *testing.T) {
+	tests := []struct {
+		name string
+		addr string
+		want bool
+	}{
+		{name: "empty defaults local", addr: "", want: false},
+		{name: "ipv4 loopback", addr: "127.0.0.1", want: false},
+		{name: "ipv6 loopback", addr: "::1", want: false},
+		{name: "localhost", addr: "localhost", want: false},
+		{name: "wildcard", addr: "0.0.0.0", want: true},
+		{name: "private address", addr: "10.0.0.5", want: true},
+		{name: "hostname", addr: "dashboard.internal", want: true},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			got := dashboardBindWarning(tt.addr)
+			if tt.want && got == "" {
+				t.Fatalf("dashboardBindWarning(%q) = empty, want warning", tt.addr)
+			}
+			if !tt.want && got != "" {
+				t.Fatalf("dashboardBindWarning(%q) = %q, want empty", tt.addr, got)
+			}
+		})
+	}
+}
+
+func TestCheckWritableDirReportsMissingDirectory(t *testing.T) {
+	err := checkWritableDir(filepath.Join(t.TempDir(), "missing"))
+	if err == nil {
+		t.Fatal("checkWritableDir() returned nil for missing directory")
+	}
+}
+
+func TestParseInt64(t *testing.T) {
+	got, err := parseInt64("12345")
+	if err != nil {
+		t.Fatalf("parseInt64(valid) error = %v", err)
+	}
+	if got != 12345 {
+		t.Fatalf("parseInt64(valid) = %d, want 12345", got)
+	}
+	if _, err := parseInt64("not-an-id"); err == nil {
+		t.Fatal("parseInt64(invalid) returned nil error")
+	}
+}
+
+func TestCurrentOperatorPrefersUserThenLogname(t *testing.T) {
+	t.Setenv("USER", "alice")
+	t.Setenv("LOGNAME", "bob")
+	if got := currentOperator(); got != "alice" {
+		t.Fatalf("currentOperator() = %q, want USER", got)
+	}
+
+	t.Setenv("USER", "")
+	if got := currentOperator(); got != "bob" {
+		t.Fatalf("currentOperator() = %q, want LOGNAME", got)
+	}
+
+	t.Setenv("LOGNAME", "")
+	if got := currentOperator(); got != "cli" {
+		t.Fatalf("currentOperator() = %q, want cli", got)
+	}
+}
+
+func TestReadPIDFileRejectsInvalidContent(t *testing.T) {
+	dir := t.TempDir()
+	pidPath := filepath.Join(dir, "test.pid")
+	if err := os.WriteFile(pidPath, []byte("0\n"), 0644); err != nil {
+		t.Fatalf("WriteFile: %v", err)
+	}
+	t.Setenv("JETMON_PID_FILE", pidPath)
+
+	if os.Getenv("JETMON_TEST_READ_PID_INVALID") == "1" {
+		_ = readPIDFile()
+		return
+	}
+
+	cmd := os.Args[0]
+	proc, err := os.StartProcess(cmd, []string{cmd, "-test.run=TestReadPIDFileRejectsInvalidContent"}, &os.ProcAttr{
+		Env: append(os.Environ(),
+			"JETMON_TEST_READ_PID_INVALID=1",
+			"JETMON_PID_FILE="+pidPath,
+		),
+		Files: []*os.File{os.Stdin, os.Stdout, os.Stderr},
+	})
+	if err != nil {
+		t.Fatalf("StartProcess: %v", err)
+	}
+	state, err := proc.Wait()
+	if err != nil {
+		t.Fatalf("Wait: %v", err)
+	}
+	if state.Success() {
+		t.Fatal("readPIDFile accepted invalid PID content")
+	}
+}
+
+func TestBuildAlertDispatchersIncludesStubEmail(t *testing.T) {
+	dispatchers := deliverer.BuildAlertDispatchers(&config.Config{
+		EmailTransport: "stub",
+		EmailFrom:      "jetmon@example.com",
+	})
+
+	for _, transport := range []alerting.Transport{
+		alerting.TransportEmail,
+		alerting.TransportPagerDuty,
+		alerting.TransportSlack,
+		alerting.TransportTeams,
+	} {
+		if dispatchers[transport] == nil {
+			t.Fatalf("dispatcher for %s is nil", transport)
+		}
+	}
+
+	destination, err := json.Marshal(map[string]string{"address": "ops@example.com"})
+	if err != nil {
+		t.Fatalf("Marshal destination: %v", err)
+	}
+
+	status, response, err := dispatchers[alerting.TransportEmail].Send(
+		context.Background(),
+		destination,
+		alerting.Notification{
+			SiteID:       123,
+			SiteURL:      "https://example.com",
+			EventID:      456,
+			EventType:    "alert.opened",
+			SeverityName: "Down",
+			Timestamp:    time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC),
+		},
+	)
+	if err != nil {
+		t.Fatalf("stub email dispatcher Send() error = %v", err)
+	}
+	// 250 mirrors the SMTP "Requested mail action okay, completed" reply
+	// code so the audit row reads the same shape regardless of which email
+	// transport actually fired.
+	if status != 250 {
+		t.Fatalf("stub email dispatcher status = %d, want 250", status)
+	}
+	if response != "delivered" {
+		t.Fatalf("stub email dispatcher response = %q, want delivered", response)
+	}
+}
+
+func TestBuildAlertDispatchersSelectsConfiguredEmailSenders(t *testing.T) {
+	tests := []struct {
+		name      string
+		transport string
+		wantType  string
+	}{
+		{name: "smtp", transport: "smtp", wantType: "*alerting.emailDispatcher"},
+		{name: "wpcom", transport: "wpcom", wantType: "*alerting.emailDispatcher"},
+		{name: "unknown falls back", transport: "sendmail", wantType: "*alerting.emailDispatcher"},
+	}
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			dispatchers := deliverer.BuildAlertDispatchers(&config.Config{
+				EmailTransport:     tt.transport,
+				EmailFrom:          "jetmon@example.com",
+				WPCOMEmailEndpoint: "https://wpcom.example/send",
+				SMTPHost:           "smtp.example",
+				SMTPPort:           25,
+			})
+			got := fmt.Sprintf("%T", dispatchers[alerting.TransportEmail])
+			if got != tt.wantType {
+				t.Fatalf("email dispatcher type = %s, want %s", got, tt.wantType)
+			}
+		})
+	}
+}
diff --git a/cmd/jetmon2/rollout.go b/cmd/jetmon2/rollout.go
new file mode 100644
index 00000000..a41f71f2
--- /dev/null
+++ b/cmd/jetmon2/rollout.go
@@ -0,0 +1,3447 @@
+package main
+
+import (
+	"bufio"
+	"bytes"
+	"context"
+	"encoding/csv"
+	"encoding/json"
+	"errors"
+	"flag"
+	"fmt"
+	"io"
+	"os"
+	"os/exec"
+	"path/filepath"
+	"sort"
+	"strconv"
+	"strings"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/checkmode"
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+)
+
+type pinnedRolloutCheckDeps struct {
+	Hostname                       func() string
+	HostRowExists                  func(context.Context, string) (bool, error)
+	ListOverlappingHostRows        func(context.Context, int, int) ([]db.HostRow, error)
+	CountActiveSitesForBucketRange func(context.Context, int, int) (int, error)
+	CountLegacyProjectionDrift     func(context.Context, int, int) (int, error)
+}
+
+type rollbackCheckDeps = pinnedRolloutCheckDeps
+
+type dynamicRolloutCheckDeps struct {
+	Now                            func() time.Time
+	GetAllHosts                    func() ([]db.HostRow, error)
+	CountActiveSitesForBucketRange func(context.Context, int, int) (int, error)
+	CountLegacyProjectionDrift     func(context.Context, int, int) (int, error)
+}
+
+type activityCheckDeps struct {
+	Now                                     func() time.Time
+	CountActiveSitesForBucketRange          func(context.Context, int, int) (int, error)
+	CountRecentlyCheckedActiveSitesForRange func(context.Context, int, int, time.Time) (int, error)
+}
+
+type projectionDriftDeps struct {
+	CountLegacyProjectionDrift     func(context.Context, int, int) (int, error)
+	ListLegacyProjectionDrift      func(context.Context, int, int, int) ([]db.ProjectionDriftRow, error)
+	SummarizeLegacyProjectionDrift func(context.Context, int, int, int) ([]db.ProjectionDriftSummaryRow, error)
+}
+
+type cutoverCheckDeps struct {
+	Pinned     pinnedRolloutCheckDeps
+	Activity   activityCheckDeps
+	Projection projectionDriftDeps
+	Status     func(int) (string, error)
+}
+
+type rolloutStateReportDeps struct {
+	Now                                     func() time.Time
+	Hostname                                func() string
+	GetAllHosts                             func() ([]db.HostRow, error)
+	CountActiveSitesForBucketRange          func(context.Context, int, int) (int, error)
+	CountRecentlyCheckedActiveSitesForRange func(context.Context, int, int, time.Time) (int, error)
+	CountLegacyProjectionDrift              func(context.Context, int, int) (int, error)
+}
+
+type hostPreflightDeps struct {
+	Pinned        pinnedRolloutCheckDeps
+	SystemdVerify func(string) (string, error)
+}
+
+func cmdRollout(args []string) {
+	if len(args) == 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout <guided|rehearsal-plan|host-preflight|static-plan-check|pinned-check|cutover-check|rollback-check|dynamic-check|activity-check|projection-drift|state-report|production-data-audit|legacy-status-bootstrap> [args]")
+		os.Exit(1)
+	}
+
+	switch args[0] {
+	case "guided":
+		cmdRolloutGuided(args[1:])
+	case "rehearsal-plan":
+		cmdRolloutRehearsalPlan(args[1:])
+	case "host-preflight":
+		cmdRolloutHostPreflight(args[1:])
+	case "static-plan-check":
+		cmdRolloutStaticPlanCheck(args[1:])
+	case "pinned-check":
+		cmdRolloutPinnedCheck(args[1:])
+	case "cutover-check":
+		cmdRolloutCutoverCheck(args[1:])
+	case "rollback-check":
+		cmdRolloutRollbackCheck(args[1:])
+	case "dynamic-check":
+		cmdRolloutDynamicCheck(args[1:])
+	case "activity-check":
+		cmdRolloutActivityCheck(args[1:])
+	case "projection-drift":
+		cmdRolloutProjectionDrift(args[1:])
+	case "state-report":
+		cmdRolloutStateReport(args[1:])
+	case "production-data-audit":
+		cmdRolloutProductionDataAudit(args[1:])
+	case "legacy-status-bootstrap":
+		cmdRolloutLegacyStatusBootstrap(args[1:])
+	default:
+		fmt.Fprintf(os.Stderr, "unknown rollout subcommand %q (want: guided, rehearsal-plan, host-preflight, static-plan-check, pinned-check, cutover-check, rollback-check, dynamic-check, activity-check, projection-drift, state-report, production-data-audit, legacy-status-bootstrap)\n", args[0])
+		os.Exit(1)
+	}
+}
+
+type rolloutJSONLine struct {
+	Level   string `json:"level"`
+	Message string `json:"message"`
+}
+
+type rolloutJSONReport struct {
+	OK          bool              `json:"ok"`
+	Command     string            `json:"command"`
+	GeneratedAt time.Time         `json:"generated_at"`
+	Lines       []rolloutJSONLine `json:"lines,omitempty"`
+	Failures    []string          `json:"failures,omitempty"`
+}
+
+func rolloutOutputFlag(fs *flag.FlagSet) *string {
+	return fs.String("output", "text", "output format: text or json")
+}
+
+func normalizeRolloutOutput(output string) (string, error) {
+	output = strings.ToLower(strings.TrimSpace(output))
+	if output == "" {
+		output = "text"
+	}
+	if output != "text" && output != "json" {
+		return "", errors.New("--output must be text or json")
+	}
+	return output, nil
+}
+
+func runRolloutCommandOutput(stdout io.Writer, command, output string, run func(io.Writer) error) error {
+	if stdout == nil {
+		stdout = io.Discard
+	}
+	if output != "json" {
+		return run(stdout)
+	}
+
+	var text bytes.Buffer
+	err := run(&text)
+	report := buildRolloutJSONReport(command, text.String(), err)
+	if renderErr := renderRolloutJSONReport(stdout, report); renderErr != nil {
+		return renderErr
+	}
+	return err
+}
+
+func buildRolloutJSONReport(command, text string, err error) rolloutJSONReport {
+	report := rolloutJSONReport{
+		OK:          err == nil,
+		Command:     command,
+		GeneratedAt: time.Now().UTC(),
+		Lines:       parseRolloutOutputLines(text),
+	}
+	if err != nil {
+		report.Failures = []string{err.Error()}
+	}
+	return report
+}
+
+func parseRolloutOutputLines(text string) []rolloutJSONLine {
+	var lines []rolloutJSONLine
+	for _, raw := range strings.Split(strings.TrimRight(text, "\n"), "\n") {
+		raw = strings.TrimSpace(raw)
+		if raw == "" {
+			continue
+		}
+		level, message := parseRolloutOutputLine(raw)
+		lines = append(lines, rolloutJSONLine{Level: level, Message: message})
+	}
+	return lines
+}
+
+func parseRolloutOutputLine(line string) (string, string) {
+	if strings.HasPrefix(line, "## ") {
+		return "section", strings.TrimSpace(strings.TrimPrefix(line, "## "))
+	}
+	for _, level := range []string{"PASS", "WARN", "INFO", "FAIL"} {
+		prefix := level + " "
+		if strings.HasPrefix(line, prefix) {
+			return strings.ToLower(level), strings.TrimSpace(strings.TrimPrefix(line, prefix))
+		}
+	}
+	return "output", line
+}
+
+func renderRolloutJSONReport(out io.Writer, report rolloutJSONReport) error {
+	enc := json.NewEncoder(out)
+	enc.SetIndent("", "  ")
+	return enc.Encode(report)
+}
+
+func exitRolloutCommandError(err error, output string) {
+	if output != "json" {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+	}
+	os.Exit(1)
+}
+
+func cmdRolloutGuided(args []string) {
+	fs := flag.NewFlagSet("rollout guided", flag.ExitOnError)
+	file := fs.String("file", "", "CSV file with host,bucket_min,bucket_max rows")
+	mode := fs.String("mode", "same-server", "rollout mode: same-server or fresh-server")
+	host := fs.String("host", "", "v1 host id that must appear in the static plan")
+	runtimeHost := fs.String("runtime-host", "", "v2 runtime host where guided rollout runs (default --host)")
+	bucketMin := fs.Int("bucket-min", -1, "expected bucket minimum for --host")
+	bucketMax := fs.Int("bucket-max", -1, "expected bucket maximum for --host")
+	bucketTotal := fs.Int("bucket-total", 0, "total bucket count (default BUCKET_TOTAL from config)")
+	service := fs.String("service", "jetmon2", "systemd service name for v2")
+	systemdUnit := fs.String("systemd-unit", "", "systemd unit path to verify (default /etc/systemd/system/<service>.service)")
+	since := fs.String("since", "15m", "activity cutoff for post-cutover checks")
+	v1StopCommand := fs.String("v1-stop-command", "", "exact command to stop the v1 monitor for this range")
+	v1StartCommand := fs.String("v1-start-command", "", "exact command to restart the v1 monitor during rollback")
+	logDir := fs.String("log-dir", filepath.Join("logs", "rollout"), "directory for guided rollout transcripts and resume state")
+	executeOperatorCommands := fs.Bool("execute-operator-commands", false, "execute v1/v2 stop/start commands after typed confirmation")
+	dryRun := fs.Bool("dry-run", false, "validate inputs, log directory, and print the guided plan without running checks or commands")
+	rollback := fs.Bool("rollback", false, "run the guided rollback path instead of the forward cutover path")
+	skipSystemd := fs.Bool("skip-systemd", false, "skip systemd-analyze verify in host-preflight")
+	skipStatus := fs.Bool("skip-status", false, "skip dashboard status check in cutover gates")
+	statusPort := fs.Int("status-port", -1, "dashboard port for cutover status check (default DASHBOARD_PORT from config)")
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 || strings.TrimSpace(*file) == "" {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout guided --file=<ranges.csv> --host=<v1-host> --bucket-min=N --bucket-max=N [--runtime-host=<v2-host>] [--bucket-total=N] [--mode=same-server|fresh-server] [--v1-stop-command=<cmd>] [--v1-start-command=<cmd>] [--log-dir=<dir>] [--execute-operator-commands] [--dry-run] [--rollback]")
+		os.Exit(1)
+	}
+
+	opts := guidedRolloutOptions{
+		Mode:                    *mode,
+		PlanFile:                *file,
+		HostID:                  *host,
+		RuntimeHost:             *runtimeHost,
+		BucketMin:               *bucketMin,
+		BucketMax:               *bucketMax,
+		BucketTotal:             *bucketTotal,
+		Service:                 *service,
+		SystemdUnit:             *systemdUnit,
+		Since:                   *since,
+		V1StopCmd:               *v1StopCommand,
+		V1StartCmd:              *v1StartCommand,
+		LogDir:                  *logDir,
+		ExecuteOperatorCommands: *executeOperatorCommands,
+		DryRun:                  *dryRun,
+		Rollback:                *rollback,
+		SkipSystemd:             *skipSystemd,
+		SkipStatus:              *skipStatus,
+		StatusPort:              *statusPort,
+	}
+	deps := defaultGuidedRolloutDeps()
+	if err := runGuidedRollout(context.Background(), os.Stdout, os.Stdin, opts, deps); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(1)
+	}
+}
+
+func cmdRolloutRehearsalPlan(args []string) {
+	fs := flag.NewFlagSet("rollout rehearsal-plan", flag.ExitOnError)
+	file := fs.String("file", "", "CSV file with host,bucket_min,bucket_max rows")
+	mode := fs.String("mode", "same-server", "rollout mode: same-server or fresh-server")
+	host := fs.String("host", "", "host id that must appear in the static plan")
+	runtimeHost := fs.String("runtime-host", "", "v2 runtime host where the runbook is executed (default --host)")
+	bucketMin := fs.Int("bucket-min", -1, "expected bucket minimum for --host")
+	bucketMax := fs.Int("bucket-max", -1, "expected bucket maximum for --host")
+	bucketTotal := fs.Int("bucket-total", 0, "total bucket count (default BUCKET_TOTAL from config)")
+	binary := fs.String("binary", "./jetmon2", "jetmon2 command path to print")
+	service := fs.String("service", "jetmon2", "systemd service name to print for v2")
+	systemdUnit := fs.String("systemd-unit", "", "systemd unit path to pass through to host-preflight (default /etc/systemd/system/<service>.service)")
+	since := fs.String("since", "15m", "activity cutoff to print for post-cutover checks")
+	v1StopCommand := fs.String("v1-stop-command", "", "exact command to stop the v1 monitor for this range")
+	v1StartCommand := fs.String("v1-start-command", "", "exact command to restart the v1 monitor during rollback")
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 || strings.TrimSpace(*file) == "" {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout rehearsal-plan --file=<ranges.csv> --host=<host> --bucket-min=N --bucket-max=N [--mode=same-server|fresh-server] [--runtime-host=<v2-host>] [--bucket-total=N] [--systemd-unit=<path>] [--v1-stop-command=<cmd>] [--v1-start-command=<cmd>]")
+		os.Exit(1)
+	}
+
+	resolvedBucketTotal := *bucketTotal
+	if resolvedBucketTotal < 0 {
+		fmt.Fprintln(os.Stderr, "FAIL bucket-total must be > 0")
+		os.Exit(1)
+	}
+	if resolvedBucketTotal == 0 {
+		configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+		if err := config.Load(configPath); err != nil {
+			fmt.Fprintf(os.Stderr, "FAIL config parse: %v\n", err)
+			os.Exit(1)
+		}
+		resolvedBucketTotal = config.Get().BucketTotal
+	}
+
+	inputName := strings.TrimSpace(*file)
+	if inputName == "-" {
+		fmt.Fprintln(os.Stderr, "FAIL --file=- is not supported for rehearsal-plan; pass a reusable CSV path so the printed commands are repeatable")
+		os.Exit(1)
+	}
+	f, err := os.Open(inputName)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL open static bucket plan: %v\n", err)
+		os.Exit(1)
+	}
+	defer f.Close()
+
+	opts := rolloutRehearsalPlanOptions{
+		Mode:        *mode,
+		PlanFile:    inputName,
+		HostID:      *host,
+		RuntimeHost: *runtimeHost,
+		BucketMin:   *bucketMin,
+		BucketMax:   *bucketMax,
+		BucketTotal: resolvedBucketTotal,
+		Binary:      *binary,
+		Service:     *service,
+		SystemdUnit: *systemdUnit,
+		Since:       *since,
+		V1StopCmd:   *v1StopCommand,
+		V1StartCmd:  *v1StartCommand,
+	}
+	if err := runRolloutRehearsalPlan(os.Stdout, f, opts); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(1)
+	}
+}
+
+func cmdRolloutHostPreflight(args []string) {
+	fs := flag.NewFlagSet("rollout host-preflight", flag.ExitOnError)
+	file := fs.String("file", "", "CSV file with host,bucket_min,bucket_max rows")
+	host := fs.String("host", "", "v1 host id that must appear in the static plan")
+	runtimeHost := fs.String("runtime-host", "", "v2 runtime host for pinned checks (default --host)")
+	bucketMin := fs.Int("bucket-min", -1, "expected bucket minimum for --host")
+	bucketMax := fs.Int("bucket-max", -1, "expected bucket maximum for --host")
+	bucketTotal := fs.Int("bucket-total", 0, "total bucket count (default BUCKET_TOTAL from config)")
+	service := fs.String("service", "jetmon2", "systemd service name for default --systemd-unit")
+	systemdUnit := fs.String("systemd-unit", "", "systemd unit path to verify (default /etc/systemd/system/<service>.service)")
+	skipSystemd := fs.Bool("skip-systemd", false, "skip systemd-analyze verify")
+	output := rolloutOutputFlag(fs)
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 || strings.TrimSpace(*file) == "" {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout host-preflight --file=<ranges.csv> --host=<host> --bucket-min=N --bucket-max=N [--runtime-host=<v2-host>] [--bucket-total=N] [--service=jetmon2] [--systemd-unit=<path>] [--skip-systemd] [--output=text|json]")
+		os.Exit(1)
+	}
+	outputFormat, err := normalizeRolloutOutput(*output)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	if err := runRolloutCommandOutput(os.Stdout, "rollout host-preflight", outputFormat, func(out io.Writer) error {
+		inputName := strings.TrimSpace(*file)
+		if inputName == "-" {
+			return errors.New("--file=- is not supported for host-preflight; pass a reusable CSV path")
+		}
+		f, err := os.Open(inputName)
+		if err != nil {
+			return fmt.Errorf("open static bucket plan: %w", err)
+		}
+		defer f.Close()
+
+		configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+		if err := config.Load(configPath); err != nil {
+			return fmt.Errorf("config parse: %w", err)
+		}
+		fmt.Fprintln(out, "PASS config parse")
+
+		config.LoadDB()
+		if err := db.ConnectWithRetry(3); err != nil {
+			return fmt.Errorf("db connect: %w", err)
+		}
+		fmt.Fprintln(out, "PASS db connect")
+
+		opts := hostPreflightOptions{
+			PlanFile:    inputName,
+			HostID:      *host,
+			RuntimeHost: *runtimeHost,
+			BucketMin:   *bucketMin,
+			BucketMax:   *bucketMax,
+			BucketTotal: *bucketTotal,
+			Service:     *service,
+			SystemdUnit: *systemdUnit,
+			SkipSystemd: *skipSystemd,
+		}
+		deps := hostPreflightDeps{
+			Pinned: pinnedRolloutCheckDeps{
+				Hostname:                       db.Hostname,
+				HostRowExists:                  db.HostRowExists,
+				ListOverlappingHostRows:        db.ListHostRowsOverlappingBucketRange,
+				CountActiveSitesForBucketRange: db.CountActiveSitesForBucketRange,
+				CountLegacyProjectionDrift:     db.CountLegacyProjectionDrift,
+			},
+			SystemdVerify: systemdAnalyzeVerify,
+		}
+		return runHostPreflight(context.Background(), out, config.Get(), f, opts, deps)
+	}); err != nil {
+		exitRolloutCommandError(err, outputFormat)
+	}
+}
+
+func cmdRolloutStaticPlanCheck(args []string) {
+	fs := flag.NewFlagSet("rollout static-plan-check", flag.ExitOnError)
+	file := fs.String("file", "", "CSV file with host,bucket_min,bucket_max rows (use - for stdin)")
+	bucketTotal := fs.Int("bucket-total", 0, "total bucket count (default BUCKET_TOTAL from config)")
+	host := fs.String("host", "", "optional host id that must appear in the plan")
+	bucketMin := fs.Int("bucket-min", -1, "expected bucket minimum for --host")
+	bucketMax := fs.Int("bucket-max", -1, "expected bucket maximum for --host")
+	output := rolloutOutputFlag(fs)
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 || strings.TrimSpace(*file) == "" {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout static-plan-check --file=<ranges.csv> [--bucket-total=N] [--host=<host> --bucket-min=N --bucket-max=N] [--output=text|json]")
+		os.Exit(1)
+	}
+	outputFormat, err := normalizeRolloutOutput(*output)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+	assertion, err := staticPlanAssertionFromFlags(*host, *bucketMin, *bucketMax)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(1)
+	}
+
+	if err := runRolloutCommandOutput(os.Stdout, "rollout static-plan-check", outputFormat, func(out io.Writer) error {
+		resolvedBucketTotal := *bucketTotal
+		if resolvedBucketTotal < 0 {
+			return errors.New("bucket-total must be > 0")
+		}
+		if resolvedBucketTotal == 0 {
+			configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+			if err := config.Load(configPath); err != nil {
+				return fmt.Errorf("config parse: %w", err)
+			}
+			resolvedBucketTotal = config.Get().BucketTotal
+		}
+
+		inputName := strings.TrimSpace(*file)
+		input := io.Reader(os.Stdin)
+		var opened *os.File
+		if inputName != "-" {
+			f, err := os.Open(inputName)
+			if err != nil {
+				return fmt.Errorf("open static bucket plan: %w", err)
+			}
+			opened = f
+			input = f
+		}
+		if opened != nil {
+			defer opened.Close()
+		}
+
+		return runStaticPlanCheck(out, inputName, input, resolvedBucketTotal, assertion)
+	}); err != nil {
+		exitRolloutCommandError(err, outputFormat)
+	}
+}
+
+func cmdRolloutPinnedCheck(args []string) {
+	fs := flag.NewFlagSet("rollout pinned-check", flag.ExitOnError)
+	host := fs.String("host", "", "host id to check (default current hostname)")
+	output := rolloutOutputFlag(fs)
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout pinned-check [--host=<host_id>] [--output=text|json]")
+		os.Exit(1)
+	}
+	outputFormat, err := normalizeRolloutOutput(*output)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	if err := runRolloutCommandOutput(os.Stdout, "rollout pinned-check", outputFormat, func(out io.Writer) error {
+		configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+		if err := config.Load(configPath); err != nil {
+			return fmt.Errorf("config parse: %w", err)
+		}
+		fmt.Fprintln(out, "PASS config parse")
+
+		config.LoadDB()
+		if err := db.ConnectWithRetry(3); err != nil {
+			return fmt.Errorf("db connect: %w", err)
+		}
+		fmt.Fprintln(out, "PASS db connect")
+
+		deps := pinnedRolloutCheckDeps{
+			Hostname:                       db.Hostname,
+			HostRowExists:                  db.HostRowExists,
+			ListOverlappingHostRows:        db.ListHostRowsOverlappingBucketRange,
+			CountActiveSitesForBucketRange: db.CountActiveSitesForBucketRange,
+			CountLegacyProjectionDrift:     db.CountLegacyProjectionDrift,
+		}
+		return runPinnedRolloutCheck(context.Background(), out, config.Get(), *host, deps)
+	}); err != nil {
+		exitRolloutCommandError(err, outputFormat)
+	}
+}
+
+func cmdRolloutCutoverCheck(args []string) {
+	fs := flag.NewFlagSet("rollout cutover-check", flag.ExitOnError)
+	host := fs.String("host", "", "host id to check (default current hostname)")
+	bucketMin := fs.Int("bucket-min", -1, "inclusive bucket minimum (default pinned range)")
+	bucketMax := fs.Int("bucket-max", -1, "inclusive bucket maximum (default pinned range)")
+	since := fs.String("since", "15m", "activity cutoff as duration like 15m or RFC3339 timestamp")
+	requireAll := fs.Bool("require-all", false, "fail unless every active site in range was checked since the cutoff")
+	limit := fs.Int("limit", 100, "maximum projection drift rows to print")
+	statusPort := fs.Int("status-port", -1, "dashboard port for status check (default DASHBOARD_PORT from config)")
+	skipStatus := fs.Bool("skip-status", false, "skip dashboard status check")
+	output := rolloutOutputFlag(fs)
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout cutover-check [--host=<host_id>] [--bucket-min=N --bucket-max=N] [--since=15m] [--require-all] [--limit=N] [--status-port=N] [--skip-status] [--output=text|json]")
+		os.Exit(1)
+	}
+	outputFormat, err := normalizeRolloutOutput(*output)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	if err := runRolloutCommandOutput(os.Stdout, "rollout cutover-check", outputFormat, func(out io.Writer) error {
+		configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+		if err := config.Load(configPath); err != nil {
+			return fmt.Errorf("config parse: %w", err)
+		}
+		fmt.Fprintln(out, "PASS config parse")
+
+		config.LoadDB()
+		if err := db.ConnectWithRetry(3); err != nil {
+			return fmt.Errorf("db connect: %w", err)
+		}
+		fmt.Fprintln(out, "PASS db connect")
+
+		deps := cutoverCheckDeps{
+			Pinned: pinnedRolloutCheckDeps{
+				Hostname:                       db.Hostname,
+				HostRowExists:                  db.HostRowExists,
+				ListOverlappingHostRows:        db.ListHostRowsOverlappingBucketRange,
+				CountActiveSitesForBucketRange: db.CountActiveSitesForBucketRange,
+				CountLegacyProjectionDrift:     db.CountLegacyProjectionDrift,
+			},
+			Activity: activityCheckDeps{
+				Now:                                     time.Now,
+				CountActiveSitesForBucketRange:          db.CountActiveSitesForBucketRange,
+				CountRecentlyCheckedActiveSitesForRange: db.CountRecentlyCheckedActiveSitesForBucketRange,
+			},
+			Projection: projectionDriftDeps{
+				CountLegacyProjectionDrift:     db.CountLegacyProjectionDrift,
+				ListLegacyProjectionDrift:      db.ListLegacyProjectionDrift,
+				SummarizeLegacyProjectionDrift: db.SummarizeLegacyProjectionDrift,
+			},
+			Status: dashboardStatus,
+		}
+		opts := cutoverCheckOptions{
+			HostOverride: *host,
+			BucketMin:    *bucketMin,
+			BucketMax:    *bucketMax,
+			Since:        *since,
+			RequireAll:   *requireAll,
+			Limit:        *limit,
+			StatusPort:   *statusPort,
+			SkipStatus:   *skipStatus,
+		}
+		return runCutoverCheck(context.Background(), out, config.Get(), opts, deps)
+	}); err != nil {
+		exitRolloutCommandError(err, outputFormat)
+	}
+}
+
+func cmdRolloutRollbackCheck(args []string) {
+	fs := flag.NewFlagSet("rollout rollback-check", flag.ExitOnError)
+	host := fs.String("host", "", "v2 host id that must not own dynamic buckets (default current hostname)")
+	bucketMin := fs.Int("bucket-min", -1, "inclusive rollback bucket minimum (default pinned range)")
+	bucketMax := fs.Int("bucket-max", -1, "inclusive rollback bucket maximum (default pinned range)")
+	output := rolloutOutputFlag(fs)
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout rollback-check [--host=<host_id>] [--bucket-min=N --bucket-max=N] [--output=text|json]")
+		os.Exit(1)
+	}
+	outputFormat, err := normalizeRolloutOutput(*output)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	if err := runRolloutCommandOutput(os.Stdout, "rollout rollback-check", outputFormat, func(out io.Writer) error {
+		configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+		if err := config.Load(configPath); err != nil {
+			return fmt.Errorf("config parse: %w", err)
+		}
+		fmt.Fprintln(out, "PASS config parse")
+
+		config.LoadDB()
+		if err := db.ConnectWithRetry(3); err != nil {
+			return fmt.Errorf("db connect: %w", err)
+		}
+		fmt.Fprintln(out, "PASS db connect")
+
+		deps := rollbackCheckDeps{
+			Hostname:                       db.Hostname,
+			HostRowExists:                  db.HostRowExists,
+			ListOverlappingHostRows:        db.ListHostRowsOverlappingBucketRange,
+			CountActiveSitesForBucketRange: db.CountActiveSitesForBucketRange,
+			CountLegacyProjectionDrift:     db.CountLegacyProjectionDrift,
+		}
+		return runRollbackCheck(context.Background(), out, config.Get(), *host, *bucketMin, *bucketMax, deps)
+	}); err != nil {
+		exitRolloutCommandError(err, outputFormat)
+	}
+}
+
+func cmdRolloutDynamicCheck(args []string) {
+	fs := flag.NewFlagSet("rollout dynamic-check", flag.ExitOnError)
+	output := rolloutOutputFlag(fs)
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout dynamic-check [--output=text|json]")
+		os.Exit(1)
+	}
+	outputFormat, err := normalizeRolloutOutput(*output)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	if err := runRolloutCommandOutput(os.Stdout, "rollout dynamic-check", outputFormat, func(out io.Writer) error {
+		configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+		if err := config.Load(configPath); err != nil {
+			return fmt.Errorf("config parse: %w", err)
+		}
+		fmt.Fprintln(out, "PASS config parse")
+
+		config.LoadDB()
+		if err := db.ConnectWithRetry(3); err != nil {
+			return fmt.Errorf("db connect: %w", err)
+		}
+		fmt.Fprintln(out, "PASS db connect")
+
+		deps := dynamicRolloutCheckDeps{
+			Now:                            time.Now,
+			GetAllHosts:                    db.GetAllHosts,
+			CountActiveSitesForBucketRange: db.CountActiveSitesForBucketRange,
+			CountLegacyProjectionDrift:     db.CountLegacyProjectionDrift,
+		}
+		return runDynamicRolloutCheck(context.Background(), out, config.Get(), deps)
+	}); err != nil {
+		exitRolloutCommandError(err, outputFormat)
+	}
+}
+
+func cmdRolloutActivityCheck(args []string) {
+	fs := flag.NewFlagSet("rollout activity-check", flag.ExitOnError)
+	bucketMin := fs.Int("bucket-min", -1, "inclusive bucket minimum (default pinned range or 0)")
+	bucketMax := fs.Int("bucket-max", -1, "inclusive bucket maximum (default pinned range or BUCKET_TOTAL-1)")
+	since := fs.String("since", "15m", "activity cutoff as duration like 15m or RFC3339 timestamp")
+	requireAll := fs.Bool("require-all", false, "fail unless every active site in range was checked since the cutoff")
+	output := rolloutOutputFlag(fs)
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout activity-check [--bucket-min=N --bucket-max=N] [--since=15m] [--require-all] [--output=text|json]")
+		os.Exit(1)
+	}
+	outputFormat, err := normalizeRolloutOutput(*output)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	if err := runRolloutCommandOutput(os.Stdout, "rollout activity-check", outputFormat, func(out io.Writer) error {
+		configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+		if err := config.Load(configPath); err != nil {
+			return fmt.Errorf("config parse: %w", err)
+		}
+		fmt.Fprintln(out, "PASS config parse")
+
+		config.LoadDB()
+		if err := db.ConnectWithRetry(3); err != nil {
+			return fmt.Errorf("db connect: %w", err)
+		}
+		fmt.Fprintln(out, "PASS db connect")
+
+		deps := activityCheckDeps{
+			Now:                                     time.Now,
+			CountActiveSitesForBucketRange:          db.CountActiveSitesForBucketRange,
+			CountRecentlyCheckedActiveSitesForRange: db.CountRecentlyCheckedActiveSitesForBucketRange,
+		}
+		return runActivityCheck(context.Background(), out, config.Get(), *bucketMin, *bucketMax, *since, *requireAll, deps)
+	}); err != nil {
+		exitRolloutCommandError(err, outputFormat)
+	}
+}
+
+func cmdRolloutProjectionDrift(args []string) {
+	fs := flag.NewFlagSet("rollout projection-drift", flag.ExitOnError)
+	bucketMin := fs.Int("bucket-min", -1, "inclusive bucket minimum (default pinned range or 0)")
+	bucketMax := fs.Int("bucket-max", -1, "inclusive bucket maximum (default pinned range or BUCKET_TOTAL-1)")
+	limit := fs.Int("limit", 50, "maximum drift rows to print")
+	output := rolloutOutputFlag(fs)
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout projection-drift [--bucket-min=N --bucket-max=N] [--limit=N] [--output=text|json]")
+		os.Exit(1)
+	}
+	outputFormat, err := normalizeRolloutOutput(*output)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	if err := runRolloutCommandOutput(os.Stdout, "rollout projection-drift", outputFormat, func(out io.Writer) error {
+		configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+		if err := config.Load(configPath); err != nil {
+			return fmt.Errorf("config parse: %w", err)
+		}
+		fmt.Fprintln(out, "PASS config parse")
+
+		config.LoadDB()
+		if err := db.ConnectWithRetry(3); err != nil {
+			return fmt.Errorf("db connect: %w", err)
+		}
+		fmt.Fprintln(out, "PASS db connect")
+
+		deps := projectionDriftDeps{
+			CountLegacyProjectionDrift:     db.CountLegacyProjectionDrift,
+			ListLegacyProjectionDrift:      db.ListLegacyProjectionDrift,
+			SummarizeLegacyProjectionDrift: db.SummarizeLegacyProjectionDrift,
+		}
+		return runProjectionDriftReport(context.Background(), out, config.Get(), *bucketMin, *bucketMax, *limit, deps)
+	}); err != nil {
+		exitRolloutCommandError(err, outputFormat)
+	}
+}
+
+func cmdRolloutStateReport(args []string) {
+	fs := flag.NewFlagSet("rollout state-report", flag.ExitOnError)
+	since := fs.String("since", "15m", "activity cutoff as duration like 15m or RFC3339 timestamp")
+	output := rolloutOutputFlag(fs)
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout state-report [--since=15m] [--output=text|json]")
+		os.Exit(1)
+	}
+	outputFormat, err := normalizeRolloutOutput(*output)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+	if err := config.Load(configPath); err != nil {
+		if outputFormat == "json" {
+			report := rolloutStateReport{
+				OK:          false,
+				Command:     "rollout state-report",
+				GeneratedAt: time.Now().UTC(),
+				Issues:      []string{fmt.Sprintf("config parse: %v", err)},
+			}
+			_ = renderRolloutStateReport(os.Stdout, report, outputFormat)
+		} else {
+			fmt.Fprintf(os.Stderr, "FAIL config parse: %v\n", err)
+		}
+		os.Exit(1)
+	}
+	if outputFormat != "json" {
+		fmt.Println("PASS config parse")
+	}
+
+	config.LoadDB()
+	if err := db.ConnectWithRetry(3); err != nil {
+		if outputFormat == "json" {
+			report := rolloutStateReport{
+				OK:          false,
+				Command:     "rollout state-report",
+				GeneratedAt: time.Now().UTC(),
+				Issues:      []string{fmt.Sprintf("db connect: %v", err)},
+			}
+			_ = renderRolloutStateReport(os.Stdout, report, outputFormat)
+		} else {
+			fmt.Fprintf(os.Stderr, "FAIL db connect: %v\n", err)
+		}
+		os.Exit(1)
+	}
+	if outputFormat != "json" {
+		fmt.Println("PASS db connect")
+	}
+
+	deps := rolloutStateReportDeps{
+		Now:                                     time.Now,
+		Hostname:                                db.Hostname,
+		GetAllHosts:                             db.GetAllHosts,
+		CountActiveSitesForBucketRange:          db.CountActiveSitesForBucketRange,
+		CountRecentlyCheckedActiveSitesForRange: db.CountRecentlyCheckedActiveSitesForBucketRange,
+		CountLegacyProjectionDrift:              db.CountLegacyProjectionDrift,
+	}
+	report, err := buildRolloutStateReport(context.Background(), config.Get(), rolloutStateReportOptions{Since: *since}, deps)
+	if err != nil {
+		if outputFormat == "json" {
+			failed := rolloutStateReport{
+				OK:          false,
+				Command:     "rollout state-report",
+				GeneratedAt: time.Now().UTC(),
+				Issues:      []string{err.Error()},
+			}
+			_ = renderRolloutStateReport(os.Stdout, failed, outputFormat)
+		} else {
+			fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		}
+		os.Exit(1)
+	}
+	if err := renderRolloutStateReport(os.Stdout, report, outputFormat); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL render rollout state report: %v\n", err)
+		os.Exit(1)
+	}
+}
+
+type staticBucketRange struct {
+	HostID    string
+	BucketMin int
+	BucketMax int
+}
+
+type rolloutRehearsalPlanOptions struct {
+	Mode        string
+	PlanFile    string
+	HostID      string
+	RuntimeHost string
+	BucketMin   int
+	BucketMax   int
+	BucketTotal int
+	Binary      string
+	Service     string
+	SystemdUnit string
+	Since       string
+	V1StopCmd   string
+	V1StartCmd  string
+}
+
+type guidedRolloutOptions struct {
+	Mode                    string
+	PlanFile                string
+	HostID                  string
+	RuntimeHost             string
+	BucketMin               int
+	BucketMax               int
+	BucketTotal             int
+	Service                 string
+	SystemdUnit             string
+	Since                   string
+	V1StopCmd               string
+	V1StartCmd              string
+	LogDir                  string
+	ExecuteOperatorCommands bool
+	DryRun                  bool
+	Rollback                bool
+	SkipSystemd             bool
+	SkipStatus              bool
+	StatusPort              int
+}
+
+type guidedRolloutDeps struct {
+	Now                func() time.Time
+	ResolveBucketTotal func(context.Context) (int, error)
+	StaticPlanCheck    func(context.Context, io.Writer, guidedRolloutOptions) error
+	ValidateConfig     func(context.Context, io.Writer, guidedRolloutOptions) error
+	HostPreflight      func(context.Context, io.Writer, guidedRolloutOptions) error
+	CutoverCheck       func(context.Context, io.Writer, guidedRolloutOptions, bool) error
+	TelemetryReport    func(context.Context, io.Writer, guidedRolloutOptions) error
+	RollbackCheck      func(context.Context, io.Writer, guidedRolloutOptions) error
+	ExecCommand        func(context.Context, string) (string, error)
+}
+
+type guidedRolloutState struct {
+	Version           int       `json:"version"`
+	Mode              string    `json:"mode"`
+	HostID            string    `json:"host_id"`
+	RuntimeHost       string    `json:"runtime_host"`
+	BucketMin         int       `json:"bucket_min"`
+	BucketMax         int       `json:"bucket_max"`
+	StartedAt         time.Time `json:"started_at"`
+	UpdatedAt         time.Time `json:"updated_at"`
+	LastCompletedStep string    `json:"last_completed_step,omitempty"`
+	CompletedSteps    []string  `json:"completed_steps,omitempty"`
+	V1Stopped         bool      `json:"v1_stopped"`
+	V1StateKnown      bool      `json:"v1_state_known,omitempty"`
+	V2Started         bool      `json:"v2_started"`
+	V2StateKnown      bool      `json:"v2_state_known,omitempty"`
+}
+
+type guidedRolloutSession struct {
+	opts      guidedRolloutOptions
+	deps      guidedRolloutDeps
+	input     *bufio.Reader
+	out       io.Writer
+	state     guidedRolloutState
+	statePath string
+}
+
+type guidedStep struct {
+	ID      string
+	Title   string
+	Details string
+	Run     func(context.Context, *guidedRolloutSession) error
+}
+
+const guidedRolloutStateVersion = 1
+
+var (
+	errGuidedStopped           = errors.New("guided rollout stopped by operator")
+	errGuidedRollbackRequested = errors.New("guided rollback requested by operator")
+	errGuidedForwardRolledBack = errors.New("forward rollout failed; guided rollback completed and range is back on v1")
+)
+
+type hostPreflightOptions struct {
+	PlanFile    string
+	HostID      string
+	RuntimeHost string
+	BucketMin   int
+	BucketMax   int
+	BucketTotal int
+	Service     string
+	SystemdUnit string
+	SkipSystemd bool
+}
+
+func runHostPreflight(ctx context.Context, out io.Writer, cfg *config.Config, input io.Reader, opts hostPreflightOptions, deps hostPreflightDeps) error {
+	if cfg == nil {
+		return errors.New("config is not loaded")
+	}
+	if out == nil {
+		out = io.Discard
+	}
+	if input == nil {
+		return errors.New("static bucket plan input is required")
+	}
+
+	planFile := strings.TrimSpace(opts.PlanFile)
+	if planFile == "" || planFile == "-" {
+		return errors.New("--file must be a reusable CSV path")
+	}
+	hostID := strings.TrimSpace(opts.HostID)
+	if hostID == "" {
+		return errors.New("--host is required")
+	}
+	runtimeHost := strings.TrimSpace(opts.RuntimeHost)
+	if runtimeHost == "" {
+		runtimeHost = hostID
+	}
+	if opts.BucketMin < 0 || opts.BucketMax < 0 {
+		return errors.New("--bucket-min and --bucket-max are required")
+	}
+	if opts.BucketMax < opts.BucketMin {
+		return errors.New("--bucket-max must be >= --bucket-min")
+	}
+	bucketTotal := opts.BucketTotal
+	if bucketTotal < 0 {
+		return errors.New("bucket-total must be > 0")
+	}
+	if bucketTotal == 0 {
+		bucketTotal = cfg.BucketTotal
+	}
+	if bucketTotal <= 0 {
+		return errors.New("BUCKET_TOTAL must be > 0")
+	}
+
+	writeRolloutPlanSection(out, "static bucket plan")
+	assertion := staticPlanAssertion{HostID: hostID, BucketMin: opts.BucketMin, BucketMax: opts.BucketMax}
+	if err := runStaticPlanCheck(out, planFile, input, bucketTotal, assertion); err != nil {
+		return err
+	}
+
+	writeRolloutPlanSection(out, "pinned pre-stop safety")
+	configMin, configMax, ok := cfg.PinnedBucketRange()
+	if !ok {
+		return errors.New("pinned bucket range is not configured; set PINNED_BUCKET_MIN/PINNED_BUCKET_MAX or BUCKET_NO_MIN/BUCKET_NO_MAX")
+	}
+	if configMin != opts.BucketMin || configMax != opts.BucketMax {
+		return fmt.Errorf("config pinned range %d-%d does not match requested bucket range %d-%d", configMin, configMax, opts.BucketMin, opts.BucketMax)
+	}
+	fmt.Fprintf(out, "PASS pinned_range_matches_request=%d-%d\n", configMin, configMax)
+	if err := runPinnedRolloutCheck(ctx, out, cfg, runtimeHost, deps.Pinned); err != nil {
+		return err
+	}
+
+	writeRolloutPlanSection(out, "systemd unit")
+	if opts.SkipSystemd {
+		fmt.Fprintln(out, "INFO systemd_verify=skipped reason=operator")
+	} else {
+		unitPath := hostPreflightSystemdUnit(opts)
+		if deps.SystemdVerify == nil {
+			return errors.New("systemd verifier is not configured")
+		}
+		verifyOutput, err := deps.SystemdVerify(unitPath)
+		if err != nil {
+			if trimmed := strings.TrimSpace(verifyOutput); trimmed != "" {
+				return fmt.Errorf("systemd-analyze verify %s: %w: %s", unitPath, err, strings.Join(strings.Fields(trimmed), " "))
+			}
+			return fmt.Errorf("systemd-analyze verify %s: %w", unitPath, err)
+		}
+		fmt.Fprintf(out, "PASS systemd_unit=%s\n", unitPath)
+		if verifyOutput = strings.TrimSpace(verifyOutput); verifyOutput != "" {
+			fmt.Fprintf(out, "INFO systemd_verify=%s\n", strings.Join(strings.Fields(verifyOutput), " "))
+		}
+	}
+
+	fmt.Fprintln(out, "PASS pre_stop_gate=ready")
+	fmt.Fprintln(out, "host preflight passed")
+	return nil
+}
+
+func hostPreflightSystemdUnit(opts hostPreflightOptions) string {
+	if unit := strings.TrimSpace(opts.SystemdUnit); unit != "" {
+		return unit
+	}
+	service := strings.TrimSpace(opts.Service)
+	if service == "" {
+		service = "jetmon2"
+	}
+	return "/etc/systemd/system/" + service + ".service"
+}
+
+func systemdAnalyzeVerify(unitPath string) (string, error) {
+	out, err := exec.Command("systemd-analyze", "verify", unitPath).CombinedOutput()
+	if err != nil {
+		return string(out), err
+	}
+	return string(out), nil
+}
+
+func defaultGuidedRolloutDeps() guidedRolloutDeps {
+	return guidedRolloutDeps{
+		Now: time.Now,
+		ResolveBucketTotal: func(context.Context) (int, error) {
+			configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+			if err := config.Load(configPath); err != nil {
+				return 0, fmt.Errorf("config parse: %w", err)
+			}
+			return config.Get().BucketTotal, nil
+		},
+		StaticPlanCheck: func(_ context.Context, out io.Writer, opts guidedRolloutOptions) error {
+			f, err := os.Open(opts.PlanFile)
+			if err != nil {
+				return fmt.Errorf("open static bucket plan: %w", err)
+			}
+			defer f.Close()
+			assertion := staticPlanAssertion{HostID: opts.HostID, BucketMin: opts.BucketMin, BucketMax: opts.BucketMax}
+			return runStaticPlanCheck(out, opts.PlanFile, f, opts.BucketTotal, assertion)
+		},
+		ValidateConfig: func(_ context.Context, out io.Writer, _ guidedRolloutOptions) error {
+			_, err := loadRolloutConfigAndDB(out)
+			return err
+		},
+		HostPreflight: func(ctx context.Context, out io.Writer, opts guidedRolloutOptions) error {
+			f, err := os.Open(opts.PlanFile)
+			if err != nil {
+				return fmt.Errorf("open static bucket plan: %w", err)
+			}
+			defer f.Close()
+
+			cfg, err := loadRolloutConfigAndDB(out)
+			if err != nil {
+				return err
+			}
+			deps := hostPreflightDeps{
+				Pinned: pinnedRolloutCheckDeps{
+					Hostname:                       db.Hostname,
+					HostRowExists:                  db.HostRowExists,
+					ListOverlappingHostRows:        db.ListHostRowsOverlappingBucketRange,
+					CountActiveSitesForBucketRange: db.CountActiveSitesForBucketRange,
+					CountLegacyProjectionDrift:     db.CountLegacyProjectionDrift,
+				},
+				SystemdVerify: systemdAnalyzeVerify,
+			}
+			return runHostPreflight(ctx, out, cfg, f, hostPreflightOptions{
+				PlanFile:    opts.PlanFile,
+				HostID:      opts.HostID,
+				RuntimeHost: opts.RuntimeHost,
+				BucketMin:   opts.BucketMin,
+				BucketMax:   opts.BucketMax,
+				BucketTotal: opts.BucketTotal,
+				Service:     opts.Service,
+				SystemdUnit: opts.SystemdUnit,
+				SkipSystemd: opts.SkipSystemd,
+			}, deps)
+		},
+		CutoverCheck: func(ctx context.Context, out io.Writer, opts guidedRolloutOptions, requireAll bool) error {
+			cfg, err := loadRolloutConfigAndDB(out)
+			if err != nil {
+				return err
+			}
+			deps := cutoverCheckDeps{
+				Pinned: pinnedRolloutCheckDeps{
+					Hostname:                       db.Hostname,
+					HostRowExists:                  db.HostRowExists,
+					ListOverlappingHostRows:        db.ListHostRowsOverlappingBucketRange,
+					CountActiveSitesForBucketRange: db.CountActiveSitesForBucketRange,
+					CountLegacyProjectionDrift:     db.CountLegacyProjectionDrift,
+				},
+				Activity: activityCheckDeps{
+					Now:                                     time.Now,
+					CountActiveSitesForBucketRange:          db.CountActiveSitesForBucketRange,
+					CountRecentlyCheckedActiveSitesForRange: db.CountRecentlyCheckedActiveSitesForBucketRange,
+				},
+				Projection: projectionDriftDeps{
+					CountLegacyProjectionDrift:     db.CountLegacyProjectionDrift,
+					ListLegacyProjectionDrift:      db.ListLegacyProjectionDrift,
+					SummarizeLegacyProjectionDrift: db.SummarizeLegacyProjectionDrift,
+				},
+				Status: dashboardStatus,
+			}
+			return runCutoverCheck(ctx, out, cfg, cutoverCheckOptions{
+				HostOverride: opts.RuntimeHost,
+				BucketMin:    opts.BucketMin,
+				BucketMax:    opts.BucketMax,
+				Since:        opts.Since,
+				RequireAll:   requireAll,
+				Limit:        100,
+				StatusPort:   opts.StatusPort,
+				SkipStatus:   opts.SkipStatus,
+			}, deps)
+		},
+		TelemetryReport: func(ctx context.Context, out io.Writer, opts guidedRolloutOptions) error {
+			if _, err := loadRolloutConfigAndDB(out); err != nil {
+				return err
+			}
+			report, err := buildTelemetryReport(ctx, db.DB(), time.Now().UTC(), telemetryReportOptions{
+				Since:        opts.Since,
+				Output:       "text",
+				Limit:        10,
+				QueryTimeout: defaultTelemetryQueryTimeout,
+			})
+			if err != nil {
+				return err
+			}
+			return renderTelemetryReport(out, report, "text")
+		},
+		RollbackCheck: func(ctx context.Context, out io.Writer, opts guidedRolloutOptions) error {
+			cfg, err := loadRolloutConfigAndDB(out)
+			if err != nil {
+				return err
+			}
+			deps := rollbackCheckDeps{
+				Hostname:                       db.Hostname,
+				HostRowExists:                  db.HostRowExists,
+				ListOverlappingHostRows:        db.ListHostRowsOverlappingBucketRange,
+				CountActiveSitesForBucketRange: db.CountActiveSitesForBucketRange,
+				CountLegacyProjectionDrift:     db.CountLegacyProjectionDrift,
+			}
+			return runRollbackCheck(ctx, out, cfg, opts.RuntimeHost, opts.BucketMin, opts.BucketMax, deps)
+		},
+		ExecCommand: shellExecCommand,
+	}
+}
+
+func loadRolloutConfigAndDB(out io.Writer) (*config.Config, error) {
+	configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+	if err := config.Load(configPath); err != nil {
+		return nil, fmt.Errorf("config parse: %w", err)
+	}
+	fmt.Fprintln(out, "PASS config parse")
+
+	config.LoadDB()
+	if err := db.ConnectWithRetry(3); err != nil {
+		return nil, fmt.Errorf("db connect: %w", err)
+	}
+	fmt.Fprintln(out, "PASS db connect")
+	return config.Get(), nil
+}
+
+func shellExecCommand(ctx context.Context, command string) (string, error) {
+	out, err := exec.CommandContext(ctx, "/bin/sh", "-c", command).CombinedOutput()
+	if err != nil {
+		return string(out), err
+	}
+	return string(out), nil
+}
+
+func runGuidedRollout(ctx context.Context, out io.Writer, input io.Reader, opts guidedRolloutOptions, deps guidedRolloutDeps) error {
+	if out == nil {
+		out = io.Discard
+	}
+	if input == nil {
+		input = strings.NewReader("")
+	}
+	deps = normalizeGuidedRolloutDeps(deps)
+	normalized, err := normalizeGuidedRolloutOptions(opts)
+	if err != nil {
+		return err
+	}
+	opts = normalized
+
+	transcript, logPath, statePath, closeLog, err := openGuidedRolloutTranscript(out, opts, deps.Now())
+	if err != nil {
+		return err
+	}
+	defer closeLog()
+
+	session := &guidedRolloutSession{
+		opts:      opts,
+		deps:      deps,
+		input:     bufio.NewReader(input),
+		out:       transcript,
+		statePath: statePath,
+	}
+	fmt.Fprintf(session.out, "PASS rollout_log_dir_writable=%s\n", opts.LogDir)
+	fmt.Fprintf(session.out, "INFO rollout_log=%s\n", logPath)
+	fmt.Fprintf(session.out, "INFO rollout_state=%s\n", statePath)
+
+	if opts.BucketTotal == 0 {
+		bucketTotal, err := deps.ResolveBucketTotal(ctx)
+		if err != nil {
+			return err
+		}
+		opts.BucketTotal = bucketTotal
+		session.opts = opts
+	}
+	if err := validateGuidedRolloutOptions(opts); err != nil {
+		return err
+	}
+	session.printRunOrigin()
+	if opts.DryRun {
+		session.state = newGuidedRolloutState(opts, deps.Now())
+		return session.printDryRunPlan()
+	}
+
+	state, found, err := loadGuidedRolloutState(statePath)
+	if err != nil {
+		return err
+	}
+	if found {
+		fmt.Fprintf(
+			session.out,
+			"INFO previous_state=found mode=%q host=%q runtime_host=%q range=%d-%d last_completed=%q v1_stopped=%t v1_state_known=%t v2_started=%t v2_state_known=%t\n",
+			state.Mode,
+			state.HostID,
+			state.RuntimeHost,
+			state.BucketMin,
+			state.BucketMax,
+			state.LastCompletedStep,
+			state.V1Stopped,
+			state.V1StateKnown,
+			state.V2Started,
+			state.V2StateKnown,
+		)
+		resume, err := session.chooseResumeState()
+		if err != nil {
+			return err
+		}
+		if resume {
+			if err := validateGuidedStateMatchesOptions(state, opts); err != nil {
+				return err
+			}
+			fmt.Fprintln(session.out, "INFO previous_state=resumed")
+			session.state = state
+		} else {
+			fmt.Fprintln(session.out, "WARN previous_state=discarded reason=operator_start_over")
+			session.state = newGuidedRolloutState(opts, deps.Now())
+		}
+	} else {
+		session.state = newGuidedRolloutState(opts, deps.Now())
+	}
+	if err := session.saveState(); err != nil {
+		return err
+	}
+
+	if opts.Rollback {
+		return session.runRollback(ctx)
+	}
+	return session.runForward(ctx)
+}
+
+func normalizeGuidedRolloutDeps(deps guidedRolloutDeps) guidedRolloutDeps {
+	defaults := defaultGuidedRolloutDeps()
+	if deps.Now == nil {
+		deps.Now = defaults.Now
+	}
+	if deps.ResolveBucketTotal == nil {
+		deps.ResolveBucketTotal = defaults.ResolveBucketTotal
+	}
+	if deps.StaticPlanCheck == nil {
+		deps.StaticPlanCheck = defaults.StaticPlanCheck
+	}
+	if deps.ValidateConfig == nil {
+		deps.ValidateConfig = defaults.ValidateConfig
+	}
+	if deps.HostPreflight == nil {
+		deps.HostPreflight = defaults.HostPreflight
+	}
+	if deps.CutoverCheck == nil {
+		deps.CutoverCheck = defaults.CutoverCheck
+	}
+	if deps.TelemetryReport == nil {
+		deps.TelemetryReport = defaults.TelemetryReport
+	}
+	if deps.RollbackCheck == nil {
+		deps.RollbackCheck = defaults.RollbackCheck
+	}
+	if deps.ExecCommand == nil {
+		deps.ExecCommand = defaults.ExecCommand
+	}
+	return deps
+}
+
+func normalizeGuidedRolloutOptions(opts guidedRolloutOptions) (guidedRolloutOptions, error) {
+	opts.Mode = strings.ToLower(strings.TrimSpace(opts.Mode))
+	if opts.Mode == "" {
+		opts.Mode = "same-server"
+	}
+	if opts.Mode != "same-server" && opts.Mode != "fresh-server" {
+		return opts, fmt.Errorf("--mode must be same-server or fresh-server, got %q", opts.Mode)
+	}
+	opts.PlanFile = strings.TrimSpace(opts.PlanFile)
+	if opts.PlanFile == "" || opts.PlanFile == "-" {
+		return opts, errors.New("--file must be a reusable CSV path")
+	}
+	opts.HostID = strings.TrimSpace(opts.HostID)
+	if opts.HostID == "" {
+		return opts, errors.New("--host is required")
+	}
+	opts.RuntimeHost = strings.TrimSpace(opts.RuntimeHost)
+	if opts.RuntimeHost == "" {
+		opts.RuntimeHost = opts.HostID
+	}
+	if opts.BucketMin < 0 || opts.BucketMax < 0 {
+		return opts, errors.New("--bucket-min and --bucket-max are required")
+	}
+	if opts.BucketMax < opts.BucketMin {
+		return opts, errors.New("--bucket-max must be >= --bucket-min")
+	}
+	if opts.BucketTotal < 0 {
+		return opts, errors.New("bucket-total must be > 0")
+	}
+	opts.Service = strings.TrimSpace(opts.Service)
+	if opts.Service == "" {
+		opts.Service = "jetmon2"
+	}
+	opts.Since = strings.TrimSpace(opts.Since)
+	if opts.Since == "" {
+		opts.Since = "15m"
+	}
+	opts.LogDir = strings.TrimSpace(opts.LogDir)
+	if opts.LogDir == "" {
+		opts.LogDir = filepath.Join("logs", "rollout")
+	}
+	opts.V1StopCmd = strings.TrimSpace(opts.V1StopCmd)
+	opts.V1StartCmd = strings.TrimSpace(opts.V1StartCmd)
+	opts.SystemdUnit = strings.TrimSpace(opts.SystemdUnit)
+	return opts, nil
+}
+
+func validateGuidedRolloutOptions(opts guidedRolloutOptions) error {
+	if opts.BucketTotal <= 0 {
+		return errors.New("BUCKET_TOTAL must be > 0")
+	}
+	if opts.BucketMax >= opts.BucketTotal {
+		return fmt.Errorf("--bucket-max must be < BUCKET_TOTAL (%d)", opts.BucketTotal)
+	}
+	if !opts.Rollback && opts.V1StopCmd == "" {
+		return errors.New("--v1-stop-command is required for guided forward rollout")
+	}
+	if opts.V1StartCmd == "" {
+		return errors.New("--v1-start-command is required so guided rollback can return the range to v1")
+	}
+	return nil
+}
+
+func openGuidedRolloutTranscript(out io.Writer, opts guidedRolloutOptions, now time.Time) (io.Writer, string, string, func(), error) {
+	if err := ensureGuidedLogDirWritable(opts.LogDir); err != nil {
+		return nil, "", "", nil, fmt.Errorf("rollout log directory preflight failed before any rollout checks or service commands ran: %w", err)
+	}
+	logPath := guidedRolloutLogPathAt(opts, now)
+	f, err := os.OpenFile(logPath, os.O_CREATE|os.O_EXCL|os.O_WRONLY, 0600)
+	if err != nil {
+		return nil, "", "", nil, fmt.Errorf("open rollout transcript: %w", err)
+	}
+	closeFn := func() {
+		_ = f.Close()
+	}
+	return io.MultiWriter(out, f), logPath, guidedRolloutStatePath(opts), closeFn, nil
+}
+
+func ensureGuidedLogDirWritable(logDir string) error {
+	if err := os.MkdirAll(logDir, 0750); err != nil {
+		return fmt.Errorf("create rollout log directory %s: %w", logDir, err)
+	}
+	checkPath := filepath.Join(logDir, fmt.Sprintf(".write-check-%d", os.Getpid()))
+	f, err := os.OpenFile(checkPath, os.O_CREATE|os.O_EXCL|os.O_WRONLY, 0600)
+	if err != nil {
+		return fmt.Errorf("rollout log directory %s is not writable: %w", logDir, err)
+	}
+	if _, err := f.WriteString("ok\n"); err != nil {
+		_ = f.Close()
+		_ = os.Remove(checkPath)
+		return fmt.Errorf("write rollout log directory %s: %w", logDir, err)
+	}
+	if err := f.Close(); err != nil {
+		_ = os.Remove(checkPath)
+		return fmt.Errorf("close rollout log directory check %s: %w", checkPath, err)
+	}
+	if err := os.Remove(checkPath); err != nil {
+		return fmt.Errorf("remove rollout log directory check %s: %w", checkPath, err)
+	}
+	return nil
+}
+
+func guidedRolloutLogPathAt(opts guidedRolloutOptions, now time.Time) string {
+	return filepath.Join(opts.LogDir, guidedRolloutStateBase(opts)+"-"+now.UTC().Format("20060102T150405.000000000Z")+".log")
+}
+
+func guidedRolloutStatePath(opts guidedRolloutOptions) string {
+	return filepath.Join(opts.LogDir, guidedRolloutStateBase(opts)+".state.json")
+}
+
+func guidedRolloutStateBase(opts guidedRolloutOptions) string {
+	return safeRolloutFilename(fmt.Sprintf("%s-%d-%d", opts.RuntimeHost, opts.BucketMin, opts.BucketMax))
+}
+
+func safeRolloutFilename(s string) string {
+	s = strings.TrimSpace(s)
+	if s == "" {
+		return "rollout"
+	}
+	var b strings.Builder
+	for _, r := range s {
+		if (r >= 'a' && r <= 'z') || (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') || r == '-' || r == '_' || r == '.' {
+			b.WriteRune(r)
+		} else {
+			b.WriteRune('_')
+		}
+	}
+	safe := strings.Trim(b.String(), "._")
+	if safe == "" {
+		return "rollout"
+	}
+	return safe
+}
+
+func newGuidedRolloutState(opts guidedRolloutOptions, now time.Time) guidedRolloutState {
+	return guidedRolloutState{
+		Version:     guidedRolloutStateVersion,
+		Mode:        opts.Mode,
+		HostID:      opts.HostID,
+		RuntimeHost: opts.RuntimeHost,
+		BucketMin:   opts.BucketMin,
+		BucketMax:   opts.BucketMax,
+		StartedAt:   now.UTC(),
+		UpdatedAt:   now.UTC(),
+	}
+}
+
+func loadGuidedRolloutState(path string) (guidedRolloutState, bool, error) {
+	f, err := os.Open(path)
+	if errors.Is(err, os.ErrNotExist) {
+		return guidedRolloutState{}, false, nil
+	}
+	if err != nil {
+		return guidedRolloutState{}, false, fmt.Errorf("open guided rollout state: %w", err)
+	}
+	defer f.Close()
+	var state guidedRolloutState
+	if err := json.NewDecoder(f).Decode(&state); err != nil {
+		return guidedRolloutState{}, false, fmt.Errorf("decode guided rollout state: %w", err)
+	}
+	return state, true, nil
+}
+
+func validateGuidedStateMatchesOptions(state guidedRolloutState, opts guidedRolloutOptions) error {
+	if state.Version != guidedRolloutStateVersion {
+		return fmt.Errorf("guided rollout state version=%d is not supported", state.Version)
+	}
+	if state.Mode != opts.Mode || state.HostID != opts.HostID || state.RuntimeHost != opts.RuntimeHost || state.BucketMin != opts.BucketMin || state.BucketMax != opts.BucketMax {
+		return fmt.Errorf(
+			"guided rollout state does not match request (state mode=%q host=%q runtime_host=%q range=%d-%d; request mode=%q host=%q runtime_host=%q range=%d-%d)",
+			state.Mode,
+			state.HostID,
+			state.RuntimeHost,
+			state.BucketMin,
+			state.BucketMax,
+			opts.Mode,
+			opts.HostID,
+			opts.RuntimeHost,
+			opts.BucketMin,
+			opts.BucketMax,
+		)
+	}
+	return nil
+}
+
+func (s *guidedRolloutSession) saveState() error {
+	s.state.UpdatedAt = s.deps.Now().UTC()
+	tmp := s.statePath + ".tmp"
+	f, err := os.OpenFile(tmp, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0600)
+	if err != nil {
+		return fmt.Errorf("open guided rollout state for write: %w", err)
+	}
+	enc := json.NewEncoder(f)
+	enc.SetIndent("", "  ")
+	if err := enc.Encode(s.state); err != nil {
+		_ = f.Close()
+		_ = os.Remove(tmp)
+		return fmt.Errorf("write guided rollout state: %w", err)
+	}
+	if err := f.Close(); err != nil {
+		_ = os.Remove(tmp)
+		return fmt.Errorf("close guided rollout state: %w", err)
+	}
+	if err := os.Rename(tmp, s.statePath); err != nil {
+		_ = os.Remove(tmp)
+		return fmt.Errorf("replace guided rollout state: %w", err)
+	}
+	return nil
+}
+
+func (s *guidedRolloutSession) markStepComplete(stepID string) error {
+	if !s.stepCompleted(stepID) {
+		s.state.CompletedSteps = append(s.state.CompletedSteps, stepID)
+	}
+	s.state.LastCompletedStep = stepID
+	return s.saveState()
+}
+
+func (s *guidedRolloutSession) stepCompleted(stepID string) bool {
+	for _, completed := range s.state.CompletedSteps {
+		if completed == stepID {
+			return true
+		}
+	}
+	return false
+}
+
+func (s *guidedRolloutSession) printRunOrigin() {
+	fmt.Fprintf(
+		s.out,
+		"INFO guided_run_origin=runtime_host mode=%q v1_host=%q runtime_host=%q\n",
+		s.opts.Mode,
+		s.opts.HostID,
+		s.opts.RuntimeHost,
+	)
+	fmt.Fprintln(s.out, "INFO run_this_command_from=runtime_host note=\"run the guided command from the staged v2 runtime host with the jetmon2 service DB environment\"")
+	if s.opts.Mode == "fresh-server" || s.opts.HostID != s.opts.RuntimeHost {
+		fmt.Fprintf(
+			s.out,
+			"WARN remote_v1_access_required=true runtime_host=%q v1_host=%q note=\"runtime host must be able to SSH to the v1 host if v1 stop/start commands use ssh\"\n",
+			s.opts.RuntimeHost,
+			s.opts.HostID,
+		)
+		return
+	}
+	fmt.Fprintln(s.out, "INFO remote_v1_access_required=false reason=same_server")
+}
+
+func (s *guidedRolloutSession) printDryRunPlan() error {
+	fmt.Fprintln(s.out, "INFO dry_run=true")
+	fmt.Fprintln(s.out, "INFO no rollout checks or service commands will be executed")
+	commandMode := "manual"
+	if s.opts.ExecuteOperatorCommands {
+		commandMode = "execute-after-confirmation"
+	}
+	fmt.Fprintf(s.out, "INFO operator_command_mode=%s\n", commandMode)
+	if s.opts.Rollback {
+		fmt.Fprintln(s.out, "INFO selected_path=rollback")
+		s.printDryRunSteps("ROLLBACK", rollbackGuidedSteps())
+		return nil
+	}
+	fmt.Fprintln(s.out, "INFO selected_path=forward")
+	s.printDryRunSteps("FORWARD", forwardGuidedSteps())
+	fmt.Fprintln(s.out, "INFO recovery_path=rollback")
+	s.printDryRunSteps("ROLLBACK", rollbackGuidedSteps())
+	return nil
+}
+
+func (s *guidedRolloutSession) printDryRunSteps(prefix string, steps []guidedStep) {
+	for _, step := range steps {
+		fmt.Fprintf(s.out, "PLAN path=%s step=%s title=%q\n", prefix, step.ID, step.Title)
+		if command := s.dryRunCommandForStep(step.ID); command != "" {
+			fmt.Fprintf(s.out, "PLAN path=%s step=%s command=%q\n", prefix, step.ID, command)
+		}
+		if phrase := s.dryRunConfirmationForStep(step.ID); phrase != "" {
+			fmt.Fprintf(s.out, "PLAN path=%s step=%s typed_confirmation=%q\n", prefix, step.ID, phrase)
+		}
+		if manualPrompt := s.dryRunManualPromptForStep(step.ID); manualPrompt != "" && !s.opts.ExecuteOperatorCommands {
+			fmt.Fprintf(s.out, "PLAN path=%s step=%s manual_checkpoint=%q\n", prefix, step.ID, manualPrompt)
+		}
+	}
+}
+
+func (s *guidedRolloutSession) dryRunCommandForStep(stepID string) string {
+	switch stepID {
+	case "stop-v1":
+		return s.opts.V1StopCmd
+	case "start-v2":
+		return startSystemdServiceCommand(s.opts.Service)
+	case "rollback-stop-v2":
+		return stopSystemdServiceCommand(s.opts.Service)
+	case "rollback-start-v1":
+		return s.opts.V1StartCmd
+	case "telemetry-report":
+		return telemetryReportCommand("./jetmon2", s.opts.Since)
+	default:
+		return ""
+	}
+}
+
+func (s *guidedRolloutSession) dryRunConfirmationForStep(stepID string) string {
+	switch stepID {
+	case "stop-v1":
+		return fmt.Sprintf("STOP %s %d-%d", s.opts.HostID, s.opts.BucketMin, s.opts.BucketMax)
+	case "start-v2":
+		return fmt.Sprintf("START V2 %s %d-%d", s.opts.RuntimeHost, s.opts.BucketMin, s.opts.BucketMax)
+	case "cutover-require-all":
+		return "READY"
+	case "rollback-stop-v2":
+		return fmt.Sprintf("STOP V2 %s %d-%d", s.opts.RuntimeHost, s.opts.BucketMin, s.opts.BucketMax)
+	case "rollback-start-v1":
+		return fmt.Sprintf("START V1 %s %d-%d", s.opts.HostID, s.opts.BucketMin, s.opts.BucketMax)
+	default:
+		return ""
+	}
+}
+
+func (s *guidedRolloutSession) dryRunManualPromptForStep(stepID string) string {
+	switch stepID {
+	case "stop-v1":
+		return "DONE after v1 is stopped and the process is no longer running"
+	case "start-v2":
+		return "DONE after v2 is started and logs show the pinned range"
+	case "rollback-stop-v2":
+		return "DONE after v2 is stopped and the process is no longer running"
+	case "rollback-start-v1":
+		return "DONE after v1 is started and checking the original bucket range"
+	default:
+		return ""
+	}
+}
+
+func forwardGuidedSteps() []guidedStep {
+	return []guidedStep{
+		{
+			ID:      "static-plan-check",
+			Title:   "Validate the copied static bucket plan",
+			Details: "This checks the CSV for full coverage and confirms this host owns the expected bucket range.",
+			Run: func(ctx context.Context, s *guidedRolloutSession) error {
+				return s.deps.StaticPlanCheck(ctx, s.out, s.opts)
+			},
+		},
+		{
+			ID:      "validate-config",
+			Title:   "Validate the staged v2 config and DB connection",
+			Details: "This loads the staged config using the service DB environment and confirms database connectivity.",
+			Run: func(ctx context.Context, s *guidedRolloutSession) error {
+				return s.deps.ValidateConfig(ctx, s.out, s.opts)
+			},
+		},
+		{
+			ID:      "host-preflight",
+			Title:   "Run the pre-stop host gate",
+			Details: "This bundles static plan, config, DB, pinned safety, projection drift, and systemd checks before v1 is stopped.",
+			Run: func(ctx context.Context, s *guidedRolloutSession) error {
+				return s.deps.HostPreflight(ctx, s.out, s.opts)
+			},
+		},
+		{
+			ID:      "stop-v1",
+			Title:   "Stop v1 for this bucket range",
+			Details: "This is the first destructive transition. v1 and v2 must not run against the same bucket range at the same time.",
+			Run: func(ctx context.Context, s *guidedRolloutSession) error {
+				phrase := fmt.Sprintf("STOP %s %d-%d", s.opts.HostID, s.opts.BucketMin, s.opts.BucketMax)
+				if err := s.runOperatorCommand(
+					ctx,
+					"Stop v1",
+					s.opts.V1StopCmd,
+					"Stopping v1 prevents duplicate checks for this range.",
+					phrase,
+					"Type DONE after v1 is stopped and you have confirmed the process is no longer running.",
+				); err != nil {
+					return err
+				}
+				s.state.V1Stopped = true
+				s.state.V1StateKnown = true
+				return s.saveState()
+			},
+		},
+		{
+			ID:      "start-v2",
+			Title:   "Start v2 for this bucket range",
+			Details: "This starts the pinned v2 monitor after v1 has been confirmed stopped.",
+			Run: func(ctx context.Context, s *guidedRolloutSession) error {
+				phrase := fmt.Sprintf("START V2 %s %d-%d", s.opts.RuntimeHost, s.opts.BucketMin, s.opts.BucketMax)
+				if err := s.runOperatorCommand(
+					ctx,
+					"Start v2",
+					startSystemdServiceCommand(s.opts.Service),
+					"Starting v2 begins production checks for this range.",
+					phrase,
+					"Type DONE after v2 is started and logs show the pinned range.",
+				); err != nil {
+					return err
+				}
+				s.state.V2Started = true
+				s.state.V2StateKnown = true
+				return s.saveState()
+			},
+		},
+		{
+			ID:      "cutover-smoke",
+			Title:   "Run the immediate post-start smoke gate",
+			Details: "This confirms startup and recent activity. Recent writes can still include v1 because the cutoff reaches back before cutover.",
+			Run: func(ctx context.Context, s *guidedRolloutSession) error {
+				return s.deps.CutoverCheck(ctx, s.out, s.opts, false)
+			},
+		},
+		{
+			ID:      "cutover-require-all",
+			Title:   "Run the full-round v2 gate",
+			Details: "Wait until one full expected v2 check round has elapsed, then require every active site in the range to have fresh activity.",
+			Run: func(ctx context.Context, s *guidedRolloutSession) error {
+				if err := s.confirmTyped("Run this only after one full expected v2 check round.", "READY"); err != nil {
+					return err
+				}
+				return s.deps.CutoverCheck(ctx, s.out, s.opts, true)
+			},
+		},
+		{
+			ID:      "telemetry-report",
+			Title:   "Capture WPCOM parity telemetry",
+			Details: "This read-only report summarizes WPCOM down and recovery notification parity plus explanation coverage for the current rollout window. Treat warnings as hold points, and widen --since if the window is too quiet to prove parity.",
+			Run: func(ctx context.Context, s *guidedRolloutSession) error {
+				return s.deps.TelemetryReport(ctx, s.out, s.opts)
+			},
+		},
+	}
+}
+
+func rollbackGuidedSteps() []guidedStep {
+	return []guidedStep{
+		{
+			ID:      "rollback-stop-v2",
+			Title:   "Stop v2 before returning the range to v1",
+			Details: "The range must not be checked by v1 and v2 at the same time.",
+			Run: func(ctx context.Context, s *guidedRolloutSession) error {
+				phrase := fmt.Sprintf("STOP V2 %s %d-%d", s.opts.RuntimeHost, s.opts.BucketMin, s.opts.BucketMax)
+				if err := s.runOperatorCommand(
+					ctx,
+					"Stop v2",
+					stopSystemdServiceCommand(s.opts.Service),
+					"Stopping v2 is required before v1 can be restarted.",
+					phrase,
+					"Type DONE after v2 is stopped and you have confirmed the process is no longer running.",
+				); err != nil {
+					return err
+				}
+				s.state.V2Started = false
+				s.state.V2StateKnown = true
+				return s.saveState()
+			},
+		},
+		{
+			ID:      "rollback-check",
+			Title:   "Run the rollback safety gate",
+			Details: "This verifies the range has no dynamic ownership overlap and no legacy projection drift before v1 restarts.",
+			Run: func(ctx context.Context, s *guidedRolloutSession) error {
+				return s.deps.RollbackCheck(ctx, s.out, s.opts)
+			},
+		},
+		{
+			ID:      "rollback-start-v1",
+			Title:   "Restart v1 for this bucket range",
+			Details: "This returns the range to the original v1 monitor. Do not roll back schema migrations.",
+			Run: func(ctx context.Context, s *guidedRolloutSession) error {
+				phrase := fmt.Sprintf("START V1 %s %d-%d", s.opts.HostID, s.opts.BucketMin, s.opts.BucketMax)
+				if err := s.runOperatorCommand(
+					ctx,
+					"Start v1",
+					s.opts.V1StartCmd,
+					"Restarting v1 returns production checks for this range to v1.",
+					phrase,
+					"Type DONE after v1 is started and checking the original bucket range.",
+				); err != nil {
+					return err
+				}
+				s.state.V1Stopped = false
+				s.state.V1StateKnown = true
+				return s.saveState()
+			},
+		},
+	}
+}
+
+func (s *guidedRolloutSession) runForward(ctx context.Context) error {
+	fmt.Fprintln(s.out, "# Guided Jetmon v2 rollout")
+	fmt.Fprintf(s.out, "INFO mode=%s host=%q runtime_host=%q range=%d-%d bucket_total=%d\n", s.opts.Mode, s.opts.HostID, s.opts.RuntimeHost, s.opts.BucketMin, s.opts.BucketMax, s.opts.BucketTotal)
+	for _, step := range forwardGuidedSteps() {
+		if err := s.runStep(ctx, step); err != nil {
+			if errors.Is(err, errGuidedRollbackRequested) {
+				if rollbackErr := s.runRollback(ctx); rollbackErr != nil {
+					return fmt.Errorf("rollback after failed forward step also failed: %w", rollbackErr)
+				}
+				fmt.Fprintln(s.out, "FAIL guided_rollout=rolled_back reason=operator_requested_after_failed_step")
+				return errGuidedForwardRolledBack
+			}
+			return err
+		}
+	}
+	fmt.Fprintln(s.out, "PASS guided_rollout=complete")
+	fmt.Fprintln(s.out, "Host signoff is complete for this range. Keep the transcript and state file with the rollout record.")
+	return nil
+}
+
+func (s *guidedRolloutSession) runRollback(ctx context.Context) error {
+	fmt.Fprintln(s.out, "# Guided Jetmon v2 rollback")
+	fmt.Fprintf(s.out, "INFO rollback_host=%q runtime_host=%q range=%d-%d\n", s.opts.HostID, s.opts.RuntimeHost, s.opts.BucketMin, s.opts.BucketMax)
+	for _, step := range rollbackGuidedSteps() {
+		if err := s.runStep(ctx, step); err != nil {
+			return err
+		}
+	}
+	fmt.Fprintln(s.out, "PASS guided_rollback=complete")
+	fmt.Fprintln(s.out, "The range has been returned to v1. Leave v2 schema in place.")
+	return nil
+}
+
+func (s *guidedRolloutSession) runStep(ctx context.Context, step guidedStep) error {
+	if s.stepCompleted(step.ID) {
+		fmt.Fprintf(s.out, "SKIP step=%s reason=completed_from_state\n", step.ID)
+		return nil
+	}
+	if reason, ok := s.stepAlreadySatisfied(step.ID); ok {
+		fmt.Fprintf(s.out, "SKIP step=%s reason=%s\n", step.ID, reason)
+		return s.markStepComplete(step.ID)
+	}
+	for {
+		writeRolloutPlanSection(s.out, step.Title)
+		fmt.Fprintf(s.out, "INFO step=%s\n", step.ID)
+		if step.Details != "" {
+			fmt.Fprintf(s.out, "INFO %s\n", step.Details)
+		}
+		if !strings.Contains(step.ID, "stop-") && !strings.Contains(step.ID, "start-") && step.ID != "cutover-require-all" {
+			proceed, err := s.confirmYes("Proceed with this step?")
+			if err != nil {
+				return err
+			}
+			if !proceed {
+				return errGuidedStopped
+			}
+		}
+		err := step.Run(ctx, s)
+		if err == nil {
+			if markErr := s.markStepComplete(step.ID); markErr != nil {
+				return markErr
+			}
+			fmt.Fprintf(s.out, "PASS guided_step=%s\n", step.ID)
+			return nil
+		}
+		action, actionErr := s.handleStepFailure(step, err)
+		if actionErr != nil {
+			return actionErr
+		}
+		switch action {
+		case "retry":
+			continue
+		case "rollback":
+			return errGuidedRollbackRequested
+		default:
+			return fmt.Errorf("step %s failed: %w", step.ID, err)
+		}
+	}
+}
+
+func (s *guidedRolloutSession) stepAlreadySatisfied(stepID string) (string, bool) {
+	switch stepID {
+	case "stop-v1":
+		if s.state.V1StateKnown && s.state.V1Stopped {
+			return "state_v1_already_stopped", true
+		}
+	case "start-v2":
+		if s.state.V2StateKnown && s.state.V2Started {
+			return "state_v2_already_started", true
+		}
+	case "rollback-stop-v2":
+		if s.state.V2StateKnown && !s.state.V2Started {
+			return "state_v2_already_stopped", true
+		}
+	case "rollback-start-v1":
+		if s.state.V1StateKnown && !s.state.V1Stopped {
+			return "state_v1_already_started", true
+		}
+	}
+	return "", false
+}
+
+func (s *guidedRolloutSession) handleStepFailure(step guidedStep, stepErr error) (string, error) {
+	fmt.Fprintf(s.out, "FAIL step=%s error=%v\n", step.ID, stepErr)
+	if s.state.V2Started {
+		fmt.Fprintln(s.out, "Options: [r] retry this step, [b] begin guided rollback, [s] stop here")
+		for {
+			answer, err := s.promptLine("Choose r, b, or s:")
+			if err != nil {
+				return "", err
+			}
+			switch strings.ToLower(answer) {
+			case "r", "retry":
+				return "retry", nil
+			case "b", "rollback":
+				return "rollback", nil
+			case "s", "stop":
+				return "stop", nil
+			}
+		}
+	}
+	fmt.Fprintln(s.out, "Options: [r] retry this step, [s] stop here")
+	for {
+		answer, err := s.promptLine("Choose r or s:")
+		if err != nil {
+			return "", err
+		}
+		switch strings.ToLower(answer) {
+		case "r", "retry":
+			return "retry", nil
+		case "s", "stop":
+			return "stop", nil
+		}
+	}
+}
+
+func (s *guidedRolloutSession) runOperatorCommand(ctx context.Context, label, command, confirmationPrompt, confirmationPhrase, manualDonePrompt string) error {
+	command = strings.TrimSpace(command)
+	if command == "" {
+		return fmt.Errorf("%s command is empty", label)
+	}
+	fmt.Fprintf(s.out, "COMMAND %s\n", command)
+	if err := s.confirmTyped(confirmationPrompt, confirmationPhrase); err != nil {
+		return err
+	}
+	if s.opts.ExecuteOperatorCommands {
+		fmt.Fprintln(s.out, "INFO executing_operator_command=true")
+		output, err := s.deps.ExecCommand(ctx, command)
+		if strings.TrimSpace(output) != "" {
+			fmt.Fprintf(s.out, "OUTPUT %s\n", strings.TrimSpace(output))
+		}
+		if err != nil {
+			return fmt.Errorf("%s command failed: %w", label, err)
+		}
+		return nil
+	}
+	fmt.Fprintln(s.out, "INFO executing_operator_command=false")
+	fmt.Fprintln(s.out, "Run the command above in the appropriate shell, then confirm completion.")
+	return s.confirmTyped(manualDonePrompt, "DONE")
+}
+
+func (s *guidedRolloutSession) chooseResumeState() (bool, error) {
+	fmt.Fprintln(s.out, "A previous guided rollout state exists for this host and range.")
+	fmt.Fprintln(s.out, "Type RESUME to continue from it, or START OVER to discard it and create a new state.")
+	for {
+		answer, err := s.promptLine("Choose RESUME or START OVER:")
+		if err != nil {
+			return false, err
+		}
+		switch strings.ToLower(answer) {
+		case "resume":
+			return true, nil
+		case "start over", "start-over", "startover":
+			return false, nil
+		case "":
+			fmt.Fprintln(s.out, "No default is selected when state exists; choose RESUME or START OVER.")
+		default:
+			fmt.Fprintln(s.out, "Please choose RESUME or START OVER.")
+		}
+	}
+}
+
+func (s *guidedRolloutSession) confirmYes(prompt string) (bool, error) {
+	answer, err := s.promptLine(prompt + " [y/N]")
+	if err != nil {
+		return false, err
+	}
+	switch strings.ToLower(answer) {
+	case "y", "yes":
+		return true, nil
+	case "", "n", "no":
+		return false, nil
+	default:
+		fmt.Fprintln(s.out, "Please answer y or n.")
+		return s.confirmYes(prompt)
+	}
+}
+
+func (s *guidedRolloutSession) confirmTyped(prompt, phrase string) error {
+	fmt.Fprintln(s.out, prompt)
+	answer, err := s.promptLine("Type " + phrase + " to continue:")
+	if err != nil {
+		return err
+	}
+	if answer != phrase {
+		return fmt.Errorf("confirmation did not match %q", phrase)
+	}
+	return nil
+}
+
+func (s *guidedRolloutSession) promptLine(prompt string) (string, error) {
+	fmt.Fprint(s.out, prompt+" ")
+	line, err := s.input.ReadString('\n')
+	if err != nil && !(errors.Is(err, io.EOF) && line != "") {
+		return "", fmt.Errorf("read operator input: %w", err)
+	}
+	answer := strings.TrimSpace(line)
+	if answer != "" {
+		fmt.Fprintf(s.out, "INPUT %s\n", answer)
+	} else {
+		fmt.Fprintln(s.out, "INPUT <empty>")
+	}
+	return answer, nil
+}
+
+func runRolloutRehearsalPlan(out io.Writer, input io.Reader, opts rolloutRehearsalPlanOptions) error {
+	if out == nil {
+		out = io.Discard
+	}
+	mode := strings.ToLower(strings.TrimSpace(opts.Mode))
+	switch mode {
+	case "", "same-server":
+		mode = "same-server"
+	case "fresh-server":
+	default:
+		return fmt.Errorf("--mode must be same-server or fresh-server, got %q", opts.Mode)
+	}
+
+	planFile := strings.TrimSpace(opts.PlanFile)
+	if planFile == "" || planFile == "-" {
+		return errors.New("--file must be a reusable CSV path")
+	}
+	hostID := strings.TrimSpace(opts.HostID)
+	if hostID == "" {
+		return errors.New("--host is required")
+	}
+	runtimeHost := strings.TrimSpace(opts.RuntimeHost)
+	if runtimeHost == "" {
+		runtimeHost = hostID
+	}
+	if opts.BucketMin < 0 || opts.BucketMax < 0 {
+		return errors.New("--bucket-min and --bucket-max are required")
+	}
+	if opts.BucketMax < opts.BucketMin {
+		return errors.New("--bucket-max must be >= --bucket-min")
+	}
+	if opts.BucketTotal <= 0 {
+		return errors.New("BUCKET_TOTAL must be > 0")
+	}
+	binary := strings.TrimSpace(opts.Binary)
+	if binary == "" {
+		binary = "./jetmon2"
+	}
+	service := strings.TrimSpace(opts.Service)
+	if service == "" {
+		service = "jetmon2"
+	}
+	since := strings.TrimSpace(opts.Since)
+	if since == "" {
+		since = "15m"
+	}
+
+	ranges, err := parseStaticBucketPlanCSV(input)
+	if err != nil {
+		return err
+	}
+	if err := validateStaticBucketPlan(ranges, opts.BucketTotal); err != nil {
+		return err
+	}
+	assertion := staticPlanAssertion{HostID: hostID, BucketMin: opts.BucketMin, BucketMax: opts.BucketMax}
+	assertedRange, err := validateStaticPlanAssertion(ranges, assertion)
+	if err != nil {
+		return err
+	}
+
+	fmt.Fprintln(out, "# Jetmon v2 rollout rehearsal plan")
+	fmt.Fprintf(out, "INFO mode=%s\n", mode)
+	fmt.Fprintf(out, "INFO static_plan_file=%s ranges=%d\n", planFile, len(ranges))
+	fmt.Fprintf(out, "INFO plan_host=%q runtime_host=%q range=%d-%d\n", assertedRange.HostID, runtimeHost, assertedRange.BucketMin, assertedRange.BucketMax)
+	fmt.Fprintln(out, "# Run this runbook from the staged v2 runtime host, not from a separate orchestrator host.")
+	fmt.Fprintln(out, "# Commands run from that runtime host unless the printed command explicitly targets another host.")
+	fmt.Fprintln(out, "# Shell commands need the same DB_* environment used by the jetmon2 service.")
+	if mode == "fresh-server" || hostID != runtimeHost {
+		fmt.Fprintf(out, "# Fresh-server mode requires %s to have SSH access to old v1 host %s for any v1 stop/start commands that use ssh.\n", runtimeHost, hostID)
+	}
+	fmt.Fprintln(out)
+
+	bucketTotalArgs := []string{"--bucket-total", strconv.Itoa(opts.BucketTotal)}
+	staticPlanArgs := append([]string{binary, "rollout", "static-plan-check", "--file", planFile, "--host", hostID, "--bucket-min", strconv.Itoa(opts.BucketMin), "--bucket-max", strconv.Itoa(opts.BucketMax)}, bucketTotalArgs...)
+	hostPreflightArgs := append([]string{binary, "rollout", "host-preflight", "--file", planFile, "--host", hostID, "--runtime-host", runtimeHost, "--bucket-min", strconv.Itoa(opts.BucketMin), "--bucket-max", strconv.Itoa(opts.BucketMax)}, bucketTotalArgs...)
+	if systemdUnit := strings.TrimSpace(opts.SystemdUnit); systemdUnit != "" {
+		hostPreflightArgs = append(hostPreflightArgs, "--systemd-unit", systemdUnit)
+	} else {
+		hostPreflightArgs = append(hostPreflightArgs, "--service", service)
+	}
+
+	writeRolloutPlanSection(out, "1. Validate the copied static bucket plan",
+		rolloutCommand(staticPlanArgs...),
+	)
+	writeRolloutPlanSection(out, "2. Validate the staged v2 config with the service environment",
+		rolloutCommand(binary, "validate-config"),
+	)
+	writeRolloutPlanSection(out, "3. Run the host preflight before stopping v1",
+		rolloutCommand(hostPreflightArgs...),
+	)
+
+	v1StopLine := rolloutOperatorCommandOrComment(opts.V1StopCmd, "TODO: stop the v1 monitor for this bucket range with the documented production command.")
+	if mode == "fresh-server" {
+		writeRolloutPlanSection(out, "4. Cut over from the old v1 host to the fresh v2 host",
+			"# HOLD: keep v2 stopped on the fresh server until the old v1 monitor process is stopped.",
+			v1StopLine,
+			"# HOLD: confirm v1 on "+shellQuote(hostID)+" is stopped before starting v2 on "+shellQuote(runtimeHost)+".",
+			startSystemdServiceCommand(service),
+		)
+	} else {
+		writeRolloutPlanSection(out, "4. Cut over the same server from v1 to v2",
+			v1StopLine,
+			"# HOLD: confirm v1 is stopped before starting v2.",
+			startSystemdServiceCommand(service),
+		)
+	}
+
+	rangeArgs := []string{"--bucket-min", strconv.Itoa(opts.BucketMin), "--bucket-max", strconv.Itoa(opts.BucketMax)}
+	writeRolloutPlanSection(out, "5. Verify the v2 host after start",
+		"# Immediate smoke gate: checks startup and recent activity; recent writes can still include v1.",
+		rolloutCommand(append([]string{binary, "rollout", "cutover-check", "--host", runtimeHost}, append(append([]string{}, rangeArgs...), "--since", since)...)...),
+		"# Strong gate after one full v2 check round:",
+		rolloutCommand(append([]string{binary, "rollout", "cutover-check", "--host", runtimeHost}, append(append([]string{}, rangeArgs...), "--since", since, "--require-all")...)...),
+		"# Window-level WPCOM down/recovery parity and explanation evidence:",
+		telemetryReportCommand(binary, since),
+	)
+
+	rollbackComment := "# Restart the original v1 service with its original BUCKET_NO_MIN/BUCKET_NO_MAX config."
+	if mode == "fresh-server" {
+		rollbackComment = "# Restart v1 on " + shellQuote(hostID) + " with its original BUCKET_NO_MIN/BUCKET_NO_MAX config."
+	}
+	v1StartLine := rolloutOperatorCommandOrComment(opts.V1StartCmd, strings.TrimPrefix(rollbackComment, "# "))
+	writeRolloutPlanSection(out, "6. Rehearse the rollback path before the rollback window closes",
+		stopSystemdServiceCommand(service),
+		"# HOLD: confirm the v2 process is stopped before restarting v1.",
+		rolloutCommand(append([]string{binary, "rollout", "rollback-check", "--host", runtimeHost}, rangeArgs...)...),
+		"# HOLD: do not restart v1 unless rollback-check passes.",
+		v1StartLine,
+		"# Do not roll back schema migrations.",
+	)
+
+	writeRolloutPlanSection(out, "7. Finish this host, then complete fleet-level checks",
+		"# Host signoff before moving on or before the fleet dynamic cutover:",
+		rolloutCommand(append([]string{binary, "rollout", "cutover-check", "--host", runtimeHost}, append(append([]string{}, rangeArgs...), "--since", since, "--require-all")...)...),
+		"# After every host is on v2, remove PINNED_BUCKET_* from every monitor config and restart the fleet:",
+		rolloutCommand(binary, "validate-config"),
+		rolloutCommand(binary, "rollout", "dynamic-check"),
+		rolloutCommand(binary, "rollout", "activity-check", "--since", since, "--require-all"),
+		rolloutCommand(binary, "rollout", "projection-drift", "--limit", "100"),
+		telemetryReportCommand(binary, since),
+	)
+	return nil
+}
+
+func telemetryReportCommand(binary, since string) string {
+	since = strings.TrimSpace(since)
+	if since == "" {
+		since = "15m"
+	}
+	return rolloutCommand(binary, "telemetry", "report", "--since", since)
+}
+
+type cutoverCheckOptions struct {
+	HostOverride string
+	BucketMin    int
+	BucketMax    int
+	Since        string
+	RequireAll   bool
+	Limit        int
+	StatusPort   int
+	SkipStatus   bool
+}
+
+func runCutoverCheck(ctx context.Context, out io.Writer, cfg *config.Config, opts cutoverCheckOptions, deps cutoverCheckDeps) error {
+	if cfg == nil {
+		return errors.New("config is not loaded")
+	}
+	if out == nil {
+		out = io.Discard
+	}
+
+	writeRolloutPlanSection(out, "pinned preflight")
+	if err := runPinnedRolloutCheck(ctx, out, cfg, opts.HostOverride, deps.Pinned); err != nil {
+		return err
+	}
+
+	writeRolloutPlanSection(out, "activity check")
+	if err := runActivityCheck(ctx, out, cfg, opts.BucketMin, opts.BucketMax, opts.Since, opts.RequireAll, deps.Activity); err != nil {
+		return err
+	}
+
+	writeRolloutPlanSection(out, "dashboard status")
+	if err := runCutoverStatusCheck(out, cfg, opts, deps); err != nil {
+		return err
+	}
+
+	writeRolloutPlanSection(out, "projection drift")
+	if err := runProjectionDriftReport(ctx, out, cfg, opts.BucketMin, opts.BucketMax, opts.Limit, deps.Projection); err != nil {
+		return err
+	}
+
+	fmt.Fprintln(out, "cutover check passed")
+	return nil
+}
+
+func runCutoverStatusCheck(out io.Writer, cfg *config.Config, opts cutoverCheckOptions, deps cutoverCheckDeps) error {
+	if opts.SkipStatus {
+		fmt.Fprintln(out, "INFO dashboard_status=skipped reason=operator")
+		return nil
+	}
+	port := opts.StatusPort
+	if port == -1 {
+		port = cfg.DashboardPort
+	}
+	if port < 0 {
+		return errors.New("status-port must be >= 0")
+	}
+	if port == 0 {
+		fmt.Fprintln(out, "INFO dashboard_status=skipped dashboard_port=disabled")
+		return nil
+	}
+	if deps.Status == nil {
+		return errors.New("dashboard status checker is not configured")
+	}
+	body, err := deps.Status(port)
+	if err != nil {
+		return fmt.Errorf("dashboard status check on port %d: %w", port, err)
+	}
+	fmt.Fprintf(out, "PASS dashboard_status=http://localhost:%d/api/state\n", port)
+	if body = strings.TrimSpace(body); body != "" {
+		fmt.Fprintf(out, "INFO dashboard_state=%s\n", strings.Join(strings.Fields(body), " "))
+	}
+	return nil
+}
+
+func dashboardStatus(port int) (string, error) {
+	return httpGet(fmt.Sprintf("http://localhost:%d/api/state", port))
+}
+
+type rolloutStateReportOptions struct {
+	Since string
+}
+
+type rolloutStateReport struct {
+	OK                  bool                       `json:"ok"`
+	Command             string                     `json:"command"`
+	GeneratedAt         time.Time                  `json:"generated_at"`
+	Host                string                     `json:"host,omitempty"`
+	Ownership           rolloutStateOwnership      `json:"ownership"`
+	BucketCoverage      rolloutStateBucketCoverage `json:"bucket_coverage"`
+	Activity            rolloutStateActivity       `json:"activity"`
+	ProjectionDrift     rolloutStateDrift          `json:"projection_drift"`
+	DeliveryOwner       rolloutStateDeliveryOwner  `json:"delivery_owner"`
+	SuggestedNextAction string                     `json:"suggested_next_action,omitempty"`
+	Issues              []string                   `json:"issues,omitempty"`
+}
+
+type rolloutStateOwnership struct {
+	Mode      string `json:"mode"`
+	BucketMin int    `json:"bucket_min"`
+	BucketMax int    `json:"bucket_max"`
+}
+
+type rolloutStateBucketCoverage struct {
+	Status      string                `json:"status"`
+	BucketTotal int                   `json:"bucket_total"`
+	HostCount   int                   `json:"host_count"`
+	Error       string                `json:"error,omitempty"`
+	Hosts       []rolloutStateHostRow `json:"hosts,omitempty"`
+}
+
+type rolloutStateHostRow struct {
+	HostID              string    `json:"host_id"`
+	BucketMin           int       `json:"bucket_min"`
+	BucketMax           int       `json:"bucket_max"`
+	Status              string    `json:"status"`
+	LastHeartbeat       time.Time `json:"last_heartbeat"`
+	LastHeartbeatAgeSec int64     `json:"last_heartbeat_age_sec"`
+}
+
+type rolloutStateActivity struct {
+	Since          time.Time `json:"since"`
+	ActiveSites    int       `json:"active_sites"`
+	CheckedSince   int       `json:"checked_since"`
+	UncheckedSince int       `json:"unchecked_since"`
+	CheckedPercent float64   `json:"checked_percent"`
+}
+
+type rolloutStateDrift struct {
+	Count int `json:"count"`
+}
+
+type rolloutStateDeliveryOwner struct {
+	Level   string `json:"level"`
+	Message string `json:"message"`
+}
+
+func buildRolloutStateReport(ctx context.Context, cfg *config.Config, opts rolloutStateReportOptions, deps rolloutStateReportDeps) (rolloutStateReport, error) {
+	if cfg == nil {
+		return rolloutStateReport{}, errors.New("config is not loaded")
+	}
+	now := time.Now().UTC()
+	if deps.Now != nil {
+		now = deps.Now().UTC()
+	}
+	cutoff, err := resolveActivityCutoff(now, opts.Since)
+	if err != nil {
+		return rolloutStateReport{}, err
+	}
+
+	hostID := ""
+	if deps.Hostname != nil {
+		hostID = strings.TrimSpace(deps.Hostname())
+	}
+	if hostID == "" {
+		hostID = "unknown"
+	}
+
+	minBucket, maxBucket, ownershipMode, err := rolloutStateRange(cfg)
+	if err != nil {
+		return rolloutStateReport{}, err
+	}
+
+	report := rolloutStateReport{
+		Command:     "rollout state-report",
+		GeneratedAt: now,
+		Host:        hostID,
+		Ownership: rolloutStateOwnership{
+			Mode:      ownershipMode,
+			BucketMin: minBucket,
+			BucketMax: maxBucket,
+		},
+		BucketCoverage: rolloutStateBucketCoverage{
+			BucketTotal: cfg.BucketTotal,
+		},
+		Activity: rolloutStateActivity{
+			Since: cutoff,
+		},
+	}
+
+	if ownershipMode == "pinned" {
+		report.BucketCoverage.Status = "pinned_config"
+	} else {
+		if deps.GetAllHosts == nil {
+			return rolloutStateReport{}, errors.New("host list query is not configured")
+		}
+		hosts, err := deps.GetAllHosts()
+		if err != nil {
+			return rolloutStateReport{}, fmt.Errorf("query jetmon_hosts: %w", err)
+		}
+		report.BucketCoverage.HostCount = len(hosts)
+		report.BucketCoverage.Hosts = summarizeRolloutHosts(hosts, now)
+		if err := validateDynamicBucketCoverage(hosts, cfg.BucketTotal, time.Duration(cfg.BucketHeartbeatGraceSec)*time.Second, now); err != nil {
+			report.BucketCoverage.Status = "invalid"
+			report.BucketCoverage.Error = err.Error()
+			report.Issues = append(report.Issues, err.Error())
+		} else {
+			report.BucketCoverage.Status = "complete"
+		}
+	}
+
+	if deps.CountActiveSitesForBucketRange == nil {
+		return rolloutStateReport{}, errors.New("active site counter is not configured")
+	}
+	activeSites, err := deps.CountActiveSitesForBucketRange(ctx, minBucket, maxBucket)
+	if err != nil {
+		return rolloutStateReport{}, fmt.Errorf("count active sites in range %d-%d: %w", minBucket, maxBucket, err)
+	}
+	report.Activity.ActiveSites = activeSites
+
+	if deps.CountRecentlyCheckedActiveSitesForRange == nil {
+		return rolloutStateReport{}, errors.New("recently checked active site counter is not configured")
+	}
+	checkedSince, err := deps.CountRecentlyCheckedActiveSitesForRange(ctx, minBucket, maxBucket, cutoff)
+	if err != nil {
+		return rolloutStateReport{}, fmt.Errorf("count recently checked active sites in range %d-%d since %s: %w", minBucket, maxBucket, cutoff.Format(time.RFC3339), err)
+	}
+	report.Activity.CheckedSince = checkedSince
+	report.Activity.UncheckedSince = maxInt(0, activeSites-checkedSince)
+	if activeSites > 0 {
+		report.Activity.CheckedPercent = float64(checkedSince) * 100 / float64(activeSites)
+	}
+
+	if deps.CountLegacyProjectionDrift == nil {
+		return rolloutStateReport{}, errors.New("projection drift counter is not configured")
+	}
+	drift, err := deps.CountLegacyProjectionDrift(ctx, minBucket, maxBucket)
+	if err != nil {
+		return rolloutStateReport{}, fmt.Errorf("count legacy projection drift in range %d-%d: %w", minBucket, maxBucket, err)
+	}
+	report.ProjectionDrift.Count = drift
+
+	level, message := deliveryOwnerStatus(cfg, hostID)
+	report.DeliveryOwner = rolloutStateDeliveryOwner{Level: level, Message: message}
+
+	report.Issues = append(report.Issues, rolloutStateIssues(report)...)
+	report.SuggestedNextAction = suggestRolloutNextAction(report)
+	report.OK = len(report.Issues) == 0
+	return report, nil
+}
+
+func rolloutStateRange(cfg *config.Config) (int, int, string, error) {
+	if minBucket, maxBucket, ok := cfg.PinnedBucketRange(); ok {
+		return minBucket, maxBucket, "pinned", nil
+	}
+	if cfg.BucketTotal <= 0 {
+		return 0, 0, "", errors.New("BUCKET_TOTAL must be > 0")
+	}
+	return 0, cfg.BucketTotal - 1, "dynamic", nil
+}
+
+func summarizeRolloutHosts(hosts []db.HostRow, now time.Time) []rolloutStateHostRow {
+	out := make([]rolloutStateHostRow, 0, len(hosts))
+	for _, host := range hosts {
+		age := now.Sub(host.LastHeartbeat)
+		if age < 0 {
+			age = 0
+		}
+		out = append(out, rolloutStateHostRow{
+			HostID:              host.HostID,
+			BucketMin:           host.BucketMin,
+			BucketMax:           host.BucketMax,
+			Status:              host.Status,
+			LastHeartbeat:       host.LastHeartbeat,
+			LastHeartbeatAgeSec: int64(age.Round(time.Second) / time.Second),
+		})
+	}
+	return out
+}
+
+func rolloutStateIssues(report rolloutStateReport) []string {
+	var issues []string
+	if report.ProjectionDrift.Count > 0 {
+		issues = append(issues, fmt.Sprintf("legacy projection drift=%d", report.ProjectionDrift.Count))
+	}
+	if report.Activity.ActiveSites > 0 && report.Activity.CheckedSince == 0 {
+		issues = append(issues, fmt.Sprintf("no active sites checked since %s", report.Activity.Since.Format(time.RFC3339)))
+	} else if report.Activity.UncheckedSince > 0 {
+		issues = append(issues, fmt.Sprintf("%d/%d active sites checked since %s", report.Activity.CheckedSince, report.Activity.ActiveSites, report.Activity.Since.Format(time.RFC3339)))
+	}
+	if report.DeliveryOwner.Level == "WARN" {
+		issues = append(issues, report.DeliveryOwner.Message)
+	}
+	return issues
+}
+
+func suggestRolloutNextAction(report rolloutStateReport) string {
+	if report.BucketCoverage.Status == "invalid" {
+		return "Fix jetmon_hosts bucket coverage before relying on dynamic ownership."
+	}
+	if report.ProjectionDrift.Count > 0 {
+		return "Run rollout projection-drift --limit=100 and fix legacy projection drift before continuing."
+	}
+	if report.Activity.ActiveSites == 0 {
+		return "Confirm this range is intentionally empty before continuing."
+	}
+	if report.Activity.CheckedSince == 0 {
+		return "Investigate the check loop; no active sites have fresh last_checked_at writes."
+	}
+	if report.Activity.UncheckedSince > 0 {
+		return "Wait one full expected round, then run rollout cutover-check --require-all before moving on."
+	}
+	if report.DeliveryOwner.Level == "WARN" {
+		return "Set DELIVERY_OWNER_HOST or explicitly approve multi-owner delivery before enabling API delivery workers."
+	}
+	if report.Ownership.Mode == "pinned" {
+		return "Continue with the next pinned host; after every host is on v2, plan dynamic ownership cutover."
+	}
+	return "Dynamic ownership looks healthy; continue normal v2 rolling updates and monitoring."
+}
+
+func renderRolloutStateReport(out io.Writer, report rolloutStateReport, output string) error {
+	if output == "json" {
+		enc := json.NewEncoder(out)
+		enc.SetIndent("", "  ")
+		return enc.Encode(report)
+	}
+	renderRolloutStateText(out, report)
+	return nil
+}
+
+func renderRolloutStateText(out io.Writer, report rolloutStateReport) {
+	fmt.Fprintf(out, "INFO rollout_state_generated_at=%s\n", report.GeneratedAt.Format(time.RFC3339))
+	fmt.Fprintf(out, "INFO host=%q\n", report.Host)
+	fmt.Fprintf(out, "INFO ownership_mode=%s bucket_range=%d-%d\n", report.Ownership.Mode, report.Ownership.BucketMin, report.Ownership.BucketMax)
+	if report.BucketCoverage.Status == "invalid" {
+		fmt.Fprintf(out, "WARN bucket_coverage=%s error=%q\n", report.BucketCoverage.Status, report.BucketCoverage.Error)
+	} else {
+		fmt.Fprintf(out, "PASS bucket_coverage=%s bucket_total=%d host_count=%d\n", report.BucketCoverage.Status, report.BucketCoverage.BucketTotal, report.BucketCoverage.HostCount)
+	}
+	fmt.Fprintf(out, "INFO activity_since=%s\n", report.Activity.Since.Format(time.RFC3339))
+	fmt.Fprintf(out, "INFO active_sites=%d checked_since=%d unchecked_since=%d checked_percent=%.1f\n", report.Activity.ActiveSites, report.Activity.CheckedSince, report.Activity.UncheckedSince, report.Activity.CheckedPercent)
+	if report.ProjectionDrift.Count > 0 {
+		fmt.Fprintf(out, "WARN legacy_projection_drift=%d\n", report.ProjectionDrift.Count)
+	} else {
+		fmt.Fprintln(out, "PASS legacy_projection_drift=0")
+	}
+	if report.DeliveryOwner.Message != "" {
+		fmt.Fprintf(out, "%s %s\n", report.DeliveryOwner.Level, report.DeliveryOwner.Message)
+	}
+	for _, issue := range report.Issues {
+		fmt.Fprintf(out, "WARN rollout_state_issue=%q\n", issue)
+	}
+	fmt.Fprintf(out, "INFO suggested_next_action=%q\n", report.SuggestedNextAction)
+}
+
+func maxInt(a, b int) int {
+	if a > b {
+		return a
+	}
+	return b
+}
+
+func writeRolloutPlanSection(out io.Writer, title string, lines ...string) {
+	fmt.Fprintf(out, "## %s\n", title)
+	for _, line := range lines {
+		fmt.Fprintln(out, line)
+	}
+	fmt.Fprintln(out)
+}
+
+func rolloutOperatorCommandOrComment(command, comment string) string {
+	command = strings.TrimSpace(command)
+	if command != "" {
+		return command
+	}
+	comment = strings.TrimSpace(comment)
+	if comment == "" {
+		comment = "TODO: operator command required."
+	}
+	if strings.HasPrefix(comment, "#") {
+		return comment
+	}
+	return "# " + comment
+}
+
+func rolloutCommand(parts ...string) string {
+	quoted := make([]string, 0, len(parts))
+	for _, part := range parts {
+		quoted = append(quoted, shellQuote(part))
+	}
+	return strings.Join(quoted, " ")
+}
+
+func startSystemdServiceCommand(service string) string {
+	quotedService := shellQuote(service)
+	return "systemctl enable --now " + quotedService + " && systemctl is-active --quiet " + quotedService
+}
+
+func stopSystemdServiceCommand(service string) string {
+	quotedService := shellQuote(service)
+	return "systemctl stop " + quotedService + " && ! systemctl is-active --quiet " + quotedService
+}
+
+func shellQuote(value string) string {
+	if value == "" {
+		return "''"
+	}
+	for _, r := range value {
+		if (r >= 'a' && r <= 'z') || (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') {
+			continue
+		}
+		switch r {
+		case '_', '-', '.', '/', ':', '=', '+', ',', '@':
+			continue
+		default:
+			return "'" + strings.ReplaceAll(value, "'", "'\\''") + "'"
+		}
+	}
+	return value
+}
+
+type staticPlanAssertion struct {
+	HostID    string
+	BucketMin int
+	BucketMax int
+}
+
+func (a staticPlanAssertion) enabled() bool {
+	return strings.TrimSpace(a.HostID) != ""
+}
+
+func (a staticPlanAssertion) checksRange() bool {
+	return a.BucketMin >= 0 || a.BucketMax >= 0
+}
+
+func staticPlanAssertionFromFlags(host string, bucketMin, bucketMax int) (staticPlanAssertion, error) {
+	host = strings.TrimSpace(host)
+	if bucketMin < -1 || bucketMax < -1 {
+		return staticPlanAssertion{}, errors.New("--bucket-min and --bucket-max must be >= 0")
+	}
+	if host == "" {
+		if bucketMin >= 0 || bucketMax >= 0 {
+			return staticPlanAssertion{}, errors.New("--host is required when --bucket-min or --bucket-max is set")
+		}
+		return staticPlanAssertion{}, nil
+	}
+	if (bucketMin >= 0) != (bucketMax >= 0) {
+		return staticPlanAssertion{}, errors.New("--bucket-min and --bucket-max must be set together")
+	}
+	return staticPlanAssertion{
+		HostID:    host,
+		BucketMin: bucketMin,
+		BucketMax: bucketMax,
+	}, nil
+}
+
+func runStaticPlanCheck(out io.Writer, inputName string, input io.Reader, bucketTotal int, assertion staticPlanAssertion) error {
+	ranges, err := parseStaticBucketPlanCSV(input)
+	if err != nil {
+		return err
+	}
+	if err := validateStaticBucketPlan(ranges, bucketTotal); err != nil {
+		return err
+	}
+	assertedRange, err := validateStaticPlanAssertion(ranges, assertion)
+	if err != nil {
+		return err
+	}
+	fmt.Fprintf(out, "PASS static_plan_file=%s ranges=%d\n", inputName, len(ranges))
+	fmt.Fprintf(out, "PASS static_bucket_coverage=0-%d hosts=%d\n", bucketTotal-1, len(ranges))
+	if assertion.enabled() {
+		fmt.Fprintf(out, "PASS static_plan_host=%q range=%d-%d\n", assertedRange.HostID, assertedRange.BucketMin, assertedRange.BucketMax)
+	}
+	fmt.Fprintln(out, "static rollout plan check passed")
+	return nil
+}
+
+func parseStaticBucketPlanCSV(input io.Reader) ([]staticBucketRange, error) {
+	reader := csv.NewReader(input)
+	reader.FieldsPerRecord = -1
+	reader.TrimLeadingSpace = true
+
+	records, err := reader.ReadAll()
+	if err != nil {
+		return nil, fmt.Errorf("read static bucket plan CSV: %w", err)
+	}
+
+	ranges := make([]staticBucketRange, 0, len(records))
+	seenData := false
+	for i, record := range records {
+		row := i + 1
+		if staticPlanRowIsBlank(record) || staticPlanRowIsComment(record) {
+			continue
+		}
+		if !seenData && staticPlanRowIsHeader(record) {
+			seenData = true
+			continue
+		}
+		seenData = true
+		if len(record) != 3 {
+			return nil, fmt.Errorf("row %d: expected host,bucket_min,bucket_max", row)
+		}
+		hostID := strings.TrimSpace(record[0])
+		if hostID == "" {
+			return nil, fmt.Errorf("row %d: host is required", row)
+		}
+		minBucket, err := parseStaticPlanInt(row, "bucket_min", record[1])
+		if err != nil {
+			return nil, err
+		}
+		maxBucket, err := parseStaticPlanInt(row, "bucket_max", record[2])
+		if err != nil {
+			return nil, err
+		}
+		ranges = append(ranges, staticBucketRange{
+			HostID:    hostID,
+			BucketMin: minBucket,
+			BucketMax: maxBucket,
+		})
+	}
+	if len(ranges) == 0 {
+		return nil, errors.New("static bucket plan has no host ranges")
+	}
+	return ranges, nil
+}
+
+func staticPlanRowIsBlank(record []string) bool {
+	for _, field := range record {
+		if strings.TrimSpace(field) != "" {
+			return false
+		}
+	}
+	return true
+}
+
+func staticPlanRowIsComment(record []string) bool {
+	return len(record) > 0 && strings.HasPrefix(strings.TrimSpace(record[0]), "#")
+}
+
+func staticPlanRowIsHeader(record []string) bool {
+	if len(record) != 3 {
+		return false
+	}
+	return staticPlanHeaderField(record[0]) == "host" &&
+		staticPlanHeaderField(record[1]) == "bucket_min" &&
+		staticPlanHeaderField(record[2]) == "bucket_max"
+}
+
+func staticPlanHeaderField(field string) string {
+	normalized := strings.ToLower(strings.TrimSpace(field))
+	switch normalized {
+	case "bucket_no_min", "min":
+		return "bucket_min"
+	case "bucket_no_max", "max":
+		return "bucket_max"
+	default:
+		return normalized
+	}
+}
+
+func parseStaticPlanInt(row int, field, value string) (int, error) {
+	parsed, err := strconv.Atoi(strings.TrimSpace(value))
+	if err != nil {
+		return 0, fmt.Errorf("row %d: %s %q is not an integer", row, field, strings.TrimSpace(value))
+	}
+	return parsed, nil
+}
+
+func validateStaticBucketPlan(ranges []staticBucketRange, bucketTotal int) error {
+	if bucketTotal <= 0 {
+		return errors.New("BUCKET_TOTAL must be > 0")
+	}
+	if len(ranges) == 0 {
+		return errors.New("static bucket plan has no host ranges")
+	}
+
+	seenHosts := make(map[string]struct{}, len(ranges))
+	for _, rng := range ranges {
+		if _, ok := seenHosts[rng.HostID]; ok {
+			return fmt.Errorf("host %q is listed more than once", rng.HostID)
+		}
+		seenHosts[rng.HostID] = struct{}{}
+		if rng.BucketMin < 0 || rng.BucketMax < rng.BucketMin || rng.BucketMax >= bucketTotal {
+			return fmt.Errorf("host %q has invalid bucket range %d-%d for BUCKET_TOTAL=%d", rng.HostID, rng.BucketMin, rng.BucketMax, bucketTotal)
+		}
+	}
+
+	sortedRanges := append([]staticBucketRange(nil), ranges...)
+	sort.Slice(sortedRanges, func(i, j int) bool {
+		if sortedRanges[i].BucketMin == sortedRanges[j].BucketMin {
+			return sortedRanges[i].HostID < sortedRanges[j].HostID
+		}
+		return sortedRanges[i].BucketMin < sortedRanges[j].BucketMin
+	})
+
+	expectedMin := 0
+	for _, rng := range sortedRanges {
+		if rng.BucketMin > expectedMin {
+			return fmt.Errorf("static bucket plan has gap %d-%d before host %q", expectedMin, rng.BucketMin-1, rng.HostID)
+		}
+		if rng.BucketMin < expectedMin {
+			return fmt.Errorf("static bucket plan overlaps before host %q at bucket %d", rng.HostID, rng.BucketMin)
+		}
+		expectedMin = rng.BucketMax + 1
+	}
+	if expectedMin < bucketTotal {
+		return fmt.Errorf("static bucket plan has trailing gap %d-%d", expectedMin, bucketTotal-1)
+	}
+	return nil
+}
+
+func validateStaticPlanAssertion(ranges []staticBucketRange, assertion staticPlanAssertion) (staticBucketRange, error) {
+	if !assertion.enabled() {
+		return staticBucketRange{}, nil
+	}
+	for _, rng := range ranges {
+		if rng.HostID != assertion.HostID {
+			continue
+		}
+		if assertion.checksRange() && (rng.BucketMin != assertion.BucketMin || rng.BucketMax != assertion.BucketMax) {
+			return staticBucketRange{}, fmt.Errorf("host %q has bucket range %d-%d in plan, want %d-%d", assertion.HostID, rng.BucketMin, rng.BucketMax, assertion.BucketMin, assertion.BucketMax)
+		}
+		return rng, nil
+	}
+	return staticBucketRange{}, fmt.Errorf("host %q is not present in static bucket plan", assertion.HostID)
+}
+
+func runPinnedRolloutCheck(ctx context.Context, out io.Writer, cfg *config.Config, hostOverride string, deps pinnedRolloutCheckDeps) error {
+	if cfg == nil {
+		return errors.New("config is not loaded")
+	}
+	minBucket, maxBucket, ok := cfg.PinnedBucketRange()
+	if !ok {
+		return errors.New("pinned bucket range is not configured; set PINNED_BUCKET_MIN/PINNED_BUCKET_MAX or BUCKET_NO_MIN/BUCKET_NO_MAX")
+	}
+	fmt.Fprintf(out, "PASS pinned_range=%d-%d\n", minBucket, maxBucket)
+
+	if !cfg.LegacyStatusProjectionEnable {
+		return errors.New("LEGACY_STATUS_PROJECTION_ENABLE must be true during pinned v1-to-v2 rollout")
+	}
+	fmt.Fprintln(out, "PASS legacy_status_projection=enabled")
+	defaultMethod, defaultProfile := rolloutDefaultCheckPolicy(cfg)
+	if defaultMethod == checkmode.MethodHEAD && defaultProfile == checkmode.ProfileLegacy {
+		fmt.Fprintln(out, "PASS default_check_policy=method:HEAD profile:legacy")
+	} else {
+		fmt.Fprintf(out, "WARN default_check_policy=method:%s profile:%s; initial v1 replacement should normally use method:HEAD profile:legacy\n", defaultMethod, defaultProfile)
+	}
+
+	if cfg.APIPort > 0 {
+		fmt.Fprintf(out, "WARN api_port=%d; confirm the API/delivery ownership plan before monitor cutover\n", cfg.APIPort)
+	} else {
+		fmt.Fprintln(out, "PASS api_port=disabled")
+	}
+
+	hostID := strings.TrimSpace(hostOverride)
+	if hostID == "" {
+		if deps.Hostname == nil {
+			return errors.New("hostname resolver is not configured")
+		}
+		hostID = strings.TrimSpace(deps.Hostname())
+	}
+	if hostID == "" {
+		return errors.New("host id is empty")
+	}
+
+	if deps.HostRowExists == nil {
+		return errors.New("host row checker is not configured")
+	}
+	hostRowExists, err := deps.HostRowExists(ctx, hostID)
+	if err != nil {
+		return fmt.Errorf("check jetmon_hosts row for %q: %w", hostID, err)
+	}
+	if hostRowExists {
+		return fmt.Errorf("host %q still has a jetmon_hosts row; pinned hosts must not participate in dynamic bucket ownership", hostID)
+	}
+	fmt.Fprintf(out, "PASS jetmon_hosts row absent host=%q\n", hostID)
+
+	if deps.ListOverlappingHostRows == nil {
+		return errors.New("overlapping host row lister is not configured")
+	}
+	overlappingRows, err := deps.ListOverlappingHostRows(ctx, minBucket, maxBucket)
+	if err != nil {
+		return fmt.Errorf("list jetmon_hosts rows overlapping pinned range %d-%d: %w", minBucket, maxBucket, err)
+	}
+	if len(overlappingRows) > 0 {
+		return fmt.Errorf("jetmon_hosts has %d row(s) overlapping pinned range %d-%d: %s", len(overlappingRows), minBucket, maxBucket, formatHostRows(overlappingRows))
+	}
+	fmt.Fprintln(out, "PASS jetmon_hosts overlap=0")
+
+	if deps.CountActiveSitesForBucketRange == nil {
+		return errors.New("active site counter is not configured")
+	}
+	activeSites, err := deps.CountActiveSitesForBucketRange(ctx, minBucket, maxBucket)
+	if err != nil {
+		return fmt.Errorf("count active sites in pinned range %d-%d: %w", minBucket, maxBucket, err)
+	}
+	fmt.Fprintf(out, "INFO active_sites_in_pinned_range=%d\n", activeSites)
+	if activeSites == 0 {
+		fmt.Fprintln(out, "WARN active_sites_in_pinned_range=0; confirm this v1 host range is intentionally empty")
+	}
+
+	if deps.CountLegacyProjectionDrift == nil {
+		return errors.New("projection drift counter is not configured")
+	}
+	drift, err := deps.CountLegacyProjectionDrift(ctx, minBucket, maxBucket)
+	if err != nil {
+		return fmt.Errorf("count legacy projection drift in pinned range %d-%d: %w", minBucket, maxBucket, err)
+	}
+	if drift > 0 {
+		return fmt.Errorf("legacy projection drift=%d in pinned range %d-%d", drift, minBucket, maxBucket)
+	}
+	fmt.Fprintln(out, "PASS legacy_projection_drift=0")
+	fmt.Fprintln(out, "pinned rollout check passed")
+	return nil
+}
+
+func runRollbackCheck(ctx context.Context, out io.Writer, cfg *config.Config, hostOverride string, bucketMin, bucketMax int, deps rollbackCheckDeps) error {
+	if cfg == nil {
+		return errors.New("config is not loaded")
+	}
+	minBucket, maxBucket, err := resolvePinnedOrExplicitRange(cfg, bucketMin, bucketMax, "rollback-check")
+	if err != nil {
+		return err
+	}
+	fmt.Fprintf(out, "PASS rollback_range=%d-%d\n", minBucket, maxBucket)
+
+	hostID := strings.TrimSpace(hostOverride)
+	if hostID == "" {
+		if deps.Hostname == nil {
+			return errors.New("hostname resolver is not configured")
+		}
+		hostID = strings.TrimSpace(deps.Hostname())
+	}
+	if hostID == "" {
+		return errors.New("host id is empty")
+	}
+
+	if deps.HostRowExists == nil {
+		return errors.New("host row checker is not configured")
+	}
+	hostRowExists, err := deps.HostRowExists(ctx, hostID)
+	if err != nil {
+		return fmt.Errorf("check jetmon_hosts row for %q: %w", hostID, err)
+	}
+	if hostRowExists {
+		return fmt.Errorf("host %q still has a jetmon_hosts row; stop v2 or clear dynamic ownership before restarting v1", hostID)
+	}
+	fmt.Fprintf(out, "PASS jetmon_hosts row absent host=%q\n", hostID)
+
+	if deps.ListOverlappingHostRows == nil {
+		return errors.New("overlapping host row lister is not configured")
+	}
+	overlappingRows, err := deps.ListOverlappingHostRows(ctx, minBucket, maxBucket)
+	if err != nil {
+		return fmt.Errorf("list jetmon_hosts rows overlapping rollback range %d-%d: %w", minBucket, maxBucket, err)
+	}
+	if len(overlappingRows) > 0 {
+		return fmt.Errorf("jetmon_hosts has %d row(s) overlapping rollback range %d-%d: %s", len(overlappingRows), minBucket, maxBucket, formatHostRows(overlappingRows))
+	}
+	fmt.Fprintln(out, "PASS jetmon_hosts overlap=0")
+
+	if deps.CountActiveSitesForBucketRange == nil {
+		return errors.New("active site counter is not configured")
+	}
+	activeSites, err := deps.CountActiveSitesForBucketRange(ctx, minBucket, maxBucket)
+	if err != nil {
+		return fmt.Errorf("count active sites in rollback range %d-%d: %w", minBucket, maxBucket, err)
+	}
+	fmt.Fprintf(out, "INFO active_sites_in_rollback_range=%d\n", activeSites)
+	if activeSites == 0 {
+		fmt.Fprintln(out, "WARN active_sites_in_rollback_range=0; confirm this v1 host range is intentionally empty")
+	}
+
+	if deps.CountLegacyProjectionDrift == nil {
+		return errors.New("projection drift counter is not configured")
+	}
+	drift, err := deps.CountLegacyProjectionDrift(ctx, minBucket, maxBucket)
+	if err != nil {
+		return fmt.Errorf("count legacy projection drift in rollback range %d-%d: %w", minBucket, maxBucket, err)
+	}
+	if drift > 0 {
+		return fmt.Errorf("legacy projection drift=%d in rollback range %d-%d; fix drift before restarting v1 readers", drift, minBucket, maxBucket)
+	}
+	fmt.Fprintln(out, "PASS legacy_projection_drift=0")
+	fmt.Fprintln(out, "rollback check passed")
+	return nil
+}
+
+func runDynamicRolloutCheck(ctx context.Context, out io.Writer, cfg *config.Config, deps dynamicRolloutCheckDeps) error {
+	if cfg == nil {
+		return errors.New("config is not loaded")
+	}
+	if minBucket, maxBucket, ok := cfg.PinnedBucketRange(); ok {
+		return fmt.Errorf("pinned bucket range %d-%d is still configured; remove PINNED_BUCKET_*/BUCKET_NO_* before dynamic ownership cutover", minBucket, maxBucket)
+	}
+	fmt.Fprintln(out, "PASS bucket_ownership=dynamic")
+
+	if !cfg.LegacyStatusProjectionEnable {
+		return errors.New("LEGACY_STATUS_PROJECTION_ENABLE must remain true until legacy readers have migrated")
+	}
+	fmt.Fprintln(out, "PASS legacy_status_projection=enabled")
+
+	if deps.GetAllHosts == nil {
+		return errors.New("host list query is not configured")
+	}
+	hosts, err := deps.GetAllHosts()
+	if err != nil {
+		return fmt.Errorf("query jetmon_hosts: %w", err)
+	}
+	fmt.Fprintf(out, "INFO jetmon_hosts_rows=%d\n", len(hosts))
+
+	now := time.Now()
+	if deps.Now != nil {
+		now = deps.Now()
+	}
+	if err := validateDynamicBucketCoverage(hosts, cfg.BucketTotal, time.Duration(cfg.BucketHeartbeatGraceSec)*time.Second, now); err != nil {
+		return err
+	}
+	fmt.Fprintf(out, "PASS dynamic_bucket_coverage=0-%d hosts=%d\n", cfg.BucketTotal-1, len(hosts))
+
+	if deps.CountActiveSitesForBucketRange == nil {
+		return errors.New("active site counter is not configured")
+	}
+	activeSites, err := deps.CountActiveSitesForBucketRange(ctx, 0, cfg.BucketTotal-1)
+	if err != nil {
+		return fmt.Errorf("count active sites in dynamic range 0-%d: %w", cfg.BucketTotal-1, err)
+	}
+	fmt.Fprintf(out, "INFO active_sites_dynamic_range=%d\n", activeSites)
+	if activeSites == 0 {
+		fmt.Fprintln(out, "WARN active_sites_dynamic_range=0; confirm the production site table is intentionally empty")
+	}
+
+	if deps.CountLegacyProjectionDrift == nil {
+		return errors.New("projection drift counter is not configured")
+	}
+	drift, err := deps.CountLegacyProjectionDrift(ctx, 0, cfg.BucketTotal-1)
+	if err != nil {
+		return fmt.Errorf("count legacy projection drift in dynamic range 0-%d: %w", cfg.BucketTotal-1, err)
+	}
+	if drift > 0 {
+		return fmt.Errorf("legacy projection drift=%d in dynamic range 0-%d", drift, cfg.BucketTotal-1)
+	}
+	fmt.Fprintln(out, "PASS legacy_projection_drift=0")
+	fmt.Fprintln(out, "dynamic rollout check passed")
+	return nil
+}
+
+func rolloutDefaultCheckPolicy(cfg *config.Config) (string, string) {
+	method := strings.TrimSpace(cfg.DefaultCheckMethod)
+	if method == "" {
+		method = checkmode.MethodGET
+	}
+	profile := strings.TrimSpace(cfg.DefaultDetectionProfile)
+	if profile == "" {
+		profile = checkmode.ProfileFull
+	}
+	return method, checkmode.EffectiveProfile(method, profile)
+}
+
+func runActivityCheck(ctx context.Context, out io.Writer, cfg *config.Config, bucketMin, bucketMax int, since string, requireAll bool, deps activityCheckDeps) error {
+	if cfg == nil {
+		return errors.New("config is not loaded")
+	}
+	minBucket, maxBucket, err := resolveRolloutBucketRange(cfg, bucketMin, bucketMax)
+	if err != nil {
+		return err
+	}
+	now := time.Now()
+	if deps.Now != nil {
+		now = deps.Now()
+	}
+	cutoff, err := resolveActivityCutoff(now, since)
+	if err != nil {
+		return err
+	}
+
+	if deps.CountActiveSitesForBucketRange == nil {
+		return errors.New("active site counter is not configured")
+	}
+	activeSites, err := deps.CountActiveSitesForBucketRange(ctx, minBucket, maxBucket)
+	if err != nil {
+		return fmt.Errorf("count active sites in activity range %d-%d: %w", minBucket, maxBucket, err)
+	}
+
+	if deps.CountRecentlyCheckedActiveSitesForRange == nil {
+		return errors.New("recently checked active site counter is not configured")
+	}
+	checkedSince, err := deps.CountRecentlyCheckedActiveSitesForRange(ctx, minBucket, maxBucket, cutoff)
+	if err != nil {
+		return fmt.Errorf("count recently checked active sites in range %d-%d since %s: %w", minBucket, maxBucket, cutoff.Format(time.RFC3339), err)
+	}
+
+	fmt.Fprintf(out, "INFO activity_range=%d-%d\n", minBucket, maxBucket)
+	fmt.Fprintf(out, "INFO activity_since=%s\n", cutoff.Format(time.RFC3339))
+	fmt.Fprintf(out, "INFO active_sites=%d\n", activeSites)
+	fmt.Fprintf(out, "INFO active_sites_checked_since=%d\n", checkedSince)
+
+	if activeSites == 0 {
+		fmt.Fprintln(out, "WARN active_sites=0; confirm this range is intentionally empty")
+		fmt.Fprintln(out, "post-cutover activity check passed")
+		return nil
+	}
+	if checkedSince == 0 {
+		return fmt.Errorf("no active sites in range %d-%d have last_checked_at >= %s", minBucket, maxBucket, cutoff.Format(time.RFC3339))
+	}
+	if requireAll && checkedSince < activeSites {
+		return fmt.Errorf("only %d/%d active sites in range %d-%d have last_checked_at >= %s", checkedSince, activeSites, minBucket, maxBucket, cutoff.Format(time.RFC3339))
+	}
+	if requireAll {
+		fmt.Fprintln(out, "PASS rollout_activity=all_active_sites_checked")
+	} else {
+		fmt.Fprintln(out, "PASS rollout_activity=recent_checks_present")
+	}
+	fmt.Fprintln(out, "post-cutover activity check passed")
+	return nil
+}
+
+func resolveActivityCutoff(now time.Time, since string) (time.Time, error) {
+	since = strings.TrimSpace(since)
+	if since == "" {
+		return time.Time{}, errors.New("since must not be empty")
+	}
+	if d, err := time.ParseDuration(since); err == nil {
+		if d <= 0 {
+			return time.Time{}, errors.New("since duration must be > 0")
+		}
+		return now.Add(-d).UTC(), nil
+	}
+	if t, err := time.Parse(time.RFC3339, since); err == nil {
+		return t.UTC(), nil
+	}
+	return time.Time{}, fmt.Errorf("since %q must be a duration like 15m or an RFC3339 timestamp", since)
+}
+
+func validateDynamicBucketCoverage(hosts []db.HostRow, bucketTotal int, heartbeatGrace time.Duration, now time.Time) error {
+	if bucketTotal <= 0 {
+		return errors.New("BUCKET_TOTAL must be > 0")
+	}
+	if heartbeatGrace <= 0 {
+		return errors.New("BUCKET_HEARTBEAT_GRACE_SEC must be > 0")
+	}
+	if len(hosts) == 0 {
+		return errors.New("jetmon_hosts has no rows; dynamic ownership is not established")
+	}
+
+	sortedHosts := append([]db.HostRow(nil), hosts...)
+	sort.Slice(sortedHosts, func(i, j int) bool {
+		if sortedHosts[i].BucketMin == sortedHosts[j].BucketMin {
+			return sortedHosts[i].HostID < sortedHosts[j].HostID
+		}
+		return sortedHosts[i].BucketMin < sortedHosts[j].BucketMin
+	})
+
+	expectedMin := 0
+	for _, host := range sortedHosts {
+		if host.Status != "active" {
+			return fmt.Errorf("host %q has status=%q; all dynamic ownership rows must be active", host.HostID, host.Status)
+		}
+		if age := now.Sub(host.LastHeartbeat); age > heartbeatGrace {
+			return fmt.Errorf("host %q heartbeat is stale age=%s grace=%s", host.HostID, age.Round(time.Second), heartbeatGrace)
+		}
+		if host.BucketMin < 0 || host.BucketMax < host.BucketMin || host.BucketMax >= bucketTotal {
+			return fmt.Errorf("host %q has invalid bucket range %d-%d for BUCKET_TOTAL=%d", host.HostID, host.BucketMin, host.BucketMax, bucketTotal)
+		}
+		if host.BucketMin > expectedMin {
+			return fmt.Errorf("dynamic bucket coverage has gap %d-%d before host %q", expectedMin, host.BucketMin-1, host.HostID)
+		}
+		if host.BucketMin < expectedMin {
+			return fmt.Errorf("dynamic bucket coverage overlaps before host %q at bucket %d", host.HostID, host.BucketMin)
+		}
+		expectedMin = host.BucketMax + 1
+	}
+
+	if expectedMin < bucketTotal {
+		return fmt.Errorf("dynamic bucket coverage has trailing gap %d-%d", expectedMin, bucketTotal-1)
+	}
+	return nil
+}
+
+func formatHostRows(hosts []db.HostRow) string {
+	parts := make([]string, 0, len(hosts))
+	for _, host := range hosts {
+		parts = append(parts, fmt.Sprintf("%s=%d-%d status=%s", host.HostID, host.BucketMin, host.BucketMax, host.Status))
+	}
+	return strings.Join(parts, ", ")
+}
+
+const defaultProjectionDriftSummaryLimit = 20
+
+func runProjectionDriftReport(ctx context.Context, out io.Writer, cfg *config.Config, bucketMin, bucketMax, limit int, deps projectionDriftDeps) error {
+	if cfg == nil {
+		return errors.New("config is not loaded")
+	}
+	if limit <= 0 {
+		return errors.New("limit must be > 0")
+	}
+	minBucket, maxBucket, err := resolveProjectionDriftRange(cfg, bucketMin, bucketMax)
+	if err != nil {
+		return err
+	}
+
+	if deps.CountLegacyProjectionDrift == nil {
+		return errors.New("projection drift counter is not configured")
+	}
+	count, err := deps.CountLegacyProjectionDrift(ctx, minBucket, maxBucket)
+	if err != nil {
+		return fmt.Errorf("count legacy projection drift in range %d-%d: %w", minBucket, maxBucket, err)
+	}
+	fmt.Fprintf(out, "INFO projection_drift_range=%d-%d\n", minBucket, maxBucket)
+	fmt.Fprintf(out, "INFO legacy_projection_drift=%d\n", count)
+
+	if count == 0 {
+		fmt.Fprintln(out, "PASS legacy_projection_drift=0")
+		return nil
+	}
+	fmt.Fprintf(out, "WARN legacy_projection_drift_requires_manual_review=%d\n", count)
+	fmt.Fprintln(out, `WARN projection_drift_next_step="review the summary first, then inspect listed event rows before making any site_status repair"`)
+
+	if deps.SummarizeLegacyProjectionDrift == nil {
+		return errors.New("projection drift summarizer is not configured")
+	}
+	summaries, err := deps.SummarizeLegacyProjectionDrift(ctx, minBucket, maxBucket, count)
+	if err != nil {
+		return fmt.Errorf("summarize legacy projection drift in range %d-%d: %w", minBucket, maxBucket, err)
+	}
+	visibleSummaries := firstProjectionDriftSummaries(summaries, defaultProjectionDriftSummaryLimit)
+	fmt.Fprintln(out, "## projection drift summary")
+	printProjectionDriftSummaries(out, visibleSummaries)
+	visibleSummaryCount := sumProjectionDriftSummaryRows(visibleSummaries)
+	totalSummaryCount := sumProjectionDriftSummaryRows(summaries)
+	if len(summaries) > len(visibleSummaries) {
+		fmt.Fprintf(out, "INFO projection_drift_summary_groups_truncated=%d\n", len(summaries)-len(visibleSummaries))
+		fmt.Fprintf(out, "INFO projection_drift_summary_rows_hidden=%d\n", totalSummaryCount-visibleSummaryCount)
+	}
+	if totalSummaryCount != count {
+		fmt.Fprintf(out, `WARN projection_drift_count_changed=%d summarized=%d note="drift changed while the report was running; rerun projection-drift before repair"`+"\n", count, totalSummaryCount)
+	}
+	printProjectionDriftCauseGuidance(out, summaries)
+
+	if deps.ListLegacyProjectionDrift == nil {
+		return errors.New("projection drift lister is not configured")
+	}
+	rows, err := deps.ListLegacyProjectionDrift(ctx, minBucket, maxBucket, limit)
+	if err != nil {
+		return fmt.Errorf("list legacy projection drift in range %d-%d: %w", minBucket, maxBucket, err)
+	}
+	fmt.Fprintln(out, "## projection drift rows")
+	printProjectionDriftRows(out, rows)
+	if count > len(rows) {
+		fmt.Fprintf(out, "INFO projection_drift_rows_truncated=%d\n", count-len(rows))
+	}
+	fmt.Fprintln(out, "INFO projection_drift_repair=manual_confirmation_required")
+	fmt.Fprintln(out, "INFO projection_drift_repair_guidance=confirm the authoritative event rows first; then repair the legacy site_status projection inside a reviewed DB change or by rerunning the code path that writes the event and projection together")
+	return fmt.Errorf("legacy projection drift=%d in range %d-%d", count, minBucket, maxBucket)
+}
+
+func resolveProjectionDriftRange(cfg *config.Config, bucketMin, bucketMax int) (int, int, error) {
+	return resolveRolloutBucketRange(cfg, bucketMin, bucketMax)
+}
+
+func resolvePinnedOrExplicitRange(cfg *config.Config, bucketMin, bucketMax int, command string) (int, int, error) {
+	if bucketMin < -1 || bucketMax < -1 {
+		return 0, 0, errors.New("bucket-min and bucket-max must be >= 0")
+	}
+	if bucketMin >= 0 || bucketMax >= 0 {
+		return resolveExplicitRolloutBucketRange(cfg, bucketMin, bucketMax)
+	}
+	if minBucket, maxBucket, ok := cfg.PinnedBucketRange(); ok {
+		return minBucket, maxBucket, nil
+	}
+	return 0, 0, fmt.Errorf("%s needs a pinned bucket config or explicit --bucket-min/--bucket-max", command)
+}
+
+func resolveRolloutBucketRange(cfg *config.Config, bucketMin, bucketMax int) (int, int, error) {
+	if bucketMin < -1 || bucketMax < -1 {
+		return 0, 0, errors.New("bucket-min and bucket-max must be >= 0")
+	}
+	if (bucketMin == -1) != (bucketMax == -1) {
+		return 0, 0, errors.New("bucket-min and bucket-max must be set together")
+	}
+	if bucketMin >= 0 && bucketMax >= 0 {
+		if bucketMax < bucketMin {
+			return 0, 0, errors.New("bucket-max must be >= bucket-min")
+		}
+		if bucketMax >= cfg.BucketTotal {
+			return 0, 0, fmt.Errorf("bucket-max must be < BUCKET_TOTAL (%d)", cfg.BucketTotal)
+		}
+		return bucketMin, bucketMax, nil
+	}
+	if minBucket, maxBucket, ok := cfg.PinnedBucketRange(); ok {
+		return minBucket, maxBucket, nil
+	}
+	if cfg.BucketTotal <= 0 {
+		return 0, 0, errors.New("BUCKET_TOTAL must be > 0")
+	}
+	return 0, cfg.BucketTotal - 1, nil
+}
+
+func resolveExplicitRolloutBucketRange(cfg *config.Config, bucketMin, bucketMax int) (int, int, error) {
+	if bucketMin < 0 || bucketMax < 0 {
+		return 0, 0, errors.New("bucket-min and bucket-max must be set together")
+	}
+	if bucketMax < bucketMin {
+		return 0, 0, errors.New("bucket-max must be >= bucket-min")
+	}
+	if bucketMax >= cfg.BucketTotal {
+		return 0, 0, fmt.Errorf("bucket-max must be < BUCKET_TOTAL (%d)", cfg.BucketTotal)
+	}
+	return bucketMin, bucketMax, nil
+}
+
+func printProjectionDriftRows(out io.Writer, rows []db.ProjectionDriftRow) {
+	fmt.Fprintf(out, "%-12s %-8s %-22s %-22s %-10s %-11s %-35s %s\n",
+		"BLOG_ID", "BUCKET", "SITE_STATUS", "EXPECTED", "EVENT_ID", "OPEN_EVENTS", "CAUSE", "EVENT_STATE")
+	for _, row := range rows {
+		cause := classifyProjectionDriftCause(row.SiteStatus, row.ExpectedStatus, row.EventState, row.OpenEventCount)
+		fmt.Fprintf(out, "%-12d %-8d %-22s %-22s %-10s %-11d %-35s %s\n",
+			row.BlogID,
+			row.BucketNo,
+			formatLegacySiteStatus(row.SiteStatus),
+			formatLegacySiteStatus(row.ExpectedStatus),
+			formatOptionalInt(row.EventID),
+			row.OpenEventCount,
+			cause.Code,
+			formatOptionalString(row.EventState),
+		)
+	}
+}
+
+func printProjectionDriftSummaries(out io.Writer, summaries []db.ProjectionDriftSummaryRow) {
+	if len(summaries) == 0 {
+		fmt.Fprintln(out, "INFO projection_drift_summary=none")
+		return
+	}
+	fmt.Fprintf(out, "%-8s %-7s %-22s %-22s %-11s %-12s %-35s %s\n",
+		"BUCKET", "COUNT", "SITE_STATUS", "EXPECTED", "OPEN_EVENTS", "SAMPLE_BLOG", "CAUSE", "EVENT_STATE")
+	for _, row := range summaries {
+		cause := classifyProjectionDriftCause(row.SiteStatus, row.ExpectedStatus, row.EventState, row.MaxOpenEventCount)
+		fmt.Fprintf(out, "%-8d %-7d %-22s %-22s %-11d %-12d %-35s %s\n",
+			row.BucketNo,
+			row.DriftCount,
+			formatLegacySiteStatus(row.SiteStatus),
+			formatLegacySiteStatus(row.ExpectedStatus),
+			row.MaxOpenEventCount,
+			row.SampleBlogID,
+			cause.Code,
+			formatOptionalString(row.EventState),
+		)
+	}
+}
+
+func sumProjectionDriftSummaryRows(summaries []db.ProjectionDriftSummaryRow) int {
+	total := 0
+	for _, row := range summaries {
+		total += row.DriftCount
+	}
+	return total
+}
+
+func firstProjectionDriftSummaries(summaries []db.ProjectionDriftSummaryRow, limit int) []db.ProjectionDriftSummaryRow {
+	if limit <= 0 || len(summaries) <= limit {
+		return summaries
+	}
+	return summaries[:limit]
+}
+
+type projectionDriftCause struct {
+	Code   string
+	Action string
+}
+
+func classifyProjectionDriftCause(siteStatus, expectedStatus int, eventState *string, openEventCount int) projectionDriftCause {
+	switch {
+	case !isKnownLegacySiteStatus(siteStatus) || !isKnownLegacySiteStatus(expectedStatus):
+		return projectionDriftCause{
+			Code:   "unexpected_projection_value",
+			Action: "the legacy site_status value is outside the expected v1 projection shape; inspect and repair the site row manually",
+		}
+	case openEventCount > 1:
+		return projectionDriftCause{
+			Code:   "multiple_open_http_events",
+			Action: "inspect duplicate open HTTP events before repairing the legacy projection; the site-level projection uses the worst open HTTP state",
+		}
+	case expectedStatus == 1 && siteStatus != 1:
+		return projectionDriftCause{
+			Code:   "stale_legacy_down_projection",
+			Action: "the legacy site row still reports downtime even though no open HTTP downtime event requires it; inspect recent close transitions before setting the projection back to running",
+		}
+	case expectedStatus == 2 && siteStatus == 1:
+		return projectionDriftCause{
+			Code:   "missing_confirmed_down_projection",
+			Action: "an open Down event exists but the legacy site row still reports running; inspect the eventstore transaction path before continuing rollout",
+		}
+	case expectedStatus == 0 && siteStatus == 1:
+		return projectionDriftCause{
+			Code:   "missing_seems_down_projection",
+			Action: "an open Seems Down event exists but the legacy site row still reports running; inspect first-failure projection writes and local retry handling",
+		}
+	case expectedStatus == 2 && siteStatus == 0:
+		return projectionDriftCause{
+			Code:   "missing_confirmed_promotion",
+			Action: "the event reached Down but the legacy projection stayed Seems Down; inspect verifier-confirmed promotion writes",
+		}
+	case expectedStatus == 0 && siteStatus == 2:
+		return projectionDriftCause{
+			Code:   "stale_confirmed_down_projection",
+			Action: "the legacy site row reports confirmed down while the open event is only Seems Down; inspect recovery or false-alarm transition history",
+		}
+	case eventState != nil && *eventState != "":
+		return projectionDriftCause{
+			Code:   "unexpected_open_event_projection",
+			Action: "the open event and legacy projection disagree in a non-standard way; inspect the event row, transition history, and site row before repair",
+		}
+	}
+	return projectionDriftCause{
+		Code:   "unexpected_projection_value",
+		Action: "the legacy site_status value is outside the expected v1 projection shape; inspect and repair the site row manually",
+	}
+}
+
+func isKnownLegacySiteStatus(status int) bool {
+	return status == 0 || status == 1 || status == 2
+}
+
+func printProjectionDriftCauseGuidance(out io.Writer, summaries []db.ProjectionDriftSummaryRow) {
+	causes := map[string]projectionDriftCause{}
+	counts := map[string]int{}
+	for _, row := range summaries {
+		cause := classifyProjectionDriftCause(row.SiteStatus, row.ExpectedStatus, row.EventState, row.MaxOpenEventCount)
+		causes[cause.Code] = cause
+		counts[cause.Code] += row.DriftCount
+	}
+	if len(counts) == 0 {
+		return
+	}
+	codes := make([]string, 0, len(counts))
+	for code := range counts {
+		codes = append(codes, code)
+	}
+	sort.Slice(codes, func(i, j int) bool {
+		if counts[codes[i]] != counts[codes[j]] {
+			return counts[codes[i]] > counts[codes[j]]
+		}
+		return codes[i] < codes[j]
+	})
+	for _, code := range codes {
+		fmt.Fprintf(out, "WARN projection_drift_cause=%s count=%d action=%q\n", code, counts[code], causes[code].Action)
+	}
+}
+
+func formatLegacySiteStatus(status int) string {
+	switch status {
+	case 0:
+		return "0/SITE_DOWN"
+	case 1:
+		return "1/SITE_RUNNING"
+	case 2:
+		return "2/SITE_CONFIRMED_DOWN"
+	default:
+		return fmt.Sprintf("%d/UNKNOWN", status)
+	}
+}
+
+func formatOptionalInt(v *int64) string {
+	if v == nil {
+		return "-"
+	}
+	return fmt.Sprintf("%d", *v)
+}
+
+func formatOptionalString(v *string) string {
+	if v == nil || *v == "" {
+		return "-"
+	}
+	return sanitizeRolloutTableString(*v)
+}
+
+func sanitizeRolloutTableString(v string) string {
+	return strings.Map(func(r rune) rune {
+		if r < 0x20 || r == 0x7f {
+			return '?'
+		}
+		return r
+	}, v)
+}
diff --git a/cmd/jetmon2/rollout_production_data.go b/cmd/jetmon2/rollout_production_data.go
new file mode 100644
index 00000000..4f5f0487
--- /dev/null
+++ b/cmd/jetmon2/rollout_production_data.go
@@ -0,0 +1,357 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"flag"
+	"fmt"
+	"io"
+	"os"
+	"strings"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/eventstore"
+)
+
+const legacyStatusBootstrapSource = "rollout:legacy-status-bootstrap"
+
+type productionDataAuditDeps struct {
+	BuildLegacySiteTableAudit func(context.Context, int, int) (db.LegacySiteTableAudit, error)
+}
+
+type legacyStatusBootstrapDeps struct {
+	BuildLegacySiteTableAudit func(context.Context, int, int) (db.LegacySiteTableAudit, error)
+	ListLegacyNonRunningSites func(context.Context, int, int, int64, int) ([]db.LegacyNonRunningSite, error)
+	OpenLegacyStatusEvent     func(context.Context, db.LegacyNonRunningSite) (bool, error)
+}
+
+type productionDataAuditEvaluation struct {
+	Blockers []string
+	Warnings []string
+}
+
+type legacyStatusBootstrapOptions struct {
+	BucketMin             int
+	BucketMax             int
+	BatchSize             int
+	Execute               bool
+	AllowDuplicateBlogIDs bool
+}
+
+type legacyStatusBootstrapSummary struct {
+	Candidates int64
+	Opened     int64
+	Existing   int64
+	DryRun     bool
+}
+
+func cmdRolloutProductionDataAudit(args []string) {
+	fs := flag.NewFlagSet("rollout production-data-audit", flag.ExitOnError)
+	bucketMin := fs.Int("bucket-min", -1, "inclusive bucket minimum (default pinned range or 0)")
+	bucketMax := fs.Int("bucket-max", -1, "inclusive bucket maximum (default pinned range or BUCKET_TOTAL-1)")
+	output := rolloutOutputFlag(fs)
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout production-data-audit [--bucket-min=N --bucket-max=N] [--output=text|json]")
+		os.Exit(1)
+	}
+	outputFormat, err := normalizeRolloutOutput(*output)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	if err := runRolloutCommandOutput(os.Stdout, "rollout production-data-audit", outputFormat, func(out io.Writer) error {
+		cfg, err := loadRolloutConfigAndDB(out)
+		if err != nil {
+			return err
+		}
+		deps := productionDataAuditDeps{
+			BuildLegacySiteTableAudit: db.BuildLegacySiteTableAudit,
+		}
+		return runProductionDataAudit(context.Background(), out, cfg, *bucketMin, *bucketMax, deps)
+	}); err != nil {
+		exitRolloutCommandError(err, outputFormat)
+	}
+}
+
+func cmdRolloutLegacyStatusBootstrap(args []string) {
+	fs := flag.NewFlagSet("rollout legacy-status-bootstrap", flag.ExitOnError)
+	bucketMin := fs.Int("bucket-min", -1, "inclusive bucket minimum (default pinned range or 0)")
+	bucketMax := fs.Int("bucket-max", -1, "inclusive bucket maximum (default pinned range or BUCKET_TOTAL-1)")
+	batchSize := fs.Int("batch-size", 1000, "rows to scan per bootstrap page")
+	execute := fs.Bool("execute", false, "write missing v2 events; default is read-only dry-run")
+	allowDuplicateBlogIDs := fs.Bool("allow-duplicate-blog-ids", false, "allow bootstrap even when active duplicate blog_id rows exist")
+	output := rolloutOutputFlag(fs)
+	_ = fs.Parse(args)
+	if fs.NArg() != 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 rollout legacy-status-bootstrap [--bucket-min=N --bucket-max=N] [--batch-size=N] [--execute] [--allow-duplicate-blog-ids] [--output=text|json]")
+		os.Exit(1)
+	}
+	outputFormat, err := normalizeRolloutOutput(*output)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	if err := runRolloutCommandOutput(os.Stdout, "rollout legacy-status-bootstrap", outputFormat, func(out io.Writer) error {
+		cfg, err := loadRolloutConfigAndDB(out)
+		if err != nil {
+			return err
+		}
+		store := eventstore.New(db.DB())
+		deps := legacyStatusBootstrapDeps{
+			BuildLegacySiteTableAudit: db.BuildLegacySiteTableAudit,
+			ListLegacyNonRunningSites: db.ListLegacyNonRunningSites,
+			OpenLegacyStatusEvent: func(ctx context.Context, site db.LegacyNonRunningSite) (bool, error) {
+				return openLegacyStatusEvent(ctx, store, site)
+			},
+		}
+		opts := legacyStatusBootstrapOptions{
+			BucketMin:             *bucketMin,
+			BucketMax:             *bucketMax,
+			BatchSize:             *batchSize,
+			Execute:               *execute,
+			AllowDuplicateBlogIDs: *allowDuplicateBlogIDs,
+		}
+		return runLegacyStatusBootstrap(context.Background(), out, cfg, opts, deps)
+	}); err != nil {
+		exitRolloutCommandError(err, outputFormat)
+	}
+}
+
+func runProductionDataAudit(ctx context.Context, out io.Writer, cfg *config.Config, bucketMin, bucketMax int, deps productionDataAuditDeps) error {
+	if cfg == nil {
+		return errors.New("config is not loaded")
+	}
+	if deps.BuildLegacySiteTableAudit == nil {
+		return errors.New("legacy table audit query is not configured")
+	}
+	minBucket, maxBucket, err := resolveRolloutBucketRange(cfg, bucketMin, bucketMax)
+	if err != nil {
+		return err
+	}
+	audit, err := deps.BuildLegacySiteTableAudit(ctx, minBucket, maxBucket)
+	if err != nil {
+		return err
+	}
+	eval := evaluateProductionDataAudit(cfg, audit)
+	renderProductionDataAudit(out, cfg, audit, eval)
+	if len(eval.Blockers) > 0 {
+		return fmt.Errorf("production data audit found %d blocker(s)", len(eval.Blockers))
+	}
+	return nil
+}
+
+func evaluateProductionDataAudit(cfg *config.Config, audit db.LegacySiteTableAudit) productionDataAuditEvaluation {
+	var eval productionDataAuditEvaluation
+	if cfg != nil && audit.ObservedBucketMax != nil && *audit.ObservedBucketMax >= cfg.BucketTotal {
+		eval.Blockers = append(eval.Blockers, fmt.Sprintf("observed bucket max %d is outside BUCKET_TOTAL=%d", *audit.ObservedBucketMax, cfg.BucketTotal))
+	}
+	if cfg != nil && audit.ObservedBucketMin != nil && audit.ObservedBucketMax != nil {
+		observedTotal := *audit.ObservedBucketMax - *audit.ObservedBucketMin + 1
+		if *audit.ObservedBucketMin == 0 && observedTotal != cfg.BucketTotal {
+			eval.Warnings = append(eval.Warnings, fmt.Sprintf("observed legacy bucket space is 0-%d but BUCKET_TOTAL=%d", *audit.ObservedBucketMax, cfg.BucketTotal))
+		}
+	}
+	if unexpected := unexpectedValueCounts(audit.MonitorActiveValues, map[int]bool{0: true, 1: true}); len(unexpected) > 0 {
+		eval.Blockers = append(eval.Blockers, "monitor_active has unexpected value(s): "+unexpected)
+	}
+	if unexpected := unexpectedValueCounts(audit.SiteStatusValues, map[int]bool{0: true, 1: true, 2: true}); len(unexpected) > 0 {
+		eval.Blockers = append(eval.Blockers, "site_status has unexpected value(s): "+unexpected)
+	}
+	if audit.ActiveDuplicateBlogs.Groups > 0 {
+		eval.Blockers = append(eval.Blockers, fmt.Sprintf("active duplicate blog_id rows groups=%d rows=%d max_rows_per_blog=%d", audit.ActiveDuplicateBlogs.Groups, audit.ActiveDuplicateBlogs.Rows, audit.ActiveDuplicateBlogs.MaxRowsPerBlog))
+	}
+	if audit.ActiveDuplicateBlogs.StatusConflicts > 0 {
+		eval.Blockers = append(eval.Blockers, fmt.Sprintf("active duplicate blog_id rows have status conflicts groups=%d", audit.ActiveDuplicateBlogs.StatusConflicts))
+	}
+	if audit.ActiveNonRunningRows > 0 {
+		eval.Warnings = append(eval.Warnings, fmt.Sprintf("active non-running legacy rows=%d; run legacy-status-bootstrap before projection-drift is a hard gate", audit.ActiveNonRunningRows))
+	}
+	if audit.ActiveMalformedURLRows > 0 {
+		eval.Warnings = append(eval.Warnings, fmt.Sprintf("active malformed monitor_url rows=%d; clean up or explicitly accept these before cutover", audit.ActiveMalformedURLRows))
+	}
+	if audit.ActiveNullStatusChange > 0 {
+		eval.Warnings = append(eval.Warnings, fmt.Sprintf("active rows with NULL last_status_change=%d; bootstrap will use current time for those rows", audit.ActiveNullStatusChange))
+	}
+	return eval
+}
+
+func renderProductionDataAudit(out io.Writer, cfg *config.Config, audit db.LegacySiteTableAudit, eval productionDataAuditEvaluation) {
+	fmt.Fprintf(out, "INFO audit_range=%d-%d configured_bucket_total=%d\n", audit.BucketMin, audit.BucketMax, cfg.BucketTotal)
+	if audit.ObservedBucketMin != nil && audit.ObservedBucketMax != nil {
+		fmt.Fprintf(out, "INFO observed_bucket_space=%d-%d distinct=%d active_distinct=%d\n", *audit.ObservedBucketMin, *audit.ObservedBucketMax, audit.ObservedBucketDistinct, audit.ActiveBucketDistinct)
+	} else {
+		fmt.Fprintln(out, "WARN observed_bucket_space=empty")
+	}
+	fmt.Fprintf(out, "INFO legacy_rows_total=%d active=%d\n", audit.TotalRows, audit.ActiveRows)
+	fmt.Fprintf(out, "INFO active_bucket_load distinct=%d min=%d max=%d avg=%.2f\n", audit.ActiveBucketLoad.Distinct, audit.ActiveBucketLoad.MinRows, audit.ActiveBucketLoad.MaxRows, audit.ActiveBucketLoad.AvgRows)
+	fmt.Fprintf(out, "INFO monitor_active_values=%s\n", formatValueCounts(audit.MonitorActiveValues))
+	fmt.Fprintf(out, "INFO site_status_values=%s\n", formatValueCounts(audit.SiteStatusValues))
+	fmt.Fprintf(out, "INFO check_interval_values=%s\n", formatValueCounts(audit.CheckIntervalValues))
+	fmt.Fprintf(out, "INFO active_nonrunning=%d active_null_last_status_change=%d\n", audit.ActiveNonRunningRows, audit.ActiveNullStatusChange)
+	fmt.Fprintf(out, "INFO active_malformed_urls=%d url_near_column_limit=%d max_url_length=%d\n", audit.ActiveMalformedURLRows, audit.ActiveURLNearColumnLimit, audit.MaxURLLength)
+	fmt.Fprintf(out, "INFO duplicate_blog_ids_all groups=%d rows=%d max_rows_per_blog=%d status_conflicts=%d\n", audit.DuplicateBlogs.Groups, audit.DuplicateBlogs.Rows, audit.DuplicateBlogs.MaxRowsPerBlog, audit.DuplicateBlogs.StatusConflicts)
+	fmt.Fprintf(out, "INFO duplicate_blog_ids_active groups=%d rows=%d max_rows_per_blog=%d status_conflicts=%d\n", audit.ActiveDuplicateBlogs.Groups, audit.ActiveDuplicateBlogs.Rows, audit.ActiveDuplicateBlogs.MaxRowsPerBlog, audit.ActiveDuplicateBlogs.StatusConflicts)
+	for _, warning := range eval.Warnings {
+		fmt.Fprintf(out, "WARN production_data_audit=%q\n", warning)
+	}
+	for _, blocker := range eval.Blockers {
+		fmt.Fprintf(out, "FAIL production_data_audit=%q\n", blocker)
+	}
+	if len(eval.Blockers) == 0 {
+		fmt.Fprintln(out, "PASS production_data_audit=ready")
+		return
+	}
+	fmt.Fprintln(out, "FAIL production_data_audit=blocked")
+}
+
+func runLegacyStatusBootstrap(ctx context.Context, out io.Writer, cfg *config.Config, opts legacyStatusBootstrapOptions, deps legacyStatusBootstrapDeps) error {
+	if cfg == nil {
+		return errors.New("config is not loaded")
+	}
+	if opts.BatchSize <= 0 {
+		return errors.New("--batch-size must be > 0")
+	}
+	if deps.BuildLegacySiteTableAudit == nil || deps.ListLegacyNonRunningSites == nil {
+		return errors.New("legacy status bootstrap queries are not configured")
+	}
+	minBucket, maxBucket, err := resolveRolloutBucketRange(cfg, opts.BucketMin, opts.BucketMax)
+	if err != nil {
+		return err
+	}
+	audit, err := deps.BuildLegacySiteTableAudit(ctx, minBucket, maxBucket)
+	if err != nil {
+		return err
+	}
+	fmt.Fprintf(out, "INFO bootstrap_range=%d-%d execute=%t\n", minBucket, maxBucket, opts.Execute)
+	if audit.ActiveDuplicateBlogs.Groups > 0 && !opts.AllowDuplicateBlogIDs {
+		fmt.Fprintf(out, "FAIL bootstrap_blocked_by_duplicate_blog_ids groups=%d rows=%d\n", audit.ActiveDuplicateBlogs.Groups, audit.ActiveDuplicateBlogs.Rows)
+		return errors.New("legacy status bootstrap requires endpoint identity support or --allow-duplicate-blog-ids")
+	}
+	if audit.ActiveDuplicateBlogs.Groups > 0 {
+		fmt.Fprintf(out, "WARN bootstrap_duplicate_blog_ids_allowed groups=%d rows=%d\n", audit.ActiveDuplicateBlogs.Groups, audit.ActiveDuplicateBlogs.Rows)
+	}
+
+	if !opts.Execute {
+		fmt.Fprintf(out, "INFO bootstrap_candidates=%d\n", audit.ActiveNonRunningRows)
+		fmt.Fprintln(out, "PASS legacy_status_bootstrap=dry_run")
+		return nil
+	}
+	if deps.OpenLegacyStatusEvent == nil {
+		return errors.New("legacy status event opener is not configured")
+	}
+	summary := legacyStatusBootstrapSummary{DryRun: false}
+	var after int64
+	for {
+		page, err := deps.ListLegacyNonRunningSites(ctx, minBucket, maxBucket, after, opts.BatchSize)
+		if err != nil {
+			return err
+		}
+		if len(page) == 0 {
+			break
+		}
+		for _, site := range page {
+			summary.Candidates++
+			opened, err := deps.OpenLegacyStatusEvent(ctx, site)
+			if err != nil {
+				return fmt.Errorf("bootstrap monitor_site_id=%d blog_id=%d: %w", site.MonitorSiteID, site.BlogID, err)
+			}
+			if opened {
+				summary.Opened++
+			} else {
+				summary.Existing++
+			}
+			after = site.MonitorSiteID
+		}
+		if len(page) < opts.BatchSize {
+			break
+		}
+	}
+	fmt.Fprintf(out, "PASS legacy_status_bootstrap=complete candidates=%d opened=%d existing=%d\n", summary.Candidates, summary.Opened, summary.Existing)
+	return nil
+}
+
+func openLegacyStatusEvent(ctx context.Context, store *eventstore.Store, site db.LegacyNonRunningSite) (bool, error) {
+	if store == nil {
+		return false, errors.New("event store is nil")
+	}
+	state, severity, ok := legacyStatusEventShape(site.SiteStatus)
+	if !ok {
+		return false, fmt.Errorf("unsupported legacy site_status=%d", site.SiteStatus)
+	}
+	meta, err := legacyStatusBootstrapMetadata(site)
+	if err != nil {
+		return false, err
+	}
+	tx, err := store.Begin(ctx)
+	if err != nil {
+		return false, err
+	}
+	defer func() { _ = tx.Rollback() }()
+	res, err := tx.Open(ctx, eventstore.OpenInput{
+		Identity:  eventstore.Identity{BlogID: site.BlogID, CheckType: "http"},
+		Severity:  severity,
+		State:     state,
+		Source:    legacyStatusBootstrapSource,
+		Metadata:  meta,
+		StartedAt: site.LastStatusChange,
+	})
+	if err != nil {
+		return false, err
+	}
+	if err := tx.Commit(); err != nil {
+		return false, err
+	}
+	return res.Opened, nil
+}
+
+func legacyStatusEventShape(status int) (string, uint8, bool) {
+	switch status {
+	case 0:
+		return eventstore.StateSeemsDown, eventstore.SeveritySeemsDown, true
+	case 2:
+		return eventstore.StateDown, eventstore.SeverityDown, true
+	default:
+		return "", 0, false
+	}
+}
+
+func legacyStatusBootstrapMetadata(site db.LegacyNonRunningSite) (json.RawMessage, error) {
+	meta := map[string]any{
+		"source":             "jetpack_monitor_sites",
+		"monitor_site_id":    site.MonitorSiteID,
+		"bucket_no":          site.BucketNo,
+		"legacy_site_status": site.SiteStatus,
+		"bootstrapped_at":    time.Now().UTC().Format(time.RFC3339Nano),
+	}
+	if site.LastStatusChange != nil {
+		meta["legacy_last_status_change"] = site.LastStatusChange.UTC().Format(time.RFC3339Nano)
+	}
+	return json.Marshal(meta)
+}
+
+func unexpectedValueCounts(counts []db.ValueCount, allowed map[int]bool) string {
+	var values []string
+	for _, row := range counts {
+		if !allowed[row.Value] {
+			values = append(values, fmt.Sprintf("%d(total=%d active=%d)", row.Value, row.Total, row.Active))
+		}
+	}
+	return strings.Join(values, ",")
+}
+
+func formatValueCounts(counts []db.ValueCount) string {
+	if len(counts) == 0 {
+		return "none"
+	}
+	parts := make([]string, 0, len(counts))
+	for _, row := range counts {
+		parts = append(parts, fmt.Sprintf("%d:total=%d,active=%d", row.Value, row.Total, row.Active))
+	}
+	return strings.Join(parts, ";")
+}
diff --git a/cmd/jetmon2/rollout_test.go b/cmd/jetmon2/rollout_test.go
new file mode 100644
index 00000000..a86c1e6b
--- /dev/null
+++ b/cmd/jetmon2/rollout_test.go
@@ -0,0 +1,3178 @@
+package main
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io"
+	"os"
+	"reflect"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/eventstore"
+)
+
+func TestRunRolloutCommandOutputJSON(t *testing.T) {
+	var out bytes.Buffer
+	err := runRolloutCommandOutput(&out, "rollout test-check", "json", func(w io.Writer) error {
+		fmt.Fprintln(w, "PASS config parse")
+		fmt.Fprintln(w, "## activity check")
+		fmt.Fprintln(w, "INFO active_sites=3")
+		return errors.New("activity failed")
+	})
+	if err == nil {
+		t.Fatal("runRolloutCommandOutput returned nil error")
+	}
+
+	var report rolloutJSONReport
+	if decodeErr := json.Unmarshal(out.Bytes(), &report); decodeErr != nil {
+		t.Fatalf("decode rollout JSON report: %v\n%s", decodeErr, out.String())
+	}
+	if report.OK {
+		t.Fatal("report.OK = true, want false")
+	}
+	if report.Command != "rollout test-check" {
+		t.Fatalf("command = %q", report.Command)
+	}
+	if len(report.Failures) != 1 || report.Failures[0] != "activity failed" {
+		t.Fatalf("failures = %#v", report.Failures)
+	}
+	wantLevels := []string{"pass", "section", "info"}
+	if len(report.Lines) != len(wantLevels) {
+		t.Fatalf("line count = %d, want %d: %#v", len(report.Lines), len(wantLevels), report.Lines)
+	}
+	for i, want := range wantLevels {
+		if report.Lines[i].Level != want {
+			t.Fatalf("line %d level = %q, want %q", i, report.Lines[i].Level, want)
+		}
+	}
+}
+
+func TestNormalizeRolloutOutput(t *testing.T) {
+	for _, raw := range []string{"", "text", "TEXT", " json "} {
+		if _, err := normalizeRolloutOutput(raw); err != nil {
+			t.Fatalf("normalizeRolloutOutput(%q): %v", raw, err)
+		}
+	}
+	if _, err := normalizeRolloutOutput("yaml"); err == nil {
+		t.Fatal("normalizeRolloutOutput(yaml) error = nil")
+	}
+}
+
+func TestRunStaticPlanCheckSuccess(t *testing.T) {
+	input := strings.NewReader(`
+# host ranges copied from v1 config
+host,bucket_min,bucket_max
+jetmon-v1-b,5,9
+jetmon-v1-a,0,4
+	`)
+
+	var out bytes.Buffer
+	if err := runStaticPlanCheck(&out, "ranges.csv", input, 10, staticPlanAssertion{}); err != nil {
+		t.Fatalf("runStaticPlanCheck: %v", err)
+	}
+	for _, want := range []string{
+		"PASS static_plan_file=ranges.csv ranges=2",
+		"PASS static_bucket_coverage=0-9 hosts=2",
+		"static rollout plan check passed",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+}
+
+func TestRunStaticPlanCheckHostAssertionSuccess(t *testing.T) {
+	input := strings.NewReader(`
+host,bucket_min,bucket_max
+jetmon-v1-a,0,4
+jetmon-v1-b,5,9
+`)
+
+	var out bytes.Buffer
+	err := runStaticPlanCheck(&out, "ranges.csv", input, 10, staticPlanAssertion{
+		HostID:    "jetmon-v1-b",
+		BucketMin: 5,
+		BucketMax: 9,
+	})
+	if err != nil {
+		t.Fatalf("runStaticPlanCheck: %v", err)
+	}
+	if !strings.Contains(out.String(), `PASS static_plan_host="jetmon-v1-b" range=5-9`) {
+		t.Fatalf("output missing host assertion:\n%s", out.String())
+	}
+}
+
+func TestStaticPlanAssertionFromFlags(t *testing.T) {
+	tests := []struct {
+		name      string
+		host      string
+		bucketMin int
+		bucketMax int
+		want      staticPlanAssertion
+		wantErr   string
+	}{
+		{name: "none", bucketMin: -1, bucketMax: -1},
+		{
+			name:      "host only",
+			host:      " host-a ",
+			bucketMin: -1,
+			bucketMax: -1,
+			want:      staticPlanAssertion{HostID: "host-a", BucketMin: -1, BucketMax: -1},
+		},
+		{
+			name:      "host and range",
+			host:      "host-a",
+			bucketMin: 0,
+			bucketMax: 9,
+			want:      staticPlanAssertion{HostID: "host-a", BucketMin: 0, BucketMax: 9},
+		},
+		{
+			name:      "range without host",
+			bucketMin: 0,
+			bucketMax: 9,
+			wantErr:   "--host is required",
+		},
+		{
+			name:      "negative range",
+			host:      "host-a",
+			bucketMin: -2,
+			bucketMax: -2,
+			wantErr:   "must be >= 0",
+		},
+		{
+			name:      "one sided range",
+			host:      "host-a",
+			bucketMin: 0,
+			bucketMax: -1,
+			wantErr:   "must be set together",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			got, err := staticPlanAssertionFromFlags(tt.host, tt.bucketMin, tt.bucketMax)
+			if tt.wantErr != "" {
+				if err == nil {
+					t.Fatal("staticPlanAssertionFromFlags succeeded")
+				}
+				if !strings.Contains(err.Error(), tt.wantErr) {
+					t.Fatalf("error = %q, want substring %q", err.Error(), tt.wantErr)
+				}
+				return
+			}
+			if err != nil {
+				t.Fatalf("staticPlanAssertionFromFlags: %v", err)
+			}
+			if got != tt.want {
+				t.Fatalf("assertion = %#v, want %#v", got, tt.want)
+			}
+		})
+	}
+}
+
+func TestValidateStaticPlanAssertionFailures(t *testing.T) {
+	ranges := []staticBucketRange{
+		{HostID: "host-a", BucketMin: 0, BucketMax: 4},
+		{HostID: "host-b", BucketMin: 5, BucketMax: 9},
+	}
+	tests := []struct {
+		name      string
+		assertion staticPlanAssertion
+		want      string
+	}{
+		{
+			name:      "missing host",
+			assertion: staticPlanAssertion{HostID: "host-c", BucketMin: -1, BucketMax: -1},
+			want:      "not present",
+		},
+		{
+			name:      "range mismatch",
+			assertion: staticPlanAssertion{HostID: "host-b", BucketMin: 6, BucketMax: 9},
+			want:      "want 6-9",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			_, err := validateStaticPlanAssertion(ranges, tt.assertion)
+			if err == nil {
+				t.Fatal("validateStaticPlanAssertion succeeded")
+			}
+			if !strings.Contains(err.Error(), tt.want) {
+				t.Fatalf("error = %q, want substring %q", err.Error(), tt.want)
+			}
+		})
+	}
+}
+
+func TestParseStaticBucketPlanCSVAcceptsLegacyHeader(t *testing.T) {
+	ranges, err := parseStaticBucketPlanCSV(strings.NewReader(`
+host,BUCKET_NO_MIN,BUCKET_NO_MAX
+jetmon-v1-a,0,4
+`))
+	if err != nil {
+		t.Fatalf("parseStaticBucketPlanCSV: %v", err)
+	}
+	if len(ranges) != 1 {
+		t.Fatalf("ranges len = %d, want 1", len(ranges))
+	}
+	got := ranges[0]
+	if got.HostID != "jetmon-v1-a" || got.BucketMin != 0 || got.BucketMax != 4 {
+		t.Fatalf("range = %#v, want jetmon-v1-a 0-4", got)
+	}
+}
+
+func TestParseStaticBucketPlanCSVFailures(t *testing.T) {
+	tests := []struct {
+		name  string
+		input string
+		want  string
+	}{
+		{
+			name:  "empty",
+			input: "\n# only comments\n",
+			want:  "no host ranges",
+		},
+		{
+			name:  "wrong field count",
+			input: "host-a,0,9,extra\n",
+			want:  "expected host,bucket_min,bucket_max",
+		},
+		{
+			name:  "missing host",
+			input: ",0,9\n",
+			want:  "host is required",
+		},
+		{
+			name:  "invalid min",
+			input: "host-a,nope,9\n",
+			want:  "bucket_min \"nope\" is not an integer",
+		},
+		{
+			name:  "invalid max",
+			input: "host-a,0,nope\n",
+			want:  "bucket_max \"nope\" is not an integer",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			_, err := parseStaticBucketPlanCSV(strings.NewReader(tt.input))
+			if err == nil {
+				t.Fatal("parseStaticBucketPlanCSV succeeded")
+			}
+			if !strings.Contains(err.Error(), tt.want) {
+				t.Fatalf("error = %q, want substring %q", err.Error(), tt.want)
+			}
+		})
+	}
+}
+
+func TestValidateStaticBucketPlanFailures(t *testing.T) {
+	tests := []struct {
+		name        string
+		ranges      []staticBucketRange
+		bucketTotal int
+		want        string
+	}{
+		{
+			name:        "invalid total",
+			ranges:      []staticBucketRange{{HostID: "host-a", BucketMin: 0, BucketMax: 9}},
+			bucketTotal: 0,
+			want:        "BUCKET_TOTAL must be > 0",
+		},
+		{
+			name:        "duplicate host",
+			ranges:      []staticBucketRange{{HostID: "host-a", BucketMin: 0, BucketMax: 4}, {HostID: "host-a", BucketMin: 5, BucketMax: 9}},
+			bucketTotal: 10,
+			want:        "listed more than once",
+		},
+		{
+			name:        "invalid range",
+			ranges:      []staticBucketRange{{HostID: "host-a", BucketMin: 0, BucketMax: 10}},
+			bucketTotal: 10,
+			want:        "invalid bucket range",
+		},
+		{
+			name:        "leading gap",
+			ranges:      []staticBucketRange{{HostID: "host-a", BucketMin: 1, BucketMax: 9}},
+			bucketTotal: 10,
+			want:        "gap 0-0",
+		},
+		{
+			name:        "middle gap",
+			ranges:      []staticBucketRange{{HostID: "host-a", BucketMin: 0, BucketMax: 3}, {HostID: "host-b", BucketMin: 5, BucketMax: 9}},
+			bucketTotal: 10,
+			want:        "gap 4-4",
+		},
+		{
+			name:        "overlap",
+			ranges:      []staticBucketRange{{HostID: "host-a", BucketMin: 0, BucketMax: 5}, {HostID: "host-b", BucketMin: 5, BucketMax: 9}},
+			bucketTotal: 10,
+			want:        "overlaps",
+		},
+		{
+			name:        "trailing gap",
+			ranges:      []staticBucketRange{{HostID: "host-a", BucketMin: 0, BucketMax: 8}},
+			bucketTotal: 10,
+			want:        "trailing gap 9-9",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			err := validateStaticBucketPlan(tt.ranges, tt.bucketTotal)
+			if err == nil {
+				t.Fatal("validateStaticBucketPlan succeeded")
+			}
+			if !strings.Contains(err.Error(), tt.want) {
+				t.Fatalf("error = %q, want substring %q", err.Error(), tt.want)
+			}
+		})
+	}
+}
+
+func TestRunRolloutRehearsalPlanSameServer(t *testing.T) {
+	input := strings.NewReader(`
+host,bucket_min,bucket_max
+jetmon-v1-a,0,4
+jetmon-v1-b,5,9
+`)
+	opts := rolloutRehearsalPlanOptions{
+		Mode:        "same-server",
+		PlanFile:    "rollout-buckets.csv",
+		HostID:      "jetmon-v1-a",
+		BucketMin:   0,
+		BucketMax:   4,
+		BucketTotal: 10,
+		Binary:      "./jetmon2",
+		Service:     "jetmon2",
+		Since:       "15m",
+		V1StopCmd:   "systemctl stop jetmon",
+		V1StartCmd:  "systemctl start jetmon",
+	}
+
+	var out bytes.Buffer
+	if err := runRolloutRehearsalPlan(&out, input, opts); err != nil {
+		t.Fatalf("runRolloutRehearsalPlan: %v", err)
+	}
+	for _, want := range []string{
+		"INFO mode=same-server",
+		`INFO plan_host="jetmon-v1-a" runtime_host="jetmon-v1-a" range=0-4`,
+		"# Run this runbook from the staged v2 runtime host, not from a separate orchestrator host.",
+		"# Commands run from that runtime host unless the printed command explicitly targets another host.",
+		"# Shell commands need the same DB_* environment used by the jetmon2 service.",
+		"./jetmon2 rollout static-plan-check --file rollout-buckets.csv --host jetmon-v1-a --bucket-min 0 --bucket-max 4 --bucket-total 10",
+		"./jetmon2 validate-config",
+		"./jetmon2 rollout host-preflight --file rollout-buckets.csv --host jetmon-v1-a --runtime-host jetmon-v1-a --bucket-min 0 --bucket-max 4 --bucket-total 10 --service jetmon2",
+		"systemctl stop jetmon",
+		"# HOLD: confirm v1 is stopped before starting v2.",
+		"systemctl enable --now jetmon2 && systemctl is-active --quiet jetmon2",
+		"# Immediate smoke gate: checks startup and recent activity; recent writes can still include v1.",
+		"./jetmon2 rollout cutover-check --host jetmon-v1-a --bucket-min 0 --bucket-max 4 --since 15m",
+		"# Strong gate after one full v2 check round:",
+		"./jetmon2 rollout cutover-check --host jetmon-v1-a --bucket-min 0 --bucket-max 4 --since 15m --require-all",
+		"# Window-level WPCOM down/recovery parity and explanation evidence:",
+		"./jetmon2 telemetry report --since 15m",
+		"# HOLD: confirm the v2 process is stopped before restarting v1.",
+		"./jetmon2 rollout rollback-check --host jetmon-v1-a --bucket-min 0 --bucket-max 4",
+		"# HOLD: do not restart v1 unless rollback-check passes.",
+		"systemctl start jetmon",
+		"# Do not roll back schema migrations.",
+		"# Host signoff before moving on or before the fleet dynamic cutover:",
+		"./jetmon2 rollout dynamic-check",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+	for _, unwanted := range []string{
+		"systemd-analyze verify",
+		"rollout pinned-check",
+	} {
+		if strings.Contains(out.String(), unwanted) {
+			t.Fatalf("output contains redundant %q:\n%s", unwanted, out.String())
+		}
+	}
+}
+
+func TestRunRolloutRehearsalPlanFreshServerRuntimeHost(t *testing.T) {
+	input := strings.NewReader(`
+host,bucket_min,bucket_max
+jetmon-v1-a,0,9
+`)
+	opts := rolloutRehearsalPlanOptions{
+		Mode:        "fresh-server",
+		PlanFile:    "rollout-buckets.csv",
+		HostID:      "jetmon-v1-a",
+		RuntimeHost: "jetmon-v2-a",
+		BucketMin:   0,
+		BucketMax:   9,
+		BucketTotal: 10,
+		Binary:      "/opt/jetmon2/jetmon2",
+		Service:     "jetmon2",
+		SystemdUnit: "/tmp/staged/jetmon2.service",
+		Since:       "20m",
+		V1StopCmd:   "ssh jetmon-v1-a sudo systemctl stop jetmon",
+		V1StartCmd:  "ssh jetmon-v1-a sudo systemctl start jetmon",
+	}
+
+	var out bytes.Buffer
+	if err := runRolloutRehearsalPlan(&out, input, opts); err != nil {
+		t.Fatalf("runRolloutRehearsalPlan: %v", err)
+	}
+	for _, want := range []string{
+		"INFO mode=fresh-server",
+		`INFO plan_host="jetmon-v1-a" runtime_host="jetmon-v2-a" range=0-9`,
+		"# Run this runbook from the staged v2 runtime host, not from a separate orchestrator host.",
+		"# Fresh-server mode requires jetmon-v2-a to have SSH access to old v1 host jetmon-v1-a for any v1 stop/start commands that use ssh.",
+		"/opt/jetmon2/jetmon2 rollout static-plan-check --file rollout-buckets.csv --host jetmon-v1-a --bucket-min 0 --bucket-max 9 --bucket-total 10",
+		"/opt/jetmon2/jetmon2 rollout host-preflight --file rollout-buckets.csv --host jetmon-v1-a --runtime-host jetmon-v2-a --bucket-min 0 --bucket-max 9 --bucket-total 10 --systemd-unit /tmp/staged/jetmon2.service",
+		"ssh jetmon-v1-a sudo systemctl stop jetmon",
+		"# HOLD: confirm v1 on jetmon-v1-a is stopped before starting v2 on jetmon-v2-a.",
+		"/opt/jetmon2/jetmon2 rollout cutover-check --host jetmon-v2-a --bucket-min 0 --bucket-max 9 --since 20m",
+		"/opt/jetmon2/jetmon2 telemetry report --since 20m",
+		"/opt/jetmon2/jetmon2 rollout rollback-check --host jetmon-v2-a --bucket-min 0 --bucket-max 9",
+		"ssh jetmon-v1-a sudo systemctl start jetmon",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+	for _, unwanted := range []string{
+		"systemd-analyze verify",
+		"rollout pinned-check",
+		"--service jetmon2",
+	} {
+		if strings.Contains(out.String(), unwanted) {
+			t.Fatalf("output contains redundant %q:\n%s", unwanted, out.String())
+		}
+	}
+}
+
+func TestRunHostPreflightSuccess(t *testing.T) {
+	input := strings.NewReader(`
+host,bucket_min,bucket_max
+jetmon-v1-a,0,4
+jetmon-v1-b,5,9
+`)
+	cfg := pinnedRolloutTestConfig(0, 4)
+	var gotUnit string
+	deps := hostPreflightDeps{
+		Pinned: successfulPinnedRolloutDeps(),
+		SystemdVerify: func(unit string) (string, error) {
+			gotUnit = unit
+			return "unit verified", nil
+		},
+	}
+
+	var out bytes.Buffer
+	err := runHostPreflight(context.Background(), &out, cfg, input, hostPreflightOptions{
+		PlanFile:    "rollout-buckets.csv",
+		HostID:      "jetmon-v1-a",
+		RuntimeHost: "host-a",
+		BucketMin:   0,
+		BucketMax:   4,
+		BucketTotal: 10,
+		Service:     "jetmon2",
+	}, deps)
+	if err != nil {
+		t.Fatalf("runHostPreflight: %v", err)
+	}
+	if gotUnit != "/etc/systemd/system/jetmon2.service" {
+		t.Fatalf("systemd unit = %q, want default jetmon2 unit", gotUnit)
+	}
+	for _, want := range []string{
+		"## static bucket plan",
+		"PASS static_plan_file=rollout-buckets.csv ranges=2",
+		"PASS static_plan_host=\"jetmon-v1-a\" range=0-4",
+		"## pinned pre-stop safety",
+		"PASS pinned_range_matches_request=0-4",
+		"pinned rollout check passed",
+		"## systemd unit",
+		"PASS systemd_unit=/etc/systemd/system/jetmon2.service",
+		"INFO systemd_verify=unit verified",
+		"PASS pre_stop_gate=ready",
+		"host preflight passed",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+}
+
+func TestRunHostPreflightSkipSystemd(t *testing.T) {
+	input := strings.NewReader(`host,bucket_min,bucket_max
+jetmon-v1-a,0,4
+`)
+	cfg := pinnedRolloutTestConfig(0, 4)
+	deps := hostPreflightDeps{
+		Pinned: successfulPinnedRolloutDeps(),
+		SystemdVerify: func(string) (string, error) {
+			t.Fatal("systemd verifier should not be called")
+			return "", nil
+		},
+	}
+
+	var out bytes.Buffer
+	err := runHostPreflight(context.Background(), &out, cfg, input, hostPreflightOptions{
+		PlanFile:    "rollout-buckets.csv",
+		HostID:      "jetmon-v1-a",
+		RuntimeHost: "host-a",
+		BucketMin:   0,
+		BucketMax:   4,
+		BucketTotal: 5,
+		SkipSystemd: true,
+	}, deps)
+	if err != nil {
+		t.Fatalf("runHostPreflight: %v", err)
+	}
+	if !strings.Contains(out.String(), "INFO systemd_verify=skipped reason=operator") {
+		t.Fatalf("output missing systemd skip:\n%s", out.String())
+	}
+}
+
+func TestRunHostPreflightFailures(t *testing.T) {
+	validInput := `host,bucket_min,bucket_max
+jetmon-v1-a,0,4
+`
+	cfg := pinnedRolloutTestConfig(0, 4)
+
+	tests := []struct {
+		name  string
+		input string
+		opts  hostPreflightOptions
+		deps  hostPreflightDeps
+		cfg   *config.Config
+		want  string
+	}{
+		{
+			name:  "missing host",
+			input: validInput,
+			opts:  hostPreflightOptions{PlanFile: "rollout-buckets.csv", BucketMin: 0, BucketMax: 4, BucketTotal: 5, SkipSystemd: true},
+			deps:  hostPreflightDeps{Pinned: successfulPinnedRolloutDeps()},
+			want:  "--host is required",
+		},
+		{
+			name:  "plan mismatch",
+			input: validInput,
+			opts:  hostPreflightOptions{PlanFile: "rollout-buckets.csv", HostID: "jetmon-v1-a", BucketMin: 1, BucketMax: 4, BucketTotal: 5, SkipSystemd: true},
+			deps:  hostPreflightDeps{Pinned: successfulPinnedRolloutDeps()},
+			want:  "has bucket range 0-4",
+		},
+		{
+			name:  "systemd failure",
+			input: validInput,
+			opts:  hostPreflightOptions{PlanFile: "rollout-buckets.csv", HostID: "jetmon-v1-a", RuntimeHost: "host-a", BucketMin: 0, BucketMax: 4, BucketTotal: 5, SystemdUnit: "/tmp/bad.service"},
+			deps: hostPreflightDeps{
+				Pinned: successfulPinnedRolloutDeps(),
+				SystemdVerify: func(string) (string, error) {
+					return "bad unit", errors.New("exit status 1")
+				},
+			},
+			want: "systemd-analyze verify /tmp/bad.service",
+		},
+		{
+			name:  "config range mismatch",
+			input: "host,bucket_min,bucket_max\njetmon-v1-a,0,5\n",
+			opts:  hostPreflightOptions{PlanFile: "rollout-buckets.csv", HostID: "jetmon-v1-a", RuntimeHost: "host-a", BucketMin: 0, BucketMax: 5, BucketTotal: 6, SkipSystemd: true},
+			deps:  hostPreflightDeps{Pinned: successfulPinnedRolloutDeps()},
+			cfg:   pinnedRolloutTestConfig(0, 4),
+			want:  "config pinned range 0-4 does not match requested bucket range 0-5",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			var out bytes.Buffer
+			testConfig := cfg
+			if tt.cfg != nil {
+				testConfig = tt.cfg
+			}
+			err := runHostPreflight(context.Background(), &out, testConfig, strings.NewReader(tt.input), tt.opts, tt.deps)
+			if err == nil {
+				t.Fatal("runHostPreflight succeeded")
+			}
+			if !strings.Contains(err.Error(), tt.want) {
+				t.Fatalf("error = %q, want substring %q", err.Error(), tt.want)
+			}
+		})
+	}
+}
+
+func TestRunRolloutRehearsalPlanFailures(t *testing.T) {
+	validInput := `host,bucket_min,bucket_max
+jetmon-v1-a,0,9
+`
+	tests := []struct {
+		name  string
+		input string
+		opts  rolloutRehearsalPlanOptions
+		want  string
+	}{
+		{
+			name:  "bad mode",
+			input: validInput,
+			opts: rolloutRehearsalPlanOptions{
+				Mode:        "auto",
+				PlanFile:    "rollout-buckets.csv",
+				HostID:      "jetmon-v1-a",
+				BucketMin:   0,
+				BucketMax:   9,
+				BucketTotal: 10,
+			},
+			want: "--mode must be",
+		},
+		{
+			name:  "missing range",
+			input: validInput,
+			opts: rolloutRehearsalPlanOptions{
+				Mode:        "same-server",
+				PlanFile:    "rollout-buckets.csv",
+				HostID:      "jetmon-v1-a",
+				BucketMin:   -1,
+				BucketMax:   9,
+				BucketTotal: 10,
+			},
+			want: "--bucket-min and --bucket-max are required",
+		},
+		{
+			name:  "host not in plan",
+			input: validInput,
+			opts: rolloutRehearsalPlanOptions{
+				Mode:        "same-server",
+				PlanFile:    "rollout-buckets.csv",
+				HostID:      "jetmon-v1-b",
+				BucketMin:   0,
+				BucketMax:   9,
+				BucketTotal: 10,
+			},
+			want: "not present",
+		},
+		{
+			name:  "range mismatch",
+			input: validInput,
+			opts: rolloutRehearsalPlanOptions{
+				Mode:        "same-server",
+				PlanFile:    "rollout-buckets.csv",
+				HostID:      "jetmon-v1-a",
+				BucketMin:   1,
+				BucketMax:   9,
+				BucketTotal: 10,
+			},
+			want: "want 1-9",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			var out bytes.Buffer
+			err := runRolloutRehearsalPlan(&out, strings.NewReader(tt.input), tt.opts)
+			if err == nil {
+				t.Fatal("runRolloutRehearsalPlan succeeded")
+			}
+			if !strings.Contains(err.Error(), tt.want) {
+				t.Fatalf("error = %q, want substring %q", err.Error(), tt.want)
+			}
+		})
+	}
+}
+
+func TestRunGuidedRolloutDryRunChecksLogDir(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	opts.DryRun = true
+	opts.BucketTotal = 0
+	var called bool
+	deps := guidedRolloutTestDeps(t)
+	deps.ResolveBucketTotal = func(context.Context) (int, error) {
+		return 10, nil
+	}
+	deps.StaticPlanCheck = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		called = true
+		return nil
+	}
+
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(""), opts, deps); err != nil {
+		t.Fatalf("runGuidedRollout: %v", err)
+	}
+	if called {
+		t.Fatal("dry run executed static plan check")
+	}
+	for _, want := range []string{
+		"PASS rollout_log_dir_writable=",
+		"INFO rollout_log=",
+		"INFO rollout_state=",
+		"INFO dry_run=true",
+		`INFO guided_run_origin=runtime_host mode="same-server" v1_host="jetmon-v1-a" runtime_host="jetmon-v1-a"`,
+		"INFO run_this_command_from=runtime_host",
+		"INFO remote_v1_access_required=false reason=same_server",
+		"INFO selected_path=forward",
+		`PLAN path=FORWARD step=static-plan-check`,
+		`PLAN path=FORWARD step=telemetry-report command="./jetmon2 telemetry report --since 15m"`,
+		`PLAN path=FORWARD step=stop-v1 command="systemctl stop jetmon"`,
+		`PLAN path=FORWARD step=stop-v1 typed_confirmation="STOP jetmon-v1-a 0-4"`,
+		`PLAN path=FORWARD step=stop-v1 manual_checkpoint="DONE after v1 is stopped and the process is no longer running"`,
+		`PLAN path=ROLLBACK step=rollback-start-v1`,
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+	if _, err := os.Stat(guidedRolloutStatePath(normalizeGuidedOptionsForTest(t, opts))); !errors.Is(err, os.ErrNotExist) {
+		t.Fatalf("dry-run state file exists or stat failed: %v", err)
+	}
+}
+
+func TestRunGuidedRolloutLogDirWriteFailure(t *testing.T) {
+	tempDir := t.TempDir()
+	logDirFile := tempDir + "/not-a-directory"
+	if err := os.WriteFile(logDirFile, []byte("not a dir"), 0600); err != nil {
+		t.Fatalf("write logDirFile: %v", err)
+	}
+	opts := guidedRolloutTestOptions(t)
+	opts.LogDir = logDirFile
+	opts.DryRun = true
+
+	var out bytes.Buffer
+	err := runGuidedRollout(context.Background(), &out, strings.NewReader(""), opts, guidedRolloutTestDeps(t))
+	if err == nil {
+		t.Fatal("runGuidedRollout succeeded")
+	}
+	if !strings.Contains(err.Error(), "create rollout log directory") {
+		t.Fatalf("error = %q", err.Error())
+	}
+}
+
+func TestRunGuidedRolloutRollbackDryRunOnlyShowsRollbackPath(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	opts.Rollback = true
+	opts.DryRun = true
+
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(""), opts, guidedRolloutTestDeps(t)); err != nil {
+		t.Fatalf("runGuidedRollout: %v", err)
+	}
+	for _, want := range []string{
+		"INFO selected_path=rollback",
+		`PLAN path=ROLLBACK step=rollback-stop-v2 command="systemctl stop jetmon2 && ! systemctl is-active --quiet jetmon2"`,
+		`PLAN path=ROLLBACK step=rollback-start-v1 typed_confirmation="START V1 jetmon-v1-a 0-4"`,
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+	if strings.Contains(out.String(), "path=FORWARD") {
+		t.Fatalf("rollback dry-run included forward path:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutDryRunExecuteModeDoesNotRunCommands(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	opts.DryRun = true
+	opts.ExecuteOperatorCommands = true
+	deps := guidedRolloutTestDeps(t)
+	deps.ExecCommand = func(context.Context, string) (string, error) {
+		t.Fatal("dry-run executed operator command")
+		return "", nil
+	}
+
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(""), opts, deps); err != nil {
+		t.Fatalf("runGuidedRollout: %v", err)
+	}
+	for _, want := range []string{
+		"INFO operator_command_mode=execute-after-confirmation",
+		`PLAN path=FORWARD step=stop-v1 command="systemctl stop jetmon"`,
+		`PLAN path=FORWARD step=start-v2 command="systemctl enable --now jetmon2 && systemctl is-active --quiet jetmon2"`,
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+	if strings.Contains(out.String(), "manual_checkpoint=") {
+		t.Fatalf("execute-mode dry-run should not print manual checkpoints:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutFreshServerDryRunShowsRemoteV1Commands(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	opts.DryRun = true
+	opts.Mode = "fresh-server"
+	opts.RuntimeHost = "jetmon-v2-a"
+	opts.V1StopCmd = "ssh jetmon-v1-a sudo systemctl stop jetmon"
+	opts.V1StartCmd = "ssh jetmon-v1-a sudo systemctl start jetmon"
+
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(""), opts, guidedRolloutTestDeps(t)); err != nil {
+		t.Fatalf("runGuidedRollout: %v", err)
+	}
+	for _, want := range []string{
+		`INFO rollout_state=`,
+		`INFO guided_run_origin=runtime_host mode="fresh-server" v1_host="jetmon-v1-a" runtime_host="jetmon-v2-a"`,
+		`WARN remote_v1_access_required=true runtime_host="jetmon-v2-a" v1_host="jetmon-v1-a"`,
+		`PLAN path=FORWARD step=stop-v1 command="ssh jetmon-v1-a sudo systemctl stop jetmon"`,
+		`PLAN path=FORWARD step=start-v2 typed_confirmation="START V2 jetmon-v2-a 0-4"`,
+		`PLAN path=ROLLBACK step=rollback-stop-v2 typed_confirmation="STOP V2 jetmon-v2-a 0-4"`,
+		`PLAN path=ROLLBACK step=rollback-start-v1 command="ssh jetmon-v1-a sudo systemctl start jetmon"`,
+		`PLAN path=ROLLBACK step=rollback-start-v1 typed_confirmation="START V1 jetmon-v1-a 0-4"`,
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+	if !strings.Contains(out.String(), "jetmon-v2-a-0-4.state.json") {
+		t.Fatalf("fresh-server state path should use runtime host:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutForwardExecuteCommands(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	opts.ExecuteOperatorCommands = true
+	deps := guidedRolloutTestDeps(t)
+	var calls []string
+	deps.StaticPlanCheck = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		calls = append(calls, "static")
+		return nil
+	}
+	deps.ValidateConfig = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		calls = append(calls, "validate")
+		return nil
+	}
+	deps.HostPreflight = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		calls = append(calls, "preflight")
+		return nil
+	}
+	deps.CutoverCheck = func(_ context.Context, _ io.Writer, _ guidedRolloutOptions, requireAll bool) error {
+		if requireAll {
+			calls = append(calls, "cutover-all")
+		} else {
+			calls = append(calls, "cutover-smoke")
+		}
+		return nil
+	}
+	deps.TelemetryReport = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		calls = append(calls, "telemetry")
+		return nil
+	}
+	var commands []string
+	deps.ExecCommand = func(_ context.Context, command string) (string, error) {
+		commands = append(commands, command)
+		return "", nil
+	}
+
+	input := strings.Join([]string{
+		"y",
+		"y",
+		"y",
+		"STOP jetmon-v1-a 0-4",
+		"START V2 jetmon-v1-a 0-4",
+		"y",
+		"READY",
+		"y",
+		"",
+	}, "\n")
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, deps); err != nil {
+		t.Fatalf("runGuidedRollout: %v\n%s", err, out.String())
+	}
+	if got, want := strings.Join(calls, ","), "static,validate,preflight,cutover-smoke,cutover-all,telemetry"; got != want {
+		t.Fatalf("calls = %s, want %s", got, want)
+	}
+	if got, want := strings.Join(commands, ","), "systemctl stop jetmon,systemctl enable --now jetmon2 && systemctl is-active --quiet jetmon2"; got != want {
+		t.Fatalf("commands = %s, want %s", got, want)
+	}
+	stopCommandAt := strings.Index(out.String(), "COMMAND systemctl stop jetmon")
+	stopConfirmAt := strings.Index(out.String(), "Type STOP jetmon-v1-a 0-4 to continue:")
+	if stopCommandAt < 0 || stopConfirmAt < 0 || stopCommandAt > stopConfirmAt {
+		t.Fatalf("stop command should be shown before typed confirmation:\n%s", out.String())
+	}
+	if !strings.Contains(out.String(), "PASS guided_rollout=complete") {
+		t.Fatalf("output missing completion:\n%s", out.String())
+	}
+	state := readGuidedStateForTest(t, opts)
+	if state.LastCompletedStep != "telemetry-report" || !state.V1Stopped || !state.V2Started {
+		t.Fatalf("state = %+v", state)
+	}
+	if !state.V1StateKnown || !state.V2StateKnown {
+		t.Fatalf("state did not mark service state as known: %+v", state)
+	}
+}
+
+func TestRunGuidedRolloutFreshServerManualFlowPrintsRemoteAndLocalCommands(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	opts.Mode = "fresh-server"
+	opts.RuntimeHost = "jetmon-v2-a"
+	opts.V1StopCmd = "ssh jetmon-v1-a sudo systemctl stop jetmon"
+	opts.V1StartCmd = "ssh jetmon-v1-a sudo systemctl start jetmon"
+	deps := guidedRolloutTestDeps(t)
+	deps.ExecCommand = func(context.Context, string) (string, error) {
+		t.Fatal("manual mode executed operator command")
+		return "", nil
+	}
+
+	input := strings.Join([]string{
+		"y",
+		"y",
+		"y",
+		"STOP jetmon-v1-a 0-4",
+		"DONE",
+		"START V2 jetmon-v2-a 0-4",
+		"DONE",
+		"y",
+		"READY",
+		"y",
+		"",
+	}, "\n")
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, deps); err != nil {
+		t.Fatalf("runGuidedRollout: %v\n%s", err, out.String())
+	}
+	for _, want := range []string{
+		`INFO guided_run_origin=runtime_host mode="fresh-server" v1_host="jetmon-v1-a" runtime_host="jetmon-v2-a"`,
+		`WARN remote_v1_access_required=true runtime_host="jetmon-v2-a" v1_host="jetmon-v1-a"`,
+		"COMMAND ssh jetmon-v1-a sudo systemctl stop jetmon",
+		"COMMAND systemctl enable --now jetmon2 && systemctl is-active --quiet jetmon2",
+		"INFO executing_operator_command=false",
+		"PASS guided_rollout=complete",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+	stopAt := strings.Index(out.String(), "COMMAND ssh jetmon-v1-a sudo systemctl stop jetmon")
+	startAt := strings.Index(out.String(), "COMMAND systemctl enable --now jetmon2 && systemctl is-active --quiet jetmon2")
+	if stopAt < 0 || startAt < 0 || stopAt > startAt {
+		t.Fatalf("fresh-server command order is wrong:\n%s", out.String())
+	}
+	state := readGuidedStateForTest(t, opts)
+	if state.RuntimeHost != "jetmon-v2-a" || !state.V1Stopped || !state.V2Started {
+		t.Fatalf("state = %+v", state)
+	}
+}
+
+func TestRunGuidedRolloutFreshServerExecuteFlowCommandOrder(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	opts.Mode = "fresh-server"
+	opts.RuntimeHost = "jetmon-v2-a"
+	opts.V1StopCmd = "ssh jetmon-v1-a sudo systemctl stop jetmon"
+	opts.V1StartCmd = "ssh jetmon-v1-a sudo systemctl start jetmon"
+	opts.ExecuteOperatorCommands = true
+	deps := guidedRolloutTestDeps(t)
+	var commands []string
+	deps.ExecCommand = func(_ context.Context, command string) (string, error) {
+		commands = append(commands, command)
+		return "", nil
+	}
+
+	input := strings.Join([]string{
+		"y",
+		"y",
+		"y",
+		"STOP jetmon-v1-a 0-4",
+		"START V2 jetmon-v2-a 0-4",
+		"y",
+		"READY",
+		"y",
+		"",
+	}, "\n")
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, deps); err != nil {
+		t.Fatalf("runGuidedRollout: %v\n%s", err, out.String())
+	}
+	if got, want := strings.Join(commands, ","), "ssh jetmon-v1-a sudo systemctl stop jetmon,systemctl enable --now jetmon2 && systemctl is-active --quiet jetmon2"; got != want {
+		t.Fatalf("commands = %s, want %s", got, want)
+	}
+	if !strings.Contains(out.String(), "PASS guided_rollout=complete") {
+		t.Fatalf("output missing completion:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutWrongConfirmationDoesNotExecuteCommand(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	opts.ExecuteOperatorCommands = true
+	deps := guidedRolloutTestDeps(t)
+	var commands []string
+	deps.ExecCommand = func(_ context.Context, command string) (string, error) {
+		commands = append(commands, command)
+		return "", nil
+	}
+
+	input := strings.Join([]string{
+		"y",
+		"y",
+		"y",
+		"STOP wrong-host 0-4",
+		"s",
+		"",
+	}, "\n")
+	var out bytes.Buffer
+	err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, deps)
+	if err == nil {
+		t.Fatal("runGuidedRollout succeeded")
+	}
+	if len(commands) != 0 {
+		t.Fatalf("commands executed after wrong confirmation: %v", commands)
+	}
+	if !strings.Contains(err.Error(), `confirmation did not match "STOP jetmon-v1-a 0-4"`) {
+		t.Fatalf("error = %q", err.Error())
+	}
+	if !strings.Contains(out.String(), "COMMAND systemctl stop jetmon") {
+		t.Fatalf("output should show command before confirmation failure:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutRollbackExecuteCommands(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	opts.Rollback = true
+	opts.ExecuteOperatorCommands = true
+	deps := guidedRolloutTestDeps(t)
+	var calls []string
+	deps.RollbackCheck = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		calls = append(calls, "rollback-check")
+		return nil
+	}
+	var commands []string
+	deps.ExecCommand = func(_ context.Context, command string) (string, error) {
+		commands = append(commands, command)
+		return "", nil
+	}
+
+	input := strings.Join([]string{
+		"STOP V2 jetmon-v1-a 0-4",
+		"y",
+		"START V1 jetmon-v1-a 0-4",
+		"",
+	}, "\n")
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, deps); err != nil {
+		t.Fatalf("runGuidedRollout: %v\n%s", err, out.String())
+	}
+	if got, want := strings.Join(calls, ","), "rollback-check"; got != want {
+		t.Fatalf("calls = %s, want %s", got, want)
+	}
+	if got, want := strings.Join(commands, ","), "systemctl stop jetmon2 && ! systemctl is-active --quiet jetmon2,systemctl start jetmon"; got != want {
+		t.Fatalf("commands = %s, want %s", got, want)
+	}
+	if !strings.Contains(out.String(), "PASS guided_rollback=complete") {
+		t.Fatalf("output missing rollback completion:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutFreshServerRollbackExecuteCommands(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	opts.Mode = "fresh-server"
+	opts.RuntimeHost = "jetmon-v2-a"
+	opts.V1StartCmd = "ssh jetmon-v1-a sudo systemctl start jetmon"
+	opts.Rollback = true
+	opts.ExecuteOperatorCommands = true
+	deps := guidedRolloutTestDeps(t)
+	var commands []string
+	deps.ExecCommand = func(_ context.Context, command string) (string, error) {
+		commands = append(commands, command)
+		return "", nil
+	}
+
+	input := strings.Join([]string{
+		"STOP V2 jetmon-v2-a 0-4",
+		"y",
+		"START V1 jetmon-v1-a 0-4",
+		"",
+	}, "\n")
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, deps); err != nil {
+		t.Fatalf("runGuidedRollout: %v\n%s", err, out.String())
+	}
+	if got, want := strings.Join(commands, ","), "systemctl stop jetmon2 && ! systemctl is-active --quiet jetmon2,ssh jetmon-v1-a sudo systemctl start jetmon"; got != want {
+		t.Fatalf("commands = %s, want %s", got, want)
+	}
+	if !strings.Contains(out.String(), `WARN remote_v1_access_required=true runtime_host="jetmon-v2-a" v1_host="jetmon-v1-a"`) {
+		t.Fatalf("output missing remote access warning:\n%s", out.String())
+	}
+	if !strings.Contains(out.String(), "PASS guided_rollback=complete") {
+		t.Fatalf("output missing rollback completion:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutFailureAfterV2CanRollback(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	deps := guidedRolloutTestDeps(t)
+	var cutoverCalls int
+	deps.CutoverCheck = func(_ context.Context, _ io.Writer, _ guidedRolloutOptions, requireAll bool) error {
+		cutoverCalls++
+		if !requireAll {
+			return errors.New("cutover smoke failed")
+		}
+		return nil
+	}
+
+	input := strings.Join([]string{
+		"y",
+		"y",
+		"y",
+		"STOP jetmon-v1-a 0-4",
+		"DONE",
+		"START V2 jetmon-v1-a 0-4",
+		"DONE",
+		"y",
+		"b",
+		"STOP V2 jetmon-v1-a 0-4",
+		"DONE",
+		"y",
+		"START V1 jetmon-v1-a 0-4",
+		"DONE",
+		"",
+	}, "\n")
+	var out bytes.Buffer
+	err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, deps)
+	if !errors.Is(err, errGuidedForwardRolledBack) {
+		t.Fatalf("error = %v, want errGuidedForwardRolledBack\n%s", err, out.String())
+	}
+	if cutoverCalls != 1 {
+		t.Fatalf("cutover calls = %d, want 1", cutoverCalls)
+	}
+	for _, want := range []string{
+		"Options: [r] retry this step, [b] begin guided rollback, [s] stop here",
+		"PASS guided_rollback=complete",
+		"FAIL guided_rollout=rolled_back reason=operator_requested_after_failed_step",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+	state := readGuidedStateForTest(t, opts)
+	if state.V1Stopped || state.V2Started || state.LastCompletedStep != "rollback-start-v1" {
+		t.Fatalf("state after rollback = %+v", state)
+	}
+}
+
+func TestRunGuidedRolloutFailureCanStop(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	deps := guidedRolloutTestDeps(t)
+	deps.StaticPlanCheck = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		return errors.New("static mismatch")
+	}
+
+	input := strings.Join([]string{"y", "s", ""}, "\n")
+	var out bytes.Buffer
+	err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, deps)
+	if err == nil {
+		t.Fatal("runGuidedRollout succeeded")
+	}
+	if !strings.Contains(err.Error(), "static mismatch") {
+		t.Fatalf("error = %q", err.Error())
+	}
+	if !strings.Contains(out.String(), "Options: [r] retry this step, [s] stop here") {
+		t.Fatalf("output missing failure options:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutResumeSkipsCompletedSteps(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	normalized := normalizeGuidedOptionsForTest(t, opts)
+	state := newGuidedRolloutState(normalized, time.Date(2026, 4, 29, 17, 0, 0, 0, time.UTC))
+	state.CompletedSteps = []string{"static-plan-check", "validate-config"}
+	state.LastCompletedStep = "validate-config"
+	writeGuidedStateForTest(t, normalized, state)
+
+	deps := guidedRolloutTestDeps(t)
+	var calls []string
+	deps.StaticPlanCheck = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		calls = append(calls, "static")
+		return nil
+	}
+	deps.ValidateConfig = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		calls = append(calls, "validate")
+		return nil
+	}
+	deps.HostPreflight = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		calls = append(calls, "preflight")
+		return nil
+	}
+	deps.CutoverCheck = func(_ context.Context, _ io.Writer, _ guidedRolloutOptions, requireAll bool) error {
+		if requireAll {
+			calls = append(calls, "cutover-all")
+		} else {
+			calls = append(calls, "cutover-smoke")
+		}
+		return nil
+	}
+	deps.TelemetryReport = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		calls = append(calls, "telemetry")
+		return nil
+	}
+
+	input := strings.Join([]string{
+		"RESUME",
+		"y",
+		"STOP jetmon-v1-a 0-4",
+		"DONE",
+		"START V2 jetmon-v1-a 0-4",
+		"DONE",
+		"y",
+		"READY",
+		"y",
+		"",
+	}, "\n")
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, deps); err != nil {
+		t.Fatalf("runGuidedRollout: %v\n%s", err, out.String())
+	}
+	if strings.Contains(strings.Join(calls, ","), "static") || strings.Contains(strings.Join(calls, ","), "validate") {
+		t.Fatalf("resume reran completed calls: %v", calls)
+	}
+	if got, want := strings.Join(calls, ","), "preflight,cutover-smoke,cutover-all,telemetry"; got != want {
+		t.Fatalf("calls = %s, want %s", got, want)
+	}
+	if !strings.Contains(out.String(), "SKIP step=static-plan-check reason=completed_from_state") {
+		t.Fatalf("output missing resume skip:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutResumeStateRequiresExplicitChoice(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	normalized := normalizeGuidedOptionsForTest(t, opts)
+	state := newGuidedRolloutState(normalized, time.Date(2026, 4, 29, 17, 0, 0, 0, time.UTC))
+	state.CompletedSteps = []string{
+		"static-plan-check",
+		"validate-config",
+		"host-preflight",
+		"stop-v1",
+		"start-v2",
+		"cutover-smoke",
+		"cutover-require-all",
+		"telemetry-report",
+	}
+	state.LastCompletedStep = "telemetry-report"
+	state.V1Stopped = true
+	state.V1StateKnown = true
+	state.V2Started = true
+	state.V2StateKnown = true
+	writeGuidedStateForTest(t, normalized, state)
+
+	var out bytes.Buffer
+	input := strings.Join([]string{"", "RESUME", ""}, "\n")
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, guidedRolloutTestDeps(t)); err != nil {
+		t.Fatalf("runGuidedRollout: %v\n%s", err, out.String())
+	}
+	if !strings.Contains(out.String(), "No default is selected when state exists") {
+		t.Fatalf("output missing no-default warning:\n%s", out.String())
+	}
+	if !strings.Contains(out.String(), "PASS guided_rollout=complete") {
+		t.Fatalf("output missing completion:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutResumeStateRejectsYNAliases(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	normalized := normalizeGuidedOptionsForTest(t, opts)
+	state := newGuidedRolloutState(normalized, time.Date(2026, 4, 29, 17, 0, 0, 0, time.UTC))
+	state.CompletedSteps = []string{
+		"static-plan-check",
+		"validate-config",
+		"host-preflight",
+		"stop-v1",
+		"start-v2",
+		"cutover-smoke",
+		"cutover-require-all",
+		"telemetry-report",
+	}
+	state.LastCompletedStep = "telemetry-report"
+	state.V1Stopped = true
+	state.V1StateKnown = true
+	state.V2Started = true
+	state.V2StateKnown = true
+	writeGuidedStateForTest(t, normalized, state)
+
+	var out bytes.Buffer
+	input := strings.Join([]string{"n", "y", "RESUME", ""}, "\n")
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, guidedRolloutTestDeps(t)); err != nil {
+		t.Fatalf("runGuidedRollout: %v\n%s", err, out.String())
+	}
+	if strings.Count(out.String(), "Please choose RESUME or START OVER.") != 2 {
+		t.Fatalf("output should reject y/n aliases:\n%s", out.String())
+	}
+	if strings.Contains(out.String(), "previous_state=discarded") {
+		t.Fatalf("y/n alias discarded state:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutResumeMismatchedStateRefuses(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	normalized := normalizeGuidedOptionsForTest(t, opts)
+	state := newGuidedRolloutState(normalized, time.Date(2026, 4, 29, 17, 0, 0, 0, time.UTC))
+	state.HostID = "jetmon-v1-other"
+	writeGuidedStateForTest(t, normalized, state)
+
+	var out bytes.Buffer
+	err := runGuidedRollout(context.Background(), &out, strings.NewReader("RESUME\n"), opts, guidedRolloutTestDeps(t))
+	if err == nil {
+		t.Fatal("runGuidedRollout succeeded")
+	}
+	if !strings.Contains(err.Error(), `state mode="same-server" host="jetmon-v1-other"`) {
+		t.Fatalf("error = %q", err.Error())
+	}
+	if !strings.Contains(out.String(), `INFO previous_state=found mode="same-server" host="jetmon-v1-other"`) {
+		t.Fatalf("output missing previous state details:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutStartOverDiscardsPreviousState(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	normalized := normalizeGuidedOptionsForTest(t, opts)
+	state := newGuidedRolloutState(normalized, time.Date(2026, 4, 29, 17, 0, 0, 0, time.UTC))
+	state.CompletedSteps = []string{"static-plan-check", "validate-config"}
+	state.LastCompletedStep = "validate-config"
+	writeGuidedStateForTest(t, normalized, state)
+
+	deps := guidedRolloutTestDeps(t)
+	var calls []string
+	deps.StaticPlanCheck = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		calls = append(calls, "static")
+		return nil
+	}
+	deps.ValidateConfig = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		calls = append(calls, "validate")
+		return nil
+	}
+	deps.HostPreflight = func(context.Context, io.Writer, guidedRolloutOptions) error {
+		calls = append(calls, "preflight")
+		return nil
+	}
+	deps.CutoverCheck = func(_ context.Context, _ io.Writer, _ guidedRolloutOptions, requireAll bool) error {
+		if requireAll {
+			calls = append(calls, "cutover-all")
+		} else {
+			calls = append(calls, "cutover-smoke")
+		}
+		return nil
+	}
+
+	input := strings.Join([]string{
+		"START OVER",
+		"y",
+		"y",
+		"y",
+		"STOP jetmon-v1-a 0-4",
+		"DONE",
+		"START V2 jetmon-v1-a 0-4",
+		"DONE",
+		"y",
+		"READY",
+		"y",
+		"",
+	}, "\n")
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, deps); err != nil {
+		t.Fatalf("runGuidedRollout: %v\n%s", err, out.String())
+	}
+	if got, want := strings.Join(calls, ","), "static,validate,preflight,cutover-smoke,cutover-all"; got != want {
+		t.Fatalf("calls = %s, want %s", got, want)
+	}
+	if !strings.Contains(out.String(), "WARN previous_state=discarded reason=operator_start_over") {
+		t.Fatalf("output missing start-over warning:\n%s", out.String())
+	}
+	if strings.Contains(out.String(), "SKIP step=static-plan-check") {
+		t.Fatalf("start-over reused previous completed step:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutResumeSkipsAlreadyStoppedV1(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	normalized := normalizeGuidedOptionsForTest(t, opts)
+	state := newGuidedRolloutState(normalized, time.Date(2026, 4, 29, 17, 0, 0, 0, time.UTC))
+	state.CompletedSteps = []string{"static-plan-check", "validate-config", "host-preflight"}
+	state.LastCompletedStep = "host-preflight"
+	state.V1Stopped = true
+	state.V1StateKnown = true
+	writeGuidedStateForTest(t, normalized, state)
+
+	deps := guidedRolloutTestDeps(t)
+	var commands []string
+	deps.ExecCommand = func(_ context.Context, command string) (string, error) {
+		commands = append(commands, command)
+		return "", nil
+	}
+
+	input := strings.Join([]string{
+		"RESUME",
+		"START V2 jetmon-v1-a 0-4",
+		"DONE",
+		"y",
+		"READY",
+		"y",
+		"",
+	}, "\n")
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, deps); err != nil {
+		t.Fatalf("runGuidedRollout: %v\n%s", err, out.String())
+	}
+	if strings.Contains(strings.Join(commands, ","), "systemctl stop jetmon") {
+		t.Fatalf("resume reran v1 stop command: %v", commands)
+	}
+	if !strings.Contains(out.String(), "SKIP step=stop-v1 reason=state_v1_already_stopped") {
+		t.Fatalf("output missing v1 stopped skip:\n%s", out.String())
+	}
+}
+
+func TestRunGuidedRolloutRollbackResumeSkipsAlreadyStoppedV2(t *testing.T) {
+	opts := guidedRolloutTestOptions(t)
+	opts.Rollback = true
+	normalized := normalizeGuidedOptionsForTest(t, opts)
+	state := newGuidedRolloutState(normalized, time.Date(2026, 4, 29, 17, 0, 0, 0, time.UTC))
+	state.V2Started = false
+	state.V2StateKnown = true
+	writeGuidedStateForTest(t, normalized, state)
+
+	deps := guidedRolloutTestDeps(t)
+	var commands []string
+	deps.ExecCommand = func(_ context.Context, command string) (string, error) {
+		commands = append(commands, command)
+		return "", nil
+	}
+
+	input := strings.Join([]string{
+		"RESUME",
+		"y",
+		"START V1 jetmon-v1-a 0-4",
+		"DONE",
+		"",
+	}, "\n")
+	var out bytes.Buffer
+	if err := runGuidedRollout(context.Background(), &out, strings.NewReader(input), opts, deps); err != nil {
+		t.Fatalf("runGuidedRollout: %v\n%s", err, out.String())
+	}
+	if strings.Contains(strings.Join(commands, ","), "systemctl stop jetmon2") {
+		t.Fatalf("resume reran v2 stop command: %v", commands)
+	}
+	if !strings.Contains(out.String(), "SKIP step=rollback-stop-v2 reason=state_v2_already_stopped") {
+		t.Fatalf("output missing v2 stopped skip:\n%s", out.String())
+	}
+}
+
+func guidedRolloutTestOptions(t *testing.T) guidedRolloutOptions {
+	t.Helper()
+	return guidedRolloutOptions{
+		Mode:        "same-server",
+		PlanFile:    "rollout-buckets.csv",
+		HostID:      "jetmon-v1-a",
+		RuntimeHost: "jetmon-v1-a",
+		BucketMin:   0,
+		BucketMax:   4,
+		BucketTotal: 10,
+		Service:     "jetmon2",
+		Since:       "15m",
+		V1StopCmd:   "systemctl stop jetmon",
+		V1StartCmd:  "systemctl start jetmon",
+		LogDir:      t.TempDir(),
+	}
+}
+
+func guidedRolloutTestDeps(t *testing.T) guidedRolloutDeps {
+	t.Helper()
+	return guidedRolloutDeps{
+		Now: func() time.Time {
+			return time.Date(2026, 4, 29, 17, 30, 0, 0, time.UTC)
+		},
+		ResolveBucketTotal: func(context.Context) (int, error) {
+			return 10, nil
+		},
+		StaticPlanCheck: func(context.Context, io.Writer, guidedRolloutOptions) error {
+			return nil
+		},
+		ValidateConfig: func(context.Context, io.Writer, guidedRolloutOptions) error {
+			return nil
+		},
+		HostPreflight: func(context.Context, io.Writer, guidedRolloutOptions) error {
+			return nil
+		},
+		CutoverCheck: func(context.Context, io.Writer, guidedRolloutOptions, bool) error {
+			return nil
+		},
+		TelemetryReport: func(context.Context, io.Writer, guidedRolloutOptions) error {
+			return nil
+		},
+		RollbackCheck: func(context.Context, io.Writer, guidedRolloutOptions) error {
+			return nil
+		},
+		ExecCommand: func(context.Context, string) (string, error) {
+			return "", nil
+		},
+	}
+}
+
+func normalizeGuidedOptionsForTest(t *testing.T, opts guidedRolloutOptions) guidedRolloutOptions {
+	t.Helper()
+	normalized, err := normalizeGuidedRolloutOptions(opts)
+	if err != nil {
+		t.Fatalf("normalizeGuidedRolloutOptions: %v", err)
+	}
+	return normalized
+}
+
+func readGuidedStateForTest(t *testing.T, opts guidedRolloutOptions) guidedRolloutState {
+	t.Helper()
+	normalized := normalizeGuidedOptionsForTest(t, opts)
+	data, err := os.ReadFile(guidedRolloutStatePath(normalized))
+	if err != nil {
+		t.Fatalf("read state: %v", err)
+	}
+	var state guidedRolloutState
+	if err := json.Unmarshal(data, &state); err != nil {
+		t.Fatalf("decode state: %v", err)
+	}
+	return state
+}
+
+func writeGuidedStateForTest(t *testing.T, opts guidedRolloutOptions, state guidedRolloutState) {
+	t.Helper()
+	if err := os.MkdirAll(opts.LogDir, 0750); err != nil {
+		t.Fatalf("mkdir log dir: %v", err)
+	}
+	data, err := json.MarshalIndent(state, "", "  ")
+	if err != nil {
+		t.Fatalf("encode state: %v", err)
+	}
+	if err := os.WriteFile(guidedRolloutStatePath(opts), data, 0600); err != nil {
+		t.Fatalf("write state: %v", err)
+	}
+}
+
+func TestRunPinnedRolloutCheckSuccess(t *testing.T) {
+	minBucket, maxBucket := 12, 34
+	cfg := &config.Config{
+		PinnedBucketMin:              &minBucket,
+		PinnedBucketMax:              &maxBucket,
+		LegacyStatusProjectionEnable: true,
+		DefaultCheckMethod:           "HEAD",
+		DefaultDetectionProfile:      "legacy",
+	}
+
+	var gotHost string
+	var gotMin, gotMax int
+	deps := pinnedRolloutCheckDeps{
+		Hostname: func() string { return "host-a" },
+		HostRowExists: func(_ context.Context, hostID string) (bool, error) {
+			gotHost = hostID
+			return false, nil
+		},
+		ListOverlappingHostRows: func(_ context.Context, min, max int) ([]db.HostRow, error) {
+			if min != minBucket || max != maxBucket {
+				t.Fatalf("ListOverlappingHostRows range = %d-%d, want %d-%d", min, max, minBucket, maxBucket)
+			}
+			return nil, nil
+		},
+		CountActiveSitesForBucketRange: func(_ context.Context, min, max int) (int, error) {
+			gotMin, gotMax = min, max
+			return 37, nil
+		},
+		CountLegacyProjectionDrift: func(_ context.Context, min, max int) (int, error) {
+			if min != minBucket || max != maxBucket {
+				t.Fatalf("CountLegacyProjectionDrift range = %d-%d, want %d-%d", min, max, minBucket, maxBucket)
+			}
+			return 0, nil
+		},
+	}
+
+	var out bytes.Buffer
+	if err := runPinnedRolloutCheck(context.Background(), &out, cfg, "", deps); err != nil {
+		t.Fatalf("runPinnedRolloutCheck: %v", err)
+	}
+	if gotHost != "host-a" {
+		t.Fatalf("host = %q, want host-a", gotHost)
+	}
+	if gotMin != minBucket || gotMax != maxBucket {
+		t.Fatalf("active site range = %d-%d, want %d-%d", gotMin, gotMax, minBucket, maxBucket)
+	}
+	for _, want := range []string{
+		"PASS pinned_range=12-34",
+		"PASS legacy_status_projection=enabled",
+		"PASS default_check_policy=method:HEAD profile:legacy",
+		"PASS api_port=disabled",
+		"PASS jetmon_hosts row absent host=\"host-a\"",
+		"PASS jetmon_hosts overlap=0",
+		"INFO active_sites_in_pinned_range=37",
+		"PASS legacy_projection_drift=0",
+		"pinned rollout check passed",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+}
+
+func TestRunPinnedRolloutCheckUsesHostOverride(t *testing.T) {
+	minBucket, maxBucket := 1, 2
+	cfg := &config.Config{
+		PinnedBucketMin:              &minBucket,
+		PinnedBucketMax:              &maxBucket,
+		LegacyStatusProjectionEnable: true,
+	}
+
+	var gotHost string
+	deps := pinnedRolloutCheckDeps{
+		Hostname: func() string { return "wrong-host" },
+		HostRowExists: func(_ context.Context, hostID string) (bool, error) {
+			gotHost = hostID
+			return false, nil
+		},
+		ListOverlappingHostRows: func(context.Context, int, int) ([]db.HostRow, error) {
+			return nil, nil
+		},
+		CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+			return 1, nil
+		},
+		CountLegacyProjectionDrift: func(context.Context, int, int) (int, error) {
+			return 0, nil
+		},
+	}
+
+	var out bytes.Buffer
+	if err := runPinnedRolloutCheck(context.Background(), &out, cfg, " override-host ", deps); err != nil {
+		t.Fatalf("runPinnedRolloutCheck: %v", err)
+	}
+	if gotHost != "override-host" {
+		t.Fatalf("host = %q, want override-host", gotHost)
+	}
+}
+
+func TestRunPinnedRolloutCheckWarnsWhenAPIEnabled(t *testing.T) {
+	minBucket, maxBucket := 1, 2
+	cfg := &config.Config{
+		PinnedBucketMin:              &minBucket,
+		PinnedBucketMax:              &maxBucket,
+		LegacyStatusProjectionEnable: true,
+		APIPort:                      8090,
+	}
+	deps := successfulPinnedRolloutDeps()
+
+	var out bytes.Buffer
+	if err := runPinnedRolloutCheck(context.Background(), &out, cfg, "", deps); err != nil {
+		t.Fatalf("runPinnedRolloutCheck: %v", err)
+	}
+	if !strings.Contains(out.String(), "WARN api_port=8090") {
+		t.Fatalf("output missing API warning:\n%s", out.String())
+	}
+}
+
+func TestRunPinnedRolloutCheckWarnsWhenRangeIsEmpty(t *testing.T) {
+	minBucket, maxBucket := 1, 2
+	cfg := pinnedRolloutTestConfig(minBucket, maxBucket)
+	deps := successfulPinnedRolloutDeps()
+	deps.CountActiveSitesForBucketRange = func(context.Context, int, int) (int, error) {
+		return 0, nil
+	}
+
+	var out bytes.Buffer
+	if err := runPinnedRolloutCheck(context.Background(), &out, cfg, "", deps); err != nil {
+		t.Fatalf("runPinnedRolloutCheck: %v", err)
+	}
+	if !strings.Contains(out.String(), "WARN active_sites_in_pinned_range=0") {
+		t.Fatalf("output missing empty-range warning:\n%s", out.String())
+	}
+}
+
+func TestRunPinnedRolloutCheckFailures(t *testing.T) {
+	minBucket, maxBucket := 1, 2
+	tests := []struct {
+		name string
+		cfg  *config.Config
+		deps pinnedRolloutCheckDeps
+		want string
+	}{
+		{
+			name: "missing pinned range",
+			cfg:  &config.Config{LegacyStatusProjectionEnable: true},
+			deps: successfulPinnedRolloutDeps(),
+			want: "pinned bucket range is not configured",
+		},
+		{
+			name: "legacy projection disabled",
+			cfg: &config.Config{
+				PinnedBucketMin: &minBucket,
+				PinnedBucketMax: &maxBucket,
+			},
+			deps: successfulPinnedRolloutDeps(),
+			want: "LEGACY_STATUS_PROJECTION_ENABLE must be true",
+		},
+		{
+			name: "host row exists",
+			cfg:  pinnedRolloutTestConfig(minBucket, maxBucket),
+			deps: pinnedRolloutCheckDeps{
+				Hostname: func() string { return "host-a" },
+				HostRowExists: func(context.Context, string) (bool, error) {
+					return true, nil
+				},
+			},
+			want: "still has a jetmon_hosts row",
+		},
+		{
+			name: "host row query error",
+			cfg:  pinnedRolloutTestConfig(minBucket, maxBucket),
+			deps: pinnedRolloutCheckDeps{
+				Hostname: func() string { return "host-a" },
+				HostRowExists: func(context.Context, string) (bool, error) {
+					return false, errors.New("db unavailable")
+				},
+			},
+			want: "db unavailable",
+		},
+		{
+			name: "overlapping host rows",
+			cfg:  pinnedRolloutTestConfig(minBucket, maxBucket),
+			deps: pinnedRolloutCheckDeps{
+				Hostname: func() string { return "host-a" },
+				HostRowExists: func(context.Context, string) (bool, error) {
+					return false, nil
+				},
+				ListOverlappingHostRows: func(context.Context, int, int) ([]db.HostRow, error) {
+					return []db.HostRow{
+						{HostID: "host-b", BucketMin: 0, BucketMax: 5, Status: "active"},
+					}, nil
+				},
+			},
+			want: "overlapping pinned range",
+		},
+		{
+			name: "overlapping host query error",
+			cfg:  pinnedRolloutTestConfig(minBucket, maxBucket),
+			deps: pinnedRolloutCheckDeps{
+				Hostname: func() string { return "host-a" },
+				HostRowExists: func(context.Context, string) (bool, error) {
+					return false, nil
+				},
+				ListOverlappingHostRows: func(context.Context, int, int) ([]db.HostRow, error) {
+					return nil, errors.New("db unavailable")
+				},
+			},
+			want: "list jetmon_hosts rows overlapping",
+		},
+		{
+			name: "projection drift",
+			cfg:  pinnedRolloutTestConfig(minBucket, maxBucket),
+			deps: pinnedRolloutCheckDeps{
+				Hostname: func() string { return "host-a" },
+				HostRowExists: func(context.Context, string) (bool, error) {
+					return false, nil
+				},
+				ListOverlappingHostRows: func(context.Context, int, int) ([]db.HostRow, error) {
+					return nil, nil
+				},
+				CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+					return 10, nil
+				},
+				CountLegacyProjectionDrift: func(context.Context, int, int) (int, error) {
+					return 2, nil
+				},
+			},
+			want: "legacy projection drift=2",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			var out bytes.Buffer
+			err := runPinnedRolloutCheck(context.Background(), &out, tt.cfg, "", tt.deps)
+			if err == nil {
+				t.Fatal("runPinnedRolloutCheck succeeded")
+			}
+			if !strings.Contains(err.Error(), tt.want) {
+				t.Fatalf("error = %q, want substring %q", err.Error(), tt.want)
+			}
+		})
+	}
+}
+
+func TestRunCutoverCheckSuccess(t *testing.T) {
+	now := time.Date(2026, 4, 29, 12, 0, 0, 0, time.UTC)
+	cfg := pinnedRolloutTestConfig(12, 34)
+	cfg.DashboardPort = 8080
+
+	deps := successfulCutoverCheckDeps(now)
+	var gotStatusPort int
+	deps.Status = func(port int) (string, error) {
+		gotStatusPort = port
+		return "{\n  \"state\": \"ok\"\n}", nil
+	}
+
+	var out bytes.Buffer
+	err := runCutoverCheck(context.Background(), &out, cfg, cutoverCheckOptions{
+		HostOverride: "host-a",
+		BucketMin:    -1,
+		BucketMax:    -1,
+		Since:        "15m",
+		Limit:        100,
+		StatusPort:   -1,
+	}, deps)
+	if err != nil {
+		t.Fatalf("runCutoverCheck: %v", err)
+	}
+	if gotStatusPort != 8080 {
+		t.Fatalf("status port = %d, want 8080", gotStatusPort)
+	}
+	for _, want := range []string{
+		"## pinned preflight",
+		"pinned rollout check passed",
+		"## activity check",
+		"PASS rollout_activity=recent_checks_present",
+		"## dashboard status",
+		"PASS dashboard_status=http://localhost:8080/api/state",
+		`INFO dashboard_state={ "state": "ok" }`,
+		"## projection drift",
+		"PASS legacy_projection_drift=0",
+		"cutover check passed",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+}
+
+func TestRunCutoverCheckRequireAllAndSkipStatus(t *testing.T) {
+	now := time.Date(2026, 4, 29, 12, 0, 0, 0, time.UTC)
+	cfg := pinnedRolloutTestConfig(12, 34)
+	deps := successfulCutoverCheckDeps(now)
+	deps.Status = func(int) (string, error) {
+		t.Fatal("status should not be called")
+		return "", nil
+	}
+
+	var out bytes.Buffer
+	err := runCutoverCheck(context.Background(), &out, cfg, cutoverCheckOptions{
+		BucketMin:  -1,
+		BucketMax:  -1,
+		Since:      "15m",
+		RequireAll: true,
+		Limit:      100,
+		StatusPort: -1,
+		SkipStatus: true,
+	}, deps)
+	if err != nil {
+		t.Fatalf("runCutoverCheck: %v", err)
+	}
+	for _, want := range []string{
+		"PASS rollout_activity=all_active_sites_checked",
+		"INFO dashboard_status=skipped reason=operator",
+		"cutover check passed",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+}
+
+func TestRunCutoverCheckSkipsDisabledDashboard(t *testing.T) {
+	now := time.Date(2026, 4, 29, 12, 0, 0, 0, time.UTC)
+	cfg := pinnedRolloutTestConfig(12, 34)
+	cfg.DashboardPort = 0
+	deps := successfulCutoverCheckDeps(now)
+	deps.Status = func(int) (string, error) {
+		t.Fatal("status should not be called")
+		return "", nil
+	}
+
+	var out bytes.Buffer
+	if err := runCutoverCheck(context.Background(), &out, cfg, cutoverCheckOptions{BucketMin: -1, BucketMax: -1, Since: "15m", Limit: 100, StatusPort: -1}, deps); err != nil {
+		t.Fatalf("runCutoverCheck: %v", err)
+	}
+	if !strings.Contains(out.String(), "INFO dashboard_status=skipped dashboard_port=disabled") {
+		t.Fatalf("output missing disabled dashboard skip:\n%s", out.String())
+	}
+}
+
+func TestRunCutoverCheckFailures(t *testing.T) {
+	now := time.Date(2026, 4, 29, 12, 0, 0, 0, time.UTC)
+	cfg := pinnedRolloutTestConfig(12, 34)
+	cfg.DashboardPort = 8080
+
+	tests := []struct {
+		name string
+		opts cutoverCheckOptions
+		deps cutoverCheckDeps
+		want string
+	}{
+		{
+			name: "status error",
+			opts: cutoverCheckOptions{BucketMin: -1, BucketMax: -1, Since: "15m", Limit: 100, StatusPort: -1},
+			deps: func() cutoverCheckDeps {
+				deps := successfulCutoverCheckDeps(now)
+				deps.Status = func(int) (string, error) {
+					return "", errors.New("connection refused")
+				}
+				return deps
+			}(),
+			want: "connection refused",
+		},
+		{
+			name: "activity require all failure",
+			opts: cutoverCheckOptions{BucketMin: -1, BucketMax: -1, Since: "15m", RequireAll: true, Limit: 100, StatusPort: -1, SkipStatus: true},
+			deps: func() cutoverCheckDeps {
+				deps := successfulCutoverCheckDeps(now)
+				deps.Activity.CountRecentlyCheckedActiveSitesForRange = func(context.Context, int, int, time.Time) (int, error) {
+					return 1, nil
+				}
+				return deps
+			}(),
+			want: "only 1/3 active sites",
+		},
+		{
+			name: "projection drift",
+			opts: cutoverCheckOptions{BucketMin: -1, BucketMax: -1, Since: "15m", Limit: 100, StatusPort: -1, SkipStatus: true},
+			deps: func() cutoverCheckDeps {
+				deps := successfulCutoverCheckDeps(now)
+				deps.Projection.CountLegacyProjectionDrift = func(context.Context, int, int) (int, error) {
+					return 2, nil
+				}
+				deps.Projection.SummarizeLegacyProjectionDrift = func(context.Context, int, int, int) ([]db.ProjectionDriftSummaryRow, error) {
+					return []db.ProjectionDriftSummaryRow{
+						{BucketNo: 0, SiteStatus: 1, ExpectedStatus: 2, DriftCount: 2, SampleBlogID: 42},
+					}, nil
+				}
+				deps.Projection.ListLegacyProjectionDrift = func(context.Context, int, int, int) ([]db.ProjectionDriftRow, error) {
+					return nil, nil
+				}
+				return deps
+			}(),
+			want: "legacy projection drift=2",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			var out bytes.Buffer
+			err := runCutoverCheck(context.Background(), &out, cfg, tt.opts, tt.deps)
+			if err == nil {
+				t.Fatal("runCutoverCheck succeeded")
+			}
+			if !strings.Contains(err.Error(), tt.want) {
+				t.Fatalf("error = %q, want substring %q", err.Error(), tt.want)
+			}
+		})
+	}
+}
+
+func TestBuildRolloutStateReportPinned(t *testing.T) {
+	now := time.Date(2026, 4, 29, 12, 0, 0, 0, time.UTC)
+	cfg := pinnedRolloutTestConfig(12, 34)
+	cfg.APIPort = 0
+
+	report, err := buildRolloutStateReport(context.Background(), cfg, rolloutStateReportOptions{Since: "15m"}, rolloutStateReportDeps{
+		Now:      func() time.Time { return now },
+		Hostname: func() string { return "host-a" },
+		CountActiveSitesForBucketRange: func(_ context.Context, min, max int) (int, error) {
+			if min != 12 || max != 34 {
+				t.Fatalf("active range = %d-%d, want 12-34", min, max)
+			}
+			return 3, nil
+		},
+		CountRecentlyCheckedActiveSitesForRange: func(_ context.Context, min, max int, cutoff time.Time) (int, error) {
+			if min != 12 || max != 34 {
+				t.Fatalf("activity range = %d-%d, want 12-34", min, max)
+			}
+			if !cutoff.Equal(now.Add(-15 * time.Minute)) {
+				t.Fatalf("cutoff = %s, want %s", cutoff, now.Add(-15*time.Minute))
+			}
+			return 3, nil
+		},
+		CountLegacyProjectionDrift: func(_ context.Context, min, max int) (int, error) {
+			if min != 12 || max != 34 {
+				t.Fatalf("drift range = %d-%d, want 12-34", min, max)
+			}
+			return 0, nil
+		},
+	})
+	if err != nil {
+		t.Fatalf("buildRolloutStateReport: %v", err)
+	}
+	if !report.OK {
+		t.Fatalf("report.OK = false issues=%v", report.Issues)
+	}
+	if report.Ownership.Mode != "pinned" || report.BucketCoverage.Status != "pinned_config" {
+		t.Fatalf("ownership/coverage = %s/%s", report.Ownership.Mode, report.BucketCoverage.Status)
+	}
+	if report.Activity.CheckedPercent != 100 {
+		t.Fatalf("checked percent = %f, want 100", report.Activity.CheckedPercent)
+	}
+	if !strings.Contains(report.SuggestedNextAction, "next pinned host") {
+		t.Fatalf("suggested action = %q", report.SuggestedNextAction)
+	}
+}
+
+func TestBuildRolloutStateReportDynamicIssues(t *testing.T) {
+	now := time.Date(2026, 4, 29, 12, 0, 0, 0, time.UTC)
+	cfg := &config.Config{
+		BucketTotal:                  10,
+		BucketHeartbeatGraceSec:      60,
+		LegacyStatusProjectionEnable: true,
+		APIPort:                      8090,
+	}
+
+	report, err := buildRolloutStateReport(context.Background(), cfg, rolloutStateReportOptions{Since: "15m"}, rolloutStateReportDeps{
+		Now:      func() time.Time { return now },
+		Hostname: func() string { return "host-a" },
+		GetAllHosts: func() ([]db.HostRow, error) {
+			return []db.HostRow{
+				{HostID: "host-a", BucketMin: 0, BucketMax: 4, LastHeartbeat: now, Status: "active"},
+				{HostID: "host-b", BucketMin: 6, BucketMax: 9, LastHeartbeat: now, Status: "active"},
+			}, nil
+		},
+		CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+			return 4, nil
+		},
+		CountRecentlyCheckedActiveSitesForRange: func(context.Context, int, int, time.Time) (int, error) {
+			return 1, nil
+		},
+		CountLegacyProjectionDrift: func(context.Context, int, int) (int, error) {
+			return 2, nil
+		},
+	})
+	if err != nil {
+		t.Fatalf("buildRolloutStateReport: %v", err)
+	}
+	if report.OK {
+		t.Fatal("report.OK = true, want false")
+	}
+	for _, want := range []string{
+		"dynamic bucket coverage has gap",
+		"legacy projection drift=2",
+		"1/4 active sites checked",
+		"delivery_owner_host is unset",
+	} {
+		if !strings.Contains(strings.Join(report.Issues, "\n"), want) {
+			t.Fatalf("issues missing %q: %#v", want, report.Issues)
+		}
+	}
+	if report.BucketCoverage.Status != "invalid" {
+		t.Fatalf("coverage status = %q, want invalid", report.BucketCoverage.Status)
+	}
+	if !strings.Contains(report.SuggestedNextAction, "Fix jetmon_hosts bucket coverage") {
+		t.Fatalf("suggested action = %q", report.SuggestedNextAction)
+	}
+}
+
+func TestRunProductionDataAuditFlagsProductionShape(t *testing.T) {
+	cfg := &config.Config{BucketTotal: 512}
+	minBucket, maxBucket := 0, 511
+	audit := db.LegacySiteTableAudit{
+		BucketMin:              minBucket,
+		BucketMax:              maxBucket,
+		TotalRows:              1000,
+		ActiveRows:             100,
+		ObservedBucketMin:      &minBucket,
+		ObservedBucketMax:      &maxBucket,
+		ObservedBucketDistinct: 512,
+		ActiveBucketDistinct:   512,
+		ActiveBucketLoad:       db.BucketLoadSummary{Distinct: 512, MinRows: 1, MaxRows: 3, AvgRows: 2},
+		MonitorActiveValues:    []db.ValueCount{{Value: 0, Total: 900}, {Value: 1, Total: 100, Active: 100}},
+		SiteStatusValues:       []db.ValueCount{{Value: 1, Total: 990, Active: 90}, {Value: 2, Total: 10, Active: 10}},
+		CheckIntervalValues:    []db.ValueCount{{Value: 5, Total: 1000, Active: 100}},
+		ActiveNonRunningRows:   10,
+		ActiveMalformedURLRows: 2,
+		ActiveNullStatusChange: 4,
+		ActiveDuplicateBlogs:   db.DuplicateBlogSummary{Groups: 1, Rows: 2, MaxRowsPerBlog: 2},
+	}
+	deps := productionDataAuditDeps{
+		BuildLegacySiteTableAudit: func(context.Context, int, int) (db.LegacySiteTableAudit, error) {
+			return audit, nil
+		},
+	}
+
+	var out bytes.Buffer
+	err := runProductionDataAudit(context.Background(), &out, cfg, -1, -1, deps)
+	if err == nil {
+		t.Fatal("runProductionDataAudit returned nil, want duplicate blog_id blocker")
+	}
+	text := out.String()
+	for _, want := range []string{
+		"INFO legacy_rows_total=1000 active=100",
+		"WARN production_data_audit=\"active non-running legacy rows=10",
+		"FAIL production_data_audit=\"active duplicate blog_id rows groups=1 rows=2",
+		"FAIL production_data_audit=blocked",
+	} {
+		if !strings.Contains(text, want) {
+			t.Fatalf("audit output missing %q:\n%s", want, text)
+		}
+	}
+}
+
+func TestRunLegacyStatusBootstrapDryRunBlocksDuplicateBlogs(t *testing.T) {
+	cfg := &config.Config{BucketTotal: 10}
+	audit := db.LegacySiteTableAudit{
+		BucketMin:            0,
+		BucketMax:            9,
+		ActiveNonRunningRows: 3,
+		ActiveDuplicateBlogs: db.DuplicateBlogSummary{Groups: 1, Rows: 2},
+	}
+	deps := legacyStatusBootstrapDeps{
+		BuildLegacySiteTableAudit: func(context.Context, int, int) (db.LegacySiteTableAudit, error) {
+			return audit, nil
+		},
+		ListLegacyNonRunningSites: func(context.Context, int, int, int64, int) ([]db.LegacyNonRunningSite, error) {
+			t.Fatal("ListLegacyNonRunningSites should not run when duplicate blog IDs block bootstrap")
+			return nil, nil
+		},
+	}
+
+	var out bytes.Buffer
+	err := runLegacyStatusBootstrap(context.Background(), &out, cfg, legacyStatusBootstrapOptions{BucketMin: 0, BucketMax: 9, BatchSize: 1000}, deps)
+	if err == nil {
+		t.Fatal("runLegacyStatusBootstrap returned nil, want duplicate blog_id blocker")
+	}
+	if !strings.Contains(out.String(), "FAIL bootstrap_blocked_by_duplicate_blog_ids groups=1 rows=2") {
+		t.Fatalf("bootstrap output = %s", out.String())
+	}
+}
+
+func TestRunLegacyStatusBootstrapExecutesIdempotentPages(t *testing.T) {
+	cfg := &config.Config{BucketTotal: 10}
+	audit := db.LegacySiteTableAudit{BucketMin: 0, BucketMax: 9, ActiveNonRunningRows: 3}
+	var opened []int64
+	deps := legacyStatusBootstrapDeps{
+		BuildLegacySiteTableAudit: func(context.Context, int, int) (db.LegacySiteTableAudit, error) {
+			return audit, nil
+		},
+		ListLegacyNonRunningSites: func(_ context.Context, _, _ int, after int64, _ int) ([]db.LegacyNonRunningSite, error) {
+			switch after {
+			case 0:
+				return []db.LegacyNonRunningSite{
+					{MonitorSiteID: 10, BlogID: 100, BucketNo: 1, SiteStatus: 0},
+					{MonitorSiteID: 11, BlogID: 101, BucketNo: 1, SiteStatus: 2},
+				}, nil
+			case 11:
+				return []db.LegacyNonRunningSite{
+					{MonitorSiteID: 12, BlogID: 102, BucketNo: 1, SiteStatus: 2},
+				}, nil
+			default:
+				return nil, nil
+			}
+		},
+		OpenLegacyStatusEvent: func(_ context.Context, site db.LegacyNonRunningSite) (bool, error) {
+			opened = append(opened, site.BlogID)
+			return site.BlogID != 101, nil
+		},
+	}
+
+	var out bytes.Buffer
+	err := runLegacyStatusBootstrap(context.Background(), &out, cfg, legacyStatusBootstrapOptions{BucketMin: 0, BucketMax: 9, BatchSize: 2, Execute: true}, deps)
+	if err != nil {
+		t.Fatalf("runLegacyStatusBootstrap: %v\n%s", err, out.String())
+	}
+	if !reflect.DeepEqual(opened, []int64{100, 101, 102}) {
+		t.Fatalf("opened blogs = %#v", opened)
+	}
+	if !strings.Contains(out.String(), "PASS legacy_status_bootstrap=complete candidates=3 opened=2 existing=1") {
+		t.Fatalf("bootstrap output = %s", out.String())
+	}
+}
+
+func TestLegacyStatusEventShape(t *testing.T) {
+	tests := []struct {
+		status   int
+		state    string
+		severity uint8
+		ok       bool
+	}{
+		{status: 0, state: eventstore.StateSeemsDown, severity: eventstore.SeveritySeemsDown, ok: true},
+		{status: 2, state: eventstore.StateDown, severity: eventstore.SeverityDown, ok: true},
+		{status: 1, ok: false},
+	}
+	for _, tt := range tests {
+		state, severity, ok := legacyStatusEventShape(tt.status)
+		if state != tt.state || severity != tt.severity || ok != tt.ok {
+			t.Fatalf("legacyStatusEventShape(%d) = (%q,%d,%t), want (%q,%d,%t)", tt.status, state, severity, ok, tt.state, tt.severity, tt.ok)
+		}
+	}
+}
+
+func pinnedRolloutTestConfig(minBucket, maxBucket int) *config.Config {
+	return &config.Config{
+		PinnedBucketMin:              &minBucket,
+		PinnedBucketMax:              &maxBucket,
+		LegacyStatusProjectionEnable: true,
+	}
+}
+
+func successfulCutoverCheckDeps(now time.Time) cutoverCheckDeps {
+	deps := cutoverCheckDeps{
+		Pinned: successfulPinnedRolloutDeps(),
+		Activity: activityCheckDeps{
+			Now: func() time.Time { return now },
+			CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+				return 3, nil
+			},
+			CountRecentlyCheckedActiveSitesForRange: func(context.Context, int, int, time.Time) (int, error) {
+				return 3, nil
+			},
+		},
+		Projection: projectionDriftDeps{
+			CountLegacyProjectionDrift: func(context.Context, int, int) (int, error) {
+				return 0, nil
+			},
+			ListLegacyProjectionDrift: func(context.Context, int, int, int) ([]db.ProjectionDriftRow, error) {
+				return nil, nil
+			},
+		},
+		Status: func(int) (string, error) {
+			return `{"state":"ok"}`, nil
+		},
+	}
+	return deps
+}
+
+func successfulPinnedRolloutDeps() pinnedRolloutCheckDeps {
+	return pinnedRolloutCheckDeps{
+		Hostname: func() string { return "host-a" },
+		HostRowExists: func(context.Context, string) (bool, error) {
+			return false, nil
+		},
+		ListOverlappingHostRows: func(context.Context, int, int) ([]db.HostRow, error) {
+			return nil, nil
+		},
+		CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+			return 1, nil
+		},
+		CountLegacyProjectionDrift: func(context.Context, int, int) (int, error) {
+			return 0, nil
+		},
+	}
+}
+
+func TestRunRollbackCheckSuccess(t *testing.T) {
+	minBucket, maxBucket := 12, 34
+	cfg := pinnedRolloutTestConfig(minBucket, maxBucket)
+	var gotHost string
+	var gotMin, gotMax int
+	deps := rollbackCheckDeps{
+		Hostname: func() string { return "host-a" },
+		HostRowExists: func(_ context.Context, hostID string) (bool, error) {
+			gotHost = hostID
+			return false, nil
+		},
+		ListOverlappingHostRows: func(_ context.Context, min, max int) ([]db.HostRow, error) {
+			if min != minBucket || max != maxBucket {
+				t.Fatalf("overlap range = %d-%d, want %d-%d", min, max, minBucket, maxBucket)
+			}
+			return nil, nil
+		},
+		CountActiveSitesForBucketRange: func(_ context.Context, min, max int) (int, error) {
+			gotMin, gotMax = min, max
+			return 42, nil
+		},
+		CountLegacyProjectionDrift: func(_ context.Context, min, max int) (int, error) {
+			if min != minBucket || max != maxBucket {
+				t.Fatalf("drift range = %d-%d, want %d-%d", min, max, minBucket, maxBucket)
+			}
+			return 0, nil
+		},
+	}
+
+	var out bytes.Buffer
+	if err := runRollbackCheck(context.Background(), &out, cfg, "", -1, -1, deps); err != nil {
+		t.Fatalf("runRollbackCheck: %v", err)
+	}
+	if gotHost != "host-a" {
+		t.Fatalf("host = %q, want host-a", gotHost)
+	}
+	if gotMin != minBucket || gotMax != maxBucket {
+		t.Fatalf("active site range = %d-%d, want %d-%d", gotMin, gotMax, minBucket, maxBucket)
+	}
+	for _, want := range []string{
+		"PASS rollback_range=12-34",
+		"PASS jetmon_hosts row absent host=\"host-a\"",
+		"PASS jetmon_hosts overlap=0",
+		"INFO active_sites_in_rollback_range=42",
+		"PASS legacy_projection_drift=0",
+		"rollback check passed",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+}
+
+func TestRunRollbackCheckUsesExplicitRangeAndHostOverride(t *testing.T) {
+	cfg := dynamicRolloutTestConfig()
+	var gotHost string
+	var gotMin, gotMax int
+	deps := rollbackCheckDeps{
+		Hostname: func() string { return "wrong-host" },
+		HostRowExists: func(_ context.Context, hostID string) (bool, error) {
+			gotHost = hostID
+			return false, nil
+		},
+		ListOverlappingHostRows: func(context.Context, int, int) ([]db.HostRow, error) {
+			return nil, nil
+		},
+		CountActiveSitesForBucketRange: func(_ context.Context, min, max int) (int, error) {
+			gotMin, gotMax = min, max
+			return 1, nil
+		},
+		CountLegacyProjectionDrift: func(context.Context, int, int) (int, error) {
+			return 0, nil
+		},
+	}
+
+	var out bytes.Buffer
+	if err := runRollbackCheck(context.Background(), &out, cfg, " v2-host ", 2, 4, deps); err != nil {
+		t.Fatalf("runRollbackCheck: %v", err)
+	}
+	if gotHost != "v2-host" {
+		t.Fatalf("host = %q, want v2-host", gotHost)
+	}
+	if gotMin != 2 || gotMax != 4 {
+		t.Fatalf("range = %d-%d, want 2-4", gotMin, gotMax)
+	}
+}
+
+func TestRunRollbackCheckWarnsWhenRangeIsEmpty(t *testing.T) {
+	minBucket, maxBucket := 12, 34
+	cfg := pinnedRolloutTestConfig(minBucket, maxBucket)
+	deps := rollbackCheckDeps(successfulPinnedRolloutDeps())
+	deps.CountActiveSitesForBucketRange = func(context.Context, int, int) (int, error) {
+		return 0, nil
+	}
+
+	var out bytes.Buffer
+	if err := runRollbackCheck(context.Background(), &out, cfg, "", -1, -1, deps); err != nil {
+		t.Fatalf("runRollbackCheck: %v", err)
+	}
+	if !strings.Contains(out.String(), "WARN active_sites_in_rollback_range=0") {
+		t.Fatalf("output missing empty-range warning:\n%s", out.String())
+	}
+}
+
+func TestRunRollbackCheckFailures(t *testing.T) {
+	minBucket, maxBucket := 12, 34
+	tests := []struct {
+		name      string
+		cfg       *config.Config
+		host      string
+		bucketMin int
+		bucketMax int
+		deps      rollbackCheckDeps
+		want      string
+	}{
+		{
+			name:      "no range",
+			cfg:       dynamicRolloutTestConfig(),
+			bucketMin: -1,
+			bucketMax: -1,
+			deps:      rollbackCheckDeps(successfulPinnedRolloutDeps()),
+			want:      "needs a pinned bucket config",
+		},
+		{
+			name:      "host row exists",
+			cfg:       pinnedRolloutTestConfig(minBucket, maxBucket),
+			bucketMin: -1,
+			bucketMax: -1,
+			deps: rollbackCheckDeps{
+				Hostname: func() string { return "host-a" },
+				HostRowExists: func(context.Context, string) (bool, error) {
+					return true, nil
+				},
+			},
+			want: "still has a jetmon_hosts row",
+		},
+		{
+			name:      "overlapping dynamic row",
+			cfg:       pinnedRolloutTestConfig(minBucket, maxBucket),
+			bucketMin: -1,
+			bucketMax: -1,
+			deps: rollbackCheckDeps{
+				Hostname: func() string { return "host-a" },
+				HostRowExists: func(context.Context, string) (bool, error) {
+					return false, nil
+				},
+				ListOverlappingHostRows: func(context.Context, int, int) ([]db.HostRow, error) {
+					return []db.HostRow{{HostID: "dynamic-host", BucketMin: 10, BucketMax: 20, Status: "active"}}, nil
+				},
+			},
+			want: "overlapping rollback range",
+		},
+		{
+			name:      "projection drift",
+			cfg:       pinnedRolloutTestConfig(minBucket, maxBucket),
+			bucketMin: -1,
+			bucketMax: -1,
+			deps: rollbackCheckDeps{
+				Hostname: func() string { return "host-a" },
+				HostRowExists: func(context.Context, string) (bool, error) {
+					return false, nil
+				},
+				ListOverlappingHostRows: func(context.Context, int, int) ([]db.HostRow, error) {
+					return nil, nil
+				},
+				CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+					return 10, nil
+				},
+				CountLegacyProjectionDrift: func(context.Context, int, int) (int, error) {
+					return 2, nil
+				},
+			},
+			want: "fix drift before restarting v1",
+		},
+		{
+			name:      "explicit range outside total",
+			cfg:       dynamicRolloutTestConfig(),
+			host:      "host-a",
+			bucketMin: 0,
+			bucketMax: 10,
+			deps:      rollbackCheckDeps(successfulPinnedRolloutDeps()),
+			want:      "bucket-max must be < BUCKET_TOTAL",
+		},
+		{
+			name:      "negative explicit range",
+			cfg:       pinnedRolloutTestConfig(minBucket, maxBucket),
+			host:      "host-a",
+			bucketMin: -2,
+			bucketMax: -2,
+			deps:      rollbackCheckDeps(successfulPinnedRolloutDeps()),
+			want:      "bucket-min and bucket-max must be >= 0",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			var out bytes.Buffer
+			err := runRollbackCheck(context.Background(), &out, tt.cfg, tt.host, tt.bucketMin, tt.bucketMax, tt.deps)
+			if err == nil {
+				t.Fatal("runRollbackCheck succeeded")
+			}
+			if !strings.Contains(err.Error(), tt.want) {
+				t.Fatalf("error = %q, want substring %q", err.Error(), tt.want)
+			}
+		})
+	}
+}
+
+func TestRunDynamicRolloutCheckSuccess(t *testing.T) {
+	now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC)
+	cfg := &config.Config{
+		BucketTotal:                  10,
+		BucketHeartbeatGraceSec:      60,
+		LegacyStatusProjectionEnable: true,
+	}
+
+	var gotMin, gotMax int
+	deps := dynamicRolloutCheckDeps{
+		Now: func() time.Time { return now },
+		GetAllHosts: func() ([]db.HostRow, error) {
+			return []db.HostRow{
+				{HostID: "host-b", BucketMin: 5, BucketMax: 9, LastHeartbeat: now.Add(-10 * time.Second), Status: "active"},
+				{HostID: "host-a", BucketMin: 0, BucketMax: 4, LastHeartbeat: now.Add(-10 * time.Second), Status: "active"},
+			}, nil
+		},
+		CountActiveSitesForBucketRange: func(_ context.Context, min, max int) (int, error) {
+			gotMin, gotMax = min, max
+			return 123, nil
+		},
+		CountLegacyProjectionDrift: func(_ context.Context, min, max int) (int, error) {
+			if min != 0 || max != 9 {
+				t.Fatalf("drift range = %d-%d, want 0-9", min, max)
+			}
+			return 0, nil
+		},
+	}
+
+	var out bytes.Buffer
+	if err := runDynamicRolloutCheck(context.Background(), &out, cfg, deps); err != nil {
+		t.Fatalf("runDynamicRolloutCheck: %v", err)
+	}
+	if gotMin != 0 || gotMax != 9 {
+		t.Fatalf("active site range = %d-%d, want 0-9", gotMin, gotMax)
+	}
+	for _, want := range []string{
+		"PASS bucket_ownership=dynamic",
+		"PASS legacy_status_projection=enabled",
+		"INFO jetmon_hosts_rows=2",
+		"PASS dynamic_bucket_coverage=0-9 hosts=2",
+		"INFO active_sites_dynamic_range=123",
+		"PASS legacy_projection_drift=0",
+		"dynamic rollout check passed",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+}
+
+func TestRunDynamicRolloutCheckWarnsWhenSiteTableIsEmpty(t *testing.T) {
+	now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC)
+	cfg := dynamicRolloutTestConfig()
+	deps := successfulDynamicRolloutDeps(now)
+	deps.CountActiveSitesForBucketRange = func(context.Context, int, int) (int, error) {
+		return 0, nil
+	}
+
+	var out bytes.Buffer
+	if err := runDynamicRolloutCheck(context.Background(), &out, cfg, deps); err != nil {
+		t.Fatalf("runDynamicRolloutCheck: %v", err)
+	}
+	if !strings.Contains(out.String(), "WARN active_sites_dynamic_range=0") {
+		t.Fatalf("output missing empty-table warning:\n%s", out.String())
+	}
+}
+
+func TestRunActivityCheckSuccess(t *testing.T) {
+	now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC)
+	cfg := dynamicRolloutTestConfig()
+	var gotCutoff time.Time
+	deps := activityCheckDeps{
+		Now: func() time.Time { return now },
+		CountActiveSitesForBucketRange: func(_ context.Context, min, max int) (int, error) {
+			if min != 0 || max != 9 {
+				t.Fatalf("active range = %d-%d, want 0-9", min, max)
+			}
+			return 10, nil
+		},
+		CountRecentlyCheckedActiveSitesForRange: func(_ context.Context, min, max int, cutoff time.Time) (int, error) {
+			if min != 0 || max != 9 {
+				t.Fatalf("recent range = %d-%d, want 0-9", min, max)
+			}
+			gotCutoff = cutoff
+			return 3, nil
+		},
+	}
+
+	var out bytes.Buffer
+	if err := runActivityCheck(context.Background(), &out, cfg, -1, -1, "15m", false, deps); err != nil {
+		t.Fatalf("runActivityCheck: %v", err)
+	}
+	if want := now.Add(-15 * time.Minute); !gotCutoff.Equal(want) {
+		t.Fatalf("cutoff = %s, want %s", gotCutoff, want)
+	}
+	for _, want := range []string{
+		"INFO activity_range=0-9",
+		"INFO active_sites=10",
+		"INFO active_sites_checked_since=3",
+		"PASS rollout_activity=recent_checks_present",
+		"post-cutover activity check passed",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+}
+
+func TestRunActivityCheckRequireAll(t *testing.T) {
+	now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC)
+	cfg := dynamicRolloutTestConfig()
+	deps := activityCheckDeps{
+		Now: func() time.Time { return now },
+		CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+			return 3, nil
+		},
+		CountRecentlyCheckedActiveSitesForRange: func(context.Context, int, int, time.Time) (int, error) {
+			return 3, nil
+		},
+	}
+
+	var out bytes.Buffer
+	if err := runActivityCheck(context.Background(), &out, cfg, -1, -1, "15m", true, deps); err != nil {
+		t.Fatalf("runActivityCheck: %v", err)
+	}
+	if !strings.Contains(out.String(), "PASS rollout_activity=all_active_sites_checked") {
+		t.Fatalf("output missing require-all pass:\n%s", out.String())
+	}
+}
+
+func TestRunActivityCheckWarnsWhenRangeIsEmpty(t *testing.T) {
+	now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC)
+	cfg := dynamicRolloutTestConfig()
+	deps := activityCheckDeps{
+		Now: func() time.Time { return now },
+		CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+			return 0, nil
+		},
+		CountRecentlyCheckedActiveSitesForRange: func(context.Context, int, int, time.Time) (int, error) {
+			return 0, nil
+		},
+	}
+
+	var out bytes.Buffer
+	if err := runActivityCheck(context.Background(), &out, cfg, -1, -1, "15m", false, deps); err != nil {
+		t.Fatalf("runActivityCheck: %v", err)
+	}
+	if !strings.Contains(out.String(), "WARN active_sites=0") {
+		t.Fatalf("output missing empty range warning:\n%s", out.String())
+	}
+}
+
+func TestRunActivityCheckFailures(t *testing.T) {
+	now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC)
+	cfg := dynamicRolloutTestConfig()
+	tests := []struct {
+		name string
+		deps activityCheckDeps
+		want string
+	}{
+		{
+			name: "active count error",
+			deps: activityCheckDeps{
+				Now: func() time.Time { return now },
+				CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+					return 0, errors.New("db unavailable")
+				},
+				CountRecentlyCheckedActiveSitesForRange: func(context.Context, int, int, time.Time) (int, error) {
+					return 0, nil
+				},
+			},
+			want: "db unavailable",
+		},
+		{
+			name: "recent count error",
+			deps: activityCheckDeps{
+				Now: func() time.Time { return now },
+				CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+					return 10, nil
+				},
+				CountRecentlyCheckedActiveSitesForRange: func(context.Context, int, int, time.Time) (int, error) {
+					return 0, errors.New("db unavailable")
+				},
+			},
+			want: "count recently checked",
+		},
+		{
+			name: "no recent checks",
+			deps: activityCheckDeps{
+				Now: func() time.Time { return now },
+				CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+					return 10, nil
+				},
+				CountRecentlyCheckedActiveSitesForRange: func(context.Context, int, int, time.Time) (int, error) {
+					return 0, nil
+				},
+			},
+			want: "no active sites",
+		},
+		{
+			name: "require all mismatch",
+			deps: activityCheckDeps{
+				Now: func() time.Time { return now },
+				CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+					return 10, nil
+				},
+				CountRecentlyCheckedActiveSitesForRange: func(context.Context, int, int, time.Time) (int, error) {
+					return 9, nil
+				},
+			},
+			want: "only 9/10",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			var out bytes.Buffer
+			err := runActivityCheck(context.Background(), &out, cfg, -1, -1, "15m", tt.name == "require all mismatch", tt.deps)
+			if err == nil {
+				t.Fatal("runActivityCheck succeeded")
+			}
+			if !strings.Contains(err.Error(), tt.want) {
+				t.Fatalf("error = %q, want substring %q", err.Error(), tt.want)
+			}
+		})
+	}
+}
+
+func TestResolveActivityCutoff(t *testing.T) {
+	now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC)
+	tests := []struct {
+		name    string
+		since   string
+		want    time.Time
+		wantErr string
+	}{
+		{
+			name:  "duration",
+			since: "15m",
+			want:  now.Add(-15 * time.Minute),
+		},
+		{
+			name:  "rfc3339",
+			since: "2026-04-28T11:30:00Z",
+			want:  time.Date(2026, 4, 28, 11, 30, 0, 0, time.UTC),
+		},
+		{
+			name:    "empty",
+			since:   "",
+			wantErr: "must not be empty",
+		},
+		{
+			name:    "negative duration",
+			since:   "-1m",
+			wantErr: "must be > 0",
+		},
+		{
+			name:    "invalid",
+			since:   "yesterday",
+			wantErr: "must be a duration",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			got, err := resolveActivityCutoff(now, tt.since)
+			if tt.wantErr != "" {
+				if err == nil {
+					t.Fatal("resolveActivityCutoff succeeded")
+				}
+				if !strings.Contains(err.Error(), tt.wantErr) {
+					t.Fatalf("error = %q, want substring %q", err.Error(), tt.wantErr)
+				}
+				return
+			}
+			if err != nil {
+				t.Fatalf("resolveActivityCutoff: %v", err)
+			}
+			if !got.Equal(tt.want) {
+				t.Fatalf("cutoff = %s, want %s", got, tt.want)
+			}
+		})
+	}
+}
+
+func TestRunDynamicRolloutCheckFailures(t *testing.T) {
+	now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC)
+	minBucket, maxBucket := 1, 2
+
+	tests := []struct {
+		name string
+		cfg  *config.Config
+		deps dynamicRolloutCheckDeps
+		want string
+	}{
+		{
+			name: "pinned range still configured",
+			cfg: &config.Config{
+				BucketTotal:                  10,
+				BucketHeartbeatGraceSec:      60,
+				LegacyStatusProjectionEnable: true,
+				PinnedBucketMin:              &minBucket,
+				PinnedBucketMax:              &maxBucket,
+			},
+			deps: successfulDynamicRolloutDeps(now),
+			want: "pinned bucket range 1-2 is still configured",
+		},
+		{
+			name: "legacy projection disabled",
+			cfg: &config.Config{
+				BucketTotal:             10,
+				BucketHeartbeatGraceSec: 60,
+			},
+			deps: successfulDynamicRolloutDeps(now),
+			want: "LEGACY_STATUS_PROJECTION_ENABLE must remain true",
+		},
+		{
+			name: "host query error",
+			cfg:  dynamicRolloutTestConfig(),
+			deps: dynamicRolloutCheckDeps{
+				GetAllHosts: func() ([]db.HostRow, error) {
+					return nil, errors.New("db unavailable")
+				},
+			},
+			want: "db unavailable",
+		},
+		{
+			name: "projection drift",
+			cfg:  dynamicRolloutTestConfig(),
+			deps: dynamicRolloutCheckDeps{
+				Now: func() time.Time { return now },
+				GetAllHosts: func() ([]db.HostRow, error) {
+					return dynamicRolloutHosts(now), nil
+				},
+				CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+					return 10, nil
+				},
+				CountLegacyProjectionDrift: func(context.Context, int, int) (int, error) {
+					return 3, nil
+				},
+			},
+			want: "legacy projection drift=3",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			var out bytes.Buffer
+			err := runDynamicRolloutCheck(context.Background(), &out, tt.cfg, tt.deps)
+			if err == nil {
+				t.Fatal("runDynamicRolloutCheck succeeded")
+			}
+			if !strings.Contains(err.Error(), tt.want) {
+				t.Fatalf("error = %q, want substring %q", err.Error(), tt.want)
+			}
+		})
+	}
+}
+
+func TestValidateDynamicBucketCoverageFailures(t *testing.T) {
+	now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC)
+	tests := []struct {
+		name  string
+		hosts []db.HostRow
+		want  string
+	}{
+		{
+			name:  "no hosts",
+			hosts: nil,
+			want:  "jetmon_hosts has no rows",
+		},
+		{
+			name: "inactive host",
+			hosts: []db.HostRow{
+				{HostID: "host-a", BucketMin: 0, BucketMax: 9, LastHeartbeat: now, Status: "draining"},
+			},
+			want: "status=\"draining\"",
+		},
+		{
+			name: "stale heartbeat",
+			hosts: []db.HostRow{
+				{HostID: "host-a", BucketMin: 0, BucketMax: 9, LastHeartbeat: now.Add(-2 * time.Minute), Status: "active"},
+			},
+			want: "heartbeat is stale",
+		},
+		{
+			name: "invalid range",
+			hosts: []db.HostRow{
+				{HostID: "host-a", BucketMin: 0, BucketMax: 10, LastHeartbeat: now, Status: "active"},
+			},
+			want: "invalid bucket range",
+		},
+		{
+			name: "leading gap",
+			hosts: []db.HostRow{
+				{HostID: "host-a", BucketMin: 1, BucketMax: 9, LastHeartbeat: now, Status: "active"},
+			},
+			want: "gap 0-0",
+		},
+		{
+			name: "middle gap",
+			hosts: []db.HostRow{
+				{HostID: "host-a", BucketMin: 0, BucketMax: 3, LastHeartbeat: now, Status: "active"},
+				{HostID: "host-b", BucketMin: 5, BucketMax: 9, LastHeartbeat: now, Status: "active"},
+			},
+			want: "gap 4-4",
+		},
+		{
+			name: "overlap",
+			hosts: []db.HostRow{
+				{HostID: "host-a", BucketMin: 0, BucketMax: 5, LastHeartbeat: now, Status: "active"},
+				{HostID: "host-b", BucketMin: 5, BucketMax: 9, LastHeartbeat: now, Status: "active"},
+			},
+			want: "overlaps",
+		},
+		{
+			name: "trailing gap",
+			hosts: []db.HostRow{
+				{HostID: "host-a", BucketMin: 0, BucketMax: 8, LastHeartbeat: now, Status: "active"},
+			},
+			want: "trailing gap 9-9",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			err := validateDynamicBucketCoverage(tt.hosts, 10, time.Minute, now)
+			if err == nil {
+				t.Fatal("validateDynamicBucketCoverage succeeded")
+			}
+			if !strings.Contains(err.Error(), tt.want) {
+				t.Fatalf("error = %q, want substring %q", err.Error(), tt.want)
+			}
+		})
+	}
+}
+
+func TestRunProjectionDriftReportNoDrift(t *testing.T) {
+	cfg := dynamicRolloutTestConfig()
+	deps := projectionDriftDeps{
+		CountLegacyProjectionDrift: func(_ context.Context, min, max int) (int, error) {
+			if min != 0 || max != 9 {
+				t.Fatalf("count range = %d-%d, want 0-9", min, max)
+			}
+			return 0, nil
+		},
+	}
+
+	var out bytes.Buffer
+	if err := runProjectionDriftReport(context.Background(), &out, cfg, -1, -1, 50, deps); err != nil {
+		t.Fatalf("runProjectionDriftReport: %v", err)
+	}
+	for _, want := range []string{
+		"INFO projection_drift_range=0-9",
+		"INFO legacy_projection_drift=0",
+		"PASS legacy_projection_drift=0",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+}
+
+func TestRunProjectionDriftReportListsRowsAndFails(t *testing.T) {
+	cfg := dynamicRolloutTestConfig()
+	eventID := int64(123)
+	eventState := "Down"
+	deps := projectionDriftDeps{
+		CountLegacyProjectionDrift: func(context.Context, int, int) (int, error) {
+			return 2, nil
+		},
+		SummarizeLegacyProjectionDrift: func(_ context.Context, min, max, limit int) ([]db.ProjectionDriftSummaryRow, error) {
+			if min != 2 || max != 4 || limit != 2 {
+				t.Fatalf("summary args = %d-%d limit=%d, want 2-4 limit=2", min, max, limit)
+			}
+			return []db.ProjectionDriftSummaryRow{
+				{BucketNo: 3, SiteStatus: 1, ExpectedStatus: 2, EventState: &eventState, MaxOpenEventCount: 1, DriftCount: 2, SampleBlogID: 42},
+			}, nil
+		},
+		ListLegacyProjectionDrift: func(_ context.Context, min, max, limit int) ([]db.ProjectionDriftRow, error) {
+			if min != 2 || max != 4 || limit != 1 {
+				t.Fatalf("list args = %d-%d limit=%d, want 2-4 limit=1", min, max, limit)
+			}
+			return []db.ProjectionDriftRow{
+				{BlogID: 42, BucketNo: 3, SiteStatus: 1, ExpectedStatus: 2, EventID: &eventID, EventState: &eventState, OpenEventCount: 1},
+			}, nil
+		},
+	}
+
+	var out bytes.Buffer
+	err := runProjectionDriftReport(context.Background(), &out, cfg, 2, 4, 1, deps)
+	if err == nil {
+		t.Fatal("runProjectionDriftReport succeeded")
+	}
+	if !strings.Contains(err.Error(), "legacy projection drift=2") {
+		t.Fatalf("error = %q, want drift count", err.Error())
+	}
+	for _, want := range []string{
+		"WARN legacy_projection_drift_requires_manual_review=2",
+		"projection_drift_next_step=",
+		"SAMPLE_BLOG",
+		"missing_confirmed_down_projection",
+		"projection_drift_cause=missing_confirmed_down_projection count=2",
+		"BLOG_ID",
+		"42",
+		"Down",
+		"INFO projection_drift_rows_truncated=1",
+		"INFO projection_drift_repair=manual_confirmation_required",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+}
+
+func TestRunProjectionDriftReportUsesAllSummariesForCauseGuidance(t *testing.T) {
+	cfg := dynamicRolloutTestConfig()
+	deps := projectionDriftDeps{
+		CountLegacyProjectionDrift: func(context.Context, int, int) (int, error) {
+			return defaultProjectionDriftSummaryLimit + 1, nil
+		},
+		SummarizeLegacyProjectionDrift: func(context.Context, int, int, int) ([]db.ProjectionDriftSummaryRow, error) {
+			var summaries []db.ProjectionDriftSummaryRow
+			for i := range defaultProjectionDriftSummaryLimit {
+				summaries = append(summaries, db.ProjectionDriftSummaryRow{
+					BucketNo:       i,
+					SiteStatus:     1,
+					ExpectedStatus: 2,
+					DriftCount:     1,
+					SampleBlogID:   int64(100 + i),
+				})
+			}
+			summaries = append(summaries, db.ProjectionDriftSummaryRow{
+				BucketNo:       99,
+				SiteStatus:     0,
+				ExpectedStatus: 1,
+				DriftCount:     1,
+				SampleBlogID:   999,
+			})
+			return summaries, nil
+		},
+		ListLegacyProjectionDrift: func(context.Context, int, int, int) ([]db.ProjectionDriftRow, error) {
+			return nil, nil
+		},
+	}
+
+	var out bytes.Buffer
+	err := runProjectionDriftReport(context.Background(), &out, cfg, 0, 9, 1, deps)
+	if err == nil {
+		t.Fatal("runProjectionDriftReport succeeded")
+	}
+	for _, want := range []string{
+		"INFO projection_drift_summary_groups_truncated=1",
+		"INFO projection_drift_summary_rows_hidden=1",
+		"projection_drift_cause=missing_confirmed_down_projection count=20",
+		"projection_drift_cause=stale_legacy_down_projection count=1",
+	} {
+		if !strings.Contains(out.String(), want) {
+			t.Fatalf("output missing %q:\n%s", want, out.String())
+		}
+	}
+	if strings.Contains(out.String(), "999") {
+		t.Fatalf("hidden summary sample was printed:\n%s", out.String())
+	}
+}
+
+func TestFormatOptionalStringSanitizesControlCharacters(t *testing.T) {
+	raw := "Down\x1b[31m\nStill Down"
+	got := formatOptionalString(&raw)
+	if strings.ContainsAny(got, "\x1b\n\r\t") {
+		t.Fatalf("formatted string contains control characters: %q", got)
+	}
+	if !strings.Contains(got, "?") {
+		t.Fatalf("formatted string = %q, want replacement marker", got)
+	}
+}
+
+func TestClassifyProjectionDriftCause(t *testing.T) {
+	eventState := "Down"
+	tests := []struct {
+		name       string
+		status     int
+		expected   int
+		state      *string
+		openEvents int
+		want       string
+	}{
+		{name: "legacy down but no open event", status: 2, expected: 1, want: "stale_legacy_down_projection"},
+		{name: "running with open down", status: 1, expected: 2, state: &eventState, openEvents: 1, want: "missing_confirmed_down_projection"},
+		{name: "seems down not promoted", status: 0, expected: 2, state: &eventState, openEvents: 1, want: "missing_confirmed_promotion"},
+		{name: "duplicate open events", status: 1, expected: 2, state: &eventState, openEvents: 2, want: "multiple_open_http_events"},
+		{name: "unknown status", status: 9, expected: 1, want: "unexpected_projection_value"},
+	}
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			got := classifyProjectionDriftCause(tt.status, tt.expected, tt.state, tt.openEvents)
+			if got.Code != tt.want {
+				t.Fatalf("cause = %q, want %q", got.Code, tt.want)
+			}
+		})
+	}
+}
+
+func TestResolveProjectionDriftRange(t *testing.T) {
+	minBucket, maxBucket := 2, 4
+	tests := []struct {
+		name    string
+		cfg     *config.Config
+		inMin   int
+		inMax   int
+		wantMin int
+		wantMax int
+		wantErr string
+	}{
+		{
+			name:    "dynamic default",
+			cfg:     dynamicRolloutTestConfig(),
+			inMin:   -1,
+			inMax:   -1,
+			wantMin: 0,
+			wantMax: 9,
+		},
+		{
+			name: "pinned default",
+			cfg: &config.Config{
+				BucketTotal:     10,
+				PinnedBucketMin: &minBucket,
+				PinnedBucketMax: &maxBucket,
+			},
+			inMin:   -1,
+			inMax:   -1,
+			wantMin: 2,
+			wantMax: 4,
+		},
+		{
+			name:    "explicit range",
+			cfg:     dynamicRolloutTestConfig(),
+			inMin:   3,
+			inMax:   5,
+			wantMin: 3,
+			wantMax: 5,
+		},
+		{
+			name:    "one sided range",
+			cfg:     dynamicRolloutTestConfig(),
+			inMin:   3,
+			inMax:   -1,
+			wantErr: "must be set together",
+		},
+		{
+			name:    "negative range",
+			cfg:     dynamicRolloutTestConfig(),
+			inMin:   -2,
+			inMax:   -2,
+			wantErr: "must be >= 0",
+		},
+		{
+			name:    "inverted range",
+			cfg:     dynamicRolloutTestConfig(),
+			inMin:   7,
+			inMax:   3,
+			wantErr: "bucket-max must be >= bucket-min",
+		},
+		{
+			name:    "range outside total",
+			cfg:     dynamicRolloutTestConfig(),
+			inMin:   0,
+			inMax:   10,
+			wantErr: "bucket-max must be < BUCKET_TOTAL",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			gotMin, gotMax, err := resolveProjectionDriftRange(tt.cfg, tt.inMin, tt.inMax)
+			if tt.wantErr != "" {
+				if err == nil {
+					t.Fatal("resolveProjectionDriftRange succeeded")
+				}
+				if !strings.Contains(err.Error(), tt.wantErr) {
+					t.Fatalf("error = %q, want substring %q", err.Error(), tt.wantErr)
+				}
+				return
+			}
+			if err != nil {
+				t.Fatalf("resolveProjectionDriftRange: %v", err)
+			}
+			if gotMin != tt.wantMin || gotMax != tt.wantMax {
+				t.Fatalf("range = %d-%d, want %d-%d", gotMin, gotMax, tt.wantMin, tt.wantMax)
+			}
+		})
+	}
+}
+
+func dynamicRolloutTestConfig() *config.Config {
+	return &config.Config{
+		BucketTotal:                  10,
+		BucketHeartbeatGraceSec:      60,
+		LegacyStatusProjectionEnable: true,
+	}
+}
+
+func dynamicRolloutHosts(now time.Time) []db.HostRow {
+	return []db.HostRow{
+		{HostID: "host-a", BucketMin: 0, BucketMax: 4, LastHeartbeat: now, Status: "active"},
+		{HostID: "host-b", BucketMin: 5, BucketMax: 9, LastHeartbeat: now, Status: "active"},
+	}
+}
+
+func successfulDynamicRolloutDeps(now time.Time) dynamicRolloutCheckDeps {
+	return dynamicRolloutCheckDeps{
+		Now: func() time.Time { return now },
+		GetAllHosts: func() ([]db.HostRow, error) {
+			return dynamicRolloutHosts(now), nil
+		},
+		CountActiveSitesForBucketRange: func(context.Context, int, int) (int, error) {
+			return 1, nil
+		},
+		CountLegacyProjectionDrift: func(context.Context, int, int) (int, error) {
+			return 0, nil
+		},
+	}
+}
diff --git a/cmd/jetmon2/site_tenants.go b/cmd/jetmon2/site_tenants.go
new file mode 100644
index 00000000..c3e590b3
--- /dev/null
+++ b/cmd/jetmon2/site_tenants.go
@@ -0,0 +1,163 @@
+package main
+
+import (
+	"context"
+	"encoding/csv"
+	"errors"
+	"flag"
+	"fmt"
+	"io"
+	"log"
+	"os"
+	"strconv"
+	"strings"
+
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+)
+
+type siteTenantImport struct {
+	Mappings         []db.SiteTenantMapping
+	SkippedDuplicate int
+}
+
+func cmdSiteTenants(args []string) {
+	if len(args) == 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 site-tenants <import> [args]")
+		os.Exit(1)
+	}
+
+	switch args[0] {
+	case "import":
+		cmdSiteTenantsImport(args[1:])
+	default:
+		fmt.Fprintf(os.Stderr, "unknown site-tenants subcommand %q (want: import)\n", args[0])
+		os.Exit(1)
+	}
+}
+
+func cmdSiteTenantsImport(args []string) {
+	fs := flag.NewFlagSet("site-tenants import", flag.ExitOnError)
+	path := fs.String("file", "", "CSV file with tenant_id,blog_id rows; use - for stdin")
+	source := fs.String("source", "gateway", "mapping source label")
+	dryRun := fs.Bool("dry-run", false, "parse and validate input without writing")
+	_ = fs.Parse(args)
+
+	if strings.TrimSpace(*path) == "" {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 site-tenants import --file <path|-> [--source=gateway] [--dry-run]")
+		os.Exit(1)
+	}
+
+	rc, err := openSiteTenantImport(*path)
+	if err != nil {
+		log.Fatalf("open import file: %v", err)
+	}
+	defer rc.Close()
+
+	in, err := parseSiteTenantMappings(rc)
+	if err != nil {
+		log.Fatalf("parse import file: %v", err)
+	}
+
+	if *dryRun {
+		fmt.Printf("Validated %d site tenant mappings", len(in.Mappings))
+		if in.SkippedDuplicate > 0 {
+			fmt.Printf(" (%d duplicate rows skipped)", in.SkippedDuplicate)
+		}
+		fmt.Println()
+		return
+	}
+
+	config.LoadDB()
+	if err := db.ConnectWithRetry(3); err != nil {
+		log.Fatalf("db: %v", err)
+	}
+	affected, err := db.UpsertSiteTenantMappings(context.Background(), db.DB(), in.Mappings, *source)
+	if err != nil {
+		log.Fatalf("import: %v", err)
+	}
+
+	fmt.Printf("Imported %d site tenant mappings", len(in.Mappings))
+	if in.SkippedDuplicate > 0 {
+		fmt.Printf(" (%d duplicate rows skipped)", in.SkippedDuplicate)
+	}
+	fmt.Printf("; database rows affected=%d\n", affected)
+}
+
+func openSiteTenantImport(path string) (io.ReadCloser, error) {
+	if strings.TrimSpace(path) == "-" {
+		return io.NopCloser(os.Stdin), nil
+	}
+	return os.Open(path)
+}
+
+func parseSiteTenantMappings(r io.Reader) (siteTenantImport, error) {
+	reader := csv.NewReader(r)
+	reader.TrimLeadingSpace = true
+	reader.FieldsPerRecord = -1
+
+	out := siteTenantImport{}
+	seen := make(map[db.SiteTenantMapping]struct{})
+	line := 0
+	sawData := false
+	for {
+		record, err := reader.Read()
+		if errors.Is(err, io.EOF) {
+			break
+		}
+		line++
+		if err != nil {
+			return out, err
+		}
+		if emptyCSVRecord(record) {
+			continue
+		}
+		if !sawData && isSiteTenantHeader(record) {
+			sawData = true
+			continue
+		}
+		sawData = true
+		if len(record) != 2 {
+			return out, fmt.Errorf("line %d: expected 2 columns tenant_id,blog_id; got %d", line, len(record))
+		}
+
+		tenantID := strings.TrimSpace(record[0])
+		if tenantID == "" {
+			return out, fmt.Errorf("line %d: tenant_id is required", line)
+		}
+		blogID, err := strconv.ParseInt(strings.TrimSpace(record[1]), 10, 64)
+		if err != nil || blogID <= 0 {
+			return out, fmt.Errorf("line %d: blog_id must be a positive integer", line)
+		}
+
+		mapping := db.SiteTenantMapping{TenantID: tenantID, BlogID: blogID}
+		if _, ok := seen[mapping]; ok {
+			out.SkippedDuplicate++
+			continue
+		}
+		seen[mapping] = struct{}{}
+		out.Mappings = append(out.Mappings, mapping)
+	}
+
+	if len(out.Mappings) == 0 {
+		return out, errors.New("no site tenant mappings found")
+	}
+	return out, nil
+}
+
+func isSiteTenantHeader(record []string) bool {
+	if len(record) != 2 {
+		return false
+	}
+	return strings.EqualFold(strings.TrimSpace(record[0]), "tenant_id") &&
+		strings.EqualFold(strings.TrimSpace(record[1]), "blog_id")
+}
+
+func emptyCSVRecord(record []string) bool {
+	for _, field := range record {
+		if strings.TrimSpace(field) != "" {
+			return false
+		}
+	}
+	return true
+}
diff --git a/cmd/jetmon2/site_tenants_test.go b/cmd/jetmon2/site_tenants_test.go
new file mode 100644
index 00000000..c69df0ed
--- /dev/null
+++ b/cmd/jetmon2/site_tenants_test.go
@@ -0,0 +1,70 @@
+package main
+
+import (
+	"strings"
+	"testing"
+
+	"github.com/Automattic/jetmon/internal/db"
+)
+
+func TestParseSiteTenantMappingsHeaderDedupesAndSkipsBlanks(t *testing.T) {
+	in, err := parseSiteTenantMappings(strings.NewReader(`
+tenant_id,blog_id
+tenant-a,42
+
+tenant-a,42
+tenant-b,43
+`))
+	if err != nil {
+		t.Fatalf("parseSiteTenantMappings: %v", err)
+	}
+	if in.SkippedDuplicate != 1 {
+		t.Fatalf("SkippedDuplicate = %d, want 1", in.SkippedDuplicate)
+	}
+	want := []db.SiteTenantMapping{
+		{TenantID: "tenant-a", BlogID: 42},
+		{TenantID: "tenant-b", BlogID: 43},
+	}
+	if len(in.Mappings) != len(want) {
+		t.Fatalf("Mappings len = %d, want %d", len(in.Mappings), len(want))
+	}
+	for i := range want {
+		if in.Mappings[i] != want[i] {
+			t.Fatalf("Mappings[%d] = %+v, want %+v", i, in.Mappings[i], want[i])
+		}
+	}
+}
+
+func TestParseSiteTenantMappingsRejectsInvalidRows(t *testing.T) {
+	tests := []struct {
+		name string
+		csv  string
+		want string
+	}{
+		{name: "empty", csv: "\n", want: "no site tenant mappings"},
+		{name: "missing tenant", csv: ",42\n", want: "tenant_id is required"},
+		{name: "bad blog id", csv: "tenant-a,nope\n", want: "blog_id must be a positive integer"},
+		{name: "too many columns", csv: "tenant-a,42,extra\n", want: "expected 2 columns"},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			_, err := parseSiteTenantMappings(strings.NewReader(tt.csv))
+			if err == nil {
+				t.Fatal("parseSiteTenantMappings succeeded")
+			}
+			if !strings.Contains(err.Error(), tt.want) {
+				t.Fatalf("error = %q, want substring %q", err.Error(), tt.want)
+			}
+		})
+	}
+}
+
+func TestIsSiteTenantHeader(t *testing.T) {
+	if !isSiteTenantHeader([]string{" tenant_id ", " blog_id "}) {
+		t.Fatal("isSiteTenantHeader did not accept canonical header")
+	}
+	if isSiteTenantHeader([]string{"tenant", "blog"}) {
+		t.Fatal("isSiteTenantHeader accepted non-canonical header")
+	}
+}
diff --git a/cmd/jetmon2/telemetry_report.go b/cmd/jetmon2/telemetry_report.go
new file mode 100644
index 00000000..aeeaada6
--- /dev/null
+++ b/cmd/jetmon2/telemetry_report.go
@@ -0,0 +1,1048 @@
+package main
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"flag"
+	"fmt"
+	"io"
+	"os"
+	"strings"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/audit"
+	"github.com/Automattic/jetmon/internal/checker"
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/eventstore"
+)
+
+const (
+	defaultTelemetryQueryTimeout = 30 * time.Second
+	maxTelemetryQueryTimeout     = 5 * time.Minute
+	telemetryWindowEdgeLookback  = time.Minute
+)
+
+type telemetryReportOptions struct {
+	Since        string
+	Until        string
+	Output       string
+	Limit        int
+	QueryTimeout time.Duration
+}
+
+type telemetryWindow struct {
+	Since time.Time `json:"since"`
+	Until time.Time `json:"until"`
+}
+
+type telemetryWindowEdge struct {
+	LookbackSeconds            int64 `json:"lookback_seconds"`
+	WPCOMEligibleTransitions   int64 `json:"wpcom_eligible_transitions"`
+	VerifierOutcomeTransitions int64 `json:"verifier_outcome_transitions"`
+}
+
+type telemetryReport struct {
+	Command              string                  `json:"command"`
+	GeneratedAt          time.Time               `json:"generated_at"`
+	Window               telemetryWindow         `json:"window"`
+	WindowEdge           telemetryWindowEdge     `json:"window_edge"`
+	TelemetryStatus      string                  `json:"telemetry_status"`
+	Highlights           []string                `json:"highlights,omitempty"`
+	ExplanationGapRows   int64                   `json:"explanation_gap_rows"`
+	Summary              telemetrySummary        `json:"summary"`
+	Timings              []telemetryTiming       `json:"timings"`
+	Verifier             telemetryVerifierReport `json:"verifier"`
+	FalseAlarmClasses    []telemetryClassCount   `json:"false_alarm_classes"`
+	WPCOM                telemetryWPCOMReport    `json:"wpcom"`
+	ExplanationGaps      []telemetryGap          `json:"explanation_gaps,omitempty"`
+	SuggestedNextActions []string                `json:"suggested_next_actions,omitempty"`
+}
+
+type telemetrySummary struct {
+	Opened             int64 `json:"opened"`
+	ConfirmedDown      int64 `json:"confirmed_down"`
+	VerifierCleared    int64 `json:"verifier_cleared"`
+	ProbeCleared       int64 `json:"probe_cleared"`
+	VerifierFalseAlarm int64 `json:"verifier_false_alarm"`
+	ManualOverride     int64 `json:"manual_override"`
+	AutoTimeout        int64 `json:"auto_timeout"`
+}
+
+type telemetryTiming struct {
+	Name  string `json:"name"`
+	Count int64  `json:"count"`
+	AvgMS int64  `json:"avg_ms"`
+	MaxMS int64  `json:"max_ms"`
+}
+
+type telemetryVerifierReport struct {
+	Replies                      int64                   `json:"replies"`
+	ConfirmDown                  int64                   `json:"confirm_down"`
+	Disagree                     int64                   `json:"disagree"`
+	MissingOutcome               int64                   `json:"missing_outcome"`
+	ConfirmPercent               float64                 `json:"confirm_percent"`
+	VoteTransitions              int64                   `json:"vote_transitions"`
+	DuplicateVotes               int64                   `json:"duplicate_votes"`
+	DuplicateVoteTransitions     int64                   `json:"duplicate_vote_transitions"`
+	MinHealthyBlockedTransitions int64                   `json:"min_healthy_blocked_transitions"`
+	MaxQuorum                    int64                   `json:"max_quorum"`
+	MaxHealthy                   int64                   `json:"max_healthy"`
+	Hosts                        []telemetryVerifierHost `json:"hosts,omitempty"`
+}
+
+type telemetryVerifierHost struct {
+	Host           string  `json:"host"`
+	Replies        int64   `json:"replies"`
+	ConfirmDown    int64   `json:"confirm_down"`
+	Disagree       int64   `json:"disagree"`
+	MissingOutcome int64   `json:"missing_outcome"`
+	ConfirmPercent float64 `json:"confirm_percent"`
+}
+
+type telemetryClassCount struct {
+	Outcome string `json:"outcome"`
+	Class   string `json:"class"`
+	Count   int64  `json:"count"`
+}
+
+type telemetryWPCOMReport struct {
+	ExpectedDownTransitions     int64   `json:"expected_down_transitions"`
+	ExpectedRecoveryTransitions int64   `json:"expected_recovery_transitions"`
+	Attempts                    int64   `json:"attempts"`
+	DownAttempts                int64   `json:"down_attempts"`
+	RecoveryAttempts            int64   `json:"recovery_attempts"`
+	Retries                     int64   `json:"retries"`
+	Suppressed                  int64   `json:"suppressed"`
+	DownSuppressed              int64   `json:"down_suppressed"`
+	RecoverySuppressed          int64   `json:"recovery_suppressed"`
+	AttemptDelta                int64   `json:"attempt_delta"`
+	DownAttemptDelta            int64   `json:"down_attempt_delta"`
+	RecoveryAttemptDelta        int64   `json:"recovery_attempt_delta"`
+	RetryRatePercent            float64 `json:"retry_rate_percent"`
+}
+
+type telemetryGap struct {
+	Name     string `json:"name"`
+	Severity string `json:"severity"`
+	Count    int64  `json:"count"`
+	Detail   string `json:"detail"`
+}
+
+func cmdTelemetry(args []string) {
+	if len(args) == 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 telemetry <report> [args]")
+		os.Exit(1)
+	}
+
+	switch args[0] {
+	case "report":
+		cmdTelemetryReport(args[1:])
+	default:
+		fmt.Fprintf(os.Stderr, "unknown telemetry subcommand %q (want: report)\n", args[0])
+		os.Exit(1)
+	}
+}
+
+func cmdTelemetryReport(args []string) {
+	opts := telemetryReportOptions{
+		Since:        "24h",
+		Output:       "text",
+		Limit:        10,
+		QueryTimeout: defaultTelemetryQueryTimeout,
+	}
+	fs := newTelemetryReportFlagSet(&opts, os.Stderr)
+	if err := fs.Parse(args); err != nil {
+		if errors.Is(err, flag.ErrHelp) {
+			os.Exit(0)
+		}
+		fmt.Fprintf(os.Stderr, "FAIL parse telemetry report flags: %v\n", err)
+		os.Exit(2)
+	}
+	if fs.NArg() != 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 telemetry report [--since=24h] [--until=<RFC3339>] [--output=text|json] [--limit=N] [--query-timeout=30s]")
+		os.Exit(1)
+	}
+
+	outputFormat, err := normalizeTelemetryOutput(opts.Output)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+	opts.Output = outputFormat
+	if err := validateTelemetryLimit(opts.Limit); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+	if err := validateTelemetryQueryTimeout(opts.QueryTimeout); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	config.LoadDB()
+	if err := db.ConnectWithRetry(3); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL db connect: %v\n", err)
+		os.Exit(1)
+	}
+
+	report, err := buildTelemetryReport(context.Background(), db.DB(), time.Now().UTC(), opts)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL telemetry report: %v\n", err)
+		os.Exit(1)
+	}
+	if err := renderTelemetryReport(os.Stdout, report, opts.Output); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL render telemetry report: %v\n", err)
+		os.Exit(1)
+	}
+}
+
+func newTelemetryReportFlagSet(opts *telemetryReportOptions, out io.Writer) *flag.FlagSet {
+	fs := flag.NewFlagSet("telemetry report", flag.ContinueOnError)
+	if out != nil {
+		fs.SetOutput(out)
+	}
+	fs.StringVar(&opts.Since, "since", opts.Since, "report start as duration like 24h or RFC3339 timestamp")
+	fs.StringVar(&opts.Until, "until", opts.Until, "report end as RFC3339 timestamp (default now)")
+	fs.StringVar(&opts.Output, "output", opts.Output, "output format: text or json")
+	fs.IntVar(&opts.Limit, "limit", opts.Limit, "maximum verifier hosts and false-alarm classes to show")
+	fs.DurationVar(&opts.QueryTimeout, "query-timeout", opts.QueryTimeout, "maximum time for the report query set")
+	fs.Usage = func() {
+		printAPIFlagUsage(fs.Output(), fs)
+	}
+	return fs
+}
+
+func normalizeTelemetryOutput(output string) (string, error) {
+	output = strings.ToLower(strings.TrimSpace(output))
+	if output == "" {
+		output = "text"
+	}
+	if output != "text" && output != "json" {
+		return "", errors.New("--output must be text or json")
+	}
+	return output, nil
+}
+
+func validateTelemetryLimit(limit int) error {
+	if limit <= 0 {
+		return errors.New("--limit must be > 0")
+	}
+	if limit > 100 {
+		return errors.New("--limit must be <= 100")
+	}
+	return nil
+}
+
+func validateTelemetryQueryTimeout(timeout time.Duration) error {
+	if timeout < 0 {
+		return errors.New("--query-timeout must be >= 0")
+	}
+	if timeout > maxTelemetryQueryTimeout {
+		return fmt.Errorf("--query-timeout must be <= %s", maxTelemetryQueryTimeout)
+	}
+	return nil
+}
+
+func buildTelemetryReport(ctx context.Context, conn *sql.DB, now time.Time, opts telemetryReportOptions) (telemetryReport, error) {
+	if conn == nil {
+		return telemetryReport{}, errors.New("database pool is not initialized")
+	}
+	if err := validateTelemetryLimit(opts.Limit); err != nil {
+		return telemetryReport{}, err
+	}
+	if err := validateTelemetryQueryTimeout(opts.QueryTimeout); err != nil {
+		return telemetryReport{}, err
+	}
+	queryTimeout := opts.QueryTimeout
+	if queryTimeout == 0 {
+		queryTimeout = defaultTelemetryQueryTimeout
+	}
+	var cancel context.CancelFunc
+	ctx, cancel = context.WithTimeout(ctx, queryTimeout)
+	defer cancel()
+
+	window, err := resolveTelemetryWindow(now, opts.Since, opts.Until)
+	if err != nil {
+		return telemetryReport{}, err
+	}
+
+	reasonCounts, err := queryTelemetryReasonCounts(ctx, conn, window)
+	if err != nil {
+		return telemetryReport{}, err
+	}
+	timings, err := queryTelemetryTimings(ctx, conn, window)
+	if err != nil {
+		return telemetryReport{}, err
+	}
+	verifier, err := queryTelemetryVerifier(ctx, conn, window, opts.Limit)
+	if err != nil {
+		return telemetryReport{}, err
+	}
+	falseAlarmClasses, err := queryTelemetryFalseAlarmClasses(ctx, conn, window, opts.Limit)
+	if err != nil {
+		return telemetryReport{}, err
+	}
+	wpcom, err := queryTelemetryWPCOM(ctx, conn, window, reasonCounts)
+	if err != nil {
+		return telemetryReport{}, err
+	}
+	windowEdge, err := queryTelemetryWindowEdge(ctx, conn, window, telemetryWindowEdgeLookback)
+	if err != nil {
+		return telemetryReport{}, err
+	}
+	gaps, err := queryTelemetryExplanationGaps(ctx, conn, window)
+	if err != nil {
+		return telemetryReport{}, err
+	}
+	gaps = append(gaps, derivedTelemetryGaps(wpcom, verifier)...)
+
+	report := telemetryReport{
+		Command:           "telemetry report",
+		GeneratedAt:       now.UTC(),
+		Window:            window,
+		WindowEdge:        windowEdge,
+		Summary:           telemetrySummaryFromReasons(reasonCounts),
+		Timings:           timings,
+		Verifier:          verifier,
+		FalseAlarmClasses: falseAlarmClasses,
+		WPCOM:             wpcom,
+		ExplanationGaps:   gaps,
+	}
+	report.TelemetryStatus = telemetryReportStatus(report.ExplanationGaps)
+	report.ExplanationGapRows = telemetryExplanationGapRows(report.ExplanationGaps)
+	report.Highlights = telemetryReportHighlights(report)
+	report.SuggestedNextActions = suggestTelemetryNextActions(report)
+	return report, nil
+}
+
+func resolveTelemetryWindow(now time.Time, since, until string) (telemetryWindow, error) {
+	now = now.UTC()
+	end := now
+	until = strings.TrimSpace(until)
+	if until != "" {
+		t, err := time.Parse(time.RFC3339, until)
+		if err != nil {
+			return telemetryWindow{}, fmt.Errorf("until %q must be an RFC3339 timestamp", until)
+		}
+		end = t.UTC()
+	}
+
+	start, err := resolveActivityCutoff(end, since)
+	if err != nil {
+		return telemetryWindow{}, err
+	}
+	if !start.Before(end) {
+		return telemetryWindow{}, errors.New("since must be before until")
+	}
+	return telemetryWindow{Since: start, Until: end}, nil
+}
+
+func queryTelemetryReasonCounts(ctx context.Context, conn *sql.DB, window telemetryWindow) (map[string]int64, error) {
+	rows, err := conn.QueryContext(ctx, `
+		SELECT reason, COUNT(*)
+		  FROM jetmon_event_transitions
+		 WHERE changed_at >= ?
+		   AND changed_at < ?
+		 GROUP BY reason`,
+		window.Since, window.Until,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("query transition reason counts: %w", err)
+	}
+	defer rows.Close()
+
+	counts := map[string]int64{}
+	for rows.Next() {
+		var reason string
+		var count int64
+		if err := rows.Scan(&reason, &count); err != nil {
+			return nil, fmt.Errorf("scan transition reason count: %w", err)
+		}
+		counts[reason] = count
+	}
+	if err := rows.Err(); err != nil {
+		return nil, fmt.Errorf("iterate transition reason counts: %w", err)
+	}
+	return counts, nil
+}
+
+func queryTelemetryTimings(ctx context.Context, conn *sql.DB, window telemetryWindow) ([]telemetryTiming, error) {
+	reasons := []struct {
+		reason string
+		name   string
+	}{
+		{eventstore.ReasonVerifierConfirmed, "first_failure_to_down"},
+		{eventstore.ReasonFalseAlarm, "first_failure_to_false_alarm"},
+		{eventstore.ReasonProbeCleared, "first_failure_to_probe_cleared"},
+		{eventstore.ReasonVerifierCleared, "first_failure_to_recovery"},
+	}
+
+	out := make([]telemetryTiming, 0, len(reasons))
+	for _, item := range reasons {
+		var timing telemetryTiming
+		timing.Name = item.name
+		err := conn.QueryRowContext(ctx, `
+			SELECT COUNT(*),
+			       COALESCE(CAST(ROUND(AVG(TIMESTAMPDIFF(MICROSECOND, opened.changed_at, outcome.changed_at) / 1000)) AS SIGNED), 0),
+			       COALESCE(MAX(TIMESTAMPDIFF(MICROSECOND, opened.changed_at, outcome.changed_at) DIV 1000), 0)
+			  FROM jetmon_event_transitions outcome
+			  JOIN jetmon_event_transitions opened
+			    ON opened.event_id = outcome.event_id
+			   AND opened.reason = ?
+			 WHERE outcome.reason = ?
+			   AND outcome.changed_at >= ?
+			   AND outcome.changed_at < ?`,
+			eventstore.ReasonOpened, item.reason, window.Since, window.Until,
+		).Scan(&timing.Count, &timing.AvgMS, &timing.MaxMS)
+		if err != nil {
+			return nil, fmt.Errorf("query timing %s: %w", item.name, err)
+		}
+		out = append(out, timing)
+	}
+	return out, nil
+}
+
+func queryTelemetryVerifier(ctx context.Context, conn *sql.DB, window telemetryWindow, limit int) (telemetryVerifierReport, error) {
+	summary := telemetryVerifierReport{}
+	err := conn.QueryRowContext(ctx, `
+		SELECT COUNT(*),
+		       COALESCE(SUM(CASE WHEN LOWER(JSON_UNQUOTE(JSON_EXTRACT(metadata, '$.success'))) = 'false' THEN 1 ELSE 0 END), 0),
+		       COALESCE(SUM(CASE WHEN LOWER(JSON_UNQUOTE(JSON_EXTRACT(metadata, '$.success'))) = 'true' THEN 1 ELSE 0 END), 0),
+		       COALESCE(SUM(CASE WHEN JSON_EXTRACT(metadata, '$.success') IS NULL THEN 1 ELSE 0 END), 0)
+		  FROM jetmon_audit_log
+		 WHERE event_type = ?
+		   AND detail = 'veriflier reply'
+		   AND created_at >= ?
+		   AND created_at < ?`,
+		audit.EventVeriflierSent, window.Since, window.Until,
+	).Scan(&summary.Replies, &summary.ConfirmDown, &summary.Disagree, &summary.MissingOutcome)
+	if err != nil {
+		return telemetryVerifierReport{}, fmt.Errorf("query verifier summary: %w", err)
+	}
+	if summary.Replies > 0 {
+		summary.ConfirmPercent = float64(summary.ConfirmDown) * 100 / float64(summary.Replies)
+	}
+
+	query := fmt.Sprintf(`
+		SELECT source,
+		       COUNT(*),
+		       COALESCE(SUM(CASE WHEN LOWER(JSON_UNQUOTE(JSON_EXTRACT(metadata, '$.success'))) = 'false' THEN 1 ELSE 0 END), 0),
+		       COALESCE(SUM(CASE WHEN LOWER(JSON_UNQUOTE(JSON_EXTRACT(metadata, '$.success'))) = 'true' THEN 1 ELSE 0 END), 0),
+		       COALESCE(SUM(CASE WHEN JSON_EXTRACT(metadata, '$.success') IS NULL THEN 1 ELSE 0 END), 0)
+		  FROM jetmon_audit_log
+		 WHERE event_type = ?
+		   AND detail = 'veriflier reply'
+		   AND created_at >= ?
+		   AND created_at < ?
+		 GROUP BY source
+		 ORDER BY COUNT(*) DESC, source
+		 LIMIT %d`, limit)
+	rows, err := conn.QueryContext(ctx, query, audit.EventVeriflierSent, window.Since, window.Until)
+	if err != nil {
+		return telemetryVerifierReport{}, fmt.Errorf("query verifier hosts: %w", err)
+	}
+	defer rows.Close()
+	for rows.Next() {
+		var host telemetryVerifierHost
+		if err := rows.Scan(&host.Host, &host.Replies, &host.ConfirmDown, &host.Disagree, &host.MissingOutcome); err != nil {
+			return telemetryVerifierReport{}, fmt.Errorf("scan verifier host: %w", err)
+		}
+		if host.Replies > 0 {
+			host.ConfirmPercent = float64(host.ConfirmDown) * 100 / float64(host.Replies)
+		}
+		summary.Hosts = append(summary.Hosts, host)
+	}
+	if err := rows.Err(); err != nil {
+		return telemetryVerifierReport{}, fmt.Errorf("iterate verifier hosts: %w", err)
+	}
+	err = conn.QueryRowContext(ctx, `
+		SELECT COUNT(*),
+		       COALESCE(SUM(COALESCE(CAST(JSON_UNQUOTE(JSON_EXTRACT(metadata, '$.verifier_duplicate_votes')) AS SIGNED), 0)), 0),
+		       COALESCE(SUM(CASE WHEN COALESCE(CAST(JSON_UNQUOTE(JSON_EXTRACT(metadata, '$.verifier_duplicate_votes')) AS SIGNED), 0) > 0 THEN 1 ELSE 0 END), 0),
+		       COALESCE(SUM(CASE WHEN COALESCE(CAST(JSON_UNQUOTE(JSON_EXTRACT(metadata, '$.verifier_healthy')) AS SIGNED), 0) < COALESCE(CAST(JSON_UNQUOTE(JSON_EXTRACT(metadata, '$.verifier_min_healthy')) AS SIGNED), 0) THEN 1 ELSE 0 END), 0),
+		       COALESCE(MAX(COALESCE(CAST(JSON_UNQUOTE(JSON_EXTRACT(metadata, '$.verifier_quorum')) AS SIGNED), 0)), 0),
+		       COALESCE(MAX(COALESCE(CAST(JSON_UNQUOTE(JSON_EXTRACT(metadata, '$.verifier_healthy')) AS SIGNED), 0)), 0)
+		  FROM jetmon_event_transitions
+		 WHERE reason IN (?, ?)
+		   AND changed_at >= ?
+		   AND changed_at < ?
+		   AND metadata IS NOT NULL`,
+		eventstore.ReasonVerifierConfirmed,
+		eventstore.ReasonFalseAlarm,
+		window.Since,
+		window.Until,
+	).Scan(
+		&summary.VoteTransitions,
+		&summary.DuplicateVotes,
+		&summary.DuplicateVoteTransitions,
+		&summary.MinHealthyBlockedTransitions,
+		&summary.MaxQuorum,
+		&summary.MaxHealthy,
+	)
+	if err != nil {
+		return telemetryVerifierReport{}, fmt.Errorf("query verifier vote evidence: %w", err)
+	}
+	return summary, nil
+}
+
+func queryTelemetryFalseAlarmClasses(ctx context.Context, conn *sql.DB, window telemetryWindow, limit int) ([]telemetryClassCount, error) {
+	query := fmt.Sprintf(`
+		SELECT outcome.reason AS outcome,
+		       CASE
+		         WHEN CAST(JSON_UNQUOTE(JSON_EXTRACT(opened.metadata, '$.error_code')) AS SIGNED) IN (%d, %d) THEN 'https'
+		         WHEN CAST(JSON_UNQUOTE(JSON_EXTRACT(opened.metadata, '$.error_code')) AS SIGNED) IN (%d, %d) THEN 'intermittent'
+		         WHEN CAST(JSON_UNQUOTE(JSON_EXTRACT(opened.metadata, '$.error_code')) AS SIGNED) = %d THEN 'redirect'
+		         WHEN CAST(JSON_UNQUOTE(JSON_EXTRACT(opened.metadata, '$.error_code')) AS SIGNED) = %d THEN 'keyword'
+		         WHEN CAST(JSON_UNQUOTE(JSON_EXTRACT(opened.metadata, '$.http_code')) AS SIGNED) >= 500 THEN 'server'
+		         WHEN CAST(JSON_UNQUOTE(JSON_EXTRACT(opened.metadata, '$.http_code')) AS SIGNED) = 403 THEN 'blocked'
+		         WHEN CAST(JSON_UNQUOTE(JSON_EXTRACT(opened.metadata, '$.http_code')) AS SIGNED) >= 400 THEN 'client'
+		         WHEN CAST(JSON_UNQUOTE(JSON_EXTRACT(opened.metadata, '$.error_code')) AS SIGNED) > 0 THEN 'intermittent'
+		         ELSE 'unknown'
+		       END AS class,
+		       COUNT(*) AS count
+		  FROM jetmon_event_transitions outcome
+		  JOIN jetmon_event_transitions opened
+		    ON opened.event_id = outcome.event_id
+		   AND opened.reason = ?
+		 WHERE outcome.reason IN (?, ?)
+		   AND outcome.changed_at >= ?
+		   AND outcome.changed_at < ?
+		 GROUP BY outcome.reason, class
+		 ORDER BY count DESC, outcome, class
+		 LIMIT %d`,
+		checker.ErrorSSL,
+		checker.ErrorTLSExpired,
+		checker.ErrorTimeout,
+		checker.ErrorBodyRead,
+		checker.ErrorRedirect,
+		checker.ErrorKeyword,
+		limit,
+	)
+	rows, err := conn.QueryContext(ctx, query,
+		eventstore.ReasonOpened,
+		eventstore.ReasonFalseAlarm,
+		eventstore.ReasonProbeCleared,
+		window.Since,
+		window.Until,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("query false-alarm classes: %w", err)
+	}
+	defer rows.Close()
+
+	var out []telemetryClassCount
+	for rows.Next() {
+		var row telemetryClassCount
+		if err := rows.Scan(&row.Outcome, &row.Class, &row.Count); err != nil {
+			return nil, fmt.Errorf("scan false-alarm class: %w", err)
+		}
+		out = append(out, row)
+	}
+	if err := rows.Err(); err != nil {
+		return nil, fmt.Errorf("iterate false-alarm classes: %w", err)
+	}
+	return out, nil
+}
+
+func queryTelemetryWPCOM(ctx context.Context, conn *sql.DB, window telemetryWindow, reasonCounts map[string]int64) (telemetryWPCOMReport, error) {
+	report := telemetryWPCOMReport{
+		ExpectedDownTransitions:     reasonCounts[eventstore.ReasonVerifierConfirmed],
+		ExpectedRecoveryTransitions: reasonCounts[eventstore.ReasonVerifierCleared] + reasonCounts[eventstore.ReasonProbeCleared],
+	}
+	err := conn.QueryRowContext(ctx, `
+		SELECT COUNT(*),
+		       COALESCE(SUM(CASE WHEN detail LIKE 'status=2 %' THEN 1 ELSE 0 END), 0),
+		       COALESCE(SUM(CASE WHEN detail LIKE 'status=1 %' THEN 1 ELSE 0 END), 0)
+		  FROM jetmon_audit_log
+		 WHERE event_type = ?
+		   AND created_at >= ?
+		   AND created_at < ?`,
+		audit.EventWPCOMSent, window.Since, window.Until,
+	).Scan(&report.Attempts, &report.DownAttempts, &report.RecoveryAttempts)
+	if err != nil {
+		return telemetryWPCOMReport{}, fmt.Errorf("query WPCOM attempts: %w", err)
+	}
+
+	err = conn.QueryRowContext(ctx, `
+		SELECT COUNT(*)
+		  FROM jetmon_audit_log
+		 WHERE event_type = ?
+		   AND created_at >= ?
+		   AND created_at < ?`,
+		audit.EventWPCOMRetry, window.Since, window.Until,
+	).Scan(&report.Retries)
+	if err != nil {
+		return telemetryWPCOMReport{}, fmt.Errorf("query WPCOM retries: %w", err)
+	}
+
+	rows, err := conn.QueryContext(ctx, `
+		SELECT event_type, COALESCE(detail, ''), COUNT(*)
+		  FROM jetmon_audit_log
+		 WHERE event_type IN (?, ?)
+		   AND created_at >= ?
+		   AND created_at < ?
+		 GROUP BY event_type, detail`,
+		audit.EventMaintenanceActive, audit.EventAlertSuppressed, window.Since, window.Until,
+	)
+	if err != nil {
+		return telemetryWPCOMReport{}, fmt.Errorf("query WPCOM suppressions: %w", err)
+	}
+	defer rows.Close()
+
+	for rows.Next() {
+		var eventType, detail string
+		var count int64
+		if err := rows.Scan(&eventType, &detail, &count); err != nil {
+			return telemetryWPCOMReport{}, fmt.Errorf("scan WPCOM suppression: %w", err)
+		}
+		report.Suppressed += count
+		down, recovery := classifyTelemetryWPCOMSuppression(eventType, detail)
+		if down {
+			report.DownSuppressed += count
+		}
+		if recovery {
+			report.RecoverySuppressed += count
+		}
+	}
+	if err := rows.Err(); err != nil {
+		return telemetryWPCOMReport{}, fmt.Errorf("iterate WPCOM suppressions: %w", err)
+	}
+
+	report.DownAttemptDelta = report.ExpectedDownTransitions - report.DownAttempts - report.DownSuppressed
+	report.RecoveryAttemptDelta = report.ExpectedRecoveryTransitions - report.RecoveryAttempts - report.RecoverySuppressed
+	report.AttemptDelta = report.DownAttemptDelta + report.RecoveryAttemptDelta
+	if report.Attempts > 0 {
+		report.RetryRatePercent = float64(report.Retries) * 100 / float64(report.Attempts)
+	}
+	return report, nil
+}
+
+func classifyTelemetryWPCOMSuppression(eventType, detail string) (down bool, recovery bool) {
+	detail = strings.TrimSpace(detail)
+	switch eventType {
+	case audit.EventMaintenanceActive:
+		if strings.HasPrefix(detail, "downtime suppressed") {
+			return true, false
+		}
+		if strings.HasPrefix(detail, "recovery suppressed") {
+			return false, true
+		}
+	case audit.EventAlertSuppressed:
+		switch {
+		case detail == "cooldown active" || strings.HasPrefix(detail, "downtime suppressed"):
+			return true, false
+		case detail == "recovery cooldown active" || strings.HasPrefix(detail, "recovery suppressed"):
+			return false, true
+		}
+	}
+	return false, false
+}
+
+func queryTelemetryWindowEdge(ctx context.Context, conn *sql.DB, window telemetryWindow, lookback time.Duration) (telemetryWindowEdge, error) {
+	if lookback <= 0 {
+		lookback = telemetryWindowEdgeLookback
+	}
+	edgeStart := window.Until.Add(-lookback)
+	if edgeStart.Before(window.Since) {
+		edgeStart = window.Since
+	}
+
+	edge := telemetryWindowEdge{LookbackSeconds: int64(window.Until.Sub(edgeStart) / time.Second)}
+	err := conn.QueryRowContext(ctx, `
+		SELECT COALESCE(SUM(CASE WHEN reason IN (?, ?, ?) THEN 1 ELSE 0 END), 0),
+		       COALESCE(SUM(CASE WHEN reason IN (?, ?) THEN 1 ELSE 0 END), 0)
+		  FROM jetmon_event_transitions
+		 WHERE changed_at >= ?
+		   AND changed_at < ?`,
+		eventstore.ReasonVerifierConfirmed,
+		eventstore.ReasonVerifierCleared,
+		eventstore.ReasonProbeCleared,
+		eventstore.ReasonVerifierConfirmed,
+		eventstore.ReasonFalseAlarm,
+		edgeStart,
+		window.Until,
+	).Scan(&edge.WPCOMEligibleTransitions, &edge.VerifierOutcomeTransitions)
+	if err != nil {
+		return telemetryWindowEdge{}, fmt.Errorf("query telemetry window edge: %w", err)
+	}
+	return edge, nil
+}
+
+func queryTelemetryExplanationGaps(ctx context.Context, conn *sql.DB, window telemetryWindow) ([]telemetryGap, error) {
+	gapQueries := []struct {
+		name     string
+		severity string
+		detail   string
+		query    string
+		args     []any
+	}{
+		{
+			name:     "opened_missing_failure_metadata",
+			severity: "amber",
+			detail:   "opened transitions should explain the local failure with http_code or error_code plus rtt_ms",
+			query: `
+				SELECT COUNT(*)
+				  FROM jetmon_event_transitions
+				 WHERE reason = ?
+				   AND changed_at >= ?
+				   AND changed_at < ?
+				   AND (metadata IS NULL
+				    OR (JSON_EXTRACT(metadata, '$.http_code') IS NULL AND JSON_EXTRACT(metadata, '$.error_code') IS NULL)
+				    OR JSON_EXTRACT(metadata, '$.rtt_ms') IS NULL)`,
+			args: []any{eventstore.ReasonOpened, window.Since, window.Until},
+		},
+		{
+			name:     "confirmed_down_missing_verifier_results",
+			severity: "amber",
+			detail:   "verifier-confirmed transitions should include verifier_results for operator explanations",
+			query: `
+				SELECT COUNT(*)
+				  FROM jetmon_event_transitions
+				 WHERE reason = ?
+				   AND changed_at >= ?
+				   AND changed_at < ?
+				   AND (metadata IS NULL OR JSON_EXTRACT(metadata, '$.verifier_results') IS NULL)`,
+			args: []any{eventstore.ReasonVerifierConfirmed, window.Since, window.Until},
+		},
+		{
+			name:     "false_alarm_missing_verifier_counts",
+			severity: "amber",
+			detail:   "false-alarm transitions should include verifier healthy/confirmed counts",
+			query: `
+				SELECT COUNT(*)
+				  FROM jetmon_event_transitions
+				 WHERE reason = ?
+				   AND changed_at >= ?
+				   AND changed_at < ?
+				   AND (metadata IS NULL
+				    OR JSON_EXTRACT(metadata, '$.verifier_healthy') IS NULL
+				    OR JSON_EXTRACT(metadata, '$.verifier_confirmed') IS NULL)`,
+			args: []any{eventstore.ReasonFalseAlarm, window.Since, window.Until},
+		},
+		{
+			name:     "verifier_reply_missing_outcome",
+			severity: "amber",
+			detail:   "verifier reply audit rows should include metadata.success so agreement can be measured",
+			query: `
+				SELECT COUNT(*)
+				  FROM jetmon_audit_log
+				 WHERE event_type = ?
+				   AND detail = 'veriflier reply'
+				   AND created_at >= ?
+				   AND created_at < ?
+				   AND (metadata IS NULL OR JSON_EXTRACT(metadata, '$.success') IS NULL)`,
+			args: []any{audit.EventVeriflierSent, window.Since, window.Until},
+		},
+	}
+
+	var gaps []telemetryGap
+	for _, gapQuery := range gapQueries {
+		var count int64
+		if err := conn.QueryRowContext(ctx, gapQuery.query, gapQuery.args...).Scan(&count); err != nil {
+			return nil, fmt.Errorf("query telemetry gap %s: %w", gapQuery.name, err)
+		}
+		if count > 0 {
+			gaps = append(gaps, telemetryGap{
+				Name:     gapQuery.name,
+				Severity: gapQuery.severity,
+				Count:    count,
+				Detail:   gapQuery.detail,
+			})
+		}
+	}
+	return gaps, nil
+}
+
+func telemetrySummaryFromReasons(counts map[string]int64) telemetrySummary {
+	return telemetrySummary{
+		Opened:             counts[eventstore.ReasonOpened],
+		ConfirmedDown:      counts[eventstore.ReasonVerifierConfirmed],
+		VerifierCleared:    counts[eventstore.ReasonVerifierCleared],
+		ProbeCleared:       counts[eventstore.ReasonProbeCleared],
+		VerifierFalseAlarm: counts[eventstore.ReasonFalseAlarm],
+		ManualOverride:     counts[eventstore.ReasonManualOverride],
+		AutoTimeout:        counts[eventstore.ReasonAutoTimeout],
+	}
+}
+
+func derivedTelemetryGaps(wpcom telemetryWPCOMReport, verifier telemetryVerifierReport) []telemetryGap {
+	var gaps []telemetryGap
+	if wpcom.DownAttemptDelta != 0 {
+		gaps = append(gaps, telemetryGap{
+			Name:     "wpcom_down_attempt_delta",
+			Severity: "amber",
+			Count:    absInt64(wpcom.DownAttemptDelta),
+			Detail:   "WPCOM confirmed-down attempts differ from verifier-confirmed transitions after downtime suppressions",
+		})
+	}
+	if wpcom.RecoveryAttemptDelta != 0 {
+		gaps = append(gaps, telemetryGap{
+			Name:     "wpcom_recovery_attempt_delta",
+			Severity: "amber",
+			Count:    absInt64(wpcom.RecoveryAttemptDelta),
+			Detail:   "WPCOM recovery attempts differ from recovery transitions after recovery suppressions",
+		})
+	}
+	if verifier.Replies == 0 && wpcom.ExpectedDownTransitions > 0 {
+		gaps = append(gaps, telemetryGap{
+			Name:     "no_verifier_replies_for_event_window",
+			Severity: "amber",
+			Count:    wpcom.ExpectedDownTransitions,
+			Detail:   "verifier-confirmed transitions exist but no verifier reply audit rows were recorded in the same window",
+		})
+	}
+	return gaps
+}
+
+func telemetryReportStatus(gaps []telemetryGap) string {
+	status := "pass"
+	for _, gap := range gaps {
+		if gap.Severity == "red" {
+			return "fail"
+		}
+		if gap.Count > 0 {
+			status = "warn"
+		}
+	}
+	return status
+}
+
+func telemetryExplanationGapRows(gaps []telemetryGap) int64 {
+	var total int64
+	for _, gap := range gaps {
+		total += gap.Count
+	}
+	return total
+}
+
+func telemetryReportHighlights(report telemetryReport) []string {
+	var highlights []string
+	if report.WPCOM.AttemptDelta != 0 {
+		highlights = append(highlights, fmt.Sprintf("WPCOM attempt delta is %d after expected suppressions.", report.WPCOM.AttemptDelta))
+		if report.WPCOM.DownAttemptDelta != 0 {
+			highlights = append(highlights, fmt.Sprintf("WPCOM confirmed-down attempt delta is %d.", report.WPCOM.DownAttemptDelta))
+		}
+		if report.WPCOM.RecoveryAttemptDelta != 0 {
+			highlights = append(highlights, fmt.Sprintf("WPCOM recovery attempt delta is %d.", report.WPCOM.RecoveryAttemptDelta))
+		}
+		if report.WindowEdge.WPCOMEligibleTransitions > 0 {
+			highlights = append(highlights, fmt.Sprintf("%d WPCOM-eligible transition(s) landed in the final %ds of the window; rerun with a later --until before treating the delta as missing telemetry.", report.WindowEdge.WPCOMEligibleTransitions, report.WindowEdge.LookbackSeconds))
+		}
+	}
+	for _, gap := range report.ExplanationGaps {
+		switch gap.Name {
+		case "no_verifier_replies_for_event_window":
+			highlights = append(highlights, fmt.Sprintf("No verifier reply audit rows were recorded for %d verifier-confirmed transition(s).", gap.Count))
+		case "verifier_reply_missing_outcome":
+			highlights = append(highlights, fmt.Sprintf("Verifier reply outcome is missing for %d audit row(s).", gap.Count))
+		}
+	}
+	if len(report.ExplanationGaps) > 0 {
+		highlights = append(highlights, fmt.Sprintf("%d explanation gap type(s) across %d row(s) need follow-up before using this report for customer-facing explanations.", len(report.ExplanationGaps), report.ExplanationGapRows))
+	}
+	if report.Verifier.Replies > 0 {
+		highlights = append(highlights, fmt.Sprintf("Verifier agreement is %.1f%% across %d replies.", report.Verifier.ConfirmPercent, report.Verifier.Replies))
+	}
+	if report.Verifier.DuplicateVotes > 0 {
+		highlights = append(highlights, fmt.Sprintf("Verifier duplicate vote protection ignored %d duplicate vote(s) across %d transition(s).", report.Verifier.DuplicateVotes, report.Verifier.DuplicateVoteTransitions))
+	}
+	if report.Verifier.MinHealthyBlockedTransitions > 0 {
+		highlights = append(highlights, fmt.Sprintf("Verifier minimum-healthy floor blocked confirmation in %d transition(s).", report.Verifier.MinHealthyBlockedTransitions))
+	}
+	if report.Summary.Opened == 0 {
+		highlights = append(highlights, "No events opened in this window; widen --since before drawing production conclusions.")
+	}
+	if len(highlights) == 0 {
+		highlights = append(highlights, "Telemetry looks internally consistent for this window.")
+	}
+	return uniqueStrings(highlights)
+}
+
+func suggestTelemetryNextActions(report telemetryReport) []string {
+	var actions []string
+	for _, gap := range report.ExplanationGaps {
+		switch gap.Name {
+		case "wpcom_down_attempt_delta", "wpcom_recovery_attempt_delta":
+			if report.WindowEdge.WPCOMEligibleTransitions > 0 {
+				actions = append(actions, "Rerun the report with a later --until to rule out window-edge WPCOM audit lag before investigating missing notifications.")
+			}
+			actions = append(actions, "Review WPCOM audit rows against event transitions for the same window; compare down and recovery attempts separately before treating the total delta as clean.")
+		case "opened_missing_failure_metadata", "confirmed_down_missing_verifier_results", "false_alarm_missing_verifier_counts", "verifier_reply_missing_outcome":
+			actions = append(actions, "Fix telemetry metadata gaps before relying on this report for customer-facing explanations.")
+		case "no_verifier_replies_for_event_window":
+			actions = append(actions, "Check verifier audit logging and verifier configuration; transition outcomes need matching verifier reply context.")
+		}
+	}
+	if report.Summary.Opened == 0 {
+		actions = append(actions, "No events opened in this window; widen --since before drawing production conclusions.")
+	}
+	if report.Verifier.Replies > 0 && report.Verifier.MissingOutcome > 0 {
+		actions = append(actions, "Inspect verifier reply metadata; missing success fields prevent agreement reporting.")
+	}
+	if report.Verifier.DuplicateVotes > 0 {
+		actions = append(actions, "Inspect Veriflier vantage IDs; duplicate vantages are ignored for quorum and should usually be represented by one monitor-side endpoint.")
+	}
+	if report.Verifier.MinHealthyBlockedTransitions > 0 {
+		actions = append(actions, "Check Veriflier health and capacity before lowering quorum; the minimum-healthy floor prevented single-vantage confirmation.")
+	}
+	if len(actions) == 0 {
+		actions = append(actions, "Telemetry looks internally consistent for this window; compare rates across longer windows as v2 traffic grows.")
+	}
+	return uniqueStrings(actions)
+}
+
+func renderTelemetryReport(out io.Writer, report telemetryReport, output string) error {
+	if output == "json" {
+		enc := json.NewEncoder(out)
+		enc.SetIndent("", "  ")
+		return enc.Encode(report)
+	}
+	if output == "text" {
+		renderTelemetryReportText(out, report)
+		return nil
+	}
+	return fmt.Errorf("unsupported telemetry output %q", output)
+}
+
+func renderTelemetryReportText(out io.Writer, report telemetryReport) {
+	fmt.Fprintf(out, "## Production Telemetry Report\n")
+	statusLevel := telemetryReportStatusLevel(report.TelemetryStatus)
+	fmt.Fprintf(out, "%s telemetry_status=%s explanation_gap_types=%d explanation_gap_rows=%d suggested_actions=%d\n",
+		statusLevel,
+		report.TelemetryStatus,
+		len(report.ExplanationGaps),
+		report.ExplanationGapRows,
+		len(report.SuggestedNextActions),
+	)
+	for _, highlight := range report.Highlights {
+		fmt.Fprintf(out, "INFO highlight=%q\n", highlight)
+	}
+	fmt.Fprintf(out, "INFO generated_at=%s window=%s..%s window_end=exclusive\n",
+		report.GeneratedAt.Format(time.RFC3339),
+		report.Window.Since.Format(time.RFC3339),
+		report.Window.Until.Format(time.RFC3339),
+	)
+	fmt.Fprintf(out, "INFO window_edge_lookback=%ds wpcom_eligible_transitions=%d verifier_outcome_transitions=%d\n",
+		report.WindowEdge.LookbackSeconds,
+		report.WindowEdge.WPCOMEligibleTransitions,
+		report.WindowEdge.VerifierOutcomeTransitions,
+	)
+	fmt.Fprintf(out, "INFO events opened=%d confirmed_down=%d verifier_cleared=%d probe_cleared=%d false_alarm=%d manual_override=%d auto_timeout=%d\n",
+		report.Summary.Opened,
+		report.Summary.ConfirmedDown,
+		report.Summary.VerifierCleared,
+		report.Summary.ProbeCleared,
+		report.Summary.VerifierFalseAlarm,
+		report.Summary.ManualOverride,
+		report.Summary.AutoTimeout,
+	)
+
+	fmt.Fprintln(out, "## Detection Timing")
+	for _, timing := range report.Timings {
+		fmt.Fprintf(out, "INFO timing=%s count=%d avg_ms=%d max_ms=%d\n", timing.Name, timing.Count, timing.AvgMS, timing.MaxMS)
+	}
+
+	fmt.Fprintln(out, "## Verifier Agreement")
+	fmt.Fprintf(out, "INFO verifier_replies=%d confirm_down=%d disagree=%d missing_outcome=%d confirm_percent=%.1f\n",
+		report.Verifier.Replies,
+		report.Verifier.ConfirmDown,
+		report.Verifier.Disagree,
+		report.Verifier.MissingOutcome,
+		report.Verifier.ConfirmPercent,
+	)
+	fmt.Fprintf(out, "INFO verifier_vote_transitions=%d duplicate_votes=%d duplicate_vote_transitions=%d min_healthy_blocked=%d max_quorum=%d max_healthy=%d\n",
+		report.Verifier.VoteTransitions,
+		report.Verifier.DuplicateVotes,
+		report.Verifier.DuplicateVoteTransitions,
+		report.Verifier.MinHealthyBlockedTransitions,
+		report.Verifier.MaxQuorum,
+		report.Verifier.MaxHealthy,
+	)
+	for _, host := range report.Verifier.Hosts {
+		fmt.Fprintf(out, "INFO verifier_host=%q replies=%d confirm_down=%d disagree=%d missing_outcome=%d confirm_percent=%.1f\n",
+			host.Host,
+			host.Replies,
+			host.ConfirmDown,
+			host.Disagree,
+			host.MissingOutcome,
+			host.ConfirmPercent,
+		)
+	}
+
+	fmt.Fprintln(out, "## False Alarm Classes")
+	if len(report.FalseAlarmClasses) == 0 {
+		fmt.Fprintln(out, "INFO false_alarm_classes=none")
+	}
+	for _, row := range report.FalseAlarmClasses {
+		fmt.Fprintf(out, "INFO outcome=%s class=%s count=%d\n", row.Outcome, row.Class, row.Count)
+	}
+
+	fmt.Fprintln(out, "## WPCOM Parity")
+	fmt.Fprintf(out, "INFO expected_down=%d expected_recovery=%d attempts=%d down_attempts=%d recovery_attempts=%d retries=%d suppressed=%d down_suppressed=%d recovery_suppressed=%d attempt_delta=%d down_attempt_delta=%d recovery_attempt_delta=%d retry_rate=%.1f%%\n",
+		report.WPCOM.ExpectedDownTransitions,
+		report.WPCOM.ExpectedRecoveryTransitions,
+		report.WPCOM.Attempts,
+		report.WPCOM.DownAttempts,
+		report.WPCOM.RecoveryAttempts,
+		report.WPCOM.Retries,
+		report.WPCOM.Suppressed,
+		report.WPCOM.DownSuppressed,
+		report.WPCOM.RecoverySuppressed,
+		report.WPCOM.AttemptDelta,
+		report.WPCOM.DownAttemptDelta,
+		report.WPCOM.RecoveryAttemptDelta,
+		report.WPCOM.RetryRatePercent,
+	)
+
+	fmt.Fprintln(out, "## Explanation Gaps")
+	if len(report.ExplanationGaps) == 0 {
+		fmt.Fprintln(out, "PASS explanation_gaps=0")
+	}
+	for _, gap := range report.ExplanationGaps {
+		level := "WARN"
+		if gap.Severity == "red" {
+			level = "FAIL"
+		}
+		fmt.Fprintf(out, "%s gap=%s count=%d detail=%q\n", level, gap.Name, gap.Count, gap.Detail)
+	}
+	for _, action := range report.SuggestedNextActions {
+		fmt.Fprintf(out, "INFO suggested_next_action=%q\n", action)
+	}
+}
+
+func telemetryReportStatusLevel(status string) string {
+	switch status {
+	case "fail":
+		return "FAIL"
+	case "warn":
+		return "WARN"
+	default:
+		return "PASS"
+	}
+}
+
+func uniqueStrings(values []string) []string {
+	seen := map[string]struct{}{}
+	var out []string
+	for _, value := range values {
+		value = strings.TrimSpace(value)
+		if value == "" {
+			continue
+		}
+		if _, ok := seen[value]; ok {
+			continue
+		}
+		seen[value] = struct{}{}
+		out = append(out, value)
+	}
+	return out
+}
+
+func absInt64(v int64) int64 {
+	if v < 0 {
+		return -v
+	}
+	return v
+}
diff --git a/cmd/jetmon2/telemetry_report_test.go b/cmd/jetmon2/telemetry_report_test.go
new file mode 100644
index 00000000..5a70c7db
--- /dev/null
+++ b/cmd/jetmon2/telemetry_report_test.go
@@ -0,0 +1,486 @@
+package main
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/audit"
+	"github.com/Automattic/jetmon/internal/eventstore"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestResolveTelemetryWindow(t *testing.T) {
+	now := time.Date(2026, 4, 30, 18, 0, 0, 0, time.UTC)
+	window, err := resolveTelemetryWindow(now, "2h", "")
+	if err != nil {
+		t.Fatalf("resolveTelemetryWindow() error = %v", err)
+	}
+	if !window.Since.Equal(now.Add(-2*time.Hour)) || !window.Until.Equal(now) {
+		t.Fatalf("window = %+v, want last 2h ending now", window)
+	}
+
+	window, err = resolveTelemetryWindow(now, "2026-04-30T10:00:00Z", "2026-04-30T12:00:00Z")
+	if err != nil {
+		t.Fatalf("resolveTelemetryWindow(RFC3339) error = %v", err)
+	}
+	if got := window.Since.Format(time.RFC3339); got != "2026-04-30T10:00:00Z" {
+		t.Fatalf("Since = %s", got)
+	}
+	if got := window.Until.Format(time.RFC3339); got != "2026-04-30T12:00:00Z" {
+		t.Fatalf("Until = %s", got)
+	}
+
+	if _, err := resolveTelemetryWindow(now, "2026-04-30T12:00:00Z", "2026-04-30T10:00:00Z"); err == nil {
+		t.Fatal("resolveTelemetryWindow(inverted) error = nil, want error")
+	}
+}
+
+func TestBuildTelemetryReport(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	now := time.Date(2026, 4, 30, 18, 0, 0, 0, time.UTC)
+	start := now.Add(-2 * time.Hour)
+	expectTelemetryReportQueries(t, mock, start, now)
+
+	report, err := buildTelemetryReport(context.Background(), sqlDB, now, telemetryReportOptions{Since: "2h", Limit: 5})
+	if err != nil {
+		t.Fatalf("buildTelemetryReport() error = %v", err)
+	}
+	if report.Summary.Opened != 5 || report.Summary.ConfirmedDown != 2 || report.Summary.VerifierFalseAlarm != 1 {
+		t.Fatalf("Summary = %+v, want opened=5 confirmed=2 false_alarm=1", report.Summary)
+	}
+	if len(report.Timings) != 4 || report.Timings[0].Name != "first_failure_to_down" || report.Timings[0].AvgMS != 1500 {
+		t.Fatalf("Timings = %+v, want first timing avg 1500", report.Timings)
+	}
+	if report.Verifier.Replies != 6 || report.Verifier.ConfirmDown != 4 || len(report.Verifier.Hosts) != 2 {
+		t.Fatalf("Verifier = %+v, want replies=6 confirm=4 hosts=2", report.Verifier)
+	}
+	if report.Verifier.VoteTransitions != 3 || report.Verifier.MaxQuorum != 2 || report.Verifier.MaxHealthy != 3 {
+		t.Fatalf("Verifier vote evidence = %+v, want transitions=3 max_quorum=2 max_healthy=3", report.Verifier)
+	}
+	if len(report.FalseAlarmClasses) != 2 || report.FalseAlarmClasses[0].Class != "server" {
+		t.Fatalf("FalseAlarmClasses = %+v, want server first", report.FalseAlarmClasses)
+	}
+	if report.WPCOM.AttemptDelta != 0 || report.WPCOM.Retries != 1 || report.WPCOM.RetryRatePercent == 0 {
+		t.Fatalf("WPCOM = %+v, want retry with no delta", report.WPCOM)
+	}
+	if len(report.ExplanationGaps) != 0 {
+		t.Fatalf("ExplanationGaps = %+v, want none", report.ExplanationGaps)
+	}
+	if report.TelemetryStatus != "pass" || report.ExplanationGapRows != 0 || len(report.Highlights) == 0 {
+		t.Fatalf("TelemetryStatus/GapRows/Highlights = %q/%d/%+v, want pass with no gap rows and highlights", report.TelemetryStatus, report.ExplanationGapRows, report.Highlights)
+	}
+	if len(report.SuggestedNextActions) == 0 || !strings.Contains(report.SuggestedNextActions[0], "consistent") {
+		t.Fatalf("SuggestedNextActions = %+v, want consistency guidance", report.SuggestedNextActions)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("sql expectations: %v", err)
+	}
+}
+
+func TestRenderTelemetryReportText(t *testing.T) {
+	report := telemetryReport{
+		GeneratedAt: time.Date(2026, 4, 30, 18, 0, 0, 0, time.UTC),
+		Window: telemetryWindow{
+			Since: time.Date(2026, 4, 30, 16, 0, 0, 0, time.UTC),
+			Until: time.Date(2026, 4, 30, 18, 0, 0, 0, time.UTC),
+		},
+		WindowEdge:      telemetryWindowEdge{LookbackSeconds: 60},
+		TelemetryStatus: "pass",
+		Highlights: []string{
+			"Telemetry looks internally consistent for this window.",
+			"Verifier duplicate vote protection ignored 1 duplicate vote(s) across 1 transition(s).",
+			"Verifier minimum-healthy floor blocked confirmation in 1 transition(s).",
+		},
+		Summary: telemetrySummary{Opened: 5, ConfirmedDown: 2, ProbeCleared: 1},
+		Timings: []telemetryTiming{{
+			Name:  "first_failure_to_down",
+			Count: 2,
+			AvgMS: 1500,
+			MaxMS: 2500,
+		}},
+		Verifier: telemetryVerifierReport{
+			Replies:                      6,
+			ConfirmDown:                  4,
+			Disagree:                     2,
+			ConfirmPercent:               66.7,
+			VoteTransitions:              2,
+			DuplicateVotes:               1,
+			DuplicateVoteTransitions:     1,
+			MinHealthyBlockedTransitions: 1,
+			MaxQuorum:                    2,
+			MaxHealthy:                   3,
+		},
+		FalseAlarmClasses: []telemetryClassCount{{
+			Outcome: eventstore.ReasonFalseAlarm,
+			Class:   "server",
+			Count:   1,
+		}},
+		WPCOM: telemetryWPCOMReport{
+			ExpectedDownTransitions:     2,
+			ExpectedRecoveryTransitions: 1,
+			Attempts:                    3,
+			Retries:                     1,
+			DownAttempts:                2,
+			RecoveryAttempts:            1,
+			DownSuppressed:              0,
+			RecoverySuppressed:          0,
+			RetryRatePercent:            33.3,
+		},
+		SuggestedNextActions: []string{"Telemetry looks internally consistent for this window."},
+	}
+	var out bytes.Buffer
+	if err := renderTelemetryReport(&out, report, "text"); err != nil {
+		t.Fatalf("renderTelemetryReport() error = %v", err)
+	}
+	got := out.String()
+	for _, want := range []string{
+		"## Production Telemetry Report",
+		"PASS telemetry_status=pass explanation_gap_types=0 explanation_gap_rows=0",
+		"INFO highlight=\"Telemetry looks internally consistent for this window.\"",
+		"window_end=exclusive",
+		"INFO window_edge_lookback=60s wpcom_eligible_transitions=0 verifier_outcome_transitions=0",
+		"INFO events opened=5 confirmed_down=2",
+		"INFO timing=first_failure_to_down count=2 avg_ms=1500 max_ms=2500",
+		"INFO verifier_replies=6 confirm_down=4 disagree=2",
+		"INFO verifier_vote_transitions=2 duplicate_votes=1 duplicate_vote_transitions=1 min_healthy_blocked=1 max_quorum=2 max_healthy=3",
+		"INFO highlight=\"Verifier duplicate vote protection ignored 1 duplicate vote(s) across 1 transition(s).\"",
+		"INFO highlight=\"Verifier minimum-healthy floor blocked confirmation in 1 transition(s).\"",
+		"INFO outcome=false_alarm class=server count=1",
+		"INFO expected_down=2 expected_recovery=1 attempts=3",
+		"down_suppressed=0 recovery_suppressed=0",
+		"down_attempt_delta=0 recovery_attempt_delta=0",
+		"PASS explanation_gaps=0",
+	} {
+		if !strings.Contains(got, want) {
+			t.Fatalf("rendered report missing %q:\n%s", want, got)
+		}
+	}
+}
+
+func TestRenderTelemetryReportWarnTextSimulation(t *testing.T) {
+	report := telemetryReport{
+		GeneratedAt: time.Date(2026, 4, 30, 18, 0, 0, 0, time.UTC),
+		Window: telemetryWindow{
+			Since: time.Date(2026, 4, 30, 16, 0, 0, 0, time.UTC),
+			Until: time.Date(2026, 4, 30, 18, 0, 0, 0, time.UTC),
+		},
+		WindowEdge: telemetryWindowEdge{
+			LookbackSeconds:          60,
+			WPCOMEligibleTransitions: 1,
+		},
+		Summary: telemetrySummary{Opened: 3, ConfirmedDown: 2},
+		Verifier: telemetryVerifierReport{
+			Replies:        4,
+			ConfirmDown:    3,
+			Disagree:       1,
+			MissingOutcome: 1,
+			ConfirmPercent: 75,
+		},
+		WPCOM: telemetryWPCOMReport{
+			ExpectedDownTransitions: 2,
+			DownAttempts:            1,
+			Attempts:                1,
+			AttemptDelta:            1,
+			DownAttemptDelta:        1,
+		},
+		ExplanationGaps: []telemetryGap{
+			{Name: "wpcom_down_attempt_delta", Severity: "amber", Count: 1, Detail: "missing WPCOM confirmed-down attempt"},
+			{Name: "verifier_reply_missing_outcome", Severity: "amber", Count: 6, Detail: "missing verifier outcome"},
+		},
+	}
+	report.TelemetryStatus = telemetryReportStatus(report.ExplanationGaps)
+	report.ExplanationGapRows = telemetryExplanationGapRows(report.ExplanationGaps)
+	report.Highlights = telemetryReportHighlights(report)
+	report.SuggestedNextActions = suggestTelemetryNextActions(report)
+
+	var out bytes.Buffer
+	if err := renderTelemetryReport(&out, report, "text"); err != nil {
+		t.Fatalf("renderTelemetryReport(warn) error = %v", err)
+	}
+	got := out.String()
+	for _, want := range []string{
+		"WARN telemetry_status=warn explanation_gap_types=2 explanation_gap_rows=7",
+		"INFO highlight=\"WPCOM attempt delta is 1 after expected suppressions.\"",
+		"INFO highlight=\"WPCOM confirmed-down attempt delta is 1.\"",
+		"INFO highlight=\"1 WPCOM-eligible transition(s) landed in the final 60s of the window; rerun with a later --until before treating the delta as missing telemetry.\"",
+		"INFO highlight=\"Verifier reply outcome is missing for 6 audit row(s).\"",
+		"WARN gap=wpcom_down_attempt_delta count=1",
+		"WARN gap=verifier_reply_missing_outcome count=6",
+		"INFO suggested_next_action=\"Rerun the report with a later --until to rule out window-edge WPCOM audit lag before investigating missing notifications.\"",
+	} {
+		if !strings.Contains(got, want) {
+			t.Fatalf("warn simulation missing %q:\n%s", want, got)
+		}
+	}
+}
+
+func TestDerivedTelemetryGapsSplitsWPCOMDownAndRecoveryDeltas(t *testing.T) {
+	gaps := derivedTelemetryGaps(telemetryWPCOMReport{
+		ExpectedDownTransitions:     2,
+		ExpectedRecoveryTransitions: 1,
+		DownAttempts:                1,
+		RecoveryAttempts:            2,
+		DownAttemptDelta:            1,
+		RecoveryAttemptDelta:        -1,
+		AttemptDelta:                0,
+	}, telemetryVerifierReport{Replies: 1})
+
+	if len(gaps) != 2 {
+		t.Fatalf("gaps = %+v, want down and recovery parity gaps even when total delta is zero", gaps)
+	}
+	if gaps[0].Name != "wpcom_down_attempt_delta" || gaps[0].Count != 1 {
+		t.Fatalf("first gap = %+v, want down attempt delta", gaps[0])
+	}
+	if gaps[1].Name != "wpcom_recovery_attempt_delta" || gaps[1].Count != 1 {
+		t.Fatalf("second gap = %+v, want recovery attempt delta", gaps[1])
+	}
+}
+
+func TestClassifyTelemetryWPCOMSuppressionSplitsDownAndRecovery(t *testing.T) {
+	tests := []struct {
+		name         string
+		eventType    string
+		detail       string
+		wantDown     bool
+		wantRecovery bool
+	}{
+		{
+			name:      "down cooldown",
+			eventType: audit.EventAlertSuppressed,
+			detail:    "cooldown active",
+			wantDown:  true,
+		},
+		{
+			name:         "recovery cooldown",
+			eventType:    audit.EventAlertSuppressed,
+			detail:       "recovery cooldown active",
+			wantRecovery: true,
+		},
+		{
+			name:      "maintenance downtime",
+			eventType: audit.EventMaintenanceActive,
+			detail:    "downtime suppressed during maintenance",
+			wantDown:  true,
+		},
+		{
+			name:         "maintenance recovery",
+			eventType:    audit.EventMaintenanceActive,
+			detail:       "recovery suppressed during maintenance",
+			wantRecovery: true,
+		},
+		{
+			name:      "maintenance swallowed probe",
+			eventType: audit.EventMaintenanceActive,
+			detail:    "failure swallowed during maintenance",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			gotDown, gotRecovery := classifyTelemetryWPCOMSuppression(tt.eventType, tt.detail)
+			if gotDown != tt.wantDown || gotRecovery != tt.wantRecovery {
+				t.Fatalf("classifyTelemetryWPCOMSuppression(%q, %q) = (%v, %v), want (%v, %v)",
+					tt.eventType, tt.detail, gotDown, gotRecovery, tt.wantDown, tt.wantRecovery)
+			}
+		})
+	}
+}
+
+func TestRenderTelemetryReportJSON(t *testing.T) {
+	report := telemetryReport{
+		Command:         "telemetry report",
+		GeneratedAt:     time.Date(2026, 4, 30, 18, 0, 0, 0, time.UTC),
+		TelemetryStatus: "pass",
+		Window: telemetryWindow{
+			Since: time.Date(2026, 4, 30, 16, 0, 0, 0, time.UTC),
+			Until: time.Date(2026, 4, 30, 18, 0, 0, 0, time.UTC),
+		},
+		Summary: telemetrySummary{Opened: 5, ConfirmedDown: 2},
+	}
+	var out bytes.Buffer
+	if err := renderTelemetryReport(&out, report, "json"); err != nil {
+		t.Fatalf("renderTelemetryReport(json) error = %v", err)
+	}
+	var got telemetryReport
+	if err := json.Unmarshal(out.Bytes(), &got); err != nil {
+		t.Fatalf("rendered JSON did not decode: %v\n%s", err, out.String())
+	}
+	if got.Command != "telemetry report" || got.Summary.ConfirmedDown != 2 {
+		t.Fatalf("decoded report = %+v, want telemetry report with confirmed_down=2", got)
+	}
+	var raw map[string]any
+	if err := json.Unmarshal(out.Bytes(), &raw); err != nil {
+		t.Fatalf("rendered JSON did not decode to map: %v", err)
+	}
+	if raw["telemetry_status"] != "pass" {
+		t.Fatalf("telemetry_status JSON field = %#v, want pass", raw["telemetry_status"])
+	}
+	if _, ok := raw["status"]; ok {
+		t.Fatalf("JSON included ambiguous status field: %s", out.String())
+	}
+	if err := renderTelemetryReport(&out, report, "yaml"); err == nil {
+		t.Fatal("renderTelemetryReport(yaml) error = nil, want error")
+	}
+}
+
+func TestQueryTelemetryWindowEdgeUsesActualShortWindowLookback(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	window := telemetryWindow{
+		Since: time.Date(2026, 4, 30, 17, 59, 30, 0, time.UTC),
+		Until: time.Date(2026, 4, 30, 18, 0, 0, 0, time.UTC),
+	}
+	mock.ExpectQuery(`(?s)SELECT COALESCE\(SUM\(CASE WHEN reason IN.*changed_at >= \?.*changed_at < \?`).
+		WithArgs(
+			eventstore.ReasonVerifierConfirmed,
+			eventstore.ReasonVerifierCleared,
+			eventstore.ReasonProbeCleared,
+			eventstore.ReasonVerifierConfirmed,
+			eventstore.ReasonFalseAlarm,
+			window.Since,
+			window.Until,
+		).
+		WillReturnRows(sqlmock.NewRows([]string{"wpcom_eligible", "verifier_outcomes"}).AddRow(int64(2), int64(1)))
+
+	edge, err := queryTelemetryWindowEdge(context.Background(), sqlDB, window, telemetryWindowEdgeLookback)
+	if err != nil {
+		t.Fatalf("queryTelemetryWindowEdge() error = %v", err)
+	}
+	if edge.LookbackSeconds != 30 || edge.WPCOMEligibleTransitions != 2 || edge.VerifierOutcomeTransitions != 1 {
+		t.Fatalf("edge = %+v, want 30s actual lookback with counts 2/1", edge)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("sql expectations: %v", err)
+	}
+}
+
+func TestTelemetryFlagUsageUsesLongDashes(t *testing.T) {
+	opts := telemetryReportOptions{Since: "24h", Output: "text", Limit: 10, QueryTimeout: 30 * time.Second}
+	var out bytes.Buffer
+	fs := newTelemetryReportFlagSet(&opts, &out)
+	fs.Usage()
+	got := out.String()
+	for _, want := range []string{"--limit", "--output", "--query-timeout", "--since", "--until"} {
+		if !strings.Contains(got, want) {
+			t.Fatalf("usage missing %q:\n%s", want, got)
+		}
+	}
+	if strings.Contains(got, "\n  -since") || strings.Contains(got, "\n  -output") {
+		t.Fatalf("usage used single-dash long flags:\n%s", got)
+	}
+}
+
+func TestTelemetryValidation(t *testing.T) {
+	if _, err := normalizeTelemetryOutput("yaml"); err == nil {
+		t.Fatal("normalizeTelemetryOutput(yaml) error = nil, want error")
+	}
+	if err := validateTelemetryLimit(0); err == nil {
+		t.Fatal("validateTelemetryLimit(0) error = nil, want error")
+	}
+	if err := validateTelemetryLimit(101); err == nil {
+		t.Fatal("validateTelemetryLimit(101) error = nil, want error")
+	}
+	if err := validateTelemetryQueryTimeout(-time.Second); err == nil {
+		t.Fatal("validateTelemetryQueryTimeout(-1s) error = nil, want error")
+	}
+	if err := validateTelemetryQueryTimeout(maxTelemetryQueryTimeout + time.Second); err == nil {
+		t.Fatal("validateTelemetryQueryTimeout(too long) error = nil, want error")
+	}
+}
+
+func expectTelemetryReportQueries(t *testing.T, mock sqlmock.Sqlmock, start, end time.Time) {
+	t.Helper()
+	mock.ExpectQuery(`(?s)SELECT reason, COUNT.*changed_at >= \?.*changed_at < \?`).
+		WithArgs(start, end).
+		WillReturnRows(sqlmock.NewRows([]string{"reason", "count"}).
+			AddRow(eventstore.ReasonOpened, int64(5)).
+			AddRow(eventstore.ReasonVerifierConfirmed, int64(2)).
+			AddRow(eventstore.ReasonVerifierCleared, int64(1)).
+			AddRow(eventstore.ReasonProbeCleared, int64(1)).
+			AddRow(eventstore.ReasonFalseAlarm, int64(1)))
+
+	for _, tc := range []struct {
+		reason string
+		count  int64
+		avg    int64
+		max    int64
+	}{
+		{eventstore.ReasonVerifierConfirmed, 2, 1500, 2500},
+		{eventstore.ReasonFalseAlarm, 1, 800, 800},
+		{eventstore.ReasonProbeCleared, 1, 600, 600},
+		{eventstore.ReasonVerifierCleared, 1, 3000, 3000},
+	} {
+		mock.ExpectQuery(`(?s)SELECT COUNT\(\*\).*FROM jetmon_event_transitions outcome.*opened.reason = \?.*outcome.changed_at >= \?.*outcome.changed_at < \?`).
+			WithArgs(eventstore.ReasonOpened, tc.reason, start, end).
+			WillReturnRows(sqlmock.NewRows([]string{"count", "avg", "max"}).AddRow(tc.count, tc.avg, tc.max))
+	}
+
+	mock.ExpectQuery(`(?s)SELECT COUNT\(\*\).*FROM jetmon_audit_log.*detail = 'veriflier reply'.*created_at >= \?.*created_at < \?`).
+		WithArgs(audit.EventVeriflierSent, start, end).
+		WillReturnRows(sqlmock.NewRows([]string{"count", "confirm", "disagree", "missing"}).
+			AddRow(int64(6), int64(4), int64(2), int64(0)))
+	mock.ExpectQuery(`(?s)SELECT source,.*created_at >= \?.*created_at < \?.*GROUP BY source.*LIMIT 5`).
+		WithArgs(audit.EventVeriflierSent, start, end).
+		WillReturnRows(sqlmock.NewRows([]string{"source", "count", "confirm", "disagree", "missing"}).
+			AddRow("verifier-a", int64(4), int64(3), int64(1), int64(0)).
+			AddRow("verifier-b", int64(2), int64(1), int64(1), int64(0)))
+	mock.ExpectQuery(`(?s)SELECT COUNT\(\*\).*verifier_duplicate_votes.*FROM jetmon_event_transitions.*reason IN.*changed_at >= \?.*changed_at < \?`).
+		WithArgs(eventstore.ReasonVerifierConfirmed, eventstore.ReasonFalseAlarm, start, end).
+		WillReturnRows(sqlmock.NewRows([]string{"count", "duplicates", "duplicate_rows", "under_min", "max_quorum", "max_healthy"}).
+			AddRow(int64(3), int64(0), int64(0), int64(0), int64(2), int64(3)))
+
+	mock.ExpectQuery(`(?s)SELECT outcome.reason AS outcome.*outcome.changed_at >= \?.*outcome.changed_at < \?.*GROUP BY outcome.reason, class.*LIMIT 5`).
+		WithArgs(eventstore.ReasonOpened, eventstore.ReasonFalseAlarm, eventstore.ReasonProbeCleared, start, end).
+		WillReturnRows(sqlmock.NewRows([]string{"outcome", "class", "count"}).
+			AddRow(eventstore.ReasonFalseAlarm, "server", int64(1)).
+			AddRow(eventstore.ReasonProbeCleared, "client", int64(1)))
+
+	mock.ExpectQuery(`(?s)SELECT COUNT\(\*\).*detail LIKE 'status=2 %'.*created_at >= \?.*created_at < \?`).
+		WithArgs(audit.EventWPCOMSent, start, end).
+		WillReturnRows(sqlmock.NewRows([]string{"count", "down", "recovery"}).AddRow(int64(3), int64(2), int64(1)))
+	mock.ExpectQuery(`(?s)SELECT COUNT\(\*\).*event_type = \?.*created_at >= \?.*created_at < \?`).
+		WithArgs(audit.EventWPCOMRetry, start, end).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(int64(1)))
+	mock.ExpectQuery(`(?s)SELECT event_type, COALESCE\(detail, ''\), COUNT\(\*\).*event_type IN.*created_at >= \?.*created_at < \?.*GROUP BY event_type, detail`).
+		WithArgs(audit.EventMaintenanceActive, audit.EventAlertSuppressed, start, end).
+		WillReturnRows(sqlmock.NewRows([]string{"event_type", "detail", "count"}).
+			AddRow(audit.EventAlertSuppressed, "recovery cooldown active", int64(1)))
+
+	mock.ExpectQuery(`(?s)SELECT COALESCE\(SUM\(CASE WHEN reason IN.*changed_at >= \?.*changed_at < \?`).
+		WithArgs(
+			eventstore.ReasonVerifierConfirmed,
+			eventstore.ReasonVerifierCleared,
+			eventstore.ReasonProbeCleared,
+			eventstore.ReasonVerifierConfirmed,
+			eventstore.ReasonFalseAlarm,
+			end.Add(-telemetryWindowEdgeLookback),
+			end,
+		).
+		WillReturnRows(sqlmock.NewRows([]string{"wpcom_eligible", "verifier_outcomes"}).AddRow(int64(0), int64(0)))
+
+	mock.ExpectQuery(`(?s)SELECT COUNT\(\*\).*reason = \?.*changed_at >= \?.*changed_at < \?.*http_code`).
+		WithArgs(eventstore.ReasonOpened, start, end).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(int64(0)))
+	mock.ExpectQuery(`(?s)SELECT COUNT\(\*\).*reason = \?.*changed_at >= \?.*changed_at < \?.*verifier_results`).
+		WithArgs(eventstore.ReasonVerifierConfirmed, start, end).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(int64(0)))
+	mock.ExpectQuery(`(?s)SELECT COUNT\(\*\).*reason = \?.*changed_at >= \?.*changed_at < \?.*verifier_healthy`).
+		WithArgs(eventstore.ReasonFalseAlarm, start, end).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(int64(0)))
+	mock.ExpectQuery(`(?s)SELECT COUNT\(\*\).*event_type = \?.*created_at >= \?.*created_at < \?.*\$\.success`).
+		WithArgs(audit.EventVeriflierSent, start, end).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(int64(0)))
+}
diff --git a/cmd/jetmon2/testdata/README.md b/cmd/jetmon2/testdata/README.md
new file mode 100644
index 00000000..c257c624
--- /dev/null
+++ b/cmd/jetmon2/testdata/README.md
@@ -0,0 +1,14 @@
+# API CLI Site Fixture
+
+`api-cli-sites.json` is the embedded default source for:
+
+```bash
+jetmon2 api sites bulk-add --count <n>
+```
+
+The list is intentionally small and local-test oriented. It mixes always-up
+targets, redirects, slow responses, HTTP error responses, TLS/certificate
+failures, custom-header checks, and keyword checks so Docker API testing can
+exercise more than one site behavior without inventing fake public domains.
+
+Do not use this fixture as a production monitoring seed list.
diff --git a/cmd/jetmon2/testdata/api-cli-sites.json b/cmd/jetmon2/testdata/api-cli-sites.json
new file mode 100644
index 00000000..303c6cde
--- /dev/null
+++ b/cmd/jetmon2/testdata/api-cli-sites.json
@@ -0,0 +1,51 @@
+[
+  {
+    "monitor_url": "https://example.com/",
+    "check_keyword": "Example Domain",
+    "redirect_policy": "follow",
+    "check_interval": 5
+  },
+  {
+    "monitor_url": "https://wordpress.com/",
+    "check_keyword": "WordPress",
+    "redirect_policy": "follow",
+    "check_interval": 5
+  },
+  {
+    "monitor_url": "https://httpbin.org/status/204",
+    "redirect_policy": "follow"
+  },
+  {
+    "monitor_url": "https://httpbin.org/status/404",
+    "redirect_policy": "follow"
+  },
+  {
+    "monitor_url": "https://httpbin.org/status/500",
+    "redirect_policy": "follow"
+  },
+  {
+    "monitor_url": "https://httpbin.org/delay/3",
+    "timeout_seconds": 5,
+    "redirect_policy": "follow"
+  },
+  {
+    "monitor_url": "https://httpbin.org/redirect-to?url=https%3A%2F%2Fexample.com%2F",
+    "redirect_policy": "alert"
+  },
+  {
+    "monitor_url": "https://httpbin.org/headers",
+    "check_keyword": "X-Jetmon-CLI",
+    "custom_headers": {
+      "X-Jetmon-CLI": "fixture"
+    },
+    "redirect_policy": "follow"
+  },
+  {
+    "monitor_url": "https://expired.badssl.com/",
+    "redirect_policy": "follow"
+  },
+  {
+    "monitor_url": "https://wrong.host.badssl.com/",
+    "redirect_policy": "follow"
+  }
+]
diff --git a/cmd/jetmon2/verifliers.go b/cmd/jetmon2/verifliers.go
new file mode 100644
index 00000000..091016ed
--- /dev/null
+++ b/cmd/jetmon2/verifliers.go
@@ -0,0 +1,724 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"flag"
+	"fmt"
+	"io"
+	"os"
+	"sort"
+	"strings"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/veriflier"
+)
+
+const (
+	defaultVeriflierDiscoveryQueryTimeout = 30 * time.Second
+	defaultVeriflierDiscoveryProbeTimeout = 2 * time.Second
+	maxVeriflierDiscoveryTimeout          = 5 * time.Minute
+)
+
+type veriflierDiscoveryReportOptions struct {
+	Output       string
+	StaleAfter   time.Duration
+	QueryTimeout time.Duration
+	ProbeTimeout time.Duration
+	ProbeStatic  bool
+}
+
+type veriflierDiscoveryReportDeps struct {
+	Now             func() time.Time
+	ProbeConfigured func(context.Context, *config.Config, time.Duration) []veriflierReadinessResult
+	ListSnapshot    func(context.Context, time.Duration) (db.VeriflierDiscoverySnapshot, error)
+}
+
+type veriflierDiscoveryReport struct {
+	OK                  bool                              `json:"ok"`
+	Status              string                            `json:"status"`
+	Command             string                            `json:"command"`
+	GeneratedAt         time.Time                         `json:"generated_at"`
+	DiscoveryMode       string                            `json:"discovery_mode"`
+	StaleAfterSeconds   int64                             `json:"stale_after_seconds"`
+	ProbeStatic         bool                              `json:"probe_static"`
+	Static              veriflierDiscoveryStaticSummary   `json:"static"`
+	Registry            veriflierDiscoveryRegistrySummary `json:"registry"`
+	Agents              veriflierDiscoveryAgentSummary    `json:"agents"`
+	StaticVerifiers     []veriflierDiscoveryStaticRow     `json:"static_verifiers,omitempty"`
+	Vantages            []veriflierDiscoveryVantageRow    `json:"vantages,omitempty"`
+	AgentRows           []veriflierDiscoveryAgentRow      `json:"agent_rows,omitempty"`
+	Issues              []veriflierDiscoveryIssue         `json:"issues,omitempty"`
+	SuggestedNextAction string                            `json:"suggested_next_action,omitempty"`
+}
+
+type veriflierDiscoveryStaticSummary struct {
+	Configured        int `json:"configured"`
+	Probed            int `json:"probed"`
+	V2                int `json:"v2"`
+	LegacyOnly        int `json:"legacy_only"`
+	ProbeErrors       int `json:"probe_errors"`
+	UniqueVantages    int `json:"unique_vantages"`
+	DuplicateVantages int `json:"duplicate_vantages"`
+}
+
+type veriflierDiscoveryRegistrySummary struct {
+	Total      int `json:"total"`
+	Enabled    int `json:"enabled"`
+	Disabled   int `json:"disabled"`
+	Usable     int `json:"usable"`
+	Incomplete int `json:"incomplete"`
+}
+
+type veriflierDiscoveryAgentSummary struct {
+	Recent         int   `json:"recent"`
+	Active         int   `json:"active"`
+	StaleAfterSec  int64 `json:"stale_after_sec"`
+	MaxConcurrency int   `json:"max_concurrency"`
+	QueueCapacity  int   `json:"queue_capacity"`
+	QueueDepth     int   `json:"queue_depth"`
+	InFlight       int   `json:"in_flight"`
+}
+
+type veriflierDiscoveryStaticRow struct {
+	Name             string `json:"name"`
+	Addr             string `json:"addr"`
+	Host             string `json:"host,omitempty"`
+	Port             string `json:"port,omitempty"`
+	AuthTokenPresent bool   `json:"auth_token_present"`
+	ProbeStatus      string `json:"probe_status"`
+	VantageID        string `json:"vantage_id,omitempty"`
+	AgentID          string `json:"agent_id,omitempty"`
+	Version          string `json:"version,omitempty"`
+	Protocol         string `json:"protocol,omitempty"`
+	Capacity         string `json:"capacity,omitempty"`
+	Error            string `json:"error,omitempty"`
+}
+
+type veriflierDiscoveryVantageRow struct {
+	VantageID        string     `json:"vantage_id"`
+	Region           string     `json:"region,omitempty"`
+	Provider         string     `json:"provider,omitempty"`
+	Endpoint         string     `json:"endpoint,omitempty"`
+	Enabled          bool       `json:"enabled"`
+	Usable           bool       `json:"usable"`
+	AuthTokenPresent bool       `json:"auth_token_present"`
+	ActiveAgents     int        `json:"active_agents"`
+	LastSeen         *time.Time `json:"last_seen,omitempty"`
+	LastSeenAgeSec   *int64     `json:"last_seen_age_sec,omitempty"`
+}
+
+type veriflierDiscoveryAgentRow struct {
+	AgentID        string    `json:"agent_id"`
+	VantageID      string    `json:"vantage_id"`
+	Hostname       string    `json:"hostname,omitempty"`
+	Endpoint       string    `json:"endpoint,omitempty"`
+	Version        string    `json:"version,omitempty"`
+	Protocols      []string  `json:"protocols,omitempty"`
+	Status         string    `json:"status"`
+	LastSeen       time.Time `json:"last_seen"`
+	LastSeenAgeSec int64     `json:"last_seen_age_sec"`
+	MaxConcurrency int       `json:"max_concurrency"`
+	QueueCapacity  int       `json:"queue_capacity"`
+	QueueDepth     int       `json:"queue_depth"`
+	Active         int       `json:"active"`
+	InFlight       int       `json:"in_flight"`
+}
+
+type veriflierDiscoveryIssue struct {
+	Severity string `json:"severity"`
+	Name     string `json:"name"`
+	Detail   string `json:"detail"`
+}
+
+func cmdVerifliers(args []string) {
+	if len(args) == 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 verifliers <discovery-report> [args]")
+		os.Exit(1)
+	}
+
+	switch args[0] {
+	case "discovery-report":
+		cmdVerifliersDiscoveryReport(args[1:])
+	default:
+		fmt.Fprintf(os.Stderr, "unknown verifliers subcommand %q (want: discovery-report)\n", args[0])
+		os.Exit(1)
+	}
+}
+
+func cmdVerifliersDiscoveryReport(args []string) {
+	opts := veriflierDiscoveryReportOptions{
+		Output:       "text",
+		StaleAfter:   db.VeriflierDiscoveryDefaultStaleAfter,
+		QueryTimeout: defaultVeriflierDiscoveryQueryTimeout,
+		ProbeTimeout: defaultVeriflierDiscoveryProbeTimeout,
+		ProbeStatic:  true,
+	}
+	fs := newVeriflierDiscoveryReportFlagSet(&opts, os.Stderr)
+	if err := fs.Parse(args); err != nil {
+		if errors.Is(err, flag.ErrHelp) {
+			os.Exit(0)
+		}
+		fmt.Fprintf(os.Stderr, "FAIL parse verifliers discovery-report flags: %v\n", err)
+		os.Exit(2)
+	}
+	if fs.NArg() != 0 {
+		fmt.Fprintln(os.Stderr, "usage: jetmon2 verifliers discovery-report [--output=text|json] [--stale-after=90s] [--query-timeout=30s] [--probe-timeout=2s] [--probe-static=true]")
+		os.Exit(1)
+	}
+	if err := validateVeriflierDiscoveryReportOptions(&opts); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL %v\n", err)
+		os.Exit(2)
+	}
+
+	configPath := envOrDefault("JETMON_CONFIG", "config/config.json")
+	if err := config.Load(configPath); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL config parse: %v\n", err)
+		os.Exit(1)
+	}
+	config.LoadDB()
+	if err := db.ConnectWithRetry(3); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL db connect: %v\n", err)
+		os.Exit(1)
+	}
+
+	deps := veriflierDiscoveryReportDeps{
+		ProbeConfigured: probeConfiguredVerifliers,
+		ListSnapshot:    db.ListVeriflierDiscoverySnapshot,
+	}
+	report, err := buildVeriflierDiscoveryReport(context.Background(), config.Get(), opts, deps)
+	if err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL verifliers discovery-report: %v\n", err)
+		os.Exit(1)
+	}
+	if err := renderVeriflierDiscoveryReport(os.Stdout, report, opts.Output); err != nil {
+		fmt.Fprintf(os.Stderr, "FAIL render verifliers discovery-report: %v\n", err)
+		os.Exit(1)
+	}
+}
+
+func newVeriflierDiscoveryReportFlagSet(opts *veriflierDiscoveryReportOptions, out io.Writer) *flag.FlagSet {
+	fs := flag.NewFlagSet("verifliers discovery-report", flag.ContinueOnError)
+	if out != nil {
+		fs.SetOutput(out)
+	}
+	fs.StringVar(&opts.Output, "output", opts.Output, "output format: text or json")
+	fs.DurationVar(&opts.StaleAfter, "stale-after", opts.StaleAfter, "maximum agent age considered recent")
+	fs.DurationVar(&opts.QueryTimeout, "query-timeout", opts.QueryTimeout, "maximum time for DB discovery queries")
+	fs.DurationVar(&opts.ProbeTimeout, "probe-timeout", opts.ProbeTimeout, "maximum time per static Veriflier status probe")
+	fs.BoolVar(&opts.ProbeStatic, "probe-static", opts.ProbeStatic, "probe configured static Verifliers for v2 vantage identity")
+	fs.Usage = func() {
+		printAPIFlagUsage(fs.Output(), fs)
+	}
+	return fs
+}
+
+func validateVeriflierDiscoveryReportOptions(opts *veriflierDiscoveryReportOptions) error {
+	output, err := normalizeVeriflierDiscoveryOutput(opts.Output)
+	if err != nil {
+		return err
+	}
+	opts.Output = output
+	if opts.StaleAfter <= 0 {
+		return errors.New("--stale-after must be > 0")
+	}
+	if opts.StaleAfter > maxVeriflierDiscoveryTimeout {
+		return fmt.Errorf("--stale-after must be <= %s", maxVeriflierDiscoveryTimeout)
+	}
+	if opts.QueryTimeout < 0 {
+		return errors.New("--query-timeout must be >= 0")
+	}
+	if opts.QueryTimeout > maxVeriflierDiscoveryTimeout {
+		return fmt.Errorf("--query-timeout must be <= %s", maxVeriflierDiscoveryTimeout)
+	}
+	if opts.ProbeTimeout <= 0 {
+		return errors.New("--probe-timeout must be > 0")
+	}
+	if opts.ProbeTimeout > maxVeriflierDiscoveryTimeout {
+		return fmt.Errorf("--probe-timeout must be <= %s", maxVeriflierDiscoveryTimeout)
+	}
+	return nil
+}
+
+func normalizeVeriflierDiscoveryOutput(output string) (string, error) {
+	output = strings.ToLower(strings.TrimSpace(output))
+	if output == "" {
+		output = "text"
+	}
+	if output != "text" && output != "json" {
+		return "", errors.New("--output must be text or json")
+	}
+	return output, nil
+}
+
+func buildVeriflierDiscoveryReport(ctx context.Context, cfg *config.Config, opts veriflierDiscoveryReportOptions, deps veriflierDiscoveryReportDeps) (veriflierDiscoveryReport, error) {
+	if cfg == nil {
+		return veriflierDiscoveryReport{}, errors.New("config is not loaded")
+	}
+	now := time.Now().UTC()
+	if deps.Now != nil {
+		now = deps.Now().UTC()
+	}
+	if deps.ListSnapshot == nil {
+		return veriflierDiscoveryReport{}, errors.New("discovery snapshot query is not configured")
+	}
+
+	queryCtx := ctx
+	var cancel context.CancelFunc
+	if opts.QueryTimeout > 0 {
+		queryCtx, cancel = context.WithTimeout(ctx, opts.QueryTimeout)
+	} else {
+		queryCtx, cancel = context.WithCancel(ctx)
+	}
+	snapshot, err := deps.ListSnapshot(queryCtx, opts.StaleAfter)
+	cancel()
+	if err != nil {
+		return veriflierDiscoveryReport{}, fmt.Errorf("query veriflier discovery snapshot: %w", err)
+	}
+
+	var probes []veriflierReadinessResult
+	if opts.ProbeStatic {
+		if deps.ProbeConfigured == nil {
+			return veriflierDiscoveryReport{}, errors.New("static Veriflier probe is not configured")
+		}
+		probes = deps.ProbeConfigured(ctx, cfg, opts.ProbeTimeout)
+	}
+
+	report := veriflierDiscoveryReport{
+		Command:           "verifliers discovery-report",
+		GeneratedAt:       now,
+		DiscoveryMode:     cfg.VeriflierDiscoveryModeOrDefault(),
+		StaleAfterSeconds: int64(opts.StaleAfter.Round(time.Second) / time.Second),
+		ProbeStatic:       opts.ProbeStatic,
+		StaticVerifiers:   buildVeriflierDiscoveryStaticRows(cfg, probes, opts.ProbeStatic),
+	}
+	report.Static = summarizeVeriflierDiscoveryStatic(report.StaticVerifiers)
+	report.Vantages = buildVeriflierDiscoveryVantageRows(snapshot.Vantages, now)
+	report.Registry = summarizeVeriflierDiscoveryRegistry(report.Vantages)
+	report.AgentRows = buildVeriflierDiscoveryAgentRows(snapshot.Agents, now)
+	report.Agents = summarizeVeriflierDiscoveryAgents(report.AgentRows, report.StaleAfterSeconds)
+	report.Issues = veriflierDiscoveryIssues(report)
+	report.Status = veriflierDiscoveryStatus(report.Issues)
+	report.OK = report.Status == "green"
+	report.SuggestedNextAction = suggestVeriflierDiscoveryNextAction(report)
+	return report, nil
+}
+
+func buildVeriflierDiscoveryStaticRows(cfg *config.Config, probes []veriflierReadinessResult, probed bool) []veriflierDiscoveryStaticRow {
+	if cfg == nil || len(cfg.Verifiers) == 0 {
+		return nil
+	}
+	rows := make([]veriflierDiscoveryStaticRow, 0, len(cfg.Verifiers))
+	if probed {
+		for i, v := range cfg.Verifiers {
+			row := veriflierDiscoveryStaticRow{
+				Name:             configuredVeriflierName(v, i),
+				Addr:             fmt.Sprintf("%s:%s", v.Host, v.TransportPort()),
+				Host:             strings.TrimSpace(v.Host),
+				Port:             strings.TrimSpace(v.TransportPort()),
+				AuthTokenPresent: strings.TrimSpace(v.AuthToken) != "",
+				ProbeStatus:      "not_probed",
+			}
+			if i < len(probes) {
+				applyVeriflierProbeToStaticRow(&row, probes[i])
+			}
+			rows = append(rows, row)
+		}
+		return rows
+	}
+	for i, v := range cfg.Verifiers {
+		rows = append(rows, veriflierDiscoveryStaticRow{
+			Name:             configuredVeriflierName(v, i),
+			Addr:             fmt.Sprintf("%s:%s", v.Host, v.TransportPort()),
+			Host:             strings.TrimSpace(v.Host),
+			Port:             strings.TrimSpace(v.TransportPort()),
+			AuthTokenPresent: strings.TrimSpace(v.AuthToken) != "",
+			ProbeStatus:      "not_probed",
+		})
+	}
+	return rows
+}
+
+func applyVeriflierProbeToStaticRow(row *veriflierDiscoveryStaticRow, result veriflierReadinessResult) {
+	row.Name = result.Name
+	row.Addr = result.Addr
+	if result.Err != nil {
+		row.ProbeStatus = "error"
+		row.Error = result.Err.Error()
+		return
+	}
+	if result.Status == nil {
+		row.ProbeStatus = "error"
+		row.Error = "empty status response"
+		return
+	}
+	row.Version = result.Status.Version
+	row.AgentID = strings.TrimSpace(result.Status.Agent.ID)
+	row.Capacity = verifierCapacitySummary(result.Status.Capacity)
+	if statusSupportsProtocol(result.Status, veriflier.ProtocolV2) {
+		row.ProbeStatus = "v2"
+		row.Protocol = veriflier.ProtocolV2
+		row.VantageID = strings.TrimSpace(result.Status.Vantage.ID)
+		return
+	}
+	row.ProbeStatus = "legacy"
+	row.Protocol = veriflier.ProtocolLegacy
+}
+
+func summarizeVeriflierDiscoveryStatic(rows []veriflierDiscoveryStaticRow) veriflierDiscoveryStaticSummary {
+	summary := veriflierDiscoveryStaticSummary{Configured: len(rows)}
+	counts := make(map[string]int)
+	for _, row := range rows {
+		if row.ProbeStatus != "not_probed" {
+			summary.Probed++
+		}
+		switch row.ProbeStatus {
+		case "v2":
+			summary.V2++
+		case "legacy":
+			summary.LegacyOnly++
+		case "error":
+			summary.ProbeErrors++
+		}
+		if row.VantageID != "" {
+			counts[row.VantageID]++
+		}
+	}
+	for _, count := range counts {
+		if count > 1 {
+			summary.DuplicateVantages += count
+			continue
+		}
+		summary.UniqueVantages++
+	}
+	return summary
+}
+
+func buildVeriflierDiscoveryVantageRows(vantages []db.VeriflierVantage, now time.Time) []veriflierDiscoveryVantageRow {
+	rows := make([]veriflierDiscoveryVantageRow, 0, len(vantages))
+	for _, v := range vantages {
+		row := veriflierDiscoveryVantageRow{
+			VantageID:        strings.TrimSpace(v.VantageID),
+			Region:           strings.TrimSpace(v.Region),
+			Provider:         strings.TrimSpace(v.Provider),
+			Endpoint:         endpointString(v.EndpointHost, v.EndpointPort),
+			Enabled:          v.Enabled,
+			Usable:           v.Usable(),
+			AuthTokenPresent: strings.TrimSpace(v.AuthToken) != "",
+			ActiveAgents:     v.ActiveAgents,
+			LastSeen:         v.LastSeen,
+		}
+		if v.LastSeen != nil {
+			age := durationSeconds(now.Sub(*v.LastSeen))
+			row.LastSeenAgeSec = &age
+		}
+		rows = append(rows, row)
+	}
+	sort.Slice(rows, func(i, j int) bool {
+		return rows[i].VantageID < rows[j].VantageID
+	})
+	return rows
+}
+
+func summarizeVeriflierDiscoveryRegistry(rows []veriflierDiscoveryVantageRow) veriflierDiscoveryRegistrySummary {
+	summary := veriflierDiscoveryRegistrySummary{Total: len(rows)}
+	for _, row := range rows {
+		if row.Enabled {
+			summary.Enabled++
+			if row.Usable {
+				summary.Usable++
+			} else {
+				summary.Incomplete++
+			}
+		} else {
+			summary.Disabled++
+		}
+	}
+	return summary
+}
+
+func buildVeriflierDiscoveryAgentRows(agents []db.VeriflierAgent, now time.Time) []veriflierDiscoveryAgentRow {
+	rows := make([]veriflierDiscoveryAgentRow, 0, len(agents))
+	for _, agent := range agents {
+		rows = append(rows, veriflierDiscoveryAgentRow{
+			AgentID:        strings.TrimSpace(agent.AgentID),
+			VantageID:      strings.TrimSpace(agent.VantageID),
+			Hostname:       strings.TrimSpace(agent.Hostname),
+			Endpoint:       endpointString(agent.EndpointHost, agent.EndpointPort),
+			Version:        strings.TrimSpace(agent.Version),
+			Protocols:      append([]string(nil), agent.Protocols...),
+			Status:         strings.TrimSpace(agent.Status),
+			LastSeen:       agent.LastSeen,
+			LastSeenAgeSec: durationSeconds(now.Sub(agent.LastSeen)),
+			MaxConcurrency: agent.MaxConcurrency,
+			QueueCapacity:  agent.QueueCapacity,
+			QueueDepth:     agent.QueueDepth,
+			Active:         agent.Active,
+			InFlight:       agent.InFlight,
+		})
+	}
+	sort.Slice(rows, func(i, j int) bool {
+		if rows[i].VantageID == rows[j].VantageID {
+			return rows[i].AgentID < rows[j].AgentID
+		}
+		return rows[i].VantageID < rows[j].VantageID
+	})
+	return rows
+}
+
+func summarizeVeriflierDiscoveryAgents(rows []veriflierDiscoveryAgentRow, staleAfterSec int64) veriflierDiscoveryAgentSummary {
+	summary := veriflierDiscoveryAgentSummary{Recent: len(rows), StaleAfterSec: staleAfterSec}
+	for _, row := range rows {
+		if row.Status == "active" {
+			summary.Active++
+		}
+		summary.MaxConcurrency += row.MaxConcurrency
+		summary.QueueCapacity += row.QueueCapacity
+		summary.QueueDepth += row.QueueDepth
+		summary.InFlight += row.InFlight
+	}
+	return summary
+}
+
+func veriflierDiscoveryIssues(report veriflierDiscoveryReport) []veriflierDiscoveryIssue {
+	var issues []veriflierDiscoveryIssue
+	staticByVantage := make(map[string][]veriflierDiscoveryStaticRow)
+	for _, row := range report.StaticVerifiers {
+		if row.ProbeStatus == "error" {
+			issues = append(issues, veriflierDiscoveryIssue{"warn", "static_probe_failed", fmt.Sprintf("%s at %s: %s", row.Name, row.Addr, row.Error)})
+			continue
+		}
+		if row.ProbeStatus == "legacy" {
+			issues = append(issues, veriflierDiscoveryIssue{"warn", "static_legacy_only", fmt.Sprintf("%s at %s does not report the v2 status contract", row.Name, row.Addr)})
+			continue
+		}
+		if row.ProbeStatus == "v2" && row.VantageID == "" {
+			issues = append(issues, veriflierDiscoveryIssue{"warn", "static_vantage_missing", fmt.Sprintf("%s at %s reports v2 without a vantage id", row.Name, row.Addr)})
+			continue
+		}
+		if row.VantageID != "" {
+			staticByVantage[row.VantageID] = append(staticByVantage[row.VantageID], row)
+		}
+	}
+	for id, rows := range staticByVantage {
+		if len(rows) <= 1 {
+			continue
+		}
+		issues = append(issues, veriflierDiscoveryIssue{"fail", "static_vantage_duplicate", fmt.Sprintf("vantage_id=%q is reported by %d configured static Verifliers", id, len(rows))})
+	}
+	if !report.ProbeStatic && (report.DiscoveryMode == config.VeriflierDiscoveryModeShadow || report.DiscoveryMode == config.VeriflierDiscoveryModeActive) {
+		issues = append(issues, veriflierDiscoveryIssue{"warn", "static_not_probed", "static Verifliers were not probed, so static-vs-registry vantage drift cannot be proven"})
+	}
+
+	registryEnabled := make(map[string]veriflierDiscoveryVantageRow)
+	registryAll := make(map[string]veriflierDiscoveryVantageRow)
+	for _, row := range report.Vantages {
+		registryAll[row.VantageID] = row
+		if row.Enabled {
+			registryEnabled[row.VantageID] = row
+			if !row.Usable {
+				severity := "warn"
+				if report.DiscoveryMode == config.VeriflierDiscoveryModeActive {
+					severity = "fail"
+				}
+				issues = append(issues, veriflierDiscoveryIssue{severity, "registry_enabled_incomplete", fmt.Sprintf("vantage_id=%q is enabled but missing endpoint host, endpoint port, or auth token", row.VantageID)})
+			}
+		}
+	}
+	if report.DiscoveryMode == config.VeriflierDiscoveryModeActive && report.Registry.Usable == 0 {
+		issues = append(issues, veriflierDiscoveryIssue{"fail", "active_without_usable_registry", "active discovery has zero enabled usable trusted vantages and would fall back to static config"})
+	}
+
+	for id, staticRows := range staticByVantage {
+		registry, ok := registryEnabled[id]
+		if !ok {
+			issues = append(issues, veriflierDiscoveryIssue{"warn", "static_missing_enabled_registry", fmt.Sprintf("static vantage_id=%q is not present as an enabled trusted registry row", id)})
+			continue
+		}
+		for _, staticRow := range staticRows {
+			if registry.Endpoint != "" && staticRow.Addr != "" && registry.Endpoint != staticRow.Addr {
+				issues = append(issues, veriflierDiscoveryIssue{"warn", "static_registry_endpoint_mismatch", fmt.Sprintf("vantage_id=%q static_addr=%q registry_endpoint=%q", id, staticRow.Addr, registry.Endpoint)})
+			}
+			if staticRow.AuthTokenPresent != registry.AuthTokenPresent {
+				issues = append(issues, veriflierDiscoveryIssue{"warn", "static_registry_auth_presence_mismatch", fmt.Sprintf("vantage_id=%q static_auth_token_present=%t registry_auth_token_present=%t", id, staticRow.AuthTokenPresent, registry.AuthTokenPresent)})
+			}
+		}
+	}
+	for id := range registryEnabled {
+		if _, ok := staticByVantage[id]; !ok && report.ProbeStatic {
+			issues = append(issues, veriflierDiscoveryIssue{"warn", "enabled_registry_missing_static", fmt.Sprintf("enabled registry vantage_id=%q was not observed in static configured Verifliers", id)})
+		}
+	}
+
+	activeAgentsByVantage := make(map[string][]veriflierDiscoveryAgentRow)
+	activeEndpointsByVantage := make(map[string]map[string]int)
+	for _, agent := range report.AgentRows {
+		if _, ok := registryAll[agent.VantageID]; !ok {
+			issues = append(issues, veriflierDiscoveryIssue{"warn", "agent_without_registry", fmt.Sprintf("agent_id=%q reports untrusted vantage_id=%q", agent.AgentID, agent.VantageID)})
+		}
+		if agent.Status != "active" {
+			continue
+		}
+		activeAgentsByVantage[agent.VantageID] = append(activeAgentsByVantage[agent.VantageID], agent)
+		if activeEndpointsByVantage[agent.VantageID] == nil {
+			activeEndpointsByVantage[agent.VantageID] = make(map[string]int)
+		}
+		activeEndpointsByVantage[agent.VantageID][agent.Endpoint]++
+		if registry, ok := registryAll[agent.VantageID]; ok && registry.Endpoint != "" && agent.Endpoint != "" && registry.Endpoint != agent.Endpoint {
+			issues = append(issues, veriflierDiscoveryIssue{"warn", "agent_registry_endpoint_mismatch", fmt.Sprintf("agent_id=%q vantage_id=%q agent_endpoint=%q registry_endpoint=%q", agent.AgentID, agent.VantageID, agent.Endpoint, registry.Endpoint)})
+		}
+	}
+	for id, registry := range registryEnabled {
+		if !registry.Usable {
+			continue
+		}
+		if len(activeAgentsByVantage[id]) == 0 {
+			issues = append(issues, veriflierDiscoveryIssue{"warn", "enabled_registry_without_active_agent", fmt.Sprintf("enabled usable vantage_id=%q has no recent active agent telemetry", id)})
+		}
+	}
+	for id, endpoints := range activeEndpointsByVantage {
+		if len(endpoints) <= 1 {
+			continue
+		}
+		issues = append(issues, veriflierDiscoveryIssue{"warn", "duplicate_active_agent_endpoints", fmt.Sprintf("vantage_id=%q has fresh active agents reporting %d different endpoints", id, len(endpoints))})
+	}
+
+	sort.SliceStable(issues, func(i, j int) bool {
+		if issueSeverityRank(issues[i].Severity) == issueSeverityRank(issues[j].Severity) {
+			if issues[i].Name == issues[j].Name {
+				return issues[i].Detail < issues[j].Detail
+			}
+			return issues[i].Name < issues[j].Name
+		}
+		return issueSeverityRank(issues[i].Severity) > issueSeverityRank(issues[j].Severity)
+	})
+	return issues
+}
+
+func issueSeverityRank(severity string) int {
+	switch severity {
+	case "fail":
+		return 3
+	case "warn":
+		return 2
+	default:
+		return 1
+	}
+}
+
+func veriflierDiscoveryStatus(issues []veriflierDiscoveryIssue) string {
+	status := "green"
+	for _, issue := range issues {
+		switch issue.Severity {
+		case "fail":
+			return "red"
+		case "warn":
+			status = "amber"
+		}
+	}
+	return status
+}
+
+func suggestVeriflierDiscoveryNextAction(report veriflierDiscoveryReport) string {
+	if report.Status == "red" {
+		return "Fix red Veriflier discovery issues before enabling or relying on active discovery."
+	}
+	if !report.ProbeStatic && (report.DiscoveryMode == config.VeriflierDiscoveryModeShadow || report.DiscoveryMode == config.VeriflierDiscoveryModeActive) {
+		return "Rerun with --probe-static=true so the static configured vantages can be compared to the trusted registry."
+	}
+	if report.Status == "amber" {
+		return "Fix the listed Veriflier discovery drift, stale telemetry, or registry gaps, then rerun this report before changing discovery mode."
+	}
+	switch report.DiscoveryMode {
+	case config.VeriflierDiscoveryModeStatic:
+		return "Static discovery is still configured; seed trusted registry rows and run this report again in shadow mode before active mode."
+	case config.VeriflierDiscoveryModeShadow:
+		return "Shadow discovery matches the static vantages; keep observing fresh agent telemetry, then plan active discovery with fallback still configured."
+	case config.VeriflierDiscoveryModeActive:
+		return "Active discovery looks healthy; keep watching fleet dashboard Veriflier warnings during rollout."
+	default:
+		return "Discovery report is clean."
+	}
+}
+
+func renderVeriflierDiscoveryReport(out io.Writer, report veriflierDiscoveryReport, output string) error {
+	switch output {
+	case "json":
+		enc := json.NewEncoder(out)
+		enc.SetIndent("", "  ")
+		return enc.Encode(report)
+	case "text":
+		renderVeriflierDiscoveryText(out, report)
+		return nil
+	default:
+		return fmt.Errorf("unsupported output format %q", output)
+	}
+}
+
+func renderVeriflierDiscoveryText(out io.Writer, report veriflierDiscoveryReport) {
+	fmt.Fprintf(out, "INFO veriflier_discovery_report_generated_at=%s\n", report.GeneratedAt.Format(time.RFC3339))
+	fmt.Fprintf(out, "INFO veriflier_discovery_mode=%s status=%s stale_after=%ds probe_static=%t\n", report.DiscoveryMode, report.Status, report.StaleAfterSeconds, report.ProbeStatic)
+	fmt.Fprintf(out, "INFO static_configured=%d static_probed=%d static_v2=%d static_legacy_only=%d static_probe_errors=%d static_unique_vantages=%d static_duplicate_vantages=%d\n",
+		report.Static.Configured, report.Static.Probed, report.Static.V2, report.Static.LegacyOnly, report.Static.ProbeErrors, report.Static.UniqueVantages, report.Static.DuplicateVantages)
+	fmt.Fprintf(out, "INFO registry_total=%d registry_enabled=%d registry_disabled=%d registry_usable=%d registry_incomplete=%d\n",
+		report.Registry.Total, report.Registry.Enabled, report.Registry.Disabled, report.Registry.Usable, report.Registry.Incomplete)
+	fmt.Fprintf(out, "INFO agents_recent=%d agents_active=%d max_concurrency=%d queue_depth=%d queue_capacity=%d in_flight=%d\n",
+		report.Agents.Recent, report.Agents.Active, report.Agents.MaxConcurrency, report.Agents.QueueDepth, report.Agents.QueueCapacity, report.Agents.InFlight)
+	for _, issue := range report.Issues {
+		level := "INFO"
+		if issue.Severity == "warn" {
+			level = "WARN"
+		} else if issue.Severity == "fail" {
+			level = "FAIL"
+		}
+		fmt.Fprintf(out, "%s veriflier_discovery_issue name=%q detail=%q\n", level, issue.Name, issue.Detail)
+	}
+	for _, row := range report.StaticVerifiers {
+		if row.Error != "" {
+			fmt.Fprintf(out, "INFO static_veriflier name=%q addr=%q probe_status=%s error=%q auth_token_present=%t\n", row.Name, row.Addr, row.ProbeStatus, row.Error, row.AuthTokenPresent)
+			continue
+		}
+		fmt.Fprintf(out, "INFO static_veriflier name=%q addr=%q probe_status=%s protocol=%q vantage_id=%q agent_id=%q version=%q capacity=%q auth_token_present=%t\n",
+			row.Name, row.Addr, row.ProbeStatus, row.Protocol, row.VantageID, row.AgentID, row.Version, row.Capacity, row.AuthTokenPresent)
+	}
+	for _, row := range report.Vantages {
+		age := ""
+		if row.LastSeenAgeSec != nil {
+			age = fmt.Sprintf(" last_seen_age_sec=%d", *row.LastSeenAgeSec)
+		}
+		fmt.Fprintf(out, "INFO registry_vantage vantage_id=%q enabled=%t usable=%t endpoint=%q region=%q provider=%q active_agents=%d auth_token_present=%t%s\n",
+			row.VantageID, row.Enabled, row.Usable, row.Endpoint, row.Region, row.Provider, row.ActiveAgents, row.AuthTokenPresent, age)
+	}
+	for _, row := range report.AgentRows {
+		fmt.Fprintf(out, "INFO veriflier_agent agent_id=%q vantage_id=%q status=%q endpoint=%q hostname=%q version=%q protocols=%q last_seen_age_sec=%d capacity=%q\n",
+			row.AgentID, row.VantageID, row.Status, row.Endpoint, row.Hostname, row.Version, strings.Join(row.Protocols, ","), row.LastSeenAgeSec,
+			verifierCapacitySummary(veriflier.Capacity{MaxConcurrency: row.MaxConcurrency, QueueCapacity: row.QueueCapacity, QueueDepth: row.QueueDepth, Active: row.Active, InFlight: row.InFlight}))
+	}
+	if report.Status == "green" {
+		fmt.Fprintln(out, "PASS veriflier_discovery_report=green")
+	} else if report.Status == "amber" {
+		fmt.Fprintln(out, "WARN veriflier_discovery_report=amber")
+	} else {
+		fmt.Fprintln(out, "FAIL veriflier_discovery_report=red")
+	}
+	fmt.Fprintf(out, "INFO suggested_next_action=%q\n", report.SuggestedNextAction)
+}
+
+func endpointString(host, port string) string {
+	host = strings.TrimSpace(host)
+	port = strings.TrimSpace(port)
+	if host == "" && port == "" {
+		return ""
+	}
+	return fmt.Sprintf("%s:%s", host, port)
+}
+
+func durationSeconds(d time.Duration) int64 {
+	if d < 0 {
+		return 0
+	}
+	return int64(d.Round(time.Second) / time.Second)
+}
diff --git a/cmd/jetmon2/verifliers_soak_test.go b/cmd/jetmon2/verifliers_soak_test.go
new file mode 100644
index 00000000..38f6db6a
--- /dev/null
+++ b/cmd/jetmon2/verifliers_soak_test.go
@@ -0,0 +1,237 @@
+package main
+
+import (
+	"bytes"
+	"context"
+	"fmt"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/veriflier"
+)
+
+func TestVeriflierDiscoverySoakShadowDriftMatrix(t *testing.T) {
+	now := time.Date(2026, 5, 11, 15, 0, 0, 0, time.UTC)
+	const trusted = 24
+	cfg := &config.Config{VeriflierDiscoveryMode: config.VeriflierDiscoveryModeShadow}
+	var probes []veriflierReadinessResult
+	var vantages []db.VeriflierVantage
+	var agents []db.VeriflierAgent
+
+	for i := range trusted {
+		id := fmt.Sprintf("vantage-%02d", i)
+		host := fmt.Sprintf("v%02d.example", i)
+		cfg.Verifiers = append(cfg.Verifiers, config.VerifierConfig{
+			Name:      id,
+			Host:      host,
+			Port:      "7803",
+			AuthToken: fmt.Sprintf("static-secret-%02d", i),
+		})
+		probes = append(probes, discoveryProbe(id, host+":7803", id, "agent-"+id))
+
+		registryHost := host
+		if i%6 == 0 {
+			registryHost = fmt.Sprintf("registry-v%02d.example", i)
+		}
+		authToken := fmt.Sprintf("registry-secret-%02d", i)
+		if i%7 == 0 {
+			authToken = ""
+		}
+		vantages = append(vantages, db.VeriflierVantage{
+			VantageID:    id,
+			Region:       "test-region",
+			Provider:     "test-provider",
+			EndpointHost: registryHost,
+			EndpointPort: "7803",
+			AuthToken:    authToken,
+			Enabled:      true,
+			LastSeen:     ptrTime(now.Add(-time.Duration(i) * time.Second)),
+			ActiveAgents: 1,
+		})
+		agents = append(agents, discoveryAgent("agent-"+id, id, host, now.Add(-time.Duration(i)*time.Second)))
+	}
+
+	cfg.Verifiers = append(cfg.Verifiers,
+		config.VerifierConfig{Name: "duplicate-a", Host: "dup-a.example", Port: "7803", AuthToken: "static-duplicate-a"},
+		config.VerifierConfig{Name: "duplicate-b", Host: "dup-b.example", Port: "7803", AuthToken: "static-duplicate-b"},
+		config.VerifierConfig{Name: "legacy", Host: "legacy.example", Port: "7803", AuthToken: "static-legacy"},
+		config.VerifierConfig{Name: "error", Host: "error.example", Port: "7803", AuthToken: "static-error"},
+		config.VerifierConfig{Name: "missing-id", Host: "missing.example", Port: "7803", AuthToken: "static-missing"},
+		config.VerifierConfig{Name: "missing-registry", Host: "missing-registry.example", Port: "7803", AuthToken: "static-missing-registry"},
+	)
+	probes = append(probes,
+		discoveryProbe("duplicate-a", "dup-a.example:7803", "duplicate", "agent-duplicate-a"),
+		discoveryProbe("duplicate-b", "dup-b.example:7803", "duplicate", "agent-duplicate-b"),
+		legacyDiscoveryProbe("legacy", "legacy.example:7803"),
+		errorDiscoveryProbe("error", "error.example:7803", "connection refused"),
+		discoveryProbe("missing-id", "missing.example:7803", "", "agent-missing-id"),
+		discoveryProbe("missing-registry", "missing-registry.example:7803", "missing-registry", "agent-missing-registry"),
+	)
+	vantages = append(vantages,
+		db.VeriflierVantage{VantageID: "duplicate", EndpointHost: "dup-lb.example", EndpointPort: "7803", AuthToken: "registry-duplicate", Enabled: true},
+		db.VeriflierVantage{VantageID: "registry-only", EndpointHost: "registry-only.example", EndpointPort: "7803", AuthToken: "registry-only-token", Enabled: true},
+		db.VeriflierVantage{VantageID: "incomplete", Enabled: true},
+		db.VeriflierVantage{VantageID: "disabled", EndpointHost: "disabled.example", EndpointPort: "7803", AuthToken: "registry-disabled", Enabled: false},
+	)
+	agents = append(agents,
+		discoveryAgent("agent-duplicate-a", "duplicate", "dup-a.example", now.Add(-5*time.Second)),
+		discoveryAgent("agent-duplicate-b", "duplicate", "dup-b.example", now.Add(-4*time.Second)),
+		discoveryAgent("agent-rogue", "rogue", "rogue.example", now.Add(-3*time.Second)),
+	)
+
+	report, err := buildVeriflierDiscoveryReport(
+		context.Background(),
+		cfg,
+		veriflierDiscoveryReportOptions{StaleAfter: 90 * time.Second, QueryTimeout: time.Second, ProbeTimeout: time.Second, ProbeStatic: true},
+		veriflierDiscoveryReportDeps{
+			Now: func() time.Time { return now },
+			ProbeConfigured: func(context.Context, *config.Config, time.Duration) []veriflierReadinessResult {
+				return probes
+			},
+			ListSnapshot: func(context.Context, time.Duration) (db.VeriflierDiscoverySnapshot, error) {
+				return db.VeriflierDiscoverySnapshot{Vantages: vantages, Agents: agents}, nil
+			},
+		},
+	)
+	if err != nil {
+		t.Fatalf("buildVeriflierDiscoveryReport: %v", err)
+	}
+	if report.Status != "red" || report.OK {
+		t.Fatalf("status=%s ok=%t issues=%+v, want red due duplicate static vantage", report.Status, report.OK, report.Issues)
+	}
+	if report.Static.Configured != trusted+6 || report.Static.DuplicateVantages != 2 {
+		t.Fatalf("static summary = %+v", report.Static)
+	}
+	for _, name := range []string{
+		"static_vantage_duplicate",
+		"static_legacy_only",
+		"static_probe_failed",
+		"static_vantage_missing",
+		"static_missing_enabled_registry",
+		"enabled_registry_missing_static",
+		"registry_enabled_incomplete",
+		"static_registry_endpoint_mismatch",
+		"static_registry_auth_presence_mismatch",
+		"agent_registry_endpoint_mismatch",
+		"agent_without_registry",
+		"duplicate_active_agent_endpoints",
+		"enabled_registry_without_active_agent",
+	} {
+		if !discoveryReportHasIssue(report, name) {
+			t.Fatalf("issues missing %q: %+v", name, report.Issues)
+		}
+	}
+
+	var out bytes.Buffer
+	if err := renderVeriflierDiscoveryReport(&out, report, "json"); err != nil {
+		t.Fatalf("render json: %v", err)
+	}
+	rendered := out.String()
+	for _, secret := range []string{"static-secret", "registry-secret", "static-duplicate", "registry-duplicate", "registry-only-token"} {
+		if strings.Contains(rendered, secret) {
+			t.Fatalf("rendered JSON leaked secret fragment %q", secret)
+		}
+	}
+}
+
+func TestVeriflierDiscoverySoakActiveFallbackAndRecoveryStates(t *testing.T) {
+	now := time.Date(2026, 5, 11, 15, 15, 0, 0, time.UTC)
+	cfg := &config.Config{
+		VeriflierDiscoveryMode: config.VeriflierDiscoveryModeActive,
+		Verifiers: []config.VerifierConfig{
+			{Name: "east", Host: "east.example", Port: "7803", AuthToken: "static-east-token"},
+			{Name: "west", Host: "west.example", Port: "7803", AuthToken: "static-west-token"},
+		},
+	}
+	probes := []veriflierReadinessResult{
+		discoveryProbe("east", "east.example:7803", "us-east", "agent-east"),
+		discoveryProbe("west", "west.example:7803", "us-west", "agent-west"),
+	}
+	opts := veriflierDiscoveryReportOptions{StaleAfter: 90 * time.Second, QueryTimeout: time.Second, ProbeTimeout: time.Second, ProbeStatic: true}
+
+	redReport, err := buildVeriflierDiscoveryReport(context.Background(), cfg, opts, veriflierDiscoveryReportDeps{
+		Now: func() time.Time { return now },
+		ProbeConfigured: func(context.Context, *config.Config, time.Duration) []veriflierReadinessResult {
+			return probes
+		},
+		ListSnapshot: func(context.Context, time.Duration) (db.VeriflierDiscoverySnapshot, error) {
+			return db.VeriflierDiscoverySnapshot{
+				Vantages: []db.VeriflierVantage{
+					{VantageID: "us-east", Enabled: true},
+					{VantageID: "us-west", Enabled: true},
+				},
+			}, nil
+		},
+	})
+	if err != nil {
+		t.Fatalf("build red report: %v", err)
+	}
+	if redReport.Status != "red" || !discoveryReportHasIssue(redReport, "active_without_usable_registry") {
+		t.Fatalf("red report status=%s issues=%+v, want active fallback red", redReport.Status, redReport.Issues)
+	}
+
+	amberReport, err := buildVeriflierDiscoveryReport(context.Background(), cfg, opts, veriflierDiscoveryReportDeps{
+		Now: func() time.Time { return now },
+		ProbeConfigured: func(context.Context, *config.Config, time.Duration) []veriflierReadinessResult {
+			return probes
+		},
+		ListSnapshot: func(context.Context, time.Duration) (db.VeriflierDiscoverySnapshot, error) {
+			return db.VeriflierDiscoverySnapshot{
+				Vantages: []db.VeriflierVantage{
+					{VantageID: "us-east", EndpointHost: "east.example", EndpointPort: "7803", AuthToken: "registry-east-token", Enabled: true},
+					{VantageID: "us-west", EndpointHost: "west.example", EndpointPort: "7803", AuthToken: "registry-west-token", Enabled: true},
+				},
+			}, nil
+		},
+	})
+	if err != nil {
+		t.Fatalf("build amber report: %v", err)
+	}
+	if amberReport.Status != "amber" || !discoveryReportHasIssue(amberReport, "enabled_registry_without_active_agent") {
+		t.Fatalf("amber report status=%s issues=%+v, want missing active agent telemetry", amberReport.Status, amberReport.Issues)
+	}
+
+	greenReport, err := buildVeriflierDiscoveryReport(context.Background(), cfg, opts, veriflierDiscoveryReportDeps{
+		Now: func() time.Time { return now },
+		ProbeConfigured: func(context.Context, *config.Config, time.Duration) []veriflierReadinessResult {
+			return probes
+		},
+		ListSnapshot: func(context.Context, time.Duration) (db.VeriflierDiscoverySnapshot, error) {
+			return db.VeriflierDiscoverySnapshot{
+				Vantages: []db.VeriflierVantage{
+					{VantageID: "us-east", EndpointHost: "east.example", EndpointPort: "7803", AuthToken: "registry-east-token", Enabled: true, ActiveAgents: 1, LastSeen: ptrTime(now.Add(-10 * time.Second))},
+					{VantageID: "us-west", EndpointHost: "west.example", EndpointPort: "7803", AuthToken: "registry-west-token", Enabled: true, ActiveAgents: 1, LastSeen: ptrTime(now.Add(-11 * time.Second))},
+				},
+				Agents: []db.VeriflierAgent{
+					discoveryAgent("agent-east", "us-east", "east.example", now.Add(-10*time.Second)),
+					discoveryAgent("agent-west", "us-west", "west.example", now.Add(-11*time.Second)),
+				},
+			}, nil
+		},
+	})
+	if err != nil {
+		t.Fatalf("build green report: %v", err)
+	}
+	if !greenReport.OK || greenReport.Status != "green" || len(greenReport.Issues) != 0 {
+		t.Fatalf("green report status=%s ok=%t issues=%+v", greenReport.Status, greenReport.OK, greenReport.Issues)
+	}
+}
+
+func legacyDiscoveryProbe(name, addr string) veriflierReadinessResult {
+	return veriflierReadinessResult{
+		Name: name,
+		Addr: addr,
+		Status: &veriflier.StatusV2Response{
+			Status:    "ok",
+			Version:   "legacy-version",
+			Protocols: []string{veriflier.ProtocolLegacy},
+		},
+	}
+}
+
+func errorDiscoveryProbe(name, addr, err string) veriflierReadinessResult {
+	return veriflierReadinessResult{Name: name, Addr: addr, Err: fmt.Errorf("%s", err)}
+}
diff --git a/cmd/jetmon2/verifliers_test.go b/cmd/jetmon2/verifliers_test.go
new file mode 100644
index 00000000..18619fad
--- /dev/null
+++ b/cmd/jetmon2/verifliers_test.go
@@ -0,0 +1,255 @@
+package main
+
+import (
+	"bytes"
+	"context"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/veriflier"
+)
+
+func TestBuildVeriflierDiscoveryReportShadowClean(t *testing.T) {
+	now := time.Date(2026, 5, 11, 12, 0, 0, 0, time.UTC)
+	cfg := &config.Config{
+		VeriflierDiscoveryMode: config.VeriflierDiscoveryModeShadow,
+		Verifiers: []config.VerifierConfig{
+			{Name: "east", Host: "east.example", Port: "7803", AuthToken: "static-east-token"},
+			{Name: "west", Host: "west.example", Port: "7803", AuthToken: "static-west-token"},
+		},
+	}
+	opts := veriflierDiscoveryReportOptions{
+		Output:       "text",
+		StaleAfter:   90 * time.Second,
+		QueryTimeout: time.Second,
+		ProbeTimeout: time.Second,
+		ProbeStatic:  true,
+	}
+	deps := veriflierDiscoveryReportDeps{
+		Now: func() time.Time { return now },
+		ProbeConfigured: func(context.Context, *config.Config, time.Duration) []veriflierReadinessResult {
+			return []veriflierReadinessResult{
+				discoveryProbe("east", "east.example:7803", "us-east", "agent-east"),
+				discoveryProbe("west", "west.example:7803", "us-west", "agent-west"),
+			}
+		},
+		ListSnapshot: func(context.Context, time.Duration) (db.VeriflierDiscoverySnapshot, error) {
+			return db.VeriflierDiscoverySnapshot{
+				Vantages: []db.VeriflierVantage{
+					{VantageID: "us-east", Region: "iad", Provider: "test", EndpointHost: "east.example", EndpointPort: "7803", AuthToken: "registry-east-token", Enabled: true, LastSeen: ptrTime(now.Add(-10 * time.Second)), ActiveAgents: 1},
+					{VantageID: "us-west", Region: "sfo", Provider: "test", EndpointHost: "west.example", EndpointPort: "7803", AuthToken: "registry-west-token", Enabled: true, LastSeen: ptrTime(now.Add(-20 * time.Second)), ActiveAgents: 1},
+				},
+				Agents: []db.VeriflierAgent{
+					discoveryAgent("agent-east", "us-east", "east.example", now.Add(-10*time.Second)),
+					discoveryAgent("agent-west", "us-west", "west.example", now.Add(-20*time.Second)),
+				},
+			}, nil
+		},
+	}
+
+	report, err := buildVeriflierDiscoveryReport(context.Background(), cfg, opts, deps)
+	if err != nil {
+		t.Fatalf("buildVeriflierDiscoveryReport: %v", err)
+	}
+	if !report.OK || report.Status != "green" || len(report.Issues) != 0 {
+		t.Fatalf("report status=%s ok=%t issues=%+v, want clean green", report.Status, report.OK, report.Issues)
+	}
+	if report.Static.V2 != 2 || report.Registry.Enabled != 2 || report.Agents.Active != 2 {
+		t.Fatalf("summaries = static %+v registry %+v agents %+v", report.Static, report.Registry, report.Agents)
+	}
+
+	var out bytes.Buffer
+	if err := renderVeriflierDiscoveryReport(&out, report, "text"); err != nil {
+		t.Fatalf("render text: %v", err)
+	}
+	rendered := out.String()
+	if !strings.Contains(rendered, "PASS veriflier_discovery_report=green") {
+		t.Fatalf("rendered text missing green pass:\n%s", rendered)
+	}
+	if strings.Contains(rendered, "static-east-token") || strings.Contains(rendered, "registry-east-token") {
+		t.Fatalf("rendered text leaked auth token:\n%s", rendered)
+	}
+}
+
+func TestBuildVeriflierDiscoveryReportFlagsShadowDrift(t *testing.T) {
+	now := time.Date(2026, 5, 11, 12, 0, 0, 0, time.UTC)
+	cfg := &config.Config{
+		VeriflierDiscoveryMode: config.VeriflierDiscoveryModeShadow,
+		Verifiers: []config.VerifierConfig{
+			{Name: "east", Host: "east.example", Port: "7803", AuthToken: "static-token"},
+			{Name: "west", Host: "west.example", Port: "7803", AuthToken: "static-token"},
+		},
+	}
+	opts := veriflierDiscoveryReportOptions{StaleAfter: 90 * time.Second, ProbeStatic: true}
+	deps := veriflierDiscoveryReportDeps{
+		Now: func() time.Time { return now },
+		ProbeConfigured: func(context.Context, *config.Config, time.Duration) []veriflierReadinessResult {
+			return []veriflierReadinessResult{
+				discoveryProbe("east", "east.example:7803", "us-east", "agent-east"),
+				discoveryProbe("west", "west.example:7803", "us-west", "agent-west"),
+			}
+		},
+		ListSnapshot: func(context.Context, time.Duration) (db.VeriflierDiscoverySnapshot, error) {
+			return db.VeriflierDiscoverySnapshot{
+				Vantages: []db.VeriflierVantage{
+					{VantageID: "us-east", EndpointHost: "east-registry.example", EndpointPort: "7803", AuthToken: "registry-token", Enabled: true},
+					{VantageID: "us-north", EndpointHost: "north.example", EndpointPort: "7803", AuthToken: "registry-token", Enabled: true},
+					{VantageID: "incomplete", Enabled: true},
+				},
+				Agents: []db.VeriflierAgent{
+					discoveryAgent("agent-east", "us-east", "east.example", now.Add(-10*time.Second)),
+					discoveryAgent("agent-rogue", "rogue", "rogue.example", now.Add(-10*time.Second)),
+				},
+			}, nil
+		},
+	}
+
+	report, err := buildVeriflierDiscoveryReport(context.Background(), cfg, opts, deps)
+	if err != nil {
+		t.Fatalf("buildVeriflierDiscoveryReport: %v", err)
+	}
+	if report.Status != "amber" || report.OK {
+		t.Fatalf("status=%s ok=%t issues=%+v, want amber", report.Status, report.OK, report.Issues)
+	}
+	for _, name := range []string{
+		"static_missing_enabled_registry",
+		"enabled_registry_missing_static",
+		"registry_enabled_incomplete",
+		"agent_without_registry",
+		"agent_registry_endpoint_mismatch",
+		"enabled_registry_without_active_agent",
+	} {
+		if !discoveryReportHasIssue(report, name) {
+			t.Fatalf("issues %+v missing %q", report.Issues, name)
+		}
+	}
+}
+
+func TestBuildVeriflierDiscoveryReportActiveNoUsableRegistryIsRed(t *testing.T) {
+	cfg := &config.Config{
+		VeriflierDiscoveryMode: config.VeriflierDiscoveryModeActive,
+		Verifiers:              []config.VerifierConfig{{Name: "east", Host: "east.example", Port: "7803"}},
+	}
+	opts := veriflierDiscoveryReportOptions{StaleAfter: 90 * time.Second, ProbeStatic: false}
+	deps := veriflierDiscoveryReportDeps{
+		Now: func() time.Time { return time.Date(2026, 5, 11, 12, 0, 0, 0, time.UTC) },
+		ListSnapshot: func(context.Context, time.Duration) (db.VeriflierDiscoverySnapshot, error) {
+			return db.VeriflierDiscoverySnapshot{
+				Vantages: []db.VeriflierVantage{{VantageID: "incomplete", Enabled: true}},
+			}, nil
+		},
+	}
+
+	report, err := buildVeriflierDiscoveryReport(context.Background(), cfg, opts, deps)
+	if err != nil {
+		t.Fatalf("buildVeriflierDiscoveryReport: %v", err)
+	}
+	if report.Status != "red" || report.OK {
+		t.Fatalf("status=%s ok=%t issues=%+v, want red", report.Status, report.OK, report.Issues)
+	}
+	if !discoveryReportHasIssue(report, "active_without_usable_registry") {
+		t.Fatalf("issues %+v missing active_without_usable_registry", report.Issues)
+	}
+}
+
+func TestRenderVeriflierDiscoveryReportJSONDoesNotExposeTokens(t *testing.T) {
+	now := time.Date(2026, 5, 11, 12, 0, 0, 0, time.UTC)
+	cfg := &config.Config{
+		VeriflierDiscoveryMode: config.VeriflierDiscoveryModeShadow,
+		Verifiers:              []config.VerifierConfig{{Name: "east", Host: "east.example", Port: "7803", AuthToken: "secret-static-token"}},
+	}
+	opts := veriflierDiscoveryReportOptions{StaleAfter: 90 * time.Second, ProbeStatic: true}
+	deps := veriflierDiscoveryReportDeps{
+		Now: func() time.Time { return now },
+		ProbeConfigured: func(context.Context, *config.Config, time.Duration) []veriflierReadinessResult {
+			return []veriflierReadinessResult{discoveryProbe("east", "east.example:7803", "us-east", "agent-east")}
+		},
+		ListSnapshot: func(context.Context, time.Duration) (db.VeriflierDiscoverySnapshot, error) {
+			return db.VeriflierDiscoverySnapshot{
+				Vantages: []db.VeriflierVantage{{VantageID: "us-east", EndpointHost: "east.example", EndpointPort: "7803", AuthToken: "secret-registry-token", Enabled: true, ActiveAgents: 1, LastSeen: ptrTime(now)}},
+				Agents:   []db.VeriflierAgent{discoveryAgent("agent-east", "us-east", "east.example", now)},
+			}, nil
+		},
+	}
+	report, err := buildVeriflierDiscoveryReport(context.Background(), cfg, opts, deps)
+	if err != nil {
+		t.Fatalf("buildVeriflierDiscoveryReport: %v", err)
+	}
+
+	var out bytes.Buffer
+	if err := renderVeriflierDiscoveryReport(&out, report, "json"); err != nil {
+		t.Fatalf("render json: %v", err)
+	}
+	rendered := out.String()
+	if strings.Contains(rendered, "secret-static-token") || strings.Contains(rendered, "secret-registry-token") {
+		t.Fatalf("rendered JSON leaked auth token:\n%s", rendered)
+	}
+	if !strings.Contains(rendered, `"auth_token_present": true`) {
+		t.Fatalf("rendered JSON missing auth token presence:\n%s", rendered)
+	}
+}
+
+func TestValidateVeriflierDiscoveryReportOptions(t *testing.T) {
+	opts := veriflierDiscoveryReportOptions{Output: "json", StaleAfter: time.Second, QueryTimeout: time.Second, ProbeTimeout: time.Second}
+	if err := validateVeriflierDiscoveryReportOptions(&opts); err != nil {
+		t.Fatalf("validateVeriflierDiscoveryReportOptions: %v", err)
+	}
+	if opts.Output != "json" {
+		t.Fatalf("Output = %q, want json", opts.Output)
+	}
+
+	opts.Output = "yaml"
+	if err := validateVeriflierDiscoveryReportOptions(&opts); err == nil {
+		t.Fatal("validateVeriflierDiscoveryReportOptions accepted yaml")
+	}
+}
+
+func discoveryProbe(name, addr, vantageID, agentID string) veriflierReadinessResult {
+	return veriflierReadinessResult{
+		Name: name,
+		Addr: addr,
+		Status: &veriflier.StatusV2Response{
+			Status:    "ok",
+			Version:   "test-version",
+			Protocols: []string{veriflier.ProtocolV2, veriflier.ProtocolLegacy},
+			Vantage:   veriflier.Vantage{ID: vantageID},
+			Agent:     veriflier.Agent{ID: agentID},
+			Capacity:  veriflier.Capacity{MaxConcurrency: 8, QueueCapacity: 16, QueueDepth: 1, Active: 2, InFlight: 1},
+		},
+	}
+}
+
+func discoveryAgent(agentID, vantageID, host string, lastSeen time.Time) db.VeriflierAgent {
+	return db.VeriflierAgent{
+		AgentID:        agentID,
+		VantageID:      vantageID,
+		Hostname:       host + "-host",
+		EndpointHost:   host,
+		EndpointPort:   "7803",
+		Version:        "test-version",
+		Protocols:      []string{veriflier.ProtocolV2},
+		MaxConcurrency: 8,
+		QueueCapacity:  16,
+		QueueDepth:     1,
+		Active:         2,
+		InFlight:       1,
+		Status:         "active",
+		LastSeen:       lastSeen,
+	}
+}
+
+func discoveryReportHasIssue(report veriflierDiscoveryReport, name string) bool {
+	for _, issue := range report.Issues {
+		if issue.Name == name {
+			return true
+		}
+	}
+	return false
+}
+
+func ptrTime(t time.Time) *time.Time {
+	return &t
+}
diff --git a/config/config-sample.json b/config/config-sample.json
index e22bd317..8cb42c28 100644
--- a/config/config-sample.json
+++ b/config/config-sample.json
@@ -1,15 +1,16 @@
 {
-	"DEBUG"             : true,
+	"DEBUG"             : false,
 	"NUM_WORKERS"       : 60,
 	"NUM_TO_PROCESS"    : 40,
 	"DATASET_SIZE"      : 100,
-	"WORKER_MAX_CHECKS" : 10000,
-	"WORKER_MAX_MEM_MB" : 53,
+	"WORKER_MAX_MEM_MB" : 0,
 
-	"DB_UPDATES_ENABLE" : false,
+	"LEGACY_STATUS_PROJECTION_ENABLE" : true,
+
+	"BUCKET_TOTAL"               : 1000,
+	"BUCKET_TARGET"              : 500,
+	"BUCKET_HEARTBEAT_GRACE_SEC" : 600,
 
-	"BUCKET_NO_MIN"  : 0,
-	"BUCKET_NO_MAX"  : 512,
 	"BATCH_SIZE"     : 32,
 	"AUTH_TOKEN"     : "<AUTH_TOKEN>",
 
@@ -17,16 +18,47 @@
 	"SQL_UPDATE_BATCH"      : 1,
 	"DB_CONFIG_UPDATES_MIN" : 10,
 	"PEER_OFFLINE_LIMIT"    : 3,
+	"VERIFLIER_DISCOVERY_MODE" : "static",
+
+	"NUM_OF_CHECKS"           : 3,
+	"TIME_BETWEEN_CHECKS_SEC" : 30,
 
-	"NUM_OF_CHECKS"               : 3,
-	"TIME_BETWEEN_CHECKS_SEC"     : 30,
+	"ALERT_COOLDOWN_MINUTES" : 30,
 
 	"STATS_UPDATE_INTERVAL_MS"     : 10000,
 	"STATSD_SEND_MEM_USAGE"        : false,
 	"TIME_BETWEEN_NOTICES_MIN"     : 59,
+	"WPCOM_NOTIFY_ENABLE"          : true,
 	"MIN_TIME_BETWEEN_ROUNDS_SEC"  : 300,
-	"TIMEOUT_FOR_REQUESTS_SEC"     : 60,
-	"USE_VARIABLE_CHECK_INTERVALS" : false,
+	"NET_COMMS_TIMEOUT"            : 10,
+	"CHECK_DNS_RESOLVERS"          : [],
+	"BODY_READ_MAX_BYTES"          : 1048576,
+	"BODY_READ_MAX_MS"             : 250,
+	"KEYWORD_READ_MAX_BYTES"       : 1048576,
+	"KEYWORD_READ_MAX_MS"          : 0,
+	"DEFAULT_CHECK_METHOD"         : "GET",
+	"DEFAULT_DETECTION_PROFILE"    : "full",
+	"USE_VARIABLE_CHECK_INTERVALS" : true,
+	"SCHEDULER_ENGINE"            : "legacy",
+	"STREAMING_LEGACY_PROJECTION_INTERVAL_MIN" : 15,
+	"STREAMING_TARGET_RELOAD_SEC" : 300,
+
+	"LOG_FORMAT"     : "text",
+	"DASHBOARD_PORT" : 8080,
+	"DASHBOARD_BIND_ADDR" : "127.0.0.1",
+	"API_PORT"       : 0,
+	"DELIVERY_OWNER_HOST": "",
+	"DEBUG_PORT"     : 6060,
+
+	"EMAIL_TRANSPORT"       : "stub",
+	"EMAIL_FROM"            : "jetmon@noreply.invalid",
+	"WPCOM_EMAIL_ENDPOINT"  : "",
+	"WPCOM_EMAIL_AUTH_TOKEN": "",
+	"SMTP_HOST"             : "",
+	"SMTP_PORT"             : 0,
+	"SMTP_USERNAME"         : "",
+	"SMTP_PASSWORD"         : "",
+	"SMTP_USE_TLS"          : false,
 
 	"VERIFIERS": [
 		{
diff --git a/config/config.readme b/config/config.readme
index 1c443deb..636de006 100644
--- a/config/config.readme
+++ b/config/config.readme
@@ -1,76 +1,321 @@
 DEBUG
-Set to true to enable more verbose log messages in logs/jetmon.log.
+Set to true to enable more verbose log messages. Default: false.
 
 NUM_WORKERS
-The number of forked worker processes to create and maintain.
+Initial/floor number of goroutines in the check pool. In the legacy scheduler,
+the pool auto-scales between 1 and NUM_WORKERS based on queue depth. In the
+streaming scheduler, NUM_WORKERS is only the floor; the engine derives the
+worker target from active site rate and observed latency. Set 0 to use the
+default floor of 60. Default: 60.
 
 NUM_TO_PROCESS
-The number of sites that a worker should process in parallel.
+Legacy compatibility setting retained so copied v1-style configs continue to
+parse. The Go scheduler uses DATASET_SIZE as the fetch page and NUM_WORKERS as
+the concurrency guardrail; NUM_TO_PROCESS does not cap scheduler throughput.
+Default: 40.
 
 DATASET_SIZE
-The maximum number of sites to send to a worker's queue in a single batch.
-
-WORKER_MAX_CHECKS
-The maximum number of checks that a worker can process before it stops accepting work and is scheduled to recycle.
-Set to 0 or a negative value to disable recycling workers based on the number of checks.
+Maximum number of sites to fetch from the database per scheduler page. The
+orchestrator keeps fetching additional pages until due work is drained, so this
+is a database/query guardrail rather than a cap on total checks per round.
+Set 0 to use the default page size of 100. Default: 100.
 
 WORKER_MAX_MEM_MB
-The maximum MB of memory that a worker can consume before it stops accepting work and is scheduled to recycle.
-Set to 0 or a negative value to disable recycling workers based on memory usage.
-The following comment was in the worker source code from an early dev on why they chose 45MB as the original value. Since then, we moved to a value of 53MB.
-	Empirically ended up with 45MB per worker.
-	They don't get killed off all the time, and on a system with 16GB RAM
-	we end up having ~1.6GB free.
+Optional Go runtime memory threshold that triggers worker-pool drain. Set to 0
+to disable the artificial drain guardrail and let the host/container memory
+limit be the real ceiling. This avoids silently reducing check throughput during
+capacity tests and production runs unless an operator has intentionally set a
+local safety cap. If capacity tests show memory-pressure drain logs, investigate
+runtime memory and pprof data before raising or re-enabling the cap. Default: 0.
+
+LEGACY_STATUS_PROJECTION_ENABLE
+Set to true while Jetmon v2 is running in shadow-v2-state migration mode. When enabled, v2 writes its authoritative incident state to jetmon_events / jetmon_event_transitions and also projects v1-compatible site_status + last_status_change values back into jetpack_monitor_sites for legacy consumers. Set to false only after downstream readers have moved to the v2 event/API surface. Default: true.
 
 DB_UPDATES_ENABLE
-WARNING: Do not enabled this on production hosts. This should only be enabled on local docker test environments and never in production.
-Set to true to allow Jetmon to update the jetpack_monitor_sites table. Without this, it is difficult to test how effective the code is working when in a local docker test environment.
+Deprecated alias for LEGACY_STATUS_PROJECTION_ENABLE. If both keys are present, LEGACY_STATUS_PROJECTION_ENABLE wins.
+
+BUCKET_TOTAL
+Total number of buckets in the system across all hosts. Must match the range of bucket_no values in the jetpack_monitor_sites table. Default: 1000.
 
-BUCKET_NO_MIN
-The first bucket in the range of jetpack_monitor_sites buckets that this host should process when checking sites. Each host should be configured to have a unique set of buckets that it is responsible for.
-The buckets currently range from 0 to 511.
+BUCKET_TARGET
+Number of buckets this host should claim on startup. Used for initial
+distribution across hosts. Set 0 to claim BUCKET_TOTAL, which is useful for a
+single-host test fleet and avoids carrying a static performance cap in config.
+Default: 500.
 
-BUCKET_NO_MAX
-The last bucket in the range of jetpack_monitor_sites buckets that this host should process when checking sites. Each host should be configured to have a unique set of buckets that it is responsible for.
-The buckets currently range from 0 to 511.
+BUCKET_HEARTBEAT_GRACE_SEC
+Seconds after a host's last heartbeat before its buckets are considered available for reclaiming by another host. Default: 600.
+
+PINNED_BUCKET_MIN / PINNED_BUCKET_MAX
+Migration-only static bucket range for replacing one v1 host with one v2 host during the initial v1-to-v2 rollout. When both are set, jetmon2 checks only that inclusive bucket range and does not claim, heartbeat, or release rows in jetmon_hosts. Disable after the whole fleet is on v2 so dynamic bucket ownership can take over. Must satisfy 0 <= min <= max < BUCKET_TOTAL. Default: unset.
+
+BUCKET_NO_MIN / BUCKET_NO_MAX
+Deprecated v1 names accepted as aliases for PINNED_BUCKET_MIN / PINNED_BUCKET_MAX. They must be set together and must match PINNED_BUCKET_* if both forms are present.
 
 BATCH_SIZE
-The number of buckets returned in each batch when running checks.
+Legacy compatibility setting retained so copied v1-style configs continue to
+parse. It is not used by the Go scheduler; scheduler DB paging is controlled by
+DATASET_SIZE. Default: 32.
 
 AUTH_TOKEN
-A string used to validate communications between different systems over HTTPS.
+Shared secret used to authenticate outbound WPCOM API calls. Required.
+
+WPCOM_NOTIFY_ENABLE
+Whether the legacy WPCOM status-change notification path is allowed to make
+outbound calls. Keep this enabled in production for v1 payload compatibility.
+Set false on isolated test fleets so capacity and outage simulations cannot
+attempt to notify WPCOM. Default: true.
 
 VERIFLIER_BATCH_SIZE
-The maximum number of sites to send to verifliers in a single batch.
+Legacy compatibility setting retained for verifier batching compatibility. The
+current JSON-over-HTTP verifier path can send multiple sites per request, but
+confirmation escalation still commonly sends one site at a time. Default: 200.
 
-SQL_UPDATE_BATCH
-Unknown. Likely not used currently.
+PEER_OFFLINE_LIMIT
+Minimum number of verifliers that must confirm a site is down before it is declared confirmed-down. Default: 3.
+For multi-Veriflier fleets, quorum counts unique v2 vantage IDs and will not
+collapse below two healthy vantages unless PEER_OFFLINE_LIMIT is explicitly set
+to 1. Single-Veriflier dev/test layouts still confirm with one healthy
+Veriflier.
 
-DB_CONFIG_UPDATES_MIN
-How frequently in minutes the database library should check for DB config changes in order to reload.
+VERIFLIER_DISCOVERY_MODE
+Controls whether monitors read Veriflier endpoints from the DB registry:
+  static - use the VERIFIERS array only (default)
+  shadow - read jetmon_veriflier_vantages / jetmon_veriflier_agents and report
+           drift in validate-config and dashboard health, but keep using
+           VERIFIERS for traffic
+  active - use enabled, usable DB registry vantages for traffic; fall back to
+           VERIFIERS if discovery fails or returns no usable vantages
 
-PEER_OFFLINE_LIMIT
-The minimum number of verifliers that must confirm that a site is down before changing the site status to down.
+The registry is intentionally trust-gated. Agent telemetry rows in
+jetmon_veriflier_agents do not create quorum votes. A vantage must have a
+pre-approved enabled row in jetmon_veriflier_vantages before monitors count it.
+Default: static.
 
 NUM_OF_CHECKS
-The number of local checks that must fail before a site is checked by the verifliers.
+Number of consecutive local check failures required before escalating to verifliers. Default: 3.
 
 TIME_BETWEEN_CHECKS_SEC
-The minimum amount of time that must elapse between local checks for a specific site.
+Legacy compatibility setting retained so copied v1-style configs parse.
+Jetmon v2 scheduler cadence is controlled by the v1 check_interval field plus
+v2 sidecar runtime due state in variable-interval mode and by
+MIN_TIME_BETWEEN_ROUNDS_SEC in fixed-cadence mode. Failed checks are scheduled
+for a bounded one-minute follow-up when the normal interval is longer. Default:
+30.
+
+ALERT_COOLDOWN_MINUTES
+Minimum minutes between repeated down-alerts for the same site. Prevents alert storms during flapping. Default: 30. Can be overridden per site through the API or jetmon_site_check_config.alert_cooldown_minutes.
 
 STATS_UPDATE_INTERVAL_MS
-The minimum delay, in milliseconds, between stats updates to both statsd and stats log files.
+Milliseconds between StatsD metric flushes and stats file updates. Default: 10000.
+
+STATSD_SEND_MEM_USAGE
+Set to true to emit Go runtime memory stats to StatsD each interval. Default: false.
 
 TIME_BETWEEN_NOTICES_MIN
-The minimum delay, in minutes, that must pass before a site can transition from SITE_DOWN to SITE_CONFIRMED_DOWN.
+Minimum minutes between a site going down and a confirmed-down notification being sent. Default: 59.
 
 MIN_TIME_BETWEEN_ROUNDS_SEC
-The minimum delay, in seconds, between check rounds.
-Note: This value has no effect if USE_VARIABLE_CHECK_INTERVALS is set to true.
+Minimum seconds between fixed-cadence full-fleet passes when
+USE_VARIABLE_CHECK_INTERVALS is false. When USE_VARIABLE_CHECK_INTERVALS is
+true, the scheduler uses a short idle poll and the SQL due predicate decides
+which sites are ready to check. Default: 300.
+
+NET_COMMS_TIMEOUT
+Default HTTP request timeout in seconds. Can be overridden per site through the API or jetmon_site_check_config.timeout_seconds. Default: 10.
+
+CHECK_DNS_RESOLVERS
+Optional list of recursive resolver IPs for HTTP checks. Entries may include
+ports (for example "10.0.0.176:5353"); entries without a port use 53. Leave
+empty to use the host resolver configuration. Resolver changes require a
+service restart so in-flight checks keep a stable transport. Default: [].
 
-TIMEOUT_FOR_REQUESTS_SEC
-The amount of time, in seconds, that a site can remain in the queuedRetries array (the queue that holds sites being checked by verifliers) before being purged out of the queue.
+BODY_READ_MAX_BYTES
+Maximum response-body bytes checker will read for success-path validation when Content-Length is unknown or larger than this budget. For known Content-Length <= this budget, checker requires full EOF and treats truncation as a hard failure. Default: 1048576.
+Set 0 to use the default budget.
+
+BODY_READ_MAX_MS
+Post-header response-body phase budget in milliseconds for budgeted reads (unknown or large Content-Length). In strict finite mode (known Content-Length <= BODY_READ_MAX_BYTES), this budget does not hard-fail completion; request timeout remains the hard deadline. Default: 250.
+Set 0 to use the default budget.
+
+KEYWORD_READ_MAX_BYTES
+Maximum response-body bytes checker will read when a keyword check is configured. This budget is separate from BODY_READ_MAX_BYTES so keyword checks can scan deeper without changing non-keyword body-read behavior. Default: 1048576.
+Set 0 to use the default budget.
+
+KEYWORD_READ_MAX_MS
+Keyword body-read budget in milliseconds. Set to 0 to inherit the request timeout envelope (NET_COMMS_TIMEOUT or a per-site jetmon_site_check_config.timeout_seconds override). If set > 0 and exhausted before the keyword is found, the check fails as timeout (ErrorTimeout), not keyword mismatch. Default: 0.
+
+DEFAULT_CHECK_METHOD
+Default HTTP method for sites without a row in jetmon_site_check_config.
+Allowed values: HEAD or GET. Use HEAD during the first v1-to-v2 replacement
+phase when v2 should behave like the legacy monitor; use GET after the fleet is
+ready to exercise visitor-path checks by default. Default: GET.
+
+DEFAULT_DETECTION_PROFILE
+Default detection profile for sites without a row in jetmon_site_check_config.
+Allowed values:
+- legacy: v1-compatible reachability semantics; no body checks, redirect
+  advisory/fail policy, TLS advisory events, or body-read integrity failures.
+- simple_http: simple status/connect/timeout checks using the selected method;
+  intended for the HEAD-to-GET migration phase before richer v2 detections are
+  enabled.
+- full: all v2 HTTP, body, redirect, TLS, keyword, and forbidden-content
+  detections. HEAD requests automatically cap the effective profile to
+  simple_http because body-based checks cannot run against HEAD responses.
+Default: full.
+
+"legacy" here describes the site probe policy, not the Veriflier transport.
+During rollout, DEFAULT_CHECK_METHOD=HEAD plus DEFAULT_DETECTION_PROFILE=legacy
+still uses the v2 Veriflier transport (POST /v2/check) when remote confirmation
+is needed. It does not require enabling the Veriflier's legacy-compatible
+/check endpoint.
 
 USE_VARIABLE_CHECK_INTERVALS
-Set to true to enable the variable check intervals as set for each site in the jetpack_monitor_sites table.
-Note: Enabling this disables use of the MIN_TIME_BETWEEN_ROUNDS_SEC config, sets the round loop to execute every minute, and checks each site on the interval as set in the database.
+Set to true to respect the check_interval column per site instead of running
+all sites every fixed-cadence pass. Recommended for production freshness
+testing because newly due sites are discovered without waiting for
+MIN_TIME_BETWEEN_ROUNDS_SEC. In legacy round-scheduler mode, Jetmon maintains
+jetmon_site_runtime.next_check_at after each completed check so due-site
+selection can use an indexed sidecar timestamp range instead of recomputing
+each row's interval on every poll. Successful checks use the site's normal
+check_interval; failed checks get a bounded one-minute follow-up when the
+normal interval is longer. Default in the sample config: true. Minimal configs
+that omit the key retain the compatibility default of false.
+
+SCHEDULER_ENGINE
+Scheduler implementation to run. "legacy" keeps the v2 round/page scheduler for
+drop-in rollout safety. "streaming" enables the v2-native time-wheel scheduler:
+active site identity/cadence plus v2 sidecar config is loaded into memory,
+checks are spread over each site's interval by stable phase, and healthy checks
+do not write per-probe freshness or history rows. Event, recovery, SSL, TLS,
+retry, and WPCOM behavior stays routed through the existing v2 incident code.
+Default: legacy.
+
+STREAMING_LEGACY_PROJECTION_INTERVAL_MIN
+Coarse freshness projection interval used only when SCHEDULER_ENGINE is
+"streaming". Streaming mode does not update jetmon_site_runtime.last_checked_at
+after every successful probe; instead it batches the latest observation for each
+site at this cadence so a rollback to the legacy scheduler loses at most this
+many minutes of freshness data in normal operation. The accepted rollout window
+is 5-15 minutes. The configured interval applies uniformly rather than shrinking
+to each site's check interval, because matching 5-minute site cadence at large
+fleet sizes can turn rollback freshness into the dominant write load. Large
+pending projection sets are flushed in rate-sized batches so compatibility
+writes do not block check throughput. Default: 15.
+
+STREAMING_TARGET_RELOAD_SEC
+How often streaming mode reloads active site config from jetpack_monitor_sites.
+This is not a performance cap and does not control check cadence. It bounds how
+quickly site activation, URL, interval, keyword, header, timeout, redirect, and
+maintenance-window changes are reflected in the in-memory schedule for smaller
+fleets. Very large fleets automatically stretch periodic full reloads to avoid
+turning broad config scans into a steady-state throughput bottleneck; activation
+and deactivation count changes are still polled separately. Default: 300.
+
+LOG_FORMAT
+Log output format. Set to "json" for structured logging (e.g. for log aggregators), or "text" for human-readable output. Default: "text".
+
+DASHBOARD_PORT
+Port for the operator dashboard. Set to 0 to disable. Default: 8080.
+
+DASHBOARD_BIND_ADDR
+Address for the operator dashboard listener. Defaults to 127.0.0.1 so the
+unauthenticated host and fleet dashboards are local-only unless an operator
+explicitly binds them to a trusted management interface. Use 0.0.0.0 only
+behind network controls that limit access to trusted operator hosts. Startup
+and validate-config warn when this is set to a non-loopback address. Default:
+127.0.0.1.
+
+API_PORT
+Port for the internal REST API. Set to 0 to disable. In the embedded v2 deployment, API_PORT also controls whether webhook and alert-contact delivery workers are eligible to run inside jetmon2. The standalone jetmon-deliverer binary does not start the API and does not require API_PORT. Default: 0.
+
+DELIVERY_OWNER_HOST
+Optional hostname that is allowed to run webhook and alert-contact delivery workers. Delivery rows are claimed transactionally, so multiple active workers do not claim the same pending row; use this setting when you want an explicit single-owner rollout while moving from embedded jetmon2 delivery to standalone jetmon-deliverer. If empty and embedded delivery is eligible, the current jetmon2 host starts delivery workers for backward compatibility and startup / validate-config emit a warning. If empty for jetmon-deliverer, that process starts delivery workers and logs the same warning. Default: empty.
+
+DEBUG_PORT
+Port for the pprof debug server. Only binds to 127.0.0.1 (localhost) — never accessible remotely. Set to 0 to disable. Default: 6060.
+Access via: curl http://localhost:6060/debug/pprof/
+
+EMAIL_TRANSPORT
+Email sender used by alert contacts with transport "email". Set to "stub" to log rendered email without sending, "smtp" to send directly through SMTP, or "wpcom" to POST to a WPCOM-owned email API endpoint. Empty is treated like "stub" for compatibility. Startup and validate-config warn when this resolves to "stub" because email alert contacts will not deliver mail in that mode. Default: "stub".
+
+EMAIL_FROM
+From address used when rendering alert-contact emails. Default: "jetmon@noreply.invalid".
+
+WPCOM_EMAIL_ENDPOINT
+Required when EMAIL_TRANSPORT is "wpcom". HTTP endpoint that receives rendered email payloads.
+
+WPCOM_EMAIL_AUTH_TOKEN
+Optional Bearer token sent to WPCOM_EMAIL_ENDPOINT when EMAIL_TRANSPORT is "wpcom".
+
+SMTP_HOST
+Required when EMAIL_TRANSPORT is "smtp". SMTP server hostname.
+
+SMTP_PORT
+Required when EMAIL_TRANSPORT is "smtp". SMTP server port.
+
+SMTP_USERNAME
+Optional SMTP username used when EMAIL_TRANSPORT is "smtp".
+
+SMTP_PASSWORD
+Optional SMTP password used when EMAIL_TRANSPORT is "smtp".
+
+SMTP_USE_TLS
+Set to true to connect to SMTP_HOST with TLS from the start. Default: false.
+
+VERIFIERS
+Array of veriflier configuration objects. Each entry requires:
+  name       - display name
+  host       - hostname, IP, or load-balanced vantage endpoint
+  port       - Veriflier JSON-over-HTTP transport port (default 7803)
+  auth_token - shared secret for veriflier authentication
+The legacy grpc_port key is still accepted as a compatibility alias.
+
+Each configured entry is treated as one quorum vote by the monitor. When a
+Veriflier endpoint is horizontally scaled, put replicas behind a stable endpoint
+and give that endpoint one Veriflier entry; do not add each replica as a
+separate monitor-side verifier unless it is intended to count as an independent
+vantage point.
+
+Veriflier process config
+------------------------
+
+The standalone Veriflier keeps its local config intentionally small:
+
+  auth_token - shared secret required for POST /check and POST /v2/check
+  port       - HTTP listen port (default 7803)
+  vantage_id - optional quorum/vantage identity reported by /v2/status and
+               /v2/check results; defaults to hostname when omitted
+  region     - optional region metadata for diagnostics
+  provider   - optional provider metadata for diagnostics
+  enable_legacy_http - optional emergency/lab compatibility switch for legacy
+               HTTP /check and /status endpoints; default false
+
+Environment aliases are VERIFLIER_AUTH_TOKEN, VERIFLIER_PORT,
+VERIFLIER_VANTAGE_ID, VERIFLIER_REGION, VERIFLIER_PROVIDER, and
+VERIFLIER_ENABLE_LEGACY_HTTP. The legacy VERIFLIER_GRPC_PORT / grpc_port names
+remain accepted as port aliases.
+
+Production v2 Verifliers should normally leave legacy HTTP disabled and expose
+only POST /v2/check plus GET /v2/status. Enable legacy HTTP only for lab runs,
+emergency rollback experiments, or explicit compatibility tests. The original
+v1 Veriflier used a TLS/custom transport and is not equivalent to this
+legacy-compatible HTTP switch.
+
+Do not enable this switch just because Monitor site policy starts at HEAD +
+legacy. HEAD/GET and detection profiles are request-level probe settings carried
+inside POST /v2/check. enable_legacy_http controls only whether veriflier2 also
+serves the compatibility transport paths POST /check and GET /status.
+
+The Veriflier derives its concurrency from the Go runtime and local file
+descriptor limit, then exposes current capacity through GET /v2/status. Normal
+deployments should not need worker-count config. Saturated Verifliers reject
+whole batches with HTTP 503 so the monitor treats that endpoint as unhealthy/no
+vote rather than as a site-down confirmation.
+
+Monitors collect agent telemetry by polling authenticated Veriflier
+GET /v2/status endpoints and writing jetmon_veriflier_agents rows from the
+monitor side. Veriflier hosts do not need database credentials. Those rows
+include agent ID, vantage ID, version, protocols, endpoint host/port, and
+current capacity. They are telemetry only until an operator creates and enables
+the matching trusted row in jetmon_veriflier_vantages.
diff --git a/config/db-config-sample.conf b/config/db-config-sample.conf
index f2b43d35..dd91ae6f 100644
--- a/config/db-config-sample.conf
+++ b/config/db-config-sample.conf
@@ -1,6 +1,9 @@
-$db_servers = array(
-	'misc' => array( 
-		array( 'jetmon', 0, 1, 'mysqldb:<MYSQLDB_PORT>', 'mysqldb:<MYSQLDB_PORT>', '<MYSQLDB_DATABASE>', '<MYSQLDB_USER>', '<MYSQLDB_ROOT_PASSWORD>', null, null, 30 ),
-		array( 'jetmon', 1, 0, 'mysqldb:<MYSQLDB_PORT>', 'mysqldb:<MYSQLDB_PORT>', '<MYSQLDB_DATABASE>', '<MYSQLDB_USER>', '<MYSQLDB_ROOT_PASSWORD>', null, null, 30 ),
-	)
-);
+# Database connection environment variables for non-Docker deployments.
+# Copy to /opt/jetmon2/config/jetmon2.env and fill in real values.
+# The systemd unit reads this file via EnvironmentFile.
+
+DB_HOST=localhost
+DB_PORT=3306
+DB_USER=jetmon
+DB_PASSWORD=change_me
+DB_NAME=jetmon_db
diff --git a/docker/.dockerignore b/docker/.dockerignore
deleted file mode 100644
index f602a916..00000000
--- a/docker/.dockerignore
+++ /dev/null
@@ -1,5 +0,0 @@
-
-build/
-node_modules/
-.DS_Store
-.env
diff --git a/docker/.env-sample b/docker/.env-sample
index 831cf63d..5a9c7496 100644
--- a/docker/.env-sample
+++ b/docker/.env-sample
@@ -1,17 +1,61 @@
-MYSQLDB_USER=root
-MYSQLDB_ROOT_PASSWORD=123456
-MYSQLDB_DATABASE=jetmon_db
-MYSQLDB_LOCAL_PORT=3307
-MYSQLDB_DOCKER_PORT=3306
+# Docker Compose reads this file for local development only.
+# *_HOST_PORT variables publish hardcoded container ports to your host.
 
-JETMON_LOCAL_PORT=7800
-JETMON_DOCKER_PORT=7800
+# Host interface used for non-API published development ports.
+BIND_ADDR=127.0.0.1
 
-JETMON_STATUS_LOCAL_PORT=7802
-JETMON_STATUS_DOCKER_PORT=7802
+# API bind address. Default exposes the API to other systems on your network;
+# set to 127.0.0.1 when you only want local API access.
+API_BIND_ADDR=0.0.0.0
 
-WPCOM_JETMON_AUTH_TOKEN=change_me
+# MySQL container bootstrap plus Jetmon's app-level DB connection.
+# MYSQL_ROOT_PASSWORD is only used by the local MySQL container and the
+# one-shot mysql-user setup service. Jetmon connects with MYSQL_USER and
+# MYSQL_PASSWORD instead of root.
+MYSQL_USER=jetmon
+MYSQL_PASSWORD=jetmon_dev_password
+MYSQL_ROOT_PASSWORD=123456
+MYSQL_DATABASE=jetmon_db
+MYSQL_HOST_PORT=3307
 
-VERIFLIER_LOCAL_PORT=7801
-VERIFLIER_DOCKER_PORT=7801
+# Token used by Jetmon when generating local config/config.json from the sample.
+WPCOM_AUTH_TOKEN=change_me
+
+# Monitor-to-Veriflier auth plus the host-published Veriflier port.
 VERIFLIER_AUTH_TOKEN=veriflier_1_auth_token
+VERIFLIER_HOST_PORT=7803
+
+# Host-published ports for Jetmon's dashboard and REST API.
+DASHBOARD_HOST_PORT=8080
+API_HOST_PORT=8090
+
+# Host-published ports for the deterministic API CLI failure fixture. Jetmon
+# containers use http://api-fixture:8091 and https://api-fixture:8443 directly.
+API_FIXTURE_HTTP_HOST_PORT=18091
+API_FIXTURE_HTTPS_HOST_PORT=18443
+
+# Host-published port for the local Mailpit web UI. Jetmon sends SMTP to the
+# internal mailpit:1025 address; the SMTP port is not published to the host.
+MAILPIT_HOST_PORT=8025
+
+# Docker-generated config uses Mailpit for local alert-contact email delivery.
+EMAIL_TRANSPORT=smtp
+EMAIL_FROM=jetmon@noreply.invalid
+SMTP_HOST=mailpit
+SMTP_PORT=1025
+SMTP_USERNAME=
+SMTP_PASSWORD=
+SMTP_USE_TLS=false
+
+# Host-published ports for local Graphite and StatsD access.
+GRAPHITE_HOST_PORT=8088
+STATSD_HOST_PORT=8125
+
+# Container user/group ids. Match these to your host user so bind-mounted files
+# in config/, logs/, and stats/ stay writable without root-owned output.
+UID=1000
+GID=1000
+
+# Local escape hatch for legacy config files that still contain
+# DB_UPDATES_ENABLE. Do not enable this in staging or production.
+# UNSAFE_DB_UPDATES=1
diff --git a/docker/Dockerfile_api_fixture b/docker/Dockerfile_api_fixture
new file mode 100644
index 00000000..03690258
--- /dev/null
+++ b/docker/Dockerfile_api_fixture
@@ -0,0 +1,17 @@
+FROM golang:1.22 AS builder
+
+WORKDIR /src
+
+COPY go.mod go.sum ./
+RUN go mod download
+
+COPY . .
+RUN CGO_ENABLED=0 GOOS=linux go build -o jetmon-testsite ./cmd/jetmon-testsite/
+
+FROM scratch
+
+COPY --from=builder /src/jetmon-testsite /jetmon-testsite
+
+EXPOSE 8091/tcp 8443/tcp
+
+ENTRYPOINT ["/jetmon-testsite"]
diff --git a/docker/Dockerfile_jetmon b/docker/Dockerfile_jetmon
index 4ea8fe55..5a7d49ab 100644
--- a/docker/Dockerfile_jetmon
+++ b/docker/Dockerfile_jetmon
@@ -1,15 +1,35 @@
-FROM node:24
+FROM golang:1.22 AS builder
 
-WORKDIR /jetmon
+WORKDIR /src
 
-# RUN apk add --no-cache python3 make g++
+COPY go.mod go.sum ./
+RUN go mod download
 
-RUN npm install -g node-gyp
+COPY . .
+RUN CGO_ENABLED=0 GOOS=linux go build -o jetmon2 ./cmd/jetmon2/
 
-# Get the dependencies loaded first - this makes rebuilds faster
-COPY package.json .
-RUN npm install
+FROM debian:bookworm-slim
 
-COPY . .
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    bash \
+    ca-certificates \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+
+RUN groupadd -r jetmon && useradd --no-log-init -r -g jetmon jetmon
+
+WORKDIR /jetmon
+
+COPY --from=builder /src/jetmon2 .
+COPY config/ config/
+COPY docker/run-jetmon.sh entrypoint.sh
+RUN chmod +x entrypoint.sh \
+    && mkdir -p logs stats certs \
+    && chown -R jetmon:jetmon /jetmon \
+    && chmod 777 logs stats certs
+
+EXPOSE 8080/tcp 8090/tcp
+
+USER jetmon
 
-CMD [ "bash", "docker/run-jetmon.sh" ]
+ENTRYPOINT ["./entrypoint.sh"]
diff --git a/docker/Dockerfile_veriflier b/docker/Dockerfile_veriflier
index ed19d6e3..1865878c 100644
--- a/docker/Dockerfile_veriflier
+++ b/docker/Dockerfile_veriflier
@@ -1,30 +1,33 @@
-FROM ubuntu:18.04
-# Sourced from https://github.com/MatiMoreyra/qt5-docker/blob/master/Dockerfile
+FROM golang:1.22 AS builder
 
-# Install dependencies.
-RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
-    git \
-    build-essential \
-    cmake \
-    qt5-default  \
-    libfontconfig1 \
-    mesa-common-dev  \
-    libglu1-mesa-dev \
-    libgtest-dev
+WORKDIR /src
 
-# Build GTest
-RUN cd /usr/src/gtest/ && \
-    cmake -DBUILD_SHARED_LIBS=ON && \
-    make && \
-    cp -a include/gtest /usr/include && \
-    cp -a libgtest_main.so libgtest.so /usr/lib/
+COPY go.mod go.sum ./
+RUN go mod download
 
-# Cleanup
-RUN rm -rf /var/lib/apt/lists/*
+COPY . .
+RUN CGO_ENABLED=0 GOOS=linux go build -o veriflier2-bin ./veriflier2/cmd/
 
+FROM debian:bookworm-slim
 
-WORKDIR /opt
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    bash \
+    ca-certificates \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
 
-COPY . .
+RUN groupadd -r veriflier && useradd --no-log-init -r -g veriflier veriflier
+
+WORKDIR /opt/veriflier
+
+COPY --from=builder /src/veriflier2-bin ./veriflier2
+COPY veriflier2/config/ config/
+COPY docker/run-veriflier.sh entrypoint.sh
+RUN chmod +x entrypoint.sh \
+    && chown -R veriflier:veriflier /opt/veriflier
+
+EXPOSE 7803/tcp
+
+USER veriflier
 
-CMD [ "bash", "docker/run-veriflier.sh" ]
+ENTRYPOINT ["./entrypoint.sh"]
diff --git a/docker/docker-compose.yml b/docker/docker-compose.yml
index 76b64547..74425edf 100644
--- a/docker/docker-compose.yml
+++ b/docker/docker-compose.yml
@@ -1,66 +1,159 @@
 services:
-    mysqldb:
-        image: mysql:8.0
-        restart: unless-stopped
-        env_file:
-          - .env
-        environment:
-          - MYSQL_ROOT_PASSWORD=$MYSQLDB_ROOT_PASSWORD
-          - MYSQL_DATABASE=$MYSQLDB_DATABASE
-        ports:
-          - $MYSQLDB_LOCAL_PORT:$MYSQLDB_DOCKER_PORT
-        volumes:
-          - db:/var/lib/mysql
-    jetmon:
-        hostname: docker.jetmon.dev.com
-        build:
-          context: ../
-          dockerfile: docker/Dockerfile_jetmon
-        env_file:
-          - .env
-        volumes:
-          - ../:/jetmon
-          # Don't sync the node_modules directory back to the client.
-          - "/jetmon/node_modules"
-        environment:
-          - DB_HOST=mysqldb
-          - DB_USER=$MYSQLDB_USER
-          - DB_PASSWORD=$MYSQLDB_ROOT_PASSWORD
-          - DB_NAME=$MYSQLDB_DATABASE
-          - DB_PORT=$MYSQLDB_DOCKER_PORT
-          - VERIFLIER_AUTH_TOKEN=$VERIFLIER_AUTH_TOKEN
-          - VERIFLIER_PORT=$VERIFLIER_DOCKER_PORT
-          - WPCOM_JETMON_AUTH_TOKEN=$WPCOM_JETMON_AUTH_TOKEN
-        ports:
-          - $JETMON_LOCAL_PORT:$JETMON_DOCKER_PORT
-          - $JETMON_STATUS_LOCAL_PORT:$JETMON_STATUS_DOCKER_PORT
-        depends_on:
-          - mysqldb
-    veriflier:
-        build:
-          context: ../
-          dockerfile: docker/Dockerfile_veriflier
-        volumes:
-          - ../:/opt
-        ports:
-          - $VERIFLIER_LOCAL_PORT:$VERIFLIER_DOCKER_PORT
-        environment:
-          - VERIFLIER_AUTH_TOKEN=$VERIFLIER_AUTH_TOKEN
-          - VERIFLIER_PORT=$VERIFLIER_DOCKER_PORT
-          - JETMON_PORT=$JETMON_DOCKER_PORT
-    statsd:
-        image: graphiteapp/graphite-statsd
-        restart: unless-stopped
-        ports:
-          - 8088:80
-          - 8126:8126
-          - 8125:8125
-          - 8125:8125/udp
-        volumes:
-          - ./volumes/statsd/graphite/conf:/opt/graphite/conf
-          - ./volumes/statsd/graphite/storage:/opt/graphite/storage
-          - ./volumes/statsd/statsd/config:/opt/statsd/config
-          - ./volumes/statsd/logs:/var/log
+  mysqldb:
+    image: mysql:8.0
+    restart: unless-stopped
+    environment:
+      MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD:-123456}
+      MYSQL_DATABASE: ${MYSQL_DATABASE:-jetmon_db}
+      MYSQL_USER: ${MYSQL_USER:-jetmon}
+      MYSQL_PASSWORD: ${MYSQL_PASSWORD:-jetmon_dev_password}
+    ports:
+      - "${BIND_ADDR:-127.0.0.1}:${MYSQL_HOST_PORT:-3307}:3306"
+    volumes:
+      - db:/var/lib/mysql
+    healthcheck:
+      test: ["CMD-SHELL", "MYSQL_PWD=$$MYSQL_ROOT_PASSWORD mysqladmin ping --protocol=tcp -h 127.0.0.1 -u root --silent"]
+      interval: 5s
+      timeout: 5s
+      retries: 10
+      start_period: 10s
+
+  mysql-user:
+    image: mysql:8.0
+    restart: "no"
+    depends_on:
+      mysqldb:
+        condition: service_healthy
+    environment:
+      MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD:-123456}
+      MYSQL_DATABASE: ${MYSQL_DATABASE:-jetmon_db}
+      MYSQL_USER: ${MYSQL_USER:-jetmon}
+      MYSQL_PASSWORD: ${MYSQL_PASSWORD:-jetmon_dev_password}
+    volumes:
+      - ./init-mysql-user.sh:/usr/local/bin/init-mysql-user.sh:ro
+    entrypoint: ["bash", "/usr/local/bin/init-mysql-user.sh"]
+
+  jetmon:
+    hostname: docker.jetmon.dev.com
+    build:
+      context: ../
+      dockerfile: docker/Dockerfile_jetmon
+    init: true
+    restart: unless-stopped
+    user: "${UID:-1000}:${GID:-1000}"
+    volumes:
+      - ../config:/jetmon/config
+      - ../logs:/jetmon/logs
+      - ../stats:/jetmon/stats
+    environment:
+      DB_HOST: mysqldb
+      DB_USER: ${MYSQL_USER:-jetmon}
+      DB_PASSWORD: ${MYSQL_PASSWORD:-jetmon_dev_password}
+      DB_NAME: ${MYSQL_DATABASE:-jetmon_db}
+      DB_PORT: "3306"
+      VERIFLIER_AUTH_TOKEN: ${VERIFLIER_AUTH_TOKEN:-veriflier_1_auth_token}
+      VERIFLIER_PORT: "7803"
+      WPCOM_AUTH_TOKEN: ${WPCOM_AUTH_TOKEN:-change_me}
+      EMAIL_TRANSPORT: ${EMAIL_TRANSPORT:-smtp}
+      EMAIL_FROM: ${EMAIL_FROM:-jetmon@noreply.invalid}
+      SMTP_HOST: ${SMTP_HOST:-mailpit}
+      SMTP_PORT: ${SMTP_PORT:-1025}
+      SMTP_USERNAME: ${SMTP_USERNAME:-}
+      SMTP_PASSWORD: ${SMTP_PASSWORD:-}
+      SMTP_USE_TLS: ${SMTP_USE_TLS:-false}
+      JETMON_PID_FILE: /jetmon/stats/jetmon2.pid
+    ports:
+      - "${BIND_ADDR:-127.0.0.1}:${DASHBOARD_HOST_PORT:-8080}:8080"
+      - "${API_BIND_ADDR:-0.0.0.0}:${API_HOST_PORT:-8090}:8090"
+    depends_on:
+      mysql-user:
+        condition: service_completed_successfully
+      mailpit:
+        condition: service_healthy
+    healthcheck:
+      test: ["CMD", "curl", "-fsS", "http://127.0.0.1:8090/api/v1/health"]
+      interval: 10s
+      timeout: 5s
+      retries: 12
+      start_period: 30s
+
+  veriflier:
+    build:
+      context: ../
+      dockerfile: docker/Dockerfile_veriflier
+    init: true
+    restart: unless-stopped
+    volumes:
+      - ../veriflier2/config:/opt/veriflier/config
+    ports:
+      - "${BIND_ADDR:-127.0.0.1}:${VERIFLIER_HOST_PORT:-7803}:7803"
+    environment:
+      VERIFLIER_AUTH_TOKEN: ${VERIFLIER_AUTH_TOKEN:-veriflier_1_auth_token}
+      VERIFLIER_PORT: "7803"
+      VERIFLIER_VANTAGE_ID: ${VERIFLIER_VANTAGE_ID:-local-veriflier}
+      VERIFLIER_REGION: ${VERIFLIER_REGION:-local}
+      VERIFLIER_PROVIDER: ${VERIFLIER_PROVIDER:-docker}
+      VERIFLIER_ENABLE_LEGACY_HTTP: ${VERIFLIER_ENABLE_LEGACY_HTTP:-false}
+      STATSD_ADDR: statsd:8125
+    healthcheck:
+      test: ["CMD", "curl", "-fsS", "http://127.0.0.1:7803/v2/status"]
+      interval: 10s
+      timeout: 5s
+      retries: 12
+      start_period: 10s
+
+  api-fixture:
+    build:
+      context: ../
+      dockerfile: docker/Dockerfile_api_fixture
+    init: true
+    restart: unless-stopped
+    ports:
+      - "${BIND_ADDR:-127.0.0.1}:${API_FIXTURE_HTTP_HOST_PORT:-18091}:8091"
+      - "${BIND_ADDR:-127.0.0.1}:${API_FIXTURE_HTTPS_HOST_PORT:-18443}:8443"
+    healthcheck:
+      test: ["CMD", "/jetmon-testsite", "healthcheck"]
+      interval: 10s
+      timeout: 5s
+      retries: 12
+      start_period: 5s
+
+  mailpit:
+    image: axllent/mailpit:v1.29
+    restart: unless-stopped
+    ports:
+      - "${BIND_ADDR:-127.0.0.1}:${MAILPIT_HOST_PORT:-8025}:8025"
+    environment:
+      MP_DATABASE: /data/mailpit.db
+      MP_MAX_MESSAGES: 5000
+    volumes:
+      - mailpit-data:/data
+    healthcheck:
+      test: ["CMD-SHELL", "wget -q --spider http://127.0.0.1:8025/readyz"]
+      interval: 10s
+      timeout: 5s
+      retries: 12
+      start_period: 10s
+
+  statsd:
+    image: graphiteapp/graphite-statsd
+    restart: unless-stopped
+    ports:
+      - "${BIND_ADDR:-127.0.0.1}:${GRAPHITE_HOST_PORT:-8088}:80"
+      - "${BIND_ADDR:-127.0.0.1}:${STATSD_HOST_PORT:-8125}:8125"
+      - "${BIND_ADDR:-127.0.0.1}:${STATSD_HOST_PORT:-8125}:8125/udp"
+    volumes:
+      - statsd-graphite-storage:/opt/graphite/storage
+      - statsd-logs:/var/log
+    healthcheck:
+      test: ["CMD-SHELL", "python3 -c \"import urllib.request; urllib.request.urlopen('http://127.0.0.1/', timeout=2).close()\""]
+      interval: 10s
+      timeout: 5s
+      retries: 12
+      start_period: 20s
 
 volumes:
   db:
+  mailpit-data:
+  statsd-graphite-storage:
+  statsd-logs:
diff --git a/docker/init-mysql-user.sh b/docker/init-mysql-user.sh
new file mode 100755
index 00000000..1096a2d5
--- /dev/null
+++ b/docker/init-mysql-user.sh
@@ -0,0 +1,60 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+: "${MYSQL_ROOT_PASSWORD:?MYSQL_ROOT_PASSWORD is required}"
+: "${MYSQL_DATABASE:?MYSQL_DATABASE is required}"
+: "${MYSQL_USER:?MYSQL_USER is required}"
+: "${MYSQL_PASSWORD:?MYSQL_PASSWORD is required}"
+
+if [ "${MYSQL_USER}" = "root" ]; then
+	echo "MYSQL_USER must be a non-root application user" >&2
+	exit 1
+fi
+
+sql_string() {
+	local value=$1
+	value=${value//\\/\\\\}
+	value=${value//\'/\\\'}
+	printf "'%s'" "${value}"
+}
+
+sql_identifier() {
+	local value=$1
+	value=${value//\`/\`\`}
+	printf '`%s`' "${value}"
+}
+
+db_name=$(sql_identifier "${MYSQL_DATABASE}")
+app_user=$(sql_string "${MYSQL_USER}")
+app_password=$(sql_string "${MYSQL_PASSWORD}")
+
+mysql_root() {
+	MYSQL_PWD="${MYSQL_ROOT_PASSWORD}" mysql \
+		--protocol=tcp \
+		--host=mysqldb \
+		--user=root \
+		--connect-timeout=2 \
+		"$@"
+}
+
+attempt=1
+max_attempts=${MYSQL_READY_ATTEMPTS:-60}
+while ! mysql_root --execute="SELECT 1" >/dev/null 2>&1; do
+	if [ "${attempt}" -ge "${max_attempts}" ]; then
+		echo "mysql: could not connect to mysqldb:3306 after ${max_attempts} attempts" >&2
+		exit 1
+	fi
+	echo "mysql: waiting for mysqldb:3306 to accept TCP connections (${attempt}/${max_attempts})" >&2
+	attempt=$((attempt + 1))
+	sleep 2
+done
+
+mysql_root <<SQL
+CREATE DATABASE IF NOT EXISTS ${db_name};
+CREATE USER IF NOT EXISTS ${app_user}@'%' IDENTIFIED BY ${app_password};
+ALTER USER ${app_user}@'%' IDENTIFIED BY ${app_password};
+GRANT ALL PRIVILEGES ON ${db_name}.* TO ${app_user}@'%';
+FLUSH PRIVILEGES;
+SQL
+
+echo "mysql: ensured ${MYSQL_USER}@% can access ${MYSQL_DATABASE}"
diff --git a/docker/run-jetmon.sh b/docker/run-jetmon.sh
index da1edd9c..b395dbdd 100644
--- a/docker/run-jetmon.sh
+++ b/docker/run-jetmon.sh
@@ -1,22 +1,55 @@
 #!/usr/bin/env bash
+set -euo pipefail
+
 cd /jetmon
 
-mkdir -p logs
-touch logs/jetmon.log logs/status-change.log
+sed_escape() {
+	printf '%s' "$1" | sed -e 's/[\\&|]/\\&/g'
+}
 
-mkdir -p stats
-touch stats/sitespersec stats/sitesqueue stats/totals
+render_config() {
+	local target=$1
+	sed \
+		-e "s|<AUTH_TOKEN>|$(sed_escape "${WPCOM_AUTH_TOKEN:-change_me}")|g" \
+		-e "s|<VERIFLIER_PORT>|$(sed_escape "${VERIFLIER_PORT}")|g" \
+		-e "s|<VERIFLIER_AUTH_TOKEN>|$(sed_escape "${VERIFLIER_AUTH_TOKEN:-veriflier_1_auth_token}")|g" \
+		-e 's|"API_PORT"       : 0|"API_PORT"       : 8090|g' \
+		-e "s|\"EMAIL_TRANSPORT\"       : \"stub\"|\"EMAIL_TRANSPORT\"       : \"$(sed_escape "${EMAIL_TRANSPORT:-smtp}")\"|g" \
+		-e "s|\"EMAIL_FROM\"            : \"jetmon@noreply.invalid\"|\"EMAIL_FROM\"            : \"$(sed_escape "${EMAIL_FROM:-jetmon@noreply.invalid}")\"|g" \
+		-e "s|\"SMTP_HOST\"             : \"\"|\"SMTP_HOST\"             : \"$(sed_escape "${SMTP_HOST:-mailpit}")\"|g" \
+		-e "s|\"SMTP_PORT\"             : 0|\"SMTP_PORT\"             : ${SMTP_PORT:-1025}|g" \
+		-e "s|\"SMTP_USERNAME\"         : \"\"|\"SMTP_USERNAME\"         : \"$(sed_escape "${SMTP_USERNAME:-}")\"|g" \
+		-e "s|\"SMTP_PASSWORD\"         : \"\"|\"SMTP_PASSWORD\"         : \"$(sed_escape "${SMTP_PASSWORD:-}")\"|g" \
+		-e "s|\"SMTP_USE_TLS\"          : false|\"SMTP_USE_TLS\"          : ${SMTP_USE_TLS:-false}|g" \
+		config/config-sample.json > "${target}"
+}
 
-mkdir -p certs
-if [ ! -f certs/jetmon.key ] && [ ! -f certs/jetmon.crt ]; then
-	openssl req -newkey rsa:2048 -nodes -keyout certs/jetmon.key -x509 -days 365 -out certs/jetmon.crt -subj "/C=US/ST=California/L=San Francisco/O=Automattic Inc./CN=jetmon"
-fi
+config_target() {
+	if [ -w config/ ]; then
+		printf '%s\n' "config/config.json"
+	else
+		export JETMON_CONFIG=/tmp/config.json
+		printf '%s\n' "${JETMON_CONFIG}"
+	fi
+}
+
+# /jetmon is owned by the jetmon user from the Dockerfile, but the container
+# runs as ${UID:-1000}:${GID:-1000} via docker-compose — write to stats/ instead, which
+# the Dockerfile chmods 0777 specifically so reload/drain commands work.
+export JETMON_PID_FILE="${JETMON_PID_FILE:-/jetmon/stats/jetmon2.pid}"
+export VERIFLIER_PORT="${VERIFLIER_PORT:-${VERIFLIER_GRPC_PORT:-7803}}"
+
+mkdir -p logs stats
+for path in logs/jetmon.log logs/status-change.log stats/sitespersec stats/sitesqueue stats/totals; do
+	if ! touch "$path" 2>/dev/null; then
+		echo "warning: could not write $path; check docker/.env UID/GID and host directory permissions" >&2
+	fi
+done
 
 if [ ! -f config/config.json ]; then
-	sed -e "s/<AUTH_TOKEN>/${WPCOM_JETMON_AUTH_TOKEN}/g" -e "s/<VERIFLIER_PORT>/${VERIFLIER_PORT}/g" -e "s/<VERIFLIER_AUTH_TOKEN>/${VERIFLIER_AUTH_TOKEN}/g" config/config-sample.json > config/config.json
-fi
-if [ ! -f config/db-config.conf ]; then
-	sed -e "s/<MYSQLDB_USER>/${MYSQLDB_USER}/g" -e "s/<MYSQLDB_ROOT_PASSWORD>/${MYSQLDB_ROOT_PASSWORD}/g" -e "s/<MYSQLDB_PORT>/${MYSQLDB_DOCKER_PORT}/g" -e "s/<MYSQLDB_DATABASE>/${MYSQLDB_DATABASE}/g" config/db-config-sample.conf > config/db-config.conf
+	render_config "$(config_target)"
 fi
 
-exec npm run rebuild-run
+./jetmon2 migrate
+
+exec ./jetmon2
diff --git a/docker/run-veriflier.sh b/docker/run-veriflier.sh
index 8dc20d1d..43d603ef 100644
--- a/docker/run-veriflier.sh
+++ b/docker/run-veriflier.sh
@@ -1,17 +1,39 @@
-#!/bin/bash
+#!/usr/bin/env bash
+set -euo pipefail
 
 cd /opt/veriflier
 
-qmake
-make
+sed_escape() {
+	printf '%s' "$1" | sed -e 's/[\\&|]/\\&/g'
+}
 
-mkdir -p certs
-if [ ! -f certs/veriflier.key ] && [ ! -f certs/veriflier.crt ]; then
-	openssl req -newkey rsa:2048 -nodes -keyout certs/veriflier.key -x509 -days 365 -out certs/veriflier.crt -subj "/C=US/ST=California/L=San Francisco/O=Automattic Inc./CN=jetmon"
-fi
+render_config() {
+	local target=$1
+	sed \
+		-e "s|<VERIFLIER_PORT>|$(sed_escape "${VERIFLIER_PORT}")|g" \
+		-e "s|<VERIFLIER_AUTH_TOKEN>|$(sed_escape "${VERIFLIER_AUTH_TOKEN:-veriflier_1_auth_token}")|g" \
+		-e "s|<VERIFLIER_VANTAGE_ID>|$(sed_escape "${VERIFLIER_VANTAGE_ID:-local-veriflier}")|g" \
+		-e "s|<VERIFLIER_REGION>|$(sed_escape "${VERIFLIER_REGION:-local}")|g" \
+		-e "s|<VERIFLIER_PROVIDER>|$(sed_escape "${VERIFLIER_PROVIDER:-docker}")|g" \
+		config/veriflier-sample.json > "${target}"
+}
+
+config_target() {
+	if [ -w config/ ]; then
+		printf '%s\n' "config/veriflier.json"
+	else
+		export VERIFLIER_CONFIG=/tmp/veriflier.json
+		printf '%s\n' "${VERIFLIER_CONFIG}"
+	fi
+}
+
+export VERIFLIER_PORT="${VERIFLIER_PORT:-${VERIFLIER_GRPC_PORT:-7803}}"
+export VERIFLIER_VANTAGE_ID="${VERIFLIER_VANTAGE_ID:-local-veriflier}"
+export VERIFLIER_REGION="${VERIFLIER_REGION:-local}"
+export VERIFLIER_PROVIDER="${VERIFLIER_PROVIDER:-docker}"
 
 if [ ! -f config/veriflier.json ]; then
-	sed -e "s/<JETMON_PORT>/${JETMON_PORT}/g" -e "s/<VERIFLIER_PORT>/${VERIFLIER_PORT}/g" -e "s/<VERIFLIER_AUTH_TOKEN>/${VERIFLIER_AUTH_TOKEN}/g" config/veriflier-sample.json > config/veriflier.json
+	render_config "$(config_target)"
 fi
 
-exec ./veriflier start
+exec ./veriflier2
diff --git a/docker/volumes/statsd/graphite/.gitignore b/docker/volumes/statsd/graphite/.gitignore
deleted file mode 100644
index 5e7d2734..00000000
--- a/docker/volumes/statsd/graphite/.gitignore
+++ /dev/null
@@ -1,4 +0,0 @@
-# Ignore everything in this directory
-*
-# Except this file
-!.gitignore
diff --git a/docker/volumes/statsd/logs/.gitignore b/docker/volumes/statsd/logs/.gitignore
deleted file mode 100644
index 5e7d2734..00000000
--- a/docker/volumes/statsd/logs/.gitignore
+++ /dev/null
@@ -1,4 +0,0 @@
-# Ignore everything in this directory
-*
-# Except this file
-!.gitignore
diff --git a/docker/volumes/statsd/statsd/.gitignore b/docker/volumes/statsd/statsd/.gitignore
deleted file mode 100644
index 5e7d2734..00000000
--- a/docker/volumes/statsd/statsd/.gitignore
+++ /dev/null
@@ -1,4 +0,0 @@
-# Ignore everything in this directory
-*
-# Except this file
-!.gitignore
diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 00000000..3e5c6336
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,55 @@
+# Jetmon Docs
+
+This directory holds longer-form design material that does not belong in the
+main README.
+
+## Architecture Decisions
+
+Accepted decisions live in [`adr/`](adr/). These records are append-only and
+capture load-bearing choices that the current v2 implementation depends on.
+
+Start with [`adr/README.md`](adr/README.md) for the ADR format and index,
+including the Veriflier discovery trust decision in
+[`adr/0010-trusted-veriflier-discovery.md`](adr/0010-trusted-veriflier-discovery.md).
+
+## User Guides
+
+| Document | Purpose |
+|---|---|
+| [`project.md`](project.md) | Full product and implementation specification for Jetmon 2. |
+| [`architecture.md`](architecture.md) | High-level architecture, package responsibilities, and deployment shape. |
+| [`internal-api-reference.md`](internal-api-reference.md) | Internal REST API reference and design notes. |
+| [`events.md`](events.md) | Event lifecycle, transition semantics, and projection rules. |
+| [`taxonomy.md`](taxonomy.md) | Severity, state, cause, rollup, and test taxonomy. |
+| [`getting-started.md`](getting-started.md) | Local Docker setup, build/test commands, API CLI smoke runs, fixture failure simulation, and tenant import basics. |
+| [`operations-guide.md`](operations-guide.md) | Production configuration, host setup, rollout modes, delivery workers, metrics, dashboard checks, and debugging. |
+| [`rollout-quick-reference.md`](rollout-quick-reference.md) | One-page operator checklist for the v1-to-v2 rollout, linked back to the full migration runbook. |
+| [`rollout-vm-lab.md`](rollout-vm-lab.md) | KVM/libvirt lab harness for rehearsing rollout flows with DB, v1, and fresh v2 VMs plus snapshots. |
+| [`jetmon-v2-prelaunch-readiness.md`](jetmon-v2-prelaunch-readiness.md) | Launch-readiness gates, canary checks, owner split, and service-side hardening map for the v2 drop-in rollout. |
+| [`data-model.md`](data-model.md) | Legacy and v2 tables, additive migrations, event-sourced incident state, legacy projection, and tenant mapping. |
+| [`support-guide.md`](support-guide.md) | Happiness Engineer workflows for explaining alerts, missed alerts, false positives, maintenance windows, and WPCOM payloads. |
+| [`api-cli-guide.md`](api-cli-guide.md) | Feature guide and examples for using `jetmon2 api` against the internal REST API during local testing, rehearsals, and CI smoke runs. |
+| [`v1-to-v2-migration.md`](v1-to-v2-migration.md) | Full production migration runbook from v1 to v2, including preparation, same-server replacement, fresh-server takeover, monitoring, revert paths, dynamic ownership cutover, and v1 teardown. |
+| [`changelog.md`](changelog.md) | Release notes and implementation history. |
+
+## Planning Notes
+
+Planning notes capture future options and open design threads. They are not
+accepted architecture decisions.
+
+| Document | Purpose |
+|---|---|
+| [`roadmap.md`](roadmap.md) | Broader v2/v3 planning, deferred feature work, and public API prerequisites. |
+| [`api-cli-roadmap.md`](api-cli-roadmap.md) | Completed implementation history for the local `jetmon2 api` helper CLI used during Docker and rollout testing. |
+| [`jetmon-deliverer-rollout.md`](jetmon-deliverer-rollout.md) | Operational rollout policy for moving outbound dispatch from embedded `jetmon2` workers to standalone `jetmon-deliverer`. |
+| [`outbound-credential-encryption-plan.md`](outbound-credential-encryption-plan.md) | Migration plan for encrypting webhook secrets and alert-contact destination credentials after the current plaintext v2 model. |
+| [`public-api-gateway-tenant-contract.md`](public-api-gateway-tenant-contract.md) | Gateway boundary contract, implemented Jetmon-side tenant ownership checks, and remaining public-exposure prerequisites. |
+| [`v1-to-v2-pinned-rollout.md`](v1-to-v2-pinned-rollout.md) | Compatibility pointer to the consolidated v1-to-v2 migration runbook. |
+| [`v3-probe-agent-architecture-options.md`](v3-probe-agent-architecture-options.md) | Post-v2 architecture options for evolving from main servers plus Verifliers toward a probe-agent architecture. |
+
+## Benchmark Reports
+
+| Document | Purpose |
+|---|---|
+| [`jetmon-v2-capacity-1000-report.md`](jetmon-v2-capacity-1000-report.md) | Capacity-benchmark report comparing the latest successful 1,000-site Jetmon v2 run with the previous failed 1,000-site run. |
+| [`jetmon-v2-scalability-test-plan.md`](jetmon-v2-scalability-test-plan.md) | Repeatable checklist for validating scheduler and check-path efficiency changes at 1k, 5k, and 10k site counts. |
diff --git a/docs/adr/0001-event-sourced-state-model.md b/docs/adr/0001-event-sourced-state-model.md
new file mode 100644
index 00000000..6287d7fd
--- /dev/null
+++ b/docs/adr/0001-event-sourced-state-model.md
@@ -0,0 +1,108 @@
+# 0001 — Event-sourced state model with dedicated transitions table
+
+**Status:** Accepted (2026-04-22)
+
+## Context
+
+Jetmon 1 stored the current site status as a column on
+`jetpack_monitor_sites` (`site_status`, with a `last_status_change`
+timestamp) and emitted a notification on every transition. There was
+no durable history of state changes — the WPCOM API was the only
+record of what happened. This made several common questions hard or
+impossible to answer:
+
+- "Why was site X notified as down at 04:12 UTC? What were the check
+  results that led to that?"
+- "How many times did site X flap between Down and SeemsDown over the
+  last hour?"
+- "Did the verifier confirm the down at 04:12 or was it a single-host
+  decision?"
+- "Did this row's status change because of a new check, a verifier
+  update, an operator close, or a maintenance window?"
+
+The site row was a projection — useful for "is this site up right
+now?" — but it had no audit story. Every customer escalation that
+touched "what happened" required digging through StatsD, application
+logs, and WPCOM-side records.
+
+The v2 redesign needed a durable, queryable record of every state
+change to support the planned events / SLA / webhooks / alert-contacts
+surface. We considered three shapes during design:
+
+- **Option 1 — Reuse `jetmon_audit_log`.** Add `old_status` /
+  `new_status` columns and emit one audit row per status change. Single
+  table, no schema growth. Rejected because audit log was operational
+  ("who did what to the system") and conflating it with site state
+  history made both queries slower and the schema confusing — the
+  audit log is for actions, not state.
+
+- **Option 2 — Dedicated `jetmon_event_transitions` table.** One row
+  per transition with `severity_before` / `severity_after` /
+  `state_before` / `state_after` / `reason` / `source` / `metadata`.
+  Append-only. Pairs with a `jetmon_events` table holding the current
+  authoritative state of each open incident.
+
+- **Option 3 — Synthesize from `jetmon_check_history`.** Compute
+  state changes by walking the check history table. Rejected because
+  not every check produces a transition, the verifier's outcome can
+  override individual check results, and operator manual closes don't
+  appear in check history at all.
+
+## Decision
+
+We will store every site state change in a dedicated, append-only
+`jetmon_event_transitions` table, paired with a current-state
+projection in `jetmon_events`. `internal/eventstore` is the single
+writer for both, writing each transition + projection update in one
+transaction so they cannot disagree.
+
+Each transition row records:
+- `event_id` (the open incident this transition belongs to)
+- `severity_before`, `severity_after` (uint8 from
+  `internal/eventstore.Severity*`)
+- `state_before`, `state_after` (string state names)
+- `reason` (e.g. `opened`, `verifier_confirmed`, `manual_override`,
+  `superseded`)
+- `source` (which jetmon2 instance or which API caller wrote it)
+- `metadata` (JSON blob with check results, verifier outputs, etc.)
+- `changed_at` (timestamp with millisecond precision)
+
+`jetmon_events` rows have a generated `dedup_key` column that is
+non-NULL only while `ended_at IS NULL`, with a `UNIQUE KEY` enforcing
+"one open event per (blog_id, endpoint_id, check_type, discriminator)
+tuple" without requiring partial indexes (which MySQL lacks).
+
+## Consequences
+
+**Wins:**
+- Every customer-facing question about site history has a single,
+  authoritative source.
+- The webhook and alerting workers consume `jetmon_event_transitions`
+  via a high-water mark — no in-process pub/sub needed (see ADR-0005).
+- The transition table is naturally auditable: who/what/when for every
+  change is on the row.
+- The five-layer severity ladder (`Up < Warning < Degraded <
+  SeemsDown < Down`) is uniformly applied and queryable; severity
+  evolves independently of state.
+
+**Costs:**
+- Two tables instead of a column. Storage cost is bounded — one row
+  per real state change, not one per check — but non-zero.
+- Writes are now transactional across two tables. Mitigated by
+  `internal/eventstore` owning the contract.
+- Migration path from Jetmon 1 is non-trivial. Acceptable because
+  v2 is a separate branch (PR #61) intentionally not drop-in
+  compatible.
+
+## Alternatives considered
+
+See Context. The audit-log overload (Option 1) was the most tempting
+shortcut and is the path most projects regret later — once the audit
+log mixes operational events with state-change events, every query
+gets harder.
+
+## Related
+
+- `internal/eventstore/` — the single writer
+- Migrations 10 (`jetmon_events`) and 11 (`jetmon_event_transitions`)
+- ADR-0005 (Pull-only delivery via event transitions)
diff --git a/docs/adr/0002-internal-only-api-behind-gateway.md b/docs/adr/0002-internal-only-api-behind-gateway.md
new file mode 100644
index 00000000..99f53410
--- /dev/null
+++ b/docs/adr/0002-internal-only-api-behind-gateway.md
@@ -0,0 +1,81 @@
+# 0002 — Internal-only API behind a gateway
+
+**Status:** Accepted (2026-04-22)
+
+## Context
+
+The v2 branch ships a versioned REST API (`/api/v1/...`) covering
+sites, events, SLA stats, webhooks, and alert contacts. The API was
+originally scoped as "the public API," and several Phase 1 design
+decisions were drafted with public-API constraints in mind (granular
+per-resource scopes, 404-on-unauthorized to avoid leaking resource
+existence, sanitized error messages, per-tenant ownership on every
+write surface, etc.).
+
+Mid-Phase-1 the scope changed: a separate gateway service will sit in
+front of Jetmon and handle all customer-facing concerns (tenant
+isolation, public errors, customer rate limiting, per-tenant
+analytics, OAuth, billing). Jetmon's API becomes internal — every
+caller is a known service (the gateway, alerting workers, the
+operator dashboard, CI tooling, the uptime-bench harness). This
+materially changes the appropriate trade-offs across most of the API
+surface.
+
+## Decision
+
+We will treat Jetmon's API as **internal-only**. Specifically:
+
+- **Auth scopes are coarse:** `read` / `write` / `admin`. Granular
+  per-resource scopes (e.g. `webhooks:write`, `events:read`) are
+  unnecessary because all callers are trusted services that operate
+  at a single privilege level.
+- **Errors are honest.** 401 vs 403 vs 404 are reported correctly
+  (no info-leak hiding). Error messages can include operational
+  detail (DB error class, the SQL stage that failed) because the
+  audience is operators and the gateway, not customers.
+- **Webhook and alert-contact ownership is shared.** Any `write`-scope
+  token can manage any registration; `created_by` is recorded for
+  audit but does not gate access.
+- **Idempotency-Key scope is `(api_key_id, key)`.** No tenant in the
+  scope tuple because there's no tenant abstraction.
+- **Rate limits are per-key, sized for service protection** (preventing
+  one buggy caller from DoS-ing the rest), not for commerce or abuse.
+- **Resource IDs are raw integers.** No type-prefixed IDs (`evt_`,
+  `whk_`); see the "Resolved design questions" section in
+  [`../internal-api-reference.md`](../internal-api-reference.md) for
+  the full rationale.
+
+Each of these is the appropriate choice for an internal service and
+not the appropriate choice for a public API.
+
+## Consequences
+
+**Wins:**
+- The implementation is dramatically simpler than a public API. No
+  per-tenant isolation, no oauth surface, no analytics events on
+  every request, no per-customer rate limit configuration.
+- Operators can debug from the API surface directly — error messages
+  carry the information needed to diagnose problems.
+- Schema design is unconstrained by tenant-scoping concerns, which
+  keeps queries fast and indexes simple.
+
+**Costs:**
+- If Jetmon's API is ever exposed to customers without a gateway in
+  front, several decisions need to be unwound. The migration path is
+  documented in [`../roadmap.md`](../roadmap.md) "Path to a public API." Each change is
+  individually clean (add a column, filter on it, deprecate the
+  unscoped version) but they touch most of the surface, so it would
+  be a significant project rather than a flag flip.
+- Documentation has to be careful not to leak the internal surface to
+  external readers. [`../internal-api-reference.md`](../internal-api-reference.md) is checked-in but is unambiguous about
+  internal-only scope; the gateway will re-export a sanitized subset.
+
+## Related
+
+- [`../internal-api-reference.md`](../internal-api-reference.md) — full API reference; the "Resolved design questions"
+  section captures the trade-offs that fall out of this decision.
+- [`../roadmap.md`](../roadmap.md) "Path to a public API" — what would change if this
+  decision is reversed.
+- ADR-0003 (Plaintext credentials) — depends on this; if customers
+  managed their own webhooks the credential storage threat model
+  would shift.
diff --git a/docs/adr/0003-plaintext-credentials-for-outbound-dispatch.md b/docs/adr/0003-plaintext-credentials-for-outbound-dispatch.md
new file mode 100644
index 00000000..1ed5304b
--- /dev/null
+++ b/docs/adr/0003-plaintext-credentials-for-outbound-dispatch.md
@@ -0,0 +1,109 @@
+# 0003 — Plaintext credential storage for outbound dispatch
+
+**Status:** Accepted (2026-04-25)
+
+## Context
+
+Both `jetmon_webhooks.secret` (HMAC signing key) and
+`jetmon_alert_contacts.destination` (transport-specific credential
+JSON: PagerDuty integration key, Slack/Teams webhook URL, SMTP
+password) need to be available at dispatch time so the worker can
+authenticate or sign the outbound request.
+
+`jetmon_api_keys.token_hash` stores SHA-256 hashes — keys are
+verified by hashing the inbound bearer token and comparing in
+constant time. This pattern works because API keys are validated on
+the **inbound** path, where having only the hash is sufficient.
+
+The first draft of the webhook schema (migration 13) mirrored this
+pattern with `secret_hash CHAR(64)`. While building the delivery
+worker we realized the analogy doesn't transfer: HMAC signing
+requires the actual secret material, not its hash. There is no way
+to reconstruct the original secret from a SHA-256 hash, so a hashed
+secret is functionally useless to the worker.
+
+The same constraint applies to alert-contact credentials. To call
+the PagerDuty Events API we need the integration key. To POST to a
+Slack incoming-webhook URL we need the URL. To `smtp.SendMail` we
+need the password. These are call-time inputs; hashing them at rest
+would prevent the call.
+
+## Decision
+
+We will store outbound-dispatch credentials in **plaintext** in the
+relevant tables:
+
+- `jetmon_webhooks.secret VARCHAR(80)` — the raw HMAC signing key,
+  with the `whsec_` prefix preserved (Stripe-style leak-detection
+  hint).
+- `jetmon_alert_contacts.destination JSON` — the transport-specific
+  credential as supplied by the operator.
+
+Each table also stores a small "preview" column (`secret_preview`
+for webhooks, `destination_preview` for alert contacts) holding the
+last 4 characters of the credential, so the API can return a
+non-sensitive identifier without ever leaking the full value.
+
+The full credential value is never returned through the API after
+creation. `secret` is shown ONCE in the create / rotate response.
+`destination` is supplied by the caller on create and is never echoed
+back; subsequent reads expose only `destination_preview`.
+
+We document the threat model on the migrations and in code comments
+so future readers can audit it without rediscovering it.
+
+## Consequences
+
+**Wins:**
+- Outbound dispatch works correctly with no special infrastructure
+  (no KMS round-trip, no per-secret cache layer).
+- Read-only API consumers (read-scope tokens) cannot exfiltrate
+  credentials — the SELECT used by handlers does not return the
+  credential column. The worker uses a separate `LoadSecret` /
+  `LoadDestination` call.
+- Rotation is simple: replace the row's secret column, return the
+  new value once, the next dispatch picks it up.
+
+**Costs:**
+- A read of `jetmon_webhooks` or `jetmon_alert_contacts` at the SQL
+  level (DBA query, MySQL replica, backup file) leaks all signing
+  keys and destination credentials in plaintext. For an internal
+  service behind a gateway with an internal-only set of consumers
+  (ADR-0002), this is equivalent to the existing access-to-events
+  threat — anyone with that level of DB access already has access to
+  the events themselves. The marginal cost is small.
+- If Jetmon ever exposes its API directly to customers (i.e.
+  ADR-0002 is reversed), this trade-off changes. Customer-managed
+  secrets in plaintext under shared infrastructure is a stronger
+  threat. The mitigation path is encryption at rest with a master
+  key (KMS-style), which is queued in [`../roadmap.md`](../roadmap.md) as a future
+  hardening step.
+
+## Alternatives considered
+
+- **Hashed credentials (the API-key pattern).** Rejected because
+  HMAC signing and outbound HTTPS auth need the raw key material,
+  not its hash. There is no inbound-validation use case for these
+  secrets.
+- **Encryption at rest with a master key (e.g. KMS).** A real
+  improvement on plaintext, but adds an operational dependency
+  (KMS access, key rotation procedure) and a runtime cost (decrypt
+  on every dispatch or maintain an in-process cache). Deferred —
+  the right time to do this is alongside any move toward customer-
+  managed secrets, not before.
+- **Per-row at-rest encryption with the AUTH_TOKEN as key material.**
+  Rejected as security theatre — the key sits next to the data on
+  the same host, so an attacker with DB access likely has config
+  access too. The complexity buys nothing.
+
+## Related
+
+- ADR-0002 (Internal-only API) — defines the threat model that
+  makes plaintext storage acceptable today.
+- Migration 13 (`jetmon_webhooks`) — documents the rationale inline.
+- Migration 16 (`jetmon_alert_contacts`) — same rationale.
+- `internal/webhooks/webhooks.go` — `LoadSecret` is intentionally a
+  separate function (not a field on `Webhook`) to prevent leakage
+  through serialization.
+- `internal/alerting/contacts.go` — `LoadDestination` follows the
+  same pattern.
diff --git a/docs/adr/0004-stripe-style-hmac-webhook-signatures.md b/docs/adr/0004-stripe-style-hmac-webhook-signatures.md
new file mode 100644
index 00000000..211a154e
--- /dev/null
+++ b/docs/adr/0004-stripe-style-hmac-webhook-signatures.md
@@ -0,0 +1,97 @@
+# 0004 — Stripe-style HMAC-SHA256 webhook signatures
+
+**Status:** Accepted (2026-04-23)
+
+## Context
+
+Webhook deliveries need a way for consumers to verify that a POST
+actually came from Jetmon and wasn't replayed or forged. The choice
+of signing scheme is consumer-facing — once shipped, every consumer's
+verification code depends on it, and changing the format is a
+coordinated migration.
+
+We surveyed the established patterns:
+
+| Scheme | Used by | Notes |
+|--------|---------|-------|
+| Stripe-style HMAC-SHA256 with versioned header | Stripe, GitHub (sig-256) | `t=<unix>,v1=<hex>` over `{ts}.{body}`. Replay-resistant via timestamp. |
+| GitHub HMAC-SHA1 (legacy) | GitHub `X-Hub-Signature` | SHA-1 is broken; only here for legacy receivers. |
+| Slack HMAC-SHA256 | Slack | Same idea as Stripe but slightly different concatenation order. |
+| JWT (signed token in header) | Some uptime services | More complex parser surface, no clear benefit for one-way notifications. |
+| RFC 9421 HTTP Message Signatures | Some IETF-leaning services | More features (covered headers), much more complex consumer code. |
+| Ed25519 asymmetric signature | Few production webhooks | Public key in metadata, no per-consumer secret to leak. |
+
+Phase 3 design needed a single choice that handled the immediate use
+case (internal API, one signing key per webhook), left a clean path
+to future algorithm rotation, and didn't impose unusual consumer
+code.
+
+## Decision
+
+We will sign every webhook delivery with HMAC-SHA256 using the
+webhook's shared secret, and surface the signature in a Stripe-style
+versioned header:
+
+```
+X-Jetmon-Signature: t=<unix_timestamp>,v1=<hex_hmac_sha256>
+```
+
+The HMAC input is `{timestamp}.{request_body}` — concatenating the
+timestamp into the signed material lets consumers reject stale
+deliveries (replay protection) by comparing `t=` against their own
+clock.
+
+The `v1=` prefix is **reserved space for a future algorithm
+rotation**. We do not ship multi-algorithm signing today (one secret,
+one algorithm). When rotation is needed, the transition emits both
+`v1=` and `v2=` for a window so consumers can verify whichever they
+support, then `v1=` is retired. Stripe-compatible header parsing
+already supports multiple `v=` values, so consumers don't need to
+update their parser to receive a v2-augmented signature.
+
+Secret storage is plaintext per ADR-0003. The signing key is
+generated by the server (32 random bytes, base32-encoded with the
+`whsec_` prefix) and returned to the operator once on create or
+rotate-secret.
+
+## Consequences
+
+**Wins:**
+- Familiar to anyone who has written a Stripe webhook receiver.
+  Documentation and example code in any major language exists.
+- Replay protection is built in via the timestamp. Consumers reject
+  signatures with `t=` more than ~5 minutes old.
+- Algorithm rotation is a clean future operation — schema column
+  additions only, no header-format churn.
+- Consumer verification is ~10 lines of code in any language with
+  an HMAC primitive.
+
+**Costs:**
+- HMAC requires the consumer to share the secret with us. If the
+  secret leaks, an attacker can mint valid deliveries until the
+  operator rotates. The `whsec_` prefix is a leak-detection hint
+  but is not a mitigation.
+- Asymmetric signatures (Ed25519) would let us publish a public key
+  and let consumers verify without holding a secret. Considered but
+  rejected for v1 because (a) it requires consumers to handle key
+  rotation via a published JWKS-like endpoint, which adds receiver
+  complexity, and (b) HMAC is what the gateway and current internal
+  consumers already know how to verify. The `v1=` prefix leaves
+  the door open for an Ed25519 `v2=`.
+
+## Alternatives considered
+
+See the table in Context. Stripe-style HMAC was chosen for the
+combination of simplicity, familiarity, and the clean rotation path.
+The Ed25519 option remains attractive if Jetmon ever exposes its
+webhooks to customer-managed receivers (per ADR-0002 reversal).
+
+## Related
+
+- [`../internal-api-reference.md`](../internal-api-reference.md) "Family 4 → Signing and secret rotation"
+- `internal/webhooks/webhooks.go` `Sign` function and
+  `TestSignatureRoundTrip` in the test suite (the contract test that
+  every consumer's verification depends on).
+- ADR-0003 (Plaintext credentials)
+- [`../roadmap.md`](../roadmap.md) "Grace-period webhook secret rotation" — the next
+  follow-up that builds on the `v1=` reservation.
diff --git a/docs/adr/0005-pull-only-delivery-via-event-transitions.md b/docs/adr/0005-pull-only-delivery-via-event-transitions.md
new file mode 100644
index 00000000..ee00d60a
--- /dev/null
+++ b/docs/adr/0005-pull-only-delivery-via-event-transitions.md
@@ -0,0 +1,115 @@
+# 0005 — Pull-only webhook and alerting delivery
+
+**Status:** Accepted (2026-04-23)
+
+## Context
+
+When an event transition happens (a site goes Down, recovers,
+escalates from Degraded to SeemsDown, etc.), the webhook delivery
+worker and the alerting delivery worker each need to fan that
+transition out to matching subscribers. There were two viable shapes:
+
+- **In-process pub/sub.** The eventstore notifies subscribers
+  in-process via a Go channel; each worker is a subscriber. The
+  workers wake on every transition with no polling latency.
+- **Pull from `jetmon_event_transitions`.** Workers maintain a
+  high-water mark in their own progress table and poll the
+  transitions table on a tick (default 1s). Transitions are
+  durable; new transitions are picked up on the next poll.
+
+Pub/sub is faster (no polling latency) and avoids a poll loop. Pull
+is slower (up to 1s tick latency) but has several properties that
+matter at the architectural scale:
+
+- The MySQL schema is the bus. No in-process state has to survive
+  a restart — the high-water mark is in the DB. A worker that
+  crashes resumes from where it left off.
+- Multiple worker instances are trivially supported. Each instance
+  has its own row in the progress table and polls independently.
+  (Multi-instance does need row-level claim semantics on the
+  delivery table; see ADR-0007.)
+- Workers don't have to live in the same process as the eventstore
+  writer. The deliverer-binary extraction ([`../roadmap.md`](../roadmap.md),
+  Architectural roadmap) becomes a clean cut: the worker code moves
+  to its own binary, points at the same MySQL, and continues
+  working without the eventstore writer being aware.
+- "I want to replay deliveries since timestamp T" is a SELECT, not a
+  bus replay primitive.
+
+## Decision
+
+We will use **pull-only delivery** for both the webhook worker
+(`internal/webhooks`) and the alerting worker (`internal/alerting`).
+Both workers:
+
+- Maintain a high-water mark of the last `jetmon_event_transitions.id`
+  they processed, in their own per-instance progress table
+  (`jetmon_webhook_dispatch_progress`,
+  `jetmon_alert_dispatch_progress`).
+- Poll on a 1-second tick by default for new transition rows after
+  the mark.
+- For each new transition, match against active subscribers and
+  enqueue per-(subscriber, transition) deliveries.
+- Then dispatch with retries on a shared retry ladder
+  (1m / 5m / 30m / 1h / 6h, then abandon).
+
+The MySQL schema is the bus between writers (eventstore) and readers
+(webhook worker, alerting worker).
+
+## Consequences
+
+**Wins:**
+- Crash-safe by design. A worker that dies mid-tick resumes
+  correctly when restarted; in-flight deliveries are caught by the
+  retry path.
+- Multi-instance friendly with a small claim-locking addition
+  (ADR-0007). The basic shape doesn't change.
+- Each worker can be extracted into its own binary without
+  modifying the eventstore. The deliverer-binary roadmap entry
+  builds on this.
+- Replay and audit are SQL queries.
+- Consumers of the events table (audit tooling, ad-hoc reporting,
+  the SLA endpoints) see the same source of truth as the workers.
+
+**Costs:**
+- 1-second tick latency is acceptable for outage notifications but
+  not for sub-second user-interactive flows. Jetmon's notification
+  use case tolerates seconds; this would be wrong for, say, a chat
+  message delivery system.
+- Tight tick + lots of subscribers + lots of transitions = noticeable
+  DB query rate. The per-tick SELECT is bounded by `BatchSize` (200
+  by default) and uses indexed columns. Watching this at scale and
+  tuning the tick is in scope for future operational work.
+- The dispatcher and the deliverer are two coupled poll loops in
+  one process. The webhook worker poll-and-enqueue tick is separate
+  from the poll-pending-deliveries tick. This is documented in
+  worker.go but is more complex than a single-loop in-process
+  pub/sub would be.
+
+## Alternatives considered
+
+- **In-process pub/sub.** Faster, simpler in single-process
+  deployment, but creates an in-process dependency between the
+  eventstore writer and the workers, breaks the multi-instance
+  story, and complicates the deliverer-binary extraction. The
+  latency win does not pay for those costs in our use case.
+- **MySQL `LISTEN`/`NOTIFY` (PostgreSQL pattern).** MySQL has no
+  equivalent. Ruled out.
+- **Outbox-pattern with explicit fan-out at write time.** The
+  eventstore writer would compute matching subscribers and write
+  per-(subscriber, transition) rows directly. Rejected because
+  matching changes when subscribers are added or removed; precomputing
+  at write time would mean a configuration change has to wait for
+  the next transition before taking effect. Pull-with-match-at-tick
+  picks up registry changes immediately.
+
+## Related
+
+- ADR-0001 (Event-sourced state model) — defines the
+  `jetmon_event_transitions` table the workers consume.
+- ADR-0007 (Soft-lock claim) — the row-level locking that makes
+  multi-instance pull safe.
+- `internal/webhooks/worker.go`, `internal/alerting/worker.go` — the
+  two pull-loop implementations.
+- [`../roadmap.md`](../roadmap.md) "Multi-repo / multi-binary split" — the deliverer
+  binary that builds on this decision.
diff --git a/docs/adr/0006-separate-alerting-and-webhooks-packages.md b/docs/adr/0006-separate-alerting-and-webhooks-packages.md
new file mode 100644
index 00000000..8654f2d8
--- /dev/null
+++ b/docs/adr/0006-separate-alerting-and-webhooks-packages.md
@@ -0,0 +1,101 @@
+# 0006 — Separate `internal/alerting` and `internal/webhooks` packages
+
+**Status:** Accepted (2026-04-25)
+
+## Context
+
+Phase 3 shipped `internal/webhooks` — a webhook registry, delivery
+worker, and HMAC signing flow. Phase 3.x then needed to ship alert
+contacts: managed channels (email, PagerDuty, Slack, Teams) for
+human destinations, with site-filter + severity-gate filtering and a
+per-hour rate cap.
+
+The two are noticeably similar at the operational level. Both:
+
+- Poll `jetmon_event_transitions` on a high-water mark (per ADR-0005).
+- Match new transitions against an active registry.
+- Enqueue per-(subscriber, transition) deliveries with INSERT IGNORE
+  on a UNIQUE KEY.
+- Have a deliver loop with a per-subscriber in-flight cap and a
+  shared retry ladder (1m / 5m / 30m / 1h / 6h).
+- Surface delivery list / manual-retry endpoints through the API.
+
+The natural temptation was to extend the webhook worker to handle
+both — define a `Dispatcher` interface, two concrete implementations
+(HMAC-POST for webhooks, transport-rendered for alert contacts), and
+share the loop / retry / claim plumbing.
+
+## Decision
+
+We will keep `internal/alerting` and `internal/webhooks` as
+**separate packages with parallel-but-duplicated structure**, at
+least until the deliverer-binary extraction ([`../roadmap.md`](../roadmap.md)).
+
+The webhook worker keeps its existing shape; the alerting worker is
+copy-paste-and-adapt with the alerting-specific concerns layered on
+(severity gate, rate cap, transport map, Notification rendering).
+
+This is a deliberate choice to defer abstraction. Webhooks shipped
+first; alerting hadn't been built. We didn't yet know what shape
+alerting would actually take — fan-out, escalation, digest mode,
+on-call routing are all real possibilities for future alert-contact
+features that webhooks doesn't have. Building a shared abstraction
+against one known concrete user (webhooks) and one guessed-at user
+(alerting) was likely to produce an abstraction that fits neither
+well.
+
+## Consequences
+
+**Wins:**
+- Each package can evolve independently. Webhooks growing a v2
+  signature scheme doesn't risk regressing alerting; alerting
+  growing per-contact escalation doesn't risk regressing the webhook
+  flow.
+- Webhooks went to production first (verified end-to-end before
+  alerting was started). Coupling them to greenfield code would
+  have added production risk to a working feature.
+- Reading either package is easy: it's all the relevant code in one
+  spot, no "is this branch reached for webhooks too?" cognitive
+  load.
+
+**Costs:**
+- ~300 lines of duplicated code: retry schedule constants, in-flight
+  cap, transactional claim-and-lease pattern (ADR-0007), polling loop
+  shape, abandon semantics. Bug fixes have to land twice (the claim
+  fix did exactly that).
+- Two metrics namespaces (`webhook_*` vs `alert_*`). Operators have
+  to remember which is which.
+- Drift risk — improvements in one package don't automatically reach
+  the other.
+
+These costs are bounded and acceptable in exchange for the
+flexibility, but they accrue every time we touch the workers. The
+delivery-claim fix is the canary: if every fix is two-pass, the
+unification is overdue.
+
+## Future revisit
+
+The deliverer-binary extraction is the natural moment to revisit
+this. By then we'll have:
+
+- Two concrete dispatch workers in production with known operational
+  profiles.
+- A clear picture of what alerting actually grew into vs. what
+  webhooks actually needed.
+- WPCOM legacy notifications queued to migrate behind the same
+  abstraction, providing a third concrete user.
+
+At that point, factor a `Dispatcher` interface against three known
+implementations, not one known plus one guess. The unification work
+is documented in [`../roadmap.md`](../roadmap.md) "Multi-repo / multi-binary split →
+Revisit point: unify `internal/alerting/` and `internal/webhooks/`."
+
+## Related
+
+- [`../roadmap.md`](../roadmap.md) "Multi-repo / multi-binary split"
+- `internal/webhooks/worker.go` and `internal/alerting/worker.go` —
+  the parallel implementations.
+- ADR-0005 (Pull-only delivery) — the shared shape both workers
+  follow.
+- ADR-0007 (Soft-lock claim) — a fix that had to land in both
+  packages, illustrating the duplication cost.
diff --git a/docs/adr/0007-soft-lock-vs-row-claim.md b/docs/adr/0007-soft-lock-vs-row-claim.md
new file mode 100644
index 00000000..9356eeac
--- /dev/null
+++ b/docs/adr/0007-soft-lock-vs-row-claim.md
@@ -0,0 +1,124 @@
+# 0007 — Soft-lock claim vs transactional row claim
+
+**Status:** Accepted (2026-04-25), amended (2026-04-28)
+
+## Context
+
+The webhook and alerting deliver loops (per ADR-0005) tick every
+1 second. Each tick:
+
+1. SELECTs up to N pending deliveries whose `next_attempt_at` has
+   passed.
+2. For each, spawns a goroutine to dispatch (subject to a per-
+   subscriber in-flight cap).
+3. The goroutine eventually calls `MarkDelivered` (success) or
+   `ScheduleRetry` (failure) to update the row's `next_attempt_at`.
+
+Two correctness questions arise:
+
+- **Within a single process**, the dispatch goroutine takes seconds
+  (HTTP timeout default 30s). If the next tick fires while the
+  dispatch is still in flight, the SELECT returns the same row
+  again — its status is still `pending` and its `next_attempt_at`
+  hasn't been updated. The goroutine hasn't finished yet. The
+  per-subscriber in-flight cap (default 3) bounds this, but lets
+  up to 3 concurrent dispatches of the same row. Each computes a
+  retry delay from the same `d.Attempt = N` value, all run
+  `attempt = attempt + 1` in SQL, and the row ends with
+  `attempt = N+3`. The retry ladder collapses: we go from 1m to
+  abandoned in roughly an hour instead of the documented 7h36m.
+
+- **Across multiple instances**, two jetmon2 processes hitting the
+  same MySQL would both see the same pending row in their SELECTs
+  and both spawn dispatch goroutines. We'd send each delivery N+1
+  times where N is the number of instances.
+
+There are two well-known fixes:
+
+- **Soft lock by pushing `next_attempt_at` out** before the
+  goroutine starts. The next tick's SELECT (which gates on
+  `next_attempt_at <= NOW()`) won't match the row again until the
+  soft lock expires. The dispatch goroutine overwrites the soft
+  lock with its real result.
+- **Transactional row claiming via `SELECT … FOR UPDATE`**. Two
+  concurrent claim transactions cannot claim the same row; the second
+  claimant waits briefly for the first transaction to commit, then sees
+  the updated `next_attempt_at` and skips that in-flight delivery.
+- **Transactional row claiming via `SELECT … FOR UPDATE SKIP LOCKED`**.
+  Same correctness property, but concurrent claimers skip locked rows
+  rather than waiting. This is better for high delivery concurrency but
+  requires newer MySQL than the current 5.7+ compatibility target.
+
+## Decision
+
+`internal/webhooks/deliveries.go` and `internal/alerting/deliveries.go`
+now use a transactional row claim. `ClaimReady` starts a transaction,
+selects ready rows with `SELECT … FOR UPDATE`, pushes each selected
+row's `next_attempt_at` to NOW + `claimLockDuration` (60 seconds), and
+commits. The dispatch goroutine overwrites that in-flight lease with
+its real value when it finishes.
+
+We intentionally use plain `FOR UPDATE` rather than `SKIP LOCKED` so
+the delivery claim path remains compatible with the MySQL 5.7+
+production target. The claim transaction is short: it only scans rows,
+updates their in-flight lease, and commits before any outbound network
+I/O begins. A competing worker may block briefly during that claim, but
+it will not duplicate the delivery.
+
+A crashed goroutine that never updates the row recovers naturally
+when the in-flight lease expires after 60s — the row becomes claimable
+again. This is intentional rollback behavior.
+
+## Consequences
+
+**Wins:**
+- The retry ladder behaves as documented; the visible regression that
+  motivated the original soft lock (~1h-then-abandon instead of 7h36m)
+  stays fixed.
+- Active-active delivery workers no longer duplicate the same pending
+  delivery row.
+- The implementation remains MySQL 5.7+ compatible.
+- Crash recovery is automatic — a process kill mid-dispatch leaves
+  the row recoverable.
+
+**Costs:**
+- `FOR UPDATE` can make one worker wait briefly behind another worker's
+  claim transaction. This is acceptable while the transaction is kept
+  short and contains no network I/O.
+- `SKIP LOCKED` would use high-concurrency workers more efficiently, but
+  it is deferred until the production database compatibility target
+  allows it.
+- The in-flight lease duration is a tuning parameter. Too short and a
+  slow dispatch can race with the next tick; too long and a crashed
+  goroutine takes longer to recover. 60s is a comfortable margin
+  for the default 30s + 5s dispatch timeout.
+
+## Alternatives considered
+
+- **`SELECT … FOR UPDATE SKIP LOCKED`.** Correct for multi-instance and
+  avoids blocking behind already-claimed rows, but would raise the MySQL
+  requirement beyond the current compatibility target.
+- **Keep the soft lock only.** Simple and MySQL-compatible, but two
+  workers can both read the same pending row before either moves
+  `next_attempt_at`, so active-active delivery still duplicates work.
+- **Reduce the per-subscriber in-flight cap to 1.** Doesn't fix
+  the bug; the second tick still sees the same row, the cap just
+  prevents the second goroutine from starting. The row stays pending
+  with stale `next_attempt_at` and the dispatch is delayed by the
+  cap rather than re-attempted concurrently. Slightly better
+  observable behavior, same underlying issue.
+- **A separate "claim ID" column with CAS semantics.** Similar
+  correctness with more schema and more code. Not worth the additional
+  complexity when row locks already provide the claim primitive.
+
+## Related
+
+- ADR-0005 (Pull-only delivery) — the worker shape that creates
+  this concurrency question.
+- ADR-0006 (Separate alerting and webhooks packages) — the fix
+  had to land in both packages, illustrating the duplication cost.
+- `internal/webhooks/deliveries.go` `ClaimReady` and the matching
+  `TestClaimReadyClaimsRowsTransactionally`.
+- `internal/alerting/deliveries.go` `ClaimReady` and matching test.
+- [`../roadmap.md`](../roadmap.md) post-v2 platform refinement items for the deliverer split
+  and active-active delivery.
diff --git a/docs/adr/0008-shadow-v2-state-migration.md b/docs/adr/0008-shadow-v2-state-migration.md
new file mode 100644
index 00000000..45011f26
--- /dev/null
+++ b/docs/adr/0008-shadow-v2-state-migration.md
@@ -0,0 +1,84 @@
+# 0008 — Shadow-v2-state migration with legacy status projection
+
+**Status:** Accepted (2026-04-27)
+
+Operational rollout steps live in
+[`../v1-to-v2-migration.md`](../v1-to-v2-migration.md). This ADR explains the
+state-model decision behind that runbook.
+
+## Context
+
+Jetmon 2 replaces mutable v1 status handling with event-sourced incident
+state (`jetmon_events` + `jetmon_event_transitions`). Production consumers,
+however, still read the legacy `jetpack_monitor_sites.site_status` and
+`last_status_change` fields. A hard cutover would require every consumer
+to migrate at the same time as the monitor binary, which is operationally
+fragile.
+
+We considered creating a completely separate v2 sites table, but that
+would immediately introduce bidirectional config sync, backfill, and
+reconciliation problems. The site/config row is not the hardest part of
+the migration; incident state is.
+
+## Decision
+
+Jetmon 2 will use a **shadow-v2-state** migration model:
+
+- `jetmon_events` and `jetmon_event_transitions` are the authoritative
+  incident state.
+- `jetpack_monitor_sites` remains the legacy site/config table during
+  migration.
+- While `LEGACY_STATUS_PROJECTION_ENABLE` is true, event mutations also
+  update the v1-compatible `site_status` / `last_status_change`
+  projection in the same transaction.
+- The internal API derives current state from active v2 events first. It
+  falls back to legacy `site_status` only while the legacy projection is
+  enabled; after disabling projection, "no active v2 event" means `Up`
+  regardless of stale legacy status values.
+- After downstream readers move to the v2 API/event tables,
+  `LEGACY_STATUS_PROJECTION_ENABLE` can be disabled. V2 incident writes
+  continue unchanged.
+
+`DB_UPDATES_ENABLE` remains as a deprecated config alias for older local
+configs, but `LEGACY_STATUS_PROJECTION_ENABLE` is the real switch.
+
+## Consequences
+
+**Wins:**
+- We can deploy v2 without requiring a simultaneous consumer migration.
+- Rollback is straightforward: legacy readers still see familiar status
+  values while projection is enabled.
+- The v2 event model becomes the source of truth immediately, so new API,
+  webhook, alerting, and SLA work does not depend on the legacy status
+  column.
+- Disabling legacy status writes later is a config change, not a schema
+  rewrite.
+
+**Costs:**
+- During migration, there are two readable state surfaces. The event tables
+  are authoritative; the legacy status fields are only a projection.
+- Projection drift must be treated as a bug while
+  `LEGACY_STATUS_PROJECTION_ENABLE` is true.
+- `jetpack_monitor_sites` still carries v1-owned site identity, bucket,
+  cadence, activity, and projection fields. V2-only rollout policy, advanced
+  check options, and runtime freshness/SSL bookkeeping live in v2-owned side
+  tables, so disabling legacy status projection does not remove the legacy table
+  from the system.
+
+## Alternatives considered
+
+- **Full v2 sites table now.** Cleaner isolation, but much more migration
+  machinery: config sync, ownership rules, backfill, reconciliation, and
+  dual-write failure handling. Deferred until legacy schema constraints
+  actually block v2 feature work.
+- **Only additive migrations on the legacy table.** Simpler schema, but it
+  keeps incident state conceptually tied to `site_status` and makes the
+  eventual cutover harder to reason about.
+- **Hard cutover to v2 event tables.** Cleanest end state, highest rollout
+  risk.
+
+## Related
+
+- ADR-0001 — Event-sourced state model.
+- [`../events.md`](../events.md) — event lifecycle and projection invariants.
+- `internal/eventstore` — sole writer for event rows and transitions.
diff --git a/docs/adr/0009-streaming-monitor-engine.md b/docs/adr/0009-streaming-monitor-engine.md
new file mode 100644
index 00000000..a8e18fe2
--- /dev/null
+++ b/docs/adr/0009-streaming-monitor-engine.md
@@ -0,0 +1,113 @@
+# ADR 0009: Streaming Monitor Engine
+
+## Status
+
+Accepted for merge candidacy on `feature/streaming-monitor-engine` after
+internal-only capacity validation through 2 million active sites.
+
+## Context
+
+The legacy-compatible v2 scheduler still behaves like a round/page system: query
+due rows, dispatch a page, collect results, then write freshness and history for
+every completed probe. Batched writes and indexed
+`jetmon_site_runtime.next_check_at` made that model viable for the current test
+sizes, but the shape does not scale cleanly to the next target: hundreds of
+thousands to one million sites on five-minute intervals.
+
+At one million sites on five-minute intervals, the monitor must sustain roughly
+3,333 checks per second all day, every day. A design that writes healthy
+freshness and raw timing rows for every probe turns the database into the hot
+path even when every customer site is healthy.
+
+## Decision
+
+Add a v2-native streaming scheduler behind `SCHEDULER_ENGINE=streaming`.
+
+The streaming engine:
+
+- loads active site identity, bucket, cadence, and projection state from
+  `jetpack_monitor_sites`, with v2-only check config from
+  `jetmon_site_check_config`;
+- assigns each site a stable phase inside its configured check interval so work
+  is naturally spread over time instead of lumped into round boundaries;
+- keeps due scheduling in memory and reschedules each target as results return;
+- auto-sizes the checker pool from required check rate and observed latency,
+  using `NUM_WORKERS` as a floor rather than a throughput ceiling;
+- avoids per-success writes to `jetmon_site_runtime.last_checked_at` and
+  `jetmon_check_history`;
+- writes failure history, event transitions, recoveries, SSL/TLS event changes,
+  verifier state changes, audit entries, and WPCOM notifications through the
+  existing v2 incident path;
+- batches coarse legacy freshness projection every
+  `STREAMING_LEGACY_PROJECTION_INTERVAL_MIN` minutes so rollback to the legacy
+  scheduler has bounded freshness loss.
+
+Add `jetmon_check_targets` as the durable home for v2-native scheduling state.
+The first prototype still reloads active targets from the legacy site table and
+v2 sidecar config tables, but the new table is intentionally additive so later
+iterations can move derived scheduling state out of the legacy path without
+breaking rollback.
+
+HTTP monitor identity is the legacy row id, not just `blog_id`. Production
+datasets can contain multiple active monitor URLs for one blog, so streaming
+planner targets, retry state, and future `jetmon_check_targets` sync must key
+by `jetpack_monitor_sites.jetpack_monitor_site_id` / `source_site_id` when that
+row id is available. `blog_id` remains the WPCOM/site identity used for
+notifications and site-level API views.
+
+## Compatibility
+
+`jetpack_monitor_sites` remains the source of truth for v1-owned site identity,
+bucket, cadence, and legacy projection during v1/v2 migration. Event state
+remains authoritative in `jetmon_events` and `jetmon_event_transitions`, with
+legacy `site_status` projection maintained by the same eventstore paths already
+used by the legacy-compatible v2 scheduler.
+
+The deliberate compatibility tradeoff is freshness precision:
+`jetmon_site_runtime.last_checked_at` and `next_check_at` are no longer updated
+after every healthy probe in streaming mode. Operators accepted a 5-15 minute
+worst-case rollback freshness loss window; the default projection interval is
+15 minutes.
+
+## Capacity Validation
+
+The branch was capacity-tested with uptime-bench internal-only HTTP/DNS targets
+so the monitor service, not external internet reachability, was the primary
+bottleneck under test.
+
+The 2026-05-12 runs validated the streaming engine through 2 million active
+sites on five-minute check intervals. At 1.5 million and 2 million active sites,
+Jetmon v2 reached 100% observed target coverage, reported no never-seen or
+stale targets, and passed replay detection for down and recovery scenarios. The
+2 million run sustained roughly 6,765 completed checks per second, kept p95
+target age around 270 seconds, kept max target age below 285 seconds, and kept
+process RSS around 6.3 GB peak on the test host.
+
+The same test shape failed at 4 million active sites. The engine initially
+reached the required throughput, then collapsed into a timeout/backlog failure:
+pending work grew into the millions, queue depth hit its cap, HTTP timeout
+counts spiked, and target-observer coverage stopped at roughly 88%. That run
+defines the current single-host ceiling signal, not an accepted production
+capacity. The follow-up work is latency/error-aware concurrency control,
+worker-scaler hardening, and bracket tests around 2.5-3 million sites before
+attempting another larger jump.
+
+## Consequences
+
+The streaming engine should dramatically reduce write pressure in healthy
+steady state. Database writes become mostly eventful writes instead of a
+constant function of fleet size.
+
+The first version still performs periodic full active-site reloads. That is
+simpler and safer for the prototype, but a later iteration should use
+`jetmon_check_targets` plus change detection to avoid broad reload reads at very
+large fleet sizes. Until that target-table sync exists, the scheduler
+automatically stretches periodic full reload cadence for large fleets so broad
+legacy-table scans do not compete with the check loop during normal steady
+state.
+
+The new engine needs uptime-bench coverage that validates freshness, incident
+correctness, recovery correctness, verifier promotion, rollback projection
+staleness, and long-running steady-state resource use. A high score on one
+benchmark is not the goal; the engine should remain robust across broad failure
+modes while scaling well.
diff --git a/docs/adr/0010-trusted-veriflier-discovery.md b/docs/adr/0010-trusted-veriflier-discovery.md
new file mode 100644
index 00000000..64b0a67c
--- /dev/null
+++ b/docs/adr/0010-trusted-veriflier-discovery.md
@@ -0,0 +1,115 @@
+# 0010 — Trusted Veriflier discovery with monitor-collected telemetry
+
+**Status:** Accepted (2026-05-11)
+
+Operational rollout steps live in
+[`../v1-to-v2-migration.md`](../v1-to-v2-migration.md). This ADR explains why
+Veriflier discovery is monitor-owned and why Veriflier hosts do not get
+database credentials.
+
+## Context
+
+Jetmon v2 needs a safer way to evolve the Veriflier fleet than a static
+`VERIFIERS` list on every monitor host. Operators need to add, remove, and
+scale Veriflier capacity without copying large config lists everywhere, and the
+fleet dashboard needs enough telemetry to show whether Veriflier capacity is
+fresh, stale, overloaded, duplicated, or mismatched.
+
+At the same time, Veriflier identity is part of the downtime quorum. A
+Veriflier vantage is not just a reachable endpoint; it is a trusted perspective
+that can help confirm a customer's site is down. If a Veriflier process could
+self-register as a trusted vantage, a bad config, compromised host, duplicate
+replica, or test process could accidentally create extra quorum votes.
+
+The security and rollout constraints are:
+
+- Veriflier hosts previously did not need MySQL access.
+- Agent liveness and capacity telemetry is useful, but telemetry is not trust.
+- Horizontal replicas behind one vantage should add capacity without adding
+  independent downtime votes.
+- Discovery must be reversible during rollout, with static config available as
+  a fallback.
+- Operator tools must not print Veriflier auth token values.
+
+## Decision
+
+Jetmon v2 will use an **operator-trusted registry plus monitor-collected
+telemetry** for Veriflier discovery.
+
+- `jetmon_veriflier_vantages` is the trusted registry. Operators create and
+  enable one row per quorum-counted Veriflier vantage.
+- Only enabled, usable registry rows are eligible for active discovery traffic
+  and downtime quorum. A row is usable only when it has an endpoint host,
+  endpoint port, and auth token.
+- `jetmon_veriflier_agents` is telemetry only. Monitor hosts poll authenticated
+  Veriflier `/v2/status` endpoints and write agent liveness, protocol, version,
+  endpoint, queue, and capacity data.
+- Veriflier hosts do not write to MySQL and do not self-register trusted
+  vantages.
+- Agent rows never create quorum votes. A fresh active agent matters only when
+  its `vantage_id` matches an operator-approved registry row.
+- `VERIFLIER_DISCOVERY_MODE=static|shadow|active` controls rollout:
+  - `static` uses config only.
+  - `shadow` compares static config, registry rows, and recent agent telemetry
+    without changing traffic.
+  - `active` uses enabled usable registry rows, but falls back to static config
+    if discovery is unavailable or empty during rollout.
+- `jetmon2 validate-config`, `jetmon2 verifliers discovery-report`, and the
+  fleet dashboard expose discovery status, static-vs-registry drift, stale
+  telemetry, duplicate vantage/endpoint warnings, and capacity. They report
+  token presence only, never token values.
+
+## Consequences
+
+**Wins:**
+- Veriflier hosts keep a smaller privilege surface because they do not need
+  database credentials.
+- Operators retain explicit control over which vantages can affect downtime
+  quorum.
+- Horizontal scaling is safe: replicas can share a `vantage_id` and add
+  capacity without increasing quorum weight.
+- Shadow mode gives a reversible gate before active discovery changes traffic.
+- The dashboard and discovery report can explain fleet health using durable
+  MySQL state rather than scraping every Veriflier host directly.
+
+**Costs:**
+- Operators must seed and maintain the trusted registry. Agent telemetry alone
+  is intentionally insufficient.
+- A new Veriflier agent may be fresh and healthy but ignored until its
+  `vantage_id` is approved in `jetmon_veriflier_vantages`.
+- Discovery depends on monitor hosts polling Verifliers. If monitors cannot
+  reach `/v2/status`, agent telemetry becomes stale even if Verifliers are
+  healthy.
+- Active discovery has a static fallback during rollout, so operators must keep
+  static config correct until the fallback removal is explicitly approved.
+- Registry/token drift can only be reported as presence mismatch; the tooling
+  must not expose token values for direct comparison.
+
+## Alternatives considered
+
+- **Veriflier self-registration into MySQL.** Easier auto-discovery, but it
+  gives Veriflier hosts database credentials and allows a process to create or
+  modify the trusted identity used for quorum. Rejected.
+- **Trust any fresh agent telemetry row.** Operationally simple, but a duplicate
+  or rogue agent could create extra votes. Rejected because quorum identity must
+  be operator-approved.
+- **External service discovery as the source of trust.** DNS, Consul,
+  Kubernetes, or another registry could discover endpoints, but Jetmon
+  production does not require a cluster orchestrator, and endpoint discovery
+  still would not answer which vantages are trusted quorum identities. Deferred.
+- **Static config only.** Lowest moving parts, but it keeps horizontal scaling
+  and stale fleet visibility tied to manual config rollout on every monitor.
+
+## Related
+
+- [`../v1-to-v2-migration.md`](../v1-to-v2-migration.md) — rollout gates for
+  Veriflier contract and discovery.
+- [`../operations-guide.md`](../operations-guide.md) — dashboard and
+  discovery-report operations.
+- [`../data-model.md`](../data-model.md) — `jetmon_veriflier_vantages` and
+  `jetmon_veriflier_agents`.
+- [`../roadmap.md`](../roadmap.md) — deferred production-like discovery soak
+  and future probe-agent work.
+- `internal/orchestrator` — monitor-side Veriflier discovery and telemetry
+  polling.
+- `internal/dashboard` — fleet Veriflier dashboard summaries.
diff --git a/docs/adr/README.md b/docs/adr/README.md
new file mode 100644
index 00000000..f0452566
--- /dev/null
+++ b/docs/adr/README.md
@@ -0,0 +1,53 @@
+# Architecture Decision Records
+
+Short, immutable records of load-bearing decisions in Jetmon 2 — the kind
+of "why is it like this" question that has been answered more than once
+in code review, on Slack, or in a PR description.
+
+## Format
+
+Each ADR is a numbered Markdown file: `NNNN-short-slug.md`. Numbers are
+allocated sequentially and never reused. The body has four sections:
+
+- **Status** — Proposed / Accepted / Superseded by ADR-NNNN / Deprecated.
+- **Context** — what problem we're solving and the constraints that
+  shaped the choice. Capture the world as it was when the decision was
+  made.
+- **Decision** — what we chose, in active voice ("We will…").
+- **Consequences** — what falls out of the decision, both the wins and
+  the costs we accept. Future readers should be able to evaluate
+  whether the consequences are still acceptable.
+
+Optional fifth section: **Alternatives considered** when the rejected
+options carry useful information for a future revisit.
+
+## Conventions
+
+- **ADRs are append-only.** Once accepted, the body is not edited.
+  Status changes (e.g. "Superseded by ADR-NNNN") are added at the top
+  with a date.
+- **Each ADR captures one decision.** If a topic produces several
+  decisions, write several ADRs that cross-reference.
+- **Write what was true at the time.** If a column has been renamed
+  since, the ADR keeps the old name with a footnote rather than being
+  silently updated. Otherwise the historical thread is lost.
+- **Cross-link generously.** ADRs frequently depend on each other;
+  always link to the related decisions.
+- **Don't backfill speculatively.** ADRs document decisions that have
+  actually been made and shipped. Open questions belong in
+  [`../roadmap.md`](../roadmap.md) until they're resolved.
+
+## Index
+
+| # | Title | Status |
+|---|-------|--------|
+| [0001](0001-event-sourced-state-model.md) | Event-sourced state model with dedicated transitions table | Accepted |
+| [0002](0002-internal-only-api-behind-gateway.md) | Internal-only API behind a gateway | Accepted |
+| [0003](0003-plaintext-credentials-for-outbound-dispatch.md) | Plaintext credential storage for outbound dispatch | Accepted |
+| [0004](0004-stripe-style-hmac-webhook-signatures.md) | Stripe-style HMAC-SHA256 webhook signatures | Accepted |
+| [0005](0005-pull-only-delivery-via-event-transitions.md) | Pull-only webhook and alerting delivery | Accepted |
+| [0006](0006-separate-alerting-and-webhooks-packages.md) | Separate `internal/alerting` and `internal/webhooks` packages | Accepted |
+| [0007](0007-soft-lock-vs-row-claim.md) | Soft-lock claim vs transactional row claim | Accepted |
+| [0008](0008-shadow-v2-state-migration.md) | Shadow-v2-state migration with legacy status projection | Accepted |
+| [0009](0009-streaming-monitor-engine.md) | Streaming monitor engine | Accepted for merge candidacy |
+| [0010](0010-trusted-veriflier-discovery.md) | Trusted Veriflier discovery with monitor-collected telemetry | Accepted |
diff --git a/docs/api-cli-guide.md b/docs/api-cli-guide.md
new file mode 100644
index 00000000..addbbfc7
--- /dev/null
+++ b/docs/api-cli-guide.md
@@ -0,0 +1,495 @@
+# API CLI Feature Guide
+
+`jetmon2 api` is the local operator and developer interface for Jetmon's
+internal `/api/v1` API. It wraps the common API paths with typed commands,
+repeatable request payloads, Docker-local defaults, and output modes that work
+for both humans and scripts.
+
+Use this guide for day-to-day examples. Use [`internal-api-reference.md`](internal-api-reference.md) as the
+endpoint contract reference when you need exact response shapes, request
+schemas, pagination semantics, or design rationale.
+
+## Setup
+
+Build the local binary:
+
+```bash
+make build
+```
+
+Start Docker and create an API key inside the `jetmon` container:
+
+```bash
+cd docker
+docker compose up --build -d
+cd ..
+make api-cli-token-create
+```
+
+Point the CLI at the Docker-local API and token:
+
+```bash
+export JETMON_API_URL=http://localhost:${API_HOST_PORT:-8090}
+export JETMON_API_TOKEN=jm_replace_with_the_printed_token
+```
+
+The token helpers use the Docker Compose stack from the repository root. Use
+`API_CLI_TOKEN_CONSUMER`, `API_CLI_TOKEN_SCOPE`, `API_CLI_TOKEN_TTL`, and
+`API_CLI_TOKEN_CREATED_BY` to vary token creation. Use
+`make api-cli-token-list` to find local key IDs and
+`API_CLI_TOKEN_ID=<id> make api-cli-token-revoke` when a rehearsal token should
+be revoked.
+
+Every command also accepts `--base-url`, `--token`, `--auth-policy`,
+`--allow-remote`, `--timeout`, `--header`, `--pretty`, `--output table`, `-v`,
+and `--verbose`. JSON is the default output for automation. Use `--pretty`
+when reading JSON directly and `--output table` for stable summary tables.
+
+Automatic `--token` and `--idempotency-key` headers are sent only to the
+configured API origin by default, including when `api request` is given an
+absolute URL. Use `--auth-policy any-origin` only when intentionally targeting
+another trusted API host. Custom `--header` values are always treated as
+explicit operator input. Verbose mode redacts common sensitive headers before
+printing them.
+
+Every POST, PUT, PATCH, and DELETE refuses to modify a non-local API unless
+`--allow-remote` is supplied. Local means `localhost`, a `*.localhost` name, or
+a loopback IP address; private LAN hosts still count as remote. On remote API
+targets, `smoke`, `sites bulk-add`, `sites cleanup`, and
+`sites simulate-failure` also require `--batch`, and remote cleanup/simulation
+keep the CLI batch marker check mandatory. Dry-run planning does not contact
+the API and is not blocked.
+
+Security notes:
+- Prefer `--auth-policy any-origin` as a one-command flag. Exporting
+  `JETMON_API_AUTH_POLICY=any-origin` is convenient but persistent; later
+  absolute-URL requests can attach the token to a different trusted host.
+- Manual sensitive headers such as `--header 'Authorization: ...'` are explicit
+  operator input and bypass same-origin automatic auth protections. Use them
+  only with trusted URLs.
+- API error bodies are printed as returned by the server for internal debugging.
+  Do not point this CLI at untrusted API servers.
+
+List the command catalog and examples when you need to discover the expanded
+tree without returning to this guide:
+
+```bash
+./bin/jetmon2 api commands --output table
+```
+
+## Health and Identity
+
+Use `health` before authenticating anything. It checks the API and database
+health endpoint.
+
+```bash
+./bin/jetmon2 api health --pretty
+```
+
+Use `me` to confirm the token, consumer name, scope, and rate limit seen by the
+API server.
+
+```bash
+./bin/jetmon2 api me --pretty
+```
+
+Verbose mode prints request and response headers to stderr, which is useful
+when debugging auth, rate limiting, or idempotency:
+
+```bash
+./bin/jetmon2 api me --verbose --pretty
+```
+
+## Request Escape Hatch
+
+Use `api request` when a route exists but a typed CLI wrapper does not yet.
+
+```bash
+./bin/jetmon2 api request --output table GET '/api/v1/sites?limit=5'
+```
+
+POST and PATCH requests can take literal JSON, a file, or stdin:
+
+```bash
+./bin/jetmon2 api request \
+  --idempotency-key local-site-12345-create \
+  --body '{"blog_id":12345,"monitor_url":"https://example.com","monitor_active":true}' \
+  --pretty \
+  POST /api/v1/sites
+```
+
+```bash
+./bin/jetmon2 api request \
+  --body-file site-update.json \
+  --pretty \
+  PATCH /api/v1/sites/12345
+```
+
+`api request` is intentionally an escape hatch, not a hardened data transfer
+tool. Request bodies and response bodies are read into memory; avoid very large
+files or endpoints that stream unbounded responses. Non-local writes still
+require `--allow-remote`.
+
+## Site Management
+
+Sites are keyed by the existing `blog_id`. The typed site commands cover list,
+get, create, update, delete, pause, resume, and trigger-now.
+
+```bash
+./bin/jetmon2 api sites list --limit 20 --output table
+./bin/jetmon2 api sites list --monitor-active=true --state-in 'Seems Down,Down' --severity-gte 3 --output table
+./bin/jetmon2 api sites get --pretty 12345
+```
+
+Create a monitored site with explicit per-site check behavior:
+
+```bash
+./bin/jetmon2 api sites create \
+  --blog-id 12345 \
+  --url https://example.com \
+  --monitor-active=true \
+  --request-method HEAD \
+  --detection-profile legacy \
+  --redirect-policy follow \
+  --timeout-seconds 5 \
+  --check-interval 1 \
+  --idempotency-key site-12345-create \
+  --pretty
+```
+
+Update a site when testing staged rollout batches, redirects, keyword checks,
+custom headers, or maintenance windows:
+
+```bash
+./bin/jetmon2 api sites update \
+  --url https://example.com/health \
+  --request-method GET \
+  --detection-profile simple_http \
+  --check-keyword Example \
+  --forbidden-keyword 'database error' \
+  --forbidden-keyword-list 'metrics.evil-cdn.example/collect.js' \
+  --forbidden-keyword-list 'buy cheap viagra' \
+  --custom-header 'X-Jetmon-Test: api-cli' \
+  --maintenance-start 2026-04-28T18:00:00Z \
+  --maintenance-end 2026-04-28T19:00:00Z \
+  --pretty \
+  12345
+```
+
+Pause, resume, and run an immediate check:
+
+```bash
+./bin/jetmon2 api sites pause --idempotency-key site-12345-pause --pretty 12345
+./bin/jetmon2 api sites resume --idempotency-key site-12345-resume --pretty 12345
+./bin/jetmon2 api sites trigger-now --idempotency-key site-12345-trigger --pretty 12345
+```
+
+Delete disposable sites:
+
+```bash
+./bin/jetmon2 api sites delete 12345
+```
+
+## Batch Test Sites
+
+`sites bulk-add` creates bounded, repeatable local test data. The default source
+is the checked-in fixture of public URLs with up, redirect, slow, error, TLS,
+header, and keyword-check examples.
+
+Preview the payloads:
+
+```bash
+./bin/jetmon2 api sites bulk-add --count 5 --batch local-smoke --dry-run --pretty
+```
+
+Create the batch:
+
+```bash
+./bin/jetmon2 api sites bulk-add \
+  --count 5 \
+  --batch local-smoke \
+  --idempotency-key-prefix local-smoke \
+  --pretty
+```
+
+The batch label derives deterministic blog IDs and stores an
+`X-Jetmon-CLI-Batch` custom header marker so later smoke, simulation, and
+cleanup commands can target only CLI-created data.
+
+Against a non-local API, add `--allow-remote`; `--batch` remains required so
+the created rows carry the CLI marker.
+
+Use your own source list when needed:
+
+```bash
+./bin/jetmon2 api sites bulk-add --source file --file sites.csv --count 10 --batch private-repro --pretty
+```
+
+Accepted source formats are newline URLs, CSV with a `url` or `monitor_url`
+column, or JSON objects using fields such as `monitor_url`, `check_keyword`,
+`forbidden_keyword`, `forbidden_keywords`, `redirect_policy`,
+`request_method`, `detection_profile`, `timeout_seconds`, `custom_headers`,
+`alert_cooldown_minutes`, and `check_interval`. In CSV, `forbidden_keywords`
+is a comma-separated list inside one field; quote the field when a value
+contains commas.
+
+Clean up a batch after testing:
+
+```bash
+./bin/jetmon2 api sites cleanup --batch local-smoke --count 5 --output table
+```
+
+By default, cleanup verifies each existing `--batch` target still exposes the
+matching derived `cli_batch` marker before deleting it. The CLI requests that
+marker through the API's opt-in `include_cli_metadata=true` projection. Use
+`--allow-unmarked` only when cleaning up older local data created before the
+marker check existed.
+
+Against a non-local API, cleanup requires `--allow-remote --batch` and rejects
+`--allow-unmarked`.
+
+## Events and Transitions
+
+Events are the API source of truth for incident state. Use event commands to
+list active incidents for a site, inspect an event, fetch transition history,
+and manually close false alarms or operator-resolved incidents.
+
+```bash
+./bin/jetmon2 api events list --active=true --output table 12345
+./bin/jetmon2 api events list --state 'Seems Down' --limit 10 --pretty 12345
+./bin/jetmon2 api events get --site-id 12345 --pretty 98765
+./bin/jetmon2 api events transitions --output table 12345 98765
+```
+
+Close an event with an explicit reason and note:
+
+```bash
+./bin/jetmon2 api events close \
+  --reason manual_override \
+  --note 'Confirmed maintenance outside scheduled window' \
+  --idempotency-key event-98765-close \
+  --pretty \
+  12345 98765
+```
+
+## Webhooks
+
+Webhooks receive HMAC-signed POSTs for matching event transitions. The CLI can
+create, update, rotate secrets, inspect deliveries, and retry failed delivery
+rows.
+
+The Docker-local `api-fixture` service also exposes a receiver at
+`http://api-fixture:8091/webhook`. From the host, use
+`http://localhost:18091/webhook/requests` to inspect recorded deliveries or
+`DELETE` the same path to clear them. Add `?secret=<webhook-secret>` to the
+receiver URL when you want the fixture to verify `X-Jetmon-Signature`.
+Webhook secrets returned by create/rotate responses are shown once; treat the
+JSON output like a credential and avoid saving it in logs.
+
+```bash
+./bin/jetmon2 api webhooks create \
+  --url https://receiver.example.test/jetmon \
+  --active=true \
+  --event event.opened,event.severity_changed,event.closed \
+  --site-id 12345 \
+  --state 'Down,Seems Down' \
+  --idempotency-key webhook-local-create \
+  --pretty
+```
+
+```bash
+./bin/jetmon2 api webhooks list --output table
+./bin/jetmon2 api webhooks get --pretty 77
+./bin/jetmon2 api webhooks deliveries --status failed --output table 77
+./bin/jetmon2 api webhooks retry --idempotency-key webhook-77-delivery-555-retry --pretty 77 555
+./bin/jetmon2 api webhooks rotate-secret --idempotency-key webhook-77-rotate --pretty 77
+```
+
+Update filters without rebuilding the whole object:
+
+```bash
+./bin/jetmon2 api webhooks update --clear-sites --state Down --pretty 77
+```
+
+## Alert Contacts
+
+Alert contacts are managed delivery channels backed by the same transition
+source as webhooks. Supported transports are `email`, `pagerduty`, `slack`, and
+`teams`.
+
+Create an email contact:
+
+```bash
+./bin/jetmon2 api alert-contacts create \
+  --label 'Local smoke email' \
+  --transport email \
+  --address alerts@example.test \
+  --active=true \
+  --min-severity SeemsDown \
+  --max-per-hour 10 \
+  --idempotency-key alert-email-create \
+  --pretty
+```
+
+Create a Slack contact:
+
+```bash
+./bin/jetmon2 api alert-contacts create \
+  --label 'Local Slack' \
+  --transport slack \
+  --webhook-url https://hooks.slack.com/services/example \
+  --site-id 12345 \
+  --min-severity Down \
+  --pretty
+```
+
+Exercise the send-test path and inspect delivery rows:
+
+```bash
+./bin/jetmon2 api alert-contacts test --idempotency-key alert-12-test --pretty 12
+./bin/jetmon2 api alert-contacts deliveries --status failed --output table 12
+./bin/jetmon2 api alert-contacts retry --idempotency-key alert-12-delivery-9001-retry --pretty 12 9001
+```
+
+Use `--destination` for raw transport-specific JSON when a shortcut flag does
+not fit a test case:
+
+```bash
+./bin/jetmon2 api alert-contacts create \
+  --label 'Raw destination example' \
+  --transport teams \
+  --destination '{"webhook_url":"https://example.test/teams"}' \
+  --pretty
+```
+
+PagerDuty integration keys and Slack/Teams webhook URLs are credentials. The
+CLI does not print request bodies in verbose mode, but shell history and saved
+JSON output can still retain values supplied through `--integration-key`,
+`--webhook-url`, or `--destination`.
+
+## Smoke Workflows
+
+`api smoke` runs a compact end-to-end sanity pass against Docker-local API
+components: health, auth identity, site creation, trigger-now, event listing,
+alert-contact creation, alert-contact send-test, and best-effort cleanup.
+
+```bash
+./bin/jetmon2 api smoke --batch local-smoke --pretty
+```
+
+Use the webhook exercise when you want the smoke workflow to prove the outbound
+webhook path too. It creates a temporary webhook, updates the receiver URL with
+the one-time generated secret, mutates the smoke site into a fixture-backed
+HTTP 500 failure, waits for an `event.opened` delivery, verifies the fixture's
+signature result, confirms a delivered webhook row, and cleans up:
+
+```bash
+./bin/jetmon2 api smoke --batch local-webhook --exercise webhook --pretty
+```
+
+The webhook exercise is Docker-local only. It refuses non-local API targets even
+with `--allow-remote`, and the fixture polling URL must resolve to localhost or
+a loopback IP because the CLI clears and polls that endpoint directly. The
+registered receiver URL must also be localhost, loopback, or `api-fixture`
+unless you pass `--allow-external-webhook-url` for an intentionally external
+test receiver.
+
+The Makefile target builds the binary first and runs the standard smoke path:
+
+```bash
+JETMON_API_URL=http://localhost:${API_HOST_PORT:-8090} \
+JETMON_API_TOKEN=jm_replace_with_the_printed_token \
+make api-cli-smoke
+```
+
+Use `api-cli-validate` for a fuller live pass against the guide's Docker-local
+workflow. It builds the binary, checks health and identity, exercises the
+generic request escape hatch, dry-runs batch creation, runs `api smoke`, runs a
+webhook delivery/signature smoke pass, runs a deterministic failure simulation
+assertion, and cleans up the validation batches on exit:
+
+```bash
+JETMON_API_URL=http://localhost:${API_HOST_PORT:-8090} \
+JETMON_API_TOKEN=jm_replace_with_the_printed_token \
+make api-cli-validate
+```
+
+The validation target uses `API_VALIDATE_BATCH`, `API_VALIDATE_MODE`,
+`API_VALIDATE_WAIT`, `API_VALIDATE_WEBHOOK_WAIT`, and `API_VALIDATE_COUNT` when
+you need to vary the default batch label or failure scenario. Set
+`API_VALIDATE_SKIP_WEBHOOK=1` or `API_VALIDATE_SKIP_FAILURE=1` to skip the
+longer webhook or failure-simulation checks.
+
+## Failure Simulation
+
+`sites simulate-failure` mutates one or more sites into a known failure mode,
+optionally creates missing test sites, triggers immediate checks, polls active
+events, fetches transitions, and returns non-zero when a site fails the workflow
+or an assertion does not match.
+
+Supported modes are `unreachable`, `http-500`, `http-403`, `redirect`,
+`keyword`, `timeout`, and `tls`.
+
+```bash
+./bin/jetmon2 api sites simulate-failure \
+  --batch local-smoke \
+  --count 1 \
+  --create-missing \
+  --mode http-500 \
+  --wait 15s \
+  --pretty
+```
+
+When `--batch` targets an existing site, simulation verifies the site's
+`cli_batch` marker before mutating it. The marker is fetched through the API's
+opt-in `include_cli_metadata=true` projection. `--create-missing` is still
+allowed for empty deterministic slots because the created site receives the
+marker. Use `--allow-unmarked` only for legacy local batches that predate the
+marker.
+
+Against a non-local API, simulation requires `--allow-remote --batch` and
+rejects `--allow-unmarked`.
+
+When Docker Compose is running, the command probes
+`http://localhost:18091/health` and uses the Docker-internal `api-fixture`
+service for deterministic HTTP status, redirect, keyword, timeout, and TLS
+cases. Force public endpoint fallbacks with `--fixture-url=off`.
+
+Use assertions when a CI or rehearsal run should fail unless the expected API
+state appears before the wait window expires:
+
+```bash
+./bin/jetmon2 api sites simulate-failure \
+  --batch local-smoke \
+  --mode http-500 \
+  --wait 30s \
+  --expect-event-state 'Seems Down' \
+  --expect-event-severity 3 \
+  --require-transition \
+  --expect-transition-reason opened \
+  --pretty
+```
+
+Target explicit site IDs instead of a batch:
+
+```bash
+./bin/jetmon2 api sites simulate-failure \
+  --site-id 12345 \
+  --site-id 12346 \
+  --mode timeout \
+  --wait 20s \
+  --output table
+```
+
+## Automation Notes
+
+- Prefer `--idempotency-key` or `--idempotency-key-prefix` for create, close,
+  retry, trigger, and test actions that scripts may repeat.
+- Use JSON output for scripts; use table output only for human-readable status.
+- Use `--batch` and `sites cleanup` for disposable data so local runs do not
+  touch unrelated sites.
+- Use `--verbose` when debugging auth, rate limits, idempotency behavior, or
+  unexpected server errors. Header values are redacted, but response bodies are
+  not.
+- Treat tokens as local secrets. Do not commit exported tokens, shell history
+  snippets, or generated local config containing credentials.
diff --git a/docs/api-cli-roadmap.md b/docs/api-cli-roadmap.md
new file mode 100644
index 00000000..474dfafd
--- /dev/null
+++ b/docs/api-cli-roadmap.md
@@ -0,0 +1,264 @@
+# API CLI Roadmap
+
+Status: complete and merged into `v2`.
+
+This roadmap is retained as implementation history for the local
+developer/operator CLI that exercises the internal Jetmon `/api/v1` surface
+without remembering endpoint paths, auth headers, and payload shapes by hand.
+New API CLI follow-up ideas should be tracked in
+[`roadmap.md`](roadmap.md) unless they are small enough to land with adjacent
+feature work.
+
+## P0 - Request Foundation
+
+- [x] Add `jetmon2 api health` for unauthenticated API/database health checks.
+- [x] Add `jetmon2 api me` for validating Bearer-token auth and API-key identity.
+- [x] Add `jetmon2 api request <method> <path-or-url>` as the escape hatch for
+  newly-added routes before typed commands exist.
+- [x] Read `JETMON_API_URL` and `JETMON_API_TOKEN`, with Docker-local defaults.
+- [x] Add `-v` / `--verbose` to print full request and response headers for local
+  debugging.
+- [x] Support request bodies from `--body`, `--body-file`, and stdin, plus
+  `--idempotency-key` for POST retry testing.
+
+## P1 - Typed Resource Commands
+
+- [x] Add `sites list|get|create|update|delete|pause|resume|trigger-now`.
+- [x] Add `events list|get|transitions|close`.
+- [x] Add `webhooks list|get|create|update|delete|rotate-secret|deliveries|retry`.
+- [x] Add `alert-contacts list|get|create|update|delete|test|deliveries|retry`.
+- [x] Keep typed command payloads close to the OpenAPI component schemas or shared
+  request structs so CLI examples do not drift from the server contract.
+- [x] Add `sites bulk-add --count <n>` for creating a bounded batch of real
+  monitored URLs for local testing. Support `--source fixture|file|stdin` so the
+  default is repeatable but operators can supply their own CSV/JSON/newline
+  list without recompiling the CLI.
+- [x] Add a curated local fixture of real public URLs with mixed behavior:
+  always-up examples, redirects, slow responses, client/server error responses,
+  TLS edge cases, and keyword-check candidates. Keep the fixture small,
+  documented, and safe for local-only test data generation.
+
+## P2 - Local Smoke Workflows
+
+- [x] Add `jetmon2 api smoke` to run a local end-to-end sanity pass against Docker:
+  health, auth, create a site, trigger a check, read events, and exercise a
+  webhook or alert-contact test path.
+- [x] Add `jetmon2 api sites simulate-failure` to intentionally mutate one or
+  more CLI-created test sites into known failure states, trigger checks, and
+  show the resulting event IDs and transitions.
+- [x] Support targeted failure modes for simulation: unreachable host, HTTP 500,
+  HTTP 403, redirect-policy failure, keyword mismatch, timeout/slow response,
+  and TLS/certificate failure.
+- [x] Track CLI-created test-site batches with a stable label or metadata marker
+  so smoke tests and failure simulations can operate on `--batch <id>`,
+  `--count <n>`, or explicit site IDs without touching unrelated local data.
+- [x] Add cleanup behavior for resources created by smoke runs.
+- [x] Return non-zero exit codes and concise failure summaries suitable for CI.
+
+## Test Site Source Ideas
+
+- **Implemented path:** Use a curated checked-in fixture plus operator-supplied
+  file/stdin imports for repeatable test-site creation. Use real public
+  endpoints for network realism, and use the Docker-local `api-fixture` service
+  for deterministic event assertions. For larger `--count` values, cycle
+  through the source list with varied site settings instead of inventing fake
+  public domains.
+- **Curated fixture:** Check in a small `docs/testdata` or `internal/testdata`
+  source list with public endpoints selected for deterministic behavior. This
+  should be the default because it keeps local test runs repeatable.
+- **Operator-supplied file/stdin:** Accept newline, CSV, or JSON site lists so
+  developers can test with a private list of real customer-like domains without
+  committing those domains to the repo.
+- **Docker failure fixture:** The local `api-fixture` service provides the most
+  deterministic failure simulation. Real public sites remain useful for network
+  realism, but local fixture endpoints are better for asserting exact event
+  transitions.
+- **Generated variants:** For `--count` larger than the curated fixture, cycle
+  through the source list with deterministic suffix metadata and varied
+  per-site settings: redirect policy, keyword, timeout, custom headers, and
+  check interval. Do not invent nonexistent public domains as "real" sites.
+- **External top-site lists:** If broader realism is needed, allow importing an
+  operator-downloaded ranked domain list. Keep download/fetch outside the CLI at
+  first so local tests remain reproducible and do not depend on third-party
+  availability.
+
+## P3 - Output Ergonomics
+
+- [x] Add stable table output for list commands while keeping JSON as the default
+  automation-friendly format.
+- [x] Add `--pretty` for formatted JSON and preserve raw JSON for scripts.
+- [x] Add examples to `internal-api-reference.md`, Docker docs, and the v1-to-v2 rollout rehearsal
+  docs once the command shape has stabilized.
+
+## P4 - Hardening and Repeatability
+
+- [x] Re-run a fresh `docker compose up --build -d` and API CLI smoke pass after
+  Docker bridge networking is healthy, so the branch is verified through the
+  normal container entrypoint instead of only the host-run binary.
+- [x] Add `jetmon2 api sites cleanup --batch <id>` for removing deterministic
+  CLI-created site batches after smoke, bulk-add, and failure simulation runs.
+- [x] Add a one-command `make api-cli-smoke` entrypoint for the documented local
+  smoke path.
+- [x] Add a deterministic Docker-local failure fixture service for response
+  codes, redirects, keyword mismatch, slow responses, and TLS edge cases.
+- [x] Teach failure simulation to prefer the Docker-local fixture when it is
+  available, while retaining public endpoint fallback behavior.
+- [x] Add deterministic failure-simulation assertions for exact
+  event/transition behavior without depending on public endpoint timing.
+
+## P5 - Feature Documentation
+
+- [x] Add a feature guide with setup instructions and examples for health,
+  generic requests, site management, batch test data, events, webhooks, alert
+  contacts, smoke runs, failure simulation, cleanup, and automation patterns.
+
+## P6 - Operator Ergonomics and Safety
+
+- [x] Allow API CLI flags before or after positional arguments so examples like
+  `sites get 123 --pretty` work the way operators naturally type them.
+- [x] Add batch ownership safety checks for destructive or mutating batch
+  workflows. `sites cleanup --batch` and `sites simulate-failure --batch`
+  should verify the target still belongs to the requested CLI batch unless the
+  operator explicitly opts out.
+- [x] Add a reproducible documentation/live validation target for the API CLI
+  feature guide and Docker-local smoke path.
+- [x] Improve table output for workflow commands with event IDs, states,
+  severities, transition counts, trigger results, and cleanup status.
+- [x] Add shell completion or richer command discovery for the expanded
+  `jetmon2 api` command tree.
+- [x] Extend `api-fixture` into a local webhook receiver that records webhook
+  deliveries and verifies `X-Jetmon-Signature` when a shared secret is supplied.
+- [x] Wire API CLI smoke or validation workflows to create a Docker-local
+  webhook, share the generated secret with the fixture receiver, and assert
+  delivery plus signature verification end-to-end.
+- [x] Add a local API-token convenience target or wrapper for creating and
+  revoking Docker-local API CLI tokens during rehearsals.
+
+## Completed
+
+- [x] 2026-04-28: Refreshed the root README into a shorter Jetmon 2 project
+  front door and moved dense setup, operations, data-model, and support
+  material into focused docs.
+- [x] 2026-04-28: Added the consolidated v1-to-v2 migration runbook, linked it
+  from README and the docs index, and updated migration references that had
+  pointed at the older pinned-rollout checklist.
+- [x] 2026-04-28: Created the `feature/api-cli` branch and initial roadmap.
+- [x] 2026-04-28: Added the `jetmon2 api` command group with `health`, `me`,
+  and generic `request` subcommands.
+- [x] 2026-04-28: Added local defaults, Bearer-token auth, repeatable custom
+  headers, idempotency-key support, request body helpers, JSON pretty printing,
+  and verbose request/response header logging.
+- [x] 2026-04-28: Added focused tests for URL resolution, auth/idempotency
+  headers, verbose output, pretty JSON output, and HTTP error handling.
+- [x] 2026-04-28: Added typed `jetmon2 api sites`
+  `list|get|create|update|delete|pause|resume|trigger-now` commands with
+  query flags, typed create/update payload builders, idempotency support for
+  POST actions, and focused helper tests.
+- [x] 2026-04-28: Added typed `jetmon2 api events`
+  `list|get|transitions|close` commands with site-scoped list/transition/close
+  paths, direct or site-scoped event lookup, close payload flags, idempotency
+  support, and focused path/body tests.
+- [x] 2026-04-28: Added typed `jetmon2 api webhooks`
+  `list|get|create|update|delete|rotate-secret|deliveries|retry` commands with
+  typed event/site/state filter payloads, explicit filter-clearing flags,
+  delivery status filters, idempotency support for POST actions, and focused
+  path/body tests.
+- [x] 2026-04-28: Added typed `jetmon2 api alert-contacts`
+  `list|get|create|update|delete|test|deliveries|retry` commands with
+  transport-specific destination shortcuts, raw destination JSON support, site
+  filter clearing, delivery status filters, idempotency support for POST
+  actions, and focused path/body tests.
+- [x] 2026-04-28: Kept typed command payloads aligned with the implemented API
+  schemas through local request structs, JSON body builders, and focused
+  path/body tests for sites, events, webhooks, and alert contacts.
+- [x] 2026-04-28: Added `jetmon2 api sites bulk-add --count <n>` with a
+  bounded 200-site cap, fixture/file/stdin sources, JSON/CSV/newline parsing,
+  dry-run output, per-site idempotency-key prefixes, and deterministic
+  sequential blog IDs for local Docker data generation.
+- [x] 2026-04-28: Added the embedded `cmd/jetmon2/testdata/api-cli-sites.json`
+  fixture with always-up, redirect, slow, HTTP error, TLS error, custom-header,
+  and keyword-check examples.
+- [x] 2026-04-28: Added `jetmon2 api smoke` for Docker-local end-to-end API
+  sanity checks covering health, auth identity, site creation, trigger-now,
+  event listing, alert-contact creation, alert-contact send-test, JSON step
+  summaries, and best-effort cleanup of created resources.
+- [x] 2026-04-28: Added `jetmon2 api sites simulate-failure` with explicit
+  `--site-id` targets or deterministic `--batch`/`--count` site IDs,
+  `--create-missing`, optional trigger-now, active-event polling, transition
+  lookup for returned event IDs, and JSON summaries that include per-site
+  errors before exiting non-zero.
+- [x] 2026-04-28: Added simulation modes for unreachable hosts, HTTP 500, HTTP
+  403, redirect-policy failure, keyword mismatch, timeout/slow responses, and
+  TLS certificate failure.
+- [x] 2026-04-28: Added CLI batch tracking through deterministic blog ID ranges
+  and the `X-Jetmon-CLI-Batch` custom-header marker for smoke-created sites,
+  bulk-added sites, and simulated failure targets.
+- [x] 2026-04-28: Added `--output table` with stable resource-oriented columns
+  for API list responses and CLI workflow summaries while keeping JSON as the
+  default script-friendly output.
+- [x] 2026-04-28: Documented API CLI setup and examples in `internal-api-reference.md`, the Docker
+  development loop, and the v1-to-v2 migration rehearsal runbook.
+- [x] 2026-04-28: Added `jetmon2 api sites cleanup` with deterministic
+  batch-derived IDs, explicit site IDs, dry-run output, 404-tolerant cleanup,
+  and JSON/table summaries for removing local CLI test data.
+- [x] 2026-04-28: Added `make api-cli-smoke` as the repeatable local API CLI
+  smoke entrypoint and documented cleanup examples in API, Docker, and rollout
+  docs.
+- [x] 2026-04-28: Verified a fresh Docker Compose rebuild after Docker bridge
+  networking was repaired, ran `make api-cli-smoke` against the rebuilt API on
+  alternate host ports, and live-tested `sites cleanup --batch`.
+- [x] 2026-04-28: Added the Docker-local `api-fixture` service with stable
+  response-code, redirect, keyword, slow-response, and self-signed TLS
+  endpoints for deterministic API CLI failure simulation.
+- [x] 2026-04-28: Updated `sites simulate-failure` to auto-detect the fixture
+  via `http://localhost:18091/health`, use Docker-internal fixture URLs when
+  available, and keep public endpoint fallbacks via `--fixture-url=off`.
+- [x] 2026-04-28: Added strict failure-simulation assertions for expected event
+  state, event severity, transition presence, and transition reason. Assertion
+  mode keeps polling until the expectations match or `--wait` expires, then
+  returns a non-zero summary with the last observed API state.
+- [x] 2026-04-28: Added `api-cli-guide.md` as a feature-oriented API CLI
+  usage guide with local setup, command examples, workflow recipes, failure
+  simulation assertions, and automation notes.
+- [x] 2026-04-28: Allowed API CLI flags before or after positional arguments by
+  normalizing recognized flags before parsing while preserving `--` literals;
+  added tests for interspersed flags and help output.
+- [x] 2026-04-28: Added `make api-cli-validate` and
+  `scripts/api-cli-validate.sh` for a live Docker-local validation pass across
+  the feature guide's health, identity, generic request, smoke, failure
+  simulation, and cleanup paths.
+- [x] 2026-04-28: Improved workflow table output by surfacing failure
+  simulation trigger status, event IDs, event states, event severities,
+  transition counts, and smoke cleanup rows in stable table columns.
+- [x] 2026-04-28: Added Docker-local API token convenience targets for
+  creating, listing, and revoking rehearsal tokens from the repository root.
+- [x] 2026-04-28: Added `jetmon2 api commands` as a table-first command
+  catalog with descriptions and examples for the expanded API CLI tree.
+- [x] 2026-04-28: Added opt-in derived site `cli_batch` responses and batch
+  marker checks for `sites cleanup --batch` and
+  `sites simulate-failure --batch`, with `--allow-unmarked` as the explicit
+  legacy opt-out.
+- [x] 2026-04-28: Addressed API CLI review feedback by redacting sensitive
+  verbose headers, defaulting automatic auth to same-origin requests, adding an
+  explicit `--auth-policy any-origin` opt-in, keeping `cli_batch` out of the
+  canonical site response, and clarifying the `sites create` command catalog
+  wording.
+- [x] 2026-04-28: Added `--allow-remote` guardrails for API CLI writes.
+  Non-local API targets now refuse POST, PUT, PATCH, and DELETE requests
+  without the explicit flag; remote smoke, bulk-add, cleanup, and simulation
+  also require `--batch`, and remote cleanup/simulation keep marker
+  verification mandatory.
+- [x] 2026-04-28: Extended `api-fixture` with `/webhook` and
+  `/webhook/requests` endpoints that record deliveries, expose captured
+  signature headers, clear recorded requests, and verify signatures when a
+  `secret` query parameter is supplied.
+- [x] 2026-04-29: Added `api smoke --exercise webhook` and wired
+  `api-cli-validate` to create a Docker-local webhook, update the fixture
+  receiver with the generated secret, simulate a fixture-backed failure,
+  verify the fixture's signature result, and confirm a delivered webhook row.
+- [x] 2026-04-29: Hardened webhook smoke so it stays Docker-local, requires a
+  local fixture polling endpoint, and redacts the generated webhook secret from
+  activation errors.
+- [x] 2026-04-29: Tightened webhook smoke review follow-ups by requiring an
+  explicit opt-in for external receiver URLs, matching fixture delivery IDs to
+  API delivery rows, and refreshing the broader validation docs.
diff --git a/docs/architecture.md b/docs/architecture.md
new file mode 100644
index 00000000..9b48d41f
--- /dev/null
+++ b/docs/architecture.md
@@ -0,0 +1,630 @@
+Jetmon 2 — Architecture Overview
+==================================
+
+This document describes the internal architecture of Jetmon 2 and the complete
+call flow used to determine and report site status.
+
+
+System Overview
+---------------
+
+Jetmon 2 runs as a Go monitor binary (`jetmon2`). Multiple monitor instances can
+run on different hosts, each owning a non-overlapping range of site buckets
+claimed from MySQL. Outbound webhooks and alert contacts can still run embedded
+inside one API-enabled `jetmon2` process, or through the standalone
+`jetmon-deliverer` binary as the first step toward the post-v2 process split.
+
+```
+                          ┌─────────────────────────────────────────┐
+                          │                 jetmon2                 │
+                          │                                         │
+  ┌──────────┐  sites     │  ┌─────────────┐    ┌─────────────────┐ │
+  │  MySQL   │──────────► │  │ Orchestrator│───►│  Checker Pool   │ │
+  │ (bucket) │◄────────── │  │  (1 gorout.)│◄───│  (N goroutines) │ │
+  └──────────┘  updates   │  └──────┬──────┘    └─────────────────┘ │
+                          │         │                               │
+  ┌──────────┐  confirm?  │         │ escalate                      │
+  │Veriflier │◄───────────│─────────┘                               │
+  │ (remote) │──────────► │                                         │
+  └──────────┘  result    │  ┌───────────┐  ┌──────────┐            │
+                          │  │   WPCOM   │  │Dashboard │            │
+  ┌──────────┐  notify    │  │  Client   │  │  (SSE)   │            │
+  │  WPCOM   │◄───────────│  │(circuit-  │  │          │            │
+  │   API    │            │  │ breaker)  │  │          │            │
+  └──────────┘            │  └───────────┘  └──────────┘            │
+                          └─────────────────────────────────────────┘
+```
+
+Multiple jetmon2 instances coordinate through MySQL bucket leases:
+
+```
+  Host A  ──────  buckets 0–499
+  Host B  ──────  buckets 500–999
+  Host C  ──────  (takes over Host B's range if B goes offline)
+```
+
+Shadow-v2-state migration model:
+
+- `jetmon_events` and `jetmon_event_transitions` are the authoritative incident
+  state for Jetmon v2.
+- `jetpack_monitor_sites` remains the legacy site/config table during migration.
+- While `LEGACY_STATUS_PROJECTION_ENABLE` is true, every v2 incident mutation
+  also projects the v1-compatible `site_status` / `last_status_change` fields
+  back to `jetpack_monitor_sites` in the same transaction.
+- Once legacy readers have moved to the v2 API/event tables, disable
+  `LEGACY_STATUS_PROJECTION_ENABLE`; v2 incident state continues to be written
+  to the event tables.
+
+
+Package Map
+-----------
+
+```
+jetmon/
+├── cmd/jetmon2/          Entry point, CLI subcommands, signal handling
+├── cmd/jetmon-deliverer/ Standalone outbound delivery worker
+├── internal/
+│   ├── orchestrator/     Round loop, bucket coordination, retry queue,
+│   │                     failure escalation, status notifications
+│   ├── checker/
+│   │   ├── checker.go    HTTP check logic (httptrace, SSL, keyword, redirect)
+│   │   └── pool.go       Auto-scaling goroutine pool
+│   ├── db/               MySQL queries and schema migrations
+│   ├── config/           Config loading, validation, hot reload
+│   ├── veriflier/        Veriflier client (JSON-over-HTTP) and server
+│   ├── wpcom/            WPCOM notification client with circuit breaker
+│   ├── audit/            Structured audit log (read + write)
+│   ├── eventstore/       Authoritative incident event + transition writer
+│   ├── api/              Internal REST API, auth, rate limits, idempotency
+│   ├── deliverer/        Shared webhook + alert-contact worker wiring
+│   ├── webhooks/         Webhook registry + HMAC-signed delivery worker
+│   ├── alerting/         Managed alert-contact registry + delivery worker
+│   ├── metrics/          StatsD UDP client, stats file writer
+│   └── dashboard/        HTTP + SSE operator dashboard
+└── veriflier2/cmd/       Standalone veriflier binary
+```
+
+
+Site Status Call Flow
+----------------------
+
+This is the end-to-end path from database query to WPCOM notification.
+
+```
+┌──────────────────────────────────────────────────────────────────────┐
+│ PHASE 1 — Fetch                                                      │
+│                                                                      │
+│  orchestrator.runRound()                                             │
+│    dbHeartbeat()          ── UPDATE jetmon_hosts SET last_heartbeat  │
+│    ClaimBuckets()         ── rebalance bucket ranges (each round)    │
+│    dbGetSitesForBucket()  ── SELECT due sites in DATASET_SIZE pages  │
+│                              ORDER BY sidecar next/last checked time  │
+└──────────────────────────────────────────────────────────────────────┘
+                  │  []db.Site
+                  ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│ PHASE 2 — Check (parallel)                                           │
+│                                                                      │
+│  for each site:                                                      │
+│    pool.Submit(checker.Request)                                      │
+│         │                                                            │
+│         ▼   (goroutine worker)                                       │
+│    checker.Check(ctx, req)                                           │
+│      • HTTP HEAD or GET with httptrace timing (DNS/TCP/TLS/TTFB)     │
+│      • Keyword match in full GET profile (reads up to 1 MB of body)  │
+│      • Redirect policy in full profile (follow / alert / fail)       │
+│      • SSL expiry extraction from peer certificate                   │
+│      • Error classification → ErrorCode (8 codes)                    │
+│      • Success = HTTPCode in [1, 399]                                │
+│         │                                                            │
+│         ▼                                                            │
+│    checker.Result  ──►  pool.results channel                         │
+└──────────────────────────────────────────────────────────────────────┘
+                  │  map[blogID]Result
+                  ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│ PHASE 3 — Collect (deadline: NetCommsTimeout + 5 s)                  │
+│                                                                      │
+│  Drain pool.Results() until all dispatched results arrive or         │
+│  deadline fires (partial results processed, remaining work stays      │
+│  visible through outstanding/due-remaining scheduler metrics)         │
+└──────────────────────────────────────────────────────────────────────┘
+                  │
+          ┌───────┴───────┐
+          │               │
+    !IsFailure()      IsFailure()
+          │               │
+          ▼               ▼
+┌─────────────┐   ┌─────────────────────────────────────────────────┐
+│  RECOVERY   │   │ PHASE 4 — Failure Escalation                    │
+│             │   │                                                 │
+│ retries     │   │ Stage 1 — Local retry                           │
+│  .clear()   │   │   retries.record(res) → failCount++             │
+│             │   │   if failCount < NumOfChecks (default 3):       │
+│ if site was │   │     auditLog("retry_dispatched")                │
+│ previously  │   │     ← return; retry next round                  │
+│ down:       │   │                                                 │
+│  dbUpdate   │   │ Stage 2 — Veriflier escalation                  │
+│  Status()   │   │   if failCount >= NumOfChecks:                  │
+│  Notify()   │   │     escalateToVerifliers()                      │
+│             │   │       ← see Veriflier Quorum section            │
+└─────────────┘   │                                                 │
+                  │ Stage 3 — Confirm down                          │
+                  │   confirmDown(site, entry, vResults)            │
+                  │     if LEGACY_STATUS_PROJECTION_ENABLE:         │
+                  │       project site_status(→ confirmed_down)     │
+                  │     if inMaintenance(): suppress + audit        │
+                  │     else if !isAlertSuppressed(): Notify()      │
+                  │     retries.clear(blogID)                       │
+                  └─────────────────────────────────────────────────┘
+```
+
+
+Failure Escalation Detail
+--------------------------
+
+```
+  Local check fails (N times)
+          │
+          │  failCount < NumOfChecks?
+          ├──────────────────────────► queue in retryQueue, retry next round
+          │
+          │  failCount >= NumOfChecks
+          ▼
+  escalateToVerifliers()
+          │
+          │  No verifliers configured?
+          ├──────────────────────────► confirmDown() immediately
+          │
+          │  Verifliers available
+          ▼
+  Dispatch in parallel to all verifliers
+          │
+          ├── veriflier-1:  same policy  ──►  {success: false, http: 500}
+          ├── veriflier-2:  same policy  ──►  {success: false, http: 500}
+          └── veriflier-N:  same policy  ──►  {success: true,  http: 200}
+                                                    ↑ false positive
+
+  healthyUniqueVantages = count of unique healthy verifier vote identities
+  minHealthyFloor = 2 for multi-verifier fleets unless PeerOfflineLimit=1
+  Quorum = max(min(healthyUniqueVantages, PeerOfflineLimit), minHealthyFloor)
+  confirmations = count of unique verifier vantages reporting !success
+
+  confirmations >= quorum?
+      YES  ──►  confirmDown() → WPCOM notified
+      NO   ──►  recordFalsePositive() + retries.clear()
+```
+
+
+Orchestrator Round Loop
+------------------------
+
+```
+orchestrator.Run()
+    │
+    └── loop (until ctx.Done()):
+          │
+          ├─ config.Get()                    // fresh config snapshot each round
+          ├─ pool.SetMaxSize(cfg.NumWorkers)  // apply hot-reloaded worker limit
+          ├─ refreshVeriflierClients(cfg)     // rebuild list only on change
+          │
+          ├─ runRound()
+          │     │
+          │     ├─ dbHeartbeat()
+          │     ├─ ClaimBuckets()             // rebalance every round
+          │     ├─ dbGetSitesForBucket()      // fetch due work in DATASET_SIZE pages
+          │     │
+          │     ├─ for each scheduler page:
+          │     │     pool.Submit(checker.Request)  // waits/collects on backpressure
+          │     │
+          │     ├─ collect results (deadline-bounded)
+          │     │
+          │     ├─ processResults()
+          │     │     ├─ dbMarkSitesChecked()       // jetmon_site_runtime freshness
+          │     │     ├─ dbRecordCheckHistories()   // method + RTT + DNS/TCP/TLS/TTFB
+          │     │     ├─ dbUpdateSSLExpiries() + checkSSLAlerts()
+          │     │     └─ handleRecovery(), handleFailure(),
+          │     │        or maintenance-swallow the failure
+          │     │
+          │     ├─ emit StatsD metrics
+          │     └─ applyMemoryPressure()       // drain workers if Go runtime memory > limit
+          │
+          └─ sleep to enforce fixed cadence or short variable-interval poll
+```
+
+
+Checker Pool — Auto-Scaling
+----------------------------
+
+The pool maintains a live set of worker goroutines bounded by `[minSize, maxSize]`.
+
+```
+  NewPool(initial=30, min=1, max=60)
+    │
+    ├─ work channel  (cap = max×2 = 120)
+    ├─ results channel (cap = max×2 = 120)
+    ├─ retire channel  (cap = max  =  60)
+    └─ autoScale() goroutine (every 5 s)
+
+  autoScale() logic:
+    ┌─────────────────────────────────────────────────────┐
+    │  current = WorkerCount()                            │
+    │  queue   = QueueDepth()                             │
+    │                                                     │
+    │  Scale UP:   queue > current && current < maxSize   │
+    │    spawn min(queue-current, maxSize-current) workers│
+    │                                                     │
+    │  Scale DOWN: current > maxSize                      │
+    │    retire (current - maxSize) workers immediately   │
+    │                                                     │
+    │  Scale DOWN: queue == 0 && current > minSize        │
+    │    retire 1 worker (gradual idle drain)             │
+    └─────────────────────────────────────────────────────┘
+
+  Worker lifecycle:
+    spawnWorker() → goroutine:
+      loop:
+        select:
+          <-ctx.Done()  → exit (pool shutdown)
+          <-retire      → exit (graceful scale-down)
+          req := <-work → execute checker.Check(), push to results
+
+  Graceful shutdown:
+    Drain() → CompareAndSwap(closed, false→true)
+            → close(work)
+            → wg.Wait()  ← blocks until last check completes
+            → cancel ctx
+```
+
+
+WPCOM Circuit Breaker
+----------------------
+
+```
+         ┌─────────────────────────────────────────────┐
+         │              Closed (normal)                │
+         │  • notifications sent immediately           │
+         │  • failure counter tracks HTTP errors       │
+         └──────────────────┬──────────────────────────┘
+                            │ failures >= 5
+                            ▼
+         ┌─────────────────────────────────────────────┐
+         │               Open (tripped)                │
+         │  • new notifications queued (max 1000)      │
+         │  • oldest dropped when queue full           │
+         │  • circuitOpenAt recorded                   │
+         └──────────────────┬──────────────────────────┘
+                            │ time.Since(circuitOpenAt) > 60 s
+                            ▼
+         ┌─────────────────────────────────────────────┐
+         │            Resetting (half-open)            │
+         │  • failures reset to 0                      │
+         │  • circuitOpen = false                      │
+         │  • queued notifications flushed             │
+         │  • next failure reopens circuit             │
+         └─────────────────────────────────────────────┘
+
+  Notify() call path:
+    if circuit open AND timeout not elapsed → enqueue, return error
+    if circuit open AND timeout elapsed     → reset + flush queue + send
+    if circuit closed                       → send()
+      send() error → failures++
+                     failures >= 5 → open circuit
+      send() ok    → failures = 0
+```
+
+
+Veriflier Transport
+--------------------
+
+```
+  Monitor (orchestrator)              Veriflier (remote)
+  ──────────────────────              ──────────────────
+  veriflier.VeriflierClient           veriflier.Server
+
+  Preferred v2 contract:
+
+  CheckBatch(ctx, []CheckRequest)
+    POST /v2/check
+    Authorization: Bearer <token>
+    Content-Type: application/json
+    Body: {
+      "batch_id": "...",
+      "deadline_ms": 12000,
+      "requests": [{
+        "request_id": "...",
+        "blog_id": 123,
+        "url": "https://example.com",
+        "timeout_ms": 10000,
+        "method": "GET",
+        "detection_profile": "full",
+        "headers": {},
+        "body_rules": {"required": ["needle"], "forbidden": ["bad"]},
+        "redirect_policy": "follow"
+      }]
+    }
+                        ─────────────────────────────►
+                                                        bounded admission queue
+                                                        concurrent HEAD/GET probes
+                                                        typed outcomes
+                        ◄─────────────────────────────
+    Body: {
+      "batch_id": "...",
+      "vantage": {"id": "us-east-1", "region": "us-east", "provider": "..."},
+      "agent": {"id": "host-a", "host": "host-a", "version": "..."},
+      "results": [{
+        "request_id": "...",
+        "blog_id": 123,
+        "url": "https://example.com",
+        "vantage_id": "us-east-1",
+        "agent_id": "host-a",
+        "outcome": "down",
+        "success": false,
+        "http_code": 500,
+        "error_code": 0,
+        "rtt_ms": 214,
+        "timings_ms": {"dns": 4, "tcp": 18, "tls": 36, "ttfb": 190}
+      }]
+    }
+
+  Status(ctx)
+    GET /v2/status
+    ◄── {
+          "status": "OK",
+          "version": "1.2.3",
+          "protocols": ["v2-json-http"],
+          "vantage": {...},
+          "agent": {...},
+          "capacity": {"max_concurrency": 512, "queue_depth": 0, ...}
+        }
+
+  Optional legacy-compatible HTTP contract
+  (only when VERIFLIER_ENABLE_LEGACY_HTTP=true):
+
+    POST /check
+    Authorization: Bearer <token>
+    Content-Type: application/json
+    Body: {"sites": [{blog_id, url, timeout, ...}, ...]}
+                        ─────────────────────────────►
+                                                        for each site:
+                                                          checkFn(req)
+                                                          res.Host = hostname
+                        ◄─────────────────────────────
+    Body: {"results": [{blog_id, host, success, http_code, ...}, ...]}
+
+  Ping(ctx)
+    GET /status
+    ◄── {"status":"OK","version":"1.2.3"}
+```
+
+Veriflier discovery is staged through `VERIFLIER_DISCOVERY_MODE`:
+`static` uses the configured `VERIFIERS` list, `shadow` reads the DB registry
+and reports drift without changing traffic, and `active` uses enabled usable
+rows from `jetmon_veriflier_vantages` with fallback to static config if the
+registry is unavailable or empty. Monitors poll Veriflier `/v2/status` and write
+`jetmon_veriflier_agents` capacity/liveness telemetry; Veriflier hosts do not
+need DB access. Agent telemetry never creates trusted quorum votes by itself;
+operators must pre-approve each enabled vantage.
+
+The transport is JSON-over-HTTP for v2 production. `proto/veriflier.proto`
+remains as a schema reference for a possible future transport, but generated
+gRPC stubs are not required to build or deploy v2.
+
+The monitor prefers `/v2/check` and falls back to `/check` only for
+`veriflier2`'s legacy-compatible HTTP endpoint when v2 is unavailable. The
+preferred rollout deploys a fresh v2 Veriflier fleet first and points v2
+Monitors only at that fleet; original v1 Verifliers use the old TLS/custom
+transport and are not v2 Monitor fallback targets.
+
+This transport fallback is distinct from site probe policy. `HEAD` + `legacy`
+checks are v1-compatible probe semantics carried in the versioned `/v2/check`
+payload, not a reason to enable the legacy-compatible `/check` transport.
+
+`vantage.id` is the quorum identity. Horizontal replicas behind the same
+regional or provider endpoint must report the same `vantage.id`; `agent.id`
+identifies only the process that handled the request and is diagnostic metadata,
+not an extra vote. This keeps vertical and horizontal Veriflier scaling from
+changing the meaning of `PEER_OFFLINE_LIMIT`.
+
+Duplicate `vantage.id` values across monitor-configured Veriflier entries are
+treated as one vote. The duplicate replies are still written to audit metadata
+for debugging, but they do not increase quorum. In multi-Veriflier fleets the
+effective quorum has a two-healthy-vantage floor unless operators intentionally
+configure `PEER_OFFLINE_LIMIT=1`; this prevents a degraded verifier set from
+collapsing to one confirming vote.
+
+The v2 server executes accepted batches through a bounded concurrent executor.
+If the local queue is full, it rejects the whole batch with HTTP 503. The
+monitor treats that endpoint as unhealthy/no-vote rather than interpreting
+overload as customer-site downtime.
+
+`body_rules.required` is an array for future extensibility, but the current
+checker supports zero or one required keyword. `body_rules.forbidden` supports
+multiple forbidden keywords.
+
+
+Bucket Distribution — Multi-Host Scaling
+-----------------------------------------
+
+Each round, all active monitors re-negotiate bucket ownership via a locked
+MySQL transaction. Expired hosts (heartbeat missed by `BucketHeartbeatGraceSec`)
+are removed and their ranges redistributed.
+
+```
+  jetmon_hosts (3 active hosts, BucketTotal=1000, BucketTarget=500):
+
+  Hosts sorted by host_id: [host-a, host-b, host-c]
+  assignBucketRanges() water-fill:
+    host-a → buckets   0– 499  (capped at BucketTarget=500)
+    host-b → buckets 500– 749  (250 remaining, 2 hosts left)
+    host-c → buckets 750– 999
+
+  host-b goes offline (heartbeat expires):
+    host-a → buckets   0– 499
+    host-c → buckets 500– 999  ← automatically absorbs host-b's range
+```
+
+`SELECT ... FOR UPDATE` prevents two hosts from claiming overlapping ranges.
+
+
+Signal Handling
+----------------
+
+```
+  SIGHUP  ──►  config.Reload()
+                  └─ re-reads JSON file under RWMutex
+                  └─ next round: pool.SetMaxSize(), refreshVeriflierClients()
+                  └─ zero downtime, current round unaffected
+
+  SIGINT  ──►  orchestrator.Stop()   (also sent by: jetmon2 drain)
+  SIGTERM      └─ cancel context
+               └─ current round completes
+               └─ dbMarkHostDraining()
+               └─ pool.Drain()       ← waits for in-flight checks
+               └─ dbReleaseHost()
+               └─ exit 0
+               └─ hard kill after 30 s if drain stalls
+```
+
+
+Database Tables
+----------------
+
+```
+  jetpack_monitor_sites   V1-shaped legacy site table plus compatibility projection
+    blog_id               WordPress site identifier
+    bucket_no             Determines which monitor instance owns this site
+    monitor_url           URL to check
+    monitor_active        Whether the site is active
+    check_interval        V1-owned per-site cadence
+    site_status           Legacy v1 projection; derived from v2 events
+    last_status_change    Legacy v1 projection; derived from v2 transitions
+
+  jetmon_site_check_config V2-only per-site probe config
+    request_method        HEAD / GET rollout policy override
+    detection_profile     legacy / simple_http / full detection profile
+    check_keyword         Optional body text to require
+    forbidden_keyword     Optional body text that must not appear
+    forbidden_keywords    JSON array of body text that must not appear
+    maintenance_start/end Suppress alerts during scheduled maintenance
+    custom_headers        JSON blob of extra HTTP headers
+    timeout_seconds       Per-site timeout override
+    redirect_policy       follow / alert / fail
+    alert_cooldown_minutes Per-site override for notification cooldown
+
+  jetmon_site_runtime     V2-only runtime/freshness projection
+    last_checked_at       Last completed local check timestamp
+    next_check_at         Materialized variable-interval due time
+    ssl_expiry_date       Updated after HTTPS checks
+    last_alert_sent_at    Tracks cooldown window
+
+  jetmon_hosts            Active monitor instances and bucket leases
+    host_id               System hostname (PRIMARY KEY)
+    bucket_min/max        Owned bucket range
+    last_heartbeat        Updated every round; expiry triggers rebalance
+    status                active / draining
+
+  jetmon_process_health   Durable process heartbeat snapshots for dashboards
+    process_id            Stable key such as <host>:monitor or <host>:deliverer
+    host_id/process_type  Fleet grouping dimensions
+    state/updated_at      Lifecycle state and freshness marker
+    health_status         Green/amber/red process health rollup
+    go_sys_mem_mb         Go runtime system memory in MB
+    rss_mem_mb            Operating-system resident set size in MB
+    dependency_health     JSON dependency health summary
+
+  jetmon_events           Authoritative v2 incident current state
+    id                    Incident identifier
+    blog_id               Site identifier
+    check_type            Probe family (http, tls_expiry, ...)
+    severity/state        Current incident projection
+    started_at/ended_at   Incident window
+    resolution_reason     Required close reason
+
+  jetmon_event_transitions Append-only mutation history for jetmon_events
+    event_id              Incident row being mutated
+    severity/state before/after
+    reason/source         Why and who caused the mutation
+    changed_at            Transition time
+
+  jetmon_audit_log        Operational trail for compliance/debugging
+    event_type            check | wpcom_sent | wpcom_retry |
+                          retry_dispatched | veriflier_sent |
+                          veriflier_result | maintenance_active |
+                          alert_suppressed | api_access | config_reload
+    blog_id, source, http_code, error_code, rtt_ms
+
+  jetmon_check_history    Per-check method and timing samples
+    request_method, rtt_ms, dns_ms, tcp_ms, tls_ms, ttfb_ms
+
+  jetmon_false_positives  Checks local failed but verifliers passed
+    blog_id, http_code, error_code, rtt_ms
+
+  jetmon_veriflier_vantages Trusted Veriflier quorum identities
+    vantage_id, region/provider, endpoint_host/port, auth_token, enabled
+
+  jetmon_veriflier_agents Concrete Veriflier process telemetry
+    agent_id, vantage_id, version, protocols, capacity, last_seen
+
+  jetmon_api_keys         Internal API Bearer-token registry
+    key_hash, consumer_name, scope, rate_limit_per_minute
+
+  jetmon_webhooks         Registered webhook receivers and filters
+  jetmon_webhook_deliveries
+                           Per-transition webhook delivery attempts
+  jetmon_webhook_dispatch_progress
+                           Webhook worker transition high-water marks
+
+  jetmon_alert_contacts   Managed notification destinations
+  jetmon_alert_deliveries Per-transition alert delivery attempts
+  jetmon_alert_dispatch_progress
+                           Alert worker transition high-water marks
+
+  jetmon_schema_migrations  Idempotent migration tracking
+```
+
+
+Key Concurrency Patterns
+-------------------------
+
+```
+  Component            Primitive          Usage
+  ─────────────────    ───────────────    ────────────────────────────────
+  config.current       sync.RWMutex       RLock on every Get(); Lock on Reload()
+  orchestrator         stdctx.Context     Cancel propagates stop to all goroutines
+  veriflierClients     sync.RWMutex       RLock for snapshot; Lock on rebuild
+  retryQueue.entries   sync.Mutex         Lock on record/clear/get/size
+  wpcom state          sync.Mutex         Never held during HTTP send()
+  pool.size/active     sync/atomic.Int64  Hot-path counters, no lock needed
+  pool.closed          sync/atomic.Bool   CAS for idempotent Drain()
+  pool.workMu          sync.RWMutex       RLock on Submit; Lock on close(work)
+  pool.wg              sync.WaitGroup     Drain() blocks until wg reaches 0
+  pool.work            chan Request        cap = maxSize×2; scheduler waits when full
+  pool.retire          chan struct{}       Signals individual workers to exit
+  dashboard.sseClients sync.RWMutex       One channel per connected SSE client
+```
+
+
+Error Codes (checker.ErrorCode)
+--------------------------------
+
+```
+  ErrorNone          0   Success, no error
+  ErrorTimeout       1   Context deadline exceeded
+  ErrorConnect       2   TCP connection refused or DNS failure
+  ErrorSSL           3   TLS handshake error (invalid cert, mismatch)
+  ErrorRedirect      4   Redirect when RedirectPolicy=fail
+  ErrorKeyword       5   Required keyword missing or forbidden keyword present
+  ErrorTLSExpired    6   Certificate has passed NotAfter date
+  ErrorTLSDeprecated 7   TLS 1.0 or 1.1 detected (advisory only, not a failure)
+  ErrorBodyRead      8   GET response body closed early or could not be read
+```
+
+`IsFailure()` returns true for all codes except `ErrorNone` and
+`ErrorTLSDeprecated`. `StatusType()` maps codes to the string values
+expected by the WPCOM API (e.g. "https", "intermittent", "redirect").
+Body integrity reads are capped to a bounded prefix so Jetmon can catch
+truncated successful GET responses without buffering unbounded response
+bodies. Keyword checks retain their larger bounded body window.
+Deprecated TLS opens a separate `tls_deprecated` warning event and does not
+project the legacy site status down.
diff --git a/docs/changelog.md b/docs/changelog.md
new file mode 100644
index 00000000..801ea995
--- /dev/null
+++ b/docs/changelog.md
@@ -0,0 +1,249 @@
+CHANGELOG
+=========
+
+Format: date (YYYY-MM-DD), change summary, PR or commit reference where available.
+Breaking changes are marked **BREAKING**.
+
+---
+
+## Unreleased
+
+### v2 branch — site health platform
+
+The v2 branch builds on the Go rewrite to turn Jetmon from a status-flipper
+into a full event-sourced health platform with an internal REST API,
+HMAC-signed webhooks, and managed alert contacts. Kept on a parallel branch
+because it is intentionally **not** drop-in with the Jetmon 1 wire format
+(see PR #61 — DO NOT MERGE).
+
+**New — event sourcing:**
+- `jetmon_events` (current authoritative state per incident) and
+  `jetmon_event_transitions` (every status/severity change, append-only)
+  tables; `internal/eventstore` writes both in a single transaction
+- Shadow-v2-state migration: while `LEGACY_STATUS_PROJECTION_ENABLE` is
+  true, event mutations also maintain the v1 `site_status` /
+  `last_status_change` projection for legacy consumers
+- Five-layer severity ladder: `Up < Warning < Degraded < SeemsDown < Down`
+  matching `internal/eventstore.Severity*` constants
+
+**New — internal REST API (`/api/v1/`, internal-only behind a gateway):**
+- Per-consumer Bearer token auth with three scopes (`read` / `write` /
+  `admin`); `./jetmon2 keys create/list/revoke/rotate` CLI
+- Per-key token-bucket rate limiter with `X-RateLimit-*` headers
+- Stripe-style idempotency keys on POST endpoints
+- Sites CRUD + pause/resume/trigger-now
+- Events list + single + transitions list + manual close
+- SLA endpoints: uptime, response-time, timing-breakdown
+- Audit logging via `jetmon_audit_log` with `event_type=api_access`
+- See internal-api-reference.md for full surface and design rationale
+
+**New — Veriflier v2 contract:**
+- Added versioned JSON-over-HTTP endpoints `POST /v2/check` and `GET /v2/status`
+  while keeping `veriflier2` legacy-compatible `/check` and `/status`
+  endpoints available behind the opt-in `VERIFLIER_ENABLE_LEGACY_HTTP` switch.
+- `/v2/check` carries batch/request IDs, request deadlines, body rules, typed
+  outcomes, timing breakdowns, quorum `vantage.id`, and diagnostic `agent.id`.
+- Veriflier checks now run through a bounded concurrent executor. Saturated
+  Verifliers reject whole batches with HTTP 503 so overload is treated as
+  no-vote/unhealthy, not as customer-site downtime.
+- Monitor clients prefer the v2 contract and fall back to the `veriflier2`
+  legacy-compatible HTTP contract only for transition-safe unsupported-v2
+  responses. The preferred rollout deploys a fresh v2 Veriflier fleet first and
+  points v2 Monitors only at that fleet; the original v1 Veriflier TLS/custom
+  transport is not a v2 Monitor fallback target.
+- Downtime quorum now counts unique v2 `vantage.id` values rather than raw
+  Veriflier agent replies. Duplicate vantage replies are audited but ignored
+  for quorum, and multi-Veriflier fleets retain a two-healthy-vantage floor
+  unless `PEER_OFFLINE_LIMIT=1` is explicitly configured.
+- `jetmon2 validate-config` and dashboard health now surface Veriflier v2
+  contract status, vantage/agent/capacity metadata, and duplicate or missing
+  vantage IDs.
+- Added Veriflier auto-discovery plumbing: trusted
+  `jetmon_veriflier_vantages`, agent telemetry rows in
+  `jetmon_veriflier_agents`, `VERIFLIER_DISCOVERY_MODE=static|shadow|active`,
+  shadow-mode drift reporting, and active-mode fallback to static config.
+- Added `jetmon2 verifliers discovery-report`, a read-only shadow-mode gate
+  that compares configured static Verifliers, trusted registry rows, and recent
+  agent telemetry without printing auth token values.
+- Documented the Veriflier discovery trust model in ADR-0010 and added an
+  operator checklist for dashboard/report green, amber, and red discovery
+  warnings.
+- Monitors collect Veriflier liveness/capacity telemetry from authenticated
+  `/v2/status` responses and write `jetmon_veriflier_agents`, so Veriflier
+  hosts do not need database credentials. Agent telemetry is not trust;
+  operators must pre-approve and enable vantages before monitors count them for
+  quorum.
+- Added `make test-veriflier-soak` for local v2 contract soak coverage:
+  high-concurrency mixed outcomes, overload recovery, auth rejection, and
+  deadline timeout recovery. The same target now also runs Veriflier discovery
+  drift soak cases for duplicate static vantages, registry mismatch, stale or
+  missing agent telemetry, untrusted agents, duplicate active endpoints,
+  active-mode fallback, and recovery to green.
+- `jetmon2 telemetry report` now includes v2 Veriflier vote-evidence rollups:
+  duplicate votes ignored for quorum, duplicate-vote transitions,
+  minimum-healthy-floor blocks, and max observed quorum/healthy-vantage counts.
+- Documented `veriflier2` legacy-compatible fallback removal gates and the v2
+  naming decision: keep `veriflier` / `veriflier2` through rollout, use a
+  clearer probe-agent name only for a future v3 architecture.
+
+**New — webhooks (Phase 3):**
+- `jetmon_webhooks` registry + `jetmon_webhook_deliveries` per-fire records
+- Stripe-style HMAC-SHA256 signatures (`t=<unix>,v1=<hex>` over
+  `{ts}.{body}`); plaintext secret storage with documented threat model
+- Filter dimensions: `events` + `site_filter` + `state_filter` (AND across,
+  whitelist within, empty=match all)
+- Delivery worker with per-webhook in-flight cap (default 3) and shared
+  pool (default 50), retry ladder 1m / 5m / 30m / 1h / 6h then abandon
+- Frozen-at-fire-time payload contract — consumer sees the event as it was
+  when the webhook fired, not as it is now
+- POST `/webhooks/{id}/rotate-secret` (immediate revocation; grace-period
+  rotation deferred — see roadmap.md)
+- POST `/webhooks/{id}/deliveries/{delivery_id}/retry` for operator manual
+  retry of abandoned rows
+
+**New — alert contacts (Phase 3.x):**
+- Managed channels for human destinations: `email`, `pagerduty`, `slack`,
+  `teams`. Boundary with webhooks: alert contacts deliver Jetmon-rendered
+  notifications through Jetmon-owned transports; webhooks deliver the raw
+  signed event stream for custom rendering
+- Filter shape: `site_filter` + `min_severity` (default `Down`); per-contact
+  `max_per_hour` rate cap (default 60) as pager-storm insurance
+- POST `/alert-contacts/{id}/test` for synthetic send-tests through the
+  same dispatch path
+- Email transport pluggable via `EMAIL_TRANSPORT` config: `wpcom`
+  (production), `smtp` (dev / staging with MailHog), `stub` (default
+  log-only / tests, with startup and validate-config warnings)
+- PagerDuty Events API v2 with severity mapping and event_action
+  trigger/resolve based on the recovery flag
+- Slack Block Kit + Microsoft Teams Adaptive Card rendering
+- Plaintext credential storage in `destination` JSON; same outbound-dispatch
+  rationale as webhook secrets, threat model documented inline
+- Legacy WPCOM notification flow continues alongside; migration tracked
+  in roadmap.md
+
+**Verifier hardening:**
+- Body size cap and empty-token guard on the JSON-over-HTTP transport
+- Verifier config validation: required `host` and `grpc_port` per entry,
+  PID file location now respects `JETMON_PID_FILE` env var
+
+**Worker fixes:**
+- Soft-lock fix for both webhooks and alerting deliver loops: `ClaimReady`
+  pushes `next_attempt_at` out by 60s so the 1s tick doesn't re-claim a
+  still-in-flight row. Without this, the per-contact in-flight cap (3)
+  was producing concurrent dispatches that inflated the attempt counter
+  and effectively skipped retry-schedule steps; the documented 7h36m
+  retry window was being collapsed to ~1h.
+- `ClaimReady` now repeats the readiness predicate during the soft-lock
+  update and returns only rows whose update affected a row, so overlapping
+  claim attempts skip stale SELECT results instead of doing duplicate
+  dispatch work. Multi-instance row-claim caveat (SELECT ... FOR UPDATE
+  SKIP LOCKED) still tracked alongside the deliverer-binary extraction in
+  roadmap.md.
+
+**Docs / tooling:**
+- Host dashboard now has a combined `/api/host` snapshot endpoint, stronger
+  red/amber/green summary behavior, clearer rollout-command visibility, and a
+  durable `jetmon_process_health` heartbeat table that `jetmon2` and
+  `jetmon-deliverer` publish to for fleet dashboards.
+- Host dashboard exposure now defaults to localhost, host summaries include
+  named red/amber issues, process lifecycle is stored separately from health
+  rollup, and the runtime memory value is clearly labeled as Go Sys memory.
+- Fleet dashboard now has `/fleet` and `/api/fleet` views backed by
+  `jetmon_process_health`, `jetmon_hosts`, delivery queues, projection drift,
+  and dependency rollups so operators can see stale heartbeats, bucket coverage,
+  delivery-owner posture, and suggested next actions in one place.
+- Fleet dashboard now has a dedicated Veriflier fleet section showing trusted
+  vantages, monitor-collected agent telemetry, capacity, discovery modes,
+  incomplete registry rows, stale telemetry, and duplicate endpoint warnings
+  without exposing Veriflier auth tokens.
+- Host and fleet dashboards now publish true process RSS beside Go runtime
+  system memory, and `process.rss_mb` again reports operating-system resident
+  memory when procfs is available.
+- Added `jetmon2 telemetry report`, a read-only production report that
+  summarizes event lifecycle counts, detection timing, verifier agreement,
+  false-alarm classes, WPCOM parity, and operator explanation gaps from durable
+  event/audit tables. The report starts with an explicit `telemetry_status`,
+  explanation-gap type/row counts, window-edge context for WPCOM parity, and
+  bounded, half-open query windows so scheduled runs are safer and easier to
+  compare.
+- `make all` now builds the currently implemented `jetmon2` and
+  `veriflier2` binaries without requiring `protoc`; generated Veriflier
+  gRPC stubs remain an explicit `make generate` step for the future
+  transport swap.
+- Makefile targets now share a configurable `GO` command and fall back to
+  `/usr/local/go/bin/go` when `go` is not on `PATH`; they also use an
+  overrideable `/tmp` Go build cache so checks do not depend on a
+  writable home-directory cache.
+- Developer docs now point at the Makefile build path and document why
+  code generation is separate from the default build.
+- Added a top-level docs index and a post-v2 probe-agent architecture
+  options document for revisiting the v3 direction after v2 is stable in
+  production.
+- Clarified that the current Veriflier transport is JSON-over-HTTP and
+  that the public API roadmap is about a future customer-facing contract,
+  not the already-implemented internal `/api/v1`.
+
+**Polish:**
+- `alerting.Update` now validates `label` (must be non-empty) and
+  `max_per_hour` (must be ≥ 0) at input time, surfacing 422
+  `invalid_alert_contact` instead of letting an empty label silently
+  persist or a negative `max_per_hour` surface as a generic 500 from
+  MySQL's `INT UNSIGNED` constraint. Validations that don't depend on
+  the existing row run before the DB lookup so obviously bad PATCH
+  bodies don't pay for a round-trip.
+- Email transport strips CR and LF from MIME header values
+  (`From` / `To` / `Subject`) as defense-in-depth against header
+  injection via untrusted strings (`monitor_url` is operator-controlled
+  but the column doesn't enforce CRLF-free). Body content with newlines
+  is unaffected.
+- `POST /api/v1/alert-contacts/{id}/test` now honors `Idempotency-Key`
+  like the other write POSTs, so a retried "click to test" during a
+  network blip doesn't double-page the destination.
+- API list-site rollup of the worst open event no longer relies on
+  `ROW_NUMBER()` window functions, so the query is compatible with
+  MySQL 5.7. Pagination caps the IN list and a site rarely has more
+  than one open event, so reducing in Go is cheap.
+- API key cutoffs (`revoked_at` and `expires_at`) now share half-open
+  semantics: a key is valid for times strictly before the cutoff and
+  rejected at or after it. Future `revoked_at` continues to act as a
+  rotation grace window. See internal-api-reference.md.
+- `LEGACY_STATUS_PROJECTION_ENABLE` is announced at startup
+  (`config: legacy_status_projection=enabled|disabled`) and surfaced by
+  `./jetmon2 validate-config`, so operators can confirm projection
+  state without reading the running config file.
+
+### Jetmon 2 — initial Go rewrite
+
+Complete rewrite of the Node.js + C++ uptime monitor as a single static Go binary.
+Drop-in replacement for Jetmon 1; all existing MySQL schema columns are preserved.
+
+**New:**
+- Single binary (`jetmon2`) — no process tree, no node_modules
+- Auto-scaling goroutine pool replaces worker process spawning
+- `jetmon2 migrate` — schema migrations embedded in binary
+- `jetmon2 validate-config` — config + DB connectivity check before deploy
+- `jetmon2 drain` / `jetmon2 reload` — signal running process via PID file
+- `jetmon2 audit` — query per-site audit log from CLI
+- Operator dashboard on configurable port with SSE state stream
+- pprof debug server on localhost-only `DEBUG_PORT` (default 6060)
+- `LEGACY_STATUS_PROJECTION_ENABLE` controls v1 `site_status` /
+  `last_status_change` compatibility writes; `DB_UPDATES_ENABLE` remains
+  as a deprecated alias
+- Graceful shutdown with 30-second hard-exit backstop
+- Non-root Docker images (`jetmon` / `veriflier` system users)
+- Healthcheck-gated MySQL dependency in docker-compose
+
+**Changed:**
+- Veriflier transport package renamed `internal/grpc` → `internal/veriflier`
+- Auth token moved from JSON request body to `Authorization: Bearer` header
+- MySQL DSN built via `mysql.Config.FormatDSN()` — password never in format strings
+- `internal/db` functions accept `context.Context` for cancellation
+- `DEBUG` config flag now controls log verbosity via `config.Debugf()`
+- `AUTH_TOKEN` is now a required config field (validated at startup)
+- `config-sample.json` ships with `DEBUG: false`
+
+**Fixed:**
+- `cmdDrain` / `cmdReload` now read PID path from `JETMON_PID_FILE` env var
+  (previously hardcoded to wrong path `/var/run/jetmon2.pid`)
+- Audit log failures are now logged rather than silently discarded
+- DB write errors (`RecordCheckHistory`, `UpdateSSLExpiry`) are now logged
diff --git a/docs/data-model.md b/docs/data-model.md
new file mode 100644
index 00000000..65204307
--- /dev/null
+++ b/docs/data-model.md
@@ -0,0 +1,291 @@
+# Data Model
+
+Jetmon 2 keeps the legacy site table v1-shaped during the
+[v1-to-v2 migration](v1-to-v2-migration.md) and adds Jetmon-owned side tables
+for v2-only configuration, runtime freshness, and event-sourced incident state.
+New schema changes are additive and applied by `./jetmon2 migrate`.
+
+## Legacy Site Table
+
+The primary site table remains `jetpack_monitor_sites`.
+
+```sql
+CREATE TABLE `jetpack_monitor_sites` (
+  `jetpack_monitor_site_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY,
+  `blog_id` bigint(20) unsigned NOT NULL,
+  `bucket_no` smallint(2) unsigned NOT NULL,
+  `monitor_url` varchar(300) NOT NULL,
+  `monitor_active` tinyint(1) unsigned NOT NULL DEFAULT 1,
+  `site_status` tinyint(1) unsigned NOT NULL DEFAULT 1,
+  `last_status_change` timestamp NULL DEFAULT current_timestamp(),
+  `check_interval` tinyint(1) unsigned NOT NULL DEFAULT 5,
+  INDEX `blog_id_monitor_url` (`blog_id`, `monitor_url`),
+  INDEX `bucket_no_monitor_active_check_interval`
+    (`bucket_no`, `monitor_active`, `check_interval`)
+);
+```
+
+Jetmon v2 does not require new columns or indexes on an existing production
+`jetpack_monitor_sites` table. It continues to read v1-owned fields such as
+`monitor_url`, `monitor_active`, `bucket_no`, and `check_interval`, and it
+writes only the v1 compatibility projection fields `site_status` and
+`last_status_change` while `LEGACY_STATUS_PROJECTION_ENABLE` is on.
+
+## New Tables
+
+| Table | Purpose |
+|---|---|
+| `jetmon_schema_migrations` | Applied migration tracking |
+| `jetmon_hosts` | MySQL-coordinated bucket ownership and heartbeat |
+| `jetmon_events` | Authoritative current state of every incident |
+| `jetmon_event_transitions` | Append-only mutation history for events |
+| `jetmon_audit_log` | Operational trail for checks, retries, WPCOM calls, suppression, API access, and reloads |
+| `jetmon_check_history` | Request method plus RTT and timing samples for trending |
+| `jetmon_false_positives` | Veriflier non-confirmation records |
+| `jetmon_veriflier_vantages` | Trusted quorum-counted Veriflier vantage registry |
+| `jetmon_veriflier_agents` | Concrete Veriflier process telemetry and capacity hints |
+| `jetmon_api_keys` | Internal REST API Bearer-token registry |
+| `jetmon_webhooks` | Webhook registrations and HMAC signing secrets |
+| `jetmon_webhook_deliveries` | Outbound webhook delivery attempts and retry state |
+| `jetmon_webhook_dispatch_progress` | Webhook worker high-water marks over transitions |
+| `jetmon_alert_contacts` | Managed destinations such as email, PagerDuty, Slack, and Teams |
+| `jetmon_alert_deliveries` | Outbound alert-contact attempts and retry state |
+| `jetmon_alert_dispatch_progress` | Alert worker high-water marks over transitions |
+| `jetmon_site_tenants` | Tenant-to-site mapping for gateway-scoped API access |
+| `jetmon_process_health` | Durable per-process heartbeat snapshots for host and fleet dashboards |
+| `jetmon_check_targets` | V2-native scheduling target state for the streaming monitor engine |
+| `jetmon_site_check_config` | Per-site v2 check config: rollout method/profile, body rules, maintenance windows, custom headers, timeout, redirect policy, and cooldown overrides |
+| `jetmon_site_runtime` | V2 runtime freshness and derived observation state such as last checked time, next due time, last alert time, and SSL expiry |
+
+## Site Check Policy
+
+`jetmon_site_check_config` keeps staged rollout policy and rich v2 probe config
+out of `jetpack_monitor_sites`:
+
+```sql
+CREATE TABLE `jetmon_site_check_config` (
+  `blog_id` bigint(20) unsigned NOT NULL PRIMARY KEY,
+  `request_method` enum('HEAD','GET') NULL,
+  `detection_profile` enum('legacy','simple_http','full') NULL,
+  `check_keyword` varchar(500) NULL,
+  `forbidden_keyword` varchar(500) NULL,
+  `forbidden_keywords` json NULL,
+  `maintenance_start` datetime NULL,
+  `maintenance_end` datetime NULL,
+  `custom_headers` json NULL,
+  `timeout_seconds` tinyint unsigned NULL,
+  `redirect_policy` enum('follow','alert','fail') NULL,
+  `alert_cooldown_minutes` smallint unsigned NULL,
+  `created_at` timestamp NOT NULL DEFAULT current_timestamp(),
+  `updated_at` timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp()
+);
+```
+
+NULL values inherit process defaults from `DEFAULT_CHECK_METHOD` and
+`DEFAULT_DETECTION_PROFILE`. During rollout, use `HEAD` + `legacy` for
+v1-compatible replacement, `GET` + `simple_http` for visitor-path migration,
+and `GET` + `full` for the complete v2 detection set. A `HEAD` request
+automatically caps the effective profile to `simple_http`; body-based keyword
+and forbidden-content checks require `GET`.
+
+The API can expose a derived `cli_batch` field for local API CLI test data when
+`include_cli_metadata=true` is requested and `custom_headers` contains
+`X-Jetmon-CLI-Batch`; it is not a dedicated database column.
+
+## Site Runtime
+
+`jetmon_site_runtime` keeps v2 freshness and derived observations out of the
+legacy table:
+
+```sql
+CREATE TABLE `jetmon_site_runtime` (
+  `blog_id` bigint(20) unsigned NOT NULL PRIMARY KEY,
+  `last_checked_at` datetime NULL,
+  `next_check_at` datetime NULL,
+  `last_alert_sent_at` datetime NULL,
+  `ssl_expiry_date` date NULL,
+  `updated_at` timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
+  INDEX `idx_next_check` (`next_check_at`, `blog_id`),
+  INDEX `idx_last_checked` (`last_checked_at`, `blog_id`)
+);
+```
+
+`last_checked_at` and `next_check_at` support API display, rollout freshness
+checks, rollback visibility, and the legacy round scheduler without requiring
+v2 to rewrite the v1 compatibility table after every probe. The streaming
+scheduler keeps its hot due-time state in memory and in `jetmon_check_targets`;
+`jetmon_site_runtime` is a compatibility/readability projection, not the
+high-frequency source of truth for streaming mode.
+
+## Streaming Check Targets
+
+`jetmon_check_targets` is additive scheduling infrastructure for
+`SCHEDULER_ENGINE=streaming`. During migration, `jetpack_monitor_sites` remains
+the source of truth for v1-owned site identity and current legacy status, while
+`jetmon_site_check_config` carries v2-only probe config. The target table stores
+derived scheduling details such as source site row, bucket, interval, stable
+phase slot, config hash, and coarse last outcome fields so later iterations can
+sync scheduling state without repeatedly scanning the legacy table or writing
+healthy probe freshness back into it.
+
+The current streaming engine creates the table but still reloads active config
+from `jetpack_monitor_sites`. That keeps correctness and rollback behavior easy
+to validate before moving config-sync reads fully onto the v2-native target
+table in a later scaling branch.
+
+Production data can contain more than one active monitor URL for the same
+`blog_id`. Monitor execution therefore treats
+`jetpack_monitor_sites.jetpack_monitor_site_id` as the endpoint identity for
+HTTP checks while retaining `blog_id` as the WPCOM/site identity. HTTP events
+write that row id to `jetmon_events.endpoint_id`, scheduler/retry in-memory
+state keys by the row id when available, and the v1 compatibility projection is
+updated by `jetpack_monitor_site_id` so two active URLs for one site do not
+overwrite each other's rollout state. The v2-native target table is unique on
+`source_site_id` for the same reason.
+
+## Process Health
+
+`jetmon_process_health` is the durable source for fleet-level operator views.
+Each long-running process owns one stable `process_id` such as
+`<host>:monitor` or `<host>:deliverer` and periodically upserts a compact
+snapshot:
+
+- process identity: host, process type, PID, version, build date, Go version
+- lifecycle state: `running`, `idle`, `stopping`, or `stopped`
+- health rollup: `green`, `amber`, or `red`, derived from local dependency
+  health and rollout-relevant warnings
+- monitor state: bucket range, ownership mode, worker counts, queue depths,
+  WPCOM circuit/queue state, delivery-owner state, API/dashboard ports, RSS
+  memory, and Go runtime system memory
+- dependency health JSON: MySQL, Verifliers, WPCOM, StatsD, and local writable
+  directories where applicable
+
+Fleet dashboards must treat stale `updated_at` values as unknown or unhealthy.
+The row says what the process last reported; it is not proof that the process is
+still alive after the heartbeat age exceeds the dashboard threshold.
+
+The fleet dashboard combines this table with `jetmon_hosts`, outbound delivery
+queues, and projection-drift counts. Dependency health stored in the process
+snapshot is also used to roll up shared dependencies such as Verifliers, MySQL,
+WPCOM, and StatsD across hosts.
+
+## Veriflier Discovery
+
+`jetmon_veriflier_vantages` stores the trusted identities that monitors may use
+for downtime quorum. `enabled` defaults to false, so a newly running Veriflier
+cannot mint its own vote. Usable active-discovery rows need `vantage_id`,
+`endpoint_host`, `endpoint_port`, and `auth_token`.
+
+`jetmon_veriflier_agents` stores concrete process telemetry collected by
+monitors from authenticated Veriflier `/v2/status` responses. Agents report
+`agent_id`, `vantage_id`, version, supported protocols, endpoint host/port,
+capacity, and `last_seen`. These rows are operational telemetry and endpoint
+hints only; they are ignored for quorum unless the matching vantage is
+pre-approved and enabled.
+
+## Check History
+
+`jetmon_check_history` records one compact timing sample per local check. The
+`request_method` column records the actual HTTP method used by the probe. This
+is primarily operational evidence for v2 rollout and uptime-bench review: v2
+should show `HEAD` during the initial legacy-compatible replacement phase,
+`GET` during the visitor-path migration phase, and the actual effective method
+for any per-site exceptions. Failure events carry richer per-incident metadata
+such as URL and error reason.
+
+## Event Source Of Truth
+
+Incident state is authoritative in:
+
+- `jetmon_events`: one mutable row per live incident identity, frozen after
+  close.
+- `jetmon_event_transitions`: one append-only row for every mutation.
+
+Every open, severity change, state change, cause-link change, and close writes a
+transition row in the same transaction as the event update. The `eventstore`
+package is the only writer for these tables.
+
+The lifecycle is:
+
+```text
+Up -> Seems Down -> Down -> Resolved
+         |
+         +-> Up (false alarm or probe-cleared)
+```
+
+`Seems Down` is first-class. It opens on the first local failure so incident
+duration starts when the user impact began, not when Verifliers later confirmed
+the outage.
+
+## Legacy Projection
+
+During the shadow-state portion of the
+[v1-to-v2 migration](v1-to-v2-migration.md),
+`jetpack_monitor_sites.site_status` and `last_status_change` are compatibility
+projections. With `LEGACY_STATUS_PROJECTION_ENABLE` enabled, every v2 event
+mutation also updates the legacy fields in the same transaction.
+
+Projection mapping:
+
+| v2 state | Legacy `site_status` |
+|---|---:|
+| Open `Seems Down` | `0` (`SITE_DOWN`) |
+| Open `Down` | `2` (`SITE_CONFIRMED_DOWN`) |
+| Closed or no open incident | `1` (`SITE_RUNNING`) |
+
+If drift is suspected, inspect mismatches with:
+
+```bash
+./jetmon2 rollout projection-drift
+./jetmon2 rollout projection-drift --bucket-min=0 --bucket-max=99 --limit=100
+```
+
+The drift report summarizes mismatches by bucket, projected status, expected
+status, likely cause, and sample blog before listing individual rows. It is
+read-only: use the likely-cause and repair guidance to confirm the event rows
+and transition history before making any reviewed database repair.
+
+Watch for repeated drift classes during rollout rehearsal and early production
+operation. Do not add an automated or dry-run repair planner until those real
+examples show which mismatch classes are safe to repair mechanically and which
+ones require eventstore investigation first.
+
+After legacy readers move to the v2 API or event tables, disable the projection.
+
+## Status And Failure Types
+
+Legacy status values:
+
+| Value | Meaning |
+|---:|---|
+| `0` | Local checks failed, retry or verification in progress |
+| `1` | Site is running |
+| `2` | Verifliers confirmed the site down |
+
+Failure classifications:
+
+| Type | Meaning |
+|---|---|
+| `server` | 5xx response |
+| `blocked` | 403 response |
+| `client` | 4xx response other than 403 |
+| `https` | SSL/TLS problem |
+| `intermittent` | Request timeout |
+| `redirect` | Redirect policy failure |
+| `ssl_expiry` | Certificate expiration threshold crossed |
+| `tls_deprecated` | TLS 1.0 or 1.1 |
+| `keyword_missing` | Required keyword was not present |
+| `keyword_forbidden` | Forbidden keyword was present |
+| `success` | Recovery |
+
+## Tenant Mapping
+
+`jetmon_site_tenants` maps gateway tenant IDs to `blog_id` values. The import
+tool upserts known mappings and intentionally does not delete missing mappings:
+
+```bash
+./jetmon2 site-tenants import --file site-tenants.csv --dry-run
+./jetmon2 site-tenants import --file site-tenants.csv --source gateway
+```
+
+The CSV format is `tenant_id,blog_id` with an optional header row.
diff --git a/docs/docker-images.md b/docs/docker-images.md
new file mode 100644
index 00000000..b4a84c18
--- /dev/null
+++ b/docs/docker-images.md
@@ -0,0 +1,205 @@
+# Running Jetmon And Veriflier From GHCR
+
+The CI workflow `.github/workflows/docker-publish.yml` publishes two images to
+GitHub Container Registry:
+
+| Image | Source Dockerfile |
+|---|---|
+| `ghcr.io/automattic/jetmon` | `docker/Dockerfile_jetmon` |
+| `ghcr.io/automattic/veriflier` | `docker/Dockerfile_veriflier` |
+
+This guide is for running those pre-built images. The build-from-source flow
+for local development stays in
+[getting-started.md](getting-started.md).
+
+## Tags
+
+- `:latest` — head of the `v2` branch. Updated on every push to `v2`.
+- `:<short-sha>` — built from a pull request when the PR carries the
+  `Docker Build` label. Use these for testing an unmerged change end to end.
+
+There are no semver tags yet; pin to a specific short SHA when reproducibility
+matters.
+
+## Authenticate (If Private)
+
+GHCR packages start out private. Until the package is made public or linked to
+the repository in the GHCR UI, every host that needs to pull must authenticate:
+
+```bash
+echo "$GHCR_PAT" | docker login ghcr.io -u <github-user> --password-stdin
+```
+
+`GHCR_PAT` is a GitHub personal access token with `read:packages`. After the
+package is made public, anonymous `docker pull` works and this step can be
+skipped.
+
+## Run Veriflier
+
+Veriflier is the simpler of the two — it has no database dependency and only
+needs an auth token shared with Jetmon:
+
+```bash
+docker pull ghcr.io/automattic/veriflier:latest
+
+docker run --rm \
+  --name veriflier \
+  -p 7803:7803 \
+  -e VERIFLIER_AUTH_TOKEN=replace_me \
+  -e VERIFLIER_PORT=7803 \
+  ghcr.io/automattic/veriflier:latest
+```
+
+The entrypoint renders `config/veriflier.json` from `veriflier-sample.json` on
+first start using the env vars above. Health check: `curl http://localhost:7803/v2/status`
+should return `{"status":"OK",...}`.
+
+Required env vars:
+
+| Var | Notes |
+|---|---|
+| `VERIFLIER_AUTH_TOKEN` | Must match the value Jetmon uses to call this verifier. |
+| `VERIFLIER_PORT` | Defaults to `7803`. |
+| `VERIFLIER_ENABLE_LEGACY_HTTP` | Optional. Defaults to `false`; set to `true` only for lab/emergency compatibility with `veriflier2`'s legacy HTTP `/check` and `/status` endpoints. |
+
+## Run Jetmon
+
+Jetmon needs MySQL connectivity and (in production) at least one reachable
+Veriflier. The simplest invocation against an already-running MySQL and
+Veriflier:
+
+```bash
+docker pull ghcr.io/automattic/jetmon:latest
+
+docker run --rm \
+  --name jetmon \
+  -p 8080:8080 \
+  -p 8090:8090 \
+  -e DB_HOST=mysql.internal \
+  -e DB_PORT=3306 \
+  -e DB_USER=jetmon \
+  -e DB_PASSWORD=replace_me \
+  -e DB_NAME=jetmon_db \
+  -e VERIFLIER_AUTH_TOKEN=replace_me \
+  -e VERIFLIER_PORT=7803 \
+  -e WPCOM_AUTH_TOKEN=change_me \
+  -e EMAIL_TRANSPORT=stub \
+  -v "$(pwd)/jetmon-logs:/jetmon/logs" \
+  -v "$(pwd)/jetmon-stats:/jetmon/stats" \
+  ghcr.io/automattic/jetmon:latest
+```
+
+The entrypoint runs `./jetmon2 migrate` before starting the monitor — migrations
+are embedded and additive. The first run renders `config/config.json` from
+`config-sample.json` using the env vars above; mount a real
+`/jetmon/config/config.json` to override the rendered defaults.
+
+Exposed ports:
+
+| Port | Purpose |
+|---|---|
+| `8080` | Operator dashboard |
+| `8090` | Internal REST API and `/api/v1/health` |
+
+Required env vars:
+
+| Var | Notes |
+|---|---|
+| `DB_HOST`, `DB_PORT`, `DB_USER`, `DB_PASSWORD`, `DB_NAME` | MySQL connection. |
+| `VERIFLIER_AUTH_TOKEN`, `VERIFLIER_PORT` | Shared with each Veriflier. |
+| `WPCOM_AUTH_TOKEN` | Set to `change_me` for non-WPCOM environments. |
+| `EMAIL_TRANSPORT` | `stub` for dev; `smtp` plus `SMTP_*` vars for real delivery. |
+
+Optional volume mounts:
+
+| Path | Reason |
+|---|---|
+| `/jetmon/config` | Mount when you want to manage `config.json` outside the container instead of relying on env-driven rendering. |
+| `/jetmon/logs` | Persist `jetmon.log` and `status-change.log`. |
+| `/jetmon/stats` | Persist counters and the `jetmon2.pid` file used by `reload` / `drain`. |
+
+## Run Both Together
+
+For an ad-hoc deploy that needs Jetmon plus a co-located Veriflier, the
+following compose snippet uses the pulled images instead of the local builds in
+`docker/docker-compose.yml`:
+
+```yaml
+services:
+  veriflier:
+    image: ghcr.io/automattic/veriflier:latest
+    environment:
+      VERIFLIER_AUTH_TOKEN: replace_me
+      VERIFLIER_PORT: "7803"
+    ports:
+      - "7803:7803"
+
+  jetmon:
+    image: ghcr.io/automattic/jetmon:latest
+    depends_on: [veriflier]
+    environment:
+      DB_HOST: mysql.internal
+      DB_PORT: "3306"
+      DB_USER: jetmon
+      DB_PASSWORD: replace_me
+      DB_NAME: jetmon_db
+      VERIFLIER_AUTH_TOKEN: replace_me
+      VERIFLIER_PORT: "7803"
+      WPCOM_AUTH_TOKEN: change_me
+      EMAIL_TRANSPORT: stub
+    ports:
+      - "8080:8080"
+      - "8090:8090"
+    volumes:
+      - ./jetmon-logs:/jetmon/logs
+      - ./jetmon-stats:/jetmon/stats
+```
+
+MySQL is intentionally not in this snippet — pre-built images are for talking
+to an existing database. For the full local stack including MySQL, Mailpit, and
+StatsD, keep using the build-from-source compose file under `docker/`.
+
+## Validate Config Inside The Container
+
+```bash
+docker run --rm \
+  -e DB_HOST=mysql.internal -e DB_PORT=3306 -e DB_USER=jetmon \
+  -e DB_PASSWORD=replace_me -e DB_NAME=jetmon_db \
+  -e VERIFLIER_AUTH_TOKEN=replace_me \
+  ghcr.io/automattic/jetmon:latest ./jetmon2 validate-config
+```
+
+The entrypoint renders a config first, then `validate-config` checks shape,
+MySQL connectivity, email transport mode, and Veriflier reachability.
+
+## Reload And Drain
+
+Both run via the PID file at `/jetmon/stats/jetmon2.pid`, so they only work
+when `/jetmon/stats` is a writable volume:
+
+```bash
+docker exec jetmon ./jetmon2 reload   # SIGHUP — config reload
+docker exec jetmon ./jetmon2 drain    # SIGINT — graceful shutdown
+```
+
+## Pin To A PR Build
+
+To test the image built from PR #123 against your environment:
+
+```bash
+docker pull ghcr.io/automattic/jetmon:<short-sha>
+```
+
+Find the short SHA in the PR's checks tab under the `Build and publish Docker
+images` workflow run summary. The PR must carry the `Docker Build` label
+before the workflow runs.
+
+## Troubleshooting
+
+| Symptom | Check |
+|---|---|
+| `denied: requested access to the resource is denied` on pull | The package is still private — authenticate with `docker login ghcr.io` using a PAT with `read:packages`, or have a maintainer flip the package visibility. |
+| Container starts but Jetmon exits with a MySQL error | `DB_HOST` is reachable from inside the container — remember `localhost` inside the container is not the host. Use the host IP, a docker network, or `host.docker.internal`. |
+| `reload` / `drain` reports "no PID file" | Mount a writable volume at `/jetmon/stats`. The PID file lives at `/jetmon/stats/jetmon2.pid`. |
+| Config changes do not persist across container restarts | Either mount `/jetmon/config` and edit the file directly, or rely on env vars — the rendered `config.json` is rebuilt from env vars on every fresh start. |
+| Jetmon cannot reach Veriflier | `VERIFLIER_AUTH_TOKEN` must match on both sides, and `VERIFLIER_PORT` (default `7803`) must be reachable from the Jetmon container. |
diff --git a/docs/events.md b/docs/events.md
new file mode 100644
index 00000000..5e39bc5d
--- /dev/null
+++ b/docs/events.md
@@ -0,0 +1,264 @@
+# Jetmon Event Model
+
+This document describes the event-sourced architecture that underlies site state in Jetmon.
+
+## Why event-sourced
+
+Early designs used a mutable `state` column on the site row as the primary record of truth. That approach loses history, makes retries ambiguous, and couples severity changes to state changes in ways that don't reflect reality (a worsening degradation isn't a new outage). Moving to an event log fixes this:
+
+- Full history is preserved across both event boundaries (open/close) and intra-event mutations (severity bumps, state transitions, cause links).
+- Severity can evolve within a single event without inventing artificial state transitions.
+- Retries and duplicate probe results become idempotent rather than destructive.
+- Derived/denormalized fields on the site row can be rebuilt from the log if they ever drift.
+
+## The two-table split
+
+The model splits the event into two tables:
+
+- **`jetmon_events`** — one row per incident, holding the *current* (or final) severity, state, and metadata. Mutable while the incident is open; frozen on close.
+- **`jetmon_event_transitions`** — append-only history of every mutation made to a `jetmon_events` row. One row per change, never updated, never deleted.
+
+The events row is the authoritative current-state projection. The transitions table is the full audit trail of how it got there. Together they give you:
+
+- Cheap "what's the current state of incident X" reads (single row in `jetmon_events`).
+- Complete "how did incident X evolve over time" reads (`SELECT * FROM jetmon_event_transitions WHERE event_id = ? ORDER BY changed_at`).
+- Independent retention policies — incidents can be pruned aggressively for the live table while transitions are kept long enough for SLA reports.
+
+**Operational logging stays in `jetmon_audit_log`.** That table records what the *monitor* did (WPCOM retries, verifier RPCs, config reloads, alert suppressions). Site-state changes do not flow through it — those go to the events tables. See "Relationship to `jetmon_audit_log`" below.
+
+## The event row
+
+`jetmon_events` represents a condition affecting a site over a time range. There is at most one *open* row per `(blog_id, endpoint_id, check_type, discriminator)` tuple at any given time (see "Identity and idempotency").
+
+| Field                | Type             | Notes                                                                    |
+|----------------------|------------------|--------------------------------------------------------------------------|
+| `id`                 | BIGINT UNSIGNED  | Primary key.                                                             |
+| `blog_id`            | BIGINT UNSIGNED  | The site this event is about. (`site_id` in taxonomy.md terms.)          |
+| `endpoint_id`        | BIGINT UNSIGNED, null | The endpoint, when applicable. Null for site-level events.          |
+| `check_type`         | VARCHAR(64)      | Which probe observed this — `http`, `dns`, `tls_expiry`, `tls_deprecated`, etc. |
+| `discriminator`      | VARCHAR(128), null | Optional tiebreaker for tuples that can have multiple concurrent failures (e.g. multiple keyword checks on the same endpoint). |
+| `severity`           | TINYINT UNSIGNED | Ordered, suitable for thresholds and escalation.                         |
+| `state`              | VARCHAR(32)      | Human-readable lifecycle label.                                          |
+| `started_at`         | TIMESTAMP(3)     | When the condition began. Frozen across severity/state changes.          |
+| `ended_at`           | TIMESTAMP(3), null | When the condition resolved. Null while active.                        |
+| `resolution_reason`  | VARCHAR(64), null | Why the event ended. Null while active.                                 |
+| `cause_event_id`     | BIGINT UNSIGNED, null | Causal link to a root-cause event (separate from rollup).           |
+| `metadata`           | JSON, null       | Check-type-specific payload (HTTP method, code, RTT, days-to-expiry, etc.). |
+| `updated_at`         | TIMESTAMP(3)     | ON UPDATE CURRENT_TIMESTAMP — convenience for the dedup path.            |
+| `dedup_key`          | VARCHAR generated | Stored generated column carrying the identity tuple while the event is open, NULL once closed. Backed by a unique index — see "Identity and idempotency". |
+
+## The transition row
+
+`jetmon_event_transitions` is the append-only history. Every mutation to a `jetmon_events` row writes exactly one transition row, in the same database transaction.
+
+| Field              | Type             | Notes                                                                          |
+|--------------------|------------------|--------------------------------------------------------------------------------|
+| `id`               | BIGINT UNSIGNED  | Primary key.                                                                   |
+| `event_id`         | BIGINT UNSIGNED  | The event this transition applies to.                                          |
+| `blog_id`          | BIGINT UNSIGNED  | Denormalized from `jetmon_events.blog_id` — avoids a join for SLA queries.     |
+| `severity_before`  | TINYINT UNSIGNED, null | Severity before the change. Null on `opened`.                            |
+| `severity_after`   | TINYINT UNSIGNED, null | Severity after the change. Null on `closed`.                             |
+| `state_before`     | VARCHAR(32), null | State before the change. Null on `opened`.                                    |
+| `state_after`      | VARCHAR(32), null | State after the change. Null on `closed` (or set to `Resolved`).              |
+| `reason`           | VARCHAR(64)      | Why the transition occurred. See "Transition reasons" below.                   |
+| `source`           | VARCHAR(255)     | Who caused it: `local`, `veriflier:us-west`, `operator:user@host`, `system:timeout`. |
+| `metadata`         | JSON, null       | Transition-specific context (HTTP code on escalation, cause id on link, etc.). |
+| `changed_at`       | TIMESTAMP(3)     | Millisecond precision; SLA report ordering needs sub-second tiebreakers.       |
+
+### Severity vs. state
+
+**Severity** is numeric. It orders events and drives thresholds. It can be updated on a live event without changing `state` — if a degradation worsens, bump severity, leave state alone.
+
+**State** is a human-readable label tied to the lifecycle. It changes at lifecycle boundaries: `Up → Seems Down → Down → Resolved`.
+
+Keeping these separate avoids conflating "this got worse" with "this is a different kind of problem."
+
+### Identity and idempotency
+
+Event identity is the tuple `(blog_id, endpoint_id, check_type, discriminator)`. Repeated probe results for the same underlying condition must resolve to the same `jetmon_events` row — a retried result updates the existing row rather than creating a new one.
+
+MySQL has no partial unique indexes, so the schema enforces "at most one *open* event per tuple" with a generated column trick:
+
+- `dedup_key` is a `VARCHAR GENERATED ALWAYS AS (... ) STORED` column.
+- It evaluates to a `CONCAT_WS` of the tuple while `ended_at IS NULL`, and to `NULL` once the event is closed.
+- A `UNIQUE KEY` on `dedup_key` rejects two open rows with the same tuple. Multiple `NULL`s are allowed by MySQL's unique-index semantics, so closed events never conflict.
+
+The probe runner's insert path collapses to a single statement:
+
+```sql
+INSERT INTO jetmon_events (blog_id, endpoint_id, check_type, discriminator, severity, state, ...)
+VALUES (?, ?, ?, ?, ?, ?, ...)
+ON DUPLICATE KEY UPDATE
+    severity = VALUES(severity),
+    state    = VALUES(state),
+    metadata = VALUES(metadata);
+```
+
+No `SELECT … FOR UPDATE` dance, no optimistic-concurrency loop. The dedup logic is enforced by the schema and the `eventstore` package wraps it so external callers never touch the table directly.
+
+## Lifecycle
+
+```
+          first failure                verifier confirms
+    Up ─────────────────▶ Seems Down ───────────────────▶ Down
+                              │                            │
+                              │  verifier disagrees        │  condition clears
+                              │  (false alarm)             │
+                              ▼                            ▼
+                              Up                        Resolved
+```
+
+### Up
+
+No active event. Probes are succeeding.
+
+### Seems Down (transient)
+
+A probe has failed but the verifier has not yet confirmed. This is a **real state**, not an implementation detail — dashboards show it, alert rules can key off it, and it has its own severity range.
+
+**The event opens on the first local failure**, not when the local retry queue eventually escalates to verifiers. This is non-negotiable: `started_at` must equal "first time we saw something wrong" so incident duration is honest. Subsequent local-retry failures are no-ops on the events table — the schema's idempotent `dedup_key` collapses them into the same row, and the `eventstore` writer skips a transition row when severity and state are unchanged.
+
+The first failure writes both an event row (`state = Seems Down`, `severity = 3`, `started_at = now`) and an `opened` transition row in one transaction.
+
+HTTP failure metadata includes `http_code`, `error_code`, legacy
+`failure_class`, operator-facing `detector_class`, `legacy_status_type`,
+`method`, `rtt_ms`, `url`, and `keyword_rule` when a content rule failed.
+`keyword_rule` is `required` for a missing `check_keyword` and `forbidden` when
+`forbidden_keyword` appears in the response body. `failure_class` intentionally
+preserves the old WPCOM status-type vocabulary, while `detector_class` explains
+the actual detector path (`partial_response`, `content_failure`, `timeout`,
+`dns_nxdomain`, etc.). Jetmon also records bounded operator diagnostics such as
+`error_detail`, redirect policy/count/chain/final URL, TLS version, cipher
+suite, and DNS resolver failure details when those facts are available. DNS
+metadata uses `dns_error_kind` (`nxdomain`, `servfail`, `timeout`, or
+`resolver_error`), `dns_error_name`, and `dns_error_server` when Go's resolver
+exposes them. Body-read failures include a `body_read` object with mode, bytes
+read, expected bytes when known, limit bytes, and read error. Response bodies
+are not stored in event metadata.
+
+Each HTTP failure also stores `metadata.observation` with timing bounds:
+`checked_at`, `first_failed_at`, `previous_observed_at`,
+`previous_known_good_at`, `normal_check_interval_seconds`, and
+`next_check_interval_seconds`. The exact customer failure may have started any
+time after the previous known-good probe and no later than `first_failed_at`;
+recovery transitions similarly store `first_recovered_at` and `closed_at` in
+their transition metadata. This keeps incident durations honest while giving
+operators enough context to explain the observation window.
+
+Three outcomes from Seems Down:
+
+- **Local probe recovers** before reaching verifier escalation → event closes with `resolution_reason = probe_cleared`. No verifier was involved; this is the "transient blip the local retry caught" path. The count of these is itself a useful signal — a baseline rate of probe-cleared closes tells you how noisy your detection is.
+- **Verifier confirms** → state changes to `Down` in place, severity bumps to 4; one transition row records `state_before = Seems Down`, `state_after = Down`, `severity_before = 3`, `severity_after = 4`, `reason = verifier_confirmed`. `started_at` does not change.
+- **Verifier disagrees** → event closes with `resolution_reason = false_alarm`; one transition row records `state_after = Resolved`, `reason = false_alarm`.
+
+### Down
+
+Outage confirmed. Severity may continue to evolve in place as additional probes report. **Each severity bump writes a transition row** (`severity_before`, `severity_after`, `reason = severity_escalation` or `severity_deescalation`). The `jetmon_events` row stores only the latest severity; the history lives in `jetmon_event_transitions`.
+
+Recovery from Down — the next successful local probe — closes the event with `resolution_reason = verifier_cleared`. (V1 of the integration trusts the local probe on the recovery path; a future "verifier-on-recovery" check would distinguish probe-cleared from verifier-cleared on this path too.)
+
+### Resolved
+
+Condition has cleared. `ended_at` is set, `resolution_reason` is recorded, and a transition row with `reason = <resolution_reason>` is appended. The event row is now historical — it is not deleted or mutated further.
+
+## The site row projection
+
+During the [v1-to-v2 migration](v1-to-v2-migration.md),
+`jetpack_monitor_sites` remains the legacy site/config table and compatibility
+projection. The authoritative incident state is the v2 event model:
+
+- `jetmon_events` stores the current incident row.
+- `jetmon_event_transitions` stores every mutation.
+- `jetpack_monitor_sites.site_status` and `last_status_change` are derived
+  compatibility fields for v1 readers.
+
+While `LEGACY_STATUS_PROJECTION_ENABLE` is true, the legacy projection is updated
+in the same transaction as the event write. There is no eventual consistency in
+migration mode: event mutation, transition row, and v1 projection commit or roll
+back together.
+
+Once all downstream readers have moved to the v2 API/event tables,
+`LEGACY_STATUS_PROJECTION_ENABLE` can be set to false. At that point the legacy
+status fields stop being maintained and must not be treated as source of truth.
+
+The compatibility projection is rebuildable from `jetmon_events` (current state)
+plus `jetmon_event_transitions` (full history). If the projection is ever
+suspected to be wrong during migration, rebuild it; don't patch it by hand.
+
+## Relationship to `jetmon_audit_log`
+
+`jetmon_audit_log` is the **operational** log — it records what the monitor did, not what happened to a site:
+
+- WPCOM notification sends and retries
+- Verifier RPC dispatch
+- Retry-queue dispatch
+- Alert suppression and maintenance-window swallowing decisions
+- Config reloads
+
+Site-state changes do **not** go through the audit log. Those flow through `jetmon_events` (current state) and `jetmon_event_transitions` (history). The audit log links to events through a nullable `event_id` so an operator can pivot from "this WPCOM retry" to "the incident it was for" with one query.
+
+The split exists because the two trails have different consumers and different retention needs:
+
+| Trail | Consumer | Retention shape |
+|-------|----------|-----------------|
+| `jetmon_events` + `jetmon_event_transitions` | Public API incident timelines, SLA reports | Long — 30/90 days at full fidelity, then rolled up |
+| `jetmon_audit_log` | Operators investigating "why did the alert fire" | Short — aggressive pruning is fine once the incident is closed |
+| `jetmon_check_history` | Response-time trending, baseline learning | Medium — granular timing is high volume |
+
+## Causal links
+
+Events can reference other events as causes. A DNS failure cascading into HTTP failures creates multiple events with causal links from the HTTP events back to the DNS event.
+
+Causal links are stored as a separate structure (e.g., `event_causes`) with `(effect_event_id, cause_event_id)`. They are **not** the same as rollup.
+
+### Why not rollup?
+
+Rollup aggregates events for display ("this site had 3 events in the last hour"). Causal linking explains relationships ("the HTTP outage was caused by the DNS outage"). They have different query patterns, different retention needs, and different consumers. Keep them separate.
+
+## Deduplication
+
+All probe types share a single runner. The runner is responsible for:
+
+- Applying idempotent event identity so duplicate results collapse into one event.
+- Batching and rate-limiting probe dispatch.
+- Feeding results into the event writer with the correct ordering guarantees.
+
+New probe types plug into this runner. They do not implement their own dedup.
+
+## Transition reasons
+
+Every transition row records *why* the change happened. The seeded vocabulary, in approximate order of frequency:
+
+- `opened` — first transition for a new event.
+- `severity_escalation` — severity went up on the same state (e.g. degradation worsening).
+- `severity_deescalation` — severity went down on the same state.
+- `verifier_confirmed` — Seems Down → Down.
+- `verifier_cleared` — site returns to Up after a verifier-confirmed Down; closes the event.
+- `probe_cleared` — site returns to Up while still in Seems Down (verifier was never invoked or never confirmed), or an advisory condition such as `tls_expiry` / `tls_deprecated` clears on a later local probe; closes the event. Count of these per site over time is the false-positive rate of local detection or advisory churn.
+- `false_alarm` — verifier disagreed with the initial failure signal; closes the event.
+- `manual_override` — an operator changed state or closed the event.
+- `maintenance_swallowed` — event closed because a maintenance window started; failures detected inside the active window are recorded operationally but do not open a downtime event.
+- `superseded` — closed because a broader event subsumed it.
+- `auto_timeout` — event aged out per retention/timeout policy.
+- `cause_linked` / `cause_unlinked` — `cause_event_id` was set or cleared on an open event.
+
+The "closed" reasons (`verifier_cleared`, `probe_cleared`, `false_alarm`, `manual_override`, `maintenance_swallowed`, `superseded`, `auto_timeout`) are also written to `jetmon_events.resolution_reason` on close, so the live row carries the immediate "why is this closed" answer without needing a join.
+
+New reasons should be added as explicit enum values in code, not free-text. The column is `VARCHAR(64)` (not MySQL `ENUM`) so adding a value doesn't require a schema migration.
+
+## Open questions
+
+- **Retention**: how long do we keep closed events at full fidelity before rolling them up?
+- **Causal graph consumers**: who reads the causal links and what query shapes do they need? That dictates indexing.
+- **Cross-probe severity**: when multiple probe types fire on the same site, should the API rollup use max severity, a weighted sum, or something else?
+
+## Invariants worth testing
+
+1. Event write and legacy status projection update are atomic while `LEGACY_STATUS_PROJECTION_ENABLE` is true.
+2. **Every** mutation of a `jetmon_events` row writes exactly one row into `jetmon_event_transitions` in the same transaction. Open, severity change, state change, cause-link change, close — no carve-outs.
+3. Replaying the same probe result twice produces the same single event and a single `opened` transition row (idempotent insert path).
+4. `Seems Down → Up` (false alarm) correctly closes the event with `resolution_reason = false_alarm` and writes a transition row with `reason = false_alarm`.
+5. Severity updates on a live event do not create a new event row, but **do** create a transition row.
+6. Closed events are never mutated (except possibly by a backfill/migration, which should be audited).
+7. After closing an event for tuple T, a new failure for tuple T can immediately open a new event without conflicting on `dedup_key`.
+8. Replaying every transition row for an event in `changed_at` order reconstructs the event's current `severity` and `state`.
diff --git a/docs/getting-started.md b/docs/getting-started.md
new file mode 100644
index 00000000..14080fde
--- /dev/null
+++ b/docs/getting-started.md
@@ -0,0 +1,185 @@
+# Getting Started
+
+This guide is for local development and smoke testing. Production rollout steps
+live in [operations-guide.md](operations-guide.md).
+
+## Requirements
+
+- Go 1.22 or newer
+- Docker and Docker Compose
+- `make`
+
+The Docker environment provides MySQL, StatsD/Graphite, Mailpit, the monitor,
+the Go Veriflier, and the API failure fixture.
+
+## Start Docker
+
+```bash
+cd docker
+cp .env-sample .env
+docker compose up --build -d
+```
+
+Useful follow-up commands:
+
+```bash
+docker compose logs -f jetmon
+docker compose exec jetmon bash
+docker compose down
+docker compose down --remove-orphans
+```
+
+Mailpit captures local alert-contact email. Open it at
+`http://localhost:8025` by default, or at the `BIND_ADDR` /
+`MAILPIT_HOST_PORT` values from `docker/.env`.
+
+## Build And Test
+
+From the repository root:
+
+```bash
+make all
+make test
+make test-race
+make lint
+```
+
+Build individual binaries when the full build is not needed:
+
+```bash
+make build
+make build-deliverer
+make build-veriflier
+```
+
+If `go` is not on `PATH`, the Makefile falls back to `/usr/local/go/bin/go`
+when present. Override with `make GO=/path/to/go ...` for other layouts.
+
+## Validate Config
+
+```bash
+./bin/jetmon2 validate-config
+```
+
+Validation checks required keys, value ranges, MySQL connectivity, legacy
+projection mode, email transport mode, and configured Verifliers. Veriflier
+reachability is reported as operational context rather than a hard validation
+failure.
+
+The local Veriflier serves the v2 status endpoint by default:
+
+```bash
+curl http://127.0.0.1:7803/v2/status
+```
+
+The v2 response includes supported protocols, local `vantage.id`, serving
+`agent.id`, and executor capacity.
+
+`veriflier2` can also expose legacy-compatible HTTP `/check` and `/status`
+endpoints for lab or emergency compatibility testing by setting
+`VERIFLIER_ENABLE_LEGACY_HTTP=true`, but production v2 Verifliers should remain
+v2-only unless there is an explicit rollout need.
+
+To inspect the local Veriflier discovery registry and monitor-collected agent
+telemetry without exposing auth token values:
+
+```bash
+./bin/jetmon2 verifliers discovery-report --output=text
+```
+
+## API CLI Smoke
+
+Build the binary, create a local API key, and point the CLI at the exposed API:
+
+```bash
+make build
+make api-cli-token-create
+
+export JETMON_API_URL=http://localhost:${API_HOST_PORT:-8090}
+export JETMON_API_TOKEN=jm_replace_with_the_printed_token
+
+./bin/jetmon2 api health --pretty
+./bin/jetmon2 api me --pretty
+./bin/jetmon2 api commands --output table
+./bin/jetmon2 api sites list --output table
+```
+
+Run the standard smoke sequence:
+
+```bash
+make api-cli-smoke
+```
+
+Run the fuller live validation pass against the guide examples, local failure
+fixture, and webhook delivery/signature flow:
+
+```bash
+make api-cli-validate
+```
+
+Set `API_VALIDATE_SKIP_WEBHOOK=1` for a shorter pass that avoids the outbound
+webhook worker.
+
+Use these helper targets to manage local rehearsal tokens:
+
+```bash
+make api-cli-token-list
+API_CLI_TOKEN_ID=<id> make api-cli-token-revoke
+```
+
+## Simulate A Failure
+
+The Docker Compose environment includes `api-fixture`, a deterministic local
+site fixture. Jetmon containers reach it at `http://api-fixture:8091` and
+`https://api-fixture:8443`; the host can inspect it at
+`http://localhost:18091` and `https://localhost:18443` by default.
+
+The fixture exposes endpoints for response codes, redirects, keyword mismatch,
+slow responses, TLS, and webhook capture.
+
+```bash
+./bin/jetmon2 api sites bulk-add --count 3 --batch local-smoke --dry-run --pretty
+./bin/jetmon2 api smoke --batch local-smoke --pretty
+./bin/jetmon2 api sites simulate-failure \
+  --batch local-smoke \
+  --mode http-500 \
+  --wait 30s \
+  --expect-event-state 'Seems Down' \
+  --expect-transition-reason opened \
+  --pretty
+./bin/jetmon2 api sites cleanup --batch local-smoke --count 3 --output table
+```
+
+Set `--fixture-url=off` to force public endpoint fallback behavior.
+
+## Add Manual Test Sites
+
+```bash
+cd docker
+docker compose exec mysqldb mysql -u jetmon -pjetmon_dev_password jetmon_db
+```
+
+```sql
+INSERT INTO jetpack_monitor_sites
+  (blog_id, bucket_no, monitor_url, monitor_active, site_status)
+VALUES
+  (1, 0, 'https://wordpress.com', 1, 1),
+  (2, 0, 'https://httpstat.us/200', 1, 1),
+  (3, 0, 'https://httpstat.us/500', 1, 1),
+  (4, 0, 'https://httpstat.us/200?sleep=15000', 1, 1);
+```
+
+## Import Tenant Mapping
+
+Gateway-routed site reads and writes are scoped through
+`jetmon_site_tenants`. Import the gateway or customer source of truth before
+customer traffic depends on Jetmon-side tenant enforcement:
+
+```bash
+./bin/jetmon2 site-tenants import --file site-tenants.csv --dry-run
+./bin/jetmon2 site-tenants import --file site-tenants.csv --source gateway
+```
+
+The CSV format is `tenant_id,blog_id` with an optional header row. Imports
+upsert mappings and skip duplicate input rows; they do not delete missing
+mappings.
diff --git a/docs/internal-api-reference.md b/docs/internal-api-reference.md
new file mode 100644
index 00000000..3217b853
--- /dev/null
+++ b/docs/internal-api-reference.md
@@ -0,0 +1,1207 @@
+# Jetmon Internal API — Reference and Design Notes
+
+This document is the reference for Jetmon 2's internal REST API and the design notes behind it. The API server, Bearer-token auth, site/event/SLA endpoints, webhooks, alert contacts, idempotency handling, and delivery retry surfaces are implemented in `internal/api/`, `internal/apikeys/`, `internal/webhooks/`, and `internal/alerting/`. Sections that describe future expansion or deferred behavior call that out explicitly.
+
+**Audience: internal systems only.** Jetmon does not expose this API to end customers directly. A separate gateway service handles all customer-facing access — authentication, tenant isolation, customer rate limiting, plan-based feature gating, public error vocabulary, etc. — and calls Jetmon over this internal interface. Other internal services (operator dashboard, alerting workers, batch reporting jobs, the gateway itself) are the only direct callers. The gateway/tenant boundary and remaining public-exposure prerequisites are documented in [`public-api-gateway-tenant-contract.md`](public-api-gateway-tenant-contract.md).
+
+**Gateway tenant context.** Requests from the internal consumer named `gateway`
+may include `X-Jetmon-Tenant-ID`, `X-Jetmon-Public-Scopes`, and
+`X-Jetmon-Gateway-Request-ID` (plus optional actor/plan headers). Jetmon
+rejects those headers from any other consumer. When accepted, the context is
+recorded in API audit metadata and used to owner-scope webhook and alert-contact
+CRUD, delivery history, manual delivery retry, and alert-contact send-test
+routes. Site, event, SLA/stat, and trigger-now routes are scoped through the
+`jetmon_site_tenants` mapping table. Normal internal callers that omit these
+headers keep the unscoped operator behavior described below.
+
+This shapes several design choices: authentication is per-consumer rather than per-customer, scopes are coarse rather than granular, error messages are verbose rather than guarded, and key management is an ops-only concern rather than a self-service feature. The trust boundary is "is this a known internal system?", not "is this user allowed to see this site?".
+
+The goal is to expose Jetmon's distinctive data model — the five-layer test taxonomy, the site → endpoint → event hierarchy, the multi-state vocabulary, and the event-sourced architecture (`taxonomy.md`, `events.md`) — over a shape that internal consumers can integrate against confidently. We took inspiration from Better Stack, UptimeRobot v3, Pingdom, and Atlassian Statuspage but did not copy any of their shapes wholesale; Jetmon's richer model (multi-state, layered tests, causal links, separate severity) wouldn't fit cleanly into a flat "monitors" API.
+
+## Principles
+
+1. **Read API is source-of-truth, not just a snapshot.** Consumers should be able to ask "what is the current state of this site?" and "how did this incident evolve from severity 3 to 4 to closed?" with separate, narrow endpoints — not by polling a coarse "monitor" record. That's what the events/transitions tables exist for.
+
+2. **Severity and state are both first-class.** Many competitor APIs collapse to a single "status" string (UptimeRobot returns `up`/`down`; Better Stack adds `paused`/`maintenance`/`validating`). Jetmon exposes both: numeric severity for ordering, thresholds, and SLA math; human-readable state for display. They never disagree because they're stored as separate columns updated in lockstep.
+
+3. **Cursor pagination, never offset.** Offset pagination breaks under concurrent writes (an event closing during traversal shifts page boundaries). Cursors keyed on stable timestamps (`started_at`, `changed_at`) survive that.
+
+4. **Versioned URLs, conservative additions.** All endpoints under `/api/v1/`. New fields on existing responses are additive (consumers ignore unknowns); shape-breaking changes get `/api/v2/` and a deprecation window. Severity values 0–4 today, room to add new values up to 255 without a version bump.
+
+5. **No shape-shifting based on permissions.** A read-scope token sees the same JSON shape for `GET /api/v1/sites/{id}` as an admin token — fields aren't hidden, they're empty/null where data isn't applicable. Easier to test, easier to document.
+
+6. **Errors carry a stable code, a human message, and (when relevant) a reference id.** Consumers branch on the `code` field, not on parsing the message.
+
+7. **Bulk operations must be explicit when added.** v1 currently exposes single-resource write endpoints only. If bulk updates are added later, they should have dedicated request and response shapes instead of encouraging "list 10,000 sites and then loop one update at a time" client behavior.
+
+## Authentication
+
+**Per-consumer Bearer tokens.** Each calling system gets one (or more) tokens identifying it. The tokens are not user-delegated — there's no concept of "an end user authenticated via this token." A token *is* a service identity.
+
+```
+Authorization: Bearer jm_a1b2c3d4e5f6...
+```
+
+Tokens are 32-byte high-entropy random strings, sha256-hashed at rest (sha256 not bcrypt — bcrypt is for human-chosen passwords; high-entropy tokens just need a fast cryptographic hash). Stored in `jetmon_api_keys`:
+
+```
+jetmon_api_keys:
+  id              BIGINT PK
+  key_hash        CHAR(64)         -- sha256 hex
+  consumer_name   VARCHAR(128)     -- e.g. "gateway", "alerts-worker", "dashboard"
+  scope           ENUM('read','write','admin')
+  rate_limit_per_minute INT
+  expires_at      TIMESTAMP NULL   -- NULL = never
+  revoked_at      TIMESTAMP NULL   -- revoke time; future value = rotation grace window
+  last_used_at    TIMESTAMP NULL
+  created_at      TIMESTAMP
+  created_by      VARCHAR(128)     -- ops user / automation that created the key
+```
+
+**Scopes — three coarse buckets:**
+
+- `read` — every GET endpoint.
+- `write` — every POST/PATCH/DELETE on sites, events, webhooks, and alert contacts.
+- `admin` — write + ability to force operations like "recompute SLA from event log" or "close all events in maintenance mode." Reserved for ops tooling, not regular consumers.
+
+We deliberately did not split into `sites:read` / `events:read` / `webhooks:read` etc. Internal consumers tend to need the whole read surface — the gateway needs to read everything to mediate it; an alerts worker reads sites, events, *and* webhooks. Granular scopes would create more configuration burden than they solve.
+
+**Per-consumer audit logging.** Every authenticated request is logged to `jetmon_audit_log` with the consumer name, endpoint, status code, and latency. This is the load-bearing accountability mechanism — if "alerts-worker is hammering the trigger-now endpoint," that's visible in the audit log without parsing access logs. The audit log already exists for operational events (`events.md`); API access becomes another `event_type` value (`api_access`).
+
+**Key management is ops-only.** No `/api/v1/keys` endpoints. Keys are created and revoked via the `./jetmon2` CLI:
+
+```
+./jetmon2 keys create --consumer gateway --scope read [--ttl 2160h]
+./jetmon2 keys list
+./jetmon2 keys revoke <key_id>
+./jetmon2 keys rotate <key_id>     # creates a new key for the same consumer; revokes old after grace
+```
+
+The CLI talks to the database directly (via `jetmon_api_keys`), prints the new token once, and never exposes hashes. There's no self-service surface because there are no end customers — keys are infrastructure config, not user-managed credentials.
+
+`revoked_at` and `expires_at` are both half-open cutoffs: a key is valid for times strictly before the cutoff and rejected at or after it. During key rotation, the CLI may set `revoked_at` in the future so the old key remains valid for the grace window while consumers deploy the replacement. Immediate revocation sets `revoked_at` to the current time.
+
+**Single key format.** No live/test split. The token format is `jm_<base32 of 32 random bytes>`. The gateway is responsible for any environment separation (dev/staging/prod) at its own layer.
+
+**Why not mTLS / IP allowlists alone?** Either could replace Bearer tokens for service-to-service auth, but tokens make per-consumer identity trivial to log and revoke. mTLS rotation is heavier; IP allowlists don't survive containerized deployments cleanly. Bearer tokens are the lowest-friction option that gives us per-consumer accountability.
+
+**Why not OAuth?** Same reasoning as before, now stronger: there are no user delegations to model. Every caller is a server.
+
+## API CLI helper
+
+`jetmon2 api` is the local developer/operator helper for this API. It defaults
+to the Docker-local API listener and reads the Bearer token from the
+environment:
+
+```bash
+export JETMON_API_URL=http://localhost:8090
+export JETMON_API_TOKEN=jm_replace_with_a_local_key
+
+./bin/jetmon2 api health --pretty
+./bin/jetmon2 api me --pretty
+./bin/jetmon2 api commands --output table
+./bin/jetmon2 api sites list --output table
+./bin/jetmon2 api sites get --pretty 12345
+```
+
+For Docker-local rehearsals, `make api-cli-token-create`,
+`make api-cli-token-list`, and
+`API_CLI_TOKEN_ID=<id> make api-cli-token-revoke` wrap the in-container
+`jetmon2 keys` commands from the repository root.
+
+Typed commands cover sites, events, webhooks, alert contacts, local smoke runs,
+and failure simulation. Use `api request` as the escape hatch for new API routes
+before a typed command exists. See [`api-cli-guide.md`](api-cli-guide.md)
+for a fuller feature guide and workflow examples:
+
+```bash
+./bin/jetmon2 api request --output table GET '/api/v1/sites?limit=5'
+./bin/jetmon2 api sites bulk-add --count 3 --batch local-smoke --dry-run --pretty
+./bin/jetmon2 api smoke --batch local-smoke --pretty
+./bin/jetmon2 api sites simulate-failure --batch local-smoke --mode http-500 --wait 15s --pretty
+./bin/jetmon2 api sites simulate-failure --batch local-smoke --mode http-500 --wait 30s --expect-event-state 'Seems Down' --expect-transition-reason opened --pretty
+./bin/jetmon2 api sites cleanup --batch local-smoke --count 3 --output table
+```
+
+JSON is the default output for scripts. Add `--pretty` for readable JSON or
+`--output table` for stable human-readable tables on list and workflow summary
+commands.
+
+Use `make api-cli-validate` with `JETMON_API_URL` and `JETMON_API_TOKEN` set for
+a live Docker-local validation pass covering the guide's core examples, the
+smoke workflow, webhook delivery/signature verification, and a deterministic
+failure-simulation assertion. Set `API_VALIDATE_SKIP_WEBHOOK=1` when you need a
+shorter pass that avoids the outbound webhook worker.
+
+When Docker Compose is running, `sites simulate-failure` probes
+`http://localhost:18091/health` and uses the Docker-internal fixture URL
+`http://api-fixture:8091` for deterministic HTTP 500, HTTP 403, redirect,
+keyword, timeout, and TLS scenarios. Use `--fixture-url=off` to force the
+public endpoint fallback, or set `JETMON_API_FIXTURE_URL` /
+`JETMON_API_FIXTURE_PROBE_URL` for custom local fixtures.
+
+For strict rehearsal or CI checks, add `--expect-event-state`,
+`--expect-event-severity`, `--require-transition`, or
+`--expect-transition-reason`. When an expectation is set, the command keeps
+polling until the expectation matches or `--wait` expires, then returns non-zero
+with the last observed events/transitions in the summary.
+
+## Common patterns
+
+### Base URL and versioning
+
+```
+https://api.jetmon.example.com/api/v1
+```
+
+Hosted in the `jetmon2` binary on a dedicated port (`API_PORT`), separate from the operator dashboard (`DASHBOARD_PORT`) and the Veriflier transport port (`VERIFLIER_PORT`).
+
+### Content negotiation
+
+`Content-Type: application/json` for both request and response. UTF-8. No XML, no form-encoded, no JSON-API envelope (Better Stack uses JSON:API; we don't because it adds an `attributes` indirection that obscures field names without buying us anything Jetmon-specific).
+
+### Response envelope
+
+Every list response wraps the data in a small envelope:
+
+```json
+{
+  "data": [ ... ],
+  "page": {
+    "next": "eyJzdGFydGVkX2F0IjoiMjAyNi0wNC0yMVQxNjo...",
+    "limit": 50
+  }
+}
+```
+
+Every single-resource response is just the resource:
+
+```json
+{
+  "id": 487291,
+  "blog_id": 12345,
+  ...
+}
+```
+
+Reasoning: keeping list and single-resource shapes distinct means consumers don't write `if (Array.isArray(response.data))` everywhere. The list envelope holds pagination; the resource envelope is the resource.
+
+### Resource IDs
+
+All resource `id` fields are raw `BIGINT UNSIGNED` integers serialized as JSON numbers (not strings). Sites use the existing `blog_id`; events, transitions, webhooks, deliveries, and contacts use their respective table's auto-increment primary key. There is no type prefix or ULID encoding.
+
+Type context comes from the **endpoint path** (`/api/v1/sites/12345` vs `/api/v1/events/12345`) and from explicit `type` fields where ambiguity would otherwise hurt — for example, error messages always name the resource type:
+
+```json
+{ "error": { "code": "event_not_found", "message": "Event 12345 does not exist", "request_id": "..." } }
+```
+
+Webhook payloads include `"type": "event.opened"` so the consumer never has to infer from a bare numeric id which table the id refers to. Operational/trace identifiers (request IDs, webhook delivery IDs, idempotency keys) follow their own conventions described in the relevant sections.
+
+### Pagination
+
+Cursor-based, opaque tokens. Each list endpoint accepts `?cursor=...&limit=N`. Default limit 50, max 200.
+
+```
+GET /api/v1/sites?cursor=eyJzdGFydGVkX2F0IjoiMjAyNi0wNC0yMVQxNjo...&limit=100
+```
+
+The cursor is an opaque base64-encoded JSON of `{started_at, id}` (or `{changed_at, id}` for transition lists). Consumers shouldn't decode it; we reserve the right to change the encoding inside it.
+
+`page.next` is null on the last page. `page.prev` is intentionally not provided — most consumers walk forward, and offering prev would force us to support reverse iteration in indexes we don't currently have.
+
+### Filtering and sorting
+
+Most list endpoints accept filter query params. The convention:
+
+- Equality filters: `?state=Down&check_type=http`
+- Range filters: `?started_at__gte=2026-04-01T00:00:00Z&started_at__lt=2026-05-01T00:00:00Z`
+- Set filters: `?state__in=Down,Seems%20Down`
+
+Sorting is fixed per endpoint to one of two sensible defaults (newest-first for incidents, alphabetical for sites). We do not expose `?order_by=...` — letting consumers pick arbitrary sort columns means we have to maintain indexes for all of them.
+
+### Error model
+
+```json
+{
+  "error": {
+    "code": "site_not_found",
+    "message": "Site with id 12345 does not exist or is not visible to this token",
+    "request_id": "req_018f9a2c..."
+  }
+}
+```
+
+Error `code` values are documented per endpoint and stable across versions. The `message` is for humans and may improve over time. `request_id` matches a server-side log line for support tickets.
+
+HTTP status codes used:
+
+- `200` — success
+- `201` — resource created (CRUD POST)
+- `204` — success, no body (DELETE)
+- `400` — malformed request (bad JSON, invalid filter syntax, unknown field)
+- `401` — missing or invalid token
+- `403` — token valid but lacks required scope
+- `404` — resource genuinely doesn't exist
+- `409` — idempotent re-attempt with different body (state already different)
+- `422` — semantic validation failure (e.g. invalid URL format)
+- `429` — rate limit exceeded
+- `500` — server error
+- `503` — temporarily unavailable (DB down, etc.)
+
+403 vs 404 are honest here: a `read`-scope token hitting a `write`-only endpoint gets a real 403, not a 404. Internal consumers benefit from accurate semantics over the "hide existence" pattern public APIs use to avoid information leakage — and the gateway in front of Jetmon handles any customer-facing 403↔404 collapsing it wants.
+
+Error messages are verbose by design — for an internal API, "table 'jetmon_events' is locked, retry in 30s" beats "internal server error" by a wide margin during incident response. The gateway can sanitize before forwarding to customers.
+
+### Rate limiting
+
+Per-key bucket, configurable per consumer at key-creation time. The current implementation uses one in-memory bucket per key, sized by that key's `rate_limit_per_minute`. Defaults are 60 req/min for `read` and `admin`, and 30 req/min for `write`. Internal consumers usually need higher limits than the default — the gateway and dashboard might be set to 600 req/min, while a daily batch job stays at 60.
+
+Standard headers on every response:
+
+```
+X-RateLimit-Limit: 60
+X-RateLimit-Remaining: 47
+X-RateLimit-Reset: 1714685400
+```
+
+`429` responses include `Retry-After` in seconds.
+
+This is service-protection rate limiting, not customer-fairness rate limiting — the gateway handles the latter. If trigger-now traffic needs a separate bucket later, add it as a route-specific extension rather than overloading the base per-key limit.
+
+### Idempotency
+
+POST endpoints that create, trigger, test, retry, rotate, or manually close resources accept an `Idempotency-Key` header. PATCH and DELETE endpoints are already idempotent on this schema and do not use the idempotency cache. The server stores `(token_id, idempotency_key) → response` for 24 hours. Replays with the same body return the cached response; replays with a different body return `409 idempotency_conflict`.
+
+This is the same pattern Stripe uses; it's the right call for monitor management where retries are common.
+
+### Time
+
+All timestamps are ISO 8601 with millisecond precision and `Z` suffix:
+
+```
+"started_at": "2026-04-25T03:18:38.329Z"
+```
+
+The server is always UTC. Clients converting to local time is their problem.
+
+---
+
+## Status and state vocabulary
+
+The API exposes the same vocabulary the orchestrator and event store use. From `taxonomy.md` Part 3 and `events.md`:
+
+**State** (string, human-readable):
+
+| Value | Meaning |
+|-------|---------|
+| `Up` | All checks passing. |
+| `Warning` | Something needs attention but isn't user-facing yet (cert expiring, version behind). |
+| `Degraded` | Some checks failing or thresholds exceeded; site is serving content. |
+| `Seems Down` | First failure detected, awaiting verifier confirmation. Transient. |
+| `Down` | Confirmed failures on critical checks. |
+| `Paused` | Monitoring suspended by user. |
+| `Maintenance` | Scheduled maintenance window active. |
+| `Unknown` | Monitor couldn't determine state (probe crashed, region offline, agent silent). |
+| `Resolved` | (Events only) The condition cleared; event is closed. |
+
+**Severity** (integer 0–255, ordered):
+
+| Value | Default state mapping |
+|-------|----------------------|
+| 0 | Up |
+| 1 | Warning |
+| 2 | Degraded |
+| 3 | Seems Down |
+| 4 | Down |
+
+Higher severity = worse. Severity climbs independently of state — a worsening Degraded event bumps severity without changing state. New severity values can be added (e.g. 5 for "data loss confirmed") without breaking ordering. Consumers should treat severity as a numeric comparison, not a switch on specific values.
+
+**Why expose both?** Severity is for thresholds (`severity >= 3 ? page on-call : email digest`); state is for human-readable rendering (`incident.state == "Seems Down" ? badge.color = yellow`). Competitors that collapse to one field force consumers to either parse a string for ordering or build their own numeric mapping.
+
+---
+
+## Endpoints
+
+The full surface is grouped into five capability families, matching `roadmap.md`. The implemented route table lives in `internal/api/routes.go`; design-only additions and deferred behavior are called out where they appear.
+
+### Family 1: Sites and current state
+
+#### `GET /api/v1/sites`
+
+List sites visible to this token.
+
+**Scopes:** `read`
+
+Normal internal callers see the full site table. Gateway-routed requests only
+see rows mapped to `X-Jetmon-Tenant-ID` in `jetmon_site_tenants`.
+
+**Query parameters:**
+
+| Param | Type | Description |
+|-------|------|-------------|
+| `cursor` | string | Pagination cursor |
+| `limit` | int (1–200) | Default 50 |
+| `state` | string | Filter by current state (e.g. `Down`) |
+| `state__in` | csv | Multiple states |
+| `severity__gte` | int | Minimum severity |
+| `monitor_active` | bool | Filter active vs paused |
+| `q` | string | URL substring search |
+| `include_cli_metadata` | bool | Optional local-tooling projection; when true, includes `cli_batch` if the site carries the CLI batch marker |
+
+**Response 200:**
+
+```json
+{
+  "data": [
+    {
+      "id": 12345,
+      "blog_id": 12345,
+      "monitor_url": "https://example.com",
+      "monitor_active": true,
+      "bucket_no": 0,
+      "check_interval": 5,
+      "current_state": "Up",
+      "current_severity": 0,
+      "active_event_id": null,
+      "last_checked_at": "2026-04-25T03:24:11.123Z",
+      "last_status_change_at": "2026-04-21T09:14:00.000Z",
+      "ssl_expiry_date": "2026-08-12",
+      "check_keyword": null,
+      "forbidden_keyword": null,
+      "forbidden_keywords": null,
+      "redirect_policy": "follow",
+      "request_method": "GET",
+      "detection_profile": "full",
+      "maintenance_start": null,
+      "maintenance_end": null,
+      "alert_cooldown_minutes": null
+    }
+  ],
+  "page": { "next": "eyJ...", "limit": 50 }
+}
+```
+
+`id` and `blog_id` are the same value for now; `id` is the public field name (`blog_id` is the historical column name). Consumers should rely on `id`.
+
+The response intentionally merges v1-shaped `jetpack_monitor_sites` fields with
+v2-owned sidecar state from `jetmon_site_check_config` and
+`jetmon_site_runtime`; callers should use the API contract instead of assuming
+all fields live in the legacy site table.
+
+`cli_batch` is an opt-in local-tooling projection. It is present only when
+`include_cli_metadata=true` and the site's `custom_headers` include
+`X-Jetmon-CLI-Batch`; the API does not expose the rest of `custom_headers`.
+
+`current_state`, `current_severity`, and `active_event_id` are derived from
+open rows in `jetmon_events`. During the
+[v1-to-v2 migration](v1-to-v2-migration.md), the legacy `site_status`
+column is only a fallback for sites with no active v2 event while
+`LEGACY_STATUS_PROJECTION_ENABLE` is true; once the projection is disabled, a
+site with no active v2 event is reported as `Up` regardless of stale legacy
+status values.
+
+#### `GET /api/v1/sites/{id}`
+
+Single site, same shape as a list entry plus an `active_events` array for any open events:
+
+Accepts `include_cli_metadata=true` with the same `cli_batch` behavior as
+`GET /api/v1/sites`.
+
+```json
+{
+  "id": 12345,
+  ...
+  "active_events": [
+    {
+      "id": 487291,
+      "check_type": "http",
+      "severity": 4,
+      "state": "Down",
+      "started_at": "2026-04-25T03:18:38.329Z"
+    },
+    {
+      "id": 487288,
+      "check_type": "tls_expiry",
+      "severity": 1,
+      "state": "Warning",
+      "started_at": "2026-04-23T00:00:00.000Z"
+    }
+  ]
+}
+```
+
+`active_events` is the simplest answer to "tell me everything wrong with this site right now." Ordered by severity descending.
+
+Gateway-routed single-site, event/history, SLA/stat, and trigger-now routes all
+derive visibility through `jetmon_site_tenants`. A site or event outside the
+tenant mapping is returned as not found.
+
+#### `POST /api/v1/sites`
+
+Create a site.
+
+**Scopes:** `write`
+
+**Request body:**
+
+```json
+{
+  "blog_id": 12345,
+  "monitor_url": "https://example.com",
+  "monitor_active": true,
+  "bucket_no": 0,
+  "check_keyword": null,
+  "forbidden_keyword": null,
+  "forbidden_keywords": [
+    "metrics.evil-cdn.example/collect.js",
+    "buy cheap viagra"
+  ],
+  "redirect_policy": "follow",
+  "request_method": "GET",
+  "detection_profile": "full",
+  "timeout_seconds": null,
+  "custom_headers": {},
+  "alert_cooldown_minutes": null,
+  "check_interval": 5
+}
+```
+
+**Response 201:** the site object.
+
+When the `gateway` consumer creates a site with tenant context, Jetmon inserts
+the site row and the `(tenant_id, blog_id)` mapping in one transaction. Internal
+creates without tenant context keep the existing unscoped behavior.
+
+`request_method` accepts `HEAD` or `GET`. `detection_profile` accepts
+`legacy`, `simple_http`, or `full`. Omit either field to inherit the process
+default. During rollout, use `HEAD` + `legacy`, then `GET` + `simple_http`,
+then `GET` + `full`.
+
+`legacy` here means v1-compatible probe behavior for that site. It does not
+mean the Monitor or Veriflier should use the optional legacy-compatible
+Veriflier HTTP endpoints; v2 Verifliers carry `HEAD` and `GET` requests through
+the `/v2/check` contract.
+
+**Errors:**
+
+| Code | Meaning |
+|------|---------|
+| `invalid_blog_id` | `blog_id` is missing or not a positive integer |
+| `invalid_url` | `monitor_url` doesn't parse |
+| `invalid_redirect_policy` | `redirect_policy` is not `follow`, `alert`, or `fail` |
+| `invalid_check_policy` | `request_method` or `detection_profile` is not supported |
+| `invalid_custom_headers` | `custom_headers` is not a valid string map |
+| `invalid_forbidden_keywords` | `forbidden_keywords` is too large or contains invalid entries |
+| `site_exists` | A site with this `blog_id` already exists |
+
+#### `PATCH /api/v1/sites/{id}`
+
+Partial update. Send only the fields you want to change.
+Send `"forbidden_keywords": []` to clear the multi-keyword forbidden-content
+list. The legacy `forbidden_keyword` string remains supported for simple
+one-off rules and compatibility.
+
+#### `DELETE /api/v1/sites/{id}`
+
+Soft-delete (sets `monitor_active = false` and tombstones). Closes any active events with `resolution_reason = manual_override`.
+
+Delete is intentionally idempotent and preserves the site row. Repeating
+`DELETE /api/v1/sites/{id}` returns `204 No Content`, and a later
+`GET /api/v1/sites/{id}` returns `200 OK` with the same site object and
+`monitor_active: false`. Consumers should treat `monitor_active:false` as the
+readable deleted/paused state rather than expecting a `404` after delete.
+
+#### `POST /api/v1/sites/{id}/pause`, `POST /api/v1/sites/{id}/resume`
+
+Convenience verbs for the common pause/resume flow. Pause closes any active events with `resolution_reason = manual_override` and sets `current_state = "Paused"`. Resume reverts.
+
+#### `POST /api/v1/sites/{id}/trigger-now`
+
+Force an immediate check, returning the result inline under the caller's normal per-key rate limit. Useful for "I just deployed a fix, is it back up?"
+
+```json
+{
+  "result": {
+    "http_code": 200,
+    "error_code": 0,
+    "success": true,
+    "rtt_ms": 412,
+    "dns_ms": 8,
+    "tcp_ms": 22,
+    "tls_ms": 35,
+    "ttfb_ms": 142,
+    "ssl_expires_at": "2026-08-12T00:00:00.000Z"
+  },
+  "current_state": "Up",
+  "active_events_closed": [487291]
+}
+```
+
+Trigger-now runs one synchronous check with a 30-second server-side timeout.
+On success it closes any open events with `resolution_reason=probe_cleared`.
+On failure it returns the failed check result but does not open a new event;
+the orchestrator remains the single owner of failure detection and event
+opening on its regular round.
+
+### Family 2: Events and history
+
+#### `GET /api/v1/sites/{id}/events`
+
+Incident history for a site. Default sort: most recent `started_at` first.
+
+**Query parameters:**
+
+| Param | Type | Description |
+|-------|------|-------------|
+| `cursor`, `limit` | | Standard |
+| `state` / `state__in` | string | Filter by state |
+| `check_type` / `check_type__in` | string | `http`, `tls_expiry`, etc. |
+| `started_at__gte` / `started_at__lt` | ISO timestamp | Time range |
+| `active` | bool | `true` → only open events; `false` → only closed |
+
+**Response:**
+
+```json
+{
+  "data": [
+    {
+      "id": 487291,
+      "site_id": 12345,
+      "endpoint_id": null,
+      "check_type": "http",
+      "discriminator": null,
+      "severity": 4,
+      "state": "Down",
+      "started_at": "2026-04-25T03:18:38.329Z",
+      "ended_at": "2026-04-25T03:21:17.290Z",
+      "resolution_reason": "verifier_cleared",
+      "cause_event_id": null,
+      "metadata": {
+        "http_code": 503,
+        "error_code": 0,
+        "failure_class": "server",
+        "method": "GET",
+        "rtt_ms": 84,
+        "url": "https://example.com",
+        "redirect_policy": "follow",
+        "tls_version": "TLS 1.3",
+        "tls_version_code": "0x0304",
+        "cipher_suite": "TLS_AES_128_GCM_SHA256",
+        "cipher_suite_id": "0x1301",
+        "observation": {
+          "checked_at": "2026-04-25T03:18:38.329Z",
+          "first_failed_at": "2026-04-25T03:18:38.329Z",
+          "previous_observed_at": "2026-04-25T03:15:38.018Z",
+          "previous_known_good_at": "2026-04-25T03:15:38.018Z",
+          "normal_check_interval_seconds": 180,
+          "next_check_interval_seconds": 60
+        }
+      },
+      "duration_ms": 158961,
+      "transition_count": 5
+    }
+  ],
+  "page": { "next": "eyJ...", "limit": 50 }
+}
+```
+
+`duration_ms` is a server-computed convenience: `(ended_at or now) - started_at`. `transition_count` lets the consumer decide whether to fetch the full transition log.
+
+When a failed HTTP probe reaches the resolver and Go exposes DNS details,
+`metadata` may also include `dns_error_kind` (`nxdomain`, `servfail`, `timeout`,
+or `resolver_error`), `dns_error_name`, and `dns_error_server`. These fields are
+diagnostic context for operators; dedicated DNS monitors remain a separate
+future check type.
+
+#### `GET /api/v1/sites/{id}/events/{event_id}`
+
+Single event, same shape, plus a `transitions` array (full history, no pagination — events have bounded transition counts).
+
+```json
+{
+  "id": 487291,
+  ...
+  "transitions": [
+    {
+      "id": 1,
+      "severity_before": null,
+      "severity_after": 3,
+      "state_before": null,
+      "state_after": "Seems Down",
+      "reason": "opened",
+      "source": "host-us-west-1",
+      "metadata": {
+        "http_code": 503,
+        "error_code": 0,
+        "failure_class": "server",
+        "rtt_ms": 84,
+        "observation": {
+          "checked_at": "2026-04-25T03:18:38.329Z",
+          "first_failed_at": "2026-04-25T03:18:38.329Z",
+          "previous_known_good_at": "2026-04-25T03:15:38.018Z"
+        }
+      },
+      "changed_at": "2026-04-25T03:18:38.329Z"
+    },
+    {
+      "id": 2,
+      "severity_before": 3,
+      "severity_after": 4,
+      "state_before": "Seems Down",
+      "state_after": "Down",
+      "reason": "verifier_confirmed",
+      "source": "host-us-west-1",
+      "metadata": { "verifier_results": [...], "verifier_confirmed": 2 },
+      "changed_at": "2026-04-25T03:18:55.412Z"
+    }
+  ]
+}
+```
+
+#### `GET /api/v1/sites/{id}/events/{event_id}/transitions`
+
+Same transition data, but as its own paginated list when an event has accumulated many transitions (long-running degradation events with hundreds of severity bumps).
+
+#### `GET /api/v1/events/{event_id}`
+
+Direct event lookup without site context. Useful for webhook payloads that link directly to an incident page.
+
+#### `POST /api/v1/sites/{id}/events/{event_id}/close`
+
+Manually close an open event (for the operator dashboard or for handling false alarms the verifier missed).
+
+**Scopes:** `write`
+
+**Request body:**
+
+```json
+{
+  "reason": "manual_override",
+  "note": "Confirmed maintenance was running, alert fired before window started"
+}
+```
+
+`note` ends up in the closing transition's metadata.
+
+### Family 3: SLA and statistics
+
+#### `GET /api/v1/sites/{id}/uptime`
+
+Uptime and downtime stats over a rolling window.
+
+**Query parameters:**
+
+| Param | Type | Description |
+|-------|------|-------------|
+| `window` | enum | `1h`, `24h` / `1d`, `7d`, `30d`, `90d` |
+| `from` / `to` | ISO timestamp | Custom range; overrides `window` |
+
+**Response:**
+
+```json
+{
+  "window": { "from": "2026-03-26T00:00:00Z", "to": "2026-04-25T00:00:00Z" },
+  "uptime_percent": 99.847,
+  "total_seconds": 2592000,
+  "down_seconds": 3960,
+  "degraded_seconds": 600,
+  "warning_seconds": 86400,
+  "maintenance_seconds": 0,
+  "unknown_seconds": 0,
+  "incident_count": 4,
+  "mttr_seconds": 990,
+  "mtbf_seconds": 647760
+}
+```
+
+**How uptime is computed:** sum of `(ended_at or now) - started_at` for events with `state in (Down, Seems Down)` within the window, divided by total window duration. Degraded, Warning, Maintenance, and Unknown durations are returned separately but are not subtracted from the denominator in the current implementation. The math is event-driven, not check-driven, which means SLA reports stay accurate even if check frequency changes.
+
+#### `GET /api/v1/sites/{id}/response-time`
+
+Response time percentiles over a window, sourced from `jetmon_check_history`.
+
+**Response:**
+
+```json
+{
+  "window": { "from": "2026-04-24T00:00:00Z", "to": "2026-04-25T00:00:00Z" },
+  "samples": 17280,
+  "p50_ms": 187,
+  "p95_ms": 412,
+  "p99_ms": 891,
+  "max_ms": 4200,
+  "mean_ms": 215,
+  "truncated": false
+}
+```
+
+Percentiles are computed from raw `jetmon_check_history` samples in the window. The handler caps the in-memory sample set at 100,000 rows; `truncated: true` means the response used the most recent capped subset.
+
+#### `GET /api/v1/sites/{id}/timing-breakdown`
+
+DNS / TCP / TLS / TTFB breakdown — one of Jetmon's distinctive features (most competitors only return total response time).
+
+**Response:**
+
+```json
+{
+  "window": { "from": "2026-04-24T00:00:00Z", "to": "2026-04-25T00:00:00Z" },
+  "samples": 17280,
+  "truncated": false,
+  "dns": { "p50_ms": 8, "p95_ms": 45, "p99_ms": 80, "max_ms": 120 },
+  "tcp": { "p50_ms": 22, "p95_ms": 78, "p99_ms": 140, "max_ms": 220 },
+  "tls": { "p50_ms": 35, "p95_ms": 110, "p99_ms": 180, "max_ms": 260 },
+  "ttfb": { "p50_ms": 142, "p95_ms": 391, "p99_ms": 760, "max_ms": 1200 }
+}
+```
+
+### Family 4: Alert contacts and webhooks
+
+#### Webhook management endpoints
+
+Implemented routes:
+
+- `GET /api/v1/webhooks`
+- `POST /api/v1/webhooks`
+- `GET /api/v1/webhooks/{id}`
+- `PATCH /api/v1/webhooks/{id}`
+- `DELETE /api/v1/webhooks/{id}`
+- `POST /api/v1/webhooks/{id}/rotate-secret`
+- `GET /api/v1/webhooks/{id}/deliveries`
+- `POST /api/v1/webhooks/{id}/deliveries/{delivery_id}/retry`
+
+Standard CRUD. A webhook is:
+
+```json
+{
+  "id": 42,
+  "url": "https://hooks.slack.com/...",
+  "active": true,
+  "events": ["event.opened", "event.severity_changed", "event.closed"],
+  "site_filter": { "site_ids": [12345, 67890] },
+  "state_filter": { "states": ["Down", "Seems Down"] },
+  "secret": "whsec_a1b2c3...",
+  "created_at": "2026-04-01T00:00:00Z"
+}
+```
+
+`secret` is the only string-prefixed identifier in the API surface — it's a shared secret, not a resource id, and the `whsec_` prefix is a Stripe-style hint to anyone scanning logs/leaks ("this is a webhook signing secret, treat as sensitive"). It is shown only on creation; afterward only `secret_preview` is returned (last 4 chars).
+
+#### Filter semantics
+
+Filters compose **AND across dimensions, whitelist within each, empty = match all**. A delivery fires when:
+
+```
+event_type ∈ events (or events == [])
+AND site_id  ∈ site_filter.site_ids (or site_filter == {})
+AND state    ∈ state_filter.states (or state_filter == {})
+```
+
+Empty fields mean "no restriction on this dimension," matching the everyday English meaning of an empty filter. Same convention as Stripe, GitHub, and Slack webhooks — consumers can omit dimensions they don't care about and progressively narrow as needed. Blacklist/exclude fields are not supported in v1.
+
+#### Webhook delivery format
+
+When an event fires, Jetmon POSTs to the webhook URL:
+
+```json
+{
+  "type": "event.opened",
+  "delivered_at": "2026-04-25T03:18:38.500Z",
+  "delivery_id": 9182734,
+  "event": { ... full event object ... },
+  "site": { ... full site object ... }
+}
+```
+
+Headers:
+
+```
+Content-Type: application/json
+X-Jetmon-Event: event.opened
+X-Jetmon-Delivery: 9182734
+X-Jetmon-Signature: t=1714685400,v1=5257a869e7ec...
+```
+
+The signature is HMAC-SHA256 of `{timestamp}.{body}` with the webhook's `secret`, formatted Stripe-style (timestamp + scheme version + signature). The timestamp prevents replay; consumers should reject deliveries older than 5 minutes.
+
+#### Webhook event types
+
+- `event.opened` — new event row inserted
+- `event.severity_changed` — severity escalated or de-escalated
+- `event.state_changed` — state changed (e.g. Seems Down → Down)
+- `event.cause_linked` / `event.cause_unlinked`
+- `event.closed` — event resolved (any reason)
+
+`event.*` types fire once per transition row written to `jetmon_event_transitions` — i.e., once per actual mutation. The 1:1 invariant the eventstore maintains is what makes detection reliable.
+
+**Deferred:** `site.state_changed` (rollup from events to the site-row projection) is **not** in v1. Rolling up cleanly without races requires changes to the orchestrator, and event-level webhooks already give consumers everything they need. Tracked in roadmap.md.
+
+#### Detection mechanism
+
+Webhook delivery uses **pull-based detection**: a worker polls `jetmon_event_transitions WHERE id > last_seen` on a 1s interval and creates one delivery row per matching transition. This is the long-term answer for Jetmon's architecture — the orchestrator's flap suppression already adds 10s+ between detection and confirmed events, so 1s poll latency is invisible in the practical budget.
+
+Current v2 deployment constraint: in the single-binary shape, `API_PORT` makes webhook and alert-contact workers eligible to run. Delivery rows are claimed transactionally, so multiple active delivery workers do not claim the same pending row. `DELIVERY_OWNER_HOST` can still restrict actual delivery to one named host when operators want a single-owner rollout while moving from embedded `jetmon2` delivery to standalone `jetmon-deliverer`.
+
+Push-based or hybrid detection is not on the roadmap. If a future consumer demands sub-second webhook latency, that's the trigger to introduce a pub/sub layer — not before.
+
+#### Retry policy
+
+Each `jetmon_webhook_deliveries` row is one webhook firing. Each delivery has up to 6 attempts on this exponential schedule:
+
+| Attempt | Delay from previous |
+|---------|---------------------|
+| 1       | immediate           |
+| 2       | 1m                  |
+| 3       | 5m                  |
+| 4       | 30m                 |
+| 5       | 1h                  |
+| 6       | 6h                  |
+
+A delivery succeeds when any attempt returns 2xx. After 6 failed attempts, the row is marked `status = 'abandoned'`. Abandoned rows stay in the table — `GET /api/v1/webhooks/{id}/deliveries?status=abandoned` lists them, and `POST /api/v1/webhooks/{id}/deliveries/{delivery_id}/retry` lets a consumer re-fire after fixing their endpoint.
+
+`GET /api/v1/webhooks/{id}/deliveries` returns the full delivery history with `status` (`pending` / `delivered` / `failed` / `abandoned`), `attempt`, `last_status_code`, and a truncated `last_response` body for debugging.
+
+#### Signing and secret rotation
+
+Signature: HMAC-SHA256 of `{timestamp}.{body}` with the webhook's secret, sent as `X-Jetmon-Signature: t=<unix_ts>,v1=<hex>`. The timestamp prevents replay; consumers should reject deliveries older than 5 minutes.
+
+Format chosen for: wide library support across consumer languages, explicit version (`v1=`) to allow future algorithm rotation without breaking consumers, replay protection via timestamp baked into the signature input, and the ability to coexist with multiple `v1=` values during a grace-period rotation (deferred). Alternatives considered and not chosen: GitHub-style (no replay protection), Slack-style (functionally equivalent, two-header form), JWT-based (wrong abstraction for "POST JSON + signature header"), HTTP Message Signatures / RFC 9421 (over-engineered for our scope), asymmetric / Ed25519 (compelling for public APIs without a gateway in front; not warranted while a gateway re-signs for end customers).
+
+When to revisit: a public-API-without-gateway requirement (then asymmetric becomes attractive — no per-consumer secret distribution), or a standards-driven third-party integration that requires RFC 9421. Migration path in either case is "add a `v2=` signature alongside `v1=` for a transition window, switch consumers, deprecate `v1=`" — same shape as algorithm rotation we already designed for.
+
+Secret rotation in v1: **immediate revocation only**. `POST /api/v1/webhooks/{id}/rotate-secret` returns a new secret once, replaces the stored hash, and the old secret stops working immediately. Failed deliveries during the consumer's deploy window go into the retry queue.
+
+**Deferred:** grace-period rotation (server signs with both old and new secrets for a configurable window so consumers can roll over without coordinated downtime) is in roadmap.md. The signature header format already supports multiple `v1=...,v1=...` values per Stripe convention, so adding grace-period rotation later is non-breaking.
+
+#### Backpressure
+
+Delivery uses a **shared worker pool** (default 50 goroutines, configurable) with a **per-webhook in-flight cap** (default 3 concurrent). The shared pool bounds total goroutine count; the per-webhook cap prevents a slow or hung webhook URL from monopolizing the pool and starving other webhooks' deliveries.
+
+Implementation: at dispatch time, the worker checks a `map[webhook_id]int` counter under a mutex. If a webhook is already at its cap, the row stays `pending` and is picked up on the next poll tick. The counter decrements when a delivery attempt completes (success or failure).
+
+#### Schema
+
+```
+jetmon_webhooks:
+  id, url, active, owner_tenant_id VARCHAR(128) NULL,
+  events JSON, site_filter JSON, state_filter JSON,
+  secret VARCHAR(80), secret_preview VARCHAR(8),
+  created_by VARCHAR(128), created_at, updated_at
+
+jetmon_webhook_deliveries:
+  id, webhook_id, transition_id, event_id, event_type,
+  payload JSON,                       -- frozen at fire time, never updated
+  status ENUM('pending','delivered','failed','abandoned'),
+  attempt INT,
+  next_attempt_at TIMESTAMP NULL,     -- when the worker should pick up
+  last_status_code INT NULL,
+  last_response VARCHAR(2048) NULL,   -- truncated body, debugging aid
+  last_attempt_at TIMESTAMP NULL,
+  delivered_at TIMESTAMP NULL,
+  created_at
+```
+
+Indexes:
+- `(status, next_attempt_at)` on deliveries — the worker's "what's ready?" query
+- `(webhook_id, created_at)` on deliveries — the deliveries-list endpoint
+- `(active)` on webhooks — the dispatcher's filter for live webhooks
+- `(owner_tenant_id)` on webhooks — scopes gateway-routed CRUD and delivery visibility while normal internal callers remain unscoped
+
+`payload` is **frozen at delivery creation**: the consumer sees the event as it was when the webhook fired, not as it is now. A closed-and-amended event would not change a delivery's payload — that's the contract consumers expect ("this is what I was told happened, not whatever it became").
+
+#### Webhook ownership and scope
+
+Webhooks are managed by any `write`-scope token. `created_by` records the consumer name from the API key for audit purposes only — there is no per-consumer ownership boundary, and any `write`-scope token can read/edit/delete any webhook.
+
+This is appropriate **only** because Jetmon is internal-only with all consumers trusted. Per-consumer ownership doesn't add value at this scale; the gateway in front of Jetmon handles tenant isolation for any customer-facing webhooks.
+
+The table includes nullable `owner_tenant_id`. Normal internal handlers remain
+unscoped when no gateway context is present, so existing internal behavior is
+unchanged. Gateway-routed creates set `owner_tenant_id`, and gateway-routed
+list/get/update/delete/rotate-secret paths filter by it. Delivery history and
+manual retry visibility are derived by first verifying ownership of the parent
+webhook.
+
+**Ramifications if Jetmon ever becomes a public API:**
+
+- This model would need to change. Customer-facing consumers cannot be allowed to read or modify each other's webhooks.
+- Migration path: continue requiring `owner_tenant_id` on gateway-routed
+  creates; add granular public `webhooks` scopes or a formal account/tenant
+  boundary before any direct customer exposure.
+- The `created_by` field is forward-compatible — it's already capturing the consumer identity, just not enforcing it.
+- Existing webhooks would need a backfill migration before being exposed publicly.
+- Webhook secrets would need stronger isolation (currently any write-scope can rotate any secret; in a public API this would be a privilege escalation).
+
+The decision to defer ownership today should be reread before any public-API conversation actually starts.
+
+### Family 5: Alert contacts
+
+Managed notification channels for human destinations: email, PagerDuty, Slack, Microsoft Teams. Where webhooks (Family 4) deliver a raw signed event stream that the consumer renders, alert contacts deliver a Jetmon-rendered notification through a transport Jetmon owns end-to-end (subject lines, message formatting, transport-specific quirks).
+
+#### When to use which
+
+- **Alert contact** — you want a person notified through a managed channel (their email, your team's PagerDuty service, your team's Slack channel). You don't want to operate a receiver, you want Jetmon to handle rendering and retries.
+- **Webhook** — you want a *system* notified, you control the receiver, and you want the raw signed event payload to render or route however you want. Use this for custom Slack bots that aren't a vanilla incoming-webhook URL, internal SIEM ingestion, custom alerting middleware, or anything that wants the structured event rather than a pre-formatted message.
+
+The two surfaces share the same event source (`jetmon_event_transitions`); a customer can use both simultaneously without dedup concerns at the source.
+
+#### Alert contact management endpoints
+
+Implemented routes:
+
+- `GET /api/v1/alert-contacts`
+- `POST /api/v1/alert-contacts`
+- `GET /api/v1/alert-contacts/{id}`
+- `PATCH /api/v1/alert-contacts/{id}`
+- `DELETE /api/v1/alert-contacts/{id}`
+- `POST /api/v1/alert-contacts/{id}/test`
+- `GET /api/v1/alert-contacts/{id}/deliveries`
+- `POST /api/v1/alert-contacts/{id}/deliveries/{delivery_id}/retry`
+
+Standard CRUD. An alert contact is:
+
+```json
+{
+  "id": 17,
+  "label": "platform-oncall",
+  "active": true,
+  "transport": "pagerduty",
+  "destination": { "integration_key": "***" },
+  "site_filter": { "site_ids": [12345, 67890] },
+  "min_severity": "Down",
+  "max_per_hour": 60,
+  "destination_preview": "abcd",
+  "created_by": "alerts-admin",
+  "created_at": "2026-04-25T00:00:00Z"
+}
+```
+
+`destination` shape varies by transport (see below); credential fields are write-only and only `destination_preview` (last 4 chars of the credential) is returned on subsequent reads.
+
+#### Transports
+
+| Transport | `destination` shape | Notes |
+|-----------|---------------------|-------|
+| `email` | `{ "address": "ops@example.com" }` | Rendered as a plain-text + HTML email. Sent via the configured email transport (see "Email delivery" below). |
+| `pagerduty` | `{ "integration_key": "<events-v2 routing key>" }` | Posts to PagerDuty Events API v2. Jetmon severity maps to PagerDuty severity: `Down`/`SeemsDown` → `critical`, `Degraded` → `warning`, `Warning` → `info`, `Up` → resolves the alert. |
+| `slack` | `{ "webhook_url": "https://hooks.slack.com/..." }` | Posts to a Slack incoming-webhook URL. Renders a Block Kit message with site, state, severity, and an event link. |
+| `teams` | `{ "webhook_url": "https://outlook.office.com/webhook/..." }` | Posts to a Microsoft Teams incoming-webhook URL. Renders an Adaptive Card with the same fields as Slack. |
+
+Custom transports (Slack via OAuth bot, OpsGenie, internal SIEM, etc.) go through the webhooks API instead — register a webhook, render however you want.
+
+#### Filter semantics
+
+Alert contacts use a simpler filter model than webhooks: **site list + severity gate**. A contact fires when:
+
+```
+site_id ∈ site_filter.site_ids   (or site_filter == {} → all sites)
+AND new_severity >= min_severity (Up=0 < Warning=1 < Degraded=2 < SeemsDown=3 < Down=4)
+```
+
+Empty `site_filter` means "all sites." `min_severity` is required and defaults to `Down` on create — this is the most common case (page me only on real outages) and avoids accidental noise from new contacts.
+
+The severity values match `internal/eventstore.Severity*` constants directly; the API exposes them by string name in JSON (`"Down"`, `"SeemsDown"`, etc.) and stores them as the underlying `uint8` in the database.
+
+The simpler filter model is intentional. Most alert contact configs are "this person, these sites, only when something serious happens"; event-type and state filters (which webhooks support) are rarely useful for human pagers — if you got the open page you almost always want the close page too. Customers who need finer-grained filtering register a webhook instead.
+
+#### Severity gate
+
+Severity ordering: `Up < Warning < Degraded < SeemsDown < Down`. The gate matches `new_severity >= min_severity` on each transition; events that *increase* into the gated band send a page, events that *resolve back to `Up`* send a recovery notification, events that move between two severities both below the gate are silently dropped.
+
+This lets agencies and VIPs configure low-severity contacts (e.g. `min_severity: "Warning"`) that catch every flicker while still letting normal users configure `Down`-only contacts that only fire on real outages — both from the same plumbing.
+
+#### Per-contact rate cap
+
+`max_per_hour` (default 60, set to `0` for unlimited) caps how many notifications a single contact can receive per rolling hour. Designed against the pager-storm scenario where a regional outage flips 200 sites at once; without a cap, on-call gets paged 200 times in 30 seconds. When the cap is hit, further transitions for that contact are marked `abandoned` with a rate-limit note and are not dispatched. Digest notifications are deferred.
+
+This is a per-contact field, not global — different contacts have different tolerance (a Slack channel can take far more than a PagerDuty oncall can).
+
+#### Send-test
+
+```
+POST /api/v1/alert-contacts/{id}/test
+```
+
+Sends a synthetic notification through the contact's transport — same rendering, same dispatch path, but with payload `{"test": true, "message": "Jetmon test notification", ...}`. Used by operators to verify a newly-created contact actually reaches its destination. Test sends are exempt from `max_per_hour`, are logged in `jetmon_audit_log` under `event_type=alert_test`, and bypass the severity gate (always delivered).
+
+Honors `Idempotency-Key` like the other write POSTs — a retried request with the same key returns the original response without re-firing the test, so a network blip during the operator's "click to test" doesn't double-page the destination.
+
+Returns `200 OK` with the test delivery row, or surfaces the transport error (e.g. invalid Slack webhook URL) directly so operators can debug without spelunking through worker logs.
+
+#### Email delivery
+
+Email is unique among the transports in that there is no equivalent of "post to this URL" — it requires a sender. Three implementations selectable at startup via `EMAIL_TRANSPORT` config:
+
+| `EMAIL_TRANSPORT` | Use case | Behavior |
+|-------------------|----------|----------|
+| `wpcom` | Production | Calls existing WPCOM email infrastructure. Default in production deploys. |
+| `smtp` | Local dev / staging | Connects to an SMTP server (e.g. Mailpit in the Docker Compose stack). Configurable host/port/auth. |
+| `stub` | Local dev / unit testing / disabled email | Logs the rendered email; no actual send. |
+
+The `Sender` interface is internal to the alerting package, so swapping transports is a config change — no code path differences. SMTP support specifically exists so docker-based integration tests can verify rendering and addressing end-to-end without depending on WPCOM infrastructure.
+
+`stub` is the default and the empty-string compatibility alias. Startup and `jetmon2 validate-config` both warn when the resolved transport is `stub` so operators know any alert contact with `transport="email"` will be logged but not delivered.
+
+#### Subscription assignment
+
+Site assignment is via `site_filter.site_ids` on the contact row itself, not a separate join table. Mirrors the webhooks API. Empty list = all sites. Setting `site_filter: {"site_ids": []}` or `{}` is "subscribe to all sites." On create, omitting `site_filter` also produces the empty match-all filter; on PATCH, omitting `site_filter` leaves the existing filter unchanged.
+
+#### Detection mechanism
+
+Same as webhooks — pull-only, polling `jetmon_event_transitions` on a high-water mark. Different worker (`internal/alerting/`) with the same dispatch shape: claim → match contacts → enqueue per-contact deliveries in `jetmon_alert_deliveries` → dispatch with retry. Worker placement is intentionally parallel to webhooks rather than unified; see ROADMAP for the rationale and the future revisit point.
+
+#### Retry policy
+
+Same schedule as webhooks: 1m, 5m, 30m, 1h, 6h, then abandon. Different transports have different idempotency stories — PagerDuty Events API is idempotent on `dedup_key`, Slack webhooks are not — so each transport implementation owns its retry-safety guarantee. Worker-level retry is conservative; if the transport library returns success, we never re-send.
+
+#### Relationship to legacy WPCOM notifications
+
+The existing WPCOM notification flow (orchestrator-side, hard-coded recipients) **continues to operate independently** in v1. Alert contacts are a parallel programmable path; they don't replace WPCOM notifications, they coexist.
+
+This means:
+- An incident may notify the same human twice if they're configured in both paths. Document this on the operator side and avoid duplicate configuration.
+- The two paths have separate retry state, separate metrics, separate audit trails.
+- Migrating WPCOM notifications behind alert contacts is a future cleanup tracked in the roadmap, gated on alert contacts proving out in production.
+
+The boundary is: WPCOM = built-in path for existing internal Jetpack notifications; alert contacts = customer-managed destinations through the API. Anything new should go through alert contacts.
+
+#### Schema
+
+```sql
+jetmon_alert_contacts (
+  id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
+  label VARCHAR(80) NOT NULL,
+  active TINYINT(1) NOT NULL DEFAULT 1,
+  owner_tenant_id VARCHAR(128) NULL,
+  transport ENUM('email','pagerduty','slack','teams') NOT NULL,
+  destination JSON NOT NULL,          -- transport-specific, secret in plaintext (outbound dispatch needs raw value)
+  destination_preview VARCHAR(8) NOT NULL,
+  site_filter JSON NOT NULL,          -- {"site_ids":[...]} or {} for all
+  min_severity TINYINT UNSIGNED NOT NULL DEFAULT 4,  -- matches eventstore.Severity* (0=Up..4=Down); default 4=Down
+  max_per_hour INT NOT NULL DEFAULT 60,
+  created_by VARCHAR(80) NOT NULL,
+  created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+  updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
+)
+
+jetmon_alert_deliveries (
+  -- mirrors jetmon_webhook_deliveries; dedup on (alert_contact_id, transition_id)
+)
+
+jetmon_alert_dispatch_progress (
+  -- mirrors jetmon_webhook_dispatch_progress; high-water mark for the worker
+)
+```
+
+`destination` stores the credential in plaintext. Same rationale as `jetmon_webhooks.secret`: outbound dispatch needs the raw value (PagerDuty integration key, Slack webhook URL, SMTP password) at every send — a hash is useless because we'd have to recover the original to call the transport. The threat model is the database itself; encryption-at-rest on the storage layer is the correct mitigation, not application-level hashing.
+
+#### Alert contact ownership
+
+Same internal model as webhooks: any `write`-scope token can manage any alert
+contact when no gateway context is present, and `created_by` is audit-only.
+Gateway-routed creates set `owner_tenant_id`; gateway-routed
+list/get/update/delete/test paths filter by it. Delivery history and manual
+retry visibility are derived by first verifying ownership of the parent alert
+contact.
+
+### Family 6: Identity and utility
+
+#### `GET /api/v1/me`
+
+Returns the identity associated with the current token: consumer name, scope, rate limit. Useful for a service to confirm at startup that its token is valid and has the expected permission level.
+
+```json
+{
+  "consumer_name": "alerts-worker",
+  "scope": "read",
+  "rate_limit_per_minute": 600,
+  "expires_at": null
+}
+```
+
+This is the only API surface for keys. **Creation, listing, and revocation are CLI-only** (`./jetmon2 keys ...`); see Authentication above. There is no `/api/v1/keys` endpoint.
+
+#### `GET /api/v1/health`
+
+Unauthenticated. Returns `{ "status": "ok" }` if the API can talk to the database. For load balancers and external uptime monitors (yes, including external monitors monitoring the monitor).
+
+#### `GET /api/v1/openapi.json`
+
+Returns the route-driven OpenAPI 3.1 contract for the internal API. Requires `read` scope like other internal introspection routes. The spec is generated from the same route table used to build the running server mux, so new routes must be added to that table before they can be served or documented.
+
+The current contract publishes paths, methods, auth scope, idempotency headers, path parameters, request/response component schemas derived from the handler structs, and the standard error envelope. `internal/api` tests resolve every component `$ref` and type-check a generated Go client smoke source from the published operation IDs and component names. Stricter public compatibility checks are tracked in `roadmap.md`.
+
+---
+
+## What we deliberately did not include
+
+- **No Statuspage-style public status pages.** That's a separate product; Jetmon focuses on monitoring. If you want a public status page, the API gives you what you need to build one.
+- **No "monitor groups" / "tags" in v1.** Most consumers organize by `owner_blog_id`; tagging is a complexity multiplier we'd rather defer until requested.
+- **No GraphQL.** REST + cursor pagination + filters covers everything the v1 use cases need. If a future consumer needs nested-fetch optimization (sites + active events + recent transitions in one round-trip), we'd add a single `/api/v1/sites/{id}/full` endpoint before reaching for GraphQL.
+- **No per-region SLA breakdown.** All sites are checked from the orchestrator's bucket assignment, not a multi-region fleet (yet — see `taxonomy.md` v2/v3 vantage-point work). When that ships, the SLA endpoint gains a `?vantage_point=us-west-1` filter.
+- **No streaming.** Webhooks cover event-driven needs; long-poll/SSE/WebSocket support is overkill for the current consumer set. Could be added on `/api/v1/sites/{id}/events/stream` if a consumer asks.
+
+## Implementation Phase Map
+
+Phase 1 (read-only foundation, implemented):
+- `jetmon_api_keys` migration + sha256 hashing helpers
+- `./jetmon2 keys create/list/revoke/rotate` CLI
+- Auth middleware (Bearer token validation, scope enforcement, audit logging via `jetmon_audit_log`)
+- Health check + `GET /api/v1/me`
+- Family 1 read endpoints (sites list, single site)
+- Family 2 (events list, single event with transitions, transitions list)
+- Family 3 (uptime, response-time, timing-breakdown)
+- Per-key rate limiting + standard headers
+
+Phase 2 (write surface, implemented):
+- Family 1 write endpoints (POST/PATCH/DELETE sites, pause/resume, trigger-now)
+- Family 2 manual close
+- Idempotency keys on POST routes
+- Route-driven OpenAPI 3.1 contract at `GET /api/v1/openapi.json`
+
+Phase 3 (webhook delivery, implemented):
+- Family 4 webhooks (CRUD + delivery infrastructure with HMAC signing + retry backoff)
+
+Phase 3.x (alert contacts, implemented):
+- Family 5 alert contacts: managed channels (email, PagerDuty, Slack, Teams)
+- `internal/alerting/` package — parallel to `internal/webhooks/`, same dispatch shape
+- Email transport interface with `wpcom` / `smtp` / `stub` implementations
+- Per-contact severity gate + per-hour rate cap
+- `POST /alert-contacts/{id}/test` send-test endpoint
+- Legacy WPCOM notification flow continues to operate in parallel; future migration tracked in ROADMAP
+
+Phase 4 (polish, future):
+- Consumer-specific OpenAPI generator validation if API consumers standardize on a tool
+- Bulk endpoints if real consumers need them
+- Per-region filters when vantage-point work ships
+
+---
+
+## Resolved design questions
+
+These were the open questions from the original draft. All resolved during review; recorded here so the rationale doesn't get lost when the doc evolves.
+
+1. **Resource ID format → raw numeric integers across all resources.** Initially proposed type-prefixed ids (`evt_12345`, `whk_42`) for self-documenting log lines, but on review the costs outweighed the benefits: dual representation between logs/DB/API, JSON type inconsistency (sites as numbers, others as strings), a real silent-coercion bug class under default MySQL `SQL_MODE`, and forward-sharding friction not actually solved by prefixes. Resolution: every resource `id` is a raw `BIGINT UNSIGNED` serialized as a JSON number. Type context is provided by endpoint paths and explicit `type` fields in error messages and webhook payloads, not embedded in the id. (Webhook signing secrets keep the `whsec_` prefix because they're shared secrets, not resource ids — the prefix is a leak-detection hint.)
+
+2. **Bulk site list cap → 200/page, no `include_inactive` opt-in flag.** The existing `monitor_active` filter does the same job; a separate flag would duplicate it. The 200-page cap alone is sufficient guardrail for full-table walks (100k sites at 200/page = 500 round trips, adequate for daily SLA batch jobs). If a consumer ever needs higher per-page volume, we add a `?limit_max=1000` opt-in tied to a special scope at that point — not now.
+
+3. **Webhook signing → Stripe-style versioned HMAC, single algorithm at a time.** Header format `t=<unix_ts>,v1=<hmac_sha256_hex>`. The `v1=` prefix reserves space for a v2 algorithm rotation (e.g. ed25519) without breaking consumer parsers. Don't build multi-algorithm signing upfront — when rotation is actually triggered, transition period emits both `v1=...,v2=...` so consumers verify whichever they support.
+
+4. **`trigger-now` semantics → synchronous with a 30s server-side timeout, no async path in v1.** Matches operator and gateway expectations ("I just deployed, is it up?"), keeps the API surface narrow (one request → one response), and the existing trigger-now rate limit (1/min default per consumer) bounds connection-pool exposure. If a batch-verification consumer ever shows up, we add `?async=true` returning a 202 with a job id — but not before there's a real consumer for it.
+
+5. **Event metadata sanitization → single `metadata` field, no public/private split.** With this being an internal API and a gateway in front of any customer-facing surface, the `metadata` JSON can carry full operational detail (verifier hostnames, internal RPC ids, full HTTP response excerpts). The gateway is responsible for any redaction before forwarding to customers.
+
+---
+
+## Sources / inspiration
+
+The patterns above were informed by reviewing the documented APIs of:
+
+- [Better Stack Uptime API](https://betterstack.com/docs/uptime/api/) — JSON:API envelope (we rejected), incident status enum (we extended), Bearer token auth (we adopted).
+- [UptimeRobot v3 API](https://uptimerobot.com/api/v3/) — Bearer JWT, REST verbs, cursor pagination (we adopted), JSON-only (we adopted).
+- [Pingdom API 3.1](https://docs.pingdom.com/api/) — OpenAPI 3.0 spec (we adopted), `summary.average` SLA endpoint shape (informed our `/uptime` design).
+- [Atlassian Statuspage API](https://developer.statuspage.io/) — incident updates timeline (we extended into transitions table), component status enum `operational/degraded/partial_outage/major_outage` (we rejected — too coarse for our taxonomy).
+- [Stripe API](https://stripe.com/docs/api) — error model with stable codes (we adopted), idempotency keys (we adopted), webhook signing scheme (we adopted).
+
+None of these were copied; each pattern was evaluated against Jetmon's data model and either adopted, modified, or rejected with rationale.
diff --git a/docs/jetmon-deliverer-rollout.md b/docs/jetmon-deliverer-rollout.md
new file mode 100644
index 00000000..e9fd8323
--- /dev/null
+++ b/docs/jetmon-deliverer-rollout.md
@@ -0,0 +1,224 @@
+# Jetmon Deliverer Rollout
+
+**Status:** Operational runbook for the existing v2 implementation.
+
+`jetmon-deliverer` is the first standalone process boundary for outbound
+delivery. It runs the webhook and alert-contact workers without starting the
+monitor round loop, REST API, dashboard, Veriflier server, or bucket ownership.
+
+The code path is shared with embedded `jetmon2` delivery through
+`internal/deliverer`. Delivery rows are claimed with short transactional
+`SELECT ... FOR UPDATE` leases, so multiple active delivery workers cannot
+claim the same pending delivery row. `DELIVERY_OWNER_HOST` remains useful as a
+rollout guard when operators want a deliberately single-owner cutover.
+
+## Process Responsibilities
+
+| Process | Owns | Does not own |
+|---|---|---|
+| `jetmon2` with `API_PORT = 0` | monitor rounds, bucket ownership, checks, WPCOM legacy notifications | REST API, webhook delivery, alert-contact delivery |
+| `jetmon2` with `API_PORT > 0` | REST API and, when allowed by `DELIVERY_OWNER_HOST`, embedded delivery | standalone process isolation for delivery |
+| `jetmon-deliverer` | webhook delivery and alert-contact delivery | REST API, monitor rounds, bucket ownership, dashboard |
+
+The production target for the split is:
+
+- monitor hosts run `jetmon2` with monitor responsibilities only;
+- API hosts run `jetmon2` for `/api/v1` traffic but do not own delivery;
+- deliverer hosts run `jetmon-deliverer` for outbound dispatch.
+
+## Package Contents
+
+A production package for the deliverer should include:
+
+- `bin/jetmon-deliverer`
+- `systemd/jetmon-deliverer.service` or the equivalent deployment-system unit
+- the same `config/config.json` schema used by `jetmon2`
+- database config via the same `DB_*` environment variables used by `jetmon2`
+- alert transport credentials required by the selected `EMAIL_TRANSPORT`
+- log routing equivalent to the existing `jetmon2` service
+
+The binary uses `JETMON_CONFIG` when set, otherwise it reads
+`config/config.json`. Use a separate config file per process class when API
+hosts and deliverer hosts need different `DELIVERY_OWNER_HOST` values.
+
+The sample systemd unit expects:
+
+- `ExecStart=/opt/jetmon2/bin/jetmon-deliverer`
+- `ExecStartPre=/opt/jetmon2/bin/jetmon-deliverer validate-config
+  --require-owner-match --require-api-disabled`
+- `EnvironmentFile=-/opt/jetmon2/config/jetmon2.env`
+- `JETMON_CONFIG=/opt/jetmon2/config/deliverer.json`
+
+Keep `deliverer.json` process-specific. Sharing a config file with API-enabled
+`jetmon2` hosts is only safe when `DELIVERY_OWNER_HOST` is intentionally set for
+all process classes that read it.
+
+The sample service is intentionally conservative: its `ExecStartPre` refuses to
+start unless `DELIVERY_OWNER_HOST` matches the deliverer host and `API_PORT` is
+disabled in the deliverer config. Remove or replace those preflight flags only
+for an explicitly approved active-active rollout.
+
+## Single-Owner Cutover
+
+This is the conservative migration path from embedded delivery to standalone
+delivery.
+
+1. Build and package `bin/jetmon-deliverer`.
+2. Install and enable `systemd/jetmon-deliverer.service` or the equivalent
+   deployment-system unit.
+3. Pick one deliverer host and set `DELIVERY_OWNER_HOST` to that host's
+   hostname in the deliverer config.
+4. Keep embedded API hosts from delivering by giving their `jetmon2` process a
+   config where `DELIVERY_OWNER_HOST` does not match the API hostnames. The
+   most common pattern is a process-specific config file via `JETMON_CONFIG`.
+5. Run the owner-host preflight from the same shell environment the service will
+   use:
+
+   ```bash
+   JETMON_CONFIG=/opt/jetmon2/config/deliverer.json \
+     /opt/jetmon2/bin/jetmon-deliverer validate-config \
+       --require-owner-match \
+       --require-api-disabled
+   ```
+
+   Add `--require-email-delivery` in any environment where email alert contacts
+   must send real mail instead of using the log-only stub.
+6. Start `jetmon-deliverer` on the owner host.
+7. Confirm logs show `delivery_owner_host="<host>" matched; delivery workers
+   enabled on this host`.
+8. Confirm API-host logs show delivery workers are skipped or idle.
+9. Watch delivery backlog and terminal outcomes with the read-only rollout
+   check:
+
+   ```bash
+   JETMON_CONFIG=/opt/jetmon2/config/deliverer.json \
+     /opt/jetmon2/bin/jetmon-deliverer delivery-check --since=15m
+   ```
+
+   For a strict cutover gate, add thresholds that fail if any delivery is due
+   now or newly abandoned:
+
+   ```bash
+   JETMON_CONFIG=/opt/jetmon2/config/deliverer.json \
+     /opt/jetmon2/bin/jetmon-deliverer delivery-check \
+       --since=15m \
+       --max-due=0 \
+       --max-abandoned=0 \
+       --max-failed=0
+   ```
+10. Stop embedded delivery after the standalone owner has been stable for at
+   least one normal alerting window.
+
+Rollback is simple: stop `jetmon-deliverer` and restore the previous embedded
+delivery config so one API-enabled `jetmon2` host matches
+`DELIVERY_OWNER_HOST` or uses the legacy empty-owner behavior.
+
+Before rollback, rehearse the embedded owner config with `./jetmon2
+validate-config` on the API host that will resume delivery, and make sure its
+`DELIVERY_OWNER_HOST` plan is intentional. After stopping the standalone
+deliverer, start the embedded owner and watch `delivery-check` until pending
+rows drain normally. The same check can validate rollback drain:
+
+```bash
+JETMON_CONFIG=/opt/jetmon2/config/api-owner.json \
+  /opt/jetmon2/bin/jetmon-deliverer delivery-check \
+    --since=15m \
+    --max-due=0 \
+    --max-abandoned=0 \
+    --max-failed=0
+```
+
+## Active-Active Delivery
+
+Transactional row claims make active-active delivery safe at the delivery-row
+level. The remaining rollout question is process selection:
+
+- If `DELIVERY_OWNER_HOST` is set, only the exact matching hostname runs
+  delivery workers.
+- If `DELIVERY_OWNER_HOST` is empty, every eligible `jetmon2` process with
+  `API_PORT > 0` and every `jetmon-deliverer` process runs delivery workers.
+
+Therefore, active-active standalone delivery should use process-specific
+configs:
+
+- API hosts: set `DELIVERY_OWNER_HOST` to a non-matching guard value so they
+  serve API traffic without dispatching outbound delivery.
+- Deliverer hosts: leave `DELIVERY_OWNER_HOST` empty, or run one config per
+  deliverer host while keeping the guard disabled only for that process class.
+
+Do not clear `DELIVERY_OWNER_HOST` in a shared config that is also used by
+API-enabled `jetmon2` hosts unless the intended state is active-active delivery
+from both API hosts and standalone deliverer hosts.
+
+## Rollout Checks
+
+Before enabling standalone delivery:
+
+- `bin/jetmon-deliverer version` reports the expected build.
+- `JETMON_CONFIG=/opt/jetmon2/config/deliverer.json bin/jetmon-deliverer
+  validate-config --require-owner-match --require-api-disabled` passes for the
+  deliverer-specific config while running with the same `DB_*` environment the
+  service will use.
+- `--require-email-delivery` is included when real alert-contact email delivery
+  is expected.
+- `JETMON_CONFIG=/opt/jetmon2/config/deliverer.json bin/jetmon-deliverer
+  delivery-check --since=15m --output=json` returns clean JSON for automation.
+- `systemd-analyze verify /etc/systemd/system/jetmon-deliverer.service` passes
+  on a staged host, or against an alternate deployment root, after
+  `/opt/jetmon2/bin/jetmon-deliverer` exists. The repository copy can report
+  missing `ExecStart` paths before packaging.
+- The process can connect to MySQL using the same schema as `jetmon2`.
+- `EMAIL_TRANSPORT` is set to `wpcom` or `smtp` in any environment where real
+  alert-contact emails should be delivered; `stub` is safe for dry runs.
+- `DELIVERY_OWNER_HOST` behavior is validated with one start on each process
+  class before production traffic.
+
+During rollout:
+
+- `delivery-check --since=15m` shows no sustained growth in pending rows.
+- `delivery-check --since=15m --max-due=0 --max-abandoned=0 --max-failed=0`
+  passes once the queue has drained and no new terminal failures are present.
+- Use `--require-recent-delivery` only when the rollout window is expected to
+  include at least one real webhook or alert-contact delivery. It is too strict
+  for quiet environments with no outbound dispatch.
+- Use `--require-recent-webhook-delivery` and
+  `--require-recent-alert-delivery` when each delivery family must prove a
+  successful send independently.
+- `OLDEST_PENDING_SEC` and `OLDEST_DUE_SEC` show queue age. A non-zero pending
+  count with a growing oldest age is a stronger signal than a one-time backlog
+  spike.
+- Logs show only the intended process class running workers.
+- Webhook and alert-contact manual retry endpoints still work.
+
+Example text output:
+
+```text
+INFO deliverer_host="deliverer-01"
+INFO delivery_check_generated_at=2026-04-29T18:30:00Z
+INFO delivery_check_since=2026-04-29T18:15:00Z
+INFO delivery_owner_host="deliverer-01" matched; delivery workers enabled on this host
+KIND     PENDING  DUE_NOW  FUTURE_RETRY  DELIVERED_SINCE  ABANDONED_SINCE  FAILED_SINCE  OLDEST_PENDING_SEC  OLDEST_DUE_SEC
+webhook  0        0        0             4                0                0             0                   0
+alert    1        0        1             2                0                0             45                  0
+total    1        0        1             6                0                0             45                  0
+PASS delivery_check=ok
+```
+
+After rollout:
+
+- Keep embedded delivery disabled on API hosts unless intentionally testing
+  active-active behavior.
+- Revisit `internal/webhooks` and `internal/alerting` duplication only after
+  standalone delivery has run long enough to expose real operational drift.
+- Plan WPCOM legacy notification migration into this process once alert-contact
+  parity and recipient inventory are known.
+
+## Failure Modes
+
+| Failure | Expected behavior | Operator action |
+|---|---|---|
+| Deliverer process exits | In-flight leases expire after the claim lock duration; rows become claimable again | Restart deliverer or roll back to embedded delivery |
+| Wrong owner hostname | Deliverer starts but idles | Fix `DELIVERY_OWNER_HOST` or process hostname/config |
+| Shared config accidentally clears owner guard | API hosts and deliverer hosts may all dispatch | Restore per-process configs; row claims prevent duplicate row claims but extra processes add load |
+| Email transport left as `stub` | Email alerts are logged but not sent | Set `EMAIL_TRANSPORT` and transport credentials, then restart |
+| Third-party outage | Rows retry on the documented ladder and eventually abandon | Fix destination or provider issue, then use manual retry endpoints |
diff --git a/docs/jetmon-v2-capacity-1000-report.md b/docs/jetmon-v2-capacity-1000-report.md
new file mode 100644
index 00000000..4e147b52
--- /dev/null
+++ b/docs/jetmon-v2-capacity-1000-report.md
@@ -0,0 +1,164 @@
+# Jetmon v2 1,000-Site Capacity Report
+
+Date: 2026-05-03 UTC
+
+Purpose: durable benchmark evidence for the scheduler, database-write, index,
+and capacity-test configuration changes in the Jetmon v2 capacity branch.
+
+## Executive Finding
+
+The latest 1,000-site capacity run fixed the core missed-check failure seen in
+the previous 1,000-site run. Jetmon v2 moved from 74.70% missed checks to 0.00%
+missed checks while using much less host and MySQL CPU. The `jetmon2` process
+itself used more CPU than before, but still only about 4% of one core at p95, so
+the service process is not the bottleneck at 1,000 active sites.
+
+The likely improvement is that work moved out of an expensive DB-bound failure
+mode and into the Go scheduler/check loop. That is the right tradeoff: the
+service did the checks, every active benchmark site stayed fresh, and cleanup
+returned the benchmark range to zero active rows.
+
+## Compared Runs
+
+Both runs used:
+
+- 1,000 active benchmark-owned Jetmon v2 sites.
+- 30-minute measurement windows.
+- `jetmon-service-host-2` as the Jetmon v2 service host.
+- `steadycadence.party` generated target names.
+- 1-minute configured check interval.
+
+Artifacts:
+
+| Run | Window | Main artifact |
+|---|---|---|
+| Previous failed run | `2026-05-02T23:17:13Z` to `2026-05-02T23:47:13Z` | `/home/gaarai/code/uptime-bench-capacity-bench/reports/capacity/v2-fix-1000-20260502-231644Z/v2-fix-1000-20260502-231644Z/run.json` |
+| Previous corrected metrics | same window | `/home/gaarai/code/uptime-bench-capacity-bench/reports/capacity/v2-fix-1000-20260502-231644Z/prometheus-window.corrected.json` |
+| Latest successful run | `2026-05-03T00:40:11Z` to `2026-05-03T01:10:11Z` | `jetmon-vm-host-3:/home/jetmon/uptime-bench-capacity/reports/run-1000-20260503-003939Z/run.json` |
+
+The previous failed run's original capacity runner hit a Prometheus query error
+after DB verification, so the comparison uses the corrected Prometheus capture
+artifact for that same window.
+
+## Success And Freshness
+
+| Metric | Previous failed run | Latest successful run | Change |
+|---|---:|---:|---:|
+| Active sites at window end | 1,000 | 1,000 | same |
+| Fresh active sites | 253 | 1,000 | +747 |
+| Stale active sites | 747 | 0 | -747 |
+| Missed check percent | 74.70% | 0.00% | -74.70 points |
+| Freshness success rate | 25.30% | 100.00% | +74.70 points |
+| Recent check history rows, last 5m | 300 | 4,440 | 14.8x |
+| Recent checks/minute proxy | 60/min | 888/min | 14.8x |
+| p50 check age | not captured | 43s | new metric |
+| p95 check age | not captured | 62s | new metric |
+| p99 check age | not captured | 62s | new metric |
+| Oldest active check age | not captured | 62s | new metric |
+| Stale scheduler buckets | not captured | 0 of 100 | new metric |
+
+The latest run had one open event at window end, but post-deactivate verify
+showed zero open events. The open event did not correlate with stale benchmark
+rows or missed freshness. It is worth inspecting separately if event accuracy is
+part of the next pass, but it was not the capacity failure mode.
+
+## CPU Comparison
+
+Percent-core values are percentages of one CPU core. Host CPU is total busy CPU
+for `jetmon-service-host-2`.
+
+| Metric | Previous failed avg / p95 / max | Latest avg / p95 / max | Interpretation |
+|---|---:|---:|---|
+| Host CPU busy | `25.91 / 29.52 / 30.94%` | `12.71 / 15.00 / 17.29%` | Host CPU roughly halved while correctness recovered. |
+| `jetmon2` process CPU | `0.42 / 0.53 / 0.61% core` | `3.20 / 3.98 / 4.42% core` | App CPU increased, but remains tiny. This is healthy if it reflects actual checking work. |
+| v2 MySQL CPU, counter-rate | `93.80 / 108.50 / 114.06% core` | `25.58 / 33.54 / 38.99% core` | Major DB CPU drop. This is the most important resource improvement. |
+| v2 MySQL CPU, instantaneous dockerstats gauge | `97.49 / 108.19 / 222.45% core` | `27.45 / 99.14 / 103.99% core` | Spiky gauge still shows bursts, but average and max are much lower. |
+
+The CPU profile changed from DB-heavy and stale to light app CPU plus much lower
+DB CPU. At 1,000 sites, Jetmon v2 has substantial host CPU headroom. The next
+scaling risk is more likely DB work per checked site than Go process CPU.
+
+## Memory, FDs, And Threads
+
+| Metric | Previous failed avg / p95 / max | Latest avg / p95 / max | Interpretation |
+|---|---:|---:|---|
+| Host memory used | `14.05 / 14.16 / 14.20%` | `14.35 / 14.43 / 14.46%` | Essentially flat. |
+| `jetmon2` RSS | `15.33 / 16.51 / 16.53 MiB` | `18.31 / 20.11 / 20.75 MiB` | Small increase, still very low. |
+| v2 MySQL working set | `576.88 / 576.89 / 579.89 MiB` | `603.70 / 605.19 / 605.30 MiB` | DB memory increased modestly while throughput improved. |
+| `jetmon2` open FDs | `23.0 / 23 / 30` | `26.5 / 60 / 89` | Higher bursts, still low. Continue watching at larger batch sizes. |
+| `jetmon2` threads | `10.0 / 10 / 10` | `10.8 / 11 / 11` | Stable. |
+
+The latest FD profile is safe at 1,000 sites, but the earlier failed suite on
+`2026-05-02T16:47Z` saw `jetmon2` open FDs average about 1,049 and max 1,391.
+That earlier pattern did not reproduce in the latest run. Keep FD metrics in the
+next larger runs to catch any regression.
+
+## DB Network
+
+The latest run moved more data through MySQL while using much less MySQL CPU.
+
+| Metric | Previous failed avg / p95 / max | Latest avg / p95 / max |
+|---|---:|---:|
+| v2 MySQL RX | `4.35 / 4.49 / 4.51 KiB/s` | `7.66 / 8.48 / 8.53 KiB/s` |
+| v2 MySQL TX | `12.15 / 12.39 / 12.50 KiB/s` | `19.13 / 20.84 / 20.97 KiB/s` |
+
+This supports the interpretation that the latest service is doing real check
+work and DB writes instead of stalling. More DB traffic with much lower DB CPU
+suggests the fixed path is significantly more efficient per useful check.
+
+## Earlier Failed-Suite Context
+
+There was another 1,000-site failure in the earlier growth suite:
+
+- Window: `2026-05-02T16:47:02Z` to `2026-05-02T17:17:02Z`.
+- Missed check percent: 95.20%.
+- Fresh active sites: 48 of 1,000.
+- Recent check history rows in last 5m: 100.
+- Host CPU max: 26.54%.
+- v2 MySQL CPU counter-rate avg / p95 / max: `32.20 / 91.29 / 94.20% core`.
+- `jetmon2` CPU avg / p95 / max: `0.49 / 0.66 / 0.70% core`.
+- `jetmon2` RSS avg / p95 / max: `40.41 / 46.70 / 48.15 MiB`.
+- `jetmon2` open FDs avg / p95 / max: `1,049 / 1,291 / 1,391`.
+
+That older failure looked different from the immediate previous failed run: high
+FD count and low app CPU, with most active sites stale. The latest successful
+run avoids both the stale-site failure and the high-FD pattern.
+
+## What To Improve Next
+
+The 1,000-site fix is successful. The next work should preserve this shape as
+the active count grows:
+
+1. Run the same 30-minute test at 5,000 active sites before changing code again.
+   At 1,000 sites the measured bottleneck is no longer correctness, and the CPU
+   profile has enough headroom to justify the next batch.
+2. Use [`jetmon-v2-scalability-test-plan.md`](jetmon-v2-scalability-test-plan.md)
+   for the next 1k/5k/10k runs. It lists the scheduler metrics, EXPLAIN checks,
+   host/process counters, and dependency metrics needed to compare this branch
+   against the successful 1,000-site baseline.
+3. Keep watching DB CPU before app CPU. `jetmon2` p95 CPU is only about 4% of a
+   core, but MySQL p95 counter-rate is already about 34% of a core at 1,000
+   active sites. If DB CPU scales linearly, it will become limiting before the Go
+   process does.
+4. Validate query plans for the due-site fetch and update paths at 5,000 and
+   10,000 active sites. Any query scanning the 1,000,000-row benchmark range
+   every round will dominate larger batches.
+5. Continue tracking FDs. Latest max was 89, which is fine, but the earlier
+   failed suite reached 1,391. If the FD count climbs with active count, inspect
+   HTTP transport reuse, response body close paths, MySQL connection pool
+   settings, and Veriflier client connection reuse.
+6. Include Veriflier host metrics in future capacity windows. The latest
+   capacity runner captured service-host metrics only, while the previous
+   corrected capture also included `jetmon-vm-host-2`. The previous Veriflier CPU
+   was negligible, but the next larger runs should prove that remains true.
+7. Inspect the one window-end open event from the latest successful run. It did
+   not indicate missed checks, but it may matter for event correctness or false
+   positive analysis.
+
+## Bottom Line
+
+Do not optimize `jetmon2` CPU first. The latest run proves `jetmon2` can spend a
+few percent of a core and keep 1,000 sites fresh. The prior failure was not app
+CPU exhaustion. Focus next on DB efficiency, scheduler observability, query
+plans, and ensuring the low-FD/low-stale profile survives the 5,000-site and
+10,000-site tests.
diff --git a/docs/jetmon-v2-prelaunch-readiness.md b/docs/jetmon-v2-prelaunch-readiness.md
new file mode 100644
index 00000000..910f9194
--- /dev/null
+++ b/docs/jetmon-v2-prelaunch-readiness.md
@@ -0,0 +1,502 @@
+# Jetmon v2 Prelaunch Readiness
+
+This tracker captures the service-side work needed before attempting the
+production Jetmon v2 rollout as a drop-in replacement for Jetmon v1.
+
+The launch posture is intentionally conservative: this rollout upgrades the
+monitoring backend. It does not launch a new customer-facing Monitor product,
+public API, paid reporting surface, customer-managed alerting, or customer
+webhook self-service unless those are explicitly enabled through a separate
+WPCOM/Product canary.
+
+## Draft Launch Posture Statement
+
+Jetmon v2 should launch first as a backend replacement for the existing
+WordPress.com Monitor service. The rollout should preserve current
+customer-facing behavior, WPCOM notification semantics, legacy status
+projection, support workflows, and allowlist expectations unless a specific
+customer-visible change has an owner, a canary plan, support language, and a
+rollback path.
+
+During the drop-in rollout, v2-only surfaces such as alert contacts, customer
+webhooks, public API access, paid reporting, trigger-now, and richer customer
+state labels should remain hidden, internal-only, or disabled by default. Those
+features can move forward in separate WPCOM/Product canaries after the backend
+replacement is stable.
+
+## Status Key
+
+- `[ ]` not started
+- `[~]` in progress
+- `[x]` complete
+- `[!]` blocker or unresolved launch risk
+
+## Owner Key
+
+- `Jetmon`: Jetmon v2 service, rollout tooling, and service documentation
+- `Systems`: production deployment, host, DB, and observability ownership
+- `WPCOM`: WordPress.com Monitor/API/platform ownership
+- `Jetpack`: Jetpack Monitor ownership
+- `Support`: support documentation, support tooling, and frontline readiness
+- `Product`: customer-facing semantics, packaging, and launch language
+
+## Launch Posture Gate
+
+Hard gate.
+
+- [x] Owner: `Jetmon`, `WPCOM`, `Product` - Draft that the first rollout
+  is a backend replacement with current customer-facing behavior preserved by
+  default.
+- [ ] Owner: `WPCOM`, `Product` - Approve or revise the draft launch posture
+  statement for the rollout room and support handoff.
+- [ ] Owner: `Jetmon` - Keep `LEGACY_STATUS_PROJECTION_ENABLE=true` for the
+  rollout.
+- [ ] Owner: `WPCOM` - Keep the existing customer-facing status and
+  notification behavior as the default.
+- [ ] Owner: `Jetmon`, `WPCOM` - Keep v2 alert contacts, customer webhooks,
+  public API access, and paid-reporting surfaces disabled or inaccessible by
+  default.
+- [ ] Owner: `Product`, `Support` - Confirm launch language such as
+  "monitoring backend upgrade" rather than "new Monitor product".
+
+Exit criteria:
+
+- One written rollout posture statement exists.
+- Every team knows which v2 features are hidden during the drop-in rollout.
+
+## Hard Gates
+
+### 1. Legacy Status And Projection Parity
+
+- [ ] Owner: `Jetmon` - Run projection drift checks against production-like data
+  and record the output.
+- [ ] Owner: `Jetmon` - Verify `Seems Down` projects to legacy status `0`,
+  confirmed `Down` projects to `2`, and closed/up projects to `1`.
+- [ ] Owner: `Jetmon` - Verify `last_status_change` remains compatible with
+  WPCOM expectations.
+- [x] Owner: `Jetmon` - Perform a first-pass local-search inventory of known
+  WPCOM and Jetpack monitor consumers from the sibling code checkouts.
+- [ ] Owner: `WPCOM` - Inventory endpoints, hooks, jobs, support tools, and
+  hidden consumers that still read `jetpack_monitor_sites.site_status`.
+- [ ] Owner: `Jetpack` - Inventory module/UI paths that still depend on legacy
+  WPCOM status or XML-RPC monitor methods.
+- [ ] Owner: `Jetmon`, `WPCOM` - Sample recent v1 incidents and verify v2 would
+  produce the same customer-visible up/down result.
+- [ ] Owner: `Jetmon`, `WPCOM` - Decide the acceptable projection drift
+  threshold for canary and broad rollout.
+
+Evidence:
+
+- `jetmon2 rollout projection-drift`
+- `jetmon2 telemetry report`
+- Sampled incident comparison table
+- Legacy reader inventory
+
+Consumer inventory status:
+
+The table below is a first-pass local-search inventory from the sibling
+`../wpcom` and `../jetpack` checkouts. It is useful for rollout planning, but it
+is not owner-approved. WPCOM, Jetpack, and Support owners still need to confirm
+the list, identify hidden consumers not present in the local checkout, and mark
+which paths must keep using the legacy projection during the drop-in rollout.
+
+| Consumer | Owner | Data source | Customer-visible impact | Needs legacy projection? | Migration status |
+| --- | --- | --- | --- | --- | --- |
+| WPCOM Monitor library: `../wpcom/wp-content/lib/jetpack-monitor/` | WPCOM | `jetpack_monitor_sites.site_status`, `last_status_change`, monitor URL, incidents | Source of truth for current WPCOM Monitor status and incident helpers | yes | first-pass candidate; needs WPCOM confirmation |
+| WPCOM REST status endpoint: `../wpcom/wp-content/rest-api-plugins/endpoints/jetpack-monitor-status.php` | WPCOM | `JP_Monitor` status, monitor URL, last downtime | Customer/API-visible monitor status | yes | first-pass candidate; needs WPCOM confirmation |
+| WPCOM REST incidents/uptime/settings endpoints: `../wpcom/wp-content/rest-api-plugins/endpoints/jetpack-monitor-{incidents,uptime,settings}.php` | WPCOM | `JP_Monitor`, `JP_Monitor_Incidents`, monitor URLs, notification settings, uptime windows | Customer/API-visible incidents, uptime, monitored URLs, and notification settings | yes | first-pass candidate; needs WPCOM confirmation |
+| WPCOM notification hook consumers: `../wpcom/wp-content/mu-plugins/jetpack/class.jetpack-monitor-consumer-hooks.php` | WPCOM | `jetpack_monitor_site_status_change` payload, `JP_Monitor` status constants | Email/mobile notification dispatch, hosting-provider stats, status-down webhook | yes | first-pass candidate; needs WPCOM confirmation |
+| WPCOM notification senders: `../wpcom/wp-content/lib/jetpack-monitor-notifications/` | WPCOM | `JP_Monitor::get_site_status_raw()`, `last_status_change`, checks payload | Email, SMS, and note content shown to customers | yes | first-pass candidate; needs WPCOM confirmation |
+| Activity Log monitor up/down activities: `../wpcom/wp-content/lib/action-to-activity-log/activities/class.activity-monitor-site-{down,up}--jetpack-monitor-site-status-change.php` | WPCOM | `jetpack_monitor_site_status_change.status_id` | Activity Log entries for monitor down/up events | yes | first-pass candidate; needs WPCOM confirmation |
+| Jetpack Agency Elasticsearch repository: `../wpcom/wp-content/lib/jetpack-agency/repository/class-jetpack-agency-elastic-search-repository.php` | WPCOM | `monitor_site_status_raw`, `monitor_site_status`, `monitor_last_status_change` | Agency dashboard/search status visibility | yes | first-pass candidate; needs WPCOM confirmation |
+| WPCOM support/explanation helpers: `../wpcom/wp-content/lib/class.jetpack-monitor-explanations.php`, `../wpcom/wp-content/lib/ai/tools/ability.jetpack-monitor.php`, `../wpcom/wp-content/lib/guides/observer-modules/jetpack-site-down-no-jetmon/observer.php` | Support, WPCOM | Monitor status, incidents, explanation data | HE/customer explanations for why a site was or was not marked down | likely | first-pass candidate; needs Support/WPCOM confirmation |
+| Jetpack plugin monitor module: `../jetpack/projects/plugins/jetpack/modules/monitor.php` | Jetpack | WPCOM XML-RPC methods `jetpack.monitor.setNotifications`, `jetpack.monitor.isUserInNotifications`, `jetpack.monitor.getLastDowntime`; local option `monitor_receive_notifications` | Site-side notification settings and last-downtime data | yes for current WPCOM responses | first-pass candidate; needs Jetpack confirmation |
+| Jetpack Sync defaults: `../jetpack/projects/packages/sync/src/class-defaults.php` | Jetpack | `monitor_receive_notifications` option | Sync behavior for monitor notification preference | maybe | first-pass candidate; needs Jetpack confirmation |
+| Jetpack customer UI: `../jetpack/projects/plugins/jetpack/_inc/client/security/monitor.jsx`, `../jetpack/projects/plugins/jetpack/_inc/client/at-a-glance/monitor.jsx` | Jetpack | Monitor module state and WPCOM-provided status/settings | Customer-facing Monitor status and settings UI | maybe | first-pass candidate; needs Jetpack confirmation |
+
+Search evidence reviewed:
+
+- `../wpcom/wp-content/lib/jetpack-monitor/`
+- `../wpcom/wp-content/rest-api-plugins/endpoints/jetpack-monitor-*.php`
+- `../wpcom/wp-content/mu-plugins/jetpack/class.jetpack-monitor-consumer-hooks.php`
+- `../wpcom/wp-content/lib/jetpack-monitor-notifications/`
+- `../wpcom/wp-content/lib/action-to-activity-log/activities/*jetpack-monitor-site-status-change.php`
+- `../wpcom/wp-content/lib/jetpack-agency/repository/class-jetpack-agency-elastic-search-repository.php`
+- `../wpcom/wp-content/lib/class.jetpack-monitor-explanations.php`
+- `../wpcom/wp-content/lib/ai/tools/ability.jetpack-monitor.php`
+- `../wpcom/wp-content/lib/guides/observer-modules/jetpack-site-down-no-jetmon/observer.php`
+- `../jetpack/projects/plugins/jetpack/modules/monitor.php`
+- `../jetpack/projects/packages/sync/src/class-defaults.php`
+- `../jetpack/projects/plugins/jetpack/_inc/client/security/monitor.jsx`
+- `../jetpack/projects/plugins/jetpack/_inc/client/at-a-glance/monitor.jsx`
+
+### 2. WPCOM Notification Parity
+
+- [x] Owner: `Jetmon` - Verify the legacy WPCOM notification payload shape is
+  unchanged.
+- [ ] Owner: `Jetmon`, `WPCOM` - Test WPCOM notification handling for site down,
+  confirmed down, recovery, inactive site, URL mismatch, and blacklisted site
+  behavior.
+- [ ] Owner: `WPCOM` - Confirm existing WPCOM notification hooks still fire from
+  the legacy `/jetmon/` path.
+- [ ] Owner: `WPCOM` - Confirm current home-URL-only notification behavior is
+  preserved unless explicitly changed.
+- [ ] Owner: `Jetmon`, `WPCOM` - Add or run parity reporting between v2 event
+  transitions and WPCOM notification actions.
+- [ ] Owner: `Jetmon`, `WPCOM` - Confirm v2 alert contacts and v2 customer
+  webhooks cannot duplicate current WPCOM notifications during rollout.
+
+Evidence:
+
+- Golden WPCOM payload sample
+- Notification parity report
+- Manual or automated down/confirmed-down/recovery test log
+
+Jetmon-owned parity coverage:
+
+| Case | Owner | Status | Evidence |
+| --- | --- | --- | --- |
+| Legacy JSON field names and auth/header shape | Jetmon | covered | `internal/wpcom` unit test |
+| Confirmed-down payload with local and Veriflier checks | Jetmon | covered | `internal/orchestrator` unit test |
+| Recovery notification uses legacy running status | Jetmon | covered | `internal/orchestrator` unit test |
+| Seems Down does not notify before Veriflier confirmation | Jetmon | covered | `internal/orchestrator` unit test |
+| False alarm does not notify WPCOM | Jetmon | covered | `internal/orchestrator` unit test |
+| Maintenance and cooldown suppression do not duplicate WPCOM notifications | Jetmon | covered | `internal/orchestrator` unit tests |
+| Down and recovery WPCOM parity deltas are reported separately | Jetmon | covered | `cmd/jetmon2` telemetry report unit test |
+| Inactive site behavior | WPCOM, Jetmon | needs external acceptance | Jetmon only selects `monitor_active=1`; WPCOM should confirm customer-visible inactive-site semantics |
+| URL mismatch behavior | WPCOM | needs external acceptance | WPCOM owns current home-URL-only handling |
+| Blacklisted site behavior | WPCOM | needs external acceptance | WPCOM owns blacklist/filter response semantics |
+
+### 3. Support, WAF, And Allowlist Readiness
+
+- [x] Owner: `Support`, `Jetmon` - Update support guidance from v1 `HEAD`
+  assumptions to v2 `GET` checks.
+- [x] Owner: `Support`, `Jetmon` - Update allowlist guidance for
+  `jetmon/2.0`.
+- [x] Owner: `Support`, `Jetmon` - Explain customer-safe meanings for blocked
+  requests, verifier confirmation, false positives, maintenance windows, and
+  monitor-side uncertainty.
+- [ ] Owner: `Support`, `WPCOM` - Update support macros/playbooks for
+  firewall, WAF, bot-block, and security-plugin cases.
+- [ ] Owner: `Support`, `WPCOM` - Update support guidance for `Unknown` so it
+  is not treated as confirmed downtime.
+- [ ] Owner: `Jetmon` - Verify blocked/security-plugin failures have enough
+  classification evidence for support to diagnose.
+
+Evidence:
+
+- Links to support docs/playbooks
+- Sample WAF-blocked incident explanation
+- Sample false-positive incident explanation
+
+### 4. Operational Rollout Rehearsal
+
+- [x] Owner: `Jetmon`, `Systems` - Run `make rollout-docs-verify`.
+- [x] Owner: `Jetmon`, `Systems` - Run same-server dry-run rehearsal.
+- [x] Owner: `Jetmon`, `Systems` - Run fresh-server dry-run rehearsal if that
+  path remains an option.
+- [x] Owner: `Jetmon`, `Systems` - Run rollback dry-run rehearsal.
+- [x] Owner: `Jetmon`, `Systems` - Run VM lab snapshot flow if the lab host is
+  available.
+- [ ] Owner: `Jetmon`, `Systems` - Confirm `DELIVERY_OWNER_HOST` posture is
+  intentional for rollout.
+
+Evidence:
+
+- `make rollout-docs-verify` output
+- `make rollout-rehearsal-verify` output
+- VM lab transcript
+- Generated rehearsal plan for the actual rollout mode, including the
+  post-cutover `jetmon2 telemetry report` parity evidence command
+
+Local dry-run evidence:
+
+- `make rollout-docs-verify` passed on 2026-05-03T03:17Z.
+- The verifier generated and checked same-server, fresh-server, and guided
+  rollback dry-run plans.
+- `make rollout-vm-lab-snapshot-all-smoke` passed on 2026-05-03T05:32Z
+  against `jetmon-vm-host-1` using snapshot `pre-guided-flow`. Covered
+  execute/rollback, interrupted resume, post-start rollback, bad SSH, v2 start
+  failure, runtime guards, real activity, and failure gates.
+
+### 5. Production Observability And Hold Points
+
+- [ ] Owner: `Jetmon`, `Systems` - Define go/no-go thresholds for projection
+  drift, missed checks, oldest selected age, stale heartbeats, WPCOM
+  notification failures, delivery backlog, API errors, MySQL errors, and
+  verifier agreement.
+- [x] Owner: `Jetmon` - Confirm host and fleet dashboards expose the Jetmon-owned
+  rollout signals: process heartbeats, bucket ownership, delivery posture,
+  projection drift, dependency health, Veriflier v2 contract status, trusted
+  vantages, agent telemetry, capacity, discovery mode posture, and suggested
+  next actions.
+- [ ] Owner: `Systems` - Confirm the host and fleet dashboard signals are
+  sufficient for the rollout room and existing production monitoring posture.
+- [ ] Owner: `Jetmon`, `Systems` - Confirm StatsD metrics and log paths remain
+  compatible with existing monitoring.
+- [x] Owner: `Jetmon` - Confirm `jetmon_process_health` heartbeats are exposed
+  through the fleet dashboard with stale thresholds.
+- [ ] Owner: `Systems` - Confirm `jetmon_process_health` heartbeat/staleness
+  thresholds are understood by operators before rollout.
+- [x] Owner: `Jetmon` - Add read-only Veriflier discovery shadow comparison via
+  `jetmon2 verifliers discovery-report`.
+- [x] Owner: `Jetmon` - Add local Veriflier discovery-drift soak coverage for
+  duplicate static vantages, missing trusted registry rows, incomplete registry
+  rows, endpoint/auth-presence drift, untrusted agents, duplicate agent
+  endpoints, active-mode fallback, and recovery to green.
+- [ ] Owner: `Jetmon`, `Systems` - Define a written pause protocol.
+- [ ] Owner: `Jetmon`, `Systems` - Define a written rollback-now protocol.
+
+Evidence:
+
+- Dashboard links or screenshots
+- Threshold table
+- Rollout room checklist
+- Alert names and owners
+- `make test-veriflier-soak`
+- `jetmon2 verifliers discovery-report`
+- ADR-0010: trusted Veriflier discovery with monitor-collected telemetry
+
+Initial stop/go threshold worksheet:
+
+| Signal | Proposed canary starting point | Hold action |
+| --- | --- | --- |
+| Projection drift | 0 unexpected drift rows in the canary bucket range after the first full v2 round | Pause expansion; compare event rows, projection rows, and v1 expectations before continuing |
+| Missed checks | 0 missed checks for the canary range after the first full expected interval; any broader threshold needs explicit Systems/Jetmon approval | Pause expansion; inspect scheduler selected/completed/outstanding metrics |
+| Oldest selected age | No selected site older than 2x its expected check interval plus timeout/retry buffer | Hold at current cohort; inspect scheduler queue depth and DB selection latency |
+| Stale host heartbeat | 0 active rollout hosts stale beyond `BUCKET_HEARTBEAT_GRACE_SEC` | Stop host expansion; confirm bucket ownership before touching more hosts |
+| WPCOM notification failures | 0 unexpected failures for down/confirmed-down/recovery canary events; circuit breaker must stay closed | Pause; keep v1-compatible projection active and resolve WPCOM/API failure |
+| Delivery backlog | Stable or decreasing backlog; oldest due delivery within the agreed retry ladder for enabled delivery workers | Hold delivery-owner changes; verify `DELIVERY_OWNER_HOST` and deliverer health |
+| API errors | Health, dashboard, and required API smoke checks pass with no sustained 5xx responses | Keep API internal; investigate before any gateway or automation dependency |
+| MySQL errors | No sustained connection failures, query errors, or lock wait spikes during the rollout window | Pause host changes; review DB health before retrying |
+| Veriflier agreement | Quorum floor remains intact and verifier health loss is explained | Pause confirmed-down expansion; avoid customer-visible downtime notifications from degraded quorum |
+
+Jetmon-owned Veriflier discovery evidence:
+
+- Host and fleet dashboards expose Veriflier v2 contract status, trusted
+  registry state, monitor-collected agent telemetry, capacity, discovery mode
+  posture, stale telemetry, and duplicate endpoint warnings.
+- `jetmon2 verifliers discovery-report` provides a read-only shadow-mode gate
+  for static-vs-registry-vs-agent comparison without printing auth token
+  values.
+- `make test-veriflier-soak` covers local Veriflier v2 contract soak scenarios
+  and discovery-drift soak scenarios.
+- ADR-0010 records the trust boundary: operator-approved vantages are quorum
+  trust, monitor-collected agents are telemetry, and Veriflier hosts do not
+  need database credentials.
+
+### 6. Internal Consumer Inventory
+
+- [x] Owner: `Jetmon` - Add first-pass local-search candidates from the sibling
+  WPCOM and Jetpack code checkouts.
+- [ ] Owner: `WPCOM` - Confirm code paths reading
+  `jetpack_monitor_sites.site_status`, `last_status_change`, monitor URL, or
+  active state.
+- [ ] Owner: `WPCOM` - Confirm WPCOM REST endpoints exposing monitor status,
+  settings, incidents, and uptime.
+- [ ] Owner: `WPCOM` - Confirm Activity Log and Elasticsearch consumers of
+  monitor incidents.
+- [ ] Owner: `WPCOM` - Confirm hooks such as
+  `jetpack_monitor_site_status_change` and consumers attached to them.
+- [ ] Owner: `Jetpack` - Confirm XML-RPC monitor methods used by Jetpack.
+- [ ] Owner: `Support` - Confirm support tools that display monitor status,
+  incidents, or notification state.
+- [ ] Owner: `WPCOM`, `Jetmon` - Mark which consumers require legacy projection
+  to stay enabled.
+
+Evidence:
+
+- Consumer inventory table with path, owner, data source, customer-visible
+  impact, and migration status.
+
+Use the inventory table in the projection parity gate. Do not disable legacy
+projection until every customer-visible reader has either migrated to v2 event
+data or explicitly accepted the old projection contract no longer being present.
+
+### 7. Failure-Mode Drills
+
+- [ ] Owner: `Jetmon`, `Systems` - Drill Jetmon API unavailable while monitor
+  checks continue.
+- [ ] Owner: `Jetmon`, `Systems` - Drill Veriflier unavailable or degraded.
+- [ ] Owner: `Jetmon`, `WPCOM` - Drill WPCOM notification endpoint failing or
+  circuit breaker open.
+- [ ] Owner: `Jetmon`, `Systems` - Drill MySQL lag or temporary DB errors.
+- [ ] Owner: `Jetmon`, `Systems` - Drill delivery backlog growth.
+- [ ] Owner: `Jetmon`, `Systems` - Drill stale host heartbeat and bucket
+  ownership handoff.
+- [ ] Owner: `Jetmon`, `Systems` - Drill bad deploy and rollback.
+- [ ] Owner: `Jetmon`, `Support` - Drill a false-positive incident caused by
+  customer WAF/bot blocking.
+- [ ] Owner: `Jetmon`, `Support` - Drill `Unknown` or monitor-side uncertainty
+  and verify it is not presented as confirmed downtime.
+
+Evidence:
+
+- Drill notes with command sequence, expected behavior, observed behavior, and
+  follow-up items.
+
+## Early Canary Gates
+
+These should be ready before broad rollout and, where feasible, before the
+first controlled canary.
+
+- [x] Owner: `WPCOM`, `Jetmon` - Draft the canary cohort matrix: WPCOM-hosted,
+  Atomic, self-hosted Jetpack, agency-managed, WAF/security-plugin, historically
+  noisy/flaky, high-traffic, and multi-endpoint sites.
+- [ ] Owner: `WPCOM`, `Product`, `Support`, `Jetmon` - Approve the canary
+  cohort matrix and exact expansion/rollback thresholds.
+- [ ] Owner: `Jetmon`, `Systems` - Define canary size, duration, rollback
+  threshold, and expansion threshold.
+- [ ] Owner: `WPCOM` - Build or script read-only shadow comparisons for v2
+  status, event list/detail, and uptime summary while customers still see
+  legacy output.
+- [ ] Owner: `Product`, `Support`, `WPCOM`, `Jetmon` - Define customer-facing
+  language for `Up`, `Seems Down`, `Down`, `Degraded`, `Warning`,
+  `Maintenance`, `Paused`, `Unknown`, and `Resolved`.
+- [ ] Owner: `Product`, `WPCOM`, `Jetmon` - Draft SLA/reporting semantics for
+  `Seems Down`, `Degraded`, `Warning`, `Maintenance`, and `Unknown`.
+- [ ] Owner: `WPCOM`, `Jetmon` - Decide whether trigger-now is available during
+  rollout and to whom.
+- [ ] Owner: `WPCOM` - Add per-site, per-actor, per-tenant, and per-plan
+  trigger-now quotas before customer exposure.
+
+Canary cohort matrix:
+
+| Cohort | Why It Matters | Starting Signal | Rollback Trigger |
+| --- | --- | --- | --- |
+| WPCOM-hosted | Lowest external-network variability; validates core parity first | Projection drift, WPCOM notification parity, scheduler freshness | Any unexplained drift, notification delta, or missed-check pattern |
+| Atomic | Exercises managed hosting plus customer-specific layers | WAF/edge blocks, GET-path behavior, false positives | Repeated blocked/redirect classes without support-ready explanation |
+| Self-hosted Jetpack | Highest network/plugin variability | Veriflier agreement, timeout classes, support explanations | Unexplained verifier disagreement or customer-facing notification mismatch |
+| Agency-managed | High support impact and strong sensitivity to false positives | Incident explanations, WAF allowlist readiness | Any repeated false-positive class that support cannot explain |
+| WAF/security-plugin | Validates v2 GET allowlist language | 403/challenge/keyword-missing classes | Any broad block caused by v2 UA/source not being allowlisted |
+| Historically noisy/flaky | Tests retry and verifier value against known difficult sites | Seems Down false-alarm rate, cooldown behavior | Regression versus v1/v2 baseline for noisy classes |
+| High-traffic | Catches performance-sensitive GET-path behavior | RTT, timeout, customer reports | Sustained timeout/intermittent class not explained by site telemetry |
+| Multi-endpoint | Exercises event identity and rollup expectations | Per-endpoint status, duplicate event count | Duplicate customer-visible incidents or unclear support explanation |
+
+Treat the first canary as a parity canary, not a feature canary. Do not expand
+because v2-only features look useful; expand only when the backend replacement
+signals stay inside the approved thresholds.
+
+## Decision Options To Resolve
+
+### Rollout Scope
+
+Option A: backend replacement only.
+
+- Pros: lowest customer-facing risk, keeps parity measurable, and gives
+  Systems/Support one change to reason about.
+- Cons: delays visible product wins from v2 until after the service is stable.
+- Recommendation: use Option A for the first production rollout.
+
+Option B: backend replacement plus customer-visible feature canary.
+
+- Pros: proves v2 value earlier.
+- Cons: mixes service migration risk with product semantics, support language,
+  and WPCOM gateway readiness.
+
+### Customer-Facing State Semantics
+
+Option A: keep current customer-facing states and treat richer v2 labels as
+internal during the drop-in rollout.
+
+- Pros: avoids surprise support changes and keeps v1/v2 parity measurable.
+- Cons: customers do not immediately see `Seems Down`, `Unknown`, or richer
+  degradation states.
+- Recommendation: use Option A until Product, WPCOM, and Support agree on the
+  public state model.
+
+Option B: expose richer v2 state labels during the first rollout.
+
+- Pros: makes v2 behavior more transparent.
+- Cons: requires UI, copy, support macros, reporting semantics, and customer
+  expectation work before backend stability is proven.
+
+### Gateway And Public API Exposure
+
+Option A: keep all v2 API access internal during the drop-in rollout.
+
+- Pros: protects the backend migration from tenant-safety and quota concerns.
+- Cons: WPCOM cannot validate customer-routed read paths with real traffic.
+
+Option B: allow WPCOM-owned read-only shadow comparisons while customers still
+see legacy output.
+
+- Pros: validates gateway routing, tenant scoping, and data shape before
+  customer exposure.
+- Cons: requires clear WPCOM ownership and comparison tooling.
+- Recommendation: use Option B after backend parity gates pass.
+
+Option C: expose customer-facing API reads during the first rollout.
+
+- Pros: accelerates product API validation.
+- Cons: too much customer-facing blast radius for a backend replacement.
+
+## Explicit Non-Blockers For Drop-In Rollout
+
+These remain important but should not block the v1 replacement unless the
+launch scope changes:
+
+- Full WPCOM gateway productization
+- Paid Monitor packaging
+- Rich Jetpack Monitor UI
+- Alert-contact self-service
+- Customer webhook self-service
+- Slack, Teams, and PagerDuty customer launch
+- Public/customer monitoring API
+- Jetpack reverse checks
+- Domain/DNS monitoring expansion
+- Quiet hours, digests, grouping, acknowledgements
+- Long-range paid SLA/reporting surfaces
+- v3 probe-agent/per-vantage architecture
+- Legacy projection retirement
+
+## Service-Side Hardening Map
+
+These recommendations should remain Jetmon-owned even when WPCOM owns public
+customer surfaces.
+
+| Priority | Work | Rollout Blocker? | Existing Tracker |
+| --- | --- | --- | --- |
+| 1 | Compatibility and parity gates | yes | this doc, migration runbook, telemetry report |
+| 2 | Gateway-routed API invariant tests | before gateway canary | `public-api-gateway-tenant-contract.md` |
+| 3 | Reporting rollups and retention policy | before paid reports | `roadmap.md` |
+| 4 | Outbound credential encryption | before broad self-service alerts/webhooks | `outbound-credential-encryption-plan.md` |
+| 5 | Delivery consolidation and notification dedupe | before customer alert-contact launch | `jetmon-deliverer-rollout.md`, `roadmap.md` |
+| 6 | Customer-routed target/destination safety controls | before public target/destination management | `roadmap.md` |
+| 7 | Capacity ladder and storage optimization | before larger cohorts | `roadmap.md` |
+| 8 | Alert/webhook lifecycle polish | before self-service launch | `roadmap.md` |
+| 9 | Public-safe state and error metadata | before public API beta | `roadmap.md` |
+| 10 | Reverse checks, DNS/domain incidents, rollup, suppression | post-rollout product expansion | `taxonomy.md`, `roadmap.md` |
+| 11 | v3 probe-agent/per-vantage readiness | post-v2 production evidence | `v3-probe-agent-architecture-options.md` |
+
+## Launch-Day Readiness Card
+
+Fill this out before the rollout attempt.
+
+| Question | Answer |
+| --- | --- |
+| Rollout mode | TBD |
+| First cohort/bucket range | TBD |
+| Rollout owner | TBD |
+| WPCOM owner | TBD |
+| Systems owner | TBD |
+| Support contact | TBD |
+| Start time | TBD |
+| First hold point | TBD |
+| Expected first full-round duration | TBD |
+| Projection drift threshold | TBD |
+| WPCOM notification failure threshold | TBD |
+| Missed check threshold | TBD |
+| Oldest selected age threshold | TBD |
+| Delivery backlog threshold | TBD |
+| Rollback command source | TBD |
+| Customer/support comms status | TBD |
+
+## Recommended First Sprint
+
+1. Get owner confirmation for the first-pass legacy consumer inventory.
+2. Run projection drift and telemetry parity reports.
+3. Test WPCOM notification parity.
+4. Update support/allowlist guidance for v2 `GET` and `jetmon/2.0`.
+5. Run rollout and rollback rehearsals.
+6. Define canary cohort and stop/go thresholds.
diff --git a/docs/jetmon-v2-scalability-test-plan.md b/docs/jetmon-v2-scalability-test-plan.md
new file mode 100644
index 00000000..c3639d7b
--- /dev/null
+++ b/docs/jetmon-v2-scalability-test-plan.md
@@ -0,0 +1,230 @@
+# Jetmon v2 Scalability Test Plan
+
+This is the repeatable checklist for validating scheduler and check-path
+efficiency changes after the successful 1,000-site capacity run.
+
+## Current Branch Under Test
+
+`feature/jetmon-v2-scalability-efficiency` adds these scaling changes on top of
+the completed 1,000-site capacity branch:
+
+- Maintained `jetmon_site_runtime.next_check_at` timestamps for indexed
+  variable-interval due selection without altering the legacy site table.
+- One-minute sampling for exact due-count and projection-drift reporting in
+  variable-interval mode.
+- Shared bounded HTTP transport for local site checks.
+- Batched `jetmon_site_runtime.ssl_expiry_date` writes when observed
+  certificate dates change.
+
+Do not stack larger persistence changes, such as async check-history writes, on
+top of this branch before a real capacity run. The next test should isolate
+these changes from the previous successful 1,000-site baseline.
+
+## Pre-Test Checks
+
+1. Confirm migrations have run through the sidecar runtime/config migrations.
+2. Confirm the test service is running this branch's `jetmon2` binary.
+3. Confirm the Veriflier service is reachable from the monitor host.
+4. Confirm `WORKER_MAX_MEM_MB=0` for capacity tests unless intentionally
+   testing memory-pressure drain.
+5. Confirm `USE_VARIABLE_CHECK_INTERVALS=true`.
+6. Confirm API-enabled test hosts set `DELIVERY_OWNER_HOST` explicitly.
+7. Confirm the exact activated `monitor_url` pattern resolves and returns HTTP
+   200 from the Jetmon service host and the Veriflier host. Do not test a
+   similar hostname by hand; query one activated row from
+   `jetpack_monitor_sites`, then run `dig` and `curl` for that exact hostname.
+   A mismatch between the capacity runner URL pattern and the target DNS
+   `generated_sites.host_pattern` will look like a Jetmon false-down storm and
+   will make event handling, not checking, dominate the run.
+
+## Query Plan Checks
+
+Capture `EXPLAIN` for both scheduler modes before the capacity run.
+
+Variable-interval selection should use `jetmon_site_runtime.idx_next_check` and
+should not show `Using filesort`:
+
+```sql
+EXPLAIN
+SELECT s.jetpack_monitor_site_id, s.blog_id, s.bucket_no, s.monitor_url,
+       s.monitor_active, s.site_status, s.last_status_change, s.check_interval,
+       r.last_checked_at, r.next_check_at
+  FROM jetpack_monitor_sites s
+  LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id
+ WHERE s.monitor_active = 1
+   AND s.bucket_no BETWEEN 0 AND 999
+   AND (r.next_check_at IS NULL OR r.next_check_at <= NOW())
+ ORDER BY r.next_check_at ASC, s.blog_id ASC
+ LIMIT 100;
+```
+
+Fixed-cadence selection should continue to use
+`jetmon_site_runtime.idx_last_checked` and should not show `Using filesort`:
+
+```sql
+EXPLAIN
+SELECT s.jetpack_monitor_site_id, s.blog_id, s.bucket_no, s.monitor_url,
+       s.monitor_active, s.site_status, s.last_status_change, s.check_interval,
+       r.last_checked_at, r.next_check_at
+  FROM jetpack_monitor_sites s
+  LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id
+ WHERE s.monitor_active = 1
+   AND s.bucket_no BETWEEN 0 AND 999
+ ORDER BY r.last_checked_at ASC, s.blog_id ASC
+ LIMIT 100;
+```
+
+## StatsD Signals To Capture
+
+Freshness and scheduler pressure:
+
+- `scheduler.round.pages.count`
+- `scheduler.round.selected.count`
+- `scheduler.round.dispatched.count`
+- `scheduler.round.completed.count`
+- `scheduler.round.outstanding.count`
+- `scheduler.round.due_count_sampled.count`
+- `scheduler.round.due_start.count`
+- `scheduler.round.due_remaining.count`
+- `scheduler.round.selected_never_checked.count`
+- `scheduler.round.selected_oldest_age_sec`
+- `scheduler.dispatch.backpressure_wait.count`
+- `scheduler.result.stale.count`
+- `scheduler.result.duplicate.count`
+- `scheduler.fetch.error.count`
+- `scheduler.due_count.error.count`
+
+Phase timing and write volume:
+
+- `scheduler.page.dispatch.time`
+- `scheduler.page.wait.time`
+- `scheduler.page.process.time`
+- `scheduler.page.mark_checked.time`
+- `scheduler.page.history.time`
+- `scheduler.page.ssl.time`
+- `scheduler.page.events.time`
+- `scheduler.page.mark_checked.row.count`
+- `scheduler.page.history.row.count`
+- `scheduler.page.ssl.row.count`
+- `scheduler.page.mark_checked.error.count`
+- `scheduler.page.history.error.count`
+- `scheduler.page.ssl.error.count`
+- `scheduler.page.check.success.count`
+- `scheduler.page.check.failure.count`
+- `scheduler.page.check.http_failure.count`
+- `scheduler.page.check.timeout.count`
+- `scheduler.page.check.connect_error.count`
+- `scheduler.page.check.ssl_error.count`
+- `scheduler.page.check.redirect.count`
+- `scheduler.page.check.keyword.count`
+- `scheduler.page.check.tls_deprecated.count`
+- `scheduler.round.dispatch.time`
+- `scheduler.round.wait.time`
+- `scheduler.round.process.time`
+- `scheduler.round.mark_checked.time`
+- `scheduler.round.history.time`
+- `scheduler.round.ssl.time`
+- `scheduler.round.events.time`
+- `scheduler.round.mark_checked.row.count`
+- `scheduler.round.history.row.count`
+- `scheduler.round.ssl.row.count`
+- `scheduler.round.mark_checked.error.count`
+- `scheduler.round.history.error.count`
+- `scheduler.round.ssl.error.count`
+- `scheduler.round.check.success.count`
+- `scheduler.round.check.failure.count`
+- `scheduler.round.check.http_failure.count`
+- `scheduler.round.check.timeout.count`
+- `scheduler.round.check.connect_error.count`
+- `scheduler.round.check.ssl_error.count`
+- `scheduler.round.check.redirect.count`
+- `scheduler.round.check.keyword.count`
+- `scheduler.round.check.tls_deprecated.count`
+- `eventstore.mutation.retry.count`
+
+Host/process signals:
+
+- `round.complete.time`
+- `round.sites.count`
+- `round.sps.count`
+- `worker.queue.active`
+- `worker.queue.queue_size`
+- `retry.queue.size`
+- RSS memory
+- Go runtime system memory
+- process file descriptor count
+
+Dependency signals:
+
+- MySQL CPU, I/O, network, and slow-query counters
+- StatsD/Graphite CPU
+- Veriflier CPU/RSS/FDs
+- monitor host CPU/RSS/FDs
+
+## Expected Interpretation
+
+- `due_count_sampled.count=0` means exact due-count gauges were intentionally
+  skipped on that short variable-interval poll. It does not mean no sites were
+  due.
+- `due_remaining` should only be interpreted on polls where
+  `due_count_sampled.count=1`.
+- `ssl.row.count` should be high only during initial certificate backfills or
+  real renewal waves. Sustained high SSL rows means certificate dates are
+  changing or stored with incompatible precision.
+- Healthy capacity targets should have `check.success.count` close to
+  `completed.count`. A high `check.connect_error.count` means the monitor could
+  not connect to the activated URLs; first verify the exact DB URL pattern and
+  DNS delegation before treating it as a Jetmon throughput regression.
+- `eventstore.mutation.retry.count` should normally be zero. Any non-zero value
+  means MySQL returned a deadlock or lock-wait timeout and Jetmon retried the
+  event mutation; sustained retries point to event/projection write contention.
+- Shared HTTP transport impact should show up in process FD count, monitor CPU,
+  DNS/TCP/TLS timing, and check latency rather than in scheduler row counts.
+- If MySQL CPU remains high while freshness is good, compare
+  `history.row.count`, `history.time`, and table growth before implementing
+  async or lower-resolution history storage.
+
+## Body-Read Budget Change Verification Thresholds
+
+Use this section only when the candidate differs from baseline by changing
+`BODY_READ_MAX_BYTES` from `262144` to `1048576`.
+
+- Body-read failure rate (`error_code=8` with HTTP `2xx`/`3xx`) must be less
+  than or equal to `max(0.30%, baseline * 0.75)` and must not exceed baseline
+  by more than `0.05` percentage points.
+- Timeout pressure primary check requires candidate ratio less than or equal to
+  baseline ratio * `1.15`, where ratio is
+  `scheduler.round.check.timeout.count / scheduler.round.check.failure.count`.
+  If either baseline or candidate failure count is too small for a stable ratio
+  (for example `<100` failures in the window), use fallback absolute timeout
+  rate `scheduler.round.check.timeout.count /
+  scheduler.round.completed.count`, and require candidate not worse than
+  baseline by more than `0.05` percentage points.
+- Throughput must hold with `round.sps.count` p50 at least `90%` of baseline
+  and p95 at least `85%` of baseline.
+- Backpressure/freshness must hold with `worker.queue.queue_size` p95 less than
+  or equal to `1.25x` baseline, and `due_remaining` must not stay elevated for
+  more than `3` consecutive sampled rounds.
+- Memory must hold with jetmon2 RSS p95 less than or equal to `1.20x`
+  baseline, with no monotonic leak trend across the window.
+
+Run baseline and candidate windows with the same duration, the same site mix,
+and only this config delta.
+
+## Next Capacity Ladder
+
+Run each step for the same duration and compare against the latest successful
+1,000-site baseline:
+
+1. 1,000 sites after this branch to isolate regressions.
+2. 5,000 sites to find the next visible bottleneck.
+3. 10,000 sites if freshness, FD count, and MySQL CPU remain healthy.
+
+For each step, preserve:
+
+- uptime-bench report directory
+- Prometheus/Graphite window export
+- service logs for the exact test window
+- `EXPLAIN` output
+- row counts for `jetmon_check_history`
+- open-event count before and after test-site deactivation
diff --git a/docs/operations-guide.md b/docs/operations-guide.md
new file mode 100644
index 00000000..c842edcc
--- /dev/null
+++ b/docs/operations-guide.md
@@ -0,0 +1,691 @@
+# Operations Guide
+
+This guide collects production-facing details that used to live in the root
+README: configuration, rollout, dashboard checks, delivery workers, metrics, and
+debugging.
+
+## Configuration
+
+Jetmon configuration lives in `config/config.json`. Copy
+`config/config-sample.json` to get started. Docker can generate this file from
+`config-sample.json` and `docker/.env` when it is not present.
+
+Use `SIGHUP` or `./jetmon2 reload` to reload configuration without restarting.
+
+Key settings:
+
+| Key | Default | Description |
+|---|---:|---|
+| `NUM_WORKERS` | 60 | Goroutine pool size/floor; 0 uses the default floor |
+| `NUM_TO_PROCESS` | 40 | Legacy compatibility setting; does not cap Go scheduler throughput |
+| `DATASET_SIZE` | 100 | Database fetch page size for scheduler work; not a total round cap; 0 uses the default |
+| `NUM_OF_CHECKS` | 3 | Local failures before Veriflier escalation |
+| `TIME_BETWEEN_CHECKS_SEC` | 30 | Legacy compatibility setting retained for copied v1-style configs |
+| `MIN_TIME_BETWEEN_ROUNDS_SEC` | 300 | Fixed-cadence full-fleet pass interval when variable intervals are disabled |
+| `NET_COMMS_TIMEOUT` | 10 | Default per-check HTTP timeout in seconds |
+| `CHECK_DNS_RESOLVERS` | `[]` | Optional HTTP-check recursive resolver IPs, with optional ports; restart required after changes |
+| `BODY_READ_MAX_BYTES` | 1048576 | Success-path body-read budget in bytes for unknown/large responses |
+| `BODY_READ_MAX_MS` | 250 | Post-header body-phase budget in milliseconds for budgeted reads (unknown/large responses); 0 uses the default |
+| `KEYWORD_READ_MAX_BYTES` | 1048576 | Max bytes scanned when keyword checks are enabled; 0 uses the default |
+| `KEYWORD_READ_MAX_MS` | 0 | Keyword read budget in milliseconds, 0 inherits full request timeout envelope |
+| `PEER_OFFLINE_LIMIT` | 3 | Veriflier agreements required to confirm downtime |
+| `WORKER_MAX_MEM_MB` | 0 | Optional Go runtime memory threshold that triggers worker-pool drain; 0 disables the artificial cap |
+| `BUCKET_TOTAL` | 1000 | Total bucket range across all hosts |
+| `BUCKET_TARGET` | 500 | Maximum buckets this host should own; 0 means all buckets |
+| `BUCKET_HEARTBEAT_GRACE_SEC` | 600 | Seconds before a silent host's buckets are reclaimed |
+| `PINNED_BUCKET_MIN` / `PINNED_BUCKET_MAX` | unset | Static bucket range used by the [v1-to-v2 migration runbook](v1-to-v2-migration.md) |
+| `ALERT_COOLDOWN_MINUTES` | 30 | Default cooldown between repeated alerts per site |
+| `LEGACY_STATUS_PROJECTION_ENABLE` | true | Keep v1 status fields projected during the [v1-to-v2 migration](v1-to-v2-migration.md) |
+| `LOG_FORMAT` | `text` | `text` or `json` |
+| `DASHBOARD_PORT` | 8080 | Internal operator dashboard port, 0 disables it |
+| `DASHBOARD_BIND_ADDR` | 127.0.0.1 | Dashboard listener address; keep localhost unless a trusted management network requires remote access |
+| `API_PORT` | 0 | Internal REST API port, 0 disables it |
+| `DELIVERY_OWNER_HOST` | empty | Optional host allowed to run embedded delivery workers |
+| `DEBUG_PORT` | 6060 | localhost-only pprof port, 0 disables it |
+| `EMAIL_TRANSPORT` | `stub` | `stub`, `smtp`, or `wpcom` |
+| `SCHEDULER_ENGINE` | `legacy` | `legacy` round/page scheduler or `streaming` v2-native scheduler |
+| `STREAMING_LEGACY_PROJECTION_INTERVAL_MIN` | 15 | Coarse sidecar freshness rollback projection interval for streaming mode |
+| `STREAMING_TARGET_RELOAD_SEC` | 300 | Active site config reload cadence for streaming mode |
+
+Scheduler behavior:
+
+- `DATASET_SIZE` limits one database page. Jetmon continues fetching pages until
+  due work is drained, so a low value should not cause unchecked sites by itself.
+  `DATASET_SIZE=0` uses the default page size.
+- `NUM_WORKERS=0` uses the default worker floor instead of failing validation.
+  In streaming mode this is not a throughput cap; the engine derives a higher
+  worker target from active site rate and observed latency.
+- `BUCKET_TARGET=0` expands to `BUCKET_TOTAL`, which is useful for a single
+  monitor host in test fleets and removes one more manual capacity-tuning knob.
+- A full worker queue applies backpressure; checks remain pending instead of
+  being dropped.
+- With `USE_VARIABLE_CHECK_INTERVALS=true`, Jetmon polls for newly due work on a
+  short idle interval and uses each site's maintained
+  `jetmon_site_runtime.next_check_at` timestamp to decide what to check.
+  `next_check_at` is recalculated after every check: successful checks use
+  `jetmon_site_runtime.last_checked_at + check_interval`, while failed checks
+  are scheduled for a bounded one-minute follow-up when the normal interval is
+  longer. `MIN_TIME_BETWEEN_ROUNDS_SEC` is only the fixed-cadence pass interval
+  when variable intervals are disabled. Use this mode for production-like
+  freshness and capacity tests.
+- Watch the `scheduler.round.*` StatsD metrics during capacity tests. In
+  particular, `due_start`, `selected`, `completed`, `outstanding`, and
+  `due_remaining` show whether freshness pressure is clearing or building.
+  Exact `due_start` / `due_remaining` and legacy projection-drift checks are
+  sampled about once per minute in variable-interval mode so broad operator
+  reporting queries do not run on every short scheduler poll. Use
+  `scheduler.round.due_count_sampled.count` to distinguish sampled polls from
+  intentionally skipped reporting polls.
+- With `SCHEDULER_ENGINE=streaming`, Jetmon uses a v2-native time-wheel
+  scheduler instead of database due-row polling. Active sites are spread over
+  stable phases inside each site's interval, healthy probes avoid per-check
+  history/freshness writes, and the checker pool target is derived from active
+  site rate plus observed latency. Streaming mode keeps event, retry, verifier,
+  SSL/TLS, recovery, and WPCOM behavior on the existing v2 incident path. It
+  batches sidecar `last_checked_at`/`next_check_at` projection at
+  `STREAMING_LEGACY_PROJECTION_INTERVAL_MIN` so rollback to the legacy scheduler
+  has bounded freshness loss rather than exact per-check freshness. The
+  projection interval is constrained to the accepted 5-15 minute rollback window
+  and applies uniformly across sites. It intentionally does not shrink to match
+  5-minute site cadence, because that makes rollback freshness writes scale with
+  active fleet size in the hot path. Pending projection writes are also flushed
+  in rate-sized batches so a backlog cannot turn one flush into a large
+  lock-heavy update burst. Streaming mode intentionally uses larger in-memory
+  due/result/work buffers than the legacy scheduler; low RSS in capacity tests is
+  expected to be spent on those buffers before check dispatch is throttled.
+- Treat the current single-host streaming capacity evidence as validated through
+  2 million active internal-only targets on five-minute intervals, not as an
+  unlimited ceiling. The 2026-05-12 2 million-target run had full target
+  coverage, no stale or never-seen targets, p95 target age around 270 seconds,
+  max target age below 285 seconds, process RSS around 6.3 GB peak, and host CPU
+  around 36% average. A 4 million-target run exceeded the current stable
+  envelope: timeout pressure grew, queue depth reached its cap, pending work
+  climbed into the millions, and target coverage stopped at roughly 88%. During
+  larger tests or rollout rehearsals, watch `scheduler.streaming.pending.count`,
+  `queue_depth`, `result_depth`, `max_lag`, `dispatch_budget_limited`, timeout
+  counters, process RSS, and host CPU together; backlog plus timeout growth is a
+  hold point even when raw CPU still appears available.
+
+See [../config/config.readme](../config/config.readme) for the full option
+reference.
+
+Checker policy note: HTTP `>= 400` responses are classified immediately by status
+code and do not depend on body drain completion. Strict EOF/truncation validation
+applies only to eligible successful finite responses and is skipped for `101`,
+upgrade handshakes, and `text/event-stream` when no keyword is configured. In
+strict finite mode (known `Content-Length <= BODY_READ_MAX_BYTES`), body-phase
+timeout is bounded by the request timeout envelope, not `BODY_READ_MAX_MS`.
+Keyword read-budget exhaustion is classified as `ErrorTimeout`. Event metadata
+keeps legacy `failure_class` for WPCOM-compatible status types and adds
+operator-facing `detector_class` plus `body_read` evidence for partial/truncated
+responses.
+
+## Production Host Setup
+
+1. Install `bin/jetmon2` as `/opt/jetmon2/jetmon2`, or update the service unit
+   if your deployment system uses a different path.
+2. Install `systemd/jetmon2.service` to `/etc/systemd/system/` and run
+   `systemctl daemon-reload`.
+3. Install `systemd/jetmon2-logrotate` to `/etc/logrotate.d/jetmon2`.
+4. Create `/opt/jetmon2/logs` and `/opt/jetmon2/stats`, owned by the `jetmon`
+   service user.
+5. Create `/opt/jetmon2/config/jetmon2.env` with database credentials and auth
+   tokens. See `config/db-config-sample.conf`.
+6. Copy or generate `config/config.json`.
+7. Set `BUCKET_TARGET` to the desired maximum bucket count for the host.
+8. Run `./jetmon2 migrate`.
+9. Run `systemd-analyze verify /etc/systemd/system/jetmon2.service` after the
+   binary exists at the path used by `ExecStart`.
+10. Start the service with
+    `systemctl enable --now jetmon2 && systemctl is-active --quiet jetmon2`.
+
+Manual commands such as `migrate`, `validate-config`, and `rollout` need the
+same `DB_*` environment that systemd reads from
+`/opt/jetmon2/config/jetmon2.env`; systemd's `EnvironmentFile` is not loaded for
+commands run directly from a shell.
+
+## v1 To v2 Migration
+
+Use [v1-to-v2-migration.md](v1-to-v2-migration.md) for the full production
+migration process. It covers preparation, additive migrations, pinned bucket
+mode, replacing v1 on the same server, moving a range to a fresh v2 server,
+monitoring, revert paths, dynamic ownership cutover, and v1 teardown.
+Use [rollout-quick-reference.md](rollout-quick-reference.md) as the one-page
+operator command checklist during rehearsals and rollout windows.
+
+Use `./jetmon2 rollout guided --file=<ranges.csv> --host=<v1-host>
+--runtime-host=<v2-host> --bucket-min=N --bucket-max=N --bucket-total=N
+--v1-stop-command='<cmd>' --v1-start-command='<cmd>'` for the preferred
+interactive rollout path. Run it from the staged v2 runtime host, not from a
+separate orchestration host. For fresh-server rollouts, `--host` is the old v1
+host and `--runtime-host` is the new v2 host where the guided command runs; if
+the v1 stop/start commands use `ssh`, that runtime host must have SSH access to
+the old v1 host. The command verifies that `--log-dir` is writable before it
+starts, writes a transcript plus resume state, explains each gate, asks before
+continuing, and requires typed confirmations before v1/v2 stop/start
+transitions. By default it prints service commands for the operator to run from
+the runtime host; add `--execute-operator-commands` only when the operator
+intentionally wants the guided command to execute those commands after
+confirmation. Use `--rollback` for the guided return-to-v1 path and
+`--dry-run` for rehearsal.
+
+Use `./jetmon2 rollout rehearsal-plan --file=<ranges.csv> --host=<host>
+--bucket-min=N --bucket-max=N --mode=same-server` to print the ordered command
+sequence for one host replacement. Use `--mode=fresh-server` plus
+`--runtime-host=<new-v2-hostname>` when the new v2 hostname differs from the v1
+host recorded in the static bucket plan. Confirm SSH from the new v2 runtime
+host to the old v1 host before using SSH-based v1 commands. Add
+`--v1-stop-command` and
+`--v1-start-command` so the generated plan includes the exact cutover and
+rollback commands instead of comments. Add `--bucket-total=N` when rehearsing
+against an explicit bucket count, and `--systemd-unit=<path>` when the staged
+unit is not `/etc/systemd/system/jetmon2.service`.
+
+Before stopping v1 for a host, use `./jetmon2 rollout host-preflight
+--file=<ranges.csv> --host=<v1-host> --runtime-host=<v2-host>
+--bucket-min=N --bucket-max=N` to bundle the static plan match, config parse,
+DB connectivity, pinned safety checks, and staged systemd validation. This is
+the pre-stop gate; it runs the older pinned safety check internally.
+
+After a pinned v2 host starts, use `./jetmon2 rollout cutover-check
+--host=<v2-host> --bucket-min=N --bucket-max=N --since=15m` to run the
+post-start pinned preflight, recent activity check, dashboard status check, and
+projection-drift report together. Treat the immediate run as a smoke gate
+because recent activity can still include v1 writes. After one full expected v2
+check round, rerun it with `--require-all`, then run `./jetmon2 telemetry
+report --since=15m` before moving to the next host. The telemetry report is
+read-only window-level evidence for WPCOM down/recovery parity and explanation
+coverage; warnings are rollout hold points, and quiet windows may need a wider
+`--since` range.
+
+Use `--output=json` on rollout gate commands when wiring them into Systems
+automation. The command still exits non-zero on failed checks, and stdout
+contains `ok`, the command name, parsed output lines, and failure messages.
+Use `./jetmon2 rollout state-report --since=15m` for an operator snapshot of
+ownership mode, bucket coverage, activity freshness, projection drift, delivery
+owner state, and the suggested next action.
+
+## v2 Rolling Updates
+
+After all monitor hosts are on v2 dynamic bucket ownership, update one host at a
+time. Surviving hosts absorb the draining host's buckets during the update
+window.
+
+```bash
+systemctl stop jetmon2 && ! systemctl is-active --quiet jetmon2
+./jetmon2 migrate
+systemctl start jetmon2
+./jetmon2 status
+```
+
+Repeat for the next host.
+
+## Delivery Workers
+
+In the embedded deployment, setting `API_PORT` to a non-zero value starts the
+internal API and makes webhook and alert-contact delivery workers eligible to
+run inside `jetmon2`.
+
+Use `DELIVERY_OWNER_HOST` when only one API-enabled host should dispatch
+outbound deliveries during rollout. If it is empty, delivery workers start on
+any host with `API_PORT` enabled.
+
+`bin/jetmon-deliverer` is the standalone process boundary for outbound delivery.
+It starts the same webhook and alert-contact workers without starting the
+monitor, API, dashboard, or bucket ownership loop. Delivery rows are claimed
+transactionally, so multiple workers do not claim the same pending row.
+
+For conservative single-owner rollout, validate the deliverer-specific config
+before enabling the service:
+
+```bash
+JETMON_CONFIG=/opt/jetmon2/config/deliverer.json \
+  /opt/jetmon2/bin/jetmon-deliverer validate-config \
+    --require-owner-match \
+    --require-api-disabled
+```
+
+Add `--require-email-delivery` when real alert-contact email delivery is
+expected in that environment.
+
+Run `systemd-analyze verify /etc/systemd/system/jetmon-deliverer.service` after
+`/opt/jetmon2/bin/jetmon-deliverer` exists, or against an equivalent staged
+deployment root where the service's `ExecStart` and `ExecStartPre` paths are
+present.
+
+During rollout, inspect the shared webhook and alert-contact delivery queues
+from the same environment the service uses:
+
+```bash
+JETMON_CONFIG=/opt/jetmon2/config/deliverer.json \
+  /opt/jetmon2/bin/jetmon-deliverer delivery-check --since=15m
+```
+
+Use thresholds for automated gates:
+
+```bash
+JETMON_CONFIG=/opt/jetmon2/config/deliverer.json \
+  /opt/jetmon2/bin/jetmon-deliverer delivery-check \
+    --since=15m \
+    --max-due=0 \
+    --max-abandoned=0 \
+    --max-failed=0 \
+    --output=json
+```
+
+`delivery-check` also reports `failed_since`, `oldest_pending_age_sec`, and
+`oldest_due_age_sec`. Use `--require-recent-webhook-delivery` or
+`--require-recent-alert-delivery` when a rollout gate needs each delivery family
+to prove a successful send independently.
+
+See [jetmon-deliverer-rollout.md](jetmon-deliverer-rollout.md) for the rollout
+and rollback path.
+
+## Runtime Checks
+
+Status and reload commands:
+
+```bash
+./jetmon2 status
+./jetmon2 reload
+./jetmon2 drain
+```
+
+The operator dashboard is available on `DASHBOARD_BIND_ADDR:DASHBOARD_PORT`
+when enabled. It defaults to `127.0.0.1`, because the host and fleet dashboards
+are unauthenticated and expose internal dependency details, rollout commands,
+host names, ports, bucket ownership, and delivery posture. Bind it to a remote
+address only behind trusted operator-network controls.
+
+The host dashboard shows a red/amber/green host summary with named issues, worker
+count, active checks, queue depth, retry queue depth, throughput, round time,
+owned buckets, rollout guard state, RSS memory, Go runtime system memory, WPCOM
+circuit-breaker state, dependency health for MySQL, Verifliers, WPCOM, StatsD,
+local log/stats writes, and the rollout commands an operator is most likely to
+need from that host.
+
+When `VERIFLIER_DISCOVERY_MODE` is `shadow` or `active`, host health also shows
+Veriflier discovery status from the DB registry. Shadow mode is the rollout
+gate: compare enabled `jetmon_veriflier_vantages` rows against the static
+`VERIFIERS` list until there is no drift. Active mode uses enabled usable
+registry rows and falls back to static config if discovery is unavailable or
+empty. Monitor-collected rows in `jetmon_veriflier_agents` expose liveness and
+capacity without giving Veriflier hosts DB credentials; they do not create
+trusted quorum votes.
+
+Use the read-only discovery report before switching from `shadow` to `active`:
+
+```bash
+./jetmon2 verifliers discovery-report --output=text
+./jetmon2 verifliers discovery-report --output=json
+```
+
+The report probes configured static Verifliers for v2 `vantage.id` values,
+compares them with enabled trusted registry rows, checks recent agent telemetry,
+and reports green/amber/red status plus a suggested next action. It never prints
+Veriflier auth tokens; text and JSON output only expose whether each static or
+registry entry has a token present.
+
+### Veriflier Discovery Warning Checklist
+
+Use this checklist for warnings from the host dashboard, fleet dashboard, or
+`jetmon2 verifliers discovery-report`.
+
+**Green**
+
+- Static configured Verifliers report v2 status with stable `vantage.id`
+  values.
+- Enabled trusted registry rows match the static quorum vantages.
+- Recent active agent telemetry exists for each enabled usable registry vantage.
+- Capacity and queue depth look normal for the current rollout window.
+
+Action: continue the approved rollout step. In `shadow` mode, keep observing at
+least one expected check/report interval before changing to `active`.
+
+**Amber**
+
+- `static_probe_failed`: verify monitor-to-Veriflier network access, auth token
+  presence, and `/v2/status` response from the monitor runtime host.
+- `static_legacy_only`: deploy or restart the Go `veriflier2` binary before
+  relying on v2 identity or active discovery.
+- `static_vantage_missing`: set a stable `VERIFLIER_VANTAGE_ID`; do not advance
+  while a v2 Veriflier lacks a quorum identity.
+- `static_missing_enabled_registry`: create or enable the matching
+  `jetmon_veriflier_vantages` row after confirming the static endpoint is a
+  trusted quorum vantage.
+- `enabled_registry_missing_static`: confirm the registry row is intentional.
+  If it is staged early, leave discovery in `shadow`; if it is stale, disable
+  or correct the row.
+- `registry_enabled_incomplete`: fill endpoint host, endpoint port, and auth
+  token before active discovery can use the row.
+- `static_registry_endpoint_mismatch` or `agent_registry_endpoint_mismatch`:
+  decide whether the static config, registry row, or load-balanced endpoint is
+  authoritative, then make them agree.
+- `static_registry_auth_presence_mismatch`: fix missing token material on the
+  side that should be active. The report will not print token values.
+- `agent_without_registry`: treat the agent as untrusted telemetry. Add a
+  registry row only if the vantage is intentionally approved for quorum.
+- `enabled_registry_without_active_agent`: verify monitors can poll
+  authenticated `/v2/status`, check the Veriflier process state, and widen
+  `--stale-after` only if the report window is intentionally longer than the
+  heartbeat interval.
+- `duplicate_active_agent_endpoints`: confirm whether the agents are replicas
+  behind one endpoint or an accidental split endpoint for the same vantage.
+- Fleet/dashboard stale telemetry warnings: inspect `/api/fleet` or rerun
+  `verifliers discovery-report`; stale rows are last-known state, not proof a
+  Veriflier is still serving traffic.
+
+Action: hold before switching from `shadow` to `active`, fix the named drift,
+then rerun `validate-config` and `verifliers discovery-report`.
+
+**Red**
+
+- `static_vantage_duplicate`: two configured Verifliers report the same
+  `vantage.id`; only one vote would count. Fix static config or endpoint
+  identity before rollout.
+- `active_without_usable_registry`: active discovery has no usable enabled
+  trusted vantages and would fall back to static config. Treat this as a bad
+  active-mode posture.
+- Active discovery plus incomplete enabled registry rows: fill or disable those
+  rows before relying on discovery traffic.
+- Dashboard red Veriflier dependency health: the monitor cannot safely prove
+  Veriflier contract, identity, or reachability from that host.
+
+Action: do not advance rollout. Return to `static` or `shadow` if needed, fix
+the registry/static/agent mismatch, and rerun the report from the monitor
+runtime host.
+
+The fleet dashboard is available at `/fleet` on the same listener. It summarizes
+all rows in `jetmon_process_health` alongside `jetmon_hosts` dynamic bucket
+coverage, delivery backlog, delivery-owner posture, dependency rollups,
+Veriflier dependency health reported by monitor hosts, Veriflier discovery
+registry state, and global legacy projection drift. It also shows per-table
+delivery queue counts, per-host bucket-owner rows, trusted Veriflier vantages,
+monitor-collected Veriflier agent telemetry, capacity, discovery modes, and
+duplicate endpoint warnings. It uses stale heartbeat thresholds when deciding
+whether a process, dynamic bucket owner, or Veriflier telemetry row is healthy.
+
+When fleet projection drift is red, run `./jetmon2 rollout projection-drift
+--limit=100` on an operator host. The command reports bucket/status summaries,
+likely causes, and sample rows before listing individual mismatches, and it
+does not repair the legacy projection automatically.
+
+Capture the cause labels from rehearsal and early production incidents. A
+future dry-run repair planner should be based on those observed patterns, not
+on assumed failure modes, because the unsafe case is repairing `site_status`
+while the event rows or transitions still need investigation.
+
+Fleet snapshots are cached briefly by the dashboard process so multiple open
+operator tabs do not run the full fleet query set on every refresh.
+
+### Fleet Dashboard Operation
+
+Enable the dashboards with:
+
+```json
+{
+  "DASHBOARD_PORT": 8080,
+  "DASHBOARD_BIND_ADDR": "127.0.0.1"
+}
+```
+
+Open the host dashboard at `http://127.0.0.1:8080/` and the fleet dashboard at
+`http://127.0.0.1:8080/fleet`. If an operator needs remote access, prefer an SSH
+tunnel or a trusted management network instead of binding the dashboard to a
+public interface:
+
+```bash
+ssh -L 8080:127.0.0.1:8080 <jetmon-host>
+```
+
+The fleet dashboard is read-only and unauthenticated. It does not discover or
+scrape other hosts over HTTP; every `jetmon2` monitor dashboard reads the same
+shared MySQL state and can serve the fleet view if `DASHBOARD_PORT` is enabled.
+Standalone `jetmon-deliverer` processes do not serve a dashboard, but they do
+publish their own rows to `jetmon_process_health`.
+
+The dashboard accepts only `GET` and `HEAD` requests for static and JSON views,
+and `/api/fleet` returns the same complete snapshot the HTML page renders for
+local operator scripts:
+
+```bash
+curl -sS http://127.0.0.1:8080/api/fleet
+```
+
+Read the top summary first:
+
+- **Red**: do not advance rollout. Typical causes are stale process heartbeats,
+  broken dynamic bucket coverage, projection drift, failed/abandoned delivery
+  rows, or red process dependency health.
+- **Amber**: operator attention needed before the next change. Typical causes
+  are pinned or mixed bucket ownership during rollout, due delivery rows,
+  delivery workers without a clear owner, no process snapshots yet, or amber
+  dependency health.
+- **Green**: no fleet-level blocker is visible. Continue normal monitoring or
+  the next approved rollout step.
+
+During the v1-to-v2 rollout, pinned monitor hosts should make bucket coverage
+show `mode=pinned` and amber. After the final dynamic-ownership cutover,
+`mode=dynamic` should be green with fresh `jetmon_hosts` coverage and no gaps or
+overlaps. A `mode=mixed` result means some monitor hosts still report pinned
+ownership while others report dynamic ownership; treat that as a rollout state
+to resolve intentionally.
+
+For delivery ownership, green means the visible fresh delivery-capable process
+set has a consistent owner posture. Amber means the fleet either has queued
+delivery rows with no fresh worker, multiple owner values, enabled workers
+without `DELIVERY_OWNER_HOST`, or a mix of explicit and unset ownership. Fix the
+delivery-owner plan before moving outbound delivery responsibility.
+
+The dashboard exposes these local JSON endpoints:
+
+```text
+GET /api/state   # raw host state snapshot
+GET /api/health  # dependency health list
+GET /api/host    # combined host state, dependency health, and summary
+GET /api/fleet   # combined fleet rollup, process health, buckets, delivery, drift
+```
+
+Long-running `jetmon2` and `jetmon-deliverer` processes also publish compact
+heartbeat snapshots to `jetmon_process_health`. That table is the durable data
+source for the fleet dashboard. Treat stale `updated_at` values as
+unknown/unhealthy; the row is the last reported process state, not proof that a
+host is still alive. The dashboard listener remains unauthenticated for both
+host and fleet views, so keep `DASHBOARD_BIND_ADDR` on loopback unless network
+access is restricted to trusted operator hosts.
+
+Bucket coverage can be inspected directly:
+
+```sql
+SELECT host_id, bucket_min, bucket_max, last_heartbeat, status
+FROM jetmon_hosts
+ORDER BY bucket_min;
+```
+
+Process health can be inspected directly:
+
+```sql
+SELECT process_id, host_id, process_type, state, updated_at
+FROM jetmon_process_health
+ORDER BY process_type, host_id;
+```
+
+For health rollups and memory:
+
+```sql
+SELECT process_id, state, health_status, rss_mem_mb, go_sys_mem_mb, updated_at
+FROM jetmon_process_health
+ORDER BY health_status DESC, updated_at;
+```
+
+Delivery queues can be inspected directly:
+
+```sql
+SELECT status, COUNT(*), MIN(COALESCE(next_attempt_at, created_at))
+FROM jetmon_webhook_deliveries
+GROUP BY status;
+
+SELECT status, COUNT(*), MIN(COALESCE(next_attempt_at, created_at))
+FROM jetmon_alert_deliveries
+GROUP BY status;
+```
+
+A host whose heartbeat is older than `BUCKET_HEARTBEAT_GRACE_SEC` will have its
+buckets reclaimed by peers on their next round.
+
+## Metrics And Logs
+
+StatsD metrics retain the v1 prefix:
+
+```text
+com.jetpack.jetmon.<hostname>
+```
+
+Important metric groups include:
+
+- Worker pool capacity and active goroutines
+- Sites processed per second
+- Round completion time
+- Scheduler page count, selected/dispatched/completed rows, outstanding checks,
+  backpressure waits, stale/duplicate results, and sampled due backlog
+- Streaming failure-pressure suppression via
+  `scheduler.streaming.pressure_suppressed.count`, which shows local
+  timeout/connect failures that were treated as monitor-side pressure instead
+  of opening noisy incident side effects for otherwise running sites
+- Scheduler phase timings for dispatch, wait, result processing,
+  sidecar freshness writes, check-history inserts, SSL expiry
+  writes, and event handling
+- Scheduler write row/error counters for freshness, check history, and SSL
+  expiry updates
+- Staged-rollout check cohort counters under
+  `scheduler.*.check.method.<method>.profile.<profile>.count`, using the
+  effective runtime method/profile for `HEAD` / `GET` and `legacy` /
+  `simple_http` / `full`
+- In those metrics, `legacy` is a detection profile, not the Veriflier
+  transport. `HEAD` + `legacy` cohorts can and should still be checked through
+  the v2 `/v2/check` Veriflier contract.
+- WPCOM API attempts, deliveries, retries, queued circuit-open responses,
+  permanent 404/410 failures, errors, and final failures
+- Veriflier response times and vote counters
+- Detection flow timing from first failure to escalation, confirmation,
+  recovery, or false alarm
+- Detection outcome counters by local failure class
+- Legacy projection drift
+- RSS and Go Sys memory usage
+
+StatsD is the primary metrics transport. Expose Graphite/StatsD data through the
+existing metrics pipeline when external systems need it.
+
+For repeatable capacity and scalability tests, use
+[`jetmon-v2-scalability-test-plan.md`](jetmon-v2-scalability-test-plan.md).
+
+For repeatable production summaries from durable Jetmon tables, use:
+
+```bash
+./jetmon2 telemetry report --since=24h
+./jetmon2 telemetry report --since=2026-04-30T00:00:00Z --until=2026-05-01T00:00:00Z --output=json
+./jetmon2 telemetry report --since=6h --query-timeout=45s
+```
+
+The report is read-only and runs with a bounded query timeout by default
+(`--query-timeout` is capped at 5 minutes). The time window is half-open
+(`since <= row time < until`) so adjacent scheduled reports do not double-count
+boundary rows. It summarizes event lifecycle counts, first-failure timings,
+verifier agreement, v2 verifier vote evidence, false-alarm classes, WPCOM
+attempt parity, and metadata gaps that would make operator or customer
+explanations weaker. Verifier vote evidence includes duplicate votes ignored
+for quorum and transitions blocked by the minimum-healthy floor. WPCOM parity
+is split between confirmed-down and recovery attempts, with maintenance/cooldown
+suppressions separated the same way, so one side cannot mask a mismatch on the
+other. During v1-to-v2 rollout, capture this report after each full-round
+cutover gate and again at fleet completion. It reports aggregate counts and
+classes rather than raw payloads or credentials.
+
+The top line reports `telemetry_status`, `explanation_gap_types`, and
+`explanation_gap_rows`. Treat `warn` or `fail` as a signal that the report found
+missing or inconsistent telemetry, not as a site-availability rollup.
+The `window_edge_lookback` line calls out transition rows at the end of the
+window that can make WPCOM parity look temporarily incomplete; rerun with a
+later `--until` before treating those edge deltas as missing audit data.
+
+Use `LOG_FORMAT=json` for structured logs during investigations.
+
+## Debugging
+
+Enable debug logging:
+
+```json
+{ "DEBUG": true }
+```
+
+Attach pprof locally:
+
+```bash
+curl http://localhost:6060/debug/pprof/
+curl http://localhost:6060/debug/pprof/heap > heap.prof
+go tool pprof heap.prof
+```
+
+The debug listener binds to localhost only. Set `DEBUG_PORT` to 0 to disable it.
+
+If `WORKER_MAX_MEM_MB` is greater than 0 and Go runtime memory exceeds that
+threshold, the goroutine pool shrinks by 10 percent via graceful drain. The
+default is 0 so Jetmon does not silently trade away check throughput because of
+a legacy memory cap. Use the host/fleet dashboard RSS value to compare Jetmon's
+resident memory with operating-system tools, and use the Go Sys value with
+pprof when investigating sustained runtime memory pressure.
+
+## Veriflier Health
+
+Verifliers that fail to respond are excluded from confirmation requests. Quorum
+counts unique v2 `vantage.id` values rather than raw agent replies. If the
+healthy unique-vantage set drops below `PEER_OFFLINE_LIMIT`, Jetmon lowers the
+effective quorum only to the multi-Veriflier safety floor: two healthy vantages
+unless `PEER_OFFLINE_LIMIT=1` was explicitly configured. This prevents one
+remaining healthy Veriflier from confirming downtime alone in normal
+multi-Veriflier layouts.
+
+New Verifliers expose the versioned contract at `/v2/status`. Use that endpoint
+for operational detail: it reports supported protocols, the quorum-counted
+`vantage.id`, serving `agent.id`, and current capacity. Horizontal replicas
+behind one regional endpoint should share the same `vantage.id`; `agent.id`
+changes per process and is diagnostic only.
+
+Once a site is already projected as confirmed down, subsequent local failures do
+not re-enter Veriflier confirmation. Jetmon keeps checking for recovery and
+emits `detection.down.still_down.*` counters for the ongoing failed
+observations without duplicating confirmed-down notifications.
+
+For deployment, build the v2 Veriflier fleet before cutting monitors over.
+The preferred rollout uses fresh `veriflier2` endpoints and points v2 Monitors
+only at that fleet. Keep v1 Monitors pointed at the original v1 Verifliers
+until monitor cutover is complete. `veriflier2` can serve `/check` and
+`/status` for legacy-compatible HTTP clients only when
+`VERIFLIER_ENABLE_LEGACY_HTTP=true`; leave that disabled for normal production
+v2 endpoints. The original v1 Veriflier uses the old TLS/custom transport and
+should not be treated as a supported v2 Monitor fallback target. Veriflier
+hosts do not need database credentials; monitors collect agent telemetry and
+write it to MySQL.
+
+Manual check:
+
+```bash
+./jetmon2 validate-config
+curl http://<veriflier-host>:7803/v2/status
+```
+
+`validate-config` reports each configured Veriflier's contract status and marks
+duplicate or missing v2 `vantage.id` values as failures. The operator dashboard
+uses the same status metadata and shows duplicate-vantage Verifliers as red.
+
+If a v2 Veriflier is saturated, it returns HTTP 503 for `/v2/check`. Treat that
+as a capacity or routing problem for that endpoint. It is not a site-down vote.
+
+## Docker Cleanup
+
+```bash
+cd docker
+docker compose down -v
+rm -f ../config/config.json
+rm -rf ../logs/*.log ../stats/*
+```
diff --git a/docs/outbound-credential-encryption-plan.md b/docs/outbound-credential-encryption-plan.md
new file mode 100644
index 00000000..d50e6c31
--- /dev/null
+++ b/docs/outbound-credential-encryption-plan.md
@@ -0,0 +1,139 @@
+# Outbound Credential Encryption Plan
+
+**Status:** Planning note, not an accepted architecture decision.
+
+ADR-0003 accepts plaintext storage for outbound-dispatch credentials under the
+current internal-only v2 threat model. This note captures the migration path
+for the next hardening step: application-level encryption at rest for webhook
+signing secrets and alert-contact destination credentials.
+
+## Current State
+
+Two columns contain raw outbound credentials because dispatch needs the
+original value at send time:
+
+- `jetmon_webhooks.secret`: HMAC signing secret used to sign webhook delivery
+  bodies.
+- `jetmon_alert_contacts.destination`: transport-specific JSON containing an
+  email address, PagerDuty integration key, Slack/Teams webhook URL, or SMTP
+  password.
+
+Handlers never return these values after creation or rotation. Normal reads
+return only `secret_preview` or `destination_preview`; dispatch workers load the
+raw value through separate helper functions.
+
+## Goals
+
+- Protect credentials from database-only compromise, read replicas, SQL dumps,
+  and backup exposure.
+- Keep dispatch fast enough that decrypting credentials does not become the
+  bottleneck during event storms.
+- Preserve the existing API contract: create/rotate still return a one-time
+  secret where applicable, and reads still expose only previews.
+- Allow rollback during migration without losing the ability to dispatch
+  existing webhooks and alert contacts.
+
+## Non-Goals
+
+- This does not protect against a fully compromised application host. The
+  dispatcher must hold decrypt-capable key material in memory to send alerts.
+- This does not replace webhook HMAC signing with asymmetric signatures.
+- This does not define the public/customer tenant model; that remains a public
+  API design item.
+- This does not encrypt delivery payload history. Payloads contain event data,
+  not destination credentials.
+
+## Target Design
+
+Use envelope-style application encryption with a versioned service data key:
+
+1. A production key manager exposes the active credential-encryption key and
+   key id to Jetmon at startup.
+2. Jetmon keeps the plaintext data key only in memory.
+3. Each credential value is encrypted locally with AES-256-GCM before storage.
+4. Each encrypted row stores the ciphertext, nonce, key id, and algorithm.
+5. Load helpers decrypt locally using the in-memory key matching the stored key
+   id.
+
+This avoids a KMS round trip on every delivery while still protecting database
+contents and backups from credential disclosure. If the deployment environment
+requires KMS unwrap per key version, do that once at process startup or reload,
+not inside the per-delivery hot path.
+
+Recommended config shape:
+
+- `CREDENTIAL_ENCRYPTION_MODE`: `plaintext`, `dual_write`, or
+  `encrypted_required`.
+- `CREDENTIAL_ENCRYPTION_KEY_ID`: current key version identifier.
+- `CREDENTIAL_ENCRYPTION_KEY_SOURCE`: local dev key, environment-provided key,
+  or production KMS-backed provider.
+
+## Schema Path
+
+Add encrypted columns alongside the existing plaintext columns:
+
+- `jetmon_webhooks.secret_ciphertext`
+- `jetmon_webhooks.secret_nonce`
+- `jetmon_webhooks.secret_key_id`
+- `jetmon_webhooks.secret_alg`
+- `jetmon_alert_contacts.destination_ciphertext`
+- `jetmon_alert_contacts.destination_nonce`
+- `jetmon_alert_contacts.destination_key_id`
+- `jetmon_alert_contacts.destination_alg`
+
+Keep `secret_preview` and `destination_preview` unchanged. Previews are not
+credentials and stay useful for operator display.
+
+After backfill and one stable release, make the encrypted columns required for
+new rows. Dropping or nulling the plaintext columns should be a separate
+deployment step after production has run in `encrypted_required` mode long
+enough to prove there is no fallback traffic.
+
+## Migration Phases
+
+1. **Introduce encryption helpers.** Add a small internal package for encrypt
+   and decrypt operations, with test vectors and explicit key id handling.
+2. **Add nullable encrypted columns.** Existing plaintext rows continue to
+   dispatch without behavior change.
+3. **Dual-write new credentials.** Create, update, and rotate paths write both
+   plaintext and encrypted values. Load helpers prefer encrypted values and
+   fall back to plaintext.
+4. **Backfill existing rows.** A CLI or migration command encrypts existing
+   plaintext values in batches. It should be idempotent and safe to resume.
+5. **Require encrypted reads.** Flip production to `encrypted_required` once
+   every row has encrypted material. Fallback to plaintext becomes an error and
+   a metric.
+6. **Remove plaintext storage.** In a later release, null or drop the plaintext
+   columns after backup retention and rollback windows make that safe.
+
+## Operational Requirements
+
+- Metrics for encrypt failures, decrypt failures, plaintext fallback count, and
+  unknown key id count.
+- A startup check that fails fast in `dual_write` or `encrypted_required` when
+  the configured key source is unavailable.
+- A key rotation runbook: add new key id, dual-write new data with it, rewrap
+  old rows, then retire the old key after the rollback window.
+- A break-glass procedure for restoring dispatch if the key source is
+  unavailable.
+
+## Test Requirements
+
+- Unit tests for encryption round trips, wrong-key failures, nonce uniqueness,
+  and malformed ciphertext.
+- Repository tests proving create/update/rotate paths write encrypted values in
+  `dual_write` and `encrypted_required`.
+- Dispatch tests proving load helpers prefer encrypted columns and emit errors
+  instead of silently using plaintext when `encrypted_required` is active.
+- Migration/backfill tests proving the backfill is resumable and leaves previews
+  unchanged.
+
+## Open Questions
+
+- Which production key manager should be the first provider?
+- Should local development use a generated throwaway key, a config-provided key,
+  or stay in `plaintext` mode by default?
+- What is the minimum stable period in `encrypted_required` before plaintext
+  columns can be removed?
+- Do backups or replica access policies require encrypted columns before public
+  API work starts, or only before customer-managed secrets are exposed directly?
diff --git a/docs/project.md b/docs/project.md
new file mode 100644
index 00000000..51fcf015
--- /dev/null
+++ b/docs/project.md
@@ -0,0 +1,417 @@
+# Jetmon 2 — Project Description
+
+## Executive Summary
+
+Jetmon 2 is a complete rewrite of the Jetmon uptime monitoring service, replacing the Node.js + C++ native addon architecture with a single Go binary. The rewrite retains full compatibility with existing external interfaces — MySQL schema, WPCOM API notification format, StatsD metric names, and log file structure — making it a genuine drop-in replacement on production infrastructure. Internally, the process-per-worker model is replaced by a goroutine pool, eliminating the overhead of forked processes and native addon compilation while dramatically increasing the number of concurrent checks per host. The rewrite is accompanied by a comprehensive tooling suite designed to make the system easier to test, deploy, operate, and interrogate.
+
+---
+
+## Why Go
+
+The current architecture uses forked Node.js processes (8–16MB RSS each at startup, 53MB limit before recycling) as workers, plus a compiled C++ addon to escape Node's event loop for blocking network I/O. Go eliminates both constraints:
+
+- **Goroutines** start at ~4KB of stack and grow on demand, making 50,000 concurrent checks on a single host practical without the memory overhead of forked processes or libuv thread pools
+- **`net/http` and `crypto/tls`** are first-class stdlib packages — no native addon, no node-gyp, no compilation step during deployment
+- **`net/http/httptrace`** provides DNS, TCP, TLS, and TTFB timing hooks as separate measurements within each check, for free
+- **Single static binary** deployment with no runtime dependencies, no `node_modules`, and no addon rebuild on Node.js version upgrades
+- **Built-in profiling** via `pprof`, race detector via `go test -race`, and a mature testing ecosystem
+- **Graceful goroutine lifecycle management** replaces the fragile worker spawn/recycle/evaporate lifecycle
+
+The Veriflier is rewritten in Go as well, replacing the Qt C++ dependency with a lightweight Go HTTP service. The v2 production Monitor-to-Veriflier transport is JSON-over-HTTP on the configured Veriflier port. New Verifliers serve a versioned `/v2/check` + `/v2/status` contract with explicit batch/request IDs, typed outcomes, vantage identity, agent identity, deadline propagation, and capacity reporting. `veriflier2` can optionally expose legacy-compatible `/check` + `/status` only when `VERIFLIER_ENABLE_LEGACY_HTTP=true` for lab or emergency compatibility testing; production v2 endpoints should leave it disabled. The proto contract is kept in `proto/` as a schema reference for a possible future transport, not as the v2 deployment path.
+
+---
+
+## Architecture Overview
+
+```
+┌──────────────────────────────────────────────────────┐
+│                       jetmon2                        │
+│                                                      │
+│  ┌─────────────┐  ┌─────────────┐  ┌──────────────┐  │
+│  │ Orchestrator│  │ Check Pool  │  │  Veriflier   │  │
+│  │  goroutine  │  │ (goroutines)│  │  transport   │  │
+│  └──────┬──────┘  └──────┬──────┘  └──────┬───────┘  │
+│         │                │                │          │
+│  ┌──────┴────────────────┴────────────────┴───────┐  │
+│  │                 Internal channels              │  │
+│  └────────────────────────────────────────────────┘  │
+└────────────┬──────────────────────────┬──────────────┘
+             │                          │
+          MySQL                    WPCOM API
+          StatsD                   (unchanged)
+          Log files
+          (all unchanged)
+```
+
+The monitor process replaces the master/worker/SSL-cluster process tree. Concurrency is managed through Go channels and a bounded goroutine worker pool. The orchestrator goroutine owns DB access and WPCOM notifications. The check pool goroutines own HTTP connections. The Veriflier client/server code handles remote confirmation batches over JSON-over-HTTP and is isolated behind `internal/veriflier/`. The preferred rollout deploys a fresh `veriflier2` fleet first and points v2 Monitors only at that fleet; `veriflier2` exposes the versioned v2 contract by default and can expose a legacy-compatible HTTP contract only when `VERIFLIER_ENABLE_LEGACY_HTTP=true`, while the original v1 Veriflier's TLS/custom transport is not a v2 Monitor fallback target. Veriflier endpoints advertise a quorum-counted `vantage.id` separately from the serving `agent.id`, so multiple horizontally scaled replicas behind one endpoint add capacity without adding extra votes. Monitor-side Veriflier discovery can run in `static`, `shadow`, or `active` mode against a trusted DB registry; monitors collect agent liveness and capacity from `/v2/status`, but only pre-approved enabled vantages count for quorum. Outbound webhook and alert-contact delivery can run embedded in one API-enabled `jetmon2` process today, or through the standalone `jetmon-deliverer` entry point as that responsibility moves toward its own deployable process.
+
+---
+
+## Benefits of the Rewrite
+
+### Memory
+
+The current architecture forks Node.js worker processes that start at 8–16MB RSS and are recycled once they reach 53MB. With a typical deployment of 8–16 workers, the process tree consumes 240–850MB of resident memory just for worker overhead, before any check data is counted. The master process, SSL server, and associated IPC buffers add further overhead.
+
+Jetmon 2 runs as a single process. Go goroutines start at 4KB of stack and grow on demand. A pool of 1,000 concurrent goroutines costs roughly 4MB of stack. Total process RSS for an equivalent workload is estimated at 50–150MB — a **75–90% reduction** in memory consumption per host.
+
+### Concurrent Checks
+
+Current concurrency is bounded by the number of worker processes. Each worker is a single-threaded Node.js process; even with the C++ addon offloading blocking I/O to a thread pool, practical concurrency per host is in the low hundreds. Scaling beyond that requires adding more hosts and manually partitioning bucket ranges.
+
+Go's goroutine scheduler makes 10,000+ concurrent in-flight checks on a single host practical with no additional configuration. At a conservative network timeout of 10 seconds and average site response time of 200ms, a pool of 1,000 goroutines sustains approximately 5,000 check completions per second. This represents an estimated **10–50× increase in concurrent checks per host**, meaning significantly fewer hosts are required to cover the same fleet.
+
+### Throughput
+
+The current architecture crosses a process boundary on every unit of work: the master dispatches via IPC, the worker receives, processes, and replies via IPC, and the master aggregates. Each crossing involves serialisation, a context switch, and V8 event loop scheduling on both ends.
+
+Jetmon 2 replaces all IPC with Go channel sends, which are in-process and order-of-magnitude cheaper. V8 GC pauses, which can delay check scheduling and RTT measurement in the current system, are eliminated. Estimated throughput improvement: **3–10× more sites checked per second per host** under equivalent conditions.
+
+### Check Scheduling Accuracy
+
+The current system uses `setTimeout` and `setInterval` for round scheduling. These are subject to V8 event loop delay — a busy event loop can delay a scheduled callback by tens to hundreds of milliseconds, introducing jitter into check timing and RTT measurements.
+
+Go's `time.Ticker` fires with OS-level timer precision. RTT measurements from `net/http/httptrace` are taken inside the HTTP stack with no event loop between the measurement point and the timer, making them more accurate and consistent.
+
+### Deployment Speed
+
+Current deployment requires `npm install`, a `node-gyp` rebuild of the native C++ addon (which must match the installed Node.js version), and a coordinated process restart. A failed addon compilation blocks deployment entirely.
+
+Jetmon 2 deploys as static Go binaries with no runtime language dependencies. The conservative v2 monitor deployment is: copy `jetmon2`, run migrations, and `systemctl restart jetmon2`. Total deployment time drops from several minutes to under 30 seconds. There is no compilation step on the target host and no dependency on a matching Node.js version.
+
+### Mean Time to Recovery
+
+A worker process crash in the current system requires the master to detect the exit, spawn a replacement, and wait for the new process to initialise — a sequence that takes several seconds and leaves that worker's in-flight checks unresolved.
+
+In Jetmon 2, a panicking goroutine is recovered by a deferred handler, the result is counted as an error, and a replacement goroutine is immediately spawned from the pool — recovery is in the low milliseconds. For a full process crash, systemd restarts the binary; with Go's fast startup, the process is accepting work again in under 2 seconds.
+
+### Operational Complexity
+
+The current system requires managing Node.js version compatibility, native addon compilation, npm dependency trees, and the fragile worker spawn/recycle lifecycle. The `node_modules` directory and compiled `.node` addon must be present and consistent on every host.
+
+Jetmon 2 eliminates all of this. There is one artifact to manage: the Go binary. It carries its own runtime, has no external dependencies, and produces a reproducible build from `go build`. The `node-gyp`, `npm`, and Node.js version management concerns disappear entirely.
+
+---
+
+## Drop-in Compatibility Requirements
+
+These interfaces must remain byte-for-byte identical to the current implementation:
+
+| Interface | Constraint |
+|-----------|-----------|
+| MySQL schema | Read same columns; additive migrations (new columns, new tables) are permitted |
+| WPCOM notification payload | Same JSON structure and field names |
+| StatsD metric names | Same dotted paths; new metrics may be added |
+| Log file paths and format | `logs/jetmon.log`, `logs/status-change.log`; same line format |
+| `stats/` file outputs | `sitespersec`, `sitesqueue`, `totals` — same format |
+| `config/config.json` keys | All existing keys honoured; new keys additive |
+| SIGHUP config reload | Same signal handling behaviour |
+| SIGINT graceful shutdown | Same behaviour |
+
+---
+
+## New Features — Competitive Parity
+
+These features address the most significant gaps against competing solutions and are scoped to be implementable without new server infrastructure.
+
+**SSL Certificate Expiry Monitoring**
+During each HTTPS check, inspect the peer certificate chain via Go's `tls.ConnectionState`. Extract `NotAfter` from the leaf certificate and store it in `jetmon_site_runtime.ssl_expiry_date`. Emit alerts at configurable thresholds (30, 14, and 7 days before expiry) through the same WPCOM notification path as downtime alerts. Zero additional network requests — the data is present in every existing HTTPS connection.
+
+**Staged HEAD/GET Site Checks**
+Jetmon 1's HEAD-only verification was a major source of customer-visible false positives and false negatives because many production stacks block, special-case, or incorrectly implement HEAD. Jetmon 2 supports both HEAD and GET checks so rollout can first preserve v1 semantics, then move selected cohorts to GET requests, and finally enable the full v2 detection profile once stability is proven. GET checks base uptime decisions on the same class of request a browser and customer-facing uptime product normally make. This staged check policy is independent of the Veriflier HTTP transport: `HEAD` + `legacy` site probes still travel over the v2 `/v2/check` Veriflier contract.
+
+**Keyword / Content Checking**
+For sites with a `check_keyword` value set in the database, full-profile GET checks search the response body for the configured string. A missing keyword on an otherwise-200 response counts as a failure and enters the same retry and confirmation pipeline as an HTTP error. Sites can also set `forbidden_keyword` for one known-bad string or `forbidden_keywords` for a JSON array of known-bad strings that must not appear, such as injected scripts, SEO spam links, parked-domain pages, compromised-content markers, or upstream error templates. These explicit rules require GET and are gated off for HEAD and simple rollout profiles.
+
+**Maintenance Windows**
+Store `maintenance_start` and `maintenance_end` in `jetmon_site_check_config`. During a maintenance window, checks continue and RTT data is collected, but failing checks are swallowed before they open or promote downtime incidents. If a local retry is already in progress when the window starts, the open HTTP event is closed with `maintenance_swallowed` and the legacy projection returns to running. The check result is logged internally so the audit trail is complete, but no alert fires. Configurable via the WPCOM API or direct DB write.
+
+**Granular Timing Breakdown**
+Go's `net/http/httptrace` provides discrete callbacks for DNS start/done, TCP connect start/done, TLS handshake start/done, request written, and first response byte. Each check records composite RTT plus DNS, TCP, TLS, and TTFB timings. The raw samples are stored in `jetmon_check_history` for response-time trending and API statistics; scheduler-level StatsD metrics report round/page phase timing and write volume.
+
+When the HTTP probe fails during resolver lookup, Jetmon records structured DNS
+diagnostics in event metadata when Go exposes them: NXDOMAIN, SERVFAIL, timeout,
+or a generic resolver error, plus the queried name and resolver server when
+available. This improves operator explanation for DNS-caused HTTP failures
+without pretending that cached recursive resolvers can see every short
+authoritative DNS outage.
+
+**Per-Site Request Headers**
+Store `custom_headers` JSON in `jetmon_site_check_config`. The check engine merges these into the outgoing request, allowing sites that require an `Authorization` header or a specific `Host` value to be checked correctly.
+
+**Configurable Timeout Per Site**
+Store `timeout_seconds` in `jetmon_site_check_config`, defaulting to the global `NET_COMMS_TIMEOUT`. Premium sites can have shorter timeouts for faster failure detection; sites on slow infrastructure can have longer ones to reduce false positives.
+
+**Sub-Minute Check Intervals (Premium)**
+The goroutine scheduler handles arbitrary intervals natively. A dedicated premium worker pool with its own configuration runs at high frequency without affecting the general pool. Routing is via the existing bucket range mechanism — premium buckets are assigned to the premium pool configuration.
+
+**False Positive Suppression Tuning**
+Expose `NUM_OF_CHECKS` as a per-site override in the database. Sites with a history of transient failures can be tuned to require more local confirmations before escalating to Verifliers, without changing the global default. Failed probes already get a bounded one-minute follow-up when the normal check interval is longer, so retry cadence should stay scheduler-owned unless production evidence shows a need for explicit per-site retry timing.
+
+**Response Time History**
+Each check result — including the HTTP request method and granular DNS/TCP/TLS/TTFB breakdown — is written to a `jetmon_check_history` table with a timestamp. This enables response time trending over configurable windows (1h, 24h, 30d) queryable via the operator dashboard and the audit CLI. The request method gives operators durable evidence that v2 probes are exercising the GET path instead of v1's HEAD-only behavior. The data is already being collected as part of the granular timing breakdown; this feature is purely a storage and query layer on top of it. Provides the response time graphs that customers expect from competing services.
+
+**Alert Deduplication and Cooldown**
+Store `alert_cooldown_minutes` in `jetmon_site_check_config`, defaulting to a global `ALERT_COOLDOWN_MINUTES` config value. After an alert fires for a site, subsequent alerts for the same site are suppressed until the cooldown expires, even if the site flaps up and down repeatedly. The suppression is recorded in the audit log. Prevents alert fatigue on flapping sites without requiring manual maintenance window configuration.
+
+Store `next_check_at` in `jetmon_site_runtime` for variable-interval scheduling in legacy round-scheduler mode. Jetmon maintains it after checks, using `last_checked_at + max(check_interval, 1) minutes` for successful probes and a bounded one-minute follow-up for failed probes when the normal interval is longer. This allows due-site selection to use an indexed sidecar range predicate instead of recalculating the interval expression for every active row while keeping open incidents from waiting a full normal interval before the next local observation.
+
+**TLS Version and Cipher Reporting**
+Alongside SSL certificate expiry monitoring, inspect `tls.ConnectionState` for the negotiated TLS version and cipher suite. Sites still serving TLS 1.0 or TLS 1.1 open a `tls_deprecated` warning event, but this advisory does not enter the downtime retry pipeline or project the legacy site status down. Jetmon permits the deprecated handshake long enough to classify the site accurately and records the TLS version and cipher in event metadata. Zero additional network requests — this data is present in every existing HTTPS connection alongside the certificate chain.
+
+**Redirect Policy Configuration**
+Store `redirect_policy` in `jetmon_site_check_config` with three options: `follow` (current behaviour — follow redirects and treat the final response code as the result), `alert` (follow the redirect but record a warning in the audit log when the redirect target or chain changes from a stored baseline), and `fail` (treat any redirect as a failure). Detecting unexpected redirect changes is valuable for catching misconfigured CDN rules, accidental HTTP-to-HTTPS regressions, and domain hijacking scenarios.
+
+---
+
+## Tooling and Developer Experience
+
+**Docker Compose Environment**
+The existing Docker Compose setup is updated for the Go binary. A single `docker compose up` starts MySQL, the Jetmon 2 binary, one or more Veriflier instances, Mailpit for local email capture, StatsD + Graphite, the operator dashboard, and the deterministic API fixture. No npm, no node-gyp, no manual build steps. `docker compose up --build` rebuilds the Go binaries in a reproducible multi-stage Docker build.
+
+**Docker-Local API Fixture**
+The Docker Compose environment includes an `api-fixture` service for deterministic local API CLI and event-flow rehearsals without depending on public endpoint timing. It exposes:
+
+- static response-code endpoints for success, client error, and server error cases
+- configurable slow responses for timeout paths
+- keyword-present and keyword-missing responses
+- redirect paths for redirect-policy checks
+- HTTPS with a self-signed certificate for TLS failure paths
+- webhook capture endpoints that record deliveries and verify
+  `X-Jetmon-Signature` when a shared secret is supplied
+
+`make api-cli-smoke` exercises the normal local API smoke path, and
+`make api-cli-validate` runs the broader guide validation with fixture-backed
+failure simulation and optional webhook signature verification.
+
+**Structured Logging**
+All log output is available in two formats: the existing plain-text line format (for drop-in compatibility with current log consumers) and an optional structured JSON format enabled via `config.json`. The JSON format emits the same fields — level, timestamp, message, blog_id, http_code, error_code, RTT — as a machine-readable object, making log ingestion into Elasticsearch, Loki, or any log aggregation platform straightforward without a custom parser. Both formats write to the same log file paths.
+
+**Alert Flow Replay**
+Given a site `blog_id` and a time range, the replay tool reconstructs the full detection and notification sequence from the audit log: when each check ran, what it found, which local retries fired, which Verifliers were queried and what they returned, and what was sent to the WPCOM API. Outputs a human-readable timeline. Intended for Happiness Engineers debugging "why didn't I get an alert?" or "why did I get an alert when the site was fine?"
+
+**Automated Test Suite**
+End-to-end integration tests that run against the Docker Compose environment:
+
+- Unit tests for the check logic (status classification, retry transitions, COMPARE mode comparison)
+- Integration tests that insert sites into the test database, configure deterministic local test endpoints to return specific states, and assert that the correct WPCOM notification is sent within a defined time window
+- Timeout and TLS failure scenarios
+- Maintenance window suppression
+- SSL expiry detection
+- Keyword check pass/fail
+- Worker pool scale-up and scale-down
+- Graceful shutdown mid-round
+- MySQL-coordinated bucket claiming: two hosts starting simultaneously claim non-overlapping ranges
+- MySQL-coordinated bucket failover: a host's heartbeat is artificially expired and surviving hosts absorb its buckets within one grace period
+- Alert cooldown suppression: a flapping site does not fire repeated alerts within the cooldown window
+- Redirect policy: `follow`, `alert`, and `fail` modes behave correctly against deterministic local test endpoints
+
+All tests run with `go test ./...` and are included in CI.
+
+**Config Validation Tool**
+A standalone binary (`jetmon2 validate-config`) that:
+
+- Parses `config.json` and checks all required keys are present
+- Validates value ranges and required per-mode settings
+- Attempts a test connection to MySQL
+- Reports legacy projection and email transport modes
+- Prints the matching rollout preflight and projection-drift investigation
+  commands for the configured bucket ownership mode
+- Warns when the email transport resolves to the log-only `stub` sender
+- Probes configured Verifliers as best-effort operator context, reports v2 vs.
+  legacy contract status, and prints v2 `vantage.id`, `agent.id`, and capacity
+  metadata when available
+- Fails on duplicate or missing v2 Veriflier vantage IDs so bad quorum identity
+  layouts are caught before rollout
+- Reports Veriflier discovery mode, trusted DB registry counts, incomplete
+  active-discovery rows, and shadow-mode drift between static config and the
+  registry
+- Outputs a pass/fail summary with specific error messages
+
+Intended to run as a pre-deployment check in CI and as an operator tool when diagnosing connectivity issues.
+
+**Veriflier Discovery Report**
+`jetmon2 verifliers discovery-report` is a read-only rollout gate for
+auto-discovery. It compares configured static Verifliers, enabled trusted
+`jetmon_veriflier_vantages` rows, and recent monitor-collected
+`jetmon_veriflier_agents` telemetry, then emits text or JSON with
+green/amber/red status and a suggested next action. It reports only token
+presence, never token values.
+
+**Operator Dashboard**
+A lightweight web UI served by the binary itself (no separate process) on a configurable internal port. Displays in real time:
+
+- Worker goroutine count and active checks
+- Check queue depth and drain rate
+- Sites per second
+- Round completion time
+- Local retry queue depth
+- Owned bucket range
+- Bucket ownership mode, legacy projection mode, delivery-worker ownership, and
+  rollout preflight / projection-drift commands
+- RSS memory and Go runtime system memory usage
+- WPCOM circuit-breaker state and queued notification depth
+- Live dependency health for MySQL, configured Verifliers, Veriflier discovery,
+  WPCOM, StatsD, and log/stats directory writes
+- Combined `/api/host` snapshot with local state, dependency health, and a
+  red/amber/green host summary for operator tooling
+
+Updates via server-sent events and lightweight JSON polling — no WebSocket library needed, no JavaScript framework. A plain HTML page with `<EventSource>` and `fetch` is sufficient and has no build toolchain dependency.
+
+**System Health Map**
+The operator dashboard health grid publishes:
+
+- MySQL: connection state and ping latency
+- Each configured Veriflier: reachability and status latency
+- Veriflier discovery: registry query status, enabled/usable vantage counts,
+  and recent agent telemetry count when discovery mode is `shadow` or `active`
+- WPCOM API: circuit-breaker state and queued notification depth
+- StatsD: local client initialization state
+- Disk: writable `logs/` and `stats/` directories
+
+Future refinements can add primary/replica breakdowns, last successful
+orchestrator batch, WPCOM request error-rate windows, and disk free-space
+thresholds once production operating data shows which signals are worth paging
+on.
+
+Long-running `jetmon2` and `jetmon-deliverer` processes also publish compact
+heartbeat snapshots into `jetmon_process_health`. The `/fleet` dashboard uses
+those snapshots alongside `jetmon_hosts`, outbound delivery queues, projection
+drift, dependency rollups, and Veriflier discovery tables to summarize monitor
+hosts, standalone deliverers, stale process heartbeats, lifecycle state,
+red/amber/green health rollups, delivery-owner posture, trusted Veriflier
+vantages, monitor-collected Veriflier agent telemetry, capacity, RSS memory, Go
+runtime system memory, and local dependency health without polling every host
+dashboard directly.
+
+**False Positive Tracker**
+Every time the system escalates a site to Veriflier confirmation and the Verifliers do NOT confirm it as down (i.e., the queue entry times out or all Verifliers report the site as up), the event is recorded in a `jetmon_false_positives` table with timestamp, site, HTTP code, error code, and RTT from the local check. A view in the operator dashboard surfaces sites with high false positive rates, helping operators tune per-site `NUM_OF_CHECKS` settings and review whether the site's content/timeout rules are too sensitive.
+
+**Internal Audit Log**
+Operational activity for every site is written to a `jetmon_audit_log` table:
+
+- Check performed: timestamp, source (local/veriflier name), result (HTTP code, error code, RTT)
+- WPCOM notification sent: timestamp, payload hash, response code
+- WPCOM notification retry: timestamp, reason
+- Local retry dispatched: timestamp, retry count
+- Veriflier request sent: timestamp, which verifliers
+- Veriflier result received: timestamp, veriflier name, result
+- Maintenance window active: timestamp, window end
+- Config change: timestamp, which keys changed
+
+Authoritative incident state transitions live in `jetmon_event_transitions`, written by the `eventstore` package in the same transaction as the matching `jetmon_events` mutation. The audit log is intentionally operational context, not the source of truth for site state.
+
+Queryable by `blog_id` and time range via a CLI tool (`jetmon2 audit --blog-id 12345 --since 2h`) and via the operator dashboard. Designed specifically for Happiness Engineers investigating customer-reported alert issues.
+
+**Deployment Tooling**
+- `jetmon2 version` — prints binary version, build date, Go version, and git commit hash
+- `jetmon2 migrate` — applies pending DB schema migrations idempotently
+- `jetmon2 status` — connects to a running instance's internal API and prints a one-line health summary (equivalent to reading `stats/totals` but richer)
+- `jetmon2 rollout guided` — interactive host rollout and rollback walkthrough with transcript logging, resume state, typed destructive confirmations, and fail-closed gates
+- `jetmon2 rollout rehearsal-plan` — prints the ordered same-server or fresh-server command sequence for a host replacement from the approved bucket CSV
+- `jetmon2 rollout host-preflight` — bundles the pre-stop host gate: static plan match, config parse, DB connectivity, pinned safety checks, and systemd validation
+- `jetmon2 rollout static-plan-check` — validates a CSV host-to-bucket plan before any v1 host is stopped
+- `jetmon2 rollout pinned-check` — validates a pinned v1-to-v2 cutover host before or during host replacement
+- `jetmon2 rollout cutover-check` — bundles the read-only post-start pinned preflight, activity, dashboard status, and projection-drift checks
+- `jetmon2 rollout activity-check` — verifies recent check activity for a bucket range after cutover
+- `jetmon2 rollout rollback-check` — verifies a pinned v2 range is safe to hand back to v1
+- `jetmon2 rollout dynamic-check` — validates full `jetmon_hosts` coverage after the fleet transitions from pinned to dynamic ownership
+- `jetmon2 rollout projection-drift` — summarizes and lists active sites whose legacy `site_status` projection disagrees with the authoritative event state
+- `jetmon2 rollout state-report` — summarizes ownership mode, bucket coverage, recent activity, projection drift, delivery-owner state, and the suggested next action
+- `jetmon2 drain --worker N` — gracefully removes one worker pool slot, waiting for in-flight checks to complete before reducing concurrency
+- `jetmon2 reload` — sends SIGHUP to the running process (convenience wrapper)
+
+Rollout gate commands accept `--output=json` for Systems automation. JSON output
+keeps the command's pass/fail state, generated timestamp, parsed output lines,
+and failure messages on stdout while preserving non-zero exit status on failed
+checks.
+
+The complete v1-to-v2 production process is documented in
+[`v1-to-v2-migration.md`](v1-to-v2-migration.md).
+
+**Zero-Downtime Rolling Updates**
+Because bucket ownership is coordinated via MySQL, a multi-host deployment can be updated one host at a time with no coverage gap. The procedure for each host: send SIGINT to release its buckets, wait for the drain to complete, deploy the new binary, start the new process. Surviving hosts absorb the draining host's buckets during the update window and release them back once the updated host rejoins and reclaims its range. No simultaneous restart of all hosts is required, and no sites are left unchecked during the update.
+
+---
+
+## Auto-Scale and Auto-Heal
+
+Jetmon 2 achieves maximum uptime without requiring a Kubernetes cluster. Scaling and healing operate at three levels: within the process, at the host level via systemd, and across hosts via MySQL-coordinated bucket ownership.
+
+**Goroutine Pool Auto-Scaling**
+The worker pool monitors queue depth against a configurable high-water mark. When queue depth exceeds the threshold for more than N seconds, new goroutine workers are added up to a configured maximum without any restart. When depth falls below a low-water mark for a sustained period, excess goroutines are drained gracefully. No process spawning, no IPC overhead — adding a worker is a channel send. This handles the vast majority of load variation entirely within a single process.
+
+**systemd Process Supervision**
+The binary ships with a systemd unit file. `Restart=on-failure` with a short `RestartSec` ensures the process is automatically restarted if it crashes or exits unexpectedly. `StartLimitIntervalSec` and `StartLimitBurst` prevent restart loops from hammering a broken dependency. The unit file also enforces resource limits (`MemoryMax`, `LimitNOFILE`) to keep the process within safe bounds on shared hosts. A watchdog integration via `sd_notify` lets systemd detect and restart a process that has stopped making progress without actually crashing.
+
+**MySQL-Coordinated Bucket Ownership**
+A `jetmon_hosts` table replaces the static `BUCKET_NO_MIN`/`BUCKET_NO_MAX` config values with runtime-negotiated bucket ownership. Hosts claim, hold, and release bucket ranges autonomously using MySQL transactions as the coordination mechanism — no cluster orchestrator required. For the initial v1-to-v2 production migration, `PINNED_BUCKET_MIN`/`PINNED_BUCKET_MAX` (with `BUCKET_NO_MIN`/`BUCKET_NO_MAX` accepted as aliases) temporarily pins a v2 host to the exact static range of the v1 host it replaces; remove those keys after the fleet is on v2 to enable dynamic ownership.
+
+Table structure:
+```sql
+CREATE TABLE jetmon_hosts (
+    host_id        VARCHAR(255) NOT NULL PRIMARY KEY,
+    bucket_min     SMALLINT UNSIGNED NOT NULL,
+    bucket_max     SMALLINT UNSIGNED NOT NULL,
+    last_heartbeat TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+    status         ENUM('active', 'draining') NOT NULL DEFAULT 'active'
+);
+```
+
+In dynamic ownership mode, on startup the instance upserts its own row, then scans for rows whose `last_heartbeat` is older than the grace period (suggested: 2× normal round time). Expired rows are presumed dead. The instance claims their uncovered bucket ranges by deleting the dead rows and inserting its own covering range inside a `SELECT ... FOR UPDATE` transaction, preventing two hosts from racing to claim the same range simultaneously. The instance derives its active range from what it successfully claimed — `BUCKET_NO_MIN`/`BUCKET_NO_MAX` are only needed as aliases for the temporary pinned migration mode.
+
+In dynamic ownership mode, each round the orchestrator issues a single `UPDATE jetmon_hosts SET last_heartbeat = NOW() WHERE host_id = ?`. If a host stalls, is OOM-killed, or loses network, its heartbeat stops updating. Surviving hosts detect the stale row at the start of their next round and absorb its buckets up to their configured `BUCKET_TARGET` maximum. In pinned migration mode, the host skips `jetmon_hosts` entirely and checks only its configured static range.
+
+On SIGINT, the instance sets `status = 'draining'`, completes in-flight checks, then deletes its own row. Surviving hosts can reclaim those buckets at the start of their next round without waiting for heartbeat expiry. A hard-killed host leaves its row in place; the grace period determines how long before its buckets are reclaimed.
+
+Bucket capacity is configured via `BUCKET_TOTAL` (total range, e.g. 1000) and `BUCKET_TARGET` per host (e.g. 500). Hosts with spare capacity absorb buckets from failed peers up to their own maximum. Live hosts are never rebalanced — only dead hosts' buckets are redistributed — which avoids race conditions and unnecessary churn.
+
+Benefits over the current static configuration:
+- **Zero-config horizontal scaling**: spin up a new host and it claims unclaimed buckets automatically; no operator coordination required
+- **Self-healing coverage**: a failed host's buckets are absorbed by surviving peers within one grace period, with no manual intervention and no gap in monitoring
+- **Clean decommissioning**: sending SIGINT releases buckets immediately rather than waiting for expiry, minimising the coverage gap during planned maintenance
+- **No external orchestrator**: MySQL is already a hard dependency; the coordination mechanism costs one extra table and one heartbeat query per round
+
+**Auto-Heal**
+- **DB connection loss**: The DB pool retries connections with exponential backoff. In-flight batch work is held in the queue; no work is lost.
+- **Veriflier unreachable**: A Veriflier that fails to respond is marked unhealthy and excluded from confirmation requests. Remaining healthy Verifliers continue; the `PEER_OFFLINE_LIMIT` threshold adjusts dynamically to the number of healthy Verifliers (with a floor to prevent false confirmations).
+- **Veriflier overloaded**: A saturated v2 Veriflier returns HTTP 503 for the whole batch. The monitor treats that endpoint as unhealthy/no-vote for the escalation; overload is never counted as a customer-site down result.
+- **WPCOM API failures**: Circuit breaker pattern. After N consecutive failures the circuit opens, pending notifications are queued in memory with timestamps, and the circuit is retried on a backoff schedule. Queue is bounded; oldest entries are dropped with an error log if it fills.
+- **Stuck check goroutine**: A watchdog goroutine tracks the last activity time of each check. A goroutine that exceeds `NET_COMMS_TIMEOUT * 2` without completing is cancelled via context cancellation, its result counted as a timeout, and a new goroutine is allocated to replace it.
+- **Memory pressure**: The binary exposes both RSS memory and Go runtime system memory through the dashboard state endpoints. If Go runtime system memory exceeds a configurable threshold, the pool size is reduced by 10% via graceful drain until pressure eases — the equivalent of the current worker recycling mechanism, but without process death. Use RSS to compare Jetmon with host-level tools and Go Sys with pprof when investigating sustained memory pressure.
+
+---
+
+## Stretch Goals and Future Add-ons
+
+These are intentionally out of scope for the initial rewrite. They represent the path to making Jetmon 2 a fully competitive standalone monitoring platform rather than a reliable internal Jetpack service.
+
+**DNS Monitoring**
+Check that a domain resolves to expected IPs on a schedule, using Go's `net.LookupHost()`. Alert when the answer changes or when resolution fails. Particularly valuable for detecting DNS hijacking and nameserver misconfigurations before they cause HTTP failures. New monitor type stored as a separate DB table.
+
+**TCP Port Monitoring**
+Attempt a TCP connection to an arbitrary host:port on a schedule. No HTTP layer — a successful connection is "up". Useful for database ports, SMTP, and custom application services. A small extension of the existing connection logic.
+
+**Heartbeat / Cron Monitoring**
+New inbound endpoint on the Monitor's HTTP/API surface where monitored jobs ping on completion. If the expected ping doesn't arrive within the configured interval plus grace period, an alert fires. Deep integration with the Jetpack heartbeat for zero-configuration WP-Cron health detection.
+
+**Response Time Anomaly Detection**
+Using the granular timing breakdown (DNS/TCP/TLS/TTFB) collected in the rewrite, build a per-site baseline over a rolling window and alert when response time exceeds N standard deviations from baseline — even if the site is technically returning 200. Detects slow-but-not-down conditions that users notice but current monitoring misses.
+
+**Status Page Generator**
+Generate a static status page (or a hosted dynamic one) showing uptime history, current status, and incident timeline for a site or group of sites. Embeddable on the customer's WordPress site via a Jetpack block. The incident history data would be available from the audit log.
+
+**Incident History and SLA Reporting**
+Derive uptime percentage and incident history from the audit log. Expose via a read API that the Jetpack dashboard can query to show customers their site's uptime over the last 30/90 days. This is primarily a query layer over data the audit log already captures.
+
+**Per-Location Downtime Visibility**
+Surface which Veriflier locations saw the site as down vs. up during a downtime event. v2 now preserves per-vantage vote evidence in audit and event transition metadata; the remaining future work is deciding how much of that regional evidence should be exposed to operators or customers. Highly valuable for diagnosing CDN or regional routing issues.
+
+**Synthetic / Transaction Monitoring**
+Simulate a real user journey (login, add to cart, checkout) using a headless browser via Playwright or Chromedp. Completely different architecture from HTTP checks — requires browser infrastructure — but represents the most valuable monitoring capability for e-commerce and membership sites. Long-term roadmap item.
+
+**On-Call and Escalation Policy**
+Within-Jetpack on-call scheduling: route alerts to different contacts at different times of day, with escalation if the primary contact doesn't acknowledge within N minutes. Would require a new data model and notification pipeline but no new infrastructure.
+
+**Distributed Tracing**
+Instrument the full check pipeline with OpenTelemetry spans: DB fetch → work dispatch → HTTP check (with DNS/TCP/TLS sub-spans) → Veriflier request → WPCOM notification. Export to Jaeger or any OTLP-compatible backend. Makes debugging latency anomalies and check delays straightforward without relying on log correlation.
diff --git a/docs/public-api-gateway-tenant-contract.md b/docs/public-api-gateway-tenant-contract.md
new file mode 100644
index 00000000..077ee852
--- /dev/null
+++ b/docs/public-api-gateway-tenant-contract.md
@@ -0,0 +1,136 @@
+# Public API Gateway Tenant Contract
+
+**Status:** Gateway tenant context and Jetmon-side ownership checks are
+implemented for internal gateway-routed requests. Native public exposure remains
+deferred.
+
+This document defines the expected boundary between a customer-facing gateway
+and Jetmon if the internal API is exposed through that gateway. It captures the
+implemented ownership-enforcement shape and the remaining public-API
+prerequisites before Jetmon could be exposed without that gateway.
+
+ADR-0002 remains the current implementation decision: Jetmon's API is internal
+only, every caller is a trusted service, and tenant isolation lives outside
+Jetmon. This contract describes the next shape if a gateway turns Jetmon into a
+customer-facing product surface.
+
+## Boundary Summary
+
+The gateway owns customer identity. Jetmon owns monitoring correctness.
+
+| Concern | Gateway responsibility | Jetmon responsibility |
+|---|---|---|
+| Customer authentication | Authenticate the customer, user, team, app, or service token. | Accept only trusted internal service credentials. |
+| Tenant identity | Derive a stable tenant id from the authenticated customer context. Never accept tenant ids from the public request body. | Accept gateway-derived tenant context only from the trusted gateway consumer and use it for ownership checks. |
+| Public authorization | Enforce customer plan, feature flags, public scopes, and role membership. | Enforce internal `read` / `write` / `admin` service scopes and resource relationship invariants. |
+| Resource ownership | Decide whether the public caller may see or mutate a site, webhook, alert contact, or delivery. | Enforce site mappings and owner columns for gateway-routed resources while preserving unscoped internal-operator behavior. |
+| Error vocabulary | Collapse or sanitize 403/404 and internal errors for customers. | Return operator-accurate internal errors to the gateway. |
+| Rate limits | Apply customer fairness, abuse, plan, and route-specific limits. | Keep per-service-key rate limits for internal service protection. |
+| Auditing | Record public actor, tenant, OAuth/client app, and gateway decision details. | Record internal consumer, Jetmon request id, and any gateway-derived tenant context that reaches Jetmon. |
+
+## Request Context
+
+When the gateway calls Jetmon on behalf of a customer, it should authenticate
+with its normal internal Bearer token and attach public request context as
+headers. These headers are not trusted customer input; they are assertions from
+the gateway service.
+
+| Header | Required | Meaning |
+|---|---|---|
+| `X-Jetmon-Tenant-ID` | Yes for customer-routed requests | Stable opaque tenant id derived by the gateway. |
+| `X-Jetmon-Actor-ID` | Yes when a human or customer app initiated the request | Stable opaque actor id for audit correlation. |
+| `X-Jetmon-Public-Scopes` | Yes for public API calls | Space-separated public scopes that the gateway has already granted, such as `sites:read events:read`. |
+| `X-Jetmon-Gateway-Request-ID` | Yes | Gateway request id to correlate public support tickets with Jetmon logs. |
+| `X-Jetmon-Plan` | Optional | Plan/tier snapshot useful for audit and abuse investigations. |
+
+Jetmon should only honor these headers from the configured gateway consumer
+identity. A non-gateway API key sending public-context headers should be
+rejected. Jetmon currently treats `consumer_name = "gateway"` as that trusted
+gateway identity, requires tenant id, public scopes, and gateway request id
+when any public-context header is present, and records accepted gateway context
+in API audit metadata.
+
+## Tenant Checks
+
+The gateway should remain the first and strongest tenant boundary. Jetmon-side
+tenant enforcement is still useful as defense in depth and becomes required if
+Jetmon ever serves customers without a gateway in front.
+
+| Route family | Gateway checks | Jetmon checks before public exposure |
+|---|---|---|
+| Sites list/detail | Caller can access each `blog_id`; plan allows monitoring data. | Implemented through `jetmon_site_tenants` when gateway context is present. |
+| Event/history/SLA reads | Caller can access the parent site; requested time range and filters are allowed. | Implemented through the parent site's `jetmon_site_tenants` mapping. |
+| Site/check writes | Caller can manage the parent site; plan permits monitor mutation and trigger-now. | Implemented through the parent site's `jetmon_site_tenants` mapping; orchestrator/eventstore invariants remain unchanged. |
+| Webhook CRUD/deliveries | Caller can manage tenant-owned webhooks; endpoint URL policy is satisfied. | Implemented with `owner_tenant_id`; delivery visibility and manual retry are derived through the owned webhook. |
+| Alert contact CRUD/deliveries | Caller can manage tenant-owned alert contacts; transport is allowed by plan. | Implemented with `owner_tenant_id`; delivery visibility, manual retry, and send-test are derived through the owned contact. |
+| Manual retries/tests | Caller owns the parent webhook/contact and route-specific abuse limits allow the operation. | Implemented by verifying parent ownership before enqueueing, retrying, or dispatching. |
+| Health, `/me`, OpenAPI | Gateway decides whether to expose them at all. | No tenant filtering; these remain service introspection routes unless a public variant is designed. |
+
+## Ownership Model
+
+The tenant id should be opaque to Jetmon. It should not encode a WPCOM user id,
+blog id, plan, or account type. If those concepts change, the gateway can keep
+the same tenant id stable.
+
+For customer-owned resources created in Jetmon, prefer explicit ownership:
+
+- `jetmon_site_tenants(tenant_id, blog_id)` for monitored-site visibility
+- `jetmon_webhooks.owner_tenant_id`
+- `jetmon_alert_contacts.owner_tenant_id`
+- delivery visibility derived from the owned webhook/contact
+- idempotency cache scoped by `(tenant_id, api_key_id, idempotency_key)` if the
+  cache is made durable or shared across public tenants
+
+For monitored sites, do not assume ownership is always one-to-one with
+`blog_id`. Jetmon now enforces site visibility for gateway-routed requests with
+the `jetmon_site_tenants(tenant_id, blog_id)` mapping table, which preserves
+room for shared ownership or gateway-derived delegation.
+
+Do not use `created_by` as ownership. It records the internal API key consumer
+that created a row and is audit-only.
+
+## Public Error Shape
+
+Jetmon can keep returning honest internal errors to the gateway. The gateway is
+responsible for public-safe behavior:
+
+- return 404 instead of 403 when a customer tries to access a resource outside
+  their tenant
+- redact DB stages, verifier names, hostnames, SQL messages, and internal
+  delivery errors
+- keep Jetmon's `request_id` or gateway request id available for support
+  escalation
+
+If Jetmon later implements a native public mode, that mode should have its own
+error rendering path instead of weakening the internal API's operator-friendly
+errors.
+
+## Migration Path
+
+1. Keep the v2 internal API unchanged while the gateway is the only public
+   entry point.
+2. Request-context parsing for the headers above is implemented in the API
+   middleware and restricted to the gateway API key. Accepted context is logged
+   in audit metadata; non-gateway keys asserting it are rejected.
+3. Gateway-routed webhook and alert-contact CRUD now set/filter
+   `owner_tenant_id`. Delivery history and manual retry visibility are derived
+   through the owned webhook/contact, and alert-contact send-test verifies the
+   contact owner before loading the destination credential.
+4. Gateway-routed site, event/history, SLA/stat, and trigger-now routes now use
+   `jetmon_site_tenants` for defense-in-depth ownership checks.
+5. Backfill/reconcile `jetmon_site_tenants` from the gateway's source of truth
+   before any customer traffic depends on direct Jetmon enforcement. The initial
+   operator path is `jetmon2 site-tenants import --file <csv>`, where the CSV is
+   `tenant_id,blog_id`; pruning stale mappings still depends on an agreed
+   gateway export/reconciliation policy.
+6. Add public-scope and redaction tests route family by route family.
+7. Only after those checks exist, consider exposing Jetmon without a gateway.
+
+## Non-Goals
+
+- This does not add customer authentication to Jetmon.
+- This does not change the current internal `read` / `write` / `admin` API key
+  scopes.
+- This does not decide the customer-facing OAuth, app-token, or WordPress.com
+  auth model.
+- This does not require tenant columns before the v2 production rollout.
diff --git a/docs/roadmap.md b/docs/roadmap.md
new file mode 100644
index 00000000..7349c7b5
--- /dev/null
+++ b/docs/roadmap.md
@@ -0,0 +1,1352 @@
+# Jetmon Roadmap
+
+Deferred features that are intentionally out of scope for the current implementation but have been identified as important future work. Items here are tracked so they are not forgotten and can be designed with future compatibility in mind.
+
+---
+
+## Prioritized TODO
+
+This is the current implementation/refinement queue. Lower-priority items are
+not abandoned; they are intentionally sequenced behind the v2 production
+migration and the operating data needed to make larger architecture decisions.
+
+### Candidate follow-up branches
+
+These are scoped branches worth considering after the merged API CLI, rollout
+preflight, deliverer hardening, API CLI fixture workflow, dashboard, and
+production telemetry branches:
+
+- `veriflier-production-soak`: run production-like Veriflier load, overload,
+  duplicate-vantage, and long-outage promotion/recovery rehearsals against the
+  Go Veriflier v2 contract before first production cutover.
+
+### Veriflier Rebuild and Contract TODO
+
+- [x] Rebuild the Veriflier as a Go binary with the v2 JSON-over-HTTP transport
+  as the production Monitor-to-Veriflier contract while keeping legacy-compatible
+  `/check` and `/status` endpoints available behind an opt-in
+  `VERIFLIER_ENABLE_LEGACY_HTTP` switch for lab/emergency transition testing.
+- [x] Keep the external monitor behavior compatible while allowing the internal
+  Monitor-to-Veriflier contract to evolve for Jetmon v2: v2 requests now carry
+  request IDs, client deadlines, body rules, header rules, redirect policy, and
+  bounded body-read settings.
+- [x] Return structured v2 verifier outcomes instead of raw success booleans:
+  `up`, `down`, `timeout`, `probe_error`, `agent_overloaded`, and `unknown`,
+  with HTTP code, error code, RTT, timing breakdown, and request correlation.
+- [x] Add Veriflier vantage identity and agent identity to the v2 status and
+  check response model. Quorum counts vantages; agent identity is diagnostic.
+- [x] Add bounded vertical scaling inside each Veriflier through an executor
+  sized from available CPU and file descriptor limits, with explicit queue
+  capacity and `agent_overloaded` / HTTP 503 behavior when saturated.
+- [x] Minimize new Veriflier config to identity metadata plus the existing bind
+  address, port, and auth token; default concurrency and queue sizing are
+  derived automatically.
+- [x] Make the Monitor client prefer `/v2/check` and `/v2/status`, cache the
+  successful protocol, and fall back only to `veriflier2`'s legacy-compatible
+  HTTP contract for transition-safe unsupported-v2 responses. The rollout plan
+  deploys a fresh v2 Veriflier fleet first and points v2 Monitors only at that
+  fleet; original v1 Verifliers are not v2 Monitor fallback targets.
+- [x] Count only unique Veriflier vantage identities in downtime quorum math,
+  emit duplicate-vote metrics, preserve duplicate replies in audit metadata,
+  and include quorum/vote evidence in event transition metadata.
+- [x] Add a multi-Veriflier quorum floor so a degraded fleet cannot collapse to
+  a single confirming vote unless operators explicitly set `PEER_OFFLINE_LIMIT`
+  to `1`; single-Veriflier dev/test layouts continue to work.
+- [x] Extend `jetmon2 validate-config` to probe configured Verifliers, report
+  v2 versus legacy contract status, show vantage/agent/capacity metadata, warn
+  on unreachable or legacy-only agents, and fail on duplicate or missing v2
+  vantage IDs.
+- [x] Extend operator dashboard dependency health so Verifliers expose v2
+  version, vantage, capacity, and duplicate-vantage failures instead of a plain
+  ping-only status.
+- [x] Update the proto file as a schema reference for the v2 JSON contract,
+  without reintroducing generated proto/gRPC as a build-time dependency.
+- [x] Add repo-local Veriflier soak coverage for high concurrency, overload
+  recovery, auth failure, deadline timeout recovery, and mixed success/down
+  outcomes through the v2 contract. Run with `make test-veriflier-soak`.
+- [ ] Run production-like Veriflier soak coverage for deployed-like network
+  behavior, duplicate-vantage misconfiguration, mixed-vantage responses, and
+  long outage promotion/recovery.
+- [x] Add Veriflier auto-discovery in a shadow-first rollout:
+  - trusted DB-backed `jetmon_veriflier_vantages` registry for quorum identities
+  - monitor-collected `jetmon_veriflier_agents` rows for process capacity and
+    liveness without giving Veriflier hosts DB access
+  - monitor discovery mode `static|shadow|active`, defaulting to static
+  - no self-created quorum votes; new vantages must be pre-approved by
+    operators before monitors count them
+  - fallback to static config if active discovery fails during rollout
+- [x] Add `jetmon2 verifliers discovery-report` as a read-only shadow-mode gate
+  comparing configured static Verifliers, trusted registry vantages, and recent
+  monitor-collected agent telemetry without exposing auth token values.
+- [x] Backfill ADR-0010 for the Veriflier discovery trust model:
+  operator-approved vantages are trust, monitor-collected agent rows are
+  telemetry, and Veriflier hosts do not need database credentials.
+- [x] Add an operator checklist for Veriflier dashboard and discovery-report
+  green/amber/red warnings before active-discovery rollout.
+- [x] Expand repo-local `make test-veriflier-soak` coverage with Veriflier
+  discovery drift scenarios: duplicate static vantages, incomplete or missing
+  trusted registry rows, endpoint/auth-presence mismatches, untrusted agents,
+  duplicate active agent endpoints, active-mode fallback, and recovery to green.
+- [ ] Run Veriflier auto-discovery in production-like shadow mode and compare
+  static configured vantages to the DB registry before enabling active mode.
+- [ ] Add an uptime-bench scenario long enough to exercise full
+  `Seems Down -> Down -> verifier_cleared` behavior with v2 vote evidence.
+- [x] Decide when legacy-compatible `veriflier2` fallback can be removed: keep
+  the server-side legacy-compatible endpoint code behind
+  `VERIFLIER_ENABLE_LEGACY_HTTP` as an explicit lab/emergency guard, but leave
+  it disabled for normal production v2 endpoints and do not rely on it for
+  original v1 Verifliers. Remove it only after every configured v2 Veriflier
+  endpoint reports v2 status, `validate-config` has no legacy-only Veriflier
+  warnings for the fleet, production-like soak passes, and telemetry shows
+  stable v2 verifier reply/vote evidence. Exact gates are documented in
+  `docs/v1-to-v2-migration.md`.
+- [x] Decide naming for v2: keep the historical `veriflier` / `veriflier2`
+  names through the v2 rollout to avoid operational churn. If v3 introduces a
+  central scheduler plus regional primary probes, name the new role
+  `probe-agent` or `vantage-agent` instead of renaming the v2 compatibility
+  binary in place.
+- [ ] Revisit durable verifier/probe jobs after v2 production data. Keep v2
+  confirmation probes simple for rollout; use collected latency, overload,
+  false-alarm, and mixed-vantage evidence to decide whether v3 needs a central
+  job bus for regional probe agents.
+
+### v2 Prelaunch Readiness TODO
+
+- [x] Add staged rollout check policy support so production can first replace
+  v1 with v2 using `HEAD` + `legacy`, then migrate controlled cohorts to
+  `GET` + `simple_http`, then enable `GET` + `full` detections after stability
+  is proven.
+- [x] Store per-site rollout check policy in `jetmon_site_check_config` instead
+  of adding more fields to `jetpack_monitor_sites`, reducing hot-schema-change
+  pressure on the legacy compatibility table.
+- [x] Move remaining v2-only site options and runtime freshness fields out of
+  `jetpack_monitor_sites` into v2-owned side tables before production. The
+  legacy table remains v1-shaped for rollout compatibility, while
+  `jetmon_site_check_config` owns advanced per-site check policy/config and
+  `jetmon_site_runtime` owns freshness, due-time, alert, and SSL bookkeeping.
+- [x] Bring the service handoff recommendations and rollout prelaunch checklist
+  into the repo as `docs/jetmon-v2-prelaunch-readiness.md`, linked from the
+  docs index and migration runbook.
+- [x] Draft the launch posture statement: v2 rollout is a backend replacement,
+  not a new customer-facing Monitor product launch.
+- [ ] Get WPCOM/Product approval for the launch posture statement before using
+  it as rollout-room or support language.
+- [x] Add first-pass local-search consumer inventory candidates for WPCOM,
+  Jetpack, Activity Log, Elasticsearch, support/explanation tools, hooks, and
+  XML-RPC monitor paths that still depend on legacy monitor fields or
+  notification behavior.
+- [ ] Get WPCOM/Jetpack/Support owner confirmation for the legacy consumer
+  inventory, including hidden consumers not present in the local sibling
+  checkouts and which paths still require legacy projection during rollout.
+- [x] Add a legacy consumer inventory table to the prelaunch tracker.
+- [x] Draft rollout stop/go threshold worksheet for projection drift, missed
+  checks, oldest selected age, stale heartbeats, WPCOM notification failures,
+  delivery backlog, API errors, MySQL errors, and verifier agreement.
+- [ ] Get Systems/Jetmon approval for exact rollout stop/go thresholds after
+  production-like rehearsal data is available.
+- [ ] Record projection drift and telemetry parity evidence on
+  production-like data before first canary.
+- [x] Add Jetmon-owned WPCOM notification parity tests for legacy payload shape,
+  confirmed-down payloads, recovery notifications, Seems Down no-notify
+  behavior, false-alarm no-notify behavior, and suppression no-duplicate
+  behavior.
+- [ ] Get WPCOM acceptance for WPCOM-owned notification parity cases: inactive
+  site behavior, URL mismatch behavior, blacklisted site behavior, current
+  home-URL-only handling, and legacy hook consumers.
+- [x] Update support and allowlist guidance for v2 `GET` checks,
+  `jetmon/2.0`, blocked/WAF cases, false positives, maintenance windows, and
+  `Unknown` as monitor-side uncertainty rather than downtime.
+- [x] Run local rollout docs verification plus same-server, fresh-server, and
+  rollback dry-run rehearsals with `make rollout-docs-verify`.
+- [x] Run VM lab snapshot rollout/rollback flow when available, and attach the
+  generated command plan for the chosen rollout mode.
+- [x] Draft canary cohort matrix and expansion/rollback threshold prompts for
+  WPCOM, Atomic, self-hosted Jetpack, agency, WAF/security-plugin,
+  historically noisy, high-traffic, and multi-endpoint sites.
+- [ ] Get WPCOM/Product/Support approval for the canary cohort matrix and exact
+  expansion/rollback thresholds.
+
+### Production Telemetry Reports TODO
+
+- [x] Add `jetmon2 telemetry report` as a read-only production report over
+  existing event, transition, audit, and verifier telemetry tables.
+- [x] Summarize event lifecycle counts, detection timings, verifier agreement,
+  false-alarm classes, WPCOM parity, and operator explanation gaps in one
+  repeatable text/JSON command.
+- [x] Keep the report safe for production use by avoiding payload/credential
+  dumps, bounding query runtime, using half-open report windows, and reporting
+  only aggregate counts, durations, classes, and gap names.
+- [x] Include `jetmon2 telemetry report` in guided rollout, generated rehearsal
+  plans, and operator runbooks as read-only WPCOM parity evidence after the
+  full-round cutover gate and at fleet completion.
+- [x] Add Veriflier v2 vote-evidence rollups to `jetmon2 telemetry report`:
+  vote-bearing transitions, duplicate votes ignored for quorum, transitions
+  with duplicate votes, transitions blocked by the minimum-healthy floor, and
+  observed max quorum/healthy-vantage counts.
+- [ ] Revisit report thresholds and suggested actions after v2 has enough real
+  production traffic to show which rates should be considered normal.
+- [x] Add richer incident observation metadata for HTTP failures and recovery
+  transitions so operators can explain the probe window behind an incident:
+  previous observed/known-good time, first failed time, first recovered time,
+  bounded error detail, redirect chain/final URL, and TLS/cipher facts.
+
+### Uptime-Bench Scenario Coverage TODO
+
+- [x] Wire uptime-bench's inverted keyword scenarios through Jetmon v2's
+  existing `forbidden_keyword` support so cases such as
+  `content-keyword-injected` can be provisioned and scored instead of skipped
+  or reported as unsupported adapter capability.
+- [x] Add multi-pattern body-content checks for scenarios where the page still
+  contains the required canary but also includes known-bad content, such as
+  injected scripts, spam links, parked-domain text, maintenance banners, and
+  upstream error templates. Keep this distinct from broad visual/content
+  baselining: operators need explicit, auditable rules before Jetmon can safely
+  declare customer content wrong.
+- [ ] Add a conservative body-size / near-empty body detector as a scoped
+  follow-up before full content baselining. This should catch white-screen and
+  empty-body failures while keeping alerts explainable. Defer until after the
+  current explicit keyword / forbidden-keyword work has benchmark and operator
+  data, because per-site thresholds can otherwise create false positives for
+  intentionally tiny health pages.
+- [ ] Design a content-integrity baseline mode separately from explicit
+  forbidden patterns. Benchmark variants such as defacement and ransomware can
+  be caught by required keywords today, but production users will eventually
+  need a controlled way to detect large unexpected body changes without
+  hard-coding every bad string. Defer full baseline/diff mode until after v2 is
+  stable in production because dynamic WordPress pages need normalization,
+  training, approval/reset workflows, and operator-visible evidence before
+  Jetmon can safely alert on "content changed unexpectedly."
+- [x] Improve DNS diagnostics on HTTP lookup failures before building explicit
+  DNS monitors. The v2 HTTP checker already records DNS timing and classifies
+  lookup failures as connect failures; event metadata now distinguishes
+  NXDOMAIN, SERVFAIL, timeout, and resolver errors where Go/runtime resolver
+  data can support it. This is the recommended near-term step because it helps
+  HEs explain failures without creating a new monitor type.
+- [ ] Track DNS-specific benchmark scenarios separately from HTTP DNS failures.
+  Explicit DNS-record, DNSSEC, split-horizon, CNAME-chain, authoritative
+  nameserver, and DNS-latency monitors need a dedicated check type and event
+  taxonomy before they should be exposed as production uptime signals. Defer
+  this larger feature until the product semantics are designed: some DNS
+  failures should be `Warning` or `Degraded`, some should roll up to site-level
+  `Down`, and monitor-side resolver impairment must remain `Unknown`.
+- [ ] Decide whether Jetmon should add an explicit DNS monitor that bypasses or
+  complements recursive resolver cache visibility. The 2026-05-05 all-services
+  gapfill run showed every service, including Jetmon v2, missing short
+  authoritative DNS failure windows, which is consistent with recursive cache
+  TTLs hiding the outage from HTTP probes. This needs product semantics before
+  implementation: direct authoritative checks can catch short DNS outages, but
+  they also increase query load and can report a failure that many end users do
+  not observe until caches expire.
+- [ ] Validate geo-scoped benchmark assumptions before changing Jetmon
+  production behavior for `http-geo-503`. Confirm the probe source ranges,
+  intended Jetmon region semantics, and support story for partial regional
+  failures; if Jetmon remains single-region until the probe-agent work, document
+  that this benchmark class is not directly comparable yet.
+- [x] Preserve Veriflier vote evidence as an interim regional-diagnostics aid
+  without exposing customer-visible regional state. Transition metadata and
+  audit rows now preserve which Veriflier vantages observed success, failure,
+  timeout, or probe errors. Defer customer-facing regional classifications
+  until the probe-agent architecture exists because current Verifliers are
+  confirmation probes after local failure, not continuous per-vantage primary
+  checks.
+- [ ] Add an uptime-bench service scenario that lasts long enough to exercise
+  the verifier-confirmed `Seems Down` -> `Down` path. The latest 10-hour v2
+  services run proved the fast transient `probe_cleared` path, but it did not
+  validate promotion, verifier vote evidence, or `verifier_cleared` recovery.
+  Keep this as benchmark coverage rather than production behavior chasing.
+
+### Capacity Scheduler TODO
+
+- [x] Add the first v2-native streaming monitor-engine implementation behind
+  `SCHEDULER_ENGINE=streaming`. The engine spreads active sites over stable
+  per-interval phases, keeps due scheduling in memory, treats `NUM_WORKERS` as
+  an autoscaling floor rather than the throughput cap, stops writing healthy
+  check-history rows, and batches coarse legacy freshness projection so rollback
+  loses at most the accepted 5-15 minute freshness window.
+- [x] Trade additional memory for lower streaming hot-path CPU and I/O pressure.
+  The scheduler now uses a bucketed due-time wheel instead of heap operations,
+  allows larger in-memory work/result buffers before pausing dispatch, drains
+  large result backlogs more aggressively, and defaults coarse legacy freshness
+  projection to the accepted 15-minute rollback ceiling.
+- [x] Validate the streaming engine through 2 million active internal-only sites
+  on five-minute check intervals. The 2026-05-12 1.5 million and 2 million runs
+  reached full target coverage, no stale or never-seen targets, and successful
+  down/recovery replay detection while keeping healthy-check write pressure low.
+- [ ] Prototype latency/error-aware concurrency control for the streaming
+  engine. The 2026-05-12 4 million run initially reached the required check
+  rate, then collapsed into HTTP timeout pressure, capped queue depth, and
+  multi-million pending backlog. The next major scheduler iteration should
+  reduce dispatch before timeouts cascade, recover cleanly after target or
+  network saturation, and distinguish Jetmon CPU headroom from downstream
+  request-path saturation.
+- [ ] Harden the streaming worker scaler against transient target spikes and
+  backlog overreaction. The 4 million run reported a worker target above 50k
+  while throughput had already collapsed, so the scaler needs stronger damping,
+  error-rate guardrails, and per-host resource feedback before another large
+  jump test.
+- [ ] Run bracket capacity tests around 2.5 million and 3 million active sites
+  on internal-only targets before attempting another 4 million-plus run. The
+  current evidence shows 2 million stable and 4 million unstable, but it does
+  not yet identify the curve shape or the exact point where timeout pressure
+  starts to dominate.
+- [ ] Prototype sharded result ingestion for the streaming engine if bracket
+  tests show result handling rather than request-path saturation as the next
+  bottleneck. Per-shard result queues/state caches could process completed
+  checks in parallel without breaking per-site ordering or retry/event
+  invariants, but the 4 million failure points first toward timeout-aware
+  dispatch and scaler control.
+- [ ] Redesign broad transport failure-storm suppression against the streaming
+  engine. PR #101 sampled Verifliers and suppressed noisy event/history fanout
+  during monitor-side transport waves, but that implementation targeted the
+  old round/page scheduler. A new version should preserve the v2 `Unknown`
+  principle for monitor-side impairment, avoid hiding real customer-site
+  outages, keep operator-visible audit/metrics evidence, and be validated with
+  failure-flood uptime-bench scenarios before production rollout.
+- [ ] Expand prepared request/runtime caches for the checker hot path. Cache
+  parsed URL/host metadata, normalized headers, keyword rules, and reusable
+  per-site request material in memory so repeated all-day checks spend less CPU
+  rebuilding immutable request state.
+- [ ] Add memory-backed success rollups before database persistence. Keep
+  event/failure writes durable, but aggregate healthy probe latency/status
+  summaries in memory and flush compact rollups so large fleets do not turn
+  passive observability into a DB or disk wall.
+- [ ] Evaluate larger DNS and HTTP connection caches for steady-state checks.
+  The checker already has DNS caching and an HTTP IP-pool transport; future
+  capacity runs should test whether a larger idle-connection budget, longer
+  safe idle timeout, or per-resolved-target cache reduces CPU/TCP churn without
+  creating FD pressure or unsafe HTTPS/SNI reuse.
+- [ ] Use uptime-bench process/device I/O attribution before making the next
+  storage optimization decision. Host disk I/O rises sharply in recent v2
+  250k/500k reports, but current container block counters do not identify the
+  writer/reader; the handoff in
+  `/home/gaarai/code/uptime-bench/docs/jetmon-v2-io-attribution-handoff.md`
+  asks uptime-bench to add `/proc/<pid>/io`, `pidstat`, `iostat`, mount, and
+  mismatch reporting.
+- [ ] Move streaming scheduler persistence from broad legacy-table reloads to
+  `jetmon_check_targets` plus change detection. The table exists now, but the
+  first prototype still reloads active identity/cadence from
+  `jetpack_monitor_sites` plus v2 sidecar config so correctness can be
+  validated before optimizing config-sync reads.
+- [x] Preserve duplicate active monitor URLs for the same `blog_id` by carrying
+  `jetpack_monitor_site_id` through local checks, Veriflier RPCs, streaming
+  planner state, retry state, HTTP event identity, and legacy projection
+  writes. `blog_id` remains the site/WPCOM identity, while the legacy row id is
+  the endpoint identity for monitor execution.
+- [x] Evaluate whether any remaining single-column `blog_id` index is needed
+  on `jetpack_monitor_sites` after sidecar-table rollout. PR #101 added
+  `idx_monitor_blog_id` for legacy-table point writes, but current rollout
+  goals minimize changes to the hot v1 compatibility table. Use `EXPLAIN`,
+  production-like write/read traces, and sidecar-table coverage before adding
+  another index to `jetpack_monitor_sites`. Current `v2` does not add
+  `idx_monitor_blog_id`: migration 27 is intentionally a no-op, v2-owned
+  point lookups use sidecar tables keyed by `blog_id`, and the legacy table
+  remains v1-shaped for rollout safety. Reopen only with query-plan evidence
+  from production-like traces.
+- [ ] Add uptime-bench scenarios for streaming mode that explicitly validate
+  phase-spread scheduling, bounded rollback freshness staleness, verifier
+  promotion/recovery, failure-history retention, and steady-state write volume
+  over multi-hour runs.
+- [x] Treat `DATASET_SIZE` as a database fetch page size rather than a total
+  per-round work cap, so low page sizes do not leave due sites unchecked.
+- [x] Keep fetching scheduler pages until due work is drained or the process
+  hits explicit shutdown/deadline pressure.
+- [x] Treat a full worker queue as backpressure by waiting and collecting
+  available results instead of dropping checks.
+- [x] Add scheduler metrics for due-start, selected, dispatched, completed,
+  outstanding, due-remaining, page count, backpressure waits, stale results,
+  duplicate results, never-checked selections, oldest selected age, and whether
+  exact due-count gauges were sampled on this variable-interval poll.
+- [x] Add per-page scheduler phase timings for dispatch, wait, result
+  processing, sidecar freshness writes, check-history writes, SSL updates, and
+  event handling so the next capacity retest can identify the exact slow stage.
+- [x] Batch passive per-check DB writes for
+  `jetmon_site_runtime.last_checked_at` freshness updates and
+  `jetmon_check_history` timing samples so healthy high-volume sweeps are not
+  dominated by one UPDATE plus one INSERT per site.
+- [x] Avoid rewriting unchanged `ssl_expiry_date` values on every HTTPS check
+  while still evaluating TLS-expiry alert state for each observed certificate.
+- [x] Batch changed `ssl_expiry_date` writes so first-run certificate backfills
+  and certificate-renewal waves do not issue one UPDATE per HTTPS site.
+- [x] Remove the `COALESCE(last_checked_at, ...)` scheduler ordering expression
+  so MySQL can use nullable runtime freshness ordering more directly while
+  preserving NULL-first behavior.
+- [x] Update capacity-test config posture so `WORKER_MAX_MEM_MB=0` disables the
+  artificial memory-drain cap by default, `USE_VARIABLE_CHECK_INTERVALS=true`
+  is the sample freshness mode, and API-enabled test hosts use an explicit
+  `DELIVERY_OWNER_HOST`.
+- [x] Run a 1,000-site capacity retest against the batched-write branch and
+  compare freshness, scheduler page timings, MySQL CPU, monitor CPU, and
+  check-history volume against the previous 17-minute sweep. The retest moved
+  Jetmon v2 from 74.70% missed checks to 0.00% missed checks while lowering host
+  CPU and MySQL CPU.
+- [x] Capture live `EXPLAIN` output for fixed-round and variable-interval site
+  selection; both plans scanned roughly 995k rows with `Using filesort`, so add
+  a scheduler-oriented runtime freshness index migration.
+- [x] After applying the scheduler index migration in a test environment,
+  capture `EXPLAIN` again and confirm the hot site-selection query no longer
+  falls back to a full scan/filesort before running the full capacity retest.
+- [ ] If MySQL CPU remains the limiting factor after batched writes, evaluate
+  an asynchronous bounded check-history writer or lower-resolution history
+  retention for healthy probes while keeping runtime freshness synchronous.
+- [x] Add maintained `jetmon_site_runtime.next_check_at` values and scheduler
+  indexes so variable-interval due selection uses a simple indexed range
+  predicate instead of computing `DATE_ADD(last_checked_at, INTERVAL
+  GREATEST(check_interval, 1) MINUTE)` during every scheduler fetch. The
+  scheduler now recalculates `next_check_at` when checks complete or
+  `check_interval` changes and gives failed checks a bounded one-minute
+  follow-up when the normal interval is longer.
+- [x] Move exact due-count and projection-drift checks out of the hot scheduler
+  loop, or run them on a slower background cadence, so operator reporting does
+  not add broad database reads to every 5-second variable-interval pass. In
+  variable-interval mode, exact due counts and projection-drift counts are now
+  sampled on a slower operator-reporting cadence while fixed-cadence mode keeps
+  exact per-round counts.
+- [ ] Prototype a bounded asynchronous check-history writer and rollup model:
+  keep runtime freshness synchronous, preserve raw rows for failures/recent
+  windows, and store long-term latency/error aggregates to avoid raw history
+  becoming the 10k/100k-site storage and I/O wall.
+- [x] Prototype a shared or per-worker HTTP transport/client pool that reduces
+  allocation, socket, DNS, TCP, and TLS churn while preserving enough probe
+  timing visibility for uptime diagnostics. The checker now shares one bounded
+  `http.Transport` across checks while keeping each check's timeout and
+  redirect policy scoped to its own `http.Client`. Connection reuse is
+  available when the response body is consumed, but the checker still avoids
+  reading full customer pages only to preserve keep-alives.
+- [x] Add scheduler outcome counters and event-mutation deadlock/lock-wait
+  retry instrumentation so capacity runs can distinguish true Jetmon
+  throughput regressions from target setup failures such as DNS/URL-pattern
+  mismatches.
+- [ ] Retest the scalability-efficiency branch after the capacity harness
+  verifies the exact activated `monitor_url` samples from Monitor and Veriflier
+  hosts, then compare freshness, check outcome mix, event mutation retries,
+  MySQL CPU/I/O, Jetmon CPU/RSS/FDs, and Veriflier resource usage against the
+  prior successful 1,000-site baseline.
+- [ ] Add a 5k/10k capacity ladder that records freshness, p95 age, MySQL CPU,
+  MySQL I/O/network, `jetmon2` CPU/RSS/FDs, StatsD CPU, Veriflier CPU, and
+  check-history row growth after each major scalability change.
+- [ ] After the next capacity retest, add validate-config sizing advice that
+  explains expected throughput from active site count, check interval,
+  `NUM_WORKERS`, and timeout settings. This is deferred until the retest shows
+  which sizing formula best matches real Jetmon v2 behavior.
+- [ ] After the next capacity retest, evaluate whether checker idle-connection
+  limits, response-body draining, or keep-alive policy need additional tuning.
+  This remains data-dependent because more aggressive connection reuse can hide
+  DNS/TCP/TLS failure modes or add page-body I/O.
+
+### Projection Drift Tooling TODO
+
+- [x] Compare legacy projection status against a per-blog rollup of open HTTP
+  events so multiple open endpoint events cannot overcount drift.
+- [x] Add bucket/status summaries to `rollout projection-drift` so operators
+  can distinguish one-off rows from range-wide projection failures.
+- [x] Add likely-cause labels and manual repair guidance to the drift report
+  without mutating production data automatically.
+- [ ] Consider a dedicated dry-run repair planner after production rehearsals
+  show which drift classes are safe enough to automate.
+
+### Rollout Simplification TODO
+
+- [x] Add `jetmon2 rollout rehearsal-plan` so operators can generate the exact
+  same-server or fresh-server command sequence from a bucket CSV, host, bucket
+  range, and rollout mode.
+- [x] Add `make rollout-docs-verify` so docs/tooling drift checks, command help
+  checks, staged systemd validation, build, test, and lint can run as one
+  repeatable gate.
+- [x] Add `jetmon2 rollout cutover-check` to bundle the read-only post-start
+  pinned preflight, activity, status, and projection-drift checks used after
+  each host replacement.
+- [x] Add JSON output to rollout checks for Systems automation gates.
+- [x] Create a one-page rollout quick reference that links to the full
+  migration runbook.
+- [x] Add a rollout state report that summarizes ownership mode, bucket
+  coverage, drift, recent activity, delivery owner state, and suggested next
+  action.
+- [x] Add `jetmon2 rollout production-data-audit` so production rehearsals can
+  inspect the real v1 `jetpack_monitor_sites` shape before a host window:
+  observed bucket space, active/non-active counts, status and interval
+  distributions, malformed URL counts, existing non-running projections, and
+  duplicate active `blog_id` rows.
+- [x] Add an explicit `jetmon2 rollout legacy-status-bootstrap` write step for
+  existing v1 non-running rows so v2 event state can be seeded before
+  `projection-drift` is treated as a hard rollout gate. The command is dry-run
+  by default and refuses duplicate active `blog_id` rows unless the operator
+  deliberately overrides the guardrail.
+- [ ] Move monitor runtime/config/event identity from per-`blog_id` state to
+  explicit endpoint identity, using the existing `jetpack_monitor_site_id` row
+  as the durable source during migration. This is required for the small but
+  real production cohort where one `blog_id` has multiple active monitor URLs.
+
+### Rollout Host Preflight Polish TODO
+
+- [x] Add `jetmon2 rollout host-preflight` to bundle the pre-stop host gate:
+  static bucket plan match, config parse, DB connectivity, pinned safety
+  checks, and staged systemd validation.
+- [x] Let `rollout rehearsal-plan` accept explicit v1 stop/start commands so
+  generated plans do not leave the most stressful cutover and rollback actions
+  as comments.
+- [x] Make generated rollback blocks more explicit about hold points, stop-v2
+  confirmation, rollback-check success, and the no-schema-rollback rule.
+- [x] Update migration docs and quick reference so operators know which checks
+  are pre-stop gates, post-start gates, rollback gates, and fleet gates.
+- [x] Simplify generated rehearsal plans so `host-preflight` is the single
+  pre-stop gate, while preserving `--bucket-total` and custom systemd unit
+  choices in the printed commands.
+- [x] Clarify operator-facing docs around service environment setup, explicit
+  cutover/rollback ranges, and the difference between the immediate cutover
+  smoke gate and the full-round `--require-all` gate.
+- [x] Add `jetmon2 rollout guided` as an interactive, idempotent rollout and
+  rollback walkthrough with log-dir write checks, transcripts, resume state,
+  typed confirmations for destructive transitions, dry-run rehearsal, and
+  optional execution of operator commands.
+- [x] Rehearse the guided rollout UX with repeated dry-run simulations and
+  tighten the flow: richer dry-run plans with commands and confirmations,
+  state-aware resume skips for interrupted service transitions, and an explicit
+  resume/start-over prompt with no unsafe default.
+- [x] Extend targeted guided rollout flow coverage for fresh-server handoff,
+  execute-mode dry-runs, wrong typed confirmations, mismatched resume state,
+  explicit start-over, and rollback after a failed post-start gate.
+- [x] Make rollout run origin explicit in guided output and docs: run from the
+  staged v2 runtime host, and require SSH from that runtime host to the old v1
+  host when fresh-server v1 stop/start commands use SSH.
+- [x] Add fresh-server guided happy-path simulations for manual flow, execute
+  flow, and direct rollback command ordering.
+
+### Rollout VM Lab TODO
+
+- [x] Prepare an in-house KVM/libvirt host with passwordless sudo, QEMU,
+  libvirt, cloud-init tooling, Go, MariaDB client tools, and a dedicated
+  `jetmon-rollout` storage pool.
+- [x] Add a repo-owned `scripts/rollout-vm-lab.sh` harness with host `doctor`,
+  image fetch, VM create/destroy, topology create, SSH wait, and offline
+  snapshot/revert primitives.
+- [x] Document the VM lab workflow, environment overrides, topology, and
+  planned rollout flow coverage.
+- [x] Seed the DB VM with v1-compatible Jetmon data and v2 additive migrations.
+- [x] Install built `jetmon2` artifacts and staged systemd units onto the v2 VM.
+- [x] Add a v1 simulator service that models static bucket ownership and safe
+  stop/start behavior for guided rollout tests.
+- [x] Wire VM lab smoke targets into Makefile or a dedicated operator script.
+- [x] Automate fresh-server execute-mode happy path and guided rollback smoke.
+- [x] Automate failed pre-stop dynamic-overlap and bad systemd-unit refusal
+  flows.
+- [x] Automate interrupted resume, failed post-start rollback, and bad SSH
+  flows.
+- [x] Add snapshot-backed VM flow runners for full execute-mode cutover and
+  rollback simulations.
+- [x] Automate v2 service start failure after v1 stops, unwritable rollout log
+  directory refusal, bad DB connection refusal, and real
+  `jetmon_site_runtime.last_checked_at` activity from the `jetmon2` service.
+- [x] Add snapshot-backed replay for every named VM lab smoke flow.
+
+### Dashboard and Fleet Health TODO
+
+- [x] Split dashboard work into two PRs: first improve host dashboards and add
+  fleet-dashboard plumbing, then build the global fleet dashboard on top.
+- [x] Add a durable `jetmon_process_health` table for long-running process
+  heartbeats and compact local health snapshots.
+- [x] Publish monitor-host health from `jetmon2`, including bucket ownership,
+  worker queues, WPCOM circuit state, delivery-owner state, dependency health,
+  RSS memory, Go runtime system memory, version, and process lifecycle state.
+- [x] Publish standalone `jetmon-deliverer` health, including active/idle
+  owner state, DB/StatsD health, RSS memory, Go runtime system memory, version,
+  and process lifecycle state.
+- [x] Add a combined host-dashboard snapshot endpoint so host state, dependency
+  health, and red/amber/green summary rules are available from one local API.
+- [x] Polish the existing host dashboard so rollout blockers, delivery-owner
+  warnings, dependency health, and operator commands are easier to scan.
+- [x] Harden host dashboard exposure by binding to localhost by default, with
+  an explicit operator-controlled bind address for trusted remote access.
+- [x] Add a compact host-summary issue list so amber/red dashboard states name
+  the highest-priority blockers instead of only showing aggregate counts.
+- [x] Split process lifecycle state from health rollup state in
+  `jetmon_process_health` so a running process can still report degraded or red
+  dependencies without overloading a single field.
+- [x] Wire real per-host sites-per-second and last-round duration values into
+  the dashboard instead of showing placeholder zero values.
+- [x] Label the dashboard memory value as Go runtime system memory so operators
+  do not mistake `runtime.MemStats.Sys` for operating-system RSS.
+- [x] Build the global fleet dashboard from `jetmon_process_health`,
+  `jetmon_hosts`, delivery queues, projection drift, and Veriflier health.
+- [x] Add dedicated fleet Veriflier discovery views for trusted vantages,
+  monitor-collected agent telemetry, capacity, discovery mode posture, and
+  duplicate endpoint warnings without exposing auth tokens.
+- [x] Add stale-heartbeat thresholds and fleet-level suggested next actions for
+  rollout handoffs.
+- [x] Add explicit fleet delivery-ownership posture so operators can
+  distinguish intentional rollout-conservative `DELIVERY_OWNER_HOST` settings
+  from accidental all-host delivery eligibility.
+- [x] Collect true process RSS for fleet and host dashboards while retaining Go
+  runtime system memory as a separate allocator/guardrail signal.
+- [x] Document and test the fleet dashboard's safe network exposure model
+  before exposing it beyond trusted operator networks.
+
+### v2 Rollout Docs Rehearsal TODO
+
+- [x] Add a dedicated `make rollout-rehearsal-verify` target that exercises the
+  operator-facing same-server, fresh-server, and rollback dry-run flows without
+  requiring a database or VM lab.
+- [x] Keep the rehearsal verifier inside `make rollout-docs-verify` so rollout
+  docs, CLI output, and generated command plans cannot drift independently.
+- [x] Cover fresh-server SSH/run-origin warnings in automated rehearsal checks
+  so the runtime-host versus v1-host distinction stays explicit.
+- [x] Do a full read-through of the migration runbook and quick reference after
+  the new rehearsal verifier lands, then tighten any remaining wording that
+  could cause operator copy/paste mistakes.
+- [x] Run the VM lab snapshot flow after the docs/tooling pass if the
+  `jetmon-vm-host-1` host is available, and capture any mismatch between the
+  text runbook and real guided execution.
+
+Recently completed candidate branches:
+
+- **`feature/production-telemetry-reports`** - adds `jetmon2 telemetry report`
+  for repeatable production summaries of lifecycle timing, verifier agreement,
+  false-alarm classes, WPCOM parity, and operator explanation gaps.
+- **`feature/fleet-dashboard`** - adds `/fleet` and `/api/fleet` global
+  dashboard views for monitor hosts, standalone deliverers, bucket coverage,
+  stale heartbeats, delivery backlog, delivery-owner posture, projection drift,
+  dependency rollups, and fleet-level rollout blockers.
+- **`feature/host-dashboard-fleet-plumbing`** - improved each host dashboard as
+  a clearer production rollout cockpit while publishing monitor and deliverer
+  process health into MySQL for the later fleet dashboard.
+- **`feature/rollout-preflight-hardening`** - merged rollout safety commands
+  for static bucket plans, pinned checks, activity checks, rollback checks,
+  projection drift, and operator-visible rollout guidance.
+- **`feature/deliverer-rollout-hardening`** - merged standalone deliverer
+  validation, owner checks, delivery backlog checks, service docs, rollback
+  guidance, and service-file cleanup.
+- **`feature/api-cli-fixture-workflows`** - merged deterministic fixture-backed
+  API CLI validation, webhook smoke, signature checks, remote-write guardrails,
+  batch-owned cleanup, and command discovery.
+
+### P0 - v2 production hardening
+
+- **Keep the v2 deployment target conservative.** Ship and stabilize the
+  current main-server-plus-Veriflier design before moving toward a v3
+  probe-agent architecture. The v2 event tables remain authoritative while
+  `LEGACY_STATUS_PROJECTION_ENABLE` keeps legacy `site_status` /
+  `last_status_change` consumers working during migration. Use the
+  [`v1-to-v2-migration.md`](v1-to-v2-migration.md) pinned bucket
+  path for the first v1-to-v2 production migration, then remove
+  `PINNED_BUCKET_*` after every host is on v2 and stable.
+- **Keep rollout health visible before cutover.** Operators should not have to
+  infer migration-critical state from logs or config while replacing v1 hosts.
+  The operator dashboard now shows bucket ownership mode, legacy projection
+  mode, delivery-worker ownership, rollout preflight/activity/rollback/drift
+  commands, and live dependency health for MySQL, Verifliers, WPCOM, StatsD,
+  and log/stats disk writes. Keep this visible and verified during rollout
+  rehearsal because it helps separate customer-site downtime from monitor-side
+  impairment during cutover.
+- **Use delivery ownership as a rollout guard.**
+  In the single-binary deployment, `API_PORT > 0` also starts webhook and
+  alert-contact delivery workers. A standalone `jetmon-deliverer` entry point
+  and transactional `SELECT ... FOR UPDATE` row claims now exist; use
+  `DELIVERY_OWNER_HOST` as a rollout guard when intentionally keeping delivery
+  single-owner during migration from embedded to standalone delivery.
+- **Run a production rollout rehearsal pass.** Validate that README,
+  `v1-to-v2-migration.md`, config samples, systemd units,
+  `validate-config`, `rollout guided`, `rollout static-plan-check`,
+  `rollout host-preflight`, `rollout cutover-check`, `rollout activity-check`,
+  `rollout rollback-check`, `rollout projection-drift`, and rollback steps line
+  up exactly before the first production host replacement.
+- **Instrument the data needed for the v3 decision.** During v2 production,
+  measure first-failure-to-`Seems Down`, `Seems Down`-to-`Down`, false alarm
+  rate by failure class, Veriflier agreement/disagreement by region, Veriflier
+  latency/timeout rates, mixed-region outcomes, monitor-side `Unknown` cases,
+  primary-check vs confirmation cost, operator explanation gaps, and WPCOM
+  notification parity. StatsD now emits the core detection timings, outcome
+  counters split by local failure class, and per-Veriflier-host RPC/vote
+  counters, plus legacy WPCOM notification attempt/delivered/retry/error/failed
+  counters split by status. `jetmon2 telemetry report` now provides the first
+  durable report surface for these questions; tune thresholds and suggested
+  actions after v2 has enough real traffic to prove which rates are normal.
+- **Watch projection drift as a production bug.** While the legacy projection
+  is enabled, event mutations, transition rows, and the site-row projection
+  must remain transactionally consistent. `jetmon2 rollout projection-drift`
+  lists the exact active sites whose legacy projection disagrees with the
+  authoritative HTTP event state, so rollout failures are actionable instead of
+  count-only.
+- **Keep roadmap/API documentation drift out of the branch.** `internal-api-reference.md` is the
+  source for the implemented internal `/api/v1` route surface. This roadmap
+  should track only the remaining public/customer API work, production
+  hardening, and deferred architecture choices.
+- **Keep API CLI rehearsal workflows production-safe.** The focused
+  `jetmon2 api` helper now exists; continue hardening fixture and smoke
+  workflows so local, staging, and any approved remote runs remain explicitly
+  batch-owned, easy to clean up, and difficult to aim at production by mistake.
+  See [`api-cli-roadmap.md`](api-cli-roadmap.md).
+
+### P1 - post-v2 platform refinement
+
+- **Extract `jetmon-deliverer` when delivery scale or blast radius warrants
+  it.** Move webhook delivery, alert-contact delivery, and eventually WPCOM
+  notification dispatch behind one outbound-delivery binary. Initial shared
+  worker wiring, a standalone `jetmon-deliverer` entry point, and
+  transactional row claims exist. A sample systemd service is available at
+  `systemd/jetmon-deliverer.service`. The rollout policy is captured in
+  [`jetmon-deliverer-rollout.md`](jetmon-deliverer-rollout.md);
+  the remaining production cutover work is deployment-system adoption and
+  host-specific config wiring.
+- **Unify webhook and alerting dispatch plumbing after production evidence.**
+  Keep the packages separate until there are two proven implementations and a
+  third transport path via WPCOM migration, then factor the shared retry,
+  claim, dispatch, and circuit-breaker shape behind a transport interface.
+- **Migrate WPCOM notifications behind alert contacts/deliverer.** Do this
+  only after alert contacts have proven stable in production and recipient
+  parity has been verified.
+- **Handle permanent WPCOM status failures without tripping the global
+  circuit.** PR #101 identified a useful hardening idea: per-notification
+  permanent WPCOM responses such as 404/410 should be reported and audited, but
+  should not open the shared WPCOM circuit breaker or generate pointless retry
+  pressure. The focused follow-up adds typed WPCOM status errors, bounded queue
+  drop logging, permanent-failure metrics, and retry/circuit tests.
+- **Adopt consumer-specific OpenAPI generator validation when one is chosen.**
+  The route-driven `GET /api/v1/openapi.json` endpoint now includes
+  handler-derived request/response component schemas, and `make test` validates
+  schema refs plus a generated Go client smoke source. If production consumers
+  standardize on a specific generator, add that exact tool to CI so tool-specific
+  schema drift breaks before release.
+- **Plan encryption-at-rest for outbound credentials before public/customer
+  secret management.** Plaintext webhook secrets and alert-contact
+  destination credentials are acceptable for the current internal threat
+  model, but KMS-style encryption should be planned before exposing
+  customer-managed secrets more broadly. See
+  [`outbound-credential-encryption-plan.md`](outbound-credential-encryption-plan.md).
+
+### P2 - v3 and product-driven extensions
+
+- **Revisit Candidate 3 after v2 has production data.** The current leading
+  v3 option is a central scheduler plus regional probe agents. The migration
+  should start with richer v2 probe metadata, then durable confirmation jobs,
+  generic probe agents, shadow-mode primary jobs, and gradual cutover.
+- **Add regional/per-vantage status only when the support story is ready.**
+  Regional classifications, per-vantage SLA, and richer `Unknown` handling
+  depend on probe-agent data and taxonomy work; they should not leak to
+  customers prematurely.
+- **Treat alert/webhook polish as demand-driven.** Grace-period webhook secret
+  rotation, `site.state_changed` webhooks, alert digest mode, quiet hours,
+  external acknowledgements, SMS, and OpsGenie are clean additions, but should
+  wait for customer demand or compliance pressure.
+- **Retire the legacy status projection after consumers migrate.** Once
+  downstream readers use the v2 API/event tables, disable
+  `LEGACY_STATUS_PROJECTION_ENABLE` and stop treating stale legacy status
+  values as meaningful.
+
+---
+
+## v3 Probe-Agent Architecture
+
+**Status:** Parked until v2 has been deployed to production and stabilized.
+
+The current v2 production target keeps the main-server-plus-Veriflier
+confirmation model. After v2 has enough production data, revisit whether Jetmon
+should evolve into a central scheduler plus regional probe-agent architecture.
+
+See [`v3-probe-agent-architecture-options.md`](v3-probe-agent-architecture-options.md)
+for the candidate architectures, data to gather during v2, and the current
+recommendation.
+
+---
+
+## Public REST API
+
+**Status:** Not started as a customer-facing surface. The v2 branch has an
+internal `/api/v1` behind a gateway (see ADR-0002); this item is about the
+public/customer contract and the gateway-facing semantics needed to expose it
+safely.
+
+### What it is
+
+A versioned, authenticated customer-facing REST API on competitive parity with established uptime monitoring services (Pingdom, UptimeRobot, Better Uptime, Datadog Synthetics). Users and integrations interact with Jetmon entirely through this API — reading current health state, pulling event history and SLA statistics, managing what gets monitored, configuring alerts, and triggering on-demand checks.
+
+Currently, Jetmon's API is internal-only: callers are known services, tenant isolation lives at the gateway, errors are intentionally verbose, and ownership checks are coarse. What is missing is a stable public contract with customer-scoped auth, tenant ownership, sanitized error semantics, public rate limits, and payloads safe to expose directly to customer tooling. The capability list below describes the public/customer contract target; many internal equivalents already exist and are documented in `internal-api-reference.md`.
+
+### Why it matters
+
+**User-facing self-service.** Customers need to provision monitors, adjust what is checked, retrieve their monitoring data, and receive alerts through their own tooling — without requiring direct database access or bespoke internal integrations from the Jetpack team. This is table stakes for a monitoring product: Pingdom, UptimeRobot, Better Uptime, and every serious competitor ship a public API as a first-class product surface.
+
+**CI/CD and deployment tooling.** Teams that deploy frequently need to pause monitors before a deploy, resume them after, and verify that no events opened during the deployment window — all from a deploy script. That use case requires management and query capabilities together.
+
+**uptime-bench integration.** The uptime-bench benchmark harness treats every service it evaluates as a first-class API client. The Jetmon adapter uses the manage API to provision a monitor before a scenario run and the query API to retrieve detection data afterward. Without a public API, uptime-bench must read directly from MySQL, which is fragile and tightly coupled to the internal schema.
+
+**Jetpack feature surface.** The Jetpack dashboard, mobile apps, and any future Jetpack products that surface monitoring data consume this API rather than requiring their own DB integrations.
+
+### Capability 1: Status and state
+
+Read-only endpoints for current site and endpoint health.
+
+| Endpoint | Description |
+|---|---|
+| `GET /api/v1/sites` | List all monitored sites with current state, severity, and active event count |
+| `GET /api/v1/sites/{blog_id}` | Full current state for one site: state, severity, active events, check summary |
+| `GET /api/v1/sites/{blog_id}/endpoints` | List all endpoints configured for the site with per-endpoint state |
+| `GET /api/v1/sites/{blog_id}/endpoints/{endpoint_id}` | Current state for a specific endpoint |
+
+### Capability 2: Events and history
+
+Read-only endpoints for event timelines and raw check results.
+
+| Endpoint | Description |
+|---|---|
+| `GET /api/v1/sites/{blog_id}/events` | Event history; supports `since`, `until`, `state`, `severity`, and `check_type` filters |
+| `GET /api/v1/sites/{blog_id}/events/active` | Currently open (unresolved) events only |
+| `GET /api/v1/sites/{blog_id}/events/{event_id}` | Single event with full metadata and causal links |
+| `GET /api/v1/sites/{blog_id}/endpoints/{endpoint_id}/events` | Events scoped to one endpoint |
+| `GET /api/v1/sites/{blog_id}/history` | Raw check timing history (DNS, TCP, TLS, TTFB, status code) with time range params |
+| `GET /api/v1/sites/{blog_id}/audit` | Audit log entries with time range and event type filters |
+
+Event response schema follows `events.md`: `started_at`, `ended_at`, `severity`, `state`, `resolution_reason`, `check_type`, `cause_event_id`, `metadata`.
+
+### Capability 3: Statistics and SLA reporting
+
+Aggregate calculations for uptime, response time, and incident summary. This is what competitors expose for SLA reporting dashboards and customer-facing status summaries.
+
+| Endpoint | Description |
+|---|---|
+| `GET /api/v1/sites/{blog_id}/uptime` | Uptime percentage for a given time range; returns total, by state (Down, Degraded, Unknown separately), and per-endpoint breakdown |
+| `GET /api/v1/sites/{blog_id}/response-times` | Response time statistics for a time range: mean, p50, p95, p99, min, max, bucketed by interval |
+| `GET /api/v1/sites/{blog_id}/incidents` | Incident summary: count, total duration, and MTTR for a time range |
+
+**Design note on Unknown vs. Downtime.** The uptime calculation must honour the Unknown/Downtime distinction from `taxonomy.md`: Unknown periods (monitor-side failures, agent not reporting) are excluded from the denominator, not counted as downtime. Conflating these breaks SLA calculations and erodes user trust. The response must return the three separately: `downtime_seconds`, `degraded_seconds`, `unknown_seconds`, and `monitored_seconds` (the denominator).
+
+### Capability 4: Monitor management
+
+Write endpoints for programmatic endpoint and check lifecycle management.
+
+| Endpoint | Description |
+|---|---|
+| `POST /api/v1/sites/{blog_id}/endpoints` | Add a new endpoint to monitor (URL, label, check types, frequency, timeout) |
+| `GET /api/v1/sites/{blog_id}/endpoints/{endpoint_id}/checks` | List checks configured on an endpoint |
+| `POST /api/v1/sites/{blog_id}/endpoints/{endpoint_id}/checks` | Add a check to an endpoint (check type, keyword, redirect policy, expected status, etc.) |
+| `PUT /api/v1/sites/{blog_id}/endpoints/{endpoint_id}/checks/{check_id}` | Update check configuration |
+| `DELETE /api/v1/sites/{blog_id}/endpoints/{endpoint_id}/checks/{check_id}` | Remove a check |
+| `POST /api/v1/sites/{blog_id}/endpoints/{endpoint_id}/pause` | Suspend all checks on an endpoint without deleting |
+| `POST /api/v1/sites/{blog_id}/endpoints/{endpoint_id}/resume` | Resume a paused endpoint |
+| `PUT /api/v1/sites/{blog_id}/maintenance` | Set a maintenance window (start, end, optional recurrence) |
+| `DELETE /api/v1/sites/{blog_id}/maintenance` | Clear an active maintenance window |
+| `POST /api/v1/sites/{blog_id}/endpoints/{endpoint_id}/trigger` | Trigger an immediate on-demand check outside the normal schedule |
+
+Creating or modifying checks via the API must go through the same orchestrator pickup path as a direct DB write. The API is a frontend to the system, not a bypass.
+
+### Capability 5: Notifications and alert contacts
+
+Programmatic management of where alerts go. Competitors that omit this force users back to a web UI for a critical configuration task.
+
+| Endpoint | Description |
+|---|---|
+| `GET /api/v1/alert-contacts` | List configured alert contacts (email, webhook, PagerDuty, Slack, etc.) |
+| `POST /api/v1/alert-contacts` | Create an alert contact |
+| `PUT /api/v1/alert-contacts/{contact_id}` | Update a contact (endpoint URL, credentials, enabled state) |
+| `DELETE /api/v1/alert-contacts/{contact_id}` | Remove a contact |
+| `GET /api/v1/sites/{blog_id}/alert-contacts` | List which contacts are subscribed to a site |
+| `PUT /api/v1/sites/{blog_id}/alert-contacts` | Set the alert contact list for a site |
+
+**Alert contact types:** the internal API currently supports email, PagerDuty, Slack, and Teams. Generic customer-owned HTTP POSTs should use the HMAC-signed webhooks API instead of duplicating that surface as an alert-contact transport. Later, direct SMS or OpsGenie can be added if customer demand justifies them.
+
+**Webhook contract.** Outbound webhook POSTs carry a standard envelope: `event_type`, `site_id`, `blog_id`, `timestamp`, `event` (the full event object). `event_type` values: `site.seems_down`, `site.down`, `site.recovered`, `site.degraded`, `maintenance.started`, `maintenance.ended`. The payload structure is versioned and must not break existing webhook consumers when new fields are added.
+
+### Public API decisions before direct exposure
+
+The internal API decisions are implemented in `internal/api/` and documented in
+`internal-api-reference.md`. A public/customer API is a different contract and needs these
+decisions before direct exposure:
+
+**Tenant and ownership model.** The baseline gateway-to-Jetmon tenant contract
+is drafted in [`public-api-gateway-tenant-contract.md`](public-api-gateway-tenant-contract.md):
+the gateway remains the first tenant boundary, while Jetmon-side ownership
+columns become necessary for defense in depth or any direct public exposure.
+Direct customer exposure requires every read/write to be tenant-scoped.
+
+**Auth scopes.** The internal API uses coarse `read` / `write` / `admin`
+scopes. Public keys likely need granular scopes such as `sites:read`,
+`events:read`, `webhooks:write`, and `alerts:write` so customer integrations can
+be least-privilege.
+
+**Error and metadata redaction.** Internal responses can expose query stages,
+DB error classes, verifier names, and operational metadata. Public responses
+need sanitized errors and customer-safe event metadata, with detailed context
+remaining in server logs and operator-only surfaces.
+
+**Public rate limits and abuse controls.** Internal limits are service
+protection. Public limits need commerce/abuse semantics, likely per tenant plus
+per key, with separate controls for expensive operations such as trigger-now.
+
+**Webhook ownership and signing posture.** Internal HMAC signing is acceptable
+today. Public customer-managed webhooks may need per-tenant ownership columns,
+public-key/asymmetric signing, or stronger secret storage before direct
+exposure.
+
+**OpenAPI and compatibility policy.** The customer contract needs a generated
+OpenAPI 3.1 spec, client-codegen validation, explicit deprecation rules, and
+tests that fail when handler behavior drifts from the published schema.
+
+### Public API work still to do
+
+- Backfill and reconcile `jetmon_site_tenants` from the gateway/customer source
+  of truth before customer traffic depends on Jetmon-side site enforcement.
+  Initial CSV import support exists via `jetmon2 site-tenants import`; remaining
+  work is agreeing on the gateway export contract and pruning/reconciliation
+  policy for mappings that disappear from the source of truth.
+- Add public-contract integration tests for route-level tenant success and
+  denial paths across sites, events, stats, trigger-now, webhooks, and alert
+  contacts.
+- Add customer-safe error and metadata redaction paths for every public route.
+- Promote the internal route-driven `GET /api/v1/openapi.json` contract into a
+  public compatibility policy with deprecation rules and consumer-specific
+  generator validation.
+- Add public-contract integration tests for auth, pagination, idempotency,
+  redaction, and trigger-now abuse controls.
+- Revisit response-time/SLA pre-aggregation before exposing high-volume public
+  reporting queries.
+- Document the migration path for consumers that currently use direct MySQL or
+  bespoke internal integrations.
+
+---
+
+## Deferred from Phase 3 (webhooks)
+
+These were considered during Phase 3 design and intentionally left out of v1 with clean upgrade paths.
+
+### `site.state_changed` webhook events
+
+Phase 3 v1 ships only `event.*` webhooks (one per `jetmon_event_transitions` row). A `site.state_changed` rollup webhook — fires when the site's derived rollup state changes — was punted because:
+
+- Detecting site-level transitions cleanly without races requires changes to the orchestrator (it currently writes `site_status` but doesn't compute deltas)
+- Event-level webhooks already give consumers everything they need to compute site-level rollup themselves
+- The schema for site state is downstream of the events tables; we'd be adding a second source of truth for "the site is now Down"
+
+**When to revisit:** a real consumer asks for site-level rollup webhooks specifically. Likely shape: orchestrator computes a "previous_state → new_state" rollup from active events; a delivery worker translates that into `site.state_changed` deliveries. Same retry/filter/signature plumbing as `event.*` webhooks — the only new piece is the orchestrator-side delta computation.
+
+### Grace-period webhook secret rotation
+
+Phase 3 v1 ships immediate-revocation only: rotating a webhook secret invalidates the old secret immediately. Brief signature-verification failures during the consumer's deploy window go into the retry queue and resolve once the consumer rolls.
+
+A future Phase 3.x extension is **grace-period rotation**: server signs with both old and new secrets for a configurable window (24h default), consumer verifies whichever they support, then the old secret expires. This matches Stripe's webhook signing roll model and lets consumers deploy at their own pace.
+
+**Why this is a clean future addition:**
+- Schema extension only: add `previous_secret_hash` and `previous_secret_expires_at` columns to `jetmon_webhooks`
+- Header format already supports multiple `v1=` values (Stripe-compatible)
+- New endpoint shape: `POST /webhooks/{id}/rotate-secret?grace=24h`
+- No migration of existing webhooks needed; immediate-revocation is the default if `?grace` is absent
+
+**When to revisit:** a customer-managing consumer (not the gateway, not internal alerting) registers webhooks and asks for graceful rotation, or a compliance requirement forces routine secret rotation.
+
+---
+
+## Deferred from Phase 3.x (alert contacts)
+
+These were considered during Phase 3.x design and intentionally left out of v1. Each has a clean addition path that doesn't disturb the v1 schema or worker shape.
+
+### Generic outbound webhook as an alert-contact transport
+
+Phase 3.x ships four managed transports: email, PagerDuty, Slack, Teams. A "generic webhook" alert-contact transport (POST a Jetmon-formatted JSON payload to any URL) was considered and rejected because the webhooks API (Family 4) already covers it — and covers it better, with HMAC signing, configurable filters across more dimensions, and a fully programmable payload shape.
+
+**The boundary:** alert contacts deliver Jetmon-rendered notifications through Jetmon-owned transports. Webhooks deliver the raw signed event stream for the consumer to render. A customer who wants "POST to my URL when sites change" should register a webhook; we shouldn't ship a duplicate surface that does the same thing worse.
+
+**When to revisit:** never, unless the boundary itself shifts (e.g. webhooks API gets removed, or alert contacts grows into a fundamentally different abstraction).
+
+### SMS notifications
+
+Skipped in v1. WPCOM SMS infrastructure availability is unclear, and a third-party SMS provider integration (Twilio/MessageBird/etc.) is a non-trivial credentialing and billing addition. PagerDuty already offers SMS as a downstream config — the dominant SMS use case is "page me," and that's already covered.
+
+**When to revisit:** a customer asks specifically for direct SMS without going through PagerDuty, AND a stable SMS sending channel (WPCOM-owned or vendor-procured) is available.
+
+### OpsGenie transport
+
+Skipped in v1. Same shape as PagerDuty but a different vendor; PagerDuty covers the dominant slice of customers who want incident-management routing. Adding OpsGenie is mechanical (new transport implementation, ~100 LoC) once a customer asks.
+
+**When to revisit:** a customer running OpsGenie asks for direct integration. Until then, they can route via webhook to OpsGenie's events API themselves.
+
+### Quiet hours / on-call schedules
+
+Per-contact "don't page me between 11pm and 7am" or "route to alternate contact during my vacation" was considered and deferred. Reasons:
+
+- PagerDuty already handles this on its end with full schedule support; customers using PagerDuty don't need it from Jetmon.
+- For Slack/email/Teams contacts, channel-level mute or auto-responders work as a workaround.
+- Building scheduling into Jetmon is a rabbit hole — timezone handling, recurring patterns, escalation overrides, holiday lists. Each of those is a feature in itself.
+
+**When to revisit:** strong customer demand specifically for non-PagerDuty contacts AND a clear scope for what "scheduling" means in v1 (probably starts with a single per-contact `quiet_hours: {start, end, tz}` field, not full PagerDuty parity).
+
+### Alert acknowledgements
+
+"Operator acks an alert from PagerDuty/Slack and Jetmon stops re-paging" was considered and deferred because it's bidirectional — Jetmon would need to receive callbacks from each transport, store ack state, and gate further deliveries against it. That's a significant new surface (inbound webhooks from PagerDuty, Slack interactivity API, etc.) for a feature most customers handle within their incident-management tool.
+
+**When to revisit:** a customer specifically asks for cross-channel ack state (e.g. "I acked in PagerDuty, don't keep posting to Slack"). Probably ships as a per-contact `respect_external_ack: bool` flag plus per-transport ack-receiver implementations.
+
+### Alert grouping / digest mode
+
+When a regional outage flips 50 sites at once, v1 sends 50 separate notifications per matching contact (modulo the per-hour rate cap, which kicks in but only as a brake, not a grouping mechanism). A real grouping/digest feature — "send one email containing all transitions in the last 5 minutes" — was deferred.
+
+**Why deferred:** per-event delivery matches webhook semantics, is the simplest semantic to reason about, and is what most monitoring tools start with. Grouping introduces real questions (window size, group boundary criteria, what happens if a transition arrives mid-group) that benefit from real customer feedback.
+
+**When to revisit:** real users complain about pager noise during regional outages even with `max_per_hour` set. Likely shape: per-contact `digest_window_seconds` field; transitions within the window batch into one notification at window end.
+
+### Migrate WPCOM notifications behind alert contacts
+
+Phase 3.x ships alert contacts alongside the existing WPCOM notification flow rather than migrating the WPCOM flow to be a transport behind alert contacts. The two paths coexist; same human can be in both and receive duplicate notifications.
+
+**Why deferred:** drop-in compatibility with the existing v1 deployment shape is more important than architectural unification. Migrating WPCOM-flow consumers to alert contacts requires:
+- Inventorying all current WPCOM notification recipients and their subscription patterns
+- Building a `wpcom` transport (or reusing an existing one) that delivers through the same channel
+- Migrating the per-recipient subscription data into `jetmon_alert_contacts`
+- Verifying nothing regresses for the existing recipients during cutover
+
+This is a coordinated migration, not a code change — and it's safer to do once alert contacts has proven out in production with real customers.
+
+**Why this is a clean future addition:**
+- The transport interface is already pluggable; adding a `wpcom` transport is the same shape as `email`/`pagerduty`/`slack`/`teams`.
+- The orchestrator's existing WPCOM notification call site becomes a simple "delete this code path" once parity is verified.
+- The deliverer-binary extraction (see Architectural roadmap below) becomes meaningfully cleaner with WPCOM unified — it's the third transport that justifies the split.
+
+**When to revisit:** alert contacts has been in production for 1–3 months without major issues, AND the deliverer-binary extraction is being actively planned. The two are the same conversation.
+
+---
+
+## Architectural roadmap
+
+### Multi-repo / multi-binary split
+
+Today everything lives in one repo and the `jetmon2` binary contains the orchestrator, the API server, the operator dashboard, and (after Phase 3) the webhook delivery worker. The `veriflier2` binary is already separate but in the same repo.
+
+This is fine for now but won't scale operationally. Different concerns have very different deployment shapes:
+
+| Concern | Scaling axis | Deployment shape |
+|---------|--------------|------------------|
+| Orchestrator | bucket count, check rate | stateful (claims buckets in `jetmon_hosts`); horizontal via bucket coordination |
+| API server | request rate | stateless; horizontal behind a load balancer |
+| Outbound delivery | event volume + slow third parties | stateless; horizontal via row-claim on per-transport delivery tables |
+| Operator dashboard | one-off operator sessions | one per ops region |
+| Veriflier | geo-distributed vantage points | one per region |
+
+Putting everything in one binary means scaling the most expensive concern scales the cheap ones with it (CPU and memory headroom that's only used for one purpose). It also concentrates failure modes — a panic in the API server takes down the orchestrator.
+
+**Plausible split:**
+- `jetmon-orchestrator` — round loop, check pool, DB writes
+- `jetmon-api` — REST API server, auth, rate limiting (read/write surface)
+- `jetmon-deliverer` — all outbound dispatch: webhooks (Phase 3), alert contacts, WPCOM notifications
+- `jetmon-dashboard` — operator UI / SSE state stream
+- `jetmon-verifier` — standalone HTTP check executor (today: `veriflier2`; rename TBD)
+
+**Why `jetmon-deliverer` is one binary, not three.** Webhooks, alert contacts, and WPCOM notifications all share the same plumbing: poll `jetmon_event_transitions` (or a similar source), build a frozen-at-fire-time payload, dispatch with a per-destination in-flight cap, retry on failure with exponential backoff, mark abandoned after N attempts. Only the transport differs (HTTPS POST + HMAC for webhooks, transport-specific protocols for PagerDuty/Slack/email/SMS, internal RPC for WPCOM). Splitting them into separate binaries would triple the operational surface (three deploy units, three retry queues, three sets of metrics) for what is fundamentally one job — outbound dispatch — with pluggable transports. Keeping them in one process also means a single circuit-breaker registry across destinations, which is the natural place to enforce shared-resource caps (e.g. "don't open 5,000 outbound connections during a regional outage").
+
+What this means concretely:
+- The Phase 3 webhook worker (`internal/webhooks/worker.go`) is the seed. Its `dispatchTick` / `deliverTick` shape generalizes — the matching, claiming, retry, and abandon logic is transport-agnostic.
+- A future refactor abstracts the transport behind a `Dispatcher` interface (`Send(ctx, dest, payload) (status, error)`), with concrete implementations per channel.
+- Per-channel state (webhook subscriptions, alert contacts, WPCOM circuit breaker counters) stays in its own table; the worker loops over each.
+
+**Revisit point: unify `internal/alerting/` and `internal/webhooks/`.** Phase 3.x ships alert contacts as a separate package (`internal/alerting/`) parallel to webhooks, deliberately *not* extending the webhook worker. The reasoning at the time was: alerting hadn't been built yet, we didn't know what shape it would actually take (fan-out? escalation? digest mode?), and forcing a shared abstraction with one known user (webhooks) and one guessed-at user (alerting) risked an abstraction that fits neither well. Better to build alerting concretely, see where the duplication actually lands, and factor with two real implementations in hand.
+
+The deliverer-binary extraction is the natural moment to revisit. By then we'll have:
+- Two concrete dispatch workers in production with known operational profiles.
+- A clear picture of what alerting actually grew into vs. what webhooks needed.
+- A real third transport on the way (WPCOM migration), which validates the abstraction against three users instead of two.
+
+At that point, factor a `Dispatcher` interface against the three known shapes — not before. The duplication cost between `internal/webhooks/` and `internal/alerting/` is bounded (~300 lines); the cost of a wrong abstraction is unbounded.
+
+**Trigger that justifies the split.** A single outbound transport doesn't justify its own binary — webhooks alone could stay co-located with the orchestrator. The argument gets compelling once there are *multiple* transports to dispatch and a shared retry/circuit-breaker substrate to amortize. Adding alert contacts is the moment the abstraction earns its keep; pulling WPCOM notifications out of the orchestrator at the same time is the cleanup that pays off the extraction.
+
+The MySQL schema is already the implicit bus between these — each service reads/writes specific tables. Splitting would mostly be:
+1. Extract each concern into its own `cmd/<name>/` directory with a thin main
+2. Move shared types into `pkg/` (currently `internal/`) so the binaries can depend on them across repos
+3. Decide on repo boundaries (one monorepo with multiple binaries, vs. multiple repos sharing a `pkg/` module)
+
+**Naming opportunity:** "veriflier" is a long-standing typo of "verifier" that has stuck around through the rewrite. A split is a natural moment to rename. Candidates: `verifier`, `witness`, `probe-worker`, `vantage`. Worth deciding before the split happens, not during.
+
+**When to revisit:** when a single binary's resource needs (CPU, memory, restart blast radius) starts working against the operational sweet spot for one of the concerns. The deliverer split specifically becomes worthwhile when alert contacts ship — that's the second outbound transport, and a third (WPCOM notifications) follows for free since they already exist as code that wants to live next to the others.
+
+### Path to a public API
+
+Today's API is internal-only — every caller is a known service (gateway, alerting workers, dashboard) and tenant isolation lives at the gateway. Several Phase 1–3 design decisions take advantage of that and would have to change if Jetmon ever exposes its API directly to end customers without a gateway in front.
+
+The decisions affected:
+
+| Decision | Internal-API form | Public-API form |
+|----------|-------------------|-----------------|
+| Auth scopes | Three coarse: `read` / `write` / `admin` | Granular per-resource (e.g. `sites:read`, `events:read`, `webhooks:write`) so customer keys can be scoped tightly |
+| Error semantics | Honest 401/403/404 (no info-leak hiding) | 404-on-unauthorized (don't leak existence of resources owned by other tenants) |
+| Error message verbosity | Verbose (DB error class, query stage) for incident response | Sanitized — internal detail belongs in server logs only |
+| Webhook ownership | Any `write`-scope token can manage any webhook (`created_by` audit only) | Per-tenant ownership column; reads/writes filtered by owner |
+| Webhook signing | HMAC-SHA256 with shared secret per webhook | Asymmetric (Ed25519) becomes more attractive — public key at a well-known URL, no per-customer secret to leak |
+| Rate limiting | Per-key bucket sized for service protection | Per-tenant bucket sized for commerce/abuse |
+| Idempotency keys | Scoped by `(api_key_id, key)` | Scoped by `(tenant_id, api_key_id, key)` to prevent cross-tenant collisions |
+| Site `id` (= `blog_id`) | Numeric, canonical from WPCOM | Probably still numeric, but tenant-scoped on lookup |
+
+The migrations are individually clean (each is "add a column, filter on it, deprecate the unscoped version") but they touch most of the API surface. A public-API exposure would be a significant project, not a flag flip.
+
+**When to revisit:** if a stakeholder asks "can a customer integration call Jetmon directly?" — the answer should be "let's design that" rather than "yes, here's the URL."
+
+The Q9 (webhook ownership) section in internal-api-reference.md captures the most concrete piece of this; the rest is captured here for visibility when the conversation comes up.
+
+---
+
+## Completed
+
+This section lists major roadmap-level work completed since the v1 baseline,
+including both the original `v2` rewrite and later work on this branch. It is
+intentionally higher level than a changelog: entries explain what exists now,
+where to look, and what each item unlocked.
+
+### v1-to-v2 Rewrite Foundation
+
+- **Single Go monitor binary.** Jetmon 2 replaces the Node.js master/worker
+  process tree and C++ native HTTP checker addon with the Go `jetmon2` binary.
+  This removes `npm`, `node-gyp`, and native-addon build friction while keeping
+  the legacy external contracts intact.
+- **Go check pool with bounded concurrency.** HTTP checks run through
+  `internal/checker` using goroutines, `net/http`, and `httptrace` timing
+  capture instead of the v1 native addon.
+  The pool records DNS, TCP, TLS, TTFB, and total RTT timings and can adjust
+  worker count under queue or memory pressure.
+- **Go orchestrator and retry queue.** The v2 orchestrator owns round
+  scheduling, local retry state, Veriflier escalation, WPCOM notifications, and
+  graceful drain behavior.
+  This preserves the v1 detection flow while making the retry queue and
+  shutdown behavior testable in Go.
+- **Go Veriflier replacement.** `veriflier2` replaces the Qt/C++ Veriflier
+  with a small Go HTTP service and shared check logic.
+  The old custom SSL server dependency is gone, and the transport is easier to
+  test and deploy.
+- **Embedded migrations and schema bootstrap.** `jetmon2 migrate` applies the
+  v2 additive schema and can create the v1-shaped `jetpack_monitor_sites` table
+  in local/dev databases.
+  This makes fresh Docker environments and production schema upgrades use the
+  same migration path.
+- **MySQL bucket coordination.** v2 introduced `jetmon_hosts` ownership and
+  heartbeat logic so hosts can claim, release, and reclaim bucket ranges
+  dynamically.
+  Static v1 bucket ranges are still supported later through pinned rollout
+  mode, but dynamic ownership is the v2 steady-state target.
+- **Compatibility-preserving StatsD and stats files.** The Go metrics layer
+  keeps the existing StatsD prefix shape and `stats/` file outputs used by
+  legacy monitoring.
+  This lets operational dashboards survive the rewrite while new metrics are
+  added incrementally.
+- **WPCOM client with circuit breaker.** The v2 WPCOM client preserves the
+  legacy notification payload while adding bounded queueing and circuit-breaker
+  behavior.
+  This protects monitor rounds from prolonged WPCOM API failures.
+- **Operator dashboard and health surface.** v2 added a built-in dashboard for
+  worker state, queues, buckets, memory, WPCOM circuit state, and later rollout
+  and dependency health.
+  It gives operators a first-party view into the monitor without querying the
+  database directly.
+- **Systemd and logrotate packaging.** The v2 branch added production service
+  and logrotate templates for the Go monitor.
+  These files provide the baseline deployment shape for rolling host updates.
+- **Initial Docker Go development environment.** Docker builds now compile the
+  Go monitor and Veriflier, run migrations, and use the new config-rendering
+  entrypoints.
+  Later Docker cleanup refined ports, permissions, Mailpit, healthchecks, and
+  non-root MySQL credentials.
+
+### Core State and Detection
+
+- **Event-sourced incident state.** Jetmon now writes authoritative incident
+  state to `jetmon_events` and append-only lifecycle history to
+  `jetmon_event_transitions`.
+  Useful for: reconstructing incidents, API reads, webhook/alert delivery, and
+  legacy projection drift checks.
+- **Shadow-state migration support.** The legacy `site_status` projection is
+  maintained behind `LEGACY_STATUS_PROJECTION_ENABLE` while v2 event tables
+  remain authoritative.
+  This keeps v1 consumers working during migration without making the legacy
+  column the source of truth.
+- **API state derived from v2 events.** Site API responses use open v2 events
+  to report current health state instead of trusting only the legacy site row.
+  This keeps the API aligned with the eventstore during the shadow migration.
+- **Detection-flow instrumentation.** StatsD now captures first failure to
+  Seems Down, first failure to Veriflier escalation, Seems Down to Down,
+  false-alarm timing, and probe-cleared recovery timing.
+  These metrics are the data set needed to evaluate future v3 probe-agent
+  designs with production evidence.
+- **Outcome metrics split by failure class.** False alarms and confirmed-down
+  outcomes are split by local failure class such as `server`, `client`,
+  `blocked`, `https`, `redirect`, and `intermittent`.
+  This makes it possible to see which failure classes produce useful
+  confirmations and which produce noisy escalations.
+- **Veriflier hardening and observability.** Veriflier request handling now has
+  stronger validation, safer body limits, clearer config behavior, and
+  per-host RPC/vote metrics.
+  The v2 production transport is documented as JSON-over-HTTP, with proto files
+  retained only as a future schema reference.
+- **WPCOM notification parity metrics.** Legacy WPCOM notification attempts,
+  deliveries, retries, errors, and final failures are counted with
+  status-specific splits.
+  This supports production parity checks while WPCOM remains outside the new
+  deliverer path.
+
+### API and Gateway Surface
+
+- **Internal REST API foundation.** The internal `/api/v1` surface now includes
+  API-key auth, read endpoints, event detail/list endpoints, SLA/stat queries,
+  and authenticated write endpoints.
+  This moved Jetmon from DB-only integration toward a service boundary for
+  dashboards, gateway callers, CI tooling, and delivery workers.
+- **Idempotent writes and scope enforcement.** POST-style writes support
+  idempotency keys, and route-level scope checks are covered through the full
+  mux.
+  API key revocation also honors future `revoked_at` timestamps so rotations
+  can use a grace window.
+- **Site management write surface.** The API can create/update/delete/pause/
+  resume sites, close events, and trigger an immediate check.
+  The write handlers preserve the eventstore and legacy-projection invariants
+  used by the orchestrator.
+- **Site scheduling fields in API responses.** API site payloads now expose
+  operational scheduling/config fields such as check interval, maintenance
+  window, redirect policy, keyword, SSL expiry, and alert cooldown.
+  This lets API consumers inspect the settings that affect monitoring behavior.
+- **Site soft-delete contract.** The soft-delete behavior is documented so
+  collaborators know how disabled sites are represented and what API consumers
+  should expect.
+  This avoids accidental hard-delete semantics while the legacy table remains
+  shared infrastructure.
+- **Gateway tenant boundary.** The gateway-to-Jetmon tenant contract is
+  documented, and gateway-routed requests now carry trusted tenant context
+  through the API middleware.
+  Non-gateway consumers cannot spoof public-context headers.
+- **Tenant ownership enforcement.** Gateway-routed site, event, stats,
+  trigger-now, webhook, alert-contact, delivery, and manual retry paths are
+  scoped through `jetmon_site_tenants` or resource `owner_tenant_id`.
+  This gives defense-in-depth behind the gateway while preserving unscoped
+  internal-operator behavior.
+- **Site tenant import tooling.** `jetmon2 site-tenants import` can load
+  `tenant_id,blog_id` mappings from CSV, including dry-run validation.
+  This provides the operator path for backfilling gateway ownership data before
+  customer traffic depends on Jetmon-side checks.
+- **Gateway tenant route tests.** Public-contract tests now cover mapped and
+  unmapped gateway paths across the key route families, including event lists,
+  transition lists, and trigger-now.
+  These tests reduce the risk that future API work bypasses tenant ownership
+  checks.
+- **Route-driven OpenAPI contract.** `GET /api/v1/openapi.json` is generated
+  from the route table with request/response component schemas.
+  Tests validate schema references and smoke-check generated Go client source
+  so route/schema drift is caught early.
+
+### Delivery and Alerting
+
+- **HMAC webhook delivery.** Webhook CRUD, HMAC-signed outbound delivery,
+  filtering, retry, abandonment, delivery listing, and manual retry are
+  implemented.
+  Payloads are frozen at fire time so consumers see the event state that caused
+  the delivery.
+- **Alert contacts.** Managed alert contacts now support email, PagerDuty,
+  Slack, and Teams, with send-test endpoints, delivery listing/retry, retry
+  behavior, and per-contact rate caps.
+  Email supports `stub`, `smtp`, and `wpcom` senders so local, staging, and
+  production modes can share the same API.
+- **Delivery claiming.** Webhook and alert-contact delivery workers claim rows
+  before dispatch so multiple workers do not dispatch the same pending delivery.
+  This is the database coordination point that makes standalone delivery
+  feasible.
+- **Delivery owner guard.** `DELIVERY_OWNER_HOST` constrains embedded delivery
+  to the intended host during conservative rollout.
+  This lets API-enabled hosts serve traffic without accidentally becoming
+  outbound delivery owners.
+- **Standalone deliverer entry point.** `bin/jetmon-deliverer` runs webhook
+  and alert-contact workers without starting the monitor, API, dashboard, or
+  bucket ownership loop.
+  It is the first concrete process boundary for the future outbound-delivery
+  split.
+- **Deliverer service packaging.** A sample
+  `systemd/jetmon-deliverer.service` now exists, and `jetmon-deliverer
+  validate-config` checks config parsing, DB connectivity, email transport
+  mode, and delivery ownership.
+  The rollout docs describe the service, process-specific `deliverer.json`,
+  and the shared `DB_*` environment expectations.
+- **Deliverer rollout checks.** `jetmon-deliverer delivery-check` summarizes
+  webhook and alert-contact delivery queues from the shared MySQL tables.
+  Operators can inspect pending, due, retry, delivered, abandoned, failed, and
+  oldest-queue-age signals in text or JSON and enforce explicit thresholds
+  during standalone-deliverer cutover or rollback.
+
+### Rollout and Operations
+
+- **Pinned v1-to-v2 rollout mode.** v2 hosts can run pinned to the exact bucket
+  range of the v1 host they replace.
+  Example: `./jetmon2 rollout guided` wraps static-plan validation,
+  host-preflight, cutover checks, and rollback gates with prompts, transcript
+  logging, and resume state; `./jetmon2 rollout host-preflight` is the direct
+  pre-stop gate for manual runs.
+- **Post-start cutover check.** `./jetmon2 rollout cutover-check` bundles the
+  read-only pinned preflight, recent activity check, dashboard status check,
+  and projection-drift report used after each v1 host replacement.
+  Operators can run it immediately after start, then again with `--require-all`
+  after one full expected check round.
+- **Rollout check JSON output.** Rollout gate commands accept `--output=json`
+  so Systems automation can parse a stable pass/fail envelope while retaining
+  the same non-zero exit behavior as text mode.
+- **Rollout quick reference.** `docs/rollout-quick-reference.md` gives
+  operators a one-page command checklist for rehearsal, per-host cutover,
+  rollback, fleet completion, and JSON automation while linking back to the
+  full migration runbook as source of truth.
+- **Rollout state report.** `./jetmon2 rollout state-report` summarizes the
+  current ownership mode, bucket coverage, recent activity, projection drift,
+  delivery-owner state, and suggested next action for operator handoffs.
+- **Host preflight gate.** `./jetmon2 rollout host-preflight` bundles the
+  pre-stop static plan assertion, config/DB load, pinned safety checks, and
+  staged systemd validation. Rehearsal plans can now include exact v1 stop/start
+  commands and explicit rollback hold points.
+- **Dynamic ownership preflight.** `./jetmon2 rollout dynamic-check` verifies
+  that pinned ranges are removed, `jetmon_hosts` rows cover the full bucket
+  range without gaps/overlaps, heartbeats are fresh, and projection drift is
+  zero.
+  This supports the second step after every host has moved safely to v2.
+- **Projection drift reporting.** `./jetmon2 rollout projection-drift` prints
+  bucket/status summaries, likely causes, sample rows, and the specific active
+  sites whose legacy projection disagrees with the authoritative open HTTP
+  event.
+  Operators get actionable diagnostics and manual repair guidance instead of a
+  count-only rollout failure.
+- **Rollout guidance in validation and dashboard.** `validate-config` prints
+  the correct rollout preflight and drift-report commands, while the operator
+  dashboard shows bucket mode, projection mode, delivery ownership, rollout
+  commands, and dependency health.
+  This keeps migration-critical state visible before and during cutover.
+- **Systemd service cleanup.** The monitor unit now places start-limit keys in
+  the correct systemd section, and the deliverer unit validates with
+  `systemd-analyze`.
+  This removes avoidable service-file warnings before production packaging.
+- **Docker development cleanup.** The Docker setup now has clearer local env
+  names, hardcoded container-internal ports, explicit host-port overrides,
+  non-root MySQL credentials, Mailpit, healthchecks, MySQL readiness waits, and
+  runtime permission fixes.
+  Local development now better matches the process and dependency shape used by
+  v2.
+
+### Documentation, Tests, and Tooling
+
+- **Architecture and ADR refresh.** The architecture docs, API reference,
+  AGENTS guidance, and ADRs were brought back in line with the current v2
+  health-platform shape.
+  This captures the "why" behind event-sourced state, pull-only delivery,
+  webhook signatures, gateway tenant boundaries, and credential-storage tradeoffs.
+- **v3 architecture options documented.** The v3 probe-agent candidates are
+  parked in `v3-probe-agent-architecture-options.md` until v2 has
+  production data.
+  Candidate 3 remains the leading option, but the roadmap now says which data
+  should be collected before revisiting it.
+- **Outbound credential encryption plan.** The repo has a staged plan for
+  encrypting webhook secrets and alert-contact destination credentials at rest.
+  The plan preserves current internal behavior while defining dual-write,
+  backfill, encrypted-required, and plaintext-removal phases.
+- **Build and generation cleanup.** `make all` builds the monitor, deliverer,
+  and Veriflier binaries without requiring generated gRPC code, and Makefile
+  targets use an explicit Go path and writable build cache.
+  This keeps normal build/test workflows reliable in local and CI-like shells.
+- **API CLI and deterministic rehearsal workflows.** `jetmon2 api` now has
+  typed commands, smoke workflows, batch-owned test data, remote-write
+  guardrails, a Docker-local failure/webhook fixture, `make api-cli-smoke`,
+  and `make api-cli-validate`.
+  This gives local, staging, and CI rehearsals a repeatable way to exercise the
+  internal API without hand-written curl scripts.
+- **Rollout preflight and deliverer hardening branches.** The v2 branch now has
+  static bucket plan validation, pinned/dynamic/cutover/activity/rollback/drift
+  rollout checks, standalone `jetmon-deliverer` validation, delivery queue
+  checks, and matching operator docs.
+  These are the guardrails needed before replacing the first v1 production
+  monitor host or splitting outbound delivery.
+- **Coverage and race-test expansion.** Core packages gained coverage for
+  list handlers, lifecycle helpers, API audit paths, delivery behavior,
+  startup helpers, and previously racy tests.
+  The branch now has broader regression coverage around the shared API and
+  delivery paths that are most likely to be touched next.
diff --git a/docs/rollout-quick-reference.md b/docs/rollout-quick-reference.md
new file mode 100644
index 00000000..060b2066
--- /dev/null
+++ b/docs/rollout-quick-reference.md
@@ -0,0 +1,313 @@
+# v1 to v2 Rollout Quick Reference
+
+This is the short operator checklist for a production v1-to-v2 monitor rollout.
+Use the full [migration runbook](v1-to-v2-migration.md) for preparation,
+approval, troubleshooting, revert details, and final v1 teardown.
+
+Run this runbook from the staged v2 runtime host for the bucket range. Do not
+run it from a separate orchestration host unless that host is also the intended
+v2 runtime host and has the same `DB_*` environment the `jetmon2` service will
+use. Shell commands do not automatically inherit systemd's `EnvironmentFile`.
+
+- Same-server rollout: `--host` and `--runtime-host` are normally the same
+  hostname, and local service commands stop v1/start v2 on that host.
+- Fresh-server rollout: run the guided command on the new v2
+  `--runtime-host`, while `--host` names the old v1 host from the static plan.
+  The v2 runtime host must have SSH access to the old v1 host when
+  `--v1-stop-command` / `--v1-start-command` use `ssh` to stop or restart v1.
+
+## Guided Path
+
+Prefer the guided command during the production window. It checks that the
+rollout log directory is writable before it starts, writes a transcript and
+resume state file, explains each step, asks before proceeding, uses typed
+confirmations for v1/v2 stop/start transitions, and stops on failed gates.
+The guided command prints `guided_run_origin=runtime_host` and, in
+fresh-server mode, warns when remote v1 access is required.
+If the command is interrupted after a stop/start transition, resuming with the
+same options uses the saved service state to avoid repeating an already
+completed transition. When resume state exists, the command has no default
+choice; the operator must type `RESUME` or `START OVER`. Short `y` / `n`
+answers are rejected for this prompt.
+
+```bash
+./jetmon2 rollout guided \
+  --file=<ranges.csv> \
+  --host=<v1-hostname> \
+  --runtime-host=<v2-hostname> \
+  --bucket-min=<min> \
+  --bucket-max=<max> \
+  --bucket-total=<total> \
+  --mode=same-server \
+  --v1-stop-command='<exact v1 stop command>' \
+  --v1-start-command='<exact v1 rollback start command>' \
+  --log-dir=logs/rollout
+```
+
+By default, guided rollout prints v1/v2 stop/start commands and asks the
+operator to confirm when they have been run. Add `--execute-operator-commands`
+only when the operator wants the command to execute those stop/start commands
+after typed confirmation. Use `--dry-run` to verify the selected path, log
+paths, service commands, typed confirmation phrases, and manual `DONE`
+checkpoints without running rollout checks or service commands.
+After the full-round v2 gate, the guided flow also captures a read-only WPCOM
+parity telemetry report so the transcript includes notification and
+operator-explanation evidence.
+
+To return a range to v1, run the guided rollback path:
+
+```bash
+./jetmon2 rollout guided \
+  --rollback \
+  --file=<ranges.csv> \
+  --host=<v1-hostname> \
+  --runtime-host=<v2-hostname> \
+  --bucket-min=<min> \
+  --bucket-max=<max> \
+  --bucket-total=<total> \
+  --v1-start-command='<exact v1 rollback start command>'
+```
+
+If a forward gate fails after v2 has started and the operator chooses guided
+rollback, the rollback path can complete successfully while the overall command
+still exits non-zero. Treat that as "rollout did not complete; range returned
+to v1" and keep the transcript with the incident record.
+
+## Before The First Host
+
+1. Verify that the documented operator flow still matches the CLI output:
+
+   ```bash
+   make rollout-rehearsal-verify
+   ```
+
+   This is a no-database dry-run gate for the generated same-server,
+   fresh-server, and rollback flows. The broader `make rollout-docs-verify`
+   target also runs it after build, test, lint, command-help, JSON, and staged
+   systemd checks. It uses a disposable sample bucket plan and does not replace
+   the real `host-preflight` gate or VM lab rehearsal.
+
+2. Confirm the approved static bucket plan exists as a reusable CSV:
+
+   ```bash
+   ./jetmon2 rollout static-plan-check \
+     --file=<ranges.csv> \
+     --host=<v1-hostname> \
+     --bucket-min=<min> \
+     --bucket-max=<max> \
+     --bucket-total=<total>
+   ```
+
+3. Generate the exact host command sequence:
+
+   ```bash
+   ./jetmon2 rollout rehearsal-plan \
+     --file=<ranges.csv> \
+     --host=<v1-hostname> \
+     --bucket-min=<min> \
+     --bucket-max=<max> \
+     --bucket-total=<total> \
+     --mode=same-server \
+     --v1-stop-command='<exact v1 stop command>' \
+     --v1-start-command='<exact v1 rollback start command>'
+   ```
+
+   Use `--mode=fresh-server --runtime-host=<new-v2-hostname>` for a fresh v2
+   server taking over from an existing v1 server. Add `--systemd-unit=<path>`
+   when the staged service unit is not `/etc/systemd/system/jetmon2.service`.
+
+4. Validate config, migrations, static plan match, pinned safety, and the
+   staged systemd service:
+
+   ```bash
+   ./jetmon2 validate-config
+   ./jetmon2 migrate
+   ./jetmon2 rollout host-preflight \
+     --file=<ranges.csv> \
+     --host=<v1-hostname> \
+     --runtime-host=<v2-hostname> \
+     --bucket-min=<min> \
+     --bucket-max=<max> \
+     --bucket-total=<total>
+   ```
+
+5. Deploy the new `veriflier2` fleet first and confirm it serves the v2
+   contract from the v2 runtime host:
+
+   ```bash
+   ./jetmon2 validate-config
+   curl -fsS http://<veriflier-host>:7803/v2/status
+   ```
+
+   The preferred migration uses fresh v2 Veriflier endpoints. Keep v1 Monitors
+   pointed at the original v1 Verifliers until monitor cutover, and point v2
+   Monitors only at the new `veriflier2` fleet. The original v1 Veriflier uses
+   the old TLS/custom transport; the v2 Monitor's legacy `/check` fallback is
+   only for `veriflier2`'s opt-in compatibility endpoint. Keep
+   `VERIFLIER_ENABLE_LEGACY_HTTP=false` unless the endpoint is part of an
+   explicit lab or emergency compatibility test. Roll one v2 endpoint at a time
+   and leave database credentials unset on Veriflier hosts.
+
+   `/v2/status` should advertise `v2-json-http`, a stable `vantage.id`, the
+   serving `agent.id`, and non-zero capacity. Horizontally scaled replicas behind
+   one endpoint must share the same `vantage.id`; do not add each replica as a
+   separate monitor-side Veriflier unless it should count as an independent
+   quorum vote. `validate-config` fails missing or duplicate v2 vantage IDs and
+   warns on unreachable or legacy-only Verifliers.
+
+   This is separate from the staged site check policy. The initial replacement
+   phase can default all sites to `HEAD` + `legacy` probe behavior while remote
+   confirmation still uses `POST /v2/check`; it does not require enabling
+   `veriflier2`'s legacy-compatible `/check` endpoint.
+
+   For auto-discovery, keep `VERIFLIER_DISCOVERY_MODE=shadow` until the
+   registry matches the static `VERIFIERS` fleet. Seed
+   `jetmon_veriflier_vantages` with one enabled row per trusted quorum vantage;
+   do not rely on `jetmon_veriflier_agents` telemetry alone, because agent rows
+   never create trusted votes. Move to `active` only after
+   `validate-config` and the read-only discovery report show usable registry
+   vantages and no shadow drift:
+
+   ```bash
+   ./jetmon2 verifliers discovery-report --output=text
+   ```
+
+## Per-Host Cutover
+
+1. Confirm the pre-stop host gate passes:
+
+   ```bash
+   ./jetmon2 rollout host-preflight \
+     --file=<ranges.csv> \
+     --host=<v1-hostname> \
+     --runtime-host=<v2-hostname> \
+     --bucket-min=<min> \
+     --bucket-max=<max> \
+     --bucket-total=<total>
+   ```
+
+2. Stop the v1 monitor for that bucket range.
+3. Confirm the v1 process is stopped, then start v2:
+
+   ```bash
+   systemctl enable --now jetmon2 && systemctl is-active --quiet jetmon2
+   ```
+
+4. Immediately run the smoke gate:
+
+   ```bash
+   ./jetmon2 rollout cutover-check \
+     --host=<v2-hostname> \
+     --bucket-min=<min> \
+     --bucket-max=<max> \
+     --since=15m
+   ```
+
+   This confirms startup and recent activity, but recent writes can still
+   include v1 because the cutoff reaches back before cutover.
+
+5. After one full expected v2 check round, run the stronger gate:
+
+   ```bash
+   ./jetmon2 rollout cutover-check \
+     --host=<v2-hostname> \
+     --bucket-min=<min> \
+     --bucket-max=<max> \
+     --since=15m \
+     --require-all
+   ```
+
+6. Capture WPCOM down/recovery parity and operator-explanation evidence:
+
+   ```bash
+   ./jetmon2 telemetry report --since=15m
+   ```
+
+   This is a read-only window-level report. Treat warnings as rollout hold
+   points, and widen `--since` when the current range is too quiet to prove
+   parity.
+
+7. Watch logs, the host dashboard, `/fleet`, WPCOM notification parity, event
+   rows, and projection drift before moving to the next host. In pinned rollout,
+   `/fleet` should show pinned bucket mode as amber rather than dynamic green.
+
+## Rollback Gate
+
+Before restarting v1 for a range, stop v2 and run:
+
+```bash
+./jetmon2 rollout rollback-check \
+  --host=<v2-hostname> \
+  --bucket-min=<min> \
+  --bucket-max=<max>
+```
+
+Only restart v1 after the v2 process is stopped and the rollback check passes.
+Do not roll back schema migrations.
+
+## Fleet Completion
+
+After every monitor host is stable on v2 pinned mode:
+
+```bash
+./jetmon2 rollout cutover-check --since=15m --require-all
+```
+
+Run that pinned `cutover-check` from each v2 runtime host, or pass that host's
+explicit `--host`, `--bucket-min`, and `--bucket-max`. It is a per-range
+signoff, not the dynamic fleet-wide coverage check.
+
+Then remove `PINNED_BUCKET_MIN` / `PINNED_BUCKET_MAX` and legacy
+`BUCKET_NO_MIN` / `BUCKET_NO_MAX` aliases from every v2 monitor config,
+restart the fleet in the approved window, and run:
+
+```bash
+./jetmon2 validate-config
+./jetmon2 rollout dynamic-check
+./jetmon2 rollout activity-check --since=15m --require-all
+./jetmon2 rollout projection-drift --limit=100
+./jetmon2 telemetry report --since=15m
+```
+
+When `projection-drift` fails, start with the summary and cause lines before
+the row table. The command is read-only and gives repair guidance; it does not
+change `site_status` automatically.
+
+## Production Data Audit
+
+Before the first host window, run the read-only production-data audit against
+the approved bucket range:
+
+```bash
+./jetmon2 rollout production-data-audit --bucket-min=0 --bucket-max=<max>
+```
+
+Resolve hard blockers before rollout. In particular, active duplicate
+`blog_id` rows are not safe for the current per-blog runtime identity model.
+Existing active non-running v1 projections are expected in production, but they
+must be represented in v2 events before `projection-drift` becomes a hard gate:
+
+```bash
+./jetmon2 rollout legacy-status-bootstrap --bucket-min=0 --bucket-max=<max>
+./jetmon2 rollout legacy-status-bootstrap --bucket-min=0 --bucket-max=<max> --execute
+```
+
+## Automation
+
+Rollout gate commands support JSON output:
+
+```bash
+./jetmon2 rollout cutover-check --since=15m --require-all --output=json
+```
+
+Automation should gate on both the process exit code and the JSON `ok` field.
+The human runbook remains the source of truth for what to do when a gate fails.
+
+For a quick operator snapshot, run:
+
+```bash
+./jetmon2 rollout state-report --since=15m
+```
+
+This summarizes ownership mode, bucket coverage, activity freshness, projection
+drift, delivery-owner state, and the suggested next action.
diff --git a/docs/rollout-vm-lab.md b/docs/rollout-vm-lab.md
new file mode 100644
index 00000000..1e381cd6
--- /dev/null
+++ b/docs/rollout-vm-lab.md
@@ -0,0 +1,296 @@
+# Rollout VM Lab
+
+The rollout VM lab is a KVM/libvirt test bed for rehearsing the v1-to-v2 host
+rollout on real Linux guests instead of containers. It is meant to catch the
+host-level failures that are hard to validate in Docker: systemd unit state,
+SSH reachability between fresh-server hosts, service start/stop ordering,
+cloud-init provisioning, writable log paths, and snapshot-based rollback.
+
+The lab harness is [`scripts/rollout-vm-lab.sh`](../scripts/rollout-vm-lab.sh).
+Run it on the virtualization host itself. For the current in-house lab host:
+
+```bash
+ssh jetmon-vm-host-1
+cd /path/to/jetmon
+scripts/rollout-vm-lab.sh doctor
+```
+
+## Host Requirements
+
+The lab host needs:
+
+- KVM available through `/dev/kvm`
+- `qemu:///system` libvirt access for the operator user
+- an active libvirt NAT network, default `default`
+- an active libvirt storage pool, default `jetmon-rollout`
+- write access to the storage pool path for the operator user
+- `qemu-img`, `virt-install`, `cloud-localds`, `ssh`, `scp`, `curl`, `mysql`,
+  `sed`, and `awk`
+- a dedicated lab SSH key, default
+  `~/.ssh/jetmon-rollout-lab_ed25519`
+
+Validate the host:
+
+```bash
+scripts/rollout-vm-lab.sh doctor
+```
+
+The command is read-only except for checking local files and libvirt state.
+
+## First-Time Setup
+
+Fetch the Ubuntu cloud image once. By default the image is cached in the
+libvirt storage pool so QEMU can use it as a backing file:
+
+```bash
+scripts/rollout-vm-lab.sh fetch-image
+```
+
+Create the baseline topology:
+
+```bash
+scripts/rollout-vm-lab.sh create-topology
+scripts/rollout-vm-lab.sh start-topology
+scripts/rollout-vm-lab.sh wait-ssh db
+scripts/rollout-vm-lab.sh wait-ssh v1
+scripts/rollout-vm-lab.sh wait-ssh v2
+```
+
+This creates:
+
+| VM | Purpose |
+| --- | --- |
+| `jetmon-rollout-db` | MariaDB host for seeded v1-compatible data and v2 migrations. |
+| `jetmon-rollout-v1` | Old monitor host used to model v1 service ownership. |
+| `jetmon-rollout-v2` | Fresh v2 runtime host where guided rollout commands run. |
+
+The guests use cloud-init to create a `jetmon` user with passwordless sudo and
+the dedicated lab SSH key. The DB guest also installs MariaDB, listens on the
+libvirt network, creates `jetmon_db`, and grants `jetmon` / `jetmon`.
+`start-topology` only starts the three lab domains derived from
+`JETMON_ROLLOUT_PREFIX`: `<prefix>-db`, `<prefix>-v1`, and `<prefix>-v2`. It
+starts guests that are cleanly shut off, treats already-running guests as OK,
+and refuses unexpected states such as crashed, paused, or suspended so an
+operator can inspect libvirt before continuing. If the prefix does not match
+the complete db/v1/v2 lab topology, the command fails before starting anything.
+
+## Prepare The Rollout Lab
+
+After the topology is reachable, prepare it for real rollout command testing:
+
+```bash
+scripts/rollout-vm-lab.sh prepare-topology
+```
+
+This command is intentionally idempotent for the lab data and staged service
+files. It:
+
+- seeds the DB VM with a v1-compatible `jetpack_monitor_sites` table and ten
+  active sites in buckets `0-99`
+- installs and starts `jetmon-v1-sim.service` on the v1 VM
+- stages `jetmon2`, `/opt/jetmon2/config/config.json`,
+  `/opt/jetmon2/config/jetmon2.env`, `systemd/jetmon2.service`, logrotate, and
+  `rollout-buckets.csv` on the v2 VM
+- installs the lab SSH key on the v2 VM so fresh-server stop/start commands can
+  reach the old v1 VM over SSH
+- runs `jetmon2 migrate` from the v2 VM against the DB VM
+- runs `rollout host-preflight` and a guided fresh-server dry-run from the v2 VM
+
+From the local workstation, the Makefile wraps artifact sync, v2 VM artifact
+staging, VM startup for the three lab domains, and remote execution:
+
+```bash
+make rollout-vm-lab-doctor
+make rollout-vm-lab-prepare
+make rollout-vm-lab-stage-v2
+make rollout-vm-lab-smoke
+make rollout-vm-lab-execute-smoke
+make rollout-vm-lab-failure-smoke
+make rollout-vm-lab-resume-smoke
+make rollout-vm-lab-post-start-rollback-smoke
+make rollout-vm-lab-bad-ssh-smoke
+make rollout-vm-lab-v2-start-failure-smoke
+make rollout-vm-lab-runtime-guard-smoke
+make rollout-vm-lab-real-activity-smoke
+make rollout-vm-lab-snapshot-all-smoke
+```
+
+The harness keeps the v2 `jetmon2` service staged but stopped. That preserves
+the production rollout shape: v1 owns the range until the guided flow reaches
+the explicit stop-v1/start-v2 transition.
+
+## Snapshots
+
+Create a named snapshot after each known-good checkpoint:
+
+```bash
+scripts/rollout-vm-lab.sh snapshot-all base-installed
+scripts/rollout-vm-lab.sh snapshot-all db-seeded
+scripts/rollout-vm-lab.sh snapshot-all pre-cutover-ready
+```
+
+Return every VM to a checkpoint:
+
+```bash
+scripts/rollout-vm-lab.sh revert-all pre-cutover-ready
+scripts/rollout-vm-lab.sh wait-ssh db
+scripts/rollout-vm-lab.sh wait-ssh v1
+scripts/rollout-vm-lab.sh wait-ssh v2
+```
+
+Snapshots are intentionally offline snapshots. The harness shuts the VM down
+before creating or reverting snapshots so disk state is deterministic.
+
+## Useful Commands
+
+List current lab state:
+
+```bash
+scripts/rollout-vm-lab.sh list
+```
+
+SSH to a guest:
+
+```bash
+scripts/rollout-vm-lab.sh ssh v2
+scripts/rollout-vm-lab.sh ssh db 'sudo systemctl status mariadb --no-pager'
+```
+
+Run only the v2-side rollout smoke checks:
+
+```bash
+scripts/rollout-vm-lab.sh start-topology
+scripts/rollout-vm-lab.sh smoke-preflight
+scripts/rollout-vm-lab.sh smoke-guided-dry-run
+```
+
+Run the heavier execute-mode cutover and rollback smoke. This actually stops
+the v1 simulator, starts `jetmon2`, verifies the post-start gates, then resumes
+guided rollback to stop `jetmon2` and restart the v1 simulator:
+
+```bash
+scripts/rollout-vm-lab.sh smoke-guided-execute-rollback
+```
+
+Run targeted guided-flow smokes:
+
+```bash
+scripts/rollout-vm-lab.sh smoke-interrupted-resume
+scripts/rollout-vm-lab.sh smoke-post-start-rollback
+scripts/rollout-vm-lab.sh smoke-bad-ssh
+scripts/rollout-vm-lab.sh smoke-v2-start-failure
+scripts/rollout-vm-lab.sh smoke-runtime-guards
+scripts/rollout-vm-lab.sh smoke-real-activity
+```
+
+- `smoke-interrupted-resume` stops v1, intentionally leaves the first guided
+  run at EOF, resumes from state, completes cutover, then rolls back to v1.
+- `smoke-post-start-rollback` starts v2, forces the post-start activity gate to
+  fail with a future cutoff, chooses guided rollback, and confirms v1 is active.
+- `smoke-bad-ssh` uses an invalid v1 SSH target and confirms the flow fails
+  before v1 is stopped or v2 is started.
+- `smoke-v2-start-failure` corrupts only the staged v2 systemd start command,
+  confirms the guided flow stops after v1 is stopped and v2 fails to start,
+  then restores the unit and returns the range to v1.
+- `smoke-runtime-guards` confirms guided rollout refuses an unwritable log
+  directory before any rollout checks run, and confirms host preflight refuses
+  a broken DB connection before service state changes.
+- `smoke-real-activity` clears the seeded range's
+  `jetmon_site_runtime.last_checked_at`, stops the v1 simulator, starts real
+  `jetmon2`, and waits for every active seeded site to receive a real check
+  write before returning the range to v1.
+
+Run the failure-gate smoke:
+
+```bash
+scripts/rollout-vm-lab.sh smoke-failure-gates
+```
+
+This injects an overlapping `jetmon_hosts` row and a broken staged systemd unit,
+then confirms `rollout host-preflight` refuses both before restoring the DB
+state.
+
+Run a flow from a named snapshot, then revert back and normalize the safe lab
+service state:
+
+```bash
+scripts/rollout-vm-lab.sh snapshot-all pre-guided-flow
+scripts/rollout-vm-lab.sh snapshot-run pre-guided-flow execute-rollback
+scripts/rollout-vm-lab.sh snapshot-run-all pre-guided-flow
+```
+
+Supported snapshot flow names are `execute-rollback`, `interrupted-resume`,
+`post-start-rollback`, `bad-ssh`, `v2-start-failure`, `runtime-guards`,
+`real-activity`, and `failure-gates`. Snapshot runners are useful when
+iterating on guided behavior because each run starts from the same VM, DB,
+service, and log state. After each revert, the runner stages the current local
+`jetmon2` artifact into the v2 guest so snapshot-backed flows do not silently
+test an old binary. The staging step starts shut-off lab VMs before waiting for
+SSH, so `make rollout-vm-lab-snapshot-all-smoke` can be run directly after an
+offline snapshot. At the end, the runner reverts to the snapshot and enforces
+the safe lab state: v1 simulator active, v2 `jetmon2` stopped and disabled.
+`snapshot-run-all` replays every named flow from the same snapshot.
+
+Destroy the topology and its lab volumes:
+
+```bash
+scripts/rollout-vm-lab.sh destroy-topology
+```
+
+## Environment Overrides
+
+| Variable | Default |
+| --- | --- |
+| `JETMON_ROLLOUT_LAB_DIR` | `~/rollout-lab` |
+| `JETMON_ROLLOUT_POOL` | `jetmon-rollout` |
+| `JETMON_ROLLOUT_NETWORK` | `default` |
+| `JETMON_ROLLOUT_PREFIX` | `jetmon-rollout` |
+| `JETMON_ROLLOUT_IMAGE_URL` | Ubuntu 24.04 noble amd64 cloud image |
+| `JETMON_ROLLOUT_IMAGE_PATH` | `<pool path>/noble-server-cloudimg-amd64.img` |
+| `JETMON_ROLLOUT_SSH_KEY` | `~/.ssh/jetmon-rollout-lab_ed25519` |
+| `JETMON_ROLLOUT_WAIT_TIMEOUT` | `600` seconds |
+| `JETMON_ROLLOUT_MEMORY_MIB` | `2048` |
+| `JETMON_ROLLOUT_VCPUS` | `2` |
+| `JETMON_ROLLOUT_DISK_GIB` | `20` |
+| `JETMON_ROLLOUT_DB_MEMORY_MIB` | `4096` |
+| `JETMON_ROLLOUT_DB_DISK_GIB` | `30` |
+| `JETMON_ROLLOUT_BUCKET_MIN` | `0` |
+| `JETMON_ROLLOUT_BUCKET_MAX` | `99` |
+| `JETMON_ROLLOUT_BUCKET_TOTAL` | `1000` |
+| `JETMON_ROLLOUT_ACTIVITY_WAIT_TIMEOUT` | `240` seconds |
+| `JETMON_ROLLOUT_JETMON2_BINARY` | `<repo>/bin/jetmon2` |
+| `JETMON_ROLLOUT_JETMON2_SERVICE` | `<repo>/systemd/jetmon2.service` |
+| `JETMON_ROLLOUT_JETMON2_LOGROTATE` | `<repo>/systemd/jetmon2-logrotate` |
+| `ROLLOUT_VM_LAB_SNAPSHOT` | `pre-guided-flow` for Makefile snapshot smoke |
+
+## Planned Flow Coverage
+
+The VM lab is intended to exercise these rollout scenarios:
+
+- DB seeded with the v1-compatible site table plus v2 additive migrations
+- v1 host active for one static bucket range
+- fresh v2 host staged with pinned config but stopped
+- `rollout guided --dry-run` from the v2 runtime host
+- successful fresh-server cutover with `--execute-operator-commands`
+- guided rollback after execute-mode cutover
+- interrupted guided flow and resume from state
+- failed pre-stop gate refusal
+- bad staged systemd unit refusal
+- failed post-start smoke gate followed by guided rollback
+- bad SSH access from the v2 runtime host to the old v1 host
+- failed v2 service start after v1 has stopped, preserving a resumable stopped
+  state and returning the lab to v1 after the fixture
+- unwritable rollout log directory refusal before any rollout checks or service
+  commands run
+- bad DB connection refusal during host preflight
+- real v2 monitor activity that writes seeded sites'
+  `jetmon_site_runtime.last_checked_at`
+- snapshot-backed flow reruns
+- bad systemd unit refusal
+
+The current harness provides VM lifecycle, DB seeding, v1/v2 service staging,
+preflight/dry-run smoke coverage, and a full execute-mode cutover plus guided
+rollback smoke. It also exercises interrupted resume, post-start rollback, bad
+SSH, v2 start failure, runtime guard failures, real activity, failure gates,
+and named-snapshot reruns. The next layer should add more specialized failure
+fixtures as new rollout bugs are discovered.
diff --git a/docs/support-guide.md b/docs/support-guide.md
new file mode 100644
index 00000000..c3d92302
--- /dev/null
+++ b/docs/support-guide.md
@@ -0,0 +1,254 @@
+# Support Guide
+
+This guide is for people explaining Jetmon behavior to customers and internal
+teams. It focuses on the questions v1 made hard to answer:
+
+- Why did Jetmon say the site was down?
+- Why was there no notification?
+- Why did the site recover before a customer noticed?
+- Was this a false positive, a real outage, or a monitor-side issue?
+
+## Start With The Audit Timeline
+
+Use the audit CLI to reconstruct a site's recent monitoring history:
+
+```bash
+./jetmon2 audit --blog-id 12345 --since 24h
+```
+
+For a specific incident window:
+
+```bash
+./jetmon2 audit \
+  --blog-id 12345 \
+  --since 2026-04-01T10:00:00 \
+  --until 2026-04-01T11:00:00
+```
+
+The timeline shows local checks, retry attempts, Veriflier requests and results,
+WPCOM notifications, status transitions, maintenance-window suppression, and
+other operational notes.
+
+For a broader production-window view, use the telemetry report:
+
+```bash
+./jetmon2 telemetry report --since=24h
+```
+
+This summarizes detection timings, Veriflier agreement, false-alarm classes,
+WPCOM attempt parity, and explanation gaps across the selected window. WPCOM
+parity is split between confirmed-down and recovery attempts so a clean total
+does not hide a mismatch in one direction. Use it to decide whether an incident
+looks like an isolated site issue, a noisy class of local failures, a verifier
+disagreement pattern, or an instrumentation gap that needs engineering
+follow-up. The first lines show an overall
+`telemetry_status` of `pass`, `warn`, or `fail` before the detailed timing and
+parity sections. If the report highlights window-edge WPCOM transitions, rerun
+with a later `--until` before treating the parity delta as a missing
+notification.
+
+## Explain The Incident State
+
+Jetmon 2 separates detection from confirmation:
+
+| State | Meaning |
+|---|---|
+| `Seems Down` | Local checks failed and Jetmon is retrying or asking Verifliers |
+| `Down` | Verifliers confirmed the outage |
+| `Resolved` | The incident closed after recovery or manual action |
+
+This matters for customer conversations. A site can be briefly unreachable from
+the monitor and then recover before Verifliers confirm it. That closes as a
+false alarm or probe-cleared event instead of sending a customer-facing outage
+notification.
+
+## Explain The GET Change
+
+Jetmon 1 used `HEAD` requests to decide whether a site was reachable. Some
+customer stacks block `HEAD`, route it differently, or return a status that does
+not match a real page load. Jetmon 2 supports a staged migration: initial
+rollout can keep `HEAD` + `legacy` behavior, then selected cohorts can move to
+`GET` + `simple_http`, and finally to `GET` + `full` detections. GET checks
+better match what visitors and customer-facing uptime tools see. That staged
+site policy does not mean the Veriflier is using legacy HTTP endpoints; v2
+Verifliers carry both `HEAD` and `GET` probes through `/v2/check`.
+
+When an alert differs from old v1 behavior, check the site's effective
+`request_method` and `detection_profile` first. v2 may be surfacing a real
+GET-path issue or a full-profile detection that v1's HEAD-only probe did not
+exercise.
+
+## Allowlist And WAF Guidance
+
+Jetmon 2 identifies itself with the `jetmon/2.0` user agent. During rollout it
+may perform either `HEAD` or `GET` requests depending on site policy. For GET
+cohorts, customer firewalls, WAFs, bot controls, and security plugins should
+allow Jetmon checks to reach the same application path a normal visitor would
+reach. Do not ask a customer to broadly disable security rules; the safer path
+is to allow the published Jetmon source hosts or IP ranges and the
+`jetmon/2.0` user agent.
+
+Blocked monitoring can show up in a few different ways:
+
+| Symptom | Likely explanation |
+|---|---|
+| `blocked` / HTTP 403 | The site or edge layer rejected the monitor request |
+| Captcha, bot challenge, or security page | The request reached a protection layer instead of the customer site |
+| `keyword_missing` | The monitor received a page, but not the expected customer content |
+| Redirect failure | The monitor was sent to a login, challenge, canonical URL, or unexpected host |
+| Local failure but Verifliers disagree | The block may be regional, source-specific, intermittent, or edge-specific |
+
+For customer explanations, separate "the site was down for visitors" from "the
+monitor could not verify the visitor path." A WAF block is real monitor
+evidence, but it is not automatically proof that all visitors saw downtime.
+
+## Understand Alert Types
+
+| Type | Meaning |
+|---|---|
+| `server` | Site returned a 5xx response |
+| `blocked` | Site returned 403, often because monitoring is blocked |
+| `client` | Site returned a 4xx response other than 403 |
+| `https` | SSL/TLS problem |
+| `intermittent` | Request timed out |
+| `redirect` | Redirect policy failure |
+| `ssl_expiry` | Certificate expires within the configured threshold |
+| `tls_deprecated` | Site is serving TLS 1.0 or 1.1 |
+| `keyword_missing` | Response body did not contain the expected keyword |
+| `keyword_forbidden` | Response body contained text from `forbidden_keyword` or `forbidden_keywords` |
+| `success` | Site recovered |
+
+For HTTP events caused by resolver failures, inspect event metadata for
+`dns_error_kind`, `dns_error_name`, and `dns_error_server`. These fields explain
+resolver-visible failures such as NXDOMAIN, SERVFAIL, and DNS timeouts. They do
+not prove that every recursive resolver on the internet saw the same DNS state;
+short authoritative outages can be hidden by recursive cache TTLs.
+
+`tls_deprecated` is advisory-only: it does not mark the site down. Jetmon still
+has to negotiate the deprecated protocol to classify the site accurately, so
+avoid sensitive custom check headers on sites that only support TLS 1.0 or 1.1
+until the site is upgraded.
+
+## Check SSL Certificate Status
+
+```sql
+SELECT s.blog_id, s.monitor_url, r.ssl_expiry_date
+FROM jetpack_monitor_sites s
+LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id
+WHERE s.blog_id = 12345;
+```
+
+`ssl_expiry_date` is updated on HTTPS checks. Alerts fire at the configured
+expiry thresholds, currently 30, 14, and 7 days before expiry.
+
+## Check For False Positives
+
+```sql
+SELECT *
+FROM jetmon_false_positives
+WHERE blog_id = 12345
+ORDER BY created_at DESC
+LIMIT 20;
+```
+
+A false positive is recorded when Jetmon escalates a local failure to Veriflier
+confirmation and the Verifliers do not confirm the site as down. A high rate for
+one site usually means the site has transient network, redirect, firewall, or
+performance behavior worth tuning.
+
+## Monitor-Side Uncertainty
+
+Treat `Unknown` or monitor-side uncertainty as an operational state, not as
+confirmed customer-site downtime. Use this framing when Jetmon cannot produce a
+trustworthy site verdict because of monitor infrastructure issues such as
+checker failure, verifier unavailability, database errors, missing telemetry, or
+an unhealthy quorum.
+
+Customer-facing downtime should require site evidence. If the monitor itself is
+uncertain, explain what Jetmon could not verify, what evidence is missing, and
+what follow-up is needed before calling the site down.
+
+## Maintenance Windows
+
+Use maintenance windows for planned work:
+
+```sql
+INSERT INTO jetmon_site_check_config (blog_id, maintenance_start, maintenance_end)
+VALUES (12345, '2026-04-20 02:00:00', '2026-04-20 04:00:00')
+ON DUPLICATE KEY UPDATE
+    maintenance_start = VALUES(maintenance_start),
+    maintenance_end = VALUES(maintenance_end);
+```
+
+Checks continue and results are recorded during the window, but failing checks
+are swallowed before they open or promote downtime incidents. If an HTTP failure
+was already in local retry when the window started, Jetmon closes that event
+with `maintenance_swallowed` and keeps the legacy site-status projection
+running. Always set an explicit `maintenance_end`; an open-ended window can
+silently suppress alerts indefinitely.
+
+Clear a window after maintenance:
+
+```sql
+INSERT INTO jetmon_site_check_config (blog_id, maintenance_start, maintenance_end)
+VALUES (12345, NULL, NULL)
+ON DUPLICATE KEY UPDATE
+    maintenance_start = NULL,
+    maintenance_end = NULL;
+```
+
+## Alert Sensitivity
+
+Use per-site cooldowns to reduce repeated alerts from a flapping site:
+
+```sql
+INSERT INTO jetmon_site_check_config (blog_id, alert_cooldown_minutes)
+VALUES (12345, 60)
+ON DUPLICATE KEY UPDATE alert_cooldown_minutes = VALUES(alert_cooldown_minutes);
+```
+
+Global promotion behavior is controlled by `NUM_OF_CHECKS`: that many
+consecutive local failures are required before Veriflier escalation. In
+variable-interval mode, failed probes are scheduled for a bounded one-minute
+follow-up when the site's normal check interval is longer, so transient
+incidents get rechecked sooner without per-site retry tuning.
+`TIME_BETWEEN_CHECKS_SEC` is retained for v1 config compatibility; do not
+promise per-site retry tuning unless the deployed schema includes it.
+
+## WPCOM Notification Data
+
+Every status-change notification sent to WPCOM includes:
+
+| Field | Description |
+|---|---|
+| `blog_id` | The site's WPCOM ID |
+| `monitor_url` | URL that was checked |
+| `status_id` | `0` down, `1` running, `2` confirmed down |
+| `last_check` | Datetime of the last check |
+| `last_status_change` | Datetime of the last status change |
+| `checks` | Local and Veriflier check results |
+
+Each `checks` entry includes:
+
+| Field | Description |
+|---|---|
+| `type` | `1` local Jetmon check, `2` Veriflier check |
+| `host` | Hostname of the checker |
+| `status` | `0` down, `1` running, `2` confirmed down |
+| `rtt` | Round-trip time in milliseconds |
+| `code` | HTTP response code |
+
+## Useful Customer Framing
+
+- "Jetmon saw local failures, retried, then asked Verifliers before notifying."
+- "The site recovered before quorum confirmation, so Jetmon recorded the event
+  but did not send a confirmed-down notification."
+- "The alert was suppressed because a maintenance window was active."
+- "The site blocked the monitor with a 403, which is different from the site
+  being down for visitors."
+- "This site is in the GET cohort, so Jetmon tests the visitor path more
+  closely than the v1 HEAD-only check did."
+- "Jetmon could not produce a trustworthy verdict because monitor-side
+  telemetry was incomplete; that is not the same thing as confirmed downtime."
+- "The audit trail shows exactly which checkers saw the failure and what status
+  code or timeout they received."
diff --git a/docs/taxonomy.md b/docs/taxonomy.md
new file mode 100644
index 00000000..a8e280c4
--- /dev/null
+++ b/docs/taxonomy.md
@@ -0,0 +1,593 @@
+# Jetmon Test Taxonomy and Architecture Reference (v4)
+
+A comprehensive reference covering what Jetmon monitors, how it organizes those checks, and the underlying state and event model. This consolidates the five-layer test taxonomy, the scope matrix, the site/endpoint hierarchy, the state vocabulary, and the event-sourced architecture into a single reference document.
+
+**Scope of this document:** what Jetmon tests and how it models the results. Not covered: customer-facing UX, alert notification design, billing, or implementation details beyond architectural decisions. Those belong in separate documents.
+
+---
+
+## Part 1: The Five-Layer Test Taxonomy
+
+The five layers follow the path a request takes from user to server: **Reachability → Transport & Security → Infrastructure & Edge → Application Response → Content Integrity**. A sixth section covers **Reverse Checks** — monitoring where the monitored system reports to us rather than us probing it.
+
+Each test is tagged with an implementation version and can also be tagged by *scope* (single-site, wide outage, architectural) for incident severity and alerting.
+
+### Version labels
+
+- **[v1]** — Table stakes. Low complexity, high value. Competitor free tiers have these.
+- **[v2]** — Clear next step. Moderate complexity, noticeably expands coverage.
+- **[v3]** — Advanced coverage. Higher complexity or requires dedicated infrastructure (multi-region fleet, headless browsers, baselining).
+- **[v4]** — Deferred beyond v3. Often gated on an external dependency: integration, partnership, prerequisite feature, or demand signal. May be split into v5 or later during future roadmap planning if v4 grows too large.
+- **[future]** — Genuinely hard, niche, or requires architectural rethinking. Knowable, not schedulable.
+
+### Assumed infrastructure milestones
+
+- **v1 probing:** single-region, HTTP(S), DNS, TCP, basic TLS inspection
+- **v2 probing:** multi-region probe fleet, network timing breakdown, expanded protocol support
+- **v3 probing:** headless browser fleet, baseline learning, cross-site correlation
+- **Jetpack agent:** already installed on target sites; basic reverse-check reporting is v1-achievable
+
+### A note on layer boundaries
+
+Failures often surface at one layer but originate at another. A CDN returning 522 is detected at Layer 3, but the root cause is a Layer 1 or Layer 2 failure between edge and origin. An expired cert at the origin (Layer 2) can manifest as a 502 at the edge (Layer 3). Tag tests by **where the monitor observes the failure**, not where the root cause ultimately lives. Root-cause attribution is tracked separately via causal links between events (see Part 3).
+
+---
+
+## Layer 1: Reachability
+
+Can the monitor reach the site at all? These failures happen before any connection is established.
+
+### Domain and registry
+- **[v2]** Domain expired at registrar
+- **[v2]** Domain approaching expiration (warning threshold, e.g., <30 days)
+- **[v3]** Registrar lock status changed unexpectedly
+- **[v3]** WHOIS/RDAP query failures
+- **[v3]** Nameserver delegation mismatch (parent zone NS records don't match child zone)
+- **[v2]** Domain suspended or in client/server hold status
+
+### DNS resolution
+- **[v1]** NXDOMAIN for apex and `www` subdomain
+- **[v1]** SERVFAIL from authoritative nameservers
+- **[v1]** Timeout contacting authoritative nameservers
+- **[v1]** Resolver returns REFUSED
+- **[v2]** DNSSEC validation failure (bogus signatures, expired signatures, broken chain of trust)
+- **[v2]** CNAME chain exceeds resolver depth limit
+- **[v1]** CNAME pointing to NXDOMAIN target
+
+### DNS configuration
+- **[v1]** Missing A record
+- **[v1]** Missing AAAA record when IPv6 is expected
+- **[v2]** A/AAAA records pointing to unreachable or parked IPs
+- **[v2]** Round-robin DNS with one or more dead backends
+- **[v3]** Geo-DNS returning wrong region's endpoint
+- **[v3]** TTL set pathologically low (thrash) or high (stale after cutover)
+- **[v4]** Missing or misconfigured MX/TXT records affecting site-adjacent services (SPF, DMARC, domain verification)
+- **[future]** Split-horizon DNS mismatch (internal vs. external resolution differs)
+
+### Network-layer connectivity
+- **[v1]** IPv4 unreachable from monitor vantage point
+- **[v1]** IPv6 unreachable when AAAA is published (common silent failure)
+- **[v2]** Asymmetric IPv4/IPv6 behavior (one works, one doesn't)
+- **[v2]** ICMP unreachable from upstream router
+- **[future]** BGP route withdrawal affecting destination prefix
+- **[v3]** MTU/PMTUD blackhole (small packets succeed, large fail)
+
+### Geographic and network-path reachability
+- **[v2]** Reachable from one region but not another *(requires multi-region probe fleet)*
+- **[v3]** ASN-level block (monitor's ASN blackholed at destination)
+- **[v2]** Country-level block or government-level filtering
+- **[v3]** Upstream transit provider outage affecting subset of vantage points
+- **[v4]** Origin IP listed on major blocklists (Spamhaus, SORBS, etc.)
+- **[future]** CDN/origin IP nullrouted by major ISP
+
+---
+
+## Layer 2: Transport & Security
+
+The connection itself — TCP, TLS, and the cryptographic handshake.
+
+### TCP
+- **[v1]** Connection refused (port closed)
+- **[v1]** Connection reset mid-handshake
+- **[v1]** Connection timeout (SYN with no SYN-ACK)
+- **[v2]** Half-open connections (handshake completes but no data flows)
+- **[v1]** Slow handshake exceeding threshold
+
+### TLS handshake
+- **[v1]** TLS handshake failure (generic)
+- **[v2]** Unsupported protocol version mismatch
+- **[v2]** No common cipher suite
+- **[v2]** SNI mismatch (wrong vhost served)
+- **[v2]** TLS alert parsing: `handshake_failure`, `protocol_version`, `unrecognized_name`
+
+### Certificate validity
+- **[v1]** Expired certificate
+- **[v1]** Not-yet-valid certificate (clock skew or premature deployment)
+- **[v1]** Certificate hostname mismatch (CN/SAN doesn't cover requested host)
+- **[v1]** Self-signed certificate in production
+- **[v1]** Certificate signed by untrusted CA
+- **[v1]** Missing intermediate certificate(s) — chain incomplete
+- **[v2]** Revoked certificate (CRL or OCSP says revoked)
+- **[v2]** Weak signature algorithm (SHA-1, MD5)
+- **[v2]** Key too short (RSA < 2048)
+
+### Certificate operational issues
+- **[v2]** OCSP stapling broken or returning `unknown`/`revoked`
+- **[v3]** Certificate Transparency: cert not logged
+- **[v1]** Approaching expiration (warning threshold, e.g., <30 days)
+- **[v2]** HSTS header missing when expected
+- **[v3]** HSTS `max-age` too low or preload list drift
+
+### HTTPS enforcement
+- **[v1]** Port 80 not redirecting to 443
+- **[v1]** HTTPS not supported at all
+- **[v3]** Mixed-content: HTTPS page loads HTTP assets
+- **[v2]** HTTP/2 or HTTP/3 negotiation failures when advertised
+
+### Other transport protocols
+- **[v3]** WebSocket upgrade failures
+- **[future]** gRPC connection or deadline-exceeded failures
+- **[v4]** SMTP/IMAP/POP port availability
+- **[v4]** Other TCP services (SSH, FTP, database ports)
+
+---
+
+## Layer 3: Infrastructure & Edge
+
+The systems between the internet and the origin server.
+
+### CDN and edge provider
+- **[v1]** CDN returning its own error page (Cloudflare 520–526)
+- **[v2]** CDN origin-unreachable errors
+- **[v4]** Cloudflare/Fastly/Akamai/CloudFront provider-level outage detection
+- **[v3]** Cache serving stale error responses
+- **[v3]** Cache poisoning (wrong content served from edge)
+
+### Cloud provider
+- **[v4]** AWS/GCP/Azure region outage detection
+- **[v3]** Managed database failure surfacing as application error
+- **[v2]** Object storage outage affecting media
+
+### Load balancer
+- **[v1]** Load balancer entirely unreachable
+- **[v2]** One or more backends dead but still in rotation
+- **[v2]** Stale backend serving old code/content
+- **[v3]** Uneven distribution (one backend getting 90% of traffic)
+- **[v3]** Session affinity broken
+- **[v2]** SSL termination issues at LB (cert mismatch between LB and origin)
+- **[future]** LB health checks misconfigured
+
+### WAF, bot protection, and rate limiting
+- **[v1]** WAF false-positive blocking monitor (403)
+- **[v1]** Bot-protection challenge page served instead of content
+- **[v1]** Rate limiting triggered on monitor (429)
+- **[v2]** IP reputation block (monitor IP flagged)
+- **[v2]** Geoblocking misconfigured
+
+### DDoS and traffic management
+- **[v2]** DDoS protection in "under attack" mode serving challenges
+- **[v3]** Anycast misrouting (traffic landing in wrong PoP)
+
+---
+
+## Layer 4: Application Response
+
+The server accepts the connection and speaks HTTP — but does it respond correctly and promptly?
+
+### Connection-level HTTP failures
+- **[v1]** TCP connection accepted, no HTTP response sent (hang)
+- **[v1]** Response timeout (server slow to first byte beyond threshold)
+- **[v1]** Connection closed mid-response (truncated body)
+- **[v2]** Invalid HTTP framing (bad Content-Length, chunked encoding errors)
+
+### Status code anomalies
+- **[v1]** 5xx responses (500, 502, 503, 504)
+- **[v2]** Intermittent 5xx at elevated rate (e.g., >1% of requests)
+- **[v1]** 4xx on canonical URLs that should succeed (404 on homepage)
+- **[v1]** 401/403 on public pages
+- **[v1]** Method inconsistency: HEAD returns 200 but GET returns 4xx/5xx
+- **[v1]** Method inconsistency: GET succeeds but HEAD returns 405
+- **[v2]** OPTIONS preflight failures affecting CORS-dependent pages
+
+### Network timing breakdown
+- **[v1]** Total response time exceeds threshold
+- **[v1]** Time to First Byte (TTFB) exceeds threshold
+- **[v2]** DNS lookup time exceeds threshold
+- **[v2]** TCP connect time exceeds threshold
+- **[v2]** TLS handshake time exceeds threshold
+- **[v2]** Content download time exceeds threshold
+- **[v2]** Response size anomalies (much smaller or larger than baseline)
+- **[v3]** Slow-loris-style responses (bytes trickle in over long duration)
+
+### Redirect behavior
+- **[v1]** Redirect loop (A → B → A)
+- **[v1]** Redirect chain too long (>5 hops)
+- **[v1]** Redirect to wrong host
+- **[v1]** HTTPS → HTTP downgrade in redirect chain
+- **[v2]** Redirect strips path or query string when it shouldn't
+- **[v3]** 301 when 302 expected, or vice versa
+
+### Header anomalies
+- **[v1]** Missing `Content-Type`
+- **[v2]** Wrong `Content-Type` (HTML served as `text/plain`)
+- **[v2]** Missing security headers when expected
+- **[v3]** Malformed `Cache-Control` causing CDN misbehavior
+- **[v3]** Excessive cookie size breaking downstream proxies
+
+---
+
+## Layer 5: Content Integrity
+
+The response is valid HTTP — but is the payload actually correct? Layer 5 splits into two classes:
+
+- **Correctness failures** — the payload is wrong regardless of who requested it or when. Detected by inspecting a single response.
+- **Consistency failures** — the payload looks fine in isolation but is wrong *for this request* (wrong user's view, wrong region's content, stale cache). Detected by comparing across requests or against expected invariants.
+
+### Correctness: silent application failures
+- **[v1]** CMS fatal error rendered with 200 OK (WSOD)
+- **[v1]** "Error establishing a database connection" served as HTML with 200
+- **[v1]** PHP fatal errors or stack traces in response body
+- **[v1]** White-screen-of-death (empty or near-empty body, 200 OK)
+- **[v2]** Python/Ruby/Node tracebacks leaked to response body
+
+### Correctness: maintenance and transitional states
+- **[v2]** Maintenance mode page served with 200 (should be 503 with Retry-After)
+- **[v2]** "Coming soon" or placeholder content served unexpectedly
+- **[v2]** Holding page from registrar/host
+- **[v2]** Default server welcome page (nginx, Apache, IIS default)
+
+### Correctness: security-relevant content
+- **[v2]** Defacement (body diff against baseline exceeds threshold)
+- **[v2]** Injected spam links or SEO spam
+- **[v3]** Injected cryptominer or malicious JavaScript
+- **[v3]** Phishing content replacing legitimate pages
+- **[v2]** Admin/debug pages exposed publicly (`/wp-admin` accessible without auth, `.env` served)
+
+### Correctness: content completeness
+- **[v1]** Expected string/marker present (canary text)
+- **[v2]** Missing critical element (no `<title>`, empty `<body>`)
+- **[v2]** Response body significantly smaller than baseline
+- **[v3]** Broken HTML structure (unclosed tags affecting render)
+- **[v3]** Missing or broken critical assets referenced by page (CSS/JS 404s)
+
+### Correctness: structured data
+- **[v2]** JSON API returning HTML error page
+- **[v2]** XML/RSS feed malformed
+- **[v2]** Sitemap returning 200 but empty or malformed
+- **[v2]** `robots.txt` missing or returning HTML
+
+### Consistency: cache and routing
+- **[v2]** Wrong vhost served (different site's content on this domain)
+- **[v3]** Cache poisoning (one user's content served to another)
+- **[future]** A/B test or feature flag stuck in wrong state
+- **[v3]** Localized content served to wrong region
+- **[v3]** Logged-in view served to anonymous monitor (cache key bug)
+- **[v2]** Stale content served long after origin update
+
+### Client-side rendering (rendered-DOM checks)
+*All items require headless browser infrastructure.*
+- **[v3]** SPA fails to hydrate (initial HTML loads, JS fails)
+- **[v3]** Client-side routing broken
+- **[v3]** JavaScript errors in console exceeding threshold
+- **[v3]** Core Web Vitals regression (LCP, CLS, FID)
+
+### Third-party dependency failures
+*All items require rendered-DOM inspection.*
+- **[v3]** Critical external JS failing to load
+- **[v3]** Payment processor SDK unavailable (Stripe, PayPal)
+- **[v3]** Font provider outage affecting rendering
+- **[v3]** CDN for assets failing (jsDelivr, unpkg)
+- **[v3]** Embedded content broken (YouTube, Vimeo, social embeds)
+
+---
+
+## Reverse Checks: Agent-Reported Monitoring
+
+Probe-based monitoring asks "is the site up from outside?" Reverse checks flip the direction: the monitored system reports *to us*, and silence means failure. This is fundamentally a different detection model.
+
+**Why this matters for Jetmon:** Jetpack's position inside WordPress means it can act as an authenticated agent on the site itself, reporting signals that external probes cannot see. Most of these are v1/v2 precisely *because* Jetpack is already on-site.
+
+### Heartbeat and dead-man's-switch
+- **[v1]** Site fails to check in within expected interval
+- **[v1]** Grace-period exhaustion (missed enough check-ins to declare down)
+- **[v2]** Heartbeat interval drift (checking in late)
+- **[v3]** Heartbeat from unexpected location or with unexpected payload
+
+### WordPress cron (wp-cron) and scheduled tasks
+- **[v1]** `wp-cron.php` not firing
+- **[v2]** Scheduled events backlogging (queue depth growing)
+- **[v2]** Individual recurring events failing repeatedly
+- **[v3]** Plugin-registered cron jobs silently failing
+- **[v2]** Post scheduled for publication but not published
+
+### Background jobs and queues
+- **[v2]** Action Scheduler queue depth exceeding threshold
+- **[v2]** Failed jobs accumulating
+- **[v3]** Job processing time regression
+- **[v4]** Specific critical jobs not completing
+
+### Application-internal health signals
+- **[v2]** PHP error log rate exceeding threshold
+- **[v3]** Database slow-query rate exceeding threshold
+- **[v3]** Cache hit rate dropping below expected baseline
+- **[v2]** Memory usage approaching PHP limit
+- **[v2]** Disk space approaching full on uploads directory
+- **[v3]** Database connection pool exhaustion
+
+### Security and integrity signals
+- **[v4]** File integrity changes in core or plugin files — *likely overlaps with Jetpack Scan*
+- **[v2]** Unexpected admin user creation
+- **[v4]** Failed login rate spike — *likely overlaps with Jetpack Protect*
+- **[v2]** Plugin/theme update failures
+- **[v1]** WordPress core, plugin, or theme out of date beyond threshold
+
+### Deployment and configuration drift
+- **[v1]** PHP version approaching EOL
+- **[v1]** WordPress version outdated
+- **[v2]** Critical plugin disabled unexpectedly
+- **[v2]** Site URL or home URL changed
+- **[v1]** Debug mode enabled in production
+
+---
+
+## Part 2: The Data Model — Site, Endpoint, Check
+
+Jetmon uses a three-level hierarchy modeled on Atlassian Statuspage rather than the flat monitor model used by UptimeRobot and Pingdom. This hierarchy is the conceptual frame for everything else in this document.
+
+### Entities
+
+- **Site** — the top-level entity a customer mentally owns ("my WordPress site at example.com"). A site has one canonical domain and an associated Jetpack-connected WordPress installation.
+- **Endpoint** — a specific URL or surface being monitored on the site. Typical endpoints include the homepage, login page, REST API root, feed, sitemap, and any customer-specified URLs.
+- **Check** — an individual test running against a site or endpoint. Each check is one of the items from the five-layer taxonomy above.
+
+### Site-level vs. endpoint-level checks
+
+Some checks belong to the site as a whole; others belong to specific endpoints. This distinction is structural and affects the data model.
+
+**Site-level checks** apply regardless of which endpoint you probe:
+- Domain expiration
+- DNS configuration (A/AAAA records, CNAME chain)
+- TLS certificate validity (shared across endpoints on the same domain)
+- All Reverse Checks (wp-cron, PHP version, disk space, etc.)
+
+**Endpoint-level checks** are specific to a URL:
+- HTTP status code
+- Response body content patterns
+- TTFB and other per-request timing
+- Redirect behavior
+- Header anomalies
+
+Mixing these creates confusion ("my homepage is down but my site is up?"). The data model reflects this split explicitly: checks have a `target_type` of either `site` or `endpoint`, and site-level events cannot be attributed to a specific endpoint.
+
+### Rollup
+
+Site-level state rolls up from endpoint-level state, which rolls up from individual check results. Rollup rules are **explicit and configurable**, not hardcoded. Ship with sensible defaults (worst-child for critical endpoints, warning-promotion for non-critical ones) but let them be overridden per site. A site owner might reasonably say "the homepage being down means the site is down, but the feed being down is just Degraded."
+
+Specific rollup decisions to expose as config:
+- Which endpoints are "critical" (affect site state directly) vs. "non-critical" (promote to warning only)
+- Whether site-level check failures (cert, domain, DNS) always set site state or can be overridden per check type
+- How Reverse Check events roll up — typically these are their own category that surfaces independently of probe-based state
+
+---
+
+## Part 3: State Model and Event Architecture
+
+### State vocabulary
+
+Jetmon uses a multi-state vocabulary rather than binary up/down:
+
+- **Up** — all checks passing
+- **Warning** — something needs attention but isn't user-facing yet (cert expiring in 14 days, WordPress version behind, wp-cron backing up)
+- **Degraded** — some checks failing or timing thresholds exceeded, but site is serving content (missing security headers, slow TTFB, one of several endpoints failing)
+- **Seems Down** — first failure detected, awaiting verifier confirmation (transient, auto-resolves to Down or Up within minutes)
+- **Down** — confirmed failures on critical checks
+- **Paused** — monitoring suspended by user
+- **Maintenance** — scheduled maintenance window active
+- **Unknown** — monitor couldn't determine state (monitor-side failure, agent not reporting, first check pending)
+
+The Warning/Degraded split matters because they route differently in alerting: Degraded might page an on-call engineer; Warning is a daily-digest email. UptimeRobot's API omits this distinction and users frequently ask for it — worth building in from the start.
+
+The Unknown state is critical for honesty: **monitor-side failures should never be reported as customer-site downtime**. If the probe itself crashes, the region loses network, the rate limit hits, or the Jetpack agent stops reporting — these are Unknown, not Down. Conflating these erodes trust quickly.
+
+### Event-sourced model
+
+Jetmon uses an event-sourced architecture where **events are the source of truth** and state is derived.
+
+**Why events over a single state field:**
+
+1. **Multiple concurrent issues.** A site can have an expiring cert (Warning), a failing endpoint (Degraded), and wp-cron stopped (Warning) all at once. A state field collapses this to one value and loses the others. Events keep all three distinct, visible, and separately resolvable.
+
+2. **Incident timeline.** "Was the cert expiration warning active before or after the first 5xx spike?" is a standard postmortem question. Events with start/end timestamps answer it natively.
+
+3. **Root-cause attribution.** Failures at one layer often originate at another. Events can link causally: the Layer 3 "CDN 522" event references the Layer 1 "origin unreachable" event as its likely cause.
+
+### Schema shape
+
+```
+events (current state — one row per open incident, frozen on close):
+  id
+  site_id (blog_id)
+  endpoint_id (nullable — null for site-level events)
+  check_type
+  discriminator (nullable — tiebreaker for tuples that can have multiple concurrent failures)
+  severity (numeric, comparable)
+  state (human-readable category)
+  started_at (frozen across severity/state changes)
+  ended_at (nullable — null for active events)
+  cause_event_id (nullable — causal link, separate from hierarchical rollup)
+  resolution_reason (nullable — why the event closed)
+  metadata (JSON — check-specific data)
+  dedup_key (generated, NULL when closed; UNIQUE — enforces one-open-per-tuple)
+
+event_transitions (append-only history of every event mutation):
+  id
+  event_id
+  site_id (blog_id, denormalized for SLA queries)
+  severity_before, severity_after
+  state_before, state_after
+  reason (opened, severity_escalation, verifier_confirmed, false_alarm, …)
+  source (local, veriflier:<region>, operator:<user>, system:<reason>)
+  metadata (JSON — transition-specific context)
+  changed_at
+
+sites (includes derived state for fast reads):
+  id
+  ...
+  current_state
+  current_state_updated_at
+  active_event_count
+  worst_active_severity
+```
+
+**Why two tables, not one mutable events table:** keeping current state in `events` and history in `event_transitions` lets you serve "current state of site X" with a single-row read on `events`, and "how did incident Y evolve" with a narrow `WHERE event_id = ?` scan on `event_transitions`. Both queries are common, both want different shapes, and a single mutable-history table compromises one or the other.
+
+The invariant is that **every write to `events` is paired with one row inserted into `event_transitions` in the same transaction**. This is enforced in code by routing all event mutations through a single `eventstore` package. Replaying `event_transitions` in `changed_at` order reconstructs any event's current `severity` and `state`, so the live `events` row is fully rebuildable from the history table.
+
+**Key design decisions:**
+
+- **Events are the source of truth across two tables.** `events` holds current state (mutable while open, frozen on close); `event_transitions` is the append-only history of every change. The site row stores a denormalized projection for fast reads. All three update transactionally — the projection should never write without a corresponding event write, and an event write must always be accompanied by a transition row.
+
+- **Severity and state are separate fields.** Severity is the numeric, comparable value used for rollup (e.g., 1=Warning, 2=Degraded, 3=Seems Down, 4=Down). State is the human-readable category. Keeping them separate lets you add new states without breaking rollup logic.
+
+- **Open events are updated in place, not replaced.** A Seems Down event that gets verifier-confirmed to Down updates the same row with a severity change. The event's `started_at` remains the original detection timestamp. This keeps "how long has this been broken" honest — incident duration starts from first failure, not from verifier confirmation.
+
+- **Event identity is idempotent.** If the same check fails twice in a row, it's the same event, not two events. Key events by `(site_id, endpoint_id, check_type, [optional discriminator])` so repeated detection of the same failure updates the existing open event rather than creating a new one. Deduplication logic lives in the shared probe runner, not in individual checks.
+
+- **Resolution reason is recorded on close.** When an event closes, record why: the check started passing, the user acknowledged and dismissed it, a maintenance window swallowed it, it was superseded by a broader event. This affects uptime calculations and report accuracy.
+
+- **Causal links are separate from hierarchical rollup.** An endpoint-level event rolls up to site level (hierarchy). A Layer-3 CDN event caused by a Layer-1 DNS event is a different relationship (causation). Keep these as two separate fields. Conflating them creates weird bugs where dismissing a cause accidentally dismisses a rollup, or vice versa.
+
+### The Seems Down flow
+
+The Seems Down state is the key transient between first failure detection and verifier confirmation, and the event model accommodates it cleanly:
+
+1. First failure detected → open event at severity Seems Down, `started_at` = now
+2. Verifier runs (retry-on-failure, multi-location confirmation, etc.)
+3a. If verifier confirms failure → **update the same event** to severity Down, `started_at` unchanged
+3b. If verifier succeeds → **close the event** with `resolution_reason = "false_positive"`, `ended_at` = now
+
+This pattern makes "events that opened at Seems Down and closed without promotion" a direct measure of detection noise, useful for tuning false-positive rates.
+
+### Check-level events vs. site-level state events
+
+Two granularities of events, both stored:
+
+- **Check-level events** answer "what specific thing broke?" — these are the primary events described above, one per failing check.
+- **Site-level state events** answer "when did the customer experience degradation?" — these record transitions in the derived site state ("Site was Down from 14:02 to 14:17"). They're derived from check-level events but stored as their own first-class records for historical timeline views, uptime percentage calculations, and SLA reporting.
+
+The rule: never write derived state without writing (or closing) the corresponding events. If the invariant holds, the representations can't drift. If the derived state is ever suspect, it can be recomputed from events and compared.
+
+---
+
+## Part 4: The Scope Matrix
+
+Every test can be tagged along two axes: **layer** (what the monitor detected) and **scope** (how broadly the failure affects customers). Scope drives alerting severity and on-call routing.
+
+### Scope definitions
+
+- **Single-site** — affects one customer site only. Typical alert: notify site owner.
+- **Wide-outage** — affects many sites simultaneously (provider-level). Typical alert: notify site owners *and* surface on provider status page; suppress duplicate individual alerts to reduce noise.
+- **Architectural** — reveals a structural problem in the customer's own setup that will recur without intervention. Typical alert: notify site owner with remediation guidance, not just "down."
+
+### Representative tests mapped by layer × scope
+
+| Layer | Single-site | Wide-outage | Architectural |
+|-------|-------------|-------------|---------------|
+| L1 Reachability | Expired domain [v2] | DNS provider outage [v4] | Round-robin with dead backend [v2] |
+| L2 Transport & Security | Expired certificate [v1] | Root CA distrust event [future] | Persistent missing intermediate cert [v1] |
+| L3 Infrastructure & Edge | Origin down [v1] | Cloudflare regional outage [v4] | LB with stale backend in rotation [v2] |
+| L4 Application Response | 500 on homepage [v1] | CDN-wide 5xx spike [v4] | HEAD/GET method mismatch on every page [v1] |
+| L5 Content Integrity | Defacement [v2], WSOD [v1] | Shared-host theme injection [v3] | Cache key bug serving logged-in views [v3] |
+| Reverse | wp-cron stopped on one site [v1] | Update server outage affecting all sites [v2] | wp-cron disabled in favor of unconfigured system cron [v2] |
+
+### Why two axes matter
+
+Slicing by layer answers "do we have coverage gaps?" Slicing by scope answers "how should we alert?" A wide-outage event that generates thousands of single-site alerts is an incident-response failure even if detection worked perfectly.
+
+---
+
+## Part 5: Detection Methodology by Layer
+
+Each layer corresponds to distinct monitoring techniques:
+
+- **Layer 1** → DNS queries, TCP probes, registrar/WHOIS queries from multiple vantage points
+- **Layer 2** → TLS inspection, certificate parsing, cipher negotiation, non-HTTP protocol probes
+- **Layer 3** → HTTP requests with edge-specific response parsing, status page integration
+- **Layer 4** → Full HTTP request/response with network timing breakdown and header analysis
+- **Layer 5** → Response body inspection (raw HTML *and* rendered DOM, depending on the class of failure)
+- **Reverse** → Inbound API endpoints, heartbeat tracking, agent-reported signal ingestion
+
+A test suite covering all five layers via only raw-HTML inspection still misses SPA failures and third-party dependency breakage. A suite covering all five layers via probes still misses cron death and background job failure. Coverage analysis has to consider technique as well as category.
+
+---
+
+## Part 6: Signal Processing and False-Positive Suppression
+
+Detecting a failure is half the problem; deciding it's real enough to open an event or escalate its severity is the other half.
+
+- **[v1]** Retry-on-failure — confirm with a second check before promoting Seems Down to Down
+- **[v1]** Maintenance windows — suppress alerts and event creation during scheduled work
+- **[v1]** Basic flap suppression — debounce rapid up/down transitions
+- **[v2]** Multi-location confirmation — require failure from N of M vantage points before promoting *(requires multi-region)*
+- **[v2]** Alert recurrence rules — how often to re-alert during a sustained incident
+- **[v3]** Dependency suppression — if a provider-level outage event is active, suppress individual-site events explained by it (via the causal link field)
+- **[v3]** Baseline-aware thresholds — what counts as "slow" depends on the site's historical baseline
+
+These aren't tests themselves, but they determine whether a check's findings become events and whether events become alerts.
+
+---
+
+## Part 7: Version Summary
+
+| Version | Approximate count | Character |
+|---------|-------------------|-----------|
+| v1 | ~55 items | Ship-worthy baseline monitor |
+| v2 | ~55 items | Competitive parity with established solutions |
+| v3 | ~40 items | Enterprise differentiation |
+| v4 | ~12 items | Deferred beyond v3, often gated on integrations/partnerships/demand |
+| future | ~10 items | Known-unscoped |
+
+**v1 as a coherent product:** the v1 set alone gives you DNS/TCP/TLS basics, core HTTP status and timing checks, essential content-integrity patterns (WSOD, DB errors), WordPress-specific reverse checks (wp-cron, core updates, debug mode), and baseline false-positive suppression. That's a credible launch.
+
+**v2 as competitive parity:** adds multi-region probing, network timing breakdown, domain expiration, expanded cert checks, maintenance-page detection, and richer reverse checks. Feature-competitive with Pingdom/UptimeRobot for WordPress sites.
+
+**v3 as differentiation:** headless browser checks, third-party dependency monitoring, cache-consistency detection, baseline learning, and advanced cross-site correlation. Enterprise-tier territory.
+
+---
+
+## Part 8: Out of Scope / Adjacent Concerns
+
+Things a complete monitoring story eventually needs, but which sit outside this taxonomy:
+
+- **[v4]** Transaction / multistep monitoring — scripted user flows (login, checkout, publish a post). Distinct because the failure mode is "step 3 of 5 broke" rather than a single request failing.
+- **[future]** Real User Monitoring (RUM) — captures actual user sessions rather than synthetic probes.
+- **[future]** Capacity and load testing — "works at 10 rps, collapses at 100 rps" is a real failure class but not an uptime concern.
+- **[future]** Application Performance Monitoring (APM) — in-process tracing, slow-query identification, code-level profiling.
+- **[future]** Log aggregation and anomaly detection.
+
+The taxonomy above should compose cleanly with any of these rather than trying to absorb them.
+
+---
+
+## Appendix: Decisions to Remember
+
+A consolidated list of architectural decisions made across the conversation history of this project, for quick reference:
+
+1. **Five-layer taxonomy (plus Reverse) organized by where the failure originates**, not by severity or scope.
+2. **Layer boundaries are fuzzy by design**; events are tagged by where detected, not where originated, with causal links for root-cause attribution.
+3. **Site → Endpoint → Check hierarchy** (Statuspage model), not flat monitors (UptimeRobot model).
+4. **Site-level checks vs. endpoint-level checks are structurally distinct** in the data model.
+5. **Rollup rules are explicit and configurable per site**, not hardcoded.
+6. **Multi-state vocabulary:** Up, Warning, Degraded, Seems Down, Down, Paused, Maintenance, Unknown.
+7. **Unknown state exists specifically to prevent monitor-side failures from being reported as customer-site downtime.**
+8. **Event-sourced architecture** across two tables: `events` for current state, `event_transitions` for append-only history of every mutation. Derived site state is denormalized onto the site row for read performance. The `eventstore` package is the sole writer; every event mutation also writes a transition row in the same transaction.
+9. **Severity and state are separate fields**; severity is numeric and comparable, state is human-readable.
+10. **Seems Down promotes in place** to Down on verifier confirmation; `started_at` stays at first-failure time.
+11. **Event identity is idempotent** via `(site_id, endpoint_id, check_type, discriminator)`.
+12. **Deduplication lives in the shared probe runner**, not in individual checks.
+13. **Resolution reason is recorded on event close** for accurate uptime reporting.
+14. **Causal links and hierarchical rollup are separate fields**.
+15. **Both check-level events and site-level state events are stored**, at different granularities.
+16. **Vantage-point ID is in the schema from v1** even though v1 is single-region.
+17. **Timeouts are configurable per site**, not global.
+18. **Error types have stable enum values**, not just strings.
diff --git a/docs/uptime-bench-jetmon-v2-improvements-handoff.md b/docs/uptime-bench-jetmon-v2-improvements-handoff.md
new file mode 100644
index 00000000..b93583af
--- /dev/null
+++ b/docs/uptime-bench-jetmon-v2-improvements-handoff.md
@@ -0,0 +1,218 @@
+# Uptime-Bench Handoff: Jetmon v2 Follow-Up
+
+This note summarizes the Jetmon v2 work suggested by the latest
+`uptime-bench` runs. The benchmark harness and raw reports live in the sibling
+repository:
+
+- `/home/gaarai/code/uptime-bench`
+- Latest report:
+  `/home/gaarai/code/uptime-bench/reports/v2-regression-9am-20260502-063755Z/report.md`
+- Prior comparison reports:
+  `/home/gaarai/code/uptime-bench/reports/overnight-gapfill-20260501-034222Z/report.md`
+  `/home/gaarai/code/uptime-bench/reports/cadence3m-20260430-211701Z/report.md`
+
+Important constraint: this is Jetmon v2 work. Do not change Jetmon v1 behavior
+to improve benchmark results.
+
+## Latest Run
+
+Run tag: `v2-regression-9am-20260502-063755Z`
+
+- Window: `2026-05-02 01:41:02 CDT` to `2026-05-02 08:25:08 CDT`
+- Services: Jetmon v1, Jetmon v2, Pingdom, UptimeRobot, Datadog Synthetics,
+  Better Uptime
+- Timing: 3-minute check cadence, 6-minute active failure window, 4-minute grace
+- Samples: 306 scheduled, 290 executed/scored, 16 deadline-skipped
+- Jetmon v2 result: `251/290` pass, no adapter errors, no capability mismatch
+- Jetmon v2 successful expected-down latency:
+  min/mean/max/p95 = `1.3 / 150.6 / 303.9 / 285.1` seconds
+
+## What Improved
+
+The latest run indicates that several issues called out in earlier handoffs are
+now fixed or at least materially improved:
+
+- `http-head-200-get-partial`: `26/26` pass
+- `http-partial`: `17/17` pass
+- `maintenance-http-503-full-cover`: `17/17` pass
+- `http-head-timeout-get-200`: improved to `24/26` pass
+
+Earlier reports showed partial/truncated responses and covered maintenance as
+major Jetmon v2 failures. Those should now be treated as regression-test areas,
+not the primary implementation targets.
+
+## Current Jetmon v2 Gaps
+
+### Priority 0: Deprecated TLS Advisory Detection
+
+Latest result:
+
+- `tls-deprecated-tls11`: `16/16` fail
+
+Interpretation:
+
+- Jetmon v2 did not appear to report downtime for deprecated TLS in this run,
+  which is good.
+- It still failed the benchmark policy because no advisory was exposed for the
+  adapter to retrieve.
+
+Expected benchmark semantics:
+
+- Deprecated TLS 1.0/1.1 should be advisory-only.
+- No advisory is a miss.
+- A downtime event would be a false outage.
+
+Relevant areas:
+
+- `internal/checker/checker.go`
+- `internal/orchestrator/orchestrator.go`
+  - `processResults`
+  - `checkSSLAlerts`
+  - existing TLS-expiry advisory flow
+- `internal/eventstore/eventstore.go`
+- API event payloads consumed by the uptime-bench Jetmon v2 adapter
+
+Recommended direction:
+
+1. Reuse the TLS-expiry advisory shape where possible.
+2. Open or update a warning/advisory event when a check negotiates TLS 1.0 or
+   TLS 1.1.
+3. Do not project legacy `site_status` down.
+4. Close the advisory when the site negotiates TLS 1.2+ again.
+5. Include negotiated TLS version and cipher suite in event metadata.
+
+Suggested tests:
+
+- A checker result with deprecated TLS opens or updates a warning event.
+- A later TLS 1.2+ result closes the advisory.
+- Deprecated TLS never enters the downtime retry or verifier-confirmation path.
+- The API exposes the advisory in the same event feed used by the benchmark
+  adapter.
+
+Benchmark acceptance:
+
+- `tls-deprecated-tls11` should pass as advisory detected.
+- It must not appear as outage / downtime.
+
+### Priority 1: Method-Specific False Downtime
+
+Latest result:
+
+- `http-head-405-get-200`: `15/18` pass, `3/18` false downtime
+- `http-head-timeout-get-200`: `24/26` pass, `2/26` false downtime
+
+Interpretation:
+
+- Jetmon v2 is mostly avoiding HEAD-only false downtime, which supports the
+  intended GET-based design.
+- The remaining failures are intermittent and need evidence before changing
+  behavior.
+
+Recommended direction:
+
+1. Preserve GET-based checking. Do not move v2 toward HEAD-based monitoring.
+2. Add or inspect per-check history around failed samples:
+   - request method
+   - URL
+   - status code
+   - Jetmon error code
+   - event open/close timestamps
+3. Confirm the uptime-bench adapter filters events by the exact provisioned
+   site and run window.
+4. Add a regression fixture where HEAD fails/hangs but GET returns 200; assert
+   no event opens and site status remains running.
+
+Benchmark acceptance:
+
+- `http-head-405-get-200` should pass consistently.
+- `http-head-timeout-get-200` should pass consistently.
+
+### Priority 2: Geo-Scoped HTTP Failures
+
+Latest result:
+
+- `http-geo-503`: `17/17` fail for Jetmon v2
+- Every service failed this scenario in the latest run.
+
+Interpretation:
+
+- Do not treat this as a Jetmon-v2-specific bug yet.
+- The benchmark's region/vantage assumptions need validation before Jetmon
+  changes should be made.
+
+Recommended direction:
+
+1. Confirm whether Jetmon v2 probes originate from a source range that should
+   have been included in the `us-east` geo failure.
+2. Confirm uptime-bench `probe_ranges` are accurate for every enabled service.
+3. If Jetmon is intended to be single-region or agent-local only, document that
+   geo-scoped external-probe scenarios are not comparable for Jetmon.
+
+## Regression Areas To Preserve
+
+Do not regress these latest clean passes:
+
+- content checks:
+  - `content-defacement`: `9/9`
+  - `content-keyword-missing`: `7/7`
+  - `content-ransomware`: `9/9`
+- ordinary HTTP:
+  - `http-503`: `17/17`
+  - `http-head-200-get-503`: `18/18`
+  - `http-head-200-get-partial`: `26/26`
+  - `http-head-200-get-redirect-loop`: `16/16`
+  - `http-partial`: `17/17`
+  - `http-timeout-ttfb`: `9/9`
+- maintenance:
+  - `maintenance-http-503-full-cover`: `17/17`
+- network/TLS:
+  - `tcp-refused`: `9/9`
+  - `tls-expiring-5d`: `18/18`
+  - `tls-handshake-version-mismatch`: `9/9`
+  - `tls-invalid-hostname-mismatch`: `9/9`
+  - `tls-invalid-self-signed`: `8/8`
+
+## Suggested Implementation Order
+
+1. Add deprecated TLS advisory lifecycle and API exposure.
+2. Add HEAD-failure/GET-healthy regression logging/tests.
+3. Validate geo-scoped benchmark assumptions before changing Jetmon behavior.
+4. Keep partial response and maintenance suppression tests in the suite as
+   regression coverage.
+
+## Verification Plan
+
+Local Jetmon repo checks:
+
+```bash
+make test
+make lint
+make build
+```
+
+Focused benchmark smoke after deployment:
+
+- `tls-deprecated-tls11`
+- `http-head-405-get-200`
+- `http-head-timeout-get-200`
+- `http-head-200-get-partial`
+- `http-partial`
+- `maintenance-http-503-full-cover`
+
+Success criteria:
+
+- Deprecated TLS produces advisory-only detection.
+- HEAD-only failure scenarios do not report downtime when GET is healthy.
+- Partial/truncated response and maintenance full-cover remain clean.
+- Existing strong TLS/content/HTTP detections stay intact.
+
+## Notes For The Next Agent
+
+- Read `AGENTS.md` before editing.
+- Jetmon v2 has compatibility constraints around schema, WPCOM payloads,
+  StatsD names, log paths, and legacy projection.
+- The benchmark sees event history through the API, not just WPCOM
+  notifications. Suppressing a notification is not enough if a downtime event
+  is still exposed.
+- Preserve GET-based checking. It is a deliberate v2 design improvement over
+  Jetmon v1's HEAD probes.
diff --git a/docs/v1-to-v2-migration.md b/docs/v1-to-v2-migration.md
new file mode 100644
index 00000000..f87b3ebc
--- /dev/null
+++ b/docs/v1-to-v2-migration.md
@@ -0,0 +1,975 @@
+# v1 to v2 Migration Runbook
+
+This is the source-of-truth runbook for the first production migration from
+Jetmon 1 to Jetmon 2.
+
+Use [rollout-quick-reference.md](rollout-quick-reference.md) as the condensed
+command checklist during rehearsals and rollout windows. If it conflicts with
+this runbook, this runbook wins.
+
+Use [jetmon-v2-prelaunch-readiness.md](jetmon-v2-prelaunch-readiness.md) before
+attempting the rollout to track launch posture, parity gates, support/WAF
+readiness, rehearsal evidence, observability thresholds, consumer inventory, and
+failure-mode drills.
+
+Use this document for:
+
+- preparing the fleet before any production change
+- replacing v1 on the same server
+- moving a v1 bucket range to a fresh v2 server
+- monitoring the cutover
+- reverting safely
+- completing the move from pinned buckets to dynamic v2 ownership
+- removing old v1 software after signoff
+
+## What Changes For Customers
+
+The important product fix is the probe method, but it should be rolled out in
+stages.
+
+Jetmon 1 verified sites with `HEAD` requests. That caused real customer pain:
+some production stacks block `HEAD`, route it differently, skip application
+logic, or return a status that does not match a visitor's real page load.
+Jetmon 2 can use `GET` requests for local monitor checks and Veriflier checks,
+so it can validate the same class of request a browser or customer-facing
+uptime check normally makes.
+
+The production rollout should not switch every variable at once. Use this
+three-step check-policy migration:
+
+1. Replace v1 processing with v2 while keeping `HEAD` plus the `legacy`
+   detection profile. This validates the binary, bucket ownership, Veriflier
+   transport, legacy projection, WPCOM payloads, and rollback process while the
+   probe semantics stay as close to v1 as possible.
+2. Move controlled batches to `GET` plus `simple_http`. This tests the visitor
+   request path without enabling keyword, forbidden-content, redirect advisory,
+   TLS advisory, or body-integrity detections.
+3. Move stable batches to `GET` plus `full`. This enables the richer v2
+   detections that provide better VIP/Agency explanations.
+
+Set `DEFAULT_CHECK_METHOD=HEAD` and `DEFAULT_DETECTION_PROFILE=legacy` during
+the initial replacement phase. Per-site overrides live in
+`jetmon_site_check_config`; use that table or the API/CLI fields
+`request_method` and `detection_profile` to move batches through the phases.
+After migration, switch the process defaults to `GET` and `full`; keep
+per-site `HEAD` overrides only for sites that truly require legacy semantics.
+
+Terminology matters during this rollout. A site using `HEAD` plus the `legacy`
+detection profile is only using v1-compatible **probe behavior**. It still runs
+through the v2 Monitor and should still use the v2 Monitor-to-Veriflier
+transport, `POST /v2/check`. That is separate from `veriflier2`'s optional
+legacy-compatible HTTP endpoints, `POST /check` and `GET /status`, which are
+disabled by default and only enabled with `VERIFLIER_ENABLE_LEGACY_HTTP=true`
+for lab or emergency compatibility tests.
+
+## Success Criteria
+
+The migration is complete only when:
+
+- every active v1 bucket range is covered by exactly one v2 host
+- no v1 monitor process is checking production buckets
+- `./jetmon2 rollout dynamic-check` passes after pinned mode is removed
+- legacy projection drift is zero while `LEGACY_STATUS_PROJECTION_ENABLE` is on
+- WPCOM notifications retain the v1 payload shape
+- check throughput, round timing, WPCOM delivery, Veriflier health, StatsD, and
+  log/stats writes are stable for the agreed observation window
+- old v1 software is retained until rollback signoff, then removed deliberately
+
+## Rollout Invariants
+
+Do not violate these during the migration:
+
+- Do not run v1 and v2 against the same bucket range at the same time.
+- Do not run unpinned dynamic v2 while any v1 host still owns static buckets.
+- Keep `LEGACY_STATUS_PROJECTION_ENABLE=true` until legacy readers have moved to
+  the v2 API or event tables.
+- Keep `API_PORT=0` on production monitor hosts during initial replacement
+  unless the API and delivery-owner plan has been explicitly approved.
+- Do not remove v1 binaries, configs, service units, or dependencies until the
+  rollback window is closed.
+- Treat `./jetmon2 migrate` as forward-only. Migrations are additive, so revert
+  by restarting v1, not by rolling the schema back.
+- V2 migrations intentionally add v2-owned tables and avoid requiring new
+  columns or indexes on the live `jetpack_monitor_sites` compatibility table.
+
+## Phase 0: Prepare Before Production Changes
+
+### Inventory The Current Fleet
+
+Record, for every v1 host:
+
+- hostname
+- service manager name and start/stop commands
+- v1 binary or checkout path
+- v1 config path
+- `BUCKET_NO_MIN` and `BUCKET_NO_MAX`
+- log and stats paths
+- WPCOM credentials source
+- Veriflier list
+- expected sites-per-round or sites-per-second baseline
+- current alert volume and any known noisy sites
+
+Confirm the bucket ranges are complete and non-overlapping:
+
+```sql
+SELECT bucket_no, COUNT(*) AS sites
+FROM jetpack_monitor_sites
+WHERE monitor_active = 1
+GROUP BY bucket_no
+ORDER BY bucket_no;
+```
+
+Run the production-data audit before approving the first host window. This
+read-only gate summarizes the real legacy table shape without printing monitor
+URLs, including active row count, observed bucket space, status distribution,
+check-interval distribution, malformed URL counts, active duplicate `blog_id`
+rows, and existing non-running v1 projections:
+
+```bash
+./jetmon2 rollout production-data-audit --bucket-min=0 --bucket-max=<max>
+```
+
+If the audit reports existing active non-running rows, bootstrap matching v2
+events before treating `projection-drift` as a hard gate. The bootstrap is
+read-only unless `--execute` is provided:
+
+```bash
+./jetmon2 rollout legacy-status-bootstrap --bucket-min=0 --bucket-max=<max>
+./jetmon2 rollout legacy-status-bootstrap --bucket-min=0 --bucket-max=<max> --execute
+```
+
+Do not force the bootstrap past active duplicate `blog_id` blockers during the
+initial rollout. Current v2 rollout state is still keyed by `blog_id`; duplicate
+active rows need endpoint-identity support or explicit data cleanup before they
+can be handled safely.
+
+Export the approved host-to-bucket plan to CSV before touching any hosts:
+
+```csv
+host,bucket_min,bucket_max
+jetmon-v1-a,0,99
+jetmon-v1-b,100,199
+```
+
+Then verify that the copied v1 static plan covers the full configured bucket
+range without gaps, overlaps, invalid ranges, or duplicate host rows:
+
+```bash
+./jetmon2 rollout static-plan-check --file rollout-buckets.csv
+```
+
+If checking the plan before the v2 config is available, pass the expected total:
+
+```bash
+./jetmon2 rollout static-plan-check --file rollout-buckets.csv --bucket-total=<n>
+```
+
+Before replacing a specific host, assert that the copied range still matches the
+approved plan:
+
+```bash
+./jetmon2 rollout static-plan-check --file rollout-buckets.csv \
+  --host=jetmon-v1-a --bucket-min=0 --bucket-max=99 --bucket-total=<total>
+```
+
+Generate the host-specific command sequence operators will rehearse and run:
+
+Run the generated runbook and `rollout guided` from the staged v2 runtime host,
+not from a separate orchestration host. In same-server mode the v1 host and v2
+runtime host are normally the same machine. In fresh-server mode,
+`--host=<old-v1-hostname>` identifies the v1 host from the static plan and
+`--runtime-host=<new-v2-hostname>` identifies the new v2 machine where the
+guided command runs. If the v1 stop/start commands use `ssh`, the new v2
+runtime host must be able to SSH to the old v1 host before the production
+window starts.
+
+```bash
+./jetmon2 rollout rehearsal-plan \
+  --file rollout-buckets.csv \
+  --host=jetmon-v1-a \
+  --bucket-min=0 \
+  --bucket-max=99 \
+  --bucket-total=<total> \
+  --mode=same-server \
+  --v1-stop-command='<exact v1 stop command>' \
+  --v1-start-command='<exact v1 rollback start command>'
+```
+
+For a fresh-server takeover where the v2 hostname differs from the v1 host in
+the static plan, add `--runtime-host=<new-v2-hostname>` and use
+`--mode=fresh-server`. Add `--systemd-unit=<path>` if the staged service unit
+is not `/etc/systemd/system/jetmon2.service`. Confirm SSH from the new v2
+runtime host to the old v1 host before relying on SSH-based
+`--v1-stop-command` or `--v1-start-command`.
+
+During the production window, prefer the guided command so operators do not
+need to copy/paste each command manually:
+
+```bash
+./jetmon2 rollout guided \
+  --file rollout-buckets.csv \
+  --host=jetmon-v1-a \
+  --runtime-host=jetmon-v1-a \
+  --bucket-min=0 \
+  --bucket-max=99 \
+  --bucket-total=<total> \
+  --mode=same-server \
+  --v1-stop-command='<exact v1 stop command>' \
+  --v1-start-command='<exact v1 rollback start command>' \
+  --log-dir=logs/rollout
+```
+
+`rollout guided` checks that the log directory is writable before it starts,
+writes a transcript plus `<runtime-host>-<min>-<max>.state.json` resume state,
+prints the expected run origin, explains each gate, asks before continuing, and
+stops on failed gates. It uses typed confirmations before stopping v1, starting
+v2, stopping v2 during rollback, or restarting v1. By default it prints
+service commands for the operator to run from the v2 runtime host and asks for
+`DONE`; add `--execute-operator-commands` only when the operator intentionally
+wants the guided command to execute those commands after confirmation.
+If the command is interrupted after a stop/start transition, rerun it with the
+same options and choose resume; saved service state prevents the command from
+asking the operator to repeat a transition that already completed. When resume
+state exists, there is no default choice; the operator must type `RESUME` or
+`START OVER`. Short `y` / `n` answers are rejected for this prompt.
+Dry-run mode prints the selected path, service commands, typed confirmation
+phrases, and manual `DONE` checkpoints without running rollout checks or
+service commands.
+
+If a rollout needs to return the range to v1, use the guided rollback path:
+
+```bash
+./jetmon2 rollout guided \
+  --rollback \
+  --file rollout-buckets.csv \
+  --host=jetmon-v1-a \
+  --runtime-host=jetmon-v1-a \
+  --bucket-min=0 \
+  --bucket-max=99 \
+  --bucket-total=<total> \
+  --v1-start-command='<exact v1 rollback start command>' \
+  --log-dir=logs/rollout
+```
+
+If a forward gate fails after v2 has started and the operator chooses rollback,
+the rollback path can complete successfully while the overall command exits
+non-zero. This is intentional: the host rollout did not complete, even though
+the range was returned to v1. Keep the transcript with the rollout record.
+
+### Prepare Database And Rollback Safety
+
+1. Confirm a recent MySQL backup exists and restore has been tested according
+   to normal production policy.
+2. Review pending migrations with the release owner.
+3. Apply additive migrations before the first host cutover:
+
+   ```bash
+   ./jetmon2 migrate
+   ```
+
+4. Confirm v1 continues to run normally after migrations are applied.
+5. Do not plan a schema rollback. If v2 must be reverted, v1 can keep running
+   with the additive v2 tables present.
+
+### Build And Stage Artifacts
+
+Build and verify the release:
+
+```bash
+make test-race
+make rollout-docs-verify
+```
+
+`make rollout-docs-verify` builds all binaries, runs the standard test suite
+and `go vet`, checks rollout command help, verifies JSON output and staged
+systemd units, and runs the operator rehearsal verifier. `make test-race` is
+kept separate because it is slower. For a faster no-database check while
+editing the runbook, run `make rollout-rehearsal-verify`; it verifies that
+generated plans, guided output, runtime-host warnings, typed confirmations, and
+rollback commands still match this runbook. That target uses a disposable
+sample bucket plan and does not replace the real `host-preflight` gate or VM
+lab rehearsal.
+
+Stage these artifacts for each target host:
+
+- `bin/jetmon2`, installed at the path expected by the service unit
+  (`/opt/jetmon2/jetmon2` for the sample unit)
+- `bin/veriflier2` when that host also owns a Veriflier deployment
+- `systemd/jetmon2.service`
+- `systemd/jetmon2-logrotate`
+- `config/config.json`
+- `/opt/jetmon2/config/jetmon2.env` from `config/db-config-sample.conf`
+
+Keep v2 files in `/opt/jetmon2` or another v2-specific directory. Do not
+overwrite the v1 install until rollback signoff.
+
+### Veriflier Contract Rollout
+
+New `veriflier2` binaries serve the versioned v2 JSON contract by default:
+
+- v2: `POST /v2/check`, `GET /v2/status`
+
+They can optionally serve a legacy-compatible HTTP contract for lab or
+emergency rollback testing by setting `VERIFLIER_ENABLE_LEGACY_HTTP=true`:
+
+- legacy-compatible HTTP: `POST /check`, `GET /status`
+
+This transport switch is independent of the site check method. Monitors can
+send `HEAD` + `legacy` checks to Verifliers over `POST /v2/check`; enabling
+legacy-compatible `/check` is not required for a Monitor rollout that starts
+with all sites in legacy HEAD mode.
+
+Deploy the new v2 Veriflier fleet before switching monitor hosts. The preferred
+rollout uses fresh Veriflier servers, proves that fleet independently, then
+points v2 Monitors only at those `veriflier2` endpoints. Keep the original v1
+Verifliers serving the original v1 Monitors until monitor cutover is complete.
+
+Do not depend on v2 Monitors talking to original v1 Verifliers. The original v1
+Veriflier uses the old TLS/custom transport, while the v2 Monitor speaks the Go
+JSON-over-HTTP Veriflier contract. The Monitor's legacy `/check` fallback is
+for `veriflier2`'s legacy-compatible endpoint, not for the original v1
+Veriflier process.
+
+Deploy one new v2 Veriflier endpoint at a time:
+
+1. Stage the `veriflier2` binary and its config on the new Veriflier host.
+2. Set the listen port and monitor auth token that v2 Monitors will use.
+3. Set `VERIFLIER_VANTAGE_ID` to a stable regional/provider identity. Leave
+   database settings unset; Veriflier hosts do not need database credentials.
+4. Leave `VERIFLIER_ENABLE_LEGACY_HTTP=false` unless this endpoint is part of an
+   explicit lab or emergency compatibility test.
+5. Start or restart the Veriflier service for that endpoint.
+6. From a v2 monitor runtime host, verify the v2 status endpoint and then resume
+   with the next Veriflier endpoint.
+
+If the endpoint is a load-balanced pool, roll the backend replicas one at a
+time. All replicas behind the same monitor-side endpoint must share the same
+`VERIFLIER_VANTAGE_ID`, because that endpoint is one quorum vote. If a rollback
+is needed before monitor cutover, remove that new v2 endpoint from the pending
+v2 Monitor config or restart the previous `veriflier2` binary on the same new
+endpoint. No Jetmon database rollback is required for a Veriflier-only
+rollback.
+
+For each Veriflier endpoint, set a stable `VERIFLIER_VANTAGE_ID` when the
+endpoint represents a region/provider vantage. Multiple horizontally scaled
+replicas behind the same load-balanced endpoint must share that `vantage_id`;
+the monitor counts the configured endpoint as one quorum vote. `agent.id` in
+`/v2/status` identifies the serving process for diagnostics only.
+
+Monitor quorum counts unique v2 `vantage.id` values. If two configured
+Veriflier entries report the same `vantage.id`, only one vote counts; the
+duplicate reply is retained in audit metadata. In multi-Veriflier layouts,
+Jetmon keeps a two-healthy-vantage floor unless `PEER_OFFLINE_LIMIT=1` was
+intentionally configured.
+
+Before advancing a monitor range that depends on the new v2 Veriflier fleet,
+run `validate-config` and verify the v2 status endpoint from the v2 monitor
+runtime host:
+
+```bash
+./jetmon2 validate-config
+curl -fsS http://<veriflier-host>:7803/v2/status
+```
+
+`/v2/status` should report `protocols` containing `v2-json-http`,
+`vantage.id`, `agent.id`, and non-zero `capacity.max_concurrency`. If a
+Veriflier is saturated, `/v2/check` returns HTTP 503 and contributes no vote;
+that is a rollout hold point for capacity, not evidence that customer sites are
+down. `validate-config` warns for unreachable or legacy-only Verifliers and
+fails for missing or duplicate v2 vantage IDs.
+
+Veriflier auto-discovery is also staged. Leave
+`VERIFLIER_DISCOVERY_MODE=static` for the first monitor cutover unless the
+registry has already been rehearsed. To rehearse discovery, create one
+pre-approved `jetmon_veriflier_vantages` row per trusted quorum vantage, enable
+it only when the endpoint and token are correct, and run monitors in
+`VERIFLIER_DISCOVERY_MODE=shadow`. Shadow mode queries
+`jetmon_veriflier_vantages` and recent `jetmon_veriflier_agents` telemetry rows,
+then reports missing/extra registry vantages without changing traffic. Switch
+to `active` only after `validate-config` reports usable registry vantages and
+no shadow drift. Active mode falls back to static `VERIFIERS` if discovery is
+unavailable or empty during rollout.
+
+Use the read-only discovery comparison report as the explicit shadow-mode gate:
+
+```bash
+./jetmon2 verifliers discovery-report --output=text
+```
+
+The report compares configured static Verifliers, trusted registry vantages,
+and recent monitor-collected agent rows. Green means the static v2 vantage IDs
+match the enabled registry and recent agents are present. Amber is a hold point
+for drift, stale telemetry, incomplete registry rows, or endpoint mismatches.
+Red is a hold point for active discovery, such as duplicate static vantages or
+active mode without any usable enabled registry vantages. The report does not
+print auth token values.
+
+Agent telemetry is not trust. Monitors poll authenticated Veriflier
+`/v2/status` endpoints and write `jetmon_veriflier_agents` rows showing process
+liveness/capacity, so Veriflier hosts do not need database credentials. Those
+rows do not create quorum votes unless an operator has created and enabled the
+matching `jetmon_veriflier_vantages` row.
+
+Keep `veriflier2`'s legacy-compatible `/check` fallback available as an
+explicit opt-in compatibility guard, but keep it disabled on normal production
+v2 endpoints and do not treat it as support for original v1 Verifliers. Remove
+the fallback code only in a follow-up branch after all of these are true:
+
+- every configured Veriflier endpoint reports `/v2/status` with
+  `v2-json-http`, a stable `vantage.id`, `agent.id`, and non-zero capacity
+- `./jetmon2 validate-config` has no legacy-only Veriflier warnings on the
+  deployed fleet
+- `make test-veriflier-soak` and the approved production-like Veriflier soak
+  pass for high concurrency, overload, auth failure, timeout, duplicate-vantage
+  misconfiguration, discovery drift, active-mode fallback, and long outage
+  promotion/recovery
+- `./jetmon2 telemetry report` shows stable verifier reply and vote evidence
+  with no verifier metadata gaps over the agreed production window
+- rollback plans no longer depend on any legacy-compatible `veriflier2`
+  endpoint
+
+Keep the historical `veriflier` / `veriflier2` names during v2 rollout. A v3
+probe architecture can introduce a clearer `probe-agent` or `vantage-agent`
+role without renaming the compatibility binary in place.
+
+Do not start `bin/jetmon-deliverer` during the initial monitor replacement
+unless standalone delivery is part of the approved rollout plan. Use
+[`jetmon-deliverer-rollout.md`](jetmon-deliverer-rollout.md) for that separate
+process cutover.
+
+After the binary and service files are staged, the pre-stop
+`rollout host-preflight` gate verifies the installed service unit before v1 is
+stopped. If you want an earlier packaging check from that staged host or
+deployment root, run:
+
+```bash
+systemd-analyze verify /etc/systemd/system/jetmon2.service
+```
+
+If this check is run directly against the repository copy before installing the
+binary to `/opt/jetmon2`, systemd can report missing `ExecStart` paths. Treat
+that as a packaging reminder and re-run the check after the final paths exist.
+
+### Prepare Pinned v2 Config
+
+For each replacement host, configure the exact v1 bucket range:
+
+```json
+{
+  "PINNED_BUCKET_MIN": 0,
+  "PINNED_BUCKET_MAX": 99,
+  "LEGACY_STATUS_PROJECTION_ENABLE": true,
+  "DEFAULT_CHECK_METHOD": "HEAD",
+  "DEFAULT_DETECTION_PROFILE": "legacy",
+  "API_PORT": 0
+}
+```
+
+The legacy v1 names `BUCKET_NO_MIN` and `BUCKET_NO_MAX` are accepted as aliases,
+but prefer `PINNED_BUCKET_MIN` and `PINNED_BUCKET_MAX` in v2 configs so the
+deployment mode is explicit.
+
+While pinned:
+
+- the host checks only the configured inclusive bucket range
+- the host does not claim or heartbeat `jetmon_hosts`
+- shutdown does not release a `jetmon_hosts` row
+- `BUCKET_TOTAL`, `BUCKET_TARGET`, and `BUCKET_HEARTBEAT_GRACE_SEC` still
+  validate, but dynamic ownership does not use them on that host
+
+### Validate Before Cutover
+
+Run validation with the same `DB_*` environment the service will use:
+
+```bash
+./jetmon2 validate-config
+```
+
+Confirm it reports:
+
+- `legacy_status_projection=enabled`
+- `bucket_ownership=pinned range=<min>-<max>`
+- `default_check_policy=method:HEAD profile:legacy`
+- `rollout_static_plan=./jetmon2 rollout static-plan-check --file=<ranges.csv>`
+- `rollout_preflight=` points at `./jetmon2 rollout host-preflight` with the
+  static plan file, v1 host, runtime v2 host, and pinned bucket range
+- `rollout_activity_check=./jetmon2 rollout activity-check --since=15m`
+- `rollout_cutover_check=./jetmon2 rollout cutover-check --since=15m`
+- `rollout_rollback_check=./jetmon2 rollout rollback-check`
+- `rollout_drift_report=./jetmon2 rollout projection-drift`
+
+Run the host preflight when the host identity and config are final:
+
+```bash
+./jetmon2 rollout host-preflight \
+  --file=rollout-buckets.csv \
+  --host=<v1-hostname> \
+  --runtime-host=<v2-hostname> \
+  --bucket-min=<min> \
+  --bucket-max=<max> \
+  --bucket-total=<total>
+```
+
+This gate fails if the copied static plan does not match the requested host
+range, the staged config cannot load, DB connectivity fails, pinned config is
+missing, the pinned config range does not match the requested range, legacy
+projection writes are disabled, the runtime v2 host still owns a dynamic
+`jetmon_hosts` row, any dynamic `jetmon_hosts` row overlaps the pinned range,
+projection drift exists, or the staged systemd unit fails validation.
+
+### Rehearse API CLI Workflows Outside Production
+
+Use the API CLI in Docker, staging, or a dedicated rehearsal database with
+disposable sites. Do not enable `API_PORT` on initial production monitor hosts
+unless the delivery-owner plan has been approved.
+
+```bash
+./jetmon2 keys create --consumer api-cli-rehearsal --scope admin --created-by rollout-rehearsal
+
+export JETMON_API_URL=http://<rehearsal-host>:8090
+export JETMON_API_TOKEN=jm_replace_with_the_printed_token
+
+./bin/jetmon2 api health --pretty
+./bin/jetmon2 api me --pretty
+./bin/jetmon2 api smoke --batch rollout-rehearsal --pretty
+./bin/jetmon2 api sites simulate-failure \
+  --batch rollout-rehearsal \
+  --mode http-500 \
+  --wait 30s \
+  --expect-event-state 'Seems Down' \
+  --expect-transition-reason opened \
+  --pretty
+./bin/jetmon2 api sites cleanup --batch rollout-rehearsal --count 3 --output table
+```
+
+When the Docker-local fixture and delivery workers are enabled, also exercise
+the webhook path:
+
+```bash
+./bin/jetmon2 api smoke --batch rollout-webhook --exercise webhook --pretty
+```
+
+For a fuller Docker-local pass against the feature-guide examples, failure
+fixture, webhook receiver, signature verification, and cleanup path, run:
+
+```bash
+make api-cli-validate
+```
+
+Set `API_VALIDATE_SKIP_WEBHOOK=1` when the environment does not have outbound
+delivery workers enabled. Any API CLI write against a non-local API URL must
+use `--allow-remote`, and remote smoke, bulk-add, cleanup, and failure
+simulation must also use `--batch`.
+
+## Phase 1A: Replace v1 On The Existing Server
+
+Use this path when the same server currently running v1 will run v2 for the
+same bucket range.
+
+Preferred: run `./jetmon2 rollout guided ...` with the same host, range, stop,
+and rollback commands from the generated rehearsal plan. The manual steps below
+are the fallback/reference path and match what the guided command walks through.
+
+1. Confirm v2 files and config are staged beside, not on top of, v1.
+2. Confirm v1 service stop/start commands and config are documented for
+   cutover and rollback.
+3. Run `./jetmon2 validate-config`.
+4. Run the pre-stop host gate:
+
+   ```bash
+   ./jetmon2 rollout host-preflight \
+     --file=rollout-buckets.csv \
+     --host=<v1-hostname> \
+     --runtime-host=<v2-hostname> \
+     --bucket-min=<min> \
+     --bucket-max=<max> \
+     --bucket-total=<total>
+   ```
+
+5. Start a terminal watching v1 logs and a terminal ready to watch v2 logs.
+6. Stop v1 cleanly with the existing production command.
+7. Confirm the v1 process is no longer running.
+8. Start v2:
+
+   ```bash
+   systemctl enable --now jetmon2 && systemctl is-active --quiet jetmon2
+   ```
+
+9. Confirm v2 logs show:
+
+   - `legacy_status_projection=enabled`
+   - `bucket_ownership=pinned range=<min>-<max>`
+   - `orchestrator: using pinned buckets <min>-<max>`
+
+10. Run:
+
+    ```bash
+    ./jetmon2 rollout cutover-check \
+      --host=<v2-hostname> \
+      --bucket-min=<min> \
+      --bucket-max=<max> \
+      --since=15m
+    ```
+
+    `cutover-check` runs the pinned preflight, recent activity check,
+    dashboard status check, and projection-drift report. Its activity section
+    proves the range has fresh `jetmon_site_runtime.last_checked_at` writes,
+    not which process wrote them. Keep v1 stopped and use logs or the dashboard
+    to confirm v2 is checking only the pinned range.
+11. After one full expected round, run:
+
+    ```bash
+    ./jetmon2 rollout cutover-check \
+      --host=<v2-hostname> \
+      --bucket-min=<min> \
+      --bucket-max=<max> \
+      --since=15m \
+      --require-all
+    ```
+
+12. Capture WPCOM parity and explanation evidence for this hold point:
+
+    ```bash
+    ./jetmon2 telemetry report --since=15m
+    ```
+
+    This report is read-only and window-level. Treat warnings as hold points,
+    and widen `--since` when the range is too quiet to prove WPCOM down/recovery
+    parity.
+
+13. Watch one full check round before moving to the next host.
+
+## Phase 1B: Move A v1 Range To A Fresh Server
+
+Use this path when a new server will take over a bucket range from an existing
+v1 server.
+
+Preferred: run `./jetmon2 rollout guided --mode=fresh-server ...` from the new
+v2 server, with `--host=<old-v1-hostname>` and
+`--runtime-host=<new-v2-hostname>`. The manual steps below are the
+fallback/reference path.
+
+1. Provision the new server and install v2 artifacts.
+2. Configure `PINNED_BUCKET_MIN` and `PINNED_BUCKET_MAX` to match the old v1
+   host's `BUCKET_NO_MIN` and `BUCKET_NO_MAX`.
+3. Keep the v2 service stopped.
+4. Run `./jetmon2 validate-config` on the new server.
+5. Run the pre-stop host gate from the new v2 server before stopping v1:
+
+   ```bash
+   ./jetmon2 rollout host-preflight \
+     --file=rollout-buckets.csv \
+     --host=<old-v1-hostname> \
+     --runtime-host=<new-v2-hostname> \
+     --bucket-min=<min> \
+     --bucket-max=<max> \
+     --bucket-total=<total>
+   ```
+
+6. Confirm network access from the new server to MySQL, Verifliers, WPCOM,
+   StatsD, and log/stats directories.
+7. Stop v1 on the old server.
+8. Confirm the old v1 process is no longer running.
+9. Start v2 on the new server:
+
+   ```bash
+   systemctl enable --now jetmon2 && systemctl is-active --quiet jetmon2
+   ```
+
+10. Run the cutover smoke gate on the new server:
+
+    ```bash
+    ./jetmon2 rollout cutover-check \
+      --host=<new-v2-hostname> \
+      --bucket-min=<min> \
+      --bucket-max=<max> \
+      --since=15m
+    ```
+
+11. After one full expected v2 round, run the stronger gate:
+
+    ```bash
+    ./jetmon2 rollout cutover-check \
+      --host=<new-v2-hostname> \
+      --bucket-min=<min> \
+      --bucket-max=<max> \
+      --since=15m \
+      --require-all
+    ```
+12. Watch one full check round before moving to the next host.
+
+Do not leave the old v1 server running as a warm standby for the same range. A
+standby is safe only when the monitor process is stopped.
+
+## Phase 2: Monitor Each Cutover
+
+For every replaced range, verify:
+
+- checks run only for the pinned range
+- round time and sites-per-second are within the expected envelope
+- local checks use `HEAD` plus `legacy` detection during the first replacement
+  phase
+- Veriflier confirmation works
+- WPCOM notifications retain the v1 payload shape
+- `jetmon_events` receives event rows
+- `jetmon_event_transitions` receives transition rows for each mutation
+- `jetpack_monitor_sites.site_status` and `last_status_change` update while
+  legacy projection is enabled
+- no unexpected row is claimed in `jetmon_hosts` by a pinned host
+- no projection drift is reported:
+
+  ```bash
+  ./jetmon2 rollout projection-drift \
+    --bucket-min=<min> \
+    --bucket-max=<max> \
+    --limit=100
+  ```
+
+  If this fails, read the summary section first. It groups mismatches by bucket
+  and likely cause, then lists sample rows. Do not restart v1 readers or apply
+  ad hoc `site_status` updates until the matching `jetmon_events` rows and
+  transition history confirm which projection value is authoritative.
+
+- recent check activity exists for the pinned range:
+
+  ```bash
+  ./jetmon2 rollout activity-check \
+    --bucket-min=<min> \
+    --bucket-max=<max> \
+    --since=15m
+  ```
+
+  After a full expected round, require every active site in the range to have a
+  fresh `jetmon_site_runtime.last_checked_at`:
+
+  ```bash
+  ./jetmon2 rollout activity-check \
+    --bucket-min=<min> \
+    --bucket-max=<max> \
+    --since=15m \
+    --require-all
+  ```
+
+  The bundled cutover check runs the pinned preflight, activity check,
+  dashboard status check, and projection-drift report together:
+
+  ```bash
+  ./jetmon2 rollout cutover-check \
+    --host=<v2-hostname> \
+    --bucket-min=<min> \
+    --bucket-max=<max> \
+    --since=15m
+  ./jetmon2 rollout cutover-check \
+    --host=<v2-hostname> \
+    --bucket-min=<min> \
+    --bucket-max=<max> \
+    --since=15m \
+    --require-all
+  ./jetmon2 telemetry report --since=15m
+  ```
+
+  The telemetry report is not a per-range hard gate like `cutover-check`; it is
+  evidence that the rollout window still has WPCOM notification parity and
+  enough metadata for support explanations. Widen `--since` if the current
+  window has too few incidents to prove parity.
+
+If `DASHBOARD_PORT` is enabled, confirm:
+
+- the host dashboard at `/` shows bucket ownership mode as pinned
+- the host dashboard dependency health is green for MySQL, configured
+  Verifliers, WPCOM, StatsD, and log/stats directory writes
+- the host dashboard shows the WPCOM circuit breaker closed
+- retry queue depth is not growing unexpectedly
+- Go runtime system memory stays below the configured guardrail and RSS stays
+  within host-level expectations
+- delivery workers are disabled unless explicitly approved
+- the fleet dashboard at `/fleet` shows the replaced host as fresh, and pinned
+  bucket mode as an expected amber rollout state
+
+Useful direct checks:
+
+```bash
+./jetmon2 status
+tail -f logs/jetmon.log
+tail -f logs/status-change.log
+cat stats/sitespersec
+cat stats/sitesqueue
+cat stats/totals
+```
+
+## Phase 3: Revert Safely
+
+### Revert On The Existing Server
+
+Use this when v2 replaced v1 on the same server.
+
+Preferred: run `./jetmon2 rollout guided --rollback ...` with the original v1
+start command. The manual steps below are the fallback/reference path.
+
+1. Stop v2:
+
+   ```bash
+   systemctl stop jetmon2 && ! systemctl is-active --quiet jetmon2
+   ```
+
+2. Confirm the v2 process is stopped. Do not restart v1 until this is true.
+3. Run the rollback safety check before restarting v1:
+
+   ```bash
+   ./jetmon2 rollout rollback-check \
+     --host=<v2-hostname> \
+     --bucket-min=<min> \
+     --bucket-max=<max>
+   ```
+
+   Pinned v2 hosts intentionally do not heartbeat `jetmon_hosts`, so this check
+   cannot prove the pinned v2 process is stopped. It verifies the rollback range
+   has no dynamic ownership overlap and no legacy projection drift; the process
+   stop still needs explicit confirmation.
+4. Restart the original v1 service with its original `BUCKET_NO_MIN` /
+   `BUCKET_NO_MAX` config.
+5. Verify v1 checks the range again.
+6. Watch WPCOM notifications and legacy logs for one full v1 check round.
+7. Leave v2 schema in place. Do not attempt schema rollback.
+
+### Revert A Fresh-Server Takeover
+
+Use this when v2 was started on a new server and the old v1 server was stopped.
+
+Preferred: run `./jetmon2 rollout guided --rollback ...` from the new v2 server
+with `--host=<old-v1-hostname>` and `--runtime-host=<new-v2-hostname>`. The
+manual steps below are the fallback/reference path.
+
+1. Stop v2 on the new server:
+
+   ```bash
+   systemctl stop jetmon2 && ! systemctl is-active --quiet jetmon2
+   ```
+
+2. Confirm the new v2 process is stopped. Do not restart v1 until this is true.
+3. Run the rollback safety check from an operator shell with the stopped v2
+   hostname:
+
+   ```bash
+   ./jetmon2 rollout rollback-check \
+     --host=<new-v2-hostname> \
+     --bucket-min=<min> \
+     --bucket-max=<max>
+   ```
+
+4. Restart v1 on the old server with its original bucket config.
+5. Verify v1 checks the range again.
+6. Keep the new v2 server disabled until the next approved attempt.
+
+Never start the old v1 process until the new v2 process is stopped for that
+range.
+
+## Phase 4: Complete The Fleet Rollout
+
+After every monitor host is on v2 and stable in pinned mode:
+
+1. Confirm no v1 monitor process remains active.
+2. Confirm every v2 host passes:
+
+   ```bash
+   ./jetmon2 rollout cutover-check --since=15m --require-all
+   ```
+
+3. Observe the fleet for the agreed stabilization window.
+4. Plan a coordinated dynamic-ownership cutover. Pinned hosts do not write
+   `jetmon_hosts`, so do not leave a long-lived mix of pinned and dynamic v2
+   hosts.
+5. Remove `PINNED_BUCKET_MIN` / `PINNED_BUCKET_MAX` and any legacy
+   `BUCKET_NO_MIN` / `BUCKET_NO_MAX` aliases from every v2 monitor config.
+6. Restart the v2 monitor fleet in the approved window.
+7. Run:
+
+   ```bash
+   ./jetmon2 validate-config
+   ./jetmon2 rollout dynamic-check
+   ./jetmon2 rollout activity-check --since=15m --require-all
+   ./jetmon2 rollout projection-drift --limit=100
+   ./jetmon2 telemetry report --since=15m
+   ```
+
+8. Confirm `jetmon_hosts` coverage is active, fresh, gap-free, and
+   overlap-free. If `DASHBOARD_PORT` is enabled, `/fleet` should show
+   `mode=dynamic`, green bucket coverage, no stale processes, no projection
+   drift, and no failed or abandoned delivery rows.
+   If the projection-drift check fails, use the bucket/cause summary to decide
+   whether this is a stale legacy projection, a missing event-to-projection
+   write, or an unexpected status value before making any manual repair.
+9. Continue with normal v2 rolling updates: stop one host, deploy, start it,
+   verify `./jetmon2 status`, then move to the next host.
+
+## Phase 5: Migrate Probe Semantics
+
+After v2 has replaced v1 and the fleet is stable, migrate probe semantics in
+separate batches:
+
+1. Select a small cohort and set `request_method='GET'`,
+   `detection_profile='simple_http'` in `jetmon_site_check_config` or through
+   the API/CLI. Watch for false-positive floods, verifier disagreement, WPCOM
+   parity issues, and support reports.
+2. Expand the `GET` + `simple_http` cohort only after the previous cohort is
+   clean for the agreed observation window.
+3. For stable GET cohorts, set `detection_profile='full'` to enable keyword,
+   forbidden-content, redirect advisory/fail, TLS advisory, and body-integrity
+   detections.
+4. When all normal sites are stable on `GET` + `full`, change process defaults
+   to:
+
+   ```json
+   {
+     "DEFAULT_CHECK_METHOD": "GET",
+     "DEFAULT_DETECTION_PROFILE": "full"
+   }
+   ```
+
+5. Leave rows in `jetmon_site_check_config` only for sites that need an
+   exception from the defaults, such as long-term `HEAD` compatibility.
+
+## Phase 5: Tear Down v1
+
+Only remove v1 after rollout signoff.
+
+1. Archive final v1 configs, service units, and deployment metadata according
+   to normal retention policy.
+2. Confirm no process manager references the v1 service.
+3. Remove old v1 service units or disable them permanently.
+4. Remove old Node.js application checkouts, `node_modules`, compiled native
+   addons, Qt Veriflier artifacts, and v1-only logrotate files.
+5. Remove v1-only deployment hooks from host automation.
+6. Keep shared log and stats paths only if v2 still writes to them.
+7. Keep v2 additive database schema. Do not remove v2-owned tables while legacy
+   consumers still need rollback coverage.
+8. Keep `LEGACY_STATUS_PROJECTION_ENABLE=true` until legacy readers have moved
+   to v2 state surfaces. Retiring that projection is a separate project.
+
+## Final Checklist
+
+- [ ] v1 host inventory complete
+- [ ] bucket ranges complete and non-overlapping
+- [ ] `rollout production-data-audit` reviewed for the production table
+- [ ] existing non-running v1 rows bootstrapped with
+      `rollout legacy-status-bootstrap --execute` if present
+- [ ] active duplicate `blog_id` rows resolved or endpoint-identity support
+      approved before rollout
+- [ ] `rollout static-plan-check` passes for the approved v1 bucket plan
+- [ ] DB backup and restore path confirmed
+- [ ] v2 binaries built and tested
+- [ ] additive migrations applied
+- [ ] pinned configs prepared for every range
+- [ ] rollback commands documented for every host
+- [ ] `rollout guided --dry-run` exercised for the first host
+- [ ] `rollout host-preflight` passes before each v1 host is stopped
+- [ ] first host cutover observed for one full round
+- [ ] `rollout cutover-check --require-all` passes for replaced ranges
+- [ ] `telemetry report` captured for WPCOM parity and explanation evidence
+- [ ] `rollout rollback-check` exercised during rehearsal
+- [ ] all hosts running v2 pinned
+- [ ] dynamic ownership cutover completed
+- [ ] `rollout dynamic-check` passes
+- [ ] projection drift is zero
+- [ ] v1 artifacts retained through rollback window
+- [ ] v1 artifacts removed after signoff
diff --git a/docs/v1-to-v2-pinned-rollout.md b/docs/v1-to-v2-pinned-rollout.md
new file mode 100644
index 00000000..1aa4e29d
--- /dev/null
+++ b/docs/v1-to-v2-pinned-rollout.md
@@ -0,0 +1,28 @@
+# v1 to v2 Pinned Bucket Rollout
+
+This document has moved.
+
+Use [v1-to-v2-migration.md](v1-to-v2-migration.md) for the complete production
+migration runbook, including pinned bucket mode, same-server replacement,
+fresh-server takeover, monitoring, revert paths, dynamic ownership cutover, and
+v1 teardown. That runbook is also the source of truth for static bucket plan,
+post-cutover activity, rollback, and projection-drift safety checks.
+
+Pinned mode still means:
+
+- configure `PINNED_BUCKET_MIN` and `PINNED_BUCKET_MAX` to match the v1 host's
+  static bucket range
+- keep `LEGACY_STATUS_PROJECTION_ENABLE=true`
+- keep `API_PORT=0` during initial production monitor replacement unless an API
+  and delivery-owner plan has been approved
+- prefer `./jetmon2 rollout guided` during the production window so the
+  operator gets prompts, transcript logging, resume state, and guided rollback
+- when running manual gates, run `./jetmon2 validate-config`, then
+  `./jetmon2 rollout host-preflight` with the copied static bucket plan, v1
+  host, v2 runtime host, and bucket range before stopping v1
+- after v2 starts, run `./jetmon2 rollout cutover-check --since=15m`, then
+  rerun it with `--require-all` after one full expected v2 check round
+
+The old detailed checklist was consolidated into
+[v1-to-v2-migration.md](v1-to-v2-migration.md) so migration guidance does not
+drift across multiple docs.
diff --git a/docs/v3-probe-agent-architecture-options.md b/docs/v3-probe-agent-architecture-options.md
new file mode 100644
index 00000000..1f057f78
--- /dev/null
+++ b/docs/v3-probe-agent-architecture-options.md
@@ -0,0 +1,402 @@
+# Jetmon v3 Probe-Agent Architecture Options
+
+## Status
+
+Planning note. This is not an accepted architecture decision and should not
+block the [v2 production migration](v1-to-v2-migration.md).
+
+The intended migration order is:
+
+```text
+v1 production
+  -> v2 compatibility rewrite
+  -> v2 production hardening and measurement
+  -> v3 probe-agent architecture in shadow mode
+  -> v3 gradual production cutover
+```
+
+The v3 architecture should be revisited only after v2 has been deployed to
+production and has enough operating data to make the tradeoffs concrete.
+
+## Why Revisit This After v2?
+
+The currently implemented v2 shape keeps Jetmon close to the existing mental
+model: main monitor servers own bucketed primary checks, and Verifliers provide
+independent confirmation before a site moves from `Seems Down` to `Down`.
+
+That is the right near-term migration target because it limits product and
+operational change while the Go rewrite, eventstore, API, alerting, and
+delivery workers stabilize.
+
+After v2 is stable, the main question is whether Jetmon should keep the
+separate "main monitor" and "Veriflier" roles or evolve into a more general
+probe platform where regional agents execute both routine checks and
+confirmation jobs while a central decision layer owns incident state.
+
+## Data To Gather During v2
+
+The v3 decision should be based on production data from v2, especially:
+
+- Time from first local failure to `Seems Down`.
+- Time from `Seems Down` to confirmed `Down`.
+- False alarm rate by failure class.
+- Veriflier agreement and disagreement rates.
+- Veriflier latency and timeout rates by region/provider.
+- Number of incidents where local failure was not confirmed remotely.
+- Number of incidents where remote confirmation was mixed by region.
+- Number of monitor-side failures that should be modeled as `Unknown`.
+- Cost and capacity profile for primary checks versus confirmation checks.
+- Operator pain points around explaining why an incident was or was not
+  confirmed.
+- Customer-impacting notification parity against the legacy WPCOM path.
+
+Without this data, v3 risks optimizing for hypothetical problems instead of
+the production failure modes that actually matter.
+
+The v2 monitor emits the first production evidence slice through StatsD:
+`detection.*` timing metrics cover the local-failure to lifecycle-state path,
+class-specific `detection.*.<failure-class>.count` counters split confirmed,
+false-alarm, and probe-cleared outcomes, and `verifier.host.<host>.*` counters
+split RPC health and confirm/disagree votes by configured Veriflier host. Use
+the host naming convention to preserve region/provider information in those
+series. Legacy WPCOM notification parity is tracked through
+`wpcom.notification.*` counters for attempts, deliveries, retries, errors, and
+final failures, with status-specific splits for `down`, `running`, and
+`confirmed_down`.
+
+## Current v2 Baseline
+
+The v2 flow is:
+
+```text
+Up
+  -> Seems Down     local probe failed, retry/confirmation in progress
+  -> Down           enough independent Verifliers confirmed
+  -> Resolved       local or confirmed recovery
+```
+
+The v2 deployment shape is:
+
+- Main `jetmon2` servers claim site buckets and perform primary checks.
+- Failed local checks open or update eventstore incidents.
+- After enough local failures, the orchestrator asks Verifliers to confirm.
+- Veriflier agreement promotes the same event from `Seems Down` to `Down`.
+- Veriflier disagreement closes the event as a false alarm.
+- Legacy WPCOM notification behavior remains preserved around the confirmed
+  `Down` and recovery transitions.
+
+This is intentionally conservative and remains the correct v2 production
+target.
+
+## Question 1: Is There A Better Flow Than Seems Down To Confirmed Down?
+
+Externally, the `Seems Down -> Down -> Resolved` lifecycle is still a good
+operator and customer-facing model. It is simple, useful, and maps well to the
+current false-positive reduction goal.
+
+Internally, v3 may need a richer decision model:
+
+| Internal state | Meaning |
+|---|---|
+| `Suspected` | First failure observed, not enough evidence yet |
+| `Confirming` | Confirmation probes are in flight |
+| `ConfirmedGlobalDown` | Enough independent regions agree the site is down |
+| `RegionalFailure` | Some regions fail while others succeed |
+| `Unknown` | Monitor/probe infrastructure cannot produce trustworthy evidence |
+| `FalseAlarm` | The original failure was not confirmed |
+
+Those internal states do not need to leak directly to every consumer. They can
+still project to the v2 public states where compatibility matters:
+
+```text
+Suspected / Confirming -> Seems Down
+ConfirmedGlobalDown    -> Down
+RegionalFailure        -> Degraded or Regional Failure, depending on taxonomy
+Unknown                -> Unknown, not downtime
+FalseAlarm             -> Resolved with reason=false_alarm
+```
+
+## Question 2: Should Main Servers And Verifliers Remain Separate?
+
+For v2, yes. It keeps the migration safe.
+
+For v3, probably not as a permanent distinction. A better long-term shape is
+likely:
+
+- **Decision layer:** owns scheduling, quorum rules, eventstore writes, and
+  notification decisions.
+- **Probe agents:** execute check jobs from one or more regions/providers.
+- **Durable job bus:** stores check jobs, claims, results, retries, and agent
+  heartbeats.
+
+In that model, "primary check" and "confirmation check" are job types, not
+separate binary roles.
+
+## Question 3: What Does The Current Shape Leave On The Table?
+
+Compared with a probe-agent architecture, the current v2 shape gives up or
+delays:
+
+- Continuous regional baseline data.
+- First-class regional or partial-outage classification.
+- Durable confirmation jobs independent of orchestrator memory.
+- Cleaner backpressure and retry accounting for probe work.
+- Easier addition of new probe types, such as synthetic flows or TCP checks.
+- Per-vantage-point latency and SLA reporting.
+- Better explanations for mixed outcomes.
+- More flexible capacity planning, because every probe agent can execute any
+  supported check job.
+
+These are good v3 motivations, but they should not be bundled into the v2
+production cutover.
+
+## Candidate Architectures To Revisit
+
+### Candidate 1: v2 Plus Stronger Probe Metadata
+
+Keep the main-server-plus-Veriflier structure, but record richer evidence for
+every vote: probe identity, region, provider, timing, failure class, and
+decision inputs.
+
+Flow:
+
+```text
+main check fails -> Seems Down
+local retries fail -> Veriflier confirmation
+event transition stores each vote and decision input
+quorum -> Down, disagreement -> false_alarm
+```
+
+Pros:
+
+- Lowest risk after v2.
+- Improves support and operator explainability quickly.
+- Produces better data for future v3 decisions.
+- Minimal deployment changes.
+
+Cons:
+
+- Keeps the main/Veriflier split.
+- Remote perspective is still mostly gathered after suspicion.
+- Does not fully support regional baseline or synthetic-check expansion.
+
+When to choose:
+
+- v2 works well, but operators mainly need better evidence and dashboards.
+
+### Candidate 2: Peer Probe Mesh
+
+Every monitor host can perform both primary and confirmation probes. A host
+that detects a failure asks peer monitor hosts in other regions/providers for
+confirmation.
+
+Flow:
+
+```text
+bucket owner detects failure
+bucket owner requests peer probes
+peer votes return directly to owner
+owner writes event transition and notifications
+```
+
+Pros:
+
+- Removes a separate Veriflier fleet.
+- Uses monitor capacity more evenly.
+- Simpler than introducing a full scheduler and job bus.
+- Can become region-aware if monitor hosts are deployed across regions.
+
+Cons:
+
+- Monitor hosts become more coupled.
+- A monitor-host incident can affect both primary and confirmation capacity.
+- Harder to enforce anti-correlation rules unless host metadata is rigorous.
+- Still centers decisions on the bucket owner.
+
+When to choose:
+
+- The Veriflier fleet is operationally awkward, but a full scheduler is too
+  large a step.
+
+### Candidate 3: Central Scheduler Plus Regional Probe Agents
+
+This is the leading v3 candidate.
+
+A scheduler/decision service owns check plans and durable jobs. Regional probe
+agents claim jobs, execute checks, and write results. The decision layer
+evaluates evidence and writes eventstore transitions.
+
+Flow:
+
+```text
+scheduler creates routine probe jobs
+regional probe agents claim and execute jobs
+decision layer evaluates results
+first failure opens Suspected/Seems Down
+confirmation jobs are scheduled to independent agents
+quorum/classifier promotes to Down, RegionalFailure, Unknown, or false_alarm
+eventstore writes remain the source of truth
+delivery workers notify from event transitions
+```
+
+Pros:
+
+- Best long-term separation of concerns.
+- Durable jobs replace in-memory confirmation state.
+- Probe agents are simple and horizontally scalable.
+- Primary and confirmation checks use the same execution path.
+- Supports regional status, confidence scoring, per-vantage SLA, synthetic
+  checks, and richer diagnostics.
+- Lets Jetmon add new probe types without reshaping the decision layer.
+
+Cons:
+
+- Largest implementation effort.
+- Requires durable job claiming and result deduplication.
+- Requires careful shadow-mode comparison before becoming authoritative.
+- More operational components than the v2 single-binary shape.
+
+When to choose:
+
+- v2 production data shows confirmation latency, regional ambiguity, or
+  operator explainability are material problems.
+- Jetmon needs regional SLAs, synthetic checks, or more probe types.
+- The team is ready to invest in a platform-shaped monitoring architecture.
+
+### Candidate 4: Always-On Multi-Region Quorum
+
+Every monitored site is checked from multiple regions continuously or
+near-continuously. Incidents are classified from live quorum rather than a
+second-stage confirmation request.
+
+Flow:
+
+```text
+regional agents check every site on schedule
+decision layer continuously evaluates current regional evidence
+multi-region failure -> Down
+single-region failure -> RegionalFailure or Degraded
+probe infrastructure failure -> Unknown
+```
+
+Pros:
+
+- Fastest confirmation.
+- Best regional visibility.
+- Strong latency and SLA data by vantage point.
+- Removes most of the "wait for retries, then confirm" gap.
+
+Cons:
+
+- Much higher check volume.
+- More customer-site load.
+- Higher cost.
+- Needs careful aggregation to avoid noisy partial failures.
+- Probably too expensive for every site unless tiers or sampling are added.
+
+When to choose:
+
+- Product requirements demand regional SLA visibility or very fast
+  confirmation, and the cost profile is acceptable.
+
+### Candidate 5: External Probes Plus Site/WPCOM Signals
+
+Combine external probe evidence with internal or site-side signals such as
+Jetpack heartbeat, wp-admin reachability, cron heartbeat, or WPCOM-side
+activity.
+
+Flow:
+
+```text
+external probe failure opens Suspected/Seems Down
+decision layer checks corroborating Jetpack/WPCOM/site signals
+external + internal evidence agree -> Down
+external failure only -> Confirming, RegionalFailure, or Unknown
+internal signal missing only -> agent/heartbeat problem, not customer downtime
+```
+
+Pros:
+
+- Better distinction between site downtime, regional network issues, and
+  monitor-side failures.
+- Better support diagnostics.
+- Can reduce false positives.
+- Complements any probe-agent architecture.
+
+Cons:
+
+- Depends on signal quality from Jetpack/WPCOM/site-side systems.
+- Heartbeats can be delayed for reasons other than downtime.
+- More data contracts outside Jetmon.
+- Not a replacement for external probing.
+
+When to choose:
+
+- v2 data shows many false positives that external probes alone cannot
+  classify confidently, or support needs better causal diagnostics.
+
+## Current Recommendation
+
+Do not change the v2 production target.
+
+The recommended path is:
+
+1. Finish and deploy v2 with the current main-server-plus-Veriflier shape.
+2. Stabilize v2 in production.
+3. Gather the data listed above.
+4. Revisit these candidates with real evidence.
+5. If the evidence supports it, evolve toward Candidate 3.
+
+Candidate 3 is the current best long-term option because it turns Jetmon into a
+durable probe platform instead of a monitor-plus-confirmers system. It offers
+the best path to regional status, richer classification, synthetic checks, and
+more predictable scaling.
+
+Candidate 1 is the likely first step regardless of final v3 choice because
+better probe metadata makes every other option easier to evaluate.
+
+## Candidate 3 Migration Sketch After v2 Stabilizes
+
+The v2-to-v3 migration should be incremental:
+
+1. **Add probe metadata to v2 results.**
+   Record region, provider, probe identity, timing, failure class, and vote
+   details for local and Veriflier checks.
+
+2. **Introduce durable confirmation jobs.**
+   Keep primary checks in v2, but replace direct Veriflier fanout with jobs in
+   MySQL. Existing Verifliers or new probe agents claim jobs and write results.
+
+3. **Generalize Veriflier into probe-agent.**
+   Make confirmation an execution mode of a generic agent rather than a
+   special-purpose service.
+
+4. **Run primary probe jobs in shadow mode.**
+   Schedule routine check jobs for a small cohort but do not let them affect
+   customer-visible state.
+
+5. **Compare v2 decisions to v3 decisions.**
+   Measure detection latency, confirmation latency, false positives, missed
+   incidents, regional disagreement, and WPCOM notification parity.
+
+6. **Cut over confirmation decisions.**
+   Let the job-based confirmation path become authoritative for
+   `Seems Down -> Down` after it matches or beats v2 behavior in shadow mode.
+
+7. **Cut over primary checks gradually.**
+   Move bucket ranges or site cohorts from direct v2 primary checks to scheduled
+   probe jobs.
+
+8. **Retire the main/Veriflier distinction.**
+   The central decision layer owns scheduling and state; probe agents execute
+   jobs from any supported check type.
+
+## Non-Goals Until After v2 Is Stable
+
+- Do not skip directly from v1 to v3.
+- Do not change customer-visible notification semantics during the v2 cutover.
+- Do not replace eventstore as the source of truth.
+- Do not require a new queueing system before MySQL-backed job claiming has
+  been evaluated.
+- Do not make regional classifications customer-visible until the taxonomy and
+  support story are ready.
diff --git a/go.mod b/go.mod
new file mode 100644
index 00000000..bab269db
--- /dev/null
+++ b/go.mod
@@ -0,0 +1,7 @@
+module github.com/Automattic/jetmon
+
+go 1.22
+
+require github.com/go-sql-driver/mysql v1.7.1
+
+require github.com/DATA-DOG/go-sqlmock v1.5.2
diff --git a/go.sum b/go.sum
new file mode 100644
index 00000000..fd205b6d
--- /dev/null
+++ b/go.sum
@@ -0,0 +1,5 @@
+github.com/DATA-DOG/go-sqlmock v1.5.2 h1:OcvFkGmslmlZibjAjaHm3L//6LiuBgolP7OputlJIzU=
+github.com/DATA-DOG/go-sqlmock v1.5.2/go.mod h1:88MAG/4G7SMwSE3CeA0ZKzrT5CiOU3OJ+JlNzwDqpNU=
+github.com/go-sql-driver/mysql v1.7.1 h1:lUIinVbN1DY0xBg0eMOzmmtGoHwWBbvnWubQUrtU8EI=
+github.com/go-sql-driver/mysql v1.7.1/go.mod h1:OXbVy3sEdcQ2Doequ6Z5BW6fXNQTmx+9S1MCJN5yJMI=
+github.com/kisielk/sqlstruct v0.0.0-20201105191214-5f3e10d3ab46/go.mod h1:yyMNCyc/Ib3bDTKd379tNMpB/7/H5TjM2Y9QJ5THLbE=
diff --git a/internal/alerting/alerting.go b/internal/alerting/alerting.go
new file mode 100644
index 00000000..d3f72363
--- /dev/null
+++ b/internal/alerting/alerting.go
@@ -0,0 +1,275 @@
+// Package alerting manages outbound alert contact subscriptions and the
+// delivery worker that fans transitions out to managed transports.
+//
+// An alert contact is a registration that says "send a Jetmon-rendered
+// notification through this transport when matching transitions fire."
+// A delivery is one alert contact firing — created when an event
+// transition matches the contact's site_filter and severity gate, then
+// dispatched by the background worker through the configured transport.
+//
+// Where webhooks (internal/webhooks) deliver a raw signed event stream
+// for the consumer to render, alert contacts deliver a Jetmon-rendered
+// notification through a transport Jetmon owns end-to-end (subject lines,
+// PagerDuty severity mapping, Slack Block Kit rendering, etc.).
+//
+// See docs/internal-api-reference.md "Family 5" for the public design and docs/roadmap.md for deferred
+// items (SMS, OpsGenie, alert grouping, WPCOM-flow migration).
+package alerting
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/eventstore"
+)
+
+// Storage note: destination credentials are stored in plaintext in
+// jetmon_alert_contacts.destination. Same rationale as
+// jetmon_webhooks.secret — outbound dispatch needs the raw value at
+// every send. A hash is useless because we'd have to recover the
+// original to call the transport. Encryption at rest with a master
+// key is on docs/roadmap.md as a future hardening step.
+
+// Status enumerates the lifecycle states of a delivery row.
+type Status string
+
+const (
+	StatusPending   Status = "pending"
+	StatusDelivered Status = "delivered"
+	StatusFailed    Status = "failed"
+	StatusAbandoned Status = "abandoned"
+)
+
+// Transport identifies which managed channel a contact delivers through.
+// New transports are added (never renamed) so existing contact configs
+// don't break — the ENUM in the migration mirrors this set.
+type Transport string
+
+const (
+	TransportEmail     Transport = "email"
+	TransportPagerDuty Transport = "pagerduty"
+	TransportSlack     Transport = "slack"
+	TransportTeams     Transport = "teams"
+)
+
+// AllTransports returns the canonical set of transport identifiers.
+// Used by validators (a contact's transport must be one of these) and
+// by docs/listings.
+func AllTransports() []Transport {
+	return []Transport{TransportEmail, TransportPagerDuty, TransportSlack, TransportTeams}
+}
+
+// IsValidTransport reports whether s is one of the known transports.
+func IsValidTransport(s string) bool {
+	for _, t := range AllTransports() {
+		if string(t) == s {
+			return true
+		}
+	}
+	return false
+}
+
+// Sentinel errors returned by package functions.
+var (
+	ErrContactNotFound  = errors.New("alerting: alert contact not found")
+	ErrDeliveryNotFound = errors.New("alerting: alert delivery not found")
+	ErrInvalidTransport = errors.New("alerting: unknown transport")
+	ErrInvalidSeverity  = errors.New("alerting: unknown severity")
+)
+
+// AlertContact is the in-memory shape of a jetmon_alert_contacts row.
+// The raw destination credential is never stored here — it's loaded
+// separately by the worker via LoadDestination so it can't leak through
+// serialization of the AlertContact struct.
+type AlertContact struct {
+	ID                 int64
+	Label              string
+	Active             bool
+	OwnerTenantID      *string
+	Transport          Transport
+	DestinationPreview string     // last 4 chars of the credential, for display
+	SiteFilter         SiteFilter // empty = match all sites
+	MinSeverity        uint8      // matches eventstore.Severity* (0=Up..4=Down)
+	MaxPerHour         int        // 0 = unlimited
+	CreatedBy          string
+	CreatedAt          time.Time
+	UpdatedAt          time.Time
+}
+
+// SiteFilter restricts deliveries to a fixed list of sites. Empty
+// SiteIDs (or a nil filter) means "match all sites." Same shape as
+// webhooks.SiteFilter — kept as a separate type so alerting can evolve
+// independently of the webhooks package.
+type SiteFilter struct {
+	SiteIDs []int64 `json:"site_ids,omitempty"`
+}
+
+// Matches reports whether this contact should fire for a given
+// transition. The filter rule is:
+//
+//	site_id ∈ site_filter.site_ids   (or site_filter empty → all sites)
+//	AND (
+//	    new_severity >= min_severity              // escalation / sustained
+//	    OR (prev_severity >= min_severity         // recovery from a
+//	        AND new_severity == SeverityUp)       //   previously-paging state
+//	)
+//
+// Within-band changes (e.g. Down → SeemsDown when min_severity=Warning)
+// fire as flickers. The per-contact max_per_hour cap absorbs the noise.
+//
+// Recovery firing requires both prev and new severity because Matches
+// doesn't see the transition reason — it can't distinguish "resolved"
+// from "transitioned through Up by accident." Practically, transitions
+// to Up only happen on real recoveries.
+func (c *AlertContact) Matches(prevSeverity, newSeverity uint8, siteID int64) bool {
+	if !c.Active {
+		return false
+	}
+	if len(c.SiteFilter.SiteIDs) > 0 && !containsInt64(c.SiteFilter.SiteIDs, siteID) {
+		return false
+	}
+	if newSeverity >= c.MinSeverity {
+		return true
+	}
+	if prevSeverity >= c.MinSeverity && newSeverity == eventstore.SeverityUp {
+		return true
+	}
+	return false
+}
+
+// CreateInput is the data needed to insert a new alert contact.
+// Label, Transport, and Destination are required; everything else has
+// sensible defaults (Active=true, SiteFilter empty=match-all,
+// MinSeverity=SeverityDown, MaxPerHour=60).
+type CreateInput struct {
+	Label         string
+	Active        *bool // nil → true
+	OwnerTenantID *string
+	Transport     Transport
+	Destination   json.RawMessage // transport-specific shape; validated per transport
+	SiteFilter    SiteFilter
+	MinSeverity   *uint8 // nil → SeverityDown
+	MaxPerHour    *int   // nil → 60
+	CreatedBy     string
+}
+
+// UpdateInput is a sparse patch. nil fields are unchanged. An explicit
+// empty SiteFilter clears the filter (restores match-all). Transport
+// and Destination cannot be updated together via PATCH — change of
+// transport requires creating a new contact (the destination shape
+// is transport-specific and validating cross-transport changes is
+// more brittle than just deleting+recreating).
+type UpdateInput struct {
+	Label       *string
+	Active      *bool
+	Destination json.RawMessage // transport-specific; nil = unchanged
+	SiteFilter  *SiteFilter
+	MinSeverity *uint8
+	MaxPerHour  *int
+}
+
+// Notification is the rendered shape passed to a Transport.Send
+// implementation. The worker builds this once per delivery from the
+// frozen-at-fire-time payload; transports translate it into their
+// channel-specific representation.
+//
+// IsTest=true is used by the send-test endpoint to flag synthetic
+// notifications. Transports may use this to add a banner ("This is a
+// Jetmon test notification") or to choose dedup keys that won't
+// collide with real alerts.
+type Notification struct {
+	SiteID       int64
+	SiteURL      string
+	EventID      int64
+	EventType    string
+	Severity     uint8
+	SeverityName string
+	State        string
+	Reason       string
+	Timestamp    time.Time
+	DedupKey     string
+	Recovery     bool
+	IsTest       bool
+}
+
+// Dispatcher defines the contract every concrete transport
+// (email/pagerduty/slack/teams) implements. Send is responsible for
+// translating Notification into the channel-specific request and
+// reporting the outcome.
+//
+// statusCode is the channel's idiomatic status (HTTP code for
+// HTTP-based transports, SMTP reply class for email — e.g. 250
+// becomes 250). responseBody is a truncated summary suitable for
+// storing in jetmon_alert_deliveries.last_response (max 2048 chars;
+// the worker truncates if needed).
+//
+// Returning err != nil means the dispatch failed in a way the worker
+// should retry on the standard ladder. Returning err == nil with a
+// non-2xx-equivalent status also schedules a retry; the worker
+// treats both as failures for retry purposes but distinguishes them
+// for diagnostics.
+type Dispatcher interface {
+	Send(ctx context.Context, destination json.RawMessage, n Notification) (statusCode int, responseBody string, err error)
+}
+
+// SeverityName returns the canonical string form of a severity uint8,
+// matching the constants in internal/eventstore. Used by the API
+// layer (which returns severity names in JSON) and by transport
+// renderers (PagerDuty severity field, email subjects, Slack message
+// bodies).
+//
+// Returns "" for unknown values rather than panicking — some callers
+// pass user-supplied input that hasn't been validated yet.
+func SeverityName(s uint8) string {
+	switch s {
+	case eventstore.SeverityUp:
+		return "Up"
+	case eventstore.SeverityWarning:
+		return "Warning"
+	case eventstore.SeverityDegraded:
+		return "Degraded"
+	case eventstore.SeveritySeemsDown:
+		return "SeemsDown"
+	case eventstore.SeverityDown:
+		return "Down"
+	default:
+		return ""
+	}
+}
+
+// SeverityFromName parses a severity string back into the eventstore
+// uint8 constant. Used by the API layer to validate min_severity
+// inputs from JSON. Returns ErrInvalidSeverity on unknown names.
+func SeverityFromName(s string) (uint8, error) {
+	switch s {
+	case "Up":
+		return eventstore.SeverityUp, nil
+	case "Warning":
+		return eventstore.SeverityWarning, nil
+	case "Degraded":
+		return eventstore.SeverityDegraded, nil
+	case "SeemsDown":
+		return eventstore.SeveritySeemsDown, nil
+	case "Down":
+		return eventstore.SeverityDown, nil
+	default:
+		return 0, ErrInvalidSeverity
+	}
+}
+
+// AllSeverityNames returns the full ordered list of severity names,
+// least-to-most severe. Used by docs and validators.
+func AllSeverityNames() []string {
+	return []string{"Up", "Warning", "Degraded", "SeemsDown", "Down"}
+}
+
+func containsInt64(haystack []int64, needle int64) bool {
+	for _, v := range haystack {
+		if v == needle {
+			return true
+		}
+	}
+	return false
+}
diff --git a/internal/alerting/alerting_test.go b/internal/alerting/alerting_test.go
new file mode 100644
index 00000000..9fd29e97
--- /dev/null
+++ b/internal/alerting/alerting_test.go
@@ -0,0 +1,159 @@
+package alerting
+
+import (
+	"testing"
+
+	"github.com/Automattic/jetmon/internal/eventstore"
+)
+
+func TestSeverityNameRoundTrip(t *testing.T) {
+	for _, name := range AllSeverityNames() {
+		s, err := SeverityFromName(name)
+		if err != nil {
+			t.Errorf("SeverityFromName(%q) returned error: %v", name, err)
+			continue
+		}
+		if got := SeverityName(s); got != name {
+			t.Errorf("round-trip %q → %d → %q failed", name, s, got)
+		}
+	}
+}
+
+func TestSeverityNameUnknown(t *testing.T) {
+	if got := SeverityName(99); got != "" {
+		t.Errorf("SeverityName(99) = %q, want empty string", got)
+	}
+	if _, err := SeverityFromName("Bogus"); err == nil {
+		t.Error("SeverityFromName(\"Bogus\") should error")
+	}
+}
+
+func TestIsValidTransport(t *testing.T) {
+	for _, valid := range []string{"email", "pagerduty", "slack", "teams"} {
+		if !IsValidTransport(valid) {
+			t.Errorf("IsValidTransport(%q) = false, want true", valid)
+		}
+	}
+	for _, bad := range []string{"", "Email", "sms", "opsgenie", "EMAIL"} {
+		if IsValidTransport(bad) {
+			t.Errorf("IsValidTransport(%q) = true, want false", bad)
+		}
+	}
+}
+
+// TestMatchesInactive verifies an inactive contact never fires regardless
+// of severity — a deactivated contact should be invisible to the worker.
+func TestMatchesInactive(t *testing.T) {
+	c := &AlertContact{
+		Active:      false,
+		MinSeverity: eventstore.SeverityWarning,
+	}
+	if c.Matches(eventstore.SeverityUp, eventstore.SeverityDown, 1) {
+		t.Error("inactive contact should not match")
+	}
+}
+
+// TestMatchesEmptySiteFilter verifies an empty site filter matches all sites
+// — the documented "empty = match all" semantic.
+func TestMatchesEmptySiteFilter(t *testing.T) {
+	c := &AlertContact{
+		Active:      true,
+		MinSeverity: eventstore.SeverityDown,
+		// SiteFilter is zero value → empty SiteIDs → match all.
+	}
+	for _, siteID := range []int64{1, 42, 99999} {
+		if !c.Matches(eventstore.SeverityUp, eventstore.SeverityDown, siteID) {
+			t.Errorf("empty site filter should match site %d", siteID)
+		}
+	}
+}
+
+func TestMatchesSiteFilterWhitelist(t *testing.T) {
+	c := &AlertContact{
+		Active:      true,
+		SiteFilter:  SiteFilter{SiteIDs: []int64{42, 99}},
+		MinSeverity: eventstore.SeverityDown,
+	}
+	if !c.Matches(eventstore.SeverityUp, eventstore.SeverityDown, 42) {
+		t.Error("site 42 should match")
+	}
+	if !c.Matches(eventstore.SeverityUp, eventstore.SeverityDown, 99) {
+		t.Error("site 99 should match")
+	}
+	if c.Matches(eventstore.SeverityUp, eventstore.SeverityDown, 7) {
+		t.Error("site 7 should not match (not in whitelist)")
+	}
+}
+
+// TestMatchesSeverityGate covers the escalation half of the gate:
+// new_severity >= min_severity fires, regardless of prev_severity.
+func TestMatchesSeverityGate(t *testing.T) {
+	c := &AlertContact{
+		Active:      true,
+		MinSeverity: eventstore.SeverityDegraded, // 2
+	}
+	cases := []struct {
+		prev, next uint8
+		want       bool
+		desc       string
+	}{
+		{eventstore.SeverityUp, eventstore.SeverityWarning, false, "Up→Warning, both below gate"},
+		{eventstore.SeverityUp, eventstore.SeverityDegraded, true, "Up→Degraded, crosses gate"},
+		{eventstore.SeverityWarning, eventstore.SeverityDegraded, true, "Warning→Degraded, crosses gate"},
+		{eventstore.SeverityDegraded, eventstore.SeveritySeemsDown, true, "Degraded→SeemsDown, within gated band"},
+		{eventstore.SeveritySeemsDown, eventstore.SeverityDown, true, "SeemsDown→Down, within gated band"},
+	}
+	for _, tc := range cases {
+		got := c.Matches(tc.prev, tc.next, 0)
+		if got != tc.want {
+			t.Errorf("%s: Matches(%d,%d) = %v, want %v", tc.desc, tc.prev, tc.next, got, tc.want)
+		}
+	}
+}
+
+// TestMatchesRecovery covers the recovery half: a transition back to Up
+// fires only if prev_severity was at or above the gate.
+func TestMatchesRecovery(t *testing.T) {
+	c := &AlertContact{
+		Active:      true,
+		MinSeverity: eventstore.SeverityDegraded, // 2
+	}
+	cases := []struct {
+		prev, next uint8
+		want       bool
+		desc       string
+	}{
+		{eventstore.SeverityDown, eventstore.SeverityUp, true, "Down→Up: previously paged, now recovered"},
+		{eventstore.SeverityDegraded, eventstore.SeverityUp, true, "Degraded→Up: at-gate recovery fires"},
+		{eventstore.SeverityWarning, eventstore.SeverityUp, false, "Warning→Up: never paged, no recovery to send"},
+		{eventstore.SeverityUp, eventstore.SeverityUp, false, "Up→Up: no transition meaning"},
+	}
+	for _, tc := range cases {
+		got := c.Matches(tc.prev, tc.next, 0)
+		if got != tc.want {
+			t.Errorf("%s: Matches(%d,%d) = %v, want %v", tc.desc, tc.prev, tc.next, got, tc.want)
+		}
+	}
+}
+
+// TestMatchesAllDimensions verifies the AND across all dimensions:
+// a contact must satisfy active, site_filter, and severity gate.
+func TestMatchesAllDimensions(t *testing.T) {
+	c := &AlertContact{
+		Active:      true,
+		SiteFilter:  SiteFilter{SiteIDs: []int64{42}},
+		MinSeverity: eventstore.SeverityDown, // 4
+	}
+	// All dimensions match.
+	if !c.Matches(eventstore.SeverityUp, eventstore.SeverityDown, 42) {
+		t.Error("all dimensions matching should fire")
+	}
+	// Wrong site, severity matches.
+	if c.Matches(eventstore.SeverityUp, eventstore.SeverityDown, 7) {
+		t.Error("wrong site should not fire")
+	}
+	// Right site, severity below gate (and no recovery: prev was below gate too).
+	if c.Matches(eventstore.SeverityUp, eventstore.SeverityWarning, 42) {
+		t.Error("severity below gate should not fire when prev also below")
+	}
+}
diff --git a/internal/alerting/contacts.go b/internal/alerting/contacts.go
new file mode 100644
index 00000000..bf7d213a
--- /dev/null
+++ b/internal/alerting/contacts.go
@@ -0,0 +1,459 @@
+package alerting
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"strings"
+)
+
+// Create inserts a new alert contact and returns the persisted record.
+// Unlike webhooks.Create (which returns the one-time raw secret), the
+// destination is supplied by the caller — they already know the
+// credential, so there's nothing to return-once. Subsequent reads
+// expose only DestinationPreview.
+func Create(ctx context.Context, db *sql.DB, in CreateInput) (*AlertContact, error) {
+	if err := validateCreateInput(in); err != nil {
+		return nil, err
+	}
+	active := true
+	if in.Active != nil {
+		active = *in.Active
+	}
+	minSev := uint8(4) // SeverityDown
+	if in.MinSeverity != nil {
+		minSev = *in.MinSeverity
+	}
+	maxPerHour := 60
+	if in.MaxPerHour != nil {
+		maxPerHour = *in.MaxPerHour
+	}
+	preview := destinationPreview(in.Transport, in.Destination)
+	siteFilterJSON, _ := json.Marshal(in.SiteFilter)
+
+	res, err := db.ExecContext(ctx, `
+		INSERT INTO jetmon_alert_contacts
+			(label, active, owner_tenant_id, transport, destination, destination_preview,
+			 site_filter, min_severity, max_per_hour, created_by)
+		VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`,
+		in.Label, boolToTinyint(active), nullableString(in.OwnerTenantID), string(in.Transport), []byte(in.Destination), preview,
+		siteFilterJSON, minSev, maxPerHour, in.CreatedBy,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("alerting: insert contact: %w", err)
+	}
+	id, err := res.LastInsertId()
+	if err != nil {
+		return nil, fmt.Errorf("alerting: last insert id: %w", err)
+	}
+	return Get(ctx, db, id)
+}
+
+// Get returns a single contact by id, or ErrContactNotFound. Does not
+// load the destination credential — use LoadDestination for that.
+func Get(ctx context.Context, db *sql.DB, id int64) (*AlertContact, error) {
+	return get(ctx, db, id, "")
+}
+
+// GetForTenant returns a single contact owned by ownerTenantID. It hides
+// cross-tenant rows behind ErrContactNotFound.
+func GetForTenant(ctx context.Context, db *sql.DB, id int64, ownerTenantID string) (*AlertContact, error) {
+	if ownerTenantID == "" {
+		return nil, errors.New("alerting: owner tenant id is required")
+	}
+	return get(ctx, db, id, ownerTenantID)
+}
+
+func get(ctx context.Context, db *sql.DB, id int64, ownerTenantID string) (*AlertContact, error) {
+	q := selectContactSQL + " WHERE id = ?"
+	args := []any{id}
+	if ownerTenantID != "" {
+		q += " AND owner_tenant_id = ?"
+		args = append(args, ownerTenantID)
+	}
+	row := db.QueryRowContext(ctx, q, args...)
+	c, err := scanContactRow(row)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			return nil, ErrContactNotFound
+		}
+		return nil, err
+	}
+	return c, nil
+}
+
+// List returns all contacts ordered by id ASC.
+func List(ctx context.Context, db *sql.DB) ([]AlertContact, error) {
+	return list(ctx, db, "")
+}
+
+// ListForTenant returns only contacts owned by ownerTenantID.
+func ListForTenant(ctx context.Context, db *sql.DB, ownerTenantID string) ([]AlertContact, error) {
+	if ownerTenantID == "" {
+		return nil, errors.New("alerting: owner tenant id is required")
+	}
+	return list(ctx, db, ownerTenantID)
+}
+
+func list(ctx context.Context, db *sql.DB, ownerTenantID string) ([]AlertContact, error) {
+	q := selectContactSQL
+	args := []any{}
+	if ownerTenantID != "" {
+		q += " WHERE owner_tenant_id = ?"
+		args = append(args, ownerTenantID)
+	}
+	q += " ORDER BY id ASC"
+	rows, err := db.QueryContext(ctx, q, args...)
+	if err != nil {
+		return nil, fmt.Errorf("alerting: list contacts: %w", err)
+	}
+	defer rows.Close()
+	var out []AlertContact
+	for rows.Next() {
+		c, err := scanContactRow(rows)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, *c)
+	}
+	return out, rows.Err()
+}
+
+// ListActive returns only contacts with active=1. Used by the delivery
+// dispatcher; inactive contacts don't get matched against new
+// transitions.
+func ListActive(ctx context.Context, db *sql.DB) ([]AlertContact, error) {
+	rows, err := db.QueryContext(ctx, selectContactSQL+" WHERE active = 1 ORDER BY id ASC")
+	if err != nil {
+		return nil, fmt.Errorf("alerting: list active contacts: %w", err)
+	}
+	defer rows.Close()
+	var out []AlertContact
+	for rows.Next() {
+		c, err := scanContactRow(rows)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, *c)
+	}
+	return out, rows.Err()
+}
+
+// Update applies a partial patch and returns the updated contact. The
+// transport itself cannot be changed via PATCH (the destination shape
+// is transport-specific and validating cross-transport changes is
+// brittle); callers who want to switch transport delete and re-create.
+func Update(ctx context.Context, db *sql.DB, id int64, in UpdateInput) (*AlertContact, error) {
+	return update(ctx, db, id, "", in)
+}
+
+// UpdateForTenant updates a contact only when it is owned by ownerTenantID.
+func UpdateForTenant(ctx context.Context, db *sql.DB, id int64, ownerTenantID string, in UpdateInput) (*AlertContact, error) {
+	if ownerTenantID == "" {
+		return nil, errors.New("alerting: owner tenant id is required")
+	}
+	return update(ctx, db, id, ownerTenantID, in)
+}
+
+func update(ctx context.Context, db *sql.DB, id int64, ownerTenantID string, in UpdateInput) (*AlertContact, error) {
+	// Validate input fields that don't depend on the existing row first
+	// (fail fast — no DB hit on obviously bad PATCH bodies).
+	if in.Label != nil && *in.Label == "" {
+		return nil, errors.New("alerting: label must not be empty")
+	}
+	if in.MinSeverity != nil {
+		if err := validateSeverity(*in.MinSeverity); err != nil {
+			return nil, err
+		}
+	}
+	if in.MaxPerHour != nil && *in.MaxPerHour < 0 {
+		return nil, errors.New("alerting: max_per_hour must be >= 0")
+	}
+
+	// The destination shape is transport-specific, so we need the
+	// existing row to know what to validate against.
+	current, err := get(ctx, db, id, ownerTenantID)
+	if err != nil {
+		return nil, err
+	}
+	if in.Destination != nil {
+		if err := validateDestination(current.Transport, in.Destination); err != nil {
+			return nil, err
+		}
+	}
+
+	clauses := []string{}
+	args := []any{}
+	if in.Label != nil {
+		clauses = append(clauses, "label = ?")
+		args = append(args, *in.Label)
+	}
+	if in.Active != nil {
+		clauses = append(clauses, "active = ?")
+		args = append(args, boolToTinyint(*in.Active))
+	}
+	if in.Destination != nil {
+		clauses = append(clauses, "destination = ?", "destination_preview = ?")
+		args = append(args, []byte(in.Destination), destinationPreview(current.Transport, in.Destination))
+	}
+	if in.SiteFilter != nil {
+		b, _ := json.Marshal(*in.SiteFilter)
+		clauses = append(clauses, "site_filter = ?")
+		args = append(args, b)
+	}
+	if in.MinSeverity != nil {
+		clauses = append(clauses, "min_severity = ?")
+		args = append(args, *in.MinSeverity)
+	}
+	if in.MaxPerHour != nil {
+		clauses = append(clauses, "max_per_hour = ?")
+		args = append(args, *in.MaxPerHour)
+	}
+
+	if len(clauses) == 0 {
+		return current, nil
+	}
+
+	args = append(args, id)
+	q := "UPDATE jetmon_alert_contacts SET " + strings.Join(clauses, ", ") + " WHERE id = ?"
+	if ownerTenantID != "" {
+		q += " AND owner_tenant_id = ?"
+		args = append(args, ownerTenantID)
+	}
+	if _, err := db.ExecContext(ctx, q, args...); err != nil {
+		return nil, fmt.Errorf("alerting: update contact: %w", err)
+	}
+	return get(ctx, db, id, ownerTenantID)
+}
+
+// Delete removes an alert contact. Existing rows in
+// jetmon_alert_deliveries are intentionally NOT cascaded — they
+// remain for audit and manual retry, mirroring webhooks.Delete.
+func Delete(ctx context.Context, db *sql.DB, id int64) error {
+	return deleteContact(ctx, db, id, "")
+}
+
+// DeleteForTenant removes a contact only when it is owned by ownerTenantID.
+func DeleteForTenant(ctx context.Context, db *sql.DB, id int64, ownerTenantID string) error {
+	if ownerTenantID == "" {
+		return errors.New("alerting: owner tenant id is required")
+	}
+	return deleteContact(ctx, db, id, ownerTenantID)
+}
+
+func deleteContact(ctx context.Context, db *sql.DB, id int64, ownerTenantID string) error {
+	q := "DELETE FROM jetmon_alert_contacts WHERE id = ?"
+	args := []any{id}
+	if ownerTenantID != "" {
+		q += " AND owner_tenant_id = ?"
+		args = append(args, ownerTenantID)
+	}
+	res, err := db.ExecContext(ctx, q, args...)
+	if err != nil {
+		return fmt.Errorf("alerting: delete contact: %w", err)
+	}
+	n, _ := res.RowsAffected()
+	if n == 0 {
+		return ErrContactNotFound
+	}
+	return nil
+}
+
+// LoadDestination returns the raw destination JSON for a contact,
+// used by the worker to call the configured Dispatcher. Kept as a
+// separate function (not a field on AlertContact) so the credential
+// can't leak through serialization of the AlertContact struct.
+func LoadDestination(ctx context.Context, db *sql.DB, id int64) (json.RawMessage, error) {
+	return loadDestination(ctx, db, id, "")
+}
+
+// LoadDestinationForTenant loads a contact credential only when it is owned
+// by ownerTenantID.
+func LoadDestinationForTenant(ctx context.Context, db *sql.DB, id int64, ownerTenantID string) (json.RawMessage, error) {
+	if ownerTenantID == "" {
+		return nil, errors.New("alerting: owner tenant id is required")
+	}
+	return loadDestination(ctx, db, id, ownerTenantID)
+}
+
+func loadDestination(ctx context.Context, db *sql.DB, id int64, ownerTenantID string) (json.RawMessage, error) {
+	var raw []byte
+	q := `SELECT destination FROM jetmon_alert_contacts WHERE id = ?`
+	args := []any{id}
+	if ownerTenantID != "" {
+		q += " AND owner_tenant_id = ?"
+		args = append(args, ownerTenantID)
+	}
+	err := db.QueryRowContext(ctx,
+		q, args...,
+	).Scan(&raw)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			return nil, ErrContactNotFound
+		}
+		return nil, fmt.Errorf("alerting: load destination: %w", err)
+	}
+	return raw, nil
+}
+
+// validateCreateInput enforces the required-fields contract for Create.
+func validateCreateInput(in CreateInput) error {
+	if in.Label == "" {
+		return errors.New("alerting: label is required")
+	}
+	if !IsValidTransport(string(in.Transport)) {
+		return fmt.Errorf("%w: %q", ErrInvalidTransport, in.Transport)
+	}
+	if err := validateDestination(in.Transport, in.Destination); err != nil {
+		return err
+	}
+	if in.MinSeverity != nil {
+		if err := validateSeverity(*in.MinSeverity); err != nil {
+			return err
+		}
+	}
+	if in.MaxPerHour != nil && *in.MaxPerHour < 0 {
+		return errors.New("alerting: max_per_hour must be >= 0")
+	}
+	return nil
+}
+
+// validateDestination checks that the destination JSON has the shape
+// the transport requires. Validates field presence, not field
+// well-formedness — a malformed Slack webhook URL surfaces as a
+// transport error at delivery time, which is fine because operators
+// can use the send-test endpoint to catch it before real alerts fire.
+func validateDestination(t Transport, dest json.RawMessage) error {
+	if len(dest) == 0 {
+		return errors.New("alerting: destination is required")
+	}
+	switch t {
+	case TransportEmail:
+		var d emailDestination
+		if err := json.Unmarshal(dest, &d); err != nil {
+			return fmt.Errorf("alerting: destination not valid JSON: %w", err)
+		}
+		if d.Address == "" {
+			return errors.New("alerting: email destination requires an address")
+		}
+	case TransportPagerDuty:
+		var d pagerDutyDestination
+		if err := json.Unmarshal(dest, &d); err != nil {
+			return fmt.Errorf("alerting: destination not valid JSON: %w", err)
+		}
+		if d.IntegrationKey == "" {
+			return errors.New("alerting: pagerduty destination requires an integration_key")
+		}
+	case TransportSlack:
+		var d slackDestination
+		if err := json.Unmarshal(dest, &d); err != nil {
+			return fmt.Errorf("alerting: destination not valid JSON: %w", err)
+		}
+		if d.WebhookURL == "" {
+			return errors.New("alerting: slack destination requires a webhook_url")
+		}
+	case TransportTeams:
+		var d teamsDestination
+		if err := json.Unmarshal(dest, &d); err != nil {
+			return fmt.Errorf("alerting: destination not valid JSON: %w", err)
+		}
+		if d.WebhookURL == "" {
+			return errors.New("alerting: teams destination requires a webhook_url")
+		}
+	default:
+		return fmt.Errorf("%w: %q", ErrInvalidTransport, t)
+	}
+	return nil
+}
+
+// validateSeverity rejects severity values outside the eventstore range.
+// Anything 0..4 is accepted; 5+ is reserved per the eventstore comment
+// for future "worse than down" signals but isn't usable as a gate yet.
+func validateSeverity(s uint8) error {
+	if s > 4 {
+		return fmt.Errorf("%w: %d (allowed 0-4)", ErrInvalidSeverity, s)
+	}
+	return nil
+}
+
+// destinationPreview returns the last 4 chars of the credential field
+// for the given transport. Used as a UI hint so operators can identify
+// a contact without exposing the full credential.
+func destinationPreview(t Transport, dest json.RawMessage) string {
+	var s string
+	switch t {
+	case TransportEmail:
+		var d emailDestination
+		_ = json.Unmarshal(dest, &d)
+		s = d.Address
+	case TransportPagerDuty:
+		var d pagerDutyDestination
+		_ = json.Unmarshal(dest, &d)
+		s = d.IntegrationKey
+	case TransportSlack:
+		var d slackDestination
+		_ = json.Unmarshal(dest, &d)
+		s = d.WebhookURL
+	case TransportTeams:
+		var d teamsDestination
+		_ = json.Unmarshal(dest, &d)
+		s = d.WebhookURL
+	}
+	if len(s) <= 4 {
+		return s
+	}
+	return s[len(s)-4:]
+}
+
+// boolToTinyint mirrors the helper in internal/webhooks/webhooks.go.
+func boolToTinyint(b bool) int {
+	if b {
+		return 1
+	}
+	return 0
+}
+
+const selectContactSQL = `
+	SELECT id, label, active, owner_tenant_id, transport, destination_preview,
+	       site_filter, min_severity, max_per_hour,
+	       created_by, created_at, updated_at
+	  FROM jetmon_alert_contacts`
+
+type rowScanner interface {
+	Scan(...any) error
+}
+
+func scanContactRow(s rowScanner) (*AlertContact, error) {
+	var (
+		c              AlertContact
+		active         uint8
+		ownerTenantID  sql.NullString
+		transport      string
+		siteFilterJSON sql.NullString
+	)
+	if err := s.Scan(
+		&c.ID, &c.Label, &active, &ownerTenantID, &transport, &c.DestinationPreview,
+		&siteFilterJSON, &c.MinSeverity, &c.MaxPerHour,
+		&c.CreatedBy, &c.CreatedAt, &c.UpdatedAt,
+	); err != nil {
+		return nil, err
+	}
+	c.Active = active == 1
+	if ownerTenantID.Valid {
+		c.OwnerTenantID = &ownerTenantID.String
+	}
+	c.Transport = Transport(transport)
+	if siteFilterJSON.Valid && siteFilterJSON.String != "" {
+		_ = json.Unmarshal([]byte(siteFilterJSON.String), &c.SiteFilter)
+	}
+	return &c, nil
+}
+
+func nullableString(s *string) any {
+	if s == nil {
+		return nil
+	}
+	return *s
+}
diff --git a/internal/alerting/deliveries.go b/internal/alerting/deliveries.go
new file mode 100644
index 00000000..7ee560ca
--- /dev/null
+++ b/internal/alerting/deliveries.go
@@ -0,0 +1,359 @@
+package alerting
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"time"
+)
+
+// Delivery is the in-memory shape of a jetmon_alert_deliveries row.
+type Delivery struct {
+	ID             int64
+	AlertContactID int64
+	TransitionID   int64
+	EventID        int64
+	EventType      string
+	Severity       uint8
+	Payload        json.RawMessage
+	Status         Status
+	Attempt        int
+	NextAttemptAt  *time.Time
+	LastStatusCode *int
+	LastResponse   *string
+	LastAttemptAt  *time.Time
+	DeliveredAt    *time.Time
+	CreatedAt      time.Time
+}
+
+// EnqueueInput carries everything needed to insert a delivery row.
+type EnqueueInput struct {
+	AlertContactID int64
+	TransitionID   int64
+	EventID        int64
+	EventType      string
+	Severity       uint8
+	Payload        json.RawMessage
+}
+
+// Enqueue inserts a pending delivery with attempt=0 and
+// next_attempt_at=now. Uses INSERT IGNORE against the
+// (alert_contact_id, transition_id) UNIQUE KEY so concurrent
+// dispatchers don't create duplicate deliveries. Returns the new id,
+// or 0 if the row was a duplicate.
+func Enqueue(ctx context.Context, db *sql.DB, in EnqueueInput) (int64, error) {
+	res, err := db.ExecContext(ctx, `
+		INSERT IGNORE INTO jetmon_alert_deliveries
+			(alert_contact_id, transition_id, event_id, event_type, severity,
+			 payload, status, attempt, next_attempt_at)
+		VALUES (?, ?, ?, ?, ?, ?, 'pending', 0, CURRENT_TIMESTAMP)`,
+		in.AlertContactID, in.TransitionID, in.EventID, in.EventType, in.Severity,
+		[]byte(in.Payload),
+	)
+	if err != nil {
+		return 0, fmt.Errorf("alerting: enqueue delivery: %w", err)
+	}
+	id, err := res.LastInsertId()
+	if err != nil {
+		return 0, fmt.Errorf("alerting: last insert id: %w", err)
+	}
+	if affected, _ := res.RowsAffected(); affected == 0 {
+		return 0, nil
+	}
+	return id, nil
+}
+
+// claimLockDuration is how far ClaimReady pushes next_attempt_at out
+// when it claims a row. Must outlast the worker's per-delivery wall
+// clock so an in-flight goroutine has time to write its real result
+// before the in-flight lease expires. The default DispatchTimeout is
+// 30s with a 5s buffer; 60s gives comfortable headroom. A crashed
+// goroutine that never updates the row recovers naturally when the
+// lease expires.
+const claimLockDuration = 60 * time.Second
+
+// ClaimReady returns up to limit pending deliveries whose
+// next_attempt_at is in the past. It claims rows with SELECT ... FOR UPDATE
+// inside a transaction so active-active delivery workers cannot claim the same
+// row. Each claimed row then gets an in-flight lease by pushing next_attempt_at
+// to NOW + claimLockDuration before the transaction commits, so subsequent
+// ticks don't re-claim a row whose dispatch is still in-flight. The dispatch
+// goroutine overwrites next_attempt_at with its real value when it finishes.
+//
+// Without the in-flight lease, the deliver loop's 1-second tick re-claims
+// any in-flight row up to the per-contact cap, producing concurrent
+// dispatches that inflate the attempt counter and effectively skip
+// retry-schedule steps. The lease prevents that after the transaction commits.
+func ClaimReady(ctx context.Context, db *sql.DB, limit int) ([]Delivery, error) {
+	tx, err := db.BeginTx(ctx, nil)
+	if err != nil {
+		return nil, fmt.Errorf("alerting: begin claim: %w", err)
+	}
+	committed := false
+	defer func() {
+		if !committed {
+			_ = tx.Rollback()
+		}
+	}()
+
+	rows, err := tx.QueryContext(ctx, `
+		SELECT id, alert_contact_id, transition_id, event_id, event_type, severity, payload,
+		       status, attempt, next_attempt_at, last_status_code, last_response,
+		       last_attempt_at, delivered_at, created_at
+		  FROM jetmon_alert_deliveries
+		 WHERE status = 'pending'
+		   AND (next_attempt_at IS NULL OR next_attempt_at <= CURRENT_TIMESTAMP)
+		 ORDER BY next_attempt_at ASC
+		 LIMIT ?
+		 FOR UPDATE`, limit)
+	if err != nil {
+		return nil, fmt.Errorf("alerting: claim ready: %w", err)
+	}
+	var claimed []Delivery
+	for rows.Next() {
+		d, err := scanDeliveryRow(rows)
+		if err != nil {
+			rows.Close()
+			return nil, err
+		}
+		claimed = append(claimed, *d)
+	}
+	if err := rows.Err(); err != nil {
+		rows.Close()
+		return nil, err
+	}
+	if err := rows.Close(); err != nil {
+		return nil, fmt.Errorf("alerting: close claim rows: %w", err)
+	}
+
+	lockUntil := time.Now().Add(claimLockDuration).UTC()
+	for i := range claimed {
+		res, err := tx.ExecContext(ctx, `
+			UPDATE jetmon_alert_deliveries
+			   SET next_attempt_at = ?
+			 WHERE id = ?
+			   AND status = 'pending'`,
+			lockUntil, claimed[i].ID)
+		if err != nil {
+			return nil, fmt.Errorf("alerting: claim row %d: %w", claimed[i].ID, err)
+		}
+		affected, err := res.RowsAffected()
+		if err != nil {
+			return nil, fmt.Errorf("alerting: claim row %d rows affected: %w", claimed[i].ID, err)
+		}
+		if affected != 1 {
+			return nil, fmt.Errorf("alerting: claim row %d affected %d rows, want 1", claimed[i].ID, affected)
+		}
+	}
+	if err := tx.Commit(); err != nil {
+		return nil, fmt.Errorf("alerting: commit claim: %w", err)
+	}
+	committed = true
+	return claimed, nil
+}
+
+// MarkDelivered records a successful delivery.
+func MarkDelivered(ctx context.Context, db *sql.DB, id int64, statusCode int, responseBody string) error {
+	_, err := db.ExecContext(ctx, `
+		UPDATE jetmon_alert_deliveries
+		   SET status = 'delivered',
+		       last_status_code = ?,
+		       last_response = ?,
+		       last_attempt_at = CURRENT_TIMESTAMP,
+		       delivered_at = CURRENT_TIMESTAMP,
+		       attempt = attempt + 1,
+		       next_attempt_at = NULL
+		 WHERE id = ?`,
+		statusCode, truncate(responseBody, 2048), id)
+	if err != nil {
+		return fmt.Errorf("alerting: mark delivered: %w", err)
+	}
+	return nil
+}
+
+// MarkSuppressed records a delivery that was dropped by the per-contact
+// rate cap. The delivery never went out and is terminal — there's no
+// useful retry because by the time the cap re-opens, the alert is
+// stale. Status='abandoned' with a distinguishing last_response so
+// operators can see why.
+func MarkSuppressed(ctx context.Context, db *sql.DB, id int64, reason string) error {
+	_, err := db.ExecContext(ctx, `
+		UPDATE jetmon_alert_deliveries
+		   SET status = 'abandoned',
+		       last_status_code = 429,
+		       last_response = ?,
+		       last_attempt_at = CURRENT_TIMESTAMP,
+		       attempt = attempt + 1,
+		       next_attempt_at = NULL
+		 WHERE id = ?`, truncate(reason, 2048), id)
+	if err != nil {
+		return fmt.Errorf("alerting: mark suppressed: %w", err)
+	}
+	return nil
+}
+
+// ScheduleRetry bumps the attempt counter and sets next_attempt_at
+// per the retry schedule. abandon=true marks the row terminal instead.
+func ScheduleRetry(ctx context.Context, db *sql.DB, id int64, statusCode int, responseBody string, nextAttempt time.Time, abandon bool) error {
+	if abandon {
+		_, err := db.ExecContext(ctx, `
+			UPDATE jetmon_alert_deliveries
+			   SET status = 'abandoned',
+			       last_status_code = ?,
+			       last_response = ?,
+			       last_attempt_at = CURRENT_TIMESTAMP,
+			       attempt = attempt + 1,
+			       next_attempt_at = NULL
+			 WHERE id = ?`,
+			statusCode, truncate(responseBody, 2048), id)
+		if err != nil {
+			return fmt.Errorf("alerting: abandon: %w", err)
+		}
+		return nil
+	}
+	_, err := db.ExecContext(ctx, `
+		UPDATE jetmon_alert_deliveries
+		   SET last_status_code = ?,
+		       last_response = ?,
+		       last_attempt_at = CURRENT_TIMESTAMP,
+		       attempt = attempt + 1,
+		       next_attempt_at = ?
+		 WHERE id = ?`,
+		statusCode, truncate(responseBody, 2048), nextAttempt.UTC(), id)
+	if err != nil {
+		return fmt.Errorf("alerting: schedule retry: %w", err)
+	}
+	return nil
+}
+
+// GetDelivery returns a single delivery row by id.
+func GetDelivery(ctx context.Context, db *sql.DB, id int64) (*Delivery, error) {
+	row := db.QueryRowContext(ctx, `
+		SELECT id, alert_contact_id, transition_id, event_id, event_type, severity, payload,
+		       status, attempt, next_attempt_at, last_status_code, last_response,
+		       last_attempt_at, delivered_at, created_at
+		  FROM jetmon_alert_deliveries
+		 WHERE id = ?`, id)
+	d, err := scanDeliveryRow(row)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			return nil, ErrDeliveryNotFound
+		}
+		return nil, err
+	}
+	return d, nil
+}
+
+// ListDeliveries returns deliveries for a contact, optionally filtered
+// by status, ordered by id DESC. Cursor-paginated on id.
+func ListDeliveries(ctx context.Context, db *sql.DB, contactID int64, status Status, cursorID int64, limit int) ([]Delivery, error) {
+	args := []any{contactID}
+	q := `
+		SELECT id, alert_contact_id, transition_id, event_id, event_type, severity, payload,
+		       status, attempt, next_attempt_at, last_status_code, last_response,
+		       last_attempt_at, delivered_at, created_at
+		  FROM jetmon_alert_deliveries
+		 WHERE alert_contact_id = ?`
+	if status != "" {
+		q += " AND status = ?"
+		args = append(args, string(status))
+	}
+	if cursorID > 0 {
+		q += " AND id < ?"
+		args = append(args, cursorID)
+	}
+	q += " ORDER BY id DESC LIMIT ?"
+	args = append(args, limit)
+
+	rows, err := db.QueryContext(ctx, q, args...)
+	if err != nil {
+		return nil, fmt.Errorf("alerting: list deliveries: %w", err)
+	}
+	defer rows.Close()
+	var out []Delivery
+	for rows.Next() {
+		d, err := scanDeliveryRow(rows)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, *d)
+	}
+	return out, rows.Err()
+}
+
+// RetryDelivery resets an abandoned delivery to pending so the worker
+// picks it up on the next tick. Mirrors webhooks.RetryDelivery — only
+// abandoned deliveries can be retried.
+func RetryDelivery(ctx context.Context, db *sql.DB, id int64) error {
+	res, err := db.ExecContext(ctx, `
+		UPDATE jetmon_alert_deliveries
+		   SET status = 'pending',
+		       attempt = 0,
+		       next_attempt_at = CURRENT_TIMESTAMP,
+		       last_status_code = NULL,
+		       last_response = NULL,
+		       last_attempt_at = NULL
+		 WHERE id = ? AND status = 'abandoned'`, id)
+	if err != nil {
+		return fmt.Errorf("alerting: retry delivery: %w", err)
+	}
+	n, _ := res.RowsAffected()
+	if n == 0 {
+		d, getErr := GetDelivery(ctx, db, id)
+		if getErr != nil {
+			return getErr
+		}
+		return fmt.Errorf("alerting: delivery %d is %s, only abandoned deliveries can be retried", id, d.Status)
+	}
+	return nil
+}
+
+func scanDeliveryRow(s rowScanner) (*Delivery, error) {
+	var (
+		d              Delivery
+		payload        sql.NullString
+		nextAttemptAt  sql.NullTime
+		lastStatusCode sql.NullInt64
+		lastResponse   sql.NullString
+		lastAttemptAt  sql.NullTime
+		deliveredAt    sql.NullTime
+		statusStr      string
+	)
+	if err := s.Scan(
+		&d.ID, &d.AlertContactID, &d.TransitionID, &d.EventID, &d.EventType, &d.Severity,
+		&payload, &statusStr, &d.Attempt, &nextAttemptAt, &lastStatusCode, &lastResponse,
+		&lastAttemptAt, &deliveredAt, &d.CreatedAt,
+	); err != nil {
+		return nil, err
+	}
+	d.Status = Status(statusStr)
+	if payload.Valid {
+		d.Payload = json.RawMessage(payload.String)
+	}
+	if nextAttemptAt.Valid {
+		d.NextAttemptAt = &nextAttemptAt.Time
+	}
+	if lastStatusCode.Valid {
+		v := int(lastStatusCode.Int64)
+		d.LastStatusCode = &v
+	}
+	if lastResponse.Valid {
+		d.LastResponse = &lastResponse.String
+	}
+	if lastAttemptAt.Valid {
+		d.LastAttemptAt = &lastAttemptAt.Time
+	}
+	if deliveredAt.Valid {
+		d.DeliveredAt = &deliveredAt.Time
+	}
+	return &d, nil
+}
+
+func truncate(s string, max int) string {
+	if len(s) <= max {
+		return s
+	}
+	return s[:max]
+}
diff --git a/internal/alerting/deliveries_test.go b/internal/alerting/deliveries_test.go
new file mode 100644
index 00000000..ead23fcb
--- /dev/null
+++ b/internal/alerting/deliveries_test.go
@@ -0,0 +1,116 @@
+package alerting
+
+import (
+	"context"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+const selectClaimReadySQL = ` SELECT id, alert_contact_id, transition_id, event_id, event_type, severity, payload, status, attempt, next_attempt_at, last_status_code, last_response, last_attempt_at, delivered_at, created_at FROM jetmon_alert_deliveries WHERE status = 'pending' AND (next_attempt_at IS NULL OR next_attempt_at <= CURRENT_TIMESTAMP) ORDER BY next_attempt_at ASC LIMIT ? FOR UPDATE`
+
+const leaseClaimedSQL = ` UPDATE jetmon_alert_deliveries SET next_attempt_at = ? WHERE id = ? AND status = 'pending'`
+
+var columnsClaimedDelivery = []string{
+	"id", "alert_contact_id", "transition_id", "event_id", "event_type", "severity",
+	"payload", "status", "attempt", "next_attempt_at", "last_status_code", "last_response",
+	"last_attempt_at", "delivered_at", "created_at",
+}
+
+// TestClaimReadyClaimsRowsTransactionally verifies that ClaimReady uses
+// row-level locks and then leases each claimed row so subsequent ticks do not
+// re-claim a still-in-flight delivery.
+func TestClaimReadyClaimsRowsTransactionally(t *testing.T) {
+	db, mock, err := sqlmock.New(sqlmock.QueryMatcherOption(sqlmock.QueryMatcherEqual))
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	rows := sqlmock.NewRows(columnsClaimedDelivery).
+		AddRow(int64(1), int64(11), int64(100), int64(900), "alert.opened", uint8(4),
+			[]byte(`{}`), "pending", 0, now, nil, nil, nil, nil, now).
+		AddRow(int64(2), int64(11), int64(101), int64(901), "alert.opened", uint8(4),
+			[]byte(`{}`), "pending", 0, now, nil, nil, nil, nil, now)
+
+	mock.ExpectBegin()
+	mock.ExpectQuery(selectClaimReadySQL).WithArgs(50).WillReturnRows(rows)
+	mock.ExpectExec(leaseClaimedSQL).
+		WithArgs(sqlmock.AnyArg(), int64(1)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec(leaseClaimedSQL).
+		WithArgs(sqlmock.AnyArg(), int64(2)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectCommit()
+
+	out, err := ClaimReady(context.Background(), db, 50)
+	if err != nil {
+		t.Fatalf("ClaimReady: %v", err)
+	}
+	if len(out) != 2 {
+		t.Errorf("got %d claimed, want 2", len(out))
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("expectations: %v", err)
+	}
+}
+
+func TestClaimReadyRollsBackWhenLeaseUpdateMisses(t *testing.T) {
+	db, mock, err := sqlmock.New(sqlmock.QueryMatcherOption(sqlmock.QueryMatcherEqual))
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	rows := sqlmock.NewRows(columnsClaimedDelivery).
+		AddRow(int64(1), int64(11), int64(100), int64(900), "alert.opened", uint8(4),
+			[]byte(`{}`), "pending", 0, now, nil, nil, nil, nil, now)
+
+	mock.ExpectBegin()
+	mock.ExpectQuery(selectClaimReadySQL).WithArgs(50).WillReturnRows(rows)
+	mock.ExpectExec(leaseClaimedSQL).
+		WithArgs(sqlmock.AnyArg(), int64(1)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+	mock.ExpectRollback()
+
+	out, err := ClaimReady(context.Background(), db, 50)
+	if err == nil {
+		t.Fatal("ClaimReady succeeded after lease update missed")
+	}
+	if len(out) != 0 {
+		t.Fatalf("got %d claimed rows with failed lease update, want 0", len(out))
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("expectations: %v", err)
+	}
+}
+
+// TestClaimReadyNoCandidatesCommitsWithoutLeaseUpdates verifies that when the
+// SELECT returns nothing, ClaimReady issues no UPDATEs (no extra DB traffic on
+// idle ticks).
+func TestClaimReadyNoCandidatesCommitsWithoutLeaseUpdates(t *testing.T) {
+	db, mock, err := sqlmock.New(sqlmock.QueryMatcherOption(sqlmock.QueryMatcherEqual))
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectQuery(selectClaimReadySQL).WithArgs(50).
+		WillReturnRows(sqlmock.NewRows(columnsClaimedDelivery))
+	mock.ExpectCommit()
+
+	out, err := ClaimReady(context.Background(), db, 50)
+	if err != nil {
+		t.Fatalf("ClaimReady: %v", err)
+	}
+	if len(out) != 0 {
+		t.Errorf("got %d claimed, want 0", len(out))
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("expectations: %v", err)
+	}
+}
diff --git a/internal/alerting/email.go b/internal/alerting/email.go
new file mode 100644
index 00000000..619947e9
--- /dev/null
+++ b/internal/alerting/email.go
@@ -0,0 +1,340 @@
+package alerting
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io"
+	"log"
+	"net/http"
+	"net/smtp"
+	"strings"
+	"sync"
+	"time"
+)
+
+// EmailMessage is the rendered email handed to a Sender. It's
+// transport-agnostic — the Sender translates it into whatever the
+// underlying channel needs (HTTP POST body for WPCOM, MIME for SMTP,
+// log line for stub).
+type EmailMessage struct {
+	From      string
+	To        string
+	Subject   string
+	PlainBody string
+	HTMLBody  string
+}
+
+// Sender abstracts the actual email-sending mechanism. Concrete impls
+// in this file: WPCOMSender (production), SMTPSender (dev / staging),
+// StubSender (unit tests).
+//
+// Send returns an error if the email could not be delivered. The
+// returned error string is recorded in jetmon_alert_deliveries for
+// debugging — keep it short and useful, not a stack trace.
+type Sender interface {
+	Send(ctx context.Context, msg EmailMessage) error
+}
+
+// emailDispatcher implements alerting.Dispatcher by translating a
+// Notification into an EmailMessage and delegating to a Sender. The
+// rendering lives here (not in the Sender) so swapping transports
+// doesn't require re-implementing the subject/body logic.
+type emailDispatcher struct {
+	sender Sender
+	from   string
+}
+
+// NewEmailDispatcher returns a Dispatcher that renders Notifications
+// into emails and delivers them via the given Sender. The from address
+// becomes the EmailMessage.From for every dispatched message.
+func NewEmailDispatcher(sender Sender, from string) Dispatcher {
+	return &emailDispatcher{sender: sender, from: from}
+}
+
+// emailDestination is the contact's destination JSON shape for email.
+type emailDestination struct {
+	Address string `json:"address"`
+}
+
+// Send renders the Notification into an EmailMessage and hands it to
+// the configured Sender. Returns SMTP-style status codes for symmetry
+// with the HTTP-based transports: 250 on success, 5xx on failure.
+func (d *emailDispatcher) Send(ctx context.Context, destination json.RawMessage, n Notification) (int, string, error) {
+	var dest emailDestination
+	if err := json.Unmarshal(destination, &dest); err != nil {
+		return 550, "invalid destination JSON", fmt.Errorf("parse email destination: %w", err)
+	}
+	if dest.Address == "" {
+		return 550, "destination missing address", errors.New("alerting/email: destination missing address")
+	}
+
+	msg := EmailMessage{
+		From:      d.from,
+		To:        dest.Address,
+		Subject:   renderEmailSubject(n),
+		PlainBody: renderEmailPlain(n),
+		HTMLBody:  renderEmailHTML(n),
+	}
+
+	if err := d.sender.Send(ctx, msg); err != nil {
+		// Cap the error message at last_response's column width.
+		summary := err.Error()
+		if len(summary) > 2048 {
+			summary = summary[:2048]
+		}
+		return 554, summary, err
+	}
+	return 250, "delivered", nil
+}
+
+// renderEmailSubject is short enough to fit in mobile notification
+// previews. Severity name and site URL are the most-relevant info at
+// a glance; recovery and test prefixes are explicit. Strips CRLF
+// from the URL to prevent MIME header injection — the URL is
+// operator-controlled (jetpack_monitor_sites.monitor_url) but the
+// column doesn't enforce CRLF-free, so defense-in-depth lives here.
+func renderEmailSubject(n Notification) string {
+	url := stripCRLF(n.SiteURL)
+	switch {
+	case n.IsTest:
+		return fmt.Sprintf("[Jetmon test] %s", url)
+	case n.Recovery:
+		return fmt.Sprintf("[Recovered] %s", url)
+	default:
+		return fmt.Sprintf("[%s] %s", stripCRLF(n.SeverityName), url)
+	}
+}
+
+// stripCRLF removes carriage return and newline characters. Used on
+// any field that becomes part of a MIME header (Subject, From, To)
+// to prevent header injection via untrusted strings.
+func stripCRLF(s string) string {
+	r := strings.NewReplacer("\r", "", "\n", "")
+	return r.Replace(s)
+}
+
+// renderEmailPlain is the plain-text body. Same fields as the HTML
+// version; consumers receiving multipart see whichever their client
+// prefers. The plain body is also the fallback for email clients
+// that strip HTML.
+func renderEmailPlain(n Notification) string {
+	var b strings.Builder
+	if n.IsTest {
+		b.WriteString("*** Jetmon test notification ***\n\n")
+	}
+	if n.Recovery {
+		b.WriteString("Recovery: site is back to Up.\n\n")
+	}
+	fmt.Fprintf(&b, "Site: %s (id %d)\n", n.SiteURL, n.SiteID)
+	fmt.Fprintf(&b, "Severity: %s\n", n.SeverityName)
+	if n.State != "" {
+		fmt.Fprintf(&b, "State: %s\n", n.State)
+	}
+	fmt.Fprintf(&b, "Event: #%d (%s)\n", n.EventID, n.EventType)
+	if n.Reason != "" {
+		fmt.Fprintf(&b, "Reason: %s\n", n.Reason)
+	}
+	fmt.Fprintf(&b, "Time: %s\n", n.Timestamp.UTC().Format(time.RFC3339))
+	return b.String()
+}
+
+// renderEmailHTML mirrors the plain body in a minimal HTML wrapper.
+// No external CSS or images — keeps the payload small and renders
+// the same in every client.
+func renderEmailHTML(n Notification) string {
+	var b strings.Builder
+	b.WriteString("<html><body style=\"font-family:sans-serif;\">")
+	if n.IsTest {
+		b.WriteString("<p><strong>*** Jetmon test notification ***</strong></p>")
+	}
+	if n.Recovery {
+		b.WriteString("<p><strong>Recovery:</strong> site is back to Up.</p>")
+	}
+	b.WriteString("<table cellpadding=\"4\">")
+	fmt.Fprintf(&b, "<tr><td><strong>Site</strong></td><td>%s (id %d)</td></tr>", htmlEscape(n.SiteURL), n.SiteID)
+	fmt.Fprintf(&b, "<tr><td><strong>Severity</strong></td><td>%s</td></tr>", htmlEscape(n.SeverityName))
+	if n.State != "" {
+		fmt.Fprintf(&b, "<tr><td><strong>State</strong></td><td>%s</td></tr>", htmlEscape(n.State))
+	}
+	fmt.Fprintf(&b, "<tr><td><strong>Event</strong></td><td>#%d (%s)</td></tr>", n.EventID, htmlEscape(n.EventType))
+	if n.Reason != "" {
+		fmt.Fprintf(&b, "<tr><td><strong>Reason</strong></td><td>%s</td></tr>", htmlEscape(n.Reason))
+	}
+	fmt.Fprintf(&b, "<tr><td><strong>Time</strong></td><td>%s</td></tr>", n.Timestamp.UTC().Format(time.RFC3339))
+	b.WriteString("</table></body></html>")
+	return b.String()
+}
+
+func htmlEscape(s string) string {
+	r := strings.NewReplacer(
+		"&", "&amp;",
+		"<", "&lt;",
+		">", "&gt;",
+		"\"", "&quot;",
+		"'", "&#39;",
+	)
+	return r.Replace(s)
+}
+
+// StubSender records every message in memory and (by default) also
+// logs a one-line summary to stdout. Used by unit tests and by
+// EMAIL_TRANSPORT="stub" in environments where a real send is not
+// configured. Never returns an error.
+type StubSender struct {
+	Logger func(EmailMessage) // optional; defaults to log.Printf
+
+	mu   sync.Mutex
+	sent []EmailMessage
+}
+
+// Send records the message and (optionally) logs a summary.
+func (s *StubSender) Send(_ context.Context, m EmailMessage) error {
+	s.mu.Lock()
+	s.sent = append(s.sent, m)
+	s.mu.Unlock()
+	if s.Logger != nil {
+		s.Logger(m)
+	} else {
+		log.Printf("alerting/email: stub send From=%s To=%s Subject=%q", m.From, m.To, m.Subject)
+	}
+	return nil
+}
+
+// Sent returns a snapshot of every message recorded so far. Used by
+// tests to assert against rendered output.
+func (s *StubSender) Sent() []EmailMessage {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	out := make([]EmailMessage, len(s.sent))
+	copy(out, s.sent)
+	return out
+}
+
+// Reset clears the sent buffer. Useful between test cases.
+func (s *StubSender) Reset() {
+	s.mu.Lock()
+	s.sent = nil
+	s.mu.Unlock()
+}
+
+// SMTPSender connects to an SMTP server and sends multipart emails.
+// Uses Go's stdlib net/smtp; doesn't take a per-call context (smtp
+// package predates context). The worker bounds runtime via its own
+// timeouts; an SMTP send that hangs blocks the worker goroutine until
+// the underlying socket times out (typically 5–10 minutes on Linux).
+//
+// For dev/staging only — production uses WPCOMSender. STARTTLS is
+// optional; AUTH PLAIN is used when Username is non-empty.
+type SMTPSender struct {
+	Host     string
+	Port     int
+	Username string // optional; if empty, no AUTH is performed
+	Password string
+	UseTLS   bool // controls whether AUTH PLAIN is sent (auth on plaintext SMTP is rejected by net/smtp without UseTLS)
+}
+
+// Send delivers msg via SMTP. The MIME body is multipart/alternative
+// with both plain and HTML parts.
+func (s *SMTPSender) Send(_ context.Context, m EmailMessage) error {
+	addr := fmt.Sprintf("%s:%d", s.Host, s.Port)
+	body := buildMIMEMessage(m)
+	var auth smtp.Auth
+	if s.Username != "" && s.UseTLS {
+		auth = smtp.PlainAuth("", s.Username, s.Password, s.Host)
+	}
+	if err := smtp.SendMail(addr, auth, m.From, []string{m.To}, []byte(body)); err != nil {
+		return fmt.Errorf("alerting/email/smtp: send to %s: %w", addr, err)
+	}
+	return nil
+}
+
+// buildMIMEMessage produces a multipart/alternative MIME body with
+// both plain-text and HTML parts. Boundary is fixed; the message is
+// short and self-contained, so collisions are not a concern.
+//
+// CRLF is stripped from From/To/Subject to prevent header injection.
+// The body parts are content, not headers — CRLF inside them is
+// expected and handled by the MIME boundary structure.
+func buildMIMEMessage(m EmailMessage) string {
+	const boundary = "JetmonAlertBoundary_4d8f31a2"
+	var b strings.Builder
+	fmt.Fprintf(&b, "From: %s\r\n", stripCRLF(m.From))
+	fmt.Fprintf(&b, "To: %s\r\n", stripCRLF(m.To))
+	fmt.Fprintf(&b, "Subject: %s\r\n", stripCRLF(m.Subject))
+	b.WriteString("MIME-Version: 1.0\r\n")
+	fmt.Fprintf(&b, "Content-Type: multipart/alternative; boundary=%q\r\n\r\n", boundary)
+
+	fmt.Fprintf(&b, "--%s\r\n", boundary)
+	b.WriteString("Content-Type: text/plain; charset=\"UTF-8\"\r\n\r\n")
+	b.WriteString(m.PlainBody)
+	b.WriteString("\r\n")
+
+	fmt.Fprintf(&b, "--%s\r\n", boundary)
+	b.WriteString("Content-Type: text/html; charset=\"UTF-8\"\r\n\r\n")
+	b.WriteString(m.HTMLBody)
+	b.WriteString("\r\n")
+
+	fmt.Fprintf(&b, "--%s--\r\n", boundary)
+	return b.String()
+}
+
+// WPCOMSender posts to a WPCOM-owned email API endpoint with a Bearer
+// token. Same shape as the existing internal/wpcom client — Bearer
+// auth, JSON body, 4xx/5xx → error. Body shape is intentionally
+// generic; the production endpoint can adapt or we wrap the body in
+// whatever shape they require.
+type WPCOMSender struct {
+	Endpoint   string
+	AuthToken  string
+	HTTPClient *http.Client // if nil, a default with a 10s timeout is used
+}
+
+// wpcomEmailRequest is the JSON body posted to the WPCOM email API.
+type wpcomEmailRequest struct {
+	From      string `json:"from"`
+	To        string `json:"to"`
+	Subject   string `json:"subject"`
+	PlainBody string `json:"plain"`
+	HTMLBody  string `json:"html"`
+}
+
+// Send POSTs the message to the configured endpoint.
+func (s *WPCOMSender) Send(ctx context.Context, m EmailMessage) error {
+	if s.Endpoint == "" {
+		return errors.New("alerting/email/wpcom: endpoint not configured")
+	}
+	body, err := json.Marshal(wpcomEmailRequest{
+		From: m.From, To: m.To, Subject: m.Subject,
+		PlainBody: m.PlainBody, HTMLBody: m.HTMLBody,
+	})
+	if err != nil {
+		return fmt.Errorf("alerting/email/wpcom: marshal: %w", err)
+	}
+	req, err := http.NewRequestWithContext(ctx, http.MethodPost, s.Endpoint, bytes.NewReader(body))
+	if err != nil {
+		return fmt.Errorf("alerting/email/wpcom: build request: %w", err)
+	}
+	req.Header.Set("Content-Type", "application/json")
+	if s.AuthToken != "" {
+		req.Header.Set("Authorization", "Bearer "+s.AuthToken)
+	}
+
+	client := s.HTTPClient
+	if client == nil {
+		client = &http.Client{Timeout: 10 * time.Second}
+	}
+	resp, err := client.Do(req)
+	if err != nil {
+		return fmt.Errorf("alerting/email/wpcom: post: %w", err)
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode >= 400 {
+		respBody, _ := io.ReadAll(io.LimitReader(resp.Body, 1024))
+		return fmt.Errorf("alerting/email/wpcom: status %d: %s", resp.StatusCode, respBody)
+	}
+	return nil
+}
diff --git a/internal/alerting/email_test.go b/internal/alerting/email_test.go
new file mode 100644
index 00000000..2a452a43
--- /dev/null
+++ b/internal/alerting/email_test.go
@@ -0,0 +1,324 @@
+package alerting
+
+import (
+	"context"
+	"encoding/json"
+	"io"
+	"net/http"
+	"net/http/httptest"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/eventstore"
+)
+
+func makeTestNotification() Notification {
+	return Notification{
+		SiteID:       42,
+		SiteURL:      "https://example.com",
+		EventID:      777,
+		EventType:    "event.opened",
+		Severity:     eventstore.SeverityDown,
+		SeverityName: "Down",
+		State:        "Down",
+		Reason:       "verifier_confirmed",
+		Timestamp:    time.Date(2026, 4, 25, 12, 0, 0, 0, time.UTC),
+	}
+}
+
+func TestRenderEmailSubjectVariants(t *testing.T) {
+	cases := []struct {
+		mutate func(*Notification)
+		want   string
+	}{
+		{func(n *Notification) {}, "[Down] https://example.com"},
+		{func(n *Notification) { n.Recovery = true }, "[Recovered] https://example.com"},
+		{func(n *Notification) { n.IsTest = true }, "[Jetmon test] https://example.com"},
+	}
+	for i, tc := range cases {
+		n := makeTestNotification()
+		tc.mutate(&n)
+		got := renderEmailSubject(n)
+		if got != tc.want {
+			t.Errorf("case %d: got %q, want %q", i, got, tc.want)
+		}
+	}
+}
+
+func TestRenderEmailPlainContainsKeyFields(t *testing.T) {
+	n := makeTestNotification()
+	body := renderEmailPlain(n)
+	for _, want := range []string{
+		"https://example.com",
+		"id 42",
+		"Down",
+		"#777",
+		"event.opened",
+		"verifier_confirmed",
+		"2026-04-25T12:00:00Z",
+	} {
+		if !strings.Contains(body, want) {
+			t.Errorf("plain body missing %q\nbody:\n%s", want, body)
+		}
+	}
+}
+
+func TestRenderEmailHTMLEscapesUntrustedFields(t *testing.T) {
+	n := makeTestNotification()
+	n.SiteURL = `<script>alert("x")</script>`
+	n.Reason = `a & b`
+	body := renderEmailHTML(n)
+	// The raw script tag must not appear.
+	if strings.Contains(body, "<script>") {
+		t.Errorf("HTML body contains unescaped <script> tag:\n%s", body)
+	}
+	// The escaped form must appear.
+	if !strings.Contains(body, "&lt;script&gt;") {
+		t.Errorf("HTML body missing escaped <script>:\n%s", body)
+	}
+	if !strings.Contains(body, "a &amp; b") {
+		t.Errorf("HTML body did not escape ampersand:\n%s", body)
+	}
+}
+
+func TestRenderEmailRecoveryBannerPresent(t *testing.T) {
+	n := makeTestNotification()
+	n.Recovery = true
+	if !strings.Contains(renderEmailPlain(n), "Recovery") {
+		t.Error("plain body missing recovery banner")
+	}
+	if !strings.Contains(renderEmailHTML(n), "Recovery") {
+		t.Error("HTML body missing recovery banner")
+	}
+}
+
+func TestRenderEmailTestBannerPresent(t *testing.T) {
+	n := makeTestNotification()
+	n.IsTest = true
+	if !strings.Contains(renderEmailPlain(n), "test notification") {
+		t.Error("plain body missing test banner")
+	}
+	if !strings.Contains(renderEmailHTML(n), "test notification") {
+		t.Error("HTML body missing test banner")
+	}
+}
+
+// TestEmailDispatcherDelegatesToSender verifies the destination is parsed
+// correctly and the rendered fields land in the EmailMessage.
+func TestEmailDispatcherDelegatesToSender(t *testing.T) {
+	stub := &StubSender{Logger: func(EmailMessage) {}} // silence test output
+	d := NewEmailDispatcher(stub, "from@example.com")
+
+	dest := json.RawMessage(`{"address":"ops@example.com"}`)
+	n := makeTestNotification()
+
+	status, _, err := d.Send(context.Background(), dest, n)
+	if err != nil {
+		t.Fatalf("Send returned error: %v", err)
+	}
+	if status != 250 {
+		t.Errorf("status = %d, want 250", status)
+	}
+
+	sent := stub.Sent()
+	if len(sent) != 1 {
+		t.Fatalf("got %d messages, want 1", len(sent))
+	}
+	m := sent[0]
+	if m.From != "from@example.com" {
+		t.Errorf("From = %q", m.From)
+	}
+	if m.To != "ops@example.com" {
+		t.Errorf("To = %q", m.To)
+	}
+	if !strings.Contains(m.Subject, "Down") {
+		t.Errorf("Subject missing severity: %q", m.Subject)
+	}
+	if !strings.Contains(m.PlainBody, "https://example.com") {
+		t.Errorf("PlainBody missing site URL")
+	}
+}
+
+func TestEmailDispatcherRejectsBadDestination(t *testing.T) {
+	stub := &StubSender{Logger: func(EmailMessage) {}}
+	d := NewEmailDispatcher(stub, "from@example.com")
+
+	cases := []json.RawMessage{
+		json.RawMessage(`{}`),
+		json.RawMessage(`{"address":""}`),
+		json.RawMessage(`not json`),
+	}
+	for i, dest := range cases {
+		status, _, err := d.Send(context.Background(), dest, makeTestNotification())
+		if err == nil {
+			t.Errorf("case %d: expected error for destination %s", i, dest)
+		}
+		if status < 500 {
+			t.Errorf("case %d: status = %d, want >=500", i, status)
+		}
+	}
+	if len(stub.Sent()) != 0 {
+		t.Error("StubSender should not have been invoked on bad destination")
+	}
+}
+
+// TestRenderEmailSubjectStripsCRLF verifies that CRLF in untrusted
+// fields (site URL is operator-controlled but the DB column doesn't
+// enforce CRLF-free) doesn't leak into the Subject header. Defense-
+// in-depth against MIME header injection.
+func TestRenderEmailSubjectStripsCRLF(t *testing.T) {
+	n := makeTestNotification()
+	n.SiteURL = "https://example.com\r\nBcc: attacker@evil.com"
+	got := renderEmailSubject(n)
+	if strings.ContainsAny(got, "\r\n") {
+		t.Errorf("subject contains CRLF: %q", got)
+	}
+	if !strings.Contains(got, "https://example.com") {
+		t.Errorf("subject lost the legitimate URL portion: %q", got)
+	}
+}
+
+func TestBuildMIMEMessageStripsHeaderCRLF(t *testing.T) {
+	mime := buildMIMEMessage(EmailMessage{
+		From:      "from@example.com\r\nX-Injected: yes",
+		To:        "to@example.com\r\nBcc: attacker@evil.com",
+		Subject:   "test\r\nX-Header: malicious",
+		PlainBody: "plain\r\nwith\r\nnewlines\r\nis fine in body",
+		HTMLBody:  "<b>html</b>",
+	})
+	// Split headers from body and assert no injected header lines.
+	parts := strings.SplitN(mime, "\r\n\r\n", 2)
+	if len(parts) != 2 {
+		t.Fatalf("MIME missing header/body separator:\n%s", mime)
+	}
+	headers := parts[0]
+	// A successful injection would put the bad token at the start of a
+	// header line (preceded by \r\n). The strip merges the malicious
+	// content into the legitimate header value, but no new header line
+	// should be created.
+	for _, bad := range []string{"\r\nX-Injected:", "\r\nBcc:", "\r\nX-Header:"} {
+		if strings.Contains(headers, bad) {
+			t.Errorf("header injection succeeded with token %q:\n%s", bad, headers)
+		}
+	}
+	// The legitimate body CRLFs should pass through unchanged.
+	if !strings.Contains(parts[1], "plain\r\nwith\r\nnewlines") {
+		t.Errorf("body CRLF was incorrectly stripped:\n%s", parts[1])
+	}
+}
+
+func TestBuildMIMEMessageHasBothParts(t *testing.T) {
+	mime := buildMIMEMessage(EmailMessage{
+		From:      "from@example.com",
+		To:        "to@example.com",
+		Subject:   "test",
+		PlainBody: "plain content",
+		HTMLBody:  "<b>html content</b>",
+	})
+	for _, want := range []string{
+		"From: from@example.com",
+		"To: to@example.com",
+		"Subject: test",
+		"multipart/alternative",
+		"text/plain",
+		"plain content",
+		"text/html",
+		"<b>html content</b>",
+	} {
+		if !strings.Contains(mime, want) {
+			t.Errorf("MIME missing %q", want)
+		}
+	}
+}
+
+func TestWPCOMSenderPostsCorrectly(t *testing.T) {
+	var (
+		gotAuth   string
+		gotCT     string
+		gotBody   wpcomEmailRequest
+		decodeErr error
+		hits      int
+	)
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		hits++
+		gotAuth = r.Header.Get("Authorization")
+		gotCT = r.Header.Get("Content-Type")
+		body, _ := io.ReadAll(r.Body)
+		decodeErr = json.Unmarshal(body, &gotBody)
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte(`{"ok":true}`))
+	}))
+	defer srv.Close()
+
+	s := &WPCOMSender{
+		Endpoint:  srv.URL,
+		AuthToken: "TEST_TOKEN",
+	}
+	err := s.Send(context.Background(), EmailMessage{
+		From:      "from@example.com",
+		To:        "ops@example.com",
+		Subject:   "test subject",
+		PlainBody: "plain",
+		HTMLBody:  "<b>html</b>",
+	})
+	if err != nil {
+		t.Fatalf("Send error: %v", err)
+	}
+	if hits != 1 {
+		t.Errorf("hits = %d, want 1", hits)
+	}
+	if gotAuth != "Bearer TEST_TOKEN" {
+		t.Errorf("Authorization = %q", gotAuth)
+	}
+	if gotCT != "application/json" {
+		t.Errorf("Content-Type = %q", gotCT)
+	}
+	if decodeErr != nil {
+		t.Errorf("body decode: %v", decodeErr)
+	}
+	if gotBody.Subject != "test subject" || gotBody.To != "ops@example.com" {
+		t.Errorf("body fields wrong: %+v", gotBody)
+	}
+}
+
+func TestWPCOMSenderSurfacesErrors(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusInternalServerError)
+		_, _ = w.Write([]byte(`{"error":"boom"}`))
+	}))
+	defer srv.Close()
+
+	s := &WPCOMSender{Endpoint: srv.URL, AuthToken: "x"}
+	err := s.Send(context.Background(), EmailMessage{
+		From: "f@x", To: "t@x", Subject: "s", PlainBody: "p", HTMLBody: "h",
+	})
+	if err == nil {
+		t.Fatal("expected error on 500")
+	}
+	if !strings.Contains(err.Error(), "500") {
+		t.Errorf("error should mention status 500: %v", err)
+	}
+}
+
+func TestWPCOMSenderRequiresEndpoint(t *testing.T) {
+	s := &WPCOMSender{}
+	err := s.Send(context.Background(), EmailMessage{})
+	if err == nil {
+		t.Fatal("expected error when endpoint missing")
+	}
+}
+
+func TestStubSenderRecordsAndReset(t *testing.T) {
+	s := &StubSender{Logger: func(EmailMessage) {}}
+	for i := 0; i < 3; i++ {
+		_ = s.Send(context.Background(), EmailMessage{Subject: "n"})
+	}
+	if got := len(s.Sent()); got != 3 {
+		t.Errorf("Sent count = %d, want 3", got)
+	}
+	s.Reset()
+	if got := len(s.Sent()); got != 0 {
+		t.Errorf("after Reset, Sent count = %d, want 0", got)
+	}
+}
diff --git a/internal/alerting/repository_coverage_test.go b/internal/alerting/repository_coverage_test.go
new file mode 100644
index 00000000..3512f512
--- /dev/null
+++ b/internal/alerting/repository_coverage_test.go
@@ -0,0 +1,469 @@
+package alerting
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/eventstore"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+var contactColumns = []string{
+	"id", "label", "active", "owner_tenant_id", "transport", "destination_preview",
+	"site_filter", "min_severity", "max_per_hour",
+	"created_by", "created_at", "updated_at",
+}
+
+func contactRow(id int64, label string, active uint8, transport Transport, now time.Time) *sqlmock.Rows {
+	return sqlmock.NewRows(contactColumns).AddRow(
+		id, label, active, "tenant-a", string(transport), "mple",
+		`{"site_ids":[42]}`, uint8(eventstore.SeverityDown), 60,
+		"ops", now, now,
+	)
+}
+
+func TestCreateContactPersistsDefaultsAndFetchesRecord(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	destination := json.RawMessage(`{"address":"ops@example.com"}`)
+	mock.ExpectExec("INSERT INTO jetmon_alert_contacts").
+		WithArgs(
+			"Ops email", 1, nil, string(TransportEmail), []byte(destination), ".com",
+			sqlmock.AnyArg(), uint8(eventstore.SeverityDown), 60, "ops",
+		).
+		WillReturnResult(sqlmock.NewResult(11, 1))
+	mock.ExpectQuery("SELECT id, label, active, owner_tenant_id, transport").
+		WithArgs(int64(11)).
+		WillReturnRows(contactRow(11, "Ops email", 1, TransportEmail, now))
+
+	contact, err := Create(context.Background(), db, CreateInput{
+		Label:       "Ops email",
+		Transport:   TransportEmail,
+		Destination: destination,
+		SiteFilter:  SiteFilter{SiteIDs: []int64{42}},
+		CreatedBy:   "ops",
+	})
+	if err != nil {
+		t.Fatalf("Create: %v", err)
+	}
+	if contact.ID != 11 || !contact.Active || contact.SiteFilter.SiteIDs[0] != 42 {
+		t.Fatalf("contact = %+v", contact)
+	}
+	if contact.OwnerTenantID == nil || *contact.OwnerTenantID != "tenant-a" {
+		t.Fatalf("contact.OwnerTenantID = %v, want tenant-a", contact.OwnerTenantID)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestCreateContactRejectsInvalidInputBeforeDB(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	if _, err := Create(context.Background(), db, CreateInput{}); err == nil {
+		t.Fatal("Create accepted empty label and destination")
+	}
+	if _, err := Create(context.Background(), db, CreateInput{
+		Label:       "Ops",
+		Transport:   Transport("sms"),
+		Destination: json.RawMessage(`{"address":"ops@example.com"}`),
+	}); !errors.Is(err, ErrInvalidTransport) {
+		t.Fatalf("Create invalid transport error = %v, want ErrInvalidTransport", err)
+	}
+	sev := uint8(99)
+	if _, err := Create(context.Background(), db, CreateInput{
+		Label:       "Ops",
+		Transport:   TransportEmail,
+		Destination: json.RawMessage(`{"address":"ops@example.com"}`),
+		MinSeverity: &sev,
+	}); !errors.Is(err, ErrInvalidSeverity) {
+		t.Fatalf("Create invalid severity error = %v, want ErrInvalidSeverity", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unexpected sql calls: %v", err)
+	}
+}
+
+func TestDestinationPreviewByTransport(t *testing.T) {
+	cases := []struct {
+		transport Transport
+		dest      json.RawMessage
+		want      string
+	}{
+		{TransportEmail, json.RawMessage(`{"address":"ops@example.com"}`), ".com"},
+		{TransportPagerDuty, json.RawMessage(`{"integration_key":"abcd1234"}`), "1234"},
+		{TransportSlack, json.RawMessage(`{"webhook_url":"https://hooks.slack.test/XYZ"}`), "/XYZ"},
+		{TransportTeams, json.RawMessage(`{"webhook_url":"https://teams.test/ABCD"}`), "ABCD"},
+	}
+	for _, tc := range cases {
+		if err := validateDestination(tc.transport, tc.dest); err != nil {
+			t.Fatalf("validateDestination(%s): %v", tc.transport, err)
+		}
+		if got := destinationPreview(tc.transport, tc.dest); got != tc.want {
+			t.Fatalf("destinationPreview(%s) = %q, want %q", tc.transport, got, tc.want)
+		}
+	}
+}
+
+func TestGetContactNotFound(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectQuery("SELECT id, label, active, owner_tenant_id, transport").
+		WithArgs(int64(404)).
+		WillReturnError(sql.ErrNoRows)
+
+	_, err = Get(context.Background(), db, 404)
+	if !errors.Is(err, ErrContactNotFound) {
+		t.Fatalf("Get error = %v, want ErrContactNotFound", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestListContactsScansRows(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	rows := sqlmock.NewRows(contactColumns).
+		AddRow(int64(1), "Email", uint8(1), nil, string(TransportEmail), "mple", `{}`, uint8(4), 60, "ops", now, now).
+		AddRow(int64(2), "Slack", uint8(0), "tenant-b", string(TransportSlack), "hook", nil, uint8(3), 0, "ops", now, now)
+	mock.ExpectQuery("SELECT id, label, active, owner_tenant_id, transport").
+		WillReturnRows(rows)
+
+	contacts, err := List(context.Background(), db)
+	if err != nil {
+		t.Fatalf("List: %v", err)
+	}
+	if len(contacts) != 2 || !contacts[0].Active || contacts[1].Active {
+		t.Fatalf("contacts = %+v", contacts)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestListActiveContactsScansRows(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	mock.ExpectQuery("SELECT id, label, active, owner_tenant_id, transport").
+		WillReturnRows(contactRow(3, "PagerDuty", 1, TransportPagerDuty, now))
+
+	contacts, err := ListActive(context.Background(), db)
+	if err != nil {
+		t.Fatalf("ListActive: %v", err)
+	}
+	if len(contacts) != 1 || contacts[0].Transport != TransportPagerDuty {
+		t.Fatalf("contacts = %+v", contacts)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestTenantScopedContactQueriesFilterByOwner(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	active := false
+	mock.ExpectQuery("WHERE id = \\? AND owner_tenant_id = \\?").
+		WithArgs(int64(12), "tenant-a").
+		WillReturnRows(contactRow(12, "Tenant email", 1, TransportEmail, now))
+	mock.ExpectQuery("WHERE owner_tenant_id = \\? ORDER BY id ASC").
+		WithArgs("tenant-a").
+		WillReturnRows(contactRow(13, "Tenant Slack", 1, TransportSlack, now))
+	mock.ExpectQuery("WHERE id = \\? AND owner_tenant_id = \\?").
+		WithArgs(int64(12), "tenant-a").
+		WillReturnRows(contactRow(12, "Tenant email", 1, TransportEmail, now))
+	mock.ExpectExec("UPDATE jetmon_alert_contacts SET active = \\? WHERE id = \\? AND owner_tenant_id = \\?").
+		WithArgs(0, int64(12), "tenant-a").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery("WHERE id = \\? AND owner_tenant_id = \\?").
+		WithArgs(int64(12), "tenant-a").
+		WillReturnRows(sqlmock.NewRows(contactColumns).AddRow(
+			int64(12), "Tenant email", uint8(0), "tenant-a", string(TransportEmail), "mple",
+			`{"site_ids":[42]}`, uint8(eventstore.SeverityDown), 60, "ops", now, now,
+		))
+	mock.ExpectQuery("SELECT destination FROM jetmon_alert_contacts WHERE id = \\? AND owner_tenant_id = \\?").
+		WithArgs(int64(12), "tenant-a").
+		WillReturnRows(sqlmock.NewRows([]string{"destination"}).AddRow([]byte(`{"address":"ops@example.com"}`)))
+	mock.ExpectExec("DELETE FROM jetmon_alert_contacts WHERE id = \\? AND owner_tenant_id = \\?").
+		WithArgs(int64(12), "tenant-a").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	contact, err := GetForTenant(context.Background(), db, 12, "tenant-a")
+	if err != nil {
+		t.Fatalf("GetForTenant: %v", err)
+	}
+	if contact.OwnerTenantID == nil || *contact.OwnerTenantID != "tenant-a" {
+		t.Fatalf("contact.OwnerTenantID = %v, want tenant-a", contact.OwnerTenantID)
+	}
+	contacts, err := ListForTenant(context.Background(), db, "tenant-a")
+	if err != nil {
+		t.Fatalf("ListForTenant: %v", err)
+	}
+	if len(contacts) != 1 || contacts[0].ID != 13 {
+		t.Fatalf("contacts = %+v", contacts)
+	}
+	contact, err = UpdateForTenant(context.Background(), db, 12, "tenant-a", UpdateInput{Active: &active})
+	if err != nil {
+		t.Fatalf("UpdateForTenant: %v", err)
+	}
+	if contact.Active {
+		t.Fatalf("contact.Active = true, want false")
+	}
+	if _, err := LoadDestinationForTenant(context.Background(), db, 12, "tenant-a"); err != nil {
+		t.Fatalf("LoadDestinationForTenant: %v", err)
+	}
+	if err := DeleteForTenant(context.Background(), db, 12, "tenant-a"); err != nil {
+		t.Fatalf("DeleteForTenant: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestUpdateContactAppliesPatchAndFetchesRecord(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	label := "Escalation"
+	active := false
+	destination := json.RawMessage(`{"address":"new@example.com"}`)
+	siteFilter := SiteFilter{SiteIDs: []int64{7}}
+	minSeverity := uint8(eventstore.SeverityWarning)
+	maxPerHour := 5
+
+	mock.ExpectQuery("SELECT id, label, active, owner_tenant_id, transport").
+		WithArgs(int64(5)).
+		WillReturnRows(contactRow(5, "Ops email", 1, TransportEmail, now))
+	mock.ExpectExec("UPDATE jetmon_alert_contacts SET").
+		WithArgs(label, 0, []byte(destination), ".com", sqlmock.AnyArg(), minSeverity, maxPerHour, int64(5)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery("SELECT id, label, active, owner_tenant_id, transport").
+		WithArgs(int64(5)).
+		WillReturnRows(sqlmock.NewRows(contactColumns).AddRow(
+			int64(5), label, uint8(0), nil, string(TransportEmail), ".com",
+			`{"site_ids":[7]}`, minSeverity, maxPerHour, "ops", now, now,
+		))
+
+	contact, err := Update(context.Background(), db, 5, UpdateInput{
+		Label:       &label,
+		Active:      &active,
+		Destination: destination,
+		SiteFilter:  &siteFilter,
+		MinSeverity: &minSeverity,
+		MaxPerHour:  &maxPerHour,
+	})
+	if err != nil {
+		t.Fatalf("Update: %v", err)
+	}
+	if contact.Active || contact.Label != label || contact.SiteFilter.SiteIDs[0] != 7 {
+		t.Fatalf("contact = %+v", contact)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestDeleteContactReportsMissingRows(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectExec("DELETE FROM jetmon_alert_contacts").
+		WithArgs(int64(10)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+
+	if err := Delete(context.Background(), db, 10); !errors.Is(err, ErrContactNotFound) {
+		t.Fatalf("Delete error = %v, want ErrContactNotFound", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestLoadDestination(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectQuery("SELECT destination FROM jetmon_alert_contacts").
+		WithArgs(int64(4)).
+		WillReturnRows(sqlmock.NewRows([]string{"destination"}).AddRow([]byte(`{"address":"ops@example.com"}`)))
+
+	dest, err := LoadDestination(context.Background(), db, 4)
+	if err != nil {
+		t.Fatalf("LoadDestination: %v", err)
+	}
+	if !strings.Contains(string(dest), "ops@example.com") {
+		t.Fatalf("destination = %s", dest)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+var alertDeliveryColumns = []string{
+	"id", "alert_contact_id", "transition_id", "event_id", "event_type", "severity",
+	"payload", "status", "attempt", "next_attempt_at", "last_status_code", "last_response",
+	"last_attempt_at", "delivered_at", "created_at",
+}
+
+func alertDeliveryRow(id int64, status Status, now time.Time) *sqlmock.Rows {
+	return sqlmock.NewRows(alertDeliveryColumns).AddRow(
+		id, int64(20), int64(30), int64(40), "alert.opened", uint8(4),
+		[]byte(`{"ok":true}`), string(status), 2, now, 503, "down", now, nil, now,
+	)
+}
+
+func TestEnqueueAlertDeliveryReturnsInsertedIDAndDuplicateZero(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	payload := json.RawMessage(`{"type":"alert.opened"}`)
+	mock.ExpectExec("INSERT IGNORE INTO jetmon_alert_deliveries").
+		WithArgs(int64(1), int64(2), int64(3), "alert.opened", uint8(4), []byte(payload)).
+		WillReturnResult(sqlmock.NewResult(9, 1))
+	mock.ExpectExec("INSERT IGNORE INTO jetmon_alert_deliveries").
+		WithArgs(int64(1), int64(2), int64(3), "alert.opened", uint8(4), []byte(payload)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+
+	id, err := Enqueue(context.Background(), db, EnqueueInput{
+		AlertContactID: 1, TransitionID: 2, EventID: 3, EventType: "alert.opened",
+		Severity: 4, Payload: payload,
+	})
+	if err != nil || id != 9 {
+		t.Fatalf("Enqueue inserted = (%d, %v), want (9, nil)", id, err)
+	}
+	id, err = Enqueue(context.Background(), db, EnqueueInput{
+		AlertContactID: 1, TransitionID: 2, EventID: 3, EventType: "alert.opened",
+		Severity: 4, Payload: payload,
+	})
+	if err != nil || id != 0 {
+		t.Fatalf("Enqueue duplicate = (%d, %v), want (0, nil)", id, err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestAlertDeliveryStateUpdates(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	next := time.Now().UTC().Add(time.Minute)
+	mock.ExpectExec("UPDATE jetmon_alert_deliveries").
+		WithArgs(204, "ok", int64(1)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("UPDATE jetmon_alert_deliveries").
+		WithArgs("quiet", int64(2)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("UPDATE jetmon_alert_deliveries").
+		WithArgs(503, "retry", next, int64(3)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("UPDATE jetmon_alert_deliveries").
+		WithArgs(410, "gone", int64(4)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	if err := MarkDelivered(context.Background(), db, 1, 204, "ok"); err != nil {
+		t.Fatalf("MarkDelivered: %v", err)
+	}
+	if err := MarkSuppressed(context.Background(), db, 2, "quiet"); err != nil {
+		t.Fatalf("MarkSuppressed: %v", err)
+	}
+	if err := ScheduleRetry(context.Background(), db, 3, 503, "retry", next, false); err != nil {
+		t.Fatalf("ScheduleRetry retry: %v", err)
+	}
+	if err := ScheduleRetry(context.Background(), db, 4, 410, "gone", next, true); err != nil {
+		t.Fatalf("ScheduleRetry abandon: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestGetListAndRetryAlertDeliveries(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	mock.ExpectQuery("SELECT id, alert_contact_id, transition_id").
+		WithArgs(int64(1)).
+		WillReturnRows(alertDeliveryRow(1, StatusAbandoned, now))
+	mock.ExpectQuery("SELECT id, alert_contact_id, transition_id").
+		WithArgs(int64(20), string(StatusAbandoned), int64(50), 10).
+		WillReturnRows(alertDeliveryRow(2, StatusAbandoned, now))
+	mock.ExpectExec("UPDATE jetmon_alert_deliveries").
+		WithArgs(int64(2)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	d, err := GetDelivery(context.Background(), db, 1)
+	if err != nil {
+		t.Fatalf("GetDelivery: %v", err)
+	}
+	if d.LastStatusCode == nil || *d.LastStatusCode != 503 || d.LastResponse == nil || *d.LastResponse != "down" {
+		t.Fatalf("delivery did not scan nullable fields: %+v", d)
+	}
+	list, err := ListDeliveries(context.Background(), db, 20, StatusAbandoned, 50, 10)
+	if err != nil {
+		t.Fatalf("ListDeliveries: %v", err)
+	}
+	if len(list) != 1 || list[0].ID != 2 {
+		t.Fatalf("deliveries = %+v", list)
+	}
+	if err := RetryDelivery(context.Background(), db, 2); err != nil {
+		t.Fatalf("RetryDelivery: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
diff --git a/internal/alerting/transports.go b/internal/alerting/transports.go
new file mode 100644
index 00000000..6591b389
--- /dev/null
+++ b/internal/alerting/transports.go
@@ -0,0 +1,384 @@
+package alerting
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io"
+	"net/http"
+	"strings"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/eventstore"
+)
+
+// defaultTransportTimeout bounds every outbound HTTP transport call.
+// Short enough that a hung receiver doesn't wedge the worker for long;
+// long enough to absorb normal third-party API latency.
+const defaultTransportTimeout = 10 * time.Second
+
+// httpClientOrDefault returns c if non-nil, otherwise a fresh client
+// with defaultTransportTimeout. Tests inject their own client to
+// point at httptest servers.
+func httpClientOrDefault(c *http.Client) *http.Client {
+	if c != nil {
+		return c
+	}
+	return &http.Client{Timeout: defaultTransportTimeout}
+}
+
+// truncateResponseBody caps a transport response at the
+// jetmon_alert_deliveries.last_response column width. Keeps the
+// most recent bytes since failure messages tend to be at the start
+// but trailing context (e.g. "rate-limit reset at ...") is also
+// useful.
+func truncateResponseBody(s string) string {
+	const cap = 2048
+	if len(s) <= cap {
+		return s
+	}
+	return s[:cap]
+}
+
+// readResponseBody reads up to 4 KB so a misbehaving server can't
+// fill memory on a 200 OK with a giant body.
+func readResponseBody(r io.Reader) string {
+	b, _ := io.ReadAll(io.LimitReader(r, 4096))
+	return string(b)
+}
+
+// ─── PagerDuty ────────────────────────────────────────────────────────
+
+// PagerDutyDispatcher implements Dispatcher for the PagerDuty Events
+// API v2. Each notification becomes an event of action "trigger"
+// (or "resolve" for recoveries) with a stable dedup_key derived from
+// the Jetmon event id, so PagerDuty groups all transitions of the
+// same incident under one alert.
+type PagerDutyDispatcher struct {
+	Endpoint   string       // override for tests; defaults to events.pagerduty.com/v2/enqueue
+	HTTPClient *http.Client // override for tests
+}
+
+// pagerDutyDestination is the contact's destination JSON shape for
+// the pagerduty transport.
+type pagerDutyDestination struct {
+	IntegrationKey string `json:"integration_key"`
+}
+
+// pagerDutyEvent is the Events API v2 request body. See
+// https://developer.pagerduty.com/docs/events-api-v2-overview.
+type pagerDutyEvent struct {
+	RoutingKey  string             `json:"routing_key"`
+	EventAction string             `json:"event_action"` // trigger | resolve
+	DedupKey    string             `json:"dedup_key,omitempty"`
+	Payload     pagerDutyEventBody `json:"payload"`
+}
+
+type pagerDutyEventBody struct {
+	Summary       string                 `json:"summary"`
+	Source        string                 `json:"source"`
+	Severity      string                 `json:"severity"` // critical | error | warning | info
+	CustomDetails map[string]interface{} `json:"custom_details,omitempty"`
+}
+
+// Send delivers n to the PagerDuty Events API v2.
+func (d *PagerDutyDispatcher) Send(ctx context.Context, destination json.RawMessage, n Notification) (int, string, error) {
+	var dest pagerDutyDestination
+	if err := json.Unmarshal(destination, &dest); err != nil {
+		return 0, "invalid destination JSON", fmt.Errorf("alerting/pagerduty: parse destination: %w", err)
+	}
+	if dest.IntegrationKey == "" {
+		return 0, "destination missing integration_key", errors.New("alerting/pagerduty: destination missing integration_key")
+	}
+
+	endpoint := d.Endpoint
+	if endpoint == "" {
+		endpoint = "https://events.pagerduty.com/v2/enqueue"
+	}
+
+	action := "trigger"
+	if n.Recovery {
+		action = "resolve"
+	}
+
+	dedup := fmt.Sprintf("jetmon-event-%d", n.EventID)
+	if n.IsTest {
+		// Test sends use a dedicated dedup key so they don't accidentally
+		// resolve a real alert when a test follows a real trigger.
+		dedup = fmt.Sprintf("jetmon-test-%d-%d", n.SiteID, n.Timestamp.Unix())
+	}
+
+	body := pagerDutyEvent{
+		RoutingKey:  dest.IntegrationKey,
+		EventAction: action,
+		DedupKey:    dedup,
+		Payload: pagerDutyEventBody{
+			Summary:  pagerDutySummary(n),
+			Source:   n.SiteURL,
+			Severity: pagerDutySeverity(n.Severity),
+			CustomDetails: map[string]interface{}{
+				"site_id":    n.SiteID,
+				"event_id":   n.EventID,
+				"event_type": n.EventType,
+				"state":      n.State,
+				"reason":     n.Reason,
+				"is_test":    n.IsTest,
+			},
+		},
+	}
+
+	return postJSON(ctx, httpClientOrDefault(d.HTTPClient), endpoint, body, nil)
+}
+
+// pagerDutySummary is the short string PagerDuty shows in its UI and
+// pager notifications. Subject-line equivalent.
+func pagerDutySummary(n Notification) string {
+	switch {
+	case n.IsTest:
+		return fmt.Sprintf("[Jetmon test] %s", n.SiteURL)
+	case n.Recovery:
+		return fmt.Sprintf("Recovered: %s", n.SiteURL)
+	default:
+		return fmt.Sprintf("%s: %s", n.SeverityName, n.SiteURL)
+	}
+}
+
+// pagerDutySeverity maps Jetmon's severity uint8 to PagerDuty's
+// severity string. Up never fires here (it routes through resolve).
+func pagerDutySeverity(s uint8) string {
+	switch s {
+	case eventstore.SeverityDown, eventstore.SeveritySeemsDown:
+		return "critical"
+	case eventstore.SeverityDegraded:
+		return "warning"
+	case eventstore.SeverityWarning:
+		return "info"
+	default:
+		// Up still gets a value because the events-v2 schema requires it
+		// even on resolve actions; PagerDuty ignores it on resolve.
+		return "info"
+	}
+}
+
+// ─── Slack ────────────────────────────────────────────────────────────
+
+// SlackDispatcher implements Dispatcher for Slack incoming-webhook URLs.
+// Each notification becomes a Block Kit message with site, severity,
+// state, time, and (for recoveries) a green-highlighted recovery banner.
+type SlackDispatcher struct {
+	HTTPClient *http.Client
+}
+
+type slackDestination struct {
+	WebhookURL string `json:"webhook_url"`
+}
+
+// slackMessage is the request body for an incoming-webhook POST. We
+// use blocks (the modern format) rather than text+attachments.
+type slackMessage struct {
+	Text   string       `json:"text"` // fallback for old clients / mobile previews
+	Blocks []slackBlock `json:"blocks"`
+}
+
+type slackBlock struct {
+	Type   string      `json:"type"`
+	Text   *slackText  `json:"text,omitempty"`
+	Fields []slackText `json:"fields,omitempty"`
+}
+
+type slackText struct {
+	Type string `json:"type"`
+	Text string `json:"text"`
+}
+
+// Send POSTs a Block Kit message to the destination's webhook URL.
+func (d *SlackDispatcher) Send(ctx context.Context, destination json.RawMessage, n Notification) (int, string, error) {
+	var dest slackDestination
+	if err := json.Unmarshal(destination, &dest); err != nil {
+		return 0, "invalid destination JSON", fmt.Errorf("alerting/slack: parse destination: %w", err)
+	}
+	if dest.WebhookURL == "" {
+		return 0, "destination missing webhook_url", errors.New("alerting/slack: destination missing webhook_url")
+	}
+
+	body := slackMessage{
+		Text:   slackFallbackText(n),
+		Blocks: slackBlocks(n),
+	}
+	return postJSON(ctx, httpClientOrDefault(d.HTTPClient), dest.WebhookURL, body, nil)
+}
+
+func slackFallbackText(n Notification) string {
+	switch {
+	case n.IsTest:
+		return fmt.Sprintf("Jetmon test notification for %s", n.SiteURL)
+	case n.Recovery:
+		return fmt.Sprintf("Jetmon recovery: %s", n.SiteURL)
+	default:
+		return fmt.Sprintf("Jetmon %s alert: %s", n.SeverityName, n.SiteURL)
+	}
+}
+
+func slackBlocks(n Notification) []slackBlock {
+	var headerEmoji string
+	switch {
+	case n.IsTest:
+		headerEmoji = ":mag:"
+	case n.Recovery:
+		headerEmoji = ":white_check_mark:"
+	case n.Severity >= eventstore.SeveritySeemsDown:
+		headerEmoji = ":rotating_light:"
+	default:
+		headerEmoji = ":warning:"
+	}
+	header := fmt.Sprintf("%s *%s* — %s", headerEmoji, n.SeverityName, n.SiteURL)
+	if n.Recovery {
+		header = fmt.Sprintf("%s *Recovered* — %s", headerEmoji, n.SiteURL)
+	}
+	if n.IsTest {
+		header = fmt.Sprintf("%s *Jetmon test* — %s", headerEmoji, n.SiteURL)
+	}
+
+	fields := []slackText{
+		{Type: "mrkdwn", Text: fmt.Sprintf("*Site ID*\n%d", n.SiteID)},
+		{Type: "mrkdwn", Text: fmt.Sprintf("*Event*\n#%d", n.EventID)},
+	}
+	if n.State != "" {
+		fields = append(fields, slackText{Type: "mrkdwn", Text: fmt.Sprintf("*State*\n%s", n.State)})
+	}
+	if n.Reason != "" {
+		fields = append(fields, slackText{Type: "mrkdwn", Text: fmt.Sprintf("*Reason*\n%s", n.Reason)})
+	}
+	fields = append(fields, slackText{Type: "mrkdwn", Text: fmt.Sprintf("*Time*\n%s", n.Timestamp.UTC().Format(time.RFC3339))})
+
+	return []slackBlock{
+		{Type: "section", Text: &slackText{Type: "mrkdwn", Text: header}},
+		{Type: "section", Fields: fields},
+	}
+}
+
+// ─── Microsoft Teams ──────────────────────────────────────────────────
+
+// TeamsDispatcher implements Dispatcher for Microsoft Teams incoming-
+// webhook URLs. Each notification becomes an Adaptive Card sent via
+// a "message" envelope — same shape as Slack but Teams-specific JSON.
+type TeamsDispatcher struct {
+	HTTPClient *http.Client
+}
+
+type teamsDestination struct {
+	WebhookURL string `json:"webhook_url"`
+}
+
+type teamsMessage struct {
+	Type        string            `json:"type"` // always "message"
+	Attachments []teamsAttachment `json:"attachments"`
+}
+
+type teamsAttachment struct {
+	ContentType string        `json:"contentType"`
+	Content     teamsCardBody `json:"content"`
+}
+
+type teamsCardBody struct {
+	Schema  string                   `json:"$schema"`
+	Type    string                   `json:"type"`
+	Version string                   `json:"version"`
+	Body    []map[string]interface{} `json:"body"`
+}
+
+// Send POSTs an Adaptive Card to the destination's webhook URL.
+func (d *TeamsDispatcher) Send(ctx context.Context, destination json.RawMessage, n Notification) (int, string, error) {
+	var dest teamsDestination
+	if err := json.Unmarshal(destination, &dest); err != nil {
+		return 0, "invalid destination JSON", fmt.Errorf("alerting/teams: parse destination: %w", err)
+	}
+	if dest.WebhookURL == "" {
+		return 0, "destination missing webhook_url", errors.New("alerting/teams: destination missing webhook_url")
+	}
+
+	header := fmt.Sprintf("**%s** — %s", n.SeverityName, n.SiteURL)
+	switch {
+	case n.IsTest:
+		header = fmt.Sprintf("**Jetmon test** — %s", n.SiteURL)
+	case n.Recovery:
+		header = fmt.Sprintf("**Recovered** — %s", n.SiteURL)
+	}
+
+	facts := []map[string]string{
+		{"title": "Site ID", "value": fmt.Sprintf("%d", n.SiteID)},
+		{"title": "Event", "value": fmt.Sprintf("#%d (%s)", n.EventID, n.EventType)},
+	}
+	if n.State != "" {
+		facts = append(facts, map[string]string{"title": "State", "value": n.State})
+	}
+	if n.Reason != "" {
+		facts = append(facts, map[string]string{"title": "Reason", "value": n.Reason})
+	}
+	facts = append(facts, map[string]string{"title": "Time", "value": n.Timestamp.UTC().Format(time.RFC3339)})
+
+	body := teamsMessage{
+		Type: "message",
+		Attachments: []teamsAttachment{
+			{
+				ContentType: "application/vnd.microsoft.card.adaptive",
+				Content: teamsCardBody{
+					Schema:  "http://adaptivecards.io/schemas/adaptive-card.json",
+					Type:    "AdaptiveCard",
+					Version: "1.4",
+					Body: []map[string]interface{}{
+						{
+							"type":   "TextBlock",
+							"text":   header,
+							"wrap":   true,
+							"size":   "Large",
+							"weight": "Bolder",
+						},
+						{
+							"type":  "FactSet",
+							"facts": facts,
+						},
+					},
+				},
+			},
+		},
+	}
+
+	return postJSON(ctx, httpClientOrDefault(d.HTTPClient), dest.WebhookURL, body, nil)
+}
+
+// ─── Shared helpers ──────────────────────────────────────────────────
+
+// postJSON serializes body and POSTs it to url with optional extra
+// headers. Returns (statusCode, truncatedResponseBody, err) shaped for
+// the Dispatcher interface. err is non-nil when the HTTP call failed
+// at the transport layer (DNS, TCP, TLS, timeout) OR when the response
+// status indicates a permanent or retryable failure (>=400).
+func postJSON(ctx context.Context, client *http.Client, url string, body any, extraHeaders map[string]string) (int, string, error) {
+	buf, err := json.Marshal(body)
+	if err != nil {
+		return 0, "", fmt.Errorf("marshal body: %w", err)
+	}
+	req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader(buf))
+	if err != nil {
+		return 0, "", fmt.Errorf("build request: %w", err)
+	}
+	req.Header.Set("Content-Type", "application/json")
+	for k, v := range extraHeaders {
+		req.Header.Set(k, v)
+	}
+	resp, err := client.Do(req)
+	if err != nil {
+		return 0, "", fmt.Errorf("post: %w", err)
+	}
+	defer resp.Body.Close()
+
+	respBody := truncateResponseBody(strings.TrimSpace(readResponseBody(resp.Body)))
+
+	if resp.StatusCode >= 400 {
+		return resp.StatusCode, respBody, fmt.Errorf("status %d", resp.StatusCode)
+	}
+	return resp.StatusCode, respBody, nil
+}
diff --git a/internal/alerting/transports_test.go b/internal/alerting/transports_test.go
new file mode 100644
index 00000000..88df5af2
--- /dev/null
+++ b/internal/alerting/transports_test.go
@@ -0,0 +1,347 @@
+package alerting
+
+import (
+	"context"
+	"encoding/json"
+	"io"
+	"net/http"
+	"net/http/httptest"
+	"strings"
+	"testing"
+
+	"github.com/Automattic/jetmon/internal/eventstore"
+)
+
+// captureServer is a tiny httptest.Server wrapper that records the
+// most recent request body and headers, returning a configurable
+// response status and body.
+type captureServer struct {
+	srv        *httptest.Server
+	gotBody    []byte
+	gotHeaders http.Header
+	gotMethod  string
+	hits       int
+	respStatus int
+	respBody   string
+}
+
+func newCaptureServer() *captureServer {
+	c := &captureServer{respStatus: http.StatusOK, respBody: `{"ok":true}`}
+	c.srv = httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		c.hits++
+		c.gotMethod = r.Method
+		c.gotHeaders = r.Header.Clone()
+		c.gotBody, _ = io.ReadAll(r.Body)
+		w.WriteHeader(c.respStatus)
+		_, _ = w.Write([]byte(c.respBody))
+	}))
+	return c
+}
+
+func (c *captureServer) URL() string { return c.srv.URL }
+func (c *captureServer) Close()      { c.srv.Close() }
+
+// ─── PagerDuty ────────────────────────────────────────────────────────
+
+func TestPagerDutyTriggerHappyPath(t *testing.T) {
+	cap := newCaptureServer()
+	defer cap.Close()
+
+	d := &PagerDutyDispatcher{Endpoint: cap.URL()}
+	dest := json.RawMessage(`{"integration_key":"PDKEY"}`)
+	n := makeTestNotification()
+
+	status, _, err := d.Send(context.Background(), dest, n)
+	if err != nil {
+		t.Fatalf("Send error: %v", err)
+	}
+	if status != http.StatusOK {
+		t.Errorf("status = %d, want 200", status)
+	}
+
+	var got pagerDutyEvent
+	if err := json.Unmarshal(cap.gotBody, &got); err != nil {
+		t.Fatalf("decode body: %v", err)
+	}
+	if got.RoutingKey != "PDKEY" {
+		t.Errorf("RoutingKey = %q", got.RoutingKey)
+	}
+	if got.EventAction != "trigger" {
+		t.Errorf("EventAction = %q, want trigger", got.EventAction)
+	}
+	if got.Payload.Severity != "critical" {
+		t.Errorf("Severity = %q, want critical (Down)", got.Payload.Severity)
+	}
+	if !strings.Contains(got.Payload.Summary, "https://example.com") {
+		t.Errorf("Summary missing site URL: %q", got.Payload.Summary)
+	}
+	if got.DedupKey != "jetmon-event-777" {
+		t.Errorf("DedupKey = %q", got.DedupKey)
+	}
+}
+
+func TestPagerDutyResolveOnRecovery(t *testing.T) {
+	cap := newCaptureServer()
+	defer cap.Close()
+
+	d := &PagerDutyDispatcher{Endpoint: cap.URL()}
+	n := makeTestNotification()
+	n.Recovery = true
+	n.Severity = eventstore.SeverityUp
+	n.SeverityName = "Up"
+
+	if _, _, err := d.Send(context.Background(), json.RawMessage(`{"integration_key":"K"}`), n); err != nil {
+		t.Fatalf("Send error: %v", err)
+	}
+
+	var got pagerDutyEvent
+	_ = json.Unmarshal(cap.gotBody, &got)
+	if got.EventAction != "resolve" {
+		t.Errorf("EventAction = %q, want resolve", got.EventAction)
+	}
+	if got.DedupKey != "jetmon-event-777" {
+		t.Errorf("DedupKey for resolve must match trigger: %q", got.DedupKey)
+	}
+}
+
+func TestPagerDutyTestUsesDistinctDedupKey(t *testing.T) {
+	cap := newCaptureServer()
+	defer cap.Close()
+
+	d := &PagerDutyDispatcher{Endpoint: cap.URL()}
+	n := makeTestNotification()
+	n.IsTest = true
+
+	if _, _, err := d.Send(context.Background(), json.RawMessage(`{"integration_key":"K"}`), n); err != nil {
+		t.Fatalf("Send error: %v", err)
+	}
+	var got pagerDutyEvent
+	_ = json.Unmarshal(cap.gotBody, &got)
+	if !strings.HasPrefix(got.DedupKey, "jetmon-test-") {
+		t.Errorf("test send should use jetmon-test- dedup_key, got %q", got.DedupKey)
+	}
+}
+
+func TestPagerDutySeverityMapping(t *testing.T) {
+	cases := map[uint8]string{
+		eventstore.SeverityDown:      "critical",
+		eventstore.SeveritySeemsDown: "critical",
+		eventstore.SeverityDegraded:  "warning",
+		eventstore.SeverityWarning:   "info",
+	}
+	for sev, want := range cases {
+		if got := pagerDutySeverity(sev); got != want {
+			t.Errorf("pagerDutySeverity(%d) = %q, want %q", sev, got, want)
+		}
+	}
+}
+
+func TestPagerDutyRejectsBadDestination(t *testing.T) {
+	d := &PagerDutyDispatcher{Endpoint: "https://nowhere.invalid"}
+	cases := []json.RawMessage{
+		json.RawMessage(`{}`),
+		json.RawMessage(`{"integration_key":""}`),
+		json.RawMessage(`not json`),
+	}
+	for i, dest := range cases {
+		_, _, err := d.Send(context.Background(), dest, makeTestNotification())
+		if err == nil {
+			t.Errorf("case %d: expected error for %s", i, dest)
+		}
+	}
+}
+
+func TestPagerDutySurfacesUpstreamError(t *testing.T) {
+	cap := newCaptureServer()
+	cap.respStatus = http.StatusBadRequest
+	cap.respBody = `{"error":"missing routing_key"}`
+	defer cap.Close()
+
+	d := &PagerDutyDispatcher{Endpoint: cap.URL()}
+	status, body, err := d.Send(context.Background(), json.RawMessage(`{"integration_key":"K"}`), makeTestNotification())
+	if err == nil {
+		t.Fatal("expected error on 400")
+	}
+	if status != 400 {
+		t.Errorf("status = %d", status)
+	}
+	if !strings.Contains(body, "missing routing_key") {
+		t.Errorf("body should include upstream error: %q", body)
+	}
+}
+
+// ─── Slack ────────────────────────────────────────────────────────────
+
+func TestSlackHappyPath(t *testing.T) {
+	cap := newCaptureServer()
+	defer cap.Close()
+
+	d := &SlackDispatcher{}
+	dest, _ := json.Marshal(slackDestination{WebhookURL: cap.URL()})
+
+	status, _, err := d.Send(context.Background(), dest, makeTestNotification())
+	if err != nil {
+		t.Fatalf("Send error: %v", err)
+	}
+	if status != http.StatusOK {
+		t.Errorf("status = %d", status)
+	}
+
+	var got slackMessage
+	if err := json.Unmarshal(cap.gotBody, &got); err != nil {
+		t.Fatalf("decode body: %v", err)
+	}
+	if got.Text == "" {
+		t.Error("Slack body must include fallback text")
+	}
+	if len(got.Blocks) < 2 {
+		t.Errorf("Slack body should have at least 2 blocks, got %d", len(got.Blocks))
+	}
+	headerText := got.Blocks[0].Text.Text
+	if !strings.Contains(headerText, "Down") {
+		t.Errorf("header should include severity: %q", headerText)
+	}
+	if !strings.Contains(headerText, "https://example.com") {
+		t.Errorf("header should include site URL: %q", headerText)
+	}
+}
+
+func TestSlackRecoveryHeader(t *testing.T) {
+	cap := newCaptureServer()
+	defer cap.Close()
+
+	d := &SlackDispatcher{}
+	dest, _ := json.Marshal(slackDestination{WebhookURL: cap.URL()})
+	n := makeTestNotification()
+	n.Recovery = true
+
+	if _, _, err := d.Send(context.Background(), dest, n); err != nil {
+		t.Fatalf("Send error: %v", err)
+	}
+	var got slackMessage
+	_ = json.Unmarshal(cap.gotBody, &got)
+	if !strings.Contains(got.Blocks[0].Text.Text, "Recovered") {
+		t.Errorf("recovery header expected, got %q", got.Blocks[0].Text.Text)
+	}
+}
+
+func TestSlackTestHeader(t *testing.T) {
+	cap := newCaptureServer()
+	defer cap.Close()
+
+	d := &SlackDispatcher{}
+	dest, _ := json.Marshal(slackDestination{WebhookURL: cap.URL()})
+	n := makeTestNotification()
+	n.IsTest = true
+
+	if _, _, err := d.Send(context.Background(), dest, n); err != nil {
+		t.Fatalf("Send error: %v", err)
+	}
+	var got slackMessage
+	_ = json.Unmarshal(cap.gotBody, &got)
+	if !strings.Contains(got.Blocks[0].Text.Text, "Jetmon test") {
+		t.Errorf("test header expected, got %q", got.Blocks[0].Text.Text)
+	}
+}
+
+func TestSlackRejectsBadDestination(t *testing.T) {
+	d := &SlackDispatcher{}
+	for _, dest := range []json.RawMessage{
+		json.RawMessage(`{}`),
+		json.RawMessage(`{"webhook_url":""}`),
+		json.RawMessage(`not json`),
+	} {
+		if _, _, err := d.Send(context.Background(), dest, makeTestNotification()); err == nil {
+			t.Errorf("expected error for %s", dest)
+		}
+	}
+}
+
+// ─── Teams ────────────────────────────────────────────────────────────
+
+func TestTeamsHappyPath(t *testing.T) {
+	cap := newCaptureServer()
+	defer cap.Close()
+
+	d := &TeamsDispatcher{}
+	dest, _ := json.Marshal(teamsDestination{WebhookURL: cap.URL()})
+
+	status, _, err := d.Send(context.Background(), dest, makeTestNotification())
+	if err != nil {
+		t.Fatalf("Send error: %v", err)
+	}
+	if status != http.StatusOK {
+		t.Errorf("status = %d", status)
+	}
+
+	// Decode loosely; the Adaptive Card JSON has nested polymorphic
+	// content that's painful to model fully — we check the key fields.
+	var generic map[string]any
+	if err := json.Unmarshal(cap.gotBody, &generic); err != nil {
+		t.Fatalf("decode body: %v", err)
+	}
+	if generic["type"] != "message" {
+		t.Errorf("type = %v", generic["type"])
+	}
+	atts, _ := generic["attachments"].([]any)
+	if len(atts) != 1 {
+		t.Fatalf("attachments len = %d", len(atts))
+	}
+	att := atts[0].(map[string]any)
+	if att["contentType"] != "application/vnd.microsoft.card.adaptive" {
+		t.Errorf("contentType = %v", att["contentType"])
+	}
+	// Spot check: serialize the attachment back and verify the
+	// header text contains the severity.
+	raw, _ := json.Marshal(att)
+	if !strings.Contains(string(raw), "Down") {
+		t.Errorf("Teams card missing severity in body: %s", raw)
+	}
+	if !strings.Contains(string(raw), "https://example.com") {
+		t.Errorf("Teams card missing site URL: %s", raw)
+	}
+}
+
+func TestTeamsRecoveryHeader(t *testing.T) {
+	cap := newCaptureServer()
+	defer cap.Close()
+
+	d := &TeamsDispatcher{}
+	dest, _ := json.Marshal(teamsDestination{WebhookURL: cap.URL()})
+	n := makeTestNotification()
+	n.Recovery = true
+
+	if _, _, err := d.Send(context.Background(), dest, n); err != nil {
+		t.Fatalf("Send error: %v", err)
+	}
+	if !strings.Contains(string(cap.gotBody), "Recovered") {
+		t.Errorf("recovery body should mention Recovered: %s", cap.gotBody)
+	}
+}
+
+func TestTeamsRejectsBadDestination(t *testing.T) {
+	d := &TeamsDispatcher{}
+	for _, dest := range []json.RawMessage{
+		json.RawMessage(`{}`),
+		json.RawMessage(`{"webhook_url":""}`),
+		json.RawMessage(`not json`),
+	} {
+		if _, _, err := d.Send(context.Background(), dest, makeTestNotification()); err == nil {
+			t.Errorf("expected error for %s", dest)
+		}
+	}
+}
+
+// ─── Shared helpers ──────────────────────────────────────────────────
+
+func TestTruncateResponseBody(t *testing.T) {
+	short := strings.Repeat("a", 100)
+	if got := truncateResponseBody(short); got != short {
+		t.Error("short body should pass through unchanged")
+	}
+	long := strings.Repeat("b", 3000)
+	got := truncateResponseBody(long)
+	if len(got) != 2048 {
+		t.Errorf("long body length = %d, want 2048", len(got))
+	}
+}
diff --git a/internal/alerting/worker.go b/internal/alerting/worker.go
new file mode 100644
index 00000000..36fff9a7
--- /dev/null
+++ b/internal/alerting/worker.go
@@ -0,0 +1,553 @@
+package alerting
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"log"
+	"sync"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/eventstore"
+)
+
+// retrySchedule mirrors the webhooks worker schedule. Same trade-offs:
+// six attempts over ~7h36m total elapsed, then abandon.
+var retrySchedule = []time.Duration{
+	0,
+	1 * time.Minute,
+	5 * time.Minute,
+	30 * time.Minute,
+	1 * time.Hour,
+	6 * time.Hour,
+}
+
+const maxAttempts = 6
+
+func nextRetryDelay(currentAttempt int) (delay time.Duration, abandoned bool) {
+	next := currentAttempt + 1
+	if next > maxAttempts {
+		return 0, true
+	}
+	return retrySchedule[next-1], false
+}
+
+// WorkerConfig configures the delivery worker.
+type WorkerConfig struct {
+	DB              *sql.DB
+	InstanceID      string
+	Dispatchers     map[Transport]Dispatcher
+	PollInterval    time.Duration
+	MaxConcurrent   int           // shared deliverer pool size
+	PerContactCap   int           // per-contact in-flight cap
+	BatchSize       int           // dispatch + claim batch size
+	DispatchTimeout time.Duration // per-delivery wall-clock limit
+}
+
+func (c *WorkerConfig) applyDefaults() {
+	if c.PollInterval == 0 {
+		c.PollInterval = 1 * time.Second
+	}
+	if c.MaxConcurrent == 0 {
+		c.MaxConcurrent = 50
+	}
+	if c.PerContactCap == 0 {
+		c.PerContactCap = 3
+	}
+	if c.BatchSize == 0 {
+		c.BatchSize = 200
+	}
+	if c.DispatchTimeout == 0 {
+		c.DispatchTimeout = 30 * time.Second
+	}
+	if c.InstanceID == "" {
+		c.InstanceID = "default"
+	}
+}
+
+// Worker drives alert contact delivery. Two background goroutines:
+//
+//   - dispatcher: every PollInterval, polls jetmon_event_transitions for
+//     new rows since last_seen, matches each against active contacts
+//     (site_filter + min_severity gate), and enqueues a delivery per
+//     match.
+//   - deliverer: every PollInterval, claims pending deliveries, picks
+//     the right Dispatcher per contact, builds a Notification, and
+//     calls Send. Successes mark delivered; failures schedule retries
+//     on the standard ladder. Per-contact rate cap drops dispatches
+//     when a contact's per-hour budget is exhausted.
+type Worker struct {
+	cfg WorkerConfig
+
+	inFlightMu sync.Mutex
+	inFlight   map[int64]int // contactID → current in-flight count
+
+	rateLimit *rateLimitWindow
+
+	stop chan struct{}
+	done chan struct{}
+}
+
+// NewWorker constructs a Worker. Call Start to launch goroutines.
+// Dispatchers map is required — without it, all dispatches fail with
+// "transport not configured."
+func NewWorker(cfg WorkerConfig) *Worker {
+	cfg.applyDefaults()
+	return &Worker{
+		cfg:       cfg,
+		inFlight:  make(map[int64]int),
+		rateLimit: newRateLimitWindow(),
+		stop:      make(chan struct{}),
+		done:      make(chan struct{}),
+	}
+}
+
+// Start launches the background goroutines. Non-blocking.
+func (w *Worker) Start() {
+	go w.run()
+}
+
+// Stop signals shutdown and blocks until both goroutines exit.
+func (w *Worker) Stop() {
+	close(w.stop)
+	<-w.done
+}
+
+func (w *Worker) run() {
+	defer close(w.done)
+
+	dispatcherDone := make(chan struct{})
+	delivererDone := make(chan struct{})
+
+	go func() {
+		defer close(dispatcherDone)
+		w.dispatchLoop()
+	}()
+	go func() {
+		defer close(delivererDone)
+		w.deliverLoop()
+	}()
+
+	<-dispatcherDone
+	<-delivererDone
+}
+
+// ─── Dispatch loop ────────────────────────────────────────────────────
+
+func (w *Worker) dispatchLoop() {
+	ticker := time.NewTicker(w.cfg.PollInterval)
+	defer ticker.Stop()
+	for {
+		select {
+		case <-w.stop:
+			return
+		case <-ticker.C:
+			if err := w.dispatchTick(); err != nil {
+				log.Printf("alerting: dispatcher tick error: %v", err)
+			}
+		}
+	}
+}
+
+// dispatchTick polls jetmon_event_transitions for new rows and
+// enqueues per-contact deliveries for each match.
+func (w *Worker) dispatchTick() error {
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
+	lastID, err := w.loadProgress(ctx)
+	if err != nil {
+		return fmt.Errorf("load progress: %w", err)
+	}
+
+	type transitionRow struct {
+		id             int64
+		eventID        int64
+		blogID         int64
+		severityBefore sql.NullInt64
+		severityAfter  sql.NullInt64
+		stateAfter     sql.NullString
+		reason         string
+		changedAt      time.Time
+	}
+
+	rows, err := w.cfg.DB.QueryContext(ctx, `
+		SELECT id, event_id, blog_id, severity_before, severity_after, state_after, reason, changed_at
+		  FROM jetmon_event_transitions
+		 WHERE id > ?
+		 ORDER BY id ASC
+		 LIMIT ?`, lastID, w.cfg.BatchSize)
+	if err != nil {
+		return fmt.Errorf("query transitions: %w", err)
+	}
+	defer rows.Close()
+
+	var transitions []transitionRow
+	for rows.Next() {
+		var t transitionRow
+		if err := rows.Scan(&t.id, &t.eventID, &t.blogID, &t.severityBefore, &t.severityAfter, &t.stateAfter, &t.reason, &t.changedAt); err != nil {
+			return fmt.Errorf("scan transition: %w", err)
+		}
+		transitions = append(transitions, t)
+	}
+	if err := rows.Err(); err != nil {
+		return fmt.Errorf("transitions iterate: %w", err)
+	}
+	if len(transitions) == 0 {
+		return nil
+	}
+
+	contacts, err := ListActive(ctx, w.cfg.DB)
+	if err != nil {
+		return fmt.Errorf("list active contacts: %w", err)
+	}
+
+	for _, t := range transitions {
+		// Use SeverityUp as the conservative default when severity is
+		// unknown. severity_after may be NULL on certain non-severity
+		// transitions; the gate then evaluates as "not above any cap"
+		// and Matches returns false unless prev_severity caused a
+		// recovery.
+		prev := uint8(eventstore.SeverityUp)
+		if t.severityBefore.Valid {
+			prev = uint8(t.severityBefore.Int64)
+		}
+		next := uint8(eventstore.SeverityUp)
+		if t.severityAfter.Valid {
+			next = uint8(t.severityAfter.Int64)
+		}
+		eventType := eventTypeForReason(t.reason)
+		if eventType == "" {
+			continue
+		}
+		state := ""
+		if t.stateAfter.Valid {
+			state = t.stateAfter.String
+		}
+
+		for i := range contacts {
+			c := &contacts[i]
+			if !c.Matches(prev, next, t.blogID) {
+				continue
+			}
+			payload, err := buildPayload(eventType, t.id, t.eventID, t.blogID, t.reason, state, prev, next, t.changedAt)
+			if err != nil {
+				log.Printf("alerting: build payload event_id=%d transition_id=%d: %v", t.eventID, t.id, err)
+				continue
+			}
+			if _, err := Enqueue(ctx, w.cfg.DB, EnqueueInput{
+				AlertContactID: c.ID,
+				TransitionID:   t.id,
+				EventID:        t.eventID,
+				EventType:      eventType,
+				Severity:       next,
+				Payload:        payload,
+			}); err != nil {
+				log.Printf("alerting: enqueue contact_id=%d transition_id=%d: %v", c.ID, t.id, err)
+				continue
+			}
+		}
+	}
+
+	if err := w.saveProgress(ctx, transitions[len(transitions)-1].id); err != nil {
+		return fmt.Errorf("save progress: %w", err)
+	}
+	return nil
+}
+
+// eventTypeForReason maps a transition reason to a coarse alerting
+// event type. Less granular than the webhook event-type set because
+// alert contacts care primarily about "did something happen" and let
+// the severity gate drive what gets sent.
+func eventTypeForReason(reason string) string {
+	switch reason {
+	case "opened":
+		return "alert.opened"
+	case "severity_escalation", "severity_deescalation":
+		return "alert.severity_changed"
+	case "state_change", "verifier_confirmed":
+		return "alert.state_changed"
+	case "verifier_cleared", "probe_cleared", "false_alarm",
+		"manual_override", "maintenance_swallowed", "superseded", "auto_timeout":
+		return "alert.closed"
+	default:
+		return ""
+	}
+}
+
+// buildPayload returns the JSON body stored on the delivery row. Frozen
+// at enqueue time. Includes both severity values so the renderer at
+// dispatch time can correctly distinguish escalation from recovery.
+func buildPayload(eventType string, transitionID, eventID, blogID int64, reason, state string, prev, next uint8, occurredAt time.Time) (json.RawMessage, error) {
+	body := map[string]any{
+		"type":            eventType,
+		"occurred_at":     occurredAt.UTC().Format(time.RFC3339Nano),
+		"transition_id":   transitionID,
+		"event_id":        eventID,
+		"site_id":         blogID,
+		"reason":          reason,
+		"state":           state,
+		"severity_before": prev,
+		"severity_after":  next,
+	}
+	return json.Marshal(body)
+}
+
+// loadProgress / saveProgress mirror the webhooks worker on the
+// jetmon_alert_dispatch_progress table.
+func (w *Worker) loadProgress(ctx context.Context) (int64, error) {
+	var lastID int64
+	err := w.cfg.DB.QueryRowContext(ctx,
+		`SELECT last_transition_id FROM jetmon_alert_dispatch_progress WHERE instance_id = ?`,
+		w.cfg.InstanceID,
+	).Scan(&lastID)
+	if errors.Is(err, sql.ErrNoRows) {
+		return 0, nil
+	}
+	if err != nil {
+		return 0, err
+	}
+	return lastID, nil
+}
+
+func (w *Worker) saveProgress(ctx context.Context, lastID int64) error {
+	_, err := w.cfg.DB.ExecContext(ctx, `
+		INSERT INTO jetmon_alert_dispatch_progress (instance_id, last_transition_id)
+		VALUES (?, ?)
+		ON DUPLICATE KEY UPDATE last_transition_id = VALUES(last_transition_id)`,
+		w.cfg.InstanceID, lastID)
+	return err
+}
+
+// ─── Deliver loop ─────────────────────────────────────────────────────
+
+func (w *Worker) deliverLoop() {
+	ticker := time.NewTicker(w.cfg.PollInterval)
+	defer ticker.Stop()
+	for {
+		select {
+		case <-w.stop:
+			return
+		case <-ticker.C:
+			if err := w.deliverTick(); err != nil {
+				log.Printf("alerting: deliverer tick error: %v", err)
+			}
+		}
+	}
+}
+
+func (w *Worker) deliverTick() error {
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
+	deliveries, err := ClaimReady(ctx, w.cfg.DB, w.cfg.MaxConcurrent)
+	if err != nil {
+		return err
+	}
+	for i := range deliveries {
+		d := deliveries[i]
+		if !w.acquireSlot(d.AlertContactID) {
+			continue
+		}
+		go func(d Delivery) {
+			defer w.releaseSlot(d.AlertContactID)
+			w.deliver(d)
+		}(d)
+	}
+	return nil
+}
+
+func (w *Worker) acquireSlot(contactID int64) bool {
+	w.inFlightMu.Lock()
+	defer w.inFlightMu.Unlock()
+	if w.inFlight[contactID] >= w.cfg.PerContactCap {
+		return false
+	}
+	w.inFlight[contactID]++
+	return true
+}
+
+func (w *Worker) releaseSlot(contactID int64) {
+	w.inFlightMu.Lock()
+	defer w.inFlightMu.Unlock()
+	w.inFlight[contactID]--
+	if w.inFlight[contactID] <= 0 {
+		delete(w.inFlight, contactID)
+	}
+}
+
+// deliver runs one dispatch attempt for d. Loads the contact +
+// destination, applies the rate cap, builds a Notification, and calls
+// the configured Dispatcher. Updates the delivery row with the result.
+func (w *Worker) deliver(d Delivery) {
+	ctx, cancel := context.WithTimeout(context.Background(), w.cfg.DispatchTimeout+5*time.Second)
+	defer cancel()
+
+	contact, err := Get(ctx, w.cfg.DB, d.AlertContactID)
+	if err != nil {
+		w.handleResult(ctx, d, 0, fmt.Sprintf("contact lookup: %v", err), true)
+		return
+	}
+	if !contact.Active {
+		w.handleResult(ctx, d, 0, "contact is inactive", true)
+		return
+	}
+
+	// Per-contact rate cap. 0 = unlimited.
+	if contact.MaxPerHour > 0 && !w.rateLimit.tryConsume(contact.ID, contact.MaxPerHour, time.Now()) {
+		if err := MarkSuppressed(ctx, w.cfg.DB, d.ID,
+			fmt.Sprintf("rate-limited: contact %d exceeded max_per_hour=%d", contact.ID, contact.MaxPerHour),
+		); err != nil {
+			log.Printf("alerting: mark suppressed id=%d: %v", d.ID, err)
+		}
+		return
+	}
+
+	dispatcher, ok := w.cfg.Dispatchers[contact.Transport]
+	if !ok {
+		w.handleResult(ctx, d, 0,
+			fmt.Sprintf("transport %q not configured on this instance", contact.Transport), true)
+		return
+	}
+	dest, err := LoadDestination(ctx, w.cfg.DB, contact.ID)
+	if err != nil {
+		w.handleResult(ctx, d, 0, fmt.Sprintf("destination lookup: %v", err), true)
+		return
+	}
+
+	n, err := w.buildNotification(ctx, contact, d)
+	if err != nil {
+		w.handleResult(ctx, d, 0, fmt.Sprintf("build notification: %v", err), true)
+		return
+	}
+
+	sendCtx, sendCancel := context.WithTimeout(ctx, w.cfg.DispatchTimeout)
+	defer sendCancel()
+	statusCode, respBody, sendErr := dispatcher.Send(sendCtx, dest, n)
+	if sendErr != nil {
+		w.handleResult(ctx, d, statusCode, "transport error: "+sendErr.Error(), false)
+		return
+	}
+	if err := MarkDelivered(ctx, w.cfg.DB, d.ID, statusCode, respBody); err != nil {
+		log.Printf("alerting: mark delivered id=%d: %v", d.ID, err)
+	}
+}
+
+// buildNotification reconstructs the rendered Notification from the
+// delivery row's frozen payload. Looks up the site URL from
+// jetpack_monitor_sites if available so renderers have a useful
+// display string; falls back to "site:<id>" if the lookup fails.
+func (w *Worker) buildNotification(ctx context.Context, contact *AlertContact, d Delivery) (Notification, error) {
+	var p struct {
+		SiteID         int64     `json:"site_id"`
+		EventID        int64     `json:"event_id"`
+		EventType      string    `json:"type"`
+		Reason         string    `json:"reason"`
+		State          string    `json:"state"`
+		SeverityBefore uint8     `json:"severity_before"`
+		SeverityAfter  uint8     `json:"severity_after"`
+		OccurredAt     time.Time `json:"occurred_at"`
+	}
+	if err := json.Unmarshal(d.Payload, &p); err != nil {
+		return Notification{}, fmt.Errorf("decode payload: %w", err)
+	}
+
+	siteURL := lookupSiteURL(ctx, w.cfg.DB, p.SiteID)
+	if siteURL == "" {
+		siteURL = fmt.Sprintf("site:%d", p.SiteID)
+	}
+
+	recovery := p.SeverityBefore >= contact.MinSeverity && p.SeverityAfter == eventstore.SeverityUp
+	severity := p.SeverityAfter
+
+	return Notification{
+		SiteID:       p.SiteID,
+		SiteURL:      siteURL,
+		EventID:      p.EventID,
+		EventType:    p.EventType,
+		Severity:     severity,
+		SeverityName: SeverityName(severity),
+		State:        p.State,
+		Reason:       p.Reason,
+		Timestamp:    p.OccurredAt,
+		DedupKey:     fmt.Sprintf("jetmon-event-%d", p.EventID),
+		Recovery:     recovery,
+	}, nil
+}
+
+func lookupSiteURL(ctx context.Context, db *sql.DB, blogID int64) string {
+	var url sql.NullString
+	err := db.QueryRowContext(ctx,
+		`SELECT monitor_url FROM jetpack_monitor_sites WHERE blog_id = ? LIMIT 1`,
+		blogID,
+	).Scan(&url)
+	if err != nil || !url.Valid {
+		return ""
+	}
+	return url.String
+}
+
+func (w *Worker) handleResult(ctx context.Context, d Delivery, statusCode int, responseBody string, forceAbandon bool) {
+	currentAttempt := d.Attempt + 1
+	var (
+		next      time.Time
+		abandoned bool
+	)
+	if forceAbandon {
+		abandoned = true
+	} else {
+		delay, ab := nextRetryDelay(currentAttempt)
+		abandoned = ab
+		if !abandoned {
+			next = time.Now().Add(delay)
+		}
+	}
+	if err := ScheduleRetry(ctx, w.cfg.DB, d.ID, statusCode, responseBody, next, abandoned); err != nil {
+		log.Printf("alerting: schedule retry id=%d: %v", d.ID, err)
+	}
+}
+
+// ─── Rate limit window ────────────────────────────────────────────────
+
+// rateLimitWindow tracks recent dispatch timestamps per contact for
+// the per-hour rate cap. Sliding window via timestamp pruning.
+//
+// In-memory only; multi-instance deployments share state via the DB
+// today. For a single-instance deployment this is correct; for
+// multi-instance, each instance enforces its own slice of the cap and
+// the actual delivered rate per contact may exceed the configured
+// max_per_hour by the number of instances. Tracked alongside the
+// "multi-instance row claim" caveat in deliveries.go.
+type rateLimitWindow struct {
+	mu         sync.Mutex
+	perContact map[int64][]time.Time
+}
+
+func newRateLimitWindow() *rateLimitWindow {
+	return &rateLimitWindow{perContact: make(map[int64][]time.Time)}
+}
+
+// tryConsume attempts to allocate a delivery for the given contact at
+// the given timestamp. Returns true if the window is under capacity
+// (and records the timestamp); false if the window is at capacity.
+func (r *rateLimitWindow) tryConsume(contactID int64, capacity int, now time.Time) bool {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+
+	cutoff := now.Add(-1 * time.Hour)
+	stamps := r.perContact[contactID]
+	pruned := stamps[:0]
+	for _, t := range stamps {
+		if t.After(cutoff) {
+			pruned = append(pruned, t)
+		}
+	}
+	if len(pruned) >= capacity {
+		r.perContact[contactID] = pruned
+		return false
+	}
+	r.perContact[contactID] = append(pruned, now)
+	return true
+}
diff --git a/internal/alerting/worker_test.go b/internal/alerting/worker_test.go
new file mode 100644
index 00000000..68039ea4
--- /dev/null
+++ b/internal/alerting/worker_test.go
@@ -0,0 +1,381 @@
+package alerting
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/eventstore"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestNextRetryDelayFollowsSchedule(t *testing.T) {
+	cases := []struct {
+		current   int
+		want      time.Duration
+		abandoned bool
+	}{
+		{1, 1 * time.Minute, false},
+		{2, 5 * time.Minute, false},
+		{3, 30 * time.Minute, false},
+		{4, 1 * time.Hour, false},
+		{5, 6 * time.Hour, false},
+		{6, 0, true},
+		{7, 0, true},
+	}
+	for _, c := range cases {
+		got, ab := nextRetryDelay(c.current)
+		if ab != c.abandoned {
+			t.Errorf("nextRetryDelay(%d).abandoned = %v, want %v", c.current, ab, c.abandoned)
+		}
+		if !c.abandoned && got != c.want {
+			t.Errorf("nextRetryDelay(%d).delay = %v, want %v", c.current, got, c.want)
+		}
+	}
+}
+
+func TestApplyDefaults(t *testing.T) {
+	c := WorkerConfig{}
+	c.applyDefaults()
+	if c.PollInterval != 1*time.Second {
+		t.Errorf("PollInterval = %v, want 1s", c.PollInterval)
+	}
+	if c.MaxConcurrent != 50 {
+		t.Errorf("MaxConcurrent = %d, want 50", c.MaxConcurrent)
+	}
+	if c.PerContactCap != 3 {
+		t.Errorf("PerContactCap = %d, want 3", c.PerContactCap)
+	}
+	if c.BatchSize != 200 {
+		t.Errorf("BatchSize = %d, want 200", c.BatchSize)
+	}
+	if c.DispatchTimeout != 30*time.Second {
+		t.Errorf("DispatchTimeout = %v, want 30s", c.DispatchTimeout)
+	}
+	if c.InstanceID != "default" {
+		t.Errorf("InstanceID = %q, want default", c.InstanceID)
+	}
+}
+
+func TestApplyDefaultsPreservesExplicit(t *testing.T) {
+	c := WorkerConfig{
+		PollInterval:  5 * time.Second,
+		PerContactCap: 7,
+		InstanceID:    "host-a",
+	}
+	c.applyDefaults()
+	if c.PollInterval != 5*time.Second {
+		t.Errorf("PollInterval = %v, want 5s (explicit)", c.PollInterval)
+	}
+	if c.PerContactCap != 7 {
+		t.Errorf("PerContactCap = %d, want 7 (explicit)", c.PerContactCap)
+	}
+	if c.InstanceID != "host-a" {
+		t.Errorf("InstanceID = %q, want host-a (explicit)", c.InstanceID)
+	}
+	// Unset fields still get defaults.
+	if c.MaxConcurrent != 50 {
+		t.Errorf("MaxConcurrent = %d, want 50 (default)", c.MaxConcurrent)
+	}
+}
+
+func TestAcquireSlotRespectsCap(t *testing.T) {
+	w := &Worker{
+		cfg:      WorkerConfig{PerContactCap: 2},
+		inFlight: make(map[int64]int),
+	}
+	if !w.acquireSlot(1) {
+		t.Fatal("first acquire should succeed")
+	}
+	if !w.acquireSlot(1) {
+		t.Fatal("second acquire should succeed (under cap)")
+	}
+	if w.acquireSlot(1) {
+		t.Fatal("third acquire should fail (cap=2)")
+	}
+	w.releaseSlot(1)
+	if !w.acquireSlot(1) {
+		t.Fatal("acquire after release should succeed")
+	}
+}
+
+func TestAcquireSlotIsolatesContacts(t *testing.T) {
+	w := &Worker{
+		cfg:      WorkerConfig{PerContactCap: 1},
+		inFlight: make(map[int64]int),
+	}
+	if !w.acquireSlot(1) {
+		t.Fatal("contact 1 first acquire failed")
+	}
+	if w.acquireSlot(1) {
+		t.Fatal("contact 1 second acquire should fail (cap=1)")
+	}
+	if !w.acquireSlot(2) {
+		t.Fatal("contact 2 should be unaffected by contact 1's cap")
+	}
+}
+
+func TestReleaseSlotCleansUpZeroCounts(t *testing.T) {
+	w := &Worker{
+		cfg:      WorkerConfig{PerContactCap: 5},
+		inFlight: make(map[int64]int),
+	}
+	w.acquireSlot(1)
+	w.releaseSlot(1)
+	if _, ok := w.inFlight[1]; ok {
+		t.Error("zero-count entry should be deleted from map")
+	}
+}
+
+func TestNewWorkerInitializesRuntimeState(t *testing.T) {
+	dispatchers := map[Transport]Dispatcher{}
+	w := NewWorker(WorkerConfig{InstanceID: "host-a", Dispatchers: dispatchers})
+	if w.cfg.InstanceID != "host-a" {
+		t.Fatalf("InstanceID = %q, want host-a", w.cfg.InstanceID)
+	}
+	if w.cfg.Dispatchers == nil || w.inFlight == nil || w.rateLimit == nil || w.stop == nil || w.done == nil {
+		t.Fatalf("worker runtime state not initialized: %+v", w)
+	}
+}
+
+func TestWorkerStartStop(t *testing.T) {
+	w := NewWorker(WorkerConfig{PollInterval: time.Hour})
+	w.Start()
+	w.Stop()
+}
+
+func TestDeliverTickNoReadyDeliveries(t *testing.T) {
+	db, mock, err := sqlmock.New(sqlmock.QueryMatcherOption(sqlmock.QueryMatcherEqual))
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectQuery(selectClaimReadySQL).WithArgs(50).
+		WillReturnRows(sqlmock.NewRows(columnsClaimedDelivery))
+	mock.ExpectCommit()
+
+	w := NewWorker(WorkerConfig{DB: db})
+	if err := w.deliverTick(); err != nil {
+		t.Fatalf("deliverTick: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestHandleResultSchedulesRetryAndForcedAbandon(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectExec("UPDATE jetmon_alert_deliveries").
+		WithArgs(503, "retry", sqlmock.AnyArg(), int64(1)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("UPDATE jetmon_alert_deliveries").
+		WithArgs(0, "gone", int64(2)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	w := NewWorker(WorkerConfig{DB: db})
+	w.handleResult(context.Background(), Delivery{ID: 1, Attempt: 0}, 503, "retry", false)
+	w.handleResult(context.Background(), Delivery{ID: 2, Attempt: 0}, 0, "gone", true)
+
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+// TestRateLimitWindowRespectsCapacity verifies the rate window admits up
+// to capacity dispatches in an hour, then refuses, then admits again
+// after a timestamp ages out.
+func TestRateLimitWindowRespectsCapacity(t *testing.T) {
+	r := newRateLimitWindow()
+	base := time.Date(2026, 4, 25, 12, 0, 0, 0, time.UTC)
+
+	for i := 0; i < 3; i++ {
+		if !r.tryConsume(42, 3, base.Add(time.Duration(i)*time.Second)) {
+			t.Fatalf("dispatch %d should be admitted (cap=3)", i)
+		}
+	}
+	if r.tryConsume(42, 3, base.Add(4*time.Second)) {
+		t.Fatal("4th dispatch should be refused (cap=3)")
+	}
+	// 1h+1s later, all three earlier timestamps age out.
+	later := base.Add(1*time.Hour + 1*time.Second)
+	if !r.tryConsume(42, 3, later) {
+		t.Fatal("dispatch should be admitted after window pruning")
+	}
+}
+
+func TestRateLimitWindowIsolatesContacts(t *testing.T) {
+	r := newRateLimitWindow()
+	now := time.Now()
+	for i := 0; i < 2; i++ {
+		_ = r.tryConsume(1, 2, now)
+	}
+	if !r.tryConsume(2, 2, now) {
+		t.Error("contact 2 should not be affected by contact 1's rate")
+	}
+}
+
+func TestEventTypeForReason(t *testing.T) {
+	cases := map[string]string{
+		"opened":                "alert.opened",
+		"severity_escalation":   "alert.severity_changed",
+		"severity_deescalation": "alert.severity_changed",
+		"state_change":          "alert.state_changed",
+		"verifier_confirmed":    "alert.state_changed",
+		"verifier_cleared":      "alert.closed",
+		"manual_override":       "alert.closed",
+		"superseded":            "alert.closed",
+		"unknown_reason":        "",
+	}
+	for reason, want := range cases {
+		got := eventTypeForReason(reason)
+		if got != want {
+			t.Errorf("eventTypeForReason(%q) = %q, want %q", reason, got, want)
+		}
+	}
+}
+
+func TestBuildPayload(t *testing.T) {
+	occurredAt := time.Date(2026, 4, 27, 12, 0, 0, 123, time.UTC)
+	payload, err := buildPayload("alert.opened", 10, 20, 30, "opened", "Seems Down", 1, 4, occurredAt)
+	if err != nil {
+		t.Fatalf("buildPayload: %v", err)
+	}
+
+	var body map[string]any
+	if err := json.Unmarshal(payload, &body); err != nil {
+		t.Fatalf("Unmarshal: %v", err)
+	}
+	if body["type"] != "alert.opened" || body["reason"] != "opened" || body["state"] != "Seems Down" {
+		t.Fatalf("payload = %s", payload)
+	}
+	if body["severity_before"].(float64) != 1 || body["severity_after"].(float64) != 4 {
+		t.Fatalf("payload severities = %s", payload)
+	}
+	if body["occurred_at"] != occurredAt.Format(time.RFC3339Nano) {
+		t.Fatalf("occurred_at = %v", body["occurred_at"])
+	}
+}
+
+func TestProgressLoadSave(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	w := &Worker{cfg: WorkerConfig{DB: db, InstanceID: "host-a"}}
+	mock.ExpectQuery("SELECT last_transition_id FROM jetmon_alert_dispatch_progress").
+		WithArgs("host-a").
+		WillReturnError(sql.ErrNoRows)
+	mock.ExpectExec("INSERT INTO jetmon_alert_dispatch_progress").
+		WithArgs("host-a", int64(55)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery("SELECT last_transition_id FROM jetmon_alert_dispatch_progress").
+		WithArgs("host-a").
+		WillReturnRows(sqlmock.NewRows([]string{"last_transition_id"}).AddRow(int64(55)))
+
+	last, err := w.loadProgress(context.Background())
+	if err != nil {
+		t.Fatalf("loadProgress empty: %v", err)
+	}
+	if last != 0 {
+		t.Fatalf("empty progress = %d, want 0", last)
+	}
+	if err := w.saveProgress(context.Background(), 55); err != nil {
+		t.Fatalf("saveProgress: %v", err)
+	}
+	last, err = w.loadProgress(context.Background())
+	if err != nil {
+		t.Fatalf("loadProgress stored: %v", err)
+	}
+	if last != 55 {
+		t.Fatalf("stored progress = %d, want 55", last)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestBuildNotificationUsesSiteURLAndRecoveryFlag(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	occurredAt := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
+	payload, err := buildPayload(
+		"alert.closed", 10, 20, 30, "verifier_cleared", "Resolved",
+		eventstore.SeverityDown, eventstore.SeverityUp, occurredAt,
+	)
+	if err != nil {
+		t.Fatalf("buildPayload: %v", err)
+	}
+	mock.ExpectQuery("SELECT monitor_url FROM jetpack_monitor_sites").
+		WithArgs(int64(30)).
+		WillReturnRows(sqlmock.NewRows([]string{"monitor_url"}).AddRow("https://site.example"))
+
+	w := &Worker{cfg: WorkerConfig{DB: db}}
+	notification, err := w.buildNotification(context.Background(), &AlertContact{
+		ID:          1,
+		MinSeverity: eventstore.SeverityDown,
+	}, Delivery{
+		ID:      99,
+		Payload: payload,
+	})
+	if err != nil {
+		t.Fatalf("buildNotification: %v", err)
+	}
+	if !notification.Recovery || notification.Severity != eventstore.SeverityUp {
+		t.Fatalf("notification recovery/severity = %+v", notification)
+	}
+	if notification.SiteURL != "https://site.example" || notification.DedupKey != "jetmon-event-20" {
+		t.Fatalf("notification = %+v", notification)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestBuildNotificationFallsBackToSiteID(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	payload, err := buildPayload(
+		"alert.opened", 10, 20, 30, "opened", "Down",
+		eventstore.SeverityUp, eventstore.SeverityDown, time.Now().UTC(),
+	)
+	if err != nil {
+		t.Fatalf("buildPayload: %v", err)
+	}
+	mock.ExpectQuery("SELECT monitor_url FROM jetpack_monitor_sites").
+		WithArgs(int64(30)).
+		WillReturnError(sql.ErrNoRows)
+
+	w := &Worker{cfg: WorkerConfig{DB: db}}
+	notification, err := w.buildNotification(context.Background(), &AlertContact{
+		ID:          1,
+		MinSeverity: eventstore.SeverityDown,
+	}, Delivery{Payload: payload})
+	if err != nil {
+		t.Fatalf("buildNotification: %v", err)
+	}
+	if notification.SiteURL != "site:30" {
+		t.Fatalf("SiteURL = %q, want site:30", notification.SiteURL)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
diff --git a/internal/api/api.go b/internal/api/api.go
new file mode 100644
index 00000000..8194767b
--- /dev/null
+++ b/internal/api/api.go
@@ -0,0 +1,124 @@
+// Package api implements the internal Jetmon REST API.
+//
+// The API is internal-only — a separate gateway service handles all
+// customer-facing concerns (tenant isolation, public errors, customer rate
+// limiting). See docs/internal-api-reference.md for the full design rationale and endpoint reference.
+//
+// Authentication is per-consumer Bearer tokens managed via the apikeys
+// package. Every authenticated request is logged to jetmon_audit_log under
+// event_type=api_access for accountability.
+package api
+
+import (
+	"context"
+	"database/sql"
+	"errors"
+	"fmt"
+	"log"
+	"net/http"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/alerting"
+)
+
+// Timeout defaults for the API HTTP server. These are generous compared to
+// the verifier's defaults because some endpoints (uptime stats over long
+// windows, full transition lists) legitimately do more work.
+const (
+	readHeaderTimeout = 5 * time.Second
+	readTimeout       = 60 * time.Second
+	writeTimeout      = 65 * time.Second
+	idleTimeout       = 120 * time.Second
+)
+
+// Server hosts the API on a single addr. Lifecycle mirrors the verifier:
+// Listen blocks; Shutdown drains gracefully up to the caller's deadline.
+type Server struct {
+	db          *sql.DB
+	addr        string
+	hostname    string
+	httpSrv     *http.Server
+	limiter     *rateLimiter
+	idempotency *idempotencyStore
+
+	// alertDispatchers is the per-transport dispatcher map used by the
+	// alert-contact send-test endpoint. The same map is shared with the
+	// alerting worker so a successful send-test is a true smoke test of
+	// the path real alerts will take. Wired by main.go via
+	// SetAlertDispatchers; nil if alerting is disabled.
+	alertDispatchers map[alerting.Transport]alerting.Dispatcher
+}
+
+// New constructs a Server. Caller is responsible for ensuring db is connected
+// and migrated before Listen is called.
+func New(addr string, db *sql.DB, hostname string) *Server {
+	return &Server{
+		db:          db,
+		addr:        addr,
+		hostname:    hostname,
+		limiter:     newRateLimiter(),
+		idempotency: newIdempotencyStore(),
+	}
+}
+
+// Listen starts the API HTTP server. Returns http.ErrServerClosed on a clean
+// Shutdown. Callers wrap with errors.Is(err, http.ErrServerClosed) to
+// distinguish graceful shutdown from a real failure.
+func (s *Server) Listen() error {
+	mux := s.routes()
+
+	s.httpSrv = &http.Server{
+		Addr:              s.addr,
+		Handler:           mux,
+		ReadHeaderTimeout: readHeaderTimeout,
+		ReadTimeout:       readTimeout,
+		WriteTimeout:      writeTimeout,
+		IdleTimeout:       idleTimeout,
+	}
+
+	log.Printf("api: listening on %s", s.addr)
+	return s.httpSrv.ListenAndServe()
+}
+
+// Shutdown drains in-flight requests up to ctx's deadline.
+func (s *Server) Shutdown(ctx context.Context) error {
+	if s.httpSrv == nil {
+		return nil
+	}
+	return s.httpSrv.Shutdown(ctx)
+}
+
+// SetAlertDispatchers wires the per-transport dispatcher map for the
+// alert-contact send-test endpoint. Call this before Listen if
+// alerting is enabled. The worker should share the same map so a
+// successful send-test exercises the real production code path.
+func (s *Server) SetAlertDispatchers(d map[alerting.Transport]alerting.Dispatcher) {
+	s.alertDispatchers = d
+}
+
+// routes builds the request multiplexer. Uses Go 1.22's pattern-based routing
+// (method + path + path-value capture).
+func (s *Server) routes() *http.ServeMux {
+	mux := http.NewServeMux()
+
+	for _, route := range apiRoutes() {
+		route.register(s, mux)
+	}
+
+	// Catch-all → 404 with a useful message rather than the default empty body.
+	mux.HandleFunc("/", s.handleNotFound)
+
+	return mux
+}
+
+func (s *Server) handleNotFound(w http.ResponseWriter, r *http.Request) {
+	writeError(w, r, http.StatusNotFound, "endpoint_not_found",
+		fmt.Sprintf("no route for %s %s", r.Method, r.URL.Path))
+}
+
+// IsServerClosed returns true if err is the sentinel returned by Listen
+// after a clean Shutdown. Callers use this to distinguish drain-completed
+// from a real listen failure.
+func IsServerClosed(err error) bool {
+	return errors.Is(err, http.ErrServerClosed)
+}
diff --git a/internal/api/api_test.go b/internal/api/api_test.go
new file mode 100644
index 00000000..6356a759
--- /dev/null
+++ b/internal/api/api_test.go
@@ -0,0 +1,161 @@
+package api
+
+import (
+	"context"
+	"errors"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+)
+
+func TestNewRequestIDLength(t *testing.T) {
+	id := newRequestID()
+	if len(id) != 32 {
+		t.Fatalf("newRequestID len = %d, want 32 (16-byte hex)", len(id))
+	}
+	other := newRequestID()
+	if id == other {
+		t.Fatal("newRequestID collided across two calls")
+	}
+}
+
+func TestBearerToken(t *testing.T) {
+	cases := []struct {
+		header string
+		want   string
+	}{
+		{"Bearer jm_abc123", "jm_abc123"},
+		{"Bearer  jm_abc123  ", "jm_abc123"},
+		{"bearer jm_abc123", ""}, // wrong case
+		{"jm_abc123", ""},        // missing "Bearer " prefix
+		{"", ""},
+	}
+	for _, c := range cases {
+		req, _ := http.NewRequest("GET", "/", nil)
+		if c.header != "" {
+			req.Header.Set("Authorization", c.header)
+		}
+		got := bearerToken(req)
+		if got != c.want {
+			t.Errorf("bearerToken(%q) = %q, want %q", c.header, got, c.want)
+		}
+	}
+}
+
+func TestEncodeDecodeIDCursor(t *testing.T) {
+	encoded := encodeIDCursor(98765)
+	if encoded == "" {
+		t.Fatal("empty cursor")
+	}
+	got, err := decodeIDCursor(encoded)
+	if err != nil {
+		t.Fatalf("decodeIDCursor: %v", err)
+	}
+	if got != 98765 {
+		t.Fatalf("decoded id = %d, want 98765", got)
+	}
+}
+
+func TestDecodeIDCursorInvalid(t *testing.T) {
+	if _, err := decodeIDCursor("not-base64!!"); err == nil {
+		t.Fatal("expected error for invalid base64")
+	}
+}
+
+func TestDeriveStateFromSiteStatus(t *testing.T) {
+	cases := []struct {
+		siteStatus int
+		state      string
+		severity   uint8
+	}{
+		{0, "Seems Down", 3},
+		{1, "Up", 0},
+		{2, "Down", 4},
+		{99, "Unknown", 0},
+	}
+	for _, c := range cases {
+		gotState, gotSev := deriveStateFromSiteStatus(c.siteStatus)
+		if gotState != c.state || gotSev != c.severity {
+			t.Errorf("deriveStateFromSiteStatus(%d) = (%q, %d), want (%q, %d)",
+				c.siteStatus, gotState, gotSev, c.state, c.severity)
+		}
+	}
+}
+
+func TestParseLimit(t *testing.T) {
+	cases := []struct {
+		s, want any
+	}{
+		{"", 50},
+		{"100", 100},
+		{"500", 200}, // clamped to maxLimit
+	}
+	for _, c := range cases {
+		got, err := parseLimit(c.s.(string), 50, 200)
+		if err != nil {
+			t.Errorf("parseLimit(%q): %v", c.s, err)
+			continue
+		}
+		if got != c.want.(int) {
+			t.Errorf("parseLimit(%q) = %d, want %d", c.s, got, c.want)
+		}
+	}
+	if _, err := parseLimit("abc", 50, 200); err == nil {
+		t.Error("parseLimit('abc') should error")
+	}
+	if _, err := parseLimit("0", 50, 200); err == nil {
+		t.Error("parseLimit('0') should error")
+	}
+}
+
+func TestHandleHealthWithoutDB(t *testing.T) {
+	s := New(":0", nil, "test")
+	req := httptest.NewRequest("GET", "/api/v1/health", nil)
+	rec := httptest.NewRecorder()
+	s.handleHealth(rec, req)
+
+	if rec.Code != http.StatusServiceUnavailable {
+		t.Fatalf("status = %d, want 503; body=%s", rec.Code, rec.Body.String())
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "db_unavailable" {
+		t.Errorf("error code = %q, want db_unavailable", body.Code)
+	}
+}
+
+func TestRoutesRegisterAllPaths(t *testing.T) {
+	// Sanity: every route in the routes() table is wired and doesn't panic
+	// when constructed. We don't exercise the handlers (those need a DB).
+	s := New(":0", nil, "test")
+	mux := s.routes()
+	if mux == nil {
+		t.Fatal("routes() returned nil")
+	}
+	// 404 catch-all should fire for unknown paths (and gives us a free
+	// signal that the mux was constructed).
+	req := httptest.NewRequest("GET", "/totally-not-a-route", nil)
+	rec := httptest.NewRecorder()
+	mux.ServeHTTP(rec, req)
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("unknown route status = %d, want 404", rec.Code)
+	}
+}
+
+func TestShutdownWithoutListenIsNoop(t *testing.T) {
+	s := New(":0", nil, "test")
+	if err := s.Shutdown(context.Background()); err != nil {
+		t.Fatalf("Shutdown before Listen = %v, want nil", err)
+	}
+}
+
+func TestIsServerClosed(t *testing.T) {
+	if !IsServerClosed(http.ErrServerClosed) {
+		t.Fatal("IsServerClosed(http.ErrServerClosed) = false")
+	}
+	if !IsServerClosed(errors.Join(errors.New("wrapped"), http.ErrServerClosed)) {
+		t.Fatal("IsServerClosed(joined ErrServerClosed) = false")
+	}
+	if IsServerClosed(errors.New("listen failed")) {
+		t.Fatal("IsServerClosed(non-sentinel) = true")
+	}
+}
diff --git a/internal/api/handlers_alert_deliveries.go b/internal/api/handlers_alert_deliveries.go
new file mode 100644
index 00000000..cb53d08a
--- /dev/null
+++ b/internal/api/handlers_alert_deliveries.go
@@ -0,0 +1,197 @@
+package api
+
+import (
+	"encoding/json"
+	"errors"
+	"fmt"
+	"net/http"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/alerting"
+)
+
+// alertDeliveryResponse is the JSON shape for an alert delivery row.
+// Same flat shape as deliveryResponse for webhooks — flat fields are
+// easier to filter and sort than a nested envelope.
+type alertDeliveryResponse struct {
+	ID             int64           `json:"id"`
+	AlertContactID int64           `json:"alert_contact_id"`
+	TransitionID   int64           `json:"transition_id"`
+	EventID        int64           `json:"event_id"`
+	EventType      string          `json:"event_type"`
+	Severity       string          `json:"severity"`
+	Payload        json.RawMessage `json:"payload"`
+	Status         string          `json:"status"`
+	Attempt        int             `json:"attempt"`
+	NextAttemptAt  *string         `json:"next_attempt_at"`
+	LastStatusCode *int            `json:"last_status_code"`
+	LastResponse   *string         `json:"last_response"`
+	LastAttemptAt  *string         `json:"last_attempt_at"`
+	DeliveredAt    *string         `json:"delivered_at"`
+	CreatedAt      string          `json:"created_at"`
+}
+
+func toAlertDeliveryResponse(d *alerting.Delivery) alertDeliveryResponse {
+	out := alertDeliveryResponse{
+		ID:             d.ID,
+		AlertContactID: d.AlertContactID,
+		TransitionID:   d.TransitionID,
+		EventID:        d.EventID,
+		EventType:      d.EventType,
+		Severity:       alerting.SeverityName(d.Severity),
+		Payload:        d.Payload,
+		Status:         string(d.Status),
+		Attempt:        d.Attempt,
+		LastStatusCode: d.LastStatusCode,
+		LastResponse:   d.LastResponse,
+		CreatedAt:      d.CreatedAt.UTC().Format(time.RFC3339),
+	}
+	if d.NextAttemptAt != nil {
+		v := d.NextAttemptAt.UTC().Format(time.RFC3339)
+		out.NextAttemptAt = &v
+	}
+	if d.LastAttemptAt != nil {
+		v := d.LastAttemptAt.UTC().Format(time.RFC3339)
+		out.LastAttemptAt = &v
+	}
+	if d.DeliveredAt != nil {
+		v := d.DeliveredAt.UTC().Format(time.RFC3339)
+		out.DeliveredAt = &v
+	}
+	return out
+}
+
+// handleListAlertDeliveries implements
+// GET /api/v1/alert-contacts/{id}/deliveries.
+func (s *Server) handleListAlertDeliveries(w http.ResponseWriter, r *http.Request) {
+	contactID, err := parseIDPath(r, "id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_alert_contact_id",
+			"alert contact id must be a positive integer")
+		return
+	}
+	if !s.ensureAlertContactOwnedForRequest(w, r, contactID) {
+		return
+	}
+
+	q := r.URL.Query()
+	limit, err := parseLimit(q.Get("limit"), 50, 200)
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_limit", err.Error())
+		return
+	}
+	cursor, err := decodeIDCursor(q.Get("cursor"))
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_cursor", err.Error())
+		return
+	}
+
+	var status alerting.Status
+	if v := q.Get("status"); v != "" {
+		switch alerting.Status(v) {
+		case alerting.StatusPending, alerting.StatusDelivered,
+			alerting.StatusFailed, alerting.StatusAbandoned:
+			status = alerting.Status(v)
+		default:
+			writeError(w, r, http.StatusBadRequest, "invalid_status",
+				"status must be one of: pending, delivered, failed, abandoned")
+			return
+		}
+	}
+
+	rows, err := alerting.ListDeliveries(r.Context(), s.db, contactID, status, cursor, limit+1)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"alert deliveries list failed: "+err.Error())
+		return
+	}
+
+	out := make([]alertDeliveryResponse, 0, len(rows))
+	for i := range rows {
+		out = append(out, toAlertDeliveryResponse(&rows[i]))
+	}
+	var nextCursor *string
+	if len(out) > limit {
+		out = out[:limit]
+		c := encodeIDCursor(out[len(out)-1].ID)
+		nextCursor = &c
+	}
+
+	writeJSON(w, http.StatusOK, ListEnvelope{
+		Data: out,
+		Page: Page{Next: nextCursor, Limit: limit},
+	})
+}
+
+// handleRetryAlertDelivery implements
+// POST /api/v1/alert-contacts/{id}/deliveries/{delivery_id}/retry.
+//
+// Resets an abandoned delivery row to pending so the worker picks it
+// up on the next tick. Same semantics as the webhook retry endpoint —
+// only abandoned deliveries can be retried; pending ones are already
+// queued and delivered ones don't need to fire again.
+func (s *Server) handleRetryAlertDelivery(w http.ResponseWriter, r *http.Request) {
+	contactID, err := parseIDPath(r, "id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_alert_contact_id",
+			"alert contact id must be a positive integer")
+		return
+	}
+	deliveryID, err := parseIDPath(r, "delivery_id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_delivery_id",
+			"delivery id must be a positive integer")
+		return
+	}
+	if !s.ensureAlertContactOwnedForRequest(w, r, contactID) {
+		return
+	}
+
+	d, err := alerting.GetDelivery(r.Context(), s.db, deliveryID)
+	if err != nil {
+		if errors.Is(err, alerting.ErrDeliveryNotFound) {
+			writeError(w, r, http.StatusNotFound, "delivery_not_found",
+				fmt.Sprintf("Delivery %d does not exist", deliveryID))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"delivery lookup failed: "+err.Error())
+		return
+	}
+	if d.AlertContactID != contactID {
+		writeError(w, r, http.StatusNotFound, "delivery_not_found",
+			fmt.Sprintf("Delivery %d does not belong to alert contact %d", deliveryID, contactID))
+		return
+	}
+
+	if err := alerting.RetryDelivery(r.Context(), s.db, deliveryID); err != nil {
+		writeError(w, r, http.StatusConflict, "delivery_not_retryable", err.Error())
+		return
+	}
+
+	d, err = alerting.GetDelivery(r.Context(), s.db, deliveryID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"read-back failed: "+err.Error())
+		return
+	}
+	writeJSON(w, http.StatusOK, toAlertDeliveryResponse(d))
+}
+
+func (s *Server) ensureAlertContactOwnedForRequest(w http.ResponseWriter, r *http.Request, id int64) bool {
+	tenantID, ok := ownerTenantIDFromRequest(r)
+	if !ok {
+		return true
+	}
+	if _, err := alerting.GetForTenant(r.Context(), s.db, id, tenantID); err != nil {
+		if errors.Is(err, alerting.ErrContactNotFound) {
+			writeError(w, r, http.StatusNotFound, "alert_contact_not_found",
+				fmt.Sprintf("Alert contact %d does not exist", id))
+			return false
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"alert contact lookup failed: "+err.Error())
+		return false
+	}
+	return true
+}
diff --git a/internal/api/handlers_alert_deliveries_test.go b/internal/api/handlers_alert_deliveries_test.go
new file mode 100644
index 00000000..bc955b0e
--- /dev/null
+++ b/internal/api/handlers_alert_deliveries_test.go
@@ -0,0 +1,201 @@
+package api
+
+import (
+	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+const selectAlertDeliveriesSQL = ` SELECT id, alert_contact_id, transition_id, event_id, event_type, severity, payload, status, attempt, next_attempt_at, last_status_code, last_response, last_attempt_at, delivered_at, created_at FROM jetmon_alert_deliveries WHERE alert_contact_id = ? ORDER BY id DESC LIMIT ?`
+
+const selectAlertDeliveryOneSQL = ` SELECT id, alert_contact_id, transition_id, event_id, event_type, severity, payload, status, attempt, next_attempt_at, last_status_code, last_response, last_attempt_at, delivered_at, created_at FROM jetmon_alert_deliveries WHERE id = ?`
+
+var columnsAlertDelivery = []string{
+	"id", "alert_contact_id", "transition_id", "event_id", "event_type", "severity",
+	"payload", "status", "attempt", "next_attempt_at", "last_status_code", "last_response",
+	"last_attempt_at", "delivered_at", "created_at",
+}
+
+func makeAlertDeliveryRow(id, contactID int64, status string) *sqlmock.Rows {
+	now := time.Now().UTC()
+	payload := []byte(`{"site_id":42,"event_id":777,"type":"alert.opened"}`)
+	return sqlmock.NewRows(columnsAlertDelivery).AddRow(
+		id, contactID, int64(1), int64(777), "alert.opened", uint8(4),
+		payload, status, 1, nil, nil, nil, nil, nil, now,
+	)
+}
+
+func TestListAlertDeliveriesHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// limit + 1 = 51 (default 50 + 1 for pagination probe).
+	mock.ExpectQuery(selectAlertDeliveriesSQL).
+		WithArgs(int64(11), 51).
+		WillReturnRows(makeAlertDeliveryRow(101, 11, "delivered"))
+
+	req := httptest.NewRequest("GET", "/api/v1/alert-contacts/11/deliveries", nil)
+	req.SetPathValue("id", "11")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleListAlertDeliveries)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var env struct {
+		Data []alertDeliveryResponse `json:"data"`
+		Page json.RawMessage         `json:"page"`
+	}
+	readJSON(t, rec.Body, &env)
+	if len(env.Data) != 1 {
+		t.Fatalf("len(data) = %d, want 1", len(env.Data))
+	}
+	if env.Data[0].Severity != "Down" {
+		t.Errorf("Severity = %q, want Down (uint8 4)", env.Data[0].Severity)
+	}
+}
+
+func TestListAlertDeliveriesWithGatewayTenantVerifiesContactOwnership(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectAlertContactOneForTenantSQL).WithArgs(int64(11), "tenant-a").
+		WillReturnRows(makeAlertContactRow(11, "oncall", "slack", 1, 4))
+	mock.ExpectQuery(selectAlertDeliveriesSQL).
+		WithArgs(int64(11), 51).
+		WillReturnRows(makeAlertDeliveryRow(101, 11, "delivered"))
+
+	req := httptest.NewRequest("GET", "/api/v1/alert-contacts/11/deliveries", nil)
+	req.SetPathValue("id", "11")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleListAlertDeliveries)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestListAlertDeliveriesRejectsBadStatus(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	req := httptest.NewRequest("GET", "/api/v1/alert-contacts/11/deliveries?status=bogus", nil)
+	req.SetPathValue("id", "11")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleListAlertDeliveries)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("status = %d, want 400", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "invalid_status" {
+		t.Errorf("code = %q, want invalid_status", got)
+	}
+}
+
+func TestRetryAlertDeliveryHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// 1) Get delivery to verify it belongs to the contact.
+	mock.ExpectQuery(selectAlertDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeAlertDeliveryRow(101, 11, "abandoned"))
+	// 2) RetryDelivery UPDATE.
+	mock.ExpectExec(`UPDATE jetmon_alert_deliveries SET status = 'pending', attempt = 0, next_attempt_at = CURRENT_TIMESTAMP, last_status_code = NULL, last_response = NULL, last_attempt_at = NULL WHERE id = ? AND status = 'abandoned'`).
+		WithArgs(int64(101)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	// 3) Read-back GetDelivery.
+	mock.ExpectQuery(selectAlertDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeAlertDeliveryRow(101, 11, "pending"))
+
+	req := newPOSTWithBody("/api/v1/alert-contacts/11/deliveries/101/retry", nil)
+	req.SetPathValue("id", "11")
+	req.SetPathValue("delivery_id", "101")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleRetryAlertDelivery)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp alertDeliveryResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.Status != "pending" {
+		t.Errorf("Status = %q, want pending", resp.Status)
+	}
+}
+
+func TestRetryAlertDeliveryWithGatewayTenantVerifiesContactOwnership(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectAlertContactOneForTenantSQL).WithArgs(int64(11), "tenant-a").
+		WillReturnRows(makeAlertContactRow(11, "oncall", "slack", 1, 4))
+	mock.ExpectQuery(selectAlertDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeAlertDeliveryRow(101, 11, "abandoned"))
+	mock.ExpectExec(`UPDATE jetmon_alert_deliveries SET status = 'pending', attempt = 0, next_attempt_at = CURRENT_TIMESTAMP, last_status_code = NULL, last_response = NULL, last_attempt_at = NULL WHERE id = ? AND status = 'abandoned'`).
+		WithArgs(int64(101)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(selectAlertDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeAlertDeliveryRow(101, 11, "pending"))
+
+	req := newPOSTWithBody("/api/v1/alert-contacts/11/deliveries/101/retry", nil)
+	req.SetPathValue("id", "11")
+	req.SetPathValue("delivery_id", "101")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleRetryAlertDelivery)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestRetryAlertDeliveryWrongContact(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// Delivery belongs to contact 99, not 11.
+	mock.ExpectQuery(selectAlertDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeAlertDeliveryRow(101, 99, "abandoned"))
+
+	req := newPOSTWithBody("/api/v1/alert-contacts/11/deliveries/101/retry", nil)
+	req.SetPathValue("id", "11")
+	req.SetPathValue("delivery_id", "101")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleRetryAlertDelivery)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+}
+
+func TestRetryAlertDeliveryNotAbandoned(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// Lookup succeeds; delivery is currently 'delivered' (not retryable).
+	mock.ExpectQuery(selectAlertDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeAlertDeliveryRow(101, 11, "delivered"))
+	// RetryDelivery UPDATE returns 0 affected.
+	mock.ExpectExec(`UPDATE jetmon_alert_deliveries SET status = 'pending', attempt = 0, next_attempt_at = CURRENT_TIMESTAMP, last_status_code = NULL, last_response = NULL, last_attempt_at = NULL WHERE id = ? AND status = 'abandoned'`).
+		WithArgs(int64(101)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+	// RetryDelivery's error path re-reads to get the current state for the message.
+	mock.ExpectQuery(selectAlertDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeAlertDeliveryRow(101, 11, "delivered"))
+
+	req := newPOSTWithBody("/api/v1/alert-contacts/11/deliveries/101/retry", nil)
+	req.SetPathValue("id", "11")
+	req.SetPathValue("delivery_id", "101")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleRetryAlertDelivery)
+
+	if rec.Code != http.StatusConflict {
+		t.Fatalf("status = %d, want 409", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "delivery_not_retryable" {
+		t.Errorf("code = %q, want delivery_not_retryable", got)
+	}
+}
diff --git a/internal/api/handlers_alerts.go b/internal/api/handlers_alerts.go
new file mode 100644
index 00000000..21166230
--- /dev/null
+++ b/internal/api/handlers_alerts.go
@@ -0,0 +1,382 @@
+package api
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"net/http"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/alerting"
+)
+
+// alertContactResponse is the JSON shape for an alert contact in
+// list/single responses. The destination credential is never returned;
+// only DestinationPreview (last 4 chars) is exposed.
+type alertContactResponse struct {
+	ID                 int64               `json:"id"`
+	Label              string              `json:"label"`
+	Active             bool                `json:"active"`
+	Transport          string              `json:"transport"`
+	DestinationPreview string              `json:"destination_preview"`
+	SiteFilter         alerting.SiteFilter `json:"site_filter"`
+	MinSeverity        string              `json:"min_severity"`
+	MaxPerHour         int                 `json:"max_per_hour"`
+	CreatedBy          string              `json:"created_by"`
+	CreatedAt          string              `json:"created_at"`
+	UpdatedAt          string              `json:"updated_at"`
+}
+
+func toAlertContactResponse(c *alerting.AlertContact) alertContactResponse {
+	return alertContactResponse{
+		ID:                 c.ID,
+		Label:              c.Label,
+		Active:             c.Active,
+		Transport:          string(c.Transport),
+		DestinationPreview: c.DestinationPreview,
+		SiteFilter:         c.SiteFilter,
+		MinSeverity:        alerting.SeverityName(c.MinSeverity),
+		MaxPerHour:         c.MaxPerHour,
+		CreatedBy:          c.CreatedBy,
+		CreatedAt:          c.CreatedAt.UTC().Format(time.RFC3339),
+		UpdatedAt:          c.UpdatedAt.UTC().Format(time.RFC3339),
+	}
+}
+
+// createAlertContactRequest is the body shape for POST /api/v1/alert-contacts.
+// MinSeverity is a string ("Down", "Warning", etc.) on the wire — it's
+// translated to the internal uint8 before passing to the alerting
+// package.
+type createAlertContactRequest struct {
+	Label       string              `json:"label"`
+	Active      *bool               `json:"active"`
+	Transport   string              `json:"transport"`
+	Destination json.RawMessage     `json:"destination"`
+	SiteFilter  alerting.SiteFilter `json:"site_filter"`
+	MinSeverity *string             `json:"min_severity"`
+	MaxPerHour  *int                `json:"max_per_hour"`
+}
+
+// updateAlertContactRequest is the body shape for PATCH
+// /api/v1/alert-contacts/{id}. Pointer fields distinguish absent from
+// explicitly empty. Transport itself cannot be changed via PATCH —
+// see docs/internal-api-reference.md "Family 5".
+type updateAlertContactRequest struct {
+	Label       *string              `json:"label"`
+	Active      *bool                `json:"active"`
+	Destination json.RawMessage      `json:"destination"`
+	SiteFilter  *alerting.SiteFilter `json:"site_filter"`
+	MinSeverity *string              `json:"min_severity"`
+	MaxPerHour  *int                 `json:"max_per_hour"`
+}
+
+// alertContactTestResponse is returned by POST /alert-contacts/{id}/test.
+type alertContactTestResponse struct {
+	ContactID    int64  `json:"contact_id"`
+	Transport    string `json:"transport"`
+	StatusCode   int    `json:"status_code"`
+	ResponseBody string `json:"response_body"`
+	Error        string `json:"error,omitempty"`
+	Delivered    bool   `json:"delivered"`
+}
+
+// handleCreateAlertContact implements POST /api/v1/alert-contacts.
+func (s *Server) handleCreateAlertContact(w http.ResponseWriter, r *http.Request) {
+	var body createAlertContactRequest
+	if err := json.NewDecoder(r.Body).Decode(&body); err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_body",
+			"request body must be valid JSON: "+err.Error())
+		return
+	}
+	if !alerting.IsValidTransport(body.Transport) {
+		writeError(w, r, http.StatusBadRequest, "invalid_transport",
+			fmt.Sprintf("transport must be one of: email, pagerduty, slack, teams (got %q)", body.Transport))
+		return
+	}
+
+	in := alerting.CreateInput{
+		Label:         body.Label,
+		Active:        body.Active,
+		OwnerTenantID: ownerTenantIDPtr(r),
+		Transport:     alerting.Transport(body.Transport),
+		Destination:   body.Destination,
+		SiteFilter:    body.SiteFilter,
+		MaxPerHour:    body.MaxPerHour,
+		CreatedBy:     consumerName(r),
+	}
+	if body.MinSeverity != nil {
+		sev, err := alerting.SeverityFromName(*body.MinSeverity)
+		if err != nil {
+			writeError(w, r, http.StatusBadRequest, "invalid_severity",
+				fmt.Sprintf("min_severity must be one of: %v", alerting.AllSeverityNames()))
+			return
+		}
+		in.MinSeverity = &sev
+	}
+
+	contact, err := alerting.Create(r.Context(), s.db, in)
+	if err != nil {
+		writeAlertingValidationError(w, r, err, "alert contact create failed")
+		return
+	}
+	writeJSON(w, http.StatusCreated, toAlertContactResponse(contact))
+}
+
+// handleListAlertContacts implements GET /api/v1/alert-contacts.
+func (s *Server) handleListAlertContacts(w http.ResponseWriter, r *http.Request) {
+	var (
+		contacts []alerting.AlertContact
+		err      error
+	)
+	if tenantID, ok := ownerTenantIDFromRequest(r); ok {
+		contacts, err = alerting.ListForTenant(r.Context(), s.db, tenantID)
+	} else {
+		contacts, err = alerting.List(r.Context(), s.db)
+	}
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"alert contact list failed: "+err.Error())
+		return
+	}
+	out := make([]alertContactResponse, 0, len(contacts))
+	for i := range contacts {
+		out = append(out, toAlertContactResponse(&contacts[i]))
+	}
+	writeJSON(w, http.StatusOK, ListEnvelope{
+		Data: out,
+		Page: Page{Next: nil, Limit: len(out)},
+	})
+}
+
+// handleGetAlertContact implements GET /api/v1/alert-contacts/{id}.
+func (s *Server) handleGetAlertContact(w http.ResponseWriter, r *http.Request) {
+	id, err := parseIDPath(r, "id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_alert_contact_id",
+			"alert contact id must be a positive integer")
+		return
+	}
+	contact, err := getAlertContactForRequest(r, s.db, id)
+	if err != nil {
+		if errors.Is(err, alerting.ErrContactNotFound) {
+			writeError(w, r, http.StatusNotFound, "alert_contact_not_found",
+				fmt.Sprintf("Alert contact %d does not exist", id))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"alert contact lookup failed: "+err.Error())
+		return
+	}
+	writeJSON(w, http.StatusOK, toAlertContactResponse(contact))
+}
+
+// handleUpdateAlertContact implements PATCH /api/v1/alert-contacts/{id}.
+func (s *Server) handleUpdateAlertContact(w http.ResponseWriter, r *http.Request) {
+	id, err := parseIDPath(r, "id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_alert_contact_id",
+			"alert contact id must be a positive integer")
+		return
+	}
+	var body updateAlertContactRequest
+	if err := json.NewDecoder(r.Body).Decode(&body); err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_body",
+			"request body must be valid JSON: "+err.Error())
+		return
+	}
+
+	in := alerting.UpdateInput{
+		Label:       body.Label,
+		Active:      body.Active,
+		Destination: body.Destination,
+		SiteFilter:  body.SiteFilter,
+		MaxPerHour:  body.MaxPerHour,
+	}
+	if body.MinSeverity != nil {
+		sev, err := alerting.SeverityFromName(*body.MinSeverity)
+		if err != nil {
+			writeError(w, r, http.StatusBadRequest, "invalid_severity",
+				fmt.Sprintf("min_severity must be one of: %v", alerting.AllSeverityNames()))
+			return
+		}
+		in.MinSeverity = &sev
+	}
+
+	var contact *alerting.AlertContact
+	if tenantID, ok := ownerTenantIDFromRequest(r); ok {
+		contact, err = alerting.UpdateForTenant(r.Context(), s.db, id, tenantID, in)
+	} else {
+		contact, err = alerting.Update(r.Context(), s.db, id, in)
+	}
+	if err != nil {
+		if errors.Is(err, alerting.ErrContactNotFound) {
+			writeError(w, r, http.StatusNotFound, "alert_contact_not_found",
+				fmt.Sprintf("Alert contact %d does not exist", id))
+			return
+		}
+		writeAlertingValidationError(w, r, err, "alert contact update failed")
+		return
+	}
+	writeJSON(w, http.StatusOK, toAlertContactResponse(contact))
+}
+
+// handleDeleteAlertContact implements DELETE /api/v1/alert-contacts/{id}.
+// Hard delete — see comment on handleDeleteWebhook for rationale.
+func (s *Server) handleDeleteAlertContact(w http.ResponseWriter, r *http.Request) {
+	id, err := parseIDPath(r, "id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_alert_contact_id",
+			"alert contact id must be a positive integer")
+		return
+	}
+	err = nil
+	if tenantID, ok := ownerTenantIDFromRequest(r); ok {
+		err = alerting.DeleteForTenant(r.Context(), s.db, id, tenantID)
+	} else {
+		err = alerting.Delete(r.Context(), s.db, id)
+	}
+	if err != nil {
+		if errors.Is(err, alerting.ErrContactNotFound) {
+			writeError(w, r, http.StatusNotFound, "alert_contact_not_found",
+				fmt.Sprintf("Alert contact %d does not exist", id))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"alert contact delete failed: "+err.Error())
+		return
+	}
+	w.WriteHeader(http.StatusNoContent)
+}
+
+// handleAlertContactTest implements POST /api/v1/alert-contacts/{id}/test.
+//
+// Sends a synthetic notification through the contact's transport — same
+// rendering, same dispatch path, but with a test-flagged Notification.
+// Bypasses the severity gate and the per-hour rate cap; logged in
+// jetmon_audit_log via the API auth middleware.
+//
+// Returns the transport's status_code + truncated response body so
+// operators can verify connectivity. Transport-level errors are
+// surfaced as 502 (we successfully called the transport, but the
+// transport reported a failure).
+func (s *Server) handleAlertContactTest(w http.ResponseWriter, r *http.Request) {
+	id, err := parseIDPath(r, "id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_alert_contact_id",
+			"alert contact id must be a positive integer")
+		return
+	}
+	contact, err := getAlertContactForRequest(r, s.db, id)
+	if err != nil {
+		if errors.Is(err, alerting.ErrContactNotFound) {
+			writeError(w, r, http.StatusNotFound, "alert_contact_not_found",
+				fmt.Sprintf("Alert contact %d does not exist", id))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"alert contact lookup failed: "+err.Error())
+		return
+	}
+
+	dispatcher, ok := s.alertDispatchers[contact.Transport]
+	if !ok {
+		writeError(w, r, http.StatusServiceUnavailable, "transport_not_configured",
+			fmt.Sprintf("transport %q is not configured on this server", contact.Transport))
+		return
+	}
+	var dest json.RawMessage
+	if tenantID, ok := ownerTenantIDFromRequest(r); ok {
+		dest, err = alerting.LoadDestinationForTenant(r.Context(), s.db, id, tenantID)
+	} else {
+		dest, err = alerting.LoadDestination(r.Context(), s.db, id)
+	}
+	if err != nil {
+		if errors.Is(err, alerting.ErrContactNotFound) {
+			writeError(w, r, http.StatusNotFound, "alert_contact_not_found",
+				fmt.Sprintf("Alert contact %d does not exist", id))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"alert contact destination load failed: "+err.Error())
+		return
+	}
+
+	now := time.Now().UTC()
+	n := alerting.Notification{
+		SiteID:       0,
+		SiteURL:      "https://test.invalid/" + contact.Label,
+		EventID:      0,
+		EventType:    "event.test",
+		Severity:     contact.MinSeverity,
+		SeverityName: alerting.SeverityName(contact.MinSeverity),
+		State:        "Test",
+		Reason:       "test_send",
+		Timestamp:    now,
+		IsTest:       true,
+	}
+
+	// Bound the test send tightly so a wedged transport doesn't
+	// hang the API request indefinitely.
+	ctx, cancel := context.WithTimeout(r.Context(), 15*time.Second)
+	defer cancel()
+
+	statusCode, respBody, sendErr := dispatcher.Send(ctx, dest, n)
+
+	resp := alertContactTestResponse{
+		ContactID:    contact.ID,
+		Transport:    string(contact.Transport),
+		StatusCode:   statusCode,
+		ResponseBody: respBody,
+	}
+	if sendErr != nil {
+		resp.Error = sendErr.Error()
+		writeJSON(w, http.StatusBadGateway, resp)
+		return
+	}
+	resp.Delivered = true
+	writeJSON(w, http.StatusOK, resp)
+}
+
+// writeAlertingValidationError translates package-level validation
+// errors into the appropriate HTTP status. ErrInvalidTransport and
+// ErrInvalidSeverity are operator/client mistakes (4xx); everything
+// else is treated as a 500 db_error.
+func writeAlertingValidationError(w http.ResponseWriter, r *http.Request, err error, prefix string) {
+	switch {
+	case errors.Is(err, alerting.ErrInvalidTransport):
+		writeError(w, r, http.StatusBadRequest, "invalid_transport", err.Error())
+	case errors.Is(err, alerting.ErrInvalidSeverity):
+		writeError(w, r, http.StatusBadRequest, "invalid_severity", err.Error())
+	default:
+		// Treat all other errors as either client validation (string
+		// errors from validateCreateInput / validateDestination) or
+		// server-side DB errors. Validation strings start with
+		// "alerting:" — match that prefix to choose 422 vs 500.
+		msg := err.Error()
+		if len(msg) >= 9 && msg[:9] == "alerting:" {
+			writeError(w, r, http.StatusUnprocessableEntity, "invalid_alert_contact", msg)
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			prefix+": "+msg)
+	}
+}
+
+// consumerName returns the API key consumer name from the request
+// context, or "" if no key is attached. Used to populate created_by
+// fields when a write endpoint creates a new resource.
+func consumerName(r *http.Request) string {
+	if k := keyFromRequest(r); k != nil {
+		return k.ConsumerName
+	}
+	return ""
+}
+
+func getAlertContactForRequest(r *http.Request, db *sql.DB, id int64) (*alerting.AlertContact, error) {
+	if tenantID, ok := ownerTenantIDFromRequest(r); ok {
+		return alerting.GetForTenant(r.Context(), db, id, tenantID)
+	}
+	return alerting.Get(r.Context(), db, id)
+}
diff --git a/internal/api/handlers_alerts_test.go b/internal/api/handlers_alerts_test.go
new file mode 100644
index 00000000..09b5e897
--- /dev/null
+++ b/internal/api/handlers_alerts_test.go
@@ -0,0 +1,579 @@
+package api
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/alerting"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+const insertAlertContactSQL = ` INSERT INTO jetmon_alert_contacts (label, active, owner_tenant_id, transport, destination, destination_preview, site_filter, min_severity, max_per_hour, created_by) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`
+
+const selectAlertContactOneSQL = ` SELECT id, label, active, owner_tenant_id, transport, destination_preview, site_filter, min_severity, max_per_hour, created_by, created_at, updated_at FROM jetmon_alert_contacts WHERE id = ?`
+
+const selectAlertContactOneForTenantSQL = selectAlertContactOneSQL + ` AND owner_tenant_id = ?`
+
+const selectAlertContactListSQL = ` SELECT id, label, active, owner_tenant_id, transport, destination_preview, site_filter, min_severity, max_per_hour, created_by, created_at, updated_at FROM jetmon_alert_contacts ORDER BY id ASC`
+
+const selectAlertContactListForTenantSQL = ` SELECT id, label, active, owner_tenant_id, transport, destination_preview, site_filter, min_severity, max_per_hour, created_by, created_at, updated_at FROM jetmon_alert_contacts WHERE owner_tenant_id = ? ORDER BY id ASC`
+
+const loadAlertDestinationSQL = `SELECT destination FROM jetmon_alert_contacts WHERE id = ?`
+
+const loadAlertDestinationForTenantSQL = loadAlertDestinationSQL + ` AND owner_tenant_id = ?`
+
+var columnsAlertContact = []string{
+	"id", "label", "active", "owner_tenant_id", "transport", "destination_preview",
+	"site_filter", "min_severity", "max_per_hour",
+	"created_by", "created_at", "updated_at",
+}
+
+func makeAlertContactRow(id int64, label string, transport string, active uint8, minSev uint8) *sqlmock.Rows {
+	now := time.Now().UTC()
+	return sqlmock.NewRows(columnsAlertContact).AddRow(
+		id, label, active, nil, transport, "abcd",
+		[]byte(`{"site_ids":[]}`), minSev, 60,
+		"test-consumer", now, now,
+	)
+}
+
+// recordingDispatcher is a Dispatcher used by send-test tests. It
+// records every Send call and returns a configurable status/body/err.
+type recordingDispatcher struct {
+	calls    int
+	gotDest  json.RawMessage
+	gotN     alerting.Notification
+	respCode int
+	respBody string
+	respErr  error
+}
+
+func (d *recordingDispatcher) Send(_ context.Context, dest json.RawMessage, n alerting.Notification) (int, string, error) {
+	d.calls++
+	d.gotDest = dest
+	d.gotN = n
+	return d.respCode, d.respBody, d.respErr
+}
+
+// ─── Create ───────────────────────────────────────────────────────────
+
+func TestCreateAlertContactHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(insertAlertContactSQL).
+		WithArgs(
+			"oncall", 1, nil, "pagerduty",
+			sqlmock.AnyArg(), sqlmock.AnyArg(),
+			sqlmock.AnyArg(), uint8(4), 60,
+			"test-consumer",
+		).
+		WillReturnResult(sqlmock.NewResult(11, 1))
+	mock.ExpectQuery(selectAlertContactOneSQL).WithArgs(int64(11)).
+		WillReturnRows(makeAlertContactRow(11, "oncall", "pagerduty", 1, 4))
+
+	body := []byte(`{
+		"label":"oncall",
+		"transport":"pagerduty",
+		"destination":{"integration_key":"PDKEY-12345"}
+	}`)
+	req := newPOSTWithBody("/api/v1/alert-contacts", body)
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleCreateAlertContact)
+
+	if rec.Code != http.StatusCreated {
+		t.Fatalf("status = %d, want 201; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp alertContactResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.ID != 11 {
+		t.Errorf("ID = %d, want 11", resp.ID)
+	}
+	if resp.Transport != "pagerduty" {
+		t.Errorf("Transport = %q, want pagerduty", resp.Transport)
+	}
+	if resp.MinSeverity != "Down" {
+		t.Errorf("MinSeverity = %q, want Down (default)", resp.MinSeverity)
+	}
+}
+
+func TestCreateAlertContactWithGatewayTenantSetsOwner(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(insertAlertContactSQL).
+		WithArgs(
+			"oncall", 1, "tenant-a", "pagerduty",
+			sqlmock.AnyArg(), sqlmock.AnyArg(),
+			sqlmock.AnyArg(), uint8(4), 60,
+			gatewayConsumerName,
+		).
+		WillReturnResult(sqlmock.NewResult(11, 1))
+	mock.ExpectQuery(selectAlertContactOneSQL).WithArgs(int64(11)).
+		WillReturnRows(makeAlertContactRow(11, "oncall", "pagerduty", 1, 4))
+
+	body := []byte(`{
+		"label":"oncall",
+		"transport":"pagerduty",
+		"destination":{"integration_key":"PDKEY-12345"}
+	}`)
+	req := newPOSTWithBody("/api/v1/alert-contacts", body)
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleCreateAlertContact)
+
+	if rec.Code != http.StatusCreated {
+		t.Fatalf("status = %d, want 201; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestCreateAlertContactRejectsBadTransport(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	body := []byte(`{"label":"x","transport":"sms","destination":{}}`)
+	req := newPOSTWithBody("/api/v1/alert-contacts", body)
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleCreateAlertContact)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("status = %d, want 400", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "invalid_transport" {
+		t.Errorf("code = %q, want invalid_transport", got)
+	}
+}
+
+func TestCreateAlertContactRejectsMissingDestinationFields(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	cases := []struct {
+		body string
+		code string
+	}{
+		{`{"label":"x","transport":"email","destination":{}}`, "invalid_alert_contact"},
+		{`{"label":"x","transport":"slack","destination":{"webhook_url":""}}`, "invalid_alert_contact"},
+		{`{"label":"","transport":"slack","destination":{"webhook_url":"https://x"}}`, "invalid_alert_contact"},
+	}
+	for _, c := range cases {
+		req := newPOSTWithBody("/api/v1/alert-contacts", []byte(c.body))
+		req = setAuthCtx(req, key)
+		rec := invokeAuthed(s, req, s.handleCreateAlertContact)
+		if rec.Code != http.StatusUnprocessableEntity {
+			t.Errorf("body=%s status=%d, want 422", c.body, rec.Code)
+			continue
+		}
+		if got := readErrorBody(t, rec.Body).Code; got != c.code {
+			t.Errorf("body=%s code = %q, want %q", c.body, got, c.code)
+		}
+	}
+}
+
+func TestCreateAlertContactRejectsBadSeverity(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	body := []byte(`{"label":"x","transport":"email","destination":{"address":"a@b"},"min_severity":"Critical"}`)
+	req := newPOSTWithBody("/api/v1/alert-contacts", body)
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleCreateAlertContact)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("status = %d, want 400", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "invalid_severity" {
+		t.Errorf("code = %q, want invalid_severity", got)
+	}
+}
+
+// ─── Get ──────────────────────────────────────────────────────────────
+
+func TestGetAlertContactHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectAlertContactOneSQL).WithArgs(int64(11)).
+		WillReturnRows(makeAlertContactRow(11, "oncall", "pagerduty", 1, 4))
+
+	req := httptest.NewRequest("GET", "/api/v1/alert-contacts/11", nil)
+	req.SetPathValue("id", "11")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleGetAlertContact)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200", rec.Code)
+	}
+}
+
+func TestGetAlertContactNotFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectAlertContactOneSQL).WithArgs(int64(999)).
+		WillReturnRows(sqlmock.NewRows(columnsAlertContact))
+
+	req := httptest.NewRequest("GET", "/api/v1/alert-contacts/999", nil)
+	req.SetPathValue("id", "999")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleGetAlertContact)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+}
+
+func TestListAlertContactsHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	now := time.Now().UTC()
+	rows := sqlmock.NewRows(columnsAlertContact).
+		AddRow(int64(1), "primary", uint8(1), nil, "email", "mple",
+			[]byte(`{"site_ids":[42]}`), uint8(4), 60, "test-consumer", now, now).
+		AddRow(int64(2), "secondary", uint8(0), nil, "slack", "hook",
+			nil, uint8(2), 0, "test-consumer", now, now)
+	mock.ExpectQuery(selectAlertContactListSQL).WillReturnRows(rows)
+
+	req := httptest.NewRequest("GET", "/api/v1/alert-contacts", nil)
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleListAlertContacts)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var env struct {
+		Data []alertContactResponse `json:"data"`
+		Page Page                   `json:"page"`
+	}
+	readJSON(t, rec.Body, &env)
+	if len(env.Data) != 2 {
+		t.Fatalf("len(data) = %d, want 2", len(env.Data))
+	}
+	if env.Page.Limit != 2 || env.Page.Next != nil {
+		t.Fatalf("page = %+v, want limit=2 next=nil", env.Page)
+	}
+	if env.Data[0].MinSeverity != "Down" || env.Data[0].SiteFilter.SiteIDs[0] != 42 {
+		t.Fatalf("first contact response = %+v", env.Data[0])
+	}
+	if env.Data[1].MinSeverity != "Degraded" {
+		t.Fatalf("second MinSeverity = %q, want Degraded", env.Data[1].MinSeverity)
+	}
+}
+
+func TestListAlertContactsWithGatewayTenantScopesRows(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectAlertContactListForTenantSQL).WithArgs("tenant-a").
+		WillReturnRows(makeAlertContactRow(1, "primary", "email", 1, 4))
+
+	req := httptest.NewRequest("GET", "/api/v1/alert-contacts", nil)
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleListAlertContacts)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestListAlertContactsDBError(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectAlertContactListSQL).WillReturnError(errors.New("query failed"))
+
+	req := httptest.NewRequest("GET", "/api/v1/alert-contacts", nil)
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleListAlertContacts)
+
+	if rec.Code != http.StatusInternalServerError {
+		t.Fatalf("status = %d, want 500", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "db_error" {
+		t.Fatalf("code = %q, want db_error", got)
+	}
+}
+
+// ─── Update ───────────────────────────────────────────────────────────
+
+func TestUpdateAlertContactHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// Update first reads the current row to know the transport.
+	mock.ExpectQuery(selectAlertContactOneSQL).WithArgs(int64(11)).
+		WillReturnRows(makeAlertContactRow(11, "oncall", "pagerduty", 1, 4))
+	mock.ExpectExec(`UPDATE jetmon_alert_contacts SET active = ? WHERE id = ?`).
+		WithArgs(0, int64(11)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(selectAlertContactOneSQL).WithArgs(int64(11)).
+		WillReturnRows(makeAlertContactRow(11, "oncall", "pagerduty", 0, 4))
+
+	body := []byte(`{"active": false}`)
+	req := newPATCHWithBody("/api/v1/alert-contacts/11", body)
+	req.SetPathValue("id", "11")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleUpdateAlertContact)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestUpdateAlertContactWithGatewayTenantScopesWrite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectAlertContactOneForTenantSQL).WithArgs(int64(11), "tenant-a").
+		WillReturnRows(makeAlertContactRow(11, "oncall", "pagerduty", 1, 4))
+	mock.ExpectExec(`UPDATE jetmon_alert_contacts SET active = ? WHERE id = ? AND owner_tenant_id = ?`).
+		WithArgs(0, int64(11), "tenant-a").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(selectAlertContactOneForTenantSQL).WithArgs(int64(11), "tenant-a").
+		WillReturnRows(makeAlertContactRow(11, "oncall", "pagerduty", 0, 4))
+
+	body := []byte(`{"active": false}`)
+	req := newPATCHWithBody("/api/v1/alert-contacts/11", body)
+	req.SetPathValue("id", "11")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleUpdateAlertContact)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+// TestUpdateAlertContactRejectsEmptyLabel verifies an empty label
+// PATCH gets rejected at the package's input-validation layer
+// without hitting the DB. Mirrors Create's "label is required" rule.
+func TestUpdateAlertContactRejectsEmptyLabel(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+	// No DB expectations — validation should fail before any query.
+
+	body := []byte(`{"label": ""}`)
+	req := newPATCHWithBody("/api/v1/alert-contacts/11", body)
+	req.SetPathValue("id", "11")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleUpdateAlertContact)
+
+	if rec.Code != http.StatusUnprocessableEntity {
+		t.Fatalf("status = %d, want 422; body=%s", rec.Code, rec.Body.String())
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "invalid_alert_contact" {
+		t.Errorf("code = %q, want invalid_alert_contact", got)
+	}
+}
+
+// TestUpdateAlertContactRejectsNegativeMaxPerHour verifies that PATCH
+// catches max_per_hour < 0 at input-validation time rather than letting
+// MySQL reject the negative value as a generic 500.
+func TestUpdateAlertContactRejectsNegativeMaxPerHour(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	body := []byte(`{"max_per_hour": -10}`)
+	req := newPATCHWithBody("/api/v1/alert-contacts/11", body)
+	req.SetPathValue("id", "11")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleUpdateAlertContact)
+
+	if rec.Code != http.StatusUnprocessableEntity {
+		t.Fatalf("status = %d, want 422; body=%s", rec.Code, rec.Body.String())
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "invalid_alert_contact" {
+		t.Errorf("code = %q, want invalid_alert_contact", got)
+	}
+}
+
+func TestUpdateAlertContactNotFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectAlertContactOneSQL).WithArgs(int64(999)).
+		WillReturnRows(sqlmock.NewRows(columnsAlertContact))
+
+	body := []byte(`{"active": false}`)
+	req := newPATCHWithBody("/api/v1/alert-contacts/999", body)
+	req.SetPathValue("id", "999")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleUpdateAlertContact)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+}
+
+// ─── Delete ───────────────────────────────────────────────────────────
+
+func TestDeleteAlertContactHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(`DELETE FROM jetmon_alert_contacts WHERE id = ?`).
+		WithArgs(int64(11)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	req := httptest.NewRequest("DELETE", "/api/v1/alert-contacts/11", nil)
+	req.SetPathValue("id", "11")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleDeleteAlertContact)
+
+	if rec.Code != http.StatusNoContent {
+		t.Fatalf("status = %d, want 204", rec.Code)
+	}
+}
+
+func TestDeleteAlertContactWithGatewayTenantScopesWrite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(`DELETE FROM jetmon_alert_contacts WHERE id = ? AND owner_tenant_id = ?`).
+		WithArgs(int64(11), "tenant-a").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	req := httptest.NewRequest("DELETE", "/api/v1/alert-contacts/11", nil)
+	req.SetPathValue("id", "11")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleDeleteAlertContact)
+
+	if rec.Code != http.StatusNoContent {
+		t.Fatalf("status = %d, want 204", rec.Code)
+	}
+}
+
+func TestDeleteAlertContactNotFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(`DELETE FROM jetmon_alert_contacts WHERE id = ?`).
+		WithArgs(int64(999)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+
+	req := httptest.NewRequest("DELETE", "/api/v1/alert-contacts/999", nil)
+	req.SetPathValue("id", "999")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleDeleteAlertContact)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+}
+
+// ─── Send-test ────────────────────────────────────────────────────────
+
+func TestAlertContactTestHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	disp := &recordingDispatcher{respCode: 200, respBody: "ok"}
+	s.SetAlertDispatchers(map[alerting.Transport]alerting.Dispatcher{
+		alerting.TransportSlack: disp,
+	})
+
+	// The test endpoint loads the contact, then loads its destination.
+	mock.ExpectQuery(selectAlertContactOneSQL).WithArgs(int64(11)).
+		WillReturnRows(makeAlertContactRow(11, "oncall-slack", "slack", 1, 4))
+	mock.ExpectQuery(loadAlertDestinationSQL).WithArgs(int64(11)).
+		WillReturnRows(sqlmock.NewRows([]string{"destination"}).
+			AddRow([]byte(`{"webhook_url":"https://hooks.slack.com/x"}`)))
+
+	req := newPOSTWithBody("/api/v1/alert-contacts/11/test", nil)
+	req.SetPathValue("id", "11")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleAlertContactTest)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	if disp.calls != 1 {
+		t.Errorf("dispatcher called %d times, want 1", disp.calls)
+	}
+	if !disp.gotN.IsTest {
+		t.Error("dispatched notification should have IsTest=true")
+	}
+}
+
+func TestAlertContactTestWithGatewayTenantScopesDestinationLoad(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	disp := &recordingDispatcher{respCode: 200, respBody: "ok"}
+	s.SetAlertDispatchers(map[alerting.Transport]alerting.Dispatcher{
+		alerting.TransportSlack: disp,
+	})
+
+	mock.ExpectQuery(selectAlertContactOneForTenantSQL).WithArgs(int64(11), "tenant-a").
+		WillReturnRows(makeAlertContactRow(11, "oncall-slack", "slack", 1, 4))
+	mock.ExpectQuery(loadAlertDestinationForTenantSQL).WithArgs(int64(11), "tenant-a").
+		WillReturnRows(sqlmock.NewRows([]string{"destination"}).
+			AddRow([]byte(`{"webhook_url":"https://hooks.slack.com/x"}`)))
+
+	req := newPOSTWithBody("/api/v1/alert-contacts/11/test", nil)
+	req.SetPathValue("id", "11")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleAlertContactTest)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	if disp.calls != 1 {
+		t.Errorf("dispatcher called %d times, want 1", disp.calls)
+	}
+}
+
+func TestAlertContactTestSurfacesTransportError(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	disp := &recordingDispatcher{respCode: 500, respBody: "internal", respErr: errBoom("upstream")}
+	s.SetAlertDispatchers(map[alerting.Transport]alerting.Dispatcher{
+		alerting.TransportSlack: disp,
+	})
+
+	mock.ExpectQuery(selectAlertContactOneSQL).WithArgs(int64(11)).
+		WillReturnRows(makeAlertContactRow(11, "oncall-slack", "slack", 1, 4))
+	mock.ExpectQuery(loadAlertDestinationSQL).WithArgs(int64(11)).
+		WillReturnRows(sqlmock.NewRows([]string{"destination"}).
+			AddRow([]byte(`{"webhook_url":"https://hooks.slack.com/x"}`)))
+
+	req := newPOSTWithBody("/api/v1/alert-contacts/11/test", nil)
+	req.SetPathValue("id", "11")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleAlertContactTest)
+
+	if rec.Code != http.StatusBadGateway {
+		t.Fatalf("status = %d, want 502; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestAlertContactTestNoDispatcherConfigured(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// alertDispatchers is nil → 503 "transport_not_configured"
+	mock.ExpectQuery(selectAlertContactOneSQL).WithArgs(int64(11)).
+		WillReturnRows(makeAlertContactRow(11, "oncall-email", "email", 1, 4))
+
+	req := newPOSTWithBody("/api/v1/alert-contacts/11/test", nil)
+	req.SetPathValue("id", "11")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleAlertContactTest)
+
+	if rec.Code != http.StatusServiceUnavailable {
+		t.Fatalf("status = %d, want 503", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "transport_not_configured" {
+		t.Errorf("code = %q, want transport_not_configured", got)
+	}
+}
+
+// errBoom is a tiny error helper for tests.
+type errBoom string
+
+func (e errBoom) Error() string { return string(e) }
diff --git a/internal/api/handlers_deliveries.go b/internal/api/handlers_deliveries.go
new file mode 100644
index 00000000..3bdf772d
--- /dev/null
+++ b/internal/api/handlers_deliveries.go
@@ -0,0 +1,206 @@
+package api
+
+import (
+	"encoding/json"
+	"errors"
+	"fmt"
+	"net/http"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/webhooks"
+)
+
+// deliveryResponse is the JSON shape for a webhook delivery row. Fields are
+// flat (not nested) so consumers can easily filter and sort. payload is
+// pass-through json.RawMessage — whatever the dispatcher froze at fire time.
+type deliveryResponse struct {
+	ID             int64           `json:"id"`
+	WebhookID      int64           `json:"webhook_id"`
+	TransitionID   int64           `json:"transition_id"`
+	EventID        int64           `json:"event_id"`
+	EventType      string          `json:"event_type"`
+	Payload        json.RawMessage `json:"payload"`
+	Status         string          `json:"status"`
+	Attempt        int             `json:"attempt"`
+	NextAttemptAt  *string         `json:"next_attempt_at"`
+	LastStatusCode *int            `json:"last_status_code"`
+	LastResponse   *string         `json:"last_response"`
+	LastAttemptAt  *string         `json:"last_attempt_at"`
+	DeliveredAt    *string         `json:"delivered_at"`
+	CreatedAt      string          `json:"created_at"`
+}
+
+func toDeliveryResponse(d *webhooks.Delivery) deliveryResponse {
+	out := deliveryResponse{
+		ID:             d.ID,
+		WebhookID:      d.WebhookID,
+		TransitionID:   d.TransitionID,
+		EventID:        d.EventID,
+		EventType:      d.EventType,
+		Payload:        d.Payload,
+		Status:         string(d.Status),
+		Attempt:        d.Attempt,
+		LastStatusCode: d.LastStatusCode,
+		LastResponse:   d.LastResponse,
+		CreatedAt:      d.CreatedAt.UTC().Format(time.RFC3339),
+	}
+	if d.NextAttemptAt != nil {
+		v := d.NextAttemptAt.UTC().Format(time.RFC3339)
+		out.NextAttemptAt = &v
+	}
+	if d.LastAttemptAt != nil {
+		v := d.LastAttemptAt.UTC().Format(time.RFC3339)
+		out.LastAttemptAt = &v
+	}
+	if d.DeliveredAt != nil {
+		v := d.DeliveredAt.UTC().Format(time.RFC3339)
+		out.DeliveredAt = &v
+	}
+	return out
+}
+
+// handleListDeliveries implements GET /api/v1/webhooks/{id}/deliveries.
+//
+// Filters:
+//   - status: pending | delivered | failed | abandoned (single value)
+//
+// Cursor pagination on id (descending — most recent first).
+func (s *Server) handleListDeliveries(w http.ResponseWriter, r *http.Request) {
+	webhookID, err := parseIDPath(r, "id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_webhook_id",
+			"webhook id must be a positive integer")
+		return
+	}
+	if !s.ensureWebhookOwnedForRequest(w, r, webhookID) {
+		return
+	}
+
+	q := r.URL.Query()
+	limit, err := parseLimit(q.Get("limit"), 50, 200)
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_limit", err.Error())
+		return
+	}
+	cursor, err := decodeIDCursor(q.Get("cursor"))
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_cursor", err.Error())
+		return
+	}
+
+	var status webhooks.Status
+	if v := q.Get("status"); v != "" {
+		switch webhooks.Status(v) {
+		case webhooks.StatusPending, webhooks.StatusDelivered,
+			webhooks.StatusFailed, webhooks.StatusAbandoned:
+			status = webhooks.Status(v)
+		default:
+			writeError(w, r, http.StatusBadRequest, "invalid_status",
+				"status must be one of: pending, delivered, failed, abandoned")
+			return
+		}
+	}
+
+	// Fetch limit+1 to detect a next-page boundary without an extra count.
+	rows, err := webhooks.ListDeliveries(r.Context(), s.db, webhookID, status, cursor, limit+1)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"deliveries list failed: "+err.Error())
+		return
+	}
+
+	out := make([]deliveryResponse, 0, len(rows))
+	for i := range rows {
+		out = append(out, toDeliveryResponse(&rows[i]))
+	}
+
+	var nextCursor *string
+	if len(out) > limit {
+		out = out[:limit]
+		c := encodeIDCursor(out[len(out)-1].ID)
+		nextCursor = &c
+	}
+
+	writeJSON(w, http.StatusOK, ListEnvelope{
+		Data: out,
+		Page: Page{Next: nextCursor, Limit: limit},
+	})
+}
+
+// handleRetryDelivery implements POST /api/v1/webhooks/{id}/deliveries/{delivery_id}/retry.
+//
+// Resets an abandoned delivery row to pending so the worker picks it up
+// on the next tick. Used by operators after fixing a previously-broken
+// consumer endpoint. Only abandoned deliveries can be retried — pending
+// ones are already in the queue, delivered ones don't need to fire again.
+func (s *Server) handleRetryDelivery(w http.ResponseWriter, r *http.Request) {
+	webhookID, err := parseIDPath(r, "id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_webhook_id",
+			"webhook id must be a positive integer")
+		return
+	}
+	deliveryID, err := parseIDPath(r, "delivery_id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_delivery_id",
+			"delivery id must be a positive integer")
+		return
+	}
+	if !s.ensureWebhookOwnedForRequest(w, r, webhookID) {
+		return
+	}
+
+	// Cross-check: the delivery must belong to the named webhook. This
+	// matches the cross-site protection we use elsewhere — an explicit
+	// 404 if the consumer asks under the wrong webhook.
+	d, err := webhooks.GetDelivery(r.Context(), s.db, deliveryID)
+	if err != nil {
+		if errors.Is(err, webhooks.ErrDeliveryNotFound) {
+			writeError(w, r, http.StatusNotFound, "delivery_not_found",
+				fmt.Sprintf("Delivery %d does not exist", deliveryID))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"delivery lookup failed: "+err.Error())
+		return
+	}
+	if d.WebhookID != webhookID {
+		writeError(w, r, http.StatusNotFound, "delivery_not_found",
+			fmt.Sprintf("Delivery %d does not belong to webhook %d", deliveryID, webhookID))
+		return
+	}
+
+	if err := webhooks.RetryDelivery(r.Context(), s.db, deliveryID); err != nil {
+		// Distinguish "not abandoned" (currently pending or delivered) from
+		// other DB errors so the caller gets a useful message.
+		writeError(w, r, http.StatusConflict, "delivery_not_retryable", err.Error())
+		return
+	}
+
+	// Read back the updated row so the caller sees the new pending state.
+	d, err = webhooks.GetDelivery(r.Context(), s.db, deliveryID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"read-back failed: "+err.Error())
+		return
+	}
+	writeJSON(w, http.StatusOK, toDeliveryResponse(d))
+}
+
+func (s *Server) ensureWebhookOwnedForRequest(w http.ResponseWriter, r *http.Request, id int64) bool {
+	tenantID, ok := ownerTenantIDFromRequest(r)
+	if !ok {
+		return true
+	}
+	if _, err := webhooks.GetForTenant(r.Context(), s.db, id, tenantID); err != nil {
+		if errors.Is(err, webhooks.ErrWebhookNotFound) {
+			writeError(w, r, http.StatusNotFound, "webhook_not_found",
+				fmt.Sprintf("Webhook %d does not exist", id))
+			return false
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"webhook lookup failed: "+err.Error())
+		return false
+	}
+	return true
+}
diff --git a/internal/api/handlers_deliveries_test.go b/internal/api/handlers_deliveries_test.go
new file mode 100644
index 00000000..1f6793e5
--- /dev/null
+++ b/internal/api/handlers_deliveries_test.go
@@ -0,0 +1,193 @@
+package api
+
+import (
+	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+const selectWebhookDeliveriesSQL = ` SELECT id, webhook_id, transition_id, event_id, event_type, payload, status, attempt, next_attempt_at, last_status_code, last_response, last_attempt_at, delivered_at, created_at FROM jetmon_webhook_deliveries WHERE webhook_id = ? ORDER BY id DESC LIMIT ?`
+
+const selectWebhookDeliveryOneSQL = ` SELECT id, webhook_id, transition_id, event_id, event_type, payload, status, attempt, next_attempt_at, last_status_code, last_response, last_attempt_at, delivered_at, created_at FROM jetmon_webhook_deliveries WHERE id = ?`
+
+var columnsWebhookDelivery = []string{
+	"id", "webhook_id", "transition_id", "event_id", "event_type",
+	"payload", "status", "attempt", "next_attempt_at", "last_status_code", "last_response",
+	"last_attempt_at", "delivered_at", "created_at",
+}
+
+func makeWebhookDeliveryRow(id, webhookID int64, status string) *sqlmock.Rows {
+	now := time.Now().UTC()
+	payload := []byte(`{"site_id":42,"event_id":777,"type":"event.opened"}`)
+	return sqlmock.NewRows(columnsWebhookDelivery).AddRow(
+		id, webhookID, int64(1), int64(777), "event.opened",
+		payload, status, 1, nil, nil, nil, nil, nil, now,
+	)
+}
+
+func TestListDeliveriesHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectWebhookDeliveriesSQL).
+		WithArgs(int64(11), 51).
+		WillReturnRows(makeWebhookDeliveryRow(101, 11, "delivered"))
+
+	req := httptest.NewRequest("GET", "/api/v1/webhooks/11/deliveries", nil)
+	req.SetPathValue("id", "11")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleListDeliveries)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var env struct {
+		Data []deliveryResponse `json:"data"`
+		Page json.RawMessage    `json:"page"`
+	}
+	readJSON(t, rec.Body, &env)
+	if len(env.Data) != 1 {
+		t.Fatalf("len(data) = %d, want 1", len(env.Data))
+	}
+	if env.Data[0].Status != "delivered" || env.Data[0].Payload == nil {
+		t.Fatalf("delivery response = %+v", env.Data[0])
+	}
+}
+
+func TestListDeliveriesWithGatewayTenantVerifiesWebhookOwnership(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectWebhookOneForTenantSQL).WithArgs(int64(11), "tenant-a").
+		WillReturnRows(makeWebhookRow(11, "https://x.example.com", 1))
+	mock.ExpectQuery(selectWebhookDeliveriesSQL).
+		WithArgs(int64(11), 51).
+		WillReturnRows(makeWebhookDeliveryRow(101, 11, "delivered"))
+
+	req := httptest.NewRequest("GET", "/api/v1/webhooks/11/deliveries", nil)
+	req.SetPathValue("id", "11")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleListDeliveries)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestListDeliveriesRejectsBadStatus(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	req := httptest.NewRequest("GET", "/api/v1/webhooks/11/deliveries?status=bogus", nil)
+	req.SetPathValue("id", "11")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleListDeliveries)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("status = %d, want 400", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "invalid_status" {
+		t.Errorf("code = %q, want invalid_status", got)
+	}
+}
+
+func TestRetryDeliveryHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectWebhookDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeWebhookDeliveryRow(101, 11, "abandoned"))
+	mock.ExpectExec(`UPDATE jetmon_webhook_deliveries SET status = 'pending', attempt = 0, next_attempt_at = CURRENT_TIMESTAMP, last_status_code = NULL, last_response = NULL, last_attempt_at = NULL WHERE id = ? AND status = 'abandoned'`).
+		WithArgs(int64(101)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(selectWebhookDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeWebhookDeliveryRow(101, 11, "pending"))
+
+	req := newPOSTWithBody("/api/v1/webhooks/11/deliveries/101/retry", nil)
+	req.SetPathValue("id", "11")
+	req.SetPathValue("delivery_id", "101")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleRetryDelivery)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp deliveryResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.Status != "pending" {
+		t.Errorf("Status = %q, want pending", resp.Status)
+	}
+}
+
+func TestRetryDeliveryWithGatewayTenantVerifiesWebhookOwnership(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectWebhookOneForTenantSQL).WithArgs(int64(11), "tenant-a").
+		WillReturnRows(makeWebhookRow(11, "https://x.example.com", 1))
+	mock.ExpectQuery(selectWebhookDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeWebhookDeliveryRow(101, 11, "abandoned"))
+	mock.ExpectExec(`UPDATE jetmon_webhook_deliveries SET status = 'pending', attempt = 0, next_attempt_at = CURRENT_TIMESTAMP, last_status_code = NULL, last_response = NULL, last_attempt_at = NULL WHERE id = ? AND status = 'abandoned'`).
+		WithArgs(int64(101)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(selectWebhookDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeWebhookDeliveryRow(101, 11, "pending"))
+
+	req := newPOSTWithBody("/api/v1/webhooks/11/deliveries/101/retry", nil)
+	req.SetPathValue("id", "11")
+	req.SetPathValue("delivery_id", "101")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleRetryDelivery)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestRetryDeliveryWrongWebhook(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectWebhookDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeWebhookDeliveryRow(101, 99, "abandoned"))
+
+	req := newPOSTWithBody("/api/v1/webhooks/11/deliveries/101/retry", nil)
+	req.SetPathValue("id", "11")
+	req.SetPathValue("delivery_id", "101")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleRetryDelivery)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+}
+
+func TestRetryDeliveryNotAbandoned(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectWebhookDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeWebhookDeliveryRow(101, 11, "delivered"))
+	mock.ExpectExec(`UPDATE jetmon_webhook_deliveries SET status = 'pending', attempt = 0, next_attempt_at = CURRENT_TIMESTAMP, last_status_code = NULL, last_response = NULL, last_attempt_at = NULL WHERE id = ? AND status = 'abandoned'`).
+		WithArgs(int64(101)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+	mock.ExpectQuery(selectWebhookDeliveryOneSQL).WithArgs(int64(101)).
+		WillReturnRows(makeWebhookDeliveryRow(101, 11, "delivered"))
+
+	req := newPOSTWithBody("/api/v1/webhooks/11/deliveries/101/retry", nil)
+	req.SetPathValue("id", "11")
+	req.SetPathValue("delivery_id", "101")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleRetryDelivery)
+
+	if rec.Code != http.StatusConflict {
+		t.Fatalf("status = %d, want 409", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "delivery_not_retryable" {
+		t.Errorf("code = %q, want delivery_not_retryable", got)
+	}
+}
diff --git a/internal/api/handlers_events.go b/internal/api/handlers_events.go
new file mode 100644
index 00000000..852c454c
--- /dev/null
+++ b/internal/api/handlers_events.go
@@ -0,0 +1,570 @@
+package api
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"fmt"
+	"net/http"
+	"strconv"
+	"strings"
+	"time"
+)
+
+// eventResponse is the JSON shape for an event in list and detail responses.
+// Field ordering loosely matches the schema (jetmon_events): identity first,
+// then severity/state, then timing, then closure data, then metadata.
+type eventResponse struct {
+	ID               int64           `json:"id"`
+	SiteID           int64           `json:"site_id"`
+	EndpointID       *int64          `json:"endpoint_id"`
+	CheckType        string          `json:"check_type"`
+	Discriminator    *string         `json:"discriminator"`
+	Severity         uint8           `json:"severity"`
+	State            string          `json:"state"`
+	StartedAt        string          `json:"started_at"`
+	EndedAt          *string         `json:"ended_at"`
+	ResolutionReason *string         `json:"resolution_reason"`
+	CauseEventID     *int64          `json:"cause_event_id"`
+	Metadata         json.RawMessage `json:"metadata"`
+	DurationMs       int64           `json:"duration_ms"`
+	TransitionCount  int             `json:"transition_count"`
+}
+
+// transitionResponse is one row from jetmon_event_transitions.
+type transitionResponse struct {
+	ID             int64           `json:"id"`
+	EventID        int64           `json:"event_id"`
+	SeverityBefore *uint8          `json:"severity_before"`
+	SeverityAfter  *uint8          `json:"severity_after"`
+	StateBefore    *string         `json:"state_before"`
+	StateAfter     *string         `json:"state_after"`
+	Reason         string          `json:"reason"`
+	Source         string          `json:"source"`
+	Metadata       json.RawMessage `json:"metadata"`
+	ChangedAt      string          `json:"changed_at"`
+}
+
+// eventDetailResponse is the single-event response with embedded transitions.
+type eventDetailResponse struct {
+	eventResponse
+	Transitions []transitionResponse `json:"transitions"`
+}
+
+// handleListSiteEvents implements GET /api/v1/sites/{id}/events.
+func (s *Server) handleListSiteEvents(w http.ResponseWriter, r *http.Request) {
+	siteID, err := strconv.ParseInt(r.PathValue("id"), 10, 64)
+	if err != nil || siteID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_site_id",
+			"site id must be a positive integer")
+		return
+	}
+	if !s.ensureSiteVisibleForRequest(w, r, siteID) {
+		return
+	}
+	s.listEvents(w, r, siteID)
+}
+
+func (s *Server) listEvents(w http.ResponseWriter, r *http.Request, siteID int64) {
+	q := r.URL.Query()
+
+	limit, err := parseLimit(q.Get("limit"), 50, 200)
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_limit", err.Error())
+		return
+	}
+	cursor, err := decodeIDCursor(q.Get("cursor"))
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_cursor", err.Error())
+		return
+	}
+
+	// Filters.
+	stateFilter, err := parseStateFilter(q)
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_state_filter", err.Error())
+		return
+	}
+	checkTypeFilter := parseCSV(q, "check_type", "check_type__in")
+	startedGTE, err := parseTimeQuery(q.Get("started_at__gte"))
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_started_at__gte", err.Error())
+		return
+	}
+	startedLT, err := parseTimeQuery(q.Get("started_at__lt"))
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_started_at__lt", err.Error())
+		return
+	}
+
+	// activeFilter: true → only open, false → only closed, "" → both.
+	var activeFilter *bool
+	switch q.Get("active") {
+	case "true", "1":
+		t := true
+		activeFilter = &t
+	case "false", "0":
+		f := false
+		activeFilter = &f
+	case "":
+		// no filter
+	default:
+		writeError(w, r, http.StatusBadRequest, "invalid_active",
+			"active must be 'true' or 'false'")
+		return
+	}
+
+	// Build the query. Events list walks backwards on id (id desc) — id is
+	// monotonically increasing because it's an auto-increment PK, so id desc
+	// matches started_at desc within the resolution we care about.
+	args := []any{siteID}
+	sb := strings.Builder{}
+	sb.WriteString(`
+		SELECT id, blog_id, endpoint_id, check_type, discriminator,
+		       severity, state, started_at, ended_at, resolution_reason,
+		       cause_event_id, metadata
+		  FROM jetmon_events
+		 WHERE blog_id = ?`)
+
+	if cursor > 0 {
+		sb.WriteString(" AND id < ?")
+		args = append(args, cursor)
+	}
+	if len(stateFilter) > 0 {
+		sb.WriteString(" AND state IN (")
+		for i, v := range stateFilter {
+			if i > 0 {
+				sb.WriteString(",")
+			}
+			sb.WriteString("?")
+			args = append(args, v)
+		}
+		sb.WriteString(")")
+	}
+	if len(checkTypeFilter) > 0 {
+		sb.WriteString(" AND check_type IN (")
+		for i, v := range checkTypeFilter {
+			if i > 0 {
+				sb.WriteString(",")
+			}
+			sb.WriteString("?")
+			args = append(args, v)
+		}
+		sb.WriteString(")")
+	}
+	if startedGTE != nil {
+		sb.WriteString(" AND started_at >= ?")
+		args = append(args, *startedGTE)
+	}
+	if startedLT != nil {
+		sb.WriteString(" AND started_at < ?")
+		args = append(args, *startedLT)
+	}
+	if activeFilter != nil {
+		if *activeFilter {
+			sb.WriteString(" AND ended_at IS NULL")
+		} else {
+			sb.WriteString(" AND ended_at IS NOT NULL")
+		}
+	}
+
+	sb.WriteString(" ORDER BY id DESC LIMIT ?")
+	args = append(args, limit+1)
+
+	ctx := r.Context()
+	rows, err := s.db.QueryContext(ctx, sb.String(), args...)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"event list query failed: "+err.Error())
+		return
+	}
+	defer rows.Close()
+
+	results := make([]eventResponse, 0, limit)
+	for rows.Next() {
+		ev, err := scanEventRow(rows)
+		if err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"event row scan failed: "+err.Error())
+			return
+		}
+		results = append(results, ev)
+	}
+
+	// Compute transition_count for each event in one batch query. Avoid n+1.
+	if len(results) > 0 {
+		ids := make([]any, len(results))
+		for i, e := range results {
+			ids[i] = e.ID
+		}
+		counts, err := s.queryTransitionCounts(ctx, ids)
+		if err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"transition count query failed: "+err.Error())
+			return
+		}
+		for i := range results {
+			results[i].TransitionCount = counts[results[i].ID]
+		}
+	}
+
+	var nextCursor *string
+	if len(results) > limit {
+		results = results[:limit]
+		c := encodeIDCursor(results[len(results)-1].ID)
+		nextCursor = &c
+	}
+
+	writeJSON(w, http.StatusOK, ListEnvelope{
+		Data: results,
+		Page: Page{Next: nextCursor, Limit: limit},
+	})
+}
+
+// handleGetEventBySite implements GET /api/v1/sites/{id}/events/{event_id}.
+// Validates that the event belongs to the named site so /sites/X/events/Y
+// can't sneakily access an event from a different site.
+func (s *Server) handleGetEventBySite(w http.ResponseWriter, r *http.Request) {
+	siteID, err := strconv.ParseInt(r.PathValue("id"), 10, 64)
+	if err != nil || siteID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_site_id",
+			"site id must be a positive integer")
+		return
+	}
+	eventID, err := strconv.ParseInt(r.PathValue("event_id"), 10, 64)
+	if err != nil || eventID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_event_id",
+			"event id must be a positive integer")
+		return
+	}
+	s.respondEvent(w, r, eventID, &siteID)
+}
+
+// handleGetEvent implements GET /api/v1/events/{event_id}, the standalone
+// lookup. Useful for webhook payloads that want to link directly to an
+// incident page without the consumer needing the site id.
+func (s *Server) handleGetEvent(w http.ResponseWriter, r *http.Request) {
+	eventID, err := strconv.ParseInt(r.PathValue("event_id"), 10, 64)
+	if err != nil || eventID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_event_id",
+			"event id must be a positive integer")
+		return
+	}
+	s.respondEvent(w, r, eventID, nil)
+}
+
+func (s *Server) respondEvent(w http.ResponseWriter, r *http.Request, eventID int64, siteIDFilter *int64) {
+	ctx := r.Context()
+	row := s.db.QueryRowContext(ctx, `
+		SELECT id, blog_id, endpoint_id, check_type, discriminator,
+		       severity, state, started_at, ended_at, resolution_reason,
+		       cause_event_id, metadata
+		  FROM jetmon_events
+		 WHERE id = ?`, eventID)
+
+	ev, err := scanEventRow(row)
+	if err != nil {
+		if err == sql.ErrNoRows {
+			writeError(w, r, http.StatusNotFound, "event_not_found",
+				fmt.Sprintf("Event %d does not exist", eventID))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"event query failed: "+err.Error())
+		return
+	}
+	if siteIDFilter != nil && ev.SiteID != *siteIDFilter {
+		writeError(w, r, http.StatusNotFound, "event_not_found",
+			fmt.Sprintf("Event %d does not belong to site %d", eventID, *siteIDFilter))
+		return
+	}
+	visible, err := s.siteVisibleToRequest(ctx, r, ev.SiteID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site tenant lookup failed: "+err.Error())
+		return
+	}
+	if !visible {
+		writeEventNotFound(w, r, eventID)
+		return
+	}
+
+	transitions, err := s.queryTransitions(ctx, eventID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"transition query failed: "+err.Error())
+		return
+	}
+	ev.TransitionCount = len(transitions)
+
+	writeJSON(w, http.StatusOK, eventDetailResponse{
+		eventResponse: ev,
+		Transitions:   transitions,
+	})
+}
+
+// handleListTransitions implements GET /api/v1/sites/{id}/events/{event_id}/transitions.
+// Useful when an event has accumulated many transitions and the inline list
+// in the event detail response is too large.
+func (s *Server) handleListTransitions(w http.ResponseWriter, r *http.Request) {
+	siteID, err := strconv.ParseInt(r.PathValue("id"), 10, 64)
+	if err != nil || siteID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_site_id",
+			"site id must be a positive integer")
+		return
+	}
+	eventID, err := strconv.ParseInt(r.PathValue("event_id"), 10, 64)
+	if err != nil || eventID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_event_id",
+			"event id must be a positive integer")
+		return
+	}
+
+	// Verify the event exists and belongs to the site before we paginate.
+	var blogID int64
+	if err := s.db.QueryRowContext(r.Context(),
+		`SELECT blog_id FROM jetmon_events WHERE id = ?`, eventID).Scan(&blogID); err != nil {
+		if err == sql.ErrNoRows {
+			writeError(w, r, http.StatusNotFound, "event_not_found",
+				fmt.Sprintf("Event %d does not exist", eventID))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"event lookup failed: "+err.Error())
+		return
+	}
+	if blogID != siteID {
+		writeError(w, r, http.StatusNotFound, "event_not_found",
+			fmt.Sprintf("Event %d does not belong to site %d", eventID, siteID))
+		return
+	}
+	visible, err := s.siteVisibleToRequest(r.Context(), r, blogID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site tenant lookup failed: "+err.Error())
+		return
+	}
+	if !visible {
+		writeEventNotFound(w, r, eventID)
+		return
+	}
+
+	q := r.URL.Query()
+	limit, err := parseLimit(q.Get("limit"), 100, 200)
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_limit", err.Error())
+		return
+	}
+	cursor, err := decodeIDCursor(q.Get("cursor"))
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_cursor", err.Error())
+		return
+	}
+
+	args := []any{eventID}
+	query := `
+		SELECT id, event_id, severity_before, severity_after,
+		       state_before, state_after, reason, source, metadata, changed_at
+		  FROM jetmon_event_transitions
+		 WHERE event_id = ?`
+	if cursor > 0 {
+		query += " AND id > ?"
+		args = append(args, cursor)
+	}
+	query += " ORDER BY id ASC LIMIT ?"
+	args = append(args, limit+1)
+
+	rows, err := s.db.QueryContext(r.Context(), query, args...)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"transition list query failed: "+err.Error())
+		return
+	}
+	defer rows.Close()
+
+	results := make([]transitionResponse, 0, limit)
+	for rows.Next() {
+		t, err := scanTransitionRow(rows)
+		if err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"transition row scan failed: "+err.Error())
+			return
+		}
+		results = append(results, t)
+	}
+
+	var nextCursor *string
+	if len(results) > limit {
+		results = results[:limit]
+		c := encodeIDCursor(results[len(results)-1].ID)
+		nextCursor = &c
+	}
+	writeJSON(w, http.StatusOK, ListEnvelope{
+		Data: results,
+		Page: Page{Next: nextCursor, Limit: limit},
+	})
+}
+
+// queryTransitions returns all transitions for an event in chronological order.
+// Used by the single-event endpoint where the count is bounded.
+func (s *Server) queryTransitions(ctx context.Context, eventID int64) ([]transitionResponse, error) {
+	rows, err := s.db.QueryContext(ctx, `
+		SELECT id, event_id, severity_before, severity_after,
+		       state_before, state_after, reason, source, metadata, changed_at
+		  FROM jetmon_event_transitions
+		 WHERE event_id = ?
+		 ORDER BY id ASC`, eventID)
+	if err != nil {
+		return nil, err
+	}
+	defer rows.Close()
+
+	out := []transitionResponse{}
+	for rows.Next() {
+		t, err := scanTransitionRow(rows)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, t)
+	}
+	return out, rows.Err()
+}
+
+// queryTransitionCounts batches the transition count for many events into a
+// single GROUP BY query — avoids n+1 lookups when listing events.
+func (s *Server) queryTransitionCounts(ctx context.Context, eventIDs []any) (map[int64]int, error) {
+	if len(eventIDs) == 0 {
+		return nil, nil
+	}
+	placeholders := strings.Repeat("?,", len(eventIDs)-1) + "?"
+	rows, err := s.db.QueryContext(ctx,
+		`SELECT event_id, COUNT(*) FROM jetmon_event_transitions
+		  WHERE event_id IN (`+placeholders+`)
+		  GROUP BY event_id`, eventIDs...)
+	if err != nil {
+		return nil, err
+	}
+	defer rows.Close()
+	out := make(map[int64]int, len(eventIDs))
+	for rows.Next() {
+		var id int64
+		var count int
+		if err := rows.Scan(&id, &count); err != nil {
+			return nil, err
+		}
+		out[id] = count
+	}
+	return out, rows.Err()
+}
+
+func scanEventRow(s rowScanner) (eventResponse, error) {
+	var (
+		out              eventResponse
+		endpointID       sql.NullInt64
+		discriminator    sql.NullString
+		startedAt        time.Time
+		endedAt          sql.NullTime
+		resolutionReason sql.NullString
+		causeEventID     sql.NullInt64
+		metadata         sql.NullString
+	)
+	if err := s.Scan(
+		&out.ID, &out.SiteID, &endpointID, &out.CheckType, &discriminator,
+		&out.Severity, &out.State, &startedAt, &endedAt, &resolutionReason,
+		&causeEventID, &metadata,
+	); err != nil {
+		return out, err
+	}
+	if endpointID.Valid {
+		out.EndpointID = &endpointID.Int64
+	}
+	if discriminator.Valid {
+		out.Discriminator = &discriminator.String
+	}
+	out.StartedAt = startedAt.UTC().Format(time.RFC3339Nano)
+
+	now := time.Now().UTC()
+	if endedAt.Valid {
+		out.EndedAt = ptrStr(endedAt.Time.UTC().Format(time.RFC3339Nano))
+		out.DurationMs = endedAt.Time.Sub(startedAt).Milliseconds()
+	} else {
+		out.DurationMs = now.Sub(startedAt).Milliseconds()
+	}
+	if resolutionReason.Valid {
+		out.ResolutionReason = &resolutionReason.String
+	}
+	if causeEventID.Valid {
+		out.CauseEventID = &causeEventID.Int64
+	}
+	if metadata.Valid && metadata.String != "" {
+		out.Metadata = json.RawMessage(metadata.String)
+	} else {
+		out.Metadata = json.RawMessage("null")
+	}
+	return out, nil
+}
+
+func scanTransitionRow(s rowScanner) (transitionResponse, error) {
+	var (
+		out            transitionResponse
+		severityBefore sql.NullInt64
+		severityAfter  sql.NullInt64
+		stateBefore    sql.NullString
+		stateAfter     sql.NullString
+		metadata       sql.NullString
+		changedAt      time.Time
+	)
+	if err := s.Scan(
+		&out.ID, &out.EventID, &severityBefore, &severityAfter,
+		&stateBefore, &stateAfter, &out.Reason, &out.Source, &metadata, &changedAt,
+	); err != nil {
+		return out, err
+	}
+	if severityBefore.Valid {
+		v := uint8(severityBefore.Int64)
+		out.SeverityBefore = &v
+	}
+	if severityAfter.Valid {
+		v := uint8(severityAfter.Int64)
+		out.SeverityAfter = &v
+	}
+	if stateBefore.Valid {
+		out.StateBefore = &stateBefore.String
+	}
+	if stateAfter.Valid {
+		out.StateAfter = &stateAfter.String
+	}
+	if metadata.Valid && metadata.String != "" {
+		out.Metadata = json.RawMessage(metadata.String)
+	} else {
+		out.Metadata = json.RawMessage("null")
+	}
+	out.ChangedAt = changedAt.UTC().Format(time.RFC3339Nano)
+	return out, nil
+}
+
+// parseCSV returns the union of values from ?key= and ?key__in=A,B,C, or nil
+// if neither was provided. Used for state and check_type filters.
+func parseCSV(q map[string][]string, single, multi string) []string {
+	if v := first(q[single]); v != "" {
+		return []string{v}
+	}
+	if v := first(q[multi]); v != "" {
+		return strings.Split(v, ",")
+	}
+	return nil
+}
+
+// parseTimeQuery parses an optional ISO8601 timestamp query parameter.
+func parseTimeQuery(s string) (*time.Time, error) {
+	if s == "" {
+		return nil, nil
+	}
+	t, err := time.Parse(time.RFC3339, s)
+	if err != nil {
+		return nil, fmt.Errorf("must be RFC3339 timestamp")
+	}
+	return &t, nil
+}
+
+func ptrStr(s string) *string { return &s }
diff --git a/internal/api/handlers_events_test.go b/internal/api/handlers_events_test.go
new file mode 100644
index 00000000..298a6afe
--- /dev/null
+++ b/internal/api/handlers_events_test.go
@@ -0,0 +1,409 @@
+package api
+
+import (
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+const eventsBaseSQL = ` SELECT id, blog_id, endpoint_id, check_type, discriminator, severity, state, started_at, ended_at, resolution_reason, cause_event_id, metadata FROM jetmon_events WHERE blog_id = ?`
+
+const transitionsListSQL = ` SELECT id, event_id, severity_before, severity_after, state_before, state_after, reason, source, metadata, changed_at FROM jetmon_event_transitions WHERE event_id = ?`
+
+const transitionsAllSQL = ` SELECT id, event_id, severity_before, severity_after, state_before, state_after, reason, source, metadata, changed_at FROM jetmon_event_transitions WHERE event_id = ? ORDER BY id ASC`
+
+func makeEventRow(id, blogID int64, severity uint8, state string, startedAt time.Time, ended *time.Time) *sqlmock.Rows {
+	rows := sqlmock.NewRows(columnsEvent)
+	var endedAt any
+	if ended != nil {
+		endedAt = *ended
+	}
+	rows.AddRow(
+		id, blogID, nil, "http", nil,
+		severity, state, startedAt, endedAt, nil,
+		nil, []byte(`{"http_code":503}`),
+	)
+	return rows
+}
+
+func TestListSiteEventsHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	startedAt := time.Date(2026, 4, 25, 3, 0, 0, 0, time.UTC)
+	rows := makeEventRow(7, 42, 4, "Down", startedAt, nil)
+
+	mock.ExpectQuery(eventsBaseSQL+` ORDER BY id DESC LIMIT ?`).
+		WithArgs(int64(42), 51).
+		WillReturnRows(rows)
+
+	// transition_count batch query
+	mock.ExpectQuery(`SELECT event_id, COUNT(*) FROM jetmon_event_transitions WHERE event_id IN (?) GROUP BY event_id`).
+		WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows([]string{"event_id", "count"}).AddRow(int64(7), 3))
+
+	req := requestWithKey("GET", "/api/v1/sites/42/events", key)
+	req.SetPathValue("id", "42")
+	rec := invokeAuthed(s, req, s.handleListSiteEvents)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp struct {
+		Data []eventResponse `json:"data"`
+		Page Page            `json:"page"`
+	}
+	readJSON(t, rec.Body, &resp)
+	if len(resp.Data) != 1 || resp.Data[0].ID != 7 {
+		t.Fatalf("data = %+v, want one event with id=7", resp.Data)
+	}
+	if resp.Data[0].TransitionCount != 3 {
+		t.Errorf("transition_count = %d, want 3", resp.Data[0].TransitionCount)
+	}
+	// Open events report duration_ms based on now-started_at; just check it's positive.
+	if resp.Data[0].DurationMs <= 0 {
+		t.Errorf("duration_ms = %d, want > 0 for open event", resp.Data[0].DurationMs)
+	}
+}
+
+func TestListSiteEventsAppliesActiveFilter(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(eventsBaseSQL+` AND ended_at IS NULL ORDER BY id DESC LIMIT ?`).
+		WithArgs(int64(42), 51).
+		WillReturnRows(sqlmock.NewRows(columnsEvent))
+
+	req := requestWithKey("GET", "/api/v1/sites/42/events?active=true", key)
+	req.SetPathValue("id", "42")
+	rec := invokeAuthed(s, req, s.handleListSiteEvents)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestListSiteEventsWithGatewayTenantRejectsUnmappedSite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteTenantCheckSQL).
+		WithArgs("tenant-a", int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}))
+
+	req := httptest.NewRequest("GET", "/api/v1/sites/42/events", nil)
+	req.SetPathValue("id", "42")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleListSiteEvents)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404; body=%s", rec.Code, rec.Body.String())
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "site_not_found" {
+		t.Fatalf("code = %q, want site_not_found", got)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestListSiteEventsWithGatewayTenantAllowsMappedSite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteTenantCheckSQL).
+		WithArgs("tenant-a", int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+	mock.ExpectQuery(eventsBaseSQL+` ORDER BY id DESC LIMIT ?`).
+		WithArgs(int64(42), 51).
+		WillReturnRows(sqlmock.NewRows(columnsEvent))
+
+	req := httptest.NewRequest("GET", "/api/v1/sites/42/events", nil)
+	req.SetPathValue("id", "42")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleListSiteEvents)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestListSiteEventsRejectsBadActive(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	req := requestWithKey("GET", "/api/v1/sites/42/events?active=maybe", key)
+	req.SetPathValue("id", "42")
+	rec := invokeAuthed(s, req, s.handleListSiteEvents)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("status = %d, want 400", rec.Code)
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "invalid_active" {
+		t.Errorf("error code = %q, want invalid_active", body.Code)
+	}
+}
+
+func TestGetEventBySiteHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	startedAt := time.Date(2026, 4, 25, 3, 0, 0, 0, time.UTC)
+	mock.ExpectQuery(` SELECT id, blog_id, endpoint_id, check_type, discriminator, severity, state, started_at, ended_at, resolution_reason, cause_event_id, metadata FROM jetmon_events WHERE id = ?`).
+		WithArgs(int64(7)).
+		WillReturnRows(makeEventRow(7, 42, 4, "Down", startedAt, nil))
+
+	// Transitions inline (no LIMIT, ORDER BY id ASC)
+	mock.ExpectQuery(transitionsAllSQL).
+		WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows(columnsTransition).
+			AddRow(int64(1), int64(7), nil, uint8(3), nil, "Seems Down", "opened", "host", []byte("null"), startedAt))
+
+	req := requestWithKey("GET", "/api/v1/sites/42/events/7", key)
+	req.SetPathValue("id", "42")
+	req.SetPathValue("event_id", "7")
+	rec := invokeAuthed(s, req, s.handleGetEventBySite)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp eventDetailResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.ID != 7 || resp.SiteID != 42 {
+		t.Errorf("event = (id=%d site=%d), want (7, 42)", resp.ID, resp.SiteID)
+	}
+	if len(resp.Transitions) != 1 {
+		t.Errorf("transitions len = %d, want 1", len(resp.Transitions))
+	}
+	if resp.TransitionCount != 1 {
+		t.Errorf("transition_count = %d, want 1", resp.TransitionCount)
+	}
+}
+
+func TestGetEventWithGatewayTenantRejectsUnmappedEventSite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	startedAt := time.Date(2026, 4, 25, 3, 0, 0, 0, time.UTC)
+	mock.ExpectQuery(` SELECT id, blog_id, endpoint_id, check_type, discriminator, severity, state, started_at, ended_at, resolution_reason, cause_event_id, metadata FROM jetmon_events WHERE id = ?`).
+		WithArgs(int64(7)).
+		WillReturnRows(makeEventRow(7, 42, 4, "Down", startedAt, nil))
+	mock.ExpectQuery(siteTenantCheckSQL).
+		WithArgs("tenant-a", int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}))
+
+	req := httptest.NewRequest("GET", "/api/v1/events/7", nil)
+	req.SetPathValue("event_id", "7")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleGetEvent)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404; body=%s", rec.Code, rec.Body.String())
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "event_not_found" {
+		t.Fatalf("code = %q, want event_not_found", got)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestGetEventBySiteCrossSite404(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// Event 7 belongs to site 42, but consumer is asking under site 99.
+	startedAt := time.Date(2026, 4, 25, 3, 0, 0, 0, time.UTC)
+	mock.ExpectQuery(` SELECT id, blog_id, endpoint_id, check_type, discriminator, severity, state, started_at, ended_at, resolution_reason, cause_event_id, metadata FROM jetmon_events WHERE id = ?`).
+		WithArgs(int64(7)).
+		WillReturnRows(makeEventRow(7, 42, 4, "Down", startedAt, nil))
+
+	req := requestWithKey("GET", "/api/v1/sites/99/events/7", key)
+	req.SetPathValue("id", "99")
+	req.SetPathValue("event_id", "7")
+	rec := invokeAuthed(s, req, s.handleGetEventBySite)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404; body=%s", rec.Code, rec.Body.String())
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "event_not_found" {
+		t.Errorf("error code = %q, want event_not_found", body.Code)
+	}
+	if !contains(body.Message, "Event 7 does not belong to site 99") {
+		t.Errorf("message %q should explain cross-site mismatch", body.Message)
+	}
+}
+
+func TestGetEventNotFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(` SELECT id, blog_id, endpoint_id, check_type, discriminator, severity, state, started_at, ended_at, resolution_reason, cause_event_id, metadata FROM jetmon_events WHERE id = ?`).
+		WithArgs(int64(999)).
+		WillReturnRows(sqlmock.NewRows(columnsEvent))
+
+	req := requestWithKey("GET", "/api/v1/events/999", key)
+	req.SetPathValue("event_id", "999")
+	rec := invokeAuthed(s, req, s.handleGetEvent)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "event_not_found" {
+		t.Errorf("error code = %q, want event_not_found", body.Code)
+	}
+}
+
+func TestListTransitionsCrossSiteProtection(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(`SELECT blog_id FROM jetmon_events WHERE id = ?`).
+		WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows([]string{"blog_id"}).AddRow(int64(42)))
+
+	req := requestWithKey("GET", "/api/v1/sites/99/events/7/transitions", key)
+	req.SetPathValue("id", "99")
+	req.SetPathValue("event_id", "7")
+	rec := invokeAuthed(s, req, s.handleListTransitions)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "event_not_found" {
+		t.Errorf("error code = %q, want event_not_found", body.Code)
+	}
+}
+
+func TestListTransitionsHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(`SELECT blog_id FROM jetmon_events WHERE id = ?`).
+		WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows([]string{"blog_id"}).AddRow(int64(42)))
+
+	startedAt := time.Date(2026, 4, 25, 3, 0, 0, 0, time.UTC)
+	mock.ExpectQuery(transitionsListSQL+` ORDER BY id ASC LIMIT ?`).
+		WithArgs(int64(7), 101).
+		WillReturnRows(sqlmock.NewRows(columnsTransition).
+			AddRow(int64(1), int64(7), nil, uint8(3), nil, "Seems Down", "opened", "host", []byte("null"), startedAt))
+
+	req := requestWithKey("GET", "/api/v1/sites/42/events/7/transitions", key)
+	req.SetPathValue("id", "42")
+	req.SetPathValue("event_id", "7")
+	rec := invokeAuthed(s, req, s.handleListTransitions)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp struct {
+		Data []transitionResponse `json:"data"`
+		Page Page                 `json:"page"`
+	}
+	readJSON(t, rec.Body, &resp)
+	if len(resp.Data) != 1 || resp.Data[0].Reason != "opened" {
+		t.Errorf("transitions = %+v, want one with reason=opened", resp.Data)
+	}
+}
+
+func TestListTransitionsWithGatewayTenantRejectsUnmappedEventSite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(`SELECT blog_id FROM jetmon_events WHERE id = ?`).
+		WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows([]string{"blog_id"}).AddRow(int64(42)))
+	mock.ExpectQuery(siteTenantCheckSQL).
+		WithArgs("tenant-a", int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}))
+
+	req := httptest.NewRequest("GET", "/api/v1/sites/42/events/7/transitions", nil)
+	req.SetPathValue("id", "42")
+	req.SetPathValue("event_id", "7")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleListTransitions)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404; body=%s", rec.Code, rec.Body.String())
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "event_not_found" {
+		t.Fatalf("code = %q, want event_not_found", got)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestListTransitionsWithGatewayTenantAllowsMappedEventSite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(`SELECT blog_id FROM jetmon_events WHERE id = ?`).
+		WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows([]string{"blog_id"}).AddRow(int64(42)))
+	mock.ExpectQuery(siteTenantCheckSQL).
+		WithArgs("tenant-a", int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+
+	startedAt := time.Date(2026, 4, 25, 3, 0, 0, 0, time.UTC)
+	mock.ExpectQuery(transitionsListSQL+` ORDER BY id ASC LIMIT ?`).
+		WithArgs(int64(7), 101).
+		WillReturnRows(sqlmock.NewRows(columnsTransition).
+			AddRow(int64(1), int64(7), nil, uint8(3), nil, "Seems Down", "opened", "host", []byte("null"), startedAt))
+
+	req := httptest.NewRequest("GET", "/api/v1/sites/42/events/7/transitions", nil)
+	req.SetPathValue("id", "42")
+	req.SetPathValue("event_id", "7")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleListTransitions)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestParseCSVCombinations(t *testing.T) {
+	q := map[string][]string{
+		"state__in": {"Down,Seems Down"},
+	}
+	got := parseCSV(q, "state", "state__in")
+	if len(got) != 2 || got[0] != "Down" || got[1] != "Seems Down" {
+		t.Errorf("parseCSV = %+v, want [Down, 'Seems Down']", got)
+	}
+
+	q2 := map[string][]string{"check_type": {"http"}}
+	got2 := parseCSV(q2, "check_type", "check_type__in")
+	if len(got2) != 1 || got2[0] != "http" {
+		t.Errorf("parseCSV single = %+v, want [http]", got2)
+	}
+}
+
+func TestParseTimeQuery(t *testing.T) {
+	if got, err := parseTimeQuery(""); err != nil || got != nil {
+		t.Errorf("empty input = (%v, %v), want (nil, nil)", got, err)
+	}
+	if _, err := parseTimeQuery("not-a-date"); err == nil {
+		t.Error("malformed date should error")
+	}
+	t1, err := parseTimeQuery("2026-04-25T00:00:00Z")
+	if err != nil || t1 == nil {
+		t.Fatalf("valid date errored: %v", err)
+	}
+	if t1.Year() != 2026 {
+		t.Errorf("parsed year = %d, want 2026", t1.Year())
+	}
+}
diff --git a/internal/api/handlers_events_write.go b/internal/api/handlers_events_write.go
new file mode 100644
index 00000000..b98773dc
--- /dev/null
+++ b/internal/api/handlers_events_write.go
@@ -0,0 +1,421 @@
+package api
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io"
+	"net/http"
+	"strconv"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/checker"
+	"github.com/Automattic/jetmon/internal/checkmode"
+	"github.com/Automattic/jetmon/internal/config"
+)
+
+// closeEventRequest is the body for POST .../events/{event_id}/close.
+//
+// reason is a free-form short label per docs/internal-api-reference.md transition vocabulary
+// (manual_override, false_alarm, maintenance_swallowed, etc.) — we don't
+// constrain it to a strict allowlist server-side because the orchestrator
+// and the operator might legitimately use different reason vocabularies
+// over time. The audit log carries enough context.
+//
+// note ends up in the closing transition's metadata for postmortem context.
+type closeEventRequest struct {
+	Reason string `json:"reason"`
+	Note   string `json:"note"`
+}
+
+// handleCloseEvent implements POST /api/v1/sites/{id}/events/{event_id}/close.
+//
+// Manual operator override path: closes an open event with an explicit
+// resolution reason. If the event was the only active one for the site,
+// projects v1 site_status back to running. Already-closed events return
+// a 200 with the existing event (idempotent close).
+func (s *Server) handleCloseEvent(w http.ResponseWriter, r *http.Request) {
+	siteID, err := strconv.ParseInt(r.PathValue("id"), 10, 64)
+	if err != nil || siteID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_site_id",
+			"site id must be a positive integer")
+		return
+	}
+	eventID, err := strconv.ParseInt(r.PathValue("event_id"), 10, 64)
+	if err != nil || eventID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_event_id",
+			"event id must be a positive integer")
+		return
+	}
+	if !s.ensureSiteVisibleForRequest(w, r, siteID) {
+		return
+	}
+
+	var body closeEventRequest
+	if err := json.NewDecoder(r.Body).Decode(&body); err != nil {
+		// Empty body is OK — defaults below kick in. json.NewDecoder
+		// surfaces io.EOF for an empty/missing body.
+		if !errors.Is(err, io.EOF) {
+			writeError(w, r, http.StatusBadRequest, "invalid_body",
+				"request body must be valid JSON: "+err.Error())
+			return
+		}
+	}
+	reason := body.Reason
+	if reason == "" {
+		reason = "manual_override"
+	}
+
+	ctx := r.Context()
+	// Verify the event exists and belongs to the named site before closing.
+	var (
+		eventBlogID int64
+		endedAt     sql.NullTime
+	)
+	err = s.db.QueryRowContext(ctx,
+		`SELECT blog_id, ended_at FROM jetmon_events WHERE id = ?`, eventID,
+	).Scan(&eventBlogID, &endedAt)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			writeError(w, r, http.StatusNotFound, "event_not_found",
+				fmt.Sprintf("Event %d does not exist", eventID))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"event lookup failed: "+err.Error())
+		return
+	}
+	if eventBlogID != siteID {
+		writeError(w, r, http.StatusNotFound, "event_not_found",
+			fmt.Sprintf("Event %d does not belong to site %d", eventID, siteID))
+		return
+	}
+	if endedAt.Valid {
+		// Idempotent close — return the existing event.
+		ev, transitions, err := s.readEventWithTransitions(ctx, eventID)
+		if err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"read-back failed: "+err.Error())
+			return
+		}
+		writeJSON(w, http.StatusOK, eventDetailResponse{eventResponse: ev, Transitions: transitions})
+		return
+	}
+
+	meta, _ := json.Marshal(map[string]any{
+		"note":   body.Note,
+		"source": "api",
+	})
+	if err := s.closeEvent(ctx, eventID, siteID, reason, meta); err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"close event failed: "+err.Error())
+		return
+	}
+
+	ev, transitions, err := s.readEventWithTransitions(ctx, eventID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"read-back failed: "+err.Error())
+		return
+	}
+	writeJSON(w, http.StatusOK, eventDetailResponse{eventResponse: ev, Transitions: transitions})
+}
+
+// readEventWithTransitions reads an event row plus all of its transitions.
+// Used by the close endpoint's read-back step.
+func (s *Server) readEventWithTransitions(ctx context.Context, eventID int64) (eventResponse, []transitionResponse, error) {
+	row := s.db.QueryRowContext(ctx, `
+		SELECT id, blog_id, endpoint_id, check_type, discriminator,
+		       severity, state, started_at, ended_at, resolution_reason,
+		       cause_event_id, metadata
+		  FROM jetmon_events
+		 WHERE id = ?`, eventID)
+	ev, err := scanEventRow(row)
+	if err != nil {
+		return ev, nil, err
+	}
+	transitions, err := s.queryTransitions(ctx, eventID)
+	if err != nil {
+		return ev, nil, err
+	}
+	ev.TransitionCount = len(transitions)
+	return ev, transitions, nil
+}
+
+// triggerNowResponse is the shape returned by POST /api/v1/sites/{id}/trigger-now.
+type triggerNowResponse struct {
+	Result             checkResultPayload `json:"result"`
+	CurrentState       string             `json:"current_state"`
+	ActiveEventsClosed []int64            `json:"active_events_closed"`
+}
+
+// checkResultPayload is the subset of checker.Result we return inline.
+type checkResultPayload struct {
+	Method           string `json:"method"`
+	DetectionProfile string `json:"detection_profile"`
+	HTTPCode         int    `json:"http_code"`
+	ErrorCode        int    `json:"error_code"`
+	Success          bool   `json:"success"`
+	RTTMs            int64  `json:"rtt_ms"`
+	DNSMs            int64  `json:"dns_ms"`
+	TCPMs            int64  `json:"tcp_ms"`
+	TLSMs            int64  `json:"tls_ms"`
+	TTFBMs           int64  `json:"ttfb_ms"`
+	SSLExpiresAt     string `json:"ssl_expires_at,omitempty"`
+}
+
+// triggerNowTimeout is the synchronous deadline for a POST /trigger-now
+// call. Long enough to cover the slowest legitimate check; short enough that
+// a hung target site doesn't pin a connection forever.
+const triggerNowTimeout = 30 * time.Second
+
+// handleTriggerNow implements POST /api/v1/sites/{id}/trigger-now.
+//
+// Runs a single HTTP check inline using the checker package, returns the
+// raw result, and — if the check succeeds and an open event exists —
+// closes that event with reason=probe_cleared (matches the orchestrator's
+// recovery semantics for "no verifier round-trip on recovery").
+//
+// trigger-now does NOT open a new event on failure. The orchestrator
+// handles that on its next regular round so the failure-detection state
+// machine has a single owner.
+func (s *Server) handleTriggerNow(w http.ResponseWriter, r *http.Request) {
+	siteID, err := strconv.ParseInt(r.PathValue("id"), 10, 64)
+	if err != nil || siteID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_site_id",
+			"site id must be a positive integer")
+		return
+	}
+	if !s.ensureSiteVisibleForRequest(w, r, siteID) {
+		return
+	}
+
+	ctx, cancel := context.WithTimeout(r.Context(), triggerNowTimeout)
+	defer cancel()
+
+	site, err := s.readSiteForCheck(ctx, siteID)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			writeError(w, r, http.StatusNotFound, "site_not_found",
+				fmt.Sprintf("Site %d does not exist", siteID))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site lookup failed: "+err.Error())
+		return
+	}
+
+	// Run the check directly via the checker package.
+	headers := map[string]string{}
+	if site.customHeadersJSON != "" {
+		_ = json.Unmarshal([]byte(site.customHeadersJSON), &headers)
+	}
+	timeoutSec := site.timeoutSeconds
+	if timeoutSec <= 0 {
+		timeoutSec = 10
+	}
+	redirectPolicy := site.redirectPolicy
+	if redirectPolicy == "" {
+		redirectPolicy = "follow"
+	}
+	method := site.effectiveRequestMethod()
+	profile := site.effectiveDetectionProfile(method)
+
+	req := checker.Request{
+		BlogID:           siteID,
+		URL:              site.monitorURL,
+		Method:           method,
+		DetectionProfile: profile,
+		TimeoutSeconds:   timeoutSec,
+		CustomHeaders:    headers,
+		RedirectPolicy:   checker.RedirectFollow,
+	}
+	if profile == checkmode.ProfileFull {
+		req.Keyword = site.checkKeywordPtr()
+		req.ForbiddenKeyword = site.forbiddenKeywordPtr()
+		req.ForbiddenKeywords = checker.ParseForbiddenKeywords(site.forbiddenKeywordsPtr())
+		req.RedirectPolicy = checker.RedirectPolicy(redirectPolicy)
+	}
+	res := checker.Check(ctx, req)
+
+	payload := checkResultPayload{
+		Method:           res.Method,
+		DetectionProfile: res.DetectionProfile,
+		HTTPCode:         res.HTTPCode,
+		ErrorCode:        res.ErrorCode,
+		Success:          res.Success,
+		RTTMs:            res.RTT.Milliseconds(),
+		DNSMs:            res.DNS.Milliseconds(),
+		TCPMs:            res.TCP.Milliseconds(),
+		TLSMs:            res.TLS.Milliseconds(),
+		TTFBMs:           res.TTFB.Milliseconds(),
+	}
+	if res.SSLExpiry != nil {
+		payload.SSLExpiresAt = res.SSLExpiry.UTC().Format(time.RFC3339)
+	}
+
+	closed := []int64{}
+	currentState := site.deriveState()
+
+	if res.Success {
+		// Probe came back clean — close any open events the orchestrator
+		// hasn't reconciled yet. probe_cleared matches the recovery semantics
+		// the orchestrator already uses (see docs/events.md: "verifier wasn't
+		// involved in this recovery").
+		ids, err := s.queryActiveEventIDs(ctx, siteID)
+		if err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"active events lookup failed: "+err.Error())
+			return
+		}
+		for _, eventID := range ids {
+			meta, _ := json.Marshal(map[string]any{
+				"http_code": res.HTTPCode,
+				"rtt_ms":    res.RTT.Milliseconds(),
+				"source":    "api_trigger",
+			})
+			if err := s.closeEvent(ctx, eventID, siteID, "probe_cleared", meta); err != nil {
+				writeError(w, r, http.StatusInternalServerError, "db_error",
+					fmt.Sprintf("close event %d failed: %v", eventID, err))
+				return
+			}
+			closed = append(closed, eventID)
+		}
+		if len(ids) > 0 {
+			currentState = "Up"
+		}
+	}
+
+	writeJSON(w, http.StatusOK, triggerNowResponse{
+		Result:             payload,
+		CurrentState:       currentState,
+		ActiveEventsClosed: closed,
+	})
+}
+
+// queryActiveEventIDs returns the ids of all open events for a site.
+// Helper for trigger-now's clear-on-success path.
+func (s *Server) queryActiveEventIDs(ctx context.Context, blogID int64) ([]int64, error) {
+	rows, err := s.db.QueryContext(ctx,
+		`SELECT id FROM jetmon_events WHERE blog_id = ? AND ended_at IS NULL`, blogID)
+	if err != nil {
+		return nil, err
+	}
+	defer rows.Close()
+	var ids []int64
+	for rows.Next() {
+		var id int64
+		if err := rows.Scan(&id); err != nil {
+			return nil, err
+		}
+		ids = append(ids, id)
+	}
+	return ids, rows.Err()
+}
+
+// siteForCheck is a slim subset of jetpack_monitor_sites carrying only the
+// fields the trigger-now path needs. Defined here rather than reusing
+// db.Site so the api package doesn't grow a dependency on internal/db
+// beyond the *sql.DB handle it already has.
+type siteForCheck struct {
+	monitorURL        string
+	timeoutSeconds    int
+	checkKeyword      sql.NullString
+	forbiddenKeyword  sql.NullString
+	forbiddenKeywords sql.NullString
+	customHeadersJSON string
+	redirectPolicy    string
+	requestMethod     string
+	detectionProfile  string
+	siteStatus        int
+}
+
+func (s siteForCheck) checkKeywordPtr() *string {
+	if !s.checkKeyword.Valid || s.checkKeyword.String == "" {
+		return nil
+	}
+	return &s.checkKeyword.String
+}
+
+func (s siteForCheck) forbiddenKeywordPtr() *string {
+	if !s.forbiddenKeyword.Valid || s.forbiddenKeyword.String == "" {
+		return nil
+	}
+	return &s.forbiddenKeyword.String
+}
+
+func (s siteForCheck) forbiddenKeywordsPtr() *string {
+	if !s.forbiddenKeywords.Valid || s.forbiddenKeywords.String == "" {
+		return nil
+	}
+	return &s.forbiddenKeywords.String
+}
+
+func (s siteForCheck) deriveState() string {
+	state, _ := deriveStateFromSiteStatus(s.siteStatus)
+	return state
+}
+
+func (s siteForCheck) effectiveRequestMethod() string {
+	def := checkmode.MethodGET
+	if cfg := config.Get(); cfg != nil && cfg.DefaultCheckMethod != "" {
+		def = cfg.DefaultCheckMethod
+	}
+	method, err := checkmode.NormalizeMethod(s.requestMethod, def)
+	if err != nil {
+		return def
+	}
+	return method
+}
+
+func (s siteForCheck) effectiveDetectionProfile(method string) string {
+	def := checkmode.ProfileFull
+	if cfg := config.Get(); cfg != nil && cfg.DefaultDetectionProfile != "" {
+		def = cfg.DefaultDetectionProfile
+	}
+	profile, err := checkmode.NormalizeProfile(s.detectionProfile, def)
+	if err != nil {
+		return checkmode.EffectiveProfile(method, def)
+	}
+	return checkmode.EffectiveProfile(method, profile)
+}
+
+func (s *Server) readSiteForCheck(ctx context.Context, blogID int64) (siteForCheck, error) {
+	var (
+		out              siteForCheck
+		timeoutSeconds   sql.NullInt64
+		customHeaders    sql.NullString
+		redirectPolicy   sql.NullString
+		requestMethod    sql.NullString
+		detectionProfile sql.NullString
+	)
+	err := s.db.QueryRowContext(ctx, `
+		SELECT s.monitor_url, c.timeout_seconds, c.check_keyword, c.forbidden_keyword, c.forbidden_keywords, c.custom_headers,
+		       c.redirect_policy, c.request_method, c.detection_profile, s.site_status
+		  FROM jetpack_monitor_sites s
+		  LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id
+		 WHERE s.blog_id = ?`, blogID,
+	).Scan(&out.monitorURL, &timeoutSeconds, &out.checkKeyword, &out.forbiddenKeyword, &out.forbiddenKeywords, &customHeaders,
+		&redirectPolicy, &requestMethod, &detectionProfile, &out.siteStatus)
+	if err != nil {
+		return out, err
+	}
+	if timeoutSeconds.Valid {
+		out.timeoutSeconds = int(timeoutSeconds.Int64)
+	}
+	if customHeaders.Valid {
+		out.customHeadersJSON = customHeaders.String
+	}
+	if redirectPolicy.Valid {
+		out.redirectPolicy = redirectPolicy.String
+	}
+	if requestMethod.Valid {
+		out.requestMethod = requestMethod.String
+	}
+	if detectionProfile.Valid {
+		out.detectionProfile = detectionProfile.String
+	}
+	return out, nil
+}
diff --git a/internal/api/handlers_events_write_test.go b/internal/api/handlers_events_write_test.go
new file mode 100644
index 00000000..ed1c2dfd
--- /dev/null
+++ b/internal/api/handlers_events_write_test.go
@@ -0,0 +1,251 @@
+package api
+
+import (
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+const eventLookupMinSQL = `SELECT blog_id, ended_at FROM jetmon_events WHERE id = ?`
+
+const closeEventTxSelectSQL = `SELECT severity, state, ended_at FROM jetmon_events WHERE id = ? FOR UPDATE`
+
+const closeEventUpdateSQL = ` UPDATE jetmon_events SET ended_at = CURRENT_TIMESTAMP(3), resolution_reason = ? WHERE id = ?`
+
+const closeEventInsertTransitionSQL = ` INSERT INTO jetmon_event_transitions (event_id, blog_id, severity_before, severity_after, state_before, state_after, reason, source, metadata) VALUES (?, ?, ?, NULL, ?, ?, ?, ?, ?)`
+
+const countActiveEventsSQL = `SELECT COUNT(*) FROM jetmon_events WHERE blog_id = ? AND ended_at IS NULL`
+
+const projectRunningSQL = `UPDATE jetpack_monitor_sites SET site_status = 1, last_status_change = ? WHERE blog_id = ?`
+
+func expectCloseEventTx(mock sqlmock.Sqlmock, eventID, blogID int64, severity uint8, state, reason string) {
+	mock.ExpectBegin()
+	mock.ExpectQuery(closeEventTxSelectSQL).
+		WithArgs(eventID).
+		WillReturnRows(sqlmock.NewRows([]string{"severity", "state", "ended_at"}).
+			AddRow(severity, state, nil))
+	mock.ExpectExec(closeEventUpdateSQL).
+		WithArgs(reason, eventID).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec(closeEventInsertTransitionSQL).
+		WithArgs(eventID, blogID, severity, state, "Resolved", reason, "api", sqlmock.AnyArg()).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(countActiveEventsSQL).WithArgs(blogID).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(0))
+	mock.ExpectExec(projectRunningSQL).
+		WithArgs(sqlmock.AnyArg(), blogID).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectCommit()
+}
+
+func TestCloseEventHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// Event exists and belongs to site 42, currently open.
+	mock.ExpectQuery(eventLookupMinSQL).WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows([]string{"blog_id", "ended_at"}).
+			AddRow(int64(42), nil))
+
+	expectCloseEventTx(mock, 7, 42, 4, "Down", "manual_override")
+
+	// Read-back: full event + transitions.
+	startedAt := time.Date(2026, 4, 25, 3, 0, 0, 0, time.UTC)
+	mock.ExpectQuery(` SELECT id, blog_id, endpoint_id, check_type, discriminator, severity, state, started_at, ended_at, resolution_reason, cause_event_id, metadata FROM jetmon_events WHERE id = ?`).
+		WithArgs(int64(7)).
+		WillReturnRows(makeEventRow(7, 42, 4, "Down", startedAt, &startedAt))
+	mock.ExpectQuery(transitionsAllSQL).WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows(columnsTransition))
+
+	body := []byte(`{"reason":"manual_override","note":"close from API"}`)
+	req := newPOSTWithBody("/api/v1/sites/42/events/7/close", body)
+	req.SetPathValue("id", "42")
+	req.SetPathValue("event_id", "7")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleCloseEvent)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestCloseEventWithGatewayTenantRejectsUnmappedSite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteTenantCheckSQL).
+		WithArgs("tenant-a", int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}))
+
+	body := []byte(`{"reason":"manual_override"}`)
+	req := newPOSTWithBody("/api/v1/sites/42/events/7/close", body)
+	req.SetPathValue("id", "42")
+	req.SetPathValue("event_id", "7")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleCloseEvent)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404; body=%s", rec.Code, rec.Body.String())
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "site_not_found" {
+		t.Fatalf("code = %q, want site_not_found", got)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestCloseEventNotFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(eventLookupMinSQL).WithArgs(int64(999)).
+		WillReturnRows(sqlmock.NewRows([]string{"blog_id", "ended_at"}))
+
+	body := []byte(`{}`)
+	req := newPOSTWithBody("/api/v1/sites/42/events/999/close", body)
+	req.SetPathValue("id", "42")
+	req.SetPathValue("event_id", "999")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleCloseEvent)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+}
+
+func TestCloseEventCrossSiteRejected(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// Event 7 belongs to site 42, request says site 99.
+	mock.ExpectQuery(eventLookupMinSQL).WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows([]string{"blog_id", "ended_at"}).
+			AddRow(int64(42), nil))
+
+	body := []byte(`{}`)
+	req := newPOSTWithBody("/api/v1/sites/99/events/7/close", body)
+	req.SetPathValue("id", "99")
+	req.SetPathValue("event_id", "7")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleCloseEvent)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "event_not_found" {
+		t.Errorf("code = %q, want event_not_found", got)
+	}
+}
+
+func TestCloseEventAlreadyClosedIsIdempotent(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// Event already has ended_at set.
+	closedAt := time.Date(2026, 4, 25, 3, 0, 0, 0, time.UTC)
+	mock.ExpectQuery(eventLookupMinSQL).WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows([]string{"blog_id", "ended_at"}).
+			AddRow(int64(42), closedAt))
+
+	// Read-back happens directly without re-closing.
+	startedAt := closedAt.Add(-1 * time.Hour)
+	mock.ExpectQuery(` SELECT id, blog_id, endpoint_id, check_type, discriminator, severity, state, started_at, ended_at, resolution_reason, cause_event_id, metadata FROM jetmon_events WHERE id = ?`).
+		WithArgs(int64(7)).
+		WillReturnRows(makeEventRow(7, 42, 4, "Down", startedAt, &closedAt))
+	mock.ExpectQuery(transitionsAllSQL).WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows(columnsTransition))
+
+	body := []byte(`{"reason":"manual_override"}`)
+	req := newPOSTWithBody("/api/v1/sites/42/events/7/close", body)
+	req.SetPathValue("id", "42")
+	req.SetPathValue("event_id", "7")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleCloseEvent)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200 (idempotent close); body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestCloseEventDefaultReason(t *testing.T) {
+	// An empty body produces reason=manual_override per the handler defaults.
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(eventLookupMinSQL).WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows([]string{"blog_id", "ended_at"}).
+			AddRow(int64(42), nil))
+	expectCloseEventTx(mock, 7, 42, 4, "Down", "manual_override")
+	startedAt := time.Date(2026, 4, 25, 3, 0, 0, 0, time.UTC)
+	mock.ExpectQuery(` SELECT id, blog_id, endpoint_id, check_type, discriminator, severity, state, started_at, ended_at, resolution_reason, cause_event_id, metadata FROM jetmon_events WHERE id = ?`).
+		WithArgs(int64(7)).
+		WillReturnRows(makeEventRow(7, 42, 4, "Down", startedAt, &startedAt))
+	mock.ExpectQuery(transitionsAllSQL).WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows(columnsTransition))
+
+	// Empty body — handler should default reason to manual_override.
+	req := httptest.NewRequest("POST", "/api/v1/sites/42/events/7/close", nil)
+	req.SetPathValue("id", "42")
+	req.SetPathValue("event_id", "7")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleCloseEvent)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestCloseEventInvalidIDs(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	cases := []struct {
+		siteID, eventID, code string
+	}{
+		{"abc", "7", "invalid_site_id"},
+		{"42", "xyz", "invalid_event_id"},
+		{"-1", "7", "invalid_site_id"},
+	}
+	for _, c := range cases {
+		req := httptest.NewRequest("POST", "/api/v1/sites/"+c.siteID+"/events/"+c.eventID+"/close", nil)
+		req.SetPathValue("id", c.siteID)
+		req.SetPathValue("event_id", c.eventID)
+		req = setAuthCtx(req, key)
+		rec := invokeAuthed(s, req, s.handleCloseEvent)
+		if rec.Code != http.StatusBadRequest {
+			t.Errorf("siteID=%s eventID=%s status=%d want 400", c.siteID, c.eventID, rec.Code)
+			continue
+		}
+		if got := readErrorBody(t, rec.Body).Code; got != c.code {
+			t.Errorf("siteID=%s eventID=%s code=%q want %q", c.siteID, c.eventID, got, c.code)
+		}
+	}
+}
+
+func TestTriggerNowWithGatewayTenantRejectsUnmappedSite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteTenantCheckSQL).
+		WithArgs("tenant-a", int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}))
+
+	req := newPOSTWithBody("/api/v1/sites/42/trigger-now", nil)
+	req.SetPathValue("id", "42")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleTriggerNow)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404; body=%s", rec.Code, rec.Body.String())
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "site_not_found" {
+		t.Fatalf("code = %q, want site_not_found", got)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
diff --git a/internal/api/handlers_identity.go b/internal/api/handlers_identity.go
new file mode 100644
index 00000000..f56305f7
--- /dev/null
+++ b/internal/api/handlers_identity.go
@@ -0,0 +1,56 @@
+package api
+
+import (
+	"context"
+	"net/http"
+	"time"
+)
+
+// handleHealth is unauthenticated and used by load balancers / external
+// monitors. Returns 200 if the API can ping the database within 1s, else 503.
+func (s *Server) handleHealth(w http.ResponseWriter, r *http.Request) {
+	if s.db == nil {
+		writeError(w, r, http.StatusServiceUnavailable, "db_unavailable",
+			"database not reachable")
+		return
+	}
+
+	ctx, cancel := context.WithTimeout(r.Context(), 1*time.Second)
+	defer cancel()
+	if err := s.db.PingContext(ctx); err != nil {
+		writeError(w, r, http.StatusServiceUnavailable, "db_unavailable",
+			"database not reachable: "+err.Error())
+		return
+	}
+	writeJSON(w, http.StatusOK, map[string]string{"status": "ok"})
+}
+
+// meResponse is what GET /api/v1/me returns. Same shape as the spec in docs/internal-api-reference.md.
+type meResponse struct {
+	ConsumerName       string  `json:"consumer_name"`
+	Scope              string  `json:"scope"`
+	RateLimitPerMinute int     `json:"rate_limit_per_minute"`
+	ExpiresAt          *string `json:"expires_at"`
+}
+
+// handleMe returns the identity associated with the request's token.
+// Used by consumers to verify their key works and check what scope it has.
+func (s *Server) handleMe(w http.ResponseWriter, r *http.Request) {
+	key := keyFromRequest(r)
+	if key == nil {
+		writeError(w, r, http.StatusInternalServerError, "auth_state_missing",
+			"authenticated key not found in request context")
+		return
+	}
+
+	resp := meResponse{
+		ConsumerName:       key.ConsumerName,
+		Scope:              string(key.Scope),
+		RateLimitPerMinute: key.RateLimitPerMinute,
+	}
+	if key.ExpiresAt != nil {
+		formatted := key.ExpiresAt.UTC().Format(time.RFC3339)
+		resp.ExpiresAt = &formatted
+	}
+	writeJSON(w, http.StatusOK, resp)
+}
diff --git a/internal/api/handlers_identity_test.go b/internal/api/handlers_identity_test.go
new file mode 100644
index 00000000..fa032ce9
--- /dev/null
+++ b/internal/api/handlers_identity_test.go
@@ -0,0 +1,123 @@
+package api
+
+import (
+	"context"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+
+	"github.com/Automattic/jetmon/internal/apikeys"
+)
+
+func TestHealthOK(t *testing.T) {
+	s, mock, _, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectPing()
+	req := httptest.NewRequest("GET", "/api/v1/health", nil)
+	rec := httptest.NewRecorder()
+	s.handleHealth(rec, req)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var body map[string]string
+	readJSON(t, rec.Body, &body)
+	if body["status"] != "ok" {
+		t.Errorf("status field = %q, want 'ok'", body["status"])
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestHealthDBDown(t *testing.T) {
+	s, mock, _, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectPing().WillReturnError(errPing{})
+
+	req := httptest.NewRequest("GET", "/api/v1/health", nil)
+	rec := httptest.NewRecorder()
+	s.handleHealth(rec, req)
+
+	if rec.Code != http.StatusServiceUnavailable {
+		t.Fatalf("status = %d, want 503; body=%s", rec.Code, rec.Body.String())
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "db_unavailable" {
+		t.Errorf("error code = %q, want db_unavailable", body.Code)
+	}
+}
+
+// errPing is a stand-in error type for db.PingContext failures since sqlmock's
+// ExpectPing accepts any error.
+type errPing struct{}
+
+func (errPing) Error() string { return "ping failed" }
+
+func TestMeReturnsAuthenticatedKey(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+	key.ConsumerName = "alerts-worker"
+	key.Scope = apikeys.ScopeRead
+	key.RateLimitPerMinute = 600
+
+	req := requestWithKey("GET", "/api/v1/me", key)
+	rec := httptest.NewRecorder()
+	s.handleMe(rec, req)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var body meResponse
+	readJSON(t, rec.Body, &body)
+	if body.ConsumerName != "alerts-worker" {
+		t.Errorf("consumer_name = %q, want alerts-worker", body.ConsumerName)
+	}
+	if body.Scope != "read" {
+		t.Errorf("scope = %q, want read", body.Scope)
+	}
+	if body.RateLimitPerMinute != 600 {
+		t.Errorf("rate_limit_per_minute = %d, want 600", body.RateLimitPerMinute)
+	}
+	if body.ExpiresAt != nil {
+		t.Errorf("expires_at = %v, want nil", *body.ExpiresAt)
+	}
+}
+
+func TestMeMissingKeyReturns500(t *testing.T) {
+	// /me running without an authenticated key in context indicates middleware
+	// was bypassed in error. The handler refuses to guess.
+	s, _, _, cleanup := newTestServer(t)
+	defer cleanup()
+
+	req := httptest.NewRequest("GET", "/api/v1/me", nil)
+	rec := httptest.NewRecorder()
+	s.handleMe(rec, req)
+
+	if rec.Code != http.StatusInternalServerError {
+		t.Fatalf("status = %d, want 500", rec.Code)
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "auth_state_missing" {
+		t.Errorf("error code = %q, want auth_state_missing", body.Code)
+	}
+}
+
+func TestKeyFromRequestNilContext(t *testing.T) {
+	req := httptest.NewRequest("GET", "/", nil)
+	if k := keyFromRequest(req); k != nil {
+		t.Errorf("keyFromRequest(no ctx) = %+v, want nil", k)
+	}
+}
+
+func TestKeyFromRequestPopulated(t *testing.T) {
+	want := &apikeys.Key{ID: 7, ConsumerName: "x"}
+	req := httptest.NewRequest("GET", "/", nil)
+	ctx := context.WithValue(req.Context(), ctxKeyAPIKey, want)
+	got := keyFromRequest(req.WithContext(ctx))
+	if got == nil || got.ID != 7 {
+		t.Errorf("keyFromRequest = %+v, want %+v", got, want)
+	}
+}
diff --git a/internal/api/handlers_sites.go b/internal/api/handlers_sites.go
new file mode 100644
index 00000000..8121fd7f
--- /dev/null
+++ b/internal/api/handlers_sites.go
@@ -0,0 +1,737 @@
+package api
+
+import (
+	"context"
+	"database/sql"
+	"encoding/base64"
+	"encoding/json"
+	"fmt"
+	"net/http"
+	"strconv"
+	"strings"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/checkmode"
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/eventstore"
+)
+
+const cliBatchHeader = "X-Jetmon-CLI-Batch"
+
+// siteResponse is the JSON shape for a site in list and single-site responses.
+// Field ordering kept human-friendly (id and url first, configuration fields
+// after, computed fields last). See docs/internal-api-reference.md "Family 1: Sites and current state".
+type siteResponse struct {
+	ID                   int64    `json:"id"`
+	BlogID               int64    `json:"blog_id"`
+	MonitorURL           string   `json:"monitor_url"`
+	MonitorActive        bool     `json:"monitor_active"`
+	BucketNo             int      `json:"bucket_no"`
+	CheckInterval        int      `json:"check_interval"`
+	CurrentState         string   `json:"current_state"`
+	CurrentSeverity      uint8    `json:"current_severity"`
+	ActiveEventID        *int64   `json:"active_event_id"`
+	LastCheckedAt        *string  `json:"last_checked_at"`
+	LastStatusChangeAt   *string  `json:"last_status_change_at"`
+	SSLExpiryDate        *string  `json:"ssl_expiry_date"`
+	CheckKeyword         *string  `json:"check_keyword"`
+	ForbiddenKeyword     *string  `json:"forbidden_keyword"`
+	ForbiddenKeywords    []string `json:"forbidden_keywords"`
+	RedirectPolicy       string   `json:"redirect_policy"`
+	RequestMethod        string   `json:"request_method"`
+	DetectionProfile     string   `json:"detection_profile"`
+	MaintenanceStart     *string  `json:"maintenance_start"`
+	MaintenanceEnd       *string  `json:"maintenance_end"`
+	AlertCooldownMinutes *int     `json:"alert_cooldown_minutes"`
+	cliBatch             string
+}
+
+// siteCLIResponse is an explicit local-tooling projection used only when
+// ?include_cli_metadata=true is supplied. It keeps the canonical Site schema
+// free of API CLI implementation details.
+type siteCLIResponse struct {
+	siteResponse
+	CLIBatch string `json:"cli_batch,omitempty"`
+}
+
+// activeEventSummary is the compact event shape embedded in single-site
+// responses under "active_events". Full event detail comes from
+// GET /api/v1/sites/{id}/events/{event_id}.
+type activeEventSummary struct {
+	ID        int64  `json:"id"`
+	CheckType string `json:"check_type"`
+	Severity  uint8  `json:"severity"`
+	State     string `json:"state"`
+	StartedAt string `json:"started_at"`
+}
+
+// singleSiteResponse extends siteResponse with the active_events array.
+type singleSiteResponse struct {
+	siteResponse
+	ActiveEvents []activeEventSummary `json:"active_events"`
+}
+
+type singleSiteCLIResponse struct {
+	siteResponse
+	CLIBatch     string               `json:"cli_batch,omitempty"`
+	ActiveEvents []activeEventSummary `json:"active_events"`
+}
+
+// handleListSites implements GET /api/v1/sites with cursor pagination.
+//
+// Cursor encodes the (id) of the last row on the previous page; we use id
+// because it's the stable monotonically-increasing primary key. State filter
+// is applied post-derivation so consumers see filtering in the same vocabulary
+// they read.
+func (s *Server) handleListSites(w http.ResponseWriter, r *http.Request) {
+	q := r.URL.Query()
+	includeCLIMetadata, err := parseIncludeCLIMetadata(q.Get("include_cli_metadata"))
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_include_cli_metadata", err.Error())
+		return
+	}
+
+	limit, err := parseLimit(q.Get("limit"), 50, 200)
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_limit", err.Error())
+		return
+	}
+
+	cursor, err := decodeIDCursor(q.Get("cursor"))
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_cursor", err.Error())
+		return
+	}
+
+	// Filters.
+	stateFilter, err := parseStateFilter(q)
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_state_filter", err.Error())
+		return
+	}
+	severityGTE, err := parseUintQuery(q.Get("severity__gte"))
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_severity", err.Error())
+		return
+	}
+	monitorActive := q.Get("monitor_active")
+	urlSubstr := q.Get("q")
+
+	// Build the query. Filter on monitor_active and on URL substring at the SQL
+	// level; state/severity filtering happens post-derivation since current
+	// state is derived from site_status (and later from active events).
+	tenantID, tenantScoped := ownerTenantIDFromRequest(r)
+	args := []any{}
+	sb := strings.Builder{}
+	if tenantScoped {
+		args = append(args, tenantID, cursor)
+		sb.WriteString(`
+		SELECT ` + siteSelectColumns("s.", "c.", "r.", includeCLIMetadata) + `
+		  FROM jetpack_monitor_sites s
+		  LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id
+		  LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id
+		  JOIN jetmon_site_tenants st ON st.blog_id = s.blog_id AND st.tenant_id = ?
+		 WHERE s.blog_id > ?`)
+	} else {
+		args = append(args, cursor)
+		sb.WriteString(`
+		SELECT ` + siteSelectColumns("s.", "c.", "r.", includeCLIMetadata) + `
+		  FROM jetpack_monitor_sites s
+		  LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id
+		  LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id
+		 WHERE s.blog_id > ?`)
+	}
+
+	switch monitorActive {
+	case "true", "1":
+		if tenantScoped {
+			sb.WriteString(" AND s.monitor_active = 1")
+		} else {
+			sb.WriteString(" AND s.monitor_active = 1")
+		}
+	case "false", "0":
+		if tenantScoped {
+			sb.WriteString(" AND s.monitor_active = 0")
+		} else {
+			sb.WriteString(" AND s.monitor_active = 0")
+		}
+	case "":
+		// no filter
+	default:
+		writeError(w, r, http.StatusBadRequest, "invalid_monitor_active",
+			"monitor_active must be 'true' or 'false'")
+		return
+	}
+	if urlSubstr != "" {
+		if tenantScoped {
+			sb.WriteString(" AND s.monitor_url LIKE ?")
+		} else {
+			sb.WriteString(" AND s.monitor_url LIKE ?")
+		}
+		args = append(args, "%"+urlSubstr+"%")
+	}
+	if tenantScoped {
+		sb.WriteString(" ORDER BY s.blog_id ASC LIMIT ?")
+	} else {
+		sb.WriteString(" ORDER BY s.blog_id ASC LIMIT ?")
+	}
+	// Fetch limit+1 so we know whether there's a next page without an extra count query.
+	args = append(args, limit+1)
+
+	ctx := r.Context()
+	rows, err := s.db.QueryContext(ctx, sb.String(), args...)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site list query failed: "+err.Error())
+		return
+	}
+	defer rows.Close()
+
+	results := make([]siteResponse, 0, limit)
+	var lastID int64
+	for rows.Next() {
+		s, err := scanSiteRow(rows, includeCLIMetadata)
+		if err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"site row scan failed: "+err.Error())
+			return
+		}
+		results = append(results, s)
+		lastID = s.ID
+	}
+	rawCount := len(results)
+	rawLastID := lastID
+	fetchedMore := rawCount > limit
+
+	if err := s.applyActiveEventRollups(ctx, results); err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"active event rollup query failed: "+err.Error())
+		return
+	}
+
+	// Apply post-derivation filters after active events have been reflected
+	// into the response. This keeps the API correct after legacy projection
+	// writes are disabled.
+	results = filterByState(results, stateFilter, severityGTE)
+
+	// Trim to the requested limit and decide on next-cursor.
+	var nextCursor *string
+	if len(results) > limit {
+		results = results[:limit]
+		lastID = results[len(results)-1].ID
+		c := encodeIDCursor(lastID)
+		nextCursor = &c
+	} else if fetchedMore {
+		c := encodeIDCursor(rawLastID)
+		nextCursor = &c
+	}
+
+	writeJSON(w, http.StatusOK, ListEnvelope{
+		Data: siteListResponseData(results, includeCLIMetadata),
+		Page: Page{Next: nextCursor, Limit: limit},
+	})
+}
+
+// handleGetSite implements GET /api/v1/sites/{id}. Returns the site plus
+// any open events as active_events, ordered by severity descending.
+func (s *Server) handleGetSite(w http.ResponseWriter, r *http.Request) {
+	id, err := strconv.ParseInt(r.PathValue("id"), 10, 64)
+	if err != nil || id <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_site_id",
+			"site id must be a positive integer")
+		return
+	}
+
+	includeCLIMetadata, err := parseIncludeCLIMetadata(r.URL.Query().Get("include_cli_metadata"))
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_include_cli_metadata", err.Error())
+		return
+	}
+
+	ctx := r.Context()
+	if !s.ensureSiteVisibleForRequest(w, r, id) {
+		return
+	}
+	row := s.db.QueryRowContext(ctx, `
+		SELECT `+siteSelectColumns("s.", "c.", "r.", includeCLIMetadata)+`
+		  FROM jetpack_monitor_sites s
+		  LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id
+		  LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id
+		 WHERE s.blog_id = ?`, id)
+
+	site, err := scanSiteRow(row, includeCLIMetadata)
+	if err != nil {
+		if err == sql.ErrNoRows {
+			writeSiteNotFound(w, r, id)
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site query failed: "+err.Error())
+		return
+	}
+
+	active, err := s.queryActiveEvents(ctx, id)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"active events query failed: "+err.Error())
+		return
+	}
+
+	// Reflect the worst active event back into the site projection for
+	// consumers reading from this single endpoint. Falls back to the v1
+	// site_status mapping when there's no event (e.g. fresh site that hasn't
+	// been checked yet).
+	if len(active) > 0 {
+		worst := active[0]
+		site.CurrentSeverity = worst.Severity
+		site.CurrentState = worst.State
+		eventID := worst.ID
+		site.ActiveEventID = &eventID
+	}
+
+	if includeCLIMetadata {
+		writeJSON(w, http.StatusOK, singleSiteCLIResponse{
+			siteResponse: site,
+			CLIBatch:     site.cliBatch,
+			ActiveEvents: active,
+		})
+		return
+	}
+	writeJSON(w, http.StatusOK, singleSiteResponse{siteResponse: site, ActiveEvents: active})
+}
+
+func siteSelectColumns(prefix, checkConfigPrefix, runtimePrefix string, includeCLIMetadata bool) string {
+	cols := []string{
+		prefix + "blog_id",
+		prefix + "blog_id AS public_id",
+		prefix + "monitor_url",
+		prefix + "monitor_active",
+		prefix + "bucket_no",
+		prefix + "check_interval",
+		prefix + "site_status",
+		runtimePrefix + "last_checked_at",
+		prefix + "last_status_change",
+		runtimePrefix + "ssl_expiry_date",
+		checkConfigPrefix + "check_keyword",
+		checkConfigPrefix + "forbidden_keyword",
+		checkConfigPrefix + "forbidden_keywords",
+		checkConfigPrefix + "redirect_policy",
+		checkConfigPrefix + "request_method",
+		checkConfigPrefix + "detection_profile",
+		checkConfigPrefix + "maintenance_start",
+		checkConfigPrefix + "maintenance_end",
+		checkConfigPrefix + "alert_cooldown_minutes",
+	}
+	if includeCLIMetadata {
+		cols = append(cols, checkConfigPrefix+"custom_headers")
+	}
+	return strings.Join(cols, ", ")
+}
+
+func parseIncludeCLIMetadata(value string) (bool, error) {
+	switch value {
+	case "", "false", "0":
+		return false, nil
+	case "true", "1":
+		return true, nil
+	default:
+		return false, fmt.Errorf("include_cli_metadata must be 'true' or 'false'")
+	}
+}
+
+func siteListResponseData(sites []siteResponse, includeCLIMetadata bool) any {
+	if !includeCLIMetadata {
+		return sites
+	}
+	out := make([]siteCLIResponse, 0, len(sites))
+	for _, site := range sites {
+		out = append(out, siteCLIResponse{siteResponse: site, CLIBatch: site.cliBatch})
+	}
+	return out
+}
+
+// queryActiveEvents returns all open events for a site, ordered by severity
+// desc then started_at asc. Used by the single-site endpoint.
+func (s *Server) queryActiveEvents(ctx context.Context, blogID int64) ([]activeEventSummary, error) {
+	rows, err := s.db.QueryContext(ctx, `
+		SELECT id, check_type, severity, state, started_at
+		  FROM jetmon_events
+		 WHERE blog_id = ? AND ended_at IS NULL
+		 ORDER BY severity DESC, started_at ASC`, blogID)
+	if err != nil {
+		return nil, err
+	}
+	defer rows.Close()
+
+	out := []activeEventSummary{}
+	for rows.Next() {
+		var e activeEventSummary
+		var startedAt time.Time
+		if err := rows.Scan(&e.ID, &e.CheckType, &e.Severity, &e.State, &startedAt); err != nil {
+			return nil, err
+		}
+		e.StartedAt = startedAt.UTC().Format(time.RFC3339Nano)
+		out = append(out, e)
+	}
+	return out, rows.Err()
+}
+
+type activeEventRollup struct {
+	id        int64
+	severity  uint8
+	state     string
+	startedAt time.Time
+}
+
+// applyActiveEventRollups reflects each site's worst open event into list
+// responses. List queries still page through jetpack_monitor_sites because that
+// remains the site/config table during migration, but current state comes from
+// v2 events when an event is open.
+//
+// The query intentionally avoids window functions so it stays compatible with
+// MySQL 5.7. Pagination caps the IN list at the API's max page size, and a
+// site rarely has more than one open event, so reducing in Go is cheap.
+func (s *Server) applyActiveEventRollups(ctx context.Context, sites []siteResponse) error {
+	if len(sites) == 0 {
+		return nil
+	}
+	ids := make([]any, 0, len(sites))
+	placeholders := make([]string, 0, len(sites))
+	for _, site := range sites {
+		ids = append(ids, site.BlogID)
+		placeholders = append(placeholders, "?")
+	}
+
+	q := fmt.Sprintf(`
+		SELECT id, blog_id, severity, state, started_at
+		  FROM jetmon_events
+		 WHERE ended_at IS NULL
+		   AND blog_id IN (%s)`, strings.Join(placeholders, ","))
+
+	rows, err := s.db.QueryContext(ctx, q, ids...)
+	if err != nil {
+		return err
+	}
+	defer rows.Close()
+
+	rollups := make(map[int64]activeEventRollup)
+	for rows.Next() {
+		var blogID int64
+		var r activeEventRollup
+		if err := rows.Scan(&r.id, &blogID, &r.severity, &r.state, &r.startedAt); err != nil {
+			return err
+		}
+		existing, ok := rollups[blogID]
+		if !ok ||
+			r.severity > existing.severity ||
+			(r.severity == existing.severity && r.startedAt.Before(existing.startedAt)) {
+			rollups[blogID] = r
+		}
+	}
+	if err := rows.Err(); err != nil {
+		return err
+	}
+
+	for i := range sites {
+		r, ok := rollups[sites[i].BlogID]
+		if !ok {
+			continue
+		}
+		sites[i].CurrentSeverity = r.severity
+		sites[i].CurrentState = r.state
+		eventID := r.id
+		sites[i].ActiveEventID = &eventID
+	}
+	return nil
+}
+
+// rowScanner accepts both *sql.Row and *sql.Rows.
+type rowScanner interface {
+	Scan(dest ...any) error
+}
+
+// scanSiteRow scans the columns selected by the site queries into a
+// siteResponse. SiteStatus is not exposed directly; it is used only as a
+// fallback for sites with no active v2 event during the shadow migration.
+func scanSiteRow(s rowScanner, includeCLIMetadata bool) (siteResponse, error) {
+	var (
+		out               siteResponse
+		monitorActive     uint8
+		siteStatus        int
+		lastCheckedAt     sql.NullTime
+		lastStatusChg     sql.NullTime
+		sslExpiry         sql.NullTime
+		checkKeyword      sql.NullString
+		forbiddenKeyword  sql.NullString
+		forbiddenKeywords sql.NullString
+		redirectPolicy    sql.NullString
+		requestMethod     sql.NullString
+		detectionProfile  sql.NullString
+		maintStart        sql.NullTime
+		maintEnd          sql.NullTime
+		alertCooldown     sql.NullInt64
+	)
+	dest := []any{
+		&out.ID, &out.BlogID, &out.MonitorURL, &monitorActive,
+		&out.BucketNo, &out.CheckInterval, &siteStatus,
+		&lastCheckedAt, &lastStatusChg, &sslExpiry, &checkKeyword, &forbiddenKeyword, &forbiddenKeywords,
+		&redirectPolicy, &requestMethod, &detectionProfile, &maintStart, &maintEnd, &alertCooldown,
+	}
+	var customHeaders sql.NullString
+	if includeCLIMetadata {
+		dest = append(dest, &customHeaders)
+	}
+	if err := s.Scan(dest...); err != nil {
+		return out, err
+	}
+	out.MonitorActive = monitorActive == 1
+	if config.LegacyStatusProjectionEnabled() {
+		out.CurrentState, out.CurrentSeverity = deriveStateFromSiteStatus(siteStatus)
+	} else {
+		out.CurrentState, out.CurrentSeverity = eventstore.StateUp, eventstore.SeverityUp
+	}
+	if lastCheckedAt.Valid {
+		v := lastCheckedAt.Time.UTC().Format(time.RFC3339)
+		out.LastCheckedAt = &v
+	}
+	if lastStatusChg.Valid {
+		v := lastStatusChg.Time.UTC().Format(time.RFC3339)
+		out.LastStatusChangeAt = &v
+	}
+	if sslExpiry.Valid {
+		v := sslExpiry.Time.UTC().Format("2006-01-02")
+		out.SSLExpiryDate = &v
+	}
+	if checkKeyword.Valid {
+		out.CheckKeyword = &checkKeyword.String
+	}
+	if forbiddenKeyword.Valid {
+		out.ForbiddenKeyword = &forbiddenKeyword.String
+	}
+	if forbiddenKeywords.Valid {
+		keywords, err := decodeForbiddenKeywords(forbiddenKeywords.String)
+		if err != nil {
+			return out, fmt.Errorf("decode forbidden_keywords: %w", err)
+		}
+		out.ForbiddenKeywords = keywords
+	}
+	if redirectPolicy.Valid {
+		out.RedirectPolicy = redirectPolicy.String
+	} else {
+		out.RedirectPolicy = "follow"
+	}
+	out.RequestMethod = effectiveAPICheckMethod(requestMethod)
+	out.DetectionProfile = effectiveAPIDetectionProfile(out.RequestMethod, detectionProfile)
+	if maintStart.Valid {
+		v := maintStart.Time.UTC().Format(time.RFC3339)
+		out.MaintenanceStart = &v
+	}
+	if maintEnd.Valid {
+		v := maintEnd.Time.UTC().Format(time.RFC3339)
+		out.MaintenanceEnd = &v
+	}
+	if alertCooldown.Valid {
+		v := int(alertCooldown.Int64)
+		out.AlertCooldownMinutes = &v
+	}
+	if includeCLIMetadata && customHeaders.Valid {
+		out.cliBatch = cliBatchFromCustomHeaders(customHeaders.String)
+	}
+	return out, nil
+}
+
+func cliBatchFromCustomHeaders(raw string) string {
+	var headers map[string]any
+	if err := json.Unmarshal([]byte(raw), &headers); err != nil {
+		return ""
+	}
+	for name, value := range headers {
+		if !strings.EqualFold(name, cliBatchHeader) {
+			continue
+		}
+		if s, ok := value.(string); ok {
+			return s
+		}
+	}
+	return ""
+}
+
+func effectiveAPICheckMethod(value sql.NullString) string {
+	def := checkmode.MethodGET
+	if cfg := config.Get(); cfg != nil && cfg.DefaultCheckMethod != "" {
+		def = cfg.DefaultCheckMethod
+	}
+	method, err := checkmode.NormalizeMethod("", def)
+	if err != nil {
+		method = checkmode.MethodGET
+	}
+	if value.Valid {
+		if normalized, err := checkmode.NormalizeMethod(value.String, method); err == nil {
+			method = normalized
+		}
+	}
+	return method
+}
+
+func effectiveAPIDetectionProfile(method string, value sql.NullString) string {
+	def := checkmode.ProfileFull
+	if cfg := config.Get(); cfg != nil && cfg.DefaultDetectionProfile != "" {
+		def = cfg.DefaultDetectionProfile
+	}
+	profile, err := checkmode.NormalizeProfile("", def)
+	if err != nil {
+		profile = checkmode.ProfileFull
+	}
+	if value.Valid {
+		if normalized, err := checkmode.NormalizeProfile(value.String, profile); err == nil {
+			profile = normalized
+		}
+	}
+	return checkmode.EffectiveProfile(method, profile)
+}
+
+func decodeForbiddenKeywords(raw string) ([]string, error) {
+	var values []string
+	if err := json.Unmarshal([]byte(raw), &values); err != nil {
+		return nil, err
+	}
+	out := values[:0]
+	for _, value := range values {
+		if value != "" {
+			out = append(out, value)
+		}
+	}
+	return out, nil
+}
+
+// deriveStateFromSiteStatus maps the v1 site_status integer to the v2
+// (current_state, current_severity) tuple. It is only a fallback when there is
+// no active v2 event for the site (fresh sites, or legacy-only rows during
+// migration).
+//
+// Mapping (matches AGENTS.md):
+//   - 0 (SITE_DOWN) → Seems Down, severity 3
+//   - 1 (SITE_RUNNING) → Up, severity 0
+//   - 2 (SITE_CONFIRMED_DOWN) → Down, severity 4
+//   - other → Unknown, severity 0
+func deriveStateFromSiteStatus(siteStatus int) (state string, severity uint8) {
+	switch siteStatus {
+	case 0:
+		return "Seems Down", 3
+	case 1:
+		return "Up", 0
+	case 2:
+		return "Down", 4
+	default:
+		return "Unknown", 0
+	}
+}
+
+// parseStateFilter returns the state values requested via ?state=X or
+// ?state__in=A,B,C (mutually exclusive — only one or the other).
+func parseStateFilter(q map[string][]string) ([]string, error) {
+	single := first(q["state"])
+	multi := first(q["state__in"])
+	if single != "" && multi != "" {
+		return nil, fmt.Errorf("use either ?state= or ?state__in=, not both")
+	}
+	if single != "" {
+		return []string{single}, nil
+	}
+	if multi != "" {
+		return strings.Split(multi, ","), nil
+	}
+	return nil, nil
+}
+
+// filterByState applies state and severity__gte filters in-memory after the
+// SQL query. Cheap because the SQL query already bounds the result to the page
+// limit.
+func filterByState(in []siteResponse, states []string, severityGTE int) []siteResponse {
+	if len(states) == 0 && severityGTE <= 0 {
+		return in
+	}
+	stateSet := make(map[string]struct{}, len(states))
+	for _, s := range states {
+		stateSet[s] = struct{}{}
+	}
+	out := in[:0]
+	for _, s := range in {
+		if len(stateSet) > 0 {
+			if _, ok := stateSet[s.CurrentState]; !ok {
+				continue
+			}
+		}
+		if int(s.CurrentSeverity) < severityGTE {
+			continue
+		}
+		out = append(out, s)
+	}
+	return out
+}
+
+// parseLimit returns a clamped limit value for list endpoints. Empty falls
+// back to defaultLimit; values above maxLimit are clamped silently (the API
+// docs say so, and a 400 here would be hostile to common pagination loops).
+func parseLimit(s string, defaultLimit, maxLimit int) (int, error) {
+	if s == "" {
+		return defaultLimit, nil
+	}
+	n, err := strconv.Atoi(s)
+	if err != nil {
+		return 0, fmt.Errorf("limit must be an integer")
+	}
+	if n < 1 {
+		return 0, fmt.Errorf("limit must be >= 1")
+	}
+	if n > maxLimit {
+		n = maxLimit
+	}
+	return n, nil
+}
+
+// parseUintQuery parses an optional unsigned int query parameter. Empty
+// returns 0 with no error.
+func parseUintQuery(s string) (int, error) {
+	if s == "" {
+		return 0, nil
+	}
+	n, err := strconv.Atoi(s)
+	if err != nil || n < 0 {
+		return 0, fmt.Errorf("must be a non-negative integer")
+	}
+	return n, nil
+}
+
+func first(vals []string) string {
+	if len(vals) == 0 {
+		return ""
+	}
+	return vals[0]
+}
+
+// idCursor is the cursor schema for list endpoints keyed on a single int64
+// (id, blog_id). Encoded as base64-JSON so consumers don't poke at internals.
+type idCursor struct {
+	ID int64 `json:"id"`
+}
+
+func encodeIDCursor(id int64) string {
+	b, _ := json.Marshal(idCursor{ID: id})
+	return base64.RawURLEncoding.EncodeToString(b)
+}
+
+func decodeIDCursor(s string) (int64, error) {
+	if s == "" {
+		return 0, nil
+	}
+	b, err := base64.RawURLEncoding.DecodeString(s)
+	if err != nil {
+		return 0, fmt.Errorf("cursor not valid base64: %v", err)
+	}
+	var c idCursor
+	if err := json.Unmarshal(b, &c); err != nil {
+		return 0, fmt.Errorf("cursor not valid JSON: %v", err)
+	}
+	return c.ID, nil
+}
diff --git a/internal/api/handlers_sites_test.go b/internal/api/handlers_sites_test.go
new file mode 100644
index 00000000..01c6a470
--- /dev/null
+++ b/internal/api/handlers_sites_test.go
@@ -0,0 +1,599 @@
+package api
+
+import (
+	"net/http"
+	"net/http/httptest"
+	"os"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+const sitesListSQL = ` SELECT s.blog_id, s.blog_id AS public_id, s.monitor_url, s.monitor_active, s.bucket_no, s.check_interval, s.site_status, r.last_checked_at, s.last_status_change, r.ssl_expiry_date, c.check_keyword, c.forbidden_keyword, c.forbidden_keywords, c.redirect_policy, c.request_method, c.detection_profile, c.maintenance_start, c.maintenance_end, c.alert_cooldown_minutes FROM jetpack_monitor_sites s LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id WHERE s.blog_id > ? ORDER BY s.blog_id ASC LIMIT ?`
+
+const sitesListForTenantSQL = ` SELECT s.blog_id, s.blog_id AS public_id, s.monitor_url, s.monitor_active, s.bucket_no, s.check_interval, s.site_status, r.last_checked_at, s.last_status_change, r.ssl_expiry_date, c.check_keyword, c.forbidden_keyword, c.forbidden_keywords, c.redirect_policy, c.request_method, c.detection_profile, c.maintenance_start, c.maintenance_end, c.alert_cooldown_minutes FROM jetpack_monitor_sites s LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id JOIN jetmon_site_tenants st ON st.blog_id = s.blog_id AND st.tenant_id = ? WHERE s.blog_id > ? ORDER BY s.blog_id ASC LIMIT ?`
+
+const sitesListWithCLIMetadataSQL = ` SELECT s.blog_id, s.blog_id AS public_id, s.monitor_url, s.monitor_active, s.bucket_no, s.check_interval, s.site_status, r.last_checked_at, s.last_status_change, r.ssl_expiry_date, c.check_keyword, c.forbidden_keyword, c.forbidden_keywords, c.redirect_policy, c.request_method, c.detection_profile, c.maintenance_start, c.maintenance_end, c.alert_cooldown_minutes, c.custom_headers FROM jetpack_monitor_sites s LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id WHERE s.blog_id > ? ORDER BY s.blog_id ASC LIMIT ?`
+
+const singleSiteSQL = ` SELECT s.blog_id, s.blog_id AS public_id, s.monitor_url, s.monitor_active, s.bucket_no, s.check_interval, s.site_status, r.last_checked_at, s.last_status_change, r.ssl_expiry_date, c.check_keyword, c.forbidden_keyword, c.forbidden_keywords, c.redirect_policy, c.request_method, c.detection_profile, c.maintenance_start, c.maintenance_end, c.alert_cooldown_minutes FROM jetpack_monitor_sites s LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id WHERE s.blog_id = ?`
+
+const singleSiteWithCLIMetadataSQL = ` SELECT s.blog_id, s.blog_id AS public_id, s.monitor_url, s.monitor_active, s.bucket_no, s.check_interval, s.site_status, r.last_checked_at, s.last_status_change, r.ssl_expiry_date, c.check_keyword, c.forbidden_keyword, c.forbidden_keywords, c.redirect_policy, c.request_method, c.detection_profile, c.maintenance_start, c.maintenance_end, c.alert_cooldown_minutes, c.custom_headers FROM jetpack_monitor_sites s LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id WHERE s.blog_id = ?`
+
+const activeEventsSQL = ` SELECT id, check_type, severity, state, started_at FROM jetmon_events WHERE blog_id = ? AND ended_at IS NULL ORDER BY severity DESC, started_at ASC`
+
+func activeEventRollupsSQL(placeholders string) string {
+	return ` SELECT id, blog_id, severity, state, started_at FROM jetmon_events WHERE ended_at IS NULL AND blog_id IN (` + placeholders + `)`
+}
+
+var activeEventRollupColumns = []string{"id", "blog_id", "severity", "state", "started_at"}
+
+// makeSiteRow returns a row builder pre-loaded with sane defaults the tests
+// can override. blog_id is the only required field.
+func makeSiteRow(blogID int64, monitorURL string, siteStatus int) *sqlmock.Rows {
+	return makeSiteRowWithSchedule(blogID, monitorURL, siteStatus, 0, 5)
+}
+
+func makeSiteRowWithSchedule(blogID int64, monitorURL string, siteStatus int, bucketNo int, checkInterval int) *sqlmock.Rows {
+	return sqlmock.NewRows(columnsSite).AddRow(
+		blogID, blogID, monitorURL, 1, bucketNo, checkInterval, siteStatus,
+		nil, nil, nil, nil, nil, nil,
+		"follow", nil, nil, nil, nil, nil,
+	)
+}
+
+func TestListSitesEmpty(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(sitesListSQL).
+		WithArgs(int64(0), 51).
+		WillReturnRows(sqlmock.NewRows(columnsSite))
+
+	req := requestWithKey("GET", "/api/v1/sites", key)
+	rec := invokeAuthed(s, req, s.handleListSites)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp ListEnvelope
+	readJSON(t, rec.Body, &resp)
+	if data, ok := resp.Data.([]any); !ok || len(data) != 0 {
+		t.Errorf("data = %v, want empty list", resp.Data)
+	}
+	if resp.Page.Next != nil {
+		t.Errorf("expected no next cursor, got %v", *resp.Page.Next)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestListSitesReturnsRows(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	rows := makeSiteRow(101, "https://example.com", 1)
+	rows.AddRow(102, 102, "https://other.com", 1, 7, 3, 1,
+		nil, nil, nil, nil, nil, nil, "follow", nil, nil, nil, nil, nil)
+
+	mock.ExpectQuery(sitesListSQL).
+		WithArgs(int64(0), 51).
+		WillReturnRows(rows)
+	mock.ExpectQuery(activeEventRollupsSQL("?,?")).
+		WithArgs(int64(101), int64(102)).
+		WillReturnRows(sqlmock.NewRows(activeEventRollupColumns).
+			AddRow(int64(9), int64(102), uint8(4), "Down", time.Now().UTC()))
+
+	req := requestWithKey("GET", "/api/v1/sites", key)
+	rec := invokeAuthed(s, req, s.handleListSites)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	if contains(rec.Body.String(), "cli_batch") {
+		t.Fatalf("canonical list response unexpectedly included cli_batch: %s", rec.Body.String())
+	}
+	var resp struct {
+		Data []siteResponse `json:"data"`
+		Page Page           `json:"page"`
+	}
+	readJSON(t, rec.Body, &resp)
+	if len(resp.Data) != 2 {
+		t.Fatalf("data len = %d, want 2", len(resp.Data))
+	}
+	if resp.Data[0].ID != 101 || resp.Data[0].CurrentState != "Up" {
+		t.Errorf("first site = %+v, want id=101 state=Up", resp.Data[0])
+	}
+	if resp.Data[1].ID != 102 || resp.Data[1].CurrentState != "Down" {
+		t.Errorf("second site = %+v, want id=102 state=Down", resp.Data[1])
+	}
+	if resp.Data[1].ActiveEventID == nil || *resp.Data[1].ActiveEventID != 9 {
+		t.Errorf("second active_event_id = %v, want 9", resp.Data[1].ActiveEventID)
+	}
+	if resp.Data[1].BucketNo != 7 || resp.Data[1].CheckInterval != 3 {
+		t.Errorf("second scheduling fields = (%d, %d), want (7, 3)",
+			resp.Data[1].BucketNo, resp.Data[1].CheckInterval)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestListSitesWithGatewayTenantScopesRows(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	rows := makeSiteRow(101, "https://example.com", 1)
+	mock.ExpectQuery(sitesListForTenantSQL).
+		WithArgs("tenant-a", int64(0), 51).
+		WillReturnRows(rows)
+	mock.ExpectQuery(activeEventRollupsSQL("?")).
+		WithArgs(int64(101)).
+		WillReturnRows(sqlmock.NewRows(activeEventRollupColumns))
+
+	req := httptest.NewRequest("GET", "/api/v1/sites", nil)
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleListSites)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp struct {
+		Data []siteResponse `json:"data"`
+	}
+	readJSON(t, rec.Body, &resp)
+	if len(resp.Data) != 1 || resp.Data[0].ID != 101 {
+		t.Fatalf("data = %+v, want site 101", resp.Data)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestListSitesIncludesCLIBatchOnlyWhenRequested(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(sitesListWithCLIMetadataSQL).
+		WithArgs(int64(0), 51).
+		WillReturnRows(sqlmock.NewRows(columnsSiteWithCLIMetadata).AddRow(
+			101, 101, "https://example.com", 1, 0, 5, 1,
+			nil, nil, nil, nil, nil, nil, "follow", nil, nil, nil, nil, nil,
+			`{"X-Jetmon-CLI-Batch":"local-smoke"}`,
+		))
+	mock.ExpectQuery(activeEventRollupsSQL("?")).
+		WithArgs(int64(101)).
+		WillReturnRows(sqlmock.NewRows(activeEventRollupColumns))
+
+	req := requestWithKey("GET", "/api/v1/sites?include_cli_metadata=true", key)
+	rec := invokeAuthed(s, req, s.handleListSites)
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp struct {
+		Data []struct {
+			ID       int64  `json:"id"`
+			CLIBatch string `json:"cli_batch"`
+		} `json:"data"`
+	}
+	readJSON(t, rec.Body, &resp)
+	if len(resp.Data) != 1 || resp.Data[0].ID != 101 || resp.Data[0].CLIBatch != "local-smoke" {
+		t.Fatalf("data = %+v, want site 101 cli_batch=local-smoke", resp.Data)
+	}
+}
+
+func TestListSitesPicksWorstOpenEventPerSite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	rows := makeSiteRow(201, "https://multi.example", 1)
+	mock.ExpectQuery(sitesListSQL).
+		WithArgs(int64(0), 51).
+		WillReturnRows(rows)
+
+	earlier := time.Now().UTC().Add(-2 * time.Hour)
+	later := time.Now().UTC().Add(-1 * time.Hour)
+	// Two open events for the same site:
+	//   - id=11: severity 2 ("Degraded"), opened earlier
+	//   - id=12: severity 4 ("Down"), opened later
+	// Highest severity wins, so the rollup should report id=12.
+	mock.ExpectQuery(activeEventRollupsSQL("?")).
+		WithArgs(int64(201)).
+		WillReturnRows(sqlmock.NewRows(activeEventRollupColumns).
+			AddRow(int64(11), int64(201), uint8(2), "Degraded", earlier).
+			AddRow(int64(12), int64(201), uint8(4), "Down", later))
+
+	req := requestWithKey("GET", "/api/v1/sites", key)
+	rec := invokeAuthed(s, req, s.handleListSites)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp struct {
+		Data []siteResponse `json:"data"`
+	}
+	readJSON(t, rec.Body, &resp)
+	if len(resp.Data) != 1 {
+		t.Fatalf("data len = %d, want 1", len(resp.Data))
+	}
+	got := resp.Data[0]
+	if got.CurrentState != "Down" || got.CurrentSeverity != 4 {
+		t.Errorf("rollup state/severity = (%q, %d), want (Down, 4)", got.CurrentState, got.CurrentSeverity)
+	}
+	if got.ActiveEventID == nil || *got.ActiveEventID != 12 {
+		t.Errorf("active_event_id = %v, want 12", got.ActiveEventID)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestListSitesPicksEarliestOnSeverityTie(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	rows := makeSiteRow(202, "https://tie.example", 1)
+	mock.ExpectQuery(sitesListSQL).
+		WithArgs(int64(0), 51).
+		WillReturnRows(rows)
+
+	earlier := time.Now().UTC().Add(-3 * time.Hour)
+	later := time.Now().UTC().Add(-1 * time.Hour)
+	// Same severity on both events: tie-break goes to the earlier started_at.
+	mock.ExpectQuery(activeEventRollupsSQL("?")).
+		WithArgs(int64(202)).
+		WillReturnRows(sqlmock.NewRows(activeEventRollupColumns).
+			AddRow(int64(22), int64(202), uint8(3), "SeemsDown", later).
+			AddRow(int64(21), int64(202), uint8(3), "SeemsDown", earlier))
+
+	req := requestWithKey("GET", "/api/v1/sites", key)
+	rec := invokeAuthed(s, req, s.handleListSites)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp struct {
+		Data []siteResponse `json:"data"`
+	}
+	readJSON(t, rec.Body, &resp)
+	if len(resp.Data) != 1 {
+		t.Fatalf("data len = %d, want 1", len(resp.Data))
+	}
+	if resp.Data[0].ActiveEventID == nil || *resp.Data[0].ActiveEventID != 21 {
+		t.Errorf("active_event_id = %v, want 21", resp.Data[0].ActiveEventID)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestListSitesAppliesPaginationCursor(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// Three rows; limit=2 → should return 2 + a next cursor.
+	rows := makeSiteRow(10, "a", 1)
+	rows.AddRow(20, 20, "b", 1, 0, 5, 1, nil, nil, nil, nil, nil, nil, "follow", nil, nil, nil, nil, nil)
+	rows.AddRow(30, 30, "c", 1, 0, 5, 1, nil, nil, nil, nil, nil, nil, "follow", nil, nil, nil, nil, nil)
+
+	mock.ExpectQuery(sitesListSQL).
+		WithArgs(int64(0), 3). // limit+1 = 3
+		WillReturnRows(rows)
+	mock.ExpectQuery(activeEventRollupsSQL("?,?,?")).
+		WithArgs(int64(10), int64(20), int64(30)).
+		WillReturnRows(sqlmock.NewRows(activeEventRollupColumns))
+
+	req := requestWithKey("GET", "/api/v1/sites?limit=2", key)
+	rec := invokeAuthed(s, req, s.handleListSites)
+
+	var resp struct {
+		Data []siteResponse `json:"data"`
+		Page Page           `json:"page"`
+	}
+	readJSON(t, rec.Body, &resp)
+	if len(resp.Data) != 2 {
+		t.Fatalf("data len = %d, want 2", len(resp.Data))
+	}
+	if resp.Page.Next == nil {
+		t.Fatal("expected a next cursor")
+	}
+	// Decoded cursor should be the id of the last row returned.
+	id, err := decodeIDCursor(*resp.Page.Next)
+	if err != nil {
+		t.Fatalf("decode cursor: %v", err)
+	}
+	if id != 20 {
+		t.Errorf("next cursor id = %d, want 20", id)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestListSitesKeepsCursorWhenFilteredPageHasMoreRows(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// Three raw rows; limit=2 means the third row is the sentinel proving
+	// there may be another page. The first two rows are filtered out, so
+	// pagination must advance past the sentinel row instead of reporting
+	// page.next=null.
+	rows := makeSiteRow(10, "a", 1)
+	rows.AddRow(20, 20, "b", 1, 0, 5, 1, nil, nil, nil, nil, nil, nil, "follow", nil, nil, nil, nil, nil)
+	rows.AddRow(30, 30, "c", 1, 0, 5, 2, nil, nil, nil, nil, nil, nil, "follow", nil, nil, nil, nil, nil)
+
+	mock.ExpectQuery(sitesListSQL).
+		WithArgs(int64(0), 3).
+		WillReturnRows(rows)
+	mock.ExpectQuery(activeEventRollupsSQL("?,?,?")).
+		WithArgs(int64(10), int64(20), int64(30)).
+		WillReturnRows(sqlmock.NewRows(activeEventRollupColumns))
+
+	req := requestWithKey("GET", "/api/v1/sites?limit=2&state=Down", key)
+	rec := invokeAuthed(s, req, s.handleListSites)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp struct {
+		Data []siteResponse `json:"data"`
+		Page Page           `json:"page"`
+	}
+	readJSON(t, rec.Body, &resp)
+	if len(resp.Data) != 1 || resp.Data[0].ID != 30 {
+		t.Fatalf("data = %+v, want only site 30", resp.Data)
+	}
+	if resp.Page.Next == nil {
+		t.Fatal("expected a next cursor")
+	}
+	id, err := decodeIDCursor(*resp.Page.Next)
+	if err != nil {
+		t.Fatalf("decode cursor: %v", err)
+	}
+	if id != 30 {
+		t.Errorf("next cursor id = %d, want 30", id)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestListSitesFiltersByMonitorActive(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	expected := ` SELECT s.blog_id, s.blog_id AS public_id, s.monitor_url, s.monitor_active, s.bucket_no, s.check_interval, s.site_status, r.last_checked_at, s.last_status_change, r.ssl_expiry_date, c.check_keyword, c.forbidden_keyword, c.forbidden_keywords, c.redirect_policy, c.request_method, c.detection_profile, c.maintenance_start, c.maintenance_end, c.alert_cooldown_minutes FROM jetpack_monitor_sites s LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id WHERE s.blog_id > ? AND s.monitor_active = 1 ORDER BY s.blog_id ASC LIMIT ?`
+	mock.ExpectQuery(expected).
+		WithArgs(int64(0), 51).
+		WillReturnRows(sqlmock.NewRows(columnsSite))
+
+	req := requestWithKey("GET", "/api/v1/sites?monitor_active=true", key)
+	rec := invokeAuthed(s, req, s.handleListSites)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet: %v", err)
+	}
+}
+
+func TestScanSiteRowIgnoresLegacyStatusWhenProjectionDisabled(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	f, err := os.CreateTemp("", "jetmon-config-*.json")
+	if err != nil {
+		t.Fatalf("create temp config: %v", err)
+	}
+	t.Cleanup(func() {
+		_ = os.Remove(f.Name())
+		_ = config.Load("../../config/config-sample.json")
+	})
+	if _, err := f.WriteString(`{
+		"AUTH_TOKEN": "token",
+		"NUM_WORKERS": 7,
+		"BUCKET_TOTAL": 100,
+		"BUCKET_TARGET": 50,
+		"NET_COMMS_TIMEOUT": 10,
+		"LOG_FORMAT": "text",
+		"LEGACY_STATUS_PROJECTION_ENABLE": false
+	}`); err != nil {
+		t.Fatalf("write temp config: %v", err)
+	}
+	if err := f.Close(); err != nil {
+		t.Fatalf("close temp config: %v", err)
+	}
+	if err := config.Load(f.Name()); err != nil {
+		t.Fatalf("config.Load: %v", err)
+	}
+
+	mock.ExpectQuery(singleSiteSQL).WithArgs(int64(501)).
+		WillReturnRows(makeSiteRow(501, "https://stale.example", 2))
+	mock.ExpectQuery(activeEventsSQL).WithArgs(int64(501)).
+		WillReturnRows(sqlmock.NewRows(columnsActiveEvent))
+
+	req := requestWithKey("GET", "/api/v1/sites/501", key)
+	req.SetPathValue("id", "501")
+	rec := invokeAuthed(s, req, s.handleGetSite)
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	if contains(rec.Body.String(), "cli_batch") {
+		t.Fatalf("canonical single-site response unexpectedly included cli_batch: %s", rec.Body.String())
+	}
+	var resp singleSiteResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.CurrentState != "Up" || resp.CurrentSeverity != 0 {
+		t.Fatalf("projection-disabled state = (%q, %d), want (Up, 0)", resp.CurrentState, resp.CurrentSeverity)
+	}
+}
+
+func TestGetSiteWithGatewayTenantRejectsUnmappedSite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteTenantCheckSQL).
+		WithArgs("tenant-a", int64(501)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}))
+
+	req := httptest.NewRequest("GET", "/api/v1/sites/501", nil)
+	req.SetPathValue("id", "501")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleGetSite)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404; body=%s", rec.Code, rec.Body.String())
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "site_not_found" {
+		t.Fatalf("code = %q, want site_not_found", got)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestListSitesRejectsBadCursor(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	req := requestWithKey("GET", "/api/v1/sites?cursor=not-base64!!", key)
+	rec := invokeAuthed(s, req, s.handleListSites)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("status = %d, want 400", rec.Code)
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "invalid_cursor" {
+		t.Errorf("error code = %q, want invalid_cursor", body.Code)
+	}
+}
+
+func TestListSitesRejectsBadLimit(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	req := requestWithKey("GET", "/api/v1/sites?limit=abc", key)
+	rec := invokeAuthed(s, req, s.handleListSites)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("status = %d, want 400", rec.Code)
+	}
+}
+
+func TestGetSiteFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(singleSiteSQL).WithArgs(int64(42)).WillReturnRows(makeSiteRow(42, "https://x", 2))
+
+	// active_events query — return one active event.
+	startedAt := time.Date(2026, 4, 25, 3, 18, 38, 329_000_000, time.UTC)
+	mock.ExpectQuery(activeEventsSQL).WithArgs(int64(42)).WillReturnRows(
+		sqlmock.NewRows(columnsActiveEvent).
+			AddRow(int64(7), "http", uint8(4), "Down", startedAt),
+	)
+
+	req := requestWithKey("GET", "/api/v1/sites/42", key)
+	req.SetPathValue("id", "42")
+	rec := invokeAuthed(s, req, s.handleGetSite)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp singleSiteResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.ID != 42 {
+		t.Errorf("id = %d, want 42", resp.ID)
+	}
+	if resp.BucketNo != 0 || resp.CheckInterval != 5 {
+		t.Errorf("scheduling fields = (%d, %d), want (0, 5)", resp.BucketNo, resp.CheckInterval)
+	}
+	if len(resp.ActiveEvents) != 1 || resp.ActiveEvents[0].ID != 7 {
+		t.Fatalf("active_events = %+v, want one with id=7", resp.ActiveEvents)
+	}
+	if resp.ActiveEventID == nil || *resp.ActiveEventID != 7 {
+		t.Errorf("active_event_id = %v, want pointer to 7", resp.ActiveEventID)
+	}
+	// Worst event should be reflected on the projection.
+	if resp.CurrentState != "Down" || resp.CurrentSeverity != 4 {
+		t.Errorf("projection = (%q, %d), want (Down, 4)", resp.CurrentState, resp.CurrentSeverity)
+	}
+}
+
+func TestGetSiteIncludesCLIBatchOnlyWhenRequested(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(singleSiteWithCLIMetadataSQL).WithArgs(int64(42)).WillReturnRows(
+		sqlmock.NewRows(columnsSiteWithCLIMetadata).AddRow(
+			42, 42, "https://x", 1, 0, 5, 1,
+			nil, nil, nil, nil, nil, nil, "follow", nil, nil, nil, nil, nil,
+			`{"X-Jetmon-CLI-Batch":"local-smoke"}`,
+		),
+	)
+	mock.ExpectQuery(activeEventsSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows(columnsActiveEvent))
+
+	req := requestWithKey("GET", "/api/v1/sites/42?include_cli_metadata=true", key)
+	req.SetPathValue("id", "42")
+	rec := invokeAuthed(s, req, s.handleGetSite)
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp struct {
+		ID       int64  `json:"id"`
+		CLIBatch string `json:"cli_batch"`
+	}
+	readJSON(t, rec.Body, &resp)
+	if resp.ID != 42 || resp.CLIBatch != "local-smoke" {
+		t.Fatalf("response = %+v, want id=42 cli_batch=local-smoke", resp)
+	}
+}
+
+func TestGetSiteNotFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(singleSiteSQL).WithArgs(int64(99999)).WillReturnRows(sqlmock.NewRows(columnsSite))
+
+	req := requestWithKey("GET", "/api/v1/sites/99999", key)
+	req.SetPathValue("id", "99999")
+	rec := invokeAuthed(s, req, s.handleGetSite)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404; body=%s", rec.Code, rec.Body.String())
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "site_not_found" {
+		t.Errorf("error code = %q, want site_not_found", body.Code)
+	}
+	// Internal API style: error message names the resource type.
+	if !contains(body.Message, "Site 99999") {
+		t.Errorf("message %q should name resource type and id", body.Message)
+	}
+}
+
+func TestGetSiteInvalidID(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	req := requestWithKey("GET", "/api/v1/sites/abc", key)
+	req.SetPathValue("id", "abc")
+	rec := invokeAuthed(s, req, s.handleGetSite)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("status = %d, want 400", rec.Code)
+	}
+}
+
+func contains(s, sub string) bool {
+	for i := 0; i+len(sub) <= len(s); i++ {
+		if s[i:i+len(sub)] == sub {
+			return true
+		}
+	}
+	return false
+}
diff --git a/internal/api/handlers_sites_write.go b/internal/api/handlers_sites_write.go
new file mode 100644
index 00000000..5a9810ec
--- /dev/null
+++ b/internal/api/handlers_sites_write.go
@@ -0,0 +1,861 @@
+package api
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"net/http"
+	"net/url"
+	"strconv"
+	"strings"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/checkmode"
+	"github.com/Automattic/jetmon/internal/config"
+	jetdb "github.com/Automattic/jetmon/internal/db"
+)
+
+// validRedirectPolicies bounds the redirect_policy field. Matches the ENUM in
+// the Jetmon-owned site check config sidecar.
+var validRedirectPolicies = map[string]struct{}{
+	"follow": {},
+	"alert":  {},
+	"fail":   {},
+}
+
+const (
+	maxForbiddenKeywords     = 20
+	maxForbiddenKeywordBytes = 500
+)
+
+// createSiteRequest is the body shape for POST /api/v1/sites.
+//
+// Fields use pointers where "absent in JSON" needs to be distinguishable
+// from "explicitly zero/empty" — for example, alert_cooldown_minutes might
+// legitimately be 0 (meaning "no cooldown") vs missing (meaning "use the
+// global default"). monitor_active is a pointer for the same reason: the
+// default is true if absent, but an explicit false has to be honored.
+type createSiteRequest struct {
+	BlogID               *int64             `json:"blog_id"`
+	MonitorURL           string             `json:"monitor_url"`
+	MonitorActive        *bool              `json:"monitor_active"`
+	BucketNo             *int               `json:"bucket_no"`
+	CheckKeyword         *string            `json:"check_keyword"`
+	ForbiddenKeyword     *string            `json:"forbidden_keyword"`
+	ForbiddenKeywords    *[]string          `json:"forbidden_keywords"`
+	RedirectPolicy       *string            `json:"redirect_policy"`
+	TimeoutSeconds       *int               `json:"timeout_seconds"`
+	CustomHeaders        *map[string]string `json:"custom_headers"`
+	AlertCooldownMinutes *int               `json:"alert_cooldown_minutes"`
+	CheckInterval        *int               `json:"check_interval"`
+	RequestMethod        *string            `json:"request_method"`
+	DetectionProfile     *string            `json:"detection_profile"`
+}
+
+// handleCreateSite implements POST /api/v1/sites.
+//
+// blog_id is caller-supplied (it's the canonical identity from WPCOM) and
+// must not already exist in jetpack_monitor_sites. Successful creation
+// returns 201 with the full site object.
+func (s *Server) handleCreateSite(w http.ResponseWriter, r *http.Request) {
+	var body createSiteRequest
+	if err := json.NewDecoder(r.Body).Decode(&body); err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_body",
+			"request body must be valid JSON: "+err.Error())
+		return
+	}
+
+	if body.BlogID == nil || *body.BlogID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_blog_id",
+			"blog_id is required and must be a positive integer")
+		return
+	}
+	if err := validateMonitorURL(body.MonitorURL); err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_url", err.Error())
+		return
+	}
+	if body.RedirectPolicy != nil {
+		if _, ok := validRedirectPolicies[*body.RedirectPolicy]; !ok {
+			writeError(w, r, http.StatusUnprocessableEntity, "invalid_redirect_policy",
+				"redirect_policy must be one of: follow, alert, fail")
+			return
+		}
+	}
+	requestMethod, detectionProfile, err := validateCheckPolicy(body.RequestMethod, body.DetectionProfile)
+	if err != nil {
+		writeError(w, r, http.StatusUnprocessableEntity, "invalid_check_policy", err.Error())
+		return
+	}
+
+	ctx := r.Context()
+
+	// Fast-path duplicate check. The actual race is closed by the UNIQUE
+	// constraint on blog_id + the INSERT below; this just produces a clean
+	// 409 for the common case.
+	exists, err := s.siteExists(ctx, *body.BlogID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site existence check failed: "+err.Error())
+		return
+	}
+	if exists {
+		writeError(w, r, http.StatusConflict, "site_exists",
+			fmt.Sprintf("Site %d already exists", *body.BlogID))
+		return
+	}
+
+	// Apply defaults for optional fields. Pointers stay nil if the column
+	// is nullable in the schema; non-nullable columns get explicit defaults.
+	monitorActive := true
+	if body.MonitorActive != nil {
+		monitorActive = *body.MonitorActive
+	}
+	bucketNo := 0
+	if body.BucketNo != nil {
+		bucketNo = *body.BucketNo
+	}
+	checkInterval := 5
+	if body.CheckInterval != nil {
+		checkInterval = *body.CheckInterval
+	}
+	redirectPolicy := "follow"
+	if body.RedirectPolicy != nil {
+		redirectPolicy = *body.RedirectPolicy
+	}
+
+	customHeadersJSON, err := encodeCustomHeaders(body.CustomHeaders)
+	if err != nil {
+		writeError(w, r, http.StatusUnprocessableEntity, "invalid_custom_headers",
+			err.Error())
+		return
+	}
+	forbiddenKeywordsJSON, err := encodeForbiddenKeywords(body.ForbiddenKeywords)
+	if err != nil {
+		writeError(w, r, http.StatusUnprocessableEntity, "invalid_forbidden_keywords",
+			err.Error())
+		return
+	}
+
+	insertArgs := []any{
+		*body.BlogID, bucketNo, body.MonitorURL, boolToTinyint(monitorActive), checkInterval,
+	}
+	configFields := siteCheckConfigFields{
+		{set: body.RequestMethod != nil, name: "request_method", value: requestMethod},
+		{set: body.DetectionProfile != nil, name: "detection_profile", value: detectionProfile},
+		{set: body.CheckKeyword != nil, name: "check_keyword", value: nullableStringPtr(body.CheckKeyword)},
+		{set: body.ForbiddenKeyword != nil, name: "forbidden_keyword", value: nullableStringPtr(body.ForbiddenKeyword)},
+		{set: body.ForbiddenKeywords != nil, name: "forbidden_keywords", value: forbiddenKeywordsJSON},
+		{set: body.RedirectPolicy != nil, name: "redirect_policy", value: redirectPolicy},
+		{set: body.TimeoutSeconds != nil, name: "timeout_seconds", value: nullableIntPtr(body.TimeoutSeconds)},
+		{set: body.CustomHeaders != nil, name: "custom_headers", value: customHeadersJSON},
+		{set: body.AlertCooldownMinutes != nil, name: "alert_cooldown_minutes", value: nullableIntPtr(body.AlertCooldownMinutes)},
+	}
+	tenantID, tenantScoped := ownerTenantIDFromRequest(r)
+	if tenantScoped || configFields.hasSet() {
+		tx, err := s.db.BeginTx(ctx, nil)
+		if err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"site transaction failed: "+err.Error())
+			return
+		}
+		defer func() { _ = tx.Rollback() }()
+		if _, err := tx.ExecContext(ctx, `
+		INSERT INTO jetpack_monitor_sites
+			(blog_id, bucket_no, monitor_url, monitor_active, site_status, check_interval)
+		VALUES (?, ?, ?, ?, 1, ?)`, insertArgs...); err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"site insert failed: "+err.Error())
+			return
+		}
+		if err := s.upsertSiteCheckConfig(ctx, tx, *body.BlogID, configFields); err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"site check config insert failed: "+err.Error())
+			return
+		}
+		if tenantScoped {
+			if err := s.assignSiteTenant(ctx, tx, *body.BlogID, tenantID); err != nil {
+				writeError(w, r, http.StatusInternalServerError, "db_error",
+					err.Error())
+				return
+			}
+		}
+		if err := tx.Commit(); err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"site transaction commit failed: "+err.Error())
+			return
+		}
+	} else {
+		if _, err := s.db.ExecContext(ctx, `
+		INSERT INTO jetpack_monitor_sites
+			(blog_id, bucket_no, monitor_url, monitor_active, site_status, check_interval)
+		VALUES (?, ?, ?, ?, 1, ?)`, insertArgs...); err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"site insert failed: "+err.Error())
+			return
+		}
+	}
+
+	// Read back the row to return it as the response body.
+	site, err := s.readSite(ctx, *body.BlogID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"read-back failed: "+err.Error())
+		return
+	}
+	writeJSON(w, http.StatusCreated, site)
+}
+
+// updateSiteRequest is the body shape for PATCH /api/v1/sites/{id}. Every
+// field is a pointer — absent fields are left unchanged, explicit nulls
+// clear nullable columns.
+type updateSiteRequest struct {
+	MonitorURL           *string            `json:"monitor_url"`
+	MonitorActive        *bool              `json:"monitor_active"`
+	BucketNo             *int               `json:"bucket_no"`
+	CheckKeyword         *string            `json:"check_keyword"`
+	ForbiddenKeyword     *string            `json:"forbidden_keyword"`
+	ForbiddenKeywords    *[]string          `json:"forbidden_keywords"`
+	RedirectPolicy       *string            `json:"redirect_policy"`
+	TimeoutSeconds       *int               `json:"timeout_seconds"`
+	CustomHeaders        *map[string]string `json:"custom_headers"`
+	AlertCooldownMinutes *int               `json:"alert_cooldown_minutes"`
+	CheckInterval        *int               `json:"check_interval"`
+	RequestMethod        *string            `json:"request_method"`
+	DetectionProfile     *string            `json:"detection_profile"`
+	MaintenanceStart     *string            `json:"maintenance_start"`
+	MaintenanceEnd       *string            `json:"maintenance_end"`
+}
+
+// handleUpdateSite implements PATCH /api/v1/sites/{id}.
+func (s *Server) handleUpdateSite(w http.ResponseWriter, r *http.Request) {
+	siteID, err := strconv.ParseInt(r.PathValue("id"), 10, 64)
+	if err != nil || siteID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_site_id",
+			"site id must be a positive integer")
+		return
+	}
+
+	var body updateSiteRequest
+	if err := json.NewDecoder(r.Body).Decode(&body); err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_body",
+			"request body must be valid JSON: "+err.Error())
+		return
+	}
+
+	// Validate the inputs we got. Validation happens before the existence
+	// check so a bad request shape returns 400/422 even for nonexistent
+	// sites — easier to debug.
+	if body.MonitorURL != nil {
+		if err := validateMonitorURL(*body.MonitorURL); err != nil {
+			writeError(w, r, http.StatusBadRequest, "invalid_url", err.Error())
+			return
+		}
+	}
+	if body.RedirectPolicy != nil {
+		if _, ok := validRedirectPolicies[*body.RedirectPolicy]; !ok {
+			writeError(w, r, http.StatusUnprocessableEntity, "invalid_redirect_policy",
+				"redirect_policy must be one of: follow, alert, fail")
+			return
+		}
+	}
+	requestMethod, detectionProfile, err := validateCheckPolicy(body.RequestMethod, body.DetectionProfile)
+	if err != nil {
+		writeError(w, r, http.StatusUnprocessableEntity, "invalid_check_policy", err.Error())
+		return
+	}
+
+	ctx := r.Context()
+	if !s.ensureSiteVisibleForRequest(w, r, siteID) {
+		return
+	}
+	exists, err := s.siteExists(ctx, siteID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site lookup failed: "+err.Error())
+		return
+	}
+	if !exists {
+		writeSiteNotFound(w, r, siteID)
+		return
+	}
+
+	// Build the UPDATE dynamically from non-nil fields.
+	setClauses, args, err := buildUpdateSetClause(body)
+	if err != nil {
+		writeError(w, r, http.StatusUnprocessableEntity, "invalid_field", err.Error())
+		return
+	}
+	configFields, err := buildSiteCheckConfigFields(body, requestMethod, detectionProfile)
+	if err != nil {
+		writeError(w, r, http.StatusUnprocessableEntity, "invalid_field", err.Error())
+		return
+	}
+	if len(setClauses) == 0 && !configFields.hasSet() {
+		// No fields to change — return the current state without touching the row.
+		site, err := s.readSite(ctx, siteID)
+		if err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"read-back failed: "+err.Error())
+			return
+		}
+		writeJSON(w, http.StatusOK, site)
+		return
+	}
+
+	if !configFields.hasSet() && body.CheckInterval == nil {
+		args = append(args, siteID)
+		query := "UPDATE jetpack_monitor_sites SET " + joinSetClauses(setClauses) + " WHERE blog_id = ?"
+		if _, err := s.db.ExecContext(ctx, query, args...); err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"site update failed: "+err.Error())
+			return
+		}
+	} else {
+		tx, err := s.db.BeginTx(ctx, nil)
+		if err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"site transaction failed: "+err.Error())
+			return
+		}
+		defer func() { _ = tx.Rollback() }()
+		if len(setClauses) > 0 {
+			args = append(args, siteID)
+			query := "UPDATE jetpack_monitor_sites SET " + joinSetClauses(setClauses) + " WHERE blog_id = ?"
+			if _, err := tx.ExecContext(ctx, query, args...); err != nil {
+				writeError(w, r, http.StatusInternalServerError, "db_error",
+					"site update failed: "+err.Error())
+				return
+			}
+		}
+		if configFields.hasSet() {
+			if err := s.upsertSiteCheckConfig(ctx, tx, siteID, configFields); err != nil {
+				writeError(w, r, http.StatusInternalServerError, "db_error",
+					"site check config update failed: "+err.Error())
+				return
+			}
+		}
+		if body.CheckInterval != nil {
+			if err := jetdb.RescheduleSiteRuntime(ctx, tx, siteID, *body.CheckInterval); err != nil {
+				writeError(w, r, http.StatusInternalServerError, "db_error",
+					"site runtime reschedule failed: "+err.Error())
+				return
+			}
+		}
+		if err := tx.Commit(); err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"site transaction commit failed: "+err.Error())
+			return
+		}
+	}
+
+	site, err := s.readSite(ctx, siteID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"read-back failed: "+err.Error())
+		return
+	}
+	writeJSON(w, http.StatusOK, site)
+}
+
+// handleDeleteSite implements DELETE /api/v1/sites/{id}.
+//
+// Soft delete: monitor_active=0 + close any open events with reason
+// manual_override. Returns 204 No Content. The row is preserved so audit
+// trails (jetmon_audit_log, jetmon_check_history) keep their foreign-key
+// targets and historical state remains queryable.
+func (s *Server) handleDeleteSite(w http.ResponseWriter, r *http.Request) {
+	siteID, err := strconv.ParseInt(r.PathValue("id"), 10, 64)
+	if err != nil || siteID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_site_id",
+			"site id must be a positive integer")
+		return
+	}
+
+	ctx := r.Context()
+	if !s.ensureSiteVisibleForRequest(w, r, siteID) {
+		return
+	}
+	exists, err := s.siteExists(ctx, siteID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site lookup failed: "+err.Error())
+		return
+	}
+	if !exists {
+		writeSiteNotFound(w, r, siteID)
+		return
+	}
+
+	if err := s.closeAllActiveEvents(ctx, siteID, "manual_override", "site deleted via API"); err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"close events failed: "+err.Error())
+		return
+	}
+
+	_, err = s.db.ExecContext(ctx,
+		`UPDATE jetpack_monitor_sites SET monitor_active = 0 WHERE blog_id = ?`, siteID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site delete failed: "+err.Error())
+		return
+	}
+
+	w.WriteHeader(http.StatusNoContent)
+}
+
+// handlePauseSite implements POST /api/v1/sites/{id}/pause.
+//
+// Equivalent to monitor_active=false but with the closing reason explicitly
+// labeled. The orchestrator's next round will see monitor_active=0 and
+// stop checking the site.
+func (s *Server) handlePauseSite(w http.ResponseWriter, r *http.Request) {
+	s.toggleSiteActive(w, r, false, "site paused via API")
+}
+
+// handleResumeSite implements POST /api/v1/sites/{id}/resume.
+//
+// Sets monitor_active=true. Does not reopen previously-closed events; the
+// orchestrator's regular flow will detect any genuine current failure on
+// the next round and open a fresh event then.
+func (s *Server) handleResumeSite(w http.ResponseWriter, r *http.Request) {
+	s.toggleSiteActive(w, r, true, "")
+}
+
+func (s *Server) toggleSiteActive(w http.ResponseWriter, r *http.Request, active bool, closeNote string) {
+	siteID, err := strconv.ParseInt(r.PathValue("id"), 10, 64)
+	if err != nil || siteID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_site_id",
+			"site id must be a positive integer")
+		return
+	}
+
+	ctx := r.Context()
+	if !s.ensureSiteVisibleForRequest(w, r, siteID) {
+		return
+	}
+	exists, err := s.siteExists(ctx, siteID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site lookup failed: "+err.Error())
+		return
+	}
+	if !exists {
+		writeSiteNotFound(w, r, siteID)
+		return
+	}
+
+	// On pause, close any active events first so the site_status projection
+	// can move cleanly to "running" (which we then stamp as paused via the
+	// monitor_active flag).
+	if !active {
+		if err := s.closeAllActiveEvents(ctx, siteID, "manual_override", closeNote); err != nil {
+			writeError(w, r, http.StatusInternalServerError, "db_error",
+				"close events failed: "+err.Error())
+			return
+		}
+	}
+
+	_, err = s.db.ExecContext(ctx,
+		`UPDATE jetpack_monitor_sites SET monitor_active = ?, last_status_change = ? WHERE blog_id = ?`,
+		boolToTinyint(active), time.Now().UTC(), siteID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site update failed: "+err.Error())
+		return
+	}
+
+	site, err := s.readSite(ctx, siteID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"read-back failed: "+err.Error())
+		return
+	}
+	writeJSON(w, http.StatusOK, site)
+}
+
+// closeAllActiveEvents closes every open event for a site in a single tx
+// using the eventstore. Used by delete/pause/resume paths and any other
+// "the site is going away cleanly" flow.
+func (s *Server) closeAllActiveEvents(ctx context.Context, siteID int64, reason, note string) error {
+	rows, err := s.db.QueryContext(ctx,
+		`SELECT id FROM jetmon_events WHERE blog_id = ? AND ended_at IS NULL`, siteID)
+	if err != nil {
+		return err
+	}
+	var eventIDs []int64
+	for rows.Next() {
+		var id int64
+		if err := rows.Scan(&id); err != nil {
+			rows.Close()
+			return err
+		}
+		eventIDs = append(eventIDs, id)
+	}
+	rows.Close()
+	if err := rows.Err(); err != nil {
+		return err
+	}
+
+	for _, eventID := range eventIDs {
+		meta, _ := json.Marshal(map[string]any{"note": note, "source": "api"})
+		// Use eventstore.Store directly — the api.Server doesn't have a
+		// reference to it, but the standalone Close handles its own tx.
+		// For now, run the write inline with the orchestrator's eventstore
+		// shape.
+		if err := s.closeEvent(ctx, eventID, siteID, reason, meta); err != nil {
+			return fmt.Errorf("close event %d: %w", eventID, err)
+		}
+	}
+	return nil
+}
+
+// closeEvent writes an event close + transition row and, while enabled,
+// projects the legacy v1 site_status back to running in one transaction when
+// this was the site's last active event.
+// Mirrors what eventstore.Tx.Close does without pulling the package in
+// here — keeps the import graph flat.
+func (s *Server) closeEvent(ctx context.Context, eventID, blogID int64, reason string, metadata []byte) error {
+	tx, err := s.db.BeginTx(ctx, nil)
+	if err != nil {
+		return err
+	}
+	defer func() { _ = tx.Rollback() }()
+
+	var (
+		severity uint8
+		state    string
+		endedAt  sql.NullTime
+	)
+	err = tx.QueryRowContext(ctx,
+		`SELECT severity, state, ended_at FROM jetmon_events WHERE id = ? FOR UPDATE`, eventID,
+	).Scan(&severity, &state, &endedAt)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			return fmt.Errorf("event %d not found", eventID)
+		}
+		return err
+	}
+	if endedAt.Valid {
+		// Already closed; treat as success — idempotent close.
+		return tx.Commit()
+	}
+
+	if _, err := tx.ExecContext(ctx, `
+		UPDATE jetmon_events
+		   SET ended_at = CURRENT_TIMESTAMP(3),
+		       resolution_reason = ?
+		 WHERE id = ?`, reason, eventID); err != nil {
+		return fmt.Errorf("update event: %w", err)
+	}
+	if _, err := tx.ExecContext(ctx, `
+		INSERT INTO jetmon_event_transitions
+			(event_id, blog_id, severity_before, severity_after,
+			 state_before, state_after, reason, source, metadata)
+		VALUES (?, ?, ?, NULL, ?, ?, ?, ?, ?)`,
+		eventID, blogID, severity, state, "Resolved", reason, "api", metadata,
+	); err != nil {
+		return fmt.Errorf("insert transition: %w", err)
+	}
+
+	if config.LegacyStatusProjectionEnabled() {
+		var activeCount int
+		if err := tx.QueryRowContext(ctx,
+			`SELECT COUNT(*) FROM jetmon_events WHERE blog_id = ? AND ended_at IS NULL`, blogID,
+		).Scan(&activeCount); err != nil {
+			return fmt.Errorf("count active events: %w", err)
+		}
+		if activeCount == 0 {
+			if _, err := tx.ExecContext(ctx,
+				`UPDATE jetpack_monitor_sites SET site_status = 1, last_status_change = ? WHERE blog_id = ?`,
+				time.Now().UTC(), blogID); err != nil {
+				return fmt.Errorf("project site_status: %w", err)
+			}
+		}
+	}
+	return tx.Commit()
+}
+
+// readSite returns the API-shaped site object for blog_id. Used by the
+// write handlers' read-back step.
+func (s *Server) readSite(ctx context.Context, blogID int64) (siteResponse, error) {
+	row := s.db.QueryRowContext(ctx, `
+		SELECT `+siteSelectColumns("s.", "c.", "r.", false)+`
+		  FROM jetpack_monitor_sites s
+		  LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id
+		  LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id
+		 WHERE s.blog_id = ?`, blogID)
+	return scanSiteRow(row, false)
+}
+
+// validateMonitorURL accepts only http and https URLs with a non-empty host.
+func validateMonitorURL(s string) error {
+	if s == "" {
+		return errors.New("monitor_url is required")
+	}
+	u, err := url.Parse(s)
+	if err != nil {
+		return fmt.Errorf("monitor_url is not a valid URL: %v", err)
+	}
+	if u.Scheme != "http" && u.Scheme != "https" {
+		return errors.New("monitor_url must use http or https")
+	}
+	if u.Host == "" {
+		return errors.New("monitor_url must include a host")
+	}
+	return nil
+}
+
+// encodeCustomHeaders marshals a map[string]string into the JSON shape the
+// custom_headers column expects. Returns nil if the input is nil or empty.
+func encodeCustomHeaders(h *map[string]string) (any, error) {
+	if h == nil || len(*h) == 0 {
+		return nil, nil
+	}
+	for k := range *h {
+		if k == "" {
+			return nil, errors.New("custom_headers must not contain empty header names")
+		}
+	}
+	b, err := json.Marshal(*h)
+	if err != nil {
+		return nil, fmt.Errorf("encode custom_headers: %v", err)
+	}
+	return string(b), nil
+}
+
+// encodeForbiddenKeywords marshals explicit bad-content body strings into the
+// JSON array stored in forbidden_keywords. Empty arrays clear the column.
+func encodeForbiddenKeywords(values *[]string) (any, error) {
+	if values == nil || len(*values) == 0 {
+		return nil, nil
+	}
+	if len(*values) > maxForbiddenKeywords {
+		return nil, fmt.Errorf("forbidden_keywords supports at most %d entries", maxForbiddenKeywords)
+	}
+	out := make([]string, 0, len(*values))
+	seen := make(map[string]struct{}, len(*values))
+	for _, value := range *values {
+		if value == "" {
+			return nil, errors.New("forbidden_keywords must not contain empty strings")
+		}
+		if len([]byte(value)) > maxForbiddenKeywordBytes {
+			return nil, fmt.Errorf("forbidden_keywords entries must be %d bytes or fewer", maxForbiddenKeywordBytes)
+		}
+		if _, ok := seen[value]; ok {
+			continue
+		}
+		seen[value] = struct{}{}
+		out = append(out, value)
+	}
+	if len(out) == 0 {
+		return nil, nil
+	}
+	b, err := json.Marshal(out)
+	if err != nil {
+		return nil, fmt.Errorf("encode forbidden_keywords: %v", err)
+	}
+	return string(b), nil
+}
+
+// buildUpdateSetClause turns a sparse updateSiteRequest into SQL fragments.
+// Returned slices are aligned: setClauses[i] applies args[i].
+func buildUpdateSetClause(body updateSiteRequest) ([]string, []any, error) {
+	var (
+		clauses []string
+		args    []any
+	)
+	if body.MonitorURL != nil {
+		clauses = append(clauses, "monitor_url = ?")
+		args = append(args, *body.MonitorURL)
+	}
+	if body.MonitorActive != nil {
+		clauses = append(clauses, "monitor_active = ?")
+		args = append(args, boolToTinyint(*body.MonitorActive))
+	}
+	if body.BucketNo != nil {
+		clauses = append(clauses, "bucket_no = ?")
+		args = append(args, *body.BucketNo)
+	}
+	if body.CheckInterval != nil {
+		clauses = append(clauses, "check_interval = ?")
+		args = append(args, *body.CheckInterval)
+	}
+	return clauses, args, nil
+}
+
+type siteCheckConfigField struct {
+	set   bool
+	name  string
+	value any
+}
+
+type siteCheckConfigFields []siteCheckConfigField
+
+func (fields siteCheckConfigFields) hasSet() bool {
+	for _, field := range fields {
+		if field.set {
+			return true
+		}
+	}
+	return false
+}
+
+func validateCheckPolicy(methodPtr, profilePtr *string) (any, any, error) {
+	var method any
+	if methodPtr != nil {
+		if *methodPtr == "" {
+			method = nil
+		} else {
+			normalized, err := checkmode.NormalizeMethod(*methodPtr, "")
+			if err != nil {
+				return nil, nil, err
+			}
+			method = normalized
+		}
+	}
+	var profile any
+	if profilePtr != nil {
+		if *profilePtr == "" {
+			profile = nil
+		} else {
+			normalized, err := checkmode.NormalizeProfile(*profilePtr, "")
+			if err != nil {
+				return nil, nil, err
+			}
+			profile = normalized
+		}
+	}
+	return method, profile, nil
+}
+
+func buildSiteCheckConfigFields(body updateSiteRequest, requestMethod, detectionProfile any) (siteCheckConfigFields, error) {
+	var fields siteCheckConfigFields
+	fields = append(fields,
+		siteCheckConfigField{set: body.RequestMethod != nil, name: "request_method", value: requestMethod},
+		siteCheckConfigField{set: body.DetectionProfile != nil, name: "detection_profile", value: detectionProfile},
+		siteCheckConfigField{set: body.CheckKeyword != nil, name: "check_keyword", value: nullableStringPtr(body.CheckKeyword)},
+		siteCheckConfigField{set: body.ForbiddenKeyword != nil, name: "forbidden_keyword", value: nullableStringPtr(body.ForbiddenKeyword)},
+		siteCheckConfigField{set: body.RedirectPolicy != nil, name: "redirect_policy", value: valueOrNilString(body.RedirectPolicy)},
+		siteCheckConfigField{set: body.TimeoutSeconds != nil, name: "timeout_seconds", value: nullableIntPtr(body.TimeoutSeconds)},
+		siteCheckConfigField{set: body.AlertCooldownMinutes != nil, name: "alert_cooldown_minutes", value: nullableIntPtr(body.AlertCooldownMinutes)},
+	)
+	if body.ForbiddenKeywords != nil {
+		v, err := encodeForbiddenKeywords(body.ForbiddenKeywords)
+		if err != nil {
+			return nil, err
+		}
+		fields = append(fields, siteCheckConfigField{set: true, name: "forbidden_keywords", value: v})
+	}
+	if body.CustomHeaders != nil {
+		v, err := encodeCustomHeaders(body.CustomHeaders)
+		if err != nil {
+			return nil, err
+		}
+		fields = append(fields, siteCheckConfigField{set: true, name: "custom_headers", value: v})
+	}
+	if body.MaintenanceStart != nil {
+		t, err := parseMaintenanceTime(*body.MaintenanceStart, "maintenance_start")
+		if err != nil {
+			return nil, err
+		}
+		fields = append(fields, siteCheckConfigField{set: true, name: "maintenance_start", value: t})
+	}
+	if body.MaintenanceEnd != nil {
+		t, err := parseMaintenanceTime(*body.MaintenanceEnd, "maintenance_end")
+		if err != nil {
+			return nil, err
+		}
+		fields = append(fields, siteCheckConfigField{set: true, name: "maintenance_end", value: t})
+	}
+	return fields, nil
+}
+
+func (s *Server) upsertSiteCheckConfig(ctx context.Context, tx *sql.Tx, blogID int64, fields siteCheckConfigFields) error {
+	if !fields.hasSet() {
+		return nil
+	}
+	cols := []string{"blog_id"}
+	placeholders := []string{"?"}
+	updates := []string{}
+	args := []any{blogID}
+	for _, field := range fields {
+		if !field.set {
+			continue
+		}
+		cols = append(cols, field.name)
+		placeholders = append(placeholders, "?")
+		updates = append(updates, field.name+" = VALUES("+field.name+")")
+		args = append(args, field.value)
+	}
+	query := fmt.Sprintf(`
+		INSERT INTO jetmon_site_check_config (%s)
+		VALUES (%s)
+		ON DUPLICATE KEY UPDATE %s`,
+		strings.Join(cols, ", "),
+		strings.Join(placeholders, ", "),
+		strings.Join(updates, ", "),
+	)
+	_, err := tx.ExecContext(ctx, query, args...)
+	return err
+}
+
+// parseMaintenanceTime accepts an empty string (clears the column to NULL)
+// or an RFC3339 timestamp. Anything else is a 422.
+func parseMaintenanceTime(s, field string) (any, error) {
+	if s == "" {
+		return nil, nil
+	}
+	t, err := time.Parse(time.RFC3339, s)
+	if err != nil {
+		return nil, fmt.Errorf("%s must be RFC3339 timestamp or empty string", field)
+	}
+	return t.UTC(), nil
+}
+
+func joinSetClauses(clauses []string) string {
+	out := ""
+	for i, c := range clauses {
+		if i > 0 {
+			out += ", "
+		}
+		out += c
+	}
+	return out
+}
+
+func boolToTinyint(b bool) int {
+	if b {
+		return 1
+	}
+	return 0
+}
+
+func nullableStringPtr(p *string) any {
+	if p == nil {
+		return nil
+	}
+	return nullableEmpty(*p)
+}
+
+func nullableEmpty(s string) any {
+	if s == "" {
+		return nil
+	}
+	return s
+}
+
+func nullableIntPtr(p *int) any {
+	if p == nil {
+		return nil
+	}
+	return *p
+}
+
+func valueOrNilString(p *string) any {
+	if p == nil {
+		return nil
+	}
+	return *p
+}
diff --git a/internal/api/handlers_sites_write_test.go b/internal/api/handlers_sites_write_test.go
new file mode 100644
index 00000000..90b8d29f
--- /dev/null
+++ b/internal/api/handlers_sites_write_test.go
@@ -0,0 +1,503 @@
+package api
+
+import (
+	"bytes"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+const siteExistsCheckSQL = `SELECT 1 FROM jetpack_monitor_sites WHERE blog_id = ? LIMIT 1`
+
+const insertSiteSQL = ` INSERT INTO jetpack_monitor_sites (blog_id, bucket_no, monitor_url, monitor_active, site_status, check_interval) VALUES (?, ?, ?, ?, 1, ?)`
+
+func newPOSTWithBody(target string, body []byte) *http.Request {
+	req := httptest.NewRequest("POST", target, bytes.NewReader(body))
+	req.Header.Set("Content-Type", "application/json")
+	return req
+}
+
+func newPATCHWithBody(target string, body []byte) *http.Request {
+	req := httptest.NewRequest("PATCH", target, bytes.NewReader(body))
+	req.Header.Set("Content-Type", "application/json")
+	return req
+}
+
+func TestParseMaintenanceTime(t *testing.T) {
+	if got, err := parseMaintenanceTime("", "maintenance_start"); err != nil || got != nil {
+		t.Fatalf("parseMaintenanceTime(empty) = (%v, %v), want (nil, nil)", got, err)
+	}
+
+	got, err := parseMaintenanceTime("2026-04-27T12:00:00-05:00", "maintenance_start")
+	if err != nil {
+		t.Fatalf("parseMaintenanceTime(valid): %v", err)
+	}
+	want := time.Date(2026, 4, 27, 17, 0, 0, 0, time.UTC)
+	if got != want {
+		t.Fatalf("parseMaintenanceTime(valid) = %v, want %v", got, want)
+	}
+
+	if _, err := parseMaintenanceTime("April 27", "maintenance_start"); err == nil {
+		t.Fatal("parseMaintenanceTime(invalid) returned nil error")
+	}
+}
+
+func TestEncodeForbiddenKeywords(t *testing.T) {
+	values := []string{"metrics.evil-cdn.example/collect.js", "buy cheap viagra", "buy cheap viagra"}
+	got, err := encodeForbiddenKeywords(&values)
+	if err != nil {
+		t.Fatalf("encodeForbiddenKeywords() error = %v", err)
+	}
+	if got != `["metrics.evil-cdn.example/collect.js","buy cheap viagra"]` {
+		t.Fatalf("encoded forbidden_keywords = %#v", got)
+	}
+
+	empty := []string{}
+	got, err = encodeForbiddenKeywords(&empty)
+	if err != nil {
+		t.Fatalf("encodeForbiddenKeywords(empty) error = %v", err)
+	}
+	if got != nil {
+		t.Fatalf("encodeForbiddenKeywords(empty) = %#v, want nil", got)
+	}
+}
+
+func TestEncodeForbiddenKeywordsRejectsEmptyValue(t *testing.T) {
+	values := []string{"ok", ""}
+	if _, err := encodeForbiddenKeywords(&values); err == nil {
+		t.Fatal("encodeForbiddenKeywords() error = nil, want empty value error")
+	}
+}
+
+func TestCreateSiteHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// existence check returns no rows
+	mock.ExpectQuery(siteExistsCheckSQL).WithArgs(int64(12345)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}))
+	// insert
+	mock.ExpectExec(insertSiteSQL).
+		WithArgs(int64(12345), 12, "https://example.com", 1, 9).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+	// read-back
+	mock.ExpectQuery(singleSiteSQL).WithArgs(int64(12345)).
+		WillReturnRows(makeSiteRowWithSchedule(12345, "https://example.com", 1, 12, 9))
+
+	body := []byte(`{"blog_id": 12345, "monitor_url": "https://example.com", "bucket_no": 12, "check_interval": 9}`)
+	req := newPOSTWithBody("/api/v1/sites", body)
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleCreateSite)
+
+	if rec.Code != http.StatusCreated {
+		t.Fatalf("status = %d, want 201; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp siteResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.ID != 12345 || resp.MonitorURL != "https://example.com" {
+		t.Errorf("response site = %+v", resp)
+	}
+	if resp.BucketNo != 12 || resp.CheckInterval != 9 {
+		t.Errorf("scheduling fields = (%d, %d), want (12, 9)", resp.BucketNo, resp.CheckInterval)
+	}
+}
+
+func TestCreateSiteWithGatewayTenantAssignsMapping(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsCheckSQL).WithArgs(int64(12345)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}))
+	mock.ExpectBegin()
+	mock.ExpectExec(insertSiteSQL).
+		WithArgs(int64(12345), 0, "https://example.com", 1, 5).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+	mock.ExpectExec(insertSiteTenantTestSQL).
+		WithArgs("tenant-a", int64(12345)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectCommit()
+	mock.ExpectQuery(singleSiteSQL).WithArgs(int64(12345)).
+		WillReturnRows(makeSiteRow(12345, "https://example.com", 1))
+
+	body := []byte(`{"blog_id": 12345, "monitor_url": "https://example.com"}`)
+	req := newPOSTWithBody("/api/v1/sites", body)
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleCreateSite)
+
+	if rec.Code != http.StatusCreated {
+		t.Fatalf("status = %d, want 201; body=%s", rec.Code, rec.Body.String())
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestCreateSiteRejectsMissingBlogID(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	body := []byte(`{"monitor_url": "https://example.com"}`)
+	req := setAuthCtx(newPOSTWithBody("/api/v1/sites", body), key)
+	rec := invokeAuthed(s, req, s.handleCreateSite)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("status = %d, want 400", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "invalid_blog_id" {
+		t.Errorf("code = %q, want invalid_blog_id", got)
+	}
+}
+
+func TestCreateSiteRejectsBadURL(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	cases := []struct {
+		body string
+		want string
+	}{
+		{`{"blog_id": 1, "monitor_url": "not-a-url"}`, "invalid_url"},
+		{`{"blog_id": 1, "monitor_url": ""}`, "invalid_url"},
+		{`{"blog_id": 1, "monitor_url": "ftp://example.com"}`, "invalid_url"},
+	}
+	for _, c := range cases {
+		req := setAuthCtx(newPOSTWithBody("/api/v1/sites", []byte(c.body)), key)
+		rec := invokeAuthed(s, req, s.handleCreateSite)
+		if rec.Code != http.StatusBadRequest {
+			t.Errorf("body=%s status=%d want 400", c.body, rec.Code)
+			continue
+		}
+		if got := readErrorBody(t, rec.Body).Code; got != c.want {
+			t.Errorf("body=%s code=%q want %q", c.body, got, c.want)
+		}
+	}
+}
+
+func TestCreateSiteRejectsBadRedirectPolicy(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	body := []byte(`{"blog_id": 1, "monitor_url": "https://x", "redirect_policy": "bounce"}`)
+	req := setAuthCtx(newPOSTWithBody("/api/v1/sites", body), key)
+	rec := invokeAuthed(s, req, s.handleCreateSite)
+	if rec.Code != http.StatusUnprocessableEntity {
+		t.Fatalf("status = %d, want 422", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "invalid_redirect_policy" {
+		t.Errorf("code = %q, want invalid_redirect_policy", got)
+	}
+}
+
+func TestCreateSiteConflictOnExisting(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsCheckSQL).WithArgs(int64(12345)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+
+	body := []byte(`{"blog_id": 12345, "monitor_url": "https://example.com"}`)
+	req := setAuthCtx(newPOSTWithBody("/api/v1/sites", body), key)
+	rec := invokeAuthed(s, req, s.handleCreateSite)
+
+	if rec.Code != http.StatusConflict {
+		t.Fatalf("status = %d, want 409", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "site_exists" {
+		t.Errorf("code = %q, want site_exists", got)
+	}
+}
+
+func TestUpdateSiteHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsCheckSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+	mock.ExpectBegin()
+	mock.ExpectExec(`UPDATE jetpack_monitor_sites SET monitor_url = ? WHERE blog_id = ?`).
+		WithArgs("https://new.example.com", int64(42)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec(`INSERT INTO jetmon_site_check_config (blog_id, redirect_policy) VALUES (?, ?) ON DUPLICATE KEY UPDATE redirect_policy = VALUES(redirect_policy)`).
+		WithArgs(int64(42), "alert").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectCommit()
+	mock.ExpectQuery(singleSiteSQL).WithArgs(int64(42)).
+		WillReturnRows(makeSiteRow(42, "https://new.example.com", 1))
+
+	body := []byte(`{"monitor_url": "https://new.example.com", "redirect_policy": "alert"}`)
+	req := newPATCHWithBody("/api/v1/sites/42", body)
+	req.SetPathValue("id", "42")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleUpdateSite)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestUpdateSiteWithGatewayTenantRejectsUnmappedSite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteTenantCheckSQL).
+		WithArgs("tenant-a", int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}))
+
+	body := []byte(`{"monitor_url": "https://new.example.com"}`)
+	req := newPATCHWithBody("/api/v1/sites/42", body)
+	req.SetPathValue("id", "42")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleUpdateSite)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404; body=%s", rec.Code, rec.Body.String())
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "site_not_found" {
+		t.Fatalf("code = %q, want site_not_found", got)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestUpdateSiteEmptyBodyReturnsCurrent(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsCheckSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+	mock.ExpectQuery(singleSiteSQL).WithArgs(int64(42)).
+		WillReturnRows(makeSiteRow(42, "https://x", 1))
+
+	body := []byte(`{}`)
+	req := newPATCHWithBody("/api/v1/sites/42", body)
+	req.SetPathValue("id", "42")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleUpdateSite)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200", rec.Code)
+	}
+}
+
+func TestUpdateSiteNotFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsCheckSQL).WithArgs(int64(999)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}))
+
+	body := []byte(`{"monitor_url": "https://x"}`)
+	req := newPATCHWithBody("/api/v1/sites/999", body)
+	req.SetPathValue("id", "999")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleUpdateSite)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+}
+
+func TestDeleteSiteSoftDeletes(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsCheckSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+	// closeAllActiveEvents queries for active events; return none.
+	mock.ExpectQuery(`SELECT id FROM jetmon_events WHERE blog_id = ? AND ended_at IS NULL`).
+		WithArgs(int64(42)).WillReturnRows(sqlmock.NewRows([]string{"id"}))
+	// soft-delete UPDATE
+	mock.ExpectExec(`UPDATE jetpack_monitor_sites SET monitor_active = 0 WHERE blog_id = ?`).
+		WithArgs(int64(42)).WillReturnResult(sqlmock.NewResult(0, 1))
+
+	req := httptest.NewRequest("DELETE", "/api/v1/sites/42", nil)
+	req.SetPathValue("id", "42")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleDeleteSite)
+
+	if rec.Code != http.StatusNoContent {
+		t.Fatalf("status = %d, want 204", rec.Code)
+	}
+}
+
+func TestPauseSiteClosesActiveEvents(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsCheckSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+	// One active event to close.
+	mock.ExpectQuery(`SELECT id FROM jetmon_events WHERE blog_id = ? AND ended_at IS NULL`).
+		WithArgs(int64(42)).WillReturnRows(sqlmock.NewRows([]string{"id"}).AddRow(int64(7)))
+
+	// closeEvent runs in a tx: BeginTx → SELECT FOR UPDATE → UPDATE event → INSERT transition → Commit
+	mock.ExpectBegin()
+	mock.ExpectQuery(`SELECT severity, state, ended_at FROM jetmon_events WHERE id = ? FOR UPDATE`).
+		WithArgs(int64(7)).
+		WillReturnRows(sqlmock.NewRows([]string{"severity", "state", "ended_at"}).
+			AddRow(uint8(4), "Down", nil))
+	mock.ExpectExec(` UPDATE jetmon_events SET ended_at = CURRENT_TIMESTAMP(3), resolution_reason = ? WHERE id = ?`).
+		WithArgs("manual_override", int64(7)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec(` INSERT INTO jetmon_event_transitions (event_id, blog_id, severity_before, severity_after, state_before, state_after, reason, source, metadata) VALUES (?, ?, ?, NULL, ?, ?, ?, ?, ?)`).
+		WithArgs(int64(7), int64(42), uint8(4), "Down", "Resolved", "manual_override", "api", sqlmock.AnyArg()).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(countActiveEventsSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(0))
+	mock.ExpectExec(projectRunningSQL).
+		WithArgs(sqlmock.AnyArg(), int64(42)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectCommit()
+
+	mock.ExpectExec(`UPDATE jetpack_monitor_sites SET monitor_active = ?, last_status_change = ? WHERE blog_id = ?`).
+		WithArgs(0, sqlmock.AnyArg(), int64(42)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(singleSiteSQL).WithArgs(int64(42)).
+		WillReturnRows(makeSiteRow(42, "https://x", 1))
+
+	req := newPOSTWithBody("/api/v1/sites/42/pause", nil)
+	req.SetPathValue("id", "42")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handlePauseSite)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestResumeSiteSetsActive(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsCheckSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+	mock.ExpectExec(`UPDATE jetpack_monitor_sites SET monitor_active = ?, last_status_change = ? WHERE blog_id = ?`).
+		WithArgs(1, sqlmock.AnyArg(), int64(42)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(singleSiteSQL).WithArgs(int64(42)).
+		WillReturnRows(makeSiteRow(42, "https://x", 1))
+
+	req := newPOSTWithBody("/api/v1/sites/42/resume", nil)
+	req.SetPathValue("id", "42")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleResumeSite)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestValidateMonitorURL(t *testing.T) {
+	cases := []struct {
+		in    string
+		valid bool
+	}{
+		{"https://example.com", true},
+		{"http://example.com/path", true},
+		{"https://example.com:8080/api", true},
+		{"", false},
+		{"not-a-url", false},
+		{"ftp://example.com", false},
+		{"http://", false}, // empty host
+		{"https:///path", false},
+	}
+	for _, c := range cases {
+		err := validateMonitorURL(c.in)
+		if c.valid && err != nil {
+			t.Errorf("validateMonitorURL(%q) errored: %v", c.in, err)
+		}
+		if !c.valid && err == nil {
+			t.Errorf("validateMonitorURL(%q) accepted, want rejection", c.in)
+		}
+	}
+}
+
+func TestEncodeCustomHeaders(t *testing.T) {
+	if v, err := encodeCustomHeaders(nil); v != nil || err != nil {
+		t.Errorf("nil input = (%v, %v), want (nil, nil)", v, err)
+	}
+	empty := map[string]string{}
+	if v, err := encodeCustomHeaders(&empty); v != nil || err != nil {
+		t.Errorf("empty input = (%v, %v), want (nil, nil)", v, err)
+	}
+	full := map[string]string{"X-Foo": "bar"}
+	v, err := encodeCustomHeaders(&full)
+	if err != nil {
+		t.Fatalf("encode error: %v", err)
+	}
+	if v == nil {
+		t.Fatal("expected non-nil JSON")
+	}
+	bad := map[string]string{"": "bad"}
+	if _, err := encodeCustomHeaders(&bad); err == nil {
+		t.Error("empty header name should error")
+	}
+}
+
+func TestBoolToTinyint(t *testing.T) {
+	if boolToTinyint(true) != 1 {
+		t.Error("true → 1")
+	}
+	if boolToTinyint(false) != 0 {
+		t.Error("false → 0")
+	}
+}
+
+func TestBuildUpdateSetClauseEmpty(t *testing.T) {
+	clauses, args, err := buildUpdateSetClause(updateSiteRequest{})
+	if err != nil {
+		t.Fatalf("err: %v", err)
+	}
+	if len(clauses) != 0 || len(args) != 0 {
+		t.Errorf("empty body should produce no clauses; got %v / %v", clauses, args)
+	}
+}
+
+func TestBuildUpdateSetClauseHandlesAllFields(t *testing.T) {
+	url := "https://x"
+	active := true
+	bucket := 5
+	keyword := "ok"
+	forbiddenKeywords := []string{"metrics.evil-cdn.example/collect.js", "buy cheap viagra"}
+	policy := "alert"
+	timeout := 30
+	headers := map[string]string{"X-A": "1"}
+	cooldown := 10
+	interval := 7
+	clauses, args, err := buildUpdateSetClause(updateSiteRequest{
+		MonitorURL:           &url,
+		MonitorActive:        &active,
+		BucketNo:             &bucket,
+		CheckKeyword:         &keyword,
+		ForbiddenKeywords:    &forbiddenKeywords,
+		RedirectPolicy:       &policy,
+		TimeoutSeconds:       &timeout,
+		CustomHeaders:        &headers,
+		AlertCooldownMinutes: &cooldown,
+		CheckInterval:        &interval,
+	})
+	if err != nil {
+		t.Fatalf("err: %v", err)
+	}
+	if len(clauses) != 4 || len(args) != 4 {
+		t.Errorf("expected 4 core clauses, got clauses=%d args=%d", len(clauses), len(args))
+	}
+	configFields, err := buildSiteCheckConfigFields(updateSiteRequest{
+		CheckKeyword:         &keyword,
+		ForbiddenKeywords:    &forbiddenKeywords,
+		RedirectPolicy:       &policy,
+		TimeoutSeconds:       &timeout,
+		CustomHeaders:        &headers,
+		AlertCooldownMinutes: &cooldown,
+	}, nil, nil)
+	if err != nil {
+		t.Fatalf("config fields err: %v", err)
+	}
+	if !configFields.hasSet() {
+		t.Fatal("expected sidecar config fields")
+	}
+}
diff --git a/internal/api/handlers_stats.go b/internal/api/handlers_stats.go
new file mode 100644
index 00000000..07a7bc23
--- /dev/null
+++ b/internal/api/handlers_stats.go
@@ -0,0 +1,547 @@
+package api
+
+import (
+	"context"
+	"database/sql"
+	"errors"
+	"fmt"
+	"net/http"
+	"sort"
+	"strconv"
+	"time"
+)
+
+// maxSamples bounds the number of jetmon_check_history rows we'll pull into
+// memory for percentile computation. 100k covers a 30d window at 26s/check
+// per site — beyond that we'd want pre-aggregation, not naive sort.
+const maxSamples = 100_000
+
+// uptimeResponse is the shape returned by GET /api/v1/sites/{id}/uptime.
+// See docs/internal-api-reference.md "Family 3".
+type uptimeResponse struct {
+	Window             windowResponse `json:"window"`
+	UptimePercent      float64        `json:"uptime_percent"`
+	TotalSeconds       int64          `json:"total_seconds"`
+	DownSeconds        int64          `json:"down_seconds"`
+	DegradedSeconds    int64          `json:"degraded_seconds"`
+	WarningSeconds     int64          `json:"warning_seconds"`
+	MaintenanceSeconds int64          `json:"maintenance_seconds"`
+	UnknownSeconds     int64          `json:"unknown_seconds"`
+	IncidentCount      int            `json:"incident_count"`
+	MTTRSeconds        int64          `json:"mttr_seconds"`
+	MTBFSeconds        int64          `json:"mtbf_seconds"`
+}
+
+// responseTimeResponse is the shape returned by GET .../response-time.
+type responseTimeResponse struct {
+	Window  windowResponse `json:"window"`
+	Samples int            `json:"samples"`
+	P50Ms   int64          `json:"p50_ms"`
+	P95Ms   int64          `json:"p95_ms"`
+	P99Ms   int64          `json:"p99_ms"`
+	MaxMs   int64          `json:"max_ms"`
+	MeanMs  int64          `json:"mean_ms"`
+	// Truncated indicates the underlying sample set hit the maxSamples cap;
+	// percentiles are computed from the most recent maxSamples rows.
+	Truncated bool `json:"truncated"`
+}
+
+// timingBreakdownResponse is the shape returned by GET .../timing-breakdown.
+// One of Jetmon's distinctive features — most competitors only return total
+// response time. Per-component percentiles let consumers pinpoint *where*
+// latency is spent.
+type timingBreakdownResponse struct {
+	Window    windowResponse   `json:"window"`
+	Samples   int              `json:"samples"`
+	Truncated bool             `json:"truncated"`
+	DNS       latencyComponent `json:"dns"`
+	TCP       latencyComponent `json:"tcp"`
+	TLS       latencyComponent `json:"tls"`
+	TTFB      latencyComponent `json:"ttfb"`
+}
+
+type latencyComponent struct {
+	P50Ms int64 `json:"p50_ms"`
+	P95Ms int64 `json:"p95_ms"`
+	P99Ms int64 `json:"p99_ms"`
+	MaxMs int64 `json:"max_ms"`
+}
+
+// windowResponse describes the time window covered by the stats.
+type windowResponse struct {
+	From string `json:"from"`
+	To   string `json:"to"`
+}
+
+// resolveWindow returns the [from, to] timestamps for a stats query. Caller
+// passes either ?window=24h|7d|30d|90d (default 24h) or both ?from and ?to
+// (overrides window). Returns an error message-ready string on bad input.
+func resolveWindow(q map[string][]string) (from, to time.Time, err error) {
+	now := time.Now().UTC()
+	to = now
+
+	fromStr := first(q["from"])
+	toStr := first(q["to"])
+	if fromStr != "" || toStr != "" {
+		if fromStr == "" || toStr == "" {
+			return time.Time{}, time.Time{}, errors.New("?from and ?to must be provided together")
+		}
+		f, parseErr := time.Parse(time.RFC3339, fromStr)
+		if parseErr != nil {
+			return time.Time{}, time.Time{}, fmt.Errorf("?from must be RFC3339: %w", parseErr)
+		}
+		t, parseErr := time.Parse(time.RFC3339, toStr)
+		if parseErr != nil {
+			return time.Time{}, time.Time{}, fmt.Errorf("?to must be RFC3339: %w", parseErr)
+		}
+		if !f.Before(t) {
+			return time.Time{}, time.Time{}, errors.New("?from must be before ?to")
+		}
+		return f.UTC(), t.UTC(), nil
+	}
+
+	window := first(q["window"])
+	if window == "" {
+		window = "24h"
+	}
+	dur, err := parseWindowDuration(window)
+	if err != nil {
+		return time.Time{}, time.Time{}, err
+	}
+	return now.Add(-dur), now, nil
+}
+
+func parseWindowDuration(s string) (time.Duration, error) {
+	switch s {
+	case "1h":
+		return time.Hour, nil
+	case "24h", "1d":
+		return 24 * time.Hour, nil
+	case "7d":
+		return 7 * 24 * time.Hour, nil
+	case "30d":
+		return 30 * 24 * time.Hour, nil
+	case "90d":
+		return 90 * 24 * time.Hour, nil
+	default:
+		return 0, fmt.Errorf("window must be one of: 1h, 24h, 7d, 30d, 90d")
+	}
+}
+
+// handleSiteUptime computes uptime statistics over a window from the events
+// table. The event log is the source of truth — we sum durations of
+// (Down, Seems Down) events that overlap the window and treat the rest as
+// up-time. This stays correct even if check frequency changes.
+func (s *Server) handleSiteUptime(w http.ResponseWriter, r *http.Request) {
+	siteID, err := strconv.ParseInt(r.PathValue("id"), 10, 64)
+	if err != nil || siteID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_site_id",
+			"site id must be a positive integer")
+		return
+	}
+
+	from, to, werr := resolveWindow(r.URL.Query())
+	if werr != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_window", werr.Error())
+		return
+	}
+	if !s.ensureSiteVisibleForRequest(w, r, siteID) {
+		return
+	}
+
+	// Verify the site exists. This guards against returning a 100% uptime
+	// answer for a nonexistent site, which would be confusing.
+	if exists, err := s.siteExists(r.Context(), siteID); err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error", "site lookup failed: "+err.Error())
+		return
+	} else if !exists {
+		writeSiteNotFound(w, r, siteID)
+		return
+	}
+
+	stats, err := s.computeUptime(r.Context(), siteID, from, to)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"uptime query failed: "+err.Error())
+		return
+	}
+
+	resp := uptimeResponse{
+		Window: windowResponse{
+			From: from.Format(time.RFC3339),
+			To:   to.Format(time.RFC3339),
+		},
+		TotalSeconds:       stats.totalSeconds,
+		DownSeconds:        stats.downSeconds,
+		DegradedSeconds:    stats.degradedSeconds,
+		WarningSeconds:     stats.warningSeconds,
+		MaintenanceSeconds: stats.maintenanceSeconds,
+		UnknownSeconds:     stats.unknownSeconds,
+		IncidentCount:      stats.incidentCount,
+		MTTRSeconds:        stats.mttrSeconds,
+		MTBFSeconds:        stats.mtbfSeconds,
+	}
+	if stats.totalSeconds > 0 {
+		resp.UptimePercent = roundTo3((1.0 - float64(stats.downSeconds)/float64(stats.totalSeconds)) * 100.0)
+	}
+	writeJSON(w, http.StatusOK, resp)
+}
+
+// uptimeStats is the intermediate computed shape; mapped onto uptimeResponse
+// at the handler level.
+type uptimeStats struct {
+	totalSeconds       int64
+	downSeconds        int64
+	degradedSeconds    int64
+	warningSeconds     int64
+	maintenanceSeconds int64
+	unknownSeconds     int64
+	incidentCount      int
+	mttrSeconds        int64
+	mtbfSeconds        int64
+}
+
+// computeUptime walks events overlapping [from, to] and accumulates per-state
+// duration. Events that started before the window are clipped to from; events
+// still open at to are clipped to to.
+func (s *Server) computeUptime(ctx context.Context, siteID int64, from, to time.Time) (uptimeStats, error) {
+	rows, err := s.db.QueryContext(ctx, `
+		SELECT severity, state, started_at, ended_at
+		  FROM jetmon_events
+		 WHERE blog_id = ?
+		   AND started_at < ?
+		   AND (ended_at IS NULL OR ended_at > ?)`,
+		siteID, to, from)
+	if err != nil {
+		return uptimeStats{}, err
+	}
+	defer rows.Close()
+
+	stats := uptimeStats{
+		totalSeconds: int64(to.Sub(from).Seconds()),
+	}
+	var sumIncidentSeconds int64
+	for rows.Next() {
+		var (
+			severity  uint8
+			state     string
+			startedAt time.Time
+			endedAt   sql.NullTime
+		)
+		if err := rows.Scan(&severity, &state, &startedAt, &endedAt); err != nil {
+			return uptimeStats{}, err
+		}
+
+		// Clip event to the window.
+		eventFrom := startedAt
+		if eventFrom.Before(from) {
+			eventFrom = from
+		}
+		var eventTo time.Time
+		if endedAt.Valid {
+			eventTo = endedAt.Time
+		} else {
+			eventTo = to
+		}
+		if eventTo.After(to) {
+			eventTo = to
+		}
+		dur := int64(eventTo.Sub(eventFrom).Seconds())
+		if dur < 0 {
+			continue
+		}
+
+		// Bucket by state. "Seems Down" counts toward downtime — the design
+		// treats it as part of the incident; the operator dashboard renders
+		// it as a different color, but for SLA math it's downtime.
+		switch state {
+		case "Down", "Seems Down":
+			stats.downSeconds += dur
+			stats.incidentCount++
+			sumIncidentSeconds += dur
+		case "Degraded":
+			stats.degradedSeconds += dur
+		case "Warning":
+			stats.warningSeconds += dur
+		case "Maintenance":
+			stats.maintenanceSeconds += dur
+		case "Unknown":
+			stats.unknownSeconds += dur
+		}
+	}
+	if stats.incidentCount > 0 {
+		stats.mttrSeconds = sumIncidentSeconds / int64(stats.incidentCount)
+	}
+	// MTBF = (uptime / incident_count). If no incidents, leave 0.
+	uptimeSeconds := stats.totalSeconds - stats.downSeconds
+	if stats.incidentCount > 0 {
+		stats.mtbfSeconds = uptimeSeconds / int64(stats.incidentCount)
+	}
+	return stats, rows.Err()
+}
+
+// handleSiteResponseTime returns p50/p95/p99/max/mean of total RTT over a
+// window, sourced from jetmon_check_history.
+func (s *Server) handleSiteResponseTime(w http.ResponseWriter, r *http.Request) {
+	siteID, from, to, ok := s.parseStatsRequest(w, r)
+	if !ok {
+		return
+	}
+
+	samples, truncated, err := s.queryRTTSamples(r.Context(), siteID, from, to)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"response time query failed: "+err.Error())
+		return
+	}
+
+	resp := responseTimeResponse{
+		Window: windowResponse{
+			From: from.Format(time.RFC3339),
+			To:   to.Format(time.RFC3339),
+		},
+		Samples:   len(samples),
+		Truncated: truncated,
+	}
+	if len(samples) > 0 {
+		sort.Slice(samples, func(i, j int) bool { return samples[i] < samples[j] })
+		resp.P50Ms = percentile(samples, 0.50)
+		resp.P95Ms = percentile(samples, 0.95)
+		resp.P99Ms = percentile(samples, 0.99)
+		resp.MaxMs = samples[len(samples)-1]
+		var sum int64
+		for _, v := range samples {
+			sum += v
+		}
+		resp.MeanMs = sum / int64(len(samples))
+	}
+	writeJSON(w, http.StatusOK, resp)
+}
+
+// handleSiteTimingBreakdown returns the same percentile shape as
+// handleSiteResponseTime but per-component (DNS/TCP/TLS/TTFB).
+func (s *Server) handleSiteTimingBreakdown(w http.ResponseWriter, r *http.Request) {
+	siteID, from, to, ok := s.parseStatsRequest(w, r)
+	if !ok {
+		return
+	}
+
+	rows, truncated, err := s.queryTimingSamples(r.Context(), siteID, from, to)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"timing breakdown query failed: "+err.Error())
+		return
+	}
+
+	resp := timingBreakdownResponse{
+		Window: windowResponse{
+			From: from.Format(time.RFC3339),
+			To:   to.Format(time.RFC3339),
+		},
+		Samples:   len(rows),
+		Truncated: truncated,
+	}
+	if len(rows) > 0 {
+		dns := make([]int64, 0, len(rows))
+		tcp := make([]int64, 0, len(rows))
+		tls := make([]int64, 0, len(rows))
+		ttfb := make([]int64, 0, len(rows))
+		for _, t := range rows {
+			if t.dns >= 0 {
+				dns = append(dns, t.dns)
+			}
+			if t.tcp >= 0 {
+				tcp = append(tcp, t.tcp)
+			}
+			if t.tls >= 0 {
+				tls = append(tls, t.tls)
+			}
+			if t.ttfb >= 0 {
+				ttfb = append(ttfb, t.ttfb)
+			}
+		}
+		resp.DNS = computePercentiles(dns)
+		resp.TCP = computePercentiles(tcp)
+		resp.TLS = computePercentiles(tls)
+		resp.TTFB = computePercentiles(ttfb)
+	}
+	writeJSON(w, http.StatusOK, resp)
+}
+
+// parseStatsRequest is the shared prelude for response-time and
+// timing-breakdown handlers — validates id, parses window, verifies site
+// exists. Returns (siteID, from, to, true) on success or writes the error
+// response and returns ok=false.
+func (s *Server) parseStatsRequest(w http.ResponseWriter, r *http.Request) (siteID int64, from, to time.Time, ok bool) {
+	siteID, err := strconv.ParseInt(r.PathValue("id"), 10, 64)
+	if err != nil || siteID <= 0 {
+		writeError(w, r, http.StatusBadRequest, "invalid_site_id",
+			"site id must be a positive integer")
+		return 0, time.Time{}, time.Time{}, false
+	}
+	from, to, werr := resolveWindow(r.URL.Query())
+	if werr != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_window", werr.Error())
+		return 0, time.Time{}, time.Time{}, false
+	}
+	if !s.ensureSiteVisibleForRequest(w, r, siteID) {
+		return 0, time.Time{}, time.Time{}, false
+	}
+	exists, err := s.siteExists(r.Context(), siteID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site lookup failed: "+err.Error())
+		return 0, time.Time{}, time.Time{}, false
+	}
+	if !exists {
+		writeSiteNotFound(w, r, siteID)
+		return 0, time.Time{}, time.Time{}, false
+	}
+	return siteID, from, to, true
+}
+
+// siteExists is a cheap existence check used by stats handlers.
+func (s *Server) siteExists(ctx context.Context, siteID int64) (bool, error) {
+	var dummy int64
+	err := s.db.QueryRowContext(ctx,
+		`SELECT 1 FROM jetpack_monitor_sites WHERE blog_id = ? LIMIT 1`, siteID,
+	).Scan(&dummy)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			return false, nil
+		}
+		return false, err
+	}
+	return true, nil
+}
+
+// queryRTTSamples pulls rtt_ms values for a site within the window. Uses a
+// hard cap (maxSamples) and orders by checked_at DESC so a window with more
+// data than we can sort still returns the most recent sample.
+func (s *Server) queryRTTSamples(ctx context.Context, siteID int64, from, to time.Time) ([]int64, bool, error) {
+	rows, err := s.db.QueryContext(ctx, `
+		SELECT rtt_ms FROM jetmon_check_history
+		 WHERE blog_id = ?
+		   AND checked_at >= ?
+		   AND checked_at < ?
+		   AND rtt_ms IS NOT NULL
+		 ORDER BY checked_at DESC
+		 LIMIT ?`, siteID, from, to, maxSamples+1)
+	if err != nil {
+		return nil, false, err
+	}
+	defer rows.Close()
+
+	out := make([]int64, 0, 1024)
+	for rows.Next() {
+		var v sql.NullInt64
+		if err := rows.Scan(&v); err != nil {
+			return nil, false, err
+		}
+		if v.Valid {
+			out = append(out, v.Int64)
+		}
+	}
+	if err := rows.Err(); err != nil {
+		return nil, false, err
+	}
+	truncated := len(out) > maxSamples
+	if truncated {
+		out = out[:maxSamples]
+	}
+	return out, truncated, nil
+}
+
+// timingRow is one jetmon_check_history row's per-component timings.
+type timingRow struct {
+	dns, tcp, tls, ttfb int64
+}
+
+func (s *Server) queryTimingSamples(ctx context.Context, siteID int64, from, to time.Time) ([]timingRow, bool, error) {
+	rows, err := s.db.QueryContext(ctx, `
+		SELECT dns_ms, tcp_ms, tls_ms, ttfb_ms FROM jetmon_check_history
+		 WHERE blog_id = ?
+		   AND checked_at >= ?
+		   AND checked_at < ?
+		 ORDER BY checked_at DESC
+		 LIMIT ?`, siteID, from, to, maxSamples+1)
+	if err != nil {
+		return nil, false, err
+	}
+	defer rows.Close()
+
+	out := make([]timingRow, 0, 1024)
+	for rows.Next() {
+		var dns, tcp, tls, ttfb sql.NullInt64
+		if err := rows.Scan(&dns, &tcp, &tls, &ttfb); err != nil {
+			return nil, false, err
+		}
+		t := timingRow{dns: -1, tcp: -1, tls: -1, ttfb: -1}
+		if dns.Valid {
+			t.dns = dns.Int64
+		}
+		if tcp.Valid {
+			t.tcp = tcp.Int64
+		}
+		if tls.Valid {
+			t.tls = tls.Int64
+		}
+		if ttfb.Valid {
+			t.ttfb = ttfb.Int64
+		}
+		out = append(out, t)
+	}
+	if err := rows.Err(); err != nil {
+		return nil, false, err
+	}
+	truncated := len(out) > maxSamples
+	if truncated {
+		out = out[:maxSamples]
+	}
+	return out, truncated, nil
+}
+
+// computePercentiles returns p50/p95/p99/max for a sample slice. Empty input
+// → all-zero result.
+func computePercentiles(samples []int64) latencyComponent {
+	if len(samples) == 0 {
+		return latencyComponent{}
+	}
+	sort.Slice(samples, func(i, j int) bool { return samples[i] < samples[j] })
+	return latencyComponent{
+		P50Ms: percentile(samples, 0.50),
+		P95Ms: percentile(samples, 0.95),
+		P99Ms: percentile(samples, 0.99),
+		MaxMs: samples[len(samples)-1],
+	}
+}
+
+// percentile returns the value at the requested rank from a sorted slice.
+// p must be in [0, 1]. Uses the nearest-rank method (no interpolation) — fine
+// for our resolution and side-steps all the float edge cases.
+func percentile(sortedSamples []int64, p float64) int64 {
+	if len(sortedSamples) == 0 {
+		return 0
+	}
+	if p <= 0 {
+		return sortedSamples[0]
+	}
+	if p >= 1 {
+		return sortedSamples[len(sortedSamples)-1]
+	}
+	// Index = ceil(p * n) - 1, clamped.
+	idx := int(p*float64(len(sortedSamples))+0.5) - 1
+	if idx < 0 {
+		idx = 0
+	}
+	if idx >= len(sortedSamples) {
+		idx = len(sortedSamples) - 1
+	}
+	return sortedSamples[idx]
+}
+
+// roundTo3 rounds to three decimal places — enough resolution for "five 9s"
+// (99.999) without producing 99.99837726391... floats in JSON output.
+func roundTo3(v float64) float64 {
+	return float64(int64(v*1000+0.5)) / 1000.0
+}
diff --git a/internal/api/handlers_stats_test.go b/internal/api/handlers_stats_test.go
new file mode 100644
index 00000000..ba46004b
--- /dev/null
+++ b/internal/api/handlers_stats_test.go
@@ -0,0 +1,331 @@
+package api
+
+import (
+	"net/http"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+const siteExistsSQL = `SELECT 1 FROM jetpack_monitor_sites WHERE blog_id = ? LIMIT 1`
+
+const uptimeSQL = ` SELECT severity, state, started_at, ended_at FROM jetmon_events WHERE blog_id = ? AND started_at < ? AND (ended_at IS NULL OR ended_at > ?)`
+
+const rttSamplesSQL = ` SELECT rtt_ms FROM jetmon_check_history WHERE blog_id = ? AND checked_at >= ? AND checked_at < ? AND rtt_ms IS NOT NULL ORDER BY checked_at DESC LIMIT ?`
+
+const timingSamplesSQL = ` SELECT dns_ms, tcp_ms, tls_ms, ttfb_ms FROM jetmon_check_history WHERE blog_id = ? AND checked_at >= ? AND checked_at < ? ORDER BY checked_at DESC LIMIT ?`
+
+func TestParseWindowDuration(t *testing.T) {
+	cases := map[string]time.Duration{
+		"1h":  time.Hour,
+		"24h": 24 * time.Hour,
+		"1d":  24 * time.Hour,
+		"7d":  7 * 24 * time.Hour,
+		"30d": 30 * 24 * time.Hour,
+		"90d": 90 * 24 * time.Hour,
+	}
+	for s, want := range cases {
+		got, err := parseWindowDuration(s)
+		if err != nil || got != want {
+			t.Errorf("parseWindowDuration(%q) = (%v, %v), want %v", s, got, err, want)
+		}
+	}
+	if _, err := parseWindowDuration("12h"); err == nil {
+		t.Error("unsupported window should error")
+	}
+}
+
+func TestResolveWindowFromQueryDefaults(t *testing.T) {
+	q := map[string][]string{}
+	from, to, err := resolveWindow(q)
+	if err != nil {
+		t.Fatalf("resolveWindow: %v", err)
+	}
+	dur := to.Sub(from)
+	if dur < 23*time.Hour || dur > 25*time.Hour {
+		t.Errorf("default window = %v, want ~24h", dur)
+	}
+}
+
+func TestResolveWindowFromTo(t *testing.T) {
+	q := map[string][]string{
+		"from": {"2026-04-01T00:00:00Z"},
+		"to":   {"2026-04-02T00:00:00Z"},
+	}
+	from, to, err := resolveWindow(q)
+	if err != nil {
+		t.Fatalf("resolveWindow: %v", err)
+	}
+	if !from.Equal(time.Date(2026, 4, 1, 0, 0, 0, 0, time.UTC)) {
+		t.Errorf("from = %v, want 2026-04-01", from)
+	}
+	if !to.Equal(time.Date(2026, 4, 2, 0, 0, 0, 0, time.UTC)) {
+		t.Errorf("to = %v, want 2026-04-02", to)
+	}
+}
+
+func TestResolveWindowRejectsHalfRange(t *testing.T) {
+	q := map[string][]string{"from": {"2026-04-01T00:00:00Z"}}
+	if _, _, err := resolveWindow(q); err == nil {
+		t.Error("from without to should error")
+	}
+}
+
+func TestResolveWindowRejectsBackwardsRange(t *testing.T) {
+	q := map[string][]string{
+		"from": {"2026-04-02T00:00:00Z"},
+		"to":   {"2026-04-01T00:00:00Z"},
+	}
+	if _, _, err := resolveWindow(q); err == nil {
+		t.Error("from after to should error")
+	}
+}
+
+func TestPercentileNearestRank(t *testing.T) {
+	samples := []int64{10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
+	cases := []struct {
+		p    float64
+		want int64
+	}{
+		{0.0, 10},
+		{0.5, 50},
+		{0.95, 100}, // ceil(9.5+0.5)-1 = 9 → 100
+		{0.99, 100},
+		{1.0, 100},
+	}
+	for _, c := range cases {
+		got := percentile(samples, c.p)
+		if got != c.want {
+			t.Errorf("percentile(p=%.2f) = %d, want %d", c.p, got, c.want)
+		}
+	}
+}
+
+func TestPercentileEmpty(t *testing.T) {
+	if got := percentile(nil, 0.5); got != 0 {
+		t.Errorf("percentile(empty) = %d, want 0", got)
+	}
+}
+
+func TestRoundTo3(t *testing.T) {
+	cases := map[float64]float64{
+		99.999_999: 100.0,
+		99.847_3:   99.847,
+		0.0:        0.0,
+		100.0:      100.0,
+	}
+	for in, want := range cases {
+		got := roundTo3(in)
+		if got != want {
+			t.Errorf("roundTo3(%.6f) = %.6f, want %.6f", in, got, want)
+		}
+	}
+}
+
+func TestUptimeHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+
+	// One closed Down event lasting 60s within a 24h window.
+	now := time.Now().UTC()
+	startedAt := now.Add(-2 * time.Hour)
+	endedAt := startedAt.Add(60 * time.Second)
+	rows := sqlmock.NewRows([]string{"severity", "state", "started_at", "ended_at"}).
+		AddRow(uint8(4), "Down", startedAt, endedAt)
+	mock.ExpectQuery(uptimeSQL).WillReturnRows(rows)
+
+	req := requestWithKey("GET", "/api/v1/sites/42/uptime", key)
+	req.SetPathValue("id", "42")
+	rec := invokeAuthed(s, req, s.handleSiteUptime)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp uptimeResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.DownSeconds != 60 {
+		t.Errorf("down_seconds = %d, want 60", resp.DownSeconds)
+	}
+	if resp.IncidentCount != 1 {
+		t.Errorf("incident_count = %d, want 1", resp.IncidentCount)
+	}
+	if resp.UptimePercent <= 99.0 || resp.UptimePercent >= 100.0 {
+		t.Errorf("uptime_percent = %.3f, want between 99 and 100", resp.UptimePercent)
+	}
+}
+
+func TestUptimeSiteNotFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsSQL).WithArgs(int64(99)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}))
+
+	req := requestWithKey("GET", "/api/v1/sites/99/uptime", key)
+	req.SetPathValue("id", "99")
+	rec := invokeAuthed(s, req, s.handleSiteUptime)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+}
+
+func TestUptimeNoEvents100Percent(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+	mock.ExpectQuery(uptimeSQL).WillReturnRows(sqlmock.NewRows([]string{"severity", "state", "started_at", "ended_at"}))
+
+	req := requestWithKey("GET", "/api/v1/sites/42/uptime", key)
+	req.SetPathValue("id", "42")
+	rec := invokeAuthed(s, req, s.handleSiteUptime)
+
+	var resp uptimeResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.UptimePercent != 100.0 {
+		t.Errorf("uptime_percent = %.3f, want 100.0", resp.UptimePercent)
+	}
+	if resp.IncidentCount != 0 || resp.DownSeconds != 0 {
+		t.Errorf("expected no incidents; got count=%d down=%d", resp.IncidentCount, resp.DownSeconds)
+	}
+}
+
+func TestResponseTimeHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+
+	rows := sqlmock.NewRows([]string{"rtt_ms"})
+	for _, v := range []int64{100, 200, 300, 400, 500} {
+		rows.AddRow(v)
+	}
+	mock.ExpectQuery(rttSamplesSQL).WillReturnRows(rows)
+
+	req := requestWithKey("GET", "/api/v1/sites/42/response-time", key)
+	req.SetPathValue("id", "42")
+	rec := invokeAuthed(s, req, s.handleSiteResponseTime)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp responseTimeResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.Samples != 5 {
+		t.Errorf("samples = %d, want 5", resp.Samples)
+	}
+	if resp.MaxMs != 500 {
+		t.Errorf("max_ms = %d, want 500", resp.MaxMs)
+	}
+	if resp.MeanMs != 300 {
+		t.Errorf("mean_ms = %d, want 300", resp.MeanMs)
+	}
+	if resp.P50Ms == 0 {
+		t.Errorf("p50_ms = 0, want non-zero")
+	}
+}
+
+func TestResponseTimeWithGatewayTenantChecksSiteOwnership(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteTenantCheckSQL).
+		WithArgs("tenant-a", int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+	mock.ExpectQuery(siteExistsSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+	mock.ExpectQuery(rttSamplesSQL).
+		WillReturnRows(sqlmock.NewRows([]string{"rtt_ms"}).AddRow(int64(123)))
+
+	req := requestWithKey("GET", "/api/v1/sites/42/response-time", key)
+	req.SetPathValue("id", "42")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleSiteResponseTime)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestResponseTimeNoSamples(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+	mock.ExpectQuery(rttSamplesSQL).WillReturnRows(sqlmock.NewRows([]string{"rtt_ms"}))
+
+	req := requestWithKey("GET", "/api/v1/sites/42/response-time", key)
+	req.SetPathValue("id", "42")
+	rec := invokeAuthed(s, req, s.handleSiteResponseTime)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200", rec.Code)
+	}
+	var resp responseTimeResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.Samples != 0 || resp.MeanMs != 0 || resp.MaxMs != 0 {
+		t.Errorf("empty stats should be zero, got %+v", resp)
+	}
+}
+
+func TestTimingBreakdownHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteExistsSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+
+	rows := sqlmock.NewRows([]string{"dns_ms", "tcp_ms", "tls_ms", "ttfb_ms"})
+	for i := 0; i < 5; i++ {
+		rows.AddRow(int64(10+i*5), int64(20+i*5), int64(30+i*5), int64(150+i*10))
+	}
+	mock.ExpectQuery(timingSamplesSQL).WillReturnRows(rows)
+
+	req := requestWithKey("GET", "/api/v1/sites/42/timing-breakdown", key)
+	req.SetPathValue("id", "42")
+	rec := invokeAuthed(s, req, s.handleSiteTimingBreakdown)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp timingBreakdownResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.Samples != 5 {
+		t.Errorf("samples = %d, want 5", resp.Samples)
+	}
+	if resp.DNS.MaxMs == 0 || resp.TCP.MaxMs == 0 || resp.TLS.MaxMs == 0 || resp.TTFB.MaxMs == 0 {
+		t.Errorf("expected non-zero per-component max; got %+v", resp)
+	}
+	// TTFB should be the slowest component in our test data.
+	if resp.TTFB.MaxMs <= resp.DNS.MaxMs {
+		t.Errorf("expected TTFB > DNS in test data, got TTFB=%d DNS=%d", resp.TTFB.MaxMs, resp.DNS.MaxMs)
+	}
+}
+
+func TestStatsRejectsBadWindow(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	req := requestWithKey("GET", "/api/v1/sites/42/uptime?window=12h", key)
+	req.SetPathValue("id", "42")
+	rec := invokeAuthed(s, req, s.handleSiteUptime)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("status = %d, want 400", rec.Code)
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "invalid_window" {
+		t.Errorf("error code = %q, want invalid_window", body.Code)
+	}
+}
diff --git a/internal/api/handlers_trigger_test.go b/internal/api/handlers_trigger_test.go
new file mode 100644
index 00000000..f853620c
--- /dev/null
+++ b/internal/api/handlers_trigger_test.go
@@ -0,0 +1,230 @@
+package api
+
+import (
+	"net/http"
+	"net/http/httptest"
+	"testing"
+
+	"github.com/Automattic/jetmon/internal/checker"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+const readSiteForCheckSQL = ` SELECT s.monitor_url, c.timeout_seconds, c.check_keyword, c.forbidden_keyword, c.forbidden_keywords, c.custom_headers, c.redirect_policy, c.request_method, c.detection_profile, s.site_status FROM jetpack_monitor_sites s LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id WHERE s.blog_id = ?`
+
+var columnsSiteForCheck = []string{"monitor_url", "timeout_seconds", "check_keyword", "forbidden_keyword", "forbidden_keywords", "custom_headers", "redirect_policy", "request_method", "detection_profile", "site_status"}
+
+func TestTriggerNowSiteNotFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// readSiteForCheck returns no rows.
+	mock.ExpectQuery(readSiteForCheckSQL).WithArgs(int64(99)).
+		WillReturnRows(sqlmock.NewRows(columnsSiteForCheck))
+
+	req := httptest.NewRequest("POST", "/api/v1/sites/99/trigger-now", nil)
+	req.SetPathValue("id", "99")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleTriggerNow)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404; body=%s", rec.Code, rec.Body.String())
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "site_not_found" {
+		t.Errorf("code = %q, want site_not_found", got)
+	}
+}
+
+func TestTriggerNowSuccessNoActiveEvents(t *testing.T) {
+	// Spin up a fake target that returns 200 OK so checker.Check returns success.
+	target := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte("OK"))
+	}))
+	defer target.Close()
+
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(readSiteForCheckSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows(columnsSiteForCheck).
+			AddRow(target.URL, nil, nil, nil, nil, nil, "follow", nil, nil, 1))
+	mock.ExpectQuery(`SELECT id FROM jetmon_events WHERE blog_id = ? AND ended_at IS NULL`).
+		WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"id"}))
+
+	req := httptest.NewRequest("POST", "/api/v1/sites/42/trigger-now", nil)
+	req.SetPathValue("id", "42")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleTriggerNow)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp triggerNowResponse
+	readJSON(t, rec.Body, &resp)
+	if !resp.Result.Success {
+		t.Errorf("expected success=true; got %+v", resp.Result)
+	}
+	if resp.Result.HTTPCode != 200 {
+		t.Errorf("http_code = %d, want 200", resp.Result.HTTPCode)
+	}
+	if len(resp.ActiveEventsClosed) != 0 {
+		t.Errorf("active_events_closed = %v, want empty", resp.ActiveEventsClosed)
+	}
+}
+
+func TestTriggerNowForbiddenKeywordFailsCheck(t *testing.T) {
+	target := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte("OK FORBIDDEN OK"))
+	}))
+	defer target.Close()
+
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(readSiteForCheckSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows(columnsSiteForCheck).
+			AddRow(target.URL, nil, nil, "FORBIDDEN", nil, nil, "follow", nil, nil, 1))
+
+	req := httptest.NewRequest("POST", "/api/v1/sites/42/trigger-now", nil)
+	req.SetPathValue("id", "42")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleTriggerNow)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp triggerNowResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.Result.Success {
+		t.Fatalf("expected success=false; got %+v", resp.Result)
+	}
+	if resp.Result.ErrorCode != checker.ErrorKeyword {
+		t.Fatalf("error_code = %d, want %d", resp.Result.ErrorCode, checker.ErrorKeyword)
+	}
+	if len(resp.ActiveEventsClosed) != 0 {
+		t.Fatalf("active_events_closed = %v, want empty", resp.ActiveEventsClosed)
+	}
+}
+
+func TestTriggerNowWithGatewayTenantAllowsMappedSite(t *testing.T) {
+	target := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusOK)
+	}))
+	defer target.Close()
+
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(siteTenantCheckSQL).
+		WithArgs("tenant-a", int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"1"}).AddRow(1))
+	mock.ExpectQuery(readSiteForCheckSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows(columnsSiteForCheck).
+			AddRow(target.URL, nil, nil, nil, nil, nil, "follow", nil, nil, 1))
+	mock.ExpectQuery(`SELECT id FROM jetmon_events WHERE blog_id = ? AND ended_at IS NULL`).
+		WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"id"}))
+
+	req := httptest.NewRequest("POST", "/api/v1/sites/42/trigger-now", nil)
+	req.SetPathValue("id", "42")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleTriggerNow)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp triggerNowResponse
+	readJSON(t, rec.Body, &resp)
+	if !resp.Result.Success {
+		t.Errorf("expected success=true; got %+v", resp.Result)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet expectations: %v", err)
+	}
+}
+
+func TestTriggerNowSuccessClosesActiveEvent(t *testing.T) {
+	// Same as above but with one active event that should be closed
+	// with reason=probe_cleared on success.
+	target := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusOK)
+	}))
+	defer target.Close()
+
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(readSiteForCheckSQL).WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows(columnsSiteForCheck).
+			AddRow(target.URL, nil, nil, nil, nil, nil, "follow", nil, nil, 2))
+	mock.ExpectQuery(`SELECT id FROM jetmon_events WHERE blog_id = ? AND ended_at IS NULL`).
+		WithArgs(int64(42)).
+		WillReturnRows(sqlmock.NewRows([]string{"id"}).AddRow(int64(7)))
+
+	expectCloseEventTx(mock, 7, 42, 4, "Down", "probe_cleared")
+
+	req := httptest.NewRequest("POST", "/api/v1/sites/42/trigger-now", nil)
+	req.SetPathValue("id", "42")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleTriggerNow)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp triggerNowResponse
+	readJSON(t, rec.Body, &resp)
+	if len(resp.ActiveEventsClosed) != 1 || resp.ActiveEventsClosed[0] != 7 {
+		t.Errorf("active_events_closed = %v, want [7]", resp.ActiveEventsClosed)
+	}
+	if resp.CurrentState != "Up" {
+		t.Errorf("current_state = %q, want Up after close-on-success", resp.CurrentState)
+	}
+}
+
+func TestTriggerNowInvalidSiteID(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	req := httptest.NewRequest("POST", "/api/v1/sites/abc/trigger-now", nil)
+	req.SetPathValue("id", "abc")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleTriggerNow)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("status = %d, want 400", rec.Code)
+	}
+}
+
+func TestRoutesIncludePhase2WriteEndpoints(t *testing.T) {
+	// Sanity: every Phase 2 write endpoint resolves through the mux and
+	// reaches an authenticated handler (which then returns 401 because no
+	// token is provided — that's fine, it confirms the route exists and
+	// runs through requireScope rather than hitting the catch-all 404).
+	s := New(":0", nil, "test")
+	mux := s.routes()
+
+	cases := []struct {
+		method, path string
+	}{
+		{"POST", "/api/v1/sites"},
+		{"PATCH", "/api/v1/sites/42"},
+		{"DELETE", "/api/v1/sites/42"},
+		{"POST", "/api/v1/sites/42/pause"},
+		{"POST", "/api/v1/sites/42/resume"},
+		{"POST", "/api/v1/sites/42/trigger-now"},
+		{"POST", "/api/v1/sites/42/events/7/close"},
+	}
+	for _, c := range cases {
+		req := httptest.NewRequest(c.method, c.path, nil)
+		rec := httptest.NewRecorder()
+		mux.ServeHTTP(rec, req)
+		if rec.Code == http.StatusNotFound {
+			body := readErrorBody(t, rec.Body)
+			if body.Code == "endpoint_not_found" {
+				t.Errorf("%s %s hit catch-all 404; route not registered", c.method, c.path)
+			}
+		}
+	}
+}
diff --git a/internal/api/handlers_webhooks.go b/internal/api/handlers_webhooks.go
new file mode 100644
index 00000000..3b53a9a1
--- /dev/null
+++ b/internal/api/handlers_webhooks.go
@@ -0,0 +1,326 @@
+package api
+
+import (
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"net/http"
+	"strconv"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/webhooks"
+)
+
+// webhookResponse is the JSON shape for a webhook in list/single responses.
+// secret is omitted by default — only the create and rotate-secret endpoints
+// return it (one-time view). secret_preview is the safe permanent view.
+type webhookResponse struct {
+	ID            int64                `json:"id"`
+	URL           string               `json:"url"`
+	Active        bool                 `json:"active"`
+	Events        []string             `json:"events"`
+	SiteFilter    webhooks.SiteFilter  `json:"site_filter"`
+	StateFilter   webhooks.StateFilter `json:"state_filter"`
+	SecretPreview string               `json:"secret_preview"`
+	CreatedBy     string               `json:"created_by"`
+	CreatedAt     string               `json:"created_at"`
+	UpdatedAt     string               `json:"updated_at"`
+}
+
+// createWebhookResponse extends webhookResponse with the raw secret. Used
+// once at create + rotate time; afterwards the caller stores the secret.
+type createWebhookResponse struct {
+	webhookResponse
+	Secret string `json:"secret"`
+}
+
+func toWebhookResponse(w *webhooks.Webhook) webhookResponse {
+	events := w.Events
+	if events == nil {
+		events = []string{}
+	}
+	return webhookResponse{
+		ID:            w.ID,
+		URL:           w.URL,
+		Active:        w.Active,
+		Events:        events,
+		SiteFilter:    w.SiteFilter,
+		StateFilter:   w.StateFilter,
+		SecretPreview: w.SecretPreview,
+		CreatedBy:     w.CreatedBy,
+		CreatedAt:     w.CreatedAt.UTC().Format(time.RFC3339),
+		UpdatedAt:     w.UpdatedAt.UTC().Format(time.RFC3339),
+	}
+}
+
+// createWebhookRequest is the body shape for POST /api/v1/webhooks.
+type createWebhookRequest struct {
+	URL         string               `json:"url"`
+	Active      *bool                `json:"active"`
+	Events      []string             `json:"events"`
+	SiteFilter  webhooks.SiteFilter  `json:"site_filter"`
+	StateFilter webhooks.StateFilter `json:"state_filter"`
+}
+
+// updateWebhookRequest is the body shape for PATCH /api/v1/webhooks/{id}.
+// Pointer fields distinguish "absent" from "explicitly empty"; an explicit
+// empty list/object clears the filter to "match all" semantics.
+type updateWebhookRequest struct {
+	URL         *string               `json:"url"`
+	Active      *bool                 `json:"active"`
+	Events      *[]string             `json:"events"`
+	SiteFilter  *webhooks.SiteFilter  `json:"site_filter"`
+	StateFilter *webhooks.StateFilter `json:"state_filter"`
+}
+
+// handleCreateWebhook implements POST /api/v1/webhooks. Returns 201 with
+// the full webhook + the one-time raw secret. The secret is shown ONCE —
+// after this response, only secret_preview is returned.
+func (s *Server) handleCreateWebhook(w http.ResponseWriter, r *http.Request) {
+	var body createWebhookRequest
+	if err := json.NewDecoder(r.Body).Decode(&body); err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_body",
+			"request body must be valid JSON: "+err.Error())
+		return
+	}
+	if err := validateMonitorURL(body.URL); err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_url",
+			"webhook url: "+err.Error())
+		return
+	}
+
+	createdBy := ""
+	if k := keyFromRequest(r); k != nil {
+		createdBy = k.ConsumerName
+	}
+
+	rawSecret, hook, err := webhooks.Create(r.Context(), s.db, webhooks.CreateInput{
+		URL:           body.URL,
+		Active:        body.Active,
+		OwnerTenantID: ownerTenantIDPtr(r),
+		Events:        body.Events,
+		SiteFilter:    body.SiteFilter,
+		StateFilter:   body.StateFilter,
+		CreatedBy:     createdBy,
+	})
+	if err != nil {
+		if errors.Is(err, webhooks.ErrInvalidEvent) {
+			writeError(w, r, http.StatusUnprocessableEntity, "invalid_event_type",
+				err.Error())
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"webhook create failed: "+err.Error())
+		return
+	}
+
+	resp := createWebhookResponse{
+		webhookResponse: toWebhookResponse(hook),
+		Secret:          rawSecret,
+	}
+	writeJSON(w, http.StatusCreated, resp)
+}
+
+// handleListWebhooks implements GET /api/v1/webhooks. No pagination yet —
+// webhook count is bounded by registered consumers. List endpoint returns
+// the full set; if a deployment ever grows past hundreds, add cursor
+// pagination here mirroring the sites endpoint.
+func (s *Server) handleListWebhooks(w http.ResponseWriter, r *http.Request) {
+	var (
+		hooks []webhooks.Webhook
+		err   error
+	)
+	if tenantID, ok := ownerTenantIDFromRequest(r); ok {
+		hooks, err = webhooks.ListForTenant(r.Context(), s.db, tenantID)
+	} else {
+		hooks, err = webhooks.List(r.Context(), s.db)
+	}
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"webhook list failed: "+err.Error())
+		return
+	}
+	out := make([]webhookResponse, 0, len(hooks))
+	for i := range hooks {
+		out = append(out, toWebhookResponse(&hooks[i]))
+	}
+	writeJSON(w, http.StatusOK, ListEnvelope{
+		Data: out,
+		Page: Page{Next: nil, Limit: len(out)},
+	})
+}
+
+// handleGetWebhook implements GET /api/v1/webhooks/{id}.
+func (s *Server) handleGetWebhook(w http.ResponseWriter, r *http.Request) {
+	id, err := parseIDPath(r, "id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_webhook_id",
+			"webhook id must be a positive integer")
+		return
+	}
+	hook, err := getWebhookForRequest(r, s.db, id)
+	if err != nil {
+		if errors.Is(err, webhooks.ErrWebhookNotFound) {
+			writeError(w, r, http.StatusNotFound, "webhook_not_found",
+				fmt.Sprintf("Webhook %d does not exist", id))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"webhook lookup failed: "+err.Error())
+		return
+	}
+	writeJSON(w, http.StatusOK, toWebhookResponse(hook))
+}
+
+// handleUpdateWebhook implements PATCH /api/v1/webhooks/{id}.
+func (s *Server) handleUpdateWebhook(w http.ResponseWriter, r *http.Request) {
+	id, err := parseIDPath(r, "id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_webhook_id",
+			"webhook id must be a positive integer")
+		return
+	}
+
+	var body updateWebhookRequest
+	if err := json.NewDecoder(r.Body).Decode(&body); err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_body",
+			"request body must be valid JSON: "+err.Error())
+		return
+	}
+	if body.URL != nil {
+		if err := validateMonitorURL(*body.URL); err != nil {
+			writeError(w, r, http.StatusBadRequest, "invalid_url",
+				"webhook url: "+err.Error())
+			return
+		}
+	}
+
+	in := webhooks.UpdateInput{
+		URL:         body.URL,
+		Active:      body.Active,
+		Events:      body.Events,
+		SiteFilter:  body.SiteFilter,
+		StateFilter: body.StateFilter,
+	}
+	var hook *webhooks.Webhook
+	if tenantID, ok := ownerTenantIDFromRequest(r); ok {
+		hook, err = webhooks.UpdateForTenant(r.Context(), s.db, id, tenantID, in)
+	} else {
+		hook, err = webhooks.Update(r.Context(), s.db, id, in)
+	}
+	if err != nil {
+		if errors.Is(err, webhooks.ErrInvalidEvent) {
+			writeError(w, r, http.StatusUnprocessableEntity, "invalid_event_type",
+				err.Error())
+			return
+		}
+		if errors.Is(err, webhooks.ErrWebhookNotFound) {
+			writeError(w, r, http.StatusNotFound, "webhook_not_found",
+				fmt.Sprintf("Webhook %d does not exist", id))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"webhook update failed: "+err.Error())
+		return
+	}
+	writeJSON(w, http.StatusOK, toWebhookResponse(hook))
+}
+
+// handleDeleteWebhook implements DELETE /api/v1/webhooks/{id}.
+//
+// Delete is hard, not soft. The dispatcher's ListActive filter would also
+// stop a soft-deleted webhook from receiving new deliveries, but a real
+// DELETE keeps the registry clean and matches consumer expectations
+// ("I revoked my webhook subscription"). Existing rows in
+// jetmon_webhook_deliveries are preserved for audit and manual retry.
+func (s *Server) handleDeleteWebhook(w http.ResponseWriter, r *http.Request) {
+	id, err := parseIDPath(r, "id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_webhook_id",
+			"webhook id must be a positive integer")
+		return
+	}
+	err = nil
+	if tenantID, ok := ownerTenantIDFromRequest(r); ok {
+		err = webhooks.DeleteForTenant(r.Context(), s.db, id, tenantID)
+	} else {
+		err = webhooks.Delete(r.Context(), s.db, id)
+	}
+	if err != nil {
+		if errors.Is(err, webhooks.ErrWebhookNotFound) {
+			writeError(w, r, http.StatusNotFound, "webhook_not_found",
+				fmt.Sprintf("Webhook %d does not exist", id))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"webhook delete failed: "+err.Error())
+		return
+	}
+	w.WriteHeader(http.StatusNoContent)
+}
+
+// handleRotateWebhookSecret implements POST /api/v1/webhooks/{id}/rotate-secret.
+//
+// v1 behaviour: immediate revocation. The new secret is returned ONCE in
+// the response; the old secret stops working immediately. Failed deliveries
+// during the consumer's deploy window go into the retry queue and clear
+// when the consumer rolls. Grace-period rotation is in docs/roadmap.md as a
+// non-breaking future addition.
+func (s *Server) handleRotateWebhookSecret(w http.ResponseWriter, r *http.Request) {
+	id, err := parseIDPath(r, "id")
+	if err != nil {
+		writeError(w, r, http.StatusBadRequest, "invalid_webhook_id",
+			"webhook id must be a positive integer")
+		return
+	}
+	var (
+		rawSecret string
+		hook      *webhooks.Webhook
+	)
+	if tenantID, ok := ownerTenantIDFromRequest(r); ok {
+		rawSecret, hook, err = webhooks.RotateSecretForTenant(r.Context(), s.db, id, tenantID)
+	} else {
+		rawSecret, hook, err = webhooks.RotateSecret(r.Context(), s.db, id)
+	}
+	if err != nil {
+		if errors.Is(err, webhooks.ErrWebhookNotFound) {
+			writeError(w, r, http.StatusNotFound, "webhook_not_found",
+				fmt.Sprintf("Webhook %d does not exist", id))
+			return
+		}
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"webhook rotate-secret failed: "+err.Error())
+		return
+	}
+	writeJSON(w, http.StatusOK, createWebhookResponse{
+		webhookResponse: toWebhookResponse(hook),
+		Secret:          rawSecret,
+	})
+}
+
+// parseIDPath extracts a positive int64 from the named path parameter.
+// Returns 0 + error for anything malformed; handlers translate to
+// invalid_<resource>_id 400.
+func parseIDPath(r *http.Request, name string) (int64, error) {
+	id, err := strconv.ParseInt(r.PathValue(name), 10, 64)
+	if err != nil || id <= 0 {
+		return 0, errors.New("must be a positive integer")
+	}
+	return id, nil
+}
+
+func getWebhookForRequest(r *http.Request, db *sql.DB, id int64) (*webhooks.Webhook, error) {
+	if tenantID, ok := ownerTenantIDFromRequest(r); ok {
+		return webhooks.GetForTenant(r.Context(), db, id, tenantID)
+	}
+	return webhooks.Get(r.Context(), db, id)
+}
+
+func ownerTenantIDPtr(r *http.Request) *string {
+	tenantID, ok := ownerTenantIDFromRequest(r)
+	if !ok {
+		return nil
+	}
+	return &tenantID
+}
diff --git a/internal/api/handlers_webhooks_test.go b/internal/api/handlers_webhooks_test.go
new file mode 100644
index 00000000..435badee
--- /dev/null
+++ b/internal/api/handlers_webhooks_test.go
@@ -0,0 +1,395 @@
+package api
+
+import (
+	"errors"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+const insertWebhookSQL = ` INSERT INTO jetmon_webhooks (url, active, owner_tenant_id, events, site_filter, state_filter, secret, secret_preview, created_by) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)`
+
+const selectWebhookOneSQL = ` SELECT id, url, active, owner_tenant_id, events, site_filter, state_filter, secret_preview, created_by, created_at, updated_at FROM jetmon_webhooks WHERE id = ?`
+
+const selectWebhookOneForTenantSQL = selectWebhookOneSQL + ` AND owner_tenant_id = ?`
+
+const selectWebhookListSQL = ` SELECT id, url, active, owner_tenant_id, events, site_filter, state_filter, secret_preview, created_by, created_at, updated_at FROM jetmon_webhooks ORDER BY id ASC`
+
+const selectWebhookListForTenantSQL = ` SELECT id, url, active, owner_tenant_id, events, site_filter, state_filter, secret_preview, created_by, created_at, updated_at FROM jetmon_webhooks WHERE owner_tenant_id = ? ORDER BY id ASC`
+
+// columnsWebhook is the column set returned by webhook SELECT queries.
+var columnsWebhook = []string{
+	"id", "url", "active", "owner_tenant_id", "events", "site_filter", "state_filter",
+	"secret_preview", "created_by", "created_at", "updated_at",
+}
+
+func makeWebhookRow(id int64, url string, active uint8) *sqlmock.Rows {
+	now := time.Now().UTC()
+	return sqlmock.NewRows(columnsWebhook).AddRow(
+		id, url, active, nil, []byte(`["event.opened"]`),
+		[]byte(`{"site_ids":[]}`), []byte(`{"states":[]}`),
+		"abcd", "test-consumer", now, now,
+	)
+}
+
+func TestCreateWebhookHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(insertWebhookSQL).
+		WithArgs(
+			"https://example.com/hook", 1,
+			nil,
+			sqlmock.AnyArg(), sqlmock.AnyArg(), sqlmock.AnyArg(),
+			sqlmock.AnyArg(), sqlmock.AnyArg(), "test-consumer",
+		).
+		WillReturnResult(sqlmock.NewResult(7, 1))
+	mock.ExpectQuery(selectWebhookOneSQL).WithArgs(int64(7)).
+		WillReturnRows(makeWebhookRow(7, "https://example.com/hook", 1))
+
+	body := []byte(`{"url":"https://example.com/hook","events":["event.opened"]}`)
+	req := newPOSTWithBody("/api/v1/webhooks", body)
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleCreateWebhook)
+
+	if rec.Code != http.StatusCreated {
+		t.Fatalf("status = %d, want 201; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp createWebhookResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.ID != 7 {
+		t.Errorf("id = %d, want 7", resp.ID)
+	}
+	if resp.Secret == "" {
+		t.Error("expected raw secret in response")
+	}
+}
+
+func TestCreateWebhookWithGatewayTenantSetsOwner(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(insertWebhookSQL).
+		WithArgs(
+			"https://example.com/hook", 1,
+			"tenant-a",
+			sqlmock.AnyArg(), sqlmock.AnyArg(), sqlmock.AnyArg(),
+			sqlmock.AnyArg(), sqlmock.AnyArg(), gatewayConsumerName,
+		).
+		WillReturnResult(sqlmock.NewResult(7, 1))
+	mock.ExpectQuery(selectWebhookOneSQL).WithArgs(int64(7)).
+		WillReturnRows(makeWebhookRow(7, "https://example.com/hook", 1))
+
+	body := []byte(`{"url":"https://example.com/hook","events":["event.opened"]}`)
+	req := newPOSTWithBody("/api/v1/webhooks", body)
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleCreateWebhook)
+
+	if rec.Code != http.StatusCreated {
+		t.Fatalf("status = %d, want 201; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestCreateWebhookRejectsBadURL(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	cases := [][]byte{
+		[]byte(`{"url":""}`),
+		[]byte(`{"url":"not-a-url"}`),
+		[]byte(`{"url":"ftp://example.com"}`),
+	}
+	for _, body := range cases {
+		req := newPOSTWithBody("/api/v1/webhooks", body)
+		req = setAuthCtx(req, key)
+		rec := invokeAuthed(s, req, s.handleCreateWebhook)
+		if rec.Code != http.StatusBadRequest {
+			t.Errorf("body=%s status=%d, want 400", body, rec.Code)
+		}
+	}
+}
+
+func TestCreateWebhookRejectsBadEventType(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	body := []byte(`{"url":"https://x.example.com","events":["event.bogus"]}`)
+	req := newPOSTWithBody("/api/v1/webhooks", body)
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleCreateWebhook)
+
+	if rec.Code != http.StatusUnprocessableEntity {
+		t.Fatalf("status = %d, want 422", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "invalid_event_type" {
+		t.Errorf("code = %q, want invalid_event_type", got)
+	}
+}
+
+func TestGetWebhookHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectWebhookOneSQL).WithArgs(int64(42)).
+		WillReturnRows(makeWebhookRow(42, "https://x.example.com", 1))
+
+	req := httptest.NewRequest("GET", "/api/v1/webhooks/42", nil)
+	req.SetPathValue("id", "42")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleGetWebhook)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestGetWebhookNotFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectWebhookOneSQL).WithArgs(int64(999)).
+		WillReturnRows(sqlmock.NewRows(columnsWebhook))
+
+	req := httptest.NewRequest("GET", "/api/v1/webhooks/999", nil)
+	req.SetPathValue("id", "999")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleGetWebhook)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+}
+
+func TestListWebhooksHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	now := time.Now().UTC()
+	rows := sqlmock.NewRows(columnsWebhook).
+		AddRow(int64(1), "https://a.example/hook", uint8(1), nil, []byte(`["event.opened"]`),
+			[]byte(`{"site_ids":[42]}`), []byte(`{"states":["Down"]}`), "aaaa", "test-consumer", now, now).
+		AddRow(int64(2), "https://b.example/hook", uint8(0), nil, nil,
+			nil, nil, "bbbb", "test-consumer", now, now)
+	mock.ExpectQuery(selectWebhookListSQL).WillReturnRows(rows)
+
+	req := httptest.NewRequest("GET", "/api/v1/webhooks", nil)
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleListWebhooks)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var env struct {
+		Data []webhookResponse `json:"data"`
+		Page Page              `json:"page"`
+	}
+	readJSON(t, rec.Body, &env)
+	if len(env.Data) != 2 {
+		t.Fatalf("len(data) = %d, want 2", len(env.Data))
+	}
+	if env.Page.Limit != 2 || env.Page.Next != nil {
+		t.Fatalf("page = %+v, want limit=2 next=nil", env.Page)
+	}
+	if env.Data[0].Events[0] != "event.opened" || env.Data[0].SiteFilter.SiteIDs[0] != 42 {
+		t.Fatalf("first webhook response = %+v", env.Data[0])
+	}
+	if env.Data[1].Events == nil {
+		t.Fatal("nil events should serialize as an empty slice")
+	}
+}
+
+func TestListWebhooksWithGatewayTenantScopesRows(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectWebhookListForTenantSQL).WithArgs("tenant-a").
+		WillReturnRows(makeWebhookRow(1, "https://a.example/hook", 1))
+
+	req := httptest.NewRequest("GET", "/api/v1/webhooks", nil)
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleListWebhooks)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestListWebhooksDBError(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(selectWebhookListSQL).WillReturnError(errors.New("query failed"))
+
+	req := httptest.NewRequest("GET", "/api/v1/webhooks", nil)
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleListWebhooks)
+
+	if rec.Code != http.StatusInternalServerError {
+		t.Fatalf("status = %d, want 500", rec.Code)
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "db_error" {
+		t.Fatalf("code = %q, want db_error", got)
+	}
+}
+
+func TestUpdateWebhookHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(`UPDATE jetmon_webhooks SET active = ? WHERE id = ?`).
+		WithArgs(0, int64(42)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(selectWebhookOneSQL).WithArgs(int64(42)).
+		WillReturnRows(makeWebhookRow(42, "https://x.example.com", 0))
+
+	body := []byte(`{"active": false}`)
+	req := newPATCHWithBody("/api/v1/webhooks/42", body)
+	req.SetPathValue("id", "42")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleUpdateWebhook)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200", rec.Code)
+	}
+}
+
+func TestUpdateWebhookWithGatewayTenantScopesWrite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(`UPDATE jetmon_webhooks SET active = ? WHERE id = ? AND owner_tenant_id = ?`).
+		WithArgs(0, int64(42), "tenant-a").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(selectWebhookOneForTenantSQL).WithArgs(int64(42), "tenant-a").
+		WillReturnRows(makeWebhookRow(42, "https://x.example.com", 0))
+
+	body := []byte(`{"active": false}`)
+	req := newPATCHWithBody("/api/v1/webhooks/42", body)
+	req.SetPathValue("id", "42")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleUpdateWebhook)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestDeleteWebhookHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(`DELETE FROM jetmon_webhooks WHERE id = ?`).
+		WithArgs(int64(42)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	req := httptest.NewRequest("DELETE", "/api/v1/webhooks/42", nil)
+	req.SetPathValue("id", "42")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleDeleteWebhook)
+
+	if rec.Code != http.StatusNoContent {
+		t.Fatalf("status = %d, want 204", rec.Code)
+	}
+}
+
+func TestDeleteWebhookWithGatewayTenantScopesWrite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(`DELETE FROM jetmon_webhooks WHERE id = ? AND owner_tenant_id = ?`).
+		WithArgs(int64(42), "tenant-a").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	req := httptest.NewRequest("DELETE", "/api/v1/webhooks/42", nil)
+	req.SetPathValue("id", "42")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleDeleteWebhook)
+
+	if rec.Code != http.StatusNoContent {
+		t.Fatalf("status = %d, want 204", rec.Code)
+	}
+}
+
+func TestDeleteWebhookNotFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(`DELETE FROM jetmon_webhooks WHERE id = ?`).
+		WithArgs(int64(999)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+
+	req := httptest.NewRequest("DELETE", "/api/v1/webhooks/999", nil)
+	req.SetPathValue("id", "999")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleDeleteWebhook)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+}
+
+func TestRotateWebhookSecretHappyPath(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(`UPDATE jetmon_webhooks SET secret = ?, secret_preview = ? WHERE id = ?`).
+		WithArgs(sqlmock.AnyArg(), sqlmock.AnyArg(), int64(42)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(selectWebhookOneSQL).WithArgs(int64(42)).
+		WillReturnRows(makeWebhookRow(42, "https://x.example.com", 1))
+
+	req := newPOSTWithBody("/api/v1/webhooks/42/rotate-secret", nil)
+	req.SetPathValue("id", "42")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleRotateWebhookSecret)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+	var resp createWebhookResponse
+	readJSON(t, rec.Body, &resp)
+	if resp.Secret == "" {
+		t.Error("expected new raw secret in rotate response")
+	}
+}
+
+func TestRotateWebhookSecretWithGatewayTenantScopesWrite(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(`UPDATE jetmon_webhooks SET secret = ?, secret_preview = ? WHERE id = ? AND owner_tenant_id = ?`).
+		WithArgs(sqlmock.AnyArg(), sqlmock.AnyArg(), int64(42), "tenant-a").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(selectWebhookOneForTenantSQL).WithArgs(int64(42), "tenant-a").
+		WillReturnRows(makeWebhookRow(42, "https://x.example.com", 1))
+
+	req := newPOSTWithBody("/api/v1/webhooks/42/rotate-secret", nil)
+	req.SetPathValue("id", "42")
+	req = setGatewayTenantCtx(req, key, "tenant-a")
+	rec := invokeAuthed(s, req, s.handleRotateWebhookSecret)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+}
+
+func TestRotateWebhookSecretNotFound(t *testing.T) {
+	s, mock, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectExec(`UPDATE jetmon_webhooks SET secret = ?, secret_preview = ? WHERE id = ?`).
+		WithArgs(sqlmock.AnyArg(), sqlmock.AnyArg(), int64(999)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+
+	req := newPOSTWithBody("/api/v1/webhooks/999/rotate-secret", nil)
+	req.SetPathValue("id", "999")
+	req = setAuthCtx(req, key)
+	rec := invokeAuthed(s, req, s.handleRotateWebhookSecret)
+
+	if rec.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", rec.Code)
+	}
+}
diff --git a/internal/api/idempotency.go b/internal/api/idempotency.go
new file mode 100644
index 00000000..d16f0c30
--- /dev/null
+++ b/internal/api/idempotency.go
@@ -0,0 +1,216 @@
+package api
+
+import (
+	"bytes"
+	"crypto/sha256"
+	"encoding/hex"
+	"io"
+	"net/http"
+	"sync"
+	"time"
+)
+
+// idempotencyTTL is how long a cached response is replayable. Stripe uses
+// 24h; we match that — long enough for a retry storm to settle, short enough
+// that the in-memory store doesn't grow without bound.
+const idempotencyTTL = 24 * time.Hour
+
+// idempotencyHeader is the request header consumers send to make a request
+// retry-safe. Stripe-style.
+const idempotencyHeader = "Idempotency-Key"
+
+// idempotencyKey identifies a stored response uniquely. Scoped by API key id
+// so two consumers can't collide on the same opaque value.
+type idempotencyKey struct {
+	keyID int64
+	key   string
+}
+
+// idempotencyEntry is the cached response. We replay status, headers, and
+// body verbatim. bodyHash distinguishes "same key, same request" (replay) from
+// "same key, different request" (409 conflict).
+type idempotencyEntry struct {
+	bodyHash   string
+	status     int
+	respHeader http.Header
+	respBody   []byte
+	expiresAt  time.Time
+}
+
+// idempotencyStore is an in-memory store with periodic GC. State is bound to
+// this jetmon2 instance; a multi-instance deployment would need Redis or a
+// dedicated table. For the current single-instance internal API that's
+// adequate.
+type idempotencyStore struct {
+	mu      sync.Mutex
+	entries map[idempotencyKey]*idempotencyEntry
+	now     func() time.Time
+}
+
+func newIdempotencyStore() *idempotencyStore {
+	s := &idempotencyStore{
+		entries: make(map[idempotencyKey]*idempotencyEntry),
+		now:     time.Now,
+	}
+	go s.gcLoop()
+	return s
+}
+
+// lookup returns the cached entry if present and not expired. The caller is
+// expected to compare the request body hash to entry.bodyHash to decide
+// between replay and 409.
+func (s *idempotencyStore) lookup(k idempotencyKey) *idempotencyEntry {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	e, ok := s.entries[k]
+	if !ok {
+		return nil
+	}
+	if s.now().After(e.expiresAt) {
+		delete(s.entries, k)
+		return nil
+	}
+	return e
+}
+
+// store records a response under the idempotency key. Overwrites any existing
+// entry for the key (which shouldn't happen in normal flow — entries are only
+// stored after a successful handler run).
+func (s *idempotencyStore) store(k idempotencyKey, e *idempotencyEntry) {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	s.entries[k] = e
+}
+
+func (s *idempotencyStore) gcLoop() {
+	ticker := time.NewTicker(1 * time.Hour)
+	defer ticker.Stop()
+	for range ticker.C {
+		s.gc()
+	}
+}
+
+func (s *idempotencyStore) gc() {
+	s.mu.Lock()
+	defer s.mu.Unlock()
+	now := s.now()
+	for k, e := range s.entries {
+		if now.After(e.expiresAt) {
+			delete(s.entries, k)
+		}
+	}
+}
+
+// idempotencyResponseWriter buffers response writes so we can record the
+// final status, headers, and body for replay. The wrapped writer is what
+// the handler sees; the stored response is what we'll replay on a retry.
+type idempotencyResponseWriter struct {
+	http.ResponseWriter
+	status int
+	body   bytes.Buffer
+	wrote  bool
+}
+
+func (w *idempotencyResponseWriter) WriteHeader(status int) {
+	if w.wrote {
+		return
+	}
+	w.status = status
+	w.wrote = true
+	w.ResponseWriter.WriteHeader(status)
+}
+
+func (w *idempotencyResponseWriter) Write(b []byte) (int, error) {
+	if !w.wrote {
+		w.WriteHeader(http.StatusOK)
+	}
+	w.body.Write(b)
+	return w.ResponseWriter.Write(b)
+}
+
+// withIdempotency wraps a handler so that if the request carries an
+// Idempotency-Key header, the response is cached and replayed on retries
+// with the same body. Stateless / unused if the header is absent.
+//
+// Usage: only wrap POST endpoints (and any other side-effecting verbs)
+// where retries can otherwise duplicate work. GETs don't need it.
+func (s *Server) withIdempotency(h http.HandlerFunc) http.HandlerFunc {
+	return func(w http.ResponseWriter, r *http.Request) {
+		idemKey := r.Header.Get(idempotencyHeader)
+		if idemKey == "" {
+			// No idempotency requested — pass through.
+			h(w, r)
+			return
+		}
+		key := keyFromRequest(r)
+		if key == nil {
+			// requireScope must have already run — this path is unreachable
+			// in production. Defensive 500 rather than nil-deref.
+			writeError(w, r, http.StatusInternalServerError, "auth_state_missing",
+				"idempotency middleware: authenticated key not in context")
+			return
+		}
+
+		// Read the body so we can both hash it (for conflict detection) and
+		// re-supply it to the handler. Body size is bounded by the server's
+		// ReadTimeout; a future MaxBytesReader would tighten this.
+		body, err := io.ReadAll(r.Body)
+		if err != nil {
+			writeError(w, r, http.StatusBadRequest, "invalid_body",
+				"failed to read request body: "+err.Error())
+			return
+		}
+		_ = r.Body.Close()
+		r.Body = io.NopCloser(bytes.NewReader(body))
+		bodyHash := hashBytes(body)
+
+		ik := idempotencyKey{keyID: key.ID, key: idemKey}
+		if cached := s.idempotency.lookup(ik); cached != nil {
+			if cached.bodyHash != bodyHash {
+				writeError(w, r, http.StatusConflict, "idempotency_conflict",
+					"the idempotency key was previously used with a different request body")
+				return
+			}
+			replayCached(w, cached)
+			return
+		}
+
+		// Capture the response so we can store it.
+		rec := &idempotencyResponseWriter{ResponseWriter: w, status: http.StatusOK}
+		h(rec, r)
+
+		// Only cache successful and client-error responses (2xx and 4xx).
+		// Server errors (5xx) shouldn't be replayed — the consumer should
+		// retry and we should re-attempt the operation.
+		if rec.status >= 200 && rec.status < 500 {
+			headerCopy := http.Header{}
+			for k, v := range w.Header() {
+				headerCopy[k] = append([]string(nil), v...)
+			}
+			s.idempotency.store(ik, &idempotencyEntry{
+				bodyHash:   bodyHash,
+				status:     rec.status,
+				respHeader: headerCopy,
+				respBody:   append([]byte(nil), rec.body.Bytes()...),
+				expiresAt:  time.Now().Add(idempotencyTTL),
+			})
+		}
+	}
+}
+
+// replayCached writes a previously cached response verbatim. Adds an
+// Idempotency-Replayed: true header so consumers can tell when a response
+// is from the cache vs freshly computed (debugging aid).
+func replayCached(w http.ResponseWriter, e *idempotencyEntry) {
+	for k, v := range e.respHeader {
+		w.Header()[k] = v
+	}
+	w.Header().Set("Idempotency-Replayed", "true")
+	w.WriteHeader(e.status)
+	_, _ = w.Write(e.respBody)
+}
+
+func hashBytes(b []byte) string {
+	h := sha256.Sum256(b)
+	return hex.EncodeToString(h[:])
+}
diff --git a/internal/api/idempotency_test.go b/internal/api/idempotency_test.go
new file mode 100644
index 00000000..94fc5e7d
--- /dev/null
+++ b/internal/api/idempotency_test.go
@@ -0,0 +1,231 @@
+package api
+
+import (
+	"bytes"
+	"io"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/apikeys"
+)
+
+func TestIdempotencyHashStable(t *testing.T) {
+	a := hashBytes([]byte(`{"foo":1}`))
+	b := hashBytes([]byte(`{"foo":1}`))
+	if a != b {
+		t.Fatal("hashBytes is non-deterministic")
+	}
+	if len(a) != 64 {
+		t.Fatalf("hashBytes length = %d, want 64 (sha256 hex)", len(a))
+	}
+}
+
+func TestIdempotencyStoreLookupAndStore(t *testing.T) {
+	store := newIdempotencyStore()
+	store.now = func() time.Time { return time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC) }
+	k := idempotencyKey{keyID: 1, key: "abc"}
+	if got := store.lookup(k); got != nil {
+		t.Fatal("empty store should return nil")
+	}
+	entry := &idempotencyEntry{
+		bodyHash:   "h1",
+		status:     200,
+		respHeader: http.Header{"Content-Type": []string{"application/json"}},
+		respBody:   []byte(`{"ok":true}`),
+		expiresAt:  store.now().Add(idempotencyTTL),
+	}
+	store.store(k, entry)
+	got := store.lookup(k)
+	if got == nil {
+		t.Fatal("entry should be retrievable")
+	}
+	if got.status != 200 || got.bodyHash != "h1" {
+		t.Errorf("retrieved entry mismatched: %+v", got)
+	}
+}
+
+func TestIdempotencyStoreExpires(t *testing.T) {
+	store := newIdempotencyStore()
+	store.now = func() time.Time { return time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC) }
+	k := idempotencyKey{keyID: 1, key: "abc"}
+	store.store(k, &idempotencyEntry{
+		expiresAt: store.now().Add(-time.Hour), // already expired
+	})
+	if got := store.lookup(k); got != nil {
+		t.Fatal("expired entry should be invisible to lookup")
+	}
+}
+
+func TestIdempotencyStoreGCRemovesExpiredEntries(t *testing.T) {
+	store := newIdempotencyStore()
+	now := time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)
+	store.now = func() time.Time { return now }
+
+	expired := idempotencyKey{keyID: 1, key: "expired"}
+	live := idempotencyKey{keyID: 1, key: "live"}
+	store.store(expired, &idempotencyEntry{expiresAt: now.Add(-time.Second)})
+	store.store(live, &idempotencyEntry{expiresAt: now.Add(time.Hour)})
+
+	store.gc()
+
+	if _, ok := store.entries[expired]; ok {
+		t.Fatal("expired entry survived gc")
+	}
+	if _, ok := store.entries[live]; !ok {
+		t.Fatal("live entry removed by gc")
+	}
+}
+
+// bodyReader wraps a byte slice as an http.Request.Body.
+func bodyReader(b []byte) io.ReadCloser {
+	return io.NopCloser(bytes.NewReader(b))
+}
+
+func TestIdempotencyMiddlewarePassthroughWhenNoHeader(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	called := 0
+	wrapped := s.withIdempotency(func(w http.ResponseWriter, r *http.Request) {
+		called++
+		w.WriteHeader(http.StatusCreated)
+		_, _ = w.Write([]byte(`{"created":true}`))
+	})
+
+	req := requestWithKey("POST", "/", key)
+	req.Body = bodyReader(nil)
+	rec := httptest.NewRecorder()
+	wrapped(rec, req)
+	if called != 1 || rec.Code != http.StatusCreated {
+		t.Fatalf("first call: called=%d code=%d", called, rec.Code)
+	}
+
+	// Second call without idempotency key should run again.
+	req2 := requestWithKey("POST", "/", key)
+	req2.Body = bodyReader(nil)
+	rec2 := httptest.NewRecorder()
+	wrapped(rec2, req2)
+	if called != 2 {
+		t.Fatalf("second call: handler called=%d, want 2", called)
+	}
+}
+
+func TestIdempotencyMiddlewareCachesAndReplays(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	called := 0
+	wrapped := s.withIdempotency(func(w http.ResponseWriter, r *http.Request) {
+		called++
+		w.WriteHeader(http.StatusCreated)
+		_, _ = w.Write([]byte(`{"id":42}`))
+	})
+
+	body := []byte(`{"foo":1}`)
+	req := requestWithKey("POST", "/", key)
+	req.Header.Set(idempotencyHeader, "key-1")
+	req.Body = bodyReader(body)
+	rec := httptest.NewRecorder()
+	wrapped(rec, req)
+	if rec.Code != http.StatusCreated {
+		t.Fatalf("first call status = %d, want 201", rec.Code)
+	}
+
+	// Second call with same key + same body: handler must not run again.
+	req2 := requestWithKey("POST", "/", key)
+	req2.Header.Set(idempotencyHeader, "key-1")
+	req2.Body = bodyReader(body)
+	rec2 := httptest.NewRecorder()
+	wrapped(rec2, req2)
+	if called != 1 {
+		t.Fatalf("handler called %d times, want 1 (replay)", called)
+	}
+	if rec2.Code != http.StatusCreated {
+		t.Fatalf("replay status = %d, want 201", rec2.Code)
+	}
+	if got := rec2.Header().Get("Idempotency-Replayed"); got != "true" {
+		t.Errorf("Idempotency-Replayed = %q, want true", got)
+	}
+	if got := rec2.Body.String(); got != `{"id":42}` {
+		t.Errorf("replayed body = %q, want %q", got, `{"id":42}`)
+	}
+}
+
+func TestIdempotencyMiddlewareConflictOnDifferentBody(t *testing.T) {
+	s, _, key, cleanup := newTestServer(t)
+	defer cleanup()
+
+	called := 0
+	wrapped := s.withIdempotency(func(w http.ResponseWriter, r *http.Request) {
+		called++
+		w.WriteHeader(http.StatusCreated)
+	})
+
+	req := requestWithKey("POST", "/", key)
+	req.Header.Set(idempotencyHeader, "key-1")
+	req.Body = bodyReader([]byte(`{"foo":1}`))
+	wrapped(httptest.NewRecorder(), req)
+	if called != 1 {
+		t.Fatalf("first call: handler called=%d, want 1", called)
+	}
+
+	req2 := requestWithKey("POST", "/", key)
+	req2.Header.Set(idempotencyHeader, "key-1")
+	req2.Body = bodyReader([]byte(`{"foo":2}`)) // different body, same key
+	rec2 := httptest.NewRecorder()
+	wrapped(rec2, req2)
+	if rec2.Code != http.StatusConflict {
+		t.Fatalf("expected 409, got %d (body=%s)", rec2.Code, rec2.Body.String())
+	}
+	if called != 1 {
+		t.Fatalf("handler should not run on conflict; called=%d", called)
+	}
+	body := readErrorBody(t, rec2.Body)
+	if body.Code != "idempotency_conflict" {
+		t.Errorf("error code = %q, want idempotency_conflict", body.Code)
+	}
+}
+
+func TestIdempotencyMiddlewareIsolatesByKeyID(t *testing.T) {
+	// Two different API keys with the same idempotency string don't share
+	// cached entries — the cache key includes the API key id.
+	s, _, _, cleanup := newTestServer(t)
+	defer cleanup()
+
+	k1 := &apikeys.Key{ID: 1, ConsumerName: "consumer-a", Scope: apikeys.ScopeWrite, RateLimitPerMinute: 60}
+	k2 := &apikeys.Key{ID: 2, ConsumerName: "consumer-b", Scope: apikeys.ScopeWrite, RateLimitPerMinute: 60}
+
+	calledA := 0
+	calledB := 0
+	wrappedA := s.withIdempotency(func(w http.ResponseWriter, r *http.Request) {
+		calledA++
+		w.WriteHeader(200)
+		_, _ = w.Write([]byte(`A`))
+	})
+	wrappedB := s.withIdempotency(func(w http.ResponseWriter, r *http.Request) {
+		calledB++
+		w.WriteHeader(200)
+		_, _ = w.Write([]byte(`B`))
+	})
+
+	body := []byte(`{}`)
+	rA := requestWithKey("POST", "/", k1)
+	rA.Header.Set(idempotencyHeader, "shared")
+	rA.Body = bodyReader(body)
+	wrappedA(httptest.NewRecorder(), rA)
+
+	rB := requestWithKey("POST", "/", k2)
+	rB.Header.Set(idempotencyHeader, "shared")
+	rB.Body = bodyReader(body)
+	rec := httptest.NewRecorder()
+	wrappedB(rec, rB)
+
+	if calledA != 1 || calledB != 1 {
+		t.Fatalf("each consumer's handler should run once; got A=%d B=%d", calledA, calledB)
+	}
+	if got := rec.Body.String(); got != "B" {
+		t.Errorf("consumer B got %q, want B (cache should not bleed across keys)", got)
+	}
+}
diff --git a/internal/api/middleware.go b/internal/api/middleware.go
new file mode 100644
index 00000000..881c897d
--- /dev/null
+++ b/internal/api/middleware.go
@@ -0,0 +1,330 @@
+package api
+
+import (
+	"context"
+	"crypto/rand"
+	"encoding/hex"
+	"encoding/json"
+	"errors"
+	"log"
+	"net/http"
+	"strings"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/apikeys"
+	"github.com/Automattic/jetmon/internal/audit"
+)
+
+// Scope aliases for handler ergonomics. The api package speaks in apikeys.Scope
+// internally but routes use these constants for brevity.
+const (
+	scopeRead  = apikeys.ScopeRead
+	scopeWrite = apikeys.ScopeWrite
+)
+
+// ctxKey is an unexported type so handlers from other packages can't trample
+// our request-scoped state.
+type ctxKey int
+
+const (
+	ctxKeyRequestID ctxKey = iota
+	ctxKeyAPIKey
+	ctxKeyGatewayContext
+)
+
+const (
+	gatewayConsumerName      = "gateway"
+	headerTenantID           = "X-Jetmon-Tenant-ID"
+	headerActorID            = "X-Jetmon-Actor-ID"
+	headerPublicScopes       = "X-Jetmon-Public-Scopes"
+	headerGatewayRequestID   = "X-Jetmon-Gateway-Request-ID"
+	headerGatewayPlan        = "X-Jetmon-Plan"
+	errForbiddenGatewayCtx   = "forbidden_gateway_context"
+	errInvalidGatewayContext = "invalid_gateway_context"
+)
+
+// gatewayContext is the trusted customer context asserted by the public API
+// gateway after it has authenticated and authorized the caller.
+type gatewayContext struct {
+	TenantID         string
+	ActorID          string
+	PublicScopes     []string
+	GatewayRequestID string
+	Plan             string
+}
+
+// keyFromRequest returns the authenticated key for r, or nil if the request
+// hasn't been through the auth middleware.
+func keyFromRequest(r *http.Request) *apikeys.Key {
+	k, _ := r.Context().Value(ctxKeyAPIKey).(*apikeys.Key)
+	return k
+}
+
+func gatewayContextFromRequest(r *http.Request) (*gatewayContext, bool) {
+	gw, ok := r.Context().Value(ctxKeyGatewayContext).(*gatewayContext)
+	if !ok || gw == nil {
+		return nil, false
+	}
+	return gw, true
+}
+
+func ownerTenantIDFromRequest(r *http.Request) (string, bool) {
+	gw, ok := gatewayContextFromRequest(r)
+	if !ok {
+		return "", false
+	}
+	return gw.TenantID, true
+}
+
+// requestIDFromRequest returns the request id assigned by the middleware.
+// Always non-empty — middleware ensures a value is set before the handler runs.
+func requestIDFromRequest(r *http.Request) string {
+	id, _ := r.Context().Value(ctxKeyRequestID).(string)
+	return id
+}
+
+// requireScope returns an http.HandlerFunc that:
+//  1. assigns a request id (echoed in headers and used in error responses),
+//  2. parses the Bearer token,
+//  3. resolves it to a Key via apikeys.Lookup,
+//  4. enforces the required scope,
+//  5. logs the access to jetmon_audit_log on the way out,
+//  6. invokes the wrapped handler.
+//
+// Internal API quirks: 401 vs 403 is honest (no 404-disguised-as-403), and
+// error messages name the resource type so consumers debugging integrations
+// can tell at a glance what went wrong.
+func (s *Server) requireScope(required apikeys.Scope, h http.HandlerFunc) http.HandlerFunc {
+	return func(w http.ResponseWriter, r *http.Request) {
+		reqID := newRequestID()
+		ctx := context.WithValue(r.Context(), ctxKeyRequestID, reqID)
+		req := r.WithContext(ctx)
+		w.Header().Set("X-Request-ID", reqID)
+
+		token := bearerToken(r)
+		if token == "" {
+			writeError(w, req, http.StatusUnauthorized, "missing_token",
+				"Authorization header with Bearer token is required")
+			s.audit(reqID, nil, req, http.StatusUnauthorized, time.Time{}, "missing token")
+			return
+		}
+
+		key, err := apikeys.Lookup(ctx, s.db, token)
+		if err != nil {
+			status, code, msg := mapAuthError(err)
+			writeError(w, req, status, code, msg)
+			s.audit(reqID, nil, req, status, time.Time{}, code)
+			return
+		}
+
+		if !key.Scope.Includes(required) {
+			writeError(w, req, http.StatusForbidden, "insufficient_scope",
+				"this endpoint requires scope "+string(required)+
+					"; your key has scope "+string(key.Scope))
+			s.audit(reqID, key, req, http.StatusForbidden, time.Time{}, "insufficient scope")
+			return
+		}
+
+		// Rate limit per key.
+		allowed, remaining, resetAt := s.limiter.allow(key.ID, key.RateLimitPerMinute)
+		writeRateLimitHeaders(w, key.RateLimitPerMinute, remaining, resetAt)
+		if !allowed {
+			writeRateLimited(w, req, key.RateLimitPerMinute, remaining, resetAt)
+			s.audit(reqID, key, req, http.StatusTooManyRequests, time.Time{}, "rate limited")
+			return
+		}
+
+		ctx = context.WithValue(ctx, ctxKeyAPIKey, key)
+		gw, status, code, msg := parseGatewayContext(r, key)
+		if status != 0 {
+			req = r.WithContext(ctx)
+			writeError(w, req, status, code, msg)
+			s.audit(reqID, key, req, status, time.Time{}, code)
+			return
+		}
+		if gw != nil {
+			ctx = context.WithValue(ctx, ctxKeyGatewayContext, gw)
+		}
+		req = r.WithContext(ctx)
+		started := time.Now()
+
+		// We wrap the response writer so we can capture the final status code
+		// for the audit log. Default to 200 if the handler doesn't write
+		// explicitly (Go's http.ResponseWriter implicitly flushes 200 on first
+		// body write).
+		rec := &statusRecorder{ResponseWriter: w, status: http.StatusOK}
+		h(rec, req)
+
+		s.audit(reqID, key, req, rec.status, started, "")
+	}
+}
+
+func parseGatewayContext(r *http.Request, key *apikeys.Key) (*gatewayContext, int, string, string) {
+	hasContext := false
+	for _, h := range []string{
+		headerTenantID,
+		headerActorID,
+		headerPublicScopes,
+		headerGatewayRequestID,
+		headerGatewayPlan,
+	} {
+		if r.Header.Get(h) != "" {
+			hasContext = true
+			break
+		}
+	}
+	if !hasContext {
+		return nil, 0, "", ""
+	}
+
+	if key == nil || key.ConsumerName != gatewayConsumerName {
+		return nil, http.StatusForbidden, errForbiddenGatewayCtx,
+			"gateway tenant context headers are only honored for the gateway API consumer"
+	}
+
+	tenantID := strings.TrimSpace(r.Header.Get(headerTenantID))
+	if tenantID == "" {
+		return nil, http.StatusBadRequest, errInvalidGatewayContext,
+			headerTenantID + " is required when gateway context headers are present"
+	}
+	gatewayRequestID := strings.TrimSpace(r.Header.Get(headerGatewayRequestID))
+	if gatewayRequestID == "" {
+		return nil, http.StatusBadRequest, errInvalidGatewayContext,
+			headerGatewayRequestID + " is required when gateway context headers are present"
+	}
+	publicScopes := strings.Fields(r.Header.Get(headerPublicScopes))
+	if len(publicScopes) == 0 {
+		return nil, http.StatusBadRequest, errInvalidGatewayContext,
+			headerPublicScopes + " is required when gateway context headers are present"
+	}
+
+	return &gatewayContext{
+		TenantID:         tenantID,
+		ActorID:          strings.TrimSpace(r.Header.Get(headerActorID)),
+		PublicScopes:     publicScopes,
+		GatewayRequestID: gatewayRequestID,
+		Plan:             strings.TrimSpace(r.Header.Get(headerGatewayPlan)),
+	}, 0, "", ""
+}
+
+// statusRecorder wraps an http.ResponseWriter to expose the final status code
+// after the handler returns. We need this for audit logging — Go's stdlib
+// doesn't expose the status code post-write.
+type statusRecorder struct {
+	http.ResponseWriter
+	status      int
+	wroteHeader bool
+}
+
+func (r *statusRecorder) WriteHeader(status int) {
+	if r.wroteHeader {
+		return
+	}
+	r.status = status
+	r.wroteHeader = true
+	r.ResponseWriter.WriteHeader(status)
+}
+
+// Flush passes through to the underlying writer if it supports it (SSE,
+// streaming responses).
+func (r *statusRecorder) Flush() {
+	if f, ok := r.ResponseWriter.(http.Flusher); ok {
+		f.Flush()
+	}
+}
+
+// bearerToken extracts the token from "Authorization: Bearer <token>", or
+// returns "" if the header is missing or malformed.
+func bearerToken(r *http.Request) string {
+	h := r.Header.Get("Authorization")
+	const prefix = "Bearer "
+	if !strings.HasPrefix(h, prefix) {
+		return ""
+	}
+	return strings.TrimSpace(h[len(prefix):])
+}
+
+func mapAuthError(err error) (status int, code, msg string) {
+	switch {
+	case errors.Is(err, apikeys.ErrInvalidToken):
+		return http.StatusUnauthorized, "invalid_token", "the provided token is invalid"
+	case errors.Is(err, apikeys.ErrKeyRevoked):
+		return http.StatusUnauthorized, "token_revoked", "the provided token has been revoked"
+	case errors.Is(err, apikeys.ErrKeyExpired):
+		return http.StatusUnauthorized, "token_expired", "the provided token has expired"
+	default:
+		return http.StatusInternalServerError, "auth_error", "internal error during authentication: " + err.Error()
+	}
+}
+
+// audit writes the request to jetmon_audit_log. Done synchronously today;
+// could be moved to a buffered channel if write latency becomes a concern.
+// Errors are logged but never returned to the consumer — audit is observability,
+// not gate.
+//
+// reqID is passed explicitly rather than pulled from r's context so callers
+// can't accidentally drop it by handing in a request whose context wasn't
+// extended with the middleware's request id.
+func (s *Server) audit(reqID string, key *apikeys.Key, r *http.Request, status int, started time.Time, note string) {
+	consumerName := "unknown"
+	var keyID int64
+	if key != nil {
+		consumerName = key.ConsumerName
+		keyID = key.ID
+	}
+
+	durationMs := int64(0)
+	if !started.IsZero() {
+		durationMs = time.Since(started).Milliseconds()
+	}
+
+	metaMap := map[string]any{
+		"key_id":      keyID,
+		"consumer":    consumerName,
+		"method":      r.Method,
+		"path":        r.URL.Path,
+		"status":      status,
+		"duration_ms": durationMs,
+		"request_id":  reqID,
+		"remote_addr": r.RemoteAddr,
+		"note":        note,
+	}
+	if gw, ok := gatewayContextFromRequest(r); ok {
+		metaMap["tenant_id"] = gw.TenantID
+		metaMap["actor_id"] = gw.ActorID
+		metaMap["public_scopes"] = gw.PublicScopes
+		metaMap["gateway_request_id"] = gw.GatewayRequestID
+		metaMap["plan"] = gw.Plan
+	}
+	meta, _ := json.Marshal(metaMap)
+
+	// Derive the audit context from Background, not r.Context(): a client
+	// disconnect must not silence the audit row, since audit is for the
+	// operator, not the caller. The timeout caps any wedged-DB hang.
+	ctx, cancel := context.WithTimeout(context.Background(), auditWriteTimeout)
+	defer cancel()
+	if err := audit.Log(ctx, audit.Entry{
+		EventType: audit.EventAPIAccess,
+		Source:    consumerName,
+		Detail:    r.Method + " " + r.URL.Path,
+		Metadata:  meta,
+	}); err != nil {
+		log.Printf("api: audit log failed: %v", err)
+	}
+}
+
+// auditWriteTimeout caps a single audit insert so a wedged DB cannot block
+// the request goroutine indefinitely. Audit is observability, not gate; if
+// the write times out we log and move on.
+const auditWriteTimeout = 5 * time.Second
+
+// newRequestID returns a 16-byte random hex id (32 chars). Same shape as the
+// verifier's NewRequestID for consistency in operator log-greppage.
+func newRequestID() string {
+	var b [16]byte
+	if _, err := rand.Read(b[:]); err != nil {
+		// Fall back to a timestamp; collisions are non-load-bearing here.
+		return "ts-" + time.Now().UTC().Format("20060102T150405.000")
+	}
+	return hex.EncodeToString(b[:])
+}
diff --git a/internal/api/middleware_test.go b/internal/api/middleware_test.go
new file mode 100644
index 00000000..d7b5190b
--- /dev/null
+++ b/internal/api/middleware_test.go
@@ -0,0 +1,615 @@
+package api
+
+import (
+	"database/sql/driver"
+	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/apikeys"
+	"github.com/Automattic/jetmon/internal/audit"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+// keyLookupSQL matches the query used by apikeys.Lookup to resolve a token.
+const keyLookupSQL = ` SELECT id, consumer_name, scope, rate_limit_per_minute, expires_at, revoked_at, last_used_at, created_at, created_by FROM jetmon_api_keys WHERE key_hash = ?`
+
+const keyTouchSQL = `UPDATE jetmon_api_keys SET last_used_at = CURRENT_TIMESTAMP WHERE id = ?`
+
+const auditInsertSQL = `
+		INSERT INTO jetmon_audit_log
+			(blog_id, event_id, event_type, source, detail, metadata)
+		VALUES (?, ?, ?, ?, ?, ?)`
+
+// columnsKey is the column set returned by getByHash.
+var columnsKey = []string{
+	"id", "consumer_name", "scope", "rate_limit_per_minute",
+	"expires_at", "revoked_at", "last_used_at", "created_at", "created_by",
+}
+
+type apiAuditMetadataWithRequestID struct {
+	t          *testing.T
+	wantStatus float64
+	wantNote   string
+	wantTenant string
+}
+
+func (m apiAuditMetadataWithRequestID) Match(v driver.Value) bool {
+	m.t.Helper()
+	var raw []byte
+	switch got := v.(type) {
+	case []byte:
+		raw = got
+	case string:
+		raw = []byte(got)
+	default:
+		m.t.Errorf("metadata type = %T, want []byte or string", v)
+		return false
+	}
+	var meta map[string]any
+	if err := json.Unmarshal(raw, &meta); err != nil {
+		m.t.Errorf("metadata is not JSON: %v", err)
+		return false
+	}
+	if meta["request_id"] == "" {
+		m.t.Errorf("metadata request_id is empty: %s", raw)
+		return false
+	}
+	if meta["status"] != m.wantStatus {
+		m.t.Errorf("metadata status = %v, want %.0f", meta["status"], m.wantStatus)
+		return false
+	}
+	if meta["note"] != m.wantNote {
+		m.t.Errorf("metadata note = %v, want %q", meta["note"], m.wantNote)
+		return false
+	}
+	if m.wantTenant != "" {
+		if meta["tenant_id"] != m.wantTenant {
+			m.t.Errorf("metadata tenant_id = %v, want %q", meta["tenant_id"], m.wantTenant)
+			return false
+		}
+		if meta["gateway_request_id"] != "gw-req-123" {
+			m.t.Errorf("metadata gateway_request_id = %v, want gw-req-123", meta["gateway_request_id"])
+			return false
+		}
+		scopes, ok := meta["public_scopes"].([]any)
+		if !ok || len(scopes) != 2 || scopes[0] != "webhooks:read" || scopes[1] != "webhooks:write" {
+			m.t.Errorf("metadata public_scopes = %v, want webhooks read/write", meta["public_scopes"])
+			return false
+		}
+	}
+	return true
+}
+
+func makeKeyRow(id int64, scope string, rateLimit int, revokedAt, expiresAt *time.Time) *sqlmock.Rows {
+	return makeConsumerKeyRow(id, "test-consumer", scope, rateLimit, revokedAt, expiresAt)
+}
+
+func makeConsumerKeyRow(id int64, consumerName, scope string, rateLimit int, revokedAt, expiresAt *time.Time) *sqlmock.Rows {
+	rows := sqlmock.NewRows(columnsKey)
+	var rev, exp any
+	if revokedAt != nil {
+		rev = *revokedAt
+	}
+	if expiresAt != nil {
+		exp = *expiresAt
+	}
+	rows.AddRow(id, consumerName, scope, rateLimit, exp, rev, nil, time.Now().UTC(), "test")
+	return rows
+}
+
+func TestRequireScopeMissingToken(t *testing.T) {
+	s, _, _, cleanup := newTestServer(t)
+	defer cleanup()
+
+	called := false
+	wrapped := s.requireScope(scopeRead, func(w http.ResponseWriter, r *http.Request) {
+		called = true
+	})
+
+	req := httptest.NewRequest("GET", "/api/v1/anything", nil)
+	rec := httptest.NewRecorder()
+	wrapped(rec, req)
+
+	if rec.Code != http.StatusUnauthorized {
+		t.Fatalf("status = %d, want 401", rec.Code)
+	}
+	if called {
+		t.Fatal("handler should not run without token")
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "missing_token" {
+		t.Errorf("error code = %q, want missing_token", body.Code)
+	}
+	if rec.Header().Get("X-Request-ID") == "" {
+		t.Error("X-Request-ID header should be set")
+	}
+}
+
+func TestRequireScopeAuditsRejectedRequestWithRequestID(t *testing.T) {
+	tests := []struct {
+		name       string
+		scope      apikeys.Scope
+		setupMock  func(sqlmock.Sqlmock)
+		setHeader  func(*http.Request)
+		wantStatus int
+		wantNote   string
+	}{
+		{
+			name:       "missing_token",
+			scope:      scopeRead,
+			setupMock:  func(_ sqlmock.Sqlmock) {},
+			setHeader:  func(_ *http.Request) {},
+			wantStatus: http.StatusUnauthorized,
+			wantNote:   "missing token",
+		},
+		{
+			name:  "invalid_token",
+			scope: scopeRead,
+			setupMock: func(m sqlmock.Sqlmock) {
+				m.ExpectQuery(keyLookupSQL).WillReturnRows(sqlmock.NewRows(columnsKey))
+			},
+			setHeader: func(r *http.Request) {
+				r.Header.Set("Authorization", "Bearer jm_INVALID-LOOKING-TOKEN-XXXXX")
+			},
+			wantStatus: http.StatusUnauthorized,
+			wantNote:   "invalid_token",
+		},
+		{
+			name:  "token_revoked",
+			scope: scopeRead,
+			setupMock: func(m sqlmock.Sqlmock) {
+				revokedAt := time.Now().UTC().Add(-time.Hour)
+				m.ExpectQuery(keyLookupSQL).WillReturnRows(makeKeyRow(1, "read", 60, &revokedAt, nil))
+			},
+			setHeader: func(r *http.Request) {
+				r.Header.Set("Authorization", "Bearer jm_ANYTOKENWILLDOFORTHISTESTXX")
+			},
+			wantStatus: http.StatusUnauthorized,
+			wantNote:   "token_revoked",
+		},
+		{
+			name:  "insufficient_scope",
+			scope: scopeWrite,
+			setupMock: func(m sqlmock.Sqlmock) {
+				m.ExpectQuery(keyLookupSQL).WillReturnRows(makeKeyRow(1, "read", 60, nil, nil))
+				m.ExpectExec(keyTouchSQL).WithArgs(int64(1)).WillReturnResult(sqlmock.NewResult(0, 1))
+			},
+			setHeader: func(r *http.Request) {
+				r.Header.Set("Authorization", "Bearer jm_ANYTOKENWILLDOFORTHISTESTXX")
+			},
+			wantStatus: http.StatusForbidden,
+			wantNote:   "insufficient scope",
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			s, mock, _, cleanup := newTestServer(t)
+			defer cleanup()
+			audit.Init(s.db)
+			t.Cleanup(func() { audit.Init(nil) })
+
+			tt.setupMock(mock)
+
+			// consumer name varies (unknown vs test-consumer) and is also
+			// reflected in the metadata's consumer field, but the matcher
+			// only asserts request_id/status/note, so AnyArg is enough.
+			mock.ExpectExec(auditInsertSQL).WithArgs(
+				nil,
+				nil,
+				audit.EventAPIAccess,
+				sqlmock.AnyArg(),
+				"GET /api/v1/anything",
+				apiAuditMetadataWithRequestID{
+					t:          t,
+					wantStatus: float64(tt.wantStatus),
+					wantNote:   tt.wantNote,
+				},
+			).WillReturnResult(sqlmock.NewResult(0, 1))
+
+			wrapped := s.requireScope(tt.scope, func(w http.ResponseWriter, r *http.Request) {})
+
+			req := httptest.NewRequest("GET", "/api/v1/anything", nil)
+			tt.setHeader(req)
+			rec := httptest.NewRecorder()
+			wrapped(rec, req)
+
+			if rec.Code != tt.wantStatus {
+				t.Fatalf("status = %d, want %d; body=%s", rec.Code, tt.wantStatus, rec.Body.String())
+			}
+			if rec.Header().Get("X-Request-ID") == "" {
+				t.Fatal("X-Request-ID header should be set")
+			}
+			if err := mock.ExpectationsWereMet(); err != nil {
+				t.Errorf("expectations: %v", err)
+			}
+		})
+	}
+}
+
+func TestRequireScopeInvalidToken(t *testing.T) {
+	s, mock, _, cleanup := newTestServer(t)
+	defer cleanup()
+
+	// Lookup will return ErrInvalidToken (no rows).
+	mock.ExpectQuery(keyLookupSQL).
+		WillReturnRows(sqlmock.NewRows(columnsKey))
+
+	wrapped := s.requireScope(scopeRead, func(w http.ResponseWriter, r *http.Request) {})
+
+	req := httptest.NewRequest("GET", "/api/v1/anything", nil)
+	req.Header.Set("Authorization", "Bearer jm_INVALID-LOOKING-TOKEN-XXXXX")
+	rec := httptest.NewRecorder()
+	wrapped(rec, req)
+
+	if rec.Code != http.StatusUnauthorized {
+		t.Fatalf("status = %d, want 401; body=%s", rec.Code, rec.Body.String())
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "invalid_token" {
+		t.Errorf("error code = %q, want invalid_token", body.Code)
+	}
+}
+
+func TestRequireScopeRevokedToken(t *testing.T) {
+	s, mock, _, cleanup := newTestServer(t)
+	defer cleanup()
+
+	revokedAt := time.Now().UTC().Add(-time.Hour)
+	mock.ExpectQuery(keyLookupSQL).WillReturnRows(makeKeyRow(1, "read", 60, &revokedAt, nil))
+
+	wrapped := s.requireScope(scopeRead, func(w http.ResponseWriter, r *http.Request) {})
+
+	req := httptest.NewRequest("GET", "/", nil)
+	req.Header.Set("Authorization", "Bearer jm_ANYTOKENWILLDOFORTHISTESTXX")
+	rec := httptest.NewRecorder()
+	wrapped(rec, req)
+
+	if rec.Code != http.StatusUnauthorized {
+		t.Fatalf("status = %d, want 401", rec.Code)
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "token_revoked" {
+		t.Errorf("error code = %q, want token_revoked", body.Code)
+	}
+}
+
+func TestRequireScopeExpiredToken(t *testing.T) {
+	s, mock, _, cleanup := newTestServer(t)
+	defer cleanup()
+
+	expiredAt := time.Now().UTC().Add(-time.Hour)
+	mock.ExpectQuery(keyLookupSQL).WillReturnRows(makeKeyRow(1, "read", 60, nil, &expiredAt))
+	// Lookup also touches last_used_at — but with expired key the expiry check fires first.
+	mock.ExpectExec(keyTouchSQL).WithArgs(int64(1)).WillReturnResult(sqlmock.NewResult(0, 1))
+
+	wrapped := s.requireScope(scopeRead, func(w http.ResponseWriter, r *http.Request) {})
+
+	req := httptest.NewRequest("GET", "/", nil)
+	req.Header.Set("Authorization", "Bearer jm_ANYTOKENWILLDOFORTHISTESTXX")
+	rec := httptest.NewRecorder()
+	wrapped(rec, req)
+
+	if rec.Code != http.StatusUnauthorized {
+		t.Fatalf("status = %d, want 401", rec.Code)
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "token_expired" {
+		t.Errorf("error code = %q, want token_expired", body.Code)
+	}
+}
+
+func TestRequireScopeInsufficientScope(t *testing.T) {
+	s, mock, _, cleanup := newTestServer(t)
+	defer cleanup()
+
+	mock.ExpectQuery(keyLookupSQL).WillReturnRows(makeKeyRow(1, "read", 60, nil, nil))
+	mock.ExpectExec(keyTouchSQL).WithArgs(int64(1)).WillReturnResult(sqlmock.NewResult(0, 1))
+
+	called := false
+	wrapped := s.requireScope(scopeWrite, func(w http.ResponseWriter, r *http.Request) {
+		called = true
+	})
+
+	req := httptest.NewRequest("GET", "/", nil)
+	req.Header.Set("Authorization", "Bearer jm_ANYTOKENWILLDOFORTHISTESTXX")
+	rec := httptest.NewRecorder()
+	wrapped(rec, req)
+
+	if rec.Code != http.StatusForbidden {
+		t.Fatalf("status = %d, want 403", rec.Code)
+	}
+	if called {
+		t.Fatal("handler should not run with insufficient scope")
+	}
+	body := readErrorBody(t, rec.Body)
+	if body.Code != "insufficient_scope" {
+		t.Errorf("error code = %q, want insufficient_scope", body.Code)
+	}
+}
+
+func TestRequireScopeAllowsValidToken(t *testing.T) {
+	s, mock, _, cleanup := newTestServer(t)
+	defer cleanup()
+	audit.Init(s.db)
+	t.Cleanup(func() { audit.Init(nil) })
+
+	mock.ExpectQuery(keyLookupSQL).WillReturnRows(makeKeyRow(1, "read", 60, nil, nil))
+	mock.ExpectExec(keyTouchSQL).WithArgs(int64(1)).WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec(auditInsertSQL).WithArgs(
+		nil,
+		nil,
+		audit.EventAPIAccess,
+		"test-consumer",
+		"GET /",
+		apiAuditMetadataWithRequestID{
+			t:          t,
+			wantStatus: float64(http.StatusOK),
+			wantNote:   "",
+		},
+	).WillReturnResult(sqlmock.NewResult(0, 1))
+
+	called := false
+	wrapped := s.requireScope(scopeRead, func(w http.ResponseWriter, r *http.Request) {
+		called = true
+		// Confirm key reached the handler context.
+		if k := keyFromRequest(r); k == nil || k.ConsumerName != "test-consumer" {
+			t.Errorf("key in handler context = %+v, want test-consumer", k)
+		}
+		w.WriteHeader(http.StatusOK)
+	})
+
+	req := httptest.NewRequest("GET", "/", nil)
+	req.Header.Set("Authorization", "Bearer jm_ANYTOKENWILLDOFORTHISTESTXX")
+	rec := httptest.NewRecorder()
+	wrapped(rec, req)
+
+	if !called {
+		t.Fatal("handler should have run")
+	}
+	if rec.Code != http.StatusOK {
+		t.Errorf("status = %d, want 200", rec.Code)
+	}
+	if got := rec.Header().Get("X-RateLimit-Limit"); got != "60" {
+		t.Errorf("X-RateLimit-Limit = %q, want 60", got)
+	}
+	if got := rec.Header().Get("X-RateLimit-Remaining"); got == "" {
+		t.Errorf("X-RateLimit-Remaining missing")
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("expectations: %v", err)
+	}
+}
+
+func TestRequireScopeRejectsGatewayContextFromNonGatewayConsumer(t *testing.T) {
+	s, mock, _, cleanup := newTestServer(t)
+	defer cleanup()
+	audit.Init(s.db)
+	t.Cleanup(func() { audit.Init(nil) })
+
+	mock.ExpectQuery(keyLookupSQL).WillReturnRows(makeKeyRow(1, "write", 60, nil, nil))
+	mock.ExpectExec(keyTouchSQL).WithArgs(int64(1)).WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec(auditInsertSQL).WithArgs(
+		nil,
+		nil,
+		audit.EventAPIAccess,
+		"test-consumer",
+		"POST /api/v1/webhooks",
+		apiAuditMetadataWithRequestID{
+			t:          t,
+			wantStatus: float64(http.StatusForbidden),
+			wantNote:   errForbiddenGatewayCtx,
+		},
+	).WillReturnResult(sqlmock.NewResult(0, 1))
+
+	called := false
+	wrapped := s.requireScope(scopeWrite, func(w http.ResponseWriter, r *http.Request) {
+		called = true
+	})
+
+	req := httptest.NewRequest("POST", "/api/v1/webhooks", nil)
+	req.Header.Set("Authorization", "Bearer jm_ANYTOKENWILLDOFORTHISTESTXX")
+	req.Header.Set(headerTenantID, "tenant-a")
+	req.Header.Set(headerPublicScopes, "webhooks:write")
+	req.Header.Set(headerGatewayRequestID, "gw-req-123")
+	rec := httptest.NewRecorder()
+	wrapped(rec, req)
+
+	if rec.Code != http.StatusForbidden {
+		t.Fatalf("status = %d, want 403; body=%s", rec.Code, rec.Body.String())
+	}
+	if called {
+		t.Fatal("handler should not run when a non-gateway key asserts gateway context")
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != errForbiddenGatewayCtx {
+		t.Fatalf("code = %q, want %s", got, errForbiddenGatewayCtx)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("expectations: %v", err)
+	}
+}
+
+func TestRequireScopeAttachesGatewayContext(t *testing.T) {
+	s, mock, _, cleanup := newTestServer(t)
+	defer cleanup()
+	audit.Init(s.db)
+	t.Cleanup(func() { audit.Init(nil) })
+
+	mock.ExpectQuery(keyLookupSQL).WillReturnRows(makeConsumerKeyRow(1, gatewayConsumerName, "write", 60, nil, nil))
+	mock.ExpectExec(keyTouchSQL).WithArgs(int64(1)).WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec(auditInsertSQL).WithArgs(
+		nil,
+		nil,
+		audit.EventAPIAccess,
+		gatewayConsumerName,
+		"POST /api/v1/webhooks",
+		apiAuditMetadataWithRequestID{
+			t:          t,
+			wantStatus: float64(http.StatusNoContent),
+			wantNote:   "",
+			wantTenant: "tenant-a",
+		},
+	).WillReturnResult(sqlmock.NewResult(0, 1))
+
+	called := false
+	wrapped := s.requireScope(scopeWrite, func(w http.ResponseWriter, r *http.Request) {
+		called = true
+		gw, ok := gatewayContextFromRequest(r)
+		if !ok {
+			t.Fatal("gateway context missing from handler request")
+		}
+		if gw.TenantID != "tenant-a" || gw.GatewayRequestID != "gw-req-123" {
+			t.Fatalf("gateway context = %+v", gw)
+		}
+		if len(gw.PublicScopes) != 2 || gw.PublicScopes[0] != "webhooks:read" || gw.PublicScopes[1] != "webhooks:write" {
+			t.Fatalf("public scopes = %v", gw.PublicScopes)
+		}
+		w.WriteHeader(http.StatusNoContent)
+	})
+
+	req := httptest.NewRequest("POST", "/api/v1/webhooks", nil)
+	req.Header.Set("Authorization", "Bearer jm_ANYTOKENWILLDOFORTHISTESTXX")
+	req.Header.Set(headerTenantID, "tenant-a")
+	req.Header.Set(headerPublicScopes, "webhooks:read webhooks:write")
+	req.Header.Set(headerGatewayRequestID, "gw-req-123")
+	req.Header.Set(headerActorID, "user-123")
+	req.Header.Set(headerGatewayPlan, "business")
+	rec := httptest.NewRecorder()
+	wrapped(rec, req)
+
+	if !called {
+		t.Fatal("handler should have run")
+	}
+	if rec.Code != http.StatusNoContent {
+		t.Fatalf("status = %d, want 204; body=%s", rec.Code, rec.Body.String())
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("expectations: %v", err)
+	}
+}
+
+func TestRequireScopeRateLimit429(t *testing.T) {
+	s, mock, _, cleanup := newTestServer(t)
+	defer cleanup()
+	audit.Init(s.db)
+	t.Cleanup(func() { audit.Init(nil) })
+
+	// Limit = 1/min — second request should 429. We have to set up two
+	// lookup expectations because the limiter check runs after auth.
+	mock.MatchExpectationsInOrder(false)
+	mock.ExpectQuery(keyLookupSQL).WillReturnRows(makeKeyRow(2, "read", 1, nil, nil))
+	mock.ExpectExec(keyTouchSQL).WithArgs(int64(2)).WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery(keyLookupSQL).WillReturnRows(makeKeyRow(2, "read", 1, nil, nil))
+	mock.ExpectExec(keyTouchSQL).WithArgs(int64(2)).WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec(auditInsertSQL).WithArgs(
+		nil,
+		nil,
+		audit.EventAPIAccess,
+		"test-consumer",
+		"GET /",
+		apiAuditMetadataWithRequestID{
+			t:          t,
+			wantStatus: float64(http.StatusOK),
+			wantNote:   "",
+		},
+	).WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec(auditInsertSQL).WithArgs(
+		nil,
+		nil,
+		audit.EventAPIAccess,
+		"test-consumer",
+		"GET /",
+		apiAuditMetadataWithRequestID{
+			t:          t,
+			wantStatus: float64(http.StatusTooManyRequests),
+			wantNote:   "rate limited",
+		},
+	).WillReturnResult(sqlmock.NewResult(0, 1))
+
+	wrapped := s.requireScope(scopeRead, func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusOK)
+	})
+
+	// First request — allowed.
+	req := httptest.NewRequest("GET", "/", nil)
+	req.Header.Set("Authorization", "Bearer jm_ANYTOKENWILLDOFORTHISTESTXX")
+	rec := httptest.NewRecorder()
+	wrapped(rec, req)
+	if rec.Code != http.StatusOK {
+		t.Fatalf("first request status = %d, want 200", rec.Code)
+	}
+
+	// Second request — rate limited.
+	req2 := httptest.NewRequest("GET", "/", nil)
+	req2.Header.Set("Authorization", "Bearer jm_ANYTOKENWILLDOFORTHISTESTXX")
+	rec2 := httptest.NewRecorder()
+	wrapped(rec2, req2)
+
+	if rec2.Code != http.StatusTooManyRequests {
+		t.Fatalf("second request status = %d, want 429; body=%s", rec2.Code, rec2.Body.String())
+	}
+	if got := rec2.Header().Get("Retry-After"); got == "" {
+		t.Error("Retry-After header missing on 429")
+	}
+	body := readErrorBody(t, rec2.Body)
+	if body.Code != "rate_limited" {
+		t.Errorf("error code = %q, want rate_limited", body.Code)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("expectations: %v", err)
+	}
+}
+
+func TestStatusRecorderCapturesCode(t *testing.T) {
+	rec := httptest.NewRecorder()
+	sr := &statusRecorder{ResponseWriter: rec, status: http.StatusOK}
+	sr.WriteHeader(http.StatusBadRequest)
+	if sr.status != http.StatusBadRequest {
+		t.Errorf("status = %d, want 400", sr.status)
+	}
+	// Second WriteHeader should be a no-op.
+	sr.WriteHeader(http.StatusInternalServerError)
+	if sr.status != http.StatusBadRequest {
+		t.Errorf("status changed after second WriteHeader = %d", sr.status)
+	}
+}
+
+type flushRecorder struct {
+	*httptest.ResponseRecorder
+	flushed bool
+}
+
+func (r *flushRecorder) Flush() {
+	r.flushed = true
+}
+
+func TestStatusRecorderFlushPassThrough(t *testing.T) {
+	rec := &flushRecorder{ResponseRecorder: httptest.NewRecorder()}
+	sr := &statusRecorder{ResponseWriter: rec, status: http.StatusOK}
+	sr.Flush()
+	if !rec.flushed {
+		t.Fatal("Flush did not pass through to the wrapped writer")
+	}
+}
+
+func TestMapAuthError(t *testing.T) {
+	cases := []struct {
+		err        error
+		wantStatus int
+		wantCode   string
+	}{
+		{apikeys.ErrInvalidToken, http.StatusUnauthorized, "invalid_token"},
+		{apikeys.ErrKeyRevoked, http.StatusUnauthorized, "token_revoked"},
+		{apikeys.ErrKeyExpired, http.StatusUnauthorized, "token_expired"},
+	}
+	for _, c := range cases {
+		gotStatus, gotCode, _ := mapAuthError(c.err)
+		if gotStatus != c.wantStatus || gotCode != c.wantCode {
+			t.Errorf("mapAuthError(%v) = (%d, %q), want (%d, %q)",
+				c.err, gotStatus, gotCode, c.wantStatus, c.wantCode)
+		}
+	}
+}
diff --git a/internal/api/openapi.go b/internal/api/openapi.go
new file mode 100644
index 00000000..c22cd0fc
--- /dev/null
+++ b/internal/api/openapi.go
@@ -0,0 +1,418 @@
+package api
+
+import (
+	"encoding/json"
+	"net/http"
+	"reflect"
+	"strconv"
+	"strings"
+)
+
+func (s *Server) handleOpenAPIJSON(w http.ResponseWriter, _ *http.Request) {
+	writeJSON(w, http.StatusOK, buildOpenAPIDocument())
+}
+
+func buildOpenAPIDocument() map[string]any {
+	paths := map[string]any{}
+	for _, route := range apiRoutes() {
+		pathItem, ok := paths[route.Path].(map[string]any)
+		if !ok {
+			pathItem = map[string]any{}
+			paths[route.Path] = pathItem
+		}
+		pathItem[strings.ToLower(route.Method)] = openAPIOperation(route)
+	}
+
+	return map[string]any{
+		"openapi":           "3.1.0",
+		"jsonSchemaDialect": "https://json-schema.org/draft/2020-12/schema",
+		"info": map[string]any{
+			"title":   "Jetmon Internal API",
+			"version": "v1",
+		},
+		"servers": []map[string]any{
+			{"url": "/"},
+		},
+		"security": []map[string][]string{
+			{"bearerAuth": {}},
+		},
+		"paths": paths,
+		"components": map[string]any{
+			"securitySchemes": map[string]any{
+				"bearerAuth": map[string]any{
+					"type":         "http",
+					"scheme":       "bearer",
+					"bearerFormat": "Jetmon API key",
+				},
+			},
+			"schemas": openAPISchemas(),
+		},
+	}
+}
+
+func openAPIOperation(route routeDef) map[string]any {
+	op := map[string]any{
+		"operationId": route.OperationID,
+		"summary":     route.Summary,
+		"tags":        route.Tags,
+		"responses":   openAPIResponses(route),
+	}
+
+	if !route.authenticated() {
+		op["security"] = []map[string][]string{}
+	} else {
+		op["x-jetmon-required-scope"] = string(route.Scope)
+	}
+	if route.Idempotency {
+		op["x-jetmon-idempotency-key"] = "optional"
+	}
+	if params := openAPIParameters(route); len(params) > 0 {
+		op["parameters"] = params
+	}
+	if route.JSONBody {
+		schema := map[string]any{
+			"type":                 "object",
+			"additionalProperties": true,
+		}
+		if route.RequestSchema != "" {
+			schema = openAPIRef(route.RequestSchema)
+		}
+		op["requestBody"] = map[string]any{
+			"required": route.BodyRequired,
+			"content": map[string]any{
+				"application/json": map[string]any{
+					"schema": schema,
+				},
+			},
+		}
+	}
+
+	return op
+}
+
+func openAPIParameters(route routeDef) []map[string]any {
+	params := make([]map[string]any, 0)
+	for _, name := range pathParamNames(route.Path) {
+		params = append(params, map[string]any{
+			"name":        name,
+			"in":          "path",
+			"required":    true,
+			"description": "Path identifier.",
+			"schema": map[string]any{
+				"type":   "integer",
+				"format": "int64",
+			},
+		})
+	}
+	if route.Idempotency {
+		params = append(params, map[string]any{
+			"name":        idempotencyHeader,
+			"in":          "header",
+			"required":    false,
+			"description": "Optional key used to safely replay POST requests.",
+			"schema": map[string]any{
+				"type": "string",
+			},
+		})
+	}
+	return params
+}
+
+func pathParamNames(path string) []string {
+	var out []string
+	remaining := path
+	for {
+		start := strings.IndexByte(remaining, '{')
+		if start < 0 {
+			return out
+		}
+		end := strings.IndexByte(remaining[start+1:], '}')
+		if end < 0 {
+			return out
+		}
+		name := remaining[start+1 : start+1+end]
+		out = append(out, name)
+		remaining = remaining[start+1+end+1:]
+	}
+}
+
+func openAPISuccessResponse(status int) map[string]any {
+	description := http.StatusText(status)
+	if description == "" {
+		description = "Success"
+	}
+	resp := map[string]any{"description": description}
+	if status != http.StatusNoContent {
+		resp["content"] = map[string]any{
+			"application/json": map[string]any{
+				"schema": map[string]any{},
+			},
+		}
+	}
+	return resp
+}
+
+func openAPIResponses(route routeDef) map[string]any {
+	status := strconv.Itoa(route.SuccessStatus)
+	responses := map[string]any{
+		status: openAPISuccessResponseForRoute(route),
+		"default": map[string]any{
+			"description": "Error response",
+			"content": map[string]any{
+				"application/json": map[string]any{
+					"schema": openAPIRef("ErrorEnvelope"),
+				},
+			},
+		},
+	}
+	return responses
+}
+
+func openAPISuccessResponseForRoute(route routeDef) map[string]any {
+	resp := openAPISuccessResponse(route.SuccessStatus)
+	if route.SuccessStatus == http.StatusNoContent || route.ResponseSchema == "" {
+		return resp
+	}
+	resp["content"] = map[string]any{
+		"application/json": map[string]any{
+			"schema": openAPIRef(route.ResponseSchema),
+		},
+	}
+	return resp
+}
+
+func openAPIRef(name string) map[string]any {
+	return map[string]any{"$ref": "#/components/schemas/" + name}
+}
+
+func openAPISchemas() map[string]any {
+	schemas := map[string]any{
+		"ErrorEnvelope": errorEnvelopeSchema(),
+		"HealthResponse": map[string]any{
+			"type":     "object",
+			"required": []string{"status"},
+			"properties": map[string]any{
+				"status": map[string]any{"type": "string"},
+			},
+		},
+		"OpenAPIDocument": map[string]any{
+			"type":                 "object",
+			"additionalProperties": true,
+		},
+		"Page": schemaFromType(reflect.TypeOf(Page{})),
+	}
+
+	for name, typ := range openAPIComponentTypes() {
+		schemas[name] = schemaFromType(typ)
+	}
+	for name, item := range map[string]string{
+		"SiteListEnvelope":            "Site",
+		"EventListEnvelope":           "Event",
+		"TransitionListEnvelope":      "Transition",
+		"WebhookListEnvelope":         "Webhook",
+		"WebhookDeliveryListEnvelope": "WebhookDelivery",
+		"AlertContactListEnvelope":    "AlertContact",
+		"AlertDeliveryListEnvelope":   "AlertDelivery",
+	} {
+		schemas[name] = listEnvelopeSchema(item)
+	}
+	return schemas
+}
+
+func openAPIComponentTypes() map[string]reflect.Type {
+	return map[string]reflect.Type{
+		"MeResponse":                reflect.TypeOf(meResponse{}),
+		"Site":                      reflect.TypeOf(siteResponse{}),
+		"ActiveEventSummary":        reflect.TypeOf(activeEventSummary{}),
+		"SiteDetail":                reflect.TypeOf(singleSiteResponse{}),
+		"CreateSiteRequest":         reflect.TypeOf(createSiteRequest{}),
+		"UpdateSiteRequest":         reflect.TypeOf(updateSiteRequest{}),
+		"Event":                     reflect.TypeOf(eventResponse{}),
+		"Transition":                reflect.TypeOf(transitionResponse{}),
+		"EventDetail":               reflect.TypeOf(eventDetailResponse{}),
+		"CloseEventRequest":         reflect.TypeOf(closeEventRequest{}),
+		"TriggerNowResponse":        reflect.TypeOf(triggerNowResponse{}),
+		"CheckResultPayload":        reflect.TypeOf(checkResultPayload{}),
+		"UptimeResponse":            reflect.TypeOf(uptimeResponse{}),
+		"ResponseTimeResponse":      reflect.TypeOf(responseTimeResponse{}),
+		"TimingBreakdownResponse":   reflect.TypeOf(timingBreakdownResponse{}),
+		"Window":                    reflect.TypeOf(windowResponse{}),
+		"LatencyComponent":          reflect.TypeOf(latencyComponent{}),
+		"Webhook":                   reflect.TypeOf(webhookResponse{}),
+		"WebhookWithSecret":         reflect.TypeOf(createWebhookResponse{}),
+		"CreateWebhookRequest":      reflect.TypeOf(createWebhookRequest{}),
+		"UpdateWebhookRequest":      reflect.TypeOf(updateWebhookRequest{}),
+		"WebhookDelivery":           reflect.TypeOf(deliveryResponse{}),
+		"AlertContact":              reflect.TypeOf(alertContactResponse{}),
+		"CreateAlertContactRequest": reflect.TypeOf(createAlertContactRequest{}),
+		"UpdateAlertContactRequest": reflect.TypeOf(updateAlertContactRequest{}),
+		"AlertContactTestResponse":  reflect.TypeOf(alertContactTestResponse{}),
+		"AlertDelivery":             reflect.TypeOf(alertDeliveryResponse{}),
+	}
+}
+
+func listEnvelopeSchema(itemSchema string) map[string]any {
+	return map[string]any{
+		"type":     "object",
+		"required": []string{"data", "page"},
+		"properties": map[string]any{
+			"data": map[string]any{
+				"type":  "array",
+				"items": openAPIRef(itemSchema),
+			},
+			"page": openAPIRef("Page"),
+		},
+	}
+}
+
+func schemaFromType(t reflect.Type) map[string]any {
+	for t.Kind() == reflect.Pointer {
+		t = t.Elem()
+	}
+
+	if t == reflect.TypeOf(json.RawMessage{}) {
+		return map[string]any{
+			"description": "Arbitrary JSON value.",
+		}
+	}
+
+	switch t.Kind() {
+	case reflect.Bool:
+		return map[string]any{"type": "boolean"}
+	case reflect.Int, reflect.Int8, reflect.Int16, reflect.Int32:
+		return map[string]any{"type": "integer", "format": "int32"}
+	case reflect.Int64:
+		return map[string]any{"type": "integer", "format": "int64"}
+	case reflect.Uint, reflect.Uint8, reflect.Uint16, reflect.Uint32:
+		return map[string]any{"type": "integer", "format": "int32", "minimum": 0}
+	case reflect.Uint64:
+		return map[string]any{"type": "integer", "format": "int64", "minimum": 0}
+	case reflect.Float32, reflect.Float64:
+		return map[string]any{"type": "number", "format": "double"}
+	case reflect.String:
+		return map[string]any{"type": "string"}
+	case reflect.Slice, reflect.Array:
+		return map[string]any{
+			"type":  "array",
+			"items": schemaForType(t.Elem()),
+		}
+	case reflect.Map:
+		return map[string]any{
+			"type":                 "object",
+			"additionalProperties": schemaForType(t.Elem()),
+		}
+	case reflect.Struct:
+		return structSchema(t)
+	case reflect.Interface:
+		return map[string]any{"description": "Arbitrary JSON value."}
+	default:
+		return map[string]any{}
+	}
+}
+
+func schemaForType(t reflect.Type) map[string]any {
+	if t.Kind() == reflect.Pointer {
+		return nullableSchema(schemaFromType(t.Elem()))
+	}
+	return schemaFromType(t)
+}
+
+func nullableSchema(schema map[string]any) map[string]any {
+	if typ, ok := schema["type"].(string); ok {
+		copy := cloneSchema(schema)
+		copy["type"] = []string{typ, "null"}
+		return copy
+	}
+	return map[string]any{"anyOf": []map[string]any{schema, map[string]any{"type": "null"}}}
+}
+
+func cloneSchema(schema map[string]any) map[string]any {
+	out := make(map[string]any, len(schema))
+	for k, v := range schema {
+		out[k] = v
+	}
+	return out
+}
+
+func structSchema(t reflect.Type) map[string]any {
+	properties := map[string]any{}
+	var required []string
+
+	for i := 0; i < t.NumField(); i++ {
+		field := t.Field(i)
+		if field.Anonymous && field.Type.Kind() == reflect.Struct {
+			embedded := structSchema(field.Type)
+			if embeddedProps, ok := embedded["properties"].(map[string]any); ok {
+				for name, schema := range embeddedProps {
+					properties[name] = schema
+				}
+			}
+			if embeddedReq, ok := embedded["required"].([]string); ok {
+				required = append(required, embeddedReq...)
+			}
+			continue
+		}
+
+		name, omitEmpty, ok := jsonFieldName(field)
+		if !ok {
+			continue
+		}
+		properties[name] = schemaForType(field.Type)
+		if field.Type.Kind() != reflect.Pointer && !omitEmpty {
+			required = append(required, name)
+		}
+	}
+
+	schema := map[string]any{
+		"type":       "object",
+		"properties": properties,
+	}
+	if len(required) > 0 {
+		schema["required"] = required
+	}
+	return schema
+}
+
+func jsonFieldName(field reflect.StructField) (name string, omitEmpty, ok bool) {
+	tag := field.Tag.Get("json")
+	if tag == "-" {
+		return "", false, false
+	}
+	parts := strings.Split(tag, ",")
+	if parts[0] != "" {
+		name = parts[0]
+	} else {
+		name = field.Name
+	}
+	for _, part := range parts[1:] {
+		if part == "omitempty" {
+			omitEmpty = true
+			break
+		}
+	}
+	return name, omitEmpty, true
+}
+
+func errorEnvelopeSchema() map[string]any {
+	return map[string]any{
+		"type":     "object",
+		"required": []string{"error"},
+		"properties": map[string]any{
+			"error": map[string]any{
+				"type":     "object",
+				"required": []string{"code", "message"},
+				"properties": map[string]any{
+					"code": map[string]any{
+						"type": "string",
+					},
+					"message": map[string]any{
+						"type": "string",
+					},
+					"request_id": map[string]any{
+						"type": "string",
+					},
+				},
+			},
+		},
+	}
+}
diff --git a/internal/api/openapi_codegen_test.go b/internal/api/openapi_codegen_test.go
new file mode 100644
index 00000000..12d8f97e
--- /dev/null
+++ b/internal/api/openapi_codegen_test.go
@@ -0,0 +1,213 @@
+package api
+
+import (
+	"fmt"
+	"go/ast"
+	"go/importer"
+	"go/parser"
+	"go/token"
+	"go/types"
+	"net/http"
+	"sort"
+	"strings"
+	"testing"
+	"unicode"
+)
+
+func TestOpenAPIReferencesResolve(t *testing.T) {
+	doc := buildOpenAPIDocument()
+	schemas := openAPISchemasFromDocument(t, doc)
+
+	walkOpenAPIRefs(t, doc, "$", func(path, ref string) {
+		const prefix = "#/components/schemas/"
+		if !strings.HasPrefix(ref, prefix) {
+			t.Fatalf("%s has unsupported ref %q", path, ref)
+		}
+		name := strings.TrimPrefix(ref, prefix)
+		if _, ok := schemas[name]; !ok {
+			t.Fatalf("%s references missing schema %q", path, name)
+		}
+	})
+}
+
+func TestOpenAPIGeneratedGoClientCompiles(t *testing.T) {
+	doc := buildOpenAPIDocument()
+	src := generateGoClientSmokeSource(t, doc)
+	fset := token.NewFileSet()
+	file, err := parser.ParseFile(fset, "client.go", src, parser.AllErrors)
+	if err != nil {
+		t.Fatalf("parse generated client: %v\n%s", err, src)
+	}
+
+	conf := types.Config{Importer: importer.Default()}
+	if _, err := conf.Check("openapiclient", fset, []*ast.File{file}, nil); err != nil {
+		t.Fatalf("type-check generated client: %v\n%s", err, src)
+	}
+}
+
+func walkOpenAPIRefs(t *testing.T, value any, path string, visit func(path, ref string)) {
+	t.Helper()
+	switch v := value.(type) {
+	case map[string]any:
+		if ref, ok := v["$ref"].(string); ok {
+			visit(path+".$ref", ref)
+		}
+		for key, child := range v {
+			walkOpenAPIRefs(t, child, path+"."+key, visit)
+		}
+	case []any:
+		for i, child := range v {
+			walkOpenAPIRefs(t, child, fmt.Sprintf("%s[%d]", path, i), visit)
+		}
+	case []map[string]any:
+		for i, child := range v {
+			walkOpenAPIRefs(t, child, fmt.Sprintf("%s[%d]", path, i), visit)
+		}
+	}
+}
+
+func generateGoClientSmokeSource(t *testing.T, doc map[string]any) string {
+	t.Helper()
+
+	var src strings.Builder
+	src.WriteString(`package openapiclient
+
+import (
+	"context"
+	"net/http"
+)
+
+type Client struct {
+	HTTPClient *http.Client
+}
+
+`)
+
+	schemas := openAPISchemasFromDocument(t, doc)
+	schemaNames := sortedMapKeys(schemas)
+	for _, schemaName := range schemaNames {
+		typeName, err := exportedGoIdentifier(schemaName)
+		if err != nil {
+			t.Fatalf("schema %q is not usable as a generated Go type: %v", schemaName, err)
+		}
+		src.WriteString(fmt.Sprintf("type %s map[string]any\n\n", typeName))
+	}
+
+	for _, op := range openAPIOperationsFromDocument(t, doc) {
+		methodName, err := exportedGoIdentifier(op.operationID)
+		if err != nil {
+			t.Fatalf("operationId %q is not usable as a generated Go method: %v", op.operationID, err)
+		}
+		src.WriteString(fmt.Sprintf(`func (c *Client) %s(ctx context.Context) (*http.Response, error) {
+	req, err := http.NewRequestWithContext(ctx, %q, %q, nil)
+	if err != nil {
+		return nil, err
+	}
+	client := c.HTTPClient
+	if client == nil {
+		client = http.DefaultClient
+	}
+	return client.Do(req)
+}
+
+`, methodName, strings.ToUpper(op.method), op.path))
+	}
+
+	return src.String()
+}
+
+type openAPIOperationDoc struct {
+	method      string
+	path        string
+	operationID string
+}
+
+func openAPIOperationsFromDocument(t *testing.T, doc map[string]any) []openAPIOperationDoc {
+	t.Helper()
+
+	paths, ok := doc["paths"].(map[string]any)
+	if !ok {
+		t.Fatal("paths missing or wrong type")
+	}
+
+	var operations []openAPIOperationDoc
+	for path, item := range paths {
+		pathItem, ok := item.(map[string]any)
+		if !ok {
+			t.Fatalf("path item %s has wrong type", path)
+		}
+		for method, rawOp := range pathItem {
+			if _, ok := supportedOpenAPIMethods[strings.ToUpper(method)]; !ok {
+				continue
+			}
+			op, ok := rawOp.(map[string]any)
+			if !ok {
+				t.Fatalf("operation %s %s has wrong type", method, path)
+			}
+			operationID, ok := op["operationId"].(string)
+			if !ok || operationID == "" {
+				t.Fatalf("operation %s %s has empty operationId", method, path)
+			}
+			operations = append(operations, openAPIOperationDoc{
+				method:      method,
+				path:        path,
+				operationID: operationID,
+			})
+		}
+	}
+
+	sort.Slice(operations, func(i, j int) bool {
+		if operations[i].operationID == operations[j].operationID {
+			return operations[i].method+operations[i].path < operations[j].method+operations[j].path
+		}
+		return operations[i].operationID < operations[j].operationID
+	})
+	return operations
+}
+
+var supportedOpenAPIMethods = map[string]struct{}{
+	http.MethodDelete: {},
+	http.MethodGet:    {},
+	http.MethodPatch:  {},
+	http.MethodPost:   {},
+	http.MethodPut:    {},
+}
+
+func sortedMapKeys(m map[string]any) []string {
+	keys := make([]string, 0, len(m))
+	for key := range m {
+		keys = append(keys, key)
+	}
+	sort.Strings(keys)
+	return keys
+}
+
+func exportedGoIdentifier(name string) (string, error) {
+	if name == "" {
+		return "", fmt.Errorf("empty identifier")
+	}
+
+	var out strings.Builder
+	for i, r := range name {
+		if i == 0 {
+			if !isGoIdentifierStart(r) {
+				return "", fmt.Errorf("first rune %q is not valid", r)
+			}
+			out.WriteRune(unicode.ToUpper(r))
+			continue
+		}
+		if !isGoIdentifierPart(r) {
+			return "", fmt.Errorf("rune %q is not valid", r)
+		}
+		out.WriteRune(r)
+	}
+	return out.String(), nil
+}
+
+func isGoIdentifierStart(r rune) bool {
+	return r == '_' || unicode.IsLetter(r)
+}
+
+func isGoIdentifierPart(r rune) bool {
+	return isGoIdentifierStart(r) || unicode.IsDigit(r)
+}
diff --git a/internal/api/openapi_test.go b/internal/api/openapi_test.go
new file mode 100644
index 00000000..8da36f63
--- /dev/null
+++ b/internal/api/openapi_test.go
@@ -0,0 +1,311 @@
+package api
+
+import (
+	"net/http"
+	"net/http/httptest"
+	"strconv"
+	"strings"
+	"testing"
+)
+
+func TestOpenAPIDocumentIncludesAPIRoutes(t *testing.T) {
+	doc := buildOpenAPIDocument()
+	paths, ok := doc["paths"].(map[string]any)
+	if !ok {
+		t.Fatal("paths missing or wrong type")
+	}
+
+	for _, route := range apiRoutes() {
+		pathItem, ok := paths[route.Path].(map[string]any)
+		if !ok {
+			t.Fatalf("OpenAPI path %s missing", route.Path)
+		}
+		op, ok := pathItem[strings.ToLower(route.Method)].(map[string]any)
+		if !ok {
+			t.Fatalf("OpenAPI operation %s %s missing", route.Method, route.Path)
+		}
+		if got := op["operationId"]; got != route.OperationID {
+			t.Errorf("%s %s operationId = %v, want %s",
+				route.Method, route.Path, got, route.OperationID)
+		}
+
+		responses, ok := op["responses"].(map[string]any)
+		if !ok {
+			t.Fatalf("%s %s responses missing", route.Method, route.Path)
+		}
+		status := strconv.Itoa(route.SuccessStatus)
+		if _, ok := responses[status]; !ok {
+			t.Errorf("%s %s missing success response %s", route.Method, route.Path, status)
+		}
+	}
+}
+
+func TestAPIRouteMetadataIsComplete(t *testing.T) {
+	seenPatterns := map[string]struct{}{}
+	seenOperationIDs := map[string]struct{}{}
+
+	for _, route := range apiRoutes() {
+		if route.Method == "" {
+			t.Fatalf("route with path %q has empty method", route.Path)
+		}
+		if route.Path == "" {
+			t.Fatalf("route %q has empty path", route.OperationID)
+		}
+		pattern := route.pattern()
+		if _, ok := seenPatterns[pattern]; ok {
+			t.Fatalf("duplicate route pattern %s", pattern)
+		}
+		seenPatterns[pattern] = struct{}{}
+
+		if route.OperationID == "" {
+			t.Fatalf("%s has empty operation id", pattern)
+		}
+		if _, ok := seenOperationIDs[route.OperationID]; ok {
+			t.Fatalf("duplicate operation id %s", route.OperationID)
+		}
+		seenOperationIDs[route.OperationID] = struct{}{}
+
+		if route.Summary == "" {
+			t.Fatalf("%s has empty summary", pattern)
+		}
+		if len(route.Tags) == 0 {
+			t.Fatalf("%s has no tags", pattern)
+		}
+		if route.SuccessStatus == 0 {
+			t.Fatalf("%s has no success status", pattern)
+		}
+		if route.SuccessStatus != http.StatusNoContent && route.ResponseSchema == "" {
+			t.Fatalf("%s has no response schema", pattern)
+		}
+		if route.JSONBody && route.RequestSchema == "" {
+			t.Fatalf("%s has JSON body but no request schema", pattern)
+		}
+		if route.authenticated() && !route.Scope.Valid() {
+			t.Fatalf("%s has invalid scope %q", pattern, route.Scope)
+		}
+		if route.Handler == nil {
+			t.Fatalf("%s has nil handler", pattern)
+		}
+	}
+}
+
+func TestOpenAPIDocumentUsesRouteSchemas(t *testing.T) {
+	doc := buildOpenAPIDocument()
+	schemas := openAPISchemasFromDocument(t, doc)
+	for _, route := range apiRoutes() {
+		if route.RequestSchema != "" {
+			if _, ok := schemas[route.RequestSchema]; !ok {
+				t.Fatalf("%s request schema %q is not in components", route.pattern(), route.RequestSchema)
+			}
+		}
+		if route.ResponseSchema != "" {
+			if _, ok := schemas[route.ResponseSchema]; !ok {
+				t.Fatalf("%s response schema %q is not in components", route.pattern(), route.ResponseSchema)
+			}
+		}
+	}
+
+	me := openAPIOperationAt(t, doc, http.MethodGet, "/api/v1/me")
+	assertOpenAPIResponseRef(t, me, "200", "MeResponse")
+
+	createSite := openAPIOperationAt(t, doc, http.MethodPost, "/api/v1/sites")
+	assertOpenAPIRequestRef(t, createSite, "CreateSiteRequest")
+	assertOpenAPIResponseRef(t, createSite, "201", "Site")
+
+	listSites := openAPIOperationAt(t, doc, http.MethodGet, "/api/v1/sites")
+	assertOpenAPIResponseRef(t, listSites, "200", "SiteListEnvelope")
+
+	deleteSite := openAPIOperationAt(t, doc, http.MethodDelete, "/api/v1/sites/{id}")
+	responses := deleteSite["responses"].(map[string]any)
+	noContent := responses["204"].(map[string]any)
+	if _, ok := noContent["content"]; ok {
+		t.Fatal("204 response should not declare response content")
+	}
+}
+
+func TestOpenAPISchemasIncludeHandlerShapes(t *testing.T) {
+	doc := buildOpenAPIDocument()
+	schemas := openAPISchemasFromDocument(t, doc)
+
+	site, ok := schemas["Site"].(map[string]any)
+	if !ok {
+		t.Fatal("Site schema missing")
+	}
+	siteProps := site["properties"].(map[string]any)
+	if _, ok := siteProps["monitor_url"]; !ok {
+		t.Fatal("Site.monitor_url missing")
+	}
+	if _, ok := siteProps["active_event_id"]; !ok {
+		t.Fatal("Site.active_event_id missing")
+	}
+	if _, ok := siteProps["bucket_no"]; !ok {
+		t.Fatal("Site.bucket_no missing")
+	}
+	if _, ok := siteProps["check_interval"]; !ok {
+		t.Fatal("Site.check_interval missing")
+	}
+
+	list, ok := schemas["SiteListEnvelope"].(map[string]any)
+	if !ok {
+		t.Fatal("SiteListEnvelope schema missing")
+	}
+	data := list["properties"].(map[string]any)["data"].(map[string]any)
+	items := data["items"].(map[string]any)
+	if got := items["$ref"]; got != "#/components/schemas/Site" {
+		t.Fatalf("SiteListEnvelope data ref = %v, want Site ref", got)
+	}
+
+	webhookWithSecret, ok := schemas["WebhookWithSecret"].(map[string]any)
+	if !ok {
+		t.Fatal("WebhookWithSecret schema missing")
+	}
+	webhookProps := webhookWithSecret["properties"].(map[string]any)
+	if _, ok := webhookProps["secret"]; !ok {
+		t.Fatal("WebhookWithSecret.secret missing")
+	}
+}
+
+func TestOpenAPIDocumentMarksAuthAndIdempotency(t *testing.T) {
+	doc := buildOpenAPIDocument()
+
+	health := openAPIOperationAt(t, doc, http.MethodGet, "/api/v1/health")
+	if security, ok := health["security"].([]map[string][]string); !ok || len(security) != 0 {
+		t.Fatalf("health security = %#v, want unauthenticated override", health["security"])
+	}
+
+	openapi := openAPIOperationAt(t, doc, http.MethodGet, "/api/v1/openapi.json")
+	if got := openapi["x-jetmon-required-scope"]; got != "read" {
+		t.Fatalf("openapi required scope = %v, want read", got)
+	}
+
+	closeEvent := openAPIOperationAt(t, doc, http.MethodPost, "/api/v1/sites/{id}/events/{event_id}/close")
+	if got := closeEvent["x-jetmon-required-scope"]; got != "write" {
+		t.Fatalf("close-event required scope = %v, want write", got)
+	}
+	if got := closeEvent["x-jetmon-idempotency-key"]; got != "optional" {
+		t.Fatalf("close-event idempotency marker = %v, want optional", got)
+	}
+
+	params, ok := closeEvent["parameters"].([]map[string]any)
+	if !ok {
+		t.Fatalf("close-event parameters missing or wrong type: %#v", closeEvent["parameters"])
+	}
+	assertOpenAPIParam(t, params, "id", "path")
+	assertOpenAPIParam(t, params, "event_id", "path")
+	assertOpenAPIParam(t, params, idempotencyHeader, "header")
+
+	body, ok := closeEvent["requestBody"].(map[string]any)
+	if !ok {
+		t.Fatal("close-event requestBody missing")
+	}
+	if got := body["required"]; got != false {
+		t.Fatalf("close-event requestBody required = %v, want false", got)
+	}
+}
+
+func TestOpenAPIEndpointRequiresReadScope(t *testing.T) {
+	s := New(":0", nil, "test")
+	req := httptest.NewRequest(http.MethodGet, "/api/v1/openapi.json", nil)
+	rec := httptest.NewRecorder()
+
+	s.routes().ServeHTTP(rec, req)
+
+	if rec.Code != http.StatusUnauthorized {
+		t.Fatalf("status = %d, want 401; body=%s", rec.Code, rec.Body.String())
+	}
+	if got := readErrorBody(t, rec.Body).Code; got != "missing_token" {
+		t.Fatalf("error code = %q, want missing_token", got)
+	}
+}
+
+func TestHandleOpenAPIJSON(t *testing.T) {
+	s := New(":0", nil, "test")
+	req := httptest.NewRequest(http.MethodGet, "/api/v1/openapi.json", nil)
+	rec := httptest.NewRecorder()
+
+	s.handleOpenAPIJSON(rec, req)
+
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200", rec.Code)
+	}
+	var doc map[string]any
+	readJSON(t, rec.Body, &doc)
+	if got := doc["openapi"]; got != "3.1.0" {
+		t.Fatalf("openapi = %v, want 3.1.0", got)
+	}
+	if _, ok := doc["components"].(map[string]any); !ok {
+		t.Fatal("components missing")
+	}
+}
+
+func openAPIOperationAt(t *testing.T, doc map[string]any, method, path string) map[string]any {
+	t.Helper()
+	paths, ok := doc["paths"].(map[string]any)
+	if !ok {
+		t.Fatal("paths missing or wrong type")
+	}
+	pathItem, ok := paths[path].(map[string]any)
+	if !ok {
+		t.Fatalf("path %s missing", path)
+	}
+	op, ok := pathItem[strings.ToLower(method)].(map[string]any)
+	if !ok {
+		t.Fatalf("operation %s %s missing", method, path)
+	}
+	return op
+}
+
+func assertOpenAPIParam(t *testing.T, params []map[string]any, name, location string) {
+	t.Helper()
+	for _, param := range params {
+		if param["name"] == name && param["in"] == location {
+			return
+		}
+	}
+	t.Fatalf("parameter %s in %s missing from %#v", name, location, params)
+}
+
+func assertOpenAPIRequestRef(t *testing.T, op map[string]any, want string) {
+	t.Helper()
+	body, ok := op["requestBody"].(map[string]any)
+	if !ok {
+		t.Fatal("requestBody missing")
+	}
+	content := body["content"].(map[string]any)
+	jsonContent := content["application/json"].(map[string]any)
+	schema := jsonContent["schema"].(map[string]any)
+	if got := schema["$ref"]; got != "#/components/schemas/"+want {
+		t.Fatalf("request schema ref = %v, want %s", got, want)
+	}
+}
+
+func assertOpenAPIResponseRef(t *testing.T, op map[string]any, status, want string) {
+	t.Helper()
+	responses, ok := op["responses"].(map[string]any)
+	if !ok {
+		t.Fatal("responses missing")
+	}
+	response, ok := responses[status].(map[string]any)
+	if !ok {
+		t.Fatalf("response %s missing", status)
+	}
+	content := response["content"].(map[string]any)
+	jsonContent := content["application/json"].(map[string]any)
+	schema := jsonContent["schema"].(map[string]any)
+	if got := schema["$ref"]; got != "#/components/schemas/"+want {
+		t.Fatalf("response schema ref = %v, want %s", got, want)
+	}
+}
+
+func openAPISchemasFromDocument(t *testing.T, doc map[string]any) map[string]any {
+	t.Helper()
+	components, ok := doc["components"].(map[string]any)
+	if !ok {
+		t.Fatal("components missing")
+	}
+	schemas, ok := components["schemas"].(map[string]any)
+	if !ok {
+		t.Fatal("components.schemas missing")
+	}
+	return schemas
+}
diff --git a/internal/api/ratelimit.go b/internal/api/ratelimit.go
new file mode 100644
index 00000000..dc65d8e8
--- /dev/null
+++ b/internal/api/ratelimit.go
@@ -0,0 +1,147 @@
+package api
+
+import (
+	"fmt"
+	"net/http"
+	"strconv"
+	"sync"
+	"time"
+)
+
+// rateLimiter is an in-memory per-key token bucket. Each key gets its own
+// bucket sized to the key's rate_limit_per_minute. Tokens refill continuously
+// at limit/60 per second (so 60/min ≈ 1 token per second, smoothly).
+//
+// In-memory state is fine for this internal API — there's currently one
+// jetmon2 instance per host, and the gateway in front handles cross-instance
+// fairness. If we ever scale the API horizontally, this moves to Redis.
+type rateLimiter struct {
+	mu      sync.Mutex
+	buckets map[int64]*rateBucket
+	now     func() time.Time // injectable for tests
+}
+
+// rateBucket is a token bucket for a single key. tokens is fractional so
+// short bursts above the per-minute average are possible (the bucket fills to
+// `limit` tokens at full size).
+type rateBucket struct {
+	tokens     float64
+	limit      float64
+	lastRefill time.Time
+}
+
+func newRateLimiter() *rateLimiter {
+	rl := &rateLimiter{
+		buckets: make(map[int64]*rateBucket),
+		now:     time.Now,
+	}
+	go rl.gcLoop()
+	return rl
+}
+
+// allow consumes one token from the key's bucket. Returns whether the request
+// is allowed, the remaining tokens (rounded down for the header), and the
+// next refill instant for X-RateLimit-Reset and Retry-After.
+func (rl *rateLimiter) allow(keyID int64, perMinute int) (allowed bool, remaining int, resetAt time.Time) {
+	rl.mu.Lock()
+	defer rl.mu.Unlock()
+
+	now := rl.now()
+	b, ok := rl.buckets[keyID]
+	if !ok {
+		b = &rateBucket{
+			tokens:     float64(perMinute),
+			limit:      float64(perMinute),
+			lastRefill: now,
+		}
+		rl.buckets[keyID] = b
+	}
+
+	// If the configured limit changed (key was rotated/edited), resize the
+	// bucket. Don't shrink tokens past the new ceiling.
+	if b.limit != float64(perMinute) {
+		b.limit = float64(perMinute)
+		if b.tokens > b.limit {
+			b.tokens = b.limit
+		}
+	}
+
+	// Refill based on elapsed time since last refill. Rate is limit/60 per second.
+	elapsed := now.Sub(b.lastRefill).Seconds()
+	if elapsed > 0 {
+		b.tokens += elapsed * b.limit / 60.0
+		if b.tokens > b.limit {
+			b.tokens = b.limit
+		}
+		b.lastRefill = now
+	}
+
+	// Reset is "when the bucket would be full again from current state". We
+	// expose this for X-RateLimit-Reset and Retry-After. For a non-empty
+	// bucket the reset time is in the past (already at limit); we clamp to now
+	// + 1s in that case so the header is meaningful.
+	deficit := b.limit - b.tokens
+	secondsToFull := deficit * 60.0 / b.limit
+	resetAt = now.Add(time.Duration(secondsToFull * float64(time.Second)))
+	if resetAt.Before(now) {
+		resetAt = now.Add(time.Second)
+	}
+
+	if b.tokens < 1.0 {
+		// Not enough tokens for this request.
+		return false, int(b.tokens), resetAt
+	}
+	b.tokens -= 1.0
+	return true, int(b.tokens), resetAt
+}
+
+// gcLoop drops buckets that haven't been touched in 10 minutes so the map
+// doesn't grow unbounded as keys come and go.
+func (rl *rateLimiter) gcLoop() {
+	ticker := time.NewTicker(5 * time.Minute)
+	defer ticker.Stop()
+	for range ticker.C {
+		rl.gc(10 * time.Minute)
+	}
+}
+
+func (rl *rateLimiter) gc(maxIdle time.Duration) {
+	rl.mu.Lock()
+	defer rl.mu.Unlock()
+	cutoff := rl.now().Add(-maxIdle)
+	for id, b := range rl.buckets {
+		if b.lastRefill.Before(cutoff) {
+			delete(rl.buckets, id)
+		}
+	}
+}
+
+// writeRateLimitHeaders sets the standard X-RateLimit-{Limit,Remaining,Reset}
+// headers on a response. Reset is unix seconds.
+//
+// Note: Go's net/http canonicalizes header names to "X-Ratelimit-Limit"
+// (lowercase after the second segment) on the wire. This is RFC 7230 compliant
+// — HTTP header names are case-insensitive — but the IETF draft for these
+// headers uses the camelCase form. Bypassing canonicalization in stdlib
+// requires direct map access (h[key] = value) and breaks http.Header.Get
+// case-insensitive lookups for downstream Go consumers, so we accept the
+// canonicalized form. Most clients (curl, fetch, requests) do
+// case-insensitive header lookup and don't care.
+func writeRateLimitHeaders(w http.ResponseWriter, limit, remaining int, resetAt time.Time) {
+	w.Header().Set("X-RateLimit-Limit", strconv.Itoa(limit))
+	w.Header().Set("X-RateLimit-Remaining", strconv.Itoa(remaining))
+	w.Header().Set("X-RateLimit-Reset", strconv.FormatInt(resetAt.Unix(), 10))
+}
+
+// writeRateLimited writes a 429 response with Retry-After and the standard
+// rate limit headers. Used by the middleware when the limiter rejects.
+func writeRateLimited(w http.ResponseWriter, r *http.Request, limit, remaining int, resetAt time.Time) {
+	writeRateLimitHeaders(w, limit, remaining, resetAt)
+	retryAfter := int(time.Until(resetAt).Seconds())
+	if retryAfter < 1 {
+		retryAfter = 1
+	}
+	w.Header().Set("Retry-After", strconv.Itoa(retryAfter))
+	writeError(w, r, http.StatusTooManyRequests, "rate_limited",
+		fmt.Sprintf("rate limit exceeded; retry after %d seconds", retryAfter))
+}
diff --git a/internal/api/ratelimit_test.go b/internal/api/ratelimit_test.go
new file mode 100644
index 00000000..f9e8d67b
--- /dev/null
+++ b/internal/api/ratelimit_test.go
@@ -0,0 +1,136 @@
+package api
+
+import (
+	"testing"
+	"time"
+)
+
+// fakeClock returns a controllable time source for deterministic rate-limit
+// tests. We can't use real time.Now() because elapsed-time math would make
+// tests flaky on slow CI.
+type fakeClock struct {
+	t time.Time
+}
+
+func (c *fakeClock) now() time.Time { return c.t }
+
+func newTestLimiter(c *fakeClock) *rateLimiter {
+	rl := &rateLimiter{
+		buckets: make(map[int64]*rateBucket),
+		now:     c.now,
+	}
+	// Don't start the GC loop in tests.
+	return rl
+}
+
+func TestRateLimiterAllowsUntilExhausted(t *testing.T) {
+	clock := &fakeClock{t: time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)}
+	rl := newTestLimiter(clock)
+
+	// Limit = 5/min. First five requests pass.
+	for i := 0; i < 5; i++ {
+		allowed, _, _ := rl.allow(42, 5)
+		if !allowed {
+			t.Fatalf("request %d should have been allowed", i+1)
+		}
+	}
+	// Sixth is denied.
+	allowed, remaining, _ := rl.allow(42, 5)
+	if allowed {
+		t.Fatal("sixth request should have been denied")
+	}
+	if remaining != 0 {
+		t.Errorf("remaining = %d, want 0", remaining)
+	}
+}
+
+func TestRateLimiterRefillsOverTime(t *testing.T) {
+	clock := &fakeClock{t: time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)}
+	rl := newTestLimiter(clock)
+
+	// Exhaust a 60/min bucket → 60 tokens.
+	for i := 0; i < 60; i++ {
+		allowed, _, _ := rl.allow(7, 60)
+		if !allowed {
+			t.Fatalf("request %d should have been allowed in burst", i+1)
+		}
+	}
+	// 61st denied.
+	allowed, _, _ := rl.allow(7, 60)
+	if allowed {
+		t.Fatal("burst exhausted should deny")
+	}
+
+	// Advance 1 second — at 60/min that's 1 token refilled.
+	clock.t = clock.t.Add(time.Second)
+	allowed, _, _ = rl.allow(7, 60)
+	if !allowed {
+		t.Fatal("after 1s with 60/min limit, one token should have refilled")
+	}
+}
+
+func TestRateLimiterIsolatesKeys(t *testing.T) {
+	clock := &fakeClock{t: time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)}
+	rl := newTestLimiter(clock)
+
+	// Exhaust key 1.
+	for i := 0; i < 2; i++ {
+		rl.allow(1, 2)
+	}
+	if allowed, _, _ := rl.allow(1, 2); allowed {
+		t.Fatal("key 1 should be exhausted")
+	}
+	// Key 2 unaffected.
+	if allowed, _, _ := rl.allow(2, 2); !allowed {
+		t.Fatal("key 2 should not be affected by key 1's bucket")
+	}
+}
+
+func TestRateLimiterResizeOnLimitChange(t *testing.T) {
+	clock := &fakeClock{t: time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)}
+	rl := newTestLimiter(clock)
+
+	// Use 5 of a 10-token bucket.
+	for i := 0; i < 5; i++ {
+		rl.allow(1, 10)
+	}
+	// Operator drops the limit to 3 (e.g. via key edit). Bucket should
+	// shrink — caller can't have 5 tokens left under a 3-token cap.
+	allowed, remaining, _ := rl.allow(1, 3)
+	if !allowed {
+		t.Fatal("first request after resize should still allow")
+	}
+	if remaining > 3 {
+		t.Errorf("remaining = %d, want <= 3 after resize", remaining)
+	}
+}
+
+func TestRateLimiterGCDropsStaleBuckets(t *testing.T) {
+	clock := &fakeClock{t: time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)}
+	rl := newTestLimiter(clock)
+
+	rl.allow(1, 60)
+	rl.allow(2, 60)
+
+	// Advance past the GC threshold.
+	clock.t = clock.t.Add(20 * time.Minute)
+	rl.gc(10 * time.Minute)
+
+	if len(rl.buckets) != 0 {
+		t.Errorf("expected GC to drop stale buckets, %d remain", len(rl.buckets))
+	}
+}
+
+func TestRateLimiterResetTimeIsFuture(t *testing.T) {
+	clock := &fakeClock{t: time.Date(2026, 1, 1, 0, 0, 0, 0, time.UTC)}
+	rl := newTestLimiter(clock)
+
+	// Exhaust a 60/min bucket so deficit is real.
+	for i := 0; i < 60; i++ {
+		rl.allow(1, 60)
+	}
+	_, _, resetAt := rl.allow(1, 60)
+	if !resetAt.After(clock.now()) {
+		t.Errorf("reset time %v should be after now %v", resetAt, clock.now())
+	}
+}
diff --git a/internal/api/responses.go b/internal/api/responses.go
new file mode 100644
index 00000000..357ccbaf
--- /dev/null
+++ b/internal/api/responses.go
@@ -0,0 +1,64 @@
+package api
+
+import (
+	"encoding/json"
+	"log"
+	"net/http"
+)
+
+// ListEnvelope wraps every list response. Single-resource responses are
+// returned as bare objects without an envelope. See docs/internal-api-reference.md "Response envelope".
+type ListEnvelope struct {
+	Data any  `json:"data"`
+	Page Page `json:"page"`
+}
+
+// Page describes the cursor for the next page of a list response. Cursor is
+// nil on the last page. Limit is the limit applied to *this* response (which
+// may differ from the request if the server clamped it).
+type Page struct {
+	Next  *string `json:"next"`
+	Limit int     `json:"limit"`
+}
+
+// errorEnvelope is the JSON shape returned on any non-2xx response. The
+// `code` is a stable machine-readable identifier; `message` is for humans
+// and may improve over time. `request_id` correlates with server logs.
+type errorEnvelope struct {
+	Error errorBody `json:"error"`
+}
+
+type errorBody struct {
+	Code      string `json:"code"`
+	Message   string `json:"message"`
+	RequestID string `json:"request_id,omitempty"`
+}
+
+// writeJSON serializes v as JSON and writes it with the given status code.
+// Errors during marshaling are logged but produce a generic 500 to the
+// consumer (we can't recover from a marshaling failure mid-response).
+func writeJSON(w http.ResponseWriter, status int, v any) {
+	w.Header().Set("Content-Type", "application/json; charset=utf-8")
+	w.WriteHeader(status)
+	enc := json.NewEncoder(w)
+	if err := enc.Encode(v); err != nil {
+		log.Printf("api: response encode failed: %v", err)
+	}
+}
+
+// writeError writes a structured error envelope. The request id is pulled
+// from request context if available so consumers can correlate with server
+// logs; the X-Request-ID header is also set if not already present.
+func writeError(w http.ResponseWriter, r *http.Request, status int, code, message string) {
+	reqID := requestIDFromRequest(r)
+	if reqID != "" && w.Header().Get("X-Request-ID") == "" {
+		w.Header().Set("X-Request-ID", reqID)
+	}
+	writeJSON(w, status, errorEnvelope{
+		Error: errorBody{
+			Code:      code,
+			Message:   message,
+			RequestID: reqID,
+		},
+	})
+}
diff --git a/internal/api/routes.go b/internal/api/routes.go
new file mode 100644
index 00000000..fc65741d
--- /dev/null
+++ b/internal/api/routes.go
@@ -0,0 +1,467 @@
+package api
+
+import (
+	"net/http"
+
+	"github.com/Automattic/jetmon/internal/apikeys"
+)
+
+type serverHandler func(*Server, http.ResponseWriter, *http.Request)
+
+func (h serverHandler) bind(s *Server) http.HandlerFunc {
+	return func(w http.ResponseWriter, r *http.Request) {
+		h(s, w, r)
+	}
+}
+
+type routeDef struct {
+	Method         string
+	Path           string
+	OperationID    string
+	Summary        string
+	Tags           []string
+	Scope          apikeys.Scope
+	SuccessStatus  int
+	RequestSchema  string
+	ResponseSchema string
+	JSONBody       bool
+	BodyRequired   bool
+	Idempotency    bool
+	Handler        serverHandler
+}
+
+func (r routeDef) pattern() string {
+	return r.Method + " " + r.Path
+}
+
+func (r routeDef) authenticated() bool {
+	return r.Scope != ""
+}
+
+func (r routeDef) register(s *Server, mux *http.ServeMux) {
+	handler := r.Handler.bind(s)
+	if r.Idempotency {
+		handler = s.withIdempotency(handler)
+	}
+	if r.authenticated() {
+		handler = s.requireScope(r.Scope, handler)
+	}
+	mux.HandleFunc(r.pattern(), handler)
+}
+
+func apiRoutes() []routeDef {
+	return []routeDef{
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/health",
+			OperationID:    "getHealth",
+			Summary:        "Check API and database health",
+			Tags:           []string{"Utility"},
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "HealthResponse",
+			Handler:        (*Server).handleHealth,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/openapi.json",
+			OperationID:    "getOpenAPI",
+			Summary:        "Return the OpenAPI 3.1 route contract",
+			Tags:           []string{"Utility"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "OpenAPIDocument",
+			Handler:        (*Server).handleOpenAPIJSON,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/me",
+			OperationID:    "getCurrentAPIKey",
+			Summary:        "Return the authenticated API key identity",
+			Tags:           []string{"Identity"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "MeResponse",
+			Handler:        (*Server).handleMe,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/sites",
+			OperationID:    "listSites",
+			Summary:        "List monitored sites",
+			Tags:           []string{"Sites"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "SiteListEnvelope",
+			Handler:        (*Server).handleListSites,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/sites/{id}",
+			OperationID:    "getSite",
+			Summary:        "Get one monitored site",
+			Tags:           []string{"Sites"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "SiteDetail",
+			Handler:        (*Server).handleGetSite,
+		},
+		{
+			Method:         http.MethodPost,
+			Path:           "/api/v1/sites",
+			OperationID:    "createSite",
+			Summary:        "Create a monitored site",
+			Tags:           []string{"Sites"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusCreated,
+			RequestSchema:  "CreateSiteRequest",
+			ResponseSchema: "Site",
+			JSONBody:       true,
+			BodyRequired:   true,
+			Idempotency:    true,
+			Handler:        (*Server).handleCreateSite,
+		},
+		{
+			Method:         http.MethodPatch,
+			Path:           "/api/v1/sites/{id}",
+			OperationID:    "updateSite",
+			Summary:        "Update a monitored site",
+			Tags:           []string{"Sites"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusOK,
+			RequestSchema:  "UpdateSiteRequest",
+			ResponseSchema: "Site",
+			JSONBody:       true,
+			BodyRequired:   true,
+			Handler:        (*Server).handleUpdateSite,
+		},
+		{
+			Method:        http.MethodDelete,
+			Path:          "/api/v1/sites/{id}",
+			OperationID:   "deleteSite",
+			Summary:       "Soft-delete a monitored site",
+			Tags:          []string{"Sites"},
+			Scope:         scopeWrite,
+			SuccessStatus: http.StatusNoContent,
+			Handler:       (*Server).handleDeleteSite,
+		},
+		{
+			Method:         http.MethodPost,
+			Path:           "/api/v1/sites/{id}/pause",
+			OperationID:    "pauseSite",
+			Summary:        "Pause monitoring for a site",
+			Tags:           []string{"Sites"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "Site",
+			Idempotency:    true,
+			Handler:        (*Server).handlePauseSite,
+		},
+		{
+			Method:         http.MethodPost,
+			Path:           "/api/v1/sites/{id}/resume",
+			OperationID:    "resumeSite",
+			Summary:        "Resume monitoring for a site",
+			Tags:           []string{"Sites"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "Site",
+			Idempotency:    true,
+			Handler:        (*Server).handleResumeSite,
+		},
+		{
+			Method:         http.MethodPost,
+			Path:           "/api/v1/sites/{id}/trigger-now",
+			OperationID:    "triggerSiteCheck",
+			Summary:        "Run an immediate site check",
+			Tags:           []string{"Sites"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "TriggerNowResponse",
+			Idempotency:    true,
+			Handler:        (*Server).handleTriggerNow,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/sites/{id}/events",
+			OperationID:    "listSiteEvents",
+			Summary:        "List events for a site",
+			Tags:           []string{"Events"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "EventListEnvelope",
+			Handler:        (*Server).handleListSiteEvents,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/sites/{id}/events/{event_id}",
+			OperationID:    "getSiteEvent",
+			Summary:        "Get a site-scoped event",
+			Tags:           []string{"Events"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "EventDetail",
+			Handler:        (*Server).handleGetEventBySite,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/sites/{id}/events/{event_id}/transitions",
+			OperationID:    "listEventTransitions",
+			Summary:        "List transitions for a site event",
+			Tags:           []string{"Events"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "TransitionListEnvelope",
+			Handler:        (*Server).handleListTransitions,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/events/{event_id}",
+			OperationID:    "getEvent",
+			Summary:        "Get an event by ID",
+			Tags:           []string{"Events"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "EventDetail",
+			Handler:        (*Server).handleGetEvent,
+		},
+		{
+			Method:         http.MethodPost,
+			Path:           "/api/v1/sites/{id}/events/{event_id}/close",
+			OperationID:    "closeEvent",
+			Summary:        "Manually close a site event",
+			Tags:           []string{"Events"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusOK,
+			RequestSchema:  "CloseEventRequest",
+			ResponseSchema: "EventDetail",
+			JSONBody:       true,
+			Idempotency:    true,
+			Handler:        (*Server).handleCloseEvent,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/sites/{id}/uptime",
+			OperationID:    "getSiteUptime",
+			Summary:        "Get site uptime statistics",
+			Tags:           []string{"Statistics"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "UptimeResponse",
+			Handler:        (*Server).handleSiteUptime,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/sites/{id}/response-time",
+			OperationID:    "getSiteResponseTime",
+			Summary:        "Get site response-time statistics",
+			Tags:           []string{"Statistics"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "ResponseTimeResponse",
+			Handler:        (*Server).handleSiteResponseTime,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/sites/{id}/timing-breakdown",
+			OperationID:    "getSiteTimingBreakdown",
+			Summary:        "Get site timing breakdown statistics",
+			Tags:           []string{"Statistics"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "TimingBreakdownResponse",
+			Handler:        (*Server).handleSiteTimingBreakdown,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/webhooks",
+			OperationID:    "listWebhooks",
+			Summary:        "List webhooks",
+			Tags:           []string{"Webhooks"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "WebhookListEnvelope",
+			Handler:        (*Server).handleListWebhooks,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/webhooks/{id}",
+			OperationID:    "getWebhook",
+			Summary:        "Get one webhook",
+			Tags:           []string{"Webhooks"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "Webhook",
+			Handler:        (*Server).handleGetWebhook,
+		},
+		{
+			Method:         http.MethodPost,
+			Path:           "/api/v1/webhooks",
+			OperationID:    "createWebhook",
+			Summary:        "Create a webhook",
+			Tags:           []string{"Webhooks"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusCreated,
+			RequestSchema:  "CreateWebhookRequest",
+			ResponseSchema: "WebhookWithSecret",
+			JSONBody:       true,
+			BodyRequired:   true,
+			Idempotency:    true,
+			Handler:        (*Server).handleCreateWebhook,
+		},
+		{
+			Method:         http.MethodPatch,
+			Path:           "/api/v1/webhooks/{id}",
+			OperationID:    "updateWebhook",
+			Summary:        "Update a webhook",
+			Tags:           []string{"Webhooks"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusOK,
+			RequestSchema:  "UpdateWebhookRequest",
+			ResponseSchema: "Webhook",
+			JSONBody:       true,
+			BodyRequired:   true,
+			Handler:        (*Server).handleUpdateWebhook,
+		},
+		{
+			Method:        http.MethodDelete,
+			Path:          "/api/v1/webhooks/{id}",
+			OperationID:   "deleteWebhook",
+			Summary:       "Delete a webhook",
+			Tags:          []string{"Webhooks"},
+			Scope:         scopeWrite,
+			SuccessStatus: http.StatusNoContent,
+			Handler:       (*Server).handleDeleteWebhook,
+		},
+		{
+			Method:         http.MethodPost,
+			Path:           "/api/v1/webhooks/{id}/rotate-secret",
+			OperationID:    "rotateWebhookSecret",
+			Summary:        "Rotate a webhook signing secret",
+			Tags:           []string{"Webhooks"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "WebhookWithSecret",
+			Idempotency:    true,
+			Handler:        (*Server).handleRotateWebhookSecret,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/webhooks/{id}/deliveries",
+			OperationID:    "listWebhookDeliveries",
+			Summary:        "List webhook deliveries",
+			Tags:           []string{"Webhook deliveries"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "WebhookDeliveryListEnvelope",
+			Handler:        (*Server).handleListDeliveries,
+		},
+		{
+			Method:         http.MethodPost,
+			Path:           "/api/v1/webhooks/{id}/deliveries/{delivery_id}/retry",
+			OperationID:    "retryWebhookDelivery",
+			Summary:        "Retry an abandoned webhook delivery",
+			Tags:           []string{"Webhook deliveries"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "WebhookDelivery",
+			Idempotency:    true,
+			Handler:        (*Server).handleRetryDelivery,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/alert-contacts",
+			OperationID:    "listAlertContacts",
+			Summary:        "List alert contacts",
+			Tags:           []string{"Alert contacts"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "AlertContactListEnvelope",
+			Handler:        (*Server).handleListAlertContacts,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/alert-contacts/{id}",
+			OperationID:    "getAlertContact",
+			Summary:        "Get one alert contact",
+			Tags:           []string{"Alert contacts"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "AlertContact",
+			Handler:        (*Server).handleGetAlertContact,
+		},
+		{
+			Method:         http.MethodPost,
+			Path:           "/api/v1/alert-contacts",
+			OperationID:    "createAlertContact",
+			Summary:        "Create an alert contact",
+			Tags:           []string{"Alert contacts"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusCreated,
+			RequestSchema:  "CreateAlertContactRequest",
+			ResponseSchema: "AlertContact",
+			JSONBody:       true,
+			BodyRequired:   true,
+			Idempotency:    true,
+			Handler:        (*Server).handleCreateAlertContact,
+		},
+		{
+			Method:         http.MethodPatch,
+			Path:           "/api/v1/alert-contacts/{id}",
+			OperationID:    "updateAlertContact",
+			Summary:        "Update an alert contact",
+			Tags:           []string{"Alert contacts"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusOK,
+			RequestSchema:  "UpdateAlertContactRequest",
+			ResponseSchema: "AlertContact",
+			JSONBody:       true,
+			BodyRequired:   true,
+			Handler:        (*Server).handleUpdateAlertContact,
+		},
+		{
+			Method:        http.MethodDelete,
+			Path:          "/api/v1/alert-contacts/{id}",
+			OperationID:   "deleteAlertContact",
+			Summary:       "Delete an alert contact",
+			Tags:          []string{"Alert contacts"},
+			Scope:         scopeWrite,
+			SuccessStatus: http.StatusNoContent,
+			Handler:       (*Server).handleDeleteAlertContact,
+		},
+		{
+			Method:         http.MethodPost,
+			Path:           "/api/v1/alert-contacts/{id}/test",
+			OperationID:    "testAlertContact",
+			Summary:        "Send a test alert through a contact",
+			Tags:           []string{"Alert contacts"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "AlertContactTestResponse",
+			Idempotency:    true,
+			Handler:        (*Server).handleAlertContactTest,
+		},
+		{
+			Method:         http.MethodGet,
+			Path:           "/api/v1/alert-contacts/{id}/deliveries",
+			OperationID:    "listAlertContactDeliveries",
+			Summary:        "List alert-contact deliveries",
+			Tags:           []string{"Alert deliveries"},
+			Scope:          scopeRead,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "AlertDeliveryListEnvelope",
+			Handler:        (*Server).handleListAlertDeliveries,
+		},
+		{
+			Method:         http.MethodPost,
+			Path:           "/api/v1/alert-contacts/{id}/deliveries/{delivery_id}/retry",
+			OperationID:    "retryAlertContactDelivery",
+			Summary:        "Retry an abandoned alert-contact delivery",
+			Tags:           []string{"Alert deliveries"},
+			Scope:          scopeWrite,
+			SuccessStatus:  http.StatusOK,
+			ResponseSchema: "AlertDelivery",
+			Idempotency:    true,
+			Handler:        (*Server).handleRetryAlertDelivery,
+		},
+	}
+}
diff --git a/internal/api/scope_test.go b/internal/api/scope_test.go
new file mode 100644
index 00000000..a6450bd7
--- /dev/null
+++ b/internal/api/scope_test.go
@@ -0,0 +1,199 @@
+package api
+
+import (
+	"bytes"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/apikeys"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+// scopeProbeKey returns the key-lookup row a token check should yield for
+// a given scope. The middleware also touches last_used_at, so every test
+// expecting a successful auth has to expect both the SELECT and the UPDATE.
+func scopeProbeKey(scope string, rateLimit int) *sqlmock.Rows {
+	return sqlmock.NewRows(columnsKey).AddRow(
+		int64(1), "test-consumer", scope, rateLimit,
+		nil, nil, nil, time.Now().UTC(), "test",
+	)
+}
+
+// expectAuthLookup primes mock with the SELECT + UPDATE pair the middleware
+// runs on a successful token resolution. Returns nothing — the call sets
+// up the expectation directly.
+func expectAuthLookup(mock sqlmock.Sqlmock, scope string) {
+	mock.ExpectQuery(keyLookupSQL).WillReturnRows(scopeProbeKey(scope, 1000))
+	mock.ExpectExec(keyTouchSQL).WithArgs(int64(1)).WillReturnResult(sqlmock.NewResult(0, 1))
+}
+
+// phase2WriteEndpoints lists every Phase 2 write route that should require
+// scope=write. Each entry uses path values that won't actually exist in the
+// DB — we only care about the auth/scope decision, which fires before any
+// DB access.
+var phase2WriteEndpoints = []struct {
+	method, path string
+}{
+	{"POST", "/api/v1/sites"},
+	{"PATCH", "/api/v1/sites/42"},
+	{"DELETE", "/api/v1/sites/42"},
+	{"POST", "/api/v1/sites/42/pause"},
+	{"POST", "/api/v1/sites/42/resume"},
+	{"POST", "/api/v1/sites/42/trigger-now"},
+	{"POST", "/api/v1/sites/42/events/7/close"},
+}
+
+// phase2ReadEndpoints covers the read side. read scope should pass the
+// scope check (we don't assert a specific 200 because the DB call after the
+// check would need its own mocks — what we want here is "not 403").
+var phase2ReadEndpoints = []struct {
+	method, path string
+}{
+	{"GET", "/api/v1/openapi.json"},
+	{"GET", "/api/v1/sites"},
+	{"GET", "/api/v1/sites/42"},
+	{"GET", "/api/v1/sites/42/events"},
+	{"GET", "/api/v1/sites/42/events/7"},
+	{"GET", "/api/v1/sites/42/events/7/transitions"},
+	{"GET", "/api/v1/events/7"},
+	{"GET", "/api/v1/sites/42/uptime"},
+	{"GET", "/api/v1/sites/42/response-time"},
+	{"GET", "/api/v1/sites/42/timing-breakdown"},
+}
+
+func TestPhase2WriteEndpointsRejectReadToken(t *testing.T) {
+	// A read-scope token hitting a write endpoint must get 403
+	// insufficient_scope, not pass through to the handler.
+	for _, ep := range phase2WriteEndpoints {
+		t.Run(ep.method+"_"+ep.path, func(t *testing.T) {
+			s, mock, _, cleanup := newTestServer(t)
+			defer cleanup()
+
+			// Auth lookup succeeds with read scope.
+			expectAuthLookup(mock, "read")
+
+			req := httptest.NewRequest(ep.method, ep.path, bytes.NewReader([]byte(`{}`)))
+			req.Header.Set("Authorization", "Bearer jm_TOKENXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
+			req.Header.Set("Content-Type", "application/json")
+			rec := httptest.NewRecorder()
+			s.routes().ServeHTTP(rec, req)
+
+			if rec.Code != http.StatusForbidden {
+				t.Fatalf("status = %d, want 403; body=%s", rec.Code, rec.Body.String())
+			}
+			if got := readErrorBody(t, rec.Body).Code; got != "insufficient_scope" {
+				t.Errorf("code = %q, want insufficient_scope", got)
+			}
+		})
+	}
+}
+
+func TestPhase2WriteEndpointsAcceptWriteToken(t *testing.T) {
+	// Write-scope tokens should pass scope enforcement and reach the
+	// handler. We assert that the response is NOT 403 (the handler
+	// itself may then 400/404 due to missing DB rows, but that's a
+	// downstream concern; the gate we care about here is the scope check).
+	for _, ep := range phase2WriteEndpoints {
+		t.Run(ep.method+"_"+ep.path, func(t *testing.T) {
+			s, mock, _, cleanup := newTestServer(t)
+			defer cleanup()
+
+			expectAuthLookup(mock, "write")
+			// We don't know exactly which DB queries each handler will run
+			// after the scope check (each endpoint is different), so allow
+			// any further queries to fail without causing test failure.
+			mock.MatchExpectationsInOrder(false)
+
+			req := httptest.NewRequest(ep.method, ep.path, bytes.NewReader([]byte(`{}`)))
+			req.Header.Set("Authorization", "Bearer jm_TOKENXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
+			req.Header.Set("Content-Type", "application/json")
+			rec := httptest.NewRecorder()
+			s.routes().ServeHTTP(rec, req)
+
+			if rec.Code == http.StatusForbidden {
+				t.Errorf("write scope unexpectedly hit 403 on %s %s; body=%s",
+					ep.method, ep.path, rec.Body.String())
+			}
+			if rec.Code == http.StatusUnauthorized {
+				t.Errorf("write scope unexpectedly hit 401 on %s %s; body=%s",
+					ep.method, ep.path, rec.Body.String())
+			}
+		})
+	}
+}
+
+func TestPhase2ReadEndpointsAcceptReadToken(t *testing.T) {
+	// Read-scope tokens should pass scope enforcement on read endpoints.
+	for _, ep := range phase2ReadEndpoints {
+		t.Run(ep.method+"_"+ep.path, func(t *testing.T) {
+			s, mock, _, cleanup := newTestServer(t)
+			defer cleanup()
+
+			expectAuthLookup(mock, "read")
+			mock.MatchExpectationsInOrder(false)
+
+			req := httptest.NewRequest(ep.method, ep.path, nil)
+			req.Header.Set("Authorization", "Bearer jm_TOKENXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
+			rec := httptest.NewRecorder()
+			s.routes().ServeHTTP(rec, req)
+
+			if rec.Code == http.StatusForbidden {
+				t.Errorf("read scope unexpectedly hit 403 on %s %s; body=%s",
+					ep.method, ep.path, rec.Body.String())
+			}
+		})
+	}
+}
+
+func TestPhase2WriteEndpointsRejectMissingToken(t *testing.T) {
+	// No Authorization header → 401 missing_token (no DB lookup expected).
+	s, _, _, cleanup := newTestServer(t)
+	defer cleanup()
+
+	for _, ep := range phase2WriteEndpoints {
+		req := httptest.NewRequest(ep.method, ep.path, bytes.NewReader([]byte(`{}`)))
+		req.Header.Set("Content-Type", "application/json")
+		rec := httptest.NewRecorder()
+		s.routes().ServeHTTP(rec, req)
+		if rec.Code != http.StatusUnauthorized {
+			t.Errorf("%s %s status = %d, want 401", ep.method, ep.path, rec.Code)
+		}
+		if got := readErrorBody(t, rec.Body).Code; got != "missing_token" {
+			t.Errorf("%s %s code = %q, want missing_token", ep.method, ep.path, got)
+		}
+	}
+}
+
+func TestAdminTokenCanReachAllScopes(t *testing.T) {
+	// admin includes write includes read — verify by hitting both a read
+	// and a write endpoint with an admin token.
+	scopes := []apikeys.Scope{apikeys.ScopeAdmin}
+	for _, scope := range scopes {
+		s, mock, _, cleanup := newTestServer(t)
+		defer cleanup()
+
+		expectAuthLookup(mock, string(scope))
+		expectAuthLookup(mock, string(scope))
+		mock.MatchExpectationsInOrder(false)
+
+		// Read endpoint
+		readReq := httptest.NewRequest("GET", "/api/v1/me", nil)
+		readReq.Header.Set("Authorization", "Bearer jm_ADMINXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
+		readRec := httptest.NewRecorder()
+		s.routes().ServeHTTP(readRec, readReq)
+		if readRec.Code == http.StatusForbidden {
+			t.Errorf("admin scope unexpectedly hit 403 on /me with scope=%s", scope)
+		}
+
+		// Write endpoint
+		writeReq := httptest.NewRequest("POST", "/api/v1/sites/42/pause", nil)
+		writeReq.Header.Set("Authorization", "Bearer jm_ADMINXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
+		writeRec := httptest.NewRecorder()
+		s.routes().ServeHTTP(writeRec, writeReq)
+		if writeRec.Code == http.StatusForbidden {
+			t.Errorf("admin scope unexpectedly hit 403 on POST /pause with scope=%s", scope)
+		}
+	}
+}
diff --git a/internal/api/site_tenants.go b/internal/api/site_tenants.go
new file mode 100644
index 00000000..91b8184b
--- /dev/null
+++ b/internal/api/site_tenants.go
@@ -0,0 +1,71 @@
+package api
+
+import (
+	"context"
+	"database/sql"
+	"errors"
+	"fmt"
+	"net/http"
+)
+
+const insertSiteTenantSQL = `
+		INSERT INTO jetmon_site_tenants (tenant_id, blog_id, source)
+		VALUES (?, ?, 'gateway')
+		ON DUPLICATE KEY UPDATE updated_at = CURRENT_TIMESTAMP`
+
+type sqlExecer interface {
+	ExecContext(context.Context, string, ...any) (sql.Result, error)
+}
+
+func (s *Server) assignSiteTenant(ctx context.Context, exec sqlExecer, blogID int64, tenantID string) error {
+	if tenantID == "" {
+		return errors.New("tenant id is required")
+	}
+	if _, err := exec.ExecContext(ctx, insertSiteTenantSQL, tenantID, blogID); err != nil {
+		return fmt.Errorf("assign site tenant: %w", err)
+	}
+	return nil
+}
+
+func (s *Server) siteVisibleToRequest(ctx context.Context, r *http.Request, blogID int64) (bool, error) {
+	tenantID, ok := ownerTenantIDFromRequest(r)
+	if !ok {
+		return true, nil
+	}
+	var dummy int64
+	err := s.db.QueryRowContext(ctx,
+		`SELECT 1 FROM jetmon_site_tenants WHERE tenant_id = ? AND blog_id = ? LIMIT 1`,
+		tenantID, blogID,
+	).Scan(&dummy)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			return false, nil
+		}
+		return false, err
+	}
+	return true, nil
+}
+
+func (s *Server) ensureSiteVisibleForRequest(w http.ResponseWriter, r *http.Request, blogID int64) bool {
+	ok, err := s.siteVisibleToRequest(r.Context(), r, blogID)
+	if err != nil {
+		writeError(w, r, http.StatusInternalServerError, "db_error",
+			"site tenant lookup failed: "+err.Error())
+		return false
+	}
+	if !ok {
+		writeSiteNotFound(w, r, blogID)
+		return false
+	}
+	return true
+}
+
+func writeSiteNotFound(w http.ResponseWriter, r *http.Request, siteID int64) {
+	writeError(w, r, http.StatusNotFound, "site_not_found",
+		fmt.Sprintf("Site %d does not exist", siteID))
+}
+
+func writeEventNotFound(w http.ResponseWriter, r *http.Request, eventID int64) {
+	writeError(w, r, http.StatusNotFound, "event_not_found",
+		fmt.Sprintf("Event %d does not exist", eventID))
+}
diff --git a/internal/api/testhelp_test.go b/internal/api/testhelp_test.go
new file mode 100644
index 00000000..2cbeed04
--- /dev/null
+++ b/internal/api/testhelp_test.go
@@ -0,0 +1,150 @@
+package api
+
+import (
+	"context"
+	"encoding/json"
+	"io"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/apikeys"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+// newTestServer builds a Server backed by sqlmock plus a stub key the tests
+// can use as the authenticated identity. Returns the Server, the mock
+// (for setting expectations), the canned key, and a cleanup func.
+//
+// The stub key has scope=admin and a high rate limit so individual tests
+// don't have to set those up. Tests that exercise scope/rate-limit edges
+// override key.Scope or key.RateLimitPerMinute directly.
+//
+// QueryMatcherEqual is used so tests assert the exact production SQL string.
+// If a query in production gets reformatted, the test fails — which is the
+// behavior we want for an internal API where SQL is part of the contract
+// with the schema.
+func newTestServer(t *testing.T) (*Server, sqlmock.Sqlmock, *apikeys.Key, func()) {
+	t.Helper()
+	db, mock, err := sqlmock.New(
+		sqlmock.QueryMatcherOption(sqlmock.QueryMatcherEqual),
+		sqlmock.MonitorPingsOption(true),
+	)
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	s := New(":0", db, "test-host")
+
+	key := &apikeys.Key{
+		ID:                 1,
+		ConsumerName:       "test-consumer",
+		Scope:              apikeys.ScopeAdmin,
+		RateLimitPerMinute: 1000,
+		CreatedAt:          time.Now().UTC(),
+	}
+
+	cleanup := func() {
+		_ = db.Close()
+	}
+	return s, mock, key, cleanup
+}
+
+// requestWithKey returns an *http.Request whose context already has the
+// authenticated key and a request id attached, so handlers can be invoked
+// directly without going through requireScope.
+func requestWithKey(method, target string, key *apikeys.Key) *http.Request {
+	req := httptest.NewRequest(method, target, nil)
+	ctx := context.WithValue(req.Context(), ctxKeyAPIKey, key)
+	ctx = context.WithValue(ctx, ctxKeyRequestID, "test-request-id")
+	return req.WithContext(ctx)
+}
+
+// setAuthCtx attaches an authenticated key + request id to an existing
+// request, preserving its body and path values. Used for handler tests
+// that need to send a JSON body alongside an authenticated context.
+func setAuthCtx(req *http.Request, key *apikeys.Key) *http.Request {
+	ctx := context.WithValue(req.Context(), ctxKeyAPIKey, key)
+	ctx = context.WithValue(ctx, ctxKeyRequestID, "test-request-id")
+	return req.WithContext(ctx)
+}
+
+// setGatewayTenantCtx attaches both an authenticated gateway key and the
+// gateway-derived tenant context that requireScope normally parses from
+// headers. Direct handler tests use this to bypass auth while still
+// exercising tenant-scoped repository calls.
+func setGatewayTenantCtx(req *http.Request, key *apikeys.Key, tenantID string) *http.Request {
+	gatewayKey := *key
+	gatewayKey.ConsumerName = gatewayConsumerName
+	ctx := context.WithValue(req.Context(), ctxKeyAPIKey, &gatewayKey)
+	ctx = context.WithValue(ctx, ctxKeyRequestID, "test-request-id")
+	ctx = context.WithValue(ctx, ctxKeyGatewayContext, &gatewayContext{
+		TenantID:         tenantID,
+		PublicScopes:     []string{"webhooks:write"},
+		GatewayRequestID: "gateway-request-id",
+	})
+	return req.WithContext(ctx)
+}
+
+// readJSON decodes the response body into the target struct.
+func readJSON(t *testing.T, body io.Reader, target any) {
+	t.Helper()
+	if err := json.NewDecoder(body).Decode(target); err != nil {
+		t.Fatalf("decode response: %v", err)
+	}
+}
+
+// readErrorBody parses the standard error envelope.
+func readErrorBody(t *testing.T, body io.Reader) errorBody {
+	t.Helper()
+	var env errorEnvelope
+	readJSON(t, body, &env)
+	return env.Error
+}
+
+// invokeAuthed runs an authenticated request through the mux by injecting
+// the key into the request context first. The mux's requireScope wrapper
+// will still fire — but bearerToken returns "" so we'd hit "missing token".
+// Instead, we bypass auth entirely by serving the handler directly. For
+// tests that need the auth middleware, use newAuthRequest with a real token
+// hash expectation.
+//
+// This indirection exists because requireScope tightly couples auth, scope,
+// rate limiting, and audit — and we test those independently.
+func invokeAuthed(_ *Server, req *http.Request, h http.HandlerFunc) *httptest.ResponseRecorder {
+	rec := httptest.NewRecorder()
+	h(rec, req)
+	return rec
+}
+
+// columnsSite is the column set returned by the site list query.
+var columnsSite = []string{
+	"blog_id", "public_id", "monitor_url", "monitor_active", "bucket_no",
+	"check_interval", "site_status", "last_checked_at", "last_status_change",
+	"ssl_expiry_date", "check_keyword", "forbidden_keyword", "forbidden_keywords", "redirect_policy",
+	"request_method", "detection_profile", "maintenance_start", "maintenance_end", "alert_cooldown_minutes",
+}
+
+var columnsSiteWithCLIMetadata = append(append([]string{}, columnsSite...), "custom_headers")
+
+// columnsActiveEvent is the column set returned by queryActiveEvents.
+var columnsActiveEvent = []string{
+	"id", "check_type", "severity", "state", "started_at",
+}
+
+// columnsEvent is the column set returned by event queries.
+var columnsEvent = []string{
+	"id", "blog_id", "endpoint_id", "check_type", "discriminator",
+	"severity", "state", "started_at", "ended_at", "resolution_reason",
+	"cause_event_id", "metadata",
+}
+
+// columnsTransition is the column set returned by transition queries.
+var columnsTransition = []string{
+	"id", "event_id", "severity_before", "severity_after",
+	"state_before", "state_after", "reason", "source", "metadata", "changed_at",
+}
+
+const siteTenantCheckSQL = `SELECT 1 FROM jetmon_site_tenants WHERE tenant_id = ? AND blog_id = ? LIMIT 1`
+
+const insertSiteTenantTestSQL = ` INSERT INTO jetmon_site_tenants (tenant_id, blog_id, source) VALUES (?, ?, 'gateway') ON DUPLICATE KEY UPDATE updated_at = CURRENT_TIMESTAMP`
diff --git a/internal/apikeys/apikeys.go b/internal/apikeys/apikeys.go
new file mode 100644
index 00000000..ba2607f8
--- /dev/null
+++ b/internal/apikeys/apikeys.go
@@ -0,0 +1,355 @@
+// Package apikeys manages API tokens stored in jetmon_api_keys.
+//
+// Tokens are 32 bytes of crypto/rand entropy, base32-encoded with a "jm_"
+// prefix (e.g. "jm_NBSWY3DPEHPK3PXP..."). Storage is sha256-hashed; the raw
+// token is only ever returned at creation time via the CLI.
+//
+// This package is the only writer for jetmon_api_keys. The HTTP API exposes
+// no key management endpoints — see docs/internal-api-reference.md "Authentication".
+package apikeys
+
+import (
+	"context"
+	"crypto/rand"
+	"crypto/sha256"
+	"database/sql"
+	"encoding/base32"
+	"encoding/hex"
+	"errors"
+	"fmt"
+	"strings"
+	"time"
+)
+
+// Scope is the coarse permission level granted to a key. See docs/internal-api-reference.md.
+type Scope string
+
+const (
+	ScopeRead  Scope = "read"
+	ScopeWrite Scope = "write"
+	ScopeAdmin Scope = "admin"
+)
+
+// AllScopes returns the full set of valid scopes in increasing privilege order.
+func AllScopes() []Scope {
+	return []Scope{ScopeRead, ScopeWrite, ScopeAdmin}
+}
+
+// Includes reports whether s grants at least the privileges of required.
+// scope ordering: read < write < admin. admin includes write includes read.
+func (s Scope) Includes(required Scope) bool {
+	rank := map[Scope]int{ScopeRead: 0, ScopeWrite: 1, ScopeAdmin: 2}
+	return rank[s] >= rank[required]
+}
+
+// Valid reports whether s is one of the recognized scope values.
+func (s Scope) Valid() bool {
+	switch s {
+	case ScopeRead, ScopeWrite, ScopeAdmin:
+		return true
+	}
+	return false
+}
+
+// TokenPrefix is prepended to every generated token. The prefix is part of
+// the auth check — tokens without it are rejected at parse time.
+const TokenPrefix = "jm_"
+
+// Sentinel errors returned by Lookup. Callers translate to HTTP status codes.
+var (
+	ErrInvalidToken = errors.New("apikeys: invalid token")
+	ErrKeyRevoked   = errors.New("apikeys: key revoked")
+	ErrKeyExpired   = errors.New("apikeys: key expired")
+)
+
+// Key is the in-memory representation of a jetmon_api_keys row. The raw
+// token is never stored here — it's hashed on the way in and discarded.
+type Key struct {
+	ID                 int64
+	ConsumerName       string
+	Scope              Scope
+	RateLimitPerMinute int
+	ExpiresAt          *time.Time
+	RevokedAt          *time.Time
+	LastUsedAt         *time.Time
+	CreatedAt          time.Time
+	CreatedBy          string
+}
+
+// CreateInput carries the fields needed to create a new key.
+type CreateInput struct {
+	ConsumerName       string
+	Scope              Scope
+	RateLimitPerMinute int           // 0 → server default
+	TTL                time.Duration // 0 → never expires
+	CreatedBy          string        // operator identity for audit; falls back to "cli"
+}
+
+// GenerateToken returns a fresh raw token and its sha256 hash. The raw token
+// is what the consumer puts in their Authorization header; the hash is what
+// goes in the database.
+func GenerateToken() (raw, hashed string, err error) {
+	var buf [32]byte
+	if _, err := rand.Read(buf[:]); err != nil {
+		return "", "", fmt.Errorf("apikeys: read entropy: %w", err)
+	}
+	encoded := base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(buf[:])
+	raw = TokenPrefix + encoded
+	hashed = HashToken(raw)
+	return raw, hashed, nil
+}
+
+// HashToken returns the sha256 hex digest of token. Used both at creation
+// (to store) and at lookup (to compare). sha256 is the right hash here because
+// tokens are high-entropy random; bcrypt's deliberate slowness is for human
+// passwords.
+func HashToken(token string) string {
+	sum := sha256.Sum256([]byte(token))
+	return hex.EncodeToString(sum[:])
+}
+
+// Create inserts a new key and returns the raw token (one-time view) plus
+// the persisted Key record.
+func Create(ctx context.Context, db *sql.DB, in CreateInput) (raw string, k *Key, err error) {
+	if in.ConsumerName == "" {
+		return "", nil, errors.New("apikeys: consumer_name is required")
+	}
+	if !in.Scope.Valid() {
+		return "", nil, fmt.Errorf("apikeys: invalid scope %q (want one of: read, write, admin)", in.Scope)
+	}
+	rateLimit := in.RateLimitPerMinute
+	if rateLimit <= 0 {
+		rateLimit = defaultRateLimitForScope(in.Scope)
+	}
+	createdBy := in.CreatedBy
+	if createdBy == "" {
+		createdBy = "cli"
+	}
+
+	raw, hashed, err := GenerateToken()
+	if err != nil {
+		return "", nil, err
+	}
+
+	var expiresAt sql.NullTime
+	if in.TTL > 0 {
+		expiresAt = sql.NullTime{Time: time.Now().UTC().Add(in.TTL), Valid: true}
+	}
+
+	res, err := db.ExecContext(ctx, `
+		INSERT INTO jetmon_api_keys
+			(key_hash, consumer_name, scope, rate_limit_per_minute, expires_at, created_by)
+		VALUES (?, ?, ?, ?, ?, ?)`,
+		hashed, in.ConsumerName, string(in.Scope), rateLimit, expiresAt, createdBy,
+	)
+	if err != nil {
+		return "", nil, fmt.Errorf("apikeys: insert: %w", err)
+	}
+	id, err := res.LastInsertId()
+	if err != nil {
+		return "", nil, fmt.Errorf("apikeys: last insert id: %w", err)
+	}
+
+	k, err = getByID(ctx, db, id)
+	if err != nil {
+		return "", nil, err
+	}
+	return raw, k, nil
+}
+
+// Lookup resolves a raw token to its Key. Returns ErrInvalidToken for any
+// case where the token shouldn't be trusted (malformed, no matching row,
+// revoked, expired). Updates last_used_at on success.
+func Lookup(ctx context.Context, db *sql.DB, raw string) (*Key, error) {
+	if !strings.HasPrefix(raw, TokenPrefix) || len(raw) < len(TokenPrefix)+10 {
+		return nil, ErrInvalidToken
+	}
+	hashed := HashToken(raw)
+
+	k, err := getByHash(ctx, db, hashed)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			return nil, ErrInvalidToken
+		}
+		return nil, fmt.Errorf("apikeys: lookup: %w", err)
+	}
+
+	// revoked_at and expires_at are half-open cutoffs: the key is valid for
+	// times strictly before the cutoff, and rejected at or after it. A future
+	// revoked_at therefore acts as a rotation grace window.
+	now := time.Now().UTC()
+	if k.RevokedAt != nil && !now.Before(*k.RevokedAt) {
+		return nil, ErrKeyRevoked
+	}
+	if k.ExpiresAt != nil && !now.Before(*k.ExpiresAt) {
+		return nil, ErrKeyExpired
+	}
+
+	// Best-effort last_used_at touch. We swallow errors here so a transient
+	// write failure doesn't fail the auth check — last_used_at is observability,
+	// not security.
+	_, _ = db.ExecContext(ctx,
+		`UPDATE jetmon_api_keys SET last_used_at = CURRENT_TIMESTAMP WHERE id = ?`, k.ID)
+	return k, nil
+}
+
+// List returns all keys for ops display. Hash is never exposed.
+func List(ctx context.Context, db *sql.DB) ([]Key, error) {
+	rows, err := db.QueryContext(ctx, `
+		SELECT id, consumer_name, scope, rate_limit_per_minute,
+		       expires_at, revoked_at, last_used_at, created_at, created_by
+		  FROM jetmon_api_keys
+		 ORDER BY id ASC`)
+	if err != nil {
+		return nil, fmt.Errorf("apikeys: list: %w", err)
+	}
+	defer rows.Close()
+
+	var out []Key
+	for rows.Next() {
+		k, err := scanKey(rows)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, *k)
+	}
+	return out, rows.Err()
+}
+
+// Revoke sets revoked_at on the given key. Idempotent — re-revoking is a no-op.
+func Revoke(ctx context.Context, db *sql.DB, id int64) error {
+	res, err := db.ExecContext(ctx,
+		`UPDATE jetmon_api_keys SET revoked_at = CURRENT_TIMESTAMP WHERE id = ? AND revoked_at IS NULL`, id)
+	if err != nil {
+		return fmt.Errorf("apikeys: revoke: %w", err)
+	}
+	affected, _ := res.RowsAffected()
+	if affected == 0 {
+		// Either the key doesn't exist or it was already revoked. Look up
+		// to distinguish, so the CLI can give a useful message.
+		k, lookupErr := getByID(ctx, db, id)
+		if lookupErr != nil {
+			if errors.Is(lookupErr, sql.ErrNoRows) {
+				return fmt.Errorf("apikeys: key %d not found", id)
+			}
+			return lookupErr
+		}
+		if k.RevokedAt != nil {
+			// Already revoked — treat as success.
+			return nil
+		}
+	}
+	return nil
+}
+
+// Rotate creates a new key matching the existing key's consumer/scope/rate-limit
+// and schedules the old one to be revoked after gracePeriod. Returns the new
+// raw token. If gracePeriod is zero, the old key is revoked immediately.
+func Rotate(ctx context.Context, db *sql.DB, oldID int64, gracePeriod time.Duration, createdBy string) (newRaw string, newKey *Key, err error) {
+	old, err := getByID(ctx, db, oldID)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			return "", nil, fmt.Errorf("apikeys: key %d not found", oldID)
+		}
+		return "", nil, err
+	}
+	if old.RevokedAt != nil {
+		return "", nil, fmt.Errorf("apikeys: key %d already revoked; create a fresh key instead", oldID)
+	}
+
+	// Honor any TTL the original key had — the rotated key inherits it.
+	var ttl time.Duration
+	if old.ExpiresAt != nil {
+		ttl = time.Until(*old.ExpiresAt)
+		if ttl < 0 {
+			ttl = 0
+		}
+	}
+
+	newRaw, newKey, err = Create(ctx, db, CreateInput{
+		ConsumerName:       old.ConsumerName,
+		Scope:              old.Scope,
+		RateLimitPerMinute: old.RateLimitPerMinute,
+		TTL:                ttl,
+		CreatedBy:          createdBy,
+	})
+	if err != nil {
+		return "", nil, err
+	}
+
+	if gracePeriod <= 0 {
+		if err := Revoke(ctx, db, oldID); err != nil {
+			return "", nil, fmt.Errorf("apikeys: rotate (revoke old): %w", err)
+		}
+	} else {
+		// Schedule revocation: set revoked_at to now+grace. Lookup checks
+		// revoked_at against the current time, so a future revoked_at is
+		// effectively "scheduled."
+		_, err := db.ExecContext(ctx,
+			`UPDATE jetmon_api_keys
+			    SET revoked_at = DATE_ADD(CURRENT_TIMESTAMP, INTERVAL ? SECOND)
+			  WHERE id = ?`,
+			int(gracePeriod.Seconds()), oldID)
+		if err != nil {
+			return "", nil, fmt.Errorf("apikeys: rotate (schedule revoke): %w", err)
+		}
+	}
+	return newRaw, newKey, nil
+}
+
+func defaultRateLimitForScope(s Scope) int {
+	switch s {
+	case ScopeWrite:
+		return 30
+	case ScopeAdmin:
+		return 60
+	default:
+		return 60
+	}
+}
+
+func getByID(ctx context.Context, db *sql.DB, id int64) (*Key, error) {
+	row := db.QueryRowContext(ctx, `
+		SELECT id, consumer_name, scope, rate_limit_per_minute,
+		       expires_at, revoked_at, last_used_at, created_at, created_by
+		  FROM jetmon_api_keys
+		 WHERE id = ?`, id)
+	return scanKey(row)
+}
+
+func getByHash(ctx context.Context, db *sql.DB, hash string) (*Key, error) {
+	row := db.QueryRowContext(ctx, `
+		SELECT id, consumer_name, scope, rate_limit_per_minute,
+		       expires_at, revoked_at, last_used_at, created_at, created_by
+		  FROM jetmon_api_keys
+		 WHERE key_hash = ?`, hash)
+	return scanKey(row)
+}
+
+// rowScanner accepts both *sql.Row and *sql.Rows so scanKey can be reused.
+type rowScanner interface {
+	Scan(dest ...any) error
+}
+
+func scanKey(s rowScanner) (*Key, error) {
+	var k Key
+	var scope string
+	var expiresAt, revokedAt, lastUsedAt sql.NullTime
+	if err := s.Scan(
+		&k.ID, &k.ConsumerName, &scope, &k.RateLimitPerMinute,
+		&expiresAt, &revokedAt, &lastUsedAt, &k.CreatedAt, &k.CreatedBy,
+	); err != nil {
+		return nil, err
+	}
+	k.Scope = Scope(scope)
+	if expiresAt.Valid {
+		k.ExpiresAt = &expiresAt.Time
+	}
+	if revokedAt.Valid {
+		k.RevokedAt = &revokedAt.Time
+	}
+	if lastUsedAt.Valid {
+		k.LastUsedAt = &lastUsedAt.Time
+	}
+	return &k, nil
+}
diff --git a/internal/apikeys/apikeys_test.go b/internal/apikeys/apikeys_test.go
new file mode 100644
index 00000000..e79a72bb
--- /dev/null
+++ b/internal/apikeys/apikeys_test.go
@@ -0,0 +1,394 @@
+package apikeys
+
+import (
+	"context"
+	"database/sql"
+	"errors"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestGenerateTokenFormat(t *testing.T) {
+	raw, hashed, err := GenerateToken()
+	if err != nil {
+		t.Fatalf("GenerateToken: %v", err)
+	}
+	if !strings.HasPrefix(raw, TokenPrefix) {
+		t.Fatalf("token missing prefix: %q", raw)
+	}
+	// 32 random bytes → 52 base32 chars (no padding) + "jm_" = 55.
+	if len(raw) != len(TokenPrefix)+52 {
+		t.Fatalf("token length = %d, want %d", len(raw), len(TokenPrefix)+52)
+	}
+	if len(hashed) != 64 {
+		t.Fatalf("hash length = %d, want 64 (sha256 hex)", len(hashed))
+	}
+	if HashToken(raw) != hashed {
+		t.Fatal("HashToken doesn't match GenerateToken's returned hash")
+	}
+}
+
+func TestGenerateTokenUnique(t *testing.T) {
+	a, _, _ := GenerateToken()
+	b, _, _ := GenerateToken()
+	if a == b {
+		t.Fatal("two generated tokens collided — entropy source is broken")
+	}
+}
+
+func TestScopeIncludes(t *testing.T) {
+	cases := []struct {
+		have     Scope
+		required Scope
+		want     bool
+	}{
+		{ScopeRead, ScopeRead, true},
+		{ScopeRead, ScopeWrite, false},
+		{ScopeRead, ScopeAdmin, false},
+		{ScopeWrite, ScopeRead, true},
+		{ScopeWrite, ScopeWrite, true},
+		{ScopeWrite, ScopeAdmin, false},
+		{ScopeAdmin, ScopeRead, true},
+		{ScopeAdmin, ScopeWrite, true},
+		{ScopeAdmin, ScopeAdmin, true},
+	}
+	for _, c := range cases {
+		got := c.have.Includes(c.required)
+		if got != c.want {
+			t.Errorf("Scope(%q).Includes(%q) = %v, want %v", c.have, c.required, got, c.want)
+		}
+	}
+}
+
+func TestScopeValid(t *testing.T) {
+	for _, s := range AllScopes() {
+		if !s.Valid() {
+			t.Errorf("AllScopes()[%q].Valid() = false", s)
+		}
+	}
+	if Scope("anything-else").Valid() {
+		t.Error("invalid scope should not be Valid()")
+	}
+	if Scope("").Valid() {
+		t.Error("empty scope should not be Valid()")
+	}
+}
+
+func TestHashTokenStability(t *testing.T) {
+	// HashToken must be deterministic — Lookup compares the hash of an
+	// incoming token against the stored hash, so a non-deterministic hash
+	// would break auth entirely.
+	a := HashToken("jm_some-fixed-token")
+	b := HashToken("jm_some-fixed-token")
+	if a != b {
+		t.Fatal("HashToken is not deterministic")
+	}
+}
+
+func TestLookupAllowsFutureRevokedAtDuringRotationGrace(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	raw := TokenPrefix + strings.Repeat("A", 52)
+	hash := HashToken(raw)
+	now := time.Now().UTC()
+	futureRevocation := now.Add(time.Hour)
+
+	rows := sqlmock.NewRows([]string{
+		"id", "consumer_name", "scope", "rate_limit_per_minute",
+		"expires_at", "revoked_at", "last_used_at", "created_at", "created_by",
+	}).AddRow(int64(42), "gateway", string(ScopeRead), 60, nil, futureRevocation, nil, now, "test")
+
+	mock.ExpectQuery("SELECT id, consumer_name, scope, rate_limit_per_minute").
+		WithArgs(hash).
+		WillReturnRows(rows)
+	mock.ExpectExec("UPDATE jetmon_api_keys SET last_used_at").
+		WithArgs(int64(42)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	key, err := Lookup(context.Background(), db, raw)
+	if err != nil {
+		t.Fatalf("Lookup returned error for grace-period key: %v", err)
+	}
+	if key.ID != 42 {
+		t.Fatalf("key.ID = %d, want 42", key.ID)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestLookupRejectsPastRevokedAt(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	raw := TokenPrefix + strings.Repeat("B", 52)
+	hash := HashToken(raw)
+	now := time.Now().UTC()
+	pastRevocation := now.Add(-time.Minute)
+
+	rows := sqlmock.NewRows([]string{
+		"id", "consumer_name", "scope", "rate_limit_per_minute",
+		"expires_at", "revoked_at", "last_used_at", "created_at", "created_by",
+	}).AddRow(int64(43), "gateway", string(ScopeRead), 60, nil, pastRevocation, nil, now, "test")
+
+	mock.ExpectQuery("SELECT id, consumer_name, scope, rate_limit_per_minute").
+		WithArgs(hash).
+		WillReturnRows(rows)
+
+	_, err = Lookup(context.Background(), db, raw)
+	if !errors.Is(err, ErrKeyRevoked) {
+		t.Fatalf("Lookup error = %v, want ErrKeyRevoked", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestLookupAllowsFutureExpiresAt(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	raw := TokenPrefix + strings.Repeat("C", 52)
+	hash := HashToken(raw)
+	now := time.Now().UTC()
+	futureExpiry := now.Add(time.Hour)
+
+	rows := sqlmock.NewRows([]string{
+		"id", "consumer_name", "scope", "rate_limit_per_minute",
+		"expires_at", "revoked_at", "last_used_at", "created_at", "created_by",
+	}).AddRow(int64(44), "gateway", string(ScopeRead), 60, futureExpiry, nil, nil, now, "test")
+
+	mock.ExpectQuery("SELECT id, consumer_name, scope, rate_limit_per_minute").
+		WithArgs(hash).
+		WillReturnRows(rows)
+	mock.ExpectExec("UPDATE jetmon_api_keys SET last_used_at").
+		WithArgs(int64(44)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	key, err := Lookup(context.Background(), db, raw)
+	if err != nil {
+		t.Fatalf("Lookup returned error for not-yet-expired key: %v", err)
+	}
+	if key.ID != 44 {
+		t.Fatalf("key.ID = %d, want 44", key.ID)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestLookupRejectsPastExpiresAt(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	raw := TokenPrefix + strings.Repeat("D", 52)
+	hash := HashToken(raw)
+	now := time.Now().UTC()
+	pastExpiry := now.Add(-time.Minute)
+
+	rows := sqlmock.NewRows([]string{
+		"id", "consumer_name", "scope", "rate_limit_per_minute",
+		"expires_at", "revoked_at", "last_used_at", "created_at", "created_by",
+	}).AddRow(int64(45), "gateway", string(ScopeRead), 60, pastExpiry, nil, nil, now, "test")
+
+	mock.ExpectQuery("SELECT id, consumer_name, scope, rate_limit_per_minute").
+		WithArgs(hash).
+		WillReturnRows(rows)
+
+	_, err = Lookup(context.Background(), db, raw)
+	if !errors.Is(err, ErrKeyExpired) {
+		t.Fatalf("Lookup error = %v, want ErrKeyExpired", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+var keyColumns = []string{
+	"id", "consumer_name", "scope", "rate_limit_per_minute",
+	"expires_at", "revoked_at", "last_used_at", "created_at", "created_by",
+}
+
+func keyRow(id int64, consumer string, scope Scope, rate int, createdAt time.Time, createdBy string) *sqlmock.Rows {
+	return sqlmock.NewRows(keyColumns).
+		AddRow(id, consumer, string(scope), rate, nil, nil, nil, createdAt, createdBy)
+}
+
+func TestCreateUsesDefaultsAndFetchesPersistedKey(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	mock.ExpectExec("INSERT INTO jetmon_api_keys").
+		WithArgs(sqlmock.AnyArg(), "gateway", string(ScopeWrite), 30, sqlmock.AnyArg(), "cli").
+		WillReturnResult(sqlmock.NewResult(7, 1))
+	mock.ExpectQuery("SELECT id, consumer_name, scope, rate_limit_per_minute").
+		WithArgs(int64(7)).
+		WillReturnRows(keyRow(7, "gateway", ScopeWrite, 30, now, "cli"))
+
+	raw, key, err := Create(context.Background(), db, CreateInput{
+		ConsumerName: "gateway",
+		Scope:        ScopeWrite,
+	})
+	if err != nil {
+		t.Fatalf("Create: %v", err)
+	}
+	if !strings.HasPrefix(raw, TokenPrefix) {
+		t.Fatalf("raw token = %q, want %s prefix", raw, TokenPrefix)
+	}
+	if key.ID != 7 || key.RateLimitPerMinute != 30 || key.CreatedBy != "cli" {
+		t.Fatalf("key = %+v", key)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestCreateRejectsInvalidInputBeforeDB(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	if _, _, err := Create(context.Background(), db, CreateInput{Scope: ScopeRead}); err == nil {
+		t.Fatal("Create accepted empty consumer_name")
+	}
+	if _, _, err := Create(context.Background(), db, CreateInput{ConsumerName: "gateway", Scope: Scope("root")}); err == nil {
+		t.Fatal("Create accepted invalid scope")
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unexpected sql calls: %v", err)
+	}
+}
+
+func TestListScansKeys(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	expiresAt := now.Add(time.Hour)
+	revokedAt := now.Add(-time.Minute)
+	lastUsedAt := now.Add(-time.Second)
+	rows := sqlmock.NewRows(keyColumns).
+		AddRow(int64(1), "gateway", string(ScopeRead), 60, nil, nil, nil, now, "ops").
+		AddRow(int64(2), "admin", string(ScopeAdmin), 60, expiresAt, revokedAt, lastUsedAt, now, "ops")
+	mock.ExpectQuery("SELECT id, consumer_name, scope, rate_limit_per_minute").
+		WillReturnRows(rows)
+
+	keys, err := List(context.Background(), db)
+	if err != nil {
+		t.Fatalf("List: %v", err)
+	}
+	if len(keys) != 2 {
+		t.Fatalf("List len = %d, want 2", len(keys))
+	}
+	if keys[1].ExpiresAt == nil || keys[1].RevokedAt == nil || keys[1].LastUsedAt == nil {
+		t.Fatalf("second key did not scan nullable timestamps: %+v", keys[1])
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestRevokeAlreadyRevokedIsSuccess(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	mock.ExpectExec("UPDATE jetmon_api_keys SET revoked_at").
+		WithArgs(int64(9)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+	mock.ExpectQuery("SELECT id, consumer_name, scope, rate_limit_per_minute").
+		WithArgs(int64(9)).
+		WillReturnRows(sqlmock.NewRows(keyColumns).
+			AddRow(int64(9), "gateway", string(ScopeRead), 60, nil, now, nil, now, "ops"))
+
+	if err := Revoke(context.Background(), db, 9); err != nil {
+		t.Fatalf("Revoke already revoked: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestRevokeMissingKeyReturnsNotFound(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectExec("UPDATE jetmon_api_keys SET revoked_at").
+		WithArgs(int64(404)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+	mock.ExpectQuery("SELECT id, consumer_name, scope, rate_limit_per_minute").
+		WithArgs(int64(404)).
+		WillReturnError(sql.ErrNoRows)
+
+	err = Revoke(context.Background(), db, 404)
+	if err == nil || !strings.Contains(err.Error(), "not found") {
+		t.Fatalf("Revoke missing error = %v, want not found", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestRotateSchedulesOldKeyRevocation(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	mock.ExpectQuery("SELECT id, consumer_name, scope, rate_limit_per_minute").
+		WithArgs(int64(3)).
+		WillReturnRows(keyRow(3, "gateway", ScopeAdmin, 60, now, "ops"))
+	mock.ExpectExec("INSERT INTO jetmon_api_keys").
+		WithArgs(sqlmock.AnyArg(), "gateway", string(ScopeAdmin), 60, sqlmock.AnyArg(), "operator").
+		WillReturnResult(sqlmock.NewResult(4, 1))
+	mock.ExpectQuery("SELECT id, consumer_name, scope, rate_limit_per_minute").
+		WithArgs(int64(4)).
+		WillReturnRows(keyRow(4, "gateway", ScopeAdmin, 60, now, "operator"))
+	mock.ExpectExec("UPDATE jetmon_api_keys").
+		WithArgs(300, int64(3)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	raw, key, err := Rotate(context.Background(), db, 3, 5*time.Minute, "operator")
+	if err != nil {
+		t.Fatalf("Rotate: %v", err)
+	}
+	if !strings.HasPrefix(raw, TokenPrefix) || key.ID != 4 {
+		t.Fatalf("Rotate returned raw=%q key=%+v", raw, key)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
diff --git a/internal/audit/audit.go b/internal/audit/audit.go
new file mode 100644
index 00000000..a5989e63
--- /dev/null
+++ b/internal/audit/audit.go
@@ -0,0 +1,128 @@
+// Package audit writes the operational trail to jetmon_audit_log: WPCOM
+// notification sends and retries, verifier RPC dispatch, retry-queue dispatch,
+// alert and maintenance suppression decisions, and config reloads. These are
+// things the monitor *did*, not things that happened to a site.
+//
+// Site-state changes (incidents opening, severity escalating, state changing,
+// events closing) flow through the eventstore package and the
+// jetmon_events / jetmon_event_transitions tables. They do not go through this
+// package. See docs/events.md for the split.
+package audit
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"fmt"
+)
+
+// Event types written to jetmon_audit_log. All values are operational — none
+// of them describe site state directly. Site-state transitions live in
+// jetmon_event_transitions.
+const (
+	EventWPCOMSent         = "wpcom_sent"
+	EventWPCOMRetry        = "wpcom_retry"
+	EventWPCOMFailure      = "wpcom_failure"
+	EventRetryDispatched   = "retry_dispatched"
+	EventVeriflierSent     = "veriflier_sent"
+	EventMaintenanceActive = "maintenance_active"
+	EventAlertSuppressed   = "alert_suppressed"
+	EventConfigChange      = "config_change"
+	EventAPIAccess         = "api_access"
+)
+
+var db *sql.DB
+
+// Init sets the database connection used by the audit log.
+func Init(conn *sql.DB) {
+	db = conn
+}
+
+// Entry carries the fields written for one audit row. blog_id and event_id are
+// both optional: system-level events (e.g. config reloads) carry neither, and
+// most operational rows for a site carry blog_id but not event_id. Linking a
+// row to an event id (e.g. "this WPCOM retry was for event 12345") lets
+// operators pivot from incident → operational context with one query.
+type Entry struct {
+	BlogID    int64           // 0 for system-level events; written as NULL
+	EventID   int64           // 0 if not linked to an incident; written as NULL
+	EventType string          // one of the Event* constants above
+	Source    string          // "local", "veriflier:us-west", "operator:user@host", …
+	Detail    string          // human-readable one-liner; truncated at 1024 chars
+	Metadata  json.RawMessage // optional structured context (e.g. retry attempt, region)
+}
+
+// Log writes an entry to jetmon_audit_log. ctx propagates cancellation and
+// deadlines into the underlying INSERT. Callers control the context lifetime:
+// the orchestrator passes its long-lived shutdown context; the API middleware
+// uses a short bounded timeout derived from context.Background so audits fire
+// regardless of client disconnect but cannot block on a wedged DB.
+func Log(ctx context.Context, e Entry) error {
+	if db == nil {
+		return nil
+	}
+	if e.EventType == "" {
+		return fmt.Errorf("audit: EventType is required")
+	}
+	source := e.Source
+	if source == "" {
+		source = "local"
+	}
+	_, err := db.ExecContext(ctx, `
+		INSERT INTO jetmon_audit_log
+			(blog_id, event_id, event_type, source, detail, metadata)
+		VALUES (?, ?, ?, ?, ?, ?)`,
+		nullableInt64(e.BlogID),
+		nullableInt64(e.EventID),
+		e.EventType,
+		source,
+		nullableString(e.Detail),
+		nullableJSON(e.Metadata),
+	)
+	if err != nil {
+		return fmt.Errorf("audit log insert: %w", err)
+	}
+	return nil
+}
+
+// Query returns audit log entries for a blog within the given time range.
+// The caller must close the returned *sql.Rows.
+func Query(db *sql.DB, blogID int64, since, until string) (*sql.Rows, error) {
+	q := `SELECT id, blog_id, event_id, event_type, source, detail, metadata, created_at
+	      FROM jetmon_audit_log
+	      WHERE blog_id = ?`
+	args := []any{blogID}
+
+	if since != "" {
+		q += " AND created_at >= ?"
+		args = append(args, since)
+	}
+	if until != "" {
+		q += " AND created_at <= ?"
+		args = append(args, until)
+	}
+	q += " ORDER BY created_at ASC"
+
+	return db.Query(q, args...)
+}
+
+func nullableInt64(v int64) any {
+	if v == 0 {
+		return nil
+	}
+	return v
+}
+
+func nullableString(s string) any {
+	if s == "" {
+		return nil
+	}
+	return s
+}
+
+func nullableJSON(b json.RawMessage) any {
+	if len(b) == 0 {
+		return nil
+	}
+	return []byte(b)
+}
diff --git a/internal/audit/audit_test.go b/internal/audit/audit_test.go
new file mode 100644
index 00000000..a90df452
--- /dev/null
+++ b/internal/audit/audit_test.go
@@ -0,0 +1,129 @@
+package audit
+
+import (
+	"context"
+	"errors"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestNullableInt64(t *testing.T) {
+	if nullableInt64(0) != nil {
+		t.Fatal("nullableInt64(0) should be nil")
+	}
+	if nullableInt64(42) != int64(42) {
+		t.Fatalf("nullableInt64(42) = %v, want 42", nullableInt64(42))
+	}
+}
+
+func TestNullableString(t *testing.T) {
+	if nullableString("") != nil {
+		t.Fatal("nullableString(\"\") should be nil")
+	}
+	if nullableString("hello") != "hello" {
+		t.Fatalf("nullableString(\"hello\") = %v, want \"hello\"", nullableString("hello"))
+	}
+}
+
+func TestNullableJSON(t *testing.T) {
+	if nullableJSON(nil) != nil {
+		t.Fatal("nullableJSON(nil) should be nil")
+	}
+	if nullableJSON([]byte("")) != nil {
+		t.Fatal("nullableJSON(empty) should be nil")
+	}
+	got := nullableJSON([]byte(`{"k":1}`))
+	if got == nil {
+		t.Fatal("nullableJSON(non-empty) should not be nil")
+	}
+}
+
+func TestLogWithNilDB(t *testing.T) {
+	// db is nil in tests — Log must return nil, not panic.
+	if err := Log(context.Background(), Entry{
+		BlogID:    1,
+		EventType: EventVeriflierSent,
+		Source:    "test",
+		Detail:    "detail",
+	}); err != nil {
+		t.Fatalf("Log() with nil db = %v, want nil", err)
+	}
+}
+
+func TestLogRequiresEventType(t *testing.T) {
+	// Set a non-nil db so the validation runs (we won't actually hit it because
+	// the validation is before the db.Exec call).
+	if err := Log(context.Background(), Entry{BlogID: 1}); err != nil {
+		// nil db short-circuits before validation. That's fine — the
+		// production code path requires a real db, which the integration
+		// tests cover. Here we just confirm the call doesn't panic with an
+		// empty Entry.
+	}
+}
+
+func TestLogHonorsCanceledContext(t *testing.T) {
+	conn, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer conn.Close()
+
+	orig := db
+	t.Cleanup(func() { db = orig })
+	db = conn
+
+	mock.ExpectExec(`INSERT INTO jetmon_audit_log`).WillReturnError(context.Canceled)
+
+	ctx, cancel := context.WithCancel(context.Background())
+	cancel()
+
+	err = Log(ctx, Entry{
+		EventType: EventConfigChange,
+		Source:    "test",
+		Detail:    "ctx canceled",
+	})
+	if !errors.Is(err, context.Canceled) {
+		t.Fatalf("Log() with canceled ctx = %v, want context.Canceled", err)
+	}
+}
+
+func TestInit(t *testing.T) {
+	orig := db
+	t.Cleanup(func() { db = orig })
+	Init(nil)
+	if db != nil {
+		t.Fatal("Init(nil) should set db to nil")
+	}
+}
+
+func TestQueryBuildsTimeRange(t *testing.T) {
+	conn, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer conn.Close()
+
+	now := time.Now().UTC()
+	mock.ExpectQuery("SELECT id, blog_id, event_id, event_type").
+		WithArgs(int64(42), "2026-04-27 00:00:00", "2026-04-28 00:00:00").
+		WillReturnRows(sqlmock.NewRows([]string{
+			"id", "blog_id", "event_id", "event_type", "source", "detail", "metadata", "created_at",
+		}).AddRow(int64(1), int64(42), nil, EventAPIAccess, "api", "ok", nil, now))
+
+	rows, err := Query(conn, 42, "2026-04-27 00:00:00", "2026-04-28 00:00:00")
+	if err != nil {
+		t.Fatalf("Query: %v", err)
+	}
+	defer rows.Close()
+	if !rows.Next() {
+		t.Fatal("expected one audit row")
+	}
+	if err := rows.Err(); err != nil {
+		t.Fatalf("rows.Err: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
diff --git a/internal/checker/checker.go b/internal/checker/checker.go
new file mode 100644
index 00000000..1f25f565
--- /dev/null
+++ b/internal/checker/checker.go
@@ -0,0 +1,1057 @@
+package checker
+
+import (
+	"context"
+	"crypto/tls"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io"
+	"net"
+	"net/http"
+	"net/http/httptrace"
+	"os"
+	"runtime"
+	"strconv"
+	"strings"
+	"sync"
+	"sync/atomic"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/checkmode"
+)
+
+// ErrorCode mirrors the status change email types from the original Jetmon.
+const (
+	ErrorNone          = 0
+	ErrorTimeout       = 1
+	ErrorConnect       = 2
+	ErrorSSL           = 3
+	ErrorRedirect      = 4
+	ErrorKeyword       = 5
+	ErrorTLSExpired    = 6
+	ErrorTLSDeprecated = 7
+	ErrorBodyRead      = 8
+	ErrorBodyTruncated = ErrorBodyRead
+)
+
+const (
+	maxBodyIntegrityBytes      int64 = 64 << 10
+	maxKeywordBodyBytes        int64 = 1 << 20
+	checkDNSCacheTTL                 = 15 * time.Minute
+	checkDNSCacheMaxEntries          = 2000000
+	checkDNSCachePurgeInterval       = 10000
+)
+
+// RedirectPolicy controls how redirect responses are handled.
+type RedirectPolicy string
+
+const (
+	RedirectFollow RedirectPolicy = "follow"
+	RedirectAlert  RedirectPolicy = "alert"
+	RedirectFail   RedirectPolicy = "fail"
+)
+
+// defaultTransport is shared across checks so the checker does not allocate a
+// fresh connection pool for every probe. The http.Client stays per request so
+// timeout and redirect policy remain isolated to that site check.
+var defaultTransport = newCheckTransport()
+var defaultHTTPIPTransport = newHTTPIPPoolTransport()
+var defaultDNSCache = newCheckDNSCache(checkDNSCacheTTL, checkDNSCacheMaxEntries)
+var defaultDNSLookupLimiter = newCheckDNSLookupLimiter()
+var configuredResolverMu sync.RWMutex
+var configuredResolverServers []string
+
+type checkDNSLookupLimiter struct {
+	slots chan struct{}
+}
+
+func newCheckDNSLookupLimiter() *checkDNSLookupLimiter {
+	limit := runtime.GOMAXPROCS(0) * 128
+	if limit < 128 {
+		limit = 128
+	}
+	if limit > 1024 {
+		limit = 1024
+	}
+	return &checkDNSLookupLimiter{slots: make(chan struct{}, limit)}
+}
+
+func (l *checkDNSLookupLimiter) acquire(ctx context.Context) (func(), error) {
+	if l == nil || cap(l.slots) == 0 {
+		return func() {}, nil
+	}
+	select {
+	case l.slots <- struct{}{}:
+		return func() { <-l.slots }, nil
+	case <-ctx.Done():
+		return nil, ctx.Err()
+	}
+}
+
+// ConfigureResolverServers replaces the system resolver list used by the HTTP
+// checker. It is intended for process startup before checks begin; runtime
+// resolver changes should restart the service so in-flight checks continue to
+// use a stable transport.
+func ConfigureResolverServers(rawServers []string) error {
+	servers, err := normalizeResolverServers(rawServers)
+	if err != nil {
+		return err
+	}
+
+	configuredResolverMu.Lock()
+	configuredResolverServers = servers
+	configuredResolverMu.Unlock()
+
+	newTransport := newCheckTransport()
+	newHTTPIPTransport := newHTTPIPPoolTransportWithFallback(newTransport)
+
+	configuredResolverMu.Lock()
+	oldTransport := defaultTransport
+	oldHTTPIPTransport := defaultHTTPIPTransport
+	defaultTransport = newTransport
+	defaultHTTPIPTransport = newHTTPIPTransport
+	defaultDNSCache = newCheckDNSCache(checkDNSCacheTTL, checkDNSCacheMaxEntries)
+	configuredResolverMu.Unlock()
+
+	if oldTransport != nil {
+		oldTransport.CloseIdleConnections()
+	}
+	if oldHTTPIPTransport != nil {
+		oldHTTPIPTransport.CloseIdleConnections()
+	}
+	return nil
+}
+
+// ConfiguredResolverServers returns the normalized resolver override currently
+// installed for checks. An empty slice means checks are using the host resolver
+// configuration.
+func ConfiguredResolverServers() []string {
+	configuredResolverMu.RLock()
+	defer configuredResolverMu.RUnlock()
+	return append([]string(nil), configuredResolverServers...)
+}
+
+type checkDNSCache struct {
+	mu         sync.RWMutex
+	ttl        time.Duration
+	maxEntries int
+	writes     int
+	entries    map[string]checkDNSCacheEntry
+}
+
+type checkDNSCacheEntry struct {
+	addrs   []net.IPAddr
+	expires time.Time
+}
+
+func newCheckDNSCache(ttl time.Duration, maxEntries int) *checkDNSCache {
+	return &checkDNSCache{
+		ttl:        ttl,
+		maxEntries: maxEntries,
+		entries:    make(map[string]checkDNSCacheEntry),
+	}
+}
+
+func (c *checkDNSCache) lookup(ctx context.Context, resolver *net.Resolver, host, network string) ([]net.IPAddr, error) {
+	if resolver == nil {
+		return nil, fmt.Errorf("lookup %s: resolver unavailable", host)
+	}
+	if c == nil || c.ttl <= 0 {
+		return lookupResolverIPAddrs(ctx, resolver, host, network)
+	}
+	key := normalizeDNSCacheKey(host, preferredLookupFamily(network))
+	now := time.Now()
+	c.mu.RLock()
+	entry, ok := c.entries[key]
+	if ok && now.Before(entry.expires) {
+		// Cache entries are immutable after store; callers only iterate them.
+		// Returning the stored slice avoids an allocation on every repeated
+		// check for long-lived monitored sites.
+		addrs := entry.addrs
+		c.mu.RUnlock()
+		return addrs, nil
+	}
+	c.mu.RUnlock()
+
+	release, err := defaultDNSLookupLimiter.acquire(ctx)
+	if err != nil {
+		return nil, err
+	}
+	defer release()
+
+	now = time.Now()
+	c.mu.RLock()
+	entry, ok = c.entries[key]
+	if ok && now.Before(entry.expires) {
+		// Cache entries are immutable after store; callers only iterate them.
+		// Returning the stored slice avoids an allocation on every repeated
+		// check for long-lived monitored sites.
+		addrs := entry.addrs
+		c.mu.RUnlock()
+		return addrs, nil
+	}
+	c.mu.RUnlock()
+
+	addrs, err := lookupResolverIPAddrs(ctx, resolver, host, network)
+	if err != nil || len(addrs) == 0 {
+		return addrs, err
+	}
+	c.store(key, addrs, now.Add(c.ttl))
+	return addrs, nil
+}
+
+func (c *checkDNSCache) store(key string, addrs []net.IPAddr, expires time.Time) {
+	c.mu.Lock()
+	defer c.mu.Unlock()
+	if c.maxEntries > 0 && len(c.entries) >= c.maxEntries {
+		c.purgeExpiredLocked(time.Now())
+		if len(c.entries) >= c.maxEntries {
+			return
+		}
+	}
+	c.entries[key] = checkDNSCacheEntry{addrs: cloneIPAddrs(addrs), expires: expires}
+	c.writes++
+	if c.writes%checkDNSCachePurgeInterval == 0 {
+		c.purgeExpiredLocked(time.Now())
+	}
+}
+
+func (c *checkDNSCache) purgeExpiredLocked(now time.Time) {
+	for key, entry := range c.entries {
+		if !now.Before(entry.expires) {
+			delete(c.entries, key)
+		}
+	}
+}
+
+func normalizeDNSCacheHost(host string) string {
+	return strings.TrimSuffix(strings.ToLower(host), ".")
+}
+
+func normalizeDNSCacheKey(host, family string) string {
+	return normalizeDNSCacheHost(host) + "|" + family
+}
+
+func preferredLookupFamily(network string) string {
+	if strings.HasSuffix(network, "6") {
+		return "ip6"
+	}
+	return "ip4"
+}
+
+func lookupResolverIPAddrs(ctx context.Context, resolver *net.Resolver, host, network string) ([]net.IPAddr, error) {
+	family := preferredLookupFamily(network)
+	addrs, err := lookupResolverIPFamily(ctx, resolver, host, family)
+	if (err == nil && len(addrs) > 0) || family == "ip6" || strings.HasSuffix(network, "4") {
+		return addrs, err
+	}
+
+	fallback, fallbackErr := lookupResolverIPFamily(ctx, resolver, host, "ip6")
+	if fallbackErr == nil && len(fallback) > 0 {
+		return fallback, nil
+	}
+	if err != nil {
+		return nil, err
+	}
+	return fallback, fallbackErr
+}
+
+func lookupResolverIPFamily(ctx context.Context, resolver *net.Resolver, host, family string) ([]net.IPAddr, error) {
+	ips, err := resolver.LookupIP(ctx, family, host)
+	if err != nil {
+		return nil, err
+	}
+	addrs := make([]net.IPAddr, 0, len(ips))
+	for _, ip := range ips {
+		if ip == nil {
+			continue
+		}
+		addrs = append(addrs, net.IPAddr{IP: ip})
+	}
+	return addrs, nil
+}
+
+func cloneIPAddrs(addrs []net.IPAddr) []net.IPAddr {
+	if len(addrs) == 0 {
+		return nil
+	}
+	out := make([]net.IPAddr, 0, len(addrs))
+	for _, addr := range addrs {
+		if addr.IP == nil {
+			continue
+		}
+		ip := make(net.IP, len(addr.IP))
+		copy(ip, addr.IP)
+		out = append(out, net.IPAddr{IP: ip, Zone: addr.Zone})
+	}
+	return out
+}
+
+func newCheckTransport() *http.Transport {
+	return &http.Transport{
+		DialContext: newCheckDialContext(newCheckResolver()),
+		TLSClientConfig: &tls.Config{
+			InsecureSkipVerify: false,
+			// Deprecated TLS versions are still site-reachability signals.
+			// Complete the handshake so the orchestrator can open an advisory
+			// tls_deprecated event instead of reporting customer downtime.
+			MinVersion: tls.VersionTLS10,
+		},
+		TLSHandshakeTimeout: 10 * time.Second,
+		// Jetmon checks huge fleets of mostly-unique hostnames on minute-scale
+		// cadences. Connections would usually expire before reuse, while the
+		// shared idle pool becomes a global lock and goroutine-pressure point at
+		// high concurrency.
+		DisableKeepAlives: true,
+	}
+}
+
+type httpIPPoolTransport struct {
+	inner    *http.Transport
+	fallback *http.Transport
+	resolver *net.Resolver
+}
+
+func newHTTPIPPoolTransport() *httpIPPoolTransport {
+	return newHTTPIPPoolTransportWithFallback(defaultTransport)
+}
+
+func newHTTPIPPoolTransportWithFallback(fallback *http.Transport) *httpIPPoolTransport {
+	inner := newCheckTransport()
+	inner.DisableKeepAlives = false
+	inner.MaxIdleConns = 8192
+	inner.MaxIdleConnsPerHost = 2048
+	inner.IdleConnTimeout = 30 * time.Second
+	return &httpIPPoolTransport{
+		inner:    inner,
+		fallback: fallback,
+		resolver: newCheckResolver(),
+	}
+}
+
+func (t *httpIPPoolTransport) RoundTrip(req *http.Request) (*http.Response, error) {
+	if t == nil || t.inner == nil {
+		return defaultTransport.RoundTrip(req)
+	}
+	if req == nil || req.URL == nil || req.URL.Scheme != "http" {
+		return t.roundTripFallback(req)
+	}
+	host := req.URL.Hostname()
+	if host == "" || net.ParseIP(host) != nil {
+		return t.inner.RoundTrip(req)
+	}
+	port := req.URL.Port()
+	if port == "" {
+		port = "80"
+	}
+
+	trace := httptrace.ContextClientTrace(req.Context())
+	if trace != nil && trace.DNSStart != nil {
+		trace.DNSStart(httptrace.DNSStartInfo{Host: host})
+	}
+	addrs, err := defaultDNSCache.lookup(req.Context(), t.resolver, host, "tcp")
+	if trace != nil && trace.DNSDone != nil {
+		trace.DNSDone(httptrace.DNSDoneInfo{Addrs: addrs, Err: err})
+	}
+	if err != nil {
+		return nil, err
+	}
+	ordered := orderedResolverAddrs(addrs, "tcp")
+	if len(ordered) == 0 {
+		return nil, &net.DNSError{Name: host, Err: "no addresses"}
+	}
+
+	var firstErr error
+	for _, addr := range ordered {
+		resp, err := t.roundTripResolvedIP(req, addr, port)
+		if err == nil || resp != nil {
+			return resp, err
+		}
+		if firstErr == nil {
+			firstErr = err
+		}
+		if req.Context().Err() != nil {
+			return nil, req.Context().Err()
+		}
+	}
+	if firstErr != nil {
+		return nil, firstErr
+	}
+	return nil, fmt.Errorf("lookup %s: no usable addresses", host)
+}
+
+func (t *httpIPPoolTransport) roundTripResolvedIP(req *http.Request, addr net.IPAddr, port string) (*http.Response, error) {
+	clone := new(http.Request)
+	*clone = *req
+	cloneURL := *req.URL
+	cloneURL.Host = net.JoinHostPort(addr.IP.String(), port)
+	clone.URL = &cloneURL
+	if clone.Host == "" {
+		clone.Host = req.URL.Host
+	}
+	resp, err := t.inner.RoundTrip(clone)
+	if resp != nil && resp.Request == clone {
+		resp.Request = req
+	}
+	return resp, err
+}
+
+func (t *httpIPPoolTransport) roundTripFallback(req *http.Request) (*http.Response, error) {
+	if t != nil && t.fallback != nil {
+		return t.fallback.RoundTrip(req)
+	}
+	return defaultTransport.RoundTrip(req)
+}
+
+func (t *httpIPPoolTransport) CloseIdleConnections() {
+	if t != nil && t.inner != nil {
+		t.inner.CloseIdleConnections()
+	}
+}
+
+func transportForRequestURL(rawURL string) http.RoundTripper {
+	if strings.HasPrefix(strings.ToLower(rawURL), "http://") && defaultHTTPIPTransport != nil {
+		return defaultHTTPIPTransport
+	}
+	return defaultTransport
+}
+
+func newCheckDialContext(resolver *net.Resolver) func(context.Context, string, string) (net.Conn, error) {
+	dialer := &net.Dialer{
+		Timeout:   30 * time.Second,
+		KeepAlive: 30 * time.Second,
+	}
+	if resolver == nil {
+		return dialer.DialContext
+	}
+	return func(ctx context.Context, network, address string) (net.Conn, error) {
+		host, port, err := net.SplitHostPort(address)
+		if err != nil {
+			return nil, err
+		}
+		if ip := net.ParseIP(host); ip != nil {
+			return dialer.DialContext(ctx, network, address)
+		}
+
+		trace := httptrace.ContextClientTrace(ctx)
+		if trace != nil && trace.DNSStart != nil {
+			trace.DNSStart(httptrace.DNSStartInfo{Host: host})
+		}
+		addrs, err := defaultDNSCache.lookup(ctx, resolver, host, network)
+		if trace != nil && trace.DNSDone != nil {
+			trace.DNSDone(httptrace.DNSDoneInfo{Addrs: addrs, Err: err})
+		}
+		if err != nil {
+			return nil, err
+		}
+
+		var firstErr error
+		for _, addr := range orderedResolverAddrs(addrs, network) {
+			target := net.JoinHostPort(addr.IP.String(), port)
+			conn, err := dialer.DialContext(ctx, network, target)
+			if err == nil {
+				return conn, nil
+			}
+			if firstErr == nil {
+				firstErr = err
+			}
+			if ctx.Err() != nil {
+				return nil, ctx.Err()
+			}
+		}
+		if firstErr != nil {
+			return nil, firstErr
+		}
+		return nil, fmt.Errorf("lookup %s: no usable addresses", host)
+	}
+}
+
+func newCheckResolver() *net.Resolver {
+	servers := directResolverServers()
+	if len(servers) == 0 {
+		return nil
+	}
+	var next atomic.Uint64
+	return &net.Resolver{
+		PreferGo: true,
+		Dial: func(ctx context.Context, network, _ string) (net.Conn, error) {
+			idx := next.Add(1)
+			server := servers[int(idx-1)%len(servers)]
+			d := net.Dialer{Timeout: 5 * time.Second}
+			return d.DialContext(ctx, network, server)
+		},
+	}
+}
+
+func orderedResolverAddrs(addrs []net.IPAddr, network string) []net.IPAddr {
+	if len(addrs) == 1 {
+		addr := addrs[0]
+		if addr.IP == nil {
+			return nil
+		}
+		if strings.HasSuffix(network, "6") && addr.IP.To4() != nil {
+			return nil
+		}
+		if strings.HasSuffix(network, "4") && addr.IP.To4() == nil {
+			return nil
+		}
+		return addrs
+	}
+	ordered := make([]net.IPAddr, 0, len(addrs))
+	wants4 := strings.HasSuffix(network, "4")
+	wants6 := strings.HasSuffix(network, "6")
+	for _, addr := range addrs {
+		if addr.IP == nil {
+			continue
+		}
+		if addr.IP.To4() == nil {
+			continue
+		}
+		if wants6 {
+			continue
+		}
+		ordered = append(ordered, addr)
+	}
+	for _, addr := range addrs {
+		if addr.IP == nil {
+			continue
+		}
+		if addr.IP.To4() != nil {
+			continue
+		}
+		if wants4 {
+			continue
+		}
+		ordered = append(ordered, addr)
+	}
+	return ordered
+}
+
+func directResolverServers() []string {
+	configuredResolverMu.RLock()
+	if len(configuredResolverServers) > 0 {
+		servers := append([]string(nil), configuredResolverServers...)
+		configuredResolverMu.RUnlock()
+		return servers
+	}
+	configuredResolverMu.RUnlock()
+
+	if servers := parseResolverServers(readResolverConfig("/run/systemd/resolve/resolv.conf")); len(servers) > 0 {
+		return servers
+	}
+	return parseResolverServers(readResolverConfig("/etc/resolv.conf"))
+}
+
+func readResolverConfig(path string) string {
+	data, err := os.ReadFile(path)
+	if err != nil {
+		return ""
+	}
+	return string(data)
+}
+
+func parseResolverServers(raw string) []string {
+	var servers []string
+	for _, line := range strings.Split(raw, "\n") {
+		fields := strings.Fields(line)
+		if len(fields) < 2 || fields[0] != "nameserver" {
+			continue
+		}
+		host := fields[1]
+		if isLocalResolverHost(host) {
+			continue
+		}
+		servers = append(servers, net.JoinHostPort(host, "53"))
+	}
+	return servers
+}
+
+func normalizeResolverServers(rawServers []string) ([]string, error) {
+	if len(rawServers) == 0 {
+		return nil, nil
+	}
+	servers := make([]string, 0, len(rawServers))
+	for i, raw := range rawServers {
+		server, err := normalizeResolverServer(raw)
+		if err != nil {
+			return nil, fmt.Errorf("resolver %d: %w", i, err)
+		}
+		servers = append(servers, server)
+	}
+	return servers, nil
+}
+
+func normalizeResolverServer(raw string) (string, error) {
+	server := strings.TrimSpace(raw)
+	if server == "" {
+		return "", fmt.Errorf("empty resolver")
+	}
+	host := server
+	port := "53"
+	if splitHost, splitPort, err := net.SplitHostPort(server); err == nil {
+		host = strings.Trim(splitHost, "[]")
+		port = splitPort
+	} else if strings.Contains(server, ":") {
+		if ip := net.ParseIP(strings.Trim(server, "[]")); ip == nil || ip.To4() != nil {
+			return "", fmt.Errorf("resolver must be an IP literal with optional port")
+		}
+		host = strings.Trim(server, "[]")
+	}
+	ip := net.ParseIP(host)
+	if ip == nil {
+		return "", fmt.Errorf("resolver must be an IP literal with optional port")
+	}
+	n, err := strconv.Atoi(port)
+	if err != nil || n <= 0 || n > 65535 {
+		return "", fmt.Errorf("resolver port must be between 1 and 65535")
+	}
+	return net.JoinHostPort(ip.String(), strconv.Itoa(n)), nil
+}
+
+func isLocalResolverHost(host string) bool {
+	ip := net.ParseIP(host)
+	if ip == nil {
+		return false
+	}
+	return ip.IsLoopback()
+}
+
+// Request holds the parameters for a single HTTP check.
+type Request struct {
+	MonitorSiteID       int64
+	BlogID              int64
+	URL                 string
+	Method              string
+	DetectionProfile    string
+	TimeoutSeconds      int
+	BodyReadMaxBytes    int64
+	BodyReadMaxMS       int
+	KeywordReadMaxBytes int64
+	KeywordReadMaxMS    int
+	Keyword             *string
+	ForbiddenKeyword    *string
+	ForbiddenKeywords   []string
+	CustomHeaders       map[string]string
+	RedirectPolicy      RedirectPolicy
+}
+
+// Result holds the outcome of a single HTTP check.
+type Result struct {
+	MonitorSiteID    int64
+	BlogID           int64
+	URL              string
+	Method           string
+	DetectionProfile string
+	Success          bool
+	HTTPCode         int
+	ErrorCode        int
+	// ErrorDetail is bounded diagnostic context from the checker. It is meant
+	// for operator-facing event metadata, not matching logic.
+	ErrorDetail string
+
+	RTT  time.Duration
+	DNS  time.Duration
+	TCP  time.Duration
+	TLS  time.Duration
+	TTFB time.Duration
+
+	SSLExpiry          *time.Time
+	TLSVersion         uint16
+	CipherSuite        uint16
+	DNSFailureKind     string
+	DNSFailureName     string
+	DNSFailureServer   string
+	RedirectChanged    bool
+	RedirectCount      int
+	RedirectChain      []string
+	FinalURL           string
+	KeywordRule        string
+	BodyReadMode       string
+	BodyBytesRead      int64
+	BodyExpectedBytes  int64
+	BodyReadLimitBytes int64
+	BodyReadError      string
+
+	Timestamp time.Time
+}
+
+// StatusType maps the result to a WPCOM status change email type.
+func (r *Result) StatusType() string {
+	switch {
+	case r.Success:
+		return "success"
+	case r.ErrorCode == ErrorSSL || r.ErrorCode == ErrorTLSExpired:
+		return "https"
+	case r.ErrorCode == ErrorTimeout || r.ErrorCode == ErrorBodyRead:
+		return "intermittent"
+	case r.ErrorCode == ErrorRedirect:
+		return "redirect"
+	case r.HTTPCode == 403:
+		return "blocked"
+	case r.HTTPCode >= 500:
+		return "server"
+	case r.HTTPCode >= 400:
+		return "client"
+	default:
+		return "intermittent"
+	}
+}
+
+// IsFailure reports whether the result should enter the downtime pipeline.
+func (r *Result) IsFailure() bool {
+	if !r.Success {
+		return true
+	}
+	switch r.ErrorCode {
+	case ErrorNone, ErrorTLSDeprecated:
+		return false
+	default:
+		return true
+	}
+}
+
+// Check performs an HTTP check and returns the result.
+func Check(ctx context.Context, req Request) Result {
+	method, err := checkmode.NormalizeMethod(req.Method, checkmode.MethodGET)
+	if err != nil {
+		method = checkmode.MethodGET
+	}
+	profile, err := checkmode.NormalizeProfile(req.DetectionProfile, checkmode.ProfileFull)
+	if err != nil {
+		profile = checkmode.ProfileFull
+	}
+	profile = checkmode.EffectiveProfile(method, profile)
+
+	res := Result{
+		MonitorSiteID:    req.MonitorSiteID,
+		BlogID:           req.BlogID,
+		URL:              req.URL,
+		Method:           method,
+		DetectionProfile: profile,
+		Timestamp:        time.Now(),
+	}
+
+	timeout := time.Duration(req.TimeoutSeconds) * time.Second
+	if timeout <= 0 {
+		timeout = 10 * time.Second
+	}
+
+	ctx, cancel := context.WithTimeout(ctx, timeout)
+	defer cancel()
+
+	var (
+		dnsStart, tcpStart, tlsStart, reqStart time.Time
+		dnsEnd, tcpEnd, tlsEnd                 time.Time
+	)
+
+	trace := &httptrace.ClientTrace{
+		DNSStart:             func(_ httptrace.DNSStartInfo) { dnsStart = time.Now() },
+		DNSDone:              func(_ httptrace.DNSDoneInfo) { dnsEnd = time.Now() },
+		ConnectStart:         func(_, _ string) { tcpStart = time.Now() },
+		ConnectDone:          func(_, _ string, _ error) { tcpEnd = time.Now() },
+		TLSHandshakeStart:    func() { tlsStart = time.Now() },
+		TLSHandshakeDone:     func(_ tls.ConnectionState, _ error) { tlsEnd = time.Now() },
+		WroteRequest:         func(_ httptrace.WroteRequestInfo) { reqStart = time.Now() },
+		GotFirstResponseByte: func() { res.TTFB = time.Since(reqStart) },
+	}
+	ctx = httptrace.WithClientTrace(ctx, trace)
+
+	headers := req.CustomHeaders
+
+	var redirectChain []string
+	redirectPolicyStr := string(req.RedirectPolicy)
+	if redirectPolicyStr == "" {
+		redirectPolicyStr = string(RedirectFollow)
+	}
+	if profile != checkmode.ProfileFull {
+		redirectPolicyStr = string(RedirectFollow)
+	}
+
+	client := &http.Client{
+		Transport: transportForRequestURL(req.URL),
+		CheckRedirect: func(r *http.Request, via []*http.Request) error {
+			redirectChain = append(redirectChain, r.URL.String())
+			if redirectPolicyStr == string(RedirectFail) {
+				return fmt.Errorf("redirect policy: fail")
+			}
+			if len(redirectChain) > 10 {
+				return fmt.Errorf("too many redirects")
+			}
+			return nil
+		},
+		Timeout: timeout,
+	}
+
+	httpReq, err := http.NewRequestWithContext(ctx, method, req.URL, nil)
+	if err != nil {
+		res.ErrorCode = ErrorConnect
+		res.ErrorDetail = boundedErrorDetail(err)
+		return res
+	}
+
+	httpReq.Header.Set("User-Agent", "jetmon/2.0 (Jetpack Site Uptime Monitor by WordPress.com)")
+	for k, v := range headers {
+		httpReq.Header.Set(k, v)
+	}
+
+	start := time.Now()
+	resp, err := client.Do(httpReq)
+	res.RTT = time.Since(start)
+	res.RedirectCount = len(redirectChain)
+	if len(redirectChain) > 0 {
+		res.RedirectChain = append([]string(nil), redirectChain...)
+	}
+	if resp != nil && resp.Request != nil && resp.Request.URL != nil {
+		res.FinalURL = resp.Request.URL.String()
+	}
+
+	// Only record a phase duration when BOTH start and end fired. If a
+	// connection errors mid-handshake the DNSStart / ConnectStart / TLS
+	// HandshakeStart hook fires without its matching Done — in that case
+	// the *End is the zero time.Time and *End.Sub(*Start) returns a huge
+	// negative duration (roughly -unix-nanos), which then overflows the
+	// jetmon_check_history INT columns and surfaces as
+	// "Out of range value for column 'dns_ms'". A failed phase is
+	// reported as zero rather than a misleading negative.
+	if !dnsStart.IsZero() && !dnsEnd.IsZero() {
+		res.DNS = dnsEnd.Sub(dnsStart)
+	}
+	if !tcpStart.IsZero() && !tcpEnd.IsZero() {
+		res.TCP = tcpEnd.Sub(tcpStart)
+	}
+	if !tlsStart.IsZero() && !tlsEnd.IsZero() {
+		res.TLS = tlsEnd.Sub(tlsStart)
+	}
+
+	if err != nil {
+		res.ErrorDetail = boundedErrorDetail(err)
+		res.DNSFailureKind, res.DNSFailureName, res.DNSFailureServer = classifyDNSError(err)
+		if ctx.Err() != nil {
+			res.ErrorCode = ErrorTimeout
+		} else if strings.Contains(err.Error(), "redirect") {
+			res.ErrorCode = ErrorRedirect
+		} else if strings.Contains(err.Error(), "tls") || strings.Contains(err.Error(), "certificate") {
+			res.ErrorCode = ErrorSSL
+		} else {
+			res.ErrorCode = ErrorConnect
+		}
+		return res
+	}
+	defer resp.Body.Close()
+
+	res.HTTPCode = resp.StatusCode
+
+	// Inspect TLS state if available.
+	if resp.TLS != nil {
+		res.TLSVersion = resp.TLS.Version
+		res.CipherSuite = resp.TLS.CipherSuite
+		if len(resp.TLS.PeerCertificates) > 0 {
+			cert := resp.TLS.PeerCertificates[0]
+			expiry := cert.NotAfter
+			res.SSLExpiry = &expiry
+			if time.Now().After(expiry) {
+				res.ErrorCode = ErrorTLSExpired
+				return res
+			}
+		}
+		// Flag deprecated TLS versions (TLS 1.0 = 0x0301, TLS 1.1 = 0x0302)
+		// only when rich detections are enabled. Legacy/simple rollout modes
+		// keep TLS version as telemetry but avoid new advisory events.
+		if profile == checkmode.ProfileFull && resp.TLS.Version <= tls.VersionTLS11 {
+			res.ErrorCode = ErrorTLSDeprecated
+		}
+	}
+
+	if profile == checkmode.ProfileFull && redirectPolicyStr == string(RedirectAlert) && res.RedirectCount > 0 {
+		res.RedirectChanged = true
+	}
+
+	forbiddenKeywords := collectForbiddenKeywords(req.ForbiddenKeyword, req.ForbiddenKeywords)
+	needsBody := profile == checkmode.ProfileFull && method != http.MethodHead &&
+		((req.Keyword != nil && *req.Keyword != "") || len(forbiddenKeywords) > 0)
+	shouldReadBody := profile == checkmode.ProfileFull && method != http.MethodHead
+	if shouldReadBody {
+		bodyRead := readResponseBody(resp, needsBody, req)
+		body := bodyRead.Body
+		res.BodyReadMode = bodyRead.Mode
+		res.BodyBytesRead = bodyRead.BytesRead
+		res.BodyExpectedBytes = bodyRead.ExpectedBytes
+		res.BodyReadLimitBytes = bodyRead.LimitBytes
+		if bodyRead.Err != nil {
+			res.BodyReadError = boundedErrorDetail(bodyRead.Err)
+		}
+		if bodyRead.Err != nil && res.HTTPCode < http.StatusBadRequest {
+			res.ErrorCode = ErrorBodyRead
+			res.ErrorDetail = res.BodyReadError
+			return res
+		}
+
+		// Keyword check uses the same bounded body read as integrity checks.
+		if needsBody {
+			bodyText := string(body)
+			if req.Keyword != nil && *req.Keyword != "" {
+				if !strings.Contains(bodyText, *req.Keyword) {
+					res.KeywordRule = "required"
+					res.ErrorCode = ErrorKeyword
+					return res
+				}
+			}
+			for _, keyword := range forbiddenKeywords {
+				if strings.Contains(bodyText, keyword) {
+					res.KeywordRule = "forbidden"
+					res.ErrorCode = ErrorKeyword
+					return res
+				}
+			}
+		}
+	}
+
+	res.Success = res.HTTPCode > 0 && res.HTTPCode < 400
+	return res
+}
+
+func boundedErrorDetail(err error) string {
+	if err == nil {
+		return ""
+	}
+	const maxErrorDetail = 500
+	detail := err.Error()
+	if len(detail) <= maxErrorDetail {
+		return detail
+	}
+	return detail[:maxErrorDetail] + "..."
+}
+
+func classifyDNSError(err error) (kind, name, server string) {
+	var dnsErr *net.DNSError
+	if !errors.As(err, &dnsErr) {
+		return "", "", ""
+	}
+	name = dnsErr.Name
+	server = dnsErr.Server
+	switch {
+	case dnsErr.IsNotFound:
+		kind = "nxdomain"
+	case dnsErr.IsTimeout:
+		kind = "timeout"
+	case dnsErr.IsTemporary || strings.Contains(strings.ToLower(dnsErr.Err), "server misbehaving"):
+		kind = "servfail"
+	default:
+		kind = "resolver_error"
+	}
+	return kind, name, server
+}
+
+type bodyReadResult struct {
+	Body          []byte
+	Err           error
+	Mode          string
+	BytesRead     int64
+	ExpectedBytes int64
+	LimitBytes    int64
+}
+
+func readResponseBody(resp *http.Response, needKeyword bool, req Request) bodyReadResult {
+	limit := maxBodyIntegrityBytes
+	mode := "success_budgeted"
+	if req.BodyReadMaxBytes > 0 {
+		limit = req.BodyReadMaxBytes
+	}
+	if needKeyword {
+		mode = "keyword"
+		limit = maxKeywordBodyBytes
+		if req.KeywordReadMaxBytes > 0 {
+			limit = req.KeywordReadMaxBytes
+		}
+	} else if resp.ContentLength > limit && resp.ContentLength <= maxKeywordBodyBytes {
+		limit = resp.ContentLength
+	}
+	if !needKeyword && resp.ContentLength >= 0 && resp.ContentLength <= limit {
+		mode = "strict_finite"
+	}
+
+	if !needKeyword {
+		n, err := io.Copy(io.Discard, io.LimitReader(resp.Body, limit+1))
+		result := bodyReadResult{
+			Err:           err,
+			Mode:          mode,
+			BytesRead:     n,
+			ExpectedBytes: resp.ContentLength,
+			LimitBytes:    limit,
+		}
+		if n > limit {
+			result.BytesRead = limit
+			return result
+		}
+		if err == nil && resp.ContentLength >= 0 && resp.ContentLength <= limit && n != resp.ContentLength {
+			result.Err = io.ErrUnexpectedEOF
+		}
+		return result
+	}
+
+	body, err := io.ReadAll(io.LimitReader(resp.Body, limit+1))
+	result := bodyReadResult{
+		Body:          body,
+		Err:           err,
+		Mode:          mode,
+		BytesRead:     int64(len(body)),
+		ExpectedBytes: resp.ContentLength,
+		LimitBytes:    limit,
+	}
+	if err != nil {
+		return result
+	}
+	if int64(len(body)) > limit {
+		result.Body = body[:limit]
+		result.BytesRead = int64(len(result.Body))
+		return result
+	}
+	if resp.ContentLength >= 0 && resp.ContentLength <= limit && int64(len(body)) != resp.ContentLength {
+		result.Err = io.ErrUnexpectedEOF
+		return result
+	}
+	return result
+}
+
+func collectForbiddenKeywords(single *string, many []string) []string {
+	out := make([]string, 0, 1+len(many))
+	if single != nil && *single != "" {
+		out = append(out, *single)
+	}
+	for _, keyword := range many {
+		if keyword != "" {
+			out = append(out, keyword)
+		}
+	}
+	return out
+}
+
+// ParseCustomHeaders deserialises a JSON custom headers string into a map.
+func ParseCustomHeaders(raw *string) map[string]string {
+	if raw == nil || *raw == "" {
+		return nil
+	}
+	var m map[string]string
+	_ = json.Unmarshal([]byte(*raw), &m)
+	return m
+}
+
+// ParseForbiddenKeywords deserialises a JSON array of body strings that must
+// not appear in the response.
+func ParseForbiddenKeywords(raw *string) []string {
+	if raw == nil || *raw == "" {
+		return nil
+	}
+	var values []string
+	if err := json.Unmarshal([]byte(*raw), &values); err != nil {
+		return nil
+	}
+	out := values[:0]
+	for _, value := range values {
+		if value != "" {
+			out = append(out, value)
+		}
+	}
+	return out
+}
diff --git a/internal/checker/checker_test.go b/internal/checker/checker_test.go
new file mode 100644
index 00000000..e046f36f
--- /dev/null
+++ b/internal/checker/checker_test.go
@@ -0,0 +1,1402 @@
+package checker
+
+import (
+	"context"
+	"crypto/tls"
+	"crypto/x509"
+	"errors"
+	"net"
+	"net/http"
+	"net/http/httptest"
+	"strconv"
+	"strings"
+	"sync"
+	"sync/atomic"
+	"testing"
+	"time"
+)
+
+func TestResultStatusType(t *testing.T) {
+	tests := []struct {
+		name string
+		res  Result
+		want string
+	}{
+		{name: "success", res: Result{Success: true}, want: "success"},
+		{name: "ssl error", res: Result{ErrorCode: ErrorSSL}, want: "https"},
+		{name: "tls expired", res: Result{ErrorCode: ErrorTLSExpired}, want: "https"},
+		{name: "timeout", res: Result{ErrorCode: ErrorTimeout}, want: "intermittent"},
+		{name: "body read", res: Result{ErrorCode: ErrorBodyRead}, want: "intermittent"},
+		{name: "redirect", res: Result{ErrorCode: ErrorRedirect}, want: "redirect"},
+		{name: "403 blocked", res: Result{HTTPCode: 403}, want: "blocked"},
+		{name: "500 server error", res: Result{HTTPCode: 500}, want: "server"},
+		{name: "503 server error", res: Result{HTTPCode: 503}, want: "server"},
+		{name: "400 client error", res: Result{HTTPCode: 400}, want: "client"},
+		{name: "404 client error", res: Result{HTTPCode: 404}, want: "client"},
+		{name: "connect error fallthrough", res: Result{ErrorCode: ErrorConnect}, want: "intermittent"},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			if got := tt.res.StatusType(); got != tt.want {
+				t.Fatalf("StatusType() = %q, want %q", got, tt.want)
+			}
+		})
+	}
+}
+
+func TestParseCustomHeaders(t *testing.T) {
+	if got := ParseCustomHeaders(nil); got != nil {
+		t.Fatalf("ParseCustomHeaders(nil) = %v, want nil", got)
+	}
+
+	empty := ""
+	if got := ParseCustomHeaders(&empty); got != nil {
+		t.Fatalf("ParseCustomHeaders(\"\") = %v, want nil", got)
+	}
+
+	invalid := "not json"
+	if got := ParseCustomHeaders(&invalid); got != nil {
+		t.Fatalf("ParseCustomHeaders(invalid) = %v, want nil", got)
+	}
+
+	valid := `{"X-Foo":"bar","X-Baz":"qux"}`
+	got := ParseCustomHeaders(&valid)
+	if len(got) != 2 {
+		t.Fatalf("ParseCustomHeaders() len = %d, want 2", len(got))
+	}
+	if got["X-Foo"] != "bar" {
+		t.Fatalf("ParseCustomHeaders()[\"X-Foo\"] = %q, want %q", got["X-Foo"], "bar")
+	}
+}
+
+func TestResultIsFailure(t *testing.T) {
+	tests := []struct {
+		name string
+		res  Result
+		want bool
+	}{
+		{
+			name: "plain success",
+			res:  Result{Success: true, ErrorCode: ErrorNone},
+			want: false,
+		},
+		{
+			name: "deprecated tls is advisory",
+			res:  Result{Success: true, ErrorCode: ErrorTLSDeprecated},
+			want: false,
+		},
+		{
+			name: "keyword failure is hard failure",
+			res:  Result{Success: true, ErrorCode: ErrorKeyword},
+			want: true,
+		},
+		{
+			name: "body read failure is hard failure",
+			res:  Result{Success: false, ErrorCode: ErrorBodyRead},
+			want: true,
+		},
+		{
+			name: "transport failure is hard failure",
+			res:  Result{Success: false, ErrorCode: ErrorConnect},
+			want: true,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			if got := tt.res.IsFailure(); got != tt.want {
+				t.Fatalf("IsFailure() = %v, want %v", got, tt.want)
+			}
+		})
+	}
+}
+
+func TestPoolDrainWorkers(t *testing.T) {
+	p := NewPool(3, 1, 3)
+	t.Cleanup(p.Drain)
+
+	if drained := p.DrainWorkers(2); drained != 2 {
+		t.Fatalf("DrainWorkers() = %d, want 2", drained)
+	}
+
+	deadline := time.Now().Add(2 * time.Second)
+	for time.Now().Before(deadline) {
+		if p.WorkerCount() == 1 {
+			return
+		}
+		time.Sleep(10 * time.Millisecond)
+	}
+
+	t.Fatalf("worker count = %d, want 1 after retirement", p.WorkerCount())
+}
+
+func TestPoolDrainWaitsForInflightCheck(t *testing.T) {
+	orig := poolCheckFunc
+	started := make(chan struct{})
+	release := make(chan struct{})
+	poolCheckFunc = func(_ context.Context, req Request) Result {
+		close(started)
+		<-release
+		return Result{BlogID: req.BlogID}
+	}
+	t.Cleanup(func() { poolCheckFunc = orig })
+
+	p := NewPool(1, 1, 1)
+	if !p.Submit(Request{BlogID: 1}) {
+		t.Fatal("Submit() returned false")
+	}
+
+	<-started
+
+	drained := make(chan struct{})
+	go func() {
+		p.Drain()
+		close(drained)
+	}()
+
+	select {
+	case <-drained:
+		t.Fatal("Drain returned before in-flight check completed")
+	case <-time.After(50 * time.Millisecond):
+	}
+
+	close(release)
+
+	select {
+	case <-drained:
+	case <-time.After(2 * time.Second):
+		t.Fatal("Drain did not return after in-flight check completed")
+	}
+}
+
+func TestSubmitReturnsFalseAfterDrain(t *testing.T) {
+	p := NewPool(1, 1, 1)
+	p.Drain()
+	if p.Submit(Request{BlogID: 1, URL: "https://example.com"}) {
+		t.Fatal("Submit() returned true after Drain, want false")
+	}
+}
+
+func TestSetMaxSizeRetireExcessWorkers(t *testing.T) {
+	p := NewPool(5, 1, 5)
+	t.Cleanup(p.Drain)
+
+	p.SetMaxSize(2)
+
+	deadline := time.Now().Add(2 * time.Second)
+	for time.Now().Before(deadline) {
+		if p.WorkerCount() <= 2 {
+			return
+		}
+		time.Sleep(10 * time.Millisecond)
+	}
+	t.Fatalf("worker count = %d after SetMaxSize(2), want <= 2", p.WorkerCount())
+}
+
+func TestEnsureSizeStartsWorkersWithoutQueuePressure(t *testing.T) {
+	p := NewPoolWithQueueCap(1, 1, 5, 10)
+	t.Cleanup(p.Drain)
+
+	if added := p.EnsureSize(4); added != 3 {
+		t.Fatalf("EnsureSize(4) added = %d, want 3", added)
+	}
+	if got := p.WorkerCount(); got != 4 {
+		t.Fatalf("WorkerCount() = %d, want 4", got)
+	}
+	if added := p.EnsureSize(10); added != 1 {
+		t.Fatalf("EnsureSize(10) added = %d, want 1 capped by max", added)
+	}
+	if got := p.WorkerCount(); got != 5 {
+		t.Fatalf("WorkerCount() = %d, want max 5", got)
+	}
+}
+
+func TestSetSizeBoundsStartsAndRetiresWorkers(t *testing.T) {
+	p := NewPoolWithQueueCap(1, 1, 5, 10)
+	t.Cleanup(p.Drain)
+
+	if added := p.SetSizeBounds(4, 4); added != 3 {
+		t.Fatalf("SetSizeBounds(4, 4) added = %d, want 3", added)
+	}
+	if got := p.WorkerCount(); got != 4 {
+		t.Fatalf("WorkerCount() = %d, want 4", got)
+	}
+
+	p.SetSizeBounds(1, 2)
+	deadline := time.Now().Add(2 * time.Second)
+	for time.Now().Before(deadline) {
+		if p.WorkerCount() <= 2 {
+			return
+		}
+		time.Sleep(10 * time.Millisecond)
+	}
+	t.Fatalf("WorkerCount() = %d after SetSizeBounds(1, 2), want <= 2", p.WorkerCount())
+}
+
+func TestDrainCalledTwice(t *testing.T) {
+	p := NewPool(1, 1, 1)
+	p.Drain()
+	p.Drain() // second Drain must be a no-op, not block or panic
+}
+
+func TestSubmitDropsWhenQueueFull(t *testing.T) {
+	// Zero workers means nothing drains the channel. Channel capacity = max*2 = 4.
+	p := NewPool(0, 0, 2)
+	t.Cleanup(p.Drain)
+
+	const cap = 4 // max*2
+	for i := range cap {
+		if !p.Submit(Request{BlogID: int64(i), URL: "x"}) {
+			t.Fatalf("Submit %d returned false on non-full queue", i)
+		}
+	}
+	if p.Submit(Request{BlogID: 99, URL: "overflow"}) {
+		t.Fatal("Submit returned true on full queue, want false")
+	}
+}
+
+func TestDrainWorkersAtMinimum(t *testing.T) {
+	p := NewPool(1, 1, 1) // size == minSize
+	t.Cleanup(p.Drain)
+
+	// Nothing above minSize to retire.
+	if drained := p.DrainWorkers(5); drained != 0 {
+		t.Fatalf("DrainWorkers(5) at minSize = %d, want 0", drained)
+	}
+}
+
+func TestDrainWorkersExceedsAvailable(t *testing.T) {
+	p := NewPool(3, 1, 3)
+	t.Cleanup(p.Drain)
+
+	// 2 workers above minSize (3-1=2), requesting 10 — should cap at 2.
+	drained := p.DrainWorkers(10)
+	if drained != 2 {
+		t.Fatalf("DrainWorkers(10) = %d, want 2 (capped at available)", drained)
+	}
+}
+
+// --- checker.Check() ---
+
+func TestCheckHTTP200(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusOK)
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 5})
+	if !res.Success {
+		t.Fatalf("Success = false, want true")
+	}
+	if res.HTTPCode != 200 {
+		t.Fatalf("HTTPCode = %d, want 200", res.HTTPCode)
+	}
+	if res.ErrorCode != ErrorNone {
+		t.Fatalf("ErrorCode = %d, want ErrorNone", res.ErrorCode)
+	}
+	if res.Method != http.MethodGet {
+		t.Fatalf("Method = %q, want GET", res.Method)
+	}
+}
+
+func TestCheckUsesGETWhenHEADWouldFail(t *testing.T) {
+	var sawGET bool
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		switch r.Method {
+		case http.MethodHead:
+			w.WriteHeader(http.StatusMethodNotAllowed)
+		case http.MethodGet:
+			sawGET = true
+			w.WriteHeader(http.StatusOK)
+		default:
+			t.Fatalf("unexpected method %q", r.Method)
+		}
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 5})
+	if !sawGET {
+		t.Fatal("server did not receive GET")
+	}
+	if !res.Success {
+		t.Fatalf("Success = false when GET is healthy and HEAD would fail; result=%+v", res)
+	}
+}
+
+func TestCheckUsesGETWhenHEADWouldTimeout(t *testing.T) {
+	var sawGET bool
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		switch r.Method {
+		case http.MethodHead:
+			time.Sleep(5 * time.Second)
+		case http.MethodGet:
+			sawGET = true
+			w.WriteHeader(http.StatusOK)
+		default:
+			t.Fatalf("unexpected method %q", r.Method)
+		}
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 1})
+	if !sawGET {
+		t.Fatal("server did not receive GET")
+	}
+	if !res.Success {
+		t.Fatalf("Success = false when GET is healthy and HEAD would timeout; result=%+v", res)
+	}
+}
+
+func TestCheckCanUseLegacyHEADMethod(t *testing.T) {
+	var sawHEAD bool
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		if r.Method != http.MethodHead {
+			t.Fatalf("method = %q, want HEAD", r.Method)
+		}
+		sawHEAD = true
+		w.WriteHeader(http.StatusOK)
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{
+		BlogID:           1,
+		URL:              srv.URL,
+		Method:           http.MethodHead,
+		DetectionProfile: "legacy",
+		TimeoutSeconds:   5,
+	})
+	if !sawHEAD {
+		t.Fatal("server did not receive HEAD")
+	}
+	if !res.Success || res.Method != http.MethodHead || res.DetectionProfile != "legacy" {
+		t.Fatalf("HEAD result = %+v, want successful legacy HEAD", res)
+	}
+	if res.BodyReadMode != "" || res.BodyBytesRead != 0 {
+		t.Fatalf("HEAD body read = mode:%q bytes:%d, want none", res.BodyReadMode, res.BodyBytesRead)
+	}
+}
+
+func TestSimpleHTTPProfileSkipsKeywordDetection(t *testing.T) {
+	kw := "must-be-present"
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		if r.Method != http.MethodGet {
+			t.Fatalf("method = %q, want GET", r.Method)
+		}
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte("plain healthy response"))
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{
+		BlogID:           1,
+		URL:              srv.URL,
+		Method:           http.MethodGet,
+		DetectionProfile: "simple_http",
+		TimeoutSeconds:   5,
+		Keyword:          &kw,
+	})
+	if !res.Success {
+		t.Fatalf("simple_http result = %+v, want success despite missing keyword", res)
+	}
+	if res.ErrorCode != ErrorNone || res.KeywordRule != "" {
+		t.Fatalf("simple_http keyword fields = code:%d rule:%q, want none", res.ErrorCode, res.KeywordRule)
+	}
+}
+
+func TestCheckHTTP500(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusInternalServerError)
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 5})
+	if res.Success {
+		t.Fatal("Success = true for 500 response, want false")
+	}
+	if res.HTTPCode != 500 {
+		t.Fatalf("HTTPCode = %d, want 500", res.HTTPCode)
+	}
+}
+
+func TestCheckTimeout(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		select {
+		case <-r.Context().Done():
+		case <-time.After(5 * time.Second):
+		}
+		w.WriteHeader(http.StatusOK)
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 1})
+	if res.ErrorCode != ErrorTimeout {
+		t.Fatalf("ErrorCode = %d, want ErrorTimeout", res.ErrorCode)
+	}
+}
+
+func TestCheckKeywordMatch(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte("hello jetpack world"))
+	}))
+	defer srv.Close()
+
+	kw := "jetpack"
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 5, Keyword: &kw})
+	if !res.Success {
+		t.Fatalf("Success = false for keyword match, want true")
+	}
+	if res.ErrorCode != ErrorNone {
+		t.Fatalf("ErrorCode = %d for keyword match, want ErrorNone", res.ErrorCode)
+	}
+}
+
+func TestCheckKeywordMiss(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte("hello world"))
+	}))
+	defer srv.Close()
+
+	kw := "jetpack"
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 5, Keyword: &kw})
+	if res.ErrorCode != ErrorKeyword {
+		t.Fatalf("ErrorCode = %d, want ErrorKeyword", res.ErrorCode)
+	}
+	if res.KeywordRule != "required" {
+		t.Fatalf("KeywordRule = %q, want required", res.KeywordRule)
+	}
+}
+
+func TestCheckForbiddenKeywordAbsent(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte("hello world"))
+	}))
+	defer srv.Close()
+
+	forbidden := "malware"
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 5, ForbiddenKeyword: &forbidden})
+	if !res.Success {
+		t.Fatalf("Success = false when forbidden keyword absent, want true; result=%+v", res)
+	}
+}
+
+func TestCheckForbiddenKeywordPresent(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte("hello malware world"))
+	}))
+	defer srv.Close()
+
+	forbidden := "malware"
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 5, ForbiddenKeyword: &forbidden})
+	if res.Success {
+		t.Fatal("Success = true for forbidden keyword present, want false")
+	}
+	if res.ErrorCode != ErrorKeyword {
+		t.Fatalf("ErrorCode = %d, want ErrorKeyword", res.ErrorCode)
+	}
+	if res.KeywordRule != "forbidden" {
+		t.Fatalf("KeywordRule = %q, want forbidden", res.KeywordRule)
+	}
+}
+
+func TestCheckForbiddenKeywordsPresent(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte("hello <script src=\"https://metrics.evil-cdn.example/collect.js\"></script> world"))
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{
+		BlogID:            1,
+		URL:               srv.URL,
+		TimeoutSeconds:    5,
+		ForbiddenKeywords: []string{"buy cheap viagra", "metrics.evil-cdn.example/collect.js"},
+	})
+	if res.Success {
+		t.Fatal("Success = true for forbidden keyword list match, want false")
+	}
+	if res.ErrorCode != ErrorKeyword {
+		t.Fatalf("ErrorCode = %d, want ErrorKeyword", res.ErrorCode)
+	}
+	if res.KeywordRule != "forbidden" {
+		t.Fatalf("KeywordRule = %q, want forbidden", res.KeywordRule)
+	}
+}
+
+func TestCheckTruncatedBodyFailsWithoutKeyword(t *testing.T) {
+	srv := truncatedBodyServer(t, "partial response")
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 5})
+	if res.Success {
+		t.Fatalf("Success = true for truncated body, want false; result=%+v", res)
+	}
+	if res.HTTPCode != http.StatusOK {
+		t.Fatalf("HTTPCode = %d, want %d", res.HTTPCode, http.StatusOK)
+	}
+	if res.ErrorCode != ErrorBodyRead {
+		t.Fatalf("ErrorCode = %d, want ErrorBodyRead", res.ErrorCode)
+	}
+	if res.ErrorDetail == "" || res.BodyReadError == "" {
+		t.Fatalf("body read diagnostic missing: error_detail=%q body_read_error=%q", res.ErrorDetail, res.BodyReadError)
+	}
+	if res.BodyReadMode != "strict_finite" {
+		t.Fatalf("BodyReadMode = %q, want strict_finite", res.BodyReadMode)
+	}
+	if res.BodyExpectedBytes != 1024 {
+		t.Fatalf("BodyExpectedBytes = %d, want 1024", res.BodyExpectedBytes)
+	}
+	if res.BodyBytesRead != int64(len("partial response")) {
+		t.Fatalf("BodyBytesRead = %d, want %d", res.BodyBytesRead, len("partial response"))
+	}
+	if res.BodyReadLimitBytes == 0 {
+		t.Fatal("BodyReadLimitBytes = 0, want configured/default limit")
+	}
+}
+
+func TestCheckTruncatedBodyFailsEvenWhenKeywordIsPresent(t *testing.T) {
+	srv := truncatedBodyServer(t, "needle but incomplete")
+	defer srv.Close()
+
+	kw := "needle"
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 5, Keyword: &kw})
+	if res.Success {
+		t.Fatalf("Success = true for truncated body, want false; result=%+v", res)
+	}
+	if res.ErrorCode != ErrorBodyRead {
+		t.Fatalf("ErrorCode = %d, want ErrorBodyRead", res.ErrorCode)
+	}
+}
+
+func TestCheckBodyReadMaxBytesLimitExactTruncatedFails(t *testing.T) {
+	const bodyReadLimit = int64(1 << 20) // 1 MiB
+
+	srv := truncatedBodyServerWithContentLength(t, bodyReadLimit, "partial body")
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{
+		BlogID:           1,
+		URL:              srv.URL,
+		TimeoutSeconds:   5,
+		BodyReadMaxBytes: bodyReadLimit,
+	})
+	if res.Success {
+		t.Fatalf("Success = true for truncated body at exact limit, want false; result=%+v", res)
+	}
+	if res.HTTPCode != http.StatusOK {
+		t.Fatalf("HTTPCode = %d, want %d", res.HTTPCode, http.StatusOK)
+	}
+	if res.ErrorCode != ErrorBodyRead {
+		t.Fatalf("ErrorCode = %d, want ErrorBodyRead", res.ErrorCode)
+	}
+}
+
+func TestCheckBodyReadMaxBytesLimitPlusOneSucceedsWithBudgetedRead(t *testing.T) {
+	const bodyReadLimit = int64(1 << 20) // 1 MiB
+	const contentLength = bodyReadLimit + 1
+
+	body := strings.Repeat("a", int(contentLength))
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.Header().Set("Content-Length", strconv.FormatInt(contentLength, 10))
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte(body))
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{
+		BlogID:           1,
+		URL:              srv.URL,
+		TimeoutSeconds:   5,
+		BodyReadMaxBytes: bodyReadLimit,
+	})
+	if !res.Success {
+		t.Fatalf("Success = false for known Content-Length above finite limit, want true; result=%+v", res)
+	}
+	if res.HTTPCode != http.StatusOK {
+		t.Fatalf("HTTPCode = %d, want %d", res.HTTPCode, http.StatusOK)
+	}
+	if res.ErrorCode != ErrorNone {
+		t.Fatalf("ErrorCode = %d, want ErrorNone", res.ErrorCode)
+	}
+}
+
+func TestCheckBodyReadMaxBytesUnknownLengthOverLimitSucceeds(t *testing.T) {
+	const bodyReadLimit = int64(1 << 20) // 1 MiB
+
+	body := strings.Repeat("a", int(bodyReadLimit)+1024)
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusOK)
+		if flusher, ok := w.(http.Flusher); ok {
+			flusher.Flush()
+		}
+		_, _ = w.Write([]byte(body))
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{
+		BlogID:           1,
+		URL:              srv.URL,
+		TimeoutSeconds:   5,
+		BodyReadMaxBytes: bodyReadLimit,
+	})
+	if !res.Success {
+		t.Fatalf("Success = false for unknown Content-Length above finite limit, want true; result=%+v", res)
+	}
+	if res.HTTPCode != http.StatusOK {
+		t.Fatalf("HTTPCode = %d, want %d", res.HTTPCode, http.StatusOK)
+	}
+	if res.ErrorCode != ErrorNone {
+		t.Fatalf("ErrorCode = %d, want ErrorNone", res.ErrorCode)
+	}
+}
+
+func TestCheckRedirectFail(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		if r.URL.Path == "/" {
+			http.Redirect(w, r, "/final", http.StatusMovedPermanently)
+			return
+		}
+		w.WriteHeader(http.StatusOK)
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 5, RedirectPolicy: RedirectFail})
+	if res.ErrorCode != ErrorRedirect {
+		t.Fatalf("ErrorCode = %d, want ErrorRedirect", res.ErrorCode)
+	}
+	if res.RedirectCount != 1 {
+		t.Fatalf("RedirectCount = %d, want 1", res.RedirectCount)
+	}
+	if len(res.RedirectChain) != 1 || !strings.HasSuffix(res.RedirectChain[0], "/final") {
+		t.Fatalf("RedirectChain = %#v, want one /final hop", res.RedirectChain)
+	}
+	if res.ErrorDetail == "" {
+		t.Fatal("ErrorDetail is empty, want redirect diagnostic context")
+	}
+}
+
+func TestCheckCustomHeadersForwarded(t *testing.T) {
+	var receivedHeader string
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		receivedHeader = r.Header.Get("X-Custom-Test")
+		w.WriteHeader(http.StatusOK)
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{
+		BlogID:         1,
+		URL:            srv.URL,
+		TimeoutSeconds: 5,
+		CustomHeaders:  map[string]string{"X-Custom-Test": "hello"},
+	})
+	if !res.Success {
+		t.Fatalf("Success = false, want true")
+	}
+	if receivedHeader != "hello" {
+		t.Fatalf("X-Custom-Test = %q, want hello", receivedHeader)
+	}
+}
+
+func TestCheckReusesCleartextIPConnectionsForFleetScans(t *testing.T) {
+	oldTransport := defaultTransport
+	oldHTTPIPTransport := defaultHTTPIPTransport
+	defaultTransport = newCheckTransport()
+	defaultHTTPIPTransport = newHTTPIPPoolTransportWithFallback(defaultTransport)
+	t.Cleanup(func() {
+		defaultTransport.CloseIdleConnections()
+		defaultHTTPIPTransport.CloseIdleConnections()
+		defaultTransport = oldTransport
+		defaultHTTPIPTransport = oldHTTPIPTransport
+	})
+	if !defaultTransport.DisableKeepAlives {
+		t.Fatal("checker transport should disable keep-alives for large unique-host fleet scans")
+	}
+
+	var newConns atomic.Int64
+	srv := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte("ok"))
+	}))
+	srv.Config.ConnState = func(_ net.Conn, state http.ConnState) {
+		if state == http.StateNew {
+			newConns.Add(1)
+		}
+	}
+	srv.Start()
+	defer srv.Close()
+
+	kw := "ok"
+	for i := 0; i < 2; i++ {
+		res := Check(context.Background(), Request{BlogID: int64(i + 1), URL: srv.URL, TimeoutSeconds: 5, Keyword: &kw})
+		if !res.Success {
+			t.Fatalf("check %d Success = false, want true (error_code=%d)", i+1, res.ErrorCode)
+		}
+	}
+
+	if got := newConns.Load(); got != 1 {
+		t.Fatalf("new connections = %d, want cleartext IP pool to reuse the connection", got)
+	}
+}
+
+func TestParseResolverServersSkipsLocalStub(t *testing.T) {
+	raw := `
+nameserver 127.0.0.53
+nameserver ::1
+nameserver 10.0.0.1
+nameserver 2600:1702:50c1:71bf:1298:36ff:fea4:d4ee
+`
+	got := parseResolverServers(raw)
+	want := []string{"10.0.0.1:53", "[2600:1702:50c1:71bf:1298:36ff:fea4:d4ee]:53"}
+	if len(got) != len(want) {
+		t.Fatalf("resolver servers = %#v, want %#v", got, want)
+	}
+	for i := range want {
+		if got[i] != want[i] {
+			t.Fatalf("resolver server %d = %q, want %q", i, got[i], want[i])
+		}
+	}
+}
+
+func TestNormalizeResolverServers(t *testing.T) {
+	got, err := normalizeResolverServers([]string{"10.0.0.176", "10.0.0.176:5353", "[2001:db8::1]:5353"})
+	if err != nil {
+		t.Fatalf("normalizeResolverServers() error = %v", err)
+	}
+	want := []string{"10.0.0.176:53", "10.0.0.176:5353", "[2001:db8::1]:5353"}
+	if len(got) != len(want) {
+		t.Fatalf("normalizeResolverServers() = %#v, want %#v", got, want)
+	}
+	for i := range want {
+		if got[i] != want[i] {
+			t.Fatalf("resolver %d = %q, want %q", i, got[i], want[i])
+		}
+	}
+}
+
+func TestNormalizeResolverServersRejectsUnsafeValues(t *testing.T) {
+	for _, raw := range []string{"", "resolver.internal:53", "10.0.0.176:0", "10.0.0.176:notaport"} {
+		if _, err := normalizeResolverServers([]string{raw}); err == nil {
+			t.Fatalf("normalizeResolverServers(%q) expected error", raw)
+		}
+	}
+}
+
+func TestConfigureResolverServersInstallsOverride(t *testing.T) {
+	configuredResolverMu.RLock()
+	oldServers := append([]string(nil), configuredResolverServers...)
+	configuredResolverMu.RUnlock()
+	oldCache := defaultDNSCache
+	oldHTTPIPTransport := defaultHTTPIPTransport
+	t.Cleanup(func() {
+		configuredResolverMu.Lock()
+		configuredResolverServers = oldServers
+		configuredResolverMu.Unlock()
+
+		restoredTransport := newCheckTransport()
+
+		configuredResolverMu.Lock()
+		currentTransport := defaultTransport
+		currentHTTPIPTransport := defaultHTTPIPTransport
+		defaultTransport = restoredTransport
+		defaultDNSCache = oldCache
+		defaultHTTPIPTransport = oldHTTPIPTransport
+		configuredResolverMu.Unlock()
+		if currentTransport != nil {
+			currentTransport.CloseIdleConnections()
+		}
+		if currentHTTPIPTransport != nil {
+			currentHTTPIPTransport.CloseIdleConnections()
+		}
+	})
+
+	if err := ConfigureResolverServers([]string{"10.0.0.176:5353"}); err != nil {
+		t.Fatalf("ConfigureResolverServers() error = %v", err)
+	}
+	got := directResolverServers()
+	if len(got) != 1 || got[0] != "10.0.0.176:5353" {
+		t.Fatalf("directResolverServers() = %#v, want configured resolver", got)
+	}
+	configured := ConfiguredResolverServers()
+	if len(configured) != 1 || configured[0] != "10.0.0.176:5353" {
+		t.Fatalf("ConfiguredResolverServers() = %#v, want configured resolver", configured)
+	}
+}
+
+func TestHTTPIPPoolTransportPreservesHostHeader(t *testing.T) {
+	hostSeen := make(chan string, 1)
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		hostSeen <- r.Host
+		w.WriteHeader(http.StatusNoContent)
+	}))
+	defer srv.Close()
+
+	addr := srv.Listener.Addr().(*net.TCPAddr)
+	rawURL := "http://site-0000001.capacity.internal:" + strconv.Itoa(addr.Port) + "/"
+
+	oldCache := defaultDNSCache
+	oldHTTPIPTransport := defaultHTTPIPTransport
+	defaultDNSCache = newCheckDNSCache(time.Minute, 10)
+	defaultDNSCache.store(
+		normalizeDNSCacheKey("site-0000001.capacity.internal", "ip4"),
+		[]net.IPAddr{{IP: net.ParseIP("127.0.0.1")}},
+		time.Now().Add(time.Minute),
+	)
+	defaultHTTPIPTransport = newHTTPIPPoolTransportWithFallback(defaultTransport)
+	t.Cleanup(func() {
+		if defaultHTTPIPTransport != nil {
+			defaultHTTPIPTransport.CloseIdleConnections()
+		}
+		defaultDNSCache = oldCache
+		defaultHTTPIPTransport = oldHTTPIPTransport
+	})
+
+	res := Check(context.Background(), Request{BlogID: 42, URL: rawURL, TimeoutSeconds: 2})
+	if !res.Success {
+		t.Fatalf("Check() success = false, result=%+v", res)
+	}
+	select {
+	case got := <-hostSeen:
+		want := "site-0000001.capacity.internal:" + strconv.Itoa(addr.Port)
+		if got != want {
+			t.Fatalf("Host header = %q, want %q", got, want)
+		}
+	case <-time.After(time.Second):
+		t.Fatal("server did not receive request")
+	}
+}
+
+func TestHTTPIPPoolTransportFallsBackAcrossResolvedAddresses(t *testing.T) {
+	hostSeen := make(chan string, 1)
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		hostSeen <- r.Host
+		w.WriteHeader(http.StatusNoContent)
+	}))
+	defer srv.Close()
+
+	addr := srv.Listener.Addr().(*net.TCPAddr)
+	rawURL := "http://multi-a.capacity.internal:" + strconv.Itoa(addr.Port) + "/"
+
+	oldCache := defaultDNSCache
+	oldHTTPIPTransport := defaultHTTPIPTransport
+	defaultDNSCache = newCheckDNSCache(time.Minute, 10)
+	defaultDNSCache.store(
+		normalizeDNSCacheKey("multi-a.capacity.internal", "ip4"),
+		[]net.IPAddr{
+			{IP: net.ParseIP("127.0.0.2")},
+			{IP: net.ParseIP("127.0.0.1")},
+		},
+		time.Now().Add(time.Minute),
+	)
+	defaultHTTPIPTransport = newHTTPIPPoolTransportWithFallback(defaultTransport)
+	t.Cleanup(func() {
+		if defaultHTTPIPTransport != nil {
+			defaultHTTPIPTransport.CloseIdleConnections()
+		}
+		defaultDNSCache = oldCache
+		defaultHTTPIPTransport = oldHTTPIPTransport
+	})
+
+	res := Check(context.Background(), Request{BlogID: 42, URL: rawURL, TimeoutSeconds: 2})
+	if !res.Success {
+		t.Fatalf("Check() success = false after fallback, result=%+v", res)
+	}
+	select {
+	case got := <-hostSeen:
+		want := "multi-a.capacity.internal:" + strconv.Itoa(addr.Port)
+		if got != want {
+			t.Fatalf("Host header = %q, want %q", got, want)
+		}
+	case <-time.After(time.Second):
+		t.Fatal("server did not receive request through fallback address")
+	}
+}
+
+func TestOrderedResolverAddrsPrefersIPv4ButHonorsNetwork(t *testing.T) {
+	addrs := []net.IPAddr{
+		{IP: net.ParseIP("2001:db8::1")},
+		{IP: net.ParseIP("192.0.2.10")},
+		{IP: net.ParseIP("198.51.100.20")},
+	}
+
+	got := orderedResolverAddrs(addrs, "tcp")
+	if len(got) != 3 || got[0].IP.String() != "192.0.2.10" || got[1].IP.String() != "198.51.100.20" || got[2].IP.String() != "2001:db8::1" {
+		t.Fatalf("orderedResolverAddrs(tcp) = %#v, want IPv4 addresses first", got)
+	}
+
+	got = orderedResolverAddrs(addrs, "tcp4")
+	if len(got) != 2 || got[0].IP.To4() == nil || got[1].IP.To4() == nil {
+		t.Fatalf("orderedResolverAddrs(tcp4) = %#v, want IPv4 only", got)
+	}
+
+	got = orderedResolverAddrs(addrs, "tcp6")
+	if len(got) != 1 || got[0].IP.To4() != nil {
+		t.Fatalf("orderedResolverAddrs(tcp6) = %#v, want IPv6 only", got)
+	}
+}
+
+func TestCheckDNSCacheReturnsCachedAddresses(t *testing.T) {
+	cache := newCheckDNSCache(time.Minute, 10)
+	cache.entries["example.com|ip4"] = checkDNSCacheEntry{
+		addrs:   []net.IPAddr{{IP: net.ParseIP("192.0.2.10")}},
+		expires: time.Now().Add(time.Minute),
+	}
+
+	got, err := cache.lookup(context.Background(), &net.Resolver{}, "Example.COM.", "tcp")
+	if err != nil {
+		t.Fatalf("lookup() error = %v", err)
+	}
+	if len(got) != 1 || got[0].IP.String() != "192.0.2.10" {
+		t.Fatalf("lookup() = %#v, want cached IPv4 address", got)
+	}
+}
+
+func TestPreferredLookupFamily(t *testing.T) {
+	if got := preferredLookupFamily("tcp"); got != "ip4" {
+		t.Fatalf("preferredLookupFamily(tcp) = %q, want ip4", got)
+	}
+	if got := preferredLookupFamily("tcp4"); got != "ip4" {
+		t.Fatalf("preferredLookupFamily(tcp4) = %q, want ip4", got)
+	}
+	if got := preferredLookupFamily("tcp6"); got != "ip6" {
+		t.Fatalf("preferredLookupFamily(tcp6) = %q, want ip6", got)
+	}
+}
+
+func TestCheckRedirectAlert(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		if r.URL.Path == "/" {
+			http.Redirect(w, r, "/final", http.StatusMovedPermanently)
+			return
+		}
+		w.WriteHeader(http.StatusOK)
+	}))
+	defer srv.Close()
+
+	res := Check(context.Background(), Request{
+		BlogID:         1,
+		URL:            srv.URL,
+		TimeoutSeconds: 5,
+		RedirectPolicy: RedirectAlert,
+	})
+	if !res.RedirectChanged {
+		t.Fatal("RedirectChanged = false for redirect-alert policy, want true")
+	}
+	if res.RedirectCount != 1 {
+		t.Fatalf("RedirectCount = %d, want 1", res.RedirectCount)
+	}
+	if !strings.HasSuffix(res.FinalURL, "/final") {
+		t.Fatalf("FinalURL = %q, want /final", res.FinalURL)
+	}
+}
+
+func TestCheckTLS11IsAdvisoryNotOutage(t *testing.T) {
+	srv := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte("ok"))
+	}))
+	srv.TLS = &tls.Config{
+		MinVersion: tls.VersionTLS11,
+		MaxVersion: tls.VersionTLS11,
+		CipherSuites: []uint16{
+			tls.TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,
+		},
+	}
+	srv.StartTLS()
+	defer srv.Close()
+
+	roots := x509.NewCertPool()
+	roots.AddCert(srv.Certificate())
+	oldTransport := defaultTransport
+	defaultTransport = newCheckTransport()
+	defaultTransport.TLSClientConfig.RootCAs = roots
+	t.Cleanup(func() {
+		defaultTransport.CloseIdleConnections()
+		defaultTransport = oldTransport
+	})
+
+	res := Check(context.Background(), Request{BlogID: 1, URL: srv.URL, TimeoutSeconds: 5})
+	if !res.Success {
+		t.Fatalf("Success = false for TLS 1.1 advisory, want true; result=%+v", res)
+	}
+	if res.ErrorCode != ErrorTLSDeprecated {
+		t.Fatalf("ErrorCode = %d, want ErrorTLSDeprecated", res.ErrorCode)
+	}
+	if res.TLSVersion != tls.VersionTLS11 {
+		t.Fatalf("TLSVersion = 0x%04x, want TLS 1.1", res.TLSVersion)
+	}
+}
+
+func TestCheckInvalidURL(t *testing.T) {
+	res := Check(context.Background(), Request{BlogID: 1, URL: "://invalid-url", TimeoutSeconds: 5})
+	if res.ErrorCode != ErrorConnect {
+		t.Fatalf("ErrorCode = %d, want ErrorConnect for invalid URL", res.ErrorCode)
+	}
+	if res.ErrorDetail == "" {
+		t.Fatal("ErrorDetail is empty, want invalid-url diagnostic context")
+	}
+}
+
+func TestCheckConnectionRefused(t *testing.T) {
+	// Start a server to get a free port, then stop it so connections are refused.
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {}))
+	url := srv.URL
+	srv.Close()
+
+	res := Check(context.Background(), Request{BlogID: 1, URL: url, TimeoutSeconds: 5})
+	if res.ErrorCode != ErrorConnect {
+		t.Fatalf("ErrorCode = %d, want ErrorConnect", res.ErrorCode)
+	}
+	// Regression: a connection refused at the TCP layer fires
+	// DNSStart/DNSDone successfully but ConnectStart without ConnectDone
+	// (and never fires TLSHandshakeStart). The phase durations for any
+	// half-fired phase must be zero, not negative — a negative duration
+	// from `zero_time.Sub(real_time)` overflows the INT column in
+	// jetmon_check_history.
+	if res.TCP < 0 {
+		t.Errorf("TCP duration is negative (%v); zero-time underflow", res.TCP)
+	}
+	if res.TLS < 0 {
+		t.Errorf("TLS duration is negative (%v); zero-time underflow", res.TLS)
+	}
+	if res.DNS < 0 {
+		t.Errorf("DNS duration is negative (%v); zero-time underflow", res.DNS)
+	}
+}
+
+func TestBoundedErrorDetailTruncatesLongErrors(t *testing.T) {
+	detail := boundedErrorDetail(errors.New(strings.Repeat("x", 600)))
+	if len(detail) != 503 {
+		t.Fatalf("len(ErrorDetail) = %d, want 503", len(detail))
+	}
+	if !strings.HasSuffix(detail, "...") {
+		t.Fatalf("ErrorDetail suffix = %q, want ellipsis", detail[len(detail)-3:])
+	}
+}
+
+func TestClassifyDNSError(t *testing.T) {
+	tests := []struct {
+		name string
+		err  error
+		want string
+	}{
+		{
+			name: "nxdomain",
+			err:  &net.DNSError{Name: "example.invalid", Err: "no such host", IsNotFound: true},
+			want: "nxdomain",
+		},
+		{
+			name: "timeout",
+			err:  &net.DNSError{Name: "example.test", Err: "i/o timeout", IsTimeout: true},
+			want: "timeout",
+		},
+		{
+			name: "servfail",
+			err:  &net.DNSError{Name: "example.test", Err: "server misbehaving", IsTemporary: true},
+			want: "servfail",
+		},
+		{
+			name: "other",
+			err:  &net.DNSError{Name: "example.test", Err: "resolver refused"},
+			want: "resolver_error",
+		},
+	}
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			got, gotName, _ := classifyDNSError(tt.err)
+			if got != tt.want {
+				t.Fatalf("kind = %q, want %q", got, tt.want)
+			}
+			if gotName == "" {
+				t.Fatal("dns name is empty")
+			}
+		})
+	}
+}
+
+func truncatedBodyServer(t *testing.T, body string) *httptest.Server {
+	return truncatedBodyServerWithContentLength(t, 1024, body)
+}
+
+func truncatedBodyServerWithContentLength(t *testing.T, contentLength int64, body string) *httptest.Server {
+	t.Helper()
+
+	return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.Header().Set("Content-Length", strconv.FormatInt(contentLength, 10))
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte(body))
+		if flusher, ok := w.(http.Flusher); ok {
+			flusher.Flush()
+		}
+		hijacker, ok := w.(http.Hijacker)
+		if !ok {
+			t.Error("response writer does not support hijacking")
+			return
+		}
+		conn, _, err := hijacker.Hijack()
+		if err != nil {
+			t.Errorf("Hijack: %v", err)
+			return
+		}
+		_ = conn.Close()
+	}))
+}
+
+// --- Pool scale(), Results(), QueueDepth(), ActiveCount() ---
+
+func TestScaleUpWhenQueueDeep(t *testing.T) {
+	orig := poolCheckFunc
+	block := make(chan struct{})
+	poolCheckFunc = func(_ context.Context, req Request) Result {
+		<-block
+		return Result{BlogID: req.BlogID}
+	}
+
+	p := NewPool(1, 1, 5)
+	// Single Cleanup so the order is explicit: unblock workers, drain the
+	// pool to completion, then restore poolCheckFunc. The previous LIFO
+	// ordering left a race where workers could still read poolCheckFunc as
+	// it was reassigned.
+	t.Cleanup(func() {
+		close(block)
+		p.Drain()
+		poolCheckFunc = orig
+	})
+
+	// Submit enough work to ensure queue > current worker count.
+	for range 4 {
+		p.Submit(Request{BlogID: 1, URL: "x"})
+	}
+	time.Sleep(10 * time.Millisecond)
+
+	p.scale()
+
+	if p.WorkerCount() <= 1 {
+		t.Fatalf("WorkerCount = %d after scale-up, want > 1", p.WorkerCount())
+	}
+}
+
+func TestScaleDownGraduallyWhenIdle(t *testing.T) {
+	p := NewPool(3, 1, 3)
+	t.Cleanup(p.Drain)
+
+	p.scale()
+
+	deadline := time.Now().Add(2 * time.Second)
+	for time.Now().Before(deadline) {
+		if p.WorkerCount() < 3 {
+			return
+		}
+		time.Sleep(10 * time.Millisecond)
+	}
+	t.Fatalf("WorkerCount = %d after idle scale-down, want < 3", p.WorkerCount())
+}
+
+func TestScaleDownExcessAboveMax(t *testing.T) {
+	p := NewPool(5, 1, 5)
+	t.Cleanup(p.Drain)
+
+	p.mu.Lock()
+	p.maxSize = 3
+	p.mu.Unlock()
+
+	p.scale()
+
+	deadline := time.Now().Add(2 * time.Second)
+	for time.Now().Before(deadline) {
+		if p.WorkerCount() <= 3 {
+			return
+		}
+		time.Sleep(10 * time.Millisecond)
+	}
+	t.Fatalf("WorkerCount = %d after maxSize reduction, want <= 3", p.WorkerCount())
+}
+
+func TestResults(t *testing.T) {
+	orig := poolCheckFunc
+	poolCheckFunc = func(_ context.Context, req Request) Result {
+		return Result{BlogID: req.BlogID, Success: true, HTTPCode: 200}
+	}
+	t.Cleanup(func() { poolCheckFunc = orig })
+
+	p := NewPool(1, 1, 1)
+	t.Cleanup(p.Drain)
+
+	p.Submit(Request{BlogID: 42, URL: "https://example.com"})
+
+	select {
+	case res := <-p.Results():
+		if res.BlogID != 42 {
+			t.Fatalf("result BlogID = %d, want 42", res.BlogID)
+		}
+	case <-time.After(2 * time.Second):
+		t.Fatal("timed out waiting for result")
+	}
+}
+
+func TestPoolDoesNotDropResultWhenResultChannelIsFull(t *testing.T) {
+	orig := poolCheckFunc
+	secondReturned := make(chan struct{})
+	var once sync.Once
+	poolCheckFunc = func(_ context.Context, req Request) Result {
+		if req.BlogID == 2 {
+			once.Do(func() { close(secondReturned) })
+		}
+		return Result{BlogID: req.BlogID, Success: true, HTTPCode: 200}
+	}
+	t.Cleanup(func() { poolCheckFunc = orig })
+
+	p := NewPoolWithQueueCap(1, 1, 1, 1)
+	t.Cleanup(p.Drain)
+
+	if !p.Submit(Request{BlogID: 1, URL: "https://example.com/1"}) {
+		t.Fatal("first Submit() returned false")
+	}
+	deadline := time.Now().Add(2 * time.Second)
+	for time.Now().Before(deadline) {
+		if len(p.Results()) == 1 {
+			break
+		}
+		time.Sleep(10 * time.Millisecond)
+	}
+	if len(p.Results()) != 1 {
+		t.Fatal("first result did not fill result channel")
+	}
+
+	if !p.Submit(Request{BlogID: 2, URL: "https://example.com/2"}) {
+		t.Fatal("second Submit() returned false")
+	}
+	select {
+	case <-secondReturned:
+	case <-time.After(2 * time.Second):
+		t.Fatal("timed out waiting for second check to complete")
+	}
+	time.Sleep(25 * time.Millisecond)
+
+	first := <-p.Results()
+	if first.BlogID != 1 {
+		t.Fatalf("first result BlogID = %d, want 1", first.BlogID)
+	}
+	select {
+	case second := <-p.Results():
+		if second.BlogID != 2 {
+			t.Fatalf("second result BlogID = %d, want 2", second.BlogID)
+		}
+	case <-time.After(2 * time.Second):
+		t.Fatal("timed out waiting for second result; result was likely dropped")
+	}
+}
+
+func TestQueueDepth(t *testing.T) {
+	orig := poolCheckFunc
+	release := make(chan struct{})
+	poolCheckFunc = func(_ context.Context, req Request) Result {
+		<-release
+		return Result{BlogID: req.BlogID}
+	}
+
+	p := NewPool(1, 1, 1)
+	// Cleanup order matters: close(release) unblocks workers so Drain can
+	// complete, Drain ensures all worker goroutines have exited before we
+	// restore poolCheckFunc. Doing this as one Cleanup keeps the ordering
+	// explicit; LIFO ordering of multiple Cleanups previously left a race
+	// where workers could still read poolCheckFunc as it was reassigned.
+	t.Cleanup(func() {
+		close(release)
+		p.Drain()
+		poolCheckFunc = orig
+	})
+
+	p.Submit(Request{BlogID: 1, URL: "a"})
+	time.Sleep(10 * time.Millisecond) // let worker pick up first request
+	p.Submit(Request{BlogID: 2, URL: "b"})
+
+	if d := p.QueueDepth(); d != 1 {
+		t.Fatalf("QueueDepth() = %d, want 1", d)
+	}
+}
+
+func TestActiveCount(t *testing.T) {
+	orig := poolCheckFunc
+	started := make(chan struct{})
+	release := make(chan struct{})
+	poolCheckFunc = func(_ context.Context, req Request) Result {
+		close(started)
+		<-release
+		return Result{BlogID: req.BlogID}
+	}
+
+	p := NewPool(1, 1, 1)
+	// Same single-Cleanup ordering as TestQueueDepth — see comment there.
+	t.Cleanup(func() {
+		close(release)
+		p.Drain()
+		poolCheckFunc = orig
+	})
+
+	p.Submit(Request{BlogID: 1, URL: "x"})
+	<-started
+
+	if p.ActiveCount() != 1 {
+		t.Fatalf("ActiveCount() = %d, want 1", p.ActiveCount())
+	}
+}
+
+func BenchmarkCheckNoKeywordLargeBody(b *testing.B) {
+	const bodyReadLimit = int64(1 << 20) // 1 MiB
+	const contentLength = bodyReadLimit + 1024
+
+	body := strings.Repeat("a", int(contentLength))
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.Header().Set("Content-Length", strconv.FormatInt(contentLength, 10))
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte(body))
+	}))
+	defer srv.Close()
+
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		res := Check(context.Background(), Request{
+			BlogID:           1,
+			URL:              srv.URL,
+			TimeoutSeconds:   5,
+			BodyReadMaxBytes: bodyReadLimit,
+		})
+		if !res.Success || res.ErrorCode != ErrorNone || res.HTTPCode != http.StatusOK {
+			b.Fatalf("unexpected result: %+v", res)
+		}
+	}
+}
+
+func BenchmarkCheckKeywordLargeBody(b *testing.B) {
+	const bodyReadLimit = int64(1 << 20) // 1 MiB
+	const contentLength = bodyReadLimit + 1024
+	const keyword = "required-token"
+	keywordPtr := keyword
+
+	body := keyword + strings.Repeat("a", int(contentLength)-len(keyword))
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.Header().Set("Content-Length", strconv.FormatInt(contentLength, 10))
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte(body))
+	}))
+	defer srv.Close()
+
+	b.ResetTimer()
+	for i := 0; i < b.N; i++ {
+		res := Check(context.Background(), Request{
+			BlogID:              1,
+			URL:                 srv.URL,
+			TimeoutSeconds:      5,
+			BodyReadMaxBytes:    bodyReadLimit,
+			KeywordReadMaxBytes: bodyReadLimit,
+			Keyword:             &keywordPtr,
+		})
+		if !res.Success || res.ErrorCode != ErrorNone || res.HTTPCode != http.StatusOK {
+			b.Fatalf("unexpected result: %+v", res)
+		}
+	}
+}
diff --git a/internal/checker/pool.go b/internal/checker/pool.go
new file mode 100644
index 00000000..3e7357b9
--- /dev/null
+++ b/internal/checker/pool.go
@@ -0,0 +1,298 @@
+package checker
+
+import (
+	"context"
+	"sync"
+	"sync/atomic"
+	"time"
+)
+
+var poolCheckFunc = Check
+
+// Pool is an auto-scaling goroutine pool for HTTP checks.
+type Pool struct {
+	work    chan Request
+	results chan Result
+	retire  chan struct{}
+	cancel  context.CancelFunc
+	ctx     context.Context
+	closed  atomic.Bool
+
+	size   atomic.Int64
+	active atomic.Int64
+
+	mu      sync.Mutex
+	workMu  sync.RWMutex
+	wg      sync.WaitGroup
+	minSize int
+	maxSize int
+}
+
+// NewPool creates a Pool with the given initial, min, and max worker counts.
+func NewPool(initial, min, max int) *Pool {
+	return NewPoolWithQueueCap(initial, min, max, max*2)
+}
+
+// NewPoolWithQueueCap creates a Pool with an explicit work/result channel
+// capacity. It is used by streaming schedulers that need a large elastic queue
+// without changing the legacy NewPool queue-size contract.
+func NewPoolWithQueueCap(initial, min, max, queueCap int) *Pool {
+	if queueCap < 1 {
+		queueCap = 1
+	}
+	if max < 1 {
+		max = 1
+	}
+	ctx, cancel := context.WithCancel(context.Background())
+	p := &Pool{
+		work:    make(chan Request, queueCap),
+		results: make(chan Result, queueCap),
+		retire:  make(chan struct{}, max),
+		cancel:  cancel,
+		ctx:     ctx,
+		minSize: min,
+		maxSize: max,
+	}
+	for range initial {
+		p.spawnWorker()
+	}
+	go p.autoScale()
+	return p
+}
+
+// Submit enqueues a check request. Non-blocking; drops if queue is full.
+func (p *Pool) Submit(req Request) bool {
+	p.workMu.RLock()
+	defer p.workMu.RUnlock()
+	if p.closed.Load() {
+		return false
+	}
+	select {
+	case p.work <- req:
+		return true
+	default:
+		return false
+	}
+}
+
+// Results returns the channel on which completed results are delivered.
+func (p *Pool) Results() <-chan Result {
+	return p.results
+}
+
+// QueueDepth returns the number of pending requests.
+func (p *Pool) QueueDepth() int {
+	return len(p.work)
+}
+
+// ResultDepth returns the number of completed checks waiting for the
+// orchestrator to process them.
+func (p *Pool) ResultDepth() int {
+	return len(p.results)
+}
+
+// ActiveCount returns the number of goroutines currently running a check.
+func (p *Pool) ActiveCount() int {
+	return int(p.active.Load())
+}
+
+// WorkerCount returns the total number of live worker goroutines.
+func (p *Pool) WorkerCount() int {
+	return int(p.size.Load())
+}
+
+// Drain stops accepting new work and waits for in-flight checks to complete.
+func (p *Pool) Drain() {
+	if !p.closed.CompareAndSwap(false, true) {
+		return
+	}
+	p.workMu.Lock()
+	close(p.work)
+	p.workMu.Unlock()
+	p.cancel()
+	p.wg.Wait()
+}
+
+func (p *Pool) spawnWorker() {
+	p.size.Add(1)
+	p.wg.Add(1)
+	go func() {
+		defer p.wg.Done()
+		defer p.size.Add(-1)
+		for {
+			select {
+			case <-p.ctx.Done():
+				return
+			case <-p.retire:
+				return
+			case req, ok := <-p.work:
+				if !ok {
+					return
+				}
+				p.active.Add(1)
+				res := poolCheckFunc(context.Background(), req)
+				p.active.Add(-1)
+				if p.closed.Load() {
+					continue
+				}
+				select {
+				case p.results <- res:
+				case <-p.ctx.Done():
+					return
+				}
+			}
+		}
+	}()
+}
+
+// autoScale adjusts the pool size every 5 seconds based on queue depth and
+// process memory usage.
+func (p *Pool) autoScale() {
+	ticker := time.NewTicker(5 * time.Second)
+	defer ticker.Stop()
+	for {
+		select {
+		case <-p.ctx.Done():
+			return
+		case <-ticker.C:
+			p.scale()
+		}
+	}
+}
+
+func (p *Pool) scale() {
+	p.mu.Lock()
+	defer p.mu.Unlock()
+	if p.closed.Load() {
+		return
+	}
+
+	current := int(p.size.Load())
+	queue := len(p.work)
+
+	// Scale up: queue depth exceeds current worker count.
+	if queue > current && current < p.maxSize {
+		add := min(queue-current, p.maxSize-current)
+		for range add {
+			p.spawnWorker()
+		}
+		return
+	}
+
+	// Scale down gradually when demand is low or the max size has been lowered.
+	if current > p.maxSize {
+		p.retireWorkers(current - p.maxSize)
+		return
+	}
+	if queue == 0 && current > p.minSize {
+		p.retireWorkers(1)
+	}
+}
+
+// SetMaxSize updates the autoscaler ceiling after config reload.
+func (p *Pool) SetMaxSize(max int) {
+	if max < 1 {
+		max = 1
+	}
+	p.mu.Lock()
+	p.maxSize = max
+	current := int(p.size.Load())
+	if current > p.maxSize {
+		p.retireWorkers(current - p.maxSize)
+	}
+	p.mu.Unlock()
+}
+
+// SetSizeBounds updates the autoscaler floor and ceiling together. If the
+// current worker count is below the new floor, workers are started immediately;
+// if it is above the new ceiling, excess workers are retired gracefully.
+func (p *Pool) SetSizeBounds(minSize, maxSize int) int {
+	if maxSize < 1 {
+		maxSize = 1
+	}
+	if minSize < 1 {
+		minSize = 1
+	}
+	if minSize > maxSize {
+		minSize = maxSize
+	}
+
+	p.mu.Lock()
+	defer p.mu.Unlock()
+	p.minSize = minSize
+	p.maxSize = maxSize
+
+	current := int(p.size.Load())
+	if current > p.maxSize {
+		p.retireWorkers(current - p.maxSize)
+		return 0
+	}
+	if current >= p.minSize || p.closed.Load() {
+		return 0
+	}
+	added := p.minSize - current
+	for range added {
+		p.spawnWorker()
+	}
+	return added
+}
+
+// EnsureSize proactively starts workers up to target, bounded by maxSize.
+// The queue-depth autoscaler will still adjust over time, but streaming
+// schedulers use this to avoid a cold pool after a large target activation.
+func (p *Pool) EnsureSize(target int) int {
+	if target < 1 {
+		return 0
+	}
+	p.mu.Lock()
+	defer p.mu.Unlock()
+	if p.closed.Load() {
+		return 0
+	}
+	current := int(p.size.Load())
+	if target > p.maxSize {
+		target = p.maxSize
+	}
+	if target <= current {
+		return 0
+	}
+	added := target - current
+	for range added {
+		p.spawnWorker()
+	}
+	return added
+}
+
+// DrainWorkers gracefully reduces the pool size by up to n idle workers.
+func (p *Pool) DrainWorkers(n int) int {
+	if n < 1 {
+		return 0
+	}
+	p.mu.Lock()
+	defer p.mu.Unlock()
+	return p.retireWorkers(n)
+}
+
+func (p *Pool) retireWorkers(n int) int {
+	if n < 1 {
+		return 0
+	}
+	current := int(p.size.Load())
+	available := current - p.minSize
+	if available < 1 {
+		return 0
+	}
+	if n > available {
+		n = available
+	}
+	retired := 0
+	for range n {
+		select {
+		case p.retire <- struct{}{}:
+			retired++
+		default:
+			return retired
+		}
+	}
+	return retired
+}
diff --git a/internal/checkmode/checkmode.go b/internal/checkmode/checkmode.go
new file mode 100644
index 00000000..9fbb9d1f
--- /dev/null
+++ b/internal/checkmode/checkmode.go
@@ -0,0 +1,59 @@
+package checkmode
+
+import (
+	"fmt"
+	"net/http"
+	"strings"
+)
+
+const (
+	MethodHEAD = http.MethodHead
+	MethodGET  = http.MethodGet
+
+	ProfileLegacy     = "legacy"
+	ProfileSimpleHTTP = "simple_http"
+	ProfileFull       = "full"
+)
+
+// NormalizeMethod returns a canonical request method. Empty values use def.
+func NormalizeMethod(value, def string) (string, error) {
+	method := strings.ToUpper(strings.TrimSpace(value))
+	if method == "" {
+		method = strings.ToUpper(strings.TrimSpace(def))
+	}
+	switch method {
+	case MethodHEAD, MethodGET:
+		return method, nil
+	default:
+		return "", fmt.Errorf("request_method must be HEAD or GET")
+	}
+}
+
+// NormalizeProfile returns a canonical detection profile. Empty values use def.
+func NormalizeProfile(value, def string) (string, error) {
+	profile := strings.ToLower(strings.TrimSpace(value))
+	if profile == "" {
+		profile = strings.ToLower(strings.TrimSpace(def))
+	}
+	switch profile {
+	case ProfileLegacy, ProfileSimpleHTTP, ProfileFull:
+		return profile, nil
+	default:
+		return "", fmt.Errorf("detection_profile must be one of: legacy, simple_http, full")
+	}
+}
+
+// EffectiveProfile gates detections that cannot run with the selected request
+// method. A HEAD request can still prove basic reachability, but it cannot
+// support body-based checks, so it never executes the full detection profile.
+func EffectiveProfile(method, profile string) string {
+	if method == MethodHEAD && profile == ProfileFull {
+		return ProfileSimpleHTTP
+	}
+	return profile
+}
+
+// FullDetectionsEnabled reports whether rich v2 detections should run.
+func FullDetectionsEnabled(method, profile string) bool {
+	return EffectiveProfile(method, profile) == ProfileFull
+}
diff --git a/internal/config/config.go b/internal/config/config.go
new file mode 100644
index 00000000..8b611da1
--- /dev/null
+++ b/internal/config/config.go
@@ -0,0 +1,560 @@
+package config
+
+import (
+	"encoding/json"
+	"fmt"
+	"log"
+	"net"
+	"os"
+	"strconv"
+	"strings"
+	"sync"
+
+	"github.com/Automattic/jetmon/internal/checkmode"
+)
+
+// VerifierConfig holds connection details for a single Veriflier instance.
+type VerifierConfig struct {
+	Name      string `json:"name"`
+	Host      string `json:"host"`
+	Port      string `json:"port"`
+	GRPCPort  string `json:"grpc_port"` // Deprecated alias for Port.
+	AuthToken string `json:"auth_token"`
+}
+
+const (
+	VeriflierDiscoveryModeStatic = "static"
+	VeriflierDiscoveryModeShadow = "shadow"
+	VeriflierDiscoveryModeActive = "active"
+)
+
+// TransportPort returns the canonical JSON-over-HTTP Veriflier port,
+// accepting grpc_port as a deprecated config alias.
+func (v VerifierConfig) TransportPort() string {
+	if v.Port != "" {
+		return v.Port
+	}
+	return v.GRPCPort
+}
+
+// Config holds all runtime configuration for Jetmon 2.
+type Config struct {
+	Debug bool `json:"DEBUG"`
+
+	NumWorkers     int `json:"NUM_WORKERS"`
+	NumToProcess   int `json:"NUM_TO_PROCESS"`
+	DatasetSize    int `json:"DATASET_SIZE"`
+	WorkerMaxMemMB int `json:"WORKER_MAX_MEM_MB"`
+
+	// LegacyStatusProjectionEnable controls compatibility writes to the
+	// v1 status projection on jetpack_monitor_sites (site_status +
+	// last_status_change). Jetmon v2 event/check/delivery tables remain
+	// authoritative and are written independently of this switch.
+	LegacyStatusProjectionEnable bool `json:"LEGACY_STATUS_PROJECTION_ENABLE"`
+
+	// DBUpdatesEnable is the deprecated name for LegacyStatusProjectionEnable.
+	// It remains as a config alias so older configs keep their behavior until
+	// they can be rewritten.
+	DBUpdatesEnable bool `json:"DB_UPDATES_ENABLE"`
+
+	BucketTotal             int `json:"BUCKET_TOTAL"`
+	BucketTarget            int `json:"BUCKET_TARGET"`
+	BucketHeartbeatGraceSec int `json:"BUCKET_HEARTBEAT_GRACE_SEC"`
+
+	// PinnedBucketMin/Max let a v2 host temporarily use the exact static bucket
+	// range of the v1 host it replaces during host-by-host migration. While set,
+	// the orchestrator does not participate in jetmon_hosts dynamic ownership.
+	PinnedBucketMin *int `json:"PINNED_BUCKET_MIN"`
+	PinnedBucketMax *int `json:"PINNED_BUCKET_MAX"`
+
+	// BucketNoMin/Max are the legacy v1 config names. They are accepted as
+	// aliases for the pinned migration mode so operators can copy a v1 host's
+	// bucket range directly into v2 config during cutover.
+	BucketNoMin *int `json:"BUCKET_NO_MIN"`
+	BucketNoMax *int `json:"BUCKET_NO_MAX"`
+
+	BatchSize int    `json:"BATCH_SIZE"`
+	AuthToken string `json:"AUTH_TOKEN"`
+
+	VeriflierBatchSize int `json:"VERIFLIER_BATCH_SIZE"`
+	SQLUpdateBatch     int `json:"SQL_UPDATE_BATCH"`
+	DBConfigUpdatesMin int `json:"DB_CONFIG_UPDATES_MIN"`
+	PeerOfflineLimit   int `json:"PEER_OFFLINE_LIMIT"`
+
+	// VeriflierDiscoveryMode controls whether the monitor reads Veriflier
+	// endpoints from the trusted DB registry. "static" preserves the VERIFIERS
+	// list behavior, "shadow" reports registry drift without changing traffic,
+	// and "active" uses the registry with static fallback if discovery fails.
+	VeriflierDiscoveryMode string `json:"VERIFLIER_DISCOVERY_MODE"`
+
+	NumOfChecks          int `json:"NUM_OF_CHECKS"`
+	TimeBetweenChecksSec int `json:"TIME_BETWEEN_CHECKS_SEC"`
+
+	AlertCooldownMinutes int `json:"ALERT_COOLDOWN_MINUTES"`
+
+	StatsUpdateIntervalMS     int      `json:"STATS_UPDATE_INTERVAL_MS"`
+	StatsdSendMemUsage        bool     `json:"STATSD_SEND_MEM_USAGE"`
+	TimeBetweenNoticesMin     int      `json:"TIME_BETWEEN_NOTICES_MIN"`
+	WPCOMNotifyEnable         bool     `json:"WPCOM_NOTIFY_ENABLE"`
+	MinTimeBetweenRoundsSec   int      `json:"MIN_TIME_BETWEEN_ROUNDS_SEC"`
+	NetCommsTimeout           int      `json:"NET_COMMS_TIMEOUT"`
+	CheckDNSResolvers         []string `json:"CHECK_DNS_RESOLVERS"`
+	BodyReadMaxBytes          int64    `json:"BODY_READ_MAX_BYTES"`
+	BodyReadMaxMS             int      `json:"BODY_READ_MAX_MS"`
+	KeywordReadMaxBytes       int64    `json:"KEYWORD_READ_MAX_BYTES"`
+	KeywordReadMaxMS          int      `json:"KEYWORD_READ_MAX_MS"`
+	DefaultCheckMethod        string   `json:"DEFAULT_CHECK_METHOD"`
+	DefaultDetectionProfile   string   `json:"DEFAULT_DETECTION_PROFILE"`
+	UseVariableCheckIntervals bool     `json:"USE_VARIABLE_CHECK_INTERVALS"`
+	SchedulerEngine           string   `json:"SCHEDULER_ENGINE"`
+
+	// StreamingLegacyProjectionIntervalMin controls the coarse compatibility
+	// freshness write interval used by the streaming scheduler. It intentionally
+	// does not affect check cadence; it only bounds jetmon_site_runtime
+	// freshness staleness for rollback to the legacy scheduler.
+	StreamingLegacyProjectionIntervalMin int `json:"STREAMING_LEGACY_PROJECTION_INTERVAL_MIN"`
+	StreamingTargetReloadSec             int `json:"STREAMING_TARGET_RELOAD_SEC"`
+
+	LogFormat         string `json:"LOG_FORMAT"`
+	DashboardPort     int    `json:"DASHBOARD_PORT"`
+	DashboardBindAddr string `json:"DASHBOARD_BIND_ADDR"`
+	DebugPort         int    `json:"DEBUG_PORT"`
+	APIPort           int    `json:"API_PORT"` // 0 = API server disabled
+
+	// DeliveryOwnerHost constrains webhook and alert-contact delivery workers
+	// to a single named host while the v2 single-binary deployment still uses
+	// soft delivery locks. Empty preserves the legacy API_PORT behavior.
+	DeliveryOwnerHost string `json:"DELIVERY_OWNER_HOST"`
+
+	// Email transport selection for alert contacts. "stub" = log only
+	// (default; safe for environments where email is not configured),
+	// "smtp" = direct SMTP send (dev / staging with MailHog or similar),
+	// "wpcom" = POST to a WPCOM-owned email API endpoint (production).
+	// See docs/internal-api-reference.md "Family 5 → Email delivery".
+	EmailTransport      string `json:"EMAIL_TRANSPORT"`
+	EmailFrom           string `json:"EMAIL_FROM"`
+	WPCOMEmailEndpoint  string `json:"WPCOM_EMAIL_ENDPOINT"`
+	WPCOMEmailAuthToken string `json:"WPCOM_EMAIL_AUTH_TOKEN"`
+	SMTPHost            string `json:"SMTP_HOST"`
+	SMTPPort            int    `json:"SMTP_PORT"`
+	SMTPUsername        string `json:"SMTP_USERNAME"`
+	SMTPPassword        string `json:"SMTP_PASSWORD"`
+	SMTPUseTLS          bool   `json:"SMTP_USE_TLS"`
+
+	Verifiers []VerifierConfig `json:"VERIFIERS"`
+}
+
+// DBConfig holds MySQL connection parameters loaded from environment variables.
+type DBConfig struct {
+	Host     string
+	Port     string
+	User     string
+	Password string
+	Name     string
+}
+
+var (
+	mu      sync.RWMutex
+	current *Config
+	dbConf  *DBConfig
+	path    string
+)
+
+// Load reads the config file at the given path and stores it.
+func Load(configPath string) error {
+	path = configPath
+	return reload()
+}
+
+// Reload re-reads the config file from the path passed to Load.
+func Reload() error {
+	return reload()
+}
+
+func reload() error {
+	raw, err := os.ReadFile(path)
+	if err != nil {
+		return fmt.Errorf("open config: %w", err)
+	}
+
+	cfg := defaults()
+	if err := json.Unmarshal(raw, cfg); err != nil {
+		return fmt.Errorf("parse config: %w", err)
+	}
+	applyDeprecatedAliases(raw, cfg)
+
+	if err := validate(cfg); err != nil {
+		return fmt.Errorf("invalid config: %w", err)
+	}
+
+	mu.Lock()
+	current = cfg
+	mu.Unlock()
+	return nil
+}
+
+// Get returns a snapshot of the current config. Safe for concurrent use.
+func Get() *Config {
+	mu.RLock()
+	defer mu.RUnlock()
+	return current
+}
+
+// LoadDB reads the database config from environment variables set by Docker,
+// systemd EnvironmentFile, or the operator shell running CLI preflight commands.
+func LoadDB() *DBConfig {
+	db := &DBConfig{
+		Host:     envOrDefault("DB_HOST", "localhost"),
+		Port:     envOrDefault("DB_PORT", "3306"),
+		User:     envOrDefault("DB_USER", "root"),
+		Password: envOrDefault("DB_PASSWORD", ""),
+		Name:     envOrDefault("DB_NAME", "jetmon_db"),
+	}
+	mu.Lock()
+	dbConf = db
+	mu.Unlock()
+	return db
+}
+
+// GetDB returns the database config.
+func GetDB() *DBConfig {
+	mu.RLock()
+	defer mu.RUnlock()
+	return dbConf
+}
+
+func defaults() *Config {
+	return &Config{
+		NumWorkers:                           60,
+		NumToProcess:                         40,
+		DatasetSize:                          100,
+		WorkerMaxMemMB:                       0,
+		LegacyStatusProjectionEnable:         true,
+		BucketTotal:                          1000,
+		BucketTarget:                         500,
+		BucketHeartbeatGraceSec:              600,
+		BatchSize:                            32,
+		VeriflierBatchSize:                   200,
+		SQLUpdateBatch:                       1,
+		DBConfigUpdatesMin:                   10,
+		PeerOfflineLimit:                     3,
+		VeriflierDiscoveryMode:               VeriflierDiscoveryModeStatic,
+		NumOfChecks:                          3,
+		TimeBetweenChecksSec:                 30,
+		AlertCooldownMinutes:                 30,
+		StatsUpdateIntervalMS:                10000,
+		TimeBetweenNoticesMin:                59,
+		WPCOMNotifyEnable:                    true,
+		MinTimeBetweenRoundsSec:              300,
+		NetCommsTimeout:                      10,
+		BodyReadMaxBytes:                     1048576,
+		BodyReadMaxMS:                        250,
+		KeywordReadMaxBytes:                  1048576,
+		KeywordReadMaxMS:                     0,
+		DefaultCheckMethod:                   checkmode.MethodGET,
+		DefaultDetectionProfile:              checkmode.ProfileFull,
+		SchedulerEngine:                      "legacy",
+		StreamingLegacyProjectionIntervalMin: 15,
+		StreamingTargetReloadSec:             300,
+		LogFormat:                            "text",
+		DashboardPort:                        8080,
+		DashboardBindAddr:                    "127.0.0.1",
+		DebugPort:                            6060,
+		EmailTransport:                       "stub",
+		EmailFrom:                            "jetmon@noreply.invalid",
+	}
+}
+
+func applyDeprecatedAliases(raw []byte, cfg *Config) {
+	var keys map[string]json.RawMessage
+	if err := json.Unmarshal(raw, &keys); err != nil {
+		return
+	}
+	if _, hasNew := keys["LEGACY_STATUS_PROJECTION_ENABLE"]; hasNew {
+		return
+	}
+	if _, hasOld := keys["DB_UPDATES_ENABLE"]; hasOld {
+		cfg.LegacyStatusProjectionEnable = cfg.DBUpdatesEnable
+	}
+}
+
+// LegacyStatusProjectionEnabled reports whether v2 should maintain the legacy
+// v1 status projection on jetpack_monitor_sites. It defaults to true so a
+// loaded-but-minimal config remains migration-compatible.
+func LegacyStatusProjectionEnabled() bool {
+	cfg := Get()
+	if cfg == nil {
+		return true
+	}
+	return cfg.LegacyStatusProjectionEnable
+}
+
+// WPCOMNotifyEnabled reports whether the legacy WPCOM status-change
+// notification path should make outbound calls. It defaults to true for
+// production compatibility; test fleets can set WPCOM_NOTIFY_ENABLE=false.
+func WPCOMNotifyEnabled() bool {
+	cfg := Get()
+	if cfg == nil {
+		return true
+	}
+	return cfg.WPCOMNotifyEnable
+}
+
+// PinnedBucketRange returns the migration-only static bucket range configured
+// on this host. Explicit PINNED_BUCKET_* keys take precedence over legacy
+// BUCKET_NO_* aliases after validation has checked for conflicts.
+func (cfg *Config) PinnedBucketRange() (int, int, bool) {
+	if cfg == nil {
+		return 0, 0, false
+	}
+	if cfg.PinnedBucketMin != nil && cfg.PinnedBucketMax != nil {
+		return *cfg.PinnedBucketMin, *cfg.PinnedBucketMax, true
+	}
+	if cfg.BucketNoMin != nil && cfg.BucketNoMax != nil {
+		return *cfg.BucketNoMin, *cfg.BucketNoMax, true
+	}
+	return 0, 0, false
+}
+
+func validate(cfg *Config) error {
+	if cfg.AuthToken == "" {
+		return fmt.Errorf("AUTH_TOKEN is required")
+	}
+	if cfg.NumWorkers < 0 {
+		return fmt.Errorf("NUM_WORKERS must be >= 0")
+	}
+	if cfg.NumWorkers == 0 {
+		cfg.NumWorkers = 60
+	}
+	if cfg.DatasetSize < 0 {
+		return fmt.Errorf("DATASET_SIZE must be >= 0")
+	}
+	if cfg.DatasetSize == 0 {
+		cfg.DatasetSize = 100
+	}
+	if cfg.BucketTotal <= 0 {
+		return fmt.Errorf("BUCKET_TOTAL must be > 0")
+	}
+	if cfg.BucketTarget < 0 {
+		return fmt.Errorf("BUCKET_TARGET must be >= 0")
+	}
+	if cfg.BucketTarget == 0 {
+		cfg.BucketTarget = cfg.BucketTotal
+	}
+	if cfg.BucketTarget > cfg.BucketTotal {
+		return fmt.Errorf("BUCKET_TARGET must be <= BUCKET_TOTAL")
+	}
+	if err := validatePinnedBucketRange(cfg); err != nil {
+		return err
+	}
+	if cfg.NetCommsTimeout <= 0 {
+		return fmt.Errorf("NET_COMMS_TIMEOUT must be > 0")
+	}
+	if err := validateCheckDNSResolvers(cfg.CheckDNSResolvers); err != nil {
+		return err
+	}
+	if cfg.BodyReadMaxBytes < 0 {
+		return fmt.Errorf("BODY_READ_MAX_BYTES must be >= 0")
+	}
+	if cfg.BodyReadMaxBytes == 0 {
+		cfg.BodyReadMaxBytes = 1048576
+	}
+	if cfg.BodyReadMaxMS < 0 {
+		return fmt.Errorf("BODY_READ_MAX_MS must be >= 0")
+	}
+	if cfg.BodyReadMaxMS == 0 {
+		cfg.BodyReadMaxMS = 250
+	}
+	if cfg.KeywordReadMaxBytes < 0 {
+		return fmt.Errorf("KEYWORD_READ_MAX_BYTES must be >= 0")
+	}
+	if cfg.KeywordReadMaxBytes == 0 {
+		cfg.KeywordReadMaxBytes = 1048576
+	}
+	if cfg.KeywordReadMaxMS < 0 {
+		return fmt.Errorf("KEYWORD_READ_MAX_MS must be >= 0")
+	}
+	method, err := checkmode.NormalizeMethod(cfg.DefaultCheckMethod, checkmode.MethodGET)
+	if err != nil {
+		return fmt.Errorf("DEFAULT_CHECK_METHOD: %w", err)
+	}
+	cfg.DefaultCheckMethod = method
+	profile, err := checkmode.NormalizeProfile(cfg.DefaultDetectionProfile, checkmode.ProfileFull)
+	if err != nil {
+		return fmt.Errorf("DEFAULT_DETECTION_PROFILE: %w", err)
+	}
+	cfg.DefaultDetectionProfile = profile
+	if cfg.MinTimeBetweenRoundsSec < 0 {
+		return fmt.Errorf("MIN_TIME_BETWEEN_ROUNDS_SEC must be >= 0")
+	}
+	switch cfg.SchedulerEngine {
+	case "", "legacy":
+		cfg.SchedulerEngine = "legacy"
+	case "streaming":
+	default:
+		return fmt.Errorf("SCHEDULER_ENGINE must be 'legacy' or 'streaming'")
+	}
+	if cfg.StreamingLegacyProjectionIntervalMin == 0 {
+		cfg.StreamingLegacyProjectionIntervalMin = 15
+	}
+	if cfg.StreamingLegacyProjectionIntervalMin < 5 {
+		return fmt.Errorf("STREAMING_LEGACY_PROJECTION_INTERVAL_MIN must be between 5 and 15")
+	}
+	if cfg.StreamingLegacyProjectionIntervalMin > 15 {
+		return fmt.Errorf("STREAMING_LEGACY_PROJECTION_INTERVAL_MIN must be between 5 and 15")
+	}
+	if cfg.StreamingTargetReloadSec == 0 {
+		cfg.StreamingTargetReloadSec = 300
+	}
+	if cfg.StreamingTargetReloadSec < 0 {
+		return fmt.Errorf("STREAMING_TARGET_RELOAD_SEC must be > 0")
+	}
+	cfg.VeriflierDiscoveryMode = normalizeVeriflierDiscoveryMode(cfg.VeriflierDiscoveryMode)
+	switch cfg.VeriflierDiscoveryMode {
+	case VeriflierDiscoveryModeStatic, VeriflierDiscoveryModeShadow, VeriflierDiscoveryModeActive:
+	default:
+		return fmt.Errorf("VERIFLIER_DISCOVERY_MODE must be one of: static, shadow, active")
+	}
+	if cfg.LogFormat != "text" && cfg.LogFormat != "json" {
+		return fmt.Errorf("LOG_FORMAT must be 'text' or 'json'")
+	}
+	if strings.TrimSpace(cfg.DashboardBindAddr) == "" {
+		cfg.DashboardBindAddr = "127.0.0.1"
+	}
+	switch cfg.EmailTransport {
+	case "", "stub":
+		// Empty remains a compatibility alias for the safe default.
+	case "smtp":
+		if cfg.SMTPHost == "" {
+			return fmt.Errorf("SMTP_HOST is required when EMAIL_TRANSPORT is 'smtp'")
+		}
+		if cfg.SMTPPort <= 0 {
+			return fmt.Errorf("SMTP_PORT must be > 0 when EMAIL_TRANSPORT is 'smtp'")
+		}
+	case "wpcom":
+		if cfg.WPCOMEmailEndpoint == "" {
+			return fmt.Errorf("WPCOM_EMAIL_ENDPOINT is required when EMAIL_TRANSPORT is 'wpcom'")
+		}
+	default:
+		return fmt.Errorf("EMAIL_TRANSPORT must be one of: stub, smtp, wpcom")
+	}
+	for i, v := range cfg.Verifiers {
+		// host and port are required. Empty values silently parse to ""
+		// then the orchestrator dials "host:" which resolves to port 80 — the
+		// most common cause of "verifier connection refused" in dev configs
+		// (typo: "ports" instead of "port").
+		if v.Host == "" {
+			return fmt.Errorf("VERIFIERS[%d] (%s): host is required", i, displayName(v, i))
+		}
+		if v.TransportPort() == "" {
+			return fmt.Errorf("VERIFIERS[%d] (%s): port is required", i, displayName(v, i))
+		}
+	}
+	return nil
+}
+
+func validateCheckDNSResolvers(servers []string) error {
+	for i, raw := range servers {
+		if _, err := normalizeCheckDNSResolver(raw); err != nil {
+			return fmt.Errorf("CHECK_DNS_RESOLVERS[%d]: %w", i, err)
+		}
+	}
+	return nil
+}
+
+func normalizeCheckDNSResolver(raw string) (string, error) {
+	server := strings.TrimSpace(raw)
+	if server == "" {
+		return "", fmt.Errorf("resolver must not be empty")
+	}
+	host := server
+	port := "53"
+	if splitHost, splitPort, err := net.SplitHostPort(server); err == nil {
+		host = strings.Trim(splitHost, "[]")
+		port = splitPort
+	} else if strings.Contains(server, ":") {
+		if ip := net.ParseIP(strings.Trim(server, "[]")); ip == nil || ip.To4() != nil {
+			return "", fmt.Errorf("resolver must be an IP literal with optional port")
+		}
+		host = strings.Trim(server, "[]")
+	}
+	ip := net.ParseIP(host)
+	if ip == nil {
+		return "", fmt.Errorf("resolver must be an IP literal with optional port")
+	}
+	n, err := strconv.Atoi(port)
+	if err != nil || n <= 0 || n > 65535 {
+		return "", fmt.Errorf("resolver port must be between 1 and 65535")
+	}
+	return net.JoinHostPort(ip.String(), strconv.Itoa(n)), nil
+}
+
+func normalizeVeriflierDiscoveryMode(mode string) string {
+	mode = strings.ToLower(strings.TrimSpace(mode))
+	if mode == "" {
+		return VeriflierDiscoveryModeStatic
+	}
+	return mode
+}
+
+func (cfg *Config) VeriflierDiscoveryModeOrDefault() string {
+	if cfg == nil {
+		return VeriflierDiscoveryModeStatic
+	}
+	return normalizeVeriflierDiscoveryMode(cfg.VeriflierDiscoveryMode)
+}
+
+func validatePinnedBucketRange(cfg *Config) error {
+	hasPinned := cfg.PinnedBucketMin != nil || cfg.PinnedBucketMax != nil
+	hasLegacy := cfg.BucketNoMin != nil || cfg.BucketNoMax != nil
+
+	if hasPinned && (cfg.PinnedBucketMin == nil || cfg.PinnedBucketMax == nil) {
+		return fmt.Errorf("PINNED_BUCKET_MIN and PINNED_BUCKET_MAX must be set together")
+	}
+	if hasLegacy && (cfg.BucketNoMin == nil || cfg.BucketNoMax == nil) {
+		return fmt.Errorf("BUCKET_NO_MIN and BUCKET_NO_MAX must be set together")
+	}
+	if hasPinned && hasLegacy &&
+		(*cfg.PinnedBucketMin != *cfg.BucketNoMin || *cfg.PinnedBucketMax != *cfg.BucketNoMax) {
+		return fmt.Errorf("PINNED_BUCKET_* conflicts with legacy BUCKET_NO_* range")
+	}
+
+	min, max, ok := cfg.PinnedBucketRange()
+	if !ok {
+		return nil
+	}
+	if min < 0 {
+		return fmt.Errorf("pinned bucket min must be >= 0")
+	}
+	if max < min {
+		return fmt.Errorf("pinned bucket max must be >= min")
+	}
+	if max >= cfg.BucketTotal {
+		return fmt.Errorf("pinned bucket max must be < BUCKET_TOTAL")
+	}
+	return nil
+}
+
+func displayName(v VerifierConfig, i int) string {
+	if v.Name != "" {
+		return v.Name
+	}
+	return fmt.Sprintf("verifier #%d", i)
+}
+
+// Debugf logs a debug message when DEBUG is true in the current config.
+func Debugf(format string, args ...any) {
+	mu.RLock()
+	d := current != nil && current.Debug
+	mu.RUnlock()
+	if d {
+		log.Printf("[DEBUG] "+format, args...)
+	}
+}
+
+func envOrDefault(key, def string) string {
+	if v := os.Getenv(key); v != "" {
+		return v
+	}
+	return def
+}
diff --git a/internal/config/config_test.go b/internal/config/config_test.go
new file mode 100644
index 00000000..326d4108
--- /dev/null
+++ b/internal/config/config_test.go
@@ -0,0 +1,700 @@
+package config
+
+import (
+	"bytes"
+	"fmt"
+	"log"
+	"os"
+	"strings"
+	"testing"
+)
+
+func TestValidate(t *testing.T) {
+	base := func() *Config {
+		return &Config{
+			AuthToken:           "token",
+			NumWorkers:          10,
+			DatasetSize:         100,
+			BucketTotal:         100,
+			BucketTarget:        50,
+			NetCommsTimeout:     10,
+			BodyReadMaxBytes:    1048576,
+			BodyReadMaxMS:       250,
+			KeywordReadMaxBytes: 1048576,
+			KeywordReadMaxMS:    0,
+			LogFormat:           "text",
+		}
+	}
+
+	tests := []struct {
+		name    string
+		mutate  func(*Config)
+		wantErr bool
+	}{
+		{
+			name:   "valid config",
+			mutate: func(_ *Config) {},
+		},
+		{
+			name:    "missing auth token",
+			mutate:  func(c *Config) { c.AuthToken = "" },
+			wantErr: true,
+		},
+		{
+			name:   "num workers zero uses default floor",
+			mutate: func(c *Config) { c.NumWorkers = 0 },
+		},
+		{
+			name:    "num workers negative",
+			mutate:  func(c *Config) { c.NumWorkers = -1 },
+			wantErr: true,
+		},
+		{
+			name:   "dataset size zero uses default",
+			mutate: func(c *Config) { c.DatasetSize = 0 },
+		},
+		{
+			name:    "dataset size negative",
+			mutate:  func(c *Config) { c.DatasetSize = -1 },
+			wantErr: true,
+		},
+		{
+			name:    "bucket total zero",
+			mutate:  func(c *Config) { c.BucketTotal = 0 },
+			wantErr: true,
+		},
+		{
+			name:   "bucket target zero uses bucket total",
+			mutate: func(c *Config) { c.BucketTarget = 0 },
+		},
+		{
+			name:    "bucket target exceeds bucket total",
+			mutate:  func(c *Config) { c.BucketTarget = 101 },
+			wantErr: true,
+		},
+		{
+			name:    "bucket target equals bucket total is valid",
+			mutate:  func(c *Config) { c.BucketTarget = 100 },
+			wantErr: false,
+		},
+		{
+			name: "check dns resolver accepts ip with port",
+			mutate: func(c *Config) {
+				c.CheckDNSResolvers = []string{"10.0.0.176:5353", "[2001:db8::1]:53"}
+			},
+		},
+		{
+			name: "check dns resolver rejects hostnames",
+			mutate: func(c *Config) {
+				c.CheckDNSResolvers = []string{"resolver.internal:53"}
+			},
+			wantErr: true,
+		},
+		{
+			name: "check dns resolver rejects bad port",
+			mutate: func(c *Config) {
+				c.CheckDNSResolvers = []string{"10.0.0.176:0"}
+			},
+			wantErr: true,
+		},
+		{
+			name: "pinned bucket range is valid",
+			mutate: func(c *Config) {
+				min, max := 10, 19
+				c.PinnedBucketMin = &min
+				c.PinnedBucketMax = &max
+			},
+		},
+		{
+			name: "legacy bucket range alias is valid",
+			mutate: func(c *Config) {
+				min, max := 10, 19
+				c.BucketNoMin = &min
+				c.BucketNoMax = &max
+			},
+		},
+		{
+			name: "pinned bucket range requires min and max",
+			mutate: func(c *Config) {
+				min := 10
+				c.PinnedBucketMin = &min
+			},
+			wantErr: true,
+		},
+		{
+			name: "legacy bucket range requires min and max",
+			mutate: func(c *Config) {
+				max := 19
+				c.BucketNoMax = &max
+			},
+			wantErr: true,
+		},
+		{
+			name: "pinned bucket range rejects max before min",
+			mutate: func(c *Config) {
+				min, max := 20, 19
+				c.PinnedBucketMin = &min
+				c.PinnedBucketMax = &max
+			},
+			wantErr: true,
+		},
+		{
+			name: "pinned bucket range rejects max outside total",
+			mutate: func(c *Config) {
+				min, max := 90, 100
+				c.PinnedBucketMin = &min
+				c.PinnedBucketMax = &max
+			},
+			wantErr: true,
+		},
+		{
+			name: "pinned and legacy ranges must agree",
+			mutate: func(c *Config) {
+				pMin, pMax := 10, 19
+				lMin, lMax := 20, 29
+				c.PinnedBucketMin = &pMin
+				c.PinnedBucketMax = &pMax
+				c.BucketNoMin = &lMin
+				c.BucketNoMax = &lMax
+			},
+			wantErr: true,
+		},
+		{
+			name:    "net comms timeout zero",
+			mutate:  func(c *Config) { c.NetCommsTimeout = 0 },
+			wantErr: true,
+		},
+		{
+			name:    "net comms timeout negative",
+			mutate:  func(c *Config) { c.NetCommsTimeout = -1 },
+			wantErr: true,
+		},
+		{
+			name:   "body read max bytes zero uses default",
+			mutate: func(c *Config) { c.BodyReadMaxBytes = 0 },
+		},
+		{
+			name:    "body read max ms negative",
+			mutate:  func(c *Config) { c.BodyReadMaxMS = -1 },
+			wantErr: true,
+		},
+		{
+			name:   "keyword read max bytes zero uses default",
+			mutate: func(c *Config) { c.KeywordReadMaxBytes = 0 },
+		},
+		{
+			name:    "keyword read max ms negative",
+			mutate:  func(c *Config) { c.KeywordReadMaxMS = -1 },
+			wantErr: true,
+		},
+		{
+			name:    "min time between rounds negative",
+			mutate:  func(c *Config) { c.MinTimeBetweenRoundsSec = -1 },
+			wantErr: true,
+		},
+		{
+			name:   "empty veriflier discovery mode defaults to static",
+			mutate: func(c *Config) { c.VeriflierDiscoveryMode = "" },
+		},
+		{
+			name:   "shadow veriflier discovery mode is valid",
+			mutate: func(c *Config) { c.VeriflierDiscoveryMode = "shadow" },
+		},
+		{
+			name:   "active veriflier discovery mode is valid",
+			mutate: func(c *Config) { c.VeriflierDiscoveryMode = "ACTIVE" },
+		},
+		{
+			name:    "invalid veriflier discovery mode",
+			mutate:  func(c *Config) { c.VeriflierDiscoveryMode = "auto" },
+			wantErr: true,
+		},
+		{
+			name:    "invalid log format",
+			mutate:  func(c *Config) { c.LogFormat = "xml" },
+			wantErr: true,
+		},
+		{
+			name:   "json log format is valid",
+			mutate: func(c *Config) { c.LogFormat = "json" },
+		},
+		{
+			name:   "empty dashboard bind address falls back to localhost",
+			mutate: func(c *Config) { c.DashboardBindAddr = "" },
+		},
+		{
+			name:   "remote dashboard bind address is explicit and valid",
+			mutate: func(c *Config) { c.DashboardBindAddr = "0.0.0.0" },
+		},
+		{
+			name:   "stub email transport is valid",
+			mutate: func(c *Config) { c.EmailTransport = "stub" },
+		},
+		{
+			name:   "empty email transport uses default stub behavior",
+			mutate: func(c *Config) { c.EmailTransport = "" },
+		},
+		{
+			name:    "invalid email transport",
+			mutate:  func(c *Config) { c.EmailTransport = "sendmail" },
+			wantErr: true,
+		},
+		{
+			name: "smtp email transport requires host",
+			mutate: func(c *Config) {
+				c.EmailTransport = "smtp"
+				c.SMTPPort = 1025
+			},
+			wantErr: true,
+		},
+		{
+			name: "smtp email transport requires port",
+			mutate: func(c *Config) {
+				c.EmailTransport = "smtp"
+				c.SMTPHost = "mailhog"
+			},
+			wantErr: true,
+		},
+		{
+			name: "smtp email transport with host and port is valid",
+			mutate: func(c *Config) {
+				c.EmailTransport = "smtp"
+				c.SMTPHost = "mailhog"
+				c.SMTPPort = 1025
+			},
+		},
+		{
+			name: "wpcom email transport requires endpoint",
+			mutate: func(c *Config) {
+				c.EmailTransport = "wpcom"
+			},
+			wantErr: true,
+		},
+		{
+			name: "wpcom email transport with endpoint is valid",
+			mutate: func(c *Config) {
+				c.EmailTransport = "wpcom"
+				c.WPCOMEmailEndpoint = "https://example.test/email"
+			},
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			cfg := base()
+			tt.mutate(cfg)
+			err := validate(cfg)
+			if (err != nil) != tt.wantErr {
+				t.Fatalf("validate() error = %v, wantErr %v", err, tt.wantErr)
+			}
+		})
+	}
+}
+
+func TestPinnedBucketRange(t *testing.T) {
+	pMin, pMax := 10, 19
+	lMin, lMax := 20, 29
+	cfg := &Config{
+		PinnedBucketMin: &pMin,
+		PinnedBucketMax: &pMax,
+		BucketNoMin:     &lMin,
+		BucketNoMax:     &lMax,
+	}
+	min, max, ok := cfg.PinnedBucketRange()
+	if !ok || min != 10 || max != 19 {
+		t.Fatalf("PinnedBucketRange explicit = %d-%d ok=%v, want 10-19 true", min, max, ok)
+	}
+
+	cfg.PinnedBucketMin = nil
+	cfg.PinnedBucketMax = nil
+	min, max, ok = cfg.PinnedBucketRange()
+	if !ok || min != 20 || max != 29 {
+		t.Fatalf("PinnedBucketRange legacy = %d-%d ok=%v, want 20-29 true", min, max, ok)
+	}
+}
+
+func TestValidateDefaultsDashboardBindAddr(t *testing.T) {
+	cfg := &Config{
+		AuthToken:           "token",
+		NumWorkers:          10,
+		DatasetSize:         100,
+		BucketTotal:         100,
+		BucketTarget:        50,
+		NetCommsTimeout:     10,
+		BodyReadMaxBytes:    1048576,
+		BodyReadMaxMS:       250,
+		KeywordReadMaxBytes: 1048576,
+		KeywordReadMaxMS:    0,
+		LogFormat:           "text",
+	}
+	if err := validate(cfg); err != nil {
+		t.Fatalf("validate() error = %v", err)
+	}
+	if cfg.DashboardBindAddr != "127.0.0.1" {
+		t.Fatalf("DashboardBindAddr = %q, want 127.0.0.1", cfg.DashboardBindAddr)
+	}
+}
+
+func saveConfigState(t *testing.T) {
+	t.Helper()
+	origPath := path
+	origCurrent := current
+	t.Cleanup(func() {
+		mu.Lock()
+		path = origPath
+		current = origCurrent
+		mu.Unlock()
+	})
+}
+
+func writeConfigFile(t *testing.T, content string) string {
+	t.Helper()
+	f, err := os.CreateTemp("", "config-*.json")
+	if err != nil {
+		t.Fatalf("create temp config: %v", err)
+	}
+	t.Cleanup(func() { os.Remove(f.Name()) })
+	if _, err := fmt.Fprint(f, content); err != nil {
+		t.Fatalf("write config: %v", err)
+	}
+	f.Close()
+	return f.Name()
+}
+
+func TestLoadAndGet(t *testing.T) {
+	saveConfigState(t)
+
+	p := writeConfigFile(t, `{
+		"AUTH_TOKEN": "loaded-token",
+		"NUM_WORKERS": 7,
+		"BUCKET_TOTAL": 100,
+		"BUCKET_TARGET": 50,
+		"NET_COMMS_TIMEOUT": 10,
+		"LOG_FORMAT": "json",
+		"DELIVERY_OWNER_HOST": "jetmon-api-1"
+	}`)
+
+	if err := Load(p); err != nil {
+		t.Fatalf("Load() error = %v", err)
+	}
+
+	cfg := Get()
+	if cfg == nil {
+		t.Fatal("Get() = nil after Load")
+	}
+	if cfg.AuthToken != "loaded-token" {
+		t.Fatalf("AuthToken = %q, want loaded-token", cfg.AuthToken)
+	}
+	if cfg.NumWorkers != 7 {
+		t.Fatalf("NumWorkers = %d, want 7", cfg.NumWorkers)
+	}
+	if cfg.LogFormat != "json" {
+		t.Fatalf("LogFormat = %q, want json", cfg.LogFormat)
+	}
+	if cfg.DeliveryOwnerHost != "jetmon-api-1" {
+		t.Fatalf("DeliveryOwnerHost = %q, want jetmon-api-1", cfg.DeliveryOwnerHost)
+	}
+	if cfg.VeriflierDiscoveryMode != VeriflierDiscoveryModeStatic {
+		t.Fatalf("VeriflierDiscoveryMode = %q, want static", cfg.VeriflierDiscoveryMode)
+	}
+	if cfg.BodyReadMaxBytes != 1048576 {
+		t.Fatalf("BodyReadMaxBytes = %d, want 1048576", cfg.BodyReadMaxBytes)
+	}
+	if cfg.BodyReadMaxMS != 250 {
+		t.Fatalf("BodyReadMaxMS = %d, want 250", cfg.BodyReadMaxMS)
+	}
+	if cfg.KeywordReadMaxBytes != 1048576 {
+		t.Fatalf("KeywordReadMaxBytes = %d, want 1048576", cfg.KeywordReadMaxBytes)
+	}
+	if cfg.KeywordReadMaxMS != 0 {
+		t.Fatalf("KeywordReadMaxMS = %d, want 0", cfg.KeywordReadMaxMS)
+	}
+	if !cfg.LegacyStatusProjectionEnable {
+		t.Fatal("LegacyStatusProjectionEnable default should be true")
+	}
+	if !cfg.WPCOMNotifyEnable {
+		t.Fatal("WPCOMNotifyEnable default should be true")
+	}
+}
+
+func TestSampleConfigLoads(t *testing.T) {
+	saveConfigState(t)
+
+	if err := Load("../../config/config-sample.json"); err != nil {
+		t.Fatalf("config-sample.json should load: %v", err)
+	}
+	cfg := Get()
+	if cfg == nil {
+		t.Fatal("Get() = nil after loading sample config")
+	}
+	if cfg.EmailTransport != "stub" {
+		t.Fatalf("EmailTransport = %q, want stub", cfg.EmailTransport)
+	}
+}
+
+func TestLegacyStatusProjectionConfig(t *testing.T) {
+	tests := []struct {
+		name string
+		body string
+		want bool
+	}{
+		{
+			name: "new key disables projection",
+			body: `"LEGACY_STATUS_PROJECTION_ENABLE": false`,
+			want: false,
+		},
+		{
+			name: "old key remains alias when new key absent",
+			body: `"DB_UPDATES_ENABLE": false`,
+			want: false,
+		},
+		{
+			name: "new key wins over old key",
+			body: `"DB_UPDATES_ENABLE": false, "LEGACY_STATUS_PROJECTION_ENABLE": true`,
+			want: true,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			saveConfigState(t)
+			p := writeConfigFile(t, `{
+				"AUTH_TOKEN": "token",
+				"NUM_WORKERS": 7,
+				"BUCKET_TOTAL": 100,
+				"BUCKET_TARGET": 50,
+				"NET_COMMS_TIMEOUT": 10,
+				"LOG_FORMAT": "text",
+				`+tt.body+`
+			}`)
+
+			if err := Load(p); err != nil {
+				t.Fatalf("Load() error = %v", err)
+			}
+			if got := LegacyStatusProjectionEnabled(); got != tt.want {
+				t.Fatalf("LegacyStatusProjectionEnabled() = %v, want %v", got, tt.want)
+			}
+		})
+	}
+}
+
+func TestWPCOMNotifyConfig(t *testing.T) {
+	saveConfigState(t)
+	p := writeConfigFile(t, `{
+		"AUTH_TOKEN": "token",
+		"NUM_WORKERS": 7,
+		"BUCKET_TOTAL": 100,
+		"BUCKET_TARGET": 50,
+		"NET_COMMS_TIMEOUT": 10,
+		"LOG_FORMAT": "text",
+		"WPCOM_NOTIFY_ENABLE": false
+	}`)
+
+	if err := Load(p); err != nil {
+		t.Fatalf("Load() error = %v", err)
+	}
+	if WPCOMNotifyEnabled() {
+		t.Fatal("WPCOMNotifyEnabled() = true, want false")
+	}
+}
+
+func TestDisplayName(t *testing.T) {
+	if got := displayName(VerifierConfig{Name: "us-west"}, 2); got != "us-west" {
+		t.Fatalf("displayName(named) = %q, want us-west", got)
+	}
+	if got := displayName(VerifierConfig{}, 2); got != "verifier #2" {
+		t.Fatalf("displayName(unnamed) = %q, want verifier #2", got)
+	}
+}
+
+func TestVerifierTransportPort(t *testing.T) {
+	if got := (VerifierConfig{Port: "7803"}).TransportPort(); got != "7803" {
+		t.Fatalf("TransportPort(port) = %q, want 7803", got)
+	}
+	if got := (VerifierConfig{GRPCPort: "7804"}).TransportPort(); got != "7804" {
+		t.Fatalf("TransportPort(grpc_port alias) = %q, want 7804", got)
+	}
+	if got := (VerifierConfig{Port: "7803", GRPCPort: "7804"}).TransportPort(); got != "7803" {
+		t.Fatalf("TransportPort(prefer port) = %q, want 7803", got)
+	}
+}
+
+func TestLoadInvalidConfigReturnsError(t *testing.T) {
+	saveConfigState(t)
+
+	p := writeConfigFile(t, `{"AUTH_TOKEN": "", "NUM_WORKERS": 5, "BUCKET_TOTAL": 100, "BUCKET_TARGET": 50, "NET_COMMS_TIMEOUT": 10, "LOG_FORMAT": "text"}`)
+
+	if err := Load(p); err == nil {
+		t.Fatal("Load() expected error for invalid config (empty AUTH_TOKEN)")
+	}
+}
+
+func TestLoadNonExistentFileReturnsError(t *testing.T) {
+	saveConfigState(t)
+	if err := Load("/does/not/exist/config.json"); err == nil {
+		t.Fatal("Load() expected error for missing file")
+	}
+}
+
+func TestReload(t *testing.T) {
+	saveConfigState(t)
+
+	p := writeConfigFile(t, `{
+		"AUTH_TOKEN": "first",
+		"NUM_WORKERS": 5,
+		"BUCKET_TOTAL": 100,
+		"BUCKET_TARGET": 50,
+		"NET_COMMS_TIMEOUT": 10,
+		"LOG_FORMAT": "text"
+	}`)
+
+	if err := Load(p); err != nil {
+		t.Fatalf("Load() error = %v", err)
+	}
+	if Get().AuthToken != "first" {
+		t.Fatalf("AuthToken before reload = %q, want first", Get().AuthToken)
+	}
+
+	if err := os.WriteFile(p, []byte(`{
+		"AUTH_TOKEN": "second",
+		"NUM_WORKERS": 10,
+		"BUCKET_TOTAL": 100,
+		"BUCKET_TARGET": 50,
+		"NET_COMMS_TIMEOUT": 10,
+		"LOG_FORMAT": "text"
+	}`), 0600); err != nil {
+		t.Fatalf("overwrite config: %v", err)
+	}
+
+	if err := Reload(); err != nil {
+		t.Fatalf("Reload() error = %v", err)
+	}
+	cfg := Get()
+	if cfg.AuthToken != "second" {
+		t.Fatalf("AuthToken after reload = %q, want second", cfg.AuthToken)
+	}
+	if cfg.NumWorkers != 10 {
+		t.Fatalf("NumWorkers after reload = %d, want 10", cfg.NumWorkers)
+	}
+}
+
+func TestDebugrLogsWhenEnabled(t *testing.T) {
+	origCurrent := current
+	t.Cleanup(func() {
+		mu.Lock()
+		current = origCurrent
+		mu.Unlock()
+	})
+
+	var buf bytes.Buffer
+	log.SetOutput(&buf)
+	defer log.SetOutput(os.Stderr)
+
+	mu.Lock()
+	current = &Config{Debug: true}
+	mu.Unlock()
+
+	Debugf("test message %d", 42)
+
+	if !strings.Contains(buf.String(), "[DEBUG]") {
+		t.Fatalf("Debugf did not log [DEBUG] when Debug=true, got: %q", buf.String())
+	}
+	if !strings.Contains(buf.String(), "test message 42") {
+		t.Fatalf("Debugf missing message body, got: %q", buf.String())
+	}
+}
+
+func TestDebugfSilentWhenDisabled(t *testing.T) {
+	origCurrent := current
+	t.Cleanup(func() {
+		mu.Lock()
+		current = origCurrent
+		mu.Unlock()
+	})
+
+	var buf bytes.Buffer
+	log.SetOutput(&buf)
+	defer log.SetOutput(os.Stderr)
+
+	mu.Lock()
+	current = &Config{Debug: false}
+	mu.Unlock()
+
+	Debugf("should not appear")
+
+	if buf.Len() != 0 {
+		t.Fatalf("Debugf logged when Debug=false: %q", buf.String())
+	}
+}
+
+func TestLoadDBAndGetDB(t *testing.T) {
+	mu.Lock()
+	origDB := dbConf
+	mu.Unlock()
+	t.Cleanup(func() {
+		mu.Lock()
+		dbConf = origDB
+		mu.Unlock()
+	})
+
+	t.Setenv("DB_HOST", "testhost")
+	t.Setenv("DB_PORT", "3307")
+	t.Setenv("DB_USER", "testuser")
+	t.Setenv("DB_PASSWORD", "testpass")
+	t.Setenv("DB_NAME", "testdb")
+
+	cfg := LoadDB()
+	if cfg == nil {
+		t.Fatal("LoadDB() = nil")
+	}
+	if cfg.Host != "testhost" {
+		t.Fatalf("Host = %q, want testhost", cfg.Host)
+	}
+	if cfg.Port != "3307" {
+		t.Fatalf("Port = %q, want 3307", cfg.Port)
+	}
+
+	got := GetDB()
+	if got == nil {
+		t.Fatal("GetDB() = nil after LoadDB")
+	}
+	if got.User != "testuser" {
+		t.Fatalf("GetDB().User = %q, want testuser", got.User)
+	}
+}
+
+func TestEnvOrDefaultConfig(t *testing.T) {
+	const key = "JETMON_CONFIG_TEST_VAR"
+	t.Setenv(key, "")
+
+	if got := envOrDefault(key, "default"); got != "default" {
+		t.Fatalf("envOrDefault() = %q, want default", got)
+	}
+
+	t.Setenv(key, "override")
+	if got := envOrDefault(key, "default"); got != "override" {
+		t.Fatalf("envOrDefault() = %q, want override", got)
+	}
+}
+
+func TestDefaults(t *testing.T) {
+	cfg := defaults()
+	if cfg.NumWorkers <= 0 {
+		t.Fatalf("defaults().NumWorkers = %d, want > 0", cfg.NumWorkers)
+	}
+	if cfg.BucketTotal <= 0 {
+		t.Fatalf("defaults().BucketTotal = %d, want > 0", cfg.BucketTotal)
+	}
+	if cfg.BucketTarget <= 0 || cfg.BucketTarget > cfg.BucketTotal {
+		t.Fatalf("defaults().BucketTarget = %d out of range [1, %d]", cfg.BucketTarget, cfg.BucketTotal)
+	}
+	if cfg.NetCommsTimeout <= 0 {
+		t.Fatalf("defaults().NetCommsTimeout = %d, want > 0", cfg.NetCommsTimeout)
+	}
+	if cfg.LogFormat != "text" && cfg.LogFormat != "json" {
+		t.Fatalf("defaults().LogFormat = %q, want text or json", cfg.LogFormat)
+	}
+	if got := cfg.VeriflierDiscoveryModeOrDefault(); got != VeriflierDiscoveryModeStatic {
+		t.Fatalf("defaults().VeriflierDiscoveryMode = %q, want static", got)
+	}
+}
diff --git a/internal/dashboard/dashboard.go b/internal/dashboard/dashboard.go
new file mode 100644
index 00000000..06f2c159
--- /dev/null
+++ b/internal/dashboard/dashboard.go
@@ -0,0 +1,671 @@
+package dashboard
+
+import (
+	"encoding/json"
+	"fmt"
+	"log"
+	"net/http"
+	_ "net/http/pprof"
+	"sync"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/processmetrics"
+)
+
+const maxSSEClients = 32
+
+// State holds the real-time metrics snapshot served by the dashboard.
+type State struct {
+	WorkerCount                   int       `json:"worker_count"`
+	ActiveChecks                  int       `json:"active_checks"`
+	QueueDepth                    int       `json:"queue_depth"`
+	RetryQueueSize                int       `json:"retry_queue_size"`
+	SitesPerSec                   int       `json:"sites_per_sec"`
+	RoundDurationMs               int64     `json:"round_duration_ms"`
+	WPCOMCircuitOpen              bool      `json:"wpcom_circuit_open"`
+	WPCOMQueueDepth               int       `json:"wpcom_queue_depth"`
+	GoSysMemMB                    int       `json:"go_sys_mem_mb"`
+	RSSMemMB                      int       `json:"rss_mem_mb"`
+	BucketMin                     int       `json:"bucket_min"`
+	BucketMax                     int       `json:"bucket_max"`
+	BucketOwnership               string    `json:"bucket_ownership"`
+	LegacyStatusProjectionEnabled bool      `json:"legacy_status_projection_enabled"`
+	DeliveryWorkersEnabled        bool      `json:"delivery_workers_enabled"`
+	DeliveryConfigEligible        bool      `json:"delivery_config_eligible"`
+	DeliveryOwnerHost             string    `json:"delivery_owner_host"`
+	RolloutPreflightCommand       string    `json:"rollout_preflight_command"`
+	RolloutCutoverCommand         string    `json:"rollout_cutover_command"`
+	RolloutActivityCommand        string    `json:"rollout_activity_command"`
+	RolloutRollbackCommand        string    `json:"rollout_rollback_command"`
+	RolloutStateReportCommand     string    `json:"rollout_state_report_command"`
+	ProjectionDriftCommand        string    `json:"projection_drift_command"`
+	Hostname                      string    `json:"hostname"`
+	UpdatedAt                     time.Time `json:"updated_at"`
+}
+
+// HealthEntry represents one external dependency's status.
+type HealthEntry struct {
+	Name      string    `json:"name"`
+	Status    string    `json:"status"` // "green", "amber", "red"
+	Latency   int64     `json:"latency_ms,omitempty"`
+	LastError string    `json:"last_error,omitempty"`
+	CheckedAt time.Time `json:"checked_at"`
+}
+
+// HostSnapshot is the combined per-host dashboard model. Fleet views can reuse
+// the same shape for a single process before aggregating many hosts.
+type HostSnapshot struct {
+	State   State         `json:"state"`
+	Health  []HealthEntry `json:"health"`
+	Summary HostSummary   `json:"summary"`
+}
+
+// HostSummary gives callers an immediate status without reimplementing the
+// dashboard's red/amber/green rules.
+type HostSummary struct {
+	Status     string   `json:"status"`
+	Message    string   `json:"message"`
+	Issues     []string `json:"issues,omitempty"`
+	RedCount   int      `json:"red_count"`
+	AmberCount int      `json:"amber_count"`
+	GreenCount int      `json:"green_count"`
+}
+
+// Server is the operator dashboard HTTP server.
+type Server struct {
+	mu          sync.RWMutex
+	state       State
+	health      []HealthEntry
+	sseClients  map[string]chan string
+	sseMu       sync.Mutex
+	hostname    string
+	fleetSource FleetSource
+}
+
+// New creates a new dashboard Server.
+func New(hostname string) *Server {
+	return &Server{
+		hostname:   hostname,
+		sseClients: make(map[string]chan string),
+	}
+}
+
+// Update replaces the current dashboard state and pushes an SSE event.
+func (s *Server) Update(st State) {
+	st.Hostname = s.hostname
+	st.UpdatedAt = time.Now()
+
+	mem := processmetrics.CurrentMemory()
+	st.GoSysMemMB = mem.GoSysMemMB
+	st.RSSMemMB = mem.RSSMemMB
+
+	s.mu.Lock()
+	s.state = st
+	s.mu.Unlock()
+
+	s.broadcast(st)
+}
+
+// UpdateHealth replaces the health entries served by /api/health.
+func (s *Server) UpdateHealth(entries []HealthEntry) {
+	s.mu.Lock()
+	s.health = entries
+	s.mu.Unlock()
+}
+
+// Listen starts the dashboard HTTP server. Blocks until the server exits.
+func (s *Server) Listen(addr string) error {
+	mux := http.NewServeMux()
+	mux.HandleFunc("/", s.handleIndex)
+	mux.HandleFunc("/events", s.handleSSE)
+	mux.HandleFunc("/api/state", s.handleState)
+	mux.HandleFunc("/api/health", s.handleHealth)
+	mux.HandleFunc("/api/host", s.handleHost)
+	mux.HandleFunc("/fleet", s.handleFleetIndex)
+	mux.HandleFunc("/api/fleet", s.handleFleet)
+
+	log.Printf("dashboard: listening on %s", addr)
+	srv := &http.Server{
+		Addr:              addr,
+		Handler:           mux,
+		ReadHeaderTimeout: 5 * time.Second,
+		IdleTimeout:       120 * time.Second,
+	}
+	return srv.ListenAndServe()
+}
+
+// ListenDebug starts a localhost-only pprof/debug server on the given address.
+// pprof can expose in-memory credentials; never expose this on a public interface.
+func ListenDebug(addr string) error {
+	// net/http/pprof registers itself on http.DefaultServeMux via init().
+	log.Printf("debug: pprof listening on %s (localhost only)", addr)
+	return http.ListenAndServe(addr, http.DefaultServeMux)
+}
+
+func (s *Server) handleState(w http.ResponseWriter, r *http.Request) {
+	if rejectNonGet(w, r) {
+		return
+	}
+	s.mu.RLock()
+	st := s.state
+	s.mu.RUnlock()
+	setDashboardJSONHeaders(w)
+	_ = json.NewEncoder(w).Encode(st)
+}
+
+func (s *Server) handleHealth(w http.ResponseWriter, r *http.Request) {
+	if rejectNonGet(w, r) {
+		return
+	}
+	s.mu.RLock()
+	h := append([]HealthEntry(nil), s.health...)
+	s.mu.RUnlock()
+	setDashboardJSONHeaders(w)
+	_ = json.NewEncoder(w).Encode(h)
+}
+
+func (s *Server) handleHost(w http.ResponseWriter, r *http.Request) {
+	if rejectNonGet(w, r) {
+		return
+	}
+	s.mu.RLock()
+	st := s.state
+	h := append([]HealthEntry(nil), s.health...)
+	s.mu.RUnlock()
+	setDashboardJSONHeaders(w)
+	_ = json.NewEncoder(w).Encode(HostSnapshot{
+		State:   st,
+		Health:  h,
+		Summary: SummarizeHost(st, h),
+	})
+}
+
+func (s *Server) handleSSE(w http.ResponseWriter, r *http.Request) {
+	if rejectNonGetOnly(w, r) {
+		return
+	}
+	flusher, ok := w.(http.Flusher)
+	if !ok {
+		http.Error(w, "streaming not supported", http.StatusInternalServerError)
+		return
+	}
+
+	ch := make(chan string, 16)
+	id := fmt.Sprintf("%p", ch)
+
+	s.sseMu.Lock()
+	if len(s.sseClients) >= maxSSEClients {
+		s.sseMu.Unlock()
+		http.Error(w, "too many dashboard event clients", http.StatusServiceUnavailable)
+		return
+	}
+	s.sseClients[id] = ch
+	s.sseMu.Unlock()
+
+	w.Header().Set("Content-Type", "text/event-stream")
+	w.Header().Set("Cache-Control", "no-cache, no-store")
+	w.Header().Set("Connection", "keep-alive")
+	w.Header().Set("X-Content-Type-Options", "nosniff")
+
+	defer func() {
+		s.sseMu.Lock()
+		delete(s.sseClients, id)
+		s.sseMu.Unlock()
+	}()
+
+	// Send current state immediately on connect.
+	s.mu.RLock()
+	if b, err := json.Marshal(s.state); err == nil {
+		fmt.Fprintf(w, "data: %s\n\n", b)
+		flusher.Flush()
+	}
+	s.mu.RUnlock()
+
+	for {
+		select {
+		case msg := <-ch:
+			fmt.Fprintf(w, "data: %s\n\n", msg)
+			flusher.Flush()
+		case <-r.Context().Done():
+			return
+		}
+	}
+}
+
+func (s *Server) broadcast(st State) {
+	b, err := json.Marshal(st)
+	if err != nil {
+		return
+	}
+	msg := string(b)
+
+	s.sseMu.Lock()
+	defer s.sseMu.Unlock()
+	for _, ch := range s.sseClients {
+		select {
+		case ch <- msg:
+		default:
+			// Slow client — drop the event rather than block.
+		}
+	}
+}
+
+func (s *Server) handleIndex(w http.ResponseWriter, r *http.Request) {
+	if r.URL.Path != "/" {
+		setDashboardNoStoreHeaders(w)
+		http.NotFound(w, r)
+		return
+	}
+	if rejectNonGet(w, r) {
+		return
+	}
+	setDashboardHTMLHeaders(w)
+	fmt.Fprint(w, dashboardHTML)
+}
+
+func setDashboardHTMLHeaders(w http.ResponseWriter) {
+	setDashboardReadHeaders(w, "text/html; charset=utf-8")
+}
+
+func setDashboardJSONHeaders(w http.ResponseWriter) {
+	setDashboardReadHeaders(w, "application/json")
+}
+
+func setDashboardReadHeaders(w http.ResponseWriter, contentType string) {
+	w.Header().Set("Content-Type", contentType)
+	setDashboardNoStoreHeaders(w)
+}
+
+func setDashboardNoStoreHeaders(w http.ResponseWriter) {
+	w.Header().Set("Cache-Control", "no-store")
+	w.Header().Set("Content-Security-Policy", "default-src 'self'; base-uri 'none'; connect-src 'self'; form-action 'none'; frame-ancestors 'none'; object-src 'none'; script-src 'self' 'unsafe-inline'; style-src 'self' 'unsafe-inline'")
+	w.Header().Set("Referrer-Policy", "no-referrer")
+	w.Header().Set("X-Content-Type-Options", "nosniff")
+	w.Header().Set("X-Frame-Options", "DENY")
+}
+
+func rejectNonGet(w http.ResponseWriter, r *http.Request) bool {
+	if r.Method == http.MethodGet || r.Method == http.MethodHead {
+		return false
+	}
+	w.Header().Set("Allow", "GET, HEAD")
+	http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
+	return true
+}
+
+func rejectNonGetOnly(w http.ResponseWriter, r *http.Request) bool {
+	if r.Method == http.MethodGet {
+		return false
+	}
+	w.Header().Set("Allow", "GET")
+	http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
+	return true
+}
+
+// SummarizeHost reduces local state and dependency health into a dashboard
+// status. It deliberately stays simple: red blocks rollout, amber needs
+// operator attention, green means no local blocker is visible.
+func SummarizeHost(st State, health []HealthEntry) HostSummary {
+	summary := HostSummary{Status: "green", Message: "host checks are green"}
+	redIssues := []string{}
+	amberIssues := []string{}
+	if st.UpdatedAt.IsZero() {
+		summary.Status = "amber"
+		summary.Message = "waiting for host state"
+		amberIssues = append(amberIssues, "host state has not been published yet")
+	}
+	if len(health) == 0 {
+		summary.AmberCount++
+		amberIssues = append(amberIssues, "dependency health has not been published yet")
+	}
+	wpcomAlreadyRed := false
+	for _, entry := range health {
+		switch entry.Status {
+		case "red":
+			summary.RedCount++
+			redIssues = append(redIssues, issueText(entry))
+			if entry.Name == "wpcom" {
+				wpcomAlreadyRed = true
+			}
+		case "amber":
+			summary.AmberCount++
+			amberIssues = append(amberIssues, issueText(entry))
+		case "green":
+			summary.GreenCount++
+		}
+	}
+	if st.WPCOMCircuitOpen && !wpcomAlreadyRed {
+		summary.RedCount++
+		redIssues = append(redIssues, "wpcom red: circuit open")
+	}
+	if st.DeliveryWorkersEnabled && st.DeliveryOwnerHost == "" {
+		summary.AmberCount++
+		amberIssues = append(amberIssues, "delivery amber: runtime workers are enabled without DELIVERY_OWNER_HOST")
+	}
+	if st.DeliveryWorkersEnabled != st.DeliveryConfigEligible {
+		summary.AmberCount++
+		amberIssues = append(amberIssues, "delivery amber: runtime worker state differs from current config; restart to apply ownership changes")
+	}
+	summary.Issues = append(redIssues, amberIssues...)
+	switch {
+	case summary.RedCount > 0:
+		summary.Status = "red"
+		summary.Message = "rollout-blocking dependency or circuit issue"
+	case summary.AmberCount > 0:
+		summary.Status = "amber"
+		summary.Message = "operator attention needed before rollout"
+	case summary.Status == "green":
+		summary.Message = "host checks are green"
+	}
+	return summary
+}
+
+func issueText(entry HealthEntry) string {
+	detail := entry.LastError
+	if detail == "" {
+		detail = "no detail reported"
+	}
+	return fmt.Sprintf("%s %s: %s", entry.Name, entry.Status, detail)
+}
+
+const dashboardHTML = `<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<title>Jetmon 2 - Host Dashboard</title>
+<style>
+  :root {
+    color-scheme: dark;
+    --bg: #101316;
+    --panel: #1b2024;
+    --panel-strong: #252b30;
+    --line: #353d43;
+    --text: #eef2f5;
+    --muted: #9aa7b0;
+    --green: #58c783;
+    --green-bg: #14301f;
+    --amber: #f0b85a;
+    --amber-bg: #342814;
+    --red: #f06b64;
+    --red-bg: #3b1d1b;
+    --accent: #77b7d9;
+  }
+  * { box-sizing: border-box; }
+  body {
+    font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;
+    background: var(--bg);
+    color: var(--text);
+    margin: 0;
+    padding: 24px;
+  }
+  main { max-width: 1400px; margin: 0 auto; }
+  h1 { margin: 0; font-size: 1.65rem; color: var(--text); }
+  h2 { margin: 28px 0 12px; font-size: 0.85rem; color: var(--muted); letter-spacing: 0; text-transform: uppercase; }
+  a { color: var(--accent); }
+  .topline { display: flex; align-items: baseline; justify-content: space-between; gap: 16px; margin-bottom: 16px; }
+  .subtle { color: var(--muted); font-size: 0.85rem; }
+  .summary {
+    display: grid;
+    grid-template-columns: minmax(0, 1fr) auto;
+    gap: 16px;
+    align-items: center;
+    padding: 18px;
+    border: 1px solid var(--line);
+    border-left-width: 6px;
+    border-radius: 6px;
+    background: var(--panel);
+  }
+  .summary.green { border-left-color: var(--green); }
+  .summary.amber { border-left-color: var(--amber); }
+  .summary.red { border-left-color: var(--red); }
+  .summary-title { font-size: 1.25rem; margin-bottom: 6px; }
+  .summary-detail { color: var(--muted); font-size: 0.9rem; }
+  .summary-issues { margin: 10px 0 0; padding-left: 18px; color: var(--text); font-size: 0.86rem; line-height: 1.45; }
+  .summary-issues:empty { display: none; }
+  .summary-meta { display: grid; gap: 6px; justify-items: end; color: var(--muted); font-size: 0.8rem; }
+  .status-pill {
+    display: inline-flex;
+    align-items: center;
+    justify-content: center;
+    min-width: 72px;
+    padding: 5px 9px;
+    border-radius: 999px;
+    color: var(--text);
+    font-size: 0.78rem;
+    text-transform: uppercase;
+  }
+  .status-pill.green { background: var(--green-bg); color: var(--green); }
+  .status-pill.amber { background: var(--amber-bg); color: var(--amber); }
+  .status-pill.red { background: var(--red-bg); color: var(--red); }
+  .grid { display: grid; grid-template-columns: repeat(auto-fill, minmax(210px, 1fr)); gap: 12px; }
+  .card { background: var(--panel); padding: 14px; border: 1px solid var(--line); border-radius: 6px; min-height: 78px; }
+  .card .label { font-size: 0.72rem; color: var(--muted); text-transform: uppercase; }
+  .card .value { font-size: 1.35rem; color: var(--accent); margin-top: 8px; overflow-wrap: anywhere; }
+  .card .detail { color: var(--muted); font-size: 0.78rem; margin-top: 6px; overflow-wrap: anywhere; }
+  .command-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(320px, 1fr)); gap: 10px; }
+  .command { background: var(--panel-strong); border: 1px solid var(--line); border-radius: 6px; padding: 12px; min-width: 0; }
+  .command .label { color: var(--muted); font-size: 0.72rem; text-transform: uppercase; margin-bottom: 8px; }
+  .command code { color: var(--accent); white-space: normal; overflow-wrap: anywhere; line-height: 1.35; }
+  .health-grid { display: grid; grid-template-columns: repeat(auto-fill, minmax(230px, 1fr)); gap: 10px; }
+  .health-item { padding: 11px 12px; border: 1px solid var(--line); border-left-width: 5px; border-radius: 6px; font-size: 0.83rem; overflow-wrap: anywhere; background: var(--panel); }
+  .health-item.green { border-left-color: var(--green); }
+  .health-item.amber { border-left-color: var(--amber); }
+  .health-item.red { border-left-color: var(--red); }
+  @media (max-width: 720px) {
+    body { padding: 14px; }
+    .topline, .summary { grid-template-columns: 1fr; }
+    .summary-meta { justify-items: start; }
+    .command-grid { grid-template-columns: 1fr; }
+  }
+</style>
+</head>
+<body>
+<main>
+  <div class="topline">
+    <div>
+      <h1>Jetmon 2</h1>
+      <div class="subtle">Host dashboard · <a href="/fleet">fleet dashboard</a></div>
+    </div>
+    <span class="status-pill amber" id="summary-pill">waiting</span>
+  </div>
+
+  <section class="summary amber" id="summary">
+    <div>
+      <div class="summary-title" id="summary-title">Waiting for host state</div>
+      <div class="summary-detail" id="summary-detail">No dashboard update has been received yet.</div>
+      <ul class="summary-issues" id="summary-issues"></ul>
+    </div>
+    <div class="summary-meta">
+      <span id="host">host: unknown</span>
+      <span id="updated">updated: never</span>
+    </div>
+  </section>
+
+  <h2>Check Pool</h2>
+  <div class="grid">
+    <div class="card"><div class="label">Goroutines</div><div class="value" id="workers">-</div></div>
+    <div class="card"><div class="label">Active Checks</div><div class="value" id="active">-</div></div>
+    <div class="card"><div class="label">Queue Depth</div><div class="value" id="queue">-</div></div>
+    <div class="card"><div class="label">Retry Queue</div><div class="value" id="retry">-</div></div>
+  </div>
+
+  <h2>Throughput</h2>
+  <div class="grid">
+    <div class="card"><div class="label">Sites/Sec</div><div class="value" id="sps">-</div></div>
+    <div class="card"><div class="label">Round Time</div><div class="value" id="round">-</div></div>
+    <div class="card"><div class="label">Buckets</div><div class="value" id="buckets">-</div></div>
+    <div class="card"><div class="label">RSS Memory</div><div class="value" id="rss-mem">-</div></div>
+    <div class="card"><div class="label">Go Sys Memory</div><div class="value" id="go-sys">-</div></div>
+  </div>
+
+  <h2>Rollout State</h2>
+  <div class="grid">
+    <div class="card"><div class="label">Ownership</div><div class="value" id="ownership">-</div></div>
+    <div class="card"><div class="label">Legacy Projection</div><div class="value" id="projection">-</div></div>
+    <div class="card"><div class="label">Delivery Runtime</div><div class="value" id="delivery">-</div></div>
+    <div class="card"><div class="label">Config Eligibility</div><div class="value" id="delivery-eligible">-</div></div>
+    <div class="card"><div class="label">Delivery Owner</div><div class="value" id="delivery-owner">-</div></div>
+    <div class="card"><div class="label">WPCOM Circuit</div><div class="value" id="wpcom">-</div></div>
+    <div class="card"><div class="label">WPCOM Queue</div><div class="value" id="wpcomq">-</div></div>
+  </div>
+
+  <h2>Operator Commands</h2>
+  <div class="command-grid">
+    <div class="command"><div class="label">State Report</div><code id="state-report">-</code></div>
+    <div class="command"><div class="label">Preflight</div><code id="preflight">-</code></div>
+    <div class="command"><div class="label">Cutover</div><code id="cutover">-</code></div>
+    <div class="command"><div class="label">Activity</div><code id="activity">-</code></div>
+    <div class="command"><div class="label">Rollback</div><code id="rollback">-</code></div>
+    <div class="command"><div class="label">Drift Report</div><code id="drift">-</code></div>
+  </div>
+
+  <h2>External Dependencies</h2>
+  <div class="health-grid" id="health"></div>
+</main>
+
+<script>
+let currentState = null;
+let currentHealth = [];
+
+function setText(id, value) {
+  document.getElementById(id).textContent = value === undefined || value === null || value === '' ? '-' : value;
+}
+
+function renderState(d) {
+  currentState = d;
+  setText('host', 'host: ' + (d.hostname || 'unknown'));
+  setText('workers', d.worker_count);
+  setText('active', d.active_checks);
+  setText('queue', d.queue_depth);
+  setText('retry', d.retry_queue_size);
+  setText('sps', d.sites_per_sec);
+  setText('round', ((d.round_duration_ms || 0) / 1000).toFixed(1) + 's');
+  setText('buckets', d.bucket_min + '-' + d.bucket_max);
+  setText('rss-mem', formatMem(d.rss_mem_mb));
+  setText('go-sys', formatMem(d.go_sys_mem_mb));
+  setText('ownership', d.bucket_ownership || '-');
+  setText('projection', d.legacy_status_projection_enabled ? 'enabled' : 'disabled');
+  setText('delivery', d.delivery_workers_enabled ? 'enabled' : 'disabled');
+  setText('delivery-eligible', d.delivery_config_eligible ? 'eligible' : 'not eligible');
+  setText('delivery-owner', d.delivery_owner_host || 'unset');
+  setText('state-report', d.rollout_state_report_command);
+  setText('preflight', d.rollout_preflight_command);
+  setText('cutover', d.rollout_cutover_command);
+  setText('activity', d.rollout_activity_command);
+  setText('rollback', d.rollout_rollback_command);
+  setText('drift', d.projection_drift_command);
+  setText('wpcom', d.wpcom_circuit_open ? 'OPEN' : 'closed');
+  setText('wpcomq', d.wpcom_queue_depth);
+  setText('updated', 'updated: ' + (d.updated_at ? new Date(d.updated_at).toLocaleTimeString() : 'never'));
+  renderSummary();
+}
+
+function formatMem(value) {
+  return value > 0 ? value + 'MB' : 'n/a';
+}
+
+function renderSummary(summary) {
+  if (!summary) {
+    summary = summarizeLocal();
+  }
+  const status = summary.status || 'amber';
+  const box = document.getElementById('summary');
+  box.className = 'summary ' + status;
+  const pill = document.getElementById('summary-pill');
+  pill.className = 'status-pill ' + status;
+  pill.textContent = status;
+  setText('summary-title', summary.message || 'host status unavailable');
+  setText('summary-detail', 'dependencies green=' + (summary.green_count || 0) + ' amber=' + (summary.amber_count || 0) + ' red=' + (summary.red_count || 0));
+  const issues = document.getElementById('summary-issues');
+  issues.textContent = '';
+  (summary.issues || []).slice(0, 5).forEach(function(issue) {
+    const item = document.createElement('li');
+    item.textContent = issue;
+    issues.appendChild(item);
+  });
+}
+
+function summarizeLocal() {
+  let red = 0;
+  let amber = currentState ? 0 : 1;
+  let green = 0;
+  let redIssues = [];
+  let amberIssues = currentState ? [] : ['host state has not been published yet'];
+  let wpcomAlreadyRed = false;
+  if (currentHealth.length === 0) {
+    amber++;
+    amberIssues.push('dependency health has not been published yet');
+  }
+  currentHealth.forEach(function(entry) {
+    if (entry.status === 'red') {
+      red++;
+      redIssues.push(issueText(entry));
+      if (entry.name === 'wpcom') wpcomAlreadyRed = true;
+    }
+    else if (entry.status === 'amber') {
+      amber++;
+      amberIssues.push(issueText(entry));
+    }
+    else if (entry.status === 'green') green++;
+  });
+  if (currentState && currentState.wpcom_circuit_open && !wpcomAlreadyRed) {
+    red++;
+    redIssues.push('wpcom red: circuit open');
+  }
+  if (currentState && currentState.delivery_workers_enabled && !currentState.delivery_owner_host) {
+    amber++;
+    amberIssues.push('delivery amber: runtime workers are enabled without DELIVERY_OWNER_HOST');
+  }
+  if (currentState && currentState.delivery_workers_enabled !== currentState.delivery_config_eligible) {
+    amber++;
+    amberIssues.push('delivery amber: runtime worker state differs from current config; restart to apply ownership changes');
+  }
+  const issues = redIssues.concat(amberIssues);
+  if (red > 0) return { status: 'red', message: 'rollout-blocking dependency or circuit issue', issues: issues, red_count: red, amber_count: amber, green_count: green };
+  if (amber > 0) return { status: 'amber', message: 'operator attention needed before rollout', issues: issues, red_count: red, amber_count: amber, green_count: green };
+  return { status: 'green', message: 'host checks are green', issues: [], red_count: red, amber_count: amber, green_count: green };
+}
+
+function issueText(entry) {
+  const detail = entry.last_error || 'no detail reported';
+  return entry.name + ' ' + entry.status + ': ' + detail;
+}
+
+const src = new EventSource('/events');
+src.onmessage = function(e) {
+  renderState(JSON.parse(e.data));
+};
+
+async function refreshHost() {
+  try {
+    const res = await fetch('/api/host', { cache: 'no-store' });
+    const snapshot = await res.json();
+    if (snapshot.state) renderState(snapshot.state);
+    const entries = snapshot.health || [];
+    currentHealth = entries;
+    const box = document.getElementById('health');
+    box.textContent = '';
+    entries.forEach(function(entry) {
+      const item = document.createElement('div');
+      item.className = 'health-item ' + (entry.status || 'amber');
+      const latency = entry.latency_ms ? ' ' + entry.latency_ms + 'ms' : '';
+      const detail = entry.last_error ? ' - ' + entry.last_error : '';
+      item.textContent = entry.name + ': ' + entry.status + latency + detail;
+      box.appendChild(item);
+    });
+    renderSummary(snapshot.summary);
+  } catch (err) {
+    const box = document.getElementById('health');
+    box.textContent = '';
+    const item = document.createElement('div');
+    item.className = 'health-item red';
+    item.textContent = 'dashboard health: red - ' + err;
+    box.appendChild(item);
+    renderSummary({ status: 'red', message: 'dashboard health endpoint failed', red_count: 1, amber_count: 0, green_count: 0 });
+  }
+}
+refreshHost();
+setInterval(refreshHost, 10000);
+</script>
+</body>
+</html>`
diff --git a/internal/dashboard/dashboard_test.go b/internal/dashboard/dashboard_test.go
new file mode 100644
index 00000000..f24fad72
--- /dev/null
+++ b/internal/dashboard/dashboard_test.go
@@ -0,0 +1,557 @@
+package dashboard
+
+import (
+	"bufio"
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"net/http"
+	"net/http/httptest"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/processmetrics"
+)
+
+type fakeFleetSource struct {
+	snapshot FleetSnapshot
+	err      error
+}
+
+func (f fakeFleetSource) Snapshot(context.Context) (FleetSnapshot, error) {
+	return f.snapshot, f.err
+}
+
+func TestHandleState(t *testing.T) {
+	srv := New("test-host")
+	srv.Update(State{
+		WorkerCount:                   5,
+		QueueDepth:                    3,
+		BucketOwnership:               "pinned range=0-99",
+		LegacyStatusProjectionEnabled: true,
+		DeliveryWorkersEnabled:        true,
+		DeliveryConfigEligible:        true,
+		DeliveryOwnerHost:             "api-1",
+		RolloutPreflightCommand:       "./jetmon2 rollout pinned-check",
+		RolloutCutoverCommand:         "./jetmon2 rollout cutover-check --since=15m",
+		RolloutActivityCommand:        "./jetmon2 rollout activity-check --since=15m",
+		RolloutRollbackCommand:        "./jetmon2 rollout rollback-check",
+		RolloutStateReportCommand:     "./jetmon2 rollout state-report --since=15m",
+		ProjectionDriftCommand:        "./jetmon2 rollout projection-drift",
+	})
+
+	r := httptest.NewRequest(http.MethodGet, "/api/state", nil)
+	w := httptest.NewRecorder()
+	srv.handleState(w, r)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200", w.Code)
+	}
+	body := w.Body.String()
+	if !strings.Contains(body, `"rss_mem_mb"`) {
+		t.Fatalf("state body missing rss_mem_mb: %s", body)
+	}
+	if !strings.Contains(body, `"go_sys_mem_mb"`) {
+		t.Fatalf("state body missing go_sys_mem_mb: %s", body)
+	}
+	var st State
+	if err := json.NewDecoder(strings.NewReader(body)).Decode(&st); err != nil {
+		t.Fatalf("decode: %v", err)
+	}
+	if st.WorkerCount != 5 {
+		t.Fatalf("WorkerCount = %d, want 5", st.WorkerCount)
+	}
+	if st.Hostname != "test-host" {
+		t.Fatalf("Hostname = %q, want test-host", st.Hostname)
+	}
+	if st.GoSysMemMB <= 0 {
+		t.Fatalf("GoSysMemMB = %d, want positive runtime memory", st.GoSysMemMB)
+	}
+	if processmetrics.CurrentMemory().RSSMemMB > 0 && st.RSSMemMB <= 0 {
+		t.Fatalf("RSSMemMB = %d, want positive RSS when procfs is available", st.RSSMemMB)
+	}
+	if st.BucketOwnership != "pinned range=0-99" {
+		t.Fatalf("BucketOwnership = %q, want pinned range=0-99", st.BucketOwnership)
+	}
+	if !st.LegacyStatusProjectionEnabled {
+		t.Fatal("LegacyStatusProjectionEnabled = false, want true")
+	}
+	if !st.DeliveryWorkersEnabled {
+		t.Fatal("DeliveryWorkersEnabled = false, want true")
+	}
+	if !st.DeliveryConfigEligible {
+		t.Fatal("DeliveryConfigEligible = false, want true")
+	}
+	if st.DeliveryOwnerHost != "api-1" {
+		t.Fatalf("DeliveryOwnerHost = %q, want api-1", st.DeliveryOwnerHost)
+	}
+	if st.RolloutPreflightCommand != "./jetmon2 rollout pinned-check" {
+		t.Fatalf("RolloutPreflightCommand = %q", st.RolloutPreflightCommand)
+	}
+	if st.RolloutCutoverCommand != "./jetmon2 rollout cutover-check --since=15m" {
+		t.Fatalf("RolloutCutoverCommand = %q", st.RolloutCutoverCommand)
+	}
+	if st.RolloutActivityCommand != "./jetmon2 rollout activity-check --since=15m" {
+		t.Fatalf("RolloutActivityCommand = %q", st.RolloutActivityCommand)
+	}
+	if st.RolloutRollbackCommand != "./jetmon2 rollout rollback-check" {
+		t.Fatalf("RolloutRollbackCommand = %q", st.RolloutRollbackCommand)
+	}
+	if st.RolloutStateReportCommand != "./jetmon2 rollout state-report --since=15m" {
+		t.Fatalf("RolloutStateReportCommand = %q", st.RolloutStateReportCommand)
+	}
+	if st.ProjectionDriftCommand != "./jetmon2 rollout projection-drift" {
+		t.Fatalf("ProjectionDriftCommand = %q", st.ProjectionDriftCommand)
+	}
+}
+
+func TestHandleHealth(t *testing.T) {
+	srv := New("test-host")
+	srv.UpdateHealth([]HealthEntry{
+		{Name: "db", Status: "green"},
+		{Name: "wpcom", Status: "amber"},
+	})
+
+	r := httptest.NewRequest(http.MethodGet, "/api/health", nil)
+	w := httptest.NewRecorder()
+	srv.handleHealth(w, r)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200", w.Code)
+	}
+	var entries []HealthEntry
+	if err := json.NewDecoder(w.Body).Decode(&entries); err != nil {
+		t.Fatalf("decode: %v", err)
+	}
+	if len(entries) != 2 {
+		t.Fatalf("entries len = %d, want 2", len(entries))
+	}
+	if entries[0].Name != "db" || entries[0].Status != "green" {
+		t.Fatalf("entries[0] = %+v, want {db green}", entries[0])
+	}
+}
+
+func TestHandleHostSnapshot(t *testing.T) {
+	srv := New("test-host")
+	srv.Update(State{
+		WorkerCount:            5,
+		WPCOMCircuitOpen:       true,
+		DeliveryWorkersEnabled: true,
+	})
+	srv.UpdateHealth([]HealthEntry{
+		{Name: "mysql", Status: "green"},
+		{Name: "statsd", Status: "amber"},
+	})
+
+	r := httptest.NewRequest(http.MethodGet, "/api/host", nil)
+	w := httptest.NewRecorder()
+	srv.handleHost(w, r)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200", w.Code)
+	}
+	var snapshot HostSnapshot
+	if err := json.NewDecoder(w.Body).Decode(&snapshot); err != nil {
+		t.Fatalf("decode: %v", err)
+	}
+	if snapshot.State.Hostname != "test-host" {
+		t.Fatalf("Hostname = %q, want test-host", snapshot.State.Hostname)
+	}
+	if len(snapshot.Health) != 2 {
+		t.Fatalf("health len = %d, want 2", len(snapshot.Health))
+	}
+	if snapshot.Summary.Status != "red" {
+		t.Fatalf("summary status = %q, want red", snapshot.Summary.Status)
+	}
+	if snapshot.Summary.RedCount == 0 {
+		t.Fatalf("summary red count = %d, want non-zero", snapshot.Summary.RedCount)
+	}
+	if len(snapshot.Summary.Issues) == 0 || !strings.Contains(snapshot.Summary.Issues[0], "wpcom") {
+		t.Fatalf("summary issues = %#v, want wpcom issue first", snapshot.Summary.Issues)
+	}
+}
+
+func TestDashboardReadHandlersSetNoStoreHeaders(t *testing.T) {
+	srv := New("test-host")
+	srv.SetFleetSource(fakeFleetSource{snapshot: FleetSnapshot{Summary: FleetSummary{Status: "green"}}})
+	handlers := map[string]http.HandlerFunc{
+		"/":           srv.handleIndex,
+		"/api/state":  srv.handleState,
+		"/api/health": srv.handleHealth,
+		"/api/host":   srv.handleHost,
+		"/fleet":      srv.handleFleetIndex,
+		"/api/fleet":  srv.handleFleet,
+	}
+	for path, handler := range handlers {
+		t.Run(path, func(t *testing.T) {
+			r := httptest.NewRequest(http.MethodGet, path, nil)
+			w := httptest.NewRecorder()
+			handler(w, r)
+			if w.Code != http.StatusOK {
+				t.Fatalf("status = %d, want 200", w.Code)
+			}
+			if got := w.Header().Get("Cache-Control"); got != "no-store" {
+				t.Fatalf("Cache-Control = %q, want no-store", got)
+			}
+			if got := w.Header().Get("X-Content-Type-Options"); got != "nosniff" {
+				t.Fatalf("X-Content-Type-Options = %q, want nosniff", got)
+			}
+			if got := w.Header().Get("X-Frame-Options"); got != "DENY" {
+				t.Fatalf("X-Frame-Options = %q, want DENY", got)
+			}
+			if got := w.Header().Get("Referrer-Policy"); got != "no-referrer" {
+				t.Fatalf("Referrer-Policy = %q, want no-referrer", got)
+			}
+			if got := w.Header().Get("Content-Security-Policy"); !strings.Contains(got, "frame-ancestors 'none'") {
+				t.Fatalf("Content-Security-Policy = %q, want frame-ancestors guard", got)
+			}
+		})
+	}
+}
+
+func TestDashboardReadHandlersRejectWriteMethods(t *testing.T) {
+	srv := New("test-host")
+	srv.SetFleetSource(fakeFleetSource{snapshot: FleetSnapshot{Summary: FleetSummary{Status: "green"}}})
+	handlers := map[string]struct {
+		handler http.HandlerFunc
+		allow   string
+	}{
+		"/":           {handler: srv.handleIndex, allow: "GET, HEAD"},
+		"/api/state":  {handler: srv.handleState, allow: "GET, HEAD"},
+		"/api/health": {handler: srv.handleHealth, allow: "GET, HEAD"},
+		"/api/host":   {handler: srv.handleHost, allow: "GET, HEAD"},
+		"/fleet":      {handler: srv.handleFleetIndex, allow: "GET, HEAD"},
+		"/api/fleet":  {handler: srv.handleFleet, allow: "GET, HEAD"},
+		"/events":     {handler: srv.handleSSE, allow: "GET"},
+	}
+	for path, tc := range handlers {
+		t.Run(path, func(t *testing.T) {
+			r := httptest.NewRequest(http.MethodPost, path, nil)
+			w := httptest.NewRecorder()
+			tc.handler(w, r)
+			if w.Code != http.StatusMethodNotAllowed {
+				t.Fatalf("status = %d, want 405", w.Code)
+			}
+			if got := w.Header().Get("Allow"); got != tc.allow {
+				t.Fatalf("Allow = %q, want %q", got, tc.allow)
+			}
+		})
+	}
+}
+
+func TestHandleIndexRejectsUnknownPaths(t *testing.T) {
+	srv := New("test-host")
+	r := httptest.NewRequest(http.MethodGet, "/api/unknown", nil)
+	w := httptest.NewRecorder()
+	srv.handleIndex(w, r)
+
+	if w.Code != http.StatusNotFound {
+		t.Fatalf("status = %d, want 404", w.Code)
+	}
+	if strings.Contains(w.Body.String(), "Jetmon") {
+		t.Fatal("unknown path returned dashboard HTML")
+	}
+	if got := w.Header().Get("Cache-Control"); got != "no-store" {
+		t.Fatalf("Cache-Control = %q, want no-store", got)
+	}
+}
+
+func TestHandleIndex(t *testing.T) {
+	srv := New("test-host")
+	r := httptest.NewRequest(http.MethodGet, "/", nil)
+	w := httptest.NewRecorder()
+	srv.handleIndex(w, r)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200", w.Code)
+	}
+	if ct := w.Header().Get("Content-Type"); ct != "text/html; charset=utf-8" {
+		t.Fatalf("Content-Type = %q, want text/html; charset=utf-8", ct)
+	}
+	if !strings.Contains(w.Body.String(), "Jetmon") {
+		t.Fatal("body does not contain expected HTML content")
+	}
+	if !strings.Contains(w.Body.String(), "id=\"preflight\"") {
+		t.Fatal("body does not contain rollout preflight card")
+	}
+	if !strings.Contains(w.Body.String(), "id=\"cutover\"") {
+		t.Fatal("body does not contain rollout cutover command")
+	}
+	if !strings.Contains(w.Body.String(), "id=\"state-report\"") {
+		t.Fatal("body does not contain rollout state report command")
+	}
+	if !strings.Contains(w.Body.String(), "id=\"activity\"") {
+		t.Fatal("body does not contain rollout activity card")
+	}
+	if !strings.Contains(w.Body.String(), "id=\"rollback\"") {
+		t.Fatal("body does not contain rollback card")
+	}
+	if !strings.Contains(w.Body.String(), "id=\"delivery-owner\"") {
+		t.Fatal("body does not contain delivery owner card")
+	}
+	if !strings.Contains(w.Body.String(), "id=\"delivery-eligible\"") {
+		t.Fatal("body does not contain delivery config eligibility card")
+	}
+	if !strings.Contains(w.Body.String(), "id=\"go-sys\"") {
+		t.Fatal("body does not contain Go system memory card")
+	}
+	if !strings.Contains(w.Body.String(), "id=\"health\"") {
+		t.Fatal("body does not contain dependency health grid")
+	}
+	if !strings.Contains(w.Body.String(), "/api/host") {
+		t.Fatal("body does not fetch combined host snapshot")
+	}
+	if !strings.Contains(w.Body.String(), "/fleet") {
+		t.Fatal("body does not link to fleet dashboard")
+	}
+}
+
+func TestHandleFleetIndex(t *testing.T) {
+	srv := New("test-host")
+	r := httptest.NewRequest(http.MethodGet, "/fleet", nil)
+	w := httptest.NewRecorder()
+	srv.handleFleetIndex(w, r)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200", w.Code)
+	}
+	if !strings.Contains(w.Body.String(), "/api/fleet") {
+		t.Fatal("body does not fetch fleet snapshot")
+	}
+	if !strings.Contains(w.Body.String(), "Fleet dashboard") {
+		t.Fatal("body does not contain fleet dashboard label")
+	}
+}
+
+func TestHandleFleetSnapshot(t *testing.T) {
+	srv := New("test-host")
+	srv.SetFleetSource(fakeFleetSource{
+		snapshot: FleetSnapshot{
+			GeneratedAt: time.Date(2026, 4, 30, 10, 0, 0, 0, time.UTC),
+			Summary:     FleetSummary{Status: "green", Message: "fleet checks are green"},
+			Processes:   []FleetProcess{{ProcessID: "host-a:monitor", HostID: "host-a", ProcessType: "monitor"}},
+		},
+	})
+
+	r := httptest.NewRequest(http.MethodGet, "/api/fleet", nil)
+	w := httptest.NewRecorder()
+	srv.handleFleet(w, r)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200", w.Code)
+	}
+	var snapshot FleetSnapshot
+	if err := json.NewDecoder(w.Body).Decode(&snapshot); err != nil {
+		t.Fatalf("decode: %v", err)
+	}
+	if snapshot.Summary.Status != "green" {
+		t.Fatalf("Summary.Status = %q, want green", snapshot.Summary.Status)
+	}
+	if len(snapshot.Processes) != 1 || snapshot.Processes[0].ProcessID != "host-a:monitor" {
+		t.Fatalf("Processes = %+v, want host-a monitor", snapshot.Processes)
+	}
+}
+
+func TestHandleFleetRejectsNonGet(t *testing.T) {
+	srv := New("test-host")
+	srv.SetFleetSource(fakeFleetSource{snapshot: FleetSnapshot{Summary: FleetSummary{Status: "green"}}})
+
+	r := httptest.NewRequest(http.MethodPost, "/api/fleet", nil)
+	w := httptest.NewRecorder()
+	srv.handleFleet(w, r)
+
+	if w.Code != http.StatusMethodNotAllowed {
+		t.Fatalf("status = %d, want 405", w.Code)
+	}
+	if got := w.Header().Get("Allow"); got != "GET, HEAD" {
+		t.Fatalf("Allow = %q, want GET, HEAD", got)
+	}
+}
+
+func TestHandleFleetErrors(t *testing.T) {
+	srv := New("test-host")
+	r := httptest.NewRequest(http.MethodGet, "/api/fleet", nil)
+	w := httptest.NewRecorder()
+	srv.handleFleet(w, r)
+	if w.Code != http.StatusServiceUnavailable {
+		t.Fatalf("status without source = %d, want 503", w.Code)
+	}
+
+	srv.SetFleetSource(fakeFleetSource{err: errors.New("db down")})
+	w = httptest.NewRecorder()
+	srv.handleFleet(w, r)
+	if w.Code != http.StatusInternalServerError {
+		t.Fatalf("status with source error = %d, want 500", w.Code)
+	}
+	if strings.Contains(w.Body.String(), "db down") {
+		t.Fatalf("error body = %q, leaked backend error", w.Body.String())
+	}
+}
+
+func TestSummarizeHost(t *testing.T) {
+	waiting := SummarizeHost(State{UpdatedAt: time.Now()}, nil)
+	if waiting.Status != "amber" {
+		t.Fatalf("waiting summary status = %q, want amber", waiting.Status)
+	}
+	if waiting.AmberCount == 0 || len(waiting.Issues) == 0 || !strings.Contains(waiting.Issues[0], "dependency health") {
+		t.Fatalf("waiting summary = %+v, want dependency health issue", waiting)
+	}
+
+	st := State{UpdatedAt: time.Now(), DeliveryWorkersEnabled: true, DeliveryConfigEligible: true}
+	summary := SummarizeHost(st, []HealthEntry{{Name: "mysql", Status: "green"}})
+	if summary.Status != "amber" {
+		t.Fatalf("summary status = %q, want amber for unset delivery owner", summary.Status)
+	}
+	if len(summary.Issues) == 0 || !strings.Contains(summary.Issues[0], "DELIVERY_OWNER_HOST") {
+		t.Fatalf("summary issues = %#v, want delivery owner issue", summary.Issues)
+	}
+
+	st.DeliveryOwnerHost = "host-a"
+	summary = SummarizeHost(st, []HealthEntry{{Name: "statsd", Status: "amber", LastError: "not initialized"}, {Name: "mysql", Status: "red", LastError: "access denied"}})
+	if summary.Status != "red" {
+		t.Fatalf("summary status = %q, want red for dependency failure", summary.Status)
+	}
+	if len(summary.Issues) < 2 || !strings.HasPrefix(summary.Issues[0], "mysql red") || !strings.HasPrefix(summary.Issues[1], "statsd amber") {
+		t.Fatalf("summary issues = %#v, want red issues before amber issues", summary.Issues)
+	}
+
+	summary = SummarizeHost(st, []HealthEntry{{Name: "mysql", Status: "green"}})
+	if summary.Status != "green" {
+		t.Fatalf("summary status = %q, want green", summary.Status)
+	}
+	if len(summary.Issues) != 0 {
+		t.Fatalf("summary issues = %#v, want none", summary.Issues)
+	}
+
+	st.DeliveryConfigEligible = false
+	summary = SummarizeHost(st, []HealthEntry{{Name: "mysql", Status: "green"}})
+	if summary.Status != "amber" || len(summary.Issues) == 0 {
+		t.Fatalf("summary = %+v, want amber config mismatch issue", summary)
+	}
+}
+
+func TestUpdateSetsHostnameAndTimestamp(t *testing.T) {
+	srv := New("my-host")
+	srv.Update(State{WorkerCount: 7, QueueDepth: 2})
+
+	srv.mu.RLock()
+	st := srv.state
+	srv.mu.RUnlock()
+
+	if st.Hostname != "my-host" {
+		t.Fatalf("Hostname = %q, want my-host", st.Hostname)
+	}
+	if st.WorkerCount != 7 {
+		t.Fatalf("WorkerCount = %d, want 7", st.WorkerCount)
+	}
+	if st.UpdatedAt.IsZero() {
+		t.Fatal("UpdatedAt is zero after Update")
+	}
+}
+
+func TestUpdateHealthStoresEntries(t *testing.T) {
+	srv := New("test-host")
+	srv.UpdateHealth([]HealthEntry{{Name: "redis", Status: "red"}})
+
+	srv.mu.RLock()
+	h := srv.health
+	srv.mu.RUnlock()
+
+	if len(h) != 1 || h[0].Name != "redis" {
+		t.Fatalf("health = %v, want [{redis red}]", h)
+	}
+}
+
+func TestBroadcastDeliverstToSSEClients(t *testing.T) {
+	srv := New("test-host")
+
+	ch := make(chan string, 1)
+	id := "test-client"
+	srv.sseMu.Lock()
+	srv.sseClients[id] = ch
+	srv.sseMu.Unlock()
+
+	srv.broadcast(State{WorkerCount: 9})
+
+	select {
+	case msg := <-ch:
+		var st State
+		if err := json.Unmarshal([]byte(msg), &st); err != nil {
+			t.Fatalf("unmarshal broadcast: %v", err)
+		}
+		if st.WorkerCount != 9 {
+			t.Fatalf("WorkerCount = %d, want 9", st.WorkerCount)
+		}
+	default:
+		t.Fatal("no message received by SSE client")
+	}
+}
+
+func TestHandleSSESendsInitialStateAndCleanup(t *testing.T) {
+	srv := New("test-host")
+	srv.Update(State{WorkerCount: 7})
+
+	mux := http.NewServeMux()
+	mux.HandleFunc("/events", srv.handleSSE)
+	ts := httptest.NewServer(mux)
+	defer ts.Close()
+
+	ctx, cancel := context.WithCancel(context.Background())
+	req, _ := http.NewRequestWithContext(ctx, http.MethodGet, ts.URL+"/events", nil)
+
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+
+	reader := bufio.NewReader(resp.Body)
+	var event strings.Builder
+	for {
+		line, err := reader.ReadString('\n')
+		if err != nil {
+			t.Fatalf("read SSE line: %v", err)
+		}
+		event.WriteString(line)
+		if line == "\n" {
+			break
+		}
+	}
+	data := event.String()
+	if !strings.Contains(data, "data:") {
+		t.Fatalf("initial SSE response = %q, want data: prefix", data)
+	}
+
+	// Disconnect the client — handleSSE should return via r.Context().Done().
+	cancel()
+}
+
+func TestHandleSSERejectsExcessClients(t *testing.T) {
+	srv := New("test-host")
+	for i := 0; i < maxSSEClients; i++ {
+		srv.sseClients[fmt.Sprintf("client-%d", i)] = make(chan string, 1)
+	}
+
+	r := httptest.NewRequest(http.MethodGet, "/events", nil)
+	w := httptest.NewRecorder()
+	srv.handleSSE(w, r)
+
+	if w.Code != http.StatusServiceUnavailable {
+		t.Fatalf("status = %d, want 503", w.Code)
+	}
+}
+
+func TestBroadcastDropsOnSlowClient(t *testing.T) {
+	srv := New("test-host")
+
+	// Channel capacity 0 — client is always "slow".
+	ch := make(chan string)
+	srv.sseMu.Lock()
+	srv.sseClients["slow"] = ch
+	srv.sseMu.Unlock()
+
+	// Should not block.
+	srv.broadcast(State{WorkerCount: 1})
+}
diff --git a/internal/dashboard/fleet.go b/internal/dashboard/fleet.go
new file mode 100644
index 00000000..ee362d18
--- /dev/null
+++ b/internal/dashboard/fleet.go
@@ -0,0 +1,1399 @@
+package dashboard
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"log"
+	"net/http"
+	"sort"
+	"strings"
+	"sync"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/fleethealth"
+)
+
+const (
+	defaultFleetRequestTimeout = 3 * time.Second
+	defaultFleetRecentWindow   = 15 * time.Minute
+)
+
+// FleetSource supplies the global dashboard snapshot.
+type FleetSource interface {
+	Snapshot(context.Context) (FleetSnapshot, error)
+}
+
+// FleetStoreOptions controls fleet dashboard summary thresholds.
+type FleetStoreOptions struct {
+	BucketTotal    int
+	HeartbeatGrace time.Duration
+	RecentWindow   time.Duration
+	CacheTTL       time.Duration
+	Now            func() time.Time
+}
+
+// FleetStore builds fleet dashboard snapshots from MySQL-backed process health
+// and rollout state.
+type FleetStore struct {
+	db             *sql.DB
+	bucketTotal    int
+	heartbeatGrace time.Duration
+	recentWindow   time.Duration
+	cacheTTL       time.Duration
+	now            func() time.Time
+	cacheMu        sync.Mutex
+	cacheSnapshot  FleetSnapshot
+	cacheUntil     time.Time
+}
+
+// NewFleetStore creates a MySQL-backed fleet dashboard source.
+func NewFleetStore(db *sql.DB, opts FleetStoreOptions) *FleetStore {
+	if opts.HeartbeatGrace <= 0 {
+		opts.HeartbeatGrace = 10 * time.Minute
+	}
+	if opts.RecentWindow <= 0 {
+		opts.RecentWindow = defaultFleetRecentWindow
+	}
+	if opts.CacheTTL == 0 {
+		opts.CacheTTL = 5 * time.Second
+	}
+	if opts.Now == nil {
+		opts.Now = time.Now
+	}
+	return &FleetStore{
+		db:             db,
+		bucketTotal:    opts.BucketTotal,
+		heartbeatGrace: opts.HeartbeatGrace,
+		recentWindow:   opts.RecentWindow,
+		cacheTTL:       opts.CacheTTL,
+		now:            opts.Now,
+	}
+}
+
+// Snapshot returns the current fleet-wide operator view.
+func (s *FleetStore) Snapshot(ctx context.Context) (FleetSnapshot, error) {
+	if s == nil || s.db == nil {
+		return FleetSnapshot{}, errors.New("fleet dashboard database source is not configured")
+	}
+	now := s.now().UTC()
+	if cached, ok := s.cachedSnapshot(now); ok {
+		return cached, nil
+	}
+	processRows, err := fleethealth.ListSnapshots(ctx, s.db)
+	if err != nil {
+		return FleetSnapshot{}, err
+	}
+	processes := summarizeFleetProcesses(processRows, now, s.heartbeatGrace)
+
+	hosts, hostErr := queryFleetBucketHosts(ctx, s.db)
+	bucketCoverage := summarizeFleetBucketCoverage(hosts, s.bucketTotal, s.heartbeatGrace, now, hostErr, processes)
+
+	delivery := queryFleetDelivery(ctx, s.db, now, s.recentWindow)
+	delivery.Posture = summarizeFleetDeliveryPosture(processes, delivery.Pending)
+
+	projectionDrift := queryFleetProjectionDrift(ctx, s.db, s.bucketTotal)
+	dependencies := summarizeFleetDependencies(processes)
+	verifliers := queryFleetVerifliers(ctx, s.db, now, s.heartbeatGrace, processes)
+
+	snapshot := FleetSnapshot{
+		GeneratedAt:     now,
+		Processes:       processes,
+		ProcessCounts:   countFleetProcesses(processes),
+		BucketCoverage:  bucketCoverage,
+		Delivery:        delivery,
+		ProjectionDrift: projectionDrift,
+		Dependencies:    dependencies,
+		Verifliers:      verifliers,
+	}
+	snapshot.Summary = summarizeFleet(snapshot)
+	s.storeCachedSnapshot(snapshot)
+	return snapshot, nil
+}
+
+func (s *FleetStore) cachedSnapshot(now time.Time) (FleetSnapshot, bool) {
+	if s.cacheTTL < 0 {
+		return FleetSnapshot{}, false
+	}
+	s.cacheMu.Lock()
+	defer s.cacheMu.Unlock()
+	if s.cacheUntil.IsZero() || !now.Before(s.cacheUntil) {
+		return FleetSnapshot{}, false
+	}
+	return cloneFleetSnapshot(s.cacheSnapshot), true
+}
+
+func (s *FleetStore) storeCachedSnapshot(snapshot FleetSnapshot) {
+	if s.cacheTTL < 0 {
+		return
+	}
+	s.cacheMu.Lock()
+	defer s.cacheMu.Unlock()
+	s.cacheSnapshot = cloneFleetSnapshot(snapshot)
+	s.cacheUntil = snapshot.GeneratedAt.Add(s.cacheTTL)
+}
+
+// FleetSnapshot is the JSON model for the global dashboard.
+type FleetSnapshot struct {
+	GeneratedAt     time.Time                `json:"generated_at"`
+	Summary         FleetSummary             `json:"summary"`
+	Processes       []FleetProcess           `json:"processes"`
+	ProcessCounts   map[string]int           `json:"process_counts"`
+	BucketCoverage  FleetBucketCoverage      `json:"bucket_coverage"`
+	Delivery        FleetDeliverySummary     `json:"delivery"`
+	ProjectionDrift FleetProjectionDrift     `json:"projection_drift"`
+	Dependencies    []FleetDependencySummary `json:"dependencies,omitempty"`
+	Verifliers      FleetVeriflierSummary    `json:"verifliers"`
+}
+
+// FleetSummary is the top-level red/amber/green rollup for the fleet.
+type FleetSummary struct {
+	Status               string   `json:"status"`
+	Message              string   `json:"message"`
+	SuggestedNextAction  string   `json:"suggested_next_action,omitempty"`
+	Issues               []string `json:"issues,omitempty"`
+	RedProcesses         int      `json:"red_processes"`
+	AmberProcesses       int      `json:"amber_processes"`
+	GreenProcesses       int      `json:"green_processes"`
+	StaleProcesses       int      `json:"stale_processes"`
+	MonitorProcesses     int      `json:"monitor_processes"`
+	DelivererProcesses   int      `json:"deliverer_processes"`
+	DependencyRedCount   int      `json:"dependency_red_count"`
+	DependencyAmberCount int      `json:"dependency_amber_count"`
+}
+
+// FleetProcess is one process-health row with dashboard-derived freshness.
+type FleetProcess struct {
+	ProcessID              string                         `json:"process_id"`
+	HostID                 string                         `json:"host_id"`
+	ProcessType            string                         `json:"process_type"`
+	PID                    int                            `json:"pid"`
+	Version                string                         `json:"version"`
+	BuildDate              string                         `json:"build_date"`
+	GoVersion              string                         `json:"go_version"`
+	State                  string                         `json:"state"`
+	HealthStatus           string                         `json:"health_status"`
+	StartedAt              time.Time                      `json:"started_at,omitempty"`
+	UpdatedAt              time.Time                      `json:"updated_at"`
+	LastHeartbeatAgeSec    int64                          `json:"last_heartbeat_age_sec"`
+	Stale                  bool                           `json:"stale"`
+	BucketMin              *int                           `json:"bucket_min,omitempty"`
+	BucketMax              *int                           `json:"bucket_max,omitempty"`
+	BucketOwnership        string                         `json:"bucket_ownership"`
+	APIPort                *int                           `json:"api_port,omitempty"`
+	DashboardPort          *int                           `json:"dashboard_port,omitempty"`
+	DeliveryWorkersEnabled bool                           `json:"delivery_workers_enabled"`
+	DeliveryOwnerHost      string                         `json:"delivery_owner_host"`
+	WorkerCount            int                            `json:"worker_count"`
+	ActiveChecks           int                            `json:"active_checks"`
+	QueueDepth             int                            `json:"queue_depth"`
+	RetryQueueSize         int                            `json:"retry_queue_size"`
+	WPCOMCircuitOpen       bool                           `json:"wpcom_circuit_open"`
+	WPCOMQueueDepth        int                            `json:"wpcom_queue_depth"`
+	GoSysMemMB             int                            `json:"go_sys_mem_mb"`
+	RSSMemMB               int                            `json:"rss_mem_mb"`
+	DependencyHealth       []fleethealth.DependencyHealth `json:"dependency_health,omitempty"`
+}
+
+// FleetBucketCoverage summarizes jetmon_hosts dynamic bucket ownership.
+type FleetBucketCoverage struct {
+	Status      string            `json:"status"`
+	Mode        string            `json:"mode"`
+	BucketTotal int               `json:"bucket_total"`
+	HostCount   int               `json:"host_count"`
+	Error       string            `json:"error,omitempty"`
+	Hosts       []FleetBucketHost `json:"hosts,omitempty"`
+}
+
+// FleetBucketHost is one jetmon_hosts row with freshness metadata.
+type FleetBucketHost struct {
+	HostID              string    `json:"host_id"`
+	BucketMin           int       `json:"bucket_min"`
+	BucketMax           int       `json:"bucket_max"`
+	Status              string    `json:"status"`
+	LastHeartbeat       time.Time `json:"last_heartbeat"`
+	LastHeartbeatAgeSec int64     `json:"last_heartbeat_age_sec"`
+	Stale               bool      `json:"stale"`
+}
+
+// FleetDeliverySummary describes global outbound delivery queues.
+type FleetDeliverySummary struct {
+	Status              string               `json:"status"`
+	Error               string               `json:"error,omitempty"`
+	Since               time.Time            `json:"since"`
+	Pending             int64                `json:"pending"`
+	DueNow              int64                `json:"due_now"`
+	FutureRetry         int64                `json:"future_retry"`
+	DeliveredSince      int64                `json:"delivered_since"`
+	AbandonedSince      int64                `json:"abandoned_since"`
+	FailedSince         int64                `json:"failed_since"`
+	OldestPendingAgeSec int64                `json:"oldest_pending_age_sec"`
+	OldestDueAgeSec     int64                `json:"oldest_due_age_sec"`
+	Tables              []FleetDeliveryTable `json:"tables,omitempty"`
+	Posture             FleetDeliveryPosture `json:"posture"`
+}
+
+// FleetDeliveryTable is a per-table outbound delivery queue summary.
+type FleetDeliveryTable struct {
+	Kind                string `json:"kind"`
+	Pending             int64  `json:"pending"`
+	DueNow              int64  `json:"due_now"`
+	FutureRetry         int64  `json:"future_retry"`
+	DeliveredSince      int64  `json:"delivered_since"`
+	AbandonedSince      int64  `json:"abandoned_since"`
+	FailedSince         int64  `json:"failed_since"`
+	OldestPendingAgeSec int64  `json:"oldest_pending_age_sec"`
+	OldestDueAgeSec     int64  `json:"oldest_due_age_sec"`
+}
+
+// FleetDeliveryPosture describes which process snapshots report delivery
+// workers as enabled.
+type FleetDeliveryPosture struct {
+	Status              string   `json:"status"`
+	EnabledProcessCount int      `json:"enabled_process_count"`
+	EnabledHosts        []string `json:"enabled_hosts,omitempty"`
+	OwnerHosts          []string `json:"owner_hosts,omitempty"`
+	Message             string   `json:"message"`
+}
+
+// FleetProjectionDrift summarizes legacy projection drift globally.
+type FleetProjectionDrift struct {
+	Status string `json:"status"`
+	Count  int    `json:"count"`
+	Error  string `json:"error,omitempty"`
+}
+
+// FleetDependencySummary aggregates dependency health by dependency name.
+type FleetDependencySummary struct {
+	Name       string `json:"name"`
+	Status     string `json:"status"`
+	RedCount   int    `json:"red_count"`
+	AmberCount int    `json:"amber_count"`
+	GreenCount int    `json:"green_count"`
+	StaleCount int    `json:"stale_count"`
+	LastError  string `json:"last_error,omitempty"`
+}
+
+// FleetVeriflierSummary summarizes trusted vantages and monitor-collected
+// Veriflier agent telemetry for the fleet dashboard.
+type FleetVeriflierSummary struct {
+	Status             string                         `json:"status"`
+	Message            string                         `json:"message,omitempty"`
+	Error              string                         `json:"error,omitempty"`
+	TotalVantages      int                            `json:"total_vantages"`
+	EnabledVantages    int                            `json:"enabled_vantages"`
+	DisabledVantages   int                            `json:"disabled_vantages"`
+	UsableVantages     int                            `json:"usable_vantages"`
+	IncompleteVantages int                            `json:"incomplete_vantages"`
+	TotalAgents        int                            `json:"total_agents"`
+	FreshAgents        int                            `json:"fresh_agents"`
+	StaleAgents        int                            `json:"stale_agents"`
+	ActiveAgents       int                            `json:"active_agents"`
+	DuplicateEndpoints int                            `json:"duplicate_endpoints"`
+	MaxConcurrency     int                            `json:"max_concurrency"`
+	QueueCapacity      int                            `json:"queue_capacity"`
+	QueueDepth         int                            `json:"queue_depth"`
+	ActiveChecks       int                            `json:"active_checks"`
+	InFlight           int                            `json:"in_flight"`
+	DiscoveryModes     []FleetVeriflierDiscoveryMode  `json:"discovery_modes,omitempty"`
+	Vantages           []FleetVeriflierVantageSummary `json:"vantages,omitempty"`
+	Agents             []FleetVeriflierAgentSummary   `json:"agents,omitempty"`
+}
+
+type FleetVeriflierDiscoveryMode struct {
+	Mode         string   `json:"mode"`
+	ProcessCount int      `json:"process_count"`
+	Hosts        []string `json:"hosts,omitempty"`
+}
+
+type FleetVeriflierVantageSummary struct {
+	VantageID        string    `json:"vantage_id"`
+	Region           string    `json:"region,omitempty"`
+	Provider         string    `json:"provider,omitempty"`
+	EndpointHost     string    `json:"endpoint_host,omitempty"`
+	EndpointPort     string    `json:"endpoint_port,omitempty"`
+	AuthTokenPresent bool      `json:"auth_token_present"`
+	Enabled          bool      `json:"enabled"`
+	Usable           bool      `json:"usable"`
+	ActiveAgents     int       `json:"active_agents"`
+	FreshAgents      int       `json:"fresh_agents"`
+	StaleAgents      int       `json:"stale_agents"`
+	LastSeen         time.Time `json:"last_seen,omitempty"`
+	LastSeenAgeSec   int64     `json:"last_seen_age_sec,omitempty"`
+}
+
+type FleetVeriflierAgentSummary struct {
+	AgentID            string    `json:"agent_id"`
+	VantageID          string    `json:"vantage_id"`
+	Hostname           string    `json:"hostname,omitempty"`
+	EndpointHost       string    `json:"endpoint_host,omitempty"`
+	EndpointPort       string    `json:"endpoint_port,omitempty"`
+	Version            string    `json:"version,omitempty"`
+	Protocols          []string  `json:"protocols,omitempty"`
+	MaxConcurrency     int       `json:"max_concurrency"`
+	QueueCapacity      int       `json:"queue_capacity"`
+	QueueDepth         int       `json:"queue_depth"`
+	Active             int       `json:"active"`
+	InFlight           int       `json:"in_flight"`
+	Status             string    `json:"status"`
+	LastSeen           time.Time `json:"last_seen"`
+	LastSeenAgeSec     int64     `json:"last_seen_age_sec"`
+	Stale              bool      `json:"stale"`
+	VantagePreapproved bool      `json:"vantage_preapproved"`
+}
+
+func summarizeFleetProcesses(rows []fleethealth.Snapshot, now time.Time, heartbeatGrace time.Duration) []FleetProcess {
+	out := make([]FleetProcess, 0, len(rows))
+	for _, row := range rows {
+		age := now.Sub(row.UpdatedAt)
+		if age < 0 {
+			age = 0
+		}
+		out = append(out, FleetProcess{
+			ProcessID:              row.ProcessID,
+			HostID:                 row.HostID,
+			ProcessType:            row.ProcessType,
+			PID:                    row.PID,
+			Version:                row.Version,
+			BuildDate:              row.BuildDate,
+			GoVersion:              row.GoVersion,
+			State:                  row.State,
+			HealthStatus:           row.HealthStatus,
+			StartedAt:              row.StartedAt,
+			UpdatedAt:              row.UpdatedAt,
+			LastHeartbeatAgeSec:    int64(age.Round(time.Second) / time.Second),
+			Stale:                  age > heartbeatGrace,
+			BucketMin:              row.BucketMin,
+			BucketMax:              row.BucketMax,
+			BucketOwnership:        row.BucketOwnership,
+			APIPort:                row.APIPort,
+			DashboardPort:          row.DashboardPort,
+			DeliveryWorkersEnabled: row.DeliveryWorkersEnabled,
+			DeliveryOwnerHost:      row.DeliveryOwnerHost,
+			WorkerCount:            row.WorkerCount,
+			ActiveChecks:           row.ActiveChecks,
+			QueueDepth:             row.QueueDepth,
+			RetryQueueSize:         row.RetryQueueSize,
+			WPCOMCircuitOpen:       row.WPCOMCircuitOpen,
+			WPCOMQueueDepth:        row.WPCOMQueueDepth,
+			GoSysMemMB:             row.GoSysMemMB,
+			RSSMemMB:               row.RSSMemMB,
+			DependencyHealth:       append([]fleethealth.DependencyHealth(nil), row.DependencyHealth...),
+		})
+	}
+	sort.Slice(out, func(i, j int) bool {
+		left, right := fleetProcessRank(out[i]), fleetProcessRank(out[j])
+		if left != right {
+			return left > right
+		}
+		leftType, rightType := fleetProcessTypeRank(out[i].ProcessType), fleetProcessTypeRank(out[j].ProcessType)
+		if leftType != rightType {
+			return leftType < rightType
+		}
+		if out[i].HostID != out[j].HostID {
+			return out[i].HostID < out[j].HostID
+		}
+		return out[i].ProcessID < out[j].ProcessID
+	})
+	return out
+}
+
+func fleetProcessRank(process FleetProcess) int {
+	if process.Stale {
+		return 5
+	}
+	switch process.HealthStatus {
+	case fleethealth.HealthRed:
+		return 4
+	case fleethealth.HealthAmber:
+		return 3
+	default:
+		return 1
+	}
+}
+
+func fleetProcessTypeRank(processType string) int {
+	switch processType {
+	case fleethealth.ProcessMonitor:
+		return 1
+	case fleethealth.ProcessDeliverer:
+		return 2
+	default:
+		return 99
+	}
+}
+
+func queryFleetBucketHosts(ctx context.Context, db *sql.DB) ([]FleetBucketHost, error) {
+	rows, err := db.QueryContext(ctx, `
+		SELECT host_id, bucket_min, bucket_max, last_heartbeat, status
+		  FROM jetmon_hosts
+		 ORDER BY bucket_min, host_id`)
+	if err != nil {
+		return nil, fmt.Errorf("query jetmon_hosts: %w", err)
+	}
+	defer rows.Close()
+
+	var hosts []FleetBucketHost
+	for rows.Next() {
+		var host FleetBucketHost
+		if err := rows.Scan(&host.HostID, &host.BucketMin, &host.BucketMax, &host.LastHeartbeat, &host.Status); err != nil {
+			return nil, fmt.Errorf("scan jetmon_hosts: %w", err)
+		}
+		host.LastHeartbeat = host.LastHeartbeat.UTC()
+		hosts = append(hosts, host)
+	}
+	if err := rows.Err(); err != nil {
+		return nil, fmt.Errorf("iterate jetmon_hosts: %w", err)
+	}
+	return hosts, nil
+}
+
+func summarizeFleetBucketCoverage(hosts []FleetBucketHost, bucketTotal int, heartbeatGrace time.Duration, now time.Time, queryErr error, processes []FleetProcess) FleetBucketCoverage {
+	mode := fleetBucketOwnershipMode(processes)
+	if mode == "unknown" && len(hosts) > 0 {
+		mode = "dynamic"
+	}
+	coverage := FleetBucketCoverage{
+		Status:      "green",
+		Mode:        mode,
+		BucketTotal: bucketTotal,
+		HostCount:   len(hosts),
+		Hosts:       append([]FleetBucketHost(nil), hosts...),
+	}
+	if queryErr != nil {
+		coverage.Status = "red"
+		coverage.Mode = "unknown"
+		coverage.Error = queryErr.Error()
+		return coverage
+	}
+	if bucketTotal <= 0 {
+		coverage.Status = "amber"
+		coverage.Mode = "unknown"
+		coverage.Error = "BUCKET_TOTAL is not configured"
+		return coverage
+	}
+	for i := range coverage.Hosts {
+		age := now.Sub(coverage.Hosts[i].LastHeartbeat)
+		if age < 0 {
+			age = 0
+		}
+		coverage.Hosts[i].LastHeartbeatAgeSec = int64(age.Round(time.Second) / time.Second)
+		coverage.Hosts[i].Stale = age > heartbeatGrace
+	}
+	if mode == "pinned" {
+		coverage.Status = "amber"
+		coverage.Error = "monitor process snapshots report pinned bucket ranges; dynamic jetmon_hosts coverage is not active"
+		return coverage
+	}
+	if mode == "mixed" {
+		coverage.Status = "amber"
+		coverage.Error = "monitor process snapshots report mixed pinned and dynamic bucket ownership"
+		return coverage
+	}
+	if len(hosts) == 0 {
+		coverage.Status = "amber"
+		coverage.Error = "jetmon_hosts has no dynamic ownership rows"
+		return coverage
+	}
+	if err := validateFleetBucketCoverage(coverage.Hosts, bucketTotal); err != nil {
+		coverage.Status = "red"
+		coverage.Error = err.Error()
+		return coverage
+	}
+	for _, host := range coverage.Hosts {
+		if host.Status != "active" {
+			coverage.Status = "red"
+			coverage.Error = fmt.Sprintf("host %q has status=%q", host.HostID, host.Status)
+			return coverage
+		}
+		if host.Stale {
+			coverage.Status = "red"
+			coverage.Error = fmt.Sprintf("host %q heartbeat is stale", host.HostID)
+			return coverage
+		}
+	}
+	return coverage
+}
+
+func fleetBucketOwnershipMode(processes []FleetProcess) string {
+	hasPinned, hasDynamic := false, false
+	for _, process := range processes {
+		if process.ProcessType != fleethealth.ProcessMonitor {
+			continue
+		}
+		if !processActiveForFleetOwnership(process) {
+			continue
+		}
+		ownership := strings.ToLower(strings.TrimSpace(process.BucketOwnership))
+		switch {
+		case strings.Contains(ownership, "pinned"):
+			hasPinned = true
+		case strings.Contains(ownership, "dynamic"):
+			hasDynamic = true
+		}
+	}
+	switch {
+	case hasPinned && hasDynamic:
+		return "mixed"
+	case hasPinned:
+		return "pinned"
+	case hasDynamic:
+		return "dynamic"
+	default:
+		return "unknown"
+	}
+}
+
+func processActiveForFleetOwnership(process FleetProcess) bool {
+	return processActiveForFleetRollup(process)
+}
+
+func validateFleetBucketCoverage(hosts []FleetBucketHost, bucketTotal int) error {
+	sortedHosts := append([]FleetBucketHost(nil), hosts...)
+	sort.Slice(sortedHosts, func(i, j int) bool {
+		if sortedHosts[i].BucketMin == sortedHosts[j].BucketMin {
+			return sortedHosts[i].HostID < sortedHosts[j].HostID
+		}
+		return sortedHosts[i].BucketMin < sortedHosts[j].BucketMin
+	})
+
+	expectedMin := 0
+	for _, host := range sortedHosts {
+		if host.BucketMin < 0 || host.BucketMax < host.BucketMin || host.BucketMax >= bucketTotal {
+			return fmt.Errorf("host %q has invalid bucket range %d-%d for BUCKET_TOTAL=%d", host.HostID, host.BucketMin, host.BucketMax, bucketTotal)
+		}
+		if host.BucketMin > expectedMin {
+			return fmt.Errorf("dynamic bucket coverage has gap %d-%d before host %q", expectedMin, host.BucketMin-1, host.HostID)
+		}
+		if host.BucketMin < expectedMin {
+			return fmt.Errorf("dynamic bucket coverage overlaps before host %q at bucket %d", host.HostID, host.BucketMin)
+		}
+		expectedMin = host.BucketMax + 1
+	}
+	if expectedMin < bucketTotal {
+		return fmt.Errorf("dynamic bucket coverage has trailing gap %d-%d", expectedMin, bucketTotal-1)
+	}
+	return nil
+}
+
+func queryFleetDelivery(ctx context.Context, db *sql.DB, now time.Time, recentWindow time.Duration) FleetDeliverySummary {
+	cutoff := now.Add(-recentWindow).UTC()
+	summary := FleetDeliverySummary{Status: "green", Since: cutoff}
+	tables := []struct {
+		kind string
+		name string
+	}{
+		{kind: "webhook", name: "jetmon_webhook_deliveries"},
+		{kind: "alert", name: "jetmon_alert_deliveries"},
+	}
+	for _, table := range tables {
+		tableSummary, err := queryFleetDeliveryTable(ctx, db, table.kind, table.name, now, cutoff)
+		if err != nil {
+			summary.Status = "red"
+			summary.Error = err.Error()
+			return summary
+		}
+		summary.Tables = append(summary.Tables, tableSummary)
+		summary.Pending += tableSummary.Pending
+		summary.DueNow += tableSummary.DueNow
+		summary.FutureRetry += tableSummary.FutureRetry
+		summary.DeliveredSince += tableSummary.DeliveredSince
+		summary.AbandonedSince += tableSummary.AbandonedSince
+		summary.FailedSince += tableSummary.FailedSince
+		summary.OldestPendingAgeSec = maxInt64(summary.OldestPendingAgeSec, tableSummary.OldestPendingAgeSec)
+		summary.OldestDueAgeSec = maxInt64(summary.OldestDueAgeSec, tableSummary.OldestDueAgeSec)
+	}
+	switch {
+	case summary.AbandonedSince > 0 || summary.FailedSince > 0:
+		summary.Status = "red"
+	case summary.DueNow > 0:
+		summary.Status = "amber"
+	default:
+		summary.Status = "green"
+	}
+	return summary
+}
+
+func queryFleetDeliveryTable(ctx context.Context, db *sql.DB, kind, table string, now, cutoff time.Time) (FleetDeliveryTable, error) {
+	switch table {
+	case "jetmon_webhook_deliveries", "jetmon_alert_deliveries":
+	default:
+		return FleetDeliveryTable{}, fmt.Errorf("unsupported delivery table %q", table)
+	}
+	summary := FleetDeliveryTable{Kind: kind}
+
+	query := fmt.Sprintf(`
+		SELECT 'pending' AS metric,
+		       COUNT(*) AS count,
+		       COALESCE(TIMESTAMPDIFF(SECOND, MIN(created_at), ?), 0) AS age_sec
+		  FROM %s
+		 WHERE status = 'pending'
+		UNION ALL
+		SELECT 'due' AS metric,
+		       COUNT(*) AS count,
+		       COALESCE(TIMESTAMPDIFF(SECOND, MIN(COALESCE(next_attempt_at, created_at)), ?), 0) AS age_sec
+		  FROM %s
+		 WHERE status = 'pending'
+		   AND (next_attempt_at IS NULL OR next_attempt_at <= ?)
+		UNION ALL
+		SELECT 'future' AS metric,
+		       COUNT(*) AS count,
+		       0 AS age_sec
+		  FROM %s
+		 WHERE status = 'pending'
+		   AND next_attempt_at > ?
+		UNION ALL
+		SELECT 'delivered' AS metric,
+		       COUNT(*) AS count,
+		       0 AS age_sec
+		  FROM %s
+		 WHERE status = 'delivered'
+		   AND delivered_at >= ?
+		UNION ALL
+		SELECT 'abandoned' AS metric,
+		       COUNT(*) AS count,
+		       0 AS age_sec
+		  FROM %s
+		 WHERE status = 'abandoned'
+		   AND (last_attempt_at >= ? OR (last_attempt_at IS NULL AND created_at >= ?))
+		UNION ALL
+		SELECT 'failed' AS metric,
+		       COUNT(*) AS count,
+		       0 AS age_sec
+		  FROM %s
+		 WHERE status = 'failed'
+		   AND (last_attempt_at >= ? OR (last_attempt_at IS NULL AND created_at >= ?))`,
+		table, table, table, table, table, table,
+	)
+	rows, err := db.QueryContext(ctx, query,
+		now,
+		now, now,
+		now,
+		cutoff,
+		cutoff, cutoff,
+		cutoff, cutoff,
+	)
+	if err != nil {
+		return FleetDeliveryTable{}, fmt.Errorf("%s delivery summary: %w", kind, err)
+	}
+	defer rows.Close()
+
+	for rows.Next() {
+		var metric string
+		var count, ageSec int64
+		if err := rows.Scan(&metric, &count, &ageSec); err != nil {
+			return FleetDeliveryTable{}, fmt.Errorf("%s delivery summary scan: %w", kind, err)
+		}
+		switch metric {
+		case "pending":
+			summary.Pending = count
+			summary.OldestPendingAgeSec = ageSec
+		case "due":
+			summary.DueNow = count
+			summary.OldestDueAgeSec = ageSec
+		case "future":
+			summary.FutureRetry = count
+		case "delivered":
+			summary.DeliveredSince = count
+		case "abandoned":
+			summary.AbandonedSince = count
+		case "failed":
+			summary.FailedSince = count
+		default:
+			return FleetDeliveryTable{}, fmt.Errorf("%s delivery summary returned unknown metric %q", kind, metric)
+		}
+	}
+	if err := rows.Err(); err != nil {
+		return FleetDeliveryTable{}, fmt.Errorf("%s delivery summary iterate: %w", kind, err)
+	}
+	return summary, nil
+}
+
+func summarizeFleetDeliveryPosture(processes []FleetProcess, queuedDeliveries int64) FleetDeliveryPosture {
+	enabledHosts := map[string]struct{}{}
+	ownerHosts := map[string]struct{}{}
+	enabledCount := 0
+	enabledWithoutOwner := 0
+	for _, process := range processes {
+		if !processActiveForDeliveryPosture(process) {
+			continue
+		}
+		if !process.DeliveryWorkersEnabled {
+			continue
+		}
+		enabledCount++
+		enabledHosts[process.HostID] = struct{}{}
+		if owner := strings.TrimSpace(process.DeliveryOwnerHost); owner != "" {
+			ownerHosts[owner] = struct{}{}
+		} else {
+			enabledWithoutOwner++
+		}
+	}
+	posture := FleetDeliveryPosture{
+		Status:              "green",
+		EnabledProcessCount: enabledCount,
+		EnabledHosts:        sortedStringKeys(enabledHosts),
+		OwnerHosts:          sortedStringKeys(ownerHosts),
+	}
+	switch {
+	case enabledCount == 0 && queuedDeliveries > 0:
+		posture.Status = "amber"
+		posture.Message = "delivery rows are queued but no fresh process snapshot reports delivery workers enabled"
+	case enabledCount == 0:
+		posture.Message = "no fresh process snapshot reports delivery workers enabled; delivery queues are empty"
+	case enabledWithoutOwner > 0 && len(posture.OwnerHosts) > 0:
+		posture.Status = "amber"
+		posture.Message = "delivery-capable processes mix explicit DELIVERY_OWNER_HOST and unset ownership"
+	case enabledWithoutOwner > 0:
+		posture.Status = "amber"
+		posture.Message = "delivery workers are enabled without DELIVERY_OWNER_HOST"
+	case len(posture.OwnerHosts) > 1:
+		posture.Status = "amber"
+		posture.Message = "multiple DELIVERY_OWNER_HOST values are visible across process snapshots"
+	case len(posture.OwnerHosts) == 1:
+		posture.Message = fmt.Sprintf("delivery owner is constrained to %s", posture.OwnerHosts[0])
+	default:
+		posture.Message = "delivery workers are enabled without an explicit owner"
+	}
+	return posture
+}
+
+func processActiveForDeliveryPosture(process FleetProcess) bool {
+	return processActiveForFleetRollup(process)
+}
+
+func processActiveForFleetRollup(process FleetProcess) bool {
+	if process.Stale {
+		return false
+	}
+	switch process.State {
+	case fleethealth.StateStopped, fleethealth.StateStopping:
+		return false
+	default:
+		return true
+	}
+}
+
+func queryFleetProjectionDrift(ctx context.Context, db *sql.DB, bucketTotal int) FleetProjectionDrift {
+	if bucketTotal <= 0 {
+		return FleetProjectionDrift{Status: "amber", Error: "BUCKET_TOTAL is not configured"}
+	}
+	var count int
+	err := db.QueryRowContext(ctx, `
+		SELECT COUNT(*)
+		  FROM jetpack_monitor_sites s
+		  LEFT JOIN jetmon_events e
+		    ON e.blog_id = s.blog_id
+		   AND e.check_type = 'http'
+		   AND e.ended_at IS NULL
+		 WHERE s.monitor_active = 1
+		   AND s.bucket_no BETWEEN 0 AND ?
+		   AND s.site_status <> CASE
+		     WHEN e.state = 'Down' THEN 2
+		     WHEN e.state = 'Seems Down' THEN 0
+		     ELSE 1
+		   END`,
+		bucketTotal-1,
+	).Scan(&count)
+	if err != nil {
+		return FleetProjectionDrift{Status: "red", Error: fmt.Sprintf("count projection drift: %v", err)}
+	}
+	if count > 0 {
+		return FleetProjectionDrift{Status: "red", Count: count}
+	}
+	return FleetProjectionDrift{Status: "green"}
+}
+
+func queryFleetVerifliers(ctx context.Context, db *sql.DB, now time.Time, staleAfter time.Duration, processes []FleetProcess) FleetVeriflierSummary {
+	summary := FleetVeriflierSummary{
+		Status:         "green",
+		DiscoveryModes: summarizeFleetVeriflierDiscoveryModes(processes),
+	}
+	vantages, err := queryFleetVeriflierVantages(ctx, db)
+	if err != nil {
+		summary.Status = veriflierQueryErrorStatus(summary.DiscoveryModes)
+		summary.Error = err.Error()
+		summary.Message = "Veriflier discovery tables are unavailable"
+		return summary
+	}
+	agents, err := queryFleetVeriflierAgents(ctx, db, now, staleAfter)
+	if err != nil {
+		summary.Status = veriflierQueryErrorStatus(summary.DiscoveryModes)
+		summary.Error = err.Error()
+		summary.Vantages = vantages
+		summary.Message = "Veriflier agent telemetry is unavailable"
+		return summary
+	}
+
+	summary.Vantages = vantages
+	summary.Agents = agents
+	summarizeFleetVeriflierRows(&summary, now)
+	return summary
+}
+
+func queryFleetVeriflierVantages(ctx context.Context, db *sql.DB) ([]FleetVeriflierVantageSummary, error) {
+	rows, err := db.QueryContext(ctx, `
+		SELECT vantage_id, region, provider, endpoint_host, endpoint_port,
+		       IF(auth_token <> '', 1, 0) AS auth_token_present,
+		       enabled, updated_at
+		  FROM jetmon_veriflier_vantages
+		 ORDER BY enabled DESC, vantage_id`)
+	if err != nil {
+		return nil, fmt.Errorf("query jetmon_veriflier_vantages: %w", err)
+	}
+	defer rows.Close()
+
+	var vantages []FleetVeriflierVantageSummary
+	for rows.Next() {
+		var v FleetVeriflierVantageSummary
+		var authTokenPresent, enabled int
+		var updatedAt time.Time
+		if err := rows.Scan(
+			&v.VantageID,
+			&v.Region,
+			&v.Provider,
+			&v.EndpointHost,
+			&v.EndpointPort,
+			&authTokenPresent,
+			&enabled,
+			&updatedAt,
+		); err != nil {
+			return nil, fmt.Errorf("scan jetmon_veriflier_vantages: %w", err)
+		}
+		v.AuthTokenPresent = authTokenPresent != 0
+		v.Enabled = enabled != 0
+		v.Usable = v.Enabled &&
+			strings.TrimSpace(v.EndpointHost) != "" &&
+			strings.TrimSpace(v.EndpointPort) != "" &&
+			v.AuthTokenPresent
+		vantages = append(vantages, v)
+	}
+	if err := rows.Err(); err != nil {
+		return nil, fmt.Errorf("iterate jetmon_veriflier_vantages: %w", err)
+	}
+	return vantages, nil
+}
+
+func queryFleetVeriflierAgents(ctx context.Context, db *sql.DB, now time.Time, staleAfter time.Duration) ([]FleetVeriflierAgentSummary, error) {
+	rows, err := db.QueryContext(ctx, `
+		SELECT agent_id, vantage_id, hostname, endpoint_host, endpoint_port,
+		       version, protocols, max_concurrency, queue_capacity, queue_depth,
+		       active, in_flight, status, last_seen
+		  FROM jetmon_veriflier_agents
+		 ORDER BY last_seen DESC, vantage_id, agent_id`)
+	if err != nil {
+		return nil, fmt.Errorf("query jetmon_veriflier_agents: %w", err)
+	}
+	defer rows.Close()
+
+	var agents []FleetVeriflierAgentSummary
+	for rows.Next() {
+		var agent FleetVeriflierAgentSummary
+		var protocols sql.NullString
+		if err := rows.Scan(
+			&agent.AgentID,
+			&agent.VantageID,
+			&agent.Hostname,
+			&agent.EndpointHost,
+			&agent.EndpointPort,
+			&agent.Version,
+			&protocols,
+			&agent.MaxConcurrency,
+			&agent.QueueCapacity,
+			&agent.QueueDepth,
+			&agent.Active,
+			&agent.InFlight,
+			&agent.Status,
+			&agent.LastSeen,
+		); err != nil {
+			return nil, fmt.Errorf("scan jetmon_veriflier_agents: %w", err)
+		}
+		if protocols.Valid && strings.TrimSpace(protocols.String) != "" {
+			if err := json.Unmarshal([]byte(protocols.String), &agent.Protocols); err != nil {
+				return nil, fmt.Errorf("decode jetmon_veriflier_agents.protocols: %w", err)
+			}
+		}
+		agent.LastSeen = agent.LastSeen.UTC()
+		age := now.Sub(agent.LastSeen)
+		if age < 0 {
+			age = 0
+		}
+		agent.LastSeenAgeSec = int64(age.Round(time.Second) / time.Second)
+		agent.Stale = staleAfter > 0 && age > staleAfter
+		agents = append(agents, agent)
+	}
+	if err := rows.Err(); err != nil {
+		return nil, fmt.Errorf("iterate jetmon_veriflier_agents: %w", err)
+	}
+	return agents, nil
+}
+
+func summarizeFleetVeriflierRows(summary *FleetVeriflierSummary, now time.Time) {
+	if summary == nil {
+		return
+	}
+	vantageIndex := map[string]int{}
+	freshActiveEndpointsByVantage := map[string]map[string]struct{}{}
+	for i := range summary.Vantages {
+		v := &summary.Vantages[i]
+		summary.TotalVantages++
+		vantageIndex[v.VantageID] = i
+		if v.Enabled {
+			summary.EnabledVantages++
+		} else {
+			summary.DisabledVantages++
+		}
+		if v.Usable {
+			summary.UsableVantages++
+		} else if v.Enabled {
+			summary.IncompleteVantages++
+		}
+	}
+
+	for i := range summary.Agents {
+		agent := &summary.Agents[i]
+		summary.TotalAgents++
+		if agent.Stale {
+			summary.StaleAgents++
+		} else {
+			summary.FreshAgents++
+		}
+		if agent.Status == "active" {
+			summary.ActiveAgents++
+		}
+		summary.MaxConcurrency += agent.MaxConcurrency
+		summary.QueueCapacity += agent.QueueCapacity
+		summary.QueueDepth += agent.QueueDepth
+		summary.ActiveChecks += agent.Active
+		summary.InFlight += agent.InFlight
+
+		if idx, ok := vantageIndex[agent.VantageID]; ok {
+			v := &summary.Vantages[idx]
+			agent.VantagePreapproved = v.Enabled
+			if agent.Status == "active" {
+				v.ActiveAgents++
+			}
+			if agent.Stale {
+				v.StaleAgents++
+			} else {
+				v.FreshAgents++
+			}
+			if v.LastSeen.IsZero() || agent.LastSeen.After(v.LastSeen) {
+				v.LastSeen = agent.LastSeen
+				age := now.Sub(agent.LastSeen)
+				if age < 0 {
+					age = 0
+				}
+				v.LastSeenAgeSec = int64(age.Round(time.Second) / time.Second)
+			}
+		}
+		if !agent.Stale && agent.Status == "active" && strings.TrimSpace(agent.VantageID) != "" {
+			endpoints := freshActiveEndpointsByVantage[agent.VantageID]
+			if endpoints == nil {
+				endpoints = map[string]struct{}{}
+				freshActiveEndpointsByVantage[agent.VantageID] = endpoints
+			}
+			endpoints[strings.TrimSpace(agent.EndpointHost)+":"+strings.TrimSpace(agent.EndpointPort)] = struct{}{}
+		}
+	}
+	for _, endpoints := range freshActiveEndpointsByVantage {
+		if len(endpoints) > 1 {
+			summary.DuplicateEndpoints++
+		}
+	}
+
+	summary.Status, summary.Message = summarizeFleetVeriflierStatus(*summary)
+}
+
+func summarizeFleetVeriflierStatus(summary FleetVeriflierSummary) (string, string) {
+	activeMode := fleetVeriflierModeVisible(summary.DiscoveryModes, "active")
+	shadowMode := fleetVeriflierModeVisible(summary.DiscoveryModes, "shadow")
+	discoveryVisible := activeMode || shadowMode
+
+	switch {
+	case activeMode && summary.UsableVantages == 0:
+		return "red", "active discovery has no usable enabled Veriflier vantages"
+	case activeMode && summary.IncompleteVantages > 0:
+		return "red", fmt.Sprintf("active discovery has %d incomplete enabled Veriflier vantages", summary.IncompleteVantages)
+	case activeMode && summary.DuplicateEndpoints > 0:
+		return "red", fmt.Sprintf("%d Veriflier vantages have multiple fresh active endpoints", summary.DuplicateEndpoints)
+	case discoveryVisible && summary.EnabledVantages == 0:
+		return "amber", "discovery mode is visible but no trusted Veriflier vantages are enabled"
+	case shadowMode && summary.IncompleteVantages > 0:
+		return "amber", fmt.Sprintf("shadow registry has %d incomplete enabled Veriflier vantages", summary.IncompleteVantages)
+	case discoveryVisible && summary.FreshAgents == 0:
+		return "amber", "discovery mode is visible but no fresh Veriflier agent telemetry is available"
+	case summary.DuplicateEndpoints > 0:
+		return "amber", fmt.Sprintf("%d Veriflier vantages have multiple fresh active endpoints", summary.DuplicateEndpoints)
+	case summary.StaleAgents > 0:
+		return "amber", fmt.Sprintf("%d Veriflier agent telemetry rows are stale", summary.StaleAgents)
+	case !discoveryVisible && summary.TotalVantages == 0:
+		return "green", "static Veriflier configuration is in use; discovery registry is empty"
+	case summary.IncompleteVantages > 0:
+		return "amber", fmt.Sprintf("registry has %d incomplete enabled Veriflier vantages", summary.IncompleteVantages)
+	case summary.EnabledVantages > 0:
+		return "green", fmt.Sprintf("%d usable trusted Veriflier vantages", summary.UsableVantages)
+	default:
+		return "green", "Veriflier discovery registry is present"
+	}
+}
+
+func summarizeFleetVeriflierDiscoveryModes(processes []FleetProcess) []FleetVeriflierDiscoveryMode {
+	byMode := map[string]map[string]struct{}{}
+	for _, process := range processes {
+		if process.Stale || process.ProcessType != fleethealth.ProcessMonitor {
+			continue
+		}
+		for _, dep := range process.DependencyHealth {
+			mode, ok := parseVeriflierDiscoveryMode(dep.Name)
+			if !ok {
+				continue
+			}
+			hosts := byMode[mode]
+			if hosts == nil {
+				hosts = map[string]struct{}{}
+				byMode[mode] = hosts
+			}
+			hosts[process.HostID] = struct{}{}
+		}
+	}
+	modes := make([]FleetVeriflierDiscoveryMode, 0, len(byMode))
+	for mode, hosts := range byMode {
+		modes = append(modes, FleetVeriflierDiscoveryMode{
+			Mode:         mode,
+			ProcessCount: len(hosts),
+			Hosts:        sortedStringKeys(hosts),
+		})
+	}
+	sort.Slice(modes, func(i, j int) bool {
+		if modes[i].Mode == modes[j].Mode {
+			return modes[i].ProcessCount > modes[j].ProcessCount
+		}
+		return modes[i].Mode < modes[j].Mode
+	})
+	return modes
+}
+
+func parseVeriflierDiscoveryMode(name string) (string, bool) {
+	const prefix = "veriflier-discovery:"
+	name = strings.TrimSpace(name)
+	if !strings.HasPrefix(name, prefix) {
+		return "", false
+	}
+	mode := strings.TrimSpace(strings.TrimPrefix(name, prefix))
+	if idx := strings.IndexByte(mode, ' '); idx >= 0 {
+		mode = mode[:idx]
+	}
+	if mode == "" {
+		return "", false
+	}
+	return mode, true
+}
+
+func fleetVeriflierModeVisible(modes []FleetVeriflierDiscoveryMode, want string) bool {
+	for _, mode := range modes {
+		if mode.Mode == want && mode.ProcessCount > 0 {
+			return true
+		}
+	}
+	return false
+}
+
+func veriflierQueryErrorStatus(modes []FleetVeriflierDiscoveryMode) string {
+	if fleetVeriflierModeVisible(modes, "active") {
+		return "red"
+	}
+	return "amber"
+}
+
+func summarizeFleetDependencies(processes []FleetProcess) []FleetDependencySummary {
+	byName := map[string]*FleetDependencySummary{}
+	for _, process := range processes {
+		for _, dep := range process.DependencyHealth {
+			name := strings.TrimSpace(dep.Name)
+			if name == "" {
+				name = "unknown"
+			}
+			summary := byName[name]
+			if summary == nil {
+				summary = &FleetDependencySummary{Name: name, Status: "green"}
+				byName[name] = summary
+			}
+			if process.Stale {
+				summary.StaleCount++
+			}
+			switch dep.Status {
+			case fleethealth.HealthRed:
+				summary.RedCount++
+				summary.Status = "red"
+				if dep.LastError != "" {
+					summary.LastError = dep.LastError
+				}
+			case fleethealth.HealthAmber:
+				summary.AmberCount++
+				if summary.Status != "red" {
+					summary.Status = "amber"
+				}
+				if dep.LastError != "" && summary.LastError == "" {
+					summary.LastError = dep.LastError
+				}
+			case fleethealth.HealthGreen:
+				summary.GreenCount++
+			default:
+				summary.AmberCount++
+				if summary.Status != "red" {
+					summary.Status = "amber"
+				}
+			}
+		}
+	}
+	out := make([]FleetDependencySummary, 0, len(byName))
+	for _, dep := range byName {
+		out = append(out, *dep)
+	}
+	sort.Slice(out, func(i, j int) bool {
+		if out[i].Status == out[j].Status {
+			return out[i].Name < out[j].Name
+		}
+		return statusRank(out[i].Status) > statusRank(out[j].Status)
+	})
+	return out
+}
+
+func summarizeFleet(snapshot FleetSnapshot) FleetSummary {
+	summary := FleetSummary{Status: "green", Message: "fleet checks are green"}
+	var redIssues []string
+	var amberIssues []string
+	if len(snapshot.Processes) == 0 {
+		amberIssues = append(amberIssues, "no process-health snapshots found")
+	}
+	for _, process := range snapshot.Processes {
+		switch process.ProcessType {
+		case fleethealth.ProcessMonitor:
+			summary.MonitorProcesses++
+		case fleethealth.ProcessDeliverer:
+			summary.DelivererProcesses++
+		}
+		if process.Stale {
+			summary.StaleProcesses++
+			summary.RedProcesses++
+			redIssues = append(redIssues, fmt.Sprintf("%s heartbeat stale age=%ds", process.ProcessID, process.LastHeartbeatAgeSec))
+			continue
+		}
+		switch process.HealthStatus {
+		case fleethealth.HealthRed:
+			summary.RedProcesses++
+			redIssues = append(redIssues, fmt.Sprintf("%s health_status=red", process.ProcessID))
+		case fleethealth.HealthAmber:
+			summary.AmberProcesses++
+			amberIssues = append(amberIssues, fmt.Sprintf("%s health_status=amber", process.ProcessID))
+		case fleethealth.HealthGreen:
+			summary.GreenProcesses++
+		default:
+			summary.AmberProcesses++
+			amberIssues = append(amberIssues, fmt.Sprintf("%s health_status=%q", process.ProcessID, process.HealthStatus))
+		}
+	}
+	for _, dep := range snapshot.Dependencies {
+		if dep.Status == "red" {
+			summary.DependencyRedCount += dep.RedCount
+		}
+		if dep.Status == "amber" {
+			summary.DependencyAmberCount += dep.AmberCount
+		}
+	}
+	if summary.MonitorProcesses == 0 {
+		amberIssues = append(amberIssues, "no monitor process snapshots found")
+	}
+	appendStatusIssue := func(prefix, status, detail string) {
+		if status == "red" {
+			redIssues = append(redIssues, prefix+": "+detail)
+		}
+		if status == "amber" {
+			amberIssues = append(amberIssues, prefix+": "+detail)
+		}
+	}
+	if snapshot.BucketCoverage.Status != "green" {
+		appendStatusIssue("bucket coverage", snapshot.BucketCoverage.Status, firstNonEmpty(snapshot.BucketCoverage.Error, snapshot.BucketCoverage.Status))
+	}
+	if snapshot.ProjectionDrift.Status != "green" {
+		detail := snapshot.ProjectionDrift.Error
+		if detail == "" {
+			detail = fmt.Sprintf("legacy projection drift=%d", snapshot.ProjectionDrift.Count)
+		}
+		appendStatusIssue("projection drift", snapshot.ProjectionDrift.Status, detail)
+	}
+	if snapshot.Delivery.Status != "green" {
+		detail := snapshot.Delivery.Error
+		if detail == "" {
+			detail = fmt.Sprintf("pending=%d due=%d failed_since=%d abandoned_since=%d", snapshot.Delivery.Pending, snapshot.Delivery.DueNow, snapshot.Delivery.FailedSince, snapshot.Delivery.AbandonedSince)
+		}
+		appendStatusIssue("delivery queues", snapshot.Delivery.Status, detail)
+	}
+	if snapshot.Delivery.Posture.Status != "green" {
+		appendStatusIssue("delivery posture", snapshot.Delivery.Posture.Status, snapshot.Delivery.Posture.Message)
+	}
+	if snapshot.Verifliers.Status != "" && snapshot.Verifliers.Status != "green" {
+		appendStatusIssue("verifliers", snapshot.Verifliers.Status, firstNonEmpty(snapshot.Verifliers.Error, snapshot.Verifliers.Message, snapshot.Verifliers.Status))
+	}
+
+	summary.Issues = append(redIssues, amberIssues...)
+	switch {
+	case len(redIssues) > 0:
+		summary.Status = "red"
+		summary.Message = "fleet has rollout-blocking issues"
+	case len(amberIssues) > 0:
+		summary.Status = "amber"
+		summary.Message = "fleet needs operator attention"
+	default:
+		summary.Status = "green"
+		summary.Message = "fleet checks are green"
+	}
+	summary.SuggestedNextAction = suggestFleetNextAction(snapshot, summary)
+	return summary
+}
+
+func suggestFleetNextAction(snapshot FleetSnapshot, summary FleetSummary) string {
+	switch {
+	case summary.StaleProcesses > 0:
+		return "Investigate stale process heartbeats before advancing rollout or relying on fleet status."
+	case snapshot.BucketCoverage.Status == "red":
+		return "Fix jetmon_hosts bucket coverage before relying on dynamic ownership."
+	case snapshot.ProjectionDrift.Status == "red":
+		return "Run rollout projection-drift --limit=100 and fix legacy projection drift before continuing."
+	case snapshot.Delivery.Status == "red":
+		return "Investigate failed or abandoned delivery rows before moving delivery ownership."
+	case snapshot.Verifliers.Status == "red":
+		return "Fix Veriflier discovery registry or agent telemetry before enabling active discovery."
+	case summary.RedProcesses > 0:
+		return "Open the affected host dashboard and resolve red process health before rollout."
+	case snapshot.Delivery.Status == "amber":
+		return "Watch delivery-check and confirm due deliveries drain before moving delivery ownership."
+	case snapshot.Delivery.Posture.Status == "amber":
+		return "Confirm DELIVERY_OWNER_HOST posture before enabling or moving delivery workers."
+	case snapshot.Verifliers.Status == "amber":
+		return "Resolve Veriflier registry or telemetry warnings before moving discovery from shadow to active."
+	case snapshot.BucketCoverage.Status == "amber":
+		return "Confirm whether the fleet is still in pinned rollout before expecting dynamic bucket coverage."
+	case summary.MonitorProcesses == 0:
+		return "Confirm monitor processes are publishing jetmon_process_health snapshots."
+	case summary.AmberProcesses > 0:
+		return "Open amber host dashboards and clear dependency warnings before the next rollout step."
+	default:
+		return "Fleet checks look healthy; continue normal monitoring and rollout validation."
+	}
+}
+
+func cloneFleetSnapshot(in FleetSnapshot) FleetSnapshot {
+	out := in
+	out.Summary.Issues = append([]string(nil), in.Summary.Issues...)
+	out.Processes = append([]FleetProcess(nil), in.Processes...)
+	for i := range out.Processes {
+		out.Processes[i].DependencyHealth = append([]fleethealth.DependencyHealth(nil), in.Processes[i].DependencyHealth...)
+	}
+	out.ProcessCounts = make(map[string]int, len(in.ProcessCounts))
+	for key, value := range in.ProcessCounts {
+		out.ProcessCounts[key] = value
+	}
+	out.BucketCoverage.Hosts = append([]FleetBucketHost(nil), in.BucketCoverage.Hosts...)
+	out.Delivery.Tables = append([]FleetDeliveryTable(nil), in.Delivery.Tables...)
+	out.Delivery.Posture.EnabledHosts = append([]string(nil), in.Delivery.Posture.EnabledHosts...)
+	out.Delivery.Posture.OwnerHosts = append([]string(nil), in.Delivery.Posture.OwnerHosts...)
+	out.Dependencies = append([]FleetDependencySummary(nil), in.Dependencies...)
+	out.Verifliers.DiscoveryModes = append([]FleetVeriflierDiscoveryMode(nil), in.Verifliers.DiscoveryModes...)
+	for i := range out.Verifliers.DiscoveryModes {
+		out.Verifliers.DiscoveryModes[i].Hosts = append([]string(nil), in.Verifliers.DiscoveryModes[i].Hosts...)
+	}
+	out.Verifliers.Vantages = append([]FleetVeriflierVantageSummary(nil), in.Verifliers.Vantages...)
+	out.Verifliers.Agents = append([]FleetVeriflierAgentSummary(nil), in.Verifliers.Agents...)
+	for i := range out.Verifliers.Agents {
+		out.Verifliers.Agents[i].Protocols = append([]string(nil), in.Verifliers.Agents[i].Protocols...)
+	}
+	return out
+}
+
+func countFleetProcesses(processes []FleetProcess) map[string]int {
+	out := map[string]int{}
+	for _, process := range processes {
+		out[process.ProcessType]++
+	}
+	return out
+}
+
+func sortedStringKeys(values map[string]struct{}) []string {
+	out := make([]string, 0, len(values))
+	for value := range values {
+		out = append(out, value)
+	}
+	sort.Strings(out)
+	return out
+}
+
+func maxInt64(a, b int64) int64 {
+	if a > b {
+		return a
+	}
+	return b
+}
+
+func statusRank(status string) int {
+	switch status {
+	case "red":
+		return 3
+	case "amber":
+		return 2
+	case "green":
+		return 1
+	default:
+		return 0
+	}
+}
+
+func firstNonEmpty(values ...string) string {
+	for _, value := range values {
+		if strings.TrimSpace(value) != "" {
+			return value
+		}
+	}
+	return ""
+}
+
+func (s *Server) SetFleetSource(source FleetSource) {
+	s.mu.Lock()
+	s.fleetSource = source
+	s.mu.Unlock()
+}
+
+func (s *Server) handleFleet(w http.ResponseWriter, r *http.Request) {
+	if rejectNonGet(w, r) {
+		return
+	}
+	setDashboardNoStoreHeaders(w)
+	s.mu.RLock()
+	source := s.fleetSource
+	s.mu.RUnlock()
+	if source == nil {
+		http.Error(w, "fleet dashboard source is not configured", http.StatusServiceUnavailable)
+		return
+	}
+	ctx, cancel := context.WithTimeout(r.Context(), defaultFleetRequestTimeout)
+	defer cancel()
+	snapshot, err := source.Snapshot(ctx)
+	if err != nil {
+		log.Printf("fleet dashboard: %v", err)
+		http.Error(w, "fleet dashboard query failed", http.StatusInternalServerError)
+		return
+	}
+	setDashboardJSONHeaders(w)
+	_ = json.NewEncoder(w).Encode(snapshot)
+}
+
+func (s *Server) handleFleetIndex(w http.ResponseWriter, r *http.Request) {
+	if rejectNonGet(w, r) {
+		return
+	}
+	setDashboardHTMLHeaders(w)
+	fmt.Fprint(w, fleetDashboardHTML)
+}
diff --git a/internal/dashboard/fleet_html.go b/internal/dashboard/fleet_html.go
new file mode 100644
index 00000000..04b0f60a
--- /dev/null
+++ b/internal/dashboard/fleet_html.go
@@ -0,0 +1,392 @@
+package dashboard
+
+const fleetDashboardHTML = `<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<title>Jetmon 2 - Fleet Dashboard</title>
+<style>
+  :root {
+    color-scheme: dark;
+    --bg: #101316;
+    --panel: #1b2024;
+    --panel-strong: #252b30;
+    --line: #353d43;
+    --text: #eef2f5;
+    --muted: #9aa7b0;
+    --green: #58c783;
+    --green-bg: #14301f;
+    --amber: #f0b85a;
+    --amber-bg: #342814;
+    --red: #f06b64;
+    --red-bg: #3b1d1b;
+    --accent: #77b7d9;
+  }
+  * { box-sizing: border-box; }
+  body {
+    margin: 0;
+    padding: 24px;
+    background: var(--bg);
+    color: var(--text);
+    font-family: ui-monospace, SFMono-Regular, Menlo, Consolas, monospace;
+  }
+  main { max-width: 1500px; margin: 0 auto; }
+  h1 { margin: 0; font-size: 1.65rem; }
+  h2 { margin: 28px 0 12px; font-size: 0.85rem; color: var(--muted); letter-spacing: 0; text-transform: uppercase; }
+  a { color: var(--accent); }
+  .topline { display: flex; align-items: baseline; justify-content: space-between; gap: 16px; margin-bottom: 16px; }
+  .subtle { color: var(--muted); font-size: 0.85rem; }
+  .summary {
+    display: grid;
+    grid-template-columns: minmax(0, 1fr) auto;
+    gap: 16px;
+    align-items: center;
+    padding: 18px;
+    border: 1px solid var(--line);
+    border-left-width: 6px;
+    border-radius: 6px;
+    background: var(--panel);
+  }
+  .summary.green { border-left-color: var(--green); }
+  .summary.amber { border-left-color: var(--amber); }
+  .summary.red { border-left-color: var(--red); }
+  .summary-title { font-size: 1.25rem; margin-bottom: 6px; }
+  .summary-detail { color: var(--muted); font-size: 0.9rem; line-height: 1.45; }
+  .summary-issues { margin: 10px 0 0; padding-left: 18px; font-size: 0.86rem; line-height: 1.45; }
+  .summary-issues:empty { display: none; }
+  .summary-meta { display: grid; gap: 6px; justify-items: end; color: var(--muted); font-size: 0.8rem; }
+  .status-pill {
+    display: inline-flex;
+    align-items: center;
+    justify-content: center;
+    min-width: 72px;
+    padding: 5px 9px;
+    border-radius: 999px;
+    font-size: 0.78rem;
+    text-transform: uppercase;
+  }
+  .status-pill.green { background: var(--green-bg); color: var(--green); }
+  .status-pill.amber { background: var(--amber-bg); color: var(--amber); }
+  .status-pill.red { background: var(--red-bg); color: var(--red); }
+  .grid { display: grid; grid-template-columns: repeat(auto-fill, minmax(210px, 1fr)); gap: 12px; }
+  .card { background: var(--panel); padding: 14px; border: 1px solid var(--line); border-radius: 6px; min-height: 78px; }
+  .card .label { font-size: 0.72rem; color: var(--muted); text-transform: uppercase; }
+  .card .value { font-size: 1.35rem; color: var(--accent); margin-top: 8px; overflow-wrap: anywhere; }
+  .card .detail { color: var(--muted); font-size: 0.78rem; margin-top: 6px; overflow-wrap: anywhere; }
+  table { width: 100%; border-collapse: collapse; background: var(--panel); border: 1px solid var(--line); border-radius: 6px; overflow: hidden; }
+  th, td { padding: 9px 10px; border-bottom: 1px solid var(--line); text-align: left; font-size: 0.82rem; vertical-align: top; }
+  th { color: var(--muted); text-transform: uppercase; font-size: 0.72rem; background: var(--panel-strong); }
+  tr:last-child td { border-bottom: 0; }
+  td { overflow-wrap: anywhere; }
+  .empty { color: var(--muted); padding: 14px; border: 1px solid var(--line); border-radius: 6px; background: var(--panel); }
+  @media (max-width: 760px) {
+    body { padding: 14px; }
+    .topline, .summary { grid-template-columns: 1fr; }
+    .summary-meta { justify-items: start; }
+    table { display: block; overflow-x: auto; }
+  }
+</style>
+</head>
+<body>
+<main>
+  <div class="topline">
+    <div>
+      <h1>Jetmon 2</h1>
+      <div class="subtle">Fleet dashboard · <a href="/">host dashboard</a></div>
+    </div>
+    <span class="status-pill amber" id="summary-pill">loading</span>
+  </div>
+
+  <section class="summary amber" id="summary">
+    <div>
+      <div class="summary-title" id="summary-title">Loading fleet state</div>
+      <div class="summary-detail" id="summary-detail">Waiting for process health, bucket ownership, and delivery queue summaries.</div>
+      <ul class="summary-issues" id="summary-issues"></ul>
+    </div>
+    <div class="summary-meta">
+      <span id="updated">updated: never</span>
+      <span id="counts">processes: -</span>
+    </div>
+  </section>
+
+  <h2>Fleet Rollup</h2>
+  <div class="grid">
+    <div class="card"><div class="label">Monitors</div><div class="value" id="monitors">-</div></div>
+    <div class="card"><div class="label">Deliverers</div><div class="value" id="deliverers">-</div></div>
+    <div class="card"><div class="label">Stale Processes</div><div class="value" id="stale">-</div></div>
+    <div class="card"><div class="label">Bucket Coverage</div><div class="value" id="bucket-status">-</div><div class="detail" id="bucket-detail"></div></div>
+    <div class="card"><div class="label">Delivery Due</div><div class="value" id="delivery-due">-</div><div class="detail" id="delivery-detail"></div></div>
+    <div class="card"><div class="label">Projection Drift</div><div class="value" id="drift">-</div></div>
+  </div>
+
+  <h2>Delivery Ownership</h2>
+  <div class="grid">
+    <div class="card"><div class="label">Posture</div><div class="value" id="delivery-posture">-</div><div class="detail" id="delivery-posture-detail"></div></div>
+    <div class="card"><div class="label">Enabled Hosts</div><div class="value" id="delivery-enabled">-</div></div>
+    <div class="card"><div class="label">Owner Hosts</div><div class="value" id="delivery-owners">-</div></div>
+  </div>
+
+  <h2>Delivery Queues</h2>
+  <table>
+    <thead>
+      <tr><th>Kind</th><th>Pending</th><th>Due</th><th>Future Retry</th><th>Delivered</th><th>Failed</th><th>Abandoned</th><th>Oldest Due</th></tr>
+    </thead>
+    <tbody id="delivery-tables"></tbody>
+  </table>
+
+  <h2>Veriflier Fleet</h2>
+  <div class="grid">
+    <div class="card"><div class="label">Status</div><div class="value" id="veriflier-status">-</div><div class="detail" id="veriflier-detail"></div></div>
+    <div class="card"><div class="label">Vantages</div><div class="value" id="veriflier-vantages">-</div><div class="detail" id="veriflier-vantage-detail"></div></div>
+    <div class="card"><div class="label">Agents</div><div class="value" id="veriflier-agents">-</div><div class="detail" id="veriflier-agent-detail"></div></div>
+    <div class="card"><div class="label">Capacity</div><div class="value" id="veriflier-capacity">-</div><div class="detail" id="veriflier-capacity-detail"></div></div>
+    <div class="card"><div class="label">Discovery Modes</div><div class="value" id="veriflier-modes">-</div></div>
+  </div>
+
+  <h2>Veriflier Vantages</h2>
+  <table>
+    <thead>
+      <tr><th>Vantage</th><th>Status</th><th>Endpoint</th><th>Region</th><th>Agents</th><th>Last Seen</th></tr>
+    </thead>
+    <tbody id="veriflier-vantage-rows"></tbody>
+  </table>
+
+  <h2>Veriflier Agents</h2>
+  <table>
+    <thead>
+      <tr><th>Agent</th><th>Vantage</th><th>Status</th><th>Endpoint</th><th>Capacity</th><th>Last Seen</th></tr>
+    </thead>
+    <tbody id="veriflier-agent-rows"></tbody>
+  </table>
+
+  <h2>Bucket Owners</h2>
+  <table>
+    <thead>
+      <tr><th>Host</th><th>Range</th><th>Status</th><th>Heartbeat</th></tr>
+    </thead>
+    <tbody id="bucket-hosts"></tbody>
+  </table>
+
+  <h2>Processes</h2>
+  <table>
+    <thead>
+      <tr><th>Process</th><th>Health</th><th>State</th><th>Heartbeat</th><th>Buckets</th><th>Queues</th><th>Memory</th></tr>
+    </thead>
+    <tbody id="processes"></tbody>
+  </table>
+
+  <h2>Dependencies</h2>
+  <table>
+    <thead>
+      <tr><th>Name</th><th>Status</th><th>Green</th><th>Amber</th><th>Red</th><th>Stale</th><th>Last Error</th></tr>
+    </thead>
+    <tbody id="dependencies"></tbody>
+  </table>
+</main>
+
+<script>
+function setText(id, value) {
+  document.getElementById(id).textContent = value === undefined || value === null || value === '' ? '-' : value;
+}
+
+function statusClass(status) {
+  if (status === 'red' || status === 'amber' || status === 'green') return status;
+  return 'amber';
+}
+
+function renderSummary(summary, generatedAt, processTotal) {
+  const status = statusClass(summary.status);
+  const box = document.getElementById('summary');
+  box.className = 'summary ' + status;
+  const pill = document.getElementById('summary-pill');
+  pill.className = 'status-pill ' + status;
+  pill.textContent = status;
+  setText('summary-title', summary.message || 'fleet status unavailable');
+  let detail = 'green=' + (summary.green_processes || 0) + ' amber=' + (summary.amber_processes || 0) + ' red=' + (summary.red_processes || 0);
+  if (summary.suggested_next_action) detail += ' · next: ' + summary.suggested_next_action;
+  setText('summary-detail', detail);
+  setText('updated', 'updated: ' + (generatedAt ? new Date(generatedAt).toLocaleTimeString() : 'never'));
+  setText('counts', 'processes: ' + processTotal);
+  const issues = document.getElementById('summary-issues');
+  issues.textContent = '';
+  (summary.issues || []).slice(0, 8).forEach(function(issue) {
+    const item = document.createElement('li');
+    item.textContent = issue;
+    issues.appendChild(item);
+  });
+  if ((summary.issues || []).length > 8) {
+    const item = document.createElement('li');
+    item.textContent = '+' + ((summary.issues || []).length - 8) + ' more issues in /api/fleet';
+    issues.appendChild(item);
+  }
+}
+
+function row(cells) {
+  const tr = document.createElement('tr');
+  cells.forEach(function(value) {
+    const td = document.createElement('td');
+    td.textContent = value === undefined || value === null || value === '' ? '-' : value;
+    tr.appendChild(td);
+  });
+  return tr;
+}
+
+function rangeLabel(item) {
+  if (item.bucket_min === undefined || item.bucket_max === undefined || item.bucket_min === null || item.bucket_max === null) return item.bucket_ownership || '-';
+  return item.bucket_min + '-' + item.bucket_max;
+}
+
+function ageLabel(seconds) {
+  return (seconds || 0) + 's ago';
+}
+
+function joinParts(parts) {
+  return parts.filter(function(part) {
+    return part !== undefined && part !== null && part !== '';
+  }).join(' · ');
+}
+
+function render(snapshot) {
+  const summary = snapshot.summary || {};
+  const processes = snapshot.processes || [];
+  renderSummary(summary, snapshot.generated_at, processes.length);
+  setText('monitors', summary.monitor_processes || 0);
+  setText('deliverers', summary.deliverer_processes || 0);
+  setText('stale', summary.stale_processes || 0);
+  setText('bucket-status', (snapshot.bucket_coverage || {}).status || '-');
+  setText('bucket-detail', joinParts([((snapshot.bucket_coverage || {}).mode || '-'), ((snapshot.bucket_coverage || {}).host_count || 0) + ' hosts', (snapshot.bucket_coverage || {}).error]));
+  setText('delivery-due', (snapshot.delivery || {}).due_now || 0);
+  setText('delivery-detail', 'pending=' + ((snapshot.delivery || {}).pending || 0) + ' failed=' + ((snapshot.delivery || {}).failed_since || 0) + ' abandoned=' + ((snapshot.delivery || {}).abandoned_since || 0));
+  setText('drift', (snapshot.projection_drift || {}).count || 0);
+  const posture = (snapshot.delivery || {}).posture || {};
+  setText('delivery-posture', posture.status || '-');
+  setText('delivery-posture-detail', posture.message || '');
+  setText('delivery-enabled', (posture.enabled_hosts || []).join(', '));
+  setText('delivery-owners', (posture.owner_hosts || []).join(', '));
+
+  const deliveryBody = document.getElementById('delivery-tables');
+  deliveryBody.textContent = '';
+  ((snapshot.delivery || {}).tables || []).forEach(function(table) {
+    deliveryBody.appendChild(row([
+      table.kind,
+      table.pending || 0,
+      table.due_now || 0,
+      table.future_retry || 0,
+      table.delivered_since || 0,
+      table.failed_since || 0,
+      table.abandoned_since || 0,
+      table.due_now > 0 ? ageLabel(table.oldest_due_age_sec) : '-'
+    ]));
+  });
+  if (((snapshot.delivery || {}).tables || []).length === 0) {
+    deliveryBody.appendChild(row(['No delivery queue summaries found', '', '', '', '', '', '', '']));
+  }
+
+  const verifliers = snapshot.verifliers || {};
+  setText('veriflier-status', verifliers.status || '-');
+  setText('veriflier-detail', joinParts([verifliers.message, verifliers.error]));
+  setText('veriflier-vantages', (verifliers.usable_vantages || 0) + '/' + (verifliers.enabled_vantages || 0));
+  setText('veriflier-vantage-detail', 'total=' + (verifliers.total_vantages || 0) + ' disabled=' + (verifliers.disabled_vantages || 0) + ' incomplete=' + (verifliers.incomplete_vantages || 0));
+  setText('veriflier-agents', (verifliers.fresh_agents || 0) + '/' + (verifliers.total_agents || 0));
+  setText('veriflier-agent-detail', 'active=' + (verifliers.active_agents || 0) + ' stale=' + (verifliers.stale_agents || 0) + ' duplicate_endpoints=' + (verifliers.duplicate_endpoints || 0));
+  setText('veriflier-capacity', (verifliers.max_concurrency || 0));
+  setText('veriflier-capacity-detail', 'queue=' + (verifliers.queue_depth || 0) + '/' + (verifliers.queue_capacity || 0) + ' active=' + (verifliers.active_checks || 0) + ' in_flight=' + (verifliers.in_flight || 0));
+  setText('veriflier-modes', (verifliers.discovery_modes || []).map(function(mode) {
+    return mode.mode + '(' + mode.process_count + ')';
+  }).join(', ') || 'static');
+
+  const vantageBody = document.getElementById('veriflier-vantage-rows');
+  vantageBody.textContent = '';
+  (verifliers.vantages || []).forEach(function(vantage) {
+    const status = vantage.enabled ? (vantage.usable ? 'enabled usable' : 'enabled incomplete') : 'disabled';
+    const endpoint = joinParts([vantage.endpoint_host, vantage.endpoint_port]);
+    const region = joinParts([vantage.region, vantage.provider]);
+    vantageBody.appendChild(row([
+      vantage.vantage_id,
+      status + (vantage.auth_token_present ? '' : ' no-token'),
+      endpoint,
+      region,
+      'fresh=' + (vantage.fresh_agents || 0) + ' stale=' + (vantage.stale_agents || 0) + ' active=' + (vantage.active_agents || 0),
+      vantage.last_seen_age_sec ? ageLabel(vantage.last_seen_age_sec) : '-'
+    ]));
+  });
+  if ((verifliers.vantages || []).length === 0) {
+    vantageBody.appendChild(row(['No trusted Veriflier vantages found', '', '', '', '', '']));
+  }
+
+  const agentBody = document.getElementById('veriflier-agent-rows');
+  agentBody.textContent = '';
+  (verifliers.agents || []).forEach(function(agent) {
+    agentBody.appendChild(row([
+      joinParts([agent.agent_id, agent.hostname]),
+      agent.vantage_id + (agent.vantage_preapproved ? '' : ' unapproved'),
+      agent.status + (agent.stale ? ' stale' : ''),
+      joinParts([agent.endpoint_host, agent.endpoint_port]),
+      'max=' + (agent.max_concurrency || 0) + ' queue=' + (agent.queue_depth || 0) + '/' + (agent.queue_capacity || 0) + ' active=' + (agent.active || 0) + ' in_flight=' + (agent.in_flight || 0),
+      ageLabel(agent.last_seen_age_sec)
+    ]));
+  });
+  if ((verifliers.agents || []).length === 0) {
+    agentBody.appendChild(row(['No Veriflier agent telemetry found', '', '', '', '', '']));
+  }
+
+  const bucketBody = document.getElementById('bucket-hosts');
+  bucketBody.textContent = '';
+  ((snapshot.bucket_coverage || {}).hosts || []).forEach(function(host) {
+    bucketBody.appendChild(row([
+      host.host_id,
+      rangeLabel(host),
+      host.status + (host.stale ? ' stale' : ''),
+      ageLabel(host.last_heartbeat_age_sec)
+    ]));
+  });
+  if (((snapshot.bucket_coverage || {}).hosts || []).length === 0) {
+    bucketBody.appendChild(row(['No dynamic bucket-owner rows found', '', '', '']));
+  }
+
+  const processBody = document.getElementById('processes');
+  processBody.textContent = '';
+  processes.forEach(function(process) {
+    processBody.appendChild(row([
+      process.process_id,
+      process.health_status + (process.stale ? ' stale' : ''),
+      process.state,
+      ageLabel(process.last_heartbeat_age_sec),
+      rangeLabel(process),
+      'active=' + (process.active_checks || 0) + ' queue=' + (process.queue_depth || 0) + ' retry=' + (process.retry_queue_size || 0),
+      'rss=' + formatMem(process.rss_mem_mb) + ' go=' + formatMem(process.go_sys_mem_mb)
+    ]));
+  });
+  if (processes.length === 0) {
+    processBody.appendChild(row(['No process-health snapshots found', '', '', '', '', '', '']));
+  }
+
+  const depBody = document.getElementById('dependencies');
+  depBody.textContent = '';
+  (snapshot.dependencies || []).forEach(function(dep) {
+    depBody.appendChild(row([dep.name, dep.status, dep.green_count, dep.amber_count, dep.red_count, dep.stale_count, dep.last_error]));
+  });
+  if ((snapshot.dependencies || []).length === 0) {
+    depBody.appendChild(row(['No dependency snapshots found', '', '', '', '', '', '']));
+  }
+}
+
+function formatMem(value) {
+  return value > 0 ? value + 'MB' : 'n/a';
+}
+
+async function refresh() {
+  try {
+    const res = await fetch('/api/fleet', { cache: 'no-store' });
+    if (!res.ok) throw new Error(await res.text());
+    render(await res.json());
+  } catch (err) {
+    renderSummary({ status: 'red', message: 'fleet dashboard unavailable', issues: [String(err)] }, null, 0);
+  }
+}
+
+refresh();
+setInterval(refresh, 10000);
+</script>
+</body>
+</html>`
diff --git a/internal/dashboard/fleet_test.go b/internal/dashboard/fleet_test.go
new file mode 100644
index 00000000..e836633f
--- /dev/null
+++ b/internal/dashboard/fleet_test.go
@@ -0,0 +1,467 @@
+package dashboard
+
+import (
+	"context"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/fleethealth"
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestSummarizeFleetFlagsStaleAndDrift(t *testing.T) {
+	now := time.Date(2026, 4, 30, 12, 0, 0, 0, time.UTC)
+	processes := summarizeFleetProcesses([]fleethealth.Snapshot{
+		{
+			ProcessID:    "host-a:monitor",
+			HostID:       "host-a",
+			ProcessType:  fleethealth.ProcessMonitor,
+			State:        fleethealth.StateRunning,
+			HealthStatus: fleethealth.HealthGreen,
+			UpdatedAt:    now.Add(-20 * time.Minute),
+			DependencyHealth: []fleethealth.DependencyHealth{
+				{Name: "mysql", Status: fleethealth.HealthGreen},
+			},
+		},
+		{
+			ProcessID:    "host-b:deliverer",
+			HostID:       "host-b",
+			ProcessType:  fleethealth.ProcessDeliverer,
+			State:        fleethealth.StateRunning,
+			HealthStatus: fleethealth.HealthAmber,
+			UpdatedAt:    now.Add(-time.Minute),
+			DependencyHealth: []fleethealth.DependencyHealth{
+				{Name: "statsd", Status: fleethealth.HealthAmber, LastError: "not initialized"},
+			},
+		},
+	}, now, 10*time.Minute)
+
+	snapshot := FleetSnapshot{
+		Processes:       processes,
+		ProcessCounts:   countFleetProcesses(processes),
+		BucketCoverage:  FleetBucketCoverage{Status: "green"},
+		Delivery:        FleetDeliverySummary{Status: "green", Posture: FleetDeliveryPosture{Status: "green"}},
+		ProjectionDrift: FleetProjectionDrift{Status: "red", Count: 2},
+		Dependencies:    summarizeFleetDependencies(processes),
+	}
+	summary := summarizeFleet(snapshot)
+	if summary.Status != "red" {
+		t.Fatalf("summary status = %q, want red", summary.Status)
+	}
+	if summary.StaleProcesses != 1 {
+		t.Fatalf("StaleProcesses = %d, want 1", summary.StaleProcesses)
+	}
+	if summary.RedProcesses != 1 {
+		t.Fatalf("RedProcesses = %d, want stale process counted as red", summary.RedProcesses)
+	}
+	if summary.MonitorProcesses != 1 || summary.DelivererProcesses != 1 {
+		t.Fatalf("process counts = monitor %d deliverer %d, want 1/1", summary.MonitorProcesses, summary.DelivererProcesses)
+	}
+	if !containsIssue(summary.Issues, "heartbeat stale") {
+		t.Fatalf("issues = %#v, want stale heartbeat issue", summary.Issues)
+	}
+	if !containsIssue(summary.Issues, "legacy projection drift=2") {
+		t.Fatalf("issues = %#v, want projection drift issue", summary.Issues)
+	}
+	if !strings.Contains(summary.SuggestedNextAction, "stale process") {
+		t.Fatalf("SuggestedNextAction = %q, want stale process guidance", summary.SuggestedNextAction)
+	}
+}
+
+func TestSummarizeFleetBucketCoverage(t *testing.T) {
+	now := time.Date(2026, 4, 30, 12, 0, 0, 0, time.UTC)
+	coverage := summarizeFleetBucketCoverage([]FleetBucketHost{
+		{HostID: "host-a", BucketMin: 0, BucketMax: 4, LastHeartbeat: now.Add(-time.Second), Status: "active"},
+		{HostID: "host-b", BucketMin: 5, BucketMax: 9, LastHeartbeat: now.Add(-2 * time.Second), Status: "active"},
+	}, 10, 30*time.Second, now, nil, nil)
+	if coverage.Status != "green" {
+		t.Fatalf("coverage status = %q, want green (%s)", coverage.Status, coverage.Error)
+	}
+	if coverage.Mode != "dynamic" {
+		t.Fatalf("coverage mode = %q, want dynamic", coverage.Mode)
+	}
+
+	coverage = summarizeFleetBucketCoverage([]FleetBucketHost{
+		{HostID: "host-a", BucketMin: 0, BucketMax: 3, LastHeartbeat: now, Status: "active"},
+		{HostID: "host-b", BucketMin: 5, BucketMax: 9, LastHeartbeat: now, Status: "active"},
+	}, 10, 30*time.Second, now, nil, nil)
+	if coverage.Status != "red" || !strings.Contains(coverage.Error, "gap") {
+		t.Fatalf("coverage = %+v, want gap error", coverage)
+	}
+
+	coverage = summarizeFleetBucketCoverage(nil, 10, 30*time.Second, now, nil, []FleetProcess{{
+		ProcessType:     fleethealth.ProcessMonitor,
+		BucketOwnership: "pinned range=0-4",
+	}})
+	if coverage.Status != "amber" || coverage.Mode != "pinned" {
+		t.Fatalf("coverage = %+v, want pinned amber", coverage)
+	}
+
+	coverage = summarizeFleetBucketCoverage([]FleetBucketHost{
+		{HostID: "host-a", BucketMin: 0, BucketMax: 9, LastHeartbeat: now.Add(-time.Hour), Status: "active"},
+	}, 10, 30*time.Second, now, nil, []FleetProcess{{
+		ProcessType:     fleethealth.ProcessMonitor,
+		BucketOwnership: "pinned range=0-9",
+	}})
+	if coverage.Status != "amber" || coverage.Mode != "pinned" || strings.Contains(coverage.Error, "stale") {
+		t.Fatalf("coverage = %+v, want pinned mode to ignore stale dynamic rows", coverage)
+	}
+
+	coverage = summarizeFleetBucketCoverage(nil, 10, 30*time.Second, now, nil, []FleetProcess{
+		{ProcessType: fleethealth.ProcessMonitor, BucketOwnership: "pinned range=0-4"},
+		{ProcessType: fleethealth.ProcessMonitor, BucketOwnership: "dynamic jetmon_hosts"},
+	})
+	if coverage.Status != "amber" || coverage.Mode != "mixed" {
+		t.Fatalf("coverage = %+v, want mixed amber", coverage)
+	}
+
+	coverage = summarizeFleetBucketCoverage([]FleetBucketHost{
+		{HostID: "host-a", BucketMin: 0, BucketMax: 9, LastHeartbeat: now, Status: "active"},
+	}, 10, 30*time.Second, now, nil, []FleetProcess{
+		{ProcessType: fleethealth.ProcessMonitor, BucketOwnership: "pinned range=0-4", Stale: true},
+		{ProcessType: fleethealth.ProcessMonitor, BucketOwnership: "dynamic jetmon_hosts"},
+	})
+	if coverage.Status != "green" || coverage.Mode != "dynamic" {
+		t.Fatalf("coverage = %+v, want stale pinned snapshots ignored for ownership mode", coverage)
+	}
+}
+
+func TestSummarizeFleetDeliveryPosture(t *testing.T) {
+	posture := summarizeFleetDeliveryPosture([]FleetProcess{
+		{HostID: "host-a", DeliveryWorkersEnabled: true},
+		{HostID: "host-b", DeliveryWorkersEnabled: true},
+	}, 0)
+	if posture.Status != "amber" {
+		t.Fatalf("posture status = %q, want amber", posture.Status)
+	}
+	if !strings.Contains(posture.Message, "without DELIVERY_OWNER_HOST") {
+		t.Fatalf("posture message = %q, want owner warning", posture.Message)
+	}
+
+	posture = summarizeFleetDeliveryPosture([]FleetProcess{
+		{HostID: "host-a", DeliveryWorkersEnabled: true, DeliveryOwnerHost: "host-a"},
+		{HostID: "host-b", DeliveryOwnerHost: "host-a"},
+	}, 0)
+	if posture.Status != "green" {
+		t.Fatalf("posture status = %q, want green", posture.Status)
+	}
+	if len(posture.OwnerHosts) != 1 || posture.OwnerHosts[0] != "host-a" {
+		t.Fatalf("OwnerHosts = %#v, want host-a", posture.OwnerHosts)
+	}
+
+	posture = summarizeFleetDeliveryPosture([]FleetProcess{
+		{HostID: "host-a", DeliveryWorkersEnabled: true},
+	}, 0)
+	if posture.Status != "amber" || !strings.Contains(posture.Message, "without DELIVERY_OWNER_HOST") {
+		t.Fatalf("posture = %+v, want unset owner warning", posture)
+	}
+
+	posture = summarizeFleetDeliveryPosture([]FleetProcess{
+		{HostID: "host-a", DeliveryWorkersEnabled: true, DeliveryOwnerHost: "host-a"},
+		{HostID: "host-b", DeliveryWorkersEnabled: true},
+	}, 0)
+	if posture.Status != "amber" || !strings.Contains(posture.Message, "mix") {
+		t.Fatalf("posture = %+v, want mixed owner warning", posture)
+	}
+
+	posture = summarizeFleetDeliveryPosture([]FleetProcess{
+		{HostID: "host-a", State: fleethealth.StateStopped, DeliveryWorkersEnabled: true, DeliveryOwnerHost: "host-a"},
+		{HostID: "host-b", Stale: true, DeliveryWorkersEnabled: true, DeliveryOwnerHost: "host-b"},
+	}, 0)
+	if posture.Status != "green" || posture.EnabledProcessCount != 0 {
+		t.Fatalf("posture = %+v, want inactive processes ignored", posture)
+	}
+
+	posture = summarizeFleetDeliveryPosture(nil, 3)
+	if posture.Status != "amber" || !strings.Contains(posture.Message, "queued") {
+		t.Fatalf("posture = %+v, want queued delivery warning", posture)
+	}
+}
+
+func TestQueryFleetDeliveryTableAggregatesInSingleQuery(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	now := time.Date(2026, 4, 30, 12, 0, 0, 0, time.UTC)
+	cutoff := now.Add(-15 * time.Minute)
+	rows := sqlmock.NewRows([]string{"metric", "count", "age_sec"}).
+		AddRow("pending", int64(4), int64(900)).
+		AddRow("due", int64(2), int64(120)).
+		AddRow("future", int64(1), int64(0)).
+		AddRow("delivered", int64(7), int64(0)).
+		AddRow("abandoned", int64(1), int64(0)).
+		AddRow("failed", int64(3), int64(0))
+	mock.ExpectQuery(`(?s)SELECT 'pending'.*FROM jetmon_webhook_deliveries.*UNION ALL.*SELECT 'failed'.*FROM jetmon_webhook_deliveries`).
+		WithArgs(now, now, now, now, cutoff, cutoff, cutoff, cutoff, cutoff).
+		WillReturnRows(rows)
+
+	summary, err := queryFleetDeliveryTable(context.Background(), sqlDB, "webhook", "jetmon_webhook_deliveries", now, cutoff)
+	if err != nil {
+		t.Fatalf("queryFleetDeliveryTable: %v", err)
+	}
+	if summary.Pending != 4 || summary.OldestPendingAgeSec != 900 {
+		t.Fatalf("pending summary = %+v, want count 4 age 900", summary)
+	}
+	if summary.DueNow != 2 || summary.OldestDueAgeSec != 120 {
+		t.Fatalf("due summary = %+v, want count 2 age 120", summary)
+	}
+	if summary.FutureRetry != 1 || summary.DeliveredSince != 7 || summary.AbandonedSince != 1 || summary.FailedSince != 3 {
+		t.Fatalf("summary = %+v, want all aggregate counts populated", summary)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet SQL expectations: %v", err)
+	}
+}
+
+func TestSummarizeFleetDependencies(t *testing.T) {
+	processes := []FleetProcess{
+		{
+			ProcessID: "host-a:monitor",
+			DependencyHealth: []fleethealth.DependencyHealth{
+				{Name: "mysql", Status: fleethealth.HealthGreen},
+				{Name: "wpcom", Status: fleethealth.HealthRed, LastError: "500"},
+			},
+		},
+		{
+			ProcessID: "host-b:monitor",
+			Stale:     true,
+			DependencyHealth: []fleethealth.DependencyHealth{
+				{Name: "mysql", Status: fleethealth.HealthAmber, LastError: "slow"},
+			},
+		},
+	}
+	deps := summarizeFleetDependencies(processes)
+	if len(deps) != 2 {
+		t.Fatalf("deps len = %d, want 2", len(deps))
+	}
+	if deps[0].Name != "wpcom" || deps[0].Status != "red" {
+		t.Fatalf("deps[0] = %+v, want red wpcom first", deps[0])
+	}
+	if deps[1].Name != "mysql" || deps[1].Status != "amber" || deps[1].StaleCount != 1 {
+		t.Fatalf("deps[1] = %+v, want amber mysql with stale count", deps[1])
+	}
+}
+
+func TestQueryFleetVerifliersSummarizesRegistryAndAgents(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	now := time.Date(2026, 4, 30, 12, 0, 0, 0, time.UTC)
+	vantageRows := sqlmock.NewRows([]string{
+		"vantage_id", "region", "provider", "endpoint_host", "endpoint_port", "auth_token_present", "enabled", "updated_at",
+	}).
+		AddRow("us-east", "iad", "provider-a", "east.example", "7803", 1, 1, now).
+		AddRow("us-west", "sfo", "provider-b", "west.example", "", 1, 1, now).
+		AddRow("disabled", "dev", "provider-c", "disabled.example", "7803", 1, 0, now)
+	agentRows := sqlmock.NewRows([]string{
+		"agent_id", "vantage_id", "hostname", "endpoint_host", "endpoint_port",
+		"version", "protocols", "max_concurrency", "queue_capacity", "queue_depth",
+		"active", "in_flight", "status", "last_seen",
+	}).
+		AddRow("agent-east", "us-east", "host-east", "east.example", "7803",
+			"dev", `["v2-json-http"]`, 64, 256, 3, 2, 1, "active", now.Add(-10*time.Second)).
+		AddRow("agent-west", "us-west", "host-west", "west-alt.example", "7803",
+			"dev", `["v2-json-http"]`, 32, 128, 1, 1, 0, "active", now.Add(-time.Minute))
+
+	mock.ExpectQuery("SELECT vantage_id").
+		WillReturnRows(vantageRows)
+	mock.ExpectQuery("SELECT agent_id").
+		WillReturnRows(agentRows)
+
+	processes := []FleetProcess{{
+		ProcessType: fleethealth.ProcessMonitor,
+		HostID:      "monitor-a",
+		DependencyHealth: []fleethealth.DependencyHealth{{
+			Name:   "veriflier-discovery:active enabled=2 usable=1 agents=1",
+			Status: fleethealth.HealthGreen,
+		}},
+	}}
+	got := queryFleetVerifliers(context.Background(), sqlDB, now, 30*time.Second, processes)
+	if got.Status != "red" || !strings.Contains(got.Message, "incomplete") {
+		t.Fatalf("status=%q message=%q, want red incomplete", got.Status, got.Message)
+	}
+	if got.EnabledVantages != 2 || got.UsableVantages != 1 || got.IncompleteVantages != 1 || got.DisabledVantages != 1 {
+		t.Fatalf("vantage counts = %+v", got)
+	}
+	if got.TotalAgents != 2 || got.FreshAgents != 1 || got.StaleAgents != 1 || got.MaxConcurrency != 96 || got.QueueDepth != 4 {
+		t.Fatalf("agent/capacity counts = %+v", got)
+	}
+	if !got.Agents[0].VantagePreapproved || got.Agents[1].VantagePreapproved != true {
+		t.Fatalf("agent preapproval flags = %+v", got.Agents)
+	}
+	if len(got.DiscoveryModes) != 1 || got.DiscoveryModes[0].Mode != "active" || got.DiscoveryModes[0].Hosts[0] != "monitor-a" {
+		t.Fatalf("discovery modes = %+v", got.DiscoveryModes)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet SQL expectations: %v", err)
+	}
+}
+
+func TestQueryFleetVerifliersStaticEmptyRegistryIsGreen(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	mock.ExpectQuery("SELECT vantage_id").
+		WillReturnRows(sqlmock.NewRows([]string{
+			"vantage_id", "region", "provider", "endpoint_host", "endpoint_port", "auth_token_present", "enabled", "updated_at",
+		}))
+	mock.ExpectQuery("SELECT agent_id").
+		WillReturnRows(sqlmock.NewRows([]string{
+			"agent_id", "vantage_id", "hostname", "endpoint_host", "endpoint_port",
+			"version", "protocols", "max_concurrency", "queue_capacity", "queue_depth",
+			"active", "in_flight", "status", "last_seen",
+		}))
+
+	got := queryFleetVerifliers(context.Background(), sqlDB, time.Now().UTC(), 30*time.Second, nil)
+	if got.Status != "green" || !strings.Contains(got.Message, "static") {
+		t.Fatalf("summary = %+v, want static green", got)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet SQL expectations: %v", err)
+	}
+}
+
+func TestSummarizeFleetVerifliersFlagsDuplicateEndpoints(t *testing.T) {
+	now := time.Date(2026, 4, 30, 12, 0, 0, 0, time.UTC)
+	summary := FleetVeriflierSummary{
+		DiscoveryModes: []FleetVeriflierDiscoveryMode{{Mode: "shadow", ProcessCount: 1}},
+		Vantages: []FleetVeriflierVantageSummary{{
+			VantageID:        "us-east",
+			Enabled:          true,
+			Usable:           true,
+			AuthTokenPresent: true,
+			EndpointHost:     "east.example",
+			EndpointPort:     "7803",
+		}},
+		Agents: []FleetVeriflierAgentSummary{
+			{AgentID: "a", VantageID: "us-east", EndpointHost: "east-a.example", EndpointPort: "7803", Status: "active", LastSeen: now},
+			{AgentID: "b", VantageID: "us-east", EndpointHost: "east-b.example", EndpointPort: "7803", Status: "active", LastSeen: now},
+		},
+	}
+	summarizeFleetVeriflierRows(&summary, now)
+	if summary.Status != "amber" || summary.DuplicateEndpoints != 1 {
+		t.Fatalf("summary = %+v, want amber duplicate endpoint warning", summary)
+	}
+}
+
+func TestParseVeriflierDiscoveryMode(t *testing.T) {
+	mode, ok := parseVeriflierDiscoveryMode("veriflier-discovery:shadow enabled=2")
+	if !ok || mode != "shadow" {
+		t.Fatalf("parse mode = %q %v, want shadow true", mode, ok)
+	}
+	if _, ok := parseVeriflierDiscoveryMode("veriflier:us-east"); ok {
+		t.Fatal("parseVeriflierDiscoveryMode accepted non-discovery dependency")
+	}
+}
+
+func TestSummarizeFleetProcessesOrdersUnhealthyFirst(t *testing.T) {
+	now := time.Date(2026, 4, 30, 12, 0, 0, 0, time.UTC)
+	processes := summarizeFleetProcesses([]fleethealth.Snapshot{
+		{ProcessID: "host-c:monitor", HostID: "host-c", ProcessType: fleethealth.ProcessMonitor, HealthStatus: fleethealth.HealthGreen, UpdatedAt: now},
+		{ProcessID: "host-d:deliverer", HostID: "host-d", ProcessType: fleethealth.ProcessDeliverer, HealthStatus: fleethealth.HealthGreen, UpdatedAt: now},
+		{ProcessID: "host-b:monitor", HostID: "host-b", ProcessType: fleethealth.ProcessMonitor, HealthStatus: fleethealth.HealthAmber, UpdatedAt: now, GoSysMemMB: 88, RSSMemMB: 99},
+		{ProcessID: "host-a:monitor", HostID: "host-a", ProcessType: fleethealth.ProcessMonitor, HealthStatus: fleethealth.HealthGreen, UpdatedAt: now.Add(-time.Hour)},
+	}, now, 10*time.Minute)
+	if got := processes[0].ProcessID; got != "host-a:monitor" {
+		t.Fatalf("first process = %q, want stale host first", got)
+	}
+	if got := processes[1].ProcessID; got != "host-b:monitor" {
+		t.Fatalf("second process = %q, want amber host second", got)
+	}
+	if got := processes[2].ProcessID; got != "host-c:monitor" {
+		t.Fatalf("third process = %q, want healthy monitors before deliverers", got)
+	}
+	if processes[1].GoSysMemMB != 88 || processes[1].RSSMemMB != 99 {
+		t.Fatalf("memory fields = go=%d rss=%d, want go=88 rss=99", processes[1].GoSysMemMB, processes[1].RSSMemMB)
+	}
+}
+
+func TestFleetStoreCachedSnapshotIsCloned(t *testing.T) {
+	now := time.Date(2026, 4, 30, 12, 0, 0, 0, time.UTC)
+	store := NewFleetStore(nil, FleetStoreOptions{CacheTTL: time.Minute})
+	store.storeCachedSnapshot(FleetSnapshot{
+		GeneratedAt:   now,
+		Summary:       FleetSummary{Issues: []string{"first issue"}},
+		ProcessCounts: map[string]int{fleethealth.ProcessMonitor: 1},
+		Processes: []FleetProcess{{
+			ProcessID: "host-a:monitor",
+			DependencyHealth: []fleethealth.DependencyHealth{{
+				Name:   "mysql",
+				Status: fleethealth.HealthGreen,
+			}},
+		}},
+		BucketCoverage: FleetBucketCoverage{Hosts: []FleetBucketHost{{HostID: "host-a"}}},
+		Delivery: FleetDeliverySummary{
+			Tables:  []FleetDeliveryTable{{Kind: "webhook", Pending: 1}},
+			Posture: FleetDeliveryPosture{EnabledHosts: []string{"host-a"}, OwnerHosts: []string{"host-a"}},
+		},
+		Dependencies: []FleetDependencySummary{{Name: "mysql", Status: "green"}},
+		Verifliers: FleetVeriflierSummary{
+			DiscoveryModes: []FleetVeriflierDiscoveryMode{{Mode: "shadow", Hosts: []string{"host-a"}}},
+			Vantages:       []FleetVeriflierVantageSummary{{VantageID: "us-east", Enabled: true}},
+			Agents:         []FleetVeriflierAgentSummary{{AgentID: "agent-a", Protocols: []string{"v2-json-http"}}},
+		},
+	})
+
+	cached, ok := store.cachedSnapshot(now.Add(time.Second))
+	if !ok {
+		t.Fatal("cachedSnapshot() missed")
+	}
+	cached.Summary.Issues[0] = "mutated"
+	cached.ProcessCounts[fleethealth.ProcessMonitor] = 99
+	cached.Processes[0].DependencyHealth[0].Status = fleethealth.HealthRed
+	cached.BucketCoverage.Hosts[0].HostID = "mutated"
+	cached.Delivery.Tables[0].Pending = 99
+	cached.Delivery.Posture.EnabledHosts[0] = "mutated"
+	cached.Dependencies[0].Status = "red"
+	cached.Verifliers.DiscoveryModes[0].Hosts[0] = "mutated"
+	cached.Verifliers.Vantages[0].VantageID = "mutated"
+	cached.Verifliers.Agents[0].Protocols[0] = "legacy"
+
+	cachedAgain, ok := store.cachedSnapshot(now.Add(2 * time.Second))
+	if !ok {
+		t.Fatal("cachedSnapshot() second read missed")
+	}
+	if cachedAgain.Summary.Issues[0] != "first issue" {
+		t.Fatalf("Summary.Issues = %#v, cache was mutated", cachedAgain.Summary.Issues)
+	}
+	if cachedAgain.ProcessCounts[fleethealth.ProcessMonitor] != 1 {
+		t.Fatalf("ProcessCounts = %#v, cache was mutated", cachedAgain.ProcessCounts)
+	}
+	if cachedAgain.Processes[0].DependencyHealth[0].Status != fleethealth.HealthGreen {
+		t.Fatalf("DependencyHealth = %#v, cache was mutated", cachedAgain.Processes[0].DependencyHealth)
+	}
+	if cachedAgain.BucketCoverage.Hosts[0].HostID != "host-a" {
+		t.Fatalf("BucketCoverage.Hosts = %#v, cache was mutated", cachedAgain.BucketCoverage.Hosts)
+	}
+	if cachedAgain.Delivery.Tables[0].Pending != 1 || cachedAgain.Delivery.Posture.EnabledHosts[0] != "host-a" {
+		t.Fatalf("Delivery = %+v, cache was mutated", cachedAgain.Delivery)
+	}
+	if cachedAgain.Dependencies[0].Status != "green" {
+		t.Fatalf("Dependencies = %#v, cache was mutated", cachedAgain.Dependencies)
+	}
+	if cachedAgain.Verifliers.DiscoveryModes[0].Hosts[0] != "host-a" ||
+		cachedAgain.Verifliers.Vantages[0].VantageID != "us-east" ||
+		cachedAgain.Verifliers.Agents[0].Protocols[0] != "v2-json-http" {
+		t.Fatalf("Verifliers = %+v, cache was mutated", cachedAgain.Verifliers)
+	}
+}
+
+func containsIssue(issues []string, want string) bool {
+	for _, issue := range issues {
+		if strings.Contains(issue, want) {
+			return true
+		}
+	}
+	return false
+}
diff --git a/internal/db/db.go b/internal/db/db.go
new file mode 100644
index 00000000..e5c24e14
--- /dev/null
+++ b/internal/db/db.go
@@ -0,0 +1,136 @@
+package db
+
+import (
+	"database/sql"
+	"fmt"
+	"log"
+	"os"
+	"runtime"
+	"time"
+
+	"github.com/go-sql-driver/mysql"
+
+	"github.com/Automattic/jetmon/internal/config"
+)
+
+var db *sql.DB
+
+// Site combines the v1-shaped jetpack_monitor_sites row with Jetmon-owned
+// sidecar config/runtime tables.
+type Site struct {
+	ID               int64
+	BlogID           int64
+	BucketNo         int
+	MonitorURL       string
+	MonitorActive    bool
+	SiteStatus       int
+	LastStatusChange *time.Time
+	CheckInterval    int
+	LastCheckedAt    *time.Time
+	NextCheckAt      *time.Time
+
+	SSLExpiryDate        *time.Time
+	CheckKeyword         *string
+	ForbiddenKeyword     *string
+	ForbiddenKeywords    *string // raw JSON array
+	MaintenanceStart     *time.Time
+	MaintenanceEnd       *time.Time
+	CustomHeaders        *string // raw JSON
+	TimeoutSeconds       *int
+	RedirectPolicy       string
+	AlertCooldownMinutes *int
+	LastAlertSentAt      *time.Time
+	RequestMethod        string
+	DetectionProfile     string
+}
+
+// Connect opens the MySQL connection pool using the loaded DBConfig.
+func Connect() error {
+	cfg := config.GetDB()
+	if cfg == nil {
+		cfg = config.LoadDB()
+	}
+
+	// Use mysql.Config.FormatDSN so the password is never interpolated into
+	// a format string (prevents accidental exposure in error chains or logs).
+	mc := mysql.NewConfig()
+	mc.User = cfg.User
+	mc.Passwd = cfg.Password
+	mc.Net = "tcp"
+	mc.Addr = cfg.Host + ":" + cfg.Port
+	mc.DBName = cfg.Name
+	mc.ParseTime = true
+	mc.Timeout = 10 * time.Second
+	mc.ReadTimeout = 30 * time.Second
+	mc.WriteTimeout = 30 * time.Second
+
+	var err error
+	db, err = sql.Open("mysql", mc.FormatDSN())
+	if err != nil {
+		return fmt.Errorf("open db: %w", err)
+	}
+
+	maxOpenConns := maxOpenConnectionsForConfig(config.Get(), runtime.GOMAXPROCS(0))
+	db.SetMaxOpenConns(maxOpenConns)
+	db.SetMaxIdleConns(maxOpenConns / 2)
+	db.SetConnMaxLifetime(5 * time.Minute)
+
+	return db.Ping()
+}
+
+func maxOpenConnectionsForConfig(cfg *config.Config, gomaxprocs int) int {
+	if gomaxprocs < 1 {
+		gomaxprocs = 1
+	}
+	multiplier := 8
+	minConns := 16
+	if cfg != nil && cfg.SchedulerEngine == "streaming" {
+		multiplier = 16
+		minConns = 64
+	}
+	maxOpenConns := gomaxprocs * multiplier
+	if maxOpenConns < minConns {
+		return minConns
+	}
+	if maxOpenConns > 256 {
+		return 256
+	}
+	return maxOpenConns
+}
+
+// ConnectWithRetry retries Connect with exponential backoff.
+func ConnectWithRetry(maxAttempts int) error {
+	var err error
+	for i := range maxAttempts {
+		err = Connect()
+		if err == nil {
+			return nil
+		}
+		wait := time.Duration(1<<i) * time.Second
+		if wait > 30*time.Second {
+			wait = 30 * time.Second
+		}
+		log.Printf("db connect attempt %d failed: %v, retrying in %s", i+1, err, wait)
+		time.Sleep(wait)
+	}
+	return fmt.Errorf("db connect failed after %d attempts: %w", maxAttempts, err)
+}
+
+// DB returns the underlying *sql.DB for direct use when needed.
+func DB() *sql.DB {
+	return db
+}
+
+// Ping checks database connectivity.
+func Ping() error {
+	return db.Ping()
+}
+
+// Hostname returns the system hostname used as the host_id in jetmon_hosts.
+func Hostname() string {
+	h, err := os.Hostname()
+	if err != nil {
+		return "unknown"
+	}
+	return h
+}
diff --git a/internal/db/db_test.go b/internal/db/db_test.go
new file mode 100644
index 00000000..3a8219a0
--- /dev/null
+++ b/internal/db/db_test.go
@@ -0,0 +1,55 @@
+package db
+
+import (
+	"testing"
+
+	"github.com/Automattic/jetmon/internal/config"
+)
+
+func TestMaxOpenConnectionsForConfig(t *testing.T) {
+	tests := []struct {
+		name       string
+		cfg        *config.Config
+		gomaxprocs int
+		want       int
+	}{
+		{
+			name:       "legacy default keeps modest floor",
+			cfg:        &config.Config{},
+			gomaxprocs: 1,
+			want:       16,
+		},
+		{
+			name:       "legacy scales from CPU",
+			cfg:        &config.Config{},
+			gomaxprocs: 8,
+			want:       64,
+		},
+		{
+			name:       "streaming keeps larger IO floor",
+			cfg:        &config.Config{SchedulerEngine: "streaming"},
+			gomaxprocs: 1,
+			want:       64,
+		},
+		{
+			name:       "streaming scales from CPU",
+			cfg:        &config.Config{SchedulerEngine: "streaming"},
+			gomaxprocs: 8,
+			want:       128,
+		},
+		{
+			name:       "connection count is capped",
+			cfg:        &config.Config{SchedulerEngine: "streaming"},
+			gomaxprocs: 64,
+			want:       256,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			if got := maxOpenConnectionsForConfig(tt.cfg, tt.gomaxprocs); got != tt.want {
+				t.Fatalf("maxOpenConnectionsForConfig() = %d, want %d", got, tt.want)
+			}
+		})
+	}
+}
diff --git a/internal/db/migrations.go b/internal/db/migrations.go
new file mode 100644
index 00000000..91a0b609
--- /dev/null
+++ b/internal/db/migrations.go
@@ -0,0 +1,644 @@
+package db
+
+import (
+	"fmt"
+	"log"
+)
+
+// migration holds a single idempotent schema change.
+type migration struct {
+	id  int
+	sql string
+}
+
+var migrations = []migration{
+	{1, `CREATE TABLE IF NOT EXISTS jetmon_schema_migrations (
+		id           INT UNSIGNED NOT NULL PRIMARY KEY,
+		applied_at   TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	{2, `CREATE TABLE IF NOT EXISTS jetpack_monitor_sites (
+		jetpack_monitor_site_id  BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+		blog_id                  BIGINT UNSIGNED NOT NULL,
+		bucket_no                SMALLINT UNSIGNED NOT NULL DEFAULT 0,
+		monitor_url              VARCHAR(2083) NOT NULL DEFAULT '',
+		monitor_active           TINYINT UNSIGNED NOT NULL DEFAULT 0,
+		site_status              TINYINT NOT NULL DEFAULT 1,
+		last_status_change       DATETIME NULL,
+		check_interval           SMALLINT UNSIGNED NOT NULL DEFAULT 5,
+		INDEX idx_bucket_active (bucket_no, monitor_active)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 3 previously added v2-only config columns to
+	// jetpack_monitor_sites. That hot table is now kept v1-shaped for rollout;
+	// v2-only config lives in jetmon_site_check_config.
+	{3, `SELECT 1`},
+
+	{4, `CREATE TABLE IF NOT EXISTS jetmon_hosts (
+		host_id        VARCHAR(255) NOT NULL PRIMARY KEY,
+		bucket_min     SMALLINT UNSIGNED NOT NULL,
+		bucket_max     SMALLINT UNSIGNED NOT NULL,
+		last_heartbeat TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		status         ENUM('active','draining') NOT NULL DEFAULT 'active'
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	{5, `CREATE TABLE IF NOT EXISTS jetmon_audit_log (
+		id           BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+		blog_id      BIGINT UNSIGNED NOT NULL,
+		event_type   VARCHAR(64) NOT NULL,
+		source       VARCHAR(255) NOT NULL DEFAULT 'local',
+		http_code    SMALLINT NULL,
+		error_code   TINYINT NULL,
+		rtt_ms       INT NULL,
+		old_status   TINYINT NULL,
+		new_status   TINYINT NULL,
+		detail       TEXT NULL,
+		created_at   TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		INDEX idx_blog_id_created (blog_id, created_at)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	{6, `CREATE TABLE IF NOT EXISTS jetmon_check_history (
+		id         BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+		blog_id    BIGINT UNSIGNED NOT NULL,
+		http_code  SMALLINT NULL,
+		error_code TINYINT NULL,
+		rtt_ms     INT NULL,
+		dns_ms     INT NULL,
+		tcp_ms     INT NULL,
+		tls_ms     INT NULL,
+		ttfb_ms    INT NULL,
+		checked_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		INDEX idx_blog_id_checked (blog_id, checked_at)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	{7, `CREATE TABLE IF NOT EXISTS jetmon_false_positives (
+		id         BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+		blog_id    BIGINT UNSIGNED NOT NULL,
+		http_code  SMALLINT NULL,
+		error_code TINYINT NULL,
+		rtt_ms     INT NULL,
+		created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		INDEX idx_blog_id (blog_id)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 8 previously added v2 runtime/freshness fields to
+	// jetpack_monitor_sites. Runtime state now lives in jetmon_site_runtime.
+	{8, `SELECT 1`},
+
+	// Migration 9 retires jetmon_audit_log's site-state columns. Per-probe data lives in
+	// jetmon_check_history; status transitions move to jetmon_event_transitions (migration 11).
+	// What remains is purely operational: WPCOM, retries, verifier RPC, suppression, config.
+	{9, `ALTER TABLE jetmon_audit_log
+		DROP COLUMN http_code,
+		DROP COLUMN error_code,
+		DROP COLUMN rtt_ms,
+		DROP COLUMN old_status,
+		DROP COLUMN new_status,
+		MODIFY COLUMN blog_id BIGINT UNSIGNED NULL,
+		MODIFY COLUMN detail VARCHAR(1024) NULL,
+		ADD COLUMN event_id BIGINT UNSIGNED NULL AFTER blog_id,
+		ADD COLUMN metadata JSON NULL AFTER detail,
+		ADD INDEX idx_event_id (event_id),
+		ADD INDEX idx_event_type_created (event_type, created_at)`},
+
+	// Migration 10 creates the events table — current authoritative state of every incident.
+	// dedup_key is a generated column that is NULL while ended_at IS NULL, full identity tuple while open.
+	// The UNIQUE KEY enforces "one open event per tuple" without requiring partial indexes (which MySQL lacks).
+	{10, `CREATE TABLE IF NOT EXISTS jetmon_events (
+		id                  BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+		blog_id             BIGINT UNSIGNED NOT NULL,
+		endpoint_id         BIGINT UNSIGNED NULL,
+		check_type          VARCHAR(64) NOT NULL,
+		discriminator       VARCHAR(128) NULL,
+		severity            TINYINT UNSIGNED NOT NULL,
+		state               VARCHAR(32) NOT NULL,
+		started_at          TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
+		ended_at            TIMESTAMP(3) NULL,
+		resolution_reason   VARCHAR(64) NULL,
+		cause_event_id      BIGINT UNSIGNED NULL,
+		metadata            JSON NULL,
+		updated_at          TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3) ON UPDATE CURRENT_TIMESTAMP(3),
+		dedup_key           VARCHAR(255) GENERATED ALWAYS AS (
+			IF(ended_at IS NULL,
+			   CONCAT_WS(':', blog_id, COALESCE(endpoint_id, 0), check_type, COALESCE(discriminator, '')),
+			   NULL)
+		) STORED,
+		UNIQUE KEY uk_open_dedup (dedup_key),
+		INDEX idx_blog_id_started (blog_id, started_at),
+		INDEX idx_blog_id_active (blog_id, ended_at),
+		INDEX idx_check_type_started (check_type, started_at),
+		INDEX idx_cause_event_id (cause_event_id)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 11 creates the append-only history of every mutation to jetmon_events.
+	// One row per change; never updated, never deleted. Together with jetmon_events,
+	// this is the full event-sourced record. blog_id is denormalized to keep SLA queries
+	// off the events table.
+	{11, `CREATE TABLE IF NOT EXISTS jetmon_event_transitions (
+		id                BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+		event_id          BIGINT UNSIGNED NOT NULL,
+		blog_id           BIGINT UNSIGNED NOT NULL,
+		severity_before   TINYINT UNSIGNED NULL,
+		severity_after    TINYINT UNSIGNED NULL,
+		state_before      VARCHAR(32) NULL,
+		state_after       VARCHAR(32) NULL,
+		reason            VARCHAR(64) NOT NULL,
+		source            VARCHAR(255) NOT NULL DEFAULT 'local',
+		metadata          JSON NULL,
+		changed_at        TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
+		INDEX idx_event_id_changed (event_id, changed_at),
+		INDEX idx_blog_id_changed (blog_id, changed_at),
+		INDEX idx_changed_at (changed_at)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 12 creates the API key registry. Keys are sha256-hashed at rest;
+	// the raw token is shown only once at creation time via the CLI. Per-key rate
+	// limit, scope, expiry, and revocation are all stored here. consumer_name is
+	// the audit-log key — every authenticated API request logs against it so we
+	// can track and revoke specific internal systems. See docs/internal-api-reference.md "Authentication".
+	{12, `CREATE TABLE IF NOT EXISTS jetmon_api_keys (
+		id                    BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+		key_hash              CHAR(64) NOT NULL,
+		consumer_name         VARCHAR(128) NOT NULL,
+		scope                 ENUM('read','write','admin') NOT NULL DEFAULT 'read',
+		rate_limit_per_minute INT NOT NULL DEFAULT 60,
+		expires_at            TIMESTAMP NULL,
+		revoked_at            TIMESTAMP NULL,
+		last_used_at          TIMESTAMP NULL,
+		created_at            TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		created_by            VARCHAR(128) NOT NULL DEFAULT 'cli',
+		UNIQUE KEY uk_key_hash (key_hash),
+		INDEX idx_consumer (consumer_name)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 13 creates the webhook registry. secret_hash is sha256 of the
+	// raw secret (which is shown once at creation, mirrors jetmon_api_keys).
+	// events / site_filter / state_filter are JSON to allow flexible filter
+	// shapes without per-filter columns; semantics: empty = match all, AND
+	// across dimensions, whitelist within each. See docs/internal-api-reference.md "Family 4".
+	// secret stores the raw HMAC signing key in plaintext. Unlike
+	// jetmon_api_keys (sha256-hashed at rest, used for inbound auth where
+	// hash is sufficient), webhook secrets are used to SIGN outbound
+	// deliveries — HMAC needs the actual key material in memory, not its
+	// hash. We never verify inbound signatures with this secret, so
+	// hash-at-rest would buy us no verification benefit while making
+	// signing impossible.
+	//
+	// Threat model: anyone with read access to jetmon_webhooks can mint
+	// valid deliveries. For the internal API behind a gateway, that's
+	// equivalent to the existing access-to-events threat. Encryption at
+	// rest with a master key (KMS-style) is in docs/roadmap.md as a future
+	// hardening step.
+	{13, `CREATE TABLE IF NOT EXISTS jetmon_webhooks (
+		id              BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+		url             VARCHAR(2083) NOT NULL,
+		active          TINYINT UNSIGNED NOT NULL DEFAULT 1,
+		events          JSON NULL,
+		site_filter     JSON NULL,
+		state_filter    JSON NULL,
+		secret          VARCHAR(80) NOT NULL,
+		secret_preview  VARCHAR(8) NOT NULL DEFAULT '',
+		created_by      VARCHAR(128) NOT NULL DEFAULT '',
+		created_at      TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		updated_at      TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+		INDEX idx_active (active)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 14 creates the per-fire delivery records. One row per
+	// (webhook, transition) match — transition_id is the fan-in point: a
+	// single jetmon_event_transitions row can produce many deliveries (one
+	// per matching webhook), but a webhook gets at most one delivery per
+	// transition (enforced by uk_webhook_transition).
+	//
+	// payload is frozen at row creation: consumer sees the event as it was
+	// when the webhook fired, not as it is now (closed-and-amended events
+	// don't retroactively change delivery contents — that's the contract).
+	//
+	// status lifecycle: pending → (delivered | abandoned). "failed" is reserved
+	// for permanent client/server errors that we wouldn't retry (currently
+	// unused; pending captures the in-retry case).
+	{14, `CREATE TABLE IF NOT EXISTS jetmon_webhook_deliveries (
+		id               BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+		webhook_id       BIGINT UNSIGNED NOT NULL,
+		transition_id    BIGINT UNSIGNED NOT NULL,
+		event_id         BIGINT UNSIGNED NOT NULL,
+		event_type       VARCHAR(64) NOT NULL,
+		payload          JSON NOT NULL,
+		status           ENUM('pending','delivered','failed','abandoned') NOT NULL DEFAULT 'pending',
+		attempt          INT UNSIGNED NOT NULL DEFAULT 0,
+		next_attempt_at  TIMESTAMP NULL,
+		last_status_code INT NULL,
+		last_response    VARCHAR(2048) NULL,
+		last_attempt_at  TIMESTAMP NULL,
+		delivered_at     TIMESTAMP NULL,
+		created_at       TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		UNIQUE KEY uk_webhook_transition (webhook_id, transition_id),
+		INDEX idx_status_next_attempt (status, next_attempt_at),
+		INDEX idx_webhook_id_created (webhook_id, created_at),
+		INDEX idx_event_id (event_id)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 15 records the webhook dispatcher's progress. One row per
+	// jetmon2 instance keeps last_transition_id high-water mark so the
+	// dispatcher polls only new transitions. The UNIQUE KEY on instance_id
+	// makes upsert (INSERT … ON DUPLICATE KEY UPDATE) trivial.
+	{15, `CREATE TABLE IF NOT EXISTS jetmon_webhook_dispatch_progress (
+		instance_id          VARCHAR(255) NOT NULL PRIMARY KEY,
+		last_transition_id   BIGINT UNSIGNED NOT NULL DEFAULT 0,
+		updated_at           TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 16 creates the alert contacts registry. Same shape as the
+	// webhook registry but with a simpler filter model (site_filter +
+	// min_severity, no event-type / state filter — see docs/internal-api-reference.md Family 5).
+	//
+	// destination is JSON because each transport has a different shape:
+	//   email     → {"address":"ops@example.com"}
+	//   pagerduty → {"integration_key":"<events-v2 routing key>"}
+	//   slack     → {"webhook_url":"https://hooks.slack.com/..."}
+	//   teams     → {"webhook_url":"https://outlook.office.com/webhook/..."}
+	// destination stores the credential in plaintext for the same reason
+	// jetmon_webhooks.secret does (see migration 13): outbound dispatch
+	// needs the raw value at every send. A hash is useless because we'd
+	// have to recover the original to call the transport. Threat model and
+	// future encryption-at-rest plan are identical.
+	//
+	// min_severity is a TINYINT matching internal/eventstore.Severity*
+	// (0=Up, 1=Warning, 2=Degraded, 3=SeemsDown, 4=Down). Default 4 (Down)
+	// avoids accidental noise from new contacts. The API serializes by
+	// string name; the column stores the underlying uint8.
+	//
+	// max_per_hour caps notification rate per contact (default 60, 0 =
+	// unlimited). Per-contact because different destinations have
+	// different tolerance — a Slack channel can take far more than a
+	// PagerDuty oncall can.
+	{16, `CREATE TABLE IF NOT EXISTS jetmon_alert_contacts (
+		id                   BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+		label                VARCHAR(128) NOT NULL,
+		active               TINYINT UNSIGNED NOT NULL DEFAULT 1,
+		transport            ENUM('email','pagerduty','slack','teams') NOT NULL,
+		destination          JSON NOT NULL,
+		destination_preview  VARCHAR(8) NOT NULL DEFAULT '',
+		site_filter          JSON NULL,
+		min_severity         TINYINT UNSIGNED NOT NULL DEFAULT 4,
+		max_per_hour         INT UNSIGNED NOT NULL DEFAULT 60,
+		created_by           VARCHAR(128) NOT NULL DEFAULT '',
+		created_at           TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		updated_at           TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+		INDEX idx_active (active)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 17 creates the per-fire alert delivery records. One row per
+	// (alert_contact, transition) match — same fan-in shape as
+	// jetmon_webhook_deliveries: one transition produces many deliveries
+	// (one per matching contact), one contact gets at most one delivery
+	// per transition (enforced by uk_alert_transition).
+	//
+	// payload is frozen at row creation: contact sees the event as it was
+	// when the alert fired, not as it is now.
+	//
+	// status lifecycle and 'failed' semantics are identical to
+	// jetmon_webhook_deliveries.
+	{17, `CREATE TABLE IF NOT EXISTS jetmon_alert_deliveries (
+		id                BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+		alert_contact_id  BIGINT UNSIGNED NOT NULL,
+		transition_id     BIGINT UNSIGNED NOT NULL,
+		event_id          BIGINT UNSIGNED NOT NULL,
+		event_type        VARCHAR(64) NOT NULL,
+		severity          TINYINT UNSIGNED NOT NULL,
+		payload           JSON NOT NULL,
+		status            ENUM('pending','delivered','failed','abandoned') NOT NULL DEFAULT 'pending',
+		attempt           INT UNSIGNED NOT NULL DEFAULT 0,
+		next_attempt_at   TIMESTAMP NULL,
+		last_status_code  INT NULL,
+		last_response     VARCHAR(2048) NULL,
+		last_attempt_at   TIMESTAMP NULL,
+		delivered_at      TIMESTAMP NULL,
+		created_at        TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		UNIQUE KEY uk_alert_transition (alert_contact_id, transition_id),
+		INDEX idx_status_next_attempt (status, next_attempt_at),
+		INDEX idx_contact_id_created (alert_contact_id, created_at),
+		INDEX idx_event_id (event_id)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 18 records the alert dispatcher's progress. Mirrors
+	// jetmon_webhook_dispatch_progress — one row per jetmon2 instance with
+	// the high-water mark for jetmon_event_transitions.id.
+	{18, `CREATE TABLE IF NOT EXISTS jetmon_alert_dispatch_progress (
+		instance_id          VARCHAR(255) NOT NULL PRIMARY KEY,
+		last_transition_id   BIGINT UNSIGNED NOT NULL DEFAULT 0,
+		updated_at           TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 19 adds a nullable tenant owner to webhooks. Internal v2
+	// callers leave it NULL, preserving the shared internal registry from
+	// ADR-0002. Gateway-routed API paths set owner_tenant_id and use
+	// tenant-scoped repository helpers so customer-owned webhooks are filtered
+	// in Jetmon as defense in depth.
+	{19, `ALTER TABLE jetmon_webhooks
+		ADD COLUMN owner_tenant_id VARCHAR(128) NULL AFTER active,
+		ADD INDEX idx_owner_tenant_id (owner_tenant_id)`},
+
+	// Migration 20 mirrors webhook ownership on alert contacts. Deliveries
+	// derive visibility through their parent contact; this column owns the
+	// customer-managed registration itself.
+	{20, `ALTER TABLE jetmon_alert_contacts
+		ADD COLUMN owner_tenant_id VARCHAR(128) NULL AFTER active,
+		ADD INDEX idx_owner_tenant_id (owner_tenant_id)`},
+
+	// Migration 21 adds a many-to-many tenant mapping for sites. Sites are
+	// still stored in the legacy jetpack_monitor_sites table; this mapping is
+	// the public/gateway ownership projection Jetmon can enforce without
+	// changing the drop-in v1-compatible site row. A site can appear under
+	// multiple tenants if the gateway's product model allows shared ownership
+	// or delegation.
+	{21, `CREATE TABLE IF NOT EXISTS jetmon_site_tenants (
+		tenant_id  VARCHAR(128) NOT NULL,
+		blog_id    BIGINT UNSIGNED NOT NULL,
+		source     VARCHAR(64) NOT NULL DEFAULT 'gateway',
+		created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+		PRIMARY KEY (tenant_id, blog_id),
+		INDEX idx_blog_id (blog_id)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 22 adds delivery-check support indexes for webhook deliveries.
+	// idx_status_next_attempt already covers ready/future pending rows; these
+	// indexes keep recent terminal outcome counts and queue-age checks from
+	// scanning historical delivery rows as the audit trail grows.
+	{22, `ALTER TABLE jetmon_webhook_deliveries
+		ADD INDEX idx_status_delivered_at (status, delivered_at),
+		ADD INDEX idx_status_last_attempt_at (status, last_attempt_at),
+		ADD INDEX idx_status_created_at (status, created_at)`},
+
+	// Migration 23 mirrors delivery-check support indexes for alert-contact
+	// deliveries.
+	{23, `ALTER TABLE jetmon_alert_deliveries
+		ADD INDEX idx_status_delivered_at (status, delivered_at),
+		ADD INDEX idx_status_last_attempt_at (status, last_attempt_at),
+		ADD INDEX idx_status_created_at (status, created_at)`},
+
+	// Migration 24 creates the durable process heartbeat table used as the
+	// foundation for fleet-wide operator dashboards. Each long-running Jetmon
+	// process owns one process_id and periodically upserts a compact snapshot of
+	// its local state. Fleet views should treat stale updated_at values as
+	// unknown/unhealthy rather than assuming the last state is still current.
+	{24, `CREATE TABLE IF NOT EXISTS jetmon_process_health (
+		process_id               VARCHAR(255) NOT NULL PRIMARY KEY,
+		host_id                  VARCHAR(255) NOT NULL,
+		process_type             VARCHAR(64) NOT NULL,
+		pid                      INT UNSIGNED NOT NULL DEFAULT 0,
+		version                  VARCHAR(64) NOT NULL DEFAULT '',
+		build_date               VARCHAR(64) NOT NULL DEFAULT '',
+		go_version               VARCHAR(64) NOT NULL DEFAULT '',
+		state                    VARCHAR(32) NOT NULL DEFAULT 'starting',
+		started_at               TIMESTAMP NULL,
+		updated_at               TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+		bucket_min               SMALLINT UNSIGNED NULL,
+		bucket_max               SMALLINT UNSIGNED NULL,
+		bucket_ownership         VARCHAR(128) NOT NULL DEFAULT '',
+		api_port                 INT UNSIGNED NULL,
+		dashboard_port           INT UNSIGNED NULL,
+		delivery_workers_enabled TINYINT UNSIGNED NOT NULL DEFAULT 0,
+		delivery_owner_host      VARCHAR(255) NOT NULL DEFAULT '',
+		worker_count             INT UNSIGNED NOT NULL DEFAULT 0,
+		active_checks            INT UNSIGNED NOT NULL DEFAULT 0,
+		queue_depth              INT UNSIGNED NOT NULL DEFAULT 0,
+		retry_queue_size         INT UNSIGNED NOT NULL DEFAULT 0,
+		wpcom_circuit_open       TINYINT UNSIGNED NOT NULL DEFAULT 0,
+		wpcom_queue_depth        INT UNSIGNED NOT NULL DEFAULT 0,
+		mem_rss_mb               INT UNSIGNED NOT NULL DEFAULT 0,
+		dependency_health        JSON NULL,
+		created_at               TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		INDEX idx_process_type_updated (process_type, updated_at),
+		INDEX idx_host_updated (host_id, updated_at),
+		INDEX idx_state_updated (state, updated_at)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 25 splits process lifecycle from health rollup and renames
+	// the memory column to the metric it actually stores. The value comes from
+	// runtime.MemStats.Sys, not operating-system RSS.
+	{25, `ALTER TABLE jetmon_process_health
+		ADD COLUMN health_status VARCHAR(16) NOT NULL DEFAULT 'green' AFTER state,
+		CHANGE COLUMN mem_rss_mb go_sys_mem_mb INT UNSIGNED NOT NULL DEFAULT 0,
+		ADD INDEX idx_health_status_updated (health_status, updated_at)`},
+
+	// Migration 26 adds true operating-system RSS beside Go runtime system
+	// memory so dashboards can show both the host-observed resident set and the
+	// runtime allocator footprint.
+	{26, `ALTER TABLE jetmon_process_health
+		ADD COLUMN rss_mem_mb INT UNSIGNED NOT NULL DEFAULT 0 AFTER go_sys_mem_mb`},
+
+	// Migration 27 previously added a scheduler index on jetpack_monitor_sites.
+	// The legacy table is no longer altered for v2 scheduler internals.
+	{27, `SELECT 1`},
+
+	// Migration 28 previously added next_check_at to jetpack_monitor_sites.
+	// Due-time projection now lives in jetmon_site_runtime.
+	{28, `SELECT 1`},
+
+	// Migration 29 previously backfilled jetpack_monitor_sites.next_check_at.
+	// No backfill is needed now because the sidecar runtime table is populated
+	// lazily as checks complete.
+	{29, `SELECT 1`},
+
+	// Migration 30 previously added a next_check_at scheduler index to the
+	// legacy table. Runtime due queries now use jetmon_site_runtime.
+	{30, `SELECT 1`},
+
+	// Migration 31 adds an explicit forbidden-content check alongside the
+	// existing required keyword. The two columns intentionally stay separate:
+	// check_keyword means "must be present"; forbidden_keyword means "must be
+	// absent".
+	// Migration 31 previously added forbidden_keyword to jetpack_monitor_sites.
+	// V2 body-check config now lives in jetmon_site_check_config.
+	{31, `SELECT 1`},
+
+	// Migration 32 records the actual HTTP method used for each timing sample.
+	// This keeps the high-volume check history compact while giving operators
+	// durable evidence that v2 probes are exercising the GET path rather than
+	// the HEAD-only behavior that caused v1 false positives and false negatives.
+	{32, `ALTER TABLE jetmon_check_history
+		ADD COLUMN request_method VARCHAR(16) NOT NULL DEFAULT 'GET' AFTER blog_id`},
+
+	// Migration 33 adds an array form for explicit forbidden body-content
+	// checks. forbidden_keyword remains for compatibility and simple one-off
+	// rules; forbidden_keywords lets operators provision multiple known-bad
+	// strings without overloading one column.
+	// Migration 33 previously added forbidden_keywords to jetpack_monitor_sites.
+	// V2 body-check config now lives in jetmon_site_check_config.
+	{33, `SELECT 1`},
+
+	// Migration 34 creates the v2-native scheduling target table. The legacy
+	// jetpack_monitor_sites row remains the source of truth during migration,
+	// but this table gives the streaming scheduler a compact place to persist
+	// derived scheduling state without turning the legacy table into the hot
+	// write path again.
+	{34, `CREATE TABLE IF NOT EXISTS jetmon_check_targets (
+		target_id           BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+		blog_id             BIGINT UNSIGNED NOT NULL,
+		source_site_id      BIGINT UNSIGNED NOT NULL,
+		bucket_no           SMALLINT UNSIGNED NOT NULL,
+		monitor_url         VARCHAR(2083) NOT NULL,
+		monitor_active      TINYINT UNSIGNED NOT NULL DEFAULT 1,
+		check_interval_sec  INT UNSIGNED NOT NULL DEFAULT 300,
+		phase_slot_sec      INT UNSIGNED NOT NULL DEFAULT 0,
+		config_hash         CHAR(64) NOT NULL DEFAULT '',
+		last_config_sync_at TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3),
+		last_checked_at     TIMESTAMP(3) NULL,
+		last_success_at     TIMESTAMP(3) NULL,
+		last_failure_at     TIMESTAMP(3) NULL,
+		last_outcome        ENUM('unknown','success','failure') NOT NULL DEFAULT 'unknown',
+		updated_at          TIMESTAMP(3) NOT NULL DEFAULT CURRENT_TIMESTAMP(3) ON UPDATE CURRENT_TIMESTAMP(3),
+		UNIQUE KEY uk_blog_id (blog_id),
+		INDEX idx_bucket_phase (bucket_no, phase_slot_sec, blog_id),
+		INDEX idx_bucket_active (bucket_no, monitor_active, blog_id),
+		INDEX idx_config_sync (last_config_sync_at)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 35 supports streaming scheduler reloads. The scheduler pages
+	// active rows by blog_id inside its bucket range; this keeps periodic config
+	// refreshes from depending on the older last_checked_at/next_check_at
+	// indexes that are specific to the legacy round scheduler.
+	// Migration 35 previously added a streaming reload index to the legacy
+	// table. The streaming engine can use the existing v1 bucket/active shape
+	// during rollout without requiring another hot ALTER.
+	{35, `SELECT 1`},
+
+	// Migration 36 stores v2 rollout check policy outside the legacy
+	// jetpack_monitor_sites table. NULL means "inherit the process default",
+	// letting operators migrate in phases without another hot ALTER on the
+	// largest v1 compatibility table.
+	{36, `CREATE TABLE IF NOT EXISTS jetmon_site_check_config (
+		blog_id            BIGINT UNSIGNED NOT NULL PRIMARY KEY,
+		request_method     ENUM('HEAD','GET') NULL,
+		detection_profile  ENUM('legacy','simple_http','full') NULL,
+		created_at         TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		updated_at         TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+		INDEX idx_request_method (request_method),
+		INDEX idx_detection_profile (detection_profile)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 37 stores v2 runtime/projection fields outside the legacy site
+	// table. These values are useful for API display, rollback freshness checks,
+	// and the legacy round scheduler, but they do not need to change the v1
+	// table shape.
+	{37, `CREATE TABLE IF NOT EXISTS jetmon_site_runtime (
+		blog_id            BIGINT UNSIGNED NOT NULL PRIMARY KEY,
+		last_checked_at    DATETIME NULL,
+		next_check_at      DATETIME NULL,
+		last_alert_sent_at DATETIME NULL,
+		ssl_expiry_date    DATE NULL,
+		updated_at         TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+		INDEX idx_next_check (next_check_at, blog_id),
+		INDEX idx_last_checked (last_checked_at, blog_id)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 38 extends the Jetmon-owned check config table with every
+	// v2-only per-site setting that previously lived on jetpack_monitor_sites.
+	// Keeping this as a separate migration lets databases that already applied
+	// migration 36 receive the expanded sidecar shape.
+	{38, `ALTER TABLE jetmon_site_check_config
+		ADD COLUMN check_keyword          VARCHAR(500) NULL AFTER detection_profile,
+		ADD COLUMN forbidden_keyword      VARCHAR(500) NULL AFTER check_keyword,
+		ADD COLUMN forbidden_keywords     JSON NULL AFTER forbidden_keyword,
+		ADD COLUMN maintenance_start      DATETIME NULL AFTER forbidden_keywords,
+		ADD COLUMN maintenance_end        DATETIME NULL AFTER maintenance_start,
+		ADD COLUMN custom_headers         JSON NULL AFTER maintenance_end,
+		ADD COLUMN timeout_seconds        TINYINT UNSIGNED NULL AFTER custom_headers,
+		ADD COLUMN redirect_policy        ENUM('follow','alert','fail') NULL DEFAULT NULL AFTER timeout_seconds,
+		ADD COLUMN alert_cooldown_minutes SMALLINT UNSIGNED NULL AFTER redirect_policy`},
+
+	// Migration 39 prepares the v2-native target table for production rows
+	// where one blog_id has multiple active monitor URLs. The legacy table's
+	// primary row id is the durable endpoint identity, so target sync must be
+	// unique on source_site_id rather than collapsing all endpoints for a blog.
+	{39, `ALTER TABLE jetmon_check_targets
+		DROP INDEX uk_blog_id,
+		ADD UNIQUE KEY uk_source_site_id (source_site_id)`},
+
+	// Migration 40 creates the trusted Veriflier vantage registry used by
+	// monitor-side discovery. Vantages are quorum-counted identities, not
+	// individual processes. enabled defaults to 0 so agent telemetry can never
+	// create its own trusted vote.
+	{40, `CREATE TABLE IF NOT EXISTS jetmon_veriflier_vantages (
+		vantage_id    VARCHAR(128) NOT NULL PRIMARY KEY,
+		region        VARCHAR(128) NOT NULL DEFAULT '',
+		provider      VARCHAR(128) NOT NULL DEFAULT '',
+		endpoint_host VARCHAR(255) NOT NULL DEFAULT '',
+		endpoint_port VARCHAR(16) NOT NULL DEFAULT '',
+		auth_token    VARCHAR(255) NOT NULL DEFAULT '',
+		enabled       TINYINT UNSIGNED NOT NULL DEFAULT 0,
+		created_at    TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		updated_at    TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+		INDEX idx_enabled (enabled),
+		INDEX idx_endpoint (endpoint_host, endpoint_port)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+
+	// Migration 41 records concrete Veriflier agent telemetry collected by
+	// monitors from /v2/status. These rows are operational telemetry and
+	// capacity hints; only pre-approved enabled rows in
+	// jetmon_veriflier_vantages are eligible for quorum or traffic.
+	{41, `CREATE TABLE IF NOT EXISTS jetmon_veriflier_agents (
+		agent_id        VARCHAR(128) NOT NULL PRIMARY KEY,
+		vantage_id      VARCHAR(128) NOT NULL,
+		hostname        VARCHAR(255) NOT NULL DEFAULT '',
+		endpoint_host   VARCHAR(255) NOT NULL DEFAULT '',
+		endpoint_port   VARCHAR(16) NOT NULL DEFAULT '',
+		version         VARCHAR(64) NOT NULL DEFAULT '',
+		protocols       JSON NULL,
+		max_concurrency INT UNSIGNED NOT NULL DEFAULT 0,
+		queue_capacity  INT UNSIGNED NOT NULL DEFAULT 0,
+		queue_depth     INT UNSIGNED NOT NULL DEFAULT 0,
+		active          INT UNSIGNED NOT NULL DEFAULT 0,
+		in_flight       INT UNSIGNED NOT NULL DEFAULT 0,
+		status          ENUM('starting','active','draining','stopped') NOT NULL DEFAULT 'active',
+		last_seen       TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		created_at      TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+		updated_at      TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+		INDEX idx_vantage_seen (vantage_id, last_seen),
+		INDEX idx_status_seen (status, last_seen)
+	) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4`},
+}
+
+// Migrate applies all pending migrations idempotently.
+func Migrate() error {
+	// Ensure the migrations table exists first (migration 1 is special).
+	if _, err := db.Exec(migrations[0].sql); err != nil {
+		return fmt.Errorf("create migrations table: %w", err)
+	}
+	if err := markApplied(migrations[0].id); err != nil {
+		return err
+	}
+
+	for _, m := range migrations[1:] {
+		applied, err := isApplied(m.id)
+		if err != nil {
+			return err
+		}
+		if applied {
+			continue
+		}
+		log.Printf("applying migration %d", m.id)
+		if _, err := db.Exec(m.sql); err != nil {
+			return fmt.Errorf("migration %d: %w", m.id, err)
+		}
+		if err := markApplied(m.id); err != nil {
+			return err
+		}
+	}
+	return nil
+}
+
+func isApplied(id int) (bool, error) {
+	var count int
+	err := db.QueryRow(`SELECT COUNT(*) FROM jetmon_schema_migrations WHERE id = ?`, id).Scan(&count)
+	return count > 0, err
+}
+
+func markApplied(id int) error {
+	_, err := db.Exec(
+		`INSERT IGNORE INTO jetmon_schema_migrations (id) VALUES (?)`, id,
+	)
+	return err
+}
diff --git a/internal/db/queries.go b/internal/db/queries.go
new file mode 100644
index 00000000..bef93389
--- /dev/null
+++ b/internal/db/queries.go
@@ -0,0 +1,929 @@
+package db
+
+import (
+	"context"
+	"database/sql"
+	"errors"
+	"fmt"
+	"sort"
+	"strings"
+	"time"
+
+	"github.com/go-sql-driver/mysql"
+)
+
+const batchWriteChunkSize = 1000
+
+// GetSitesForBucket fetches active sites within the given bucket range.
+func GetSitesForBucket(ctx context.Context, bucketMin, bucketMax, batchSize int, useVariableIntervals bool) ([]Site, error) {
+	query := `
+		SELECT
+			s.jetpack_monitor_site_id, s.blog_id, s.bucket_no, s.monitor_url,
+			s.monitor_active, s.site_status, s.last_status_change, s.check_interval, r.last_checked_at, r.next_check_at,
+			r.ssl_expiry_date, c.check_keyword, c.forbidden_keyword, c.forbidden_keywords, c.maintenance_start, c.maintenance_end,
+			c.custom_headers, c.timeout_seconds, c.redirect_policy, c.alert_cooldown_minutes, r.last_alert_sent_at,
+			c.request_method, c.detection_profile
+		FROM jetpack_monitor_sites s
+		LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id
+		LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id
+		WHERE s.monitor_active = 1
+		  AND s.bucket_no BETWEEN ? AND ?`
+	if useVariableIntervals {
+		query += `
+		  AND (
+			r.next_check_at IS NULL
+			OR r.next_check_at <= NOW()
+		  )`
+		query += `
+		ORDER BY
+			r.next_check_at ASC,
+			s.blog_id ASC`
+	} else {
+		query += `
+		ORDER BY
+			r.last_checked_at ASC,
+			s.blog_id ASC`
+	}
+	query += `
+		LIMIT ?`
+
+	rows, err := db.QueryContext(ctx, query, bucketMin, bucketMax, batchSize)
+	if err != nil {
+		return nil, fmt.Errorf("query sites: %w", err)
+	}
+	defer rows.Close()
+
+	return scanSiteRows(rows)
+}
+
+// ListActiveSitesForBucketRange pages active site config for the streaming
+// scheduler. It intentionally ignores last_checked_at and next_check_at: those
+// are legacy scheduler projections, while streaming mode maintains due time in
+// memory and writes coarse rollback freshness separately.
+func ListActiveSitesForBucketRange(ctx context.Context, bucketMin, bucketMax int, afterMonitorSiteID int64, limit int) ([]Site, error) {
+	if limit <= 0 {
+		limit = 5000
+	}
+	rows, err := db.QueryContext(ctx, `
+		SELECT
+			s.jetpack_monitor_site_id, s.blog_id, s.bucket_no, s.monitor_url,
+			s.monitor_active, s.site_status, s.last_status_change, s.check_interval, r.last_checked_at, r.next_check_at,
+			r.ssl_expiry_date, c.check_keyword, c.forbidden_keyword, c.forbidden_keywords, c.maintenance_start, c.maintenance_end,
+			c.custom_headers, c.timeout_seconds, c.redirect_policy, c.alert_cooldown_minutes, r.last_alert_sent_at,
+			c.request_method, c.detection_profile
+		FROM jetpack_monitor_sites s
+		LEFT JOIN jetmon_site_check_config c ON c.blog_id = s.blog_id
+		LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id
+		WHERE s.monitor_active = 1
+		  AND s.bucket_no BETWEEN ? AND ?
+		  AND s.jetpack_monitor_site_id > ?
+		ORDER BY s.jetpack_monitor_site_id ASC
+		LIMIT ?`,
+		bucketMin, bucketMax, afterMonitorSiteID, limit,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("query active sites: %w", err)
+	}
+	defer rows.Close()
+	return scanSiteRows(rows)
+}
+
+func scanSiteRows(rows *sql.Rows) ([]Site, error) {
+	var sites []Site
+	for rows.Next() {
+		var s Site
+		var redirectPolicy sql.NullString
+		var requestMethod sql.NullString
+		var detectionProfile sql.NullString
+		err := rows.Scan(
+			&s.ID, &s.BlogID, &s.BucketNo, &s.MonitorURL,
+			&s.MonitorActive, &s.SiteStatus, &s.LastStatusChange, &s.CheckInterval, &s.LastCheckedAt, &s.NextCheckAt,
+			&s.SSLExpiryDate, &s.CheckKeyword, &s.ForbiddenKeyword, &s.ForbiddenKeywords, &s.MaintenanceStart, &s.MaintenanceEnd,
+			&s.CustomHeaders, &s.TimeoutSeconds, &redirectPolicy, &s.AlertCooldownMinutes, &s.LastAlertSentAt,
+			&requestMethod, &detectionProfile,
+		)
+		if err != nil {
+			return nil, fmt.Errorf("scan site: %w", err)
+		}
+		if redirectPolicy.Valid {
+			s.RedirectPolicy = redirectPolicy.String
+		} else {
+			s.RedirectPolicy = "follow"
+		}
+		if requestMethod.Valid {
+			s.RequestMethod = requestMethod.String
+		}
+		if detectionProfile.Valid {
+			s.DetectionProfile = detectionProfile.String
+		}
+		sites = append(sites, s)
+	}
+	if err := rows.Err(); err != nil {
+		return nil, err
+	}
+	return sites, nil
+}
+
+// CountActiveSitesForBucketRange returns the number of active monitor rows in
+// the inclusive bucket range.
+func CountActiveSitesForBucketRange(ctx context.Context, bucketMin, bucketMax int) (int, error) {
+	var count int
+	err := db.QueryRowContext(ctx, `
+		SELECT COUNT(*)
+		  FROM jetpack_monitor_sites
+		 WHERE monitor_active = 1
+		   AND bucket_no BETWEEN ? AND ?`,
+		bucketMin, bucketMax,
+	).Scan(&count)
+	if err != nil {
+		return 0, fmt.Errorf("count active sites: %w", err)
+	}
+	return count, nil
+}
+
+// CountRecentlyCheckedActiveSitesForBucketRange returns the number of active
+// monitor rows in the inclusive bucket range whose runtime freshness timestamp
+// is at or after the provided cutoff.
+func CountRecentlyCheckedActiveSitesForBucketRange(ctx context.Context, bucketMin, bucketMax int, cutoff time.Time) (int, error) {
+	var count int
+	err := db.QueryRowContext(ctx, `
+		SELECT COUNT(*)
+		  FROM jetpack_monitor_sites s
+		  JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id
+		 WHERE s.monitor_active = 1
+		   AND s.bucket_no BETWEEN ? AND ?
+		   AND r.last_checked_at >= ?`,
+		bucketMin, bucketMax, cutoff.UTC(),
+	).Scan(&count)
+	if err != nil {
+		return 0, fmt.Errorf("count recently checked active sites: %w", err)
+	}
+	return count, nil
+}
+
+// CountDueSitesForBucketRange returns the number of active rows currently due
+// for checking in the inclusive bucket range. When variable intervals are
+// disabled, every active row is considered due for the fixed round cadence.
+func CountDueSitesForBucketRange(ctx context.Context, bucketMin, bucketMax int, useVariableIntervals bool) (int, error) {
+	query := `
+		SELECT COUNT(*)
+		  FROM jetpack_monitor_sites s
+		  LEFT JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id
+		 WHERE s.monitor_active = 1
+		   AND s.bucket_no BETWEEN ? AND ?`
+	if useVariableIntervals {
+		query += `
+		   AND (
+			r.next_check_at IS NULL
+			OR r.next_check_at <= NOW()
+		   )`
+	}
+
+	var count int
+	err := db.QueryRowContext(ctx, query, bucketMin, bucketMax).Scan(&count)
+	if err != nil {
+		return 0, fmt.Errorf("count due sites: %w", err)
+	}
+	return count, nil
+}
+
+// UpdateSiteStatus updates site_status and last_status_change for a site.
+func UpdateSiteStatus(ctx context.Context, blogID int64, status int, changedAt time.Time) error {
+	return UpdateSiteStatusForMonitorSite(ctx, 0, blogID, status, changedAt)
+}
+
+// UpdateSiteStatusForMonitorSite updates the legacy projection for one monitor
+// row when monitorSiteID is known, falling back to the historical blog_id update
+// for callers that are still site-level.
+func UpdateSiteStatusForMonitorSite(ctx context.Context, monitorSiteID, blogID int64, status int, changedAt time.Time) error {
+	if monitorSiteID > 0 {
+		_, err := db.ExecContext(ctx,
+			`UPDATE jetpack_monitor_sites SET site_status = ?, last_status_change = ? WHERE jetpack_monitor_site_id = ?`,
+			status, changedAt.UTC(), monitorSiteID,
+		)
+		return err
+	}
+	_, err := db.ExecContext(ctx,
+		`UPDATE jetpack_monitor_sites SET site_status = ?, last_status_change = ? WHERE blog_id = ?`,
+		status, changedAt.UTC(), blogID,
+	)
+	return err
+}
+
+// GetSiteStatus reads the legacy status projection for one site. Streaming
+// mode uses this sparingly after verifier escalation so its in-memory target
+// state does not send a recovery notification after a false alarm.
+func GetSiteStatus(ctx context.Context, blogID int64) (int, error) {
+	return GetSiteStatusForMonitorSite(ctx, 0, blogID)
+}
+
+// GetSiteStatusForMonitorSite reads the legacy status projection for one
+// monitor row when monitorSiteID is known, falling back to the historical
+// blog_id lookup for site-level callers.
+func GetSiteStatusForMonitorSite(ctx context.Context, monitorSiteID, blogID int64) (int, error) {
+	var status int
+	if monitorSiteID > 0 {
+		err := db.QueryRowContext(ctx,
+			`SELECT site_status FROM jetpack_monitor_sites WHERE jetpack_monitor_site_id = ?`,
+			monitorSiteID,
+		).Scan(&status)
+		if err != nil {
+			return 0, fmt.Errorf("get site status: %w", err)
+		}
+		return status, nil
+	}
+	err := db.QueryRowContext(ctx,
+		`SELECT site_status FROM jetpack_monitor_sites WHERE blog_id = ?`,
+		blogID,
+	).Scan(&status)
+	if err != nil {
+		return 0, fmt.Errorf("get site status: %w", err)
+	}
+	return status, nil
+}
+
+// UpdateSiteStatusTx is the transaction-aware variant of UpdateSiteStatus, used
+// when the projection write must commit atomically with an event mutation.
+func UpdateSiteStatusTx(ctx context.Context, tx *sql.Tx, blogID int64, status int, changedAt time.Time) error {
+	return UpdateSiteStatusTxForMonitorSite(ctx, tx, 0, blogID, status, changedAt)
+}
+
+// UpdateSiteStatusTxForMonitorSite is the transaction-aware variant of
+// UpdateSiteStatusForMonitorSite.
+func UpdateSiteStatusTxForMonitorSite(ctx context.Context, tx *sql.Tx, monitorSiteID, blogID int64, status int, changedAt time.Time) error {
+	if monitorSiteID > 0 {
+		_, err := tx.ExecContext(ctx,
+			`UPDATE jetpack_monitor_sites SET site_status = ?, last_status_change = ? WHERE jetpack_monitor_site_id = ?`,
+			status, changedAt.UTC(), monitorSiteID,
+		)
+		return err
+	}
+	_, err := tx.ExecContext(ctx,
+		`UPDATE jetpack_monitor_sites SET site_status = ?, last_status_change = ? WHERE blog_id = ?`,
+		status, changedAt.UTC(), blogID,
+	)
+	return err
+}
+
+// CountLegacyProjectionDrift returns the number of active sites in the bucket
+// range whose v1 site_status projection disagrees with the authoritative open
+// HTTP event, if any.
+func CountLegacyProjectionDrift(ctx context.Context, bucketMin, bucketMax int) (int, error) {
+	var count int
+	err := db.QueryRowContext(ctx, `
+		SELECT COUNT(*)
+		  FROM (
+			SELECT s.jetpack_monitor_site_id,
+			       s.blog_id,
+			       s.site_status,
+			       CASE
+			         WHEN SUM(CASE WHEN e.state = 'Down' THEN 1 ELSE 0 END) > 0 THEN 2
+			         WHEN SUM(CASE WHEN e.state = 'Seems Down' THEN 1 ELSE 0 END) > 0 THEN 0
+			         ELSE 1
+			       END AS expected_status
+			  FROM jetpack_monitor_sites s
+			  LEFT JOIN jetmon_events e
+			    ON e.blog_id = s.blog_id
+			   AND (e.endpoint_id = s.jetpack_monitor_site_id OR e.endpoint_id IS NULL)
+			   AND e.check_type = 'http'
+			   AND e.ended_at IS NULL
+			 WHERE s.monitor_active = 1
+			   AND s.bucket_no BETWEEN ? AND ?
+			 GROUP BY s.jetpack_monitor_site_id, s.blog_id, s.site_status
+		  ) drift
+		 WHERE drift.site_status <> drift.expected_status`,
+		bucketMin, bucketMax,
+	).Scan(&count)
+	if err != nil {
+		return 0, fmt.Errorf("count projection drift: %w", err)
+	}
+	return count, nil
+}
+
+// ProjectionDriftRow identifies one active site whose legacy site_status
+// projection disagrees with the authoritative open HTTP event, if any.
+type ProjectionDriftRow struct {
+	BlogID         int64
+	BucketNo       int
+	SiteStatus     int
+	ExpectedStatus int
+	EventID        *int64
+	EventState     *string
+	OpenEventCount int
+}
+
+// ProjectionDriftSummaryRow summarizes one bucket/status/cause group of legacy
+// projection drift rows.
+type ProjectionDriftSummaryRow struct {
+	BucketNo          int
+	SiteStatus        int
+	ExpectedStatus    int
+	EventState        *string
+	MaxOpenEventCount int
+	DriftCount        int
+	SampleBlogID      int64
+}
+
+// ListLegacyProjectionDrift returns active sites in the bucket range whose v1
+// site_status projection disagrees with the authoritative open HTTP event.
+func ListLegacyProjectionDrift(ctx context.Context, bucketMin, bucketMax, limit int) ([]ProjectionDriftRow, error) {
+	if limit <= 0 {
+		limit = 50
+	}
+	rows, err := db.QueryContext(ctx, `
+		SELECT drift.blog_id,
+		       drift.bucket_no,
+		       drift.site_status,
+		       drift.expected_status,
+		       drift.event_id,
+		       drift.event_state,
+		       drift.open_event_count
+		  FROM (
+			SELECT s.jetpack_monitor_site_id,
+			       s.blog_id,
+			       s.bucket_no,
+			       s.site_status,
+			       CASE
+			         WHEN SUM(CASE WHEN e.state = 'Down' THEN 1 ELSE 0 END) > 0 THEN 2
+			         WHEN SUM(CASE WHEN e.state = 'Seems Down' THEN 1 ELSE 0 END) > 0 THEN 0
+			         ELSE 1
+			       END AS expected_status,
+			       CASE
+			         WHEN SUM(CASE WHEN e.state = 'Down' THEN 1 ELSE 0 END) > 0 THEN 'Down'
+			         WHEN SUM(CASE WHEN e.state = 'Seems Down' THEN 1 ELSE 0 END) > 0 THEN 'Seems Down'
+			         ELSE MIN(e.state)
+			       END AS event_state,
+			       COALESCE(
+			         MIN(CASE WHEN e.state = 'Down' THEN e.id END),
+			         MIN(CASE WHEN e.state = 'Seems Down' THEN e.id END),
+			         MIN(e.id)
+			       ) AS event_id,
+			       COUNT(e.id) AS open_event_count
+			  FROM jetpack_monitor_sites s
+			  LEFT JOIN jetmon_events e
+			    ON e.blog_id = s.blog_id
+			   AND (e.endpoint_id = s.jetpack_monitor_site_id OR e.endpoint_id IS NULL)
+			   AND e.check_type = 'http'
+			   AND e.ended_at IS NULL
+			 WHERE s.monitor_active = 1
+			   AND s.bucket_no BETWEEN ? AND ?
+			 GROUP BY s.jetpack_monitor_site_id, s.blog_id, s.bucket_no, s.site_status
+		  ) drift
+		 WHERE drift.site_status <> drift.expected_status
+		 ORDER BY drift.bucket_no ASC, drift.blog_id ASC
+		 LIMIT ?`,
+		bucketMin, bucketMax, limit,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("list projection drift: %w", err)
+	}
+	defer rows.Close()
+
+	var out []ProjectionDriftRow
+	for rows.Next() {
+		var row ProjectionDriftRow
+		var eventID sql.NullInt64
+		var eventState sql.NullString
+		if err := rows.Scan(
+			&row.BlogID,
+			&row.BucketNo,
+			&row.SiteStatus,
+			&row.ExpectedStatus,
+			&eventID,
+			&eventState,
+			&row.OpenEventCount,
+		); err != nil {
+			return nil, fmt.Errorf("scan projection drift: %w", err)
+		}
+		if eventID.Valid {
+			v := eventID.Int64
+			row.EventID = &v
+		}
+		if eventState.Valid {
+			v := eventState.String
+			row.EventState = &v
+		}
+		out = append(out, row)
+	}
+	return out, rows.Err()
+}
+
+// SummarizeLegacyProjectionDrift groups drift rows by bucket and mismatch
+// shape so operators can see whether the problem is isolated, systemic, or a
+// repeated projection failure pattern.
+func SummarizeLegacyProjectionDrift(ctx context.Context, bucketMin, bucketMax, limit int) ([]ProjectionDriftSummaryRow, error) {
+	if limit <= 0 {
+		limit = 20
+	}
+	rows, err := db.QueryContext(ctx, `
+		SELECT drift.bucket_no,
+		       drift.site_status,
+		       drift.expected_status,
+		       drift.event_state,
+		       MAX(drift.open_event_count) AS max_open_event_count,
+		       COUNT(*) AS drift_count,
+		       MIN(drift.blog_id) AS sample_blog_id
+		  FROM (
+			SELECT s.jetpack_monitor_site_id,
+			       s.blog_id,
+			       s.bucket_no,
+			       s.site_status,
+			       CASE
+			         WHEN SUM(CASE WHEN e.state = 'Down' THEN 1 ELSE 0 END) > 0 THEN 2
+			         WHEN SUM(CASE WHEN e.state = 'Seems Down' THEN 1 ELSE 0 END) > 0 THEN 0
+			         ELSE 1
+			       END AS expected_status,
+			       CASE
+			         WHEN SUM(CASE WHEN e.state = 'Down' THEN 1 ELSE 0 END) > 0 THEN 'Down'
+			         WHEN SUM(CASE WHEN e.state = 'Seems Down' THEN 1 ELSE 0 END) > 0 THEN 'Seems Down'
+			         ELSE MIN(e.state)
+			       END AS event_state,
+			       COUNT(e.id) AS open_event_count
+			  FROM jetpack_monitor_sites s
+			  LEFT JOIN jetmon_events e
+			    ON e.blog_id = s.blog_id
+			   AND (e.endpoint_id = s.jetpack_monitor_site_id OR e.endpoint_id IS NULL)
+			   AND e.check_type = 'http'
+			   AND e.ended_at IS NULL
+			 WHERE s.monitor_active = 1
+			   AND s.bucket_no BETWEEN ? AND ?
+			 GROUP BY s.jetpack_monitor_site_id, s.blog_id, s.bucket_no, s.site_status
+		  ) drift
+		 WHERE drift.site_status <> drift.expected_status
+		 GROUP BY drift.bucket_no, drift.site_status, drift.expected_status, drift.event_state
+		 ORDER BY drift_count DESC, drift.bucket_no ASC, drift.site_status ASC, drift.expected_status ASC
+		 LIMIT ?`,
+		bucketMin, bucketMax, limit,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("summarize projection drift: %w", err)
+	}
+	defer rows.Close()
+
+	var out []ProjectionDriftSummaryRow
+	for rows.Next() {
+		var row ProjectionDriftSummaryRow
+		var eventState sql.NullString
+		if err := rows.Scan(
+			&row.BucketNo,
+			&row.SiteStatus,
+			&row.ExpectedStatus,
+			&eventState,
+			&row.MaxOpenEventCount,
+			&row.DriftCount,
+			&row.SampleBlogID,
+		); err != nil {
+			return nil, fmt.Errorf("scan projection drift summary: %w", err)
+		}
+		if eventState.Valid {
+			v := eventState.String
+			row.EventState = &v
+		}
+		out = append(out, row)
+	}
+	return out, rows.Err()
+}
+
+// MarkSiteChecked records when a site was last checked and when it is next due.
+func MarkSiteChecked(ctx context.Context, blogID int64, checkedAt, nextCheckAt time.Time) error {
+	_, err := db.ExecContext(ctx,
+		`INSERT INTO jetmon_site_runtime (blog_id, last_checked_at, next_check_at)
+		 VALUES (?, ?, ?)
+		 ON DUPLICATE KEY UPDATE
+			last_checked_at = VALUES(last_checked_at),
+			next_check_at = VALUES(next_check_at)`,
+		blogID, checkedAt.UTC(), nextCheckAt.UTC(),
+	)
+	return err
+}
+
+// SiteCheck records one site freshness update.
+type SiteCheck struct {
+	BlogID      int64
+	CheckedAt   time.Time
+	NextCheckAt time.Time
+}
+
+// MarkSitesChecked records last_checked_at for a batch of sites. Batching this
+// passive freshness update keeps scheduler throughput from being dominated by
+// one UPDATE per healthy site.
+func MarkSitesChecked(ctx context.Context, checks []SiteCheck) error {
+	if len(checks) == 0 {
+		return nil
+	}
+	checks = append([]SiteCheck(nil), checks...)
+	sort.Slice(checks, func(i, j int) bool {
+		return checks[i].BlogID < checks[j].BlogID
+	})
+	for start := 0; start < len(checks); start += batchWriteChunkSize {
+		end := min(start+batchWriteChunkSize, len(checks))
+		if err := markSitesCheckedChunkWithRetry(ctx, checks[start:end]); err != nil {
+			return err
+		}
+	}
+	return nil
+}
+
+func markSitesCheckedChunkWithRetry(ctx context.Context, checks []SiteCheck) error {
+	var err error
+	for attempt := 0; attempt < 5; attempt++ {
+		err = markSitesCheckedChunk(ctx, checks)
+		if err == nil || !isRetryableWriteConflict(err) {
+			return err
+		}
+		backoff := time.Duration(attempt+1) * 25 * time.Millisecond
+		timer := time.NewTimer(backoff)
+		select {
+		case <-ctx.Done():
+			timer.Stop()
+			return ctx.Err()
+		case <-timer.C:
+		}
+	}
+	return err
+}
+
+func markSitesCheckedChunk(ctx context.Context, checks []SiteCheck) error {
+	var query strings.Builder
+	query.WriteString("INSERT INTO jetmon_site_runtime (blog_id, last_checked_at, next_check_at) VALUES ")
+	args := make([]any, 0, len(checks)*3)
+	for i, check := range checks {
+		if i > 0 {
+			query.WriteByte(',')
+		}
+		query.WriteString("(?, ?, ?)")
+		args = append(args, check.BlogID, check.CheckedAt.UTC(), check.NextCheckAt.UTC())
+	}
+	query.WriteString(" ON DUPLICATE KEY UPDATE last_checked_at = VALUES(last_checked_at), next_check_at = VALUES(next_check_at)")
+	_, err := db.ExecContext(ctx, query.String(), args...)
+	return err
+}
+
+func isRetryableWriteConflict(err error) bool {
+	var mysqlErr *mysql.MySQLError
+	if errors.As(err, &mysqlErr) {
+		return mysqlErr.Number == 1205 || mysqlErr.Number == 1213
+	}
+	return false
+}
+
+// UpdateLastAlertSent records when an alert was last sent for a site.
+func UpdateLastAlertSent(ctx context.Context, blogID int64, sentAt time.Time) error {
+	_, err := db.ExecContext(ctx,
+		`INSERT INTO jetmon_site_runtime (blog_id, last_alert_sent_at)
+		 VALUES (?, ?)
+		 ON DUPLICATE KEY UPDATE last_alert_sent_at = VALUES(last_alert_sent_at)`,
+		blogID, sentAt.UTC(),
+	)
+	return err
+}
+
+// UpdateSSLExpiry records the SSL certificate expiry date for a site.
+func UpdateSSLExpiry(ctx context.Context, blogID int64, expiry time.Time) error {
+	_, err := db.ExecContext(ctx,
+		`INSERT INTO jetmon_site_runtime (blog_id, ssl_expiry_date)
+		 VALUES (?, ?)
+		 ON DUPLICATE KEY UPDATE ssl_expiry_date = VALUES(ssl_expiry_date)`,
+		blogID, expiry,
+	)
+	return err
+}
+
+// SiteSSLExpiry records one observed certificate expiry update.
+type SiteSSLExpiry struct {
+	BlogID int64
+	Expiry time.Time
+}
+
+// UpdateSSLExpiries records observed certificate expiry dates for a batch of
+// sites. Certificate expiry changes are usually sparse after warm-up, but
+// batching prevents first-run or certificate-churn sweeps from issuing one
+// UPDATE per HTTPS site.
+func UpdateSSLExpiries(ctx context.Context, expiries []SiteSSLExpiry) error {
+	if len(expiries) == 0 {
+		return nil
+	}
+	expiries = append([]SiteSSLExpiry(nil), expiries...)
+	sort.Slice(expiries, func(i, j int) bool {
+		return expiries[i].BlogID < expiries[j].BlogID
+	})
+	for start := 0; start < len(expiries); start += batchWriteChunkSize {
+		end := min(start+batchWriteChunkSize, len(expiries))
+		if err := updateSSLExpiriesChunk(ctx, expiries[start:end]); err != nil {
+			return err
+		}
+	}
+	return nil
+}
+
+func updateSSLExpiriesChunk(ctx context.Context, expiries []SiteSSLExpiry) error {
+	var query strings.Builder
+	query.WriteString("INSERT INTO jetmon_site_runtime (blog_id, ssl_expiry_date) VALUES ")
+	args := make([]any, 0, len(expiries)*2)
+	for i, expiry := range expiries {
+		if i > 0 {
+			query.WriteByte(',')
+		}
+		query.WriteString("(?, ?)")
+		args = append(args, expiry.BlogID, expiry.Expiry)
+	}
+	query.WriteString(" ON DUPLICATE KEY UPDATE ssl_expiry_date = VALUES(ssl_expiry_date)")
+	_, err := db.ExecContext(ctx, query.String(), args...)
+	return err
+}
+
+// RescheduleSiteRuntime recalculates the sidecar due-time projection after a
+// site's check interval changes.
+func RescheduleSiteRuntime(ctx context.Context, tx *sql.Tx, blogID int64, checkInterval int) error {
+	_, err := tx.ExecContext(ctx,
+		`INSERT INTO jetmon_site_runtime (blog_id, next_check_at)
+		 VALUES (?, NULL)
+		 ON DUPLICATE KEY UPDATE
+			next_check_at = CASE
+				WHEN last_checked_at IS NULL THEN NULL
+				ELSE DATE_ADD(last_checked_at, INTERVAL GREATEST(?, 1) MINUTE)
+			END`,
+		blogID, checkInterval,
+	)
+	return err
+}
+
+// ClaimBuckets registers this host in jetmon_hosts, claiming uncovered bucket
+// ranges from expired peers. Returns the claimed min/max bucket numbers.
+func ClaimBuckets(hostID string, bucketTotal, bucketTarget int, graceSec int) (int, int, error) {
+	tx, err := db.Begin()
+	if err != nil {
+		return 0, 0, fmt.Errorf("begin tx: %w", err)
+	}
+	defer tx.Rollback()
+
+	// Remove expired hosts.
+	_, err = tx.Exec(
+		`DELETE FROM jetmon_hosts WHERE last_heartbeat < DATE_SUB(NOW(), INTERVAL ? SECOND) AND host_id != ?`,
+		graceSec, hostID,
+	)
+	if err != nil {
+		return 0, 0, fmt.Errorf("delete expired hosts: %w", err)
+	}
+
+	rows, err := tx.Query(`SELECT host_id FROM jetmon_hosts WHERE host_id != ? AND status = 'active' FOR UPDATE`, hostID)
+	if err != nil {
+		return 0, 0, fmt.Errorf("query hosts: %w", err)
+	}
+	hostIDs := []string{hostID}
+	for rows.Next() {
+		var id string
+		if err := rows.Scan(&id); err != nil {
+			rows.Close()
+			return 0, 0, err
+		}
+		hostIDs = append(hostIDs, id)
+	}
+	rows.Close()
+	if err := rows.Err(); err != nil {
+		return 0, 0, err
+	}
+	sort.Strings(hostIDs)
+
+	assignments := assignBucketRanges(hostIDs, bucketTotal, bucketTarget)
+
+	for _, id := range hostIDs {
+		rng := assignments[id]
+		_, err = tx.Exec(
+			`INSERT INTO jetmon_hosts (host_id, bucket_min, bucket_max, last_heartbeat, status)
+			 VALUES (?, ?, ?, NOW(), 'active')
+			 ON DUPLICATE KEY UPDATE bucket_min = VALUES(bucket_min), bucket_max = VALUES(bucket_max),
+			 last_heartbeat = NOW(), status = 'active'`,
+			id, rng[0], rng[1],
+		)
+		if err != nil {
+			return 0, 0, fmt.Errorf("upsert host %s: %w", id, err)
+		}
+	}
+
+	rng := assignments[hostID]
+	return rng[0], rng[1], tx.Commit()
+}
+
+func assignBucketRanges(hostIDs []string, bucketTotal, bucketTarget int) map[string][2]int {
+	assignments := make(map[string][2]int, len(hostIDs))
+	nextBucket := 0
+	for i, id := range hostIDs {
+		if nextBucket >= bucketTotal {
+			assignments[id] = [2]int{0, -1}
+			continue
+		}
+
+		remainingBuckets := bucketTotal - nextBucket
+		remainingHosts := len(hostIDs) - i
+		size := (remainingBuckets + remainingHosts - 1) / remainingHosts
+		if size > bucketTarget {
+			size = bucketTarget
+		}
+		if size < 1 {
+			assignments[id] = [2]int{0, -1}
+			continue
+		}
+
+		assignments[id] = [2]int{nextBucket, nextBucket + size - 1}
+		nextBucket += size
+	}
+	return assignments
+}
+
+// Heartbeat updates last_heartbeat for this host.
+func Heartbeat(ctx context.Context, hostID string) error {
+	_, err := db.ExecContext(ctx,
+		`UPDATE jetmon_hosts SET last_heartbeat = NOW(), status = 'active' WHERE host_id = ?`,
+		hostID,
+	)
+	return err
+}
+
+// MarkHostDraining marks a host as draining before it releases its buckets.
+func MarkHostDraining(ctx context.Context, hostID string) error {
+	_, err := db.ExecContext(ctx,
+		`UPDATE jetmon_hosts SET status = 'draining', last_heartbeat = NOW() WHERE host_id = ?`,
+		hostID,
+	)
+	return err
+}
+
+// ReleaseHost removes this host's row from jetmon_hosts on graceful shutdown.
+func ReleaseHost(ctx context.Context, hostID string) error {
+	_, err := db.ExecContext(ctx, `DELETE FROM jetmon_hosts WHERE host_id = ?`, hostID)
+	return err
+}
+
+// HostRowExists reports whether a host currently has a jetmon_hosts ownership
+// row.
+func HostRowExists(ctx context.Context, hostID string) (bool, error) {
+	var exists int
+	err := db.QueryRowContext(ctx,
+		`SELECT 1 FROM jetmon_hosts WHERE host_id = ? LIMIT 1`,
+		hostID,
+	).Scan(&exists)
+	if errors.Is(err, sql.ErrNoRows) {
+		return false, nil
+	}
+	if err != nil {
+		return false, fmt.Errorf("check host row: %w", err)
+	}
+	return true, nil
+}
+
+// ListHostRowsOverlappingBucketRange returns jetmon_hosts ownership rows whose
+// bucket ranges overlap the inclusive requested range.
+func ListHostRowsOverlappingBucketRange(ctx context.Context, bucketMin, bucketMax int) ([]HostRow, error) {
+	rows, err := db.QueryContext(ctx,
+		`SELECT host_id, bucket_min, bucket_max, last_heartbeat, status
+		   FROM jetmon_hosts
+		  WHERE bucket_min <= ?
+		    AND bucket_max >= ?
+		  ORDER BY bucket_min, host_id`,
+		bucketMax, bucketMin,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("query overlapping host rows: %w", err)
+	}
+	defer rows.Close()
+
+	var hosts []HostRow
+	for rows.Next() {
+		var h HostRow
+		if err := rows.Scan(&h.HostID, &h.BucketMin, &h.BucketMax, &h.LastHeartbeat, &h.Status); err != nil {
+			return nil, fmt.Errorf("scan overlapping host row: %w", err)
+		}
+		hosts = append(hosts, h)
+	}
+	return hosts, rows.Err()
+}
+
+// GetAllHosts returns all rows from jetmon_hosts for operator visibility.
+func GetAllHosts() ([]HostRow, error) {
+	rows, err := db.Query(
+		`SELECT host_id, bucket_min, bucket_max, last_heartbeat, status FROM jetmon_hosts ORDER BY bucket_min`,
+	)
+	if err != nil {
+		return nil, err
+	}
+	defer rows.Close()
+
+	var hosts []HostRow
+	for rows.Next() {
+		var h HostRow
+		if err := rows.Scan(&h.HostID, &h.BucketMin, &h.BucketMax, &h.LastHeartbeat, &h.Status); err != nil {
+			return nil, err
+		}
+		hosts = append(hosts, h)
+	}
+	return hosts, rows.Err()
+}
+
+// HostRow represents a row in jetmon_hosts.
+type HostRow struct {
+	HostID        string
+	BucketMin     int
+	BucketMax     int
+	LastHeartbeat time.Time
+	Status        string
+}
+
+// RecordFalsePositive inserts a false positive event.
+func RecordFalsePositive(blogID int64, httpCode, errorCode int, rttMs int64) error {
+	_, err := db.Exec(
+		`INSERT INTO jetmon_false_positives (blog_id, http_code, error_code, rtt_ms, created_at)
+		 VALUES (?, ?, ?, ?, NOW())`,
+		blogID, httpCode, errorCode, rttMs,
+	)
+	return err
+}
+
+// RecordCheckHistory inserts a check timing sample.
+func RecordCheckHistory(blogID int64, requestMethod string, httpCode, errorCode int, rttMs, dnsMs, tcpMs, tlsMs, ttfbMs int64) error {
+	requestMethod = normalizeHistoryMethod(requestMethod)
+	_, err := db.Exec(
+		`INSERT INTO jetmon_check_history
+		    (blog_id, request_method, http_code, error_code, rtt_ms, dns_ms, tcp_ms, tls_ms, ttfb_ms, checked_at)
+		 VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, NOW())`,
+		blogID, requestMethod, httpCode, errorCode, rttMs, dnsMs, tcpMs, tlsMs, ttfbMs,
+	)
+	return err
+}
+
+func normalizeHistoryMethod(method string) string {
+	method = strings.ToUpper(strings.TrimSpace(method))
+	if method == "" {
+		return "GET"
+	}
+	if len(method) > 16 {
+		return method[:16]
+	}
+	return method
+}
+
+// CheckHistoryRow is one check timing sample for jetmon_check_history.
+type CheckHistoryRow struct {
+	BlogID        int64
+	RequestMethod string
+	HTTPCode      int
+	ErrorCode     int
+	RTTMs         int64
+	DNSMs         int64
+	TCPMs         int64
+	TLSMs         int64
+	TTFBMs        int64
+	CheckedAt     time.Time
+}
+
+// RecordCheckHistories inserts check timing samples in batches. This retains
+// the same table contract as RecordCheckHistory while avoiding one INSERT per
+// healthy site during high-volume sweeps.
+func RecordCheckHistories(ctx context.Context, rows []CheckHistoryRow) error {
+	if len(rows) == 0 {
+		return nil
+	}
+	rows = append([]CheckHistoryRow(nil), rows...)
+	sort.Slice(rows, func(i, j int) bool {
+		return rows[i].BlogID < rows[j].BlogID
+	})
+	for start := 0; start < len(rows); start += batchWriteChunkSize {
+		end := min(start+batchWriteChunkSize, len(rows))
+		if err := recordCheckHistoriesChunk(ctx, rows[start:end]); err != nil {
+			return err
+		}
+	}
+	return nil
+}
+
+func recordCheckHistoriesChunk(ctx context.Context, rows []CheckHistoryRow) error {
+	var query strings.Builder
+	query.WriteString(`INSERT INTO jetmon_check_history
+		(blog_id, request_method, http_code, error_code, rtt_ms, dns_ms, tcp_ms, tls_ms, ttfb_ms, checked_at)
+		VALUES `)
+	args := make([]any, 0, len(rows)*10)
+	for i, row := range rows {
+		if i > 0 {
+			query.WriteByte(',')
+		}
+		query.WriteString("(?, ?, ?, ?, ?, ?, ?, ?, ?, ?)")
+		checkedAt := row.CheckedAt
+		if checkedAt.IsZero() {
+			checkedAt = time.Now().UTC()
+		}
+		args = append(args,
+			row.BlogID,
+			normalizeHistoryMethod(row.RequestMethod),
+			row.HTTPCode,
+			row.ErrorCode,
+			row.RTTMs,
+			row.DNSMs,
+			row.TCPMs,
+			row.TLSMs,
+			row.TTFBMs,
+			checkedAt.UTC(),
+		)
+	}
+	_, err := db.ExecContext(ctx, query.String(), args...)
+	return err
+}
diff --git a/internal/db/queries_test.go b/internal/db/queries_test.go
new file mode 100644
index 00000000..5b7cff2a
--- /dev/null
+++ b/internal/db/queries_test.go
@@ -0,0 +1,748 @@
+package db
+
+import (
+	"context"
+	"database/sql/driver"
+	"reflect"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+	"github.com/go-sql-driver/mysql"
+)
+
+func TestAssignBucketRanges(t *testing.T) {
+	tests := []struct {
+		name         string
+		hostIDs      []string
+		bucketTotal  int
+		bucketTarget int
+		want         map[string][2]int
+	}{
+		{
+			name:         "single host claims all buckets up to target",
+			hostIDs:      []string{"host-a"},
+			bucketTotal:  10,
+			bucketTarget: 10,
+			want: map[string][2]int{
+				"host-a": {0, 9},
+			},
+		},
+		{
+			name:         "multiple hosts split coverage evenly",
+			hostIDs:      []string{"host-a", "host-b", "host-c"},
+			bucketTotal:  10,
+			bucketTarget: 10,
+			want: map[string][2]int{
+				"host-a": {0, 3},
+				"host-b": {4, 6},
+				"host-c": {7, 9},
+			},
+		},
+		{
+			name:         "bucket target caps allocation",
+			hostIDs:      []string{"host-a", "host-b", "host-c"},
+			bucketTotal:  12,
+			bucketTarget: 4,
+			want: map[string][2]int{
+				"host-a": {0, 3},
+				"host-b": {4, 7},
+				"host-c": {8, 11},
+			},
+		},
+		{
+			name:         "extra hosts get empty ranges",
+			hostIDs:      []string{"host-a", "host-b", "host-c"},
+			bucketTotal:  2,
+			bucketTarget: 2,
+			want: map[string][2]int{
+				"host-a": {0, 0},
+				"host-b": {1, 1},
+				"host-c": {0, -1},
+			},
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			got := assignBucketRanges(tt.hostIDs, tt.bucketTotal, tt.bucketTarget)
+			if !reflect.DeepEqual(got, tt.want) {
+				t.Fatalf("assignBucketRanges() = %#v, want %#v", got, tt.want)
+			}
+		})
+	}
+}
+
+func TestNormalizeHistoryMethod(t *testing.T) {
+	tests := []struct {
+		name string
+		in   string
+		want string
+	}{
+		{name: "empty defaults to GET", in: "", want: "GET"},
+		{name: "trims and uppercases", in: " get ", want: "GET"},
+		{name: "bounds long values", in: strings.Repeat("x", 20), want: strings.Repeat("X", 16)},
+	}
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			if got := normalizeHistoryMethod(tt.in); got != tt.want {
+				t.Fatalf("normalizeHistoryMethod(%q) = %q, want %q", tt.in, got, tt.want)
+			}
+		})
+	}
+}
+
+func withMockDB(t *testing.T) (sqlmock.Sqlmock, func()) {
+	t.Helper()
+	mockDB, mock, err := sqlmock.New(sqlmock.MonitorPingsOption(true))
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	orig := db
+	db = mockDB
+	cleanup := func() {
+		db = orig
+		_ = mockDB.Close()
+	}
+	return mock, cleanup
+}
+
+func TestGlobalDBAccessors(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectPing()
+	if DB() == nil {
+		t.Fatal("DB() = nil")
+	}
+	if err := Ping(); err != nil {
+		t.Fatalf("Ping: %v", err)
+	}
+	if Hostname() == "" {
+		t.Fatal("Hostname() returned empty string")
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestGetSitesForBucketScansRowsAndDefaultRedirectPolicy(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	now := time.Now().UTC()
+	rows := sqlmock.NewRows([]string{
+		"jetpack_monitor_site_id", "blog_id", "bucket_no", "monitor_url",
+		"monitor_active", "site_status", "last_status_change", "check_interval", "last_checked_at", "next_check_at",
+		"ssl_expiry_date", "check_keyword", "forbidden_keyword", "forbidden_keywords", "maintenance_start", "maintenance_end",
+		"custom_headers", "timeout_seconds", "redirect_policy", "alert_cooldown_minutes", "last_alert_sent_at", "request_method", "detection_profile",
+	}).AddRow(
+		int64(1), int64(42), 7, "https://site.example",
+		true, 1, now, 5, now, now.Add(5*time.Minute),
+		nil, nil, nil, nil, nil, nil,
+		nil, nil, nil, nil, nil, nil, nil,
+	)
+	mock.ExpectQuery("SELECT").
+		WithArgs(0, 99, 50).
+		WillReturnRows(rows)
+
+	sites, err := GetSitesForBucket(context.Background(), 0, 99, 50, false)
+	if err != nil {
+		t.Fatalf("GetSitesForBucket: %v", err)
+	}
+	if len(sites) != 1 {
+		t.Fatalf("sites len = %d, want 1", len(sites))
+	}
+	if sites[0].BlogID != 42 || sites[0].RedirectPolicy != "follow" {
+		t.Fatalf("site = %+v", sites[0])
+	}
+	if sites[0].NextCheckAt == nil || !sites[0].NextCheckAt.Equal(now.Add(5*time.Minute)) {
+		t.Fatalf("NextCheckAt = %v, want %s", sites[0].NextCheckAt, now.Add(5*time.Minute))
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestGetSitesForBucketVariableIntervalsUsesNextCheckAt(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	rows := sqlmock.NewRows([]string{
+		"jetpack_monitor_site_id", "blog_id", "bucket_no", "monitor_url",
+		"monitor_active", "site_status", "last_status_change", "check_interval", "last_checked_at", "next_check_at",
+		"ssl_expiry_date", "check_keyword", "forbidden_keyword", "forbidden_keywords", "maintenance_start", "maintenance_end",
+		"custom_headers", "timeout_seconds", "redirect_policy", "alert_cooldown_minutes", "last_alert_sent_at", "request_method", "detection_profile",
+	})
+	mock.ExpectQuery("next_check_at").
+		WithArgs(0, 99, 50).
+		WillReturnRows(rows)
+
+	if _, err := GetSitesForBucket(context.Background(), 0, 99, 50, true); err != nil {
+		t.Fatalf("GetSitesForBucket: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestCountActiveSitesForBucketRange(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectQuery("SELECT COUNT").
+		WithArgs(10, 19).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(42))
+
+	count, err := CountActiveSitesForBucketRange(context.Background(), 10, 19)
+	if err != nil {
+		t.Fatalf("CountActiveSitesForBucketRange: %v", err)
+	}
+	if count != 42 {
+		t.Fatalf("CountActiveSitesForBucketRange = %d, want 42", count)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestCountDueSitesForBucketRangeVariableIntervalsUsesNextCheckAt(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectQuery("next_check_at").
+		WithArgs(10, 19).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(9))
+
+	count, err := CountDueSitesForBucketRange(context.Background(), 10, 19, true)
+	if err != nil {
+		t.Fatalf("CountDueSitesForBucketRange: %v", err)
+	}
+	if count != 9 {
+		t.Fatalf("CountDueSitesForBucketRange = %d, want 9", count)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestCountRecentlyCheckedActiveSitesForBucketRange(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	cutoff := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC)
+	mock.ExpectQuery("SELECT COUNT").
+		WithArgs(10, 19, cutoff).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(17))
+
+	count, err := CountRecentlyCheckedActiveSitesForBucketRange(context.Background(), 10, 19, cutoff)
+	if err != nil {
+		t.Fatalf("CountRecentlyCheckedActiveSitesForBucketRange: %v", err)
+	}
+	if count != 17 {
+		t.Fatalf("CountRecentlyCheckedActiveSitesForBucketRange = %d, want 17", count)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestCountDueSitesForBucketRangeUsesNextCheckAtForVariableIntervals(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectQuery("next_check_at <= NOW").
+		WithArgs(10, 19).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(5))
+
+	count, err := CountDueSitesForBucketRange(context.Background(), 10, 19, true)
+	if err != nil {
+		t.Fatalf("CountDueSitesForBucketRange: %v", err)
+	}
+	if count != 5 {
+		t.Fatalf("CountDueSitesForBucketRange = %d, want 5", count)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestSimpleMutationQueries(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	now := time.Now().UTC()
+	next := now.Add(5 * time.Minute)
+	mock.ExpectExec("UPDATE jetpack_monitor_sites SET site_status").
+		WithArgs(2, now, int64(42)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("INSERT INTO jetmon_site_runtime").
+		WithArgs(int64(42), now, next).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("INSERT INTO jetmon_site_runtime").
+		WithArgs(int64(42), now).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("INSERT INTO jetmon_site_runtime").
+		WithArgs(int64(42), now).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("UPDATE jetmon_hosts SET last_heartbeat").
+		WithArgs("host-a").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("UPDATE jetmon_hosts SET status = 'draining'").
+		WithArgs("host-a").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("DELETE FROM jetmon_hosts").
+		WithArgs("host-a").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("INSERT INTO jetmon_false_positives").
+		WithArgs(int64(42), 500, 1, int64(123)).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+	mock.ExpectExec("INSERT INTO jetmon_check_history").
+		WithArgs(int64(42), "GET", 200, 0, int64(100), int64(1), int64(2), int64(3), int64(4)).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+
+	if err := UpdateSiteStatus(context.Background(), 42, 2, now); err != nil {
+		t.Fatalf("UpdateSiteStatus: %v", err)
+	}
+	if err := MarkSiteChecked(context.Background(), 42, now, next); err != nil {
+		t.Fatalf("MarkSiteChecked: %v", err)
+	}
+	if err := UpdateLastAlertSent(context.Background(), 42, now); err != nil {
+		t.Fatalf("UpdateLastAlertSent: %v", err)
+	}
+	if err := UpdateSSLExpiry(context.Background(), 42, now); err != nil {
+		t.Fatalf("UpdateSSLExpiry: %v", err)
+	}
+	if err := Heartbeat(context.Background(), "host-a"); err != nil {
+		t.Fatalf("Heartbeat: %v", err)
+	}
+	if err := MarkHostDraining(context.Background(), "host-a"); err != nil {
+		t.Fatalf("MarkHostDraining: %v", err)
+	}
+	if err := ReleaseHost(context.Background(), "host-a"); err != nil {
+		t.Fatalf("ReleaseHost: %v", err)
+	}
+	if err := RecordFalsePositive(42, 500, 1, 123); err != nil {
+		t.Fatalf("RecordFalsePositive: %v", err)
+	}
+	if err := RecordCheckHistory(42, "GET", 200, 0, 100, 1, 2, 3, 4); err != nil {
+		t.Fatalf("RecordCheckHistory: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestMarkSitesCheckedBatchesUpdates(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	first := time.Date(2026, 5, 2, 12, 0, 0, 0, time.UTC)
+	second := first.Add(time.Minute)
+	firstNext := first.Add(5 * time.Minute)
+	secondNext := second.Add(5 * time.Minute)
+	mock.ExpectExec("INSERT INTO jetmon_site_runtime").
+		WithArgs(int64(7), first, firstNext, int64(42), second, secondNext).
+		WillReturnResult(sqlmock.NewResult(0, 2))
+
+	err := MarkSitesChecked(context.Background(), []SiteCheck{
+		{BlogID: 42, CheckedAt: second, NextCheckAt: secondNext},
+		{BlogID: 7, CheckedAt: first, NextCheckAt: firstNext},
+	})
+	if err != nil {
+		t.Fatalf("MarkSitesChecked: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestMonitorSiteStatusUsesEndpointIdentity(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	changedAt := time.Date(2026, 5, 13, 18, 0, 0, 0, time.UTC)
+	mock.ExpectExec("UPDATE jetpack_monitor_sites SET site_status").
+		WithArgs(2, changedAt, int64(1234)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery("SELECT site_status FROM jetpack_monitor_sites").
+		WithArgs(int64(1234)).
+		WillReturnRows(sqlmock.NewRows([]string{"site_status"}).AddRow(2))
+
+	if err := UpdateSiteStatusForMonitorSite(context.Background(), 1234, 42, 2, changedAt); err != nil {
+		t.Fatalf("UpdateSiteStatusForMonitorSite: %v", err)
+	}
+	status, err := GetSiteStatusForMonitorSite(context.Background(), 1234, 42)
+	if err != nil {
+		t.Fatalf("GetSiteStatusForMonitorSite: %v", err)
+	}
+	if status != 2 {
+		t.Fatalf("status = %d, want 2", status)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestMarkSitesCheckedRetriesDeadlock(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	checkedAt := time.Date(2026, 5, 2, 12, 0, 0, 0, time.UTC)
+	nextAt := checkedAt.Add(5 * time.Minute)
+	args := []driver.Value{int64(42), checkedAt, nextAt}
+	mock.ExpectExec("INSERT INTO jetmon_site_runtime").
+		WithArgs(args...).
+		WillReturnError(&mysql.MySQLError{Number: 1213, Message: "Deadlock found when trying to get lock"})
+	mock.ExpectExec("INSERT INTO jetmon_site_runtime").
+		WithArgs(args...).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	if err := MarkSitesChecked(context.Background(), []SiteCheck{{BlogID: 42, CheckedAt: checkedAt, NextCheckAt: nextAt}}); err != nil {
+		t.Fatalf("MarkSitesChecked: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestRecordCheckHistoriesBatchesInserts(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	first := time.Date(2026, 5, 2, 12, 0, 0, 0, time.UTC)
+	second := first.Add(time.Minute)
+	mock.ExpectExec("INSERT INTO jetmon_check_history").
+		WithArgs(
+			int64(7), "GET", 201, 1, int64(10), int64(1), int64(2), int64(3), int64(4), first,
+			int64(42), "POST", 200, 0, int64(100), int64(5), int64(6), int64(7), int64(8), second,
+		).
+		WillReturnResult(sqlmock.NewResult(1, 2))
+
+	err := RecordCheckHistories(context.Background(), []CheckHistoryRow{
+		{BlogID: 42, RequestMethod: "post", HTTPCode: 200, ErrorCode: 0, RTTMs: 100, DNSMs: 5, TCPMs: 6, TLSMs: 7, TTFBMs: 8, CheckedAt: second},
+		{BlogID: 7, HTTPCode: 201, ErrorCode: 1, RTTMs: 10, DNSMs: 1, TCPMs: 2, TLSMs: 3, TTFBMs: 4, CheckedAt: first},
+	})
+	if err != nil {
+		t.Fatalf("RecordCheckHistories: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestUpdateSSLExpiriesBatchesUpdates(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	first := time.Date(2026, 6, 1, 0, 0, 0, 0, time.UTC)
+	second := first.AddDate(0, 1, 0)
+	mock.ExpectExec("INSERT INTO jetmon_site_runtime").
+		WithArgs(int64(7), first, int64(42), second).
+		WillReturnResult(sqlmock.NewResult(0, 2))
+
+	err := UpdateSSLExpiries(context.Background(), []SiteSSLExpiry{
+		{BlogID: 42, Expiry: second},
+		{BlogID: 7, Expiry: first},
+	})
+	if err != nil {
+		t.Fatalf("UpdateSSLExpiries: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestUpdateSiteStatusTx(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	now := time.Now().UTC()
+	mock.ExpectBegin()
+	mock.ExpectExec("UPDATE jetpack_monitor_sites SET site_status").
+		WithArgs(2, now, int64(42)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectCommit()
+
+	tx, err := db.Begin()
+	if err != nil {
+		t.Fatalf("Begin: %v", err)
+	}
+	if err := UpdateSiteStatusTx(context.Background(), tx, 42, 2, now); err != nil {
+		t.Fatalf("UpdateSiteStatusTx: %v", err)
+	}
+	if err := tx.Commit(); err != nil {
+		t.Fatalf("Commit: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestHostRowExists(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectQuery("SELECT 1 FROM jetmon_hosts").
+		WithArgs("host-a").
+		WillReturnRows(sqlmock.NewRows([]string{"exists"}).AddRow(1))
+	mock.ExpectQuery("SELECT 1 FROM jetmon_hosts").
+		WithArgs("host-b").
+		WillReturnRows(sqlmock.NewRows([]string{"exists"}))
+
+	exists, err := HostRowExists(context.Background(), "host-a")
+	if err != nil {
+		t.Fatalf("HostRowExists(host-a): %v", err)
+	}
+	if !exists {
+		t.Fatal("HostRowExists(host-a) = false, want true")
+	}
+
+	exists, err = HostRowExists(context.Background(), "host-b")
+	if err != nil {
+		t.Fatalf("HostRowExists(host-b): %v", err)
+	}
+	if exists {
+		t.Fatal("HostRowExists(host-b) = true, want false")
+	}
+
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestListHostRowsOverlappingBucketRange(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	now := time.Now().UTC()
+	mock.ExpectQuery("SELECT host_id, bucket_min, bucket_max").
+		WithArgs(34, 12).
+		WillReturnRows(sqlmock.NewRows([]string{"host_id", "bucket_min", "bucket_max", "last_heartbeat", "status"}).
+			AddRow("host-a", 0, 19, now, "active").
+			AddRow("host-b", 20, 49, now, "draining"))
+
+	hosts, err := ListHostRowsOverlappingBucketRange(context.Background(), 12, 34)
+	if err != nil {
+		t.Fatalf("ListHostRowsOverlappingBucketRange: %v", err)
+	}
+	if len(hosts) != 2 {
+		t.Fatalf("hosts len = %d, want 2", len(hosts))
+	}
+	if hosts[0].HostID != "host-a" || hosts[0].BucketMin != 0 || hosts[0].BucketMax != 19 {
+		t.Fatalf("host 0 = %+v", hosts[0])
+	}
+	if hosts[1].HostID != "host-b" || hosts[1].Status != "draining" {
+		t.Fatalf("host 1 = %+v", hosts[1])
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestCountLegacyProjectionDrift(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectQuery("SELECT COUNT").
+		WithArgs(0, 99).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(3))
+
+	count, err := CountLegacyProjectionDrift(context.Background(), 0, 99)
+	if err != nil {
+		t.Fatalf("CountLegacyProjectionDrift: %v", err)
+	}
+	if count != 3 {
+		t.Fatalf("CountLegacyProjectionDrift = %d, want 3", count)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestListLegacyProjectionDrift(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectQuery("SELECT drift.blog_id").
+		WithArgs(0, 99, 50).
+		WillReturnRows(sqlmock.NewRows([]string{
+			"blog_id", "bucket_no", "site_status", "expected_status", "id", "state", "open_event_count",
+		}).
+			AddRow(int64(42), 7, 1, 2, int64(123), "Down", 1).
+			AddRow(int64(43), 8, 0, 1, nil, nil, 0))
+
+	rows, err := ListLegacyProjectionDrift(context.Background(), 0, 99, 0)
+	if err != nil {
+		t.Fatalf("ListLegacyProjectionDrift: %v", err)
+	}
+	if len(rows) != 2 {
+		t.Fatalf("rows len = %d, want 2", len(rows))
+	}
+	if rows[0].BlogID != 42 || rows[0].BucketNo != 7 || rows[0].SiteStatus != 1 || rows[0].ExpectedStatus != 2 {
+		t.Fatalf("row 0 = %+v", rows[0])
+	}
+	if rows[0].EventID == nil || *rows[0].EventID != 123 {
+		t.Fatalf("row 0 EventID = %v, want 123", rows[0].EventID)
+	}
+	if rows[0].EventState == nil || *rows[0].EventState != "Down" {
+		t.Fatalf("row 0 EventState = %v, want Down", rows[0].EventState)
+	}
+	if rows[0].OpenEventCount != 1 {
+		t.Fatalf("row 0 OpenEventCount = %d, want 1", rows[0].OpenEventCount)
+	}
+	if rows[1].EventID != nil || rows[1].EventState != nil {
+		t.Fatalf("row 1 event fields = %+v, want nil", rows[1])
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestSummarizeLegacyProjectionDrift(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectQuery("SELECT drift.bucket_no").
+		WithArgs(0, 99, 20).
+		WillReturnRows(sqlmock.NewRows([]string{
+			"bucket_no", "site_status", "expected_status", "expected_state", "max_open_event_count", "drift_count", "sample_blog_id",
+		}).
+			AddRow(7, 1, 2, "Down", 1, 3, int64(42)).
+			AddRow(8, 0, 1, nil, 0, 2, int64(43)))
+
+	rows, err := SummarizeLegacyProjectionDrift(context.Background(), 0, 99, 0)
+	if err != nil {
+		t.Fatalf("SummarizeLegacyProjectionDrift: %v", err)
+	}
+	if len(rows) != 2 {
+		t.Fatalf("rows len = %d, want 2", len(rows))
+	}
+	if rows[0].BucketNo != 7 || rows[0].SiteStatus != 1 || rows[0].ExpectedStatus != 2 || rows[0].DriftCount != 3 || rows[0].SampleBlogID != 42 {
+		t.Fatalf("row 0 = %+v", rows[0])
+	}
+	if rows[0].EventState == nil || *rows[0].EventState != "Down" {
+		t.Fatalf("row 0 EventState = %v, want Down", rows[0].EventState)
+	}
+	if rows[1].EventState != nil {
+		t.Fatalf("row 1 EventState = %v, want nil", rows[1].EventState)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestGetAllHostsScansRows(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	now := time.Now().UTC()
+	mock.ExpectQuery("SELECT host_id, bucket_min, bucket_max").
+		WillReturnRows(sqlmock.NewRows([]string{"host_id", "bucket_min", "bucket_max", "last_heartbeat", "status"}).
+			AddRow("host-a", 0, 49, now, "active").
+			AddRow("host-b", 50, 99, now, "draining"))
+
+	hosts, err := GetAllHosts()
+	if err != nil {
+		t.Fatalf("GetAllHosts: %v", err)
+	}
+	if len(hosts) != 2 || hosts[1].Status != "draining" {
+		t.Fatalf("hosts = %+v", hosts)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestClaimBucketsRebalancesKnownHosts(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectBegin()
+	mock.ExpectExec("DELETE FROM jetmon_hosts").
+		WithArgs(60, "host-b").
+		WillReturnResult(sqlmock.NewResult(0, 0))
+	mock.ExpectQuery("SELECT host_id FROM jetmon_hosts").
+		WithArgs("host-b").
+		WillReturnRows(sqlmock.NewRows([]string{"host_id"}).AddRow("host-a"))
+	mock.ExpectExec("INSERT INTO jetmon_hosts").
+		WithArgs("host-a", 0, 4).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("INSERT INTO jetmon_hosts").
+		WithArgs("host-b", 5, 9).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectCommit()
+
+	minBucket, maxBucket, err := ClaimBuckets("host-b", 10, 10, 60)
+	if err != nil {
+		t.Fatalf("ClaimBuckets: %v", err)
+	}
+	if minBucket != 5 || maxBucket != 9 {
+		t.Fatalf("claimed range = %d..%d, want 5..9", minBucket, maxBucket)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestMigrateAppliesOnlyPendingMigrations(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	origMigrations := migrations
+	migrations = []migration{
+		{id: 1, sql: "CREATE TABLE jetmon_schema_migrations"},
+		{id: 2, sql: "ALTER TABLE already_done"},
+		{id: 3, sql: "ALTER TABLE pending_change"},
+	}
+	defer func() { migrations = origMigrations }()
+
+	mock.ExpectExec("CREATE TABLE jetmon_schema_migrations").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("INSERT IGNORE INTO jetmon_schema_migrations").
+		WithArgs(1).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery("SELECT COUNT").
+		WithArgs(2).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(1))
+	mock.ExpectQuery("SELECT COUNT").
+		WithArgs(3).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(0))
+	mock.ExpectExec("ALTER TABLE pending_change").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("INSERT IGNORE INTO jetmon_schema_migrations").
+		WithArgs(3).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	if err := Migrate(); err != nil {
+		t.Fatalf("Migrate: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestCheckHistoryRequestMethodAddedOnlyByMigration32(t *testing.T) {
+	var createHistory, addMethod migration
+	for _, m := range migrations {
+		switch m.id {
+		case 6:
+			createHistory = m
+		case 32:
+			addMethod = m
+		}
+	}
+	if createHistory.sql == "" {
+		t.Fatal("migration 6 not found")
+	}
+	if addMethod.sql == "" {
+		t.Fatal("migration 32 not found")
+	}
+	if strings.Contains(createHistory.sql, "request_method") {
+		t.Fatalf("migration 6 creates request_method; fresh databases would fail when migration 32 adds it again")
+	}
+	if !strings.Contains(addMethod.sql, "ADD COLUMN request_method") {
+		t.Fatalf("migration 32 should add request_method; sql=%q", addMethod.sql)
+	}
+}
diff --git a/internal/db/rollout_audit.go b/internal/db/rollout_audit.go
new file mode 100644
index 00000000..13f4904b
--- /dev/null
+++ b/internal/db/rollout_audit.go
@@ -0,0 +1,270 @@
+package db
+
+import (
+	"context"
+	"database/sql"
+	"fmt"
+	"time"
+)
+
+// ValueCount summarizes one integer value across the full legacy table range
+// and the active subset of that range.
+type ValueCount struct {
+	Value  int
+	Total  int64
+	Active int64
+}
+
+// BucketLoadSummary summarizes row distribution across populated buckets.
+type BucketLoadSummary struct {
+	Distinct int
+	MinRows  int64
+	MaxRows  int64
+	AvgRows  float64
+}
+
+// DuplicateBlogSummary describes monitor rows that share the same blog_id.
+type DuplicateBlogSummary struct {
+	Groups          int64
+	Rows            int64
+	MaxRowsPerBlog  int64
+	StatusConflicts int64
+}
+
+// LegacySiteTableAudit captures production-data readiness signals from the
+// v1-shaped jetpack_monitor_sites table without exposing individual site URLs.
+type LegacySiteTableAudit struct {
+	BucketMin int
+	BucketMax int
+
+	TotalRows  int64
+	ActiveRows int64
+
+	ObservedBucketMin      *int
+	ObservedBucketMax      *int
+	ObservedBucketDistinct int
+	ActiveBucketDistinct   int
+	ActiveBucketLoad       BucketLoadSummary
+
+	MonitorActiveValues []ValueCount
+	SiteStatusValues    []ValueCount
+	CheckIntervalValues []ValueCount
+
+	ActiveNonRunningRows     int64
+	ActiveNullStatusChange   int64
+	ActiveMalformedURLRows   int64
+	ActiveURLNearColumnLimit int64
+	MaxURLLength             int64
+
+	DuplicateBlogs       DuplicateBlogSummary
+	ActiveDuplicateBlogs DuplicateBlogSummary
+}
+
+// LegacyNonRunningSite is one active v1 row whose legacy projection already
+// says the site is not running before v2 owns the range.
+type LegacyNonRunningSite struct {
+	MonitorSiteID    int64
+	BlogID           int64
+	BucketNo         int
+	SiteStatus       int
+	LastStatusChange *time.Time
+}
+
+// BuildLegacySiteTableAudit reads aggregate legacy-table signals for a bucket
+// range. It is intentionally count-only so operators can safely share output.
+func BuildLegacySiteTableAudit(ctx context.Context, bucketMin, bucketMax int) (LegacySiteTableAudit, error) {
+	audit := LegacySiteTableAudit{
+		BucketMin: bucketMin,
+		BucketMax: bucketMax,
+	}
+	err := db.QueryRowContext(ctx, `
+		SELECT
+			COUNT(*),
+			COALESCE(SUM(monitor_active = 1), 0),
+			COALESCE(SUM(monitor_active = 1 AND site_status <> 1), 0),
+			COALESCE(SUM(monitor_active = 1 AND last_status_change IS NULL), 0),
+			COALESCE(SUM(monitor_active = 1 AND LOWER(monitor_url) NOT REGEXP '^https?://[^/?#[:space:]]+'), 0),
+			COALESCE(SUM(monitor_active = 1 AND CHAR_LENGTH(monitor_url) >= 250), 0),
+			COALESCE(MAX(CHAR_LENGTH(monitor_url)), 0)
+		  FROM jetpack_monitor_sites
+		 WHERE bucket_no BETWEEN ? AND ?`,
+		bucketMin, bucketMax,
+	).Scan(
+		&audit.TotalRows,
+		&audit.ActiveRows,
+		&audit.ActiveNonRunningRows,
+		&audit.ActiveNullStatusChange,
+		&audit.ActiveMalformedURLRows,
+		&audit.ActiveURLNearColumnLimit,
+		&audit.MaxURLLength,
+	)
+	if err != nil {
+		return audit, fmt.Errorf("query legacy site table totals: %w", err)
+	}
+
+	var minBucket, maxBucket sql.NullInt64
+	err = db.QueryRowContext(ctx, `
+		SELECT MIN(bucket_no), MAX(bucket_no), COUNT(DISTINCT bucket_no),
+		       COALESCE(COUNT(DISTINCT CASE WHEN monitor_active = 1 THEN bucket_no END), 0)
+		  FROM jetpack_monitor_sites
+		 WHERE bucket_no BETWEEN ? AND ?`,
+		bucketMin, bucketMax,
+	).Scan(&minBucket, &maxBucket, &audit.ObservedBucketDistinct, &audit.ActiveBucketDistinct)
+	if err != nil {
+		return audit, fmt.Errorf("query legacy bucket coverage: %w", err)
+	}
+	if minBucket.Valid {
+		v := int(minBucket.Int64)
+		audit.ObservedBucketMin = &v
+	}
+	if maxBucket.Valid {
+		v := int(maxBucket.Int64)
+		audit.ObservedBucketMax = &v
+	}
+
+	load, err := queryActiveBucketLoad(ctx, bucketMin, bucketMax)
+	if err != nil {
+		return audit, err
+	}
+	audit.ActiveBucketLoad = load
+
+	audit.MonitorActiveValues, err = queryLegacyValueCounts(ctx, bucketMin, bucketMax, "monitor_active")
+	if err != nil {
+		return audit, err
+	}
+	audit.SiteStatusValues, err = queryLegacyValueCounts(ctx, bucketMin, bucketMax, "site_status")
+	if err != nil {
+		return audit, err
+	}
+	audit.CheckIntervalValues, err = queryLegacyValueCounts(ctx, bucketMin, bucketMax, "check_interval")
+	if err != nil {
+		return audit, err
+	}
+
+	audit.DuplicateBlogs, err = queryDuplicateBlogSummary(ctx, bucketMin, bucketMax, false)
+	if err != nil {
+		return audit, err
+	}
+	audit.ActiveDuplicateBlogs, err = queryDuplicateBlogSummary(ctx, bucketMin, bucketMax, true)
+	if err != nil {
+		return audit, err
+	}
+
+	return audit, nil
+}
+
+func queryLegacyValueCounts(ctx context.Context, bucketMin, bucketMax int, column string) ([]ValueCount, error) {
+	switch column {
+	case "monitor_active", "site_status", "check_interval":
+	default:
+		return nil, fmt.Errorf("unsupported legacy value-count column %q", column)
+	}
+	rows, err := db.QueryContext(ctx, fmt.Sprintf(`
+		SELECT %s AS value,
+		       COUNT(*) AS total_rows,
+		       COALESCE(SUM(monitor_active = 1), 0) AS active_rows
+		  FROM jetpack_monitor_sites
+		 WHERE bucket_no BETWEEN ? AND ?
+		 GROUP BY %s
+		 ORDER BY %s ASC`, column, column, column),
+		bucketMin, bucketMax,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("query %s value counts: %w", column, err)
+	}
+	defer rows.Close()
+
+	var out []ValueCount
+	for rows.Next() {
+		var row ValueCount
+		if err := rows.Scan(&row.Value, &row.Total, &row.Active); err != nil {
+			return nil, fmt.Errorf("scan %s value counts: %w", column, err)
+		}
+		out = append(out, row)
+	}
+	return out, rows.Err()
+}
+
+func queryActiveBucketLoad(ctx context.Context, bucketMin, bucketMax int) (BucketLoadSummary, error) {
+	var out BucketLoadSummary
+	err := db.QueryRowContext(ctx, `
+		SELECT COUNT(*), COALESCE(MIN(rows_per_bucket), 0), COALESCE(MAX(rows_per_bucket), 0), COALESCE(AVG(rows_per_bucket), 0)
+		  FROM (
+			SELECT bucket_no, COUNT(*) AS rows_per_bucket
+			  FROM jetpack_monitor_sites
+			 WHERE monitor_active = 1
+			   AND bucket_no BETWEEN ? AND ?
+			 GROUP BY bucket_no
+		  ) active_buckets`,
+		bucketMin, bucketMax,
+	).Scan(&out.Distinct, &out.MinRows, &out.MaxRows, &out.AvgRows)
+	if err != nil {
+		return out, fmt.Errorf("query active bucket load: %w", err)
+	}
+	return out, nil
+}
+
+func queryDuplicateBlogSummary(ctx context.Context, bucketMin, bucketMax int, activeOnly bool) (DuplicateBlogSummary, error) {
+	filter := ""
+	if activeOnly {
+		filter = "AND monitor_active = 1"
+	}
+	query := fmt.Sprintf(`
+		SELECT COUNT(*), COALESCE(SUM(row_count), 0), COALESCE(MAX(row_count), 0), COALESCE(SUM(status_conflict), 0)
+		  FROM (
+			SELECT blog_id,
+			       COUNT(*) AS row_count,
+			       CASE WHEN COUNT(DISTINCT site_status) > 1 THEN 1 ELSE 0 END AS status_conflict
+			  FROM jetpack_monitor_sites
+			 WHERE bucket_no BETWEEN ? AND ?
+			   %s
+			 GROUP BY blog_id
+			HAVING COUNT(*) > 1
+		  ) duplicate_blogs`, filter)
+	var out DuplicateBlogSummary
+	err := db.QueryRowContext(ctx, query, bucketMin, bucketMax).
+		Scan(&out.Groups, &out.Rows, &out.MaxRowsPerBlog, &out.StatusConflicts)
+	if err != nil {
+		return out, fmt.Errorf("query duplicate blog summary: %w", err)
+	}
+	return out, nil
+}
+
+// ListLegacyNonRunningSites pages active legacy rows whose v1 projection is not
+// SITE_RUNNING. The page cursor is jetpack_monitor_site_id so duplicate blog_id
+// rows are not skipped.
+func ListLegacyNonRunningSites(ctx context.Context, bucketMin, bucketMax int, afterMonitorSiteID int64, limit int) ([]LegacyNonRunningSite, error) {
+	if limit <= 0 {
+		limit = 1000
+	}
+	rows, err := db.QueryContext(ctx, `
+		SELECT jetpack_monitor_site_id, blog_id, bucket_no, site_status, last_status_change
+		  FROM jetpack_monitor_sites
+		 WHERE monitor_active = 1
+		   AND site_status IN (0, 2)
+		   AND bucket_no BETWEEN ? AND ?
+		   AND jetpack_monitor_site_id > ?
+		 ORDER BY jetpack_monitor_site_id ASC
+		 LIMIT ?`,
+		bucketMin, bucketMax, afterMonitorSiteID, limit,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("query legacy non-running sites: %w", err)
+	}
+	defer rows.Close()
+
+	var out []LegacyNonRunningSite
+	for rows.Next() {
+		var row LegacyNonRunningSite
+		var lastStatusChange sql.NullTime
+		if err := rows.Scan(&row.MonitorSiteID, &row.BlogID, &row.BucketNo, &row.SiteStatus, &lastStatusChange); err != nil {
+			return nil, fmt.Errorf("scan legacy non-running site: %w", err)
+		}
+		if lastStatusChange.Valid {
+			t := lastStatusChange.Time.UTC()
+			row.LastStatusChange = &t
+		}
+		out = append(out, row)
+	}
+	return out, rows.Err()
+}
diff --git a/internal/db/site_tenants.go b/internal/db/site_tenants.go
new file mode 100644
index 00000000..db8b2b26
--- /dev/null
+++ b/internal/db/site_tenants.go
@@ -0,0 +1,75 @@
+package db
+
+import (
+	"context"
+	"database/sql"
+	"errors"
+	"fmt"
+	"strings"
+)
+
+// SiteTenantMapping links one gateway/customer tenant to one monitored site.
+// The mapping is many-to-many so gateway-side shared ownership or delegated
+// access does not require changing the legacy site row.
+type SiteTenantMapping struct {
+	TenantID string
+	BlogID   int64
+}
+
+// UpsertSiteTenantMappings inserts or refreshes site tenant mappings from a
+// gateway-owned source of truth. It intentionally does not delete mappings;
+// pruning requires a source-specific reconciliation policy.
+func UpsertSiteTenantMappings(ctx context.Context, conn *sql.DB, mappings []SiteTenantMapping, source string) (int64, error) {
+	if conn == nil {
+		return 0, errors.New("db is nil")
+	}
+	source = strings.TrimSpace(source)
+	if source == "" {
+		source = "gateway"
+	}
+	if len(mappings) == 0 {
+		return 0, nil
+	}
+
+	tx, err := conn.BeginTx(ctx, nil)
+	if err != nil {
+		return 0, fmt.Errorf("begin site tenant import: %w", err)
+	}
+	defer tx.Rollback()
+
+	stmt, err := tx.PrepareContext(ctx, `
+		INSERT INTO jetmon_site_tenants (tenant_id, blog_id, source)
+		VALUES (?, ?, ?)
+		ON DUPLICATE KEY UPDATE
+			source = VALUES(source),
+			updated_at = CURRENT_TIMESTAMP`)
+	if err != nil {
+		return 0, fmt.Errorf("prepare site tenant import: %w", err)
+	}
+	defer stmt.Close()
+
+	var affected int64
+	for _, m := range mappings {
+		tenantID := strings.TrimSpace(m.TenantID)
+		if tenantID == "" {
+			return 0, errors.New("tenant id is required")
+		}
+		if m.BlogID <= 0 {
+			return 0, fmt.Errorf("blog id must be positive for tenant %q", tenantID)
+		}
+		res, err := stmt.ExecContext(ctx, tenantID, m.BlogID, source)
+		if err != nil {
+			return 0, fmt.Errorf("upsert site tenant mapping tenant=%q blog_id=%d: %w", tenantID, m.BlogID, err)
+		}
+		n, err := res.RowsAffected()
+		if err != nil {
+			return 0, fmt.Errorf("read site tenant import result: %w", err)
+		}
+		affected += n
+	}
+
+	if err := tx.Commit(); err != nil {
+		return 0, fmt.Errorf("commit site tenant import: %w", err)
+	}
+	return affected, nil
+}
diff --git a/internal/db/site_tenants_test.go b/internal/db/site_tenants_test.go
new file mode 100644
index 00000000..c7e08cc1
--- /dev/null
+++ b/internal/db/site_tenants_test.go
@@ -0,0 +1,53 @@
+package db
+
+import (
+	"context"
+	"testing"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestUpsertSiteTenantMappings(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectBegin()
+	prep := mock.ExpectPrepare("INSERT INTO jetmon_site_tenants")
+	prep.ExpectExec().
+		WithArgs("tenant-a", int64(42), "gateway").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	prep.ExpectExec().
+		WithArgs("tenant-b", int64(43), "gateway").
+		WillReturnResult(sqlmock.NewResult(0, 2))
+	mock.ExpectCommit()
+
+	affected, err := UpsertSiteTenantMappings(context.Background(), DB(), []SiteTenantMapping{
+		{TenantID: "tenant-a", BlogID: 42},
+		{TenantID: "tenant-b", BlogID: 43},
+	}, "")
+	if err != nil {
+		t.Fatalf("UpsertSiteTenantMappings: %v", err)
+	}
+	if affected != 3 {
+		t.Fatalf("affected = %d, want 3", affected)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestUpsertSiteTenantMappingsValidatesInput(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectBegin()
+	mock.ExpectPrepare("INSERT INTO jetmon_site_tenants")
+	mock.ExpectRollback()
+
+	_, err := UpsertSiteTenantMappings(context.Background(), DB(), []SiteTenantMapping{
+		{TenantID: " ", BlogID: 42},
+	}, "gateway")
+	if err == nil {
+		t.Fatal("UpsertSiteTenantMappings accepted empty tenant id")
+	}
+}
diff --git a/internal/db/veriflier_discovery.go b/internal/db/veriflier_discovery.go
new file mode 100644
index 00000000..d1382ea5
--- /dev/null
+++ b/internal/db/veriflier_discovery.go
@@ -0,0 +1,310 @@
+package db
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"fmt"
+	"strings"
+	"time"
+)
+
+const VeriflierDiscoveryDefaultStaleAfter = 90 * time.Second
+
+type VeriflierVantage struct {
+	VantageID    string
+	Region       string
+	Provider     string
+	EndpointHost string
+	EndpointPort string
+	AuthToken    string
+	Enabled      bool
+	LastSeen     *time.Time
+	ActiveAgents int
+}
+
+func (v VeriflierVantage) Usable() bool {
+	return strings.TrimSpace(v.EndpointHost) != "" &&
+		strings.TrimSpace(v.EndpointPort) != "" &&
+		strings.TrimSpace(v.AuthToken) != ""
+}
+
+type VeriflierAgentHeartbeat struct {
+	AgentID        string
+	VantageID      string
+	Hostname       string
+	EndpointHost   string
+	EndpointPort   string
+	Version        string
+	Protocols      []string
+	MaxConcurrency int
+	QueueCapacity  int
+	QueueDepth     int
+	Active         int
+	InFlight       int
+	Status         string
+}
+
+type VeriflierAgent struct {
+	AgentID        string
+	VantageID      string
+	Hostname       string
+	EndpointHost   string
+	EndpointPort   string
+	Version        string
+	Protocols      []string
+	MaxConcurrency int
+	QueueCapacity  int
+	QueueDepth     int
+	Active         int
+	InFlight       int
+	Status         string
+	LastSeen       time.Time
+}
+
+type VeriflierDiscoverySnapshot struct {
+	Vantages []VeriflierVantage
+	Agents   []VeriflierAgent
+}
+
+func UpsertVeriflierAgent(ctx context.Context, hb VeriflierAgentHeartbeat) error {
+	hb.AgentID = strings.TrimSpace(hb.AgentID)
+	hb.VantageID = strings.TrimSpace(hb.VantageID)
+	if hb.AgentID == "" {
+		return fmt.Errorf("agent_id is required")
+	}
+	if hb.VantageID == "" {
+		return fmt.Errorf("vantage_id is required")
+	}
+	status := strings.TrimSpace(hb.Status)
+	if status == "" {
+		status = "active"
+	}
+	protocols, err := json.Marshal(hb.Protocols)
+	if err != nil {
+		return fmt.Errorf("marshal protocols: %w", err)
+	}
+
+	_, err = db.ExecContext(ctx, `
+		INSERT INTO jetmon_veriflier_agents (
+			agent_id, vantage_id, hostname, endpoint_host, endpoint_port,
+			version, protocols, max_concurrency, queue_capacity, queue_depth,
+			active, in_flight, status, last_seen
+		) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, UTC_TIMESTAMP())
+		ON DUPLICATE KEY UPDATE
+			vantage_id = VALUES(vantage_id),
+			hostname = VALUES(hostname),
+			endpoint_host = VALUES(endpoint_host),
+			endpoint_port = VALUES(endpoint_port),
+			version = VALUES(version),
+			protocols = VALUES(protocols),
+			max_concurrency = VALUES(max_concurrency),
+			queue_capacity = VALUES(queue_capacity),
+			queue_depth = VALUES(queue_depth),
+			active = VALUES(active),
+			in_flight = VALUES(in_flight),
+			status = VALUES(status),
+			last_seen = UTC_TIMESTAMP(),
+			updated_at = UTC_TIMESTAMP()`,
+		hb.AgentID,
+		hb.VantageID,
+		strings.TrimSpace(hb.Hostname),
+		strings.TrimSpace(hb.EndpointHost),
+		strings.TrimSpace(hb.EndpointPort),
+		strings.TrimSpace(hb.Version),
+		string(protocols),
+		clampNonNegative(hb.MaxConcurrency),
+		clampNonNegative(hb.QueueCapacity),
+		clampNonNegative(hb.QueueDepth),
+		clampNonNegative(hb.Active),
+		clampNonNegative(hb.InFlight),
+		status,
+	)
+	if err != nil {
+		return fmt.Errorf("upsert veriflier agent: %w", err)
+	}
+	return nil
+}
+
+func MarkVeriflierAgentStopped(ctx context.Context, agentID string) error {
+	agentID = strings.TrimSpace(agentID)
+	if agentID == "" {
+		return fmt.Errorf("agent_id is required")
+	}
+	_, err := db.ExecContext(ctx,
+		`UPDATE jetmon_veriflier_agents
+		    SET status = 'stopped',
+		        last_seen = UTC_TIMESTAMP(),
+		        updated_at = UTC_TIMESTAMP()
+		  WHERE agent_id = ?`,
+		agentID,
+	)
+	if err != nil {
+		return fmt.Errorf("mark veriflier agent stopped: %w", err)
+	}
+	return nil
+}
+
+func ListEnabledVeriflierVantages(ctx context.Context, staleAfter time.Duration) ([]VeriflierVantage, error) {
+	vantages, err := listVeriflierVantages(ctx, true)
+	if err != nil {
+		return nil, err
+	}
+	agents, err := ListRecentVeriflierAgents(ctx, staleAfter)
+	if err != nil {
+		return nil, err
+	}
+	applyAgentHints(vantages, agents)
+	return vantages, nil
+}
+
+func ListVeriflierDiscoverySnapshot(ctx context.Context, staleAfter time.Duration) (VeriflierDiscoverySnapshot, error) {
+	vantages, err := listVeriflierVantages(ctx, false)
+	if err != nil {
+		return VeriflierDiscoverySnapshot{}, err
+	}
+	agents, err := ListRecentVeriflierAgents(ctx, staleAfter)
+	if err != nil {
+		return VeriflierDiscoverySnapshot{}, err
+	}
+	applyAgentHints(vantages, agents)
+	return VeriflierDiscoverySnapshot{Vantages: vantages, Agents: agents}, nil
+}
+
+func ListRecentVeriflierAgents(ctx context.Context, staleAfter time.Duration) ([]VeriflierAgent, error) {
+	seconds := staleAfterSeconds(staleAfter)
+	rows, err := db.QueryContext(ctx, `
+		SELECT agent_id, vantage_id, hostname, endpoint_host, endpoint_port,
+		       version, protocols, max_concurrency, queue_capacity, queue_depth,
+		       active, in_flight, status, last_seen
+		  FROM jetmon_veriflier_agents
+		 WHERE last_seen >= DATE_SUB(UTC_TIMESTAMP(), INTERVAL ? SECOND)
+		 ORDER BY vantage_id, last_seen DESC, agent_id`,
+		seconds,
+	)
+	if err != nil {
+		return nil, fmt.Errorf("list veriflier agents: %w", err)
+	}
+	defer rows.Close()
+
+	var agents []VeriflierAgent
+	for rows.Next() {
+		agent, err := scanVeriflierAgent(rows)
+		if err != nil {
+			return nil, err
+		}
+		agents = append(agents, agent)
+	}
+	if err := rows.Err(); err != nil {
+		return nil, fmt.Errorf("list veriflier agents: %w", err)
+	}
+	return agents, nil
+}
+
+func listVeriflierVantages(ctx context.Context, enabledOnly bool) ([]VeriflierVantage, error) {
+	query := `
+		SELECT vantage_id, region, provider, endpoint_host, endpoint_port,
+		       auth_token, enabled
+		  FROM jetmon_veriflier_vantages`
+	if enabledOnly {
+		query += ` WHERE enabled = 1`
+	}
+	query += ` ORDER BY vantage_id`
+
+	rows, err := db.QueryContext(ctx, query)
+	if err != nil {
+		return nil, fmt.Errorf("list veriflier vantages: %w", err)
+	}
+	defer rows.Close()
+
+	var vantages []VeriflierVantage
+	for rows.Next() {
+		var v VeriflierVantage
+		var enabled int
+		if err := rows.Scan(&v.VantageID, &v.Region, &v.Provider, &v.EndpointHost, &v.EndpointPort, &v.AuthToken, &enabled); err != nil {
+			return nil, fmt.Errorf("scan veriflier vantage: %w", err)
+		}
+		v.Enabled = enabled != 0
+		vantages = append(vantages, v)
+	}
+	if err := rows.Err(); err != nil {
+		return nil, fmt.Errorf("list veriflier vantages: %w", err)
+	}
+	return vantages, nil
+}
+
+type agentScanner interface {
+	Scan(dest ...any) error
+}
+
+func scanVeriflierAgent(row agentScanner) (VeriflierAgent, error) {
+	var agent VeriflierAgent
+	var protocols sql.NullString
+	if err := row.Scan(
+		&agent.AgentID,
+		&agent.VantageID,
+		&agent.Hostname,
+		&agent.EndpointHost,
+		&agent.EndpointPort,
+		&agent.Version,
+		&protocols,
+		&agent.MaxConcurrency,
+		&agent.QueueCapacity,
+		&agent.QueueDepth,
+		&agent.Active,
+		&agent.InFlight,
+		&agent.Status,
+		&agent.LastSeen,
+	); err != nil {
+		return VeriflierAgent{}, fmt.Errorf("scan veriflier agent: %w", err)
+	}
+	if protocols.Valid && strings.TrimSpace(protocols.String) != "" {
+		if err := json.Unmarshal([]byte(protocols.String), &agent.Protocols); err != nil {
+			return VeriflierAgent{}, fmt.Errorf("decode veriflier agent protocols: %w", err)
+		}
+	}
+	return agent, nil
+}
+
+func applyAgentHints(vantages []VeriflierVantage, agents []VeriflierAgent) {
+	byID := make(map[string]int, len(vantages))
+	for i := range vantages {
+		byID[vantages[i].VantageID] = i
+	}
+	for _, agent := range agents {
+		i, ok := byID[agent.VantageID]
+		if !ok || agent.Status != "active" {
+			continue
+		}
+		v := &vantages[i]
+		v.ActiveAgents++
+		if v.LastSeen == nil || agent.LastSeen.After(*v.LastSeen) {
+			lastSeen := agent.LastSeen
+			v.LastSeen = &lastSeen
+		}
+		if strings.TrimSpace(v.EndpointHost) == "" && strings.TrimSpace(agent.EndpointHost) != "" {
+			v.EndpointHost = agent.EndpointHost
+		}
+		if strings.TrimSpace(v.EndpointPort) == "" && strings.TrimSpace(agent.EndpointPort) != "" {
+			v.EndpointPort = agent.EndpointPort
+		}
+	}
+}
+
+func staleAfterSeconds(d time.Duration) int {
+	if d <= 0 {
+		d = VeriflierDiscoveryDefaultStaleAfter
+	}
+	seconds := int(d.Seconds())
+	if seconds < 1 {
+		return 1
+	}
+	return seconds
+}
+
+func clampNonNegative(v int) int {
+	if v < 0 {
+		return 0
+	}
+	return v
+}
diff --git a/internal/db/veriflier_discovery_test.go b/internal/db/veriflier_discovery_test.go
new file mode 100644
index 00000000..3cefde77
--- /dev/null
+++ b/internal/db/veriflier_discovery_test.go
@@ -0,0 +1,149 @@
+package db
+
+import (
+	"context"
+	"database/sql/driver"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestUpsertVeriflierAgentValidatesIdentity(t *testing.T) {
+	if err := UpsertVeriflierAgent(context.Background(), VeriflierAgentHeartbeat{}); err == nil {
+		t.Fatal("UpsertVeriflierAgent accepted empty agent id")
+	}
+	if err := UpsertVeriflierAgent(context.Background(), VeriflierAgentHeartbeat{AgentID: "agent"}); err == nil {
+		t.Fatal("UpsertVeriflierAgent accepted empty vantage id")
+	}
+}
+
+func TestUpsertVeriflierAgentWritesHeartbeat(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectExec("INSERT INTO jetmon_veriflier_agents").
+		WithArgs(
+			"agent-1", "us-east", "host-a", "veriflier-a", "7803",
+			"test-version", `["v2-json-http"]`, 32, 64, 3, 2, 1, "active",
+		).
+		WillReturnResult(driver.RowsAffected(1))
+
+	err := UpsertVeriflierAgent(context.Background(), VeriflierAgentHeartbeat{
+		AgentID:        " agent-1 ",
+		VantageID:      " us-east ",
+		Hostname:       "host-a",
+		EndpointHost:   "veriflier-a",
+		EndpointPort:   "7803",
+		Version:        "test-version",
+		Protocols:      []string{"v2-json-http"},
+		MaxConcurrency: 32,
+		QueueCapacity:  64,
+		QueueDepth:     3,
+		Active:         2,
+		InFlight:       1,
+	})
+	if err != nil {
+		t.Fatalf("UpsertVeriflierAgent: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestMarkVeriflierAgentStopped(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	mock.ExpectExec("UPDATE jetmon_veriflier_agents").
+		WithArgs("agent-1").
+		WillReturnResult(driver.RowsAffected(1))
+
+	if err := MarkVeriflierAgentStopped(context.Background(), "agent-1"); err != nil {
+		t.Fatalf("MarkVeriflierAgentStopped: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestListEnabledVeriflierVantagesAppliesActiveAgentHints(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	vantageRows := sqlmock.NewRows([]string{
+		"vantage_id", "region", "provider", "endpoint_host", "endpoint_port", "auth_token", "enabled",
+	}).
+		AddRow("us-east", "iad", "provider-a", "", "", "token-east", 1).
+		AddRow("us-west", "sfo", "provider-b", "west.example", "7803", "token-west", 1)
+	agentRows := sqlmock.NewRows([]string{
+		"agent_id", "vantage_id", "hostname", "endpoint_host", "endpoint_port",
+		"version", "protocols", "max_concurrency", "queue_capacity", "queue_depth",
+		"active", "in_flight", "status", "last_seen",
+	}).
+		AddRow("agent-east", "us-east", "host-east", "east.example", "7804",
+			"dev", `["v2-json-http"]`, 64, 256, 4, 2, 1, "active", time.Now()).
+		AddRow("agent-stopped", "us-west", "host-west", "ignored.example", "7999",
+			"dev", `["v2-json-http"]`, 64, 256, 0, 0, 0, "stopped", time.Now())
+
+	mock.ExpectQuery("SELECT vantage_id").
+		WillReturnRows(vantageRows)
+	mock.ExpectQuery("SELECT agent_id").
+		WithArgs(90).
+		WillReturnRows(agentRows)
+
+	got, err := ListEnabledVeriflierVantages(context.Background(), VeriflierDiscoveryDefaultStaleAfter)
+	if err != nil {
+		t.Fatalf("ListEnabledVeriflierVantages: %v", err)
+	}
+	if len(got) != 2 {
+		t.Fatalf("len = %d, want 2", len(got))
+	}
+	if got[0].EndpointHost != "east.example" || got[0].EndpointPort != "7804" || got[0].ActiveAgents != 1 || !got[0].Usable() {
+		t.Fatalf("east vantage = %+v", got[0])
+	}
+	if got[1].EndpointHost != "west.example" || got[1].EndpointPort != "7803" || got[1].ActiveAgents != 0 || !got[1].Usable() {
+		t.Fatalf("west vantage = %+v", got[1])
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestListVeriflierDiscoverySnapshotIncludesDisabledVantages(t *testing.T) {
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+
+	vantageRows := sqlmock.NewRows([]string{
+		"vantage_id", "region", "provider", "endpoint_host", "endpoint_port", "auth_token", "enabled",
+	}).
+		AddRow("disabled", "dev", "local", "disabled.example", "7803", "token", 0)
+	agentRows := sqlmock.NewRows([]string{
+		"agent_id", "vantage_id", "hostname", "endpoint_host", "endpoint_port",
+		"version", "protocols", "max_concurrency", "queue_capacity", "queue_depth",
+		"active", "in_flight", "status", "last_seen",
+	}).
+		AddRow("agent-disabled", "disabled", "host", "disabled.example", "7803",
+			"dev", `["legacy-json-http","v2-json-http"]`, 8, 32, 0, 0, 0, "active", time.Now())
+
+	mock.ExpectQuery("SELECT vantage_id").
+		WillReturnRows(vantageRows)
+	mock.ExpectQuery("SELECT agent_id").
+		WithArgs(90).
+		WillReturnRows(agentRows)
+
+	got, err := ListVeriflierDiscoverySnapshot(context.Background(), VeriflierDiscoveryDefaultStaleAfter)
+	if err != nil {
+		t.Fatalf("ListVeriflierDiscoverySnapshot: %v", err)
+	}
+	if len(got.Vantages) != 1 || got.Vantages[0].Enabled {
+		t.Fatalf("vantages = %+v", got.Vantages)
+	}
+	if len(got.Agents) != 1 || strings.Join(got.Agents[0].Protocols, ",") != "legacy-json-http,v2-json-http" {
+		t.Fatalf("agents = %+v", got.Agents)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
diff --git a/internal/deliverer/deliverer.go b/internal/deliverer/deliverer.go
new file mode 100644
index 00000000..fdca0879
--- /dev/null
+++ b/internal/deliverer/deliverer.go
@@ -0,0 +1,108 @@
+// Package deliverer owns outbound delivery worker wiring.
+package deliverer
+
+import (
+	"database/sql"
+	"log"
+
+	"github.com/Automattic/jetmon/internal/alerting"
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/webhooks"
+)
+
+// Config is the runtime wiring needed by the outbound deliverer.
+type Config struct {
+	DB          *sql.DB
+	InstanceID  string
+	Dispatchers map[alerting.Transport]alerting.Dispatcher
+	Logger      *log.Logger
+}
+
+// Runtime holds the active delivery workers.
+type Runtime struct {
+	hookWorker  *webhooks.Worker
+	alertWorker *alerting.Worker
+	logger      *log.Logger
+}
+
+// Start launches webhook and alert-contact delivery workers.
+func Start(cfg Config) *Runtime {
+	logger := cfg.Logger
+	if logger == nil {
+		logger = log.Default()
+	}
+
+	hookWorker := webhooks.NewWorker(webhooks.WorkerConfig{
+		DB:         cfg.DB,
+		InstanceID: cfg.InstanceID,
+	})
+	hookWorker.Start()
+	logger.Println("webhooks: delivery worker started")
+
+	alertWorker := alerting.NewWorker(alerting.WorkerConfig{
+		DB:          cfg.DB,
+		InstanceID:  cfg.InstanceID,
+		Dispatchers: cfg.Dispatchers,
+	})
+	alertWorker.Start()
+	logger.Printf("alerting: delivery worker started (transports=%d)", len(cfg.Dispatchers))
+
+	return &Runtime{
+		hookWorker:  hookWorker,
+		alertWorker: alertWorker,
+		logger:      logger,
+	}
+}
+
+// Stop drains both delivery workers.
+func (r *Runtime) Stop() {
+	if r == nil {
+		return
+	}
+	if r.hookWorker != nil {
+		r.hookWorker.Stop()
+		r.logger.Println("webhooks: delivery worker stopped")
+	}
+	if r.alertWorker != nil {
+		r.alertWorker.Stop()
+		r.logger.Println("alerting: delivery worker stopped")
+	}
+}
+
+// BuildAlertDispatchers constructs the per-transport Dispatcher map
+// from runtime config. Always returns the three webhook-shaped
+// transports (PagerDuty, Slack, Teams) because they have no per-instance
+// config beyond the destination credential stored on each alert contact.
+// Email is selected with EMAIL_TRANSPORT: "wpcom"/"smtp" wire the
+// corresponding sender, and "stub" or empty falls back to log-only.
+func BuildAlertDispatchers(cfg *config.Config) map[alerting.Transport]alerting.Dispatcher {
+	out := map[alerting.Transport]alerting.Dispatcher{
+		alerting.TransportPagerDuty: &alerting.PagerDutyDispatcher{},
+		alerting.TransportSlack:     &alerting.SlackDispatcher{},
+		alerting.TransportTeams:     &alerting.TeamsDispatcher{},
+	}
+
+	var sender alerting.Sender
+	switch cfg.EmailTransport {
+	case "wpcom":
+		sender = &alerting.WPCOMSender{
+			Endpoint:  cfg.WPCOMEmailEndpoint,
+			AuthToken: cfg.WPCOMEmailAuthToken,
+		}
+		log.Printf("alerting/email: using wpcom sender (endpoint=%s)", cfg.WPCOMEmailEndpoint)
+	case "smtp":
+		sender = &alerting.SMTPSender{
+			Host:     cfg.SMTPHost,
+			Port:     cfg.SMTPPort,
+			Username: cfg.SMTPUsername,
+			Password: cfg.SMTPPassword,
+			UseTLS:   cfg.SMTPUseTLS,
+		}
+		log.Printf("alerting/email: using smtp sender (%s:%d)", cfg.SMTPHost, cfg.SMTPPort)
+	default:
+		sender = &alerting.StubSender{}
+		log.Println("alerting/email: using stub sender (set EMAIL_TRANSPORT to enable real delivery)")
+	}
+	out[alerting.TransportEmail] = alerting.NewEmailDispatcher(sender, cfg.EmailFrom)
+	return out
+}
diff --git a/internal/eventstore/eventstore.go b/internal/eventstore/eventstore.go
new file mode 100644
index 00000000..51ce1d0e
--- /dev/null
+++ b/internal/eventstore/eventstore.go
@@ -0,0 +1,779 @@
+// Package eventstore is the sole writer for jetmon_events and jetmon_event_transitions.
+//
+// Site state in Jetmon is event-sourced across two tables:
+//
+//   - jetmon_events holds the current state of every incident — one row per
+//     (blog_id, endpoint_id, check_type, discriminator) tuple while open, mutable
+//     until ended_at is set, then frozen.
+//   - jetmon_event_transitions is the append-only history of every mutation made
+//     to a jetmon_events row. One row per change. Never updated, never deleted.
+//
+// The load-bearing invariant is: every mutation to jetmon_events writes exactly
+// one row into jetmon_event_transitions, in the same database transaction. This
+// package enforces that by being the only writer for both tables. External
+// callers go through Open, UpdateSeverity, UpdateState, LinkCause, and Close.
+//
+// Two API surfaces:
+//
+//   - Store.Open / Store.Promote / Store.Close (etc.) — each opens its own
+//     transaction, performs the event mutation + transition write, and commits.
+//     Use these when the event mutation is the only DB write.
+//
+//   - Store.Begin → *Tx → Tx.Open / Tx.Promote / Tx.Close (etc.) → Tx.Commit —
+//     caller controls transaction boundaries, can run additional SQL on the
+//     same transaction (e.g. updating jetpack_monitor_sites.site_status as a
+//     v1 projection alongside the event write).
+//
+// See docs/events.md for the full design rationale and docs/taxonomy.md for the data model.
+package eventstore
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"time"
+)
+
+// State labels written to jetmon_events.state and jetmon_event_transitions.state_*.
+// The state column is VARCHAR(32) rather than ENUM so new states can be added in
+// code without a schema migration.
+const (
+	StateUp          = "Up"
+	StateWarning     = "Warning"
+	StateDegraded    = "Degraded"
+	StateSeemsDown   = "Seems Down"
+	StateDown        = "Down"
+	StatePaused      = "Paused"
+	StateMaintenance = "Maintenance"
+	StateUnknown     = "Unknown"
+	StateResolved    = "Resolved"
+)
+
+// Severity is the numeric, ordered companion to State. Higher = worse. Stored
+// as TINYINT UNSIGNED so values 0–255 are valid; the canonical scale below
+// covers the lifecycle states. Severity moves independently of state — a
+// degradation worsening bumps severity without changing state, and severity
+// values above SeverityDown can be reserved for future "worse than down"
+// signals (e.g. data loss, security compromise) without breaking rollup.
+const (
+	SeverityUp        uint8 = 0
+	SeverityWarning   uint8 = 1
+	SeverityDegraded  uint8 = 2
+	SeveritySeemsDown uint8 = 3
+	SeverityDown      uint8 = 4
+)
+
+// Transition reasons written to jetmon_event_transitions.reason. The closed-event
+// reasons are also written to jetmon_events.resolution_reason on Close.
+const (
+	ReasonOpened               = "opened"
+	ReasonSeverityEscalation   = "severity_escalation"
+	ReasonSeverityDeescalation = "severity_deescalation"
+	ReasonStateChange          = "state_change"
+	ReasonVerifierConfirmed    = "verifier_confirmed"
+	ReasonVerifierCleared      = "verifier_cleared"
+	ReasonProbeCleared         = "probe_cleared"
+	ReasonFalseAlarm           = "false_alarm"
+	ReasonManualOverride       = "manual_override"
+	ReasonMaintenanceSwallowed = "maintenance_swallowed"
+	ReasonSuperseded           = "superseded"
+	ReasonAutoTimeout          = "auto_timeout"
+	ReasonCauseLinked          = "cause_linked"
+	ReasonCauseUnlinked        = "cause_unlinked"
+)
+
+// ErrEventClosed is returned when a caller attempts to mutate an event that is
+// already closed (ended_at IS NOT NULL). Closed events are immutable.
+var ErrEventClosed = errors.New("eventstore: event is closed")
+
+// ErrEventNotFound is returned when a caller references an event id that does
+// not exist.
+var ErrEventNotFound = errors.New("eventstore: event not found")
+
+// Identity is the dedup tuple for an event. Two open events cannot share the
+// same Identity — the schema's dedup_key + UNIQUE INDEX enforces this.
+type Identity struct {
+	BlogID        int64
+	EndpointID    *int64 // nil for site-level checks (DNS, TLS expiry, domain)
+	CheckType     string
+	Discriminator string // empty when the (blog, endpoint, check_type) is single-failure
+}
+
+// OpenInput carries the fields needed to open (or reopen) an event.
+type OpenInput struct {
+	Identity  Identity
+	Severity  uint8
+	State     string
+	Source    string          // who detected the failure: "local", "veriflier:us-west", …
+	Metadata  json.RawMessage // optional check-type-specific payload
+	StartedAt *time.Time      // optional legacy/bootstrap incident start time
+}
+
+// OpenResult describes the outcome of an Open call.
+type OpenResult struct {
+	EventID         int64
+	Opened          bool   // true if a new event was inserted; false if an existing open event matched the identity
+	CurrentSeverity uint8  // severity on the event row after the call
+	CurrentState    string // state on the event row after the call
+}
+
+// Store is the sole writer for jetmon_events and jetmon_event_transitions.
+type Store struct {
+	db *sql.DB
+}
+
+// New returns a Store backed by the given database handle. A nil db is allowed
+// (writes become no-ops) so packages that depend on Store can still construct
+// in tests where the database isn't available.
+func New(db *sql.DB) *Store {
+	return &Store{db: db}
+}
+
+// Tx wraps a single database transaction and exposes the same event-mutation
+// API as Store, but without committing. Callers who need to coordinate event
+// writes with other SQL (e.g. updating a v1 projection like
+// jetpack_monitor_sites.site_status) start a Tx, perform the event mutation,
+// run their other writes via Tx.Tx().Exec(...), then Commit.
+//
+// A Tx returned from a nil-db Store is itself a no-op shell; all methods
+// short-circuit and Commit/Rollback are safe to call.
+type Tx struct {
+	tx *sql.Tx // nil when Store had no db
+}
+
+// Begin starts a new transaction. Caller must Commit or Rollback. Calling on a
+// nil-db Store returns an empty Tx whose methods are no-ops.
+func (s *Store) Begin(ctx context.Context) (*Tx, error) {
+	if s.db == nil {
+		return &Tx{}, nil
+	}
+	tx, err := s.db.BeginTx(ctx, nil)
+	if err != nil {
+		return nil, fmt.Errorf("begin tx: %w", err)
+	}
+	return &Tx{tx: tx}, nil
+}
+
+// Tx returns the underlying *sql.Tx so the caller can run additional SQL on
+// the same transaction. Returns nil when the Tx is in nil-db mode.
+func (t *Tx) Tx() *sql.Tx { return t.tx }
+
+// Commit commits the transaction. No-op in nil-db mode.
+func (t *Tx) Commit() error {
+	if t.tx == nil {
+		return nil
+	}
+	return t.tx.Commit()
+}
+
+// Rollback rolls back the transaction. No-op in nil-db mode. Safe to call
+// after Commit (the underlying sql.ErrTxDone is swallowed) so it composes
+// with `defer tx.Rollback()`.
+func (t *Tx) Rollback() error {
+	if t.tx == nil {
+		return nil
+	}
+	if err := t.tx.Rollback(); err != nil && !errors.Is(err, sql.ErrTxDone) {
+		return err
+	}
+	return nil
+}
+
+// Open opens a new event for the given identity, or returns the existing open
+// event's id if one already exists. Idempotent — repeated calls with the same
+// identity return the same event id and only write one "opened" transition
+// row (the one for the actual insert).
+//
+// Severity escalation on a re-detection should go through UpdateSeverity, not
+// through repeated Opens.
+func (t *Tx) Open(ctx context.Context, in OpenInput) (OpenResult, error) {
+	if t.tx == nil {
+		return OpenResult{}, nil
+	}
+	if in.Identity.CheckType == "" {
+		return OpenResult{}, errors.New("eventstore: Open requires CheckType")
+	}
+	if in.State == "" {
+		return OpenResult{}, errors.New("eventstore: Open requires State")
+	}
+
+	// LAST_INSERT_ID(id) on the UPDATE branch makes the driver return the
+	// existing row's id. RowsAffected is 1 on insert, 2 on update (per the
+	// MySQL driver convention). We only write an "opened" transition on insert.
+	insertSQL := `
+		INSERT INTO jetmon_events
+			(blog_id, endpoint_id, check_type, discriminator, severity, state, metadata)
+		VALUES (?, ?, ?, ?, ?, ?, ?)
+		ON DUPLICATE KEY UPDATE id = LAST_INSERT_ID(id)`
+	args := []any{
+		in.Identity.BlogID,
+		nullableEndpoint(in.Identity.EndpointID),
+		in.Identity.CheckType,
+		nullableDiscriminator(in.Identity.Discriminator),
+		in.Severity,
+		in.State,
+		nullableJSON(in.Metadata),
+	}
+	if in.StartedAt != nil {
+		insertSQL = `
+			INSERT INTO jetmon_events
+				(blog_id, endpoint_id, check_type, discriminator, severity, state, started_at, metadata)
+			VALUES (?, ?, ?, ?, ?, ?, ?, ?)
+			ON DUPLICATE KEY UPDATE id = LAST_INSERT_ID(id)`
+		args = []any{
+			in.Identity.BlogID,
+			nullableEndpoint(in.Identity.EndpointID),
+			in.Identity.CheckType,
+			nullableDiscriminator(in.Identity.Discriminator),
+			in.Severity,
+			in.State,
+			in.StartedAt.UTC(),
+			nullableJSON(in.Metadata),
+		}
+	}
+	res, err := t.tx.ExecContext(ctx, insertSQL, args...)
+	if err != nil {
+		return OpenResult{}, fmt.Errorf("insert event: %w", err)
+	}
+	eventID, err := res.LastInsertId()
+	if err != nil {
+		return OpenResult{}, fmt.Errorf("last insert id: %w", err)
+	}
+	rowsAffected, err := res.RowsAffected()
+	if err != nil {
+		return OpenResult{}, fmt.Errorf("rows affected: %w", err)
+	}
+	opened := rowsAffected == 1
+
+	var currentSeverity uint8
+	var currentState string
+	if opened {
+		currentSeverity = in.Severity
+		currentState = in.State
+		sev := in.Severity
+		if err := writeTransition(ctx, t.tx, transitionInput{
+			eventID:        eventID,
+			blogID:         in.Identity.BlogID,
+			severityBefore: nil,
+			severityAfter:  &sev,
+			stateBefore:    "",
+			stateAfter:     in.State,
+			reason:         ReasonOpened,
+			source:         in.Source,
+			metadata:       in.Metadata,
+			changedAt:      in.StartedAt,
+		}); err != nil {
+			return OpenResult{}, err
+		}
+	} else {
+		// Existing open event matched. Read its current severity/state so the
+		// caller can decide whether to follow up with UpdateSeverity/UpdateState.
+		if err := t.tx.QueryRowContext(ctx,
+			`SELECT severity, state FROM jetmon_events WHERE id = ?`, eventID,
+		).Scan(&currentSeverity, &currentState); err != nil {
+			return OpenResult{}, fmt.Errorf("read existing event: %w", err)
+		}
+	}
+
+	return OpenResult{
+		EventID:         eventID,
+		Opened:          opened,
+		CurrentSeverity: currentSeverity,
+		CurrentState:    currentState,
+	}, nil
+}
+
+// UpdateSeverity changes the severity of an open event. If the new severity
+// equals the current one, no row is written and (false, nil) is returned.
+func (t *Tx) UpdateSeverity(ctx context.Context, eventID int64, newSeverity uint8, reason, source string, metadata json.RawMessage) (bool, error) {
+	if t.tx == nil {
+		return false, nil
+	}
+	return t.mutate(ctx, eventID, mutation{
+		severityAfter: &newSeverity,
+		reason:        reason,
+		source:        source,
+		metadata:      metadata,
+	})
+}
+
+// UpdateState changes the lifecycle state of an open event (e.g.,
+// Seems Down → Down on verifier confirmation). If the new state equals the
+// current one, no row is written.
+func (t *Tx) UpdateState(ctx context.Context, eventID int64, newState, reason, source string, metadata json.RawMessage) (bool, error) {
+	if t.tx == nil {
+		return false, nil
+	}
+	return t.mutate(ctx, eventID, mutation{
+		stateAfter: &newState,
+		reason:     reason,
+		source:     source,
+		metadata:   metadata,
+	})
+}
+
+// Promote bumps state and severity together with one transition row. Used for
+// the common "verifier confirms a Seems Down event as Down" path.
+func (t *Tx) Promote(ctx context.Context, eventID int64, newSeverity uint8, newState, reason, source string, metadata json.RawMessage) (bool, error) {
+	if t.tx == nil {
+		return false, nil
+	}
+	return t.mutate(ctx, eventID, mutation{
+		severityAfter: &newSeverity,
+		stateAfter:    &newState,
+		reason:        reason,
+		source:        source,
+		metadata:      metadata,
+	})
+}
+
+// LinkCause sets or clears the cause_event_id on an open event. Passing 0 (or
+// a negative value) clears the existing link.
+func (t *Tx) LinkCause(ctx context.Context, eventID, causeEventID int64, source string) (bool, error) {
+	if t.tx == nil {
+		return false, nil
+	}
+	cur, err := readEventForUpdate(ctx, t.tx, eventID)
+	if err != nil {
+		return false, err
+	}
+	if cur.endedAt.Valid {
+		return false, ErrEventClosed
+	}
+
+	var newCause sql.NullInt64
+	if causeEventID > 0 {
+		newCause = sql.NullInt64{Int64: causeEventID, Valid: true}
+	}
+	if cur.causeEventID == newCause {
+		return false, nil
+	}
+
+	if _, err := t.tx.ExecContext(ctx,
+		`UPDATE jetmon_events SET cause_event_id = ? WHERE id = ?`,
+		nullableInt64(newCause), eventID,
+	); err != nil {
+		return false, fmt.Errorf("update cause: %w", err)
+	}
+
+	reason := ReasonCauseLinked
+	if !newCause.Valid {
+		reason = ReasonCauseUnlinked
+	}
+	meta, err := json.Marshal(map[string]any{
+		"cause_event_id_before": nullableInt64ToAny(cur.causeEventID),
+		"cause_event_id_after":  nullableInt64ToAny(newCause),
+	})
+	if err != nil {
+		return false, fmt.Errorf("marshal cause metadata: %w", err)
+	}
+	if err := writeTransition(ctx, t.tx, transitionInput{
+		eventID:        eventID,
+		blogID:         cur.blogID,
+		severityBefore: &cur.severity,
+		severityAfter:  &cur.severity,
+		stateBefore:    cur.state,
+		stateAfter:     cur.state,
+		reason:         reason,
+		source:         source,
+		metadata:       meta,
+	}); err != nil {
+		return false, err
+	}
+	return true, nil
+}
+
+// Close marks an open event as resolved. resolutionReason is recorded on the
+// event row and used as the transition reason. Closing an already-closed event
+// returns ErrEventClosed; closing a missing event returns ErrEventNotFound.
+func (t *Tx) Close(ctx context.Context, eventID int64, resolutionReason, source string, metadata json.RawMessage) error {
+	if t.tx == nil {
+		return nil
+	}
+	if resolutionReason == "" {
+		return errors.New("eventstore: Close requires resolutionReason")
+	}
+	cur, err := readEventForUpdate(ctx, t.tx, eventID)
+	if err != nil {
+		return err
+	}
+	if cur.endedAt.Valid {
+		return ErrEventClosed
+	}
+
+	if _, err := t.tx.ExecContext(ctx, `
+		UPDATE jetmon_events
+		   SET ended_at = CURRENT_TIMESTAMP(3),
+		       resolution_reason = ?
+		 WHERE id = ?`,
+		resolutionReason, eventID,
+	); err != nil {
+		return fmt.Errorf("close event: %w", err)
+	}
+
+	resolved := StateResolved
+	return writeTransition(ctx, t.tx, transitionInput{
+		eventID:        eventID,
+		blogID:         cur.blogID,
+		severityBefore: &cur.severity,
+		severityAfter:  nil,
+		stateBefore:    cur.state,
+		stateAfter:     resolved,
+		reason:         resolutionReason,
+		source:         source,
+		metadata:       metadata,
+	})
+}
+
+// ActiveEvent is the minimal snapshot of an open event needed by callers that
+// found it via FindActiveByBlog and now want to close, promote, or otherwise
+// mutate it without a second round-trip to read its state.
+type ActiveEvent struct {
+	ID       int64
+	Severity uint8
+	State    string
+}
+
+// FindActiveByBlog returns the open event for (blog_id, check_type) — the
+// most common lookup the orchestrator needs on recovery. Returns
+// ErrEventNotFound if no open event exists. Used when the caller doesn't have
+// the event id cached (e.g. a recovery in a round after the open was forgotten
+// across a process restart).
+func (t *Tx) FindActiveByBlog(ctx context.Context, blogID int64, checkType string) (ActiveEvent, error) {
+	return t.FindActive(ctx, Identity{BlogID: blogID, CheckType: checkType})
+}
+
+// FindActive returns the open event for an identity. When EndpointID is set it
+// prefers an endpoint-specific event but can fall back to a legacy site-level
+// event so in-flight migrations can recover rows opened before endpoint
+// identity was introduced.
+func (t *Tx) FindActive(ctx context.Context, identity Identity) (ActiveEvent, error) {
+	if t.tx == nil {
+		return ActiveEvent{}, nil
+	}
+	var ae ActiveEvent
+	if identity.EndpointID != nil {
+		err := t.tx.QueryRowContext(ctx, `
+			SELECT id, severity, state FROM jetmon_events
+			 WHERE blog_id = ?
+			   AND check_type = ?
+			   AND ended_at IS NULL
+			   AND (endpoint_id = ? OR endpoint_id IS NULL)
+			 ORDER BY endpoint_id IS NULL ASC, started_at ASC
+			 LIMIT 1`, identity.BlogID, identity.CheckType, *identity.EndpointID,
+		).Scan(&ae.ID, &ae.Severity, &ae.State)
+		if errors.Is(err, sql.ErrNoRows) {
+			return ActiveEvent{}, ErrEventNotFound
+		}
+		if err != nil {
+			return ActiveEvent{}, fmt.Errorf("find active event: %w", err)
+		}
+		return ae, nil
+	}
+	err := t.tx.QueryRowContext(ctx, `
+		SELECT id, severity, state FROM jetmon_events
+		 WHERE blog_id = ? AND check_type = ? AND ended_at IS NULL
+		 ORDER BY started_at ASC
+		 LIMIT 1`, identity.BlogID, identity.CheckType,
+	).Scan(&ae.ID, &ae.Severity, &ae.State)
+	if errors.Is(err, sql.ErrNoRows) {
+		return ActiveEvent{}, ErrEventNotFound
+	}
+	if err != nil {
+		return ActiveEvent{}, fmt.Errorf("find active event: %w", err)
+	}
+	return ae, nil
+}
+
+// Standalone Store methods are thin wrappers that begin/commit a transaction
+// around a single Tx call. Use these when no other writes need to land in the
+// same transaction.
+
+// Open is the standalone (auto-commit) form of Tx.Open.
+func (s *Store) Open(ctx context.Context, in OpenInput) (OpenResult, error) {
+	if s.db == nil {
+		return OpenResult{}, nil
+	}
+	tx, err := s.Begin(ctx)
+	if err != nil {
+		return OpenResult{}, err
+	}
+	defer func() { _ = tx.Rollback() }()
+	res, err := tx.Open(ctx, in)
+	if err != nil {
+		return OpenResult{}, err
+	}
+	if err := tx.Commit(); err != nil {
+		return OpenResult{}, fmt.Errorf("commit: %w", err)
+	}
+	return res, nil
+}
+
+// UpdateSeverity is the standalone form of Tx.UpdateSeverity.
+func (s *Store) UpdateSeverity(ctx context.Context, eventID int64, newSeverity uint8, reason, source string, metadata json.RawMessage) (bool, error) {
+	return s.runTx(ctx, func(tx *Tx) (bool, error) {
+		return tx.UpdateSeverity(ctx, eventID, newSeverity, reason, source, metadata)
+	})
+}
+
+// UpdateState is the standalone form of Tx.UpdateState.
+func (s *Store) UpdateState(ctx context.Context, eventID int64, newState, reason, source string, metadata json.RawMessage) (bool, error) {
+	return s.runTx(ctx, func(tx *Tx) (bool, error) {
+		return tx.UpdateState(ctx, eventID, newState, reason, source, metadata)
+	})
+}
+
+// Promote is the standalone form of Tx.Promote.
+func (s *Store) Promote(ctx context.Context, eventID int64, newSeverity uint8, newState, reason, source string, metadata json.RawMessage) (bool, error) {
+	return s.runTx(ctx, func(tx *Tx) (bool, error) {
+		return tx.Promote(ctx, eventID, newSeverity, newState, reason, source, metadata)
+	})
+}
+
+// LinkCause is the standalone form of Tx.LinkCause.
+func (s *Store) LinkCause(ctx context.Context, eventID, causeEventID int64, source string) (bool, error) {
+	return s.runTx(ctx, func(tx *Tx) (bool, error) {
+		return tx.LinkCause(ctx, eventID, causeEventID, source)
+	})
+}
+
+// Close is the standalone form of Tx.Close.
+func (s *Store) Close(ctx context.Context, eventID int64, resolutionReason, source string, metadata json.RawMessage) error {
+	if s.db == nil {
+		return nil
+	}
+	tx, err := s.Begin(ctx)
+	if err != nil {
+		return err
+	}
+	defer func() { _ = tx.Rollback() }()
+	if err := tx.Close(ctx, eventID, resolutionReason, source, metadata); err != nil {
+		return err
+	}
+	if err := tx.Commit(); err != nil {
+		return fmt.Errorf("commit: %w", err)
+	}
+	return nil
+}
+
+func (s *Store) runTx(ctx context.Context, fn func(*Tx) (bool, error)) (bool, error) {
+	if s.db == nil {
+		return false, nil
+	}
+	tx, err := s.Begin(ctx)
+	if err != nil {
+		return false, err
+	}
+	defer func() { _ = tx.Rollback() }()
+	changed, err := fn(tx)
+	if err != nil {
+		return false, err
+	}
+	if err := tx.Commit(); err != nil {
+		return false, fmt.Errorf("commit: %w", err)
+	}
+	return changed, nil
+}
+
+// mutation captures the pieces of a single severity/state change. severityAfter
+// or stateAfter (or both) must be non-nil for a mutation to be written.
+type mutation struct {
+	severityAfter *uint8
+	stateAfter    *string
+	reason        string
+	source        string
+	metadata      json.RawMessage
+}
+
+func (t *Tx) mutate(ctx context.Context, eventID int64, m mutation) (bool, error) {
+	if m.severityAfter == nil && m.stateAfter == nil {
+		return false, errors.New("eventstore: mutate requires severityAfter or stateAfter")
+	}
+	if m.reason == "" {
+		return false, errors.New("eventstore: mutate requires reason")
+	}
+
+	cur, err := readEventForUpdate(ctx, t.tx, eventID)
+	if err != nil {
+		return false, err
+	}
+	if cur.endedAt.Valid {
+		return false, ErrEventClosed
+	}
+
+	severityChanged := m.severityAfter != nil && *m.severityAfter != cur.severity
+	stateChanged := m.stateAfter != nil && *m.stateAfter != cur.state
+	if !severityChanged && !stateChanged {
+		// No-op — do not write a transition row.
+		return false, nil
+	}
+
+	switch {
+	case severityChanged && stateChanged:
+		_, err = t.tx.ExecContext(ctx,
+			`UPDATE jetmon_events SET severity = ?, state = ? WHERE id = ?`,
+			*m.severityAfter, *m.stateAfter, eventID)
+	case severityChanged:
+		_, err = t.tx.ExecContext(ctx,
+			`UPDATE jetmon_events SET severity = ? WHERE id = ?`,
+			*m.severityAfter, eventID)
+	case stateChanged:
+		_, err = t.tx.ExecContext(ctx,
+			`UPDATE jetmon_events SET state = ? WHERE id = ?`,
+			*m.stateAfter, eventID)
+	}
+	if err != nil {
+		return false, fmt.Errorf("update event: %w", err)
+	}
+
+	severityBefore := cur.severity
+	severityAfter := cur.severity
+	if m.severityAfter != nil {
+		severityAfter = *m.severityAfter
+	}
+	stateAfter := cur.state
+	if m.stateAfter != nil {
+		stateAfter = *m.stateAfter
+	}
+	if err := writeTransition(ctx, t.tx, transitionInput{
+		eventID:        eventID,
+		blogID:         cur.blogID,
+		severityBefore: &severityBefore,
+		severityAfter:  &severityAfter,
+		stateBefore:    cur.state,
+		stateAfter:     stateAfter,
+		reason:         m.reason,
+		source:         m.source,
+		metadata:       m.metadata,
+	}); err != nil {
+		return false, err
+	}
+	return true, nil
+}
+
+// eventSnapshot is what readEventForUpdate returns: the columns we need to
+// validate the mutation and to populate the *_before fields on the transition.
+type eventSnapshot struct {
+	blogID       int64
+	severity     uint8
+	state        string
+	endedAt      sql.NullTime
+	causeEventID sql.NullInt64
+}
+
+func readEventForUpdate(ctx context.Context, tx *sql.Tx, eventID int64) (eventSnapshot, error) {
+	var snap eventSnapshot
+	err := tx.QueryRowContext(ctx, `
+		SELECT blog_id, severity, state, ended_at, cause_event_id
+		  FROM jetmon_events
+		 WHERE id = ?
+		   FOR UPDATE`, eventID,
+	).Scan(&snap.blogID, &snap.severity, &snap.state, &snap.endedAt, &snap.causeEventID)
+	if errors.Is(err, sql.ErrNoRows) {
+		return snap, ErrEventNotFound
+	}
+	if err != nil {
+		return snap, fmt.Errorf("read event %d: %w", eventID, err)
+	}
+	return snap, nil
+}
+
+type transitionInput struct {
+	eventID        int64
+	blogID         int64
+	severityBefore *uint8
+	severityAfter  *uint8
+	stateBefore    string
+	stateAfter     string
+	reason         string
+	source         string
+	metadata       json.RawMessage
+	changedAt      *time.Time
+}
+
+func writeTransition(ctx context.Context, tx *sql.Tx, t transitionInput) error {
+	source := t.source
+	if source == "" {
+		source = "local"
+	}
+	if t.changedAt != nil {
+		_, err := tx.ExecContext(ctx, `
+			INSERT INTO jetmon_event_transitions
+				(event_id, blog_id, severity_before, severity_after,
+				 state_before, state_after, reason, source, metadata, changed_at)
+			VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)`,
+			t.eventID, t.blogID, nullableUint8(t.severityBefore), nullableUint8(t.severityAfter),
+			nullableString(t.stateBefore), nullableString(t.stateAfter),
+			t.reason, source, nullableJSON(t.metadata), t.changedAt.UTC(),
+		)
+		if err != nil {
+			return fmt.Errorf("insert transition: %w", err)
+		}
+		return nil
+	}
+	_, err := tx.ExecContext(ctx, `
+		INSERT INTO jetmon_event_transitions
+			(event_id, blog_id, severity_before, severity_after,
+			 state_before, state_after, reason, source, metadata)
+		VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)`,
+		t.eventID, t.blogID,
+		nullableUint8(t.severityBefore), nullableUint8(t.severityAfter),
+		nullableString(t.stateBefore), nullableString(t.stateAfter),
+		t.reason, source, nullableJSON(t.metadata),
+	)
+	if err != nil {
+		return fmt.Errorf("insert transition: %w", err)
+	}
+	return nil
+}
+
+func nullableEndpoint(p *int64) any {
+	if p == nil {
+		return nil
+	}
+	return *p
+}
+
+func nullableDiscriminator(s string) any {
+	if s == "" {
+		return nil
+	}
+	return s
+}
+
+func nullableJSON(b json.RawMessage) any {
+	if len(b) == 0 {
+		return nil
+	}
+	return []byte(b)
+}
+
+func nullableUint8(p *uint8) any {
+	if p == nil {
+		return nil
+	}
+	return *p
+}
+
+func nullableString(s string) any {
+	if s == "" {
+		return nil
+	}
+	return s
+}
+
+func nullableInt64(n sql.NullInt64) any {
+	if !n.Valid {
+		return nil
+	}
+	return n.Int64
+}
+
+func nullableInt64ToAny(n sql.NullInt64) any {
+	if !n.Valid {
+		return nil
+	}
+	return n.Int64
+}
diff --git a/internal/eventstore/eventstore_test.go b/internal/eventstore/eventstore_test.go
new file mode 100644
index 00000000..00a490dc
--- /dev/null
+++ b/internal/eventstore/eventstore_test.go
@@ -0,0 +1,457 @@
+package eventstore
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"testing"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestNewWithNilDB(t *testing.T) {
+	s := New(nil)
+	if s == nil {
+		t.Fatal("New(nil) returned nil Store")
+	}
+
+	// All write operations should be no-ops when db is nil.
+	ctx := context.Background()
+
+	res, err := s.Open(ctx, OpenInput{
+		Identity: Identity{BlogID: 1, CheckType: "http"},
+		Severity: SeveritySeemsDown,
+		State:    StateSeemsDown,
+	})
+	if err != nil {
+		t.Fatalf("Open with nil db: %v", err)
+	}
+	if res.EventID != 0 || res.Opened {
+		t.Fatalf("Open with nil db = %+v, want zero", res)
+	}
+
+	if changed, err := s.UpdateSeverity(ctx, 42, SeverityDown, ReasonSeverityEscalation, "local", nil); err != nil || changed {
+		t.Fatalf("UpdateSeverity with nil db = (%v, %v)", changed, err)
+	}
+
+	if changed, err := s.UpdateState(ctx, 42, StateDown, ReasonVerifierConfirmed, "local", nil); err != nil || changed {
+		t.Fatalf("UpdateState with nil db = (%v, %v)", changed, err)
+	}
+
+	if changed, err := s.Promote(ctx, 42, SeverityDown, StateDown, ReasonVerifierConfirmed, "local", nil); err != nil || changed {
+		t.Fatalf("Promote with nil db = (%v, %v)", changed, err)
+	}
+
+	if changed, err := s.LinkCause(ctx, 42, 99, "local"); err != nil || changed {
+		t.Fatalf("LinkCause with nil db = (%v, %v)", changed, err)
+	}
+
+	if err := s.Close(ctx, 42, ReasonVerifierCleared, "local", nil); err != nil {
+		t.Fatalf("Close with nil db: %v", err)
+	}
+}
+
+func TestNilDBTxIsNoOp(t *testing.T) {
+	// Begin on a nil-db Store returns a no-op Tx whose methods all short-circuit
+	// without touching a database.
+	s := New(nil)
+	ctx := context.Background()
+
+	tx, err := s.Begin(ctx)
+	if err != nil {
+		t.Fatalf("Begin: %v", err)
+	}
+	if tx == nil {
+		t.Fatal("Begin returned nil Tx")
+	}
+	if tx.Tx() != nil {
+		t.Fatal("nil-db Tx should expose nil *sql.Tx")
+	}
+
+	// All Tx methods should run without panicking.
+	res, err := tx.Open(ctx, OpenInput{
+		Identity: Identity{BlogID: 1, CheckType: "http"},
+		Severity: SeveritySeemsDown,
+		State:    StateSeemsDown,
+	})
+	if err != nil || res.EventID != 0 {
+		t.Fatalf("Tx.Open with nil db = (%+v, %v)", res, err)
+	}
+	if _, err := tx.UpdateSeverity(ctx, 1, SeverityDown, ReasonSeverityEscalation, "local", nil); err != nil {
+		t.Fatalf("Tx.UpdateSeverity: %v", err)
+	}
+	if _, err := tx.Promote(ctx, 1, SeverityDown, StateDown, ReasonVerifierConfirmed, "local", nil); err != nil {
+		t.Fatalf("Tx.Promote: %v", err)
+	}
+	if _, err := tx.UpdateState(ctx, 1, StateDown, ReasonStateChange, "local", nil); err != nil {
+		t.Fatalf("Tx.UpdateState: %v", err)
+	}
+	if _, err := tx.LinkCause(ctx, 1, 2, "local"); err != nil {
+		t.Fatalf("Tx.LinkCause: %v", err)
+	}
+	if err := tx.Close(ctx, 1, ReasonVerifierCleared, "local", nil); err != nil {
+		t.Fatalf("Tx.Close: %v", err)
+	}
+	ae, err := tx.FindActiveByBlog(ctx, 1, "http")
+	if err != nil {
+		t.Fatalf("Tx.FindActiveByBlog: %v", err)
+	}
+	if ae.ID != 0 {
+		t.Fatalf("FindActiveByBlog on nil-db = %+v, want zero", ae)
+	}
+
+	if err := tx.Commit(); err != nil {
+		t.Fatalf("Commit: %v", err)
+	}
+	// Rollback after Commit should also be a no-op.
+	if err := tx.Rollback(); err != nil {
+		t.Fatalf("Rollback after Commit: %v", err)
+	}
+}
+
+func TestSQLTxBeginCommitAndRollback(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	s := New(db)
+	ctx := context.Background()
+
+	mock.ExpectBegin()
+	tx, err := s.Begin(ctx)
+	if err != nil {
+		t.Fatalf("Begin for commit: %v", err)
+	}
+	if tx.Tx() == nil {
+		t.Fatal("sql-backed Tx should expose *sql.Tx")
+	}
+	mock.ExpectCommit()
+	if err := tx.Commit(); err != nil {
+		t.Fatalf("Commit: %v", err)
+	}
+	// Rollback after Commit should swallow sql.ErrTxDone so callers can defer it.
+	if err := tx.Rollback(); err != nil {
+		t.Fatalf("Rollback after Commit: %v", err)
+	}
+
+	mock.ExpectBegin()
+	tx, err = s.Begin(ctx)
+	if err != nil {
+		t.Fatalf("Begin for rollback: %v", err)
+	}
+	mock.ExpectRollback()
+	if err := tx.Rollback(); err != nil {
+		t.Fatalf("Rollback: %v", err)
+	}
+	// A second Rollback after the transaction is closed is also a no-op.
+	if err := tx.Rollback(); err != nil {
+		t.Fatalf("second Rollback: %v", err)
+	}
+
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+var eventSnapshotColumns = []string{"blog_id", "severity", "state", "ended_at", "cause_event_id"}
+
+func eventSnapshotRow(blogID int64, severity uint8, state string, cause any) *sqlmock.Rows {
+	return sqlmock.NewRows(eventSnapshotColumns).
+		AddRow(blogID, severity, state, nil, cause)
+}
+
+func TestStoreOpenInsertedEventWritesTransition(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectExec("INSERT INTO jetmon_events").
+		WithArgs(int64(42), nil, "http", nil, SeveritySeemsDown, StateSeemsDown, nil).
+		WillReturnResult(sqlmock.NewResult(99, 1))
+	mock.ExpectExec("INSERT INTO jetmon_event_transitions").
+		WithArgs(int64(99), int64(42), nil, SeveritySeemsDown, nil, StateSeemsDown, ReasonOpened, "local", nil).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+	mock.ExpectCommit()
+
+	res, err := New(db).Open(context.Background(), OpenInput{
+		Identity: Identity{BlogID: 42, CheckType: "http"},
+		Severity: SeveritySeemsDown,
+		State:    StateSeemsDown,
+	})
+	if err != nil {
+		t.Fatalf("Open: %v", err)
+	}
+	if res.EventID != 99 || !res.Opened || res.CurrentSeverity != SeveritySeemsDown || res.CurrentState != StateSeemsDown {
+		t.Fatalf("Open result = %+v", res)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestStoreOpenExistingEventReadsCurrentState(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectExec("INSERT INTO jetmon_events").
+		WithArgs(int64(42), nil, "http", nil, SeveritySeemsDown, StateSeemsDown, nil).
+		WillReturnResult(sqlmock.NewResult(99, 2))
+	mock.ExpectQuery("SELECT severity, state FROM jetmon_events").
+		WithArgs(int64(99)).
+		WillReturnRows(sqlmock.NewRows([]string{"severity", "state"}).AddRow(SeverityDown, StateDown))
+	mock.ExpectCommit()
+
+	res, err := New(db).Open(context.Background(), OpenInput{
+		Identity: Identity{BlogID: 42, CheckType: "http"},
+		Severity: SeveritySeemsDown,
+		State:    StateSeemsDown,
+	})
+	if err != nil {
+		t.Fatalf("Open existing: %v", err)
+	}
+	if res.Opened || res.CurrentSeverity != SeverityDown || res.CurrentState != StateDown {
+		t.Fatalf("Open existing result = %+v", res)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestStoreUpdateSeverityNoopSkipsTransition(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectQuery("SELECT blog_id, severity, state, ended_at, cause_event_id").
+		WithArgs(int64(99)).
+		WillReturnRows(eventSnapshotRow(42, SeverityDown, StateDown, nil))
+	mock.ExpectCommit()
+
+	changed, err := New(db).UpdateSeverity(context.Background(), 99, SeverityDown, ReasonSeverityEscalation, "tester", nil)
+	if err != nil {
+		t.Fatalf("UpdateSeverity: %v", err)
+	}
+	if changed {
+		t.Fatal("UpdateSeverity reported change for same severity")
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestStorePromoteWritesEventAndTransition(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectQuery("SELECT blog_id, severity, state, ended_at, cause_event_id").
+		WithArgs(int64(99)).
+		WillReturnRows(eventSnapshotRow(42, SeveritySeemsDown, StateSeemsDown, nil))
+	mock.ExpectExec("UPDATE jetmon_events SET severity").
+		WithArgs(SeverityDown, StateDown, int64(99)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("INSERT INTO jetmon_event_transitions").
+		WithArgs(int64(99), int64(42), SeveritySeemsDown, SeverityDown, StateSeemsDown, StateDown, ReasonVerifierConfirmed, "tester", nil).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+	mock.ExpectCommit()
+
+	changed, err := New(db).Promote(context.Background(), 99, SeverityDown, StateDown, ReasonVerifierConfirmed, "tester", nil)
+	if err != nil {
+		t.Fatalf("Promote: %v", err)
+	}
+	if !changed {
+		t.Fatal("Promote reported no change")
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestStoreLinkCauseWritesMetadataTransition(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectQuery("SELECT blog_id, severity, state, ended_at, cause_event_id").
+		WithArgs(int64(99)).
+		WillReturnRows(eventSnapshotRow(42, SeverityDown, StateDown, nil))
+	mock.ExpectExec("UPDATE jetmon_events SET cause_event_id").
+		WithArgs(int64(123), int64(99)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("INSERT INTO jetmon_event_transitions").
+		WithArgs(int64(99), int64(42), SeverityDown, SeverityDown, StateDown, StateDown, ReasonCauseLinked, "tester", sqlmock.AnyArg()).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+	mock.ExpectCommit()
+
+	changed, err := New(db).LinkCause(context.Background(), 99, 123, "tester")
+	if err != nil {
+		t.Fatalf("LinkCause: %v", err)
+	}
+	if !changed {
+		t.Fatal("LinkCause reported no change")
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestStoreCloseWritesResolvedTransition(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectQuery("SELECT blog_id, severity, state, ended_at, cause_event_id").
+		WithArgs(int64(99)).
+		WillReturnRows(eventSnapshotRow(42, SeverityDown, StateDown, nil))
+	mock.ExpectExec("UPDATE jetmon_events").
+		WithArgs(ReasonVerifierCleared, int64(99)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("INSERT INTO jetmon_event_transitions").
+		WithArgs(int64(99), int64(42), SeverityDown, nil, StateDown, StateResolved, ReasonVerifierCleared, "tester", nil).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+	mock.ExpectCommit()
+
+	if err := New(db).Close(context.Background(), 99, ReasonVerifierCleared, "tester", nil); err != nil {
+		t.Fatalf("Close: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestTxFindActiveByBlog(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectQuery("SELECT id, severity, state FROM jetmon_events").
+		WithArgs(int64(42), "http").
+		WillReturnRows(sqlmock.NewRows([]string{"id", "severity", "state"}).AddRow(int64(99), SeverityDown, StateDown))
+	mock.ExpectRollback()
+
+	tx, err := New(db).Begin(context.Background())
+	if err != nil {
+		t.Fatalf("Begin: %v", err)
+	}
+	active, err := tx.FindActiveByBlog(context.Background(), 42, "http")
+	if err != nil {
+		t.Fatalf("FindActiveByBlog: %v", err)
+	}
+	if active.ID != 99 || active.Severity != SeverityDown || active.State != StateDown {
+		t.Fatalf("active = %+v", active)
+	}
+	if err := tx.Rollback(); err != nil {
+		t.Fatalf("Rollback: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestSeverityScale(t *testing.T) {
+	// Severity is intentionally a small ordered scale; relative ordering matters
+	// more than the exact numbers, but the constants must agree with what the
+	// orchestrator and dashboards expect.
+	if SeverityUp >= SeverityWarning ||
+		SeverityWarning >= SeverityDegraded ||
+		SeverityDegraded >= SeveritySeemsDown ||
+		SeveritySeemsDown >= SeverityDown {
+		t.Fatalf("severity scale not strictly increasing: %d %d %d %d %d",
+			SeverityUp, SeverityWarning, SeverityDegraded, SeveritySeemsDown, SeverityDown)
+	}
+}
+
+func TestStateAndReasonConstants(t *testing.T) {
+	if StateSeemsDown != "Seems Down" {
+		t.Fatalf("StateSeemsDown = %q, want %q", StateSeemsDown, "Seems Down")
+	}
+	if ReasonOpened != "opened" {
+		t.Fatalf("ReasonOpened = %q, want %q", ReasonOpened, "opened")
+	}
+	if ReasonProbeCleared != "probe_cleared" {
+		t.Fatalf("ReasonProbeCleared = %q, want %q", ReasonProbeCleared, "probe_cleared")
+	}
+	if ReasonFalseAlarm != "false_alarm" {
+		t.Fatalf("ReasonFalseAlarm = %q, want %q", ReasonFalseAlarm, "false_alarm")
+	}
+}
+
+func TestNullableHelpers(t *testing.T) {
+	if nullableEndpoint(nil) != nil {
+		t.Fatal("nullableEndpoint(nil) should be nil")
+	}
+	id := int64(7)
+	if nullableEndpoint(&id) != int64(7) {
+		t.Fatalf("nullableEndpoint(&7) = %v, want 7", nullableEndpoint(&id))
+	}
+
+	if nullableDiscriminator("") != nil {
+		t.Fatal("nullableDiscriminator(\"\") should be nil")
+	}
+	if nullableDiscriminator("abc") != "abc" {
+		t.Fatal("nullableDiscriminator(\"abc\") should be \"abc\"")
+	}
+
+	if nullableJSON(nil) != nil {
+		t.Fatal("nullableJSON(nil) should be nil")
+	}
+	if nullableJSON(json.RawMessage("")) != nil {
+		t.Fatal("nullableJSON(empty) should be nil")
+	}
+	if nullableJSON(json.RawMessage(`{"a":1}`)) == nil {
+		t.Fatal("nullableJSON(non-empty) should not be nil")
+	}
+
+	if nullableUint8(nil) != nil {
+		t.Fatal("nullableUint8(nil) should be nil")
+	}
+	v := uint8(3)
+	if nullableUint8(&v) != uint8(3) {
+		t.Fatalf("nullableUint8(&3) = %v, want 3", nullableUint8(&v))
+	}
+
+	if nullableString("") != nil {
+		t.Fatal("nullableString(\"\") should be nil")
+	}
+	if nullableString("x") != "x" {
+		t.Fatal("nullableString(\"x\") should be \"x\"")
+	}
+
+	if nullableInt64(sql.NullInt64{}) != nil {
+		t.Fatal("nullableInt64(invalid) should be nil")
+	}
+	validInt := sql.NullInt64{Int64: 12, Valid: true}
+	if nullableInt64(validInt) != int64(12) {
+		t.Fatalf("nullableInt64(valid 12) = %v, want 12", nullableInt64(validInt))
+	}
+	if nullableInt64ToAny(sql.NullInt64{}) != nil {
+		t.Fatal("nullableInt64ToAny(invalid) should be nil")
+	}
+	if nullableInt64ToAny(validInt) != int64(12) {
+		t.Fatalf("nullableInt64ToAny(valid 12) = %v, want 12", nullableInt64ToAny(validInt))
+	}
+}
diff --git a/internal/fleethealth/health.go b/internal/fleethealth/health.go
new file mode 100644
index 00000000..334f72cf
--- /dev/null
+++ b/internal/fleethealth/health.go
@@ -0,0 +1,414 @@
+// Package fleethealth publishes durable process health snapshots for fleet views.
+package fleethealth
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"strings"
+	"time"
+)
+
+const (
+	ProcessMonitor   = "monitor"
+	ProcessDeliverer = "deliverer"
+
+	StateRunning  = "running"
+	StateStopping = "stopping"
+	StateStopped  = "stopped"
+	StateIdle     = "idle"
+
+	HealthGreen = "green"
+	HealthAmber = "amber"
+	HealthRed   = "red"
+)
+
+// DependencyHealth is a compact dependency status snapshot suitable for JSON
+// storage and fleet dashboard summaries.
+type DependencyHealth struct {
+	Name      string    `json:"name"`
+	Status    string    `json:"status"`
+	LatencyMS int64     `json:"latency_ms,omitempty"`
+	LastError string    `json:"last_error,omitempty"`
+	CheckedAt time.Time `json:"checked_at"`
+}
+
+// Snapshot is the local process state written to jetmon_process_health.
+type Snapshot struct {
+	ProcessID              string
+	HostID                 string
+	ProcessType            string
+	PID                    int
+	Version                string
+	BuildDate              string
+	GoVersion              string
+	State                  string
+	HealthStatus           string
+	StartedAt              time.Time
+	UpdatedAt              time.Time
+	BucketMin              *int
+	BucketMax              *int
+	BucketOwnership        string
+	APIPort                *int
+	DashboardPort          *int
+	DeliveryWorkersEnabled bool
+	DeliveryOwnerHost      string
+	WorkerCount            int
+	ActiveChecks           int
+	QueueDepth             int
+	RetryQueueSize         int
+	WPCOMCircuitOpen       bool
+	WPCOMQueueDepth        int
+	GoSysMemMB             int
+	RSSMemMB               int
+	DependencyHealth       []DependencyHealth
+}
+
+// ProcessID returns the stable row key for one process type on one host.
+func ProcessID(hostID, processType string) string {
+	hostID = strings.TrimSpace(hostID)
+	processType = strings.TrimSpace(processType)
+	if hostID == "" || processType == "" {
+		return ""
+	}
+	return hostID + ":" + processType
+}
+
+// Upsert writes the current snapshot for a long-running process.
+func Upsert(ctx context.Context, db *sql.DB, snapshot Snapshot) error {
+	if db == nil {
+		return errors.New("database pool is not initialized")
+	}
+	normalized, err := normalizeSnapshot(snapshot)
+	if err != nil {
+		return err
+	}
+	deps, err := json.Marshal(normalized.DependencyHealth)
+	if err != nil {
+		return fmt.Errorf("marshal dependency health: %w", err)
+	}
+
+	_, err = db.ExecContext(ctx, upsertSnapshotSQL,
+		normalized.ProcessID,
+		normalized.HostID,
+		normalized.ProcessType,
+		normalized.PID,
+		normalized.Version,
+		normalized.BuildDate,
+		normalized.GoVersion,
+		normalized.State,
+		normalized.HealthStatus,
+		normalized.StartedAt,
+		normalized.UpdatedAt,
+		nullableInt(normalized.BucketMin),
+		nullableInt(normalized.BucketMax),
+		normalized.BucketOwnership,
+		nullableInt(normalized.APIPort),
+		nullableInt(normalized.DashboardPort),
+		boolInt(normalized.DeliveryWorkersEnabled),
+		normalized.DeliveryOwnerHost,
+		normalized.WorkerCount,
+		normalized.ActiveChecks,
+		normalized.QueueDepth,
+		normalized.RetryQueueSize,
+		boolInt(normalized.WPCOMCircuitOpen),
+		normalized.WPCOMQueueDepth,
+		normalized.GoSysMemMB,
+		normalized.RSSMemMB,
+		string(deps),
+	)
+	if err != nil {
+		return fmt.Errorf("upsert process health: %w", err)
+	}
+	return nil
+}
+
+// MarkStopped records a terminal stopped state during graceful shutdown.
+func MarkStopped(ctx context.Context, db *sql.DB, processID string, when time.Time) error {
+	if db == nil {
+		return errors.New("database pool is not initialized")
+	}
+	processID = strings.TrimSpace(processID)
+	if processID == "" {
+		return errors.New("process id is required")
+	}
+	if when.IsZero() {
+		when = time.Now().UTC()
+	}
+	_, err := db.ExecContext(ctx,
+		`UPDATE jetmon_process_health
+		   SET state = ?, health_status = ?, updated_at = ?
+		 WHERE process_id = ?`,
+		StateStopped,
+		HealthAmber,
+		when.UTC(),
+		processID,
+	)
+	if err != nil {
+		return fmt.Errorf("mark process stopped: %w", err)
+	}
+	return nil
+}
+
+// ListSnapshots returns durable process-health rows ordered for operator views.
+func ListSnapshots(ctx context.Context, db *sql.DB) ([]Snapshot, error) {
+	if db == nil {
+		return nil, errors.New("database pool is not initialized")
+	}
+	rows, err := db.QueryContext(ctx, `
+		SELECT process_id,
+		       host_id,
+		       process_type,
+		       pid,
+		       version,
+		       build_date,
+		       go_version,
+		       state,
+		       health_status,
+		       started_at,
+		       updated_at,
+		       bucket_min,
+		       bucket_max,
+		       bucket_ownership,
+		       api_port,
+		       dashboard_port,
+		       delivery_workers_enabled,
+		       delivery_owner_host,
+		       worker_count,
+		       active_checks,
+		       queue_depth,
+		       retry_queue_size,
+		       wpcom_circuit_open,
+		       wpcom_queue_depth,
+		       go_sys_mem_mb,
+		       rss_mem_mb,
+		       dependency_health
+		  FROM jetmon_process_health
+		 ORDER BY process_type, host_id, process_id`)
+	if err != nil {
+		return nil, fmt.Errorf("query process health: %w", err)
+	}
+	defer rows.Close()
+
+	var out []Snapshot
+	for rows.Next() {
+		var snapshot Snapshot
+		var startedAt sql.NullTime
+		var bucketMin, bucketMax, apiPort, dashboardPort sql.NullInt64
+		var deliveryWorkersEnabled, wpcomCircuitOpen int
+		var dependencyHealth sql.NullString
+		if err := rows.Scan(
+			&snapshot.ProcessID,
+			&snapshot.HostID,
+			&snapshot.ProcessType,
+			&snapshot.PID,
+			&snapshot.Version,
+			&snapshot.BuildDate,
+			&snapshot.GoVersion,
+			&snapshot.State,
+			&snapshot.HealthStatus,
+			&startedAt,
+			&snapshot.UpdatedAt,
+			&bucketMin,
+			&bucketMax,
+			&snapshot.BucketOwnership,
+			&apiPort,
+			&dashboardPort,
+			&deliveryWorkersEnabled,
+			&snapshot.DeliveryOwnerHost,
+			&snapshot.WorkerCount,
+			&snapshot.ActiveChecks,
+			&snapshot.QueueDepth,
+			&snapshot.RetryQueueSize,
+			&wpcomCircuitOpen,
+			&snapshot.WPCOMQueueDepth,
+			&snapshot.GoSysMemMB,
+			&snapshot.RSSMemMB,
+			&dependencyHealth,
+		); err != nil {
+			return nil, fmt.Errorf("scan process health: %w", err)
+		}
+		if startedAt.Valid {
+			snapshot.StartedAt = startedAt.Time.UTC()
+		}
+		snapshot.UpdatedAt = snapshot.UpdatedAt.UTC()
+		snapshot.BucketMin = nullableIntPtr(bucketMin)
+		snapshot.BucketMax = nullableIntPtr(bucketMax)
+		snapshot.APIPort = nullableIntPtr(apiPort)
+		snapshot.DashboardPort = nullableIntPtr(dashboardPort)
+		snapshot.DeliveryWorkersEnabled = deliveryWorkersEnabled != 0
+		snapshot.WPCOMCircuitOpen = wpcomCircuitOpen != 0
+		if dependencyHealth.Valid && strings.TrimSpace(dependencyHealth.String) != "" {
+			if err := json.Unmarshal([]byte(dependencyHealth.String), &snapshot.DependencyHealth); err != nil {
+				return nil, fmt.Errorf("decode dependency health for %s: %w", snapshot.ProcessID, err)
+			}
+		}
+		out = append(out, snapshot)
+	}
+	if err := rows.Err(); err != nil {
+		return nil, fmt.Errorf("iterate process health: %w", err)
+	}
+	return out, nil
+}
+
+func nullableIntPtr(value sql.NullInt64) *int {
+	if !value.Valid {
+		return nil
+	}
+	v := int(value.Int64)
+	return &v
+}
+
+func normalizeSnapshot(snapshot Snapshot) (Snapshot, error) {
+	snapshot.HostID = strings.TrimSpace(snapshot.HostID)
+	snapshot.ProcessType = strings.TrimSpace(snapshot.ProcessType)
+	if snapshot.HostID == "" {
+		return Snapshot{}, errors.New("host id is required")
+	}
+	if snapshot.ProcessType == "" {
+		return Snapshot{}, errors.New("process type is required")
+	}
+	snapshot.ProcessID = strings.TrimSpace(snapshot.ProcessID)
+	if snapshot.ProcessID == "" {
+		snapshot.ProcessID = ProcessID(snapshot.HostID, snapshot.ProcessType)
+	}
+	if snapshot.ProcessID == "" {
+		return Snapshot{}, errors.New("process id is required")
+	}
+	snapshot.State = strings.TrimSpace(snapshot.State)
+	if snapshot.State == "" {
+		snapshot.State = StateRunning
+	}
+	if !validState(snapshot.State) {
+		return Snapshot{}, fmt.Errorf("invalid process state %q", snapshot.State)
+	}
+	snapshot.HealthStatus = strings.TrimSpace(snapshot.HealthStatus)
+	if snapshot.HealthStatus == "" {
+		snapshot.HealthStatus = RollupHealthStatus(snapshot.DependencyHealth)
+	}
+	if !validHealthStatus(snapshot.HealthStatus) {
+		return Snapshot{}, fmt.Errorf("invalid health status %q", snapshot.HealthStatus)
+	}
+	if snapshot.StartedAt.IsZero() {
+		snapshot.StartedAt = time.Now().UTC()
+	}
+	if snapshot.UpdatedAt.IsZero() {
+		snapshot.UpdatedAt = time.Now().UTC()
+	}
+	snapshot.StartedAt = snapshot.StartedAt.UTC()
+	snapshot.UpdatedAt = snapshot.UpdatedAt.UTC()
+	return snapshot, nil
+}
+
+func validState(state string) bool {
+	switch state {
+	case StateRunning, StateStopping, StateStopped, StateIdle:
+		return true
+	default:
+		return false
+	}
+}
+
+func validHealthStatus(status string) bool {
+	switch status {
+	case HealthGreen, HealthAmber, HealthRed:
+		return true
+	default:
+		return false
+	}
+}
+
+// RollupHealthStatus reduces dependency snapshots into a green/amber/red health
+// status. Unknown dependency status is treated as amber because it needs
+// operator attention but is not itself proof of failure.
+func RollupHealthStatus(entries []DependencyHealth) string {
+	status := HealthGreen
+	for _, entry := range entries {
+		switch entry.Status {
+		case HealthRed:
+			return HealthRed
+		case HealthAmber:
+			status = HealthAmber
+		case HealthGreen:
+		default:
+			if status == HealthGreen {
+				status = HealthAmber
+			}
+		}
+	}
+	return status
+}
+
+func nullableInt(value *int) any {
+	if value == nil {
+		return nil
+	}
+	return *value
+}
+
+func boolInt(value bool) int {
+	if value {
+		return 1
+	}
+	return 0
+}
+
+const upsertSnapshotSQL = `
+INSERT INTO jetmon_process_health (
+	process_id,
+	host_id,
+	process_type,
+	pid,
+	version,
+	build_date,
+	go_version,
+	state,
+	health_status,
+	started_at,
+	updated_at,
+	bucket_min,
+	bucket_max,
+	bucket_ownership,
+	api_port,
+	dashboard_port,
+	delivery_workers_enabled,
+	delivery_owner_host,
+	worker_count,
+	active_checks,
+	queue_depth,
+	retry_queue_size,
+	wpcom_circuit_open,
+	wpcom_queue_depth,
+	go_sys_mem_mb,
+	rss_mem_mb,
+	dependency_health
+) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+ON DUPLICATE KEY UPDATE
+	host_id = VALUES(host_id),
+	process_type = VALUES(process_type),
+	pid = VALUES(pid),
+	version = VALUES(version),
+	build_date = VALUES(build_date),
+	go_version = VALUES(go_version),
+	state = VALUES(state),
+	health_status = VALUES(health_status),
+	started_at = VALUES(started_at),
+	updated_at = VALUES(updated_at),
+	bucket_min = VALUES(bucket_min),
+	bucket_max = VALUES(bucket_max),
+	bucket_ownership = VALUES(bucket_ownership),
+	api_port = VALUES(api_port),
+	dashboard_port = VALUES(dashboard_port),
+	delivery_workers_enabled = VALUES(delivery_workers_enabled),
+	delivery_owner_host = VALUES(delivery_owner_host),
+	worker_count = VALUES(worker_count),
+	active_checks = VALUES(active_checks),
+	queue_depth = VALUES(queue_depth),
+	retry_queue_size = VALUES(retry_queue_size),
+	wpcom_circuit_open = VALUES(wpcom_circuit_open),
+	wpcom_queue_depth = VALUES(wpcom_queue_depth),
+	go_sys_mem_mb = VALUES(go_sys_mem_mb),
+	rss_mem_mb = VALUES(rss_mem_mb),
+	dependency_health = VALUES(dependency_health)`
diff --git a/internal/fleethealth/health_test.go b/internal/fleethealth/health_test.go
new file mode 100644
index 00000000..272e7e99
--- /dev/null
+++ b/internal/fleethealth/health_test.go
@@ -0,0 +1,280 @@
+package fleethealth
+
+import (
+	"context"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestProcessID(t *testing.T) {
+	if got := ProcessID(" host-a ", " monitor "); got != "host-a:monitor" {
+		t.Fatalf("ProcessID() = %q, want host-a:monitor", got)
+	}
+	if got := ProcessID("", "monitor"); got != "" {
+		t.Fatalf("ProcessID(empty host) = %q, want empty", got)
+	}
+}
+
+func TestUpsertSnapshot(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	started := time.Date(2026, 4, 30, 10, 0, 0, 0, time.UTC)
+	updated := started.Add(time.Minute)
+	bucketMin, bucketMax := 0, 99
+	apiPort, dashboardPort := 8090, 8080
+
+	mock.ExpectExec("INSERT INTO jetmon_process_health").
+		WithArgs(
+			"host-a:monitor",
+			"host-a",
+			ProcessMonitor,
+			123,
+			"abc123",
+			"2026-04-30T10:00:00Z",
+			"go1.26.2",
+			StateRunning,
+			HealthGreen,
+			started,
+			updated,
+			bucketMin,
+			bucketMax,
+			"pinned range=0-99",
+			apiPort,
+			dashboardPort,
+			1,
+			"host-a",
+			12,
+			3,
+			4,
+			5,
+			0,
+			2,
+			88,
+			99,
+			`[{"name":"mysql","status":"green","checked_at":"2026-04-30T10:01:00Z"}]`,
+		).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	err = Upsert(context.Background(), sqlDB, Snapshot{
+		HostID:                 "host-a",
+		ProcessType:            ProcessMonitor,
+		PID:                    123,
+		Version:                "abc123",
+		BuildDate:              "2026-04-30T10:00:00Z",
+		GoVersion:              "go1.26.2",
+		State:                  StateRunning,
+		HealthStatus:           HealthGreen,
+		StartedAt:              started,
+		UpdatedAt:              updated,
+		BucketMin:              &bucketMin,
+		BucketMax:              &bucketMax,
+		BucketOwnership:        "pinned range=0-99",
+		APIPort:                &apiPort,
+		DashboardPort:          &dashboardPort,
+		DeliveryWorkersEnabled: true,
+		DeliveryOwnerHost:      "host-a",
+		WorkerCount:            12,
+		ActiveChecks:           3,
+		QueueDepth:             4,
+		RetryQueueSize:         5,
+		WPCOMQueueDepth:        2,
+		GoSysMemMB:             88,
+		RSSMemMB:               99,
+		DependencyHealth: []DependencyHealth{{
+			Name:      "mysql",
+			Status:    "green",
+			CheckedAt: updated,
+		}},
+	})
+	if err != nil {
+		t.Fatalf("Upsert() error = %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("sql expectations: %v", err)
+	}
+}
+
+func TestUpsertValidatesRequiredFields(t *testing.T) {
+	sqlDB, _, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	err = Upsert(context.Background(), sqlDB, Snapshot{ProcessType: ProcessMonitor})
+	if err == nil || !strings.Contains(err.Error(), "host id is required") {
+		t.Fatalf("Upsert() error = %v, want host id validation", err)
+	}
+}
+
+func TestUpsertValidatesStateAndHealthStatus(t *testing.T) {
+	sqlDB, _, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	err = Upsert(context.Background(), sqlDB, Snapshot{
+		HostID:       "host-a",
+		ProcessType:  ProcessMonitor,
+		State:        "starting",
+		HealthStatus: HealthGreen,
+	})
+	if err == nil || !strings.Contains(err.Error(), "invalid process state") {
+		t.Fatalf("Upsert() error = %v, want invalid process state validation", err)
+	}
+
+	err = Upsert(context.Background(), sqlDB, Snapshot{
+		HostID:       "host-a",
+		ProcessType:  ProcessMonitor,
+		State:        StateRunning,
+		HealthStatus: "blue",
+	})
+	if err == nil || !strings.Contains(err.Error(), "invalid health status") {
+		t.Fatalf("Upsert() error = %v, want invalid health status validation", err)
+	}
+}
+
+func TestMarkStopped(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	when := time.Date(2026, 4, 30, 10, 2, 0, 0, time.UTC)
+	mock.ExpectExec("UPDATE jetmon_process_health").
+		WithArgs(StateStopped, HealthAmber, when, "host-a:deliverer").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	if err := MarkStopped(context.Background(), sqlDB, "host-a:deliverer", when); err != nil {
+		t.Fatalf("MarkStopped() error = %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("sql expectations: %v", err)
+	}
+}
+
+func TestListSnapshots(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	started := time.Date(2026, 4, 30, 10, 0, 0, 0, time.UTC)
+	updated := started.Add(time.Minute)
+	rows := sqlmock.NewRows([]string{
+		"process_id",
+		"host_id",
+		"process_type",
+		"pid",
+		"version",
+		"build_date",
+		"go_version",
+		"state",
+		"health_status",
+		"started_at",
+		"updated_at",
+		"bucket_min",
+		"bucket_max",
+		"bucket_ownership",
+		"api_port",
+		"dashboard_port",
+		"delivery_workers_enabled",
+		"delivery_owner_host",
+		"worker_count",
+		"active_checks",
+		"queue_depth",
+		"retry_queue_size",
+		"wpcom_circuit_open",
+		"wpcom_queue_depth",
+		"go_sys_mem_mb",
+		"rss_mem_mb",
+		"dependency_health",
+	}).AddRow(
+		"host-a:monitor",
+		"host-a",
+		ProcessMonitor,
+		123,
+		"abc123",
+		"2026-04-30T10:00:00Z",
+		"go1.26.2",
+		StateRunning,
+		HealthAmber,
+		started,
+		updated,
+		0,
+		99,
+		"pinned range=0-99",
+		8090,
+		8080,
+		1,
+		"host-a",
+		12,
+		3,
+		4,
+		5,
+		1,
+		2,
+		88,
+		99,
+		`[{"name":"mysql","status":"green","checked_at":"2026-04-30T10:01:00Z"}]`,
+	)
+	mock.ExpectQuery("SELECT process_id").WillReturnRows(rows)
+
+	snapshots, err := ListSnapshots(context.Background(), sqlDB)
+	if err != nil {
+		t.Fatalf("ListSnapshots() error = %v", err)
+	}
+	if len(snapshots) != 1 {
+		t.Fatalf("snapshots len = %d, want 1", len(snapshots))
+	}
+	got := snapshots[0]
+	if got.ProcessID != "host-a:monitor" || got.HealthStatus != HealthAmber {
+		t.Fatalf("snapshot = %+v, want host-a amber", got)
+	}
+	if got.BucketMin == nil || *got.BucketMin != 0 || got.APIPort == nil || *got.APIPort != 8090 {
+		t.Fatalf("nullable ints not decoded: BucketMin=%v APIPort=%v", got.BucketMin, got.APIPort)
+	}
+	if got.GoSysMemMB != 88 || got.RSSMemMB != 99 {
+		t.Fatalf("memory fields = go=%d rss=%d, want go=88 rss=99", got.GoSysMemMB, got.RSSMemMB)
+	}
+	if !got.DeliveryWorkersEnabled || !got.WPCOMCircuitOpen {
+		t.Fatalf("bools not decoded: delivery=%v wpcom=%v", got.DeliveryWorkersEnabled, got.WPCOMCircuitOpen)
+	}
+	if len(got.DependencyHealth) != 1 || got.DependencyHealth[0].Name != "mysql" {
+		t.Fatalf("DependencyHealth = %+v, want mysql", got.DependencyHealth)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("sql expectations: %v", err)
+	}
+}
+
+func TestRollupHealthStatus(t *testing.T) {
+	tests := []struct {
+		name string
+		in   []DependencyHealth
+		want string
+	}{
+		{name: "empty is green", want: HealthGreen},
+		{name: "green entries are green", in: []DependencyHealth{{Status: HealthGreen}}, want: HealthGreen},
+		{name: "amber wins over green", in: []DependencyHealth{{Status: HealthGreen}, {Status: HealthAmber}}, want: HealthAmber},
+		{name: "red wins", in: []DependencyHealth{{Status: HealthAmber}, {Status: HealthRed}}, want: HealthRed},
+		{name: "unknown status is amber", in: []DependencyHealth{{Status: "unknown"}}, want: HealthAmber},
+	}
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			if got := RollupHealthStatus(tt.in); got != tt.want {
+				t.Fatalf("RollupHealthStatus() = %q, want %q", got, tt.want)
+			}
+		})
+	}
+}
diff --git a/internal/metrics/metrics.go b/internal/metrics/metrics.go
new file mode 100644
index 00000000..ca43fadb
--- /dev/null
+++ b/internal/metrics/metrics.go
@@ -0,0 +1,94 @@
+package metrics
+
+import (
+	"fmt"
+	"net"
+	"os"
+	"strconv"
+	"strings"
+	"sync"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/processmetrics"
+)
+
+// Client sends StatsD metrics via UDP and writes stats files.
+type Client struct {
+	prefix string
+	conn   net.Conn
+	mu     sync.Mutex
+}
+
+var global *Client
+
+// Init creates the global StatsD client.
+// host:port is the StatsD server address (e.g. "statsd:8125").
+// hostname is used to build the metric prefix.
+func Init(addr, hostname string) error {
+	conn, err := net.Dial("udp", addr)
+	if err != nil {
+		return fmt.Errorf("statsd dial %s: %w", addr, err)
+	}
+	global = &Client{
+		prefix: "com.jetpack.jetmon." + sanitize(hostname),
+		conn:   conn,
+	}
+	return nil
+}
+
+// Client returns the global metrics client. Panics if Init was not called.
+func Global() *Client {
+	return global
+}
+
+// Increment sends a counter metric.
+func (c *Client) Increment(stat string, value int) {
+	c.send(fmt.Sprintf("%s.%s:%d|c", c.prefix, stat, value))
+}
+
+// Gauge sends a gauge metric.
+func (c *Client) Gauge(stat string, value int) {
+	c.send(fmt.Sprintf("%s.%s:%d|g", c.prefix, stat, value))
+}
+
+// Timing sends a timer metric in milliseconds.
+func (c *Client) Timing(stat string, d time.Duration) {
+	c.send(fmt.Sprintf("%s.%s:%d|ms", c.prefix, stat, d.Milliseconds()))
+}
+
+func (c *Client) send(msg string) {
+	c.mu.Lock()
+	defer c.mu.Unlock()
+	_, _ = fmt.Fprintln(c.conn, msg)
+}
+
+// EmitMemStats emits legacy memory gauges. process.rss_mb uses operating-system
+// resident set size when available and falls back to Go runtime Sys memory when
+// procfs is unavailable; process.go_sys_mem_mb keeps the runtime value visible.
+func (c *Client) EmitMemStats() {
+	mem := processmetrics.CurrentMemory()
+	rssMB := mem.RSSMemMB
+	goSysMB := mem.GoSysMemMB
+	if rssMB <= 0 {
+		rssMB = goSysMB
+	}
+	c.Gauge("process.rss_mb", rssMB)
+	c.Gauge("process.go_sys_mem_mb", goSysMB)
+	c.Gauge("process.heap_alloc_mb", mem.HeapAllocMemMB)
+}
+
+// WriteStatsFiles writes sitespersec, sitesqueue, and totals to the stats/
+// directory so existing monitoring and the README examples continue to work.
+func WriteStatsFiles(sitesPerSec, queueSize, totalChecked int) {
+	writeFile("stats/sitespersec", strconv.Itoa(sitesPerSec))
+	writeFile("stats/sitesqueue", strconv.Itoa(queueSize))
+	writeFile("stats/totals", strconv.Itoa(totalChecked))
+}
+
+func writeFile(path, content string) {
+	_ = os.WriteFile(path, []byte(content+"\n"), 0644)
+}
+
+func sanitize(s string) string {
+	return strings.NewReplacer(".", "_", "-", "_").Replace(s)
+}
diff --git a/internal/metrics/metrics_test.go b/internal/metrics/metrics_test.go
new file mode 100644
index 00000000..a51dd32c
--- /dev/null
+++ b/internal/metrics/metrics_test.go
@@ -0,0 +1,142 @@
+package metrics
+
+import (
+	"bufio"
+	"net"
+	"strings"
+	"testing"
+	"time"
+)
+
+func TestSanitize(t *testing.T) {
+	tests := []struct {
+		input string
+		want  string
+	}{
+		{"simple", "simple"},
+		{"host.name", "host_name"},
+		{"my-host", "my_host"},
+		{"a.b-c.d", "a_b_c_d"},
+	}
+	for _, tt := range tests {
+		if got := sanitize(tt.input); got != tt.want {
+			t.Fatalf("sanitize(%q) = %q, want %q", tt.input, got, tt.want)
+		}
+	}
+}
+
+func TestGlobalNilBeforeInit(t *testing.T) {
+	orig := global
+	global = nil
+	defer func() { global = orig }()
+
+	if Global() != nil {
+		t.Fatal("Global() = non-nil before Init, want nil")
+	}
+}
+
+func TestWriteStatsFilesDoesNotPanic(t *testing.T) {
+	// stats/ directory may not exist in test context; errors are silently
+	// ignored by design — just verify this does not panic.
+	WriteStatsFiles(10, 5, 1000)
+}
+
+func TestClientSendsStatsDMessages(t *testing.T) {
+	clientConn, serverConn := net.Pipe()
+	defer clientConn.Close()
+	defer serverConn.Close()
+
+	c := &Client{
+		prefix: "com.jetpack.jetmon.host_name",
+		conn:   clientConn,
+	}
+
+	lines := make(chan string, 6)
+	done := make(chan struct{})
+	go func() {
+		defer close(done)
+		r := bufio.NewReader(serverConn)
+		for i := 0; i < 6; i++ {
+			line, err := r.ReadString('\n')
+			if err != nil {
+				return
+			}
+			lines <- strings.TrimSpace(line)
+		}
+	}()
+
+	c.Increment("checks.total", 2)
+	c.Gauge("queue.depth", 7)
+	c.Timing("request.rtt", 1500*time.Millisecond)
+	c.EmitMemStats()
+
+	got := make([]string, 0, 6)
+	for len(got) < 6 {
+		select {
+		case line := <-lines:
+			got = append(got, line)
+		case <-time.After(time.Second):
+			t.Fatalf("timed out waiting for metric lines; got %v", got)
+		}
+	}
+	_ = serverConn.Close()
+	<-done
+
+	wantPrefix := "com.jetpack.jetmon.host_name."
+	expected := map[string]bool{
+		wantPrefix + "checks.total:2|c":       false,
+		wantPrefix + "queue.depth:7|g":        false,
+		wantPrefix + "request.rtt:1500|ms":    false,
+		wantPrefix + "process.rss_mb:":        false,
+		wantPrefix + "process.go_sys_mem_mb:": false,
+		wantPrefix + "process.heap_alloc_mb:": false,
+	}
+	for _, line := range got {
+		if _, ok := expected[line]; ok {
+			expected[line] = true
+			continue
+		}
+		matchedDynamic := false
+		for prefix := range expected {
+			if strings.HasSuffix(prefix, ":") && strings.HasPrefix(line, prefix) {
+				expected[prefix] = true
+				matchedDynamic = true
+				break
+			}
+		}
+		if !matchedDynamic {
+			t.Fatalf("unexpected metric line %q in %v", line, got)
+		}
+	}
+	for line, seen := range expected {
+		if !seen {
+			t.Fatalf("missing metric line %q in %v", line, got)
+		}
+	}
+}
+
+func TestInitSetsGlobalClient(t *testing.T) {
+	pc, err := net.ListenPacket("udp4", "127.0.0.1:0")
+	if err != nil {
+		t.Skipf("udp listener unavailable: %v", err)
+	}
+	defer pc.Close()
+
+	orig := global
+	t.Cleanup(func() {
+		if global != nil && global.conn != nil {
+			_ = global.conn.Close()
+		}
+		global = orig
+	})
+
+	if err := Init(pc.LocalAddr().String(), "my-host.example"); err != nil {
+		t.Fatalf("Init: %v", err)
+	}
+	if Global() == nil {
+		t.Fatal("Global() = nil after Init")
+	}
+	if Global().prefix != "com.jetpack.jetmon.my_host_example" {
+		t.Fatalf("prefix = %q", Global().prefix)
+	}
+}
diff --git a/internal/orchestrator/identity.go b/internal/orchestrator/identity.go
new file mode 100644
index 00000000..e902a9b7
--- /dev/null
+++ b/internal/orchestrator/identity.go
@@ -0,0 +1,30 @@
+package orchestrator
+
+import (
+	"github.com/Automattic/jetmon/internal/checker"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/eventstore"
+)
+
+func monitorTargetID(site db.Site) int64 {
+	if site.ID > 0 {
+		return site.ID
+	}
+	return site.BlogID
+}
+
+func checkResultTargetID(res checker.Result) int64 {
+	if res.MonitorSiteID > 0 {
+		return res.MonitorSiteID
+	}
+	return res.BlogID
+}
+
+func httpEventIdentity(site db.Site) eventstore.Identity {
+	identity := eventstore.Identity{BlogID: site.BlogID, CheckType: checkTypeHTTP}
+	if site.ID > 0 {
+		endpointID := site.ID
+		identity.EndpointID = &endpointID
+	}
+	return identity
+}
diff --git a/internal/orchestrator/orchestrator.go b/internal/orchestrator/orchestrator.go
new file mode 100644
index 00000000..b0eea70f
--- /dev/null
+++ b/internal/orchestrator/orchestrator.go
@@ -0,0 +1,3108 @@
+package orchestrator
+
+import (
+	stdctx "context"
+	"crypto/tls"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"log"
+	runtimemetrics "runtime/metrics"
+	"sort"
+	"strings"
+	"sync"
+	"time"
+
+	"github.com/go-sql-driver/mysql"
+
+	"github.com/Automattic/jetmon/internal/audit"
+	"github.com/Automattic/jetmon/internal/checker"
+	"github.com/Automattic/jetmon/internal/checkmode"
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/eventstore"
+	"github.com/Automattic/jetmon/internal/metrics"
+	"github.com/Automattic/jetmon/internal/veriflier"
+	"github.com/Automattic/jetmon/internal/wpcom"
+)
+
+// v1 site_status values projected onto jetpack_monitor_sites.site_status from
+// the event-sourced state. These remain unchanged for back-compat with v1
+// consumers; the orchestrator writes them in the same transaction as every
+// event mutation.
+const (
+	statusDown          = 0 // Seems Down event open (local failures, retry/verification in progress)
+	statusRunning       = 1 // No active event
+	statusConfirmedDown = 2 // Down event (verifier-confirmed)
+)
+
+// checkTypeHTTP is the canonical check_type for the v1 HTTP probe path. New
+// check types (DNS, TLS expiry, keyword, redirect, etc.) get their own
+// constants alongside.
+const (
+	checkTypeHTTP          = "http"
+	checkTypeTLSExpiry     = "tls_expiry"
+	checkTypeTLSDeprecated = "tls_deprecated"
+)
+
+// verifierRPCHeadroom is added to the per-site check timeout when computing
+// the RPC deadline for a verifier call. The verifier needs enough budget to
+// run its own HTTP check (matches site timeout) plus serialization, queueing,
+// and network round-trip — 5s covers a comfortable steady-state and forces
+// failure on a truly wedged verifier rather than letting the call hang.
+const verifierRPCHeadroom = 5 * time.Second
+const verifierTelemetryStatusTimeout = 2 * time.Second
+
+const schedulerBackpressurePollInterval = 10 * time.Millisecond
+const schedulerVariableIntervalPollInterval = 5 * time.Second
+const schedulerBacklogPollInterval = 5 * time.Second
+const schedulerBroadReportInterval = time.Minute
+const eventMutationMaxAttempts = 3
+const eventMutationRetryBaseDelay = 25 * time.Millisecond
+const failedCheckRetryInterval = time.Minute
+const maxPostRecoveryTransientFailureWindow = 5 * time.Minute
+const minPostFalseAlarmTransientFailureWindow = 5 * time.Minute
+const maxPostFalseAlarmTransientFailureWindow = 10 * time.Minute
+const wpcomPermanentFailureLogInterval = 10 * time.Second
+
+// VariableIntervalPollInterval returns the idle scheduler poll interval used
+// when per-site check intervals are enabled. The SQL due predicate prevents
+// early checks; this only controls how quickly newly due work is discovered.
+func VariableIntervalPollInterval() time.Duration {
+	return schedulerVariableIntervalPollInterval
+}
+
+var (
+	nowFunc                 = time.Now
+	dbClaimBuckets          = db.ClaimBuckets
+	dbHeartbeat             = db.Heartbeat
+	dbReleaseHost           = db.ReleaseHost
+	dbMarkHostDraining      = db.MarkHostDraining
+	dbGetSitesForBucket     = db.GetSitesForBucket
+	dbListActiveSites       = db.ListActiveSitesForBucketRange
+	dbCountActiveSites      = db.CountActiveSitesForBucketRange
+	dbMarkSiteChecked       = db.MarkSiteChecked
+	dbMarkSitesChecked      = db.MarkSitesChecked
+	dbRecordCheckHistory    = db.RecordCheckHistory
+	dbRecordCheckHistories  = db.RecordCheckHistories
+	dbUpdateSSLExpiry       = db.UpdateSSLExpiry
+	dbUpdateSSLExpiries     = db.UpdateSSLExpiries
+	dbUpdateSiteStatus      = db.UpdateSiteStatus
+	dbGetSiteStatus         = db.GetSiteStatusForMonitorSite
+	dbRecordFalsePositive   = db.RecordFalsePositive
+	dbUpdateLastAlertSent   = db.UpdateLastAlertSent
+	dbCountDueSites         = db.CountDueSitesForBucketRange
+	dbCountProjectionDrift  = db.CountLegacyProjectionDrift
+	dbListVeriflierVantages = db.ListEnabledVeriflierVantages
+	dbUpsertVeriflierAgent  = db.UpsertVeriflierAgent
+	veriflierStatusFunc     = func(c *veriflier.VeriflierClient, ctx stdctx.Context) (*veriflier.StatusV2Response, error) {
+		return c.Status(ctx)
+	}
+	veriflierCheckFunc = func(c *veriflier.VeriflierClient, ctx stdctx.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		return c.Check(ctx, req)
+	}
+	metricsClientFunc = func() metricsClient {
+		if m := metrics.Global(); m != nil {
+			return m
+		}
+		return nil
+	}
+	wpcomNotifyFunc     = func(c *wpcom.Client, n wpcom.Notification) error { return c.Notify(n) }
+	currentMemoryMBFunc = currentMemoryMB
+)
+
+type metricsClient interface {
+	Increment(stat string, value int)
+	Gauge(stat string, value int)
+	Timing(stat string, d time.Duration)
+	EmitMemStats()
+}
+
+type roundSummary struct {
+	pagesFetched      int
+	selected          int
+	dispatched        int
+	completed         int
+	outstanding       int
+	backpressureWaits int
+	staleResults      int
+	duplicateResults  int
+	neverChecked      int
+	oldestSelectedAge time.Duration
+	dueAtStart        int
+	dueRemaining      int
+	dueCountsSampled  bool
+	dueCountErrors    int
+	fetchErrors       int
+	interrupted       bool
+
+	dispatchDuration    time.Duration
+	waitDuration        time.Duration
+	processDuration     time.Duration
+	markCheckedDuration time.Duration
+	historyDuration     time.Duration
+	sslDuration         time.Duration
+	eventDuration       time.Duration
+
+	markCheckedRows   int
+	historyRows       int
+	sslRows           int
+	markCheckedErrors int
+	historyErrors     int
+	sslErrors         int
+
+	checkSuccesses     int
+	checkFailures      int
+	checkHTTPFailures  int
+	checkTimeouts      int
+	checkConnectErrors int
+	checkSSLErrors     int
+	checkRedirects     int
+	checkKeywords      int
+	checkTLSDeprecated int
+	checkCohorts       map[checkCohortKey]int
+}
+
+func (s *roundSummary) add(other roundSummary) {
+	s.pagesFetched += other.pagesFetched
+	s.selected += other.selected
+	s.dispatched += other.dispatched
+	s.completed += other.completed
+	s.outstanding += other.outstanding
+	s.backpressureWaits += other.backpressureWaits
+	s.staleResults += other.staleResults
+	s.duplicateResults += other.duplicateResults
+	s.neverChecked += other.neverChecked
+	s.dueCountErrors += other.dueCountErrors
+	s.fetchErrors += other.fetchErrors
+	if other.dueCountsSampled {
+		s.dueCountsSampled = true
+	}
+	s.dispatchDuration += other.dispatchDuration
+	s.waitDuration += other.waitDuration
+	s.processDuration += other.processDuration
+	s.markCheckedDuration += other.markCheckedDuration
+	s.historyDuration += other.historyDuration
+	s.sslDuration += other.sslDuration
+	s.eventDuration += other.eventDuration
+	s.markCheckedRows += other.markCheckedRows
+	s.historyRows += other.historyRows
+	s.sslRows += other.sslRows
+	s.markCheckedErrors += other.markCheckedErrors
+	s.historyErrors += other.historyErrors
+	s.sslErrors += other.sslErrors
+	s.checkSuccesses += other.checkSuccesses
+	s.checkFailures += other.checkFailures
+	s.checkHTTPFailures += other.checkHTTPFailures
+	s.checkTimeouts += other.checkTimeouts
+	s.checkConnectErrors += other.checkConnectErrors
+	s.checkSSLErrors += other.checkSSLErrors
+	s.checkRedirects += other.checkRedirects
+	s.checkKeywords += other.checkKeywords
+	s.checkTLSDeprecated += other.checkTLSDeprecated
+	mergeCheckCohorts(&s.checkCohorts, other.checkCohorts)
+	if other.oldestSelectedAge > s.oldestSelectedAge {
+		s.oldestSelectedAge = other.oldestSelectedAge
+	}
+	if other.interrupted {
+		s.interrupted = true
+	}
+}
+
+type resultProcessSummary struct {
+	processed         int
+	markCheckedRows   int
+	historyRows       int
+	sslRows           int
+	markCheckedErrors int
+	historyErrors     int
+	sslErrors         int
+
+	checkSuccesses     int
+	checkFailures      int
+	checkHTTPFailures  int
+	checkTimeouts      int
+	checkConnectErrors int
+	checkSSLErrors     int
+	checkRedirects     int
+	checkKeywords      int
+	checkTLSDeprecated int
+	checkCohorts       map[checkCohortKey]int
+
+	markCheckedDuration time.Duration
+	historyDuration     time.Duration
+	sslDuration         time.Duration
+	eventDuration       time.Duration
+}
+
+type checkCohortKey struct {
+	method  string
+	profile string
+}
+
+type siteCheckResult struct {
+	blogID int64
+	site   db.Site
+	res    checker.Result
+}
+
+// Orchestrator drives the main check loop.
+type Orchestrator struct {
+	pool             *checker.Pool
+	retries          *retryQueue
+	wpcom            *wpcom.Client
+	events           *eventstore.Store
+	veriflierClients []*veriflier.VeriflierClient
+	veriflierAddrs   []string // parallel slice of "addr|token" for change detection
+	veriflierMu      sync.RWMutex
+	hostname         string
+	bucketMin        int
+	bucketMax        int
+
+	totalChecked int
+	roundStart   time.Time
+	statsMu      sync.RWMutex
+	lastRoundSPS int
+	lastRoundDur time.Duration
+
+	lastDueCountAt        time.Time
+	lastProjectionDriftAt time.Time
+
+	wpcomNotifyDisabledLogOnce sync.Once
+	wpcomPermanentMu           sync.Mutex
+	wpcomPermanentLastLog      time.Time
+	wpcomPermanentSuppressed   int
+
+	ctx    stdctx.Context
+	cancel stdctx.CancelFunc
+}
+
+// New creates an Orchestrator. Call Run to start the check loop.
+func New(cfg *config.Config, wp *wpcom.Client) *Orchestrator {
+	ctx, cancel := stdctx.WithCancel(stdctx.Background())
+	pool := checker.NewPool(cfg.NumWorkers/2, 1, cfg.NumWorkers)
+
+	o := &Orchestrator{
+		pool:     pool,
+		retries:  newRetryQueue(),
+		wpcom:    wp,
+		events:   eventstore.New(db.DB()),
+		hostname: db.Hostname(),
+		ctx:      ctx,
+		cancel:   cancel,
+	}
+
+	o.refreshVeriflierClients(cfg)
+	if len(o.veriflierClients) == 0 {
+		log.Println("orchestrator: warning: no verifliers configured — down confirmations rely on local checks only")
+	}
+
+	return o
+}
+
+// ev returns a non-nil event store. Tests that construct &Orchestrator{}
+// directly without setting events get a no-op store backed by a nil DB so
+// event-mutation paths run without panicking. Production always wires up a
+// real Store in New().
+func (o *Orchestrator) ev() *eventstore.Store {
+	if o.events == nil {
+		return eventstore.New(nil)
+	}
+	return o.events
+}
+
+// ClaimBuckets registers this host in jetmon_hosts and sets the bucket range.
+func (o *Orchestrator) ClaimBuckets() error {
+	cfg := config.Get()
+	if min, max, ok := cfg.PinnedBucketRange(); ok {
+		if o.bucketMin != min || o.bucketMax != max {
+			log.Printf("orchestrator: using pinned buckets %d-%d (dynamic bucket ownership disabled)", min, max)
+		}
+		o.bucketMin = min
+		o.bucketMax = max
+		return nil
+	}
+	min, max, err := dbClaimBuckets(
+		o.hostname,
+		cfg.BucketTotal,
+		cfg.BucketTarget,
+		cfg.BucketHeartbeatGraceSec,
+	)
+	if err != nil {
+		return err
+	}
+	o.bucketMin = min
+	o.bucketMax = max
+	log.Printf("orchestrator: claimed buckets %d-%d", min, max)
+	return nil
+}
+
+// Run starts the main orchestration loop. Blocks until ctx is cancelled.
+func (o *Orchestrator) Run() {
+	log.Printf("orchestrator: starting, host=%s buckets=%d-%d", o.hostname, o.bucketMin, o.bucketMax)
+	for {
+		select {
+		case <-o.ctx.Done():
+			o.shutdown()
+			return
+		default:
+		}
+
+		cfg := config.Get()
+		if cfg.SchedulerEngine == "streaming" {
+			o.runStreamingEngine()
+			return
+		}
+		o.pool.SetMaxSize(cfg.NumWorkers)
+		o.refreshVeriflierClients(cfg)
+		o.syncVeriflierAgentTelemetry(cfg)
+
+		o.roundStart = time.Now()
+		summary := o.runRound()
+
+		elapsed := time.Since(o.roundStart)
+		sleepFor := schedulerSleepDuration(cfg, summary, elapsed)
+		if sleepFor > 0 {
+			select {
+			case <-time.After(sleepFor):
+			case <-o.ctx.Done():
+			}
+		}
+	}
+}
+
+func (o *Orchestrator) shutdown() {
+	log.Println("orchestrator: shutting down")
+	if !o.usesPinnedBuckets(config.Get()) {
+		if err := dbMarkHostDraining(stdctx.Background(), o.hostname); err != nil {
+			log.Printf("orchestrator: mark draining: %v", err)
+		}
+	}
+	if o.pool != nil {
+		o.pool.Drain()
+	}
+	if o.usesPinnedBuckets(config.Get()) {
+		log.Println("orchestrator: pinned bucket mode active; no jetmon_hosts row to release")
+	} else if err := dbReleaseHost(stdctx.Background(), o.hostname); err != nil {
+		log.Printf("orchestrator: release host: %v", err)
+	}
+}
+
+// Stop signals the orchestrator to shut down after the current round.
+func (o *Orchestrator) Stop() {
+	o.cancel()
+}
+
+func (o *Orchestrator) runRound() roundSummary {
+	cfg := config.Get()
+	summary := roundSummary{}
+	reportNow := nowFunc().UTC()
+	if o.roundStart.IsZero() {
+		o.roundStart = time.Now()
+	}
+
+	if o.usesPinnedBuckets(cfg) {
+		if err := o.ClaimBuckets(); err != nil {
+			log.Printf("orchestrator: pinned bucket claim failed: %v", err)
+		}
+	} else {
+		// Update heartbeat.
+		if err := dbHeartbeat(o.ctx, o.hostname); err != nil {
+			log.Printf("orchestrator: heartbeat failed: %v", err)
+		}
+		// Re-claim every round so bucket ranges rebalance automatically when
+		// hosts join or leave the cluster.
+		if err := o.ClaimBuckets(); err != nil {
+			log.Printf("orchestrator: bucket rebalance failed: %v", err)
+		}
+	}
+	dueCountsSampled := !cfg.UseVariableCheckIntervals || o.shouldSampleDueCounts(reportNow)
+	if o.shouldSampleProjectionDrift(cfg, reportNow) {
+		o.checkLegacyProjectionDrift(cfg)
+	}
+
+	if dueCountsSampled {
+		summary.dueCountsSampled = true
+		if due, err := dbCountDueSites(o.ctx, o.bucketMin, o.bucketMax, cfg.UseVariableCheckIntervals); err != nil {
+			summary.dueCountErrors++
+			log.Printf("orchestrator: count due sites failed: %v", err)
+		} else {
+			summary.dueAtStart = due
+		}
+	}
+
+	pageSize := cfg.DatasetSize
+	if pageSize < 1 {
+		pageSize = 1
+	}
+	seen := make(map[int64]struct{}, pageSize)
+	for {
+		select {
+		case <-o.ctx.Done():
+			summary.interrupted = true
+			o.finishRound(cfg, summary)
+			return summary
+		default:
+		}
+
+		sites, err := dbGetSitesForBucket(o.ctx, o.bucketMin, o.bucketMax, pageSize, cfg.UseVariableCheckIntervals)
+		if err != nil {
+			summary.fetchErrors++
+			log.Printf("orchestrator: fetch sites failed: %v", err)
+			break
+		}
+		page := filterUnseenSites(sites, seen)
+		if len(page) == 0 {
+			break
+		}
+
+		summary.pagesFetched++
+		summary.selected += len(page)
+		summary.add(selectedSiteSummary(page))
+		log.Printf("orchestrator: checking %d sites (scheduler page %d)", len(page), summary.pagesFetched)
+
+		pageSummary := o.checkSitesPage(cfg, page, summary.pagesFetched)
+		summary.add(pageSummary)
+		if pageSummary.interrupted || pageSummary.outstanding > 0 {
+			break
+		}
+		if len(sites) < pageSize {
+			break
+		}
+	}
+
+	if cfg.UseVariableCheckIntervals && dueCountsSampled {
+		if due, err := dbCountDueSites(o.ctx, o.bucketMin, o.bucketMax, true); err != nil {
+			summary.dueCountErrors++
+			log.Printf("orchestrator: count remaining due sites failed: %v", err)
+		} else {
+			summary.dueRemaining = due
+		}
+	} else if !cfg.UseVariableCheckIntervals {
+		summary.dueRemaining = max(0, summary.dueAtStart-summary.completed)
+	}
+
+	o.finishRound(cfg, summary)
+	o.applyMemoryPressure(cfg)
+	return summary
+}
+
+func (o *Orchestrator) checkSitesPage(cfg *config.Config, sites []db.Site, pageNumber int) roundSummary {
+	summary := roundSummary{}
+	siteMap := make(map[int64]db.Site, len(sites))
+	results := make(map[int64]checker.Result, len(sites))
+	for _, s := range sites {
+		siteMap[monitorTargetID(s)] = s
+	}
+
+	dispatchStart := time.Now()
+	for _, site := range sites {
+		req := checkRequestForSite(cfg, site)
+		for {
+			if o.pool.Submit(req) {
+				summary.dispatched++
+				break
+			}
+			summary.backpressureWaits++
+			if !o.waitForPageResult(siteMap, results, &summary, schedulerBackpressurePollInterval) {
+				summary.interrupted = true
+				summary.dispatchDuration += time.Since(dispatchStart)
+				return summary
+			}
+		}
+	}
+	summary.dispatchDuration += time.Since(dispatchStart)
+
+	deadline := time.NewTimer(collectionDeadlineForSites(cfg, sites))
+	defer deadline.Stop()
+	waitStart := time.Now()
+	for len(results) < summary.dispatched {
+		select {
+		case res := <-o.pool.Results():
+			recordPageResult(siteMap, results, res, &summary)
+		case <-deadline.C:
+			summary.outstanding = summary.dispatched - len(results)
+			log.Printf("orchestrator: round deadline reached, %d results outstanding", summary.outstanding)
+			goto process
+		case <-o.ctx.Done():
+			summary.interrupted = true
+			summary.waitDuration += time.Since(waitStart)
+			return summary
+		}
+	}
+
+process:
+	summary.waitDuration += time.Since(waitStart)
+	processStart := time.Now()
+	processSummary := o.processResults(results, siteMap)
+	summary.processDuration += time.Since(processStart)
+	summary.completed += processSummary.processed
+	summary.markCheckedRows += processSummary.markCheckedRows
+	summary.historyRows += processSummary.historyRows
+	summary.sslRows += processSummary.sslRows
+	summary.markCheckedErrors += processSummary.markCheckedErrors
+	summary.historyErrors += processSummary.historyErrors
+	summary.sslErrors += processSummary.sslErrors
+	summary.checkSuccesses += processSummary.checkSuccesses
+	summary.checkFailures += processSummary.checkFailures
+	summary.checkHTTPFailures += processSummary.checkHTTPFailures
+	summary.checkTimeouts += processSummary.checkTimeouts
+	summary.checkConnectErrors += processSummary.checkConnectErrors
+	summary.checkSSLErrors += processSummary.checkSSLErrors
+	summary.checkRedirects += processSummary.checkRedirects
+	summary.checkKeywords += processSummary.checkKeywords
+	summary.checkTLSDeprecated += processSummary.checkTLSDeprecated
+	mergeCheckCohorts(&summary.checkCohorts, processSummary.checkCohorts)
+	summary.markCheckedDuration += processSummary.markCheckedDuration
+	summary.historyDuration += processSummary.historyDuration
+	summary.sslDuration += processSummary.sslDuration
+	summary.eventDuration += processSummary.eventDuration
+	o.totalChecked += processSummary.processed
+	emitPageMetrics(summary)
+	logPageSummary(pageNumber, len(sites), summary)
+	return summary
+}
+
+func emitPageMetrics(summary roundSummary) {
+	m := metricsClientFunc()
+	if m == nil {
+		return
+	}
+	m.Timing("scheduler.page.dispatch.time", summary.dispatchDuration)
+	m.Timing("scheduler.page.wait.time", summary.waitDuration)
+	m.Timing("scheduler.page.process.time", summary.processDuration)
+	m.Timing("scheduler.page.mark_checked.time", summary.markCheckedDuration)
+	m.Timing("scheduler.page.history.time", summary.historyDuration)
+	m.Timing("scheduler.page.ssl.time", summary.sslDuration)
+	m.Timing("scheduler.page.events.time", summary.eventDuration)
+	m.Increment("scheduler.page.mark_checked.row.count", summary.markCheckedRows)
+	m.Increment("scheduler.page.history.row.count", summary.historyRows)
+	m.Increment("scheduler.page.ssl.row.count", summary.sslRows)
+	m.Increment("scheduler.page.mark_checked.error.count", summary.markCheckedErrors)
+	m.Increment("scheduler.page.history.error.count", summary.historyErrors)
+	m.Increment("scheduler.page.ssl.error.count", summary.sslErrors)
+	m.Increment("scheduler.page.check.success.count", summary.checkSuccesses)
+	m.Increment("scheduler.page.check.failure.count", summary.checkFailures)
+	m.Increment("scheduler.page.check.http_failure.count", summary.checkHTTPFailures)
+	m.Increment("scheduler.page.check.timeout.count", summary.checkTimeouts)
+	m.Increment("scheduler.page.check.connect_error.count", summary.checkConnectErrors)
+	m.Increment("scheduler.page.check.ssl_error.count", summary.checkSSLErrors)
+	m.Increment("scheduler.page.check.redirect.count", summary.checkRedirects)
+	m.Increment("scheduler.page.check.keyword.count", summary.checkKeywords)
+	m.Increment("scheduler.page.check.tls_deprecated.count", summary.checkTLSDeprecated)
+	emitCheckCohortCounters(m, "scheduler.page", summary.checkCohorts)
+}
+
+func logPageSummary(pageNumber, sites int, summary roundSummary) {
+	log.Printf(
+		"orchestrator: page summary page=%d sites=%d dispatched=%d completed=%d outstanding=%d dispatch=%s wait=%s process=%s mark_checked=%s history=%s ssl=%s events=%s checks_success=%d checks_failure=%d checks_http_failure=%d checks_timeout=%d checks_connect_error=%d checks_ssl_error=%d checks_redirect=%d checks_keyword=%d checks_tls_deprecated=%d mark_checked_rows=%d history_rows=%d ssl_rows=%d mark_checked_errors=%d history_errors=%d ssl_errors=%d",
+		pageNumber,
+		sites,
+		summary.dispatched,
+		summary.completed,
+		summary.outstanding,
+		summary.dispatchDuration.Round(time.Millisecond),
+		summary.waitDuration.Round(time.Millisecond),
+		summary.processDuration.Round(time.Millisecond),
+		summary.markCheckedDuration.Round(time.Millisecond),
+		summary.historyDuration.Round(time.Millisecond),
+		summary.sslDuration.Round(time.Millisecond),
+		summary.eventDuration.Round(time.Millisecond),
+		summary.checkSuccesses,
+		summary.checkFailures,
+		summary.checkHTTPFailures,
+		summary.checkTimeouts,
+		summary.checkConnectErrors,
+		summary.checkSSLErrors,
+		summary.checkRedirects,
+		summary.checkKeywords,
+		summary.checkTLSDeprecated,
+		summary.markCheckedRows,
+		summary.historyRows,
+		summary.sslRows,
+		summary.markCheckedErrors,
+		summary.historyErrors,
+		summary.sslErrors,
+	)
+}
+
+func (o *Orchestrator) waitForPageResult(siteMap map[int64]db.Site, results map[int64]checker.Result, summary *roundSummary, maxWait time.Duration) bool {
+	timer := time.NewTimer(maxWait)
+	defer timer.Stop()
+	select {
+	case res := <-o.pool.Results():
+		recordPageResult(siteMap, results, res, summary)
+		return true
+	case <-timer.C:
+		return true
+	case <-o.ctx.Done():
+		return false
+	}
+}
+
+func filterUnseenSites(sites []db.Site, seen map[int64]struct{}) []db.Site {
+	filtered := make([]db.Site, 0, len(sites))
+	for _, site := range sites {
+		targetID := monitorTargetID(site)
+		if _, ok := seen[targetID]; ok {
+			continue
+		}
+		seen[targetID] = struct{}{}
+		filtered = append(filtered, site)
+	}
+	return filtered
+}
+
+func selectedSiteSummary(sites []db.Site) roundSummary {
+	summary := roundSummary{}
+	now := nowFunc().UTC()
+	for _, site := range sites {
+		if site.LastCheckedAt == nil {
+			summary.neverChecked++
+			continue
+		}
+		age := now.Sub(site.LastCheckedAt.UTC())
+		if age > summary.oldestSelectedAge {
+			summary.oldestSelectedAge = age
+		}
+	}
+	return summary
+}
+
+func checkRequestForSite(cfg *config.Config, site db.Site) checker.Request {
+	method := effectiveCheckMethod(cfg, site)
+	profile := effectiveDetectionProfile(cfg, site, method)
+	req := checker.Request{
+		MonitorSiteID:       site.ID,
+		BlogID:              site.BlogID,
+		URL:                 site.MonitorURL,
+		Method:              method,
+		DetectionProfile:    profile,
+		TimeoutSeconds:      timeoutForSite(cfg, site),
+		BodyReadMaxBytes:    cfg.BodyReadMaxBytes,
+		BodyReadMaxMS:       cfg.BodyReadMaxMS,
+		KeywordReadMaxBytes: cfg.KeywordReadMaxBytes,
+		KeywordReadMaxMS:    cfg.KeywordReadMaxMS,
+		CustomHeaders:       checker.ParseCustomHeaders(site.CustomHeaders),
+		RedirectPolicy:      checker.RedirectFollow,
+	}
+	if profile == checkmode.ProfileFull {
+		req.Keyword = site.CheckKeyword
+		req.ForbiddenKeyword = site.ForbiddenKeyword
+		req.ForbiddenKeywords = checker.ParseForbiddenKeywords(site.ForbiddenKeywords)
+		req.RedirectPolicy = checker.RedirectPolicy(site.RedirectPolicy)
+		if req.RedirectPolicy == "" {
+			req.RedirectPolicy = checker.RedirectFollow
+		}
+	}
+	return req
+}
+
+func effectiveCheckMethod(cfg *config.Config, site db.Site) string {
+	def := checkmode.MethodGET
+	if cfg != nil && cfg.DefaultCheckMethod != "" {
+		def = cfg.DefaultCheckMethod
+	}
+	method, err := checkmode.NormalizeMethod(site.RequestMethod, def)
+	if err != nil {
+		return def
+	}
+	return method
+}
+
+func effectiveDetectionProfile(cfg *config.Config, site db.Site, method string) string {
+	def := checkmode.ProfileFull
+	if cfg != nil && cfg.DefaultDetectionProfile != "" {
+		def = cfg.DefaultDetectionProfile
+	}
+	profile, err := checkmode.NormalizeProfile(site.DetectionProfile, def)
+	if err != nil {
+		return checkmode.EffectiveProfile(method, def)
+	}
+	return checkmode.EffectiveProfile(method, profile)
+}
+
+func fullDetectionsEnabled(cfg *config.Config, site db.Site) bool {
+	method := effectiveCheckMethod(cfg, site)
+	profile := effectiveDetectionProfile(cfg, site, method)
+	return checkmode.FullDetectionsEnabled(method, profile)
+}
+
+func collectionDeadlineForSites(cfg *config.Config, sites []db.Site) time.Duration {
+	timeout := cfg.NetCommsTimeout
+	for _, site := range sites {
+		if siteTimeout := timeoutForSite(cfg, site); siteTimeout > timeout {
+			timeout = siteTimeout
+		}
+	}
+	return time.Duration(timeout+5) * time.Second
+}
+
+func recordPageResult(siteMap map[int64]db.Site, results map[int64]checker.Result, res checker.Result, summary *roundSummary) {
+	targetID := checkResultTargetID(res)
+	if _, ok := siteMap[targetID]; !ok {
+		summary.staleResults++
+		log.Printf("orchestrator: ignored stale check result target_id=%d blog_id=%d", targetID, res.BlogID)
+		return
+	}
+	if _, ok := results[targetID]; ok {
+		summary.duplicateResults++
+		log.Printf("orchestrator: ignored duplicate check result target_id=%d blog_id=%d", targetID, res.BlogID)
+		return
+	}
+	results[targetID] = res
+}
+
+func (o *Orchestrator) shouldSampleDueCounts(now time.Time) bool {
+	if o.lastDueCountAt.IsZero() || now.Before(o.lastDueCountAt) || now.Sub(o.lastDueCountAt) >= schedulerBroadReportInterval {
+		o.lastDueCountAt = now
+		return true
+	}
+	return false
+}
+
+func (o *Orchestrator) shouldSampleProjectionDrift(cfg *config.Config, now time.Time) bool {
+	if !cfg.LegacyStatusProjectionEnable {
+		return false
+	}
+	if o.lastProjectionDriftAt.IsZero() || now.Before(o.lastProjectionDriftAt) || now.Sub(o.lastProjectionDriftAt) >= schedulerBroadReportInterval {
+		o.lastProjectionDriftAt = now
+		return true
+	}
+	return false
+}
+
+func boolInt(value bool) int {
+	if value {
+		return 1
+	}
+	return 0
+}
+
+func schedulerSleepDuration(cfg *config.Config, summary roundSummary, elapsed time.Duration) time.Duration {
+	if summary.interrupted {
+		return 0
+	}
+	if summary.dueRemaining > 0 || summary.outstanding > 0 || summary.fetchErrors > 0 {
+		return schedulerBacklogPollInterval
+	}
+	if cfg.UseVariableCheckIntervals {
+		return schedulerVariableIntervalPollInterval
+	}
+	minInterval := time.Duration(cfg.MinTimeBetweenRoundsSec) * time.Second
+	if elapsed >= minInterval {
+		return 0
+	}
+	return minInterval - elapsed
+}
+
+func (o *Orchestrator) finishRound(cfg *config.Config, summary roundSummary) {
+	// Emit metrics and update stats files.
+	roundDuration := time.Since(o.roundStart)
+	sps := 0
+	if roundDuration.Seconds() > 0 {
+		sps = int(float64(summary.completed) / roundDuration.Seconds())
+	}
+	o.statsMu.Lock()
+	o.lastRoundSPS = sps
+	o.lastRoundDur = roundDuration
+	o.statsMu.Unlock()
+
+	m := metricsClientFunc()
+	if m != nil {
+		activeChecks := 0
+		queueDepth := 0
+		if o.pool != nil {
+			activeChecks = o.pool.ActiveCount()
+			queueDepth = o.pool.QueueDepth()
+		}
+		retryQueueSize := 0
+		if o.retries != nil {
+			retryQueueSize = o.retries.size()
+		}
+		m.Timing("round.complete.time", roundDuration)
+		m.Gauge("worker.queue.active", activeChecks)
+		m.Gauge("worker.queue.queue_size", queueDepth)
+		m.Gauge("retry.queue.size", retryQueueSize)
+		m.Increment("round.sites.count", summary.completed)
+		m.Gauge("round.sps.count", sps)
+		m.Gauge("scheduler.round.pages.count", summary.pagesFetched)
+		m.Gauge("scheduler.round.selected.count", summary.selected)
+		m.Gauge("scheduler.round.dispatched.count", summary.dispatched)
+		m.Gauge("scheduler.round.completed.count", summary.completed)
+		m.Gauge("scheduler.round.outstanding.count", summary.outstanding)
+		m.Gauge("scheduler.round.due_count_sampled.count", boolInt(summary.dueCountsSampled))
+		if summary.dueCountsSampled {
+			m.Gauge("scheduler.round.due_start.count", summary.dueAtStart)
+			m.Gauge("scheduler.round.due_remaining.count", summary.dueRemaining)
+		}
+		m.Gauge("scheduler.round.selected_never_checked.count", summary.neverChecked)
+		m.Gauge("scheduler.round.selected_oldest_age_sec", int(summary.oldestSelectedAge.Seconds()))
+		m.Increment("scheduler.dispatch.backpressure_wait.count", summary.backpressureWaits)
+		m.Increment("scheduler.result.stale.count", summary.staleResults)
+		m.Increment("scheduler.result.duplicate.count", summary.duplicateResults)
+		m.Increment("scheduler.due_count.error.count", summary.dueCountErrors)
+		m.Increment("scheduler.fetch.error.count", summary.fetchErrors)
+		m.Timing("scheduler.round.dispatch.time", summary.dispatchDuration)
+		m.Timing("scheduler.round.wait.time", summary.waitDuration)
+		m.Timing("scheduler.round.process.time", summary.processDuration)
+		m.Timing("scheduler.round.mark_checked.time", summary.markCheckedDuration)
+		m.Timing("scheduler.round.history.time", summary.historyDuration)
+		m.Timing("scheduler.round.ssl.time", summary.sslDuration)
+		m.Timing("scheduler.round.events.time", summary.eventDuration)
+		m.Increment("scheduler.round.mark_checked.row.count", summary.markCheckedRows)
+		m.Increment("scheduler.round.history.row.count", summary.historyRows)
+		m.Increment("scheduler.round.ssl.row.count", summary.sslRows)
+		m.Increment("scheduler.round.mark_checked.error.count", summary.markCheckedErrors)
+		m.Increment("scheduler.round.history.error.count", summary.historyErrors)
+		m.Increment("scheduler.round.ssl.error.count", summary.sslErrors)
+		m.Increment("scheduler.round.check.success.count", summary.checkSuccesses)
+		m.Increment("scheduler.round.check.failure.count", summary.checkFailures)
+		m.Increment("scheduler.round.check.http_failure.count", summary.checkHTTPFailures)
+		m.Increment("scheduler.round.check.timeout.count", summary.checkTimeouts)
+		m.Increment("scheduler.round.check.connect_error.count", summary.checkConnectErrors)
+		m.Increment("scheduler.round.check.ssl_error.count", summary.checkSSLErrors)
+		m.Increment("scheduler.round.check.redirect.count", summary.checkRedirects)
+		m.Increment("scheduler.round.check.keyword.count", summary.checkKeywords)
+		m.Increment("scheduler.round.check.tls_deprecated.count", summary.checkTLSDeprecated)
+		emitCheckCohortCounters(m, "scheduler.round", summary.checkCohorts)
+
+		if cfg.StatsdSendMemUsage {
+			m.EmitMemStats()
+		}
+
+		metrics.WriteStatsFiles(sps, queueDepth, o.totalChecked)
+	}
+	logRoundSummary(summary, roundDuration, sps)
+}
+
+func logRoundSummary(summary roundSummary, roundDuration time.Duration, sps int) {
+	if summary.selected == 0 &&
+		summary.dueRemaining == 0 &&
+		summary.outstanding == 0 &&
+		summary.backpressureWaits == 0 &&
+		summary.fetchErrors == 0 &&
+		summary.dueCountErrors == 0 {
+		return
+	}
+	log.Printf(
+		"orchestrator: round summary pages=%d due_count_sampled=%t due_start=%d selected=%d dispatched=%d completed=%d outstanding=%d due_remaining=%d backpressure_waits=%d stale_results=%d duplicate_results=%d never_checked=%d oldest_selected_age_sec=%d dispatch=%s wait=%s process=%s mark_checked=%s history=%s ssl=%s events=%s checks_success=%d checks_failure=%d checks_http_failure=%d checks_timeout=%d checks_connect_error=%d checks_ssl_error=%d checks_redirect=%d checks_keyword=%d checks_tls_deprecated=%d mark_checked_rows=%d history_rows=%d ssl_rows=%d mark_checked_errors=%d history_errors=%d ssl_errors=%d duration=%s sps=%d",
+		summary.pagesFetched,
+		summary.dueCountsSampled,
+		summary.dueAtStart,
+		summary.selected,
+		summary.dispatched,
+		summary.completed,
+		summary.outstanding,
+		summary.dueRemaining,
+		summary.backpressureWaits,
+		summary.staleResults,
+		summary.duplicateResults,
+		summary.neverChecked,
+		int(summary.oldestSelectedAge.Seconds()),
+		summary.dispatchDuration.Round(time.Millisecond),
+		summary.waitDuration.Round(time.Millisecond),
+		summary.processDuration.Round(time.Millisecond),
+		summary.markCheckedDuration.Round(time.Millisecond),
+		summary.historyDuration.Round(time.Millisecond),
+		summary.sslDuration.Round(time.Millisecond),
+		summary.eventDuration.Round(time.Millisecond),
+		summary.checkSuccesses,
+		summary.checkFailures,
+		summary.checkHTTPFailures,
+		summary.checkTimeouts,
+		summary.checkConnectErrors,
+		summary.checkSSLErrors,
+		summary.checkRedirects,
+		summary.checkKeywords,
+		summary.checkTLSDeprecated,
+		summary.markCheckedRows,
+		summary.historyRows,
+		summary.sslRows,
+		summary.markCheckedErrors,
+		summary.historyErrors,
+		summary.sslErrors,
+		roundDuration.Round(time.Millisecond),
+		sps,
+	)
+}
+
+func (o *Orchestrator) processResults(results map[int64]checker.Result, sites map[int64]db.Site) resultProcessSummary {
+	records := knownSiteResults(results, sites)
+	summary := resultProcessSummary{processed: len(records)}
+	if len(records) == 0 {
+		return summary
+	}
+	for _, record := range records {
+		addCheckOutcome(&summary, record.res)
+	}
+
+	o.markResultsChecked(records, &summary)
+	o.recordResultHistories(records, &summary)
+
+	sslStart := time.Now()
+	sslUpdates := make([]db.SiteSSLExpiry, 0)
+	cfg := config.Get()
+	for _, record := range records {
+		if !fullDetectionsEnabled(cfg, record.site) {
+			continue
+		}
+		if record.res.TLSVersion != 0 {
+			o.checkTLSDeprecated(record.site, record.res)
+		}
+		// Update SSL expiry if available.
+		if record.res.SSLExpiry != nil {
+			if shouldUpdateSSLExpiry(record.site.SSLExpiryDate, *record.res.SSLExpiry) {
+				sslUpdates = append(sslUpdates, db.SiteSSLExpiry{
+					BlogID: record.blogID,
+					Expiry: *record.res.SSLExpiry,
+				})
+			}
+			o.checkSSLAlerts(record.site, *record.res.SSLExpiry)
+		}
+	}
+	o.updateSSLExpiries(sslUpdates, &summary)
+	summary.sslDuration += time.Since(sslStart)
+
+	eventStart := time.Now()
+	for _, record := range records {
+		// Per-check data is recorded in jetmon_check_history (above); duplicating
+		// it in jetmon_audit_log was retired with the operational/site-state split.
+		if !record.res.IsFailure() {
+			o.handleRecovery(record.site, record.res)
+		} else {
+			o.handleFailure(record.site, record.res)
+		}
+	}
+	summary.eventDuration += time.Since(eventStart)
+	return summary
+}
+
+func addCheckOutcome(summary *resultProcessSummary, res checker.Result) {
+	summary.checkCohorts = incrementCheckCohort(summary.checkCohorts, res)
+	if res.Success {
+		summary.checkSuccesses++
+	} else {
+		summary.checkFailures++
+	}
+
+	if !res.Success && res.HTTPCode >= 400 {
+		summary.checkHTTPFailures++
+	}
+	switch res.ErrorCode {
+	case checker.ErrorTimeout:
+		summary.checkTimeouts++
+	case checker.ErrorConnect:
+		summary.checkConnectErrors++
+	case checker.ErrorSSL, checker.ErrorTLSExpired:
+		summary.checkSSLErrors++
+	case checker.ErrorRedirect:
+		summary.checkRedirects++
+	case checker.ErrorKeyword:
+		summary.checkKeywords++
+	case checker.ErrorTLSDeprecated:
+		summary.checkTLSDeprecated++
+	}
+}
+
+func incrementCheckCohort(cohorts map[checkCohortKey]int, res checker.Result) map[checkCohortKey]int {
+	if cohorts == nil {
+		cohorts = make(map[checkCohortKey]int)
+	}
+	cohorts[checkCohortForResult(res)]++
+	return cohorts
+}
+
+func checkCohortForResult(res checker.Result) checkCohortKey {
+	method := res.Method
+	if method == "" {
+		method = "unknown"
+	}
+	profile := res.DetectionProfile
+	if profile == "" {
+		profile = "unknown"
+	}
+	return checkCohortKey{method: method, profile: profile}
+}
+
+func mergeCheckCohorts(dst *map[checkCohortKey]int, src map[checkCohortKey]int) {
+	if len(src) == 0 {
+		return
+	}
+	if *dst == nil {
+		*dst = make(map[checkCohortKey]int, len(src))
+	}
+	for key, count := range src {
+		(*dst)[key] += count
+	}
+}
+
+func emitCheckCohortCounters(m metricsClient, prefix string, cohorts map[checkCohortKey]int) {
+	if m == nil || len(cohorts) == 0 {
+		return
+	}
+	keys := make([]checkCohortKey, 0, len(cohorts))
+	for key := range cohorts {
+		keys = append(keys, key)
+	}
+	sort.Slice(keys, func(i, j int) bool {
+		if keys[i].method == keys[j].method {
+			return keys[i].profile < keys[j].profile
+		}
+		return keys[i].method < keys[j].method
+	})
+	for _, key := range keys {
+		count := cohorts[key]
+		if count <= 0 {
+			continue
+		}
+		m.Increment(fmt.Sprintf(
+			"%s.check.method.%s.profile.%s.count",
+			prefix,
+			metricSegment(key.method),
+			metricSegment(key.profile),
+		), count)
+	}
+}
+
+func (o *Orchestrator) updateSSLExpiries(updates []db.SiteSSLExpiry, summary *resultProcessSummary) {
+	if len(updates) == 0 {
+		return
+	}
+	if err := dbUpdateSSLExpiries(o.ctx, updates); err != nil {
+		summary.sslErrors++
+		log.Printf("orchestrator: batch update ssl expiries rows=%d: %v", len(updates), err)
+		for _, update := range updates {
+			if err := dbUpdateSSLExpiry(o.ctx, update.BlogID, update.Expiry); err != nil {
+				summary.sslErrors++
+				log.Printf("orchestrator: update ssl expiry blog_id=%d: %v", update.BlogID, err)
+				continue
+			}
+			summary.sslRows++
+		}
+		return
+	}
+	summary.sslRows += len(updates)
+}
+
+func knownSiteResults(results map[int64]checker.Result, sites map[int64]db.Site) []siteCheckResult {
+	targetIDs := make([]int64, 0, len(results))
+	for targetID := range results {
+		targetIDs = append(targetIDs, targetID)
+	}
+	sort.Slice(targetIDs, func(i, j int) bool {
+		return targetIDs[i] < targetIDs[j]
+	})
+
+	records := make([]siteCheckResult, 0, len(results))
+	for _, targetID := range targetIDs {
+		site, ok := sites[targetID]
+		if !ok {
+			continue
+		}
+		records = append(records, siteCheckResult{
+			blogID: site.BlogID,
+			site:   site,
+			res:    results[targetID],
+		})
+	}
+	return records
+}
+
+func (o *Orchestrator) markResultsChecked(records []siteCheckResult, summary *resultProcessSummary) {
+	checks := make([]db.SiteCheck, 0, len(records))
+	for _, record := range records {
+		checks = append(checks, db.SiteCheck{
+			BlogID:      record.blogID,
+			CheckedAt:   resultCheckedAt(record.res),
+			NextCheckAt: nextCheckAt(record.site, record.res),
+		})
+	}
+
+	start := time.Now()
+	if err := dbMarkSitesChecked(o.ctx, checks); err != nil {
+		summary.markCheckedErrors++
+		log.Printf("orchestrator: batch mark checked sites=%d: %v", len(checks), err)
+		for _, check := range checks {
+			if err := dbMarkSiteChecked(o.ctx, check.BlogID, check.CheckedAt, check.NextCheckAt); err != nil {
+				summary.markCheckedErrors++
+				log.Printf("orchestrator: mark checked blog_id=%d: %v", check.BlogID, err)
+				continue
+			}
+			summary.markCheckedRows++
+		}
+	} else {
+		summary.markCheckedRows += len(checks)
+	}
+	summary.markCheckedDuration += time.Since(start)
+}
+
+func (o *Orchestrator) recordResultHistories(records []siteCheckResult, summary *resultProcessSummary) {
+	histories := make([]db.CheckHistoryRow, 0, len(records))
+	for _, record := range records {
+		histories = append(histories, checkHistoryRowForResult(record.blogID, record.res))
+	}
+
+	start := time.Now()
+	if err := dbRecordCheckHistories(o.ctx, histories); err != nil {
+		summary.historyErrors++
+		log.Printf("orchestrator: batch record check history rows=%d: %v", len(histories), err)
+		for _, row := range histories {
+			if err := dbRecordCheckHistory(
+				row.BlogID,
+				row.RequestMethod,
+				row.HTTPCode,
+				row.ErrorCode,
+				row.RTTMs,
+				row.DNSMs,
+				row.TCPMs,
+				row.TLSMs,
+				row.TTFBMs,
+			); err != nil {
+				summary.historyErrors++
+				log.Printf("orchestrator: record history blog_id=%d: %v", row.BlogID, err)
+				continue
+			}
+			summary.historyRows++
+		}
+	} else {
+		summary.historyRows += len(histories)
+	}
+	summary.historyDuration += time.Since(start)
+}
+
+func (o *Orchestrator) recordStreamingHistoryRows(rows []db.CheckHistoryRow) resultProcessSummary {
+	summary := resultProcessSummary{}
+	if len(rows) == 0 {
+		return summary
+	}
+
+	rows = append([]db.CheckHistoryRow(nil), rows...)
+	start := time.Now()
+	if err := dbRecordCheckHistories(o.ctx, rows); err != nil {
+		summary.historyErrors++
+		log.Printf("orchestrator: streaming batch record check history rows=%d: %v", len(rows), err)
+		for _, row := range rows {
+			if err := dbRecordCheckHistory(
+				row.BlogID,
+				row.RequestMethod,
+				row.HTTPCode,
+				row.ErrorCode,
+				row.RTTMs,
+				row.DNSMs,
+				row.TCPMs,
+				row.TLSMs,
+				row.TTFBMs,
+			); err != nil {
+				summary.historyErrors++
+				log.Printf("orchestrator: streaming record check history blog_id=%d: %v", row.BlogID, err)
+			}
+		}
+	}
+	summary.historyDuration += time.Since(start)
+	summary.historyRows += len(rows)
+	return summary
+}
+
+func checkHistoryRowForResult(blogID int64, res checker.Result) db.CheckHistoryRow {
+	return db.CheckHistoryRow{
+		BlogID:        blogID,
+		RequestMethod: res.Method,
+		HTTPCode:      res.HTTPCode,
+		ErrorCode:     res.ErrorCode,
+		RTTMs:         res.RTT.Milliseconds(),
+		DNSMs:         res.DNS.Milliseconds(),
+		TCPMs:         res.TCP.Milliseconds(),
+		TLSMs:         res.TLS.Milliseconds(),
+		TTFBMs:        res.TTFB.Milliseconds(),
+		CheckedAt:     resultCheckedAt(res),
+	}
+}
+
+func resultCheckedAt(res checker.Result) time.Time {
+	if res.Timestamp.IsZero() {
+		return nowFunc().UTC()
+	}
+	return res.Timestamp.UTC()
+}
+
+func nextCheckAt(site db.Site, res checker.Result) time.Time {
+	interval := siteCheckInterval(site)
+	if res.IsFailure() && interval > failedCheckRetryInterval {
+		interval = failedCheckRetryInterval
+	}
+	return resultCheckedAt(res).Add(interval)
+}
+
+func siteCheckInterval(site db.Site) time.Duration {
+	interval := site.CheckInterval
+	if interval < 1 {
+		interval = 1
+	}
+	return time.Duration(interval) * time.Minute
+}
+
+func checkResultMetadata(site db.Site, res checker.Result, firstFailAt time.Time) map[string]any {
+	method := res.Method
+	if method == "" {
+		method = effectiveCheckMethod(config.Get(), site)
+	}
+	profile := res.DetectionProfile
+	if profile == "" {
+		profile = effectiveDetectionProfile(config.Get(), site, method)
+	}
+	metadata := map[string]any{
+		"detector_class":     detectorClass(res),
+		"failure_class":      failureClass(res),
+		"http_code":          res.HTTPCode,
+		"error_code":         res.ErrorCode,
+		"legacy_status_type": (&res).StatusType(),
+		"keyword_rule":       res.KeywordRule,
+		"method":             method,
+		"detection_profile":  profile,
+		"rtt_ms":             res.RTT.Milliseconds(),
+		"url":                site.MonitorURL,
+	}
+	if res.ErrorDetail != "" {
+		metadata["error_detail"] = res.ErrorDetail
+	}
+	if bodyReadMetadata := checkBodyReadMetadata(res); len(bodyReadMetadata) > 0 {
+		metadata["body_read"] = bodyReadMetadata
+	}
+	if res.DNSFailureKind != "" {
+		metadata["dns_error_kind"] = res.DNSFailureKind
+		if servers := checker.ConfiguredResolverServers(); len(servers) > 0 {
+			metadata["dns_resolver_source"] = "configured"
+			metadata["check_dns_resolvers"] = servers
+		} else {
+			metadata["dns_resolver_source"] = "system"
+		}
+	}
+	if res.DNSFailureName != "" {
+		metadata["dns_error_name"] = res.DNSFailureName
+	}
+	if res.DNSFailureServer != "" {
+		metadata["dns_error_server"] = res.DNSFailureServer
+	}
+	if site.RedirectPolicy != "" {
+		metadata["redirect_policy"] = site.RedirectPolicy
+	} else {
+		metadata["redirect_policy"] = string(checker.RedirectFollow)
+	}
+	if res.RedirectCount > 0 {
+		metadata["redirect_count"] = res.RedirectCount
+	}
+	if len(res.RedirectChain) > 0 {
+		metadata["redirect_chain"] = append([]string(nil), res.RedirectChain...)
+	}
+	if res.FinalURL != "" {
+		metadata["final_url"] = res.FinalURL
+	}
+	if res.TLSVersion != 0 {
+		metadata["tls_version"] = tlsVersionName(res.TLSVersion)
+		metadata["tls_version_code"] = fmt.Sprintf("0x%04x", res.TLSVersion)
+	}
+	if res.CipherSuite != 0 {
+		metadata["cipher_suite"] = tls.CipherSuiteName(res.CipherSuite)
+		metadata["cipher_suite_id"] = fmt.Sprintf("0x%04x", res.CipherSuite)
+	}
+	metadata["observation"] = failureObservationMetadata(site, res, firstFailAt)
+	return metadata
+}
+
+func failureObservationMetadata(site db.Site, res checker.Result, firstFailAt time.Time) map[string]any {
+	checkedAt := resultCheckedAt(res)
+	if firstFailAt.IsZero() {
+		firstFailAt = checkedAt
+	}
+	normalInterval := siteCheckInterval(site)
+	nextInterval := nextCheckAt(site, res).Sub(checkedAt)
+	obs := map[string]any{
+		"checked_at":                    checkedAt.Format(time.RFC3339Nano),
+		"first_failed_at":               firstFailAt.UTC().Format(time.RFC3339Nano),
+		"normal_check_interval_seconds": int64(normalInterval / time.Second),
+		"next_check_interval_seconds":   int64(nextInterval / time.Second),
+	}
+	if site.LastCheckedAt != nil {
+		previousObservedAt := site.LastCheckedAt.UTC().Format(time.RFC3339Nano)
+		obs["previous_observed_at"] = previousObservedAt
+		if site.SiteStatus == statusRunning {
+			obs["previous_known_good_at"] = previousObservedAt
+		}
+	}
+	return obs
+}
+
+func recoveryResultMetadata(res checker.Result, changeTime time.Time) map[string]any {
+	checkedAt := resultCheckedAt(res)
+	method := res.Method
+	if method == "" {
+		method = "GET"
+	}
+	return map[string]any{
+		"http_code":  res.HTTPCode,
+		"error_code": res.ErrorCode,
+		"method":     method,
+		"rtt_ms":     res.RTT.Milliseconds(),
+		"observation": map[string]any{
+			"first_recovered_at": checkedAt.Format(time.RFC3339Nano),
+			"closed_at":          changeTime.UTC().Format(time.RFC3339Nano),
+		},
+	}
+}
+
+func checkBodyReadMetadata(res checker.Result) map[string]any {
+	if res.BodyReadMode == "" &&
+		res.BodyBytesRead == 0 &&
+		res.BodyReadLimitBytes == 0 &&
+		res.BodyExpectedBytes <= 0 &&
+		res.BodyReadError == "" {
+		return nil
+	}
+	body := map[string]any{
+		"bytes_read":  res.BodyBytesRead,
+		"limit_bytes": res.BodyReadLimitBytes,
+		"mode":        res.BodyReadMode,
+	}
+	if res.BodyExpectedBytes >= 0 {
+		body["expected_bytes"] = res.BodyExpectedBytes
+	}
+	if res.BodyReadError != "" {
+		body["error"] = res.BodyReadError
+	}
+	return body
+}
+
+func detectorClass(res checker.Result) string {
+	switch {
+	case res.Success:
+		return "success"
+	case res.ErrorCode == checker.ErrorBodyRead:
+		return "partial_response"
+	case res.ErrorCode == checker.ErrorKeyword:
+		return "content_failure"
+	case res.ErrorCode == checker.ErrorTimeout:
+		return "timeout"
+	case res.ErrorCode == checker.ErrorRedirect:
+		return "redirect"
+	case res.ErrorCode == checker.ErrorSSL || res.ErrorCode == checker.ErrorTLSExpired:
+		return "tls_failure"
+	case res.ErrorCode == checker.ErrorTLSDeprecated:
+		return "tls_deprecated"
+	case res.DNSFailureKind != "":
+		return "dns_" + metricSegment(res.DNSFailureKind)
+	case res.HTTPCode >= 400:
+		return "http_failure"
+	case res.ErrorCode == checker.ErrorConnect:
+		return "connect_error"
+	default:
+		return "unknown"
+	}
+}
+
+func shouldUpdateSSLExpiry(stored *time.Time, observed time.Time) bool {
+	if stored == nil {
+		return true
+	}
+	storedYear, storedMonth, storedDay := stored.UTC().Date()
+	observedYear, observedMonth, observedDay := observed.UTC().Date()
+	return storedYear != observedYear || storedMonth != observedMonth || storedDay != observedDay
+}
+
+func (o *Orchestrator) handleRecovery(site db.Site, res checker.Result) {
+	targetID := monitorTargetID(site)
+	entry := o.retries.get(targetID)
+	if entry == nil && site.SiteStatus == statusRunning {
+		return // was already up, nothing to do
+	}
+
+	knownEventID := int64(0)
+	if entry != nil {
+		knownEventID = entry.eventID
+	}
+	o.retries.clear(targetID)
+
+	if site.SiteStatus != statusRunning || knownEventID > 0 {
+		changeTime := nowFunc().UTC()
+		log.Printf("orchestrator: blog_id=%d recovered", site.BlogID)
+		if entry != nil && site.SiteStatus == statusDown {
+			emitCounter("detection.probe_cleared.count", 1)
+			emitCounter("detection.probe_cleared."+failureClass(entry.lastResult)+".count", 1)
+			emitTimingSince("detection.seems_down_to_probe_cleared.time", entry.firstFailAt, changeTime)
+		}
+
+		// Close the open event and project site_status back to running in the
+		// same transaction. The resolution reason depends on whether the event
+		// was already verifier-confirmed (Down) or still in the local-retry
+		// phase (Seems Down).
+		if err := o.closeRecoveredEvent(site, knownEventID, changeTime, res); err != nil {
+			log.Printf("orchestrator: close recovered event blog_id=%d: %v", site.BlogID, err)
+		}
+
+		if inMaintenance(site) {
+			o.auditLog(audit.Entry{
+				BlogID:    site.BlogID,
+				EventType: audit.EventMaintenanceActive,
+				Source:    "local",
+				Detail:    "recovery suppressed during maintenance",
+			})
+		} else if !o.isAlertSuppressed(site) {
+			o.sendNotification(site, res, statusRunning, changeTime, nil)
+		} else {
+			o.auditLog(audit.Entry{
+				BlogID:    site.BlogID,
+				EventType: audit.EventAlertSuppressed,
+				Source:    "local",
+				Detail:    "recovery cooldown active",
+			})
+		}
+		o.retries.markRecovered(targetID, changeTime)
+	}
+}
+
+func (o *Orchestrator) handleFailure(site db.Site, res checker.Result) bool {
+	if inMaintenance(site) {
+		o.swallowMaintenanceFailure(site, res)
+		return false
+	}
+
+	if suppressed, reason, window := o.postRecoveryTransientSuppression(site, res); suppressed {
+		class := failureClass(res)
+		emitCounter("detection.post_recovery_transient_suppressed.count", 1)
+		emitCounter("detection.post_recovery_transient_suppressed."+class+".count", 1)
+		metaMap := checkResultMetadata(site, res, resultCheckedAt(res))
+		if reason == "false_alarm" {
+			metaMap["suppressed_after_recent_false_alarm"] = true
+		} else {
+			metaMap["suppressed_after_recent_recovery"] = true
+		}
+		metaMap["suppressed_after"] = reason
+		metaMap["post_recovery_window_seconds"] = int(window / time.Second)
+		meta, _ := json.Marshal(metaMap)
+		o.auditLog(audit.Entry{
+			BlogID:    site.BlogID,
+			EventType: audit.EventAlertSuppressed,
+			Source:    o.hostname,
+			Detail:    "post-recovery transient failure suppressed",
+			Metadata:  meta,
+		})
+		return false
+	}
+
+	if site.SiteStatus == statusConfirmedDown {
+		o.retries.clear(monitorTargetID(site))
+		class := failureClass(res)
+		emitCounter("detection.down.still_down.count", 1)
+		emitCounter("detection.down.still_down."+class+".count", 1)
+		return true
+	}
+
+	entry := o.retries.record(res)
+	class := failureClass(res)
+	emitCounter("detection.failure."+class+".count", 1)
+	lowConfidenceDNS := lowConfidenceDNSFailure(res)
+
+	// Open a Seems Down event on the first failure we don't already have an
+	// id for. The schema's idempotent dedup_key means re-detecting the same
+	// failure would update the same row, so this is also a self-healing retry
+	// path if a previous Open failed to commit.
+	if entry.eventID == 0 && lowConfidenceDNS {
+		emitCounter("detection.low_confidence_dns.awaiting_verifier.count", 1)
+		emitCounter("detection.low_confidence_dns.awaiting_verifier."+metricSegment(res.DNSFailureKind)+".count", 1)
+	} else if entry.eventID == 0 {
+		id, opened, err := o.openSeemsDown(site, res, entry.firstFailAt)
+		if err != nil {
+			log.Printf("orchestrator: open seems-down event blog_id=%d: %v", site.BlogID, err)
+		} else {
+			entry.eventID = id
+			if opened || entry.failCount == 1 {
+				emitCounter("detection.seems_down.open.count", 1)
+				emitCounter("detection.seems_down.open."+class+".count", 1)
+				emitTimingSince("detection.first_failure_to_seems_down.time", entry.firstFailAt, nowFunc().UTC())
+			}
+		}
+	}
+
+	if entry.failCount < config.Get().NumOfChecks {
+		metaMap := checkResultMetadata(site, res, entry.firstFailAt)
+		metaMap["attempt"] = entry.failCount
+		metaMap["of"] = config.Get().NumOfChecks
+		metaMap["event_id"] = entry.eventID
+		if lowConfidenceDNS {
+			metaMap["low_confidence_dns_failure"] = true
+			metaMap["customer_visible_event_deferred_until_verifier_confirmation"] = true
+		}
+		meta, _ := json.Marshal(metaMap)
+		o.auditLog(audit.Entry{
+			BlogID:    site.BlogID,
+			EventID:   entry.eventID,
+			EventType: audit.EventRetryDispatched,
+			Source:    o.hostname,
+			Detail:    fmt.Sprintf("retry %d of %d", entry.failCount, config.Get().NumOfChecks),
+			Metadata:  meta,
+		})
+		return entry.eventID > 0
+	}
+
+	// Escalate to verifliers.
+	o.escalateToVerifliers(site, entry)
+	if lowConfidenceDNS {
+		return entry.eventID > 0
+	}
+	return true
+}
+
+func (o *Orchestrator) shouldSuppressPostRecoveryTransientFailure(site db.Site, res checker.Result) bool {
+	suppressed, _, _ := o.postRecoveryTransientSuppression(site, res)
+	return suppressed
+}
+
+func (o *Orchestrator) postRecoveryTransientSuppression(site db.Site, res checker.Result) (bool, string, time.Duration) {
+	if o == nil || o.retries == nil || !postRecoveryTransientFailure(res) {
+		return false, "", 0
+	}
+	if o.retries.get(monitorTargetID(site)) != nil {
+		return false, "", 0
+	}
+	return postRecoveryTransientSuppression(site, res, o.retries)
+}
+
+func postRecoveryTransientSuppression(site db.Site, res checker.Result, retries *retryQueue) (bool, string, time.Duration) {
+	if retries == nil || !postRecoveryTransientFailure(res) {
+		return false, "", 0
+	}
+	targetID := monitorTargetID(site)
+	if retries.get(targetID) != nil {
+		return false, "", 0
+	}
+	checkedAt := resultCheckedAt(res)
+	falseAlarmWindow := postFalseAlarmTransientFailureWindow(site)
+	if retries.recentlyFalseAlarmed(targetID, checkedAt, falseAlarmWindow) {
+		retries.markFalseAlarm(targetID, checkedAt)
+		return true, "false_alarm", falseAlarmWindow
+	}
+	recoveryWindow := postRecoveryTransientFailureWindow(site)
+	if retries.recentlyRecovered(targetID, checkedAt, recoveryWindow) {
+		return true, "recovery", recoveryWindow
+	}
+	return false, "", 0
+}
+
+func postRecoveryTransientFailure(res checker.Result) bool {
+	if !res.IsFailure() || res.HTTPCode > 0 {
+		return false
+	}
+	return res.ErrorCode == checker.ErrorConnect || res.ErrorCode == checker.ErrorTimeout
+}
+
+func lowConfidenceDNSFailure(res checker.Result) bool {
+	return postRecoveryTransientFailure(res) && res.DNSFailureKind != ""
+}
+
+func postRecoveryTransientFailureWindow(site db.Site) time.Duration {
+	window := siteCheckInterval(site)
+	if window < failedCheckRetryInterval {
+		return failedCheckRetryInterval
+	}
+	if window > maxPostRecoveryTransientFailureWindow {
+		return maxPostRecoveryTransientFailureWindow
+	}
+	return window
+}
+
+func postFalseAlarmTransientFailureWindow(site db.Site) time.Duration {
+	window := siteCheckInterval(site) * 2
+	if window < minPostFalseAlarmTransientFailureWindow {
+		return minPostFalseAlarmTransientFailureWindow
+	}
+	if window > maxPostFalseAlarmTransientFailureWindow {
+		return maxPostFalseAlarmTransientFailureWindow
+	}
+	return window
+}
+
+func (o *Orchestrator) escalateToVerifliers(site db.Site, entry *retryEntry) {
+	clients := o.veriflierSnapshot()
+	emitCounter("detection.verifier.escalation.count", 1)
+	emitTimingSince("detection.first_failure_to_verification.time", entry.firstFailAt, nowFunc().UTC())
+	if len(clients) == 0 {
+		emitCounter("detection.verifier.no_clients.count", 1)
+		o.confirmDown(site, entry, nil)
+		return
+	}
+
+	cfg := config.Get()
+	method := effectiveCheckMethod(cfg, site)
+	profile := effectiveDetectionProfile(cfg, site, method)
+	req := veriflier.CheckRequest{
+		MonitorSiteID:       site.ID,
+		BlogID:              site.BlogID,
+		URL:                 site.MonitorURL,
+		Method:              method,
+		DetectionProfile:    profile,
+		TimeoutSeconds:      int32(timeoutForSite(cfg, site)),
+		BodyReadMaxBytes:    cfg.BodyReadMaxBytes,
+		BodyReadMaxMS:       int32(cfg.BodyReadMaxMS),
+		KeywordReadMaxBytes: cfg.KeywordReadMaxBytes,
+		KeywordReadMaxMS:    int32(cfg.KeywordReadMaxMS),
+		CustomHeaders:       checker.ParseCustomHeaders(site.CustomHeaders),
+		RedirectPolicy:      string(checker.RedirectFollow),
+		RequestID:           veriflier.NewRequestID(),
+	}
+	if profile == checkmode.ProfileFull {
+		req.Keyword = stringPtrValue(site.CheckKeyword)
+		req.ForbiddenKeyword = stringPtrValue(site.ForbiddenKeyword)
+		req.ForbiddenKeywords = checker.ParseForbiddenKeywords(site.ForbiddenKeywords)
+		req.RedirectPolicy = site.RedirectPolicy
+		if req.RedirectPolicy == "" {
+			req.RedirectPolicy = string(checker.RedirectFollow)
+		}
+	}
+
+	escalateMeta, _ := json.Marshal(map[string]any{
+		"verifier_count":    len(clients),
+		"request_id":        req.RequestID,
+		"method":            method,
+		"detection_profile": profile,
+	})
+	o.auditLog(audit.Entry{
+		BlogID:    site.BlogID,
+		EventType: audit.EventVeriflierSent,
+		Source:    o.hostname,
+		Detail:    fmt.Sprintf("escalating to %d verifliers", len(clients)),
+		Metadata:  escalateMeta,
+	})
+
+	// Per-RPC deadline: site's check budget plus headroom for the verifier's
+	// own HTTP work, server queueing, and network. Without this the dial /
+	// read can hang for o.ctx's lifetime (effectively forever) on a wedged
+	// verifier — the old hardcoded 30s client.Timeout was the only bound and
+	// has been removed in favor of this caller-controlled deadline.
+	rpcDeadline := time.Duration(timeoutForSite(config.Get(), site))*time.Second + verifierRPCHeadroom
+	rpcCtx, rpcCancel := stdctx.WithTimeout(o.ctx, rpcDeadline)
+	defer rpcCancel()
+
+	type vResult struct {
+		host     string
+		duration time.Duration
+		res      *veriflier.CheckResult
+		err      error
+	}
+	ch := make(chan vResult, len(clients))
+
+	for _, client := range clients {
+		c := client
+		go func() {
+			start := nowFunc()
+			res, err := veriflierCheckFunc(c, rpcCtx, req)
+			ch <- vResult{host: c.Addr(), duration: nowFunc().Sub(start), res: res, err: err}
+		}()
+	}
+
+	var vResults []veriflier.CheckResult
+	seenVoteIDs := make(map[string]struct{}, len(clients))
+	healthyVerifliers := 0
+	confirmations := 0
+	duplicateVotes := 0
+
+	for range clients {
+		vr := <-ch
+		emitTiming("verifier.rpc.duration", vr.duration)
+		hostSegment := metricSegment(vr.host)
+		emitTiming("verifier.host."+hostSegment+".rpc.duration", vr.duration)
+		if vr.err != nil {
+			emitCounter("verifier.rpc.error.count", 1)
+			emitCounter("verifier.host."+hostSegment+".rpc.error.count", 1)
+			log.Printf("orchestrator: veriflier %s error: %v", vr.host, vr.err)
+			continue
+		}
+		if vr.res == nil {
+			emitCounter("verifier.rpc.error.count", 1)
+			emitCounter("verifier.host."+hostSegment+".rpc.error.count", 1)
+			log.Printf("orchestrator: veriflier %s returned no result", vr.host)
+			continue
+		}
+
+		emitCounter("verifier.rpc.success.count", 1)
+		emitCounter("verifier.host."+hostSegment+".rpc.success.count", 1)
+		voteID := verifierVoteID(vr.host, vr.res)
+		_, duplicateVote := seenVoteIDs[voteID]
+		if duplicateVote {
+			duplicateVotes++
+			emitCounter("verifier.vote.duplicate_identity.count", 1)
+			emitCounter("verifier.host."+hostSegment+".vote.duplicate_identity.count", 1)
+			log.Printf("orchestrator: veriflier %s returned duplicate vote identity %q; ignoring duplicate vote", vr.host, voteID)
+		} else {
+			seenVoteIDs[voteID] = struct{}{}
+		}
+		vr.res.Host = voteID
+
+		// Verifier reply is operational telemetry — recorded under
+		// EventVeriflierSent with the response in metadata. The site-state
+		// outcome (confirm or false alarm) is captured separately, ultimately
+		// as a transition row in jetmon_event_transitions.
+		metaMap := map[string]any{
+			"http_code":      vr.res.HTTPCode,
+			"error_code":     vr.res.ErrorCode,
+			"rtt_ms":         vr.res.RTTMs,
+			"success":        vr.res.Success,
+			"request_id":     vr.res.RequestID,
+			"vote_id":        voteID,
+			"duplicate_vote": duplicateVote,
+		}
+		if vr.res.VantageID != "" {
+			metaMap["vantage_id"] = vr.res.VantageID
+		}
+		if vr.res.AgentID != "" {
+			metaMap["agent_id"] = vr.res.AgentID
+		}
+		if vr.res.Outcome != "" {
+			metaMap["outcome"] = vr.res.Outcome
+		}
+		meta, _ := json.Marshal(metaMap)
+		o.auditLog(audit.Entry{
+			BlogID:    site.BlogID,
+			EventType: audit.EventVeriflierSent,
+			Source:    vr.host,
+			Detail:    "veriflier reply",
+			Metadata:  meta,
+		})
+		if duplicateVote {
+			continue
+		}
+
+		healthyVerifliers++
+		vResults = append(vResults, *vr.res)
+		if !vr.res.Success {
+			emitCounter("verifier.vote.confirm_down.count", 1)
+			emitCounter("verifier.host."+hostSegment+".vote.confirm_down.count", 1)
+			confirmations++
+		} else {
+			emitCounter("verifier.vote.disagree.count", 1)
+			emitCounter("verifier.host."+hostSegment+".vote.disagree.count", 1)
+		}
+	}
+
+	// Adjust quorum to healthy unique verifier vote identities. In a
+	// multi-verifier fleet, avoid letting a degraded verifier set collapse
+	// to one confirming vote unless operators intentionally configured a
+	// one-vote quorum.
+	quorum := config.Get().PeerOfflineLimit
+	if healthyVerifliers < quorum {
+		quorum = healthyVerifliers
+	}
+	if quorum < 1 {
+		quorum = 1
+	}
+	minHealthy := verifierMinHealthyFloor(config.Get().PeerOfflineLimit, len(clients))
+	if quorum < minHealthy {
+		quorum = minHealthy
+	}
+	emitGauge("detection.verifier.healthy.count", healthyVerifliers)
+	emitGauge("detection.verifier.confirmations.count", confirmations)
+	emitGauge("detection.verifier.quorum.count", quorum)
+	emitGauge("detection.verifier.min_healthy.count", minHealthy)
+	emitGauge("detection.verifier.duplicate_votes.count", duplicateVotes)
+	decision := verifierDecision{
+		Quorum:         quorum,
+		MinHealthy:     minHealthy,
+		Healthy:        healthyVerifliers,
+		Confirmed:      confirmations,
+		Disagreed:      healthyVerifliers - confirmations,
+		DuplicateVotes: duplicateVotes,
+	}
+
+	if confirmations >= quorum {
+		emitCounter("detection.verifier.quorum_met.count", 1)
+		o.confirmDown(site, entry, vResults, decision)
+	} else {
+		// Verifliers did not confirm — false positive. Close the Seems Down
+		// event with reason=false_alarm and reset site_status in the same tx.
+		log.Printf("orchestrator: blog_id=%d verifliers did not confirm down (%d/%d)", site.BlogID, confirmations, quorum)
+		falseAlarmAt := nowFunc().UTC()
+		emitCounter("detection.verifier.false_alarm.count", 1)
+		emitCounter("detection.verifier.false_alarm."+failureClass(entry.lastResult)+".count", 1)
+		emitTimingSince("detection.seems_down_to_false_alarm.time", entry.firstFailAt, falseAlarmAt)
+		_ = dbRecordFalsePositive(site.BlogID, entry.lastResult.HTTPCode, entry.lastResult.ErrorCode,
+			entry.lastResult.RTT.Milliseconds())
+
+		if entry.eventID > 0 {
+			meta, _ := json.Marshal(map[string]any{
+				"verifier_quorum":          quorum,
+				"verifier_min_healthy":     minHealthy,
+				"verifier_healthy":         healthyVerifliers,
+				"verifier_disagreed":       healthyVerifliers - confirmations,
+				"verifier_confirmed":       confirmations,
+				"verifier_duplicate_votes": duplicateVotes,
+				"verifier_results":         summarizeVerifierResults(vResults),
+			})
+			if err := o.closeEvent(site, entry.eventID,
+				eventstore.ReasonFalseAlarm, statusRunning, falseAlarmAt, meta); err != nil {
+				log.Printf("orchestrator: close false-alarm event blog_id=%d event_id=%d: %v",
+					site.BlogID, entry.eventID, err)
+			}
+		}
+		targetID := monitorTargetID(site)
+		o.retries.clear(targetID)
+		o.retries.markFalseAlarm(targetID, falseAlarmAt)
+	}
+}
+
+type verifierDecision struct {
+	Quorum         int
+	MinHealthy     int
+	Healthy        int
+	Confirmed      int
+	Disagreed      int
+	DuplicateVotes int
+}
+
+func verifierVoteID(addr string, res *veriflier.CheckResult) string {
+	if res != nil {
+		if vantageID := strings.TrimSpace(res.VantageID); vantageID != "" {
+			return vantageID
+		}
+		if host := strings.TrimSpace(res.Host); host != "" {
+			return host
+		}
+	}
+	return strings.TrimSpace(addr)
+}
+
+func verifierMinHealthyFloor(peerOfflineLimit, configuredVerifiers int) int {
+	if configuredVerifiers <= 0 {
+		return 0
+	}
+	if configuredVerifiers == 1 || peerOfflineLimit <= 1 {
+		return 1
+	}
+	return 2
+}
+
+func inferredVerifierDecision(vResults []veriflier.CheckResult) verifierDecision {
+	decision := verifierDecision{
+		Quorum:     len(vResults),
+		MinHealthy: 1,
+		Healthy:    len(vResults),
+	}
+	if len(vResults) == 0 {
+		decision.MinHealthy = 0
+	}
+	for _, vr := range vResults {
+		if vr.Success {
+			decision.Disagreed++
+			continue
+		}
+		decision.Confirmed++
+	}
+	return decision
+}
+
+func (o *Orchestrator) confirmDown(site db.Site, entry *retryEntry, vResults []veriflier.CheckResult, decisions ...verifierDecision) {
+	if inMaintenance(site) {
+		if entry != nil {
+			o.swallowMaintenanceFailure(site, entry.lastResult)
+		}
+		return
+	}
+	if entry == nil {
+		log.Printf("orchestrator: confirmed down blog_id=%d without retry entry", site.BlogID)
+		return
+	}
+
+	newStatus := statusConfirmedDown
+	changeTime := nowFunc().UTC()
+	emitCounter("detection.down.confirmed.count", 1)
+	emitCounter("detection.down.confirmed."+failureClass(entry.lastResult)+".count", 1)
+	emitTimingSince("detection.seems_down_to_down.time", entry.firstFailAt, changeTime)
+
+	log.Printf("orchestrator: blog_id=%d confirmed down", site.BlogID)
+
+	meta := confirmedDownMetadata(site, entry, vResults, entry.eventID == 0, decisions...)
+
+	// Promote the open Seems Down event to Down with reason=verifier_confirmed
+	// and project site_status=SITE_CONFIRMED_DOWN in the same tx. Low-confidence
+	// local DNS failures intentionally skip the Seems Down event, so verifier
+	// confirmation opens the customer-visible incident directly as Down.
+	if entry.eventID > 0 {
+		if err := o.promoteToDown(site, entry.eventID, changeTime, meta); err != nil {
+			log.Printf("orchestrator: promote event blog_id=%d event_id=%d: %v", site.BlogID, entry.eventID, err)
+		}
+	} else {
+		eventID, opened, err := o.openConfirmedDown(site, changeTime, meta)
+		if err != nil {
+			log.Printf("orchestrator: open confirmed-down event blog_id=%d: %v", site.BlogID, err)
+			if config.LegacyStatusProjectionEnabled() {
+				_ = db.UpdateSiteStatusForMonitorSite(o.ctx, site.ID, site.BlogID, newStatus, changeTime)
+			}
+		} else {
+			entry.eventID = eventID
+			if opened {
+				emitCounter("detection.down.open_after_verifier.count", 1)
+			}
+		}
+	}
+
+	if inMaintenance(site) {
+		o.auditLog(audit.Entry{
+			BlogID:    site.BlogID,
+			EventType: audit.EventMaintenanceActive,
+			Source:    "local",
+			Detail:    "downtime suppressed during maintenance",
+		})
+	} else if !o.isAlertSuppressed(site) {
+		o.sendNotification(site, entry.lastResult, newStatus, changeTime, vResults)
+	} else {
+		o.auditLog(audit.Entry{
+			BlogID:    site.BlogID,
+			EventType: audit.EventAlertSuppressed,
+			Source:    "local",
+			Detail:    "cooldown active",
+		})
+	}
+
+	o.retries.clear(monitorTargetID(site))
+}
+
+func confirmedDownMetadata(site db.Site, entry *retryEntry, vResults []veriflier.CheckResult, directOpen bool, decisions ...verifierDecision) json.RawMessage {
+	decision := inferredVerifierDecision(vResults)
+	if len(decisions) > 0 {
+		decision = decisions[0]
+	}
+	metaMap := checkResultMetadata(site, entry.lastResult, entry.firstFailAt)
+	metaMap["verifier_results"] = summarizeVerifierResults(vResults)
+	metaMap["verifier_quorum"] = decision.Quorum
+	metaMap["verifier_min_healthy"] = decision.MinHealthy
+	metaMap["verifier_healthy"] = decision.Healthy
+	metaMap["verifier_disagreed"] = decision.Disagreed
+	metaMap["verifier_confirmed"] = decision.Confirmed
+	metaMap["verifier_duplicate_votes"] = decision.DuplicateVotes
+	if directOpen {
+		metaMap["opened_after_verifier_confirmation"] = true
+	}
+	if lowConfidenceDNSFailure(entry.lastResult) {
+		metaMap["low_confidence_dns_failure"] = true
+		metaMap["customer_visible_event_deferred_until_verifier_confirmation"] = true
+	}
+	meta, _ := json.Marshal(metaMap)
+	return meta
+}
+
+func verifierConfirmationCount(vResults []veriflier.CheckResult) int {
+	count := 0
+	for _, res := range vResults {
+		if !res.Success {
+			count++
+		}
+	}
+	return count
+}
+
+func (o *Orchestrator) swallowMaintenanceFailure(site db.Site, res checker.Result) {
+	targetID := monitorTargetID(site)
+	entry := o.retries.get(targetID)
+	knownEventID := int64(0)
+	if entry != nil {
+		knownEventID = entry.eventID
+	}
+
+	class := failureClass(res)
+	emitCounter("detection.maintenance.swallowed.count", 1)
+	emitCounter("detection.maintenance.swallowed."+class+".count", 1)
+
+	meta, _ := json.Marshal(map[string]any{
+		"http_code":         res.HTTPCode,
+		"error_code":        res.ErrorCode,
+		"rtt_ms":            res.RTT.Milliseconds(),
+		"maintenance_start": site.MaintenanceStart,
+		"maintenance_end":   site.MaintenanceEnd,
+		"event_id":          knownEventID,
+	})
+
+	if entry != nil || site.SiteStatus != statusRunning {
+		if err := o.closeMaintenanceEvent(site, knownEventID, nowFunc().UTC(), meta); err != nil {
+			log.Printf("orchestrator: close maintenance-swallowed event blog_id=%d event_id=%d: %v",
+				site.BlogID, knownEventID, err)
+		}
+	}
+	o.retries.clear(targetID)
+
+	o.auditLog(audit.Entry{
+		BlogID:    site.BlogID,
+		EventID:   knownEventID,
+		EventType: audit.EventMaintenanceActive,
+		Source:    "local",
+		Detail:    "failure swallowed during maintenance",
+		Metadata:  meta,
+	})
+}
+
+func (o *Orchestrator) sendNotification(site db.Site, res checker.Result, status int, changeTime time.Time, vResults []veriflier.CheckResult) {
+	if !config.WPCOMNotifyEnabled() {
+		emitCounter("wpcom.notification.disabled.count", 1)
+		o.wpcomNotifyDisabledLogOnce.Do(func() {
+			log.Print("orchestrator: wpcom notification disabled; skipping legacy status-change notifications")
+		})
+		return
+	}
+
+	checks := []wpcom.CheckEntry{
+		{
+			Type:   1,
+			Host:   o.hostname,
+			Status: statusFromBool(res.Success),
+			RTT:    res.RTT.Milliseconds(),
+			Code:   res.HTTPCode,
+		},
+	}
+	for _, vr := range vResults {
+		checks = append(checks, wpcom.CheckEntry{
+			Type:   2,
+			Host:   vr.Host,
+			Status: statusFromBool(vr.Success),
+			RTT:    vr.RTTMs,
+			Code:   int(vr.HTTPCode),
+		})
+	}
+
+	n := wpcom.Notification{
+		BlogID:           site.BlogID,
+		MonitorURL:       site.MonitorURL,
+		StatusID:         status,
+		LastCheck:        res.Timestamp.UTC().Format(time.RFC3339),
+		LastStatusChange: changeTime.UTC().Format(time.RFC3339),
+		StatusType:       res.StatusType(),
+		Checks:           checks,
+	}
+
+	o.auditLog(audit.Entry{
+		BlogID:    site.BlogID,
+		EventType: audit.EventWPCOMSent,
+		Source:    "local",
+		Detail:    fmt.Sprintf("status=%d type=%s", status, n.StatusType),
+	})
+
+	wpcomStatus := wpcomStatusMetricSegment(status)
+	emitCounter("wpcom.notification.attempt.count", 1)
+	emitCounter("wpcom.notification.status."+wpcomStatus+".attempt.count", 1)
+	if err := wpcomNotifyFunc(o.wpcom, n); err != nil {
+		emitCounter("wpcom.notification.error.count", 1)
+		emitCounter("wpcom.notification.status."+wpcomStatus+".error.count", 1)
+		log.Printf("orchestrator: wpcom notify failed for blog_id=%d: %v", site.BlogID, err)
+
+		if errors.Is(err, wpcom.ErrCircuitOpen) {
+			emitCounter("wpcom.notification.queued.count", 1)
+			emitCounter("wpcom.notification.status."+wpcomStatus+".queued.count", 1)
+			return
+		}
+
+		if wpcom.IsPermanentStatusError(err) {
+			emitCounter("wpcom.notification.permanent_failure.count", 1)
+			emitCounter("wpcom.notification.status."+wpcomStatus+".permanent_failure.count", 1)
+			if statusCode, ok := wpcom.HTTPStatusCode(err); ok {
+				emitCounter(fmt.Sprintf("wpcom.notification.http.%d.permanent_failure.count", statusCode), 1)
+			}
+			emitCounter("wpcom.notification.failed.count", 1)
+			emitCounter("wpcom.notification.status."+wpcomStatus+".failed.count", 1)
+			o.logWPCOMPermanentFailure(site.BlogID, err)
+			o.auditLog(audit.Entry{
+				BlogID:    site.BlogID,
+				EventType: audit.EventWPCOMFailure,
+				Source:    "local",
+				Detail:    err.Error(),
+			})
+			return
+		}
+
+		emitCounter("wpcom.notification.retry.count", 1)
+		o.auditLog(audit.Entry{
+			BlogID:    site.BlogID,
+			EventType: audit.EventWPCOMRetry,
+			Source:    "local",
+			Detail:    err.Error(),
+		})
+
+		// Single retry.
+		if retryErr := wpcomNotifyFunc(o.wpcom, n); retryErr != nil {
+			emitCounter("wpcom.notification.error.count", 1)
+			emitCounter("wpcom.notification.status."+wpcomStatus+".error.count", 1)
+			emitCounter("wpcom.notification.failed.count", 1)
+			emitCounter("wpcom.notification.status."+wpcomStatus+".failed.count", 1)
+			log.Printf("orchestrator: wpcom notify retry failed for blog_id=%d: %v", site.BlogID, retryErr)
+			o.auditLog(audit.Entry{
+				BlogID:    site.BlogID,
+				EventType: audit.EventWPCOMFailure,
+				Source:    "local",
+				Detail:    retryErr.Error(),
+			})
+			return
+		}
+		emitCounter("wpcom.notification.retry.delivered.count", 1)
+	}
+	emitCounter("wpcom.notification.delivered.count", 1)
+	emitCounter("wpcom.notification.status."+wpcomStatus+".delivered.count", 1)
+	if err := dbUpdateLastAlertSent(o.ctx, site.BlogID, nowFunc().UTC()); err != nil {
+		log.Printf("orchestrator: update last alert sent blog_id=%d: %v", site.BlogID, err)
+	}
+}
+
+func (o *Orchestrator) logWPCOMPermanentFailure(blogID int64, err error) {
+	if o == nil {
+		log.Printf("orchestrator: wpcom notify permanent failure for blog_id=%d: %v", blogID, err)
+		return
+	}
+	o.wpcomPermanentMu.Lock()
+	defer o.wpcomPermanentMu.Unlock()
+
+	now := nowFunc()
+	if o.wpcomPermanentLastLog.IsZero() || now.Sub(o.wpcomPermanentLastLog) >= wpcomPermanentFailureLogInterval {
+		if o.wpcomPermanentSuppressed > 0 {
+			log.Printf(
+				"orchestrator: wpcom notify permanent failure for blog_id=%d: %v (suppressed %d similar permanent failures)",
+				blogID,
+				err,
+				o.wpcomPermanentSuppressed,
+			)
+		} else {
+			log.Printf("orchestrator: wpcom notify permanent failure for blog_id=%d: %v", blogID, err)
+		}
+		o.wpcomPermanentLastLog = now
+		o.wpcomPermanentSuppressed = 0
+		return
+	}
+	o.wpcomPermanentSuppressed++
+}
+
+// checkSSLAlerts manages a site-level tls_expiry event that tracks the cert's
+// remaining lifetime. The event is opened idempotently — once it's open, every
+// HTTPS check is a no-op on the events table unless the threshold (and thus
+// severity) changes. The event closes when the cert is renewed beyond the
+// outermost threshold.
+//
+// Severity ladder:
+//   - <= 7 days  → Degraded (severity 2)
+//   - <= 14 days → Warning  (severity 1)
+//   - <= 30 days → Warning  (severity 1)
+//   - >  30 days → close any open event with reason=probe_cleared
+func (o *Orchestrator) checkSSLAlerts(site db.Site, expiry time.Time) {
+	daysUntil := int(time.Until(expiry).Hours() / 24)
+
+	const (
+		warnDays     = 30
+		degradedDays = 7
+	)
+
+	if daysUntil > warnDays {
+		// Cert is healthy. Close any pre-existing tls_expiry event for this site.
+		if err := o.closeSSLExpiryIfOpen(site.BlogID); err != nil {
+			log.Printf("orchestrator: close tls_expiry event blog_id=%d: %v", site.BlogID, err)
+		}
+		return
+	}
+
+	severity := eventstore.SeverityWarning
+	state := eventstore.StateWarning
+	if daysUntil <= degradedDays {
+		severity = eventstore.SeverityDegraded
+		state = eventstore.StateDegraded
+	}
+
+	meta, _ := json.Marshal(map[string]any{
+		"days_until": daysUntil,
+		"expires_at": expiry.UTC().Format(time.RFC3339),
+	})
+
+	if err := o.openOrUpdateSSLExpiry(site.BlogID, severity, state, daysUntil, meta); err != nil {
+		log.Printf("orchestrator: tls_expiry event blog_id=%d days=%d: %v", site.BlogID, daysUntil, err)
+		return
+	}
+	log.Printf("orchestrator: blog_id=%d SSL cert expires in %d days (severity %d)", site.BlogID, daysUntil, severity)
+}
+
+// openOrUpdateSSLExpiry opens a tls_expiry event for the site if none exists,
+// or escalates / de-escalates the existing event's severity if a threshold has
+// been crossed. site_status is intentionally not projected — TLS expiry
+// warnings don't affect the Up/Down state of the site (Layer 2 issue, not a
+// Layer 4 outage).
+func (o *Orchestrator) openOrUpdateSSLExpiry(blogID int64, severity uint8, state string, daysUntil int, meta json.RawMessage) error {
+	return o.withEventMutationRetry(blogID, "open_update_ssl_expiry", func() error {
+		return o.openOrUpdateSSLExpiryOnce(blogID, severity, state, daysUntil, meta)
+	})
+}
+
+func (o *Orchestrator) openOrUpdateSSLExpiryOnce(blogID int64, severity uint8, state string, daysUntil int, meta json.RawMessage) error {
+	tx, err := o.ev().Begin(o.ctx)
+	if err != nil {
+		return err
+	}
+	defer func() { _ = tx.Rollback() }()
+
+	out, err := tx.Open(o.ctx, eventstore.OpenInput{
+		Identity: eventstore.Identity{BlogID: blogID, CheckType: checkTypeTLSExpiry},
+		Severity: severity,
+		State:    state,
+		Source:   o.hostname,
+		Metadata: meta,
+	})
+	if err != nil {
+		return fmt.Errorf("open tls_expiry: %w", err)
+	}
+
+	// If the event already existed and its severity differs from the new
+	// threshold, escalate (or de-escalate) with a transition row recording why.
+	if !out.Opened && out.CurrentSeverity != severity {
+		reason := eventstore.ReasonSeverityEscalation
+		if severity < out.CurrentSeverity {
+			reason = eventstore.ReasonSeverityDeescalation
+		}
+		if _, err := tx.Promote(o.ctx, out.EventID, severity, state, reason, o.hostname, meta); err != nil {
+			return fmt.Errorf("escalate tls_expiry: %w", err)
+		}
+	}
+	return tx.Commit()
+}
+
+// closeSSLExpiryIfOpen closes an open tls_expiry event for the site, if any.
+// No-op if no event exists.
+func (o *Orchestrator) closeSSLExpiryIfOpen(blogID int64) error {
+	return o.withEventMutationRetry(blogID, "close_ssl_expiry", func() error {
+		return o.closeSSLExpiryIfOpenOnce(blogID)
+	})
+}
+
+func (o *Orchestrator) closeSSLExpiryIfOpenOnce(blogID int64) error {
+	tx, err := o.ev().Begin(o.ctx)
+	if err != nil {
+		return err
+	}
+	defer func() { _ = tx.Rollback() }()
+
+	if tx.Tx() == nil {
+		return tx.Commit()
+	}
+	ae, err := tx.FindActiveByBlog(o.ctx, blogID, checkTypeTLSExpiry)
+	if err != nil {
+		if errors.Is(err, eventstore.ErrEventNotFound) {
+			return tx.Commit()
+		}
+		return err
+	}
+	if err := tx.Close(o.ctx, ae.ID, eventstore.ReasonProbeCleared, o.hostname, nil); err != nil {
+		return fmt.Errorf("close tls_expiry: %w", err)
+	}
+	return tx.Commit()
+}
+
+func (o *Orchestrator) checkTLSDeprecated(site db.Site, res checker.Result) {
+	if res.TLSVersion <= tls.VersionTLS11 {
+		meta, _ := json.Marshal(map[string]any{
+			"tls_version":      tlsVersionName(res.TLSVersion),
+			"tls_version_code": fmt.Sprintf("0x%04x", res.TLSVersion),
+			"cipher_suite":     tls.CipherSuiteName(res.CipherSuite),
+			"cipher_suite_id":  fmt.Sprintf("0x%04x", res.CipherSuite),
+		})
+		if err := o.openTLSDeprecated(site.BlogID, meta); err != nil {
+			log.Printf("orchestrator: tls_deprecated event blog_id=%d version=%s: %v",
+				site.BlogID, tlsVersionName(res.TLSVersion), err)
+		}
+		return
+	}
+
+	if err := o.closeTLSDeprecatedIfOpen(site.BlogID); err != nil {
+		log.Printf("orchestrator: close tls_deprecated event blog_id=%d: %v", site.BlogID, err)
+	}
+}
+
+func (o *Orchestrator) openTLSDeprecated(blogID int64, meta json.RawMessage) error {
+	tx, err := o.ev().Begin(o.ctx)
+	if err != nil {
+		return err
+	}
+	defer func() { _ = tx.Rollback() }()
+
+	if _, err := tx.Open(o.ctx, eventstore.OpenInput{
+		Identity: eventstore.Identity{BlogID: blogID, CheckType: checkTypeTLSDeprecated},
+		Severity: eventstore.SeverityWarning,
+		State:    eventstore.StateWarning,
+		Source:   o.hostname,
+		Metadata: meta,
+	}); err != nil {
+		return fmt.Errorf("open tls_deprecated: %w", err)
+	}
+	return tx.Commit()
+}
+
+func (o *Orchestrator) closeTLSDeprecatedIfOpen(blogID int64) error {
+	tx, err := o.ev().Begin(o.ctx)
+	if err != nil {
+		return err
+	}
+	defer func() { _ = tx.Rollback() }()
+
+	if tx.Tx() == nil {
+		return tx.Commit()
+	}
+	ae, err := tx.FindActiveByBlog(o.ctx, blogID, checkTypeTLSDeprecated)
+	if err != nil {
+		if errors.Is(err, eventstore.ErrEventNotFound) {
+			return tx.Commit()
+		}
+		return err
+	}
+	if err := tx.Close(o.ctx, ae.ID, eventstore.ReasonProbeCleared, o.hostname, nil); err != nil {
+		return fmt.Errorf("close tls_deprecated: %w", err)
+	}
+	return tx.Commit()
+}
+
+func tlsVersionName(version uint16) string {
+	switch version {
+	case tls.VersionTLS10:
+		return "TLS 1.0"
+	case tls.VersionTLS11:
+		return "TLS 1.1"
+	case tls.VersionTLS12:
+		return "TLS 1.2"
+	case tls.VersionTLS13:
+		return "TLS 1.3"
+	default:
+		return fmt.Sprintf("0x%04x", version)
+	}
+}
+
+func (o *Orchestrator) isAlertSuppressed(site db.Site) bool {
+	cfg := config.Get()
+	cooldown := cfg.AlertCooldownMinutes
+	if site.AlertCooldownMinutes != nil {
+		cooldown = *site.AlertCooldownMinutes
+	}
+	if cooldown <= 0 {
+		return false
+	}
+	if site.LastAlertSentAt == nil || site.LastAlertSentAt.IsZero() {
+		return false
+	}
+	return time.Since(*site.LastAlertSentAt) < time.Duration(cooldown)*time.Minute
+}
+
+func (o *Orchestrator) checkLegacyProjectionDrift(cfg *config.Config) {
+	if !cfg.LegacyStatusProjectionEnable {
+		return
+	}
+	count, err := dbCountProjectionDrift(o.ctx, o.bucketMin, o.bucketMax)
+	if err != nil {
+		log.Printf("orchestrator: legacy projection drift check failed: %v", err)
+		emitCounter("projection.drift.check_error.count", 1)
+		return
+	}
+	emitGauge("projection.drift.count", count)
+	if count > 0 {
+		log.Printf("orchestrator: WARN legacy projection drift detected count=%d buckets=%d-%d", count, o.bucketMin, o.bucketMax)
+		emitCounter("projection.drift.detected.count", 1)
+	}
+}
+
+// RetryQueueSize returns the number of sites currently in local retry.
+func (o *Orchestrator) RetryQueueSize() int {
+	return o.retries.size()
+}
+
+// BucketRange returns the current bucket min/max for this host.
+func (o *Orchestrator) BucketRange() (int, int) {
+	return o.bucketMin, o.bucketMax
+}
+
+func (o *Orchestrator) usesPinnedBuckets(cfg *config.Config) bool {
+	_, _, ok := cfg.PinnedBucketRange()
+	return ok
+}
+
+// WorkerCount returns the live worker count.
+func (o *Orchestrator) WorkerCount() int {
+	return o.pool.WorkerCount()
+}
+
+// ActiveChecks returns the active-check count.
+func (o *Orchestrator) ActiveChecks() int {
+	return o.pool.ActiveCount()
+}
+
+// QueueDepth returns the work queue depth.
+func (o *Orchestrator) QueueDepth() int {
+	return o.pool.QueueDepth()
+}
+
+// LastRoundStats returns the latest completed round's throughput and duration.
+func (o *Orchestrator) LastRoundStats() (int, time.Duration) {
+	o.statsMu.RLock()
+	defer o.statsMu.RUnlock()
+	return o.lastRoundSPS, o.lastRoundDur
+}
+
+func (o *Orchestrator) auditLog(e audit.Entry) {
+	if err := audit.Log(o.ctx, e); err != nil {
+		log.Printf("audit: blog_id=%d event=%s: %v", e.BlogID, e.EventType, err)
+	}
+}
+
+func emitCounter(stat string, value int) {
+	if m := metricsClientFunc(); m != nil {
+		m.Increment(stat, value)
+	}
+}
+
+func emitGauge(stat string, value int) {
+	if m := metricsClientFunc(); m != nil {
+		m.Gauge(stat, value)
+	}
+}
+
+func emitTiming(stat string, d time.Duration) {
+	if d < 0 {
+		return
+	}
+	if m := metricsClientFunc(); m != nil {
+		m.Timing(stat, d)
+	}
+}
+
+func emitTimingSince(stat string, start, end time.Time) {
+	if start.IsZero() || end.IsZero() {
+		return
+	}
+	emitTiming(stat, end.Sub(start))
+}
+
+func failureClass(res checker.Result) string {
+	return metricSegment((&res).StatusType())
+}
+
+func metricSegment(s string) string {
+	s = strings.ToLower(strings.TrimSpace(s))
+	if s == "" {
+		return "unknown"
+	}
+
+	var b strings.Builder
+	lastUnderscore := false
+	for _, r := range s {
+		if (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') {
+			b.WriteRune(r)
+			lastUnderscore = false
+			continue
+		}
+		if !lastUnderscore {
+			b.WriteByte('_')
+			lastUnderscore = true
+		}
+	}
+
+	out := strings.Trim(b.String(), "_")
+	if out == "" {
+		return "unknown"
+	}
+	return out
+}
+
+func (o *Orchestrator) withEventMutationRetry(blogID int64, operation string, fn func() error) error {
+	ctx := o.ctx
+	if ctx == nil {
+		ctx = stdctx.Background()
+	}
+
+	var lastErr error
+	for attempt := 1; attempt <= eventMutationMaxAttempts; attempt++ {
+		err := fn()
+		if err == nil {
+			return nil
+		}
+		lastErr = err
+		if !isRetryableMySQLError(err) || attempt == eventMutationMaxAttempts {
+			return err
+		}
+		emitCounter("eventstore.mutation.retry.count", 1)
+		emitCounter("eventstore.mutation."+metricSegment(operation)+".retry.count", 1)
+		wait := time.Duration(attempt) * eventMutationRetryBaseDelay
+		log.Printf("orchestrator: retrying event mutation blog_id=%d operation=%s attempt=%d/%d wait=%s err=%v",
+			blogID, operation, attempt+1, eventMutationMaxAttempts, wait, err)
+		timer := time.NewTimer(wait)
+		select {
+		case <-timer.C:
+		case <-ctx.Done():
+			timer.Stop()
+			return ctx.Err()
+		}
+	}
+	return lastErr
+}
+
+func isRetryableMySQLError(err error) bool {
+	var mysqlErr *mysql.MySQLError
+	if !errors.As(err, &mysqlErr) {
+		return false
+	}
+	switch mysqlErr.Number {
+	case 1205, 1213:
+		return true
+	default:
+		return false
+	}
+}
+
+// openSeemsDown opens (or re-detects) a Seems Down event for an HTTP-failing
+// site and projects v1 site_status=SITE_DOWN in the same transaction. Returns
+// the event id. Idempotent: a re-detection of the same identity returns the
+// existing event's id with no transition row written and no projection update.
+func (o *Orchestrator) openSeemsDown(site db.Site, res checker.Result, firstFailAt time.Time) (int64, bool, error) {
+	var eventID int64
+	var opened bool
+	err := o.withEventMutationRetry(site.BlogID, "open_seems_down", func() error {
+		id, didOpen, err := o.openSeemsDownOnce(site, res, firstFailAt)
+		if err != nil {
+			return err
+		}
+		eventID = id
+		opened = didOpen
+		return nil
+	})
+	return eventID, opened, err
+}
+
+func (o *Orchestrator) openSeemsDownOnce(site db.Site, res checker.Result, firstFailAt time.Time) (int64, bool, error) {
+	tx, err := o.ev().Begin(o.ctx)
+	if err != nil {
+		return 0, false, err
+	}
+	defer func() { _ = tx.Rollback() }()
+
+	meta, _ := json.Marshal(checkResultMetadata(site, res, firstFailAt))
+
+	out, err := tx.Open(o.ctx, eventstore.OpenInput{
+		Identity: httpEventIdentity(site),
+		Severity: eventstore.SeveritySeemsDown,
+		State:    eventstore.StateSeemsDown,
+		Source:   o.hostname,
+		Metadata: meta,
+	})
+	if err != nil {
+		return 0, false, err
+	}
+
+	// Project v1 site_status=SITE_DOWN only on the actual insert. A re-detection
+	// (Opened=false) is by definition a row that already exists, so site_status
+	// was already projected when the event first opened.
+	if out.Opened && config.LegacyStatusProjectionEnabled() && tx.Tx() != nil {
+		if err := db.UpdateSiteStatusTxForMonitorSite(o.ctx, tx.Tx(), site.ID, site.BlogID, statusDown, nowFunc().UTC()); err != nil {
+			return 0, false, fmt.Errorf("project site_status: %w", err)
+		}
+	}
+
+	if err := tx.Commit(); err != nil {
+		return 0, false, fmt.Errorf("commit: %w", err)
+	}
+	return out.EventID, out.Opened, nil
+}
+
+// openConfirmedDown opens a Down event directly for failures that were kept
+// out of the customer-visible Seems Down state until verifier confirmation.
+func (o *Orchestrator) openConfirmedDown(site db.Site, changeTime time.Time, meta json.RawMessage) (int64, bool, error) {
+	var eventID int64
+	var opened bool
+	err := o.withEventMutationRetry(site.BlogID, "open_confirmed_down", func() error {
+		id, didOpen, err := o.openConfirmedDownOnce(site, changeTime, meta)
+		if err != nil {
+			return err
+		}
+		eventID = id
+		opened = didOpen
+		return nil
+	})
+	return eventID, opened, err
+}
+
+func (o *Orchestrator) openConfirmedDownOnce(site db.Site, changeTime time.Time, meta json.RawMessage) (int64, bool, error) {
+	tx, err := o.ev().Begin(o.ctx)
+	if err != nil {
+		return 0, false, err
+	}
+	defer func() { _ = tx.Rollback() }()
+
+	out, err := tx.Open(o.ctx, eventstore.OpenInput{
+		Identity: httpEventIdentity(site),
+		Severity: eventstore.SeverityDown,
+		State:    eventstore.StateDown,
+		Source:   o.hostname,
+		Metadata: meta,
+	})
+	if err != nil {
+		return 0, false, err
+	}
+
+	projectConfirmedDown := out.Opened
+	if !out.Opened && (out.CurrentSeverity != eventstore.SeverityDown || out.CurrentState != eventstore.StateDown) {
+		if _, err := tx.Promote(o.ctx, out.EventID,
+			eventstore.SeverityDown, eventstore.StateDown,
+			eventstore.ReasonVerifierConfirmed, o.hostname, meta); err != nil {
+			return 0, false, fmt.Errorf("promote existing event: %w", err)
+		}
+		projectConfirmedDown = true
+	}
+
+	if projectConfirmedDown && config.LegacyStatusProjectionEnabled() && tx.Tx() != nil {
+		if err := db.UpdateSiteStatusTxForMonitorSite(o.ctx, tx.Tx(), site.ID, site.BlogID, statusConfirmedDown, changeTime); err != nil {
+			return 0, false, fmt.Errorf("project site_status: %w", err)
+		}
+	}
+
+	if err := tx.Commit(); err != nil {
+		return 0, false, fmt.Errorf("commit: %w", err)
+	}
+	return out.EventID, out.Opened, nil
+}
+
+// promoteToDown bumps an open Seems Down event to Down (severity 4) and
+// projects site_status=SITE_CONFIRMED_DOWN in the same transaction.
+func (o *Orchestrator) promoteToDown(site db.Site, eventID int64, changeTime time.Time, meta json.RawMessage) error {
+	return o.withEventMutationRetry(site.BlogID, "promote_to_down", func() error {
+		return o.promoteToDownOnce(site, eventID, changeTime, meta)
+	})
+}
+
+func (o *Orchestrator) promoteToDownOnce(site db.Site, eventID int64, changeTime time.Time, meta json.RawMessage) error {
+	tx, err := o.ev().Begin(o.ctx)
+	if err != nil {
+		return err
+	}
+	defer func() { _ = tx.Rollback() }()
+
+	if _, err := tx.Promote(o.ctx, eventID,
+		eventstore.SeverityDown, eventstore.StateDown,
+		eventstore.ReasonVerifierConfirmed, o.hostname, meta); err != nil {
+		return fmt.Errorf("promote event: %w", err)
+	}
+
+	if config.LegacyStatusProjectionEnabled() && tx.Tx() != nil {
+		if err := db.UpdateSiteStatusTxForMonitorSite(o.ctx, tx.Tx(), site.ID, site.BlogID, statusConfirmedDown, changeTime); err != nil {
+			return fmt.Errorf("project site_status: %w", err)
+		}
+	}
+	return tx.Commit()
+}
+
+// closeEvent closes an open event with the given resolution reason and projects
+// site_status to the given v1 value in the same transaction.
+func (o *Orchestrator) closeEvent(site db.Site, eventID int64, reason string, projectedStatus int, changeTime time.Time, meta json.RawMessage) error {
+	return o.withEventMutationRetry(site.BlogID, "close_event", func() error {
+		return o.closeEventOnce(site, eventID, reason, projectedStatus, changeTime, meta)
+	})
+}
+
+func (o *Orchestrator) closeEventOnce(site db.Site, eventID int64, reason string, projectedStatus int, changeTime time.Time, meta json.RawMessage) error {
+	tx, err := o.ev().Begin(o.ctx)
+	if err != nil {
+		return err
+	}
+	defer func() { _ = tx.Rollback() }()
+
+	if err := tx.Close(o.ctx, eventID, reason, o.hostname, meta); err != nil {
+		return fmt.Errorf("close event: %w", err)
+	}
+
+	if config.LegacyStatusProjectionEnabled() && tx.Tx() != nil {
+		if err := db.UpdateSiteStatusTxForMonitorSite(o.ctx, tx.Tx(), site.ID, site.BlogID, projectedStatus, changeTime); err != nil {
+			return fmt.Errorf("project site_status: %w", err)
+		}
+	}
+	return tx.Commit()
+}
+
+// closeRecoveredEvent closes the open event for a recovering site. Picks
+// resolution reason from the event's current state — Seems Down → probe_cleared,
+// Down → verifier_cleared. If the caller already knows the event id (from the
+// retry entry) it is used directly; otherwise the active event is looked up
+// inside the transaction. site_status is projected back to SITE_RUNNING in the
+// same tx.
+func (o *Orchestrator) closeRecoveredEvent(site db.Site, knownEventID int64, changeTime time.Time, res checker.Result) error {
+	return o.withEventMutationRetry(site.BlogID, "close_recovered_event", func() error {
+		return o.closeRecoveredEventOnce(site, knownEventID, changeTime, res)
+	})
+}
+
+func (o *Orchestrator) closeRecoveredEventOnce(site db.Site, knownEventID int64, changeTime time.Time, res checker.Result) error {
+	tx, err := o.ev().Begin(o.ctx)
+	if err != nil {
+		return err
+	}
+	defer func() { _ = tx.Rollback() }()
+
+	// Determine event id and current state. If knownEventID is set, read state
+	// directly; otherwise look up the active event for this blog.
+	var eventID int64
+	var state string
+	switch {
+	case knownEventID > 0 && tx.Tx() != nil:
+		eventID = knownEventID
+		if err := tx.Tx().QueryRowContext(o.ctx,
+			`SELECT state FROM jetmon_events WHERE id = ?`, eventID,
+		).Scan(&state); err != nil {
+			return fmt.Errorf("read event state: %w", err)
+		}
+	case tx.Tx() != nil:
+		ae, err := tx.FindActive(o.ctx, httpEventIdentity(site))
+		if err != nil {
+			if errors.Is(err, eventstore.ErrEventNotFound) {
+				// site_status disagreed with the event store (no open event but
+				// projection said non-running). Just project back to running.
+				if config.LegacyStatusProjectionEnabled() {
+					if err := db.UpdateSiteStatusTxForMonitorSite(o.ctx, tx.Tx(), site.ID, site.BlogID, statusRunning, changeTime); err != nil {
+						return fmt.Errorf("project site_status: %w", err)
+					}
+				}
+				return tx.Commit()
+			}
+			return err
+		}
+		eventID = ae.ID
+		state = ae.State
+	default:
+		// nil-mode (no DB): nothing to do.
+		return tx.Commit()
+	}
+
+	reason := eventstore.ReasonProbeCleared
+	if state == eventstore.StateDown {
+		reason = eventstore.ReasonVerifierCleared
+	}
+
+	meta, _ := json.Marshal(recoveryResultMetadata(res, changeTime))
+	if err := tx.Close(o.ctx, eventID, reason, o.hostname, meta); err != nil {
+		return fmt.Errorf("close event: %w", err)
+	}
+	if config.LegacyStatusProjectionEnabled() && tx.Tx() != nil {
+		if err := db.UpdateSiteStatusTxForMonitorSite(o.ctx, tx.Tx(), site.ID, site.BlogID, statusRunning, changeTime); err != nil {
+			return fmt.Errorf("project site_status: %w", err)
+		}
+	}
+	return tx.Commit()
+}
+
+func (o *Orchestrator) closeMaintenanceEvent(site db.Site, knownEventID int64, changeTime time.Time, meta json.RawMessage) error {
+	tx, err := o.ev().Begin(o.ctx)
+	if err != nil {
+		return err
+	}
+	defer func() { _ = tx.Rollback() }()
+
+	var eventID int64
+	switch {
+	case knownEventID > 0 && tx.Tx() != nil:
+		eventID = knownEventID
+	case tx.Tx() != nil:
+		ae, err := tx.FindActive(o.ctx, httpEventIdentity(site))
+		if err != nil {
+			if errors.Is(err, eventstore.ErrEventNotFound) {
+				if config.LegacyStatusProjectionEnabled() {
+					if err := db.UpdateSiteStatusTxForMonitorSite(o.ctx, tx.Tx(), site.ID, site.BlogID, statusRunning, changeTime); err != nil {
+						return fmt.Errorf("project site_status: %w", err)
+					}
+				}
+				return tx.Commit()
+			}
+			return err
+		}
+		eventID = ae.ID
+	default:
+		return tx.Commit()
+	}
+
+	if err := tx.Close(o.ctx, eventID, eventstore.ReasonMaintenanceSwallowed, o.hostname, meta); err != nil {
+		return fmt.Errorf("close event: %w", err)
+	}
+	if config.LegacyStatusProjectionEnabled() && tx.Tx() != nil {
+		if err := db.UpdateSiteStatusTxForMonitorSite(o.ctx, tx.Tx(), site.ID, site.BlogID, statusRunning, changeTime); err != nil {
+			return fmt.Errorf("project site_status: %w", err)
+		}
+	}
+	return tx.Commit()
+}
+
+// summarizeVerifierResults extracts a small JSON-friendly summary of verifier
+// replies for storage in transition metadata. We don't store the full result
+// list — the per-RPC details are already in jetmon_audit_log under
+// EventVeriflierSent.
+func summarizeVerifierResults(vResults []veriflier.CheckResult) []map[string]any {
+	out := make([]map[string]any, 0, len(vResults))
+	for _, vr := range vResults {
+		item := map[string]any{
+			"host":       vr.Host,
+			"success":    vr.Success,
+			"http_code":  vr.HTTPCode,
+			"error_code": vr.ErrorCode,
+			"rtt_ms":     vr.RTTMs,
+			"request_id": vr.RequestID,
+		}
+		if vr.VantageID != "" {
+			item["vantage_id"] = vr.VantageID
+		}
+		if vr.AgentID != "" {
+			item["agent_id"] = vr.AgentID
+		}
+		if vr.Outcome != "" {
+			item["outcome"] = vr.Outcome
+		}
+		out = append(out, item)
+	}
+	return out
+}
+
+func inMaintenance(site db.Site) bool {
+	now := nowFunc()
+	if site.MaintenanceStart == nil || site.MaintenanceEnd == nil {
+		return false
+	}
+	return now.After(*site.MaintenanceStart) && now.Before(*site.MaintenanceEnd)
+}
+
+func statusFromBool(success bool) int {
+	if success {
+		return statusRunning
+	}
+	return 0
+}
+
+func wpcomStatusMetricSegment(status int) string {
+	switch status {
+	case statusDown:
+		return "down"
+	case statusRunning:
+		return "running"
+	case statusConfirmedDown:
+		return "confirmed_down"
+	default:
+		return "unknown"
+	}
+}
+
+func (o *Orchestrator) refreshVeriflierClients(cfg *config.Config) {
+	verifiers := o.veriflierConfigs(cfg)
+	newAddrs := make([]string, 0, len(verifiers))
+	for _, v := range verifiers {
+		newAddrs = append(newAddrs, fmt.Sprintf("%s:%s|%s", v.Host, v.TransportPort(), v.AuthToken))
+	}
+
+	o.veriflierMu.RLock()
+	unchanged := slicesEqual(o.veriflierAddrs, newAddrs)
+	o.veriflierMu.RUnlock()
+	if unchanged {
+		return
+	}
+
+	clients := make([]*veriflier.VeriflierClient, 0, len(verifiers))
+	for _, v := range verifiers {
+		addr := fmt.Sprintf("%s:%s", v.Host, v.TransportPort())
+		clients = append(clients, veriflier.NewVeriflierClient(addr, v.AuthToken))
+	}
+	o.veriflierMu.Lock()
+	o.veriflierClients = clients
+	o.veriflierAddrs = newAddrs
+	o.veriflierMu.Unlock()
+}
+
+func (o *Orchestrator) veriflierConfigs(cfg *config.Config) []config.VerifierConfig {
+	if cfg == nil {
+		return nil
+	}
+	static := cfg.Verifiers
+	if cfg.VeriflierDiscoveryModeOrDefault() != config.VeriflierDiscoveryModeActive {
+		return static
+	}
+
+	ctx := o.ctx
+	if ctx == nil {
+		ctx = stdctx.Background()
+	}
+	queryCtx, cancel := stdctx.WithTimeout(ctx, 2*time.Second)
+	defer cancel()
+
+	vantages, err := dbListVeriflierVantages(queryCtx, db.VeriflierDiscoveryDefaultStaleAfter)
+	if err != nil {
+		log.Printf("orchestrator: veriflier discovery failed, using static config: %v", err)
+		return static
+	}
+	discovered := verifierConfigsFromVantages(vantages)
+	if len(discovered) == 0 {
+		log.Println("orchestrator: veriflier discovery returned no usable enabled vantages, using static config")
+		return static
+	}
+	return discovered
+}
+
+func verifierConfigsFromVantages(vantages []db.VeriflierVantage) []config.VerifierConfig {
+	out := make([]config.VerifierConfig, 0, len(vantages))
+	for _, vantage := range vantages {
+		if !vantage.Usable() {
+			continue
+		}
+		out = append(out, config.VerifierConfig{
+			Name:      vantage.VantageID,
+			Host:      strings.TrimSpace(vantage.EndpointHost),
+			Port:      strings.TrimSpace(vantage.EndpointPort),
+			AuthToken: strings.TrimSpace(vantage.AuthToken),
+		})
+	}
+	return out
+}
+
+func (o *Orchestrator) syncVeriflierAgentTelemetry(cfg *config.Config) {
+	verifiers := o.veriflierConfigs(cfg)
+	if len(verifiers) == 0 {
+		return
+	}
+	var wg sync.WaitGroup
+	for _, verifierConfig := range verifiers {
+		v := verifierConfig
+		if v.Host == "" || v.TransportPort() == "" {
+			continue
+		}
+		wg.Add(1)
+		go func() {
+			defer wg.Done()
+			ctx := o.ctx
+			if ctx == nil {
+				ctx = stdctx.Background()
+			}
+			statusCtx, cancel := stdctx.WithTimeout(ctx, verifierTelemetryStatusTimeout)
+			defer cancel()
+
+			addr := fmt.Sprintf("%s:%s", v.Host, v.TransportPort())
+			status, err := veriflierStatusFunc(veriflier.NewVeriflierClient(addr, v.AuthToken), statusCtx)
+			if err != nil || status == nil || !verifierStatusSupportsProtocol(status, veriflier.ProtocolV2) {
+				return
+			}
+			hb := veriflierAgentHeartbeatFromStatus(v, status)
+			if hb.AgentID == "" || hb.VantageID == "" {
+				return
+			}
+			writeCtx, writeCancel := stdctx.WithTimeout(ctx, verifierTelemetryStatusTimeout)
+			defer writeCancel()
+			if err := dbUpsertVeriflierAgent(writeCtx, hb); err != nil {
+				log.Printf("orchestrator: veriflier agent telemetry failed addr=%s: %v", addr, err)
+			}
+		}()
+	}
+	wg.Wait()
+}
+
+func veriflierAgentHeartbeatFromStatus(cfg config.VerifierConfig, status *veriflier.StatusV2Response) db.VeriflierAgentHeartbeat {
+	if status == nil {
+		return db.VeriflierAgentHeartbeat{}
+	}
+	return db.VeriflierAgentHeartbeat{
+		AgentID:        strings.TrimSpace(status.Agent.ID),
+		VantageID:      strings.TrimSpace(status.Vantage.ID),
+		Hostname:       strings.TrimSpace(status.Agent.Host),
+		EndpointHost:   strings.TrimSpace(cfg.Host),
+		EndpointPort:   strings.TrimSpace(cfg.TransportPort()),
+		Version:        strings.TrimSpace(status.Version),
+		Protocols:      append([]string(nil), status.Protocols...),
+		MaxConcurrency: status.Capacity.MaxConcurrency,
+		QueueCapacity:  status.Capacity.QueueCapacity,
+		QueueDepth:     status.Capacity.QueueDepth,
+		Active:         status.Capacity.Active,
+		InFlight:       status.Capacity.InFlight,
+		Status:         "active",
+	}
+}
+
+func verifierStatusSupportsProtocol(status *veriflier.StatusV2Response, protocol string) bool {
+	if status == nil {
+		return false
+	}
+	for _, p := range status.Protocols {
+		if p == protocol {
+			return true
+		}
+	}
+	return false
+}
+
+func slicesEqual(a, b []string) bool {
+	if len(a) != len(b) {
+		return false
+	}
+	for i := range a {
+		if a[i] != b[i] {
+			return false
+		}
+	}
+	return true
+}
+
+func (o *Orchestrator) veriflierSnapshot() []*veriflier.VeriflierClient {
+	o.veriflierMu.RLock()
+	defer o.veriflierMu.RUnlock()
+	out := make([]*veriflier.VeriflierClient, len(o.veriflierClients))
+	copy(out, o.veriflierClients)
+	return out
+}
+
+func timeoutForSite(cfg *config.Config, site db.Site) int {
+	timeout := cfg.NetCommsTimeout
+	if site.TimeoutSeconds != nil {
+		timeout = *site.TimeoutSeconds
+	}
+	return timeout
+}
+
+func stringPtrValue(s *string) string {
+	if s == nil {
+		return ""
+	}
+	return *s
+}
+
+func (o *Orchestrator) applyMemoryPressure(cfg *config.Config) {
+	if cfg.WorkerMaxMemMB <= 0 || o.pool == nil {
+		return
+	}
+
+	rssMB := currentMemoryMBFunc()
+	if rssMB <= 0 || rssMB <= cfg.WorkerMaxMemMB {
+		return
+	}
+
+	current := o.pool.WorkerCount()
+	toDrain := current / 10
+	if toDrain < 1 {
+		toDrain = 1
+	}
+	drained := o.pool.DrainWorkers(toDrain)
+	if drained == 0 {
+		return
+	}
+
+	// Lower the autoscaler ceiling for the rest of this round to avoid
+	// immediately respawning the workers we just drained.
+	o.pool.SetMaxSize(max(1, current-drained))
+	log.Printf(
+		"orchestrator: memory pressure %dMB > %dMB, draining %d workers",
+		rssMB,
+		cfg.WorkerMaxMemMB,
+		drained,
+	)
+}
+
+func currentMemoryMB() int {
+	samples := []runtimemetrics.Sample{
+		{Name: "/memory/classes/total:bytes"},
+		{Name: "/memory/classes/heap/released:bytes"},
+	}
+	runtimemetrics.Read(samples)
+
+	total := samples[0].Value.Uint64()
+	released := samples[1].Value.Uint64()
+	if total <= released {
+		return 0
+	}
+	return int((total - released) / 1024 / 1024)
+}
diff --git a/internal/orchestrator/orchestrator_test.go b/internal/orchestrator/orchestrator_test.go
new file mode 100644
index 00000000..90fb630b
--- /dev/null
+++ b/internal/orchestrator/orchestrator_test.go
@@ -0,0 +1,3499 @@
+package orchestrator
+
+import (
+	"context"
+	"crypto/tls"
+	"encoding/json"
+	"fmt"
+	"net/http"
+	"net/http/httptest"
+	"reflect"
+	"sync"
+	"sync/atomic"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/audit"
+	"github.com/Automattic/jetmon/internal/checker"
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/eventstore"
+	"github.com/Automattic/jetmon/internal/veriflier"
+	"github.com/Automattic/jetmon/internal/wpcom"
+	"github.com/DATA-DOG/go-sqlmock"
+	"github.com/go-sql-driver/mysql"
+)
+
+var orchestratorConfigTestMu sync.Mutex
+
+func TestIsAlertSuppressedUsesLastAlertSent(t *testing.T) {
+	now := time.Now().UTC()
+	recent := now.Add(-5 * time.Minute)
+	old := now.Add(-31 * time.Minute)
+
+	setTestConfig(t)
+
+	o := &Orchestrator{}
+
+	if o.isAlertSuppressed(db.Site{}) {
+		t.Fatalf("zero site should not be suppressed")
+	}
+	if o.isAlertSuppressed(db.Site{LastAlertSentAt: &old}) {
+		t.Fatalf("old alert should not be suppressed")
+	}
+	if !o.isAlertSuppressed(db.Site{LastAlertSentAt: &recent}) {
+		t.Fatalf("recent alert should be suppressed")
+	}
+}
+
+func TestDuplicateBlogMonitorRowsUseMonitorSiteIdentity(t *testing.T) {
+	sites := []db.Site{
+		{ID: 10, BlogID: 42, MonitorURL: "https://example.com/"},
+		{ID: 11, BlogID: 42, MonitorURL: "https://example.com/path"},
+	}
+	filtered := filterUnseenSites(sites, map[int64]struct{}{})
+	if len(filtered) != 2 {
+		t.Fatalf("filterUnseenSites returned %d sites, want 2", len(filtered))
+	}
+
+	req := checkRequestForSite(&config.Config{}, sites[0])
+	if req.MonitorSiteID != 10 || req.BlogID != 42 {
+		t.Fatalf("request identity = monitor_site_id:%d blog_id:%d, want 10/42", req.MonitorSiteID, req.BlogID)
+	}
+
+	siteMap := map[int64]db.Site{
+		monitorTargetID(sites[0]): sites[0],
+		monitorTargetID(sites[1]): sites[1],
+	}
+	results := map[int64]checker.Result{
+		10: {MonitorSiteID: 10, BlogID: 42, Success: true},
+		11: {MonitorSiteID: 11, BlogID: 42, Success: true},
+	}
+	records := knownSiteResults(results, siteMap)
+	if len(records) != 2 {
+		t.Fatalf("knownSiteResults returned %d records, want 2", len(records))
+	}
+
+	identity := httpEventIdentity(sites[0])
+	if identity.EndpointID == nil || *identity.EndpointID != 10 {
+		t.Fatalf("httpEventIdentity endpoint = %v, want 10", identity.EndpointID)
+	}
+}
+
+func TestTimeoutForSite(t *testing.T) {
+	cfg := &config.Config{NetCommsTimeout: 10}
+
+	if got := timeoutForSite(cfg, db.Site{}); got != 10 {
+		t.Fatalf("timeoutForSite() = %d, want 10", got)
+	}
+
+	override := 3
+	if got := timeoutForSite(cfg, db.Site{TimeoutSeconds: &override}); got != 3 {
+		t.Fatalf("timeoutForSite() with override = %d, want 3", got)
+	}
+}
+
+func TestCheckRequestForSiteAppliesRolloutCheckPolicy(t *testing.T) {
+	keyword := "needle"
+	forbidden := `["blocked"]`
+	cfg := &config.Config{
+		NetCommsTimeout:         10,
+		DefaultCheckMethod:      "HEAD",
+		DefaultDetectionProfile: "legacy",
+		BodyReadMaxBytes:        64,
+		KeywordReadMaxBytes:     128,
+	}
+
+	req := checkRequestForSite(cfg, db.Site{
+		BlogID:            42,
+		MonitorURL:        "https://example.com",
+		CheckKeyword:      &keyword,
+		ForbiddenKeywords: &forbidden,
+		RedirectPolicy:    "fail",
+	})
+	if req.Method != "HEAD" || req.DetectionProfile != "legacy" {
+		t.Fatalf("request policy = %s/%s, want HEAD/legacy", req.Method, req.DetectionProfile)
+	}
+	if req.Keyword != nil || len(req.ForbiddenKeywords) != 0 || req.RedirectPolicy != checker.RedirectFollow {
+		t.Fatalf("legacy request kept full detections: %+v", req)
+	}
+
+	req = checkRequestForSite(cfg, db.Site{
+		BlogID:            43,
+		MonitorURL:        "https://example.com",
+		RequestMethod:     "GET",
+		DetectionProfile:  "full",
+		CheckKeyword:      &keyword,
+		ForbiddenKeywords: &forbidden,
+		RedirectPolicy:    "fail",
+	})
+	if req.Method != "GET" || req.DetectionProfile != "full" {
+		t.Fatalf("request policy = %s/%s, want GET/full", req.Method, req.DetectionProfile)
+	}
+	if req.Keyword == nil || len(req.ForbiddenKeywords) != 1 || req.RedirectPolicy != checker.RedirectFail {
+		t.Fatalf("full request did not keep rich detections: %+v", req)
+	}
+}
+
+func TestInMaintenance(t *testing.T) {
+	origNow := nowFunc
+	defer func() { nowFunc = origNow }()
+
+	now := time.Date(2026, 5, 1, 12, 0, 0, 0, time.UTC)
+	nowFunc = func() time.Time { return now }
+	past := now.Add(-1 * time.Hour)
+	future := now.Add(1 * time.Hour)
+
+	if inMaintenance(db.Site{}) {
+		t.Fatal("nil window should not be in maintenance")
+	}
+	if inMaintenance(db.Site{MaintenanceStart: &past}) {
+		t.Fatal("nil end should not be in maintenance")
+	}
+	if inMaintenance(db.Site{MaintenanceEnd: &future}) {
+		t.Fatal("nil start should not be in maintenance")
+	}
+	if !inMaintenance(db.Site{MaintenanceStart: &past, MaintenanceEnd: &future}) {
+		t.Fatal("active window should be in maintenance")
+	}
+	if inMaintenance(db.Site{MaintenanceStart: &past, MaintenanceEnd: &past}) {
+		t.Fatal("expired window should not be in maintenance")
+	}
+	if inMaintenance(db.Site{MaintenanceStart: &future, MaintenanceEnd: &future}) {
+		t.Fatal("future window should not be in maintenance")
+	}
+}
+
+func TestSummarizeVerifierResults(t *testing.T) {
+	got := summarizeVerifierResults([]veriflier.CheckResult{
+		{Host: "us-west", VantageID: "us-west", AgentID: "agent-a", Outcome: veriflier.OutcomeDown, Success: false, HTTPCode: 500, RTTMs: 123},
+		{Host: "eu", Success: true, HTTPCode: 200, RTTMs: 45},
+	})
+	if len(got) != 2 {
+		t.Fatalf("len = %d, want 2", len(got))
+	}
+	if got[0]["host"] != "us-west" || got[0]["success"] != false ||
+		got[0]["http_code"] != int32(500) || got[0]["rtt_ms"] != int64(123) {
+		t.Fatalf("first summary = %+v", got[0])
+	}
+	if got[0]["vantage_id"] != "us-west" || got[0]["agent_id"] != "agent-a" || got[0]["outcome"] != veriflier.OutcomeDown {
+		t.Fatalf("first v2 identity summary = %+v", got[0])
+	}
+	if got[1]["host"] != "eu" || got[1]["success"] != true {
+		t.Fatalf("second summary = %+v", got[1])
+	}
+	if _, ok := got[1]["vantage_id"]; ok {
+		t.Fatalf("legacy summary included synthetic vantage_id: %+v", got[1])
+	}
+}
+
+func TestSlicesEqual(t *testing.T) {
+	if !slicesEqual(nil, nil) {
+		t.Fatal("nil slices should be equal")
+	}
+	if !slicesEqual([]string{"a", "b"}, []string{"a", "b"}) {
+		t.Fatal("identical slices should be equal")
+	}
+	if slicesEqual([]string{"a"}, []string{"b"}) {
+		t.Fatal("different content should not be equal")
+	}
+	if slicesEqual([]string{"a"}, []string{"a", "b"}) {
+		t.Fatal("different lengths should not be equal")
+	}
+}
+
+func TestRefreshVeriflierClientsReusesUnchangedClients(t *testing.T) {
+	cfg := &config.Config{
+		Verifiers: []config.VerifierConfig{
+			{Name: "a", Host: "host1", Port: "7803", AuthToken: "token1"},
+			{Name: "b", Host: "host2", Port: "7804", AuthToken: "token2"},
+		},
+	}
+
+	o := New(cfg, nil)
+	before := append([]*veriflier.VeriflierClient(nil), o.veriflierClients...)
+
+	o.refreshVeriflierClients(cfg)
+
+	for i := range before {
+		if before[i] != o.veriflierClients[i] {
+			t.Fatalf("client %d was rebuilt for unchanged config", i)
+		}
+	}
+}
+
+func TestRefreshVeriflierClientsRebuildsChangedClients(t *testing.T) {
+	cfg := &config.Config{
+		Verifiers: []config.VerifierConfig{
+			{Name: "a", Host: "host1", Port: "7803", AuthToken: "token1"},
+		},
+	}
+
+	o := New(cfg, nil)
+	before := o.veriflierClients[0]
+
+	updated := &config.Config{
+		Verifiers: []config.VerifierConfig{
+			{Name: "a", Host: "host1", Port: "7803", AuthToken: "token2"},
+		},
+	}
+
+	o.refreshVeriflierClients(updated)
+
+	if before == o.veriflierClients[0] {
+		t.Fatalf("client was reused after config changed")
+	}
+}
+
+func TestRefreshVeriflierClientsActiveDiscoveryUsesEnabledVantages(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	dbListVeriflierVantages = func(context.Context, time.Duration) ([]db.VeriflierVantage, error) {
+		return []db.VeriflierVantage{
+			{VantageID: "us-east", EndpointHost: "east.example", EndpointPort: "7803", AuthToken: "east-token"},
+			{VantageID: "incomplete", EndpointHost: "missing-token", EndpointPort: "7803"},
+			{VantageID: "us-west", EndpointHost: "west.example", EndpointPort: "7804", AuthToken: "west-token"},
+		}, nil
+	}
+
+	cfg := &config.Config{
+		NumWorkers:             2,
+		VeriflierDiscoveryMode: config.VeriflierDiscoveryModeActive,
+		Verifiers:              []config.VerifierConfig{{Name: "static", Host: "static.example", Port: "7803", AuthToken: "static-token"}},
+	}
+	o := New(cfg, nil)
+
+	want := []string{"east.example:7803|east-token", "west.example:7804|west-token"}
+	if !slicesEqual(o.veriflierAddrs, want) {
+		t.Fatalf("veriflierAddrs = %#v, want %#v", o.veriflierAddrs, want)
+	}
+}
+
+func TestRefreshVeriflierClientsActiveDiscoveryFallsBackToStatic(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	dbListVeriflierVantages = func(context.Context, time.Duration) ([]db.VeriflierVantage, error) {
+		return nil, fmt.Errorf("db unavailable")
+	}
+
+	cfg := &config.Config{
+		NumWorkers:             2,
+		VeriflierDiscoveryMode: config.VeriflierDiscoveryModeActive,
+		Verifiers:              []config.VerifierConfig{{Name: "static", Host: "static.example", Port: "7803", AuthToken: "static-token"}},
+	}
+	o := New(cfg, nil)
+
+	want := []string{"static.example:7803|static-token"}
+	if !slicesEqual(o.veriflierAddrs, want) {
+		t.Fatalf("veriflierAddrs = %#v, want %#v", o.veriflierAddrs, want)
+	}
+}
+
+func TestVerifierConfigsFromVantagesFiltersIncompleteRows(t *testing.T) {
+	got := verifierConfigsFromVantages([]db.VeriflierVantage{
+		{VantageID: "ok", EndpointHost: " host.example ", EndpointPort: " 7803 ", AuthToken: " token "},
+		{VantageID: "no-host", EndpointPort: "7803", AuthToken: "token"},
+		{VantageID: "no-port", EndpointHost: "host.example", AuthToken: "token"},
+		{VantageID: "no-token", EndpointHost: "host.example", EndpointPort: "7803"},
+	})
+	if len(got) != 1 {
+		t.Fatalf("len = %d, want 1: %#v", len(got), got)
+	}
+	if got[0].Name != "ok" || got[0].Host != "host.example" || got[0].Port != "7803" || got[0].AuthToken != "token" {
+		t.Fatalf("config = %+v", got[0])
+	}
+}
+
+func TestSyncVeriflierAgentTelemetryWritesMonitorCollectedStatus(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	var mu sync.Mutex
+	var heartbeats []db.VeriflierAgentHeartbeat
+	veriflierStatusFunc = func(c *veriflier.VeriflierClient, _ context.Context) (*veriflier.StatusV2Response, error) {
+		return &veriflier.StatusV2Response{
+			Version:   "test-version",
+			Protocols: []string{veriflier.ProtocolV2, veriflier.ProtocolLegacy},
+			Vantage:   veriflier.Vantage{ID: "us-east", Region: "iad", Provider: "test"},
+			Agent:     veriflier.Agent{ID: "agent-a", Host: "host-a", Version: "test-version", Protocol: veriflier.ProtocolV2},
+			Capacity:  veriflier.Capacity{MaxConcurrency: 64, QueueCapacity: 256, QueueDepth: 3, Active: 2, InFlight: 1},
+		}, nil
+	}
+	dbUpsertVeriflierAgent = func(_ context.Context, hb db.VeriflierAgentHeartbeat) error {
+		mu.Lock()
+		defer mu.Unlock()
+		heartbeats = append(heartbeats, hb)
+		return nil
+	}
+
+	o := &Orchestrator{ctx: context.Background()}
+	o.syncVeriflierAgentTelemetry(&config.Config{
+		Verifiers: []config.VerifierConfig{{Name: "east", Host: "east.example", Port: "7803", AuthToken: "token"}},
+	})
+
+	mu.Lock()
+	defer mu.Unlock()
+	if len(heartbeats) != 1 {
+		t.Fatalf("heartbeats len = %d, want 1", len(heartbeats))
+	}
+	got := heartbeats[0]
+	if got.AgentID != "agent-a" || got.VantageID != "us-east" || got.EndpointHost != "east.example" || got.EndpointPort != "7803" {
+		t.Fatalf("heartbeat identity = %+v", got)
+	}
+	if got.MaxConcurrency != 64 || got.QueueCapacity != 256 || got.QueueDepth != 3 || got.Active != 2 || got.InFlight != 1 {
+		t.Fatalf("heartbeat capacity = %+v", got)
+	}
+}
+
+func TestSyncVeriflierAgentTelemetrySkipsLegacyStatus(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	veriflierStatusFunc = func(_ *veriflier.VeriflierClient, _ context.Context) (*veriflier.StatusV2Response, error) {
+		return &veriflier.StatusV2Response{Version: "legacy", Protocols: []string{veriflier.ProtocolLegacy}}, nil
+	}
+	dbUpsertVeriflierAgent = func(_ context.Context, hb db.VeriflierAgentHeartbeat) error {
+		t.Fatalf("unexpected heartbeat: %+v", hb)
+		return nil
+	}
+
+	o := &Orchestrator{ctx: context.Background()}
+	o.syncVeriflierAgentTelemetry(&config.Config{
+		Verifiers: []config.VerifierConfig{{Name: "legacy", Host: "legacy.example", Port: "7803", AuthToken: "token"}},
+	})
+}
+
+func TestVeriflierAgentHeartbeatFromStatus(t *testing.T) {
+	got := veriflierAgentHeartbeatFromStatus(
+		config.VerifierConfig{Host: " endpoint.example ", Port: " 7803 "},
+		&veriflier.StatusV2Response{
+			Version:   " test-version ",
+			Protocols: []string{veriflier.ProtocolV2},
+			Vantage:   veriflier.Vantage{ID: " us-west "},
+			Agent:     veriflier.Agent{ID: " agent-west ", Host: " host-west "},
+			Capacity:  veriflier.Capacity{MaxConcurrency: 8, QueueCapacity: 16},
+		},
+	)
+	if got.AgentID != "agent-west" || got.VantageID != "us-west" || got.EndpointHost != "endpoint.example" ||
+		got.EndpointPort != "7803" || got.Version != "test-version" || got.Status != "active" {
+		t.Fatalf("heartbeat = %+v", got)
+	}
+}
+
+func TestSendNotificationRetriesAndUpdatesAlertTimestamp(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	setTestConfig(t)
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	var notifyCalls int
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		notifyCalls++
+		if notifyCalls == 1 {
+			return fmt.Errorf("first failure")
+		}
+		return nil
+	}
+
+	var updatedBlogID int64
+	dbUpdateLastAlertSent = func(_ context.Context, blogID int64, _ time.Time) error {
+		updatedBlogID = blogID
+		return nil
+	}
+
+	o := &Orchestrator{
+		wpcom:    &wpcom.Client{},
+		hostname: "local-host",
+		ctx:      context.Background(),
+	}
+
+	res := checkerResultSuccess(123)
+	o.sendNotification(db.Site{BlogID: 123, MonitorURL: "https://example.com"}, res, statusRunning, res.Timestamp, nil)
+
+	if notifyCalls != 2 {
+		t.Fatalf("notify calls = %d, want 2", notifyCalls)
+	}
+	if updatedBlogID != 123 {
+		t.Fatalf("updated blog_id = %d, want 123", updatedBlogID)
+	}
+	for stat, want := range map[string]int{
+		"wpcom.notification.attempt.count":                  1,
+		"wpcom.notification.status.running.attempt.count":   1,
+		"wpcom.notification.error.count":                    1,
+		"wpcom.notification.status.running.error.count":     1,
+		"wpcom.notification.retry.count":                    1,
+		"wpcom.notification.retry.delivered.count":          1,
+		"wpcom.notification.delivered.count":                1,
+		"wpcom.notification.status.running.delivered.count": 1,
+	} {
+		if got := rec.counter(stat); got != want {
+			t.Fatalf("%s = %d, want %d", stat, got, want)
+		}
+	}
+}
+
+func TestSendNotificationDoesNotRetryWhenWPCOMCircuitOpen(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	setTestConfig(t)
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	var notifyCalls int
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		notifyCalls++
+		return fmt.Errorf("%w, notification queued", wpcom.ErrCircuitOpen)
+	}
+
+	var updateAlertCalled bool
+	dbUpdateLastAlertSent = func(context.Context, int64, time.Time) error {
+		updateAlertCalled = true
+		return nil
+	}
+
+	o := &Orchestrator{
+		wpcom:    &wpcom.Client{},
+		hostname: "local-host",
+		ctx:      context.Background(),
+	}
+
+	res := checkerResultFailure(123)
+	o.sendNotification(db.Site{BlogID: 123, MonitorURL: "https://example.com"}, res, statusConfirmedDown, res.Timestamp, nil)
+
+	if notifyCalls != 1 {
+		t.Fatalf("notify calls = %d, want 1 for circuit-open queue response", notifyCalls)
+	}
+	if updateAlertCalled {
+		t.Fatal("dbUpdateLastAlertSent should not be called while WPCOM circuit is open")
+	}
+	for stat, want := range map[string]int{
+		"wpcom.notification.attempt.count":                         1,
+		"wpcom.notification.status.confirmed_down.attempt.count":   1,
+		"wpcom.notification.error.count":                           1,
+		"wpcom.notification.status.confirmed_down.error.count":     1,
+		"wpcom.notification.queued.count":                          1,
+		"wpcom.notification.status.confirmed_down.queued.count":    1,
+		"wpcom.notification.retry.count":                           0,
+		"wpcom.notification.failed.count":                          0,
+		"wpcom.notification.delivered.count":                       0,
+		"wpcom.notification.status.confirmed_down.delivered.count": 0,
+	} {
+		if got := rec.counter(stat); got != want {
+			t.Fatalf("%s = %d, want %d", stat, got, want)
+		}
+	}
+}
+
+func TestSendNotificationDoesNotRetryPermanentWPCOMFailure(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	setTestConfig(t)
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	var notifyCalls int
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		notifyCalls++
+		return wpcom.StatusError{StatusCode: http.StatusNotFound}
+	}
+
+	var updateAlertCalled bool
+	dbUpdateLastAlertSent = func(context.Context, int64, time.Time) error {
+		updateAlertCalled = true
+		return nil
+	}
+
+	o := &Orchestrator{
+		wpcom:    &wpcom.Client{},
+		hostname: "local-host",
+		ctx:      context.Background(),
+	}
+
+	res := checkerResultFailure(123)
+	o.sendNotification(db.Site{BlogID: 123, MonitorURL: "https://example.com"}, res, statusConfirmedDown, res.Timestamp, nil)
+
+	if notifyCalls != 1 {
+		t.Fatalf("notify calls = %d, want 1 for permanent failure", notifyCalls)
+	}
+	if updateAlertCalled {
+		t.Fatal("dbUpdateLastAlertSent should not be called for permanent failure")
+	}
+	for stat, want := range map[string]int{
+		"wpcom.notification.attempt.count":                                 1,
+		"wpcom.notification.status.confirmed_down.attempt.count":           1,
+		"wpcom.notification.error.count":                                   1,
+		"wpcom.notification.status.confirmed_down.error.count":             1,
+		"wpcom.notification.permanent_failure.count":                       1,
+		"wpcom.notification.status.confirmed_down.permanent_failure.count": 1,
+		"wpcom.notification.http.404.permanent_failure.count":              1,
+		"wpcom.notification.failed.count":                                  1,
+		"wpcom.notification.status.confirmed_down.failed.count":            1,
+		"wpcom.notification.retry.count":                                   0,
+		"wpcom.notification.delivered.count":                               0,
+		"wpcom.notification.status.confirmed_down.delivered.count":         0,
+	} {
+		if got := rec.counter(stat); got != want {
+			t.Fatalf("%s = %d, want %d", stat, got, want)
+		}
+	}
+}
+
+func TestSendNotificationSkipsWhenWPCOMDisabled(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	cfg := setTestConfig(t)
+	cfg.WPCOMNotifyEnable = false
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	var notifyCalls int
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		notifyCalls++
+		return nil
+	}
+
+	var updatedBlogID int64
+	dbUpdateLastAlertSent = func(_ context.Context, blogID int64, _ time.Time) error {
+		updatedBlogID = blogID
+		return nil
+	}
+
+	o := &Orchestrator{
+		wpcom:    &wpcom.Client{},
+		hostname: "local-host",
+		ctx:      context.Background(),
+	}
+
+	res := checkerResultSuccess(123)
+	o.sendNotification(db.Site{BlogID: 123, MonitorURL: "https://example.com"}, res, statusRunning, res.Timestamp, nil)
+
+	if notifyCalls != 0 {
+		t.Fatalf("notify calls = %d, want 0", notifyCalls)
+	}
+	if updatedBlogID != 0 {
+		t.Fatalf("updated blog_id = %d, want 0", updatedBlogID)
+	}
+	if got := rec.counter("wpcom.notification.disabled.count"); got != 1 {
+		t.Fatalf("wpcom.notification.disabled.count = %d, want 1", got)
+	}
+}
+
+func TestSendNotificationBuildsLegacyWPCOMPayload(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	setTestConfig(t)
+
+	checkTime := time.Date(2026, 5, 3, 3, 0, 0, 0, time.UTC)
+	changeTime := time.Date(2026, 5, 3, 3, 1, 0, 0, time.UTC)
+	alertUpdateTime := time.Date(2026, 5, 3, 3, 1, 5, 0, time.UTC)
+	nowFunc = func() time.Time { return alertUpdateTime }
+
+	var got wpcom.Notification
+	wpcomNotifyFunc = func(_ *wpcom.Client, n wpcom.Notification) error {
+		got = n
+		return nil
+	}
+	var updatedAt time.Time
+	dbUpdateLastAlertSent = func(_ context.Context, blogID int64, ts time.Time) error {
+		if blogID != 123 {
+			t.Fatalf("updated alert blog_id = %d, want 123", blogID)
+		}
+		updatedAt = ts
+		return nil
+	}
+
+	o := &Orchestrator{
+		wpcom:    &wpcom.Client{},
+		hostname: "monitor-a",
+		ctx:      context.Background(),
+	}
+	res := checker.Result{
+		BlogID:    123,
+		Success:   false,
+		HTTPCode:  500,
+		ErrorCode: checker.ErrorConnect,
+		RTT:       123 * time.Millisecond,
+		Timestamp: checkTime,
+	}
+	vResults := []veriflier.CheckResult{
+		{Host: "verifier-us", Success: false, HTTPCode: 500, RTTMs: 456},
+		{Host: "verifier-eu", Success: true, HTTPCode: 200, RTTMs: 78},
+	}
+
+	o.sendNotification(
+		db.Site{BlogID: 123, MonitorURL: "https://example.com/"},
+		res,
+		statusConfirmedDown,
+		changeTime,
+		vResults,
+	)
+
+	want := wpcom.Notification{
+		BlogID:           123,
+		MonitorURL:       "https://example.com/",
+		StatusID:         statusConfirmedDown,
+		LastCheck:        "2026-05-03T03:00:00Z",
+		LastStatusChange: "2026-05-03T03:01:00Z",
+		StatusType:       "server",
+		Checks: []wpcom.CheckEntry{
+			{Type: 1, Host: "monitor-a", Status: statusDown, RTT: 123, Code: 500},
+			{Type: 2, Host: "verifier-us", Status: statusDown, RTT: 456, Code: 500},
+			{Type: 2, Host: "verifier-eu", Status: statusRunning, RTT: 78, Code: 200},
+		},
+	}
+	if !reflect.DeepEqual(got, want) {
+		t.Fatalf("notification = %+v, want %+v", got, want)
+	}
+	if !updatedAt.Equal(alertUpdateTime) {
+		t.Fatalf("last alert update = %s, want %s", updatedAt, alertUpdateTime)
+	}
+}
+
+func TestConfirmDownSuppressedDuringCooldown(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	setTestConfig(t)
+
+	recent := time.Now().UTC().Add(-5 * time.Minute)
+	var notifyCalls int
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		notifyCalls++
+		return nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local-host",
+		ctx:      context.Background(),
+	}
+	o.retries.record(checkerResultFailure(123))
+
+	entry := o.retries.get(123)
+	o.confirmDown(db.Site{
+		BlogID:          123,
+		SiteStatus:      statusRunning,
+		LastAlertSentAt: &recent,
+	}, entry, nil)
+
+	if notifyCalls != 0 {
+		t.Fatalf("notify calls = %d, want 0", notifyCalls)
+	}
+	if o.retries.get(123) != nil {
+		t.Fatal("retry entry should be cleared after confirmDown")
+	}
+}
+
+func TestEscalateToVerifliersConfirmsWhenQuorumReached(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	cfg := setTestConfig(t)
+	cfg.PeerOfflineLimit = 2
+
+	var notifyCalls int
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		notifyCalls++
+		return nil
+	}
+	dbUpdateLastAlertSent = func(context.Context, int64, time.Time) error { return nil }
+
+	veriflierCheckFunc = func(c *veriflier.VeriflierClient, _ context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		if req.BodyReadMaxBytes != cfg.BodyReadMaxBytes {
+			t.Fatalf("verifier req BodyReadMaxBytes = %d, want %d", req.BodyReadMaxBytes, cfg.BodyReadMaxBytes)
+		}
+		if req.BodyReadMaxMS != int32(cfg.BodyReadMaxMS) {
+			t.Fatalf("verifier req BodyReadMaxMS = %d, want %d", req.BodyReadMaxMS, cfg.BodyReadMaxMS)
+		}
+		if req.KeywordReadMaxBytes != cfg.KeywordReadMaxBytes {
+			t.Fatalf("verifier req KeywordReadMaxBytes = %d, want %d", req.KeywordReadMaxBytes, cfg.KeywordReadMaxBytes)
+		}
+		if req.KeywordReadMaxMS != int32(cfg.KeywordReadMaxMS) {
+			t.Fatalf("verifier req KeywordReadMaxMS = %d, want %d", req.KeywordReadMaxMS, cfg.KeywordReadMaxMS)
+		}
+		return &veriflier.CheckResult{
+			BlogID:   req.BlogID,
+			Host:     c.Addr(),
+			Success:  false,
+			HTTPCode: 500,
+		}, nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local-host",
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+			veriflier.NewVeriflierClient("v2", ""),
+		},
+	}
+
+	fail := checkerResultFailure(321)
+	o.retries.record(fail)
+	entry := o.retries.get(321)
+	o.escalateToVerifliers(db.Site{BlogID: 321, MonitorURL: "https://example.com", SiteStatus: statusRunning}, entry)
+
+	if notifyCalls != 1 {
+		t.Fatalf("notify calls = %d, want 1", notifyCalls)
+	}
+	if o.retries.get(321) != nil {
+		t.Fatal("retry entry should be cleared after confirmed down")
+	}
+}
+
+func TestEscalateToVerifliersRecordsFalsePositiveWhenQuorumMissed(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	cfg := setTestConfig(t)
+	cfg.PeerOfflineLimit = 2
+
+	var falsePositiveBlogID int64
+	dbRecordFalsePositive = func(blogID int64, _ int, _ int, _ int64) error {
+		falsePositiveBlogID = blogID
+		return nil
+	}
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		t.Fatal("notification should not be sent for false positive")
+		return nil
+	}
+
+	// escalateToVerifliers fans the verifier RPC out across goroutines, so
+	// `call` is read+written concurrently. Use atomic so `go test -race`
+	// stays clean. The semantics — first verifier returns Success=false,
+	// subsequent ones return true — are unchanged.
+	var call atomic.Int64
+	veriflierCheckFunc = func(c *veriflier.VeriflierClient, _ context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		n := call.Add(1)
+		return &veriflier.CheckResult{
+			BlogID:   req.BlogID,
+			Host:     c.Addr(),
+			Success:  n != 1,
+			HTTPCode: 200,
+		}, nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local-host",
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+			veriflier.NewVeriflierClient("v2", ""),
+		},
+	}
+
+	fail := checkerResultFailure(654)
+	o.retries.record(fail)
+	entry := o.retries.get(654)
+	o.escalateToVerifliers(db.Site{BlogID: 654, MonitorURL: "https://example.com", SiteStatus: statusRunning}, entry)
+
+	if falsePositiveBlogID != 654 {
+		t.Fatalf("false positive blog_id = %d, want 654", falsePositiveBlogID)
+	}
+	if o.retries.get(654) != nil {
+		t.Fatal("retry entry should be cleared after false positive")
+	}
+}
+
+func TestEscalateToVerifliersIgnoresDuplicateVoteIdentities(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	cfg := setTestConfig(t)
+	cfg.PeerOfflineLimit = 2
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	var falsePositiveBlogID int64
+	dbRecordFalsePositive = func(blogID int64, _ int, _ int, _ int64) error {
+		falsePositiveBlogID = blogID
+		return nil
+	}
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		t.Fatal("duplicate verifier identity should not satisfy quorum")
+		return nil
+	}
+	veriflierCheckFunc = func(_ *veriflier.VeriflierClient, _ context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		return &veriflier.CheckResult{
+			BlogID:   req.BlogID,
+			Host:     "shared-vantage",
+			Success:  false,
+			HTTPCode: 500,
+		}, nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local-host",
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+			veriflier.NewVeriflierClient("v2", ""),
+		},
+	}
+
+	fail := checkerResultFailure(655)
+	o.retries.record(fail)
+	entry := o.retries.get(655)
+	o.escalateToVerifliers(db.Site{BlogID: 655, MonitorURL: "https://example.com", SiteStatus: statusRunning}, entry)
+
+	if falsePositiveBlogID != 655 {
+		t.Fatalf("false positive blog_id = %d, want 655", falsePositiveBlogID)
+	}
+	if got := rec.counter("verifier.vote.duplicate_identity.count"); got != 1 {
+		t.Fatalf("duplicate identity counter = %d, want 1", got)
+	}
+	if got := rec.gauge("detection.verifier.healthy.count"); got != 1 {
+		t.Fatalf("healthy verifier gauge = %d, want 1 unique vote", got)
+	}
+	if got := rec.gauge("detection.verifier.duplicate_votes.count"); got != 1 {
+		t.Fatalf("duplicate vote gauge = %d, want 1", got)
+	}
+}
+
+func TestEscalateToVerifliersRequiresTwoHealthyVotesForMultiVerifierFleet(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	cfg := setTestConfig(t)
+	cfg.PeerOfflineLimit = 2
+
+	var falsePositiveBlogID int64
+	dbRecordFalsePositive = func(blogID int64, _ int, _ int, _ int64) error {
+		falsePositiveBlogID = blogID
+		return nil
+	}
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		t.Fatal("one healthy verifier should not confirm downtime in a multi-verifier fleet")
+		return nil
+	}
+	veriflierCheckFunc = func(c *veriflier.VeriflierClient, _ context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		if c.Addr() != "v1" {
+			return nil, fmt.Errorf("veriflier offline")
+		}
+		return &veriflier.CheckResult{
+			BlogID:   req.BlogID,
+			Host:     c.Addr(),
+			Success:  false,
+			HTTPCode: 500,
+		}, nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local-host",
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+			veriflier.NewVeriflierClient("v2", ""),
+			veriflier.NewVeriflierClient("v3", ""),
+		},
+	}
+
+	fail := checkerResultFailure(656)
+	o.retries.record(fail)
+	entry := o.retries.get(656)
+	o.escalateToVerifliers(db.Site{BlogID: 656, MonitorURL: "https://example.com", SiteStatus: statusRunning}, entry)
+
+	if falsePositiveBlogID != 656 {
+		t.Fatalf("false positive blog_id = %d, want 656", falsePositiveBlogID)
+	}
+}
+
+func TestEscalateToVerifliersAllowsSingleConfiguredVerifier(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	cfg := setTestConfig(t)
+	cfg.PeerOfflineLimit = 2
+
+	var notifyCalls int
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		notifyCalls++
+		return nil
+	}
+	dbUpdateLastAlertSent = func(context.Context, int64, time.Time) error { return nil }
+	veriflierCheckFunc = func(c *veriflier.VeriflierClient, _ context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		return &veriflier.CheckResult{
+			BlogID:   req.BlogID,
+			Host:     c.Addr(),
+			Success:  false,
+			HTTPCode: 500,
+		}, nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local-host",
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+		},
+	}
+
+	fail := checkerResultFailure(657)
+	o.retries.record(fail)
+	entry := o.retries.get(657)
+	o.escalateToVerifliers(db.Site{BlogID: 657, MonitorURL: "https://example.com", SiteStatus: statusRunning}, entry)
+
+	if notifyCalls != 1 {
+		t.Fatalf("notify calls = %d, want 1", notifyCalls)
+	}
+}
+
+func TestVerifierMinHealthyFloor(t *testing.T) {
+	tests := []struct {
+		name               string
+		peerOfflineLimit   int
+		configuredVerifier int
+		want               int
+	}{
+		{name: "none", peerOfflineLimit: 2, configuredVerifier: 0, want: 0},
+		{name: "single verifier", peerOfflineLimit: 2, configuredVerifier: 1, want: 1},
+		{name: "intentional one vote quorum", peerOfflineLimit: 1, configuredVerifier: 3, want: 1},
+		{name: "multi verifier floor", peerOfflineLimit: 2, configuredVerifier: 3, want: 2},
+		{name: "higher configured quorum still floor two", peerOfflineLimit: 4, configuredVerifier: 5, want: 2},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			if got := verifierMinHealthyFloor(tt.peerOfflineLimit, tt.configuredVerifier); got != tt.want {
+				t.Fatalf("verifierMinHealthyFloor() = %d, want %d", got, tt.want)
+			}
+		})
+	}
+}
+
+func TestEscalateToVerifliersConfirmsDownOnPartialResponseFromLocalAndVerifier(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	cfg := setTestConfig(t)
+	cfg.PeerOfflineLimit = 1
+
+	var got wpcom.Notification
+	wpcomNotifyFunc = func(_ *wpcom.Client, n wpcom.Notification) error {
+		got = n
+		return nil
+	}
+	dbUpdateLastAlertSent = func(context.Context, int64, time.Time) error { return nil }
+
+	veriflierCheckFunc = func(c *veriflier.VeriflierClient, _ context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		if req.BodyReadMaxBytes != cfg.BodyReadMaxBytes {
+			t.Fatalf("verifier req BodyReadMaxBytes = %d, want %d", req.BodyReadMaxBytes, cfg.BodyReadMaxBytes)
+		}
+		if req.BodyReadMaxMS != int32(cfg.BodyReadMaxMS) {
+			t.Fatalf("verifier req BodyReadMaxMS = %d, want %d", req.BodyReadMaxMS, cfg.BodyReadMaxMS)
+		}
+		if req.KeywordReadMaxBytes != cfg.KeywordReadMaxBytes {
+			t.Fatalf("verifier req KeywordReadMaxBytes = %d, want %d", req.KeywordReadMaxBytes, cfg.KeywordReadMaxBytes)
+		}
+		if req.KeywordReadMaxMS != int32(cfg.KeywordReadMaxMS) {
+			t.Fatalf("verifier req KeywordReadMaxMS = %d, want %d", req.KeywordReadMaxMS, cfg.KeywordReadMaxMS)
+		}
+		return &veriflier.CheckResult{
+			BlogID:    req.BlogID,
+			Host:      c.Addr(),
+			Success:   false,
+			HTTPCode:  200,
+			ErrorCode: checker.ErrorBodyTruncated,
+		}, nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local-host",
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+		},
+	}
+
+	fail := checker.Result{BlogID: 777, Success: false, HTTPCode: 200, ErrorCode: checker.ErrorBodyTruncated, RTT: 120 * time.Millisecond, Timestamp: time.Now().UTC()}
+	entry := o.retries.record(fail)
+	entry.failCount = cfg.NumOfChecks
+	entry.firstFailAt = time.Now().Add(-2 * time.Minute)
+
+	o.escalateToVerifliers(db.Site{BlogID: 777, MonitorURL: "https://example.com", SiteStatus: statusRunning}, entry)
+
+	if got.BlogID != 777 {
+		t.Fatalf("notified blog_id = %d, want 777", got.BlogID)
+	}
+	if got.StatusID != statusConfirmedDown {
+		t.Fatalf("notification status = %d, want %d", got.StatusID, statusConfirmedDown)
+	}
+	if len(got.Checks) != 2 {
+		t.Fatalf("checks len = %d, want 2 (local + verifier)", len(got.Checks))
+	}
+	if got.Checks[0].Code != 200 || got.Checks[1].Code != 200 {
+		t.Fatalf("check HTTP codes = [%d, %d], want [200, 200]", got.Checks[0].Code, got.Checks[1].Code)
+	}
+}
+
+func stubOrchestratorDeps() func() {
+	origNow := nowFunc
+	origDBClaimBuckets := dbClaimBuckets
+	origDBHeartbeat := dbHeartbeat
+	origDBReleaseHost := dbReleaseHost
+	origDBMarkHostDraining := dbMarkHostDraining
+	origDBGetSites := dbGetSitesForBucket
+	origDBUpdateStatus := dbUpdateSiteStatus
+	origDBGetSiteStatus := dbGetSiteStatus
+	origDBUpdateLastAlert := dbUpdateLastAlertSent
+	origDBRecordFalsePositive := dbRecordFalsePositive
+	origDBMarkSiteChecked := dbMarkSiteChecked
+	origDBMarkSitesChecked := dbMarkSitesChecked
+	origDBRecordCheckHistory := dbRecordCheckHistory
+	origDBRecordCheckHistories := dbRecordCheckHistories
+	origDBUpdateSSLExpiry := dbUpdateSSLExpiry
+	origDBUpdateSSLExpiries := dbUpdateSSLExpiries
+	origDBCountDueSites := dbCountDueSites
+	origDBCountProjectionDrift := dbCountProjectionDrift
+	origDBListVeriflierVantages := dbListVeriflierVantages
+	origDBUpsertVeriflierAgent := dbUpsertVeriflierAgent
+	origNotify := wpcomNotifyFunc
+	origVeriflierStatus := veriflierStatusFunc
+	origVeriflierCheck := veriflierCheckFunc
+	origMetricsClient := metricsClientFunc
+
+	nowFunc = time.Now
+	dbClaimBuckets = func(string, int, int, int) (int, int, error) { return 0, 0, nil }
+	dbHeartbeat = func(context.Context, string) error { return nil }
+	dbReleaseHost = func(context.Context, string) error { return nil }
+	dbMarkHostDraining = func(context.Context, string) error { return nil }
+	dbGetSitesForBucket = func(context.Context, int, int, int, bool) ([]db.Site, error) { return nil, nil }
+	dbUpdateSiteStatus = func(context.Context, int64, int, time.Time) error { return nil }
+	dbGetSiteStatus = func(context.Context, int64, int64) (int, error) { return statusRunning, nil }
+	dbUpdateLastAlertSent = func(context.Context, int64, time.Time) error { return nil }
+	dbRecordFalsePositive = func(int64, int, int, int64) error { return nil }
+	dbMarkSiteChecked = func(context.Context, int64, time.Time, time.Time) error { return nil }
+	dbMarkSitesChecked = func(context.Context, []db.SiteCheck) error { return nil }
+	dbRecordCheckHistory = func(int64, string, int, int, int64, int64, int64, int64, int64) error { return nil }
+	dbRecordCheckHistories = func(context.Context, []db.CheckHistoryRow) error { return nil }
+	dbUpdateSSLExpiry = func(context.Context, int64, time.Time) error { return nil }
+	dbUpdateSSLExpiries = func(context.Context, []db.SiteSSLExpiry) error { return nil }
+	dbCountDueSites = func(context.Context, int, int, bool) (int, error) { return 0, nil }
+	dbCountProjectionDrift = func(context.Context, int, int) (int, error) { return 0, nil }
+	dbListVeriflierVantages = func(context.Context, time.Duration) ([]db.VeriflierVantage, error) { return nil, nil }
+	dbUpsertVeriflierAgent = func(context.Context, db.VeriflierAgentHeartbeat) error { return nil }
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error { return nil }
+	veriflierStatusFunc = func(c *veriflier.VeriflierClient, ctx context.Context) (*veriflier.StatusV2Response, error) {
+		return c.Status(ctx)
+	}
+	veriflierCheckFunc = func(c *veriflier.VeriflierClient, ctx context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		return c.Check(ctx, req)
+	}
+
+	return func() {
+		nowFunc = origNow
+		dbClaimBuckets = origDBClaimBuckets
+		dbHeartbeat = origDBHeartbeat
+		dbReleaseHost = origDBReleaseHost
+		dbMarkHostDraining = origDBMarkHostDraining
+		dbGetSitesForBucket = origDBGetSites
+		dbUpdateSiteStatus = origDBUpdateStatus
+		dbGetSiteStatus = origDBGetSiteStatus
+		dbUpdateLastAlertSent = origDBUpdateLastAlert
+		dbRecordFalsePositive = origDBRecordFalsePositive
+		dbMarkSiteChecked = origDBMarkSiteChecked
+		dbMarkSitesChecked = origDBMarkSitesChecked
+		dbRecordCheckHistory = origDBRecordCheckHistory
+		dbRecordCheckHistories = origDBRecordCheckHistories
+		dbUpdateSSLExpiry = origDBUpdateSSLExpiry
+		dbUpdateSSLExpiries = origDBUpdateSSLExpiries
+		dbCountDueSites = origDBCountDueSites
+		dbCountProjectionDrift = origDBCountProjectionDrift
+		dbListVeriflierVantages = origDBListVeriflierVantages
+		dbUpsertVeriflierAgent = origDBUpsertVeriflierAgent
+		wpcomNotifyFunc = origNotify
+		veriflierStatusFunc = origVeriflierStatus
+		veriflierCheckFunc = origVeriflierCheck
+		metricsClientFunc = origMetricsClient
+	}
+}
+
+func setTestConfig(t *testing.T) *config.Config {
+	t.Helper()
+	orchestratorConfigTestMu.Lock()
+	t.Cleanup(func() {
+		_ = config.Load("../../config/config-sample.json")
+		orchestratorConfigTestMu.Unlock()
+	})
+	if err := config.Load("../../config/config-sample.json"); err != nil {
+		t.Fatalf("config.Load() error = %v", err)
+	}
+	cfg := config.Get()
+	cfg.AlertCooldownMinutes = 30
+	cfg.NumOfChecks = 3
+	cfg.PeerOfflineLimit = 2
+	cfg.LegacyStatusProjectionEnable = false
+	return cfg
+}
+
+func checkerResultSuccess(blogID int64) checker.Result {
+	return checker.Result{
+		BlogID:    blogID,
+		Success:   true,
+		Timestamp: time.Now().UTC(),
+	}
+}
+
+func checkerResultFailure(blogID int64) checker.Result {
+	return checker.Result{
+		BlogID:    blogID,
+		Success:   false,
+		HTTPCode:  500,
+		ErrorCode: checker.ErrorConnect,
+		RTT:       100 * time.Millisecond,
+		Timestamp: time.Now().UTC(),
+	}
+}
+
+func checkerResultTransportFailure(blogID int64, at time.Time) checker.Result {
+	return checker.Result{
+		BlogID:      blogID,
+		URL:         "https://example.com",
+		Success:     false,
+		HTTPCode:    0,
+		ErrorCode:   checker.ErrorConnect,
+		ErrorDetail: "dial tcp: connection refused",
+		RTT:         10 * time.Millisecond,
+		Timestamp:   at.UTC(),
+	}
+}
+
+func checkerResultDNSFailure(blogID int64, at time.Time) checker.Result {
+	res := checkerResultTransportFailure(blogID, at)
+	res.ErrorDetail = "lookup example.test on 127.0.0.53:53: no such host"
+	res.DNSFailureKind = "nxdomain"
+	res.DNSFailureName = "example.test"
+	res.DNSFailureServer = "127.0.0.53:53"
+	return res
+}
+
+func TestCheckResultMetadataIncludesObservationAndDiagnostics(t *testing.T) {
+	previous := time.Date(2026, 5, 3, 11, 57, 0, 0, time.UTC)
+	firstFail := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
+	res := checkerResultFailure(42)
+	res.HTTPCode = 0
+	res.Timestamp = firstFail.Add(5 * time.Second)
+	res.Method = "GET"
+	res.ErrorDetail = "dial tcp: connection refused"
+	res.DNSFailureKind = "nxdomain"
+	res.DNSFailureName = "example.invalid"
+	res.DNSFailureServer = "127.0.0.53:53"
+	res.RedirectCount = 1
+	res.RedirectChain = []string{"https://example.com/final"}
+	res.FinalURL = "https://example.com/final"
+	res.TLSVersion = tls.VersionTLS12
+	res.CipherSuite = tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
+
+	meta := checkResultMetadata(db.Site{
+		BlogID:         42,
+		MonitorURL:     "https://example.com",
+		SiteStatus:     statusRunning,
+		CheckInterval:  7,
+		LastCheckedAt:  &previous,
+		RedirectPolicy: "alert",
+	}, res, firstFail)
+
+	if meta["error_detail"] != res.ErrorDetail {
+		t.Fatalf("error_detail = %v, want %q", meta["error_detail"], res.ErrorDetail)
+	}
+	if meta["detector_class"] != "dns_nxdomain" {
+		t.Fatalf("detector_class = %v, want dns_nxdomain", meta["detector_class"])
+	}
+	if meta["legacy_status_type"] != "intermittent" {
+		t.Fatalf("legacy_status_type = %v, want intermittent", meta["legacy_status_type"])
+	}
+	if meta["dns_error_kind"] != "nxdomain" || meta["dns_error_name"] != "example.invalid" {
+		t.Fatalf("dns metadata = kind:%v name:%v, want nxdomain/example.invalid", meta["dns_error_kind"], meta["dns_error_name"])
+	}
+	if meta["dns_resolver_source"] != "system" {
+		t.Fatalf("dns_resolver_source = %v, want system", meta["dns_resolver_source"])
+	}
+	if meta["redirect_policy"] != "alert" || meta["redirect_count"] != 1 {
+		t.Fatalf("redirect metadata = policy:%v count:%v, want alert/1", meta["redirect_policy"], meta["redirect_count"])
+	}
+	if meta["final_url"] != res.FinalURL {
+		t.Fatalf("final_url = %v, want %q", meta["final_url"], res.FinalURL)
+	}
+	if meta["tls_version"] == "" || meta["cipher_suite"] == "" {
+		t.Fatalf("TLS metadata missing: %+v", meta)
+	}
+
+	obs, ok := meta["observation"].(map[string]any)
+	if !ok {
+		t.Fatalf("observation = %T, want map[string]any", meta["observation"])
+	}
+	if obs["first_failed_at"] != firstFail.Format(time.RFC3339Nano) {
+		t.Fatalf("first_failed_at = %v, want %s", obs["first_failed_at"], firstFail.Format(time.RFC3339Nano))
+	}
+	if obs["previous_known_good_at"] != previous.Format(time.RFC3339Nano) {
+		t.Fatalf("previous_known_good_at = %v, want %s", obs["previous_known_good_at"], previous.Format(time.RFC3339Nano))
+	}
+	if obs["normal_check_interval_seconds"] != int64(420) {
+		t.Fatalf("normal_check_interval_seconds = %v, want 420", obs["normal_check_interval_seconds"])
+	}
+	if obs["next_check_interval_seconds"] != int64(60) {
+		t.Fatalf("next_check_interval_seconds = %v, want 60", obs["next_check_interval_seconds"])
+	}
+}
+
+func TestCheckResultMetadataIncludesBodyReadEvidence(t *testing.T) {
+	res := checkerResultFailure(42)
+	res.HTTPCode = http.StatusOK
+	res.ErrorCode = checker.ErrorBodyRead
+	res.ErrorDetail = "unexpected EOF"
+	res.BodyReadMode = "strict_finite"
+	res.BodyBytesRead = 100
+	res.BodyExpectedBytes = 1024
+	res.BodyReadLimitBytes = 1048576
+	res.BodyReadError = "unexpected EOF"
+
+	meta := checkResultMetadata(db.Site{
+		BlogID:        42,
+		MonitorURL:    "https://example.com",
+		CheckInterval: 1,
+	}, res, res.Timestamp)
+
+	if meta["detector_class"] != "partial_response" {
+		t.Fatalf("detector_class = %v, want partial_response", meta["detector_class"])
+	}
+	if meta["failure_class"] != "intermittent" {
+		t.Fatalf("failure_class = %v, want legacy intermittent", meta["failure_class"])
+	}
+	body, ok := meta["body_read"].(map[string]any)
+	if !ok {
+		t.Fatalf("body_read = %T, want map[string]any", meta["body_read"])
+	}
+	if body["mode"] != "strict_finite" ||
+		body["bytes_read"] != int64(100) ||
+		body["expected_bytes"] != int64(1024) ||
+		body["limit_bytes"] != int64(1048576) ||
+		body["error"] != "unexpected EOF" {
+		t.Fatalf("body_read metadata = %+v", body)
+	}
+}
+
+func TestRecoveryResultMetadataMarshalsObservation(t *testing.T) {
+	res := checkerResultSuccess(42)
+	res.Timestamp = time.Date(2026, 5, 3, 12, 4, 0, 0, time.UTC)
+	changeTime := time.Date(2026, 5, 3, 12, 4, 2, 0, time.UTC)
+
+	raw, err := json.Marshal(recoveryResultMetadata(res, changeTime))
+	if err != nil {
+		t.Fatalf("marshal recovery metadata: %v", err)
+	}
+	var meta map[string]any
+	if err := json.Unmarshal(raw, &meta); err != nil {
+		t.Fatalf("unmarshal recovery metadata: %v", err)
+	}
+	obs, ok := meta["observation"].(map[string]any)
+	if !ok {
+		t.Fatalf("observation = %T, want map[string]any", meta["observation"])
+	}
+	if obs["first_recovered_at"] != res.Timestamp.Format(time.RFC3339Nano) {
+		t.Fatalf("first_recovered_at = %v, want %s", obs["first_recovered_at"], res.Timestamp.Format(time.RFC3339Nano))
+	}
+	if obs["closed_at"] != changeTime.Format(time.RFC3339Nano) {
+		t.Fatalf("closed_at = %v, want %s", obs["closed_at"], changeTime.Format(time.RFC3339Nano))
+	}
+}
+
+func maintenanceSite(blogID int64, now time.Time) db.Site {
+	start := now.Add(-1 * time.Hour)
+	end := now.Add(1 * time.Hour)
+	return db.Site{
+		BlogID:           blogID,
+		MonitorURL:       "https://example.com",
+		SiteStatus:       statusRunning,
+		MaintenanceStart: &start,
+		MaintenanceEnd:   &end,
+	}
+}
+
+func TestHandleRecoverySendsNotificationWhenSiteWasDown(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	var notifiedStatus int
+	wpcomNotifyFunc = func(_ *wpcom.Client, n wpcom.Notification) error {
+		notifiedStatus = n.StatusID
+		return nil
+	}
+	dbUpdateLastAlertSent = func(context.Context, int64, time.Time) error { return nil }
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	o.handleRecovery(db.Site{BlogID: 1, SiteStatus: statusConfirmedDown}, checkerResultSuccess(1))
+
+	if notifiedStatus != statusRunning {
+		t.Fatalf("notification StatusID = %d, want %d (statusRunning)", notifiedStatus, statusRunning)
+	}
+	if o.retries.get(1) != nil {
+		t.Fatal("retry entry should be cleared after recovery")
+	}
+	if !o.retries.recentlyRecovered(1, time.Now().UTC(), postRecoveryTransientFailureWindow(db.Site{BlogID: 1, CheckInterval: 5})) {
+		t.Fatal("recovery should mark site for post-recovery transient dampening")
+	}
+}
+
+func TestHandleRecoveryIsNoopWhenSiteAlreadyRunning(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	var notifyCalls int
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		notifyCalls++
+		return nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	// No retry entry, site already running — should be a complete no-op.
+	o.handleRecovery(db.Site{BlogID: 1, SiteStatus: statusRunning}, checkerResultSuccess(1))
+
+	if notifyCalls != 0 {
+		t.Fatalf("notify calls = %d, want 0", notifyCalls)
+	}
+}
+
+func TestHandleRecoveryClearsRetryEntryEvenWhenAlreadyRunning(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	// Site has a stale retry entry (e.g. from a previous partial failure) but
+	// is now reported as running. The entry must be cleared.
+	o.retries.record(checkerResultFailure(1))
+	o.handleRecovery(db.Site{BlogID: 1, SiteStatus: statusRunning}, checkerResultSuccess(1))
+
+	if o.retries.get(1) != nil {
+		t.Fatal("stale retry entry should be cleared on recovery even when status was already running")
+	}
+}
+
+func TestHandleRecoveryEmitsProbeClearedClassMetric(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+	o.retries.record(checkerResultFailure(42))
+
+	o.handleRecovery(db.Site{BlogID: 42, SiteStatus: statusDown}, checkerResultSuccess(42))
+
+	if got := rec.counter("detection.probe_cleared.count"); got != 1 {
+		t.Fatalf("probe-cleared counter = %d, want 1", got)
+	}
+	if got := rec.counter("detection.probe_cleared.server.count"); got != 1 {
+		t.Fatalf("probe-cleared server counter = %d, want 1", got)
+	}
+	if got := rec.timingCount("detection.seems_down_to_probe_cleared.time"); got != 1 {
+		t.Fatalf("probe-cleared timing count = %d, want 1", got)
+	}
+}
+
+func TestHandleFailureBelowThresholdDoesNotEscalate(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+	config.Get().NumOfChecks = 3
+
+	var escalated bool
+	veriflierCheckFunc = func(_ *veriflier.VeriflierClient, _ context.Context, _ veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		escalated = true
+		return &veriflier.CheckResult{Success: false}, nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	// First failure only — failCount (1) < NumOfChecks (3).
+	o.handleFailure(db.Site{BlogID: 1}, checkerResultFailure(1))
+
+	if escalated {
+		t.Fatal("escalated to verifliers after only 1 failure, want NumOfChecks (3) failures first")
+	}
+	if o.retries.get(1) == nil {
+		t.Fatal("retry entry should exist after first failure")
+	}
+}
+
+func TestHandleFailureDefersLowConfidenceDNSFailureEventUntilVerifierConfirmation(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.NumOfChecks = 3
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	active := o.handleFailure(
+		db.Site{BlogID: 1, MonitorURL: "https://example.com", SiteStatus: statusRunning},
+		checkerResultDNSFailure(1, time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)),
+	)
+
+	if active {
+		t.Fatal("low-confidence local DNS failure below verifier threshold should not make the site non-running")
+	}
+	if entry := o.retries.get(1); entry == nil || entry.eventID != 0 {
+		t.Fatalf("retry entry = %+v, want retry without customer-visible event", entry)
+	}
+	if got := rec.counter("detection.seems_down.open.count"); got != 0 {
+		t.Fatalf("seems-down open counter = %d, want 0", got)
+	}
+	if got := rec.counter("detection.low_confidence_dns.awaiting_verifier.count"); got != 1 {
+		t.Fatalf("low-confidence DNS counter = %d, want 1", got)
+	}
+}
+
+func TestHandleFailureOpensConfirmedDownAfterVerifierConfirmsDeferredDNSFailure(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.NumOfChecks = 1
+	cfg.PeerOfflineLimit = 1
+
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectExec("INSERT INTO jetmon_events").
+		WithArgs(int64(1), nil, checkTypeHTTP, nil, eventstore.SeverityDown, eventstore.StateDown, sqlmock.AnyArg()).
+		WillReturnResult(sqlmock.NewResult(501, 1))
+	mock.ExpectExec("INSERT INTO jetmon_event_transitions").
+		WithArgs(int64(501), int64(1), nil, eventstore.SeverityDown, nil, eventstore.StateDown, eventstore.ReasonOpened, "local-host", sqlmock.AnyArg()).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+	mock.ExpectCommit()
+
+	veriflierCheckFunc = func(c *veriflier.VeriflierClient, _ context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		return &veriflier.CheckResult{
+			BlogID:    req.BlogID,
+			Host:      c.Addr(),
+			Success:   false,
+			RequestID: req.RequestID,
+		}, nil
+	}
+
+	o := &Orchestrator{
+		events:   eventstore.New(sqlDB),
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local-host",
+		ctx:      context.Background(),
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+		},
+	}
+
+	active := o.handleFailure(
+		db.Site{BlogID: 1, MonitorURL: "https://example.com", SiteStatus: statusRunning},
+		checkerResultDNSFailure(1, time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)),
+	)
+
+	if !active {
+		t.Fatal("verifier-confirmed DNS failure should become active after the Down event opens")
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestHandleFailureSuppressesFirstPostRecoveryTransportFailure(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.NumOfChecks = 1
+
+	var escalated bool
+	veriflierCheckFunc = func(_ *veriflier.VeriflierClient, _ context.Context, _ veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		escalated = true
+		return &veriflier.CheckResult{Success: false}, nil
+	}
+
+	recoveredAt := time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+	o.retries.markRecovered(42, recoveredAt)
+
+	active := o.handleFailure(
+		db.Site{BlogID: 42, MonitorURL: "https://example.com", CheckInterval: 3, SiteStatus: statusRunning},
+		checkerResultTransportFailure(42, recoveredAt.Add(2*time.Minute)),
+	)
+
+	if active {
+		t.Fatal("first post-recovery transport failure should not make the site non-running")
+	}
+	if escalated {
+		t.Fatal("first post-recovery transport failure escalated despite dampening")
+	}
+	if entry := o.retries.get(42); entry != nil {
+		t.Fatalf("suppressed failure created retry state: %+v", entry)
+	}
+}
+
+func TestHandleFailureEscalatesPostRecoveryTransportFailureAfterWindow(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.NumOfChecks = 1
+	cfg.PeerOfflineLimit = 1
+
+	var escalated bool
+	veriflierCheckFunc = func(c *veriflier.VeriflierClient, _ context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		escalated = true
+		return &veriflier.CheckResult{
+			BlogID:   req.BlogID,
+			Host:     c.Addr(),
+			Success:  false,
+			HTTPCode: 0,
+		}, nil
+	}
+
+	recoveredAt := time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+		},
+	}
+	o.retries.markRecovered(42, recoveredAt)
+	site := db.Site{BlogID: 42, MonitorURL: "https://example.com", CheckInterval: 3, SiteStatus: statusRunning}
+
+	o.handleFailure(site, checkerResultTransportFailure(42, recoveredAt.Add(postRecoveryTransientFailureWindow(site)+time.Second)))
+
+	if !escalated {
+		t.Fatal("transport failure after post-recovery suppression window should continue the normal retry pipeline")
+	}
+}
+
+func TestHandleFailureSuppressesPostFalseAlarmTransportFailurePastRecoveryWindow(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.NumOfChecks = 1
+
+	var escalated bool
+	veriflierCheckFunc = func(_ *veriflier.VeriflierClient, _ context.Context, _ veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		escalated = true
+		return &veriflier.CheckResult{Success: false}, nil
+	}
+
+	falseAlarmAt := time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+	o.retries.markFalseAlarm(42, falseAlarmAt)
+	site := db.Site{BlogID: 42, MonitorURL: "https://example.com", CheckInterval: 3, SiteStatus: statusRunning}
+
+	active := o.handleFailure(site, checkerResultTransportFailure(42, falseAlarmAt.Add(postRecoveryTransientFailureWindow(site)+time.Second)))
+
+	if active {
+		t.Fatal("post-false-alarm transport failure should stay suppressed beyond the normal recovery window")
+	}
+	if escalated {
+		t.Fatal("post-false-alarm transport failure escalated despite false-alarm dampening")
+	}
+	if entry := o.retries.get(42); entry != nil {
+		t.Fatalf("suppressed post-false-alarm failure created retry state: %+v", entry)
+	}
+}
+
+func TestHandleFailureRefreshesPostFalseAlarmSuppressionWindow(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.NumOfChecks = 1
+
+	var escalated bool
+	veriflierCheckFunc = func(_ *veriflier.VeriflierClient, _ context.Context, _ veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		escalated = true
+		return &veriflier.CheckResult{Success: false}, nil
+	}
+
+	falseAlarmAt := time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+	o.retries.markFalseAlarm(42, falseAlarmAt)
+	site := db.Site{BlogID: 42, MonitorURL: "https://example.com", CheckInterval: 3, SiteStatus: statusRunning}
+	falseAlarmWindow := postFalseAlarmTransientFailureWindow(site)
+
+	firstSuppressedAt := falseAlarmAt.Add(5 * time.Minute)
+	if active := o.handleFailure(site, checkerResultTransportFailure(42, firstSuppressedAt)); active {
+		t.Fatal("first post-false-alarm transient failure should stay suppressed")
+	}
+
+	secondSuppressedAt := falseAlarmAt.Add(falseAlarmWindow + 30*time.Second)
+	if active := o.handleFailure(site, checkerResultTransportFailure(42, secondSuppressedAt)); active {
+		t.Fatal("suppression window should roll forward while transient failures continue")
+	}
+	if escalated {
+		t.Fatal("rolling false-alarm dampening escalated despite refreshed suppression window")
+	}
+	if entry := o.retries.get(42); entry != nil {
+		t.Fatalf("suppressed rolling post-false-alarm failure created retry state: %+v", entry)
+	}
+}
+
+func TestHandleFailureEscalatesPostFalseAlarmTransportFailureAfterFalseAlarmWindow(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.NumOfChecks = 1
+	cfg.PeerOfflineLimit = 1
+
+	var escalated bool
+	veriflierCheckFunc = func(c *veriflier.VeriflierClient, _ context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		escalated = true
+		return &veriflier.CheckResult{
+			BlogID:   req.BlogID,
+			Host:     c.Addr(),
+			Success:  false,
+			HTTPCode: 0,
+		}, nil
+	}
+
+	falseAlarmAt := time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+		},
+	}
+	o.retries.markFalseAlarm(42, falseAlarmAt)
+	site := db.Site{BlogID: 42, MonitorURL: "https://example.com", CheckInterval: 3, SiteStatus: statusRunning}
+
+	o.handleFailure(site, checkerResultTransportFailure(42, falseAlarmAt.Add(postFalseAlarmTransientFailureWindow(site)+time.Second)))
+
+	if !escalated {
+		t.Fatal("transport failure after post-false-alarm suppression window should continue the normal retry pipeline")
+	}
+}
+
+func TestProcessResultsMarksChecked(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	var markedBlogID int64
+	var markedAt time.Time
+	var markedNext time.Time
+	dbMarkSitesChecked = func(_ context.Context, checks []db.SiteCheck) error {
+		if len(checks) != 1 {
+			t.Fatalf("batch checks = %d, want 1", len(checks))
+		}
+		markedBlogID = checks[0].BlogID
+		markedAt = checks[0].CheckedAt
+		markedNext = checks[0].NextCheckAt
+		return nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	res := checkerResultSuccess(42)
+	res.Timestamp = time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
+	sites := map[int64]db.Site{42: {BlogID: 42, SiteStatus: statusRunning, CheckInterval: 7}}
+	o.processResults(map[int64]checker.Result{42: res}, sites)
+
+	if markedBlogID != 42 {
+		t.Fatalf("MarkSitesChecked blog_id = %d, want 42", markedBlogID)
+	}
+	if !markedAt.Equal(res.Timestamp) {
+		t.Fatalf("MarkSitesChecked checked_at = %s, want %s", markedAt, res.Timestamp)
+	}
+	if want := res.Timestamp.Add(7 * time.Minute); !markedNext.Equal(want) {
+		t.Fatalf("MarkSitesChecked next_check_at = %s, want %s", markedNext, want)
+	}
+}
+
+func TestProcessResultsSchedulesFailedChecksSoonerThanNormalInterval(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	var markedNext time.Time
+	dbMarkSitesChecked = func(_ context.Context, checks []db.SiteCheck) error {
+		if len(checks) != 1 {
+			t.Fatalf("batch checks = %d, want 1", len(checks))
+		}
+		markedNext = checks[0].NextCheckAt
+		return nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	res := checkerResultFailure(42)
+	res.Timestamp = time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
+	sites := map[int64]db.Site{42: {BlogID: 42, SiteStatus: statusRunning, CheckInterval: 7}}
+	o.processResults(map[int64]checker.Result{42: res}, sites)
+
+	if want := res.Timestamp.Add(time.Minute); !markedNext.Equal(want) {
+		t.Fatalf("failed check next_check_at = %s, want %s", markedNext, want)
+	}
+}
+
+func TestProcessResultsReportsCheckOutcomes(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	success := checkerResultSuccess(1)
+	timeout := checkerResultFailure(2)
+	timeout.HTTPCode = 0
+	timeout.ErrorCode = checker.ErrorTimeout
+	connect := checkerResultFailure(3)
+	connect.HTTPCode = 0
+	connect.ErrorCode = checker.ErrorConnect
+	server := checkerResultFailure(4)
+	server.HTTPCode = 500
+	server.ErrorCode = checker.ErrorNone
+	deprecatedTLS := checkerResultSuccess(5)
+	deprecatedTLS.ErrorCode = checker.ErrorTLSDeprecated
+
+	summary := o.processResults(
+		map[int64]checker.Result{
+			1: success,
+			2: timeout,
+			3: connect,
+			4: server,
+			5: deprecatedTLS,
+		},
+		map[int64]db.Site{
+			1: {BlogID: 1, SiteStatus: statusRunning},
+			2: {BlogID: 2, SiteStatus: statusRunning},
+			3: {BlogID: 3, SiteStatus: statusRunning},
+			4: {BlogID: 4, SiteStatus: statusRunning},
+			5: {BlogID: 5, SiteStatus: statusRunning},
+		},
+	)
+
+	if summary.checkSuccesses != 2 || summary.checkFailures != 3 {
+		t.Fatalf("success/failure counts = %d/%d, want 2/3", summary.checkSuccesses, summary.checkFailures)
+	}
+	if summary.checkTimeouts != 1 || summary.checkConnectErrors != 1 || summary.checkHTTPFailures != 1 || summary.checkTLSDeprecated != 1 {
+		t.Fatalf("outcome counts timeout/connect/http/tls = %d/%d/%d/%d, want 1/1/1/1",
+			summary.checkTimeouts,
+			summary.checkConnectErrors,
+			summary.checkHTTPFailures,
+			summary.checkTLSDeprecated,
+		)
+	}
+}
+
+func TestProcessResultsReportsCheckCohorts(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	headLegacy := checkerResultSuccess(1)
+	headLegacy.Method = http.MethodHead
+	headLegacy.DetectionProfile = "legacy"
+	getSimple := checkerResultFailure(2)
+	getSimple.Method = http.MethodGet
+	getSimple.DetectionProfile = "simple_http"
+	getFull := checkerResultSuccess(3)
+	getFull.Method = http.MethodGet
+	getFull.DetectionProfile = "full"
+
+	summary := o.processResults(
+		map[int64]checker.Result{
+			1: headLegacy,
+			2: getSimple,
+			3: getFull,
+		},
+		map[int64]db.Site{
+			1: {BlogID: 1, SiteStatus: statusRunning},
+			2: {BlogID: 2, SiteStatus: statusRunning},
+			3: {BlogID: 3, SiteStatus: statusRunning},
+		},
+	)
+
+	assertCheckCohortCount(t, summary.checkCohorts, http.MethodHead, "legacy", 1)
+	assertCheckCohortCount(t, summary.checkCohorts, http.MethodGet, "simple_http", 1)
+	assertCheckCohortCount(t, summary.checkCohorts, http.MethodGet, "full", 1)
+}
+
+func TestEmitCheckCohortCounters(t *testing.T) {
+	rec := newRecordingMetrics()
+	emitCheckCohortCounters(rec, "scheduler.streaming", map[checkCohortKey]int{
+		{method: http.MethodGet, profile: "full"}:         2,
+		{method: http.MethodHead, profile: "simple_http"}: 1,
+	})
+
+	if got := rec.counter("scheduler.streaming.check.method.get.profile.full.count"); got != 2 {
+		t.Fatalf("GET/full cohort counter = %d, want 2", got)
+	}
+	if got := rec.counter("scheduler.streaming.check.method.head.profile.simple_http.count"); got != 1 {
+		t.Fatalf("HEAD/simple_http cohort counter = %d, want 1", got)
+	}
+}
+
+func TestProcessResultsFallsBackWhenBatchWritesFail(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	dbMarkSitesChecked = func(context.Context, []db.SiteCheck) error {
+		return fmt.Errorf("batch mark failed")
+	}
+	dbRecordCheckHistories = func(context.Context, []db.CheckHistoryRow) error {
+		return fmt.Errorf("batch history failed")
+	}
+	dbUpdateSSLExpiries = func(context.Context, []db.SiteSSLExpiry) error {
+		return fmt.Errorf("batch ssl failed")
+	}
+
+	var fallbackMarked int64
+	dbMarkSiteChecked = func(_ context.Context, blogID int64, _, _ time.Time) error {
+		fallbackMarked = blogID
+		return nil
+	}
+	var fallbackHistory int64
+	dbRecordCheckHistory = func(blogID int64, _ string, _ int, _ int, _ int64, _ int64, _ int64, _ int64, _ int64) error {
+		fallbackHistory = blogID
+		return nil
+	}
+	var fallbackSSL int64
+	dbUpdateSSLExpiry = func(_ context.Context, blogID int64, _ time.Time) error {
+		fallbackSSL = blogID
+		return nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	res := checkerResultSuccess(42)
+	expiry := time.Now().UTC().AddDate(0, 1, 0)
+	res.SSLExpiry = &expiry
+	summary := o.processResults(
+		map[int64]checker.Result{42: res},
+		map[int64]db.Site{42: {BlogID: 42, SiteStatus: statusRunning}},
+	)
+
+	if fallbackMarked != 42 || fallbackHistory != 42 || fallbackSSL != 42 {
+		t.Fatalf("fallback marked/history/ssl = %d/%d/%d, want 42/42/42", fallbackMarked, fallbackHistory, fallbackSSL)
+	}
+	if summary.markCheckedRows != 1 || summary.historyRows != 1 || summary.sslRows != 1 {
+		t.Fatalf("fallback rows = %d/%d/%d, want 1/1/1", summary.markCheckedRows, summary.historyRows, summary.sslRows)
+	}
+	if summary.markCheckedErrors != 1 || summary.historyErrors != 1 || summary.sslErrors != 1 {
+		t.Fatalf("batch errors = %d/%d/%d, want 1/1/1", summary.markCheckedErrors, summary.historyErrors, summary.sslErrors)
+	}
+}
+
+func TestEventMutationRetryRetriesDeadlocks(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	o := &Orchestrator{ctx: context.Background()}
+	attempts := 0
+	err := o.withEventMutationRetry(42, "open_seems_down", func() error {
+		attempts++
+		if attempts == 1 {
+			return fmt.Errorf("wrapped: %w", &mysql.MySQLError{Number: 1213, Message: "deadlock"})
+		}
+		return nil
+	})
+	if err != nil {
+		t.Fatalf("withEventMutationRetry: %v", err)
+	}
+	if attempts != 2 {
+		t.Fatalf("attempts = %d, want 2", attempts)
+	}
+	if got := rec.counter("eventstore.mutation.retry.count"); got != 1 {
+		t.Fatalf("retry metric = %d, want 1", got)
+	}
+}
+
+func TestIsRetryableMySQLError(t *testing.T) {
+	if !isRetryableMySQLError(fmt.Errorf("wrapped: %w", &mysql.MySQLError{Number: 1205, Message: "lock wait"})) {
+		t.Fatal("lock wait timeout should be retryable")
+	}
+	if !isRetryableMySQLError(&mysql.MySQLError{Number: 1213, Message: "deadlock"}) {
+		t.Fatal("deadlock should be retryable")
+	}
+	if isRetryableMySQLError(&mysql.MySQLError{Number: 1062, Message: "duplicate"}) {
+		t.Fatal("duplicate key should not be retryable")
+	}
+}
+
+func TestProcessResultsSkipsUnknownSite(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	var markCalled bool
+	dbMarkSitesChecked = func(_ context.Context, _ []db.SiteCheck) error {
+		markCalled = true
+		return nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	res := checkerResultSuccess(99)
+	o.processResults(map[int64]checker.Result{99: res}, map[int64]db.Site{})
+
+	if markCalled {
+		t.Fatal("MarkSitesChecked called for unknown blog_id, want skipped")
+	}
+}
+
+func TestProcessResultsUpdatesSSLExpiry(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	var updatedExpiry time.Time
+	dbUpdateSSLExpiries = func(_ context.Context, updates []db.SiteSSLExpiry) error {
+		if len(updates) != 1 {
+			t.Fatalf("ssl expiry updates = %d, want 1", len(updates))
+		}
+		updatedExpiry = updates[0].Expiry
+		return nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	expiry := time.Now().Add(30 * 24 * time.Hour)
+	res := checkerResultSuccess(1)
+	res.SSLExpiry = &expiry
+
+	sites := map[int64]db.Site{1: {BlogID: 1, SiteStatus: statusRunning}}
+	o.processResults(map[int64]checker.Result{1: res}, sites)
+
+	if updatedExpiry.IsZero() {
+		t.Fatal("UpdateSSLExpiry not called")
+	}
+}
+
+func TestProcessResultsTLSDeprecatedIsAdvisory(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		t.Fatal("deprecated TLS advisory should not send a downtime notification")
+		return nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	res := checkerResultSuccess(72)
+	res.HTTPCode = 200
+	res.ErrorCode = checker.ErrorTLSDeprecated
+	res.TLSVersion = tls.VersionTLS11
+
+	sites := map[int64]db.Site{72: {BlogID: 72, SiteStatus: statusRunning}}
+	o.processResults(map[int64]checker.Result{72: res}, sites)
+
+	if o.retries.get(72) != nil {
+		t.Fatal("deprecated TLS advisory should not enter the downtime retry queue")
+	}
+}
+
+func TestCheckTLSDeprecatedOpensWarningEvent(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectExec("INSERT INTO jetmon_events").
+		WithArgs(int64(72), nil, checkTypeTLSDeprecated, nil, eventstore.SeverityWarning, eventstore.StateWarning, sqlmock.AnyArg()).
+		WillReturnResult(sqlmock.NewResult(101, 1))
+	mock.ExpectExec("INSERT INTO jetmon_event_transitions").
+		WithArgs(int64(101), int64(72), nil, eventstore.SeverityWarning, nil, eventstore.StateWarning, eventstore.ReasonOpened, "local-host", sqlmock.AnyArg()).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+	mock.ExpectCommit()
+
+	o := &Orchestrator{
+		events:   eventstore.New(sqlDB),
+		hostname: "local-host",
+		ctx:      context.Background(),
+	}
+	o.checkTLSDeprecated(db.Site{BlogID: 72}, checker.Result{
+		TLSVersion:  tls.VersionTLS11,
+		CipherSuite: tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,
+	})
+
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestCheckTLSDeprecatedClosesWarningOnModernTLS(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectQuery("SELECT id, severity, state FROM jetmon_events").
+		WithArgs(int64(73), checkTypeTLSDeprecated).
+		WillReturnRows(sqlmock.NewRows([]string{"id", "severity", "state"}).
+			AddRow(int64(202), eventstore.SeverityWarning, eventstore.StateWarning))
+	mock.ExpectQuery("SELECT blog_id, severity, state, ended_at, cause_event_id").
+		WithArgs(int64(202)).
+		WillReturnRows(sqlmock.NewRows([]string{"blog_id", "severity", "state", "ended_at", "cause_event_id"}).
+			AddRow(int64(73), eventstore.SeverityWarning, eventstore.StateWarning, nil, nil))
+	mock.ExpectExec("UPDATE jetmon_events").
+		WithArgs(eventstore.ReasonProbeCleared, int64(202)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("INSERT INTO jetmon_event_transitions").
+		WithArgs(int64(202), int64(73), eventstore.SeverityWarning, nil, eventstore.StateWarning, eventstore.StateResolved, eventstore.ReasonProbeCleared, "local-host", nil).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+	mock.ExpectCommit()
+
+	o := &Orchestrator{
+		events:   eventstore.New(sqlDB),
+		hostname: "local-host",
+		ctx:      context.Background(),
+	}
+	o.checkTLSDeprecated(db.Site{BlogID: 73}, checker.Result{TLSVersion: tls.VersionTLS12})
+
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestShouldUpdateSSLExpiryComparesStoredDate(t *testing.T) {
+	stored := time.Date(2026, 5, 2, 0, 0, 0, 0, time.UTC)
+	sameDate := time.Date(2026, 5, 2, 23, 59, 0, 0, time.UTC)
+	nextDate := time.Date(2026, 5, 3, 0, 0, 0, 0, time.UTC)
+
+	if !shouldUpdateSSLExpiry(nil, sameDate) {
+		t.Fatal("nil stored expiry should update")
+	}
+	if shouldUpdateSSLExpiry(&stored, sameDate) {
+		t.Fatal("same stored expiry date should not update")
+	}
+	if !shouldUpdateSSLExpiry(&stored, nextDate) {
+		t.Fatal("different stored expiry date should update")
+	}
+}
+
+func TestCheckSSLAlertsAtThresholds(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	o := &Orchestrator{hostname: "local"}
+
+	// Test each threshold and a non-threshold day; verify no panic.
+	for _, days := range []int{30, 14, 7, 31, 15} {
+		expiry := time.Now().Add(time.Duration(days)*24*time.Hour + 30*time.Minute)
+		o.checkSSLAlerts(db.Site{BlogID: 1}, expiry)
+	}
+}
+
+func TestCloseSSLExpiryUsesProbeCleared(t *testing.T) {
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectQuery("SELECT id, severity, state FROM jetmon_events").
+		WithArgs(int64(74), checkTypeTLSExpiry).
+		WillReturnRows(sqlmock.NewRows([]string{"id", "severity", "state"}).
+			AddRow(int64(303), eventstore.SeverityWarning, eventstore.StateWarning))
+	mock.ExpectQuery("SELECT blog_id, severity, state, ended_at, cause_event_id").
+		WithArgs(int64(303)).
+		WillReturnRows(sqlmock.NewRows([]string{"blog_id", "severity", "state", "ended_at", "cause_event_id"}).
+			AddRow(int64(74), eventstore.SeverityWarning, eventstore.StateWarning, nil, nil))
+	mock.ExpectExec("UPDATE jetmon_events").
+		WithArgs(eventstore.ReasonProbeCleared, int64(303)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("INSERT INTO jetmon_event_transitions").
+		WithArgs(int64(303), int64(74), eventstore.SeverityWarning, nil, eventstore.StateWarning, eventstore.StateResolved, eventstore.ReasonProbeCleared, "local-host", nil).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+	mock.ExpectCommit()
+
+	o := &Orchestrator{
+		events:   eventstore.New(sqlDB),
+		hostname: "local-host",
+		ctx:      context.Background(),
+	}
+	if err := o.closeSSLExpiryIfOpen(74); err != nil {
+		t.Fatalf("closeSSLExpiryIfOpen: %v", err)
+	}
+
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestApplyMemoryPressureNoActionBelowLimit(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	origFn := currentMemoryMBFunc
+	currentMemoryMBFunc = func() int { return 10 }
+	defer func() { currentMemoryMBFunc = origFn }()
+
+	p := checker.NewPool(5, 1, 5)
+	t.Cleanup(p.Drain)
+
+	o := &Orchestrator{pool: p, ctx: context.Background()}
+	cfg := config.Get()
+	cfg.WorkerMaxMemMB = 100
+
+	o.applyMemoryPressure(cfg)
+
+	if p.WorkerCount() != 5 {
+		t.Fatalf("WorkerCount = %d under limit, want 5", p.WorkerCount())
+	}
+}
+
+func TestApplyMemoryPressureNoActionWhenDisabled(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	origFn := currentMemoryMBFunc
+	currentMemoryMBFunc = func() int { return 9999 }
+	defer func() { currentMemoryMBFunc = origFn }()
+
+	p := checker.NewPool(5, 1, 5)
+	t.Cleanup(p.Drain)
+
+	o := &Orchestrator{pool: p, ctx: context.Background()}
+	cfg := config.Get()
+	cfg.WorkerMaxMemMB = 0
+
+	o.applyMemoryPressure(cfg)
+
+	if p.WorkerCount() != 5 {
+		t.Fatalf("WorkerCount = %d when disabled, want 5", p.WorkerCount())
+	}
+}
+
+func TestApplyMemoryPressureDrainsWorkersOverLimit(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	origFn := currentMemoryMBFunc
+	currentMemoryMBFunc = func() int { return 500 }
+	defer func() { currentMemoryMBFunc = origFn }()
+
+	p := checker.NewPool(10, 1, 10)
+	t.Cleanup(p.Drain)
+
+	o := &Orchestrator{pool: p, ctx: context.Background()}
+	cfg := config.Get()
+	cfg.WorkerMaxMemMB = 50
+
+	initial := p.WorkerCount()
+	o.applyMemoryPressure(cfg)
+
+	deadline := time.Now().Add(2 * time.Second)
+	for time.Now().Before(deadline) {
+		if p.WorkerCount() < initial {
+			return
+		}
+		time.Sleep(10 * time.Millisecond)
+	}
+	t.Fatalf("WorkerCount = %d after memory pressure, want < %d", p.WorkerCount(), initial)
+}
+
+func TestOrchestratorAccessors(t *testing.T) {
+	p := checker.NewPool(3, 1, 3)
+	defer p.Drain()
+
+	o := &Orchestrator{
+		retries:   newRetryQueue(),
+		bucketMin: 10,
+		bucketMax: 99,
+		pool:      p,
+	}
+	o.retries.record(checkerResultFailure(1))
+
+	if o.RetryQueueSize() != 1 {
+		t.Fatalf("RetryQueueSize() = %d, want 1", o.RetryQueueSize())
+	}
+	min, max := o.BucketRange()
+	if min != 10 || max != 99 {
+		t.Fatalf("BucketRange() = %d-%d, want 10-99", min, max)
+	}
+	if o.WorkerCount() != 3 {
+		t.Fatalf("WorkerCount() = %d, want 3", o.WorkerCount())
+	}
+	if o.ActiveChecks() != 0 {
+		t.Fatalf("ActiveChecks() = %d, want 0", o.ActiveChecks())
+	}
+	if o.QueueDepth() != 0 {
+		t.Fatalf("QueueDepth() = %d, want 0", o.QueueDepth())
+	}
+}
+
+func TestClaimBucketsUsesPinnedRangeWithoutHostTable(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	min, max := 12, 34
+	cfg.PinnedBucketMin = &min
+	cfg.PinnedBucketMax = &max
+
+	var dynamicClaimCalled bool
+	dbClaimBuckets = func(string, int, int, int) (int, int, error) {
+		dynamicClaimCalled = true
+		return 0, 0, nil
+	}
+
+	o := &Orchestrator{hostname: "host-a"}
+	if err := o.ClaimBuckets(); err != nil {
+		t.Fatalf("ClaimBuckets: %v", err)
+	}
+	if dynamicClaimCalled {
+		t.Fatal("ClaimBuckets called dynamic jetmon_hosts claim in pinned mode")
+	}
+	if o.bucketMin != 12 || o.bucketMax != 34 {
+		t.Fatalf("bucket range = %d-%d, want 12-34", o.bucketMin, o.bucketMax)
+	}
+}
+
+func TestRunRoundSkipsHeartbeatWhenPinned(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	min, max := 12, 34
+	cfg.PinnedBucketMin = &min
+	cfg.PinnedBucketMax = &max
+
+	var heartbeatCalled bool
+	dbHeartbeat = func(context.Context, string) error {
+		heartbeatCalled = true
+		return nil
+	}
+	dbGetSitesForBucket = func(_ context.Context, gotMin, gotMax, _ int, _ bool) ([]db.Site, error) {
+		if gotMin != 12 || gotMax != 34 {
+			t.Fatalf("fetch buckets = %d-%d, want 12-34", gotMin, gotMax)
+		}
+		return nil, nil
+	}
+
+	o := &Orchestrator{ctx: context.Background(), hostname: "host-a"}
+	o.runRound()
+
+	if heartbeatCalled {
+		t.Fatal("runRound updated jetmon_hosts heartbeat in pinned mode")
+	}
+}
+
+func TestRunRoundDrainsAllPagesUntilWorkWraps(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.DatasetSize = 2
+	cfg.NetCommsTimeout = 1
+	cfg.MinTimeBetweenRoundsSec = 0
+	cfg.UseVariableCheckIntervals = false
+	cfg.WorkerMaxMemMB = 0
+
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusOK)
+	}))
+	defer srv.Close()
+
+	sites := []db.Site{
+		{BlogID: 1, MonitorURL: srv.URL},
+		{BlogID: 2, MonitorURL: srv.URL},
+		{BlogID: 3, MonitorURL: srv.URL},
+		{BlogID: 4, MonitorURL: srv.URL},
+		{BlogID: 5, MonitorURL: srv.URL},
+	}
+
+	checked := make(map[int64]bool)
+	var marked []int64
+	var queries int
+	dbGetSitesForBucket = func(_ context.Context, _, _ int, batchSize int, useVariableIntervals bool) ([]db.Site, error) {
+		if batchSize != 2 {
+			t.Fatalf("batch size = %d, want 2", batchSize)
+		}
+		if useVariableIntervals {
+			t.Fatal("useVariableIntervals = true, want false")
+		}
+		queries++
+		return nextSchedulerTestPage(sites, checked, batchSize, false), nil
+	}
+	dbMarkSitesChecked = func(_ context.Context, checks []db.SiteCheck) error {
+		for _, check := range checks {
+			checked[check.BlogID] = true
+			marked = append(marked, check.BlogID)
+		}
+		return nil
+	}
+	dbCountDueSites = func(_ context.Context, _, _ int, useVariableIntervals bool) (int, error) {
+		if useVariableIntervals {
+			t.Fatal("count due useVariableIntervals = true, want false")
+		}
+		return len(sites), nil
+	}
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	p := checker.NewPool(2, 1, 2)
+	defer p.Drain()
+	o := &Orchestrator{
+		pool:       p,
+		retries:    newRetryQueue(),
+		ctx:        context.Background(),
+		hostname:   "host-a",
+		roundStart: time.Now(),
+	}
+
+	summary := o.runRound()
+	if summary.selected != 5 || summary.dispatched != 5 || summary.completed != 5 {
+		t.Fatalf("summary selected/dispatched/completed = %d/%d/%d, want 5/5/5", summary.selected, summary.dispatched, summary.completed)
+	}
+	if summary.pagesFetched != 3 {
+		t.Fatalf("pages fetched = %d, want 3", summary.pagesFetched)
+	}
+	if len(marked) != 5 {
+		t.Fatalf("marked checked = %d, want 5", len(marked))
+	}
+	if queries < 4 {
+		t.Fatalf("queries = %d, want at least 4 including wrap-stop query", queries)
+	}
+	if got := rec.gauge("scheduler.round.selected.count"); got != 5 {
+		t.Fatalf("selected metric = %d, want 5", got)
+	}
+	if got := rec.gauge("scheduler.round.completed.count"); got != 5 {
+		t.Fatalf("completed metric = %d, want 5", got)
+	}
+	if got := rec.gauge("scheduler.round.due_remaining.count"); got != 0 {
+		t.Fatalf("due remaining metric = %d, want 0", got)
+	}
+}
+
+func TestRunRoundWaitsUnderPoolBackpressureInsteadOfDropping(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.DatasetSize = 5
+	cfg.NetCommsTimeout = 1
+	cfg.MinTimeBetweenRoundsSec = 0
+	cfg.UseVariableCheckIntervals = true
+	cfg.WorkerMaxMemMB = 0
+
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		time.Sleep(25 * time.Millisecond)
+		w.WriteHeader(http.StatusOK)
+	}))
+	defer srv.Close()
+
+	sites := []db.Site{
+		{BlogID: 1, MonitorURL: srv.URL},
+		{BlogID: 2, MonitorURL: srv.URL},
+		{BlogID: 3, MonitorURL: srv.URL},
+		{BlogID: 4, MonitorURL: srv.URL},
+		{BlogID: 5, MonitorURL: srv.URL},
+	}
+
+	checked := make(map[int64]bool)
+	dbGetSitesForBucket = func(_ context.Context, _, _ int, batchSize int, useVariableIntervals bool) ([]db.Site, error) {
+		if batchSize != 5 {
+			t.Fatalf("batch size = %d, want 5", batchSize)
+		}
+		if !useVariableIntervals {
+			t.Fatal("useVariableIntervals = false, want true")
+		}
+		return nextSchedulerTestPage(sites, checked, batchSize, true), nil
+	}
+	dbMarkSitesChecked = func(_ context.Context, checks []db.SiteCheck) error {
+		for _, check := range checks {
+			checked[check.BlogID] = true
+		}
+		return nil
+	}
+	dbCountDueSites = func(_ context.Context, _, _ int, useVariableIntervals bool) (int, error) {
+		if !useVariableIntervals {
+			t.Fatal("count due useVariableIntervals = false, want true")
+		}
+		count := 0
+		for _, site := range sites {
+			if !checked[site.BlogID] {
+				count++
+			}
+		}
+		return count, nil
+	}
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	p := checker.NewPool(1, 1, 1)
+	defer p.Drain()
+	o := &Orchestrator{
+		pool:       p,
+		retries:    newRetryQueue(),
+		ctx:        context.Background(),
+		hostname:   "host-a",
+		roundStart: time.Now(),
+	}
+
+	summary := o.runRound()
+	if summary.selected != 5 || summary.dispatched != 5 || summary.completed != 5 {
+		t.Fatalf("summary selected/dispatched/completed = %d/%d/%d, want 5/5/5", summary.selected, summary.dispatched, summary.completed)
+	}
+	if summary.backpressureWaits == 0 {
+		t.Fatal("backpressure waits = 0, want > 0")
+	}
+	if got := rec.counter("scheduler.dispatch.backpressure_wait.count"); got == 0 {
+		t.Fatal("backpressure metric = 0, want > 0")
+	}
+	if got := rec.gauge("scheduler.round.outstanding.count"); got != 0 {
+		t.Fatalf("outstanding metric = %d, want 0", got)
+	}
+	if got := rec.gauge("scheduler.round.due_remaining.count"); got != 0 {
+		t.Fatalf("due remaining metric = %d, want 0", got)
+	}
+}
+
+func TestRunRoundSamplesBroadReportsOnCadence(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.DatasetSize = 10
+	cfg.UseVariableCheckIntervals = true
+	cfg.LegacyStatusProjectionEnable = true
+	cfg.WorkerMaxMemMB = 0
+
+	now := time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC)
+	nowFunc = func() time.Time { return now }
+
+	var dueCalls int
+	var driftCalls int
+	dbGetSitesForBucket = func(context.Context, int, int, int, bool) ([]db.Site, error) {
+		return nil, nil
+	}
+	dbCountDueSites = func(context.Context, int, int, bool) (int, error) {
+		dueCalls++
+		return 0, nil
+	}
+	dbCountProjectionDrift = func(context.Context, int, int) (int, error) {
+		driftCalls++
+		return 0, nil
+	}
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	o := &Orchestrator{ctx: context.Background(), hostname: "host-a", roundStart: now}
+
+	first := o.runRound()
+	if !first.dueCountsSampled {
+		t.Fatal("first round dueCountsSampled = false, want true")
+	}
+	if dueCalls != 2 {
+		t.Fatalf("due count calls after first round = %d, want 2", dueCalls)
+	}
+	if driftCalls != 1 {
+		t.Fatalf("projection drift calls after first round = %d, want 1", driftCalls)
+	}
+	if got := rec.gauge("scheduler.round.due_count_sampled.count"); got != 1 {
+		t.Fatalf("due_count_sampled metric after first round = %d, want 1", got)
+	}
+
+	second := o.runRound()
+	if second.dueCountsSampled {
+		t.Fatal("second round dueCountsSampled = true before cadence elapsed, want false")
+	}
+	if dueCalls != 2 {
+		t.Fatalf("due count calls after second round = %d, want still 2", dueCalls)
+	}
+	if driftCalls != 1 {
+		t.Fatalf("projection drift calls after second round = %d, want still 1", driftCalls)
+	}
+	if got := rec.gauge("scheduler.round.due_count_sampled.count"); got != 0 {
+		t.Fatalf("due_count_sampled metric after skipped round = %d, want 0", got)
+	}
+
+	now = now.Add(schedulerBroadReportInterval)
+	third := o.runRound()
+	if !third.dueCountsSampled {
+		t.Fatal("third round dueCountsSampled = false after cadence elapsed, want true")
+	}
+	if dueCalls != 4 {
+		t.Fatalf("due count calls after third round = %d, want 4", dueCalls)
+	}
+	if driftCalls != 2 {
+		t.Fatalf("projection drift calls after third round = %d, want 2", driftCalls)
+	}
+}
+
+func TestSchedulerSleepDurationUsesShortPollForVariableIntervals(t *testing.T) {
+	cfg := &config.Config{
+		MinTimeBetweenRoundsSec:   300,
+		UseVariableCheckIntervals: true,
+	}
+	if got := schedulerSleepDuration(cfg, roundSummary{}, time.Second); got != schedulerVariableIntervalPollInterval {
+		t.Fatalf("schedulerSleepDuration(variable) = %v, want %v", got, schedulerVariableIntervalPollInterval)
+	}
+
+	cfg.UseVariableCheckIntervals = false
+	if got := schedulerSleepDuration(cfg, roundSummary{}, time.Second); got != 299*time.Second {
+		t.Fatalf("schedulerSleepDuration(fixed) = %v, want 299s", got)
+	}
+
+	if got := schedulerSleepDuration(cfg, roundSummary{dueRemaining: 1}, time.Second); got != schedulerBacklogPollInterval {
+		t.Fatalf("schedulerSleepDuration(backlog) = %v, want %v", got, schedulerBacklogPollInterval)
+	}
+
+	if got := schedulerSleepDuration(cfg, roundSummary{}, 301*time.Second); got != 0 {
+		t.Fatalf("schedulerSleepDuration(elapsed) = %v, want 0", got)
+	}
+}
+
+func nextSchedulerTestPage(sites []db.Site, checked map[int64]bool, batchSize int, dueOnly bool) []db.Site {
+	out := make([]db.Site, 0, batchSize)
+	for _, site := range sites {
+		if checked[site.BlogID] {
+			continue
+		}
+		out = append(out, site)
+		if len(out) == batchSize {
+			return out
+		}
+	}
+	if dueOnly {
+		return out
+	}
+	for _, site := range sites {
+		if !checked[site.BlogID] {
+			continue
+		}
+		out = append(out, site)
+		if len(out) == batchSize {
+			return out
+		}
+	}
+	return out
+}
+
+func TestRetryQueueAllBlogIDs(t *testing.T) {
+	q := newRetryQueue()
+	q.record(checkerResultFailure(1))
+	q.record(checkerResultFailure(2))
+	q.record(checkerResultFailure(3))
+
+	ids := q.allBlogIDs()
+	if len(ids) != 3 {
+		t.Fatalf("allBlogIDs() len = %d, want 3", len(ids))
+	}
+}
+
+func TestStringPtrValue(t *testing.T) {
+	if got := stringPtrValue(nil); got != "" {
+		t.Fatalf("stringPtrValue(nil) = %q, want empty", got)
+	}
+	s := "hello"
+	if got := stringPtrValue(&s); got != "hello" {
+		t.Fatalf("stringPtrValue(&\"hello\") = %q, want hello", got)
+	}
+}
+
+func TestStatusFromBool(t *testing.T) {
+	if got := statusFromBool(true); got != statusRunning {
+		t.Fatalf("statusFromBool(true) = %d, want %d", got, statusRunning)
+	}
+	if got := statusFromBool(false); got != 0 {
+		t.Fatalf("statusFromBool(false) = %d, want 0", got)
+	}
+}
+
+func TestIsAlertSuppressedCustomCooldown(t *testing.T) {
+	setTestConfig(t)
+
+	recent := time.Now().UTC().Add(-2 * time.Minute)
+	customCooldown := 60
+
+	o := &Orchestrator{}
+	// Custom per-site cooldown of 60 min, last alert 2 min ago → suppressed.
+	if !o.isAlertSuppressed(db.Site{LastAlertSentAt: &recent, AlertCooldownMinutes: &customCooldown}) {
+		t.Fatal("expected suppressed with custom 60-min cooldown and 2-min-old alert")
+	}
+	// Custom cooldown of 0 → never suppressed.
+	zeroCooldown := 0
+	if o.isAlertSuppressed(db.Site{LastAlertSentAt: &recent, AlertCooldownMinutes: &zeroCooldown}) {
+		t.Fatal("expected not suppressed when custom cooldown = 0")
+	}
+}
+
+func TestCheckLegacyProjectionDriftEmitsGaugeAndWarningCounter(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.LegacyStatusProjectionEnable = true
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+	dbCountProjectionDrift = func(_ context.Context, bucketMin, bucketMax int) (int, error) {
+		if bucketMin != 10 || bucketMax != 20 {
+			t.Fatalf("drift check buckets = %d-%d, want 10-20", bucketMin, bucketMax)
+		}
+		return 3, nil
+	}
+
+	o := &Orchestrator{ctx: context.Background(), bucketMin: 10, bucketMax: 20}
+	o.checkLegacyProjectionDrift(cfg)
+
+	if got := rec.gauge("projection.drift.count"); got != 3 {
+		t.Fatalf("projection.drift.count = %d, want 3", got)
+	}
+	if got := rec.counter("projection.drift.detected.count"); got != 1 {
+		t.Fatalf("projection.drift.detected.count = %d, want 1", got)
+	}
+}
+
+func TestCheckLegacyProjectionDriftSkipsWhenProjectionDisabled(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.LegacyStatusProjectionEnable = false
+
+	var called bool
+	dbCountProjectionDrift = func(context.Context, int, int) (int, error) {
+		called = true
+		return 0, nil
+	}
+
+	o := &Orchestrator{ctx: context.Background()}
+	o.checkLegacyProjectionDrift(cfg)
+	if called {
+		t.Fatal("drift check should be skipped when legacy projection is disabled")
+	}
+}
+
+func TestCheckLegacyProjectionDriftEmitsErrorCounter(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.LegacyStatusProjectionEnable = true
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+	dbCountProjectionDrift = func(context.Context, int, int) (int, error) {
+		return 0, fmt.Errorf("db failed")
+	}
+
+	o := &Orchestrator{ctx: context.Background()}
+	o.checkLegacyProjectionDrift(cfg)
+	if got := rec.counter("projection.drift.check_error.count"); got != 1 {
+		t.Fatalf("projection.drift.check_error.count = %d, want 1", got)
+	}
+}
+
+func TestSendNotificationBothRetriesFail(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	calls := 0
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		calls++
+		return fmt.Errorf("always fails")
+	}
+
+	var updateAlertCalled bool
+	dbUpdateLastAlertSent = func(context.Context, int64, time.Time) error {
+		updateAlertCalled = true
+		return nil
+	}
+
+	o := &Orchestrator{
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+	o.sendNotification(db.Site{BlogID: 1, MonitorURL: "https://example.com"}, checkerResultFailure(1), statusConfirmedDown, time.Now(), nil)
+
+	if calls != 2 {
+		t.Fatalf("notify calls = %d, want 2 (initial + retry)", calls)
+	}
+	if updateAlertCalled {
+		t.Fatal("dbUpdateLastAlertSent should not be called when both retries fail")
+	}
+	for stat, want := range map[string]int{
+		"wpcom.notification.attempt.count":                         1,
+		"wpcom.notification.status.confirmed_down.attempt.count":   1,
+		"wpcom.notification.error.count":                           2,
+		"wpcom.notification.status.confirmed_down.error.count":     2,
+		"wpcom.notification.retry.count":                           1,
+		"wpcom.notification.failed.count":                          1,
+		"wpcom.notification.status.confirmed_down.failed.count":    1,
+		"wpcom.notification.delivered.count":                       0,
+		"wpcom.notification.status.confirmed_down.delivered.count": 0,
+	} {
+		if got := rec.counter(stat); got != want {
+			t.Fatalf("%s = %d, want %d", stat, got, want)
+		}
+	}
+}
+
+func TestEscalateToVerifliersNoClients(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	var confirmed bool
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		confirmed = true
+		return nil
+	}
+	dbUpdateLastAlertSent = func(context.Context, int64, time.Time) error { return nil }
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local",
+		// veriflierClients is empty
+	}
+	fail := checkerResultFailure(55)
+	o.retries.record(fail)
+	entry := o.retries.get(55)
+	o.escalateToVerifliers(db.Site{BlogID: 55, MonitorURL: "https://example.com", SiteStatus: statusRunning}, entry)
+
+	if !confirmed {
+		t.Fatal("expected confirmDown (and notification) when no verifliers are configured")
+	}
+}
+
+func TestHandleFailureSwallowsFailureDuringMaintenance(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	cfg := setTestConfig(t)
+	cfg.NumOfChecks = 1
+
+	fixedNow := time.Date(2026, 5, 1, 12, 0, 0, 0, time.UTC)
+	nowFunc = func() time.Time { return fixedNow }
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		t.Fatal("notification should not be sent during maintenance")
+		return nil
+	}
+	veriflierCheckFunc = func(_ *veriflier.VeriflierClient, _ context.Context, _ veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		t.Fatal("failure during maintenance should not escalate to verifliers")
+		return nil, nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local",
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+		},
+	}
+
+	o.handleFailure(maintenanceSite(88, fixedNow), checkerResultFailure(88))
+
+	if o.retries.get(88) != nil {
+		t.Fatal("retry entry should not be retained for maintenance-swallowed failure")
+	}
+	if got := rec.counter("detection.maintenance.swallowed.count"); got != 1 {
+		t.Fatalf("maintenance swallowed counter = %d, want 1", got)
+	}
+	if got := rec.counter("detection.maintenance.swallowed.server.count"); got != 1 {
+		t.Fatalf("maintenance swallowed server counter = %d, want 1", got)
+	}
+	if got := rec.counter("detection.failure.server.count"); got != 0 {
+		t.Fatalf("failure server counter = %d, want 0 for maintenance-swallowed failure", got)
+	}
+}
+
+func TestHandleFailureClearsExistingRetryWhenMaintenanceStarts(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	fixedNow := time.Date(2026, 5, 1, 12, 0, 0, 0, time.UTC)
+	nowFunc = func() time.Time { return fixedNow }
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local",
+	}
+
+	fail := checkerResultFailure(89)
+	o.retries.record(fail)
+	entry := o.retries.get(89)
+	entry.eventID = 123
+
+	o.handleFailure(maintenanceSite(89, fixedNow), fail)
+
+	if o.retries.get(89) != nil {
+		t.Fatal("retry entry should be cleared when maintenance swallows an existing failure")
+	}
+}
+
+func TestConfirmDownInMaintenance(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		t.Fatal("notification should not be sent during maintenance")
+		return nil
+	}
+
+	now := time.Now()
+	past := now.Add(-1 * time.Hour)
+	future := now.Add(1 * time.Hour)
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local",
+	}
+	fail := checkerResultFailure(77)
+	o.retries.record(fail)
+	entry := o.retries.get(77)
+
+	o.confirmDown(db.Site{
+		BlogID:           77,
+		SiteStatus:       statusRunning,
+		MaintenanceStart: &past,
+		MaintenanceEnd:   &future,
+	}, entry, nil)
+
+	if o.retries.get(77) != nil {
+		t.Fatal("retry entry should be cleared after confirmDown in maintenance")
+	}
+}
+
+func TestHandleRecoveryInMaintenance(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		t.Fatal("notification should not be sent during maintenance")
+		return nil
+	}
+
+	now := time.Now()
+	past := now.Add(-1 * time.Hour)
+	future := now.Add(1 * time.Hour)
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local",
+	}
+
+	o.handleRecovery(db.Site{
+		BlogID:           1,
+		SiteStatus:       statusConfirmedDown,
+		MaintenanceStart: &past,
+		MaintenanceEnd:   &future,
+	}, checkerResultSuccess(1))
+}
+
+func TestHandleRecoveryCooldownSuppressionIsAudited(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	sqlDB, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer sqlDB.Close()
+	audit.Init(sqlDB)
+	t.Cleanup(func() { audit.Init(nil) })
+
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		t.Fatal("notification should not be sent during cooldown")
+		return nil
+	}
+
+	recent := time.Now().UTC().Add(-5 * time.Minute)
+	mock.ExpectExec(`INSERT INTO jetmon_audit_log`).
+		WithArgs(int64(1), nil, audit.EventAlertSuppressed, "local", "recovery cooldown active", nil).
+		WillReturnResult(sqlmock.NewResult(1, 1))
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local",
+	}
+
+	o.handleRecovery(db.Site{
+		BlogID:          1,
+		SiteStatus:      statusConfirmedDown,
+		LastAlertSentAt: &recent,
+	}, checkerResultSuccess(1))
+
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("sql expectations: %v", err)
+	}
+}
+
+func TestProcessResultsLogsErrorsFromDB(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	// Make all DB calls return errors to exercise the log.Printf branches in processResults.
+	dbMarkSitesChecked = func(context.Context, []db.SiteCheck) error {
+		return fmt.Errorf("batch mark checked error")
+	}
+	dbMarkSiteChecked = func(context.Context, int64, time.Time, time.Time) error {
+		return fmt.Errorf("mark checked error")
+	}
+	dbRecordCheckHistories = func(context.Context, []db.CheckHistoryRow) error {
+		return fmt.Errorf("batch history error")
+	}
+	dbRecordCheckHistory = func(int64, string, int, int, int64, int64, int64, int64, int64) error {
+		return fmt.Errorf("history error")
+	}
+	dbUpdateSSLExpiries = func(context.Context, []db.SiteSSLExpiry) error {
+		return fmt.Errorf("batch ssl expiry error")
+	}
+	dbUpdateSSLExpiry = func(context.Context, int64, time.Time) error {
+		return fmt.Errorf("ssl expiry error")
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+
+	expiry := time.Now().Add(30 * 24 * time.Hour)
+	res := checkerResultSuccess(1)
+	res.SSLExpiry = &expiry
+	sites := map[int64]db.Site{1: {BlogID: 1, SiteStatus: statusRunning}}
+
+	// Should not panic despite all DB calls failing.
+	o.processResults(map[int64]checker.Result{1: res}, sites)
+}
+
+func TestHandleFailureEscalatesAfterThreshold(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+	cfg := config.Get()
+	cfg.NumOfChecks = 2
+	cfg.PeerOfflineLimit = 1
+
+	var escalated bool
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error { return nil }
+	dbUpdateLastAlertSent = func(context.Context, int64, time.Time) error { return nil }
+	veriflierCheckFunc = func(_ *veriflier.VeriflierClient, _ context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		escalated = true
+		return &veriflier.CheckResult{BlogID: req.BlogID, Success: false, HTTPCode: 500}, nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+		},
+	}
+
+	// Two failures reaches NumOfChecks (2) and triggers escalation.
+	for range cfg.NumOfChecks {
+		o.handleFailure(db.Site{BlogID: 1, SiteStatus: statusRunning}, checkerResultFailure(1))
+	}
+
+	if !escalated {
+		t.Fatal("expected escalation to verifliers after NumOfChecks failures")
+	}
+}
+
+func TestHandleFailureEmitsSeemsDownMetrics(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	firstFailureAt := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC)
+	nowFunc = func() time.Time { return firstFailureAt.Add(2 * time.Second) }
+
+	res := checkerResultFailure(42)
+	res.Timestamp = firstFailureAt
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local-host",
+		ctx:      context.Background(),
+	}
+	o.handleFailure(db.Site{BlogID: 42, MonitorURL: "https://example.com", SiteStatus: statusRunning}, res)
+
+	if got := rec.counter("detection.seems_down.open.count"); got != 1 {
+		t.Fatalf("seems-down open counter = %d, want 1", got)
+	}
+	if got := rec.counter("detection.failure.server.count"); got != 1 {
+		t.Fatalf("failure class counter = %d, want 1", got)
+	}
+	if got := rec.counter("detection.seems_down.open.server.count"); got != 1 {
+		t.Fatalf("seems-down class counter = %d, want 1", got)
+	}
+	if got := rec.timingCount("detection.first_failure_to_seems_down.time"); got != 1 {
+		t.Fatalf("first failure timing count = %d, want 1", got)
+	}
+}
+
+func TestHandleFailureDoesNotNotifyWPCOMBeforeConfirmation(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	setTestConfig(t)
+
+	var notifyCalls int
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		notifyCalls++
+		return nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local-host",
+		ctx:      context.Background(),
+	}
+	o.handleFailure(db.Site{BlogID: 42, MonitorURL: "https://example.com", SiteStatus: statusRunning}, checkerResultFailure(42))
+
+	if notifyCalls != 0 {
+		t.Fatalf("notify calls = %d, want 0 before verifier confirmation", notifyCalls)
+	}
+}
+
+func TestHandleFailureDoesNotReverifyAlreadyConfirmedDownSite(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	cfg := setTestConfig(t)
+	cfg.NumOfChecks = 1
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		t.Fatal("already-confirmed down failure should not send another notification")
+		return nil
+	}
+	veriflierCheckFunc = func(_ *veriflier.VeriflierClient, _ context.Context, _ veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		t.Fatal("already-confirmed down failure should not re-enter Veriflier confirmation")
+		return nil, nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local-host",
+		ctx:      context.Background(),
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+		},
+	}
+	o.retries.record(checkerResultFailure(42))
+
+	active := o.handleFailure(db.Site{
+		BlogID:     42,
+		MonitorURL: "https://example.com",
+		SiteStatus: statusConfirmedDown,
+	}, checkerResultFailure(42))
+
+	if !active {
+		t.Fatal("already-confirmed down failure should report an active failure")
+	}
+	if entry := o.retries.get(42); entry != nil {
+		t.Fatalf("stale retry entry should be cleared for already-confirmed down site: %+v", entry)
+	}
+	if got := rec.counter("detection.down.still_down.count"); got != 1 {
+		t.Fatalf("still-down counter = %d, want 1", got)
+	}
+	if got := rec.counter("detection.down.still_down.server.count"); got != 1 {
+		t.Fatalf("still-down server counter = %d, want 1", got)
+	}
+	if got := rec.counter("detection.verifier.escalation.count"); got != 0 {
+		t.Fatalf("verifier escalation counter = %d, want 0", got)
+	}
+}
+
+func TestEscalateToVerifliersEmitsConfirmedMetrics(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	cfg := setTestConfig(t)
+	cfg.PeerOfflineLimit = 1
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error { return nil }
+	dbUpdateLastAlertSent = func(context.Context, int64, time.Time) error { return nil }
+	veriflierCheckFunc = func(c *veriflier.VeriflierClient, _ context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		return &veriflier.CheckResult{
+			BlogID:    req.BlogID,
+			Host:      c.Addr(),
+			Success:   false,
+			HTTPCode:  500,
+			RequestID: req.RequestID,
+		}, nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local-host",
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+		},
+	}
+
+	fail := checkerResultFailure(321)
+	o.retries.record(fail)
+	entry := o.retries.get(321)
+	o.escalateToVerifliers(db.Site{BlogID: 321, MonitorURL: "https://example.com", SiteStatus: statusRunning}, entry)
+
+	for stat, want := range map[string]int{
+		"detection.verifier.escalation.count":      1,
+		"verifier.rpc.success.count":               1,
+		"verifier.host.v1.rpc.success.count":       1,
+		"verifier.vote.confirm_down.count":         1,
+		"verifier.host.v1.vote.confirm_down.count": 1,
+		"detection.verifier.quorum_met.count":      1,
+		"detection.down.confirmed.count":           1,
+		"detection.down.confirmed.server.count":    1,
+	} {
+		if got := rec.counter(stat); got != want {
+			t.Fatalf("%s = %d, want %d", stat, got, want)
+		}
+	}
+	for _, stat := range []string{
+		"detection.first_failure_to_verification.time",
+		"verifier.rpc.duration",
+		"verifier.host.v1.rpc.duration",
+		"detection.seems_down_to_down.time",
+	} {
+		if got := rec.timingCount(stat); got != 1 {
+			t.Fatalf("%s timing count = %d, want 1", stat, got)
+		}
+	}
+}
+
+func TestEscalateToVerifliersEmitsFalseAlarmMetrics(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+
+	cfg := setTestConfig(t)
+	cfg.PeerOfflineLimit = 1
+
+	rec := newRecordingMetrics()
+	metricsClientFunc = func() metricsClient { return rec }
+
+	dbRecordFalsePositive = func(int64, int, int, int64) error { return nil }
+	wpcomNotifyFunc = func(_ *wpcom.Client, _ wpcom.Notification) error {
+		t.Fatal("notification should not be sent for false alarm")
+		return nil
+	}
+	veriflierCheckFunc = func(c *veriflier.VeriflierClient, _ context.Context, req veriflier.CheckRequest) (*veriflier.CheckResult, error) {
+		return &veriflier.CheckResult{
+			BlogID:    req.BlogID,
+			Host:      c.Addr(),
+			Success:   true,
+			HTTPCode:  200,
+			RequestID: req.RequestID,
+		}, nil
+	}
+
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		ctx:      context.Background(),
+		hostname: "local-host",
+		veriflierClients: []*veriflier.VeriflierClient{
+			veriflier.NewVeriflierClient("v1", ""),
+		},
+	}
+
+	fail := checkerResultFailure(654)
+	o.retries.record(fail)
+	entry := o.retries.get(654)
+	o.escalateToVerifliers(db.Site{BlogID: 654, MonitorURL: "https://example.com", SiteStatus: statusRunning}, entry)
+
+	for stat, want := range map[string]int{
+		"detection.verifier.escalation.count":         1,
+		"verifier.rpc.success.count":                  1,
+		"verifier.host.v1.rpc.success.count":          1,
+		"verifier.vote.disagree.count":                1,
+		"verifier.host.v1.vote.disagree.count":        1,
+		"detection.verifier.false_alarm.count":        1,
+		"detection.verifier.false_alarm.server.count": 1,
+	} {
+		if got := rec.counter(stat); got != want {
+			t.Fatalf("%s = %d, want %d", stat, got, want)
+		}
+	}
+	if got := rec.timingCount("detection.seems_down_to_false_alarm.time"); got != 1 {
+		t.Fatalf("false alarm timing count = %d, want 1", got)
+	}
+	if entry := o.retries.get(654); entry != nil {
+		t.Fatalf("retry entry after false alarm = %+v, want nil", entry)
+	}
+	if !o.retries.recentlyFalseAlarmed(654, nowFunc().UTC(), postFalseAlarmTransientFailureWindow(db.Site{BlogID: 654, CheckInterval: 5})) {
+		t.Fatal("false alarm should mark the site for false-alarm transient suppression")
+	}
+	if o.retries.recentlyRecovered(654, nowFunc().UTC(), postRecoveryTransientFailureWindow(db.Site{BlogID: 654, CheckInterval: 5})) {
+		t.Fatal("false alarm should not mark the site as a normal recovery")
+	}
+}
+
+func TestMetricSegment(t *testing.T) {
+	tests := []struct {
+		in   string
+		want string
+	}{
+		{in: "", want: "unknown"},
+		{in: "server", want: "server"},
+		{in: "US-West:7803", want: "us_west_7803"},
+		{in: "  eu.central-1  ", want: "eu_central_1"},
+		{in: "://", want: "unknown"},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.in, func(t *testing.T) {
+			if got := metricSegment(tt.in); got != tt.want {
+				t.Fatalf("metricSegment(%q) = %q, want %q", tt.in, got, tt.want)
+			}
+		})
+	}
+}
+
+func TestWPCOMStatusMetricSegment(t *testing.T) {
+	tests := []struct {
+		status int
+		want   string
+	}{
+		{status: statusDown, want: "down"},
+		{status: statusRunning, want: "running"},
+		{status: statusConfirmedDown, want: "confirmed_down"},
+		{status: 99, want: "unknown"},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.want, func(t *testing.T) {
+			if got := wpcomStatusMetricSegment(tt.status); got != tt.want {
+				t.Fatalf("wpcomStatusMetricSegment(%d) = %q, want %q", tt.status, got, tt.want)
+			}
+		})
+	}
+}
+
+type recordingMetrics struct {
+	mu       sync.Mutex
+	counters map[string]int
+	gauges   map[string]int
+	timings  map[string][]time.Duration
+}
+
+func newRecordingMetrics() *recordingMetrics {
+	return &recordingMetrics{
+		counters: make(map[string]int),
+		gauges:   make(map[string]int),
+		timings:  make(map[string][]time.Duration),
+	}
+}
+
+func (r *recordingMetrics) Increment(stat string, value int) {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+	r.counters[stat] += value
+}
+
+func (r *recordingMetrics) Gauge(stat string, value int) {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+	r.gauges[stat] = value
+}
+
+func (r *recordingMetrics) Timing(stat string, d time.Duration) {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+	r.timings[stat] = append(r.timings[stat], d)
+}
+
+func (r *recordingMetrics) EmitMemStats() {}
+
+func (r *recordingMetrics) counter(stat string) int {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+	return r.counters[stat]
+}
+
+func (r *recordingMetrics) gauge(stat string) int {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+	return r.gauges[stat]
+}
+
+func (r *recordingMetrics) timingCount(stat string) int {
+	r.mu.Lock()
+	defer r.mu.Unlock()
+	return len(r.timings[stat])
+}
+
+func assertCheckCohortCount(t *testing.T, cohorts map[checkCohortKey]int, method, profile string, want int) {
+	t.Helper()
+	if got := cohorts[checkCohortKey{method: method, profile: profile}]; got != want {
+		t.Fatalf("cohort %s/%s count = %d, want %d", method, profile, got, want)
+	}
+}
diff --git a/internal/orchestrator/retry.go b/internal/orchestrator/retry.go
new file mode 100644
index 00000000..73f3cd6d
--- /dev/null
+++ b/internal/orchestrator/retry.go
@@ -0,0 +1,140 @@
+package orchestrator
+
+import (
+	"sync"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/checker"
+)
+
+// retryEntry tracks local retry state for a site that has failed at least once.
+type retryEntry struct {
+	targetID    int64
+	blogID      int64
+	url         string
+	failCount   int
+	firstFailAt time.Time
+	lastResult  checker.Result
+	checks      []checker.Result // all check results since first failure
+	eventID     int64            // jetmon_events.id for the open Seems Down event; 0 if not yet opened or eventstore unavailable
+}
+
+// retryQueue holds sites awaiting local retry or veriflier escalation.
+// It persists between rounds — never flushed at round start.
+type retryQueue struct {
+	mu                sync.Mutex
+	entries           map[int64]*retryEntry
+	recentRecoveries  map[int64]time.Time
+	recentFalseAlarms map[int64]time.Time
+}
+
+func newRetryQueue() *retryQueue {
+	return &retryQueue{
+		entries:           make(map[int64]*retryEntry),
+		recentRecoveries:  make(map[int64]time.Time),
+		recentFalseAlarms: make(map[int64]time.Time),
+	}
+}
+
+// record adds a failed check result to the queue. Returns the updated entry.
+func (q *retryQueue) record(res checker.Result) *retryEntry {
+	q.mu.Lock()
+	defer q.mu.Unlock()
+
+	targetID := checkResultTargetID(res)
+	e, exists := q.entries[targetID]
+	if !exists {
+		e = &retryEntry{
+			targetID:    targetID,
+			blogID:      res.BlogID,
+			url:         res.URL,
+			firstFailAt: res.Timestamp,
+		}
+		q.entries[targetID] = e
+	}
+	e.failCount++
+	e.lastResult = res
+	e.checks = append(e.checks, res)
+	return e
+}
+
+// clear removes a site from the retry queue (site recovered or confirmed down).
+func (q *retryQueue) clear(targetID int64) {
+	q.mu.Lock()
+	defer q.mu.Unlock()
+	delete(q.entries, targetID)
+}
+
+func (q *retryQueue) markRecovered(targetID int64, recoveredAt time.Time) {
+	q.mu.Lock()
+	defer q.mu.Unlock()
+	if recoveredAt.IsZero() {
+		recoveredAt = time.Now().UTC()
+	}
+	q.recentRecoveries[targetID] = recoveredAt.UTC()
+}
+
+func (q *retryQueue) recentlyRecovered(targetID int64, at time.Time, window time.Duration) bool {
+	return q.recentlyMarked(q.recentRecoveries, targetID, at, window)
+}
+
+func (q *retryQueue) markFalseAlarm(targetID int64, falseAlarmAt time.Time) {
+	q.mu.Lock()
+	defer q.mu.Unlock()
+	if falseAlarmAt.IsZero() {
+		falseAlarmAt = time.Now().UTC()
+	}
+	q.recentFalseAlarms[targetID] = falseAlarmAt.UTC()
+}
+
+func (q *retryQueue) recentlyFalseAlarmed(targetID int64, at time.Time, window time.Duration) bool {
+	return q.recentlyMarked(q.recentFalseAlarms, targetID, at, window)
+}
+
+func (q *retryQueue) recentlyMarked(markers map[int64]time.Time, targetID int64, at time.Time, window time.Duration) bool {
+	if window <= 0 {
+		return false
+	}
+	q.mu.Lock()
+	defer q.mu.Unlock()
+	markedAt, ok := markers[targetID]
+	if !ok {
+		return false
+	}
+	if at.IsZero() {
+		at = time.Now().UTC()
+	}
+	if at.Before(markedAt) {
+		return true
+	}
+	if at.Sub(markedAt.UTC()) <= window {
+		return true
+	}
+	delete(markers, targetID)
+	return false
+}
+
+// get returns the entry for a site, or nil if not in the queue.
+func (q *retryQueue) get(targetID int64) *retryEntry {
+	q.mu.Lock()
+	defer q.mu.Unlock()
+	return q.entries[targetID]
+}
+
+// allBlogIDs returns the blog IDs of all sites currently in retry.
+func (q *retryQueue) allBlogIDs() []int64 {
+	q.mu.Lock()
+	defer q.mu.Unlock()
+	ids := make([]int64, 0, len(q.entries))
+	for id := range q.entries {
+		ids = append(ids, id)
+	}
+	return ids
+}
+
+// size returns the number of sites in the queue.
+func (q *retryQueue) size() int {
+	q.mu.Lock()
+	defer q.mu.Unlock()
+	return len(q.entries)
+}
diff --git a/internal/orchestrator/retry_test.go b/internal/orchestrator/retry_test.go
new file mode 100644
index 00000000..16166997
--- /dev/null
+++ b/internal/orchestrator/retry_test.go
@@ -0,0 +1,139 @@
+package orchestrator
+
+import (
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/checker"
+)
+
+func TestRetryQueueRecord(t *testing.T) {
+	q := newRetryQueue()
+	res := checker.Result{BlogID: 1, URL: "https://example.com", Timestamp: time.Now()}
+
+	e := q.record(res)
+	if e.failCount != 1 {
+		t.Fatalf("failCount = %d, want 1", e.failCount)
+	}
+	if e.blogID != 1 {
+		t.Fatalf("blogID = %d, want 1", e.blogID)
+	}
+
+	e = q.record(res)
+	if e.failCount != 2 {
+		t.Fatalf("failCount after second record = %d, want 2", e.failCount)
+	}
+}
+
+func TestRetryQueueRecordAccumulatesChecks(t *testing.T) {
+	q := newRetryQueue()
+	res := checker.Result{BlogID: 1, Timestamp: time.Now()}
+
+	for range 3 {
+		q.record(res)
+	}
+
+	e := q.get(1)
+	if len(e.checks) != 3 {
+		t.Fatalf("checks length = %d, want 3", len(e.checks))
+	}
+}
+
+func TestRetryQueueGet(t *testing.T) {
+	q := newRetryQueue()
+
+	if got := q.get(99); got != nil {
+		t.Fatalf("get() = %v, want nil for unknown blog_id", got)
+	}
+
+	q.record(checker.Result{BlogID: 99, Timestamp: time.Now()})
+	if got := q.get(99); got == nil {
+		t.Fatalf("get() = nil after record, want entry")
+	}
+}
+
+func TestRetryQueueClear(t *testing.T) {
+	q := newRetryQueue()
+	q.record(checker.Result{BlogID: 1, Timestamp: time.Now()})
+	q.record(checker.Result{BlogID: 2, Timestamp: time.Now()})
+	recoveredAt := time.Now().UTC()
+	q.markRecovered(1, recoveredAt)
+
+	q.clear(1)
+
+	if q.get(1) != nil {
+		t.Fatalf("get() after clear returned entry, want nil")
+	}
+	if !q.recentlyRecovered(1, recoveredAt.Add(time.Minute), 2*time.Minute) {
+		t.Fatalf("recent recovery marker should survive retry clear")
+	}
+	if q.get(2) == nil {
+		t.Fatalf("get() for uncleared entry returned nil")
+	}
+}
+
+func TestRetryQueueRecentlyRecoveredExpires(t *testing.T) {
+	q := newRetryQueue()
+	recoveredAt := time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)
+	q.markRecovered(42, recoveredAt)
+
+	if !q.recentlyRecovered(42, recoveredAt.Add(30*time.Second), time.Minute) {
+		t.Fatal("recentlyRecovered returned false inside window")
+	}
+	if q.recentlyRecovered(42, recoveredAt.Add(2*time.Minute), time.Minute) {
+		t.Fatal("recentlyRecovered returned true after window")
+	}
+	if q.recentlyRecovered(42, recoveredAt.Add(30*time.Second), time.Minute) {
+		t.Fatal("expired marker should be removed")
+	}
+}
+
+func TestRetryQueueRecentlyFalseAlarmedExpiresSeparately(t *testing.T) {
+	q := newRetryQueue()
+	falseAlarmAt := time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)
+	q.markFalseAlarm(42, falseAlarmAt)
+
+	if !q.recentlyFalseAlarmed(42, falseAlarmAt.Add(4*time.Minute), 5*time.Minute) {
+		t.Fatal("recentlyFalseAlarmed returned false inside window")
+	}
+	if q.recentlyRecovered(42, falseAlarmAt.Add(4*time.Minute), 5*time.Minute) {
+		t.Fatal("false-alarm marker should not count as a normal recovery marker")
+	}
+	if q.recentlyFalseAlarmed(42, falseAlarmAt.Add(6*time.Minute), 5*time.Minute) {
+		t.Fatal("recentlyFalseAlarmed returned true after window")
+	}
+}
+
+func TestRetryQueueSize(t *testing.T) {
+	q := newRetryQueue()
+	if q.size() != 0 {
+		t.Fatalf("size() = %d, want 0", q.size())
+	}
+
+	q.record(checker.Result{BlogID: 1, Timestamp: time.Now()})
+	q.record(checker.Result{BlogID: 2, Timestamp: time.Now()})
+	if q.size() != 2 {
+		t.Fatalf("size() = %d, want 2", q.size())
+	}
+
+	q.clear(1)
+	if q.size() != 1 {
+		t.Fatalf("size() after clear = %d, want 1", q.size())
+	}
+}
+
+func TestRetryQueuePersistsBetweenRounds(t *testing.T) {
+	q := newRetryQueue()
+	q.record(checker.Result{BlogID: 5, Timestamp: time.Now()})
+
+	// Simulate a new round: record should accumulate, not reset.
+	q.record(checker.Result{BlogID: 5, Timestamp: time.Now()})
+
+	e := q.get(5)
+	if e == nil {
+		t.Fatal("entry missing after second round")
+	}
+	if e.failCount != 2 {
+		t.Fatalf("failCount = %d, want 2 — queue was flushed between rounds", e.failCount)
+	}
+}
diff --git a/internal/orchestrator/streaming.go b/internal/orchestrator/streaming.go
new file mode 100644
index 00000000..583dc46c
--- /dev/null
+++ b/internal/orchestrator/streaming.go
@@ -0,0 +1,1868 @@
+package orchestrator
+
+import (
+	"context"
+	"log"
+	"math"
+	"runtime"
+	"sort"
+	"sync"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/checker"
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/metrics"
+)
+
+const (
+	streamingTickInterval                    = time.Second
+	streamingReportInterval                  = time.Minute
+	streamingScaleInterval                   = 5 * time.Second
+	streamingDispatchWakeInterval            = 100 * time.Millisecond
+	streamingProjectionFlushInterval         = 10 * time.Second
+	streamingDefaultProjectionInterval       = 15 * time.Minute
+	streamingProjectionFlushMinRows          = 5000
+	streamingProjectionFlushMaxRows          = 50000
+	streamingReloadDeferInterval             = time.Minute
+	streamingLargeFleetReloadFloor           = 100000
+	streamingLargeFleetReloadSitesPerSecond  = 50
+	streamingMaxTargetReloadInterval         = 6 * time.Hour
+	streamingProjectionSlack                 = 2 * time.Second
+	streamingEmptyTargetPollInterval         = 5 * time.Second
+	streamingActiveCountPollInterval         = 30 * time.Second
+	streamingDefaultLatency                  = 250 * time.Millisecond
+	streamingBootstrapLatency                = time.Second
+	streamingHistoryFlushInterval            = 250 * time.Millisecond
+	streamingHistoryBatchSize                = 1000
+	streamingMinLoadPageSize                 = 5000
+	streamingMinQueueCap                     = 65536
+	streamingMaxQueueCap                     = 1048576
+	streamingMinScheduleHeadroom             = time.Second
+	streamingMaxScheduleHeadroom             = 15 * time.Second
+	streamingMinSideEffectShards             = 8
+	streamingMaxSideEffectShards             = 256
+	streamingWorkerHeadroom                  = 2.0
+	streamingMinWorkerStep                   = 50
+	streamingMinBackpressureDepth            = 1024
+	streamingBacklogWorkerDivisor            = 240
+	streamingBacklogWorkerMultiplier         = 2
+	streamingResultDrainLimit                = 4096
+	streamingMaxResultDrainLimit             = 262144
+	streamingResultDispatchStride            = 1024
+	streamingDispatchCatchupDivisor          = 120
+	streamingDispatchFastCatchupDivisor      = 60
+	streamingDispatchMaxElapsed              = 6 * time.Second
+	streamingDispatchMaxMultiplier           = 2.0
+	streamingDispatchCatchupMultiplier       = 4.0
+	streamingDispatchWorkerMultiplier        = 3
+	streamingDispatchCatchupWorkerMultiplier = 6
+	streamingHotPathScaleLag                 = 10 * time.Second
+	streamingFailurePressureMin              = 1000
+	streamingFailurePressurePercent          = 25
+	streamingFailurePressureHold             = 2 * time.Minute
+	streamingFailurePressureLatency          = streamingDefaultLatency
+)
+
+type streamingTarget struct {
+	site            db.Site
+	dueAt           time.Time
+	inFlight        bool
+	queued          bool
+	active          bool
+	lastProjectedAt time.Time
+
+	checkRequest       checker.Request
+	checkRequestConfig streamingRequestConfig
+	checkRequestReady  bool
+	checkRequestDirty  bool
+}
+
+type streamingRequestConfig struct {
+	timeoutSeconds      int
+	bodyReadMaxBytes    int64
+	bodyReadMaxMS       int
+	keywordReadMaxBytes int64
+	keywordReadMaxMS    int
+	requestMethod       string
+	detectionProfile    string
+}
+
+type streamingPlanner struct {
+	targets      map[int64]*streamingTarget
+	due          streamingDueWheel
+	requiredRate float64
+}
+
+type streamingDueWheel struct {
+	buckets     map[int64][]int64
+	nextDueUnix int64
+}
+
+type streamingDueBucket struct {
+	dueUnix   int64
+	targetIDs []int64
+}
+
+func newStreamingDueWheel(capacity int) streamingDueWheel {
+	if capacity < 1 {
+		capacity = 1
+	}
+	return streamingDueWheel{buckets: make(map[int64][]int64, capacity)}
+}
+
+func (w *streamingDueWheel) schedule(dueUnix, targetID int64) {
+	if w.buckets == nil {
+		w.buckets = make(map[int64][]int64)
+	}
+	w.buckets[dueUnix] = append(w.buckets[dueUnix], targetID)
+	if w.nextDueUnix == 0 || dueUnix < w.nextDueUnix {
+		w.nextDueUnix = dueUnix
+	}
+}
+
+func (w *streamingDueWheel) popReady(nowUnix int64) []streamingDueBucket {
+	if w == nil || len(w.buckets) == 0 {
+		return nil
+	}
+	if w.nextDueUnix == 0 {
+		w.refreshNextDue()
+	}
+	if w.nextDueUnix > nowUnix {
+		return nil
+	}
+	readyTimes := make([]int64, 0, min(len(w.buckets), 1024))
+	for dueUnix := range w.buckets {
+		if dueUnix <= nowUnix {
+			readyTimes = append(readyTimes, dueUnix)
+		}
+	}
+	if len(readyTimes) == 0 {
+		return nil
+	}
+	sort.Slice(readyTimes, func(i, j int) bool {
+		return readyTimes[i] < readyTimes[j]
+	})
+	ready := make([]streamingDueBucket, 0, len(readyTimes))
+	for _, dueUnix := range readyTimes {
+		targetIDs := w.buckets[dueUnix]
+		delete(w.buckets, dueUnix)
+		ready = append(ready, streamingDueBucket{
+			dueUnix:   dueUnix,
+			targetIDs: targetIDs,
+		})
+	}
+	w.refreshNextDue()
+	return ready
+}
+
+func (w *streamingDueWheel) refreshNextDue() {
+	w.nextDueUnix = 0
+	for dueUnix := range w.buckets {
+		if w.nextDueUnix == 0 || dueUnix < w.nextDueUnix {
+			w.nextDueUnix = dueUnix
+		}
+	}
+}
+
+type streamingReloadResult struct {
+	sites     []db.Site
+	bucketMin int
+	bucketMax int
+	err       error
+}
+
+type streamingProjectionFlushResult struct {
+	checks   []db.SiteCheck
+	duration time.Duration
+	err      error
+}
+
+func newStreamingPlanner(sites []db.Site, now time.Time) *streamingPlanner {
+	p := &streamingPlanner{
+		targets: make(map[int64]*streamingTarget, len(sites)),
+		due:     newStreamingDueWheel(streamingDueWheelInitialCapacity(sites)),
+	}
+	p.merge(sites, now)
+	return p
+}
+
+func streamingDueWheelInitialCapacity(sites []db.Site) int {
+	maxSeconds := 0
+	for _, site := range sites {
+		seconds := int(streamingCheckCadence(site) / time.Second)
+		if seconds > maxSeconds {
+			maxSeconds = seconds
+		}
+	}
+	if maxSeconds < 1 {
+		return 1
+	}
+	return maxSeconds
+}
+
+func (p *streamingPlanner) merge(sites []db.Site, now time.Time) (added, updated, removed int) {
+	seen := make(map[int64]struct{}, len(sites))
+	for _, site := range sites {
+		targetID := monitorTargetID(site)
+		seen[targetID] = struct{}{}
+		if target, ok := p.targets[targetID]; ok {
+			if !streamingSiteCheckConfigEqual(target.site, site) {
+				target.checkRequestDirty = true
+			}
+			target.site = site
+			target.active = true
+			updated++
+			continue
+		}
+		target := &streamingTarget{
+			site:              site,
+			active:            true,
+			checkRequestDirty: true,
+		}
+		if site.LastCheckedAt != nil {
+			target.lastProjectedAt = site.LastCheckedAt.UTC()
+		}
+		p.targets[targetID] = target
+		p.scheduleAt(target, initialStreamingDueAt(site, now))
+		added++
+	}
+	for blogID, target := range p.targets {
+		if _, ok := seen[blogID]; ok {
+			continue
+		}
+		target.active = false
+		delete(p.targets, blogID)
+		removed++
+	}
+	p.recalculateRequiredRate()
+	return added, updated, removed
+}
+
+func streamingSiteCheckConfigEqual(a, b db.Site) bool {
+	return a.MonitorURL == b.MonitorURL &&
+		a.CheckInterval == b.CheckInterval &&
+		stringPtrEqual(a.CheckKeyword, b.CheckKeyword) &&
+		stringPtrEqual(a.ForbiddenKeyword, b.ForbiddenKeyword) &&
+		stringPtrEqual(a.ForbiddenKeywords, b.ForbiddenKeywords) &&
+		stringPtrEqual(a.CustomHeaders, b.CustomHeaders) &&
+		a.TimeoutSeconds == b.TimeoutSeconds &&
+		a.RedirectPolicy == b.RedirectPolicy &&
+		a.RequestMethod == b.RequestMethod &&
+		a.DetectionProfile == b.DetectionProfile
+}
+
+func stringPtrEqual(a, b *string) bool {
+	if a == nil || b == nil {
+		return a == b
+	}
+	return *a == *b
+}
+
+func (p *streamingPlanner) activeCount() int {
+	return len(p.targets)
+}
+
+func (p *streamingPlanner) requiredChecksPerSecond() float64 {
+	return p.requiredRate
+}
+
+func (p *streamingPlanner) recalculateRequiredRate() {
+	var rate float64
+	for _, target := range p.targets {
+		interval := streamingCheckCadence(target.site)
+		if interval <= 0 {
+			continue
+		}
+		rate += 1 / interval.Seconds()
+	}
+	p.requiredRate = rate
+}
+
+func (p *streamingPlanner) scheduleAfterResult(target *streamingTarget, res checker.Result, checkedAt time.Time, allowImmediateRetry bool) {
+	siteInterval := siteCheckInterval(target.site)
+	if allowImmediateRetry && res.IsFailure() && target.site.SiteStatus != statusConfirmedDown && siteInterval > failedCheckRetryInterval {
+		p.scheduleAt(target, checkedAt.Add(failedCheckRetryInterval))
+		return
+	}
+
+	interval := streamingCheckCadence(target.site)
+	next := target.dueAt.Add(interval)
+	for !next.After(checkedAt) {
+		next = next.Add(interval)
+	}
+	p.scheduleAt(target, next)
+}
+
+func (p *streamingPlanner) scheduleAtNextPhaseAfter(target *streamingTarget, after time.Time) {
+	interval := streamingCheckCadence(target.site)
+	phase := streamingPhaseOffset(monitorTargetID(target.site), interval)
+	p.scheduleAt(target, nextStreamingPhaseAt(after.Add(time.Second), interval, phase))
+}
+
+func (p *streamingPlanner) scheduleAt(target *streamingTarget, dueAt time.Time) {
+	dueAt = dueAt.UTC().Truncate(time.Second)
+	target.dueAt = dueAt
+	dueUnix := dueAt.Unix()
+	p.due.schedule(dueUnix, monitorTargetID(target.site))
+}
+
+func (p *streamingPlanner) popDue(now time.Time) []*streamingTarget {
+	nowUnix := now.UTC().Unix()
+	var due []*streamingTarget
+	for _, bucket := range p.due.popReady(nowUnix) {
+		for _, targetID := range bucket.targetIDs {
+			target, ok := p.targets[targetID]
+			if !ok || !target.active || target.inFlight || target.queued || target.dueAt.Unix() != bucket.dueUnix {
+				continue
+			}
+			target.queued = true
+			due = append(due, target)
+		}
+	}
+	return due
+}
+
+type streamingStats struct {
+	selected           int
+	dispatched         int
+	completed          int
+	backpressureWaits  int
+	staleResults       int
+	checkFailures      int
+	checkSuccesses     int
+	historyRows        int
+	historyErrors      int
+	sslRows            int
+	sslErrors          int
+	eventDuration      time.Duration
+	historyDuration    time.Duration
+	sslDuration        time.Duration
+	sideEffectRows     int
+	sideEffectWaits    int
+	sideEffectPaused   int
+	resultPaused       int
+	dispatchLimited    int
+	pressureSuppressed int
+	latencyTotal       time.Duration
+	latencyCount       int
+	successLatency     time.Duration
+	successLatencyN    int
+	maxLag             time.Duration
+	errorTimeouts      int
+	errorConnects      int
+	errorSSL           int
+	errorRedirects     int
+	errorKeywords      int
+	errorBodyReads     int
+	errorTLSExpired    int
+	errorTLSDeprecated int
+	errorOther         int
+	checkCohorts       map[checkCohortKey]int
+}
+
+func (s *streamingStats) addResult(res checker.Result, lag time.Duration) {
+	s.completed++
+	s.checkCohorts = incrementCheckCohort(s.checkCohorts, res)
+	if res.Success {
+		s.checkSuccesses++
+	} else {
+		s.checkFailures++
+	}
+	if res.RTT > 0 {
+		s.latencyTotal += res.RTT
+		s.latencyCount++
+		if !res.IsFailure() {
+			s.successLatency += res.RTT
+			s.successLatencyN++
+		}
+	}
+	if lag > s.maxLag {
+		s.maxLag = lag
+	}
+	if res.ErrorCode != checker.ErrorNone {
+		s.addErrorCode(res.ErrorCode)
+	}
+}
+
+func (s *streamingStats) addErrorCode(code int) {
+	switch code {
+	case checker.ErrorTimeout:
+		s.errorTimeouts++
+	case checker.ErrorConnect:
+		s.errorConnects++
+	case checker.ErrorSSL:
+		s.errorSSL++
+	case checker.ErrorRedirect:
+		s.errorRedirects++
+	case checker.ErrorKeyword:
+		s.errorKeywords++
+	case checker.ErrorBodyRead:
+		s.errorBodyReads++
+	case checker.ErrorTLSExpired:
+		s.errorTLSExpired++
+	case checker.ErrorTLSDeprecated:
+		s.errorTLSDeprecated++
+	default:
+		s.errorOther++
+	}
+}
+
+func (s *streamingStats) addSideEffects(summary resultProcessSummary) {
+	s.sideEffectRows += summary.processed
+	s.historyRows += summary.historyRows
+	s.historyErrors += summary.historyErrors
+	s.sslRows += summary.sslRows
+	s.sslErrors += summary.sslErrors
+	s.eventDuration += summary.eventDuration
+	s.historyDuration += summary.historyDuration
+	s.sslDuration += summary.sslDuration
+}
+
+func (s streamingStats) averageLatency() time.Duration {
+	if s.latencyCount == 0 {
+		return 0
+	}
+	return s.latencyTotal / time.Duration(s.latencyCount)
+}
+
+func (s streamingStats) scaleLatency() time.Duration {
+	if s.successLatencyN > 0 {
+		return s.successLatency / time.Duration(s.successLatencyN)
+	}
+	return 0
+}
+
+type streamingSideEffectJob struct {
+	site db.Site
+	res  checker.Result
+}
+
+type streamingSideEffectReport struct {
+	targetID      int64
+	blogID        int64
+	status        int
+	resultFailure bool
+	checkedAt     time.Time
+	summary       resultProcessSummary
+}
+
+type streamingSideEffectProcessor struct {
+	ctx     <-chan struct{}
+	shards  []chan streamingSideEffectJob
+	reports chan streamingSideEffectReport
+	wg      sync.WaitGroup
+}
+
+func (o *Orchestrator) newStreamingSideEffectProcessor(shards, queueCap int) *streamingSideEffectProcessor {
+	if shards < 1 {
+		shards = 1
+	}
+	if queueCap < shards {
+		queueCap = shards
+	}
+	perShard := queueCap / shards
+	if perShard < 1 {
+		perShard = 1
+	}
+	p := &streamingSideEffectProcessor{
+		ctx:     o.ctx.Done(),
+		shards:  make([]chan streamingSideEffectJob, shards),
+		reports: make(chan streamingSideEffectReport, queueCap),
+	}
+	for i := range shards {
+		ch := make(chan streamingSideEffectJob, perShard)
+		p.shards[i] = ch
+		p.wg.Add(1)
+		go p.runShard(o, ch)
+	}
+	return p
+}
+
+func (p *streamingSideEffectProcessor) runShard(o *Orchestrator, jobs <-chan streamingSideEffectJob) {
+	defer p.wg.Done()
+	statusByTarget := make(map[int64]int)
+	sslExpiryByTarget := make(map[int64]*time.Time)
+	historyRows := make([]db.CheckHistoryRow, 0, streamingHistoryBatchSize)
+	historyTicker := time.NewTicker(streamingHistoryFlushInterval)
+	defer historyTicker.Stop()
+	flushHistory := func() bool {
+		if len(historyRows) == 0 {
+			return true
+		}
+		summary := o.recordStreamingHistoryRows(historyRows)
+		historyRows = historyRows[:0]
+		select {
+		case p.reports <- streamingSideEffectReport{summary: summary}:
+			return true
+		case <-p.ctx:
+			return false
+		}
+	}
+	defer flushHistory()
+	for {
+		select {
+		case <-p.ctx:
+			flushHistory()
+			return
+		case <-historyTicker.C:
+			if !flushHistory() {
+				return
+			}
+		case job, ok := <-jobs:
+			if !ok {
+				flushHistory()
+				return
+			}
+			site := job.site
+			targetID := monitorTargetID(site)
+			if status, ok := statusByTarget[targetID]; ok {
+				site.SiteStatus = status
+			}
+			if expiry, ok := sslExpiryByTarget[targetID]; ok {
+				site.SSLExpiryDate = expiry
+			}
+			summary, updated := o.processStreamingSideEffects(site, job.res)
+			statusByTarget[targetID] = updated.SiteStatus
+			if updated.SSLExpiryDate != nil {
+				expiry := *updated.SSLExpiryDate
+				sslExpiryByTarget[targetID] = &expiry
+			} else {
+				delete(sslExpiryByTarget, targetID)
+			}
+			if job.res.IsFailure() {
+				historyRows = append(historyRows, checkHistoryRowForResult(site.BlogID, job.res))
+				if len(historyRows) >= streamingHistoryBatchSize && !flushHistory() {
+					return
+				}
+			}
+			select {
+			case p.reports <- streamingSideEffectReport{
+				targetID:      targetID,
+				blogID:        site.BlogID,
+				status:        updated.SiteStatus,
+				resultFailure: job.res.IsFailure(),
+				checkedAt:     resultCheckedAt(job.res),
+				summary:       summary,
+			}:
+			case <-p.ctx:
+				return
+			}
+		}
+	}
+}
+
+func (p *streamingSideEffectProcessor) enqueue(job streamingSideEffectJob) bool {
+	if len(p.shards) == 0 {
+		return false
+	}
+	ch := p.shards[streamingSideEffectShard(monitorTargetID(job.site), len(p.shards))]
+	select {
+	case ch <- job:
+		return true
+	case <-p.ctx:
+		return false
+	}
+}
+
+func (p *streamingSideEffectProcessor) tryEnqueue(job streamingSideEffectJob) bool {
+	if len(p.shards) == 0 {
+		return false
+	}
+	ch := p.shards[streamingSideEffectShard(monitorTargetID(job.site), len(p.shards))]
+	select {
+	case ch <- job:
+		return true
+	default:
+		return false
+	}
+}
+
+func (p *streamingSideEffectProcessor) reportsChannel() <-chan streamingSideEffectReport {
+	return p.reports
+}
+
+func (p *streamingSideEffectProcessor) queueDepth() int {
+	total := 0
+	for _, ch := range p.shards {
+		total += len(ch)
+	}
+	return total
+}
+
+func (p *streamingSideEffectProcessor) stop() {
+	for _, ch := range p.shards {
+		close(ch)
+	}
+	p.wg.Wait()
+	close(p.reports)
+}
+
+func streamingSideEffectShard(blogID int64, shards int) int {
+	if shards <= 1 {
+		return 0
+	}
+	if blogID < 0 {
+		blogID = -blogID
+	}
+	return int(blogID % int64(shards))
+}
+
+func streamingSideEffectShardCount(activeTargets int) int {
+	shards := runtime.GOMAXPROCS(0) * 4
+	if targetBased := activeTargets / 500; targetBased > shards {
+		shards = targetBased
+	}
+	if shards < streamingMinSideEffectShards {
+		return streamingMinSideEffectShards
+	}
+	if shards > streamingMaxSideEffectShards {
+		return streamingMaxSideEffectShards
+	}
+	return shards
+}
+
+func (o *Orchestrator) runStreamingEngine() {
+	cfg := config.Get()
+	log.Printf("orchestrator: streaming scheduler starting, host=%s buckets=%d-%d", o.hostname, o.bucketMin, o.bucketMax)
+	if _, err := o.refreshStreamingBuckets(cfg); err != nil {
+		log.Printf("orchestrator: streaming bucket refresh failed: %v", err)
+	}
+	sites, err := o.loadStreamingSites(cfg)
+	if err != nil {
+		log.Printf("orchestrator: streaming initial target load failed: %v", err)
+		sites = nil
+	}
+	planner := newStreamingPlanner(sites, nowFunc().UTC())
+	o.configureStreamingPool(cfg, planner, streamingBootstrapLatency)
+	sideEffectShards := streamingSideEffectShardCount(planner.activeCount())
+	sideEffects := o.newStreamingSideEffectProcessor(sideEffectShards, streamingQueueCap(streamingWorkerTarget(cfg, planner, streamingBootstrapLatency), planner.activeCount()))
+	log.Printf("orchestrator: streaming scheduler loaded targets=%d required_rate=%.2f/s workers=%d queue_cap=%d side_effect_shards=%d",
+		planner.activeCount(),
+		planner.requiredChecksPerSecond(),
+		o.pool.WorkerCount(),
+		streamingQueueCap(streamingWorkerTarget(cfg, planner, streamingBootstrapLatency), planner.activeCount()),
+		sideEffectShards,
+	)
+
+	tick := time.NewTicker(streamingTickInterval)
+	defer tick.Stop()
+
+	var (
+		pending             []*streamingTarget
+		pendingProjection   = make(map[int64]db.SiteCheck)
+		pendingSideEffects  = make(map[int64]int)
+		sideEffectStatus    = make(map[int64]int)
+		stats               streamingStats
+		lastReport          = nowFunc().UTC()
+		lastReload          = lastReport
+		lastScale           = lastReport
+		lastDispatch        = lastReport
+		lastProjectionFlush = lastReport
+		lastHeartbeat       = lastReport
+		lastActiveCountPoll = lastReport
+		pressureUntil       time.Time
+		hotPathPressure     bool
+		reloadResults       = make(chan streamingReloadResult, 1)
+		reloadInFlight      bool
+		projectionResults   = make(chan streamingProjectionFlushResult, 1)
+		projectionInFlight  bool
+		telemetryResults    = make(chan struct{}, 1)
+		telemetryInFlight   bool
+	)
+
+	startVeriflierTelemetrySync := func(syncCfg *config.Config) {
+		if telemetryInFlight {
+			return
+		}
+		telemetryInFlight = true
+		go func() {
+			o.syncVeriflierAgentTelemetry(syncCfg)
+			select {
+			case telemetryResults <- struct{}{}:
+			case <-o.ctx.Done():
+			}
+		}()
+	}
+	startVeriflierTelemetrySync(cfg)
+
+	applyReload := func(reload streamingReloadResult) {
+		reloadInFlight = false
+		if reload.err != nil {
+			log.Printf("orchestrator: streaming target reload failed: %v", reload.err)
+			return
+		}
+		if reload.bucketMin != o.bucketMin || reload.bucketMax != o.bucketMax {
+			log.Printf("orchestrator: streaming target reload discarded stale bucket snapshot loaded=%d-%d current=%d-%d",
+				reload.bucketMin, reload.bucketMax, o.bucketMin, o.bucketMax)
+			lastReload = time.Time{}
+			return
+		}
+
+		wasEmpty := planner.activeCount() == 0
+		wasActive := planner.activeCount() > 0
+		added, updated, removed := planner.merge(reload.sites, nowFunc().UTC())
+		if wasEmpty && planner.activeCount() > 0 {
+			o.configureStreamingPool(cfg, planner, streamingBootstrapLatency)
+			sideEffects.stop()
+			sideEffectShards = streamingSideEffectShardCount(planner.activeCount())
+			sideEffects = o.newStreamingSideEffectProcessor(sideEffectShards, streamingQueueCap(streamingWorkerTarget(cfg, planner, streamingBootstrapLatency), planner.activeCount()))
+			pendingSideEffects = make(map[int64]int)
+			sideEffectStatus = make(map[int64]int)
+		} else if wasActive && planner.activeCount() == 0 {
+			o.configureStreamingPool(cfg, planner, streamingBootstrapLatency)
+			sideEffects.stop()
+			sideEffectShards = streamingSideEffectShardCount(planner.activeCount())
+			sideEffects = o.newStreamingSideEffectProcessor(sideEffectShards, streamingQueueCap(streamingWorkerTarget(cfg, planner, streamingBootstrapLatency), planner.activeCount()))
+			pending = nil
+			pendingSideEffects = make(map[int64]int)
+			sideEffectStatus = make(map[int64]int)
+		}
+		log.Printf("orchestrator: streaming target reload active=%d added=%d updated=%d removed=%d required_rate=%.2f/s",
+			planner.activeCount(), added, updated, removed, planner.requiredChecksPerSecond())
+	}
+
+	startReload := func(reason string, now time.Time) {
+		if reloadInFlight {
+			return
+		}
+		reloadInFlight = true
+		lastReload = now
+		reloadCfg := cfg
+		bucketMin, bucketMax := o.bucketMin, o.bucketMax
+		log.Printf("orchestrator: streaming target reload started reason=%s buckets=%d-%d", reason, bucketMin, bucketMax)
+		go func() {
+			sites, err := o.loadStreamingSitesForRange(o.ctx, reloadCfg, bucketMin, bucketMax)
+			result := streamingReloadResult{
+				sites:     sites,
+				bucketMin: bucketMin,
+				bucketMax: bucketMax,
+				err:       err,
+			}
+			select {
+			case reloadResults <- result:
+			case <-o.ctx.Done():
+			}
+		}()
+	}
+
+	startProjectionFlush := func(checks []db.SiteCheck) {
+		if len(checks) == 0 || projectionInFlight {
+			return
+		}
+		projectionInFlight = true
+		go func() {
+			start := time.Now()
+			err := dbMarkSitesChecked(o.ctx, checks)
+			result := streamingProjectionFlushResult{
+				checks:   checks,
+				duration: time.Since(start),
+				err:      err,
+			}
+			select {
+			case projectionResults <- result:
+			case <-o.ctx.Done():
+			}
+		}()
+	}
+
+	handleProjectionFlushResult := func(result streamingProjectionFlushResult) {
+		projectionInFlight = false
+		if result.err != nil {
+			log.Printf("orchestrator: streaming legacy freshness projection rows=%d: %v", len(result.checks), result.err)
+			for _, check := range result.checks {
+				existing, ok := pendingProjection[check.BlogID]
+				if !ok || existing.CheckedAt.Before(check.CheckedAt) {
+					pendingProjection[check.BlogID] = check
+				}
+			}
+			return
+		}
+		if m := metricsClientFunc(); m != nil {
+			m.Increment("scheduler.streaming.legacy_projection.row.count", len(result.checks))
+			m.Timing("scheduler.streaming.legacy_projection.time", result.duration)
+		}
+	}
+
+	handleResult := func(res checker.Result, now time.Time) {
+		targetID := checkResultTargetID(res)
+		target, ok := planner.targets[targetID]
+		if !ok || !target.inFlight {
+			stats.staleResults++
+			return
+		}
+		target.inFlight = false
+		checkedAt := resultCheckedAt(res)
+		lag := checkedAt.Sub(target.dueAt)
+		if lag < 0 {
+			lag = 0
+		}
+		stats.addResult(res, lag)
+		failurePressureActive := now.Before(pressureUntil)
+		if streamingFailurePressure(stats) {
+			pressureUntil = now.Add(streamingFailurePressureHold)
+			failurePressureActive = true
+		}
+		pressureActive := failurePressureActive || hotPathPressure
+		o.totalChecked++
+		suppressedByPressure := streamingSideEffectsSuppressedByPressure(target, res, pendingSideEffects, sideEffectStatus, o.retries, pressureActive)
+		if suppressedByPressure {
+			stats.pressureSuppressed++
+		}
+		if !suppressedByPressure && streamingSideEffectsNeeded(target, res, pendingSideEffects, sideEffectStatus, o.retries, pressureActive) {
+			job := streamingSideEffectJob{site: target.site, res: res}
+			if !sideEffects.tryEnqueue(job) {
+				stats.sideEffectWaits++
+				if !sideEffects.enqueue(job) {
+					return
+				}
+			}
+			pendingSideEffects[targetID]++
+		}
+		planner.scheduleAfterResult(target, res, checkedAt, streamingAllowImmediateRetry(target, res, o.retries, pressureActive))
+		o.queueStreamingProjection(cfg, target, checkedAt, now, pendingProjection)
+	}
+
+	dispatchPending := func(now time.Time, recordPause bool, minInterval time.Duration) {
+		if len(pending) == 0 {
+			return
+		}
+		if minInterval > 0 && now.Sub(lastDispatch) < minInterval {
+			return
+		}
+		if streamingShouldPauseDispatchForResultBacklog(o.pool.ResultDepth(), o.pool.WorkerCount(), planner.activeCount(), o.pool.ActiveCount(), o.pool.QueueDepth()) {
+			if recordPause {
+				stats.resultPaused++
+			}
+			return
+		}
+		if sideEffectDepth := sideEffects.queueDepth(); sideEffectDepth >= streamingSideEffectBackpressureDepth(o.pool.WorkerCount(), planner.activeCount()) {
+			if recordPause {
+				stats.sideEffectPaused++
+			}
+			return
+		}
+		dispatchElapsed := now.Sub(lastDispatch)
+		budget := streamingDispatchBudget(planner.requiredChecksPerSecond(), len(pending), o.pool.WorkerCount(), dispatchElapsed, stats.maxLag, o.pool.ResultDepth(), planner.activeCount())
+		pending = o.dispatchStreamingPending(cfg, pending, budget, &stats)
+		lastDispatch = now
+	}
+
+	drainResults := func(limit int, dispatchDuringDrain bool) {
+		now := nowFunc().UTC()
+		for processed := 0; processed < limit; processed++ {
+			select {
+			case res := <-o.pool.Results():
+				handleResult(res, now)
+				if dispatchDuringDrain && processed%streamingResultDispatchStride == streamingResultDispatchStride-1 {
+					now = nowFunc().UTC()
+					dispatchPending(now, false, streamingDispatchWakeInterval)
+				}
+			default:
+				if dispatchDuringDrain {
+					dispatchPending(now, false, streamingDispatchWakeInterval)
+				}
+				return
+			}
+		}
+		if dispatchDuringDrain {
+			dispatchPending(nowFunc().UTC(), false, streamingDispatchWakeInterval)
+		}
+	}
+
+	handleTick := func() {
+		now := nowFunc().UTC()
+		cfg = config.Get()
+		reloadReason := ""
+		drainResults(streamingResultDrainLimitFor(o.pool.ResultDepth()), true)
+		o.refreshVeriflierClients(cfg)
+		failurePressureActive := now.Before(pressureUntil) || streamingFailurePressure(stats)
+		hotPathPressure = streamingHotPathBehind(planner, len(pending), o.pool.ResultDepth(), sideEffects.queueDepth(), o.pool.WorkerCount(), stats)
+		pressureActive := failurePressureActive || hotPathPressure
+		if now.Sub(lastScale) >= streamingScaleInterval {
+			o.applyStreamingWorkerTarget(cfg, planner, stats, len(pending), o.pool.ResultDepth(), sideEffects.queueDepth(), failurePressureActive, hotPathPressure)
+			lastScale = now
+		}
+
+		if now.Sub(lastHeartbeat) >= schedulerBroadReportInterval {
+			bucketsChanged, err := o.refreshStreamingBuckets(cfg)
+			if err != nil {
+				log.Printf("orchestrator: streaming bucket refresh failed: %v", err)
+			}
+			startVeriflierTelemetrySync(cfg)
+			if bucketsChanged {
+				lastReload = time.Time{}
+				reloadReason = "bucket_change"
+			}
+			lastHeartbeat = now
+		}
+
+		if now.Sub(lastActiveCountPoll) >= streamingActiveCountPollIntervalFor(planner) {
+			if count, err := dbCountActiveSites(o.ctx, o.bucketMin, o.bucketMax); err != nil {
+				log.Printf("orchestrator: streaming active target count check failed: %v", err)
+			} else if count != planner.activeCount() {
+				log.Printf("orchestrator: streaming active target count changed db=%d memory=%d; reloading targets", count, planner.activeCount())
+				lastReload = time.Time{}
+				reloadReason = "active_count_changed"
+			}
+			lastActiveCountPoll = now
+		}
+
+		reloadInterval := streamingTargetReloadInterval(cfg, planner)
+		if now.Sub(lastReload) >= reloadInterval {
+			if reloadReason == "" {
+				reloadReason = "periodic"
+			}
+			if reloadReason == "periodic" && streamingShouldDeferPeriodicReload(planner, len(pending), o.pool.ResultDepth(), sideEffects.queueDepth(), o.pool.WorkerCount(), stats) {
+				lastReload = streamingDeferredReloadLastReload(now, reloadInterval)
+				log.Printf("orchestrator: streaming target reload deferred reason=periodic active=%d pending=%d result_depth=%d side_effect_depth=%d max_lag=%s",
+					planner.activeCount(), len(pending), o.pool.ResultDepth(), sideEffects.queueDepth(), stats.maxLag.Round(time.Millisecond))
+			} else {
+				startReload(reloadReason, now)
+			}
+		}
+
+		due := planner.popDue(now)
+		stats.selected += len(due)
+		pending = append(pending, due...)
+		dispatchPending(now, true, 0)
+
+		if now.Sub(lastProjectionFlush) >= streamingProjectionFlushInterval {
+			if !projectionInFlight && len(pendingProjection) > 0 {
+				checks := streamingProjectionFlushBatch(pendingProjection, streamingProjectionFlushRowLimit(planner.requiredChecksPerSecond()))
+				startProjectionFlush(checks)
+			}
+			lastProjectionFlush = now
+		}
+
+		if now.Sub(lastReport) >= streamingReportInterval {
+			reportElapsed := now.Sub(lastReport)
+			o.reportStreamingStats(cfg, planner, stats, len(pending), sideEffects.queueDepth(), reportElapsed, pressureActive)
+			stats = streamingStats{}
+			lastReport = now
+		}
+	}
+
+	drainReadyTick := func() {
+		select {
+		case <-tick.C:
+			handleTick()
+		default:
+		}
+	}
+
+	for {
+		select {
+		case <-o.ctx.Done():
+			o.flushStreamingProjection(pendingProjection)
+			sideEffects.stop()
+			o.shutdown()
+			return
+		case result := <-projectionResults:
+			handleProjectionFlushResult(result)
+		case report := <-sideEffects.reportsChannel():
+			stats.addSideEffects(report.summary)
+			if report.targetID != 0 {
+				if pendingSideEffects[report.targetID] <= 1 {
+					delete(pendingSideEffects, report.targetID)
+				} else {
+					pendingSideEffects[report.targetID]--
+				}
+				sideEffectStatus[report.targetID] = report.status
+				if target, ok := planner.targets[report.targetID]; ok {
+					target.site.SiteStatus = report.status
+					rescheduleStreamingAfterSideEffect(planner, target, report)
+				}
+			}
+		case reload := <-reloadResults:
+			applyReload(reload)
+		case <-telemetryResults:
+			telemetryInFlight = false
+		case res := <-o.pool.Results():
+			now := nowFunc().UTC()
+			handleResult(res, now)
+			drainResults(streamingResultDrainLimitFor(o.pool.ResultDepth()), true)
+			dispatchPending(nowFunc().UTC(), false, streamingDispatchWakeInterval)
+			drainReadyTick()
+		case <-tick.C:
+			handleTick()
+		}
+	}
+}
+
+func (o *Orchestrator) configureStreamingPool(cfg *config.Config, planner *streamingPlanner, latency time.Duration) {
+	if o.pool != nil {
+		o.pool.Drain()
+	}
+	workerTarget := streamingWorkerTarget(cfg, planner, latency)
+	queueCap := streamingQueueCap(workerTarget, planner.activeCount())
+	initial := cfg.NumWorkers
+	if initial > workerTarget {
+		initial = workerTarget
+	}
+	if initial < 1 {
+		initial = 1
+	}
+	o.pool = checker.NewPoolWithQueueCap(initial, 1, workerTarget, queueCap)
+	if planner.activeCount() > 0 {
+		o.pool.EnsureSize(workerTarget)
+	}
+}
+
+func (o *Orchestrator) refreshStreamingBuckets(cfg *config.Config) (bool, error) {
+	oldMin, oldMax := o.bucketMin, o.bucketMax
+	if o.usesPinnedBuckets(cfg) {
+		err := o.ClaimBuckets()
+		return oldMin != o.bucketMin || oldMax != o.bucketMax, err
+	}
+	if err := dbHeartbeat(o.ctx, o.hostname); err != nil {
+		log.Printf("orchestrator: streaming heartbeat failed: %v", err)
+	}
+	err := o.ClaimBuckets()
+	return oldMin != o.bucketMin || oldMax != o.bucketMax, err
+}
+
+func (o *Orchestrator) loadStreamingSites(cfg *config.Config) ([]db.Site, error) {
+	return o.loadStreamingSitesForRange(o.ctx, cfg, o.bucketMin, o.bucketMax)
+}
+
+func (o *Orchestrator) loadStreamingSitesForRange(ctx context.Context, cfg *config.Config, bucketMin, bucketMax int) ([]db.Site, error) {
+	pageSize := streamingLoadPageSize(cfg)
+	var (
+		afterMonitorSiteID int64
+		sites              []db.Site
+	)
+	for {
+		page, err := dbListActiveSites(ctx, bucketMin, bucketMax, afterMonitorSiteID, pageSize)
+		if err != nil {
+			return nil, err
+		}
+		if len(page) == 0 {
+			return sites, nil
+		}
+		sites = append(sites, page...)
+		afterMonitorSiteID = page[len(page)-1].ID
+		if len(page) < pageSize {
+			return sites, nil
+		}
+	}
+}
+
+func (o *Orchestrator) dispatchStreamingPending(cfg *config.Config, pending []*streamingTarget, budget int, stats *streamingStats) []*streamingTarget {
+	if budget <= 0 {
+		return pending
+	}
+	dispatched := 0
+	for len(pending) > 0 && dispatched < budget {
+		target := pending[0]
+		if !target.active || target.inFlight || !target.queued {
+			target.queued = false
+			pending = pending[1:]
+			continue
+		}
+		if !o.pool.Submit(streamingCheckRequestForTarget(cfg, target)) {
+			stats.backpressureWaits++
+			return pending
+		}
+		target.queued = false
+		target.inFlight = true
+		stats.dispatched++
+		dispatched++
+		pending = pending[1:]
+	}
+	if len(pending) > 0 && dispatched >= budget {
+		stats.dispatchLimited++
+	}
+	return pending
+}
+
+func (o *Orchestrator) processStreamingSideEffects(site db.Site, res checker.Result) (resultProcessSummary, db.Site) {
+	summary := resultProcessSummary{processed: 1}
+
+	sslStart := time.Now()
+	if fullDetectionsEnabled(config.Get(), site) {
+		if res.TLSVersion != 0 {
+			o.checkTLSDeprecated(site, res)
+		}
+		if res.SSLExpiry != nil {
+			if shouldUpdateSSLExpiry(site.SSLExpiryDate, *res.SSLExpiry) {
+				o.updateSSLExpiries([]db.SiteSSLExpiry{{
+					BlogID: site.BlogID,
+					Expiry: *res.SSLExpiry,
+				}}, &summary)
+				expiry := *res.SSLExpiry
+				site.SSLExpiryDate = &expiry
+			}
+			o.checkSSLAlerts(site, *res.SSLExpiry)
+		}
+	}
+	summary.sslDuration += time.Since(sslStart)
+
+	eventStart := time.Now()
+	if !res.IsFailure() {
+		o.handleRecovery(site, res)
+		site.SiteStatus = statusRunning
+	} else {
+		failureActive := o.handleFailure(site, res)
+		if retry := o.retries.get(monitorTargetID(site)); retry != nil && (failureActive || retry.eventID > 0) {
+			site.SiteStatus = statusDown
+		} else if status, err := dbGetSiteStatus(o.ctx, site.ID, site.BlogID); err != nil {
+			log.Printf("orchestrator: streaming refresh site status blog_id=%d: %v", site.BlogID, err)
+		} else {
+			site.SiteStatus = status
+		}
+	}
+	summary.eventDuration += time.Since(eventStart)
+	return summary, site
+}
+
+func streamingCheckRequestForTarget(cfg *config.Config, target *streamingTarget) checker.Request {
+	if target == nil {
+		return checker.Request{}
+	}
+	requestConfig := streamingRequestConfigForSite(cfg, target.site)
+	if !target.checkRequestReady || target.checkRequestDirty || target.checkRequestConfig != requestConfig {
+		target.checkRequest = checkRequestForSite(cfg, target.site)
+		target.checkRequestConfig = requestConfig
+		target.checkRequestReady = true
+		target.checkRequestDirty = false
+	}
+	return target.checkRequest
+}
+
+func streamingRequestConfigForSite(cfg *config.Config, site db.Site) streamingRequestConfig {
+	method := effectiveCheckMethod(cfg, site)
+	profile := effectiveDetectionProfile(cfg, site, method)
+	return streamingRequestConfig{
+		timeoutSeconds:      timeoutForSite(cfg, site),
+		bodyReadMaxBytes:    cfg.BodyReadMaxBytes,
+		bodyReadMaxMS:       cfg.BodyReadMaxMS,
+		keywordReadMaxBytes: cfg.KeywordReadMaxBytes,
+		keywordReadMaxMS:    cfg.KeywordReadMaxMS,
+		requestMethod:       method,
+		detectionProfile:    profile,
+	}
+}
+
+func streamingSideEffectsNeeded(target *streamingTarget, res checker.Result, pending map[int64]int, statusCache map[int64]int, retries *retryQueue, pressure bool) bool {
+	if target == nil {
+		return false
+	}
+	if streamingSideEffectsSuppressedByPressure(target, res, pending, statusCache, retries, pressure) {
+		return false
+	}
+	targetID := monitorTargetID(target.site)
+	if pending[targetID] > 0 {
+		return true
+	}
+	status := target.site.SiteStatus
+	if cached, ok := statusCache[targetID]; ok {
+		status = cached
+	}
+	retrying := retries != nil && retries.get(targetID) != nil
+	if res.IsFailure() || res.TLSVersion != 0 || res.SSLExpiry != nil {
+		return true
+	}
+	if status != statusRunning {
+		return true
+	}
+	return retrying
+}
+
+func streamingSideEffectsSuppressedByPressure(target *streamingTarget, res checker.Result, pending map[int64]int, statusCache map[int64]int, retries *retryQueue, pressure bool) bool {
+	if !pressure || target == nil {
+		return false
+	}
+	targetID := monitorTargetID(target.site)
+	if pending[targetID] > 0 {
+		return false
+	}
+	status := target.site.SiteStatus
+	if cached, ok := statusCache[targetID]; ok {
+		status = cached
+	}
+	retrying := retries != nil && retries.get(targetID) != nil
+	return status == statusRunning && !retrying && streamingLocalPressureFailure(res)
+}
+
+func streamingLocalPressureFailure(res checker.Result) bool {
+	if !res.IsFailure() || res.HTTPCode > 0 {
+		return false
+	}
+	return res.ErrorCode == checker.ErrorTimeout || res.ErrorCode == checker.ErrorConnect
+}
+
+func streamingAllowImmediateRetry(target *streamingTarget, res checker.Result, retries *retryQueue, pressure bool) bool {
+	if !pressure {
+		if target == nil {
+			return false
+		}
+		return !streamingSuppressPostRecoveryImmediateRetry(target, res, retries)
+	}
+	if target == nil {
+		return false
+	}
+	if streamingSuppressPostRecoveryImmediateRetry(target, res, retries) {
+		return false
+	}
+	if !streamingLocalPressureFailure(res) {
+		return true
+	}
+	if target.site.SiteStatus != statusRunning {
+		return true
+	}
+	return retries != nil && retries.get(monitorTargetID(target.site)) != nil
+}
+
+func streamingSuppressPostRecoveryImmediateRetry(target *streamingTarget, res checker.Result, retries *retryQueue) bool {
+	if target == nil || retries == nil {
+		return false
+	}
+	suppressed, _, _ := postRecoveryTransientSuppression(target.site, res, retries)
+	return suppressed
+}
+
+func rescheduleStreamingAfterSideEffect(planner *streamingPlanner, target *streamingTarget, report streamingSideEffectReport) {
+	if planner == nil || target == nil {
+		return
+	}
+	if report.resultFailure && report.status != statusDown && !target.inFlight {
+		target.queued = false
+		planner.scheduleAtNextPhaseAfter(target, report.checkedAt)
+	}
+}
+
+func (o *Orchestrator) queueStreamingProjection(cfg *config.Config, target *streamingTarget, resultAt, projectedAt time.Time, pending map[int64]db.SiteCheck) {
+	interval := streamingProjectionInterval(cfg)
+	if !streamingProjectionDue(target, resultAt, interval) {
+		return
+	}
+	if projectedAt.Before(resultAt) {
+		projectedAt = resultAt
+	}
+	pending[target.site.BlogID] = db.SiteCheck{
+		BlogID:      target.site.BlogID,
+		CheckedAt:   projectedAt,
+		NextCheckAt: target.dueAt,
+	}
+	target.lastProjectedAt = projectedAt
+}
+
+func streamingProjectionDue(target *streamingTarget, checkedAt time.Time, interval time.Duration) bool {
+	if target.lastProjectedAt.IsZero() || interval <= 0 {
+		return true
+	}
+	siteInterval := siteCheckInterval(target.site)
+	if siteInterval >= interval {
+		return true
+	}
+	return !checkedAt.Add(streamingProjectionSlack).Before(target.lastProjectedAt.Add(interval))
+}
+
+func streamingProjectionInterval(cfg *config.Config) time.Duration {
+	if cfg == nil {
+		return streamingDefaultProjectionInterval
+	}
+	interval := time.Duration(cfg.StreamingLegacyProjectionIntervalMin) * time.Minute
+	if interval <= 0 {
+		interval = streamingDefaultProjectionInterval
+	}
+	minRollbackWindow := 5 * time.Minute
+	if interval < minRollbackWindow {
+		interval = minRollbackWindow
+	}
+	return interval
+}
+
+func (o *Orchestrator) flushStreamingProjection(pending map[int64]db.SiteCheck) bool {
+	if len(pending) == 0 {
+		return true
+	}
+	checks := streamingProjectionChecks(pending)
+	start := time.Now()
+	if err := dbMarkSitesChecked(o.ctx, checks); err != nil {
+		log.Printf("orchestrator: streaming legacy freshness projection rows=%d: %v", len(checks), err)
+		return false
+	}
+	if m := metricsClientFunc(); m != nil {
+		m.Increment("scheduler.streaming.legacy_projection.row.count", len(checks))
+		m.Timing("scheduler.streaming.legacy_projection.time", time.Since(start))
+	}
+	return true
+}
+
+func streamingProjectionChecks(pending map[int64]db.SiteCheck) []db.SiteCheck {
+	checks := make([]db.SiteCheck, 0, len(pending))
+	for _, check := range pending {
+		checks = append(checks, check)
+	}
+	return checks
+}
+
+func streamingProjectionFlushBatch(pending map[int64]db.SiteCheck, limit int) []db.SiteCheck {
+	if len(pending) == 0 {
+		return nil
+	}
+	if limit <= 0 || limit >= len(pending) {
+		checks := streamingProjectionChecks(pending)
+		for blogID := range pending {
+			delete(pending, blogID)
+		}
+		return checks
+	}
+	checks := make([]db.SiteCheck, 0, limit)
+	for blogID, check := range pending {
+		checks = append(checks, check)
+		delete(pending, blogID)
+		if len(checks) >= limit {
+			break
+		}
+	}
+	return checks
+}
+
+func streamingProjectionFlushRowLimit(requiredRate float64) int {
+	limit := int(math.Ceil(requiredRate * streamingProjectionFlushInterval.Seconds() * 1.25))
+	if limit < streamingProjectionFlushMinRows {
+		return streamingProjectionFlushMinRows
+	}
+	if limit > streamingProjectionFlushMaxRows {
+		return streamingProjectionFlushMaxRows
+	}
+	return limit
+}
+
+func (o *Orchestrator) reportStreamingStats(cfg *config.Config, planner *streamingPlanner, stats streamingStats, pending, sideEffectDepth int, elapsed time.Duration, pressureActive bool) {
+	avgLatency := stats.averageLatency()
+	if avgLatency == 0 {
+		avgLatency = streamingDefaultLatency
+	}
+	scaleLatency := stats.scaleLatency()
+	if scaleLatency == 0 {
+		scaleLatency = streamingDefaultLatency
+	}
+	if elapsed <= 0 {
+		elapsed = streamingReportInterval
+	}
+	workerTarget := o.pool.WorkerCount()
+
+	activeChecks := o.pool.ActiveCount()
+	queueDepth := o.pool.QueueDepth()
+	resultDepth := o.pool.ResultDepth()
+	workers := o.pool.WorkerCount()
+	sps := 0
+	if elapsed.Seconds() > 0 {
+		sps = int(float64(stats.completed) / elapsed.Seconds())
+	}
+	o.statsMu.Lock()
+	o.lastRoundSPS = sps
+	o.lastRoundDur = elapsed
+	o.statsMu.Unlock()
+
+	if m := metricsClientFunc(); m != nil {
+		m.Gauge("scheduler.streaming.targets.count", planner.activeCount())
+		m.Gauge("scheduler.streaming.required_rate.count", int(planner.requiredChecksPerSecond()))
+		m.Gauge("scheduler.streaming.worker_target.count", workerTarget)
+		m.Gauge("scheduler.streaming.pending.count", pending)
+		m.Gauge("scheduler.streaming.inflight.count", activeChecks)
+		m.Gauge("scheduler.streaming.queue_depth.count", queueDepth)
+		m.Gauge("scheduler.streaming.result_depth.count", resultDepth)
+		m.Gauge("scheduler.streaming.side_effect_queue_depth.count", sideEffectDepth)
+		m.Gauge("scheduler.streaming.worker.count", workers)
+		m.Gauge("scheduler.streaming.sps.count", sps)
+		m.Increment("scheduler.streaming.selected.count", stats.selected)
+		m.Increment("scheduler.streaming.dispatched.count", stats.dispatched)
+		m.Increment("scheduler.streaming.completed.count", stats.completed)
+		m.Increment("scheduler.streaming.backpressure_wait.count", stats.backpressureWaits)
+		m.Increment("scheduler.streaming.side_effect_backpressure_wait.count", stats.sideEffectWaits)
+		m.Increment("scheduler.streaming.result_backpressure_pause.count", stats.resultPaused)
+		m.Increment("scheduler.streaming.side_effect_backpressure_pause.count", stats.sideEffectPaused)
+		m.Increment("scheduler.streaming.dispatch_budget_limited.count", stats.dispatchLimited)
+		m.Increment("scheduler.streaming.pressure_suppressed.count", stats.pressureSuppressed)
+		m.Increment("scheduler.streaming.stale_result.count", stats.staleResults)
+		m.Increment("scheduler.streaming.check.success.count", stats.checkSuccesses)
+		m.Increment("scheduler.streaming.check.failure.count", stats.checkFailures)
+		m.Increment("scheduler.streaming.check.error.timeout.count", stats.errorTimeouts)
+		m.Increment("scheduler.streaming.check.error.connect.count", stats.errorConnects)
+		m.Increment("scheduler.streaming.check.error.ssl.count", stats.errorSSL)
+		m.Increment("scheduler.streaming.check.error.redirect.count", stats.errorRedirects)
+		m.Increment("scheduler.streaming.check.error.keyword.count", stats.errorKeywords)
+		m.Increment("scheduler.streaming.check.error.body_read.count", stats.errorBodyReads)
+		m.Increment("scheduler.streaming.check.error.tls_expired.count", stats.errorTLSExpired)
+		m.Increment("scheduler.streaming.check.error.tls_deprecated.count", stats.errorTLSDeprecated)
+		m.Increment("scheduler.streaming.check.error.other.count", stats.errorOther)
+		emitCheckCohortCounters(m, "scheduler.streaming", stats.checkCohorts)
+		m.Increment("scheduler.streaming.side_effect.processed.count", stats.sideEffectRows)
+		m.Increment("scheduler.streaming.history.row.count", stats.historyRows)
+		m.Increment("scheduler.streaming.history.error.count", stats.historyErrors)
+		m.Increment("scheduler.streaming.ssl.row.count", stats.sslRows)
+		m.Increment("scheduler.streaming.ssl.error.count", stats.sslErrors)
+		m.Timing("scheduler.streaming.avg_latency.time", avgLatency)
+		m.Timing("scheduler.streaming.scale_latency.time", scaleLatency)
+		m.Timing("scheduler.streaming.max_lag.time", stats.maxLag)
+		m.Timing("scheduler.streaming.history.time", stats.historyDuration)
+		m.Timing("scheduler.streaming.ssl.time", stats.sslDuration)
+		m.Timing("scheduler.streaming.events.time", stats.eventDuration)
+		metrics.WriteStatsFiles(sps, queueDepth, o.totalChecked)
+	}
+
+	log.Printf("orchestrator: streaming summary active=%d required_rate=%.2f/s selected=%d dispatched=%d completed=%d side_effects=%d pending=%d active_checks=%d queue_depth=%d result_depth=%d side_effect_depth=%d workers=%d worker_target=%d sps=%d elapsed=%s max_lag=%s avg_latency=%s scale_latency=%s successes=%d failures=%d failure_pressure=%t pressure_suppressed=%d error_timeout=%d error_connect=%d error_ssl=%d error_redirect=%d error_keyword=%d error_body_read=%d error_tls_expired=%d error_tls_deprecated=%d error_other=%d history_rows=%d ssl_rows=%d stale_results=%d backpressure_waits=%d side_effect_waits=%d result_pauses=%d side_effect_pauses=%d dispatch_limited=%d",
+		planner.activeCount(),
+		planner.requiredChecksPerSecond(),
+		stats.selected,
+		stats.dispatched,
+		stats.completed,
+		stats.sideEffectRows,
+		pending,
+		activeChecks,
+		queueDepth,
+		resultDepth,
+		sideEffectDepth,
+		workers,
+		workerTarget,
+		sps,
+		elapsed.Round(time.Millisecond),
+		stats.maxLag.Round(time.Millisecond),
+		avgLatency.Round(time.Millisecond),
+		scaleLatency.Round(time.Millisecond),
+		stats.checkSuccesses,
+		stats.checkFailures,
+		pressureActive || streamingFailurePressure(stats),
+		stats.pressureSuppressed,
+		stats.errorTimeouts,
+		stats.errorConnects,
+		stats.errorSSL,
+		stats.errorRedirects,
+		stats.errorKeywords,
+		stats.errorBodyReads,
+		stats.errorTLSExpired,
+		stats.errorTLSDeprecated,
+		stats.errorOther,
+		stats.historyRows,
+		stats.sslRows,
+		stats.staleResults,
+		stats.backpressureWaits,
+		stats.sideEffectWaits,
+		stats.resultPaused,
+		stats.sideEffectPaused,
+		stats.dispatchLimited,
+	)
+}
+
+func (o *Orchestrator) applyStreamingWorkerTarget(cfg *config.Config, planner *streamingPlanner, stats streamingStats, pending, resultDepth, sideEffectDepth int, failurePressure, hotPathPressure bool) int {
+	latency := stats.scaleLatency()
+	desiredTarget := streamingDesiredWorkerTarget(cfg, planner, latency, stats.maxLag, pending, o.pool.QueueDepth(), resultDepth, sideEffectDepth, o.pool.WorkerCount(), failurePressure, hotPathPressure)
+	workerTarget := streamingDampedWorkerTarget(o.pool.WorkerCount(), desiredTarget, failurePressure)
+	if planner.activeCount() > 0 {
+		if added := o.pool.SetSizeBounds(workerTarget, workerTarget); added > 0 {
+			log.Printf("orchestrator: streaming prewarmed check pool by %d workers (target=%d desired=%d active_targets=%d failure_pressure=%t hot_path_pressure=%t)",
+				added, workerTarget, desiredTarget, planner.activeCount(), failurePressure, hotPathPressure)
+		}
+	} else {
+		o.pool.SetSizeBounds(1, workerTarget)
+	}
+	return workerTarget
+}
+
+func streamingDesiredWorkerTarget(cfg *config.Config, planner *streamingPlanner, latency, maxLag time.Duration, pending, queueDepth, resultDepth, sideEffectDepth, currentWorkers int, failurePressure, hotPathPressure bool) int {
+	resultPressured := resultDepth >= streamingResultDispatchPauseDepth(currentWorkers, planner.activeCount())/2
+	scaleLatency := latency
+	if !failurePressure && (!hotPathPressure || resultPressured) && scaleLatency > streamingDefaultLatency {
+		scaleLatency = streamingDefaultLatency
+	}
+	desiredTarget := streamingWorkerTarget(cfg, planner, scaleLatency)
+	if failurePressure {
+		pressureTarget := streamingPressureWorkerTarget(cfg, planner)
+		if desiredTarget > pressureTarget {
+			desiredTarget = pressureTarget
+		}
+		return desiredTarget
+	}
+	scaleForBacklog := maxLag > streamingReportInterval
+	if hotPathPressure && maxLag > streamingHotPathScaleLag {
+		scaleForBacklog = true
+	}
+	if !scaleForBacklog {
+		return desiredTarget
+	}
+	if resultPressured {
+		return desiredTarget
+	}
+	if sideEffectDepth < streamingSideEffectBackpressureDepth(currentWorkers, planner.activeCount()) {
+		desiredTarget = streamingBacklogWorkerTarget(desiredTarget, planner.activeCount(), pending+queueDepth)
+	}
+	return desiredTarget
+}
+
+func streamingPressureWorkerTarget(cfg *config.Config, planner *streamingPlanner) int {
+	return streamingWorkerTarget(cfg, planner, streamingFailurePressureLatency)
+}
+
+func streamingDampedWorkerTarget(current, desired int, pressure bool) int {
+	if current < 1 || desired < 1 {
+		return desired
+	}
+	if desired > current {
+		step := current / 2
+		if step < streamingMinWorkerStep {
+			step = streamingMinWorkerStep
+		}
+		if current+step < desired {
+			return current + step
+		}
+		return desired
+	}
+	if desired < current {
+		step := current / 5
+		if pressure {
+			step = current / 2
+		}
+		if step < streamingMinWorkerStep {
+			step = streamingMinWorkerStep
+		}
+		if current-step > desired {
+			return current - step
+		}
+		return desired
+	}
+	return desired
+}
+
+func streamingFailurePressure(stats streamingStats) bool {
+	total := stats.checkSuccesses + stats.checkFailures
+	if total < streamingFailurePressureMin {
+		return false
+	}
+	return stats.checkFailures*100 >= total*streamingFailurePressurePercent
+}
+
+func streamingBacklogWorkerTarget(base, active, backlog int) int {
+	if base < 1 || backlog <= 0 {
+		return base
+	}
+	target := base + backlog/streamingBacklogWorkerDivisor
+	if target <= base {
+		target = base + 1
+	}
+	maxTarget := base * streamingBacklogWorkerMultiplier
+	if maxTarget < base+streamingMinWorkerStep {
+		maxTarget = base + streamingMinWorkerStep
+	}
+	if target > maxTarget {
+		target = maxTarget
+	}
+	if active > 0 && target > active {
+		target = active
+	}
+	return target
+}
+
+func streamingShouldDeferPeriodicReload(planner *streamingPlanner, pending, resultDepth, sideEffectDepth, workers int, stats streamingStats) bool {
+	return streamingHotPathBehind(planner, pending, resultDepth, sideEffectDepth, workers, stats)
+}
+
+func streamingHotPathBehind(planner *streamingPlanner, pending, resultDepth, sideEffectDepth, workers int, stats streamingStats) bool {
+	if planner == nil || planner.activeCount() == 0 {
+		return false
+	}
+	active := planner.activeCount()
+	if pending > streamingReloadPendingDeferDepth(active, workers) {
+		return true
+	}
+	if resultDepth > streamingResultDispatchPauseDepth(workers, active)/2 {
+		return true
+	}
+	if sideEffectDepth > streamingSideEffectBackpressureDepth(workers, active)/2 {
+		return true
+	}
+	return stats.maxLag > streamingReportInterval
+}
+
+func streamingReloadPendingDeferDepth(active, workers int) int {
+	if active < 1 {
+		return 0
+	}
+	depth := active / 100
+	if workerBased := workers * 2; workerBased > depth {
+		depth = workerBased
+	}
+	if depth < 1000 {
+		return 1000
+	}
+	return depth
+}
+
+func streamingDeferredReloadLastReload(now time.Time, reloadInterval time.Duration) time.Time {
+	if reloadInterval <= streamingReloadDeferInterval {
+		return now
+	}
+	return now.Add(-(reloadInterval - streamingReloadDeferInterval))
+}
+
+func streamingTargetReloadInterval(cfg *config.Config, planner *streamingPlanner) time.Duration {
+	interval := time.Duration(cfg.StreamingTargetReloadSec) * time.Second
+	if interval <= 0 {
+		interval = 5 * time.Minute
+	}
+	active := planner.activeCount()
+	if active < streamingLargeFleetReloadFloor {
+		return interval
+	}
+	scaled := time.Duration(active/streamingLargeFleetReloadSitesPerSecond) * time.Second
+	if scaled < interval {
+		return interval
+	}
+	if scaled > streamingMaxTargetReloadInterval {
+		return streamingMaxTargetReloadInterval
+	}
+	return scaled
+}
+
+func streamingLoadPageSize(cfg *config.Config) int {
+	if cfg == nil || cfg.DatasetSize < streamingMinLoadPageSize {
+		return streamingMinLoadPageSize
+	}
+	return cfg.DatasetSize
+}
+
+func streamingActiveCountPollIntervalFor(planner *streamingPlanner) time.Duration {
+	if planner.activeCount() == 0 {
+		return streamingEmptyTargetPollInterval
+	}
+	return streamingActiveCountPollInterval
+}
+
+func streamingWorkerTarget(cfg *config.Config, planner *streamingPlanner, latency time.Duration) int {
+	if cfg == nil {
+		return 1
+	}
+	active := planner.activeCount()
+	if active < 1 {
+		if cfg.NumWorkers > 0 {
+			return cfg.NumWorkers
+		}
+		return 1
+	}
+	if latency <= 0 {
+		latency = streamingDefaultLatency
+	}
+	if latency < streamingDefaultLatency {
+		latency = streamingDefaultLatency
+	}
+	if maxLatency := streamingScaleLatencyCap(cfg); latency > maxLatency {
+		latency = maxLatency
+	}
+	// Little's Law with headroom: concurrency ~= throughput * latency. The
+	// headroom absorbs normal latency variance, but it stays conservative enough
+	// to avoid turning transient latency into a self-amplifying worker surge.
+	target := int(planner.requiredChecksPerSecond()*latency.Seconds()*streamingWorkerHeadroom) + 1
+	if target < cfg.NumWorkers {
+		target = cfg.NumWorkers
+	}
+	if target > active {
+		target = active
+	}
+	if target < 1 {
+		target = 1
+	}
+	return target
+}
+
+func streamingScaleLatencyCap(cfg *config.Config) time.Duration {
+	if cfg == nil || cfg.NetCommsTimeout <= 0 {
+		return 10 * time.Second
+	}
+	cap := time.Duration(cfg.NetCommsTimeout) * time.Second
+	if cap < streamingDefaultLatency {
+		return streamingDefaultLatency
+	}
+	return cap
+}
+
+func streamingQueueCap(workerTarget, activeCount int) int {
+	capacity := workerTarget * 4
+	if activeCount > 0 {
+		activeCap := activeCount
+		if activeCap > streamingMaxQueueCap {
+			activeCap = streamingMaxQueueCap
+		}
+		if capacity < activeCap {
+			capacity = activeCap
+		}
+	}
+	if capacity < streamingMinQueueCap {
+		return streamingMinQueueCap
+	}
+	if capacity > streamingMaxQueueCap {
+		return streamingMaxQueueCap
+	}
+	return capacity
+}
+
+func streamingResultBackpressureDepth(workerTarget, activeCount int) int {
+	return streamingBackpressureDepth(workerTarget, activeCount, 2)
+}
+
+func streamingResultDispatchPauseDepth(workerTarget, activeCount int) int {
+	if workerTarget < 1 {
+		workerTarget = 1
+	}
+	depth := workerTarget * 6
+	if activeBased := activeCount / 3; activeBased > depth {
+		depth = activeBased
+	}
+	if depth < streamingMinBackpressureDepth {
+		return streamingMinBackpressureDepth
+	}
+	limit := streamingMaxQueueCap * 3 / 4
+	if depth > limit {
+		return limit
+	}
+	return depth
+}
+
+func streamingShouldPauseDispatchForResultBacklog(resultDepth, workerTarget, activeCount, activeChecks, queueDepth int) bool {
+	if resultDepth < streamingResultDispatchPauseDepth(workerTarget, activeCount) {
+		return false
+	}
+	return activeChecks+queueDepth > 0
+}
+
+func streamingResultDrainLimitFor(resultDepth int) int {
+	if resultDepth <= streamingResultDrainLimit {
+		return streamingResultDrainLimit
+	}
+	limit := resultDepth / 2
+	if limit < streamingResultDrainLimit {
+		return streamingResultDrainLimit
+	}
+	if limit > streamingMaxResultDrainLimit {
+		return streamingMaxResultDrainLimit
+	}
+	return limit
+}
+
+func streamingSideEffectBackpressureDepth(workerTarget, activeCount int) int {
+	return streamingBackpressureDepth(workerTarget, activeCount, 1)
+}
+
+func streamingBackpressureDepth(workerTarget, activeCount, workerMultiplier int) int {
+	if workerTarget < 1 {
+		workerTarget = 1
+	}
+	if workerMultiplier < 1 {
+		workerMultiplier = 1
+	}
+	depth := workerTarget * workerMultiplier
+	if activeBased := activeCount / 20; activeBased > depth {
+		depth = activeBased
+	}
+	if depth < streamingMinBackpressureDepth {
+		return streamingMinBackpressureDepth
+	}
+	limit := streamingMaxQueueCap / 2
+	if depth > limit {
+		return limit
+	}
+	return depth
+}
+
+func streamingDispatchBudget(requiredRate float64, pending, workerTarget int, elapsed, maxLag time.Duration, resultDepth, activeCount int) int {
+	if pending <= 0 {
+		return 0
+	}
+	if elapsed > streamingDispatchMaxElapsed {
+		elapsed = streamingDispatchMaxElapsed
+	}
+	seconds := elapsed.Seconds()
+	if seconds <= 0 {
+		seconds = streamingTickInterval.Seconds()
+	}
+	base := int(math.Ceil(requiredRate * seconds * 1.25))
+	if base < 1 {
+		base = 1
+	}
+	catchupDivisor := streamingDispatchCatchupDivisor
+	maxMultiplier := streamingDispatchMaxMultiplier
+	workerMultiplier := streamingDispatchWorkerMultiplier
+	if streamingDispatchFastCatchup(maxLag, resultDepth, workerTarget, activeCount) {
+		catchupDivisor = streamingDispatchFastCatchupDivisor
+		maxMultiplier = streamingDispatchCatchupMultiplier
+		workerMultiplier = streamingDispatchCatchupWorkerMultiplier
+	}
+	catchup := pending / catchupDivisor
+	budget := base + catchup
+	maxBudget := int(math.Ceil(requiredRate * seconds * maxMultiplier))
+	if maxBudget < base {
+		maxBudget = base
+	}
+	workerCap := workerTarget * workerMultiplier
+	if workerCap < workerTarget {
+		workerCap = workerTarget
+	}
+	if workerCap > 0 && maxBudget > workerCap {
+		maxBudget = workerCap
+	}
+	if maxBudget < 1 {
+		maxBudget = 1
+	}
+	if budget > maxBudget {
+		budget = maxBudget
+	}
+	if budget > pending {
+		return pending
+	}
+	return budget
+}
+
+func streamingDispatchFastCatchup(maxLag time.Duration, resultDepth, workerTarget, activeCount int) bool {
+	if maxLag <= streamingReportInterval {
+		return false
+	}
+	return resultDepth < streamingResultDispatchPauseDepth(workerTarget, activeCount)/2
+}
+
+func initialStreamingDueAt(site db.Site, now time.Time) time.Time {
+	interval := streamingCheckCadence(site)
+	phase := streamingPhaseOffset(monitorTargetID(site), interval)
+	return nextStreamingPhaseAt(now, interval, phase)
+}
+
+func streamingCheckCadence(site db.Site) time.Duration {
+	interval := siteCheckInterval(site)
+	headroom := interval / 20
+	if headroom < streamingMinScheduleHeadroom {
+		headroom = streamingMinScheduleHeadroom
+	}
+	if headroom > streamingMaxScheduleHeadroom {
+		headroom = streamingMaxScheduleHeadroom
+	}
+	cadence := interval - headroom
+	if cadence < time.Second {
+		return time.Second
+	}
+	return cadence
+}
+
+func streamingPhaseOffset(blogID int64, interval time.Duration) int64 {
+	seconds := int64(interval / time.Second)
+	if seconds <= 1 {
+		return 0
+	}
+	hash := uint64(blogID) * 11400714819323198485
+	return int64(hash % uint64(seconds))
+}
+
+func nextStreamingPhaseAt(now time.Time, interval time.Duration, phase int64) time.Time {
+	now = now.UTC().Truncate(time.Second)
+	seconds := int64(interval / time.Second)
+	if seconds <= 1 {
+		return now
+	}
+	mod := now.Unix() % seconds
+	delta := phase - mod
+	if delta < 0 {
+		delta += seconds
+	}
+	return now.Add(time.Duration(delta) * time.Second)
+}
diff --git a/internal/orchestrator/streaming_test.go b/internal/orchestrator/streaming_test.go
new file mode 100644
index 00000000..aaf9c5a5
--- /dev/null
+++ b/internal/orchestrator/streaming_test.go
@@ -0,0 +1,1071 @@
+package orchestrator
+
+import (
+	"context"
+	"net/http"
+	"testing"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/checker"
+	"github.com/Automattic/jetmon/internal/config"
+	"github.com/Automattic/jetmon/internal/db"
+	"github.com/Automattic/jetmon/internal/wpcom"
+)
+
+func TestStreamingPhaseStaysInsideInterval(t *testing.T) {
+	now := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	site := db.Site{ID: 98765, BlogID: 12345, CheckInterval: 5}
+
+	due := initialStreamingDueAt(site, now)
+	if due.Before(now) {
+		t.Fatalf("initialStreamingDueAt() = %s before now %s", due, now)
+	}
+	if due.Sub(now) >= 5*time.Minute {
+		t.Fatalf("initialStreamingDueAt() delay = %s, want < 5m", due.Sub(now))
+	}
+	cadence := streamingCheckCadence(site)
+	if got := due.Unix() % int64(cadence/time.Second); got != streamingPhaseOffset(site.ID, cadence) {
+		t.Fatalf("due phase = %d, want stable phase", got)
+	}
+}
+
+func TestStreamingPlannerKeepsDuplicateBlogEndpoints(t *testing.T) {
+	now := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	planner := newStreamingPlanner([]db.Site{
+		{ID: 10, BlogID: 42, MonitorURL: "https://example.com/", CheckInterval: 5},
+		{ID: 11, BlogID: 42, MonitorURL: "https://example.com/path", CheckInterval: 5},
+	}, now)
+	if got := planner.activeCount(); got != 2 {
+		t.Fatalf("activeCount = %d, want 2", got)
+	}
+	if _, ok := planner.targets[10]; !ok {
+		t.Fatal("planner missing monitor_site_id 10")
+	}
+	if _, ok := planner.targets[11]; !ok {
+		t.Fatal("planner missing monitor_site_id 11")
+	}
+}
+
+func TestStreamingPlannerPopDueSkipsQueuedAndInflightTargets(t *testing.T) {
+	now := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	planner := &streamingPlanner{
+		targets: make(map[int64]*streamingTarget),
+	}
+	ready := &streamingTarget{site: db.Site{BlogID: 1, CheckInterval: 1}, active: true}
+	queued := &streamingTarget{site: db.Site{BlogID: 2, CheckInterval: 1}, active: true, queued: true}
+	inFlight := &streamingTarget{site: db.Site{BlogID: 3, CheckInterval: 1}, active: true, inFlight: true}
+	planner.targets[1] = ready
+	planner.targets[2] = queued
+	planner.targets[3] = inFlight
+	planner.scheduleAt(ready, now.Add(-time.Second))
+	planner.scheduleAt(queued, now.Add(-time.Second))
+	planner.scheduleAt(inFlight, now.Add(-time.Second))
+
+	due := planner.popDue(now)
+	if len(due) != 1 || due[0].site.BlogID != 1 {
+		t.Fatalf("popDue() = %+v, want only blog 1", due)
+	}
+	if !ready.queued {
+		t.Fatal("ready target should be marked queued after popDue")
+	}
+}
+
+func TestStreamingPlannerPopDueKeepsFutureBucketsPending(t *testing.T) {
+	now := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	planner := &streamingPlanner{
+		targets: make(map[int64]*streamingTarget),
+	}
+	ready := &streamingTarget{site: db.Site{BlogID: 1, CheckInterval: 1}, active: true}
+	future := &streamingTarget{site: db.Site{BlogID: 2, CheckInterval: 1}, active: true}
+	planner.targets[1] = ready
+	planner.targets[2] = future
+	planner.scheduleAt(future, now.Add(10*time.Minute))
+	planner.scheduleAt(ready, now)
+
+	due := planner.popDue(now)
+	if len(due) != 1 || due[0].site.BlogID != 1 {
+		t.Fatalf("popDue(now) = %+v, want only blog 1", due)
+	}
+	if got := planner.popDue(now.Add(5 * time.Minute)); len(got) != 0 {
+		t.Fatalf("popDue(before future) = %+v, want no sites", got)
+	}
+	due = planner.popDue(now.Add(10 * time.Minute))
+	if len(due) != 1 || due[0].site.BlogID != 2 {
+		t.Fatalf("popDue(future) = %+v, want only blog 2", due)
+	}
+}
+
+func TestStreamingPlannerPopDueSkipsStaleBucketEntries(t *testing.T) {
+	now := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	planner := &streamingPlanner{
+		targets: make(map[int64]*streamingTarget),
+	}
+	target := &streamingTarget{site: db.Site{BlogID: 1, CheckInterval: 1}, active: true}
+	planner.targets[1] = target
+	planner.scheduleAt(target, now.Add(-time.Second))
+	planner.scheduleAt(target, now.Add(time.Minute))
+
+	if got := planner.popDue(now); len(got) != 0 {
+		t.Fatalf("popDue(now) = %+v, want stale bucket skipped", got)
+	}
+	if target.queued {
+		t.Fatal("target should not be queued from a stale due bucket")
+	}
+	due := planner.popDue(now.Add(time.Minute))
+	if len(due) != 1 || due[0].site.BlogID != 1 {
+		t.Fatalf("popDue(rescheduled) = %+v, want blog 1", due)
+	}
+}
+
+func TestStreamingScheduleAfterResultKeepsLocalRetryForSeemsDown(t *testing.T) {
+	checkedAt := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	planner := &streamingPlanner{
+		targets: make(map[int64]*streamingTarget),
+	}
+	target := &streamingTarget{
+		site:   db.Site{BlogID: 42, CheckInterval: 5, SiteStatus: statusDown},
+		dueAt:  checkedAt,
+		active: true,
+	}
+
+	planner.scheduleAfterResult(target, checker.Result{
+		BlogID:    42,
+		ErrorCode: checker.ErrorConnect,
+		Timestamp: checkedAt,
+	}, checkedAt, true)
+
+	if got, want := target.dueAt, checkedAt.Add(failedCheckRetryInterval); !got.Equal(want) {
+		t.Fatalf("dueAt = %s, want retry at %s", got, want)
+	}
+}
+
+func TestStreamingScheduleAfterResultUsesNormalCadenceForConfirmedDown(t *testing.T) {
+	checkedAt := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	planner := &streamingPlanner{
+		targets: make(map[int64]*streamingTarget),
+	}
+	target := &streamingTarget{
+		site:   db.Site{BlogID: 42, CheckInterval: 5, SiteStatus: statusConfirmedDown},
+		dueAt:  checkedAt,
+		active: true,
+	}
+
+	planner.scheduleAfterResult(target, checker.Result{
+		BlogID:    42,
+		ErrorCode: checker.ErrorConnect,
+		Timestamp: checkedAt,
+	}, checkedAt, true)
+
+	if got, want := target.dueAt, checkedAt.Add(streamingCheckCadence(target.site)); !got.Equal(want) {
+		t.Fatalf("dueAt = %s, want normal cadence at %s", got, want)
+	}
+}
+
+func TestStreamingScheduleAtNextPhaseAfterRestoresPhaseSpread(t *testing.T) {
+	checkedAt := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	planner := &streamingPlanner{
+		targets: make(map[int64]*streamingTarget),
+	}
+	target := &streamingTarget{
+		site:   db.Site{BlogID: 42, CheckInterval: 5, SiteStatus: statusRunning},
+		dueAt:  checkedAt.Add(failedCheckRetryInterval),
+		active: true,
+	}
+
+	planner.scheduleAtNextPhaseAfter(target, checkedAt)
+
+	cadence := streamingCheckCadence(target.site)
+	expected := nextStreamingPhaseAt(checkedAt.Add(time.Second), cadence, streamingPhaseOffset(target.site.BlogID, cadence))
+	if !target.dueAt.Equal(expected) {
+		t.Fatalf("dueAt = %s, want phase-spread due at %s", target.dueAt, expected)
+	}
+	if target.dueAt.Equal(checkedAt.Add(failedCheckRetryInterval)) {
+		t.Fatal("dueAt stayed on local retry cadence")
+	}
+}
+
+func TestStreamingScheduleAfterResultSkipsImmediateRetryUnderPressure(t *testing.T) {
+	checkedAt := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	planner := &streamingPlanner{
+		targets: make(map[int64]*streamingTarget),
+	}
+	target := &streamingTarget{
+		site:   db.Site{BlogID: 42, CheckInterval: 5, SiteStatus: statusDown},
+		dueAt:  checkedAt,
+		active: true,
+	}
+
+	planner.scheduleAfterResult(target, checker.Result{
+		BlogID:    42,
+		ErrorCode: checker.ErrorConnect,
+		Timestamp: checkedAt,
+	}, checkedAt, false)
+
+	if target.dueAt.Equal(checkedAt.Add(failedCheckRetryInterval)) {
+		t.Fatal("dueAt used immediate retry while failure pressure was active")
+	}
+	if got := target.dueAt.Sub(checkedAt); got < time.Minute {
+		t.Fatalf("dueAt delay = %s, want normal cadence instead of retry cadence", got)
+	}
+}
+
+func TestStreamingAllowImmediateRetryUnderPressure(t *testing.T) {
+	target := &streamingTarget{site: db.Site{BlogID: 42, CheckInterval: 5, SiteStatus: statusRunning}}
+	localTimeout := checker.Result{BlogID: 42, ErrorCode: checker.ErrorTimeout}
+	if streamingAllowImmediateRetry(target, localTimeout, nil, true) {
+		t.Fatal("new local timeout under pressure should not use immediate retry")
+	}
+	if !streamingAllowImmediateRetry(target, checker.Result{BlogID: 42, HTTPCode: 503}, nil, true) {
+		t.Fatal("HTTP failure under pressure should keep immediate retry")
+	}
+
+	retries := newRetryQueue()
+	retries.record(checker.Result{BlogID: 42, URL: "http://example.com", Timestamp: time.Now()})
+	if !streamingAllowImmediateRetry(target, localTimeout, retries, true) {
+		t.Fatal("existing retry state under pressure should keep immediate retry")
+	}
+
+	target.site.SiteStatus = statusDown
+	if !streamingAllowImmediateRetry(target, localTimeout, nil, true) {
+		t.Fatal("non-running site under pressure should keep immediate retry")
+	}
+}
+
+func TestStreamingAllowImmediateRetrySkipsSuppressedPostRecoveryFailure(t *testing.T) {
+	retries := newRetryQueue()
+	recoveredAt := time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)
+	retries.markRecovered(42, recoveredAt)
+	target := &streamingTarget{site: db.Site{BlogID: 42, CheckInterval: 3, SiteStatus: statusRunning}}
+	res := checkerResultTransportFailure(42, recoveredAt.Add(2*time.Minute))
+
+	if streamingAllowImmediateRetry(target, res, retries, false) {
+		t.Fatal("suppressed post-recovery transport failure should return to normal cadence")
+	}
+
+	retries.record(checker.Result{BlogID: 42, URL: "http://example.com", Timestamp: recoveredAt})
+	if !streamingAllowImmediateRetry(target, res, retries, false) {
+		t.Fatal("existing retry state should keep immediate retry")
+	}
+}
+
+func TestStreamingAllowImmediateRetrySkipsSuppressedPostFalseAlarmFailure(t *testing.T) {
+	retries := newRetryQueue()
+	falseAlarmAt := time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)
+	retries.markFalseAlarm(42, falseAlarmAt)
+	target := &streamingTarget{site: db.Site{BlogID: 42, CheckInterval: 3, SiteStatus: statusRunning}}
+	res := checkerResultTransportFailure(42, falseAlarmAt.Add(postRecoveryTransientFailureWindow(target.site)+time.Second))
+
+	if streamingAllowImmediateRetry(target, res, retries, false) {
+		t.Fatal("suppressed post-false-alarm transport failure should return to normal cadence")
+	}
+}
+
+func TestRescheduleStreamingAfterSideEffectCancelsQueuedRetry(t *testing.T) {
+	checkedAt := time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)
+	planner := &streamingPlanner{
+		targets: make(map[int64]*streamingTarget),
+	}
+	target := &streamingTarget{
+		site:   db.Site{BlogID: 42, CheckInterval: 3, SiteStatus: statusRunning},
+		dueAt:  checkedAt.Add(failedCheckRetryInterval),
+		active: true,
+		queued: true,
+	}
+	planner.targets[42] = target
+
+	rescheduleStreamingAfterSideEffect(planner, target, streamingSideEffectReport{
+		blogID:        42,
+		status:        statusRunning,
+		resultFailure: true,
+		checkedAt:     checkedAt,
+	})
+
+	if target.queued {
+		t.Fatal("queued immediate retry should be canceled after side effects keep the site running")
+	}
+	if target.dueAt.Equal(checkedAt.Add(failedCheckRetryInterval)) {
+		t.Fatalf("dueAt = %s, want normal phase instead of immediate retry point", target.dueAt)
+	}
+}
+
+func TestDispatchStreamingPendingSkipsCanceledQueuedEntry(t *testing.T) {
+	o := &Orchestrator{}
+	stats := &streamingStats{}
+	target := &streamingTarget{
+		site:   db.Site{BlogID: 42, CheckInterval: 3, SiteStatus: statusRunning},
+		active: true,
+		queued: false,
+	}
+
+	remaining := o.dispatchStreamingPending(&config.Config{}, []*streamingTarget{target}, 1, stats)
+
+	if len(remaining) != 0 {
+		t.Fatalf("remaining pending = %d, want canceled entry drained", len(remaining))
+	}
+	if stats.dispatched != 0 {
+		t.Fatalf("dispatched = %d, want canceled entry skipped", stats.dispatched)
+	}
+}
+
+func TestStreamingWorkerTargetScalesFromRequiredRate(t *testing.T) {
+	planner := &streamingPlanner{targets: make(map[int64]*streamingTarget)}
+	for i := int64(1); i <= 1200; i++ {
+		planner.targets[i] = &streamingTarget{
+			site:   db.Site{BlogID: i, CheckInterval: 1},
+			active: true,
+		}
+	}
+	planner.recalculateRequiredRate()
+	cfg := &config.Config{NumWorkers: 10}
+
+	got := streamingWorkerTarget(cfg, planner, time.Second)
+	if got <= cfg.NumWorkers {
+		t.Fatalf("streamingWorkerTarget() = %d, want above NumWorkers floor", got)
+	}
+	if got > planner.activeCount() {
+		t.Fatalf("streamingWorkerTarget() = %d, want <= active target count", got)
+	}
+}
+
+func TestStreamingWorkerTargetCapsScaleLatencyAtCheckTimeout(t *testing.T) {
+	planner := &streamingPlanner{targets: make(map[int64]*streamingTarget)}
+	for i := int64(1); i <= 100000; i++ {
+		planner.targets[i] = &streamingTarget{
+			site:   db.Site{BlogID: i, CheckInterval: 5},
+			active: true,
+		}
+	}
+	planner.recalculateRequiredRate()
+	cfg := &config.Config{NumWorkers: 60, NetCommsTimeout: 2}
+
+	got := streamingWorkerTarget(cfg, planner, 10*time.Second)
+	want := int(planner.requiredChecksPerSecond() * 2 * streamingWorkerHeadroom)
+	want++
+	if got != want {
+		t.Fatalf("streamingWorkerTarget() = %d, want capped target %d", got, want)
+	}
+}
+
+func TestStreamingPressureWorkerTargetUsesConservativeLatency(t *testing.T) {
+	planner := &streamingPlanner{targets: make(map[int64]*streamingTarget)}
+	for i := int64(1); i <= 100000; i++ {
+		planner.targets[i] = &streamingTarget{
+			site:   db.Site{BlogID: i, CheckInterval: 5},
+			active: true,
+		}
+	}
+	planner.recalculateRequiredRate()
+	cfg := &config.Config{NumWorkers: 60, NetCommsTimeout: 10}
+
+	got := streamingPressureWorkerTarget(cfg, planner)
+	want := int(planner.requiredChecksPerSecond() * streamingFailurePressureLatency.Seconds() * streamingWorkerHeadroom)
+	want++
+	if got != want {
+		t.Fatalf("streamingPressureWorkerTarget() = %d, want %d", got, want)
+	}
+}
+
+func TestStreamingDesiredWorkerTargetUsesBacklogWithoutFailurePressure(t *testing.T) {
+	planner := &streamingPlanner{targets: make(map[int64]*streamingTarget)}
+	for i := int64(1); i <= 100000; i++ {
+		planner.targets[i] = &streamingTarget{
+			site:   db.Site{BlogID: i, CheckInterval: 5},
+			active: true,
+		}
+	}
+	planner.recalculateRequiredRate()
+	cfg := &config.Config{NumWorkers: 60, NetCommsTimeout: 10}
+
+	base := streamingWorkerTarget(cfg, planner, 40*time.Millisecond)
+	got := streamingDesiredWorkerTarget(cfg, planner, 40*time.Millisecond, 2*time.Minute, 60000, 0, 0, 0, base, false, false)
+	if got <= base {
+		t.Fatalf("streamingDesiredWorkerTarget() = %d, want above base target %d for pending backlog", got, base)
+	}
+}
+
+func TestStreamingDesiredWorkerTargetAvoidsLatencySurgeWhileOnTime(t *testing.T) {
+	planner := &streamingPlanner{targets: make(map[int64]*streamingTarget)}
+	for i := int64(1); i <= 100000; i++ {
+		planner.targets[i] = &streamingTarget{
+			site:   db.Site{BlogID: i, CheckInterval: 5},
+			active: true,
+		}
+	}
+	planner.recalculateRequiredRate()
+	cfg := &config.Config{NumWorkers: 60, NetCommsTimeout: 10}
+
+	base := streamingWorkerTarget(cfg, planner, streamingDefaultLatency)
+	got := streamingDesiredWorkerTarget(cfg, planner, 3*time.Second, 30*time.Second, 60000, 0, 0, 0, base, false, false)
+	if got != base {
+		t.Fatalf("streamingDesiredWorkerTarget() = %d, want base target %d while freshness is on time", got, base)
+	}
+}
+
+func TestStreamingDesiredWorkerTargetUsesHotPathLatencyBeforeMinuteLag(t *testing.T) {
+	planner := &streamingPlanner{targets: make(map[int64]*streamingTarget)}
+	for i := int64(1); i <= 100000; i++ {
+		planner.targets[i] = &streamingTarget{
+			site:   db.Site{BlogID: i, CheckInterval: 5},
+			active: true,
+		}
+	}
+	planner.recalculateRequiredRate()
+	cfg := &config.Config{NumWorkers: 60, NetCommsTimeout: 10}
+
+	base := streamingWorkerTarget(cfg, planner, streamingDefaultLatency)
+	got := streamingDesiredWorkerTarget(cfg, planner, 900*time.Millisecond, 20*time.Second, 60000, 0, 0, 0, base, false, true)
+	if got <= base {
+		t.Fatalf("streamingDesiredWorkerTarget() = %d, want above base target %d during hot-path backlog", got, base)
+	}
+}
+
+func TestStreamingDesiredWorkerTargetSkipsBacklogGrowthUnderResultPressure(t *testing.T) {
+	planner := &streamingPlanner{targets: make(map[int64]*streamingTarget)}
+	for i := int64(1); i <= 100000; i++ {
+		planner.targets[i] = &streamingTarget{
+			site:   db.Site{BlogID: i, CheckInterval: 5},
+			active: true,
+		}
+	}
+	planner.recalculateRequiredRate()
+	cfg := &config.Config{NumWorkers: 60, NetCommsTimeout: 10}
+
+	base := streamingWorkerTarget(cfg, planner, streamingDefaultLatency)
+	resultDepth := streamingResultDispatchPauseDepth(base, planner.activeCount()) / 2
+	got := streamingDesiredWorkerTarget(cfg, planner, streamingDefaultLatency, 2*time.Minute, 60000, 0, resultDepth, 0, base, false, false)
+	if got != base {
+		t.Fatalf("streamingDesiredWorkerTarget() = %d, want base target %d while result backlog is pressured", got, base)
+	}
+}
+
+func TestStreamingDesiredWorkerTargetAvoidsHotPathGrowthUnderResultPressure(t *testing.T) {
+	planner := &streamingPlanner{targets: make(map[int64]*streamingTarget)}
+	for i := int64(1); i <= 100000; i++ {
+		planner.targets[i] = &streamingTarget{
+			site:   db.Site{BlogID: i, CheckInterval: 5},
+			active: true,
+		}
+	}
+	planner.recalculateRequiredRate()
+	cfg := &config.Config{NumWorkers: 60, NetCommsTimeout: 10}
+
+	base := streamingWorkerTarget(cfg, planner, streamingDefaultLatency)
+	resultDepth := streamingResultDispatchPauseDepth(base, planner.activeCount()) / 2
+	got := streamingDesiredWorkerTarget(cfg, planner, 900*time.Millisecond, 20*time.Second, 60000, 0, resultDepth, 0, base, false, true)
+	if got != base {
+		t.Fatalf("streamingDesiredWorkerTarget() = %d, want base target %d while hot path has result pressure", got, base)
+	}
+}
+
+func TestStreamingDesiredWorkerTargetSkipsBacklogGrowthDuringFailurePressure(t *testing.T) {
+	planner := &streamingPlanner{targets: make(map[int64]*streamingTarget)}
+	for i := int64(1); i <= 100000; i++ {
+		planner.targets[i] = &streamingTarget{
+			site:   db.Site{BlogID: i, CheckInterval: 5},
+			active: true,
+		}
+	}
+	planner.recalculateRequiredRate()
+	cfg := &config.Config{NumWorkers: 60, NetCommsTimeout: 10}
+
+	base := streamingWorkerTarget(cfg, planner, 40*time.Millisecond)
+	got := streamingDesiredWorkerTarget(cfg, planner, 40*time.Millisecond, 2*time.Minute, 60000, 0, 0, 0, base, true, false)
+	if got != base {
+		t.Fatalf("streamingDesiredWorkerTarget() = %d, want base target %d while failure pressure is active", got, base)
+	}
+}
+
+func TestStreamingDampedWorkerTargetLimitsGrowthAndShrink(t *testing.T) {
+	if got := streamingDampedWorkerTarget(400, 2000, false); got != 600 {
+		t.Fatalf("growth damped target = %d, want 600", got)
+	}
+	if got := streamingDampedWorkerTarget(2000, 400, false); got != 1600 {
+		t.Fatalf("shrink damped target = %d, want 1600", got)
+	}
+	if got := streamingDampedWorkerTarget(60, 80, false); got != 80 {
+		t.Fatalf("small target change = %d, want 80", got)
+	}
+}
+
+func TestStreamingDampedWorkerTargetShrinksFasterUnderFailurePressure(t *testing.T) {
+	if got := streamingDampedWorkerTarget(2000, 400, true); got != 1000 {
+		t.Fatalf("pressure shrink target = %d, want 1000", got)
+	}
+}
+
+func TestStreamingFailurePressureRequiresVolumeAndRatio(t *testing.T) {
+	if streamingFailurePressure(streamingStats{checkSuccesses: 9, checkFailures: 990}) {
+		t.Fatal("failure pressure should wait for enough completed checks")
+	}
+	if streamingFailurePressure(streamingStats{checkSuccesses: 800, checkFailures: 200}) {
+		t.Fatal("failure pressure should stay off below the failure ratio threshold")
+	}
+	if !streamingFailurePressure(streamingStats{checkSuccesses: 750, checkFailures: 250}) {
+		t.Fatal("failure pressure should trip at the configured failure ratio")
+	}
+}
+
+func TestStreamingStatsCountsErrorCodes(t *testing.T) {
+	var stats streamingStats
+	stats.addResult(checker.Result{ErrorCode: checker.ErrorTimeout}, 0)
+	stats.addResult(checker.Result{ErrorCode: checker.ErrorConnect}, 0)
+	stats.addResult(checker.Result{ErrorCode: checker.ErrorSSL}, 0)
+	stats.addResult(checker.Result{ErrorCode: checker.ErrorRedirect}, 0)
+	stats.addResult(checker.Result{ErrorCode: checker.ErrorKeyword}, 0)
+	stats.addResult(checker.Result{ErrorCode: checker.ErrorBodyRead}, 0)
+	stats.addResult(checker.Result{ErrorCode: checker.ErrorTLSExpired}, 0)
+	stats.addResult(checker.Result{ErrorCode: checker.ErrorTLSDeprecated}, 0)
+
+	if stats.errorTimeouts != 1 || stats.errorConnects != 1 || stats.errorSSL != 1 || stats.errorRedirects != 1 ||
+		stats.errorKeywords != 1 || stats.errorBodyReads != 1 || stats.errorTLSExpired != 1 || stats.errorTLSDeprecated != 1 || stats.errorOther != 0 {
+		t.Fatalf("error counters = timeout:%d connect:%d ssl:%d redirect:%d keyword:%d body:%d expired:%d deprecated:%d other:%d",
+			stats.errorTimeouts,
+			stats.errorConnects,
+			stats.errorSSL,
+			stats.errorRedirects,
+			stats.errorKeywords,
+			stats.errorBodyReads,
+			stats.errorTLSExpired,
+			stats.errorTLSDeprecated,
+			stats.errorOther,
+		)
+	}
+}
+
+func TestStreamingStatsCountsCheckCohorts(t *testing.T) {
+	var stats streamingStats
+	stats.addResult(checker.Result{Method: http.MethodHead, DetectionProfile: "legacy", Success: true}, 0)
+	stats.addResult(checker.Result{Method: http.MethodGet, DetectionProfile: "simple_http", Success: false}, 0)
+	stats.addResult(checker.Result{Method: http.MethodGet, DetectionProfile: "full", Success: true}, 0)
+	stats.addResult(checker.Result{Method: http.MethodGet, DetectionProfile: "full", Success: false}, 0)
+
+	assertCheckCohortCount(t, stats.checkCohorts, http.MethodHead, "legacy", 1)
+	assertCheckCohortCount(t, stats.checkCohorts, http.MethodGet, "simple_http", 1)
+	assertCheckCohortCount(t, stats.checkCohorts, http.MethodGet, "full", 2)
+}
+
+func TestStreamingBacklogWorkerTargetUsesSpareHeadroom(t *testing.T) {
+	if got := streamingBacklogWorkerTarget(700, 100000, 42000); got != 875 {
+		t.Fatalf("backlog worker target = %d, want base plus backlog catch-up", got)
+	}
+	if got := streamingBacklogWorkerTarget(700, 100000, 200000); got != 1400 {
+		t.Fatalf("capped backlog worker target = %d, want 2x base cap", got)
+	}
+	if got := streamingBacklogWorkerTarget(700, 1000, 200000); got != 1000 {
+		t.Fatalf("active-capped backlog worker target = %d, want active target count", got)
+	}
+}
+
+func TestStreamingShouldDeferPeriodicReloadOnlyWhenHotPathIsBehind(t *testing.T) {
+	planner := &streamingPlanner{targets: make(map[int64]*streamingTarget)}
+	for i := int64(1); i <= 100000; i++ {
+		planner.targets[i] = &streamingTarget{
+			site:   db.Site{BlogID: i, CheckInterval: 5},
+			active: true,
+		}
+	}
+
+	if !streamingShouldDeferPeriodicReload(planner, 50000, 0, 0, 5000, streamingStats{}) {
+		t.Fatal("large pending backlog should defer periodic reload")
+	}
+	resultBacklog := streamingResultDispatchPauseDepth(5000, planner.activeCount())/2 + 1
+	if !streamingShouldDeferPeriodicReload(planner, 0, resultBacklog, 0, 5000, streamingStats{}) {
+		t.Fatal("large result backlog should defer periodic reload")
+	}
+	if !streamingShouldDeferPeriodicReload(planner, 0, 0, 0, 5000, streamingStats{maxLag: 2 * time.Minute}) {
+		t.Fatal("large scheduler lag should defer periodic reload")
+	}
+	if streamingShouldDeferPeriodicReload(planner, 10, 10, 0, 5000, streamingStats{}) {
+		t.Fatal("healthy hot path should not defer periodic reload")
+	}
+}
+
+func TestStreamingDeferredReloadRetriesSoonerThanFullInterval(t *testing.T) {
+	now := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	lastReload := streamingDeferredReloadLastReload(now, 5*time.Minute)
+
+	if next := lastReload.Add(5 * time.Minute); !next.Equal(now.Add(streamingReloadDeferInterval)) {
+		t.Fatalf("next deferred reload = %s, want %s", next, now.Add(streamingReloadDeferInterval))
+	}
+}
+
+func TestStreamingTargetReloadIntervalScalesForLargeFleets(t *testing.T) {
+	cfg := &config.Config{StreamingTargetReloadSec: 300}
+	planner := &streamingPlanner{targets: make(map[int64]*streamingTarget)}
+	for i := int64(1); i <= 50000; i++ {
+		planner.targets[i] = &streamingTarget{site: db.Site{BlogID: i}, active: true}
+	}
+	if got := streamingTargetReloadInterval(cfg, planner); got != 5*time.Minute {
+		t.Fatalf("small fleet reload interval = %s, want configured 5m", got)
+	}
+
+	for i := int64(50001); i <= 500000; i++ {
+		planner.targets[i] = &streamingTarget{site: db.Site{BlogID: i}, active: true}
+	}
+	if got := streamingTargetReloadInterval(cfg, planner); got != 10000*time.Second {
+		t.Fatalf("500k fleet reload interval = %s, want scaled 10000s", got)
+	}
+}
+
+func TestStreamingTargetReloadIntervalRespectsLongConfigAndCap(t *testing.T) {
+	planner := &streamingPlanner{targets: make(map[int64]*streamingTarget)}
+	for i := int64(1); i <= 500000; i++ {
+		planner.targets[i] = &streamingTarget{site: db.Site{BlogID: i}, active: true}
+	}
+	if got := streamingTargetReloadInterval(&config.Config{StreamingTargetReloadSec: 4 * 60 * 60}, planner); got != 4*time.Hour {
+		t.Fatalf("long configured reload interval = %s, want configured 4h", got)
+	}
+	for i := int64(500001); i <= 2000000; i++ {
+		planner.targets[i] = &streamingTarget{site: db.Site{BlogID: i}, active: true}
+	}
+	if got := streamingTargetReloadInterval(&config.Config{StreamingTargetReloadSec: 300}, planner); got != streamingMaxTargetReloadInterval {
+		t.Fatalf("capped reload interval = %s, want %s", got, streamingMaxTargetReloadInterval)
+	}
+}
+
+func TestStreamingResultDrainLimitScalesWithBacklog(t *testing.T) {
+	if got := streamingResultDrainLimitFor(10); got != streamingResultDrainLimit {
+		t.Fatalf("small result drain limit = %d, want %d", got, streamingResultDrainLimit)
+	}
+	if got := streamingResultDrainLimitFor(40000); got != 20000 {
+		t.Fatalf("medium result drain limit = %d, want 20000", got)
+	}
+	if got := streamingResultDrainLimitFor(200000); got != 100000 {
+		t.Fatalf("large result drain limit = %d, want 100000", got)
+	}
+	if got := streamingResultDrainLimitFor(600000); got != streamingMaxResultDrainLimit {
+		t.Fatalf("large result drain limit = %d, want %d", got, streamingMaxResultDrainLimit)
+	}
+}
+
+func TestStreamingScaleLatencyCapUsesDefaultTimeout(t *testing.T) {
+	if got := streamingScaleLatencyCap(&config.Config{}); got != 10*time.Second {
+		t.Fatalf("streamingScaleLatencyCap(default) = %s, want 10s", got)
+	}
+	if got := streamingScaleLatencyCap(&config.Config{NetCommsTimeout: 1}); got != time.Second {
+		t.Fatalf("streamingScaleLatencyCap(configured) = %s, want 1s", got)
+	}
+}
+
+func TestStreamingBackpressureDepthScalesWithWorkersAndTargets(t *testing.T) {
+	if got := streamingResultBackpressureDepth(60, 0); got != streamingMinBackpressureDepth {
+		t.Fatalf("empty result backpressure depth = %d, want %d", got, streamingMinBackpressureDepth)
+	}
+	if got := streamingResultDispatchPauseDepth(60, 0); got != streamingMinBackpressureDepth {
+		t.Fatalf("empty result dispatch pause depth = %d, want %d", got, streamingMinBackpressureDepth)
+	}
+	if got := streamingSideEffectBackpressureDepth(2000, 100000); got != 5000 {
+		t.Fatalf("100k side-effect backpressure depth = %d, want target-based 5000", got)
+	}
+	if got := streamingResultDispatchPauseDepth(5000, 500000); got != 166666 {
+		t.Fatalf("500k result dispatch pause depth = %d, want target-based 166666", got)
+	}
+	if got := streamingResultBackpressureDepth(100000, 1000000); got != 200000 {
+		t.Fatalf("large result backpressure depth = %d, want 200000", got)
+	}
+	if got := streamingResultDispatchPauseDepth(100000, 1000000); got != 600000 {
+		t.Fatalf("large result dispatch pause depth = %d, want 600000", got)
+	}
+}
+
+func TestStreamingResultBacklogPauseKeepsIdleWorkersFed(t *testing.T) {
+	depth := streamingResultDispatchPauseDepth(5000, 500000)
+	if streamingShouldPauseDispatchForResultBacklog(depth, 5000, 500000, 0, 0) {
+		t.Fatal("idle pool should keep dispatching despite result backlog")
+	}
+	if !streamingShouldPauseDispatchForResultBacklog(depth, 5000, 500000, 1, 0) {
+		t.Fatal("active checks should pause dispatch at result backlog threshold")
+	}
+	if !streamingShouldPauseDispatchForResultBacklog(depth, 5000, 500000, 0, 1) {
+		t.Fatal("queued checks should pause dispatch at result backlog threshold")
+	}
+	if streamingShouldPauseDispatchForResultBacklog(depth-1, 5000, 500000, 1, 1) {
+		t.Fatal("below threshold should not pause dispatch")
+	}
+}
+
+func TestStreamingDispatchBudgetPacesBacklogCatchup(t *testing.T) {
+	if got := streamingDispatchBudget(350.88, 60000, 3500, time.Second, 0, 0, 100000); got != 702 {
+		t.Fatalf("100k backlog dispatch budget = %d, want paced catch-up budget 702", got)
+	}
+	if got := streamingDispatchBudget(3508.8, 600000, 5000, time.Second, 0, 0, 1000000); got != 7018 {
+		t.Fatalf("1M backlog dispatch budget = %d, want capped catch-up budget 7018", got)
+	}
+	if got := streamingDispatchBudget(350.88, 20, 3500, time.Second, 0, 0, 100000); got != 20 {
+		t.Fatalf("small pending dispatch budget = %d, want pending count", got)
+	}
+}
+
+func TestStreamingDispatchBudgetUsesFastCatchupWhenLagged(t *testing.T) {
+	got := streamingDispatchBudget(350.88, 60000, 3500, time.Second, 2*time.Minute, 0, 100000)
+	if got != 1404 {
+		t.Fatalf("lagged dispatch budget = %d, want fast catch-up cap 1404", got)
+	}
+}
+
+func TestStreamingDispatchBudgetAvoidsFastCatchupUnderResultPressure(t *testing.T) {
+	resultDepth := streamingResultDispatchPauseDepth(3500, 100000) / 2
+	got := streamingDispatchBudget(350.88, 60000, 3500, time.Second, 2*time.Minute, resultDepth, 100000)
+	if got != 702 {
+		t.Fatalf("pressured dispatch budget = %d, want normal catch-up budget 702", got)
+	}
+}
+
+func TestStreamingDispatchBudgetScalesWithElapsedTime(t *testing.T) {
+	got := streamingDispatchBudget(350.88, 60000, 3500, 10*time.Second, 0, 0, 100000)
+	if got <= 702 {
+		t.Fatalf("10s delayed dispatch budget = %d, want above one-second budget", got)
+	}
+	if got > 4211 {
+		t.Fatalf("10s delayed dispatch budget = %d, want <= capped elapsed steady-state cap", got)
+	}
+}
+
+func TestStreamingDispatchBudgetCapsLongPauseByWorkers(t *testing.T) {
+	got := streamingDispatchBudget(350.88, 96077, 3693, 136*time.Second, 0, 0, 100000)
+	if got != 3432 {
+		t.Fatalf("long-pause dispatch budget = %d, want elapsed-capped budget 3432", got)
+	}
+}
+
+func TestStreamingQueueCapScalesWithActiveTargets(t *testing.T) {
+	if got := streamingQueueCap(60, 0); got != streamingMinQueueCap {
+		t.Fatalf("streamingQueueCap(empty) = %d, want %d", got, streamingMinQueueCap)
+	}
+	if got := streamingQueueCap(60, 100000); got != 100000 {
+		t.Fatalf("streamingQueueCap(100k active) = %d, want 100000", got)
+	}
+	if got := streamingQueueCap(100000, 1000000); got != 1000000 {
+		t.Fatalf("streamingQueueCap(1M active) = %d, want active target count", got)
+	}
+	if got := streamingQueueCap(100000, 2000000); got != streamingMaxQueueCap {
+		t.Fatalf("streamingQueueCap(capped) = %d, want %d", got, streamingMaxQueueCap)
+	}
+}
+
+func TestStreamingSideEffectShardIsStable(t *testing.T) {
+	const shards = 8
+	first := streamingSideEffectShard(42, shards)
+	for range 10 {
+		if got := streamingSideEffectShard(42, shards); got != first {
+			t.Fatalf("streamingSideEffectShard() = %d, want stable %d", got, first)
+		}
+	}
+	if got := streamingSideEffectShard(-42, shards); got != first {
+		t.Fatalf("streamingSideEffectShard(negative) = %d, want %d", got, first)
+	}
+}
+
+func TestStreamingSideEffectShardCountIsBounded(t *testing.T) {
+	got := streamingSideEffectShardCount(0)
+	if got < streamingMinSideEffectShards || got > streamingMaxSideEffectShards {
+		t.Fatalf("streamingSideEffectShardCount() = %d, want within [%d,%d]", got, streamingMinSideEffectShards, streamingMaxSideEffectShards)
+	}
+
+	large := streamingSideEffectShardCount(100000)
+	if large <= got {
+		t.Fatalf("streamingSideEffectShardCount(100k) = %d, want above empty target count %d", large, got)
+	}
+	if large > streamingMaxSideEffectShards {
+		t.Fatalf("streamingSideEffectShardCount(100k) = %d, want <= %d", large, streamingMaxSideEffectShards)
+	}
+}
+
+func TestStreamingSideEffectsNeededSkipsNoopSuccess(t *testing.T) {
+	target := &streamingTarget{site: db.Site{BlogID: 42, SiteStatus: statusRunning}}
+	res := checker.Result{BlogID: 42, Success: true, HTTPCode: 200}
+
+	if streamingSideEffectsNeeded(target, res, nil, nil, nil, false) {
+		t.Fatal("no-op success should not require side effects")
+	}
+
+	pending := map[int64]int{42: 1}
+	if !streamingSideEffectsNeeded(target, res, pending, nil, nil, false) {
+		t.Fatal("success behind a pending side effect should preserve ordering")
+	}
+
+	statusCache := map[int64]int{42: statusDown}
+	if !streamingSideEffectsNeeded(target, res, nil, statusCache, nil, false) {
+		t.Fatal("success for cached non-running status should run recovery side effects")
+	}
+
+	retries := newRetryQueue()
+	retries.record(checker.Result{BlogID: 42, URL: "http://example.com", Timestamp: time.Now()})
+	if !streamingSideEffectsNeeded(target, res, nil, nil, retries, false) {
+		t.Fatal("success for retrying site should run recovery side effects")
+	}
+}
+
+func TestStreamingSideEffectsNeededKeepsFailureAndTLS(t *testing.T) {
+	target := &streamingTarget{site: db.Site{BlogID: 42, SiteStatus: statusRunning}}
+	if !streamingSideEffectsNeeded(target, checker.Result{BlogID: 42, ErrorCode: checker.ErrorConnect}, nil, nil, nil, false) {
+		t.Fatal("failure should require side effects")
+	}
+	if !streamingSideEffectsNeeded(target, checker.Result{BlogID: 42, Success: true, HTTPCode: 200, TLSVersion: 0x0304}, nil, nil, nil, false) {
+		t.Fatal("TLS observations should require side effects")
+	}
+}
+
+func TestStreamingSideEffectsNeededSuppressesNewLocalFailuresUnderPressure(t *testing.T) {
+	target := &streamingTarget{site: db.Site{BlogID: 42, SiteStatus: statusRunning}}
+	timeout := checker.Result{BlogID: 42, ErrorCode: checker.ErrorTimeout}
+
+	if !streamingSideEffectsSuppressedByPressure(target, timeout, nil, nil, nil, true) {
+		t.Fatal("new local timeout under pressure should be counted as pressure-suppressed")
+	}
+	if streamingSideEffectsNeeded(target, timeout, nil, nil, nil, true) {
+		t.Fatal("new local timeout under pressure should not open event side effects")
+	}
+	if streamingSideEffectsSuppressedByPressure(target, timeout, nil, nil, nil, false) {
+		t.Fatal("local timeout without pressure should not be counted as pressure-suppressed")
+	}
+	if !streamingSideEffectsNeeded(target, checker.Result{BlogID: 42, HTTPCode: 500}, nil, nil, nil, true) {
+		t.Fatal("HTTP failures should still flow through side effects under pressure")
+	}
+	if streamingSideEffectsSuppressedByPressure(target, checker.Result{BlogID: 42, HTTPCode: 500}, nil, nil, nil, true) {
+		t.Fatal("HTTP failures should not be counted as pressure-suppressed")
+	}
+	if !streamingSideEffectsNeeded(target, timeout, map[int64]int{42: 1}, nil, nil, true) {
+		t.Fatal("pending side effects should preserve ordering under pressure")
+	}
+	if streamingSideEffectsSuppressedByPressure(target, timeout, map[int64]int{42: 1}, nil, nil, true) {
+		t.Fatal("pending side effects should not be counted as pressure-suppressed")
+	}
+
+	retries := newRetryQueue()
+	retries.record(checker.Result{BlogID: 42, URL: "http://example.com", Timestamp: time.Now()})
+	if !streamingSideEffectsNeeded(target, timeout, nil, nil, retries, true) {
+		t.Fatal("existing retry state should continue through side effects under pressure")
+	}
+	if streamingSideEffectsSuppressedByPressure(target, timeout, nil, nil, retries, true) {
+		t.Fatal("existing retry state should not be counted as pressure-suppressed")
+	}
+}
+
+func TestProcessStreamingSideEffectsKeepsSuppressedPostRecoveryFailureRunning(t *testing.T) {
+	restore := stubOrchestratorDeps()
+	defer restore()
+	cfg := setTestConfig(t)
+	cfg.NumOfChecks = 1
+
+	recoveredAt := time.Date(2026, 5, 12, 10, 0, 0, 0, time.UTC)
+	o := &Orchestrator{
+		retries:  newRetryQueue(),
+		wpcom:    &wpcom.Client{},
+		hostname: "local",
+		ctx:      context.Background(),
+	}
+	o.retries.markRecovered(42, recoveredAt)
+
+	_, updated := o.processStreamingSideEffects(
+		db.Site{BlogID: 42, MonitorURL: "https://example.com", CheckInterval: 3, SiteStatus: statusRunning},
+		checkerResultTransportFailure(42, recoveredAt.Add(2*time.Minute)),
+	)
+
+	if updated.SiteStatus != statusRunning {
+		t.Fatalf("SiteStatus = %d, want running for suppressed post-recovery transient failure", updated.SiteStatus)
+	}
+	if entry := o.retries.get(42); entry != nil {
+		t.Fatalf("suppressed post-recovery transient failure created retry state: %+v", entry)
+	}
+}
+
+func TestStreamingCheckRequestForTargetCachesParsedSiteFields(t *testing.T) {
+	headers := `{"X-Test":"one"}`
+	forbidden := `["blocked",""]`
+	site := db.Site{
+		BlogID:            42,
+		MonitorURL:        "https://example.com",
+		CheckInterval:     5,
+		CustomHeaders:     &headers,
+		ForbiddenKeywords: &forbidden,
+	}
+	cfg := &config.Config{
+		NetCommsTimeout:     10,
+		BodyReadMaxBytes:    64,
+		BodyReadMaxMS:       20,
+		KeywordReadMaxBytes: 128,
+		KeywordReadMaxMS:    30,
+	}
+	target := &streamingTarget{site: site, checkRequestDirty: true}
+
+	req := streamingCheckRequestForTarget(cfg, target)
+	if !target.checkRequestReady {
+		t.Fatal("check request was not marked ready")
+	}
+	if target.checkRequestDirty {
+		t.Fatal("check request stayed dirty after refresh")
+	}
+	if got := req.CustomHeaders["X-Test"]; got != "one" {
+		t.Fatalf("CustomHeaders[X-Test] = %q, want one", got)
+	}
+	if len(req.ForbiddenKeywords) != 1 || req.ForbiddenKeywords[0] != "blocked" {
+		t.Fatalf("ForbiddenKeywords = %#v, want [blocked]", req.ForbiddenKeywords)
+	}
+
+	cfg.BodyReadMaxMS = 21
+	req = streamingCheckRequestForTarget(cfg, target)
+	if req.BodyReadMaxMS != 21 {
+		t.Fatalf("BodyReadMaxMS = %d after config change, want 21", req.BodyReadMaxMS)
+	}
+}
+
+func TestStreamingPlannerMergeMarksCheckRequestDirty(t *testing.T) {
+	now := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	headers := `{"X-Test":"one"}`
+	site := db.Site{BlogID: 42, MonitorURL: "https://example.com", CheckInterval: 5, CustomHeaders: &headers}
+	cfg := &config.Config{NetCommsTimeout: 10}
+	planner := newStreamingPlanner([]db.Site{site}, now)
+	target := planner.targets[42]
+	_ = streamingCheckRequestForTarget(cfg, target)
+
+	updatedHeaders := `{"X-Test":"two"}`
+	site.CustomHeaders = &updatedHeaders
+	planner.merge([]db.Site{site}, now)
+	if !target.checkRequestDirty {
+		t.Fatal("planner merge did not mark cached check request dirty")
+	}
+	req := streamingCheckRequestForTarget(cfg, target)
+	if got := req.CustomHeaders["X-Test"]; got != "two" {
+		t.Fatalf("CustomHeaders[X-Test] = %q after merge, want two", got)
+	}
+}
+
+func TestStreamingPlannerMergeKeepsCheckRequestCacheForStatusOnlyReload(t *testing.T) {
+	now := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	site := db.Site{BlogID: 42, MonitorURL: "https://example.com", CheckInterval: 5, SiteStatus: statusRunning}
+	cfg := &config.Config{NetCommsTimeout: 10}
+	planner := newStreamingPlanner([]db.Site{site}, now)
+	target := planner.targets[42]
+	_ = streamingCheckRequestForTarget(cfg, target)
+
+	site.SiteStatus = statusDown
+	planner.merge([]db.Site{site}, now)
+	if target.checkRequestDirty {
+		t.Fatal("status-only reload marked cached check request dirty")
+	}
+	if target.site.SiteStatus != statusDown {
+		t.Fatalf("SiteStatus = %d after merge, want %d", target.site.SiteStatus, statusDown)
+	}
+}
+
+func TestStreamingCheckCadenceAddsBoundedHeadroom(t *testing.T) {
+	if got := streamingCheckCadence(db.Site{CheckInterval: 5}); got != 285*time.Second {
+		t.Fatalf("streamingCheckCadence(5m) = %s, want 285s", got)
+	}
+	if got := streamingCheckCadence(db.Site{CheckInterval: 1}); got != 57*time.Second {
+		t.Fatalf("streamingCheckCadence(1m) = %s, want 57s", got)
+	}
+	if got := streamingCheckCadence(db.Site{CheckInterval: 60}); got != 3585*time.Second {
+		t.Fatalf("streamingCheckCadence(60m) = %s, want 3585s", got)
+	}
+}
+
+func TestQueueStreamingProjectionRespectsInterval(t *testing.T) {
+	origNow := nowFunc
+	defer func() { nowFunc = origNow }()
+
+	o := &Orchestrator{}
+	cfg := &config.Config{StreamingLegacyProjectionIntervalMin: 10}
+	checkedAt := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	projectedAt := checkedAt.Add(10*time.Minute + 2*time.Second)
+	nowFunc = func() time.Time { return projectedAt }
+	target := &streamingTarget{
+		site:            db.Site{BlogID: 42, CheckInterval: 1},
+		dueAt:           checkedAt.Add(5 * time.Minute),
+		lastProjectedAt: checkedAt.Add(-5 * time.Minute),
+	}
+	pending := map[int64]db.SiteCheck{}
+
+	o.queueStreamingProjection(cfg, target, checkedAt, projectedAt, pending)
+	if len(pending) != 0 {
+		t.Fatalf("pending projection rows = %d, want 0 before interval", len(pending))
+	}
+
+	later := checkedAt.Add(10 * time.Minute)
+	o.queueStreamingProjection(cfg, target, later, projectedAt, pending)
+	if len(pending) != 1 {
+		t.Fatalf("pending projection rows = %d, want 1 after interval", len(pending))
+	}
+	if got := pending[42].CheckedAt; !got.Equal(projectedAt) {
+		t.Fatalf("projected CheckedAt = %s, want %s", got, projectedAt)
+	}
+}
+
+func TestQueueStreamingProjectionUsesConfiguredRollbackWindowForFiveMinuteSites(t *testing.T) {
+	origNow := nowFunc
+	defer func() { nowFunc = origNow }()
+
+	o := &Orchestrator{}
+	cfg := &config.Config{StreamingLegacyProjectionIntervalMin: 10}
+	checkedAt := time.Date(2026, 5, 10, 12, 0, 0, 0, time.UTC)
+	projectedAt := checkedAt.Add(2 * time.Second)
+	nowFunc = func() time.Time { return projectedAt }
+	target := &streamingTarget{
+		site:            db.Site{BlogID: 42, CheckInterval: 5},
+		dueAt:           checkedAt.Add(5 * time.Minute),
+		lastProjectedAt: checkedAt.Add(-299 * time.Second),
+	}
+	pending := map[int64]db.SiteCheck{}
+
+	o.queueStreamingProjection(cfg, target, checkedAt, projectedAt, pending)
+	if len(pending) != 0 {
+		t.Fatalf("pending projection rows = %d, want 0 before configured interval", len(pending))
+	}
+
+	later := checkedAt.Add(10 * time.Minute)
+	o.queueStreamingProjection(cfg, target, later, projectedAt, pending)
+	if len(pending) != 1 {
+		t.Fatalf("pending projection rows = %d, want 1 after configured interval", len(pending))
+	}
+}
+
+func TestStreamingProjectionIntervalUsesConfiguredRollbackWindow(t *testing.T) {
+	cfg := &config.Config{StreamingLegacyProjectionIntervalMin: 10}
+
+	got := streamingProjectionInterval(cfg)
+	if got != 10*time.Minute {
+		t.Fatalf("streamingProjectionInterval(configured) = %s, want 10m", got)
+	}
+
+	cfg.StreamingLegacyProjectionIntervalMin = 3
+	got = streamingProjectionInterval(cfg)
+	if got != 5*time.Minute {
+		t.Fatalf("streamingProjectionInterval(enforced floor) = %s, want 5m", got)
+	}
+}
+
+func TestStreamingProjectionFlushBatchCapsRows(t *testing.T) {
+	pending := map[int64]db.SiteCheck{}
+	for i := int64(1); i <= 5; i++ {
+		pending[i] = db.SiteCheck{BlogID: i}
+	}
+
+	batch := streamingProjectionFlushBatch(pending, 2)
+	if len(batch) != 2 {
+		t.Fatalf("projection flush batch size = %d, want 2", len(batch))
+	}
+	if len(pending) != 3 {
+		t.Fatalf("pending projection rows after capped batch = %d, want 3", len(pending))
+	}
+
+	batch = streamingProjectionFlushBatch(pending, 0)
+	if len(batch) != 3 {
+		t.Fatalf("projection flush final batch size = %d, want 3", len(batch))
+	}
+	if len(pending) != 0 {
+		t.Fatalf("pending projection rows after final batch = %d, want 0", len(pending))
+	}
+}
+
+func TestStreamingProjectionFlushRowLimitScalesWithRequiredRate(t *testing.T) {
+	if got := streamingProjectionFlushRowLimit(10); got != streamingProjectionFlushMinRows {
+		t.Fatalf("low-rate projection row limit = %d, want %d", got, streamingProjectionFlushMinRows)
+	}
+	if got := streamingProjectionFlushRowLimit(1754.39); got != 21930 {
+		t.Fatalf("500k projection row limit = %d, want 21930", got)
+	}
+	if got := streamingProjectionFlushRowLimit(100000); got != streamingProjectionFlushMaxRows {
+		t.Fatalf("high-rate projection row limit = %d, want %d", got, streamingProjectionFlushMaxRows)
+	}
+}
diff --git a/internal/processmetrics/memory.go b/internal/processmetrics/memory.go
new file mode 100644
index 00000000..89a2362d
--- /dev/null
+++ b/internal/processmetrics/memory.go
@@ -0,0 +1,66 @@
+package processmetrics
+
+import (
+	"fmt"
+	"os"
+	"runtime"
+	"strconv"
+	"strings"
+)
+
+const bytesPerMB = 1024 * 1024
+
+// MemorySnapshot is a compact local process memory sample.
+type MemorySnapshot struct {
+	RSSMemMB       int
+	GoSysMemMB     int
+	HeapAllocMemMB int
+}
+
+// CurrentMemory returns a single memory sample suitable for dashboards and
+// metrics. RSS is best-effort because it depends on Linux procfs availability.
+func CurrentMemory() MemorySnapshot {
+	mem := currentRuntimeMemory()
+	mem.RSSMemMB = rssMemMB()
+	return mem
+}
+
+func currentRuntimeMemory() MemorySnapshot {
+	var ms runtime.MemStats
+	runtime.ReadMemStats(&ms)
+	return MemorySnapshot{
+		GoSysMemMB:     int(ms.Sys / bytesPerMB),
+		HeapAllocMemMB: int(ms.HeapAlloc / bytesPerMB),
+	}
+}
+
+// rssMemMB returns this process' resident set size in MiB when the operating
+// system exposes it. A zero return means RSS could not be collected.
+func rssMemMB() int {
+	rssMB, err := rssMemMBFromStatm("/proc/self/statm", os.Getpagesize())
+	if err != nil {
+		return 0
+	}
+	return rssMB
+}
+
+// rssMemMBFromStatm parses a Linux procfs statm file and converts resident
+// pages to MiB.
+func rssMemMBFromStatm(path string, pageSize int) (int, error) {
+	if pageSize <= 0 {
+		return 0, fmt.Errorf("invalid page size %d", pageSize)
+	}
+	raw, err := os.ReadFile(path)
+	if err != nil {
+		return 0, err
+	}
+	fields := strings.Fields(string(raw))
+	if len(fields) < 2 {
+		return 0, fmt.Errorf("statm %s has %d fields, want at least 2", path, len(fields))
+	}
+	residentPages, err := strconv.ParseUint(fields[1], 10, 64)
+	if err != nil {
+		return 0, fmt.Errorf("parse resident pages: %w", err)
+	}
+	return int((residentPages * uint64(pageSize)) / bytesPerMB), nil
+}
diff --git a/internal/processmetrics/memory_test.go b/internal/processmetrics/memory_test.go
new file mode 100644
index 00000000..f745a410
--- /dev/null
+++ b/internal/processmetrics/memory_test.go
@@ -0,0 +1,44 @@
+package processmetrics
+
+import (
+	"os"
+	"path/filepath"
+	"testing"
+)
+
+func TestRSSMemMBFromStatm(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "statm")
+	if err := os.WriteFile(path, []byte("1000 512 0 0 0 0 0\n"), 0644); err != nil {
+		t.Fatalf("WriteFile: %v", err)
+	}
+	got, err := rssMemMBFromStatm(path, 4096)
+	if err != nil {
+		t.Fatalf("rssMemMBFromStatm() error = %v", err)
+	}
+	if got != 2 {
+		t.Fatalf("rssMemMBFromStatm() = %d, want 2", got)
+	}
+}
+
+func TestRSSMemMBFromStatmRejectsMalformedInput(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "statm")
+	if err := os.WriteFile(path, []byte("1000 nope\n"), 0644); err != nil {
+		t.Fatalf("WriteFile: %v", err)
+	}
+	if _, err := rssMemMBFromStatm(path, 4096); err == nil {
+		t.Fatal("rssMemMBFromStatm() error = nil, want parse error")
+	}
+}
+
+func TestCurrentMemory(t *testing.T) {
+	snapshot := CurrentMemory()
+	if snapshot.GoSysMemMB <= 0 {
+		t.Fatalf("GoSysMemMB = %d, want positive runtime memory", snapshot.GoSysMemMB)
+	}
+	if snapshot.HeapAllocMemMB < 0 {
+		t.Fatalf("HeapAllocMemMB = %d, want non-negative heap allocation", snapshot.HeapAllocMemMB)
+	}
+	if snapshot.RSSMemMB < 0 {
+		t.Fatalf("RSSMemMB = %d, want non-negative RSS", snapshot.RSSMemMB)
+	}
+}
diff --git a/internal/veriflier/client.go b/internal/veriflier/client.go
new file mode 100644
index 00000000..bf55a361
--- /dev/null
+++ b/internal/veriflier/client.go
@@ -0,0 +1,348 @@
+package veriflier
+
+import (
+	"bytes"
+	"context"
+	"crypto/rand"
+	"encoding/hex"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io"
+	"net"
+	"net/http"
+	"sync"
+	"time"
+)
+
+// VeriflierClient sends check batches to a remote Veriflier over the v2
+// production JSON-over-HTTP transport.
+type VeriflierClient struct {
+	addr       string
+	authToken  string
+	httpClient *http.Client
+	mu         sync.RWMutex
+	protocol   string
+}
+
+// NewVeriflierClient creates a client targeting the given address (host:port).
+//
+// The HTTP transport is tuned for the orchestrator's hot-path use: many
+// short-lived RPCs to the same verifier host during outage waves. Default
+// MaxIdleConnsPerHost=2 forces frequent reconnects under any concurrency above
+// 2; we raise it so the orchestrator's per-verifier escalation goroutines
+// reuse a small pool of warm connections.
+//
+// No client-level Timeout is set. Per-call deadlines come from the caller's
+// context (the orchestrator wraps each escalation with NET_COMMS_TIMEOUT +
+// headroom). A blanket client.Timeout would override that — see Go's
+// http.Client docs: client.Timeout is enforced regardless of ctx, so leaving
+// it unset means ctx is the only deadline and is honored exactly.
+func NewVeriflierClient(addr, authToken string) *VeriflierClient {
+	transport := &http.Transport{
+		Proxy: http.ProxyFromEnvironment,
+		DialContext: (&net.Dialer{
+			Timeout:   5 * time.Second,
+			KeepAlive: 30 * time.Second,
+		}).DialContext,
+		MaxIdleConns:          100,
+		MaxIdleConnsPerHost:   20,
+		IdleConnTimeout:       90 * time.Second,
+		TLSHandshakeTimeout:   5 * time.Second,
+		ExpectContinueTimeout: 1 * time.Second,
+		ForceAttemptHTTP2:     true,
+	}
+	return &VeriflierClient{
+		addr:       addr,
+		authToken:  authToken,
+		httpClient: &http.Client{Transport: transport},
+	}
+}
+
+// Addr returns the target address of this client.
+func (c *VeriflierClient) Addr() string {
+	return c.addr
+}
+
+// Check sends a single site check request to the Veriflier and returns the result.
+func (c *VeriflierClient) Check(ctx context.Context, req CheckRequest) (*CheckResult, error) {
+	if req.RequestID == "" {
+		req.RequestID = NewRequestID()
+	}
+	results, err := c.CheckBatch(ctx, []CheckRequest{req})
+	if err != nil {
+		return nil, err
+	}
+	if len(results) == 0 {
+		return nil, fmt.Errorf("veriflier returned no results")
+	}
+	return &results[0], nil
+}
+
+// CheckBatch sends multiple check requests to the Veriflier. Each request
+// without a RequestID is given a fresh one; existing RequestIDs are preserved.
+func (c *VeriflierClient) CheckBatch(ctx context.Context, reqs []CheckRequest) ([]CheckResult, error) {
+	switch c.cachedProtocol() {
+	case ProtocolLegacy:
+		return c.checkBatchLegacy(ctx, reqs)
+	case ProtocolV2:
+		return c.checkBatchV2(ctx, reqs)
+	default:
+		results, err := c.checkBatchV2(ctx, reqs)
+		if err == nil {
+			c.setProtocol(ProtocolV2)
+			return results, nil
+		}
+		if !isV2Unsupported(err) {
+			return nil, err
+		}
+		c.setProtocol(ProtocolLegacy)
+		return c.checkBatchLegacy(ctx, reqs)
+	}
+}
+
+func (c *VeriflierClient) checkBatchLegacy(ctx context.Context, reqs []CheckRequest) ([]CheckResult, error) {
+	type batchReq struct {
+		Sites []CheckRequest `json:"sites"`
+	}
+	type batchResp struct {
+		Results []CheckResult `json:"results"`
+	}
+
+	for i := range reqs {
+		if reqs[i].RequestID == "" {
+			reqs[i].RequestID = NewRequestID()
+		}
+	}
+
+	body, err := json.Marshal(batchReq{Sites: reqs})
+	if err != nil {
+		return nil, err
+	}
+
+	url := fmt.Sprintf("http://%s/check", c.addr)
+	httpReq, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader(body))
+	if err != nil {
+		return nil, err
+	}
+	httpReq.Header.Set("Content-Type", "application/json")
+	httpReq.Header.Set("Authorization", "Bearer "+c.authToken)
+
+	resp, err := c.httpClient.Do(httpReq)
+	if err != nil {
+		return nil, fmt.Errorf("veriflier request failed: %w", err)
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		return nil, fmt.Errorf("veriflier returned %d", resp.StatusCode)
+	}
+
+	var br batchResp
+	if err := json.NewDecoder(resp.Body).Decode(&br); err != nil {
+		return nil, fmt.Errorf("decode veriflier response: %w", err)
+	}
+	return br.Results, nil
+}
+
+func (c *VeriflierClient) checkBatchV2(ctx context.Context, reqs []CheckRequest) ([]CheckResult, error) {
+	type batchResp struct {
+		Results []CheckV2Result `json:"results"`
+	}
+
+	v2Reqs := make([]CheckV2Request, len(reqs))
+	reqByRequestID := make(map[string]CheckRequest, len(reqs))
+	for i := range reqs {
+		if reqs[i].RequestID == "" {
+			reqs[i].RequestID = NewRequestID()
+		}
+		reqByRequestID[reqs[i].RequestID] = reqs[i]
+		v2Reqs[i] = legacyRequestToV2(reqs[i])
+	}
+	batchReq := CheckV2BatchRequest{
+		BatchID:  NewRequestID(),
+		Requests: v2Reqs,
+	}
+	if deadline, ok := ctx.Deadline(); ok {
+		if remaining := time.Until(deadline); remaining > 0 {
+			batchReq.DeadlineMS = remaining.Milliseconds()
+		}
+	}
+	body, err := json.Marshal(batchReq)
+	if err != nil {
+		return nil, err
+	}
+
+	url := fmt.Sprintf("http://%s/v2/check", c.addr)
+	httpReq, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader(body))
+	if err != nil {
+		return nil, err
+	}
+	httpReq.Header.Set("Content-Type", "application/json")
+	httpReq.Header.Set("Authorization", "Bearer "+c.authToken)
+
+	resp, err := c.httpClient.Do(httpReq)
+	if err != nil {
+		return nil, v2TransportError{endpoint: "/v2/check", err: err}
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		return nil, statusError{endpoint: "/v2/check", status: resp.StatusCode}
+	}
+
+	var br batchResp
+	if err := json.NewDecoder(resp.Body).Decode(&br); err != nil {
+		return nil, fmt.Errorf("decode veriflier v2 response: %w", err)
+	}
+	results := make([]CheckResult, 0, len(br.Results))
+	for i, res := range br.Results {
+		orig := reqByRequestID[res.RequestID]
+		if orig.MonitorSiteID == 0 && i < len(reqs) {
+			orig = reqs[i]
+		}
+		results = append(results, CheckResult{
+			MonitorSiteID: orig.MonitorSiteID,
+			BlogID:        res.BlogID,
+			URL:           res.URL,
+			Host:          res.VantageID,
+			VantageID:     res.VantageID,
+			AgentID:       res.AgentID,
+			Outcome:       res.Outcome,
+			Success:       res.Success,
+			HTTPCode:      res.HTTPCode,
+			ErrorCode:     res.ErrorCode,
+			RTTMs:         res.RTTMs,
+			RequestID:     res.RequestID,
+		})
+	}
+	return results, nil
+}
+
+// Ping checks whether the Veriflier is reachable and returns its version.
+func (c *VeriflierClient) Ping(ctx context.Context) (string, error) {
+	status, err := c.Status(ctx)
+	if err != nil {
+		return "", err
+	}
+	return status.Version, nil
+}
+
+func (c *VeriflierClient) Status(ctx context.Context) (*StatusV2Response, error) {
+	status, err := c.statusV2(ctx)
+	if err == nil {
+		c.setProtocol(ProtocolV2)
+		return status, nil
+	}
+	if !isV2Unsupported(err) {
+		return nil, err
+	}
+	url := fmt.Sprintf("http://%s/status", c.addr)
+	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
+	if err != nil {
+		return nil, err
+	}
+	resp, err := c.httpClient.Do(req)
+	if err != nil {
+		return nil, v2TransportError{endpoint: "/v2/status", err: err}
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusOK {
+		return nil, fmt.Errorf("veriflier status returned %d", resp.StatusCode)
+	}
+
+	var s struct {
+		Status  string `json:"status"`
+		Version string `json:"version"`
+	}
+	_ = json.NewDecoder(resp.Body).Decode(&s)
+	c.setProtocol(ProtocolLegacy)
+	return &StatusV2Response{
+		Status:    s.Status,
+		Version:   s.Version,
+		Protocols: []string{ProtocolLegacy},
+	}, nil
+}
+
+func (c *VeriflierClient) statusV2(ctx context.Context) (*StatusV2Response, error) {
+	url := fmt.Sprintf("http://%s/v2/status", c.addr)
+	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
+	if err != nil {
+		return nil, err
+	}
+	resp, err := c.httpClient.Do(req)
+	if err != nil {
+		return nil, err
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusOK {
+		return nil, statusError{endpoint: "/v2/status", status: resp.StatusCode}
+	}
+	var s StatusV2Response
+	if err := json.NewDecoder(resp.Body).Decode(&s); err != nil {
+		return nil, fmt.Errorf("decode veriflier v2 status: %w", err)
+	}
+	return &s, nil
+}
+
+// NewRequestID returns a 16-byte random id, hex-encoded (32 chars). Used as
+// the RPC correlation id between Monitor and Verifier. Crypto/rand backed so
+// IDs are unpredictable; this isn't a security primitive but it's free.
+func NewRequestID() string {
+	var b [16]byte
+	if _, err := rand.Read(b[:]); err != nil {
+		// Fall back to a timestamp-based id; collisions are vanishingly
+		// unlikely at our request rates and the id is correlation-only.
+		return fmt.Sprintf("ts-%d", time.Now().UnixNano())
+	}
+	return hex.EncodeToString(b[:])
+}
+
+func (c *VeriflierClient) cachedProtocol() string {
+	c.mu.RLock()
+	defer c.mu.RUnlock()
+	return c.protocol
+}
+
+func (c *VeriflierClient) setProtocol(protocol string) {
+	c.mu.Lock()
+	c.protocol = protocol
+	c.mu.Unlock()
+}
+
+type statusError struct {
+	endpoint string
+	status   int
+}
+
+func (e statusError) Error() string {
+	return fmt.Sprintf("veriflier %s returned %d", e.endpoint, e.status)
+}
+
+type v2TransportError struct {
+	endpoint string
+	err      error
+}
+
+func (e v2TransportError) Error() string {
+	return fmt.Sprintf("veriflier %s request failed: %v", e.endpoint, e.err)
+}
+
+func (e v2TransportError) Unwrap() error {
+	return e.err
+}
+
+func isV2Unsupported(err error) bool {
+	var se statusError
+	if errors.As(err, &se) {
+		return se.status == http.StatusNotFound ||
+			se.status == http.StatusMethodNotAllowed ||
+			se.status == http.StatusNotImplemented
+	}
+
+	var te v2TransportError
+	if errors.As(err, &te) {
+		return errors.Is(te.err, io.EOF) || errors.Is(te.err, io.ErrUnexpectedEOF)
+	}
+	return false
+}
diff --git a/internal/veriflier/executor.go b/internal/veriflier/executor.go
new file mode 100644
index 00000000..32f88eb5
--- /dev/null
+++ b/internal/veriflier/executor.go
@@ -0,0 +1,248 @@
+package veriflier
+
+import (
+	"context"
+	"errors"
+	"runtime"
+	"sync"
+	"sync/atomic"
+	"syscall"
+)
+
+var ErrOverloaded = errors.New("veriflier overloaded")
+
+type CheckFunc func(context.Context, CheckRequest) ProbeResult
+
+type Executor struct {
+	checkFn CheckFunc
+
+	jobs   chan execJob
+	slots  chan struct{}
+	ctx    context.Context
+	cancel context.CancelFunc
+
+	wg        sync.WaitGroup
+	active    atomic.Int64
+	completed atomic.Uint64
+	rejected  atomic.Uint64
+
+	maxConcurrency int
+	queueCapacity  int
+}
+
+type execJob struct {
+	ctx    context.Context
+	req    CheckRequest
+	result chan ProbeResult
+}
+
+func NewExecutor(checkFn CheckFunc, maxConcurrency, queueCapacity int) *Executor {
+	if checkFn == nil {
+		checkFn = func(_ context.Context, req CheckRequest) ProbeResult {
+			return ProbeResult{CheckResult: CheckResult{
+				BlogID:    req.BlogID,
+				URL:       req.URL,
+				Success:   false,
+				ErrorCode: 1,
+			}, Outcome: OutcomeUnknown}
+		}
+	}
+	if maxConcurrency <= 0 {
+		maxConcurrency = defaultMaxConcurrency()
+	}
+	if queueCapacity < 0 {
+		queueCapacity = 0
+	}
+	if queueCapacity == 0 {
+		queueCapacity = maxConcurrency * 4
+	}
+
+	ctx, cancel := context.WithCancel(context.Background())
+	e := &Executor{
+		checkFn:        checkFn,
+		jobs:           make(chan execJob, maxConcurrency+queueCapacity),
+		slots:          make(chan struct{}, maxConcurrency+queueCapacity),
+		ctx:            ctx,
+		cancel:         cancel,
+		maxConcurrency: maxConcurrency,
+		queueCapacity:  queueCapacity,
+	}
+	for range maxConcurrency {
+		e.wg.Add(1)
+		go e.worker(ctx)
+	}
+	return e
+}
+
+func (e *Executor) ExecuteBatch(ctx context.Context, reqs []CheckRequest) ([]ProbeResult, error) {
+	if len(reqs) == 0 {
+		return nil, nil
+	}
+
+	acquired := 0
+	for range reqs {
+		select {
+		case e.slots <- struct{}{}:
+			acquired++
+		case <-e.ctx.Done():
+			e.releaseSlots(acquired)
+			return nil, e.ctx.Err()
+		case <-ctx.Done():
+			e.releaseSlots(acquired)
+			return nil, ctx.Err()
+		default:
+			e.rejected.Add(1)
+			e.releaseSlots(acquired)
+			return nil, ErrOverloaded
+		}
+	}
+
+	results := make([]ProbeResult, len(reqs))
+	resultChans := make([]chan ProbeResult, len(reqs))
+	for i, req := range reqs {
+		resultChans[i] = make(chan ProbeResult, 1)
+		job := execJob{
+			ctx:    ctx,
+			req:    req,
+			result: resultChans[i],
+		}
+		select {
+		case e.jobs <- job:
+		case <-e.ctx.Done():
+			e.releaseSlots(len(reqs) - i)
+			return nil, e.ctx.Err()
+		case <-ctx.Done():
+			e.releaseSlots(len(reqs) - i)
+			return nil, ctx.Err()
+		}
+		results[i].RequestID = req.RequestID
+		results[i].BlogID = req.BlogID
+		results[i].URL = req.URL
+	}
+
+	for i, resultCh := range resultChans {
+		if err := e.ctx.Err(); err != nil {
+			return nil, err
+		}
+		select {
+		case <-e.ctx.Done():
+			return nil, e.ctx.Err()
+		case <-ctx.Done():
+			return nil, ctx.Err()
+		case res := <-resultCh:
+			if err := e.ctx.Err(); err != nil {
+				return nil, err
+			}
+			results[i] = res
+		}
+	}
+	return results, nil
+}
+
+func (e *Executor) Capacity() Capacity {
+	return Capacity{
+		MaxConcurrency: e.maxConcurrency,
+		QueueCapacity:  e.queueCapacity,
+		QueueDepth:     len(e.jobs),
+		Active:         int(e.active.Load()),
+		InFlight:       len(e.slots),
+	}
+}
+
+func (e *Executor) Shutdown() {
+	e.cancel()
+	e.wg.Wait()
+}
+
+func (e *Executor) worker(ctx context.Context) {
+	defer e.wg.Done()
+	for {
+		select {
+		case <-ctx.Done():
+			return
+		case job := <-e.jobs:
+			e.active.Add(1)
+			jobCtx, cancel := context.WithCancel(job.ctx)
+			stopShutdownCancel := context.AfterFunc(ctx, cancel)
+			res := e.checkFn(jobCtx, job.req)
+			stopShutdownCancel()
+			cancel()
+			if res.RequestID == "" {
+				res.RequestID = job.req.RequestID
+			}
+			if res.BlogID == 0 {
+				res.BlogID = job.req.BlogID
+			}
+			if res.MonitorSiteID == 0 {
+				res.MonitorSiteID = job.req.MonitorSiteID
+			}
+			if res.URL == "" {
+				res.URL = job.req.URL
+			}
+			if res.Outcome == "" {
+				res.Outcome = outcomeFromResult(res.CheckResult)
+			}
+			e.completed.Add(1)
+			e.active.Add(-1)
+			select {
+			case job.result <- res:
+			default:
+			}
+			<-e.slots
+		}
+	}
+}
+
+func (e *Executor) releaseSlots(n int) {
+	for range n {
+		select {
+		case <-e.slots:
+		default:
+			return
+		}
+	}
+}
+
+func defaultMaxConcurrency() int {
+	workers := runtime.GOMAXPROCS(0) * 64
+	if workers < 32 {
+		workers = 32
+	}
+	if fdCap := fdConcurrencyCap(); fdCap > 0 && workers > fdCap {
+		workers = fdCap
+	}
+	if workers > 4096 {
+		workers = 4096
+	}
+	return workers
+}
+
+func fdConcurrencyCap() int {
+	var lim syscall.Rlimit
+	if err := syscall.Getrlimit(syscall.RLIMIT_NOFILE, &lim); err != nil || lim.Cur == 0 {
+		return 0
+	}
+	// Leave descriptors for inbound sockets, logs, DNS resolver activity, and
+	// short connection bursts. A single HTTP probe normally needs one outbound fd.
+	cap := int(lim.Cur / 2)
+	if cap < 16 {
+		return 16
+	}
+	return cap
+}
+
+func outcomeFromResult(res CheckResult) string {
+	if res.Success {
+		return OutcomeUp
+	}
+	if res.ErrorCode == 1 {
+		return OutcomeTimeout
+	}
+	if res.HTTPCode >= 400 {
+		return OutcomeDown
+	}
+	if res.ErrorCode != 0 {
+		return OutcomeProbeError
+	}
+	return OutcomeUnknown
+}
diff --git a/internal/veriflier/executor_test.go b/internal/veriflier/executor_test.go
new file mode 100644
index 00000000..26b7a2fa
--- /dev/null
+++ b/internal/veriflier/executor_test.go
@@ -0,0 +1,142 @@
+package veriflier
+
+import (
+	"context"
+	"errors"
+	"sync/atomic"
+	"testing"
+	"time"
+)
+
+func TestExecutorPreservesInputOrder(t *testing.T) {
+	exec := NewExecutor(func(_ context.Context, req CheckRequest) ProbeResult {
+		if req.BlogID == 1 {
+			time.Sleep(20 * time.Millisecond)
+		}
+		return ProbeResult{CheckResult: CheckResult{
+			BlogID:   req.BlogID,
+			URL:      req.URL,
+			Success:  true,
+			HTTPCode: 200,
+		}, Outcome: OutcomeUp}
+	}, 2, 2)
+	defer exec.Shutdown()
+
+	results, err := exec.ExecuteBatch(context.Background(), []CheckRequest{
+		{BlogID: 1, URL: "https://example.com/slow"},
+		{BlogID: 2, URL: "https://example.com/fast"},
+	})
+	if err != nil {
+		t.Fatalf("ExecuteBatch() error = %v", err)
+	}
+	if len(results) != 2 {
+		t.Fatalf("results len = %d, want 2", len(results))
+	}
+	if results[0].BlogID != 1 || results[1].BlogID != 2 {
+		t.Fatalf("results order = [%d, %d], want [1, 2]", results[0].BlogID, results[1].BlogID)
+	}
+}
+
+func TestExecutorRejectsOverCapacityBatch(t *testing.T) {
+	var called atomic.Int64
+	exec := NewExecutor(func(_ context.Context, req CheckRequest) ProbeResult {
+		called.Add(1)
+		return ProbeResult{CheckResult: CheckResult{BlogID: req.BlogID, URL: req.URL}}
+	}, 1, 1)
+	defer exec.Shutdown()
+
+	_, err := exec.ExecuteBatch(context.Background(), []CheckRequest{
+		{BlogID: 1, URL: "https://example.com/1"},
+		{BlogID: 2, URL: "https://example.com/2"},
+		{BlogID: 3, URL: "https://example.com/3"},
+	})
+	if !errors.Is(err, ErrOverloaded) {
+		t.Fatalf("ExecuteBatch() error = %v, want ErrOverloaded", err)
+	}
+	if called.Load() != 0 {
+		t.Fatalf("check function called %d times for rejected batch", called.Load())
+	}
+}
+
+func TestExecutorContextCancellation(t *testing.T) {
+	exec := NewExecutor(func(ctx context.Context, req CheckRequest) ProbeResult {
+		<-ctx.Done()
+		return ProbeResult{CheckResult: CheckResult{
+			BlogID:    req.BlogID,
+			URL:       req.URL,
+			Success:   false,
+			ErrorCode: 1,
+		}, Outcome: OutcomeTimeout}
+	}, 1, 1)
+	defer exec.Shutdown()
+
+	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Millisecond)
+	defer cancel()
+	_, err := exec.ExecuteBatch(ctx, []CheckRequest{{BlogID: 1, URL: "https://example.com"}})
+	if !errors.Is(err, context.DeadlineExceeded) {
+		t.Fatalf("ExecuteBatch() error = %v, want DeadlineExceeded", err)
+	}
+}
+
+func TestExecutorCapacityReflectsInFlightWork(t *testing.T) {
+	block := make(chan struct{})
+	started := make(chan struct{}, 1)
+	exec := NewExecutor(func(_ context.Context, req CheckRequest) ProbeResult {
+		started <- struct{}{}
+		<-block
+		return ProbeResult{CheckResult: CheckResult{BlogID: req.BlogID, URL: req.URL}}
+	}, 1, 2)
+	defer exec.Shutdown()
+
+	done := make(chan error, 1)
+	go func() {
+		_, err := exec.ExecuteBatch(context.Background(), []CheckRequest{{BlogID: 1, URL: "https://example.com"}})
+		done <- err
+	}()
+
+	select {
+	case <-started:
+	case <-time.After(time.Second):
+		t.Fatal("timed out waiting for check to start")
+	}
+	capacity := exec.Capacity()
+	if capacity.MaxConcurrency != 1 || capacity.QueueCapacity != 2 {
+		t.Fatalf("capacity = %+v", capacity)
+	}
+	if capacity.Active != 1 || capacity.InFlight != 1 {
+		t.Fatalf("capacity active/in-flight = %+v, want active=1 in_flight=1", capacity)
+	}
+	close(block)
+	if err := <-done; err != nil {
+		t.Fatalf("ExecuteBatch() error = %v", err)
+	}
+}
+
+func TestExecutorShutdownCancelsInFlightBatch(t *testing.T) {
+	started := make(chan struct{}, 1)
+	exec := NewExecutor(func(ctx context.Context, req CheckRequest) ProbeResult {
+		started <- struct{}{}
+		<-ctx.Done()
+		return ProbeResult{CheckResult: CheckResult{
+			BlogID:    req.BlogID,
+			URL:       req.URL,
+			ErrorCode: 1,
+		}, Outcome: OutcomeTimeout}
+	}, 1, 1)
+
+	done := make(chan error, 1)
+	go func() {
+		_, err := exec.ExecuteBatch(context.Background(), []CheckRequest{{BlogID: 1, URL: "https://example.com"}})
+		done <- err
+	}()
+
+	select {
+	case <-started:
+	case <-time.After(time.Second):
+		t.Fatal("timed out waiting for check to start")
+	}
+	exec.Shutdown()
+	if err := <-done; !errors.Is(err, context.Canceled) {
+		t.Fatalf("ExecuteBatch() error = %v, want context.Canceled", err)
+	}
+}
diff --git a/internal/veriflier/server.go b/internal/veriflier/server.go
new file mode 100644
index 00000000..984caab5
--- /dev/null
+++ b/internal/veriflier/server.go
@@ -0,0 +1,480 @@
+package veriflier
+
+import (
+	"context"
+	"crypto/subtle"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"log"
+	"net/http"
+	"os"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/checkmode"
+	"github.com/Automattic/jetmon/internal/metrics"
+)
+
+// Server listens for inbound connections from the Monitor and dispatches
+// check batches to the local checker. Used by the Veriflier binary.
+//
+// This is the server-side counterpart to VeriflierClient. It implements
+// the v2 production JSON-over-HTTP transport.
+//
+// The HTTP server is configured with read/write/idle timeouts so a slow or
+// stalled client cannot pin a goroutine indefinitely (slowloris-style DoS).
+// Shutdown(ctx) drains in-flight requests up to the caller's deadline before
+// closing the listener.
+type Server struct {
+	authToken string
+	addr      string
+	hostname  string
+	version   string
+	vantage   Vantage
+	agent     Agent
+	executor  *Executor
+	httpSrv   *http.Server
+	legacy    bool
+}
+
+type ServerOptions struct {
+	CheckFunc      CheckFunc
+	Vantage        Vantage
+	AgentID        string
+	MaxConcurrency int
+	QueueCapacity  int
+	EnableLegacy   bool
+}
+
+// Timeout defaults for the verifier HTTP server. These are conservative — the
+// expected pattern is a small batch POST that completes in well under a
+// second. Longer values would make slowloris cheaper.
+const (
+	readHeaderTimeout = 5 * time.Second
+	readTimeout       = 30 * time.Second
+	writeTimeout      = 35 * time.Second // > readTimeout so the response can flush
+	idleTimeout       = 120 * time.Second
+)
+
+// maxRequestBodyBytes caps an inbound POST /check body. A typical batch is
+// ~200 sites × ~250 bytes/site ≈ 50KB, so 10MB is generous headroom and
+// closes a trivial DoS vector (an attacker that has the auth token can't
+// stream gigabytes through the JSON decoder before we notice).
+const maxRequestBodyBytes = 10 * 1024 * 1024
+
+// NewServer creates a Server that calls checkFn for each check request.
+//
+// authToken must be non-empty. An empty token would create a
+// dangerous edge case where any request with `Authorization: Bearer ` (with
+// a trailing space and nothing else) would be accepted. The handler rejects
+// empty server tokens even if a caller constructs a Server directly.
+func NewServer(addr, authToken, hostname, version string, checkFn func(CheckRequest) CheckResult) *Server {
+	return NewServerWithOptions(addr, authToken, hostname, version, ServerOptions{
+		CheckFunc: func(_ context.Context, req CheckRequest) ProbeResult {
+			if checkFn == nil {
+				return ProbeResult{CheckResult: CheckResult{
+					BlogID:    req.BlogID,
+					URL:       req.URL,
+					Success:   false,
+					ErrorCode: 1,
+				}, Outcome: OutcomeUnknown}
+			}
+			res := checkFn(req)
+			return ProbeResult{
+				CheckResult: res,
+				Outcome:     outcomeFromResult(res),
+			}
+		},
+		EnableLegacy: true,
+	})
+}
+
+func NewServerWithOptions(addr, authToken, hostname, version string, opts ServerOptions) *Server {
+	if hostname == "" {
+		hostname, _ = os.Hostname()
+	}
+	vantage := opts.Vantage
+	if vantage.ID == "" {
+		vantage.ID = hostname
+	}
+	agentID := opts.AgentID
+	if agentID == "" {
+		agentID = hostname
+	}
+	executor := NewExecutor(opts.CheckFunc, opts.MaxConcurrency, opts.QueueCapacity)
+	return &Server{
+		addr:      addr,
+		authToken: authToken,
+		hostname:  hostname,
+		version:   version,
+		vantage:   vantage,
+		agent: Agent{
+			ID:       agentID,
+			Host:     hostname,
+			Version:  version,
+			Protocol: ProtocolV2,
+		},
+		executor: executor,
+		legacy:   opts.EnableLegacy,
+	}
+}
+
+func (s *Server) handler() http.Handler {
+	mux := http.NewServeMux()
+	if s.legacy {
+		mux.HandleFunc("/check", s.handleCheck)
+		mux.HandleFunc("/status", s.handleStatus)
+	}
+	mux.HandleFunc("/v2/check", s.handleV2Check)
+	mux.HandleFunc("/v2/status", s.handleV2Status)
+	return mux
+}
+
+// Listen starts the HTTP server. Blocks until the server exits via Shutdown
+// or an unrecoverable error. Returns http.ErrServerClosed on a clean Shutdown.
+func (s *Server) Listen() error {
+	s.httpSrv = &http.Server{
+		Addr:              s.addr,
+		Handler:           s.handler(),
+		ReadHeaderTimeout: readHeaderTimeout,
+		ReadTimeout:       readTimeout,
+		WriteTimeout:      writeTimeout,
+		IdleTimeout:       idleTimeout,
+	}
+
+	log.Printf("veriflier: listening on %s", s.addr)
+	return s.httpSrv.ListenAndServe()
+}
+
+// Shutdown gracefully stops the server, allowing in-flight requests to
+// complete up to the context's deadline. Safe to call before Listen — the
+// underlying http.Server is nil-checked.
+func (s *Server) Shutdown(ctx context.Context) error {
+	if s.httpSrv == nil {
+		if s.executor != nil {
+			s.executor.Shutdown()
+		}
+		return nil
+	}
+	err := s.httpSrv.Shutdown(ctx)
+	if s.executor != nil {
+		s.executor.Shutdown()
+	}
+	return err
+}
+
+func (s *Server) handleCheck(w http.ResponseWriter, r *http.Request) {
+	start := time.Now()
+
+	if r.Method != http.MethodPost {
+		http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
+		return
+	}
+
+	if !s.authorized(r) {
+		incrementMetric("verifier.auth.rejected.count", 1)
+		http.Error(w, "unauthorized", http.StatusUnauthorized)
+		return
+	}
+
+	type batchReq struct {
+		Sites []CheckRequest `json:"sites"`
+	}
+	type batchResp struct {
+		Results []CheckResult `json:"results"`
+	}
+
+	// Cap the body before decoding. An overlong body produces a clear 413
+	// rather than streaming through the JSON decoder until something else
+	// times out.
+	r.Body = http.MaxBytesReader(w, r.Body, maxRequestBodyBytes)
+
+	var req batchReq
+	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
+		// MaxBytesReader's "http: request body too large" error is the
+		// signal we want to surface as 413; everything else is a malformed
+		// JSON payload (400).
+		if err.Error() == "http: request body too large" {
+			http.Error(w, "request body too large", http.StatusRequestEntityTooLarge)
+			return
+		}
+		http.Error(w, fmt.Sprintf("decode: %v", err), http.StatusBadRequest)
+		return
+	}
+
+	for i := range req.Sites {
+		if req.Sites[i].RequestID == "" {
+			req.Sites[i].RequestID = NewRequestID()
+		}
+		// Echo RequestID so the orchestrator can correlate this reply with the
+		// audit row it wrote when escalating.
+		log.Printf("veriflier: check blog_id=%d request_id=%s url=%s", req.Sites[i].BlogID, req.Sites[i].RequestID, req.Sites[i].URL)
+	}
+
+	probeResults, err := s.executor.ExecuteBatch(r.Context(), req.Sites)
+	if err != nil {
+		if errors.Is(err, ErrOverloaded) {
+			incrementMetric("verifier.checks.overloaded.count", 1)
+			http.Error(w, "veriflier overloaded", http.StatusServiceUnavailable)
+			return
+		}
+		http.Error(w, err.Error(), http.StatusGatewayTimeout)
+		return
+	}
+	results := make([]CheckResult, 0, len(probeResults))
+	for _, probeResult := range probeResults {
+		res := probeResult.CheckResult
+		res.Host = s.hostname
+		if res.RequestID == "" {
+			res.RequestID = probeResult.RequestID
+		}
+		results = append(results, res)
+	}
+
+	incrementMetric("verifier.checks.received.count", len(req.Sites))
+	timingMetric("verifier.checks.duration.timer", time.Since(start))
+
+	w.Header().Set("Content-Type", "application/json")
+	_ = json.NewEncoder(w).Encode(batchResp{Results: results})
+}
+
+func (s *Server) handleStatus(w http.ResponseWriter, r *http.Request) {
+	w.Header().Set("Content-Type", "application/json")
+	_ = json.NewEncoder(w).Encode(map[string]string{
+		"status":  "OK",
+		"version": s.version,
+	})
+}
+
+func (s *Server) handleV2Check(w http.ResponseWriter, r *http.Request) {
+	start := time.Now()
+
+	if r.Method != http.MethodPost {
+		http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
+		return
+	}
+	if !s.authorized(r) {
+		incrementMetric("verifier.auth.rejected.count", 1)
+		http.Error(w, "unauthorized", http.StatusUnauthorized)
+		return
+	}
+
+	r.Body = http.MaxBytesReader(w, r.Body, maxRequestBodyBytes)
+	var req CheckV2BatchRequest
+	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
+		if err.Error() == "http: request body too large" {
+			http.Error(w, "request body too large", http.StatusRequestEntityTooLarge)
+			return
+		}
+		http.Error(w, fmt.Sprintf("decode: %v", err), http.StatusBadRequest)
+		return
+	}
+	if len(req.Requests) == 0 {
+		http.Error(w, "requests is required", http.StatusBadRequest)
+		return
+	}
+
+	ctx := r.Context()
+	var cancel context.CancelFunc
+	if req.DeadlineMS > 0 {
+		ctx, cancel = context.WithTimeout(ctx, time.Duration(req.DeadlineMS)*time.Millisecond)
+		defer cancel()
+	}
+
+	legacyReqs := make([]CheckRequest, 0, len(req.Requests))
+	for _, site := range req.Requests {
+		legacyReq, err := v2RequestToLegacy(site)
+		if err != nil {
+			http.Error(w, err.Error(), http.StatusBadRequest)
+			return
+		}
+		legacyReqs = append(legacyReqs, legacyReq)
+		log.Printf("veriflier: v2 check blog_id=%d request_id=%s url=%s", legacyReq.BlogID, legacyReq.RequestID, legacyReq.URL)
+	}
+
+	probeResults, err := s.executor.ExecuteBatch(ctx, legacyReqs)
+	if err != nil {
+		if errors.Is(err, ErrOverloaded) {
+			incrementMetric("verifier.checks.overloaded.count", 1)
+			writeV2Error(w, http.StatusServiceUnavailable, OutcomeAgentOverloaded, "veriflier overloaded")
+			return
+		}
+		writeV2Error(w, http.StatusGatewayTimeout, OutcomeUnknown, err.Error())
+		return
+	}
+
+	results := make([]CheckV2Result, 0, len(probeResults))
+	for _, probeResult := range probeResults {
+		results = append(results, s.v2Result(probeResult))
+	}
+
+	incrementMetric("verifier.checks.received.count", len(req.Requests))
+	timingMetric("verifier.checks.duration.timer", time.Since(start))
+
+	w.Header().Set("Content-Type", "application/json")
+	_ = json.NewEncoder(w).Encode(CheckV2BatchResponse{
+		BatchID: req.BatchID,
+		Vantage: s.vantage,
+		Agent:   s.agent,
+		Results: results,
+	})
+}
+
+func (s *Server) handleV2Status(w http.ResponseWriter, r *http.Request) {
+	w.Header().Set("Content-Type", "application/json")
+	_ = json.NewEncoder(w).Encode(s.Status())
+}
+
+func (s *Server) Status() StatusV2Response {
+	protocols := []string{ProtocolV2}
+	if s.legacy {
+		protocols = append(protocols, ProtocolLegacy)
+	}
+	return StatusV2Response{
+		Status:    "OK",
+		Version:   s.version,
+		Protocols: protocols,
+		Vantage:   s.vantage,
+		Agent:     s.agent,
+		Capacity:  s.executor.Capacity(),
+	}
+}
+
+func (s *Server) authorized(r *http.Request) bool {
+	if s.authToken == "" {
+		return false
+	}
+	got := r.Header.Get("Authorization")
+	want := "Bearer " + s.authToken
+	return subtle.ConstantTimeCompare([]byte(got), []byte(want)) == 1
+}
+
+func (s *Server) v2Result(res ProbeResult) CheckV2Result {
+	outcome := res.Outcome
+	if outcome == "" {
+		outcome = outcomeFromResult(res.CheckResult)
+	}
+	return CheckV2Result{
+		RequestID: res.RequestID,
+		BlogID:    res.BlogID,
+		URL:       res.URL,
+		VantageID: s.vantage.ID,
+		AgentID:   s.agent.ID,
+		Outcome:   outcome,
+		Success:   res.Success,
+		HTTPCode:  res.HTTPCode,
+		ErrorCode: res.ErrorCode,
+		RTTMs:     res.RTTMs,
+		TimingsMS: res.TimingsMS,
+	}
+}
+
+func v2RequestToLegacy(req CheckV2Request) (CheckRequest, error) {
+	if req.URL == "" {
+		return CheckRequest{}, fmt.Errorf("url is required")
+	}
+	method, err := checkmode.NormalizeMethod(req.Method, checkmode.MethodGET)
+	if err != nil {
+		return CheckRequest{}, fmt.Errorf("unsupported method %q", req.Method)
+	}
+	profile, err := checkmode.NormalizeProfile(req.DetectionProfile, checkmode.ProfileFull)
+	if err != nil {
+		return CheckRequest{}, fmt.Errorf("unsupported detection_profile %q", req.DetectionProfile)
+	}
+	profile = checkmode.EffectiveProfile(method, profile)
+	requestID := req.RequestID
+	if requestID == "" {
+		requestID = NewRequestID()
+	}
+	timeoutSeconds := int32(0)
+	if req.TimeoutMS > 0 {
+		timeoutSeconds = int32((req.TimeoutMS + 999) / 1000)
+	}
+	legacyReq := CheckRequest{
+		BlogID:              req.BlogID,
+		URL:                 req.URL,
+		Method:              method,
+		DetectionProfile:    profile,
+		TimeoutSeconds:      timeoutSeconds,
+		BodyReadMaxBytes:    req.BodyReadMaxBytes,
+		BodyReadMaxMS:       req.BodyReadMaxMS,
+		KeywordReadMaxBytes: req.KeywordReadMaxBytes,
+		KeywordReadMaxMS:    req.KeywordReadMaxMS,
+		CustomHeaders:       req.Headers,
+		RedirectPolicy:      req.RedirectPolicy,
+		RequestID:           requestID,
+	}
+	if len(req.BodyRules.Required) > 1 {
+		return CheckRequest{}, fmt.Errorf("only one required body rule is supported")
+	}
+	if len(req.BodyRules.Required) > 0 {
+		legacyReq.Keyword = req.BodyRules.Required[0]
+	}
+	if len(req.BodyRules.Forbidden) > 0 {
+		legacyReq.ForbiddenKeywords = append([]string(nil), req.BodyRules.Forbidden...)
+	}
+	return legacyReq, nil
+}
+
+func legacyRequestToV2(req CheckRequest) CheckV2Request {
+	method, err := checkmode.NormalizeMethod(req.Method, checkmode.MethodGET)
+	if err != nil {
+		method = checkmode.MethodGET
+	}
+	profile, err := checkmode.NormalizeProfile(req.DetectionProfile, checkmode.ProfileFull)
+	if err != nil {
+		profile = checkmode.ProfileFull
+	}
+	profile = checkmode.EffectiveProfile(method, profile)
+
+	out := CheckV2Request{
+		RequestID:           req.RequestID,
+		BlogID:              req.BlogID,
+		URL:                 req.URL,
+		Method:              method,
+		DetectionProfile:    profile,
+		Headers:             req.CustomHeaders,
+		RedirectPolicy:      req.RedirectPolicy,
+		BodyReadMaxBytes:    req.BodyReadMaxBytes,
+		BodyReadMaxMS:       req.BodyReadMaxMS,
+		KeywordReadMaxBytes: req.KeywordReadMaxBytes,
+		KeywordReadMaxMS:    req.KeywordReadMaxMS,
+	}
+	if req.TimeoutSeconds > 0 {
+		out.TimeoutMS = int64(req.TimeoutSeconds) * 1000
+	}
+	if req.Keyword != "" {
+		out.BodyRules.Required = []string{req.Keyword}
+	}
+	if req.ForbiddenKeyword != "" {
+		out.BodyRules.Forbidden = append(out.BodyRules.Forbidden, req.ForbiddenKeyword)
+	}
+	if len(req.ForbiddenKeywords) > 0 {
+		out.BodyRules.Forbidden = append(out.BodyRules.Forbidden, req.ForbiddenKeywords...)
+	}
+	return out
+}
+
+func writeV2Error(w http.ResponseWriter, status int, outcome, message string) {
+	w.Header().Set("Content-Type", "application/json")
+	w.WriteHeader(status)
+	_ = json.NewEncoder(w).Encode(map[string]string{
+		"outcome": outcome,
+		"error":   message,
+	})
+}
+
+// incrementMetric and timingMetric are nil-safe wrappers around the global
+// StatsD client. The verifier binary may run without metrics configured (no
+// STATSD_ADDR env var), in which case these are no-ops.
+func incrementMetric(name string, value int) {
+	if m := metrics.Global(); m != nil {
+		m.Increment(name, value)
+	}
+}
+
+func timingMetric(name string, d time.Duration) {
+	if m := metrics.Global(); m != nil {
+		m.Timing(name, d)
+	}
+}
diff --git a/internal/veriflier/soak_test.go b/internal/veriflier/soak_test.go
new file mode 100644
index 00000000..829a62c0
--- /dev/null
+++ b/internal/veriflier/soak_test.go
@@ -0,0 +1,327 @@
+package veriflier
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"io"
+	"net/http"
+	"sync"
+	"sync/atomic"
+	"testing"
+	"time"
+)
+
+func TestV2SoakHighConcurrencyMixedOutcomes(t *testing.T) {
+	const (
+		maxConcurrency = 8
+		batches        = 24
+		sitesPerBatch  = 5
+	)
+
+	var active atomic.Int64
+	var peak atomic.Int64
+	var total atomic.Int64
+	srv, ts := newV2TestServer(func(ctx context.Context, req CheckRequest) ProbeResult {
+		now := active.Add(1)
+		updatePeak(&peak, now)
+		defer active.Add(-1)
+		total.Add(1)
+
+		select {
+		case <-time.After(2 * time.Millisecond):
+		case <-ctx.Done():
+			return ProbeResult{
+				CheckResult: CheckResult{BlogID: req.BlogID, URL: req.URL, Success: false, ErrorCode: 1},
+				Outcome:     OutcomeTimeout,
+			}
+		}
+
+		success := req.BlogID%7 != 0
+		httpCode := int32(200)
+		outcome := OutcomeUp
+		if !success {
+			httpCode = 500
+			outcome = OutcomeDown
+		}
+		return ProbeResult{
+			CheckResult: CheckResult{BlogID: req.BlogID, URL: req.URL, Success: success, HTTPCode: httpCode, RTTMs: 2},
+			Outcome:     outcome,
+			TimingsMS:   TimingsMS{DNS: 1, TCP: 1, TTFB: 1},
+		}
+	}, ServerOptions{
+		Vantage:        Vantage{ID: "soak-vantage", Region: "test-region", Provider: "test-provider"},
+		AgentID:        "soak-agent",
+		MaxConcurrency: maxConcurrency,
+		QueueCapacity:  batches * sitesPerBatch,
+	})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+	defer cancel()
+
+	errCh := make(chan error, batches)
+	var wg sync.WaitGroup
+	for batch := range batches {
+		wg.Add(1)
+		go func(batch int) {
+			defer wg.Done()
+			reqs := make([]CheckRequest, 0, sitesPerBatch)
+			for site := range sitesPerBatch {
+				blogID := int64(batch*sitesPerBatch + site + 1)
+				reqs = append(reqs, CheckRequest{BlogID: blogID, URL: "https://example.com/soak"})
+			}
+			results, err := client.CheckBatch(ctx, reqs)
+			if err != nil {
+				errCh <- err
+				return
+			}
+			if len(results) != len(reqs) {
+				t.Errorf("batch %d result len = %d, want %d", batch, len(results), len(reqs))
+				return
+			}
+			for _, result := range results {
+				if result.Host != "soak-vantage" {
+					t.Errorf("batch %d result host = %q, want soak-vantage", batch, result.Host)
+					return
+				}
+				if result.RequestID == "" {
+					t.Errorf("batch %d result request id is empty", batch)
+					return
+				}
+			}
+		}(batch)
+	}
+	wg.Wait()
+	close(errCh)
+	for err := range errCh {
+		if err != nil {
+			t.Fatalf("soak request failed: %v", err)
+		}
+	}
+
+	wantTotal := int64(batches * sitesPerBatch)
+	if got := total.Load(); got != wantTotal {
+		t.Fatalf("checks executed = %d, want %d", got, wantTotal)
+	}
+	if got := peak.Load(); got < 2 || got > maxConcurrency {
+		t.Fatalf("peak concurrency = %d, want between 2 and %d", got, maxConcurrency)
+	}
+	waitForCapacity(t, client, func(c Capacity) bool {
+		return c.Active == 0 && c.InFlight == 0 && c.QueueDepth == 0
+	})
+}
+
+func TestV2SoakOverloadThenRecovers(t *testing.T) {
+	block := make(chan struct{})
+	srv, ts := newV2TestServer(func(_ context.Context, req CheckRequest) ProbeResult {
+		<-block
+		return ProbeResult{
+			CheckResult: CheckResult{BlogID: req.BlogID, URL: req.URL, Success: true, HTTPCode: 200},
+			Outcome:     OutcomeUp,
+		}
+	}, ServerOptions{
+		Vantage:        Vantage{ID: "overload-vantage"},
+		AgentID:        "overload-agent",
+		MaxConcurrency: 1,
+		QueueCapacity:  1,
+	})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	firstDone := make(chan error, 1)
+	go func() {
+		status, body := postV2Batch(t, ts.URL, "secret", CheckV2BatchRequest{
+			Requests: []CheckV2Request{
+				{BlogID: 1, URL: "https://example.com/1"},
+				{BlogID: 2, URL: "https://example.com/2"},
+			},
+		})
+		if status != http.StatusOK {
+			firstDone <- statusBodyError{status: status, body: body}
+			return
+		}
+		firstDone <- nil
+	}()
+
+	waitForCapacity(t, client, func(c Capacity) bool {
+		return c.InFlight == 2
+	})
+
+	status, body := postV2Batch(t, ts.URL, "secret", CheckV2BatchRequest{
+		Requests: []CheckV2Request{{BlogID: 3, URL: "https://example.com/3"}},
+	})
+	if status != http.StatusServiceUnavailable {
+		t.Fatalf("overload status = %d body=%s, want 503", status, body)
+	}
+	var errBody map[string]string
+	if err := json.Unmarshal(body, &errBody); err != nil {
+		t.Fatalf("decode overload body: %v", err)
+	}
+	if errBody["outcome"] != OutcomeAgentOverloaded {
+		t.Fatalf("overload outcome = %q, want %q", errBody["outcome"], OutcomeAgentOverloaded)
+	}
+
+	close(block)
+	if err := <-firstDone; err != nil {
+		t.Fatalf("blocked request failed: %v", err)
+	}
+	waitForCapacity(t, client, func(c Capacity) bool {
+		return c.Active == 0 && c.InFlight == 0 && c.QueueDepth == 0
+	})
+
+	result, err := client.Check(context.Background(), CheckRequest{BlogID: 4, URL: "https://example.com/4"})
+	if err != nil {
+		t.Fatalf("post-overload Check() error = %v", err)
+	}
+	if !result.Success || result.Host != "overload-vantage" {
+		t.Fatalf("post-overload result = %+v", result)
+	}
+}
+
+func TestV2SoakDeadlineTimeoutThenRecovers(t *testing.T) {
+	srv, ts := newV2TestServer(func(ctx context.Context, req CheckRequest) ProbeResult {
+		if req.URL == "https://example.com/slow" {
+			<-ctx.Done()
+			return ProbeResult{
+				CheckResult: CheckResult{BlogID: req.BlogID, URL: req.URL, Success: false, ErrorCode: 1},
+				Outcome:     OutcomeTimeout,
+			}
+		}
+		return ProbeResult{
+			CheckResult: CheckResult{BlogID: req.BlogID, URL: req.URL, Success: true, HTTPCode: 200},
+			Outcome:     OutcomeUp,
+		}
+	}, ServerOptions{
+		Vantage:        Vantage{ID: "deadline-vantage"},
+		AgentID:        "deadline-agent",
+		MaxConcurrency: 1,
+		QueueCapacity:  4,
+	})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	status, body := postV2Batch(t, ts.URL, "secret", CheckV2BatchRequest{
+		DeadlineMS: 10,
+		Requests:   []CheckV2Request{{BlogID: 1, URL: "https://example.com/slow"}},
+	})
+	if status != http.StatusGatewayTimeout {
+		t.Fatalf("deadline status = %d body=%s, want 504", status, body)
+	}
+
+	waitForCapacity(t, client, func(c Capacity) bool {
+		return c.Active == 0 && c.InFlight == 0 && c.QueueDepth == 0
+	})
+	result, err := client.Check(context.Background(), CheckRequest{BlogID: 2, URL: "https://example.com/fast"})
+	if err != nil {
+		t.Fatalf("post-timeout Check() error = %v", err)
+	}
+	if !result.Success || result.Host != "deadline-vantage" {
+		t.Fatalf("post-timeout result = %+v", result)
+	}
+}
+
+func TestV2SoakUnauthorizedRequestsDoNotConsumeCapacity(t *testing.T) {
+	var called atomic.Int64
+	srv, ts := newV2TestServer(func(_ context.Context, req CheckRequest) ProbeResult {
+		called.Add(1)
+		return ProbeResult{
+			CheckResult: CheckResult{BlogID: req.BlogID, URL: req.URL, Success: true, HTTPCode: 200},
+			Outcome:     OutcomeUp,
+		}
+	}, ServerOptions{
+		Vantage:        Vantage{ID: "auth-vantage"},
+		AgentID:        "auth-agent",
+		MaxConcurrency: 2,
+		QueueCapacity:  2,
+	})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	for i := range 20 {
+		status, body := postV2Batch(t, ts.URL, "wrong-token", CheckV2BatchRequest{
+			Requests: []CheckV2Request{{BlogID: int64(i + 1), URL: "https://example.com/auth"}},
+		})
+		if status != http.StatusUnauthorized {
+			t.Fatalf("unauthorized status = %d body=%s, want 401", status, body)
+		}
+	}
+	if got := called.Load(); got != 0 {
+		t.Fatalf("check function called %d times for unauthorized requests", got)
+	}
+	waitForCapacity(t, client, func(c Capacity) bool {
+		return c.Active == 0 && c.InFlight == 0 && c.QueueDepth == 0
+	})
+
+	result, err := client.Check(context.Background(), CheckRequest{BlogID: 99, URL: "https://example.com/auth-ok"})
+	if err != nil {
+		t.Fatalf("authorized Check() error = %v", err)
+	}
+	if !result.Success || result.Host != "auth-vantage" {
+		t.Fatalf("authorized result = %+v", result)
+	}
+}
+
+func updatePeak(peak *atomic.Int64, value int64) {
+	for {
+		old := peak.Load()
+		if value <= old || peak.CompareAndSwap(old, value) {
+			return
+		}
+	}
+}
+
+func postV2Batch(t *testing.T, baseURL, token string, batch CheckV2BatchRequest) (int, []byte) {
+	t.Helper()
+	body := bytes.NewBuffer(nil)
+	if err := json.NewEncoder(body).Encode(batch); err != nil {
+		t.Fatalf("encode request: %v", err)
+	}
+	req, err := http.NewRequest(http.MethodPost, baseURL+"/v2/check", body)
+	if err != nil {
+		t.Fatalf("new request: %v", err)
+	}
+	req.Header.Set("Authorization", "Bearer "+token)
+	req.Header.Set("Content-Type", "application/json")
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+	respBody, err := io.ReadAll(resp.Body)
+	if err != nil {
+		t.Fatalf("read response: %v", err)
+	}
+	return resp.StatusCode, respBody
+}
+
+func waitForCapacity(t *testing.T, client *VeriflierClient, ok func(Capacity) bool) {
+	t.Helper()
+	deadline := time.Now().Add(2 * time.Second)
+	var last Capacity
+	for time.Now().Before(deadline) {
+		status, err := client.Status(context.Background())
+		if err == nil && status != nil {
+			last = status.Capacity
+			if ok(last) {
+				return
+			}
+		}
+		time.Sleep(10 * time.Millisecond)
+	}
+	t.Fatalf("timed out waiting for capacity condition, last=%+v", last)
+}
+
+type statusBodyError struct {
+	status int
+	body   []byte
+}
+
+func (e statusBodyError) Error() string {
+	return http.StatusText(e.status) + ": " + string(e.body)
+}
diff --git a/internal/veriflier/types.go b/internal/veriflier/types.go
new file mode 100644
index 00000000..23484a5e
--- /dev/null
+++ b/internal/veriflier/types.go
@@ -0,0 +1,155 @@
+// Package veriflier provides the client and server for Monitor↔Veriflier
+// communication. The current transport is JSON-over-HTTP; proto/veriflier.proto
+// is retained as a schema reference for a possible future transport.
+package veriflier
+
+// CheckRequest is a single site to check, sent from Monitor to Veriflier.
+//
+// RequestID is a client-generated correlation id (16-byte hex). The verifier
+// echoes it back in the response and stamps it on its server-side log line so
+// that "the orchestrator escalated → this verifier observed → this audit row
+// in the monitor DB" can be reconstructed without timestamp matching.
+type CheckRequest struct {
+	MonitorSiteID       int64
+	BlogID              int64
+	URL                 string
+	Method              string
+	DetectionProfile    string
+	TimeoutSeconds      int32
+	BodyReadMaxBytes    int64
+	BodyReadMaxMS       int32
+	KeywordReadMaxBytes int64
+	KeywordReadMaxMS    int32
+	Keyword             string
+	ForbiddenKeyword    string
+	ForbiddenKeywords   []string
+	CustomHeaders       map[string]string
+	RedirectPolicy      string
+	RequestID           string
+}
+
+// CheckResult is a single check outcome returned by the Veriflier.
+type CheckResult struct {
+	MonitorSiteID int64
+	BlogID        int64
+	URL           string
+	Host          string
+	VantageID     string
+	AgentID       string
+	Outcome       string
+	Success       bool
+	HTTPCode      int32
+	ErrorCode     int32
+	RTTMs         int64
+	RequestID     string // echoed from CheckRequest.RequestID
+}
+
+const (
+	ProtocolLegacy = "legacy-json-http"
+	ProtocolV2     = "v2-json-http"
+
+	OutcomeUp              = "up"
+	OutcomeDown            = "down"
+	OutcomeTimeout         = "timeout"
+	OutcomeProbeError      = "probe_error"
+	OutcomeAgentOverloaded = "agent_overloaded"
+	OutcomeUnknown         = "unknown"
+)
+
+// Vantage identifies the quorum-counted perspective represented by a Veriflier
+// endpoint. Multiple horizontally scaled agents can share one Vantage; quorum
+// logic should count Vantage identity, not individual agent processes.
+type Vantage struct {
+	ID       string `json:"id"`
+	Region   string `json:"region,omitempty"`
+	Provider string `json:"provider,omitempty"`
+}
+
+// Agent identifies the concrete process that served a request. This is
+// diagnostic metadata only; it must not be used as a quorum identity.
+type Agent struct {
+	ID       string `json:"id"`
+	Host     string `json:"host"`
+	Version  string `json:"version"`
+	Protocol string `json:"protocol,omitempty"`
+}
+
+type Capacity struct {
+	MaxConcurrency int `json:"max_concurrency"`
+	QueueCapacity  int `json:"queue_capacity"`
+	QueueDepth     int `json:"queue_depth"`
+	Active         int `json:"active"`
+	InFlight       int `json:"in_flight"`
+}
+
+type StatusV2Response struct {
+	Status    string   `json:"status"`
+	Version   string   `json:"version"`
+	Protocols []string `json:"protocols"`
+	Vantage   Vantage  `json:"vantage"`
+	Agent     Agent    `json:"agent"`
+	Capacity  Capacity `json:"capacity"`
+}
+
+type BodyRules struct {
+	Required  []string `json:"required,omitempty"`
+	Forbidden []string `json:"forbidden,omitempty"`
+}
+
+type CheckV2Request struct {
+	RequestID           string            `json:"request_id,omitempty"`
+	BlogID              int64             `json:"blog_id"`
+	URL                 string            `json:"url"`
+	TimeoutMS           int64             `json:"timeout_ms,omitempty"`
+	Method              string            `json:"method,omitempty"`
+	DetectionProfile    string            `json:"detection_profile,omitempty"`
+	Headers             map[string]string `json:"headers,omitempty"`
+	BodyRules           BodyRules         `json:"body_rules,omitempty"`
+	RedirectPolicy      string            `json:"redirect_policy,omitempty"`
+	BodyReadMaxBytes    int64             `json:"body_read_max_bytes,omitempty"`
+	BodyReadMaxMS       int32             `json:"body_read_max_ms,omitempty"`
+	KeywordReadMaxBytes int64             `json:"keyword_read_max_bytes,omitempty"`
+	KeywordReadMaxMS    int32             `json:"keyword_read_max_ms,omitempty"`
+}
+
+type CheckV2BatchRequest struct {
+	BatchID    string           `json:"batch_id,omitempty"`
+	DeadlineMS int64            `json:"deadline_ms,omitempty"`
+	Requests   []CheckV2Request `json:"requests"`
+}
+
+type TimingsMS struct {
+	DNS  int64 `json:"dns,omitempty"`
+	TCP  int64 `json:"tcp,omitempty"`
+	TLS  int64 `json:"tls,omitempty"`
+	TTFB int64 `json:"ttfb,omitempty"`
+}
+
+type CheckV2Result struct {
+	RequestID string    `json:"request_id"`
+	BlogID    int64     `json:"blog_id"`
+	URL       string    `json:"url"`
+	VantageID string    `json:"vantage_id"`
+	AgentID   string    `json:"agent_id"`
+	Outcome   string    `json:"outcome"`
+	Success   bool      `json:"success"`
+	HTTPCode  int32     `json:"http_code"`
+	ErrorCode int32     `json:"error_code"`
+	RTTMs     int64     `json:"rtt_ms"`
+	TimingsMS TimingsMS `json:"timings_ms,omitempty"`
+}
+
+type CheckV2BatchResponse struct {
+	BatchID string          `json:"batch_id,omitempty"`
+	Vantage Vantage         `json:"vantage"`
+	Agent   Agent           `json:"agent"`
+	Results []CheckV2Result `json:"results"`
+}
+
+// ProbeResult is the server-internal result shape. It carries the legacy
+// CheckResult plus diagnostics that are only emitted by the v2 contract.
+type ProbeResult struct {
+	CheckResult
+	Outcome   string
+	TimingsMS TimingsMS
+}
diff --git a/internal/veriflier/veriflier_test.go b/internal/veriflier/veriflier_test.go
new file mode 100644
index 00000000..cf4bad1d
--- /dev/null
+++ b/internal/veriflier/veriflier_test.go
@@ -0,0 +1,965 @@
+package veriflier
+
+import (
+	"bytes"
+	"context"
+	"encoding/hex"
+	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+	"sync/atomic"
+	"testing"
+	"time"
+)
+
+func newTestServer(checkFn func(CheckRequest) CheckResult) (*Server, *httptest.Server) {
+	srv := NewServerWithOptions("", "secret", "test-host", "1.0", ServerOptions{
+		CheckFunc: func(_ context.Context, req CheckRequest) ProbeResult {
+			res := checkFn(req)
+			return ProbeResult{CheckResult: res, Outcome: outcomeFromResult(res)}
+		},
+		MaxConcurrency: 4,
+		QueueCapacity:  4,
+		EnableLegacy:   true,
+	})
+	ts := httptest.NewServer(srv.handler())
+	return srv, ts
+}
+
+func newV2TestServer(checkFn CheckFunc, opts ...ServerOptions) (*Server, *httptest.Server) {
+	cfg := ServerOptions{
+		CheckFunc:      checkFn,
+		Vantage:        Vantage{ID: "test-vantage", Region: "test-region", Provider: "test-provider"},
+		AgentID:        "test-agent",
+		MaxConcurrency: 4,
+		QueueCapacity:  4,
+	}
+	if len(opts) > 0 {
+		override := opts[0]
+		if override.CheckFunc != nil {
+			cfg.CheckFunc = override.CheckFunc
+		}
+		if override.Vantage.ID != "" {
+			cfg.Vantage = override.Vantage
+		}
+		if override.AgentID != "" {
+			cfg.AgentID = override.AgentID
+		}
+		if override.MaxConcurrency != 0 {
+			cfg.MaxConcurrency = override.MaxConcurrency
+		}
+		if override.QueueCapacity != 0 {
+			cfg.QueueCapacity = override.QueueCapacity
+		}
+		if override.EnableLegacy {
+			cfg.EnableLegacy = true
+		}
+	}
+	srv := NewServerWithOptions("", "secret", "test-host", "1.0", cfg)
+	ts := httptest.NewServer(srv.handler())
+	return srv, ts
+}
+
+func checkReqBody(t *testing.T, sites []CheckRequest) *bytes.Buffer {
+	t.Helper()
+	body, err := json.Marshal(struct {
+		Sites []CheckRequest `json:"sites"`
+	}{Sites: sites})
+	if err != nil {
+		t.Fatalf("marshal: %v", err)
+	}
+	return bytes.NewBuffer(body)
+}
+
+func TestServerHandleCheckSuccess(t *testing.T) {
+	_, ts := newTestServer(func(req CheckRequest) CheckResult {
+		return CheckResult{Success: true, HTTPCode: 200}
+	})
+	defer ts.Close()
+
+	req, _ := http.NewRequest(http.MethodPost, ts.URL+"/check", checkReqBody(t, []CheckRequest{
+		{MonitorSiteID: 1234, BlogID: 42, URL: "https://example.com"},
+	}))
+	req.Header.Set("Authorization", "Bearer secret")
+	req.Header.Set("Content-Type", "application/json")
+
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		t.Fatalf("status = %d, want 200", resp.StatusCode)
+	}
+
+	var result struct {
+		Results []CheckResult `json:"results"`
+	}
+	if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
+		t.Fatalf("decode: %v", err)
+	}
+	if len(result.Results) != 1 {
+		t.Fatalf("results len = %d, want 1", len(result.Results))
+	}
+	if result.Results[0].Host != "test-host" {
+		t.Fatalf("Host = %q, want test-host", result.Results[0].Host)
+	}
+	if result.Results[0].BlogID != 42 {
+		t.Fatalf("BlogID = %d, want 42", result.Results[0].BlogID)
+	}
+	if result.Results[0].MonitorSiteID != 1234 {
+		t.Fatalf("MonitorSiteID = %d, want 1234", result.Results[0].MonitorSiteID)
+	}
+}
+
+func TestServerHandleCheckUnauthorized(t *testing.T) {
+	_, ts := newTestServer(func(req CheckRequest) CheckResult { return CheckResult{} })
+	defer ts.Close()
+
+	req, _ := http.NewRequest(http.MethodPost, ts.URL+"/check", checkReqBody(t, []CheckRequest{{BlogID: 1}}))
+	req.Header.Set("Authorization", "Bearer wrong-token")
+
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusUnauthorized {
+		t.Fatalf("status = %d, want 401", resp.StatusCode)
+	}
+}
+
+func TestServerHandleCheckMethodNotAllowed(t *testing.T) {
+	_, ts := newTestServer(func(req CheckRequest) CheckResult { return CheckResult{} })
+	defer ts.Close()
+
+	req, _ := http.NewRequest(http.MethodGet, ts.URL+"/check", nil)
+	req.Header.Set("Authorization", "Bearer secret")
+
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusMethodNotAllowed {
+		t.Fatalf("status = %d, want 405", resp.StatusCode)
+	}
+}
+
+func TestServerHandleStatus(t *testing.T) {
+	_, ts := newTestServer(func(req CheckRequest) CheckResult { return CheckResult{} })
+	defer ts.Close()
+
+	resp, err := http.Get(ts.URL + "/status")
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		t.Fatalf("status = %d, want 200", resp.StatusCode)
+	}
+
+	var body map[string]string
+	if err := json.NewDecoder(resp.Body).Decode(&body); err != nil {
+		t.Fatalf("decode: %v", err)
+	}
+	if body["status"] != "OK" {
+		t.Fatalf("status field = %q, want OK", body["status"])
+	}
+	if body["version"] != "1.0" {
+		t.Fatalf("version field = %q, want 1.0", body["version"])
+	}
+}
+
+func TestClientServerRoundTrip(t *testing.T) {
+	_, ts := newTestServer(func(req CheckRequest) CheckResult {
+		return CheckResult{BlogID: req.BlogID, Success: true, HTTPCode: 200}
+	})
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	res, err := client.Check(context.Background(), CheckRequest{
+		BlogID: 77,
+		URL:    "https://example.com",
+	})
+	if err != nil {
+		t.Fatalf("Check() error = %v", err)
+	}
+	if res.BlogID != 77 {
+		t.Fatalf("BlogID = %d, want 77", res.BlogID)
+	}
+	if res.Host != "test-host" {
+		t.Fatalf("Host = %q, want test-host", res.Host)
+	}
+	if !res.Success {
+		t.Fatal("Success = false, want true")
+	}
+}
+
+func TestClientAddr(t *testing.T) {
+	client := NewVeriflierClient("host1:7803", "token")
+	if client.Addr() != "host1:7803" {
+		t.Fatalf("Addr() = %q, want host1:7803", client.Addr())
+	}
+}
+
+func TestClientPing(t *testing.T) {
+	_, ts := newTestServer(func(req CheckRequest) CheckResult { return CheckResult{} })
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	version, err := client.Ping(context.Background())
+	if err != nil {
+		t.Fatalf("Ping() error = %v", err)
+	}
+	if version != "1.0" {
+		t.Fatalf("version = %q, want 1.0", version)
+	}
+}
+
+func TestClientPingRejectsErrorStatus(t *testing.T) {
+	ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		http.Error(w, "unavailable", http.StatusServiceUnavailable)
+	}))
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	_, err := client.Ping(context.Background())
+	if err == nil {
+		t.Fatal("Ping() expected error")
+	}
+	if err.Error() != "veriflier /v2/status returned 503" {
+		t.Fatalf("Ping() error = %v", err)
+	}
+}
+
+func TestClientBatchRoundTrip(t *testing.T) {
+	_, ts := newTestServer(func(req CheckRequest) CheckResult {
+		return CheckResult{BlogID: req.BlogID, Success: true, HTTPCode: 200}
+	})
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	res, err := client.CheckBatch(context.Background(), []CheckRequest{
+		{BlogID: 10, URL: "https://example.com"},
+		{BlogID: 20, URL: "https://example.org"},
+	})
+	if err != nil {
+		t.Fatalf("CheckBatch() error = %v", err)
+	}
+	if len(res) != 2 {
+		t.Fatalf("CheckBatch() len = %d, want 2", len(res))
+	}
+}
+
+func TestClientRejectsUnauthorized(t *testing.T) {
+	_, ts := newTestServer(func(req CheckRequest) CheckResult { return CheckResult{} })
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "wrong-token")
+	_, err := client.Check(context.Background(), CheckRequest{BlogID: 1, URL: "https://example.com"})
+	if err == nil {
+		t.Fatal("Check() expected error for wrong auth token")
+	}
+}
+
+func TestNewRequestID(t *testing.T) {
+	id := NewRequestID()
+	if len(id) != 32 {
+		t.Fatalf("NewRequestID() len = %d, want 32", len(id))
+	}
+	if _, err := hex.DecodeString(id); err != nil {
+		t.Fatalf("NewRequestID() not hex: %v", err)
+	}
+	other := NewRequestID()
+	if id == other {
+		t.Fatal("NewRequestID() collided across two calls")
+	}
+}
+
+func TestRequestIDIsEchoed(t *testing.T) {
+	// Server should reflect each request's RequestID into the corresponding result.
+	_, ts := newTestServer(func(req CheckRequest) CheckResult {
+		return CheckResult{BlogID: req.BlogID, Success: true, HTTPCode: 200}
+	})
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	res, err := client.Check(context.Background(), CheckRequest{BlogID: 99, URL: "https://example.com"})
+	if err != nil {
+		t.Fatalf("Check() error = %v", err)
+	}
+	if res.RequestID == "" {
+		t.Fatal("RequestID empty in response — client should auto-generate and server should echo")
+	}
+	if len(res.RequestID) != 32 {
+		t.Fatalf("RequestID len = %d, want 32 (16-byte hex)", len(res.RequestID))
+	}
+}
+
+func TestRequestIDPreservedWhenCallerSets(t *testing.T) {
+	// When the caller sets RequestID explicitly, the client must not overwrite it.
+	const callerID = "caller-supplied-id"
+	_, ts := newTestServer(func(req CheckRequest) CheckResult {
+		return CheckResult{BlogID: req.BlogID, Success: true}
+	})
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	res, err := client.Check(context.Background(), CheckRequest{
+		BlogID:    1,
+		URL:       "https://example.com",
+		RequestID: callerID,
+	})
+	if err != nil {
+		t.Fatalf("Check() error = %v", err)
+	}
+	if res.RequestID != callerID {
+		t.Fatalf("RequestID = %q, want %q (caller-supplied id was overwritten)", res.RequestID, callerID)
+	}
+}
+
+func TestServerRejectsOversizedBody(t *testing.T) {
+	// The body cap is the only DoS mitigation between an authorized caller
+	// and the JSON decoder. A body over the 10MB cap should be rejected
+	// with 413 — and crucially, the checkFn should never be invoked.
+	_, ts := newTestServer(func(req CheckRequest) CheckResult {
+		t.Fatal("checkFn should not be called for oversized body")
+		return CheckResult{}
+	})
+	defer ts.Close()
+
+	// Build a body just over the 10MB cap. Padding lives in a custom_headers
+	// value so the JSON shape is still valid (we want to confirm the cap
+	// fires, not that the JSON is malformed).
+	pad := make([]byte, 11*1024*1024)
+	for i := range pad {
+		pad[i] = 'x'
+	}
+	body := bytes.NewBuffer(nil)
+	body.WriteString(`{"sites":[{"BlogID":1,"URL":"https://example.com","CustomHeaders":{"X-Pad":"`)
+	body.Write(pad)
+	body.WriteString(`"}}]}`)
+
+	req, _ := http.NewRequest(http.MethodPost, ts.URL+"/check", body)
+	req.Header.Set("Authorization", "Bearer secret")
+	req.Header.Set("Content-Type", "application/json")
+
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusRequestEntityTooLarge {
+		t.Fatalf("status = %d, want 413", resp.StatusCode)
+	}
+}
+
+func TestServerShutdownDrains(t *testing.T) {
+	// Shutdown should drain in-flight requests up to the context deadline,
+	// not yank the connection mid-response.
+	srv := NewServer("127.0.0.1:0", "secret", "test-host", "1.0", func(req CheckRequest) CheckResult {
+		// Simulate a slow check so Shutdown has something to drain.
+		time.Sleep(50 * time.Millisecond)
+		return CheckResult{BlogID: req.BlogID, Success: true}
+	})
+
+	// Listen in background; surface the listener's actual port via httptest hack.
+	// Using httptest.NewUnstartedServer with our handler avoids the port-binding race.
+	mux := http.NewServeMux()
+	mux.HandleFunc("/check", srv.handleCheck)
+	mux.HandleFunc("/status", srv.handleStatus)
+	ts := httptest.NewServer(mux)
+	defer ts.Close()
+
+	// Fire a request, then call Shutdown on the underlying httptest.Server's
+	// http.Server. We're testing the *handler* path with timeouts; the
+	// httptest.Server itself manages the listener.
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	done := make(chan error, 1)
+	go func() {
+		_, err := client.Check(context.Background(), CheckRequest{BlogID: 1, URL: "https://example.com"})
+		done <- err
+	}()
+
+	// Give the request time to land in the handler's sleep, then verify it
+	// completes successfully (no panic, no shutdown mid-response).
+	if err := <-done; err != nil {
+		t.Fatalf("in-flight check failed: %v", err)
+	}
+}
+
+func TestServerHandleV2Status(t *testing.T) {
+	srv, ts := newV2TestServer(func(_ context.Context, req CheckRequest) ProbeResult {
+		return ProbeResult{}
+	})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	resp, err := http.Get(ts.URL + "/v2/status")
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusOK {
+		t.Fatalf("status = %d, want 200", resp.StatusCode)
+	}
+
+	var status StatusV2Response
+	if err := json.NewDecoder(resp.Body).Decode(&status); err != nil {
+		t.Fatalf("decode: %v", err)
+	}
+	if status.Vantage.ID != "test-vantage" {
+		t.Fatalf("vantage id = %q, want test-vantage", status.Vantage.ID)
+	}
+	if status.Agent.ID != "test-agent" {
+		t.Fatalf("agent id = %q, want test-agent", status.Agent.ID)
+	}
+	if status.Capacity.MaxConcurrency != 4 {
+		t.Fatalf("max concurrency = %d, want 4", status.Capacity.MaxConcurrency)
+	}
+	if len(status.Protocols) != 1 || status.Protocols[0] != ProtocolV2 {
+		t.Fatalf("protocols = %#v, want v2-only by default", status.Protocols)
+	}
+}
+
+func TestServerLegacyEndpointsDisabledByDefault(t *testing.T) {
+	srv, ts := newV2TestServer(func(_ context.Context, req CheckRequest) ProbeResult {
+		t.Fatal("checkFn should not be called for disabled legacy endpoint")
+		return ProbeResult{}
+	})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	req, _ := http.NewRequest(http.MethodPost, ts.URL+"/check", checkReqBody(t, []CheckRequest{{BlogID: 1, URL: "https://example.com"}}))
+	req.Header.Set("Authorization", "Bearer secret")
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusNotFound {
+		t.Fatalf("POST /check status = %d, want 404 when legacy disabled", resp.StatusCode)
+	}
+
+	resp, err = http.Get(ts.URL + "/status")
+	if err != nil {
+		t.Fatalf("status request error: %v", err)
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusNotFound {
+		t.Fatalf("GET /status status = %d, want 404 when legacy disabled", resp.StatusCode)
+	}
+}
+
+func TestServerLegacyEndpointsOptIn(t *testing.T) {
+	srv, ts := newV2TestServer(func(_ context.Context, req CheckRequest) ProbeResult {
+		return ProbeResult{CheckResult: CheckResult{
+			BlogID:   req.BlogID,
+			URL:      req.URL,
+			Success:  true,
+			HTTPCode: 200,
+		}, Outcome: OutcomeUp}
+	}, ServerOptions{EnableLegacy: true})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	req, _ := http.NewRequest(http.MethodPost, ts.URL+"/check", checkReqBody(t, []CheckRequest{{BlogID: 1, URL: "https://example.com"}}))
+	req.Header.Set("Authorization", "Bearer secret")
+	req.Header.Set("Content-Type", "application/json")
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusOK {
+		t.Fatalf("POST /check status = %d, want 200 when legacy enabled", resp.StatusCode)
+	}
+
+	statusResp, err := http.Get(ts.URL + "/v2/status")
+	if err != nil {
+		t.Fatalf("v2 status request error: %v", err)
+	}
+	defer statusResp.Body.Close()
+	var status StatusV2Response
+	if err := json.NewDecoder(statusResp.Body).Decode(&status); err != nil {
+		t.Fatalf("decode status: %v", err)
+	}
+	if len(status.Protocols) != 2 || status.Protocols[0] != ProtocolV2 || status.Protocols[1] != ProtocolLegacy {
+		t.Fatalf("protocols = %+v, want v2 plus legacy when enabled", status.Protocols)
+	}
+}
+
+func TestServerHandleV2Check(t *testing.T) {
+	srv, ts := newV2TestServer(func(_ context.Context, req CheckRequest) ProbeResult {
+		if req.Method != http.MethodHead {
+			t.Fatalf("method = %q, want HEAD", req.Method)
+		}
+		if req.DetectionProfile != "legacy" {
+			t.Fatalf("detection profile = %q, want legacy", req.DetectionProfile)
+		}
+		if req.Keyword != "needle" {
+			t.Fatalf("keyword = %q, want needle", req.Keyword)
+		}
+		if len(req.ForbiddenKeywords) != 1 || req.ForbiddenKeywords[0] != "bad" {
+			t.Fatalf("forbidden keywords = %#v", req.ForbiddenKeywords)
+		}
+		return ProbeResult{
+			CheckResult: CheckResult{
+				BlogID:    req.BlogID,
+				URL:       req.URL,
+				Success:   false,
+				HTTPCode:  500,
+				ErrorCode: 0,
+				RTTMs:     123,
+			},
+			Outcome:   OutcomeDown,
+			TimingsMS: TimingsMS{DNS: 1, TCP: 2, TLS: 3, TTFB: 4},
+		}
+	})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	body := bytes.NewBuffer(nil)
+	if err := json.NewEncoder(body).Encode(CheckV2BatchRequest{
+		BatchID: "batch-1",
+		Requests: []CheckV2Request{{
+			RequestID:        "req-1",
+			BlogID:           42,
+			URL:              "https://example.com",
+			TimeoutMS:        1500,
+			Method:           http.MethodHead,
+			DetectionProfile: "legacy",
+			RedirectPolicy:   "follow",
+			BodyRules: BodyRules{
+				Required:  []string{"needle"},
+				Forbidden: []string{"bad"},
+			},
+		}},
+	}); err != nil {
+		t.Fatalf("encode: %v", err)
+	}
+	req, _ := http.NewRequest(http.MethodPost, ts.URL+"/v2/check", body)
+	req.Header.Set("Authorization", "Bearer secret")
+	req.Header.Set("Content-Type", "application/json")
+
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusOK {
+		t.Fatalf("status = %d, want 200", resp.StatusCode)
+	}
+
+	var result CheckV2BatchResponse
+	if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
+		t.Fatalf("decode: %v", err)
+	}
+	if result.BatchID != "batch-1" {
+		t.Fatalf("batch id = %q, want batch-1", result.BatchID)
+	}
+	if result.Vantage.ID != "test-vantage" || result.Agent.ID != "test-agent" {
+		t.Fatalf("identity = vantage:%q agent:%q", result.Vantage.ID, result.Agent.ID)
+	}
+	if len(result.Results) != 1 {
+		t.Fatalf("results len = %d, want 1", len(result.Results))
+	}
+	got := result.Results[0]
+	if got.RequestID != "req-1" || got.VantageID != "test-vantage" || got.AgentID != "test-agent" {
+		t.Fatalf("result identity = %+v", got)
+	}
+	if got.Outcome != OutcomeDown || got.Success || got.HTTPCode != 500 || got.RTTMs != 123 {
+		t.Fatalf("result = %+v", got)
+	}
+	if got.TimingsMS.DNS != 1 || got.TimingsMS.TTFB != 4 {
+		t.Fatalf("timings = %+v", got.TimingsMS)
+	}
+}
+
+func TestServerHandleV2CheckAppliesBatchDeadline(t *testing.T) {
+	srv, ts := newV2TestServer(func(ctx context.Context, req CheckRequest) ProbeResult {
+		deadline, ok := ctx.Deadline()
+		if !ok {
+			t.Fatal("check context has no deadline")
+		}
+		if remaining := time.Until(deadline); remaining <= 0 || remaining > time.Second {
+			t.Fatalf("remaining deadline = %s, want a short positive deadline", remaining)
+		}
+		return ProbeResult{CheckResult: CheckResult{
+			BlogID:   req.BlogID,
+			URL:      req.URL,
+			Success:  true,
+			HTTPCode: 200,
+		}, Outcome: OutcomeUp}
+	})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	body := bytes.NewBuffer(nil)
+	if err := json.NewEncoder(body).Encode(CheckV2BatchRequest{
+		DeadlineMS: 250,
+		Requests: []CheckV2Request{{
+			BlogID: 1,
+			URL:    "https://example.com",
+		}},
+	}); err != nil {
+		t.Fatalf("encode: %v", err)
+	}
+	req, _ := http.NewRequest(http.MethodPost, ts.URL+"/v2/check", body)
+	req.Header.Set("Authorization", "Bearer secret")
+	req.Header.Set("Content-Type", "application/json")
+
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusOK {
+		t.Fatalf("status = %d, want 200", resp.StatusCode)
+	}
+}
+
+func TestServerHandleV2CheckRejectsMultipleRequiredBodyRules(t *testing.T) {
+	srv, ts := newV2TestServer(func(_ context.Context, req CheckRequest) ProbeResult {
+		t.Fatal("checkFn should not be called for invalid body rules")
+		return ProbeResult{}
+	})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	body := bytes.NewBuffer(nil)
+	if err := json.NewEncoder(body).Encode(CheckV2BatchRequest{
+		Requests: []CheckV2Request{{
+			BlogID: 1,
+			URL:    "https://example.com",
+			BodyRules: BodyRules{
+				Required: []string{"a", "b"},
+			},
+		}},
+	}); err != nil {
+		t.Fatalf("encode: %v", err)
+	}
+	req, _ := http.NewRequest(http.MethodPost, ts.URL+"/v2/check", body)
+	req.Header.Set("Authorization", "Bearer secret")
+	req.Header.Set("Content-Type", "application/json")
+
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusBadRequest {
+		t.Fatalf("status = %d, want 400", resp.StatusCode)
+	}
+}
+
+func TestClientPrefersV2WhenAvailable(t *testing.T) {
+	srv, ts := newV2TestServer(func(_ context.Context, req CheckRequest) ProbeResult {
+		return ProbeResult{CheckResult: CheckResult{
+			BlogID:   req.BlogID,
+			URL:      req.URL,
+			Success:  true,
+			HTTPCode: 200,
+		}, Outcome: OutcomeUp}
+	}, ServerOptions{
+		Vantage:        Vantage{ID: "edge-us-east"},
+		AgentID:        "agent-1",
+		MaxConcurrency: 2,
+		QueueCapacity:  2,
+	})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	res, err := client.Check(context.Background(), CheckRequest{BlogID: 1, URL: "https://example.com"})
+	if err != nil {
+		t.Fatalf("Check() error = %v", err)
+	}
+	if res.Host != "edge-us-east" {
+		t.Fatalf("Host = %q, want v2 vantage identity", res.Host)
+	}
+	if res.VantageID != "edge-us-east" || res.AgentID != "agent-1" || res.Outcome != OutcomeUp {
+		t.Fatalf("v2 identity = vantage:%q agent:%q outcome:%q", res.VantageID, res.AgentID, res.Outcome)
+	}
+	if client.cachedProtocol() != ProtocolV2 {
+		t.Fatalf("cached protocol = %q, want %q", client.cachedProtocol(), ProtocolV2)
+	}
+}
+
+func TestClientFallsBackToLegacyWhenUnknownV2ConnectionCloses(t *testing.T) {
+	var v2Hits atomic.Int64
+	var legacyHits atomic.Int64
+
+	mux := http.NewServeMux()
+	mux.HandleFunc("/v2/check", func(w http.ResponseWriter, r *http.Request) {
+		v2Hits.Add(1)
+		conn, _, err := w.(http.Hijacker).Hijack()
+		if err != nil {
+			t.Fatalf("hijack: %v", err)
+		}
+		_ = conn.Close()
+	})
+	mux.HandleFunc("/check", func(w http.ResponseWriter, r *http.Request) {
+		legacyHits.Add(1)
+		var req struct {
+			Sites []CheckRequest `json:"sites"`
+		}
+		if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
+			t.Fatalf("decode legacy request: %v", err)
+		}
+		results := make([]CheckResult, 0, len(req.Sites))
+		for _, site := range req.Sites {
+			results = append(results, CheckResult{
+				MonitorSiteID: site.MonitorSiteID,
+				BlogID:        site.BlogID,
+				URL:           site.URL,
+				RequestID:     site.RequestID,
+				Success:       true,
+				HTTPCode:      200,
+			})
+		}
+		w.Header().Set("Content-Type", "application/json")
+		_ = json.NewEncoder(w).Encode(struct {
+			Results []CheckResult `json:"results"`
+		}{Results: results})
+	})
+	ts := httptest.NewServer(mux)
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	res, err := client.Check(context.Background(), CheckRequest{
+		MonitorSiteID: 44,
+		BlogID:        99,
+		URL:           "https://example.com",
+	})
+	if err != nil {
+		t.Fatalf("Check() error = %v", err)
+	}
+	if !res.Success || res.BlogID != 99 || res.MonitorSiteID != 44 {
+		t.Fatalf("legacy fallback result = %+v", res)
+	}
+	if client.cachedProtocol() != ProtocolLegacy {
+		t.Fatalf("cached protocol = %q, want %q", client.cachedProtocol(), ProtocolLegacy)
+	}
+	if v2Hits.Load() != 1 || legacyHits.Load() != 1 {
+		t.Fatalf("hits after first check v2=%d legacy=%d, want 1/1", v2Hits.Load(), legacyHits.Load())
+	}
+
+	if _, err := client.Check(context.Background(), CheckRequest{BlogID: 100, URL: "https://example.org"}); err != nil {
+		t.Fatalf("cached legacy Check() error = %v", err)
+	}
+	if v2Hits.Load() != 1 || legacyHits.Load() != 2 {
+		t.Fatalf("hits after cached legacy check v2=%d legacy=%d, want 1/2", v2Hits.Load(), legacyHits.Load())
+	}
+}
+
+func TestClientDoesNotFallbackWhenV2ProtocolCached(t *testing.T) {
+	var legacyHits atomic.Int64
+
+	mux := http.NewServeMux()
+	mux.HandleFunc("/v2/check", func(w http.ResponseWriter, r *http.Request) {
+		conn, _, err := w.(http.Hijacker).Hijack()
+		if err != nil {
+			t.Fatalf("hijack: %v", err)
+		}
+		_ = conn.Close()
+	})
+	mux.HandleFunc("/check", func(w http.ResponseWriter, r *http.Request) {
+		legacyHits.Add(1)
+		w.WriteHeader(http.StatusOK)
+	})
+	ts := httptest.NewServer(mux)
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	client.setProtocol(ProtocolV2)
+	_, err := client.Check(context.Background(), CheckRequest{BlogID: 1, URL: "https://example.com"})
+	if err == nil {
+		t.Fatal("Check() expected v2 transport error")
+	}
+	if legacyHits.Load() != 0 {
+		t.Fatalf("legacy hits = %d, want 0 for cached v2 protocol", legacyHits.Load())
+	}
+}
+
+func TestClientV2SendsContextDeadline(t *testing.T) {
+	ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		if r.URL.Path != "/v2/check" {
+			http.NotFound(w, r)
+			return
+		}
+		var req CheckV2BatchRequest
+		if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
+			t.Fatalf("decode request: %v", err)
+		}
+		if req.DeadlineMS <= 0 {
+			t.Fatalf("deadline_ms = %d, want positive", req.DeadlineMS)
+		}
+		if len(req.Requests) != 1 || req.Requests[0].TimeoutMS != 2000 {
+			t.Fatalf("requests = %+v", req.Requests)
+		}
+		if req.Requests[0].Method != http.MethodHead || req.Requests[0].DetectionProfile != "legacy" {
+			t.Fatalf("method/profile = %s/%s, want HEAD/legacy", req.Requests[0].Method, req.Requests[0].DetectionProfile)
+		}
+		w.Header().Set("Content-Type", "application/json")
+		_ = json.NewEncoder(w).Encode(CheckV2BatchResponse{
+			Vantage: Vantage{ID: "vantage"},
+			Agent:   Agent{ID: "agent"},
+			Results: []CheckV2Result{{
+				RequestID: req.Requests[0].RequestID,
+				BlogID:    req.Requests[0].BlogID,
+				URL:       req.Requests[0].URL,
+				VantageID: "vantage",
+				AgentID:   "agent",
+				Outcome:   OutcomeUp,
+				Success:   true,
+				HTTPCode:  200,
+			}},
+		})
+	}))
+	defer ts.Close()
+
+	client := NewVeriflierClient(ts.Listener.Addr().String(), "secret")
+	ctx, cancel := context.WithTimeout(context.Background(), time.Second)
+	defer cancel()
+	res, err := client.Check(ctx, CheckRequest{
+		MonitorSiteID:    1234,
+		BlogID:           9,
+		URL:              "https://example.com",
+		Method:           http.MethodHead,
+		DetectionProfile: "legacy",
+		TimeoutSeconds:   2,
+	})
+	if err != nil {
+		t.Fatalf("Check() error = %v", err)
+	}
+	if res.Host != "vantage" {
+		t.Fatalf("Host = %q, want vantage", res.Host)
+	}
+	if res.VantageID != "vantage" || res.AgentID != "agent" || res.Outcome != OutcomeUp {
+		t.Fatalf("v2 identity = vantage:%q agent:%q outcome:%q", res.VantageID, res.AgentID, res.Outcome)
+	}
+	if res.MonitorSiteID != 1234 {
+		t.Fatalf("MonitorSiteID = %d, want 1234", res.MonitorSiteID)
+	}
+}
+
+func TestServerHandleV2CheckRejectsOverload(t *testing.T) {
+	var called atomic.Int64
+	srv, ts := newV2TestServer(func(_ context.Context, req CheckRequest) ProbeResult {
+		called.Add(1)
+		return ProbeResult{CheckResult: CheckResult{BlogID: req.BlogID, URL: req.URL}}
+	}, ServerOptions{
+		MaxConcurrency: 1,
+		QueueCapacity:  1,
+	})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	body := bytes.NewBuffer(nil)
+	if err := json.NewEncoder(body).Encode(CheckV2BatchRequest{
+		Requests: []CheckV2Request{
+			{BlogID: 1, URL: "https://example.com/1"},
+			{BlogID: 2, URL: "https://example.com/2"},
+			{BlogID: 3, URL: "https://example.com/3"},
+		},
+	}); err != nil {
+		t.Fatalf("encode: %v", err)
+	}
+	req, _ := http.NewRequest(http.MethodPost, ts.URL+"/v2/check", body)
+	req.Header.Set("Authorization", "Bearer secret")
+	req.Header.Set("Content-Type", "application/json")
+
+	resp, err := http.DefaultClient.Do(req)
+	if err != nil {
+		t.Fatalf("request error: %v", err)
+	}
+	defer resp.Body.Close()
+	if resp.StatusCode != http.StatusServiceUnavailable {
+		t.Fatalf("status = %d, want 503", resp.StatusCode)
+	}
+	if called.Load() != 0 {
+		t.Fatalf("check function called %d times for overloaded batch", called.Load())
+	}
+}
+
+func TestServerExecutesLegacyBatchConcurrently(t *testing.T) {
+	var active atomic.Int64
+	var peak atomic.Int64
+	release := make(chan struct{})
+	started := make(chan struct{}, 2)
+
+	srv, ts := newV2TestServer(func(_ context.Context, req CheckRequest) ProbeResult {
+		now := active.Add(1)
+		for {
+			old := peak.Load()
+			if now <= old || peak.CompareAndSwap(old, now) {
+				break
+			}
+		}
+		started <- struct{}{}
+		<-release
+		active.Add(-1)
+		return ProbeResult{CheckResult: CheckResult{
+			BlogID:   req.BlogID,
+			URL:      req.URL,
+			Success:  true,
+			HTTPCode: 200,
+		}, Outcome: OutcomeUp}
+	}, ServerOptions{
+		MaxConcurrency: 2,
+		QueueCapacity:  2,
+		EnableLegacy:   true,
+	})
+	defer srv.executor.Shutdown()
+	defer ts.Close()
+
+	req, _ := http.NewRequest(http.MethodPost, ts.URL+"/check", checkReqBody(t, []CheckRequest{
+		{BlogID: 1, URL: "https://example.com/1"},
+		{BlogID: 2, URL: "https://example.com/2"},
+	}))
+	req.Header.Set("Authorization", "Bearer secret")
+	req.Header.Set("Content-Type", "application/json")
+
+	done := make(chan error, 1)
+	go func() {
+		resp, err := http.DefaultClient.Do(req)
+		if err != nil {
+			done <- err
+			return
+		}
+		resp.Body.Close()
+		if resp.StatusCode != http.StatusOK {
+			done <- errStatus(resp.StatusCode)
+			return
+		}
+		done <- nil
+	}()
+
+	for range 2 {
+		select {
+		case <-started:
+		case <-time.After(time.Second):
+			t.Fatal("timed out waiting for concurrent checks to start")
+		}
+	}
+	close(release)
+	if err := <-done; err != nil {
+		t.Fatalf("request failed: %v", err)
+	}
+	if peak.Load() != 2 {
+		t.Fatalf("peak concurrency = %d, want 2", peak.Load())
+	}
+}
+
+type errStatus int
+
+func (e errStatus) Error() string {
+	return http.StatusText(int(e))
+}
diff --git a/internal/webhooks/deliveries.go b/internal/webhooks/deliveries.go
new file mode 100644
index 00000000..cfed37ab
--- /dev/null
+++ b/internal/webhooks/deliveries.go
@@ -0,0 +1,369 @@
+package webhooks
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"time"
+)
+
+// ErrDeliveryNotFound is returned by Get / Retry when the delivery row
+// doesn't exist.
+var ErrDeliveryNotFound = errors.New("webhooks: delivery not found")
+
+// Delivery is the in-memory shape of a jetmon_webhook_deliveries row.
+type Delivery struct {
+	ID             int64
+	WebhookID      int64
+	TransitionID   int64
+	EventID        int64
+	EventType      string
+	Payload        json.RawMessage // frozen at create time
+	Status         Status
+	Attempt        int
+	NextAttemptAt  *time.Time
+	LastStatusCode *int
+	LastResponse   *string
+	LastAttemptAt  *time.Time
+	DeliveredAt    *time.Time
+	CreatedAt      time.Time
+}
+
+// EnqueueInput carries everything needed to insert a delivery row. payload
+// is captured by the caller (the dispatcher builds it from the event +
+// transition + site context) and stored verbatim.
+type EnqueueInput struct {
+	WebhookID    int64
+	TransitionID int64
+	EventID      int64
+	EventType    string
+	Payload      json.RawMessage
+}
+
+// Enqueue inserts a pending delivery with attempt=0 and next_attempt_at=now,
+// signaling the worker to pick it up on the next tick. Uses INSERT IGNORE
+// against the (webhook_id, transition_id) UNIQUE KEY so concurrent
+// dispatchers don't create duplicate deliveries.
+//
+// Returns the new delivery's id, or 0 if the row was a duplicate (in which
+// case some other dispatcher already enqueued this combination).
+func Enqueue(ctx context.Context, db *sql.DB, in EnqueueInput) (int64, error) {
+	res, err := db.ExecContext(ctx, `
+		INSERT IGNORE INTO jetmon_webhook_deliveries
+			(webhook_id, transition_id, event_id, event_type, payload,
+			 status, attempt, next_attempt_at)
+		VALUES (?, ?, ?, ?, ?, 'pending', 0, CURRENT_TIMESTAMP)`,
+		in.WebhookID, in.TransitionID, in.EventID, in.EventType, []byte(in.Payload),
+	)
+	if err != nil {
+		return 0, fmt.Errorf("webhooks: enqueue: %w", err)
+	}
+	id, err := res.LastInsertId()
+	if err != nil {
+		// MySQL's LastInsertId after INSERT IGNORE that didn't insert returns
+		// 0 with no error; getting an error here is an unusual driver quirk.
+		return 0, fmt.Errorf("webhooks: last insert id: %w", err)
+	}
+	affected, _ := res.RowsAffected()
+	if affected == 0 {
+		// Row was a duplicate — another dispatcher already enqueued this
+		// (webhook, transition) combination. Not an error condition.
+		return 0, nil
+	}
+	return id, nil
+}
+
+// claimLockDuration is how far ClaimReady pushes next_attempt_at out
+// when it claims a row. It must outlast the worker's per-delivery wall
+// clock so the in-flight goroutine has time to write its real result
+// (delivered → next_attempt_at NULL, failed → next_attempt_at = retry
+// time) before this in-flight lease expires. The default worker
+// HTTPTimeout is 30s with a 5s buffer; 60s gives comfortable headroom.
+//
+// If a goroutine crashes without updating the row (panic without
+// recovery, OOM kill, etc.), the lease expires naturally and the
+// row becomes claimable again — natural recovery without operator
+// intervention.
+const claimLockDuration = 60 * time.Second
+
+// ClaimReady returns up to limit pending deliveries whose next_attempt_at
+// is in the past, ordered by next_attempt_at ASC (oldest first). It claims
+// rows with SELECT ... FOR UPDATE inside a transaction so active-active
+// delivery workers cannot claim the same row. Each claimed row then gets an
+// in-flight lease by pushing next_attempt_at to NOW +
+// claimLockDuration before the transaction commits, so subsequent ticks don't
+// re-claim a row whose dispatch is still in-flight. The dispatch goroutine
+// overwrites next_attempt_at with its real value (NULL on success, retry time
+// on failure) when it finishes.
+//
+// Without the in-flight lease, the deliver loop's 1-second tick re-claims
+// any in-flight row up to the per-webhook in-flight cap, producing
+// concurrent dispatches and inflating the attempt counter — three
+// concurrent claims followed by three failures end up at attempt=3
+// after a single round. The lease prevents that after the transaction commits.
+func ClaimReady(ctx context.Context, db *sql.DB, limit int) ([]Delivery, error) {
+	tx, err := db.BeginTx(ctx, nil)
+	if err != nil {
+		return nil, fmt.Errorf("webhooks: begin claim: %w", err)
+	}
+	committed := false
+	defer func() {
+		if !committed {
+			_ = tx.Rollback()
+		}
+	}()
+
+	rows, err := tx.QueryContext(ctx, `
+		SELECT id, webhook_id, transition_id, event_id, event_type, payload,
+		       status, attempt, next_attempt_at, last_status_code, last_response,
+		       last_attempt_at, delivered_at, created_at
+		  FROM jetmon_webhook_deliveries
+		 WHERE status = 'pending'
+		   AND (next_attempt_at IS NULL OR next_attempt_at <= CURRENT_TIMESTAMP)
+		 ORDER BY next_attempt_at ASC
+		 LIMIT ?
+		 FOR UPDATE`, limit)
+	if err != nil {
+		return nil, fmt.Errorf("webhooks: claim ready: %w", err)
+	}
+	var claimed []Delivery
+	for rows.Next() {
+		d, err := scanDeliveryRow(rows)
+		if err != nil {
+			rows.Close()
+			return nil, err
+		}
+		claimed = append(claimed, *d)
+	}
+	if err := rows.Err(); err != nil {
+		rows.Close()
+		return nil, err
+	}
+	if err := rows.Close(); err != nil {
+		return nil, fmt.Errorf("webhooks: close claim rows: %w", err)
+	}
+
+	lockUntil := time.Now().Add(claimLockDuration).UTC()
+	for i := range claimed {
+		res, err := tx.ExecContext(ctx, `
+			UPDATE jetmon_webhook_deliveries
+			   SET next_attempt_at = ?
+			 WHERE id = ?
+			   AND status = 'pending'`,
+			lockUntil, claimed[i].ID)
+		if err != nil {
+			return nil, fmt.Errorf("webhooks: claim row %d: %w", claimed[i].ID, err)
+		}
+		affected, err := res.RowsAffected()
+		if err != nil {
+			return nil, fmt.Errorf("webhooks: claim row %d rows affected: %w", claimed[i].ID, err)
+		}
+		if affected != 1 {
+			return nil, fmt.Errorf("webhooks: claim row %d affected %d rows, want 1", claimed[i].ID, affected)
+		}
+	}
+	if err := tx.Commit(); err != nil {
+		return nil, fmt.Errorf("webhooks: commit claim: %w", err)
+	}
+	committed = true
+	return claimed, nil
+}
+
+// MarkDelivered records a successful delivery with the response status.
+// Sets status=delivered, captures last_status_code, last_response, and
+// delivered_at. Subsequent retries are not scheduled — the row is terminal.
+func MarkDelivered(ctx context.Context, db *sql.DB, id int64, statusCode int, responseBody string) error {
+	_, err := db.ExecContext(ctx, `
+		UPDATE jetmon_webhook_deliveries
+		   SET status = 'delivered',
+		       last_status_code = ?,
+		       last_response = ?,
+		       last_attempt_at = CURRENT_TIMESTAMP,
+		       delivered_at = CURRENT_TIMESTAMP,
+		       attempt = attempt + 1,
+		       next_attempt_at = NULL
+		 WHERE id = ?`,
+		statusCode, truncate(responseBody, 2048), id)
+	if err != nil {
+		return fmt.Errorf("webhooks: mark delivered: %w", err)
+	}
+	return nil
+}
+
+// ScheduleRetry bumps the attempt counter and sets next_attempt_at per the
+// retry schedule. Captures the status/response from the failed attempt.
+// If the next attempt would exceed maxAttempts, the row is marked
+// abandoned instead.
+func ScheduleRetry(ctx context.Context, db *sql.DB, id int64, statusCode int, responseBody string, nextAttempt time.Time, abandon bool) error {
+	if abandon {
+		_, err := db.ExecContext(ctx, `
+			UPDATE jetmon_webhook_deliveries
+			   SET status = 'abandoned',
+			       last_status_code = ?,
+			       last_response = ?,
+			       last_attempt_at = CURRENT_TIMESTAMP,
+			       attempt = attempt + 1,
+			       next_attempt_at = NULL
+			 WHERE id = ?`,
+			statusCode, truncate(responseBody, 2048), id)
+		if err != nil {
+			return fmt.Errorf("webhooks: abandon: %w", err)
+		}
+		return nil
+	}
+	_, err := db.ExecContext(ctx, `
+		UPDATE jetmon_webhook_deliveries
+		   SET last_status_code = ?,
+		       last_response = ?,
+		       last_attempt_at = CURRENT_TIMESTAMP,
+		       attempt = attempt + 1,
+		       next_attempt_at = ?
+		 WHERE id = ?`,
+		statusCode, truncate(responseBody, 2048), nextAttempt.UTC(), id)
+	if err != nil {
+		return fmt.Errorf("webhooks: schedule retry: %w", err)
+	}
+	return nil
+}
+
+// GetDelivery returns a single delivery row by id.
+func GetDelivery(ctx context.Context, db *sql.DB, id int64) (*Delivery, error) {
+	row := db.QueryRowContext(ctx, `
+		SELECT id, webhook_id, transition_id, event_id, event_type, payload,
+		       status, attempt, next_attempt_at, last_status_code, last_response,
+		       last_attempt_at, delivered_at, created_at
+		  FROM jetmon_webhook_deliveries
+		 WHERE id = ?`, id)
+	d, err := scanDeliveryRow(row)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			return nil, ErrDeliveryNotFound
+		}
+		return nil, err
+	}
+	return d, nil
+}
+
+// ListDeliveries returns deliveries for a webhook, optionally filtered by
+// status, ordered by created_at DESC. Cursor-paginated on id.
+func ListDeliveries(ctx context.Context, db *sql.DB, webhookID int64, status Status, cursorID int64, limit int) ([]Delivery, error) {
+	args := []any{webhookID}
+	q := `
+		SELECT id, webhook_id, transition_id, event_id, event_type, payload,
+		       status, attempt, next_attempt_at, last_status_code, last_response,
+		       last_attempt_at, delivered_at, created_at
+		  FROM jetmon_webhook_deliveries
+		 WHERE webhook_id = ?`
+	if status != "" {
+		q += " AND status = ?"
+		args = append(args, string(status))
+	}
+	if cursorID > 0 {
+		q += " AND id < ?"
+		args = append(args, cursorID)
+	}
+	q += " ORDER BY id DESC LIMIT ?"
+	args = append(args, limit)
+
+	rows, err := db.QueryContext(ctx, q, args...)
+	if err != nil {
+		return nil, fmt.Errorf("webhooks: list deliveries: %w", err)
+	}
+	defer rows.Close()
+	var out []Delivery
+	for rows.Next() {
+		d, err := scanDeliveryRow(rows)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, *d)
+	}
+	return out, rows.Err()
+}
+
+// RetryDelivery resets an abandoned delivery to pending so the worker
+// picks it up on the next tick. Manual operator path: consumer fixed
+// their endpoint, wants the previously-failed delivery to fire again.
+//
+// Resets attempt to 0 (new retry sequence) so the consumer gets the full
+// 6 attempts again — they may have just brought their service back and a
+// transient failure deserves a fresh budget.
+//
+// Only abandoned deliveries can be retried via this path. pending
+// deliveries are already in the worker's queue; delivered deliveries
+// were already accepted by the consumer.
+func RetryDelivery(ctx context.Context, db *sql.DB, id int64) error {
+	res, err := db.ExecContext(ctx, `
+		UPDATE jetmon_webhook_deliveries
+		   SET status = 'pending',
+		       attempt = 0,
+		       next_attempt_at = CURRENT_TIMESTAMP,
+		       last_status_code = NULL,
+		       last_response = NULL,
+		       last_attempt_at = NULL
+		 WHERE id = ? AND status = 'abandoned'`, id)
+	if err != nil {
+		return fmt.Errorf("webhooks: retry delivery: %w", err)
+	}
+	n, _ := res.RowsAffected()
+	if n == 0 {
+		// Either the row doesn't exist or it isn't abandoned. Distinguish
+		// for a useful error message.
+		d, getErr := GetDelivery(ctx, db, id)
+		if getErr != nil {
+			return getErr
+		}
+		return fmt.Errorf("webhooks: delivery %d is %s, only abandoned deliveries can be retried", id, d.Status)
+	}
+	return nil
+}
+
+func scanDeliveryRow(s rowScanner) (*Delivery, error) {
+	var (
+		d              Delivery
+		payload        sql.NullString
+		nextAttemptAt  sql.NullTime
+		lastStatusCode sql.NullInt64
+		lastResponse   sql.NullString
+		lastAttemptAt  sql.NullTime
+		deliveredAt    sql.NullTime
+		statusStr      string
+	)
+	if err := s.Scan(
+		&d.ID, &d.WebhookID, &d.TransitionID, &d.EventID, &d.EventType, &payload,
+		&statusStr, &d.Attempt, &nextAttemptAt, &lastStatusCode, &lastResponse,
+		&lastAttemptAt, &deliveredAt, &d.CreatedAt,
+	); err != nil {
+		return nil, err
+	}
+	d.Status = Status(statusStr)
+	if payload.Valid {
+		d.Payload = json.RawMessage(payload.String)
+	}
+	if nextAttemptAt.Valid {
+		d.NextAttemptAt = &nextAttemptAt.Time
+	}
+	if lastStatusCode.Valid {
+		v := int(lastStatusCode.Int64)
+		d.LastStatusCode = &v
+	}
+	if lastResponse.Valid {
+		d.LastResponse = &lastResponse.String
+	}
+	if lastAttemptAt.Valid {
+		d.LastAttemptAt = &lastAttemptAt.Time
+	}
+	if deliveredAt.Valid {
+		d.DeliveredAt = &deliveredAt.Time
+	}
+	return &d, nil
+}
+
+func truncate(s string, max int) string {
+	if len(s) <= max {
+		return s
+	}
+	return s[:max]
+}
diff --git a/internal/webhooks/deliveries_test.go b/internal/webhooks/deliveries_test.go
new file mode 100644
index 00000000..eef65110
--- /dev/null
+++ b/internal/webhooks/deliveries_test.go
@@ -0,0 +1,115 @@
+package webhooks
+
+import (
+	"context"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+const selectClaimReadySQL = ` SELECT id, webhook_id, transition_id, event_id, event_type, payload, status, attempt, next_attempt_at, last_status_code, last_response, last_attempt_at, delivered_at, created_at FROM jetmon_webhook_deliveries WHERE status = 'pending' AND (next_attempt_at IS NULL OR next_attempt_at <= CURRENT_TIMESTAMP) ORDER BY next_attempt_at ASC LIMIT ? FOR UPDATE`
+
+const leaseClaimedSQL = ` UPDATE jetmon_webhook_deliveries SET next_attempt_at = ? WHERE id = ? AND status = 'pending'`
+
+var columnsClaimedDelivery = []string{
+	"id", "webhook_id", "transition_id", "event_id", "event_type",
+	"payload", "status", "attempt", "next_attempt_at", "last_status_code", "last_response",
+	"last_attempt_at", "delivered_at", "created_at",
+}
+
+// TestClaimReadyClaimsRowsTransactionally verifies that ClaimReady uses
+// row-level locks and then leases each claimed row so subsequent ticks do not
+// re-claim a still-in-flight delivery.
+func TestClaimReadyClaimsRowsTransactionally(t *testing.T) {
+	db, mock, err := sqlmock.New(sqlmock.QueryMatcherOption(sqlmock.QueryMatcherEqual))
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	rows := sqlmock.NewRows(columnsClaimedDelivery).
+		AddRow(int64(1), int64(7), int64(100), int64(900), "event.opened",
+			[]byte(`{}`), "pending", 0, now, nil, nil, nil, nil, now).
+		AddRow(int64(2), int64(7), int64(101), int64(901), "event.opened",
+			[]byte(`{}`), "pending", 0, now, nil, nil, nil, nil, now)
+
+	mock.ExpectBegin()
+	mock.ExpectQuery(selectClaimReadySQL).WithArgs(50).WillReturnRows(rows)
+	mock.ExpectExec(leaseClaimedSQL).
+		WithArgs(sqlmock.AnyArg(), int64(1)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec(leaseClaimedSQL).
+		WithArgs(sqlmock.AnyArg(), int64(2)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectCommit()
+
+	out, err := ClaimReady(context.Background(), db, 50)
+	if err != nil {
+		t.Fatalf("ClaimReady: %v", err)
+	}
+	if len(out) != 2 {
+		t.Errorf("got %d claimed, want 2", len(out))
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("expectations: %v", err)
+	}
+}
+
+func TestClaimReadyRollsBackWhenLeaseUpdateMisses(t *testing.T) {
+	db, mock, err := sqlmock.New(sqlmock.QueryMatcherOption(sqlmock.QueryMatcherEqual))
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	rows := sqlmock.NewRows(columnsClaimedDelivery).
+		AddRow(int64(1), int64(7), int64(100), int64(900), "event.opened",
+			[]byte(`{}`), "pending", 0, now, nil, nil, nil, nil, now)
+
+	mock.ExpectBegin()
+	mock.ExpectQuery(selectClaimReadySQL).WithArgs(50).WillReturnRows(rows)
+	mock.ExpectExec(leaseClaimedSQL).
+		WithArgs(sqlmock.AnyArg(), int64(1)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+	mock.ExpectRollback()
+
+	out, err := ClaimReady(context.Background(), db, 50)
+	if err == nil {
+		t.Fatal("ClaimReady succeeded after lease update missed")
+	}
+	if len(out) != 0 {
+		t.Fatalf("got %d claimed rows with failed lease update, want 0", len(out))
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("expectations: %v", err)
+	}
+}
+
+// TestClaimReadyNoCandidatesCommitsWithoutLeaseUpdates verifies that when the
+// SELECT returns nothing, ClaimReady issues no UPDATEs.
+func TestClaimReadyNoCandidatesCommitsWithoutLeaseUpdates(t *testing.T) {
+	db, mock, err := sqlmock.New(sqlmock.QueryMatcherOption(sqlmock.QueryMatcherEqual))
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectQuery(selectClaimReadySQL).WithArgs(50).
+		WillReturnRows(sqlmock.NewRows(columnsClaimedDelivery))
+	mock.ExpectCommit()
+
+	out, err := ClaimReady(context.Background(), db, 50)
+	if err != nil {
+		t.Fatalf("ClaimReady: %v", err)
+	}
+	if len(out) != 0 {
+		t.Errorf("got %d claimed, want 0", len(out))
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("expectations: %v", err)
+	}
+}
diff --git a/internal/webhooks/repository_coverage_test.go b/internal/webhooks/repository_coverage_test.go
new file mode 100644
index 00000000..8bc85d22
--- /dev/null
+++ b/internal/webhooks/repository_coverage_test.go
@@ -0,0 +1,452 @@
+package webhooks
+
+import (
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+var webhookColumns = []string{
+	"id", "url", "active", "owner_tenant_id", "events", "site_filter", "state_filter",
+	"secret_preview", "created_by", "created_at", "updated_at",
+}
+
+func webhookRow(id int64, url string, active uint8, createdAt time.Time) *sqlmock.Rows {
+	return sqlmock.NewRows(webhookColumns).AddRow(
+		id, url, active, "tenant-a",
+		`["event.opened"]`,
+		`{"site_ids":[42]}`,
+		`{"states":["Down"]}`,
+		"_XYZ", "ops", createdAt, createdAt,
+	)
+}
+
+func TestCreateWebhookPersistsDefaultsAndFetchesRecord(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	mock.ExpectExec("INSERT INTO jetmon_webhooks").
+		WithArgs(
+			"https://consumer.example/hook",
+			1,
+			nil,
+			sqlmock.AnyArg(),
+			sqlmock.AnyArg(),
+			sqlmock.AnyArg(),
+			sqlmock.AnyArg(),
+			sqlmock.AnyArg(),
+			"ops",
+		).
+		WillReturnResult(sqlmock.NewResult(12, 1))
+	mock.ExpectQuery("SELECT id, url, active, owner_tenant_id, events").
+		WithArgs(int64(12)).
+		WillReturnRows(webhookRow(12, "https://consumer.example/hook", 1, now))
+
+	raw, hook, err := Create(context.Background(), db, CreateInput{
+		URL:         "https://consumer.example/hook",
+		Events:      []string{EventOpened},
+		SiteFilter:  SiteFilter{SiteIDs: []int64{42}},
+		StateFilter: StateFilter{States: []string{"Down"}},
+		CreatedBy:   "ops",
+	})
+	if err != nil {
+		t.Fatalf("Create: %v", err)
+	}
+	if !strings.HasPrefix(raw, SecretPrefix) {
+		t.Fatalf("raw secret = %q, want %s prefix", raw, SecretPrefix)
+	}
+	if hook.ID != 12 || !hook.Active || hook.SiteFilter.SiteIDs[0] != 42 || hook.StateFilter.States[0] != "Down" {
+		t.Fatalf("hook = %+v", hook)
+	}
+	if hook.OwnerTenantID == nil || *hook.OwnerTenantID != "tenant-a" {
+		t.Fatalf("hook.OwnerTenantID = %v, want tenant-a", hook.OwnerTenantID)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestCreateWebhookRejectsInvalidInputBeforeDB(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	if _, _, err := Create(context.Background(), db, CreateInput{}); err == nil {
+		t.Fatal("Create accepted an empty URL")
+	}
+	if _, _, err := Create(context.Background(), db, CreateInput{
+		URL:    "https://consumer.example/hook",
+		Events: []string{"event.bogus"},
+	}); !errors.Is(err, ErrInvalidEvent) {
+		t.Fatalf("Create invalid event error = %v, want ErrInvalidEvent", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unexpected sql calls: %v", err)
+	}
+}
+
+func TestGetWebhookNotFound(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectQuery("SELECT id, url, active, owner_tenant_id, events").
+		WithArgs(int64(404)).
+		WillReturnError(sql.ErrNoRows)
+
+	_, err = Get(context.Background(), db, 404)
+	if !errors.Is(err, ErrWebhookNotFound) {
+		t.Fatalf("Get error = %v, want ErrWebhookNotFound", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestListWebhooksScansRows(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	rows := sqlmock.NewRows(webhookColumns).
+		AddRow(int64(1), "https://a.example", uint8(1), nil, `[]`, `{}`, `{}`, "aaaa", "ops", now, now).
+		AddRow(int64(2), "https://b.example", uint8(0), "tenant-b", nil, nil, nil, "bbbb", "ops", now, now)
+	mock.ExpectQuery("SELECT id, url, active, owner_tenant_id, events").
+		WillReturnRows(rows)
+
+	hooks, err := List(context.Background(), db)
+	if err != nil {
+		t.Fatalf("List: %v", err)
+	}
+	if len(hooks) != 2 || hooks[0].Active != true || hooks[1].Active != false {
+		t.Fatalf("hooks = %+v", hooks)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestListActiveWebhooksScansRows(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	mock.ExpectQuery("SELECT id, url, active, owner_tenant_id, events").
+		WillReturnRows(webhookRow(3, "https://active.example", 1, now))
+
+	hooks, err := ListActive(context.Background(), db)
+	if err != nil {
+		t.Fatalf("ListActive: %v", err)
+	}
+	if len(hooks) != 1 || hooks[0].ID != 3 {
+		t.Fatalf("hooks = %+v", hooks)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestTenantScopedWebhookQueriesFilterByOwner(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	active := false
+	mock.ExpectQuery("WHERE id = \\? AND owner_tenant_id = \\?").
+		WithArgs(int64(12), "tenant-a").
+		WillReturnRows(webhookRow(12, "https://tenant.example/hook", 1, now))
+	mock.ExpectQuery("WHERE owner_tenant_id = \\? ORDER BY id ASC").
+		WithArgs("tenant-a").
+		WillReturnRows(webhookRow(13, "https://tenant.example/other", 1, now))
+	mock.ExpectExec("UPDATE jetmon_webhooks SET").
+		WithArgs(0, int64(12), "tenant-a").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery("WHERE id = \\? AND owner_tenant_id = \\?").
+		WithArgs(int64(12), "tenant-a").
+		WillReturnRows(sqlmock.NewRows(webhookColumns).AddRow(
+			int64(12), "https://tenant.example/hook", uint8(0), "tenant-a",
+			`["event.opened"]`, `{}`, `{}`, "_XYZ", "ops", now, now,
+		))
+	mock.ExpectExec("DELETE FROM jetmon_webhooks WHERE id = \\? AND owner_tenant_id = \\?").
+		WithArgs(int64(12), "tenant-a").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	hook, err := GetForTenant(context.Background(), db, 12, "tenant-a")
+	if err != nil {
+		t.Fatalf("GetForTenant: %v", err)
+	}
+	if hook.OwnerTenantID == nil || *hook.OwnerTenantID != "tenant-a" {
+		t.Fatalf("hook.OwnerTenantID = %v, want tenant-a", hook.OwnerTenantID)
+	}
+	hooks, err := ListForTenant(context.Background(), db, "tenant-a")
+	if err != nil {
+		t.Fatalf("ListForTenant: %v", err)
+	}
+	if len(hooks) != 1 || hooks[0].ID != 13 {
+		t.Fatalf("hooks = %+v", hooks)
+	}
+	hook, err = UpdateForTenant(context.Background(), db, 12, "tenant-a", UpdateInput{Active: &active})
+	if err != nil {
+		t.Fatalf("UpdateForTenant: %v", err)
+	}
+	if hook.Active {
+		t.Fatalf("hook.Active = true, want false")
+	}
+	if err := DeleteForTenant(context.Background(), db, 12, "tenant-a"); err != nil {
+		t.Fatalf("DeleteForTenant: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestUpdateWebhookAppliesPatchAndFetchesRecord(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	url := "https://consumer.example/new"
+	active := false
+	events := []string{EventClosed}
+	siteFilter := SiteFilter{SiteIDs: []int64{7}}
+	stateFilter := StateFilter{States: []string{"Up"}}
+	now := time.Now().UTC()
+
+	mock.ExpectExec("UPDATE jetmon_webhooks SET").
+		WithArgs(url, 0, sqlmock.AnyArg(), sqlmock.AnyArg(), sqlmock.AnyArg(), int64(5)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery("SELECT id, url, active, owner_tenant_id, events").
+		WithArgs(int64(5)).
+		WillReturnRows(sqlmock.NewRows(webhookColumns).AddRow(
+			int64(5), url, uint8(0), nil, `["event.closed"]`,
+			`{"site_ids":[7]}`, `{"states":["Up"]}`, "_NEW", "ops", now, now,
+		))
+
+	hook, err := Update(context.Background(), db, 5, UpdateInput{
+		URL:         &url,
+		Active:      &active,
+		Events:      &events,
+		SiteFilter:  &siteFilter,
+		StateFilter: &stateFilter,
+	})
+	if err != nil {
+		t.Fatalf("Update: %v", err)
+	}
+	if hook.Active || hook.Events[0] != EventClosed || hook.SiteFilter.SiteIDs[0] != 7 {
+		t.Fatalf("hook = %+v", hook)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestDeleteWebhookReportsMissingRows(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectExec("DELETE FROM jetmon_webhooks").
+		WithArgs(int64(10)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+
+	if err := Delete(context.Background(), db, 10); !errors.Is(err, ErrWebhookNotFound) {
+		t.Fatalf("Delete error = %v, want ErrWebhookNotFound", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestRotateSecretUpdatesStoredSecret(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	mock.ExpectExec("UPDATE jetmon_webhooks SET secret").
+		WithArgs(sqlmock.AnyArg(), sqlmock.AnyArg(), int64(8)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery("SELECT id, url, active, owner_tenant_id, events").
+		WithArgs(int64(8)).
+		WillReturnRows(webhookRow(8, "https://consumer.example/hook", 1, now))
+
+	raw, hook, err := RotateSecret(context.Background(), db, 8)
+	if err != nil {
+		t.Fatalf("RotateSecret: %v", err)
+	}
+	if !strings.HasPrefix(raw, SecretPrefix) || hook.ID != 8 {
+		t.Fatalf("RotateSecret returned raw=%q hook=%+v", raw, hook)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestLoadSecret(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectQuery("SELECT secret FROM jetmon_webhooks").
+		WithArgs(int64(4)).
+		WillReturnRows(sqlmock.NewRows([]string{"secret"}).AddRow("whsec_secret"))
+
+	secret, err := LoadSecret(context.Background(), db, 4)
+	if err != nil {
+		t.Fatalf("LoadSecret: %v", err)
+	}
+	if secret != "whsec_secret" {
+		t.Fatalf("secret = %q", secret)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+var webhookDeliveryColumns = []string{
+	"id", "webhook_id", "transition_id", "event_id", "event_type",
+	"payload", "status", "attempt", "next_attempt_at", "last_status_code", "last_response",
+	"last_attempt_at", "delivered_at", "created_at",
+}
+
+func webhookDeliveryRow(id int64, status Status, now time.Time) *sqlmock.Rows {
+	return sqlmock.NewRows(webhookDeliveryColumns).AddRow(
+		id, int64(20), int64(30), int64(40), EventOpened,
+		[]byte(`{"ok":true}`), string(status), 2, now, 503, "down", now, nil, now,
+	)
+}
+
+func TestEnqueueWebhookDeliveryReturnsInsertedIDAndDuplicateZero(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	payload := json.RawMessage(`{"type":"event.opened"}`)
+	mock.ExpectExec("INSERT IGNORE INTO jetmon_webhook_deliveries").
+		WithArgs(int64(1), int64(2), int64(3), EventOpened, []byte(payload)).
+		WillReturnResult(sqlmock.NewResult(9, 1))
+	mock.ExpectExec("INSERT IGNORE INTO jetmon_webhook_deliveries").
+		WithArgs(int64(1), int64(2), int64(3), EventOpened, []byte(payload)).
+		WillReturnResult(sqlmock.NewResult(0, 0))
+
+	id, err := Enqueue(context.Background(), db, EnqueueInput{
+		WebhookID: 1, TransitionID: 2, EventID: 3, EventType: EventOpened, Payload: payload,
+	})
+	if err != nil || id != 9 {
+		t.Fatalf("Enqueue inserted = (%d, %v), want (9, nil)", id, err)
+	}
+	id, err = Enqueue(context.Background(), db, EnqueueInput{
+		WebhookID: 1, TransitionID: 2, EventID: 3, EventType: EventOpened, Payload: payload,
+	})
+	if err != nil || id != 0 {
+		t.Fatalf("Enqueue duplicate = (%d, %v), want (0, nil)", id, err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestWebhookDeliveryStateUpdates(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	next := time.Now().UTC().Add(time.Minute)
+	mock.ExpectExec("UPDATE jetmon_webhook_deliveries").
+		WithArgs(204, "ok", int64(1)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("UPDATE jetmon_webhook_deliveries").
+		WithArgs(503, "retry", next, int64(2)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("UPDATE jetmon_webhook_deliveries").
+		WithArgs(410, "gone", int64(3)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	if err := MarkDelivered(context.Background(), db, 1, 204, "ok"); err != nil {
+		t.Fatalf("MarkDelivered: %v", err)
+	}
+	if err := ScheduleRetry(context.Background(), db, 2, 503, "retry", next, false); err != nil {
+		t.Fatalf("ScheduleRetry retry: %v", err)
+	}
+	if err := ScheduleRetry(context.Background(), db, 3, 410, "gone", next, true); err != nil {
+		t.Fatalf("ScheduleRetry abandon: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestGetListAndRetryWebhookDeliveries(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	now := time.Now().UTC()
+	mock.ExpectQuery("SELECT id, webhook_id, transition_id").
+		WithArgs(int64(1)).
+		WillReturnRows(webhookDeliveryRow(1, StatusAbandoned, now))
+	mock.ExpectQuery("SELECT id, webhook_id, transition_id").
+		WithArgs(int64(20), string(StatusAbandoned), int64(50), 10).
+		WillReturnRows(webhookDeliveryRow(2, StatusAbandoned, now))
+	mock.ExpectExec("UPDATE jetmon_webhook_deliveries").
+		WithArgs(int64(2)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	d, err := GetDelivery(context.Background(), db, 1)
+	if err != nil {
+		t.Fatalf("GetDelivery: %v", err)
+	}
+	if d.LastStatusCode == nil || *d.LastStatusCode != 503 || d.LastResponse == nil || *d.LastResponse != "down" {
+		t.Fatalf("delivery did not scan nullable fields: %+v", d)
+	}
+	list, err := ListDeliveries(context.Background(), db, 20, StatusAbandoned, 50, 10)
+	if err != nil {
+		t.Fatalf("ListDeliveries: %v", err)
+	}
+	if len(list) != 1 || list[0].ID != 2 {
+		t.Fatalf("deliveries = %+v", list)
+	}
+	if err := RetryDelivery(context.Background(), db, 2); err != nil {
+		t.Fatalf("RetryDelivery: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
diff --git a/internal/webhooks/webhooks.go b/internal/webhooks/webhooks.go
new file mode 100644
index 00000000..0d39e319
--- /dev/null
+++ b/internal/webhooks/webhooks.go
@@ -0,0 +1,611 @@
+// Package webhooks manages outbound webhook subscriptions and HMAC-signed
+// deliveries. Sole writer for jetmon_webhooks and jetmon_webhook_deliveries.
+//
+// A webhook is a registration that says "POST to this URL when matching
+// events fire." A delivery is one webhook firing — created when an event
+// transition matches the webhook's filters, then dispatched by the
+// background delivery worker.
+//
+// See docs/internal-api-reference.md "Family 4" for the public design and docs/roadmap.md for deferred
+// items (site.state_changed events, grace-period secret rotation).
+package webhooks
+
+import (
+	"context"
+	"crypto/hmac"
+	"crypto/rand"
+	"crypto/sha256"
+	"database/sql"
+	"encoding/base32"
+	"encoding/hex"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"strconv"
+	"time"
+)
+
+// Storage note: the raw secret is stored in plaintext in jetmon_webhooks.
+// Webhooks are outbound-only — the server signs every delivery, so the HMAC
+// key has to be available in plaintext at signing time. Hashing the secret
+// at rest (the API-key pattern) would make signing impossible. Encryption
+// at rest with a master key is on docs/roadmap.md as a future hardening step.
+
+// Status enumerates the lifecycle states of a delivery row.
+type Status string
+
+const (
+	StatusPending   Status = "pending"
+	StatusDelivered Status = "delivered"
+	StatusFailed    Status = "failed"
+	StatusAbandoned Status = "abandoned"
+)
+
+// Webhook event type strings — what consumers see in the X-Jetmon-Event
+// header and the events filter array. Stable identifiers; new types are
+// added (never renamed) so existing webhook configs don't break.
+const (
+	EventOpened          = "event.opened"
+	EventSeverityChanged = "event.severity_changed"
+	EventStateChanged    = "event.state_changed"
+	EventCauseLinked     = "event.cause_linked"
+	EventCauseUnlinked   = "event.cause_unlinked"
+	EventClosed          = "event.closed"
+)
+
+// AllEventTypes returns the canonical set of webhook event types. Used by
+// validators (a webhook's events filter must use values from this set) and
+// by docs/listings.
+func AllEventTypes() []string {
+	return []string{
+		EventOpened,
+		EventSeverityChanged,
+		EventStateChanged,
+		EventCauseLinked,
+		EventCauseUnlinked,
+		EventClosed,
+	}
+}
+
+// SecretPrefix is the leak-detection hint on every raw secret. Stripe
+// convention: a string that starts with this is unmistakably a webhook
+// signing secret if it shows up in logs or git diffs.
+const SecretPrefix = "whsec_"
+
+// Sentinel errors returned by package functions.
+var (
+	ErrWebhookNotFound = errors.New("webhooks: webhook not found")
+	ErrInvalidEvent    = errors.New("webhooks: unknown event type")
+)
+
+// Webhook is the in-memory shape of a jetmon_webhooks row. The raw secret
+// is never stored here — it's hashed at create/rotate time and discarded.
+type Webhook struct {
+	ID            int64
+	URL           string
+	Active        bool
+	OwnerTenantID *string
+	Events        []string    // empty slice = match all
+	SiteFilter    SiteFilter  // empty = match all
+	StateFilter   StateFilter // empty = match all
+	SecretPreview string      // last 4 chars of the raw secret, for display
+	CreatedBy     string
+	CreatedAt     time.Time
+	UpdatedAt     time.Time
+}
+
+// SiteFilter restricts deliveries to a fixed list of sites. Empty SiteIDs
+// (or a nil filter) means "match all sites."
+type SiteFilter struct {
+	SiteIDs []int64 `json:"site_ids,omitempty"`
+}
+
+// StateFilter restricts deliveries to events with one of the given states.
+// Empty States means "match all states."
+type StateFilter struct {
+	States []string `json:"states,omitempty"`
+}
+
+// Matches reports whether the filter set as a whole accepts a given
+// (event_type, site_id, state) combination. Filters AND together; empty
+// dimensions are unrestricted.
+func (w *Webhook) Matches(eventType string, siteID int64, state string) bool {
+	if !w.Active {
+		return false
+	}
+	if len(w.Events) > 0 && !contains(w.Events, eventType) {
+		return false
+	}
+	if len(w.SiteFilter.SiteIDs) > 0 && !containsInt64(w.SiteFilter.SiteIDs, siteID) {
+		return false
+	}
+	if len(w.StateFilter.States) > 0 && !contains(w.StateFilter.States, state) {
+		return false
+	}
+	return true
+}
+
+// CreateInput is the data needed to insert a new webhook. URL is required;
+// everything else has sensible defaults (Active=true, all filters empty =
+// match-all).
+type CreateInput struct {
+	URL           string
+	Active        *bool // nil → true
+	OwnerTenantID *string
+	Events        []string
+	SiteFilter    SiteFilter
+	StateFilter   StateFilter
+	CreatedBy     string
+}
+
+// UpdateInput is a sparse patch. nil fields are unchanged. Empty slices
+// (vs. nil slices) are meaningful: an explicit empty slice clears the
+// filter, restoring "match all" semantics.
+type UpdateInput struct {
+	URL         *string
+	Active      *bool
+	Events      *[]string
+	SiteFilter  *SiteFilter
+	StateFilter *StateFilter
+}
+
+// Create inserts a webhook and returns the one-time raw secret plus the
+// persisted record. The raw secret is also stored in the DB (see Storage
+// note above) so the delivery worker can sign with it.
+func Create(ctx context.Context, db *sql.DB, in CreateInput) (rawSecret string, w *Webhook, err error) {
+	if in.URL == "" {
+		return "", nil, errors.New("webhooks: URL is required")
+	}
+	if err := validateEvents(in.Events); err != nil {
+		return "", nil, err
+	}
+	active := true
+	if in.Active != nil {
+		active = *in.Active
+	}
+
+	rawSecret, err = GenerateSecret()
+	if err != nil {
+		return "", nil, err
+	}
+	preview := previewOf(rawSecret)
+
+	eventsJSON, _ := json.Marshal(in.Events)
+	siteFilterJSON, _ := json.Marshal(in.SiteFilter)
+	stateFilterJSON, _ := json.Marshal(in.StateFilter)
+
+	res, err := db.ExecContext(ctx, `
+		INSERT INTO jetmon_webhooks
+			(url, active, owner_tenant_id, events, site_filter, state_filter,
+			 secret, secret_preview, created_by)
+		VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)`,
+		in.URL, boolToTinyint(active), nullableString(in.OwnerTenantID), eventsJSON, siteFilterJSON, stateFilterJSON,
+		rawSecret, preview, in.CreatedBy,
+	)
+	if err != nil {
+		return "", nil, fmt.Errorf("webhooks: insert: %w", err)
+	}
+	id, err := res.LastInsertId()
+	if err != nil {
+		return "", nil, fmt.Errorf("webhooks: last insert id: %w", err)
+	}
+
+	w, err = Get(ctx, db, id)
+	if err != nil {
+		return "", nil, err
+	}
+	return rawSecret, w, nil
+}
+
+// Get returns a single webhook by id, or ErrWebhookNotFound.
+func Get(ctx context.Context, db *sql.DB, id int64) (*Webhook, error) {
+	return get(ctx, db, id, "")
+}
+
+// GetForTenant returns a single webhook owned by ownerTenantID. It hides
+// cross-tenant rows behind ErrWebhookNotFound so future public callers don't
+// learn whether another tenant's webhook exists.
+func GetForTenant(ctx context.Context, db *sql.DB, id int64, ownerTenantID string) (*Webhook, error) {
+	if ownerTenantID == "" {
+		return nil, errors.New("webhooks: owner tenant id is required")
+	}
+	return get(ctx, db, id, ownerTenantID)
+}
+
+func get(ctx context.Context, db *sql.DB, id int64, ownerTenantID string) (*Webhook, error) {
+	q := selectWebhookSQL + " WHERE id = ?"
+	args := []any{id}
+	if ownerTenantID != "" {
+		q += " AND owner_tenant_id = ?"
+		args = append(args, ownerTenantID)
+	}
+	row := db.QueryRowContext(ctx, q, args...)
+	w, err := scanWebhookRow(row)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			return nil, ErrWebhookNotFound
+		}
+		return nil, err
+	}
+	return w, nil
+}
+
+// List returns all webhooks ordered by id ASC. Webhook count is bounded by
+// the number of registered consumers; we don't paginate today. If a future
+// deployment grows past hundreds of webhooks, add cursor pagination here.
+func List(ctx context.Context, db *sql.DB) ([]Webhook, error) {
+	return list(ctx, db, "")
+}
+
+// ListForTenant returns only webhooks owned by ownerTenantID.
+func ListForTenant(ctx context.Context, db *sql.DB, ownerTenantID string) ([]Webhook, error) {
+	if ownerTenantID == "" {
+		return nil, errors.New("webhooks: owner tenant id is required")
+	}
+	return list(ctx, db, ownerTenantID)
+}
+
+func list(ctx context.Context, db *sql.DB, ownerTenantID string) ([]Webhook, error) {
+	q := selectWebhookSQL
+	args := []any{}
+	if ownerTenantID != "" {
+		q += " WHERE owner_tenant_id = ?"
+		args = append(args, ownerTenantID)
+	}
+	q += " ORDER BY id ASC"
+	rows, err := db.QueryContext(ctx, q, args...)
+	if err != nil {
+		return nil, fmt.Errorf("webhooks: list: %w", err)
+	}
+	defer rows.Close()
+	var out []Webhook
+	for rows.Next() {
+		w, err := scanWebhookRow(rows)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, *w)
+	}
+	return out, rows.Err()
+}
+
+// ListActive returns only webhooks with active=1. Used by the delivery
+// dispatcher; inactive webhooks don't get matched against new transitions.
+func ListActive(ctx context.Context, db *sql.DB) ([]Webhook, error) {
+	rows, err := db.QueryContext(ctx, selectWebhookSQL+" WHERE active = 1 ORDER BY id ASC")
+	if err != nil {
+		return nil, fmt.Errorf("webhooks: list active: %w", err)
+	}
+	defer rows.Close()
+	var out []Webhook
+	for rows.Next() {
+		w, err := scanWebhookRow(rows)
+		if err != nil {
+			return nil, err
+		}
+		out = append(out, *w)
+	}
+	return out, rows.Err()
+}
+
+// Update applies a partial patch and returns the updated webhook. Fields
+// left nil in UpdateInput are unchanged; an explicitly empty slice clears
+// the corresponding filter to "match all" semantics.
+func Update(ctx context.Context, db *sql.DB, id int64, in UpdateInput) (*Webhook, error) {
+	return update(ctx, db, id, "", in)
+}
+
+// UpdateForTenant updates a webhook only when it is owned by ownerTenantID.
+func UpdateForTenant(ctx context.Context, db *sql.DB, id int64, ownerTenantID string, in UpdateInput) (*Webhook, error) {
+	if ownerTenantID == "" {
+		return nil, errors.New("webhooks: owner tenant id is required")
+	}
+	return update(ctx, db, id, ownerTenantID, in)
+}
+
+func update(ctx context.Context, db *sql.DB, id int64, ownerTenantID string, in UpdateInput) (*Webhook, error) {
+	if in.Events != nil {
+		if err := validateEvents(*in.Events); err != nil {
+			return nil, err
+		}
+	}
+
+	clauses := []string{}
+	args := []any{}
+	if in.URL != nil {
+		clauses = append(clauses, "url = ?")
+		args = append(args, *in.URL)
+	}
+	if in.Active != nil {
+		clauses = append(clauses, "active = ?")
+		args = append(args, boolToTinyint(*in.Active))
+	}
+	if in.Events != nil {
+		b, _ := json.Marshal(*in.Events)
+		clauses = append(clauses, "events = ?")
+		args = append(args, b)
+	}
+	if in.SiteFilter != nil {
+		b, _ := json.Marshal(*in.SiteFilter)
+		clauses = append(clauses, "site_filter = ?")
+		args = append(args, b)
+	}
+	if in.StateFilter != nil {
+		b, _ := json.Marshal(*in.StateFilter)
+		clauses = append(clauses, "state_filter = ?")
+		args = append(args, b)
+	}
+
+	if len(clauses) == 0 {
+		// No-op patch — return current state.
+		return get(ctx, db, id, ownerTenantID)
+	}
+
+	args = append(args, id)
+	q := "UPDATE jetmon_webhooks SET "
+	for i, c := range clauses {
+		if i > 0 {
+			q += ", "
+		}
+		q += c
+	}
+	q += " WHERE id = ?"
+	if ownerTenantID != "" {
+		q += " AND owner_tenant_id = ?"
+		args = append(args, ownerTenantID)
+	}
+	if _, err := db.ExecContext(ctx, q, args...); err != nil {
+		return nil, fmt.Errorf("webhooks: update: %w", err)
+	}
+	return get(ctx, db, id, ownerTenantID)
+}
+
+// Delete removes a webhook from jetmon_webhooks. Existing rows in
+// jetmon_webhook_deliveries are intentionally NOT cascaded — they remain
+// for audit and manual retry. The dispatcher won't create new deliveries
+// for a deleted webhook because ListActive filters it out.
+func Delete(ctx context.Context, db *sql.DB, id int64) error {
+	return deleteWebhook(ctx, db, id, "")
+}
+
+// DeleteForTenant removes a webhook only when it is owned by ownerTenantID.
+func DeleteForTenant(ctx context.Context, db *sql.DB, id int64, ownerTenantID string) error {
+	if ownerTenantID == "" {
+		return errors.New("webhooks: owner tenant id is required")
+	}
+	return deleteWebhook(ctx, db, id, ownerTenantID)
+}
+
+func deleteWebhook(ctx context.Context, db *sql.DB, id int64, ownerTenantID string) error {
+	q := "DELETE FROM jetmon_webhooks WHERE id = ?"
+	args := []any{id}
+	if ownerTenantID != "" {
+		q += " AND owner_tenant_id = ?"
+		args = append(args, ownerTenantID)
+	}
+	res, err := db.ExecContext(ctx, q, args...)
+	if err != nil {
+		return fmt.Errorf("webhooks: delete: %w", err)
+	}
+	n, _ := res.RowsAffected()
+	if n == 0 {
+		return ErrWebhookNotFound
+	}
+	return nil
+}
+
+// RotateSecret generates a new secret, replaces the stored value, and
+// returns the new raw secret (one-time view in API responses). The old
+// secret stops working immediately — see docs/internal-api-reference.md "Signing and secret
+// rotation" for why this is the v1 behavior and how grace-period rotation
+// will be added later.
+func RotateSecret(ctx context.Context, db *sql.DB, id int64) (string, *Webhook, error) {
+	return rotateSecret(ctx, db, id, "")
+}
+
+// RotateSecretForTenant rotates a webhook secret only when it is owned by
+// ownerTenantID.
+func RotateSecretForTenant(ctx context.Context, db *sql.DB, id int64, ownerTenantID string) (string, *Webhook, error) {
+	if ownerTenantID == "" {
+		return "", nil, errors.New("webhooks: owner tenant id is required")
+	}
+	return rotateSecret(ctx, db, id, ownerTenantID)
+}
+
+func rotateSecret(ctx context.Context, db *sql.DB, id int64, ownerTenantID string) (string, *Webhook, error) {
+	rawSecret, err := GenerateSecret()
+	if err != nil {
+		return "", nil, err
+	}
+	preview := previewOf(rawSecret)
+	q := `UPDATE jetmon_webhooks SET secret = ?, secret_preview = ? WHERE id = ?`
+	args := []any{rawSecret, preview, id}
+	if ownerTenantID != "" {
+		q += " AND owner_tenant_id = ?"
+		args = append(args, ownerTenantID)
+	}
+	res, err := db.ExecContext(ctx,
+		q, args...)
+	if err != nil {
+		return "", nil, fmt.Errorf("webhooks: rotate-secret: %w", err)
+	}
+	n, _ := res.RowsAffected()
+	if n == 0 {
+		return "", nil, ErrWebhookNotFound
+	}
+	w, err := get(ctx, db, id, ownerTenantID)
+	if err != nil {
+		return "", nil, err
+	}
+	return rawSecret, w, nil
+}
+
+// LoadSecret returns the raw signing secret for a webhook. Used by the
+// delivery worker only — every public-facing handler returns SecretPreview
+// instead. Kept as a separate function (not a field on Webhook) so the
+// raw value can't leak through serialization of the Webhook struct.
+func LoadSecret(ctx context.Context, db *sql.DB, id int64) (string, error) {
+	var s string
+	err := db.QueryRowContext(ctx,
+		`SELECT secret FROM jetmon_webhooks WHERE id = ?`, id,
+	).Scan(&s)
+	if err != nil {
+		if errors.Is(err, sql.ErrNoRows) {
+			return "", ErrWebhookNotFound
+		}
+		return "", fmt.Errorf("webhooks: load secret: %w", err)
+	}
+	return s, nil
+}
+
+// GenerateSecret returns a fresh raw secret. 32 random bytes encoded as
+// base32 with the "whsec_" prefix. Same shape as apikeys — high-entropy
+// random; the leak-detection prefix is the only thing that distinguishes
+// it from a generic random string.
+func GenerateSecret() (string, error) {
+	var buf [32]byte
+	if _, err := rand.Read(buf[:]); err != nil {
+		return "", fmt.Errorf("webhooks: read entropy: %w", err)
+	}
+	encoded := base32.StdEncoding.WithPadding(base32.NoPadding).EncodeToString(buf[:])
+	return SecretPrefix + encoded, nil
+}
+
+// Sign produces the X-Jetmon-Signature header value for a delivery.
+// Format: "t=<unix>,v1=<hex_hmac_sha256(t.body)>" — see docs/internal-api-reference.md.
+//
+// The timestamp is part of the signature input so consumers can reject
+// stale (replayed) deliveries by checking the t= value against their
+// own clock and refusing anything older than ~5 minutes.
+func Sign(timestamp time.Time, body []byte, secret string) string {
+	ts := strconv.FormatInt(timestamp.Unix(), 10)
+	mac := hmac.New(sha256.New, []byte(secret))
+	mac.Write([]byte(ts))
+	mac.Write([]byte("."))
+	mac.Write(body)
+	sig := hex.EncodeToString(mac.Sum(nil))
+	return "t=" + ts + ",v1=" + sig
+}
+
+// EventTypeForReason maps a jetmon_event_transitions.reason value to the
+// webhook event type that should fire. Returns "" if the reason should
+// produce no webhook (used for cause-link reasons that are stored as
+// transitions but not surfaced as separate webhook events in v1).
+//
+// The mapping is fixed in code — adding new transition reasons requires
+// extending this function so consumers see the right webhook event type.
+func EventTypeForReason(reason string) string {
+	switch reason {
+	case "opened":
+		return EventOpened
+	case "severity_escalation", "severity_deescalation":
+		return EventSeverityChanged
+	case "state_change", "verifier_confirmed":
+		return EventStateChanged
+	case "cause_linked":
+		return EventCauseLinked
+	case "cause_unlinked":
+		return EventCauseUnlinked
+	case "verifier_cleared", "probe_cleared", "false_alarm",
+		"manual_override", "maintenance_swallowed", "superseded", "auto_timeout":
+		return EventClosed
+	default:
+		return ""
+	}
+}
+
+// validateEvents rejects an events list that includes an unknown event
+// type. Empty list is fine — that's the "match all" sentinel.
+func validateEvents(events []string) error {
+	all := AllEventTypes()
+	for _, e := range events {
+		if !contains(all, e) {
+			return fmt.Errorf("%w: %q (allowed: %v)", ErrInvalidEvent, e, all)
+		}
+	}
+	return nil
+}
+
+// previewOf returns the last 4 characters of a raw secret for display.
+// Short enough to fit on a one-line listing; long enough to disambiguate
+// among a handful of webhooks.
+func previewOf(s string) string {
+	if len(s) <= 4 {
+		return s
+	}
+	return s[len(s)-4:]
+}
+
+// selectWebhookSQL is shared by Get / List / ListActive so the column
+// order matches scanWebhookRow.
+const selectWebhookSQL = `
+	SELECT id, url, active, owner_tenant_id, events, site_filter, state_filter,
+	       secret_preview, created_by, created_at, updated_at
+	  FROM jetmon_webhooks`
+
+type rowScanner interface {
+	Scan(...any) error
+}
+
+func scanWebhookRow(s rowScanner) (*Webhook, error) {
+	var (
+		w               Webhook
+		active          uint8
+		ownerTenantID   sql.NullString
+		eventsJSON      sql.NullString
+		siteFilterJSON  sql.NullString
+		stateFilterJSON sql.NullString
+	)
+	if err := s.Scan(
+		&w.ID, &w.URL, &active, &ownerTenantID, &eventsJSON, &siteFilterJSON, &stateFilterJSON,
+		&w.SecretPreview, &w.CreatedBy, &w.CreatedAt, &w.UpdatedAt,
+	); err != nil {
+		return nil, err
+	}
+	w.Active = active == 1
+	if ownerTenantID.Valid {
+		w.OwnerTenantID = &ownerTenantID.String
+	}
+	if eventsJSON.Valid && eventsJSON.String != "" {
+		_ = json.Unmarshal([]byte(eventsJSON.String), &w.Events)
+	}
+	if siteFilterJSON.Valid && siteFilterJSON.String != "" {
+		_ = json.Unmarshal([]byte(siteFilterJSON.String), &w.SiteFilter)
+	}
+	if stateFilterJSON.Valid && stateFilterJSON.String != "" {
+		_ = json.Unmarshal([]byte(stateFilterJSON.String), &w.StateFilter)
+	}
+	return &w, nil
+}
+
+func boolToTinyint(b bool) int {
+	if b {
+		return 1
+	}
+	return 0
+}
+
+func nullableString(s *string) any {
+	if s == nil {
+		return nil
+	}
+	return *s
+}
+
+func contains(haystack []string, needle string) bool {
+	for _, s := range haystack {
+		if s == needle {
+			return true
+		}
+	}
+	return false
+}
+
+func containsInt64(haystack []int64, needle int64) bool {
+	for _, v := range haystack {
+		if v == needle {
+			return true
+		}
+	}
+	return false
+}
diff --git a/internal/webhooks/webhooks_test.go b/internal/webhooks/webhooks_test.go
new file mode 100644
index 00000000..05e5e873
--- /dev/null
+++ b/internal/webhooks/webhooks_test.go
@@ -0,0 +1,238 @@
+package webhooks
+
+import (
+	"crypto/hmac"
+	"crypto/sha256"
+	"encoding/hex"
+	"strconv"
+	"strings"
+	"testing"
+	"time"
+)
+
+func TestGenerateSecretShape(t *testing.T) {
+	raw, err := GenerateSecret()
+	if err != nil {
+		t.Fatalf("GenerateSecret: %v", err)
+	}
+	if !strings.HasPrefix(raw, SecretPrefix) {
+		t.Fatalf("missing prefix: %q", raw)
+	}
+	// 32 random bytes → 52 base32 chars (no padding) + len(SecretPrefix).
+	if len(raw) != len(SecretPrefix)+52 {
+		t.Errorf("raw length = %d, want %d", len(raw), len(SecretPrefix)+52)
+	}
+}
+
+func TestGenerateSecretUnique(t *testing.T) {
+	a, _ := GenerateSecret()
+	b, _ := GenerateSecret()
+	if a == b {
+		t.Fatal("two generated secrets collided")
+	}
+}
+
+func TestSignDeterministicWithSameInputs(t *testing.T) {
+	ts := time.Date(2026, 4, 25, 12, 0, 0, 0, time.UTC)
+	body := []byte(`{"event":"event.opened","id":42}`)
+	a := Sign(ts, body, "whsec_TESTSECRET")
+	b := Sign(ts, body, "whsec_TESTSECRET")
+	if a != b {
+		t.Errorf("Sign should be deterministic; got %q vs %q", a, b)
+	}
+}
+
+func TestSignFormat(t *testing.T) {
+	ts := time.Date(2026, 4, 25, 12, 0, 0, 0, time.UTC)
+	body := []byte(`{"hello":"world"}`)
+	secret := "whsec_TESTSECRET"
+	got := Sign(ts, body, secret)
+	if !strings.HasPrefix(got, "t=") {
+		t.Errorf("signature = %q, want prefix t=", got)
+	}
+	if !strings.Contains(got, ",v1=") {
+		t.Errorf("signature = %q, want ,v1=", got)
+	}
+	// Compute the expected signature independently — same algorithm but with
+	// the timestamp pulled from ts so the test stays correct under any clock.
+	tsStr := strconv.FormatInt(ts.Unix(), 10)
+	mac := hmac.New(sha256.New, []byte(secret))
+	mac.Write([]byte(tsStr))
+	mac.Write([]byte("."))
+	mac.Write(body)
+	expected := "t=" + tsStr + ",v1=" + hex.EncodeToString(mac.Sum(nil))
+	if got != expected {
+		t.Errorf("Sign computed unexpectedly\n got: %s\nwant: %s", got, expected)
+	}
+}
+
+func TestSignDiffersOnTimestamp(t *testing.T) {
+	t1 := time.Date(2026, 4, 25, 12, 0, 0, 0, time.UTC)
+	t2 := t1.Add(1 * time.Second)
+	body := []byte(`{}`)
+	a := Sign(t1, body, "whsec_x")
+	b := Sign(t2, body, "whsec_x")
+	if a == b {
+		t.Errorf("signature should change with timestamp; both = %q", a)
+	}
+}
+
+func TestSignDiffersOnSecret(t *testing.T) {
+	ts := time.Date(2026, 4, 25, 12, 0, 0, 0, time.UTC)
+	body := []byte(`{}`)
+	if Sign(ts, body, "whsec_a") == Sign(ts, body, "whsec_b") {
+		t.Error("signature should differ between secrets")
+	}
+}
+
+func TestEventTypeForReason(t *testing.T) {
+	cases := map[string]string{
+		"opened":                EventOpened,
+		"severity_escalation":   EventSeverityChanged,
+		"severity_deescalation": EventSeverityChanged,
+		"state_change":          EventStateChanged,
+		"verifier_confirmed":    EventStateChanged,
+		"cause_linked":          EventCauseLinked,
+		"cause_unlinked":        EventCauseUnlinked,
+		"verifier_cleared":      EventClosed,
+		"probe_cleared":         EventClosed,
+		"false_alarm":           EventClosed,
+		"manual_override":       EventClosed,
+		"maintenance_swallowed": EventClosed,
+		"superseded":            EventClosed,
+		"auto_timeout":          EventClosed,
+		"unknown_reason":        "",
+		"":                      "",
+	}
+	for reason, want := range cases {
+		got := EventTypeForReason(reason)
+		if got != want {
+			t.Errorf("EventTypeForReason(%q) = %q, want %q", reason, got, want)
+		}
+	}
+}
+
+func TestWebhookMatchesAllFiltersEmpty(t *testing.T) {
+	// No filters set — webhook should match everything.
+	w := &Webhook{Active: true}
+	if !w.Matches(EventOpened, 12345, "Down") {
+		t.Error("empty filters should match all events")
+	}
+	if !w.Matches(EventClosed, 99999, "Up") {
+		t.Error("empty filters should match unrelated event/state")
+	}
+}
+
+func TestWebhookMatchesInactive(t *testing.T) {
+	w := &Webhook{Active: false}
+	if w.Matches(EventOpened, 1, "Down") {
+		t.Error("inactive webhook should never match")
+	}
+}
+
+func TestWebhookMatchesEventFilter(t *testing.T) {
+	w := &Webhook{
+		Active: true,
+		Events: []string{EventOpened, EventClosed},
+	}
+	if !w.Matches(EventOpened, 1, "Down") {
+		t.Error("event in filter should match")
+	}
+	if w.Matches(EventSeverityChanged, 1, "Down") {
+		t.Error("event not in filter should not match")
+	}
+}
+
+func TestWebhookMatchesSiteFilter(t *testing.T) {
+	w := &Webhook{
+		Active:     true,
+		SiteFilter: SiteFilter{SiteIDs: []int64{101, 102}},
+	}
+	if !w.Matches(EventOpened, 101, "Down") {
+		t.Error("site in filter should match")
+	}
+	if w.Matches(EventOpened, 999, "Down") {
+		t.Error("site not in filter should not match")
+	}
+}
+
+func TestWebhookMatchesStateFilter(t *testing.T) {
+	w := &Webhook{
+		Active:      true,
+		StateFilter: StateFilter{States: []string{"Down", "Seems Down"}},
+	}
+	if !w.Matches(EventOpened, 1, "Down") {
+		t.Error("state in filter should match")
+	}
+	if w.Matches(EventOpened, 1, "Warning") {
+		t.Error("state not in filter should not match")
+	}
+}
+
+func TestWebhookMatchesAllDimensions(t *testing.T) {
+	// All three filters set — must AND across dimensions.
+	w := &Webhook{
+		Active:      true,
+		Events:      []string{EventOpened},
+		SiteFilter:  SiteFilter{SiteIDs: []int64{42}},
+		StateFilter: StateFilter{States: []string{"Down"}},
+	}
+	if !w.Matches(EventOpened, 42, "Down") {
+		t.Error("all three dimensions match → should fire")
+	}
+	if w.Matches(EventClosed, 42, "Down") {
+		t.Error("event mismatch → should not fire (AND semantics)")
+	}
+	if w.Matches(EventOpened, 99, "Down") {
+		t.Error("site mismatch → should not fire (AND semantics)")
+	}
+	if w.Matches(EventOpened, 42, "Up") {
+		t.Error("state mismatch → should not fire (AND semantics)")
+	}
+}
+
+func TestPreviewOf(t *testing.T) {
+	if got := previewOf("whsec_LONG_SECRET_VALUE_XYZ"); got != "_XYZ" {
+		t.Errorf("previewOf long = %q, want _XYZ", got)
+	}
+	if got := previewOf("ab"); got != "ab" {
+		t.Errorf("previewOf short = %q, want ab", got)
+	}
+}
+
+func TestValidateEventsRejectsUnknown(t *testing.T) {
+	if err := validateEvents([]string{EventOpened, "event.bogus"}); err == nil {
+		t.Error("unknown event type should be rejected")
+	}
+	if err := validateEvents([]string{EventOpened, EventClosed}); err != nil {
+		t.Errorf("known events rejected: %v", err)
+	}
+	if err := validateEvents(nil); err != nil {
+		t.Errorf("empty events list rejected: %v", err)
+	}
+}
+
+func TestAllEventTypesIsCanonical(t *testing.T) {
+	all := AllEventTypes()
+	expected := []string{
+		EventOpened, EventSeverityChanged, EventStateChanged,
+		EventCauseLinked, EventCauseUnlinked, EventClosed,
+	}
+	if len(all) != len(expected) {
+		t.Fatalf("AllEventTypes() len = %d, want %d", len(all), len(expected))
+	}
+	for i, e := range expected {
+		if all[i] != e {
+			t.Errorf("AllEventTypes()[%d] = %q, want %q", i, all[i], e)
+		}
+	}
+}
+
+func TestTruncate(t *testing.T) {
+	if got := truncate("hello", 10); got != "hello" {
+		t.Errorf("truncate(short) = %q", got)
+	}
+	if got := truncate("hello world", 5); got != "hello" {
+		t.Errorf("truncate(long) = %q", got)
+	}
+}
diff --git a/internal/webhooks/worker.go b/internal/webhooks/worker.go
new file mode 100644
index 00000000..ad918d1a
--- /dev/null
+++ b/internal/webhooks/worker.go
@@ -0,0 +1,464 @@
+package webhooks
+
+import (
+	"bytes"
+	"context"
+	"database/sql"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"io"
+	"log"
+	"net"
+	"net/http"
+	"strconv"
+	"sync"
+	"time"
+)
+
+// retrySchedule maps the *next* attempt number to its delay from the
+// previous attempt. attempt 1 is the initial enqueue (immediate); attempts
+// 2–6 retry at the documented intervals.
+//
+// After attempt 6 fails, the delivery is abandoned. Total elapsed time
+// from first attempt to abandonment: ~7h36m. See docs/internal-api-reference.md for rationale.
+var retrySchedule = []time.Duration{
+	0,                // attempt 1 — initial enqueue, no retry delay
+	1 * time.Minute,  // attempt 2
+	5 * time.Minute,  // attempt 3
+	30 * time.Minute, // attempt 4
+	1 * time.Hour,    // attempt 5
+	6 * time.Hour,    // attempt 6
+}
+
+// maxAttempts is the highest attempt number we'll try. After attempt 6
+// fails, the row is marked abandoned.
+const maxAttempts = 6
+
+// nextRetryDelay returns the delay until the next attempt given the
+// current attempt count (1-indexed: 1 is the first POST, 6 is the last).
+// abandoned=true means there is no next attempt — the delivery should
+// be marked abandoned.
+func nextRetryDelay(currentAttempt int) (delay time.Duration, abandoned bool) {
+	next := currentAttempt + 1
+	if next > maxAttempts {
+		return 0, true
+	}
+	return retrySchedule[next-1], false
+}
+
+// WorkerConfig configures the delivery worker. Defaults are sensible for
+// a single jetmon2 instance; multi-instance deployments should set
+// InstanceID to a unique value per instance so each tracks its own
+// dispatch progress.
+type WorkerConfig struct {
+	DB            *sql.DB
+	InstanceID    string        // key into jetmon_webhook_dispatch_progress
+	PollInterval  time.Duration // default 1s
+	MaxConcurrent int           // shared deliverer pool size; default 50
+	PerWebhookCap int           // per-webhook in-flight cap; default 3
+	HTTPTimeout   time.Duration // per-delivery HTTP timeout; default 30s
+	BatchSize     int           // dispatcher's transition fetch + deliverer's claim batch; default 200
+}
+
+func (c *WorkerConfig) applyDefaults() {
+	if c.PollInterval == 0 {
+		c.PollInterval = 1 * time.Second
+	}
+	if c.MaxConcurrent == 0 {
+		c.MaxConcurrent = 50
+	}
+	if c.PerWebhookCap == 0 {
+		c.PerWebhookCap = 3
+	}
+	if c.HTTPTimeout == 0 {
+		c.HTTPTimeout = 30 * time.Second
+	}
+	if c.BatchSize == 0 {
+		c.BatchSize = 200
+	}
+	if c.InstanceID == "" {
+		c.InstanceID = "default"
+	}
+}
+
+// Worker drives webhook delivery. Two background goroutines:
+//
+//   - dispatcher: every PollInterval, polls jetmon_event_transitions for
+//     new rows since last_seen, matches each against active webhooks,
+//     and enqueues a delivery per match.
+//   - deliverer: every PollInterval, claims pending deliveries whose
+//     next_attempt_at has passed and POSTs them with HMAC signing.
+//     Successes mark delivered; failures schedule retries on the
+//     exponential backoff schedule until attempt 6, then abandon.
+//
+// Both goroutines run continuously until Stop is called. Stop blocks
+// until both have exited cleanly.
+type Worker struct {
+	cfg        WorkerConfig
+	httpClient *http.Client
+
+	inFlightMu sync.Mutex
+	inFlight   map[int64]int // webhook_id → current in-flight count
+
+	stop chan struct{}
+	done chan struct{}
+}
+
+// NewWorker constructs a Worker. Call Start to launch the goroutines.
+func NewWorker(cfg WorkerConfig) *Worker {
+	cfg.applyDefaults()
+	transport := &http.Transport{
+		Proxy: http.ProxyFromEnvironment,
+		DialContext: (&net.Dialer{
+			Timeout:   5 * time.Second,
+			KeepAlive: 30 * time.Second,
+		}).DialContext,
+		MaxIdleConns:          100,
+		MaxIdleConnsPerHost:   10,
+		IdleConnTimeout:       90 * time.Second,
+		TLSHandshakeTimeout:   5 * time.Second,
+		ExpectContinueTimeout: 1 * time.Second,
+		ForceAttemptHTTP2:     true,
+	}
+	return &Worker{
+		cfg:        cfg,
+		httpClient: &http.Client{Transport: transport, Timeout: cfg.HTTPTimeout},
+		inFlight:   make(map[int64]int),
+		stop:       make(chan struct{}),
+		done:       make(chan struct{}),
+	}
+}
+
+// Start launches the dispatcher and deliverer goroutines. Call Stop to
+// signal shutdown. Start is non-blocking.
+func (w *Worker) Start() {
+	go w.run()
+}
+
+// Stop signals the goroutines to exit and waits for them.
+func (w *Worker) Stop() {
+	close(w.stop)
+	<-w.done
+}
+
+func (w *Worker) run() {
+	defer close(w.done)
+
+	dispatcherDone := make(chan struct{})
+	delivererDone := make(chan struct{})
+
+	go func() {
+		defer close(dispatcherDone)
+		w.dispatchLoop()
+	}()
+	go func() {
+		defer close(delivererDone)
+		w.deliverLoop()
+	}()
+
+	<-dispatcherDone
+	<-delivererDone
+}
+
+// dispatchLoop is the polling loop for the dispatcher.
+func (w *Worker) dispatchLoop() {
+	ticker := time.NewTicker(w.cfg.PollInterval)
+	defer ticker.Stop()
+	for {
+		select {
+		case <-w.stop:
+			return
+		case <-ticker.C:
+			if err := w.dispatchTick(); err != nil {
+				log.Printf("webhooks: dispatcher tick error: %v", err)
+			}
+		}
+	}
+}
+
+// dispatchTick polls jetmon_event_transitions for new rows and creates
+// deliveries for each match against an active webhook.
+func (w *Worker) dispatchTick() error {
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
+	lastID, err := w.loadProgress(ctx)
+	if err != nil {
+		return fmt.Errorf("load progress: %w", err)
+	}
+
+	type transitionRow struct {
+		id         int64
+		eventID    int64
+		blogID     int64
+		stateAfter sql.NullString
+		reason     string
+		changedAt  time.Time
+	}
+	rows, err := w.cfg.DB.QueryContext(ctx, `
+		SELECT id, event_id, blog_id, state_after, reason, changed_at
+		  FROM jetmon_event_transitions
+		 WHERE id > ?
+		 ORDER BY id ASC
+		 LIMIT ?`, lastID, w.cfg.BatchSize)
+	if err != nil {
+		return fmt.Errorf("query transitions: %w", err)
+	}
+	defer rows.Close()
+
+	var transitions []transitionRow
+	for rows.Next() {
+		var t transitionRow
+		if err := rows.Scan(&t.id, &t.eventID, &t.blogID, &t.stateAfter, &t.reason, &t.changedAt); err != nil {
+			return fmt.Errorf("scan transition: %w", err)
+		}
+		transitions = append(transitions, t)
+	}
+	if err := rows.Err(); err != nil {
+		return fmt.Errorf("transitions iterate: %w", err)
+	}
+	if len(transitions) == 0 {
+		return nil
+	}
+
+	hooks, err := ListActive(ctx, w.cfg.DB)
+	if err != nil {
+		return fmt.Errorf("list active webhooks: %w", err)
+	}
+
+	for _, t := range transitions {
+		eventType := EventTypeForReason(t.reason)
+		if eventType == "" {
+			continue
+		}
+		state := ""
+		if t.stateAfter.Valid {
+			state = t.stateAfter.String
+		}
+		for i := range hooks {
+			h := &hooks[i]
+			if !h.Matches(eventType, t.blogID, state) {
+				continue
+			}
+			payload, err := w.buildPayload(eventType, t.id, t.eventID, t.blogID, t.reason, state, t.changedAt)
+			if err != nil {
+				log.Printf("webhooks: build payload event_id=%d transition_id=%d: %v",
+					t.eventID, t.id, err)
+				continue
+			}
+			if _, err := Enqueue(ctx, w.cfg.DB, EnqueueInput{
+				WebhookID:    h.ID,
+				TransitionID: t.id,
+				EventID:      t.eventID,
+				EventType:    eventType,
+				Payload:      payload,
+			}); err != nil {
+				log.Printf("webhooks: enqueue webhook_id=%d transition_id=%d: %v",
+					h.ID, t.id, err)
+				continue
+			}
+		}
+	}
+
+	if err := w.saveProgress(ctx, transitions[len(transitions)-1].id); err != nil {
+		return fmt.Errorf("save progress: %w", err)
+	}
+	return nil
+}
+
+// buildPayload returns the JSON body that the consumer receives. Frozen at
+// enqueue time — see docs/internal-api-reference.md "frozen-at-fire-time" contract.
+//
+// Shape is flat: type, occurred_at, ids, and the relevant event/transition
+// fields. Consumers who want full event detail call GET /events/{id}.
+func (w *Worker) buildPayload(eventType string, transitionID, eventID, blogID int64, reason, state string, occurredAt time.Time) (json.RawMessage, error) {
+	body := map[string]any{
+		"type":          eventType,
+		"occurred_at":   occurredAt.UTC().Format(time.RFC3339Nano),
+		"transition_id": transitionID,
+		"event_id":      eventID,
+		"site_id":       blogID,
+		"reason":        reason,
+		"state":         state,
+	}
+	return json.Marshal(body)
+}
+
+// loadProgress reads the last_transition_id high-water mark for this
+// instance from jetmon_webhook_dispatch_progress. Returns 0 if no row
+// exists yet (first tick).
+func (w *Worker) loadProgress(ctx context.Context) (int64, error) {
+	var lastID int64
+	err := w.cfg.DB.QueryRowContext(ctx,
+		`SELECT last_transition_id FROM jetmon_webhook_dispatch_progress WHERE instance_id = ?`,
+		w.cfg.InstanceID,
+	).Scan(&lastID)
+	if errors.Is(err, sql.ErrNoRows) {
+		return 0, nil
+	}
+	if err != nil {
+		return 0, err
+	}
+	return lastID, nil
+}
+
+// saveProgress upserts the last_transition_id high-water mark for this
+// instance. Multi-instance: each instance has its own row keyed on
+// instance_id, so they don't trample each other's progress.
+func (w *Worker) saveProgress(ctx context.Context, lastID int64) error {
+	_, err := w.cfg.DB.ExecContext(ctx, `
+		INSERT INTO jetmon_webhook_dispatch_progress (instance_id, last_transition_id)
+		VALUES (?, ?)
+		ON DUPLICATE KEY UPDATE last_transition_id = VALUES(last_transition_id)`,
+		w.cfg.InstanceID, lastID)
+	return err
+}
+
+// deliverLoop is the polling loop for the deliverer. It pulls ready
+// deliveries from the queue and dispatches each as a goroutine, subject
+// to the per-webhook in-flight cap.
+func (w *Worker) deliverLoop() {
+	ticker := time.NewTicker(w.cfg.PollInterval)
+	defer ticker.Stop()
+	for {
+		select {
+		case <-w.stop:
+			return
+		case <-ticker.C:
+			if err := w.deliverTick(); err != nil {
+				log.Printf("webhooks: deliverer tick error: %v", err)
+			}
+		}
+	}
+}
+
+func (w *Worker) deliverTick() error {
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
+	deliveries, err := ClaimReady(ctx, w.cfg.DB, w.cfg.MaxConcurrent)
+	if err != nil {
+		return err
+	}
+	for i := range deliveries {
+		d := deliveries[i]
+		if !w.acquireSlot(d.WebhookID) {
+			// Per-webhook cap reached; row stays pending and we'll pick
+			// it up next tick.
+			continue
+		}
+		go func(d Delivery) {
+			defer w.releaseSlot(d.WebhookID)
+			w.deliver(d)
+		}(d)
+	}
+	return nil
+}
+
+// acquireSlot tries to reserve a per-webhook in-flight slot. Returns true
+// if reserved, false if the webhook is already at its cap.
+func (w *Worker) acquireSlot(webhookID int64) bool {
+	w.inFlightMu.Lock()
+	defer w.inFlightMu.Unlock()
+	if w.inFlight[webhookID] >= w.cfg.PerWebhookCap {
+		return false
+	}
+	w.inFlight[webhookID]++
+	return true
+}
+
+func (w *Worker) releaseSlot(webhookID int64) {
+	w.inFlightMu.Lock()
+	defer w.inFlightMu.Unlock()
+	w.inFlight[webhookID]--
+	if w.inFlight[webhookID] <= 0 {
+		delete(w.inFlight, webhookID)
+	}
+}
+
+// deliver runs one POST attempt against the consumer URL. Updates the
+// delivery row with success/retry/abandon based on the response.
+func (w *Worker) deliver(d Delivery) {
+	ctx, cancel := context.WithTimeout(context.Background(), w.cfg.HTTPTimeout+5*time.Second)
+	defer cancel()
+
+	// Look up the URL and signing secret from the webhook row. Either may
+	// be missing if the webhook was deleted between dispatch and deliver,
+	// in which case we abandon the row (the delivery target is gone).
+	hook, err := Get(ctx, w.cfg.DB, d.WebhookID)
+	if err != nil {
+		w.handleResult(ctx, d, 0, fmt.Sprintf("webhook lookup: %v", err), true)
+		return
+	}
+	secret, err := LoadSecret(ctx, w.cfg.DB, d.WebhookID)
+	if err != nil {
+		w.handleResult(ctx, d, 0, fmt.Sprintf("secret lookup: %v", err), true)
+		return
+	}
+	if !hook.Active {
+		// Webhook was paused between dispatch and deliver. Abandon: the
+		// caller doesn't want this delivery anymore.
+		w.handleResult(ctx, d, 0, "webhook is inactive", true)
+		return
+	}
+
+	timestamp := time.Now()
+	signature := Sign(timestamp, d.Payload, secret)
+
+	req, err := http.NewRequestWithContext(ctx, http.MethodPost, hook.URL, bytes.NewReader(d.Payload))
+	if err != nil {
+		w.handleResult(ctx, d, 0, fmt.Sprintf("build request: %v", err), false)
+		return
+	}
+	req.Header.Set("Content-Type", "application/json")
+	req.Header.Set("X-Jetmon-Event", d.EventType)
+	req.Header.Set("X-Jetmon-Delivery", strconv.FormatInt(d.ID, 10))
+	req.Header.Set("X-Jetmon-Signature", signature)
+
+	resp, err := w.httpClient.Do(req)
+	if err != nil {
+		// Network-level failure: connection refused, DNS, timeout, TLS.
+		// Record the error message as last_response and schedule retry.
+		w.handleResult(ctx, d, 0, "transport: "+err.Error(), false)
+		return
+	}
+	defer resp.Body.Close()
+
+	body, _ := io.ReadAll(io.LimitReader(resp.Body, 2048))
+	if resp.StatusCode >= 200 && resp.StatusCode < 300 {
+		if err := MarkDelivered(ctx, w.cfg.DB, d.ID, resp.StatusCode, string(body)); err != nil {
+			log.Printf("webhooks: mark delivered id=%d: %v", d.ID, err)
+		}
+		return
+	}
+	// Any non-2xx is retried. Some 4xx (404, 410) might warrant immediate
+	// abandonment, but for v1 we treat all non-2xx alike — consumers
+	// occasionally return 4xx during deploys, and a single 4xx shouldn't
+	// permanently fail an otherwise-recoverable webhook.
+	w.handleResult(ctx, d, resp.StatusCode, string(body), false)
+}
+
+// handleResult writes the delivery outcome to the database. forceAbandon
+// is true for non-retryable failures (webhook deleted/inactive, request
+// build error); otherwise the retry schedule decides whether to retry or
+// abandon based on the attempt count.
+func (w *Worker) handleResult(ctx context.Context, d Delivery, statusCode int, responseBody string, forceAbandon bool) {
+	currentAttempt := d.Attempt + 1 // we just completed this attempt
+	var (
+		next      time.Time
+		abandoned bool
+	)
+	if forceAbandon {
+		abandoned = true
+	} else {
+		delay, ab := nextRetryDelay(currentAttempt)
+		abandoned = ab
+		if !abandoned {
+			next = time.Now().Add(delay)
+		}
+	}
+	if err := ScheduleRetry(ctx, w.cfg.DB, d.ID, statusCode, responseBody, next, abandoned); err != nil {
+		log.Printf("webhooks: schedule retry id=%d: %v", d.ID, err)
+	}
+}
diff --git a/internal/webhooks/worker_test.go b/internal/webhooks/worker_test.go
new file mode 100644
index 00000000..09dddaaa
--- /dev/null
+++ b/internal/webhooks/worker_test.go
@@ -0,0 +1,306 @@
+package webhooks
+
+import (
+	"context"
+	"crypto/hmac"
+	"crypto/sha256"
+	"database/sql"
+	"encoding/hex"
+	"encoding/json"
+	"strconv"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+)
+
+func TestNextRetryDelayFollowsSchedule(t *testing.T) {
+	cases := []struct {
+		current   int
+		want      time.Duration
+		abandoned bool
+	}{
+		{1, 1 * time.Minute, false},
+		{2, 5 * time.Minute, false},
+		{3, 30 * time.Minute, false},
+		{4, 1 * time.Hour, false},
+		{5, 6 * time.Hour, false},
+		{6, 0, true}, // last attempt failed → abandon
+		{7, 0, true}, // beyond max → still abandon (defensive)
+	}
+	for _, c := range cases {
+		got, ab := nextRetryDelay(c.current)
+		if ab != c.abandoned {
+			t.Errorf("nextRetryDelay(%d).abandoned = %v, want %v", c.current, ab, c.abandoned)
+		}
+		if !c.abandoned && got != c.want {
+			t.Errorf("nextRetryDelay(%d).delay = %v, want %v", c.current, got, c.want)
+		}
+	}
+}
+
+// TestSignatureRoundTrip verifies that consumers can recompute and verify
+// the signature we send. This is the contract test — if it ever fails,
+// every consumer's signature verification breaks.
+func TestSignatureRoundTrip(t *testing.T) {
+	secret := "whsec_TEST_SECRET_VALUE"
+	body := []byte(`{"type":"event.opened","event_id":42}`)
+	timestamp := time.Date(2026, 4, 25, 12, 0, 0, 0, time.UTC)
+
+	signature := Sign(timestamp, body, secret)
+
+	// Parse the signature: t=<unix>,v1=<hex>
+	parts := strings.Split(signature, ",")
+	if len(parts) != 2 {
+		t.Fatalf("signature should have 2 parts, got %d: %s", len(parts), signature)
+	}
+	if !strings.HasPrefix(parts[0], "t=") {
+		t.Fatalf("part 0 should start with t=, got %s", parts[0])
+	}
+	if !strings.HasPrefix(parts[1], "v1=") {
+		t.Fatalf("part 1 should start with v1=, got %s", parts[1])
+	}
+	tsStr := strings.TrimPrefix(parts[0], "t=")
+	sigHex := strings.TrimPrefix(parts[1], "v1=")
+
+	// Recompute on the consumer side.
+	mac := hmac.New(sha256.New, []byte(secret))
+	mac.Write([]byte(tsStr))
+	mac.Write([]byte("."))
+	mac.Write(body)
+	expected := hex.EncodeToString(mac.Sum(nil))
+
+	if !hmac.Equal([]byte(sigHex), []byte(expected)) {
+		t.Errorf("signature mismatch:\n  got %s\n want %s", sigHex, expected)
+	}
+
+	// Verify timestamp is parseable and matches what we sent.
+	ts, err := strconv.ParseInt(tsStr, 10, 64)
+	if err != nil {
+		t.Errorf("timestamp not parseable: %v", err)
+	}
+	if ts != timestamp.Unix() {
+		t.Errorf("timestamp = %d, want %d", ts, timestamp.Unix())
+	}
+}
+
+func TestApplyDefaults(t *testing.T) {
+	c := WorkerConfig{}
+	c.applyDefaults()
+	if c.PollInterval != 1*time.Second {
+		t.Errorf("PollInterval = %v, want 1s", c.PollInterval)
+	}
+	if c.MaxConcurrent != 50 {
+		t.Errorf("MaxConcurrent = %d, want 50", c.MaxConcurrent)
+	}
+	if c.PerWebhookCap != 3 {
+		t.Errorf("PerWebhookCap = %d, want 3", c.PerWebhookCap)
+	}
+	if c.HTTPTimeout != 30*time.Second {
+		t.Errorf("HTTPTimeout = %v, want 30s", c.HTTPTimeout)
+	}
+	if c.BatchSize != 200 {
+		t.Errorf("BatchSize = %d, want 200", c.BatchSize)
+	}
+	if c.InstanceID != "default" {
+		t.Errorf("InstanceID = %q, want default", c.InstanceID)
+	}
+}
+
+func TestApplyDefaultsPreservesExplicit(t *testing.T) {
+	c := WorkerConfig{
+		PollInterval:  5 * time.Second,
+		MaxConcurrent: 10,
+		InstanceID:    "host-a",
+	}
+	c.applyDefaults()
+	if c.PollInterval != 5*time.Second {
+		t.Errorf("PollInterval = %v, want 5s (explicit)", c.PollInterval)
+	}
+	if c.MaxConcurrent != 10 {
+		t.Errorf("MaxConcurrent = %d, want 10 (explicit)", c.MaxConcurrent)
+	}
+	if c.InstanceID != "host-a" {
+		t.Errorf("InstanceID = %q, want host-a (explicit)", c.InstanceID)
+	}
+	// Unset fields should still get defaults.
+	if c.PerWebhookCap != 3 {
+		t.Errorf("PerWebhookCap = %d, want 3 (default)", c.PerWebhookCap)
+	}
+}
+
+func TestAcquireSlotRespectsCap(t *testing.T) {
+	w := &Worker{
+		cfg:      WorkerConfig{PerWebhookCap: 2},
+		inFlight: make(map[int64]int),
+	}
+	if !w.acquireSlot(1) {
+		t.Fatal("first acquire should succeed")
+	}
+	if !w.acquireSlot(1) {
+		t.Fatal("second acquire should succeed (under cap)")
+	}
+	if w.acquireSlot(1) {
+		t.Fatal("third acquire should fail (cap=2)")
+	}
+	w.releaseSlot(1)
+	if !w.acquireSlot(1) {
+		t.Fatal("acquire after release should succeed")
+	}
+}
+
+func TestAcquireSlotIsolatesWebhooks(t *testing.T) {
+	w := &Worker{
+		cfg:      WorkerConfig{PerWebhookCap: 1},
+		inFlight: make(map[int64]int),
+	}
+	if !w.acquireSlot(1) {
+		t.Fatal("webhook 1 first acquire failed")
+	}
+	if w.acquireSlot(1) {
+		t.Fatal("webhook 1 second acquire should fail (cap=1)")
+	}
+	// Different webhook should be unaffected.
+	if !w.acquireSlot(2) {
+		t.Fatal("webhook 2 acquire should succeed even though webhook 1 is at cap")
+	}
+}
+
+func TestReleaseSlotCleansUpZeroCounts(t *testing.T) {
+	w := &Worker{
+		cfg:      WorkerConfig{PerWebhookCap: 5},
+		inFlight: make(map[int64]int),
+	}
+	w.acquireSlot(1)
+	w.releaseSlot(1)
+	if _, ok := w.inFlight[1]; ok {
+		t.Error("zero-count entry should be deleted from map")
+	}
+}
+
+func TestNewWorkerInitializesRuntimeState(t *testing.T) {
+	w := NewWorker(WorkerConfig{InstanceID: "host-a", HTTPTimeout: 2 * time.Second})
+	if w.cfg.InstanceID != "host-a" {
+		t.Fatalf("InstanceID = %q, want host-a", w.cfg.InstanceID)
+	}
+	if w.httpClient == nil || w.httpClient.Timeout != 2*time.Second {
+		t.Fatalf("httpClient = %+v", w.httpClient)
+	}
+	if w.inFlight == nil || w.stop == nil || w.done == nil {
+		t.Fatalf("worker runtime state not initialized: %+v", w)
+	}
+}
+
+func TestWorkerStartStop(t *testing.T) {
+	w := NewWorker(WorkerConfig{PollInterval: time.Hour})
+	w.Start()
+	w.Stop()
+}
+
+func TestDeliverTickNoReadyDeliveries(t *testing.T) {
+	db, mock, err := sqlmock.New(sqlmock.QueryMatcherOption(sqlmock.QueryMatcherEqual))
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectBegin()
+	mock.ExpectQuery(selectClaimReadySQL).WithArgs(50).
+		WillReturnRows(sqlmock.NewRows(columnsClaimedDelivery))
+	mock.ExpectCommit()
+
+	w := NewWorker(WorkerConfig{DB: db})
+	if err := w.deliverTick(); err != nil {
+		t.Fatalf("deliverTick: %v", err)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestHandleResultSchedulesRetryAndForcedAbandon(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	mock.ExpectExec("UPDATE jetmon_webhook_deliveries").
+		WithArgs(503, "retry", sqlmock.AnyArg(), int64(1)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectExec("UPDATE jetmon_webhook_deliveries").
+		WithArgs(0, "gone", int64(2)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	w := NewWorker(WorkerConfig{DB: db})
+	w.handleResult(context.Background(), Delivery{ID: 1, Attempt: 0}, 503, "retry", false)
+	w.handleResult(context.Background(), Delivery{ID: 2, Attempt: 0}, 0, "gone", true)
+
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
+
+func TestBuildPayload(t *testing.T) {
+	occurredAt := time.Date(2026, 4, 27, 12, 0, 0, 123, time.UTC)
+	w := &Worker{}
+	payload, err := w.buildPayload(EventOpened, 10, 20, 30, "opened", "Seems Down", occurredAt)
+	if err != nil {
+		t.Fatalf("buildPayload: %v", err)
+	}
+
+	var body map[string]any
+	if err := json.Unmarshal(payload, &body); err != nil {
+		t.Fatalf("Unmarshal: %v", err)
+	}
+	if body["type"] != EventOpened || body["reason"] != "opened" || body["state"] != "Seems Down" {
+		t.Fatalf("payload = %s", payload)
+	}
+	if body["transition_id"].(float64) != 10 || body["event_id"].(float64) != 20 || body["site_id"].(float64) != 30 {
+		t.Fatalf("payload ids = %s", payload)
+	}
+	if body["occurred_at"] != occurredAt.Format(time.RFC3339Nano) {
+		t.Fatalf("occurred_at = %v", body["occurred_at"])
+	}
+}
+
+func TestProgressLoadSave(t *testing.T) {
+	db, mock, err := sqlmock.New()
+	if err != nil {
+		t.Fatalf("sqlmock.New: %v", err)
+	}
+	defer db.Close()
+
+	w := &Worker{cfg: WorkerConfig{DB: db, InstanceID: "host-a"}}
+	mock.ExpectQuery("SELECT last_transition_id FROM jetmon_webhook_dispatch_progress").
+		WithArgs("host-a").
+		WillReturnError(sql.ErrNoRows)
+	mock.ExpectExec("INSERT INTO jetmon_webhook_dispatch_progress").
+		WithArgs("host-a", int64(55)).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	mock.ExpectQuery("SELECT last_transition_id FROM jetmon_webhook_dispatch_progress").
+		WithArgs("host-a").
+		WillReturnRows(sqlmock.NewRows([]string{"last_transition_id"}).AddRow(int64(55)))
+
+	last, err := w.loadProgress(context.Background())
+	if err != nil {
+		t.Fatalf("loadProgress empty: %v", err)
+	}
+	if last != 0 {
+		t.Fatalf("empty progress = %d, want 0", last)
+	}
+	if err := w.saveProgress(context.Background(), 55); err != nil {
+		t.Fatalf("saveProgress: %v", err)
+	}
+	last, err = w.loadProgress(context.Background())
+	if err != nil {
+		t.Fatalf("loadProgress stored: %v", err)
+	}
+	if last != 55 {
+		t.Fatalf("stored progress = %d, want 55", last)
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Fatalf("unmet sql expectations: %v", err)
+	}
+}
diff --git a/internal/wpcom/client.go b/internal/wpcom/client.go
new file mode 100644
index 00000000..d91e9d02
--- /dev/null
+++ b/internal/wpcom/client.go
@@ -0,0 +1,233 @@
+package wpcom
+
+import (
+	"bytes"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"log"
+	"net/http"
+	"sync"
+	"time"
+)
+
+const (
+	notifyEndpoint = "https://public-api.wordpress.com/wpcom/v2/jetpack-monitor/status-change"
+
+	cbMaxFailures        = 5
+	cbResetTimeout       = 60 * time.Second
+	queueMaxSize         = 1000
+	queueDropLogInterval = 10 * time.Second
+)
+
+var ErrCircuitOpen = errors.New("wpcom circuit open")
+
+// StatusError reports an HTTP response status returned by WPCOM.
+type StatusError struct {
+	StatusCode int
+}
+
+func (e StatusError) Error() string {
+	return fmt.Sprintf("wpcom returned %d", e.StatusCode)
+}
+
+// HTTPStatusCode extracts a WPCOM HTTP response status from err.
+func HTTPStatusCode(err error) (int, bool) {
+	var statusErr StatusError
+	if !errors.As(err, &statusErr) {
+		return 0, false
+	}
+	return statusErr.StatusCode, true
+}
+
+// IsPermanentStatusError reports whether err is a per-notification WPCOM
+// failure that retrying or opening the global circuit breaker cannot fix.
+func IsPermanentStatusError(err error) bool {
+	status, ok := HTTPStatusCode(err)
+	if !ok {
+		return false
+	}
+	return status == http.StatusNotFound || status == http.StatusGone
+}
+
+// CheckEntry represents a single check result included in a notification.
+type CheckEntry struct {
+	Type   int    `json:"type"` // 1=local, 2=veriflier
+	Host   string `json:"host"`
+	Status int    `json:"status"`
+	RTT    int64  `json:"rtt"`
+	Code   int    `json:"code"`
+}
+
+// Notification is the payload sent to the WPCOM API on a status change.
+type Notification struct {
+	BlogID           int64        `json:"blog_id"`
+	MonitorURL       string       `json:"monitor_url"`
+	StatusID         int          `json:"status_id"`
+	LastCheck        string       `json:"last_check"`
+	LastStatusChange string       `json:"last_status_change"`
+	StatusType       string       `json:"status_type"`
+	Checks           []CheckEntry `json:"checks"`
+}
+
+// Client sends notifications to the WPCOM API with circuit breaker protection.
+type Client struct {
+	authToken  string
+	notifyURL  string // overrides notifyEndpoint when set (used in tests)
+	httpClient *http.Client
+	hostname   string
+
+	mu            sync.Mutex
+	failures      int
+	circuitOpen   bool
+	circuitOpenAt time.Time
+	queue         []queuedNotification
+
+	lastQueueDropLog    time.Time
+	queueDropSuppressed int
+}
+
+type queuedNotification struct {
+	n        Notification
+	queuedAt time.Time
+}
+
+// New creates a new WPCOM client.
+func New(authToken, hostname string) *Client {
+	return &Client{
+		authToken: authToken,
+		hostname:  hostname,
+		httpClient: &http.Client{
+			Timeout: 15 * time.Second,
+		},
+	}
+}
+
+// Notify sends a status change notification. If the circuit is open, the
+// notification is queued and retried when the circuit closes.
+// The mutex is never held during HTTP calls to avoid blocking callers.
+func (c *Client) Notify(n Notification) error {
+	c.mu.Lock()
+
+	if c.circuitOpen {
+		if time.Since(c.circuitOpenAt) > cbResetTimeout {
+			log.Printf("wpcom: circuit breaker resetting after timeout")
+			c.circuitOpen = false
+			c.failures = 0
+			// Drain the queue outside the lock; send the current notification normally below.
+			toFlush := c.queue
+			c.queue = nil
+			c.mu.Unlock()
+			c.sendFlush(toFlush)
+		} else {
+			c.enqueue(n)
+			c.mu.Unlock()
+			return fmt.Errorf("%w, notification queued", ErrCircuitOpen)
+		}
+	} else {
+		c.mu.Unlock()
+	}
+
+	if err := c.send(n); err != nil {
+		if IsPermanentStatusError(err) {
+			return err
+		}
+		c.mu.Lock()
+		c.failures++
+		if c.failures >= cbMaxFailures && !c.circuitOpen {
+			c.circuitOpen = true
+			c.circuitOpenAt = time.Now()
+			log.Printf("wpcom: circuit breaker opened after %d failures", c.failures)
+		}
+		c.mu.Unlock()
+		return err
+	}
+
+	c.mu.Lock()
+	c.failures = 0
+	c.mu.Unlock()
+	return nil
+}
+
+func (c *Client) send(n Notification) error {
+	body, err := json.Marshal(n)
+	if err != nil {
+		return fmt.Errorf("marshal notification: %w", err)
+	}
+
+	url := c.notifyURL
+	if url == "" {
+		url = notifyEndpoint
+	}
+	req, err := http.NewRequest(http.MethodPost, url, bytes.NewReader(body))
+	if err != nil {
+		return fmt.Errorf("build request: %w", err)
+	}
+	req.Header.Set("Content-Type", "application/json")
+	req.Header.Set("Authorization", "Bearer "+c.authToken)
+
+	resp, err := c.httpClient.Do(req)
+	if err != nil {
+		return fmt.Errorf("post notification: %w", err)
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode >= 400 {
+		return StatusError{StatusCode: resp.StatusCode}
+	}
+	return nil
+}
+
+func (c *Client) enqueue(n Notification) {
+	if len(c.queue) >= queueMaxSize {
+		c.logQueueDrop(c.queue[0].n.BlogID)
+		c.queue = c.queue[1:]
+	}
+	c.queue = append(c.queue, queuedNotification{n: n, queuedAt: time.Now()})
+}
+
+func (c *Client) logQueueDrop(blogID int64) {
+	now := time.Now()
+	if c.lastQueueDropLog.IsZero() || now.Sub(c.lastQueueDropLog) >= queueDropLogInterval {
+		if c.queueDropSuppressed > 0 {
+			log.Printf(
+				"wpcom: queue full, dropping oldest notification for blog_id=%d (suppressed %d similar drops)",
+				blogID,
+				c.queueDropSuppressed,
+			)
+		} else {
+			log.Printf("wpcom: queue full, dropping oldest notification for blog_id=%d", blogID)
+		}
+		c.lastQueueDropLog = now
+		c.queueDropSuppressed = 0
+		return
+	}
+	c.queueDropSuppressed++
+}
+
+// sendFlush sends previously queued notifications without holding the mutex.
+func (c *Client) sendFlush(queue []queuedNotification) {
+	if len(queue) == 0 {
+		return
+	}
+	log.Printf("wpcom: flushing %d queued notifications", len(queue))
+	for _, q := range queue {
+		if err := c.send(q.n); err != nil {
+			log.Printf("wpcom: flush failed for blog_id=%d: %v", q.n.BlogID, err)
+		}
+	}
+}
+
+// IsCircuitOpen reports whether the circuit breaker is currently open.
+func (c *Client) IsCircuitOpen() bool {
+	c.mu.Lock()
+	defer c.mu.Unlock()
+	return c.circuitOpen
+}
+
+// QueueDepth returns the number of queued notifications.
+func (c *Client) QueueDepth() int {
+	c.mu.Lock()
+	defer c.mu.Unlock()
+	return len(c.queue)
+}
diff --git a/internal/wpcom/client_test.go b/internal/wpcom/client_test.go
new file mode 100644
index 00000000..84358da8
--- /dev/null
+++ b/internal/wpcom/client_test.go
@@ -0,0 +1,343 @@
+package wpcom
+
+import (
+	"encoding/json"
+	"errors"
+	"net/http"
+	"net/http/httptest"
+	"reflect"
+	"slices"
+	"testing"
+	"time"
+)
+
+func newTestClient(t *testing.T, handler http.HandlerFunc) (*Client, func()) {
+	t.Helper()
+	srv := httptest.NewServer(handler)
+	c := &Client{
+		authToken:  "test-token",
+		notifyURL:  srv.URL,
+		httpClient: &http.Client{Timeout: 5 * time.Second},
+	}
+	return c, srv.Close
+}
+
+func testNotification(blogID int64) Notification {
+	return Notification{BlogID: blogID, MonitorURL: "https://example.com", StatusType: "success"}
+}
+
+func TestNotifySuccess(t *testing.T) {
+	c, close := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
+		if r.Header.Get("Authorization") != "Bearer test-token" {
+			t.Errorf("Authorization header = %q, want Bearer test-token", r.Header.Get("Authorization"))
+		}
+		w.WriteHeader(http.StatusOK)
+	})
+	defer close()
+
+	if err := c.Notify(testNotification(1)); err != nil {
+		t.Fatalf("Notify() error = %v", err)
+	}
+	if c.IsCircuitOpen() {
+		t.Fatal("circuit should be closed after success")
+	}
+	if c.failures != 0 {
+		t.Fatalf("failures = %d, want 0", c.failures)
+	}
+}
+
+func TestNotifySendsLegacyPayloadShape(t *testing.T) {
+	var got map[string]json.RawMessage
+	c, close := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
+		if r.Method != http.MethodPost {
+			t.Errorf("method = %s, want POST", r.Method)
+		}
+		if r.Header.Get("Content-Type") != "application/json" {
+			t.Errorf("Content-Type = %q, want application/json", r.Header.Get("Content-Type"))
+		}
+		if r.Header.Get("Authorization") != "Bearer test-token" {
+			t.Errorf("Authorization header = %q, want Bearer test-token", r.Header.Get("Authorization"))
+		}
+		defer r.Body.Close()
+		if err := json.NewDecoder(r.Body).Decode(&got); err != nil {
+			t.Fatalf("decode request: %v", err)
+		}
+		w.WriteHeader(http.StatusOK)
+	})
+	defer close()
+
+	notification := Notification{
+		BlogID:           12345,
+		MonitorURL:       "https://example.com/",
+		StatusID:         2,
+		LastCheck:        "2026-05-03T03:00:00Z",
+		LastStatusChange: "2026-05-03T03:01:00Z",
+		StatusType:       "server",
+		Checks: []CheckEntry{
+			{Type: 1, Host: "monitor-a", Status: 0, RTT: 123, Code: 500},
+			{Type: 2, Host: "verifier-us", Status: 0, RTT: 456, Code: 500},
+		},
+	}
+
+	if err := c.Notify(notification); err != nil {
+		t.Fatalf("Notify() error = %v", err)
+	}
+
+	wantKeys := []string{
+		"blog_id",
+		"checks",
+		"last_check",
+		"last_status_change",
+		"monitor_url",
+		"status_id",
+		"status_type",
+	}
+	var gotKeys []string
+	for key := range got {
+		gotKeys = append(gotKeys, key)
+	}
+	slices.Sort(gotKeys)
+	if !reflect.DeepEqual(gotKeys, wantKeys) {
+		t.Fatalf("payload keys = %v, want %v", gotKeys, wantKeys)
+	}
+
+	var decoded Notification
+	encoded, err := json.Marshal(got)
+	if err != nil {
+		t.Fatalf("remarshal payload: %v", err)
+	}
+	if err := json.Unmarshal(encoded, &decoded); err != nil {
+		t.Fatalf("decode payload into Notification: %v", err)
+	}
+	if !reflect.DeepEqual(decoded, notification) {
+		t.Fatalf("payload = %+v, want %+v", decoded, notification)
+	}
+}
+
+func TestNotifyResetsFailureCountOnSuccess(t *testing.T) {
+	calls := 0
+	c, close := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
+		calls++
+		if calls < 3 {
+			w.WriteHeader(http.StatusBadGateway)
+		} else {
+			w.WriteHeader(http.StatusOK)
+		}
+	})
+	defer close()
+
+	_ = c.Notify(testNotification(1))
+	_ = c.Notify(testNotification(2))
+	if c.failures != 2 {
+		t.Fatalf("failures = %d, want 2", c.failures)
+	}
+
+	_ = c.Notify(testNotification(3))
+	if c.failures != 0 {
+		t.Fatalf("failures after success = %d, want 0", c.failures)
+	}
+}
+
+func TestNotifyOpensCircuitAfterMaxFailures(t *testing.T) {
+	c, close := newTestClient(t, func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusBadGateway)
+	})
+	defer close()
+
+	for range cbMaxFailures {
+		_ = c.Notify(testNotification(1))
+	}
+
+	if !c.IsCircuitOpen() {
+		t.Fatal("circuit should be open after max failures")
+	}
+}
+
+func TestNotifyPermanentNotFoundDoesNotOpenCircuit(t *testing.T) {
+	c, close := newTestClient(t, func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusNotFound)
+	})
+	defer close()
+
+	for range cbMaxFailures + 2 {
+		err := c.Notify(testNotification(1))
+		if err == nil {
+			t.Fatal("Notify() expected not-found error")
+		}
+		if !IsPermanentStatusError(err) {
+			t.Fatalf("Notify() error = %v, want permanent status error", err)
+		}
+		status, ok := HTTPStatusCode(err)
+		if !ok || status != http.StatusNotFound {
+			t.Fatalf("HTTPStatusCode() = %d, %v; want %d, true", status, ok, http.StatusNotFound)
+		}
+	}
+
+	if c.IsCircuitOpen() {
+		t.Fatal("circuit should stay closed for permanent per-notification failures")
+	}
+	if c.failures != 0 {
+		t.Fatalf("failures = %d after permanent failures, want 0", c.failures)
+	}
+	if c.QueueDepth() != 0 {
+		t.Fatalf("QueueDepth() = %d after permanent failures, want 0", c.QueueDepth())
+	}
+}
+
+func TestNotifyGoneIsPermanent(t *testing.T) {
+	c, close := newTestClient(t, func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusGone)
+	})
+	defer close()
+
+	err := c.Notify(testNotification(1))
+	if err == nil {
+		t.Fatal("Notify() expected gone error")
+	}
+	if !IsPermanentStatusError(err) {
+		t.Fatalf("Notify() error = %v, want permanent status error", err)
+	}
+}
+
+func TestNotifyQueuesAndReturnsErrorWhenCircuitOpen(t *testing.T) {
+	c, close := newTestClient(t, func(w http.ResponseWriter, _ *http.Request) {
+		w.WriteHeader(http.StatusOK)
+	})
+	defer close()
+
+	c.circuitOpen = true
+	c.circuitOpenAt = time.Now()
+
+	err := c.Notify(testNotification(42))
+	if err == nil {
+		t.Fatal("Notify() expected error when circuit is open")
+	}
+	if !errors.Is(err, ErrCircuitOpen) {
+		t.Fatalf("Notify() error = %v, want ErrCircuitOpen", err)
+	}
+	if c.QueueDepth() != 1 {
+		t.Fatalf("QueueDepth() = %d, want 1", c.QueueDepth())
+	}
+}
+
+func TestNotifyResetsCircuitAfterTimeout(t *testing.T) {
+	var flushed []int64
+	c, close := newTestClient(t, func(w http.ResponseWriter, r *http.Request) {
+		defer r.Body.Close()
+		var n Notification
+		if err := json.NewDecoder(r.Body).Decode(&n); err != nil {
+			t.Fatalf("decode request: %v", err)
+		}
+		flushed = append(flushed, n.BlogID)
+		w.WriteHeader(http.StatusOK)
+	})
+	defer close()
+
+	// Open circuit and pre-load a queued notification.
+	c.circuitOpen = true
+	c.circuitOpenAt = time.Now().Add(-(cbResetTimeout + time.Second))
+	c.failures = cbMaxFailures
+	c.queue = []queuedNotification{{n: testNotification(99)}}
+	_ = flushed
+
+	// Next Notify call should reset the circuit and flush the queue.
+	err := c.Notify(testNotification(1))
+	if err != nil {
+		t.Fatalf("Notify() after timeout error = %v", err)
+	}
+	if c.IsCircuitOpen() {
+		t.Fatal("circuit should be closed after reset timeout")
+	}
+	if c.QueueDepth() != 0 {
+		t.Fatalf("QueueDepth() = %d, want 0 after flush", c.QueueDepth())
+	}
+	if !slices.Equal(flushed, []int64{99, 1}) {
+		t.Fatalf("flushed notifications = %v, want [99 1]", flushed)
+	}
+}
+
+func TestNew(t *testing.T) {
+	c := New("my-token", "my-host")
+	if c == nil {
+		t.Fatal("New() = nil")
+	}
+	if c.authToken != "my-token" {
+		t.Fatalf("authToken = %q, want my-token", c.authToken)
+	}
+	if c.hostname != "my-host" {
+		t.Fatalf("hostname = %q, want my-host", c.hostname)
+	}
+}
+
+func TestSendFlushContinuesAfterError(t *testing.T) {
+	calls := 0
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		calls++
+		if calls == 1 {
+			w.WriteHeader(http.StatusBadGateway)
+		} else {
+			w.WriteHeader(http.StatusOK)
+		}
+	}))
+	defer srv.Close()
+
+	c := &Client{
+		authToken:  "test-token",
+		notifyURL:  srv.URL,
+		httpClient: &http.Client{Timeout: 5 * time.Second},
+	}
+
+	c.sendFlush([]queuedNotification{
+		{n: testNotification(1)},
+		{n: testNotification(2)},
+	})
+
+	if calls != 2 {
+		t.Fatalf("send calls = %d, want 2 (flush should continue after first error)", calls)
+	}
+}
+
+func TestSendFlushEmptyIsNoop(t *testing.T) {
+	c := &Client{}
+	c.sendFlush(nil)
+	c.sendFlush([]queuedNotification{})
+}
+
+func TestNotifySendNetworkError(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {}))
+	url := srv.URL
+	srv.Close() // close before sending — forces a connection error
+
+	c := &Client{
+		authToken:  "token",
+		notifyURL:  url,
+		httpClient: &http.Client{Timeout: time.Second},
+	}
+
+	err := c.Notify(testNotification(1))
+	if err == nil {
+		t.Fatal("Notify() expected error for closed server")
+	}
+	if c.failures != 1 {
+		t.Fatalf("failures = %d after network error, want 1", c.failures)
+	}
+}
+
+func TestEnqueueDropsOldestWhenFull(t *testing.T) {
+	c := &Client{}
+	for i := range queueMaxSize {
+		c.enqueue(Notification{BlogID: int64(i)})
+	}
+	if c.QueueDepth() != queueMaxSize {
+		t.Fatalf("QueueDepth() = %d, want %d", c.QueueDepth(), queueMaxSize)
+	}
+
+	c.enqueue(Notification{BlogID: queueMaxSize})
+
+	if c.QueueDepth() != queueMaxSize {
+		t.Fatalf("QueueDepth() after overflow = %d, want %d", c.QueueDepth(), queueMaxSize)
+	}
+	if c.queue[0].n.BlogID != 1 {
+		t.Fatalf("oldest entry BlogID = %d, want 1 (entry 0 should have been dropped)", c.queue[0].n.BlogID)
+	}
+}
diff --git a/lib/comms.js b/lib/comms.js
deleted file mode 100644
index 2e879daa..00000000
--- a/lib/comms.js
+++ /dev/null
@@ -1,114 +0,0 @@
-
-var _https = require( 'https' );
-var _fs    = require( 'fs' );
-
-const NETWORK_TIMEOUT_MS = 30000;
-
-var ssl_key  = _fs.readFileSync( 'certs/jetmon.key' );
-var ssl_cert = _fs.readFileSync( 'certs/jetmon.crt' );
-
-var client = {
-	get_remote_status: function( veriflier_server, blog_id, s_server, call_back ) {
-		var o_request         = {};
-		o_request.auth_token  = global.config.get( 'AUTH_TOKEN' );
-		o_request.blog_id     = blog_id;
-		o_request.monitor_url = s_server;
-		var request_str       = JSON.stringify( o_request );
-
-		var options = {
-			hostname: veriflier_server.host,
-			port:     veriflier_server.port,
-			key:      ssl_key,
-			cert:     ssl_cert,
-			path:     '/get/host-status?d=' + request_str,
-			method:   'GET',
-			rejectUnauthorized: false,
-		};
-
-		client.perform_request( options, null, call_back );
-	},
-
-	get_remote_status_array: function( veriflier_server, checkArray, call_back ) {
-		var o_request        = {};
-		o_request.auth_token = global.config.get( 'AUTH_TOKEN' );
-		o_request.checks     = checkArray;
-		var requestData      = JSON.stringify( o_request );
-
-		logger.debug( 'POSTing ' + checkArray.length + ' to ' +
-					veriflier_server.name + ', ' + requestData.length + ' bytes' );
-
-		var options = {
-			hostname: veriflier_server.host,
-			port:     veriflier_server.port,
-			key:      ssl_key,
-			cert:     ssl_cert,
-			path:     '/get/host-status',
-			method:   'POST',
-			headers: {
-				  'Content-Type'  : 'application/json',
-				  'Content-Length': requestData.length,
-			},
-			rejectUnauthorized: false,
-		};
-
-		client.perform_request( options, requestData, call_back );
-	},
-
-	perform_request: function( options, postData, call_back ) {
-		var response_handler = function( res ) {
-			res.setEncoding( 'utf8' );
-			var s_data = '';
-			res.on( 'data', function( response_data ) {
-				s_data += response_data;
-			});
-			res.on( 'end', function() {
-				var reply_data = {};
-				if ( 200 == res.statusCode ) {
-					try {
-						reply_data = JSON.parse( s_data );
-					}
-					catch ( Exception ) {
-						reply_data.status = -1;
-						reply_data.reply = "PARSE_ERROR";
-					}
-				} else {
-					reply_data.status = -1;
-					reply_data.reply = "CODE_ERROR";
-					logger.error( res.statusCode + ': error sending status data to ' +
-									options.hostname + ':' + options.port );
-				}
-				call_back( reply_data );
-			});
-		};
-
-		var error_handler = function( err ) {
-			var reply_data = {};
-			reply_data.status = -1;
-			reply_data.reply = "REQUEST_ERROR";
-			call_back( reply_data );
-			logger.error( 'error performing request: ' + err );
-		};
-
-		var timeout_handler = function() {
-			var reply_data = {};
-			reply_data.status = -1;
-			reply_data.reply = "TIMEOUT_ERROR";
-			call_back( reply_data );
-			logger.error( 'timed out performing a request to server ' + options.hostname );
-		};
-
-		options.secureProtocol = "TLSv1_2_method";
-
-		var request = _https.request( options ).
-								 addListener( 'response', response_handler )
-								.addListener( 'error', error_handler )
-								.addListener( 'timeout', timeout_handler );
-		request.setTimeout( NETWORK_TIMEOUT_MS );
-		if ( null !== postData ) {
-			request.write( postData );
-		}
-		request.end();
-	}
-}
-
-module.exports = client;
diff --git a/lib/config.js b/lib/config.js
deleted file mode 100644
index 06316edc..00000000
--- a/lib/config.js
+++ /dev/null
@@ -1,17 +0,0 @@
-
-const CONFIG_FILE = 'config/config.json';
-
-var fs = require( 'fs' );
-
-var config = {
-	_cache: null,
-	load: function( s_file_name ) {
-		this._cache = JSON.parse( fs.readFileSync( CONFIG_FILE ).toString() );
-		return this._cache;
-	},
-	get: function( key ) {
-		return this._cache[ key ];
-	}
-};
-
-module.exports = config;
diff --git a/lib/database.js b/lib/database.js
deleted file mode 100644
index cb70e88c..00000000
--- a/lib/database.js
+++ /dev/null
@@ -1,206 +0,0 @@
-
-const SITE_DOWN           = 0;
-const SITE_RUNNING        = 1;
-const SITE_CONFIRMED_DOWN = 2;
-
-const SECONDS = 1000;
-const MINUTES = 60 * SECONDS;
-const HOURS   = 60 * MINUTES;
-const DAYS    = 24 * HOURS;
-
-var pool = require( './dbpools' );
-
-var fromBucketNo = global.config.get( 'BUCKET_NO_MIN' );
-var toBucketNo   = global.config.get( 'BUCKET_NO_MIN' ) + global.config.get( 'BATCH_SIZE' ) - 1;
-
-var arrUpdateStatements = [];
-var arrUpdateQueue      = [];
-var reloadConfig        = false;
-
-var database = {
-	init : function( success ) {
-		if ( ! success ) {
-			console.error( 'failed to load the DB config file, exiting...' );
-			process.exit( 1 );
-		}
-	},
-
-	updateConfig : function() {
-		logger.debug( 'checking if the DB config file has been updated.' );
-		pool.config.update(
-			function( updated ) {
-				if ( updated ) {
-					logger.debug( 'updated DB config file detected, setting reloadConfig variable.' );
-					reloadConfig = true;
-				} else {
-					logger.debug( 'no DB config update detected.' );
-				}
-			});
-	},
-
-	execQuery : function( sqlQuery, callBack ) {
-		if ( reloadConfig ) {
-			pool.config.reload();
-			reloadConfig = false;
-		}
-		var poolPrefix = new String( 'USER_' );
-		if ( -1 !== sqlQuery.indexOf( 'jetpack_' ) ) {
-			poolPrefix = 'MISC_';
-		} else if ( -1 !== sqlQuery.indexOf( 'languages' ) ) {
-			poolPrefix = 'GLOBAL_';
-		}
-		// round-robin select the relevant read-only server
-		pool.cluster.getConnection(
-			poolPrefix + 'SLAVE*',
-			function( err, connection ) {
-				if ( err ) {
-					logger.error( 'error connecting to local DC slave db: ' + err );
-
-					// round-robin select a read-only failover server
-					pool.cluster.getConnection(
-						poolPrefix + 'FAILOVER*',
-						function( err, connection ) {
-							if ( err ) {
-								callBack( 'error connecting to a remote failover db: ' + err, new Array() );
-							} else {
-								if ( true === global.config.get( 'DEBUG' ) ) {
-									logger.debug( 'running: ' + sqlQuery );
-								}
-								connection.query(
-									sqlQuery,
-									function( error, rows ) {
-										callBack( error, rows );
-										connection.release();
-									});
-							}
-					});
-				} else {
-					if ( true === global.config.get( 'DEBUG' ) )
-						logger.debug( 'running: ' + sqlQuery );
-					connection.query(
-						sqlQuery,
-						function( error, rows ) {
-							callBack( error, rows );
-							connection.release();
-						});
-				}
-			});
-	},
-
-	getNextBatch : function( afterQueryFunction ) {
-		if ( global.config.get( 'USE_VARIABLE_CHECK_INTERVALS' ) ) {
-			/**
-			 * If variable check intervals are enabled, use a different query to
-			 * spread out the sites across one-minute intervals.
-			 */
-			var query = 'SELECT `blog_id`, `monitor_url`, `site_status`, `last_status_change` ' +
-				'FROM `jetpack_monitor_sites` WHERE `bucket_no` >= ' +
-				fromBucketNo + ' AND `bucket_no` <= ' + toBucketNo + ' AND `monitor_active` = 1 AND ' +
-				'MOD(MINUTE(NOW()) + `jetpack_monitor_site_id`, `check_interval`) = 0;';
-		} else {
-			var query = 'SELECT `blog_id`, `monitor_url`, `site_status`, `last_status_change` ' +
-				'FROM `jetpack_monitor_sites` WHERE `bucket_no` >= ' +
-				fromBucketNo + ' AND `bucket_no` <= ' + toBucketNo + ' AND `monitor_active` = 1';
-		}
-
-		database.execQuery(
-			query,
-			function( error, rows ) {
-				if ( error ) {
-					logger.debug( 'error fetching records: ' + error );
-					fromBucketNo -= global.config.get( 'BATCH_SIZE' );
-					toBucketNo -= global.config.get( 'BATCH_SIZE' );
-				} else {
-					afterQueryFunction( rows );
-				}
-			});
-
-		fromBucketNo = fromBucketNo + global.config.get( 'BATCH_SIZE' );
-		if ( fromBucketNo >= global.config.get( 'BUCKET_NO_MAX' ) ) {
-			fromBucketNo = global.config.get( 'BUCKET_NO_MIN' );
-		}
-		toBucketNo = fromBucketNo + global.config.get( 'BATCH_SIZE' ) - 1;
-		if ( toBucketNo > global.config.get( 'BUCKET_NO_MAX' ) ) {
-			toBucketNo = global.config.get( 'BUCKET_NO_MAX' );
-		}
-		// if we have 'wrapped' around to the start again, then return that we are finished for this round
-		return ( global.config.get( 'BUCKET_NO_MIN' ) === fromBucketNo );
-	},
-
-	updateSite : function( blog_id, monitor_url, site_status ) {
-		var query = "UPDATE `jetpack_monitor_sites` " +
-			"SET `site_status`=" + Number( site_status ) + ", `last_status_change`=NOW() " +
-			"WHERE `blog_id`=" + Number( blog_id ) + " AND `monitor_url`='" + monitor_url + "'";
-
-		database.execQuery(
-			query,
-			function( error, rows ) {
-				if ( error ) {
-					logger.debug( 'error updating site: ' + error );
-				}
-			}
-		);
-	},
-
-	getNowDateTime : function() {
-		var now = new Date();
-		return now.getFullYear() + '-' +
-			( 1 == ( 1 + now.getMonth() ).length ? '0' + ( 1 + now.getMonth() ) : ( 1 + now.getMonth() ) ) + '-' +
-			( 1 == now.getDate().toString().length    ? '0' + now.getDate()    : now.getDate() )  + ' ' +
-			( 1 == now.getHours().toString().length   ? '0' + now.getHours()   : now.getHours() ) + ':' +
-			( 1 == now.getMinutes().toString().length ? '0' + now.getMinutes() : now.getMinutes() ) + ':' +
-			( 1 == now.getSeconds().toString().length ? '0' + now.getSeconds() : now.getSeconds() );
-	},
-
-	commitUpdates : function( callBack ) {
-		if ( 0 == arrUpdateStatements.length ) {
-			if ( undefined !== callBack )
-				callBack();
-			return;
-		}
-		if ( reloadConfig ) {
-			pool.config.reload();
-			reloadConfig = false;
-		}
-		pool.cluster.getConnection(
-			'MISC_MASTER',
-			function( err, conn_write ) {
-				if ( undefined != err ) {
-					logger.error( 'error getting a connection: ' + err.code );
-					if ( undefined !== callBack ) {
-						callBack();
-					}
-					return;
-				}
-				var up_query = '';
-				for ( var uploop = 0; uploop < arrUpdateStatements.length; uploop++ ) {
-					up_query = up_query + arrUpdateStatements[uploop];
-				}
-				if ( true === global.config.get( 'DEBUG' ) ) {
-					logger.debug( 'RUNNING batch : ' + up_query );
-				}
-				conn_write.query(
-					up_query,
-					function( error, rows ) {
-						if ( error ) {
-							logger.error( 'Error updating: ' + error.code );
-						} else {
-							arrUpdateStatements = [];
-						}
-						conn_write.release();
-						if ( undefined !== callBack ) {
-							callBack();
-						}
-					});
-			});
-	}
-};
-
-// initialise the database settings
-pool.config.load( database.init );
-
-// set a repeating 'tick' to perform the database config update checks and conditional reloads
-setInterval( database.updateConfig, global.config.get( 'DB_CONFIG_UPDATES_MIN' ) * MINUTES );
-
-module.exports = database;
-
diff --git a/lib/dbpools.js b/lib/dbpools.js
deleted file mode 100644
index 82dc1ed1..00000000
--- a/lib/dbpools.js
+++ /dev/null
@@ -1,159 +0,0 @@
-
-const DATACENTER   = 0;
-const READ_SLAVE   = 1;
-const WRITE_MASTER = 2;
-const INTERNET_URI = 3;
-const INTERNAL_URI = 4;
-const DB_NAME      = 5;
-const DB_USER      = 6;
-const DB_PASSWORD  = 7;
-
-const DB_CONF_FILE     = 'config/db-config.conf';
-const DB_ORIGINAL_FILE = 'config/db-config_original.conf';
-const DB_UPDATE_SCRIPT = '/usr/local/bin/jetmon-config-update.sh';
-
-var fs    = require( 'fs' );
-var mysql = require( 'mysql2' );
-
-var reloadConfig = false;
-var poolCluster  = mysql.createPoolCluster();
-
-poolCluster.on( 'remove', function( nodeName ) {
-	logger.debug( 'node has been removed : ' + nodeName );
-});
-
-poolCluster.on( 'error', function( err ) {
-	logger.error( "pool cluster error:" + err );
-});
-
-var configuration = {
-	reload : function() {
-		logger.debug( 'reloading the DB config' );
-		poolCluster = mysql.createPoolCluster();
-		if ( configuration.load() ) {
-			logger.debug( 'DB config has been reloaded' );
-		} else {
-			logger.error( 'DB config failed to reload' );
-		}
-	},
-
-	update : function( callBack ) {
-		var execute = require('child_process').exec;
-		fs.stat( DB_ORIGINAL_FILE, function( err, stats ) {
-			if ( err ) {
-				logger.error( 'stat error on the config file: ' + err );
-				callBack( false );
-				return;
-			}
-			var mDate = stats.mtime.valueOf();
-			var result = execute(
-					DB_UPDATE_SCRIPT,
-					function( error, stdout, stderr ) {
-						if ( error ) {
-							logger.error( 'error updating the config: ' + error );
-							callBack( false );
-						} else {
-							if ( 0 === stdout.length ) {
-								fs.stat(
-									DB_ORIGINAL_FILE,
-									function( err, stats ) {
-										if ( err ) {
-											logger.error( 'stat error on the config file: ' + error );
-											callBack( false );
-										} else {
-											if ( stats.mtime.valueOf() > mDate ) {
-												callBack( true );
-											} else {
-												callBack( false );
-											}
-										}
-								});
-							} else {
-								callBack( false );
-							}
-						}
-					});
-		});
-	},
-
-	load: function( callBack ) {
-		var data = fs.readFileSync( DB_CONF_FILE );
-		if ( undefined === data ) {
-			logger.error( 'error loading the db config file: ' + err );
-			if ( undefined !== callBack ) {
-				callBack( false );
-				return;
-			} else {
-				return false;
-			}
-		}
-		var aDataLines = data.toString().split( '\n' );
-		var slaveUniqueCount = 1;
-		var backupUniqueCount = 1;
-		var currentDataset;
-		var datasetPattern = /^\s'([-\w]+)'\s*=>\s*array\(/;
-		var serverPattern = /^\s*array\(/;
-
-		for ( var loop = 0; loop < aDataLines.length; loop++ ) {
-			var match = datasetPattern.exec( aDataLines[loop] );
-			if ( match )
-				currentDataset = match[1];
-
-			if ( 'misc' !== currentDataset )
-				continue;
-
-			if ( ! serverPattern.test( aDataLines[loop] ) )
-				continue;
-
-			var arrSettings = aDataLines[loop].substr( aDataLines[loop].indexOf( "'" ), aDataLines[loop].lastIndexOf( ')' ) - aDataLines[loop].indexOf( "'" ) );
-			arrSettings = arrSettings.replace(/\'/g, '' ).replace(/\"/g, '' ).replace(/ /g, '' ).split( ',' );
-			if ( 11 != arrSettings.length ) {
-				continue;
-			}
-			var db_config = {
-				host              : arrSettings[ INTERNET_URI ].split( ':' )[0],
-				port              : arrSettings[ INTERNET_URI ].split( ':' )[1],
-				user              : arrSettings[ DB_USER ],
-				password          : arrSettings[ DB_PASSWORD ],
-				database          : arrSettings[ DB_NAME ],
-				connectionLimit   : 5,
-				supportBigNumbers : true,
-			};
-
-			if ( 1 == arrSettings[ WRITE_MASTER ] ) {
-				db_config['multipleStatements'] = true;
-				poolCluster.add( 'MISC_MASTER', db_config );
-			} else {
-				var _os = require( 'os' );
-				var aHost = _os.hostname().split( '.' );
-				if( aHost.length < 4 ) {
-					logger.error( 'DB config failed to grok the installed DC' );
-					if ( undefined !== callBack ) {
-						callBack( false );
-					} else {
-						return false;
-					}
-				}
-				var installedDC = aHost[ aHost.length - 3 ];
-				if ( -1 != arrSettings[ DATACENTER ].indexOf( installedDC ) ) {
-					poolCluster.add( 'MISC_SLAVE' + slaveUniqueCount++, db_config );
-				} else if ( -1 == arrSettings[ DATACENTER ].indexOf( "'bak'" ) ) {
-					// change to external URI's for non-local DC servers
-					db_config.host = arrSettings[ INTERNET_URI ].split( ':' )[0];
-					db_config.port = arrSettings[ INTERNET_URI ].split( ':' )[1];
-					poolCluster.add( 'MISC_FAILOVER' + backupUniqueCount++, db_config );
-				}
-			}
-		}
-
-		if ( undefined !== callBack ) {
-			callBack( true );
-		} else {
-			return true;
-		}
-	}
-}
-
-exports.cluster = poolCluster;
-exports.config  = configuration;
-
diff --git a/lib/httpcheck.js b/lib/httpcheck.js
deleted file mode 100644
index 927e59e4..00000000
--- a/lib/httpcheck.js
+++ /dev/null
@@ -1,372 +0,0 @@
-
-process.title = 'jetmon-worker';
-
-const SITE_DOWN           = 0;
-const SITE_RUNNING        = 1;
-const SITE_CONFIRMED_DOWN = 2;
-
-const SUICIDE_SIGNAL        = 1;
-const EXIT_MAXRAMUSAGE      = 2;
-const EXIT_MAXCHECKS        = 3;
-
-const DEFAULT_HTTP_PORT     = 80;
-
-const JETMON_CHECK        = 1;
-const VERIFLIER_CHECK     = 2;
-
-const SECONDS = 1000;
-const MINUTES = 60 * SECONDS;
-const HOURS   = 60 * MINUTES;
-const DAYS    = 24 * HOURS;
-
-var _watcher = require( './jetmon.node' );
-var o_log4js = require( 'log4js' );
-
-// each worker loads it's own config object
-var config   = require( './config' );
-config.load();
-
-var arrCheck          = [];
-var running           = false;
-var askedForWork      = false;
-var availableForWork  = true;
-var suicideSignal     = false;
-var pointer           = 0;
-
-/**
- * How many checks are currently being processed by the worker.
- *
- * @type {number}
- */
-var activeChecks  = 0;
-var totalChecks   = 0;
-var createdTime   = Date.now();
-
-// These values will be set in HttpChecker.reloadConfig.
-var maxChecks   = 0;
-var maxMemUsage = 0;
-
-var workerTotals = {};
-workerTotals[SITE_DOWN] = 0;
-workerTotals[SITE_RUNNING] = 0;
-workerTotals[SITE_CONFIRMED_DOWN] = 0;
-
-var checkStats = {};
-
-o_log4js.configure({
-	appenders: {
-		flog: {
-			type: 'file',
-			filename: 'logs/jetmon.log',
-			maxLogSize: 52428800,
-			backups: 30
-		},
-		slog: {
-			type: 'file',
-			filename: 'logs/status-change.log',
-			maxLogSize: 104857600,
-			backups: 100
-		}
-	},
-
-	categories: {
-		default: { appenders: ['flog'], level: 'debug' },
-		flog: { appenders: ['flog'], level: 'debug' },
-		slog: { appenders: ['slog'], level: 'debug' }
-	}
-});
-o_log4js.PatternLayout = '%d{HH:mm:ss,SSS} p m';
-global.logger  = o_log4js.getLogger( 'flog' );
-var slogger    = o_log4js.getLogger( 'slog' );
-
-var _os         = require( 'os' );
-var hostname    = _os.hostname();
-
-var HttpChecker = {
-	reloadConfig: function() {
-		maxChecks   = config.get( 'WORKER_MAX_CHECKS' ) || 0;
-		maxMemUsage = config.get( 'WORKER_MAX_MEM_MB' ) || 96;
-
-		if ( maxChecks > 0 ) {
-			// If the number of checks is limited, pre-seed totalChecks to a random value.
-			// This helps prevent all the workers from trying to recycle at the same time.
-			totalChecks = Math.floor( Math.random() * maxChecks );
-		} else {
-			maxChecks = Number.MAX_VALUE;
-		}
-		if ( maxMemUsage > 0 ) {
-			maxMemUsage = maxMemUsage * 1024 * 1024;
-		} else {
-			maxMemUsage = Number.MAX_VALUE;
-		}
-	},
-
-	checkServers: function() {
-		try {
-			var pointerCurrentMax = pointer + config.get( 'NUM_TO_PROCESS' );
-			if ( pointerCurrentMax > arrCheck.length )
-				pointerCurrentMax = arrCheck.length;
-			for ( ; pointer < pointerCurrentMax ; pointer++ ) {
-				activeChecks++;
-				totalChecks++;
-				_watcher.http_check( arrCheck[ pointer ].monitor_url, DEFAULT_HTTP_PORT, pointer, HttpChecker.processResultsCallback );
-			}
-		}
-		catch ( Exception ) {
-			logger.debug( process.pid + ': ERROR - failed to process the server array: ' + Exception.toString() );
-		}
-	},
-
-	sendStats: function() {
-		if ( workerTotals[SITE_DOWN] ||  workerTotals[SITE_RUNNING] || workerTotals[SITE_CONFIRMED_DOWN] )
-			process.send( { msgtype: 'totals', worker_pid: process.pid, work_totals: workerTotals } );
-		workerTotals[SITE_DOWN] = 0;
-		workerTotals[SITE_RUNNING] = 0;
-		workerTotals[SITE_CONFIRMED_DOWN] = 0;
-	},
-
-	processResultsCallback: function( serverArrayIndex, rtt, http_code, error_code ) {
-		/**
-		 * Reduce the amount of active checks, as the check has finished.
-		 */
-		activeChecks--;
-
-		var server = arrCheck[ serverArrayIndex ];
-		server.processed = true;
-		server.lastCheck = new Date().valueOf();	// we use set the value to the milliseconds value
-
-		if ( rtt > 0 && 400 > http_code && 0 != http_code ) {
-			server.site_status = SITE_RUNNING;
-		}
-		else if (
-			( SITE_RUNNING == server.oldStatus ) ||
-			(
-				( SITE_CONFIRMED_DOWN != server.oldStatus ) &&
-				( new Date().valueOf() < ( server.last_status_change + ( config.get( 'TIME_BETWEEN_NOTICES_MIN' ) * MINUTES ) ) )
-			)
-		) {
-			server.site_status = SITE_DOWN;
-		}
-		else {
-			server.site_status = SITE_CONFIRMED_DOWN;
-		}
-
-		if ( server.site_status !=  server.oldStatus ) {
-			var resO        = {};
-			resO.type       = JETMON_CHECK;
-			resO.host       = hostname;
-			resO.status     = server.site_status;
-			resO.rtt        = Math.round( rtt / 1000 );
-			resO.code       = http_code;
-			resO.error_code = error_code;
-			server.checks.push( resO );
-
-			// if site is down and it has not been confirmed
-			if ( server.site_status == SITE_DOWN ) {
-				process.send( { msgtype: 'recheck', server: server } );
-			} else if ( SITE_CONFIRMED_DOWN != server.site_status ) {
-				process.send( { msgtype: 'notify_status_change', server: server } );
-				slogger.trace( 'status_change: ' + JSON.stringify( server ) );
-			} else {
-				process.send( { msgtype: 'notify_still_down', server: server } );
-				slogger.trace( 'still_down: ' + JSON.stringify( server ) );
-			}
-		}
-
-		workerTotals[server.site_status]++;
-
-		if ( pointer < arrCheck.length ) {
-			activeChecks++;
-			totalChecks++;
-			_watcher.http_check( arrCheck[ pointer ].monitor_url, DEFAULT_HTTP_PORT, pointer, HttpChecker.processResultsCallback );
-			pointer++;
-		} else {
-			if ( availableForWork && ( suicideSignal || process.memoryUsage().rss > maxMemUsage || totalChecks > maxChecks ) ) {
-				availableForWork = false;
-				process.send( { msgtype: 'stop_work', worker_pid: process.pid } );
-			}
-
-			 // check if we have any outstanding callbacks
-			var waiting_for = 0;
-			for ( var count in arrCheck ) {
-				if ( ! arrCheck[ count ].processed )
-					waiting_for++;
-			}
-
-			if ( 0 === waiting_for ) {
-				// No outstanding callbacks and not available for work. Time to die!
-				if ( false === availableForWork ) {
-					// HttpChecker.sendStats();
-					if ( suicideSignal ) {
-						process.exit( SUICIDE_SIGNAL );
-					} else if ( totalChecks > maxChecks ) {
-						process.exit( EXIT_MAXCHECKS );
-					} else {
-						process.exit( EXIT_MAXRAMUSAGE );
-					}
-				}
-
-				arrCheck = [];
-				running = false;
-			}
-
-			if ( availableForWork && ( false === askedForWork ) ) {
-				askedForWork = true;
-				process.send( { msgtype: 'send_work', worker_pid: process.pid } );
-			}
-		}
-
-
-		/**
-		 * Store stats data to send to the parent later.
-		 *
-		 * Doing in the end to make sure we send all the data to appropriate consumers first and
-		 * only then we can try to log the data.
-		 */
-		let stats_site_status = 'unknown';
-
-		switch ( server.site_status ) {
-			case SITE_RUNNING:
-				stats_site_status = 'up';
-				break;
-
-			case SITE_DOWN:
-				stats_site_status = 'down';
-				break;
-
-			case SITE_CONFIRMED_DOWN:
-				stats_site_status = 'still_down';
-				break;
-		}
-
-		const stats_rtt = Math.round( rtt / 1000 );
-
-		if ( checkStats[stats_site_status] ) {
-			checkStats[stats_site_status]['http_code'][http_code] = ( checkStats[stats_site_status]['http_code'][http_code] || 0 ) + 1;
-			if ( error_code !== 0 ){
-				checkStats[stats_site_status]['error_code'][error_code] = ( checkStats[stats_site_status]['error_code'][error_code] || 0 ) + 1;
-			}
-
-			checkStats[stats_site_status]['rtt']['count']++;
-			checkStats[stats_site_status]['rtt']['sum'] += stats_rtt;
-			checkStats[stats_site_status]['rtt']['max']  = Math.max( checkStats[stats_site_status]['rtt']['max'], stats_rtt );
-			checkStats[stats_site_status]['rtt']['min']  = Math.min( checkStats[stats_site_status]['rtt']['min'], stats_rtt );
-		} else {
-			checkStats[stats_site_status] = {
-				'http_code': {},
-				'error_code': {},
-				'rtt':       {
-					'count': 1,
-					'sum':   stats_rtt,
-					'max':   stats_rtt,
-					'min':   stats_rtt
-				}
-			};
-
-			checkStats[stats_site_status]['http_code'][http_code] = 1;
-			if ( error_code !== 0 )	{
-				checkStats[stats_site_status]['error_code'][error_code] = 1;
-			}
-		}
-	},
-
-	addToQueue: function( arrData ) {
-		if ( running ) {
-			for ( var count in arrData ) {
-				arrCheck.push( arrData[ count ] );
-			}
-		} else {
-			arrCheck = arrData;
-			pointer = 0;
-			running	 = true;
-			setTimeout( HttpChecker.checkServers, 50 );
-		}
-	},
-
-	/**
-	 * Returns how long the worker has been running.
-	 *
-	 * @returns {number}
-	 */
-	getAge: function() {
-		return Date.now() - createdTime;
-	}
-
-};
-
-process.on( 'message', function( msg ) {
-	try {
-		switch ( msg.request )
-		{
-			case 'queue-add': {
-				// once we get some work we reset our 'asked state'
-				askedForWork = false;
-				HttpChecker.addToQueue( msg.payload );
-				break;
-			}
-			case 'evaporate' : {
-				if ( ! running ) {
-//					HttpChecker.sendStats();
-					process.exit( SUICIDE_SIGNAL );
-				} else {
-					suicideSignal = true;
-				}
-				break;
-			}
-			case 'config-update': {
-				logger.debug( 'worker pid ' + msg.pid + ': updating config settings.' );
-				config.load();
-
-				HttpChecker.reloadConfig();
-
-				break;
-			}
-			default: {
-				logger.debug( process.pid + ': INFO: received unknown message "' + msg.request + '"' );
-				process.send( { msgtype: 'unknown', worker_pid: msg.pid, payload: 0 } );
-				break;
-			}
-		}
-	}
-	catch ( Exception ) {
-		logger.error( process.pid + ": ERROR: receiving the Master's message: " + Exception.toString() );
-	}
-});
-
-setInterval( HttpChecker.sendStats, config.get( 'STATS_UPDATE_INTERVAL_MS' ) );
-
-setTimeout( function() {
-			askedForWork = true;
-			if ( true === process.connected ) {
-				process.send( { msgtype: 'send_work', worker_pid: process.pid } );
-			}
-}, 2000 );
-
-
-/**
- * Periodically send stats up to the main Jetmon process, so we can know what the current situation in the worker is.
- *
- * Currently, sending stats every 1 second.
- */
-setInterval( function() {
-	var message = {
-		msgtype:      'stats',
-		worker_pid:   process.pid,
-		stats:        {
-			queueLength:  arrCheck.length,
-			pointer:      pointer,
-			activeChecks: activeChecks,
-			totalChecks:  totalChecks,
-			memoryUsage:  process.memoryUsage().rss,
-			checkStats:   checkStats,
-			uptime:       HttpChecker.getAge(),
-		}
-	};
-
-	checkStats = {};
-
-	process.send( message );
-}, 1000 );
-
-// Ensure that the variable config values are set properly.
-HttpChecker.reloadConfig();
diff --git a/lib/jetmon.js b/lib/jetmon.js
deleted file mode 100755
index 79764317..00000000
--- a/lib/jetmon.js
+++ /dev/null
@@ -1,891 +0,0 @@
-
-process.title = 'jetmon-master';
-
-const SITE_DOWN           = 0;
-const SITE_RUNNING        = 1;
-const SITE_CONFIRMED_DOWN = 2;
-
-const HOST_OFFLINE        = 0;
-const HOST_ONLINE         = 1;
-
-const SUICIDE_SIGNAL      = 1;
-const EXIT_MAXRAMUSAGE    = 2;
-const EXIT_MAXCHECKS      = 3;
-
-const NUM_SSL_SERVERS     = 4;
-
-const JETMON_CHECK        = 1;
-const VERIFLIER_CHECK     = 2;
-
-const STATUS_PORT         = 7802;
-
-const SECONDS = 1000;
-const MINUTES = 60 * SECONDS;
-const HOURS   = 60 * MINUTES;
-const DAYS    = 24 * HOURS;
-
-global.config = require( './config' );
-config.load();
-
-// This determines how many peers have to confirm that the
-// site is down before a notification email is sent
-const PEER_OFFLINE_LIMIT = global.config.get( 'PEER_OFFLINE_LIMIT' ) || 3;
-
-var child_proc = require('child_process');
-var fs         = require( 'fs' );
-var o_log4js   = require( 'log4js' );
-
-o_log4js.configure({
-	appenders: {
-		flog: {
-			type: 'file',
-			filename: 'logs/jetmon.log',
-			maxLogSize: 52428800,
-			backups: 30
-		},
-		slog: {
-			type: 'file',
-			filename: 'logs/status-change.log',
-			maxLogSize: 104857600,
-			backups: 100
-		}
-	},
-	categories: {
-		default: { appenders: ['flog'], level: 'debug' },
-		flog: { appenders: ['flog'], level: 'debug' },
-		slog: { appenders: ['slog'], level: 'debug' }
-	}
-});
-o_log4js.PatternLayout = '%d{HH:mm:ss,SSS} m';
-global.logger = o_log4js.getLogger( 'flog' );
-var slogger   = o_log4js.getLogger( 'slog' );
-
-var db_mysql = require( './database' );
-var wpcom    = require( './wpcom'    );
-var comms    = require( './comms'    );
-var cluster  = require( 'cluster'    );
-
-const statsdClient = require('./statsd.js');
-
-var gCountSuccess   = 0;
-var gCountError     = 0;
-var gCountOffline   = 0;
-var startTime       = new Date().valueOf();
-var sitesCount      = 0;
-var arrObjects      = [];
-var localRetries    = [];
-var freeWorkers     = [];
-var haltedWorkers   = [];
-var arrWorkers      = [];
-var workerStats     = {};
-var checkStats      = {};
-var gettingSites    = false;
-var inRound         = false;
-var endOfRound      = false;
-var roundSitesCount = 0;
-
-global.queuedRetries = [];
-
-logger.debug( 'booting jetmon.js' );
-
-process.on( 'SIGINT',  gracefulShutdown );
-process.on( 'EXIT',    gracefulShutdown );
-
-process.on( 'SIGHUP', function() {
-	logger.debug( 'reloading config file' );
-	global.config.load();
-
-	statsdClient.increment('config_reload.count');
-
-	for ( var count in arrWorkers ) {
-		if ( undefined !== arrWorkers[ count ] )
-			arrWorkers[ count ].send( { pid : arrWorkers[ count ].pid, request : 'config-update' } );
-	}
-});
-
-process.on( 'uncaughtException', function( errDesc ) {
-	logger.debug( 'uncaughtException error: ' + errDesc );
-});
-
-
-function spawnWorker() {
-	var worker = child_proc.fork('./lib/httpcheck.js' );
-
-	statsdClient.increment('worker.spawn.new.count');
-
-	worker.on( 'message', workerMsgCallback );
-	worker.on( 'exit', function( code, signal ) {
-		if ( true == worker.exitedAfterDisconnect ) {
-			logger.debug( 'worker thread pid ' + worker.pid + ' shutting down.' );
-
-			statsdClient.increment('worker.die.shutdown.count');
-		} else {
-			var respawn = false;
-
-			if ( SUICIDE_SIGNAL == code ) {
-				logger.debug( 'worker thread pid ' + worker.pid + ' was asked to evaporate.' );
-				statsdClient.increment('worker.die.evaporate.count');
-			} else if ( EXIT_MAXRAMUSAGE == code ) {
-				logger.debug( 'worker thread pid ' + worker.pid + ' exited due to reaching mem limit, replacing...' );
-				statsdClient.increment('worker.die.memlimit.count');
-				respawn = true;
-			} else if ( EXIT_MAXCHECKS == code ) {
-				logger.debug( 'worker thread pid ' + worker.pid + ' exited due to reaching check limit, replacing...' );
-				statsdClient.increment('worker.die.checklimit.count');
-				respawn = true;
-			} else {
-				if ( 130 == code ) {
-					logger.debug( 'worker thread pid ' + worker.pid + ' shutting down.' );
-					statsdClient.increment('worker.die.code_130.count');
-				} else {
-					logger.debug( 'worker thread pid ' + worker.pid + ' died (' + code + '), creating a replacement.' );
-					statsdClient.increment('worker.die.code_other.count');
-					respawn = true;
-				}
-			}
-
-			deleteWorker( worker.pid );
-
-			if ( respawn ) {
-				spawnWorker();
-			}
-		}
-	} );
-
-	// Ensure that the new worker PID is not in any of the existing arrays.
-	deleteWorker( worker.pid );
-
-	arrWorkers.push( worker );
-}
-
-function deleteWorker( pid ) {
-	if ( ! pid )
-		return;
-	for ( var count in arrWorkers ) {
-		if (
-			( undefined != arrWorkers[count] ) &&
-			( arrWorkers[count].pid == pid )
-		) {
-			arrWorkers.splice( count, 1 );
-			if ( workerStats[pid] ) {
-				delete( workerStats[pid] );
-			}
-
-			statsdClient.increment('worker.delete.count');
-
-			break;
-		}
-	}
-	freeWorkers   = freeWorkers.filter( a => a !== pid );
-	haltedWorkers = haltedWorkers.filter( a => a !== pid );
-}
-
-function getWorker( pid ) {
-	if ( ! pid )
-		return null;
-	for ( var count in arrWorkers ) {
-		if ( ( undefined != arrWorkers[ count ] ) &&
-			( arrWorkers[ count ].pid == pid ) ) {
-			return arrWorkers[ count ];
-		}
-	}
-	return null;
-}
-
-function gracefulShutdown() {
-	// Note: calling the 'logger' object during shutdown causes an immediate exit (only use 'console.log')
-	console.log( 'Caught shutdown signal, disconnecting worker threads.' );
-	for ( var count in arrWorkers ) {
-		if ( undefined !== arrWorkers[ count ] && arrWorkers[ count ].connected ) {
-			arrWorkers[ count ].disconnect();
-		}
-	}
-
-	console.log( 'committing any outstanding db updates.' );
-	db_mysql.commitUpdates(
-		function() {
-			printTotalsExit();
-			process.exit( 0 );
-	});
-}
-
-function printTotalsExit() {
-	printTotals();
-	process.exit( 0 );
-}
-
-function printTotals() {
-	console.log( '' );
-	console.log( 'Error:   ' + gCountError );
-	console.log( 'Offline: ' + gCountOffline );
-	console.log( 'Success: ' + gCountSuccess );
-	console.log( 'Total:   ' + ( gCountSuccess + gCountError + gCountOffline ) );
-	var now = new Date().valueOf();
-	console.log( 'Time:    ' + Math.floor( ( now - startTime ) / 60000 ) + 'm ' +
-				( ( ( now - startTime ) % 60000 ) / 1000 ) + 's' );
-}
-
-function resetVariables() {
-	startTime     = new Date().valueOf();
-	endOfRound    = false;
-}
-
-function getRoundDuration() {
-	if ( global.config.get( 'USE_VARIABLE_CHECK_INTERVALS' ) ) {
-		/**
-		 * If variable check intervals are enabled, rounds must run every
-		 * minute.
-		 */
-		return 60;
-	} else {
-		return global.config.get( 'MIN_TIME_BETWEEN_ROUNDS_SEC' );
-	}
-}
-
-function getMoreSites() {
-	gettingSites = true;
-
-	if ( ! inRound ) {
-		inRound = true;
-	}
-
-	if ( endOfRound ) {
-		var timeSinceStart = new Date().valueOf() - startTime;
-		var timeToNextLoop = ( getRoundDuration() * SECONDS ) - timeSinceStart;
-
-		statsdClient.timing( 'round.done_sending_work.time', timeSinceStart );
-
-		setTimeout( function() {
-				resetVariables();
-				getMoreSites();
-			},
-			timeToNextLoop
-		);
-		return;
-	}
-
-	/**
-	 * Write out how many items were still in the queue when we requested new batch of data
-	 */
-	statsdClient.increment( 'queue.items_left_in_queue_when_fetching_new.count', arrObjects.length );
-
-	const startTimeGetDbBatch = new Date().valueOf();
-
-	endOfRound = db_mysql.getNextBatch(
-		function( rows ) {
-			if ( ( undefined === rows ) || ( 0 === rows.length ) ) {
-				getMoreSites();
-				return;
-			}
-
-			const endTimeGetDbBatch = new Date().valueOf();
-			statsdClient.timing( 'db.get_next_batch', endTimeGetDbBatch - startTimeGetDbBatch );
-
-			for ( var i = 0; i < rows.length; i++ ) {
-				var server = rows[i];
-				server.processed = false;
-				server.oldStatus = server.site_status;
-				server.last_status_change = new Date( server.last_status_change ).valueOf();
-				server.checks = [];
-				arrObjects.push( server );
-			}
-			gettingSites = false;
-			freeWorkersToWork();
-	});
-}
-
-function maybeEndRound() {
-	if ( inRound && 0 === arrObjects.length && arrWorkers.length === freeWorkers.length ) {
-		// Still in the round. No work is queued. All the workers are free. The round has ended.
-		inRound = false;
-
-		var timeSinceStart = new Date().valueOf() - startTime;
-		var timeToNextLoop = ( getRoundDuration() * SECONDS ) - timeSinceStart;
-		var sps = roundSitesCount / timeSinceStart * 1000;
-
-		if ( 0 === sps % 1 ) {
-			sps = sps.toFixed( 0 );
-		} else {
-			sps = sps.toFixed( 1 );
-		}
-
-		statsdClient.timing( 'round.complete.time', timeSinceStart );
-		statsdClient.timing( 'round.next.time', timeToNextLoop );
-		statsdClient.increment( 'round.sites.count', roundSitesCount );
-		statsdClient.increment( 'round.sps.count', sps );
-
-		// TODO: Deprecated. Leave this in temporarily to help track changes
-		// from the old calculation to the new calculation.
-		statsdClient.timing( 'round.time', timeSinceStart );
-
-		roundSitesCount = 0;
-	}
-}
-
-function freeWorkersToWork() {
-	if ( 0 == arrObjects.length )
-		return;
-	var tmpWorkers = freeWorkers; 	// take pointer
-	freeWorkers = [];				// and reset
-	for ( var i = 0; i < tmpWorkers.length; i++ )
-		if ( null !== getWorker( tmpWorkers[i] ) )
-			workerMsgCallback( { msgtype: 'send_work', worker_pid: tmpWorkers[i] } );
-}
-
-function checkHostStatus( veriflier_host, data ) {
-	for( var loop = 0; loop < queuedRetries.length; loop++ ) {
-		if ( queuedRetries[ loop ].blog_id != data.blog_id || queuedRetries[ loop ].monitor_url != data.monitor_url ) {
-			continue;
-		}
-		queuedRetries[ loop ].requests_outstanding--;
-		queuedRetries[ loop ].last_activity = new Date().valueOf();
-		var replyO    = {};
-		replyO.type   = VERIFLIER_CHECK;
-		replyO.host   = veriflier_host;
-		replyO.status = data.status;
-		replyO.rtt    = data.rtt;
-		replyO.code   = data.code;
-		replyO.error_code   = data.error_code;
-		queuedRetries[ loop ].checks.push( replyO );
-		if ( HOST_OFFLINE == data.status ) {
-			queuedRetries[ loop ].offline_confirms++;
-			if ( queuedRetries[ loop ].offline_confirms >= PEER_OFFLINE_LIMIT ) {
-				queuedRetries[ loop ].site_status = SITE_DOWN;
-				wpcom.notifyStatusChange( queuedRetries[ loop ],
-						function( reply ) {
-							if ( ! reply.success ) {
-								logger.error( 'error posting status change, retrying: ' + ( reply?.data || 'no error message' ) );
-								wpcom.notifyStatusChange( queuedRetries[ loop ],
-										function( reply ) {
-											if ( reply.success )
-												logger.trace( 'posted successfully' );
-											else
-												logger.error( 'error posting status change: ' + ( reply?.data || 'no error message' ) );
-								});
-							}
-				});
-				slogger.trace( 'site_down: ' + JSON.stringify( queuedRetries[ loop ] ) );
-			}
-		}
-		break;
-	}
-}
-
-function sslWorkerCallBack( msg ) {
-	try {
-		switch ( msg.msgtype ) {
-			case 'host_status': {
-				checkHostStatus( msg.payload.veriflier_host, msg.payload );
-				break;
-			}
-			case 'host_status_array': {
-				for( var loop = 0; loop < msg.payload.checks.length; loop++ ) {
-					checkHostStatus( msg.payload.veriflier_host, msg.payload.checks[ loop ] );
-				}
-				break;
-			}
-			default: {
-				logger.debug( 'Unknown SSL worker message type: ' + msg.msgtype );
-				break;
-			}
-		}
-	}
-	catch ( Exception ) {
-		logger.error( "Error receiving SSL worker's message: " + Exception.toString() );
-	}
-}
-
-function workerMsgCallback( msg ) {
-	try {
-		switch ( msg.msgtype ) {
-			case 'totals':
-				gCountSuccess += msg.work_totals[SITE_RUNNING];
-				gCountError += msg.work_totals[SITE_DOWN];
-				gCountOffline += msg.work_totals[SITE_CONFIRMED_DOWN];
-				sitesCount += msg.work_totals[SITE_DOWN] + msg.work_totals[SITE_RUNNING] + msg.work_totals[SITE_CONFIRMED_DOWN];
-				roundSitesCount += msg.work_totals[SITE_DOWN] + msg.work_totals[SITE_RUNNING] + msg.work_totals[SITE_CONFIRMED_DOWN];
-				break;
-			case 'notify_still_down':
-				// set new server status and then send via the next case statement
-				msg.server.site_status = SITE_CONFIRMED_DOWN;
-			case 'notify_status_change':
-				wpcom.notifyStatusChange( msg.server,
-						function( reply ) {
-							if ( ! reply.success ) {
-								logger.error( 'error posting status change, retrying: ' + ( reply?.data || 'no error message' ) );
-								wpcom.notifyStatusChange( msg.server,
-										function( reply ) {
-											if ( reply.success )
-												logger.trace( 'posted successfully' );
-											else
-												logger.error( 'error posting status change: ' + ( reply?.data || 'no error message' ) );
-								});
-							}
-				});
-				break;
-			case 'stop_work':
-				/**
-				 * Worker asked to no longer receive work so that it can be recycled.
-				 */
-				if ( -1 == haltedWorkers.indexOf( msg.worker_pid ) && null !== getWorker( msg.worker_pid ) ) {
-					haltedWorkers.push( msg.worker_pid );
-					freeWorkers = freeWorkers.filter( a => a !== msg.worker_pid );
-				}
-
-				maybeEndRound();
-
-				break;
-			case 'send_work':
-				/**
-				 * Worker asked for work
-				 */
-
-				/**
-				 * There are more workers than needed, kindly ask the worker to shut down.
-				 */
-				if ( arrWorkers.length > global.config.get( 'NUM_WORKERS' ) ) {
-					var w = getWorker( msg.worker_pid );
-					if ( null !== w )
-						w.send( {
-							pid     : msg.worker_pid,
-							request : 'evaporate',
-							payload : 'pls :)'
-						} );
-					break;
-				}
-
-				if ( 0 == arrObjects.length ) {
-					/**
-					 * There are no URLs in the global queue, let's flag the worker as "free"
-					 * and request more sites from the database, if we haven't done so yet.
-					 */
-					if ( -1 == haltedWorkers.indexOf( msg.worker_pid ) && -1 == freeWorkers.indexOf( msg.worker_pid ) ) {
-						freeWorkers.push( msg.worker_pid );
-					}
-					if ( ! gettingSites ) {
-						gettingSites = true;
-						getMoreSites();
-					}
-
-					maybeEndRound();
-				} else {
-					/**
-					 * There are items in the global queue, let's send them to the worker.
-					 */
-					assign_work_to_worker( msg.worker_pid );
-				}
-				break;
-			case 'recheck':
-				if ( msg.server.checks.length < config.get( 'NUM_OF_CHECKS' ) ) {
-					add_server_to_local_retries( msg.server );
-				} else {
-					// we have exhausted our local check limit, ask the verifliers to confirm
-					host_check_request( msg.server );
-				}
-				break;
-
-			case 'stats':
-				if ( msg.stats ) {
-					// Update global checkStats var with data from the worker.
-					for ( let site_status in msg.stats.checkStats ) {
-						if ( checkStats[site_status] ) {
-							for ( let http_code in msg.stats.checkStats[site_status]['http_code'] ) {
-								if ( checkStats[site_status]['http_code'][http_code] ) {
-									checkStats[site_status]['http_code'][http_code] += msg.stats.checkStats[site_status]['http_code'][http_code];
-								} else {
-									checkStats[site_status]['http_code'][http_code] = msg.stats.checkStats[site_status]['http_code'][http_code];
-								}
-							}
-							for ( let error_code in msg.stats.checkStats[site_status]['error_code'] ) {
-								if ( checkStats[site_status]['error_code'][error_code] ) {
-									checkStats[site_status]['error_code'][error_code] += msg.stats.checkStats[site_status]['error_code'][error_code];
-								} else {
-									checkStats[site_status]['error_code'][error_code] = msg.stats.checkStats[site_status]['error_code'][error_code];
-								}
-							}
-							checkStats[site_status]['rtt']['count'] += msg.stats.checkStats[site_status]['rtt']['count'];
-							checkStats[site_status]['rtt']['sum']   += msg.stats.checkStats[site_status]['rtt']['sum'];
-							checkStats[site_status]['rtt']['max']    = Math.max( checkStats[site_status]['rtt']['max'], msg.stats.checkStats[site_status]['rtt']['max'] );
-							checkStats[site_status]['rtt']['min']    = Math.min( checkStats[site_status]['rtt']['min'], msg.stats.checkStats[site_status]['rtt']['min'] );
-						} else {
-							checkStats[site_status] = msg.stats.checkStats[site_status];
-						}
-					}
-
-					// Remove checkStats as it is not needed for workerStats.
-					delete msg.stats.checkStats;
-
-					workerStats[msg.worker_pid] = msg.stats;
-
-					const workerUptime = msg.stats.uptime;
-					if ( workerUptime > 5000 ) {
-						/**
-						 * Log only if the worker has been up for at least 5 seconds, to make sure we don't log
-						 * empty values at the beginning when the worker has just started, but hasn't received any work.
-						 */
-						statsdClient.increment( 'worker.queue.active', msg.stats.activeChecks )
-						statsdClient.increment( 'worker.queue.queue_size', msg.stats.queueLength );
-					}
-
-					/**
-					 * Check if the worker's queue is less than what we want.
-					 *
-					 * If the worker's queue has less than NUM_TO_PROCESS items in there, we want to
-					 * push more, as it might be waiting for some longer-running ones to finish, before continuing.
-					 * This will keep the worker busier than before.
-					 */
-						// Math.max used to make sure that we don't go below zero and make crazy assumptions
-					const queueLeftToProcess = Math.max( 0, msg.stats.queueLength - msg.stats.pointer );
-					const maxParallel = global.config.get( 'NUM_TO_PROCESS' )
-
-					if ( queueLeftToProcess < maxParallel ) {
-						assign_work_to_worker( msg.worker_pid, global.config.get( 'DATASET_SIZE' ) - queueLeftToProcess );
-					}
-				}
-				break;
-			default:
-		}
-	}
-	catch ( Exception ) {
-		logger.error( "Error receiving worker's message: ", Exception, msg );
-	}
-}
-
-/**
- * Add the server to localRetries.
- *
- * The server is not added if it already exists in the array. As it is already
- * processing if it is in the array.
- *
- * @param object server The server object as received by the recheck message.
- *
- * @returns {null}
- */
-function add_server_to_local_retries( server ) {
-	var found = false;
-	for( var loop = 0; loop < localRetries.length; loop++ ) {
-		if ( localRetries[loop].blog_id == server.blog_id && localRetries[loop].monitor_url == server.monitor_url ) {
-			found = true;
-			break;
-		}
-	}
-	if ( ! found ) {
-		server.processed = false;
-		localRetries.push( server );
-	}
-}
-
-/**
- * Get a work batch dataset to send to a worker.
- * @param int size The batch size. If the number is invalid, negative or more than `DATASET_SIZE`, return `DATASET_SIZE` items.
- *
- * @returns {*[]|boolean}
- */
-function get_work_dataset( size ) {
-	// Make sure that we don't give too little or too much work.
-	if ( !size || size < 1 || size > global.config.get( 'DATASET_SIZE' ) ) {
-		size = global.config.get( 'DATASET_SIZE' );
-	}
-
-	if ( arrObjects.length < 1 ) {
-		return [];
-	}
-
-	const data = arrObjects.splice( 0, Math.min( arrObjects.length, size ) )
-
-	return data;
-}
-
-/**
- * Assigns (sends) a variable amount of work to a specific worker.
- *
- * @param int pid The Worker's PID
- * @param int|null size The number of items to send to the worker. @see get_work_dataset()
- * @returns {null}
- */
-function assign_work_to_worker( pid, size = null ) {
-	const dataset = get_work_dataset( size );
-	if ( !dataset || dataset.length === 0 ) {
-		return false;
-	}
-
-	const worker = getWorker( pid );
-	if ( !worker ) {
-		return false;
-	}
-
-	if ( -1 != haltedWorkers.indexOf( pid ) ) {
-		return false;
-	}
-
-	worker.send( {
-		pid: worker.pid,
-		request: 'queue-add',
-		payload: dataset,
-	} );
-}
-
-function host_check_request( server ) {
-	var check_server = {};
-	check_server.blog_id              = server.blog_id;
-	check_server.monitor_url          = server.monitor_url;
-	check_server.status_id            = server.site_status;
-	check_server.lastCheck            = server.lastCheck;
-	check_server.last_status_change   = server.last_status_change;
-	check_server.checks               = server.checks;
-	check_server.offline_confirms     = 0;
-	check_server.requests_sent        = false;
-	check_server.requests_outstanding = 0;
-	check_server.last_activity        = new Date().valueOf();
-
-	queuedRetries.push( check_server );
-}
-
-function updateStats() {
-	try {
-		var sps = sitesCount / global.config.get( 'STATS_UPDATE_INTERVAL_MS' ) * 1000;
-
-		if ( 0 === sps % 1 ) {
-			sps = sps.toFixed( 0 );
-		} else {
-			sps = sps.toFixed( 1 );
-		}
-
-		if ( true === global.config.get( 'DEBUG' ) ) {
-			var nextLoop = ( getRoundDuration() * SECONDS ) - ( new Date().valueOf() - startTime );
-			logger.debug( 'sps = ' + sps + ' - ' +
-							( arrWorkers.length - freeWorkers.length ) + ' working, ' +
-							freeWorkers.length + ' waiting, ' +
-							haltedWorkers.length + ' halting : ' +
-							'next round in ' + ( nextLoop / 1000 ) + 's' );
-			if ( nextLoop < -20000 ) {
-				logger.error( 'restarting the getMoreSites loop' );
-				resetVariables();
-				setTimeout( getMoreSites, 100 );
-			}
-		}
-
-		var localGCountSuccess = gCountSuccess;
-		var localGCountError   = gCountError;
-		var localGCountOffline = gCountOffline;
-		var localSitesCount    = sitesCount;
-
-		gCountSuccess = 0;
-		gCountError   = 0;
-		gCountOffline = 0;
-		sitesCount    = 0; // need this local otherwise the async call below writes 0, due to the 'finally' call setting sitesCount to 0
-
-		var spsFile = fs.createWriteStream( 'stats/sitespersec', { flags : "w" } );
-		spsFile.once( 'open', function( fd ) {
-			spsFile.write( 'sites per second: ' + sps + '\n' );
-			spsFile.end();
-		});
-		var queueFile = fs.createWriteStream( 'stats/sitesqueue', { flags : "w" } );
-		queueFile.once( 'open', function( fd ) {
-			queueFile.write( 'sites in queue: ' + arrObjects.length + '\n' );
-			queueFile.end();
-		});
-		var totalFile = fs.createWriteStream( 'stats/totals', { flags : "w" } );
-		totalFile.once( 'open', function( fd ) {
-			totalFile.write( 'working : ' + ( arrWorkers.length - freeWorkers.length ) + '\n' );
-			totalFile.write( 'waiting : ' + freeWorkers.length + '\n' );
-			totalFile.write( 'halting : ' + haltedWorkers.length + '\n' );
-			totalFile.write( 'error   : ' + localGCountError + '\n' );
-			totalFile.write( 'offline : ' + localGCountOffline + '\n' );
-			totalFile.write( 'success : ' + localGCountSuccess + '\n' );
-			totalFile.write( 'total   : ' + localSitesCount + '\n' );
-			totalFile.end();
-		});
-
-		/**
-		 * Push some of the stats to StatsD
-		 */
-		statsdClient.increment( 'stats.sites.sps.count', sps );
-		statsdClient.increment( 'stats.sites.error.count', localGCountError );
-		statsdClient.increment( 'stats.sites.offline.count', localGCountOffline );
-		statsdClient.increment( 'stats.sites.success.count', localGCountSuccess );
-		statsdClient.increment( 'stats.sites.total.count', localSitesCount );
-		statsdClient.increment( 'stats.sites.queue.count', arrObjects.length );
-
-		statsdClient.increment( 'stats.workers.free.count', freeWorkers.length );
-		statsdClient.increment( 'stats.workers.halting.count', haltedWorkers.length );
-		statsdClient.increment( 'stats.workers.working.count', ( arrWorkers.length - freeWorkers.length ) );
-
-		for ( let site_status in checkStats ) {
-			for ( let http_code in checkStats[site_status]['http_code'] ) {
-				statsdClient.increment( `worker.check.${site_status}.code.${http_code}.count`, checkStats[site_status]['http_code'][http_code] );
-			}
-			for ( let error_code in checkStats[site_status]['error_code'] ) {
-				statsdClient.increment( `worker.check.${site_status}.error_code.${error_code}.count`, checkStats[site_status]['error_code'][error_code] );
-			}
-
-			let rtt_avg = Math.round( checkStats[site_status]['rtt']['sum'] / checkStats[site_status]['rtt']['count'] );
-			statsdClient.timing( `worker.check.${site_status}.rtt.avg`, rtt_avg );
-			statsdClient.timing( `worker.check.${site_status}.rtt.max`, checkStats[site_status]['rtt']['max'] );
-			statsdClient.timing( `worker.check.${site_status}.rtt.min`, checkStats[site_status]['rtt']['min'] );
-		}
-
-		checkStats = {};
-
-		if ( global.config.get( 'STATSD_SEND_MEM_USAGE' ) ) {
-			statsdClient.timing( 'stats.parent.memory', process.memoryUsage().rss );
-
-			for ( var pid in workerStats ) {
-				statsdClient.timing( 'stats.workers.memory', workerStats[pid].memoryUsage );
-			}
-		}
-	}
-	catch  ( Exception ) {
-		logger.error( 'Error updating stats files: ' + Exception.toString() );
-	}
-	finally {
-		sitesCount = 0;
-		setTimeout( updateStats, ( global.config.get( 'STATS_UPDATE_INTERVAL_MS' ) ) );
-	}
-}
-
-function processQueuedRetries() {
-	if ( true === global.config.get( 'DEBUG' ) )
-		logger.debug( 'starting checks for ' + queuedRetries.length + ' REMOTE' );
-
-	var sendRetries = [];
-	var peerCount = global.config.get( 'VERIFIERS' ).length;
-	for( var loop = queuedRetries.length - 1; loop >= 0; loop-- ) {
-		if ( false === queuedRetries[loop].requests_sent ) {
-			sendRetries.push( queuedRetries[loop] );
-			queuedRetries[loop].requests_sent = true;
-			queuedRetries[loop].requests_outstanding = peerCount;
-		} else if ( ( queuedRetries[loop].requests_outstanding <= 0 ) ||
-			( new Date().valueOf() > queuedRetries[loop].last_activity + ( global.config.get( 'TIMEOUT_FOR_REQUESTS_SEC' ) * SECONDS ) ) ) {
-			if ( true === global.config.get( 'DEBUG' ) ) {
-				if ( 0 < queuedRetries[loop].requests_outstanding )
-					logger.trace( 'TIMED out : ' + queuedRetries[loop].monitor_url +
-									', "outstanding": ' + queuedRetries[loop].requests_outstanding +
-									', "confirms": ' + queuedRetries[loop].offline_confirms );
-				else
-					logger.trace( 'NORMAL out : ' + queuedRetries[loop].monitor_url +
-									', "outstanding": ' + queuedRetries[loop].requests_outstanding +
-									', "confirms": ' + queuedRetries[loop].offline_confirms );
-			}
-			queuedRetries.splice( loop, 1 );
-		}
-	}
-
-	var peerArray = global.config.get( 'VERIFIERS' );
-	var batchSize = global.config.get( 'VERIFLIER_BATCH_SIZE' ) || 200;
-	for( var loop = sendRetries.length - 1; loop >= 0; loop -= batchSize ) {
-		var sending = Math.min( batchSize, sendRetries.length );
-		var batchData = sendRetries.splice( sendRetries.length - sending, sending );
-		for ( var count in peerArray ) {
-			comms.get_remote_status_array(
-				peerArray[ count ],
-				batchData,
-				function( res ) {
-					if ( 1 !== res.status ) {
-						logger.debug( res.veriflier + ': send ' + res.status );
-					}
-			});
-		}
-	}
-
-	var addedWork = false;
-	if ( true === global.config.get( 'DEBUG' ) )
-		logger.debug( 'starting checks for ' + localRetries.length + ' LOCAL' );
-	for( var loop = localRetries.length - 1; loop >= 0; loop-- ) {
-		if ( new Date().valueOf() < ( localRetries[loop].lastCheck + ( global.config.get( 'TIME_BETWEEN_CHECKS_SEC' ) * SECONDS ) ) )
-			continue;
-		if ( 0 !== freeWorkers.length ) {
-			var i = 0;
-			while ( i < freeWorkers.length && null === getWorker( freeWorkers[i] ) ) {
-				i++;
-			}
-			if ( i < freeWorkers.length ) {
-				var w = getWorker( freeWorkers[i] );
-				w.send( {
-					pid     : freeWorkers[i],
-					request : 'queue-add',
-					payload : [ localRetries.splice( loop, 1 )[0] ]
-				} );
-				freeWorkers.splice( i, 1 );
-			} else {
-				arrObjects.push( localRetries.splice( loop, 1 )[0] );
-				addedWork = true;
-			}
-		} else {
-			arrObjects.push( localRetries.splice( loop, 1 )[0] );
-			addedWork = true;
-		}
-	}
-	if ( addedWork )
-		freeWorkersToWork();
-}
-
-
-
-/**
- * Ensures that we're always at NUM_WORKERS count.
- * @param first_usage If this call is the initial spawn of workers when Jetmon has started.
- */
-function ensure_worker_count( first_usage = false ) {
-	const max_worker_count = global.config.get( 'NUM_WORKERS' );
-	const current_worker_count = arrWorkers.length;
-
-	if ( current_worker_count < max_worker_count ) {
-		const new_worker_count = max_worker_count - current_worker_count;
-
-		logger.debug( `Missing workers, spawning: ${new_worker_count} new workers` );
-
-		/**
-		 * Only log the missing worker count if it's not the first spawn
-		 * after Jetmon has started.
-		 * This is done to avoid polluting the data with the occasional NUM_WORKERS peaks.
-		 */
-		if ( ! first_usage ) {
-			statsdClient.increment( 'worker.spawn.missing.count', new_worker_count );
-		}
-
-		for ( let loop = 0; loop < new_worker_count; loop++ ) {
-			spawnWorker();
-		}
-	}
-}
-
-/**
- * Spawn the workers and start keeping track of the number of workers.
- */
-ensure_worker_count( true );
-setInterval( ensure_worker_count, SECONDS );
-
-// Start the SSL cluster
-cluster.setupMaster( {
-	exec   : './lib/server',
-	silent : false,
-});
-
-cluster.on( 'online', function( worker ) {
-	logger.debug( 'SSL worker (pid:' + worker.process.pid + ') is online.' );
-});
-
-cluster.on( 'disconnect', function( worker ) {
-	logger.debug( 'SSL worker (pid:' + worker.process.pid + ') has disconnected.' );
-});
-
-cluster.on( 'exit', function( worker, code, signal ) {
-	if ( true == worker.exitedAfterDisconnect ) {
-		logger.debug( 'SSL worker (pid:' + worker.process.pid + ') is shutting down.' );
-	} else {
-		logger.error( 'SSL worker (pid:' + worker.process.pid + ') died (' + worker.process.exitCode + ').' );
-	}
-});
-
-for ( var i = 0; i < NUM_SSL_SERVERS; i++ ) {
-	var ssl_server = cluster.fork();
-	ssl_server.on( 'message', sslWorkerCallBack );
-}
-
-// set a repeating 'tick' to perform clean-up and retries allocation
-setInterval( processQueuedRetries, SECONDS * 5 );
-
-// start the 'recursive' stats logging
-updateStats();
-
diff --git a/lib/server.js b/lib/server.js
deleted file mode 100644
index 1804c1ad..00000000
--- a/lib/server.js
+++ /dev/null
@@ -1,195 +0,0 @@
-
-process.title = 'jetmon-server';
-
-var _https = require( 'https'   );
-var _url   = require( 'url'     );
-var _fs    = require( 'fs'      );
-var config = require( './config' );
-config.load();
-
-var o_log4js   = require( 'log4js' );
-
-o_log4js.configure({
-	appenders: {
-		flog: {
-			type: 'file',
-			filename: 'logs/jetmon.log',
-			maxLogSize: 52428800,
-			backups: 30
-		}
-	},
-
-	categories: {
-		default: { appenders: ['flog'], level: 'debug' },
-		flog: { appenders: ['flog'], level: 'debug' }
-	}
-});
-o_log4js.PatternLayout = '%d{HH:mm:ss,SSS} p m';
-var logger = o_log4js.getLogger( 'flog' );
-
-const NETWORK_TIMEOUT_MS = 30000;
-
-const PortNum = 7800;
-
-const SITE_DOWN           = 0;
-const SITE_RUNNING        = 1;
-const SITE_CONFIRMED_DOWN = 2;
-
-const HOST_OFFLINE = 0;
-const HOST_ONLINE  = 1;
-
-const ERROR   = -1;
-const SUCCESS = 1;
-
-var veriflierArray = config.get( 'VERIFIERS' );
-var ssl_key        = _fs.readFileSync( 'certs/jetmon.key' );
-var ssl_cert       = _fs.readFileSync( 'certs/jetmon.crt' );
-
-var https_server = function() {
-
-	var routes = {
-		'/' : function( request, response ) {
-			response.writeHead( 404, { 'Content-Type': 'text/html' } );
-			response.write( 'Unsupported call\n' );
-			response.end();
-		},
-
-		'/put/host-status' : function( request, response ) {
-			if ( 'POST' === request.method ) {
-				request.setEncoding( 'utf8' );
-				var s_data = '';
-				request.on( 'data', function( response_data ) {
-					s_data += response_data;
-				});
-				request.on( 'end', function() {
-					var reply_data = {};
-					try {
-						req = JSON.parse( s_data );
-						if ( undefined === req.auth_token || undefined === req.checks || 0 == req.checks.length ) {
-							response.writeHead( 404, { 'Content-Type': 'application/json' } );
-							response.write( '{"response":' + ERROR + '}' );
-							response.end();
-							logger.error( 'invalid JSON POST data provided by ' +
-										request.connection.remoteAddress );
-							return;
-						}
-
-						var veriflier = false;
-						for ( var count in veriflierArray ) {
-							if ( req.auth_token == veriflierArray[ count ].auth_token ) {
-								veriflier = true;
-								req.veriflier_host = veriflierArray[ count ].host;
-								break;
-							}
-						}
-
-						if ( false === veriflier ) {
-							response.writeHead( 503, { 'Content-Type': 'application/json' } );
-							response.write( '{"response":' + ERROR + '}' );
-							response.end();
-							logger.error( 'invalid auth_code provided ' + req.auth_code );
-							return;
-						}
-
-						response.writeHead( 200, { 'Content-Type': 'application/json' } );
-						response.write( '{"response":' + SUCCESS + '}' );
-						response.end();
-
-						process.send( { msgtype: 'host_status_array', payload: req } );
-					}
-					catch ( Exception ) {
-						logger.error( 'error parsing status reply data: ' + Exception.toString() );
-					}
-				});
-			} else {
-				var _get = _url.parse( request.url, true ).query;
-				if ( ( undefined == _get['d'] ) && ( "" == _get['d'] ) ) {
-					response.writeHead( 503, { 'Content-Type': 'application/json' } );
-					response.write( '{"response":' + ERROR + '}' );
-					response.end();
-					logger.error( 'malformed request from server:' + _get );
-					return;
-				}
-
-				var req = JSON.parse( _get['d'] );
-				if ( undefined === req.auth_token || undefined === req.blog_id || undefined === req.status ) {
-					response.writeHead( 404, { 'Content-Type': 'application/json' } );
-					response.write( '{"response":' + ERROR + '}' );
-					response.end();
-					logger.error( 'invalid JSON data provided ' + _get['d'] );
-					return;
-				}
-
-				var veriflier = false;
-				for ( var count in veriflierArray ) {
-					if ( req.auth_token == veriflierArray[ count ].auth_token ) {
-						veriflier = true;
-						req.veriflier_host = veriflierArray[ count ].host;
-						break;
-					}
-				}
-
-				if ( false === veriflier ) {
-					response.writeHead( 503, { 'Content-Type': 'application/json' } );
-					response.write( '{"response":' + ERROR + '}' );
-					response.end();
-					logger.error( 'invalid auth_code provided ' + req.auth_code );
-					return;
-				}
-
-				response.writeHead( 200, { 'Content-Type': 'application/json' } );
-				response.write( '{"response":' + SUCCESS + '}' );
-				response.end();
-
-				process.send( { msgtype: 'host_status', payload: req } );
-			}
-		},
-
-		'/get/status' : function( request, response ) {
-			logger.error( 'status confirmation requested' );
-			response.writeHead( 200, { 'Content-Type': 'text/plain' } );
-			response.write( 'OK' );
-			response.end();
-		}
-	};
-
-	var ssl_options = {
-		key:      ssl_key,
-		cert:     ssl_cert,
-	};
-
-	var request_handler = function( request, response ) {
-		var arr_req = request.url.toString().split( '?' );
-		if ( arr_req instanceof Array ) {
-			if( undefined === routes[ arr_req[0] ] ) {
-				response.writeHead( 404, { 'Content-Type': 'text/plain' } );
-				response.write( 'not found\n' );
-				response.end();
-			} else {
-				routes[ arr_req[0] ].call( this, request, response );
-			}
-		} else {
-			response.writeHead( 404, { 'Content-Type': 'text/plain' } );
-			response.write( 'Unsupported call\n' );
-			response.end();
-			logger.error( 'unsupported call: ' + request.url.toString() );
-		}
-	};
-
-	var close_handler = function() {
-		logger.error( process.pid + ': HTTPS server has been shutdown.' );
-	};
-
-	var error_handler = function( err ) {
-		logger.error( process.pid + ': HTTPS error encountered: ' + err );
-	};
-
-	var _server = _https.createServer( ssl_options ).
-					 addListener( 'request', request_handler )
-					.addListener( 'close', close_handler )
-					.addListener( 'error', error_handler )
-					.listen( PortNum );
-};
-
-new https_server();
-
diff --git a/lib/statsd.js b/lib/statsd.js
deleted file mode 100644
index 8df092f4..00000000
--- a/lib/statsd.js
+++ /dev/null
@@ -1,111 +0,0 @@
-const os    = require('os');
-const dgram = require('dgram');
-
-/**
- * Hostnames on prod look like:
- *  <node>.<datacenter>.<domain>
- *
- * We only need the first 2 pieces and flip them around to make easier to group/filter things in StatsD later.
- * @type {string}
- */
-const currentHostname = os.hostname().split( '.' ).slice( 0, 2 ).reverse().join( '.' );
-
-/**
- * Set up the StatD client.
- *
- * All entries are prefixed with `com.jetpack.jetmon.<hostname>.`
- */
-let statsdHostname = '127.0.0.1';
-
-/**
- * The MTU of the network connection that sends StatsD metrics is used to
- * determine the max buffer size.
- */
-let statsdMTU = 65536;
-
-/**
- * The number of milliseconds that can elapse before buffered StatsD metrics are
- * flushed.
- */
-let statsdFlushInterval = 5000;
-
-/**
- * Add a workaround for the local Docker instances, as prod is running statsd proxies on 127.0.0.1,
- * while the Docker nodes run it in the `statsd` container.
- */
-if ( currentHostname === 'jetmon.docker' ) {
-	statsdHostname = 'statsd';
-	statsdMTU = 1500;
-}
-
-const prefix = 'com.jetpack.jetmon.' + currentHostname + '.';
-
-
-const statsdClient = {
-	init: function( prefix, host, port, mtu, flushInterval, logger ) {
-		this.prefix        = prefix;
-		this.host          = host;
-		this.port          = port;
-		this.maxBufferSize = mtu - 29; // Reduce by 29 to account for packet headers.
-		this.logger        = logger;
-
-		this.buffer = '';
-
-		this.socket = dgram.createSocket( 'udp4' );
-		this.socket.on( 'error', (error) => this.logger.error( error ) );
-
-		this.interval = setInterval( () => {
-			this.flush();
-		}, flushInterval );
-	},
-
-	increment: function( metric, value = 1, sampleRate = 1) {
-		this.send( `${metric}:${value}|c|@${sampleRate}` );
-	},
-
-	timing: function( metric, value, sampleRate = 1 ) {
-		this.send( `${metric}:${value}|ms|@${sampleRate}` );
-	},
-
-	gauge: function( metric, value ) {
-		this.send( `${metric}:${value}|g` );
-	},
-
-	send: function( message ) {
-		message = `${this.prefix}${message}\n`;
-
-		// If the total buffer size is already at the maximum size, flush it first
-		if ( this.buffer.length + message.length >= this.maxBufferSize ) {
-			this.flush();
-		}
-
-		// Append the message to the buffer
-		this.buffer += message;
-	},
-
-	flush: function() {
-		if ( this.buffer.length > 0 ) {
-			const buffer = this.buffer;
-			this.buffer = '';
-			try {
-				this.socket.send( buffer, this.port, this.host, (error) => {
-					if ( error ) {
-						this.logger.error( 'Error when sending to statsd: ' + error.toString() );
-					}
-				});
-			}
-			catch ( Exception ) {
-				this.logger.error( 'Exception when sending to statsd: ' + Exception.toString() );
-			}
-		}
-	},
-
-	close: function() {
-		clearInterval( this.interval );
-		this.socket.close();
-	}
-}
-
-statsdClient.init( prefix, statsdHostname, 8125, statsdMTU, statsdFlushInterval, logger );
-
-module.exports = statsdClient;
diff --git a/lib/wpcom.js b/lib/wpcom.js
deleted file mode 100644
index 82eee9b6..00000000
--- a/lib/wpcom.js
+++ /dev/null
@@ -1,87 +0,0 @@
-
-const NETWORK_TIMEOUT_MS = 20000;
-
-var https    = require( 'https' );
-var fs       = require( 'fs'    );
-var db_mysql = require( './database' );
-var ssl_key  = fs.readFileSync( 'certs/jetmon.key' );
-var ssl_cert = fs.readFileSync( 'certs/jetmon.crt' );
-
-var wpcom = {
-	notifyStatusChange : function( serverObject, callBack ) {
-		if ( global.config.get( 'DB_UPDATES_ENABLE' ) ) {
-			db_mysql.updateSite( serverObject.blog_id, serverObject.monitor_url, serverObject.site_status );
-		}
-		try {
-			var o_request                = {};
-			o_request.blog_id            = serverObject.blog_id;
-			o_request.monitor_url        = serverObject.monitor_url;
-			o_request.status_id          = serverObject.site_status;
-			o_request.last_check         = new Date( serverObject.lastCheck );
-			o_request.last_status_change = serverObject.last_status_change ? new Date( serverObject.last_status_change ) : null;
-			o_request.checks             = serverObject.checks;
-			o_request.token              = global.config.get( 'AUTH_TOKEN' );
-
-			var request_str = JSON.stringify( o_request );
-
-			var options = {
-				hostname: 'jetpack.wordpress.com',
-				port:     443,
-				key:      ssl_key,
-				cert:     ssl_cert,
-				path:     '/jetmon/?data=' + request_str,
-				method:   'GET',
-				rejectUnauthorized: false,
-			};
-
-			logger.trace( 'setting blogid ' + o_request.blog_id + ' status ' + o_request.status_id + ', URL: ' + o_request.monitor_url );
-
-			var response_handler = function( res ) {
-				res.setEncoding( 'utf8' );
-				var reply_data = {};
-
-				if ( 200 != res.statusCode ) {
-					logger.error('incorrect status code from the server: ' + res.statusCode);
-				}
-
-				res.on( 'data', function ( response_data ) {
-					try {
-						reply_data = JSON.parse( response_data );
-					}
-					catch ( Exception ) {
-						logger.error( 'error parsing the server response.' );
-						reply_data.success = false;
-					}
-					callBack( reply_data );
-				});
-			};
-
-			var error_handler = function( err ) {
-				logger.error( 'error performing request: ' + err );
-				var reply_data = {};
-				reply_data.success = false;
-				callBack( reply_data );
-			};
-
-			var timeout_handler = function() {
-				logger.error( 'timed out performing a request to the jetpack.wordpress.com server ' );
-				var reply_data = {};
-				reply_data.success = false;
-				callBack( reply_data );
-			};
-
-			var request = https.request( options ).
-									 addListener( 'response', response_handler )
-									.addListener( 'error', error_handler )
-									.addListener( 'timeout', timeout_handler );
-			request.setTimeout( NETWORK_TIMEOUT_MS );
-			request.end();
-		}
-		catch ( Exception ) {
-			logger.error( 'notifyStatusChange error: ' + Exception.toString() );
-		}
-	}
-}
-
-module.exports = wpcom;
-
diff --git a/migrations/001_jetmon2.sql b/migrations/001_jetmon2.sql
new file mode 100644
index 00000000..7b4e4389
--- /dev/null
+++ b/migrations/001_jetmon2.sql
@@ -0,0 +1,139 @@
+-- Jetmon 2 schema migrations.
+-- Applied automatically by `jetmon2 migrate` via internal/db/migrations.go.
+-- This file is provided for reference and manual application if needed.
+
+CREATE TABLE IF NOT EXISTS jetmon_schema_migrations (
+    id           INT UNSIGNED NOT NULL PRIMARY KEY,
+    applied_at   TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
+) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
+
+-- Jetmon v2 intentionally keeps jetpack_monitor_sites v1-shaped during
+-- rollout. V2-only site config and runtime state live in side tables below;
+-- the legacy table is still the source for monitor_url, bucket_no,
+-- monitor_active, check_interval, site_status, and last_status_change.
+
+-- MySQL-coordinated bucket ownership.
+CREATE TABLE IF NOT EXISTS jetmon_hosts (
+    host_id        VARCHAR(255) NOT NULL PRIMARY KEY,
+    bucket_min     SMALLINT UNSIGNED NOT NULL,
+    bucket_max     SMALLINT UNSIGNED NOT NULL,
+    last_heartbeat TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+    status         ENUM('active','draining') NOT NULL DEFAULT 'active'
+) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
+
+-- Full event history per site.
+CREATE TABLE IF NOT EXISTS jetmon_audit_log (
+    id           BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+    blog_id      BIGINT UNSIGNED NOT NULL,
+    event_type   VARCHAR(64) NOT NULL,
+    source       VARCHAR(255) NOT NULL DEFAULT 'local',
+    http_code    SMALLINT NULL,
+    error_code   TINYINT NULL,
+    rtt_ms       INT NULL,
+    old_status   TINYINT NULL,
+    new_status   TINYINT NULL,
+    detail       TEXT NULL,
+    created_at   TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+    INDEX idx_blog_id_created (blog_id, created_at)
+) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
+
+-- RTT and timing samples for trending.
+CREATE TABLE IF NOT EXISTS jetmon_check_history (
+    id         BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+    blog_id    BIGINT UNSIGNED NOT NULL,
+    request_method VARCHAR(16) NOT NULL DEFAULT 'GET',
+    http_code  SMALLINT NULL,
+    error_code TINYINT NULL,
+    rtt_ms     INT NULL,
+    dns_ms     INT NULL,
+    tcp_ms     INT NULL,
+    tls_ms     INT NULL,
+    ttfb_ms    INT NULL,
+    checked_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+    INDEX idx_blog_id_checked (blog_id, checked_at)
+) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
+
+-- Veriflier non-confirmation events (false positives).
+CREATE TABLE IF NOT EXISTS jetmon_false_positives (
+    id         BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+    blog_id    BIGINT UNSIGNED NOT NULL,
+    http_code  SMALLINT NULL,
+    error_code TINYINT NULL,
+    rtt_ms     INT NULL,
+    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+    INDEX idx_blog_id (blog_id)
+) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
+
+-- Per-site v2 check policy and rich probe config live outside the legacy site
+-- table so rollout batches can switch HEAD/GET and detection profiles without
+-- another ALTER on jetpack_monitor_sites. NULL values inherit process defaults
+-- or built-in checker defaults.
+CREATE TABLE IF NOT EXISTS jetmon_site_check_config (
+    blog_id                BIGINT UNSIGNED NOT NULL PRIMARY KEY,
+    request_method         ENUM('HEAD','GET') NULL,
+    detection_profile      ENUM('legacy','simple_http','full') NULL,
+    check_keyword          VARCHAR(500) NULL,
+    forbidden_keyword      VARCHAR(500) NULL,
+    forbidden_keywords     JSON NULL,
+    maintenance_start      DATETIME NULL,
+    maintenance_end        DATETIME NULL,
+    custom_headers         JSON NULL,
+    timeout_seconds        TINYINT UNSIGNED NULL,
+    redirect_policy        ENUM('follow','alert','fail') NULL DEFAULT NULL,
+    alert_cooldown_minutes SMALLINT UNSIGNED NULL,
+    created_at             TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+    updated_at             TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+    INDEX idx_request_method (request_method),
+    INDEX idx_detection_profile (detection_profile)
+) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
+
+-- V2 runtime/projection state. These fields are useful for API display,
+-- rollback freshness checks, and the legacy round scheduler, but they are not
+-- part of the v1 compatibility table.
+CREATE TABLE IF NOT EXISTS jetmon_site_runtime (
+    blog_id            BIGINT UNSIGNED NOT NULL PRIMARY KEY,
+    last_checked_at    DATETIME NULL,
+    next_check_at      DATETIME NULL,
+    last_alert_sent_at DATETIME NULL,
+    ssl_expiry_date    DATE NULL,
+    updated_at         TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+    INDEX idx_next_check (next_check_at, blog_id),
+    INDEX idx_last_checked (last_checked_at, blog_id)
+) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
+
+-- Trusted Veriflier vantage registry for monitor-side discovery.
+CREATE TABLE IF NOT EXISTS jetmon_veriflier_vantages (
+    vantage_id    VARCHAR(128) NOT NULL PRIMARY KEY,
+    region        VARCHAR(128) NOT NULL DEFAULT '',
+    provider      VARCHAR(128) NOT NULL DEFAULT '',
+    endpoint_host VARCHAR(255) NOT NULL DEFAULT '',
+    endpoint_port VARCHAR(16) NOT NULL DEFAULT '',
+    auth_token    VARCHAR(255) NOT NULL DEFAULT '',
+    enabled       TINYINT UNSIGNED NOT NULL DEFAULT 0,
+    created_at    TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+    updated_at    TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+    INDEX idx_enabled (enabled),
+    INDEX idx_endpoint (endpoint_host, endpoint_port)
+) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
+
+-- Concrete Veriflier process telemetry and capacity hints.
+CREATE TABLE IF NOT EXISTS jetmon_veriflier_agents (
+    agent_id        VARCHAR(128) NOT NULL PRIMARY KEY,
+    vantage_id      VARCHAR(128) NOT NULL,
+    hostname        VARCHAR(255) NOT NULL DEFAULT '',
+    endpoint_host   VARCHAR(255) NOT NULL DEFAULT '',
+    endpoint_port   VARCHAR(16) NOT NULL DEFAULT '',
+    version         VARCHAR(64) NOT NULL DEFAULT '',
+    protocols       JSON NULL,
+    max_concurrency INT UNSIGNED NOT NULL DEFAULT 0,
+    queue_capacity  INT UNSIGNED NOT NULL DEFAULT 0,
+    queue_depth     INT UNSIGNED NOT NULL DEFAULT 0,
+    active          INT UNSIGNED NOT NULL DEFAULT 0,
+    in_flight       INT UNSIGNED NOT NULL DEFAULT 0,
+    status          ENUM('starting','active','draining','stopped') NOT NULL DEFAULT 'active',
+    last_seen       TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+    created_at      TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
+    updated_at      TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
+    INDEX idx_vantage_seen (vantage_id, last_seen),
+    INDEX idx_status_seen (status, last_seen)
+) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
diff --git a/package.json b/package.json
deleted file mode 100644
index 7f33f4df..00000000
--- a/package.json
+++ /dev/null
@@ -1,12 +0,0 @@
-{
-  "name": "jetmon",
-  "version": "0.0.1",
-  "description": "Parallel HTTP health monitoring using HEAD requests for large scale website monitoring.",
-  "scripts": {
-    "rebuild-run": "npm install && node-gyp rebuild && cp build/Release/jetmon.node lib && node lib/jetmon.js"
-  },
-  "dependencies": {
-    "mysql2": "^3.16.3",
-    "log4js": "^6.9.1"
-  }
-}
diff --git a/proto/veriflier.proto b/proto/veriflier.proto
new file mode 100644
index 00000000..ce5657d3
--- /dev/null
+++ b/proto/veriflier.proto
@@ -0,0 +1,98 @@
+syntax = "proto3";
+
+package veriflier;
+
+option go_package = "github.com/Automattic/jetmon/internal/grpc/proto";
+
+// Veriflier is the service implemented by veriflier2 instances.
+// The Monitor sends check batches; the Veriflier returns results.
+service Veriflier {
+    rpc CheckSites(CheckBatch) returns (CheckBatchResult);
+    rpc GetStatus(StatusRequest) returns (StatusResponse);
+}
+
+message CheckRequest {
+    int64  blog_id                     = 1;
+    string url                         = 2;
+    int32  timeout_seconds             = 3;
+    string keyword                     = 4;
+    string forbidden_keyword           = 5;
+    string method                      = 6;
+    string detection_profile           = 7;
+    string request_id                  = 8;
+    int64  body_read_max_bytes         = 9;
+    int32  body_read_max_ms            = 10;
+    int64  keyword_read_max_bytes      = 11;
+    int32  keyword_read_max_ms         = 12;
+    repeated string forbidden_keywords = 13;
+    map<string, string> custom_headers = 14;
+    string redirect_policy             = 15;
+}
+
+message CheckBatch {
+    repeated CheckRequest sites       = 1;
+    string                auth_token  = 2;
+    string                batch_id    = 3;
+    int64                 deadline_ms = 4;
+}
+
+message CheckResult {
+    int64    blog_id    = 1;
+    string   url        = 2;
+    string   host       = 3;
+    bool     success    = 4;
+    int32    http_code  = 5;
+    int32    error_code = 6;
+    int64    rtt_ms     = 7;
+    string   request_id = 8;
+    string   outcome    = 9;
+    Timings  timings_ms = 10;
+    string   vantage_id = 11;
+    string   agent_id   = 12;
+}
+
+message CheckBatchResult {
+    repeated CheckResult results = 1;
+    Vantage              vantage = 2;
+    Agent                agent   = 3;
+    string               batch_id = 4;
+}
+
+message StatusRequest {}
+
+message StatusResponse {
+    string          status    = 1;
+    string          version   = 2;
+    repeated string protocols = 3;
+    Vantage         vantage   = 4;
+    Agent           agent     = 5;
+    Capacity        capacity  = 6;
+}
+
+message Vantage {
+    string id       = 1;
+    string region   = 2;
+    string provider = 3;
+}
+
+message Agent {
+    string id       = 1;
+    string host     = 2;
+    string version  = 3;
+    string protocol = 4;
+}
+
+message Capacity {
+    int32 max_concurrency = 1;
+    int32 queue_capacity  = 2;
+    int32 queue_depth     = 3;
+    int32 active          = 4;
+    int32 in_flight       = 5;
+}
+
+message Timings {
+    int64 dns  = 1;
+    int64 tcp  = 2;
+    int64 tls  = 3;
+    int64 ttfb = 4;
+}
diff --git a/scripts/api-cli-validate.sh b/scripts/api-cli-validate.sh
new file mode 100755
index 00000000..5af68c68
--- /dev/null
+++ b/scripts/api-cli-validate.sh
@@ -0,0 +1,60 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+binary="${API_CLI_BINARY:-./bin/jetmon2}"
+batch="${API_VALIDATE_BATCH:-api-cli-validate}"
+smoke_batch="${batch}-smoke"
+webhook_batch="${batch}-webhook"
+failure_batch="${batch}-failure"
+failure_count="${API_VALIDATE_COUNT:-1}"
+failure_mode="${API_VALIDATE_MODE:-http-500}"
+failure_wait="${API_VALIDATE_WAIT:-30s}"
+webhook_wait="${API_VALIDATE_WEBHOOK_WAIT:-60s}"
+
+if [[ -z "${JETMON_API_TOKEN:-}" ]]; then
+	echo "JETMON_API_TOKEN is required" >&2
+	exit 1
+fi
+
+export JETMON_API_URL="${JETMON_API_URL:-http://localhost:${API_HOST_PORT:-8090}}"
+
+cleanup() {
+	"$binary" api sites cleanup --batch "$smoke_batch" --count 3 --output table >/dev/null 2>&1 || true
+	"$binary" api sites cleanup --batch "$webhook_batch" --count 1 --output table >/dev/null 2>&1 || true
+	"$binary" api sites cleanup --batch "$failure_batch" --count "$failure_count" --output table >/dev/null 2>&1 || true
+}
+trap cleanup EXIT
+
+step() {
+	printf '\n== %s ==\n' "$1"
+	shift
+	"$@"
+}
+
+step "health" "$binary" api health --pretty
+step "identity" "$binary" api me --pretty
+step "request escape hatch" "$binary" api request --output table GET "/api/v1/sites?limit=1"
+step "bulk-add dry run" "$binary" api sites bulk-add --count 3 --batch "$smoke_batch" --dry-run --pretty
+step "smoke workflow" "$binary" api smoke --batch "$smoke_batch" --pretty
+
+if [[ "${API_VALIDATE_SKIP_WEBHOOK:-0}" != "1" ]]; then
+	step "webhook delivery smoke" "$binary" api smoke \
+		--batch "$webhook_batch" \
+		--exercise webhook \
+		--webhook-wait "$webhook_wait" \
+		--pretty
+fi
+
+if [[ "${API_VALIDATE_SKIP_FAILURE:-0}" != "1" ]]; then
+	step "failure simulation assertions" "$binary" api sites simulate-failure \
+		--batch "$failure_batch" \
+		--count "$failure_count" \
+		--create-missing \
+		--mode "$failure_mode" \
+		--wait "$failure_wait" \
+		--expect-event-state "Seems Down" \
+		--expect-event-severity 3 \
+		--require-transition \
+		--expect-transition-reason opened \
+		--pretty
+fi
diff --git a/scripts/rollout-docs-verify.sh b/scripts/rollout-docs-verify.sh
new file mode 100755
index 00000000..e888c4ad
--- /dev/null
+++ b/scripts/rollout-docs-verify.sh
@@ -0,0 +1,180 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+repo_root="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+cd "$repo_root"
+
+jetmon_binary="${ROLLOUT_DOCS_JETMON2:-./bin/jetmon2}"
+deliverer_binary="${ROLLOUT_DOCS_DELIVERER:-./bin/jetmon-deliverer}"
+
+step() {
+	printf '\n== %s ==\n' "$1"
+}
+
+fail() {
+	printf 'FAIL %s\n' "$1" >&2
+	exit 1
+}
+
+run_help_check() {
+	local output status
+	set +e
+	output="$("$@" 2>&1)"
+	status=$?
+	set -e
+	printf '%s\n' "$output"
+	if [[ "$status" -ne 0 && "$status" -ne 2 ]]; then
+		fail "help command failed with status $status: $*"
+	fi
+	if ! grep -qi 'usage' <<<"$output"; then
+		fail "help output did not contain usage: $*"
+	fi
+}
+
+step "stale rollout docs scan"
+stale_pattern='Status: active on|after the API CLI work|rollout preflight hardening settle|Planned Simulated|simulated site server remains a planned|systemd-analyze verify systemd/|API.md|ROADMAP.md|PROJECT.md|TAXONOMY.md|EVENTS.md'
+if rg -n "$stale_pattern" README.md AGENTS.md docs config systemd; then
+	fail "stale documentation references found"
+fi
+
+step "diff whitespace"
+git diff --check
+
+step "rollout command help"
+run_help_check "$jetmon_binary" rollout guided --help
+run_help_check "$jetmon_binary" rollout rehearsal-plan --help
+run_help_check "$jetmon_binary" rollout host-preflight --help
+run_help_check "$jetmon_binary" rollout static-plan-check --help
+run_help_check "$jetmon_binary" rollout pinned-check --help
+run_help_check "$jetmon_binary" rollout cutover-check --help
+run_help_check "$jetmon_binary" rollout rollback-check --help
+run_help_check "$jetmon_binary" rollout dynamic-check --help
+run_help_check "$jetmon_binary" rollout activity-check --help
+run_help_check "$jetmon_binary" rollout projection-drift --help
+run_help_check "$jetmon_binary" rollout state-report --help
+
+step "deliverer command help"
+run_help_check "$deliverer_binary" validate-config --help
+run_help_check "$deliverer_binary" delivery-check --help
+
+step "rehearsal-plan smoke"
+plan_file="${ROLLOUT_DOCS_PLAN_FILE:-/tmp/jetmon-rollout-docs-verify-buckets.csv}"
+printf '%s\n' \
+	'host,bucket_min,bucket_max' \
+	'jetmon-v1-a,0,4' \
+	'jetmon-v1-b,5,9' \
+	>"$plan_file"
+plan_output="$("$jetmon_binary" rollout rehearsal-plan \
+	--file "$plan_file" \
+	--bucket-total 10 \
+	--host jetmon-v1-a \
+	--bucket-min 0 \
+	--bucket-max 4 \
+	--mode same-server \
+	--v1-stop-command 'systemctl stop jetmon' \
+	--v1-start-command 'systemctl start jetmon')"
+printf '%s\n' "$plan_output"
+grep -q 'rollout static-plan-check' <<<"$plan_output" || fail "rehearsal plan omitted static-plan-check"
+grep -q 'rollout host-preflight' <<<"$plan_output" || fail "rehearsal plan omitted host-preflight"
+grep -q 'rollout cutover-check' <<<"$plan_output" || fail "rehearsal plan omitted cutover-check"
+grep -q 'rollout rollback-check' <<<"$plan_output" || fail "rehearsal plan omitted rollback-check"
+grep -q 'rollout dynamic-check' <<<"$plan_output" || fail "rehearsal plan omitted dynamic-check"
+grep -q -- '--bucket-total 10' <<<"$plan_output" || fail "rehearsal plan omitted bucket-total passthrough"
+grep -q 'same DB_\* environment' <<<"$plan_output" || fail "rehearsal plan omitted service environment reminder"
+grep -q 'Immediate smoke gate' <<<"$plan_output" || fail "rehearsal plan omitted immediate smoke gate note"
+grep -q 'systemctl stop jetmon' <<<"$plan_output" || fail "rehearsal plan omitted v1 stop command"
+grep -q 'systemctl start jetmon' <<<"$plan_output" || fail "rehearsal plan omitted v1 start command"
+if grep -q 'rollout pinned-check' <<<"$plan_output"; then
+	fail "rehearsal plan should not print redundant pinned-check after host-preflight"
+fi
+if grep -q 'systemd-analyze verify' <<<"$plan_output"; then
+	fail "rehearsal plan should not print redundant systemd-analyze after host-preflight"
+fi
+
+step "guided rollout dry-run smoke"
+guided_log_dir="${ROLLOUT_DOCS_GUIDED_LOG_DIR:-/tmp/jetmon-rollout-docs-guided}"
+guided_output="$("$jetmon_binary" rollout guided \
+	--dry-run \
+	--file "$plan_file" \
+	--bucket-total 10 \
+	--host jetmon-v1-a \
+	--runtime-host jetmon-v1-a \
+	--bucket-min 0 \
+	--bucket-max 4 \
+	--mode same-server \
+	--v1-stop-command 'systemctl stop jetmon' \
+	--v1-start-command 'systemctl start jetmon' \
+	--log-dir "$guided_log_dir")"
+printf '%s\n' "$guided_output"
+grep -q 'INFO rollout_log=' <<<"$guided_output" || fail "guided dry-run omitted rollout log path"
+grep -q 'INFO rollout_state=' <<<"$guided_output" || fail "guided dry-run omitted rollout state path"
+grep -q 'INFO dry_run=true' <<<"$guided_output" || fail "guided dry-run omitted dry-run marker"
+grep -q 'PASS rollout_log_dir_writable=' <<<"$guided_output" || fail "guided dry-run omitted log-dir write check"
+grep -q 'INFO guided_run_origin=runtime_host mode="same-server" v1_host="jetmon-v1-a" runtime_host="jetmon-v1-a"' <<<"$guided_output" || fail "guided dry-run omitted run origin"
+grep -q 'INFO run_this_command_from=runtime_host' <<<"$guided_output" || fail "guided dry-run omitted runtime-host execution note"
+grep -q 'INFO remote_v1_access_required=false reason=same_server' <<<"$guided_output" || fail "guided dry-run omitted same-server remote access note"
+grep -q 'INFO selected_path=forward' <<<"$guided_output" || fail "guided dry-run omitted selected path"
+grep -q 'PLAN path=FORWARD step=static-plan-check' <<<"$guided_output" || fail "guided dry-run omitted static-plan step"
+grep -q 'PLAN path=FORWARD step=stop-v1 command="systemctl stop jetmon"' <<<"$guided_output" || fail "guided dry-run omitted v1 stop command"
+grep -q 'PLAN path=FORWARD step=stop-v1 typed_confirmation="STOP jetmon-v1-a 0-4"' <<<"$guided_output" || fail "guided dry-run omitted v1 stop confirmation"
+grep -q 'PLAN path=ROLLBACK step=rollback-start-v1' <<<"$guided_output" || fail "guided dry-run omitted rollback step"
+
+step "rollout json smoke"
+json_output="$("$jetmon_binary" rollout static-plan-check \
+	--file "$plan_file" \
+	--bucket-total 10 \
+	--output=json)"
+printf '%s\n' "$json_output"
+grep -q '"ok": true' <<<"$json_output" || fail "static-plan-check JSON did not report ok=true"
+grep -q '"command": "rollout static-plan-check"' <<<"$json_output" || fail "static-plan-check JSON omitted command name"
+
+step "operator rehearsal flow verification"
+ROLLOUT_REHEARSAL_JETMON2="$jetmon_binary" scripts/rollout-rehearsal-verify.sh
+
+step "staged systemd verify"
+if ! command -v systemd-analyze >/dev/null 2>&1; then
+	printf 'WARN systemd-analyze not found; skipping service-unit verification\n'
+	exit 0
+fi
+
+staged_root="${ROLLOUT_DOCS_SYSTEMD_ROOT:-/tmp/jetmon-rollout-docs-verify-root}"
+mkdir -p \
+	"$staged_root/opt/jetmon2/bin" \
+	"$staged_root/etc/systemd/system" \
+	"$staged_root/bin" \
+	"$staged_root/usr/lib/systemd/system"
+cp "$jetmon_binary" "$staged_root/opt/jetmon2/jetmon2"
+cp "$deliverer_binary" "$staged_root/opt/jetmon2/bin/jetmon-deliverer"
+cp systemd/jetmon2.service "$staged_root/etc/systemd/system/jetmon2.service"
+cp systemd/jetmon-deliverer.service "$staged_root/etc/systemd/system/jetmon-deliverer.service"
+
+if [[ -x /bin/kill ]]; then
+	cp /bin/kill "$staged_root/bin/kill"
+else
+	printf 'WARN /bin/kill not found; skipping service-unit verification\n'
+	exit 0
+fi
+
+if [[ -f /usr/lib/systemd/system/sysinit.target ]]; then
+	cp /usr/lib/systemd/system/sysinit.target "$staged_root/usr/lib/systemd/system/sysinit.target"
+elif [[ -f /lib/systemd/system/sysinit.target ]]; then
+	cp /lib/systemd/system/sysinit.target "$staged_root/usr/lib/systemd/system/sysinit.target"
+else
+	printf 'WARN sysinit.target not found; skipping service-unit verification\n'
+	exit 0
+fi
+
+set +e
+systemd_output="$(systemd-analyze --root="$staged_root" verify /etc/systemd/system/jetmon2.service /etc/systemd/system/jetmon-deliverer.service 2>&1)"
+systemd_status=$?
+set -e
+printf '%s\n' "$systemd_output"
+if [[ "$systemd_status" -ne 0 ]]; then
+	if grep -qiE 'SO_PASSCRED|Operation not permitted' <<<"$systemd_output"; then
+		printf 'WARN systemd-analyze was blocked by the local sandbox; rerun on an unrestricted host for service-unit verification\n'
+	else
+		exit "$systemd_status"
+	fi
+fi
+
+printf '\nrollout docs verification passed\n'
diff --git a/scripts/rollout-rehearsal-verify.sh b/scripts/rollout-rehearsal-verify.sh
new file mode 100755
index 00000000..9e60a264
--- /dev/null
+++ b/scripts/rollout-rehearsal-verify.sh
@@ -0,0 +1,173 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+repo_root="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+cd "$repo_root"
+
+jetmon_binary="${ROLLOUT_REHEARSAL_JETMON2:-./bin/jetmon2}"
+cleanup_work_dir=0
+if [[ -n "${ROLLOUT_REHEARSAL_WORK_DIR:-}" ]]; then
+	work_dir="$ROLLOUT_REHEARSAL_WORK_DIR"
+else
+	work_dir="$(mktemp -d "${TMPDIR:-/tmp}/jetmon-rollout-rehearsal.XXXXXXXXXX")"
+	cleanup_work_dir=1
+fi
+plan_file="${ROLLOUT_REHEARSAL_PLAN_FILE:-$work_dir/ranges.csv}"
+
+cleanup() {
+	if [[ "$cleanup_work_dir" -eq 1 && -n "${work_dir:-}" ]]; then
+		rm -rf "$work_dir"
+	fi
+}
+trap cleanup EXIT
+
+step() {
+	printf '\n== %s ==\n' "$1"
+}
+
+fail() {
+	printf 'FAIL %s\n' "$1" >&2
+	exit 1
+}
+
+require_contains() {
+	local haystack="$1"
+	local needle="$2"
+	local message="$3"
+	if ! grep -Fq -- "$needle" <<<"$haystack"; then
+		fail "$message"
+	fi
+}
+
+require_absent() {
+	local haystack="$1"
+	local needle="$2"
+	local message="$3"
+	if grep -Fq -- "$needle" <<<"$haystack"; then
+		fail "$message"
+	fi
+}
+
+if [[ ! -x "$jetmon_binary" ]]; then
+	fail "jetmon binary is not executable: $jetmon_binary"
+fi
+
+mkdir -p "$work_dir"
+mkdir -p "$(dirname "$plan_file")"
+printf 'INFO rehearsal_work_dir=%s\n' "$work_dir"
+printf 'INFO rehearsal_plan_file=%s\n' "$plan_file"
+if [[ (-e "$plan_file" || -L "$plan_file") && "${ROLLOUT_REHEARSAL_OVERWRITE_PLAN:-}" != "1" ]]; then
+	fail "rehearsal plan file already exists: $plan_file (set ROLLOUT_REHEARSAL_OVERWRITE_PLAN=1 to replace it)"
+fi
+printf '%s\n' \
+	'host,bucket_min,bucket_max' \
+	'jetmon-v1-a,0,4' \
+	'jetmon-v1-b,5,9' \
+	>"$plan_file"
+
+step "same-server rehearsal plan"
+same_plan="$("$jetmon_binary" rollout rehearsal-plan \
+	--file "$plan_file" \
+	--bucket-total 10 \
+	--host jetmon-v1-a \
+	--runtime-host jetmon-v1-a \
+	--bucket-min 0 \
+	--bucket-max 4 \
+	--mode same-server \
+	--v1-stop-command 'systemctl stop jetmon' \
+	--v1-start-command 'systemctl start jetmon')"
+printf '%s\n' "$same_plan"
+require_contains "$same_plan" 'INFO mode=same-server' "same-server plan omitted mode"
+require_contains "$same_plan" 'INFO plan_host="jetmon-v1-a" runtime_host="jetmon-v1-a" range=0-4' "same-server plan omitted host/range context"
+require_contains "$same_plan" 'same DB_* environment used by the jetmon2 service' "same-server plan omitted service environment reminder"
+require_contains "$same_plan" './jetmon2 rollout host-preflight' "same-server plan omitted host-preflight"
+require_contains "$same_plan" 'systemctl stop jetmon' "same-server plan omitted v1 stop command"
+require_contains "$same_plan" 'systemctl start jetmon' "same-server plan omitted v1 start command"
+require_contains "$same_plan" './jetmon2 rollout cutover-check --host jetmon-v1-a --bucket-min 0 --bucket-max 4 --since 15m --require-all' "same-server plan omitted full-round cutover gate"
+require_contains "$same_plan" './jetmon2 rollout dynamic-check' "same-server plan omitted fleet dynamic gate"
+require_absent "$same_plan" 'Fresh-server mode requires' "same-server plan printed fresh-server SSH warning"
+
+step "fresh-server rehearsal plan"
+fresh_plan="$("$jetmon_binary" rollout rehearsal-plan \
+	--file "$plan_file" \
+	--bucket-total 10 \
+	--host jetmon-v1-a \
+	--runtime-host jetmon-v2-a \
+	--bucket-min 0 \
+	--bucket-max 4 \
+	--mode fresh-server \
+	--v1-stop-command 'ssh jetmon-v1-a sudo systemctl stop jetmon' \
+	--v1-start-command 'ssh jetmon-v1-a sudo systemctl start jetmon')"
+printf '%s\n' "$fresh_plan"
+require_contains "$fresh_plan" 'INFO mode=fresh-server' "fresh-server plan omitted mode"
+require_contains "$fresh_plan" 'INFO plan_host="jetmon-v1-a" runtime_host="jetmon-v2-a" range=0-4' "fresh-server plan omitted old/new host context"
+require_contains "$fresh_plan" 'Fresh-server mode requires jetmon-v2-a to have SSH access to old v1 host jetmon-v1-a' "fresh-server plan omitted SSH access warning"
+require_contains "$fresh_plan" 'HOLD: keep v2 stopped on the fresh server until the old v1 monitor process is stopped.' "fresh-server plan omitted v2 hold point"
+require_contains "$fresh_plan" 'ssh jetmon-v1-a sudo systemctl stop jetmon' "fresh-server plan omitted remote v1 stop command"
+require_contains "$fresh_plan" './jetmon2 rollout cutover-check --host jetmon-v2-a --bucket-min 0 --bucket-max 4 --since 15m --require-all' "fresh-server plan omitted runtime-host cutover gate"
+require_contains "$fresh_plan" './jetmon2 rollout rollback-check --host jetmon-v2-a --bucket-min 0 --bucket-max 4' "fresh-server plan omitted runtime-host rollback gate"
+require_contains "$fresh_plan" 'ssh jetmon-v1-a sudo systemctl start jetmon' "fresh-server plan omitted remote v1 rollback command"
+
+step "same-server guided dry-run"
+same_guided="$("$jetmon_binary" rollout guided \
+	--dry-run \
+	--file "$plan_file" \
+	--bucket-total 10 \
+	--host jetmon-v1-a \
+	--runtime-host jetmon-v1-a \
+	--bucket-min 0 \
+	--bucket-max 4 \
+	--mode same-server \
+	--v1-stop-command 'systemctl stop jetmon' \
+	--v1-start-command 'systemctl start jetmon' \
+	--log-dir "$work_dir/guided-same")"
+printf '%s\n' "$same_guided"
+require_contains "$same_guided" 'PASS rollout_log_dir_writable=' "same-server guided dry-run omitted log-dir write check"
+require_contains "$same_guided" 'INFO guided_run_origin=runtime_host mode="same-server" v1_host="jetmon-v1-a" runtime_host="jetmon-v1-a"' "same-server guided dry-run omitted run origin"
+require_contains "$same_guided" 'INFO remote_v1_access_required=false reason=same_server' "same-server guided dry-run omitted same-server remote access note"
+require_contains "$same_guided" 'PLAN path=FORWARD step=stop-v1 typed_confirmation="STOP jetmon-v1-a 0-4"' "same-server guided dry-run omitted v1 stop confirmation"
+require_contains "$same_guided" 'PLAN path=FORWARD step=start-v2 typed_confirmation="START V2 jetmon-v1-a 0-4"' "same-server guided dry-run omitted v2 start confirmation"
+require_contains "$same_guided" 'PLAN path=ROLLBACK step=rollback-start-v1 typed_confirmation="START V1 jetmon-v1-a 0-4"' "same-server guided dry-run omitted v1 rollback confirmation"
+
+step "fresh-server guided dry-run"
+fresh_guided="$("$jetmon_binary" rollout guided \
+	--dry-run \
+	--file "$plan_file" \
+	--bucket-total 10 \
+	--host jetmon-v1-a \
+	--runtime-host jetmon-v2-a \
+	--bucket-min 0 \
+	--bucket-max 4 \
+	--mode fresh-server \
+	--v1-stop-command 'ssh jetmon-v1-a sudo systemctl stop jetmon' \
+	--v1-start-command 'ssh jetmon-v1-a sudo systemctl start jetmon' \
+	--log-dir "$work_dir/guided-fresh")"
+printf '%s\n' "$fresh_guided"
+require_contains "$fresh_guided" 'INFO guided_run_origin=runtime_host mode="fresh-server" v1_host="jetmon-v1-a" runtime_host="jetmon-v2-a"' "fresh-server guided dry-run omitted run origin"
+require_contains "$fresh_guided" 'WARN remote_v1_access_required=true runtime_host="jetmon-v2-a" v1_host="jetmon-v1-a"' "fresh-server guided dry-run omitted remote access warning"
+require_contains "$fresh_guided" 'PLAN path=FORWARD step=stop-v1 command="ssh jetmon-v1-a sudo systemctl stop jetmon"' "fresh-server guided dry-run omitted remote v1 stop command"
+require_contains "$fresh_guided" 'PLAN path=FORWARD step=start-v2 typed_confirmation="START V2 jetmon-v2-a 0-4"' "fresh-server guided dry-run omitted runtime-host v2 start confirmation"
+require_contains "$fresh_guided" 'PLAN path=ROLLBACK step=rollback-start-v1 command="ssh jetmon-v1-a sudo systemctl start jetmon"' "fresh-server guided dry-run omitted remote v1 rollback command"
+require_contains "$fresh_guided" 'PLAN path=ROLLBACK step=rollback-start-v1 typed_confirmation="START V1 jetmon-v1-a 0-4"' "fresh-server guided dry-run omitted old-host v1 restart confirmation"
+
+step "guided rollback dry-run"
+rollback_guided="$("$jetmon_binary" rollout guided \
+	--dry-run \
+	--rollback \
+	--file "$plan_file" \
+	--bucket-total 10 \
+	--host jetmon-v1-a \
+	--runtime-host jetmon-v2-a \
+	--bucket-min 0 \
+	--bucket-max 4 \
+	--mode fresh-server \
+	--v1-start-command 'ssh jetmon-v1-a sudo systemctl start jetmon' \
+	--log-dir "$work_dir/guided-rollback")"
+printf '%s\n' "$rollback_guided"
+require_contains "$rollback_guided" 'INFO selected_path=rollback' "rollback guided dry-run omitted rollback path marker"
+require_contains "$rollback_guided" 'PLAN path=ROLLBACK step=rollback-stop-v2 command="systemctl stop jetmon2 && ! systemctl is-active --quiet jetmon2"' "rollback guided dry-run omitted v2 stop command"
+require_contains "$rollback_guided" 'PLAN path=ROLLBACK step=rollback-check title="Run the rollback safety gate"' "rollback guided dry-run omitted rollback safety gate"
+require_contains "$rollback_guided" 'PLAN path=ROLLBACK step=rollback-start-v1 command="ssh jetmon-v1-a sudo systemctl start jetmon"' "rollback guided dry-run omitted v1 start command"
+require_absent "$rollback_guided" 'PLAN path=FORWARD' "rollback guided dry-run printed forward steps"
+
+printf '\nrollout rehearsal verification passed\n'
diff --git a/scripts/rollout-vm-lab.sh b/scripts/rollout-vm-lab.sh
new file mode 100755
index 00000000..8afadc46
--- /dev/null
+++ b/scripts/rollout-vm-lab.sh
@@ -0,0 +1,1552 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd -- "$SCRIPT_DIR/.." && pwd)"
+
+LAB_DIR="${JETMON_ROLLOUT_LAB_DIR:-$HOME/rollout-lab}"
+POOL="${JETMON_ROLLOUT_POOL:-jetmon-rollout}"
+NETWORK="${JETMON_ROLLOUT_NETWORK:-default}"
+PREFIX="${JETMON_ROLLOUT_PREFIX:-jetmon-rollout}"
+IMAGE_URL="${JETMON_ROLLOUT_IMAGE_URL:-https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img}"
+IMAGE_NAME="${JETMON_ROLLOUT_IMAGE_NAME:-noble-server-cloudimg-amd64.img}"
+VM_USER="${JETMON_ROLLOUT_VM_USER:-jetmon}"
+SSH_KEY="${JETMON_ROLLOUT_SSH_KEY:-$HOME/.ssh/jetmon-rollout-lab_ed25519}"
+SSH_PUBKEY="${JETMON_ROLLOUT_SSH_PUBKEY:-$SSH_KEY.pub}"
+LIBVIRT_URI="${JETMON_ROLLOUT_LIBVIRT_URI:-qemu:///system}"
+DEFAULT_MEMORY_MIB="${JETMON_ROLLOUT_MEMORY_MIB:-2048}"
+DEFAULT_VCPUS="${JETMON_ROLLOUT_VCPUS:-2}"
+DEFAULT_DISK_GIB="${JETMON_ROLLOUT_DISK_GIB:-20}"
+SSH_CONNECT_TIMEOUT="${JETMON_ROLLOUT_SSH_CONNECT_TIMEOUT:-5}"
+WAIT_TIMEOUT="${JETMON_ROLLOUT_WAIT_TIMEOUT:-600}"
+ACTIVITY_WAIT_TIMEOUT="${JETMON_ROLLOUT_ACTIVITY_WAIT_TIMEOUT:-240}"
+JETMON2_BINARY="${JETMON_ROLLOUT_JETMON2_BINARY:-$REPO_ROOT/bin/jetmon2}"
+JETMON2_SERVICE="${JETMON_ROLLOUT_JETMON2_SERVICE:-$REPO_ROOT/systemd/jetmon2.service}"
+JETMON2_LOGROTATE="${JETMON_ROLLOUT_JETMON2_LOGROTATE:-$REPO_ROOT/systemd/jetmon2-logrotate}"
+LAB_BUCKET_MIN="${JETMON_ROLLOUT_BUCKET_MIN:-0}"
+LAB_BUCKET_MAX="${JETMON_ROLLOUT_BUCKET_MAX:-99}"
+LAB_BUCKET_TOTAL="${JETMON_ROLLOUT_BUCKET_TOTAL:-1000}"
+
+usage() {
+	cat <<'USAGE'
+usage: scripts/rollout-vm-lab.sh <command> [args]
+
+Commands:
+  doctor                         Verify host KVM/libvirt/image/key prerequisites.
+  fetch-image                    Download the configured Ubuntu cloud image.
+  create <role> [name]           Create one VM. Roles: db, v1, v2, generic.
+  create-topology                Create db, v1, and v2 lab VMs.
+  start-topology                 Start db, v1, and v2 lab VMs if needed.
+  seed-db                        Seed v1-compatible site data into the DB VM.
+  install-v1-sim                 Install/start the v1 simulator service.
+  install-v2                     Stage jetmon2, config, and systemd unit on v2.
+  migrate-v2                     Run jetmon2 migrations from the v2 VM.
+  prepare-topology               Seed DB, install v1/v2, migrate, and smoke preflight.
+  smoke-preflight                Run rollout host-preflight from the v2 VM.
+  smoke-guided-dry-run           Print the guided fresh-server rollout plan on v2.
+  smoke-guided-execute-rollback  Execute guided cutover, then guided rollback.
+  smoke-failure-gates            Verify preflight refuses unsafe DB/systemd state.
+  smoke-interrupted-resume       Interrupt after v1 stop, resume, then roll back.
+  smoke-post-start-rollback      Fail a post-start gate and verify guided rollback.
+  smoke-bad-ssh                  Verify bad v1 SSH commands fail before stopping v1.
+  smoke-v2-start-failure         Verify v1 stays stopped when v2 service start fails.
+  smoke-runtime-guards           Verify bad DB config and unwritable log dir refusals.
+  smoke-real-activity            Start v2 and wait for real last_checked_at updates.
+  snapshot-run <snapshot> <flow> Revert to snapshot, run flow, revert again.
+  snapshot-run-all <snapshot>    Run all named snapshot-backed smoke flows.
+  wait-ssh <vm>                  Wait until a VM has an IP and accepts SSH.
+  ssh <vm> [command...]          SSH into a VM or run a command.
+  snapshot <vm> <snapshot>       Create an offline libvirt snapshot.
+  snapshot-all <snapshot>        Snapshot db, v1, and v2 VMs.
+  revert <vm> <snapshot>         Revert a VM to a snapshot.
+  revert-all <snapshot>          Revert db, v1, and v2 VMs.
+  destroy <vm>                   Destroy and undefine one VM plus its lab volumes.
+  destroy-topology               Destroy db, v1, and v2 lab VMs.
+  list                           List lab VMs, network leases, and pool volumes.
+
+Environment:
+  JETMON_ROLLOUT_LAB_DIR         Default: ~/rollout-lab
+  JETMON_ROLLOUT_POOL            Default: jetmon-rollout
+  JETMON_ROLLOUT_NETWORK         Default: default
+  JETMON_ROLLOUT_PREFIX          Default: jetmon-rollout
+  JETMON_ROLLOUT_IMAGE_URL       Default: Ubuntu 24.04 noble cloud image
+  JETMON_ROLLOUT_SSH_KEY         Default: ~/.ssh/jetmon-rollout-lab_ed25519
+  JETMON_ROLLOUT_BUCKET_MIN      Default: 0
+  JETMON_ROLLOUT_BUCKET_MAX      Default: 99
+  JETMON_ROLLOUT_BUCKET_TOTAL    Default: 1000
+  JETMON_ROLLOUT_ACTIVITY_WAIT_TIMEOUT
+                                  Default: 240 seconds
+USAGE
+}
+
+log() {
+	printf 'INFO %s\n' "$*"
+}
+
+pass() {
+	printf 'PASS %s\n' "$*"
+}
+
+warn() {
+	printf 'WARN %s\n' "$*" >&2
+}
+
+fail() {
+	printf 'FAIL %s\n' "$*" >&2
+	exit 1
+}
+
+need_cmd() {
+	command -v "$1" >/dev/null 2>&1 || fail "missing command: $1"
+}
+
+virsh_cmd() {
+	virsh -c "$LIBVIRT_URI" "$@"
+}
+
+vm_name() {
+	case "$1" in
+	"$PREFIX"-*) printf '%s\n' "$1" ;;
+	*) printf '%s-%s\n' "$PREFIX" "$1" ;;
+	esac
+}
+
+role_from_vm() {
+	local vm="$1"
+	vm="${vm#"$PREFIX"-}"
+	printf '%s\n' "$vm"
+}
+
+image_path() {
+	if [[ -n "${JETMON_ROLLOUT_IMAGE_PATH:-}" ]]; then
+		printf '%s\n' "$JETMON_ROLLOUT_IMAGE_PATH"
+		return 0
+	fi
+	printf '%s/%s\n' "$(pool_path)" "$IMAGE_NAME"
+}
+
+disk_path() {
+	printf '%s/%s.qcow2\n' "$(pool_path)" "$1"
+}
+
+seed_path() {
+	printf '%s/%s-seed.iso\n' "$(pool_path)" "$1"
+}
+
+user_data_path() {
+	printf '%s/cloud-init/%s-user-data.yaml\n' "$LAB_DIR" "$1"
+}
+
+meta_data_path() {
+	printf '%s/cloud-init/%s-meta-data.yaml\n' "$LAB_DIR" "$1"
+}
+
+pool_path() {
+	local target
+	target="$(virsh_cmd pool-dumpxml "$POOL" | sed -n 's:.*<path>\(.*\)</path>.*:\1:p' | head -n 1)"
+	[[ -n "$target" ]] || fail "could not determine path for pool $POOL"
+	printf '%s\n' "$target"
+}
+
+ensure_lab_dirs() {
+	mkdir -p "$LAB_DIR/images" "$LAB_DIR/cloud-init" "$LAB_DIR/logs" "$LAB_DIR/work"
+}
+
+doctor() {
+	local image
+	for cmd in virsh qemu-img virt-install cloud-localds ssh scp curl mysql sed awk; do
+		need_cmd "$cmd"
+	done
+	[[ -e /dev/kvm ]] || fail "/dev/kvm does not exist"
+	[[ -r /dev/kvm && -w /dev/kvm ]] || fail "/dev/kvm is not accessible to $(id -un)"
+	pass "kvm_accessible=/dev/kvm"
+	virsh_cmd list --all >/dev/null
+	pass "libvirt_uri=$LIBVIRT_URI"
+	virsh_cmd net-info "$NETWORK" >/dev/null
+	pass "network=$NETWORK"
+	virsh_cmd pool-info "$POOL" >/dev/null
+	pass "pool=$POOL path=$(pool_path)"
+	[[ -w "$(pool_path)" ]] || fail "pool path is not writable by $(id -un): $(pool_path)"
+	pass "pool_writable=$(pool_path)"
+	[[ -f "$SSH_KEY" ]] || fail "missing SSH key $SSH_KEY"
+	[[ -f "$SSH_PUBKEY" ]] || fail "missing SSH public key $SSH_PUBKEY"
+	pass "ssh_key=$SSH_KEY"
+	image="$(image_path)"
+	if [[ -f "$image" ]]; then
+		pass "image=$image"
+	else
+		warn "image_missing=$image; run fetch-image"
+	fi
+}
+
+fetch_image() {
+	ensure_lab_dirs
+	local image tmp
+	image="$(image_path)"
+	tmp="$image.tmp"
+	mkdir -p "$(dirname "$image")"
+	if [[ -s "$image" ]]; then
+		pass "image_exists=$image"
+		return 0
+	fi
+	log "download_image=$IMAGE_URL"
+	curl -fL --retry 3 --retry-delay 2 -o "$tmp" "$IMAGE_URL"
+	mv "$tmp" "$image"
+	pass "image_downloaded=$image"
+}
+
+role_packages() {
+	case "$1" in
+	db)
+		printf 'mariadb-server\nmariadb-client\n'
+		;;
+	v1 | v2 | generic)
+		;;
+	*)
+		fail "unknown role $1"
+		;;
+	esac
+}
+
+write_cloud_init() {
+	local role="$1"
+	local vm="$2"
+	local public_key
+	public_key="$(<"$SSH_PUBKEY")"
+	{
+		printf '#cloud-config\n'
+		printf 'hostname: %s\n' "$vm"
+		printf 'manage_etc_hosts: true\n'
+		printf 'users:\n'
+		printf '  - default\n'
+		printf '  - name: %s\n' "$VM_USER"
+		printf '    gecos: Jetmon Rollout Lab\n'
+		printf '    groups: sudo\n'
+		printf '    shell: /bin/bash\n'
+		printf '    sudo: ALL=(ALL) NOPASSWD:ALL\n'
+		printf '    lock_passwd: true\n'
+		printf '    ssh_authorized_keys:\n'
+		printf '      - %s\n' "$public_key"
+		printf 'package_update: true\n'
+		printf 'packages:\n'
+		printf '  - qemu-guest-agent\n'
+		printf '  - curl\n'
+		printf '  - ca-certificates\n'
+		printf '  - git\n'
+		printf '  - jq\n'
+		printf '  - make\n'
+		printf '  - netcat-openbsd\n'
+		while IFS= read -r package; do
+			[[ -n "$package" ]] && printf '  - %s\n' "$package"
+		done < <(role_packages "$role")
+		printf 'write_files:\n'
+		printf '  - path: /etc/jetmon-rollout-lab-role\n'
+		printf '    permissions: "0644"\n'
+		printf '    content: |\n'
+		printf '      %s\n' "$role"
+		if [[ "$role" == "db" ]]; then
+			printf '  - path: /etc/mysql/mariadb.conf.d/99-jetmon-rollout-lab.cnf\n'
+			printf '    permissions: "0644"\n'
+			printf '    content: |\n'
+			printf '      [mysqld]\n'
+			printf '      bind-address=0.0.0.0\n'
+		fi
+		printf 'runcmd:\n'
+		printf '  - systemctl enable --now qemu-guest-agent\n'
+		if [[ "$role" == "db" ]]; then
+			printf '  - systemctl restart mariadb\n'
+			printf '  - mysql -uroot -e "CREATE DATABASE IF NOT EXISTS jetmon_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;"\n'
+			printf '  - mysql -uroot -e "CREATE USER IF NOT EXISTS '\''jetmon'\''@'\''%%'\'' IDENTIFIED BY '\''jetmon'\'';"\n'
+			printf '  - mysql -uroot -e "GRANT ALL PRIVILEGES ON jetmon_db.* TO '\''jetmon'\''@'\''%%'\'';"\n'
+			printf '  - mysql -uroot -e "FLUSH PRIVILEGES;"\n'
+		fi
+		printf 'final_message: "jetmon rollout lab %s ready"\n' "$role"
+	} >"$(user_data_path "$vm")"
+	{
+		printf 'instance-id: %s\n' "$vm"
+		printf 'local-hostname: %s\n' "$vm"
+	} >"$(meta_data_path "$vm")"
+	cloud-localds "$(seed_path "$vm")" "$(user_data_path "$vm")" "$(meta_data_path "$vm")"
+}
+
+create_vm() {
+	local role="$1"
+	local requested="${2:-$role}"
+	local vm disk seed image memory vcpus disk_gib
+	vm="$(vm_name "$requested")"
+	disk="$(disk_path "$vm")"
+	seed="$(seed_path "$vm")"
+	image="$(image_path)"
+	memory="${JETMON_ROLLOUT_CREATE_MEMORY_MIB:-$DEFAULT_MEMORY_MIB}"
+	vcpus="${JETMON_ROLLOUT_CREATE_VCPUS:-$DEFAULT_VCPUS}"
+	disk_gib="${JETMON_ROLLOUT_CREATE_DISK_GIB:-$DEFAULT_DISK_GIB}"
+	[[ "$role" == "db" ]] && memory="${JETMON_ROLLOUT_DB_MEMORY_MIB:-4096}"
+	[[ "$role" == "db" ]] && disk_gib="${JETMON_ROLLOUT_DB_DISK_GIB:-30}"
+	[[ -f "$image" ]] || fail "missing image $image; run fetch-image"
+	if virsh_cmd dominfo "$vm" >/dev/null 2>&1; then
+		fail "domain already exists: $vm"
+	fi
+	[[ ! -e "$disk" ]] || fail "disk already exists: $disk"
+	write_cloud_init "$role" "$vm"
+	qemu-img create -f qcow2 -F qcow2 -b "$image" "$disk" "${disk_gib}G" >/dev/null
+	virt-install \
+		--connect "$LIBVIRT_URI" \
+		--name "$vm" \
+		--memory "$memory" \
+		--vcpus "$vcpus" \
+		--cpu host \
+		--os-variant ubuntu24.04 \
+		--import \
+		--disk "path=$disk,format=qcow2,bus=virtio" \
+		--disk "path=$seed,device=cdrom" \
+		--network "network=$NETWORK,model=virtio" \
+		--channel unix,target.type=virtio,target.name=org.qemu.guest_agent.0 \
+		--graphics none \
+		--console pty,target_type=serial \
+		--noautoconsole
+	pass "vm_created=$vm role=$role disk=$disk seed=$seed"
+}
+
+create_topology() {
+	create_vm db db
+	create_vm v1 v1
+	create_vm v2 v2
+}
+
+require_vm_domain() {
+	local vm="$1"
+	if ! virsh_cmd dominfo "$vm" >/dev/null 2>&1; then
+		fail "missing VM domain: $vm; run create-topology first or check JETMON_ROLLOUT_PREFIX=$PREFIX"
+	fi
+}
+
+topology_vms() {
+	printf '%s\n' "$(vm_name db)" "$(vm_name v1)" "$(vm_name v2)"
+}
+
+require_topology_domains() {
+	local vm missing=0
+	while IFS= read -r vm; do
+		if ! virsh_cmd dominfo "$vm" >/dev/null 2>&1; then
+			warn "missing_vm_domain=$vm"
+			missing=1
+		fi
+	done < <(topology_vms)
+	[[ "$missing" == "0" ]] || fail "topology is incomplete; run create-topology first or check JETMON_ROLLOUT_PREFIX=$PREFIX"
+}
+
+start_vm() {
+	local vm="$1"
+	local state
+	require_vm_domain "$vm"
+	state="$(virsh_cmd domstate "$vm" 2>/dev/null || true)"
+	case "$state" in
+	running | blocked)
+		pass "vm_running=$vm state=\"$state\""
+		;;
+	"shut off")
+		virsh_cmd start "$vm" >/dev/null
+		pass "vm_started=$vm previous_state=\"$state\""
+		;;
+	"")
+		fail "could not determine VM state: $vm"
+		;;
+	*)
+		fail "VM is not in a safe auto-start state: $vm state=\"$state\""
+		;;
+	esac
+}
+
+start_topology() {
+	log "start_topology prefix=$PREFIX vms=$(vm_name db),$(vm_name v1),$(vm_name v2)"
+	require_topology_domains
+	while IFS= read -r vm; do
+		start_vm "$vm"
+	done < <(topology_vms)
+}
+
+vm_ip() {
+	local vm="$1"
+	local ip
+	ip="$(virsh_cmd net-dhcp-leases "$NETWORK" 2>/dev/null | awk -v vm="$vm" '$0 ~ vm && /ipv4/ {sub("/.*", "", $5); print $5; exit}')"
+	if [[ -z "$ip" ]]; then
+		ip="$(virsh_cmd domifaddr "$vm" --source lease 2>/dev/null | awk '/ipv4/ {sub("/.*", "", $4); print $4; exit}')"
+	fi
+	if [[ -z "$ip" ]]; then
+		ip="$(virsh_cmd domifaddr "$vm" --source agent 2>/dev/null | awk '/ipv4/ {sub("/.*", "", $4); print $4; exit}')"
+	fi
+	printf '%s\n' "$ip"
+}
+
+wait_ssh() {
+	local vm="$1"
+	local deadline ip
+	deadline=$((SECONDS + WAIT_TIMEOUT))
+	while (( SECONDS < deadline )); do
+		ip="$(vm_ip "$vm")"
+		if [[ -n "$ip" ]]; then
+			if ssh -i "$SSH_KEY" -o BatchMode=yes -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR -o ConnectTimeout="$SSH_CONNECT_TIMEOUT" "$VM_USER@$ip" true >/dev/null 2>&1; then
+				pass "ssh_ready vm=$vm ip=$ip"
+				return 0
+			fi
+		fi
+		sleep 5
+	done
+	fail "timed out waiting for SSH: $vm"
+}
+
+ssh_vm() {
+	local vm="$1"
+	shift || true
+	local ip
+	ip="$(vm_ip "$vm")"
+	[[ -n "$ip" ]] || fail "no IP found for $vm"
+	ssh -i "$SSH_KEY" -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR "$VM_USER@$ip" "$@"
+}
+
+vm_ip_required() {
+	local vm="$1"
+	local ip
+	ip="$(vm_ip "$vm")"
+	[[ -n "$ip" ]] || fail "no IP found for $vm"
+	printf '%s\n' "$ip"
+}
+
+scp_to_vm() {
+	local src="$1"
+	local vm="$2"
+	local dest="$3"
+	local ip
+	ip="$(vm_ip_required "$vm")"
+	scp -i "$SSH_KEY" -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR "$src" "$VM_USER@$ip:$dest"
+}
+
+mysql_lab() {
+	local db_ip="$1"
+	shift
+	mysql --connect-timeout=5 -h "$db_ip" -u jetmon -pjetmon "$@"
+}
+
+wait_mysql() {
+	local db_ip="$1"
+	local deadline
+	deadline=$((SECONDS + WAIT_TIMEOUT))
+	while (( SECONDS < deadline )); do
+		if mysql_lab "$db_ip" jetmon_db -e 'SELECT 1' >/dev/null 2>&1; then
+			pass "mysql_ready host=$db_ip database=jetmon_db"
+			return 0
+		fi
+		sleep 5
+	done
+	fail "timed out waiting for MySQL: $db_ip"
+}
+
+write_seed_sql() {
+	ensure_lab_dirs
+	cat >"$LAB_DIR/work/seed-db.sql" <<'SQL'
+CREATE DATABASE IF NOT EXISTS jetmon_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
+CREATE USER IF NOT EXISTS 'jetmon'@'%' IDENTIFIED BY 'jetmon';
+GRANT ALL PRIVILEGES ON jetmon_db.* TO 'jetmon'@'%';
+FLUSH PRIVILEGES;
+
+USE jetmon_db;
+
+CREATE TABLE IF NOT EXISTS jetpack_monitor_sites (
+	jetpack_monitor_site_id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
+	blog_id BIGINT UNSIGNED NOT NULL,
+	bucket_no SMALLINT UNSIGNED NOT NULL DEFAULT 0,
+	monitor_url VARCHAR(300) NOT NULL DEFAULT '',
+	monitor_active TINYINT UNSIGNED NOT NULL DEFAULT 0,
+	site_status TINYINT NOT NULL DEFAULT 1,
+	last_status_change DATETIME NULL,
+	check_interval SMALLINT UNSIGNED NOT NULL DEFAULT 5,
+	INDEX idx_bucket_active (bucket_no, monitor_active),
+	INDEX blog_id_monitor_url (blog_id, monitor_url)
+) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
+
+DELETE FROM jetpack_monitor_sites WHERE blog_id BETWEEN 910001 AND 910010;
+
+INSERT INTO jetpack_monitor_sites
+	(blog_id, bucket_no, monitor_url, monitor_active, site_status, last_status_change, check_interval)
+VALUES
+	(910001, 0,  'https://example.com/',             1, 1, UTC_TIMESTAMP(), 5),
+	(910002, 3,  'https://wordpress.com/',           1, 1, UTC_TIMESTAMP(), 5),
+	(910003, 7,  'https://developer.wordpress.com/', 1, 1, UTC_TIMESTAMP(), 5),
+	(910004, 15, 'https://jetpack.com/',             1, 1, UTC_TIMESTAMP(), 5),
+	(910005, 24, 'https://automattic.com/',          1, 1, UTC_TIMESTAMP(), 5),
+	(910006, 32, 'https://wp.com/',                  1, 1, UTC_TIMESTAMP(), 5),
+	(910007, 49, 'https://woocommerce.com/',         1, 1, UTC_TIMESTAMP(), 5),
+	(910008, 63, 'https://akismet.com/',             1, 1, UTC_TIMESTAMP(), 5),
+	(910009, 81, 'https://gravatar.com/',            1, 1, UTC_TIMESTAMP(), 5),
+	(910010, 99, 'https://wordpress.org/',           1, 1, UTC_TIMESTAMP(), 5);
+SQL
+}
+
+seed_db() {
+	local db_vm db_ip sql
+	db_vm="$(vm_name db)"
+	wait_ssh "$db_vm"
+	db_ip="$(vm_ip_required "$db_vm")"
+	wait_mysql "$db_ip"
+	write_seed_sql
+	sql="$LAB_DIR/work/seed-db.sql"
+	scp_to_vm "$sql" "$db_vm" /tmp/jetmon-rollout-seed-db.sql
+	ssh_vm "$db_vm" 'sudo mysql < /tmp/jetmon-rollout-seed-db.sql'
+	pass "db_seeded vm=$db_vm host=$db_ip rows=10"
+}
+
+write_v1_sim_files() {
+	ensure_lab_dirs
+	cat >"$LAB_DIR/work/jetmon-v1-sim.sh" <<'SH'
+#!/usr/bin/env bash
+set -euo pipefail
+
+log_dir=/opt/jetmon-v1-sim/logs
+pid_file=/opt/jetmon-v1-sim/jetmon-v1-sim.pid
+mkdir -p "$log_dir"
+printf '%s\n' "$$" >"$pid_file"
+trap 'rm -f "$pid_file"; exit 0' INT TERM
+
+while true; do
+	printf '%s bucket_range=%s-%s db=%s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "${BUCKET_NO_MIN:-}" "${BUCKET_NO_MAX:-}" "${DB_HOST:-}" >>"$log_dir/jetmon-v1-sim.log"
+	sleep 5
+done
+SH
+
+	cat >"$LAB_DIR/work/jetmon-v1-sim.service" <<'SERVICE'
+[Unit]
+Description=Jetmon v1 rollout lab simulator
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+User=jetmon
+Group=jetmon
+EnvironmentFile=/etc/jetmon-v1-sim.env
+ExecStart=/opt/jetmon-v1-sim/jetmon-v1-sim.sh
+Restart=on-failure
+RestartSec=2s
+
+[Install]
+WantedBy=multi-user.target
+SERVICE
+}
+
+install_v1_sim() {
+	local v1_vm db_vm db_ip env_file
+	v1_vm="$(vm_name v1)"
+	db_vm="$(vm_name db)"
+	wait_ssh "$v1_vm"
+	wait_ssh "$db_vm"
+	db_ip="$(vm_ip_required "$db_vm")"
+	write_v1_sim_files
+	env_file="$LAB_DIR/work/jetmon-v1-sim.env"
+	cat >"$env_file" <<ENV
+BUCKET_NO_MIN=$LAB_BUCKET_MIN
+BUCKET_NO_MAX=$LAB_BUCKET_MAX
+DB_HOST=$db_ip
+DB_NAME=jetmon_db
+ENV
+	ssh_vm "$v1_vm" 'mkdir -p /tmp/jetmon-v1-sim-upload'
+	scp_to_vm "$LAB_DIR/work/jetmon-v1-sim.sh" "$v1_vm" /tmp/jetmon-v1-sim-upload/jetmon-v1-sim.sh
+	scp_to_vm "$LAB_DIR/work/jetmon-v1-sim.service" "$v1_vm" /tmp/jetmon-v1-sim-upload/jetmon-v1-sim.service
+	scp_to_vm "$env_file" "$v1_vm" /tmp/jetmon-v1-sim-upload/jetmon-v1-sim.env
+	ssh_vm "$v1_vm" 'bash -s' <<'REMOTE'
+set -euo pipefail
+sudo install -d -o jetmon -g jetmon -m 0755 /opt/jetmon-v1-sim /opt/jetmon-v1-sim/logs
+sudo install -o jetmon -g jetmon -m 0755 /tmp/jetmon-v1-sim-upload/jetmon-v1-sim.sh /opt/jetmon-v1-sim/jetmon-v1-sim.sh
+sudo install -o root -g root -m 0644 /tmp/jetmon-v1-sim-upload/jetmon-v1-sim.service /etc/systemd/system/jetmon-v1-sim.service
+sudo install -o root -g root -m 0644 /tmp/jetmon-v1-sim-upload/jetmon-v1-sim.env /etc/jetmon-v1-sim.env
+sudo systemctl daemon-reload
+sudo systemctl enable --now jetmon-v1-sim
+systemctl is-active --quiet jetmon-v1-sim
+REMOTE
+	pass "v1_sim_installed vm=$v1_vm bucket_range=$LAB_BUCKET_MIN-$LAB_BUCKET_MAX"
+}
+
+write_v2_lab_files() {
+	local db_ip="$1"
+	local v1_ip="$2"
+	ensure_lab_dirs
+	cat >"$LAB_DIR/work/config.json" <<JSON
+{
+	"AUTH_TOKEN": "jetmon-rollout-lab",
+	"NUM_WORKERS": 10,
+	"NUM_TO_PROCESS": 10,
+	"DATASET_SIZE": 25,
+	"WORKER_MAX_MEM_MB": 256,
+	"LEGACY_STATUS_PROJECTION_ENABLE": true,
+	"BUCKET_TOTAL": $LAB_BUCKET_TOTAL,
+	"BUCKET_TARGET": 500,
+	"BUCKET_HEARTBEAT_GRACE_SEC": 600,
+	"PINNED_BUCKET_MIN": $LAB_BUCKET_MIN,
+	"PINNED_BUCKET_MAX": $LAB_BUCKET_MAX,
+	"BATCH_SIZE": 8,
+	"VERIFLIER_BATCH_SIZE": 10,
+	"SQL_UPDATE_BATCH": 1,
+	"DB_CONFIG_UPDATES_MIN": 10,
+	"PEER_OFFLINE_LIMIT": 1,
+	"NUM_OF_CHECKS": 3,
+	"TIME_BETWEEN_CHECKS_SEC": 30,
+	"ALERT_COOLDOWN_MINUTES": 30,
+	"STATS_UPDATE_INTERVAL_MS": 10000,
+	"TIME_BETWEEN_NOTICES_MIN": 59,
+	"MIN_TIME_BETWEEN_ROUNDS_SEC": 300,
+	"NET_COMMS_TIMEOUT": 10,
+	"LOG_FORMAT": "text",
+	"DASHBOARD_PORT": 0,
+	"API_PORT": 0,
+	"DEBUG_PORT": 0,
+	"EMAIL_TRANSPORT": "stub",
+	"EMAIL_FROM": "jetmon@noreply.invalid",
+	"VERIFIERS": []
+}
+JSON
+	cat >"$LAB_DIR/work/jetmon2.env" <<ENV
+DB_HOST=$db_ip
+DB_PORT=3306
+DB_USER=jetmon
+DB_PASSWORD=jetmon
+DB_NAME=jetmon_db
+ENV
+	{
+		printf 'host,bucket_min,bucket_max\n'
+		if (( LAB_BUCKET_MIN > 0 )); then
+			printf '%s-before,0,%d\n' "$PREFIX" "$((LAB_BUCKET_MIN - 1))"
+		fi
+		printf '%s,%d,%d\n' "$(vm_name v1)" "$LAB_BUCKET_MIN" "$LAB_BUCKET_MAX"
+		if (( LAB_BUCKET_MAX + 1 < LAB_BUCKET_TOTAL )); then
+			printf '%s-after,%d,%d\n' "$PREFIX" "$((LAB_BUCKET_MAX + 1))" "$((LAB_BUCKET_TOTAL - 1))"
+		fi
+	} >"$LAB_DIR/work/rollout-buckets.csv"
+	cat >"$LAB_DIR/work/v2-ssh-config" <<SSHCONFIG
+Host $(vm_name v1)
+  HostName $v1_ip
+  User $VM_USER
+  IdentityFile ~/.ssh/jetmon-rollout-lab_ed25519
+  BatchMode yes
+  StrictHostKeyChecking no
+  UserKnownHostsFile /dev/null
+  LogLevel ERROR
+SSHCONFIG
+}
+
+install_v2() {
+	local v2_vm v1_vm db_vm db_ip v1_ip
+	v2_vm="$(vm_name v2)"
+	v1_vm="$(vm_name v1)"
+	db_vm="$(vm_name db)"
+	[[ -x "$JETMON2_BINARY" ]] || fail "missing executable jetmon2 binary: $JETMON2_BINARY"
+	[[ -f "$JETMON2_SERVICE" ]] || fail "missing systemd unit: $JETMON2_SERVICE"
+	[[ -f "$JETMON2_LOGROTATE" ]] || fail "missing logrotate file: $JETMON2_LOGROTATE"
+	start_topology
+	wait_ssh "$v2_vm"
+	wait_ssh "$v1_vm"
+	wait_ssh "$db_vm"
+	db_ip="$(vm_ip_required "$db_vm")"
+	v1_ip="$(vm_ip_required "$v1_vm")"
+	write_v2_lab_files "$db_ip" "$v1_ip"
+	ssh_vm "$v2_vm" 'mkdir -p /tmp/jetmon-v2-upload'
+	scp_to_vm "$JETMON2_BINARY" "$v2_vm" /tmp/jetmon-v2-upload/jetmon2
+	scp_to_vm "$JETMON2_SERVICE" "$v2_vm" /tmp/jetmon-v2-upload/jetmon2.service
+	scp_to_vm "$JETMON2_LOGROTATE" "$v2_vm" /tmp/jetmon-v2-upload/jetmon2-logrotate
+	scp_to_vm "$LAB_DIR/work/config.json" "$v2_vm" /tmp/jetmon-v2-upload/config.json
+	scp_to_vm "$LAB_DIR/work/jetmon2.env" "$v2_vm" /tmp/jetmon-v2-upload/jetmon2.env
+	scp_to_vm "$LAB_DIR/work/rollout-buckets.csv" "$v2_vm" /tmp/jetmon-v2-upload/rollout-buckets.csv
+	scp_to_vm "$LAB_DIR/work/v2-ssh-config" "$v2_vm" /tmp/jetmon-v2-upload/ssh-config
+	scp_to_vm "$SSH_KEY" "$v2_vm" /tmp/jetmon-v2-upload/jetmon-rollout-lab_ed25519
+	scp_to_vm "$SSH_PUBKEY" "$v2_vm" /tmp/jetmon-v2-upload/jetmon-rollout-lab_ed25519.pub
+	ssh_vm "$v2_vm" 'bash -s' <<'REMOTE'
+set -euo pipefail
+sudo install -d -o jetmon -g jetmon -m 0755 /opt/jetmon2 /opt/jetmon2/config /opt/jetmon2/logs /opt/jetmon2/logs/rollout /opt/jetmon2/stats
+sudo install -o root -g root -m 0755 /tmp/jetmon-v2-upload/jetmon2 /opt/jetmon2/jetmon2
+sudo install -o jetmon -g jetmon -m 0644 /tmp/jetmon-v2-upload/config.json /opt/jetmon2/config/config.json
+sudo install -o root -g jetmon -m 0640 /tmp/jetmon-v2-upload/jetmon2.env /opt/jetmon2/config/jetmon2.env
+sudo install -o jetmon -g jetmon -m 0644 /tmp/jetmon-v2-upload/rollout-buckets.csv /opt/jetmon2/rollout-buckets.csv
+sudo install -o root -g root -m 0644 /tmp/jetmon-v2-upload/jetmon2.service /etc/systemd/system/jetmon2.service
+sudo install -o root -g root -m 0644 /tmp/jetmon-v2-upload/jetmon2-logrotate /etc/logrotate.d/jetmon2
+sudo install -d -o jetmon -g jetmon -m 0700 /home/jetmon/.ssh
+sudo install -o jetmon -g jetmon -m 0600 /tmp/jetmon-v2-upload/jetmon-rollout-lab_ed25519 /home/jetmon/.ssh/jetmon-rollout-lab_ed25519
+sudo install -o jetmon -g jetmon -m 0644 /tmp/jetmon-v2-upload/jetmon-rollout-lab_ed25519.pub /home/jetmon/.ssh/jetmon-rollout-lab_ed25519.pub
+sudo install -o jetmon -g jetmon -m 0600 /tmp/jetmon-v2-upload/ssh-config /home/jetmon/.ssh/config
+sudo chown -R jetmon:jetmon /opt/jetmon2/logs /opt/jetmon2/stats
+sudo systemctl daemon-reload
+sudo systemctl disable --now jetmon2 >/dev/null 2>&1 || true
+sudo systemd-analyze verify /etc/systemd/system/jetmon2.service
+cd /opt/jetmon2
+set -a
+. /opt/jetmon2/config/jetmon2.env
+set +a
+sudo -u jetmon env \
+	JETMON_CONFIG=/opt/jetmon2/config/config.json \
+	DB_HOST="$DB_HOST" DB_PORT="$DB_PORT" DB_USER="$DB_USER" DB_PASSWORD="$DB_PASSWORD" DB_NAME="$DB_NAME" \
+	/opt/jetmon2/jetmon2 validate-config
+REMOTE
+	pass "v2_installed vm=$v2_vm db_host=$db_ip plan=/opt/jetmon2/rollout-buckets.csv"
+}
+
+migrate_v2() {
+	local v2_vm db_vm
+	v2_vm="$(vm_name v2)"
+	db_vm="$(vm_name db)"
+	wait_ssh "$v2_vm"
+	wait_ssh "$db_vm"
+	ssh_vm "$v2_vm" 'bash -s' <<'REMOTE'
+set -euo pipefail
+cd /opt/jetmon2
+set -a
+. /opt/jetmon2/config/jetmon2.env
+set +a
+sudo -u jetmon env \
+	JETMON_CONFIG=/opt/jetmon2/config/config.json \
+	DB_HOST="$DB_HOST" DB_PORT="$DB_PORT" DB_USER="$DB_USER" DB_PASSWORD="$DB_PASSWORD" DB_NAME="$DB_NAME" \
+	/opt/jetmon2/jetmon2 migrate
+REMOTE
+	mark_lab_activity
+	pass "v2_migrations_applied vm=$v2_vm"
+}
+
+mark_lab_activity() {
+	local db_vm db_ip
+	db_vm="$(vm_name db)"
+	db_ip="$(vm_ip_required "$db_vm")"
+	mysql_lab "$db_ip" jetmon_db -e "INSERT INTO jetmon_site_runtime (blog_id, last_checked_at) SELECT blog_id, UTC_TIMESTAMP() FROM jetpack_monitor_sites WHERE bucket_no BETWEEN $LAB_BUCKET_MIN AND $LAB_BUCKET_MAX ON DUPLICATE KEY UPDATE last_checked_at = VALUES(last_checked_at)"
+	pass "lab_activity_marked bucket_range=$LAB_BUCKET_MIN-$LAB_BUCKET_MAX"
+}
+
+clear_lab_activity() {
+	local db_vm db_ip
+	db_vm="$(vm_name db)"
+	db_ip="$(vm_ip_required "$db_vm")"
+	mysql_lab "$db_ip" jetmon_db -e "INSERT INTO jetmon_site_runtime (blog_id, last_checked_at) SELECT blog_id, NULL FROM jetpack_monitor_sites WHERE bucket_no BETWEEN $LAB_BUCKET_MIN AND $LAB_BUCKET_MAX ON DUPLICATE KEY UPDATE last_checked_at = NULL"
+	pass "lab_activity_cleared bucket_range=$LAB_BUCKET_MIN-$LAB_BUCKET_MAX"
+}
+
+lab_active_site_count() {
+	local db_ip="$1"
+	mysql_lab "$db_ip" --batch --skip-column-names jetmon_db -e "SELECT COUNT(*) FROM jetpack_monitor_sites WHERE monitor_active = 1 AND bucket_no BETWEEN $LAB_BUCKET_MIN AND $LAB_BUCKET_MAX" | tr -d '[:space:]'
+}
+
+lab_checked_site_count() {
+	local db_ip="$1"
+	mysql_lab "$db_ip" --batch --skip-column-names jetmon_db -e "SELECT COUNT(*) FROM jetpack_monitor_sites s JOIN jetmon_site_runtime r ON r.blog_id = s.blog_id WHERE s.monitor_active = 1 AND s.bucket_no BETWEEN $LAB_BUCKET_MIN AND $LAB_BUCKET_MAX AND r.last_checked_at IS NOT NULL" | tr -d '[:space:]'
+}
+
+wait_for_real_lab_activity() {
+	local db_vm db_ip active checked deadline
+	db_vm="$(vm_name db)"
+	db_ip="$(vm_ip_required "$db_vm")"
+	active="$(lab_active_site_count "$db_ip")"
+	[[ "$active" =~ ^[0-9]+$ ]] || {
+		warn "invalid active site count: $active"
+		return 1
+	}
+	if (( active == 0 )); then
+		warn "no active lab sites in bucket range $LAB_BUCKET_MIN-$LAB_BUCKET_MAX"
+		return 1
+	fi
+
+	deadline=$((SECONDS + ACTIVITY_WAIT_TIMEOUT))
+	while (( SECONDS < deadline )); do
+		checked="$(lab_checked_site_count "$db_ip")"
+		[[ "$checked" =~ ^[0-9]+$ ]] || {
+			warn "invalid checked site count: $checked"
+			return 1
+		}
+		if (( checked >= active )); then
+			pass "real_activity_seen checked=$checked active=$active bucket_range=$LAB_BUCKET_MIN-$LAB_BUCKET_MAX"
+			return 0
+		fi
+		log "waiting_for_real_activity checked=$checked active=$active timeout_seconds=$ACTIVITY_WAIT_TIMEOUT"
+		sleep 5
+	done
+	warn "timed out waiting for real activity checked=$(lab_checked_site_count "$db_ip") active=$active bucket_range=$LAB_BUCKET_MIN-$LAB_BUCKET_MAX"
+	return 1
+}
+
+future_activity_cutoff() {
+	date -u -d '+1 hour' +%Y-%m-%dT%H:%M:%SZ
+}
+
+smoke_preflight() {
+	local v2_vm
+	v2_vm="$(vm_name v2)"
+	wait_ssh "$v2_vm"
+	ssh_vm "$v2_vm" "bash -lc 'cd /opt/jetmon2 && set -a && . config/jetmon2.env && set +a && JETMON_CONFIG=config/config.json ./jetmon2 rollout host-preflight --file rollout-buckets.csv --host $(vm_name v1) --runtime-host $(vm_name v2) --bucket-min $LAB_BUCKET_MIN --bucket-max $LAB_BUCKET_MAX --bucket-total $LAB_BUCKET_TOTAL'"
+	pass "smoke_preflight_passed vm=$v2_vm"
+}
+
+smoke_guided_dry_run() {
+	local v2_vm
+	v2_vm="$(vm_name v2)"
+	wait_ssh "$v2_vm"
+	ssh_vm "$v2_vm" 'bash -s' <<REMOTE
+set -euo pipefail
+cd /opt/jetmon2
+set -a
+. config/jetmon2.env
+set +a
+JETMON_CONFIG=config/config.json ./jetmon2 rollout guided \\
+	--file rollout-buckets.csv \\
+	--host $(vm_name v1) \\
+	--runtime-host $(vm_name v2) \\
+	--bucket-min $LAB_BUCKET_MIN \\
+	--bucket-max $LAB_BUCKET_MAX \\
+	--bucket-total $LAB_BUCKET_TOTAL \\
+	--mode fresh-server \\
+	--v1-stop-command 'ssh $(vm_name v1) sudo systemctl stop jetmon-v1-sim' \\
+	--v1-start-command 'ssh $(vm_name v1) sudo systemctl start jetmon-v1-sim' \\
+	--log-dir logs/rollout \\
+	--skip-status \\
+	--dry-run
+REMOTE
+	pass "smoke_guided_dry_run_passed vm=$v2_vm"
+}
+
+reset_guided_lab_state() {
+	local v1_vm v2_vm
+	v1_vm="$(vm_name v1)"
+	v2_vm="$(vm_name v2)"
+	wait_ssh "$v1_vm"
+	wait_ssh "$v2_vm"
+	ssh_vm "$v2_vm" 'sudo systemctl disable --now jetmon2 >/dev/null 2>&1 || true; sudo rm -f /opt/jetmon2/logs/rollout/*.state.json'
+	ssh_vm "$v1_vm" 'sudo systemctl start jetmon-v1-sim; systemctl is-active --quiet jetmon-v1-sim'
+	ssh_vm "$v2_vm" '! systemctl is-enabled --quiet jetmon2'
+	pass "guided_lab_state_reset v1=$v1_vm v2=$v2_vm"
+}
+
+smoke_guided_execute_rollback() {
+	local v1_vm v2_vm
+	v1_vm="$(vm_name v1)"
+	v2_vm="$(vm_name v2)"
+	reset_guided_lab_state
+	mark_lab_activity
+	ssh_vm "$v2_vm" 'bash -s' <<REMOTE
+set -euo pipefail
+cd /opt/jetmon2
+set -a
+. config/jetmon2.env
+set +a
+printf '%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n' \\
+	'y' \\
+	'y' \\
+	'y' \\
+	'STOP $v1_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' \\
+	'START V2 $v2_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' \\
+	'y' \\
+	'READY' \\
+	'y' | sudo env \\
+	JETMON_CONFIG=/opt/jetmon2/config/config.json \\
+	DB_HOST="\$DB_HOST" DB_PORT="\$DB_PORT" DB_USER="\$DB_USER" DB_PASSWORD="\$DB_PASSWORD" DB_NAME="\$DB_NAME" \\
+	/opt/jetmon2/jetmon2 rollout guided \\
+	--file rollout-buckets.csv \\
+	--host $v1_vm \\
+	--runtime-host $v2_vm \\
+	--bucket-min $LAB_BUCKET_MIN \\
+	--bucket-max $LAB_BUCKET_MAX \\
+	--bucket-total $LAB_BUCKET_TOTAL \\
+	--mode fresh-server \\
+	--v1-stop-command 'sudo -u jetmon ssh $v1_vm sudo systemctl stop jetmon-v1-sim' \\
+	--v1-start-command 'sudo -u jetmon ssh $v1_vm sudo systemctl start jetmon-v1-sim' \\
+	--log-dir logs/rollout \\
+	--execute-operator-commands \\
+	--skip-status
+REMOTE
+	ssh_vm "$v1_vm" '! systemctl is-active --quiet jetmon-v1-sim'
+	ssh_vm "$v2_vm" 'systemctl is-active --quiet jetmon2'
+	mark_lab_activity
+	ssh_vm "$v2_vm" 'bash -s' <<REMOTE
+set -euo pipefail
+cd /opt/jetmon2
+set -a
+. config/jetmon2.env
+set +a
+printf '%s\n%s\n%s\n%s\n' \\
+	'RESUME' \\
+	'STOP V2 $v2_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' \\
+	'y' \\
+	'START V1 $v1_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' | sudo env \\
+	JETMON_CONFIG=/opt/jetmon2/config/config.json \\
+	DB_HOST="\$DB_HOST" DB_PORT="\$DB_PORT" DB_USER="\$DB_USER" DB_PASSWORD="\$DB_PASSWORD" DB_NAME="\$DB_NAME" \\
+	/opt/jetmon2/jetmon2 rollout guided \\
+	--file rollout-buckets.csv \\
+	--host $v1_vm \\
+	--runtime-host $v2_vm \\
+	--bucket-min $LAB_BUCKET_MIN \\
+	--bucket-max $LAB_BUCKET_MAX \\
+	--bucket-total $LAB_BUCKET_TOTAL \\
+	--mode fresh-server \\
+	--v1-stop-command 'sudo -u jetmon ssh $v1_vm sudo systemctl stop jetmon-v1-sim' \\
+	--v1-start-command 'sudo -u jetmon ssh $v1_vm sudo systemctl start jetmon-v1-sim' \\
+	--log-dir logs/rollout \\
+	--execute-operator-commands \\
+	--skip-status \\
+	--rollback
+REMOTE
+	ssh_vm "$v2_vm" '! systemctl is-active --quiet jetmon2'
+	ssh_vm "$v1_vm" 'systemctl is-active --quiet jetmon-v1-sim'
+	ssh_vm "$v2_vm" 'sudo systemctl disable jetmon2 >/dev/null 2>&1 || true'
+	pass "smoke_guided_execute_rollback_passed v1=$v1_vm v2=$v2_vm"
+}
+
+smoke_interrupted_resume() {
+	local v1_vm v2_vm out
+	v1_vm="$(vm_name v1)"
+	v2_vm="$(vm_name v2)"
+	ensure_lab_dirs
+	reset_guided_lab_state
+	mark_lab_activity
+
+	out="$LAB_DIR/logs/interrupted-resume-first.out"
+	if ssh_vm "$v2_vm" 'bash -s' <<REMOTE >"$out" 2>&1; then
+set -euo pipefail
+cd /opt/jetmon2
+set -a
+. config/jetmon2.env
+set +a
+printf '%s\n%s\n%s\n%s\n' \\
+	'y' \\
+	'y' \\
+	'y' \\
+	'STOP $v1_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' | sudo env \\
+	JETMON_CONFIG=/opt/jetmon2/config/config.json \\
+	DB_HOST="\$DB_HOST" DB_PORT="\$DB_PORT" DB_USER="\$DB_USER" DB_PASSWORD="\$DB_PASSWORD" DB_NAME="\$DB_NAME" \\
+	/opt/jetmon2/jetmon2 rollout guided \\
+	--file rollout-buckets.csv \\
+	--host $v1_vm \\
+	--runtime-host $v2_vm \\
+	--bucket-min $LAB_BUCKET_MIN \\
+	--bucket-max $LAB_BUCKET_MAX \\
+	--bucket-total $LAB_BUCKET_TOTAL \\
+	--mode fresh-server \\
+	--v1-stop-command 'sudo -u jetmon ssh $v1_vm sudo systemctl stop jetmon-v1-sim' \\
+	--v1-start-command 'sudo -u jetmon ssh $v1_vm sudo systemctl start jetmon-v1-sim' \\
+	--log-dir logs/rollout \\
+	--execute-operator-commands \\
+	--skip-status
+REMOTE
+		fail "interrupted guided run unexpectedly completed"
+	fi
+	grep -q 'PASS guided_step=stop-v1' "$out" || {
+		cat "$out"
+		fail "interrupted guided run did not complete stop-v1"
+	}
+	ssh_vm "$v1_vm" '! systemctl is-active --quiet jetmon-v1-sim'
+	ssh_vm "$v2_vm" '! systemctl is-active --quiet jetmon2'
+	pass "interrupted_after_v1_stop output=$out"
+
+	out="$LAB_DIR/logs/interrupted-resume-complete.out"
+	ssh_vm "$v2_vm" 'bash -s' <<REMOTE >"$out" 2>&1
+set -euo pipefail
+cd /opt/jetmon2
+set -a
+. config/jetmon2.env
+set +a
+printf '%s\n%s\n%s\n%s\n%s\n' \\
+	'RESUME' \\
+	'START V2 $v2_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' \\
+	'y' \\
+	'READY' \\
+	'y' | sudo env \\
+	JETMON_CONFIG=/opt/jetmon2/config/config.json \\
+	DB_HOST="\$DB_HOST" DB_PORT="\$DB_PORT" DB_USER="\$DB_USER" DB_PASSWORD="\$DB_PASSWORD" DB_NAME="\$DB_NAME" \\
+	/opt/jetmon2/jetmon2 rollout guided \\
+	--file rollout-buckets.csv \\
+	--host $v1_vm \\
+	--runtime-host $v2_vm \\
+	--bucket-min $LAB_BUCKET_MIN \\
+	--bucket-max $LAB_BUCKET_MAX \\
+	--bucket-total $LAB_BUCKET_TOTAL \\
+	--mode fresh-server \\
+	--v1-stop-command 'sudo -u jetmon ssh $v1_vm sudo systemctl stop jetmon-v1-sim' \\
+	--v1-start-command 'sudo -u jetmon ssh $v1_vm sudo systemctl start jetmon-v1-sim' \\
+	--log-dir logs/rollout \\
+	--execute-operator-commands \\
+	--skip-status
+REMOTE
+	grep -q 'previous_state=resumed' "$out" || {
+		cat "$out"
+		fail "resume run did not resume previous state"
+	}
+	grep -q 'SKIP step=stop-v1 reason=completed_from_state' "$out" || {
+		cat "$out"
+		fail "resume run did not skip completed stop-v1"
+	}
+	grep -q 'PASS guided_rollout=complete' "$out" || {
+		cat "$out"
+		fail "resume run did not complete guided rollout"
+	}
+	ssh_vm "$v1_vm" '! systemctl is-active --quiet jetmon-v1-sim'
+	ssh_vm "$v2_vm" 'systemctl is-active --quiet jetmon2'
+	pass "interrupted_resume_completed output=$out"
+
+	out="$LAB_DIR/logs/interrupted-resume-rollback.out"
+	ssh_vm "$v2_vm" 'bash -s' <<REMOTE >"$out" 2>&1
+set -euo pipefail
+cd /opt/jetmon2
+set -a
+. config/jetmon2.env
+set +a
+printf '%s\n%s\n%s\n%s\n' \\
+	'RESUME' \\
+	'STOP V2 $v2_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' \\
+	'y' \\
+	'START V1 $v1_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' | sudo env \\
+	JETMON_CONFIG=/opt/jetmon2/config/config.json \\
+	DB_HOST="\$DB_HOST" DB_PORT="\$DB_PORT" DB_USER="\$DB_USER" DB_PASSWORD="\$DB_PASSWORD" DB_NAME="\$DB_NAME" \\
+	/opt/jetmon2/jetmon2 rollout guided \\
+	--file rollout-buckets.csv \\
+	--host $v1_vm \\
+	--runtime-host $v2_vm \\
+	--bucket-min $LAB_BUCKET_MIN \\
+	--bucket-max $LAB_BUCKET_MAX \\
+	--bucket-total $LAB_BUCKET_TOTAL \\
+	--mode fresh-server \\
+	--v1-stop-command 'sudo -u jetmon ssh $v1_vm sudo systemctl stop jetmon-v1-sim' \\
+	--v1-start-command 'sudo -u jetmon ssh $v1_vm sudo systemctl start jetmon-v1-sim' \\
+	--log-dir logs/rollout \\
+	--execute-operator-commands \\
+	--skip-status \\
+	--rollback
+REMOTE
+	ssh_vm "$v2_vm" '! systemctl is-active --quiet jetmon2'
+	ssh_vm "$v1_vm" 'systemctl is-active --quiet jetmon-v1-sim'
+	ssh_vm "$v2_vm" 'sudo systemctl disable jetmon2 >/dev/null 2>&1 || true'
+	pass "smoke_interrupted_resume_passed v1=$v1_vm v2=$v2_vm"
+}
+
+smoke_post_start_rollback() {
+	local v1_vm v2_vm out future_since
+	v1_vm="$(vm_name v1)"
+	v2_vm="$(vm_name v2)"
+	ensure_lab_dirs
+	reset_guided_lab_state
+	future_since="$(future_activity_cutoff)"
+	out="$LAB_DIR/logs/post-start-rollback.out"
+	if ssh_vm "$v2_vm" 'bash -s' <<REMOTE >"$out" 2>&1; then
+set -euo pipefail
+cd /opt/jetmon2
+set -a
+. config/jetmon2.env
+set +a
+printf '%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n' \\
+	'y' \\
+	'y' \\
+	'y' \\
+	'STOP $v1_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' \\
+	'START V2 $v2_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' \\
+	'y' \\
+	'b' \\
+	'STOP V2 $v2_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' \\
+	'y' \\
+	'START V1 $v1_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' | sudo env \\
+	JETMON_CONFIG=/opt/jetmon2/config/config.json \\
+	DB_HOST="\$DB_HOST" DB_PORT="\$DB_PORT" DB_USER="\$DB_USER" DB_PASSWORD="\$DB_PASSWORD" DB_NAME="\$DB_NAME" \\
+	/opt/jetmon2/jetmon2 rollout guided \\
+	--file rollout-buckets.csv \\
+	--host $v1_vm \\
+	--runtime-host $v2_vm \\
+	--bucket-min $LAB_BUCKET_MIN \\
+	--bucket-max $LAB_BUCKET_MAX \\
+	--bucket-total $LAB_BUCKET_TOTAL \\
+	--mode fresh-server \\
+	--since '$future_since' \\
+	--v1-stop-command 'sudo -u jetmon ssh $v1_vm sudo systemctl stop jetmon-v1-sim' \\
+	--v1-start-command 'sudo -u jetmon ssh $v1_vm sudo systemctl start jetmon-v1-sim' \\
+	--log-dir logs/rollout \\
+	--execute-operator-commands \\
+	--skip-status
+REMOTE
+		fail "post-start rollback flow unexpectedly exited successfully"
+	fi
+	grep -q 'PASS guided_step=start-v2' "$out" || {
+		cat "$out"
+		fail "post-start rollback flow did not reach start-v2"
+	}
+	grep -q 'PASS guided_rollback=complete' "$out" || {
+		cat "$out"
+		fail "post-start rollback flow did not complete rollback"
+	}
+	grep -q 'guided_rollout=rolled_back' "$out" || {
+		cat "$out"
+		fail "post-start rollback flow did not report rolled_back"
+	}
+	ssh_vm "$v2_vm" '! systemctl is-active --quiet jetmon2'
+	ssh_vm "$v1_vm" 'systemctl is-active --quiet jetmon-v1-sim'
+	ssh_vm "$v2_vm" 'sudo systemctl disable jetmon2 >/dev/null 2>&1 || true'
+	pass "smoke_post_start_rollback_passed v1=$v1_vm v2=$v2_vm output=$out"
+}
+
+smoke_bad_ssh() {
+	local v1_vm v2_vm out
+	v1_vm="$(vm_name v1)"
+	v2_vm="$(vm_name v2)"
+	ensure_lab_dirs
+	reset_guided_lab_state
+	mark_lab_activity
+	out="$LAB_DIR/logs/bad-ssh.out"
+	if ssh_vm "$v2_vm" 'bash -s' <<REMOTE >"$out" 2>&1; then
+set -euo pipefail
+cd /opt/jetmon2
+set -a
+. config/jetmon2.env
+set +a
+printf '%s\n%s\n%s\n%s\n%s\n' \\
+	'y' \\
+	'y' \\
+	'y' \\
+	'STOP $v1_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' \\
+	's' | sudo env \\
+	JETMON_CONFIG=/opt/jetmon2/config/config.json \\
+	DB_HOST="\$DB_HOST" DB_PORT="\$DB_PORT" DB_USER="\$DB_USER" DB_PASSWORD="\$DB_PASSWORD" DB_NAME="\$DB_NAME" \\
+	/opt/jetmon2/jetmon2 rollout guided \\
+	--file rollout-buckets.csv \\
+	--host $v1_vm \\
+	--runtime-host $v2_vm \\
+	--bucket-min $LAB_BUCKET_MIN \\
+	--bucket-max $LAB_BUCKET_MAX \\
+	--bucket-total $LAB_BUCKET_TOTAL \\
+	--mode fresh-server \\
+	--v1-stop-command 'sudo -u jetmon ssh $v1_vm-missing sudo systemctl stop jetmon-v1-sim' \\
+	--v1-start-command 'sudo -u jetmon ssh $v1_vm sudo systemctl start jetmon-v1-sim' \\
+	--log-dir logs/rollout \\
+	--execute-operator-commands \\
+	--skip-status
+REMOTE
+		fail "bad SSH guided run unexpectedly passed"
+	fi
+	grep -q 'FAIL step=stop-v1' "$out" || {
+		cat "$out"
+		fail "bad SSH flow failed before expected stop-v1 failure"
+	}
+	ssh_vm "$v1_vm" 'systemctl is-active --quiet jetmon-v1-sim'
+	ssh_vm "$v2_vm" '! systemctl is-active --quiet jetmon2'
+	pass "smoke_bad_ssh_passed v1=$v1_vm v2=$v2_vm output=$out"
+}
+
+smoke_v2_start_failure() {
+	local v1_vm v2_vm out run_status
+	v1_vm="$(vm_name v1)"
+	v2_vm="$(vm_name v2)"
+	ensure_lab_dirs
+	reset_guided_lab_state
+	mark_lab_activity
+
+	ssh_vm "$v2_vm" 'sudo cp /etc/systemd/system/jetmon2.service /tmp/jetmon2.service.rollout-lab-good; sudo sed -i "/^ExecStart=/i ExecStartPre=/bin/false" /etc/systemd/system/jetmon2.service; sudo systemctl daemon-reload; sudo systemctl reset-failed jetmon2 >/dev/null 2>&1 || true'
+	out="$LAB_DIR/logs/v2-start-failure.out"
+	run_status=0
+	if ssh_vm "$v2_vm" 'bash -s' <<REMOTE >"$out" 2>&1; then
+set -euo pipefail
+cd /opt/jetmon2
+set -a
+. config/jetmon2.env
+set +a
+printf '%s\n%s\n%s\n%s\n%s\n%s\n' \\
+	'y' \\
+	'y' \\
+	'y' \\
+	'STOP $v1_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' \\
+	'START V2 $v2_vm $LAB_BUCKET_MIN-$LAB_BUCKET_MAX' \\
+	's' | sudo env \\
+	JETMON_CONFIG=/opt/jetmon2/config/config.json \\
+	DB_HOST="\$DB_HOST" DB_PORT="\$DB_PORT" DB_USER="\$DB_USER" DB_PASSWORD="\$DB_PASSWORD" DB_NAME="\$DB_NAME" \\
+	/opt/jetmon2/jetmon2 rollout guided \\
+	--file rollout-buckets.csv \\
+	--host $v1_vm \\
+	--runtime-host $v2_vm \\
+	--bucket-min $LAB_BUCKET_MIN \\
+	--bucket-max $LAB_BUCKET_MAX \\
+	--bucket-total $LAB_BUCKET_TOTAL \\
+	--mode fresh-server \\
+	--v1-stop-command 'sudo -u jetmon ssh $v1_vm sudo systemctl stop jetmon-v1-sim' \\
+	--v1-start-command 'sudo -u jetmon ssh $v1_vm sudo systemctl start jetmon-v1-sim' \\
+	--log-dir logs/rollout \\
+	--execute-operator-commands \\
+	--skip-status
+REMOTE
+		run_status=0
+	else
+		run_status=$?
+	fi
+	ssh_vm "$v2_vm" 'sudo mv /tmp/jetmon2.service.rollout-lab-good /etc/systemd/system/jetmon2.service; sudo systemctl daemon-reload; sudo systemctl reset-failed jetmon2 >/dev/null 2>&1 || true; sudo systemctl disable --now jetmon2 >/dev/null 2>&1 || true'
+	if (( run_status == 0 )); then
+		cat "$out"
+		fail "v2 start failure guided run unexpectedly passed"
+	fi
+	grep -q 'PASS guided_step=stop-v1' "$out" || {
+		cat "$out"
+		fail "v2 start failure flow did not stop v1 first"
+	}
+	grep -q 'FAIL step=start-v2' "$out" || {
+		cat "$out"
+		fail "v2 start failure flow did not fail at start-v2"
+	}
+	ssh_vm "$v1_vm" '! systemctl is-active --quiet jetmon-v1-sim'
+	ssh_vm "$v2_vm" '! systemctl is-active --quiet jetmon2'
+	ssh_vm "$v1_vm" 'sudo systemctl start jetmon-v1-sim; systemctl is-active --quiet jetmon-v1-sim'
+	ssh_vm "$v2_vm" 'sudo rm -f /opt/jetmon2/logs/rollout/*.state.json'
+	pass "smoke_v2_start_failure_passed v1=$v1_vm v2=$v2_vm output=$out"
+}
+
+smoke_runtime_guards() {
+	local v1_vm v2_vm out
+	v1_vm="$(vm_name v1)"
+	v2_vm="$(vm_name v2)"
+	ensure_lab_dirs
+	reset_guided_lab_state
+
+	out="$LAB_DIR/logs/unwritable-log-dir.out"
+	ssh_vm "$v2_vm" 'sudo rm -rf /tmp/jetmon-unwritable-rollout; sudo install -d -o root -g root -m 0500 /tmp/jetmon-unwritable-rollout'
+	if ssh_vm "$v2_vm" 'bash -s' <<REMOTE >"$out" 2>&1; then
+set -euo pipefail
+cd /opt/jetmon2
+set -a
+. config/jetmon2.env
+set +a
+sudo -u jetmon env \\
+	JETMON_CONFIG=/opt/jetmon2/config/config.json \\
+	DB_HOST="\$DB_HOST" DB_PORT="\$DB_PORT" DB_USER="\$DB_USER" DB_PASSWORD="\$DB_PASSWORD" DB_NAME="\$DB_NAME" \\
+	/opt/jetmon2/jetmon2 rollout guided \\
+	--file rollout-buckets.csv \\
+	--host $v1_vm \\
+	--runtime-host $v2_vm \\
+	--bucket-min $LAB_BUCKET_MIN \\
+	--bucket-max $LAB_BUCKET_MAX \\
+	--bucket-total $LAB_BUCKET_TOTAL \\
+	--mode fresh-server \\
+	--v1-stop-command 'ssh $v1_vm sudo systemctl stop jetmon-v1-sim' \\
+	--v1-start-command 'ssh $v1_vm sudo systemctl start jetmon-v1-sim' \\
+	--log-dir /tmp/jetmon-unwritable-rollout \\
+	--skip-status \\
+	--dry-run
+REMOTE
+		ssh_vm "$v2_vm" 'sudo rm -rf /tmp/jetmon-unwritable-rollout'
+		fail "unwritable log dir guided run unexpectedly passed"
+	fi
+	ssh_vm "$v2_vm" 'sudo rm -rf /tmp/jetmon-unwritable-rollout'
+	grep -q 'rollout log directory preflight failed' "$out" || {
+		cat "$out"
+		fail "unwritable log dir failed for an unexpected reason"
+	}
+	pass "runtime_guard_unwritable_log_dir_refused output=$out"
+
+	out="$LAB_DIR/logs/bad-db-preflight.out"
+	if ssh_vm "$v2_vm" 'bash -s' <<REMOTE >"$out" 2>&1; then
+set -euo pipefail
+cd /opt/jetmon2
+set -a
+. config/jetmon2.env
+set +a
+sudo -u jetmon env \\
+	JETMON_CONFIG=/opt/jetmon2/config/config.json \\
+	DB_HOST="\$DB_HOST" DB_PORT=1 DB_USER="\$DB_USER" DB_PASSWORD=wrong DB_NAME="\$DB_NAME" \\
+	timeout 20s /opt/jetmon2/jetmon2 rollout host-preflight \\
+	--file rollout-buckets.csv \\
+	--host $v1_vm \\
+	--runtime-host $v2_vm \\
+	--bucket-min $LAB_BUCKET_MIN \\
+	--bucket-max $LAB_BUCKET_MAX \\
+	--bucket-total $LAB_BUCKET_TOTAL
+REMOTE
+		fail "bad DB host-preflight unexpectedly passed"
+	fi
+	grep -q 'db connect' "$out" || {
+		cat "$out"
+		fail "bad DB host-preflight failed for an unexpected reason"
+	}
+	pass "runtime_guard_bad_db_refused output=$out"
+}
+
+smoke_real_activity() {
+	local v1_vm v2_vm
+	v1_vm="$(vm_name v1)"
+	v2_vm="$(vm_name v2)"
+	reset_guided_lab_state
+	clear_lab_activity
+	ssh_vm "$v1_vm" 'sudo systemctl stop jetmon-v1-sim; ! systemctl is-active --quiet jetmon-v1-sim'
+	if ! ssh_vm "$v2_vm" 'sudo systemctl enable --now jetmon2; systemctl is-active --quiet jetmon2'; then
+		ssh_vm "$v2_vm" 'sudo journalctl -u jetmon2 -n 80 --no-pager || true'
+		ssh_vm "$v2_vm" 'sudo systemctl disable --now jetmon2 >/dev/null 2>&1 || true'
+		ssh_vm "$v1_vm" 'sudo systemctl start jetmon-v1-sim'
+		fail "v2 service did not start for real activity smoke"
+	fi
+	if ! wait_for_real_lab_activity; then
+		ssh_vm "$v2_vm" 'sudo journalctl -u jetmon2 -n 120 --no-pager || true'
+		ssh_vm "$v2_vm" 'sudo systemctl disable --now jetmon2 >/dev/null 2>&1 || true'
+		ssh_vm "$v1_vm" 'sudo systemctl start jetmon-v1-sim'
+		fail "real activity smoke did not observe last_checked_at updates"
+	fi
+	ssh_vm "$v2_vm" 'sudo systemctl disable --now jetmon2 >/dev/null 2>&1 || true; ! systemctl is-enabled --quiet jetmon2'
+	ssh_vm "$v1_vm" 'sudo systemctl start jetmon-v1-sim; systemctl is-active --quiet jetmon-v1-sim'
+	pass "smoke_real_activity_passed v1=$v1_vm v2=$v2_vm"
+}
+
+smoke_failure_gates() {
+	local v2_vm db_vm db_ip out
+	v2_vm="$(vm_name v2)"
+	db_vm="$(vm_name db)"
+	wait_ssh "$v2_vm"
+	wait_ssh "$db_vm"
+	ensure_lab_dirs
+	db_ip="$(vm_ip_required "$db_vm")"
+
+	out="$LAB_DIR/logs/overlap-preflight.out"
+	mysql_lab "$db_ip" jetmon_db -e "INSERT INTO jetmon_hosts (host_id, bucket_min, bucket_max, status) VALUES ('jetmon-rollout-overlap-test', $LAB_BUCKET_MIN, $LAB_BUCKET_MAX, 'active') ON DUPLICATE KEY UPDATE bucket_min = VALUES(bucket_min), bucket_max = VALUES(bucket_max), status = VALUES(status), last_heartbeat = UTC_TIMESTAMP()"
+	if ssh_vm "$v2_vm" "bash -lc 'cd /opt/jetmon2 && set -a && . config/jetmon2.env && set +a && JETMON_CONFIG=config/config.json ./jetmon2 rollout host-preflight --file rollout-buckets.csv --host $(vm_name v1) --runtime-host $(vm_name v2) --bucket-min $LAB_BUCKET_MIN --bucket-max $LAB_BUCKET_MAX --bucket-total $LAB_BUCKET_TOTAL'" >"$out" 2>&1; then
+		mysql_lab "$db_ip" jetmon_db -e "DELETE FROM jetmon_hosts WHERE host_id = 'jetmon-rollout-overlap-test'"
+		fail "overlap preflight unexpectedly passed"
+	fi
+	mysql_lab "$db_ip" jetmon_db -e "DELETE FROM jetmon_hosts WHERE host_id = 'jetmon-rollout-overlap-test'"
+	grep -q 'overlapping pinned range' "$out" || {
+		cat "$out"
+		fail "overlap preflight failed for an unexpected reason"
+	}
+	pass "failure_gate_overlap_refused output=$out"
+
+	out="$LAB_DIR/logs/bad-systemd-preflight.out"
+	ssh_vm "$v2_vm" "printf '%s\n' '[Unit]' 'Description=Broken Jetmon lab unit' '[Service]' 'ExecStart=/does/not/exist' | sudo tee /tmp/jetmon-bad.service >/dev/null"
+	if ssh_vm "$v2_vm" "bash -lc 'cd /opt/jetmon2 && set -a && . config/jetmon2.env && set +a && JETMON_CONFIG=config/config.json ./jetmon2 rollout host-preflight --file rollout-buckets.csv --host $(vm_name v1) --runtime-host $(vm_name v2) --bucket-min $LAB_BUCKET_MIN --bucket-max $LAB_BUCKET_MAX --bucket-total $LAB_BUCKET_TOTAL --systemd-unit /tmp/jetmon-bad.service'" >"$out" 2>&1; then
+		fail "bad systemd preflight unexpectedly passed"
+	fi
+	grep -qi 'systemd' "$out" || {
+		cat "$out"
+		fail "bad systemd preflight failed for an unexpected reason"
+	}
+	pass "failure_gate_bad_systemd_refused output=$out"
+}
+
+wait_topology_ssh() {
+	wait_ssh "$(vm_name db)"
+	wait_ssh "$(vm_name v1)"
+	wait_ssh "$(vm_name v2)"
+}
+
+snapshot_exists() {
+	local vm="$1"
+	local snapshot="$2"
+	virsh_cmd snapshot-info "$vm" "$snapshot" >/dev/null 2>&1
+}
+
+validate_snapshot_all_exists() {
+	local snapshot="$1"
+	local role vm
+	for role in db v1 v2; do
+		vm="$(vm_name "$role")"
+		snapshot_exists "$vm" "$snapshot" || fail "missing snapshot $snapshot for $vm; create it with snapshot-all $snapshot"
+	done
+}
+
+run_flow_by_name() {
+	case "$1" in
+	execute-rollback) smoke_guided_execute_rollback ;;
+	interrupted-resume) smoke_interrupted_resume ;;
+	post-start-rollback) smoke_post_start_rollback ;;
+	bad-ssh) smoke_bad_ssh ;;
+	v2-start-failure) smoke_v2_start_failure ;;
+	runtime-guards) smoke_runtime_guards ;;
+	real-activity) smoke_real_activity ;;
+	failure-gates) smoke_failure_gates ;;
+	*) fail "unknown snapshot flow $1 (want: execute-rollback, interrupted-resume, post-start-rollback, bad-ssh, v2-start-failure, runtime-guards, real-activity, failure-gates)" ;;
+	esac
+}
+
+SNAPSHOT_RUN_CLEANUP_ACTIVE=0
+SNAPSHOT_RUN_CLEANUP_NAME=""
+
+cleanup_snapshot_run() {
+	if [[ "$SNAPSHOT_RUN_CLEANUP_ACTIVE" == "1" && -n "$SNAPSHOT_RUN_CLEANUP_NAME" ]]; then
+		warn "snapshot_flow_cleanup snapshot=$SNAPSHOT_RUN_CLEANUP_NAME"
+		revert_all "$SNAPSHOT_RUN_CLEANUP_NAME" || true
+	fi
+}
+
+snapshot_run() {
+	local snapshot="$1"
+	local flow="$2"
+	validate_snapshot_all_exists "$snapshot"
+	SNAPSHOT_RUN_CLEANUP_ACTIVE=1
+	SNAPSHOT_RUN_CLEANUP_NAME="$snapshot"
+	trap cleanup_snapshot_run EXIT
+	revert_all "$snapshot"
+	wait_topology_ssh
+	install_v2
+	run_flow_by_name "$flow"
+	revert_all "$snapshot"
+	wait_topology_ssh
+	reset_guided_lab_state
+	SNAPSHOT_RUN_CLEANUP_ACTIVE=0
+	SNAPSHOT_RUN_CLEANUP_NAME=""
+	trap - EXIT
+	pass "snapshot_flow_passed snapshot=$snapshot flow=$flow"
+}
+
+snapshot_run_all() {
+	local snapshot="$1"
+	local flow
+	for flow in execute-rollback interrupted-resume post-start-rollback bad-ssh v2-start-failure runtime-guards real-activity failure-gates; do
+		snapshot_run "$snapshot" "$flow"
+	done
+	pass "snapshot_all_flows_passed snapshot=$snapshot"
+}
+
+prepare_topology() {
+	seed_db
+	install_v1_sim
+	install_v2
+	migrate_v2
+	smoke_preflight
+	smoke_guided_dry_run
+	pass "topology_prepared prefix=$PREFIX bucket_range=$LAB_BUCKET_MIN-$LAB_BUCKET_MAX"
+}
+
+shutdown_vm() {
+	local vm="$1"
+	if ! virsh_cmd dominfo "$vm" >/dev/null 2>&1; then
+		return 0
+	fi
+	if [[ "$(virsh_cmd domstate "$vm")" == "running" ]]; then
+		virsh_cmd shutdown "$vm" >/dev/null || true
+		for _ in {1..36}; do
+			[[ "$(virsh_cmd domstate "$vm" 2>/dev/null || true)" != "running" ]] && return 0
+			sleep 5
+		done
+		virsh_cmd destroy "$vm" >/dev/null || true
+	fi
+}
+
+snapshot_vm() {
+	local vm="$1"
+	local snapshot="$2"
+	virsh_cmd dominfo "$vm" >/dev/null
+	shutdown_vm "$vm"
+	virsh_cmd snapshot-create-as "$vm" "$snapshot" "jetmon rollout lab snapshot $snapshot" --atomic >/dev/null
+	pass "snapshot_created vm=$vm snapshot=$snapshot"
+}
+
+snapshot_all() {
+	local snapshot="$1"
+	snapshot_vm "$(vm_name db)" "$snapshot"
+	snapshot_vm "$(vm_name v1)" "$snapshot"
+	snapshot_vm "$(vm_name v2)" "$snapshot"
+}
+
+revert_vm() {
+	local vm="$1"
+	local snapshot="$2"
+	virsh_cmd dominfo "$vm" >/dev/null
+	shutdown_vm "$vm"
+	virsh_cmd snapshot-revert "$vm" "$snapshot" >/dev/null
+	virsh_cmd start "$vm" >/dev/null
+	pass "snapshot_reverted vm=$vm snapshot=$snapshot"
+}
+
+revert_all() {
+	local snapshot="$1"
+	revert_vm "$(vm_name db)" "$snapshot"
+	revert_vm "$(vm_name v1)" "$snapshot"
+	revert_vm "$(vm_name v2)" "$snapshot"
+}
+
+destroy_vm() {
+	local vm="$1"
+	shutdown_vm "$vm"
+	if virsh_cmd dominfo "$vm" >/dev/null 2>&1; then
+		virsh_cmd undefine "$vm" --remove-all-storage --snapshots-metadata >/dev/null || virsh_cmd undefine "$vm" --snapshots-metadata >/dev/null
+	fi
+	rm -f "$(seed_path "$vm")" "$(user_data_path "$vm")" "$(meta_data_path "$vm")"
+	pass "vm_destroyed=$vm"
+}
+
+destroy_topology() {
+	destroy_vm "$(vm_name v2)"
+	destroy_vm "$(vm_name v1)"
+	destroy_vm "$(vm_name db)"
+}
+
+list_lab() {
+	printf '## domains\n'
+	virsh_cmd list --all | sed -n "1,2p; /$PREFIX-/p"
+	printf '\n## leases\n'
+	virsh_cmd net-dhcp-leases "$NETWORK" | sed -n "1,2p; /$PREFIX-/p"
+	printf '\n## volumes\n'
+	virsh_cmd vol-list "$POOL" | sed -n "1,2p; /$PREFIX-/p"
+}
+
+main() {
+	local cmd="${1:-}"
+	[[ -n "$cmd" ]] || {
+		usage
+		exit 2
+	}
+	shift || true
+	case "$cmd" in
+	doctor) doctor "$@" ;;
+	fetch-image) fetch_image "$@" ;;
+	create)
+		[[ $# -ge 1 ]] || fail "create requires role"
+		create_vm "$@"
+		;;
+	create-topology) create_topology "$@" ;;
+	start-topology) start_topology "$@" ;;
+	seed-db) seed_db "$@" ;;
+	install-v1-sim) install_v1_sim "$@" ;;
+	install-v2) install_v2 "$@" ;;
+	migrate-v2) migrate_v2 "$@" ;;
+	prepare-topology) prepare_topology "$@" ;;
+	smoke-preflight) smoke_preflight "$@" ;;
+	smoke-guided-dry-run) smoke_guided_dry_run "$@" ;;
+	smoke-guided-execute-rollback) smoke_guided_execute_rollback "$@" ;;
+	smoke-failure-gates) smoke_failure_gates "$@" ;;
+	smoke-interrupted-resume) smoke_interrupted_resume "$@" ;;
+	smoke-post-start-rollback) smoke_post_start_rollback "$@" ;;
+	smoke-bad-ssh) smoke_bad_ssh "$@" ;;
+	smoke-v2-start-failure) smoke_v2_start_failure "$@" ;;
+	smoke-runtime-guards) smoke_runtime_guards "$@" ;;
+	smoke-real-activity) smoke_real_activity "$@" ;;
+	snapshot-run)
+		[[ $# -eq 2 ]] || fail "snapshot-run requires snapshot and flow"
+		snapshot_run "$@"
+		;;
+	snapshot-run-all)
+		[[ $# -eq 1 ]] || fail "snapshot-run-all requires snapshot"
+		snapshot_run_all "$1"
+		;;
+	wait-ssh)
+		[[ $# -eq 1 ]] || fail "wait-ssh requires vm"
+		wait_ssh "$(vm_name "$1")"
+		;;
+	ssh)
+		[[ $# -ge 1 ]] || fail "ssh requires vm"
+		vm="$(vm_name "$1")"
+		shift
+		ssh_vm "$vm" "$@"
+		;;
+	snapshot)
+		[[ $# -eq 2 ]] || fail "snapshot requires vm and snapshot name"
+		snapshot_vm "$(vm_name "$1")" "$2"
+		;;
+	snapshot-all)
+		[[ $# -eq 1 ]] || fail "snapshot-all requires snapshot name"
+		snapshot_all "$1"
+		;;
+	revert)
+		[[ $# -eq 2 ]] || fail "revert requires vm and snapshot name"
+		revert_vm "$(vm_name "$1")" "$2"
+		;;
+	revert-all)
+		[[ $# -eq 1 ]] || fail "revert-all requires snapshot name"
+		revert_all "$1"
+		;;
+	destroy)
+		[[ $# -eq 1 ]] || fail "destroy requires vm"
+		destroy_vm "$(vm_name "$1")"
+		;;
+	destroy-topology) destroy_topology "$@" ;;
+	list) list_lab "$@" ;;
+	-h | --help | help) usage ;;
+	*) usage; fail "unknown command: $cmd" ;;
+	esac
+}
+
+main "$@"
diff --git a/src/http_checker.cpp b/src/http_checker.cpp
deleted file mode 100644
index 38327ec9..00000000
--- a/src/http_checker.cpp
+++ /dev/null
@@ -1,1015 +0,0 @@
-
-#include "http_checker.h"
-#include <cstring>
-
-using namespace std;
-
-const int ERROR_STATUS_CODE_UNKNOWN = 999;
-const int ERROR_TIMEOUT = 998;
-const int ERROR_REDIRECT_LOCATION = 997;
-const int ERROR_CONNECT_REDIRECT_HOST = 996;
-const int ERROR_CONNECT_HOST = 995;
-
-HTTP_Checker::HTTP_Checker() : m_sock( -1 ), m_host_name( "" ), m_host_dir( "" ), m_port( HTTP_DEFAULT_PORT ),
-		m_is_ssl( false ), m_triptime( 0 ), m_response_code( 0 ), m_ctx( NULL ), m_ssl( NULL ), m_sbio( NULL ), m_error_code( 0 ) {
-	gettimeofday( &m_tstart, &m_tzone );
-	memset( m_buf, 0, MAX_TCP_BUFFER );
-	m_cutofftime = time( NULL );
-	m_cutofftime += NET_COMMS_TIMEOUT;
-}
-
-HTTP_Checker::~HTTP_Checker() {
-	this->disconnect();
-}
-
-time_t HTTP_Checker::get_rtt() {
-	struct timeval m_tend;
-	gettimeofday( &m_tend, &m_tzone );
-
-	if ( (m_tend.tv_usec -= m_tstart.tv_usec) < 0 ) {
-		m_tend.tv_sec--;
-		m_tend.tv_usec += 1000000;
-	}
-	m_tend.tv_sec -= m_tstart.tv_sec;
-	return m_tend.tv_sec * 1000000 + ( m_tend.tv_usec );
-}
-
-void HTTP_Checker::check( string p_host_name, int p_port ) {
-	try {
-		m_host_name = p_host_name;
-		m_port = p_port;
-		m_host_dir = '/';
-
-		this->parse_host_values();
-		if ( connect() ) {
-			this->set_host_response( 0 );
-		} else {
-#if DEBUG_MODE
-			cerr << "Unable to connect to host" << endl;
-#endif
-			m_error_code = ERROR_CONNECT_HOST;
-		}
-	}
-	catch( exception &ex ) {
-		cerr << "exception in HTTP_Checker::check(): for host '" << p_host_name.c_str() << "'" << endl;
-	}
-}
-
-void HTTP_Checker::set_host_response( int redirects ) {
-	try {
-		string response = this->send_http_get();
-		if ( 0 >= response.size() ) {
-#if DEBUG_MODE
-			cerr << "no response - timed out" << endl;
-#endif
-			m_response_code = 0;
-			m_error_code = ERROR_TIMEOUT;
-			return;
-		}
-
-		if ( 8 != response.find_first_of( ' ' ) ) {
-#if DEBUG_MODE
-			cerr << "Status code unknown" << endl;
-#endif
-			m_response_code = 999;
-			m_error_code = ERROR_STATUS_CODE_UNKNOWN;
-			return;
-		}
-
-		string s_response_code = response.substr( 9, 3 );
-		m_response_code = atoi( s_response_code.c_str() );
-
-		// if we have been redirected, get the details and make a recursive call
-		if ( ( 300 < m_response_code ) && ( 400 > m_response_code ) ) {
-			redirects++;
-			if ( MAX_REDIRECTS < redirects ) {
-#if DEBUG_MODE
-				cerr << "Hit max on the redirects" << endl;
-#endif
-				// Note we leave the 3xx response code so this site is marked as up
-				return;
-			}
-			if ( ! set_redirect_host_values( response ) ) {
-#if DEBUG_MODE
-				cerr << "Unable to parse redirect location" << endl;
-#endif
-				m_response_code = 0;
-				m_error_code = ERROR_REDIRECT_LOCATION;
-				return;
-			}
-			this->disconnect();
-			if ( this->connect() ) {
-				this->set_host_response( redirects );
-			} else {
-#if DEBUG_MODE
-				cerr << "Unable to connect to redirect host" << endl;
-#endif
-				m_response_code = 0;
-				m_error_code = ERROR_CONNECT_REDIRECT_HOST;
-			}
-		}
-
-#if DEBUG_MODE
-			cerr << m_host_name.c_str() << " : " << m_response_code << endl;
-#endif
-
-	}
-	catch( exception &ex ) {
-		cerr << "exception in HTTP_Checker::set_host_responses(): for host '" << m_host_name.c_str() << "'" << endl;
-	}
-}
-
-bool HTTP_Checker::set_redirect_host_values( string p_content ) {
-	try {
-		string p_lcase_search = p_content;
-
-		std::transform( p_lcase_search.begin(), p_lcase_search.end(), p_lcase_search.begin(), ::tolower );
-
-		if ( string::npos == p_lcase_search.find( "location: " ) )
-			return false;
-
-		p_content = p_content.substr( p_lcase_search.find( "location: " ) + 10, p_content.length() - ( p_lcase_search.find( "location: " ) + 10 ) );
-
-		if ( string::npos == p_content.find( "\r\n" ) )
-			return false;
-
-		p_content.erase( p_content.find_first_of( "\r\n" ), p_content.length() - p_content.find_first_of( "\r\n" ) );
-
-		// keep a copy for relative location redirects
-		string hostname_backup = m_host_name;
-		m_host_name = p_content;
-		m_port = HTTP_DEFAULT_PORT;
-		m_host_dir = '/';
-		m_is_ssl = false;
-
-		this->parse_host_values();
-
-		// this is a relative location redirect, reinstate hostname
-		if ( 0 == m_host_name.size() )
-			m_host_name = hostname_backup;
-
-		return true;
-	}
-	catch( exception &ex ) {
-		cerr << "exception in HTTP_Checker::set_redirect_host_values()" << endl;
-		return false;
-	}
-}
-
-void HTTP_Checker::parse_host_values() {
-	if ( string::npos != m_host_name.find( "http://" ) ) {
-		m_host_name.erase( m_host_name.find( "http://" ), 7 );
-	}
-
-	if ( string::npos != m_host_name.find( "https://" ) ) {
-		m_host_name.erase( m_host_name.find( "https://" ), 8 );
-		m_port = HTTPS_DEFAULT_PORT;
-		m_is_ssl = true;
-	}
-
-	size_t s_pos = m_host_name.find_first_of( '/' );
-	size_t q_pos = m_host_name.find_first_of( '?' );
-	size_t c_pos = m_host_name.find_first_of( ':' );
-	size_t f_pos = m_host_name.find_first_of( '#' );
-
-	if ( ( c_pos < s_pos ) && ( c_pos < q_pos ) && ( c_pos < f_pos ) ) {
-		int new_port = atoi( m_host_name.substr( c_pos + 1, min( s_pos, m_host_name.length() ) ).c_str() );
-		if ( 0 < new_port ) {
-			m_port = new_port;
-			m_host_name.erase( c_pos, min( s_pos, m_host_name.length() ) - c_pos );
-			// recalc since we've erased some characters
-			s_pos = m_host_name.find_first_of( '/' );
-			q_pos = m_host_name.find_first_of( '?' );
-			f_pos = m_host_name.find_first_of( '#' );
-		}
-	}
-
-	if ( string::npos != s_pos || string::npos != q_pos || string::npos != f_pos ) {
-		size_t m_pos = min( min( s_pos, q_pos ), f_pos );
-		m_host_dir = m_host_name.substr( m_pos, m_host_name.length() - m_pos );
-		if ( 0 == m_host_dir.length() || '?' == m_host_dir[0] || '#' == m_host_dir[0] ) {
-			m_host_dir = "/" + m_host_dir;
-		}
-		m_host_name.erase( m_pos, m_host_name.length() - m_pos );
-	}
-}
-
-string HTTP_Checker::send_http_get() {
-	string s_tmp = "HEAD " + m_host_dir + " HTTP/1.1\r\n";
-			s_tmp += "Host: " + m_host_name + "\r\n";
-			s_tmp += "User-Agent: jetmon/1.0 (Jetpack Site Uptime Monitor by WordPress.com)\r\n";
-			s_tmp += "Connection: close\r\n\r\n";
-
-	strcpy( m_buf, s_tmp.c_str() );
-
-	if ( send_bytes( m_buf, s_tmp.length() ) ) {
-		s_tmp = get_response();
-	} else {
-		s_tmp = "";
-#if DEBUG_MODE
-		cerr << "failed to send_bytes()" << endl;
-#endif
-	}
-	return s_tmp;
-}
-
-string HTTP_Checker::get_response() {
-	try {
-		ssize_t received;
-		fd_set read_fds;
-		struct timeval tv;
-		string ret_val = "";
-
-		do {
-			tv.tv_sec = 0;
-			tv.tv_usec = 500000;
-			FD_ZERO( &read_fds );
-			FD_SET( m_sock, &read_fds );
-
-			::select( m_sock + 1, &read_fds, NULL, NULL, &tv );
-		} while ( ( FD_ISSET( m_sock, &read_fds ) == 0) && ( m_cutofftime > time( NULL ) ) );
-
-		if ( FD_ISSET( m_sock, &read_fds) ) {
-			if ( m_is_ssl )
-				received = SSL_read( m_ssl, m_buf, MAX_TCP_BUFFER - 1 );
-			else
-				received = ::recv( m_sock, m_buf, MAX_TCP_BUFFER - 1, 0 );
-
-			while ( received > 0 ) {
-				if ( received < MAX_TCP_BUFFER ) {
-					m_buf[ received ] = '\0';
-					ret_val += m_buf;
-				}
-				do
-				{
-					tv.tv_sec = 0;
-					tv.tv_usec = 500000;
-					FD_ZERO( &read_fds );
-					FD_SET( m_sock, &read_fds );
-
-					select( m_sock + 1, &read_fds, NULL, NULL, &tv );
-				} while( (FD_ISSET( m_sock, &read_fds ) == 0) && ( m_cutofftime > time( NULL ) ) );
-
-				if( FD_ISSET( m_sock, &read_fds) )
-					if ( m_is_ssl )
-						received = SSL_read( m_ssl, m_buf, MAX_TCP_BUFFER - 1 );
-					else
-						received = ::recv( m_sock, m_buf, MAX_TCP_BUFFER - 1, 0 );
-				else
-					received = 0;
-			}
-		}
-		return ret_val;
-	}
-	catch( exception& ex ) {
-		cerr << "exception in HTTP_Checker::get_response(): for host '" << m_host_name.c_str() << "'" << endl;
-		return "";
-	}
-}
-
-bool HTTP_Checker::init_socket( addrinfo *addr ) {
-	if ( NULL != addr ) {
-		m_sock = ::socket( addr->ai_family, addr->ai_socktype, addr->ai_protocol );
-	} else {
-		m_sock = ::socket( AF_INET, SOCK_STREAM, IPPROTO_TCP );
-	}
-	if ( -1 == m_sock ) {
-		errno = 0;
-#if DEBUG_MODE
-		cerr << "unable to create socket" << endl;
-#endif
-		return false;
-	}
-
-	int val = 1;
-	int ret_val = ::setsockopt( m_sock, SOL_SOCKET, SO_REUSEADDR, &val, sizeof( val ) );
-	if( -1 == ret_val ) {
-		close( m_sock );
-		m_sock = -1;
-		errno = 0;
-#if DEBUG_MODE
-		cerr << "unable to set socket option SO_REUSEADDR" << endl;
-#endif
-		return false;
-	}
-
-#if NON_BLOCKING_IO
-	int flags = fcntl( m_sock, F_GETFL, 0 );
-	if ( fcntl( m_sock, F_SETFL, flags | O_NONBLOCK ) ) {
-		close( m_sock );
-		m_sock = -1;
-		errno = 0;
-#if DEBUG_MODE
-		cerr << "could not fcntl" << endl;
-#endif
-		return false;
-	}
-#endif // NON_BLOCKING_IO
-
-	struct timeval time_out;
-	time_out.tv_sec = NET_COMMS_TIMEOUT;
-	time_out.tv_usec = 0;
-
-	ret_val = ::setsockopt( m_sock, SOL_SOCKET, SO_SNDTIMEO, &time_out, sizeof( time_out ) );
-	if( -1 == ret_val ) {
-		close( m_sock );
-		m_sock = -1;
-		errno = 0;
-#if DEBUG_MODE
-		cerr << "unable to set socket option SO_SNDTIMEO" << endl;
-#endif
-		return false;
-	}
-
-	ret_val = ::setsockopt( m_sock, SOL_SOCKET, SO_RCVTIMEO, &time_out, sizeof( time_out ) );
-	if( -1 == ret_val ) {
-		close( m_sock );
-		m_sock = -1;
-		errno = 0;
-#if DEBUG_MODE
-		cerr << "unable to set socket option SO_RCVTIMEO" << endl;
-#endif
-		return false;
-	}
-
-	return true;
-}
-
-bool HTTP_Checker::init_ssl() {
-	m_ctx = SSL_CTX_new( SSLv23_client_method() );
-
-	if ( NULL == m_ctx ) {
-		close( m_sock );
-		m_sock = -1;
-		errno = 0;
-#if DEBUG_MODE
-		cerr << "unable to set SSL context" << endl;
-#endif
-		return false;
-	}
-
-#ifdef SSL_MODE_RELEASE_BUFFERS
-	SSL_CTX_set_mode( m_ctx, SSL_MODE_RELEASE_BUFFERS );
-#endif
-
-	if ( ! SSL_CTX_load_verify_locations( m_ctx, NULL, "/etc/ssl/certs" ) ) {
-		close( m_sock );
-		m_sock = -1;
-		errno = 0;
-#if DEBUG_MODE
-		cerr << "unable to load the cert location" << endl;
-#endif
-		return false;
-	}
-
-	m_ssl = SSL_new( m_ctx );
-
-	if ( NULL == m_ssl ) {
-		close( m_sock );
-		m_sock = -1;
-		errno = 0;
-#if DEBUG_MODE
-		cerr << "unable to set init SSL" << endl;
-#endif
-		return false;
-	}
-
-	SSL_set_mode( m_ssl, SSL_MODE_AUTO_RETRY );
-	return true;
-}
-
-#if USE_GETADDRINFO
-bool HTTP_Checker::connect_getaddrinfo() {
-	try {
-		addrinfo *res = 0;
-		struct addrinfo hints;
-		memset( &hints, 0, sizeof( hints ) );
-		hints.ai_family = AF_UNSPEC;
-		hints.ai_flags = AI_ADDRCONFIG;
-		hints.ai_socktype = SOCK_STREAM;
-		int con_ret = -1;
-		int result = -1;
-
-#if DEBUG_MODE
-		cerr << "getaddrinfo: looking up " << m_host_name.c_str() << endl;
-#endif
-		string s_lookup_type = "http";
-		if ( m_is_ssl ) {
-			s_lookup_type = "https";
-		}
-
-		result = getaddrinfo( m_host_name.c_str(), s_lookup_type.c_str(), &hints, &res );
-		if ( EAI_BADFLAGS == result ) {
-			hints.ai_flags = 0;
-			result = getaddrinfo( m_host_name.c_str(), s_lookup_type.c_str(), &hints, &res );
-		}
-
-		if ( EAI_NONAME == result ) {
-#if DEBUG_MODE
-			cerr << "NXDOMAIN: " << m_host_name.c_str() << endl;
-#endif
-			return false;
-		}
-		if ( 0 != result || EAI_FAIL == result ) {
-#if DEBUG_MODE
-			cerr << "Error looking up host: " << m_host_name.c_str() << endl;
-#endif
-			return false;
-		}
-
-		addrinfo *node = res;
-		int tried_recs = 0;
-		while ( node && m_cutofftime > time( NULL ) ) {
-			if ( ! ( AF_INET == node->ai_family || AF_INET6 == node->ai_family ) ) {
-				node = node->ai_next;
-				continue;
-			}
-			tried_recs++;
-			if ( ! init_socket( node ) ) {
-#if DEBUG_MODE
-				cerr << "socket init failed" << endl;
-#endif
-				node = node->ai_next;
-				continue;
-			}
-
-#if NON_BLOCKING_IO // NON_BLOCKING_IO
-			struct epoll_event ev;
-			struct epoll_event events[MAX_EPOLL_EVENTS];
-
-			int e_fd = epoll_create1( 0 );
-			if ( -1 == e_fd ) {
-#if DEBUG_MODE
-				cerr << "epoll_create failed" << endl;
-#endif
-				close( m_sock );
-				m_sock = -1;
-				errno = 0;
-				node = node->ai_next;
-				continue;
-			}
-
-			ev.data.fd = m_sock;
-			ev.events = EPOLLOUT | EPOLLIN | EPOLLERR | EPOLLHUP;
-			int c_fd = epoll_ctl( e_fd, EPOLL_CTL_ADD, m_sock, &ev );
-			if ( 0 != c_fd ) {
-#if DEBUG_MODE
-				cerr << "epoll_ctl failed" << endl;
-#endif
-				close( e_fd );
-				close( m_sock );
-				m_sock = -1;
-				errno = 0;
-				node = node->ai_next;
-				continue;
-			}
-
-#endif // NON_BLOCKING_IO
-
-			con_ret = ::connect( m_sock, node->ai_addr, node->ai_addrlen );
-
-#if NON_BLOCKING_IO // NON_BLOCKING_IO
-
-			if ( con_ret < 0 && errno != EINPROGRESS ) {
-#if DEBUG_MODE
-				cerr << "socket connect failed" << endl;
-#endif
-				close( e_fd );
-				close( m_sock );
-				m_sock = -1;
-				con_ret = -1;
-				errno = 0;
-				node = node->ai_next;
-				continue;
-			}
-
-			if ( con_ret == 0 ) {
-				close( e_fd );
-				break;
-			}
-
-			int timeout = m_cutofftime - time( NULL );
-			if ( timeout < 0 ) {
-#if DEBUG_MODE
-				cerr << "timed out for " << m_host_name.c_str() << endl;
-#endif
-				errno = 0;
-				con_ret = -1;
-				close( e_fd );
-				break;
-			}
-			int num_events = epoll_wait( e_fd, events, MAX_EPOLL_EVENTS, timeout * 1000 );
-			for ( int i = 0; i < num_events; i++ ) {
-				if ( events[i].events & EPOLLERR || events[i].events & EPOLLHUP ) {
-#if DEBUG_MODE
-					cerr << "epoll error or HUP" << endl;
-#endif
-					con_ret = -1;
-					break;
-				} else if ( events[i].events & EPOLLOUT ) {
-					con_ret = 0;
-					break;
-				}
-			}
-			close( e_fd );
-#endif // NON_BLOCKING_IO
-
-			if ( con_ret == 0 ) {
-				break;
-			}
-#if DEBUG_MODE
-			cerr << "failed to connect to " << m_host_name.c_str() << endl;
-#endif
-			close( m_sock );
-			m_sock = -1;
-			con_ret = -1;
-			errno = 0;
-			node = node->ai_next;
-		}
-#if DEBUG_MODE
-		if ( 0 == tried_recs && 0 == node ) {
-			cerr <<  "unknown address types for: " << m_host_name.c_str() << endl;
-		}
-#endif
-		freeaddrinfo( res );
-		return ( 0 == con_ret );
-	}
-	catch( exception& ex ) {
-		cerr << "exception in HTTP_Checker::connect(): for host '" << m_host_name.c_str() << "'" << endl;
-		return false;
-	}
-}
-
-#else // USE_GETADDRINFO
-
-
-bool HTTP_Checker::connect_gethostbyname() {
-	try {
-		struct sockaddr_in m_addr;
-		char *tmp = (char *)malloc( MAX_TCP_BUFFER );
-		struct hostent hostbuf, *hp;
-		int herr, hres;
-
-#if DEBUG_MODE
-		cerr << "gethostbyname: looking up " << m_host_name.c_str() << endl;
-#endif
-		hres = gethostbyname_r( m_host_name.c_str(), &hostbuf, tmp, MAX_TCP_BUFFER, &hp, &herr );
-		if ( ERANGE == hres ) {
-#if DEBUG_MODE
-			cerr << "realloc for DNS results" << endl;
-#endif
-			tmp = (char *)realloc( tmp, ( MAX_TCP_BUFFER * 2 ) );
-			if ( NULL == tmp ) {
-#if DEBUG_MODE
-				cerr << "realloc error!" << endl;
-#endif
-				return false;
-			}
-			hres = gethostbyname_r( m_host_name.c_str(), &hostbuf, tmp, ( MAX_TCP_BUFFER * 2 ), &hp, &herr );
-		}
-
-		if ( hp ) {
-			m_addr.sin_port = htons( m_port );
-			m_addr.sin_family = hp->h_addrtype;
-			bcopy( hp->h_addr, (caddr_t)&m_addr.sin_addr, hp->h_length );
-		} else {
-#if DEBUG_MODE
-			cerr << "NXDOMAIN: " << m_host_name.c_str() << endl;
-#endif
-			free( tmp );
-			return false;
-		}
-
-		if ( ! init_socket( NULL ) ) {
-#if DEBUG_MODE
-			cerr << "socket init failed" << endl;
-#endif
-			free( tmp );
-			return false;
-		}
-
-#if NON_BLOCKING_IO // NON_BLOCKING_IO
-		int e_fd = epoll_create1( 0 );
-		if ( -1 == e_fd ) {
-#if DEBUG_MODE
-			cerr << "epoll_create failed" << endl;
-#endif
-			close( m_sock );
-			m_sock = -1;
-			errno = 0;
-			free( tmp );
-			return false;
-		}
-
-		struct epoll_event ev;
-		struct epoll_event events[MAX_EPOLL_EVENTS];
-
-		ev.data.fd = m_sock;
-		ev.events = EPOLLOUT | EPOLLIN | EPOLLERR | EPOLLHUP;
-		int c_fd = epoll_ctl( e_fd, EPOLL_CTL_ADD, m_sock, &ev );
-		if ( 0 != c_fd ) {
-#if DEBUG_MODE
-			cerr << "epoll_ctl failed" << endl;
-#endif
-			close( e_fd );
-			close( m_sock );
-			m_sock = -1;
-			errno = 0;
-			free( tmp );
-			return false;
-		}
-
-#endif // NON_BLOCKING_IO
-
-		int con_ret = ::connect( m_sock, (struct sockaddr *)&m_addr, sizeof( struct sockaddr ) );
-		free( tmp );
-
-#if NON_BLOCKING_IO
-		if ( con_ret < 0 && errno != EINPROGRESS ) {
-#if DEBUG_MODE
-			cerr << "failed to connect to " << m_host_name.c_str() << endl;
-#endif
-			close( e_fd );
-			close( m_sock );
-			m_sock = -1;
-			return false;
-		} else if ( 0 != con_ret ) {
-			int timeout = m_cutofftime - time( NULL );
-			if ( timeout < 0 ) {
-#if DEBUG_MODE
-				cerr << "timed out for " << m_host_name.c_str() << endl;
-#endif
-				errno = 0;
-				close( e_fd );
-				close( m_sock );
-				m_sock = -1;
-				return false;
-			}
-
-			int num_events = epoll_wait( e_fd, events, MAX_EPOLL_EVENTS, timeout * 1000 );
-			for ( int i = 0; i < num_events; i++ ) {
-				if ( events[i].events & EPOLLERR || events[i].events & EPOLLHUP ) {
-#if DEBUG_MODE
-					cerr << "epoll error or HUP for " << m_host_name.c_str() << endl;
-#endif
-					con_ret = -1;
-					break;
-				} else if ( events[i].events & EPOLLOUT ) {
-					con_ret = 0;
-					break;
-				}
-			}
-		}
-
-		close( e_fd );
-#endif // NON_BLOCKING_IO
-
-		return ( 0 == con_ret );
-	}
-	catch( exception& ex ) {
-		cerr << "exception in HTTP_Checker::connect(): for host '" << m_host_name.c_str() << "'" << endl;
-		return false;
-	}
-}
-
-#endif // USE_GETADDRINFO
-
-bool HTTP_Checker::connect() {
-	try {
-#if USE_GETADDRINFO
-		if ( ! this->connect_getaddrinfo() ) {
-#else
-		if ( ! this->connect_gethostbyname() ) {
-#endif
-#if DEBUG_MODE
-			int so_error;
-			socklen_t len = sizeof so_error;
-			::getsockopt( m_sock, SOL_SOCKET, SO_ERROR, &so_error, &len );
-			if ( 0 != so_error ) {
-				cerr << "socket connect error: " << m_host_name.c_str() << " : " << strerror( so_error ) << endl;
-			}
-#endif
-			if ( -1 != m_sock ) {
-				close( m_sock );
-				m_sock = -1;
-			}
-			errno = 0;
-			return false;
-		}
-
-#if DEBUG_MODE
-		cerr << "connected!" << endl;
-#endif
-
-		if ( m_is_ssl ) {
-			if ( ! this->init_ssl() )
-				return false;
-
-			m_sbio = BIO_new_socket( m_sock, BIO_NOCLOSE );
-			if ( NULL == m_sbio ) {
-#if DEBUG_MODE
-				cerr << "The SSL socket alloc failed" << endl;
-#endif
-				close( m_sock );
-				m_sock = -1;
-				errno = 0;
-				return false;
-			}
-
-			SSL_set_bio( m_ssl, m_sbio, m_sbio );
-			SSL_set_tlsext_host_name( m_ssl, m_host_name.c_str() );
-
-#if NON_BLOCKING_IO
-			int status;
-			bool want_read = false;
-			bool want_write = false;
-			do {
-				status = SSL_connect( m_ssl );
-				switch ( SSL_get_error( m_ssl, status ) ) {
-					case SSL_ERROR_NONE:
-						status = 0;
-						break;
-					case SSL_ERROR_WANT_WRITE:
-						want_write = true;
-						status = 1;
-						break;
-					case SSL_ERROR_WANT_READ:
-						want_read = true;
-						status = 1;
-						break;
-					case SSL_ERROR_ZERO_RETURN:
-						// The peer has notified us that it is shutting down via
-						// the SSL "close_notify" message so we need to shutdown, too.
-						status = -1;
-						break;
-					case SSL_ERROR_SYSCALL:
-						if ( EWOULDBLOCK == errno && -1 == status ) {
-							// Although the SSL_ERROR_WANT_READ/WRITE isn't getting
-							// set correctly, the read/write state should be valid.
-							errno = 0;
-							status = 1;
-							if ( SSL_want_write( m_ssl ) ) {
-								want_write = true;
-							} else if ( SSL_want_read( m_ssl ) ) {
-								want_read = true;
-							} else {
-								status = -1;
-							}
-						} else {
-							status = -1;
-						}
-						break;
-					default:
-						status = -1;
-						break;
-				}
-
-				if ( 1 == status ) {
-					if ( ! want_read && ! want_write ) {
-#if DEBUG_MODE
-						cerr << "The SSL connect failed for " << m_host_name.c_str() << endl;
-#endif
-						return false;
-					}
-
-					fd_set read_fds, write_fds;
-					if ( want_read ) {
-						FD_ZERO( &read_fds );
-						FD_SET( m_sock, &read_fds );
-					}
-					if ( want_write ) {
-						FD_ZERO( &write_fds );
-						FD_SET( m_sock, &write_fds );
-					}
-
-					struct timeval tv;
-					tv.tv_sec = m_cutofftime - time( NULL );
-					tv.tv_usec = 0;
-					status = ::select( m_sock + 1, &read_fds, &write_fds, NULL, &tv );
-
-					// 0 is timeout, -1 is error, or one or both handles could be set
-					if ( status >= 1 ) {
-						status = 1;
-					} else {
-						status = -1;
-					}
-				}
-			} while ( 1 == status && ! SSL_is_init_finished( m_ssl ) && m_cutofftime > time( NULL ) );
-
-			if ( 0 != status || ! SSL_is_init_finished( m_ssl ) ) {
-#if DEBUG_MODE
-				cerr << "The SSL handshake failed for " << m_host_name.c_str()
-					<< ERR_error_string( ERR_get_error(), NULL ) << endl;
-#endif
-				return false;
-			}
-
-#else // NON_BLOCKING_IO
-			int status = SSL_connect( m_ssl );
-
-			if ( 1 != status ) {
-#if DEBUG_MODE
-				cerr << "The SSL handshake failed for " << m_host_name.c_str()
-					<< ERR_error_string( ERR_get_error(), NULL ) << endl;
-#endif
-				return false;
-			}
-#endif // NON_BLOCKING_IO
-
-			X509* cert = SSL_get_peer_certificate( m_ssl );
-			if ( cert ) {
-				X509_free( cert );
-			}
-		}
-		return true;
-	}
-	catch( exception& ex ) {
-		cerr << "exception in HTTP_Checker::connect(): for host '" << m_host_name.c_str() << "'" << endl;
-		return false;
-	}
-}
-
-#if NON_BLOCKING_IO
-void HTTP_Checker::disconnect_ssl() {
-	int status;
-	// attempt shutdown for a max of 3 seconds
-	time_t waittime = time( NULL ) + 3;
-	if ( m_cutofftime < waittime ) {
-		waittime = m_cutofftime;
-	}
-	do {
-#if DEBUG_MODE
-		cerr << "SSL shutdown handshake for " << m_host_name.c_str() << endl;
-#endif
-		status = SSL_shutdown( m_ssl );
-		switch ( status ) {
-			case 1:
-#if DEBUG_MODE
-				cerr << "clean shutdown : " << m_host_name.c_str() << endl;
-#endif
-				return;
-			case -1:
-#if DEBUG_MODE
-				cerr << "shutdown failed: " << m_host_name.c_str() << endl;
-#endif
-				ERR_print_errors_fp( stderr );
-				return;
-			default:
-#if DEBUG_MODE
-				cerr << "shutdown not yet finished : " << m_host_name.c_str() << endl;
-#endif
-				break;
-		}
-		switch ( SSL_get_error( m_ssl, status ) ) {
-			case SSL_ERROR_WANT_WRITE:
-			case SSL_ERROR_WANT_READ:
-#if DEBUG_MODE
-				cerr << "want read/write : " << m_host_name.c_str() << endl;
-#endif
-				fd_set read_fds, write_fds;
-				FD_ZERO( &read_fds );
-				FD_ZERO( &write_fds );
-				FD_SET( m_sock, &read_fds );
-				FD_SET( m_sock, &write_fds );
-
-				struct timeval tv;
-				tv.tv_sec = waittime - time( NULL );
-				tv.tv_usec = 0;
-#if DEBUG_MODE
-				cerr << "selecting : " << m_host_name.c_str() << endl;
-#endif
-				status = ::select( m_sock + 1, &read_fds, &write_fds, NULL, &tv );
-#if DEBUG_MODE
-				cerr << "select result : " << status << endl;
-#endif
-				if ( status >= 1 ) {
-					status = 1;
-				}
-				break;
-			case SSL_ERROR_SYSCALL:
-				// From the man page:
-				// The output of SSL_get_error(3) may be misleading, as an erroneous
-				// SSL_ERROR_SYSCALL may be flagged even though no error occurred.
-				status = 1;
-				break;
-			default:
-#if DEBUG_MODE
-				cerr << "generic error for : " << m_host_name.c_str() << endl;
-#endif
-				status = 1;
-				break;
-		}
-	} while ( 1 == status && waittime > time( NULL ) );
-}
-#endif // NON_BLOCKING_IO
-
-bool HTTP_Checker::disconnect() {
-	try {
-		if ( m_is_ssl ) {
-			if ( NULL != m_ssl ) {
-#if NON_BLOCKING_IO
-				this->disconnect_ssl();
-#else
-				// attempt shutdown for a max of 3 seconds
-				time_t waittime = time( NULL ) + 3;
-				if ( m_cutofftime < waittime ) {
-					waittime = m_cutofftime;
-				}
-				int res = SSL_shutdown( m_ssl );
-				while ( 1 != res && waittime > time( NULL ) ) {
-					res = SSL_shutdown( m_ssl );
-					sleep( 1 );
-				}
-#if DEBUG_MODE
-				if ( 1 == res ) {
-					cerr << "client exited gracefully: " << m_host_name.c_str() << endl;
-				} else {
-					cerr << "error in shutdown for " <<  m_host_name.c_str() << endl;
-					ERR_print_errors_fp( stderr );
-				}
-#endif // DEBUG_MODE
-#endif // NON_BLOCKING_IO
-				SSL_free( m_ssl );
-				m_ssl = NULL;
-			}
-			if ( NULL != m_ctx ) {
-				SSL_CTX_free( m_ctx );
-				m_ctx = NULL;
-			}
-		}
-		if ( m_sock > 0 ) {
-			if ( ::shutdown( m_sock, SHUT_RDWR ) != 0 ) {
-				errno = 0;
-			}
-			::close( m_sock );
-			m_sock = -1;
-		}
-		return true;
-	}
-	catch( exception &ex ) {
-		cerr << "exception in HTTP_Checker::disconnect(): for host '" << m_host_name.c_str() << "'" << endl;
-		return false;
-	}
-}
-
-bool HTTP_Checker::send_bytes( char* p_packet, size_t p_packet_length ) {
-	try {
-		ssize_t bytes_left = p_packet_length;
-		ssize_t bytes_sent = 0;
-		int send_attempts = 5;
-		ssize_t bytes_to_send = 0;
-
-		do
-		{
-			if ( bytes_left < MAX_TCP_BUFFER )
-				bytes_to_send = bytes_left;
-			else
-				bytes_to_send = MAX_TCP_BUFFER;
-
-			if ( m_is_ssl )
-				bytes_sent = SSL_write( m_ssl, (const char *)p_packet, (int)bytes_to_send );
-			else
-				bytes_sent = ::send( this->m_sock, (const char *)p_packet, (int)bytes_to_send, 0 );
-
-			if ( bytes_sent != bytes_to_send ) {
-				switch ( errno ) {
-					case ENOTSOCK: {
-#if DEBUG_MODE
-						cerr << "ERROR: socket operation on non-socket irrecoverable; aborting" << endl;
-#endif
-						return false;
-					}
-					case EBADF: {
-#if DEBUG_MODE
-						cerr << "ERROR: bad file descriptor is irrecoverable; aborting" << endl;
-#endif
-						return false;
-					}
-					case EPIPE: {
-#if DEBUG_MODE
-						cerr << "ERROR: broken pipe is irrecoverable; aborting" << endl;
-#endif
-						return false;
-					}
-					default: {
-#if DEBUG_MODE
-						cerr << "ERROR: unknown error (" << errno << "); aborting" << endl;
-#endif
-						return false;
-					}
-				}
-			}
-			if ( bytes_sent > 0 ) {
-				bytes_left -= bytes_sent;
-				p_packet += bytes_sent;
-			}
-		} while( ( bytes_left > 0 ) && --send_attempts && m_cutofftime > time( NULL ) );
-
-		return ( bytes_left == 0 );
-	}
-	catch( exception & ex ) {
-		cerr << "exception in HTTP_Checker::send_bytes(): for host '" << m_host_name.c_str() << "'" << endl;
-		return false;
-	}
-}
-
diff --git a/src/http_checker.h b/src/http_checker.h
deleted file mode 100644
index 37602645..00000000
--- a/src/http_checker.h
+++ /dev/null
@@ -1,102 +0,0 @@
-
-#ifndef __HTTP_CHECKER_H__
-#define __HTTP_CHECKER_H__
-
-#include <cstdlib>
-#include <string>
-#include <cerrno>
-#include <exception>
-#include <iostream>
-#include <algorithm>
-#include <fcntl.h>
-#include <arpa/inet.h>
-#include <sys/time.h>
-#include <sys/socket.h>
-#include <sys/select.h>
-#include <sys/epoll.h>
-#include <sys/types.h>
-#include <unistd.h>
-
-#include <netinet/tcp.h>
-#include <netinet/in_systm.h>
-#include <netinet/in.h>
-#include <netinet/ip.h>
-#include <netinet/ip_icmp.h>
-#include <netdb.h>
-
-#include <openssl/crypto.h>
-#include <openssl/ssl.h>
-#include <openssl/err.h>
-#if (SSLEAY_VERSION_NUMBER >= 0x0907000L)
-# include <openssl/conf.h>
-#endif
-
-#define HTTP_DEFAULT_PORT	80
-#define HTTPS_DEFAULT_PORT  443
-#define MAX_TCP_BUFFER		1024
-#define NET_COMMS_TIMEOUT   20
-#define MAX_REDIRECTS       2
-#define MAX_EPOLL_EVENTS    10
-
-// Enables the printing of debug messages to stderr
-#define DEBUG_MODE          0
-
-// getaddrinfo is much slower than gethostbyname and, although
-// it is technically the best way to lookup hosts, only enable
-// this on hosts with more than enough CPU compute headroom.
-#define USE_GETADDRINFO     1
-
-// Sets whether we compile and use non-blocking socket IO
-#define NON_BLOCKING_IO     0
-
-class HTTP_Checker {
-
-public:
-	HTTP_Checker();
-	~HTTP_Checker();
-
-	void check( std::string p_host_name, int p_port = HTTP_DEFAULT_PORT );
-	time_t get_rtt();
-	int get_response_code() { return m_response_code; }
-	int get_error_code() { return m_error_code; }
-
-private:
-	char m_buf[MAX_TCP_BUFFER];
-	int m_sock;
-	std::string m_host_name;
-	std::string m_host_dir;
-	int m_port;
-	bool m_is_ssl;
-	struct timezone m_tzone;
-	struct timeval m_tstart;
-	time_t m_triptime;
-	time_t m_cutofftime;
-	int m_response_code;
-	int m_error_code;
-
-	SSL_CTX *m_ctx;
-	SSL *m_ssl;
-	BIO *m_sbio;
-
-	bool init_socket( addrinfo *addr );
-	bool init_ssl();
-	bool connect();
-#if USE_GETADDRINFO
-	bool connect_getaddrinfo();
-#else
-	bool connect_gethostbyname();
-#endif
-	bool disconnect();
-#if NON_BLOCKING_IO
-	void disconnect_ssl();
-#endif
-	std::string send_http_get();
-	bool send_bytes( char* p_packet, size_t p_packet_length );
-	std::string get_response();
-	void set_host_response( int redirects );
-	bool set_redirect_host_values( std::string p_content );
-	void parse_host_values();
-};
-
-#endif	//__HTTP_H__
-
diff --git a/src/main.cpp b/src/main.cpp
deleted file mode 100644
index e7806cdd..00000000
--- a/src/main.cpp
+++ /dev/null
@@ -1,114 +0,0 @@
-
-#ifndef BUILDING_NODE_EXTENSION
-#define BUILDING_NODE_EXTENSION
-#endif
-
-#include <iostream>
-#include <string>
-#include <unistd.h>
-
-#include <node.h>
-#include <uv.h>
-
-using namespace v8;
-using namespace node;
-
-#include "http_checker.h"
-
-struct HTTP_Check_Baton {
-	Global<Function> callback;
-	HTTP_Checker *http_checker;
-	std::string server;
-	int port;
-	int server_id;
-	Isolate* isolate;
-};
-
-static void http_check_async_fin( uv_work_t *req, int status ) {
-	HTTP_Check_Baton *baton = static_cast<HTTP_Check_Baton*>(req->data);
-
-	Isolate* isolate = baton->isolate;
-	HandleScope scope( isolate );
-	Local<Context> ctx = isolate->GetCurrentContext();
-	Local<Value> argv[4] = {
-		Number::New(isolate, baton->server_id),
-		Number::New(isolate, baton->http_checker->get_rtt()),
-		Number::New(isolate, baton->http_checker->get_response_code()),
-		Number::New(isolate, baton->http_checker->get_error_code())
-	};
-
-	Local<Function> cb_func = baton->callback.Get( isolate );
-	(void) cb_func->Call( ctx, ctx->Global(), 4, argv );
-	baton->callback.Reset();
-
-	delete baton->http_checker;
-	delete baton;
-	delete req;
-}
-
-void http_check_async( uv_work_t *req ) {
-	HTTP_Check_Baton *baton = static_cast<HTTP_Check_Baton*>( req->data );
-	baton->http_checker->check( baton->server, baton->port );
-}
-
-void http_check( const FunctionCallbackInfo<Value>& args ) {
-	args.GetReturnValue().SetUndefined();
-	Isolate* isolate = args.GetIsolate();
-	HandleScope scope( isolate );
-
-	if ( args.Length() < 4 ) {
-		isolate->ThrowException( Exception::TypeError(
-			String::NewFromUtf8( isolate, "Wrong number of arguments" ).ToLocalChecked() ) );
-		return;
-	}
-
-	if ( ! args[1]->IsNumber() ) {
-		isolate->ThrowException( Exception::TypeError(
-			String::NewFromUtf8( isolate, "The port number argument is not valid" ).ToLocalChecked() ) );
-		return;
-	}
-
-	if ( ! args[2]->IsNumber() ) {
-		isolate->ThrowException( Exception::TypeError(
-			String::NewFromUtf8( isolate, "The server id argument is not valid" ).ToLocalChecked() ) );
-		return;
-	}
-
-	if ( ! args[3]->IsFunction() ) {
-		isolate->ThrowException( Exception::TypeError(
-			String::NewFromUtf8( isolate, "You have not provided a callback function as the 4th parameter" ).ToLocalChecked() ) );
-		return;
-	}
-
-	HTTP_Check_Baton *baton = new HTTP_Check_Baton();
-	HTTP_Checker *checker = new HTTP_Checker();
-	baton->http_checker = checker;
-
-	String::Utf8Value sHost( isolate, args[0] );
-	baton->server = *sHost;
-
-	baton->port = args[1]->ToInteger( isolate->GetCurrentContext() ).ToLocalChecked()->Value();
-	baton->server_id = (int) args[2]->ToInteger( isolate->GetCurrentContext() ).ToLocalChecked()->Value();
-
-	baton->isolate = isolate;
-	baton->callback.Reset( isolate, args[3].As<Function>() );
-
-	uv_work_t *req = new uv_work_t();
-	req->data = baton;
-
-	uv_queue_work( uv_default_loop(), req, http_check_async, (uv_after_work_cb)http_check_async_fin );
-}
-
-void Initialise( Local<Object> exports) {
-	SSL_load_error_strings();
-	SSL_library_init();
-	OpenSSL_add_all_algorithms();
-#if (SSLEAY_VERSION_NUMBER >= 0x0907000L)
-	OPENSSL_config( NULL );
-#endif
-
-	NODE_SET_METHOD( exports, "http_check", http_check );
-}
-
-NODE_MODULE( jetmon, Initialise )
-
diff --git a/systemd/jetmon-deliverer.service b/systemd/jetmon-deliverer.service
new file mode 100644
index 00000000..f782b2ca
--- /dev/null
+++ b/systemd/jetmon-deliverer.service
@@ -0,0 +1,36 @@
+[Unit]
+Description=Jetmon Delivery Workers
+Documentation=https://github.com/Automattic/jetmon
+After=network.target mysql.service
+Wants=network.target
+StartLimitIntervalSec=60s
+StartLimitBurst=5
+
+[Service]
+Type=simple
+User=jetmon
+Group=jetmon
+WorkingDirectory=/opt/jetmon2
+ExecStartPre=/opt/jetmon2/bin/jetmon-deliverer validate-config --require-owner-match --require-api-disabled
+ExecStart=/opt/jetmon2/bin/jetmon-deliverer
+Restart=on-failure
+RestartSec=5s
+TimeoutStopSec=35s
+
+# Resource limits.
+MemoryMax=256M
+LimitNOFILE=65536
+LimitNPROC=4096
+
+# Hardening.
+NoNewPrivileges=yes
+PrivateTmp=yes
+ProtectSystem=full
+ProtectHome=yes
+
+# Environment.
+EnvironmentFile=-/opt/jetmon2/config/jetmon2.env
+Environment=JETMON_CONFIG=/opt/jetmon2/config/deliverer.json
+
+[Install]
+WantedBy=multi-user.target
diff --git a/systemd/jetmon2-logrotate b/systemd/jetmon2-logrotate
new file mode 100644
index 00000000..9118502e
--- /dev/null
+++ b/systemd/jetmon2-logrotate
@@ -0,0 +1,12 @@
+/opt/jetmon2/logs/*.log {
+    daily
+    missingok
+    rotate 14
+    compress
+    delaycompress
+    notifempty
+    sharedscripts
+    postrotate
+        /bin/kill -HUP $(cat /run/jetmon2/jetmon2.pid 2>/dev/null) 2>/dev/null || true
+    endscript
+}
diff --git a/systemd/jetmon2.service b/systemd/jetmon2.service
new file mode 100644
index 00000000..71ae837f
--- /dev/null
+++ b/systemd/jetmon2.service
@@ -0,0 +1,43 @@
+[Unit]
+Description=Jetmon 2 — Jetpack Uptime Monitor
+Documentation=https://github.com/Automattic/jetmon
+After=network.target mysql.service
+Wants=network.target
+StartLimitIntervalSec=60s
+StartLimitBurst=5
+
+[Service]
+Type=simple
+User=jetmon
+Group=jetmon
+WorkingDirectory=/opt/jetmon2
+ExecStart=/opt/jetmon2/jetmon2
+ExecReload=/bin/kill -HUP $MAINPID
+Restart=on-failure
+RestartSec=5s
+TimeoutStopSec=35s
+
+# Resource limits.
+MemoryMax=512M
+LimitNOFILE=65536
+LimitNPROC=4096
+
+# Hardening.
+NoNewPrivileges=yes
+PrivateTmp=yes
+ProtectSystem=full
+ProtectHome=yes
+
+# RuntimeDirectory creates /run/jetmon2/ owned by the service user.
+RuntimeDirectory=jetmon2
+RuntimeDirectoryMode=0750
+
+# Environment.
+EnvironmentFile=-/opt/jetmon2/config/jetmon2.env
+Environment=JETMON_CONFIG=/opt/jetmon2/config/config.json
+
+# PID file for jetmon2 reload/drain subcommands.
+PIDFile=/run/jetmon2/jetmon2.pid
+
+[Install]
+WantedBy=multi-user.target
diff --git a/veriflier/.gitignore b/veriflier/.gitignore
deleted file mode 100644
index 5ba2fc29..00000000
--- a/veriflier/.gitignore
+++ /dev/null
@@ -1,6 +0,0 @@
-*.o
-*.cpp
-*.h
-.qmake.stash
-Makefile
-veriflier
diff --git a/veriflier/LICENSE b/veriflier/LICENSE
deleted file mode 100644
index 06b2a371..00000000
--- a/veriflier/LICENSE
+++ /dev/null
@@ -1,339 +0,0 @@
-GNU GENERAL PUBLIC LICENSE
-                       Version 2, June 1991
-
- Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
- 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
- Everyone is permitted to copy and distribute verbatim copies
- of this license document, but changing it is not allowed.
-
-                            Preamble
-
-  The licenses for most software are designed to take away your
-freedom to share and change it.  By contrast, the GNU General Public
-License is intended to guarantee your freedom to share and change free
-software--to make sure the software is free for all its users.  This
-General Public License applies to most of the Free Software
-Foundation's software and to any other program whose authors commit to
-using it.  (Some other Free Software Foundation software is covered by
-the GNU Lesser General Public License instead.)  You can apply it to
-your programs, too.
-
-  When we speak of free software, we are referring to freedom, not
-price.  Our General Public Licenses are designed to make sure that you
-have the freedom to distribute copies of free software (and charge for
-this service if you wish), that you receive source code or can get it
-if you want it, that you can change the software or use pieces of it
-in new free programs; and that you know you can do these things.
-
-  To protect your rights, we need to make restrictions that forbid
-anyone to deny you these rights or to ask you to surrender the rights.
-These restrictions translate to certain responsibilities for you if you
-distribute copies of the software, or if you modify it.
-
-  For example, if you distribute copies of such a program, whether
-gratis or for a fee, you must give the recipients all the rights that
-you have.  You must make sure that they, too, receive or can get the
-source code.  And you must show them these terms so they know their
-rights.
-
-  We protect your rights with two steps: (1) copyright the software, and
-(2) offer you this license which gives you legal permission to copy,
-distribute and/or modify the software.
-
-  Also, for each author's protection and ours, we want to make certain
-that everyone understands that there is no warranty for this free
-software.  If the software is modified by someone else and passed on, we
-want its recipients to know that what they have is not the original, so
-that any problems introduced by others will not reflect on the original
-authors' reputations.
-
-  Finally, any free program is threatened constantly by software
-patents.  We wish to avoid the danger that redistributors of a free
-program will individually obtain patent licenses, in effect making the
-program proprietary.  To prevent this, we have made it clear that any
-patent must be licensed for everyone's free use or not licensed at all.
-
-  The precise terms and conditions for copying, distribution and
-modification follow.
-
-                    GNU GENERAL PUBLIC LICENSE
-   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
-
-  0. This License applies to any program or other work which contains
-a notice placed by the copyright holder saying it may be distributed
-under the terms of this General Public License.  The "Program", below,
-refers to any such program or work, and a "work based on the Program"
-means either the Program or any derivative work under copyright law:
-that is to say, a work containing the Program or a portion of it,
-either verbatim or with modifications and/or translated into another
-language.  (Hereinafter, translation is included without limitation in
-the term "modification".)  Each licensee is addressed as "you".
-
-Activities other than copying, distribution and modification are not
-covered by this License; they are outside its scope.  The act of
-running the Program is not restricted, and the output from the Program
-is covered only if its contents constitute a work based on the
-Program (independent of having been made by running the Program).
-Whether that is true depends on what the Program does.
-
-  1. You may copy and distribute verbatim copies of the Program's
-source code as you receive it, in any medium, provided that you
-conspicuously and appropriately publish on each copy an appropriate
-copyright notice and disclaimer of warranty; keep intact all the
-notices that refer to this License and to the absence of any warranty;
-and give any other recipients of the Program a copy of this License
-along with the Program.
-
-You may charge a fee for the physical act of transferring a copy, and
-you may at your option offer warranty protection in exchange for a fee.
-
-  2. You may modify your copy or copies of the Program or any portion
-of it, thus forming a work based on the Program, and copy and
-distribute such modifications or work under the terms of Section 1
-above, provided that you also meet all of these conditions:
-
-    a) You must cause the modified files to carry prominent notices
-    stating that you changed the files and the date of any change.
-
-    b) You must cause any work that you distribute or publish, that in
-    whole or in part contains or is derived from the Program or any
-    part thereof, to be licensed as a whole at no charge to all third
-    parties under the terms of this License.
-
-    c) If the modified program normally reads commands interactively
-    when run, you must cause it, when started running for such
-    interactive use in the most ordinary way, to print or display an
-    announcement including an appropriate copyright notice and a
-    notice that there is no warranty (or else, saying that you provide
-    a warranty) and that users may redistribute the program under
-    these conditions, and telling the user how to view a copy of this
-    License.  (Exception: if the Program itself is interactive but
-    does not normally print such an announcement, your work based on
-    the Program is not required to print an announcement.)
-
-These requirements apply to the modified work as a whole.  If
-identifiable sections of that work are not derived from the Program,
-and can be reasonably considered independent and separate works in
-themselves, then this License, and its terms, do not apply to those
-sections when you distribute them as separate works.  But when you
-distribute the same sections as part of a whole which is a work based
-on the Program, the distribution of the whole must be on the terms of
-this License, whose permissions for other licensees extend to the
-entire whole, and thus to each and every part regardless of who wrote it.
-
-Thus, it is not the intent of this section to claim rights or contest
-your rights to work written entirely by you; rather, the intent is to
-exercise the right to control the distribution of derivative or
-collective works based on the Program.
-
-In addition, mere aggregation of another work not based on the Program
-with the Program (or with a work based on the Program) on a volume of
-a storage or distribution medium does not bring the other work under
-the scope of this License.
-
-  3. You may copy and distribute the Program (or a work based on it,
-under Section 2) in object code or executable form under the terms of
-Sections 1 and 2 above provided that you also do one of the following:
-
-    a) Accompany it with the complete corresponding machine-readable
-    source code, which must be distributed under the terms of Sections
-    1 and 2 above on a medium customarily used for software interchange; or,
-
-    b) Accompany it with a written offer, valid for at least three
-    years, to give any third party, for a charge no more than your
-    cost of physically performing source distribution, a complete
-    machine-readable copy of the corresponding source code, to be
-    distributed under the terms of Sections 1 and 2 above on a medium
-    customarily used for software interchange; or,
-
-    c) Accompany it with the information you received as to the offer
-    to distribute corresponding source code.  (This alternative is
-    allowed only for noncommercial distribution and only if you
-    received the program in object code or executable form with such
-    an offer, in accord with Subsection b above.)
-
-The source code for a work means the preferred form of the work for
-making modifications to it.  For an executable work, complete source
-code means all the source code for all modules it contains, plus any
-associated interface definition files, plus the scripts used to
-control compilation and installation of the executable.  However, as a
-special exception, the source code distributed need not include
-anything that is normally distributed (in either source or binary
-form) with the major components (compiler, kernel, and so on) of the
-operating system on which the executable runs, unless that component
-itself accompanies the executable.
-
-If distribution of executable or object code is made by offering
-access to copy from a designated place, then offering equivalent
-access to copy the source code from the same place counts as
-distribution of the source code, even though third parties are not
-compelled to copy the source along with the object code.
-
-  4. You may not copy, modify, sublicense, or distribute the Program
-except as expressly provided under this License.  Any attempt
-otherwise to copy, modify, sublicense or distribute the Program is
-void, and will automatically terminate your rights under this License.
-However, parties who have received copies, or rights, from you under
-this License will not have their licenses terminated so long as such
-parties remain in full compliance.
-
-  5. You are not required to accept this License, since you have not
-signed it.  However, nothing else grants you permission to modify or
-distribute the Program or its derivative works.  These actions are
-prohibited by law if you do not accept this License.  Therefore, by
-modifying or distributing the Program (or any work based on the
-Program), you indicate your acceptance of this License to do so, and
-all its terms and conditions for copying, distributing or modifying
-the Program or works based on it.
-
-  6. Each time you redistribute the Program (or any work based on the
-Program), the recipient automatically receives a license from the
-original licensor to copy, distribute or modify the Program subject to
-these terms and conditions.  You may not impose any further
-restrictions on the recipients' exercise of the rights granted herein.
-You are not responsible for enforcing compliance by third parties to
-this License.
-
-  7. If, as a consequence of a court judgment or allegation of patent
-infringement or for any other reason (not limited to patent issues),
-conditions are imposed on you (whether by court order, agreement or
-otherwise) that contradict the conditions of this License, they do not
-excuse you from the conditions of this License.  If you cannot
-distribute so as to satisfy simultaneously your obligations under this
-License and any other pertinent obligations, then as a consequence you
-may not distribute the Program at all.  For example, if a patent
-license would not permit royalty-free redistribution of the Program by
-all those who receive copies directly or indirectly through you, then
-the only way you could satisfy both it and this License would be to
-refrain entirely from distribution of the Program.
-
-If any portion of this section is held invalid or unenforceable under
-any particular circumstance, the balance of the section is intended to
-apply and the section as a whole is intended to apply in other
-circumstances.
-
-It is not the purpose of this section to induce you to infringe any
-patents or other property right claims or to contest validity of any
-such claims; this section has the sole purpose of protecting the
-integrity of the free software distribution system, which is
-implemented by public license practices.  Many people have made
-generous contributions to the wide range of software distributed
-through that system in reliance on consistent application of that
-system; it is up to the author/donor to decide if he or she is willing
-to distribute software through any other system and a licensee cannot
-impose that choice.
-
-This section is intended to make thoroughly clear what is believed to
-be a consequence of the rest of this License.
-
-  8. If the distribution and/or use of the Program is restricted in
-certain countries either by patents or by copyrighted interfaces, the
-original copyright holder who places the Program under this License
-may add an explicit geographical distribution limitation excluding
-those countries, so that distribution is permitted only in or among
-countries not thus excluded.  In such case, this License incorporates
-the limitation as if written in the body of this License.
-
-  9. The Free Software Foundation may publish revised and/or new versions
-of the General Public License from time to time.  Such new versions will
-be similar in spirit to the present version, but may differ in detail to
-address new problems or concerns.
-
-Each version is given a distinguishing version number.  If the Program
-specifies a version number of this License which applies to it and "any
-later version", you have the option of following the terms and conditions
-either of that version or of any later version published by the Free
-Software Foundation.  If the Program does not specify a version number of
-this License, you may choose any version ever published by the Free Software
-Foundation.
-
-  10. If you wish to incorporate parts of the Program into other free
-programs whose distribution conditions are different, write to the author
-to ask for permission.  For software which is copyrighted by the Free
-Software Foundation, write to the Free Software Foundation; we sometimes
-make exceptions for this.  Our decision will be guided by the two goals
-of preserving the free status of all derivatives of our free software and
-of promoting the sharing and reuse of software generally.
-
-                            NO WARRANTY
-
-  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
-FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
-OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
-PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
-OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
-MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
-TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
-PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
-REPAIR OR CORRECTION.
-
-  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
-WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
-REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
-INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
-OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
-TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
-YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
-PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGES.
-
-                     END OF TERMS AND CONDITIONS
-
-            How to Apply These Terms to Your New Programs
-
-  If you develop a new program, and you want it to be of the greatest
-possible use to the public, the best way to achieve this is to make it
-free software which everyone can redistribute and change under these terms.
-
-  To do so, attach the following notices to the program.  It is safest
-to attach them to the start of each source file to most effectively
-convey the exclusion of warranty; and each file should have at least
-the "copyright" line and a pointer to where the full notice is found.
-
-    Jetpack Site Monitor
-    Copyright (C) 2013  Automattic
-
-    This program is free software; you can redistribute it and/or modify
-    it under the terms of the GNU General Public License as published by
-    the Free Software Foundation; either version 2 of the License, or
-    (at your option) any later version.
-
-    This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU General Public License for more details.
-
-    You should have received a copy of the GNU General Public License along
-    with this program; if not, write to the Free Software Foundation, Inc.,
-    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
-
-Also add information on how to contact you by electronic and paper mail.
-
-If the program is interactive, make it output a short notice like this
-when it starts in an interactive mode:
-
-    Gnomovision version 69, Copyright (C) year name of author
-    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
-    This is free software, and you are welcome to redistribute it
-    under certain conditions; type `show c' for details.
-
-The hypothetical commands `show w' and `show c' should show the appropriate
-parts of the General Public License.  Of course, the commands you use may
-be called something other than `show w' and `show c'; they could even be
-mouse-clicks or menu items--whatever suits your program.
-
-You should also get your employer (if you work as a programmer) or your
-school, if any, to sign a "copyright disclaimer" for the program, if
-necessary.  Here is a sample; alter the names:
-
-  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
-  `Gnomovision' (which makes passes at compilers) written by James Hacker.
-
-  {signature of Ty Coon}, 1 April 1989
-  Ty Coon, President of Vice
-
-This General Public License does not permit incorporating your program into
-proprietary programs.  If your program is a subroutine library, you may
-consider it more useful to permit linking proprietary applications with the
-library.  If this is what you want to do, use the GNU Lesser General
-Public License instead of this License.
diff --git a/veriflier/README.md b/veriflier/README.md
deleted file mode 100644
index 3a1d64d9..00000000
--- a/veriflier/README.md
+++ /dev/null
@@ -1,24 +0,0 @@
-veriflier service
-=================
-
-Overview
---------
-
-The veriflier services check whether sites are reachable from their location, using the same HEAD request as jetmon. This allows deployment to be done in geographically disparate datacenters, providing a true global status of the site being verified.
-
-Building
---------
-
-1) Ensure you have a Qt5 build environment installed.
-
-2) Run the Qt5 'qmake' executable in the veriflier directory.
-
-3) Run 'make'.
-
-Running
--------
-
-1) Modify the install path if necessary in 'veriflier.sh'.
-
-2) Run './veriflier start|stop|restart|reload'
-
diff --git a/veriflier/certs/.gitignore b/veriflier/certs/.gitignore
deleted file mode 100644
index 436f866e..00000000
--- a/veriflier/certs/.gitignore
+++ /dev/null
@@ -1,5 +0,0 @@
-# Ignore everything in this directory
-*
-# Exceptions
-!you_need_to_create_certs
-!.gitignore
diff --git a/veriflier/certs/you_need_to_create_certs b/veriflier/certs/you_need_to_create_certs
deleted file mode 100644
index c7c03bba..00000000
--- a/veriflier/certs/you_need_to_create_certs
+++ /dev/null
@@ -1 +0,0 @@
-create certs to use for SSL
diff --git a/veriflier/config/.gitignore b/veriflier/config/.gitignore
deleted file mode 100644
index d8c6cbff..00000000
--- a/veriflier/config/.gitignore
+++ /dev/null
@@ -1,2 +0,0 @@
-veriflier.json
-
diff --git a/veriflier/config/veriflier-sample.json b/veriflier/config/veriflier-sample.json
deleted file mode 100644
index 7742e5d2..00000000
--- a/veriflier/config/veriflier-sample.json
+++ /dev/null
@@ -1,27 +0,0 @@
-{
-	"veriflier_name"     : "Veriflier 1",
-	"auth_token"         : "<VERIFLIER_AUTH_TOKEN>",
-
-	"thread_pool_max"    : 4,
-	"max_pending_conns"  : 256,
-	"max_redirects"      : 2,
-	"max_checkers"       : 10,
-	"max_checks"         : 1000,
-
-	"net_comms_timeout"  : 30000,
-	"debug"              : true,
-
-	"listen_port"        : <VERIFLIER_PORT>,
-	"jetmon_server_port" : <JETMON_PORT>,
-
-	"privatekey_file"    : "./certs/veriflier.key",
-	"privatecert_file"   : "./certs/veriflier.crt",
-
-	"monitors" : [
-		{
-			"name"       : "<Friendly Name>",
-			"host"       : "<host name",
-			"auth_token" : "<some token auth string>"
-		}
-	]
-}
diff --git a/veriflier/headers/check_controller.h b/veriflier/headers/check_controller.h
deleted file mode 100644
index f66ce6f9..00000000
--- a/veriflier/headers/check_controller.h
+++ /dev/null
@@ -1,81 +0,0 @@
-
-#ifndef __CHECKCONTROLLER_H__
-#define __CHECKCONTROLLER_H__
-
-#include <QObject>
-#include <QVector>
-#include <QMutex>
-#include <QDateTime>
-#include <QTimer>
-#include <QSslSocket>
-#include <QSslConfiguration>
-
-#include "headers/check_thread.h"
-#include "headers/jetmon_server.h"
-
-#define NOT_ASSIGNED -1
-#define PRE_ASSIGNED -2
-
-struct Runner {
-	CheckThread* ct;
-	int checking;
-};
-
-class CheckController : public QObject
-{
-	Q_OBJECT
-public:
-	explicit CheckController( const QSslConfiguration *m_ssl_config,
-							const int jetmon_server_port,
-							const int max_runners = 20,
-							const int max_checks = 50,
-							const QString &veriflier_name = "",
-							const QString &auth_token = "",
-							const int net_timeout = 20000,
-							const bool debug = false );
-
-	~CheckController();
-	void addCheck( HealthCheck* hc );
-	void addChecks( QVector<HealthCheck*> hcs );
-
-public slots:
-	void startChecking( HealthCheck* hc );
-	void finishedChecking( int thread_index, qint64 blog_id, QString monitor_url, int status, int http_code, int rtt );
-	void finishedSending( JetmonServer* js, int status, int rtt );
-	void ticked();
-
-signals:
-	void startCheck( HealthCheck* hc );
-
-private:
-	QVector<HealthCheck*> m_checks;
-	QVector<Runner*> m_runners;
-
-	QMap<QString, QJsonDocument> m_check_results;
-
-	const QSslConfiguration *m_ssl_config;
-	int m_jetmon_server_port;
-	QSslSocket *m_socket;
-	QMutex m_check_lock;
-	QTimer *m_ticker;
-
-	int m_max_checkers;
-	int m_max_checks;
-	int m_checking;
-	int m_checked;
-
-	QString m_veriflier_name;
-	QString m_auth_token;
-	int m_net_timeout;
-	bool m_debug;
-
-	inline bool haveCheck( qint64 blog_id, QString monitor_url );
-	int selectRunner();
-	void sendResults();
-	QString post_http_header( QString jetmon_server, int content_size );
-	void sendToJetmonServer( QString jetmon_server, QByteArray status_data );
-	QJsonDocument parse_json_response( QByteArray &raw_data );
-	int readResponse();
-};
-
-#endif // __CHECKCONTROLLER_H__
diff --git a/veriflier/headers/check_thread.h b/veriflier/headers/check_thread.h
deleted file mode 100644
index 142112f5..00000000
--- a/veriflier/headers/check_thread.h
+++ /dev/null
@@ -1,50 +0,0 @@
-
-#ifndef __CHECKTHREAD_H__
-#define __CHECKTHREAD_H__
-
-#include <QThread>
-#include <QtNetwork/QAbstractSocket>
-#include <QtNetwork/QSslSocket>
-#include <QJsonValue>
-#include <QDateTime>
-#include <QString>
-#include <QHostAddress>
-#include <QUrl>
-
-#include "headers/config.h"
-#include "headers/http_checker.h"
-
-#define HOST_DOWN           0
-#define HOST_ONLINE         1
-
-class CheckThread : public QThread
-{
-	Q_OBJECT
-public:
-	CheckThread( const int net_timeout, const bool debug,
-				const int thread_index );
-
-	void performCheck( HealthCheck* hc );
-
-protected:
-	void run();
-
-signals:
-	void resultReady( int thread_index, qint64 blog_id, QString monitor_url, int status, int http_code, int rtt );
-
-public slots:
-	void finishedCheck( HTTP_Checker *checker, HealthCheck* hc );
-
-private:
-	QVector<HealthCheck*> m_checkers;
-
-	int m_net_timeout;
-	int m_thread_index;
-
-	QDateTime m_timer;
-	bool m_debug;
-
-	void performHostCheck();
-};
-
-#endif // __CHECKTHREAD_H__
diff --git a/veriflier/headers/client_thread.h b/veriflier/headers/client_thread.h
deleted file mode 100644
index 16d4cf68..00000000
--- a/veriflier/headers/client_thread.h
+++ /dev/null
@@ -1,73 +0,0 @@
-
-#ifndef __CLIENT_THREAD_H__
-#define __CLIENT_THREAD_H__
-
-#include <QRunnable>
-#include <QtNetwork/QAbstractSocket>
-#include <QtNetwork/QSslSocket>
-#include <QtNetwork/QTcpSocket>
-#include <QJsonValue>
-#include <QDateTime>
-#include <QString>
-#include <QHostAddress>
-#include <QUrl>
-#include <QVector>
-
-#include "headers/config.h"
-#include "headers/check_controller.h"
-
-#define HOST_DOWN           0
-#define HOST_ONLINE         1
-
-class ClientThread : public QRunnable
-{
-public:
-	enum QueryType { ServiceRunning, SiteStatusCheck, SiteStatusPostCheck, UnknownQuery };
-
-	ClientThread( qintptr sock,
-					const QSslConfiguration *ssl_config,
-					CheckController *checker,
-					const QString &veriflier_name,
-					const QString &auth_token,
-					const int net_timeout,
-					const bool debug );
-	~ClientThread();
-
-	void run();
-
-private:
-	qintptr m_sock;
-	QSslSocket *m_socket;
-	const QSslConfiguration *m_ssl_config;
-	CheckController *m_checker;
-	QVector<HealthCheck*> m_checks;
-
-	QString m_veriflier_name;
-	QString m_auth_token;
-	int m_net_timeout;
-	QString m_jetmon_server;
-
-	bool m_debug;
-	bool m_site_status_request;
-
-	void sendOK();
-	void sendServiceOK();
-	void sendError( const QString errorString );
-
-	void readRequest();
-
-	QueryType get_request_type( QByteArray &raw_data );
-
-	QJsonDocument parse_json_request( QByteArray &raw_data );
-	QJsonDocument parse_json_request_post( QByteArray &raw_data );
-
-	int parse_json_request_post_length( QByteArray &raw_data );
-	int get_content_length( QByteArray &raw_data );
-
-	bool parse_requests( QueryType type, QJsonDocument json_doc );
-
-	QString get_http_reply_header( const QString &http_code, const QString &p_data);
-	QString get_http_content( int status, const QString &error = "" );
-};
-#endif // __CLIENT_THREAD_H__
-
diff --git a/veriflier/headers/config.h b/veriflier/headers/config.h
deleted file mode 100644
index b4428f9b..00000000
--- a/veriflier/headers/config.h
+++ /dev/null
@@ -1,29 +0,0 @@
-
-#ifndef __CONFIG_H__
-#define __CONFIG_H__
-
-#include <QJsonDocument>
-#include <QJsonObject>
-#include <QJsonValue>
-#include <QFile>
-
-#include <iostream>
-
-class Config {
-public:
-	static Config* instance() { return m_instance; }
-
-	int get_int_value( QString name );
-	bool get_bool_value( QString name );
-	QString get_string_value( QString name );
-
-private:
-	Config();
-	static Config *m_instance;
-	QJsonDocument m_json;
-
-	void load_config_file();
-};
-
-#endif // __CONFIG_H__
-
diff --git a/veriflier/headers/http_checker.h b/veriflier/headers/http_checker.h
deleted file mode 100644
index aefb1052..00000000
--- a/veriflier/headers/http_checker.h
+++ /dev/null
@@ -1,73 +0,0 @@
-
-#ifndef __HTTP_CHECKER_H__
-#define __HTTP_CHECKER_H__
-
-#include <iostream>
-#include <sys/time.h>
-
-#include <QTcpSocket>
-#include <QSslSocket>
-#include <QTimer>
-
-#include "headers/config.h"
-
-#define DEFAULT_HTTP_PORT   80
-#define DEFAULT_HTTPS_PORT 443
-
-struct HealthCheck {
-	int blog_id;
-	QString monitor_url;
-	QString jetmon_server;
-	QDateTime received;
-	int thread_index;
-};
-
-class HTTP_Checker: public QObject
-{
-	Q_OBJECT
-public:
-	HTTP_Checker( const int p_net_timeout = 20000 );
-	~HTTP_Checker();
-
-	void check( HealthCheck* hc );
-	int get_rtt() { return m_starttime.msecsTo( QDateTime::currentDateTime() ); }
-	int get_response_code() { return m_response_code; }
-
-signals:
-	void finished( HTTP_Checker* checker, HealthCheck* hc );
-
-private slots:
-	void connected();
-	void connectionError( QAbstractSocket::SocketError err );
-	void readyRead();
-	void timed_out();
-
-private:
-	QSslConfiguration *m_ssl_config;
-	QAbstractSocket *m_sock;
-	HealthCheck *m_hc;
-	QTimer *m_timeout;
-
-	QString m_host_name;
-	QString m_host_dir;
-	int m_port;
-	bool m_is_ssl;
-	bool m_finished;
-	QDateTime m_starttime;
-	QString m_response;
-	int m_redirects;
-	int m_response_code;
-	int m_net_timeout;
-
-	void connect();
-	void closeConnection();
-	bool send_http_get();
-	void process_response();
-	void finish_request();
-	bool set_redirect_host_values( QString p_content );
-	void parse_host_values();
-	void parse_response_code( QByteArray a_data );
-};
-
-#endif	//__HTTP_H__
-
diff --git a/veriflier/headers/jetmon_server.h b/veriflier/headers/jetmon_server.h
deleted file mode 100644
index dd399c88..00000000
--- a/veriflier/headers/jetmon_server.h
+++ /dev/null
@@ -1,40 +0,0 @@
-
-#ifndef __JETMON_SERVER_H__
-#define __JETMON_SERVER_H__
-
-#include <QObject>
-#include <QtNetwork/QSslSocket>
-#include <QJsonDocument>
-#include <QJsonObject>
-
-#include "headers/logger.h"
-
-class JetmonServer : public QObject
-{
-	Q_OBJECT
-public:
-	JetmonServer( QObject *parent, const QSslConfiguration *ssl_config, QString jetmon_server, int jetmon_server_port );
-
-	void sendData( QByteArray status_data );
-	QString jetmonServer() { return m_jetmon_server; }
-
-signals:
-	void finished( JetmonServer* jetmon_server, int status, int rtt );
-
-private slots:
-	void connected();
-	void connectionError( QAbstractSocket::SocketError err );
-	void readyRead();
-
-private:
-	QSslSocket *m_socket;
-	QString m_jetmon_server;
-	int m_jetmon_server_port;
-	QDateTime m_timer;
-	QByteArray m_status_data;
-
-	QJsonDocument parse_json_response( QByteArray &raw_data );
-	void closeConnection();
-};
-
-#endif // __JETMON_SERVER_H__
diff --git a/veriflier/headers/logger.h b/veriflier/headers/logger.h
deleted file mode 100644
index b498905f..00000000
--- a/veriflier/headers/logger.h
+++ /dev/null
@@ -1,36 +0,0 @@
-
-#ifndef __LOGGER_H__
-#define __LOGGER_H__
-
-#include <QDir>
-#include <QFile>
-#include <QMutex>
-#include <QDateTime>
-
-#include <iostream>
-
-#define MAX_LOG_FILESIZE 1024 * 1024 * 50 // 50 MB
-#define LOGS_TO_KEEP 20
-
-const QString LOG_FILE_NAME = QDir::currentPath() + "/logs/veriflier.log";
-#define LOG( content ) Logger::write( content )
-
-class Logger {
-public:
-	static Logger* instance() { return m_instance; }
-
-	static void startLogger();
-	static void stopLogging();
-	static void write( QString s_data );
-
-private:
-	Logger() {}
-	static Logger *m_instance;
-	static QFile *m_file;
-	static QMutex *m_mutex;
-
-	static void do_log_rotation();
-};
-
-#endif // __CONFIG_H__
-
diff --git a/veriflier/headers/ssl_server.h b/veriflier/headers/ssl_server.h
deleted file mode 100644
index 2864b716..00000000
--- a/veriflier/headers/ssl_server.h
+++ /dev/null
@@ -1,46 +0,0 @@
-
-#ifndef __SSL_SERVER_H__
-#define __SSL_SERVER_H__
-
-#include <QTcpServer>
-#include <QThreadPool>
-#include <QSslConfiguration>
-#include <QSslCertificate>
-#include <QSslKey>
-
-#include "headers/config.h"
-#include "headers/client_thread.h"
-#include "headers/check_controller.h"
-
-#define DEFAULT_MAX_CHECKERS 4
-#define DEFAULT_MAX_CHECKS   500
-
-class SSL_Server : public QTcpServer
-{
-	Q_OBJECT
-public:
-	SSL_Server( QObject *parent = 0 );
-	~SSL_Server();
-
-protected:
-	void incomingConnection( qintptr socketDescriptor );
-
-public slots:
-	void logError( QAbstractSocket::SocketError socketError );
-
-private:
-	QThreadPool *pool;
-	CheckController *m_checker;
-	QSslConfiguration *m_ssl_config;
-
-	QDateTime ticker;
-	QString m_veriflier_name;
-	QString m_auth_token;
-	int m_net_timeout;
-	int m_served_count;
-	int m_jetmon_server_port;
-	bool m_debug;
-};
-
-#endif // __SSL_SERVER_H__
-
diff --git a/veriflier/logs/.gitignore b/veriflier/logs/.gitignore
deleted file mode 100644
index 5e7d2734..00000000
--- a/veriflier/logs/.gitignore
+++ /dev/null
@@ -1,4 +0,0 @@
-# Ignore everything in this directory
-*
-# Except this file
-!.gitignore
diff --git a/veriflier/source/check_controller.cpp b/veriflier/source/check_controller.cpp
deleted file mode 100644
index b5da990e..00000000
--- a/veriflier/source/check_controller.cpp
+++ /dev/null
@@ -1,204 +0,0 @@
-
-#include "headers/check_controller.h"
-#include "headers/logger.h"
-
-#include <QMutexLocker>
-#include <QJsonDocument>
-#include <QJsonArray>
-
-CheckController::CheckController( const QSslConfiguration *ssl_config, const int jetmon_server_port,
-								  const int max_runners, const int max_checks, const QString &veriflier_name,
-								  const QString &auth_token, const int net_timeout, const bool debug )
-	: m_ssl_config( ssl_config ), m_jetmon_server_port( jetmon_server_port ), m_socket( NULL ),
-	m_max_checkers( max_runners ), m_max_checks( max_checks ), m_checking( 0 ), m_checked( 0 ),
-	m_veriflier_name( veriflier_name ), m_auth_token( auth_token ), m_net_timeout( net_timeout ), m_debug( debug )
-{
-	m_checks.resize( 0 );
-	m_ticker = new QTimer( this );
-	connect( m_ticker, SIGNAL( timeout() ), this, SLOT( ticked() ) );
-	m_ticker->start( 5000 );
-
-	for ( int thread_index = 0; thread_index < m_max_checkers; thread_index++ ) {
-		CheckThread *ct = new CheckThread( m_net_timeout, m_debug, thread_index );
-		connect( ct, SIGNAL( resultReady(int, qint64, QString, int, int, int) ), this, SLOT( finishedChecking(int, qint64, QString, int, int, int) ) );
-		Runner* run = new Runner();
-		run->ct = ct;
-		run->checking = 0;
-		run->ct->start();
-		m_runners.push_back(run);
-	}
-	connect( this, SIGNAL(startCheck(HealthCheck*)), this, SLOT(startChecking(HealthCheck*)) );
-}
-
-CheckController::~CheckController() {
-	m_ticker->stop();
-	delete m_ticker;
-	delete m_socket;
-	for ( int checker = 0; checker < m_runners.length(); checker++ ) {
-		delete m_runners[checker]->ct;
-		m_runners[checker]->ct = NULL;
-	}
-	qDeleteAll(m_runners);
-}
-
-void CheckController::finishedChecking( int thread_index, qint64 blog_id, QString monitor_url, int status, int http_code, int rtt ) {
-	QJsonDocument json_doc;
-	QJsonObject json_obj, arr_result;
-	QJsonArray checkArray;
-	m_checked++;
-	m_check_lock.lock();
-	for ( int loop = 0; loop < m_checks.size(); loop++ ) {
-		if ( m_checks[loop]->blog_id == blog_id && m_checks[loop]->monitor_url == monitor_url ) {
-			if ( 0 > m_checks[loop]->thread_index ) {
-				LOG( "deleting a blog_id that does not have a check thread assigned?: " + QString::number( blog_id ) + " " + monitor_url );
-			}
-			if ( thread_index != m_checks[loop]->thread_index ) {
-				LOG( "deleting a blog_id that has a different thread_index linked: " +
-					 QString::number( m_checks[loop]->thread_index  )  + " != " + QString::number( thread_index ) );
-			}
-			arr_result.insert( "blog_id", QJsonValue( blog_id ) );
-			arr_result.insert( "monitor_url", QJsonValue( monitor_url ) );
-			arr_result.insert( "status", QJsonValue( status ) );
-			arr_result.insert( "code", QJsonValue( http_code ) );
-			arr_result.insert( "rtt", QJsonValue( rtt ) );
-			QMap<QString, QJsonDocument>::iterator itr = m_check_results.find( m_checks[loop]->jetmon_server );
-			if ( m_check_results.end() != itr ) {
-				json_doc = itr.value();
-				checkArray = json_doc.object()["checks"].toArray();
-			}
-			checkArray.append( arr_result );
-			json_obj.insert( "auth_token", QJsonValue( m_auth_token ) );
-			json_obj.insert( "checks", checkArray );
-			json_doc.setObject( json_obj );
-			m_check_results.insert( m_checks[loop]->jetmon_server, json_doc );
-
-			HealthCheck *ptr = m_checks[loop];
-			m_checks.remove( loop );
-			delete ptr;
-			m_runners[thread_index]->checking--;
-			m_checking--;
-			break;
-		}
-	}
-	if ( m_checking < ( m_runners.length() * m_max_checks ) ) {
-		for ( int loop = 0; loop < m_checks.size(); loop++ ) {
-			if ( NOT_ASSIGNED == m_checks[loop]->thread_index ) {
-				m_checks[loop]->thread_index = PRE_ASSIGNED;
-				emit startCheck( m_checks[loop] );
-				m_check_lock.unlock();
-				return;
-			}
-		}
-	}
-	m_check_lock.unlock();
-}
-
-inline bool CheckController::haveCheck( qint64 blog_id, QString monitor_url ) {
-	for ( int loop = 0; loop < m_checks.size(); loop++ ) {
-		if ( m_checks[loop]->blog_id == blog_id && m_checks[loop]->monitor_url == monitor_url ) {
-			return true;
-		}
-	}
-	return false;
-}
-
-void CheckController::startChecking( HealthCheck* hc ) {
-	m_checking++;
-	int runner = this->selectRunner();
-	m_runners[runner]->checking++;
-	hc->thread_index = runner;
-	m_runners[runner]->ct->performCheck( hc );
-}
-
-int CheckController::selectRunner() {
-	int min = m_max_checks;
-	int min_index = 0;
-	for ( int index = 0; index < m_runners.length(); index++ ) {
-		if ( m_runners[index]->checking < min ) {
-			min = m_runners[index]->checking;
-			min_index = index;
-		}
-	}
-	return min_index;
-}
-
-void CheckController::addCheck( HealthCheck* hc ) {
-	if ( haveCheck( hc->blog_id, hc->monitor_url ) ) {
-		LOG( "ERROR:\t: already have this blog in the check list: " + QString::number( hc->blog_id ) + " " + hc->monitor_url );
-		return;
-	}
-
-	m_checks.append( hc );
-	if ( m_checking < ( m_runners.length() * m_max_checks ) ) {
-		hc->thread_index = PRE_ASSIGNED;
-		emit startCheck( hc );
-	}
-}
-
-void CheckController::addChecks( QVector<HealthCheck *> hcs ) {
-	m_check_lock.lock();
-	for ( int loop = 0; loop < hcs.size(); loop++ ) {
-		this->addCheck( hcs[loop] );
-	}
-	m_check_lock.unlock();
-}
-
-void CheckController::ticked() {
-	this->sendResults();
-
-	if ( m_checks.size() > 0 || m_checked > 0 ) {
-		LOG( "total - " + QString::number( m_checks.size() ) +
-			" : checking - " + QString::number( m_checking ) +
-			" : checked = " + QString::number( m_checked ) );
-		for ( int index = 0; index < m_runners.length(); index++ ) {
-			LOG( "runner " + QString::number( index ) + "\t: checking " + QString::number( m_runners[index]->checking ) );
-		}
-	}
-	m_checked = 0;
-}
-
-void CheckController::sendResults() {
-	if ( 0 == m_check_results.size() )
-		return;
-
-	m_check_lock.lock();
-	QMap<QString, QJsonDocument> sendMap( m_check_results );
-	m_check_results.clear();
-	m_check_lock.unlock();
-
-	QMap<QString, QJsonDocument>::const_iterator itr = sendMap.begin();
-	while ( itr != sendMap.end() ) {
-		QByteArray arr_data;
-		arr_data.append( post_http_header( QString( itr.key().toStdString().c_str() ), itr.value().toJson().size() ) );
-		arr_data.append( itr.value().toJson() );
-		LOG( "\t\t: SENDING :\t" + QString::number( itr.value().object()["checks"].toArray().size() ) + " results" );
-		this->sendToJetmonServer( QString( itr.key().toStdString().c_str() ), arr_data );
-		itr++;
-	}
-}
-
-QString CheckController::post_http_header( QString jetmon_server, int content_size ) {
-	QString ret_val = "POST /put/host-status";
-	ret_val += " HTTP/1.1\r\nHost: ";
-	ret_val += jetmon_server;
-	ret_val += "\r\nContent-Type: application/json";
-	ret_val += "\r\nContent-Length: " + QString::number( content_size );
-	ret_val += "\r\nConnection: Keep-Alive\r\n\r\n";
-
-	return ret_val;
-}
-
-void CheckController::sendToJetmonServer( QString jetmon_server, QByteArray status_data ) {
-	JetmonServer * js = new JetmonServer( this, m_ssl_config, jetmon_server, m_jetmon_server_port );
-	QObject::connect( js, SIGNAL( finished( JetmonServer*, int, int ) ), SLOT( finishedSending( JetmonServer*, int, int ) ) );
-	js->sendData( status_data );
-}
-
-void CheckController::finishedSending( JetmonServer* js, int status, int rtt ) {
-	if ( 1 == status ) {
-		LOG( QString::number( rtt ) + "\t\t: SENDING :\tsent - 1" );
-	} else {
-		LOG( QString::number( rtt ) + "\t\t: SENDING :\tfailed to connect to :" + js->jetmonServer() );
-	}
-	js->deleteLater();
-}
diff --git a/veriflier/source/check_thread.cpp b/veriflier/source/check_thread.cpp
deleted file mode 100644
index 6bca7605..00000000
--- a/veriflier/source/check_thread.cpp
+++ /dev/null
@@ -1,44 +0,0 @@
-
-#include "headers/check_thread.h"
-#include "headers/logger.h"
-
-CheckThread::CheckThread( const int net_timeout, const bool debug, const int thread_index )
-	: QThread( 0 ), m_net_timeout( net_timeout ), m_thread_index( thread_index ), m_debug( debug )
-{
-	;
-}
-
-void CheckThread::run() {
-	LOG( "checker #" + QString::number( m_thread_index ) + " running" );
-	exec();
-}
-
-void CheckThread::finishedCheck( HTTP_Checker *checker, HealthCheck *hc ) {
-	int status = -1;
-	if ( checker->get_rtt() > 0 && 0 < checker->get_response_code() && 400 > checker->get_response_code() )
-		status = HOST_ONLINE;
-	else
-		status = HOST_DOWN;
-
-	if ( m_debug ) {
-		LOG( QString::number( hc->received.msecsTo( QDateTime::currentDateTime() ) ) +
-			QString( "\t: STAGE 2 :\t" ) + hc->monitor_url +
-			QString( " status :" ) + QString::number( status ) );
-	}
-
-	emit resultReady( hc->thread_index, hc->blog_id, hc->monitor_url, status, checker->get_response_code(), checker->get_rtt() );
-	checker->deleteLater();
-}
-
-void CheckThread::performCheck( HealthCheck *hc ) {
-	if ( m_debug ) {
-		LOG( QString::number( hc->received.msecsTo( QDateTime::currentDateTime() ) ) +
-			 QString( "  \t: STAGE 1 :\t" ) + QString( hc->monitor_url ) );
-	}
-
-	// Start the check on our side
-	HTTP_Checker *m_checker = new HTTP_Checker( m_net_timeout );
-	QObject::connect( m_checker, SIGNAL( finished( HTTP_Checker*, HealthCheck* ) ), this, SLOT( finishedCheck( HTTP_Checker*, HealthCheck* ) ) );
-	m_checker->check( hc );
-}
-
diff --git a/veriflier/source/client_thread.cpp b/veriflier/source/client_thread.cpp
deleted file mode 100644
index 943d853d..00000000
--- a/veriflier/source/client_thread.cpp
+++ /dev/null
@@ -1,359 +0,0 @@
-
-#include "headers/client_thread.h"
-#include "headers/logger.h"
-
-#include <QtNetwork>
-#include <QtNetwork/QSslSocket>
-#include <QtNetwork/QTcpSocket>
-#include <QJsonObject>
-
-ClientThread::ClientThread( qintptr sock, const QSslConfiguration *ssl_config,
-							CheckController *checker, const QString &veriflier_name,
-							const QString &auth_token, const int net_timeout, const bool debug )
-	: m_sock( sock ), m_socket( NULL ), m_ssl_config( ssl_config ), m_checker( checker ),
-	m_veriflier_name( veriflier_name ), m_auth_token( auth_token ), m_net_timeout( net_timeout ),
-	m_debug( debug ), m_site_status_request( false )
-{
-	;
-}
-
-ClientThread::~ClientThread() {
-	delete m_socket;
-}
-
-void ClientThread::run() {
-	m_socket = new QSslSocket();
-
-	if ( ! m_socket->setSocketDescriptor( m_sock ) ) {
-		LOG ( "Unable to set file descriptor for server SSL connection." );
-		return;
-	}
-
-	m_socket->setSslConfiguration( *m_ssl_config );
-	m_socket->startServerEncryption();
-
-	if ( ! m_socket->waitForEncrypted() ) {
-		LOG( "Unable to negotiate SSL for server request: " + m_socket->errorString() );
-		m_socket->close();
-		return;
-	}
-
-	if ( m_socket->encryptedBytesToWrite() ) {
-		m_socket->flush();
-	}
-
-	// Store the jetmon server's address for our reply
-	m_jetmon_server = m_socket->peerAddress().toString();
-
-	if ( m_socket->waitForReadyRead( m_net_timeout ) ) {
-		this->readRequest();
-	}
-
-	if ( ! m_site_status_request ) {
-		m_socket->close();
-		return;
-	}
-
-	// Tell the Jetmon server we have received and validated the request
-	this->sendOK();
-	m_socket->close();
-
-	if ( m_debug ) {
-		LOG( "RECV\t: ------- :\t " + QString::number( m_checks.size() ) );
-	}
-	m_checker->addChecks( m_checks );
-}
-
-ClientThread::QueryType ClientThread::get_request_type( QByteArray &raw_data ) {
-	QString s_data = raw_data.data();
-	int pos = s_data.indexOf( "HTTP/1." );
-
-	if ( -1 == pos ) {
-		LOG( "Invalid HTTP request format." );
-		this->sendError( "Invalid HTTP request format." );
-		return ClientThread::UnknownQuery;
-	}
-
-	s_data = s_data.left( pos - 1 );
-
-	if ( s_data.startsWith( "GET /get/status") ) {
-		return ClientThread::ServiceRunning;
-	}
-
-	if ( s_data.startsWith( "GET /get/host-status") ) {
-		return ClientThread::SiteStatusCheck;
-	}
-
-	if ( s_data.startsWith( "POST /get/host-status") ) {
-		return ClientThread::SiteStatusPostCheck;
-	}
-
-	return ClientThread::UnknownQuery;
-}
-
-QJsonDocument ClientThread::parse_json_request( QByteArray &raw_data ) {
-	QJsonDocument ret_val;
-	QString s_data = raw_data.data();
-	int pos = s_data.indexOf( "HTTP/1." );
-
-	if ( -1 == pos ) {
-		LOG( "Invalid HTTP request format." );
-		this->sendError( "Invalid HTTP request format." );
-		return ret_val;
-	}
-
-	s_data = s_data.left( pos - 1 );
-	s_data = s_data.right( s_data.length() - s_data.indexOf( "/" ) );
-	s_data = s_data.right( s_data.length() - s_data.indexOf( "?d=" ) - 3 );
-
-	ret_val = QJsonDocument::fromJson( s_data.toUtf8() );
-
-	return ret_val;
-}
-
-QJsonDocument ClientThread::parse_json_request_post( QByteArray &raw_data ) {
-	QJsonDocument ret_val;
-	QString s_data = raw_data.data();
-	int pos = s_data.indexOf( " HTTP/" );
-
-	if ( -1 == pos ) {
-		LOG( "Invalid HTTP request format." );
-		this->sendError( "Invalid HTTP request format." );
-		return ret_val;
-	}
-
-	pos = s_data.indexOf( "\r\n\r\n" );
-
-	if ( -1 == pos ) {
-		LOG( "Invalid HTTP request format." );
-		this->sendError( "Invalid HTTP request format." );
-		return ret_val;
-	}
-
-	s_data = s_data.right( s_data.length() - pos - 4 );
-	ret_val = QJsonDocument::fromJson( s_data.toUtf8() );
-
-	return ret_val;
-}
-
-int ClientThread::parse_json_request_post_length( QByteArray &raw_data ) {
-	QString s_data = raw_data.data();
-	int pos = s_data.indexOf( " HTTP/" );
-
-	if ( -1 == pos ) {
-		LOG( "Invalid HTTP request format." );
-		this->sendError( "Invalid HTTP request format." );
-		return -1;
-	}
-
-	pos = s_data.indexOf( "\r\n\r\n" );
-
-	if ( -1 == pos ) {
-		LOG( "Invalid HTTP request format." );
-		this->sendError( "Invalid HTTP request format." );
-		return -1;
-	}
-
-	return ( s_data.length() - pos - 4 );
-}
-
-int ClientThread::get_content_length( QByteArray &raw_data ) {
-	QString s_data = raw_data.data();
-	int pos = s_data.indexOf( " HTTP/" );
-
-	if ( -1 == pos ) {
-		LOG( "Invalid HTTP request format." );
-		return -1;
-	}
-
-	pos = s_data.indexOf( "Content-Length: " );
-	if ( -1 == pos ) {
-		pos = s_data.indexOf( "GET /" );
-		if ( 0 == pos )
-			return 0;
-		LOG( "Unable to get 'Content-Length'." );
-		return -1;
-	}
-
-	pos += 16;
-	s_data = s_data.right( s_data.length() - pos );
-	pos = s_data.indexOf( "\r\n" );
-	if ( -1 == pos ) {
-		LOG( "Unable to get 'Content-Length' termination characters." );
-		return -1;
-	}
-
-	s_data = s_data.left( pos );
-	return s_data.toInt();
-}
-
-void ClientThread::readRequest() {
-	QByteArray a_data = m_socket->readAll();
-
-	if ( 0 == a_data.length() ) {
-		LOG( "NO data received from the jetmon server." );
-		return;
-	}
-
-	QueryType type = get_request_type( a_data );
-
-	if ( type == ClientThread::UnknownQuery ) {
-		this->sendError( "Unknown query received: " + QString::number( type ) );
-		LOG( "unknown query received: " + QString::number( type ) );
-		return;
-	}
-	if ( type == ClientThread::ServiceRunning ) {
-		this->sendServiceOK();
-		LOG( "replied to service status check" );
-		return;
-	}
-
-	int content_len = get_content_length( a_data );
-	if ( -1 == content_len )    // Failed
-		return;
-
-	QJsonDocument json_doc;
-	int current_len = 0;
-
-	if ( 0 == content_len ) {   // GET request
-		json_doc = parse_json_request( a_data );
-	} else {
-		current_len = parse_json_request_post_length( a_data );
-		while ( current_len < content_len && m_socket->waitForReadyRead( m_net_timeout ) ) {
-			if ( m_socket->bytesAvailable() ) {
-				a_data += m_socket->readAll();
-				current_len = parse_json_request_post_length( a_data );
-			}
-		}
-		json_doc = ( type == ClientThread::SiteStatusPostCheck ? parse_json_request_post( a_data ) : parse_json_request( a_data ) );
-	}
-
-	if ( json_doc.isEmpty() || json_doc.isNull() ) {
-		LOG( "Invalid JSON document format." );
-		this->sendError( "Invalid JSON document format." );
-		return;
-	}
-
-	QString client_auth_token = json_doc.object().value( "auth_token" ).toString("");
-	if ( "" == client_auth_token ) {
-		LOG( "Missing 'auth_token' JSON value." );
-		this->sendError( "Missing 'auth_token' JSON value." );
-		return;
-	}
-
-	m_site_status_request = parse_requests( type, json_doc );
-}
-
-bool ClientThread::parse_requests( QueryType type, QJsonDocument json_doc ) {
-	QJsonValue blog_id, monitor_url;
-
-	if ( type == ClientThread::SiteStatusCheck ) {
-		blog_id = json_doc.object().value( "blog_id" );
-		if ( blog_id.isNull() ) {
-			LOG( "Missing 'blog_id' JSON value." );
-			this->sendError( "Missing 'blog_id' JSON value." );
-			return false;
-		}
-
-		monitor_url = json_doc.object().value( "monitor_url" );
-		if ( monitor_url.isNull() ) {
-			LOG( "Missing 'monitor_url' JSON value." );
-			this->sendError( "Missing 'monitor_url' JSON value." );
-			return false;
-		}
-
-		HealthCheck *hc = new HealthCheck();
-		hc->thread_index = NOT_ASSIGNED;
-		hc->received = QDateTime::currentDateTime();
-		hc->jetmon_server = m_jetmon_server;
-		hc->monitor_url = monitor_url.toString();
-		hc->blog_id = blog_id.toInt();
-
-		this->m_checks.append( hc );
-	} else {
-		QJsonArray jArr = json_doc.object()["checks"].toArray();
-		if ( jArr.isEmpty() ) {
-			this->sendError( "Missing 'checks' JSON array." );
-			LOG( "Missing 'checks' JSON array." );
-			return false;
-		}
-
-		for ( int loop = 0; loop < jArr.count(); loop++ ) {
-			blog_id = jArr.at( loop ).toObject().value( "blog_id" );
-			if ( blog_id.isNull() ) {
-				LOG( "Missing 'blog_id' JSON value for array index " + QString::number( loop ) );
-				continue;
-			}
-
-			monitor_url = jArr.at( loop ).toObject().value( "monitor_url" );
-			if ( monitor_url.isNull() ) {
-				LOG( "Missing 'monitor_url' JSON value for array index " + QString::number( loop ) );
-				continue;
-			}
-
-			HealthCheck *hc = new HealthCheck();
-			hc->thread_index = NOT_ASSIGNED;
-			hc->received = QDateTime::currentDateTime();
-			hc->jetmon_server = m_jetmon_server;
-			hc->monitor_url = monitor_url.toString();
-			hc->blog_id = blog_id.toInt();
-
-			this->m_checks.append( hc );
-		}
-	}
-
-	return true;
-}
-
-void ClientThread::sendServiceOK() {
-	m_socket->write( "OK" );
-	m_socket->flush();
-	m_socket->waitForBytesWritten( m_net_timeout );
-}
-
-void ClientThread::sendOK() {
-	QString s_data = get_http_content( 1 );
-	QString s_response = get_http_reply_header( "200 OK", s_data );
-
-	m_socket->write( s_response.toStdString().c_str() );
-	m_socket->flush();
-	m_socket->waitForBytesWritten( m_net_timeout );
-}
-
-void ClientThread::sendError( const QString errorString ) {
-	QString s_data = get_http_content( -1, errorString );
-	QString s_response = get_http_reply_header( "404 Not Found", s_data );
-
-	m_socket->write( s_response.toStdString().c_str() );
-	m_socket->flush();
-	m_socket->waitForBytesWritten( m_net_timeout );
-}
-
-QString ClientThread::get_http_content( int status, const QString &error ) {
-	QString ret_val = "{\"veriflier\":\"";
-	ret_val += m_veriflier_name;
-	ret_val += "\",\"auth_token\":\"";
-	ret_val += m_auth_token;
-	ret_val += "\",\"status\":";
-	ret_val += QString::number( status );
-	if ( error.length() > 0 ) {
-		ret_val += ",\"error\":\"";
-		ret_val += error;
-		ret_val += "\"";
-	}
-	ret_val += "}\n";
-
-	return ret_val;
-}
-
-QString ClientThread::get_http_reply_header( const QString &http_code, const QString &p_data) {
-	QString ret_val = "HTTP/1.1 ";
-	ret_val += http_code;
-	ret_val += "\r\nContent-Type: application/json\r\n";
-	ret_val += "Content-Length: ";
-	ret_val += QString::number( p_data.length() );
-	ret_val += "\r\nConnection: close\r\n\r\n";
-	ret_val += p_data;
-
-	return ret_val;
-}
diff --git a/veriflier/source/config.cpp b/veriflier/source/config.cpp
deleted file mode 100644
index 688e8c31..00000000
--- a/veriflier/source/config.cpp
+++ /dev/null
@@ -1,55 +0,0 @@
-
-#include "headers/config.h"
-#include "headers/logger.h"
-
-Config *Config::m_instance = new Config;
-
-Config::Config() {
-	load_config_file();
-}
-
-void Config::load_config_file() {
-	QFile file;
-	file.setFileName( QDir::currentPath() + "/config/veriflier.json" );
-	file.open( QIODevice::ReadOnly | QIODevice::Text );
-	QString val = file.readAll();
-	file.close();
-	m_json = QJsonDocument::fromJson( val.toUtf8() );
-}
-
-int Config::get_int_value( QString name ) {
-	if ( m_json.isEmpty() || m_json.isNull() )
-		return -1;
-
-	QJsonValue value = m_json.object().value( name );
-	if ( value.isNull() ) {
-		LOG( ( QString( "Missing '" ) + name + QString( "' JSON value in config file." ) ).toStdString().c_str() );
-		return -1;
-	}
-	return value.toInt();
-}
-
-bool Config::get_bool_value( QString name ) {
-	if ( m_json.isEmpty() || m_json.isNull() )
-		return false;
-
-	QJsonValue value = m_json.object().value( name );
-	if ( value.isNull() ) {
-		LOG( ( QString( "Missing '" ) + name + QString( "' JSON value in config file." ) ).toStdString().c_str() );
-		return false;
-	}
-	return value.toBool();
-}
-
-QString Config::get_string_value( QString name ) {
-	if ( m_json.isEmpty() || m_json.isNull() )
-		return QString( "" );
-
-	QJsonValue value = m_json.object().value( name );
-	if ( value.isNull() ) {
-		LOG( ( QString( "Missing '" ) + name + QString( "' JSON value in config file." ) ).toStdString().c_str() );
-		return QString( "" );
-	}
-	return QString( value.toString() );
-}
-
diff --git a/veriflier/source/http_checker.cpp b/veriflier/source/http_checker.cpp
deleted file mode 100644
index 60f221b8..00000000
--- a/veriflier/source/http_checker.cpp
+++ /dev/null
@@ -1,231 +0,0 @@
-
-#include <QSslConfiguration>
-#include "headers/http_checker.h"
-#include "headers/logger.h"
-
-using namespace std;
-
-HTTP_Checker::HTTP_Checker( const int p_net_timeout ) : QObject( NULL ), m_ssl_config( NULL ),
-	m_sock( NULL ), m_timeout( NULL ), m_host_name( "" ), m_host_dir( "" ),
-    m_port( DEFAULT_HTTP_PORT ), m_is_ssl( false ), m_finished( false ),
-	m_redirects( 0 ), m_response_code( 0 ), m_net_timeout( p_net_timeout )
-{
-	m_starttime = QDateTime::currentDateTime();
-	m_ssl_config = new QSslConfiguration();
-	m_ssl_config->setProtocol( QSsl::SecureProtocols );
-}
-
-HTTP_Checker::~HTTP_Checker() {
-	this->closeConnection();
-	delete m_ssl_config;
-}
-
-void HTTP_Checker::check( HealthCheck* p_hc ) {
-	try {
-		m_hc = p_hc;
-		m_host_name = m_hc->monitor_url;
-		m_host_dir = '/';
-
-		m_timeout = new QTimer( this );
-		QObject::connect( m_timeout, SIGNAL( timeout() ), this, SLOT( timed_out() ) );
-		m_timeout->start( m_net_timeout );
-
-		this->parse_host_values();
-		this->connect();
-	}
-	catch ( exception &ex ) {
-		LOG( QString( "exception in HTTP_Checker::check(): for host '" ) + m_host_name + "' : " + ex.what() );
-	}
-}
-
-void HTTP_Checker::process_response() {
-	// if we have been redirected, get the details and make a recursive call
-	if ( ( 300 < m_response_code ) && ( 400 > m_response_code ) ) {
-		m_redirects++;
-		if ( Config::instance()->get_int_value( "max_redirects" ) >= m_redirects &&
-			set_redirect_host_values( m_response.toStdString().c_str() ) ) {
-			this->closeConnection();
-			LOG( QString::number( m_starttime.msecsTo( QDateTime::currentDateTime() ) ) +
-				 "  \t: STAGE 1 :\tredirecting to " + m_host_name + m_host_dir );
-			m_response_code = 0;
-			this->connect();
-		} else {
-			// Note we leave the 3xx response code so this site is marked as up
-			finish_request();
-		}
-	} else {
-		finish_request();
-	}
-}
-
-bool HTTP_Checker::set_redirect_host_values( QString p_content ) {
-	try {
-		QString p_lcase_search = p_content.toLower();
-		if ( -1 == p_lcase_search.indexOf( "location: " ) )
-			return false;
-
-		p_content = p_content.mid( p_lcase_search.indexOf( "location: " ) + 10, p_content.length() - ( p_lcase_search.indexOf( "location: " ) + 10 ) );
-		if ( -1 == p_content.indexOf( "\r\n" ) )
-			return false;
-
-		p_content.remove( p_content.indexOf( "\r\n" ), p_content.length() - p_content.indexOf( "\r\n" ) );
-
-		// keep a copy for relative location redirects
-		QString hostname_backup = m_host_name;
-		m_host_name = p_content;
-		m_port = DEFAULT_HTTP_PORT;
-		m_host_dir = '/';
-		m_is_ssl = false;
-
-		this->parse_host_values();
-
-		// this is a relative location redirect, reinstate hostname
-		if ( 0 == m_host_name.size() )
-			m_host_name = hostname_backup;
-
-		return true;
-	}
-	catch( exception &ex ) {
-		LOG( QString( "exception in HTTP_Checker::set_redirect_host_values(): " ) + ex.what() );
-		return false;
-	}
-}
-
-void HTTP_Checker::parse_host_values() {
-	if ( -1 != m_host_name.indexOf( "http://" ) ) {
-		m_host_name.remove( m_host_name.indexOf( "http://" ), 7 );
-		m_is_ssl = false;
-	}
-
-	if ( -1 != m_host_name.indexOf( "https://" ) ) {
-		m_host_name.remove( m_host_name.indexOf( "https://" ), 8 );
-		m_is_ssl = true;
-		m_port = DEFAULT_HTTPS_PORT;
-	}
-
-	size_t s_pos = m_host_name.indexOf( '/' );
-	size_t q_pos = m_host_name.indexOf( '?' );
-	size_t c_pos = m_host_name.indexOf( ':' );
-	size_t f_pos = m_host_name.indexOf( '#' );
-
-	if ( ( c_pos < s_pos ) && ( c_pos < q_pos ) && ( c_pos < f_pos ) ) {
-		int new_port = m_host_name.mid( c_pos + 1, min( s_pos, (size_t)m_host_name.length() ) ).toInt();
-		if ( 0 < new_port ) {
-			m_port = new_port;
-			m_host_name.remove( c_pos, min( s_pos, (size_t)m_host_name.length() ) - c_pos );
-			// recalc since we've erased some characters
-			s_pos = m_host_name.indexOf( '/' );
-			q_pos = m_host_name.indexOf( '?' );
-			f_pos = m_host_name.indexOf( '#' );
-		}
-	}
-
-	if ( string::npos != s_pos || string::npos != q_pos || string::npos != f_pos ) {
-		int m_pos = min( min( s_pos, q_pos ), f_pos );
-		m_host_dir = m_host_name.mid( m_pos, m_host_name.length() - m_pos );
-		if ( 0 == m_host_dir.length() || '?' == m_host_dir[0] || '#' == m_host_dir[0] ) {
-			m_host_dir = "/" + m_host_dir;
-		}
-		m_host_name.remove( m_pos, m_host_name.length() - m_pos );
-	}
-}
-
-void HTTP_Checker::parse_response_code( QByteArray a_data ) {
-	if ( 0 == a_data.size() ) {
-		return;
-	}
-
-	m_response = a_data.toStdString().c_str();
-	if ( m_response.indexOf( " " ) == 8 ) {
-		m_response_code = m_response.mid( 9, 3 ).toInt();
-	} else {
-		m_response_code = -1;
-	}
-}
-
-bool HTTP_Checker::send_http_get() {
-	QString m_buf = "HEAD " + m_host_dir + " HTTP/1.1\r\n";
-			m_buf += "Host: " + m_host_name + "\r\n";
-			m_buf += "User-Agent: jetmon/1.0 (Jetpack Site Uptime Monitor by WordPress.com)\r\n";
-			m_buf += "Connection: close\r\n\r\n";
-
-	qint64 bytes_sent = m_sock->write( m_buf.toStdString().c_str(), m_buf.length() );
-
-	return ( bytes_sent == m_buf.length() );
-}
-
-void HTTP_Checker::connect() {
-	if ( m_is_ssl ) {
-		m_sock = new QSslSocket();
-		((QSslSocket*)m_sock)->setSslConfiguration( *m_ssl_config );
-		QObject::connect( ((QSslSocket*)m_sock), SIGNAL( connected() ), this, SLOT( connected() ) );
-		QObject::connect( ((QSslSocket*)m_sock), SIGNAL( readyRead() ), this, SLOT( readyRead() ) );
-		QObject::connect( ((QSslSocket*)m_sock), SIGNAL( error( QAbstractSocket::SocketError) ), this, SLOT( connectionError( QAbstractSocket::SocketError) ) );
-		((QSslSocket*)m_sock)->connectToHostEncrypted( m_host_name, m_port );
-	} else {
-		m_sock = new QTcpSocket();
-		QObject::connect( m_sock, SIGNAL( connected() ), this, SLOT( connected() ) );
-		QObject::connect( m_sock, SIGNAL( readyRead() ), this, SLOT( readyRead() ) );
-		QObject::connect( m_sock, SIGNAL( error( QAbstractSocket::SocketError) ), this, SLOT( connectionError( QAbstractSocket::SocketError) ) );
-		m_sock->connectToHost( m_host_name, m_port );
-	}
-}
-
-void HTTP_Checker::connected() {
-	try {
-		if ( ! m_sock->isOpen() ) {
-			finish_request();
-			return;
-		}
-		send_http_get();
-	}
-	catch( exception &ex ) {
-		LOG( QString( "exception in HTTP_Checker::connected(): for host '" ) + m_host_name + "' : " + ex.what() );
-	}
-}
-
-void HTTP_Checker::connectionError( QAbstractSocket::SocketError err ) {
-	//LOG( "Connection Error[" + QString::number( err ) + "]: " + m_sock->errorString() );
-	finish_request();
-}
-
-void HTTP_Checker::readyRead() {
-	QByteArray a_data = m_sock->readAll();
-
-	if ( 0 == a_data.length() ) {
-		LOG( "NO data received from the check." );
-		return;
-	}
-
-	if ( 0 == m_response_code ) {
-		parse_response_code( a_data );
-		if ( 0 != m_response_code ) {
-			process_response();
-		}
-	}
-}
-
-void HTTP_Checker::closeConnection() {
-	if ( m_sock != NULL ) {
-		if ( m_sock->isOpen() )
-			m_sock->close();
-		m_sock->deleteLater();
-	}
-}
-
-void HTTP_Checker::timed_out() {
-	if ( m_sock != NULL ) {
-		if ( m_sock->isOpen() )
-			m_sock->disconnectFromHost();
-	}
-	m_timeout->stop();
-	finish_request();
-}
-
-void HTTP_Checker::finish_request() {
-	if ( ! m_finished ) {
-		m_finished = true;
-		emit finished( this, m_hc );
-	}
-}
-
diff --git a/veriflier/source/jetmon_server.cpp b/veriflier/source/jetmon_server.cpp
deleted file mode 100644
index 0ef7c0c1..00000000
--- a/veriflier/source/jetmon_server.cpp
+++ /dev/null
@@ -1,87 +0,0 @@
-
-#include "../headers/jetmon_server.h"
-
-JetmonServer::JetmonServer( QObject *parent, const QSslConfiguration *ssl_config, QString jetmon_server, int jetmon_server_port ) :
-	QObject(parent), m_jetmon_server( jetmon_server ), m_jetmon_server_port( jetmon_server_port )
-{
-	m_socket = new QSslSocket();
-	m_socket->setSslConfiguration( *ssl_config );
-
-	QObject::connect( m_socket, SIGNAL( connected() ), this, SLOT( connected() ) );
-	QObject::connect( m_socket, SIGNAL( readyRead() ), this, SLOT( readyRead() ) );
-	QObject::connect( m_socket, SIGNAL( error( QAbstractSocket::SocketError) ), this, SLOT( connectionError( QAbstractSocket::SocketError) ) );
-}
-
-void JetmonServer::sendData( QByteArray status_data ) {
-	m_timer = QDateTime::currentDateTime();
-	m_status_data = status_data;
-	m_socket->connectToHostEncrypted( m_jetmon_server, m_jetmon_server_port );
-}
-
-void JetmonServer::connected() {
-	if ( m_socket->isEncrypted() || ( ! m_socket->isOpen() ) ) {
-		emit finished( this, 0, m_timer.msecsTo( QDateTime::currentDateTime() ) );
-		return;
-	}
-
-	LOG( QString::number( m_timer.msecsTo( QDateTime::currentDateTime() ) ) +
-		QString( "\t\t: SENDING :\tconnected to :" ) + m_jetmon_server );
-	m_timer = QDateTime::currentDateTime();
-
-	m_socket->write( m_status_data );
-	m_socket->flush();
-}
-
-void JetmonServer::connectionError( QAbstractSocket::SocketError err ) {
-	LOG( "Connection Error[" + QString::number( err ) + "]: " + m_jetmon_server + " : "+ m_socket->errorString() );
-	emit finished( this, 0, m_timer.msecsTo( QDateTime::currentDateTime() ) );
-}
-
-void JetmonServer::readyRead() {
-	QByteArray a_data = m_socket->readAll();
-	this->closeConnection();
-
-	if ( 0 == a_data.length() ) {
-		LOG( "NO data returned when reading jetmon response." );
-		emit finished( this, 0, m_timer.msecsTo( QDateTime::currentDateTime() ) );
-		return;
-	}
-
-	QJsonDocument json_doc = parse_json_response( a_data );
-
-	if ( json_doc.isEmpty() || json_doc.isNull() ) {
-		LOG( "Invalid JSON document format." );
-		emit finished( this, 0, m_timer.msecsTo( QDateTime::currentDateTime() ) );
-		return;
-	}
-
-	QJsonValue response = json_doc.object().value( "response" );
-	if ( response.isNull() ) {
-		LOG( "Missing 'response' JSON value." );
-		emit finished( this, 0, m_timer.msecsTo( QDateTime::currentDateTime() ) );
-		return;
-	}
-
-	emit finished( this, 1, m_timer.msecsTo( QDateTime::currentDateTime() ) );
-}
-
-void JetmonServer::closeConnection() {
-	if ( m_socket->isOpen() )
-		m_socket->close();
-	m_socket->deleteLater();
-}
-
-QJsonDocument JetmonServer::parse_json_response( QByteArray &raw_data ) {
-	QJsonDocument ret_val;
-	QString s_data = raw_data.data();
-
-	if ( ( -1 == s_data.indexOf( "{" ) ) || ( -1 == s_data.lastIndexOf( "}" ) ) ) {
-		LOG( "Invalid JSON response format.\n\n" + s_data );
-		return ret_val;
-	}
-
-	s_data = s_data.mid( s_data.indexOf( "{" ), s_data.lastIndexOf( "}" ) - s_data.indexOf( "{" ) + 1 );
-	ret_val = QJsonDocument::fromJson( s_data.toUtf8() );
-	return ret_val;
-}
-
diff --git a/veriflier/source/logger.cpp b/veriflier/source/logger.cpp
deleted file mode 100644
index c8903a91..00000000
--- a/veriflier/source/logger.cpp
+++ /dev/null
@@ -1,52 +0,0 @@
-
-#include "headers/logger.h"
-
-Logger *Logger::m_instance = new Logger;
-QFile *Logger::m_file = new QFile;
-QMutex *Logger::m_mutex = new QMutex;
-
-void Logger::stopLogging() {
-	m_file->close();
-}
-
-void Logger::startLogger() {
-	QDir check;
-	if ( ! check.exists( QDir::currentPath() + "/logs" ) )
-		check.mkdir( QDir::currentPath() + "/logs" );
-	m_file->setFileName( LOG_FILE_NAME );
-	m_file->open( QIODevice::WriteOnly | QIODevice::Text | QIODevice::Append );
-}
-
-void Logger::write( QString s_data ) {
-	m_mutex->lock();
-	if ( ( QFile( LOG_FILE_NAME ).size() ) > MAX_LOG_FILESIZE ) {
-		m_file->close();
-		Logger::do_log_rotation();
-		m_file->setFileName( LOG_FILE_NAME );
-		m_file->open( QIODevice::WriteOnly | QIODevice::Text | QIODevice::Append );
-	}
-	if ( m_file->isOpen() ) {
-		m_file->write( QDateTime::currentDateTime().toString( "yyyy-MM-dd hh:mm:ss").toStdString().c_str() );
-		m_file->write( " - " );
-		m_file->write( s_data.toStdString().c_str() );
-		m_file->write( "\n" );
-		m_file->flush();
-	}
-	m_mutex->unlock();
-}
-
-void Logger::do_log_rotation() {
-	for ( int del_loop = ( LOGS_TO_KEEP - 1 ); del_loop > 0; del_loop-- ) {
-		if ( QFile( LOG_FILE_NAME + "." + QString::number( del_loop ) ).exists() ) {
-			if ( QFile( LOG_FILE_NAME +"." + QString::number( del_loop + 1 ) ).exists() )
-				QFile( LOG_FILE_NAME + "." + QString::number( del_loop + 1 ) ).remove();
-			QFile( LOG_FILE_NAME + "." + QString::number( del_loop ) ).copy(
-					LOG_FILE_NAME + "." + QString::number( del_loop + 1 ) );
-		}
-	}
-	if ( QFile( LOG_FILE_NAME + ".1" ).exists() )
-		QFile( LOG_FILE_NAME + ".1" ).remove();
-	QFile( LOG_FILE_NAME ).copy( LOG_FILE_NAME + ".1" );
-	QFile( LOG_FILE_NAME ).remove();
-}
-
diff --git a/veriflier/source/main.cpp b/veriflier/source/main.cpp
deleted file mode 100644
index cf404d01..00000000
--- a/veriflier/source/main.cpp
+++ /dev/null
@@ -1,25 +0,0 @@
-
-#include <QCoreApplication>
-
-#include "headers/config.h"
-#include "headers/logger.h"
-#include "headers/ssl_server.h"
-
-#include <iostream>
-
-int main( int argc, char *argv[] )
-{
-	QCoreApplication app(argc, argv);
-	Logger::instance()->startLogger();
-
-	SSL_Server *ssl = new SSL_Server();
-	bool result = ssl->listen( QHostAddress::Any, Config::instance()->get_int_value( "listen_port" ) );
-
-	if ( ! result ) {
-		LOG( "failed to open the server port, eXiting." );
-		Logger::instance()->stopLogging();
-		return -1;
-	}
-
-	return app.exec();
-}
diff --git a/veriflier/source/ssl_server.cpp b/veriflier/source/ssl_server.cpp
deleted file mode 100644
index bbee28fa..00000000
--- a/veriflier/source/ssl_server.cpp
+++ /dev/null
@@ -1,84 +0,0 @@
-
-#include "headers/ssl_server.h"
-#include "headers/logger.h"
-
-using namespace std;
-
-SSL_Server::SSL_Server( QObject *parent ) : QTcpServer( parent ) {
-	LOG( "booting veriflier" );
-	m_served_count = 0;
-
-	pool = new QThreadPool(this);
-	pool->setMaxThreadCount( Config::instance()->get_int_value( "thread_pool_max" ) );
-	LOG( ( QString( "comms thread pool: " ) + QString::number( pool->maxThreadCount() ) ) );
-
-	this->setMaxPendingConnections( Config::instance()->get_int_value( "max_pending_conns" ) );
-	LOG( ( QString( "max pending conns: " ) + QString::number( this->maxPendingConnections() ) ) );
-
-	m_veriflier_name = Config::instance()->get_string_value( "veriflier_name" );
-	m_auth_token = Config::instance()->get_string_value( "auth_token" );
-	m_net_timeout = Config::instance()->get_int_value( "net_comms_timeout" );
-	m_debug = Config::instance()->get_bool_value( "debug" );
-	m_jetmon_server_port = Config::instance()->get_int_value( "jetmon_server_port" );
-
-	int max_checkers = Config::instance()->get_int_value( "max_checkers" );
-	if ( 0 == max_checkers || -1 == max_checkers ) max_checkers = DEFAULT_MAX_CHECKERS;
-	LOG( ( QString( "max checkers: " ) + QString::number( max_checkers ) ) );
-
-	int max_checks = Config::instance()->get_int_value( "max_checks" );
-	if ( 0 == max_checks || -1 == max_checks ) max_checks = DEFAULT_MAX_CHECKS;
-	LOG( ( QString( "max checks: " ) + QString::number( max_checks ) ) );
-
-	m_ssl_config = new QSslConfiguration();
-	m_ssl_config->setPeerVerifyMode( QSslSocket::VerifyNone );
-	m_ssl_config->setProtocol( QSsl::AnyProtocol );
-
-	QFile keyFile( Config::instance()->get_string_value( "privatekey_file" ) );
-	keyFile.open( QFile::ReadOnly );
-	if ( ! keyFile.isOpen() ) {
-		LOG( "Error loading private key file" );
-		cerr << "Error loading private key file" << endl;
-		exit(-1);
-	}
-	QSslKey ssl_key( &keyFile, QSsl::Rsa );
-	m_ssl_config->setPrivateKey( ssl_key );
-	keyFile.close();
-
-	QFile certFile( Config::instance()->get_string_value( "privatecert_file" ) );
-	certFile.open( QFile::ReadOnly );
-	if ( ! certFile.isOpen() ) {
-		LOG( "Error loading private certificate file" );
-		cerr << "Error loading private key file" << endl;
-		exit(-1);
-	}
-	QSslCertificate ssl_cert( &certFile );
-	m_ssl_config->setLocalCertificate( ssl_cert );
-	certFile.close();
-
-	m_checker = new CheckController( m_ssl_config, m_jetmon_server_port, max_checkers, max_checks,
-									 m_veriflier_name, m_auth_token, m_net_timeout, m_debug );
-
-	connect( this, SIGNAL( acceptError(QAbstractSocket::SocketError) ), this, SLOT( logError(QAbstractSocket::SocketError) ) );
-}
-
-SSL_Server::~SSL_Server() {
-	delete m_ssl_config;
-	delete m_checker;
-	delete pool;
-	Logger::instance()->stopLogging();
-}
-
-void SSL_Server::incomingConnection( qintptr socketDescriptor ) {
-	m_served_count++;
-	if ( m_served_count % 50 == 0 )
-		LOG( ( QString( "served count: " ) + QString::number( m_served_count ) ).toStdString().c_str() );
-
-	ClientThread *client = new ClientThread( socketDescriptor, m_ssl_config, m_checker, m_veriflier_name,
-											m_auth_token, m_net_timeout, m_debug );
-	client->setAutoDelete( true );
-	pool->start( client );
-}
-
-void SSL_Server::logError(QAbstractSocket::SocketError socketError) {
-	LOG( QString( socketError ).toStdString().c_str() );
-}
diff --git a/veriflier/veriflier.pro b/veriflier/veriflier.pro
deleted file mode 100644
index 4b40e671..00000000
--- a/veriflier/veriflier.pro
+++ /dev/null
@@ -1,29 +0,0 @@
-QT       += core network
-QT       -= gui
-
-TARGET = veriflier
-CONFIG   += console
-CONFIG   -= app_bundle
-
-TEMPLATE = app
-
-SOURCES += \
-    source/client_thread.cpp \
-    source/main.cpp \
-    source/ssl_server.cpp \
-    source/http_checker.cpp \
-    source/config.cpp \
-    source/logger.cpp \
-    source/check_thread.cpp \
-    source/check_controller.cpp \
-    source/jetmon_server.cpp
-
-HEADERS += \
-    headers/client_thread.h \
-    headers/http_checker.h \
-    headers/ssl_server.h \
-    headers/config.h \
-    headers/logger.h \
-    headers/check_thread.h \
-    headers/check_controller.h \
-    headers/jetmon_server.h
diff --git a/veriflier/veriflier.sh b/veriflier/veriflier.sh
deleted file mode 100755
index b3cd409f..00000000
--- a/veriflier/veriflier.sh
+++ /dev/null
@@ -1,97 +0,0 @@
-#!/bin/bash
-
-### BEGIN INIT INFO
-# Provides:          veriflier
-# Required-Start:
-# Required-Stop:
-# Default-Start:     2 3 4 5
-# Default-Stop:      0 1 6
-# Short-Description: Start/stop veriflier
-# Description:       Start/stop the service.
-### END INIT INFO
-
-# installation directory
-INSTALL_DIR=/opt/veriflier
-
-function startservice {
-	if [ ! -d ${INSTALL_DIR} ]; then
-		echo "the jetmon veriflier is not installed in the correct directory: ${INSTALL_DIR}"
-		exit 1
-	fi
-	if [ -f /var/run/veriflier.pid ]; then
-		pid="`cat /var/run/veriflier.pid`"
-		if [ -z "$pid" ]; then
-			pid=0
-		else
-			if ! ps -p $pid >/dev/null; then
-				rm -f /var/run/veriflier.pid
-				pid=0
-			fi
-		fi
-		if [ $pid -gt 0 ] ; then
-			echo "veriflier service is already running"
-			exit 1
-		fi
-	fi
-	# create the required log directory
-	if [ ! -d "${INSTALL_DIR}/logs" ]; then
-		mkdir "${INSTALL_DIR}/logs"
-	fi
-
-	echo "Starting veriflier"
-	(
-		cd "${INSTALL_DIR}"
-		( ${INSTALL_DIR}/veriflier >/dev/null 2>&1 )&
-		echo $! > /var/run/veriflier.pid
-	)
-}
-
-function stopservice {
-	if [ -f /var/run/veriflier.pid ]; then
-		pid="`cat /var/run/veriflier.pid`"
-		if [ -z "$pid" ]; then
-			echo "There was an error loading the pid file."
-		else
-			echo "Stopping veriflier with pid $pid"
-			kill -15 $pid
-		fi
-		rm -f /var/run/veriflier.pid
-	else
-		echo "There is no veriflier process running."
-	fi
-}
-
-function reload_config {
-	pid="`ps -ef | grep 'veriflier' | grep -v 'grep' | awk ' { print $(2) }'`"
-	if [ -z "$pid" ]; then
-		pid=0
-	fi
-	if [ $pid -gt 0 ]; then
-		echo "Reloading veriflier config"
-		kill -SIGHUP $pid
-	else
-		echo "There is no veriflier process running."
-	fi
-}
-
-case "$1" in
-start )
-	startservice
-	;;
-stop )
-	stopservice
-	;;
-restart )
-	stopservice
-	sleep 2
-	startservice
-	;;
-reload )
-	reload_config
-	;;
-* )
-	echo "Usage:$0 start|stop|restart|reload"
-	exit 1
-	;;
-esac
-exit 0
diff --git a/veriflier2/cmd/main.go b/veriflier2/cmd/main.go
new file mode 100644
index 00000000..58f79df2
--- /dev/null
+++ b/veriflier2/cmd/main.go
@@ -0,0 +1,255 @@
+package main
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"log"
+	"net/http"
+	"os"
+	"os/signal"
+	"strings"
+	"syscall"
+	"time"
+
+	"github.com/Automattic/jetmon/internal/checker"
+	"github.com/Automattic/jetmon/internal/metrics"
+	"github.com/Automattic/jetmon/internal/veriflier"
+)
+
+var version = "dev"
+
+const shutdownGracePeriod = 30 * time.Second
+
+type veriflierConfig struct {
+	AuthToken  string `json:"auth_token"`
+	Port       string `json:"port"`
+	GRPCPort   string `json:"grpc_port"` // Deprecated alias for Port.
+	VantageID  string `json:"vantage_id"`
+	Region     string `json:"region"`
+	Provider   string `json:"provider"`
+	LegacyHTTP bool   `json:"enable_legacy_http"`
+}
+
+func main() {
+	configPath := envOrDefault("VERIFLIER_CONFIG", "config/veriflier.json")
+
+	cfg, err := loadConfig(configPath)
+	if err != nil {
+		log.Fatalf("load config: %v", err)
+	}
+
+	hostname, _ := os.Hostname()
+
+	// Override auth token and port from environment if set (Docker entrypoint).
+	if v := os.Getenv("VERIFLIER_AUTH_TOKEN"); v != "" {
+		cfg.AuthToken = v
+	}
+	if v := envOrDefault("VERIFLIER_PORT", ""); v != "" {
+		cfg.Port = v
+	} else if v := os.Getenv("VERIFLIER_GRPC_PORT"); v != "" {
+		cfg.Port = v
+	}
+	if v := os.Getenv("VERIFLIER_VANTAGE_ID"); v != "" {
+		cfg.VantageID = v
+	}
+	if v := os.Getenv("VERIFLIER_REGION"); v != "" {
+		cfg.Region = v
+	}
+	if v := os.Getenv("VERIFLIER_PROVIDER"); v != "" {
+		cfg.Provider = v
+	}
+	if v := os.Getenv("VERIFLIER_ENABLE_LEGACY_HTTP"); v != "" {
+		enabled, err := parseBool(v)
+		if err != nil {
+			log.Fatalf("VERIFLIER_ENABLE_LEGACY_HTTP: %v", err)
+		}
+		cfg.LegacyHTTP = enabled
+	}
+
+	if cfg.TransportPort() == "" {
+		log.Fatalf("VERIFLIER_PORT is not set")
+	}
+	// Reject empty auth tokens at startup. The verifier's Bearer comparison
+	// would otherwise accept any request with the literal "Bearer " header
+	// (no token after the space) — a subtle auth bypass if a misconfigured
+	// deploy leaves the token blank. Better to fail loud at startup.
+	if cfg.AuthToken == "" {
+		log.Fatalf("VERIFLIER_AUTH_TOKEN is not set; refusing to start with no authentication")
+	}
+	addr := fmt.Sprintf(":%s", cfg.TransportPort())
+	agentID := veriflierAgentID(hostname, cfg.TransportPort())
+
+	// Optional StatsD metrics. STATSD_ADDR is unset in standalone deploys,
+	// "statsd:8125" in the docker compose stack. metrics.Init failure logs and
+	// continues — the verifier should still run with metrics disabled.
+	if statsdAddr := os.Getenv("STATSD_ADDR"); statsdAddr != "" {
+		if err := metrics.Init(statsdAddr, hostname); err != nil {
+			log.Printf("metrics: init failed (%v) — running without metrics", err)
+		} else {
+			log.Printf("metrics: sending to %s", statsdAddr)
+		}
+	}
+
+	srv := veriflier.NewServerWithOptions(addr, cfg.AuthToken, hostname, version, veriflier.ServerOptions{
+		CheckFunc: performCheckContext,
+		Vantage: veriflier.Vantage{
+			ID:       cfg.VantageID,
+			Region:   cfg.Region,
+			Provider: cfg.Provider,
+		},
+		AgentID:      agentID,
+		EnableLegacy: cfg.LegacyHTTP,
+	})
+
+	// Graceful shutdown: SIGINT/SIGTERM triggers Shutdown(ctx) with a drain
+	// budget so in-flight checks can complete before the listener closes.
+	sigCh := make(chan os.Signal, 1)
+	signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
+	go func() {
+		sig := <-sigCh
+		log.Printf("veriflier2: %s received, draining (up to %s)", sig, shutdownGracePeriod)
+		ctx, cancel := context.WithTimeout(context.Background(), shutdownGracePeriod)
+		defer cancel()
+		if err := srv.Shutdown(ctx); err != nil {
+			log.Printf("veriflier2: shutdown error: %v", err)
+		}
+	}()
+
+	log.Printf("veriflier2 %s starting on %s legacy_http=%s", version, addr, enabledLabel(cfg.LegacyHTTP))
+	if err := srv.Listen(); err != nil && !errors.Is(err, http.ErrServerClosed) {
+		log.Fatalf("listen: %v", err)
+	}
+	log.Println("veriflier2: shutdown complete")
+}
+
+func veriflierAgentID(hostname, port string) string {
+	hostname = strings.TrimSpace(hostname)
+	port = strings.TrimSpace(port)
+	if hostname == "" {
+		hostname = "unknown"
+	}
+	if port == "" {
+		return hostname
+	}
+	return hostname + ":" + port
+}
+
+// performCheck runs a single HTTP check and returns the result for the server.
+func performCheck(req veriflier.CheckRequest) veriflier.CheckResult {
+	return performCheckContext(context.Background(), req).CheckResult
+}
+
+func performCheckContext(ctx context.Context, req veriflier.CheckRequest) veriflier.ProbeResult {
+	res := checker.Check(ctx, checker.Request{
+		MonitorSiteID:       req.MonitorSiteID,
+		BlogID:              req.BlogID,
+		URL:                 req.URL,
+		Method:              req.Method,
+		DetectionProfile:    req.DetectionProfile,
+		TimeoutSeconds:      int(req.TimeoutSeconds),
+		BodyReadMaxBytes:    req.BodyReadMaxBytes,
+		BodyReadMaxMS:       int(req.BodyReadMaxMS),
+		KeywordReadMaxBytes: req.KeywordReadMaxBytes,
+		KeywordReadMaxMS:    int(req.KeywordReadMaxMS),
+		Keyword:             stringPtr(req.Keyword),
+		ForbiddenKeyword:    stringPtr(req.ForbiddenKeyword),
+		ForbiddenKeywords:   req.ForbiddenKeywords,
+		CustomHeaders:       req.CustomHeaders,
+		RedirectPolicy:      checker.RedirectPolicy(req.RedirectPolicy),
+	})
+
+	checkResult := veriflier.CheckResult{
+		MonitorSiteID: res.MonitorSiteID,
+		BlogID:        res.BlogID,
+		URL:           res.URL,
+		Success:       res.Success,
+		HTTPCode:      int32(res.HTTPCode),
+		ErrorCode:     int32(res.ErrorCode),
+		RTTMs:         res.RTT.Milliseconds(),
+	}
+	return veriflier.ProbeResult{
+		CheckResult: checkResult,
+		Outcome:     outcomeFromCheckerResult(res),
+		TimingsMS: veriflier.TimingsMS{
+			DNS:  res.DNS.Milliseconds(),
+			TCP:  res.TCP.Milliseconds(),
+			TLS:  res.TLS.Milliseconds(),
+			TTFB: res.TTFB.Milliseconds(),
+		},
+	}
+}
+
+func stringPtr(s string) *string {
+	if s == "" {
+		return nil
+	}
+	return &s
+}
+
+func loadConfig(path string) (*veriflierConfig, error) {
+	f, err := os.Open(path)
+	if err != nil {
+		// Fall back to environment-only config.
+		return &veriflierConfig{
+			AuthToken: os.Getenv("VERIFLIER_AUTH_TOKEN"),
+			Port:      envOrDefault("VERIFLIER_PORT", envOrDefault("VERIFLIER_GRPC_PORT", "7803")),
+		}, nil
+	}
+	defer f.Close()
+
+	var cfg veriflierConfig
+	if err := json.NewDecoder(f).Decode(&cfg); err != nil {
+		return nil, err
+	}
+	return &cfg, nil
+}
+
+func (c veriflierConfig) TransportPort() string {
+	if c.Port != "" {
+		return c.Port
+	}
+	return c.GRPCPort
+}
+
+func outcomeFromCheckerResult(res checker.Result) string {
+	if res.Success {
+		return veriflier.OutcomeUp
+	}
+	if res.ErrorCode == checker.ErrorTimeout {
+		return veriflier.OutcomeTimeout
+	}
+	if res.HTTPCode >= http.StatusBadRequest {
+		return veriflier.OutcomeDown
+	}
+	if res.ErrorCode != checker.ErrorNone {
+		return veriflier.OutcomeProbeError
+	}
+	return veriflier.OutcomeUnknown
+}
+
+func envOrDefault(key, def string) string {
+	if v := os.Getenv(key); v != "" {
+		return v
+	}
+	return def
+}
+
+func parseBool(raw string) (bool, error) {
+	switch strings.ToLower(strings.TrimSpace(raw)) {
+	case "1", "t", "true", "y", "yes", "on", "enabled":
+		return true, nil
+	case "0", "f", "false", "n", "no", "off", "disabled":
+		return false, nil
+	default:
+		return false, fmt.Errorf("expected boolean value, got %q", raw)
+	}
+}
+
+func enabledLabel(enabled bool) string {
+	if enabled {
+		return "enabled"
+	}
+	return "disabled"
+}
diff --git a/veriflier2/cmd/main_test.go b/veriflier2/cmd/main_test.go
new file mode 100644
index 00000000..64fa9026
--- /dev/null
+++ b/veriflier2/cmd/main_test.go
@@ -0,0 +1,251 @@
+package main
+
+import (
+	"context"
+	"net/http"
+	"net/http/httptest"
+	"os"
+	"path/filepath"
+	"testing"
+
+	"github.com/Automattic/jetmon/internal/checker"
+	"github.com/Automattic/jetmon/internal/veriflier"
+)
+
+func TestEnvOrDefault(t *testing.T) {
+	const key = "VERIFLIER_TEST_ENV_OR_DEFAULT"
+	t.Setenv(key, "")
+	if got := envOrDefault(key, "fallback"); got != "fallback" {
+		t.Fatalf("envOrDefault(empty) = %q, want fallback", got)
+	}
+
+	t.Setenv(key, "configured")
+	if got := envOrDefault(key, "fallback"); got != "configured" {
+		t.Fatalf("envOrDefault(set) = %q, want configured", got)
+	}
+}
+
+func TestParseBool(t *testing.T) {
+	for _, value := range []string{"1", "true", "TRUE", "yes", "on", "enabled"} {
+		got, err := parseBool(value)
+		if err != nil || !got {
+			t.Fatalf("parseBool(%q) = %v, %v; want true, nil", value, got, err)
+		}
+	}
+	for _, value := range []string{"0", "false", "FALSE", "no", "off", "disabled"} {
+		got, err := parseBool(value)
+		if err != nil || got {
+			t.Fatalf("parseBool(%q) = %v, %v; want false, nil", value, got, err)
+		}
+	}
+	if _, err := parseBool("sometimes"); err == nil {
+		t.Fatal("parseBool accepted invalid value")
+	}
+}
+
+func TestStringPtr(t *testing.T) {
+	if got := stringPtr(""); got != nil {
+		t.Fatalf("stringPtr(empty) = %v, want nil", got)
+	}
+	got := stringPtr("needle")
+	if got == nil || *got != "needle" {
+		t.Fatalf("stringPtr(non-empty) = %v, want pointer to needle", got)
+	}
+}
+
+func TestLoadConfigFromFile(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "veriflier.json")
+	if err := os.WriteFile(path, []byte(`{"auth_token":"secret","port":"7804","vantage_id":"us-east","region":"iad","provider":"test","enable_legacy_http":true}`), 0644); err != nil {
+		t.Fatalf("WriteFile: %v", err)
+	}
+
+	cfg, err := loadConfig(path)
+	if err != nil {
+		t.Fatalf("loadConfig: %v", err)
+	}
+	if cfg.AuthToken != "secret" || cfg.TransportPort() != "7804" {
+		t.Fatalf("config = %+v", cfg)
+	}
+	if cfg.VantageID != "us-east" || cfg.Region != "iad" || cfg.Provider != "test" {
+		t.Fatalf("vantage config = %+v", cfg)
+	}
+	if !cfg.LegacyHTTP {
+		t.Fatalf("LegacyHTTP = false, want true")
+	}
+}
+
+func TestLoadConfigSupportsLegacyGRPCPort(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "veriflier.json")
+	if err := os.WriteFile(path, []byte(`{"auth_token":"secret","grpc_port":"7805"}`), 0644); err != nil {
+		t.Fatalf("WriteFile: %v", err)
+	}
+
+	cfg, err := loadConfig(path)
+	if err != nil {
+		t.Fatalf("loadConfig: %v", err)
+	}
+	if cfg.TransportPort() != "7805" {
+		t.Fatalf("TransportPort() = %q, want 7805", cfg.TransportPort())
+	}
+}
+
+func TestLoadConfigFallsBackToEnvironment(t *testing.T) {
+	t.Setenv("VERIFLIER_AUTH_TOKEN", "env-secret")
+	t.Setenv("VERIFLIER_PORT", "7900")
+
+	cfg, err := loadConfig(filepath.Join(t.TempDir(), "missing.json"))
+	if err != nil {
+		t.Fatalf("loadConfig: %v", err)
+	}
+	if cfg.AuthToken != "env-secret" || cfg.TransportPort() != "7900" {
+		t.Fatalf("config = %+v", cfg)
+	}
+}
+
+func TestLoadConfigFallsBackToLegacyPortEnvironment(t *testing.T) {
+	t.Setenv("VERIFLIER_AUTH_TOKEN", "env-secret")
+	t.Setenv("VERIFLIER_GRPC_PORT", "7901")
+
+	cfg, err := loadConfig(filepath.Join(t.TempDir(), "missing.json"))
+	if err != nil {
+		t.Fatalf("loadConfig: %v", err)
+	}
+	if cfg.TransportPort() != "7901" {
+		t.Fatalf("TransportPort() = %q, want 7901", cfg.TransportPort())
+	}
+}
+
+func TestLoadConfigRejectsMalformedJSON(t *testing.T) {
+	path := filepath.Join(t.TempDir(), "veriflier.json")
+	if err := os.WriteFile(path, []byte(`{"auth_token":`), 0644); err != nil {
+		t.Fatalf("WriteFile: %v", err)
+	}
+
+	if _, err := loadConfig(path); err == nil {
+		t.Fatal("loadConfig accepted malformed JSON")
+	}
+}
+
+func TestVeriflierAgentIDIncludesPort(t *testing.T) {
+	if got := veriflierAgentID("host-a", "7803"); got != "host-a:7803" {
+		t.Fatalf("veriflierAgentID() = %q, want host-a:7803", got)
+	}
+	if got := veriflierAgentID("", ""); got != "unknown" {
+		t.Fatalf("veriflierAgentID(empty) = %q, want unknown", got)
+	}
+}
+
+func TestPerformCheckSuccess(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		if got := r.Header.Get("X-Test"); got != "present" {
+			t.Fatalf("X-Test header = %q, want present", got)
+		}
+		_, _ = w.Write([]byte("needle"))
+	}))
+	defer srv.Close()
+
+	res := performCheck(veriflier.CheckRequest{
+		BlogID:         42,
+		URL:            srv.URL,
+		TimeoutSeconds: 2,
+		Keyword:        "needle",
+		CustomHeaders:  map[string]string{"X-Test": "present"},
+		RedirectPolicy: string(checker.RedirectFollow),
+	})
+	if !res.Success {
+		t.Fatalf("performCheck success = false; result=%+v", res)
+	}
+	if res.BlogID != 42 || res.HTTPCode != http.StatusOK {
+		t.Fatalf("performCheck result = %+v", res)
+	}
+}
+
+func TestPerformCheckContextOutcomeAndTimings(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		_, _ = w.Write([]byte("ok"))
+	}))
+	defer srv.Close()
+
+	res := performCheckContext(context.Background(), veriflier.CheckRequest{
+		BlogID:         45,
+		URL:            srv.URL,
+		TimeoutSeconds: 2,
+		RedirectPolicy: string(checker.RedirectFollow),
+	})
+	if res.Outcome != veriflier.OutcomeUp {
+		t.Fatalf("outcome = %q, want up; result=%+v", res.Outcome, res)
+	}
+	if !res.Success || res.HTTPCode != http.StatusOK {
+		t.Fatalf("check result = %+v", res.CheckResult)
+	}
+	if res.RTTMs < 0 || res.TimingsMS.TTFB < 0 {
+		t.Fatalf("negative timings = %+v", res.TimingsMS)
+	}
+}
+
+func TestPerformCheckKeywordFailure(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		_, _ = w.Write([]byte("different"))
+	}))
+	defer srv.Close()
+
+	res := performCheck(veriflier.CheckRequest{
+		BlogID:         43,
+		URL:            srv.URL,
+		TimeoutSeconds: 2,
+		Keyword:        "needle",
+		RedirectPolicy: string(checker.RedirectFollow),
+	})
+	if res.Success {
+		t.Fatalf("performCheck success = true; result=%+v", res)
+	}
+	if res.ErrorCode != int32(checker.ErrorKeyword) {
+		t.Fatalf("error code = %d, want %d", res.ErrorCode, checker.ErrorKeyword)
+	}
+}
+
+func TestPerformCheckTruncatedBodyFailure(t *testing.T) {
+	srv := truncatedBodyServer(t, "needle but incomplete")
+	defer srv.Close()
+
+	res := performCheck(veriflier.CheckRequest{
+		BlogID:         44,
+		URL:            srv.URL,
+		TimeoutSeconds: 2,
+		Keyword:        "needle",
+		RedirectPolicy: string(checker.RedirectFollow),
+	})
+	if res.Success {
+		t.Fatalf("performCheck success = true for truncated body; result=%+v", res)
+	}
+	if res.HTTPCode != http.StatusOK {
+		t.Fatalf("http code = %d, want %d", res.HTTPCode, http.StatusOK)
+	}
+	if res.ErrorCode != int32(checker.ErrorBodyRead) {
+		t.Fatalf("error code = %d, want %d", res.ErrorCode, checker.ErrorBodyRead)
+	}
+}
+
+func truncatedBodyServer(t *testing.T, body string) *httptest.Server {
+	t.Helper()
+
+	return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+		w.Header().Set("Content-Length", "1024")
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte(body))
+		if flusher, ok := w.(http.Flusher); ok {
+			flusher.Flush()
+		}
+		hijacker, ok := w.(http.Hijacker)
+		if !ok {
+			t.Error("response writer does not support hijacking")
+			return
+		}
+		conn, _, err := hijacker.Hijack()
+		if err != nil {
+			t.Errorf("Hijack: %v", err)
+			return
+		}
+		_ = conn.Close()
+	}))
+}
diff --git a/veriflier2/config/veriflier-sample.json b/veriflier2/config/veriflier-sample.json
new file mode 100644
index 00000000..4df3e44a
--- /dev/null
+++ b/veriflier2/config/veriflier-sample.json
@@ -0,0 +1,8 @@
+{
+	"auth_token" : "<VERIFLIER_AUTH_TOKEN>",
+	"port"       : "<VERIFLIER_PORT>",
+	"vantage_id" : "<VERIFLIER_VANTAGE_ID>",
+	"region"     : "<VERIFLIER_REGION>",
+	"provider"   : "<VERIFLIER_PROVIDER>",
+	"enable_legacy_http" : false
+}