Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
300 commits
Select commit Hold shift + click to select a range
29b380c
Document send-test idempotency and v2 polish fixes
Apr 25, 2026
a06ee1d
Fix dns_ms / tcp_ms / tls_ms overflow on partially-failed checks
Apr 25, 2026
63bb091
Refresh AGENTS.md architecture for the v2 health-platform shape
Apr 25, 2026
f9ad4d7
Backfill architecture decision records for the v2 branch
Apr 25, 2026
bfca180
Honor future revoked_at timestamps as API key grace windows
Apr 27, 2026
2e73f25
Replace the unsafe DB update flag with legacy status projection config
Apr 27, 2026
3bb1c3a
Derive API site state from open v2 events
Apr 27, 2026
e873adc
Document the shadow-state rollout and related migration constraints
Apr 27, 2026
9905d84
Make active-event rollup MySQL 5.7-compatible and align API key cutoffs
Apr 27, 2026
531e1a5
Clean up unused symbols and gopls-flagged inefficiencies
Apr 27, 2026
ca10fd8
Apply gofmt to pre-existing nonconforming files
Apr 27, 2026
712aaa0
Drop the dead nil-siteID branch from listEvents
Apr 27, 2026
5519c74
Bring architecture docs back in line with the current code
Apr 27, 2026
74628df
Align the API reference with implemented routes
Apr 27, 2026
8e6c780
Validate email transport configuration
Apr 27, 2026
a9afebf
Return only delivery rows that win the soft lock
Apr 27, 2026
df6ebcf
Warn at startup when email transport will not deliver
Apr 27, 2026
2b05f9f
Document the in-place filter idiom in delivery claim loops
Apr 27, 2026
cba7a54
Cover email transport startup helpers
Apr 27, 2026
4a4928c
Document email transport warning behavior
Apr 27, 2026
ed22772
Align dashboard docs with the current health surface
Apr 27, 2026
4430811
Handle health checks without a database handle
Apr 27, 2026
b85e302
Capture post-v2 probe-agent architecture options
Apr 27, 2026
c660e6f
Add a top-level docs index
Apr 27, 2026
99e3281
Clarify public API roadmap scope
Apr 27, 2026
9043cb8
Align Veriflier transport wording
Apr 27, 2026
714aa72
Keep make all independent of code generation
Apr 27, 2026
cc65332
Document the Makefile build path
Apr 27, 2026
1d18619
Record recent docs and tooling polish
Apr 27, 2026
7f485e2
Make Go resolution explicit in build targets
Apr 27, 2026
8d8977a
Preserve request ids in rejected API audit rows
Apr 27, 2026
d77f3c0
Use a writable Go build cache for Makefile targets
Apr 27, 2026
947ec05
Pass request id explicitly to the audit helper
Apr 27, 2026
4e69766
Unify health-check error wording across nil-DB and ping failure
Apr 27, 2026
7469c62
Document the SMTP-style reply code in the stub email test
Apr 27, 2026
1557d29
Cover audit rows on the success and rate-limit paths
Apr 27, 2026
3cc50ef
Thread context.Context through audit.Log
Apr 27, 2026
b17fdd2
Refresh README to lead with the v2 health-platform story
Apr 27, 2026
68932dc
Keep filtered site-list pagination advancing
Apr 27, 2026
9790ffb
Expand coverage across core packages
Apr 27, 2026
0cd7c6e
Cover list handlers and lifecycle helpers
Apr 27, 2026
653a1c8
Prioritize remaining roadmap work
Apr 27, 2026
4a4d036
Guard delivery workers behind an owner host
Apr 28, 2026
93b836e
Measure the v2 downtime decision flow
Apr 28, 2026
b518ab7
Detect legacy projection drift during rounds
Apr 28, 2026
edb3abf
Introduce a standalone deliverer entrypoint
Apr 28, 2026
7b44f42
Claim deliveries with row locks
Apr 28, 2026
18d753d
Bless JSON-over-HTTP for Veriflier transport
Apr 28, 2026
b79d3e9
Publish route-driven OpenAPI contract
Apr 28, 2026
9a31f88
Plan outbound credential encryption
Apr 28, 2026
839da6c
Document deliverer rollout policy
Apr 28, 2026
8f2ea05
Publish OpenAPI component schemas
Apr 28, 2026
5cec3ea
Validate OpenAPI client generation smoke
Apr 28, 2026
0a4752b
Define public API tenant boundary
Apr 28, 2026
ade7a2f
Add tenant ownership hooks for outbound resources
Apr 28, 2026
f8f8ca1
Enforce gateway tenant context for outbound API resources
Apr 28, 2026
3d600cf
Enforce gateway tenant ownership for site APIs
Apr 28, 2026
9ccc903
Clean up gateway docs and lint compatibility
Apr 28, 2026
768723d
Import gateway site tenant mappings
Apr 28, 2026
d5faf10
Add pinned bucket rollout mode
Apr 28, 2026
481c8a2
Add pinned rollout preflight command
Apr 28, 2026
95e7e4b
Add dynamic rollout ownership check
Apr 28, 2026
edf51fe
Split detection metrics by outcome source
Apr 28, 2026
68ceaf4
List legacy projection drift details
Apr 28, 2026
25f5ced
Track WPCOM notification parity metrics
Apr 28, 2026
58ad6f9
Surface rollout checks in validate-config
Apr 28, 2026
c108b2d
Surface rollout health in dashboard
Apr 28, 2026
6531b54
Simplify Docker env port overrides
Apr 28, 2026
37a7dd3
Document Docker env sample groups
Apr 28, 2026
d6e905f
Clarify Docker API port binding
Apr 28, 2026
7f6857d
Handle Docker runtime permissions
Apr 28, 2026
9439dd4
Refine Docker development setup
Apr 28, 2026
81cb306
Wait for Docker MySQL TCP readiness
Apr 28, 2026
3de5c3b
Expose Docker API bind separately
Apr 28, 2026
a758770
Add Docker Mailpit and healthchecks
Apr 28, 2026
b02d6f1
Expose site scheduling fields in API responses
Apr 28, 2026
67cf7f5
Document site soft-delete contract
Apr 28, 2026
5bd3f45
Package standalone deliverer service
Apr 28, 2026
c0866e8
Cover gateway tenant event access paths
Apr 28, 2026
faaef24
Document completed roadmap work
Apr 28, 2026
f0be708
Merge stacked PR 1: Event-sourced state and internal API foundation
chrisbliss18 Apr 28, 2026
a67ffc3
Merge stacked PR 2: Add webhook registry and delivery worker
chrisbliss18 Apr 28, 2026
b83bee8
Merge stacked PR 3: Add alert contacts and managed delivery
chrisbliss18 Apr 28, 2026
4a8ffc6
Merge stacked PR 4: Harden shadow-state projection and delivery paths
chrisbliss18 Apr 28, 2026
d95c8f8
Merge stacked PR 5: Refresh docs, tooling, and coverage
chrisbliss18 Apr 28, 2026
3c76947
Merge stacked PR 6: Add deliverer, OpenAPI, and v2 decision metrics
chrisbliss18 Apr 28, 2026
e7ce3d9
Merge stacked PR 7: Add gateway tenant ownership enforcement
chrisbliss18 Apr 28, 2026
3617fdf
Merge stacked PR 8: Add production rollout hardening
chrisbliss18 Apr 28, 2026
e68b7e4
Merge stacked PR 9: Polish Docker, site API docs, and deliverer packa…
chrisbliss18 Apr 28, 2026
8784568
Add API CLI request foundation
Apr 28, 2026
f4f7983
Add typed API site commands
Apr 28, 2026
e8fb55e
Add typed API event commands
Apr 28, 2026
5e3f63a
Add typed API webhook commands
Apr 28, 2026
8eb416a
Add typed API alert contact commands
Apr 28, 2026
68db114
Add API site bulk-add fixture
Apr 28, 2026
5751c70
Add API smoke workflow
Apr 28, 2026
9651128
Add API failure simulation workflow
Apr 28, 2026
f8c231b
Add API CLI table output and examples
Apr 28, 2026
e6ef45d
Add API CLI batch cleanup
Apr 28, 2026
5b53bb5
Add deterministic API failure simulation
Apr 28, 2026
07b216e
Mark API CLI roadmap complete
Apr 28, 2026
1be4c6c
Add API CLI feature guide
Apr 28, 2026
1afdbab
Fix API CLI guide flag ordering
Apr 28, 2026
119ad9c
Fix API CLI help flag rendering
Apr 28, 2026
2a43d02
Allow API CLI flags after positional args
Apr 28, 2026
272f7bd
Add API CLI live validation target
Apr 28, 2026
fff8456
Improve API CLI workflow table summaries
Apr 28, 2026
833abf2
Add Docker API CLI token helpers
Apr 28, 2026
6ed20ce
Add API CLI command catalog
Apr 28, 2026
1c6b887
Verify API CLI batch ownership before mutation
Apr 28, 2026
921cdac
Add webhook receiver to API fixture
Apr 28, 2026
335efbf
Refresh Jetmon 2 README and docs
Apr 28, 2026
2981589
Add v1 to v2 migration runbook
Apr 28, 2026
ca1be50
Address API CLI review security feedback
Apr 28, 2026
98e9677
Add remote guardrails for API CLI workflows
Apr 28, 2026
c531ae7
Require explicit smoke batch for remote API CLI runs
Apr 28, 2026
c4e8e10
Guard all remote API CLI writes
Apr 28, 2026
4b1bd4f
Add static bucket rollout preflight
Apr 28, 2026
4262642
Reject dynamic ownership overlap during pinned rollout
Apr 28, 2026
00aaece
Add rollout activity preflight
Apr 28, 2026
1a5907d
Add rollback safety preflight
Apr 28, 2026
535e9a3
Surface rollout safety commands in operator views
Apr 28, 2026
5e3b295
Align rollout docs with safety command set
Apr 28, 2026
fbd6337
Reject negative rollback ranges
Apr 28, 2026
c407740
Merge pull request #82 from Automattic/feature/api-cli
chrisbliss18 Apr 28, 2026
5a7c694
Clarify rollout preflight limits
Apr 28, 2026
19e4a43
Merge v2 into rollout preflight hardening
Apr 28, 2026
b052f75
Track follow-up branch ideas in roadmap
Apr 28, 2026
ac42ee6
Merge pull request #83 from Automattic/feature/rollout-preflight-hard…
chrisbliss18 Apr 29, 2026
6c880ed
Update audience labels in README.md
chrisbliss18 Apr 29, 2026
34bf7cd
Harden deliverer rollout validation
Apr 29, 2026
505436f
Add deliverer delivery backlog checks
Apr 29, 2026
e0231eb
Harden deliverer backlog reporting
Apr 29, 2026
b62f4ee
Organize project docs under docs directory
Apr 29, 2026
b777354
Merge pull request #84 from Automattic/feature/deliverer-rollout-hard…
chrisbliss18 Apr 29, 2026
a77f00e
Add webhook fixture validation to API CLI smoke
Apr 29, 2026
7a5a337
Keep webhook smoke local-only
Apr 29, 2026
44a73fc
Tighten API CLI webhook smoke assertions
Apr 29, 2026
8cc7f41
Merge pull request #85 from Automattic/feature/api-cli-fixture-workflows
chrisbliss18 Apr 29, 2026
2d0a3db
Rehearse v2 rollout docs against current tooling
Apr 29, 2026
3121623
Add rollout rehearsal planning and docs verification
Apr 29, 2026
ad9e4e4
Bundle post-cutover rollout checks
Apr 29, 2026
92e07a6
Add JSON output for rollout gates
Apr 29, 2026
feeca66
Add rollout quick reference
Apr 29, 2026
cfcd08b
Add rollout state report
Apr 29, 2026
0ef5505
Merge pull request #86 from Automattic/feature/v2-rollout-docs-rehearsal
chrisbliss18 Apr 29, 2026
20c133d
Add rollout host preflight gate
Apr 29, 2026
a958663
Clarify rollout preflight rehearsal flow
Apr 29, 2026
92c221c
Add guided rollout walkthrough
Apr 29, 2026
7c50b33
Harden guided rollout rehearsals
Apr 29, 2026
50c0712
Expand guided rollout flow coverage
Apr 29, 2026
a3e266b
Clarify guided rollout run origin
Apr 29, 2026
a7e632b
Cover fresh-server guided happy paths
Apr 29, 2026
3616695
Add rollout VM lab harness
Apr 29, 2026
44365b5
Complete rollout VM lab flow coverage
Apr 29, 2026
5f7b4a7
Expand rollout VM lab failure coverage
Apr 29, 2026
43779b6
Harden rollout service state verification
Apr 29, 2026
8d1bf3f
Keep VM snapshot smokes on current artifacts
Apr 30, 2026
b3af788
Merge pull request #88 from Automattic/feature/rollout-host-preflight…
chrisbliss18 Apr 30, 2026
5ec6db3
Merge latest v2 into rollout VM lab branch
Apr 30, 2026
ed77bee
Correct rollout VM lab host requirements
Apr 30, 2026
4b8a864
Merge pull request #89 from Automattic/feature/rollout-vm-lab
chrisbliss18 Apr 30, 2026
a3dcbcd
Build host dashboard fleet-health foundation
Apr 30, 2026
c3e64f1
Track host dashboard polish and fleet dashboard follow-ups
Apr 30, 2026
25934b6
Harden host dashboard health reporting
Apr 30, 2026
1b80ba0
Tighten host dashboard edge cases
Apr 30, 2026
256beb8
Mark host dashboard plumbing complete in roadmap
Apr 30, 2026
31bf48a
Merge pull request #90 from Automattic/feature/host-dashboard-fleet-p…
chrisbliss18 Apr 30, 2026
1210350
Add initial fleet dashboard
Apr 30, 2026
391b8fc
Refine fleet dashboard operator signals
Apr 30, 2026
49d98d6
Refresh fleet dashboard roadmap status
Apr 30, 2026
79e99ed
Document fleet dashboard operations
Apr 30, 2026
5fb1292
Tighten fleet dashboard review findings
Apr 30, 2026
73d9977
Harden fleet dashboard edge cases
Apr 30, 2026
8f03916
Reduce fleet dashboard delivery queries
Apr 30, 2026
2d06e4a
Merge pull request #91 from Automattic/feature/fleet-dashboard
chrisbliss18 Apr 30, 2026
a82fc77
Add rollout rehearsal verification
Apr 30, 2026
4969b0a
Record rollout VM rehearsal pass
Apr 30, 2026
8b93e05
Harden rollout rehearsal verifier
Apr 30, 2026
95e9655
Guard rollout rehearsal plan writes
Apr 30, 2026
1a402e3
Merge pull request #92 from Automattic/feature/v2-rollout-docs-rehear…
chrisbliss18 Apr 30, 2026
440cf25
Expand projection drift diagnostics
Apr 30, 2026
5d16a77
Document projection drift repair caution
Apr 30, 2026
e75594c
Harden projection drift review findings
Apr 30, 2026
9e0a650
Clarify projection drift cause totals
Apr 30, 2026
b970444
Merge pull request #93 from Automattic/feature/projection-drift-tooling
chrisbliss18 Apr 30, 2026
8a4c0dc
Add true RSS reporting to dashboards
Apr 30, 2026
96c6c4a
Merge pull request #94 from Automattic/feature/dashboard-true-rss
chrisbliss18 Apr 30, 2026
cae6646
Add production telemetry report command
Apr 30, 2026
8e6f165
Harden production telemetry report output
Apr 30, 2026
27a4166
Clarify telemetry report health signals
Apr 30, 2026
f3c0568
Add telemetry report window-edge guidance
Apr 30, 2026
e4bc45f
Merge pull request #95 from Automattic/feature/production-telemetry-r…
chrisbliss18 Apr 30, 2026
d934f64
Harden rollout VM lab startup
May 1, 2026
def4f6a
Clarify rollout VM lab auto-start guards
May 1, 2026
5614616
Make rollout VM lab startup fail atomically
May 1, 2026
22af147
Merge pull request #96 from Automattic/feature/rollout-vm-lab-self-start
chrisbliss18 May 1, 2026
f9e3c18
Make Jetmon v2 scheduler drain due work
May 2, 2026
24d6414
Improve scheduler retest visibility
May 2, 2026
c3faf96
Clarify scheduler config documentation
May 2, 2026
42394a7
Improve Jetmon v2 capacity write efficiency
May 3, 2026
af623b7
Add scheduler selection index migration
May 3, 2026
c687df3
Update capacity test configuration defaults
May 3, 2026
a90cee6
Remove obsolete capacity handoff document
May 3, 2026
14d7eaa
Capture Jetmon v2 capacity retest results
May 3, 2026
0d6184b
Merge pull request #97 from Automattic/feature/jetmon-v2-capacity-han…
chrisbliss18 May 3, 2026
966b406
Add indexed next-check scheduler timestamp
May 3, 2026
9a5c30a
Sample broad scheduler reports on cadence
May 3, 2026
43b1309
Reuse bounded HTTP transport for checks
May 3, 2026
2d9a5c2
Batch changed SSL expiry updates
May 3, 2026
b554333
Document scalability test instrumentation
May 3, 2026
74d9d24
Add Jetmon v2 prelaunch readiness tracker
May 3, 2026
49e1be7
Start resolving v2 prelaunch readiness gaps
May 3, 2026
60f7450
Expand Jetmon v2 prelaunch parity coverage
May 3, 2026
a77c03e
Split WPCOM parity reporting by transition type
May 3, 2026
558e81b
Record local rollout rehearsal evidence
May 3, 2026
2e182c0
Add telemetry evidence to rollout procedures
May 3, 2026
026aaf9
Improve capacity failure diagnostics
May 3, 2026
a120e5f
Keep VM lab rehearsal current
May 3, 2026
596ca47
Reuse checker HTTP transport
May 3, 2026
36a4a1b
Sample broad scheduler reports on a cadence
May 3, 2026
e1b1cd1
Materialize variable-interval due times
May 3, 2026
810a6da
Merge scalability efficiency work for capacity stress testing
May 3, 2026
ead0911
Merge checker transport pooling for capacity stress testing
May 3, 2026
69aad58
Merge scheduler reporting cadence work for capacity stress testing
May 3, 2026
fc016f6
Merge materialized next-check scheduling for capacity stress testing
May 3, 2026
eddb929
Merge pull request #98 from Automattic/integration/jetmon-v2-capacity…
chrisbliss18 May 3, 2026
a5fd459
Merge uptime-bench Jetmon v2 detection and capacity fixes (#99)
chrisbliss18 May 4, 2026
2e536b5
Merge latest v2 into prelaunch recommendations branch
May 5, 2026
952eeac
Improve incident observation metadata
May 5, 2026
952cb35
checker: detect truncated/partial GET responses via strict EOF valida…
heydemoura May 5, 2026
b9670a0
Classify DNS and deprecated TLS probe evidence
May 6, 2026
b25c433
Use probe-cleared for advisory event recovery
May 6, 2026
fbe08ff
Merge latest v2 strict body-read checks
May 7, 2026
28b262b
Enrich incident metadata for body-read failures
May 7, 2026
da20180
Fix telemetry suppression parity accounting
May 7, 2026
6bd977e
Address Jetmon v2 prelaunch readiness recommendations (#100)
chrisbliss18 May 7, 2026
6994249
Merge remote-tracking branch 'origin/v2' into feature/jetmon-v2-incid…
May 7, 2026
d599e86
Improve Jetmon v2 incident observability for TLS and DNS evidence (#102)
chrisbliss18 May 7, 2026
bf42033
Add the streaming monitor engine for high-scale checks (#104)
chrisbliss18 May 12, 2026
a0b54e7
Add GitHub Actions workflow to publish Docker images to GHCR (#106)
heydemoura May 12, 2026
ec1a26d
Harden DNS-related post-recovery false positives (#108)
chrisbliss18 May 13, 2026
f316f57
Add staged HEAD/GET rollout check modes
chrisbliss18 May 13, 2026
edfe7cb
Avoid repeated Veriflier confirmation for down sites (#111)
chrisbliss18 May 13, 2026
51f2767
Track PR 101 follow-up hardening work
May 13, 2026
8a98783
Harden WPCOM failure handling from PR 101
May 13, 2026
9a05c93
Harden PR 101 WPCOM and streaming follow-ups (#110)
chrisbliss18 May 13, 2026
1caa330
Add production data rollout readiness checks (#112)
chrisbliss18 May 14, 2026
ba09fe7
Preserve endpoint identity for duplicate monitor rows (#113)
chrisbliss18 May 14, 2026
55cc0fa
Rebuild Veriflier v2 contract and discovery (#105)
chrisbliss18 May 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .agents/skills/handoff/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
name: handoff
description: Create a self-contained Jetmon handoff for another agent.
---

# Jetmon Handoff

Use this when Chris asks for a handoff doc or wants another agent to continue a
Jetmon thread.

## Include

- Repo path, branch, and relevant commit IDs.
- Whether the work affects Jetmon v1, Jetmon v2, Veriflier, bridge, support
services, or uptime-bench.
- Active test locks and what must not be changed.
- Problem statement, evidence, and current hypothesis.
- Relevant logs, reports, metrics, PRs, and file paths.
- Commands already run and their outcome.
- Next recommended actions and approvals needed.

## Placement

During active tests, prefer `.agents` or global memory for agent-only handoffs.
Ask before editing non-agent project docs.

## Secrets

Do not include tokens, passwords, private keys, or unredacted service configs.
44 changes: 44 additions & 0 deletions .agents/skills/jetmon-test-fleet/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
name: jetmon-test-fleet
description: Work safely with Jetmon services used by uptime-bench capacity tests.
---

# Jetmon Test Fleet

Use this when Chris asks about Jetmon v1/v2 test services, Verifliers, support
services, Prometheus capacity data, or whether a Jetmon branch is ready for
uptime-bench tests.

## Safety First

- If tests are running, do not restart services, change config, move support
services, deploy binaries, mutate databases, or alter target/provider state
without explicit permission.
- Prefer read-only inspection and report analysis during active tests.
- State which repo is being acted on before making changes.

## Common Context

- Uptime-bench canonical repo:
`/home/gaarai/code/uptime-bench`.
- Current Prometheus for Jetmon capacity work:
`http://10.0.0.67:9091`.
- Service hosts:
`jetmon-service-host-1`/`jetmon-v1`,
`jetmon-service-host-2`/`jetmon-v2`,
`jetmon-service-host-3`,
`jetmon-service-host-4`.
- Support/monitoring hosts:
`jetmon-vm-host-1`,
`jetmon-vm-host-2`,
`jetmon-vm-host-3`.

## Output Expectations

When answering readiness or risk questions, include:

- Branch and commit under discussion.
- What is deployed versus only local.
- Which checks were read-only.
- Whether changes are safe during an active uptime-bench run.
- Recommended next action and any approval needed.
31 changes: 31 additions & 0 deletions .agents/skills/safe-background-work/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
name: safe-background-work
description: Pick useful Jetmon work that cannot affect active uptime-bench or Jetmon tests.
---

# Safe Background Work

Use this when tests are running and Chris asks what can be done without
interrupting them.

## Allowed By Default

- Local code review and static analysis.
- Agent-specific files.
- Branch inspection and commit comparison.
- Handoff writing.
- Local-only planning for changes that will not be deployed.

## Ask First

- Deploying binaries or configs.
- Restarting `jetmon2`, Jetmon v1, bridge, Veriflier, database, StatsD, or
monitoring services.
- Moving support services between hosts.
- Changing bucket ownership, pinned bucket ranges, or test fleet data.
- Running smoke tests that create, delete, or modify sites/providers.

## Blocker Policy

If a safe task becomes blocked on approval, record the blocker and move to the
next safe task.
93 changes: 31 additions & 62 deletions .claude/commands/debug-memory.md
Original file line number Diff line number Diff line change
@@ -1,97 +1,66 @@
# Debug Memory Issues

Debug memory issues in Jetmon workers and identify leaks.
Debug memory growth and goroutine leaks in the Jetmon 2 Go binary.

## Instructions

Help the user diagnose memory problems in Jetmon workers. Memory leaks are a known pitfall because workers are long-running processes.
Help the user diagnose memory problems in Jetmon 2. Unlike the old Node.js/worker architecture,
Jetmon 2 is a single Go binary. Memory pressure does not cause worker crashes — instead the
orchestrator drains the goroutine pool when RSS exceeds `WORKER_MAX_MEM_MB`.

### 1. Check Current Memory Status

First, see current memory usage of all Jetmon processes:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon ps aux --sort=-%mem | grep -E '(node|PID)' | head -20
cd docker && docker compose exec jetmon ps aux
```

Check worker memory limits in config:
Check memory config:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon cat config/config.json | grep -E '(WORKER_MAX_MEM|WORKER_MAX_CHECK)'
docker compose exec jetmon cat config/config.json | grep -E '(WORKER_MAX_MEM|NUM_WORKERS)'
```

### 2. Monitor Memory Over Time
### 2. Use pprof for Deep Analysis

Watch memory growth in real-time:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon bash -c 'while true; do echo "=== $(date) ==="; ps aux --sort=-%mem | grep node | head -10; sleep 10; done'
```

Let this run for a few minutes to observe trends. Look for:
- Workers steadily increasing memory without recycling
- Workers approaching or exceeding `WORKER_MAX_MEM_MB` (default 53MB)
- Memory not dropping after worker recycle

### 3. Check Worker Recycling
The operator dashboard exposes pprof endpoints at http://localhost:8080/debug/pprof/

Verify workers are being recycled when hitting limits:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose logs jetmon 2>&1 | grep -E '(memory|recycle|spawn|die|limit)' | tail -30
```

### 4. Force Aggressive Recycling (Testing)
# Count goroutines
curl http://localhost:8080/debug/pprof/goroutine?debug=1 | grep -c "^goroutine"

To test worker recycling behavior, temporarily set low limits:

```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon bash -c 'cat > /tmp/test-config.json << EOF
{
"WORKER_MAX_CHECKS": 50,
"WORKER_MAX_MEM_MB": 20
}
EOF
cat /tmp/test-config.json'
# Heap profile
curl http://localhost:8080/debug/pprof/heap > heap.prof
go tool pprof heap.prof
```

Tell the user to manually update `config/config.json` with these values, then reload:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon sh -c 'kill -HUP $(pgrep -f "node lib/jetmon.js" | head -1)'
```
### 3. Monitor Memory Over Time

Watch for recycling:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose logs -f jetmon 2>&1 | grep -E '(spawn|die|recycle|memory|limit)'
docker compose exec jetmon bash -c 'while true; do ps -o pid,rss,vsz,comm -p $(pgrep jetmon2); sleep 10; done'
```

### 5. Check for Known Memory Issues
Enable detailed StatsD metrics by setting `STATSD_SEND_MEM_USAGE: true` in `config/config.json`,
then reload config: `docker compose exec jetmon ./jetmon2 reload`

**Retry queue growth:** If retry queues aren't being processed, they can grow unbounded:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose logs jetmon 2>&1 | grep -i retry | tail -20
```
### 4. Check Retry Queue Size

Large retry queues indicate many sites are down and being tracked. This is expected behaviour.

**StatsD buffer:** Check if metrics buffer is growing:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon bash -c 'cat stats/* 2>/dev/null'
curl http://localhost:8080/api/state | python3 -m json.tool
```

### 6. Analyze with Node.js Tools (Advanced)

If deeper analysis is needed, suggest:

1. **Heap snapshots:** Would require code changes to expose `v8.writeHeapSnapshot()`
2. **--inspect flag:** Could attach Chrome DevTools, but requires exposing debug port
3. **Process stats:** Check `/proc/<pid>/status` for detailed memory breakdown
Look at `RetryQueueSize`.

### 7. Common Memory Issues in Jetmon
### 5. Common Issues

| Symptom | Likely Cause | Fix |
|---------|--------------|-----|
| Workers never recycle | `WORKER_MAX_MEM_MB` set to 0 or very high | Set reasonable limit (53MB default) |
| Memory spikes during rounds | Too many concurrent checks | Reduce `NUM_TO_PROCESS` |
| Gradual leak over hours | Retry queue not draining | Check Veriflier connectivity |
| Sudden OOM | Node.js version regression | Test with previous Node version |
| Goroutine count grows | Context not cancelled on shutdown | Verify `orch.Stop()` called |
| Memory never drops | Pool drain not triggered | Check `WORKER_MAX_MEM_MB` value |
| Retry queue unbounded | Veriflier unreachable | Check veriflier connectivity |
| High allocations | Keyword-check body reads | Reduce `NUM_WORKERS` |

### 8. Restore Normal Settings
### 6. Restore Normal Settings

Remind user to restore normal config values after testing:
- `WORKER_MAX_MEM_MB`: 53
- `WORKER_MAX_CHECKS`: 10000
After testing, remind user to restore:
- `STATSD_SEND_MEM_USAGE`: false (avoid extra StatsD traffic in production)
46 changes: 29 additions & 17 deletions .claude/commands/docker-test.md
Original file line number Diff line number Diff line change
@@ -1,79 +1,91 @@
# Docker Test Environment

Run, debug, and test Jetmon using the Docker development environment.
Run, debug, and test Jetmon 2 using the Docker development environment.

## Instructions

Help the user test Jetmon in the Docker environment. Follow these steps:
Help the user test Jetmon 2 in the Docker environment. Follow these steps:

### 1. Check Docker Status
First, check if the Docker environment is already running:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose ps
cd docker && docker compose ps
```

### 2. Start Services (if needed)
If services aren't running, start them:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose up -d
cd docker && docker compose up -d
```

Wait a few seconds for services to initialize, then verify:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose ps
docker compose ps
```

### 3. Ask User What They Want to Test

Present these options:
- **View logs** - Watch Jetmon or Veriflier logs in real-time
- **Check worker status** - See worker activity and stats
- **Operator dashboard** - Open http://localhost:8080 in a browser
- **Test with sample sites** - Insert test URLs into database
- **Test configuration reload** - Send SIGHUP to master process
- **Test graceful shutdown** - Verify shutdown behavior
- **Test configuration reload** - Send SIGHUP to reload config
- **Test graceful drain** - Verify drain/shutdown behaviour
- **Test Veriflier connectivity** - Check Veriflier is responding
- **View audit log** - Query the audit log for a specific blog
- **View metrics** - Check StatsD/Graphite dashboard

### 4. Execute Based on Selection

**View logs:**
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose logs -f jetmon
docker compose logs -f jetmon
```

**Check worker status:**
**Check process and stats:**
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon ps auxf
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon cat stats/sitespersec
docker compose exec jetmon ps aux
docker compose exec jetmon cat stats/sitespersec
docker compose exec jetmon cat stats/sitesqueue
```

**Test with sample sites:**
First check if table exists and has data:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec mysqldb mysql -u root -p123456 jetmon_db -e "SELECT COUNT(*) as count FROM jetpack_monitor_sites;" 2>/dev/null
docker compose exec mysqldb mysql -u root -p123456 jetmon_db -e "SELECT COUNT(*) as count FROM jetpack_monitor_sites;" 2>/dev/null
```

If empty or table doesn't exist, offer to create test data per `running-tests.md`.

**Test configuration reload:**
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon sh -c 'kill -HUP $(pgrep -f "node lib/jetmon.js" | head -1)'
docker compose exec jetmon ./jetmon2 reload
```

**Test drain/graceful shutdown:**
```bash
docker compose exec jetmon ./jetmon2 drain
```

**Test Veriflier connectivity:**
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose exec jetmon curl -k https://veriflier:7801/get/status
docker compose exec jetmon curl http://veriflier:7803/status
```

**View audit log:**
```bash
docker compose exec jetmon ./jetmon2 audit --blog-id 1 --since 1h
```

**View metrics:**
Tell user to open http://localhost:8088 and navigate to `Metrics > stats > com > jetpack > jetmon > docker > jetmon`

### 5. Cleanup (if requested)
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose down
docker compose down
```

Or to fully reset with fresh database:
```bash
cd /Users/rdcoll/Code/a8c/jetmon/docker && docker compose down -v
docker compose down -v
```
Loading
Loading