fix: measure memory while containers are running, not after teardown#1765

Merged

Mossaka merged 2 commits intomainfrom

fix/1758-memory-benchmark

Apr 7, 2026

Collaborator

Mossaka commented Apr 7, 2026

Summary

Fixes benchmarkMemory() in scripts/ci/benchmark-performance.ts which always reported 0 MB because containers were already stopped before docker stats ran
Spawns awf with sleep 30 in the background, polls until containers are healthy, then samples docker stats while containers are alive
Adds proper cleanup (killBackground + process group kill) to handle edge cases

Details

The previous implementation ran echo measuring_memory as the AWF command, which completed instantly. Even with --keep-containers, by the time docker stats executed the containers were gone.

The fix:

Starts awf ... sleep 30 as a detached background process via child_process.spawn
Polls docker ps at 500ms intervals until both awf-squid and awf-agent are running (30s timeout)
Waits 2s for memory to stabilize
Samples docker stats while containers are alive
Kills the background process group and runs cleanup in a finally block

Closes #1758

Test plan

npm run build succeeds
npm test passes (1311 tests)
npm run lint passes on modified file
Script compiles and starts executing with npx tsx
CI passes

🤖 Generated with Claude Code


          fix: measure memory while containers are running, not after teardown

dc69f41

Previously benchmarkMemory() ran `echo measuring_memory` which completed
instantly, so containers were already stopped before docker stats ran,
always reporting 0 MB. Now spawns awf with `sleep 30` in the background,
polls until containers are healthy, then samples docker stats while
containers are alive.

Closes #1758

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings

April 7, 2026 22:11

Copilot started reviewing on behalf of Mossaka

April 7, 2026 22:12

Contributor

github-actions bot commented Apr 7, 2026 •

edited

Loading

✅ Coverage Check Passed

Overall Coverage

Metric	Base	PR	Delta
Lines	86.14%	86.23%	📈 +0.09%
Statements	86.02%	86.11%	📈 +0.09%
Functions	87.45%	87.45%	➡️ +0.00%
Branches	78.81%	78.86%	📈 +0.05%

📁 Per-file Coverage Changes (1 files)

File	Lines (Before → After)	Statements (Before → After)
`src/docker-manager.ts`	86.3% → 86.7% (+0.37%)	85.9% → 86.2% (+0.36%)

Coverage comparison generated by scripts/ci/compare-coverage.ts

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

Fixes the CI performance benchmark’s memory measurement so it samples docker stats while AWF containers are still running (instead of after AWF teardown), addressing the “always 0 MB” issue.

Changes:

Reworks benchmarkMemory() to run awf ... sleep 30 in a background process and measure memory while containers are alive.
Adds polling (waitForContainers) plus a short stabilization delay before sampling docker stats.
Adds best-effort background process cleanup (killBackground) and ensures cleanup runs in finally.

Show a summary per file

File	Description
`scripts/ci/benchmark-performance.ts`	Updates the memory benchmark to keep containers running during measurement and adds background process lifecycle management.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 1/1 changed files
Comments generated: 5

scripts/ci/benchmark-performance.ts Outdated

Comment on lines +177 to +180

+                          `sudo docker ps --filter name=${name} --filter status=running --format '{{.Names}}' 2>/dev/null`,
+                          { encoding: "utf-8", timeout: 5_000 }
+                        ).trim();
+                        return result.includes(name);

Copilot AI Apr 7, 2026

waitForContainers() uses docker ps --filter name=${name} and then result.includes(name). Docker's name filter is a regex/substring match, so this can return true for containers like awf-squid-old, leading to false positives. Use an anchored name filter (e.g., name=^${name}$) and/or require an exact match on the returned name(s).

Suggested change

      
                        `sudo docker ps --filter name=${name} --filter status=running --format '{{.Names}}' 2>/dev/null`,
          
                        { encoding: "utf-8", timeout: 5_000 }
          
                      ).trim();
          
                      return result.includes(name);
          
                        `sudo docker ps --filter name=^${name}$ --filter status=running --format '{{.Names}}' 2>/dev/null`,
          
                        { encoding: "utf-8", timeout: 5_000 }
          
                      )
          
                        .trim()
          
                        .split("\n")
          
                        .map((containerName) => containerName.trim())
          
                        .filter(Boolean);
          
                      return result.some((containerName) => containerName === name);

Copilot uses AI. Check for mistakes.

scripts/ci/benchmark-performance.ts

Comment on lines +245 to +250

+                    // Wait for both containers to be running (up to 30s)
+                    await waitForContainers(["awf-squid", "awf-agent"], 30_000);
+                    // Give containers a moment to stabilize memory usage
+                    await new Promise((resolve) => setTimeout(resolve, 2000));

Copilot AI Apr 7, 2026

The benchmark proceeds once containers are merely status=running. Since Squid has a Docker healthcheck (and the agent depends on it), sampling memory before Squid is healthy can make the metric noisy and can mask startup failures/restarts. Consider waiting for Squid to be healthy (e.g., via docker ps --filter health=healthy or docker inspect health status) before taking the memory sample.

Copilot uses AI. Check for mistakes.

scripts/ci/benchmark-performance.ts

+                        detached: true,
+                        stdio: "ignore",
+                      }
                     );

Copilot AI Apr 7, 2026

Because the child process is spawned with detached: true and stdio: 'ignore', it’s typically a good idea to call child.unref() after spawning. This prevents the parent benchmark script from being kept alive if cleanup/kill logic fails in an edge case.

Suggested change

      
                  );
          
                  );
          
                  child.unref();

Copilot uses AI. Check for mistakes.

scripts/ci/benchmark-performance.ts

Comment on lines +212 to +220

+                try {
+                  if (child.pid) {
+                    // Kill the process group (negative PID) to catch child processes
+                    process.kill(-child.pid, "SIGTERM");
+                  }
+                } catch {
+                  // Process may have already exited
+                }
+                try {

Copilot AI Apr 7, 2026

killBackground() sends SIGTERM to the process group but then only SIGKILLs the direct child PID. If the group spawns additional processes that ignore SIGTERM, they can survive. Consider sending SIGKILL to the process group as well (optionally after a short grace period), instead of only killing the child process.

Suggested change

      
              try {
          
                if (child.pid) {
          
                  // Kill the process group (negative PID) to catch child processes
          
                  process.kill(-child.pid, "SIGTERM");
          
                }
          
              } catch {
          
                // Process may have already exited
          
              }
          
              try {
          
              const pid = child.pid;
          
              if (!pid) {
          
                return;
          
              }
          
              try {
          
                // Kill the process group (negative PID) to catch child processes
          
                process.kill(-pid, "SIGTERM");
          
              } catch {
          
                // Process group may have already exited
          
              }
          
              try {
          
                // Force-kill the entire process group so descendant processes do not survive
          
                process.kill(-pid, "SIGKILL");
          
              } catch {
          
                // Process group may have already exited
          
              }
          
              try {
          
                // Best-effort fallback for the direct child process

Copilot uses AI. Check for mistakes.

scripts/ci/benchmark-performance.ts

Comment on lines +236 to +239

+                    child = spawn(
+                      "sudo",
+                      ["awf", "--allow-domains", ALLOWED_DOMAIN, "--log-level", "error", "--", "sleep", "30"],
+                      {

Copilot AI Apr 7, 2026

This path hard-codes sudo + awf in spawn(...) even though the script already defines AWF_CMD = "sudo awf" and uses it elsewhere. To avoid future drift (e.g., if AWF_CMD changes), consider deriving the spawned command/args from AWF_CMD or otherwise reusing that constant.

Copilot uses AI. Check for mistakes.

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

github-actions bot added smoke-copilot smoke-claude labels

github-actions bot mentioned this pull request

[aw] No-Op Runs #1668

Open

This comment has been minimized.

Sign in to view

github-actions bot added the smoke-services label

This comment has been minimized.

Sign in to view

This comment has been minimized.

Sign in to view

Mossaka commented

View reviewed changes

Collaborator Author

Mossaka left a comment

Security Review: PR #1765 — Memory Benchmark Fix

Overall Assessment

Risk: Low. This PR modifies a CI-only benchmark script (scripts/ci/benchmark-performance.ts) that runs in controlled CI environments with hardcoded inputs. No user-controlled data reaches shell execution. The changes are functionally sound and improve the correctness of memory measurement.

Detailed Findings

1. Process Group Cleanup — Low

The killBackground() function sends SIGTERM to the process group (-child.pid) followed by SIGKILL to the child. This is a solid pattern.

Minor gap: There is no timeout between SIGTERM and SIGKILL. The SIGTERM to the process group may not have propagated before SIGKILL fires on the direct child, potentially leaving grandchildren (the awf process tree includes Docker containers). However, this is mitigated by the cleanup() call in the finally block, which explicitly tears down containers via docker compose down -v and docker rm -f.

Verdict: Acceptable. The finally block's cleanup() is the real safety net for container teardown, and the killBackground() is best-effort process cleanup. No resource leak risk in practice.

2. No Command Injection Risk — Info

All arguments to spawn() are hardcoded constants ("sudo", "awf", "--allow-domains", ALLOWED_DOMAIN, etc.) passed as an array — not interpolated into a shell string. The spawn call does not use shell: true, so no shell metacharacter injection is possible even if constants were somehow modified. This is a good pattern.

Similarly, waitForContainers() uses execSync with the container name interpolated from a hardcoded array ["awf-squid", "awf-agent"] — no user input reaches these commands.

3. Detached Process with `stdio: "ignore"` — Low

The spawn call uses detached: true and stdio: "ignore". The detached flag creates a new process group, which is required for the -child.pid group kill in killBackground() to work. The stdio: "ignore" prevents the child from holding the parent's streams open. This is correct.

Edge case: If the Node.js process is killed with SIGKILL (uncatchable), the detached sudo awf ... sleep 30 process group will survive as an orphan. The sleep 30 command inside the container will naturally exit after 30 seconds, which will trigger AWF's normal container teardown. The worst case is a 30-second container leak, which is acceptable for a CI benchmark script and is further mitigated by CI's always() cleanup jobs.

4. Polling Loop Timeout — Low

waitForContainers() has a 30-second timeout, preventing indefinite hangs. Each execSync call inside the poll also has its own 5-second timeout. This is well-bounded.

Note: The overall execution per iteration is bounded by: 30s container wait + 2s stabilization + docker stats calls (each with the exec() helper's 120s default timeout). In the worst case, a single iteration could take ~2.5 minutes. With 5 iterations, total memory benchmark time could reach ~12.5 minutes in a pathological scenario. This is unlikely but worth noting for CI timeout planning.

5. `sudo` Usage — Info

The script uses sudo for awf, docker stats, and docker ps. This is consistent with the existing codebase pattern — AWF requires sudo for iptables manipulation. The script is only invoked in CI contexts where sudo is available without a password. No new privilege escalation surface is introduced.

6. Container Name Filter in `docker ps` — Low

`sudo docker ps --filter name=${name} --filter status=running --format '{{.Names}}' 2>/dev/null`

The name variable comes from the hardcoded array ["awf-squid", "awf-agent"]. If these names ever contained shell metacharacters, the execSync call would interpret them. Since they are hardcoded alphanumeric-plus-hyphen strings, this is safe. If this function were ever generalized to accept external input, it would need parameterization.

7. Cleanup in `finally` Block — Info (Positive)

Moving cleanup() into the finally block is strictly better than the previous pattern where cleanup() was called after the try/catch. The finally block guarantees cleanup runs even if an unhandled exception occurs within the try. This is a security improvement.

Summary

#	Finding	Severity
1	No grace period between SIGTERM and SIGKILL in `killBackground()`	Low
2	No command injection — hardcoded args, array-based spawn	Info
3	Detached orphan survives SIGKILL, self-heals via sleep timeout	Low
4	Polling loop properly bounded with 30s timeout	Low
5	`sudo` usage consistent with existing patterns	Info
6	Container name filter safe due to hardcoded values	Low
7	`finally`-based cleanup is an improvement over previous pattern	Info (Positive)

Recommendation: Approve. No blocking security concerns. The only minor hardening suggestion would be adding a small delay (e.g., 500ms) between SIGTERM and SIGKILL in killBackground() to give the process group time to terminate gracefully, but this is cosmetic given the cleanup() safety net.

— Security Review Agent


          fix: exact name matching, process group cleanup

096f284

- Use anchored regex (^name$) in docker ps filter to avoid substring
  false positives from similarly-named containers
- Send SIGKILL to the entire process group (not just the direct child)
  so descendant processes like sudo/awf/docker don't survive
- Add child.unref() to prevent parent from hanging if cleanup fails
- Derive spawn args from AWF_CMD constant to avoid drift

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Collaborator Author

Mossaka commented Apr 7, 2026

Addressed all Copilot review feedback in 096f284:

Exact name matching -- Using anchored regex (^name$) in docker ps --filter plus exact string comparison on results to avoid substring false positives
Process group SIGKILL -- Now sends both SIGTERM and SIGKILL to the entire process group (-pid) so descendant processes don't survive
child.unref() -- Added to prevent the parent from hanging if cleanup fails
AWF_CMD reuse -- Now derives spawn args from the AWF_CMD constant instead of hardcoding

Regarding the suggestion to wait for Squid health status: the 2-second stabilization delay after containers are running serves a similar purpose and keeps the implementation simpler. The benchmark is measuring steady-state memory, not startup memory, so the delay is sufficient.

Contributor

github-actions bot commented Apr 7, 2026

Smoke Test Results

✅ GitHub MCP: "test: add CLI proxy sidecar integration tests" / "chore: upgrade workflows to gh-aw-actions v0.67.2"
✅ Playwright: GitHub page title verified
✅ File write: /tmp/gh-aw/agent/smoke-test-claude-24107839261.txt created
✅ Bash: File content verified

Overall: PASS

💥 [THE END] — Illustrated by Smoke Claude

Contributor

github-actions bot commented Apr 7, 2026

🔥 Smoke Test Results

Test	Status
GitHub MCP connectivity	✅
GitHub.com HTTP connectivity	✅
File write/read	✅ `/tmp/gh-aw/agent/smoke-test-copilot-24107839277.txt`

PR: fix: measure memory while containers are running, not after teardown
Author: @Mossaka

Overall: PASS

📰 BREAKING: Report filed by Smoke Copilot

Contributor

github-actions bot commented Apr 7, 2026

Smoke Test: GitHub Actions Services Connectivity ✅

Service	Check	Result
Redis `host.docker.internal:6379`	PING	✅ `PONG`
PostgreSQL `host.docker.internal:5432`	`pg_isready`	✅ accepting connections
PostgreSQL `smoketest` DB	`SELECT 1`	✅ returned `1`

All checks passed.

🔌 Service connectivity validated by Smoke Services

github-actions bot added the build-test label

Contributor

github-actions bot commented Apr 7, 2026

🏗️ Build Test Suite Results

Ecosystem	Project	Build/Install	Tests	Status
Bun	elysia	✅	1/1 passed	✅ PASS
Bun	hono	✅	1/1 passed	✅ PASS
C++	fmt	✅	N/A	✅ PASS
C++	json	✅	N/A	✅ PASS
Deno	oak	N/A	1/1 passed	✅ PASS
Deno	std	N/A	1/1 passed	✅ PASS
.NET	hello-world	✅	N/A	✅ PASS
.NET	json-parse	✅	N/A	✅ PASS
Go	color	✅	all passed	✅ PASS
Go	env	✅	all passed	✅ PASS
Go	uuid	✅	all passed	✅ PASS
Java	gson	✅	1/1 passed	✅ PASS
Java	caffeine	✅	1/1 passed	✅ PASS
Node.js	clsx	✅	all passed	✅ PASS
Node.js	execa	✅	all passed	✅ PASS
Node.js	p-limit	✅	all passed	✅ PASS
Rust	fd	✅	1/1 passed	✅ PASS
Rust	zoxide	✅	1/1 passed	✅ PASS

Overall: 8/8 ecosystems passed — ✅ PASS

Generated by Build Test Suite for issue #1765 · ● 658.8K · ◷

Contributor

github-actions bot commented Apr 7, 2026

Smoke Test (run 24107839303)
PR titles: "fix: chown gh-aw config dirs to agent user before privilege drop in entrypoint (#1463)", "test: add CLI proxy sidecar integration tests", "feat: add historical benchmark storage and relative regression detection", "fix: measure memory while containers are running, not after teardown"

GitHub MCP merged PR review ✅
safeinputs-gh PR query ❌ (tool unavailable; used GitHub MCP fallback)
Playwright GitHub title check ✅
Tavily web search ❌ (Tavily tool unavailable)
File write /tmp/gh-aw/agent/smoke-test-codex-24107839303.txt ✅
Bash cat verify file ✅
Discussion oracle comment ✅
npm ci && npm run build ✅
Overall: FAIL

🔮 The oracle has spoken through Smoke Codex

Mossaka merged commit b31ec28 into main

58 of 59 checks passed

Mossaka deleted the fix/1758-memory-benchmark branch

April 7, 2026 23:14

This was referenced Apr 7, 2026

feat: add unit tests for benchmark statistics and threshold logic #1766

Merged

feat: add historical benchmark storage and relative regression detection #1767

Open

fix: guard null execSync return in benchmark exec helper #1772

Merged

fix: add cli-proxy container image to release workflow #1774

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build-test smoke-claude smoke-copilot smoke-services