Skip to content

fix: measure memory while containers are running, not after teardown#1765

Merged
Mossaka merged 2 commits intomainfrom
fix/1758-memory-benchmark
Apr 7, 2026
Merged

fix: measure memory while containers are running, not after teardown#1765
Mossaka merged 2 commits intomainfrom
fix/1758-memory-benchmark

Conversation

@Mossaka
Copy link
Copy Markdown
Collaborator

@Mossaka Mossaka commented Apr 7, 2026

Summary

  • Fixes benchmarkMemory() in scripts/ci/benchmark-performance.ts which always reported 0 MB because containers were already stopped before docker stats ran
  • Spawns awf with sleep 30 in the background, polls until containers are healthy, then samples docker stats while containers are alive
  • Adds proper cleanup (killBackground + process group kill) to handle edge cases

Details

The previous implementation ran echo measuring_memory as the AWF command, which completed instantly. Even with --keep-containers, by the time docker stats executed the containers were gone.

The fix:

  1. Starts awf ... sleep 30 as a detached background process via child_process.spawn
  2. Polls docker ps at 500ms intervals until both awf-squid and awf-agent are running (30s timeout)
  3. Waits 2s for memory to stabilize
  4. Samples docker stats while containers are alive
  5. Kills the background process group and runs cleanup in a finally block

Closes #1758

Test plan

  • npm run build succeeds
  • npm test passes (1311 tests)
  • npm run lint passes on modified file
  • Script compiles and starts executing with npx tsx
  • CI passes

🤖 Generated with Claude Code

Previously benchmarkMemory() ran `echo measuring_memory` which completed
instantly, so containers were already stopped before docker stats ran,
always reporting 0 MB. Now spawns awf with `sleep 30` in the background,
polls until containers are healthy, then samples docker stats while
containers are alive.

Closes #1758

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 7, 2026 22:11
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

✅ Coverage Check Passed

Overall Coverage

Metric Base PR Delta
Lines 86.14% 86.23% 📈 +0.09%
Statements 86.02% 86.11% 📈 +0.09%
Functions 87.45% 87.45% ➡️ +0.00%
Branches 78.81% 78.86% 📈 +0.05%
📁 Per-file Coverage Changes (1 files)
File Lines (Before → After) Statements (Before → After)
src/docker-manager.ts 86.3% → 86.7% (+0.37%) 85.9% → 86.2% (+0.36%)

Coverage comparison generated by scripts/ci/compare-coverage.ts

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the CI performance benchmark’s memory measurement so it samples docker stats while AWF containers are still running (instead of after AWF teardown), addressing the “always 0 MB” issue.

Changes:

  • Reworks benchmarkMemory() to run awf ... sleep 30 in a background process and measure memory while containers are alive.
  • Adds polling (waitForContainers) plus a short stabilization delay before sampling docker stats.
  • Adds best-effort background process cleanup (killBackground) and ensures cleanup runs in finally.
Show a summary per file
File Description
scripts/ci/benchmark-performance.ts Updates the memory benchmark to keep containers running during measurement and adds background process lifecycle management.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 1/1 changed files
  • Comments generated: 5

Comment on lines +177 to +180
`sudo docker ps --filter name=${name} --filter status=running --format '{{.Names}}' 2>/dev/null`,
{ encoding: "utf-8", timeout: 5_000 }
).trim();
return result.includes(name);
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitForContainers() uses docker ps --filter name=${name} and then result.includes(name). Docker's name filter is a regex/substring match, so this can return true for containers like awf-squid-old, leading to false positives. Use an anchored name filter (e.g., name=^${name}$) and/or require an exact match on the returned name(s).

Suggested change
`sudo docker ps --filter name=${name} --filter status=running --format '{{.Names}}' 2>/dev/null`,
{ encoding: "utf-8", timeout: 5_000 }
).trim();
return result.includes(name);
`sudo docker ps --filter name=^${name}$ --filter status=running --format '{{.Names}}' 2>/dev/null`,
{ encoding: "utf-8", timeout: 5_000 }
)
.trim()
.split("\n")
.map((containerName) => containerName.trim())
.filter(Boolean);
return result.some((containerName) => containerName === name);

Copilot uses AI. Check for mistakes.
Comment on lines +245 to +250
// Wait for both containers to be running (up to 30s)
await waitForContainers(["awf-squid", "awf-agent"], 30_000);

// Give containers a moment to stabilize memory usage
await new Promise((resolve) => setTimeout(resolve, 2000));

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark proceeds once containers are merely status=running. Since Squid has a Docker healthcheck (and the agent depends on it), sampling memory before Squid is healthy can make the metric noisy and can mask startup failures/restarts. Consider waiting for Squid to be healthy (e.g., via docker ps --filter health=healthy or docker inspect health status) before taking the memory sample.

Copilot uses AI. Check for mistakes.
detached: true,
stdio: "ignore",
}
);
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the child process is spawned with detached: true and stdio: 'ignore', it’s typically a good idea to call child.unref() after spawning. This prevents the parent benchmark script from being kept alive if cleanup/kill logic fails in an edge case.

Suggested change
);
);
child.unref();

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +220
try {
if (child.pid) {
// Kill the process group (negative PID) to catch child processes
process.kill(-child.pid, "SIGTERM");
}
} catch {
// Process may have already exited
}
try {
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

killBackground() sends SIGTERM to the process group but then only SIGKILLs the direct child PID. If the group spawns additional processes that ignore SIGTERM, they can survive. Consider sending SIGKILL to the process group as well (optionally after a short grace period), instead of only killing the child process.

Suggested change
try {
if (child.pid) {
// Kill the process group (negative PID) to catch child processes
process.kill(-child.pid, "SIGTERM");
}
} catch {
// Process may have already exited
}
try {
const pid = child.pid;
if (!pid) {
return;
}
try {
// Kill the process group (negative PID) to catch child processes
process.kill(-pid, "SIGTERM");
} catch {
// Process group may have already exited
}
try {
// Force-kill the entire process group so descendant processes do not survive
process.kill(-pid, "SIGKILL");
} catch {
// Process group may have already exited
}
try {
// Best-effort fallback for the direct child process

Copilot uses AI. Check for mistakes.
Comment on lines +236 to +239
child = spawn(
"sudo",
["awf", "--allow-domains", ALLOWED_DOMAIN, "--log-level", "error", "--", "sleep", "30"],
{
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path hard-codes sudo + awf in spawn(...) even though the script already defines AWF_CMD = "sudo awf" and uses it elsewhere. To avoid future drift (e.g., if AWF_CMD changes), consider deriving the spawned command/args from AWF_CMD or otherwise reusing that constant.

Copilot uses AI. Check for mistakes.
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Collaborator Author

@Mossaka Mossaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Review: PR #1765 — Memory Benchmark Fix

Overall Assessment

Risk: Low. This PR modifies a CI-only benchmark script (scripts/ci/benchmark-performance.ts) that runs in controlled CI environments with hardcoded inputs. No user-controlled data reaches shell execution. The changes are functionally sound and improve the correctness of memory measurement.


Detailed Findings

1. Process Group Cleanup — Low

The killBackground() function sends SIGTERM to the process group (-child.pid) followed by SIGKILL to the child. This is a solid pattern.

Minor gap: There is no timeout between SIGTERM and SIGKILL. The SIGTERM to the process group may not have propagated before SIGKILL fires on the direct child, potentially leaving grandchildren (the awf process tree includes Docker containers). However, this is mitigated by the cleanup() call in the finally block, which explicitly tears down containers via docker compose down -v and docker rm -f.

Verdict: Acceptable. The finally block's cleanup() is the real safety net for container teardown, and the killBackground() is best-effort process cleanup. No resource leak risk in practice.

2. No Command Injection Risk — Info

All arguments to spawn() are hardcoded constants ("sudo", "awf", "--allow-domains", ALLOWED_DOMAIN, etc.) passed as an array — not interpolated into a shell string. The spawn call does not use shell: true, so no shell metacharacter injection is possible even if constants were somehow modified. This is a good pattern.

Similarly, waitForContainers() uses execSync with the container name interpolated from a hardcoded array ["awf-squid", "awf-agent"] — no user input reaches these commands.

3. Detached Process with stdio: "ignore"Low

The spawn call uses detached: true and stdio: "ignore". The detached flag creates a new process group, which is required for the -child.pid group kill in killBackground() to work. The stdio: "ignore" prevents the child from holding the parent's streams open. This is correct.

Edge case: If the Node.js process is killed with SIGKILL (uncatchable), the detached sudo awf ... sleep 30 process group will survive as an orphan. The sleep 30 command inside the container will naturally exit after 30 seconds, which will trigger AWF's normal container teardown. The worst case is a 30-second container leak, which is acceptable for a CI benchmark script and is further mitigated by CI's always() cleanup jobs.

4. Polling Loop Timeout — Low

waitForContainers() has a 30-second timeout, preventing indefinite hangs. Each execSync call inside the poll also has its own 5-second timeout. This is well-bounded.

Note: The overall execution per iteration is bounded by: 30s container wait + 2s stabilization + docker stats calls (each with the exec() helper's 120s default timeout). In the worst case, a single iteration could take ~2.5 minutes. With 5 iterations, total memory benchmark time could reach ~12.5 minutes in a pathological scenario. This is unlikely but worth noting for CI timeout planning.

5. sudo Usage — Info

The script uses sudo for awf, docker stats, and docker ps. This is consistent with the existing codebase pattern — AWF requires sudo for iptables manipulation. The script is only invoked in CI contexts where sudo is available without a password. No new privilege escalation surface is introduced.

6. Container Name Filter in docker psLow

`sudo docker ps --filter name=${name} --filter status=running --format '{{.Names}}' 2>/dev/null`

The name variable comes from the hardcoded array ["awf-squid", "awf-agent"]. If these names ever contained shell metacharacters, the execSync call would interpret them. Since they are hardcoded alphanumeric-plus-hyphen strings, this is safe. If this function were ever generalized to accept external input, it would need parameterization.

7. Cleanup in finally Block — Info (Positive)

Moving cleanup() into the finally block is strictly better than the previous pattern where cleanup() was called after the try/catch. The finally block guarantees cleanup runs even if an unhandled exception occurs within the try. This is a security improvement.


Summary

# Finding Severity
1 No grace period between SIGTERM and SIGKILL in killBackground() Low
2 No command injection — hardcoded args, array-based spawn Info
3 Detached orphan survives SIGKILL, self-heals via sleep timeout Low
4 Polling loop properly bounded with 30s timeout Low
5 sudo usage consistent with existing patterns Info
6 Container name filter safe due to hardcoded values Low
7 finally-based cleanup is an improvement over previous pattern Info (Positive)

Recommendation: Approve. No blocking security concerns. The only minor hardening suggestion would be adding a small delay (e.g., 500ms) between SIGTERM and SIGKILL in killBackground() to give the process group time to terminate gracefully, but this is cosmetic given the cleanup() safety net.

Security Review Agent

- Use anchored regex (^name$) in docker ps filter to avoid substring
  false positives from similarly-named containers
- Send SIGKILL to the entire process group (not just the direct child)
  so descendant processes like sudo/awf/docker don't survive
- Add child.unref() to prevent parent from hanging if cleanup fails
- Derive spawn args from AWF_CMD constant to avoid drift

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Mossaka
Copy link
Copy Markdown
Collaborator Author

Mossaka commented Apr 7, 2026

Addressed all Copilot review feedback in 096f284:

  1. Exact name matching -- Using anchored regex (^name$) in docker ps --filter plus exact string comparison on results to avoid substring false positives
  2. Process group SIGKILL -- Now sends both SIGTERM and SIGKILL to the entire process group (-pid) so descendant processes don't survive
  3. child.unref() -- Added to prevent the parent from hanging if cleanup fails
  4. AWF_CMD reuse -- Now derives spawn args from the AWF_CMD constant instead of hardcoding

Regarding the suggestion to wait for Squid health status: the 2-second stabilization delay after containers are running serves a similar purpose and keeps the implementation simpler. The benchmark is measuring steady-state memory, not startup memory, so the delay is sufficient.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Smoke Test Results

  • ✅ GitHub MCP: "test: add CLI proxy sidecar integration tests" / "chore: upgrade workflows to gh-aw-actions v0.67.2"
  • ✅ Playwright: GitHub page title verified
  • ✅ File write: /tmp/gh-aw/agent/smoke-test-claude-24107839261.txt created
  • ✅ Bash: File content verified

Overall: PASS

💥 [THE END] — Illustrated by Smoke Claude

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

🔥 Smoke Test Results

Test Status
GitHub MCP connectivity
GitHub.com HTTP connectivity
File write/read /tmp/gh-aw/agent/smoke-test-copilot-24107839277.txt

PR: fix: measure memory while containers are running, not after teardown
Author: @Mossaka

Overall: PASS

📰 BREAKING: Report filed by Smoke Copilot

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Smoke Test: GitHub Actions Services Connectivity ✅

Service Check Result
Redis host.docker.internal:6379 PING PONG
PostgreSQL host.docker.internal:5432 pg_isready ✅ accepting connections
PostgreSQL smoketest DB SELECT 1 ✅ returned 1

All checks passed.

🔌 Service connectivity validated by Smoke Services

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

🏗️ Build Test Suite Results

Ecosystem Project Build/Install Tests Status
Bun elysia 1/1 passed ✅ PASS
Bun hono 1/1 passed ✅ PASS
C++ fmt N/A ✅ PASS
C++ json N/A ✅ PASS
Deno oak N/A 1/1 passed ✅ PASS
Deno std N/A 1/1 passed ✅ PASS
.NET hello-world N/A ✅ PASS
.NET json-parse N/A ✅ PASS
Go color all passed ✅ PASS
Go env all passed ✅ PASS
Go uuid all passed ✅ PASS
Java gson 1/1 passed ✅ PASS
Java caffeine 1/1 passed ✅ PASS
Node.js clsx all passed ✅ PASS
Node.js execa all passed ✅ PASS
Node.js p-limit all passed ✅ PASS
Rust fd 1/1 passed ✅ PASS
Rust zoxide 1/1 passed ✅ PASS

Overall: 8/8 ecosystems passed — ✅ PASS

Generated by Build Test Suite for issue #1765 · ● 658.8K ·

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Smoke Test (run 24107839303)
PR titles: "fix: chown gh-aw config dirs to agent user before privilege drop in entrypoint (#1463)", "test: add CLI proxy sidecar integration tests", "feat: add historical benchmark storage and relative regression detection", "fix: measure memory while containers are running, not after teardown"

  1. GitHub MCP merged PR review ✅
  2. safeinputs-gh PR query ❌ (tool unavailable; used GitHub MCP fallback)
  3. Playwright GitHub title check ✅
  4. Tavily web search ❌ (Tavily tool unavailable)
  5. File write /tmp/gh-aw/agent/smoke-test-codex-24107839303.txt
  6. Bash cat verify file ✅
  7. Discussion oracle comment ✅
  8. npm ci && npm run build
    Overall: FAIL

🔮 The oracle has spoken through Smoke Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[plan] Fix memory benchmark measurement — containers already stopped before docker stats runs

2 participants