feat(st): execute test cases via task-submit to free NPU during compile#1674
feat(st): execute test cases via task-submit to free NPU during compile#1674luohuan19 wants to merge 2 commits into
Conversation
…tion Golden computation runs inside compile-pool worker threads. Without a cap, each torch op defaults to ~nproc intra-op threads, so every compile worker grabs the full core count and they thrash the process-wide pool. Bound torch intra-op threads to cores // compile_workers so outer x intra ~= cores. Env-overridable via PYPTO_GOLDEN_THREADS.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a new task-submit execution mode that decouples on-device execution from the compilation job by running compiled artifacts via a subprocess on a borrowed NPU. Key changes include adding the execute_artifact.py CLI entry point, introducing new pytest options and validation in conftest.py, implementing the task-submit invocation and marker parsing in test_runner.py, and adding comprehensive unit tests. The review feedback suggests robustifying the subprocess execution by adding a defensive timeout to subprocess.run, improving stderr and empty-output handling, and validating the presence of the task-submit CLI tool during pytest configuration.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| proc = subprocess.run(argv, capture_output=True, text=True) # noqa: PLW1510 — rc handled below | ||
| result, device = _parse_exec_marker(proc.stdout) |
There was a problem hiding this comment.
To prevent the CI pipeline from hanging indefinitely if task-submit gets stuck or fails to respect its internal timeout, it is highly recommended to set a defensive timeout on subprocess.run.
try:
timeout_val = _pipeline_ctx.get("task_max_time", 600) + 30
proc = subprocess.run(argv, capture_output=True, text=True, timeout=timeout_val)
except subprocess.TimeoutExpired as e:
return RunResult(
passed=False,
test_name=name,
error=f"task-submit timed out after {timeout_val} seconds.\nStdout:\n{e.stdout}\nStderr:\n{e.stderr}",
execution_time=time.time() - start,
)
result, device = _parse_exec_marker(proc.stdout)| if passed: | ||
| if proc.stdout: | ||
| with _print_lock: | ||
| print(proc.stdout, end="") # surface into pytest's per-item capture | ||
| return RunResult(passed=True, test_name=name, execution_time=time.time() - start) |
There was a problem hiding this comment.
When running external processes, handle cases where the process fails with a non-zero exit code but produces no output by logging an informative error message. Additionally, when the subprocess execution succeeds, ensure any warnings or diagnostic messages printed to stderr are not completely swallowed.
passed = (result == "PASS") if result is not None else (proc.returncode == 0)
if not passed:
if proc.returncode != 0 and not proc.stdout and not proc.stderr:
import logging
logging.error(f"Process failed with exit code {proc.returncode} but produced no output.")
else:
if proc.stdout or proc.stderr:
with _print_lock:
if proc.stdout:
print(proc.stdout, end="")
if proc.stderr:
import sys
print(proc.stderr, file=sys.stderr, end="")References
- When running external processes, handle cases where the process fails with a non-zero exit code but produces no output. Log an informative error message to prevent silent failures and aid debugging.
| if execute_via_task_submit: | ||
| if config.getoption("--precompile-workers") is None: | ||
| raise pytest.UsageError( | ||
| "--execute-via-task-submit requires --precompile-workers (the task-submit " | ||
| "execution path runs inside the precompile pipeline)." | ||
| ) | ||
| if not config.getoption("--save-kernels"): | ||
| raise pytest.UsageError( | ||
| "--execute-via-task-submit requires --save-kernels so the compiled artifact " | ||
| "directory lands on a shared mount the borrowed-card subprocess can read " | ||
| "(optionally add --kernels-dir <shared-path>). A private /tmp dir is " | ||
| "unreachable from the host task-submit context." | ||
| ) |
There was a problem hiding this comment.
To fail fast and provide a clear error message, we should validate that the task-submit CLI tool is actually installed and available in the system PATH before running the tests.
if execute_via_task_submit:
if shutil.which("task-submit") is None:
raise pytest.UsageError(
"--execute-via-task-submit requires the 'task-submit' CLI tool to be installed "
"and available in the system PATH."
)
if config.getoption("--precompile-workers") is None:
raise pytest.UsageError(
"--execute-via-task-submit requires --precompile-workers (the task-submit "
"execution path runs inside the precompile pipeline)."
)
if not config.getoption("--save-kernels"):
raise pytest.UsageError(
"--execute-via-task-submit requires --save-kernels so the compiled artifact "
"directory lands on a shared mount the borrowed-card subprocess can read "
"(optionally add --kernels-dir <shared-path>). A private /tmp dir is "
"unreachable from the host task-submit context."
)0b847d2 to
b1ed05a
Compare
A tests/st CI job currently pins one NPU card for its entire run, but the dominant phase — compiling IR to kernel/orchestration C++ and on to device binaries — is pure CPU work that needs no card. The card sits idle-but-held during compilation, so a host's cards cannot be shared across concurrently queued CI jobs, capping parallelism at (cards / cards-per-job). This decouples execution from card ownership: compilation and golden generation stay cardless, and only the brief device run + verify borrows a card on demand via `task-submit --device auto` (host-level root queue that hands out a free NPU as $TASK_DEVICE and releases it on subprocess exit). - New entrypoint `python/pypto/runtime/execute_artifact.py`: a thin CLI that rebuilds the ChipCallable from work_dir (cache hit — no device recompile) and runs golden.py on the assigned card. Returns 0 on pass, 1 on failure. - Harness (`tests/st/harness/core/test_runner.py`): `_fused_execute_task` branches to `_execute_via_task_submit` in task-submit mode; sim platforms always stay in-process. `start_pipeline` sizes the execute pool by --execute-concurrency rather than card count. - conftest options: --execute-via-task-submit, --execute-concurrency, --task-max-time; --device becomes optional in task-submit mode. - ci.yml: system-tests job can run on cardless runners and borrow via task-submit. - Tests: tests/ut/runtime/test_execute_artifact.py (arg parsing, DFX passthrough, exit codes) and harness helper tests (task-submit argv, sim guard, RunResult mapping).
b1ed05a to
3cac56c
Compare
Intent
A
tests/stCI job currently holds one NPU card for its entire run, but the time-dominant phase — compiling IR → kernel/orchestration C++ → device binaries — is pure CPU work that needs no card. The card sits idle-but-held during compilation, so a host's cards can't be shared across concurrently queued CI jobs. This caps parallelism at cards ÷ cards-per-job.Goal: stop binding a job to a card at GitHub-scheduling time. Instead:
task-submit. Each case's execution is launched as a subprocess viatask-submit --device auto— a host-level root execution queue that allocates a free NPU (exposed as$TASK_DEVICE) and releases it on subprocess exit.Net effect: CI jobs can run on cardless runners; cards are multiplexed by
task-submitacross all queued jobs, and parallelism scales with the host's total free cards rather than per-job reservation.Why it works
.o/.sois the cross-process cache.compile_and_assemblewrites each kernel's.o/.sobeside the source on first call; a second call in another process hits_load_binaryand rebuilds theChipCallablewith no device recompile, no card — providedwork_diris on a shared filesystem.compute_expectedis the case's PyTorch reference (CPU); results land indata/out/during compile. The device path only readsdata/out/, so the card-borrow window is just device run + allclose.Changes
python/pypto/runtime/execute_artifact.py— a thin CLI: readwork_dir→ rebuildChipCallable(cache hit) → rungolden.pyon--device-id. Returns0on pass,1on failure (traceback to stderr). Parses--work-dir/--platform/--device-id/--pto-isa-commit+ DFX switches.tests/st/harness/core/test_runner.py—_fused_execute_taskbranches to a new_execute_via_task_submithelper in task-submit mode; sim platforms always stay in-process (resolved_platform.endswith("sim")never borrows a card).start_pipelinesizes the execute pool by--execute-concurrencyinstead of card count (the device pool no longer gates concurrency in this mode).tests/st/conftest.py— new options--execute-via-task-submit,--execute-concurrency,--task-max-time;--devicebecomes optional in task-submit mode..github/workflows/ci.yml—system-testscan run on cardless runners and borrow cards viatask-submit.Testing
tests/ut/runtime/test_execute_artifact.py— mockscompile_and_assemble+_execute_on_device; asserts arg parsing, DFX passthrough, exit codes (0 pass / 1 on exception).subprocess.run; asserttask-submitargv contains--device auto,$TASK_DEVICE, correct DFX flags; assertRunResultreflects the return code.--execute-via-task-submit --platform=a2a3simmust not calltask-submit(stays in-process).--execute-via-task-submit: confirm borrow/release, PASS reporting,[DEVICE]line.Open questions (from the design)
cache_dirlocation — force into a workspace bind-mount path so the subprocess can re-readwork_diracross processes. Most critical; must be validated before landing.task-submit --runpropagate the inner exit code? Determines whether failure is detected by return code or by parsing aPASS/FAILmarker.task-submittoo, or disable with a clear error.Out of scope
Distributed tests (
tests/st/distributed, L2/L3): multi-card, separate pytest invocations,--device-num N. Unchanged for now; the entrypoint can later grow--device-num.