feat(st): execute test cases via task-submit to free NPU during compile by luohuan19 · Pull Request #1674 · hw-native-sys/pypto

luohuan19 · 2026-06-04T03:23:17Z

Draft / stacked PR. This branch is stacked on #1666 (perf(st): cap golden torch threads). Until #1666 merges, the diff here also contains that commit. Review the feat(st): … task-submit commit; flip to ready once #1666 lands and the diff is clean.

Intent

A tests/st CI job currently holds one NPU card for its entire run, but the time-dominant phase — compiling IR → kernel/orchestration C++ → device binaries — is pure CPU work that needs no card. The card sits idle-but-held during compilation, so a host's cards can't be shared across concurrently queued CI jobs. This caps parallelism at cards ÷ cards-per-job.

Goal: stop binding a job to a card at GitHub-scheduling time. Instead:

Separate compile from execute. Compilation (and golden generation) is fully cardless. Only the brief device run + verify needs a card.
Borrow a card on demand via task-submit. Each case's execution is launched as a subprocess via task-submit --device auto — a host-level root execution queue that allocates a free NPU (exposed as $TASK_DEVICE) and releases it on subprocess exit.

Net effect: CI jobs can run on cardless runners; cards are multiplexed by task-submit across all queued jobs, and parallelism scales with the host's total free cards rather than per-job reservation.

Why it works

On-disk .o/.so is the cross-process cache. compile_and_assemble writes each kernel's .o/.so beside the source on first call; a second call in another process hits _load_binary and rebuilds the ChipCallable with no device recompile, no card — provided work_dir is on a shared filesystem.
Golden never touches a card. compute_expected is the case's PyTorch reference (CPU); results land in data/out/ during compile. The device path only reads data/out/, so the card-borrow window is just device run + allclose.

Changes

New entrypoint python/pypto/runtime/execute_artifact.py — a thin CLI: read work_dir → rebuild ChipCallable (cache hit) → run golden.py on --device-id. Returns 0 on pass, 1 on failure (traceback to stderr). Parses --work-dir/--platform/--device-id/--pto-isa-commit + DFX switches.
tests/st/harness/core/test_runner.py — _fused_execute_task branches to a new _execute_via_task_submit helper in task-submit mode; sim platforms always stay in-process (resolved_platform.endswith("sim") never borrows a card). start_pipeline sizes the execute pool by --execute-concurrency instead of card count (the device pool no longer gates concurrency in this mode).
tests/st/conftest.py — new options --execute-via-task-submit, --execute-concurrency, --task-max-time; --device becomes optional in task-submit mode.
.github/workflows/ci.yml — system-tests can run on cardless runners and borrow cards via task-submit.

Testing

tests/ut/runtime/test_execute_artifact.py — mocks compile_and_assemble + _execute_on_device; asserts arg parsing, DFX passthrough, exit codes (0 pass / 1 on exception).
Harness helper tests — mock subprocess.run; assert task-submit argv contains --device auto, $TASK_DEVICE, correct DFX flags; assert RunResult reflects the return code.
sim regression — --execute-via-task-submit --platform=a2a3sim must not call task-submit (stays in-process).
On-hardware smoke — a matmul case with --execute-via-task-submit: confirm borrow/release, PASS reporting, [DEVICE] line.

Open questions (from the design)

cache_dir location — force into a workspace bind-mount path so the subprocess can re-read work_dir across processes. Most critical; must be validated before landing.
Does task-submit --run propagate the inner exit code? Determines whether failure is detected by return code or by parsing a PASS/FAIL marker.
inline fallback in task-submit mode — route dynamically-constructed cases through task-submit too, or disable with a clear error.

Out of scope

Distributed tests (tests/st/distributed, L2/L3): multi-card, separate pytest invocations, --device-num N. Unchanged for now; the entrypoint can later grow --device-num.

…tion Golden computation runs inside compile-pool worker threads. Without a cap, each torch op defaults to ~nproc intra-op threads, so every compile worker grabs the full core count and they thrash the process-wide pool. Bound torch intra-op threads to cores // compile_workers so outer x intra ~= cores. Env-overridable via PYPTO_GOLDEN_THREADS.

coderabbitai · 2026-06-04T03:23:24Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 42a696f0-5346-4f0e-ba49-9dd281285f7c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new task-submit execution mode that decouples on-device execution from the compilation job by running compiled artifacts via a subprocess on a borrowed NPU. Key changes include adding the execute_artifact.py CLI entry point, introducing new pytest options and validation in conftest.py, implementing the task-submit invocation and marker parsing in test_runner.py, and adding comprehensive unit tests. The review feedback suggests robustifying the subprocess execution by adding a defensive timeout to subprocess.run, improving stderr and empty-output handling, and validating the presence of the task-submit CLI tool during pytest configuration.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-04T03:25:47Z

+    proc = subprocess.run(argv, capture_output=True, text=True)  # noqa: PLW1510 — rc handled below
+    result, device = _parse_exec_marker(proc.stdout)


To prevent the CI pipeline from hanging indefinitely if task-submit gets stuck or fails to respect its internal timeout, it is highly recommended to set a defensive timeout on subprocess.run.

try: timeout_val = _pipeline_ctx.get("task_max_time", 600) + 30 proc = subprocess.run(argv, capture_output=True, text=True, timeout=timeout_val) except subprocess.TimeoutExpired as e: return RunResult( passed=False, test_name=name, error=f"task-submit timed out after {timeout_val} seconds.\nStdout:\n{e.stdout}\nStderr:\n{e.stderr}", execution_time=time.time() - start, ) result, device = _parse_exec_marker(proc.stdout)

gemini-code-assist · 2026-06-04T03:25:47Z

+    if passed:
+        if proc.stdout:
+            with _print_lock:
+                print(proc.stdout, end="")  # surface into pytest's per-item capture
+        return RunResult(passed=True, test_name=name, execution_time=time.time() - start)


When running external processes, handle cases where the process fails with a non-zero exit code but produces no output by logging an informative error message. Additionally, when the subprocess execution succeeds, ensure any warnings or diagnostic messages printed to stderr are not completely swallowed.

passed = (result == "PASS") if result is not None else (proc.returncode == 0) if not passed: if proc.returncode != 0 and not proc.stdout and not proc.stderr: import logging logging.error(f"Process failed with exit code {proc.returncode} but produced no output.") else: if proc.stdout or proc.stderr: with _print_lock: if proc.stdout: print(proc.stdout, end="") if proc.stderr: import sys print(proc.stderr, file=sys.stderr, end="")

References

When running external processes, handle cases where the process fails with a non-zero exit code but produces no output. Log an informative error message to prevent silent failures and aid debugging.

gemini-code-assist · 2026-06-04T03:25:47Z

+    if execute_via_task_submit:
+        if config.getoption("--precompile-workers") is None:
+            raise pytest.UsageError(
+                "--execute-via-task-submit requires --precompile-workers (the task-submit "
+                "execution path runs inside the precompile pipeline)."
+            )
+        if not config.getoption("--save-kernels"):
+            raise pytest.UsageError(
+                "--execute-via-task-submit requires --save-kernels so the compiled artifact "
+                "directory lands on a shared mount the borrowed-card subprocess can read "
+                "(optionally add --kernels-dir <shared-path>). A private /tmp dir is "
+                "unreachable from the host task-submit context."
+            )


To fail fast and provide a clear error message, we should validate that the task-submit CLI tool is actually installed and available in the system PATH before running the tests.

if execute_via_task_submit: if shutil.which("task-submit") is None: raise pytest.UsageError( "--execute-via-task-submit requires the 'task-submit' CLI tool to be installed " "and available in the system PATH." ) if config.getoption("--precompile-workers") is None: raise pytest.UsageError( "--execute-via-task-submit requires --precompile-workers (the task-submit " "execution path runs inside the precompile pipeline)." ) if not config.getoption("--save-kernels"): raise pytest.UsageError( "--execute-via-task-submit requires --save-kernels so the compiled artifact " "directory lands on a shared mount the borrowed-card subprocess can read " "(optionally add --kernels-dir <shared-path>). A private /tmp dir is " "unreachable from the host task-submit context." )

A tests/st CI job currently pins one NPU card for its entire run, but the dominant phase — compiling IR to kernel/orchestration C++ and on to device binaries — is pure CPU work that needs no card. The card sits idle-but-held during compilation, so a host's cards cannot be shared across concurrently queued CI jobs, capping parallelism at (cards / cards-per-job). This decouples execution from card ownership: compilation and golden generation stay cardless, and only the brief device run + verify borrows a card on demand via `task-submit --device auto` (host-level root queue that hands out a free NPU as $TASK_DEVICE and releases it on subprocess exit). - New entrypoint `python/pypto/runtime/execute_artifact.py`: a thin CLI that rebuilds the ChipCallable from work_dir (cache hit — no device recompile) and runs golden.py on the assigned card. Returns 0 on pass, 1 on failure. - Harness (`tests/st/harness/core/test_runner.py`): `_fused_execute_task` branches to `_execute_via_task_submit` in task-submit mode; sim platforms always stay in-process. `start_pipeline` sizes the execute pool by --execute-concurrency rather than card count. - conftest options: --execute-via-task-submit, --execute-concurrency, --task-max-time; --device becomes optional in task-submit mode. - ci.yml: system-tests job can run on cardless runners and borrow via task-submit. - Tests: tests/ut/runtime/test_execute_artifact.py (arg parsing, DFX passthrough, exit codes) and harness helper tests (task-submit argv, sim guard, RunResult mapping).

github-project-automation Bot added this to pto project Jun 4, 2026

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes

luohuan19 force-pushed the feat/st-execute-via-task-submit branch 3 times, most recently from 0b847d2 to b1ed05a Compare June 4, 2026 06:41

luohuan19 force-pushed the feat/st-execute-via-task-submit branch from b1ed05a to 3cac56c Compare June 4, 2026 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(st): execute test cases via task-submit to free NPU during compile#1674

feat(st): execute test cases via task-submit to free NPU during compile#1674
luohuan19 wants to merge 2 commits into
hw-native-sys:mainfrom
luohuan19:feat/st-execute-via-task-submit

luohuan19 commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		proc = subprocess.run(argv, capture_output=True, text=True) # noqa: PLW1510 — rc handled below
		result, device = _parse_exec_marker(proc.stdout)

Conversation

luohuan19 commented Jun 4, 2026

Intent

Why it works

Changes

Testing

Open questions (from the design)

Out of scope

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading