Skip to content

feat(st): execute test cases via task-submit to free NPU during compile#1674

Draft
luohuan19 wants to merge 2 commits into
hw-native-sys:mainfrom
luohuan19:feat/st-execute-via-task-submit
Draft

feat(st): execute test cases via task-submit to free NPU during compile#1674
luohuan19 wants to merge 2 commits into
hw-native-sys:mainfrom
luohuan19:feat/st-execute-via-task-submit

Conversation

@luohuan19

Copy link
Copy Markdown
Contributor

Draft / stacked PR. This branch is stacked on #1666 (perf(st): cap golden torch threads). Until #1666 merges, the diff here also contains that commit. Review the feat(st): … task-submit commit; flip to ready once #1666 lands and the diff is clean.

Intent

A tests/st CI job currently holds one NPU card for its entire run, but the time-dominant phase — compiling IR → kernel/orchestration C++ → device binaries — is pure CPU work that needs no card. The card sits idle-but-held during compilation, so a host's cards can't be shared across concurrently queued CI jobs. This caps parallelism at cards ÷ cards-per-job.

Goal: stop binding a job to a card at GitHub-scheduling time. Instead:

  1. Separate compile from execute. Compilation (and golden generation) is fully cardless. Only the brief device run + verify needs a card.
  2. Borrow a card on demand via task-submit. Each case's execution is launched as a subprocess via task-submit --device auto — a host-level root execution queue that allocates a free NPU (exposed as $TASK_DEVICE) and releases it on subprocess exit.

Net effect: CI jobs can run on cardless runners; cards are multiplexed by task-submit across all queued jobs, and parallelism scales with the host's total free cards rather than per-job reservation.

Why it works

  • On-disk .o/.so is the cross-process cache. compile_and_assemble writes each kernel's .o/.so beside the source on first call; a second call in another process hits _load_binary and rebuilds the ChipCallable with no device recompile, no card — provided work_dir is on a shared filesystem.
  • Golden never touches a card. compute_expected is the case's PyTorch reference (CPU); results land in data/out/ during compile. The device path only reads data/out/, so the card-borrow window is just device run + allclose.

Changes

  • New entrypoint python/pypto/runtime/execute_artifact.py — a thin CLI: read work_dir → rebuild ChipCallable (cache hit) → run golden.py on --device-id. Returns 0 on pass, 1 on failure (traceback to stderr). Parses --work-dir/--platform/--device-id/--pto-isa-commit + DFX switches.
  • tests/st/harness/core/test_runner.py_fused_execute_task branches to a new _execute_via_task_submit helper in task-submit mode; sim platforms always stay in-process (resolved_platform.endswith("sim") never borrows a card). start_pipeline sizes the execute pool by --execute-concurrency instead of card count (the device pool no longer gates concurrency in this mode).
  • tests/st/conftest.py — new options --execute-via-task-submit, --execute-concurrency, --task-max-time; --device becomes optional in task-submit mode.
  • .github/workflows/ci.ymlsystem-tests can run on cardless runners and borrow cards via task-submit.

Testing

  • tests/ut/runtime/test_execute_artifact.py — mocks compile_and_assemble + _execute_on_device; asserts arg parsing, DFX passthrough, exit codes (0 pass / 1 on exception).
  • Harness helper tests — mock subprocess.run; assert task-submit argv contains --device auto, $TASK_DEVICE, correct DFX flags; assert RunResult reflects the return code.
  • sim regression — --execute-via-task-submit --platform=a2a3sim must not call task-submit (stays in-process).
  • On-hardware smoke — a matmul case with --execute-via-task-submit: confirm borrow/release, PASS reporting, [DEVICE] line.

Open questions (from the design)

  1. cache_dir location — force into a workspace bind-mount path so the subprocess can re-read work_dir across processes. Most critical; must be validated before landing.
  2. Does task-submit --run propagate the inner exit code? Determines whether failure is detected by return code or by parsing a PASS/FAIL marker.
  3. inline fallback in task-submit mode — route dynamically-constructed cases through task-submit too, or disable with a clear error.

Out of scope

Distributed tests (tests/st/distributed, L2/L3): multi-card, separate pytest invocations, --device-num N. Unchanged for now; the entrypoint can later grow --device-num.

…tion

Golden computation runs inside compile-pool worker threads. Without a cap,
each torch op defaults to ~nproc intra-op threads, so every compile worker
grabs the full core count and they thrash the process-wide pool. Bound
torch intra-op threads to cores // compile_workers so outer x intra ~= cores.
Env-overridable via PYPTO_GOLDEN_THREADS.
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 42a696f0-5346-4f0e-ba49-9dd281285f7c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new task-submit execution mode that decouples on-device execution from the compilation job by running compiled artifacts via a subprocess on a borrowed NPU. Key changes include adding the execute_artifact.py CLI entry point, introducing new pytest options and validation in conftest.py, implementing the task-submit invocation and marker parsing in test_runner.py, and adding comprehensive unit tests. The review feedback suggests robustifying the subprocess execution by adding a defensive timeout to subprocess.run, improving stderr and empty-output handling, and validating the presence of the task-submit CLI tool during pytest configuration.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +440 to +441
proc = subprocess.run(argv, capture_output=True, text=True) # noqa: PLW1510 — rc handled below
result, device = _parse_exec_marker(proc.stdout)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent the CI pipeline from hanging indefinitely if task-submit gets stuck or fails to respect its internal timeout, it is highly recommended to set a defensive timeout on subprocess.run.

    try:
        timeout_val = _pipeline_ctx.get("task_max_time", 600) + 30
        proc = subprocess.run(argv, capture_output=True, text=True, timeout=timeout_val)
    except subprocess.TimeoutExpired as e:
        return RunResult(
            passed=False,
            test_name=name,
            error=f"task-submit timed out after {timeout_val} seconds.\nStdout:\n{e.stdout}\nStderr:\n{e.stderr}",
            execution_time=time.time() - start,
        )
    result, device = _parse_exec_marker(proc.stdout)

Comment on lines +447 to +451
if passed:
if proc.stdout:
with _print_lock:
print(proc.stdout, end="") # surface into pytest's per-item capture
return RunResult(passed=True, test_name=name, execution_time=time.time() - start)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When running external processes, handle cases where the process fails with a non-zero exit code but produces no output by logging an informative error message. Additionally, when the subprocess execution succeeds, ensure any warnings or diagnostic messages printed to stderr are not completely swallowed.

    passed = (result == "PASS") if result is not None else (proc.returncode == 0)
    if not passed:
        if proc.returncode != 0 and not proc.stdout and not proc.stderr:
            import logging
            logging.error(f"Process failed with exit code {proc.returncode} but produced no output.")
    else:
        if proc.stdout or proc.stderr:
            with _print_lock:
                if proc.stdout:
                    print(proc.stdout, end="")
                if proc.stderr:
                    import sys
                    print(proc.stderr, file=sys.stderr, end="")
References
  1. When running external processes, handle cases where the process fails with a non-zero exit code but produces no output. Log an informative error message to prevent silent failures and aid debugging.

Comment thread tests/st/conftest.py
Comment on lines +519 to +531
if execute_via_task_submit:
if config.getoption("--precompile-workers") is None:
raise pytest.UsageError(
"--execute-via-task-submit requires --precompile-workers (the task-submit "
"execution path runs inside the precompile pipeline)."
)
if not config.getoption("--save-kernels"):
raise pytest.UsageError(
"--execute-via-task-submit requires --save-kernels so the compiled artifact "
"directory lands on a shared mount the borrowed-card subprocess can read "
"(optionally add --kernels-dir <shared-path>). A private /tmp dir is "
"unreachable from the host task-submit context."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To fail fast and provide a clear error message, we should validate that the task-submit CLI tool is actually installed and available in the system PATH before running the tests.

    if execute_via_task_submit:
        if shutil.which("task-submit") is None:
            raise pytest.UsageError(
                "--execute-via-task-submit requires the 'task-submit' CLI tool to be installed "
                "and available in the system PATH."
            )
        if config.getoption("--precompile-workers") is None:
            raise pytest.UsageError(
                "--execute-via-task-submit requires --precompile-workers (the task-submit "
                "execution path runs inside the precompile pipeline)."
            )
        if not config.getoption("--save-kernels"):
            raise pytest.UsageError(
                "--execute-via-task-submit requires --save-kernels so the compiled artifact "
                "directory lands on a shared mount the borrowed-card subprocess can read "
                "(optionally add --kernels-dir <shared-path>). A private /tmp dir is "
                "unreachable from the host task-submit context."
            )

@luohuan19 luohuan19 force-pushed the feat/st-execute-via-task-submit branch 3 times, most recently from 0b847d2 to b1ed05a Compare June 4, 2026 06:41
A tests/st CI job currently pins one NPU card for its entire run, but the
dominant phase — compiling IR to kernel/orchestration C++ and on to device
binaries — is pure CPU work that needs no card. The card sits idle-but-held
during compilation, so a host's cards cannot be shared across concurrently
queued CI jobs, capping parallelism at (cards / cards-per-job).

This decouples execution from card ownership: compilation and golden
generation stay cardless, and only the brief device run + verify borrows a
card on demand via `task-submit --device auto` (host-level root queue that
hands out a free NPU as $TASK_DEVICE and releases it on subprocess exit).

- New entrypoint `python/pypto/runtime/execute_artifact.py`: a thin CLI that
  rebuilds the ChipCallable from work_dir (cache hit — no device recompile)
  and runs golden.py on the assigned card. Returns 0 on pass, 1 on failure.
- Harness (`tests/st/harness/core/test_runner.py`): `_fused_execute_task`
  branches to `_execute_via_task_submit` in task-submit mode; sim platforms
  always stay in-process. `start_pipeline` sizes the execute pool by
  --execute-concurrency rather than card count.
- conftest options: --execute-via-task-submit, --execute-concurrency,
  --task-max-time; --device becomes optional in task-submit mode.
- ci.yml: system-tests job can run on cardless runners and borrow via
  task-submit.
- Tests: tests/ut/runtime/test_execute_artifact.py (arg parsing, DFX
  passthrough, exit codes) and harness helper tests (task-submit argv, sim
  guard, RunResult mapping).
@luohuan19 luohuan19 force-pushed the feat/st-execute-via-task-submit branch from b1ed05a to 3cac56c Compare June 4, 2026 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant