ci(perf): add daily Qwen3-14B decode performance monitoring (warn-only) by luohuan19 · Pull Request #1766 · hw-native-sys/pypto

luohuan19 · 2026-06-12T10:40:30Z

Summary

Adds advisory, non-failing performance monitoring for models/qwen3/14b/decode_layer.py. Following review feedback, the perf check runs in the scheduled daily CI (not on every PR) and samples the model multiple times, averaging the result — perf is noisy (~8.5% run-to-run jitter), so per-PR gating would be flaky and slow.

Per-PR ci.yml keeps only the functional correctness run of the Qwen3 decode example; all perf measurement moves to daily_ci.yml.

What changed

.github/workflows/daily_ci.yml — new qwen3-perf job. Runs on the same a2a3 / npu-1-device container as the former per-PR job (so the committed baseline stays comparable). It runs decode_layer.py --enable-l2-swimlane --max-seq --seed 1234 5×, appending each run's stdout to one log, then invokes the guard. Wired into notify-on-failure so a hard failure (build/device) still opens the daily issue; a perf regression does not (it is warn-only).
.github/scripts/perf_guard.py (new, stdlib-only) — parses all Total Test Time: <X> us lines from the log and averages them (vs. the first match), reports the per-run samples + min/max spread, and compares the average against the baseline. Emits a ::warning:: annotation + Step Summary row and exits non-zero on regression / missing data, while continue-on-error keeps the job green.
.github/perf_baselines/qwen3_14b_decode.json (new) — value=1016.66us (median of 3 on-device a2a3 runs), threshold_pct=15 to absorb the ~8.5% jitter, pypto_ref pinned to the capturing commit. Refresh via PR when a perf change is intentional.
.github/workflows/ci.yml — drop the perf run + guard steps from pypto-lib-model; keep the functional Qwen3-14B decode example run.

Why warn-only

GitHub Actions has no native per-job "yellow/neutral" conclusion. The guard exits non-zero on regression and the step carries continue-on-error: true, so the step renders yellow ⚠ while the job conclusion stays green. Perf is a signal, not a gate.

Baseline maintenance

The baseline is updated manually via PR only when a perf change is intentional — lower it to lock in an optimization, never raise it to silence a real regression. The 15% threshold absorbs normal jitter, so routine PRs never need to touch it.

Testing

perf_guard.py verified across paths: multi-run averaging (5 samples → mean + spread), within-threshold → green, regression → yellow + exit 1, unseeded baseline → report-only, missing/empty log → warning + exit 1.
decode_layer.py -p a2a3 --enable-l2-swimlane --max-seq --seed 1234 confirmed to run, PASS correctness on device, and emit the Total Test Time line.
Both workflow YAMLs parse; qwen3-perf job structure and notify-on-failure wiring validated.
pre-commit (check-headers, ruff, pyright, check-yaml) green.

coderabbitai · 2026-06-12T10:40:46Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8fe03744-6bea-42f6-8cf7-4d891fa7c717

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds a non-blocking performance regression detection system for the Qwen3-14B model decode benchmark. It defines a baseline metric, implements a CI guard script that compares run results against baselines, and integrates the guard into the workflow to warn on regressions without failing builds.

Changes

Qwen3-14B Performance Regression Guard

Layer / File(s)	Summary
Baseline performance metric `.github/perf_baselines/qwen3_14b_decode.json`	Establishes a `makespan_us` baseline of 1016.66 µs with 15% threshold, reference commit, and documentation of capture methodology.
Regression detection script `.github/scripts/perf_guard.py`	Parses run logs to extract performance metrics, loads baseline JSON with "unseeded" support, computes percentage delta, and reports via GitHub Actions warnings and step summaries without exiting with failure when baseline is absent or metric is within threshold.
Workflow integration `.github/workflows/ci.yml`	Qwen3-14B decode job now includes a perf run step (with swimlane flags, output logged to file) and a perf guard step (runs in `always()` mode to compare log against baseline, both steps set to `continue-on-error`).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A guard hops in to watch the perf,
Baseline locked, no build-time dirge,
Just yellow warnings, never red—
Swimlane logs dance ahead.
Regression caught with gentle care,
The baseline baseline, beyond compare!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly summarizes the main change: adding daily Qwen3-14B decode performance monitoring with warn-only behavior.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, covering the three new/modified files, implementation details, and testing verification.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a performance-regression guard for CI model runs. It adds a new Python script, perf_guard.py, which parses execution logs to extract performance metrics (such as makespan) and compares them against committed baselines. It also adds an initial baseline configuration file for the qwen3_14b_decode model. No review comments were provided, and there is no additional feedback to address.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Run decode_layer.py with --enable-l2-swimlane to capture the device makespan ('Total Test Time: <X> us') and compare it against a committed baseline. A regression beyond the threshold (or missing perf data) emits a GitHub warning annotation and a Step Summary row, and exits non-zero so the step renders yellow while continue-on-error keeps the job green. - .github/scripts/perf_guard.py: stdlib-only guard parsing the run log - .github/perf_baselines/qwen3_14b_decode.json: baseline 1016.66us (median of 3 device runs), threshold 15% to absorb ~8.5% jitter - ci.yml: perf run + guard steps in the pypto-lib-model job, both continue-on-error so the perf signal is advisory only

- perf_guard.py: replace single-line copyright with the full CANN license header and drop the shebang (repo convention: header at line 1, no shebang) — fixes the check-headers hook - apply ruff-format - echo the baseline comparison (measured/baseline/delta/verdict) to stdout so the CI log is self-explanatory, not only the Step Summary

…samples Address review feedback on hw-native-sys#1766: per-PR CI should only check functional correctness; perf is noisy (~8.5% jitter) and belongs in a scheduled job that samples multiple times. - ci.yml: drop the perf run + perf_guard steps from pypto-lib-model; keep the functional Qwen3-14B decode example run - daily_ci.yml: add a 'qwen3-perf' job (a2a3 npu-1-device container, mirroring the ci.yml setup) that runs decode 5x and feeds the guard; wire it into notify-on-failure so hard failures still open an issue - perf_guard.py: average the makespan across all sampled runs in the log (findall instead of first match), report per-run samples and spread Baseline stays on a2a3 hardware, so the committed 1016.66us value remains comparable; the guard now compares the 5-run average against it.

The branch was cut from an older main; the runtime gitlink had drifted to an unrelated commit. Restore it to main's pointer so this PR only contains the Qwen3 perf-monitoring change.

github-project-automation Bot added this to pto project Jun 12, 2026

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

luohuan19 force-pushed the profilling-decode branch 2 times, most recently from bef0723 to 89a626b Compare June 15, 2026 01:27

luohuan19 force-pushed the profilling-decode branch from 89a626b to aec8f8d Compare June 15, 2026 01:53

luohuan19 force-pushed the profilling-decode branch from b5fbe40 to b952e55 Compare June 15, 2026 08:29

luohuan19 changed the title ~~ci(pypto-lib-model): add warn-only perf guard for Qwen3-14B decode~~ ci(perf): add daily Qwen3-14B decode performance monitoring (warn-only) Jun 15, 2026

luohuan19 force-pushed the profilling-decode branch 3 times, most recently from 215d0f1 to 69b65af Compare June 15, 2026 08:59

chore: drop stray runtime submodule change from this branch

69b65af

The branch was cut from an older main; the runtime gitlink had drifted to an unrelated commit. Restore it to main's pointer so this PR only contains the Qwen3 perf-monitoring change.

lyfne123 approved these changes Jun 15, 2026

View reviewed changes

lyfne123 merged commit 90de055 into hw-native-sys:main Jun 15, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(perf): add daily Qwen3-14B decode performance monitoring (warn-only)#1766

ci(perf): add daily Qwen3-14B decode performance monitoring (warn-only)#1766
lyfne123 merged 4 commits into
hw-native-sys:mainfrom
luohuan19:profilling-decode

luohuan19 commented Jun 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

luohuan19 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Why warn-only

Baseline maintenance

Testing

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luohuan19 commented Jun 12, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading