ci(perf): add daily Qwen3-14B decode performance monitoring (warn-only)#1766
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds a non-blocking performance regression detection system for the Qwen3-14B model decode benchmark. It defines a baseline metric, implements a CI guard script that compares run results against baselines, and integrates the guard into the workflow to warn on regressions without failing builds. ChangesQwen3-14B Performance Regression Guard
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a performance-regression guard for CI model runs. It adds a new Python script, perf_guard.py, which parses execution logs to extract performance metrics (such as makespan) and compares them against committed baselines. It also adds an initial baseline configuration file for the qwen3_14b_decode model. No review comments were provided, and there is no additional feedback to address.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
bef0723 to
89a626b
Compare
Run decode_layer.py with --enable-l2-swimlane to capture the device
makespan ('Total Test Time: <X> us') and compare it against a committed
baseline. A regression beyond the threshold (or missing perf data) emits
a GitHub warning annotation and a Step Summary row, and exits non-zero so
the step renders yellow while continue-on-error keeps the job green.
- .github/scripts/perf_guard.py: stdlib-only guard parsing the run log
- .github/perf_baselines/qwen3_14b_decode.json: baseline 1016.66us
(median of 3 device runs), threshold 15% to absorb ~8.5% jitter
- ci.yml: perf run + guard steps in the pypto-lib-model job, both
continue-on-error so the perf signal is advisory only
89a626b to
aec8f8d
Compare
- perf_guard.py: replace single-line copyright with the full CANN license header and drop the shebang (repo convention: header at line 1, no shebang) — fixes the check-headers hook - apply ruff-format - echo the baseline comparison (measured/baseline/delta/verdict) to stdout so the CI log is self-explanatory, not only the Step Summary
…samples Address review feedback on hw-native-sys#1766: per-PR CI should only check functional correctness; perf is noisy (~8.5% jitter) and belongs in a scheduled job that samples multiple times. - ci.yml: drop the perf run + perf_guard steps from pypto-lib-model; keep the functional Qwen3-14B decode example run - daily_ci.yml: add a 'qwen3-perf' job (a2a3 npu-1-device container, mirroring the ci.yml setup) that runs decode 5x and feeds the guard; wire it into notify-on-failure so hard failures still open an issue - perf_guard.py: average the makespan across all sampled runs in the log (findall instead of first match), report per-run samples and spread Baseline stays on a2a3 hardware, so the committed 1016.66us value remains comparable; the guard now compares the 5-run average against it.
b5fbe40 to
b952e55
Compare
…samples Address review feedback on hw-native-sys#1766: per-PR CI should only check functional correctness; perf is noisy (~8.5% jitter) and belongs in a scheduled job that samples multiple times. - ci.yml: drop the perf run + perf_guard steps from pypto-lib-model; keep the functional Qwen3-14B decode example run - daily_ci.yml: add a 'qwen3-perf' job (a2a3 npu-1-device container, mirroring the ci.yml setup) that runs decode 5x and feeds the guard; wire it into notify-on-failure so hard failures still open an issue - perf_guard.py: average the makespan across all sampled runs in the log (findall instead of first match), report per-run samples and spread Baseline stays on a2a3 hardware, so the committed 1016.66us value remains comparable; the guard now compares the 5-run average against it.
215d0f1 to
69b65af
Compare
The branch was cut from an older main; the runtime gitlink had drifted to an unrelated commit. Restore it to main's pointer so this PR only contains the Qwen3 perf-monitoring change.
Summary
Adds advisory, non-failing performance monitoring for
models/qwen3/14b/decode_layer.py. Following review feedback, the perf check runs in the scheduled daily CI (not on every PR) and samples the model multiple times, averaging the result — perf is noisy (~8.5% run-to-run jitter), so per-PR gating would be flaky and slow.Per-PR
ci.ymlkeeps only the functional correctness run of the Qwen3 decode example; all perf measurement moves todaily_ci.yml.What changed
.github/workflows/daily_ci.yml— newqwen3-perfjob. Runs on the same a2a3 / npu-1-device container as the former per-PR job (so the committed baseline stays comparable). It runsdecode_layer.py --enable-l2-swimlane --max-seq --seed 12345×, appending each run's stdout to one log, then invokes the guard. Wired intonotify-on-failureso a hard failure (build/device) still opens the daily issue; a perf regression does not (it is warn-only)..github/scripts/perf_guard.py(new, stdlib-only) — parses allTotal Test Time: <X> uslines from the log and averages them (vs. the first match), reports the per-run samples + min/max spread, and compares the average against the baseline. Emits a::warning::annotation + Step Summary row and exits non-zero on regression / missing data, whilecontinue-on-errorkeeps the job green..github/perf_baselines/qwen3_14b_decode.json(new) —value=1016.66us(median of 3 on-device a2a3 runs),threshold_pct=15to absorb the ~8.5% jitter,pypto_refpinned to the capturing commit. Refresh via PR when a perf change is intentional..github/workflows/ci.yml— drop the perf run + guard steps frompypto-lib-model; keep the functional Qwen3-14B decode example run.Why warn-only
GitHub Actions has no native per-job "yellow/neutral" conclusion. The guard exits non-zero on regression and the step carries
continue-on-error: true, so the step renders yellow ⚠ while the job conclusion stays green. Perf is a signal, not a gate.Baseline maintenance
The baseline is updated manually via PR only when a perf change is intentional — lower it to lock in an optimization, never raise it to silence a real regression. The 15% threshold absorbs normal jitter, so routine PRs never need to touch it.
Testing
perf_guard.pyverified across paths: multi-run averaging (5 samples → mean + spread), within-threshold → green, regression → yellow + exit 1, unseeded baseline → report-only, missing/empty log → warning + exit 1.decode_layer.py -p a2a3 --enable-l2-swimlane --max-seq --seed 1234confirmed to run, PASS correctness on device, and emit theTotal Test Timeline.qwen3-perfjob structure andnotify-on-failurewiring validated.