Skip to content

ci(perf): add daily Qwen3-14B decode performance monitoring (warn-only)#1766

Merged
lyfne123 merged 4 commits into
hw-native-sys:mainfrom
luohuan19:profilling-decode
Jun 15, 2026
Merged

ci(perf): add daily Qwen3-14B decode performance monitoring (warn-only)#1766
lyfne123 merged 4 commits into
hw-native-sys:mainfrom
luohuan19:profilling-decode

Conversation

@luohuan19

@luohuan19 luohuan19 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds advisory, non-failing performance monitoring for models/qwen3/14b/decode_layer.py. Following review feedback, the perf check runs in the scheduled daily CI (not on every PR) and samples the model multiple times, averaging the result — perf is noisy (~8.5% run-to-run jitter), so per-PR gating would be flaky and slow.

Per-PR ci.yml keeps only the functional correctness run of the Qwen3 decode example; all perf measurement moves to daily_ci.yml.

What changed

  • .github/workflows/daily_ci.yml — new qwen3-perf job. Runs on the same a2a3 / npu-1-device container as the former per-PR job (so the committed baseline stays comparable). It runs decode_layer.py --enable-l2-swimlane --max-seq --seed 1234 , appending each run's stdout to one log, then invokes the guard. Wired into notify-on-failure so a hard failure (build/device) still opens the daily issue; a perf regression does not (it is warn-only).
  • .github/scripts/perf_guard.py (new, stdlib-only) — parses all Total Test Time: <X> us lines from the log and averages them (vs. the first match), reports the per-run samples + min/max spread, and compares the average against the baseline. Emits a ::warning:: annotation + Step Summary row and exits non-zero on regression / missing data, while continue-on-error keeps the job green.
  • .github/perf_baselines/qwen3_14b_decode.json (new) — value=1016.66us (median of 3 on-device a2a3 runs), threshold_pct=15 to absorb the ~8.5% jitter, pypto_ref pinned to the capturing commit. Refresh via PR when a perf change is intentional.
  • .github/workflows/ci.yml — drop the perf run + guard steps from pypto-lib-model; keep the functional Qwen3-14B decode example run.

Why warn-only

GitHub Actions has no native per-job "yellow/neutral" conclusion. The guard exits non-zero on regression and the step carries continue-on-error: true, so the step renders yellow ⚠ while the job conclusion stays green. Perf is a signal, not a gate.

Baseline maintenance

The baseline is updated manually via PR only when a perf change is intentional — lower it to lock in an optimization, never raise it to silence a real regression. The 15% threshold absorbs normal jitter, so routine PRs never need to touch it.

Testing

  • perf_guard.py verified across paths: multi-run averaging (5 samples → mean + spread), within-threshold → green, regression → yellow + exit 1, unseeded baseline → report-only, missing/empty log → warning + exit 1.
  • decode_layer.py -p a2a3 --enable-l2-swimlane --max-seq --seed 1234 confirmed to run, PASS correctness on device, and emit the Total Test Time line.
  • Both workflow YAMLs parse; qwen3-perf job structure and notify-on-failure wiring validated.
  • pre-commit (check-headers, ruff, pyright, check-yaml) green.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8fe03744-6bea-42f6-8cf7-4d891fa7c717

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a non-blocking performance regression detection system for the Qwen3-14B model decode benchmark. It defines a baseline metric, implements a CI guard script that compares run results against baselines, and integrates the guard into the workflow to warn on regressions without failing builds.

Changes

Qwen3-14B Performance Regression Guard

Layer / File(s) Summary
Baseline performance metric
.github/perf_baselines/qwen3_14b_decode.json
Establishes a makespan_us baseline of 1016.66 µs with 15% threshold, reference commit, and documentation of capture methodology.
Regression detection script
.github/scripts/perf_guard.py
Parses run logs to extract performance metrics, loads baseline JSON with "unseeded" support, computes percentage delta, and reports via GitHub Actions warnings and step summaries without exiting with failure when baseline is absent or metric is within threshold.
Workflow integration
.github/workflows/ci.yml
Qwen3-14B decode job now includes a perf run step (with swimlane flags, output logged to file) and a perf guard step (runs in always() mode to compare log against baseline, both steps set to continue-on-error).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A guard hops in to watch the perf,
Baseline locked, no build-time dirge,
Just yellow warnings, never red—
Swimlane logs dance ahead.
Regression caught with gentle care,
The baseline baseline, beyond compare!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly summarizes the main change: adding daily Qwen3-14B decode performance monitoring with warn-only behavior.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering the three new/modified files, implementation details, and testing verification.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performance-regression guard for CI model runs. It adds a new Python script, perf_guard.py, which parses execution logs to extract performance metrics (such as makespan) and compares them against committed baselines. It also adds an initial baseline configuration file for the qwen3_14b_decode model. No review comments were provided, and there is no additional feedback to address.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@luohuan19 luohuan19 force-pushed the profilling-decode branch 2 times, most recently from bef0723 to 89a626b Compare June 15, 2026 01:27
Run decode_layer.py with --enable-l2-swimlane to capture the device
makespan ('Total Test Time: <X> us') and compare it against a committed
baseline. A regression beyond the threshold (or missing perf data) emits
a GitHub warning annotation and a Step Summary row, and exits non-zero so
the step renders yellow while continue-on-error keeps the job green.

- .github/scripts/perf_guard.py: stdlib-only guard parsing the run log
- .github/perf_baselines/qwen3_14b_decode.json: baseline 1016.66us
  (median of 3 device runs), threshold 15% to absorb ~8.5% jitter
- ci.yml: perf run + guard steps in the pypto-lib-model job, both
  continue-on-error so the perf signal is advisory only
@luohuan19 luohuan19 force-pushed the profilling-decode branch from 89a626b to aec8f8d Compare June 15, 2026 01:53
- perf_guard.py: replace single-line copyright with the full CANN
  license header and drop the shebang (repo convention: header at
  line 1, no shebang) — fixes the check-headers hook
- apply ruff-format
- echo the baseline comparison (measured/baseline/delta/verdict) to
  stdout so the CI log is self-explanatory, not only the Step Summary
luohuan19 added a commit to luohuan19/pypto that referenced this pull request Jun 15, 2026
…samples

Address review feedback on hw-native-sys#1766: per-PR CI should only check functional
correctness; perf is noisy (~8.5% jitter) and belongs in a scheduled job
that samples multiple times.

- ci.yml: drop the perf run + perf_guard steps from pypto-lib-model;
  keep the functional Qwen3-14B decode example run
- daily_ci.yml: add a 'qwen3-perf' job (a2a3 npu-1-device container,
  mirroring the ci.yml setup) that runs decode 5x and feeds the guard;
  wire it into notify-on-failure so hard failures still open an issue
- perf_guard.py: average the makespan across all sampled runs in the log
  (findall instead of first match), report per-run samples and spread

Baseline stays on a2a3 hardware, so the committed 1016.66us value remains
comparable; the guard now compares the 5-run average against it.
@luohuan19 luohuan19 force-pushed the profilling-decode branch from b5fbe40 to b952e55 Compare June 15, 2026 08:29
…samples

Address review feedback on hw-native-sys#1766: per-PR CI should only check functional
correctness; perf is noisy (~8.5% jitter) and belongs in a scheduled job
that samples multiple times.

- ci.yml: drop the perf run + perf_guard steps from pypto-lib-model;
  keep the functional Qwen3-14B decode example run
- daily_ci.yml: add a 'qwen3-perf' job (a2a3 npu-1-device container,
  mirroring the ci.yml setup) that runs decode 5x and feeds the guard;
  wire it into notify-on-failure so hard failures still open an issue
- perf_guard.py: average the makespan across all sampled runs in the log
  (findall instead of first match), report per-run samples and spread

Baseline stays on a2a3 hardware, so the committed 1016.66us value remains
comparable; the guard now compares the 5-run average against it.
@luohuan19 luohuan19 changed the title ci(pypto-lib-model): add warn-only perf guard for Qwen3-14B decode ci(perf): add daily Qwen3-14B decode performance monitoring (warn-only) Jun 15, 2026
@luohuan19 luohuan19 force-pushed the profilling-decode branch 3 times, most recently from 215d0f1 to 69b65af Compare June 15, 2026 08:59
The branch was cut from an older main; the runtime gitlink had drifted
to an unrelated commit. Restore it to main's pointer so this PR only
contains the Qwen3 perf-monitoring change.
@lyfne123 lyfne123 merged commit 90de055 into hw-native-sys:main Jun 15, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants