perf(dsv4 ratio4): coarsen softmax_pool to POOL_GROUP=2 batches per task by wangqin1723-max · Pull Request #497 · hw-native-sys/pypto-lib

wangqin1723-max · 2026-06-11T02:29:31Z

What

Coarsen the ratio-4 compressor softmax_pool loop from one task per batch (pl.spmd(B) = 64 tasks) to POOL_GROUP=2 batches per task (pl.spmd(B // POOL_GROUP) = 32 tasks), with an inner pl.unroll(POOL_GROUP) over the batches.

pl.unroll (NOT pl.range): trace-time unroll gives each batch its own independent AST values for the online-softmax accumulators (mi/li/oi). A runtime pl.range would loop-carry them across batches → intermittent NaN/507018.
Each batch keeps its own data-dependent gate (pos_b + S >= COMPRESS_RATIO) and window positions, so the unroll is bit-identical to the per-batch form.

Why / Tuning

softmax_pool was a fine-grained scope (64 tiny online-softmax tasks, Exec% 58%, tail OH 8.1µs) that floods the scheduler. Grouping batches per task amortizes the per-task overhead. A/B sweep on standalone decode_compressor_ratio4.py (a2a3, kv/compress_state/cmp_kv_cache all PASS):

POOL_GROUP	softmax_pool tasks	Exec%	tail OH	Total Test Time
1 (baseline)	64	58.3%	8.1 µs	327.32 µs
2 (this PR)	32	85.0%	3.0 µs	261.22 µs (−20.2%)
4	16	93.4%	2.5 µs	324.70 µs (over-coarsened, core-starved → back to baseline)

POOL_GROUP=2 is the sweet spot — a clean U-curve. POOL_GROUP=4 over-coarsens (16 big serial tasks pinned to 16 cores) and collapses back to baseline.

Note on the full CSA orchestrator

In decode_attention_csa.py this scope is overlapped by qr_rope in the same window, so the standalone −20% is largely hidden at the CSA Total level (CSA wall-clock is dominated by qk_pv ~442µs and gather_kv ~332µs). The standalone gain is the scope-level signal; this change is bit-identical and a strict scope-level win, with no CSA regression.

gemini-code-assist

Code Review

This pull request coarsens the softmax_pool SPMD loop in decode_compressor_ratio4.py by grouping batches into POOL_GROUP tasks and unrolling them to prevent intermittent NaN issues. Feedback on these changes suggests replacing the newly added assert statement with a ValueError check, as assertions can be globally disabled in optimized Python environments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-11T02:30:35Z

 HEAD_DIM_TILE = 128
 RMS_TILE = 16
+POOL_GROUP = 4   # batches per softmax_pool task (pl.unroll; B % POOL_GROUP == 0)
+assert B % POOL_GROUP == 0, "B must be divisible by POOL_GROUP"


Using assert statements for production configuration or runtime validation is discouraged because assertions can be globally disabled when Python is run with optimization flags (e.g., python -O). It is safer and more robust to raise a ValueError instead, which is also consistent with the validation pattern used in decode_compressor_ratio128.py.

Suggested change

assert B % POOL_GROUP == 0, "B must be divisible by POOL_GROUP"

if B % POOL_GROUP != 0:

raise ValueError("B must be divisible by POOL_GROUP")

coderabbitai · 2026-06-11T02:30:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 01a209e0-3a64-4e95-a4cf-5f1e0d1a9dd3

📥 Commits

Reviewing files that changed from the base of the PR and between 4b24c1d and 6fe66fe.

📒 Files selected for processing (1)

models/deepseek/v4/decode_compressor_ratio4.py

📝 Walkthrough

Walkthrough

This PR refactors the softmax-pooling stage in the DeepSeek v4 decode compressor to use grouped batch parallelization. It introduces a POOL_GROUP constant with a divisibility assertion, then restructures the core pooling loop from single-batch SPMD tasks to grouped-batch execution, replacing per-batch loops with grouped allocation and inner batch unrolling.

Changes

Grouped Softmax Pooling

Layer / File(s)	Summary
Pool grouping configuration `models/deepseek/v4/decode_compressor_ratio4.py`	Introduces `POOL_GROUP = 2` constant and adds assertion that batch dimension `B` is evenly divisible by `POOL_GROUP`.
Grouped softmax-pool computation `models/deepseek/v4/decode_compressor_ratio4.py`	Replaces per-batch `pl.spmd(B, ...)` loop with `pl.spmd(B // POOL_GROUP, ...)` plus inner `pl.unroll(POOL_GROUP)` iteration, maintaining identical window-boundary overlap logic and mi/li/oi state accumulation across front/back segments.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

hw-native-sys/pypto-lib#492: Both PRs modify softmax-pool SPMD loop scheduling in the same file, transitioning batch grouping strategy with POOL_GROUP-based parallelization.
hw-native-sys/pypto-lib#415: Both refactor compressor softmax-pooling stages to use grouped batching with POOL_GROUP divisibility checks and pl.spmd(...//POOL_GROUP) plus pl.unroll(POOL_GROUP) patterns.
hw-native-sys/pypto-lib#418: Both modify the same softmax-pooling stage in decode_compressor_ratio4.py, restructuring the SPMD parallelization and pooling phase logic.

Suggested Labels

enhancement

Poem

🐰 A rabbit hops through batches bright,
Grouped loops now pool with pooled might,
Where B // POOL_GROUP leads the way,
SPMD tasks dance every day,
Overlap windows hold their ground true!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: coarsening the softmax_pool loop by grouping batches with POOL_GROUP=2, which is the core performance optimization in this PR.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining both the technical implementation (pl.unroll vs pl.range, bit-identical behavior) and performance tuning results with detailed benchmarks.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The softmax_pool loop ran one task per batch (pl.spmd(B)=64 tasks), each a small online-softmax over HEAD_DIM//HEAD_TILE head tiles -- fine-grained, high per-task tail overhead. Group POOL_GROUP=2 batches per task via pl.unroll (B//POOL_GROUP=32 tasks). pl.unroll (NOT pl.range) is required: trace-time unroll gives each batch its own independent AST values for the online-softmax accumulators (mi/li/oi), whereas a runtime pl.range would loop-carry them across batches -> intermittent NaN/507018. Each batch keeps its own data-dependent gate and positions, so the unroll is bit-identical to the per-batch form. Standalone decode_compressor_ratio4 on a2a3 (kv/compress_state/cmp_kv_cache all PASS), softmax_pool task count / Total Test Time A/B: POOL_GROUP=1 (64 tasks): 327.32 us (Exec% 58.3, tail OH 8.1 us) POOL_GROUP=2 (32 tasks): 261.22 us (-20.2%, Exec% 85.0, tail OH 3.0 us) POOL_GROUP=4 (16 tasks): 324.70 us (over-coarsened, core-starved -> back to baseline) POOL_GROUP=2 is the sweet spot. In the full CSA orchestrator this scope is overlapped by qr_rope so the win is largely hidden there; the standalone gain is the scope-level signal.

gemini-code-assist Bot reviewed Jun 11, 2026

View reviewed changes

wangqin1723-max force-pushed the perf/dsv4-compressor-softmax-pool-coarsen branch from 11194ad to 6fe66fe Compare June 11, 2026 03:05

wangqin1723-max changed the title ~~perf(dsv4 ratio4): coarsen softmax_pool to POOL_GROUP batches per task~~ perf(dsv4 ratio4): coarsen softmax_pool to POOL_GROUP=2 batches per task Jun 11, 2026

wangqin1723-max marked this pull request as ready for review June 11, 2026 03:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(dsv4 ratio4): coarsen softmax_pool to POOL_GROUP=2 batches per task#497

perf(dsv4 ratio4): coarsen softmax_pool to POOL_GROUP=2 batches per task#497
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-compressor-softmax-pool-coarsen

wangqin1723-max commented Jun 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated Code Review Effort

Possibly Related PRs

Suggested Labels

Poem

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	assert B % POOL_GROUP == 0, "B must be divisible by POOL_GROUP"
	if B % POOL_GROUP != 0:
	raise ValueError("B must be divisible by POOL_GROUP")

Conversation

wangqin1723-max commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why / Tuning

Note on the full CSA orchestrator

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated Code Review Effort

Possibly Related PRs

Suggested Labels

Poem

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wangqin1723-max commented Jun 11, 2026 •

edited

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading