Skip to content

perf(dsv4 ratio4): coarsen softmax_pool to POOL_GROUP=2 batches per task#497

Open
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-compressor-softmax-pool-coarsen
Open

perf(dsv4 ratio4): coarsen softmax_pool to POOL_GROUP=2 batches per task#497
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-compressor-softmax-pool-coarsen

Conversation

@wangqin1723-max

@wangqin1723-max wangqin1723-max commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

What

Coarsen the ratio-4 compressor softmax_pool loop from one task per batch (pl.spmd(B) = 64 tasks) to POOL_GROUP=2 batches per task (pl.spmd(B // POOL_GROUP) = 32 tasks), with an inner pl.unroll(POOL_GROUP) over the batches.

  • pl.unroll (NOT pl.range): trace-time unroll gives each batch its own independent AST values for the online-softmax accumulators (mi/li/oi). A runtime pl.range would loop-carry them across batches → intermittent NaN/507018.
  • Each batch keeps its own data-dependent gate (pos_b + S >= COMPRESS_RATIO) and window positions, so the unroll is bit-identical to the per-batch form.

Why / Tuning

softmax_pool was a fine-grained scope (64 tiny online-softmax tasks, Exec% 58%, tail OH 8.1µs) that floods the scheduler. Grouping batches per task amortizes the per-task overhead. A/B sweep on standalone decode_compressor_ratio4.py (a2a3, kv/compress_state/cmp_kv_cache all PASS):

POOL_GROUP softmax_pool tasks Exec% tail OH Total Test Time
1 (baseline) 64 58.3% 8.1 µs 327.32 µs
2 (this PR) 32 85.0% 3.0 µs 261.22 µs (−20.2%)
4 16 93.4% 2.5 µs 324.70 µs (over-coarsened, core-starved → back to baseline)

POOL_GROUP=2 is the sweet spot — a clean U-curve. POOL_GROUP=4 over-coarsens (16 big serial tasks pinned to 16 cores) and collapses back to baseline.

Note on the full CSA orchestrator

In decode_attention_csa.py this scope is overlapped by qr_rope in the same window, so the standalone −20% is largely hidden at the CSA Total level (CSA wall-clock is dominated by qk_pv ~442µs and gather_kv ~332µs). The standalone gain is the scope-level signal; this change is bit-identical and a strict scope-level win, with no CSA regression.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request coarsens the softmax_pool SPMD loop in decode_compressor_ratio4.py by grouping batches into POOL_GROUP tasks and unrolling them to prevent intermittent NaN issues. Feedback on these changes suggests replacing the newly added assert statement with a ValueError check, as assertions can be globally disabled in optimized Python environments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

HEAD_DIM_TILE = 128
RMS_TILE = 16
POOL_GROUP = 4 # batches per softmax_pool task (pl.unroll; B % POOL_GROUP == 0)
assert B % POOL_GROUP == 0, "B must be divisible by POOL_GROUP"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using assert statements for production configuration or runtime validation is discouraged because assertions can be globally disabled when Python is run with optimization flags (e.g., python -O). It is safer and more robust to raise a ValueError instead, which is also consistent with the validation pattern used in decode_compressor_ratio128.py.

Suggested change
assert B % POOL_GROUP == 0, "B must be divisible by POOL_GROUP"
if B % POOL_GROUP != 0:
raise ValueError("B must be divisible by POOL_GROUP")

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 01a209e0-3a64-4e95-a4cf-5f1e0d1a9dd3

📥 Commits

Reviewing files that changed from the base of the PR and between 4b24c1d and 6fe66fe.

📒 Files selected for processing (1)
  • models/deepseek/v4/decode_compressor_ratio4.py

📝 Walkthrough

Walkthrough

This PR refactors the softmax-pooling stage in the DeepSeek v4 decode compressor to use grouped batch parallelization. It introduces a POOL_GROUP constant with a divisibility assertion, then restructures the core pooling loop from single-batch SPMD tasks to grouped-batch execution, replacing per-batch loops with grouped allocation and inner batch unrolling.

Changes

Grouped Softmax Pooling

Layer / File(s) Summary
Pool grouping configuration
models/deepseek/v4/decode_compressor_ratio4.py
Introduces POOL_GROUP = 2 constant and adds assertion that batch dimension B is evenly divisible by POOL_GROUP.
Grouped softmax-pool computation
models/deepseek/v4/decode_compressor_ratio4.py
Replaces per-batch pl.spmd(B, ...) loop with pl.spmd(B // POOL_GROUP, ...) plus inner pl.unroll(POOL_GROUP) iteration, maintaining identical window-boundary overlap logic and mi/li/oi state accumulation across front/back segments.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

  • hw-native-sys/pypto-lib#492: Both PRs modify softmax-pool SPMD loop scheduling in the same file, transitioning batch grouping strategy with POOL_GROUP-based parallelization.
  • hw-native-sys/pypto-lib#415: Both refactor compressor softmax-pooling stages to use grouped batching with POOL_GROUP divisibility checks and pl.spmd(...//POOL_GROUP) plus pl.unroll(POOL_GROUP) patterns.
  • hw-native-sys/pypto-lib#418: Both modify the same softmax-pooling stage in decode_compressor_ratio4.py, restructuring the SPMD parallelization and pooling phase logic.

Suggested Labels

enhancement

Poem

🐰 A rabbit hops through batches bright,
Grouped loops now pool with pooled might,
Where B // POOL_GROUP leads the way,
SPMD tasks dance every day,
Overlap windows hold their ground true!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: coarsening the softmax_pool loop by grouping batches with POOL_GROUP=2, which is the core performance optimization in this PR.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining both the technical implementation (pl.unroll vs pl.range, bit-identical behavior) and performance tuning results with detailed benchmarks.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wangqin1723-max wangqin1723-max force-pushed the perf/dsv4-compressor-softmax-pool-coarsen branch from 11194ad to 6fe66fe Compare June 11, 2026 03:05
The softmax_pool loop ran one task per batch (pl.spmd(B)=64 tasks), each a small
online-softmax over HEAD_DIM//HEAD_TILE head tiles -- fine-grained, high per-task
tail overhead. Group POOL_GROUP=2 batches per task via pl.unroll (B//POOL_GROUP=32
tasks). pl.unroll (NOT pl.range) is required: trace-time unroll gives each batch
its own independent AST values for the online-softmax accumulators (mi/li/oi),
whereas a runtime pl.range would loop-carry them across batches -> intermittent
NaN/507018. Each batch keeps its own data-dependent gate and positions, so the
unroll is bit-identical to the per-batch form.

Standalone decode_compressor_ratio4 on a2a3 (kv/compress_state/cmp_kv_cache all
PASS), softmax_pool task count / Total Test Time A/B:
  POOL_GROUP=1 (64 tasks): 327.32 us  (Exec% 58.3, tail OH 8.1 us)
  POOL_GROUP=2 (32 tasks): 261.22 us  (-20.2%, Exec% 85.0, tail OH 3.0 us)
  POOL_GROUP=4 (16 tasks): 324.70 us  (over-coarsened, core-starved -> back to baseline)
POOL_GROUP=2 is the sweet spot. In the full CSA orchestrator this scope is
overlapped by qr_rope so the win is largely hidden there; the standalone gain is
the scope-level signal.
@wangqin1723-max wangqin1723-max changed the title perf(dsv4 ratio4): coarsen softmax_pool to POOL_GROUP batches per task perf(dsv4 ratio4): coarsen softmax_pool to POOL_GROUP=2 batches per task Jun 11, 2026
@wangqin1723-max wangqin1723-max marked this pull request as ready for review June 11, 2026 03:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant