Cherry-pick FP4 acts + MXF4 + DG_USE_FP8_COMBINE; wrap mega_moe_pre_dispatch in tvm-ffi by Fridge003 · Pull Request #33 · sgl-project/DeepGEMM

Fridge003 · 2026-05-12T22:52:33Z

No description provided.

…el (#27) Two related additions for the DeepSeek-V4-Pro mega-MoE path: 1. **FP4 (E2M1) activations + `kind::mxf4` mainloop opt-in** for `fp8_fp4_mega_moe`. - `DG_USE_FP4_ACTS=1` halves the symm-buffer x-slot footprint (E2M1 nibbles vs E4M3 bytes); SF slot unchanged (still `hidden/32` UE8M0 bytes under gran_k=32). - `use_mxf4_kind=true` switches the L1+L2 mainloops to `cta_group::2 kind::mxf4` (2-CTA cluster) with dense FP4 smem layout (`_ALIGN8B`, 2 nibbles/byte). Per-stage A/B byte footprint halves → num_stages doubles for the same smem budget. - Threads `cumulative_local_expert_recv_stats` through the public mega-MoE API for per-rank expert counters used by sglang's expert-distribution recorder. - Block-m heuristic: under `use_mxf4_kind`, bumps `block_m=16 → 32` for the smallest-tokens-per-expert bucket so `load_block_m * block_k / 2` meets the 1024-byte smem alignment. - Multi-block_m support via `kCandidateBlockM` array + LCM-aligned pool padding; replaces the static `block_m=192` heuristic with token-density dispatch (8/16/32/64/96/128/192). 2. **`mega_moe_pre_dispatch` kernel**: BF16 → quant + topk-copy + pad-fill in one launch, gated on `kUseFp4Acts` + `kUsePDL`. Templated on `(kGroupSize, kUseFp4Acts, kUsePDL)`. Uses bucketize-style E2M1 encoder for byte-exact match against the `per_token_cast_to_fp4` host helper. - New: `deep_gemm.mega_moe_pre_dispatch(x, topk_idx, topk_weights, buf_x, buf_x_sf, buf_topk_idx, buf_topk_weights, num_tokens, group_size, use_fp4_acts)` - Test: `tests/test_mega_moe_pre_dispatch.py` — single-GPU bytewise check against host `per_token_cast_to_fp{8,4}` + pad-fill assertion. Validated end-to-end on 8× B300 with DeepSeek-V4-Pro at 8K input bench: - FP4 acts + MXF4 kind path produces matching tokens vs the FP8 baseline (rel-RMSE ≤ 0.5 sentinel; GSM8K accuracy parity within run-to-run variance). PR also includes existing FP4-mega-MoE supporting changes that are required by the kernel: - `cluster_sync_with_relaxed_arrive` helper (used twice in `sm100_fp8_fp4_mega_moe.cuh`). - `cvt_pack_f32_to_e2m1x2` / `cvt_pack_f32x4_to_e2m1x4` PTX wrappers. - `SM100_MMA_MXF4_2x1SM_SS` 2-CTA cluster MMA wrapper. - Generalized `red_add(int*, int)` for the `cumulative_local_expert_recv_stats` counter. - `st.L1::no_allocate.relaxed.sys.global.u64` (correctness fix: previous generic-address variant could miss the global state space). Co-authored-by: pranjalssh <adkz.photos@gmail.com> (cherry picked from commit bca278e)

…bine path) (#28) * Add DG_USE_FP8_COMBINE: FP8 + per-row UE8M0 SF on the second a2a (combine path) The mega-MoE second all-to-all (combine) currently ships BF16 over NVLink: each token, each topk slot = kHidden * 2 bytes. This commit adds an env- gated FP8 path that ships FP8 E4M3 + a per-(token, N=128) UE8M0 SF byte — kHidden + kHidden/128 bytes per token per slot, half the NVLink bytes. Wiring: - New `kUseFp8Combine` template flag (default false → keeps BF16 path byte-identical when off). - New `combine_sf_buffer` symm-buffer slot, sized kHidden/128 bytes per (token, slot) when on, zero when off. - Host: `DG_USE_FP8_COMBINE=1` env flag in `mega.hpp`. Independent of `DG_USE_FP4_ACTS` / `DG_USE_MXF4_KIND` (those control the dispatch a2a + mainloops; this controls the combine a2a only). Producer side (L2 epilogue write-back, sm100_fp8_fp4_mega_moe.cuh): - Read 8 BF16 from smem (existing STSM target). - Compute per-row amax via `__shfl_xor_sync` reduction over the 16 lanes that share each row tile. Use a 16-lane mask (NOT 0xffffffff) — the outer `if (m_idx_in_block >= valid_m) break` may cause the OTHER half- warp to exit on padding rows, and a full-warp shfl would deadlock. - Compute UE8M0 SF (E4M3 finfo_max=448, mirrors `get_e4m3_sf_and_sf_inv`). - Cast 8 BF16 → 8 FP8 via `__nv_fp8x4_e4m3(float4)` ×2; pack into uint64. - Write 8 FP8 bytes to remote (vs 16 BF16 bytes). Lane 0 of the 16-lane group writes the SF byte to `combine_sf_buffer`. Consumer side (combine reduce): - Per-slot SF base ptr cached at slot start. - TMA-load FP8 chunk (kNumChunkBytes / 2 bytes when kUseFp8Combine). - Per uint4 (16 FP8): __ldg the SF byte for the segment; FP8 → FP16x2 via `cvt.rn.f16x2.e4m3x2`, FP16 → FP32 via `cvt.f32.f16`, then `__fmaf_rn(val, sf, acc)` for the accumulate-with-dequant. - BF16 store-buffer layout for FP8 path: 2 BF16 uint4 per input uint4 (16 elements → 2 × 8 BF16 stripes), at indices (j*32+lane)*2 + {0,1}. Total store uint4/lane same as BF16 path (kNumChunkUint4Bf16 / 32). Validation: - Microbench (`ptx/d_combine_reduce_v{1,2}_*`): - v1 BF16 baseline: 6,895 cycles/token, max_abs=0 (perfect). - v2 FP8 + UE8M0 SF: correctness PASS (max_abs=0 vs host reference that uses the same FP8 quant), 50% NVLink bytes savings. - Single-GPU iso bench (8x B300, fp4_mxf4 vs fp4_mxf4+combine): - b=128: 364 us → 359 us (+1.5%) - b=512: 377 us → 386 us (-2.2%) - b=2048: 710 us → 739 us (-3.9%) Single-GPU is compute-bound (no NVLink saving); production is the point of the change. - E2E DeepSeek-V4-Pro on 8x B300 (b=8192 input, 1024 output): - b=512: 91.92 s (FP8) → 78.37 s (FP4+MXF4+FP8combine) — +17.3% - b=2048: 259.4 s (FP8) → 238.2 s — +8.9% - b=4096: 489.5 s (FP8) → 444.2 s — +10.2% Sentinel test (FP4 acts vs FP8 acts): rel-RMSE <= 0.5 still passes. Numerical: rel-RMSE on synthetic random init = 0.027 (combine FP8 vs BF16 baseline, w/o SwiGLU clamping → tail outliers). Real activations post-SwiGLU + topk-weighting are bounded; production accuracy parity preserved (same GSM8K results as FP4 baseline). * Combine reduce: HFMA path (FP16 accumulator + fma.f16x2) Switch the FP8 combine reduce inner loop from FP32 accumulator + scalar fma to FP16x2 accumulator + hfma.f16x2. Halves the per-element op count and halves the accumulator register pressure (94 regs vs 138 regs). Inner loop, before: cvt.rn.f16x2.e4m3x2 (FP8x2 → FP16x2) cvt.f32.f16 ×2 (FP16 → FP32) fma.rn.f32 ×2 (acc += sf_f32 * f32_val) = 5 ops per FP8x2 (= 2 elements) After: cvt.rn.f16x2.e4m3x2 (FP8x2 → FP16x2) fma.rn.f16x2 (acc_fp16x2 += sf_pair * f16x2) = 2 ops per FP8x2 SF in FP16: UE8M0 byte → 1.0 * 2^(byte-127), packed as FP16 with bias 15. Out-of-range SFs (byte < 112 or > 142) clamp to 0 / FP16-max — production activations post-SwiGLU + topk-weighting fit comfortably in FP16 range. End cast: FP16x2 → __half22float2 → __float22bfloat162_rn for the gmem write-back (BF16 output unchanged). Microbench (`ptx/d_combine_reduce_v3_fp8_hfma`): v1 BF16 baseline: 6,895 cycles/token v2 FP8 + FP32 acc: 10,797 cycles/token (+57% vs v1) v3 FP8 + FP16 HFMA: **5,799 cycles/token (-16% vs v1, -46% vs v2)** E2E DeepSeek-V4-Pro 8x B300, 8K input + 1024 output: | batch | FP4+MXF4 | combine FP32 | combine HFMA | |------:|---------:|-------------:|-------------:| | 512 | — | 7,526 | 7,350 | | 2048 | 9,814 | 9,903 | **9,992** | | 4096 | 10,418 | 10,622 | **10,699** | HFMA wins at 2048/4096; ~tie at 512. Worth keeping as the default. Numerical: v3 microbench correctness max_abs=0.0625, rel_rmse=3.8e-4 vs the FP32 reference. Production activations: still within sentinel tolerance (rel-RMSE ≤ 0.5 vs FP8 baseline). * Revert "Combine reduce: HFMA path (FP16 accumulator + fma.f16x2)" This reverts commit 48e8101. --------- Co-authored-by: pranjalssh <adkz.photos@gmail.com> (cherry picked from commit 8fc78b4)

pranjalssh and others added 3 commits May 12, 2026 15:45

Add tvm-ffi wrapper for w4a4 megamoe

7fc5b56

Fridge003 merged commit 8ab0096 into release-0426 May 12, 2026
2 of 3 checks passed

Fridge003 deleted the release-0426-upd branch May 12, 2026 22:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick FP4 acts + MXF4 + DG_USE_FP8_COMBINE; wrap mega_moe_pre_dispatch in tvm-ffi#33

Cherry-pick FP4 acts + MXF4 + DG_USE_FP8_COMBINE; wrap mega_moe_pre_dispatch in tvm-ffi#33
Fridge003 merged 3 commits into
release-0426from
release-0426-upd

Fridge003 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Fridge003 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants