Skip to content

SM90 (Hopper) FP4 MegaMoE fused kernel with swapAB small-batch path#53

Open
qiushixiaoyu wants to merge 4 commits into
sgl-project:devfrom
qiushixiaoyu:fp4_swapAB
Open

SM90 (Hopper) FP4 MegaMoE fused kernel with swapAB small-batch path#53
qiushixiaoyu wants to merge 4 commits into
sgl-project:devfrom
qiushixiaoyu:fp4_swapAB

Conversation

@qiushixiaoyu

@qiushixiaoyu qiushixiaoyu commented Jun 29, 2026

Copy link
Copy Markdown

This PR adds an FP4-weight MegaMoE fused kernel that:

  • cuts expert-weight memory traffic ~2× vs FP8 by using packed FP4 (E2M1) weights, and
  • adds a swapAB tiling for small batches (weight on the WGMMA M dimension,
    tokens on N), which is more efficient when tokens-per-rank is small.

Changes

  • New SM90 FP4 MegaMoE fused kernel sm90_fp8_fp4_mega_moe_impl
    (deep_gemm.fp8_fp4_mega_moe): FP8 (E4M3) activations × packed FP4 (E2M1)
    expert weights with per-32-K UE8M0 weight scales folded into the FP4→E4M3
    dequant; fused L1 GEMM → SwiGLU → per-token FP8 requant → L2 GEMM → combine.
  • swapAB small-batch path: for small per-rank token counts the grouped GEMM
    runs with A/B swapped, selected by the L1/L2 dispatch ladders.

Accuracy (DeepSeek-V4-Flash, 8×H20, swapAB on)

SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1 SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=8192 SGLANG_DSV4_FP4_EXPERTS=1 GLOO_SOCKET_IFNAME=eth0 SGLANG_DEFAULT_THINKING=1
sglang serve
--trust-remote-code
--model-path /data00/models/DeepSeek-V4-Flash
--tp 8
--dp-size 8
--enable-dp-attention
--enable-dp-lm-head
--ep-size 8
--cuda-graph-max-bs 128
--chunked-prefill-size 8192
--mem-fraction-static 0.75
--max-running-requests 128
--tool-call-parser deepseekv4
--reasoning-parser deepseek-v4
--host 0.0.0.0
--moe-runner-backend deep_gemm
--moe-a2a-backend deepep
--port 30000

sgl-eval run gpqa
--n-repeats 16 --max-tokens 200000
--temperature 1.0 --top-p 1.0 --thinking
--out-dir /sgl-workspace/logs
--base-url http://localhost:30000/v1
2>&1 | tee /sgl-workspace/logs/gpqa_$(date +%Y%m%d_%H%M%S).console.log

== gpqa ==
198 examples x 16 repeats | 11789.7s | 2895 tok/s | 34.1M tokens

  • pass@1[avg-of-16] = 88.32% +/- 1.35% (SEM 0.34%)
    pass@16 = 96.46%
    majority@16 = 90.15%
    no_answer = 0.00%
    stop_rate = 100.00%
    truncated_rate = 0.00%
    error_rate = 0.00%
Eval FP4 MegaMoE (swapAB)
GSM8K (1319) 0.951 (invalid 0.000)
GPQA-diamond (32, thinking) 0.938

Performance (single-op MegaMoE kernel, 8×H20, bench_kineto)

DeepSeekV4Flash

batch/卡 FP4 µs FP8-LL µs speedup
1 137.4 311.9 2.27×
2 206.2 335.2 1.63×
4 332.1 449.6 1.35×
8 373.4 526.3 1.41×
16 416.5 572.1 1.37×
32 440.3 602.6 1.37×
64 476.3 609.0 1.28×
128 519.4 634.4 1.22×
256 540.5 655.1 1.21×
batch/卡 FP4 µs FP8-normal µs speedup
1 154.7 501.1 3.24×
8 375.1 965.8 2.57×
32 512.7 1165.4 2.27×
64 481.4 1150.1 2.39×
128 518.4 1147.7 2.21×
256 535.8 1197.5 2.24×
512 966.4 1256.8 1.30×
1024 1837.8 2241.3 1.22×
2048 3166.6 3741.3 1.18×
4096 5821.2 6771.4 1.16×
8192 11213.5 12821.1 1.14×

DeepSeekV4Pro

batch/卡 FP4 µs FP8-LL µs speedup
1 373.0 556.4 1.49×
2 539.6 742.1 1.38×
4 827.7 1085.5 1.31×
8 1213.5 1513.2 1.25×
16 1485.2 1781.2 1.20×
32 1482.5 1861.6 1.26×
64 1521.6 1870.0 1.23×
128 1711.8 1898.0 1.11×
192 1900.8 1919.5 1.01×
256 1805.0 1949.8 1.08×
batch/卡 FP4 µs FP8-normal µs speedup
1 384.7 969.5 2.52×
8 1210.0 2948.8 2.44×
32 1486.5 3560.3 2.40×
64 1539.8 3542.3 2.30×
128 1734.8 3596.0 2.07×
256 1804.8 3629.3 2.01×
512 3408.9 3726.4 1.09×
1024 5056.5 5527.3 1.09×
2048 8733.5 9310.6 1.07×
4096 15052.6 15987.0 1.06×
8192 28918.0 30540.9 1.06×

yinding and others added 3 commits June 30, 2026 17:13
Add the SM90 FP8xFP4 MegaMoE runtime, kernel path, Python API, Hopper correctness and benchmark coverage, tuned runtime decode heuristics, swapAB support, synchronization/spill fixes, and the SM90 MegaMoE alignment export.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant