SM90 (Hopper) FP4 MegaMoE fused kernel with swapAB small-batch path by qiushixiaoyu · Pull Request #53 · sgl-project/DeepGEMM

qiushixiaoyu · 2026-06-29T03:16:34Z

This PR adds an FP4-weight MegaMoE fused kernel that:

cuts expert-weight memory traffic ~2× vs FP8 by using packed FP4 (E2M1) weights, and
adds a swapAB tiling for small batches (weight on the WGMMA M dimension,
tokens on N), which is more efficient when tokens-per-rank is small.

Changes

New SM90 FP4 MegaMoE fused kernel sm90_fp8_fp4_mega_moe_impl
(deep_gemm.fp8_fp4_mega_moe): FP8 (E4M3) activations × packed FP4 (E2M1)
expert weights with per-32-K UE8M0 weight scales folded into the FP4→E4M3
dequant; fused L1 GEMM → SwiGLU → per-token FP8 requant → L2 GEMM → combine.
swapAB small-batch path: for small per-rank token counts the grouped GEMM
runs with A/B swapped, selected by the L1/L2 dispatch ladders.

Accuracy (DeepSeek-V4-Flash, 8×H20, swapAB on)

SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1 SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1 SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=8192 SGLANG_DSV4_FP4_EXPERTS=1 GLOO_SOCKET_IFNAME=eth0 SGLANG_DEFAULT_THINKING=1
sglang serve
--trust-remote-code
--model-path /data00/models/DeepSeek-V4-Flash
--tp 8
--dp-size 8
--enable-dp-attention
--enable-dp-lm-head
--ep-size 8
--cuda-graph-max-bs 128
--chunked-prefill-size 8192
--mem-fraction-static 0.75
--max-running-requests 128
--tool-call-parser deepseekv4
--reasoning-parser deepseek-v4
--host 0.0.0.0
--moe-runner-backend deep_gemm
--moe-a2a-backend deepep
--port 30000

sgl-eval run gpqa
--n-repeats 16 --max-tokens 200000
--temperature 1.0 --top-p 1.0 --thinking
--out-dir /sgl-workspace/logs
--base-url http://localhost:30000/v1
2>&1 | tee /sgl-workspace/logs/gpqa_$(date +%Y%m%d_%H%M%S).console.log

== gpqa ==
198 examples x 16 repeats | 11789.7s | 2895 tok/s | 34.1M tokens

pass@1[avg-of-16] = 88.32% +/- 1.35% (SEM 0.34%)
pass@16 = 96.46%
majority@16 = 90.15%
no_answer = 0.00%
stop_rate = 100.00%
truncated_rate = 0.00%
error_rate = 0.00%

Eval	FP4 MegaMoE (swapAB)
GSM8K (1319)	0.951 (invalid 0.000)
GPQA-diamond (32, thinking)	0.938

Performance (single-op MegaMoE kernel, 8×H20, bench_kineto)

DeepSeekV4Flash

batch/卡	FP4 µs	FP8-LL µs	speedup
1	137.4	311.9	2.27×
2	206.2	335.2	1.63×
4	332.1	449.6	1.35×
8	373.4	526.3	1.41×
16	416.5	572.1	1.37×
32	440.3	602.6	1.37×
64	476.3	609.0	1.28×
128	519.4	634.4	1.22×
256	540.5	655.1	1.21×

batch/卡	FP4 µs	FP8-normal µs	speedup
1	154.7	501.1	3.24×
8	375.1	965.8	2.57×
32	512.7	1165.4	2.27×
64	481.4	1150.1	2.39×
128	518.4	1147.7	2.21×
256	535.8	1197.5	2.24×
512	966.4	1256.8	1.30×
1024	1837.8	2241.3	1.22×
2048	3166.6	3741.3	1.18×
4096	5821.2	6771.4	1.16×
8192	11213.5	12821.1	1.14×

DeepSeekV4Pro

batch/卡	FP4 µs	FP8-LL µs	speedup
1	373.0	556.4	1.49×
2	539.6	742.1	1.38×
4	827.7	1085.5	1.31×
8	1213.5	1513.2	1.25×
16	1485.2	1781.2	1.20×
32	1482.5	1861.6	1.26×
64	1521.6	1870.0	1.23×
128	1711.8	1898.0	1.11×
192	1900.8	1919.5	1.01×
256	1805.0	1949.8	1.08×

batch/卡	FP4 µs	FP8-normal µs	speedup
1	384.7	969.5	2.52×
8	1210.0	2948.8	2.44×
32	1486.5	3560.3	2.40×
64	1539.8	3542.3	2.30×
128	1734.8	3596.0	2.07×
256	1804.8	3629.3	2.01×
512	3408.9	3726.4	1.09×
1024	5056.5	5527.3	1.09×
2048	8733.5	9310.6	1.07×
4096	15052.6	15987.0	1.06×
8192	28918.0	30540.9	1.06×

Add the SM90 FP8xFP4 MegaMoE runtime, kernel path, Python API, Hopper correctness and benchmark coverage, tuned runtime decode heuristics, swapAB support, synchronization/spill fixes, and the SM90 MegaMoE alignment export.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

qiushixiaoyu force-pushed the fp4_swapAB branch from e354e30 to 382bf64 Compare June 29, 2026 08:37

Fridge003 force-pushed the dev branch from 77c9522 to 731e7c7 Compare June 30, 2026 00:43

yinding and others added 3 commits June 30, 2026 17:13

Add SM90 FP4 MegaMoE implementation

60e77b5

Add the SM90 FP8xFP4 MegaMoE runtime, kernel path, Python API, Hopper correctness and benchmark coverage, tuned runtime decode heuristics, swapAB support, synchronization/spill fixes, and the SM90 MegaMoE alignment export.

Avoid L1 N32 swapAB bucket for FP4 MegaMoE

8e765eb

FP4 swapAB L1: align routing-weight fold order with non-swap path

2fa29dd

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

qiushixiaoyu force-pushed the fp4_swapAB branch from 382bf64 to 2fa29dd Compare June 30, 2026 09:16

Fix SM90 FP4 MegaMoE build against post-deepseek-ai#364 signatures

c5920e9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SM90 (Hopper) FP4 MegaMoE fused kernel with swapAB small-batch path#53

SM90 (Hopper) FP4 MegaMoE fused kernel with swapAB small-batch path#53
qiushixiaoyu wants to merge 4 commits into
sgl-project:devfrom
qiushixiaoyu:fp4_swapAB

qiushixiaoyu commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

qiushixiaoyu commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Accuracy (DeepSeek-V4-Flash, 8×H20, swapAB on)

Performance (single-op MegaMoE kernel, 8×H20, bench_kineto)

DeepSeekV4Flash

DeepSeekV4Pro

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qiushixiaoyu commented Jun 29, 2026 •

edited

Loading