SM90 (Hopper) FP4 MegaMoE fused kernel with swapAB small-batch path#53
Open
qiushixiaoyu wants to merge 4 commits into
Open
SM90 (Hopper) FP4 MegaMoE fused kernel with swapAB small-batch path#53qiushixiaoyu wants to merge 4 commits into
qiushixiaoyu wants to merge 4 commits into
Conversation
Add the SM90 FP8xFP4 MegaMoE runtime, kernel path, Python API, Hopper correctness and benchmark coverage, tuned runtime decode heuristics, swapAB support, synchronization/spill fixes, and the SM90 MegaMoE alignment export.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds an FP4-weight MegaMoE fused kernel that:
tokens on N), which is more efficient when tokens-per-rank is small.
Changes
sm90_fp8_fp4_mega_moe_impl(
deep_gemm.fp8_fp4_mega_moe): FP8 (E4M3) activations × packed FP4 (E2M1)expert weights with per-32-K UE8M0 weight scales folded into the FP4→E4M3
dequant; fused L1 GEMM → SwiGLU → per-token FP8 requant → L2 GEMM → combine.
runs with A/B swapped, selected by the L1/L2 dispatch ladders.
Accuracy (DeepSeek-V4-Flash, 8×H20, swapAB on)
sgl-eval run gpqa
--n-repeats 16 --max-tokens 200000
--temperature 1.0 --top-p 1.0 --thinking
--out-dir /sgl-workspace/logs
--base-url http://localhost:30000/v1
2>&1 | tee /sgl-workspace/logs/gpqa_$(date +%Y%m%d_%H%M%S).console.log
== gpqa ==
198 examples x 16 repeats | 11789.7s | 2895 tok/s | 34.1M tokens
pass@16 = 96.46%
majority@16 = 90.15%
no_answer = 0.00%
stop_rate = 100.00%
truncated_rate = 0.00%
error_rate = 0.00%
Performance (single-op MegaMoE kernel, 8×H20, bench_kineto)
DeepSeekV4Flash
DeepSeekV4Pro