feat: support sm90_fp8_fp4 kernel by zhangxiaolei123456 · Pull Request #332 · deepseek-ai/DeepGEMM

zhangxiaolei123456 · 2026-05-11T04:22:45Z

For contiguous kernel
direct FP32 B scale case: b.second shape = [groups, N, K/128]

groups	m/group	n	k	W4 us	W4 GB/s	FP8 us	FP8 GB/s	FP8 diff	Speedup
8	256	4096	7168	529	296	512	536	0.0335	0.97x
8	512	4096	7168	1036	182	925	331	0.0338	0.89x
8	1024	4096	7168	1978	128	1884	196	0.0338	0.95x
8	2048	4096	7168	3875	98	3643	137	0.0336	0.94x
16	256	4096	7168	1041	301	927	592	0.0339	0.89x
16	512	4096	7168	1988	190	1890	324	0.0335	0.95x
16	1024	4096	7168	3872	130	3657	202	0.0339	0.94x
16	2048	4096	7168	7689	99	7174	139	0.0337	0.93x
24	256	4096	7168	1489	316	1408	584	0.0336	0.95x
24	512	4096	7168	2947	192	2781	330	0.0337	0.94x
24	1024	4096	7168	5762	131	5426	205	0.0340	0.94x
24	2048	4096	7168	11379	100	10608	141	0.0337	0.93x
32	256	4096	7168	1979	317	1882	583	0.0339	0.95x
32	512	4096	7168	3852	196	3637	337	0.0338	0.94x
32	1024	4096	7168	7630	132	7126	208	0.0339	0.93x
32	2048	4096	7168	15137	100	14094	141	0.0338	0.93x

For mask kernel
direct FP32 B scale case: b.second shape = [groups, N, K/128]

groups	m/group	n	k	W4 us	W4 GB/s	FP8 us	FP8 GB/s	FP8 diff	Speedup
8	1	4096	7168	97	1289	134	1803	0.0344	1.39x
8	4	4096	7168	97	1287	134	1817	0.0350	1.37x
8	8	4096	7168	97	1301	133	1824	0.0344	1.38x
8	16	4096	7168	101	1251	134	1829	0.0337	1.32x
8	32	4096	7168	113	1142	133	1846	0.0344	1.18x
8	1	7168	2048	52	1195	79	1540	0.0363	1.50x
8	4	7168	2048	53	1196	79	1548	0.0345	1.49x
8	8	7168	2048	52	1219	79	1555	0.0340	1.51x
8	16	7168	2048	56	1161	79	1567	0.0348	1.42x
8	32	7168	2048	60	1110	79	1596	0.0347	1.31x

direct E8M0 B scale case: b.second shape = [groups, N, K/32]

groups	m/group	n	k	W4 us	W4 GB/s	FP8 us	FP8 GB/s	FP8 diff	Speedup
8	1	4096	7168	130	1127	132	1843	0.0344	1.01x
8	4	4096	7168	131	1122	131	1849	0.0351	1.00x
8	8	4096	7168	131	1125	132	1849	0.0344	1.00x
8	16	4096	7168	161	925	131	1858	0.0337	0.82x
8	1	7168	2048	68	1086	77	1566	0.0349	1.14x
8	4	7168	2048	68	1085	77	1573	0.0356	1.13x
8	8	7168	2048	68	1097	77	1584	0.0346	1.14x
8	16	7168	2048	80	947	78	1589	0.0352	0.97x

sm90 fp8 fp4 1d2d kernel

Support redundant expert groups in FP4 fast path

JoyFuture · 2026-06-22T04:26:10Z

Hi, thanks for the great work on the SM90 FP8xFP4 kernels.

I have a question about the contiguous grouped GEMM prefill path. Some MXFP4 MoE models, such as DeepSeek-V4 and MiMoV2.5, use FP4 weight scales with K-group size 32, while the current SM90 FP8xFP4 contiguous grouped GEMM seems to mainly target gran_k_b=128.

Is there any plan to support gran_k_b=32 for m_grouped_fp8_fp4_gemm_nt_contiguous_sm90_fused_wgmma?

zhangxiaolei123456 added 28 commits May 11, 2026 11:15

Create test_sm90_fp8_fp4.py

6e12aeb

Create sm90_fp8_fp4_gemm_1d1d.cuh

b118d88

Create sm90_fp8_fp4_gemm_1d2d.cuh

561cf21

Update gemm.hpp

4122a7d

Update __init__.py

b3d92e0

Update __init__.py

825cbdb

Update test_sm90_fp8_fp4.py

9625f4f

Merge pull request #1 from zhangxiaolei123456/zhangxiaolei123456-patch-1

6a3b1ee

sm90 fp8 fp4 1d2d kernel

Create sm90_fp8_fp4_gemm_1d2d.hpp

043f007

Update sm90_fp8_fp4_gemm_1d2d.hpp

eebc4f4

Update ld_st.cuh

48dec94

Update sm90_fp8_fp4_gemm_1d2d.cuh

2520585

Update test_sm90_fp8_fp4.py

214baaf

Update test_sm90_fp8_fp4.py

4392c08

Update gemm.hpp

83cd196

Update gemm.hpp

02ff071

Update layout.hpp

2f98946

Update sm90_fp8_fp4_gemm_1d2d.hpp

d2e9012

Create sm90_fp8_fp4_gemm_1d2d_rs.hpp

307455a

Update smxx_layout.hpp

d7f3149

Update layout.hpp

79a5c8c

Update __init__.py

08d51f2

Update sm90_fp8_fp4_gemm_1d2d.cuh

8542680

Create sm90_fp8_fp4_gemm_1d2d_rs.cuh

698155f

Update sm90.cuh

fafbce3

Update math.py

53c15be

Update test_sm90_fp8_fp4.py

87debb1

Update test_sm90_fp8_fp4.py

f05e493

zhangxiaolei123456 mentioned this pull request May 20, 2026

[feat] DeepSeek V4 support W4A8(MXFP4FP8) on hopper sgl-project/sglang#25905

Open

5 tasks

Create test_sm90_int4_a8.py

f784be7

zhangxiaolei123456 added 12 commits May 27, 2026 13:42

Update gemm.hpp

b6aa5d7

Update layout.hpp

31a6971

Update sm90_fp8_fp4_gemm_1d2d.hpp

30b26f4

Update sm90_fp8_fp4_gemm_1d2d_rs.hpp

8d4894f

Update smxx_layout.hpp

29cfbf6

Update layout.hpp

8d8aaf6

Update math.py

0603791

Update test_sm90_fp8_fp4.py

48689a1

Update sm90_fp8_fp4_gemm_1d2d_rs.hpp

720da9f

Update sm90_fp8_fp4_gemm_1d2d_rs.cuh

180e71c

Support redundant expert groups in FP4 fast path

40c4fb2

Merge pull request #2 from zhangxiaolei123456/fix/dsv4-num-groups-36

a3646af

Support redundant expert groups in FP4 fast path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support sm90_fp8_fp4 kernel#332

feat: support sm90_fp8_fp4 kernel#332
zhangxiaolei123456 wants to merge 41 commits into
deepseek-ai:mainfrom
zhangxiaolei123456:main_hopper_fp8_fp4

zhangxiaolei123456 commented May 11, 2026 •

edited

Loading

Uh oh!

JoyFuture commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zhangxiaolei123456 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoyFuture commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhangxiaolei123456 commented May 11, 2026 •

edited

Loading