Skip to content

[CUDA] Add SM120 NVF4 block-scale MMA support#2364

Open
qqq-tao wants to merge 1 commit into
tile-ai:mainfrom
qqq-tao:nvf4-block-scale-sm120-pr-clean
Open

[CUDA] Add SM120 NVF4 block-scale MMA support#2364
qqq-tao wants to merge 1 commit into
tile-ai:mainfrom
qqq-tao:nvf4-block-scale-sm120-pr-clean

Conversation

@qqq-tao

@qqq-tao qqq-tao commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds the SM120 NVFP4 block-scaled GEMM path to TileLang and wires the scale-source contract end to end:

BF16 A/B -> TileLang NVFP4 quantizer -> packed FP4 + blockscaled_chunk_kmajor SFA/SFB -> TileLang NVFP4 GEMM

The public scale layout name is blockscaled_chunk_kmajor, matching the CUTLASS BlockScaledBasicChunk K-major source layout. The same contract applies to SFA and SFB.

What Changed

  • Added SM120 NVFP4 block-scaled MMA lowering for E2M1 A/B operands, UE4M3 scale factors, and FP32 accumulation.
  • Added the TileLang-facing T.mma_gemm_blockscaled(...) path and the SM120 CUDA helper layer used by the GEMM emitter.
  • Added reusable NVFP4 quantization/packing utilities in tilelang.quantize, including BF16 activation quantization to packed FP4 plus uint32[rows, K/64] scale-source tensors.
  • Promoted blockscaled_chunk_kmajor as the only user-facing scale-source layout for this path, aligned with CUTLASS BlockScaledBasicChunk K-major.
  • Updated the SM120 NVFP4 GEMM example so bf16_quantized input consumes TileLang quantizer output directly instead of relying on a benchmark-only host swizzle path.
  • Added focused layout, quantizer, and GEMM correctness coverage for fixed offsets, random binary scale-byte roundtrip, UE4M3 edge cases, TileLang quantizer output, and GEMM verification.

Validation

Remote validation was run on RTX5090 SM120.

pytest testing/python/quantize/test_tilelang_quantize_nvfp4_scale_layout.py -q
30 passed

GEMM correctness smoke:

512^3, input-mode=bf16_quantized: TileLang correctness passed
512^3, input-mode=random, scale-mode=random_binary: TileLang correctness passed

CUTLASS alignment check for the quantizer-produced tensors:

CUTLASS verify: Passed
cutlass_vs_cutlass_host exact True
cutlass_vs_torch_semantic exact True

Performance Status

Current large-shape timing on RTX 5090 SM120:

M=N=K=8192
TileLang tile: 128x128x256
input-mode=random, scale-mode=constant
producer_regs=40, consumer_regs=224

TileLang latency: 0.8097 ms
TileLang FLOPS: 1357.84 TFLOPS

This updates the older PR-clean 8192^3 number of about 0.8724 ms / 1260.26 TFLOPS. It is a benchmark status update, not a claim that every varying-scale workload now matches CUTLASS/CuTe performance architecture.

Notes

The main user path is now intentionally small: use the TileLang NVFP4 quantizer and feed its blockscaled_chunk_kmajor scale tensors directly to the TileLang SM120 NVFP4 GEMM. Further mainloop and scale-copy performance work can stay inside the SM120 lowering/helper layer without adding new user-facing scale-layout names.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds SM120 NVF4 block-scale MMA support end-to-end: TL builtin intrinsic, PTX/mma template, CodeGen emission, 4-bit layout transforms and emitter wiring, CUTLASS CUDA reference kernel, TileLang correctness harness with packing/decoding utilities, and CUDA-gated tests.

Changes

NVF4 Block-Scale MMA Support

Layer / File(s) Summary
Compiler op and TIR intrinsic surface
src/op/builtin.{cc,h}, tilelang/language/ast/ir.py, tilelang/language/tir/ir.py, tilelang/language/tir/op.py
Register tl.ptx_mma_block_scale as a TL builtin with 17 inputs and expose dtype-forwarding TIR/AST wrappers plus an op wrapper that converts metadata into StringImm/IntImm before invoking the intrinsic.
PTX instruction template and config
src/tl_templates/cuda/gemm_sm120.h
Define SM120 enums and SM120MmaBlockScaledConfig with a specialization enabling exactly kMxf4nvf4/scale-vec-4/kUE4M3, add inline-PTX device implementation for mma.sync.aligned.m16n8k64, and provide a guarded sm120_mma_sync_blockscaled template wrapper.
CUDA codegen emission
src/cuda/codegen/codegen_cuda.cc
Extend CodeGenTileLangCUDA to recognize tl.ptx_mma_block_scale (17 args), validate the supported configuration, rewrite fp4-packed A/B buffer operands (halving offsets where packed), and emit tl::sm120_mma_sync_blockscaled with template argument substitution.
Layout transforms and MMA emitter
tilelang/cuda/intrinsics/layout/mma_layout.py, tilelang/cuda/intrinsics/macro/mma_macro_generator.py, tilelang/cuda/intrinsics/{macro/,}__init__.py, tilelang/intrinsics/__init__.py
Add shared↔MMA layout mappings for 4-bit A/B operands and inverse load helpers; introduce BlockScaleMmaConfig registry; implement TensorCoreIntrinEmitterWithBlockScale subclass that overrides ldmatrix_a/ldmatrix_b and emits per-warp ptx_mma_block_scale calls using explicit scale buffers and computed packed K indices; re-export emitter through package initializers.
CUTLASS reference kernel
maint/gemm/cutlass_nvf4_ref.cu
Add a PyTorch-bound CUTLASS SM120 block-scaled GEMM entry cutlass_nvf4_gemm_128x128x256 that validates tensor devices/contiguity and exact packed sizes, constructs CUTLASS problem/stride/layout views including SFA/SFB, checks can_implement, allocates workspace, runs the kernel, and synchronizes.
TileLang evaluation harness
maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py
Add a correctness-evaluation script: TileLang NVF4 primfunc generator with optional swizzled shared layouts, FP4/UE4M3 packing/repacing utilities for TileLang vs CUTLASS, FP4 decode to float32 reference, CUTLASS extension builder/loader, execution of both implementations, metric printing, and optional strict equality assertion.
Tests and correctness validation
testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py
Add CUDA-gated tests: FP4 decode table, swizzle layout helper, NVF4 primfunc generator, input/scale generators, fragment-layout coverage tests, lane-to-atom mapping tests, unsupported-config/type rejection tests, codegen source assertions, and end-to-end correctness tests comparing kernel outputs to a decoded float32 reference with zero tolerance.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • lucifer1004
  • LeiWang1999

Poem

🐰 In silken shared tiles I hop and pack,
Four-bit whispers stitched along each track,
Scales march in words, the warps all hum,
TileLang builds, CUTLASS comes to sum,
A rabbit nods — the matrices align.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.06% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding SM120 NVF4 block-scale MMA support on CUDA.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py`:
- Around line 253-256: The code sets cutlass_root using a hardcoded developer
path which is unsafe; update the logic in the block that defines repo and
cutlass_root (variables cutlass_root, repo and the use of
os.environ["CUTLASS_ROOT"]) to remove the hardcoded "/data/home/..." fallback
and instead: prefer the CUTLASS_ROOT environment variable if set, otherwise fall
back to repo / "3rdparty" / "cutlass"; ensure cutlass_root.exists() is checked
and raise a clear error or log if neither location exists so the failure is
explicit.

In `@tilelang/language/ast/ir.py`:
- Line 2140: The module's __all__ includes "ptx_mma_block_scale" but no symbol
by that name is defined or imported, which breaks wildcard imports; either
remove "ptx_mma_block_scale" from the __all__ list or add a proper
definition/import for ptx_mma_block_scale (e.g., define the function/class or
import it from its source) so the name is actually bound in this module; update
the __all__ entry near the existing list and ensure the symbol name matches
exactly the defined/imported identifier.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a5bd78f-7a51-446a-bd8a-0078295fbd05

📥 Commits

Reviewing files that changed from the base of the PR and between a3f7093 and a24f9f1.

📒 Files selected for processing (16)
  • maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py
  • maint/gemm/cutlass_nvf4_ref.cu
  • src/cuda/codegen/codegen_cuda.cc
  • src/cuda/codegen/codegen_cuda.h
  • src/op/builtin.cc
  • src/op/builtin.h
  • src/tl_templates/cuda/instruction/mma_block_scale.h
  • testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py
  • tilelang/cuda/intrinsics/__init__.py
  • tilelang/cuda/intrinsics/layout/mma_layout.py
  • tilelang/cuda/intrinsics/macro/__init__.py
  • tilelang/cuda/intrinsics/macro/mma_macro_generator.py
  • tilelang/intrinsics/__init__.py
  • tilelang/language/ast/ir.py
  • tilelang/language/tir/ir.py
  • tilelang/language/tir/op.py

Comment thread maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py Outdated
Comment thread tilelang/language/ast/ir.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tilelang/cuda/intrinsics/macro/mma_macro_generator.py (2)

1402-1407: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve outer-dimension bases for plain scale buffers.

_scale_region_parts() treats Buffer differently from the A/B paths and drops every prefix dimension by returning ([], 0, 0). The accesses at Lines 1478-1482 and Line 1510 then only index the trailing two axes, which breaks direct N-D scale buffers and sliced views. Reuse _legalize_to_buffer_region() here so scale buffers follow the same region contract as A/B.

Suggested fix
     `@staticmethod`
     def _scale_region_parts(scale_buf: Buffer | BufferRegion):
-        if isinstance(scale_buf, BufferRegion):
-            return scale_buf.buffer, [r.min for r in scale_buf.region[:-2]], scale_buf.region[-2].min, scale_buf.region[-1].min
-        if isinstance(scale_buf, Buffer):
-            return scale_buf, [], 0, 0
-        raise ValueError(f"Unsupported scale buffer type: {type(scale_buf)}")
+        region = TensorCoreIntrinEmitter._legalize_to_buffer_region(scale_buf)
+        return (
+            region.buffer,
+            [r.min for r in region.region[:-2]],
+            region.region[-2].min,
+            region.region[-1].min,
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/cuda/intrinsics/macro/mma_macro_generator.py` around lines 1402 -
1407, _scale_region_parts currently returns (buffer, [], 0, 0) for plain Buffer
which drops outer-dimension bases and breaks N-D scale buffers; change it to
call and reuse _legalize_to_buffer_region(scale_buf) so both Buffer and
BufferRegion follow the same region contract as A/B. Update _scale_region_parts
to accept either type, call _legalize_to_buffer_region when scale_buf is a
Buffer to obtain the buffer and full region, and then extract the same tuples
(buffer, [r.min for r in region[:-2]], region[-2].min, region[-1].min) as for
BufferRegion; keep the existing ValueError for unsupported types. Ensure
references to _scale_region_parts usage (the indexing code that expects
preserved outer bases) continue to work unchanged.

1282-1333: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the fixed NVF4 contract in the constructor.

This emitter always lowers as mxf4nvf4 with k64, e2m1/e2m1, and the block-scale fragment layouts from this class, but it never rejects incompatible a_dtype, b_dtype, or accum_dtype inputs. A caller can currently instantiate this public API with, for example, float16 operands and still get NVF4 PTX emitted against mismatched fragment assumptions. Fail fast here instead of silently generating wrong code.

Suggested guard
     def __init__(
         self,
         a_dtype: str = T.float4_e2m1fn,
         b_dtype: str = T.float4_e2m1fn,
         accum_dtype: str = T.float32,
@@
         kind: str = "mxf4nvf4",
         scale_vec_size: int = 4,
         stype: str = "ue4m3",
     ):
+        if str(DataType(a_dtype)) != str(T.float4_e2m1fn) or str(DataType(b_dtype)) != str(T.float4_e2m1fn):
+            raise ValueError("SM120 block-scale MMA currently only supports float4_e2m1fn operands")
+        if str(DataType(accum_dtype)) != str(T.float32):
+            raise ValueError("SM120 block-scale MMA currently only supports float32 accumulation")
         self.block_scale_config = _get_block_scale_mma_config(kind, scale_vec_size, stype)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/cuda/intrinsics/macro/mma_macro_generator.py` around lines 1282 -
1333, The constructor currently forces block-scale config to NVF4 but does not
validate the caller-provided dtypes, so callers can pass incompatible
a_dtype/b_dtype/accum_dtype and silently generate wrong code; after calling
_get_block_scale_mma_config(...) in __init__, add a guard that checks the
resolved self.block_scale_config.kind (and/or its expected dtype descriptors
from the config) against the incoming a_dtype, b_dtype, and accum_dtype
parameters (e.g., ensure a_dtype and b_dtype match the NVF4 fragment dtypes such
as T.float4_e2m1fn and accum_dtype matches the expected accumulator like
T.float32 or whatever the config exposes), and raise a ValueError with a clear
message if they mismatch; perform this validation before calling
super().__init__ so invalid combinations fail fast.
maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py (1)

342-354: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

The FP32 reference diagnostics are missing the UE4M3 block scales.

ref is built from the decoded FP4 payloads only, so diff_tl_ref and diff_cutlass_ref are not checking the same computation as the block-scaled GEMMs. In the default NVF4_SCALE_MODE="varying" case, those numbers are inherently misleading. Either apply sfa_logical / sfb_logical per 16-K chunk when building ref, or drop these prints until that scale-aware reference exists.

Suggested minimal change
-    ref = _decode_rowmajor_fp4(a, M, K) @ _decode_rowmajor_fp4(b, N, K).T
-    diff_tl_ref = (c_tl - ref).abs()
-    diff_cutlass_ref = (c_cutlass - ref).abs()
     print("scale_mode:", scale_mode)
     print("input_mode:", input_mode)
     print("max_abs_diff:", diff.max().item())
     print("mean_abs_diff:", diff.mean().item())
     print("max_abs_diff_transposed:", diff_t.max().item())
     print("mean_abs_diff_transposed:", diff_t.mean().item())
-    print("max_abs_diff_tilelang_ref:", diff_tl_ref.max().item())
-    print("mean_abs_diff_tilelang_ref:", diff_tl_ref.mean().item())
-    print("max_abs_diff_cutlass_ref:", diff_cutlass_ref.max().item())
-    print("mean_abs_diff_cutlass_ref:", diff_cutlass_ref.mean().item())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py` around lines 342 - 354,
The FP32 reference `ref` is computed from raw decoded FP4 payloads so it doesn't
include the UE4M3 block scales (`sfa_logical`/`sfb_logical`), making
`diff_tl_ref` and `diff_cutlass_ref` invalid under NVF4_SCALE_MODE="varying";
fix by applying the per-block scales to the decoded tensors before matmul: after
calling `_decode_rowmajor_fp4(a, M, K)` and `_decode_rowmajor_fp4(b, N, K)`,
multiply each 16xK chunk of the decoded A by the corresponding entries in
`sfa_logical` and each 16xK chunk of the decoded B by `sfb_logical` (or apply
equivalent broadcasting per 16-K block) so `ref = scaled_decoded_a @
scaled_decoded_b.T` matches the block-scaled GEMM, otherwise remove the
`diff_tl_ref`/`diff_cutlass_ref` prints until the scale-aware reference is
implemented; reference symbols: `ref`, `_decode_rowmajor_fp4`, `sfa_logical`,
`sfb_logical`, `diff_tl_ref`, `diff_cutlass_ref`.
♻️ Duplicate comments (1)
maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py (1)

247-256: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail fast when the CUTLASS root is missing.

This now avoids the developer-local fallback, but it still passes non-existent include roots into load(). If CUTLASS_ROOT is unset or wrong and 3rdparty/cutlass is absent, the harness fails later with a compiler error instead of an explicit setup error.

Suggested fix
     repo = Path(__file__).resolve().parents[2]
     cutlass_root_env = os.environ.get("CUTLASS_ROOT")
     cutlass_root = Path(cutlass_root_env) if cutlass_root_env else repo / "3rdparty" / "cutlass"
+    if not cutlass_root.exists():
+        raise RuntimeError(
+            f"CUTLASS not found at {cutlass_root}. "
+            "Set CUTLASS_ROOT or ensure 3rdparty/cutlass exists."
+        )
 
     return load(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py` around lines 247 - 256,
Check that the computed cutlass_root (from cutlass_root_env or repo / "3rdparty"
/ "cutlass") actually exists before calling load; if it does not exist, raise a
clear RuntimeError instructing the developer to set CUTLASS_ROOT or populate
3rdparty/cutlass, and only pass existing include paths (cutlass_root / "include"
and cutlass_root / "tools" / "util" / "include") into the load(...) call instead
of blindly passing non-existent paths; refer to the variables cutlass_root_env,
cutlass_root, extra_include_paths and the load(...) call to locate where to add
the existence check and error raise.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py`:
- Around line 342-354: The FP32 reference `ref` is computed from raw decoded FP4
payloads so it doesn't include the UE4M3 block scales
(`sfa_logical`/`sfb_logical`), making `diff_tl_ref` and `diff_cutlass_ref`
invalid under NVF4_SCALE_MODE="varying"; fix by applying the per-block scales to
the decoded tensors before matmul: after calling `_decode_rowmajor_fp4(a, M, K)`
and `_decode_rowmajor_fp4(b, N, K)`, multiply each 16xK chunk of the decoded A
by the corresponding entries in `sfa_logical` and each 16xK chunk of the decoded
B by `sfb_logical` (or apply equivalent broadcasting per 16-K block) so `ref =
scaled_decoded_a @ scaled_decoded_b.T` matches the block-scaled GEMM, otherwise
remove the `diff_tl_ref`/`diff_cutlass_ref` prints until the scale-aware
reference is implemented; reference symbols: `ref`, `_decode_rowmajor_fp4`,
`sfa_logical`, `sfb_logical`, `diff_tl_ref`, `diff_cutlass_ref`.

In `@tilelang/cuda/intrinsics/macro/mma_macro_generator.py`:
- Around line 1402-1407: _scale_region_parts currently returns (buffer, [], 0,
0) for plain Buffer which drops outer-dimension bases and breaks N-D scale
buffers; change it to call and reuse _legalize_to_buffer_region(scale_buf) so
both Buffer and BufferRegion follow the same region contract as A/B. Update
_scale_region_parts to accept either type, call _legalize_to_buffer_region when
scale_buf is a Buffer to obtain the buffer and full region, and then extract the
same tuples (buffer, [r.min for r in region[:-2]], region[-2].min,
region[-1].min) as for BufferRegion; keep the existing ValueError for
unsupported types. Ensure references to _scale_region_parts usage (the indexing
code that expects preserved outer bases) continue to work unchanged.
- Around line 1282-1333: The constructor currently forces block-scale config to
NVF4 but does not validate the caller-provided dtypes, so callers can pass
incompatible a_dtype/b_dtype/accum_dtype and silently generate wrong code; after
calling _get_block_scale_mma_config(...) in __init__, add a guard that checks
the resolved self.block_scale_config.kind (and/or its expected dtype descriptors
from the config) against the incoming a_dtype, b_dtype, and accum_dtype
parameters (e.g., ensure a_dtype and b_dtype match the NVF4 fragment dtypes such
as T.float4_e2m1fn and accum_dtype matches the expected accumulator like
T.float32 or whatever the config exposes), and raise a ValueError with a clear
message if they mismatch; perform this validation before calling
super().__init__ so invalid combinations fail fast.

---

Duplicate comments:
In `@maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py`:
- Around line 247-256: Check that the computed cutlass_root (from
cutlass_root_env or repo / "3rdparty" / "cutlass") actually exists before
calling load; if it does not exist, raise a clear RuntimeError instructing the
developer to set CUTLASS_ROOT or populate 3rdparty/cutlass, and only pass
existing include paths (cutlass_root / "include" and cutlass_root / "tools" /
"util" / "include") into the load(...) call instead of blindly passing
non-existent paths; refer to the variables cutlass_root_env, cutlass_root,
extra_include_paths and the load(...) call to locate where to add the existence
check and error raise.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5bd71dcd-3f86-40b2-b9d8-6c494431985b

📥 Commits

Reviewing files that changed from the base of the PR and between a24f9f1 and c3c51c9.

📒 Files selected for processing (6)
  • maint/gemm/correctness_evaluation_nvf4_vs_cutlass.py
  • maint/gemm/cutlass_nvf4_ref.cu
  • src/cuda/codegen/codegen_cuda.cc
  • src/tl_templates/cuda/instruction/mma_block_scale.h
  • tilelang/cuda/intrinsics/macro/mma_macro_generator.py
  • tilelang/language/ast/ir.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/cuda/codegen/codegen_cuda.cc
  • src/tl_templates/cuda/instruction/mma_block_scale.h
  • maint/gemm/cutlass_nvf4_ref.cu

@Rachmanino

Rachmanino commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Really appreciate your contribution! I think it's better to provide high-level API for blockscaled mma on sm120, like existing T.tcgen05_gemm_blockscaled for sm100 tcgen5? Besides, it may be better to move blockscaled mma instructions to src/tl_templates/cuda/gemm_sm120.h? Would love to discuss if you have better suggestions!

@qqq-tao

qqq-tao commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Thanks, that makes sense.

This PR is currently a low-level SM120 NVF4 MMA emitter. It checks that the new mma.sync instruction works, and that the fragment layout and scale-factor lane mapping are correct.

I agree that users should have a higher-level API, similar to the existing SM100 blockscaled GEMM API. The SM120 version should not copy the TCGEN05/TMEM behavior directly, but it can provide a similar user experience on top of the SM120 warp-level mma.sync path.

For this PR, I can move the SM120 blockscaled MMA helper into gemm_sm120.h, so it sits with the SM120 GEMM code. Then we can add a higher-level tile op API as a follow-up, unless you prefer to include that in this PR.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tilelang/language/tir/op.py (1)

1079-1119: ⚠️ Potential issue | 🟠 Major

Fix tl.ptx_mma_block_scale intrinsic arity: set_num_inputs(21) doesn’t match CUDA lowering expectations.

  • src/op/builtin.cc:244-247 registers tl.ptx_mma_block_scale with .set_num_inputs(21).
  • src/cuda/codegen/codegen_cuda.cc:2852 lowers the intrinsic with ICHECK_EQ(op->args.size(), 17U) and consumes op->args[0..16] (accum_dtype/shape/layouts/k/vec_size/dtypes + A/B/C pointers & offsets + scale_a/scale_b).
  • The Python wrapper language/tir/op.py:1079-1119 passes exactly those 17 call_intrin args, so the registration count should be aligned (likely change .set_num_inputs(21)17).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tilelang/language/tir/op.py` around lines 1079 - 1119, The registered
intrinsic tl.ptx_mma_block_scale has a mismatched arity: the Python wrapper
function ptx_mma_block_scale builds 17 call_intrin arguments and the CUDA
lowering asserts ICHECK_EQ(..., 17), so update the intrinsic registration to use
.set_num_inputs(17) (replace the current .set_num_inputs(21)) so the
registration count matches the call_intrin arguments and the CUDA lowering
expectations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tilelang/language/tir/op.py`:
- Around line 1079-1119: The registered intrinsic tl.ptx_mma_block_scale has a
mismatched arity: the Python wrapper function ptx_mma_block_scale builds 17
call_intrin arguments and the CUDA lowering asserts ICHECK_EQ(..., 17), so
update the intrinsic registration to use .set_num_inputs(17) (replace the
current .set_num_inputs(21)) so the registration count matches the call_intrin
arguments and the CUDA lowering expectations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 82a9e9cc-83ba-44c4-acfa-79f6cc76f5fa

📥 Commits

Reviewing files that changed from the base of the PR and between cbc83fc and 17d2313.

📒 Files selected for processing (5)
  • src/cuda/codegen/codegen_cuda.cc
  • src/tl_templates/cuda/gemm_sm120.h
  • testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py
  • tilelang/cuda/intrinsics/macro/mma_macro_generator.py
  • tilelang/language/tir/op.py
💤 Files with no reviewable changes (1)
  • tilelang/cuda/intrinsics/macro/mma_macro_generator.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • testing/python/language/test_tilelang_language_nvf4_mma_block_scale.py

@Rachmanino

Rachmanino commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Nice work! May I ask whether we support TMA load for operands and SFs? If so, we may also illustrate that in our example. Also, I'm curious whether you've considered warp-specialization (either automatic version or handwritten) and its influence on the performance.

@qqq-tao

qqq-tao commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Updated status: this PR now includes TMA loading for packed FP4 A/B and SFA/SFB, plus a warp-specialized SM120a NVFP4 performance path. Please see the latest compact status note for current validation and performance numbers: #2364 (comment)

qqq-tao commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

Current SM120 NVFP4 block-scaled GEMM status after the latest cleanup/update:

This PR adds the first TileLang SM120a NVFP4 block-scaled GEMM path with:

  • TMA loading for packed FP4 A/B operands and SFA/SFB scaling factors.
  • Warp-specialized execution with 1 producer warp group and 2 consumer warp groups.
  • A fixed promoted performance strategy: tile 128x128x256, stages=2, scale_layout=blockscaled_chunk_kmajor.
  • C epilogue through the current reg -> smem -> gmem path.
  • CUTLASS BlockScaledBasicChunk K-major scale-source contract plus TileLang quantizer support for that packed scale layout.

Validation on dev5090_m2, head 1e1576fb:

pre-commit targeted files: passed
pytest focused SM120/quantizer/example set: 117 passed, 32 warnings
512^3 random A/B + random_binary scale correctness: passed
512^3 bf16_quantized input correctness: passed
8192^3 random_binary scale: 0.8498 ms, 1293.81 TFLOPS

Additional short-loop perf sanity check, using random A/B + random_binary scales, 5 warmup iterations and 50 measured iterations:

TileLang: about 1.30-1.31 PFLOPS
CUTLASS 79a K256 no-verify perf rebuild: about 1.44-1.46 PFLOPS
TileLang / CUTLASS: about 0.90x

The remaining gap is not from missing outer-level TMA or warp specialization support. Our current understanding is that the next optimization stage needs deeper CUTLASS/CuTe-style consumer-side modeling inside TileLang lowering: A/B/SFA/SFB copy_view, register-fragment packaging, scale slot/package modeling, and stronger OMMA.SF issue-continuity control.

Durable PR status observed for this head:

mergeable: true
pre-commit.ci - pr: success
CodeRabbit: success

I am not pinning GitHub Actions state here because it is time-varying; please use the PR checks tab for the current Actions run state.

@qqq-tao

qqq-tao commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

@Rachmanino
Hi, I am pleased to tell you that my tilelang NVFP4 GEMM has reached 90% of CUTLASS's performance. I hope you could review the PR and give me more advice to get it finally merged into the mainstream of Tilelang. Thank you for your time.

@qqq-tao qqq-tao force-pushed the nvf4-block-scale-sm120-pr-clean branch from 1e1576f to e736713 Compare July 3, 2026 03:19
@qqq-tao qqq-tao force-pushed the nvf4-block-scale-sm120-pr-clean branch from e736713 to c3596f5 Compare July 3, 2026 03:40
from .macro.mma_macro_generator import ( # noqa: F401
TensorCoreIntrinEmitter,
TensorCoreIntrinEmitterWithLadderTransform,
TensorCoreIntrinEmitterWithBlockScale,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to extend the original TensorCoreIntrinEmitter for mma to support blockscaling mode, which aligns with the impl of TCGEN05MMAIntricEmitter.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think these 2 examples are a bit too complicated.

  1. I think we can make use of the auto ws provided by TileLang instead of handwritting ws
  2. For an illustrative example, i think non-persistent is sufficient
  3. T.sm120_store_full_c_fragment_panel32_tma_bf16 it is not very proper to expose such procedure as an API under tilelang.language, which usually only includes essential operators or intrinisics. A better alternative way it to consider how to lower to it automatically
  4. For simplicity, kernel implementation, along with a simple benchmark result demonstration are sufficient for an example. We can remove stuff like finding nvcc/cmake/cutlass and so on, or consider moving them to the maint folder.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i understand correctly, multiple intrinsics exposed here that started with sm120_mma only differ in their micro pipeline strategies (i.e. how ldmatrix and atom mma are composed into a tiled blockscaled mma?). I think instead of exposing so many intrinsics, a unified sm120_mma_blockscaled API containing a parameter which controls the micro pipeline impl is cleaner and easier to maintain

namespace {

bool EnableSM120CompactUnpackedFP4Shared() {
const char *value = std::getenv("TL_SM120_COMPACT_UNPACKED_FP4_SHARED");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need TL_SM120_COMPACT_UNPACKED_FP4_SHARED and TL_SM120_MMA_RAW_UNPACKED_FP4_ACCESS_PTR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants