[PyTorch] Add workaround for cuteDSL stride requirement for zero-token expert by ksivaman · Pull Request #2947 · NVIDIA/TransformerEngine

ksivaman · 2026-04-30T21:22:48Z

Description

cudnn-frontend and cutedsl do not relax their stride divisibility requirements for input tensors with 0 elements for wgrad. This is a workaround that would be removed after the proper fix is made in cutedsl.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Create dummy empty tensors to pass to cuDNN for the case where we have zero tokens.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps · 2026-04-30T21:25:32Z

Greptile Summary

This PR adds a workaround in _cudnn_compute_wgrad to handle the zero-token expert case, where cuteDSL/cudnn-frontend enforces stride divisibility requirements even on 0-element tensors. The fix creates dummy empty tensors with compliant strides for a_tensor, b_tensor, sfa_tensor, and sfb_tensor when total_tokens == 0, bypassing the kernel's validation while producing a correct (no-op) result.

Confidence Score: 5/5

Safe to merge — the change is a narrowly scoped, no-op workaround that only activates when total_tokens == 0 and leaves the hot path untouched.

No P0 or P1 issues found. The zero-token guard correctly creates 0-element tensors with cuteDSL-compliant strides. The non-zero path is unchanged. The b_tensor stride of (in_features, 1) is consistent with the non-zero path and is always a multiple of the required alignment given MXFP8's block-size constraints. Previously flagged concerns (missing TODO, unexplained constant 16) are already tracked in existing review threads.

No files require special attention beyond the previously noted style concerns.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py	Adds zero-token guard in `_cudnn_compute_wgrad`; dummy empty tensors with cuteDSL-compliant strides are created and forwarded to the wgrad kernel, which is a no-op for 0 tokens. Refactors `sfa_leading_dim`/`sfb_leading_dim` to be computed before the branch.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["_cudnn_compute_wgrad called"] --> B{"total_tokens == 0?"}
    B -- Yes --> C["Create dummy empty_strided tensors\n(a, b, sfa, sfb)\nwith cuteDSL-compliant strides"]
    B -- No --> D["Slice real data from\ngrouped_dy & grouped_x\n(unchanged path)"]
    C --> E["wgrad_kernel_fn\n(no-op: 0 tokens → no\ngradient contribution)"]
    D --> E
    E --> F{"single_grouped_weight?"}
    F -- Yes --> G["dense mode:\nwgrad_tensor written"]
    F -- No --> H["discrete mode:\nper-expert wgrad_ptrs written"]

_{Reviews (2): Last reviewed commit: "Merge branch 'main' into cutedsl_zero_to..." | Re-trigger Greptile}

greptile-apps · 2026-04-30T21:25:36Z

+    if total_tokens == 0:
+        # A workaround for the case with zero-token experts.
+        # Even for this case, cuteDSL still requires the same
+        # stride requirements for the input and scale tensors.


Missing TODO for temporary workaround

The PR description states this workaround will be removed once the upstream fix lands in cutedsl, but the in-code comment has no corresponding TODO or issue-tracker reference. Without one, there's no actionable reminder to clean this up once the upstream fix is released.

Suggested change

if total_tokens == 0:

# A workaround for the case with zero-token experts.

# Even for this case, cuteDSL still requires the same

# stride requirements for the input and scale tensors.

if total_tokens == 0:

# TODO: Remove this workaround once cuteDSL relaxes stride

# divisibility requirements for zero-element tensors (tracked in

# <upstream issue link>).

# A workaround for the case with zero-token experts.

# Even for this case, cuteDSL still requires the same

# stride requirements for the input and scale tensors.

greptile-apps · 2026-04-30T21:25:37Z

+        a_tensor = torch.empty_strided((out_features, 0), (16, 1), dtype=fp8_dtype, device=device)
+        b_tensor = torch.empty_strided(
+            (0, in_features), (in_features, 1), dtype=fp8_dtype, device=device
+        )
+        sfa_tensor = torch.empty_strided(
+            (sfa_leading_dim, 0),
+            (16, 1),
+            dtype=torch.float8_e8m0fnu,
+            device=device,
+        )
+        sfb_tensor = torch.empty_strided(
+            (sfb_leading_dim, 0),
+            (16, 1),
+            dtype=torch.float8_e8m0fnu,
+            device=device,
+        )


Hardcoded stride 16 undocumented

The value 16 is used as the leading stride for both a_tensor and the scale tensors (sfa_tensor, sfb_tensor) in the zero-token path, but there is no comment explaining why 16 specifically satisfies cuteDSL's divisibility requirement. In the non-zero path the leading stride of a_tensor is 1 (column-major after transpose), so this value is not derived from the tensor layout. If the cuteDSL requirement ever changes (e.g. requires 32 or 128 alignment), this silent constant will be wrong without any indication of why it was chosen. A brief comment citing the minimum stride constraint would make future maintenance safer.

ksivaman · 2026-04-30T23:18:19Z

/te-ci pytorch

timmoon10

LGTM

Add workaround for cuteDSL stride requirement for zero token expert

04269b8

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman requested a review from timmoon10 April 30, 2026 21:22

ksivaman added the 2.15.0 label Apr 30, 2026

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

Merge branch 'main' into cutedsl_zero_token_stride_war

739b9b6

timmoon10 approved these changes Apr 30, 2026

View reviewed changes

ksivaman merged commit 36fc336 into NVIDIA:main May 1, 2026
21 of 24 checks passed

ksivaman deleted the cutedsl_zero_token_stride_war branch May 1, 2026 06:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Add workaround for cuteDSL stride requirement for zero-token expert#2947

[PyTorch] Add workaround for cuteDSL stride requirement for zero-token expert#2947
ksivaman merged 2 commits intoNVIDIA:mainfrom
ksivaman:cutedsl_zero_token_stride_war

ksivaman commented Apr 30, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

ksivaman commented Apr 30, 2026

Uh oh!

timmoon10 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ksivaman commented Apr 30, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ksivaman commented Apr 30, 2026

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading