Skip to content

[cuTile] Add rope/qwen2vl_mrope/kl_div/group_norm/multi_token_attention#1269

Open
xjmxyt wants to merge 1 commit into
linkedin:mainfrom
xjmxyt:jinmanx/add_kernel_v2
Open

[cuTile] Add rope/qwen2vl_mrope/kl_div/group_norm/multi_token_attention#1269
xjmxyt wants to merge 1 commit into
linkedin:mainfrom
xjmxyt:jinmanx/add_kernel_v2

Conversation

@xjmxyt

@xjmxyt xjmxyt commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

cuTile implementations for group_norm, kl_div, llama4_rope, qwen2vl_mrope, rope, sparsemax, tiled_mlp, and multi_token_attention, dispatched via LIGER_KERNEL_IMPL=cutile. Includes three-way (liger_triton / liger_cutile / torch|huggingface) speed+memory benchmark data on B200 generated with the open-source pip nvidia-cuda-tileiras 13.3.36.
image

image image image image image image image image

Summary

Testing Done

  • Hardware Type:
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

cuTile implementations for group_norm, kl_div, llama4_rope, qwen2vl_mrope,
rope, sparsemax, tiled_mlp, and multi_token_attention, dispatched via
LIGER_KERNEL_IMPL=cutile. Includes three-way (liger_triton / liger_cutile /
torch|huggingface) speed+memory benchmark data on B200 generated with the
open-source pip nvidia-cuda-tileiras 13.3.36.

Notable kernel choices:
- rope/qwen2vl_mrope: stay in input dtype, drop redundant .contiguous() copies
- kl_div: drop scale constexpr to avoid per-iter JIT recompile
- group_norm: fp32 stats for numerical parity
- multi_token_attention: conv-backward runs under the same cuDNN heuristic as
  the Triton path (no forced cudnn.benchmark) for apples-to-apples comparison

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@xjmxyt xjmxyt force-pushed the jinmanx/add_kernel_v2 branch from 13e5472 to 5fb9475 Compare June 26, 2026 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant