perf: optimize grad_weight accumulation with addmm by maskyuanzh · Pull Request #1239 · linkedin/Liger-Kernel

maskyuanzh · 2026-05-26T10:56:56Z

Summary

This PR optimizes grad_weight accumulation in fused linear cross entropy by replacing:

grad_weight += torch.mm(...).float()

with an in-place torch.addmm(..., out=grad_weight)-based accumulation.

For PyTorch >= 2.8 on CUDA, when out_dtype is supported and accumulating fp16/bf16 operands into an fp32 grad_weight, this uses:

torch.addmm(..., out_dtype=torch.float32, out=grad_weight)

This avoids materializing the full [V, H] intermediate from torch.mm(...).float().

Fixes #1232.

Memory Benchmark

I benchmarked a 128k-vocab case with V=131072, H=4096, chunk_size=2048, bf16 inputs, and fp32 grad_weight on an NVIDIA GeForce RTX 4090.

PyTorch 2.1.2+cu121:
  old mm(...).float():                     extra peak 3072 MiB

PyTorch 2.12.0+cu126:
  old mm(...).float():                     extra peak 3072 MiB
  addmm(out_dtype=torch.float32, out=...): extra peak 0 MiB

So on PyTorch >= 2.8, the out_dtype path removes the large [V, H] peak allocation in this configuration. On earlier PyTorch versions, the existing implementation is preserved.

Testing Done

Hardware Type: NVIDIA GeForce RTX 4090
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Additional targeted testing:

pytest -q test/transformers/test_fused_linear_cross_entropy.py

Passed. This covers the fused linear cross entropy paths affected by this change.

tyler-romero · 2026-05-26T19:27:09Z

        if ce_weight.stride(-1) != 1:
            ce_weight = ce_weight.contiguous()
-
+    IS_TORCH2P12 = Version(torch.__version__.split("+")[0]) >= Version("2.12.0")


This could probably be located globally, so it doesn't need to run every forward pass.

Thanks for pointing this out. I’ll move the PyTorch version check to the module-level constants so it is only evaluated once instead of on every forward pass.

tyler-romero · 2026-05-26T19:29:47Z

+                    grad_weight,
+                    grad_logits_chunk.t(),
+                    _input_chunk.to(dtype=grad_logits_chunk.t().dtype),
+                    out_dtype=torch.float32,


I think technically out_dtype is available earlier than 2.12, I just didnt do the work to track down which version it was introduced in. I think I remember it existing in 2.10 as well.

Thanks for pointing this out. The exact version wasn’t carefully verified here. After checking the PyTorch docs and source tags, I found that out_dtype was added to torch.addmm in PyTorch 2.8.0 for fp16/bf16 CUDA inputs with fp32 output accumulation. I’ll lower the version guard from 2.12.0 to 2.8.0.

tyler-romero · 2026-05-26T19:30:40Z

+                torch.addmm(
+                    grad_weight,
+                    grad_logits_chunk.t(),
+                    _input_chunk.to(dtype=grad_logits_chunk.t().dtype),


Why is this cast now necessary?

Thanks for asking. The cast is needed because out_dtype only controls the output/accumulation dtype; torch.addmm still requires mat1 and mat2 to have the same input dtype.

I tested this on PyTorch 2.12.0 + CUDA. With out_dtype=torch.float32, addmm still fails for fp16 x fp32 and bf16 x fp32 inputs:

Half and Float -> RuntimeError: mat1 and mat2 must have the same dtype BFloat16 and Float -> RuntimeError: mat1 and mat2 must have the same dtype

After casting mat2 to match mat1, both fp16 and bf16 paths succeed and write into a fp32 output buffer. In the AMP path here, grad_logits_chunk is the low-precision operand while _input_chunk can remain fp32, so this cast aligns _input_chunk with grad_logits_chunk and keeps the multiply in fp16/bf16 while accumulating into fp32.

now I looked at it, I don't think we ever casted _input_chunk, _input_chunk should always have same dtype as grad_logits_chunk. .to is a no-op when both are same dtype, but I prefer keeping it clean rather than guarding defensively if it could never happen.

Thanks, that makes sense. I removed the redundant _input_chunk.to(...) and now pass _input_chunk directly to addmm.

tyler-romero · 2026-05-26T19:32:38Z

+                    grad_logits_chunk.t().to(grad_weight.dtype),
+                    _input_chunk.to(grad_weight.dtype),


So the input, chunk, and weight all need the same dtype on this path? The desired behavior is typically to multiply in bf16 and then to accumulate in fp32. Doing the multiply in fp32 as well would be pretty slow, so I dont think this is advisable.

Thanks for pointing this out. I tested the fp32-operand fallback and found that your concern was correct: although it reduces peak memory compared with the old mm(...).float() path, it is slower because the matmul itself runs in fp32.

old: mm(lowp, lowp).float() 23.5 ms, peak 5688 MiB old fallback: addmm(fp32, fp32, out=fp32) 27.4 ms, peak 3640 MiB fast: addmm(lowp, lowp, out_dtype=fp32) 13.4 ms, peak 2632 MiB

I updated the logic so this path no longer promotes both operands to fp32. The addmm(..., out_dtype=torch.float32, out=grad_weight) path is now only used when out_dtype is supported, grad_weight is fp32, and grad_logits_chunk is fp16/bf16. Since addmm(..., out=...) does not autocast the operands, and out_dtype only controls the output dtype, _input_chunk is explicitly cast to grad_logits_chunk’s dtype. This keeps the matmul in fp16/bf16 while writing directly into the fp32 accumulation buffer. For unsupported cases, the code now falls back to the original mm(...).float() behavior.

maskyuanzh · 2026-06-09T08:23:38Z

@Tcc0403 This pr is ready for review. Thanks!

Tcc0403

LGTM, cc @Mecoli1219 and @vaibhavjindal for double check

maskyuanzh · 2026-06-16T07:38:57Z

@Tcc0403 Gentle ping on this PR.

Thanks again for the approval. As far as I can tell, the review comments have been addressed, and the PR is ready from my side. Since @Mecoli1219 and @vaibhavjindal were cc’ed for double check, I just wanted to ask whether there is any remaining concern or blocker before this can be merged.

Thanks!

vaibhavjindal · 2026-06-30T22:03:18Z

 # The optimal maximum block size depends on your hardware, your kernel, and your dtype
 MAX_FUSED_SIZE = 2048 if infer_device() == "npu" else 65536 // 2
+_TORCH_VERSION = Version(torch.__version__.split("+")[0])
+_SUPPORTS_ADDM_MIXED_PRECISION_OUT_DTYPE = _TORCH_VERSION >= Version("2.8.0")


There's a small typo, it should be 'ADDMM' instead of 'ADDM'.

Also, consider renaming this to _ADDMM_SUPPORTS_OUT_DTYPE to make it less verbose.

Thanks for the suggestion! I renamed it to _ADDMM_SUPPORTS_OUT_DTYPE as suggested.

vaibhavjindal · 2026-06-30T22:07:07Z

@maskyuanzh thanks for the PR. Two minor things:

Added a comment to rename the constant.
Please update the PR body: The PR body describes a fallback that "explicitly aligns operand dtypes with grad_weight.dtype before calling addmm(out=grad_weight)" and benchmarks it at 1056 MiB on torch 2.1.2. But the actual else branch is the unchanged mm().float() line — which the same benchmark lists at 3072 MiB. So on torch < 2.8 this PR gives zero memory benefit, contrary to the description.

maskyuanzh · 2026-07-01T02:29:27Z

@vaibhavjindal Thanks for the review and the helpful suggestions!
I renamed the constant to _ADDMM_SUPPORTS_OUT_DTYPE and updated the PR description to match the current implementation.

maskyuanzh mentioned this pull request May 26, 2026

Mem and compute inefficiency in fused_linear_cross_entropy_foward #1232

Open

maskyuanzh changed the title ~~Optimize grad_weight accumulation with addmm~~ perf: optimize grad_weight accumulation with addmm May 26, 2026

tyler-romero reviewed May 26, 2026

View reviewed changes

Tcc0403 previously approved these changes Jun 9, 2026

View reviewed changes

maskyuanzh added 3 commits June 24, 2026 05:17

Optimize grad_weight accumulation with addmm

18f4c96

fix fused linear CE addmm fallback dtype handling

d497771

Remove redundant input chunk cast

ec69ffb

maskyuanzh dismissed Tcc0403’s stale review via ec69ffb June 24, 2026 05:30

maskyuanzh force-pushed the fix-fused-linear-ce-addmm branch from 02a39bd to ec69ffb Compare June 24, 2026 05:30

Tcc0403 mentioned this pull request Jun 27, 2026

perf(flce): fuse grad_weight accumulation via addmm_ (cuBLAS beta=1) + memory cleanups #1270

Open

maskyuanzh requested a review from Tcc0403 June 29, 2026 04:11

Merge branch 'main' into fix-fused-linear-ce-addmm

69a1575

vaibhavjindal reviewed Jun 30, 2026

View reviewed changes

Rename addmm out_dtype support constant

4c21cc7

maskyuanzh requested a review from vaibhavjindal July 1, 2026 02:29

		grad_logits_chunk.t().to(grad_weight.dtype),
		_input_chunk.to(grad_weight.dtype),

Uh oh!

Conversation

maskyuanzh commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Memory Benchmark

Testing Done

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maskyuanzh commented Jun 9, 2026

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

maskyuanzh commented Jun 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vaibhavjindal commented Jun 30, 2026

Uh oh!

maskyuanzh commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maskyuanzh commented May 26, 2026 •

edited

Loading

Tcc0403 Jun 23, 2026 •

edited

Loading