Gefen: Optimized Stochastic Optimizer

Gefen is a drop-in replacement for the AdamW optimizer (and Muon, see below) for memory-efficient pre-training. It keeps the familiar AdamW training recipe while dramatically reducing optimizer-state memory: an 8x reduction in AdamW memory footprint, or about 6.5 GiB saved per billion parameters, while maintaining AdamW-level performance. The reduced memory footprint lets you train larger models or use larger batch sizes and, as a result, achieve higher training throughput. All it takes is changing two lines of code: import Gefen and replace the AdamW optimizer constructor. For fine-tuning, we introduce GefenMuon, that does not store second moments (see below).

Installation

Install from source to get the latest version:

git clone https://github.com/ndvbd/Gefen
cd Gefen
pip install -e .

Or, install from PyPI:

pip install gefen

On the first CUDA run, Gefen builds its fused CUDA kernels with PyTorch JIT and nvcc. This can take a few minutes. Later runs reuse the cached build for the same Python, PyTorch, CUDA version, and Gefen source checkout.

This keeps the source install lightweight, but it requires a CUDA toolkit and host compiler compatible with your PyTorch installation. In the future, we plan to make this smoother with prebuilt wheels for common PyTorch/CUDA combinations.

Quick Start

import torch
from gefen import Gefen

device = "cuda" if torch.cuda.is_available() else "cpu"
model = torch.nn.Linear(128, 10).to(device)

# optimizer = torch.optim.AdamW(
optimizer = Gefen(  # Replace AdamW with Gefen:
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0.0,
)

inputs = torch.randn(32, 128, device=device)
targets = torch.randint(0, 10, (32,), device=device)

logits = model(inputs)
loss = torch.nn.functional.cross_entropy(logits, targets)
loss.backward()

optimizer.step()
optimizer.zero_grad(set_to_none=True)

print('Finished successfully.')

Pre-training GPT-2 validation loss, Gefen vs. AdamW. Curves are similar:

Hugging Face Trainer

Until native optim="gefen" support is released in Transformers, pass Gefen to the Trainer with optimizer_cls_and_kwargs:

from gefen import Gefen
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="outputs",
    learning_rate=1e-3,
    weight_decay=0.0,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    optimizer_cls_and_kwargs=(
        Gefen,
        {
            "lr": training_args.learning_rate,
            "betas": (training_args.adam_beta1, training_args.adam_beta2),
            "eps": training_args.adam_epsilon,
            "fused": True,
        },
    ),
)

Distributed Training

Gefen is compatible with standard distributed training setups, including PyTorch DDP, PyTorch FSDP, and flavors of DeepSpeed ZeRO. In the usual DDP, FSDP, and DeepSpeed ZeRO workflows, Gefen can be used like any other PyTorch optimizer.

Extension: Gefen-Muon

Based on the Gefen paradigm, a simple extension is to add a pseudo-orthogonalization step on the first moment, as Muon does, while skipping the second moment. This version, which is based on the PyTorch Muon implementation, immediately reduces Muon's optimizer-state footprint by 4x: the first moments are quantized to 8-bit using Gefen's Hessian-block-diagonal-inspired partitioning exact quantization, while performance remains similar to Muon.

You can use it exactly as you use Muon, with a simple constructor name replacement:

from gefen import GefenMuon

optimizer = GefenMuon(
    [muon_parameter for _, muon_parameter in muon_parameter_pairs],
    lr=lr,
)

Our experiments show similar performance to Muon, with x4 less persistent optimizer memory (when Muon stores fp32 momentum). Because Muon supports only 2D parameters, you can either apply the optimizer only to those parameters or use a simple wrapper to flatten all other parameters to 2D.

Below is a training curve of finetuning Qwen3-1.7B once with Muon and once with GefenMuon. The curves are similar. With GPT2-pretraining, similar loss curves for Muon and GefenMuon

GefenMuon, supports DDP and FSDP.

Features

Gefen and GefenMuon support 32 and 16 bit training.

Pretraining versus Fine-tuning

Block-diagonal Hessian structure is stronger in pre-training than in fine-tuning. In fine-tuning, as the name suggests, the Hessian and second-moment structure is flatter and finer in most tensors. Therefore, Gefen should be used for pre-training, and for fine-tuning, we recommend GefenMuon, since it does not store and share a second moment. Since Gefen infers tensor structure from gradients at the beginning of training, an underlying assumption is that the global batch has a meaningful size, which is most often larger than 1.

Case Studies

In pretraining experiments with a 31.6B-parameter Nemotron-3-style model using NVIDIA Megatron Bridge with full sharding on 8 H100 80 GiB GPUs, Gefen improved throughput by 6x over AdamW: from 25 s/step and 86 tokens/s to 4.2 s/step and 520 tokens/s. AdamW required about 385 GB of CPU offload, while Gefen kept all tensors on GPU and used about 504 GiB of VRAM.

Testimonials

Have you tried Gefen and want to report your impressions privately or publicly? We would be happy to hear about your experience. With your permission, we can credit you and mention your work here. The repository is still a work in progress, so with a bit of patience, we will try to address the various intricacies of different frameworks and add more features over time.

Citation

If you found this library useful, please consider citing our work:

@article{benedek2026gefen,
  title={Gefen: Optimized Stochastic Optimizer},
  author={Benedek, Nadav and Koren, Tomer and Fried, Ohad},
  journal={arXiv preprint arXiv:2606.13894},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
figures		figures
kernels		kernels
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
__init__.py		__init__.py
gefen.py		gefen.py
gefen_muon.py		gefen_muon.py
partitioning.py		partitioning.py
pyproject.toml		pyproject.toml
quantization.py		quantization.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gefen: Optimized Stochastic Optimizer

Installation

Quick Start

Hugging Face Trainer

Distributed Training

Extension: Gefen-Muon

Features

Pretraining versus Fine-tuning

Case Studies

Testimonials

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gefen: Optimized Stochastic Optimizer

Installation

Quick Start

Hugging Face Trainer

Distributed Training

Extension: Gefen-Muon

Features

Pretraining versus Fine-tuning

Case Studies

Testimonials

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages