Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Chongyu Fan^† Gaowen Liu^‡ Mingyi Hong^¶ Ramana Rao Kompella^‡ Sijia Liu^†,§

^†Michigan State University ^‡Cisco ^¶University of Minnesota ^§IBM Research

This is the official code repository for the paper "Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR", which introduces Pion (sPectral hIgh-pass Optimization on momeNtum) -- a drop-in replacement for Muon designed for regimes such as vision-language-action (VLA) training and reinforcement learning with verifiable rewards (RLVR). See the project page for more.


_{(a) Muon NS}	_{(b) Promotion f_p}	_{(c) Suppression f_s}	_{(d) High-pass NS}

Visualization of f(σ) over σ ∈ [0, 1], with f(σ) = σ shown as the identity reference. (a) f^t_NS denotes Muon's NS iteration applied t times. (b) f^t_p denotes the Promotion polynomial f_p applied t times. (c) f^t_s denotes the Suppression polynomial f_s applied t times. (d) Pion's high-pass NS iteration: f^k_s_s ∘ f^k_p_p applies k_p Promotion steps followed by k_s = 5 - k_p Suppression steps.

Abstract

Muon (MomentUm Orthogonalized by Newton–Schulz) is a matrix-aware optimizer that leverages Newton–Schulz (NS) iterations to enforce spectral gradient orthogonalization by driving all singular values of the momentum matrix toward 1. While this uniform spectral whitening enhances exploration and outperforms AdamW in LLM pretraining, we show it could lead to fundamental limitations beyond pretraining in two increasingly important regimes: (i) cross-modality vision-language-action (VLA) training, where inherently low-rank action-module gradients cause amplification of noisy tail directions, and (ii) reinforcement learning with verifiable rewards (RLVR), where low-SNR gradients and the need to preserve per-head specialization inherited from prior training make whitening unstable. To address these challenges, we propose Pion (sPectral hIgh-pass Optimization on momeNtum), a drop-in replacement for Muon that preserves its computational efficiency while replacing uniform spectral whitening with a two-stage Promotion + Suppression mechanism, which we call the high-pass NS iteration. This design induces a sharp spectral high-pass effect, anchoring dominant singular values at 1 while suppressing noisy tail components toward 0, with controllable filter strength. To preserve pretrained per-head heterogeneity, Pion also supports a per-head mode that applies updates independently across attention heads via a simple reshape, at no extra cost. Extensive experiments demonstrate consistent gains over Muon and AdamW across both VLA and RLVR regimes. In VLA training on LIBERO and LIBERO-Plus, Pion consistently outperforms both baselines across ℓ₁-regression (VLA-Adapter) and flow-matching (VLANeXt) architectures, e.g., reaching 100% success rate on LIBERO Object at training 1,500 steps with VLA-Adapter, vs. 97.0% for Muon and only 32.2% for AdamW. In RLVR post-training on Qwen3-1.7B/4B with GRPO and GMPO, Pion also outperforms AdamW on MATH and GSM8K while Muon collapses to zero.

What's in this repo

Pion/
├── VLA/                       # Vision-Language-Action experiments
│   ├── VLAAdapter/            # VLA-Adapter
│   │   └── pion_optim/        # Muon / DefaultPion / LowRankMuon
│   ├── VLANeXt/               # VLANeXt
│   │   └── pion_optim/        # Muon / DefaultPion
│   └── openpi/                # π0.5 on real Franka FR3
│       └── src/openpi/training/muon_optim.py
│                              # MuonAdamW / DefaultPionAdamW
└── RL/                        # RLVR experiments
    └── verl/                  # GRPO + GMPO on Qwen3-1.7B / 4B, GSM8K + MATH
        └── verl/utils/muon.py
                               # MuonAdamW / DefaultPionAdamW / PerHeadMuonAdamW / PerHeadPionAdamW

Across all sub-repos we maintain the same five optimizer families, each paired into a base form and an AdamW-fused form. Which form a sub-repo ships depends on whether its training framework can hold multiple torch.optim.Optimizer instances at once:

VLA-Adapter / VLANeXt drive several optimizers in the same training loop (one Muon / Pion instance per modality bucket plus a torch.optim.AdamW for the 1-D / embedding / output-head bucket), so they ship the base classes and let the trainer call step() on each.
openpi and verl are wrapped by frameworks (openpi's Trainer, verl's Hydra + FSDP2 config) that expose only a single optimizer slot per model; on those we ship the AdamW-fused variants, which apply the Muon / Pion polynomial to ndim ≥ 2 parameters and AdamW to ndim < 2 parameters inside one step() call.

Each sub-repo only ships the variants its recipes actually use (see the tree above).

Algorithm	Base class	AdamW-fused class
Muon (NS on the whole matrix)	`Muon`	`MuonAdamW`
Muon (NS on per attention head)	—	`PerHeadMuonAdamW`
Pion (high-pass NS on the whole matrix)	`DefaultPion`	`DefaultPionAdamW`
Pion (high-pass NS on per attention head)	—	`PerHeadPionAdamW`
LowRankMuon	`LowRankMuon`	—

Each sub-repo is a pruned, vendored copy of an upstream training codebase with the Pion optimizer wired in and three drop-in run scripts (run_adamw.sh, run_muon.sh, run_pion.sh). See each sub-repo's README.md for full environment setup, data preparation and run commands.

Getting Started

The optimizers are not packaged as a top-level library; they live next to the training code that uses them. Pick your task and follow the sub-repo README:

Sub-repo	Backbone / task
`VLA/VLAAdapter`	VLA-Adapter
`VLA/VLANeXt`	VLANeXt
`VLA/openpi`	π_0.5 on Franka FR3
`RL/verl`	GRPO / GMPO on Qwen3-1.7B / 4B with GSM8K + MATH

Inside each sub-repo:

pion_optim/ (for VLA-Adapter / VLANeXt) or */utils/muon.py / */training/muon_optim.py (for verl / openpi) contains the optimizer implementations.
scripts/run_adamw.sh, scripts/run_muon.sh, scripts/run_pion.sh are the three drop-in launchers.

Citation

If you find this work useful, please consider citing:

@article{fan2026rethinking,
  title={Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR},
  author={Fan, Chongyu and Liu, Gaowen and Hong, Mingyi and Kompella, Ramana Rao and Liu, Sijia},
  journal={arXiv preprint arXiv:2605.19282},
  year={2026}
}

Acknowledgements

This codebase builds on the excellent Muon optimizer, Flash-Muon, VLA-Adapter, VLANeXt, openpi, and verl.

Contributors

Chongyu Fan

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
RL		RL
VLA		VLA
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Abstract

What's in this repo

Getting Started

Citation

Acknowledgements

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Abstract

What's in this repo

Getting Started

Citation

Acknowledgements

Contributors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages