Chongyu Fan† Gaowen Liu‡ Mingyi Hong¶ Ramana Rao Kompella‡ Sijia Liu†,§
†Michigan State University ‡Cisco ¶University of Minnesota §IBM Research
This is the official code repository for the paper "Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR", which introduces Pion (sPectral hIgh-pass Optimization on momeNtum) -- a drop-in replacement for Muon designed for regimes such as vision-language-action (VLA) training and reinforcement learning with verifiable rewards (RLVR). See the project page for more.
![]() |
![]() |
![]() |
![]() |
| (a) Muon NS | (b) Promotion fp | (c) Suppression fs | (d) High-pass NS |
Visualization of f(σ) over
σ ∈ [0, 1], with f(σ) = σ
shown as the identity reference.
(a) ftNS denotes Muon's NS
iteration applied t times.
(b) ftp denotes the Promotion
polynomial fp applied t times.
(c) fts denotes the Suppression
polynomial fs applied t times.
(d) Pion's high-pass NS iteration:
fkss ∘
fkpp applies
kp Promotion steps followed by
ks = 5 - kp Suppression steps.
Muon (MomentUm Orthogonalized by Newton–Schulz) is a matrix-aware optimizer that leverages Newton–Schulz (NS) iterations to enforce spectral gradient orthogonalization by driving all singular values of the momentum matrix toward 1. While this uniform spectral whitening enhances exploration and outperforms AdamW in LLM pretraining, we show it could lead to fundamental limitations beyond pretraining in two increasingly important regimes: (i) cross-modality vision-language-action (VLA) training, where inherently low-rank action-module gradients cause amplification of noisy tail directions, and (ii) reinforcement learning with verifiable rewards (RLVR), where low-SNR gradients and the need to preserve per-head specialization inherited from prior training make whitening unstable. To address these challenges, we propose Pion (sPectral hIgh-pass Optimization on momeNtum), a drop-in replacement for Muon that preserves its computational efficiency while replacing uniform spectral whitening with a two-stage Promotion + Suppression mechanism, which we call the high-pass NS iteration. This design induces a sharp spectral high-pass effect, anchoring dominant singular values at 1 while suppressing noisy tail components toward 0, with controllable filter strength. To preserve pretrained per-head heterogeneity, Pion also supports a per-head mode that applies updates independently across attention heads via a simple reshape, at no extra cost. Extensive experiments demonstrate consistent gains over Muon and AdamW across both VLA and RLVR regimes. In VLA training on LIBERO and LIBERO-Plus, Pion consistently outperforms both baselines across ℓ1-regression (VLA-Adapter) and flow-matching (VLANeXt) architectures, e.g., reaching 100% success rate on LIBERO Object at training 1,500 steps with VLA-Adapter, vs. 97.0% for Muon and only 32.2% for AdamW. In RLVR post-training on Qwen3-1.7B/4B with GRPO and GMPO, Pion also outperforms AdamW on MATH and GSM8K while Muon collapses to zero.
Pion/
├── VLA/ # Vision-Language-Action experiments
│ ├── VLAAdapter/ # VLA-Adapter
│ │ └── pion_optim/ # Muon / DefaultPion / LowRankMuon
│ ├── VLANeXt/ # VLANeXt
│ │ └── pion_optim/ # Muon / DefaultPion
│ └── openpi/ # π0.5 on real Franka FR3
│ └── src/openpi/training/muon_optim.py
│ # MuonAdamW / DefaultPionAdamW
└── RL/ # RLVR experiments
└── verl/ # GRPO + GMPO on Qwen3-1.7B / 4B, GSM8K + MATH
└── verl/utils/muon.py
# MuonAdamW / DefaultPionAdamW / PerHeadMuonAdamW / PerHeadPionAdamW
Across all sub-repos we maintain the same five optimizer families,
each paired into a base form and an AdamW-fused form. Which form
a sub-repo ships depends on whether its training framework can hold
multiple torch.optim.Optimizer instances at once:
VLA-Adapter/VLANeXtdrive several optimizers in the same training loop (one Muon / Pion instance per modality bucket plus atorch.optim.AdamWfor the 1-D / embedding / output-head bucket), so they ship the base classes and let the trainer callstep()on each.openpiandverlare wrapped by frameworks (openpi'sTrainer, verl's Hydra + FSDP2 config) that expose only a single optimizer slot per model; on those we ship the AdamW-fused variants, which apply the Muon / Pion polynomial tondim ≥ 2parameters and AdamW tondim < 2parameters inside onestep()call.
Each sub-repo only ships the variants its recipes actually use (see the tree above).
| Algorithm | Base class | AdamW-fused class |
|---|---|---|
| Muon (NS on the whole matrix) | Muon |
MuonAdamW |
| Muon (NS on per attention head) | — | PerHeadMuonAdamW |
| Pion (high-pass NS on the whole matrix) | DefaultPion |
DefaultPionAdamW |
| Pion (high-pass NS on per attention head) | — | PerHeadPionAdamW |
| LowRankMuon | LowRankMuon |
— |
Each sub-repo is a pruned, vendored copy of an upstream training
codebase with the Pion optimizer wired in and three drop-in run scripts
(run_adamw.sh, run_muon.sh, run_pion.sh). See each sub-repo's
README.md for full environment setup, data preparation and run
commands.
The optimizers are not packaged as a top-level library; they live next to the training code that uses them. Pick your task and follow the sub-repo README:
| Sub-repo | Backbone / task |
|---|---|
VLA/VLAAdapter |
VLA-Adapter |
VLA/VLANeXt |
VLANeXt |
VLA/openpi |
π0.5 on Franka FR3 |
RL/verl |
GRPO / GMPO on Qwen3-1.7B / 4B with GSM8K + MATH |
Inside each sub-repo:
pion_optim/(forVLA-Adapter/VLANeXt) or*/utils/muon.py/*/training/muon_optim.py(forverl/openpi) contains the optimizer implementations.scripts/run_adamw.sh,scripts/run_muon.sh,scripts/run_pion.share the three drop-in launchers.
If you find this work useful, please consider citing:
@article{fan2026rethinking,
title={Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR},
author={Fan, Chongyu and Liu, Gaowen and Hong, Mingyi and Kompella, Ramana Rao and Liu, Sijia},
journal={arXiv preprint arXiv:2605.19282},
year={2026}
}This codebase builds on the excellent Muon optimizer, Flash-Muon, VLA-Adapter, VLANeXt, openpi, and verl.



