Skip to content

Fix weight decay#6

Open
Niccolo-Ajroldi wants to merge 2 commits into
athms:mainfrom
Niccolo-Ajroldi:fix_weight_decay
Open

Fix weight decay#6
Niccolo-Ajroldi wants to merge 2 commits into
athms:mainfrom
Niccolo-Ajroldi:fix_weight_decay

Conversation

@Niccolo-Ajroldi

@Niccolo-Ajroldi Niccolo-Ajroldi commented Dec 5, 2024

Copy link
Copy Markdown
Contributor

tl;dr: applies weight decay only to some layers, excluding Mamba's A_log and D, as well as biases and normalization layers.

Fixes #5

For a more thorough discussion, see #5

@Zymrael

Zymrael commented Dec 25, 2024

Copy link
Copy Markdown
Collaborator

Thanks! Similar changes to weight decay should actually also be applied to other operator primitives, depending on your objective. Can you report if you observe differences in scores (even a simple representative task is ok) with and without your PR?

@Niccolo-Ajroldi

Niccolo-Ajroldi commented Feb 9, 2025

Copy link
Copy Markdown
Contributor Author

@Zymrael sorry, I have no bandwidth to run those tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug 🐛] weight decay incorrectly applied to LayerNorm and Mamba A, D parameters

2 participants