Fork: Model Organism Interp

This is a research fork of decoderesearch/SAELens, repurposed as the model-organism-interp project. It uses the upstream SAELens library as a foundation and adds an analysis pipeline for studying how fine-tuning shifts SAE feature usage in quirked model organisms (e.g. Gemma 3 1B IT variants trained to talk about submarines in military contexts, or to express Italian-food preferences).

What this fork adds on top of upstream:

scripts/model_organism_interp_analysis/ — full pipeline (per-MO feature analysis, sibling-diff variants, LLM-based feature judging via OpenRouter, cross-MO noise floors, paper plots). See scripts/model_organism_interp_analysis/README.md for the full overview and command reference.
uv-based dependency management (uv.lock, pyproject.toml rewrite) with a CUDA symlink fix-up script at scripts/fix_cuda_libs.sh.
CLAUDE.md — project guidance for Claude Code.

Setup

Run once after uv sync, or on a new machine:

uv sync
bash scripts/fix_cuda_libs.sh   # symlinks system CUDA .so files into the venv

Auth

Required before running analysis. Gemma 3 is gated; OpenRouter is needed for the LLM judge. Copy .env.example to .env and fill in both keys:

HF_TOKEN=<your_token>
OPENROUTER_API_KEY=<your_key>

Run the full pipeline

bash scripts/model_organism_interp_analysis/run_binary_pipeline.sh

For per-step commands, judge options, regeneration flags, sibling pipeline, plots, and exports, see scripts/model_organism_interp_analysis/README.md.

Serve results locally

python3 -m http.server 8080 --directory results
# then open e.g.: http://localhost:8080/military_submarine_binary/runs/<run>_feature_analysis.html
#                 http://localhost:8080/italian_food_binary/runs/<run>_feature_analysis.html

The upstream SAELens README follows below, describing the underlying library.

SAE Lens

SAELens exists to help researchers:

Train sparse autoencoders.
Analyse sparse autoencoders / research mechanistic interpretability.
Generate insights which make it easier to create safe and aligned AI systems.

SAELens inference works with any PyTorch-based model, not just TransformerLens. While we provide deep integration with TransformerLens via HookedSAETransformer, SAEs can be used with Hugging Face Transformers, NNsight, or any other framework by extracting activations and passing them to the SAE's encode() and decode() methods.

Please refer to the documentation for information on how to:

Download and Analyse pre-trained sparse autoencoders.
Train your own sparse autoencoders.
Generate feature dashboards with the SAE-Vis Library.

SAE Lens is the result of many contributors working collectively to improve humanity's understanding of neural networks, many of whom are motivated by a desire to safeguard humanity from risks posed by artificial intelligence.

This library is maintained by Joseph Bloom, Curt Tigges, Anthony Duong and David Chanin.

Loading Pre-trained SAEs.

Pre-trained SAEs for various models can be imported via SAE Lens. See this page for a list of all SAEs.

Migrating to SAELens v6

The new v6 update is a major refactor to SAELens and changes the way training code is structured. Check out the migration guide for more details.

Tutorials

Join the Slack!

Feel free to join the Open Source Mechanistic Interpretability Slack for support!

Other SAE Projects

dictionary-learning: An SAE training library that focuses on having hackable code.
Sparsify: A lean SAE training library focused on TopK SAEs.
Overcomplete: SAE training library focused on vision models.
SAE-Vis: A library for visualizing SAE features, works with SAELens.
SAEBench: A suite of LLM SAE benchmarks, works with SAELens.

Citation

Please cite the package as follows:

@misc{bloom2024saetrainingcodebase,
   title = {SAELens},
   author = {Bloom, Joseph and Tigges, Curt and Duong, Anthony and Chanin, David},
   year = {2024},
   howpublished = {\url{https://github.com/decoderesearch/SAELens}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,120 Commits
.github		.github
.vscode		.vscode
benchmark		benchmark
docs		docs
sae_lens		sae_lens
scripts		scripts
tests		tests
tutorials		tutorials
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
codecov.yaml		codecov.yaml
makefile		makefile
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fork: Model Organism Interp

Setup

Auth

Run the full pipeline

Serve results locally

SAE Lens

Loading Pre-trained SAEs.

Migrating to SAELens v6

Tutorials

Join the Slack!

Other SAE Projects

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fork: Model Organism Interp

Setup

Auth

Run the full pipeline

Serve results locally

SAE Lens

Loading Pre-trained SAEs.

Migrating to SAELens v6

Tutorials

Join the Slack!

Other SAE Projects

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages