Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Authors: Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, Christopher Potts
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Therefore, we introduce AXBENCH, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Stanford University 2Pr(AI)2R Group. Correspondence to: Zhengxuan Wu <EMAIL>, Aryaman Arora <EMAIL>. |
| Pseudocode | No | The paper describes methods like Diff Mean, PCA, LAT, Linear Probe, SSV, Re FT-r1, SAE using mathematical formulas and textual explanations (e.g., equations 1-15), but it does not present them in structured pseudocode or algorithm blocks with explicit control flow statements. |
| Open Source Code | Yes | github.com/stanfordnlp/axbench. 1We open-source all of our datasets and trained dictionaries at https://huggingface.co/pyvene. |
| Open Datasets | Yes | We synthetically generate training and validation datasets (see 3.1) for 500 concepts, which we release as CONCEPT500. [...] We additionally release training and evaluation datasets for all 16K concepts in Gemma Scope as the CONCEPT16K dataset suite. |
| Dataset Splits | Yes | We construct a small training dataset Dtrain = {(x+ c,i, y+)}n/2 i=1 {(x c,i, y )}n/2 i=1. with n examples and a concept detection evaluation dataset Dconcept of the same structure and harder examples, where y+ and y are binary labels indicating whether the concept c is present. We set n = 144 for our main experiments. [...] For each concept, we include 144 examples for training and 72 samples for evaluating concept detection. |
| Hardware Specification | No | The paper mentions evaluating methods on "Gemma-2-2B and 9B" models, but does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud configurations) used to run the experiments. |
| Software Dependencies | No | The paper mentions using "pyvene" and "PyTorch" as well as "sklearn.decomposition.PCA" and "Adam W", but it does not specify version numbers for these software components. |
| Experiment Setup | Yes | To ensure a fair comparison, we perform separate hyperparameter-tuning for each method that requires training. For each method, we conduct separate hyperparameter-tuning on a small CONCEPT10 Dataset containing training and testing datasets only for 10 concepts. [...] Table 8 and Table 9 show hyperparameter settings for methods that require training. [...] We minimise the loss with Adam W with a linear scheduler for all methods that require training. |