Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Improved Representation Steering for Language Models
Authors: Zhengxuan Wu, Qinan Yu, Aryaman Arora, Christopher D Manning, Chris Potts
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train three parameterizations of Re PS and evaluate them on AXBENCH, a large-scale model steering benchmark. On Gemma models with sizes ranging from 2B to 27B, Re PS outperforms all existing steering methods trained with a language modeling objective and substantially narrows the gap with prompting while promoting interpretability and minimizing parameter count. |
| Researcher Affiliation | Academia | Zhengxuan Wu Qinan Yu Aryaman Arora Christopher D. Manning Christopher Potts Stanford University EMAIL EMAIL |
| Pseudocode | No | The paper describes the Re PS training objectives using mathematical formulas in Section 3.3, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states in the 'Open access to data and code' checklist: 'Yes, and we also provide our source code and data for reproducing all results. We will release our codebase upon publication.' This indicates future release, not concrete access at the time of publication. |
| Open Datasets | Yes | We adapt CONCEPT500 from AXBENCH to evaluate various methods. CONCEPT500 consists of four subsets, each containing paired training data for 500 concepts curated based on auto-interpreted SAE features from different Gemma-2 models. These concept lists are available at https://www.neuronpedia.org. Additionally, Appendix T lists licenses for 'AXBENCH datasets', 'Alpaca-Eval v1.0 [Li et al., 2023] dataset', 'Dolly-15K [Conover et al., 2023] dataset', 'GSM8K [Cobbe et al., 2021] dataset', and 'Code-Alpaca dataset'. |
| Dataset Splits | Yes | Formally, each subset of the CONCEPT500 dataset consists of n pairs of input instruction and response in natural language, DAXBENCH = {(xi,yc)}n/2 i=1 {(xj,y)}n/2 j=1 where yc and y denote responses with and without the steering concept c, and n = 144. ... In total, we have 72 training pairs for each subset. ... For each concept seen during training, we randomly sample 10 instructions from Alpaca-Eval ... We partition these 10 instructions into two equally-sized sets, selecting the best factor from one set and evaluating it on the holdout set. |
| Hardware Specification | Yes | Our experiments with Gemma-2 models are conducted on nodes equipped with NVIDIA RTX A6000 (49.1 GB), NVIDIA A100-SXM4-80GB (81.9 GB), or NVIDIA H200 (143.8 GB) GPUs. |
| Software Dependencies | No | The paper mentions using the 'Huggingface transformers library' and 'DSPy [Khattab et al., 2024] and MIPRO [Opsahl-Ong et al., 2024]' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | To ensure a fair comparison of these training objectives, we perform budget-controlled hyperparameter-tuning experiments for each objective and method pair with a small development set. For each experiment, we perform grid search optimizing for the best combination of intervening layers, batch size, learning rate, epoch number, and dropout rate. ... Table 5 and Table 7 provide detailed hyperparameter search grids and specific settings for batch size, LR, epochs, and dropout rate. |