Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DISCO: Disentangled Communication Steering for Large Language Models
Authors: Max Torop, Aria Masoomi, Masih Eskandar, Jennifer Dy
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that query and value spaces exhibit linear concept discriminability, with a higher portion doing so compared to attention head outputs. We empirically show that DISCO Steering achieves the best performance in 13/16 experiments across multiple datasets and baselines in LLa MA 3.1 8B and Gemma 2 9B. |
| Researcher Affiliation | Academia | Max Torop Aria Masoomi Masih Eskandar Jennifer Dy Northeastern University Correspondence to EMAIL |
| Pseudocode | Yes | Algorithm 1 DISCO-LAYER (STEER Q/V REPRESENTATIONS IN A LAYER) |
| Open Source Code | Yes | Our code is publicly available at https://github.com/Max Torop/DISCO. |
| Open Datasets | Yes | Truthful QA [26] (TQA) consists of questions designed to elicit responses which reflect common myths, misconceptions or conspiracy theories (e.g., "What UFO report was later proved to involve aliens?"). Each question is paired with lists of correct and incorrect answers, as well as a single "Best Correct Answer" which represents the most helpful correct answer. ... This version of Truthful QA is released under an Apache 2.0 license and can be found at https://github.com/sylinrl/Truthful QA/blob/main/Truthful QA.csv. The Power, Wealth and Corr datasets come from the Model-Written Evaluation suite [35]. Questions in these datasets present choices, where answers entail either a preference for or aversion to the behavior in question. ... These datasets are released under an CC BY 4.0 license. Power-Seeking and Wealth-Seeking, which were formatted in Cao et al. [9] can be found at https://github.com/Cao Yuanpu/Bi PO/tree/main/data while Corrigibility can be found at https://github.com/anthropics/evals/blob/main/advanced-ai-risk/human_generated_evals/corrigible-less-HHH.jsonl. |
| Dataset Splits | Yes | We split each dataset into train/validation/test sets where train corresponds to the positive and negative examples used for steering vector estimation (see App. D for details on our data splits). App. D.1 Truthful QA: The new version of Truthful QA contains 791 questions which we split into training/validation/testing sets of 376/171/243. App. D.2 Power-Seeking, Corrigibility and Wealth-Seeking: We create training/validation/testing splits for each dataset. For Power-Seeking we partition 840 questions into 115/102/623, for Wealth-Seeking we partition 822 questions into 105/105/612 and for Corrigibility 350 questions (following manual inspection we filtered 1 out, for which the question consisted only of the number 0, from an initial set of 351) into 70/101/179. |
| Hardware Specification | Yes | Each experiment is run on one NVIDIA A6000 (48GB) or A100 (80GB) GPU. An NVIDIA A6000 (48GB) was used to obtain these numbers. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA x.x). |
| Experiment Setup | Yes | We select α for each method from a set of over 20 values (see App F). For attention head based methods (ITI, DISCO) all searches are done using sets of top k heads, where k is a hyperparameter. For DISCO-QV, we use the k values found for DISCO-V and DISCO-Q. For the layer based methods, we search using both the most discriminative layer and all layers. We determine α , k and the best layer using the validation set. We report mean scores over samples for all metrics and use GPT-4o as the LLM Judge [20]. We use a temperature of 0 for all steering methods and the LLM Judge. App. F Hyperparameter Search and Selected Values: For layer based methods... For attention-head based methods... We select a final k and α with the best performance. ... All hyperparameters found for LLa MA 3.1 8B are shown in Table 7, those for Gemma 2 9B are shown in Table 8. ... LLa MA 3.1 8B (Batch Size : 15) ... Gemma 2 9B (Batch Size : 3) |