Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Elicitation Game: Evaluating Capability Elicitation Techniques

Authors: Felix Hofstätter, Teun Van Der Weij, Jayden Teoh, Rada Djoneva, Henning Bartsch, Francis Rhys Ward

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we evaluate the effectiveness of capability elicitation techniques by intentionally training model organisms language models with hidden capabilities that are revealed by a password. We evaluate a variety of capability elicitation techniques on password-locked and circuit-broken model organisms, in both a multiple-choice question-answering (MCQA) and code-generation setting.
Researcher Affiliation Collaboration 1Independent 2Imperial College London. Correspondence to: Felix Hofstätter <EMAIL>.
Pseudocode No The paper includes mathematical formulas for LCB and Lretain but does not present any clearly labeled pseudocode or algorithm blocks. Descriptions of methods are given in paragraph form.
Open Source Code Yes We publish our code and results at https://github.com/ Felhof/sandbagging-elicitation
Open Datasets Yes MCQA. We use the multiple-choice question-answering Weapons of Mass Destruction Proxy (WMDP) and Massive Multitask Language Understanding (MMLU) benchmarks. The WMDP benchmark consists of questions on knowledge that is useful information for bioterrorism, chemical... (Li et al., 2024b). MMLU, on the other hand, evaluates world knowledge and problem-solving ability across diverse subjects, including STEM, humanities, and social sciences (Hendrycks et al., 2020). Code-generation. Additionally, we use the open-ended code generation task APPS, where each problem includes a natural language description, test cases, and solutions (Hendrycks et al., 2021).
Dataset Splits Yes We use 3668 and 1402 samples from the training partitions of WMDP and MMLU, respectively. 50% of the samples are to train model organisms, 10% to evaluate model organisms, 25% to train elicitation techniques (if needed), 5% to evaluate elicitation quality, and a final 10% as a test set (WMDP example shown below). For APPS, we use 3000 samples to train the model organisms, 1000 to train elicitation techniques, 500 for evaluation, and 200 for testing.
Hardware Specification Yes We use https://www.runpod.io/ to rent GPUs, and we typically use A40s.
Software Dependencies No The paper mentions using the `steering-vectors` library but does not provide specific version numbers for this or any other key software components or programming languages used for the experiments.
Experiment Setup Yes We set temperature to 0 for all experiments. Appendix B, D: Table 6. Mistral-base & Mistral-instruct hyperparameters for WMDP circuit-breaking. (Learning rate 1e-4, Effective batch size (batch size x gradient accumulation steps) 16 (1 x 16), Epochs 1, alpha 5., k 32, c_r_max 4.875, c_s_min 0.125)... and similar tables (Table 7-13) for other models and tasks, specifying learning rates, batch sizes, epochs, and other parameters.